1. Introduction
As an integral part of natural ecosystems, forests not only regulate the circulation of air and water in nature and protect the soil from wind and rain, they also reduce the harm caused by environmental pollution to people. The complete or partial destruction of forest and vegetation cover generally results in many adverse consequences in the ecosystem, including an intensification of runoff and erosive processes, loss of biodiversity, and impact on global warming [
1].
In recent years, climate change has made extreme weather more and more frequent. Hurricanes and fires are two types of natural disturbances to forest ecosystems [
2,
3]. A severe hurricane can widely impact on the vegetation composition, structure, and succession of forests, and accordingly influence the terrestrial carbon sink [
4,
5]. Damages from fires, a common and prevalent disturbance that affects the forest [
3,
6,
7,
8], can be severe, reducing species richness and above-ground live biomass [
9]. Between 2015 and 2020, ~10 m hectares of the world’s forests have been lost each year according to the records of the Food and Agriculture Organization of the United Nations [
10].
It is of great significance to map the windfall damages of forests and estimate the affected areas for the post-disaster management [
11]. However, field-based studies are time-consuming, laborious, and usually only cover small areas, in consideration of the time and cost, of making the observations. Moreover, forest disasters generally happen in remote regions [
12], and as such can be hard to reach for in situ measurements [
13]. Consequently, using these methods is constrained to an extremely limited spatial and temporal extent. On the contrary, as exemplified in [
14,
15,
16,
17], remote sensing offers an opportunity to accurately assess all areas covered by damages in a reliable, cost-effective, and time-efficient manner, to the extent that machine learning provides accurate classifying tools.
A substantial number of approaches have been devised for monitoring forest changes using remote sensing images classification [
18,
19]. For example, Zarco-Tejada et al. detected forest decline using high-resolution hyperspectral and Sentinel-2a imagery [
20]. White et al. evaluated the utility of a spectral index of recovery based on the Normalized Burn Ratio (NBR): the years to recovery, as an indicator of the return of forest vegetation following forest harvest (clear-cutting) [
21]. Bar et al. identified pre-monsoonal forest fire patches over Uttarakhand Himalaya using medium resolution optical satellite data [
22]. However, most of these methods pay more attention to only utilize the spectral features derived from image pixels. The feature limitation could not solve the problem of similar reflectance characteristics coming from different objects, thus leads to a decrease in classification accuracy.
In later research endeavors, apart from the spectral information, texture features have been widely used in image classification to assist in the discrimination of different target classes with similar spectral features [
23,
24]. They provide an opportunity to increase the classification performance of the forest disturbances, specifically for forest hurricane and forest fire disturbance mapping. For example, Jiang et al. computed texture features of the image to recognize forest fire smog [
25]. The recognition accuracy is 98% with robustness on their smog image database. Benjamin et al. extracted the fire region from forest fire images using color rules and texture analysis [
26]. Kulkarni et al. used the local fractal TPSA method to compute the textural features of the image and evaluated the impacts of Hurricane Hugo on the land cover of Francis Marion National Forest, South Carolina [
27]. Textural features are statistical measures of the spatial arrangement of the gray-level intensities of an image [
28,
29]. Generally, texture feature extraction approaches are applied to pixels of the input image by evaluating some type of difference among neighboring pixels through square windows that overlap over the entire image [
30]. The accuracy of extracting texture information by window-based method depends on the choice of various parameters. For example, the study by Marceau et al. [
31] proved that the window size is the most crucial parameter impacting classification accuracy. Murray et al. systematically analyzed the influence of the window size and texture feature selection on classification accuracy [
32]. Garcia et al. analyzed the part played by both the shape and size of the window, then indicated that texture features are much more affected by the window size than by its shape [
33].
In fact, the size of the texture window should ideally match the spatial scale of the object or class under consideration. However, the aforementioned texture feature extraction methods are all based on a single window. In other words, the single window approaches do not adequately consider the scale of different objects, and always face the trade-off problem between window size and classification accuracy [
32]. On the one hand, the window size should be large enough to obtain the relevant patterns, but if the size is too large, edge effects could dominate the results [
34]. On the other hand, finding precise localizations of boundary edges between adjacent regions is a fundamental goal for the segmentation task, but it can only be ensured with relatively small windows [
30]. Consequently, how to obtain a good texture characterization is a very important research direction for the improvement of the image classification accuracy.
Ensemble learning techniques have been successfully applied to remote sensing image classification and are attracting more attention for their increased performance in accuracy [
35,
36,
37,
38,
39,
40]. They carry out various training tasks inducing multiple base classifiers whose outcomes are fused into a final prediction. Random Forest (RF) is a very powerful ensemble algorithm. It is a collection of decision trees usually induced with the Classification and Regression Trees algorithm (CART) [
41]. RF has been widely used for forest disturbance mapping. For example, Einzmann et al. used the RF classifier to detect the area damaged by the hurricane [
42]. Their methodology was evaluated on two test sites in Bavaria with RapidEye data at 5 m pixel resolution and identified over 90% of the windthrow areas. Ramo et al. developed an RF algorithm for MODIS global burned area classification [
43]. To assess the performance of their methods, these models are used to classify daily MCD43A4 images in three test sites for three consecutive years (2006–2008). Considering the advantages of random forest, such as fast training speed and few parameters, random forest is also used for forest disaster monitoring in this paper. In the application under focus, we assume that a fair amount of already classified pixels are available for training. These pixels should be relevant by having sufficiently similar reflectance characteristics, and yet in general, they should not be expected, to be part of the studied image. In RF, base classifiers are decision trees, and diversity is achieved by multiple random samplings of both the training set and the feature set. Outcomes of all decision trees are combined based upon majority rule. As a supplementary benefit and with no extra computations, the training of decision trees provides measurements of feature importance. These measurements are used for feature selection, as a dimension reduction technique improving the ability to discriminate between classes.
Our aim is to classify pixels from a multispectral image captured over a forest landscape, into three classes—damaged, not damaged, and unknown—denoted here as , , and , respectively. The specific objectives of this study are to (1) develop a new algorithm for extracting texture features, more specifically, using a composite window; (2) utilize the importance of features calculated while using RF for feature selection as a class-discriminative dimension reduction tool; and (3) validate this algorithm. Finally, the texture features extracted from this study could improve the classification accuracy of remote sensing images and enhance the monitoring ability of forest disturbance.
The structure of this paper is as follows.
Section 2 of this paper describes the studied area and the imagery being used for this study. In
Section 3, the proposed method is presented.
Section 4 presents the results and the analysis, including the overall accuracy, the kappa coefficient, the per-class accuracy, the per-class area, and color maps towards evaluating forest damage. The discussion is presented in
Section 5.
Section 6 shows the conclusions that can be drawn from the methods and results.
3. The Proposed Method
The proposed training method is summarized in a flowchart shown in
Figure 8. On the left of the flowchart, a multispectral image is depicted as a stack of images (i.e., one image for each wavelength). The “preprocessing” indicated with a right arrow is here represented by normalization mapping pixel values in the range of
. At the upper center of the flowchart, there is a large frame entitled “Texture feature extraction” transforming images into feature values. Pixels inside windows have specific spatial coordinates assembled in two sets:
for the smaller window and
for the larger one. These images are first scanned by a composite window as disclosed in
Section 3.1. From pixels inside each window, five feature values are then extracted, as enumerated in
Section 3.2. On the right of the flowchart, there is a frame entitled “Feature selection and classification” which figures two tasks. The extracted feature values feed an RF ensemble classifier, as summarized in
Section 3.3. Besides, while the base classifiers are being trained, measurements of feature importance are first readout, then used for feature selection, as explained in
Section 3.4. The proposed training method is sketched in Algorithm. The actual feature values used to train the ensemble classifier, come from the training set. Moreover, labels of the test set are compared to predictions of the trained ensemble classifier. Those connections are indicated with two long arrows at the bottom of the flowchart and the comparisons of these predictions are figured with a small frame entitled “Prediction evaluation” at the bottom right of the flowchart. Equations of these comparisons are given in
Section 3.5. The parameter setting will be presented in
Section 3.6.
3.1. Scanning with a Composite Window
The composite window proposed here is a set of two windows sharing the same center, the first being larger than the second one. To ease notations, windows are assumed to be square and their sizes to be odd integers. Their respective sizes can be written as
and
. With a size denoted as
, multispectral images are processed in a raster scanning order with a stride of 1. There are
possible locations for both windows as they are moved together (
is a set containing all possible
k-indexes). Some locations are discarded when they would be centers of windows exceeding the image’s size.
The coordinates of their common center is
where
.
Values extracted from both
windows are
where
is the pixel value at location
and wavelength is indexed by
d.
This composite windowing technique is illustrated on the left of
Figure 9 showing a flowchart of the feature extraction process with values specific to the Nezer dataset. The four superimposed darker squares are figuring a multispectral image with four wavelengths, and from which can be extracted four specific images, denoted in equations as
. On the upper left corner of the first frame, a white square contains inside itself a gray square. Both squares are figuring the first location of the two windows sharing a common center as indicated by a left-oriented arrow. The central part of the Nezer-specific flowchart lists eight windows; there are two windows for each captured wavelength. These
windows are specific to each location, so from a multispectral image, there are
windows.
Note an unusual consequence of this windowing process in terms of machine learning, because features are actually extracted from pixels inside moving windows, the real training and test set should not be considered each as a set of pixels, rather as sets of pixels contained in windows, whose centers are, respectively, in the so-called training and test sets. Some training and test sets may have common pixels. While the reader ought to bear in mind this puzzling oddity, we think that this is not too much an issue, as this experimental setting is used here for comparison purposes. For the sake of simplicity, the training and test sets mentioned in the following sections are described as featured pixels, not sets of pixels.
3.2. Texture Features Extracted
The flowchart of
Figure 9 shows on its right a list of five features (Data range, Mean, Variance, Entropy, and Skewness), extracted from each of the
windows available at each
k-indexed location. These features are defined in the following equations, all computed using only values of pixels inside a given window. They are denoted as
when extracted from the smaller window, and
when extracted from the larger window,
r being a feature-index ranging from 1 to 5.
and
are data ranges and defined as the difference between the maximum pixel value and the pixel minimum value.
and
are means computed by adding all pixel values and dividing by their number.
and
are variances computed by averaging the square differences between pixel values and their means.
and
have a strong flavor of entropy and for that reason are referred to as entropy in this paper. Being computed on pixel values and not on their distribution, they should not to be considered as similar to entropy, rather they can be thought of as aggregates of pixel values being nonlinearly transformed with
,
. Here, x stands for
or
. Up to some normalization, they can be approximated to the average absolute differences between pixel values and
.
Figure 10 supports this claim by showing on a graph the mapping
as a smooth curve and
as a triangle function that is fairly similar.
and
are estimates of the skewness.
For each sample, the features extracted are recorded as a row-vector with the following order.
where
,
,
, and
.
The row vectors corresponding to samples in the training set are then stacked into a matrix
of size
, where each row is related to a specific location and each column is a feature defined by its window, its wavelength, and its equation ranging from (
5)–(
9). A ground truth label
c is assigned to each row and stacked into a column vector
of size
.
3.3. Classification
The ensemble classifier considered here is an RF. The following three tasks are repeated V times, V being the number of decisions trees to induce.
Rows of , figuring the training instances, are sampled with replacement. This is equivalent to the left multiplication of a matrix of size , where each row contains only one component equal to 1 and its location in the row is evenly drawn from . The assigned labels are also transformed into . Moreover, a random G-sized subset of the -columns is selected. This is equivalent to the right multiplication of of size defined as the identity matrix deprived of a random set of columns. The labels remain unmodified. Finally, is used to train a decision tree using the Classification And Regression Trees (CART) methodology. The trained decision tree maps any row-vector of size into an integer , C being the number of classes.
The predictions of
V decision trees are fused according to the majority rule.
where
is a row vector, and
is equal to one when
is true and zero if not.
3.4. Measurements of Feature Importance
As side information, the training of decision trees brings us measurements of feature importance. Let us consider the
v-indexed decision tree trained with
and denote
the number of its internal nodes (including its root node). Considering a specific node indexed by
n, from the previous splits, there remains a set of samples and labels which are rows of
and
, the set of these row indexes is denoted
. Components of
and
are denoted as
and
. At this node, samples are split upon a feature
, which is randomly selected without repetition and a cutting point
, defining two complementary subsets.
Splitting modifies the class distribution
into
and
. These class distributions are associated to, respectively,
,
, and
.
The importance of a feature
used in node
n is here the amount by which splitting reduces entropy.
The relative amount of samples dealt by each node is used as weighting when averaging
The measures of feature importance are defined as
where
is a normalization factor here defined as
Note
does not take into account to which class each sample is assigned, it assumes that at each leaf, the class predicted is the most likely. The rationale is that a decision tree makes more reliable predictions when training instances reaching a leaf are more often belonging to the same class. Moreover, entropy is precisely measuring that “purity”.
As exemplified in [
48], we use feature selection to improve performance and help to prevent the “curse of dimensionality”. As compared to principal component analysis, which is generally used for dimension reduction, the
-driven feature selection is expected to be more effective when discriminating between classes. We select the
G most important features and train a new model using only those
G features,
G being a new parameter. To select those features, they are first ranked by increasing order of importance as measured by
, and then the first
G features are selected.
where
are the features in ranked order, and
is the selected features. The selection of
features is equivalent to the right multiplication of a matrix
obtained by removing of the
-identity matrix the following columns
. The new ensemble classifier is trained as in
Section 3.3, but using
instead of
.
Algorithm 1: Composite Window Based Feature Extraction Method |
Input: |
: multispectral image of size |
: set of locations of size |
: set of ground truth labels |
: size of the smaller window |
: size of the larger window |
V: number of decision trees to be trained |
G: number of features to be sampled for training decision trees |
: number of features to be selected for training a new ensemble classifier |
Output: |
: ensemble classifier |
fordo Compute and using with ( 4) fordo Compute using with ( 5)–( 9) Compute using with ( 5)–( 9) end for Stack the computed values into with ( 10) end for Stack and into and fordo Sample rows and define Sample features using G and define Train decision tree with Compute and with ( 13)–( 15) end for Compute h using with ( 11) Compute with ( 16) Select the most important features and compute with ( 18) fordo Sample rows and define Sample features using G and define Train decision tree with end for Compute using with ( 11)
|
3.5. Precision Evaluation
Three classical evaluation metrics are used here to test a classifier h on a test set for which the true labels are known and denoted .
: The Overall Accuracy measures the true prediction rate.
where
is the number of test samples.
: The kappa-statistic is concerned with the overall accuracy.
: The per-class accuracy measures the prediction rate when testing only samples of class
c.
and
is the number of test samples belonging to class
c.
: the per-class Area measures the area (in km
) correctly detected as of class
c.
where
is the area covered by a pixel.
3.6. Parameter Setting
Two parameters need to be set in RF classifier: the number of decision trees to be generated (Ntree) and the number of variables to be selected and tested for the best split when growing the trees (Mtry). In this study, the default value of 100 for Ntree was used, and the Mtry parameter was set to the square root of the number of input variables. For comparisons purposes, five learning techniques, and , are trained with the same training set and tested on the same test set . Besides, the Blue Mountain Forest is the fire burn area, so differential burn ratio (dNBR) is added to classify the area in this study.
is an RF trained on spectral features. As in Algorithm, feature preparation (steps 2–7) is replaced by , where and . Moreover, RF training is remained unchanged with steps 9–16, step 14 not included, and using decision trees.
and are an RF trained on texture features computed using a single window of size, respectively, 3 × 3, 5 × 5 and 7 × 7. As in Algorithm, feature preparation is done with steps 1–9, step 5 not included. RF training is done using with steps 10–16, step 14 not included.
is an RF trained on texture features computed using a composite window of sizes 5 × 5 and 7 × 7. Feature preparation is done according to steps 1–9 of Algorithm. RF training is done using with steps 10–16, step 14 not included.
The proposed learning technique is described in Algorithm. When applied to Nezer Forest, uses features out of the 40 available. Moreover, for Blue Mountain Forest, features are used out of the 130 available.