\AtBeginBibliography

IEEE copyright notice

© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Accepted to be Published in : 3rd IEEE International Conference on Computer Vision and Machine Intelligence (IEEE CVMI), October 19 - 20, 2024, IIIT Allahabad, Prayagraj, India.

StrideNET: Swin Transformer for Terrain Recognition with Dynamic Roughness Extraction

Maitreya Shelare  Neha Shigvan  Atharva Satam  Poonam Sonar
Rajiv Gandhi Institute of Technology, University of Mumbai, India
{maitreya.cse,nehatshigvan,atharvajsatam17}@gmail.com, poonam.sonar@mctrgit.ac.in
Corresponding Author
Abstract

The field of remote-sensing image classification has seen immense progress with the rise of convolutional neural networks, and more recently, through vision transformers. These models, with their self-attention mechanism, can effectively capture global relationships and long-range dependencies between the image patches, in contrast with traditional convolutional models. This paper introduces StrideNET, a dual-branch transformer-based model developed for terrain recognition and surface roughness extraction. The terrain recognition branch employs the Swin Transformer to classify varied terrains by leveraging its capability to capture both local and global features. Complementing this, the roughness extraction branch utilizes a statistical texture-feature analysis technique to dynamically extract important land surface properties such as roughness and slipperiness. The model was trained on a custom dataset consisting of four terrain classes — grassy, marshy, sandy, and rocky, and it outperforms benchmark CNN and transformer based models, by achieving an average test accuracy of over 99%percent9999\%99 % across all classes. The applications of this work extend to different domains such as environmental monitoring, land use and cover classification, disaster response and precision agriculture.

Index Terms:
Swin Transformer, Land Surface Roughness, Remote Sensing

I Introduction

\lettrine

[lines=2]Terrain recognition and extraction of its properties such as roughness & slipperiness, by fusion of deep learning and remote sensing techniques offers meaningful benefits across diverse domains. These applications include land use and land cover (LULC) classification [8848484], ecological monitoring [WILLIS2015233], geographical mapping [https://doi.org/10.1111/gcb.13388], natural feature detection [doi:10.1080/13658816.2018.1542697], and disaster management [HOQUE2017345].

Traditionally, terrain recognition was done manually by experts, which was a time-consuming and expensive process. To automate this, different image processing techniques were proposed [10.5555/3137503] earlier. But they all failed to classify rapidly changing terrain accurately. This drawback was later overcome by using deep learning techniques for classification.

The Convolutional Neural Networks (CNN) stands as one of the most extensively used deep learning technique. CNN based methods excel at classifying images with high accuracy even in very challenging situations [9324261].

However, their locality bias [rs15071860] limits them from capturing long-range dependencies and global relationships within the image and they often lack the ability to explain the rationale behind their inferences [10.1145/2939672.2939778].

These limitations of CNN-based methods are overcome by the Vision Transformer [dosovitskiy2021image] and its variants [touvron2021training], which have shown promising results in numerous computer vision tasks.

Thus, a novel dual-branch transformer-based model is proposed, named StrideNET: Swin Transformer for Terrain Recognition with Dynamic Roughness Extraction.

The Terrain Recognition branch uses the Swin Transformer to classify different terrains. Swin Sransformer is a variant of the Vision Transformer architecture, which constructs hierarchical representation of the input image. By utilizing shifted window-based self-attention, it also establishes cross-window connections while maintaining computationally efficient local window computations.

By restricting computations to non-overlapping local windows, Swin Transformer achieves a linear time complexity of O(mn)𝑂𝑚𝑛O(mn)italic_O ( italic_m italic_n ), unlike the complexity of O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) offered by the traditional the Vision Transformer, if the window size m𝑚mitalic_m is kept reasonably small [liu2021swin]. Moreover, it offers higher generalization capabilities over CNNs by considering the relationship between different features of an image [rs14020359].

The Roughness Extraction branch uses a statistical texture-feature analysis technique to dynamically extract surface properties like roughness and slipperiness, by determining how pixels interact within local areas of an image by capturing changes in the grayscale levels.

It utilizes statistical methods to model surface texture as a random field and then fits a probability distribution to the intensity distribution within that texture  [Bhuyan2020-li]. Using this, first the variance of each image patch is calculated and then the corresponding roughness factor is computed.

The key contributions of this work are summarized below:

  1. 1.

    A novel algorithm for extraction land surface properties such as roughness and slipperiness is proposed, which utilizes statistical texture-feature analysis for inference.

  2. 2.

    The StrideNET model achieves exceptional classification accuracy in terrain recognition, surpassing other benchmark models and further validating its effectiveness.

The following is the structure of this paper. An overview of related work in terrain recognition and roughness extraction is provided in Section II. The proposed StrideNET model is explained in Section III. Experiments & Results are discussed in Section IV. A summary of our work is given in Section V.

Refer to caption
Figure 1: StrideNET Architecture

II Related Works

Over the years, numerous deep learning techniques have been proposed for terrain recognition. The Faster-RCNN model [10.1145/3149808.3149814], uses a deep convolutional neural network (D-CNN) for accurate detection of craters in aerial and remote sensing imagery, although it is limited to only binary classification.

W. Li et al. [Li2020-jf] conducts more advanced experiments for classification of natural terrain features, by comparing multiple convolutional models, and reporting that Inception-ResNet hybrid model outperforms other traditional CNN models.

Z. Yu [9236884] explores the fusion of AlexNet with K-nearest neighbors algorithm in remote sensing terrain classification, remarking that the hybrid method yields improved performance compared to other methods.

A. A. Aleissaee et al. [Aleissaee2023-vc] reviews the performance of various transformer models across different remote sensing tasks, reporting that these models outperform their convolutional counterparts due to their ability to effectively capture long-range dependencies within images.

The TRS model [zhang] combines traditional convolutional neural networks with vision transformers, by replacing their spatial convolutions with multi-head self-attention, which leads to significant improvements in classification performance.

Y. Bazi et al. [bazi] demonstrates that the attention mechanism of vision transformers can be used to capture the contextual relations between different image patches effectively, which results in better classification accuracy.

V. Suryamurthy et al. [Suryamurthy] propose a deep neural network using SegNet and ERFNet for pixel-wise terrain labeling and roughness prediction, leveraging low-level CNN features and up-projection blocks to restore spatial resolution.

Z. Yu [YuZ] presents a self-supervised model using a whiskered robot to capture vibrations for terrain classification and roughness estimation with low computational cost.

III Proposed Model - Stridenet

Refer to caption
Figure 2: Terrain Recognition branch

The model architecture is illustrated in Fig. 1, where the input image is processed through two distinct branches: the Terrain Recognition branch and the Roughness Extraction branch.

III-A Mathematical Background

The StrideNET model is built using the Swin Transformer architecture, which differs from standard Vision Transformer in three key aspects, which are discussed below:

III-A1 Self-attention in non-overlapped windows

Swin Transformer utilizes a hierarchical structure using self-attention for enabling efficient processing of high-resolution images.

The shifted window self-attention further improves the capacity of the model to capture long-range dependencies by introducing cross-window connections.

Multi-head self-attention (𝒮𝒜𝒮𝒜\mathcal{MSA}caligraphic_M caligraphic_S caligraphic_A) is a mechanism designed to capture long-range dependencies among pixels in an image, and it is computed using Eq. 1.

Ω(𝒮𝒜)=4hwC2+2(hw)2CΩ𝒮𝒜4𝑤superscript𝐶22superscript𝑤2𝐶\Omega(\mathcal{MSA})=4hwC^{2}+2(hw)^{2}Croman_Ω ( caligraphic_M caligraphic_S caligraphic_A ) = 4 italic_h italic_w italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( italic_h italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C (1)

The windowed multi-head self-attention (𝒲𝒮𝒜𝒲𝒮𝒜\mathcal{W-MSA}caligraphic_W - caligraphic_M caligraphic_S caligraphic_A) block in the Swin Transformer is a more computationally efficient variant of 𝒮𝒜𝒮𝒜\mathcal{MSA}caligraphic_M caligraphic_S caligraphic_A. The formula for 𝒲𝒮𝒜𝒲𝒮𝒜\mathcal{W-MSA}caligraphic_W - caligraphic_M caligraphic_S caligraphic_A, which takes four inputs, is provided in Eq. 2.

Ω(𝒲𝒮𝒜)=4hwC2+2(M)2hwCΩ𝒲𝒮𝒜4𝑤superscript𝐶22superscript𝑀2𝑤𝐶\Omega(\mathcal{W-MSA})=4hwC^{2}+2(M)^{2}hwCroman_Ω ( caligraphic_W - caligraphic_M caligraphic_S caligraphic_A ) = 4 italic_h italic_w italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( italic_M ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h italic_w italic_C (2)

In Eq. 1 and Eq. 2, h{h}italic_h and w𝑤{w}italic_w represent the height and width of the feature maps, C𝐶{C}italic_C denotes the number of channels in the input feature map, and M𝑀{M}italic_M is the number of attention heads.

III-A2 Shifted window partition in successive blocks

Swin Transformer adopts a shifted windowing approach which confines self-attention to disjoint windows while enabling cross-window connectivity.

By alternating the partitioning configuration across blocks, this technique enhances the model’s ability to capture long-range dependencies and global context within images more effectively.

𝐳^l=𝒲𝒮𝒜(𝒩(𝐳l1))+𝐳l1,𝐳l=𝒫(𝒩(𝐳^l))+𝐳^l,𝐳^l+1=𝒮𝒲𝒮𝒜(𝒩(𝐳l))+𝐳l,𝐳l+1=𝒫(𝒩(𝐳^l+1))+𝐳^l+1,superscript^𝐳𝑙𝒲𝒮𝒜𝒩superscript𝐳𝑙1superscript𝐳𝑙1superscript𝐳𝑙𝒫𝒩superscript^𝐳𝑙superscript^𝐳𝑙superscript^𝐳𝑙1𝒮𝒲𝒮𝒜𝒩superscript𝐳𝑙superscript𝐳𝑙superscript𝐳𝑙1𝒫𝒩superscript^𝐳𝑙1superscript^𝐳𝑙1\begin{array}[]{l}{{\hat{\bf z}^{l}=\mathcal{W-MSA}\left(\mathcal{LN}\left({% \bf z}^{l-1}\right)\right)+{\bf z}^{l-1},}}\\ {{{\bf z}^{l}=\mathcal{MLP}\left(\mathcal{LN}\left({\hat{\bf z}}^{l}\right)% \right)+{\hat{\bf z}}^{l},}}\\ {{\hat{\bf z}^{l+1}=\mathcal{SW-MSA}\left(\mathcal{LN}\left({\bf z}^{l}\right)% \right)+{\bf z}^{l},}}\\ {{{\bf z}^{l+1}=\mathcal{MLP}\left(\mathcal{LN}\left({\hat{\bf z}}^{l+1}\right% )\right)+{\hat{\bf z}}^{l+1},}}\end{array}start_ARRAY start_ROW start_CELL over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = caligraphic_W - caligraphic_M caligraphic_S caligraphic_A ( caligraphic_L caligraphic_N ( bold_z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) ) + bold_z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = caligraphic_M caligraphic_L caligraphic_P ( caligraphic_L caligraphic_N ( over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) + over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = caligraphic_S caligraphic_W - caligraphic_M caligraphic_S caligraphic_A ( caligraphic_L caligraphic_N ( bold_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) + bold_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_z start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = caligraphic_M caligraphic_L caligraphic_P ( caligraphic_L caligraphic_N ( over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ) ) + over^ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT , end_CELL end_ROW end_ARRAY (3)

In Eq. 3, the 𝒲𝒮𝒜()𝒲𝒮𝒜\mathcal{W-MSA}(\cdot)caligraphic_W - caligraphic_M caligraphic_S caligraphic_A ( ⋅ ) operation applies self-attention within local windows by partitioning the input feature map into disjoint regions, while 𝒩()𝒩\mathcal{LN}(\cdot)caligraphic_L caligraphic_N ( ⋅ ) denotes layer normalization.

The 𝒫()𝒫\mathcal{MLP}(\cdot)caligraphic_M caligraphic_L caligraphic_P ( ⋅ ) consists of two fully connected layers, each incorporating a GELUGELU\mathrm{GELU}roman_GELU activation function, positioned between the 𝒮𝒲𝒮𝒜()𝒮𝒲𝒮𝒜\mathcal{SW-MSA}(\cdot)caligraphic_S caligraphic_W - caligraphic_M caligraphic_S caligraphic_A ( ⋅ ) operations. The variable 𝐳𝐳{\bf z}bold_z represents the feature map at each stage of the Swin Transformer.

III-A3 Relative position bias

Swin Transformer employs relative positional bias to improve the model’s performance. This technique allows the attention mechanism to more effectively focus on different segments of the input sequence by considering their relative positions.

Attention(Q,K,V)=SoftMax(QKT/d+B)VAttention𝑄𝐾𝑉SoftMax𝑄superscript𝐾𝑇𝑑𝐵𝑉\mathrm{Attention}(Q,K,V)=\mathrm{SoftMax}(QK^{T}/\sqrt{d}+B)Vroman_Attention ( italic_Q , italic_K , italic_V ) = roman_SoftMax ( italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_d end_ARG + italic_B ) italic_V (4)

where Q𝑄{Q}italic_Q, K𝐾{K}italic_K and V𝑉{V}italic_V are the query, key and value vectors respectively. B𝐵{B}italic_B is the relative position bias matrix & d𝑑{d}italic_d denotes the dimension of the key vector.

The relative position bias matrix B𝐵{B}italic_B is a learned matrix that encodes the relative positions of elements in the input sequence. It enhances attention mechanisms by enabling more effective learning of how to attend to different parts of the sequence.

The StrideNET model leverages the capabilities of the Swin Transformer to achieve accurate terrain recognition.

III-B Terrain Recognition Branch

Algorithm 1 Swin Transformer for Terrain Recognition
1:Input:
2:      Image: I224×224×3Isuperscript2242243\textbf{{I}}\in\mathbb{R}^{224\times 224\times 3}I ∈ blackboard_R start_POSTSUPERSCRIPT 224 × 224 × 3 end_POSTSUPERSCRIPT
3:Parameters:
4:      Patch size: P=4×4𝑃44P=4\times 4italic_P = 4 × 4 Embedded Dimension: D=96𝐷96D=96italic_D = 96 Heads in MSA: H𝐻Hitalic_H Transformer Blocks: T={2,2,6,2}𝑇2262T=\{2,2,6,2\}italic_T = { 2 , 2 , 6 , 2 }
5:Initial Embeddings:
6:      Split I into patches: Ip56×56×48subscript𝐼𝑝superscript565648I_{p}\in\mathbb{R}^{56\times 56\times 48}italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 56 × 56 × 48 end_POSTSUPERSCRIPT Embed: E1=Linear(Ip);E156×56×Dformulae-sequencesubscript𝐸1𝐿𝑖𝑛𝑒𝑎𝑟subscript𝐼𝑝subscript𝐸1superscript5656𝐷E_{1}=Linear(I_{p});E_{1}\in\mathbb{R}^{56\times 56\times D}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_L italic_i italic_n italic_e italic_a italic_r ( italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ; italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 56 × 56 × italic_D end_POSTSUPERSCRIPT
7:for s=2𝑠2s=2italic_s = 2 to 4444 do
8:     a. Patch Merge:
9:     Es=Merge(Es1)subscript𝐸𝑠Mergesubscript𝐸𝑠1E_{s}=\text{Merge}(E_{s-1})italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = Merge ( italic_E start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT )
10:     b. Process:
11:     for t=1𝑡1t=1italic_t = 1 to T[s]𝑇delimited-[]𝑠T[s]italic_T [ italic_s ] do
12:         Es=TransformerBlock(Es)subscript𝐸𝑠TransformerBlocksubscript𝐸𝑠E_{s}=\text{TransformerBlock}(E_{s})italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = TransformerBlock ( italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
13:     end for
14:end for
15:Global Average Pooling:
16:      V=GAP(E4);VDformulae-sequence𝑉𝐺𝐴𝑃subscript𝐸4𝑉superscript𝐷V=GAP(E_{4});V\in\mathbb{R}^{D}italic_V = italic_G italic_A italic_P ( italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ; italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT
17:Classifier:
18:      O=Softmax(Linear(V))𝑂𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝐿𝑖𝑛𝑒𝑎𝑟𝑉O=Softmax(Linear(V))italic_O = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_L italic_i italic_n italic_e italic_a italic_r ( italic_V ) )
19:Output:
20:      O𝑂Oitalic_O: Class Probabilities of Different Terrains

The Terrain Recognition branch is illustrated in Fig. 2. The input image I𝐼Iitalic_I is first processed by an augmentation layer that applies operations such as cropping and flipping to enhance the model’s robustness. Next, a patch extraction process divides the image into patches of size P𝑃Pitalic_P, forming a three-dimensional tensor Ipsubscript𝐼𝑝I_{p}italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT that represents distinct local regions.

Positional embedding is then utilized to encode the spatial information of the patches. This is achieved by applying a linear transformation to each image patch, and embedding it into a D-dimensional vector, resulting in a new tensor E. The image patches, along with their corresponding positional embeddings, are then fed into the encoding layer of the Swin Transformer.

Refer to caption
(a) Grassy
Refer to caption
(b) Marshy
Refer to caption
(c) Rocky
Refer to caption
(d) Sandy
Figure 3: Terrain Dataset [Aras_2023]

The encoding layers comprise successive transformer blocks, each equipped with multiple attention heads and a feedforward neural network. The shifted windowed self-attention mechanism is used to efficiently capture long-range dependencies within the image. The output is subsequently passed through the feedforward neural network for feature transformation.

Following this, patch merging is performed to downsample the feature maps between each stage of the transformer, except at the final stage.

Next, global average pooling is applied to compress the spatial information from the feature maps into a fixed-length vector, ensuring a uniform input size.

The output V𝑉Vitalic_V from the global average pooling layer is forwarded to a dense layer, which maps the high-dimensional feature representation to the desired output dimension of four, corresponding to the number of terrain classes.

Finally, Softmax activation is used to assign probability to each terrain class. This branch is summarized in Algorithm 1.

Refer to caption
Figure 4: Roughness Extraction

III-C Roughness Extraction Branch

The Roughness Extraction branch uses different image characteristics like texture and variance to compute the value of roughness factor R𝑅Ritalic_R.

Texture describes the recurring pattern of localized intensity variations in an image. It quantifies how these intensities are arranged within a specific region and is usually represented as a feature vector. Statistical methods are particularly effective for analyzing small texture elements, which contribute to microtextures.

Variance measures the spread of intensity values within an image. It is a dimensionless indicator of how much these values deviate from the mean intensity. A high variance indicates a wide distribution of intensity values, characteristic of high-contrast images. Conversely, low variance indicates that intensity values are closely clustered, typical of low-contrast images.

Algorithm 2 Terrain Roughness Extraction
1:Input:
2:      Image: I224×224×3Isuperscript2242243\textbf{{I}}\in\mathbb{R}^{224\times 224\times 3}I ∈ blackboard_R start_POSTSUPERSCRIPT 224 × 224 × 3 end_POSTSUPERSCRIPT
3:Parameters:
4:      Patch size: P=W×H𝑃𝑊𝐻P=W\times Hitalic_P = italic_W × italic_H Step Size: s𝑠sitalic_s
5:procedure ImplicitProperties(I,s𝐼𝑠I,sitalic_I , italic_s)
6:     PPatchify(I,P,s)𝑃Patchify𝐼𝑃𝑠P\leftarrow\text{Patchify}(I,P,s)italic_P ← Patchify ( italic_I , italic_P , italic_s )
7:     for PPatches𝑃𝑃𝑎𝑡𝑐𝑒𝑠P\in Patchesitalic_P ∈ italic_P italic_a italic_t italic_c italic_h italic_e italic_s do
8:         PatchListPatchList{P}𝑃𝑎𝑡𝑐𝐿𝑖𝑠𝑡𝑃𝑎𝑡𝑐𝐿𝑖𝑠𝑡𝑃PatchList\leftarrow PatchList\cup\{P\}italic_P italic_a italic_t italic_c italic_h italic_L italic_i italic_s italic_t ← italic_P italic_a italic_t italic_c italic_h italic_L italic_i italic_s italic_t ∪ { italic_P }
9:         WindowStrides𝑊𝑖𝑛𝑑𝑜𝑤𝑆𝑡𝑟𝑖𝑑𝑒𝑠WindowStride\leftarrow sitalic_W italic_i italic_n italic_d italic_o italic_w italic_S italic_t italic_r italic_i italic_d italic_e ← italic_s
10:     end for
11:     for VVariance(PatchList)𝑉𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑃𝑎𝑡𝑐𝐿𝑖𝑠𝑡V\in Variance(PatchList)italic_V ∈ italic_V italic_a italic_r italic_i italic_a italic_n italic_c italic_e ( italic_P italic_a italic_t italic_c italic_h italic_L italic_i italic_s italic_t ) do
12:         RoughnessRoughness{111+V}𝑅𝑜𝑢𝑔𝑛𝑒𝑠𝑠𝑅𝑜𝑢𝑔𝑛𝑒𝑠𝑠111𝑉Roughness\leftarrow Roughness\cup\{1-\frac{1}{1+V}\}italic_R italic_o italic_u italic_g italic_h italic_n italic_e italic_s italic_s ← italic_R italic_o italic_u italic_g italic_h italic_n italic_e italic_s italic_s ∪ { 1 - divide start_ARG 1 end_ARG start_ARG 1 + italic_V end_ARG }
13:     end for
14:end procedure
15:GlobalRoughnessGlobalAverage(Roughness)𝐺𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑔𝑛𝑒𝑠𝑠GlobalAverage𝑅𝑜𝑢𝑔𝑛𝑒𝑠𝑠GlobalRoughness\leftarrow\text{GlobalAverage}(Roughness)italic_G italic_l italic_o italic_b italic_a italic_l italic_R italic_o italic_u italic_g italic_h italic_n italic_e italic_s italic_s ← GlobalAverage ( italic_R italic_o italic_u italic_g italic_h italic_n italic_e italic_s italic_s )
16:procedure Visualize(GlobalRoughness𝐺𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑔𝑛𝑒𝑠𝑠GlobalRoughnessitalic_G italic_l italic_o italic_b italic_a italic_l italic_R italic_o italic_u italic_g italic_h italic_n italic_e italic_s italic_s)
17:     DataCreateMatrix(GlobalRoughness)𝐷𝑎𝑡𝑎CreateMatrix𝐺𝑙𝑜𝑏𝑎𝑙𝑅𝑜𝑢𝑔𝑛𝑒𝑠𝑠Data\leftarrow\text{CreateMatrix}(GlobalRoughness)italic_D italic_a italic_t italic_a ← CreateMatrix ( italic_G italic_l italic_o italic_b italic_a italic_l italic_R italic_o italic_u italic_g italic_h italic_n italic_e italic_s italic_s )
18:     DataColorMap(Data)𝐷𝑎𝑡𝑎ColorMap𝐷𝑎𝑡𝑎Data\leftarrow\text{ColorMap}(Data)italic_D italic_a italic_t italic_a ← ColorMap ( italic_D italic_a italic_t italic_a )
19:     OBlend(I,Data)𝑂Blend𝐼𝐷𝑎𝑡𝑎O\leftarrow\text{Blend}(I,Data)italic_O ← Blend ( italic_I , italic_D italic_a italic_t italic_a )
20:     return O𝑂Oitalic_O
21:end procedure
22:Output:
23:      O𝑂Oitalic_O: Terrain image with extracted properties

Variance to be computed from the image histogram is given by Eq. 5

σ2=i=0j1(zim)2p(zi)superscript𝜎2superscriptsubscript𝑖0𝑗1superscriptsubscript𝑧𝑖𝑚2𝑝subscript𝑧𝑖\sigma^{2}=\sum_{i=0}^{j-1}(z_{i}-m)^{2}\,p(z_{i})italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_m ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (5)

where zisubscript𝑧𝑖{z_{i}}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the intensity value of the ith𝑖thi{\textsuperscript{th}}italic_i pixel, m𝑚{m}italic_m denotes the mean intensity value & p(zi)𝑝subscript𝑧𝑖p(z_{i})italic_p ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the probability of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT pixel having intensity value zisubscript𝑧𝑖{z_{i}}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

The roughness factor R𝑅{R}italic_R is a measure of the texture of an image, which is computed using Eq. 6

R=111+σ2𝑅111superscript𝜎2R=1-\frac{1}{1+\sigma^{2}}italic_R = 1 - divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (6)

It is a dimensionless quantity ranging from 0 to 1, where 0 represents a completely smooth image and 1 represents an entirely rough image  [Bhuyan2020-li].

The Roughness Extraction branch, outlined in Algorithm 2, uses the roughness factor R𝑅{R}italic_R to dynamically extracts roughness and slipperiness from the image.

First, the input image I is segmented into patches of size P, creating a collection of patches known as PatchList. The algorithm then processes each patch in PatchList by calculating its variance to determine its roughness value. These roughness values are stored in the Roughness matrix. The global roughness value is subsequently obtained by averaging the roughness values from the Roughness matrix, and this average is recorded in the GlobalRoughness matrix.

To visualize the extracted roughness value, a Data matrix is generated. This matrix is normalized to a common range and resized to match the dimensions of the original image. A colormap is then applied to depict varying roughness levels. The final output O is obtained by blending the original image I with the matrix, as illustrated in Fig. 4.

IV Experiments and Results

IV-A Dataset Description and Training Details

The model is trained on a custom dataset consisting of over 45,0004500045,00045 , 000 images, with approximately 10,0001000010,00010 , 000 images per class for each terrain type: Sandy, Rocky, Grassy, and Marshy, as shown in Fig. 3. This dataset is publicly available at [Aras_2023].

All experiments are performed using PyTorch on a system equipped with an Intel Core i711800H𝑖711800𝐻i7-11800Hitalic_i 7 - 11800 italic_H CPU, 16161616 GB DDR4444 RAM, and an Nvidia GeForce RTX 3050305030503050 Ti GPU.

For model training, the input image is resized from 256×256256256256\times 256256 × 256 to 224×224224224224\times 224224 × 224. The patch size is set to 4444, and window size is configured to 7777. Label Smoothing Cross-Entropy loss is used as the objective function, with AdamW as optimizer.

The model undergoes training for a total of 10101010 epochs, with a 70:30:703070:3070 : 30 data split. The learning rate is initialized to 0.0010.0010.0010.001, and a StepLR scheduler is applied with a step size of 3333 epochs and a decay factor of 0.970.970.970.97.

To prevent the model from overfitting, a dropout rate of 0.30.30.30.3 is applied. The model architecture consists of transformer layers configured as [2,2,6,2]2262[2,2,6,2][ 2 , 2 , 6 , 2 ] and attention heads configured as [3,6,12,24]361224[3,6,12,24][ 3 , 6 , 12 , 24 ].

TABLE I: Classwise Accuracy (in %) of different methods on the Terrain Recognition dataset.
Class MobileNet-V2 EfficientNet-B0 ResNet-101 ViT DeiT StrideNET
Grassy 95.98 95.42 97.69 97.54 99.37 99.56
Marshy 93.83 94.01 98.32 96.28 98.93 99.05
Rocky 97.89 93.87 98.44 97.00 99.29 99.20
Sandy 94.30 94.32 97.03 98.17 98.54 99.93
TABLE II: Performance Metrics (in %) of different methods on the Terrain Recognition dataset.
Methods OA AA Kappa Precision Recall F1 Score
MobileNet-V2 95.30 95.50 90.83 75.50 65.32 70.30
EfficientNet-B0 95.02 94.40 93.42 95.02 92.30 94.30
ResNet-101 96.83 97.87 95.94 96.32 95.69 96.69
ViT 98.61 97.25 98.84 98.61 99.03 99.03
DeiT 99.86 99.03 99.18 99.86 98.49 99.49
StrideNET 99.98 99.41 99.96 99.98 99.95 99.99
Refer to caption
(a) Accuracy Curve
Refer to caption
(b) Loss Curve
Figure 5: Graph representing model accuracy and model loss for training and validation set of proposed StrideNET model.

IV-B Classification Results

To evaluate the model, we use Overall Accuracy (OA), Average Accuracy (AA), and the Kappa Coefficient as primary metrics. Additionally, macro-averaged values of precision, recall, and F1 score are used for comprehensive evaluation.

Overall Accuracy represents the proportion of correctly classified instances across all classes relative to the total number of instances. Average Accuracy represents the mean accuracy across all classes.

The Kappa Coefficient is a robust statistical measure that evaluates the agreement between predicted and actual classifications, adjusted for chance agreement. Macro-averaged values of precision, recall, and F1 score are selected due to the balanced class distribution in our dataset.

The class-wise accuracy of the proposed StrideNET model is compared in Table I with standard CNN models, including MobileNet-V2, EfficientNet-B0, and ResNet-101, as well as transformer-based models such as ViT and DeiT. StrideNET achieves a test accuracy exceeding 99% for each class.

Table II shows a comparison of StrideNET against other models using our primary metrics: Overall Accuracy (OA), Average Accuracy (AA), and the Kappa Coefficient (κ𝜅\kappaitalic_κ). Our proposed model also achieves values exceeding 99% in these metrics.

To assess potential overfitting, we examined the accuracy-loss curves shown in Fig. 5. These curves, which illustrate the model’s performance across varying epochs, indicate that the model maintains a balanced fit, avoiding both overfitting and underfitting. This empirical evidence suggests that our model is reliable and robust, and it can generalize well to unseen data.

In our comparative analysis (Table II), the StrideNET model demonstrated performance metrics on par with state-of-the-art models, including ResNet and DeiT. While both models achieved similar accuracy levels, StrideNET’s efficiency in terms of data requirements and computational resources makes it the preferred choice for real-world terrain recognition tasks.

V Conclusion

This paper introduces StrideNET, a model based on Swin Transformer, to perform terrain recognition and surface roughness extraction from remote sensing images. It comprises of two branches: Terrain Recognition and Roughness Extraction. The model is trained to identify four terrain classes, namely grassy, marshy, rocky and sandy. A novel algorithm is proposed in this paper, which uses statistical texture-feature analysis to dynamically extract terrain characteristics of roughness and slipperiness from the input image. The model is trained on a custom dataset, and is benchmarked with standard convolutional and transformer based models. Experimental results demonstrate that the proposed model achieves classification accuracy of over 99%percent9999\%99 % for all terrain types, outperforming other models. Thus, it can be utilized for applications such as environmental monitoring, LULC classification, and precision agriculture.

\printbibliography