Applications of Knowledge Distillation in Remote Sensing: A Survey

Yassine Himeur yhimeur@ud.ac.ae    Nour Aburaed    Omar Elharrouss    Iraklis Varlamis    Shadi Atalla    Wathiq Mansoor    Hussain Al Ahmad College of Engineering and Information Technology, University of Dubai, Dubai, UAE MBRSC Lab, University of Dubai, Dubai 2713, UAE Department of Computer Science and Software Engineering, United Arab Emirates University, UAE Department of Informatics and Telematics, Harokopio University of Athens, GR-17778 Athens, Greece
Abstract

With the ever-growing complexity of models in the field of remote sensing (RS), there is an increasing demand for solutions that balance model accuracy with computational efficiency. Knowledge distillation (KD) has emerged as a powerful tool to meet this need, enabling the transfer of knowledge from large, complex models to smaller, more efficient ones without significant loss in performance. This review article provides an extensive examination of KD and its innovative applications in RS. KD, a technique developed to transfer knowledge from a complex, often cumbersome model (teacher) to a more compact and efficient model (student), has seen significant evolution and application across various domains. Initially, we introduce the fundamental concepts and historical progression of KD methods. The advantages of employing KD are highlighted, particularly in terms of model compression, enhanced computational efficiency, and improved performance, which are pivotal for practical deployments in RS scenarios. The article provides a comprehensive taxonomy of KD techniques, where each category is critically analyzed to demonstrate the breadth and depth of the alternative options, and illustrates specific case studies that showcase the practical implementation of KD methods in RS tasks, such as instance segmentation and object detection. Further, the review discusses the challenges and limitations of KD in RS, including practical constraints and prospective future directions, providing a comprehensive overview for researchers and practitioners in the field of RS. Through this organization, the paper not only elucidates the current state of research in KD but also sets the stage for future research opportunities, thereby contributing significantly to both academic research and real-world applications.

keywords:
Knowledge distillation \sepModel Compression \sepModel and Data Distillation \sepRemote Sensing \sepUrban Planning and Precision Agriculture

1 Introduction

1.1 Preliminary

Remote sensing (RS) image analysis plays a pivotal role in interpreting and managing Earth’s natural and human-made environments [1]. This technology harnesses data captured by satellites or high-altitude aircraft, providing crucial insights across a broad spectrum of applications—from agricultural monitoring and disaster management to urban planning and climate science [2]. By enabling timely and efficient observation of vast, inaccessible, or dangerous areas, RS becomes indispensable for tracking environmental changes, predicting weather patterns, and managing natural resources [3]. Consequently, the ability to quickly process and analyze RS images leads to more informed decision-making, enhancing our global capability to respond to challenges such as food security, natural disasters, and climate change [4, 5].

However, the complexity of RS tasks varies significantly depending on the specific application, and many of these tasks are inherently challenging and computationally intensive [6, 7]. Key tasks such as image classification, object detection, change detection, and segmentation involve processing high-dimensional data, often characterized by large spatial and spectral resolutions [8, 9]. For instance, distinguishing between different land cover types or detecting minute changes over time in vast geographical areas necessitates sophisticated algorithms capable of handling enormous datasets [10]. Moreover, the presence of noise, variability in lighting conditions, atmospheric distortions, and the need for high precision further compound the complexity of these tasks [11, 12]. These challenges lead to extensive training times, particularly for deep learning models, which require large datasets to achieve high accuracy and generalization. Therefore, optimizing these models to balance accuracy and computational efficiency remains an ongoing challenge in RS [13, 14].

Artificial Intelligence (AI), particularly machine learning (ML) and deep learning (DL), has revolutionized RS image analysis by introducing levels of precision and efficiency previously unattainable with traditional methods [15, 16, 17]. DL models, especially those based on Convolutional Neural Networks (CNNs), are highly adept at handling high-dimensional data from RS imagery [18]. These models excel in tasks such as pattern recognition, object detection, and semantic segmentation, where they can automatically identify features like roads, buildings, or vegetation changes [19]. Furthermore, the deployment of AI enables the processing of large datasets in real-time, significantly improving the accuracy of predictions and analyses. Moreover, DL’s ability to learn feature representations without manual intervention reduces reliance on expert-driven feature design, thus scaling up the analytical capabilities of RS technologies [20].

Despite these advances, the integration of AI and DL into RS presents several significant challenges. One of the foremost issues is the requirement for substantial computational resources, particularly for training large neural network models [21, 22, 23]. This becomes a critical barrier for organizations with limited access to high-performance computing infrastructure [24]. Additionally, DL models often require vast labeled datasets for training, which can be difficult and costly to acquire in the context of RS. Furthermore, these models are prone to overfitting, especially when trained on limited datasets, reducing their ability to generalize well to new, unseen data [25]. Another pressing concern is the "black box" nature of DL models, which often leads to difficulties in interpreting their decision-making processes—a critical requirement in applications where transparency and understanding are paramount, such as in environmental compliance and strategic planning [26, 27].

To address some of these challenges, knowledge distillation (KD) emerges as a promising technique. KD involves training a smaller, more efficient student model to replicate the performance of a larger, more complex teacher model [28]. By transferring knowledge from a high-performing neural network to a compact model, KD reduces the computational resources required for deployment, making advanced AI-driven RS technologies more accessible [29]. Moreover, the student model can often achieve comparable accuracy with less data, mitigating the issues of extensive data requirements and overfitting [30]. In resource-constrained environments, KD proves particularly advantageous, as it enables energy-efficient deployment, thereby reducing the carbon footprint of AI systems [31]. Additionally, KD facilitates the transfer of pre-trained models to other domains through fine-tuning, extending the versatility of AI applications even in scenarios with scarce data. Furthermore, KD techniques can assist in generating synthetic training data when annotated data is limited, thus addressing one of the critical bottlenecks in RS [32, 30]. The resulting simpler models from the distillation process also offer easier interpretability, providing clearer insights into their decision-making mechanisms [33, 34]. This interpretability is essential for applications requiring transparency, such as environmental monitoring and regulatory compliance. As a result, KD not only democratizes AI capabilities within RS but also enhances the practical utility of these technologies in critical applications, ensuring a balance between performance, energy efficiency, and scalability across diverse domains [31].

Abbreviation Full Form Abbreviation Full Form
KD Knowledge Distillation YOLOv8 You Only Look Once version 8
RS Remote Sensing MS2RGB Multispectral to RGB Knowledge Distillation
CNN Convolutional Neural Network PseKD Phase-shift Encoded Knowledge Distillation
S-T Student-Teacher GSGNet Graph Semantic Guided Network
ARSD Adaptive Reinforcement Supervision Distillation LPIS Land Parcel Identification System
RGB Red, Green, Blue DOTA Dataset for Object Detection in Aerial Images
R-CNN Region-based Convolutional Neural Network DIOR Dataset for Object Detection in Remote Sensing
FPN Feature Pyramid Network AID Aerial Image Dataset
MCFI Multiscale Core Features Imitation SSKDNet Self-supervised Knowledge Distillation Network
SSRD Strict Supervision Regression Distillation MSKA Multi-level Semantic Knowledge Alignment
CFKD Cross-layer Fusion for Knowledge Distillation ViTs Vision Transformers
YOLO You Only Look Once FPN Feature Pyramid Network
HSI Hyperspectral Image ERKT Efficient and Robust Knowledge Transfer
CKD Collaborative Consistent Knowledge Distillation TWA Two-way Adaptive Distillation
GKD Generalized Knowledge Distillation NLD Noisy Label Distillation
DKD Decoupled Knowledge Distillation CAMs Class Activation Maps
SSFD Spatial Feature Blurring for Distillation RS-SSKD Remote Sensing Self-supervised Knowledge Distillation
LEVIR Large-scale Earth Vision Image Recognition SAR SSDD Synthetic Aperture Radar Ship Detection Dataset
UCMerced University of California Merced Land-use Dataset NWPU-RESISC Northwestern Polytechnical University Remote Sensing Image Scene Classification
CMD Class Mean Distillation MSW Maximum Sustained Wind

1.2 Comparison with Existing Reviews

Several recent reviews and surveys have provided comprehensive analyses of various aspects of Knowledge Distillation (KD) and its applications across different domains. These works highlight the evolution, challenges, and future directions of KD, focusing on areas such as computer vision, medical applications, and large language models. For instance, [35] offers an in-depth examination of KD within the framework of the Student-Teacher (S-T) learning model, providing a thorough overview of KD’s core concepts, methods, and applications, particularly in vision tasks. The study also identifies key challenges and potential future research directions. Similarly, [36] explores the significance of cross-stage connection paths between teacher and student networks, introducing a novel approach that enhances the effectiveness of KD while maintaining low computational overhead. This framework is shown to improve performance across various tasks such as classification and object detection. Additionally, [37] presents a survey focusing on KD as a model compression and acceleration technique, categorizing KD methods by knowledge types, training schemes, and architectures. The paper discusses challenges like the trade-off between model size and performance and suggests potential research avenues to advance the field further.

In another study, Yadikar et al. [38] examine the application of KD in target detection within computer vision, focusing on the challenges of balancing detection speed and accuracy. The study highlights how knowledge compression techniques, particularly knowledge refinement, can enhance the performance of target detection algorithms on edge devices with limited computational power. The authors also propose potential improvements and future trends in integrating distillation learning with target detection [38]. Similarly, Alkhulaifi et al. [39] explore KD as a solution for deploying deep learning models on resource-constrained devices. They introduce a "distillation metric" to compare different KD methods based on model size and accuracy, providing a detailed survey of techniques such as soft label distillation and logit and feature map distillation, both offline and online. The study also discusses real-world KD applications in domains such as autonomous vehicles, healthcare, and IoT, outlining current challenges and future research directions. Furthermore, Yu et al. [40] review dataset distillation (DD), a technique related to KD that focuses on creating smaller, synthetic datasets that retain the performance of models trained on larger datasets. The study presents an algorithmic framework for DD methods, categorizes existing approaches, and identifies challenges such as privacy, copyright, and data storage, offering insights into future research directions for this emerging field.

Additionally, Meng et al. [41] explore the use of KD in the medical field, addressing challenges such as deploying large models on lightweight devices and the difficulty of sharing medical datasets. The study reviews various KD applications in healthcare, demonstrating how KD can compress complex models while improving their performance in medical tasks. It highlights the potential of KD to alleviate issues related to medical resource shortages by optimizing model deployment effectively. Similarly, Li et al. [42] present a survey on KD in object detection (OD), discussing the evolution of KD-based OD models and their advantages in performance and resource efficiency. The study analyzes different distillation techniques and explores their applications in domains like remote sensing (RS) and the management of 3D point cloud datasets, offering a comprehensive comparison of model performance across various datasets.

Furthering the exploration of KD, Luo et al. [43] provide an overview of modern approaches to distilling Diffusion Models (DMs), focusing on distilling DMs into neural vector fields and reviewing stochastic and deterministic implicit generators. The authors also examine accelerated diffusion sampling algorithms as a training-free method for distillation, offering valuable insights for researchers interested in DM distillation. Additionally, Acharya et al. [44] address the emerging field of symbolic KD in large language models (LLMs), emphasizing the transformation of implicit knowledge within these models into a more explicit, symbolic form. This survey categorizes existing research, highlights the importance of symbolic KD in enhancing interpretability and efficiency, and proposes future research directions to advance this growing field.

In the context of computer vision, Kaleem et al. [45] provide a comprehensive review of KD techniques, covering major methods such as response-based, feature-based, and relation-based knowledge transfer. The study discusses the benefits and challenges of using KD to compress and optimize deep learning models, especially in resource-constrained environments. It explores the application of KD in tasks such as image classification, object detection, and video captioning, and highlights recent developments in multimodal models with KD. Similarly, Habib et al. [46] focus on KD in Vision Transformers (ViTs), addressing the challenges of deploying these models in environments with limited computational resources. The study reviews various KD approaches for compressing ViTs, emphasizing KD’s role in reducing computational and memory requirements while maintaining model performance. It also provides a comparative analysis of different KD techniques for ViTs and identifies unresolved challenges that warrant further research. Table 1 presents a comparison of several KD surveys and reviews across various aspects such as focus on vision tasks, the use of teacher-student frameworks, real-world and medical applications, distillation techniques, and future research directions. It highlights which aspects are covered by each reference, along with the proposed study, indicating areas of focus and gaps in the existing literature.

The proposed review offers a comprehensive and structured analysis of KD, significantly expanding upon previous works by integrating a wide-ranging taxonomy and exploring its diverse applications across various domains, particularly in RS. Unlike existing reviews, which tend to focus on specific aspects of KD such as its role in model compression or its application in computer vision, this review provides a holistic overview, categorizing KD models based on architecture, distillation techniques, and application areas. Furthermore, it delves into advanced topics such as dynamic distillation, layer-wise distillation, and the integration of KD with real-time processing and edge AI—areas that remain relatively underexplored in prior literature. Additionally, the review addresses practical challenges such as data heterogeneity, scalability, and the balance between efficiency and accuracy, offering insights into emerging trends and future directions. This approach not only contextualizes KD within the broader landscape of machine learning but also highlights its potential for innovation in areas like precision agriculture, urban planning, and oceanographic monitoring. Thus, this review serves as a valuable resource for researchers and practitioners aiming to leverage KD in diverse and complex environments. Overall, this review makes several key contributions to the field of knowledge distillation (KD) in RS, which can be briefly summarized into the following:

  • Provides a comprehensive and structured analysis of KD, significantly expanding on previous works by integrating a wide-ranging taxonomy.

  • Explores the diverse applications of KD across various domains, with a particular focus on RS.

  • Categorizes KD models based on architecture, distillation techniques, and application areas, offering a holistic overview.

  • Delves into advanced topics such as dynamic distillation, layer-wise distillation, and the integration of KD with real-time processing and edge-AI, which are underexplored in prior literature.

  • Addresses practical challenges, including data heterogeneity, scalability, and the balance between efficiency and accuracy, providing insights into emerging trends and future directions.

  • Contextualizes KD within the broader landscape of machine learning over RS data, highlighting its potential for innovation in areas like precision agriculture, urban planning, and oceanographic monitoring.

Table 1: Comparison of KD Surveys and Reviews
Aspect [35] [36] [37] [38] [39] [40] [41] [42] [45] [46] Proposed
Focus on Vision Tasks
Teacher-Student Framework
Real-world Applications
Medical Applications
RS Applications
Distillation Techniques
Discussion on Challenges
Future Research Directions
Model Compression Techniques
Introduction of New Metrics
Multimodal Model Applications
Discussion of Existing Datasets

1.3 Literature Screening Approach

1.3.1 Inclusion and Exclusion Criteria

The inclusion and exclusion criteria have been identified by including studies that are directly relevant to KD, particularly in the context of RS, with a focus on research published within the last 5 years to capture the latest advancements. Peer-reviewed articles, conference papers, preprints from reputable platforms and book chapters published in English are prioritized, in addition to to empirical studies, reviews, case studies, and theoretical papers. Studies that do not specifically address KD in RS or focus on unrelated technologies, as well as outdated research published more than 5 years ago unless it is seminal are excluded, as well as and non-peer-reviewed sources such as blog posts, opinion pieces, and non-academic publications.

1.3.2 Search Strategy

The search strategy involves using multiple academic databases such as IEEE Xplore, Scopus, Web of Science, Elsevier, Springer Nature, Wiley, Taylor & Francis, MDPI, Google Scholar, etc. to conduct a comprehensive search using relevant keywords like "Knowledge Distillation", "Model Compression", "Model Distillation", "Feature Distillation", "Data Distillation", "Remote Sensing" "Urban Planning", "Precision Agriculture", "Land Cover Classification", etc., refined with Boolean operators (AND, OR, NOT). Initial screening of titles, abstracts, and keywords is performed manually to identify potentially relevant studies. Studies that meet the inclusion criteria are shortlisted for a full-text review, where a detailed evaluation confirms their relevance and quality. Additionally, reference lists of selected studies are screened to identify any further relevant studies that may have been overlooked. Fig. 1 explains the literature screening approach adopted in this study.

Refer to caption
Figure 1: Summary of literature screening approach used in this review.

1.4 Organization of the Paper

The organization of the paper is meticulously structured to provide a thorough exploration of KD and its applications in remote sensing. Section 2 lays the groundwork by covering the fundamentals of KD, starting with a brief overview, defining essential concepts, and discussing the historical evolution of KD techniques. This section also delves into the basic principles and mechanisms of KD, including the objective function and overall loss that guide the distillation process. Additionally, the benefits of KD are highlighted, such as model compression, improved efficiency, enhanced performance on smaller models, and the broader implications for various applications. Following this, Section 3 focuses on RS tasks and the public datasets that are pivotal for applying KD in this domain. Section 4 introduces a comprehensive taxonomy of KD models, categorizing them based on variations in the model or input data, the type of transferred knowledge (including response-based, feature-based, and relation-based distillation), distillation targets (data, model, and feature distillation), and structural relationships within network layers (layer-to-layer and cross-layer distillation). In Section 5, the paper transitions to discussing the applications of KD in remote sensing, with a detailed examination of its use in image/scene classification, object detection, land cover classification, semantic segmentation, precision agriculture, urban planning, and oceanographic monitoring. Section 6 then addresses the challenges and limitations associated with KD, including model complexity, data heterogeneity, overfitting, scalability, real-time applicability, dependency on high-quality data, balancing efficiency and accuracy, and integration complexity. Looking ahead, Section 7 outlines future directions for KD research. It suggests advancements such as dynamic distillation, layer-wise distillation, efficient training and inference techniques, low-cost training algorithms, hardware-aware distillation, and improvements in data quality and robustness. The section also discusses scalability solutions like distributed and incremental distillation, real-time processing enhancements, and the integration of cross-modal and multi-modal distillation. Additionally, it explores the potential for seamless integration with existing workflows through plug-and-play distillation modules, toolkits, and frameworks, as well as enhancing model interpretability through explainable distillation and feature importance preservation. The potential of hybrid approaches, combining KD with other techniques and developing adaptive distillation frameworks, is also considered. Finally, Section 8 offers a comprehensive conclusion, synthesizing the insights gained throughout the paper and highlighting the potential for future advancements in the field of KD in remote sensing.

2 Fundamentals of KD

2.1 A Brief Overview

2.1.1 Definition and Basic Concepts of KD

KD is a ML technique where a smaller, simpler model (known as the student) is trained to emulate the behavior of a larger, more complex model (known as the teacher) [19]. As shown in Fig. 2, KD relies on two deep neural network models, a more complex one that is called the Teacher and a simpler one that is called the Student. The core idea is to transfer the knowledge from the teacher model, which typically performs better due to its greater capacity, to the student model, which is less resource-intensive [47] and tries to mimic the teacher’s behavior. This can be achieved by aligning the student’s outputs with those of the teacher using a Distillation Loss function that compares the two outputs. Usually, the teacher’s soft target probabilities (the outputs from the softmax layer before applying the final decision function, as depicted in Fig. 2) are used for this purpose. These soft targets provide richer information than hard labels, as they contain insights about the relative probabilities of incorrect answers, giving the student model clues about the underlying data structure and feature relationships that the teacher model has learned [48]. However, apart from this response-based knowledge distillation tactic, the student also can learn the output of intermediate teacher layers, or other representations, making the KD approach very flexible and powerful.

Refer to caption
Figure 2: An overview of the knowledge distillation principle.

2.1.2 Brief History and Evolution of the KD Technique

The concept of KD can be traced back to earlier works in model compression and hints training, where simpler models were trained to mimic more complex ones using additional information from those models. However, the term “knowledge distillation” was popularized by Hinton et al. [49] in a seminal 2015 paper, where they demonstrated the effectiveness of using soft targets to train neural networks. Since then, the field has seen rapid development and broad applications across various domains of Artificial Intelligence. Originally, KD was primarily used to reduce the size and computational demands of large neural networks so that they could be deployed on devices with limited hardware capabilities, such as mobile phones and embedded systems. This was particularly valuable for applications that require real-time processing, such as speech recognition and mobile vision [50].

As research progressed, the scope of KD expanded beyond model compression. Researchers began exploring its potential to improve model generalization by smoothing the decision boundaries, making them more stable and improving generalization. Using ensembles of teacher networks to train a student network further stabilizes training by distilling the collective knowledge of multiple models into a single model, and facilitates transfer learning across different domains or tasks [51]. The technique has been adapted and refined to include not just direct output mimicry, but also feature-based and relation-based distillation, where intermediate representations and relationships between data points are also transferred from the teacher to the student [52]. Today, KD is an active area of research with ongoing innovations that aim to further enhance its effectiveness and expand its applicability. This includes cross-modal distillation for transferring knowledge between different types of data, such as video-to-text, and self-distillation, where a model is iteratively trained on its own softened outputs to refine its capabilities [53].

2.2 Basic Principle and Mechanism

This section provides the mathematical background of KD [54]. Let us denote the output logits (pre-softmax activations) of the teacher model as zTsuperscript𝑧𝑇z^{T}italic_z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and those of the student as zSsuperscript𝑧𝑆z^{S}italic_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT. The softmax function applied to these logits is given by:

σ(zi,T)=ezi/Tjezj/T,𝜎subscript𝑧𝑖𝑇superscript𝑒subscript𝑧𝑖𝑇subscript𝑗superscript𝑒subscript𝑧𝑗𝑇\sigma(z_{i},T)=\frac{e^{z_{i}/T}}{\sum_{j}e^{z_{j}/T}},italic_σ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_T end_POSTSUPERSCRIPT end_ARG , (1)

where i𝑖iitalic_i indexes the output classes, and T𝑇Titalic_T is the temperature parameter that controls the softness of the probability distribution. A higher value of T𝑇Titalic_T produces a softer probability distribution [55].

2.2.1 Objective Function

The training of the student network involves minimizing a loss function that typically comprises two terms: the distillation loss and the traditional hard target loss [56].

  • Distillation Loss: This loss measures the difference between the softened outputs of the teacher and the student, encouraging the student to mimic the teacher’s generalized behavior. It is often computed using the Kullback-Leibler (KL𝐾𝐿KLitalic_K italic_L) divergence [57]:

    LKD=T2KL(σ(zT,T)σ(zS,T))subscript𝐿KDsuperscript𝑇2𝐾𝐿conditional𝜎superscript𝑧𝑇𝑇𝜎superscript𝑧𝑆𝑇L_{\text{KD}}=T^{2}\cdot KL(\sigma(z^{T},T)\|\sigma(z^{S},T))italic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_K italic_L ( italic_σ ( italic_z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_T ) ∥ italic_σ ( italic_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_T ) ) (2)

    The factor of T2superscript𝑇2T^{2}italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is used to scale the gradients appropriately, as the gradients produced by the softmax function are scaled by 1T1𝑇\frac{1}{T}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG [57].

  • Hard Target Loss: This is a standard loss, such as Cross-entropy (CE), used in training neural networks, calculated between the student’s output (at T=1𝑇1T=1italic_T = 1) and the true labels [58]:

    LCE=CE(y,σ(zS,1))subscript𝐿CE𝐶𝐸𝑦𝜎superscript𝑧𝑆1L_{\text{CE}}=CE(y,\sigma(z^{S},1))italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT = italic_C italic_E ( italic_y , italic_σ ( italic_z start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , 1 ) ) (3)

    where y𝑦yitalic_y are the true labels.

2.2.2 Overall Loss

The total loss function used to train the student model is a weighted sum of the distillation and hard target losses:

L=αLCE+(1α)LKD𝐿𝛼subscript𝐿CE1𝛼subscript𝐿KDL=\alpha L_{\text{CE}}+(1-\alpha)L_{\text{KD}}italic_L = italic_α italic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT (4)
Refer to caption
Figure 3: Principal steps of applying KD in RS applications.

where α𝛼\alphaitalic_α is a hyperparameter that balances the importance of the two loss components. By optimizing this loss, the student learns not only the explicit knowledge represented by the class labels but also the implicit, richer information embedded in the teacher’s output distribution, thus achieving better generalization from a more compact model [59]. Fig. 3 summarizes the main steps of applying KD in RS applications. Fig. 4 illustrates the architecture of a knowledge distillation (KD) framework based on YOLOv8, designed for precision agriculture applications such as weed recognition and variable rate spraying. In this framework, a YOLOv8l model, which has the highest recognition accuracy, was chosen as the teacher network, while a YOLOv8n model, which has the lowest recognition accuracy and the smallest model size, was selected as the student network. The resulting KD model, named YOLOv8n-DT, is specifically tailored for rice field weed recognition and comprises three main components: the teacher network, the student network, and the distillation loss function module, that performs both target and feature distillation.

Refer to caption
Figure 4: The YOLOv8n DT network architecture is structured into three primary components: the teacher network, the student network, and the distillation loss function module. This architecture incorporates both feature loss and logit loss within the distillation process to effectively transfer knowledge from the teacher to the student network, thereby enhancing the student’s performance while maintaining efficiency.

2.3 Benefits of KD

KD offers several compelling advantages that make it an attractive technique in the field of ML, particularly when deploying models in resource-constrained environments.

2.3.1 Model Compression

One of the primary benefits of KD is model compression. Traditional DL models often require substantial computational resources due to their depth and complexity, which limits their deployment on devices with restricted hardware capabilities such as mobile phones, IoT devices, and embedded systems. KD addresses this challenge by enabling the training of smaller, lighter models (students) that mimic the behavior of larger, more complex models (teachers). This process involves transferring the intricate knowledge and insights learned by the teacher model into a more compact form within the student model. The student thereby learns to approximate the function of the teacher but with fewer parameters and lower computational demands. This compression not only reduces the size of the model but also lessens the energy consumption and heat production, which are critical factors for battery-powered devices.

2.3.2 Improved Efficiency

Efficiency in model training and inference is another significant advantage of KD. By distilling a cumbersome model into a smaller one, KD effectively reduces the time and computational power needed for training and deploying AI systems. This improved efficiency is particularly beneficial for applications requiring real-time data processing, such as autonomous driving and real-time surveillance. Smaller models also allow for more frequent updates and easier maintenance, which is crucial for systems that need to adapt to changing conditions or data streams.

2.3.3 Enhanced Performance on Smaller Models

KD not only compresses the size of the models but often also enhances their performance, especially in smaller models. Typically, smaller neural networks are prone to underfitting and may not capture the complex patterns in large datasets as effectively as their larger counterparts. However, when trained through the KD process, these smaller models inherit refined insights from the teacher models, which include soft probabilities and inter-class relationships that are not visible through traditional hard-label training. This enriched training set helps the student models to perform better than if they were trained independently from scratch. Moreover, the nuanced knowledge transferred includes the handling of edge cases and anomalies, which significantly improves the robustness and generalizability of the student models.

2.3.4 Broader Implications

The advantages of KD extend beyond individual model improvements. In educational settings, distillation techniques can democratize access to advanced AI capabilities by enabling more institutions to deploy high-performing AI solutions without the need for expensive infrastructure. Furthermore, in a research context, KD facilitates greater experimental flexibility and faster iteration speeds, accelerating the pace of innovation in AI.

3 RS Tasks and

Public datasets RS has become a pivotal tool for monitoring and understanding changes in both urban and agricultural environments. RS involves the acquisition and analysis of data from satellite or airborne sensors to observe and interpret features on the Earth’s surface. The main RS tasks encompass a variety of applications that leverage spectral, spatial, and temporal information. These tasks include image classification, object detection, change detection, segmentation, and data fusion. The primary RS tasks are centered around image classification and analysis. Image classification categorizes pixels in an image into distinct classes, such as different land cover types, using methods like convolutional neural networks (CNNs) for high accuracy. Object detection identifies and locates specific objects within an image, such as vehicles or buildings, making it crucial for applications in urban planning and agriculture. Change detection focuses on identifying differences in images taken at different times, which is essential for monitoring environmental changes like deforestation. Segmentation further refines this process by partitioning an image into meaningful regions, helping extract detailed information about specific features like roads or rivers. Collectively, these tasks highlight that RS primarily involves sophisticated image classification and analysis.

Data fusion tasks arise from the need to integrate and analyze data from various sources, such as multispectral, hyperspectral, or LiDAR data, to enhance the comprehensiveness and accuracy of RS applications. This integration is vital when dealing with the complex nature of environmental features that cannot be fully captured by a single sensor type. Consequently, data fusion is an essential approach to addressing the limitations of individual datasets, providing a more holistic view of the Earth’s surface. All the aforementioned tasks are fundamental to various environmental, agricultural, and urban studies, providing essential insights for decision-making and resource management. During the years, diverse datasets have been developed to support the advancement of instance segmentation techniques in this field, each tailored to specific challenges and applications.

SpaceNet 7 [60] and SpaceNet 4 [61] represent significant contributions to urban development analysis. SpaceNet 7 offers insights into the evolution of building footprints across 100 global locations over two years, using Planet imagery. This dataset is crucial for tracking urban expansion and infrastructure development. Conversely, SpaceNet 4 focuses on the technical challenge of detecting buildings from steep observation angles—up to 54 degrees off-nadir. This is particularly valuable in emergency response situations where quick, accurate assessments are necessary. Similarly, the Microsoft BuildingFootprints dataset [62] provides detailed building footprints across several countries, extracted from Bing imagery. This resource supports urban planning and management by offering extensive building delineations. Additionally, the xView 2 Building Damage Assessment Challenge [63] leverages high-resolution Worldview-3 imagery to assess building damage from natural disasters, a critical component of effective disaster response. In the agricultural sector, datasets like PASTIS [64] and the Agriculture-Vision Database [65, 66] are invaluable. PASTIS provides panoptic labels for over 124,000 agricultural parcels in France, captured across Sentinel-2 timeseries images. This dataset aids in the precise monitoring and management of agricultural lands. The Agriculture-Vision challenge, on the other hand, focuses on identifying field anomalies from aerial imagery across the United States, promoting enhanced agricultural practices through detailed monitoring.

For more specialized applications, datasets like RarePlanes [67], which includes both synthetic and real data for plane detection, and iSAID [68], which covers a wide range of categories from planes to bridges, are particularly noteworthy. RarePlanes is essential for developing models that differentiate between aircraft types, useful in both civilian and defense sectors. iSAID facilitates broad applications in aerial image analysis by providing extensive annotations for diverse objects. Furthermore, the introduction of SpaceNet 6: Multi-Sensor All-Weather Mapping [69] combines SAR data and optical imagery to enhance building footprint detection in challenging weather conditions, illustrating the value of multi-sensor data integration in RS. The technological advancements in datasets like Airbus Ship Detection Challenge [70], which focuses on ship detection using satellite imagery, and novel methodologies in the LPIS agricultural field boundaries dataset highlight the industry’s shift towards more sophisticated and fine-grained analysis capabilities. The PASTIS dataset [71] provides detailed panoptic labels for over 124,000 agricultural parcels across France, captured in 2,433 Sentinel-2 image timeseries. This dataset is instrumental for applications in agricultural monitoring, allowing for the differentiation of crops at the parcel level through both instance and semantic segmentation. It is particularly useful for tracking changes in agricultural land over time.

4 Taxonomy of KD Models

KD methods in RS (RS) can be categorized into several key approaches, each with unique attributes and applications. As depicted in Fig. 5, the variations may come from the differences in the data or architecture used by the teacher and student networks resulting to Heterogeneous and Cross-modal KD approaches that are based on the Teacher-Student Architecture, or from the different types of knowledge that are distilled between the teacher and the student, resulting to Response-based, Feature-based, and Relation-based approaches. These approaches are tailored to optimize RS models by transferring knowledge from complex teacher models to more efficient student models, using all the available data per case, thereby enhancing performance in tasks such as object detection, scene classification, and image segmentation. Of course, there are several more variations that depend on the training methodology, the application area, the structural representation, the distillation strategy, etc., as shown in Fig. 5 and explained in the following.

Refer to caption
Figure 5: A Comprehensive Taxonomy of Existing KD Techniques.

4.1 Varying the Model or Input Data

4.1.1 Heterogeneous KD

Heterogeneous KD (HKD) is a method of transferring knowledge from a teacher model to a student model where the teacher and student models have significantly different architectures [72]. Traditional KD methods typically assume that the teacher and student models have similar architectures, which allows for straightforward layer-by-layer transfer of knowledge. However, in HKD, the architectures may vary greatly, posing a challenge for direct knowledge transfer.

The study in [73] presents a Generalized KD (GKD) framework for multi-source Earth Observation analysis, specifically for land cover mapping using radar and optical satellite image time series data. This approach tackles data misalignment due to atmospheric conditions or acquisition costs, using radar data consistently and treating optical data as privileged information. This makes it a case of heterogeneous distillation, where different modalities (radar and optical) are involved, requiring the student model to adapt to a less data-rich environment at test time compared to training. The authors in [74] propose using information from deep convolutional networks to guide the training of shallow Grassmannian manifold networks, addressing the need for high-performance yet small-sized networks in resource-limited scenarios. The approach bridges DL with manifold learning, fitting well within the heterogeneous category, as it involves transferring knowledge between fundamentally different architectures. Moving forward, Yang et al. [75] introduce a two-way assistant distillation method for lightweight object detection in RS. This method incorporates compression and multiscale adaptive modules to address feature disparities and background noise, utilizing a heterogeneous distillation approach by applying complex operations from larger models to enhance smaller, simpler ones.

Besides, Nabi et al. [76] propose a compound loss computed on a Transformer-based student and a CNN teacher for single-label scene classification in RS. The use of heterogeneous architectures, where a CNN and a Transformer are involved, leverages the long-range visual capabilities of the Transformer and the inductive biases of the CNN, aiming to enhance classification accuracy in complex scenes. Similarly, the research in [77] involves a teacher-ensemble learning approach using KD in cross-source content-based image retrieval for high-resolution RS images. The method combines source-shared and source-specific classifiers, constructing an effective heterogeneous ensemble of teacher models to transfer useful information to the student model.

4.1.2 Cross-Modal KD

Cross-modal KD (CMKD) refers to the process of transferring knowledge from a model trained with superior modalities (e.g., depth maps or point clouds) to another model trained with weaker modalities (e.g., RGB images) [78]. The goal is to improve the performance of the student model trained on the weaker modality by leveraging the knowledge from the teacher model trained on the superior modality. This transfer is achieved by aligning the intermediate feature representations and activation maps between the teacher and student models [78]. In CMKD, the knowledge from the teacher model is used as an additional supervision signal to guide the training of the student model, enhancing its learning process and performance.

Expanding on this concept, Geng et al. [79] propose a topological space network for road extraction, where a denser teacher network focused on topological feature extraction guides a lighter student network. This distillation process transfers knowledge about complex road topology from a heavy network, illustrating a clear case of cross-modal architecture distillation by integrating high-dimensional topological features into a simplified network. Similarly, Xiong et al. [80] introduce a discriminative distillation network for cross-source Content-Based RS Image Retrieval (CBRSIR), addressing the challenge of harmonizing features between multispectral and panchromatic images, thereby further exemplifying cross-modal architecture by handling variations between different types of RS data sources. Additionally, Liu et al. [81] propose a cross-modal KD framework designed to improve multispectral scene classification by transferring knowledge from teacher models pre-trained on RGB images to a student model processing multispectral images. This approach highlights the adaptability of CMKD by addressing the differences between modalities and enhancing the student’s performance, particularly in scenarios with limited samples.

Furthermore, Pande et al. [82] contribute to the field with an adversarial training-driven hallucination architecture for modality distillation in RS image classification, focusing on learning discriminative feature representations from multiple sensor modalities, even in the presence of missing data during the model inference phase. This work aligns closely with cross-modal architectures as it effectively distills features across varying sensor modalities, enhancing model robustness. Lastly, Liu et al. [83] present a universal Super-Resolution-Assisted Learning (SRAL) framework aimed at improving the performance and efficiency of salient object detection in RS images. By incorporating super-resolution techniques into a multitask learning framework, this approach distills domain knowledge from the super-resolution task to significantly boost object detection performance, further showcasing the potential of cross-modal knowledge transfer in enhancing model accuracy and efficiency.

4.2 Varying the Type of Transferred Knowledge

4.2.1 Response-Based (Soft Targets Distillation)

Response-based KD focuses on the knowledge extracted from the final layer of the teacher model. It aims to align the final predictions between the teacher and the student models. The primary goal is outcome-driven learning, which involves distilling the class probability distribution via a softened softmax function, known as ’soft labels.’ This method guides the student model by matching the output distributions of the teacher and student models using various distance functions such as Kullback-Leibler divergence, mean squared error, or Pearson correlation coefficient [84].

The study in [85] introduces a KD framework applied to RS scene classification. By using the high-temperature softmax outputs from a large, deep teacher model to train a smaller, shallow student model, the study showcases how KD can improve the performance of less complex models on multiple public datasets, increasing accuracy significantly even on smaller and unbalanced datasets. This approach directly employs the response-based distillation technique by leveraging the teacher’s softened output probabilities to enhance the student’s learning process. Moving on, Zhao et al. [86] introduces a novel pairwise similarity KD method for reducing the complexity of CNN models in RS image scene classification, maintaining accuracy while using less computational resources. This method focuses on distilling discriminative information between sample pairs.

4.2.2 Feature-Based (Intermediate Representations)

Feature-based KD addresses the limitation of response-based KD by providing supervision at intermediate layers of the network. This method focuses on transferring intermediate feature representations, such as feature maps, attention mechanisms, activation boundaries, and probability distributions, from the teacher to the student model [84]. The goal is to ensure that the student model learns more meaningful semantic information throughout its hidden layers. The distillation loss in feature-based KD measures the similarity between the transformed intermediate features of the teacher and student models using various distance functions. For example, Chen et al. [87] introduce a new architecture for incremental object detection in RS, which utilizes a Feature Pyramid Network (FPN) to handle objects of various sizes and orientations. Importantly, the study incorporates KD to maintain previously learned information during incremental learning, applying it to outputs from different layers of the FPN. This approach is particularly aligned with feature-based KD, as it focuses on preserving and transferring detailed feature representations across various scales and model iterations, which is essential for detection tasks in dynamically changing datasets. Similarly, to address the efficiency challenge in lightweight object detectors, Yang et al. [88] propose a training method called adaptive reinforcement supervision distillation (ARSD). This method enhances lightweight models by using a multiscale core features imitation (MCFI) module and a strict supervision regression distillation (SSRD) module, improving feature selection and regression accuracy during training. Furthermore, Li et al. [89] introduce a dual KD model that incorporates dual attention and spatial structure modules to enhance the local feature extraction and high-level semantic representation abilities of lightweight CNN models for RS image scene classification. This approach, where knowledge in the teacher network about dual attention and spatial structure is transferred to the student network, directly targets the transfer of intermediate representations. In the same vein, Wang et al. [90] propose a fine-grained object recognition method for high-resolution RS images, which utilizes two stages of KD, emphasizing efficient fine-grained object recognition through feature learning and category correction.

Moreover, Shin et al. [91] focus on transferring detailed spectral feature representations from a teacher model to a student model, allowing the latter to perform complex scene classification tasks traditionally dependent on multispectral data inputs, using simpler RGB inputs. The approach involves intricate feature imitation and retention of critical spectral information, spanning across different data modalities, making it a clear example of feature-based KD. Similarly, Chi et al. [92] propose a self-supervised learning method with KD for hyperspectral image classification, focusing on generating soft labels for unlabeled samples by considering spatial and spectral distances, fitting well under feature-based distillation as it involves creating and transferring complex spectral features. Additionally, Jiang et al. [93] introduce the Deep Distillation Recursive Network (DDRN) for satellite image super-resolution, utilizing ultra-dense residual blocks and a multi-scale purification unit to enhance feature sharing and compensation, particularly focusing on high-frequency components in image super-resolution. Yuan et al. [94] also contribute by proposing a CNN framework for building change detection that uses self-attention KD strategies to refine features for detecting changes in building regions. This approach integrates globally changed information, emphasizing intermediate feature enhancement and integration for improved accuracy in change detection. Finally, Liu et al. [95] introduce ZoomInNet, a cross-scale KD method designed to improve the detection of small objects in drone-based imagery, which often presents complex and dynamic backgrounds. Utilizing a feature pyramid network, the method trains teacher and student networks with differently scaled images to enhance feature harmonization across scales, incorporating layer adaptation, feature level alignment, and an adaptive key distillation point algorithm to refine and distill essential features, showcasing significant advancements in the precision of object detection.

4.2.3 Relation-Based (Learning Relationships Between Different Data Layers)

Relation-based KD explores the relationships between different data samples or across different layers within the neural network. Unlike response-based and feature-based KD, which typically handle individual samples, relation-based KD captures cross-sample or cross-layer relationships as meaningful knowledge [84]. This method constructs relational graphs to model dependencies and similarities between instances or layers and uses similarity metrics and distance functions to measure these relationships. The goal is to transfer structured knowledge that encapsulates higher-order dependencies and interactions within the dataset.

Chen et al. [96] develop consistency- and dependence-guided KD methods for object detection in RS images. They introduce modules that focus on extracting and transferring discriminative spatial locations and channels, as well as establishing the consistency and dependence of features between the teacher and student models. This approach utilizes relation-based distillation by focusing on the inter-layer and inter-feature relationships to guide the student model’s learning process. moving on, Li et al. [97] introduce an instance-aware distillation method, which combines feature-based and relation-based distillation techniques. The method enhances the student model’s performance by focusing on instance-related foreground information and constructing relationships between different instances to improve detection accuracy in complex remote-sensing images. Zhao et al. [98] propose a self-supervised KD network (SSKDNet) that uses feature maps of the backbone as supervision signals and transfers the "dark knowledge" through KD. This method focuses on enhancing the discriminative feature extraction capabilities by learning the relationships between different data layers in a self-supervised setting.

Dong et al. [99] present a cross-model KD framework, distilling segmenters from CNNs and transformers, which uses a channel-weighted attention-guided feature distillation and a target–nontarget KD module to guide the student model in learning complex representations and decision boundaries. This study distinctly focuses on relation-based distillation by leveraging the interdependencies of features and classification decisions between different network architectures. On the other hand, Zhou et al. [100] introduce the Multi-level Semantic Transfer Network (MSTNet), a KD framework designed for dense prediction of RS images. This network utilizes a Multi-level Semantic Knowledge Alignment (MSKA) framework to distill semantic information from a complex teacher model to a more compact student model. The MSKA framework emphasizes cross-layer semantic alignment, dynamic semantic aggregation, and softening learning to adaptively transfer knowledge and optimize the learning of semantic information, thus addressing the complexities of deploying models in practical scenarios.

4.3 Varying Distillation Target

4.3.1 Data Distillation

Data distillation refers to techniques that aim to synthesize small, high-fidelity data summaries which capture the most important knowledge from a given dataset [101]. These distilled summaries are optimized to serve as effective substitutes for the original dataset in various data-usage applications such as model training, inference, and architecture search. The goal is to create a concise representation of the data that maintains its critical characteristics, allowing for faster and more efficient model training and evaluation [101].

Building on this concept, Zhang et al. [102] introduce a novel noisy label distillation method within an end-to-end teacher-student framework, which distills knowledge from labels across various noise levels. This approach exemplifies data distillation by effectively utilizing knowledge from noisy data to improve classification performance in RS image scene classification. Extending the application of data distillation, Zhao et al. [86] propose a pair-wise similarity KD method for RS image scene classification. By distilling discriminative information from a cumbersome model to a compact model, this study aims to maintain high accuracy while reducing model complexity, demonstrating another facet of data distillation. Furthermore, Yue et al. [103] contribute to this field with a self-supervised learning method that incorporates adaptive distillation for hyperspectral image classification. Their approach, which focuses on generating adaptive soft labels based on spatial-spectral similarity, underscores the importance of utilizing extensive unlabeled data in the data distillation process.

4.3.2 Model Distillation

Model distillation refers to the process of replacing a complex ML model with a simpler model that approximates the original model’s performance [104]. This technique is used to improve computational efficiency by distilling large or ensemble models into smaller, more manageable models that maintain similar accuracy. The primary goal is to reduce the computational cost associated with deploying large models while preserving their predictive capabilities [104]. Model distillation also aids in model interpretability by converting “black-box” models, such as neural networks, into more transparent forms.

In the context of model distillation for RS applications, a variety of approaches have been developed to enhance the performance and efficiency of lightweight models. Zhang et al. [105] introduce a dynamic knowledge distillation (KD) framework that enables CNN models to be lightweight while maintaining high detection accuracy, with an emphasis on selective learning through a dynamic instance selection distillation module. Building on the concept of model distillation, Yang et al. [106] develop a lightweight semantic segmentation network that combines KD with a multiscale pyramidal pooling module and attention mechanisms, resulting in a pruned model that retains high accuracy. Similarly, Wang et al. [107] propose a change detection method that integrates prototypical contrastive distillation and channel-spatial-normalized distillation, allowing the student model to learn complex feature distributions from the teacher, thereby fitting into the model distillation framework.

Further advancing the field, Chen et al. [108] propose a multi-teacher collaborative distillation approach that uses adaptive weight and feature knowledge exchange to enhance the robustness of student models, while Gu et al. [109] introduce a Context-aware Dense Feature Distillation (CDFD) strategy for CubeSat-based RS object detection, integrating multiple teacher networks to optimize a lightweight detector. Chai et al. [110] contribute to the model distillation category with their Bidirectional Self-Attention Distillation (Bi-SAD) approach, aimed at enhancing cloud detection models by enabling compact models to learn detailed textural and semantic information.

Addressing the challenge of few-shot learning, Liu et al. [111] present a ranking-preserving KD method that improves the generalization capabilities of student models in RS scene classification. Similarly, Wang et al. [112] explore the enhancement of lightweight models through a Phase-shift encoded KD method (PseKD) that improves object orientation prediction. In a broader application, Chen et al. [113] propose a semi-supervised KD framework for global-scale urban object mapping, emphasizing the handling of urban diversity and large-scale sample growth.

Complementing these efforts, Zhao et al. [114] propose a weakly correlated distillation learning framework for RS object recognition with limited samples, leveraging large-scale natural image datasets to enhance small-scale RS datasets. Lin et al. [115] address the issue of denoising by presenting a lightweight model that uses KD to efficiently extract spatial and spectral features while maintaining computational efficiency. Yu et al. [116] focus on incremental learning, introducing a dual KD method to mitigate catastrophic forgetting, which aligns with the incremental learning approach proposed by Xu et al. [117] and Xu et al. [118], who use KD to enhance multimodal learning and hyperspectral image classification, respectively.

Lastly, Zhou et al. [119] introduce a graph semantic guided network (GSGNet) for optical RS scene analysis, utilizing knowledge refinement to maintain high inference speed and contextual inference capability. Zhao et al. [120] propose a target detection model distillation framework that uses feature transition and label registration to improve the learning ability of lightweight networks in RS imagery, further contributing to the body of work on model distillation.

4.3.3 Feature Distillation

Feature distillation refers to a method in which the student network learns to mimic the hidden feature values of a teacher network [121]. This process involves transferring the intermediate representations (features) learned by the teacher network to the student network. Unlike traditional KD that focuses on the output probabilities (logits), feature distillation emphasizes the transfer of internal activations or feature maps. The primary goal is to improve the student network’s performance by leveraging the knowledge encapsulated in the teacher’s feature representations [121].

Building upon this concept, Zhou et al. [122] propose a lightweight student network framework for semantic segmentation of high-resolution RS images. By employing a graph attention guidance network, they distill knowledge from a large teacher network to optimize image features, thereby enhancing segmentation accuracy. This method aligns with feature distillation, where the objective is to boost the student’s feature representation capabilities to closely match those of the teacher. Similarly, Zhang et al. [123] introduce a few-shot classification method for RS scene classification, which also falls under the feature distillation category. This approach utilizes a novel two-branch network and incorporates self-KD during training to generate powerful representations, prevent overfitting, and enhance overall performance.

In parallel, Hu et al. [124] contribute to the field with a variational self-distillation network designed for RS scene classification. This method hierarchically distills class entanglement information from deep to shallow layers, further illustrating the application of feature distillation by refining and transferring feature information across different network layers. Expanding on these ideas, Xing et al. [125] present a collaborative consistent KD method aimed at improving classification accuracy for RS image scenes on embedded devices. Their approach emphasizes feature distillation across multiple network branches, focusing on reducing parameter redundancy and enhancing model efficiency, thus reinforcing the relevance of feature distillation in RS applications.

Table 2: Summary of Studies on KD in RS
Ref. Model Used Main Contribution Database Task/Appl. Best Performance Value Limitation
[85] Small and shallow student models Introduced a KD framework for scene classification. AID, UCMerced, NWPU-RESISC, EuroSAT Scene Classification Increased accuracy by 1% to 5% Performance on small and unbalanced datasets
[88] Lightweight object detector Developed ARSD to enhance detection capability through feature and regression distillation. DOTA, DIOR, NWPU VHR Object Detection Outperforms SOTA methods Noise in training due to complicated backgrounds
[96] Consistency and dependence-guided model (CDKD) Improved object detection with structured discriminative modules and consistency techniques. RSOD Object Detection 92% mean average precision High model volume and computation in RS images
[87] Incremental learning model with FPN Employed feature pyramid and KD for incremental learning in object detection. Various RS datasets Object Detection Comparative performance to SOTA Challenges with object size diversity and directions
[105] Dynamic KD (DKD) Developed a dynamic KD framework to improve model performance on edge devices. DOTA, NWPU VHR-10 Object Detection SOTA performance Complex model deployment on low-computation devices
[122] GAGNet with KD Utilized graph attention and dense fusion for semantic segmentation. Potsdam, Vaihingen Semantic Segmentation Excellent performance on datasets Resource-intensive model deployment
[123] RS-SSKD for few-shot classification Introduced a two-branch network with self-KD for few-shot classification. NWPU-RESISC45, RSD46-WHU Scene Classification Surpasses current SOTA Requires high model adaptability to new data
[97] Instance-aware distillation (InsDist) Combined feature-based and relation-based KD for object detection. DIOR, DOTA Object Detection Noticeable gains over other methods Integration complexity with existing detectors
[124] Variational self-distillation network (VSDNet) Implemented a VKT module for robust and end-to-end optimization. Multiple RS datasets Scene Classification Significant improvement over backbones Managing uncertainty and perturbation in images
[125] Collaborative consistent KD (CKD) Designed a KD method for high classification accuracy on embedded devices. SIRI-WHU, NWPU-RESISC45 Scene Classification 0.943 and 0.916 average accuracy Redundancy and parameter management on devices
[89] DKD Model with DA and SS Dual KD with dual attention and spatial structure modules AID, NWPU-45 RSI Scene Classification Improved accuracy by 7.57% and 7.28% Model complexity and computational cost
[83] SRAL Framework Super-resolution-assisted learning for salient object detection Multiple datasets Object Detection in RSIs Superior to 20+ algorithms High computational cost of high-resolution processing
[90] Oriented R-CNN, CF-ORNet Two-stage fine-grained object recognition with KD VEDAI, HRSC2016 Object Recognition in HR-RSIs Competitive performance Limited by size of geospatial objects
[98] SSKDNet Self-supervised KD network for feature learning Multiple datasets Scene Classification Effective feature extraction Difficulty in training self-supervised networks
[106] KD-MSANet Lightweight semantic segmentation with multiscale pooling and attention Vaihingen, Potsdam Semantic Segmentation Accuracy near 99.30% of teacher model Reduced model size might impact some complex scene parsing
[107] CDKD Method Change detection with prototypical contrastive and channel-spatial-normalized distillation Public CD datasets Change Detection Comparable to large models Requires careful tuning of distillation parameters
[102] NLD Method Noisy label distillation for robust training on noisy datasets UC Merced Land-use, NWPU-RESISC45, AID Scene Classification Outperforms directly fine-tuning methods Performance variability with noise levels
[99] DSCT Framework Cross-model KD from CNNs and transformers for semantic segmentation ISPRS Potsdam, Vaihingen, GID, LoveDA Semantic Segmentation Outperforms state-of-the-art KD methods Complexity of integrating CNNs and transformers
[91] MS2RGB-KD MS-to-RGB KD for scene classification using RGB images EuroSAT Scene Classification Effective compared to KD baselines Dependent on quality of MS teacher model
[100] MSTNet with MSKA Dense prediction using multilevel semantic transfer and KD Vaihingen, Potsdam Dense Prediction in RSIs Excellent performance with reduced parameters Balancing between model complexity and performance

4.4 Varying the Structural Relationship of Network Layers

4.4.1 Layer-to-Layer Distillation

Layer-to-layer distillation refers to the process where the teacher model’s intermediate layers directly guide the corresponding layers of the student model. This method ensures that the student model learns similar feature representations as the teacher model at different stages of its depth [126, 127].

Direct Mapping: In this approach, each layer of the teacher model is aligned with the corresponding layer in the student model. The outputs of each intermediate layer in the teacher model are used as targets for the corresponding layer in the student model. This direct mapping can help the student model learn hierarchical features similar to those learned by the teacher model [128].

Feature Representation: By mimicking the intermediate representations of the teacher, the student model can capture complex features and patterns, which might be difficult to learn solely from the final output. This method is particularly useful when the student model has a similar or reduced architecture compared to the teacher.

Loss Function: Often, additional loss terms are introduced to minimize the difference between the teacher’s and student’s intermediate layer outputs. This can include mean squared error (MSE) or other similarity measures.

Suppose a deep CNN is used as the teacher model with layers: T1,T2,T3,,Tnsubscript𝑇1subscript𝑇2subscript𝑇3subscript𝑇𝑛T_{1},T_{2},T_{3},\ldots,T_{n}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The student model has corresponding layers: S1,S2,S3,,Snsubscript𝑆1subscript𝑆2subscript𝑆3subscript𝑆𝑛S_{1},S_{2},S_{3},\ldots,S_{n}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. During training, the output of T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT will guide S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT will guide S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and so on, ensuring that each student layer learns to mimic the feature maps of the corresponding teacher layer.

4.4.2 Cross-Layer Distillation

Cross-layer distillation refers to the process where the teacher and student models do not have a direct correspondence between layers. Instead, the knowledge transfer happens between non-matching layers, for example, higher layers of the teacher model guiding lower layers of the student model or vice versa.

Non-Matching Layers: In this approach, there is no strict one-to-one correspondence between the layers of the teacher and the student. The knowledge from higher (more abstract) layers of the teacher model can be distilled into lower (more detailed) layers of the student model, allowing for flexible guidance. Chen et al. [129] propose Semantic Calibration for Cross-layer Knowledge Distillation (SemCKD), which automatically assigns target layers from a teacher model to each student layer using an attention mechanism. This method allows student layers to distill knowledge from multiple teacher layers rather than following a fixed, one-to-one correspondence. Building on this concept, Wang et al. [130] further refine the idea of non-matching layers by using a learned attention distribution to assign appropriate teacher layers to student layers, thereby enhancing cross-layer supervision and subsequently improving student model performance. In addition, Nath et al. [131] introduce Robust Neural Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), which searches for the best teacher layer to supervise each student layer, thus allowing for non-matching layer associations that enhance robustness in neural architectures. Furthermore, Zhao et al. [132] develop Cross-Architecture Knowledge Distillation (CAKD), where non-matching layers are utilized to transfer knowledge from a Transformer-based teacher model to a CNN-based student model, involving the alignment of pixel-wise spatial information across different architectures and expanding the applicability of non-matching layers in cross-architecture scenarios.

Layer Interaction: This method leverages the hierarchical nature of neural networks, where different layers capture different levels of abstraction. By using high-level features from the teacher to guide the student’s learning process, the student can gain a richer understanding of the data. Yao et al. [133] propose Dense Cross-layer Mutual-distillation (DCM), which involves layer interaction by integrating auxiliary classifiers and bidirectional knowledge distillation operations across different layers of the teacher and student models, thereby enhancing knowledge representation and performance. Building on this concept, Su et al. [134] present Deep Cross-layer Collaborative Learning (DCCL), focusing on layer interaction through intermediate cross-layer supervision among peer student models, which integrates features from different layers to enhance representation and learning outcomes. Similarly, Zhu et al. [135] introduce Cross-layer Fusion for Knowledge Distillation (CFKD), which aggregates features from both teacher and student models, allowing for rich layer interactions that further enhance the student model’s learning process. In a related effort, Hu et al. [136] propose an online knowledge distillation method with layer-level feature fusion modules that connect sub-networks, thereby facilitating mutual learning through enhanced layer interaction among student networks. Expanding on the concept, Nguyen et al. [137] develop CLAFusion, a framework that employs cross-layer alignment for fusing neural networks with different numbers of layers, leveraging layer interaction to improve model accuracy and efficiency. Finally, Zhang et al. [138] propose Patch Aware Knowledge Distillation (PAKD), which emphasizes cross-layer patch alignment and interaction within and across instances, guiding the student’s learning of multi-level information and further reinforcing the importance of layer interaction in knowledge distillation.

Hierarchical Guidance: Cross-layer distillation can help in scenarios where the student model is significantly smaller or has a different architecture compared to the teacher. It allows the student to learn abstract representations earlier in its layers. Imagine a teacher model with layers: T1,T2,T3,,Tnsubscript𝑇1subscript𝑇2subscript𝑇3subscript𝑇𝑛T_{1},T_{2},T_{3},\ldots,T_{n}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and a student model with layers: S1,S2,S3,,Smsubscript𝑆1subscript𝑆2subscript𝑆3subscript𝑆𝑚S_{1},S_{2},S_{3},\ldots,S_{m}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. In cross-layer distillation, Tnsubscript𝑇𝑛T_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (the final layer of the teacher) might guide S3subscript𝑆3S_{3}italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (a middle layer of the student), T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT might guide S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and so on, depending on the distillation strategy and the specific architecture of the models. In this regard, Zou et al. [139] develop CoCo DistillNet, which utilizes cross-layer correlations to guide the student model in learning abstract representations from a teacher model in the context of pathological image segmentation, thereby enhancing the student model’s performance in resource-constrained environments. Building on this concept, Zou et al. [140] propose Graph Flow Distillation, a method that transfers cross-layer variations from a large teacher network to a compact student network in medical image segmentation, enabling the student model to learn from both high-level and low-level abstractions of the teacher. In a similar vein, Zhai et al. [141] introduce a method that uses the deepest feature maps from the teacher to guide the shallow layers of the student model, providing hierarchical guidance that effectively balances performance and efficiency. Furthermore, Guo et al. [142] propose Alignahead++, an online knowledge distillation framework for GNNs that transfers structure and feature information across layers, facilitating hierarchical guidance and significantly improving performance on edge devices. Together, these studies underscore the importance of hierarchical guidance in enhancing the efficiency and effectiveness of knowledge distillation across various architectures.

5 Tasks and Applications of KD in RS

5.1 Tasks

As previously described, KD has emerged as a transformative approach in RS, enabling the development of more efficient models that handle RS tasks with the same or even better performance across various applications. The main applications of KD is RS are depicted in Fig. 6.

Refer to caption
Figure 6: Applications of KD in RS.

5.1.1 Image/Scene Classification

In the context of the classification of RS images/scenes, KD can be particularly beneficial. High-resolution satellite or hyperspectral images, which are rich in spatial and spectral information, can be computationally intensive to process online using large models. By employing KD, a large, powerful model (teacher) that has been trained on such images can pass on its learned representations and decision-making capabilities to a smaller, more efficient model (student). This allows the student model to achieve high classification accuracy while significantly reducing computational and storage requirements. Techniques such as spatial feature blurring can be incorporated to enhance the student’s learning by making the training data more challenging, which helps in better generalization and improved classification performance. Various studies have been proposed in the literature to enhance RS image classification, focusing on KD, model efficiency, feature extraction, and handling noisy or incomplete data. Table 3 provides the main features of these works. Building on this, Xu et al. [118] propose a hyperspectral image classification method based on class-incremental learning to learn new land-cover types without forgetting the old ones. This method uses a KD strategy to recall information of old classes and a channel attention mechanism to effectively utilize spatial-spectral information, demonstrating high accuracy on three hyperspectral image datasets. Similarly, Chi et al. [92] introduce a self-supervised learning method with KD for HSI classification, termed SSKD, which generates soft labels for unlabeled samples by considering spatial and spectral distances. This method significantly improves classification accuracy on three public HSI datasets. In addition, Xing et al. [125] address the challenge of using large deep neural networks on embedded devices by proposing a collaborative consistent KD (CKD) method. This method reduces the number of redundant parameters and improves the classification accuracy when tested on the SIRI-WHU and NWPU-RESISC45 datasets. Furthermore, Chen et al. [85] focus on scene classification using a KD framework to improve the performance of smaller and shallower network models. Their method increases the overall accuracy when tested on AID, UCMerced, NWPU-RESISC, and EuroSAT datasets. Along similar lines, Song et al. [143] present ERKT-Net, an efficient and robust knowledge transfer network designed for lightweight yet accurate CNN classifiers, demonstrating superior accuracy and compactness on three RSI datasets. Likewise, Wu et al. [1] propose the TAKD method, which reduces background disturbance and improves the accuracy of student models for RS scene classification on three benchmark datasets. Moreover, Ienco et al. [73] propose a Generalized KD (GKD) framework to manage information misalignment between training and test data, demonstrating improved classification results using radar and optical satellite image time series data. Similarly, Zhang et al. [102] address the challenge of noisy labels in RS image scene classification by proposing a noisy label distillation (NLD) method, which effectively distills knowledge from labels across a range of noise levels, achieving high accuracy on UC Merced Land-use, NWPU-RESISC45, and AID datasets.

In another approach, Zhao et al. [98] propose a self-supervised KD network (SSKDNet) that uses feature maps as supervision signals and dynamically fuses feature maps to extract discriminating features, showing excellent performance on three datasets. Furthermore, Yang et al. [75] introduce the TWA distillation method for RS object detection, reducing background information and addressing feature disparities, achieving superior performance on the LEVIR and SAR SSDD datasets. Additionally, Pande et al. [82] tackle the problem of missing modalities in RS image classification by proposing an adversarial training-driven hallucination architecture. This method shows that the student model can surpass the teacher model’s performance on HSI datasets. In a similar vein, Yu et al. [116] propose a two-stage training method for incremental learning that includes dual KD to prevent catastrophic forgetting, improving accuracy on CIFAR100 and RESISC45 datasets. Finally, Xie et al. [2] introduce an improved decoupled KD (DKD) strategy for HSI classification using a spatial feature blurring (SFB) module, achieving high overall accuracy on the Salinas dataset.

Table 3: Comparison of KD-based RS Image/Scene Classification Studies
Ref. Model(s) Used Dataset/Data Type Main Contribution Best Performance Value Achieved Limitation
[118] Class-incremental learning PaviaU KD with channel attention mechanism 99.91% OA Bias towards new classes
[125] Collaborative consistent KD SIRI-WHU, NWPU-RESISC45 Multi-branch fused redundant feature mapping 0.943 accuracy (SIRI-WHU) Parameter redundancy
[92] Self-supervised learning with KD Three HSI datasets Adaptive generation of soft labels 7.09% improvement Limited labeled samples
[85] KD framework AID, UCMerced, NWPU-RESISC, EuroSAT KD training method for small and shallow models 5% accuracy improvement (UCMerced) Computationally expensive
[143] ERKT-Net Three RSI datasets Efficient and robust KD network 22.4% OA (NWPU45) Slight accuracy sacrifice
[98] SSKDNet AID Self-supervised KD network 95.98% accuracy Complex training
[82] Adversarial training HSI datasets Handling missing modalities with hallucination architecture 98.17% accuracy (Houston) Modality dependency
[123] RS-SSKD NWPU-RESISC45, RSD46-WHU Few-shot classification with CAMs and KD 86.26% accuracy (NWPU-RESISC45) Overfitting risk
[102] NLD UC Merced, NWPU-RESISC45, AID Handling noisy labels with end-to-end KD 99.08% accuracy (UC Merced) Noisy data handling
[73] GKD framework Dordogne study site Handling data misalignment with privileged information 64.27% F-Measure Incomplete coverage
[75] TWA distillation LEVIR, SAR SSDD Reducing background noise and feature disparities 95.4% AP50 (SAR SSDD) Background interference
[116] Incremental learning CIFAR100, RESISC45 Dual KD to prevent catastrophic forgetting 6.9% accuracy improvement Stability-plasticity dilemma
[2] DKD with SFB module Four HSI datasets Spatial feature blurring for better KD 97.55% OA (Salinas) Fixed receptive fields

Moving forward, Zhang et al. [123] present RS-SSKD for few-shot RS scene classification, which uses Class Activation Maps (CAMs) and self-KD to generate powerful representations, achieving high accuracy on NWPU-RESISC45 and RSD46-WHU datasets. As the availability of airborne and satellite imagery increases, the challenge in RS (RS) scene classification has shifted from data scarcity to the lack of ground truth samples. Addressing these challenges, especially in unfamiliar environments with limited training data, few-shot classification offers a promising solution within meta-learning by extracting rich knowledge from minimal data. In [123], the authors introduce RS-SSKD, a method designed for few-shot RS scene classification that focuses on generating robust representations for downstream meta-learners. This approach features a two-branch network that uses three pairs of original-transformed images and incorporates Class Activation Maps (CAMs) to focus on the most relevant category-specific regions, ensuring the creation of discriminative embeddings. Additionally, a self-KD is applied to prevent overfitting and enhance performance (see Fig. 7).

Refer to caption
Figure 7: The overall framework includes the SSKD module for embedding learning and the meta-learning module based on ProtoNets. The parameter γ𝛾\gammaitalic_γ is used to adjust cosine similarity in the meta-learning process.

5.1.2 Object Detection

In RS, object detection is crucial for identifying specific features such as buildings, vehicles, and vegetation. KD helps in creating lightweight models that maintain high accuracy, making it feasible to run these models on devices with limited computational power. Several studies focus on the use of KD for improving object detection in RS images, each introducing innovative strategies to address specific challenges. Algorithm 1 outlines a process for KD in RS object detection. It starts by training a teacher model on a dataset, and then defines a student model with a simpler architecture. The teacher model generates soft targets, which are probability distributions over classes, using a softened softmax function. The student model is trained using a combined loss function that includes the cross-entropy loss and the Kullback-Leibler divergence between the teacher’s and student’s outputs. The process iterates over several epochs, optimizing the student model to mimic the teacher while also learning from the original labels. Finally, the trained student model is deployed. The main works on the use of KD in the object detection task in RS images and their main features are summarized in Table 4.

Input: Training data D𝐷Ditalic_D, Teacher model T𝑇Titalic_T, Student model architecture S𝑆Sitalic_S, Temperature Tempsubscript𝑇𝑒𝑚𝑝T_{emp}italic_T start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT, Loss weights α,β𝛼𝛽\alpha,\betaitalic_α , italic_β, Number of epochs N𝑁Nitalic_N
Output: Trained Student Model S𝑆Sitalic_S
Step 1: Train the Teacher Model
TTrainTeacherModel(D,T)𝑇TrainTeacherModel𝐷𝑇T\leftarrow\text{TrainTeacherModel}(D,T)italic_T ← TrainTeacherModel ( italic_D , italic_T )
Step 2: Define the Student Model
SDefineStudentModel(S)𝑆DefineStudentModel𝑆S\leftarrow\text{DefineStudentModel}(S)italic_S ← DefineStudentModel ( italic_S )
Step 3: Compute the Soft Targets from the Teacher Model
for each batch (x,y)D𝑥𝑦𝐷(x,y)\in D( italic_x , italic_y ) ∈ italic_D do
       zTT(x)subscript𝑧𝑇𝑇𝑥z_{T}\leftarrow T(x)italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← italic_T ( italic_x )
       pTSoftmax(zT/Temp)subscript𝑝𝑇Softmaxsubscript𝑧𝑇subscript𝑇𝑒𝑚𝑝p_{T}\leftarrow\text{Softmax}(z_{T}/T_{emp})italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← Softmax ( italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_T start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT )
end for
Step 4: Define the Loss Functions
CECrossEntropy(S(x),y)subscript𝐶𝐸CrossEntropy𝑆𝑥𝑦\mathcal{L}_{CE}\leftarrow\text{CrossEntropy}(S(x),y)caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ← CrossEntropy ( italic_S ( italic_x ) , italic_y )
KDKLDiv(LogSoftmax(S(x)/Temp),pT)×Temp2subscript𝐾𝐷KLDivLogSoftmax𝑆𝑥subscript𝑇𝑒𝑚𝑝subscript𝑝𝑇superscriptsubscript𝑇𝑒𝑚𝑝2\mathcal{L}_{KD}\leftarrow\text{KLDiv}(\text{LogSoftmax}(S(x)/T_{emp}),p_{T})% \times T_{emp}^{2}caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT ← KLDiv ( LogSoftmax ( italic_S ( italic_x ) / italic_T start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) × italic_T start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
totalαCE+βKDsubscript𝑡𝑜𝑡𝑎𝑙𝛼subscript𝐶𝐸𝛽subscript𝐾𝐷\mathcal{L}_{total}\leftarrow\alpha\mathcal{L}_{CE}+\beta\mathcal{L}_{KD}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT ← italic_α caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT
Step 5: Train the Student Model
for epoch = 1 to N𝑁Nitalic_N do
       for each batch (x,y)D𝑥𝑦𝐷(x,y)\in D( italic_x , italic_y ) ∈ italic_D do
             pTComputeSoftTargets(T,x,Temp)subscript𝑝𝑇ComputeSoftTargets𝑇𝑥subscript𝑇𝑒𝑚𝑝p_{T}\leftarrow\text{ComputeSoftTargets}(T,x,T_{emp})italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← ComputeSoftTargets ( italic_T , italic_x , italic_T start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT )
             zSS(x)subscript𝑧𝑆𝑆𝑥z_{S}\leftarrow S(x)italic_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ← italic_S ( italic_x )
             ComputeLoss(zS,y,pT,α,β,Temp)ComputeLosssubscript𝑧𝑆𝑦subscript𝑝𝑇𝛼𝛽subscript𝑇𝑒𝑚𝑝\mathcal{L}\leftarrow\text{ComputeLoss}(z_{S},y,p_{T},\alpha,\beta,T_{emp})caligraphic_L ← ComputeLoss ( italic_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_y , italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_α , italic_β , italic_T start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT )
             Update S𝑆Sitalic_S by minimizing \mathcal{L}caligraphic_L
       end for
      
end for
Step 6: Deploy the Student Model
STrained Student Model𝑆Trained Student ModelS\leftarrow\text{Trained Student Model}italic_S ← Trained Student Model
Algorithm 1 KD for RS Object Detection

For instance, Yang et al. [88] propose an adaptive reinforcement supervision distillation (ARSD) framework to enhance lightweight object detectors. This method focuses on multiscale core features imitation and strict supervision regression distillation to improve performance, especially for small objects in complex backgrounds. Zhang et al. [105] introduce a dynamic KD (DKD) framework, leveraging dynamic global distillation and instance selection distillation to enhance object detection in cluttered scenes. Another study by Zhang et al. [144] presents Orientation Distillation (OD) to address issues with boundary discontinuity and spatial feature ossification for detecting arbitrary-oriented objects in RS images, the authors further propose an adaptive composite feature generation (ACFG) strategy to improve feature mapping and handling of foreground and background loss in object detection [145].

Feng et al. [146] introduce an Instance-aware Distillation approach for Class-incremental Object Detection (IDCOD), which helps in preserving old class knowledge while learning new classes, thus mitigating catastrophic forgetting. Chen et al. [3] propose Discretized Position KD (DPKD), which focuses on transferring high-quality bounding box position and pose information to improve object detection performance. Pang et al. [4] present a pyramid KD (PKD) framework to handle the limitations of model compression, utilizing a hybrid online–offline smooth distillation strategy to enhance recognition accuracy while avoiding knowledge explosion and offset.

Du et al. [55] add a detection head specifically for small targets in the YOLOv5 model, proposing a network KD framework for improved small-scale target detection in RS images. Gao et al. [147] design a feature super-resolution fusion framework using cross-scale distillation to improve the detection accuracy of small objects by enhancing feature expression capability. Yang et al. [148] propose a weakly supervised object detection method using self-attention distillation and instance-aware mining to handle varying scales and dense object proximity in RS images.

Other studies address different aspects of RS. Zhang et al. [53] combine detection and tracking in a joint framework (OKD-JDT), using KD to improve tracking efficiency and accuracy. Sun et al. [149] develop the Efficient Multidimensional Global Feature Adaptive Fusion Network (MGFAFNET) for UAV platforms, incorporating a dual-branch multidimensional aggregation backbone and a localized compensation dual-mask distillation strategy to balance detection speed and accuracy. Yang et al. [150] introduce DC-KD, a distillation scheme for object detection in satellite images, addressing data distribution differences between aerial and satellite images. Song et al. [151] present HMKD-Net, a hybrid-model KD approach combining CNNs and vision transformers to enhance classification performance in RS images. Zhang et al. [152] propose a visual knowledge-oriented approach using pseudo labels to improve object detection in complex and dense RS images.

Lian et al. [153] introduce a multitask learning framework combining image translation and saliency detection networks with KD to enhance feature expressiveness and reduce model complexity. Zeng et al. [154] propose TDKD-Net, a tensor decomposition and KD-based network for UAV detection, focusing on small object detection and handling imbalanced issues. Wan et al. [155] develop a coarse-to-fine detection method integrating density-aware scale adaptation and KD to improve small object detection in UAV images. Jia et al. [156] suggest a multi-scale self-distillation approach to improve small target detection accuracy without using a teacher model. Lin et al. [157] propose DTCNet, a distillation Transform-CNN network for super-resolution reconstruction in RS images, enhancing reconstruction quality while maintaining a smaller parameter count. Tang et al. [158] introduce a text-guided tail-class generation network (TGN) to address long-tailed data distribution in RS datasets, improving tail-class accuracy by generating diverse and consistent tail-class images.

Table 4: Comparison of Studies on KD-based Object Detection in RS Imagery.
Ref. Model(s) Used Dataset/Data Type Main Contribution Best Performance Limitation
[88] ARSD framework DOTA, DIOR, NWPU VHR-10 Adaptive reinforcement supervision distillation for lightweight object detection Outperforms SOTA methods High complexity due to adaptive modules
[105] DKD framework DOTA, NWPU VHR-10 Dynamic KD for multi-scale feature imitation Suitable for various detectors Potential overfitting to specific datasets
[144] Orientation Distillation (OD) Multiple datasets Anti-ambiguous location prediction and feature calibration Improved performance on non-axially arranged objects Limited accuracy in complex scenes
[145] ACFG strategy DIOR, DOTA Adaptive composite feature generation for KD Better performance than SOTA KD algorithms Complexity in composite mask generation
[146] IDCOD DOTA, DIOR, RTDOD, PASCAL VOC Instance-aware distillation for class-incremental detection mAP@0.5 of 74.0% on DIOR Challenge in handling new classes post-deployment
[3] DPKD DOTA, HRSID Discretized position KD for object detection mAP of 79.82% on DOTA Overlooks certain localization knowledge
[4] PKD framework Aircraft, FGSC-23 Pyramid KD to avoid knowledge explosion and offset Effective with ResNet and VGG networks Complexity in finding optimal configuration
[55] Enhanced YOLOv5 NWPU VHR-10 KD framework for small-scale target detection Detection accuracy of 43.9% High computational cost
[147] SSRFPN with CSD NWPU VHR-10, DIOR Feature super-resolution fusion for small-object detection AP0.5 of 95.0% on NWPU VHR-10 Difficulty in feature extraction for very small objects
[148] WSOD with SAD and IAM NWPU VHR-10, DIOR Weakly supervised learning for object detection Accurate bounding boxes Struggles with varying scales and dense objects
[53] OKD-JDT JiLin-1 Joint detection and tracking framework State-of-the-art performance Limited to certain types of satellite videos
[149] MGFAFNET SyluDrone Efficient detection method for UAV platforms AP of 52.7%, AP50 of 93.6% on SyluDrone Balancing detection speed and accuracy
[150] DC-KD xView Distillation scheme for object detection in satellite images 3.88% mAP50 improvement on xView Data distribution differences
[151] HMKD-Net Multiple datasets Hybrid-model KD with CNN-ViT ensemble Max accuracy improvement of 22.8% Handling variances during KD
[152] Visual knowledge-oriented WSOD NWPU VHR-10, DIOR Leveraging visual cues as pseudo labels mAP of 84.25% on NWPU VHR-10 Handling noise in object proposals
[153] WSA-GAN, BGNet Various RS datasets Multitask learning for image translation and saliency detection Outperforms other approaches Complexity in multimodal context learning
[154] TDKD-Net Various RS datasets Tensor decomposition and KD for UAV detection High generalization and robustness Handling imbalanced issues
[155] Coarse-to-fine network VisDrone, UAVDT Density-aware scale adaptation for small object detection Superior detection in UAV images Issues with scale variation
[156] Self-distillation YOLO KITTI Multi-scale self-distillation for object detection 2.8% accuracy improvement Inefficiencies in knowledge transfer
[157] DTCNet AID Distillation Transform-CNN for super-resolution PSNR of 28.73 dB, SSIM of 0.7904 High model complexity
[158] TGN with KMDN and CDTG DIOR, FGSC-23, DOTA Text-guided tail-class generation for long-tailed distribution Superior performance on tail classes Data distribution imbalance

On the other hand, traditional KD-based object detection methods have limitations, such as ignoring crucial background information and focusing solely on global context. To overcome these issues, Attention-based Feature Distillation (AFD) is proposed in [159], which distills both local and global information. AFD enhances local distillation with a multi-instance attention mechanism and reconstructs pixel relationships, resulting in state-of-the-art performance in object detection while remaining efficient. Fig. 8 illustrates the architecture of the proposed Attention-based Feature Distillation (AFD) method. This framework improves upon traditional KD-based object detection by incorporating both local and global information from the teacher network. The multi-instance attention mechanism within AFD allows the model to distinguish between background and foreground elements effectively. Additionally, the method reconstructs pixel relationships, ensuring that both local details and broader context are accurately transferred from the teacher to the student detector, resulting in enhanced detection performance.

Refer to caption
Figure 8: The KD architecture enhances AFD through three key advancements [159]. Firstly, the method extracts both local and global information from the teacher network. For local distillation, a multi-instance attention mechanism is introduced to effectively distinguish foreground elements from the background. Secondly, the approach reconstructs the relationships between different pixels, facilitating a more comprehensive transfer of knowledge from the teacher to the student detector through both attention-focused local and global distillation strategies.

5.1.3 Semantic Segmentation

KD is beneficial for semantic segmentation in RS applications, which involves classifying each pixel in an image into predefined categories. The teacher model is first trained on the segmentation task using high-resolution RS data [160]. Due to its complexity and larger capacity, it can learn intricate patterns and detailed features from the data. Once trained, the teacher model’s predictions, along with its internal representations, are used to guide the training of the student model. The student model, being smaller and more efficient, aims to mimic the performance of the teacher model while maintaining lower computational costs and faster inference times [161]. Moreover, KD is particularly advantageous because it allows for the deployment of effective semantic segmentation models on edge devices or in scenarios with limited computational resources [162]. By leveraging the distilled knowledge from the teacher model, the student model can achieve high segmentation accuracy despite its reduced size. This is crucial for applications such as real-time environmental monitoring, disaster response, and agricultural analysis, where timely and accurate segmentation of satellite or aerial imagery is needed. The distillation process also helps the student model generalize better to new and unseen data, enhancing its robustness and reliability in diverse RS tasks [163].

The studies on semantic segmentation in RS show significant advancements but also face several limitations. For instance, Gao et al. [164] introduced the FoMA framework, which significantly improves segmentation performance by leveraging foundation models, but it struggles with data scarcity in novel classes and balancing segmentation performance across classes. Similarly, Zhou et al. [122] proposed a lightweight student network (GAGNet-S*) with KD that achieves excellent segmentation performance but faces challenges related to scalability and complexity in deployment on resource-limited equipment. Dong et al. [99] addressed the limitations of CNNs and transformers by proposing the DSCT framework, which enhances segmentation performance through cross-model KD. However, this approach requires high computational complexity and massive data resources.

Studies focusing on KD methods, such as the MGSAD by Zhang et al. [165], and MTKD by Li et al. [166] with MTKD, contribute innovative techniques but encounter challenges such as the need for extensive computation and handling of domain shifts. Liu et al. [167] proposed a three-stage UDA method that shows better performance but relies heavily on large-scale annotated data and struggles with domain shift handling. Similarly, Shi et al. [168] introduced DSANet, which effectively handles spatial and semantic feature enhancement but reduces the model’s characterization ability for these features.

Incremental learning and domain adaptation are other areas where significant contributions have been made but also face limitations. Rong et al. [169] proposed a generalized framework for CSS but struggled with the challenge of old classes collapsing into the background. Rui et al. [170] and Le et al. [171] focused on incremental learning methods but faced high computational costs and the complexity of adapting to incremental domains and partial multi-task learning, respectively. Shan et al. [172, 173] developed class-incremental segmentation methods that address catastrophic forgetting but require balancing old and new class learning and managing feature generation complexity. Li [174] proposed DSSN with weakly-supervised constraints to handle cross-domain segmentation, but the method heavily depends on labeled data and struggles with geographic variation.

Lastly, Guo et al. [175] and Cao et al. [176] proposed methods to balance effectiveness and compactness in segmentation models, but they face high computational demands and challenges in handling noise and redundant features. Zhou et al. [119] introduced GSGNet with high inference speed but had to balance this with contextual reasoning capabilities. Bai et al. [177] and Wang et al. [178] focused on domain adaptation, but they faced difficulties in aligning high-dimensional image representations and managing intermediate domain learning. Michieli et al. [179] addressed incremental learning with various KD techniques but struggled with catastrophic forgetting and internal feature representation complexity. Lastly, Pena et al. [180] introduced DeepAqua for water detection, which improves segmentation accuracy but lacks specific details on datasets and segmentation scenarios. Table 5 provides a summary of the works that use KD for semantic segmentation of RS images.

Table 5: Comparison of Various Studies on Semantic Segmentation in RS
Ref. Model(s) Used Dataset/Data Type Main Contribution Best Performance Value Achieved Limitation(s)
[164] FoMA Framework OpenEarthMap Introduces GFSS with three strategies: SLE, DGK, and VFE for improved segmentation Improvement of 28.94% in segmentation performance, with 31.79% for novel classes and 24.64% for base classes Data scarcity in novel classes and complex balancing in segmentation performance
[122] GAGNet-S* (with KD) Potsdam, Vaihingen Proposes a lightweight student network framework with KD Achieved excellent segmentation performance on Potsdam and Vaihingen datasets Scalability and complexity in deployment on resource-limited equipment
[99] DSCT Framework ISPRS Potsdam, Vaihingen, GID, LoveDA Cross-model KD using CNNs and transformers Outperforms state-of-the-art KD methods on four datasets High computational complexity and massive data resource requirements
[165] MGSAD Not specified Proposes a multi-granularity semantic alignment distillation method for semantic segmentation Not specified Details on datasets and specific performance metrics are not provided
[166] MTKD Not specified Multi-task KD for weather-degraded image segmentation Achieves 0.038 s in semantic segmentation for a 2048 × 1024 image Specific performance values not provided, computation-intensive
[167] Covariance-based Channel Attention Module ISPRS 2-D Semantic Labeling, Urban Drone Dataset (UDD) Proposes three-stage UDA method with KD for RS images Shows better performance compared with state-of-the-art methods Domain shift handling and reliance on large-scale annotated data
[168] DSANet ISPRS Potsdam, Vaihingen Effective deep supervision-based attention network for RSIs 79.19% mIoU on Potsdam, 72.26% mIoU on Vaihingen with 470.07 FPS on 512 × 512 images Reduces model characterization ability for spatial and semantic features
[181] Cross-modal KD Not specified Uses optical images to train a student model for SAR images through cross-modal KD Increase of 5-20% IoU score compared to training from scratch Small training datasets and complexity in cross-modal learning
[169] Generalized Framework for CSS iSAID, GCSS Proposes historical information-guided modules for CSS in RS images Outperforms state-of-the-art methods in most incremental settings Challenge of old classes collapsing into the background
[170] Domain-Incremental Learning LoveDA-rural Proposes domain-incremental learning for multi-source RS data Achieves mIoU of 0.6233 on LoveDA-rural at step 5 High computational cost and complexity in incremental domain learning
[171] Partial Multi-Task Learning with KD ISPRS 2D Semantic Labeling Contest Enhances partial multi-task learning performance using KD mIoU of 68.97% on Vaihingen dataset Lack of all-task annotations and reliance on soft labels
[172] DFD and LM Modules Aerial images dataset Proposes class-incremental segmentation method without old data storage 6.2% and 15% mIoU gains from DFD and LM modules respectively Catastrophic forgetting and balancing old and new class learning
[173] PFG and TKD Modules Not specified Effective class-incremental segmentation framework without storing old data More than 4.5% gains compared with state-of-the-art methods Limited detail on dataset performance and complexity in feature generation
[174] DSSN with Weakly-Supervised Constraints Not specified Proposes DSSN for cross-domain RS image segmentation Mean F1Score: 60.76%, Mean IoU: 44.53% High dependency on labeled data and difficulty in geographic variation handling
[175] CLNet-T and CLNet-S (with KD) MFNet, PST900 Proposes a balance between effectiveness and compactness using KD MFNet: mAcc 76.6%, mIoU 58.2%; PST900: mAcc 95.59%, mIoU 80.77% High computational demands and complexity in terminal device deployment
[176] C3Net with Multi-Level KD ISPRS Vaihingen Proposes efficient C3Net for multi-modal data semantic segmentation Overall Accuracy: 91.3%, High mean F1 score for car class Noise and redundant feature handling and high running time
[119] GSGNet with KD Vaihingen, Potsdam Proposes GSGNet for ORSI scenario analysis with high inference speed Outperforms most advanced methods with 19.61 M parameters Balancing high inference speed and contextual reasoning capability
[177] Contrastive and Adversarial Learning Not specified Proposes a model for domain adaptation in representation space and spatial layout Not specified Specific performance values and dataset details not provided
[178] TDARS Three domain adaptation datasets Proposes transitive domain adaptation for RS images Effectively handles domain shift problem compared to other methods High complexity in intermediate domain learning and transfer
[179] Various KD Techniques Pascal VOC2012, MSRC-v2 Proposes incremental learning for semantic segmentation with KD Highest Accuracy: 97.5% (Abisoye et al. 2024), Lowest Error: 0.032 MAE (De 2024) Catastrophic forgetting and complexity in internal feature representation handling
[180] DeepAqua Not specified Proposes an unsupervised method for water detection in RS Improves accuracy by 3%, IoU by 11%, F1-score by 6% Specific details on datasets and segmentation scenarios not provided

Besides, [164] produces a Foundation Model Assisted (FoMA) for Generalized Few-Shot Semantic Segmentation (GFSS) in RS images, aimed at improving segmentation performance under data scarcity conditions. FoMA leverages foundation models through three strategies: Support Label Enrichment (SLE) to enhance support labels, Distillation of General Knowledge (DGK) to transfer generalizable knowledge, and Voting Fusion of Experts (VFE) to combine zero-shot and few-shot predictions. The method demonstrates state-of-the-art performance on the OpenEarthMap few-shot challenge dataset. Fig. 9 illustrates the architecture of the FoMA framework, which effectively integrates a vision-language foundation model’s general knowledge into the GFSS task for RS images.

Refer to caption

(a) The FoMA GFSS framework

Refer to caption

(b) SLE                                                                         (c) DGK

Figure 9: The FoMA’s architecture proposed in [164], incorporates the general knowledge from a vision-language foundation model, initially trained on natural images, into the GFSS task for RS images. This is achieved through two key modules: SLE, which integrates the foundation model’s predictions as pseudo-labels into the GFSS learner’s training on support images, and DGK, which transfers the model’s superior performance on novel classes from query images into the learner. Additionally, a voting fusion strategy effectively combines results from both the foundation model and the GFSS learner for enhanced accuracy..

5.2 Specific Applications

5.2.1 Land Cover Classification

KD improves the classification of land cover types by refining the feature extraction capabilities of student models. This leads to better segmentation and classification of different land cover types, essential for environmental monitoring and urban planning. Several studies have proposed innovative methods to improve land cover classification and other RS tasks using KD and multimodal data fusion. For example, Xu et al. [117] developed a two-branch patch-based CNN with an encoder-decoder (ED) module to fuse multimodal RS (RS) data. They introduced a KD in model (DIM) module for better multimodal data fusion and a cross-model (DCM) module to enhance single-modal classification using multimodal knowledge. Their approach demonstrated superior performance on hyperspectral (HS) and light detection and ranging (LiDAR) data as well as HS and synthetic aperture radar (SAR) data. Fig. 10 depicts the approach proposed in [117]. Wang et al. [182] proposed the cross-modal graph knowledge representation and distillation learning (CGKR-DL) framework, which combines CNN and graph convolutional network (GCN) to enhance land cover classification. Their method addresses the limitations of traditional CNN-based cross-modal distillation methods and significantly improves performance on various multimodal RS datasets.

The Generalized KD (GKD) framework has been introduced by Ienco et al. [73] to handle data misalignment between training and test phases. Their method, applied to radar and optical satellite image time series data, improved land use land cover mapping, especially for agricultural classes. A multimodal online KD (MMOKD) framework that supports both multimodal and cross-modal learning, showing superior performance in both scenarios has been proposed by Liu et al. [11] for land use/cover classification using optical and SAR images. Finally, Li et al. [183] introduced the dynamic-hierarchical attention distillation network (DH-ADNet) with multimodal synergetic instance selection (MSIS) for land cover classification using missing data modalities. Their method emphasizes selective instance enhancement and hierarchical attention distillation, achieving state-of-the-art results.

Several other studies focused on specific RS challenges. For instance, Lu et al. [54] developed a weakly supervised change detection technique via KD and Multiscale Sigmoid Inference (KD-MSI), significantly improving change detection performance on multiple datasets. Similarly, Zhang et al. [184] proposed a transfer learning framework using teacher-student structure for better generalizability and performance in land cover classification. Kanagavelu et al. [185] and Gbodjo et al. [186] explored federated learning and multisensor data integration, respectively, to enhance land cover mapping and monitoring. The former work federated UNet model integrated KD to reduce communication costs while maintaining high accuracy whereas the later developed a self-distillation strategy within a CNN framework to combine multitemporal SAR and optical data for improved land cover classification. The works that use KD for the classification of land cover and their main characteristics are summarized in Table 6.

Table 6: Comparison of Studies on KD-based Land Cover Classification
Ref. Model(s) Used Dataset/Data Type Main Contribution Best Performance Value Achieved Limitation
[117] Two-branch patch-based CNN with ED and DIM modules Hyperspectral (HS) and LiDAR data (Houston2013) Developed a KD in model (DIM) and cross-model (DCM) module for better LC classification Improved LC classification performance on two multimodal RS datasets The study mainly focuses on LC classification; does not cover other RS applications
[182] CGKR-DL framework with CNN and GCN HS-LiDAR, HS-SAR, HS-SAR-DSM datasets Proposed cross-modal graph knowledge representation and distillation learning Significant improvement in land cover classification accuracy Focuses on classification; not on other types of RS tasks
[73] Generalized KD (GKD) framework Radar (Sentinel-1) and optical (Sentinel-2) SITS Managed information misalignment between training and test data Accuracy: 65.01%, F-Measure: 64.27%, Kappa: 0.5775 Limited to cases where radar data is always available
[11] MMOKD framework Optical and SAR images Developed multimodal online KD framework for land use/cover classification Outperformed other networks in both full- and missing-modality scenarios Large semantic gap between modalities poses a challenge
[185] Federated UNet model with KD Satellite and street view images Improved efficiency and privacy of real-time climate tracking Accuracy above 95% Focus on semantic segmentation, not other RS tasks
[183] DH-ADNet with MSIS Coregistered optical and SAR datasets Introduced dynamic-hierarchical attention distillation for land cover classification State-of-the-art results in the privileged information scenario Limited to privileged information scenarios
[54] KD-MSI with CAMs WHU-CD, DSIFN-CD, LEVIR-CD datasets Weakly supervised change detection using KD F1-score: 0.854 on WHU-CD Focuses on change detection; not applicable to other RS tasks
[184] Transfer learning framework with CMD and high-temperature softmax Various RS datasets Improved land cover classification using teacher-student structure Average increase in mIoU: 9.9%, 2.1%, 4.3% Requires large datasets for teacher model training
[187] DAGDNet with IG-FGM and MS-ADL Coregistered optical and SAR datasets Efficient dense adaptive grouping distillation network for MLCC Superior performances on representative datasets Limited to scenarios with privileged modality
[186] Patch-based multibranch CNN Multitemporal SAR/optical data Integrated multisensor RS data using self-distillation strategy Accuracy: 94% (Reunion island), 88% (Dordogne) Requires sparsely annotated ground-truth data
[188] Hallucination network with KD PAN-MS image pairs, hyperspectral dataset Provided robust solution for missing modalities using hallucination module Overall accuracy: 97.01% Focused on scene recognition and image classification
[189] CloudSeg framework with multi-task learning M3M-CR, WHU-OPT-SAR datasets Addressed semantic segmentation under cloud cover using KD mIoU improvement: 3.16% (M3M-CR), 5.56% (WHU-OPT-SAR) Focuses on cloudy conditions; not applicable to cloud-free scenarios
[190] Segment Anything (SAM) model Planetary images Rapid annotation for geological mapping using KD Comparable to state-of-the-art on mapping planetary skylights Limited to geological mapping tasks
[191] Distill and refine strategy with CNN Sentinel-1 data Addressed spatial transfer challenge for mapping irrigated areas Best performance in spatial transferability Focused on spatial transfer; not on other RS tasks
[192] Lightweight model with KD UC Merced Land Use dataset High accuracy and efficiency for RS image retrieval mAP: 0.9680 with 3.8M parameters Limited to image retrieval tasks
[193] MRF-NAS with self-training UDA OpenEarthMap, FLAIR #1 datasets Lightweight neural networks for UDA in RS mIoU: 59.38% (OpenEarthMap), 51.19% (FLAIR #1) Focus on UDA; not on other RS tasks
[194] Cross-modal distillation framework Sen1Floods11 dataset Improved flood detection with cross-modal distillation IoU improvement: 6.53% on test split Limited to flood detection; not other RS tasks
[195] GCPNet with GCN and ASPM Various satellite datasets Enhanced pansharpening using GCN and KD Outperformed state-of-the-art visually and quantitatively Limited to pansharpening tasks
[196] Domain knowledge-guided self-supervised learning Onera Satellite Change Detection dataset Improved unsupervised change detection using domain knowledge Kap: 53.34%, F1: 55.69% Focused on change detection; not other RS tasks
[197] VGG13 (teacher), ResNet8 (student) SMAP satellite data Improved soil moisture prediction using KD High prediction accuracy with efficient student model Focused on soil moisture prediction; not other RS tasks
[198] LSAW with adaptive weights CCF, Potsdam, Vaihingen datasets Addressed catastrophic forgetting in incremental learning Best results on three datasets Focus on incremental learning; not other RS tasks

Besides, significant research has been devoted to enhancing land cover classification using multimodal RS data, which significantly outperforms single-modal methods due to its richer information content. To advance this field, a two-branch, patch-based CNN with an encoder-decoder (ED) module for effective multimodal data fusion is proposed in [117]. Typically, a KD in model (DIM) module to guide per-modality encoder learning is introduced, ensuring more efficient fusion. Additionally, we explored guiding single-modal learning with multimodal information through the KD cross-model (DCM) module. This approach treats the multimodal method as a teacher, transferring its knowledge to single-modal methods. Extensive experiments on the Houston2013 and Berlin datasets, combining hyperspectral (HS) with LiDAR and synthetic aperture radar (SAR) data, respectively, demonstrated the superiority of our multimodal fusion strategy over state-of-the-art methods. The DCM module also significantly enhances LC classification performance for single-modal methods.

Refer to caption
Figure 10: Illustration of the framework proposed in [117]. The "Conv" block comprises a 3 × 3 convolutional layer, followed by batch normalization, a 2 × 2 max-pooling layer, and a ReLU activation function. The "FC" block includes a fully connected layer, batch normalization, and a ReLU activation function. Both the "Shared Classifier" and "Classifier" share the same structure, composed of "FC" blocks and a softmax layer for final classification.

5.2.2 Precision Agriculture

KD is crucial in smart agriculture as it allows for the development of lightweight models that maintain high accuracy while being deployable on resource-constrained edge devices, such as drones or sensors. This is particularly important for precision agriculture tasks like early weed detection and crop monitoring, where efficient and accurate models are needed for real-time decision-making. By transferring knowledge from larger, more complex models to smaller ones, KD helps optimize these tasks, enhancing agricultural productivity and sustainability. Numerous studies have investigated the application of KD in precision agriculture. These efforts focus on enhancing model efficiency and accuracy in tasks such as crop monitoring, weed detection, and resource management. For instance, Liangde et al. [199] develop a model distillation approach to enhance agricultural named entity recognition, leveraging a BERT-based model enhanced by BiLSTM and CRF for precise entity detection from a constructed agriculture knowledge graph. Ghofrani and Mahdian Toroghi [200] focus on plant disease detection, using a KD approach to enable smaller CNN architectures, like MobileNet, to achieve near high-end model accuracy on the Plantvillage dataset. On the same line, Hu et al. [201] and Dong et al. [202] address crop disease detection. The former approach optimizes YOLOv5s for maize disease detection, while the latter uses ECA-KDNet for efficient apple disease diagnosis on mobile devices. Finally, Huang et al. [203] develop multistage KD to create lightweight models for diagnosing multiple crop diseases effectively.

In the task of image segmentation, Angarano et al. [204] introduce a method for robust crop segmentation using KD, aimed at improving the generalization across different environmental conditions for robotic field management. Similarly, Li et al. [205] employ KD for efficient panoptic segmentation, creating lightweight networks capable of detailed scene understanding at high speeds, whereas Jung et al. [206] improve plant leaf segmentation using KD to maintain high-quality instance segmentation. The work of Pagé-Fortin [207] investigates class-incremental learning methods to address the challenge of learning new plant species and diseases incrementally, focusing on mitigating catastrophic forgetting.

In the domain of precise image analysis and preventive detection in vineyards, Wang et al. [208] propose a lightweight semantic segmentation model for identifying grape picking points, enhancing the picking efficiency in vineyard environments, whereas Hollard and Mohimont [209] apply KD to enhance grapevine detection for early yield prediction, focusing on lightweight model deployment for embedded devices. As far as it concern disease detection, Musa et al. [210] propose a low-power DL model for detecting plant diseases in hydroponic systems, aiming at efficiency and reduced resource consumption and Zhang and Wang [211] improve plant leaf disease recognition using a novel data augmentation-based KD framework, enhancing recognition accuracy in natural environments.

More advanced and complex tasks have also been addressed using KD. In the aquaculture domain, Yin et al. [212] propose a novel fish individual recognition method using KD within a vision transformer framework, improving accuracy. In the same application domain, Li et al. [213] focus on underwater fish species classification using a novel two-tier KD method to enhance model accuracy and reduce computational demands. Back to the plants and trees images, Yang et al. [214] developed a fast pest detection algorithm using lightweight feature extraction and KD to enhance performance on edge devices and Wu et al. [215] presented Deep BarkID, a lightweight CNN for tree species identification from bark images, tailored for use in forest environments with limited computing resources. Finally, Yamamoto [216] utilized CNNs to distill crop models to accelerate understanding of plant physiology, applying DL to evaluate environmental impact on grain yield.

Researchers have also contributed to the development and deployment of lightweight student models on low capacity devices on the edge. Wenjie et al. [217] discuss structured model compression via KD, transferring knowledge from a complex VGG16 model to a lightweight MobileNet. This approach significantly reduces model size and improves performance, making it suitable for deployment on devices with limited resources. Wang et al. [218] explore lightweight model development for leaf image analysis, particularly for coffee leaf pest and disease identification. Using VGG as a teacher network, a student network is trained with KD, achieving high recognition accuracy and speed, which is crucial for real-time analysis. In the same line, Li et al. [219] delve into KD for instance-based semantic segmentation, particularly applying it to transform complex transformer models into more efficient DCNN architectures, which shows effectiveness in agricultural applications on datasets like BUP20 and SB20. Arablouei et al. [220] use KD to create compact models suitable for classifying animal behavior from accelerometry data on wearable devices. The models are optimized for real-time, in-situ performance on devices with limited computational resources.

Finally, there are works that take advantage of lightweight student models and combine them with heterogeneous remote sensor data to improve prediction accuracy. Castellano et al. [221] study the application of KD for mapping weeds using UAVs in precision agriculture, developing a lightweight Vision Transformer-based model that provides effective weed mapping with minimal computational resources, and Bansal et al. [222] develop a transformer-based network for plant growth monitoring, utilizing KD to enhance the model’s performance by fusing RGB and depth image data for more accurate growth predictions. Table 7 provides a summary of the main works that employ KD in the agriculture domain.

Table 7: Comparison of Studies on KD in Agriculture
Ref. Model(s) Used Dataset/Data Type Main Contribution Best Performance Limitation
[199] BERT-ALA + BiLSTM + CRF Agriculture named entity data Enhanced agricultural entity recognition using model distillation Macro-F1 increased by 3.3% High time and space complexity
[200] MobileNet, Xception PlantVillage dataset Plant disease recognition with KD Accuracy of 97.58% Limited to small architectures
[201] Improved YOLOv5s Maize leaf disease dataset Lightweight model for maize leaf disease detection mAP(0.5): Increased by 3.8% Only focuses on maize; may not generalize to other crops
[205] ResNet-34 Various datasets for panoptic segmentation KD for panoptic segmentation Improved panoptic quality by up to 4.1 points Requires extensive fine-tuning of balancing weights
[206] Identical architecture for teacher and student Large dataset for plant leaf segmentation Improved instance segmentation using spatial embedding and KD Enhanced segmentation accuracy High dependency on the quality and size of the dataset
[202] ECA-KDNet Apple leaf dataset Lightweight model for apple leaf disease diagnosis Accuracy of 98.28% Focused only on apple leaves, might not generalize
[203] YOLOR model variants PlantDoc dataset Multistage KD for plant disease detection 60.4% mAP@.5 Model complexity and distillation stages may be challenging to manage
[223] Multilevel distillation framework CIFAR100 and CIFAR10 Addressing low resolution identification problems Improved low-resolution recognition accuracy Specific to low-resolution datasets
[208] Lightweight semantic segmentation model Custom dataset for grape picking point localization Efficient grape picking point localization in complex environments 91.08% accuracy in picking point localization Limited to grape picking, may not extend to other fruits
[210] Low-power DL model Hydroponic systems Plant disease detection in low-power IoT devices Accuracy of 99.4% Focus on hydroponics; broader application unknown
[211] Data augmentation-based KD framework PlantDoc dataset Enhanced recognition accuracy for plant leaf diseases Improved accuracy by up to 3.06% Performance heavily dependent on data augmentation quality
[209] Knowledge-distilled models Datasets for grapevine detection Early grape detection and yield prediction with KD Improvement in various metrics, e.g., 13.63% in mAP50-95 Predominantly focused on early detection stages
[218] Lightweight model using VGG for KD Coffee leaf dataset High accuracy in coffee leaf disease identification with a lightweight model Accuracy of 96.73% Generalization to other crop diseases not demonstrated
[220] GRU-MLP models, ResNet Animal behavior datasets In-situ animal behavior classification on wearable devices MCC of 0.882 (ResNet) Mainly applicable to animal behavior, not crops
[217] Distilled-MobileNet Common diseases of crops Lightweight disease recognition model for limited-resource devices Accuracy of 97.62% Limited to specific diseases and crops
[219] Instance-based semantic segmentation with Mask2Former Agricultural datasets KD for instance semantic segmentation AP improvement of 1.8 for ResNet-50 Focused on specific types of segmentation
[221] Lightweight Vision Transformer WeedMap dataset Mapping weeds with drones using KD F1 score of 0.863 Specific to drone-based RS
[222] PA-RDFKNet Various datasets for plant growth monitoring RGB-depth fusion for plant age estimation with KD MSE reduced from 2 to 0.14 weeks Focused on plant growth, might not extend to other agricultural tasks
[224] KD from Multi-head Teacher (KDM) Bio-HSI Efficient hyperspectral image segmentation with a compact student network mIoU of 90.03% Over-compression degrades performance without medium-sized teacher assistants
[225] UNet with various backbones On-field images of pomegranate fruit Effective segmentation of pomegranate fruits for agricultural automation F1 score of 90.35% for VGG19 backbone Dependency on the choice of backbone for performance
[212] Vision Transformer with chunking method DlouFish dataset Enhanced fish individual recognition using a novel KD strategy Accuracy of 93.19% Specific to underwater environments
[214] C3Faster with KD CropPest6 dataset Fast and efficient crop pest detection suitable for edge devices 97.5% mAP Reduced feature extraction capability in lightweight models
[215] Lightweight CNN models Indiana Bark Dataset Portable tree species identification system for smartphones 96.12% accuracy Limited to specific tree species in Indiana
[216] CNN Crop growth dataset generated by a crop model Learning plant physiology from crop models to enhance model portability MSE of 52.9 during training Limited by synthetic data generation from crop models
[213] Two-tier KD (T-KD) Fish37 dataset Improved accuracy and reduced parameters for underwater fish species classification Top-1 accuracy of 97.20% Requires large model sizes for initial training
[226] KD from multispectral to RGB models Mullus Marbatus family dataset Fish quality estimation using RGB cameras with knowledge from multispectral images Classification accuracy of 84.3% Limited to specific types of fish and conditions
[227] ResNet50 and a lightweight student model Dataset of Ethiopian medicinal plants Accurate identification of medicinal plants using a distilled knowledge approach 96.91% accuracy High accuracy dependent on extensive data preprocessing

5.2.3 Urban Planning

KD has emerged as a vital technique in urban planning, particularly in the context of enhancing the efficiency and accuracy of models used for complex tasks such as environmental monitoring, infrastructure management, and resource allocation. For instance, KD has been effectively used to improve the real-time detection of building defects, optimize building extraction from noisy datasets, and enhance the accuracy of traffic flow prediction and travel time estimation. This technique is particularly valuable in scenarios involving large-scale urban data, where it enables the deployment of sophisticated models on resource-constrained devices, such as UAVs and edge computing frameworks, facilitating more efficient management of urban infrastructure and services. By enabling the transfer of knowledge from powerful teacher models to more efficient student models, KD supports the development of robust, scalable solutions that are essential for modern urban planning and the creation of smarter, more responsive cities For instance, Rithanasophon et al. [228] proposed a method that leverages deep CNNs (DCNNs) and KD to evaluate QoL for pedestrians using walkability data collected through virtual reality tools, achieving significant improvements in model performance and computational efficiency. Similarly, Liu et al. [229] introduced UrbanKG, a knowledge graph system that integrates KD for urban data fusion, showing promising results in boosting performance across various urban computing applications. Xu et al. [230] also addressed the challenges of limited training samples in building polygon extraction by proposing BPDNet, a KD-based framework that effectively integrates generalization knowledge from large datasets with task-specific characteristics, resulting in superior performance in complex urban environments.

Federated learning frameworks have also benefited from KD, particularly in the context of land use monitoring and environmental impact assessment. Kanagavelu et al. [185] demonstrated the potential of integrating KD with federated UNet models for the semantic segmentation of satellite and street view images, achieving high accuracy and significant model compression. In a similar vein, Xu et al. [231] developed a KD-based building extraction method that reduces the impact of noise on model performance while maintaining generalization, achieving notable improvements in precision, recall, and IoU metrics. In the context of transportation systems, KD has been utilized to enhance travel time estimation (TTE) models and improve traffic flow prediction. Yang et al. [106] proposed KDTTE, a deep neural network model that employs KD to reduce computation and memory costs while increasing accuracy, significantly outperforming state-of-the-art baselines in TTE tasks. In a different but related task, Li et al. [232] applied deep KD to traffic flow prediction in spatio-temporal networks, demonstrating improvements in both local and global feature perception and achieving better accuracy in traffic predictions.

In the autonomous driving domain and the task of off-road environment segmentation, KD has been instrumental in improving model efficiency and accuracy. Pan et al. [233] developed an end-to-end lane detection method using KD to guide polynomial regression under complex road conditions, achieving competitive results in efficiency and accuracy. Similarly, Kim and An [234] proposed a KD method for segmenting off-road environment range images, resulting in a favorable trade-off between segmentation performance and computational cost, highlighting its effectiveness for autonomous systems.

Lee et al. [235] proposed a high-speed detection method for multi-class defects on residential building façades using KD. The study demonstrated that applying KD to a lightweight DL model significantly improved mean average precision (mAP) by approximately 20% and reduced inference time by 2.5 times, making it more suitable for real-time applications. Moving on, Chen et al. [108] introduced a novel approach to building extraction that utilizes KD to enhance the robustness of the distilled student model. The study employed a multi-teacher collaborative distillation strategy to transfer comprehensive feature knowledge from teacher networks to the student model. The approach demonstrated state-of-the-art performance on multiple datasets, including the Massachusetts Roads Dataset, LRSNY Roads Dataset, and WHU Building Dataset, achieving high IoU scores and improving learning capabilities. Geng et al. [79] developed a lightweight topological space network for road extraction from optical RS images, leveraging KD. The study addressed the challenge of extracting topological features from complex road networks by proposing a topological space loss calculation model. The method resulted in significant improvements in accuracy and computational efficiency, demonstrating a good balance between performance and model size.

Besides, Li et al. [236] proposed an off-policy imitation learning method for autonomous driving that employs task KD. This approach was designed to clone human driving behavior and transfer driving strategies to new, unseen scenarios. The method showed promising results in transferring knowledge to different illumination and weather conditions, enhancing route-following performance in realistic urban driving scenes. Hong et al. [237] introduced a hierarchical edge-decision framework for intelligent transportation systems (ITS) that incorporates KD. The framework enables vehicle-road-cloud cooperation to enhance real-time motion planning by distilling complex spatial-temporal event reasoning into efficient decision-making processes. The method was validated on autonomous driving scenarios, demonstrating improved adaptability to complex environments. Luo et al. (2022) [238] presented the KeepEdge framework, which integrates deep neural networks into an edge computing system for UAV-assisted parcel delivery. By employing KD, the study created a lightweight model that maintained high accuracy while reducing the computational load on UAVs. This approach proved effective in complex environments where traditional GPS-based positioning might fail. Pelizari et al. (2023) [239] developed a deep multitask learning (MTL) architecture for building characterization using street-level imagery. The study incorporated KD to encode cross-task interdependencies, which improved the generalization capabilities of the model across multiple natural hazards. The proposed MTL methods outperformed traditional single-task learning (STL) models, achieving higher accuracy and efficiency. The aforementioned studies that demonstrate the versatility of KD in enhancing the efficiency, accuracy, and scalability of models in various prediction and classification tasks in urban planning and intelligent transportation systems, using RS data are summarized in Table 8.

Table 8: Comparison of KD Studies for Urban Planning
Citation Model(s) Used Dataset/Data Type Main Contribution Best Performance Value Achieved Limitation
[228] DCNNs, LSTM, KD VR-based questionnaire data Evaluates walkability using AI and enhances real-time performance through KD MSE of 7.19×1037.19superscript1037.19\times 10^{-3}7.19 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (within-city) and 9.73×1039.73superscript1039.73\times 10^{-3}9.73 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (across-cities) Limited to VR data, may not generalize to all environments
[229] FedUKD, UNet Satellite and street view images Integrates knowledge graphs for urban data fusion 97% accuracy on Chennai land use dataset May struggle with dynamic, heterogeneous urban data
[230] BPDNet Building polygons Distills knowledge for generalization in building extraction tasks IoU of 66.54% Performance may drop in complex urban settings
[185] FedUKD Satellite and street view images Reduces communication costs in land use classification via federated learning Above 95% accuracy with significant compression Scalability to other urban data types may be limited
[106] KDTTE Travel time estimation datasets Improves travel time estimation with KD 86.8% accuracy improvement on Porto dataset Limited generalization to diverse traffic conditions
[240] UrbanKG Urban spatial-temporal data Develops an urban knowledge graph for data fusion Effective in various urban applications Requires extensive setup and integration
[231] UPerNet, Swin Transformer Noisy RS images Enhances building extraction from noisy images with KD IoU of 81.61% Dependent on noisy label quality
[235] DCNN Building façade images Accelerates defect detection on building façades using KD 20% mAP increase, 2.5x faster inference Limited to façade defects, may not generalize
[108] U-Net, DeepLabV3Plus Road and building datasets Enhances model robustness via multi-teacher distillation IoU scores: 48.56%, 79.51%, 81.35% Teacher weight optimization is still needed
[232] Deep KD Model Traffic flow datasets Improves spatio-temporal traffic flow prediction using KD Accuracy improvement of 0.19 and 0.18 on respective datasets Focused on local data, may miss global patterns
[233] End-to-End Lane Detection with KD TuSimple, CULane Datasets Lane detection method using auxiliary supervision Competitive accuracy, high efficiency Post-processing still needed for some tasks
[241] Interaction-aware Trajectory Planning with KD Real-world driving scenarios Combines DL with optimization for trajectory planning Fivefold improvement in computation time Integration with control paradigms is complex
[242] Lightweight Next Location Prediction Model Mobility data Efficient next-location prediction with reduced inference time 6.57% error reduction, 99.8% faster inference Focuses on reducing computational load
[243] MJPNet-S* RGB-T/D data Trimodal joint-perception network for crowd density estimation 92% faster, 83% fewer parameters Reduced resource consumption may impact generalization
[79] TSKD-Road RS images Topological network for road extraction with KD Road IoU: 59.16%, mIoU: 78.49%, F1: 74.15% Limited to road extraction tasks
[234] MobileNet_v2 DeepLabV3+ with SLKD Off-road environment dataset Lightweight model for off-road segmentation using KD mIoU of 57.28%, low computational cost Trade-off between accuracy and efficiency
[237] GSCNN Autonomous driving scenarios Edge-decision framework for motion skill enhancement Improved adaptation to dynamic environments Complexity in real-time implementation
[238] DNN UAV delivery environments Edge intelligence framework for UAV positioning High accuracy with reduced model complexity Dependent on visual data quality
[239] Deep MTL Street-level imagery Cross-task interdependency modeling for building characterization accuracy = 88.43% Complexity in MTL model training

5.2.4 Oceanographic Monitoring

KD can significantly enhance the efficiency and practicality of AI applications in ocean and sea studies by simplifying complex models for deployment on resource-constrained devices. This includes improving marine wildlife detection, real-time oceanographic monitoring, and underwater object detection by compressing large models into smaller, more efficient versions without compromising accuracy. Additionally, it can aid in climate change prediction and fisheries management, making advanced AI models more accessible and effective for monitoring and analysis in remote or resource-limited environments. In this direction, the authors in [244] explore the application of CNNs in ocean RS, highlighting their effectiveness in tasks such as 3D ocean field reconstruction, image super-resolution, and ocean phenomena forecasting. The study demonstrates significant improvements in classification accuracy for sea ice and open water areas in SAR images and a notable enhancement in image resolution using CNN-based models.

Several studies focus on underwater environments, where detection and analysis face unique challenges due to poor visibility and environmental complexities. Chen et al. [245] propose an online KD framework, Online-XKD, to enhance the accuracy and generalizability of underwater object detection models while maintaining their lightweight nature. Similarly, Ben Tamou et al. [246] present a CNN-based approach for classifying live reef fish species in underwater environments, using incremental learning to maintain high accuracy as new species are added. Another underwater-focused study introduces WaterMono [247], a framework for depth estimation and image enhancement in underwater scenes, leveraging KD to address challenges such as dynamic scenes and image degradation.

In the domain of geophysical field reconstruction, AdaptDeep [248], a self-supervised framework designed, has been proposed to reconstruct fine-grained spatial structures from coarse-scale geophysical data. The proposed method effectively identifies and recovers detailed information in sea surface temperature fields, demonstrating the potential of domain adaptation techniques in enhancing data resolution and accuracy. Moving on, Tropical cyclone (TC) wind radii estimation is the focus of Jin et al. [249], who propose a multimodal fusion network, MT-TCNet, and its distillation variant, MT-TCNet-Distill. These models utilize a combination of satellite infrared images, wind field reanalysis, and maximum sustained wind speed data to estimate TC wind radii, achieving superior performance even in scenarios with incomplete data.

In the domain of water segmentation, the challenge of accurately segmenting water areas for unmanned surface vehicles (USVs) has been presented in [250]. The study introduced a multimodal fusion method combining 2D camera images and 3D LiDAR point clouds, utilizing transformers and KD to improve segmentation accuracy and processing speed. Lastly, Yang et al. [251] focus on sea ice segmentation, proposing a CNN-based method enhanced with data augmentation, a novel loss function, and multiscale strategies. Their study achieves high segmentation accuracy using the HRNet-W48 backbone, demonstrating the effectiveness of innovative DL techniques in environmental monitoring. Table 9 provides a summary of the studies that employ KD techniques to improve model performance in the oceanographic remote imaging domain.

Table 9: Summary of Oceanographic Studies that employ KD
Ref. Model(s) Used Dataset/Data Type Brief Description of Main Contribution Best Performance Value Achieved Limitation
[244] CNNs Various ocean RS data Applied CNNs across multiple ocean RS tasks, including 3D ocean field reconstruction and image super-resolution. ACC=92.36% for sea ice and open water areas in SAR images High computational cost and model interpretability challenges.
[245] Online-XKD URPC2020 dataset Enhanced feature extraction and generalization in underwater object detection using mutual knowledge transfer in a distillation framework. 3.6 mAP improvement in student model detection accuracy Complexity may hinder deployment in low-resource environments.
[246] CNN with incremental learning LifeClef 2015 Fish dataset Developed an incremental learning strategy for live reef fish species classification, maintaining high performance on previously learned species. 81.83% accuracy on LifeClef 2015 Fish benchmark dataset Scaling to larger datasets or complex environments could be challenging.
[248] AdaptDeep Coarse and fine-scale geophysical field data Proposed a self-supervised framework for fine-grained reconstruction of geophysical data using domain adaptation and contrastive learning. Recovered 81.2% detailed information in sea surface temperature fields Performance depends on the availability of coarse-scale data and temporal correlations.
[247] WaterMono Underwater images Introduced a self-supervised depth estimation framework with image enhancement for underwater environments using KD. RMSE: 0.945, RMSE log: 0.152 Limited generalization to diverse camera angles and extreme conditions.
[249] MT-TCNet, MT-TCNet-Distill Multimodal data including satellite IR images, reanalysis wind fields, and MSW speed Developed a multimodal fusion network and distillation method for robust TC wind radii estimation with both complete and missing modalities. R34 estimation: RMSE 22.458 nmi, MAE 16.577 nmi, R-value 0.855; RMW estimation: RMSE 7.958 nmi, MAE 5.689 nmi, R-value 0.738 Reliance on reanalysis data limits real-time applicability.
[250] Transformer-based multimodal fusion 2D camera images, 3D LiDAR point clouds Proposed a water segmentation method using transformers and KD for improved 2D image-based segmentation with faster speed. Approx. 1.5% improvement in accuracy and MaxF, speed of 15-110 fps High computational load during training phase, though reduced with distillation.
[251] CNN with HRNet-W48 backbone Large sea-ice segmentation dataset Introduced innovative data augmentation, loss function, and multiscale strategies for accurate sea ice segmentation with KD for real-time application. FWIoU score of 97.8439 High computational resource requirement for real-time processing.
[252] Tiny YOLO-Lite SSDD, HRSID, large-scene SAR images Developed a lightweight SAR ship detector using network pruning and KD to reduce model size and computation while maintaining high accuracy. Average Precision (AP) of 89.07%, 2.8 MB model size, inference speed >200 fps Performance may decline with further model size reduction.

6 Challenges and Limitations

Despite the many advantages that KD techniques offer and the wide range of their applications, they still face several limitations as portrayed in Fig. 11. These challenges are mainly related to the deployment of the models to resource-constrained devices and to keeping the performance of these models high when handling heterogeneous data or data from new, unseen distributions. Finding the balance between model efficiency and prediction accuracy is the key challenge as explained in the following.

Refer to caption
Figure 11: Challenges and Limitations of KD in RS.

6.1 Model Complexity and Deployment

In RS applications, KD is often employed to create smaller, more efficient student models by transferring knowledge from a larger, more complex teacher model. The main goal of KD is to retain the high accuracy of the teacher model while reducing the computational load and model size, which is crucial for deployment on resource-constrained devices commonly used in RS. However, the process of optimizing the distillation loss function to achieve this balance between model size and performance is inherently complex and computationally demanding. The distillation loss KDsubscriptKD\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT, which combines the cross-entropy loss CEsubscriptCE\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT and the Kullback-Leibler divergence KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT, needs to be carefully minimized to ensure that the student model effectively approximates the teacher model. This optimization process becomes more challenging as the complexity of the teacher model increases, leading to a higher computational burden during training [253]. Furthermore, when deploying the distilled student model on resource-constrained devices, such as those used in RS for on-board data processing, the reduced model complexity must still meet the real-time processing requirements and maintain high accuracy. The complexity of optimizing the KD process for deployment is expressed by the computational cost associated with the gradient of the distillation loss with respect to the student model parameters. As this complexity increases, it can lead to longer training times, higher energy consumption, and potentially suboptimal model performance, particularly when deployed in environments with limited computational resources. In this reagrd, given a teacher model T𝑇Titalic_T with parameters θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and a student model S𝑆Sitalic_S with parameters θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, the distillation loss is defined as:

KD(θS)=αCE(y,S(x;θS))+(1α)KL(T(x;θT),S(x;θS))subscriptKDsubscript𝜃𝑆𝛼subscriptCE𝑦𝑆𝑥subscript𝜃𝑆1𝛼subscriptKL𝑇𝑥subscript𝜃𝑇𝑆𝑥subscript𝜃𝑆\mathcal{L}_{\text{KD}}(\theta_{S})=\alpha\mathcal{L}_{\text{CE}}(y,S(x;\theta% _{S}))+(1-\alpha)\mathcal{L}_{\text{KL}}(T(x;\theta_{T}),S(x;\theta_{S}))caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = italic_α caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_y , italic_S ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) + ( 1 - italic_α ) caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_T ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_S ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) (5)

The challenge related to the complexity of optimizing this loss function for deployment is defined as follows:

ComplexityO(KDθS)proportional-toComplexity𝑂subscriptKDsubscript𝜃𝑆\text{Complexity}\propto O\left(\frac{\partial\mathcal{L}_{\text{KD}}}{% \partial\theta_{S}}\right)Complexity ∝ italic_O ( divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG ) (6)

This complexity can affect the feasibility and effectiveness of deploying KD models in real-world RS scenarios, where computational efficiency and model robustness are critical.

6.2 Data Heterogeneity

RS data often comes from multiple modalities, such as optical, SAR, and multispectral sensors. Integrating knowledge across these heterogeneous data sources while maintaining accuracy is challenging, as the characteristics of the data can vary significantly. For multi-modal RS data x1,x2,,xmsubscript𝑥1subscript𝑥2subscript𝑥𝑚x_{1},x_{2},\dots,x_{m}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT from different modalities, the aggregate loss is:

multi-modal(θS)=i=1mwiKD(Ti(xi;θT),S(xi;θS))subscriptmulti-modalsubscript𝜃𝑆superscriptsubscript𝑖1𝑚subscript𝑤𝑖subscriptKDsubscript𝑇𝑖subscript𝑥𝑖subscript𝜃𝑇𝑆subscript𝑥𝑖subscript𝜃𝑆\mathcal{L}_{\text{multi-modal}}(\theta_{S})=\sum_{i=1}^{m}w_{i}\cdot\mathcal{% L}_{\text{KD}}(T_{i}(x_{i};\theta_{T}),S(x_{i};\theta_{S}))caligraphic_L start_POSTSUBSCRIPT multi-modal end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) (7)

Typically, the heterogeneity error is defined as:

Heterogeneity Error=i=1m|KD(Ti(xi;θT),S(xi;θS))KD(Tj(xj;θT),S(xj;θS))|Heterogeneity Errorsuperscriptsubscript𝑖1𝑚subscriptKDsubscript𝑇𝑖subscript𝑥𝑖subscript𝜃𝑇𝑆subscript𝑥𝑖subscript𝜃𝑆subscriptKDsubscript𝑇𝑗subscript𝑥𝑗subscript𝜃𝑇𝑆subscript𝑥𝑗subscript𝜃𝑆\begin{split}\text{Heterogeneity Error}=&\sum_{i=1}^{m}\Big{|}\mathcal{L}_{% \text{KD}}(T_{i}(x_{i};\theta_{T}),S(x_{i};\theta_{S}))\\ &-\mathcal{L}_{\text{KD}}(T_{j}(x_{j};\theta_{T}),S(x_{j};\theta_{S}))\Big{|}% \end{split}start_ROW start_CELL Heterogeneity Error = end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_S ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) | end_CELL end_ROW (8)

The performance of knowledge distillation can be reduced by data heterogeneity due to the following reasons. RS data from different modalities often exhibit vastly different characteristics, leading to inconsistencies in feature representations. The student model may struggle to generalize well across all modalities, resulting in higher overall prediction errors. Moving on, the loss landscape associated with each modality differs due to the inherent characteristics of the data. The aggregate loss function, multi-modal(θS)subscriptmulti-modalsubscript𝜃𝑆\mathcal{L}_{\text{multi-modal}}(\theta_{S})caligraphic_L start_POSTSUBSCRIPT multi-modal end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ), combines the knowledge from multiple teacher models. The heterogeneity in data causes the loss functions for each modality to diverge, making it difficult to effectively minimize the combined loss. Additionally, the heterogeneity error quantifies the discrepancy between the losses associated with different modalities. A large heterogeneity error indicates misalignment in the distilled knowledge, which can cause the student model to perform poorly on certain modalities. Lastly, determining the appropriate weights wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each modality’s contribution to the overall loss is challenging. Incorrectly weighting a modality can lead to an imbalance, where the student model prioritizes less important or noisier data, further degrading performance.

6.3 Overfitting and Generalization

While KD helps reduce the size of models, it can also lead to overfitting, particularly when the student model is trained on a limited dataset. This results in poor generalization to new, unseen data, which is critical for the success of RS applications [35, 254]. The generalization error is given by:

Generalization Error=KD(θS;Dtest)KD(θS;Dtrain)Generalization ErrorsubscriptKDsubscript𝜃𝑆subscript𝐷testsubscriptKDsubscript𝜃𝑆subscript𝐷train\text{Generalization Error}=\mathcal{L}_{\text{KD}}(\theta_{S};D_{\text{test}}% )-\mathcal{L}_{\text{KD}}(\theta_{S};D_{\text{train}})Generalization Error = caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ) (9)

Overfitting occurs when: Generalization Error0much-greater-thanGeneralization Error0\text{Generalization Error}\gg 0Generalization Error ≫ 0. The impact of overfitting on the performance of knowledge distillation in RS applications includes: Overfitting causes the student model to perform well on the training data but poorly on unseen test data, which can significantly reduce the model’s effectiveness in real-world RS applications where data variability is high; Overfitted models are often overly sensitive to noise in the training data. This sensitivity can lead to incorrect predictions when the model encounters noisy or outlier data in RS, where data quality can vary widely across different sensors and conditions; In RS, data is often collected from multiple modalities (e.g., optical, SAR, multispectral). An overfitted student model might fail to generalize well across these different modalities, leading to inconsistent performance and reduced reliability in practical applications; Overfitting can limit the ability of the student model to transfer learned knowledge to new tasks or domains within RS, reducing the versatility and adaptability of the distilled model.

6.4 Scalability

As the size of RS datasets increases, the computational complexity of applying KD also increases. This scalability issue can limit the practicality of deploying distilled models on large datasets.

For a dataset of size N𝑁Nitalic_N, the computational complexity scales as:

ScalabilityO(N𝒞(KD))proportional-toScalability𝑂𝑁𝒞subscriptKD\text{Scalability}\propto O(N\cdot\mathcal{C}(\mathcal{L}_{\text{KD}}))Scalability ∝ italic_O ( italic_N ⋅ caligraphic_C ( caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ) ) (10)

where 𝒞(KD)𝒞subscriptKD\mathcal{C}(\mathcal{L}_{\text{KD}})caligraphic_C ( caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ) is the computational cost of evaluating the distillation loss.

Scalability challenges in knowledge distillation (KD) for RS applications can significantly affect performance by increasing training times, requiring more computational resources, and decreasing model efficiency. As datasets grow larger, the time needed to train student models becomes prohibitively long, making iterative improvements difficult. Resource constraints, particularly on devices used in RS, limit the ability to handle large datasets effectively. Additionally, scalability issues can reduce the efficiency of distilled models, complicate real-time data processing, and hinder the integration of data from multiple sources, ultimately leading to suboptimal knowledge transfer.

6.5 Limited Real-Time Applicability

In RS applications, the need for real-time processing is critical, as delays in data processing can render the information outdated and less useful for immediate decision-making. Knowledge Distillation (KD) aims to create more efficient models, but even with distilled models, achieving the required inference speed can be challenging, especially when dealing with complex student models. The inference time tinferencesubscript𝑡inferencet_{\text{inference}}italic_t start_POSTSUBSCRIPT inference end_POSTSUBSCRIPT, which depends on both the size of the student model θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and the computational complexity of evaluating the distillation loss 𝒞(KD)𝒞subscriptKD\mathcal{C}(\mathcal{L}_{\text{KD}})caligraphic_C ( caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ), must be kept within the real-time processing limit treal-timesubscript𝑡real-timet_{\text{real-time}}italic_t start_POSTSUBSCRIPT real-time end_POSTSUBSCRIPT. If tinferencesubscript𝑡inferencet_{\text{inference}}italic_t start_POSTSUBSCRIPT inference end_POSTSUBSCRIPT exceeds treal-timesubscript𝑡real-timet_{\text{real-time}}italic_t start_POSTSUBSCRIPT real-time end_POSTSUBSCRIPT, the performance of the KD model is compromised, as it may not be able to process data quickly enough to be useful in time-sensitive RS applications. This limitation can reduce the effectiveness of KD in scenarios where immediate data analysis and decision-making are required [35, 254].

tinferencetreal-timesubscript𝑡inferencesubscript𝑡real-timet_{\text{inference}}\leq t_{\text{real-time}}italic_t start_POSTSUBSCRIPT inference end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT real-time end_POSTSUBSCRIPT (11)

where tinferencesubscript𝑡inferencet_{\text{inference}}italic_t start_POSTSUBSCRIPT inference end_POSTSUBSCRIPT is the inference time and treal-timesubscript𝑡real-timet_{\text{real-time}}italic_t start_POSTSUBSCRIPT real-time end_POSTSUBSCRIPT is the allowable time for real-time processing:

tinferenceO(θS)+O(𝒞(KD))subscript𝑡inference𝑂subscript𝜃𝑆𝑂𝒞subscriptKDt_{\text{inference}}\approx O(\theta_{S})+O(\mathcal{C}(\mathcal{L}_{\text{KD}% }))italic_t start_POSTSUBSCRIPT inference end_POSTSUBSCRIPT ≈ italic_O ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) + italic_O ( caligraphic_C ( caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ) ) (12)

6.6 Dependency on High-Quality Data

The effectiveness of Knowledge Distillation (KD) is highly contingent on the quality of the training data, which plays a critical role in the success of the distillation process. In RS applications, the data is often noisy, sparse, or collected under varying conditions, leading to inconsistencies that can adversely impact the KD process. The distillation loss, which combines the cross-entropy loss CEsubscriptCE\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT between the student model’s predictions and the true labels, and the Kullback-Leibler divergence KLsubscriptKL\mathcal{L}_{\text{KL}}caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT between the teacher and student model outputs, assumes that the input data is of high quality. However, when the data is of poor quality, the student model may struggle to learn effectively from the teacher model, leading to increased errors and reduced generalization capability. The formula for the impact of data quality indicates that as the proportion of low-quality data increases, the overall performance of the KD model diminishes. This reduction in performance can result in less robust models, which may fail to accurately process and interpret RS data, ultimately hindering the effectiveness of KD in real-world RS tasks [255, 256, 96].

KD(θS)=i=1n(CE(yi,S(xi;θS))+KL(T(xi;θT),S(xi;θS)))subscriptKDsubscript𝜃𝑆superscriptsubscript𝑖1𝑛subscriptCEsubscript𝑦𝑖𝑆subscript𝑥𝑖subscript𝜃𝑆subscriptKL𝑇subscript𝑥𝑖subscript𝜃𝑇𝑆subscript𝑥𝑖subscript𝜃𝑆\mathcal{L}_{\text{KD}}(\theta_{S})=\sum_{i=1}^{n}\left(\mathcal{L}_{\text{CE}% }(y_{i},S(x_{i};\theta_{S}))+\mathcal{L}_{\text{KL}}(T(x_{i};\theta_{T}),S(x_{% i};\theta_{S}))\right)caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) + caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_T ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_S ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) ) (13)

Typically, the impact of data quality is defined as follows:

Data Quality Impact=1ni=1n𝕀(quality(xi,yi)<ϵ)Data Quality Impact1𝑛superscriptsubscript𝑖1𝑛𝕀qualitysubscript𝑥𝑖subscript𝑦𝑖italic-ϵ\text{Data Quality Impact}=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(\text{quality}(% x_{i},y_{i})<\epsilon)Data Quality Impact = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I ( quality ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_ϵ ) (14)

6.7 Balancing Efficiency and Accuracy

One of the key challenges in Knowledge Distillation (KD) is striking the right balance between model efficiency and accuracy. In the context of RS applications, where the stakes are often high, such as in disaster monitoring or environmental protection, compressing the student model too much in the pursuit of efficiency can lead to a significant loss in accuracy. This reduction in accuracy could result in the failure to correctly interpret RS data, leading to erroneous decisions [257, 258]. The trade-off between efficiency and accuracy is represented by the following relationships:

Efficiency1Model Size(θS),Accuracy1KD(θS)formulae-sequenceproportional-toEfficiency1Model Sizesubscript𝜃𝑆proportional-toAccuracy1subscriptKDsubscript𝜃𝑆\text{Efficiency}\propto\frac{1}{\text{Model Size}(\theta_{S})},\quad\text{% Accuracy}\propto\frac{1}{\mathcal{L}_{\text{KD}}(\theta_{S})}Efficiency ∝ divide start_ARG 1 end_ARG start_ARG Model Size ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG , Accuracy ∝ divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG (15)

where Model Size(θS)Model Sizesubscript𝜃𝑆\text{Model Size}(\theta_{S})Model Size ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) refers to the number of parameters in the student model, and KD(θS)subscriptKDsubscript𝜃𝑆\mathcal{L}_{\text{KD}}(\theta_{S})caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) is the distillation loss, which is inversely proportional to the accuracy of the student model.

The optimization problem, therefore, involves maximizing the product of efficiency and accuracy:

maxθS(1Model Size(θS)1KD(θS))subscriptsubscript𝜃𝑆1Model Sizesubscript𝜃𝑆1subscriptKDsubscript𝜃𝑆\max_{\theta_{S}}\left(\frac{1}{\text{Model Size}(\theta_{S})}\cdot\frac{1}{% \mathcal{L}_{\text{KD}}(\theta_{S})}\right)roman_max start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG Model Size ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG ) (16)

However, this optimization is complex because increasing efficiency (i.e., reducing model size) often leads to a rise in distillation loss KDsubscriptKD\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT, which in turn decreases accuracy. Conversely, maintaining high accuracy may require a larger model size, reducing efficiency. In RS, where both computational resources and model performance are critical, failing to achieve an optimal balance can limit the effectiveness of KD. This trade-off must be carefully managed to ensure that the compressed model performs adequately in practical RS scenarios, where both speed and accuracy are essential.

6.8 Integration Complexity

Integrating Knowledge Distillation (KD) with other techniques such as multi-modal fusion or domain adaptation introduces significant complexity to the model and its training process, affecting its performance in RS applications. Integrating these techniques requires careful balancing of multiple loss functions, as the overall performance now depends on the combined effectiveness of KD, multi-modal learning, and domain adaptation. This added complexity can make the training process more computationally expensive, harder to optimize, and more prone to issues like overfitting or convergence to suboptimal solutions. For instance, when integrating KD with other techniques, the overall loss function is expressed as a combination of multiple loss components, each weighted by a coefficient (e.g., α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ). The need to fine-tune these coefficients to achieve the desired balance between KD, multi-modal fusion, and domain adaptation further complicates the training process [116, 259]. Moreover, the integration complexity, represented by the derivative of the total loss function concerning the student model parameters, reflects the increased difficulty in optimizing the student model. As the complexity increases, the risk of inefficient training or suboptimal performance also rises, making it challenging to achieve the desired accuracy and efficiency in real-world RS applications.

7 Future Directions

7.1 Advanced Model Compression Techniques

7.1.1 Dynamic Distillation

Dynamic Distillation is a technique within the broader category of advanced model compression. It aims to optimize the student model’s performance by dynamically adjusting its complexity based on the specific characteristics of the input data or the task at hand. The core idea behind dynamic distillation is to create a flexible and adaptive student model that can efficiently learn from the teacher model without being overly constrained by a fixed architecture or predetermined level of complexity [260, 261]. In many RS applications, the complexity of the input data can vary significantly. For example, a satellite image of a dense urban area may contain more intricate features than an image of a rural landscape. A static, one-size-fits-all student model may struggle to balance performance across such diverse inputs. Dynamic distillation allows the student model to adapt its architecture or parameters based on the specific task, ensuring that it allocates resources efficiently [262, 263]. Not all inputs require the same level of processing. Dynamic distillation enables the student model to adjust its complexity (e.g., the number of layers, the size of feature maps, or the degree of feature extraction) depending on the input. For instance, simpler inputs might be processed with a reduced version of the student model, while more complex inputs trigger a more detailed processing approach [264]. The term "dynamic" implies that these adjustments occur in real-time or near-real-time, during the inference phase. This is particularly useful in resource-constrained environments, such as edge devices or real-time RS applications, where computational resources are limited. By making on-the-fly adjustments, the model can maintain high performance while conserving resources [105]. The dynamic distillation process can be expressed as an optimization problem:

minθS𝔼x𝒟[λ(x)KD(T(x;θT),S(x;θS))]subscriptsubscript𝜃𝑆subscript𝔼similar-to𝑥𝒟delimited-[]𝜆𝑥subscriptKD𝑇𝑥subscript𝜃𝑇𝑆𝑥subscript𝜃𝑆\min_{\theta_{S}}\mathbb{E}_{x\sim\mathcal{D}}\left[\lambda(x)\cdot\mathcal{L}% _{\text{KD}}(T(x;\theta_{T}),S(x;\theta_{S}))\right]roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_λ ( italic_x ) ⋅ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_S ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) ] (17)

Where θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT represents the parameters of the student model that need to be optimized. The input sample, denoted as x𝑥xitalic_x, is drawn from the data distribution 𝒟𝒟\mathcal{D}caligraphic_D. The teacher model, parameterized by θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, produces an output T(x;θT)𝑇𝑥subscript𝜃𝑇T(x;\theta_{T})italic_T ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) for the input x𝑥xitalic_x, while the student model, with parameters θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, generates an output S(x;θS)𝑆𝑥subscript𝜃𝑆S(x;\theta_{S})italic_S ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) for the same input. The KD loss function, KDsubscriptKD\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT, typically quantifies the difference between the teacher’s and student’s outputs. Additionally, λ(x)𝜆𝑥\lambda(x)italic_λ ( italic_x ) is a dynamic weighting function that adjusts the contribution of each input x𝑥xitalic_x to the overall loss, depending on its complexity or the specific requirements of the task.

Specifically, in dynamic distillation, the factor λ(x)𝜆𝑥\lambda(x)italic_λ ( italic_x ) acts as a critical gatekeeper, adjusting the influence of each input on the student model’s training. For complex or critical inputs, λ(x)𝜆𝑥\lambda(x)italic_λ ( italic_x ) increases, prompting the student model to allocate more resources, such as deeper layers or enhanced feature extraction. Conversely, simpler inputs lead to a lower λ(x)𝜆𝑥\lambda(x)italic_λ ( italic_x ), allowing the student model to process the data more efficiently with reduced resources. The KD loss function KDsubscriptKD\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT is central to this process, focusing on minimizing the difference between the teacher and student models’ outputs to ensure the student effectively mimics the teacher. This approach is generalized across the entire dataset, as captured by the expectation 𝔼x𝒟subscript𝔼similar-to𝑥𝒟\mathbb{E}_{x\sim\mathcal{D}}blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT, optimizing the student model’s performance across diverse inputs.

Dynamic distillation enhances resource efficiency by dynamically adjusting the student model’s complexity based on the input, ensuring that computational resources are used optimally, especially in resource-constrained environments. This adaptability allows the student model to maintain or even surpass the performance of static models, particularly when dealing with heterogeneous datasets like those in RS. Additionally, the scalability of dynamic distillation makes it a versatile solution that is suitable for deploying ML models across various environments, from cloud-based systems to edge devices. Typically, in RS, dynamic distillation could be applied to urban monitoring where satellite images vary significantly between dense urban areas and sparse rural regions. For instance, when analyzing satellite imagery for urban heat island detection, the student model could dynamically adjust its complexity, using more layers and features for complex urban environments with varied structures, while simplifying its approach for less complex rural landscapes, thus optimizing processing efficiency and accuracy.

7.1.2 Layer-Wise Distillation:

Layer-wise distillation is an advanced technique in model compression that focuses on transferring knowledge from a teacher model to a student model at a more granular level [265]. Unlike traditional KD, which typically focuses on aligning the final outputs of the teacher and student models, layer-wise distillation involves aligning the outputs of corresponding layers in both models [266]. This approach ensures that the student model learns not only the final output distribution but also the intermediate representations that the teacher model uses to arrive at that output [267]. Typically, layer-wise distillation enables a more effective transfer of knowledge by focusing on the outputs of different layers within a complex model, where each layer captures varying levels of abstraction—from basic features like edges to more complex patterns [268]. This approach allows for tailored compression by assigning different importance to each layer, ensuring that critical features are preserved while less important layers are compressed more aggressively. As a result, the student model becomes more compact and maintains or improves performance, particularly in tasks requiring detailed understanding, such as detecting and classifying intricate patterns in RS applications [269]. The layer-wise distillation process can be expressed as follows:

KD-layer=l=1LwlKD(Tl(x),Sl(x))subscriptKD-layersuperscriptsubscript𝑙1𝐿subscript𝑤𝑙subscriptKDsuperscript𝑇𝑙𝑥superscript𝑆𝑙𝑥\mathcal{L}_{\text{KD-layer}}=\sum_{l=1}^{L}w_{l}\cdot\mathcal{L}_{\text{KD}}(% T^{l}(x),S^{l}(x))caligraphic_L start_POSTSUBSCRIPT KD-layer end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) , italic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) ) (18)

where Tl(x)superscript𝑇𝑙𝑥T^{l}(x)italic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) and Sl(x)superscript𝑆𝑙𝑥S^{l}(x)italic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) denote the outputs of the l𝑙litalic_l-th layer in the teacher and student models, respectively, while KDsubscriptKD\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT represents the knowledge distillation loss function applied to these corresponding layer outputs. The term wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is a weight assigned to each l𝑙litalic_l-th layer, indicating its significance in the distillation process, and L𝐿Litalic_L denotes the total number of layers in the model.

In layer-wise distillation, Tl(x)superscript𝑇𝑙𝑥T^{l}(x)italic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) and Sl(x)superscript𝑆𝑙𝑥S^{l}(x)italic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_x ) represent the outputs of the l𝑙litalic_l-th layer for a given input x𝑥xitalic_x in the teacher and student models, respectively, ensuring the student model learns the same feature representations as the teacher model at each stage. The KD loss function, KDsubscriptKD\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT, measures the difference between these outputs, and when applied layer-wise, it ensures close alignment between the corresponding layers of both models. The layer-specific weight wlsubscript𝑤𝑙w_{l}italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT allows for fine-tuning the importance of each layer, with critical layers in the teacher model being given higher weights to ensure their knowledge is effectively transferred. Finally, the summation across all layers L𝐿Litalic_L ensures that the distillation process comprehensively covers the entire model, enabling the student model to replicate the full range of the teacher model’s capabilities.

Layer-wise distillation offers several practical benefits, including enhanced feature preservation, where focusing on each layer ensures that the student model retains the critical features learned by the teacher, leading to greater accuracy and capability. The flexibility in compression, enabled by layer-specific weights, allows for optimizing the trade-off between model size and performance, depending on the application’s needs. Additionally, this approach fosters better generalization, as the student model is trained to replicate the hierarchical representations of the teacher model, making it more adept at handling new data, particularly in tasks that require detailed feature extraction and classification. In RS, layer-wise distillation can be applied to multispectral image classification, where different spectral bands capture varying levels of detail. By aligning the outputs of corresponding layers in both the teacher and student models, this technique ensures that the student model effectively learns the intermediate features critical for distinguishing complex land cover types, such as differentiating between various crop types or identifying subtle changes in vegetation health.

7.2 Efficient Training and Inference

7.2.1 Low-Cost Training Algorithms

Low-cost training algorithms aim to reduce the computational burden associated with training both teacher and student models, which is particularly important in the context of KD where the goal is to make the student model as efficient as possible. These algorithms focus on optimizing various aspects of the training process to minimize costs while maintaining or even enhancing the performance of the distilled models [270]. Developing more efficient training algorithms that reduce the computational burden of training both teacher and student models is crucial. This can involve leveraging techniques such as federated learning, transfer learning, or smaller proxy datasets for initial training to minimize the overall training cost [271]. Federated learning is a distributed approach that allows training to occur across multiple devices or servers without the need to centralize the data. This can significantly reduce the computational cost associated with data processing and model training by distributing the workload [272]. Each device trains a local model using its data and periodically shares updates with a central server, which aggregates these updates to improve the global model. This approach not only reduces the computational burden on individual devices but also enhances privacy since raw data is not shared [273, 274]. Transfer learning involves taking a pre-trained model (often trained on a large dataset) and fine-tuning it on a smaller, task-specific dataset. This approach can drastically reduce the training cost because the model has already learned general features from the larger dataset, and only minimal additional training is required to adapt it to the new task. In the context of KD, transfer learning can be used to initialize the teacher model, which then distills its knowledge to a student model with minimal additional training [275, 276]. Using smaller proxy datasets for initial training can also reduce costs. Proxy datasets are subsets of the original data or synthetic datasets that approximate the characteristics of the full dataset but are much smaller in size. Training on these datasets requires fewer resources and can provide a good initial model that can be further refined with the full dataset. This approach is particularly useful in scenarios where obtaining labeled data is expensive or time-consuming [277, 278]. The cost of training both the teacher and student models can be expressed as:

Ctrain=i=1N(Cdata(xi)+Cmodel(T,S))subscript𝐶trainsuperscriptsubscript𝑖1𝑁subscript𝐶datasubscript𝑥𝑖subscript𝐶model𝑇𝑆C_{\text{train}}=\sum_{i=1}^{N}\left(C_{\text{data}}(x_{i})+C_{\text{model}}(T% ,S)\right)italic_C start_POSTSUBSCRIPT train end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ( italic_T , italic_S ) ) (19)

where Cdata(xi)subscript𝐶datasubscript𝑥𝑖C_{\text{data}}(x_{i})italic_C start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the cost associated with processing each data sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which encompasses activities such as data loading, augmentation, and preprocessing, while Cmodel(T,S)subscript𝐶model𝑇𝑆C_{\text{model}}(T,S)italic_C start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ( italic_T , italic_S ) refers to the cost incurred during the training process for updating both the teacher model T𝑇Titalic_T and the student model S𝑆Sitalic_S. This formulation highlights that the total training cost Ctrainsubscript𝐶trainC_{\text{train}}italic_C start_POSTSUBSCRIPT train end_POSTSUBSCRIPT is the sum of the costs associated with processing all data samples and updating the models. By optimizing these components—such as by reducing the size of the data samples with proxy datasets, distributing the training workload with federated learning, or leveraging pre-trained models with transfer learning—the overall training cost can be significantly reduced [271]. Low-cost training algorithms enhance resource efficiency, allowing organizations to train effective models even in environments with limited computational resources, which is particularly valuable in fields like RS that involve large datasets and complex models. These algorithms also support scalability, enabling the handling of extensive datasets and sophisticated models without a proportional increase in resource demands, making them adaptable to various environments from cloud servers to edge devices [279]. Additionally, by reducing training costs and time, these algorithms facilitate faster development cycles, allowing for quicker iteration and deployment of ML models, which is essential in rapidly evolving fields like AI [280]. Low-cost training algorithms can be applied to disaster response scenarios, where real-time analysis of satellite imagery is crucial. For example, federated learning can be used to train models across multiple local servers situated near disaster zones, enabling rapid analysis of satellite images for damage assessment without the need for extensive centralized computing resources. Transfer learning can further enhance this process by fine-tuning pre-trained models on smaller, region-specific datasets, ensuring swift deployment and effective monitoring during critical events.

7.2.2 Hardware-Aware Distillation

Integrating KD with hardware-aware design principles can optimize models specifically for the hardware on which they will be deployed, such as edge devices or GPUs. This approach aims to balance the distillation process with the computational capabilities of the target hardware [281]. Hardware-aware distillation integrates KD with hardware-specific optimization strategies to create models that are not only efficient in terms of performance but are also tailored to the computational constraints of the hardware on which they will be deployed. This approach is particularly useful for scenarios where the model needs to be run on edge devices, GPUs, or other specialized hardware, ensuring that the distilled model operates within the physical and computational limits of the target platform [282].

Hardware-aware distillation seeks to balance the effectiveness of the knowledge transfer process with the computational efficiency required by the target hardware. The goal is to ensure that the student model retains as much of the teacher model’s performance as possible while also fitting within the hardware’s resource constraints [283]. This involves careful consideration of factors such as memory usage, processing speed, and power consumption, which are critical in environments like mobile devices, embedded systems, or cloud-based GPUs [284]. Different hardware platforms have varying capabilities and limitations. For example, GPUs excel at parallel processing but may have limited memory bandwidth, while edge devices often have strict power and computational limits. Hardware-aware distillation tailors the student model to leverage the strengths of the target hardware while minimizing its weaknesses. This could involve optimizing the model’s architecture to reduce the number of parameters, simplify computations, or increase parallelism, depending on the hardware’s characteristics [285]. The regularization parameter λ𝜆\lambdaitalic_λ in the hardware-aware distillation framework controls the trade-off between the accuracy of the distilled model and its hardware efficiency. A higher λ𝜆\lambdaitalic_λ places more emphasis on minimizing computational costs, potentially sacrificing some accuracy for greater efficiency. Conversely, a lower λ𝜆\lambdaitalic_λ prioritizes accuracy, allowing for more complex models that may require more computational resources. The choice of λ𝜆\lambdaitalic_λ depends on the specific requirements of the application and the hardware. The optimization objective for hardware-aware distillation can be expressed as:

minθSKD(T,S)+λChardware(θS)subscriptsubscript𝜃𝑆subscriptKD𝑇𝑆𝜆subscript𝐶hardwaresubscript𝜃𝑆\min_{\theta_{S}}\mathcal{L}_{\text{KD}}(T,S)+\lambda\cdot C_{\text{hardware}}% (\theta_{S})roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) + italic_λ ⋅ italic_C start_POSTSUBSCRIPT hardware end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) (20)

where KD(T,S)subscriptKD𝑇𝑆\mathcal{L}_{\text{KD}}(T,S)caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) refers to the knowledge distillation loss function, which quantifies the discrepancy between the outputs of the teacher model T𝑇Titalic_T and the student model S𝑆Sitalic_S. The term Chardware(θS)subscript𝐶hardwaresubscript𝜃𝑆C_{\text{hardware}}(\theta_{S})italic_C start_POSTSUBSCRIPT hardware end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) represents the computational cost associated with running the student model S𝑆Sitalic_S on specific hardware, encompassing factors such as inference time, memory usage, and power consumption. The regularization parameter λ𝜆\lambdaitalic_λ is introduced to balance the trade-off between reducing the distillation loss and optimizing hardware efficiency. This formulation ensures that the student model is not only accurate but also optimized for the computational environment in which it will be deployed. Typically, hardware-aware distillation enhances the feasibility of deploying advanced ML models in resource-constrained environments by tailoring the student model to the specific hardware, making it particularly valuable for edge computing scenarios with strict power, memory, and processing limits. This approach leads to improved performance on the target hardware, offering faster inference times, lower power consumption, and more efficient memory usage, thereby optimizing model deployment in real-world applications. Additionally, hardware-aware distillation allows for customization to meet the unique requirements of various deployment environments, ensuring that models are optimized whether deployed on high-performance GPUs in data centers or low-power microcontrollers in IoT devices [286]. Hardware-aware distillation can be applied to real-time monitoring on edge devices, such as drones used for precision agriculture. By optimizing the student model to operate efficiently within the power and computational constraints of these drones, the model can quickly process high-resolution imagery to detect crop health issues or identify weeds, enabling swift, in-field decision-making without relying on cloud-based resources. This approach ensures that advanced analysis can be performed directly on the edge, even in remote or resource-limited environments.

7.3 Improving Data Quality and Robustness

7.3.1 Robust Distillation Against Noisy Data

The effectiveness of KD heavily relies on the quality and quantity of training data. Robust distillation techniques need to be developed to handle noisy, sparse, or imbalanced datasets, which are common in RS. Robust Distillation Against Noisy Data focuses on enhancing the resilience of the KD process when dealing with imperfect data [287]. In real-world applications, particularly in RS, datasets often contain noise, inconsistencies, or imbalances that can degrade the performance of ML models. This approach aims to mitigate the impact of such issues by incorporating robustness into the distillation process, ensuring that the student model can still learn effectively even when the data is not ideal [288]. In RS, noisy data is common due to various factors like sensor errors, atmospheric interference, or mislabeling during data collection. Standard KD techniques may struggle with such data, leading to poor model performance [289, 290]. Robust distillation techniques address this by explicitly modeling and compensating for the noise during the training process. This can involve using techniques that identify and either correct or down-weight the influence of noisy samples on the student model [291, 292]. The key to robust distillation is modifying the loss function to account for the presence of noise. The traditional KD loss function, which measures the difference between the teacher and student models, is augmented with a term that penalizes the model based on the amount of noise in the data. This term is controlled by a weighting factor η𝜂\etaitalic_η, which determines how much influence the noise has on the overall learning process. By doing so, the student model becomes more robust to the effects of noise, learning to focus on cleaner, more reliable data [293]. Robust distillation techniques must strike a balance between learning from noisy and clean data. While it is important to minimize the negative impact of noise, completely ignoring noisy data could result in a loss of valuable information [294]. Therefore, these techniques aim to optimize the learning process by allowing the student model to still extract useful knowledge from noisy data while minimizing the distortion it causes [295]. The formulation for robust distillation against noisy data can be expressed as:

KD-robust=𝔼x,x~𝒟[KD(T(x~),S(x))+ηNoise(x,x~)]subscriptKD-robustsubscript𝔼similar-to𝑥~𝑥𝒟delimited-[]subscriptKD𝑇~𝑥𝑆𝑥𝜂Noise𝑥~𝑥\mathcal{L}_{\text{KD-robust}}=\mathbb{E}_{x,\tilde{x}\sim\mathcal{D}}\left[% \mathcal{L}_{\text{KD}}(T(\tilde{x}),S(x))+\eta\cdot\text{Noise}(x,\tilde{x})\right]caligraphic_L start_POSTSUBSCRIPT KD-robust end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x , over~ start_ARG italic_x end_ARG ∼ caligraphic_D end_POSTSUBSCRIPT [ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T ( over~ start_ARG italic_x end_ARG ) , italic_S ( italic_x ) ) + italic_η ⋅ Noise ( italic_x , over~ start_ARG italic_x end_ARG ) ] (21)

where KD(T(x~),S(x))subscriptKD𝑇~𝑥𝑆𝑥\mathcal{L}_{\text{KD}}(T(\tilde{x}),S(x))caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T ( over~ start_ARG italic_x end_ARG ) , italic_S ( italic_x ) ) denotes the standard knowledge distillation loss, capturing the discrepancy between the teacher model’s output on noisy data x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG and the student model’s output on corresponding clean data x𝑥xitalic_x. The weighting factor η𝜂\etaitalic_η regulates the influence of the noise term in the overall loss function, with a higher η𝜂\etaitalic_η placing greater emphasis on noise correction, thereby enhancing the model’s robustness to noise. The term Noise(x,x~)Noise𝑥~𝑥\text{Noise}(x,\tilde{x})Noise ( italic_x , over~ start_ARG italic_x end_ARG ) quantifies the noise level between the clean and noisy data, using metrics such as mean squared error (MSE) or other relevant measures of data corruption.

This aforementioned formulation ensures that the distillation process remains effective even in the presence of noisy data by integrating a mechanism to handle noise directly within the training objective. Robust distillation techniques enhance model robustness by enabling student models to handle real-world data imperfections, resulting in more reliable performance in practical applications like RS where data quality can vary [296]. These techniques also improve generalization across diverse data conditions by incorporating noise handling into the training process, ensuring that models perform well in both clean and noisy environments. Additionally, robust distillation is particularly beneficial for sparse or imbalanced datasets, allowing student models to learn effectively from limited or unevenly distributed data while minimizing the risk of overfitting to noisy or rare examples [297]. In RS, robust distillation against noisy data can be applied to cloud detection in satellite imagery, where cloud cover often introduces noise that obscures ground features. By employing robust distillation techniques, a student model can be trained to accurately detect clouds even in images with varying levels of noise caused by atmospheric conditions, ensuring more reliable and consistent results for subsequent analyses, such as land use classification or vegetation monitoring.

7.3.2 Semi-Supervised and Unsupervised Distillation

Semi-Supervised and Unsupervised Distillation represents an emerging area of research in KD, particularly relevant for fields like RS where labeled data is often scarce or expensive to obtain. The traditional KD process relies heavily on labeled datasets to transfer knowledge from a teacher model to a student model [298]. However, in many practical scenarios, especially in RS, obtaining a large volume of labeled data is challenging. To address this, semi-supervised and unsupervised distillation techniques aim to leverage the abundant unlabeled data available, reducing the dependency on labeled datasets and making the distillation process more robust and scalable [299].

Semi-supervised and unsupervised distillation techniques leverage both labeled and unlabeled data during training to enhance model generalization and scalability. Labeled data provides accurate supervision, while unlabeled data exposes the model to a broader range of scenarios, improving its ability to generalize to new situations [300]. In semi-supervised distillation, a parameter α𝛼\alphaitalic_α balances the influence of labeled and unlabeled data, allowing the model to rely more on labeled data initially and gradually incorporate more from the unlabeled data. In purely unsupervised distillation, the student model learns directly from the teacher model’s predictions, using them as pseudo-labels [301]. These approaches are particularly valuable in RS, where large quantities of data are available but often lack comprehensive labeling, enabling the development of robust models that efficiently handle vast amounts of unlabeled data [302]. The loss function for semi-supervised KD can be expressed as:

KD-semi=αKD(T,S;𝒟labeled)+(1α)KD(T,S;𝒟unlabeled)subscriptKD-semi𝛼subscriptKD𝑇𝑆subscript𝒟labeled1𝛼subscriptKD𝑇𝑆subscript𝒟unlabeled\mathcal{L}_{\text{KD-semi}}=\alpha\cdot\mathcal{L}_{\text{KD}}(T,S;\mathcal{D% }_{\text{labeled}})+(1-\alpha)\cdot\mathcal{L}_{\text{KD}}(T,S;\mathcal{D}_{% \text{unlabeled}})caligraphic_L start_POSTSUBSCRIPT KD-semi end_POSTSUBSCRIPT = italic_α ⋅ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ; caligraphic_D start_POSTSUBSCRIPT labeled end_POSTSUBSCRIPT ) + ( 1 - italic_α ) ⋅ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ; caligraphic_D start_POSTSUBSCRIPT unlabeled end_POSTSUBSCRIPT ) (22)

The overall loss function includes KD(T,S;𝒟labeled)subscriptKD𝑇𝑆subscript𝒟labeled\mathcal{L}_{\text{KD}}(T,S;\mathcal{D}_{\text{labeled}})caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ; caligraphic_D start_POSTSUBSCRIPT labeled end_POSTSUBSCRIPT ), which represents the KD loss computed using the labeled dataset 𝒟labeledsubscript𝒟labeled\mathcal{D}_{\text{labeled}}caligraphic_D start_POSTSUBSCRIPT labeled end_POSTSUBSCRIPT, and KD(T,S;𝒟unlabeled)subscriptKD𝑇𝑆subscript𝒟unlabeled\mathcal{L}_{\text{KD}}(T,S;\mathcal{D}_{\text{unlabeled}})caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ; caligraphic_D start_POSTSUBSCRIPT unlabeled end_POSTSUBSCRIPT ), which is the distillation loss calculated from the unlabeled dataset 𝒟unlabeledsubscript𝒟unlabeled\mathcal{D}_{\text{unlabeled}}caligraphic_D start_POSTSUBSCRIPT unlabeled end_POSTSUBSCRIPT where the teacher model’s predictions serve as pseudo-labels. The weighting factor α𝛼\alphaitalic_α adjusts the influence of the labeled and unlabeled data on the overall loss function.

The above formulation allows the student model to learn from both labeled and unlabeled data, making the training process more flexible and less dependent on extensive labeled datasets. Semi-supervised and unsupervised distillation techniques offer significant practical advantages by reducing the reliance on large, high-quality labeled datasets, making model training more feasible in resource-constrained environments. These methods enhance generalization by exposing models to a wider variety of data patterns, allowing them to perform well on new, unseen data–an essential capability in fields like RS, where data diversity is high. Additionally, these techniques provide a scalable approach to model training, enabling organizations to leverage vast amounts of unlabeled data to build robust models without the need for extensive manual labeling efforts. Semi-supervised and unsupervised distillation can be applied to satellite imagery for land cover classification, where acquiring labeled data for every land type is impractical. By using semi-supervised distillation, a model can initially learn from a small set of labeled images and then leverage the large pool of unlabeled satellite images, where the teacher model provides pseudo-labels to refine the student model’s performance.

7.4 Scalability Solutions

7.4.1 Distributed Distillation

Distributed distillation is an advanced approach designed to scale the KD process across larger datasets by leveraging distributed learning frameworks. This method spreads the computational load of both the teacher and student models across multiple nodes or devices, making it more feasible to handle large-scale data. The key idea is to perform distillation in a parallel or distributed manner, where each node or device processes a subset of the data, thus reducing the individual computational burden and allowing for more efficient training [303]. In distributed distillation, the overall training task is divided among multiple nodes in a distributed learning framework. Each node independently performs a portion of the distillation process, working with its local subset of data. This division of labor helps in managing the computational load effectively, enabling the distillation process to scale with the size of the dataset. The use of multiple nodes allows for parallel processing, which significantly speeds up the training process [304]. Each node in the distributed system hosts a teacher model and a student model, denoted as Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, respectively, where k𝑘kitalic_k represents the node index. These models operate on the local data available to that node. The student model on each node learns from its corresponding teacher model, capturing knowledge specific to that subset of the data. This localized learning allows the student models to collectively capture diverse knowledge from the entire dataset when combined [305]. After the distributed distillation process is complete, the student models from all nodes are aggregated to form a comprehensive model that incorporates the knowledge distilled from all parts of the dataset. The aggregation can be done by averaging the model weights, or by combining the outputs of the student models. The objective function KD-distributedsubscriptKD-distributed\mathcal{L}_{\text{KD-distributed}}caligraphic_L start_POSTSUBSCRIPT KD-distributed end_POSTSUBSCRIPT is averaged over all nodes, ensuring that the final student model reflects a balanced learning from the entire distributed dataset [306]. The primary advantage of distributed distillation is its scalability. By distributing the workload, it becomes feasible to train on extremely large datasets that would be otherwise impractical to process on a single machine. This approach is particularly useful in scenarios like RS, where data is often collected in large quantities from multiple sources and needs to be processed efficiently. Distributed distillation allows for more rapid training and deployment of models, making it an effective solution for handling big data in ML [307, 308]. The loss function for distributed distillation is given by:

KD-distributed=1Kk=1KKD(Tk,Sk)subscriptKD-distributed1𝐾superscriptsubscript𝑘1𝐾subscriptKDsubscript𝑇𝑘subscript𝑆𝑘\mathcal{L}_{\text{KD-distributed}}=\frac{1}{K}\sum_{k=1}^{K}\mathcal{L}_{% \text{KD}}(T_{k},S_{k})caligraphic_L start_POSTSUBSCRIPT KD-distributed end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (23)

where KD(Tk,Sk)subscriptKDsubscript𝑇𝑘subscript𝑆𝑘\mathcal{L}_{\text{KD}}(T_{k},S_{k})caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) represents the KD loss computed on the k𝑘kitalic_k-th node, with Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT being the teacher and student models on that node. The term K𝐾Kitalic_K denotes the total number of nodes involved in the distributed learning process. By averaging the distillation losses across all nodes, the overall objective function ensures that the student model benefits from the collective knowledge distributed across the dataset [306]. Distributed distillation can be utilized in the analysis of global satellite imagery, where the vast amount of data is processed across multiple computational nodes. For instance, in mapping land use changes over large geographic regions, distributed distillation allows each node to handle different segments of the satellite images, collectively training a student model that integrates insights from all segments, leading to a comprehensive and scalable approach for detecting and classifying land cover changes [309].

7.4.2 Incremental Distillation

Incremental distillation is a technique designed to efficiently handle the continuous influx of new data without the need to retrain the student model from scratch each time new information becomes available. This method is particularly beneficial in scenarios where datasets grow over time, such as in real-time applications or when new data is periodically added, as it allows for the model to be updated incrementally [310]. The key idea behind incremental distillation is to enable the student model to learn from new data as it arrives, while also retaining the knowledge gained from previous training sessions. At each time step t𝑡titalic_t, the student model Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is trained to mimic the behavior of the teacher model Ttsubscript𝑇𝑡T_{t}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is typically derived from the latest data [311]. A critical aspect of incremental distillation is the preservation of previously learned knowledge. As the student model is updated with new data, it is important to ensure that it does not forget what it learned in earlier stages. This is achieved by incorporating a historical distillation term historysubscripthistory\mathcal{L}_{\text{history}}caligraphic_L start_POSTSUBSCRIPT history end_POSTSUBSCRIPT in the loss function, which measures the difference between the current student model Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its previous version St1subscript𝑆𝑡1S_{t-1}italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. This regularization term helps in maintaining continuity and stability in the learning process, preventing catastrophic forgetting [312]. The overall loss function for incremental distillation, KD-incrementalsubscriptKD-incremental\mathcal{L}_{\text{KD-incremental}}caligraphic_L start_POSTSUBSCRIPT KD-incremental end_POSTSUBSCRIPT, includes two components:

  • KD Loss: KD(Tt,St)subscriptKDsubscript𝑇𝑡subscript𝑆𝑡\mathcal{L}_{\text{KD}}(T_{t},S_{t})caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ensures that the student model learns from the current teacher model at time t𝑡titalic_t.

  • Historical Distillation Loss: history(St,St1)subscripthistorysubscript𝑆𝑡subscript𝑆𝑡1\mathcal{L}_{\text{history}}(S_{t},S_{t-1})caligraphic_L start_POSTSUBSCRIPT history end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) penalizes deviations from the knowledge previously acquired by the student model, helping it retain important information from past iterations.

The parameter β𝛽\betaitalic_β controls the balance between learning new information and preserving old knowledge. A higher β𝛽\betaitalic_β places more emphasis on retaining historical knowledge, while a lower β𝛽\betaitalic_β allows the model to adapt more quickly to new data [313].

Incremental distillation provides significant practical benefits by allowing the student model to be updated incrementally, thereby avoiding the high computational costs associated with full model retraining. This approach is especially advantageous for large-scale applications where datasets are continuously evolving, such as in RS or streaming analytics [314]. It ensures that the model remains adaptive to new data trends, maintaining its relevance and accuracy over time. Additionally, the use of a historical distillation term enhances the stability of the learning process, minimizing the risk of performance degradation as new data is incorporated [315]. Incremental distillation can be used for real-time monitoring of deforestation in satellite imagery. As new satellite images become available, the student model is updated incrementally to learn from the latest data while preserving its ability to recognize previously identified deforestation patterns, thus allowing continuous and efficient tracking of forest loss over time without needing to retrain the model from scratch [316].

7.5 Real-Time Processing Enhancements

7.5.1 Real-Time Distillation Algorithms

Real-time distillation algorithms are critical in scenarios where timely decision-making is essential, such as in RS applications that involve disaster monitoring or autonomous systems. These algorithms are designed to optimize both the accuracy and speed of the inference process, ensuring that the student model can deliver predictions within the stringent time constraints required for real-time operations [307]. In real-time applications, there’s a trade-off between model accuracy and inference speed. A highly accurate model may be too slow for real-time processing, while a faster model might sacrifice accuracy. Real-time distillation algorithms aim to strike a balance by training the student model to achieve an acceptable level of accuracy while ensuring that it can make predictions quickly enough to meet real-time requirements [317]. The inference time, denoted as tinferencesubscript𝑡inferencet_{\text{inference}}italic_t start_POSTSUBSCRIPT inference end_POSTSUBSCRIPT, is the time it takes for the model to process an input and produce an output. For real-time applications, this must be less than or equal to a predetermined threshold treal-timesubscript𝑡real-timet_{\text{real-time}}italic_t start_POSTSUBSCRIPT real-time end_POSTSUBSCRIPT. The distillation process incorporates this requirement into the training by adding a penalty term in the loss function if the student model’s inference time exceeds the threshold. This ensures that the final model is optimized not only for accuracy but also for speed [318]. The penalty term γ𝕀(tinferencetreal-time)𝛾𝕀subscript𝑡inferencesubscript𝑡real-time\gamma\cdot\mathbb{I}(t_{\text{inference}}\leq t_{\text{real-time}})italic_γ ⋅ blackboard_I ( italic_t start_POSTSUBSCRIPT inference end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT real-time end_POSTSUBSCRIPT ) in the loss function introduces a strong disincentive for any delay in inference time. Here, 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) is an indicator function that activates the penalty whenever the inference time tinferencesubscript𝑡inferencet_{\text{inference}}italic_t start_POSTSUBSCRIPT inference end_POSTSUBSCRIPT is greater than the real-time threshold treal-timesubscript𝑡real-timet_{\text{real-time}}italic_t start_POSTSUBSCRIPT real-time end_POSTSUBSCRIPT. This encourages the model to adhere strictly to time constraints, making it suitable for deployment in environments where every millisecond counts, such as in disaster response or real-time traffic management [319]. In RS, where data often needs to be processed and acted upon quickly (e.g., detecting natural disasters, monitoring climate changes, or guiding autonomous vehicles), real-time distillation ensures that models are both accurate and fast enough to be practical. This is particularly important in disaster monitoring, where delays in data processing could result in missed opportunities to mitigate damage or save lives [320]. The loss function for real-time distillation, KD-real-timesubscriptKD-real-time\mathcal{L}_{\text{KD-real-time}}caligraphic_L start_POSTSUBSCRIPT KD-real-time end_POSTSUBSCRIPT, combines the traditional KD loss with a penalty for exceeding the inference time threshold:

KD-real-time=KD(T,S)+γ𝕀(tinferencetreal-time)subscriptKD-real-timesubscriptKD𝑇𝑆𝛾𝕀subscript𝑡inferencesubscript𝑡real-time\mathcal{L}_{\text{KD-real-time}}=\mathcal{L}_{\text{KD}}(T,S)+\gamma\cdot% \mathbb{I}(t_{\text{inference}}\leq t_{\text{real-time}})caligraphic_L start_POSTSUBSCRIPT KD-real-time end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) + italic_γ ⋅ blackboard_I ( italic_t start_POSTSUBSCRIPT inference end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT real-time end_POSTSUBSCRIPT ) (24)

where KD(T,S)subscriptKD𝑇𝑆\mathcal{L}_{\text{KD}}(T,S)caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) represents the standard Knowledge Distillation (KD) loss, which measures how effectively the student model replicates the teacher model’s behavior. The hyperparameter γ𝛾\gammaitalic_γ controls the strength of the penalty applied when the inference time exceeds the real-time threshold. Additionally, the indicator function 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) activates this penalty if the inference time tinferencesubscript𝑡inferencet_{\text{inference}}italic_t start_POSTSUBSCRIPT inference end_POSTSUBSCRIPT surpasses the allowed real-time limit treal-timesubscript𝑡real-timet_{\text{real-time}}italic_t start_POSTSUBSCRIPT real-time end_POSTSUBSCRIPT. This formulation ensures that the student model not only learns effectively from the teacher model but also adheres to the necessary timing constraints for real-time deployment [321]. The development of real-time distillation algorithms provides several practical benefits, including the ability to make timely decisions, which is critical in applications such as disaster monitoring where delays can have severe consequences. These algorithms efficiently balance the trade-offs between speed and accuracy, ensuring models are both effective and practical for real-time use. Additionally, their versatility makes them suitable for a wide range of fields that require real-time processing, from RS to autonomous driving and other time-sensitive applications [322]. Real-time distillation algorithms can be utilized in monitoring wildfires through satellite imagery, where immediate detection and response are crucial. The student model is trained to rapidly process satellite images and detect fire outbreaks in real-time while adhering to stringent time constraints, ensuring timely alerts for emergency services and minimizing potential damage [323].

7.5.2 Edge-AI Distillation

Edge-AI distillation focuses on optimizing ML models specifically for deployment on edge devices, such as sensors, drones, or other low-power, resource-constrained hardware commonly used in RS applications [324]. These devices often have limited computational power, memory, and battery life, which necessitates the development of highly efficient models that can perform complex tasks with minimal resources. The goal of edge-AI distillation is to ensure that the distilled student model is not only accurate but also energy-efficient and capable of running with low latency on edge devices [325].

The key to achieving this lies in balancing the KD process with the energy consumption constraints of the target hardware. The distillation loss function denoted as KD(T,S)subscriptKD𝑇𝑆\mathcal{L}_{\text{KD}}(T,S)caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ), is typically used to align the outputs of the student model S𝑆Sitalic_S with those of the teacher model T𝑇Titalic_T. However, in the context of edge-AI, an additional term, Energy(θS)Energysubscript𝜃𝑆\text{Energy}(\theta_{S})Energy ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ), is introduced into the objective function to account for the energy consumption of the student model on the edge device [326].

This approach involves minimizing not just the distillation loss but also the energy consumption associated with running the student model. The regularization parameter δ𝛿\deltaitalic_δ is used to balance the importance of energy efficiency against the need to maintain high model performance. By carefully tuning this parameter, it is possible to develop models that are both effective in their predictive capabilities and efficient in terms of energy usage [327].

Edge-AI distillation can be employed in drones for real-time wildlife monitoring, where models need to process and analyze video feeds on the device itself. By optimizing the model for low power consumption and fast inference, drones can efficiently identify and track animal movements without draining their batteries or relying on constant data transmission to central servers.

7.6 Cross-Modal and Multi-Modal Distillation

7.6.1 Cross-Modal Knowledge Transfer

Cross-modal distillation involves transferring knowledge from one modality (e.g., optical images) to another (e.g., SAR or multispectral images). This approach can improve model generalization across different data sources [328, 329].

KD-cross=m=1MwmKD(Tm(xm),S(xm))subscriptKD-crosssuperscriptsubscript𝑚1𝑀subscript𝑤𝑚subscriptKDsubscript𝑇𝑚subscript𝑥𝑚𝑆subscript𝑥𝑚\mathcal{L}_{\text{KD-cross}}=\sum_{m=1}^{M}w_{m}\cdot\mathcal{L}_{\text{KD}}(% T_{m}(x_{m}),S(x_{m}))caligraphic_L start_POSTSUBSCRIPT KD-cross end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , italic_S ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) (25)

where M𝑀Mitalic_M is the number of modalities, and wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the weight associated with each modality m𝑚mitalic_m.

Cross-modal knowledge transfer is a specialized technique within the broader context of KD, focusing on transferring knowledge from one data modality to another. In RS, data is often collected from various sources, each providing unique and complementary information about the environment. For instance, optical images capture visible light, SAR (Synthetic Aperture Radar) images provide microwave data, and multispectral images cover a range of wavelengths beyond the visible spectrum. Each modality offers distinct advantages and limitations, making it valuable to develop models that can generalize across these diverse data types [330]. The essence of cross-modal distillation lies in the ability to leverage a teacher model trained on one modality (e.g., optical images) to improve the performance of a student model operating on a different modality (e.g., SAR or multispectral images). This process enhances the student model’s ability to generalize and perform well across different data sources, which is crucial in RS tasks that require robust performance across varying environmental conditions and sensor types [329].

The formulation of cross-modal distillation is encapsulated in the loss function KD-crosssubscriptKD-cross\mathcal{L}_{\text{KD-cross}}caligraphic_L start_POSTSUBSCRIPT KD-cross end_POSTSUBSCRIPT. This loss function aggregates the KD process across multiple modalities, denoted by M𝑀Mitalic_M. For each modality m𝑚mitalic_m, a specific weight wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is assigned, reflecting the importance or relevance of that modality in the distillation process. The objective is to minimize the weighted sum of the KD losses KDsubscriptKD\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT for each modality, where Tm(xm)subscript𝑇𝑚subscript𝑥𝑚T_{m}(x_{m})italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) represents the output of the teacher model for modality m𝑚mitalic_m, and S(xm)𝑆subscript𝑥𝑚S(x_{m})italic_S ( italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) is the corresponding output of the student model [331]. By carefully selecting and tuning the weights wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the cross-modal distillation process can be tailored to emphasize certain modalities over others, depending on the specific requirements of the application. For example, in scenarios where optical images are more informative, the weight wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for optical data can be increased, ensuring that the student model learns more effectively from that modality [332]. Cross-modal knowledge transfer is particularly beneficial in RS, where data from different modalities may be abundant but not always consistently available. By enabling models to transfer knowledge across modalities, this approach ensures that the student model remains robust and effective even when some data sources are missing or degraded. This capability is critical in applications such as environmental monitoring, disaster response, and resource management, where reliable and accurate information from diverse data sources is essential for decision-making [333]. As mentioned above, cross-modal knowledge transfer can be applied in integrating SAR and optical satellite imagery for flood monitoring. By leveraging a teacher model trained on high-resolution optical images, a student model can be optimized to accurately interpret SAR images, enhancing the detection of flood extents and water levels in areas where optical data might be obstructed or unavailable.

7.6.2 Multi-Task Distillation

Multi-task distillation is an advanced technique in KD where a single student model is trained to perform multiple tasks simultaneously, such as classification, segmentation, and object detection. This approach aims to create a more versatile and efficient model that can handle a variety of tasks without compromising performance in any of them. By distilling knowledge from teacher models specialized in different tasks, the student model learns to balance these tasks effectively, making it highly valuable in applications where multi-functionality is essential, such as in RS or autonomous systems [334]. Multi-task distillation involves training a student model to handle multiple tasks concurrently. Each task has its own teacher model, and the student model learns from these teachers to perform all tasks simultaneously. For example, in RS, one teacher model might be specialized in land cover classification, while another is focused on detecting specific objects like vehicles or buildings. The student model is designed to learn from both these tasks, allowing it to perform land cover classification and object detection within the same framework [335]. The loss function for multi-task distillation, denoted as KD-multi-tasksubscriptKD-multi-task\mathcal{L}_{\text{KD-multi-task}}caligraphic_L start_POSTSUBSCRIPT KD-multi-task end_POSTSUBSCRIPT, is a weighted sum of the distillation losses from each task. The weight λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT assigned to each task allows for prioritization based on the importance or difficulty of the task. For instance, if segmentation is more critical for a specific application than classification, a higher weight can be assigned to the segmentation task to ensure the student model focuses more on that aspect [336].

KD-multi-task=t=1TλtKD(Tt,St)subscriptKD-multi-tasksuperscriptsubscript𝑡1𝑇subscript𝜆𝑡subscriptKDsubscript𝑇𝑡subscript𝑆𝑡\mathcal{L}_{\text{KD-multi-task}}=\sum_{t=1}^{T}\lambda_{t}\cdot\mathcal{L}_{% \text{KD}}(T_{t},S_{t})caligraphic_L start_POSTSUBSCRIPT KD-multi-task end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (26)

Where T𝑇Titalic_T represents the total number of tasks, and KD(Tt,St)subscriptKDsubscript𝑇𝑡subscript𝑆𝑡\mathcal{L}_{\text{KD}}(T_{t},S_{t})caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the KD loss for task t𝑡titalic_t between the teacher Ttsubscript𝑇𝑡T_{t}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the student model Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. One of the key challenges in multi-task distillation is balancing the learning of different tasks. Since each task might require different levels of focus or complexity, the student model must be carefully trained to avoid underperforming in one task while excelling in another. The weight λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT helps manage this balance by allowing certain tasks to have more influence on the model’s learning process [337]. By combining multiple tasks into a single model, multi-task distillation reduces the need for deploying separate models for each task, saving computational resources and simplifying the deployment process. This makes the distilled model more efficient and versatile, particularly in environments where resources are limited or where real-time processing of multiple tasks is required [338].

In practical scenarios, such as RS, a multi-task distilled model could simultaneously analyze satellite images for land classification, detect changes over time, and identify specific objects or features. This approach is not only more efficient but also enables the model to leverage shared information across tasks, leading to better overall performance and more coherent results across the different tasks [339]. While multi-task distillation offers numerous advantages, it also comes with challenges, such as the potential for task interference, where learning one task might negatively impact another. Careful design of the distillation process and appropriate weighting of tasks are essential to ensure that the student model performs well across all tasks [340]. A multi-task distilled model can be used for comprehensive urban monitoring from satellite images, where it simultaneously performs land cover classification, detects infrastructure changes (such as new buildings or roads), and identifies specific objects (like vehicles or trees). This enables efficient and integrated analysis of diverse data, enhancing the ability to track urban development and infrastructure changes in a single model [341].

7.7 Seamless Integration with Existing Workflows

7.7.1 Plug-and-Play Distillation Modules

Creating plug-and-play distillation modules that can be easily integrated into existing RS workflows offers several benefits [342]. These modules are designed to be seamlessly incorporated with minimal adjustments to the current infrastructure, reducing the barriers to adopting knowledge distillation (KD) in RS applications [33]. The formulation of this integration can be expressed as:

KD-plug=KD(T,S)+m=1Mmod(S,Mm)subscriptKD-plugsubscriptKD𝑇𝑆superscriptsubscript𝑚1𝑀subscriptmod𝑆subscript𝑀𝑚\mathcal{L}_{\text{KD-plug}}=\mathcal{L}_{\text{KD}}(T,S)+\sum_{m=1}^{M}% \mathcal{L}_{\text{mod}}(S,M_{m})caligraphic_L start_POSTSUBSCRIPT KD-plug end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT mod end_POSTSUBSCRIPT ( italic_S , italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )

where KD(T,S)subscriptKD𝑇𝑆\mathcal{L}_{\text{KD}}(T,S)caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) represents the standard KD loss between the teacher model T𝑇Titalic_T and the student model S𝑆Sitalic_S, and mod(S,Mm)subscriptmod𝑆subscript𝑀𝑚\mathcal{L}_{\text{mod}}(S,M_{m})caligraphic_L start_POSTSUBSCRIPT mod end_POSTSUBSCRIPT ( italic_S , italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) denotes the additional loss terms introduced by integrating existing modules Mmsubscript𝑀𝑚M_{m}italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with the student model S𝑆Sitalic_S. The modularity of this approach allows different components to be swapped or upgraded independently, facilitating the adoption of KD in diverse scenarios. This scalability ensures that KD remains applicable across a wide range of RS tasks, from small UAV datasets to extensive satellite imagery. By leveraging pre-built modular KD techniques, developers and researchers can save time on implementation, accelerating the development and deployment of RS models, which ultimately enhances their performance [343, 344].

7.7.2 Toolkits and Frameworks

Developing comprehensive toolkits and frameworks for KD in RS can significantly enhance its performance and adoption. These toolkits provide standardized implementations of KD methods, ensuring consistency and reliability across different RS tasks [345]. The complexity of integrating various modules within the KD process can be expressed as:

Toolkit Complexityi=1NCintegration(Mi,KD)proportional-toToolkit Complexitysuperscriptsubscript𝑖1𝑁subscript𝐶integrationsubscript𝑀𝑖subscriptKD\text{Toolkit Complexity}\propto\sum_{i=1}^{N}C_{\text{integration}}(M_{i},% \mathcal{L}_{\text{KD}})Toolkit Complexity ∝ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT integration end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT )

where Cintegration(Mi,KD)subscript𝐶integrationsubscript𝑀𝑖subscriptKDC_{\text{integration}}(M_{i},\mathcal{L}_{\text{KD}})italic_C start_POSTSUBSCRIPT integration end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ) represents the complexity of integrating module Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the KD process. This standardization reduces the variability in performance that can arise from ad-hoc implementations, leading to more predictable results and making KD a more reliable option for RS applications. Additionally, these toolkits lower the technical barriers for practitioners by providing user-friendly interfaces and comprehensive documentation. Frameworks often include optimized routines for tasks such as hyperparameter tuning or data preprocessing, which can lead to better model performance and faster training times, especially in computationally intensive RS tasks. The community-driven development of these toolkits and frameworks leads to faster identification of bugs, new feature additions, and overall better support for the technology. This collective improvement makes KD more practical and effective for RS tasks, contributing to enhanced model performance and more efficient applications [346].

7.8 Enhancing Model Interpretability

7.8.1 Explainable Distillation

Explainable distillation is an advanced approach in KD that not only focuses on transferring the predictive performance of the teacher model to the student model but also ensures that the student model’s decision-making process is interpretable. The goal is to create models that are not just accurate but also transparent, providing insights into how they arrive at their predictions [347]. This is particularly important in critical applications like RS, healthcare, and autonomous systems, where understanding the model’s reasoning is crucial for trust and accountability [348]. Explainable distillation mainly relies on the following key concepts. Traditional KD emphasizes performance, often at the cost of interpretability. However, in many applications, it is essential to know why a model makes a certain decision. Explainable distillation aims to bridge this gap by integrating interpretability into the distillation process. The student model is trained not only to mimic the teacher’s outputs but also to generate explanations for its decisions that are comprehensible to humans [349]. The process of explainable distillation involves an additional term in the loss function, denoted as explain(S)subscriptexplain𝑆\mathcal{L}_{\text{explain}}(S)caligraphic_L start_POSTSUBSCRIPT explain end_POSTSUBSCRIPT ( italic_S ), which penalizes the student model if its explanations are not clear or do not align with certain interpretability criteria. This ensures that the student model learns to provide meaningful insights alongside accurate predictions [350].

KD-explain=KD(T,S)+ξexplain(S)subscriptKD-explainsubscriptKD𝑇𝑆𝜉subscriptexplain𝑆\mathcal{L}_{\text{KD-explain}}=\mathcal{L}_{\text{KD}}(T,S)+\xi\cdot\mathcal{% L}_{\text{explain}}(S)caligraphic_L start_POSTSUBSCRIPT KD-explain end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) + italic_ξ ⋅ caligraphic_L start_POSTSUBSCRIPT explain end_POSTSUBSCRIPT ( italic_S ) (27)

where, KD(T,S)subscriptKD𝑇𝑆\mathcal{L}_{\text{KD}}(T,S)caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) is the standard KD loss, and ξ𝜉\xiitalic_ξ is a regularization parameter that controls the trade-off between accuracy and interpretability. The term explain(S)subscriptexplain𝑆\mathcal{L}_{\text{explain}}(S)caligraphic_L start_POSTSUBSCRIPT explain end_POSTSUBSCRIPT ( italic_S ) guides the student model toward generating explanations that are either inherently interpretable or match the explanations provided by the teacher model if available [348].

In domains such as RS, where decisions based on model predictions can have significant real-world impacts, the ability to explain model outputs is critical. For example, when a model identifies a potential disaster area in satellite imagery, it is not enough to simply flag the area; stakeholders need to understand the reasoning behind the decision, such as the specific patterns or features that led to the prediction [349]. Various explainable AI techniques can be incorporated into the distillation process, such as saliency maps, attention mechanisms, or feature attribution methods [351]. These techniques help to visualize and understand which parts of the input data are most influential in the model’s decision-making process. By integrating these techniques into the distillation process, the student model becomes not only a distilled version of the teacher but also a more interpretable and transparent model [352].

All in all, explainable distillation holds significant potential in fields where trust in AI systems is paramount. By producing models that are both accurate and interpretable, it enhances the usability and acceptance of AI systems in sensitive areas. Moreover, it aids in regulatory compliance, as interpretable models can more easily meet legal and ethical standards regarding AI transparency [353]. Besides, while explainable distillation is a promising approach, it comes with challenges, such as defining and quantifying interpretability in a way that aligns with both human understanding and model performance. Future research may focus on developing more sophisticated explainability metrics and integrating them seamlessly into the distillation process, ensuring that models are not only effective but also understandable and trustworthy [354].

7.8.2 Feature Importance Preservation

During the distillation process, there is a risk that the simplified student model might not fully capture or retain the importance of these critical features. If the student model fails to recognize the same features as important, it might lead to reduced performance and decreased interpretability. Moreover, the loss of feature importance can erode trust in the model, especially in high-stakes scenarios where the rationale behind predictions needs to be transparent [355]. To address this challenge, the distillation process can incorporate a feature importance preservation mechanism. This involves adding an additional term to the standard KD loss function that explicitly penalizes differences in feature importance between the teacher and student models [356]. The modified loss function can be expressed as:

KD-feature=KD(T,S)+ζi=1d|fiTfiS|subscriptKD-featuresubscriptKD𝑇𝑆𝜁superscriptsubscript𝑖1𝑑superscriptsubscript𝑓𝑖𝑇superscriptsubscript𝑓𝑖𝑆\mathcal{L}_{\text{KD-feature}}=\mathcal{L}_{\text{KD}}(T,S)+\zeta\cdot\sum_{i% =1}^{d}\left|f_{i}^{T}-f_{i}^{S}\right|caligraphic_L start_POSTSUBSCRIPT KD-feature end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) + italic_ζ ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT | (28)

where KD(T,S)subscriptKD𝑇𝑆\mathcal{L}_{\text{KD}}(T,S)caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) is the standard KD loss, which aligns the outputs of the student model S𝑆Sitalic_S with those of the teacher model T𝑇Titalic_T. fiTsuperscriptsubscript𝑓𝑖𝑇f_{i}^{T}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and fiSsuperscriptsubscript𝑓𝑖𝑆f_{i}^{S}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT represent the importance of the i𝑖iitalic_i-th feature in the teacher and student models, respectively. ζ𝜁\zetaitalic_ζ is a regularization parameter that controls the weight given to the feature importance preservation term in the overall loss function. The summation term i=1d|fiTfiS|superscriptsubscript𝑖1𝑑superscriptsubscript𝑓𝑖𝑇superscriptsubscript𝑓𝑖𝑆\sum_{i=1}^{d}\left|f_{i}^{T}-f_{i}^{S}\right|∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT | measures the discrepancy between the feature importance values of the teacher and student models across all features i𝑖iitalic_i in the input space.

By ensuring that the student model not only mimics the predictions of the teacher model but also recognizes the same features as important, the feature importance preservation approach enhances the interpretability of the student model. This is particularly useful in applications where the end-users, such as analysts or domain experts, need to understand the reasoning behind the model’s decisions. In RS, for instance, preserving feature importance can ensure that the student model correctly identifies key environmental indicators, such as vegetation health or water quality, that are critical for accurate predictions [252]. The preservation of feature importance can also play a significant role in building trust in AI systems. When users can see that the model is focusing on the right features, they are more likely to trust its predictions. This is particularly crucial in regulated industries or applications involving ethical considerations, where transparency is mandatory [357]. While feature importance preservation is beneficial, it introduces a trade-off between model simplicity and interpretability. The regularization parameter ζ𝜁\zetaitalic_ζ needs to be carefully tuned to balance the preservation of feature importance with the overall performance of the student model. If ζ𝜁\zetaitalic_ζ is too high, the student model may overemphasize feature alignment at the cost of predictive accuracy. Conversely, if ζ𝜁\zetaitalic_ζ is too low, the student model might neglect important features, compromising its interpretability and trustworthiness [358, 359].

7.9 Hybrid Approaches

7.9.1 Combining Distillation with Other Techniques

Hybrid approaches in ML involve the integration of multiple learning paradigms to create more powerful and versatile models. When applied to KD, these hybrid approaches can significantly enhance the performance and efficiency of models, especially in complex and data-rich fields like RS [360]. For instance, a hybrid approach relies on combining KD with transfer learning and federated learning. Typically, transfer learning involves leveraging knowledge gained from one task or domain to improve the performance on a related but different task. In the context of hybrid approaches, transfer learning can be integrated into the distillation process to help the student model acquire additional knowledge from pre-trained models on related tasks [361]. The transfer learning loss component, Transfer(S)subscriptTransfer𝑆\mathcal{L}_{\text{Transfer}}(S)caligraphic_L start_POSTSUBSCRIPT Transfer end_POSTSUBSCRIPT ( italic_S ), represents the cost associated with adapting the student model to the new task or domain. Moving on, reinforcement learning is a paradigm where models learn to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. In hybrid distillation approaches, RL can be incorporated to optimize the student model’s performance through trial and error, especially in scenarios where decision-making under uncertainty is critical. The RL loss component, Reinforce(S)subscriptReinforce𝑆\mathcal{L}_{\text{Reinforce}}(S)caligraphic_L start_POSTSUBSCRIPT Reinforce end_POSTSUBSCRIPT ( italic_S ), measures how well the student model performs in achieving its objectives within the environment [362].

Besides, active learning is another technique that can be integrated with KD. It involves selectively querying the most informative data points for training, thereby improving the model’s performance with fewer labeled examples. While not explicitly included in the equation, active learning can complement the distillation process by ensuring that the most critical data points are used for training the student model [363, 364]. The hybrid loss function in this approach combines the losses from KD, transfer learning, and reinforcement learning:

KD-hybrid=αKD(T,S)+βTransfer(S)+γReinforce(S)subscriptKD-hybrid𝛼subscriptKD𝑇𝑆𝛽subscriptTransfer𝑆𝛾subscriptReinforce𝑆\mathcal{L}_{\text{KD-hybrid}}=\alpha\cdot\mathcal{L}_{\text{KD}}(T,S)+\beta% \cdot\mathcal{L}_{\text{Transfer}}(S)+\gamma\cdot\mathcal{L}_{\text{Reinforce}% }(S)caligraphic_L start_POSTSUBSCRIPT KD-hybrid end_POSTSUBSCRIPT = italic_α ⋅ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) + italic_β ⋅ caligraphic_L start_POSTSUBSCRIPT Transfer end_POSTSUBSCRIPT ( italic_S ) + italic_γ ⋅ caligraphic_L start_POSTSUBSCRIPT Reinforce end_POSTSUBSCRIPT ( italic_S ) (29)

In this context, α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and γ𝛾\gammaitalic_γ are weighting factors that determine the contribution of each component to the overall loss function. The term KD(T,S)subscriptKD𝑇𝑆\mathcal{L}_{\text{KD}}(T,S)caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T , italic_S ) ensures that the student model closely mimics the behavior of the teacher model, while Transfer(S)subscriptTransfer𝑆\mathcal{L}_{\text{Transfer}}(S)caligraphic_L start_POSTSUBSCRIPT Transfer end_POSTSUBSCRIPT ( italic_S ) aids the student model in adapting to new tasks or domains by utilizing previously acquired knowledge. Additionally, Reinforce(S)subscriptReinforce𝑆\mathcal{L}_{\text{Reinforce}}(S)caligraphic_L start_POSTSUBSCRIPT Reinforce end_POSTSUBSCRIPT ( italic_S ) enhances the student model’s decision-making abilities through interaction with an environment, thereby optimizing its performance [365].

By integrating multiple learning paradigms, hybrid approaches can create models that are not only smaller and faster (through distillation) but also more knowledgeable and capable (through transfer learning) and more adaptive (through reinforcement learning). Hybrid models are better equipped to handle a variety of tasks and environments, as they combine the strengths of different learning methods. This is particularly useful in RS, where data can vary widely in type, quality, and context. The ability to combine different learning strategies makes hybrid approaches highly versatile. For example, a hybrid model might use transfer learning to understand basic image recognition tasks while using reinforcement learning to make real-time decisions based on that understanding [366].

Hybrid distillation approaches are particularly relevant in complex fields such as RS, where models need to process and analyze vast amounts of data from multiple sources. For instance, in disaster monitoring, a hybrid model could use transfer learning to recognize different types of terrain, reinforcement learning to predict the spread of a fire, and KD to ensure the model is efficient enough to run on edge devices deployed in the field [367]. However, combining multiple learning paradigms can increase the complexity of the model and its training process. Careful tuning of the weighting factors (α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ) is necessary to achieve the desired balance between the different learning objectives. Moreover, while the goal of distillation is to create efficient models, the integration of transfer learning and reinforcement learning can demand additional computational resources during training, which may limit the scalability of the approach [368].

7.9.2 Adaptive Distillation Frameworks

Developing adaptive distillation frameworks that can adjust the distillation process based on the complexity of the task or the availability of data could lead to more flexible and robust models. Traditional knowledge distillation methods apply a fixed strategy for transferring knowledge from the teacher model to the student model, which might not be optimal for all scenarios, particularly in RS where data heterogeneity and varying task requirements are common [269]. An adaptive distillation framework dynamically adjusts the distillation process by incorporating additional mechanisms that account for task complexity, data quality, or computational constraints. This dynamic adjustment allows the distillation process to be more responsive to the specific needs of the application, improving both the efficiency and effectiveness of the student model [369]. For instance, in scenarios where the task is relatively simple or where data is abundant and high-quality, the framework could prioritize rapid model convergence by assigning higher weights to the knowledge distillation loss. Conversely, in more complex tasks or when dealing with sparse or noisy data, the framework could allocate more resources to the adaptation process, fine-tuning the student model to handle these challenges more effectively [370]. The adaptive distillation process can be represented as:

KD-adaptive=t=1TαtKD(Tt,St)+μtadaptive(St)subscriptKD-adaptivesuperscriptsubscript𝑡1𝑇subscript𝛼𝑡subscriptKDsubscript𝑇𝑡subscript𝑆𝑡subscript𝜇𝑡subscriptadaptivesubscript𝑆𝑡\mathcal{L}_{\text{KD-adaptive}}=\sum_{t=1}^{T}\alpha_{t}\cdot\mathcal{L}_{% \text{KD}}(T_{t},S_{t})+\mu_{t}\cdot\mathcal{L}_{\text{adaptive}}(S_{t})caligraphic_L start_POSTSUBSCRIPT KD-adaptive end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT adaptive end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (30)

where αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are time-varying weights for the KD and adaptation loss components, respectively. These weights are not static but rather evolve over time t𝑡titalic_t based on factors such as the current performance of the student model, the difficulty of the current task, and the quality of the data available at each step [371].

The term KD(Tt,St)subscriptKDsubscript𝑇𝑡subscript𝑆𝑡\mathcal{L}_{\text{KD}}(T_{t},S_{t})caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) represents the traditional knowledge distillation loss, which measures the discrepancy between the outputs of the teacher and student models. The term adaptive(St)subscriptadaptivesubscript𝑆𝑡\mathcal{L}_{\text{adaptive}}(S_{t})caligraphic_L start_POSTSUBSCRIPT adaptive end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is an additional adaptation loss that could include penalties for model complexity, regularization terms to prevent overfitting, or other factors that enhance the student model’s ability to generalize across different tasks or datasets [372]. By allowing the distillation process to adapt to varying conditions, these frameworks can produce student models that are not only smaller and faster but also more versatile and capable of maintaining high performance across a range of tasks and environments. This adaptability is particularly valuable in RS, where the conditions under which models operate can vary significantly, and the ability to generalize effectively is crucial [373].

8 Conclusion

This review provides a comprehensive analysis of knowledge distillation (KD) and its applications in remote sensing (RS), making significant contributions to the understanding and advancement of this technique. The review begins by offering a detailed overview of the fundamentals of KD, including its definition, basic concepts, historical evolution, and underlying mechanisms. The advantages of KD, such as model compression, improved efficiency, and enhanced performance in smaller models, are highlighted, particularly in the context of RS tasks like image classification, object detection, land cover classification, and semantic segmentation. A key contribution of this review is the taxonomy of KD models, which categorizes the variations in the models and input data, the type of knowledge transferred, the distillation target, and the structural relationships between network layers. This taxonomy serves as a valuable resource for researchers and practitioners, providing a clear framework for understanding the diverse applications and implementations of KD in RS.

However, the review also addresses the challenges and limitations of applying KD in remote sensing. These challenges include the complexity of model deployment, the heterogeneity of data sources, overfitting and generalization issues, scalability, real-time applicability, dependency on high-quality data, and the difficulty in balancing efficiency and accuracy. These challenges underscore the need for continued innovation and refinement of KD techniques to meet the unique demands of remote sensing applications. Movin forward, the review identifies several important future directions for the field. These include the development of advanced model compression techniques, dynamic and layer-wise distillation, efficient training algorithms, and hardware-aware distillation. The importance of improving data quality and robustness through robust distillation against noisy data and semi-supervised or unsupervised approaches is also emphasized. Additionally, the review suggests scalability solutions like distributed and incremental distillation, real-time processing enhancements, cross-modal and multi-modal distillation, and the seamless integration of KD into existing workflows through plug-and-play modules and standardized toolkits.

Finally, the review calls for efforts to enhance model interpretability through explainable distillation and feature importance preservation, as well as exploring hybrid approaches that combine KD with other techniques. These future directions highlight the ongoing potential for KD to revolutionize remote sensing applications, offering pathways to overcome current limitations and achieve greater accuracy, efficiency, and applicability in this critical field.

Acknowledgement

No funding is available for this work.

Conflict of Interest

The authors declare no conflicts of interest.

References

  • [1] J. Wu, L. Fang, J. Yue, Takd: Target-aware knowledge distillation for remote sensing scene classification, IEEE Transactions on Circuits and Systems for Video Technology (2024).
  • [2] W. Xie, Z. Zhang, L. Jiao, J. Wang, Decoupled knowledge distillation via spatial feature blurring for hyperspectral image classification, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2024).
  • [3] C. Chen, H. Ding, M. Duan, Discretization and decoupled knowledge distillation for arbitrary oriented object detection, Digital Signal Processing (2024) 104512.
  • [4] Y. Pang, Y. Zhang, Y. Wang, X. Wei, B. Chen, Exploring model compression limits and laws: A pyramid knowledge distillation framework for satellite-on-orbit object recognition, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [5] A. Paranata, R. Adha, H. T. P. Thao, E. E. Sasanti, Fafurida, The catastrophe of corruption in the sustainability of foreign aid: A prediction of artificial neural network method in indonesia, Fudan Journal of the Humanities and Social Sciences 16 (2) (2023) 239–257.
  • [6] H. Wu, Z. Xue, S. Zhou, H. Su, Beyond spectral shift mitigation: Knowledge swap net for cross-domain few-shot hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing (2024) 1–1.
  • [7] S. Atalla, S. Tarapiah, A. Gawanmeh, M. Daradkeh, H. Mukhtar, Y. Himeur, W. Mansoor, K. F. B. Hashim, M. Daadoo, Iot-enabled precision agriculture: Developing an ecosystem for optimized crop management, Information 14 (4) (2023) 205.
  • [8] Y. Ma, S. Chen, S. Ermon, D. B. Lobell, Transfer learning in environmental remote sensing, Remote Sensing of Environment 301 (2024) 113924.
  • [9] J. Liu, Z. Wu, L. Xiao, A spectral diffusion prior for unsupervised hyperspectral image super-resolution, IEEE Transactions on Geoscience and Remote Sensing (2024) 1–1doi:10.1109/TGRS.2024.3449073.
  • [10] M. Salem, N. Tsurusaki, Deep learning for land cover mapping using sentinel-2 imagery: A case study at greater cairo, egypt, in: IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, IEEE, 2023, pp. 6748–6751.
  • [11] X. Liu, F. Jin, S. Wang, J. Rui, X. Zuo, X. Yang, C. Cheng, Multimodal online knowledge distillation framework for land use/cover classification using full or missing modalities, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [12] A. Ouamane, A. Chouchane, Y. Himeur, A. Debilou, A. Amira, S. Atalla, W. Mansoor, H. A. Ahmad, Enhancing plant disease detection: A novel cnn-based approach with tensor subspace learning and howsvd-md, arXiv preprint arXiv:2405.20058 (2024).
  • [13] J. Lu, H. Fu, X. Tang, Z. Liu, J. Huang, W. Zou, H. Chen, Y. Sun, X. Ning, J. Li, Goa-optimized deep learning for soybean yield estimation using multi-source remote sensing data, Scientific Reports 14 (1) (2024) 7097.
  • [14] Y. Zhang, M. Ye, G. Zhu, Y. Liu, P. Guo, J. Yan, Ffca-yolo for small object detection in remote sensing images, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [15] Y. Himeur, S. Al-Maadeed, H. Kheddar, N. Al-Maadeed, K. Abualsaud, A. Mohamed, T. Khattab, Video surveillance using deep transfer learning and deep domain adaptation: Towards better generalization, Engineering Applications of Artificial Intelligence 119 (2023) 105698.
  • [16] T. Lopes, D. Capela, D. Guimarães, M. F. Ferreira, P. A. Jorge, N. A. Silva, From sensor fusion to knowledge distillation in collaborative libs and hyperspectral imaging for mineral identification, Scientific Reports 14 (2024).
  • [17] S. S. Sohail, Y. Himeur, H. Kheddar, A. Amira, F. Fadli, S. Atalla, A. Copiaco, W. Mansoor, Advancing 3d point cloud understanding through deep transfer learning: A comprehensive survey, Information Fusion (2024) 102601.
  • [18] C. Yu, X. Zhao, B. Gong, Y. Hu, M. Song, H. Yu, C.-I. Chang, Distillation-constrained prototype representation network for hyperspectral image incremental classification, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [19] B. Xu, H. Zheng, Z. Hu, L. Yang, M. Zheng, X. Feng, W. Lin, Double reverse regularization network based on self-knowledge distillation for sar object classification, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 7800–7804.
  • [20] X. Cheng, Y. Sun, W. Zhang, Y. Wang, X. Cao, Y. Wang, Application of deep learning in multitemporal remote sensing image classification, Remote Sensing 15 (15) (2023).
    URL https://www.mdpi.com/2072-4292/15/15/3859
  • [21] O. Kerdjidj, Y. Himeur, S. S. Sohail, A. Amira, F. Fadli, S. Attala, W. Mansoor, A. Copiaco, A. Gawanmeh, S. Miniaoui, et al., Uncovering the potential of indoor localization: Role of deep and transfer learning, IEEE Access (2024).
  • [22] A. N. Sayed, Y. Himeur, F. Bensaali, From time-series to 2d images for building occupancy prediction using deep transfer learning, Engineering Applications of Artificial Intelligence 119 (2023) 105786.
  • [23] N. C. Thompson, K. Greenewald, K. Lee, G. F. Manso, The computational limits of deep learning, MIT INITIATIVE ON THE DIGITAL ECONOMY RESEARCH BRIEF 4 (2020).
    URL https://ide.mit.edu/wp-content/uploads/2020/09/RBN.Thompson.pdf
  • [24] Y. Himeur, B. Rimal, A. Tiwary, A. Amira, Using artificial intelligence and data fusion for environmental monitoring: A review and future perspectives, Information Fusion 86 (2022) 44–75.
  • [25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (56) (2014) 1929–1958.
    URL http://jmlr.org/papers/v15/srivastava14a.html
  • [26] A. Konya, P. Nematzadeh, Recent applications of ai to environmental disciplines: A review, Science of The Total Environment 906 (2024) 167705. doi:https://doi.org/10.1016/j.scitotenv.2023.167705.
    URL https://www.sciencedirect.com/science/article/pii/S0048969723063325
  • [27] R. Biloslavo, D. Edgar, E. Aydin, C. Bulut, Artificial intelligence (ai) and strategic planning process within vuca environments: A research agenda and guidelines, Management Decision (Jan 2024). doi:10.1108/MD-10-2023-1944.
    URL https://doi.org/10.1108/MD-10-2023-1944
  • [28] J. Ji, X. Xi, X. Lu, Y. Guo, H. Xie, From coarse to fine: Knowledge distillation for remote sensing scene classification, in: IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, IEEE, 2023, pp. 5427–5430.
  • [29] G. Yu, Data-free knowledge distillation for privacy-preserving efficient uav networks, in: 2022 6th International Conference on Robotics and Automation Sciences (ICRAS), IEEE, 2022, pp. 52–56.
  • [30] H. Kheddar, Y. Himeur, A. I. Awad, Deep transfer learning for intrusion detection in industrial control networks: A comprehensive review, Journal of Network and Computer Applications 220 (2023) 103760.
  • [31] S. Zhang, H. Liu, K. He, Knowledge distillation via token-level relationship graph based on the big data technologies, Big Data Research 36 (2024) 100438.
  • [32] O. Kerdjidj, Y. Himeur, S. Atalla, A. Copiac, S. S. Sohail, F. Fadli, A. Amira, W. Mansoor, A. Gawanmeh, Exploring 2d representation and transfer learning techniques for people identification in indoor localization, in: 2023 6th International Conference on Signal Processing and Information Security (ICSPIS), IEEE, 2023, pp. 173–177.
  • [33] S. Sun, W. Ren, J. Li, R. Wang, X. Cao, Logit standardization in knowledge distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15731–15740.
  • [34] A. Bechar, Y. Elmir, R. Medjoudj, Y. Himeur, A. Amira, Transfer learning for cancer detection based on images analysis, Procedia Computer Science 239 (2024) 1903–1910, cENTERIS – International Conference on ENTERprise Information Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on Health and Social Care Information Systems and Technologies 2023.
  • [35] Y. Wang, X. Li, M. Shi, K. Xian, Z. Cao, Knowledge distillation for fast and accurate monocular depth estimation on mobile devices, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2457–2465.
  • [36] P. Chen, S. Liu, H. Zhao, J. Jia, Distilling knowledge via knowledge review, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5008–5017.
  • [37] J. Gou, B. Yu, S. J. Maybank, D. Tao, Knowledge distillation: A survey, International Journal of Computer Vision 129 (6) (2021) 1789–1819.
  • [38] N. Yadikar, K. Ubul, et al., A review of knowledge distillation in object detection, IEEE Access (2023).
  • [39] A. Alkhulaifi, F. Alsahli, I. Ahmad, Knowledge distillation in deep learning and its applications, PeerJ Computer Science 7 (2021) e474.
  • [40] R. Yu, S. Liu, X. Wang, Dataset distillation: A comprehensive review, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  • [41] H. Meng, Z. Lin, F. Yang, Y. Xu, L. Cui, Knowledge distillation in medical data mining: a survey, in: 5th International Conference on Crowd Science and Engineering, 2021, pp. 175–182.
  • [42] Z. Li, P. Xu, X. Chang, L. Yang, Y. Zhang, L. Yao, X. Chen, When object detection meets knowledge distillation: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (8) (2023) 10555–10579.
  • [43] W. Luo, A comprehensive survey on knowledge distillation of diffusion models, arXiv preprint arXiv:2304.04262 (2023).
  • [44] K. Acharya, A. Velasquez, H. H. Song, A survey on symbolic knowledge distillation of large language models, IEEE Transactions on Artificial Intelligence (2024).
  • [45] S. M. Kaleem, T. Rouf, G. Habib, B. Lall, et al., A comprehensive review of knowledge distillation in computer vision, arXiv preprint arXiv:2404.00936 (2024).
  • [46] G. Habib, T. J. Saleem, B. Lall, Knowledge distillation in vision transformers: A critical review, arXiv preprint arXiv:2302.02108 (2023).
  • [47] A.-A. Liu, B. Yang, W. Li, D. Song, Z. Sun, T. Ren, Z. Wei, Text-guided knowledge transfer for remote sensing image-text retrieval, IEEE Geoscience and Remote Sensing Letters (2024).
  • [48] W. Ma, O. Karakus, P. L. Rosin, Knowledge distillation for road detection based on cross-model semi-supervised learning, arXiv preprint arXiv:2402.05305 (2024).
  • [49] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531 (2015).
  • [50] J. Xue, J. Li, Y. Han, Z. Wang, C. Deng, T. Xu, Feature-based knowledge distillation for infrared small target detection, IEEE Geoscience and Remote Sensing Letters (2024).
  • [51] Z. Liu, S. Wang, Y. Gu, Sar image compression with inherent denoising capability through knowledge distillation, IEEE Geoscience and Remote Sensing Letters (2024).
  • [52] F. Han, H. Dong, L. Si, L. Zhang, Improving sar automatic target recognition via trusted knowledge distillation from simulated data, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [53] W. Zhang, W. Deng, Z. Cui, J. Liu, L. Jiao, Object knowledge distillation for joint detection and tracking in satellite videos, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [54] B. Lu, C. Ding, J. Bi, D. Song, Weakly supervised change detection via knowledge distillation and multiscale sigmoid inference, arXiv preprint arXiv:2403.05796 (2024).
  • [55] Z. Du, Y. Liang, Object detection of remote sensing image based on multi-scale feature fusion and attention mechanism, IEEE Access (2024).
  • [56] W. Zhao, Z. Zhang, J. Liu, Y. Liu, Y. He, H. Lu, Center-wise feature consistency learning for long-tailed remote sensing object recognition, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [57] R. Miles, K. Mikolajczyk, Understanding the role of the projector in knowledge distillation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, No. 5, 2024, pp. 4233–4241.
  • [58] H. Oki, M. Abe, J. Miyao, T. Kurita, Triplet loss for knowledge distillation, in: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1–7.
  • [59] Z. Yang, A. Zeng, Z. Li, T. Zhang, C. Yuan, Y. Li, From knowledge distillation to self-knowledge distillation: A unified approach with normalized loss and customized soft labels, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17185–17194.
  • [60] A. Van Etten, D. Hogan, J. M. Manso, J. Shermeyer, N. Weir, R. Lewis, The multi-temporal urban development spacenet dataset, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6398–6407.
  • [61] A. Van Etten, D. Lindenbaum, T. M. Bacastow, Spacenet: A remote sensing dataset and challenge series, arXiv preprint arXiv:1807.01232 (2018).
  • [62] (Apr 2023). [link].
    URL https://planetarycomputer.microsoft.com/dataset/ms-buildings
  • [63] R. Gupta, B. Goodman, N. Patel, R. Hosfelt, S. Sajeev, E. Heim, K. Doshi, Jigar Lucas, H. Choset, M. Gaston, Creating xbd: A dataset for assessing building damage from satellite imagery, in: CVPR, IEEE, 2019, pp. 10–17.
  • [64] V. S. F. Garnot, L. Landrieu, N. Chehata, Multi-modal temporal attention models for crop mapping from satellite time series, ISPRS Journal of Photogrammetry and Remote Sensing 187 (2022) 294–305.
  • [65] M. T. Chiu, X. Xu, K. Wang, J. Hobbs, N. Hovakimyan, T. S. Huang, H. Shi, The 1st agriculture-vision challenge: Methods and results, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 48–49.
  • [66] M. T. Chiu, X. Xu, Y. Wei, Z. Huang, A. G. Schwing, R. Brunner, H. Khachatrian, H. Karapetyan, I. Dozier, G. Rose, et al., Agriculture-vision: A large aerial image database for agricultural pattern analysis, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2828–2838.
  • [67] J. Shermeyer, T. Hossler, A. Van Etten, D. Hogan, R. Lewis, D. Kim, Rareplanes: Synthetic data takes flight, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 207–217.
  • [68] S. Waqas Zamir, A. Arora, A. Gupta, S. Khan, G. Sun, F. Shahbaz Khan, F. Zhu, L. Shao, G.-S. Xia, X. Bai, isaid: A large-scale dataset for instance segmentation in aerial images, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 28–37.
  • [69] J. Shermeyer, D. Hogan, J. Brown, A. Van Etten, N. Weir, F. Pacifici, R. Hänsch, A. Bastidas, S. Soenen, T. Bacastow, R. Lewis, Spacenet 6: Multi-sensor all weather mapping dataset, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, pp. 768–777. doi:10.1109/CVPRW50498.2020.00106.
  • [70] M. inversion, Jeff Faudi, Airbus ship detection challenge (2018).
    URL https://kaggle.com/competitions/airbus-ship-detection
  • [71] V. Sainte Fare Garnot, L. Landrieu, Pastis: Panoptic agricultural satellite time series (optical and radar), https://github.com/VSainteuf/pastis-benchmark (2022).
  • [72] N. Passalis, M. Tzelepi, A. Tefas, Heterogeneous knowledge distillation using information flow modeling, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2336–2345. doi:10.1109/CVPR42600.2020.00241.
  • [73] D. Ienco, Y. J. E. Gbodjo, R. Gaetano, R. Interdonato, Generalized knowledge distillation for multi-sensor remote sensing classification: an application to land cover mapping, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 2 (2020) 997–1003.
  • [74] L. Tian, Z. Wang, B. He, C. He, D. Wang, D. Li, Knowledge distillation of grassmann manifold network for remote sensing scene classification, Remote Sensing 13 (22) (2021) 4537.
  • [75] X. Yang, S. Zhang, W. Yang, Two-way assistant: A knowledge distillation object detection method for remote sensing images, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [76] M. Nabi, L. Maggiolo, G. Moser, S. B. Serpico, A cnn-transformer knowledge distillation for remote sensing scene classification, in: IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, IEEE, 2022, pp. 663–666.
  • [77] J. Ma, D. Shi, X. Tang, X. Zhang, X. Han, L. Jiao, Cross-source image retrieval based on ensemble learning and knowledge distillation for remote sensing images, in: 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, IEEE, 2021, pp. 2803–2806.
  • [78] L. Zhao, X. Peng, Y. Chen, M. Kapadia, D. N. Metaxas, Knowledge as priors: Cross-modal knowledge generalization for datasets without superior knowledge, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6528–6537.
  • [79] K. Geng, X. Sun, Z. Yan, W. Diao, X. Gao, Topological space knowledge distillation for compact road extraction in optical remote sensing images, Remote Sensing 12 (19) (2020) 3175.
  • [80] W. Xiong, Z. Xiong, Y. Cui, Y. Lv, A discriminative distillation network for cross-source remote sensing image retrieval, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020) 1234–1247.
  • [81] H. Liu, Y. Qu, L. Zhang, Multispectral scene classification via cross-modal knowledge distillation, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–12.
  • [82] S. Pande, A. Banerjee, S. Kumar, B. Banerjee, S. Chaudhuri, An adversarial approach to discriminative modality distillation for remote sensing image classification, in: Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 0–0.
  • [83] Y. Liu, Z. Xiong, Y. Yuan, Q. Wang, Distilling knowledge from super resolution for efficient remote sensing salient object detection, IEEE Transactions on Geoscience and Remote Sensing (2023).
  • [84] C. Yang, X. Yu, Z. An, Y. Xu, Categories of response-based, feature-based, and relation-based knowledge distillation, in: Advancements in Knowledge Distillation: Towards New Horizons of Intelligent Systems, Springer, 2023, pp. 1–32.
  • [85] G. Chen, X. Zhang, X. Tan, Y. Cheng, F. Dai, K. Zhu, Y. Gong, Q. Wang, Training small networks for scene classification of remote sensing images via knowledge distillation, Remote Sensing 10 (5) (2018) 719.
  • [86] H. Zhao, X. Sun, F. Gao, J. Dong, Pair-wise similarity knowledge distillation for rsi scene classification, Remote Sensing 14 (10) (2022) 2483.
  • [87] J. Chen, S. Wang, L. Chen, H. Cai, Y. Qian, Incremental detection of remote sensing objects with feature pyramid and knowledge distillation, IEEE Transactions on Geoscience and Remote Sensing 60 (2020) 1–13.
  • [88] Y. Yang, X. Sun, W. Diao, H. Li, Y. Wu, X. Li, K. Fu, Adaptive knowledge distillation for lightweight remote sensing object detectors optimizing, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–15.
  • [89] D. Li, Y. Nan, Y. Liu, Remote sensing image scene classification model based on dual knowledge distillation, IEEE Geoscience and Remote Sensing Letters 19 (2022) 1–5.
  • [90] L. Wang, J. Zhang, J. Tian, J. Li, L. Zhuo, Q. Tian, Efficient fine-grained object recognition in high-resolution remote sensing images from knowledge distillation to filter grafting, IEEE Transactions on Geoscience and Remote Sensing 61 (2023) 1–16.
  • [91] H.-K. Shin, K.-H. Uhm, S.-W. Jung, S.-J. Ko, Multispectral-to-rgb knowledge distillation for remote sensing image scene classification, IEEE Geoscience and Remote Sensing Letters 20 (2023) 1–5.
  • [92] Q. Chi, G. Lv, G. Zhao, X. Dong, A novel knowledge distillation method for self-supervised hyperspectral image classification, Remote Sensing 14 (18) (2022) 4523.
  • [93] K. Jiang, Z. Wang, P. Yi, J. Jiang, J. Xiao, Y. Yao, Deep distillation recursive network for remote sensing imagery super-resolution, Remote Sensing 10 (11) (2018) 1700.
  • [94] Q. Yuan, N. Wang, Buildings change detection using high-resolution remote sensing images with self-attention knowledge distillation and multiscale change-aware module, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 46 (2022) 225–231.
  • [95] B.-Y. Liu, H.-X. Chen, Z. Huang, X. Liu, Y.-Z. Yang, Zoominnet: A novel small object detector in drone images with cross-scale knowledge distillation, Remote Sensing 13 (6) (2021) 1198.
  • [96] Y. Chen, M. Lin, Z. He, K. Polat, A. Alhudhaif, F. Alenezi, Consistency-and dependence-guided knowledge distillation for object detection in remote sensing images, Expert Systems with Applications 229 (2023) 120519.
  • [97] C. Li, G. Cheng, G. Wang, P. Zhou, J. Han, Instance-aware distillation for efficient object detection in remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 61 (2023) 1–11.
  • [98] Y. Zhao, J. Liu, J. Yang, Z. Wu, Remote sensing image scene classification via self-supervised learning and knowledge distillation, Remote Sensing 14 (19) (2022) 4813.
  • [99] Z. Dong, G. Gao, T. Liu, Y. Gu, X. Zhang, Distilling segmenters from cnns and transformers for remote sensing images semantic segmentation, IEEE Transactions on Geoscience and Remote Sensing (2023).
  • [100] W. Zhou, Y. Li, J. Huang, Y. Liu, Q. Jiang, Mstnet-kd: Multilevel transfer networks using knowledge distillation for the dense prediction of remote-sensing images, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [101] N. Sachdeva, J. McAuley, Data distillation: A survey, Transactions on Machine Learning Research (2023).
    URL https://openreview.net/forum?id=lmXMXP74TO
  • [102] R. Zhang, Z. Chen, S. Zhang, F. Song, G. Zhang, Q. Zhou, T. Lei, Remote sensing image scene classification with noisy label distillation, Remote Sensing 12 (15) (2020) 2376.
  • [103] J. Yue, L. Fang, H. Rahmani, P. Ghamisi, Self-supervised learning with adaptive distillation for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 60 (2021) 1–13.
  • [104] E. Boix-Adsera, Towards a theory of model distillation, arXiv preprint arXiv:2403.09053 (2024).
  • [105] Y. Zhang, Z. Yan, X. Sun, W. Diao, K. Fu, L. Wang, Learning efficient and accurate detectors with dynamic knowledge distillation in remote sensing imagery, IEEE Transactions on Geoscience and Remote Sensing 60 (2021) 1–19.
  • [106] Y. Yang, Y. Wang, J. Dong, B. Yu, A knowledge distillation-based ground feature classification network with multiscale feature fusion in remote sensing images, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2023).
  • [107] G. Wang, N. Zhang, J. Wang, W. Liu, Y. Xie, H. Chen, Knowledge distillation-based lightweight change detection in high-resolution remote sensing imagery for on-board processing, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2024).
  • [108] Z. Chen, L. Deng, J. Gou, C. Wang, J. Li, D. Li, Building and road detection from remote sensing images based on weights adaptive multi-teacher collaborative distillation using a fused knowledge, International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103522.
  • [109] L. Gu, Q. Fang, Z. Wang, E. Popov, G. Dong, Learning lightweight and superior detectors with feature distillation for onboard remote sensing object detection, Remote Sensing 15 (2) (2023) 370.
  • [110] Y. Chai, K. Fu, X. Sun, W. Diao, Z. Yan, Y. Feng, L. Wang, Compact cloud detection with bidirectional self-attention knowledge distillation, Remote Sensing 12 (17) (2020) 2770.
  • [111] Y. Liu, L. Zhang, Z. Han, C. Chen, Integrating knowledge distillation with learning to rank for few-shot scene classification, IEEE Transactions on Geoscience and Remote Sensing 60 (2021) 1–12.
  • [112] C. Wang, Y. Yue, B. Luo, Y. Chen, J. Xue, Psekd: Phase-shift encoded knowledge distillation for oriented object detection in remote sensing images, in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 2680–2684.
  • [113] D. Chen, A. Ma, Y. Zhong, Semi-supervised knowledge distillation framework for global-scale urban man-made object remote sensing mapping, International Journal of Applied Earth Observation and Geoinformation 122 (2023) 103439.
  • [114] W. Zhao, X. Lv, H. Wang, Y. Liu, Y. He, H. Lu, Weakly correlated distillation for remote sensing object recognition, IEEE Transactions on Geoscience and Remote Sensing (2023).
  • [115] Y. Lin, Z. Cai, J. Li, J. Zhang, Lightweight remote sensing image denoising via knowledge distillation, in: 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), IEEE, 2022, pp. 1–7.
  • [116] C.-C. Yu, T.-Y. Chen, C.-W. Hsu, H.-Y. Cheng, Incremental scene classification using dual knowledge distillation and classifier discrepancy on natural and remote sensing images, Electronics 13 (3) (2024) 583.
  • [117] G. Xu, X. Jiang, Y. Zhou, S. Li, X. Liu, P. Lin, Robust land cover classification with multi-modal knowledge distillation, IEEE Transactions on Geoscience and Remote Sensing (2023).
  • [118] M. Xu, Y. Zhao, Y. Liang, X. Ma, Hyperspectral image classification based on class-incremental learning with knowledge distillation, Remote Sensing 14 (11) (2022) 2556.
  • [119] W. Zhou, Y. Li, J. Huang, W. Yan, M. Fang, Q. Jiang, Gsgnet-s*: Graph semantic guidance network via knowledge distillation for optical remote sensing image scene analysis, IEEE Transactions on Geoscience and Remote Sensing 61 (2023) 1–12.
  • [120] B. Zhao, Q. Wang, Y. Wu, Q. Cao, Q. Ran, Target detection model distillation using feature transition and label registration for remote sensing imagery, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15 (2022) 5416–5426.
  • [121] B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, J. Y. Choi, A comprehensive overhaul of feature distillation, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1921–1930. doi:10.1109/ICCV.2019.00201.
  • [122] W. Zhou, X. Fan, W. Yan, S. Shan, Q. Jiang, J.-N. Hwang, Graph attention guidance network with knowledge distillation for semantic segmentation of remote sensing images, IEEE Transactions on Geoscience and Remote Sensing (2023).
  • [123] P. Zhang, Y. Li, D. Wang, J. Wang, Rs-sskd: Self-supervision equipped with knowledge distillation for few-shot remote sensing scene classification, Sensors 21 (5) (2021) 1566.
  • [124] Y. Hu, X. Huang, X. Luo, J. Han, X. Cao, J. Zhang, Variational self-distillation for remote sensing scene classification, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–13.
  • [125] S. Xing, J. Xing, J. Ju, Q. Hou, X. Ding, Collaborative consistent knowledge distillation framework for remote sensing image scene classification network, Remote Sensing 14 (20) (2022) 5186.
  • [126] Y. Wu, P. Passban, M. Rezagholizadeh, Q. Liu, Why skip if you can combine: A simple knowledge distillation technique for intermediate layers, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1016–1021.
  • [127] H.-J. Chang, N. Dong, R. Mavlyutov, S. Popuri, Y.-A. Chung, Colld: Contrastive layer-to-layer distillation for compressing multilingual pre-trained speech encoders, in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 10801–10805.
  • [128] C. Deepa, A. Shetty, A. Narasimhadhan, Knowledge distillation: a novel approach for deep feature selection, The Egyptian Journal of Remote Sensing and Space Science 26 (1) (2023) 63–73.
  • [129] D. Chen, J.-P. Mei, Y. Zhang, C. Wang, Z. Wang, Y. Feng, C. Chen, Cross-layer distillation with semantic calibration, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 35, No. 8, 2021, pp. 7028–7036.
  • [130] C. Wang, D. Chen, J.-P. Mei, Y. Zhang, Y. Feng, C. Chen, Semckd: Semantic calibration for cross-layer knowledge distillation, IEEE Transactions on Knowledge and Data Engineering 35 (6) (2022) 6305–6319.
  • [131] U. Nath, Y. Wang, P. Turaga, Y. Yang, Rnas-cl: Robust neural architecture search by cross-layer knowledge distillation, International Journal of Computer Vision (2024) 1–20.
  • [132] W. Zhao, X. Zhu, Z. He, X.-Y. Zhang, Z. Lei, Cross-architecture distillation for face recognition, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 8076–8085.
  • [133] A. Yao, D. Sun, Knowledge transfer via dense cross-layer mutual-distillation, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, Springer, 2020, pp. 294–311.
  • [134] T. Su, Q. Liang, J. Zhang, Z. Yu, Z. Xu, G. Wang, X. Liu, Deep cross-layer collaborative learning network for online knowledge distillation, IEEE Transactions on Circuits and Systems for Video Technology 33 (5) (2022) 2075–2087.
  • [135] H. Zhu, N. Jiang, J. Tang, X. Huang, H. Qing, W. Wu, P. Zhang, Cross-layer fusion for feature distillation, in: International Conference on Neural Information Processing, Springer, 2022, pp. 433–445.
  • [136] G. Hu, Y. Ji, X. Liang, Y. Han, Layer-fusion for online mutual knowledge distillation, Multimedia Systems 29 (2) (2023) 787–796.
  • [137] D. Nguyen, T. Nguyen, K. Nguyen, D. Phung, H. Bui, N. Ho, On cross-layer alignment for model fusion of heterogeneous neural networks, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5.
  • [138] Y. Zhang, Y. Gao, H. Zhang, X. Lei, L. Liu, Cross-layer patch alignment and intra-and-inter patch relations for knowledge distillation, in: 2023 IEEE International Conference on Image Processing (ICIP), IEEE, 2023, pp. 535–539.
  • [139] W. Zou, X. Qi, Z. Wu, Z. Wang, M. Sun, C. Shan, Coco distillnet: a cross-layer correlation distillation network for pathological gastric cancer segmentation, in: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2021, pp. 1227–1234.
  • [140] W. Zou, X. Qi, W. Zhou, M. Sun, Z. Sun, C. Shan, Graph flow: Cross-layer graph flow distillation for dual efficient medical image segmentation, IEEE Transactions on Medical Imaging 42 (4) (2022) 1159–1171.
  • [141] Z. Zhai, J. Liang, B. Cheng, L. Zhao, J. Qian, Strengthening attention: knowledge distillation via cross-layer feature fusion for image classification, International Journal of Multimedia Information Retrieval 13 (2) (2024) 1–13.
  • [142] J. Guo, D. Chen, C. Wang, Online cross-layer knowledge distillation on graph neural networks with deep supervision, Neural Computing and Applications 35 (30) (2023) 22359–22374.
  • [143] H. Song, Y. Li, X. Li, Y. Zhang, Y. Zhu, Y. Zhou, Erkt-net: Implementing efficient and robust knowledge distillation for remote sensing image classification, EAI Endorsed Transactions on Industrial Networks and Intelligent Systems 11 (3) (2024).
  • [144] Y. Zhang, W. Zhang, J. Li, X. Qi, X. Lu, L. Wang, Y. Hou, Empowering lightweight detectors: Orientation distillation via anti-ambiguous spatial transformation for remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing 214 (2024) 244–260.
  • [145] Z. Zhang, S. Mei, M. Ma, Z. Han, Adaptive composite feature generation for object detection in remote sensing images, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [146] H. Feng, L. Zhang, X. Yang, Z. Liu, Enhancing class-incremental object detection in remote sensing through instance-aware distillation, Neurocomputing 583 (2024) 127552.
  • [147] Y. Gao, Y. Wang, Y. Zhang, Z. Li, C. Chen, H. Feng, Feature super-resolution fusion with cross-scale distillation for small object detection in optical remote sensing images, IEEE Geoscience and Remote Sensing Letters (2024).
  • [148] P. Yang, S. Zhou, L. Wang, G. Yang, Weakly supervised object detection from remote sensing images via self-attention distillation and instance-aware mining, Multimedia Tools and Applications 83 (13) (2024) 39073–39095.
  • [149] J. Sun, H. Gao, Z. Yan, X. Qi, J. Yu, Z. Ju, Lightweight uav object-detection method based on efficient multidimensional global feature adaptive fusion and knowledge distillation, Electronics 13 (8) (2024) 1558.
  • [150] H. Yang, S. Qiu, X. Feng, Dc-kd: double-constraint knowledge distillation for optical satellite imagery object detection based on yolox model, in: Fourth International Conference on Machine Learning and Computer Application (ICMLCA 2023), Vol. 13176, SPIE, 2024, pp. 476–482.
  • [151] H. Song, Y. Yuan, Z. Ouyang, Y. Yang, H. Xiang, Efficient knowledge distillation for hybrid models: A vision transformer-convolutional neural network to convolutional neural network approach for classifying remote sensing images, IET Cyber-Systems and Robotics 6 (3) (2024) e12120.
  • [152] J. Zhang, B. Ye, Q. Zhang, Y. Gong, J. Lu, D. Zeng, A visual knowledge oriented approach for weakly supervised remote sensing object detection, Neurocomputing (2024) 128114.
  • [153] Y. Lian, X. Shi, S. Shen, J. Hua, Multitask learning for image translation and salient object detection from multimodal remote sensing images, The Visual Computer 40 (3) (2024) 1395–1414.
  • [154] N. Zeng, X. Li, P. Wu, H. Li, X. Luo, A novel tensor decomposition-based efficient detector for low-altitude aerial objects with knowledge distillation scheme, IEEE/CAA Journal of Automatica Sinica 11 (2) (2024) 487–501.
  • [155] Y. Wan, Z. Luan, J. Hu, H. Ji, W. Song, Z. Gao, Small object detection in unmanned aerial vehicle images leveraging density-aware scale adaptation and knowledge distillation, in: 2024 IEEE 18th International Conference on Control & Automation (ICCA), IEEE, 2024, pp. 699–704.
  • [156] Z. Jia, S. Sun, G. Liu, B. Liu, Mssd: multi-scale self-distillation for object detection, Visual Intelligence 2 (1) (2024) 8.
  • [157] C. Lin, X. Mao, C. Qiu, L. Zou, Dtcnet: Transformer-cnn distillation for super-resolution of remote sensing image, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2024).
  • [158] H. Tang, W. Zhao, G. Hu, Y. Xiao, Y. Li, H. Wang, Text-guided diverse image synthesis for long-tailed remote sensing object classification, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [159] P. Shamsolmoali, J. Chanussot, H. Zhou, Y. Lu, Efficient object detection in optical remote sensing imagery via attention-based feature distillation, IEEE Transactions on Geoscience and Remote Sensing (2023).
  • [160] Y. Sun, L. Huang, Q. Zhu, D. Liang, Cs-kd: Confused sample knowledge distillation for semantic segmentation of aerial imagery, in: International Conference on Intelligent Computing, Springer, 2024, pp. 266–278.
  • [161] J. Yuan, M. H. Phan, L. Liu, Y. Liu, Fakd: Feature augmented knowledge distillation for semantic segmentation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 595–605.
  • [162] R. Naushad, T. Kaur, E. Ghaderpour, Deep transfer learning for land use and land cover classification: A comparative study, Sensors 21 (23) (2021) 8083.
  • [163] Y. Wang, Y. Wang, J. Cai, T. K. Lee, C. Miao, Z. J. Wang, Ssd-kd: A self-supervised diverse knowledge distillation method for lightweight skin lesion classification using dermoscopic images, Medical Image Analysis 84 (2023) 102693.
  • [164] T. Gao, W. Ao, X.-A. Wang, Y. Zhao, P. Ma, M. Xie, H. Fu, J. Ren, Z. Gao, Enrich distill and fuse: Generalized few-shot semantic segmentation in remote sensing leveraging foundation model’s assistance, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2771–2780.
  • [165] D. Zhang, Y. Zhou, J. Zhao, Z. Yang, H. Dong, R. Yao, H. Ma, Multi-granularity semantic alignment distillation learning for remote sensing image semantic segmentation, Frontiers of Computer Science 16 (4) (2022) 164351.
  • [166] Z. Li, X. Wu, J. Wang, Y. Guo, Weather-degraded image semantic segmentation with multi-task knowledge distillation, Image and Vision Computing 127 (2022) 104554.
  • [167] Y. Liu, X. Kang, Y. Huang, K. Wang, G. Yang, Unsupervised domain adaptation semantic segmentation for remote-sensing images via covariance attention, IEEE Geoscience and Remote Sensing Letters 19 (2022) 1–5.
  • [168] W. Shi, Q. Meng, L. Zhang, M. Zhao, C. Su, T. Jancsó, Dsanet: A deep supervision-based simple attention network for efficient semantic segmentation in remote sensing imagery, Remote Sensing 14 (21) (2022) 5399.
  • [169] X. Rong, X. Sun, W. Diao, P. Wang, Z. Yuan, H. Wang, Historical information-guided class-incremental semantic segmentation in remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–18.
  • [170] X. Rui, Z. Li, Y. Cao, Z. Li, W. Song, Dilrs: Domain-incremental learning for semantic segmentation in multi-source remote sensing data, Remote Sensing 15 (10) (2023) 2541.
  • [171] H.-Â. Lê, M.-T. Pham, Leveraging knowledge distillation for partial multi-task learning from multiple remote sensing datasets, arXiv preprint arXiv:2405.15394 (2024).
  • [172] L. Shan, W. Wang, K. Lv, B. Luo, Class-incremental learning for semantic segmentation in aerial imagery via distillation in all aspects, IEEE Transactions on Geoscience and Remote Sensing 60 (2021) 1–12.
  • [173] L. Shan, W. Wang, K. Lv, B. Luo, Class-incremental semantic segmentation of aerial images via pixel-level feature generation and task-wise distillation, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–17.
  • [174] Y. Li, T. Shi, Y. Zhang, W. Chen, Z. Wang, H. Li, Learning deep semantic segmentation network under multiple weakly-supervised constraints for cross-domain remote sensing image semantic segmentation, ISPRS Journal of Photogrammetry and Remote Sensing 175 (2021) 20–33.
  • [175] X. Guo, W. Zhou, T. Liu, Contrastive learning-based knowledge distillation for rgb-thermal urban scene semantic segmentation, Knowledge-Based Systems 292 (2024) 111588.
  • [176] Z. Cao, W. Diao, X. Sun, X. Lyu, M. Yan, K. Fu, C3net: Cross-modal feature recalibrated, cross-scale semantic aggregated and compact network for semantic segmentation of multi-modal high-resolution aerial images, Remote Sensing 13 (3) (2021) 528.
  • [177] L. Bai, S. Du, X. Zhang, H. Wang, B. Liu, S. Ouyang, Domain adaptation for remote sensing image semantic segmentation: An integrated approach of contrastive learning and adversarial learning, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–13.
  • [178] H. Wang, C. Tao, J. Qi, R. Xiao, H. Li, Avoiding negative transfer for semantic segmentation of remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–15.
  • [179] U. Michieli, P. Zanuttigh, Knowledge distillation for incremental learning in semantic segmentation, Computer Vision and Image Understanding 205 (2021) 103167.
  • [180] F. J. Peña, C. Hübinger, A. H. Payberah, F. Jaramillo, Deepaqua: Semantic segmentation of wetland water surfaces with sar imagery using deep neural networks without manually annotated data, International Journal of Applied Earth Observation and Geoinformation 126 (2024) 103624.
  • [181] R. N. Nair, R. Hänsch, Let me show you how it’s done-cross-modal knowledge distillation as pretext task for semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 595–603.
  • [182] W. Wang, F. Liu, W. Liao, L. Xiao, Cross-modal graph knowledge representation and distillation learning for land cover classification, IEEE Transactions on Geoscience and Remote Sensing (2023).
  • [183] X. Li, L. Lei, Y. Sun, G. Kuang, Dynamic-hierarchical attention distillation with synergetic instance selection for land cover classification using missing heterogeneity images, IEEE Transactions on Geoscience and Remote Sensing 60 (2021) 1–16.
  • [184] X. Zhang, X. Li, G. Chen, P. Liao, T. Wang, H. Yang, C. He, W. Zhou, Y. Sun, A deep transfer learning framework using teacher-student structure for land cover classification of remote sensing imagery, IEEE Geoscience and Remote Sensing Letters (2023).
  • [185] R. Kanagavelu, K. Dua, P. Garai, N. Thomas, S. Elias, S. Elias, Q. Wei, L. Yong, G. S. M. Rick, Fedukd: Federated unet model with knowledge distillation for land use classification from satellite and street views, Electronics 12 (4) (2023) 896.
  • [186] Y. J. E. Gbodjo, O. Montet, D. Ienco, R. Gaetano, S. Dupuy, Multisensor land cover classification with sparsely annotated data based on convolutional neural networks and self-distillation, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021) 11485–11499.
  • [187] X. Li, L. Lei, C. Zhang, G. Kuang, Dense adaptive grouping distillation network for multimodal land cover classification with privileged modality, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–14.
  • [188] S. Kumar, B. Banerjee, S. Chaudhuri, Improved landcover classification using online spectral data hallucination, Neurocomputing 439 (2021) 316–326.
  • [189] F. Xu, Y. Shi, W. Yang, G.-S. Xia, X. X. Zhu, Cloudseg: A multi-modal learning framework for robust land cover mapping under cloudy conditions, ISPRS Journal of Photogrammetry and Remote Sensing 214 (2024) 21–32.
  • [190] S. Julka, M. Granitzer, Knowledge distillation with segment anything (sam) model for planetary geological mapping, in: International Conference on Machine Learning, Optimization, and Data Science, Springer, 2023, pp. 68–77.
  • [191] H. Bazzi, D. Ienco, N. Baghdadi, M. Zribi, V. Demarez, Distilling before refine: Spatio-temporal transfer learning for mapping irrigated areas using sentinel-1 time series, IEEE Geoscience and Remote Sensing Letters 17 (11) (2020) 1909–1913.
  • [192] K.-A. C. Quan, V.-T. Nguyen, M.-T. Tran, A lightweight model for remote sensing image retrieval with knowledge distillation and mining interclass characteristics, in: 2021 8th NAFOSTED Conference on Information and Computer Science (NICS), IEEE, 2021, pp. 217–223.
  • [193] C. Broni-Bediako, J. Xia, N. Yokoya, Unsupervised domain adaptation architecture search with self-training for land cover mapping, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 543–553.
  • [194] S. Garg, B. Feinstein, S. Timnat, V. Batchu, G. Dror, A. G. Rosenthal, V. Gulshan, Cross-modal distillation for flood extent mapping, Environmental Data Science 2 (2023) e37.
  • [195] K. Yan, M. Zhou, L. Liu, C. Xie, D. Hong, When pansharpening meets graph convolution network and knowledge distillation, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–15.
  • [196] L. Yan, J. Yang, J. Wang, Domain knowledge-guided self-supervised change detection for remote sensing images, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16 (2023) 4167–4179.
  • [197] A. Matin, P. Khandelwal, S. Pallickara, S. L. Pallickara, Discern: Leveraging knowledge distillation to generate high resolution soil moisture estimation from coarse satellite data, in: 2023 IEEE International Conference on Big Data (BigData), IEEE, 2023, pp. 1222–1229.
  • [198] B. Ren, Z. Wang, B. Hou, B. Liu, Z. Wu, J. Chanussot, L. Jiao, Incremental land cover classification via label strategy and adaptive weights, IEEE Transactions on Geoscience and Remote Sensing 61 (2023) 1–15.
  • [199] L. Liangde, W. Xiujuan, K. Mengzhen, H. Jing, F. Menghan, Agricultural named entity recognition based on semantic aggregation and model distillation, Smart Agriculture 3 (1) (2021) 118.
  • [200] A. Ghofrani, R. Mahdian Toroghi, Knowledge distillation in plant disease recognition, Neural Computing and Applications 34 (17) (2022) 14287–14296.
  • [201] Y. Hu, G. Liu, Z. Chen, J. Liu, J. Guo, Lightweight one-stage maize leaf disease detection model with knowledge distillation, Agriculture 13 (9) (2023) 1664.
  • [202] Q. Dong, R. Gu, S. Chen, J. Zhu, Apple leaf disease diagnosis based on knowledge distillation and attention mechanism, IEEE Access (2024).
  • [203] Q. Huang, X. Wu, Q. Wang, X. Dong, Y. Qin, X. Wu, Y. Gao, G. Hao, Knowledge distillation facilitates the lightweight and efficient plant diseases detection model, Plant phenomics 5 (2023) 0062.
  • [204] S. Angarano, M. Martini, A. Navone, M. Chiaberge, Domain generalization for crop segmentation with standardized ensemble knowledge distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5450–5459.
  • [205] M. Li, M. Hasltead, C. McCool, Knowledge distillation for efficient panoptic semantic segmentation: Applied to agriculture, in: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2023, pp. 4204–4211.
  • [206] J.-Y. Jung, S.-H. Lee, J.-O. Kim, Plant leaf segmentation using knowledge distillation, in: 2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), IEEE, 2022, pp. 1–3.
  • [207] M. Pagé-Fortin, Class-incremental learning of plant and disease detection: Growing branches with knowledge distillation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 593–603.
  • [208] J. Wang, X. Lin, L. Luo, M. Chen, H. Wei, L. Xu, S. Luo, Cognition of grape cluster picking point based on visual knowledge distillation in complex vineyard environment, Computers and Electronics in Agriculture 225 (2024) 109216.
  • [209] L. Hollard, L. Mohimont, Applying knowledge distillation on pre-trained model for early grapevine detection, in: Workshop Proceedings of the 19th International Conference on Intelligent Environments (IE2023), IOS Press, 2023, pp. 149–156.
  • [210] A. Musa, M. Hassan, M. Hamada, F. Aliyu, Low-power deep learning model for plant disease detection for smart-hydroponics using knowledge distillation techniques, Journal of Low Power Electronics and Applications 12 (2) (2022) 24.
  • [211] H. Zhang, M. Wang, Mixkd: Mix data augmentation guided knowledge distillation for plant leaf disease recognition, in: International Conference on Green, Pervasive, and Cloud Computing, Springer, 2022, pp. 169–177.
  • [212] J. Yin, J. Wu, C. Gao, H. Yu, L. Liu, S. Guo, A novel fish individual recognition method for precision farming based on knowledge distillation strategy and the range of the receptive field, Journal of Fish Biology (2024).
  • [213] B. Li, Y. Liu, Q. Duan, T-kd: two-tier knowledge distillation for a lightweight underwater fish species classification model, Aquaculture International 32 (3) (2024) 3107–3128.
  • [214] Z. Yang, X. Jiang, G. Jin, J. Huang, J. Bai, D. Yu, Fast crop pest detection using lightweight feature extraction and knowledge distillation, in: 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), IEEE, 2023, pp. 2277–2282.
  • [215] F. Wu, R. Gazo, B. Benes, E. Haviarova, Deep barkid: a portable tree bark identification system by knowledge distillation, European Journal of Forest Research 140 (6) (2021) 1391–1399.
  • [216] K. Yamamoto, Distillation of crop models to learn plant physiology theories using machine learning, PloS one 14 (5) (2019) e0217075.
  • [217] Q. Wenjie, Y. Jin, H. Liangqing, Y. Juan, L. Qili, M. Jianyou, Y. Wanmao, Distilled-mobilenet model of convolutional neural network simplified structure for plant disease recognition, Smart Agriculture 3 (1) (2021) 109.
  • [218] Z. Wang, Z. Ren, X. Li, Identification of coffee leaf pests and diseases based on transfer learning and knowledge distillation, Frontiers in Computing and Intelligent Systems 5 (1) (2023) 15–18.
  • [219] M. Li, M. Halstead, C. Mccool, Knowledge distillation for efficient instance semantic segmentation with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5432–5439.
  • [220] R. Arablouei, L. Wang, C. Phillips, L. Currie, J. Yates, G. Bishop-Hurley, In-situ animal behavior classification using knowledge distillation and fixed-point quantization, Smart Agricultural Technology 4 (2023) 100159.
  • [221] G. Castellano, P. De Marinis, G. Vessio, Applying knowledge distillation to improve weed mapping with drones, in: 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), IEEE, 2023, pp. 393–400.
  • [222] S. Bansal, M. Singh, S. Barda, N. Goel, M. Saini, Pa-rdfknet: Unifying plant age estimation through rgb-depth fusion and knowledge distillation, IEEE Transactions on AgriFood Electronics (2024).
  • [223] L. Shen, J. Lin, D. Bai, Z. Zhang, C. Wang, X. Lei, Multi-level relational knowledge distillation for low resolution image recognition, in: Proceedings of the 2021 10th International Conference on Computing and Pattern Recognition, 2021, pp. 31–35.
  • [224] M. H. Phan, S. L. Phung, K. Luu, A. Bouzerdoum, Efficient hyperspectral image segmentation for biosecurity scanning using knowledge distillation from multi-head teacher, Neurocomputing 504 (2022) 189–203.
  • [225] S. Mane, P. Bartakke, T. Bastewad, Efficient pomegranate segmentation with unet: A comparative analysis of backbone architectures and knowledge distillation, in: ITM Web of Conferences, Vol. 54, EDP Sciences, 2023, p. 01001.
  • [226] G. Tsagkatakis, S. Roumpakis, S. Nikolidakis, E. Petra, A. Mantes, A. Kapantagakis, K. Grigorakis, G. Katselis, N. Vlahos, P. Tsakalides, Knowledge distillation from multispectral images for fish freshness estimation, Electronic Imaging 33 (2021) 1–7.
  • [227] S. O. Mengisti Berihu Girmay, Explainable ai: Leaf-based medicinal plant classification using knowledge distillation, in: 44. GIL-Jahrestagung, Biodiversität fördern durch digitale Landwirtschaft, Gesellschaft für Informatik eV, 2024, pp. 23–34.
  • [228] T. Rithanasophon, K. Thitisiriwech, P. Kantavat, B. Kijsirikul, Y. Iwahori, S. Fukui, K. Nakamura, Y. Hayashi, Quality of life prediction on walking scenes using deep neural networks and performance improvement using knowledge distillation, Electronics 12 (13) (2023) 2907.
  • [229] Y. Liu, J. Ding, Y. Fu, Y. Li, Urbankg: An urban knowledge graph system, ACM Transactions on Intelligent Systems and Technology 14 (4) (2023) 1–25.
  • [230] H. Xu, G. Xu, G. Sun, J. Chen, J. Hao, Building polygon extraction from high-resolution remote sensing imagery using knowledge distillation, Applied Sciences 13 (16) (2023) 9239.
  • [231] G. Xu, M. Deng, G. Sun, Y. Guo, J. Chen, Improving building extraction by using knowledge distillation to reduce the impact of label noise, Remote Sensing 14 (22) (2022) 5645.
  • [232] Y. Li, P. Li, D. Yan, Y. Liu, Z. Liu, Deep knowledge distillation: A self-mutual learning framework for traffic prediction, Expert Systems with Applications 252 (2024) 124138.
  • [233] H. Pan, X. Chang, W. Sun, Multitask knowledge distillation guides end-to-end lane detection, IEEE Transactions on Industrial Informatics 19 (9) (2023) 9703–9712.
  • [234] N. Kim, J. An, Knowledge distillation for traversable region detection of lidar scan in off-road environments, Sensors 24 (1) (2023) 79.
  • [235] K. Lee, S. Lee, H. Kim, Accelerating multi-class defect detection of building façades using knowledge distillation of dcnn-based model, International Journal of Sustainable Building Technology and Urban Development 12 (2) (2021) 80–95.
  • [236] G. Li, Z. Ji, S. Li, X. Luo, X. Qu, Driver behavioral cloning for route following in autonomous vehicles using task knowledge distillation, IEEE Transactions on Intelligent Vehicles 8 (2) (2022) 1025–1033.
  • [237] Z. Hong, Q. Lin, B. Hu, Knowledge distillation-based edge-decision hierarchies for interactive behavior-aware planning in autonomous driving system, IEEE Transactions on Intelligent Transportation Systems (2024).
  • [238] H. Luo, T. Chen, X. Li, S. Li, C. Zhang, G. Zhao, X. Liu, Keepedge: A knowledge distillation empowered edge intelligence framework for visual assisted positioning in uav delivery, IEEE Transactions on Mobile Computing 22 (8) (2022) 4729–4741.
  • [239] P. A. Pelizari, C. Geiß, S. Groth, H. Taubenböck, Deep multitask learning with label interdependency distillation for multicriteria street-level image classification, ISPRS Journal of Photogrammetry and Remote Sensing 204 (2023) 275–290.
  • [240] Y. Liu, J. Ding, Y. Li, Developing knowledge graph based system for urban computing, in: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Geospatial Knowledge Graphs, 2022, pp. 3–7.
  • [241] P. Gupta, D. Isele, S. Bae, Towards scalable & efficient interaction-aware planning in autonomous vehicles using knowledge distillation, arXiv preprint arXiv:2404.01746 (2024).
  • [242] S. Tsanakas, A. Hameed, J. Violos, A. Leivadeas, A light-weight edge-enabled knowledge distillation technique for next location prediction of multitude transportation means, Future Generation Computer Systems 154 (2024) 45–58.
  • [243] W. Zhou, X. Yang, X. Dong, M. Fang, W. Yan, T. Luo, Mjpnet-s*: Multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities, IEEE Internet of Things Journal (2024).
  • [244] H. Wang, X. Li, Deepblue: Advanced convolutional neural network applications for ocean remote sensing, IEEE Geoscience and Remote Sensing Magazine (2023).
  • [245] X. Chen, X. Chen, F. Wu, H. Wang, H. Yao, Online_xkd: An online knowledge distillation model for underwater object detection, Computers and Electrical Engineering 119 (2024) 109501.
  • [246] A. Ben Tamou, A. Benzinou, K. Nasreddine, Live fish species classification in underwater images by using convolutional neural networks based on incremental learning with knowledge distillation loss, Machine Learning and Knowledge Extraction 4 (3) (2022) 753–767.
  • [247] Y. Ding, K. Li, H. Mei, S. Liu, G. Hou, Watermono: Teacher-guided anomaly masking and enhancement boosting for robust underwater self-supervised monocular depth estimation, arXiv preprint arXiv:2406.13344 (2024).
  • [248] L. Wang, Q. Li, T. Wang, Q. Lv, X. Peng, A self-supervised framework for refined reconstruction of geophysical fields via domain adaptation, Earth and Space Science 11 (7) (2024) e2023EA003197.
  • [249] Y. Jin, J. Liu, K. Ren, X. Wang, K. Deng, Z. Fan, C. Deng, Y. Yue, Towards robust tropical cyclone wind radii estimation with multi-modality fusion and missing-modality distillation, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [250] J. Zhang, J. Gao, J. Liang, Y. Wu, B. Li, Y. Zhai, X. Li, Efficient water segmentation with transformer and knowledge distillation for usvs, Journal of Marine Science and Engineering 11 (5) (2023) 901.
  • [251] Z.-A. Yang, N.-R. Zheng, X.-Z. Shi, W.-B. Zhou, F. Wang, Precise and fast segmentation of sea ice in high-resolution images based on multiscale and knowledge distillation, in: IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, IEEE, 2023, pp. 4946–4949.
  • [252] S. Chen, R. Zhan, W. Wang, J. Zhang, Learning slimming sar ship object detector through network pruning and knowledge distillation, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2020) 1267–1282.
  • [253] S. Li, M. Lin, Y. Wang, Y. Wu, Y. Tian, L. Shao, R. Ji, Distilling a powerful student model via online knowledge distillation, IEEE transactions on neural networks and learning systems 34 (11) (2022) 8743–8752.
  • [254] Z. Li, J. Ye, M. Song, Y. Huang, Z. Pan, Online knowledge distillation for efficient pose estimation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11740–11750.
  • [255] K. Binici, N. T. Pham, T. Mitra, K. Leman, Preventing catastrophic forgetting and distribution mismatch in knowledge distillation via synthetic data, in: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 663–671.
  • [256] W. Zhang, X. Miao, Y. Shao, J. Jiang, L. Chen, O. Ruas, B. Cui, Reliable data distillation on graph convolutional network, in: Proceedings of the 2020 ACM SIGMOD international conference on management of data, 2020, pp. 1399–1414.
  • [257] A. Mishra, D. Marr, Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy, in: International Conference on Learning Representations, 2018.
    URL https://openreview.net/forum?id=B1ae1lZRb
  • [258] S. Stanton, P. Izmailov, P. Kirichenko, A. A. Alemi, A. G. Wilson, Does knowledge distillation really work?, Advances in Neural Information Processing Systems 34 (2021) 6906–6919.
  • [259] X. Lin, W. Zhong, X. Lin, Y. Zhou, L. Jiang, L. Du-Ikonen, L. Huang, Component modeling and updating method of integrated energy systems based on knowledge distillation, Energy and AI 16 (2024) 100350.
  • [260] S. Zhu, R. Shang, B. Yuan, W. Zhang, W. Li, Y. Li, L. Jiao, Dynamickd: An effective knowledge distillation via dynamic entropy correction-based distillation for gap optimizing, Pattern Recognition 153 (2024) 110545.
  • [261] M. Liang, S. Huang, W. Liu, Dynamic semantic structure distillation for low-resolution fine-grained recognition, Pattern Recognition 148 (2024) 110216.
  • [262] D. Yu, C. Fang, Urban remote sensing with spatial big data: a review and renewed perspective of urban studies in recent decades, Remote Sensing 15 (5) (2023) 1307.
  • [263] F. Ye, T. Ai, J. Wang, Y. Yao, Z. Zhou, A method for classifying complex features in urban areas using video satellite remote sensing data, Remote Sensing 14 (10) (2022) 2324.
  • [264] L. Zhang, C. Bao, K. Ma, Self-distillation: Towards efficient and compact neural networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (8) (2021) 4388–4403.
  • [265] H. Du, R. Yu, L. Bai, L. Bai, W. Wang, Learning structure perception mlps on graphs: a layer-wise graph knowledge distillation framework, International Journal of Machine Learning and Cybernetics (2024) 1–16.
  • [266] S. Kokane, M. R. Uddin, M. Xu, Improving knowledge distillation in transfer learning with layer-wise learning rates, arXiv preprint arXiv:2407.04871 (2024).
  • [267] S. Kim, Y. Kim, K. Moon, M. Jang, Ladimo: Layer-wise distillation inspired moefier, arXiv preprint arXiv:2408.04278 (2024).
  • [268] L. Zhong, S. Yan, Self knowledge distillation based on layer-wise weighted feature imitation for efficient object detection, in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 9851–9855.
  • [269] C. Liang, J. Yu, M.-H. Yang, M. Brown, Y. Cui, T. Zhao, B. Gong, T. Zhou, Module-wise adaptive distillation for multimodality foundation models, Advances in Neural Information Processing Systems 36 (2024).
  • [270] S. Park, D. Kang, J. Paik, Cosine similarity-guided knowledge distillation for robust object detectors, Scientific Reports 14 (1) (2024) 18888.
  • [271] Z. Lu, J. Wang, C. Jiang, Data-free knowledge filtering and distillation in federated learning, IEEE Transactions on Big Data (2024).
  • [272] J. Shao, F. Wu, J. Zhang, Selective knowledge sharing for privacy-preserving federated distillation without a good teacher, Nature Communications 15 (1) (2024) 349.
  • [273] Y. Qiao, A. Adhikary, K. T. Kim, C. Zhang, C. S. Hong, Knowledge distillation assisted robust federated learning: Towards edge intelligence, in: ICC 2024-IEEE International Conference on Communications, IEEE, 2024, pp. 843–848.
  • [274] Y. Yang, C. Liu, X. Cai, S. Huang, H. Lu, Y. Ding, Unideal: Curriculum knowledge distillation federated learning, in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 7145–7149.
  • [275] Q. Zhong, L. Ding, J. Liu, B. Du, D. Tao, Panda: Prompt transfer meets knowledge distillation for efficient model adaptation, IEEE Transactions on Knowledge and Data Engineering (2024).
  • [276] J. Gou, Y. Hu, L. Sun, Z. Wang, H. Ma, Collaborative knowledge distillation via filter knowledge transfer, Expert Systems with Applications 238 (2024) 121884.
  • [277] Z. Wu, S. Sun, Y. Wang, M. Liu, Q. Pan, J. Zhang, Z. Li, Q. Liu, Exploring the distributed knowledge congruence in proxy-data-free federated distillation, ACM Transactions on Intelligent Systems and Technology 15 (2) (2024) 1–34.
  • [278] H. Q. Le, M. N. Nguyen, S. R. Pandey, C. Zhang, C. S. Hong, Cdkt-fl: Cross-device knowledge transfer using proxy dataset in federated learning, Engineering Applications of Artificial Intelligence 133 (2024) 108093.
  • [279] K. Xu, L. Wang, H. Zhang, B. Yin, Self-knowledge distillation with learning from role-model samples, in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 5185–5189.
  • [280] S. Zhao, T. Liao, L. Fu, C. Chen, J. Bian, Z. Zheng, Data-free knowledge distillation via generator-free data generation for non-iid federated learning, Neural Networks (2024) 106627.
  • [281] K. Balaskas, A. Karatzas, C. Sad, K. Siozios, I. Anagnostopoulos, G. Zervakis, et al., Hardware-aware dnn compression via diverse pruning and mixed-precision quantization, IEEE Transactions on Emerging Topics in Computing (2024).
  • [282] H. Wang, P. Ling, X. Fan, T. Tu, J. Zheng, H. Chen, Y. Jin, E. Chen, All-in-one hardware-oriented model compression for efficient multi-hardware deployment, IEEE Transactions on Circuits and Systems for Video Technology (2024).
  • [283] Z. Li, A. Lu, Y. Xie, Z. Kong, M. Sun, H. Tang, Z. J. Xue, P. Dong, C. Ding, Y. Wang, et al., Quasar-vit: Hardware-oriented quantization-aware architecture search for vision transformers, in: Proceedings of the 38th ACM International Conference on Supercomputing, 2024, pp. 324–337.
  • [284] M. I. E. Ghebriout, H. Bouzidi, S. Niar, H. Ouarnoughi, Harmonic-nas: Hardware-aware multimodal neural architecture search on resource-constrained devices, in: Asian Conference on Machine Learning, PMLR, 2024, pp. 374–389.
  • [285] J.-Y. Baek, D.-H. Hur, D.-W. Kim, Y.-S. Yoo, H.-J. Shin, D.-H. Park, S.-H. Bae, Bit-width aware generator and intermediate layer knowledge distillation using channel-wise attention for generative data-free quantization, Journal of The Korea Society of Computer and Information 29 (7) (2024) 11–20.
  • [286] H. Bouzidi, Efficient deployment of deep neural networks on hardware devices for edge ai, Ph.D. thesis, Université Polytechnique Hauts-de-France (2024).
  • [287] N. Wang, H. Bi, F. Li, C. Xu, J. Gao, Self-distillation-based polarimetric image classification with noisy and sparse labels, Remote Sensing 15 (24) (2023) 5751.
  • [288] C. Fang, Q. Wang, L. Cheng, Z. Gao, C. Pan, Z. Cao, Z. Zheng, D. Zhang, Reliable mutual distillation for medical image segmentation under imperfect annotations, IEEE Transactions on Medical Imaging 42 (6) (2023) 1720–1734.
  • [289] X. Tian, D. Hou, S. Wang, X. Liu, H. Xing, An adaptive weighted method for remote sensing image retrieval with noisy labels, Applied Sciences 14 (5) (2024) 1756.
  • [290] Y. Shao, S. Li, P. Yang, F. Cheng, Y. Ding, J. Sun, Jointnet: Multitask learning framework for denoising and detecting anomalies in hyperspectral remote sensing, Remote Sensing 16 (14) (2024) 2619.
  • [291] M.-T. Tran, T. Le, X.-M. Le, M. Harandi, Q. H. Tran, D. Phung, Nayer: Noisy layer data generation for efficient and effective data-free knowledge distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 23860–23869.
  • [292] J. Wang, D. Huang, X. Wu, Y. Tang, L. Lan, Continuous review and timely correction: Enhancing the resistance to noisy labels via self-not-true distillation, in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 5700–5704.
  • [293] H. J. Park, W. Shin, J. S. Kim, S. W. Han, Leveraging non-causal knowledge via cross-network knowledge distillation for real-time speech enhancement, IEEE Signal Processing Letters (2024).
  • [294] H. Liu, M. Sheng, Z. Sun, Y. Yao, X.-S. Hua, H.-T. Shen, Learning with imbalanced noisy data by preventing bias in sample selection, IEEE Transactions on Multimedia (2024).
  • [295] Z. Li, H. Zhao, Z. Li, T. Liu, D. Guo, X. Wan, Extracting clean and balanced subset for noisy long-tailed classification, arXiv preprint arXiv:2404.06795 (2024).
  • [296] J. Tang, N. Jiang, H. Zhu, J. T. Zhou, C. Gong, Learning student network under universal label noise, IEEE Transactions on Image Processing (2024).
  • [297] H. Liu, Y. Wang, H. Liu, F. Sun, A. Yao, Small scale data-free knowledge distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6008–6016.
  • [298] L. Zhang, T. Xu, C. Zeng, Q. Hao, Z. Chen, X. Liang, Semantic-aware contrastive adaptation bridges domain discrepancy for unsupervised remote sensing, IEEE Access (2024).
  • [299] S. Lee, J.-H. Kim, Semi-supervised scene change detection by distillation from feature-metric alignment, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1226–1235.
  • [300] K. Heidler, I. Nitze, G. Grosse, X. X. Zhu, Pixeldino: Semi-supervised semantic segmentation for detecting permafrost disturbances in the arctic, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [301] J. Yang, X. Zhu, A. Bulat, B. Martinez, G. Tzimiropoulos, Knowledge distillation meets open-set semi-supervised learning, International Journal of Computer Vision (2024) 1–20.
  • [302] W. Pan, T. Gao, Y. Zhang, X. Zheng, Y. Shen, K. Li, R. Hu, Y. Liu, P. Dai, Semi-supervised blind image quality assessment through knowledge distillation and incremental learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 4388–4396.
  • [303] I. Bistritz, A. Mann, N. Bambos, Distributed distillation for on-device learning, Advances in Neural Information Processing Systems 33 (2020) 22593–22604.
  • [304] A. Malinin, B. Mlodozeniec, M. Gales, Ensemble distribution distillation, arXiv preprint arXiv:1905.00076 (2019).
  • [305] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, G. E. Hinton, Large scale distributed neural network training through online distillation, arXiv preprint arXiv:1804.03235 (2018).
  • [306] M. Ryabinin, A. Malinin, M. Gales, Scaling ensemble distribution distillation to many classes with proxy targets, Advances in Neural Information Processing Systems 34 (2021) 6023–6035.
  • [307] Y. Shen, Z. Zhang, M. R. Sabuncu, L. Sun, Real-time uncertainty estimation in computer vision via uncertainty-aware distribution distillation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 707–716.
  • [308] Y. Fathullah, M. J. Gales, Self-distribution distillation: efficient uncertainty estimation, in: Uncertainty in Artificial Intelligence, PMLR, 2022, pp. 663–673.
  • [309] A. Taya, T. Nishio, M. Morikura, K. Yamamoto, Decentralized and model-free federated learning: Consensus-based distillation in function space, IEEE Transactions on Signal and Information Processing over Networks 8 (2022) 799–814.
  • [310] H. Ruan, J. Peng, Y. Chen, S. He, Z. Zhang, H. Li, A class-incremental detection method of remote sensing images based on selective distillation, Symmetry 14 (10) (2022) 2100.
  • [311] M. Shen, D. Chen, S. Hu, G. Xu, Class incremental learning of remote sensing images based on class similarity distillation, PeerJ Computer Science 9 (2023) e1583.
  • [312] W.-b. GUAN, X.-h. WU, H.-g. CHEN, et al., Class-incremental few-shot object detection with distillation response in remote sensing images, New Generation of Information Technology 6 (22) (2023) 01–07.
  • [313] X. Lu, X. Sun, W. Diao, Y. Feng, P. Wang, K. Fu, Lil: Lightweight incremental learning approach through feature transfer for remote sensing image scene classification, IEEE Transactions on Geoscience and Remote Sensing 60 (2021) 1–20.
  • [314] Z. Ye, Y. Zhang, J. Zhang, W. Li, L. Bai, A multiscale incremental learning network for remote sensing scene classification, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [315] J. Xie, B. Pan, X. Xu, Z. Shi, Missnet: Memory-inspired semantic segmentation augmentation network for class-incremental learning in remote sensing images, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [316] E. Arnaudo, F. Cermelli, A. Tavera, C. Rossi, B. Caputo, A contrastive distillation approach for incremental semantic segmentation in aerial images, in: International Conference on Image Analysis and Processing, Springer, 2022, pp. 742–754.
  • [317] J. Wu, R. Ji, J. Liu, M. Xu, J. Zheng, L. Shao, Q. Tian, Real-time semantic segmentation via sequential knowledge distillation, Neurocomputing 439 (2021) 134–145.
  • [318] L. Zhuo, G. Wang, S. Li, W. Wu, Z. Liu, Fast-vid2vid++: Spatial-temporal distillation for real-time video-to-video synthesis, IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
  • [319] F. Li, C. Fu, F. Lin, Y. Li, P. Lu, Training-set distillation for real-time uav object tracking, in: 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2020, pp. 9715–9721.
  • [320] F. Grünenfelder, A. Boaron, G. V. Resta, M. Perrenoud, D. Rusca, C. Barreiro, R. Houlmann, R. Sax, L. Stasi, S. El-Khoury, et al., Fast single-photon detectors and real-time key distillation enable high secret-key-rate quantum key distribution systems, Nature Photonics 17 (5) (2023) 422–426.
  • [321] M. Thakker, S. E. Eskimez, T. Yoshioka, H. Wang, Fast real-time personalized speech enhancement: End-to-end enhancement network (e3net) and knowledge distillation, arXiv preprint arXiv:2204.00771 (2022).
  • [322] M. R. Islam, M. Abdel-Aty, D. Wang, Z. Islam, Spatial ensemble distillation learning for large-scale real-time crash prediction, IEEE Transactions on Intelligent Transportation Systems (2024).
  • [323] D. J. Dave, M. Z. Dabhiya, S. Satyadev, S. Ganguly, D. N. Saraf, Online tuning of a steady state crude distillation unit model for real time applications, Journal of Process Control 13 (3) (2003) 267–282.
  • [324] S. Angarano, F. Salvetti, M. Martini, M. Chiaberge, Generative adversarial super-resolution at the edge with knowledge distillation, Engineering Applications of Artificial Intelligence 123 (2023) 106407.
  • [325] M. Sepahvand, F. Abdali-Mohammadi, A. Taherkordi, An adaptive teacher–student learning algorithm with decomposed knowledge distillation for on-edge intelligence, Engineering Applications of Artificial Intelligence 117 (2023) 105560.
  • [326] S. Dey, A. Mukherjee, A. Ukil, A. Pal, Towards a task-agnostic distillation methodology for creating edge foundation models, in: Proceedings of the Workshop on Edge and Mobile Foundation Models, 2024, pp. 10–15.
  • [327] C. Wang, G. Yang, G. Papanastasiou, H. Zhang, J. J. Rodrigues, V. H. C. De Albuquerque, Industrial cyber-physical systems-based cloud iot edge for federated heterogeneous distillation, IEEE Transactions on Industrial Informatics 17 (8) (2020) 5511–5521.
  • [328] F. Huo, W. Xu, J. Guo, H. Wang, S. Guo, C2kd: Bridging the modality gap for cross-modal knowledge distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16006–16015.
  • [329] L. Zhu, Y. Wang, Y. Hu, X. Su, K. Fu, Cross-modal contrastive learning with spatio-temporal context for correlation-aware multi-scale remote sensing image retrieval, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [330] P. Li, G. Liu, J. He, X. Meng, S. Zhong, X. Chen, Rsmodm: Multimodal momentum distillation model for remote sensing visual question answering, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2024).
  • [331] D. Ienco, C. F. Dantas, Discom-kd: Cross-modal knowledge distillation via disentanglement representation and adversarial learning, arXiv preprint arXiv:2408.07080 (2024).
  • [332] Y. Chen, C. Du, Y. Zi, S. Xiong, X. Lu, Scale-aware adaptive refinement and cross interaction for remote sensing audio-visual cross-modal retrieval, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [333] A. Zavras, D. Michail, B. Demir, I. Papoutsis, Mind the modality gap: Towards a remote sensing vision-language model via cross-modal alignment, arXiv preprint arXiv:2402.09816 (2024).
  • [334] X. Zhang, W. Li, X. Wang, L. Wang, F. Zheng, L. Wang, H. Zhang, A fusion encoder with multi-task guidance for cross-modal text–image retrieval in remote sensing, Remote Sensing 15 (18) (2023) 4637.
  • [335] A. Dong, S. Liu, Multi-scale field distillation for multi-task semantic segmentation, in: International Conference on Artificial Neural Networks, Springer, 2023, pp. 508–519.
  • [336] D. Hong, C. Qiu, A. Yu, Y. Quan, B. Liu, X. Chen, Multi-task learning for building extraction and change detection from remote sensing images, Applied Sciences 13 (2) (2023) 1037.
  • [337] H. Zhou, X. Du, S. Li, Self-supervision and self-distillation with multilayer feature contrast for supervision collapse in few-shot remote sensing scene classification, Remote Sensing 14 (13) (2022) 3111.
  • [338] B. Liu, S. Wei, F. Zhang, N. Guo, H. Fan, W. Yao, Tomato leaf disease recognition based on multi-task distillation learning, Frontiers in Plant Science 14 (2024) 1330527.
  • [339] Z. Zhu, J. Kang, W. Diao, Y. Feng, J. Li, J. Ni, Sirs: Multi-task joint learning for remote sensing foreground-entity image-text retrieval, IEEE Transactions on Geoscience and Remote Sensing (2024).
  • [340] J. Zhang, J. Zhang, X. Huang, W. Zhou, H. Fu, Y. Chen, Z. Zhan, Dual-task network for terrace and ridge extraction: Automatic terrace extraction via multi-task learning, Remote Sensing 16 (3) (2024) 568.
  • [341] B. Yuan, D. Zhao, Z. Liu, W. Li, T. Li, Continual panoptic perception: Towards multi-modal incremental interpretation of remote sensing images, arXiv preprint arXiv:2407.14242 (2024).
  • [342] X. Jin, T. Ge, F. Wei, Plug and play knowledge distillation for knn-lm with external logits, in: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2022, pp. 463–469.
  • [343] Y.-T. Hsiao, S. Khodadadeh, K. Duarte, W.-A. Lin, H. Qu, M. Kwon, R. Kalarot, Plug-and-play diffusion distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13743–13752.
  • [344] S. Lao, G. Song, B. Liu, Y. Liu, Y. Yang, Unikd: Universal knowledge distillation for mimicking homogeneous or heterogeneous object detectors, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6362–6372.
  • [345] Z. Yang, Y. Cui, Z. Chen, W. Che, T. Liu, S. Wang, G. Hu, Textbrewer: An open-source knowledge distillation toolkit for natural language processing, arXiv preprint arXiv:2002.12620 (2020).
  • [346] Y. Matsubara, torchdistill: A modular, configuration-driven framework for knowledge distillation, in: International Workshop on Reproducible Research in Pattern Recognition, Springer, 2021, pp. 24–44.
  • [347] D. Batic, G. Tanoni, L. Stankovic, V. Stankovic, E. Principi, Improving knowledge distillation for non-intrusive load monitoring through explainability guided learning, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5.
  • [348] M. Liu, C. Guo, S. Guo, An explainable knowledge distillation method with xgboost for icu mortality prediction, Computers in Biology and Medicine 152 (2023) 106466.
  • [349] G. Taskin, A model distillation approach for explaining black-box models for hyperspectral image classification, in: IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, IEEE, 2022, pp. 3592–3595.
  • [350] G. Y. Lee, T. Dam, M. M. Ferdaus, D. P. Poenar, V. N. Duong, Unlocking the capabilities of explainable few-shot learning in remote sensing, Artificial Intelligence Review 57 (7) (2024) 169.
  • [351] H. Lee, S. Kim, Explaining neural networks using attentive knowledge distillation, Sensors 21 (4) (2021) 1280.
  • [352] C. Termritthikun, A. Umer, S. Suwanwimolkul, F. Xia, I. Lee, Explainable knowledge distillation for on-device chest x-ray classification, IEEE/ACM Transactions on Computational Biology and Bioinformatics (2023).
  • [353] X. Li, Q. Shen, A hybrid framework based on knowledge distillation for explainable disease diagnosis, Expert Systems with Applications 238 (2024) 121844.
  • [354] J. Mi, L. Wang, Y. Liu, J. Zhang, Kde-gan: A multimodal medical image-fusion model based on knowledge distillation and explainable ai modules, Computers in Biology and Medicine 151 (2022) 106273.
  • [355] Y. Xiao, L. Wang, W. Li, X. Zeng, Knowledge distillation with feature enhancement mask, in: International Conference on Artificial Neural Networks, Springer, 2023, pp. 432–443.
  • [356] G. Yang, S. Yu, Y. Sheng, H. Yang, Attention and feature transfer based knowledge distillation, Scientific Reports 13 (1) (2023) 18369.
  • [357] M. Zhou, J. Huang, X. Fu, F. Zhao, D. Hong, Effective pan-sharpening by multiscale invertible neural network and heterogeneous task distilling, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–14.
  • [358] Y. Lv, W. Xiong, X. Zhang, Y. Cui, Fusion-based correlation learning model for cross-modal remote sensing image retrieval, IEEE Geoscience and Remote Sensing Letters 19 (2021) 1–5.
  • [359] K. Xu, P. Deng, H. Huang, Vision transformer: An excellent teacher for guiding small networks in remote sensing image scene classification, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–15.
  • [360] N. Aghli, E. Ribeiro, Combining weight pruning and knowledge distillation for cnn compression, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3191–3198.
  • [361] L. Malihi, G. Heidemann, Matching the ideal pruning method with knowledge distillation for optimal compression, Applied System Innovation 7 (4) (2024) 56.
  • [362] A. Kuldashboy, S. Umirzakova, S. Allaberdiev, R. Nasimov, A. Abdusalomov, Y. Im Cho, Efficient image classification through collaborative knowledge distillation: A novel alexnet modification approach, Heliyon 10 (14) (2024).
  • [363] B.-w. Kwak, Y. Kim, Y. J. Kim, S.-w. Hwang, J. Yeo, Trustal: Trustworthy active learning using knowledge distillation, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 36, 2022, pp. 7263–7271.
  • [364] Y. Boreshban, S. M. Mirbostani, G. Ghassem-Sani, S. A. Mirroshandel, S. Amiriparian, Improving question answering performance using knowledge distillation and active learning, Engineering Applications of Artificial Intelligence 123 (2023) 106137.
  • [365] H. Zhang, R. K. Wong, V. W. Chu, Hybrid learning with teacher-student knowledge distillation for recommenders, in: 2020 International Conference on Data Mining Workshops (ICDMW), IEEE, 2020, pp. 227–235.
  • [366] J. Xie, L. Gong, S. Shao, S. Lin, L. Luo, Hybrid knowledge distillation from intermediate layers for efficient single image super-resolution, Neurocomputing 554 (2023) 126592.
  • [367] J. Zhang, Z. Tao, S. Zhang, Z. Qiao, K. Guo, Soft hybrid knowledge distillation against deep neural networks, Neurocomputing 570 (2024) 127142.
  • [368] J. Zhang, Z. Tao, K. Guo, H. Li, S. Zhang, Hybrid mix-up contrastive knowledge distillation, Information Sciences 660 (2024) 120107.
  • [369] G. Li, R. Togo, T. Ogawa, M. Haseyama, Importance-aware adaptive dataset distillation, Neural Networks 172 (2024) 106154.
  • [370] D. Zhang, H. Yan, Y. Chen, D. Li, C. Hao, Cross-domain few-shot learning based on feature adaptive distillation, Neural Computing and Applications 36 (8) (2024) 4451–4465.
  • [371] J. Mi, S. Wermter, J. Zhang, Adaptive knowledge distillation and integration for weakly supervised referring expression comprehension, Knowledge-Based Systems 286 (2024) 111437.
  • [372] Z. Yu, J. Peng, Adaptive multi-information distillation network for image dehazing, Multimedia Tools and Applications 83 (6) (2024) 18407–18426.
  • [373] Z. Huang, W. Li, R. Tao, Extracting and distilling direction-adaptive knowledge for lightweight object detection in remote sensing images, in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022, pp. 2360–2364.