Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

Xue Zhang, Si-Yuan Cao, Fang Wang, Runmin Zhang, Zhe Wu, Xiaohan Zhang, Xiaokai Bai, and
Hui-Liang Shen, Senior Member, IEEE
This work was supported in part by the National Key Research and Development Program of China under grant 2023YFB3209800, in part by the Natural Science Foundation of Zhejiang Province under grant D24F020006, in part by the National Natural Science Foundation of China under grant 62301484, and in part by the Jinhua Science and Technology Bureau Project. (Corresponding authors: Si-Yuan Cao and Hui-Liang Shen.) X. Zhang, R. Zhang, Z. Wu, X. Zhang, and X. Bai are with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China (e-mail: zxue2019@zju.edu.cn, runmin_zhang@zju.edu.cn, jeffw@zju.edu.cn, zhangxh2023@zju.edu.cn, shawnnnkb@zju.edu.cn). S.-Y. Cao is with the Ningbo Research Institute, College of Information Science and Electronic Engineering, Zhejiang University, China (e-mail: cao_siyuan@zju.edu.cn). F. Wang is with the School of Information and Electrical Engineering, Hangzhou City University, and also with the Hangzhou City University Binjiang Innovation Center, China (e-mail: wangf@zucc.edu.cn) H.-L. Shen is with the College of Information Science and Electronic Engineering, Zhejiang University, the Jinhua Institute of Zhejiang University, and the Key Laboratory of Collaborative Sensing and Autonomous Unmanned Systems of Zhejiang Province, China (e-mail: shenhl@zju.edu.cn).
Abstract

Most recent multispectral object detectors employ a two-branch structure to extract features from RGB and thermal images. While the two-branch structure achieves better performance than a single-branch structure, it overlooks inference efficiency. This conflict is increasingly aggressive, as recent works solely pursue higher performance rather than both performance and efficiency. In this paper, we address this issue by improving the performance of efficient single-branch structures. We revisit the reasons causing the performance gap between these structures. For the first time, we reveal the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures. Besides, we find that the domain gap between multispectral images, and weak feature representation of the single-branch structure are also key obstacles for performance. Focusing on these three problems, we propose corresponding solutions, including a novel shape-priority early-fusion strategy, a weakly supervised learning method, and a core knowledge distillation technique. Experiments demonstrate that single-branch networks equipped with these three contributions achieve significant performance enhancements while retaining high efficiency. Our code will be available at https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection.

Index Terms:
Multispectral object detection; feature fusion; weakly supervised learning; knowledge distillation
Refer to caption
(a)
Refer to caption
(b)
Figure 1: Multispectral object detection and fusion strategies. (a) In Scene-1, objects are easier to detect in the thermal image. (b) In Scene-2, objects are easier to detect in the RGB image. (c) Early-fusion strategy. (d) Medium-fusion strategy. (e) Late-fusion strategy. (f) Detection results of different strategies on the M3FD dataset [1]. YOLOv5 [2] is adopted as the baseline in this experiment. The area of each circle denotes the number of parameters.

I Introduction

Multispectral object detection has been widely studied, since multispectral images can provide complementary information to achieve consistent detection in various lighting conditions [3, 4, 5, 6, 7, 8, 9, 10]. This complementarity is illustrated in Fig. 1 (a) and (b). Given the multispectral inputs, modern multispectral detectors develop three fusion strategies: early-fusion, medium-fusion, and late-fusion, shown in Fig. 1 (c) - (e). The medium and late-fusion strategies often achieve superior performance compared to early-fusion [11, 12, 13, 5, 4, 14]. However, they use a two-branch structure, making model deployment on edge devices expensive. In contrast, the early-fusion strategy adopts a simple single-branch structure, facilitating deployment on edge devices. Nevertheless, its performance is low, and there are few works to address this problem, resulting in an increasing gap between high performance and high efficiency.

The main motivation for this work is to resolve the conflict between detection performance and inference efficiency. To this end, we focus on improving the performance of the early-fusion strategy while maintaining its high computational efficiency. We first conduct pilot studies and observe that a plain early-fusion strategy cannot consistently obtain improved performances compared to single-modality inputs. Based on this observation, we rethink the early-fusion strategy and summarize three key obstacles: 1) the information interference problem when simply concatenating the multispectral images, 2) the domain gap existing in thermal and RGB images, and 3) the weak feature representation of the single-branch structure. Focusing on these obstacles, we propose corresponding solutions.
- Information interference problem refers to the potential suppression of important information in one modality by another. In the plain early-fusion strategy, previous works [15] typically feed concatenated multispectral images into a convolution layer and generate a fused feature. The convolution layer generally has a small receptive field. Therefore, based on limited contexts, this approach is hard to determine which modality information is important. We address this issue by first recognizing that object shapes are agnostic to visible and infrared wavelengths and devise a module to fuse multispectral images based on object shape saliency, named the shape-priority early-fusion (ShaPE) module.
- Domain gap between RGB and thermal images is usually neglected in previous works. They generally adopt an RGB pre-trained backbone network to extract features from both RGB and thermal images [13, 5]. However, the domain gap may cause the representation distribution shift. This issue is also recognized in the work [16] on an RGB-D task. Different from previous works, we introduce a weakly supervised learning method to address this issue. Within this method, the backbone network jointly uses RGB and thermal images to learn the representation of CLIP [17], since CLIP has demonstrated promising zero-shot generalizability in bridging the domain gap [18]. Additionally, we introduce a segmentation auxiliary branch. Our method allows the backbone network to reduce representation shifts and improve semantic localization ability.
- Weak feature representation problem results from the early-fusion strategy employing a single-branch structure. This structure has fewer parameters and simpler fusion modules compared to medium and late-fusion strategies. We address this issue by introducing the knowledge distillation (KD) technique [19]. In KD, a key problem is how to align the feature dimensions between teacher and student models. Previous works generally introduce a convolution layer for the student model to learn all knowledge from the teacher model [20, 21]. However, we show that not all information in teacher model is helpful for downstream tasks. Therefore, we introduce core knowledge distillation (CoreKD) to transfer the most crucial knowledge for specific downstream tasks, resembling the human learning process where the teacher highlights key knowledge for quick understanding and absorption by the students.

Experimental results validate that our efficient multispectral early-fusion (EME) detector achieves a significant performance improvement without considerably increasing the number of parameters, as shown in Fig. 1 (f). Besides, our EME outperforms the previous state-of-the-art approaches. In summary, our contributions are threefold:

  • We systematically analyze the causes of the performance gap between single-branch and two-branch structures. Unlike previous works, we identify and summarize three key obstacles limiting the single-branch early-fusion strategy: information interference, domain gap, and weak feature representation. Notably, information interference between multispectral images is revealed for the first time in this work.

  • For each obstacle, we propose the corresponding solution: we develop 1) a ShaPE module to address the information interference issue, 2) a weakly supervised learning method to reduce domain gap and improve semantic localization abilities, and 3) a CoreKD to enhance the feature representation of single-branch networks.

  • Extensive experiments validate that the early-fusion strategy, equipped with our ShaPE module, weakly supervised learning, and CoreKD technique, demonstrates significant improvement. These three modules benefit various common detectors, such as YOLOv5 [2], RetinaNet [22], and GFL [23]. Importantly, only the ShaPE module is retained during the inference phase. Consequently, our method achieves both high performance and efficiency.

II Related Work

In this section, we offer a brief overview of multispectral object detection and introduce related works in weakly supervised learning and knowledge distillation.

II-A Multispectral Object Detection

Multi-source information fusion [24, 25, 26, 27] has exhibited promising application potential in computer vision tasks. In this work, we focus on the multispectral object detection task that uses RGB and thermal image pairs to detect objects. According to fusion strategies, multispectral object detection can be classified into three categories: early-fusion, medium-fusion and late-fusion strategies. Previous works [11], [12] and [28] confirm that both medium-fusion and late-fusion strategies outperform the early-fusion strategy.

However, both the medium and late fusion strategies adopt a two-branch structure that limits their use on resource-limited edge devices. Previous works notice this weakness and provide some solutions. For example, in [14], a model using the medium-fusion strategy is first trained as a teacher, and its knowledge is transferred to a student model. The student model only receives RGB images as inputs. Although it saves resources, it discards important complementary information from thermal images. The work [13] introduces a domain adaptation technique. It uses a medium-fusion model to guide single-branch model learning, which only receives thermal images as inputs and also discards complementary information from RGB images. To employ complementary information while saving computational resources, [29] transfers knowledge from a medium-fusion model to an early-fusion model. Nevertheless, it neglects information interference problem. Some works in the image fusion field [1, 30, 31] demonstrate that fused images can improve detectors, but the fusion process still introduces an additional computational burden.

Different from previous works, we identify the information interference problem in early-fusion strategies. By addressing this problem, we fully employ the complementary information in multispectral images, without significantly increasing computational burden.

II-B Weakly Supervised Learning in Object Detection

Weakly supervised learning has received much attention in object localization and detection, as comprehensively surveyed in [32]. Recent works in the multispectral object detection adopt this technique. Based on the weak annotations they utilize, we can coarsely divide them into image- and box-level weakly supervised learning approaches.

In image-level weakly supervised learning approaches, previous works mainly employ the illumination condition of RGB images as weighting factors to determine the modality importance [14, 5, 33, 13]. In box-level approaches, previous works [15, 34] mainly employ the bounding-box annotations to generate masks. They use these masks to construct spatial attention mechanisms, highlighting representations within target regions.

Different from previous works, we use weakly supervised learning to address the domain gap problem in RGB and thermal images. We employ image-level labels to construct a multi-label classification auxiliary task. This task can fully exploit the complementary information in multispectral images, instead of solely using information from one modality. Along with the powerful CLIP model [17] and box-level weak labels, our method can reduce the domain gap and obtain precise semantic localization abilities.

II-C Knowledge Distillation

Knowledge distillation is first introduced in [19]. It aims to improve a lightweight student model by learning knowledge from a high-capacity teacher model. According to distillation approaches, this technique can be roughly divided into two groups: logit distillation [19] and feature distillation [20]. The former let a student model learn the logit of a teacher model, while the latter let a student model learn the feature of a teacher model. These distillation approaches are also applied to object detection [35, 21]. Recently, some works in multispectral object detection also employ the knowledge distillation technique [29, 14]. In the distillation process, they generally introduce a projection layer to align the teacher and student feature channel number. The purpose of this approach is to learn all representations in the teacher model.

Different from previous works, we first confirm that not all information in teacher features is beneficial to downstream task including classification and regression. Based on this, we propose a core knowledge distillation technique to transfer the most important features for the downstream tasks to the student model.

III Method

Fig. 2 illustrates the overview of our method, where the training process and the inference process are marked in green and blue, respectively. We adopt a single-branch structure as the baseline model considering its low memory cost. To boost its performance, we develop three key modules: shape-priority early-fusion (ShaPE), weakly supervised auxiliary learning, and core knowledge distillation (CoreKD). Note that only the training process requires weakly supervised auxiliary learning and CoreKD, and both are removed during the inference phase. Consequently, our method adds only the ShaPE module to the early-fusion single-branch structure during the inference phase. In the following sections, we describe the ShaPE module in Section III-A, the weakly supervised auxiliary learning method in Section III-B, and CoreKD in Section III-C.

Refer to caption
Figure 2: Overview of our method. We adopt the single-branch structure as the baseline model and develop three key modules: shape-priority early-fusion (ShaPE), weakly supervised auxiliary learning, and core knowledge distillation. The ShaPE module remains in both the inference and training phases, while the other two modules are removed in the inference phase.

III-A Shape-Priority Early-Fusion Module

Observation. Given a pair of RGB-T images, the plain early-fusion strategy concatenates them in the channel dimension and then feeds them into a detector. With the plain strategy, we conduct pilot studies on the M3FD [1] dataset. We first train three commonly used one-stage detectors: RetinaNet [22], GFL [23] and YOLOv5 [2]. Then, we compute the mean values and standard deviations of their detection results and illustrate the computed results in Fig. 3. Besides, we also train these detectors using single-modality images as input for comparisons. We have the following two observations. First, the plain early-fusion strategy cannot achieve consistent improvement compared with single-modality input. Second, for objects that require color to identify, such as ‘Traffic Light’, the plain early-fusion strategy yields worse results than the RGB input.

Refer to caption
Figure 3: Pilot studies conducted on the M3FD [1] dataset. We use three detectors as baselines: RetinaNet [22], GFL [23] and YOLOv5 [2]. Each bar and error bar represents the mean values and standard deviation of the results obtained by these three detectors. ‘RGB’ represents detectors that only take RGB images as inputs, while ‘T’ represents detectors that only take thermal images as inputs. ‘PlainRGB-T’ denotes detectors that use the plain early-fusion strategy. The ‘All’ column illustrates the mAP50 for all classes, and the other columns illustrate the AP50 for specific classes. Red lines denote the plain RGB-T early fusion strategy obtains worse results compared to detectors that use single-modality inputs.

Motivation. We attribute the above phenomena to the convolutional inductive bias, namely, local connectivity and weight sharing. The process of 2D convolution involves two steps: (1) sampling across the concatenated RGB-T images using a regular grid \mathcal{R}caligraphic_R; (2) summing the sampled values with weighting factor 𝐖𝐖\mathbf{W}bold_W. The grid \mathcal{R}caligraphic_R determines both the receptive field size and dilation. For example,

={(3,3),(3,2),,(2,3),(3,3)}33322333\mathcal{R}=\left\{(-3,-3),(-3,-2),\dots,(2,3),(3,3)\right\}caligraphic_R = { ( - 3 , - 3 ) , ( - 3 , - 2 ) , … , ( 2 , 3 ) , ( 3 , 3 ) }

defines a 7×\times×7 kernel with dilation 1. For each position 𝐩0subscript𝐩0\mathbf{p}_{0}bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on an out feature map OO\mathrm{O}roman_O, we have

O(𝐩0)=𝐩nj{rgb,t}𝐖j(𝐩n)𝐈j(𝐩0+𝐩n),Osubscript𝐩0subscriptsubscript𝐩𝑛subscript𝑗rgbtsubscript𝐖𝑗subscript𝐩𝑛subscript𝐈𝑗subscript𝐩0subscript𝐩𝑛\mathrm{O}(\mathbf{p}_{0})=\sum_{\mathbf{p}_{n}\in\mathcal{R}}\sum_{j\in\{\rm rgb% ,t\}}\mathbf{W}_{j}(\mathbf{p}_{n})\mathbf{I}_{j}(\mathbf{p}_{0}+\mathbf{p}_{n% }),roman_O ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ { roman_rgb , roman_t } end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (1)

where 𝐩nsubscript𝐩𝑛\mathbf{p}_{n}bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT enumerates the positions in \mathcal{R}caligraphic_R.

This process indicates that the plain early-fusion strategy is a pixel-level weighting method, with weights learned from data. However, the limited receptive field of pixel-level weighting methods makes the weights difficult to determine which modality is important. This weakness may result in valuable information from one modality being suppressed by another. As an example, Fig. 4 (c) depicts the feature map generated from the RGB-T images of Fig. 4 (a) and (b) using the plain early-fusion strategy. It is observed from the close-up that the ‘Traffic Light’ in the fused feature map doesn’t preserve the significant information of the RGB image.

The straightforward solutions to this weakness are: (1) enlarging the receptive field by using a larger kernel or more convolutional layers so that the model can judge the modality importance based on a broader range of contexts, or (2) increasing the number of convolutional kernels so that the model can learn more representations. However, these solutions increase memory costs and computational burden, making them unfriendly to edge devices.

ShaPE Module. We realize that shape is an inherent attribute of an object. Any visible objects in RGB and thermal images have consistent shapes. Thus, we consider the salience of shape as a modifying factor to adaptively determine the modality importance, and design the shape-priority early-fusion (ShaPE) module. In the ShaPE module, the RGB and thermal images are modified by self-gating masks. In this context, Eq. (1) becomes:

O(𝐩0)=𝐩nj{rgb,t}𝐖j(𝐩n)𝐌j(𝐩0+𝐩n)𝐈j(𝐩0+𝐩n),Osubscript𝐩0subscriptsubscript𝐩𝑛subscript𝑗rgbtsubscript𝐖𝑗subscript𝐩𝑛subscript𝐌𝑗subscript𝐩0subscript𝐩𝑛subscript𝐈𝑗subscript𝐩0subscript𝐩𝑛\mathrm{O}(\mathbf{p}_{0})=\sum_{\mathbf{p}_{n}\in\mathcal{R}}\sum_{j\in\{\rm rgb% ,t\}}\mathbf{W}_{j}(\mathbf{p}_{n})\mathbf{M}_{j}(\mathbf{p}_{0}+\mathbf{p}_{n% })\mathbf{I}_{j}(\mathbf{p}_{0}+\mathbf{p}_{n}),roman_O ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ { roman_rgb , roman_t } end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (2)

where 𝐌rgbsubscript𝐌rgb\mathbf{M}_{\rm rgb}bold_M start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT and 𝐌tsubscript𝐌t\mathbf{M}_{\rm t}bold_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT denote the self-gating masks of RGB and thermal images, respectively.

In the following, we describe the generation process of self-gating masks 𝐌rgbsubscript𝐌rgb\mathbf{M}_{\rm rgb}bold_M start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT and 𝐌tsubscript𝐌t\mathbf{M}_{\rm t}bold_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT. Since our ShaPE module focuses on the shapes of objects and structural contributions of different modalities to the fused features, we employ the gradients and structural similarities in our method. For easy understanding, we visualize some important intermediate results in Fig. 4. Given the RGB-T images as shown in Fig. 4 (a) and (b), we compute their gradients

𝐈rgb(𝐩0)subscript𝐈rgbsubscript𝐩0\displaystyle\nabla\mathbf{I}_{\rm rgb}(\mathbf{p}_{0})∇ bold_I start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =(x𝐈rgb(𝐩0))2+(y𝐈rgb(𝐩0))2,absentsuperscriptsubscript𝑥subscript𝐈rgbsubscript𝐩02superscriptsubscript𝑦subscript𝐈rgbsubscript𝐩02\displaystyle=\sqrt{(\nabla_{x}\mathbf{I}_{\rm rgb}(\mathbf{p}_{0}))^{2}+(% \nabla_{y}\mathbf{I}_{\rm rgb}(\mathbf{p}_{0}))^{2}},= square-root start_ARG ( ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∇ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,
𝐈t(𝐩0)subscript𝐈tsubscript𝐩0\displaystyle\nabla\mathbf{I}_{\rm t}(\mathbf{p}_{0})∇ bold_I start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =(x𝐈t(𝐩0))2+(y𝐈t(𝐩0))2,absentsuperscriptsubscript𝑥subscript𝐈tsubscript𝐩02superscriptsubscript𝑦subscript𝐈tsubscript𝐩02\displaystyle=\sqrt{(\nabla_{x}\mathbf{I}_{\rm t}(\mathbf{p}_{0}))^{2}+(\nabla% _{y}\mathbf{I}_{\rm t}(\mathbf{p}_{0}))^{2}},= square-root start_ARG ( ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∇ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

as shown in Fig. 4 (d) and (e). We then generate the union gradient as the reference using

𝐈ref(𝐩0)=max(𝐈rgb(𝐩0),𝐈t(𝐩0)).subscriptsuperscript𝐈refsubscript𝐩0subscript𝐈rgbsubscript𝐩0subscript𝐈tsubscript𝐩0\nabla\mathbf{I}^{\prime}_{\rm ref}(\mathbf{p}_{0})=\max(\nabla\mathbf{I}_{\rm rgb% }(\mathbf{p}_{0}),\nabla\mathbf{I}_{\rm t}(\mathbf{p}_{0})).∇ bold_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = roman_max ( ∇ bold_I start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ∇ bold_I start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) .

We further use max-pooling within a 3×\times×3 neighborhood superscript\mathcal{R}^{\prime}caligraphic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to boost the reference gradient, which is written as

𝐈ref(𝐩0)=max𝐩n𝐈ref(𝐩0+𝐩n),subscript𝐈refsubscript𝐩0subscriptsubscript𝐩𝑛superscriptsubscriptsuperscript𝐈refsubscript𝐩0subscript𝐩𝑛\nabla\mathbf{I}_{\rm ref}(\mathbf{p}_{0})=\max_{\mathbf{p}_{n}\in\mathcal{R}^% {\prime}}\nabla\mathbf{I}^{\prime}_{\rm ref}(\mathbf{p}_{0}+\mathbf{p}_{n}),∇ bold_I start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ bold_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ,

as shown in Fig. 4 (f).

Refer to caption
Figure 4: Illustration of fused feature map generation process for the plain early-fusion strategy and our ShaPE module. (a) RGB image. (b) Thermal image. (c) Fused feature map generated using the plain early-fusion strategy, with a close-up indicated by a white circle line. (d) and (e) are gradient images of the RGB and thermal images, respectively. (f) Boosted reference gradient image. (g) and (h) are self-gating masks of the RGB and thermal images, respectively. (i) Fused feature map generated by our ShaPE module.

To determine the structural contributions of each modality to the fused features, we compute the structural similarities between single-modality gradient images {𝐈rgbsubscript𝐈rgb\nabla\mathbf{I}_{\rm rgb}∇ bold_I start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT, 𝐈tsubscript𝐈t\nabla\mathbf{I}_{\rm t}∇ bold_I start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT} and the reference gradient image 𝐈refsubscript𝐈ref\nabla\mathbf{I}_{\rm ref}∇ bold_I start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT. Inspired by [36], for each patch \mathcal{R}caligraphic_R, we compute three fundamental properties: the means {μrgb,μt,μrefsubscript𝜇rgbsubscript𝜇tsubscript𝜇ref\mu_{\rm rgb},\mu_{\rm t},\mu_{\rm ref}italic_μ start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT}, the standard deviations {σrgb,σt,σrefsubscript𝜎rgbsubscript𝜎tsubscript𝜎ref\sigma_{\rm rgb},\sigma_{\rm t},\sigma_{\rm ref}italic_σ start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT}, and the covariances {σ(rgb,ref)subscript𝜎rgbref\sigma_{({\rm rgb,ref})}italic_σ start_POSTSUBSCRIPT ( roman_rgb , roman_ref ) end_POSTSUBSCRIPT, σ(t,ref)subscript𝜎tref\sigma_{({\rm t,ref})}italic_σ start_POSTSUBSCRIPT ( roman_t , roman_ref ) end_POSTSUBSCRIPT} between the single-modality gradient images and the reference gradient images. In this context, we generate the self-gating masks:

𝐌rgb=(2μrgbμref+ξ1)(2σ(rgb,ref)+ξ2)(μrgb2+μref2+ξ1)(σrgb2+σref2+ξ2),superscriptsubscript𝐌rgb2subscript𝜇rgbsubscript𝜇refsubscript𝜉12subscript𝜎rgbrefsubscript𝜉2superscriptsubscript𝜇rgb2superscriptsubscript𝜇ref2subscript𝜉1superscriptsubscript𝜎rgb2superscriptsubscript𝜎ref2subscript𝜉2\displaystyle\mathbf{M}_{\rm rgb}^{\prime}=\frac{(2\mu_{\rm rgb}\mu_{\rm ref}+% \xi_{1})(2\sigma_{\rm(rgb,ref)}+\xi_{2})}{(\mu_{\rm rgb}^{2}+\mu_{\rm ref}^{2}% +\xi_{1})(\sigma_{\rm rgb}^{2}+\sigma_{\rm ref}^{2}+\xi_{2})},bold_M start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG ( 2 italic_μ start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 2 italic_σ start_POSTSUBSCRIPT ( roman_rgb , roman_ref ) end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ,
𝐌t=(2μtμref+ξ1)(2σ(t,ref)+ξ2)(μt2+μref2+ξ1)(σt2+σref2+ξ2),superscriptsubscript𝐌t2subscript𝜇tsubscript𝜇refsubscript𝜉12subscript𝜎trefsubscript𝜉2superscriptsubscript𝜇t2superscriptsubscript𝜇ref2subscript𝜉1superscriptsubscript𝜎t2superscriptsubscript𝜎ref2subscript𝜉2\displaystyle\mathbf{M}_{\rm t}^{\prime}=\frac{(2\mu_{\rm t}\mu_{\rm ref}+\xi_% {1})(2\sigma_{\rm(t,ref)}+\xi_{2})}{(\mu_{\rm t}^{2}+\mu_{\rm ref}^{2}+\xi_{1}% )(\sigma_{\rm t}^{2}+\sigma_{\rm ref}^{2}+\xi_{2})},bold_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG ( 2 italic_μ start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 2 italic_σ start_POSTSUBSCRIPT ( roman_t , roman_ref ) end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ,

where ξ1=(k1L)2subscript𝜉1superscriptsubscript𝑘1𝐿2\xi_{1}=(k_{1}L)^{2}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and ξ2=(k2L)2subscript𝜉2superscriptsubscript𝑘2𝐿2\xi_{2}=(k_{2}L)^{2}italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are used to prevent instability. L𝐿Litalic_L is the dynamic range of the gradient images, k1=0.01subscript𝑘10.01k_{1}=\text{0.01}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.01, and k2=0.03subscript𝑘20.03k_{2}=\text{0.03}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.03.

Since the ranges of both 𝐌rgbsuperscriptsubscript𝐌rgb\mathbf{M}_{\rm rgb}^{\prime}bold_M start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐌tsuperscriptsubscript𝐌t\mathbf{M}_{\rm t}^{\prime}bold_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are [1,1]11[-\text{1},\text{1}][ - 1 , 1 ], we then normalize the self-gating masks and obtain

𝐌rgb=exp(𝐌rgb)j{rgb,t}exp(𝐌j),𝐌t=exp(𝐌t)j{rgb,t}exp(𝐌j),formulae-sequencesubscript𝐌rgbsubscriptsuperscript𝐌rgbsubscript𝑗rgbtsubscriptsuperscript𝐌𝑗subscript𝐌tsubscriptsuperscript𝐌tsubscript𝑗rgbtsubscriptsuperscript𝐌𝑗\mathbf{M}_{\rm rgb}=\frac{\exp(\mathbf{M}^{\prime}_{\rm rgb})}{\sum\limits_{j% \in\{\rm rgb,t\}}\exp(\mathbf{M}^{\prime}_{j})},\;\mathbf{M}_{\rm t}=\frac{% \exp(\mathbf{M}^{\prime}_{\rm t})}{\sum\limits_{j\in\{\rm rgb,t\}}\exp(\mathbf% {M}^{\prime}_{j})},bold_M start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT = divide start_ARG roman_exp ( bold_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_rgb end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ { roman_rgb , roman_t } end_POSTSUBSCRIPT roman_exp ( bold_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG , bold_M start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT = divide start_ARG roman_exp ( bold_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ { roman_rgb , roman_t } end_POSTSUBSCRIPT roman_exp ( bold_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG , (3)

as shown in Fig. 4 (g) and (h). According to Eq. (2), we can finally generate the fused feature map as shown in Fig. 4 (i).

III-B Weakly Supervised Learning Method

In RGB-T object detection, an unneglectable issue is the lack of pre-trained backbone networks on large-scale RGB-T datasets. This is because there are few large-scale datasets like ImageNet [37] and COCO [38] in RGB-T image recognition fields. Previous works generally use backbone networks pre-trained on ImageNet. However, the domain gap between thermal and RGB images would cause representation distribution shifts, as illustrated in Fig. 5 (a) and (b). This is because the backbone network is trained solely on RGB images, but is applied to thermal images.

Refer to caption
Figure 5: T-SNE visualization of RGB and thermal image features. (a) and (b) visualize the image features of the M3FD [1] and FLIR [39] datasets using the ImageNet pre-trained ResNet-50 backbone network. (c) and (d) visualize the image features of the same datasets using the ResNet-50 trained with our weakly supervised learning method. Additionally, we present corresponding images of six pairs of feature points.

To handle this issue, we turn to the powerful Contrastive Language-Image Pre-training (CLIP) [17] model. It has been confirmed that CLIP can bridge domain gaps [18, 40, 41, 42, 43], since it is trained using a huge number of (image, text) pairs. In this context, we feed both RGB and thermal images into the backbone network, and let it learn the representation generated by the CLIP model. Specifically, we first present a CLIP-driven image-level weakly supervised learning method. This method enables the network to recognize the classes of objects in a pair of RGB-T images while locating their coarse regions. For fine-grained localization, we then introduce a box-level weakly supervised learning method. Fig. 6 illustrates the architecture of weakly supervised learning method.

CLIP-Driven Image-Level Weak Supervision. To learn the CLIP model’s knowledge, we construct the image-level weak supervision method. Based on three considerations, we adopt the multi-label classification task as the image-level weak supervision: (1) the CLIP model can be viewed as a classifier, (2) this auxiliary task can fully use the complementary information in the RGB-T images, and (3) by summarizing all classes and removing duplicates in an image, we can easily construct the ground-truth multi-label targets based on detection annotations.

Nevertheless, original CLIP model is only trained for recognizing a single object per image [17] and is not suitable for multi-label classification [44, 45]. To address this issue, we introduce a Divide-and-Aggregation CLIP (DA-CLIP) model. DA-CLIP first divides input images into multiple crops. Each crop is then fed into CLIP. All predictions of these crops are finally aggregated by a max-pooling operation on each class. Considering DA-CLIP may generate inaccurate predictions, we construct a learnable adapter, which consists of three fully-connected (FC) layers, to fine-tune the result of DA-CLIP. To prevent overfitting, we add a dropout layer in the adapter. We denote the predicted probability from the adapter as 𝐪^adcsubscript^𝐪adsuperscript𝑐\mathbf{\hat{q}}_{\rm ad}\in\mathbb{R}^{c}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, where c𝑐citalic_c denotes the number of classes.

Refer to caption
Figure 6: Illustration of the weakly supervised learning method. It consists of a divide-and-aggregation CLIP model (DA-CLIP), an adapter, a backbone, two auxiliary heads used for classification and segmentation, and weakly supervised losses. The crops are obtained using PyTorch’s function torch.nn.functional.unfold(image, kernel_size=224, stride=112). The image-level label is determined through a two-step process: 1) gather all classes present in the image according to bounding-box annotations, and 2) remove duplicated classes. Note that all modules except the DA-CLIP are updated, and only the backbone network remains in the inference phase.

For the backbone network, we add an auxiliary classification head on its top. The head consists of a global average pooling (GAP) operation and one FC layer. We denote the predicted probability from the classification head as 𝐪^bbcsubscript^𝐪bbsuperscript𝑐\mathbf{\hat{q}}_{\rm bb}\in\mathbb{R}^{c}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

We adopt the mutual learning approach [46] to train the backbone network and the adapter simultaneously. In this approach, an important step is that one model generates soft targets for the other model using the softmax function. However, this approach cannot be directly applied to the multi-label classification problem, since it requires the sum of predicted probabilities to be one, which is rarely satisfied in multi-label classification. To address this issue, we draw inspiration from self-training KD [47] and construct the soft targets for the adapter and backbone network as

𝐪~ad=(1λ)𝐪+λ𝐪^ad,𝐪~bb=(1λ)𝐪+λ𝐪^bb,formulae-sequencesubscript~𝐪ad1𝜆𝐪𝜆subscript^𝐪adsubscript~𝐪bb1𝜆𝐪𝜆subscript^𝐪bb\mathbf{\tilde{q}}_{\rm ad}=(1-\lambda)\mathbf{q}+\lambda\mathbf{\hat{q}}_{\rm ad% },\quad\mathbf{\tilde{q}}_{\rm bb}=(1-\lambda)\mathbf{q}+\lambda\mathbf{\hat{q% }}_{\rm bb},over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT = ( 1 - italic_λ ) bold_q + italic_λ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT , over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT = ( 1 - italic_λ ) bold_q + italic_λ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT ,

where 𝐪c𝐪superscript𝑐\mathbf{q}\in\mathbb{R}^{c}bold_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denotes a ground-truth multi-label target, and λ𝜆\lambdaitalic_λ denotes a balancing factor set to 0.1. In this context, we compute the binary cross-entropy (BCE) losses

(𝐪~ad,𝐪^bb)subscript~𝐪adsubscript^𝐪bb\displaystyle\mathcal{H}(\mathbf{\tilde{q}}_{\rm ad},\mathbf{\hat{q}}_{\rm bb})caligraphic_H ( over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT )
=i=1cq~ad,ilog(q^bb,i)+(1q~ad,i)log(1q^bb,i),absentsuperscriptsubscript𝑖1𝑐subscript~𝑞ad𝑖subscript^𝑞bb𝑖1subscript~𝑞ad𝑖1subscript^𝑞bb𝑖\displaystyle=-\sum_{i=1}^{c}\tilde{q}_{{\rm ad},i}\log(\hat{q}_{{\rm bb},i})+% (1-\tilde{q}_{{\rm ad},i})\log(1-\hat{q}_{{\rm bb},i}),= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_ad , italic_i end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_bb , italic_i end_POSTSUBSCRIPT ) + ( 1 - over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_ad , italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_bb , italic_i end_POSTSUBSCRIPT ) , (4a)
(𝐪~bb,𝐪^ad)subscript~𝐪bbsubscript^𝐪ad\displaystyle\mathcal{H}(\mathbf{\tilde{q}}_{\rm bb},\mathbf{\hat{q}}_{\rm ad})caligraphic_H ( over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT )
=i=1cq~bb,ilog(q^ad,i)+(1q~bb,i)log(1q^ad,i).absentsuperscriptsubscript𝑖1𝑐subscript~𝑞bb𝑖subscript^𝑞ad𝑖1subscript~𝑞bb𝑖1subscript^𝑞ad𝑖\displaystyle=-\sum_{i=1}^{c}\tilde{q}_{{\rm bb},i}\log(\hat{q}_{{\rm ad},i})+% (1-\tilde{q}_{{\rm bb},i})\log(1-\hat{q}_{{\rm ad},i}).= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_bb , italic_i end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_ad , italic_i end_POSTSUBSCRIPT ) + ( 1 - over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_bb , italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT roman_ad , italic_i end_POSTSUBSCRIPT ) . (4b)
Refer to caption
Figure 7: Illustration of the class activation map (CAM) of the backbone network. Each row’s triplet of images represents the CAM for a specific class, using (a) image-level auxiliary learning only, (b) box-level auxiliary learning only, and (c) both image-level and box-level auxiliary learning.
Refer to caption
Figure 8: Illustration of feature maps generated by the backbone network. (a) and (b) present the RGB and thermal images. (c) and (d) present their corresponding features map. (e) and (f) present the feature maps generated by the ResNet-50 trained without and with our weakly supervised learning, respectively. The close-up is highlighted with a red box.

To showcase the semantic localization effect of our CLIP-driven image-level weak supervision, we visualize the class activation map (CAM) of the backbone network in Fig. 8 (a). CAM is a useful tool for understanding which regions the network focuses on to predict a class. We can observe that the backbone network can coarsely localize regions of ‘Person’, ‘Car’, and ‘Traffic Light’ in the image.

Box-Level Weak Supervision. To precisely localize the semantic regions, we introduce box-level weak supervision. The ground-truth box-level target is generated by directly filling the area within an annotation box with its corresponding class index. In this context, we add an auxiliary segmentation head on top of the backbone network to predict the target. Denoting the ground-truth box-level target mask as 𝐆𝐆\mathbf{G}bold_G, and the predicted mask as 𝐆^^𝐆\mathbf{\hat{G}}over^ start_ARG bold_G end_ARG, we compute the BCE loss between them as

(𝐆,𝐆^)=n=1NGnlog(G^n)+(1Gn)log(1G^n),𝐆^𝐆superscriptsubscript𝑛1𝑁subscript𝐺𝑛subscript^𝐺𝑛1subscript𝐺𝑛1subscript^𝐺𝑛\mathcal{H}(\mathbf{G},\mathbf{\hat{G}})=-\sum_{n=1}^{N}G_{n}\log(\hat{G}_{n})% +(1-G_{n})\log(1-\hat{G}_{n}),caligraphic_H ( bold_G , over^ start_ARG bold_G end_ARG ) = - ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + ( 1 - italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (5)

where N𝑁Nitalic_N denotes the number of elements in the mask.

Refer to caption
Figure 9: Illustration of the knowledge distillation technique. The student model adopts an early-fusion single-branch structure, while the teacher model adopts a medium-fusion two-branch structure. In the training phase, both the pre-trained teacher model and the core knowledge convolution module are fixed, while only the student model is updated. After training, only the student model is used for deployment. In this diagram, we use YOLOv5 [2] as an example, and it can be easily extended to other detectors.

We visualize attention maps of the backbone network for different classes, as shown in Fig. 8 (b). Using the box-level weak supervision, the backbone network can precisely localize the interest of objects, such as ’Car’. Nevertheless, it may miss some useful information in the image. Therefore, we combine the CLIP-driven image-level weak supervision and the box-level weak supervision. The results presented in Fig. 8 (c) show that our weakly supervised learning method can effectively allow the backbone network to localize the important semantic regions.

Effect Validation. When our weakly supervised learning method is employed, Fig. 5 (c) and (d) demonstrate that the domain gap between RGB and thermal features is reduced. This implies that the backbone network can extract information from RGB and thermal images without bias. To further illustrate this effect, we visualize the feature map generated by the ResNet-50 [48] in Fig. 8. The generation process of these feature maps is as follows: First, we resize all features of the ResNet-50 across four stages to the same resolution as the input images. Then, we aggregate these features along the channel dimension using sum(softmax(𝐅,dim=0)𝐅,dim=0)sumtensor-productsoftmax𝐅dim=0𝐅dim=0\texttt{sum}(\texttt{softmax}(\mathbf{F},\texttt{dim=0})\otimes\mathbf{F},% \texttt{dim=0})sum ( softmax ( bold_F , dim=0 ) ⊗ bold_F , dim=0 ), where 𝐅D×H×W𝐅superscript𝐷𝐻𝑊\mathbf{F}\in\mathbb{R}^{D\times H\times W}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_H × italic_W end_POSTSUPERSCRIPT represents the concatenated feature. D𝐷Ditalic_D, H𝐻Hitalic_H, and W𝑊Witalic_W denote its depth, height, and width, respectively. tensor-product\otimes denotes the element-wise production operation.

Fig. 8 (a) and (b) present the RGB and thermal images in one example scene. Fig. 8 (c) and (d) illustrate their corresponding feature maps. Fig. 8 (e) shows the RGB-T feature map without using our weakly supervised learning method. Fig. 8 (f) shows the feature map using our weakly supervised learning method. Observing Fig. 8 (e), we note that the ResNet-50 tends to acquire information primarily from the RGB image. In contrast, the feature map in Fig. 8 (f) demonstrates that our method enables the ResNet-50 to gather important information from both RGB and thermal images.

III-C Core Knowledge Distillation

Problem Description. To further improve the detection accuracy of the early-fusion strategy without increasing its computational cost, we introduce the knowledge distillation technique [19]. To achieve knowledge transfer, we instruct the student model to mimic intermediate features of teacher model. In this process, a primary obstacle the student model faces is the unequal number of feature channels as the teacher model. Previous works introduce convolution layers to align their feature channel numbers [20, 21], while neglecting whether the teacher’s knowledge is helpful to the student. To address this issue, we propose core knowledge distillation (CoreKD).

CoreKD Architecture. We use YOLOv5 [2] as an example and illustrate the knowledge distillation architecture in Fig. 9. In its architecture, we use the early-fusion single-branch structure as the student model and the medium-fusion two-branch structure as the teacher model. In the student model, a pair of RGB-T images is first concatenated, then fed into different network modules, and finally converted into predicted results. In the teacher model, the RGB and thermal images are respectively fed into different backbone networks. The generated multispectral features are fused in the feature space through concatenation and convolution operations. The fused features are then fed into the subsequent network modules and converted into predicted results. The predicted results of both the student and teacher models consist of bounding boxes and class-specific confidence scores.

CoreKD Formulation. Since we apply the same distillation techniques to different feature pyramid levels, we only describe the technique at one level and omit the subscript for simplicity. In the head modules of Fig. 9, we denote the input features of the student and teacher models as 𝐗Ssuperscript𝐗S\mathbf{X}^{\rm S}bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT and 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, respectively. Feature distillation typically transfers the teacher’s knowledge to the student by minimizing the loss [20]

feat′′=𝒜(𝐗S)𝐗T22,subscriptsuperscript′′featsuperscriptsubscriptnorm𝒜superscript𝐗Ssuperscript𝐗T22\mathcal{L}^{\prime\prime}_{\rm feat}=||\mathcal{A}(\mathbf{X}^{\rm S})-% \mathbf{X}^{\rm T}||_{2}^{2},caligraphic_L start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_feat end_POSTSUBSCRIPT = | | caligraphic_A ( bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ) - bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (6)

where 𝒜𝒜\mathcal{A}caligraphic_A denotes an adaptation layer used to match the channel dimensions between the student and teacher features. Previous works usually use a convolution layer as the adaptation layer [20, 21]. This approach aims to make 𝒜(𝐗S)𝒜superscript𝐗S\mathcal{A}(\mathbf{X}^{\rm S})caligraphic_A ( bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ) learn all information in the teacher feature 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT. However, they neglect whether all the information in 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is beneficial for downstream tasks, including classification and regression.

To address this problem, we revisit the structure of head module in the teacher model. As shown in Fig. 9, the official implementation of YOLOv5 uses a ‘1×1Conv11Conv1\times 1\;\texttt{Conv}1 × 1 Conv’ layer to output the predicted results

𝐘^T=Conv(𝐗T;𝐖T),superscript^𝐘TConvsuperscript𝐗Tsuperscript𝐖T\mathbf{\hat{Y}}^{\rm T}=\texttt{Conv}(\mathbf{X}^{\rm T};\mathbf{W}^{\rm T}),over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = Conv ( bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ; bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ,

where 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT denotes the weighting factor in the teacher’s head module. According to the 2D convolution formulation in Eq. (1), we can infer that the weighting factor 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT reflects the importance of a channel map in 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT for the downstream feature. We visualize the histogram of 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT in Fig. 10. It is evident that most of the values in 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT approximate 0. This implies that only a few feature representations in 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT are important for the downstream tasks. We call these important feature representations the core knowledge in teacher model.

To learn this core knowledge, we modify the feature loss Eq. (6) into

feat=||Conv(𝒜(𝐗S);𝐖T)Conv(𝐗T;𝐖T))||22.\mathcal{L}^{\prime}_{\rm feat}=||\texttt{Conv}(\mathcal{A}(\mathbf{X}^{\rm S}% );\mathbf{W}^{\rm T})-\texttt{Conv}(\mathbf{X}^{\rm T};\mathbf{W}^{\rm T}))||_% {2}^{2}.caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_feat end_POSTSUBSCRIPT = | | Conv ( caligraphic_A ( bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ) ; bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) - Conv ( bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ; bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (7)

This modification ensures that 𝒜(𝐗S)𝒜superscript𝐗S\mathcal{A}(\mathbf{X}^{\text{S}})caligraphic_A ( bold_X start_POSTSUPERSCRIPT S end_POSTSUPERSCRIPT ) and 𝐗Tsuperscript𝐗T\mathbf{X}^{\text{T}}bold_X start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT are projected into an identical space constructed by 𝐖Tsuperscript𝐖T\mathbf{W}^{\text{T}}bold_W start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT, and that the projected features are close to each other. Furthermore, to avoid introducing the adaption layer 𝒜𝒜\mathcal{A}caligraphic_A, we construct a core knowledge convolution (Core Knowledge Conv) operator by sampling the weighting factor 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT. We denote the sampling process as 𝒮()𝒮\mathcal{S}(\cdot)caligraphic_S ( ⋅ ). In the process, we first obtain the channel dimension d𝑑ditalic_d of the student feature 𝐗Ssuperscript𝐗S\mathbf{X}^{\rm S}bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT, then sample the top-d𝑑ditalic_d values along the ‘in_channel’ axis from 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT based on their absolute values. Finally, we obtain the sampled weighting factor 𝒮(𝐖T)𝒮superscript𝐖T\mathcal{S}(\mathbf{W}^{\rm T})caligraphic_S ( bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ). In this context, we rewrite the feature loss given in Eq. (7) as

featsubscriptfeat\displaystyle\mathcal{L}_{\rm feat}caligraphic_L start_POSTSUBSCRIPT roman_feat end_POSTSUBSCRIPT =𝐘^CT𝐘^T22absentsuperscriptsubscriptnormsuperscript^𝐘CTsuperscript^𝐘T22\displaystyle=||\mathbf{\hat{Y}}^{\rm CT}-\mathbf{\hat{Y}}^{\rm T}||_{2}^{2}= | | over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT roman_CT end_POSTSUPERSCRIPT - over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (8)
=||Conv(𝐗S;𝒮(𝐖T))Conv(𝐗T;𝐖T))||22,\displaystyle=||\texttt{Conv}(\mathbf{X}^{\rm S};\mathcal{S}(\mathbf{W}^{\rm T% }))-\texttt{Conv}(\mathbf{X}^{\rm T};\mathbf{W}^{\rm T}))||_{2}^{2},= | | Conv ( bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ; caligraphic_S ( bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ) - Conv ( bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ; bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where 𝐘^CTsuperscript^𝐘CT\mathbf{\hat{Y}}^{\rm CT}over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT roman_CT end_POSTSUPERSCRIPT denotes the output of core knowledge convolution. When using this feature loss, we keep the weighting factor 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT fixed and only compute the gradient with respect to the student feature 𝐗Ssuperscript𝐗S\mathbf{X}^{\rm S}bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT.

Refer to caption
Figure 10: Weighting factor histograms of the teacher’s head module in Fig. 9. (a), (b), and (c) correspond to the level-0, level-1, and level-2 convolution weighting factor histograms, respectively.

Mathematical Foundation of CoreKD. We first explain the mathematical foundation of traditional feature distillation and analyze its weaknesses. Then, we introduce the mathematical foundation of our CoreKD. Finally, we compare the results of our CoreKD with those of the traditional one.

Traditional feature distillation uses a convolution layer to align feature channel numbers between student and teacher models. We denote the convolution layer as a function 𝒜()𝒜\mathcal{A}(\cdot)caligraphic_A ( ⋅ ) in the Eq. (6). Next, we denote the weighting factor of 𝒜()𝒜\mathcal{A}(\cdot)caligraphic_A ( ⋅ ) as 𝐀d×d𝐀superscriptsuperscript𝑑𝑑\mathbf{A}\in\mathbb{R}^{d^{\prime}\times d}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT. This function is used to convert an input feature 𝐗Sh×w×dsuperscript𝐗Ssuperscript𝑤𝑑\mathbf{X}^{\rm S}\in\mathbb{R}^{h\times w\times d}bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_d end_POSTSUPERSCRIPT into an output feature 𝐙h×w×d𝐙superscript𝑤superscript𝑑\mathbf{Z}\in\mathbb{R}^{h\times w\times d^{\prime}}bold_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, i.e., 𝐙=𝒜(𝐗S)𝐙𝒜superscript𝐗S\mathbf{Z}=\mathcal{A}(\mathbf{X}^{\rm S})bold_Z = caligraphic_A ( bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ). We denote the vector at an arbitrary spatial location of 𝐗Ssuperscript𝐗S\mathbf{X}^{\rm S}bold_X start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT as 𝐱Sd×1superscript𝐱Ssuperscript𝑑1\mathbf{x}^{\rm S}\in\mathbb{R}^{d\times 1}bold_x start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT, the ith row vector in the weighting factor 𝐀𝐀\mathbf{A}bold_A as 𝐚i1×dsubscript𝐚𝑖superscript1𝑑\mathbf{a}_{i}\in\mathbb{R}^{1\times d}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT, and the corresponding value in 𝐙𝐙\mathbf{Z}bold_Z as zi1×1subscript𝑧𝑖superscript11z_{i}\in\mathbb{R}^{1\times 1}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT. The mathematical relation between 𝐱Ssuperscript𝐱S\mathbf{x}^{\rm S}bold_x start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT, 𝐚isubscript𝐚𝑖\mathbf{a}_{i}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be represented as

zi=Conv(𝐱S;𝐚i)=𝐚i𝐱S,subscript𝑧𝑖Convsuperscript𝐱Ssubscript𝐚𝑖subscript𝐚𝑖superscript𝐱Sz_{i}=\texttt{Conv}(\mathbf{x}^{\rm S};\mathbf{a}_{i})=\mathbf{a}_{i}\cdot% \mathbf{x}^{\rm S},italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Conv ( bold_x start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ; bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_x start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT , (9)

where the operator ‘\cdot’ indicates a dot product. The dot product computation can be viewed as the projection of the vector 𝐱Ssuperscript𝐱S\mathbf{x}^{\rm S}bold_x start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT onto the vector 𝐚isubscript𝐚𝑖\mathbf{a}_{i}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as shown in Fig. 11 (a). We can infer that the generation of zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is related to the 𝐚isubscript𝐚𝑖\mathbf{a}_{i}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐱Ssuperscript𝐱S\mathbf{x}^{\rm S}bold_x start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT but has no relation to the teacher’s features. This implies that the traditional feature distillation merely focuses on enforcing the student to learn all information from the teacher without considering whether the teacher’s features are beneficial to downstream tasks.

On the contrary, our CoreKD uses the weighting factor 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT of the teacher model to align feature channel numbers between the student and teacher models, as shown in Eq. (8). We denote the ith row vector of 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT as 𝐰Tsuperscript𝐰T\mathbf{w}^{\rm T}bold_w start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT and the vector at an arbitrary spatial location of 𝐗Tsuperscript𝐗T\mathbf{X}^{\rm T}bold_X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT as 𝐱Tsuperscript𝐱T\mathbf{x}^{\rm T}bold_x start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT. Then we can write the ith loss value of featsubscriptfeat\mathcal{L}_{\rm feat}caligraphic_L start_POSTSUBSCRIPT roman_feat end_POSTSUBSCRIPT as

featisubscriptsuperscript𝑖feat\displaystyle{\ell}^{i}_{\rm feat}roman_ℓ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_feat end_POSTSUBSCRIPT =Conv(𝐱S;𝒮(𝐰iT))Conv(𝐱T;𝐰iT)absentConvsuperscript𝐱S𝒮superscriptsubscript𝐰𝑖TConvsuperscript𝐱Tsuperscriptsubscript𝐰𝑖T\displaystyle=\texttt{Conv}(\mathbf{x}^{\rm S};\mathcal{S}(\mathbf{w}_{i}^{\rm T% }))-\texttt{Conv}(\mathbf{x}^{\rm T};\mathbf{w}_{i}^{\rm T})= Conv ( bold_x start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT ; caligraphic_S ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ) - Conv ( bold_x start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ; bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) (10)
=𝒮(𝐰iT)𝐱S𝐰iT𝐱T.absent𝒮subscriptsuperscript𝐰T𝑖superscript𝐱Ssubscriptsuperscript𝐰T𝑖superscript𝐱T\displaystyle=\mathcal{S}(\mathbf{w}^{\rm T}_{i})\cdot\mathbf{x}^{\rm S}-% \mathbf{w}^{\rm T}_{i}\cdot\mathbf{x}^{\rm T}.= caligraphic_S ( bold_w start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ bold_x start_POSTSUPERSCRIPT roman_S end_POSTSUPERSCRIPT - bold_w start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_x start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT .

This loss value calculation process is illustrated in Fig. 11 (b). From the above analyses, we have two key observations: 1) our CoreKD projects the student and teacher features into an identical space constructed by 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, and 2) our CoreKD does not enforce the student feature to be the same as the teacher feature but rather focuses on minimizing the projected distances. Since the values within the weighting factor 𝐖Tsuperscript𝐖T\mathbf{W}^{\rm T}bold_W start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT reflect the importance of teacher features, our CoreKD enables the student model to learn beneficial information for downstream tasks from the teacher model.

Refer to caption
Figure 11: Schematic diagram of the mathematical foundation of feature distillation: (a) Convolution operation in the traditional feature distillation; (b) The loss calculation process in our CoreKD.

Since the experimental results involve the introduction of both datasets and implementation details, we arrange the comparison results of our CoreKD with the traditional feature distillation in the experimental section. For details, please refer to Section IV-C.

III-D Loss Function

Our efficient multispectral early-fusion (EME) single-branch model is trained using all the losses described above. The total loss is

total=cls+reg+weak+feat,subscripttotalsubscriptclssubscriptregsubscriptweaksubscriptfeat\mathcal{L}_{\rm total}=\mathcal{L}_{\rm cls}+\mathcal{L}_{\rm reg}+\mathcal{L% }_{\rm weak}+\mathcal{L}_{\rm feat},caligraphic_L start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_weak end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_feat end_POSTSUBSCRIPT , (11)

where clssubscriptcls\mathcal{L}_{\rm cls}caligraphic_L start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT and regsubscriptreg\mathcal{L}_{\rm reg}caligraphic_L start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT represent the classification and regression losses defined by a detector [22, 23, 2], respectively. weaksubscriptweak\mathcal{L}_{\rm weak}caligraphic_L start_POSTSUBSCRIPT roman_weak end_POSTSUBSCRIPT is the summation of weakly supervised losses defined in Eq. (4) and Eq. (5):

weak=(𝐪~ad,𝐪^bb)+(𝐪~bb,𝐪^ad)+(𝐆,𝐆^).subscriptweaksubscript~𝐪adsubscript^𝐪bbsubscript~𝐪bbsubscript^𝐪ad𝐆^𝐆\mathcal{L}_{\rm weak}=\mathcal{H}(\mathbf{\tilde{q}}_{\rm ad},\mathbf{\hat{q}% }_{\rm bb})+\mathcal{H}(\mathbf{\tilde{q}}_{\rm bb},\mathbf{\hat{q}}_{\rm ad})% +\mathcal{H}(\mathbf{G},\mathbf{\hat{G}}).caligraphic_L start_POSTSUBSCRIPT roman_weak end_POSTSUBSCRIPT = caligraphic_H ( over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT ) + caligraphic_H ( over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_bb end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_ad end_POSTSUBSCRIPT ) + caligraphic_H ( bold_G , over^ start_ARG bold_G end_ARG ) .

IV Experiments

IV-A Experimental Setup

Datasets. Our experiments are conducted on the M3FD dataset [1] and FLIR dataset [39]. M3FD dataset contains 4200 pairs of RGB and thermal images. These image pairs are well aligned. The dataset contains 6 classes of objects: ‘Person’, ‘Car’, ‘Bus’, ‘Motorcycle’, ‘Traffic Light’, and ‘Truck’. Since this dataset doesn’t provide unified data splits, previous works have used a random splitting approach to determine the train and validation sets [1]. However, images in this dataset are sampled from video sequences, meaning that two adjacent frames may contain identical content. In this context, the random splitting approach results in information leakage between the train and validation sets. To address this problem, we first manually divide the dataset into 73 video segments based on different scenes. Then, we collect the first 70% of images in each video segment as the train set and the remaining images as the validation set. Finally, we obtain 2905 and 1295 pairs of RGB-T images in the train and validation sets, respectively. We name this data split ‘M3FD-zxSplit’ and release it to the public111https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection. For the performance evaluation in Section IV-B, we use this data split. When comparing with state-of-the-art approaches in Section IV-D, we employ both ‘M3FD-zxSplit’ and random splitting. Our random splitting refers to randomly selecting 80% images as the train set and the remaining images as the validation set. FLIR dataset originally contains unaligned RGB-T image pairs. The work [49] develops a data-processing approach to align these images and obtain 7381, 1056, and 2111 image pairs in the train, validation, and test sets. This dataset contains 3 classes: ‘Person’, ‘Bicycle’ and ‘Car’.

Evaluation Metrics. We use the standard mean Average Precision (mAP) with IoU thresholds ranging from 0.5 to 0.95 across various object scales as metrics.

Inference Efficiency Evaluations. We assess the inference efficiency of our method (Python implementation) on the edge device NVIDIA AGX Orin with 64GB memory. We also evaluate the complexity of our method using FLOPs and the number of parameters. All results are presented in Tables I and II.

Implementation Details. We incorporate our three key modules into commonly-used one-stage detectors, including RetinaNet [22], GFL [23], and YOLOv5 [2]. For RetinaNet and GFL, we adopt the implementations in MMDetection toolbox [50]. For YOLOv5, we use its official implemtation [2].

In the early-fusion strategy based on RetinaNet [22] and GFL [23] detectors, we use ResNet-50 [48] as the backbone network. For a fair comparison, we use the same backbone network in the medium-fusion strategy. Notably, in the CoreKD technique, the teacher model utilizes ResNet-101 [48] as the backbone network. For both strategies, we train for 12 epochs using the SGD optimizer with a batch size of 4. The initial learning rate is set to 0.01 and is decayed by 0.1 at epochs 8 and 11. Random horizontal flipping is employed as a data augmentation technique.

For the early-fusion strategy based on YOLOv5, we use YOLOv5-small as the baseline detector, and we use YOLOv5-large to construct a teacher model in the CoreKD technique. For both strategies, we train for 36 epochs with a batch size of 16. We keep all other hyperparameters consistent with the official settings of the YOLOv5 repository [2].

To standardize data for RetinaNet and GFL detectors, we calculate the mean value and standard deviation of RGB and thermal images for the M3FD and FLIR datasets. All experiments use the 640 ×\times× 512 image resolution. For the M3FD dataset, we obtain meanrgb = [128.2, 129.3, 125.3], stdrgb = [49.1, 50.2, 53.5], meant = [84.1, 84.1, 84.1], and stdt = [50.6, 50.6, 50.6]. For the FLIR dataset, we obtain meanrgb = [149.4, 148.7, 141.7], stdrgb = [49.3, 52.8, 59.0], meant = [135.7, 135.7, 135.7], and stdt = [63.6, 63.6, 63.6]. For the YOLOv5 detector, we normalize both RGB and thermal images to the range of [0, 1] following its official implementations.

TABLE I: Inference efficiency and detection performance on the M3FD dataset [1]. The inference time is evaluated on an edge device: NVIDIA AGX Orin. The best results in the mAP and mAP50 columns are highlighted in bold and marked in red, while the second best ones are underlined and marked in green. All detection results are obtained by running three independent experiments. The mean value and standard deviation of these results are reported.
Detector FLOPs (↓) Parameter (↓) Time (↓) mAP (↑) mAP50 (↑) Person (↑) Car (↑) Bus (↑) Motor (↑) TrafficLight (↑) Truck (↑)
RGB RetinaNet-Res50 61.893G 36.434M 0.106s 31.03±0.09 51.30±0.16 44.57±0.12 74.87±0.26 57.80±0.24 44.30±0.22 36.63±0.58 49.70±0.16
Thermal RetinaNet-Res50 61.893G 36.434M 0.106s 29.27±0.05 46.97±0.09 59.17±0.34 71.17±0.29 54.17±0.69 35.90±0.22 10.43±0.09 50.83±0.21
RGB-T Medium Fusion RetinaNet-Res50 94.611G 47.582M 0.170s 33.43±0.05 53.63±0.05 60.10±0.08 77.27±0.05 61.63±0.05 45.43±0.09 25.67±0.12 51.80±0.00
Baseline: RGB-T Early Fusion RetinaNet-Res50 62.164G 36.434M 0.110s 32.03±0.05 50.70±0.16 58.93±0.26 75.37±0.09 58.97±0.39 39.97±0.21 21.67±0.47 49.33±0.05
+ ShaPE RetinaNet-Res50 62.218G 36.434M 0.149s 32.80±0.08 51.97±0.24 60.20±0.59 77.10±0.08 58.83±0.33 39.77±0.58 24.57±0.54 51.33±0.17
+ ShaPE + WeakSup. RetinaNet-Res50 62.218G 36.434M 0.149s 33.53±0.05 52.90±0.16 59.10±0.71 77.07±0.41 62.00±0.78 40.97±1.67 24.03±0.85 54.30±0.71
+ ShaPE + WeakSup. + CoreKD RetinaNet-Res50 62.218G 36.434M 0.149s 33.53±0.17 53.23±0.09 61.47±0.49 76.40±0.08 59.20±0.08 43.10±0.22 25.97±0.12 53.17±0.19
RGB GFL-Res50 61.392G 32.270M 0.110s 32.23±0.12 53.10±0.16 48.67±0.12 77.43±0.26 60.27±0.60 43.50±0.16 39.07±0.94 49.63±0.39
Thermal GFL-Res50 61.392G 32.270M 0.110s 29.50±0.22 48.27±0.42 64.27±0.34 73.73±0.12 52.50±1.99 36.50±0.0 15.10±0.14 47.37±0.25
RGB-T Medium Fusion GFL-Res50 94.110G 43.419M 0.172s 34.17±0.33 54.47±0.60 65.37±0.12 79.83±0.05 61.20±1.00 37.00±1.79 34.80±0.57 48.73±0.77
Baseline: RGB-T Early Fusion GFL-Res50 61.663G 32.271M 0.114s 33.50±0.28 52.77±0.25 64.03±0.12 78.20±0.16 53.00±1.31 39.27±0.26 30.47±0.56 51.57±0.54
+ ShaPE GFL-Res50 61.718G 32.271M 0.151s 35.17±0.17 55.50±0.22 65.80±0.24 79.10±0.08 62.33±1.24 41.33±0.42 30.80±0.42 53.67±0.41
+ ShaPE + WeakSup. GFL-Res50 61.718G 32.271M 0.151s 35.23±0.53 55.97±0.24 65.73±0.49 79.57±0.17 60.50±4.02 43.10±1.06 33.73±2.01 53.10±1.94
+ ShaPE + WeakSup. + CoreKD GFL-Res50 61.718G 32.271M 0.151s 37.03±0.09 57.70±0.08 68.43±0.25 81.23±0.09 63.37±0.33 43.90±0.86 35.77±0.12 53.53±0.29
Refer to caption
Figure 12: Detection results of the GFL [23] detector on two example scenes from the M3FD [1] dataset. (a) and (e) display results using only RGB images. (b) and (f) show results using only thermal images. (c) and (g) demonstrate results using the plain RGB-T early-fusion strategy. (d) and (h) depict results using our EME method. Solid boxes represent detection results. Green dashed boxes mark missed objects (false negatives) while yellow dashed boxes mark false positives.
TABLE II: Inference efficiency and detection performance on the FLIR dataset [39]. The inference time is evaluated on an edge device: NVIDIA AGX Orin. The best results in the mAP and mAP50 columns are highlighted in bold and marked in red, while the second best ones are underlined and marked in green. All detection results are obtained by running three independent experiments. The mean value and standard deviation of these results are reported.
Detector FLOPs (↓) Parameter (↓) Time (↓) mAP (↑) mAP50 (↑) Person (↑) Bicycle (↑) Car (↑)
RGB RetinaNet-Res50 61.893G 36.434M 0.106s 28.10±0.0 59.47±0.12 44.93±0.39 55.70±0.08 77.80±0.08
Thermal RetinaNet-Res50 61.893G 36.434M 0.106s 35.53±0.05 70.93±0.05 62.17±0.17 66.37±0.09 84.27±0.05
RGB-T Medium Fusion RetinaNet-Res50 94.611G 47.582M 0.170s 38.50±0.08 71.57±0.05 61.93±0.29 67.60±0.14 85.17±0.09
Baseline: RGB-T Early Fusion RetinaNet-Res50 62.164G 36.434M 0.110s 37.47±0.05 69.57±0.05 60.70±0.22 63.77±0.09 84.37±0.05
+ ShaPE RetinaNet-Res50 62.218G 36.434M 0.149s 38.70±0.14 71.60±0.22 61.40±0.54 68.60±0.36 84.70±0.22
+ ShaPE + WeakSup. RetinaNet-Res50 62.218G 36.434M 0.149s 38.80±0.29 72.07±0.37 62.50±0.96 68.77±0.97 85.03±0.21
+ ShaPE + WeakSup. + CoreKD RetinaNet-Res50 62.218G 36.434M 0.149s 38.83±0.17 72.23±0.31 62.27±0.26 69.23±0.66 85.10±0.22
RGB GFL-Res50 61.392G 32.270M 0.110s 31.73±0.12 63.77±0.05 51.70±0.08 57.87±0.25 81.77±0.05
Thermal GFL-Res50 61.392G 32.270M 0.110s 42.40±0.14 75.07±0.05 69.80±0.16 68.47±0.24 86.93±0.05
RGB-T Medium Fusion GFL-Res50 94.110G 43.419M 0.172s 42.60±0.08 76.07±0.21 70.07±0.19 70.57±0.41 87.63±0.05
Baseline: RGB-T Early Fusion GFL-Res50 61.663G 32.271M 0.114s 41.90±0.22 74.77±0.17 69.70±0.33 67.73±0.26 87.00±0.00
+ ShaPE GFL-Res50 61.718G 32.271M 0.151s 42.40±0.16 75.77±0.17 69.97±0.21 70.13±0.37 87.23±0.09
+ ShaPE + WeakSup. GFL-Res50 61.718G 32.271M 0.151s 42.93±0.24 76.90±0.22 71.30±0.14 71.40±0.50 87.97±0.09
+ ShaPE + WeakSup. + CoreKD GFL-Res50 61.718G 32.271M 0.151s 44.00±0.00 78.17±0.05 73.03±0.05 72.63±0.17 88.80±0.00
Refer to caption
Figure 13: Detection results of the GFL [23] detector on two example scenes from the FLIR [39] dataset. (a) and (e) display results using only RGB images. (b) and (f) show results using only thermal images. (c) and (g) demonstrate results using the plain RGB-T early-fusion strategy. (d) and (h) depict results using our EME method. Solid boxes represent detection results. Green dashed boxes mark missed objects (false negatives) while yellow dashed boxes mark false positives.

IV-B Performance Evaluation of Proposed Modules

Table I and Table II present the performance of our method on the M3FD [1] and FLIR [39] datasets. Key observations include: (1) the medium-fusion strategy adds more parameters and FLOPs compared to the early-fusion strategy; (2) the medium-fusion strategy achieves better performance compared to single-modality inputs, whereas the plain early-fusion strategy does not consistently improve performance; (3) our EME method, incorporating the ShaPE module, weakly supervised learning, and CoreKD techniques into the plain early-fusion strategy, achieves significant performance improvement without significantly increasing parameters and FLOPs; (4) the inference time of our EME method is longer than that of the baseline method, since the structural similarity computation process has not been optimized when calculating the self-gating mask; (5) our EME method can outperform the medium-fusion strategy in both performance and efficiency to some extent; (6) Both architectures: “Baseline + ShaPE + WeakSup.” and “Baseline + ShaPE + WeakSup. + CoreKD” have the same FLOPs, parameters, and inference time as “Baseline + ShaPE”. This is because both the weakly supervised learning method and CoreKD are removed in the inference phase, while only the ShaPE module is retained.

Fig. 13 and Fig. 13 present visualization results for two example scenes from M3FD [1] and FLIR [39] datasets, respectively. We observe that false positives or false negatives in the single-modality results may affect the plain early-fusion strategy. For instance, the person missed in Fig. 13 (e) is also absent in Fig. 13 (g), despite being detected in Fig. 13 (f). Moreover, false positives in Fig. 13 (f) affect the detection results of plain early-fusion, as shown in Fig. 13 (g). These phenomena confirm that the problem of information interference is a key obstacle to performance in the plain early-fusion strategy. Clearly, our EME effectively alleviates this problem.

IV-C Performance Evaluation of Variants

Feature Distillation. We compare the results of traditional feature distillation with those of our CoreKD. Table III and Table IV present comparison results on the M3FD and FLIR datasets, respectively. For comprehensive comparisons, we adopt RetinaNet-Res50 and GFL-Res50 as baseline detectors in these two tables. From the results, we can observe that our CoreKD consistently achieves superior performance compared to traditional feature distillation. For example, our CoreKD (72.23%) obtains a 1.26% mAP50 absolute gain over the traditional one (70.97%) when using RetinaNet-Res50 on the FLIR dataset.

TABLE III: Comparison of traditional feature distillation with our CoreKD on the M3FD dataset [1].
Method Detector mAP (↑) mAP50 (↑)
Baseline:
Plain RGB-T Early Fusion
RetinaNet-Res50 32.03±0.05 50.70±0.16
Traditional Feature Distill+
Baseline+ShaPE+WeakSup.
RetinaNet-Res50 32.17±0.12 52.43±0.05
CoreKD+
Baseline+ShaPE+WeakSup.
RetinaNet-Res50 33.53±0.17 53.23±0.09
Baseline:
Plain RGB-T Early Fusion
GFL-Res50 33.50±0.28 52.77±0.25
Traditional Feature Distill+
Baseline+ShaPE+WeakSup.
GFL-Res50 35.83±0.17 57.07±0.09
CoreKD+
Baseline+ShaPE+WeakSup.
GFL-Res50 37.03±0.09 57.70±0.08
TABLE IV: Comparison of traditional feature distillation with our CoreKD on the FLIR dataset [39].
Method Detector mAP (↑) mAP50 (↑)
Baseline:
Plain RGB-T Early Fusion
RetinaNet-Res50 37.47±0.05 69.57±0.05
Traditional Feature Distill+
Baseline+ShaPE+WeakSup.
RetinaNet-Res50 38.10±0.14 70.97±0.12
CoreKD+
Baseline+ShaPE+WeakSup.
RetinaNet-Res50 38.83±0.17 72.23±0.31
Baseline:
Plain RGB-T Early Fusion
GFL-Res50 41.90±0.22 74.77±0.17
Traditional Feature Distill+
Baseline+ShaPE+WeakSup.
GFL-Res50 43.63±0.09 77.80±0.08
CoreKD+
Baseline+ShaPE+WeakSup.
GFL-Res50 44.00±0.00 78.17±0.05

Backbone Network. We evaluate our EME method using ResNet-101 [48] as the backbone network on the M3FD and FLIR datasets, and present the results in Table V. We observe that detectors using ResNet-101 consistently achieve better performance than those using ResNet-50. For example, RetinaNet with ResNet-101 (74.87%) obtains a 2.64% mAP50 absolute gain over ResNet-50 (72.23%).

TABLE V: Results of our EME method on the M3FD and FLIR datasets using different baseline detectors and backbone networks.
Datasets Detector Backbone FLOPs mAP (↑) mAP50 (↑)
M3FD RetinaNet ResNet-50 62.218G 33.53±0.17 53.23±0.09
RetinaNet ResNet-101 65.392G 34.23±0.12 54.63±0.45
GFL ResNet-50 61.718G 37.03±0.09 57.70±0.08
GFL ResNet-101 64.892G 37.37±0.12 58.60±0.08
FLIR RetinaNet ResNet-50 62.218G 38.83±0.17 72.23±0.31
RetinaNet ResNet-101 65.392G 40.67±0.05 74.87±0.34
GFL ResNet-50 61.718G 44.00±0.00 78.17±0.05
GFL ResNet-101 64.892G 44.47±0.05 79.57±0.05
TABLE VI: Comparisons with state-of-the-art approaches on the M3FD dataset [1]. The best results are highlighted in bold and marked in red, while the second-best ones are underlined and marked in green. The detection results of our EME are obtained by running three independent experiments. The mean values and standard deviations of these results are reported.
(a) Dataset Splitting Method: Random Splitting
Thermal [2] RGB [2] AUIF [51] CDDF [30] DDcGAN [52] DIVF [31] DenseF [53] PSF [54] RFN [55] SeAF [56] TarDAL [1] U2F [57] EME (Ours)
mAP 49.10 52.40 53.30 53.00 52.20 52.70 53.40 53.10 53.50 53.10 52.50 53.40 54.00±0.28
mAP50 77.30 81.90 81.90 80.90 81.60 81.50 81.70 82.00 81.70 82.20 81.00 81.90 82.90±0.37
Person 79.30 68.40 76.70 76.30 73.60 74.50 76.50 76.70 75.30 77.00 79.10 77.00 79.53±0.26
Car 87.90 90.80 91.00 91.00 90.70 91.10 91.40 90.80 91.00 91.10 90.50 91.20 91.90±0.29
Bus 87.20 92.20 90.00 90.10 90.70 91.60 89.40 90.10 89.40 91.20 89.40 90.70 89.80±0.45
Motor 70.00 74.00 72.60 69.20 74.80 73.50 72.80 73.30 73.30 72.20 70.30 71.30 74.87±0.95
TrafficLight 55.90 80.30 77.40 75.40 76.90 74.80 77.20 78.20 77.40 77.60 72.70 77.70 77.40±1.13
Truck 83.40 85.70 83.70 83.10 82.90 83.40 82.90 82.90 83.90 84.10 84.00 83.60 84.00±0.93
(b) Dataset Splitting Method: M3FD-zxSplit
Thermal [2] RGB [2] AUIF [51] CDDF [30] DDcGAN [52] DIVF [31] DenseF [53] PSF [54] RFN [55] SeAF [56] TarDAL [1] U2F [57] EME (Ours)
mAP 34.90 36.10 38.30 38.60 37.10 37.10 38.90 38.00 38.20 38.90 39.10 38.70 41.10±0.29
mAP50 57.20 60.20 62.00 61.90 61.00 60.80 62.40 61.10 61.30 62.20 61.90 61.90 66.23±0.40
Person 74.60 55.90 72.20 71.90 67.30 67.60 72.30 71.70 70.50 72.50 75.50 72.40 77.23±0.26
Car 80.20 84.80 85.50 85.60 84.90 85.20 85.90 85.50 85.80 85.50 85.00 85.50 87.13±0.19
Bus 58.30 65.70 58.60 61.80 61.60 59.80 61.40 58.30 61.30 61.50 60.90 60.10 62.33±1.96
Motor 48.00 45.10 49.10 47.60 49.00 48.70 49.60 45.80 44.60 47.50 46.80 50.80 55.33±0.21
TrafficLight 27.30 56.80 49.80 48.70 49.10 51.20 48.60 50.90 49.70 50.80 46.90 48.00 55.33±0.37
Truck 54.80 52.70 56.70 55.50 53.80 52.60 56.80 54.70 55.80 55.70 56.70 54.70 60.10±0.16
Refer to caption
Figure 14: Detection results of the YOLOv5 [2] detector on one example scene from the M3FD [1] dataset. (a) and (b) respectively show the results using only a thermal image and only an RGB image. (c)-(l) display the detection results using fused images obtained from 10 different image fusion approaches. (m) demonstrates the results using our EME method.
TABLE VII: Comparisons with state-of-the-art approaches on the FLIR [39] dataset. The best results are highlighted in bold and marked in red, while the second-best ones are underlined and marked in green. The detection results of our EME are obtained by running three independent experiments. The mean values and standard deviations of these results are reported.
CBF [58] MCG [59] MUN [59] ODS [60] CFR [61] GAFF [15] BU [62] SMPD [63] ThDe [64] MSAT [65] CSAA [66] MFPT [67] ProbEn3 [28] EME (Ours)
mAP50 67.20 61.40 61.54 69.62 72.39 72.90 73.20 73.58 74.60 76.20 79.20 80.00 83.76 84.63±0.12
Bicycle 60.50 50.26 49.43 55.53 55.77 - 57.40 56.20 60.04 - - 67.70 73.49 79.73±0.09
Car 83.60 70.63 70.72 82.33 84.91 - 86.50 85.80 85.52 - - 89.00 90.14 92.67±0.12
Person 57.60 63.31 64.47 71.01 74.49 - 75.60 78.74 78.24 - - 83.20 87.65 81.27±0.09

IV-D Comparison with the State-of-The-Art Approaches

We use the one-stage YOLOv5 [2] detector as the baseline, and incorporate our proposed modules to construct the effective multispectral early-fusion (EME) model. Table VI and Table VII compares our EME and previous state-of-the-art approaches on M3FD [1] and FLIR [39] datasets.

In Table VI, we compare our EME with 10 state-of-the-art image-fusion-based object detection approaches [51, 30, 52, 31, 53, 54, 55, 56, 1, 57]. We first generate fused images based on their official implementations, and then train YOLOv5 [2] using these fused images with the same training settings. The results show that our EME achieves state-of-the-art performance. We observe that the results in Table VI (a) are obviously better than those in Table VI (b). This demonstrates that random splitting causes information leakage and makes it difficult to improve performance. Fig. 14 presents an example scene for visualization. Compared to other approaches, a weakness of our EME detector is that it doesn’t generate a fused image for direct visualization. This is because our method focuses on detection rather than image fusion. We will address this issue in future work.

In Table VII, we compare our EME with 13 multispectral object detection approaches. These approaches include (1) medium-fusion strategies, such as CBF [58], MCG [59], MUN [59], CFR [61], GAFF [15], SMPD [63], MSAT [65], CSAA [66], and MFPT [67]; (2) domain adaptation and single-modality detection approaches, such as ODS [60], BU [62], and ThDe [64]; and (3) late-fusion strategy [28]. The results show that our EME also achieves state-of-the-art performance on the FLIR dataset [39].

IV-E Comparison of Inference Efficiency

We compare the inference efficiency of our EME method with previous state-of-the-art approaches on an edge device: the NVIDIA AGX Orin with 64GB of memory. We select open-source approaches for comparison and adopt YOLOv5-small as the baseline detector for all methods.

Table VIII presents the FLOPs, number of parameters, and inference time for various methods. Experimental results show that our EME method is the fastest. Interestingly, we notice that a reduction in FLOPs does not directly lead to a similar reduction in the inference time of an approach. This phenomenon may be attributed to the frequent memory access by operators, as confirmed in PConv[68]. This observation inspires us to further speed up our EME method by reducing memory access in the future.

TABLE VIII: Comparison of inference efficiency. All methods use the YOLOv5-small as the baseline detector. Inference time is evaluated on an edge device: the NVIDIA AGX Orin.
Method FLOPs Parameters Time (seconds)
AUIF[51] 12.185G 7.037M 5.12s
CDDF[30] 2816.279G 8.214M 9.507s
DensF[53] 151.596G 7.100M 1.769s
PSF[54] 939.024G 52.925M 2.512s
RFN[55] 1859.908G 9.759M 3.473s
SeAF[56] 272.945G 7.192M 1.344s
TarDAL[1] 478.474G 7.323M 1.446s
EME (Ours) 15.780G 7.063M 0.077s

V Conclusions

In this paper, we propose the effective multispectral early-fusion (EME) detector, which achieves both high performance and efficiency. We identify and address performance obstacles in a plain early-fusion strategy, such as information interference, domain gaps, and weak feature representation, by proposing solutions including shape-priority early-fusion modules, weakly supervised learning, and core knowledge distillation. Extensive experiments on representative datasets demonstrate the effectiveness and efficiency of our EME detector.

The main advantage of our EME detector is that it improves the performance of an efficient single-branch early-fusion strategy without significantly increasing its computational burden. We demonstrate that our EME detector has similar FLOPs and parameters to a plain early-fusion strategy, while achieving better performance than a cumbersome two-branch structure. We also show that our EME detector has higher inference efficiency than the two-branch structure on an edge device: the NVIDIA AGX Orin.

A limitation of our current EME detector is its inefficient two-stage training paradigm in the knowledge distillation technique. In the future, we will work towards an optimized one-stage paradigm to accelerate the training process and further improve detection accuracy.

References

  • [1] J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-Aware Dual Adversarial Learning and a Multi-Scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2022.
  • [2] G. Jocher, “YOLOv5 by Ultralytics,” 2020. [Online]. Available: https://github.com/ultralytics/yolov5
  • [3] Z. Chen and X. Huang, “Pedestrian Detection for Autonomous Vehicle Using Multi-Spectral Cameras,” IEEE Transactions on Intelligent Vehicles, vol. 4, no. 2, pp. 211–219, 2019.
  • [4] W. Zhou, S. Dong, M. Fang, and L. Yu, “CACFNet: Cross-Modal Attention Cascaded Fusion Network for RGB-T Urban Scene Parsing,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 1919–1929, 2024.
  • [5] Y. Liu, C. Hu, B. Zhao, Y. Huang, and X. Zhang, “Region-Based Illumination-Temperature Awareness and Cross-Modality Enhancement for Multispectral Pedestrian Detection,” IEEE Transactions on Intelligent Vehicles, pp. 1–12, 2024.
  • [6] M. A. Farooq, W. Shariff, and P. Corcoran, “Evaluation of Thermal Imaging on Embedded GPU Platforms for Application in Vehicular Assistance Systems,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 2, pp. 1130–1144, 2022.
  • [7] M. Ding, W.-H. Chen, and Y.-F. Cao, “Thermal Infrared Single-Pedestrian Tracking for Advanced Driver Assistance System,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 814–824, 2023.
  • [8] W. Zhou, S. Dong, J. Lei, and L. Yu, “MTANet: Multitask-Aware Network With Hierarchical Multimodal Fusion for RGB-T Urban Scene Understanding,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 48–58, 2023.
  • [9] Y. Guo, H. Kong, and S. Gu, “Unsupervised Multi-Spectrum Stereo Depth Estimation for All-Day Vision,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 501–511, 2024.
  • [10] Y. Zhu, C. Li, J. Tang, and B. Luo, “Quality-Aware Feature Aggregation Network for Robust RGB-T Tracking,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 1, pp. 121–130, 2021.
  • [11] J. Liu, S. Zhang, S. Wang, and D. N. Metaxas, “Multispectral Deep Neural Networks for Pedestrian Detection,” in Proceedings of the British Machine Vision Conference, 2016.
  • [12] J. Wagner, V. Fischer, M. Herman, S. Behnke et al., “Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks,” in Proceedings of the European Symposium on Artificial Neural Networks, vol. 587, 2016, pp. 509–514.
  • [13] Q. Xie, T.-Y. Cheng, Z. Dai, V. Tran, N. Trigoni, and A. Markham, “Illumination-Aware Hallucination-Based Domain Adaptation for Thermal Pedestrian Detection,” IEEE Transactions on Intelligent Transportation Systems, 2023.
  • [14] T. Liu, K.-M. Lam, R. Zhao, and G. Qiu, “Deep Cross-Modal Representation Learning and Distillation for Illumination-Invariant Pedestrian Detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 315–329, 2021.
  • [15] H. Zhang, E. Fromont, S. Lefèvre, and B. Avignon, “Guided Attentive Feature Fusion for Multispectral Pedestrian Detection,” in Proceedings of the Winter Conference on Applications of Computer Vision, 2021, pp. 72–80.
  • [16] B. Yin, X. Zhang, Z. Li, L. Liu, M.-M. Cheng, and Q. Hou, “DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation,” in Proceedings of the International Conference on Learning Representations, 2024.
  • [17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models from Natural Language Supervision,” in Proceedings of the International Conference on Machine Learning, vol. 139, 2021, pp. 8748–8763.
  • [18] Z. Lai, N. Vesdapunt, N. Zhou, J. Wu, C. P. Huynh, X. Li, K. K. Fu, and C.-N. Chuah, “PADCLIP: Pseudo-Labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation,” in Proceedings of the International Conference on Computer Vision, 2023, pp. 16 109–16 119.
  • [19] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” in Proceedings of the Advances in Neural Information Processing Systems Workshop, 2015.
  • [20] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for Thin Deep Nets,” in Proceedings of the International Conference on Learning Representations, 2015.
  • [21] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning Efficient Object Detection Models with Knowledge Distillation,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [22] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” in Proceedings of the International Conference on Computer Vision, 2017, pp. 2980–2988.
  • [23] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 002–21 012, 2020.
  • [24] J. Wang, M. Zhang, W. Li, and R. Tao, “A Multistage Information Complementary Fusion Network Based on Flexible-Mixup for HSI-X Image Classification,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2023.
  • [25] J. Wang, W. Li, Y. Gao, M. Zhang, R. Tao, and Q. Du, “Hyperspectral and SAR Image Classification via Multiscale Interactive Fusion Network,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 12, pp. 10 823–10 837, 2023.
  • [26] M. Zhang, W. Li, X. Zhao, H. Liu, R. Tao, and Q. Du, “Morphological Transformation and Spatial-Logical Aggregation for Tree Species Classification Using Hyperspectral Imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–12, 2023.
  • [27] Y. Gao, M. Zhang, J. Wang, and W. Li, “Cross-Scale Mixing Attention for Multisource Remote Sensing Data Fusion and Classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023.
  • [28] Y.-T. Chen, J. Shi, Z. Ye, C. Mertz, D. Ramanan, and S. Kong, “Multimodal Object Detection via Probabilistic Ensembling,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 139–158.
  • [29] H. Zhang, E. Fromont, S. Lefèvre, and B. Avignon, “Low-Cost Multispectral Scene Analysis with Modality Distillation,” in Proceedings of the Winter Conference on Applications of Computer Vision, 2022, pp. 803–812.
  • [30] Z. Zhao, H. Bai, J. Zhang, Y. Zhang, S. Xu, Z. Lin, R. Timofte, and L. Van Gool, “CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2023, pp. 5906–5916.
  • [31] L. Tang, X. Xiang, H. Zhang, M. Gong, and J. Ma, “DIVFusion: Darkness-Free Infrared and Visible Image Fusion,” Information Fusion, vol. 91, pp. 477–493, 2023.
  • [32] D. Zhang, J. Han, G. Cheng, and M.-H. Yang, “Weakly Supervised Object Localization and Detection: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5866–5885, 2021.
  • [33] Y. Zhang, H. Yu, Y. He, X. Wang, and W. Yang, “Illumination-Guided RGBT Object Detection with Inter- and Intra-Modality Fusion,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–13, 2023.
  • [34] X. Zhang, X. Zhang, Z. Sheng, and H.-L. Shen, “TFDet: Target-Aware Fusion for RGB-T Pedestrian Detection,” arXiv preprint arXiv:2305.16580, 2023.
  • [35] Z. Li, P. Xu, X. Chang, L. Yang, Y. Zhang, L. Yao, and X. Chen, “When Object Detection Meets Knowledge Distillation: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [36] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image Quality Assessment: From Error Visibility to Structural Similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 740–755.
  • [39] “FREE FLIR Thermal Dataset for Algorithm Training,” https://www.flir.com/oem/adas/adas-dataset-form/.
  • [40] Z. Chen, Z. Zhang, X. Tan, Y. Qu, and Y. Xie, “Unveiling the Power of CLIP in Unsupervised Visible-Infrared Person Re-Identification,” in Proceedings of the ACM International Conference on Multimedia, 2023, pp. 3667–3675.
  • [41] X. Yi, H. Xu, H. Zhang, L. Tang, and J. Ma, “Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 026–27 035.
  • [42] X. Yu, N. Dong, L. Zhu, H. Peng, and D. Tao, “CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification,” 2024. [Online]. Available: https://arxiv.org/abs/2401.05806
  • [43] Z. Wang, Y. Li, X. Chen, S.-N. Lim, A. Torralba, H. Zhao, and S. Wang, “Detecting Everything in the Open World: Towards Universal Object Detection,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, June 2023, pp. 11 433–11 443.
  • [44] R. Abdelfattah, Q. Guo, X. Li, X. Wang, and S. Wang, “CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification,” in Proceedings of the International Conference on Computer Vision, 2023, pp. 1348–1357.
  • [45] Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li, and J. Gao, “RegionCLIP: Region-Based Language-Image Pretraining,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, June 2022, pp. 16 793–16 803.
  • [46] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep Mutual Learning,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2018, pp. 4320–4328.
  • [47] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisiting Knowledge Distillation via Label Smoothing Regularization,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2020, pp. 3903–3911.
  • [48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [49] V. Sam, K. Ali, M. Christian, K. Laurent, and E. Lutz, “Robust Environment Perception for Automated Driving: A Unified Learning Pipeline for Visual-Infrared Object Detection,” in IEEE Intelligent Vehicles Symposium, 2022, pp. 367–374.
  • [50] MMDetection Contributors, “OpenMMLab Detection Toolbox and Benchmark,” 2018. [Online]. Available: https://github.com/open-mmlab/mmdetection
  • [51] Z. Zhao, S. Xu, J. Zhang, C. Liang, C. Zhang, and J. Liu, “Efficient and Model-Based Infrared and Visible Image Fusion via Algorithm Unrolling,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1186–1196, 2022.
  • [52] J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “DDcGAN: A Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion,” IEEE Transactions on Image Processing, vol. 29, pp. 4980–4995, 2020.
  • [53] H. Li and X.-J. Wu, “DenseFuse: A Fusion Approach to Infrared and Visible Images,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2614–2623, 2019.
  • [54] L. Tang, H. Zhang, H. Xu, and J. Ma, “Rethinking the Necessity of Image Fusion in High-Level Vision Tasks: A Practical Infrared and Visible Image Fusion Network Based on Progressive Semantic Injection and Scene Fidelity,” Information Fusion, vol. 99, p. 101870, 2023.
  • [55] H. Li, X.-J. Wu, and J. Kittler, “RFN-Nest: An End-to-End Residual Fusion Network for Infrared and Visible Images,” Information Fusion, vol. 73, pp. 72–86, 2021.
  • [56] L. Tang, J. Yuan, and J. Ma, “Image Fusion in the Loop of High-Level Vision Tasks: A Semantic-Aware Real-Time Infrared and Visible Image Fusion Network,” Information Fusion, vol. 82, pp. 28–42, 2022.
  • [57] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2Fusion: A Unified Unsupervised Image Fusion Network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [58] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional Block Attention Module,” in Proceedings of the European Conference on Computer Vision, 2018.
  • [59] C. Devaguptapu, N. Akolekar, M. M Sharma, and V. N Balasubramanian, “Borrow from Anywhere: Pseudo Multi-Modal Object Detection in Thermal Imagery,” in Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
  • [60] F. Munir, S. Azam, M. A. Rafique, A. M. Sheri, and M. Jeon, “Thermal Object Detection using Domain Adaptation through Style Consistency,” ArXiv, vol. abs/2006.00821, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:219176719
  • [61] H. Zhang, E. Fromont, S. Lefevre, and B. Avignon, “Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks,” in Proceedings of the International Conference on Image Processing, 2020, pp. 276–280.
  • [62] M. Kieu, A. D. Bagdanov, and M. Bertini, “Bottom-Up and Layerwise Domain Adaptation for Pedestrian Detection in Thermal Images,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 17, no. 1, 2021.
  • [63] Q. Li, C. Zhang, Q. Hu, P. Zhu, H. Fu, and L. Chen, “Stabilizing Multispectral Pedestrian Detection with Evidential Hybrid Fusion,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 3017–3029, 2024.
  • [64] Y. Cao, T. Zhou, X. Zhu, and Y. Su, “Every Feature Counts: An Improved One-Stage Detector in Thermal Imagery,” in Proceedings of the International Conference on Computer and Communications, 2019, pp. 1965–1969.
  • [65] S. You, X. Xie, Y. Feng, C. Mei, and Y. Ji, “Multi-Scale Aggregation Transformers for Multispectral Object Detection,” IEEE Signal Processing Letters, vol. 30, pp. 1172–1176, 2023.
  • [66] Y. Cao, J. Bin, J. Hamari, E. Blasch, and Z. Liu, “Multimodal Object Detection by Channel Switching and Spatial Attention,” in Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, 2023, pp. 403–411.
  • [67] Y. Zhu, X. Sun, M. Wang, and H. Huang, “Multi-Modal Feature Pyramid Transformer for RGB-Infrared Object Detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 9, pp. 9984–9995, 2023.
  • [68] J. Chen, S.-h. Kao, H. He, W. Zhuo, S. Wen, C.-H. Lee, and S.-H. G. Chan, “Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, June 2023, pp. 12 021–12 031.