Next Article in Journal
Measurement Error Analysis of Seawater Refractive Index: A Measurement Sensor Based on a Position-Sensitive Detector
Previous Article in Journal
Enhance Ethanol Sensing Performance of Fe-Doped Tetragonal SnO2 Films on Glass Substrate with a Proposed Mathematical Model for Diffusion in Porous Media
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

High-Performance Binocular Disparity Prediction Algorithm for Edge Computing

1
Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China
2
Department of Computer Science, University of Reading, Whiteknights, Reading RG6 6DH, UK
3
Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China
4
University of Chinese Academy of Sciences, Beijing 100089, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(14), 4563; https://doi.org/10.3390/s24144563
Submission received: 23 May 2024 / Revised: 5 July 2024 / Accepted: 11 July 2024 / Published: 14 July 2024
(This article belongs to the Topic Applications in Image Analysis and Pattern Recognition)

Abstract

:
End-to-end disparity estimation algorithms based on cost volume deployed in edge-end neural network accelerators have the problem of structural adaptation and need to ensure accuracy under the condition of adaptation operator. Therefore, this paper proposes a novel disparity calculation algorithm that uses low-rank approximation to approximately replace 3D convolution and transposed 3D convolution, WReLU to reduce data compression caused by the activation function, and unimodal cost volume filtering and a confidence estimation network to regularize cost volume. It alleviates the problem of disparity-matching cost distribution being far away from the true distribution and greatly reduces the computational complexity and number of parameters of the algorithm while improving accuracy. Experimental results show that compared with a typical disparity estimation network, the absolute error of the proposed algorithm is reduced by 38.3%, the three-pixel error is reduced to 1.41%, and the number of parameters is reduced by 67.3%. The calculation accuracy is better than that of other algorithms, it is easier to deploy, and it has strong structural adaptability and better practicability.

1. Introduction

It is important to obtain the depth information of objects in many advanced vision tasks. Binocular stereo matching technology imitates human eyes, calculates the disparity and perceived depth by using the horizontal difference between the left and right cameras, then obtains three-dimensional scene information. It is a key technology and important topic in the field of stereo vision. Binocular stereo matching technology has been widely used in the fields of 3D ranging, 3D reconstruction [1], autonomous navigation [2], robot control [3], virtual reality [4], and so on. The binocular ranging method has wide application prospects because of its advantages of low cost, high precision, and simple deployment. Similarly to the development process of convolutional neural networks in other visual tasks, although the accuracy of stereo matching networks is improving, the depth of the networks is also increasing, followed by the rapid increase in the number of network parameters and increasing computational cost.
CASSANN-v2 [5], designed by the Institute of Semiconductors, Chinese Academy of Sciences, is a high-performance accelerator architecture with on-chip adaptive memory-tuning capabilities, enabling neural network acceleration for edge computing. Nowadays, with the rapid development of artificial intelligence, more and more intelligent applications need to be deployed on mobile and embedded devices, as well as edge computing devices. Existing deep learning-based disparity algorithms have made significant strides in the field of stereo matching. Jure and LeCun [6] designed a convolutional neural network that estimates disparity by calculating the similarity between image blocks. Fatma and Geiger [7] developed a network focused on resolving stereo ambiguities capable of predicting depths of textureless, reflective, and transparent surfaces. Pang et al. [8] proposed a two-stage convolutional network where the first stage uses an improved version of DispNet [9] for disparity estimation and the second stage corrects preliminary results. Eigen et al. [10] refined disparity map estimations by stacking two deep neural networks. Lena et al. [11] studied the mapping relationship between RGB images and depth maps through a supervised deep neural network, optimizing with Huber loss. Although capable of achieving high-precision disparity calculation, such methods typically suffers from high power consumption, high computational cost, and insufficient computational efficiency, making them unsuitable for deployment on edge computing devices like CASSANN-v2. Even lightweight networks contain modules that edge computing devices cannot directly deploy and, thus, need separate design considerations. Before being ported to edge computing devices, neural networks need to be optimized and improved to meet the requirements of being lightweight, efficient, and low-cost.
In order to solve the above problems, we propose a disparity calculation algorithm based on low-rank approximation and unimodal cost volume filtering. The algorithm proposed in this paper is optimized for the two aspects of computational complexity and network modeling ability, and a binocular disparity calculation algorithm suitable for high-performance terminals was designed to realize the trade-off between accuracy and calculation cost. The main contributions of this work are summarized as follows:
  • The three-dimensional convolution in the matching cost aggregation process is equivalent to one- and two-dimensional convolution according to the low-rank approximation principle, and the action mode of one- and two-dimensional convolution on the output is demonstrated and determined, which greatly reduces the number of network weights.
  • In terms of disparity accuracy, the activation function with pixel-level modeling capability is used to optimize the gradient propagation of the disparity computing network after network compression and address approximation to improve the performance of the network.In this way, for the edge computing device, only one convolution layer and one max operation are needed to achieve an activation function with pixel-wise modeling capability.
  • The matching cost volume is regularized by using unimodal cost volume filtering and a confidence estimation network, and the network parameters are updated by an independent loss function only in the training stage, which reduces the video memory in the running stage and alleviates the problem of the disparity matching cost distribution being far from the real distribution.

2. Related Works

2.1. Disparity Estimation

In recent decades, stereo matching technology has made significant progress, driven by the evolution from traditional algorithms to deep learning methods. Traditional stereo matching techniques typically involve the following four steps: matching cost computation, cost aggregation, disparity calculation, and disparity refinement [12]. Although effective, these methods are often limited by slow processing speeds and reduced accuracy, which restrict their wider application.
In recent years, the success of convolutional neural networks (CNNs) in various visual tasks, such as object detection and semantic segmentation [13], has encouraged researchers to apply deep learning to the field of stereo matching. For example, MC-CNN [14] was the first to apply CNNs to the computation of matching costs, calculating the similarity between stereo image pairs by extracting abstract features. This was followed by Mayer et al.’s introduction of DispNetC [9], an end-to-end stereo matching network that directly produces disparity maps from stereo images, marking a new direction in stereo matching research.
Additionally, GCNet (Geometry and Context Network) [15] introduces a novel approach by forming a cost volume from cascaded feature maps under different disparities, clearly representing the geometric features of the image. Then, 3D convolution is applied to this cost volume, extracting features across the dimensions of height, width, and disparity, which is crucial for learning environmental information and improving stereo matching results. PSMNet [16] expands on these ideas by integrating a deep residual network (ResNet) [17] for feature extraction and employing a Spatial Pyramid Pooling (SPP) structure [18]. This design enables the network to capture both global and local information at various scales, forming a comprehensive matching cost volume that significantly enhances the accuracy of disparity estimation.
The latest algorithms, like RAFT-Stereo [19] and HitNet [20], continue to advance the field. RAFT-Stereo utilizes a recursive multi-scale approach, progressively refining the disparity map across different iterations, enhancing both accuracy and convergence speed. Meanwhile, HitNet, through its hierarchical iterative tile network, focuses on efficiently resolving disparity issues within small regional blocks, achieving high precision and efficiency in complex scenes.
These developments demonstrate that the integration of deep learning with stereo matching algorithms not only significantly improves speed and accuracy but also expands their potential applications in areas such as autonomous driving, robotic navigation, and augmented reality. With ongoing technological advancements and the introduction of new datasets, the future of stereo matching technology looks increasingly widespread and efficient.

2.2. Three-Dimensional Convolution and Its Optimization

The application of 3D convolution for the extraction of spatio-temporal features from video data was originally proposed by Ji et al. [21]. This pioneering approach laid the groundwork for subsequent innovations in 3D convolutional neural networks (CNNs), such as the C3D model developed by Tran et al. [22] for human action recognition. This model has since become a benchmark within the field, demonstrating the significant advantages of 3D convolutional techniques over traditional 2D convolution in analyzing complex video data.
Building on this foundational work, Hara et al. [23] further advanced the application of 3D convolution by integrating it into the ResNet101 [17] architecture. This integration significantly enhanced the model’s capabilities, allowing for more detailed and precise temporal and spatial analyses of video sequences. Such advancements underscore the superiority of 3D convolution in capturing the dynamic and intricate patterns of movement and behavior within video data.
Despite the efficacy of 3D convolution in handling video data, its adoption in real-world applications has been limited by the high computational cost associated with its algorithmic complexity. This complexity introduces a significant overhead, making it challenging to deploy these models in environments where processing efficiency is paramount [24].
In response to these computational challenges, the research community has explored various strategies to mitigate the intensive demands of 3D convolution. One notable approach has been the development of hybrid models that amalgamate the spatial analytical power of 2D convolutions with the temporal depth of 1D convolutions. These hybrid models aim to optimize processing requirements while maintaining robust feature extraction capabilities, thus addressing the dual needs of efficiency and effectiveness.
Among the innovative solutions in this area, the Pseudo-3D Residual Network (P3D ResNet) developed by Qiu et al. [25] represents a significant breakthrough. This architecture cleverly decomposes traditional 3D convolutions into a combination of 2D and 1D operations, thereby reducing both computational load and parameter count. While the P3D model has successfully reduced computational demands, its impact on processing speeds has been constrained by limited compression efficiency.
The ongoing research and development in 3D convolutional technologies reflect a concerted effort to strike an optimal balance between the depth of feature extraction and computational efficiency. This balance is crucial for the broader adoption of these technologies in real-time applications such as video surveillance, interactive media, and sports analytics. With continued advancements, 3D convolutional networks are expected to become even more effective and feasible for widespread use, paving the way for smarter and more responsive video analysis systems.

3. Methods

Binocular disparity calculation algorithm design usually includes the following four steps: feature extraction, matching cost calculation and aggregation, disparity calculation, and disparity refinement. In contrast to traditional methods, each module needs to be independently designed and trained. The algorithm proposed in this study is an end-to-end binocular disparity calculation algorithm based on a deep neural network, and the gradient can be transmitted between each module. The weight of the network is updated by the supervised training loss function. The framework of the algorithm proposed in this study is shown in Figure 1. The entire process is conducted through a backbone network, as shown in Figure 1 by the unboxed middle line (outside of the yellow background box above and the dashed gray line box below). Below, we first introduce the overall structure, then detail the advantages of our algorithm.

3.1. Overview

Our network’s process involves inputting left and right stereo images into a CNN network that shares weights and is used to compute the feature map. These feature maps are then fed into a depth-separable SPP (spatial pyramid pooling) module for feature extraction. The first part of the SPP module processes the input images in sub-regions of varying sizes, and the results are then concatenated with the output of a convolution layer designed for feature fusion, producing merged features from both the left and right images. Subsequently, these features are used to construct a four-dimensional matching cost volume. This cost volume is then processed by a pseudo 3D module designed for edge computing (EC-P3D) and a transposed EC-P3D module to complete cost aggregation and disparity regression. During training, the loss between the network output and the actual labels (L1) is continuously calculated to optimize the network parameters, ultimately yielding the predicted disparity.

3.2. Pseudo 3D Convolution

The four-dimensional feature vector of [ C , D , H , W ] can be obtained after matching cost calculation. In order to fuse the context relationship between the spatial domain and the disparity domain, 3D convolution with a nonlinear activation function (ReLU) and batch normalization is needed to learn the features defined in this dimension. In contrast to 2D convolution used for 3D feature vectors of [ C , H , W ], the number of parameters and computational complexity of 3D convolution used for 4D feature vectors of [ C , D , H , W ] are significantly increased, taking up a lot of computational resources and making training more difficult. Therefore, in order to reduce the number of parameters and computational complexity of 3D convolution, this study uses matching cost aggregation based on low-rank approximation for 3D convolution. Table 1 shows a comparison of the number of parameters and the computational complexity after 3D convolution low-rank approximation.
For a video clip, we can abstract it as a tensor of size C × L × H × W , where C , L , H , and W represent the channel number, frame number, frame height, and frame width, respectively. The most direct way to extract the tensor is using 3D convolution, which can model spatial information and extract time sequence information between frames. Suppose we have a 3 × 3 × 3 convolution kernel, which can be naturally decomposed into a 2D convolution kernel of 1 × 3 × 3 in the space domain and a 1D convolution kernel of 3 × 1 × 1 in the time domain. In this way, the decomposition of 3D convolution can greatly reduce the size of the model.
Based on the above ideas, P3D block extends the 2D residual unit in ResNet and realizes the spatio-temporal coding of video in a structure similar to that of ResNet. ResNet’s residual network consists of a large number of residual units, which can usually be expressed in the following way:
x t + 1 = h x t + F x t
where x t and x t + 1 represent the input and output of the residual element, respectively; h x t = x t represents the identity mapping; and F is a nonlinear residual function. ResNet, with a shortcut, no longer learns a nonlinear function directly from input to output but from the residual between the output and the input.
The design idea of P3D block is to expand all the convolution kernels in the above 2D residual unit into 3D, then decompose the 3D convolution kernels into 1 × 3 × 3 a 2D convolution and a 3 × 1 × 1 1D time convolution. Since the original 3D convolution was decomposed into two filters, it is necessary to consider whether the result of the spatial convolution is directly input to the time convolution or whether it is carried out in parallel and not in sequential order. Another point that needs to be considered is whether the results of these two convolution kernels directly affect the output results of the residual element. Based on the above two design ideas, P3D designs three block structures [25], namely P3D-A, P3D-B, and P3D-C. A schematic of the three structures is shown in Figure 2.
P3D-A: As shown in Figure 2a, P3D-A first performs a 2D spatial convolution followed by a 1D temporal convolution. These two convolutions are directly connected, with only the 1D temporal convolution and the final output connected. The relationship between input x t and output x t + 1 can be represented as follows:
( I + T · S ) · x t : = x t + T ( S ( x t ) ) = x t + 1
where I indicates an identity mapping.
P3D-B: As shown in Figure 2b, P3D-B performs 2D spatial convolution and 1D temporal convolution simultaneously, the results of which are added together. The relationship between input x t and output x t + 1 can be represented as follows:
( I + S + T ) · x t : = x t + S ( x t ) + T ( x t ) = x t + 1
P3D-C: As shown in Figure 2c, P3D-C is a compromise between P3D-A and P3D-B. It first performs a 2D spatial convolution, followed by a shortcut branch that combines the results of the 2D spatial convolution and 1D temporal convolution. The relationship between input x t and output x t + 1 can be represented as follows:
( I + S + T · S ) · x t : = x t + S ( x t ) + T ( S ( x t ) ) = x t + 1
In the process of stereo disparity calculation, the spatial features from the left and right images need to be matched to produce disparity outputs. The results are not directly influenced by the three different structures, but based on experimental evidence, structures 1 and 3 are capable of performing this task. However, structure 3 involves too many connections, making it unsuitable for data transmission on chips. Therefore, structure 1 is adopted. This choice resolves issues related to the chip’s inability to support 3D convolutions and the problem of excessive parameter volume, aligning with the characteristics of disparity calculation algorithms.

3.3. WReLU Activation Function with Pixel-Level Modeling Capability

In this study, the WReLU activation function [26] is used to avoid the problem of information loss and to ensure pixel-level modeling capability. WReLU extends the ability of the ReLU activation function by adding a residual space condition at a negligible additional computational cost. The algorithm can generate complex gradients and achieve a better training process without the gradient disappearing, so it has an advantage in terms of performance. For each input pixel ( x i , j ), WReLU sets an operation window centered at x i , j . The size of the window is k × k, and our calculation first adopts a inner vector product where pixels in the window are viewed as a vector ( p i , j ).
T x i , j = B N u · p i , j
where u, containing k × k learning parameters elements in windows, is a channel-wise parameter to reduce the number of network parameters and BN represents the batch normalization operation. Then, WReLU is defined as follows:
W x i , j = max x i , j , x i , j + T x i , j
where T x i , j represents the residual difference of the function fitted by x and the network, and max(·) makes the activation function nonlinear and ensures pixel-level modeling capability. A schematic and the expression of our activation function are shown in Figure 3.
WReLU improves the generalization performance of the model by adding parameters. It introduces windows to capture spatial dependencies for better presentation. This algorithm can generate complex gradients and achieve a better training process without gradient disappearance, so it has advantages in terms of performance.

3.4. EC-P3D Module and Transposed EC-P3D Module

A lightweight stereo matching module, EC-P3D, based on pseudo 3D convolution and low-rank approximation, is proposed for high-performance terminals. This module enables the deployment of disparity calculation algorithms on AI chips that cannot directly implement 3D convolution. Replacing the bulky, computationally intensive, and hard-to-deploy 3D and transposed 3D convolutions with this new module reduces the number of network parameters while maintaining network accuracy.
EC-P3D consists of two main parts, as shown in Figure 4. The design of the first part is inspired by P3D and is based on low-rank approximation, using 1D and 2D convolutions to approximate the 3D convolution. However, there are significant differences between them. Unlike P3D-A, we use Conv1 for channel compression of the input feature map to avoid taking up too much video memory. Then, the compressed feature map is combined with the output of the subsequent pseudo 3D module to obtain the final result. In the pseudo 3D part, we first use 3 × 3 convolution to perform 2D convolution on each compressed feature map of size H × W in dimension D, then stack the results of each dimension on dimension D to generate new feature map and change the data index method of the feature map. A 3 × 1 × 1 convolution kernel (size 3 in dimension D) is used to conduct 1D convolution for the newly generated feature map; then, the generated 1D result is resized as H × W so as to achieve the equivalent of 3D convolution.
Due to the equivalency of the process, the new convolution kernel loses a large amount of correlation information inside the 3D tensor during feature extraction (that is, the correlation of pixels at different corresponding positions in each feature map in dimension D within the 3 × 3 × 3 element area), so we use WReLU to restore this correlation pixel by pixel. Meanwhile, in order to further reduce the number of parameters in the model and improve the realizability of chip deployment, basic 2D blocks of feature extraction are replaced with depth-wise convolution similar to MobileNet.
As up-sampling cannot be completed by a CNN chip directly and FPGA is needed for auxiliary design, in order to realize up-sampling of the feature map, we propose a transposed EC-P3D module. The transposed EC-P3D module is shown in Figure 5. Since transposed 3D convolution actually implements the up-sampling process, the transposed EC-P3D module does not require a shortcut. Like the EC-P3D module, Conv2 and Conv3 are composed of a 1 × 3 × 3 2D convolution kernel and a 3 × 1 × 1 1D convolution kernel, applying dimension raising in the H, W, and D dimensions, respectively. The final results are obtained through BN and WReLU.
For the data storage of edge computing devices, 1 × 3 × 3 and 3 × 1 × 1 are equivalent to the 3 × 3 2D convolution and the 1D convolution of size 3, which greatly reduces the difficulty of deployment and improves the utilization of the chip.

3.5. Adaptive Unimodal Cost Volume Filtering

Matching cost volumes indirectly supervised by disparity regression tend to be highly ambiguous; an infinite number of matching cost distributions can produce the same disparity regression results. Specifically, the three types of matching probability distributions illustrated in Figure 6a–c can all yield the same disparity values. However, only distributions (a) and (b)—where the peak is sharp and high at the true disparity, indicating low uncertainty—are considered reasonable.
We demonstrate a typical pixel’s matching cost distribution, where a bimodal distribution also yields the correct disparity. This flexibility in matching cost, which lacks direct supervision, means that incorrectly learned cost volumes can still approximate true disparities, leading to severe overfitting and reduced network accuracy. Therefore, it is necessary to regularize the matching cost based on its unimodal characteristics to align the cost distribution more closely with the true distribution. We propose unimodal cost volume filtering and a confidence estimation network for to regularize the matching cost volume.
In this study, we filter the matching cost volume with the unimodal distribution that peaks at the real disparity and add constraints directly to the matching cost by adding a regularized network. In addition, the network estimates the variance of the unimodal distribution of each pixel and explicitly models the uncertainty of matching in different environments. In order to avoid the problem of the video memory being too large to run during network inference, we designed it as a plug-and-play module that only needs to supervise the matching cost during training and can shield this part of the network during inference and only use the subject network weight optimized by this module so as to reduce the video memory required during the inference stage.
Given the true disparity ( d g t ), the unimodal distribution is defined as follows:
P ( d ) = Softmax d d g t σ = exp c d g t d = 0 D 1 exp c d g t
c d g t = d d g t σ
where σ > 0 is the variance, which controls the sharpness of the peaks around the true disparity.
However, the matching cost volume constructed by P ( d ) based on the above method and serving as a true label has the same peak sharpness (i.e., variance) between different pixels, which cannot reflect the differences in the distributions to which different pixels belong. In order to construct a more reasonable real label for the matching cost volume, a confidence estimation network ( f p ) is added to the algorithm to adaptively predict the variance ( σ p ) of the corresponding distribution of each pixel.
σ p = s 1 f p + ϵ
where s 0 is the scale factor reflecting the sensitivity of σ p to the change in confidence ( f p ) and ϵ > 0 defines the lower bound of σ p , avoiding the numerical problem of division by 0. Thus, σ p [ ϵ , s + ϵ ] . Untextured pixels and blocked pixels have larger σ p values because untextured pixels tend to have multiple matches, while blocked pixels do not have the right match. Pixel-by-pixel adaptive estimation ( σ p ) can modify the cost volume of the real tag defined in the Formula (7) to generate a standard cost volume as a supervised training tag, namely
P ( d ) = Softmax d d g t σ p .
Based on the above discussion, for pixel position p in the matching cost volume, there is a matching cost distribution ( P ^ p ( d ) ) produced by the network estimation and the real label ( P p ( d ) ). The loss function between them can be defined by cross-entropy, and the serious sample imbalance problem needs to be solved [6]. Stereo focal loss is used in binocular disparity calculation to focus the loss function on the positive disparity sample to avoid the total loss dominated by the negative disparity sample.
L S F = 1 | P | p P d = 0 D 1 1 P p ( d ) α · P p ( d ) · log P ^ p ( d )
where α 0 is a parameter that represents the degree of focus. When α = 0 , the stereo matching focus loss degenerates into cross-entropy loss, and when α > 0 , more weight can be proportionally assigned to the positive disparity sample by P p ( d ) . Therefore, simple negative disparity samples are further explicitly suppressed at considerably smaller weights, and positive disparity samples are allowed to compete with only a few difficult samples.

3.6. Loss Function of Multi-Module Fusion Training

The disparity-computing network discussed in this chapter consists of two loss functions. The first regression loss function is defined between the predicted disparity map and the true disparity map and divided into three stages, as shown in the structure within the gray dotted box in Figure 1. The output disparity map of all stages is supervised, supervising the entire network. The second stereo matching focus loss function is defined in a supervised way between the generated real matching cost and the matching cost obtained by network matching.
For the loss of the predicted disparity map and the real disparity map, the s m o o t h L 1 loss function in the first k stages is defined according to the real disparity ( d p ) of each pixel p, namely
L k = 1 | P | p P s m o o t h L 1 d p d ^ p
s m o o t h L 1 ( x ) = 0.5 x 2 , if | x | < 1 | x | 0.5 , otherwise ,
where the predicted disparity value is d ^ p , and P represents the set of points with true disparity in the label. The s m o o t h L 1 loss function is insensitive to outliers and can remain robust at outliers in the disparity graph (e.g., noise, etc.). For each phase, the s m o o t h L 1 loss function is used to measure the error between the predicted disparity map and the true disparity map. The entire regression loss function is the weighted sum of the losses at each stage, namely
L reg = k = 1 3 λ k × L k ,
where λ k is the weight of regression loss in stage k, and the weight values of the three stages are 0.25, 0.5, and 1.0, respectively.
For stereo matching focus loss, in order to make more pixels tend toward high confidence values, it is necessary to add L conf as the regularization term, as follows:
L conf = 1 | P | p P log f p
In summary, the loss function of the binocular disparity calculation algorithm based on low-rank approximation and unimodal cost volume filtering is defined as follows:
L a l l = λ reg L reg + L S F + λ conf L conf ,
where λ reg and λ conf are hyperparameters used to balance the weight of various loss functions. L reg supervises disparity training, and L S F supervises cost volume.

4. Experiments

In order to verify the performance of the disparity calculation algorithm based on low-rank approximation and single-peak cost body filtering proposed in this chapter, relevant comparative experiments were designed on public datasets. First, information on two related datasets (SceneFlow [9] and KITTI 2015 [27]) is presented. Secondly, the implementation details and training strategies of the network are introduced in detail. Finally, different network structures are ablated to test the influence of network structure and parameter settings on the results.

4.1. Datasets

The datasets used in this chapter are the large SceneFlow and KITTI 2015 datasets containing pictures and real disparity values, 80% of which are randomly selected as the training set and the remaining 20% as the test set. SceneFlow is a synthetic dataset. The training set contains 168,357 stereo images, and the test set contains 19,854 stereo images to test model performance in the training stage. KITTI 2015 is a dataset collected from the real world with a maximum disparity of 192, a training set containing 400 images, and a validation set containing 800 images without real labels. This chapter uses SceneFlow for pre-training and testing, with the model trained, fine-tuned, and tested on KITTI 2015.

4.2. Training Details

NVIDIA TitanXp GPUs with 11G video memory in the Ubuntu environment using the PyTorch deep learning framework are used to implement the EC-P3D module proposed in this study. The model is trained end-to-end using 4 NVIDIA TitanXp GPUs with 11G video memory, the batch size set to 4 and the Adam optimizer. For all datasets, the size of training images is set as 512 × 256, the RGB values of all images are normalized to the range of [−1, 1], and the maximum disparity value ( D m a x ) is set as 192. For the SceneFlow dataset, 10 epochs are trained at a fixed learning rate of 0.001. For the KITTI 2015 dataset, the model pre-trained on the SceneFlow dataset is used in this study for further optimization training. There are 300 epochs of optimization training, among which the learning rate is 0.001 in the first 200 epochs and adjusted to 0.0001 in the last 100 epochs.

4.3. Metrics

Since we use multiple depth estimation datasets and there are various metrics for evaluating different datasets we list the metrics we use.
End-point error (EPE): End-point error is used in the evaluation of the SceneFlow dataset. Formally, the difference between the result and the ground truth can be written as E P E ( d * d ^ ) = d * d ^ 2 , where d * is the result of the network output, and d ^ is the ground truth.
The percentage of erroneous pixels: The percentage of error pixels is used in the evaluation of the KITTI 2012 and KITTI 2015 datasets. Specifically, a pixel is considered an error pixel when its disparity error is greater than t pixels. Then, the percentage of error pixels in the non-occluded area (Out-NOC) and the total area (Out-all) is calculated. Specifically, for KITTI 2012, t { 2 , 3 , 4 , 5 } . For KITTI 2015, a pixel is considered wrong when the disparity error is greater than three pixels or 5%.

4.4. Results and Evaluation

In the comparison experiment, first, the performance of this algorithm is compared quantitatively and qualitatively with other dense disparity-computing algorithms; then, three improvements are proposed for this research. Ablation experiments on the WReLU activation function, EC-P3D adaptation module, and matching cost regularization are conducted to test the influence of network structure and parameter settings on the results. Finally, the video memory occupancy of the network is tested.

4.4.1. Qualitative Analysis and Comparison of Algorithm Performance

For pre-training on the SceneFLow dataset, Figure 7 shows a comparison between the proposed algorithm and other algorithms based on deep learning, where the first and second lines represent the input left and right stereo image pairs, the third line represents the network prediction results, the fourth line represents the true disparity, and the last line is the error heat map. The colder the heat map, the lower the error, and the warmer the color, the higher the error. The use of the heat map can make the error more clearly visible. Therefore, the color of the occluded area in the error heat map is warmer.
For fine tuning on the KITTI dataset, Figure 8 shows a comparison of the algorithm proposed in this study with other deep learning-based algorithms in real-world scenarios. According to the analysis of the experimental results presented in Figure 7 and Figure 8, the following conclusions can be drawn:
(1)
In the error heat map, the overall color of the algorithm proposed in this study is more cool, so the overall accuracy is higher than that of other compared algorithms.
(2)
The output of the disparity map is dense, and the disparity changes continuously in the semantically relevant regions. Although the algorithm’s adaptation of compression and low-rank approximation is performed on the terminals, the quality of the disparity output is still guaranteed.
(3)
Due to the regularization operation on the matching cost volume, the algorithm proposed in this study achieves sharper and clearer boundaries in the edge areas compared to other algorithms. As shown in the black box in Figure 9, our algorithm demonstrates superior capabilities in representing fine structures and edges over other networks. The output of the confidence network is shown in Figure 10 and Figure 11, which not only represent the reliability of the disparity predicted by the network for these pixels but also objectively reflect the possibility of the real scene point corresponding to the pixel being in the edge, occlusion, or subtle structure.

4.4.2. Quantitative Analysis and Comparison of Algorithm Performance

Table 2 shows the results of quantitative analysis of the absolute error index and three-pixel error index of the algorithm proposed in this study on SceneFlow dataset and the KITTI dataset, as well as the quantitative analysis results of other algorithms under the same evaluation criteria. When testing the algorithm, the average value obtained by repeating 10 experiments is used to evaluate the index. The test of running time is based on the experimental platform proposed in this study, but this index is directly related to the performance of the CPU, so it is only used as a comparison of running time between algorithms and is only for reference for other operating platforms.
By analyzing the data in Table 2, it can be seen that the algorithm proposed in this study is significantly superior to other methods in terms of the end-point error index (EPE) and three-pixel error and has greater advantages in terms of the number of parameters and running time compared with high-precision disparity calculation networks such as PSMNet and GC-Net. Although the running time on the current platform presents no obvious advantage compared with the lightweight algorithm, acceleration through the new transformer terminal offers the the possibility of real-time operation in the future.

4.4.3. Ablation Study

In order to test the contributions of the proposed sub-modules of network compression, WReLU, and cost volume regularization to the overall binocular disparity calculation algorithm, ablation experiments were designed. For PSMNet and AnyNet, two classical high-precision and lightweight disparity calculation networks, starting from the basic structure, were added one by one according to the proposed order, and the network with different modules added was evaluated. The experimental results are shown in Table 3 and Table 4. Both network compression and EC-P3D modules significantly reduce the number of parameters and running time of the network and play the role of a lightweight network and hardware adaptation with less loss of accuracy. WReLU and cost volume regularization greatly improve the performance of the network, with considerable accuracy improvements, while the computational cost is almost unchanged, meeting the application requirements of high-performance terminals.
We conducted a performance analysis of the WReLU activation function on foundational vision models including SqueezeNet 1.0, SqueezeNet 1.1, SqNxt-23, MobileNetV2, and ShuffleNetV2 × 0.5. Research on and improvements of these models can be extended to other visual tasks, assisting in achieving better network accuracy for various applications. Therefore, this section conducts tests on these foundational models to verify the versatility of the WReLU activation function. The tests conducted across these five different visual classification base models all achieve excellent performance, as shown in Table 5. The WReLU activation function significantly enhances the network’s top-two and top-five accuracy rates, with almost no change in the number of parameters.

5. Conclusions

For the application scenarios of high-performance edge computing devices with high computing power, this paper conducts research on algorithm performance optimization based on hardware resource adaptation and proposes a disparity calculation algorithm based on low-rank approximation and unimodal cost volume filtering. In the matching cost aggregation part, an EC-P3D network structure is proposed. The three-dimensional convolution is equivalent to two-dimensional and one-dimensional convolution using low-rank approximation technology, which greatly reduces the number of network weights; in terms of disparity accuracy, a WReLU activation function with pixel-level modeling ability is proposed, which avoids information disappearance during backpropagation learning and enhances the expression ability of the network. This activation function only requires one layer of convolution and one size comparison for mobile chips; it is proposed to use unimodal cost volume filtering and confidence estimation network to regularize the matching cost volume, which alleviates the problem of the disparity matching cost distribution being far away from the true distribution. The unimodal cost volume filtering and confidence estimation network is a plug-and-play module that can be combined with other recent methods based on cost volume to continuously enhance the performance of depth estimation. Future developments are expected to focus on hardware adaptation and algorithm optimization based on transformers for depth estimation algorithms, which will open up broader application prospects.
Compared with the typical deep learning disparity calculation network, PSMNet, our proposed disparity calculation algorithm based on low-rank approximation and unimodal cost volume filtering reduces absolute error by 38.3%, three-pixel error to 1.41%, and the number of parameters by 67.3%. The calculation accuracy is better than that of other algorithms, and it is easier to deploy. The number of network parameters and computational cost are greatly reduced, and it has better practicability.

Author Contributions

Conceptualization, Y.L. and Y.S.; Data curation, Y.S.; Formal analysis, H.Z. and Y.S.; Funding acquisition, Y.L.; Investigation, Y.C. and Y.S.; Methodology Y.L.; Project administration, Y.L.; Resources, F.L.; Software, Y.C.; Supervision, F.L.; Validation, Y.C. and H.Z.; Visualization, Y.C.; Writing—original draft, Y.L.; Writing—review and editing, Y.C. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Startup Foundation for Introducing Talent of NUIST (No. 2023r124) and the Enterprise Cooperation Project (No. 2023h852).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets we used in this study are the KITTI and SceneFlow datasets, and they are openly available at http://www.cvlibs.net/datasets/kitti/ (accessed on 7 August 2022) and https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html (accessed on 7 August 2022), respectively.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hu, K.; Wang, T.; Shen, C.; Weng, C.; Zhou, F.; Xia, M.; Weng, L. Overview of underwater 3D reconstruction technology based on optical images. J. Mar. Sci. Eng. 2023, 11, 949. [Google Scholar] [CrossRef]
  2. Janai, J.; Güney, F.; Behl, A.; Geiger, A. Computer vision for autonomous vehicles: Problems, datasets and state of the art. Found. Trends Comput. Graph. Vis. 2020, 12, 1–308. [Google Scholar] [CrossRef]
  3. Schmid, K.; Tomic, T.; Ruess, F.; Hirschmüller, H.; Suppa, M. Stereo vision based indoor/outdoor navigation for flying robots. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Toyo, Japan, 3–7 November 2013; pp. 3955–3962. [Google Scholar]
  4. Zenati, N.; Zerhouni, N. Dense stereo matching with application to augmented reality. In Proceedings of the 2007 IEEE International Conference on Signal Processing and Communications, Dubai, United Arab Emirates, 24–27 November 2007; pp. 1503–1506. [Google Scholar]
  5. Liu, F.; Qiao, R.; Chen, G.; Gong, G.; Lu, H. CASSANN-v2: A high-performance CNN accelerator architecture with on-chip memory self-adaptive tuning. IEICE Electron. Express 2022, 19, 20220124. [Google Scholar] [CrossRef]
  6. Žbontar, J.; LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 2016, 17, 1–32. [Google Scholar]
  7. Guney, F.; Geiger, A. Displets: Resolving stereo ambiguities using object knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2015; pp. 4165–4175. [Google Scholar]
  8. Pang, J.; Sun, W.; Ren, J.S.; Yang, C.; Yan, Q. Cascade residual learning: A two-stage convolutional neural network for stereo matching. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 887–895. [Google Scholar]
  9. Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar]
  10. Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. 2014, 27, 2366–2374. [Google Scholar]
  11. Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
  12. Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
  13. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  14. Zbontar, J.; LeCun, Y. Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1592–1599. [Google Scholar]
  15. Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar]
  16. Chang, J.R.; Chen, Y.S. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar]
  17. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  18. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
  19. Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. pp. 402–419. [Google Scholar]
  20. Tankovich, V.; Hane, C.; Zhang, Y.; Kowdle, A.; Fanello, S.; Bouaziz, S. Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14362–14372. [Google Scholar]
  21. Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef]
  22. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
  23. Hara, K.; Kataoka, H.; Satoh, Y. Can spatiotemporal 3d cnns retrace the history of 2D cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6546–6555. [Google Scholar]
  24. Fan, H.; Niu, X.; Liu, Q.; Luk, W. F-C3D: FPGA-based 3-dimensional convolutional neural network. In Proceedings of the 2017 27th International Conference on Field Programmable Logic and Applications (FPL), Ghent, Belgium, 4–8 September 2017; pp. 1–4. [Google Scholar]
  25. Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
  26. Liu, Y.; Guo, X.; Tan, K.; Gong, G.; Lu, H. Novel activation function with pixelwise modeling capacity for lightweight neural network design. Concurr. Comput. Pract. Exp. 2021, 35, e6350. [Google Scholar] [CrossRef]
  27. Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3061–3070. [Google Scholar]
Figure 1. Framework of the proposed algorithm.
Figure 1. Framework of the proposed algorithm.
Sensors 24 04563 g001
Figure 2. Three designs of P3D blocks.
Figure 2. Three designs of P3D blocks.
Sensors 24 04563 g002
Figure 3. An example of the WReLU activation function (the blue line for a possible case, the orange dashed line is the function interval that is ignored after WReLU is formed).
Figure 3. An example of the WReLU activation function (the blue line for a possible case, the orange dashed line is the function interval that is ignored after WReLU is formed).
Sensors 24 04563 g003
Figure 4. EC-P3D module.
Figure 4. EC-P3D module.
Sensors 24 04563 g004
Figure 5. Transposed EC-P3D module.
Figure 5. Transposed EC-P3D module.
Sensors 24 04563 g005
Figure 6. Three types of matching probability distribution, where the blue is the disparity probability distribution curve, and the orange dashed line is the predicted disparity value.
Figure 6. Three types of matching probability distribution, where the blue is the disparity probability distribution curve, and the orange dashed line is the predicted disparity value.
Sensors 24 04563 g006
Figure 7. Visualization results of the algorithm proposed in this chapter on the SceneFlow dataset.
Figure 7. Visualization results of the algorithm proposed in this chapter on the SceneFlow dataset.
Sensors 24 04563 g007
Figure 8. Visualization results of the algorithm proposed in this chapter on the KITTI 2015 dataset.
Figure 8. Visualization results of the algorithm proposed in this chapter on the KITTI 2015 dataset.
Sensors 24 04563 g008
Figure 9. The ability to represent fine structures and edges.
Figure 9. The ability to represent fine structures and edges.
Sensors 24 04563 g009
Figure 10. Confidence network output on the SceneFLow dataset.
Figure 10. Confidence network output on the SceneFLow dataset.
Sensors 24 04563 g010
Figure 11. Confidence network output on the KITTI 2015 dataset.
Figure 11. Confidence network output on the KITTI 2015 dataset.
Sensors 24 04563 g011
Table 1. Comparison of 3D convolution with low-rank approximation.
Table 1. Comparison of 3D convolution with low-rank approximation.
NameParametersComputational Complexity
Standard 3D convolution D D F T F H F W O D D T H W F T F H F W
2D convolution in the spatial direction D D F H F W O D D T H W F H F W
1D convolution in the temporal direction D D F T O D D T H W F T
Spatial + temporal D D F H F W + F T O D D T H W F H F W + F T
Table 2. Algorithm performance comparison.
Table 2. Algorithm performance comparison.
EPEThree-Pixel Error%ParametersRunning Time
PSMNet1.094.355.2 M0.50 s
AnyNet3.196.200.04 M97.3 ms
DeepPruner0.862.15N/A182 ms
AANet0.872.55N/A62 ms
AcfNet0.861.895.6 M0.48 s
GC-Net2.512.873.5 M0.95 s
Proposed method0.771.411.7 M0.48 s
Table 3. Ablation study of PSMNet.
Table 3. Ablation study of PSMNet.
EPEThree-Pixel Error%ParametersRunning Time
PSMNet infrastructure1.0904.346%5.22 M500 ms
Increased network compression1.1354.864%3.84 M452.4 ms
EC-P3D module added1.1875.124%1.72 M476.2 ms
WReLU added1.0733.573%1.74 M483.1 ms
Cost volume regularization added0.7701.41%1.74 M483.1 ms
Table 4. Ablation study of AnyNet.
Table 4. Ablation study of AnyNet.
EPEThree-Pixel Error%ParametersFPS
AnyNet infrastructureStage 0 = 5.44,
Stage 1 = 4.88,
Stage 2 = 4.51
7.25%34,62988.1
EC-P3D module addedStage 0 = 5.79,
Stage 1 = 5.12,
Stage 2 = 4.74
7.62%22,68392.5
WReLU addedStage 0 = 5.11,
Stage 1 = 4.63,
Stage 2 = 4.11
6.85%22,68391.6
Table 5. Experimental results of WReLU in a basic visual network.
Table 5. Experimental results of WReLU in a basic visual network.
NetworkMethodTop-1 AccuracyTop-5 AccuracyParameters
SqueezeNet 1.0Original57.50%80.30%1.25 M
 This Design64.55%85.09%1.25 M
SqueezeNet 1.1Original57.10%80.30%1.24 M
 This Design64.08%84.98%1.24 M
SqNxt-23Original57.80%80.90%0.72 M
 This Design65.15%86.33%0.77 M
MobileNetV2Original71.88%90.29%3.51 M
 This Design73.80%91.64%3.59 M
ShuffleNetV2x05Original58.62%81.14%1.37 M
 This Design62.16%83.45%1.37 M
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cheng, Y.; Song, Y.; Liu, Y.; Zhang, H.; Liu, F. High-Performance Binocular Disparity Prediction Algorithm for Edge Computing. Sensors 2024, 24, 4563. https://doi.org/10.3390/s24144563

AMA Style

Cheng Y, Song Y, Liu Y, Zhang H, Liu F. High-Performance Binocular Disparity Prediction Algorithm for Edge Computing. Sensors. 2024; 24(14):4563. https://doi.org/10.3390/s24144563

Chicago/Turabian Style

Cheng, Yuxi, Yang Song, Yi Liu, Hui Zhang, and Feng Liu. 2024. "High-Performance Binocular Disparity Prediction Algorithm for Edge Computing" Sensors 24, no. 14: 4563. https://doi.org/10.3390/s24144563

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop