Depth Estimation Based on 3D Gaussian Splatting Siamese Defocus

Jinchang Zhang*, Ningning Xu*, Hao Zhang, Guoyu Lu * indicates equal contribution. Jinchang Zhang, Ningning Xu, Guoyu Lu are with the University of Georgia guoyulu62@gmail.com. Hao Zhang is with University of Massachusetts Amherst.
Abstract

Depth estimation is a fundamental task in 3D geometry. While stereo depth estimation can be achieved through triangulation methods, it is not as straightforward for monocular methods, which require the integration of global and local information. The Depth from Defocus (DFD) method utilizes camera lens models and parameters to recover depth information from blurred images and has been proven to perform well. However, these methods rely on All-In-Focus (AIF) images for depth estimation, which is nearly impossible to obtain in real-world applications. To address this issue, we propose a self-supervised framework based on 3D Gaussian splatting and Siamese networks. By learning the blur levels at different focal distances of the same scene in the focal stack, the framework predicts the defocus map and Circle of Confusion (CoC) from a single defocused image, using the defocus map as input to DepthNet for monocular depth estimation. The 3D Gaussian splatting model renders defocused images using the predicted CoC, and the differences between these and the real defocused images provide additional supervision signals for the Siamese Defocus self-supervised network. This framework has been validated on both artificially synthesized and real blurred datasets. Subsequent quantitative and visualization experiments demonstrate that our proposed framework is highly effective as a DFD method.

I Introduction

Depth estimation is crucial for 3D reconstruction and understanding, serving as the foundation for tasks like scene understanding [6], autonomous driving [7], and augmented reality [13]. Its success relies on overcoming challenges related to size, speed, accuracy, and cost. Traditional methods use 3D geometric constraints through techniques like structure-from-motion (SfM) [28, 1], image sequences [38, 2], stereo pairs [8, 9, 23], and structured light [25]. However, monocular SfM faces scale ambiguity, and stereo imaging struggles with calibration and translating disparities into accurate depth. Traditional methods also struggle with defocus blur due to varying focal planes.

Refer to caption
Figure 1: An overview of the SDNet. We adopt the siamese network structure with mpvit model[17] and convolutional layer to enhance the defocus map modeling. We use defocus loss module[29] to learn the relationship between distance and blurriness while training the Siamese network. After training, we use one single blurred image to predict the defocus map for the depth inference.

Depth from Defocus (DFD) is an alternative approach to depth recovery that relies on defocus blur. Unlike traditional Structure-from-Motion (SfM), DFD estimates depth by utilizing the geometry of the camera lens and depth variations. Previous research [27] has shown that generating a focal stack of images with different levels of blur and analyzing the blur in each image can provide depth information at various focal distances. Supervised and self-supervised DFD methods based on deep learning have since been developed to estimate depth using focal stacks and All-In-Focus (AIF) images. However, current DFD methods that use AIF images often rely on approximations, treating small-aperture photos as AIF images, which can lead to inaccuracies. Additionally, using focal stacks in real-world scenarios is impractical due to the need for frequent focus adjustments. To overcome these limitations, this paper introduces Siamese-Defocus-Net (SDNet), a neural network capable of simultaneously estimating defocus information and depth from a single image, eliminating the need for multiple images or focus adjustments and simplifying depth estimation in dynamic environments.

To estimate depth from a single defocused image, we begin with the camera lens model and focus on estimating the defocus map, a crucial step in depth recovery. This paper introduces a defocus depth estimation method, trained on focal stacks but designed to estimate depth from a single defocused image during testing. We develop a self-supervised framework combining the Siamese Defocus Network and 3D Gaussian splatting, training both models jointly. The Siamese Defocus Network takes the same image with varying levels of defocus as input and outputs the corresponding defocus map and Circle of Confusion (CoC) for each image. To ensure accurate prediction of defocus maps across different levels of blur, we leverage Siamese networks and train the model using defocus loss. This approach allows the network to effectively capture defocus characteristics from the focal stack, validating its ability to distinguish varying blur levels and improving its sensitivity to defocus features. This training strategy enhances the network’s performance when processing images with different focal lengths, and it can extract and integrate features across multiple scales, improving the accuracy of defocus map generation. To further improve CoC and defocus map predictions, the CoC is fed into the 3D Gaussian splatting model to verify its accuracy. Specifically, we input a series of defocused images into the 3D Gaussian splatting model, combining the CoC predicted by the Siamese Defocus Network with 2D projection images generated by the splatting model. We render defocused images and compute the differences between synthetic and real defocused images, providing additional supervision to the Siamese Defocus Network. This approach explores the connection between defocus characteristics and the camera lens model, enabling effective depth estimation. We validated our method across multiple datasets using the point spread function (PSF), rendering images with varying focal lengths and demonstrating the feasibility and effectiveness of our approach. The framework is illustrated in Figure 1.

In summary, the contributions of this paper include: 1. We propose a system that can simultaneously estimate defocus maps and scene depth. 2. We design a self-supervised framework based on Siamese networks and 3D Gaussian splatting, capable of generating defocus maps and Circle of Confusion (CoC) at different focal lengths. 3. we embed 3D Gaussian splatting into the Siamese Defocus Network, using CoC as input to the 3D Gaussian splatting to calculate the blur reconstruction loss, thereby improving the training of the Siamese Defocus Network. 3D Gaussian splatting generates the initial depth, which, combined with the predicted defocus map, serves as input for the depth estimation network. The depth estimation network uses the defocus map to optimize the initial depth and predict depth residuals.

II Related Work

Monocular depth estimation. Monocular depth estimation aims to reconstruct depth information from a single camera by utilizing scene geometry as a training constraint. [10] employs three consecutive images, using both depth and pose networks to predict depth. Subsequent works [19, 12] have successfully applied monocular depth estimation algorithms to high-resolution images, achieving impressive results. In contrast, stereo depth estimation reconstructs depth using the baseline between two cameras and the disparity map, avoiding issues with scale ambiguities. [30] and [34] calculate 3D and 4D cost volumes to recover depth information, demonstrating efficient memory and computational usage. However, due to significant information loss in the encoded volume, these methods do not achieve optimal accuracy.

Defocus map and depth from defocus. A defocus map quantifies the level of defocus blur or the size of the circle of confusion (CoC) for each pixel in a blurred image. For estimating defocus maps, Depth from defocus (DFD) methods derive depth by measuring image blurriness caused by lens effects. [37] made advancements by applying coded aperture cameras to measure defocus blur. [11] predicts depth by simulating defocus on datasets like KITTI and Make3D, while [20] reconstructs both all-in-focus images and depth using supervised learning. [35] improves depth estimation by incorporating focus volume and differential focus volume into their model, enhancing accuracy. [26] introduces a fully self-supervised framework that estimates depth from a sparse focal stack. These deep learning approaches rely on focal stacks or all-in-focus (AIF) images, using multiple images with varying degrees of blurriness for depth estimation. However, this is impractical in real-world applications. To address these limitations, we leverage prior knowledge from the camera lens model and extract multi-scale features from a single blurred image, enabling depth estimation without requiring multiple images.

III Depth from Defocus Framework

In this section, we first introduce the foundational background knowledge on the camera lens model and the Depth from Defocus (DFD) method, followed by a detailed presentation of our network architecture for defocus and depth estimation. We developed a Siamese network architecture to estimate defocus maps and the Circle of Confusion (CoC) from two differently blurred images of the same scene. 3D Gaussian splatting is integrated to render blurred images based on the predicted CoC, and blur reconstruction loss is applied to optimize the CoC predictions. The final depth is then estimated using the predicted defocus maps and the initial depth information obtained from 3D Gaussian splatting.

III-A Camera Lens Model

Depth from defocus methods are based on thin lens model and geometry properties, as shown in Fig. 2 (a), which explains the origin of blurriness generation. When an object is situated on the focal plane, any given point on the object corresponds to a specific point on the image plane. However, when the object is moved away from the focal plane, each point on the object corresponds to a circular region with a radius of sigma on the image plane, thereby resulting in blurriness. This phenomenon is referred to as the ’Circle of Confusion’ (CoC). Based on the Gaussian lens formula and the properties of similar triangles, we can derive the equation:

CoC=A|dFd|dfFdf,𝐶𝑜𝐶𝐴𝑑subscript𝐹𝑑𝑑𝑓subscript𝐹𝑑𝑓CoC=A\frac{|d-F_{d}|}{d}\frac{f}{F_{d}-f},\vspace{-1mm}italic_C italic_o italic_C = italic_A divide start_ARG | italic_d - italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | end_ARG start_ARG italic_d end_ARG divide start_ARG italic_f end_ARG start_ARG italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_f end_ARG , (1)
Refer to caption
Figure 2: (a) An illustration of the camera Thin-Lens model. Objects on the focal plane (indicated by the orange line) are sharply imaged, while objects off the focal plane appear blurred due to the Circle of Confusion (CoC). (b) The CoC curve derived from the NYUv2 dataset demonstrates the relationship between object depth and the blur radius, where the blur radius initially decreases as depth increases and then enlarges.

where A𝐴Aitalic_A is the diameter of the lens, Fdsubscript𝐹𝑑F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the focus distance, f𝑓fitalic_f is the focal length, and d𝑑ditalic_d is the distance of the object to the lens (depth). In general, the unit of depth is meters, so we must use the CMOS pixel size p𝑝pitalic_p to convert the unit of CoC to meters as well. For convenience, we incorporate the camera parameter f-number N𝑁Nitalic_N, defined as f/A𝑓𝐴f/Aitalic_f / italic_A, into the formula, and σ𝜎\sigmaitalic_σ is the radius of CoC. This gives the following equation:

σ=|dFd|df2N(Fdf)12p𝜎𝑑subscript𝐹𝑑𝑑superscript𝑓2𝑁subscript𝐹𝑑𝑓12𝑝\sigma=\frac{|d-F_{d}|}{d}\frac{f^{2}}{N(F_{d}-f)}\frac{1}{2p}\vspace{-1mm}italic_σ = divide start_ARG | italic_d - italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | end_ARG start_ARG italic_d end_ARG divide start_ARG italic_f start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N ( italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_f ) end_ARG divide start_ARG 1 end_ARG start_ARG 2 italic_p end_ARG (2)

Figure 2 (b) reveals that CoC sharply decreases as the subject distance approaches the focus distance, denoting a clear boundary between in-focus and out-of-focus regions. Beyond the focus distance, the CoC incrementally increases, indicating a gradual onset of blur with depth.

III-B Defocus Generation from Camera Lens Model

As mentioned in the previous section, we can generate defocus blurred images from equation (2) by simulating the CoC. The most common used method for generating defocus blurred image is PSF method. The point spread function describes the response of the camera lens model to a point source or point object. We employ a Gaussian kernel function to generate synthetic training images, following the methodologies of previous work[26, 11].

Gx,y(u,v)=12πΣx,y2exp(u2+v22Σx,y2),subscript𝐺𝑥𝑦𝑢𝑣12𝜋superscriptsubscriptΣ𝑥𝑦2𝑒𝑥𝑝superscript𝑢2superscript𝑣22superscriptsubscriptΣ𝑥𝑦2\vspace{-2mm}G_{x,y}(u,v)=\frac{1}{2\pi\Sigma_{x,y}^{2}}exp(-\frac{u^{2}+v^{2}% }{2\Sigma_{x,y}^{2}}),italic_G start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ( italic_u , italic_v ) = divide start_ARG 1 end_ARG start_ARG 2 italic_π roman_Σ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_e italic_x italic_p ( - divide start_ARG italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 roman_Σ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (3)

where Σx,ysubscriptΣ𝑥𝑦\Sigma_{x,y}roman_Σ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT represents the defocus map, which is generated based on depth and the camera optical model. Let IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT denote the all-in-focus image, and Idefocussubscript𝐼𝑑𝑒𝑓𝑜𝑐𝑢𝑠I_{defocus}italic_I start_POSTSUBSCRIPT italic_d italic_e italic_f italic_o italic_c italic_u italic_s end_POSTSUBSCRIPT represent the image generated with defocus blur. In practice, the IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is very difficult to obtain. Usually, we acquire an image IRsubscript𝐼𝑅I_{R}italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT with some depth of field, which already contains some blur. We can assume that the IRsubscript𝐼𝑅I_{R}italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT can be represented as the IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT blurred by an initial defocus blur kernel G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The blurred image is generated by convolving the all-in-focus image with the Gaussian kernel, as shown in the following equation:

Idefocussubscript𝐼𝑑𝑒𝑓𝑜𝑐𝑢𝑠\displaystyle\vspace{-3mm}I_{defocus}italic_I start_POSTSUBSCRIPT italic_d italic_e italic_f italic_o italic_c italic_u italic_s end_POSTSUBSCRIPT =IAG,IR=IAG0,formulae-sequenceabsenttensor-productsubscript𝐼𝐴𝐺subscript𝐼𝑅tensor-productsubscript𝐼𝐴subscript𝐺0\displaystyle=I_{A}\otimes G,I_{R}=I_{A}\otimes G_{0},= italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊗ italic_G , italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊗ italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (4)
Idefocussubscript𝐼𝑑𝑒𝑓𝑜𝑐𝑢𝑠\displaystyle I_{defocus}italic_I start_POSTSUBSCRIPT italic_d italic_e italic_f italic_o italic_c italic_u italic_s end_POSTSUBSCRIPT =IRG=IA(G0G).absenttensor-productsubscript𝐼𝑅𝐺tensor-productsubscript𝐼𝐴tensor-productsubscript𝐺0𝐺\displaystyle=I_{R}\otimes G=I_{A}\otimes(G_{0}\otimes G).= italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ⊗ italic_G = italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊗ ( italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊗ italic_G ) .

We can represent the IRsubscript𝐼𝑅I_{R}italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT as a blurred version of the IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. Although the IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is the most fundamental image in an ideal case, it is difficult to obtain in practice. The actual image IRsubscript𝐼𝑅I_{R}italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT already contains some level of blur. Therefore, we can directly use IRsubscript𝐼𝑅I_{R}italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT as input and apply additional blur G𝐺Gitalic_G to generate a stronger defocus-blurred image. Here, G𝐺Gitalic_G is the spatially varying Gaussian kernel, which changes with pixel positions x,y𝑥𝑦x,yitalic_x , italic_y to reflect varying degrees of defocus. The blur kernel G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the initial level of blur in the image IRsubscript𝐼𝑅I_{R}italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, which may be determined by factors such as the optical properties of the camera and the depth of field. Since the defocus map Σx,ysubscriptΣ𝑥𝑦\Sigma_{x,y}roman_Σ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT is pixel-dependent, the convolution kernel G𝐺Gitalic_G also varies spatially. We directly adopt the PSF layer proposed in [11] to generate the blurred images. The window size of the convolution kernel is set to 7, and we apply thresholding to the blur radius as σ=σ𝟏σ1𝜎𝜎subscript1𝜎1\sigma=\sigma\cdot\mathbf{1}_{\sigma\geq 1}italic_σ = italic_σ ⋅ bold_1 start_POSTSUBSCRIPT italic_σ ≥ 1 end_POSTSUBSCRIPT to avoid negligible blur effects from small radii.

III-C Siamese Defocus Net for Defocus Mapping

The Circle of Confusion (CoC) is the core principle behind the phenomenon of defocus. Based on CoC mapping, we built a self-supervised framework that combines Siamese Defocus Net and 3D Gaussian splatting to accurately estimate defocus maps and CoC. As described in Section III-B, we generated defocused images at different focal lengths from the camera lens model, forming a focal stack. To enhance Siamese Defocus Net’s learning of defocus maps, we adopted a Siamese network structure that takes images with varying focal distances as input. To optimize the prediction of CoC and defocus maps, we integrated CoC into the 3D Gaussian splatting model to generate synthetic blurred images. We then compared the synthetic images with real defocused images, using the differences for self-supervised training to optimize network parameters and improve prediction accuracy.

Leveraging our Siamese network design, we effectively learn varying degrees of blurriness in the same region at different focus distances. Specifically, we incorporate the multi-path transformer from [17], which allows for both fine and coarse feature representations. The network uses a multi-scale patch embedding strategy through overlapping convolutions, processing these embeddings in parallel paths within the Transformer framework. This enables independent and efficient handling of multi-scale features, capturing both detailed and broader contextual information for dense prediction tasks. For predicting defocus maps, it is essential to extract both local and global features to understand the blur characteristics associated with different focus distances. After extracting local features from each patch, we apply a max-pooling operation to fuse local and global features, allowing the network to accurately capture the blur level for each focus distance. This approach is inspired by the layer-wise global pooling concept from [20]. During training, defocus loss is applied to paired patches to measure the similarity between corresponding patches, effectively capturing the variations in defocus blur across different focus distances. SDNet plays a vital role in our pipeline by accurately predicting the defocus map of a given input blurred image. This defocus map is integral to our methodology, as it serves as the foundation for reconstructing the original blurred input image to be sharper and clearer. Importantly, we utilize the defocus map to derive the depth of the image, which is essential for understanding the spatial relationships and focus levels within the scene. Therefore, the precision of our defocus map prediction directly impacts the overall effectiveness and reliability of the image reconstruction process. Here, we employ not only a reconstruction loss but also a loss that assesses the degree of image blurring, as well as a smoothness loss. We first introduce defocus loss, following the work[29]. The defocus loss is applied on image patches to learn defocus within each focal stack. The smoothness loss is utilized to ensure that the gradients of the target image are not excessively large, thus enhancing smoother gradient changes. Let D1,D2subscript𝐷1subscript𝐷2D_{1},D_{2}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent the predicted defocus map of the SDNet. the loss constraints are:

defocus=𝐄[cosine(D1,D2)]subscript𝑑𝑒𝑓𝑜𝑐𝑢𝑠𝐄delimited-[]𝑐𝑜𝑠𝑖𝑛𝑒subscript𝐷1subscript𝐷2\vspace{-2mm}\mathcal{L}_{defocus}=\mathbf{E}\left[-cosine(D_{1},D_{2})\right]caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_f italic_o italic_c italic_u italic_s end_POSTSUBSCRIPT = bold_E [ - italic_c italic_o italic_s italic_i italic_n italic_e ( italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] (5)

We utilize the Laplacian operator to manipulate the edge map and compute its variance. For the prediction of the defocus map, we aim to exhibit distinct boundaries between blurred and non-blurred regions, which forces large gradient changes. Following the work in [18], we take the negative logarithm, sum it up, and normalize the result, which is expressed as:

blur=1Nβlog(ij(2I^(i,j))2Mμ2)subscript𝑏𝑙𝑢𝑟1𝑁𝛽𝑙𝑜𝑔subscript𝑖subscript𝑗superscriptsuperscript2^𝐼𝑖𝑗2𝑀superscript𝜇2\vspace{-2mm}\mathcal{L}_{blur}=-\frac{1}{N}\sum\beta log\left(\frac{\sum_{i}% \sum_{j}(\nabla^{2}\hat{I}(i,j))^{2}}{M-\mu^{2}}\right)caligraphic_L start_POSTSUBSCRIPT italic_b italic_l italic_u italic_r end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ italic_β italic_l italic_o italic_g ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_I end_ARG ( italic_i , italic_j ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_M - italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (6)

where 2superscript2\nabla^{2}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the Laplacian operator, M𝑀Mitalic_M is the amount of pixels, μ𝜇\muitalic_μ is the mean value of pixels, and beta𝑏𝑒𝑡𝑎betaitalic_b italic_e italic_t italic_a is a scaling factor, which we have set to 0.01 for the subsequent experiments.

III-D 3D Gaussian Splatting Siamese

3D Gaussian Splatting [14] is a technique used in point cloud data processing. Consider a 3D point cloud 𝐏={𝐩i3i=1,2,,N}𝐏conditional-setsubscript𝐩𝑖superscript3𝑖12𝑁\mathbf{P}=\{\mathbf{p}_{i}\in\mathbb{R}^{3}\mid i=1,2,\ldots,N\}bold_P = { bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∣ italic_i = 1 , 2 , … , italic_N }, where 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a point in the point cloud. 3D Gaussian Splatting can be expressed as applying the following operation to each point Gi(P)subscript𝐺𝑖P{G_{i}}\left({\rm P}\right)italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_P ):

Gi(P)=exp(12(Pμi)TΣi1(Pμi))subscript𝐺𝑖P12superscriptPsubscript𝜇𝑖𝑇superscriptsubscriptΣ𝑖1Psubscript𝜇𝑖\vspace{-2mm}{G_{i}}\left({\rm P}\right)=\exp\left({-\frac{1}{2}{{\left({{\rm P% }-{\mu_{i}}}\right)}^{T}}\Sigma_{i}^{-1}\left({{\rm P}-{\mu_{i}}}\right)}\right)italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_P ) = roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_P - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_P - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (7)

Specifically, each Gaussian function is defined by the following attributes: a center position u𝑢uitalic_u, a covariance matrix ΣΣ\Sigmaroman_Σ derived from anisotropic scaling s𝑠sitalic_s and a quaternion vector q𝑞qitalic_q, as well as opacity o𝑜oitalic_o and spherical harmonics coefficients hhitalic_h. To evaluate the accuracy of the coc calculation, we need to project the 3D Gaussian points into 2D screen space. The 2D Gaussian in screen space is formulated as:

Gi(x)=exp(12(xμi)T(i)1(xμi))superscriptsubscript𝐺𝑖𝑥12superscript𝑥subscriptsuperscript𝜇𝑖𝑇superscriptsuperscriptsubscript𝑖1𝑥subscriptsuperscript𝜇𝑖\vspace{-2mm}{G_{i}}^{\prime}\left(x\right)=\exp\left({-\frac{1}{2}{{\left({x-% {{\mu^{\prime}}_{i}}}\right)}^{T}}{{\left({{\sum_{i}}^{\prime}}\right)}^{-1}}% \left({x-{{\mu^{\prime}}_{i}}}\right)}\right)italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (8)
i=JWiWTJTsuperscriptsubscript𝑖𝐽𝑊subscript𝑖superscript𝑊𝑇superscript𝐽𝑇\vspace{-2mm}{\sum_{i}}^{\prime}=JW{\sum_{i}}{W^{T}}{J^{T}}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_J italic_W ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (9)

The Jacobian matrix J of the projective transformation can be computed during the process of projecting the 3D Gaussian points onto 2D screen space.W represents the view matrix transforming points from world space to camera space. where uisubscriptsuperscript𝑢𝑖u^{\prime}_{i}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the 2D center position, post-projection.

To help Siamese Defocus Net better predict the defocus map and the Circle of Confusion (CoC), we incorporate a depth-of-field rendering process into the 3D Gaussian splatting. We begin by estimating the camera pose and the initial sparse point cloud from the defocused image. For each viewpoint, we introduce the Circle of Confusion predicted by the Siamese Defocus Net for the blurred image. During the optimization process, for each sampled viewpoint, we render a defocused image using the CoC to fit the target view. We hypothesize that a blurred image can be obtained by first applying depth-dependent blurring to individual Gaussians in the scene and then rendering the image from these blurred Gaussians [15]. As illustrated in Figure 1, during the blur rendering process, the 3D Gaussian functions Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are projected onto the 2D screen space. Each 2D Gaussian Gksubscriptsuperscript𝐺𝑘G^{\prime}_{k}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (representing the projection of a 3D Gaussian point) is convolved with a Gaussian blur kernel proportional to the CoC generated by a camera lens model with a finite aperture, as shown in Figure 2 (left). The adopted camera lens model includes an aperture parameter Q𝑄Qitalic_Q, which differs from the pinhole model. Object points that deviate from the focal distance f𝑓fitalic_f form a region known as the CoC, rather than a single point. The final color is obtained by compositing the convolved Gaussians Gi′′subscriptsuperscript𝐺′′𝑖G^{\prime\prime}_{i}italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We assume that the depth across each 2D Gaussian’s support region is uniform, set as zksubscript𝑧𝑘z_{k}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which is the z-coordinate of the transformed center position in camera space. Based on the CoC radius σ𝜎\sigmaitalic_σ calculated using Equation 2, we construct the blur kernel giσ=1/2exp(xTΣiσx)superscriptsubscript𝑔𝑖𝜎12superscript𝑥𝑇superscriptsubscriptΣ𝑖𝜎𝑥g_{i}^{\sigma}=1/2\exp\left({{x^{T}}\Sigma_{i}^{\sigma}x}\right)italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT = 1 / 2 roman_exp ( italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT italic_x ) When Σiσ=aI,a=12ln4(σi)formulae-sequencesuperscriptsubscriptΣ𝑖𝜎𝑎𝐼𝑎124subscript𝜎𝑖\Sigma_{i}^{\sigma}=aI,a=\frac{1}{{2\ln 4}}\left({{\sigma_{i}}}\right)roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT = italic_a italic_I , italic_a = divide start_ARG 1 end_ARG start_ARG 2 roman_ln 4 end_ARG ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) [33], giσsuperscriptsubscript𝑔𝑖𝜎g_{i}^{\sigma}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT is similar to a uniform intensity distribution within the CoC. Using Gaussian kernels for blur convolution ensures that the subsequent color composition remains similar to the original rasterization process, as convolving two Gaussians yields another Gaussian. The convolved 2D Gaussian is defined as Gi′′=Gigiσsubscriptsuperscript𝐺′′𝑖subscriptsuperscript𝐺𝑖superscriptsubscript𝑔𝑖𝜎G^{\prime\prime}_{i}=G^{\prime}_{i}\ast g_{i}^{\sigma}italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT, where \ast denotes convolution. Although Gi′′subscriptsuperscript𝐺′′𝑖G^{\prime\prime}_{i}italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has an infinite support in theory, in practice it is truncated by a cutoff radius t𝑡titalic_t and is evaluated only for a limited range. Therefore, each pixel x𝑥xitalic_x is just associated with a part of the Gaussians within the scene, whose number is denoted as Nxsubscript𝑁𝑥N_{x}italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Finally, the projected Gaussians are rendered through alpha blending:

I^(x)=i=1NxTiαici,αi=Gi′′(x),Ti=j=1i1(1αj)formulae-sequence^𝐼𝑥superscriptsubscript𝑖1subscript𝑁𝑥subscript𝑇𝑖subscript𝛼𝑖subscript𝑐𝑖formulae-sequencesubscript𝛼𝑖subscriptsuperscript𝐺′′𝑖𝑥subscript𝑇𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗\vspace{-2mm}\hat{I}\left(x\right)=\sum\limits_{i=1}^{{N_{x}}}{{T_{i}}{\alpha_% {i}}{c_{i}}},{\alpha_{i}}={G^{\prime\prime}_{i}}\left(x\right),{T_{i}}=\prod% \nolimits_{j=1}^{i-1}{\left({1-{\alpha_{j}}}\right)}over^ start_ARG italic_I end_ARG ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (10)

Tisubscript𝑇𝑖{T_{i}}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is transmittance, cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the view-dependent color of the i-th Gaussian associated with the queried pixel x𝑥xitalic_x. As Equation 10 is fully differentiable, 3DGS reconstructs a 3D scene by minimizing errors between its renderings and training views.

min{S}m1Mrec(I^m,Im)subscript𝑆superscriptsubscript𝑚1𝑀subscript𝑟𝑒𝑐subscript^𝐼𝑚subscript𝐼𝑚\vspace{-4mm}\mathop{\min}\limits_{\left\{{S}\right\}}\sum\limits_{m-1}^{M}{% \mathcal{L}_{rec}\left({{{\widehat{I}}_{m}},{I_{m}}}\right)}roman_min start_POSTSUBSCRIPT { italic_S } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) (11)
rec=(α1SSIM(I^,I)2+(1α)I^I1)subscript𝑟𝑒𝑐𝛼1𝑆𝑆𝐼𝑀^𝐼𝐼21𝛼subscriptnorm^𝐼𝐼1\vspace{-1mm}\mathcal{L}_{rec}=\left(\alpha\frac{1-SSIM(\hat{I},I)}{2}+(1-% \alpha)\|\hat{I}-I\|_{1}\right)caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT = ( italic_α divide start_ARG 1 - italic_S italic_S italic_I italic_M ( over^ start_ARG italic_I end_ARG , italic_I ) end_ARG start_ARG 2 end_ARG + ( 1 - italic_α ) ∥ over^ start_ARG italic_I end_ARG - italic_I ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (12)

As shown in Equation 11, the optimizable parameters now include the underlying 3D scene S𝑆Sitalic_S and the CoC parameters {Mm}m=1subscriptsubscript𝑀𝑚𝑚1\{M_{m}\}_{m=1}{ italic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT for the training views. m is the index running over the training views. I𝐼Iitalic_Iis the real blurred image, and I^^𝐼{\hat{I}}over^ start_ARG italic_I end_ARG is the CoC-synthesized blurred image. The 3D Gaussian splatting module generates a blurred image I^(x)^𝐼𝑥\hat{I}\left(x\right)over^ start_ARG italic_I end_ARG ( italic_x ) along with its corresponding depth D^(x)^𝐷𝑥\hat{D}\left(x\right)over^ start_ARG italic_D end_ARG ( italic_x ), defined as:

D^(x)=i=1NxTiαizi,αi=Gi′′(x),Ti=j=1i1(1αj)formulae-sequence^𝐷𝑥superscriptsubscript𝑖1subscript𝑁𝑥subscript𝑇𝑖subscript𝛼𝑖subscript𝑧𝑖formulae-sequencesubscript𝛼𝑖subscriptsuperscript𝐺′′𝑖𝑥subscript𝑇𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗\vspace{-1mm}\hat{D}\left(x\right)=\sum\limits_{i=1}^{{N_{x}}}{{T_{i}}{\alpha_% {i}}{z_{i}}},{\alpha_{i}}={G^{\prime\prime}_{i}}\left(x\right),{T_{i}}=\prod% \nolimits_{j=1}^{i-1}{\left({1-{\alpha_{j}}}\right)}over^ start_ARG italic_D end_ARG ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_G start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (13)

By introducing the defocus map and CoC calculated by the Siamese Defocus network, the scale of the depth obtained through 3D Gaussian splatting is made closer to the true depth. However, 3D Gaussian splatting relies on sparse point cloud data, and the generated depth may have some inaccuracies. The limitations of the sparse point cloud data itself can lead to incomplete depth information, and errors introduced during projection and rendering are also non-negligible. Therefore, it is necessary to further refine and optimize the depth.

III-E Joint Optimization

We simultaneously trained the 3D Gaussian Splatting model and Siamese Defocus Net. As described in Section III-B, we generated blurred images using the given CoC parameters. For Siamese Defocus Net, we input images with varying degrees of blur to predict the CoC and defocus map for each image. We selected a set of scenes with the same level of blur as input for COLMAP to estimate the camera poses and initial sparse point cloud. The output from COLMAP was used as the initialization for the 3D Gaussian Splatting model. During the optimization process, we rendered blurred images using the CoC parameters predicted by Siamese Defocus Net and compared them to the original blurred images to assess whether the reconstructed 3D scene could accurately reproduce the input blur effects. At the same time, this process helped Siamese Defocus Net further refine its CoC predictions. Then, we obtained the total loss as follows:

=μ1defocus+μ2blur+μ3reconsubscript𝜇1subscript𝑑𝑒𝑓𝑜𝑐𝑢𝑠subscript𝜇2subscript𝑏𝑙𝑢𝑟subscript𝜇3subscript𝑟𝑒𝑐𝑜𝑛\vspace{-1mm}\mathcal{L}=\mu_{1}\mathcal{L}_{defocus}+\mu_{2}\mathcal{L}_{blur% }+\mu_{3}\mathcal{L}_{recon}caligraphic_L = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_f italic_o italic_c italic_u italic_s end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_b italic_l italic_u italic_r end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c italic_o italic_n end_POSTSUBSCRIPT (14)

III-F Defocus-based Depth Estimation

This paper introduces DepthNet, designed to improve depth estimation accuracy by utilizing the defocus map generated by DefocusNet and initial depth data from 3D Gaussian Splatting. DepthNet enhances accuracy by predicting the residual between the estimated depth and the ground truth. Using an encoder-decoder structure, the encoder is built on ResNet, enhanced by integrating an Atrous Spatial Pyramid Pooling (ASPP) module [5]. This setup leverages ResNet’s feature extraction strengths, while the ASPP module improves the network’s ability to capture multi-scale contextual information, essential for accurate depth estimation. We further propose an improved ASPP module with dense connections between the 1x1 convolution and Atrous convolution layers, promoting better feature integration to capture both local and global details. A self-attention mechanism is also included to refine feature extraction. The decoding process incorporates three upsampling blocks, with skip connections to maintain high-resolution details, ensuring accurate defocus depth prediction. DepthNet is trained using L1 loss and smoothness loss [9], estimating the residual between the 3D Gaussian Splatting depth and the ground truth.

res=1NN|(D^x+Dres)Dgt|1subscript𝑟𝑒𝑠1𝑁superscript𝑁subscriptsubscript^𝐷𝑥subscript𝐷𝑟𝑒𝑠subscript𝐷𝑔𝑡1\vspace{-1mm}\mathcal{L}_{res}=\frac{1}{N}\sum^{N}\left|(\hat{D}_{x}+{D}_{res}% )-D_{gt}\right|_{1}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | ( over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_D start_POSTSUBSCRIPT italic_r italic_e italic_s end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (15)
sm=1NN(|xD^|e|xI|+|yD^|e|yI|)subscript𝑠𝑚1𝑁superscript𝑁subscript𝑥^𝐷superscript𝑒subscript𝑥𝐼subscript𝑦^𝐷superscript𝑒subscript𝑦𝐼\vspace{-2mm}\mathcal{L}_{sm}=\frac{1}{N}\sum^{N}\left(|\partial_{x}\hat{D}|e^% {-|\partial_{x}I|}+|\partial_{y}\hat{D}|e^{-|\partial_{y}I|}\right)caligraphic_L start_POSTSUBSCRIPT italic_s italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( | ∂ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG | italic_e start_POSTSUPERSCRIPT - | ∂ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_I | end_POSTSUPERSCRIPT + | ∂ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT over^ start_ARG italic_D end_ARG | italic_e start_POSTSUPERSCRIPT - | ∂ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_I | end_POSTSUPERSCRIPT ) (16)

D^xsubscript^𝐷𝑥\hat{D}_{x}over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is the depth obtained from 3D Gaussian splatting, Dressubscript𝐷𝑟𝑒𝑠D_{res}italic_D start_POSTSUBSCRIPT italic_r italic_e italic_s end_POSTSUBSCRIPT is the DepthNet result, and Dgtsubscript𝐷𝑔𝑡D_{gt}italic_D start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT is the ground truth.

IV Experiment

We present the quantitative and visual results of our experiments as well as the ablation study. Our experiments are conducted on both synthetically generated datasets and real defocus datasets.

IV-A Implementation Details

Synthetic dataset:. The FoD500 dataset [20] contains 1000 scenes, each focal stack comprising 5 RGB images, 5 defocus maps, 1 depth map, and 1 all-in-focus image. The dataset is set with a max distance of 3 meters, and focus distances are defined at 0.3, 0.45, 0.75, 1.2, 1.8 meters. Synthetic Defocus with Real Images: The synthetic dataset is generated using the method described in Section III-B. We used the NYUv2 indoor dataset [22] for our experiments. The NYUv2 dataset is set with a maximum depth limit of 10 meters. For NYUv2, we set the focus distances at [1, 1.5, 2.5, 4, 6] meters to generate defocus blur. Real Focal Stack Dataset: The MobileDFF dataset [27] contains 11 scenes, with the number of focal stacks per scene ranging from 14 to 33.

IV-B Comparison with State-of-the-art methods

Methods δ1subscript𝛿1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT δ2subscript𝛿2\delta_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT δ3subscript𝛿3\delta_{3}italic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT RMSE𝑅𝑀𝑆𝐸RMSEitalic_R italic_M italic_S italic_E AbsRel𝐴𝑏𝑠𝑅𝑒𝑙AbsRelitalic_A italic_b italic_s italic_R italic_e italic_l
Regular
DefocusNet[18] 0.912 0.967 0.983 0.194 0.090
DFF-FV[35] 0.883 0.953 0.980 0.231 0.107
DFF-DFV[35] 0.921 0.977 0.990 0.219 0.104
DIAF-net[26] 0.746 0.883 0.938 0.351 0.177
Ours 0.849 0.930 0.983 0.256 0.173
0.5m
DefocusNet[18] 0.911 0.933 0.938 0.062 0.069
DFF-FV[35] 0.977 0.996 0.999 0.023 0.032
DFF-DFV[35] 0.976 0.996 0.999 0.023 0.031
DIAF-net[26] 0.889 0.987 0.992 0.072 0.138
Ours 0.930 0.990 0.996 0.057 0.079
TABLE I: The quantitative depth comparison of the FoD500 dataset.

FoD500 Dataset Following the previous DFD work [26], we conducted evaluations using two data splits: regular (including all results) and for depth less than 0.5 meters. We split the 500 focal stacks into 400 training stacks and 100 testing stacks. For comparison, we selected currently open-sourced DFD methods (DAIFNet [26], AiFDepthNet [31], DFF-DFV/FV [35]) and presented the quantitative comparison results in Table I. Our approach takes only a single defocus image as input, whereas DFF-DF [35], DFF-DFV [35], and DAIF-net [26] use focal stacks (5 images) as input. Our model, using only a single defocus image, achieves results comparable to pther methods that use focal stacks as input.

Methods Input δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ δ2subscript𝛿2absent\delta_{2}\uparrowitalic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ↑ δ3subscript𝛿3absent\delta_{3}\uparrowitalic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ↑ RMSE𝑅𝑀𝑆𝐸absentRMSE\downarrowitalic_R italic_M italic_S italic_E ↓ AbsRel𝐴𝑏𝑠𝑅𝑒𝑙absentAbsRel\downarrowitalic_A italic_b italic_s italic_R italic_e italic_l ↓
Moeller[21] focal stack 0.670 0.778 0.912 0.985 0.263
Suwajanakorn[27] focal stack 0.688 0.802 0.917 0.950 0.250
Gur and Wolf[11] in-focus 0.720 0.887 0.951 0.649 0.184
Defocus-Net[18] defocus 0.732 0.887 0.951 0.623 0.176
Focus-Net[18] focal stack 0.748 0.892 0.949 0.611 0.172
AiFDepth-Net[32] focal stack 0.688 0.944 0.961 0.669 0.289
DAIF-Net[26] focal stack 0.950 0.979 0.987 0.325 0.170
DFF-FV[35] focal stack 0.956 0.979 0.988 0.285 0.470
DFF-DFV[35] focal stack 0.967 0.980 0.990 0.235 0.445
Ours defocus 0.964 0.998 0.999 0.201 0.026
TABLE II: The quantitative depth comparison of the NYUv2 dataset.

NYUv2 Dataset: In the Depth from Defocus task on the NYUv2 dataset, we utilized the focal stack generation technique from [26] and compared our model with two DFD methods and four self-supervised deep learning methods. Using the provided code for DAIF-net and AIFDepth-net, we evaluated their results. For other focal stack-based methods, where source code was unavailable, we directly referenced the results provided in [26] for comparison. As shown in Table II, despite using only a single defocused image as input, our method performs comparably to current state-of-the-art methods on key metrics. Additionally, we compared our approach with several monocular depth estimation methods. As demonstrated in Table III, our method consistently outperforms the current SOTA monocular depth estimation methods, highlighting the broad applicability of this technique.

Methods RMSE𝑅𝑀𝑆𝐸absentRMSE\downarrowitalic_R italic_M italic_S italic_E ↓ AbsRel𝐴𝑏𝑠𝑅𝑒𝑙absentAbsRel\downarrowitalic_A italic_b italic_s italic_R italic_e italic_l ↓ δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ δ2subscript𝛿2absent\delta_{2}\uparrowitalic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ↑ δ3subscript𝛿3absent\delta_{3}\uparrowitalic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ↑
SharpNet[24] 0.502 0.139 0.836 0.966 0.993
AdaBins[3] 0.364 0.103 0.903 0.984 0.997
NeWCRFs[36] 0.334 0.041 0.922 0.992 0.998
ZoeDepth[4] 0.270 0.032 0.955 0.995 0.999
EVP[16] 0.224 0.027 0.976 0.997 0.999
Ours 0.201 0.026 0.964 0.998 0.999
TABLE III: Comparison with monocular depth estimation on the NYUv2.
Refer to caption
Figure 3: Depth estimation results on MobileDFF dataset. The warmer color indicates a larger depth. We choose DAIFNet[26], AiFDepthNet[31], DFV[35],MobileDFF[27] as a comparsion.
Refer to caption
Figure 4: 3D map generated results of KITTI Odometry dataset. Each 3D map is created by merging ten consecutive point clouds..

KITTI Dataset: We also evaluated our model on the KITTI dataset. For a sequence of 10 consecutive images, we performed 3D reconstruction by combining the estimated depth with the ground truth pose, as shown in Figure 4. The point cloud results demonstrate the accuracy of our model .

MobileDFF Dataset: For the MobileDFF dataset, due to the lack of ground truth for defocus maps, we opted for joint training on the FoD500 and NYUv2 datasets before conducting evaluations on the MobileDFF dataset. We selected four methods for comparison: AiFDepthNet[31], DFF-DFV/FV[35], and DAIFNet[26]. Based on the visual results depicted in Fig. 3, our method is capable of generating more obvious depth changes even for subtle differences.

IV-C Ablation studies

Loss functions: we conducted ablation studies on the loss functions of our model to validate the effectiveness of the chosen loss functions. We separately verified the effectiveness of the blurred reconstruction loss and the defocus loss. These experiments were carried out on the NYUv2 dataset, and the quantitative results are presented in Table IV. For defocus loss ablation experiments, we change the defocus loss into triplet loss, and do evaluation on NYUv2 dataset. From the table, we can observe that the blurred reconstruction loss and defocus loss can improve the accuracy of our model by 2.2% and 3.4%. We can see that with our defocus loss and blurred reconstruction loss, our model can achieve a higher accuracy and lower error rate.

Setting δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ δ2subscript𝛿2absent\delta_{2}\uparrowitalic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ↑ δ3subscript𝛿3absent\delta_{3}\uparrowitalic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ↑ RMSE𝑅𝑀𝑆𝐸absentRMSE\downarrowitalic_R italic_M italic_S italic_E ↓ AbsRel𝐴𝑏𝑠𝑅𝑒𝑙absentAbsRel\downarrowitalic_A italic_b italic_s italic_R italic_e italic_l ↓
blurred defocus
\checkmark 0.943 0.982 0.992 0.254 0.079
\checkmark 0.932 0.976 0.989 0.312 0.137
\checkmark \checkmark 0.964 0.998 0.999 0.201 0.026
TABLE IV: Ablation experiments for loss functions.
siamese δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ δ2subscript𝛿2absent\delta_{2}\uparrowitalic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ↑ δ3subscript𝛿3absent\delta_{3}\uparrowitalic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ↑ RMSE𝑅𝑀𝑆𝐸absentRMSE\downarrowitalic_R italic_M italic_S italic_E ↓ AbsRel𝐴𝑏𝑠𝑅𝑒𝑙absentAbsRel\downarrowitalic_A italic_b italic_s italic_R italic_e italic_l ↓
wo w
\checkmark 0.897 0.974 0.985 0.453 0.297
\checkmark 0.964 0.998 0.999 0.201 0.026
TABLE V: Ablation experiments for model structure.

Model Structure: we conduct ablation studies on the structure of the model to prove that the effect of Siamese network in improving the performance of depth estimation. In this ablation study, experiments are performed with the single MPViT model to predict the defocus map without the change of DepthNet. The results are shown in Table V.

V Conclusion

This paper introduces a novel framework for depth estimation from defocused images. By incorporating the camera lens model, the network generates images with varying blur levels as input. The framework, built on Siamese networks and 3D Gaussian splatting, is trained in a self-supervised manner. The Siamese network predicts defocus maps and the Circle of Confusion, which are used by the 3D Gaussian splatting model to generate synthetic blurred images. The model parameters are optimized by comparing synthetic and real blurred images. Additionally, the depth corresponding to the blurred images is fed into DepthNet, enabling high-precision depth estimation across various datasets.

References

  • [1] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM, 2011.
  • [2] V Madhu Babu, Kaushik Das, Anima Majumdar, and Swagat Kumar. Undemon: Unsupervised deep network for depth and ego-motion estimation. In IROS. IEEE, 2018.
  • [3] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In CVPR, 2021.
  • [4] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  • [5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.
  • [6] Po-Yi Chen, Alexander H Liu, Yen-Cheng Liu, and Yu-Chiang Frank Wang. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Transactions on Intelligent Transportation Systems, 2019.
  • [7] Xingshuai Dong, Matthew A Garratt, Sreenatha G Anavatti, and Hussein A Abbass. Towards real-time monocular depth estimation for robotics: A survey. TIST, 2022.
  • [8] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV, 2016.
  • [9] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
  • [10] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In ICCV, 2019.
  • [11] Shir Gur and Lior Wolf. Single image depth estimation trained via depth from defocus cues. In CVPR, 2019.
  • [12] Mu He, Le Hui, Yikai Bian, Jian Ren, Jin Xie, and Jian Yang. Ra-depth: Resolution adaptive self-supervised monocular depth estimation. In ECCV. Springer, 2022.
  • [13] Megha Kalia, Nassir Navab, and Tim Salcudean. A real-time interactive augmented reality depth estimation technique for surgical robotics. In ICRA. IEEE, 2019.
  • [14] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 2023.
  • [15] Jaroslav Krivánek, Jiri Zara, and Kadi Bouatouch. Fast depth of field rendering with surface splatting. In Proceedings Computer Graphics International 2003. IEEE, 2003.
  • [16] Mykola Lavreniuk, Shariq Farooq Bhat, Matthias Müller, and Peter Wonka. Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment. arXiv preprint arXiv:2312.08548, 2023.
  • [17] Youngwan Lee, Jonghee Kim, Jeffrey Willette, and Sung Ju Hwang. Mpvit: Multi-path vision transformer for dense prediction. In CVPR, 2022.
  • [18] Yawen Lu, Garrett Milliron, John Slagter, and Guoyu Lu. Self-supervised single-image depth estimation from focus and defocus clues. RAL, 2021.
  • [19] Xiaoyang Lyu, Liang Liu, Mengmeng Wang, Xin Kong, Lina Liu, Yong Liu, Xinxin Chen, and Yi Yuan. Hr-depth: High resolution self-supervised monocular depth estimation. In AAAI, 2021.
  • [20] Maxim Maximov, Kevin Galim, and Laura Leal-Taixé. Focus on defocus: bridging the synthetic to real domain gap for depth estimation. In CVPR, 2020.
  • [21] Michael Moeller, Martin Benning, Carola Schönlieb, and Daniel Cremers. Variational depth from focus reconstruction. TIP, 2015.
  • [22] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  • [23] Matteo Poggi, Fabio Tosi, and Stefano Mattoccia. Learning monocular depth estimation with unsupervised trinocular assumptions. In 3DV. IEEE, 2018.
  • [24] Michael Ramamonjisoa and Vincent Lepetit. Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation. In ICCV Workshops, 2019.
  • [25] Sean Ryan Fanello, Christoph Rhemann, Vladimir Tankovich, Adarsh Kowdle, Sergio Orts Escolano, David Kim, and Shahram Izadi. Hyperdepth: Learning depth from structured light without matching. In Transactions on Intelligent Transportation Systems, 2016.
  • [26] Haozhe Si, Bin Zhao, Dong Wang, Yunpeng Gao, Mulin Chen, Zhigang Wang, and Xuelong Li. Fully self-supervised depth estimation from defocus clue. In CVPR, 2023.
  • [27] Supasorn Suwajanakorn, Carlos Hernandez, and Steven M Seitz. Depth from focus with your mobile phone. In CVPR, 2015.
  • [28] Chris Sweeney, Torsten Sattler, Tobias Hollerer, Matthew Turk, and Marc Pollefeys. Optimizing the viewing graph for structure-from-motion. In ICCV, 2015.
  • [29] Chenxin Tao, Xizhou Zhu, Weijie Su, Gao Huang, Bin Li, Jie Zhou, Yu Qiao, Xiaogang Wang, and Jifeng Dai. Siamese image modeling for self-supervised vision representation learning. In CVPR, 2023.
  • [30] Alessio Tonioni, Fabio Tosi, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano. Real-time self-adaptive deep stereo. In CVPR, 2019.
  • [31] Ning-Hsu Wang, Ren Wang, Yu-Lun Liu, Yu-Hao Huang, Yu-Lin Chang, Chia-Ping Chen, and Kevin Jou. Bridging unsupervised and supervised depth from focus via all-in-focus supervision. In ICCV, 2021.
  • [32] Ning-Hsu Wang, Ren Wang, Yu-Lun Liu, Yu-Hao Huang, Yu-Lin Chang, Chia-Ping Chen, and Kevin Jou. Bridging unsupervised and supervised depth from focus via all-in-focus supervision. In ICCV, 2021.
  • [33] Yujie Wang, Praneeth Chakravarthula, and Baoquan Chen. Dof-gs: Adjustable depth-of-field 3d gaussian splatting for refocusing, defocus rendering and blur removal. arXiv preprint arXiv:2405.17351, 2024.
  • [34] Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Attention concatenation volume for accurate and efficient stereo matching. In CVPR, 2022.
  • [35] Fengting Yang, Xiaolei Huang, and Zihan Zhou. Deep depth from focus with differential focus volume. In CVPR, 2022.
  • [36] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022.
  • [37] Changyin Zhou, Stephen Lin, and Shree Nayar. Coded aperture pairs for depth from defocus. In ICCV. IEEE, 2009.
  • [38] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Transactions on Intelligent Transportation Systems, 2017.