With GPU ray tracing being very popular of late I decided to do a comparison between BVH and SDF based ray tracing.
I plan to add a better global illumination model, this will require tracing some rays on the GPU, but I cannot depend on the hardware ray tracer existing.
I was already generating a BVH for each object, so I took the BVH generated for the static world and added ray tracing against it.
This BVH happens to be a mostly a BVH4(4 wide),and the typical depth is at max is about 22 levels. The static worlds total number of AABBs varied, as it changes as you move around, but was generally around 60k, the vast majority being leafs.
This same static world is also available as an SDF, not a discrete version, but a hierarchical/implicit version. Anyway it can be traced just like the BVH.
I added a single ray that would trace from the cameras position in the direction the camera was looking until it hit something.
It does this every frame, and then report stats about it for both the BVH and SDF trace.
For BVH it counts the number of box/ray intersections performed to
hit the leaf or to conclude that it hit nothing(sky). This is the scalar
count.
For SDF it counts the number of steps taken to reach the target.
BVH notes
The BVH trace is a bunch of AABB vs Ray intersections, along with sorting so that it follows the nearest intersection first. It can only trace 1 ray per call, it is 1 ray vs N boxes.
My BVH AABBs are fairly tight, I also happened to have a bounding sphere radius, so I augmented it by also testing against the sphere for rejection. This resulted in about 10-15% less intersections in some cases. In reality the sphere probably isn't worth it so the actual intersection count would go up slightly.
On CPU SIMD is possible but not as easily done as with the SDF.
For a CPU version with AVX2 I'd need a BVH8, but BVH8 is less optimal than BVH4. The other alternative is tracing multiple rays, but then you have to deal with divergence.
For a GPU software version it is hard to see how you wouldn't end up with very large divergence given that the SIMD width is 32 or 64.
I'm not sure what the preferred width of BVH is on the GPU, perhaps BV2 is preferred, and you simply process 32 rays at once. I don't have BVH2 so I'd have to convert them, and it would double the depth. BV2 would mean a few less intersections, but more random memory access--at least the sorting would be simple.. Update:this paper suggest an 8 wide works well, they do 8 per lane rather than the CPU approach of doing it horizontally.
My use case is for highly divergence rays, so optimizations for BVHs that focus on coherent rays don't seem relevant.
Observations:
* BVH4 with 22 levels means it will have a minimum of around 80 intersections to hit, typically for surfaces near the ray start it was 80-120 intersections. With longer ray traces along the ground, intersection counts went up to 200, and in some cases 400.
*BVH4 when looking directly down at the ground is still around 100 intersections, and looking up at the sky generally resulted in 115 to 140. The SDF by comparison is 2-6 steps looking at the ground, and 10-15 looking at the sky. This is a big advantage for the SDF as this type of trace would be common.
*SDF step count when looking into the distance varied from 15-30 typically, but in some cases went up to around 100.
*Memory wise the BVH would take about 1.5 mb for the AABBs, currently the leafs do not contain any triangles or a small SDF, so it just reports the hit point on the leaf AABB.
*The SDF is intended to be computed and uploaded to the GPU in discrete form, it would mostly likely be a sequence of progressively larger volumes centered around the camera such as with clip maps. Memory usage will depend on the size of each of these, but most likely it will use more than the BVH.
* On CPU with 4 wide BVH you can divide count by 4, this means 20-100 intersections tests per ray, but we only process 1 at at time! SDF on the other hand always process SIMD width at a time. For GPU version with BVH2 iteration count would be very high 80-400 probably, but it would process SIMD width at a time
*A 8 wide BVH would not scale linearly as you cannot collapse 16 children into 8, thus the depth wouldn't drop much, so scaling would be worse, perhaps 6 or 7 rather than 8. Also the sorting becomes more complex and slower..
Conclusion
It would appear that it least for the scene I tested the SDF is the winner, it scales perfectly with SIMD width on the GPU due to hardware accelerated volume sampling, and almost always takes less steps than the BVH takes intersection tests to reach the target, especially in common cases such as rays heading toward the sky or directly down at the ground.
Also the software BVH on the GPU sounds pretty slow..
The primary advantage of the BVH is likely it is faster to build, so for instance it would be better suited to rebuilding each frame for an animated character. Also of course that hardware acceleration exists, even if the performance seems lacking.
The SDF also results in a hit on the actual surface here, where as the BVH still needs something in the leafs, like a 2^3 SDF or triangles.
*Side Note:
I really wish MS/Nvidia/AMD would get together and create a better format for block compressed SDFs.
Currently the only available format worth mentioning is BC4, but it is a 2D slice 4x4 in size, there is no 3D block format available on Windows.
BC4 is also less than ideal in other regards, as it has 2 modes, with the 2nd mode being fairly useless for an SDF.
I created 2 formats for SDFs that I use in software, block size is 2^3, one is 4bits per pixel and the other is 8 bits per pixel, resulting in block sizes of 4 bytes and 8 bytes.
Format for 4 byte version is: 8 bit min, 8 bit ratio between min/255, and 8 2 bit indices.
Format for 8 byte version is: 8 bit min, 8 bit ratio between min/255, and 8 6 bit indices. <--(this is my preferred format)
These are easy to encode and easily beat BC4 for quality, please add something like this for GPU formats!