Pages

SV_PrimitiveID without the perf hit

  On Nvidia/AMD if you access SV_PrimitiveID in a pixel shader there is a fairly substantial performance hit.

For the scene shown rasterizing the visibility buffer went from 3.35 ms to 5.45 ms by accessing SV_PrimitiveID in the pixel shader. This scene had 30.5 million triangles rasterized per frame. 


The thing is you kind of need primitive ID for visibility buffers...

 It turns out that you can access SV_PrimitiveID without the performance hit on Nvidia if you do a little song and dance involving creating a so-called fast geometry shader.

 You need NVAPI, Nvidia's driver extension library.

 They have a function NvAPI_D3D11_CreateFastGeometryShader you can use instead of the standard D3D11 CreateGeometryShader.

If you feed a GS that follows a restricted set of rules it will produce a very fast GS without the standard problems often associated with the GS.  And oddly enough you can access SV_PrimitiveID in this GS and pass it down to the PS, and what do you know, no performance hit..

 So yes adding a GS that does almost nothing makes the shader complete in 60% of the time.

 

What about AMD?

  An alternative to emitting the primitive ID is to pass down the 3 vertex IDs,  then write those out, but this requires a larger visibility buffer(64 bits instead of 32 bits).

 AMD has driver extensions for accessing any of the 3 vertices from the PS, so you can pass down SV_VertexID and then access and write these out.  With D3D12 this functionality is standard.

 On Nvida I can toggle between the triangle ID vs 3 vertex ID variant and I do not see any performance difference(RTX 2060), so performance appears on par.

Intel? 

Supposedly Intel has no performance issues accessing SV_PrimitiveID in the PS, I do not have an Intel GPU to test on.


Why does SV_PrimitiveID cause such performance issues on Nvidia etc?

 It is unclear to me exactly what is going on, but one thing I've noticed is that vertex cache optimizations(tipsify/forsyth/random) do nothing when SV_PrimitiveID is accessed in the PS on NVidia. 

So perhaps it is somehow disabling the vertex cache.


BVH vs SDF for ray tracing

  With GPU ray tracing being very popular of late I decided to do a comparison between BVH and SDF based ray tracing.

  I plan to add a better global illumination model, this will require tracing some rays on the GPU, but I cannot depend on the hardware ray tracer existing.

 I was already generating a BVH for each object, so I took the BVH generated for the static world and added ray tracing against it. 

 This BVH happens to be a mostly a BVH4(4 wide),and the typical depth is at max is about 22 levels. The static worlds total number of AABBs varied, as it changes as you move around, but was generally around 60k, the vast majority being leafs.

 This same static world is also available as an SDF, not a discrete version, but a hierarchical/implicit version. Anyway it can be traced just like the BVH.

I added a single ray that would trace from the cameras position in the direction the camera was looking until it hit something.

It does this every frame, and then report stats about it for both the BVH and SDF trace.

For BVH it counts the number of box/ray intersections performed to hit the leaf or to conclude that it hit nothing(sky). This is the scalar count.

For SDF it counts the number of steps taken to reach the target.

BVH notes

The BVH trace is a bunch of AABB vs Ray intersections, along with sorting so that it follows the nearest intersection first. It can only trace 1 ray per call, it is 1 ray vs N boxes.

My BVH AABBs are fairly tight, I also happened to have a bounding sphere radius, so I augmented it by also testing against the sphere for rejection. This resulted in about 10-15% less intersections in some cases. In reality the sphere probably isn't worth it so the actual intersection count would go up slightly.

On CPU SIMD is possible but not as easily done as with the SDF.

 For a CPU version with AVX2 I'd need a BVH8, but BVH8 is less optimal than BVH4. The other alternative is tracing multiple rays, but then you have to deal with divergence.

For a GPU software version it is hard to see how you wouldn't end up with very large divergence given that the SIMD width is 32 or 64. 

I'm not sure what the preferred width of BVH is on the GPU, perhaps BV2 is preferred, and you simply process 32 rays at once. I don't have BVH2 so I'd have to convert them, and it would double the depth. BV2 would mean a few less intersections, but more random memory access--at least the sorting would be simple.. Update:this  paper suggest an  8 wide works well, they do 8 per lane rather than the CPU approach of doing it horizontally.

 My use case is for highly divergence rays, so optimizations for BVHs that focus on coherent rays don't seem relevant.


Observations:

* BVH4 with 22 levels means it will have a minimum of around 80 intersections to hit, typically for surfaces near the ray start it was 80-120 intersections.  With longer ray traces along the ground, intersection counts went up to 200, and in some cases 400.

*BVH4 when looking directly down at the ground is still around 100 intersections, and looking up at the sky generally resulted in 115 to 140. The SDF by comparison is 2-6 steps looking at the ground, and 10-15 looking at the sky. This is a big advantage for the SDF as this type of trace would be common.

*SDF step count when looking into the distance varied from 15-30 typically, but in some cases went up to around 100.  

*Memory wise the BVH would take about 1.5 mb for the AABBs, currently the leafs do not contain any triangles or a small SDF, so it just reports the hit point on the leaf AABB.

*The SDF is intended to be computed and uploaded to the GPU in discrete form, it would mostly likely be a sequence of progressively larger volumes centered around the camera such as with clip maps. Memory usage will depend on the size of each of these, but most likely it will use more than the BVH.

* On CPU with 4 wide BVH you can divide count by 4, this means 20-100 intersections tests per ray, but we only process 1 at at time! SDF on the other hand always process SIMD width at a time. For GPU version with BVH2 iteration count would be very high 80-400 probably, but it would process SIMD width at a time

*A 8 wide BVH would not scale linearly as you cannot collapse 16 children into 8, thus the depth wouldn't drop much, so  scaling would be worse, perhaps 6 or 7 rather than 8. Also the sorting becomes more complex and slower..

Conclusion

It would appear that it least for the scene I tested the SDF is the winner, it scales perfectly with SIMD width on the GPU due to hardware accelerated volume sampling, and almost always takes less steps than the BVH takes intersection tests to reach the target, especially in common cases such as rays heading toward the sky or directly down at the ground.

Also the software BVH on the GPU sounds pretty slow..

The primary advantage of the BVH is likely it is faster to build, so for instance it would be better suited to rebuilding each frame for an animated character.  Also of course that hardware acceleration exists, even if the performance seems lacking.

The SDF also results in a hit on the actual surface here, where as the BVH still needs something in the leafs, like a 2^3 SDF or triangles.



*Side Note:
I really wish MS/Nvidia/AMD would get together and create a better format for block compressed SDFs. 

Currently the only available format worth mentioning is BC4, but it is a 2D slice 4x4 in size, there is no 3D block format available on Windows. 

BC4 is also less than ideal in other regards, as it has 2 modes, with the 2nd mode being fairly useless for an SDF.

I created 2 formats for SDFs that I use in software, block size is 2^3, one is 4bits per pixel and the other is 8 bits per pixel, resulting in block sizes of 4 bytes and 8 bytes.

Format for 4 byte version is: 8 bit min, 8 bit ratio between min/255, and 8 2 bit indices.

Format for 8 byte version is: 8 bit min, 8 bit ratio between min/255, and 8 6 bit indices. <--(this is my preferred format)

These are easy to encode and easily beat BC4 for quality, please add something like this for GPU formats!

Dynamic Shadows with SDFs

 For dynamic shadows I use SDF volumes, they are generated at runtime for each dynamic object.

Storage support: f16, u8 or BC4

 Can toggle between these formats at runtime to compare them, there isn't much difference, although obviously fp16 has the overall best quality.

 Performance is quite good, the tracing is done during visibility buffer shading pass, with each draw call ID having an associated list of  shadows to process.  

There is a fast shadow mode that uses nearest point on line from sphere to give us a quick sample location, this only requires a single volume read, but is very approximate.

Or you can enable tracing, where it loops and traces the SDF, 8 loops gets a solid result, but 16 is better.

 It exits early once a shadow is 0, or we have traveled to a point where no further shadowing can occur. The early exits are very important, without them perf is *not* good.

 There is also an option to enable screen space shadows to augment the SDF shadows, but it is mostly only useful for objects with really skinny bits that aren't well captured by the SDF. 

 It uses the SDF closest hit location as the starting point for screen space tracing.

Perf is worse for screen space shadows than SDF shadows, and it doesn't do much overall so it is off by default.

 The dynamic shadows are intended to look very similiar to the baked shadows, so they use a nearly identical method to produce results. 

The reason they should look similiar is that when a dynamic stops moving, after some time it gets baked into the world, this transition is less noticeable if the lighting & shadows match up.

In this image the left shadow is baked while the right shadow is dynamic.

 

Other tidbits:

A hash of of the program used to generate the SDF is computed, this is used to share the SDF between identical dynamic objects.   

 If the first hash fails, after generating the SDF and encoding it, a hash of the encoded SDF is also computed, this is to share the uploaded GPU SDF slot,  as it is possible that the program was different but after range normalization and compression that the final results are identical. This primarily comes about from changes in scaling size to the shape or one of the ops-mostly BC4 benefits form this, as it is far more likely to end up identical.