Here is a screen shot showing a debug mode in the game that renders the bounding boxes for edits near the player.
As you can see there are many of them, and this is in a relatively small area.
Even a small world can end up with millions of these edits.
Now imagine you want to query a region of space that is quite large, perhaps 1km across!
First we compile a copy of the world that overlaps the 1km bounds.
This is our program we can run to evaluate the SDF within that bounds, we did manage to remove everything outside the bounds, but we still have an extremely large program...
Now we need a second culling system! This one runs during the evaluation of points.
How it works
The way in works in the game is that we also generate a list of
instruction AABBs, these get stored in a SIMD friendly structure. For
culling if we find that none of our points overlap the bounds, we skip
the program instruction range it would have otherwise evaluated.
Evaluating every point against every instruction AABB would be slow and defeat the purpose of this, so multiple coarse layers are used.
The first layer takes in our SIMD WIDTH * 32 (So for AVX2 this is 8*32=256) points, and computes a single AABB that contains all of them
Once we have this first coarse point AABB, we loop over all instruction AABBs, and process them N at a time with SIMD. This generates a 1 bit mask per instruction AABB.
If there was an overlap, the we enter the second layer.
This is a more refined culling process where it checks the SIMD_WITH points against that specific instruction AABB, it does this for all 32 loops and outputs 1 bit for each.
This results in a 32 bit mask, if a bit is on we must process that SIMD_WIDTH set of points, if all bits are 0, we can skip all 256 points. This results in a loop that operates using BitScanForward/zero lowest bit, rather than the traditional iteration from 0 to 32.
The instructions bounds also have a similar bit mask that allows for skipping up to 64 bounds at at time with BitScanForward.
Results
(this was written before I added the 2nd layer so it is slightly out
of date, with the 2nd layer the culling is more accurate and would show
even better results than this)
Now given that most evaluations aren't processing one point, but hundreds to thousands at a time, and our culling is somewhat coarse(N = multiples of 32) because we don't wish it to dominate program time, does the order in which the points are passed in matter? Well yes, yes it does.
The default order is more or less xyz, with some psuedo randomness just based on whatever the program happened to do to spit the points out. This is not a great order, and results in significantly more bounds overlaps being true.
Anyway I instrumented the program and aggregated the *not* culled rate using xyz, morton and hilbert.
Here are the results, lower being better.
- xyz .222
- Morton .1159
- Hilbert .09
While these numbers don't look too different, keep in mind the speedup is relative, so going from xyz to hilbert is .222/.09 = 2.5x, I'll take that!
The speedup for morton to hilbert is 1.287x, I also find this to worth the extra cost(hilbert is about 4x slower to compute, but computing the indices isn't a huge factor).
The morton/hilbert approaches do require sorting, and generating the respective indices, thankfully I have SIMD optimized code for this, so the cost is pretty minor.
I am enabling this dynamically based on the number of bounds, once it goes over a certain threshold Hilbert order is imposed.