Pages

Bounds Culling: xyz vs morton vs hilbert

 


Here is a screen shot showing a debug mode in the game that renders the bounding boxes for edits near the player.

As you can see there are many of them, and this is in a relatively small area.

Even  a small world can end up with millions of these edits.

Now imagine you want to query a region of space that is quite large, perhaps 1km across!  

First we compile a copy of the world that overlaps the 1km bounds.

 This is our program we can run to evaluate the SDF within that bounds, we did manage to remove everything outside the bounds, but we still have an extremely large program...

 Now we need a second culling system! This one runs during the evaluation of points.  


 How it works 

The way in works in the game is that we also generate a list of instruction AABBs, these get stored in a SIMD friendly structure.  For culling if we find that none of our points overlap the bounds, we skip the program instruction range it would have otherwise evaluated.  

Evaluating every point against every instruction AABB would be slow and defeat the purpose of this, so multiple coarse layers are used.

 The first layer takes in our SIMD WIDTH * 32 (So for AVX2 this is 8*32=256) points, and computes a single AABB that contains all of them

  Once we have this first coarse point AABB, we loop over all instruction AABBs, and process them N at a time with SIMD.  This generates a 1 bit mask per instruction AABB.

 If there was an overlap, the we enter the second layer.

 This is a more refined culling process where it checks the SIMD_WITH points against that specific instruction AABB, it does this for all 32 loops and outputs 1 bit for each. 

This results in a 32 bit mask, if a bit is on we must process that SIMD_WIDTH set of points, if all bits are 0, we can skip all 256 points.  This results in a loop that operates using BitScanForward/zero lowest bit, rather than the traditional iteration from 0 to 32.

 The instructions bounds also have a similar bit mask that allows for skipping up to 64 bounds at at time with BitScanForward.

 

Results

(this was written before I added the 2nd layer so it is slightly out of date, with the 2nd layer the culling is more accurate and would show even better results than this)

 Now given that most evaluations aren't processing one point, but hundreds to thousands at a time, and our culling is somewhat coarse(N = multiples of 32) because we don't wish it to dominate program time, does the order in which the points are passed in matter? Well yes, yes it does.

 The default order is more or less xyz, with some psuedo randomness just based on whatever the program happened to do to spit the points out. This is not a great order, and results in significantly more bounds overlaps being true.

 Anyway I instrumented the program and aggregated the *not* culled rate using xyz, morton and hilbert. 

Here are the results, lower being better.

  • xyz         .222
  • Morton  .1159
  • Hilbert   .09

While these numbers don't look too different, keep in mind the speedup is relative, so going from xyz to hilbert is .222/.09 =  2.5x, I'll take that!

The speedup for morton to hilbert is 1.287x, I also find this to worth the extra cost(hilbert is about 4x slower to compute, but computing the indices isn't a huge factor).

The morton/hilbert approaches do require sorting, and generating the respective indices, thankfully I have SIMD optimized code for this, so the cost is pretty minor.

I am enabling this dynamically based on the number of bounds, once it goes over a certain threshold Hilbert order is imposed.


Displacement Format


 I'm changing the displacement map format. 

The current format is a histogram normalized u8, generated from the source maps which are u16.

When zooming in there simply isn't enough precision and we start to see stair stepping from the u8


So the format I am switching to is roughly based on BC4, but as I'm decoding on the CPU I decided to simplify it. 

BC4 stores per 4x4 pixels a min and max value, and a 3 bit interpolation weight per pixel. The total size is 64 bits per 4x4 block, half of u8.

First I removed the 2nd mode, where the order of min/max is swapped, this simplifies the decoder.

Second I changed it so that rather than storing a min and max value per 4x4, we store a min and ratio between min and 255. 

This changes the precision range from 3-11 to 3-19 bits and is only possible because I removed the 2nd mode.

Third I changed the way the indices are stored, so that 0-7 maps directly to the lerp ratio. 

I have no idea why BC4 stores it indices in such a random order, but it makes it much harder to decode on the CPU.

I'll refer to my format as DBlock.

 I wrote a fail fast test compare the DBlock vs U8 encoding,  one immediate downside of DBlock is that the decoder requires more instructions than U8.  On the other hand the total memory is half, so less cache misses. Also the blocks are in 4x4 format which is essentially a partial morton order, so again better cache access patterns.

Visually DBlock wins in virtually every case despite being half the memory. This is because the vast majority of 4x4 blocks are better captured by a local gradient encoding.  In theory for blocks that contain large differences between min/max DBlock should look worse, but I am not easily able to spot the difference.

The only real downside is the complexity of the decoder, in particular with U8 I was taking advantage of that fast that I could load multiple pixels with a single gather. This is far more difficult with DBlock, since each "pixel" is now 4x4 pixels,  and 8 bytes in size. 

Also this is all SIMD based, I don't have scalar decoders. 

Thus branching on whether the X coordinate crosses from one block to another doesn't work well, it is very likely at least one of the lanes will cross blocks.

 The initial decoder simply does many more gathers than the U8 one, for a 2x2 bilinear sample the U8 decoder required 2 gathers, with an extra 2 only when the X coordinate wrapped around(Repeat sampling).  

The initial DBlock decoder for bilinear requires 8 gathers(in best case theoretically we only need 1 or2, but worst case is 8), although in many cases they are gathering from the same block, so it should be possible to eliminate some of these once I write a more optimized decoder.

 One interesting thing is that for quadratic sampling(3x3), DBlocks number of gathers doesn't change from bilinear, and we can get everything we need with 8.

 One perf improvement I quickly added was to switch the gathers to 64 bit sized rather than 32 bit, as our blocks are 64 bit, and 64 bit gathers do run faster.

 Another option would be to drop the gathers and use 128 bit loads per lane, then shuffle everything around. I'll probably test this at some point, but it is a large # of instructions, so it may not be a win despite dropping the gathers.


*Update: I wrote the 128 bit load/swizzle based version and it does outperform the gather based implementation, and is now the default version. This is on Zen2, where gather is particularly slow and decodes to an obscene number of Uops.


Future: At some point I'd like to port the volumes to also using this format (Edit: this has been done although it is a slightly different format since it targets 3D blocks)