voxels/thru : 2017

Quasi Virtual Texturing System

This is an overview of a quasi virtual texturing system that I implemented for my game.

I use this system instead of a traditional virtual texturing system for the following reasons.

Supports all levels of anisotropic/trilinear filtering, even on old hardware
Doesn’t use tiled resources which require newer GPUs, and aren’t available on Windows 7 with d3d11
I use triplanar texturing exclusively, and do not have traditional UVs

Algorithm

This algorithm uses multiple texture arrays, and a structured buffer that indicates per material, which array to sample from.

It uses a fixed sized amount of memory for textures on the GPU.

Textures are limited to square powers of 2, the only sizes supported are 256,512,1024, and 2048. This reduces the number of arrays we will need to a reasonable number.

I pre-generated a texture array that contains all of the mips of size 256 and under, for all textures.

I also allocate fixed sized arrays for 512,1024 and 2048; with the larger sizes having progressively fewer slots.

Number of slots : Here are the number of slots I have per array, these numbers are mostly arbitrary.

256: # of textures
512: 128 slots
1024: 32 slots
2048: 8 slots

Here is a visualization of the arrays.

SDFShape2

Yellow is 256, Green is 512, Blue is 1024, and Red is 2048.
I force everything to use 256 past a certain distance to reduce divergence

A structured buffer is used as an indirection mapping; pass in the material ID, and retrieve which array to sample from, and which slot in the array.

Prioritizing Textures

As there is a limited number of high resolution textures that can be on the GPU at any given time, a system to was needed to determine which textures should be resident.

This is done by storing the N most important textures, along with a score, for each voxel patch.

The score is simple the number of vertices that reference the texture.

Each frame, for visible patches, the score is added to the appropriate textures/MIP based on the distance from the camera for the patch.

Patches which are out of the frustum or occluded do not get added.

The textures/mip with the highest score, that is not resident on the GPU is then considered for uploaded, if it has a higher score than a resident texture at the same mip level.

Only one texture is uploaded per frame, to limit the amount of data that needs to be transferred.

The structured buffer is also updated to reflect where the GPU can find the texture.

Latency is reduced by having the CPU side also store a cache of recently accessed textures, the caches size is adjustable, but is currently set to 300mb.

Results

The system works well and transparently pages in texture data based on visibility.

It never suffers from horribly blurry textures since the 256 mips are always available as a fallback.

Even standing still and turning the camera around can and will cause textures to be paged in/out.

Limitations

D3d11 has a max texture array size of 2048. This means this system can only support 2048 separate materials. I am nowhere near this limit, so this isn’t an issue for me.

GPU Texture Format for PBR

Visuals with PBR.

Objective: encode a full PBR set of channels into as few bits as possible, with good performance.

From testing it appears that each additional texture adds significant cost, using 3 textures to encode the PBR signal with a size of 2 bytes per texel is much more costly than 2 textures at the same overall number of bytes per texel.

The PBR system I'm using has 8 channels:

3 base color

2 normal

1 roughness

1 metallic

1 AO

If we attempt to store the normal in BC5, a two channel format designed specifically for tangent space normals, we have 6 channels remaining, and cannot fit that into a single texture, as none of them support more than 4 channels.
So we cannot use BC5.

There are two good options I've found instead, both using the same layout, two textures, both with 4 channels.

On anything D3D11 and newer, BC7 can be used.

For pre-D3D11 systems, BC3 textures can be used instead.

The normal will be split and stored into the alpha channels, which should help preserve precision of the normal.

Texture1: RGB: base color A: normal.x

Texture2: RGB: ao, roughness, metallic. A: normal.y

*Texture1 can be safety set to SRGB, as both BC3 and BC7 treat the alpha as linear.

Uncompressed signal: 8 bytes

Compressed: 2 bytes in both BC3 and BC7 formats

Encoding speed:

Using AMD Compressonator BC3 is fast to encode, even with quality set high it churn through BC3 fairly quickly.

Another encoder I tested was Crunch, a BC1/BC3 compressor that applies a lossy entropy reduction algorithm on top of the lossy block compression algorithm- this enables crunched BC1/3 files to compress much smaller on disk.
I decided not to use it because the compressor was very slow, and I feel that BC1 already looks less than stellar(the endpoints are 565..)-- throw in even more artifacts from Crunch and the textures just didn't look very good.

AMD Compressonators BC7 encoding is not nearly as fast as its BC3.

This is understandable as the format is vastly more complex.

With the quality set to low, it still takes much longer than BC3 at high quality.

BC format Impact on Rendering:

There is no observable difference in rendering performance between BC3 and BC7 on my AMD 280x.

Both are observably faster than uncompressed, not surprising given that uncompressed is 4x larger.

BC3 vs BC7 Visual Quality:

I have only run BC7 high quality on a few images, I'd probably have to run it overnight and then some to generate high quality BC7 for everything.

Comparing low quality BC7 vs high quality BC3:

BC3's RGB part(identical to BC1), can only encode 4 possible colors in each 4x4 region, BC7 is far less limited.

For noisey images the difference isn't all that noticeable, but if you look closely BC7 generally has slightly more detail.

For anything with smooth gradients BC7 is clearly superior.

Normals:

BC3 has dedicated 8 bit end points and 3 bit indices for the alpha channel, while BC7 may or may not even have dedicated indices for alpha, as this is chosen on a per block basis.

There is no obvious difference in the normals, but when I zoom in I can occasional spot areas where BC3 appears to have done a better job, but this is rare, and the overall improvements in the other channels is larger improvement than this small loss. Also running BC7 high quality may change this--

Size on Disk:
Both BC3 and BC7 are 8 bits per pixel
When bit compressed, in this case with zstd, the BC7 files are generally about 1-2% smaller.

I tried lzham(an LZMA variant), but the files are only about 5% smaller than zstd level 19, not worth the 5x slower decode.

Possible/Future Improvements:

1. Quality of all channels can be improved by tracking min/max for the entire image and then re-normalizing it. This would require 2 floats per channel in the shader to decode though.

2. The normals in the normal map are in euclidean space, this wastes bits since some values are never used. Octahedral coordinates make better use of the available bits, and decoding isn't really much different.

Metal channel is active for many of the objects seen here

Adding SDF Collision to Bullet Physics

Bullet Physics is an open source physics engine that is sometimes used for games and movies.

It supports many near phase collision types such as spheres, boxes, capsules, triangle meshes and convex hulls.

None of these were a good match for my signed distance fields(SDF).

Triangle meshes might seem like a solution, but they have two obvious issues.

No volume which leads to penetration issues and bad collision detection
Uses lots of memory to store the mesh

Convex hulls might also seem like a solution, but also have a downsides

They only work with convex data. To work with concave data you must generate multiple hulls and stick them together.
Not very accurate representation of the original shape unless you stitch many of them together, this would be infeasible for the world/terrain shape which is many kilometres in size

So I decided to extend bullet to directly support SDF vs SDF collision.

As everything in my game world is represented by an SDF, I do not need any of bullets built in collision types and only use SDF vs SDF collision.

Collision Algorithm

The solution I went with is based on generating a point hull for the SDF, a point hull is a list of points that lie within the shapes negative space. In my implementation all points in a given hull uses the same radius.

To perform collision detection between two SDFs, you treat one of them as a point hull and the other as an SDF. You transform each point in the point hull into the space of the SDF and sample the SDF. If the distance to the negative space is <= the point hulls radius, you have a collision.

I decide which object will act as the point hull based on who has the smaller point hull radius, this ensures consistent collision, and allows for the smaller object to collide against the larger objects full SDF.

For the world/terrain shape I disable the point hull generation pass, it is always treated as the SDF when colliding.

For points that are found to be colliding, a second pass is run to generate the normal and collision depth, and to reduce the number of impact points down to four(this is the number of points bullet wants fed to it).

Result

SDF vs SDF supports both convex and concave shapes.

It also has fewer issues with penetration than triangle meshes since SDFs have proper volume, even if a small object penetrates into a larger one, it will still be pushed out in the correct direction.

SDFShape1

Here is a shape for which we will generate a point hull

SDFShape2

And here is a visualization of the point hull.

Optimizations sometimes known as midphase

There are numerous ways to optimize this, but here are some that I use. The underlying SDF algorithms are already all SIMD, so this is focused on algorithmic optimizations.

Surface Approximate

Instead of sampling the full SDF against all of the points, I first run a pass that only samples eight points on the point hulls AABB within the SDF. It then uses those eight points to perform a quick rejection of points that incapable of colliding because they are too far away. This often eliminates >90% of the points, so we can skip full SDF evaluation.

Temporal Collision Frame

This is a frame that is specific to a given object vs object collision, it records a rough approximation of the previous collision attempt between the two shapes. If not enough movement or rotation has occurred it can early out and skip performing the full near phase–the previous contact points are reused.

Handling SDFs with broken distance formulas

Otherwise known as Lipschitz continuous– we want gradients whose magnitude is as close to 1 as possible.

Some SDFs do not have euclidean correct distances, for these cases I run a fix up step which calculates a correction factor based on the rate of change in the local space of the SDF.

My correction algorithm takes the distances calculated for the AABB of the point hull projected into the SDFs space, and compares the true distance between the points against the distance returned by the SDF. It compares all of the points against each other, and uses the largest ratio of (sdf distance/true distance) as the correction factor.

This allows for colliding against fractals and other complex formulas, which often have incorrect distance formulas.

Performance Notes

In scenes with many moving & colliding objects:

Initially my SDF near phase and Bullets solver were about equal in cycle usage.

After adding various optimizations to reduce the time spent in the SDF near phase, the solver is now the primary waster of cycles.

Bullets solver has a few SSE based implementations, but they are AoS not SoA based, so the performance gain is minimal.

Bullet has a few inefficiencies in how it works, it loves to iterate over every single CollisionObject just to access a single bool, while this is not a problem for small scenes, it does not scale to the number of physics objects I plan to use.

I plan to rework this part of Bullet in my local copy, and have already rewritten/removed a few passes Bullet was performing that I did not need.

Pseudo Continuous Collision Detection

Bullet has some support for CCD(Continuous Collision Detection), but I’m using my own pseudo CCD instead.

My CCD solution is very simple:

Calculate the maximum distance an object can travel in a given physics tick based on its current velocity, and expand the point hulls radius to incorporate it.

At high speeds this causes expansion, but I cannot visually tell that this is happening.

I run physics at 120 hertz, but I did test it at 60 hertz and it seemed to work fine there also.

I’m not currently incorporating angular velocity, but I will probably add that at some point.

Future Work

The algorithm for placing points within the hull could always use more work.
It might also be worth looking into storing a per point radius, this will allow for less points in some situations
Angular Velocity for CCD

Higher Quality Vertex Normals

I use the Oct16 format to encode my vertex normals, this format is two 8 bit channels in octahedral mapping.

Most of the time this was sufficient, but under certain conditions artifacts were visible-- such as the surface of a smoothly varying sphere using triplanar texturing, whose weights are based on the normals.

Here is a visualization of the Triplanar weights generated from the Oct16 normals.

There is a very obvious diamond pattern visible.
Even switching to Oct20(10 bits per channel) does not completely solve this, the diamonds are much smaller, but they persist.

Oct16, but with custom scale/bias

Instead of adding bits, I decided to take advantage of the fact that most triangle patches only use a
limited range of the world space normals.

I track min/max per channel for the entire patch, then encode the normals so that the full range of bits is used.

Decoding in the shader requires a custom scale and bias parameter per channel(4 floats for the two channel Oct16).

There are no extra instructions, as a fixed scale of 2 and bias of -1 was previously being used to transform from [0,1] to [-1,1] range.

The 2nd image was encoded this way, the normals are still using Oct16, so only 16 bits per normal, but with a custom scale/bias per patch.

In the majority of cases this provides many extra bits of precision, and in the worst case it degrades back to standard Oct16.

Faster Triplanar Texturing

Here is a method I created to improve performance when using Triplanar texturing.
I also think it looks better.

So the standard triplanar texturing algorithm you will find in varous places on the internet looks something like this.

float3 TriPlanarBlendWeightsStandard(float3 normal) {
float3 blend_weights = abs(normal.xyz);
blend_weights = (blend_weights - 0.55);
blend_weights = max(blend_weights, 0);
float rcpBlend = 1.0 / (blend_weights.x + blend_weights.y + blend_weights.z);
return blend_weights*rcpBlend;
}

If we visualize the blend zones this is what it looks like.

Red/Green/Blue represent one texture sample.

Yellow/pink/cyan represent two textures samples.

And in the white corner we need all three.

As we can see the blend width is not constant, it is very small in the corner and quite wide along axis aligned edges.

The corner has barely any blending as we have pushed our blend zone out as far as possible by subtracting .55.(anything over 1/sqrt(3) or 0.577 results in negative blend zones in the corner).

This results in needless texture sampling along aligned edges, stealing away our precious bandwidth.

Constant Blend Width

What we want is something more like this-- constant blend width.

We do this by working in max norm distance instead of euclidean, as our planes are axis aligned anyway--

Here is the modified code that generates this:
float3 TriPlanarBlendWeightsConstantOverlap(float3 normal) {

//float3 blend_weights = abs(normal);
float3 blend_weights = normal*normal;
float maxBlend = max(blend_weights.x, max(blend_weights.y, blend_weights.z));
blend_weights = blend_weights - maxBlend*0.9f;

blend_weights = max(blend_weights, 0);

float rcpBlend = 1.0 / (blend_weights.x + blend_weights.y + blend_weights.z);
return blend_weights*rcpBlend;
}

You can adjust the blend width by changing the scalar .9 value.

On my GPU the constant version runs slightly faster, likely because there are less pixels where more than one texture sample is required.

I believe it also looks better--as there is less smearing along axis aligned edges.

Here is a shadertoy I created if you want to play with it

Barycentric Coordinates in Pixel Shader

EDIT: this is out of date, it is better to use visibility buffer and manually calculate barycentrics

Recently I was in need a way to perform smooth blending between per vertex materials.

Basically I needed barycentric coordinates + access to each vertices material in the pixel shader.

Unfortunately this isn’t built into the common rendering APIs, and so requires some extra effort.

Here is a list of some possible solutions:

Geometry Shader: Assign the coordinates: (1,0,0), (0,1,0), (0,0,1) to the vertices of the triangle. Also write the three materials to each vertex. This method is easy to implement but has terrible performance on many cards, since it requires a geometry shader. When enabled on my AMD card, FPS drops to half or less.

The following two methods are D3D11/12 focused

AMD AGS Driver extension: AMD has a library called AGS_SDK which exposes driver extensions, one of these is direct access to barycentric coordinates in the pixel shader. It also allows for direct access to any of the attributes from the 3 vertices that make up the triangle. This method is very fast and works well if you have an AMD card that supports it.

 float2 bary2d = AmdDxExtShaderIntrinsics_IjBarycentricCoords(AmdDxExtShaderIntrinsicsBarycentric_PerspCenter);
 //reconstruct the 3rd coordinate
 float3 bary = float3(1.0 - bary2d.x - bary2d.y, bary2d.y, bary2d.x);

//extract materials
 float m0 = AmdDxExtShaderIntrinsics_VertexParameterComponent(0, 1, 0);
 float m1 = AmdDxExtShaderIntrinsics_VertexParameterComponent(1, 1, 0);
 float m2 = AmdDxExtShaderIntrinsics_VertexParameterComponent(2, 1, 0);

Nvidia FastGeometryShader: Nvidia also have driver extensions NVAPI, and one of these is the the “fast geometry shader” for when you only need a subset of the features geometry shaders offer. It should be possible to use this to pass down barycentric coordinates & materials, but I do not have an Nvidia card to test this on.

Embed Into Vertex Data: Another option is to enlarge the vertex, and embed the barycentric coordinates and the 3 materials directly into it. This is probably a better fallback than the GS, although it does have the downside of reducing vertex reuse, since many vertices that were previously identical would now differ.

Domain Shader?: I haven’t tried this method, but I think it might be possible to pass down barycentric coordinates from a domain shader

Visual Comparison

BaryOn

Ground rendered using barycentrics to perform smooth blending between materials

BaryOff

Ground rendered without barycentrics, the material is selected from the base vertex and there is no blending between materials

AVX2 Gather

Masked Gather vs Unmasked Gather

AVX2 has masked gather instructions(_mm_mask_i32gather_epi32 etc), these have two additional parameters, a mask, and a default value that is used when the mask is false.

I was hoping masked gathers would be accelerated, such that when most of the lanes were masked off, the gather would complete sooner, but this does not appear to be the case.

The performance of masked and unmasked gathers was very similar, but masked gathers were consistently slower than unmasked gathers.

Load vs Gather vs Software Gather

To compare gather with load, I created a buffer and run through it in linear order summing the values.
I forced the gathers to load from the same indices the load was operating on. Indices(0,1,2,3,4,5,6,7), incremented by 8 for each loop.

Software gather loaded each index using scalar loads instead of the hardware intrinsics.
Gather was generally ~1.2-1.5x faster than software gather.

Performance was depended upon the cache level that buffer fit into.

Buffer fits in L1

Load is ~10x faster than Gather

Buffer fits in L2

Load is ~3.5x faster than Gather

Buffer greater than L2

Load tapers off to ~2.x faster than Gather

This was all run on a Haswell, newer chips might perform differently.

Moment of Inertia of a Distance Field

While adding physics support to my voxel engine I needed a reasonable accurate & fast method to calculate the moment of inertia, volume, and center of mass for an arbitrary distance field.

I've written this C++ code to do so.
Feed it a regularly spaced grid of points and distances.
The points must fully bound the negative space of the field to be accurate.

The common primitives, such as a sphere, there are established formulas that we can compare against.

For a 10^3 distance field of a sphere, the estimate is off by about 1%, close enough for me.
If more accuracy is needed, the sample rate can be increased.

SDF Based Occlusion Culling

I’m going to describe an occlusion culling algorithm I came up with about 4 or 5 years ago. I use it in my game and it has worked well for me.

If you do not know what occlusion culling is, it is culling objects that are in the frustum, but are blocked from the users view, and do not contribute to the scene.

Prior to this I used a Software Rasterizer based implementation, but I found that it was problematic for the following reasons.

Low resolution: not a huge issue, but it is much lower resolution than the GPU rasterizer so it isn’t completely correct
False Occlusion: since my occluders had to be auto generated at run time from arbitrary SDFs, I used a convex hull approximation which was not always conservative, this caused some false occlusion
Memory: each occluder had its own set of triangles that needed to be stored. Also meant jumping around in memory to access each of these sets.

SDF Occlusion Algorithm Overview

My scene is represented with triangular patches, each containing perhaps 500 to 2000 triangles. They are approximately equal size in screen space.

SDFShape2

I’ve scaled the patches down here so that we can see gaps between them

Basically we want to shoot rays from the viewers eye to each patch.

Because our scene is represented with a signed distance field, we always know the distance to the nearest surface.

If there is ever a point along the ray where we are inside of something, and the distance to the surface is sufficient to occlude our patch, we know the patch isn’t visible.

Optimizing to make it practical

The number of patches depends on the settings, but a typical number after frustum culling might be 10,000.

Shooting 10,000 rays through complex SDFs is quite slow, thankfully it is turns out that each frame is generally very similar to the previous, so we can take advantage of temporal stability.

Typically I only end up tracing around 100 rays per frame.

This is done by tracking previous occlusion results, and storing a sphere that represents the most occluded point.

The next time around, for patches that were occluded, we can quickly retest them without needing the SDF, by using a sphere occludes sphere test.

If that fails, we can still optimize by starting the ray test near the point of previous occlusion.

For patches that were previously not occluded, we can actually perform a fairly similar test, if there hasn’t been enough relative movement, there is no need to test the SDF, and we can assume the object is visible.

I differentiate between dynamic and static parts of the scene, as these temporal tests only work reliable for the static part.

I also use a timer to set a max time that this tracing process can run per frame.

Other Benefits

I track an occlusion ratio for each patch, this is between 0 and 1.

So even if something isn’t completely occluded I can still tell when it is partially occluded.

I use this occlusion ratio to adjust the priority for patch splitting and generation.

Patches that are completely occluded are never split.
Patches that are partially occluded, have a much lower score/priority, and won’t split until the user gets closer.

Future Work

Dynamic objects are occluded by static objects, but I haven’t yet added the code for dynamic objects to occlude anything. It isn’t terribly difficult, but I don’t think it will contribute much so I haven’t prioritized that work.

Potential downsides

While this algorithm works well for me, this is largely due to the precise way that may engine works, it is probably not well suited to a more typical game engine like Unreal which does not work based on equally sized in screen space triangular patches.
Thin objects don’t occlude as well as a software rasterizer based implementation, this is because it relies on the object having volume
Occluder fusion isn’t the best

Video of my Distance Field Engine

This is a short clip showing my engine.

It uses distance fields to represent all geometry.

Don't confuse it with a minecraft like engine--those use 1 bit filled/not filled.

It is written in very SIMD heavy C++.

voxels/thru

Pages

Quasi Virtual Texturing System

GPU Texture Format for PBR

Adding SDF Collision to Bullet Physics

Higher Quality Vertex Normals

Faster Triplanar Texturing

Barycentric Coordinates in Pixel Shader

AVX2 Gather

Moment of Inertia of a Distance Field

SDF Based Occlusion Culling

Video of my Distance Field Engine

vector_scatter

Links