Hi,
I'm working on a game in UE4 that uses a compute shader to perform 3D rotations on 2D images. The final rotated image is drawn to a render target after the calculations are done.
Problem: on certain hardware, we're seeing “striping” occur, where rows of pixels are missing from the final image (example shown in the video at the bottom). This is known to happen on specific Nvidia cards like Nvidia RTX 3080 Ti, 3090, and 4080, but most likely indicates some underlying issue that presents on certain hardware. It looks like a race condition- the problem is, I can't find where in the code I'm causing that to happen.
Code: Here's a high-level version of the HLSL, obviously very simplified. ?
I noted the things relevant to the overall flow and possible memory sharing issues:
//--------------------------------------------------------------------------------------
// Global variables
//--------------------------------------------------------------------------------------
//Input texture (read only) - 256x256. Set infrequently, not every frame
Texture2D InputTex;
//SRV (read only) buffers, used to pass inputs from C++ to HLSL
//These buffers must be set almost every frame (rotation data for the frame, etc)
StructuredBuffer<InputStruct1> InputBuffer1;
StructuredBuffer<InputStruct2> InputBuffer2; //Note: I'm not sure about the max data size for these, but I think
//DX11 specifies that buffers can hold 2 GB of data at a minimum
//(and we're under that)
//UAV (read/write) buffers, used to hold data that needs to be shared between passes.
//Buffer length is 65536 (256x256), corresponding with pixels in the final texture.
//"globallycoherent" causes memory barriers and syncs to flush data across the entire GPU
globallycoherent RWStructuredBuffer<float4> DataBuffer1;
globallycoherent RWStructuredBuffer<float4> DataBuffer2;
//Result texture - 256x256, displayed on render target
RWTexture2D<float4> ResultTex;
//--------------------------------------------------------------------------------------
// ClearBuffersPass:
// clear out the data from the previous run
//--------------------------------------------------------------------------------------
[numthreads(THREADGROUPSIZE_X, THREADGROUPSIZE_Y, THREADGROUPSIZE_Z)]
void ClearBuffersPass(uint3 Gid : SV_GroupID,
uint3 DTid : SV_DispatchThreadID,
uint3 GTid : SV_GroupThreadID,
uint GI : SV_GroupIndex))
{
//Just for the heck of it, I've tried putting a memory barrier here to make sure there are no
//operations still in progress from the previous run before we write to the buffers
DeviceMemoryBarrier();
DataBuffer1[DTid.x] = float4(0,0,0,0);
DataBuffer2[DTid.x] = float4(0,0,0,0);
}
//--------------------------------------------------------------------------------------
// Pass1:
// Read colors from the input texture, do some math, and populate a data buffer
//--------------------------------------------------------------------------------------
[numthreads(THREADGROUPSIZE_X, THREADGROUPSIZE_Y, THREADGROUPSIZE_Z)]
void Pass1(uint3 Gid : SV_GroupID,
uint3 DTid : SV_DispatchThreadID,
uint3 GTid : SV_GroupThreadID,
uint GI : SV_GroupIndex))
{
//Read color from the input texture
float4 color = InputTex.mips[0][uint2(DTid.x,DTid.y));
//Read values from our input buffers
float property1 = InputBuffer1[0].Property1;
//Perform calculations
//...
//Before we try to write to a shared buffer, make sure the previous pass is done accessing it
//I've tried all manner of these ("AllMemoryBarrier", "DeviceMemoryBarrierWithGroupSync", etc.
//none of them fix the issue.)
DeviceMemoryBarrier();
//Set data in the buffers- use Interlocked operations in cases where multiple threads could access
//the same index in the buffer
InterlockedAdd(DataBuffer1[index].x, value); //Index and value are calculated above
}
//--------------------------------------------------------------------------------------
// Pass2:
// Read from the data buffer (after all threads in the previous pass are done),
// and populate the result texture
//--------------------------------------------------------------------------------------
[numthreads(THREADGROUPSIZE_X, THREADGROUPSIZE_Y, THREADGROUPSIZE_Z)]
void Pass2(uint3 Gid : SV_GroupID,
uint3 DTid : SV_DispatchThreadID,
uint3 GTid : SV_GroupThreadID,
uint GI : SV_GroupIndex))
{
//Before we try to read from a shared buffer, make sure the previous pass is done
//accessing it. The interlocked operations above don't guarantee ordering of the passes.
DeviceMemoryBarrier();
//Read color from the data buffer
float4 color = DataBuffer1[index]; //In some cases, we get a 1-dimensional index from an (i,j)
//pair using index = i + j*width
//Perform more calculations. We have about 5 passes like this that all rely on having the previous
//pass complete and using data from the buffers to perform the next round of calculations, but all
//of them follow the same kind of structure in terms of reading/writing to buffers.
//...
//Write pixel to the final texture
FinalResultTexture[uint2(DTid.x,DTid.y)] = finalColor;
}
Other possibly relevant things:
- The impacted graphics cards are all very recent, high end cards, which makes me think that the issue may happen when the GPU is so fast that it overlaps the previous run
- I've tried running with different values for
THREADGROUPSIZE_X
andTHREADGROUPSIZE_Y
. For whatever reason, the best performance was obtained when I just maxed outTHREADGROUPSIZE_X
at 1024 and left y and z at 1 (I'm not sure of the significance, but I've heard that the thread group size impacts VRAM and other memory allocation)
If there's anything else that would be helpful for diagnosing, let me know! I'm not a graphics specialist, but I've been at this for a while (this issue is blocking completion of a 5 year project for my small team- oof!) I'd really appreciate any advice.
What the issue looks like for impacted players: striping occurs primarily in the top half of the 256x256 render target, and pixels flicker randomly. Video:
https://www.youtube.com/watch?v=bjrzPupKnGI