Shortly after I posted my previous dev journal entry, I got a direct message asking how I determined the number of vertices that the stream output buffer should be able to hold in the test program. I think that is a great question and while I replied discussing it for the test program, I think it warrants another entry to discuss the minimum size of the stream output buffer for the test program and more generally.
Vertices Available to the Stream Output Stage
The stream output stage places vertices from up to 4 output streams into corresponding stream output buffers, if bound. One of these output streams can also go to the rasterizer stage for drawing to a render target. If the geometry shader is not used, then only 1 output stream is available in the graphics pipeline.
In the previous entry I discussed determining how many vertices were written to a stream output buffer (basically reading back BufferFilledSizeLocation and converting bytes to vertices). In the event there are more vertices in the output stream than there is room in the buffer, then the buffer is filled in the order the vertices are in the output stream and the extraneous vertices are discarded and the count doesn't include them. If the overflow is in the output stream for the rasterizer stage, then the vertices that don't fit into the stream output buffer are still sent along to the rasterizer with the rest.
Test Program
The test program tries to be relatively simple for testing stream output capability. The two triangles on the right are trying to duplicate the two triangles on the left, where the ones on the left went through all the graphics pipeline stages including sending the tessellated vertices to a stream output buffer. So the stream output buffer needs to have space for at least as many vertices as are used for the left image. For the left image, I am sending a 4 point control patch through the pipeline, which the hull shader is setting all edge and inside tessellation factors to 1 and with an output topology of triangle. The tessellation factors of 1 will prevent subdivision of the edges and interior, so it's the output topology change that matters in this case. Since the control patch is a quad and the hull shader output topology is a triangle, the tessellator stage needs to create vertices for two objects, triangles in this case. Rather than re-use shared vertices between objects, the tessellator stage duplicates the vertices. That leads to domain shader being invoked 6 times for the control patch (3 per triangle). The geometry shader isn't doing any further amplification, so the stream output stage will get and need capacity for 6 vertices.
Vertex Shader Without Tessellation and Geometry Shader
The stream output stage can be used without tessellation or the geometry shader. In this case, there is 1 output stream available and it is comprised of the vertices returned by the vertex shader. There are basically 4 ways to use the graphics pipeline in this configuration:
-
Just using a vertex buffer using a list topology. There is no amplification of vertices in this case. The stream output buffer should be sized for the number of vertices passed to the input assembler stage. This is not necessarily equal to the number of vertices in the vertex buffer since the first argument to ID3D12GraphicsCommandList::DrawInstanced specifies the number of vertices to draw. Documenting this case made me realize that while the framework allows for setting the number of vertices to draw, it assumed the start index would always be 0 (the third argument to ID3D12GraphicsCommandList::DrawInstanced). So, I have added an overload to D3D12_CommandList::DrawInstanced to allow the start index to be configurable.
-
Using a vertex and index buffers in a list topology. Since the index buffer could reference the same vertex multiple times or skip vertices entirely, the stream output buffer should be sized for the number of indices passed to the input assembler stage. Like the previous case, ID3D12GraphicsCommandList::DrawIndexedInstanced has arguments to use a subset of the index buffer. Also like the previous case, documenting this made me realize that D3D12_CommandList::DrawIndexedInstanced assumed the starting index into the index buffer would be 0. And again I have added an overload to allow the start index to be configurable.
- Either of the previous cases using a strip topology. The shared vertices get duplicated effectively converting to a list topology. So, this will cause vertex amplification based on which primitive shape is used for the topology.
- Using an instance buffer with any of the previous cases. Instance buffers have a multiplicative effect on the number of vertices. Like the other buffers, ID3D12GraphicsCommandList::DrawInstanced and ID3D12GraphicsCommandList::DrawIndexedInstanced take arguments to allow a subset of the instance buffer to be used. So, the stream output buffer would need to be sized for the number of vertices for a single instance multiplied by the number of entries in the instance buffer passed to the graphics pipeline.
Adding Tessellation
When the tessellation stages are enabled, then the number of vertices produced depends on output topology and the tessellation factors. As mentioned when discussing the test program, the same vertex can be duplicated by the tessellator if they are for different objects in the output topology, which obviously increases the vertex count. Increasing the tessellation factors increases the subdivision of the control patch, which by definition increases the vertex count. The test program made things simple for these stages by using fixed values. In the case of dynamic tessellation values, such as doing camera distance LOD over a large terrain patch when viewed along the surface, the exact number of vertices produced is a bit more difficult to calculate and it is of questionable value to do so since the number of vertices could change frame to frame as the camera moves. Instead, calculating the upper bound for the vertices produced for a particular hull shader would allow for the stream output buffer to be appropriately sized, and the number of vertices actually written can be retrieved as needed.
Adding a Geometry Shader
The geometry shader allows for using multiple output streams with a maximum of 4, one of which can go to the rasterizer stage. This stage is also interesting in that the shader is responsible for adding vertices to the output streams. This means it can effectively discard a vertex by not adding to an output stream. Or conversely, it could add more vertices than there are in its input patch. The result of these possibilities is the amount of vertex amplification or reduction is implementation dependent and the shader bound to this stage will need to be analyzed for its upper bound.
Oversized Stream Output Buffer
In the course of doing some additional testing to make sure I discussed all the factors involved here, I had updated the test program (which subsequently got spun off into its own) so that the stream output buffer was massively oversized for the data set and would log the number of vertices in the stream output buffer. That logging made it obvious that the stream output buffer is appended to if there is space each time the pipeline uses it instead of it always starting from the beginning.
Stream output buffer num verts: 6
Stream output buffer num verts: 12
Stream output buffer num verts: 18
Stream output buffer num verts: 24
Stream output buffer num verts: 30
Stream output buffer num verts: 36
Stream output buffer num verts: 42
This means that the UINT64 for BufferFilledSizeLocation needs to be reset when the buffer should be overwritten. So, I have added a new function in the framework's StreamOutputBuffer/D3D12_StreamOutputBuffer classes to do exactly that:
virtual void PrepReset(CommandList& command_list, ConstantBuffer& scratch_buffer) = 0
Rather than take the approach of GetNumVerticesWrittenD3D12 where the function takes care of executing the command list and waiting on the fence, PrepReset only adds the commands to a command list. The reason for this difference is GetNumVerticesWrittenD3D12 needs to execute the command list and wait on the fence for the data to be available to a host readable buffer to do its computation. Whereas PrepReset doesn't need to do any computation, it just needs the sizeof(UINT64) bytes at the start of scratch_buffer to be 0 when the command list executes and the fence is waited on. There is a remark in the function comment to document this requirement. That also means that I can't just re-use the buffer the test program has for the camera and a separate one needs to be created. I could have made the creation of this scratch buffer internal to the class where when the first StreamOutputBuffer is created, then so is this scratch buffer. However I expect there may be other uses for a scratch buffer as I continue development of this framework.
Now that the test program is updated to call this during of the draw function, but before the stream output buffer is bound to the stream output stage, the logging of the number of vertices written to the stream output buffer produces the expected result:
Stream output buffer num verts: 6
Stream output buffer num verts: 6
Stream output buffer num verts: 6
Stream output buffer num verts: 6
Stream output buffer num verts: 6
Stream output buffer num verts: 6
Resetting the buffer also ensures that each frame the two triangles on the right are getting the stream output of the triangles on the left for the same frame.