DirectX 11 Compute Shader tutorial

DirectX 11 Compute Shader tutorial

Original from: http://www.gamedev.net/community/forums/topic.asp?topic_id= ; 516043.

With the introduction of DirectX 11 come a number of exciting new features that you as a game developer (or a graphics technology introduction) would definitively want to play around with The most proposed of them are the Tessellation Shaders and the Compute Shaders Since my personal interest in graphics is more to the side of rendering, I am specifically interested in the compute shader This is with an inclination towards geometry, which is likely more kept on diving into the testing shader But what ever your focus of interest may be, I believe that these new emerging features have a much more promising employment outlook than the geometry shaders, which in my opinion turned out to be complete due.

Microsoft released their beta DirectX 11 SDK in the November update This means that you can start playing around with these new toys right away! I know what you are thinking Doesn't; DirectX 11 requires Windows 7? And is the DirectX 11 compatible hardware already ready out? Fully DirectX 11 compatible hardware has not been released yet by any of the manufacturers (NVIDIA or ATI) But Microsoft has determined a subset of DirectX 11 features that will run on DirectX 10 class hardware And luckily, compute shader is one of them! The formal name for this initiative is" DirectX 11 Compute on DirectX 10 hardware" We will start seeing public releases of drivers from NVIDIA and ATI for these new features soon The point, however, is that you can start developing your first DirectX 11 application right away With a reference device of course
This tutorial walks you through the steps of creating a simple application that utilizes compute shaders and the DirectX 11 API The following code is mean to run on DirectX 10 class hardware I assume that you already have experience with writing DirectX code, there is more I will not provide any redundant details

1 Creating the device, context, and swap chain:

DXGI_SWAP_CHAIN_DESC sd;
ZeroMemory( &sd, sizeof( sd ) );
sd.BufferCount = 1;
sd.BufferDesc.Width = width;
sd.BufferDesc.Height = height;
sd.BufferDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
sd.BufferDesc.RefreshRate.Numerator = 60;
sd.BufferDesc.RefreshRate.Denominator = 1;
sd.BufferUsage = DXGI_USAGE_RENDER_TARGET_OUTPUT;
sd.OutputWindow = g_hWnd;
sd.SampleDesc.Count = 1;
sd.SampleDesc.Quality = 0;
sd.Windowed = TRUE;
D3D_FEATURE_LEVEL level;
D3D_FEATURE_LEVEL levelsWanted[] = { D3D_FEATURE_LEVEL_11_0,
                                                       D3D_FEATURE_LEVEL_10_1,
                                                       D3D_FEATURE_LEVEL_10_0};
UINT numLevelsWanted = sizeof( levelsWanted ) / sizeof( levelsWanted );
D3D_DRIVER_TYPE driverTypes[] ={ D3D_DRIVER_TYPE_HARDWARE,
                                                 D3D_DRIVER_TYPE_REFERENCE,};
UINT numDriverTypes = sizeof( driverTypes ) / sizeof( driverTypes );
for( UINT driverTypeIndex = 0; driverTypeIndex < numDriverTypes; driverTypeIndex++ )
{
g_driverType = driverTypes;
hr = D3D11CreateDeviceAndSwapChain( NULL,
                                                         g_driverType,
                                                         NULL,
                                                         createDeviceFlags,
                                                         levelsWanted,
                                                         numLevelsWanted,
                                                         D3D11_SDK_VERSION,
                                                         &sd,
                                                         &g_pSwapChain,
                                                         &g_pd3dDevice,
                                                         &level,
                                                         &g_pd3dContext );
if( SUCCEEDED( hr ) )
  break;
else if( g_driverType == D3D_DRIVER_TYPE_HARDWARE )
  MessageBox(NULL, L"Could not create hardware device", L"Device creation failed", MB_OK);
}
The code above cycles through the various driver types and selects the highest feature level that is available If you were using the" DirectX 11 Compute on DirectX 10 hardware" Drivers, then the DirectX runtime would end up selecting D3D_ FEATURE_ LEVEL_ 10_ 0 or_ 10_ 1 depends on what graphics card you have Note that the D3D_ FEATURE_ LEVEL_ 11_ 0 is only for graphics cards that support the entire DirectX 11 feature set The cards that support only the subset are still considered 10. x feature level cards
I will assume that with the code above, you are expecting D3D_ DRIVER_ TYPE_ HARDWARE to be selected along with D3D_ FEATURE_ LEVEL_ 10_ 0. Of course that will not be the case if you don' T have the proper drivers, but lets just assume it is for save of demonstration

2 Check for Compute feature support

D3D11_FEATURE_DATA_D3D10_X_HARDWARE_OPTIONS options;
hr = g_pd3dDevice->CheckFeatureSupport( D3D11_FEATURE_D3D10_X_HARDWARE_OPTIONS,
                                                         &options,
                                                         sizeof(D3D11_FEATURE_D3D10_X_HARDWARE_OPTIONS));
if( !options.ComputeShaders_Plus_RawAndStructuredBuffers_Via_Shader_4_x ){
MessageBox(NULL, L"Compute Shaders are not supported on your hardware", L"Unsupported HW", MB_OK);
return E_FAIL;
}
Now we get to the compute shadow stuff If this is the first time you are hearing about" Compute; Shaders, then let me give you a quick description of what they are before I show code related to them Compute shaders are basically the same as any other shaders Pixel shader for example Just like the pixel shader is invoiced for each pixel, the compute shader is invoiced for each" Thread" A thread is a generic and independent execution entity that does not exist' T really require any sort of geometry Up until now, if you want to do general purpose computation on the GPU (to explore this parallel computing beat), you would have had to report to 3D geometry trickery that didn' T really make any sense for the problem that you were trying to solve For example, if you want to do matrix multiplication on the GPU, you' D have to draw a quad so that you can force refinement of ceramic pixels which would then allow you to use the pixel shader on them Draw a quad for a matrix multiplication? What? This is exactly the problem that the compute shadow solutions All you have to do now is to patch a number of threads, and your shader will be executed for each of these threads That all Clean and simple
In DirectX, these threads are organized into" Groups" You have X * Y * Z threads in each group, and U * V * W thread groups in your application You can think of them as 3D blocks of threads and groups Threads are organized into groups for synchronization purposes, but I will not get into that One thing to note through, for" DirectX 11 compute on DirectX 10 hardware", The third dimension is always 1 i. E. There are X * Y * 1 threads in each group, and U * V * 1 total groups The number of groups are specified during dispatch time, and the number of threads in each group are hardcoded in the compute shader We will see this later when we get to the code
The compute shader is just like any other shader You can access buffer and texture resources But what/how do you output from the compute shader
[*] From vertex shaders you output transformed vertices [*] from geometry shaders you output primitives [*] from pixel shaders you output color and depth
But what in the world do you output from a compute shader? Well the compute shaders actually don; T output anything You store your calculation results in a buffer at any location This is done via what is known as an" Unordered Access View; (UAV) There are just like any other resource views that we have in DirectX 10, except they let you read and write at any location UAVs can be created from buffers and textures on DirectX 11 hardware But on DirectX 10 hardware, you cannot construct UAVs from type resources (i.e. textures) Instant, we use two new types of buffers:
[*] structured buffers, are arrays of structures (AOS)
[*] raw buffers ----Are arrays of bytes (Array of Unstructures) Structured buffers are arrays of structures (AOS) Raw buffers are arrays of bytes I will not go into the details on these because they are just, well, buffers:/Also, I think I have provided you with just enough information to finally show you some code now

3 Creating a structured buffer

struct BufferStruct{
UINT color;
};
D3D11_BUFFER_DESC sbDesc;
sbDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE;
sbDesc.CPUAccessFlags = 0;
sbDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
sbDesc.StructureByteStride = sizeof(BufferStruct);
sbDesc.ByteWidth = sizeof(BufferStruct) * THREAD_GRID_SIZE_X * THREAD_GRID_SIZE_Y;
sbDesc.Usage = D3D11_USAGE_DEFAULT;
InitData.pSysMem = NULL;
hr = g_pd3dDevice->CreateBuffer(&sbDesc, NULL, &g_pStructuredBuffer);

4 Creating a UAV of a structured buffer

D3D11_UNORDERED_ACCESS_VIEW_DESC sbUAVDesc;
sbUAVDesc.Buffer.FirstElement = 0;
sbUAVDesc.Buffer.Flags = 0;
sbUAVDesc.Buffer.NumElements = THREAD_GRID_SIZE_X * THREAD_GRID_SIZE_Y;
sbUAVDesc.Format = DXGI_FORMAT_UNKNOWN;
sbUAVDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
hr = g_pd3dDevice->CreateUnorderedAccessView(g_pStructuredBuffer, &sbUAVDesc, &g_pStructuredBufferUAV);
By the way, we can also create access views of this same buffer so that they can be bound to other stages of the graphics pipeline, so as the pixel shader This is none via Shader Resource Views (SRV) These are read-only views (Note: UAVs are available in pixel shaders as well! But only on DirectX 11 hardware)
So for instance, you can have your compute shader perform some sort of simulation, store the result in a buffer via the UAV, and then read the result from the pixel shader via SRV and render something based on that Anyway, here is how you create a SRV:

5 Creating a SRV of a structured buffer

D3D11_SHADER_RESOURCE_VIEW_DESC sbSRVDesc;
sbSRVDesc.Buffer.ElementOffset = 0;
sbSRVDesc.Buffer.ElementWidth = sizeof(BufferStruct);
sbSRVDesc.Buffer.FirstElement = 0;
sbSRVDesc.Buffer.NumElements = THREAD_GRID_SIZE_X * THREAD_GRID_SIZE_Y;
sbSRVDesc.Format = DXGI_FORMAT_UNKNOWN;
sbSRVDesc.ViewDimension = D3D11_SRV_DIMENSION_BUFFER;
hr = g_pd3dDevice->CreateShaderResourceView( g_pStructuredBuffer,
                                                                 &sbSRVDesc,
                                                                 &g_pStructuredBufferSRV);
Alright! Time to create the compute shader

6 simple compute shaders

(Here is the source code of a very simple compute shader. Well this really does not do any calculations per say,) it simply identifies its thread ID and writes it to UAV.

struct BufferStruct {
uint4 color;
};
RWStructuredBuffer<BufferStruct> g_OutBuff;
void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID ){
float4 color = float4( (float)threadIDInGroup.x / THREAD_GROUP_SIZE_X ,
                             (float)threadIDInGroup.y / THREAD_GROUP_SIZE_Y, 0, 1
                            ) * 255;
int buffIndex = ( groupID.y * THREAD_GROUP_SIZE_Y + threadIDInGroup.y )
                      * THREAD_GROUPS_X * THREAD_GROUP_SIZE_X
                      + ( groupID.x * THREAD_GROUP_SIZE_X + threadIDInGroup.x );
g_OutBuff[ buffIndex ].color = color;
}
Note how we have hardcoded the number of threads in the shader This specifies how many threads are in each group Also note the input parameters to the main() function Each thread is identified by a 3D thread ID inside the tea group, and also a 3D group ID. Both of these values are provided to us as input to the shader In this compute shader, I' M simply using these IDs to compute an index to my structured buffer to output some value Nothing fan Oh, and the THREAD_ GROUP_ SIZE_* Variables are something I defined, those are not provided by the runtime These values are specified when you execute the compute pass, we' Ll see that in a bit

7 Compile and create the compute shadow object

hr = D3DCompile(cs_src, strlen(cs_src), NULL, NULL, NULL, "main", "cs_4_0", 0, 0, &pByteCodeBlob, NULL);
hr = g_pd3dDevice->CreateComputeShader( pByteCodeBlob->GetBufferPointer(),
                                                           pByteCodeBlob->GetBufferSize(),
                                                           NULL,
                                                           &g_pComputeShader);

8 And finally we have our compute pass where we patch the threads

UINT initCounts = 0;
g_pd3dContext->CSSetUnorderedAccessViews( 0, 1, &g_pStructuredBufferUAV, &initCounts );
g_pd3dContext->CSSetShader( g_pComputeShader, NULL, 0 );
g_pd3dContext->Dispatch( THREAD_GROUPS_X, THREAD_GROUPS_Y, 1 );
ID3D11UnorderedAccessView* pNullUAV = NULL;
g_pd3dContext->CSSetUnorderedAccessViews( 0, 1, &pNullUAV, &initCounts );
Look at the Dispatch() call This is where we specify the number of groups that we want to execute Note that the last parameter (i.e. the Z dimension) is 1, since we want to be able to run this on DirectX 10 class hardware
And that all folds! I hope you enjoyed the tutorial