This is the first follow up article to the "how the GPU works" series, and this one is also related to GPU versus CPU
Note: if you already know how and why a GPU works and how it does compare to a modern CPU, and why the two are converging, you might want to skip to the last paragraph directly...
Note: if you already know how and why a GPU works and how it does compare to a modern CPU, and why the two are converging, you might want to skip to the last paragraph directly...
NVidia Cuda Architecture
Cuda is the NVidia API for General Purpose GPU computation. What does that mean? That is an API that allows us to access the GPU hardware, same as OpenGL or DirectX, but it was designed not allow rendering of shaded triangles, but to execute programs, generic programs. Well, generic... there are some restrictions. It's stream computing, you specify a smallish kernel code, that executes on a huge array of data, and can access memory resources in a very specific way (compared to what we can do on a CPU).
The API itself is not too interesting, there are a lot of similar ones (most notably OpenCL and DirectX11 compute shaders, but also Rapidmind, ATI CTM/FireStream and others). What is intresting is that a new API also sheds some new light on the underlying architecture, and that architecture is what really matters today (in that respect the recent NVAPI looks really promising too, and a good read are ATI open GPU documents as well).
Why does that matter?The API itself is not too interesting, there are a lot of similar ones (most notably OpenCL and DirectX11 compute shaders, but also Rapidmind, ATI CTM/FireStream and others). What is intresting is that a new API also sheds some new light on the underlying architecture, and that architecture is what really matters today (in that respect the recent NVAPI looks really promising too, and a good read are ATI open GPU documents as well).
For us of course, knowing the GPU architecture is fundamental. The more information we have, the better. Plus the compute shader model allows for some nifty new effects, especially on the postprocessing, and someone is even starting to map REYES (Pixar!).
But GPGPU has a relevance that is outside the realm of the graphics (so much that most Cuda documentation is not aimed at rendering engineers). It sheds some light on the future of CPUs in general.
The future of CPUs... Multicore! Multicore is the hype! But what it comes with it? How to fit all those cores in a single chip? What about power consumption?
Well it turns out, and it's no news, that you have to simplify... and if you've only experienced the Intel desktop line of processors, that's not so clear yet, even if the new Core architectures went back from the extremely power hungry Pentium 4 design, and looking at the Pentium M design again (i.e. some of the first Centrino chips).
For the Larabee chip, that will have many more cores than the current desktop CPUs, they used for each core something derived from the design of the original Pentium.
When it comes to computing power per watt, our current CPUs are faring pretty badly. It has been said that GPU computational power is outpacing Moore's law, while CPUs are respecting that trend. That's false, Moore's law is not about gigaflops, but number of transistors. The fact that desktop CPU processors have a Moore's law-like behaviour in terms of that, means they're wasting a lot of, because gigaflops should not take care only of number of transistors, but also of better use of those, and increased clock rate as well! That's why the smart guys at google, preferred to stick to older Pentium processors in their farms...
Where did all those transistors go, if not directly improving operations per second? Well, they went into increasing performance, that is a different thing. They increased the execution speed of the applications they were designed to execute, spending most of their effort into complex logic to handle caching, to predict branches, to decode and reschedule instructions... That's why now "the free lunch is over", because chips are going simpler, more gigaflops are out there but we have to do work to use them, we have to change our code.
That's really evident if you work on the current generation of gaming consoles. The processors powering the Xbox 360 and the Playstation 3, Xenon and Cell respectively, are very similar, in fact the former was derived from the design of the latter.
Multicore, in-order execution (no rescheduling of your assembly instructions!), no or minimal branch prediction, SIMD instructions with long pipelines, lots of registers (but you can't move data from one set to the other without passing from memory), a few cores, and a punishing penality for cache misses.
The idea is to minimize the power needed to achieve the result. That is also a rule in general we should follow when designing software!
How to squeeze performance out of those CPUs?
Well, it turns out that's not so hard. Actually it can be incredibly fast, if you have a small-ish function that iterates on a lot of data, and that data is in arranged in a good way. We use SIMD for sure, but that has a long instruction pipeline, and no reordering, so we need to push a lot of intructions that have no dependencies with previous ones in order to keep it happy.
How? Well we go even wider, by unrolling (and inlining) our loop! We try to avoid branches, using conditional moves for example. We exploit data-parallelism, splitting our loop into threads, and prefetching memory into the cache. That's why data is the king!
If, unluckily, your memory accesses are not linear, and thus prefetching does not work well you go even wider! The idea is to have something similar to fibers, having different execution contextes running in your thread and switching between them before a memory access. Of course, you first fire a prefetch, and then switch, otherwise when you switch back you won't have gained anything...
Usually you don't need explicit fibers, you organize your algorithm in an external loop, that prefetches the random memory accesses, and internal unrolled ones that provide enough ALU work to make the prefetch successful.
Note that in some way, this is something that many CPUs do implement internally too. Both Xenon and Cell do something similar to Intel Hyperthreading, they have more hardware threads than cores. That is done to ensure they have more independent code paths to execute per core, so if a core is waiting on a memory request, or on a pipeline stall, it can still execute something by switching to the other thread, and keep the pipelines happy.
How does a GPU map to a standard CPU?
Well, if you look at a GPU from a perspective of an older, non multicore CPU, it could seem that the two chips are totally unrelated. But nowadays, with many, simpler cores starting to appear in CPUs, and GPUs being more and more general purpose, the gap is becoming way smaller. Just take this CPU trend, and push it to the extreme, and you'll have something close to a GPU (as a sidenote, that's why some of us, rendering engineers, can seem to already have seen the future, in terms of how to do threading...).
Imagine a CPU core, and think about packing a LOT of those into a single chip.You'll pretty soon run out of resources, space and power. So you have to remove stuff, simplify.
We have already seen that we can give up many technologies that were made to squeeze performance out of sequential code, like fancy memory and branch prediction units, out of order instruction schedulers, long pipelines and so on. On the other hand, we want to maximize our computing power, so it makes sense to go wide, add ALUs and process a lot of data in parallel (in a SIMD fashion).
The picture we have so far is of a computational component (usually in a GPU there are many of those) that has a simple instruction decoding unit, some kind of memory access (usually in a GPU there are separate units for texture and vertex memory, with different caching policies and functionalities), and a lot of ALUs. This looks like a pain to program! As we saw, the problem with having such raw power, is that it works really well only if we don't have stalls, and not having stalls requires no istruction dependencies, and no memory accesses. That's impossible, so we have to find a way around it.
What did we say about modern CPU programming? The solution is to identify and unroll loops, to have less dependencies between instructions, and when things go really wrong, in case of memory stalls, we can switch execution contest alltogether, with fibers, in order to find new instructions that are not waiting on the stall.
In other words, we can always find instructions to process if our loops are way wider than our wide processing elements, so we do not only fill their SIMD instuctions, but "overflow" them. If our CPU has 4-way simd vectors, and our computation is unrolled so that it could use 16-ways ones, it means that we have 4 independent 4-way instruction paths the CPU can execute!
Well GPUs were made to process vertices and pixels, usually we have a lot of them, a lot of data to process with the same code, the same kernel. That's a big loop! There is hope to implement the same solution in hardware!
Here things become really dependent on the specific GPU, but anyway, in general we can draw a picture. Think about a single instruction decoder, that feeds the same instructions to a lot of ALUs. Those ALUs can be SIMD or not, it doesn't really matter because anyway the same instruction is shared among ALUs, thus achieving parallel execution (interestingly, even if shader code is capable of using vectors of four floats, and packing data correctly can make a big difference in performances, expecially when it comes to registers usage, not all GPU ALUs do work natively on those vectors).
NVIDIA calls that SIMT execution, but the same concept applies for ATI GPUs as well (recent ATI units can process two instructions, one four way vector and one scalar in a single cycle, NVIDIA ones can do two scalars, and there are some other limitations on the operations that can be performed too). Intel Larrabee differs from those GPUs in that, as it will have only explicit SIMD, 16-wide, that means that there is not a control unit that dispatches the same instruction to different ALUs, on different registers, but it's all encoded into instructions that will access 16-float wide registers.
To keep the data for all those ALUs, there is a a big shared register space. This register space not only provides enough registers to unroll our computation, enabling the repetition of the same instruction for more than one clock cycle on different data (usually an instruction is repeated two or four times, thus masking dependencies between instructions!), but also enables us to keep data for different execution contextes of the same code, different fibers! As we're executing the same instruction on a lot of ALUs for differeny cycles, that means that our fibers are not independent, they are scheduled in small blocks that follow the same destiny (the same instruction flow).
The instruction decoder will automatically, in hardware, take care of switching context on stalls. A context has to record the state of all the (temporary) registers needed to execute the shader (or loop kernel) code, exactly as a fiber (that has its own function stack and trace of the CPU registers). So depending on the complexity of the code, the GPU can have more or less contextes in flight, and thus do a better or worse job at hiding big (memory) latencies.
Again, we need a lot of registers because we want our data to be way wider than our capability of executing it! Multiply that, for a lot of units made of the same components (instruction decode, memory interface, a lot of ALUs and even more registers, and you'll have a GPU (read this)!
How does Cuda map to our view of the GPU as a rendering device?
With all that in mind, we have now a pretty good picture of what a GPU is. And all this is really well explained in many Cuda tutorials and presentations as well, focusing more and specifically on the computational architecture of recent Nvidia GPUs.
The problem that I've found reading all that, is that it's not only not general (i.e. it's Nvidia specific) but it also leaves open the question of how all that architecture maps to the common concepts we are used to. All the terminology they use is totally different from the one of the graphics APIs, and that's understandable, as the concept of vertices, quads and texels do not make much sense in a general programming environment... But those are concepts that we're used to, we know how those entities work in a GPU, so knowing which is which in the Cuda terminology is useful.
What I'll do now is to try to explain the basic Cuda concepts mapping them to what I wrote before, and to common graphic programming terminology. Disclaimer: I've found no document that confirms my intuitions, so the following may be very wrong!
You might already know how Cuda looks like. It's a C-like programming language (and an API), you can write functions that look very similar to shaders and that execute on the GPU, using floats and simd vectors, but you also write in the same language code for the CPU. The CPU code will be mostly about allocating memory on the GPU and filling it with data, to then launch GPU functions to process that data, get the result back, rinse and repeat.
That's very similar to what we do, usually in C/C++ in our graphics engines, we create and set GPU resources, streams, textures and shader constants, and then we execute a shader that will use those resources and give us something back (in our case, pixels). What Cuda hides from you are the vertex and pixel shader stages, you can declare only a generic GPU function, that makes sense because modern GPUs are unified, they don't really have different units for the two kinds of shaders. That also means that you lose control over some fixed-function GPU stages, you don't control the rasterizer for example, nor the blending operator, and you can't use the vertex attributes iterpolators either.
The GPU functions are also called Cuda kernels, a kernel is executed in parallel, over many threads. Here, when you read "thread" you really have to think about a fiber, a lightweight execution context. Actually, threads are scheduled in "warps", units of 32 threads, all the threads in a warp execute the same instruction (thus, if one thread branches, all the ones in the warp has to follow the same branch too, and discard it later if it was not what they wanted). This is the very same thing that happens on vertices and quads when they get shaded, they are always processed in blocks, usually each block is a multiple of the number of ALUs a GPU core has as a single instruction get fetched and executed over all the ALUs for more than a single clock cycle (changing the register context the ALUs are operating with at each cycle).
A first interesting thing is that the threads get their own thread Id, from which they can decide which data to fetch. Compared to rendering, that means that you don't get linear data streams and an indexing of those, but you can do arbitrary fetches. That's somewhat similar to the xbox 360 custom vfetching functionality, where in a vertex shader you can receive only the vertex index, instead of the stream data, and get the data from the streams explicitly (i.e. for hardware instancing).
The thread Id can be organized in a linear fashion, or can be two or three-dimensional, that comes handy if you're executing a kernel over some data that has a grid topology.
The second, and probably the most important thing, is memory. In Cuda you have different kinds of GPU (device) memory.
Let's see them starting from the ones "closest" to the threads: the registers. Registers are the space used in a function to hold temporary results and do arithmetic. Unsurprisingly, knowing how shader works, you don't actually get a fixed number of them, they are allocated depending on the need of the function. Minimizing the number of registers used is fundamental because it controls how many threads a GPU core (what Cuda calls "streaming multiprocessor") can hold (and as usual, more threads we have, more opportunities to find one not stalled on a memory access we have).
Registers are bound to the unit ("streaming processor") that is executing a given thread, they can't be seen across threads.
A level above the registers we have the shared memory. Shared memory can be seen across threads, but only if they are executing on the same core, and is especially useful to implement data caching (that's also where having 2d and 3d thread identifiers comes handy, otherwise it could be impossible to have Cuda execute threads in a way where shared memory can be shared meaningfully). I'm not sure, but I do suspect that shared memory is what in the normal shader world would hold shader constants, if so the interesting addition that Cuda makes is that you can now write from a shader to that memory as well.
In order to control the thread execution on a given core, threads are organized in thread blocks, of a fixed size, and a kernel executes over them. Blocks also get their own Id (they are part of what Cuda calls a compute "grid"), and as for thread Id it can be one to three dimensional.
Now that is where things get complicated. In order to run a kernel, you have to specify the number and dimension of blocks you want to split it into, and the number and dimension of the threads inside the blocks. The problem is that those numbers can't be arbitrary, blocks are dispatched to cores and they can't be split, so a core has to have enough resources to hold all the threads in the block. That means that it has to have enough registers to create contextes to execute the kernel function on all the threads, and it has to have enough shared memory to hold the shared data of the function as well.
Last, but not least, there is the global device memory. This is split into constant memory, read only and cached, texture memory, read only again but with filtering and swizling for fast 2d cached access, and global memory that is read and write, but uncached.
Constant memory is most probably what rendering API do use for vertex streams data, texture memory is obviously associated with texture samplers, we're left with global memory. Well the only resource shaders can write generally are the render targets, also looking at the Cuda documentation we see that global memory writes are coalesced into bigger block writes, if a series of very restricting conditions are met.
This also smells of render target/stream out memory, so we can guess that's the GPU unit is used to implement that, again, the good news is that now this memory is accessible in read and write (render targets are only writable, even if you have the blend operation fixed stage that do read from them). Also, we can finally implement scattering, writing in multiple, arbitrary positions. Slowly, as it disables write coalescing, but it's possible!
As a last note, Cuda also has the concept of local memory. This is really just global memory, read/write uncached, that Cuda allocates per thread if the kernel function has local arrays or explictly tagged local variables. Even if it's per thread, it does not live on a GPU core, but outside, in the graphic card RAM.
But GPGPU has a relevance that is outside the realm of the graphics (so much that most Cuda documentation is not aimed at rendering engineers). It sheds some light on the future of CPUs in general.
The future of CPUs... Multicore! Multicore is the hype! But what it comes with it? How to fit all those cores in a single chip? What about power consumption?
Well it turns out, and it's no news, that you have to simplify... and if you've only experienced the Intel desktop line of processors, that's not so clear yet, even if the new Core architectures went back from the extremely power hungry Pentium 4 design, and looking at the Pentium M design again (i.e. some of the first Centrino chips).
For the Larabee chip, that will have many more cores than the current desktop CPUs, they used for each core something derived from the design of the original Pentium.
When it comes to computing power per watt, our current CPUs are faring pretty badly. It has been said that GPU computational power is outpacing Moore's law, while CPUs are respecting that trend. That's false, Moore's law is not about gigaflops, but number of transistors. The fact that desktop CPU processors have a Moore's law-like behaviour in terms of that, means they're wasting a lot of, because gigaflops should not take care only of number of transistors, but also of better use of those, and increased clock rate as well! That's why the smart guys at google, preferred to stick to older Pentium processors in their farms...
Where did all those transistors go, if not directly improving operations per second? Well, they went into increasing performance, that is a different thing. They increased the execution speed of the applications they were designed to execute, spending most of their effort into complex logic to handle caching, to predict branches, to decode and reschedule instructions... That's why now "the free lunch is over", because chips are going simpler, more gigaflops are out there but we have to do work to use them, we have to change our code.
That's really evident if you work on the current generation of gaming consoles. The processors powering the Xbox 360 and the Playstation 3, Xenon and Cell respectively, are very similar, in fact the former was derived from the design of the latter.
Multicore, in-order execution (no rescheduling of your assembly instructions!), no or minimal branch prediction, SIMD instructions with long pipelines, lots of registers (but you can't move data from one set to the other without passing from memory), a few cores, and a punishing penality for cache misses.
The idea is to minimize the power needed to achieve the result. That is also a rule in general we should follow when designing software!
How to squeeze performance out of those CPUs?
Well, it turns out that's not so hard. Actually it can be incredibly fast, if you have a small-ish function that iterates on a lot of data, and that data is in arranged in a good way. We use SIMD for sure, but that has a long instruction pipeline, and no reordering, so we need to push a lot of intructions that have no dependencies with previous ones in order to keep it happy.
How? Well we go even wider, by unrolling (and inlining) our loop! We try to avoid branches, using conditional moves for example. We exploit data-parallelism, splitting our loop into threads, and prefetching memory into the cache. That's why data is the king!
If, unluckily, your memory accesses are not linear, and thus prefetching does not work well you go even wider! The idea is to have something similar to fibers, having different execution contextes running in your thread and switching between them before a memory access. Of course, you first fire a prefetch, and then switch, otherwise when you switch back you won't have gained anything...
Usually you don't need explicit fibers, you organize your algorithm in an external loop, that prefetches the random memory accesses, and internal unrolled ones that provide enough ALU work to make the prefetch successful.
Note that in some way, this is something that many CPUs do implement internally too. Both Xenon and Cell do something similar to Intel Hyperthreading, they have more hardware threads than cores. That is done to ensure they have more independent code paths to execute per core, so if a core is waiting on a memory request, or on a pipeline stall, it can still execute something by switching to the other thread, and keep the pipelines happy.
How does a GPU map to a standard CPU?
Well, if you look at a GPU from a perspective of an older, non multicore CPU, it could seem that the two chips are totally unrelated. But nowadays, with many, simpler cores starting to appear in CPUs, and GPUs being more and more general purpose, the gap is becoming way smaller. Just take this CPU trend, and push it to the extreme, and you'll have something close to a GPU (as a sidenote, that's why some of us, rendering engineers, can seem to already have seen the future, in terms of how to do threading...).
Imagine a CPU core, and think about packing a LOT of those into a single chip.You'll pretty soon run out of resources, space and power. So you have to remove stuff, simplify.
We have already seen that we can give up many technologies that were made to squeeze performance out of sequential code, like fancy memory and branch prediction units, out of order instruction schedulers, long pipelines and so on. On the other hand, we want to maximize our computing power, so it makes sense to go wide, add ALUs and process a lot of data in parallel (in a SIMD fashion).
The picture we have so far is of a computational component (usually in a GPU there are many of those) that has a simple instruction decoding unit, some kind of memory access (usually in a GPU there are separate units for texture and vertex memory, with different caching policies and functionalities), and a lot of ALUs. This looks like a pain to program! As we saw, the problem with having such raw power, is that it works really well only if we don't have stalls, and not having stalls requires no istruction dependencies, and no memory accesses. That's impossible, so we have to find a way around it.
What did we say about modern CPU programming? The solution is to identify and unroll loops, to have less dependencies between instructions, and when things go really wrong, in case of memory stalls, we can switch execution contest alltogether, with fibers, in order to find new instructions that are not waiting on the stall.
In other words, we can always find instructions to process if our loops are way wider than our wide processing elements, so we do not only fill their SIMD instuctions, but "overflow" them. If our CPU has 4-way simd vectors, and our computation is unrolled so that it could use 16-ways ones, it means that we have 4 independent 4-way instruction paths the CPU can execute!
Well GPUs were made to process vertices and pixels, usually we have a lot of them, a lot of data to process with the same code, the same kernel. That's a big loop! There is hope to implement the same solution in hardware!
Here things become really dependent on the specific GPU, but anyway, in general we can draw a picture. Think about a single instruction decoder, that feeds the same instructions to a lot of ALUs. Those ALUs can be SIMD or not, it doesn't really matter because anyway the same instruction is shared among ALUs, thus achieving parallel execution (interestingly, even if shader code is capable of using vectors of four floats, and packing data correctly can make a big difference in performances, expecially when it comes to registers usage, not all GPU ALUs do work natively on those vectors).
NVIDIA calls that SIMT execution, but the same concept applies for ATI GPUs as well (recent ATI units can process two instructions, one four way vector and one scalar in a single cycle, NVIDIA ones can do two scalars, and there are some other limitations on the operations that can be performed too). Intel Larrabee differs from those GPUs in that, as it will have only explicit SIMD, 16-wide, that means that there is not a control unit that dispatches the same instruction to different ALUs, on different registers, but it's all encoded into instructions that will access 16-float wide registers.
To keep the data for all those ALUs, there is a a big shared register space. This register space not only provides enough registers to unroll our computation, enabling the repetition of the same instruction for more than one clock cycle on different data (usually an instruction is repeated two or four times, thus masking dependencies between instructions!), but also enables us to keep data for different execution contextes of the same code, different fibers! As we're executing the same instruction on a lot of ALUs for differeny cycles, that means that our fibers are not independent, they are scheduled in small blocks that follow the same destiny (the same instruction flow).
The instruction decoder will automatically, in hardware, take care of switching context on stalls. A context has to record the state of all the (temporary) registers needed to execute the shader (or loop kernel) code, exactly as a fiber (that has its own function stack and trace of the CPU registers). So depending on the complexity of the code, the GPU can have more or less contextes in flight, and thus do a better or worse job at hiding big (memory) latencies.
Again, we need a lot of registers because we want our data to be way wider than our capability of executing it! Multiply that, for a lot of units made of the same components (instruction decode, memory interface, a lot of ALUs and even more registers, and you'll have a GPU (read this)!
How does Cuda map to our view of the GPU as a rendering device?
With all that in mind, we have now a pretty good picture of what a GPU is. And all this is really well explained in many Cuda tutorials and presentations as well, focusing more and specifically on the computational architecture of recent Nvidia GPUs.
The problem that I've found reading all that, is that it's not only not general (i.e. it's Nvidia specific) but it also leaves open the question of how all that architecture maps to the common concepts we are used to. All the terminology they use is totally different from the one of the graphics APIs, and that's understandable, as the concept of vertices, quads and texels do not make much sense in a general programming environment... But those are concepts that we're used to, we know how those entities work in a GPU, so knowing which is which in the Cuda terminology is useful.
What I'll do now is to try to explain the basic Cuda concepts mapping them to what I wrote before, and to common graphic programming terminology. Disclaimer: I've found no document that confirms my intuitions, so the following may be very wrong!
You might already know how Cuda looks like. It's a C-like programming language (and an API), you can write functions that look very similar to shaders and that execute on the GPU, using floats and simd vectors, but you also write in the same language code for the CPU. The CPU code will be mostly about allocating memory on the GPU and filling it with data, to then launch GPU functions to process that data, get the result back, rinse and repeat.
That's very similar to what we do, usually in C/C++ in our graphics engines, we create and set GPU resources, streams, textures and shader constants, and then we execute a shader that will use those resources and give us something back (in our case, pixels). What Cuda hides from you are the vertex and pixel shader stages, you can declare only a generic GPU function, that makes sense because modern GPUs are unified, they don't really have different units for the two kinds of shaders. That also means that you lose control over some fixed-function GPU stages, you don't control the rasterizer for example, nor the blending operator, and you can't use the vertex attributes iterpolators either.
The GPU functions are also called Cuda kernels, a kernel is executed in parallel, over many threads. Here, when you read "thread" you really have to think about a fiber, a lightweight execution context. Actually, threads are scheduled in "warps", units of 32 threads, all the threads in a warp execute the same instruction (thus, if one thread branches, all the ones in the warp has to follow the same branch too, and discard it later if it was not what they wanted). This is the very same thing that happens on vertices and quads when they get shaded, they are always processed in blocks, usually each block is a multiple of the number of ALUs a GPU core has as a single instruction get fetched and executed over all the ALUs for more than a single clock cycle (changing the register context the ALUs are operating with at each cycle).
A first interesting thing is that the threads get their own thread Id, from which they can decide which data to fetch. Compared to rendering, that means that you don't get linear data streams and an indexing of those, but you can do arbitrary fetches. That's somewhat similar to the xbox 360 custom vfetching functionality, where in a vertex shader you can receive only the vertex index, instead of the stream data, and get the data from the streams explicitly (i.e. for hardware instancing).
The thread Id can be organized in a linear fashion, or can be two or three-dimensional, that comes handy if you're executing a kernel over some data that has a grid topology.
The second, and probably the most important thing, is memory. In Cuda you have different kinds of GPU (device) memory.
Let's see them starting from the ones "closest" to the threads: the registers. Registers are the space used in a function to hold temporary results and do arithmetic. Unsurprisingly, knowing how shader works, you don't actually get a fixed number of them, they are allocated depending on the need of the function. Minimizing the number of registers used is fundamental because it controls how many threads a GPU core (what Cuda calls "streaming multiprocessor") can hold (and as usual, more threads we have, more opportunities to find one not stalled on a memory access we have).
Registers are bound to the unit ("streaming processor") that is executing a given thread, they can't be seen across threads.
A level above the registers we have the shared memory. Shared memory can be seen across threads, but only if they are executing on the same core, and is especially useful to implement data caching (that's also where having 2d and 3d thread identifiers comes handy, otherwise it could be impossible to have Cuda execute threads in a way where shared memory can be shared meaningfully). I'm not sure, but I do suspect that shared memory is what in the normal shader world would hold shader constants, if so the interesting addition that Cuda makes is that you can now write from a shader to that memory as well.
In order to control the thread execution on a given core, threads are organized in thread blocks, of a fixed size, and a kernel executes over them. Blocks also get their own Id (they are part of what Cuda calls a compute "grid"), and as for thread Id it can be one to three dimensional.
Now that is where things get complicated. In order to run a kernel, you have to specify the number and dimension of blocks you want to split it into, and the number and dimension of the threads inside the blocks. The problem is that those numbers can't be arbitrary, blocks are dispatched to cores and they can't be split, so a core has to have enough resources to hold all the threads in the block. That means that it has to have enough registers to create contextes to execute the kernel function on all the threads, and it has to have enough shared memory to hold the shared data of the function as well.
Last, but not least, there is the global device memory. This is split into constant memory, read only and cached, texture memory, read only again but with filtering and swizling for fast 2d cached access, and global memory that is read and write, but uncached.
Constant memory is most probably what rendering API do use for vertex streams data, texture memory is obviously associated with texture samplers, we're left with global memory. Well the only resource shaders can write generally are the render targets, also looking at the Cuda documentation we see that global memory writes are coalesced into bigger block writes, if a series of very restricting conditions are met.
This also smells of render target/stream out memory, so we can guess that's the GPU unit is used to implement that, again, the good news is that now this memory is accessible in read and write (render targets are only writable, even if you have the blend operation fixed stage that do read from them). Also, we can finally implement scattering, writing in multiple, arbitrary positions. Slowly, as it disables write coalescing, but it's possible!
As a last note, Cuda also has the concept of local memory. This is really just global memory, read/write uncached, that Cuda allocates per thread if the kernel function has local arrays or explictly tagged local variables. Even if it's per thread, it does not live on a GPU core, but outside, in the graphic card RAM.