So how does the actual shader execution unit work? This is where the things get very platform (GPU) specific, but let's try to get the general picture, without infranging any NDA. Even for developers it's not always easy to find in depth information on all the details, luckily most of the times they aren't needed also.
For the interested reader, a nice starting point are the ATI-AMD and Intel recently disclosed documentation about their GPUs. The Intel ones are more intresting than you could imagine being low-end graphic chips. On the NVidia side, the best documents you can find (as of now) are the CUDA/G80 ones (and as CUDA is kinda popular now, there are interesting investigations done even by third parties)...
Enough said, let's start. Every GPU has a number of arithmetic units, ALUs, and texture units. As in every modern processor, the memory access is a couple of orders of magnitude slower than instruction processing, so what we want to do is always have our execution units full, in order to amortize those costs and hide them with high latencies and high throughput.
That is not something new at all. CPUs started this trade a long time ago. A single instruction was split in an number of simpler stages, those stages where arranged in a deep pipeline, if the pipeline is always full we have a high latency, as every instruction has to go thought all those pipeline stages, but if we have no bubbles in the pipeline, we'll get an high thorughput, the mean number of instructions per second that we're able to process is high. Simpler stages meant more gigahertz, deep pipelines meant that a pipeline stall was and is incredibly expensive (i.e. a branch misprediction). Even more similar to what GPUs do is hyperthreading, we need get more "hardware threads" per each functional CPU core because doing so the CPU has different independent streams of instructions to compute, and if one is stalled on a memory access, there's another one to keep its ALU busy...
GPUs employ the same ideas, but in a way more radical way that CPUs are doing. The stream execution model is such that we want to execute the same instruction sequence (shaders) on a huge amount of data (vertices or pixels), and all the computations are independent (even geometry shaders have access to topological information, how a vertex is connected to other vertices, but computation on a vertex has no influence on the ones done for the other vertices).
So what we do is to partition input data in big groups, in one group all the data has to be processed in the same way (with the same shader/pipeline configuration).
Execution in a group happens in parallel, is we have a shader of ten instructions and the vector is a group of one hundred different inputs, ALUs compute the first instruction for each of the one hundred inputs, then the second and so on. Usually more than a single ALU works on a given group (ALUs are split into different pipelines, each pipeline can process a different group/shader). The problem with this approach is that we need to store not only all the different inputs that make a group, but we also have to have space for all the intermediate data that we need during the execution of a shader.
An input for a vertex shader for example can be made of only four floats (the vertex position) but the shader itself could require an higher number of floats as temporary storage during its execution. That's why when the shaders are compiled, the maximum number of used registers is also recorded, for the GPU to know. Each register is made of four floats. Each GPU pipeline has a limited number of registers available to process its execution group, so the size of that group is limited by the space each input requires for processing.
In other words, the more registers a shader needs, the less parallel threads a group will be made of, the less latency hiding we get. That's a key concept of GPU processing. This execution grouping is also the very same reason why dynamic branching on the GPU is not really dynamic, but happens in groups, if a given number of pixels or vertices go through the same branch only that one is evaluated, otherwise both are and the right result is applied for each input using conditional moves.
Of course, in real GPUs there are also other limiting factors, usually even if there are enough registers only a fixed number of pixel quads or vertices can be in process in any given moment, and pipeline bubbles occour when we need a change in the pipeline configuration, execution in a group has to be exactly the same. Unfortunately knowing when those state changes happen depends on the specific GPU platform, and somethines things get weird.
Understanding latencies and pipeline stages in the entire GPU is crucial in order to write effective shaders. Some GPUs to furhter hide memory latencies can execute different instructions of the same shader on the different threads of a group, so if a thread is busy on a memory access, ALU processing can be immediately scheduled for other threads that have already finished with the same access. That also means that for each memory access you get a number of alu instructions "for free" as they're hidden by memory access anyway, AND viceversa.
That's true also for other latencies in the GPU, for example, writing in render targets always takes a number of cycles, even if your pixel shader executes in less than those cycles, overall that pipeline stage is a bottleneck for pixel processing anyway, so you're shader won't go any fater, the same applies for interpolator latencies, triangle setup/clipping ones or for vertex fetching, even if usually the memory related ones are so big that the only thing you should care is balancing ALUs with fetching/writing, but still, for some simple shaders, other latencies can be the limiting facotrs.
The key to GPU performance is balancing this huge pipeline, every stage has to be always busy, if so, enormous throughputs can be obtained. The key for shader performance is balancing ALU count with texture/vertex fetches, while at the same time, trying to keep our register count as small as possible. Pipeline bubbles are performance killers. Bubbles are caused by configuration changes. Those are the general rules. Exceptions to those rules can really kill you, but in general, this is what you should aim for.
Next time we'll see (if south park does not drain all my free time) some shader coding advices, after we've got the idea of how everything works.