If some of you think of writing a CUDA program, here a couple of things to keep in mind:
1.) Reduce the number of used registers to run more parallel threads
2.) Reduce the number of memory accesses
3.) Store runtime variables in registers
4.) Do not use local arrays in your code like int a[3]={1,2,3} - better use variables such as a0=1;a1=.. etc if possible.
5.) Write small kernels. If you have one large Kernel, try to split it up into multiple small ones - it might be faster due to less used registers.
6.) Use textures to store your data where possible. Texture reads are cached - global memory reads aren't.
7.) Conditional jumps should branch equal for all threads
8.) Avoid loops which are run only by a minority of threads while the others are idle
9.) Use fast math routines where possible
10.) A complex calculation often is faster than a large lookup table
11.) Writing your own cache manager that uses shared memory for caching might not be an advantage
12.) Try to avoid multiple threads accessing the same memory element (accesses get serialized - also for shared mem)
13.) Try coalescence of global memory accesses.
14.) Try to avoid bank conflicts for reading memory
15.) Small lookup tables can be stored in shared mem
16.) Experiment with the number of parallel threads to find the optimum. In case you run out of registers, use --maxrregcount=...
17.) If you can implement you method using GLSL, it might be faster than CUDA. In GLSL you get a lot of calculations for free like alpha blending, fog, z buffer testing, interpolation of variables between pixels and perhaps a better thread handling too. Also you not have to copy around the rendered image as PBO and you'll save development time since there is no bluescreen from a bad pointer.
Abonnieren
Kommentare zum Post (Atom)
Keine Kommentare:
Kommentar veröffentlichen