Scheduling FFT computation on SMP and multicore systems

A Ali, L Johnsson, J Subhlok - Proceedings of the 21st annual …, 2007 - dl.acm.org
Proceedings of the 21st annual international conference on Supercomputing, 2007dl.acm.org
Increased complexity of memory systems to ameliorate the gap between the speed of
processors and memory has made it increasingly harder for compilers to optimize an
arbitrary code within a palatable amount of time. With the emergence of multicore (CMP),
multiprocessor (SMP) and hybrid shared memory multiprocessor architectures, achieving
high e ciency is becoming even more challenging. To address the challenge to achieve high
e ciency in performance critical applications, domain speci c frameworks have been …
Increased complexity of memory systems to ameliorate the gap between the speed of processors and memory has made it increasingly harder for compilers to optimize an arbitrary code within a palatable amount of time. With the emergence of multicore (CMP), multiprocessor (SMP) and hybrid shared memory multiprocessor architectures, achieving high e ciency is becoming even more challenging. To address the challenge to achieve high e ciency in performance critical applications, domain speci c frameworks have been developed that aid the compilers in scheduling the computations. We have developed a portable framework for the Fast Fourier Transform (FFT) that achieves high e ciency by automatically adapting to various architectural features. Adapting to parallel architectures by searching through all the combinations of schedules (plans) is an expensive task, even when the search is conducted in parallel. In this paper, we develop heuristics to simplify the generation of better schedules for parallel FFT computations on CMP/SMP systems. We evaluate the performance of OpenMP and PThreads implementations of FFT on a number of latest architectures. The performance of parallel FFT schedules is compared with that of the best plan generated for sequential FFT and the speedup for di erent number of processors is reported. In the end, we also present a performance comparison between the UHFFT and FFTW implementations.
ACM Digital Library