Openacc
Matrix Multiply
TAU v 2.25.1 has support for the OpenACC directives available in PGI 12.3 and greater. TAU provides instrumentation at the PGI runtime library layer with detailed source information. This simple matrix multiply application written with OpenACC annotations was compiled with the PGI -ta=nvidia flag to generate the executable. To use TAU to profile this application, you may:
Configure TAU:
./configure -c++=pgCC -cc=pgcc -fortran=pgi make install
export TAU_MAKEFILE=<path to TAU>/x86_64/lib/Makefile.tau-pgi
Compile
make
Run:
tau_exec -T pgi -openacc ./mm
Use TAU's analysis tools to view the performance data:
pprof paraprof
Here we see the time spent in the PGI runtime library routines. The download time for variable a in the source code dominates the execution. We can see the nature of each operation in parenthesis.
Next, this data is presented in ParaProf's thread statistics window.
The driver code.
By clicking on a runtime layer routine, we can see the function in the application where the kernel was invoked along with the associated variable, source line number as well as the size of the array. By right clicking and choosing the 'Show Source Code' window, we can see the source line where this transfer takes place. For the downloadxx_multiply_matrices routine with the variable 'a', the time is attributed on the host at the source location shown below. It represents the transfer time and the time spent waiting on the host for results to be returned from the GPU.
OpenACC example source code
Matrix Multiply using the OpenACC directives and the Makefile to run with TAU.