Guide:TAUCrayOpenAcc
Jacobin example
Let's look at a simple Jocobin example written in Cray OpenACC:
!********************************************************************** ! matmult.f90 - simple matrix multiply implementation !************************************************************************ subroutine initialize(a, b, n) real a(n,n) real b(n,n) integer ! first initialize the A matrix do i = 1,n do j = 1,n a(j,i) = i end do end do ! then initialize the B matrix do i = 1,n do j = 1,n b(j,i) = i end do end do end subroutine initialize subroutine multiply_matrices(a, b, c, matsize) IMPLICIT NONE real a(matsize, matsize) real b(matsize, matsize) real c(matsize, matsize) real ctemp integer i, j, k, l, m, matsize !$acc data copyin(a,b) copyout(c) !$acc kernels loop do k = 1,matsize do i = 1,matsize do j = 1,matsize c(i,k) = c(i,k) + a(i,j) * b(j,k) enddo enddo enddo !$acc end kernels loop !$acc end data end subroutine multiply_matrices program main integer SIZE_OF_MATRIX parameter (SIZE_OF_MATRIX = 1000) real a(SIZE_OF_MATRIX,SIZE_OF_MATRIX) real b(SIZE_OF_MATRIX,SIZE_OF_MATRIX) real c(SIZE_OF_MATRIX,SIZE_OF_MATRIX) integer matsize matsize = SIZE_OF_MATRIX call initialize(a, b, matsize) ! multiply the matrices here using C(i,j) += (A(i,k)* B(k,j)) call multiply_matrices(a, b, c, matsize) end program main
We will start with a simple OpenACC parallel loop directive right before the Jacobian computation.Here is the TAU profile:
We have profiles for the Jacobi kernel ("jacobi_$ck_L215_2"), Memory copies, and CPU synchronization. Look at the time spent copying data to the GPU, it completely dominates the runtime, let look at the some details:
Nearly 26,000 Memory copies for a total of 99 GB. That is a lot of memory being moved. As a improvement let's try to keep as much data on the GPU as possible.
Next we have initialized the matrices on GPU, performed on the initialization on the GPU. This is the profile we see:
Much better performance Memory copies to the GPU and now a quarter of what it was. The second kernel ("jacobi_$ck_L281_6") is the final reduction. And the number of bytes copied:
Only 25 GB in about 11,500 copies.
Configuring
Here is how to configure and use TAU to collect Cray OpenACC:
./configure -arch=craycnl -cuda=/opt/nvidia/cudatoolkit/4.1.28 -cudalibrary=-L/opt/nvidia/cudatoolkit/4.1.28/lib64\ -L/opt/nvidia/cudatoolkit/4.1.28/extras/CUPTI/lib64\ -lcupti\ -L/opt/cray/nvidia/default/lib64\ -lcuda -bfd=none -mpi -useropt=-DTAU_MPICH3
And run this way:
export TAU_CUPTI_API=driver aprun -n 8 tau_exec -T mpi,cray,cupti -cupti ./himeno