Latest revision as of 23:02, 20 February 2013

To build MADNESS with TAU for profiling/tracing, some modifications need to be made to the code for parsing purposes. Additionally, a modification was made to fix a bug in MADNESS.

Linux

To begin, get the source from svn:

svn co -r1177 http://m-a-d-n-e-s-s.googlecode.com/svn/local/trunk madness

Next, patch it with the following patch:

tau-madness-r1177.diff

Build TAU with -pthread support.

Next, make sure you have built your own blas and lapack (system installed ones will rarely work). Also, install google perftools. Configure as follow:

export TAU_MAKEFILE=</path/to/tau-stub-makefile>
export TAU_OPTIONS="-optVerbose -optKeepFiles -optNoRevert -optHeaderInst -optTauSelectFile=$HOME/apps/madness/select.tau"

MPICC=tau_cc.sh MPICXX=tau_cxx.sh ../configure --prefix=$HOME/apps/madness/install-tau \
LIBS="-L/usr/local/packages/lapack -llapack -lblas -lgfortran \
-L/usr/local/packages/google-perftools-1.3/lib -ltcmalloc_minimal" --disable-dependency-tracking

Build the code

make
make install

Run the code

export MADNESS_ROOT=$HOME/apps/madness/install-tau
export MAD_NTHREAD=7
export MRA_DATA_DIR=${MADNESS_ROOT}/share
export TAU_VERBOSE=1
export TAU_METRICS=LINUX_TIMERS
time mpiexec -n 1 ${MADNESS_ROOT}/bin/moldft

Examine results...

BG/Q

Source from svn (revision: 3131)

Use this script to configure madness (from madness wiki):

#!/bin/bash -x

export LANG=C

export XL_TOP="${IBM_MAIN_DIR}"

export LIBS="${LIBS} -L/soft/libraries/alcf/current/xl/LAPACK/lib -llapack"
export LIBS="${LIBS} -L/soft/libraries/essl/current/lib64 -lesslbg"
export LIBS="${LIBS} -L${XL_TOP}/xlf/bg/14.1/bglib64 -lxlf90_r -lxlfmath -lxlopt -lxl"
export LIBS="${LIBS} -L${XL_TOP}/xlsmp/bg/3.1/bglib64 -lxlsmp "

# this should not be necessary any more but there is no hard in including it
export CPPFLAGS="-D__bgq__"

export CXXFLAGS="${CPPFLAGS} -O2 -qdebug=IgnoreCvOnTopOfFunctionTypes" 
export CFLAGS=" -O2"
export FFLAGS=" -O2"

export MPICC=mpixlc_r 
export MPICXX=mpixlcxx_r 
export CC=mpixlc_r 
export CXX=mpixlcxx_r 
#export F77=bgxlf90_r 
export F77=mpixlf77_r

./configure \
   --prefix=`pwd` \
   --with-google-test=/home/naromero/gtest-1.6.0-XLMAY2012 \
   --enable-debugging=yes \
   --enable-optimal=no \
   --enable-optimization=no \
   --enable-warning=IBM \
   --host=powerpc64-bgq-linux \
   --with-fortran-int32

Then issue make.

It will fail to link, correct using these commands:

cd src/lib/tensor/
make new_mtxmq/bests/bgq_mtxm.o
 ar cru libMADtensor.a tensor.o tensoriter.o basetensor.o mtxmq.o vmath.o new_mtxmq/bests/bgq_mtxm.o
cd ../../..
make

        Total wall time    297.2s
        Total  cpu time    297.2s

Overhead of Instrumentation Methods

Of the available instrumentation options, the only viable solution we found was to use header instrumentation with a selective instrumentation file (automatically generated).

Method	Number of Profiled Events	Runtime (seconds)	Overhead (%)
Uninstrumented		654s
Compiler-based Instrumentation	1321	19625s	2901%
Regular Source Instrumentation	183	748s	14.4%
Source Instrumentation with headers (-optHeaderInst)	806	1628s	150%
-optHeaderInst and selective instrumentation (auto)	539	685s	4.7%
callpath depth 2, -optHeaderInst and selective instrumentation (auto)	1773	693s	6%
callpath depth 100, -optHeaderInst and selective instrumentation (auto)	8535	893s	36.5%

Discussion of Overhead

The use of non-header instrumentation would have an acceptable overhead if selective instrumentation is used. However, the majority of executable code (by time) for MADNESS is contained in the headers. These events are not instrumented, and hence the time/counters spent in them is attributed to the currently instrumented event. In the case of MADNESS, for any thread other than thread 0, this is the "ThreadBase::main" that we added in the patch. For thread 0, the story is similar.

We investigated the use of compiler-based instrumentation for MADNESS merely for completeness sake since using compiler instrumentation on a C++ code with heavy use of templates, STL, or getters/setters is bound to result in excessive overhead. Indeed, we see that the overhead is in thousands of percent. We can perform selective instrumentation on a file by file basis, but we have no automated way to do this.

To properly instrument MADNESS, we need to use TAU's header instrumentation facility, which is enabled by setting -optHeaderInst in the $TAU_OPTIONS variable while compiling the code. Without selective instrumentation, the overhead is quite large, 150%, due to the large number of small one line routines (getters/setters, etc) that are called hundreds of millions of times. To automatically eliminate these routines, we simply run it once with full instrumentation, and then use the TAU tools (either tau_reduce, or paraprof) to automatically generate a selective instrumentation file. Once generated, we recompile the code and run it again and we see that we have a 4.7% overhead.

Flat Profile Performance Data

Shown below are the flat profiles for thread 0 and thread 1:

The remaining threads look very similar to thread 1:

So, the majority of the time (62%) is spent in madness::SeparatedConvolution<Q, NDIM>::muopxv_fast [{operator.h} {127,9}-{198,9}] and any uninstrumented routines below it, such as madness::SeparatedConvolution<Q, NDIM>::apply_transformation [{operator.h} {70,9}-{124,9}]

Callpath Profile Performance Data

Shown below are a couple of views from ParaProf showing the callpath data:

Difference between revisions of "MADNESS"

Latest revision as of 23:02, 20 February 2013

Contents

Linux

BG/Q

Overhead of Instrumentation Methods

Discussion of Overhead

Flat Profile Performance Data

Callpath Profile Performance Data

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
 To build MADNESS with TAU for profiling/tracing, some modifications need to be made to the code for parsing purposes.  Additionally, a modification was made to fix a bug in MADNESS.
+== Linux ==
 * To begin, get the source from svn:
@@ Line 32: / Line 34: @@
 * Examine results...
+== BG/Q ==
+Source from svn (revision: 3131)
+Use this script to configure madness (from madness wiki):
+ <nowiki>
+#!/bin/bash -x
+export LANG=C
+export XL_TOP="${IBM_MAIN_DIR}"
+export LIBS="${LIBS} -L/soft/libraries/alcf/current/xl/LAPACK/lib -llapack"
+export LIBS="${LIBS} -L/soft/libraries/essl/current/lib64 -lesslbg"
+export LIBS="${LIBS} -L${XL_TOP}/xlf/bg/14.1/bglib64 -lxlf90_r -lxlfmath -lxlopt -lxl"
+export LIBS="${LIBS} -L${XL_TOP}/xlsmp/bg/3.1/bglib64 -lxlsmp "
+# this should not be necessary any more but there is no hard in including it
+export CPPFLAGS="-D__bgq__"
+export CXXFLAGS="${CPPFLAGS} -O2 -qdebug=IgnoreCvOnTopOfFunctionTypes"
+export CFLAGS=" -O2"
+export FFLAGS=" -O2"
+export MPICC=mpixlc_r
+export MPICXX=mpixlcxx_r
+export CC=mpixlc_r
+export CXX=mpixlcxx_r
+#export F77=bgxlf90_r
+export F77=mpixlf77_r
+./configure \
+   --prefix=`pwd` \
+   --with-google-test=/home/naromero/gtest-1.6.0-XLMAY2012 \
+   --enable-debugging=yes \
+   --enable-optimal=no \
+   --enable-optimization=no \
+   --enable-warning=IBM \
+   --host=powerpc64-bgq-linux \
+   --with-fortran-int32
+</nowiki>
+Then issue '''make'''.
+It will fail to link, correct using these commands:
+ cd src/lib/tensor/
+ make new_mtxmq/bests/bgq_mtxm.o
+  ar cru libMADtensor.a tensor.o tensoriter.o basetensor.o mtxmq.o vmath.o new_mtxmq/bests/bgq_mtxm.o
+ cd ../../..
+ make
+         Total wall time    297.2s
+         Total  cpu time    297.2s
 =Overhead of Instrumentation Methods=
@@ Line 48: / Line 109: @@
 !654s
 !
+|-
+|Compiler-based Instrumentation
+|1321
+|19625s
+|bgcolor="red"|<font color=white>2901%</font>
 |-
 |Regular Source Instrumentation
@@ Line 53: / Line 119: @@
 |748s
 |14.4%
-|-
-|Compiler-based Instrumentation
-|1321
-|19625s
-|bgcolor="red"|<font color=white>2901%</font>
 |-
 |Source Instrumentation with headers (-optHeaderInst)
@@ Line 68: / Line 129: @@
 |685s
 |bgcolor="green"|<font color=white>4.7%</font>
+|-
+|callpath depth 2, -optHeaderInst and selective instrumentation (auto)
+|1773
+|693s
+|bgcolor="green"|<font color=white>6%</font>
+|-
+|callpath depth 100, -optHeaderInst and selective instrumentation (auto)
+|8535
+|893s
+|36.5%
 |}
@@ Line 88: / Line 159: @@
 [[Image:madness-thread1.png|800px]]
-So, the majority of the time (62%) is spent in <font color=blue><tt>madness::SeparatedConvolution<Q, NDIM>::muopxv_fast [{operator.h} {127,9}-{198,9}]</tt></font>
+So, the majority of the time (62%) is spent in <font color=blue><tt>madness::SeparatedConvolution<Q, NDIM>::muopxv_fast [{operator.h} {127,9}-{198,9}]</tt></font> and any uninstrumented routines below it, such as <font color=blue><tt>madness::SeparatedConvolution<Q, NDIM>::apply_transformation [{operator.h} {70,9}-{124,9}]</tt></font>
+=Callpath Profile Performance Data=
+Shown below are a couple of views from ParaProf showing the callpath data:
+[[Image:madness-callpath1.png|800px]]
+[[Image:madness-callpath2.png|800px]]