Difference between revisions of "Guide:TAUChapel"
(→Performance Results) |
(→Performance Results) |
||
Line 90: | Line 90: | ||
But since each source file is included as a header, none of them will be instrumented. However these sources files can be modified to add TAU probes directly. Furthermore sampling can be added get more detail (time spent in the pthread library for example). | But since each source file is included as a header, none of them will be instrumented. However these sources files can be modified to add TAU probes directly. Furthermore sampling can be added get more detail (time spent in the pthread library for example). | ||
+ | |||
+ | Using PDT is also an option, here is a profile from Titan (Cray XK7) using PDT for instrumentation: | ||
+ | |||
+ | [[Image:chapel_titan.png]] | ||
+ | |||
+ | === Source Code === | ||
+ | |||
+ | [[Image:pi.chpl]] |
Revision as of 19:01, 5 October 2013
Contents
Chapel
MonteCarlo example
To test out some Chapel's language features let us program a MonteCarlo simulation to calculate PI. We can calculate PI by assessing how many points with coordinates x,y fit in the unit circle, ie x^2+y^2<=1.
Basic
Here is the basic routine that computes PI:
proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real { var c : sync int; c = 0; forall i in 1..n { if (x ** 2 + y ** 2 <= 1) then c += 1; } return c * 4.0 / n; }
Notice that the forall here will compute each iteration in parallel, hence the need to define variable c as a sync variable. Performance here is limited by the need to synchronize access to c. Take a look of this profile:
70% percent of the time is spent in synchronization. Let's see if we can do better.
Procedure promotion
One feature of Chapel is procedure promotion, this is where calling a procedure that takes scalar arguments with an array, will have be as if each element of the array is passed to the procedure in parallel:
proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real { var c : sync int; forall i in in_circle(p_x, p_y) { c += i; } return c * 4.0 / n; } proc in_circle(x: real(64), y: real(64)): bool { return (x ** 2 + y ** 2) <= 1; }
Reduction
Furthermore with reorganization will allow us to take advantage of Chapel's built in reduction:
proc compute_pi(p_x: [] real(64), p_y: [] real(64)) : real { var c : int; c= +reduce in_circle(p_x, p_y); return c * 4.0 / n; }
This also improves performance:
Multiple Locales
Let's look at how the array of x and y values are allocated:
var p_x: [1..n] real(64); var p_y: [1..n] real(64);
However Chapel provides a way to distribute these array across multiple locales:
const space = {1..n}; var Dom: domain(1) dmapped Block(boundingBox=space) = space; var p_x: [Dom] real(64); var p_y: [Dom] real(64);
This Block mapping will allocate the elements block-wise among the locales. Furthermore the reduction used earlier will continue to work.
Performance Results
There are a couple of options for collecting Chapel performance data with TAU. To begin configure TAU with PDT, pthreads and bfd (for sampling).
Compiling Chapel with --savec c_code will store the intermediate C sources files in c_code. Compiling the C code with TAU is easy:
make -f c_code/Makefile CC=tau_cc.sh
But since each source file is included as a header, none of them will be instrumented. However these sources files can be modified to add TAU probes directly. Furthermore sampling can be added get more detail (time spent in the pthread library for example).
Using PDT is also an option, here is a profile from Titan (Cray XK7) using PDT for instrumentation: