PathScale Compiler Suite (Fortran, C and C++ compilers) flag descriptions, for SPEC OMP2001 submissions. By default, the EKOPath compiler generates code for the 64-bit ABI for AMD64 processors. To generate a 32-bit executable, the -m32 flag is used. Portability Flags: ----------------- -extend_source Specifies a 132-character line length for fixed-format source lines tahter than the default, fixed-format lines that are 72 characters wide. -fixedform Tells f90 compiler to use F77 fixed format, instead of F90 free format. -gnu[N] (For C/C++ only) Enables the compiler to generate code compatible with the GNU N series of compilers, where N is either 3 or 4. On systems whose system compiler is GCC 3, the default is -gnu3; on GCC 4 systems the default is -gnu4. Use -show-defaults to display the default. This was required to compile using -mp on to compile C/C++ on sles10 sp1. It is not necessary on sles10 sp1 when compiling C/C++ without -mp. Optimization Flags: ------------------ Some suboptions either enable or disable the feature. To enable a feature, either specify only the suboption name or specify =1, =ON, or =TRUE. Disabling a feature, is accomplished by adding =0, =OFF, or =FALSE. These values are insensitive to case: 'on' & 'ON' mean the same thing. Below, ON & OFF indicate the enabling or disabling of a feature. -CG[:...] Code Generation option group: control the optimizations and transformations of the instruction-level code generator. -CG:cflow=(ON|OFF) A value of OFF disables control flow optimization in the code generation. Default is ON. -CG:gcm=(ON|OFF) Specifying OFF disables the instruction-level global code motion optimization phase. The default is ON. -CG:load_exe=n Specifies the threshold for subsuming a memory load operation into the operand of an arithmetic instruction. The value of 0 turns off this subsumption optimization. The default is 1, when this subsumption is performed only when the result of the load has only one use. This subsumption is not performed if the number of times the result of the load is used exceeds the value n, a non-negative integer. -CG:local_fwd_sched=(ON|OFF) Changes the instruction scheduling algorithm to work forward instead of backward for the instructions in each basic block. The default is OFF. -CG:movnti=N Convert ordinary stores to non-temporal stores when writing memory blocks of size larger than N KB. When N is set to 0, this transformation is avoided. The default value is 1000 (KB). -col72, -col80, -col132 Specifies the line width for fixed format Fortran code. By default fixed format lines are 72 characters. Specifying -col132 implies the flag -extend-source as well. -extend-source Specifies a 132 character wide line for fixed format Fortran code. -fb_create Used to specify that an instrumented executable program is to be generated. Such an executable is suitable for producing feedback data files with the specified prefix for use in feedback-directed compilation (FDO). The commonly used prefix is "fbdata". This is OFF by default. -fb_opt Used to specify feedback-directed compilation (FDO) by extracting feedback data from files with the specified prefix, which were previously generated using -fb_create. The commonly used prefix is "fbdata". This optimization is off by default. -fno-math-errno Do not set ERRNO after calling math functions that are executed with a single instruction, e.g., sqrt. A program that relies on IEEE exceptions for math error handling may want to use this flag for speed while maintaining IEEE arithmetic compatibility. This is implied by -Ofast. The default is -fmath-errno. -INLINE:aggressive=(on|off) Tells the compiler to be more aggressive about inlining. The default setting is off. -IPA[:...] IPA option group: control the inter-procedural analyses and transformations performed. Note that giving just the group name without any options, i.e., -IPA, will invoke the interprocedural analyzer. -IPA is off by default unless -Ofast is specified. -ipa Same as -IPA alone. -IPA:callee_limit=(n) Functions whose size exceeds this limit will never be automatically inlined by the compiler. The default is n=2000. -IPA:linear=(ON|OFF) Controls conversion of a multi-dimensional array to a single dimensional (linear) array that covers the same block of memory. When inlining Fortran subroutines, IPA tries to map formal array parameters to the shape of the actual parameter. In the case that it cannot map the parameter, it linearizes the array reference. By default, IPA will not inline such callsites because they may cause performance problems. The default is OFF. -IPA:plimit=(n) Inline calls to a procedure until the procedure has grown to size of n. The default is 2500. -IPA:pu_reorder=(0|1|2) Controls the phase that optimizes the layout of the program units (functions) in the program. 0 = Disables procedure reordering (default) 1 = Reorder based on the frequency in which different procedures are invoked. 2 = Reorder based on caller-callee relationship. -L/opt/acml2.5.1/pathscale64/lib -lacml The flags above are needed to use the PathScale compiler to link with the ACML (AMD Core Math Library) 2.5.1 library. The PathScale-compiled, 64-bit version of ACML that gets installed at /opt/acml2.5.1/pathscale64 by default. ACML is available as a free download from http://www.developwithamd.com/acml. -LNO: option group specifies options and transformations performed on loop nests. The -LNO: option group is enabled only if the -O3 option is also specified on the compiler command line. -LNO:blocking[=(ON|OFF)] Enable/disable the cache blocking transformation. The default is on at -O3 or higher. -LNO:fission=(0|1|2) This option controls loop fission. The options can be one of the following: 0 = Disables loop fission (default) 1 = Performs normal fission as necessary 2 = Specifies that fission be tried before fusion If -LNO:fission=1:fusion=1 or -LNO:fission=2:fusion=2 are spec- ified, then fusion is performed. -LNO:full_unroll,fu=N Fully unroll innermost loops with trip_count <= N inside LNO. N can be any integer between 0 and 100. The default value for N is 5. Setting this flag to 0 disables full unrolling of small trip count loops inside LNO. -LNO:full_unroll_size=N Fully unroll innermost loops with unrolled loop size <= N inside LNO. N can be any integer between 0 and 10000. The conditions implied by the full_unroll option must also be satisfied for the loop to be fully unrolled. The default value for N is 1600. -LNO:full_unroll_outer=(ON|OFF) Fully unroll outer innermost loops (i.e.stand-alone loops not belonging to any loop nest) with known trip count. The conditions implied by both the full_unroll and the full_unroll_size options must be satisfied for the loop to be fully unrolled. The default is OFF. -LNO:fusion=n Perform loop fusion, n: 0 - off, 1 - conservative, 2 - aggressive. The default is 1. -LNO:interchange[=(ON|OFF)] Specifying OFF disables the loop interchange transformation in the loop nest optimizer. Default is ON. -LNO:opt=1 Turns on Loop Nest Optimizations. The LNO feature is only active at optimization levels of -O3 or higher. -LNO:ou_prod_max=n Indicates that the product of unrolling of the various outer loops in a given loop nest is not to exceed n, where n is a positive integer. The default is 16. -LNO:outer_unroll_max,ou_max=(n) Outer_unroll_max indicates that the compiler may unroll outer loops in a loop nest by as many as n per loop, but no more. The default is 4. -LNO:prefetch[=(0|1|2|3)] Specify level of prefetching. 0 = Prefetch disabled. 1 = Prefetch is done only for arrays that are always referenced in each iteration of a loop, the default. 2 = Prefetch is done without the above restrictions. 3 = Most aggressive. -LNO:prefetch_ahead=n Prefetch n cache line(s) ahead. The default is 2. -LNO:simd=(0|1|2) This option enables or disables inner loop vectorization. 0 = Turn off the vectorizer. 1 = (Default) Vectorize only if the compiler can determine that there is no undesirable performance impact due to sub-optimal alignment. Vectorize only if vectorization does not introduce accuracy problems with floating-point operations. 2 = Vectorize without any constraints (most aggressive). -LNO:vintr=(0|1|2) Controls use of vectorized functions in the math library such as sin and cosine. A value of 0 turns off vectorization of math intrinsics, while 1 is the default. A value of 2 will vectorize all math functions but may affect the accuracy of some functions. Note, vectorization of user code is controlled by the separate flag "-LNO:simd=...". -m3dnow Enable use of 3DNow instructions. The default is OFF. -mcmodel=medium Select the code size model to use when generating offsets within object files. Most programs will work with -mcmodel=small (using 32-bit data relocations), but some need -mcmodel=medium (using 32-bit relocations for code and 64-bit relocations for data). -march= Compiler will optimize code for the selected cpu type: opteron, athlon, athlon64, athlon64fx, barcelona, em64t, pentium4, xeon, core, anyx86, auto. auto means to optimize for the platform that the compiler is running on, which the compiler determines by reading /proc/cpuinfo. The default is auto. -mcpu= same as -march option. -mp Interpret OpenMP directives to explicitly parallelize regions of code for execution by multiple threads on a multi-processor system. -msse3 Enable use of SSE3 instructions. Default is ON under -march=em64t. Otherwise, it is OFF by default. -noextend-source Specifies a 72 character wide line for fixed format Fortran code. -O or -O2 Turn on extensive optimization. The optimizations at this level are generally conservative, in the sense that they (1) are virtually always beneficial, (2) provide improvements commensurate to the compile time spent to achieve them, and (3) avoid changes which affect such things as floating point accuracy. -O3 Turn on aggressive optimization. The optimizations at this level are distinguished from -O2 by their aggressiveness, generally seeking highest-quality generated code even if it requires extensive compile time. They may include optimizations which are generally beneficial but occasionally hurt performance. This includes but is not limited to turning on the Loop Nest Optimizer, -LNO:opt=1, and setting -OPT:ro=1:IEEE_arith=2:Olimit=9000. -Ofast Equivalent to "-O3 -ipa -OPT:Ofast -fno-math-errno." -OPT:Ofast is described below. -OPT:alias= Specifies the pointer aliasing model to be used. By specifiying one or more of the following for , the compiler is able to make assumptions throughout the compilation: typed Assume that the code adheres to the ANSI/ISO C standard which states that two pointers of different types cannot point to the same location in memory. This is on by default when -Ofast is specified. restrict Specifies that distinct pointers are assumed to point to distinct, non-overlapping objects. This is off by default. disjoint Specifies that any two pointer expressions are assumed to point to distinct, non-overlapping objects. This is off by default. -OPT:div_split=(ON|OFF|0|1) Enable or disable changing x/y into x*(recip(y)). This is OFF by default, but enabled by -OPT:Ofast or -OPT:IEEE_arithmetic=3. This transformation generates fairly accurate code. -OPT:early_mp=(ON|OFF) This flag has effect only under -mp compilation. It controls whether the transformation of code to run under multiple threads should take place before (=ON) or after (=OFF) the loop nest optimization (LNO) phase in the compilation process. The default is OFF, implying that the transformation occurs after LNO. -OPT:fast_complex=(ON|OFF) Setting fast_complex=ON enables fast calculations for values declared to be of type complex. When this is set to ON, complex absolute value (norm) and complex division use fast algorithms that are more likely to overflow or underflow than the standard algorithms. OFF is the default. fast_complex=ON is enabled if -OPT:roundoff=3 is in effect. -OPT:IEEE_arithmetic,IEEE_arith=(n) specify level of conformance to IEEE 754 floating pointing roundoff/overflow behavior. n can be one of the following: 1 Adheres to IEEE accuracy. This is the default when optimization levels -O0, -O1 and -O2 are in effect. 2. May produce inexact result not conforming to IEEE 754. This is the default when -O3 is in effect. 3. All mathematically valid transformations are allowed. -OPT:IEEE_NaN_Inf=(ON|OFF) OFF specifies non-IEEE-754 results in operations that might have IEEE 754 NaN or infinity operands; this enables many optimizations which would be invalid for NaN or infinity operands. The default is ON. -OPT:Ofast Use optimizations selected to maximize performance. Although the optimizations are generally safe, they may affect floating point accuracy due to rearrangement of computations. This effectively turns on the following optimizations: -OPT:ro=2:Olimit=0:div_split=ON:alias=typed -OPT:Olimit=(n) Disable optimization when size of program unit is > n. When n is 0, program unit size is ignored and optimization process will not be disabled due to compile time limit. The default is 0 when -Ofast is specified, otherwise the default is 6000 under -O2 and 9000 under -O3. -OPT:roundoff,ro=(n) Specifies the level of acceptable departure from source language floating-point, round-off, and overflow semantics. n can be one of the following: 0 Inhibits optimizations that might affect the floating-point behavior. This is the default when optimization levels -O0, -O1, and -O2 are in effect. 1 Allows simple transformations that might cause limited round-off or overflow differences. Compounding such transformations could have more extensive effects. This is the default level when -O3 is in effect. 2 Allows more extensive transformations, such as the reordering of reduction loops. This is the default level when -Ofast is specified. 3 Enables any mathematically valid transformation. -OPT:treeheight=(ON|OFF) The value ON turns on re-association in expressions to reduce the expressions' tree height. The default value is OFF. -OPT:unroll_analysis=(ON|OFF) The default value of ON lets the compiler analyze the content of the loop to determine the best unrolling parameters, instead of strictly adhering to the -OPT:unroll_times_max and -OPT:unroll_size parameters. -OPT:unroll_times_max,unroll_times=(n) Unroll inner loops by a maximum of n. The default is 4. -OPT:unroll_size=(n) Sets the ceiling of maximum number of instructions for an unrolled inner loop. If n = 0, the ceiling is disregarded. -static Suppresses dynamic linking at run-time for shared libraries; uses static linking instead. -TENV:X=(0|1|2|3|4) Specify the level of enabled exceptions that will be assumed for purposes of performing speculative code motion (default is 1 at all optimization levels). In general, an instruction will not be speculated (i.e. moved above a branch by the optimizer) unless any exceptions it might cause are disabled by this option. At level 0, no speculative code motion may be performed. At level 1, safe speculative code motion may be performed, with IEEE-754 underflow and inexact exceptions disabled. At level 2, all IEEE-754 exceptions are disabled except divide by zero. At level 3, all IEEE-754 exceptions are disabled including divide by zero. At level 4, memory exceptions may be disabled or ignored. -WOPT:mem_opnds=(ON|OFF) ON makes the scalar optimizer preserve any memory operands of arithmetic operations so as to promote subsumption of memory loads into the operands of arithmetic operations. The default is OFF. -WOPT:retype_expr=(ON|OFF) ON enables the optimization in the compiler that converts 64-bit address computation to use 32-bit arithmetic as much as possible. The default is OFF. Flags or Variables to invoke non-PathScale Libraries ---------------------------------------------------- +ACML, a shorthand for flags like: -L/opt/acml3.5.0/pathscale64/lib -lacml "/opt/acml3.5.0" can be replaced by the path to where you installed the version of ACML you are using. The flags above are needed to link with ACML (AMD Core Math Library) version 3.5.0. ACML is available as a free download from http://developer.amd.com/acml.aspx. ACML includes BLAS and LAPACK routines needed for 178.galgel when the RM_SOURCES make variable described below is used. +MKL, a shorthand for flags like: -L$(SPEC)/mkl -lmkl_lapack -lmkl "$(SPEC)/mkl" can be replaced by the path to where you installed the version of Intel MKL you are using. The flags above are needed to link with MKL (Intel's Math Kernel Library) version 8.1. Intel MKL is available from http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/index.htm MKL includes BLAS and LAPACK routines needed for 178.galgel when the following RM_SOURCES make variable is used. RM_SOURCES = lapak.f90 EXTRA_LIBS = -L -lacml or EXTRA_LIBS = -L -lmkl_lapack -lmkl These SPEC make variables (RM_SOURCES and EXTRA_LIBS) settings allow building 178.galgel without its copy of LAPACK and BLAS sources in lapak.f90; instead, it links LAPACK and BLAS routines from the scientific library specified (either ACML or MKL). The following describes the environment variables associated with the PathScale OpenMP library. ------------------------------------------------- OMP_DYNAMIC Enables or disables dynamic adjustment of the number of threads available for execution. Default is FALSE, since this mechanism is not supported. OMP_NUM_THREADS Set the number of threads to use during execution. Default is the number of CPUs in the machine. PSC_OMP_AFFINITY When TRUE, the operating systems affinity mechanism (where available) is used to assign threads to CPUs, otherwise no affinity assignments are made. The default value is TRUE. PSC_OMP_AFFINITY_MAP This environment variable allows the mapping from threads to CPUs to be fully specified by the user. It must be set to a list of CPU identifiers separated by commas. The list must contain at least one CPU identifier, and entries in the list beyond the maximum number of threads supported by the implementation (256) are ignored. Each CPU identifier is a decimal number between 0 and one less than the number of CPUs in the system (inclusive). The implementation generates a mapping table that enumerates the mapping from each thread to CPUs. The CPU identifiers in the PSC_OMP_AFFINITY_MAP list are inserted in the mapping table starting at the index for thread 0 and increasing upwards. If the list is shorter than the maximum number of threads, then it is simply repeated over and over again until there is a mapping for each thread. This repeat feature allows short lists to be used to specify repetitive thread mappings for all threads. PSC_OMP_GUIDED_CHUNK_MAX This is the maximum chunk size that will be used by the loop scheduler for guided scheduling. The default value for this is 300. Note that a minimum chunk size can already be set by the user on a guided schedule directive. This environment variable allows the user to set a maximum too (though it applies to the whole program). PSC_OMP_STATIC_FAIR (Set or not set) The default static scheduling policy when no chunk size is specified is as follows. The number of iterations of the loop is divided by the number of threads in the team and rounded up to give the chunk size. Loop iterations are grouped into chunks of this size and assigned to threads in order of increasing thread id (within the team). If the division was not exact then the last thread will have fewer iterations, and possibly none at all. PSC_OMP_THREAD_SPIN (Integer value) This takes a numeric value and sets the number of times that the spin loops will spin at user-level before falling back to O/S schedule/reschedule mechanisms. By default it is 100. If there are more active threads than processors and this is set very high, then the thread contention will typically cause a performance drop. Synchronization using the O/S schedule and reschedule mechanisms is significantly more expensive but frees up execution resources for other threads. Submit command example: submit= numactl --interleave=0,1,2,3 $command numactl Control NUMA policy for processes or shared memory numactl --interleave=nodes Set a memory interleave policy. Memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes. Machine Configuration: chkconfig updates and queries runlevel information for system services on Redhat Linux systems. cpuspeed the daemon that controls frequency scaling (AMD PowerNow! technology) on Redhat Linux systems. chkconfig --levels 12345 cpuspeed off In Redhat Linux, this disables starting the dynamic cpuspeed daemon for runlevels 1,2,3,4, and 5. powersaved daemon in SuSE Linux and SLES that support dynamic frequncy scaling and other APM and ACPI functionality. powersave -f command in SuSE Linux and SLES to cause powersaved daemon to run cpus in "performance" mode. This disables dynamic frequency scaling and causes the cpus to run at their maximum frequency.