Copyright © 2007, 2008 Pathscale LLD. 2006, 2007. QLogic Corporation. All rights reserved.
Selecting one of the following will take you directly to that section:
HEADER for OPTIMIZATION
Specify the basic level of optimization desired.
The options can be one of the following:
0 Turn off all optimizations.
1 Turn on local optimizations that can be done quickly. Do peephole optimizations and instruction scheduling.
2 Turn on extensive optimization.
This is the default.
The optimizations at this level are generally conservative,
in the sense that they are virtually always beneficial and
avoid changes which affect
such things as floating point accuracy. In addition to the level
1 optimizations, do inner loop
unrolling, if-conversion, two passes of instruction scheduling,
global register allocation, dead store elimination,
instruction scheduling across basic blocks,
and partial redundancy elimination.
3 Turn on aggressive optimization.
The optimizations at this level are distinguished from -O2
by their aggressiveness, generally seeking highest-quality
generated code even if it requires extensive compile time.
They may include optimizations that are generally beneficial
but may hurt performance.
This includes but is not limited to turning on the
Loop Nest Optimizer, -LNO:opt=1, and setting
-OPT:roundoff=1:IEEE_arithmetic=2:Olimit=9000:reorg_common=ON.
s Specify that code size is to be given priority in tradeoffs with execution time.
If no value is specified, 2 is assumed.
Equivalent to -O3 -ipa -OPT:Ofast -fno-math-errno -ffast-math.
Use optimizations selected to maximize performance.
Although the optimizations are generally safe, they may affect
floating point accuracy due to rearrangement of computations.
NOTE: -Ofast enables -ipa (inter-procedural analysis), which places limitations on how libraries and .o files are built.
-apo <path>
This auto-parallelizing option signals the compiler to automatically
convert sequential code into parallel code where it is safe and beneficial to do so.
The default number of threads used at run-time is the number of CPUs available in the machine. This number of threads can also be controlled by setting the OMP_NUM_THREADS environment variable.
-fb_create <path>
Used to specify that an instrumented executable program is to be
generated. Such an executable is suitable for producing feedback
data files with the specified prefix for use in feedback-directed
optimization (FDO).
The commonly used prefix is "fbdata".
This is OFF by default.
During the training run, the instrumented executable produces information regarding execution paths and data values, but does not generate information by using hardware performance counters.
-fb_opt <prefix for feedback data files>
Used to specify feedback-directed optimization (FDO) by extracting
feedback data from files with the specified prefix, which were
previously generated using -fb-create.
The commonly used prefix is "fbdata".
The same optimization flags should be used
for both the -fb-create and fb_opt compile steps.
Feedback data files created from executables compiled
with different optimization flags may give checksum errors.
FDO is OFF by default.
During the -fb_opt compilation phase, information regarding execution paths and data values are used to improve the information available to the optimizer. FDO enables some optimizations which are only performed when the feedback data file is available. The safety of optimizations performed under FDO is consistent with the level of safety implied by the other optimization flags (outside of fb_create and fb_opt) specified on the compile and link lines.
Disable the use of SSE2/SSE3 instructions. SSE2 cannot be disabled under -m64 and will result in a warning.
Enable the use of 3DNow instructions.
Compiler will optimize code for selected platform. The default value, auto, means to optimize for the platform on which the compiler is running, as determined by reading /proc/cpuinfo. anyx86 means a generic 32-bit x86 processor without SSE2 support.
(For C++ only) -fexceptions enables exception handling. This is the default. -fno-exceptions disables exception handling.
-ffast-math improves FP speed by relaxing ANSI & IEEE rules. -fno-fast-math tells the compiler to conform to ANSI and IEEE math rules at the expense of speed. -ffast- math implies -OPT:IEEE_arithmetic=2 -fno-math-errno. -fno-fast-math implies -OPT:IEEE_arithmetic=1 -fmath-errno.
Do not set ERRNO after calling math functions that are executed with a single instruction, e.g. sqrt. A program that relies on IEEE exceptions for math error handling may want to use this flag for speed while maintaining IEEE arithmetic compatibility. This is implied by -Ofast. The default is -fmath-errno.
Invoke inter-procedural analysis (IPA). Specifying this option is identical to specifying -IPA or -IPA:. Default settings for the individual IPA suboptions are used.
The Code Generation option group -CG: controls the optimizations and transformations of the instruction-level code generator.
-CG:flow=(on|off|0|1): Specifying OFF disables control flow optimization in the code generation. Default is ON.
-CG:cse_regs=N : When performing common subexpression elimination during code generation, assume there are N extra integer registers available over the number provided by the CPU. N can be positive, zero, or negative. The default is positive infinity. See also -CG:sse_cse_regs.
-CG:gcm=(on|off|0|1): Specifying OFF disables the instruction-level global code motion optimization phase. The default is ON.
-CG:ignore_loop_loadstore_dep=(on|off|0|1): Assume load/stores in innermost loops don’t alias unless proven otherwise. This assumption can lead to faster code by giving the instruction scheduler more flexibility to reorder instructions. The default is OFF because this assumption is not true for all programs..
-CG:load_exe=N : Specify the threshold for subsuming a memory load
operation into the operand of an arithmetic instruction.
The value of 0 turns off this subsumption optimization.
If N is 1, this subsumption is performed only when the result of
the load has only one use.
This subsumption is not performed if the number of times the result
of the load is used exceeds the value N, a non-negative integer.
If the ABI is 64-bit and the language is Fortran, the default for N
is 2, otherwise the default is 1.
-CG:local_fwd_sched=(on|off|0|1): this optimization option is deprecated.
-CG:local_sched_alg=(0|1|2): Select the basic block instruction scheduling algorithm. If 0, perform backward scheduling, where instructions are scheduled from the bottom of the basic block to the top. If 1, perform forward scheduling. If 2, schedule the instructions twice - once in the forward direction and once in the backward direction - and take the better of the two schedules. The default value of this option is determined by the compiler during compilation.
-CG:locs_shallow_depth=(on|off|0|1): When performing local instruction scheduling to reduce register usage, give priority to instructions that have shallow depths in the dependence graph. The default is OFF.
-CG:movnti=N : Convert ordinary stores to non-temporal stores when writing memory blocks of size larger than N KB. When N is set to 0, this transformation is avoided. The default value is 1000 (KB).
-CG:p2align=(on|off|0|1): Align loop heads to 64-byte boundaries. The default is OFF.
-CG:post_local_sched=(on|off|0|1): Enable the local scheduler phase after register allocation. The default is ON.
-CG:prefetch=(on|off|0|1): Enable or disable generation of prefetch instructions in the code generator. The default is ON.
-CG:prefer_legacy_regs=(on|off|0|1): Tell the local register allocator to use the first 8 integer and SSE registers whenever possible (%rax-%rbp,%xmm0-%xmm7). Instructions using these registers have smaller instruction sizes. The default is OFF
-CG:prefer_lru_reg=(on|off|0|1): Tell the local register allocator to use the least-recently-used register among the available registers. The default is ON.
-CG:ptr_load_use=N: Add a latency of N cycles between an instruction that loads a pointer and an instruction that uses the pointer. The extra latency will force the instruction scheduler to schedule the pointer load earlier. In general, it is beneficial to load pointers as soon as possible so that dependent memory instructions can begin execution. N is 4 by default. ("Load pointer" instructions include load-execute instructions that compute a pointer result.)
-CG:push_pop_int_saved_regs=(on|off|0|1): Use the X86 push and pop instructions to save the integer callee-saved registers at function prologs and epilogs instead of mov instructions to and from memory locations based off the stack pointer. The default is ON when the CPU target is barcelona, and OFF otherwise.
-CG:sse_cse_regs=N : When performing common subexpression elimination during code generation, assume there are N extra SSE registers available over the number provided by the CPU. N can be positive, zero, or negative. The default is positive infinity.
-CG:use_prefetchnta=(on|off|0|1): Prefetch when data is non-temporal at all levels of the cache hierarchy. This is for data streaming situations in which the data will not need to be re-used soon. The default is OFF.
-INLINE:aggressive=(on|off|0|1): Tell the compiler to be more aggressive about inlining. The default is -INLINE:aggressive=OFF.
The inter-procedural analyzer option group -IPA: controls application of inter-procedural analysis and optimization.
-IPA:callee_limit=N : Functions whose size exceed this limit will never be automatically inlined by the compiler. The default is 500.
-IPA:linear=(on|off|0|1): Enable the re-ordering of fields in large structs based on their reference patterns in feedback compilation to minimize data cache misses. The default is OFF.
-IPA:linear=(on|off|0|1): Controls conversion of a multi-dimensional array to a single dimensional (linear) array that covers the same block of memory. When inlining Fortran subroutines, IPA tries to map formal array parameters to the shape of the actual parameter. In the case that it cannot map the parameter, it linearizes the array reference. By default, IPA will not inline such callsites because they may cause performance problems. The default is OFF.
-IPA:plimit=N : This option stops inlining into a specific subprogram once it reaches size N in the intermediate representation. Default is 2500.
-IPA:pu_reorder=(0|1|2) : Control re-ordering the layout of program units based on their invocation patterns in feedback compilation to minimize instruction cache misses. This option is ignored unless under feedback compilation.
0 Disable procedure reordering. This is the default for non-C++ programs.
1 Reorder based on the frequency in which different procedures are invoked. This is the default for C++ programs.
2 Reorder based on caller-callee relationship.
-IPA:space=N : Inline until a program expansion of N % is reached. For example, -IPA:space=20 limits code expansion due to inlining to approximately 20 %. Default is no limit.
Specify options and transformations performed on loop nests by the Loop Nest Optimizer (LNO). The -LNO options are enabled only if -O3 is also specified on the pathf95 command line.
-LNO:blocking=(on|off|0|1): Enable or disable the cache blocking transformation. The default is ON.
-LNO:full_unroll,fu=N : Fully unroll loops with trip_count <= N inside LNO. N can be any integer between 0 and 100. The default value for N is 5. Setting this flag to 0 disables full unrolling of small trip count loops inside LNO.
-LNO:full_unroll_outer=(on|off|0|1) : Control the full unrolling of loops with known trip count that do not contain a loop and are not contained in a loop. The conditions implied by both the full_unroll and the full_unroll_size options must be satisfied for the loop to be fully unrolled. The default is OFF.
-LNO:full_unroll_size=N : Fully unroll loops with unrolled loop size <= N inside LNO. N can be any integer between 0 and 10000. The conditions implied by the full_unroll option must also be satisfied for the loop to be fully unrolled. The default value for N is 2000.
-LNO:fission=N : Perform loop fission. N can be one of the following:
0 = Disable loop fission (default)
1 = Perform normal loop fission as necessary
2 = Specify that fission be tried before fusion
Because -LNO:fusion is on by default, turning on fission without turning off fusion may result in their effects being nullified. Ordinarily, fusion is applied before fission. Specifying -LNO:fission=2 will turn on fission and cause it to be applied before fusion.
-LNO:fusion=N : Perform loop fusion. N can be one of the following:
0 = Loop fusion is off
1 = Perform conservative loop fusion
2 = Perform aggressive loop fusion
The default is 1.
-LNO:ignore_feedback=(on|off|0|1) : If the flag is ON then feedback information from the loop annotations will be ignored in LNO transformations. The default is ON.
-LNO:interchange=(on|off|0|1) : Disable the loop interchange transformation in the loop nest optimizer. Default is ON.
-LNO:minvariant=(on|off|0|1): Enable or disable moving loop-invariant expressions out of loops. The default is ON.
-LNO:opt=(0|1) : This option controls the LNO optimization level. The options can be
one of the following:
0 = Disable nearly all loop nest optimizations.
1 = Perform full loop nest transformations. This is the default.
-LNO:outer_unroll_max,ou_max=N : The Outer_unroll_max option indicates that the com- piler may unroll outer loops in a loop nest by as many as N per loop, but no more. The default is 5.
-LNO:ou_prod_max=N : This option indicates that the product of unrolling of the various outer loops in a given loop nest is not to exceed N, where N is a positive integer. The default is 16.
-LNO:parallel_overhead=N : Effective only when specified with -apo, the parallel_overhead option controls the auto-parallelizing compiler's estimate of the overhead (in processor cycles) incurred by invoking the parallel version of a loop. When the compiler parallelizes a loop, it generates both a serial and a parallel version. If the amount of work performed by the loop is small, it may not be beneficial to use the parallel version during execution. The set value of parallel_overhead is used in this determination during execution time when the number of processors and the iteration count of the loop are taken into account. The default value is 4096. Because the optimal value varies across systems and programs, this option can be used for parallel performance tuning.
-LNO:prefetch_ahead=N : Prefetch N cache line(s) ahead. The default is 2.
-LNO:prefetch=(0|1|2|3) : This option specifies the level of prefetching.
0 = Prefetch disabled.
1 = Prefetch is done only for arrays that are always referenced in each iteration of a loop.
2 = Prefetch is done without the above restriction. This is the default.
3 = Most aggressive.
-LNO:pf2=(on|off|0|1): This option selectively disables or enables prefetching for Level 2 caches. The default is ON
-LNO:sclrze=(on|off): Turn ON or OFF the optimization that replaces an array by a scalar variable. The default is ON.
-LNO:simd=(0|1|2) : This option enables or disables inner loop vectorization.
0 = Turn off the vectorizer.
1 = (Default) Vectorize only if the compiler can determine that there is no undesirable performance impact due to sub-optimal alignment. Vectorize only if vectorization does not introduce accuracy problems with floating-point operations.
2 = Vectorize without any constraints (most aggressive).
-LNO:trip_count=N : This flag is to provide an assumed loop trip-count if it is unknown at compile time. LNO uses this information for loop transformations and prefetch, etc. N can be any positive integer, and the default value is 1000.
-LNO:vintr=(0|1|2) : This flag controls loop vectorization to make use of vector intrinsic routines (Note: a vector intrinsic routine is called once to compute a math intrinsic for the entire vector). -LNO:vintr=1 is the default. -LNO:vintr=0 turns off the vintr optimization. Under -LNO:vintr=2 the compiler will do aggressive optimization for all vector intrinsic routines. Note that -LNO:vintr=2 could be unsafe in that some of these routines could have accuracy problems.
Compile for 32-bit ABI, also known as x86 or IA32.
Compile for 64-bit ABI, also known as AMD64, x86_64, or IA32e. On a 32-bit host, the default is 32-bit ABI. On a 64-bit host, the default is 64-bit ABI if the target platform (-march/-mcpu/-mtune) is 64-bit; otherwise the default is 32-bit.
The -OPT: option group controls miscellaneous optimizations. These options override defaults based on the main optimization level.
-OPT:alias=<name>
Specify the pointer aliasing model
to be used. By specifying one or more of the following for <name>,
the compiler is able to make assumptions throughout the compilation:
typed
Assume that the code adheres to the ANSI/ISO C standard
which states that two pointers of different types cannot point
to the same location in memory.
This is ON by default when -OPT:Ofast is specified.
restrict
Specify that distinct pointers are assumed to point to distinct,
non-overlapping objects. This is OFF by default.
disjoint
Specify that any two pointer expressions are assumed to point
to distinct, non-overlapping objects. This is OFF by default.
no_f90_pointer_alias
Specify that any two Fortran 90 pointer expressions are assumed to point
to distinct, non-overlapping objects. This is OFF by default.
-OPT:div_split=(on|off|0|1)
Enable or disable changing x/y into x*(recip(y)). This is OFF by default,
but enabled by -OPT:Ofast or -OPT:IEEE_arithmetic=3. This transformation
generates fairly accurate code.
-OPT:fast_complex=(on|off|0|1)
Setting fast_complex=ON enables fast
calculations for values declared to be of the type complex.
When this is set to ON, complex absolute value (norm) and complex
division use fast algorithms that overflow for an operand
(the divisor, in the case of division) that has an absolute value
that is larger than the square root of the largest representable
floating-point number.
This would also apply to an underflow for a value that is smaller
than the square root of the smallest representable floating point
number.
OFF is the default.
fast_complex=ON is enabled if -OPT:roundoff=3 is in effect.
-OPT:fold_unsigned_relops=(on|off|0|1)
This option folds unsigned relational operators in
the presence of possible integer overflow. Default is OFF.
-OPT:goto=(on|off|0|1)
Disable or enable the conversion of GOTOs into higher-level
structures like FOR loops. The default is ON for -O2 or higher.
-OPT:IEEE_arithmetic,IEEE_arith,IEEE_a=(1|2|3)
Specify the level of conformance to IEEE 754 floating pointing
roundoff/overflow behavior.
The options can be one of the following:
1 Adhere to IEEE accuracy. This is the default when optimization levels -O0, -O1 and -O2 are in effect.
2 May produce inexact result not conforming to IEEE 754. This is the default when -O3 is in effect.
3 All mathematically valid transformations are allowed.
-OPT:IEEE_NaN_Inf=(on|off|0|1)
-OPT:IEEE_NaN_inf=ON forces all operations that might have IEEE-754 NaN
or infinity operands to yield results that conform to ANSI/IEEE 754-1985,
the IEEE Standard for Binary Floating-point Arithmetic, which describes a
standard for NaN and inf operands. Default is ON.
-OPT:IEEE_NaN_inf=OFF produces non-IEEE results for various operations.
For example, x=x is treated as TRUE without executing a test and x/x is
simplified to 1 without dividing. OFF can enable many common optimizations
that can help performance.
-OPT:malloc_alg=(0|1)
Select an alternate malloc algorithm which may improve speed.
The compiler adds setup code in the C/C++/Fortran "main" function to enable the chosen algorithm.
The default is 0.
-OPT:Ofast
Use optimizations selected to maximize performance.
Although the optimizations are generally safe, they may affect
floating point accuracy due to rearrangement of computations.
This effectively turns on the following optimizations:
-OPT:ro=2:Olimit=0:div_split=ON:alias=typed.
-OPT:Olimit=N
Disable optimization when size of program unit is > N. When N is 0,
program unit size is ignored and optimization process will not be
disabled due to compile time limit.
The default is 0 when -OPT:Ofast is specified,
9000 when -O3 is specified; otherwise the default is 6000.
-OPT:roundoff,ro=(0|1|2|3)
Specify the level of acceptable departure from source language
floating-point, round-off, and overflow semantics.
The options can be one of the following:
0 = Inhibit optimizations that might affect the floating-point behavior. This is the default when optimization levels -O0, -O1, and -O2 are in effect.
1 = Allow simple transformations that might cause limited round-off or overflow differences. Compounding such transformations could have more extensive effects. This is the default when -O3 is in effect.
2 = Allow more extensive transformations, such as the reordering of reduction loops. This is the default level when -OPT:Ofast is specified.
3 = Enable any mathematically valid transformation.
-OPT:rsqrt=(0|1|2)
This option specifies if the RSQRT machine instruction should be used
to calculate reciprocal square root. RSQRT is faster but potentially
less accurate than the regular square root operation.
0 means not to use RSQRT.
1 means to use RSQRT followed by instructions to refine the result.
2 means to use RSQRT by itself.
Default is 1 when -OPT:roundoff=2 or greater, else the default is 0.
-OPT:treeheight=(on|off|0|1)
The value ON enables re-association in expressions to reduce
the expressions' tree height. The default is OFF.
-OPT:unroll_size=N
Set the ceiling of maximum number of instructions for an
unrolled inner loop. If N=0, the ceiling is disregarded.
The default is 40.
-OPT:unroll_times_max=N
Unroll inner loops by a maximum of N. The default is 4.
-L<library directory> -lsmartheap ,
when used as an EXTRA_CLIB or EXTRA_CXXLIB variable,
results in linking with MicroQuill's SmartHeap 8 (32-bit) library
for Linux. This is a library that optimizes calls to new, delete, malloc and free.
-L<library directory> -lhugetlbfs ,
Link with the hugetlbfs library for Linux. This is a library that utilizes
hugepages.
-static
Suppress dynamic linking at runtime for shared libraries;
use static linking instead.
-Wl,-Tscriptfile
Instruct the linker to use the scriptfile as the linker script.
In the example, the scriptfile, elf_x86_64.xBDT, is needed to configure
the .bss, .data and .txt sections to utilize hugepages.
The -WOPT: Specifies options that affect the global optimizer. The options are enabled at -O2 or above.
-WOPT:aggstr=N
This controls the aggressiveness of the strength reduction optimization
performed by the scalar optimizer, in which induction expressions
within a loop are replaced by temporaries that are incremented
together with the loop variable. When strength reduction is overdone,
the additional temporaries increase register pressure, resulting in
excessive register spills that decrease performance.
The value specified must be a positive integer value, which specifies
the maximum number of induction expressions that will be strength-reduced
across an index variable increment.
When set at 0, strength reduction is only performed for non-trivial
induction expressions. The default is 11.
-WOPT:if_conv=(0|1|2):
Controls the optimization that translates simple IF
statements to conditional move instructions in the
target CPU. Setting to 0 suppresses this optimization.
The value of 1 designates conservative if-conversion,
in which the context around the IF statement is used
in deciding whether to if-convert. The value of 2
enables aggressive if-conversion by causing it to be
performed regardless of the context. The default is 1.
-WOPT:mem_opnds=(on|off|0|1)
Makes the scalar optimizer preserve any memory operands of arithmetic
operations so as to help bring about subsumption of memory loads into
the operands of arithmetic operations. Load subsumption is the combining
of an arithmetic instruction and a memory load into one instruction.
Default is OFF.
-WOPT:retype_expr=(on|off|0|1)
Enables the optimization in the compiler that converts 64-bit address
computation to use 32-bit arithmetic as much as possible.
Default is OFF.
-WOPT:unroll=(0|1|2) : Control the unrolling of innermost loops in the scalar optimizer. Setting to 0 suppresses this unroller. The default is 1, which makes the scalar optimizer unroll only loops that contain IF statements. Setting to 2 makes the unrolling to also apply to loop bodies that are straight line code, which duplicates the unrolling done in the code generator, and is thus unnecessary. The default setting of 1 makes this unrolling complementary to what is done in the code generator. This unrolling is not affected by the unrolling options under the -OPT group.
-WOPT:val=(0|1|2) : Control the number of times the value-numbering optimization is performed in the global optimizer, with the default being 1. This optimization tries to recognize expressions that will compute identical runtime values and changes the program to avoid re-computing them.
The -GRA: Option group for Global Register Allocator.
-GRA:optimize_boundary=(on|off|0|1)
Allow the Global Register Allocator to allocate the same register to different variables
in the same basic-block. Default is OFF.
-GRA:prioritize_by_density=(on|off|0|1)
Tell the Global Register Allocator to prioritize register assignment to variables based on the variable's
reference density instead of the variable's reference count. Default is OFF.
The -LANG: This controls the language option group.
-LANG:copyinout=(on|off)
When an array section is passed as the actual argument in a call, the compiler sometimes copies the
array section to a temporary array and passes the temporary array, thus promoting locality in the
accesses to the array argument. This optimization is relevant only to Fortran, and this flag controls
the aggressiveness of this optimization. The default is ON for -O2 or higher and OFF otherwise.
The -TENV: This option specifies the target environment option group. These options control the target environment assumed and/or produced by the compiler
-TENV:frame_pointer=(on|off)
Default is ON for C++ and OFF otherwise. Local variables in the function stack frame are addressed
via the frame pointer register. Ordinarily, the compiler will replace this use of frame pointer by
addressing local variables via the stack pointer when it determines that the stack pointer is fixed
throughout the function invocation. This frees up the frame pointer for other purposes. Turning this
flag on forces the compiler to use the frame pointer to address local variables. This flag defaults
to ON for C++ because the exception handling mechanism relies on the frame pointer register being
used to address local variables. This flag can be turned OFF for C++ for programs that do not throw
exceptions.
HEADER for PORTABILITY
CFP2006:
If -funderscoring is in effect, and the original Fortran external identifier contained an underscore, -fsecond-underscore appends a second underscore to the one added by -funderscoring. -fno-second-underscore does not append a second underscore. The default is both -funderscoring and -fsecond-underscore, the same defaults as g77 uses. -fno-second-underscore corresponds to the default policies of PGI Fortran and Intel Fortran.
HEADER for COMPILER
Invoke the PathScale C compiler.
Also used to invoke linker for C programs.
Invoke the PathScale C++ compiler.
Also used to invoke linker for C++ programs.
Invoke the PathScale Fortran 77, 90 and 95 compilers.
Also used to invoke linker for Fortran programs and
for mixed C / Fortran. pathf90 and pathf95 are synonymous.
HEADER for OTHER
-IPA:max_jobs=N : This option limits the maximum parallelism when invoking the compiler after IPA to (at most) N compilations running at once. The option can take the following values:
0 = The parallelism chosen is equal to either the number of CPUs, the number of cores, or the number of hyperthreading units in the compiling system, whichever is greatest.
1 = Disable parallelization during compilation (default)
>1 = Specifically set the degree of parallelism