IBM Power platform file
- ulimit -s 1048576
Sets the maximum stack size to "1048576 KB".
- To reserve 200 huge pages out of the physical memory pool, issue the following command,
echo 200 > /proc/sys/vm/nr_hugepages
or
echo 200 > /proc/sys/vm/nr_overcommit_hugepages
to allocate from the dynamic hugepage pool.
- chsyscfg -m system -r prof -i name=profile,lpar_name=partition,lpar_proc_compat_mode=POWER6_enhanced
This command enables the POWERPC architecture optional instructions supported on POWER6.
Usage: chsyscfg -r lpar | prof | sys | sysprof | frame
-m <managed system> | -e <managed frame>
-f <configuration file> | -i "<configuration data>"
[--help]
Changes partitions, partition profiles, system profiles, or the attributes of a
managed system or a managed frame.
-r - the type of resource(s) to be changed:
lpar - partition
prof - partition profile
sys - managed system
sysprof - system profile
frame - managed frame
-m <managed system> - the managed system's name
-e <managed frame> - the managed frame's name
-f <configuration file> - the name of the file containing the
configuration data for this command.
The format is:
attr_name1=value,attr_name2=value,...
or
"attr_name1=value1,value2,...",...
-i "<configuration data>" - the configuration data for this command.
The format is:
"attr_name1=value,attr_name2=value,..."
or
""attr_name1=value1,value2,...",..."
--help - prints this help
The valid attribute names for this command are:
-r prof required: name, lpar_id | lpar_name
optional: ...
lpar_proc_compat_mode (default | POWER6_enhanced)
- Each process was bound to a cpu using submit= with the numactl command
submit = numactl --membind=\$SPECCOPYNUM --physcpubind=\$SPECCOPYNUM $command
- numactl : Control NUMA policy for processes or shared memory
--membind=nodes
Only allocate memory from nodes. Allocation will fail when
there is not enough memory available on these nodes.
--physcpubind=cpus
Only execute process on cpus. This accepts physical cpu numbers
as shown in the processor fields of /proc/cpuinfo.
- Environment variables that can be set before the run:
HUGETLB_VERBOSE=0 : Turn off any debugging message from libhugetlbfs
HUGETLB_MORECORE=yes: Instructs libhugetlbfs to override libc's normal morecore() function with a hugepage version and use it for malloc().
HUGETLB_MORECORE_HEAPBASE=0x50000000: Specifies that the hugepage heap address to start at 0x50000000.
HUGETLB_ELFMAP=R ; Instructs libhugetlbfs to place text segment in hugepages.
HUGETLB_ELFMAP=W ; Instructs libhugetlbfs to place data and BSS segments in hugepages.
HUGETLB_ELFMAP=RW ; Instructs libhugetlbfs to place all segments in hugepages.
HUGETLB_ELFMAP=no ; Instructs libhugetlbfs not to place any segment in hugepages.
XLFRTEOPTS=intrinthrds=1 : Causes the Fortran runtime to only use a single thread.
- IBM Post-Link Optimization (fdprpro):
- First we copied the original executable (baseexe) to baseexe.orig.
- Then, the executable is instrumented and its initial profile generated, as follows:
$ fdprpro -a instr baseexe
The output will be generated (by default) in baseexe.instr and its profile in baseexe.nprof.
- Next, run baseexe.instr using the training data. This will fill the profile file with information that characterizes the training workload.
- Finally, re-run FDPR-Pro with the profile file provided, as follows:
$ fdprpro -a opt -f baseexe.nprof [optimization options] baseexe
Instrumentation Options Descriptions:
-ei, --embedded-instrumentation
Perform embedded instrumentation. The profile will be collected
into global variables.
-fd Fdesc, --file-descriptor Fdesc
Set the file descriptor number to be used when opening the profile
file. The default of Fdesc is set to the maximum-allowed number of
open files.
-imullX, --mullX-instrumentation
perform value profiling of RA and RB operands in mullX instruc-
tions.
-issu, --instrumentation-safe-stack-usage
Ensure additional stack space is properly allocated for the
instrumented run. Use this option if your application uses stack
extensively (e.g., when the program uses alloca()). Note that this
option adds extra overhead on instrumentation code.
-iso offset, --instrumentation-stack-offset offset
Set the offset from the stack, a negative number, where the
instrumentation's area for saving registers is kept at runtime.
Use with care.
-M addr, --profile-map addr
Set shared memory segment address for profiling. Alternative
shared memory addresses are needed when the instrumented program
application creates a conflict with the shared-memory addresses
preserved for the profiling. Typical alternative values are
0x40000000, 0x50000000, ... up to 0xC0000000. The default is set
to 0x3000000.
-[no]ri, --[no]register-instrumentation
Instrument the input program file to collect profile information
about indirect branches via registers. The default is set to col-
lect the profile information.
-[no]sfp, --[no]save-floating-point-registers
Save floating point registers in instrumented code. The default is
set to save floating point registers.
Optimization Options Descriptions:
-A alignment, --align-code alignment
Align program so that hot code will be aligned on alignment-byte
addresses.
-abb factor, --align-basic-blocks factor
Align basic blocks that are hotter than the average by a given
(float) factor. This is a lower-level machine-specific alignment
compared to --align-code. Value of -1 (the default) disables this
option.
-bf, --branch-folding
Eliminate branch to branch instructions.
-bldcg, --build-dcg
Build a Data Connectivity Graph (DCG) for enhanced data reordering
(applicable only with the -RD flag).
-bp, --branch-prediction
Set branch prediction bit for conditional branches according to
the collected profile.
-btcar, --branch-table-csect-anchor-removal
Eliminate load instructions used when accessing branch tables.
-cbtd, --convert-bss-to-data
Convert BSS section into a data section. This is useful for more
aggressive tocload and RD optimizations.
-cRD, --conservativeRD
Perform conservative static data reordering by packing together
all frequently referenced static variables.
-dce, --dead-code-elimination
Eliminate instructions related to unused local variables within
frequently executed functions. This is useful mainly after apply-
ing function inlining optimization.
-dp, --data-prefetch
Insert data-cache prefetch instructions to improve data-cache per-
formance.
-dpht threshold, --data-placement-hotness-threshold threshold
Set data placement algorithm hotness threshold between (0,1),
where 0 reorders the static variables in large groups based on the
control flow, and 1 reorders the variables in very small groups
based on their access frequency. (This is applicable only with the
-RD flag).
-dpnf factor, --data-placement-normalization-factor factor
Set data placement algorithm normalization factor between (0,1),
where 0 causes static variables to be reordered regardless of
their size, and 1 locates only small sized variables first.
(applicable only with the -RD flag).
-ece, --epilog-code-eliminate
Reduce code size by grouping common instructions in function epi-
logs, into a single unified code.
-fc, --function-cloning
Enable function cloning phase only during function inlining opti-
mizations (applicable only with function inlining flags: -i, -si,
-ihf, -isf, -shci).
-hr, --hco-reschedule
Relocate instructions from frequently executed code to rarely exe-
cuted code areas, when possible.
-hrf factor, --hco-resched-factor factor
Set the aggressiveness of the -hr optimization option according to
a factor value between (0,1), where 0 is the least aggressive fac-
tor (applicable only with the -hr option).
-i, --inline
Same as --selective-inline with --inline-small-funcs 12.
-ihf pct, --inline-hot-functions pct
Inline all function call sites to functions that have a frequency
count greater than the given pct frequency percentage.
-isf size, --inline-small-funcs size
Inline all functions that are smaller than or equal to the given
size in bytes.
-kr, --killed-registers
Eliminate stores and restores of registers that are killed (over-
written) after frequently executed function calls.
-lap, --load-address-propagation
Eliminate load instructions of variable addresses by re-using pre-
loaded addresses of adjacent variables.
-las, --load-after-store
Add NOP instructions to place each load instruction further apart
following a store instruction that references the same memory
address.
-lro, --link-register-optimization
Eliminate saves and restores of the link register in frequently-
executed functions.
-lu aggressiveness_factor, --loop-unroll aggressiveness_factor
Unroll short loops containing one to several basic blocks accord-
ing to an aggressiveness factor between (1,9), where 1 is the
least aggressive unrolling option for very hot and short loops.
-lun unrolling_number, --loop-unrolling-number unrolling_number
Set the number of unrolled iterations in each unrolled loop. The
allowed range is between (2,50). Default is set to 2. (Applicable
only with the -lu flag).
-nop, --nop-removal
Remove NOP instructions from reordered code.
-O Switch on basic optimizations only. Same as -RC -nop -bp -bf.
-O2 Switch on less aggressive optimization flags. Same as -O -hr -pto
-isf 8 -tlo -kr.
-O3 Switch on aggressive optimization flags. Same as -O2 -RD -isf 12
-si -dp -lro -las -vro -btcar -lu 9 -rt 0 -so.
-O4 Switch on aggressive optimization flags together with aggressive
function inlining. Same as -O3 -sidf 50 -ihf 20 -sdp 9 -shci 90
and -bldcg (for XCOFF files).
-O5 Switch on aggressive optimization flags together with HLR opti-
mization. Same as -O4 -sa -gcpyp -gcnstp -dce -vrox.
-omullX, --mullX-optimization
Optimize mullX instructions by adding a run-time check on RA and
RB and performing equivalent operations with lower penalty. The
optimization requires the use of -imullX in the instrumentation
phase.
-pbsi, --path-based-selective-inline
Perform selective inlining of dominant hot function calls based on
the control flow paths leading to hot functions.
-pc, --preserve-csects
Preserve CSects' boundaries in reordered code.
-pca, --propagate-constant-area
Relocate the constant variables area to the top of the code sec-
tion when possible.
-pfb, --preserve-first-bb
Preserve original location of the entry point basic block in pro-
gram.
-pp, --preserve-functions
Preserve functions' boundaries in reordered code.
-[no]pr, --[no]ptrgl-r11
Perform removal of R11 load instruction in _ptrgl csect.
-pto, --ptrgl-optimization
Perform optimization of indirect call instructions via registers
by replacing them with conditional direct jumps.
-ptoht heatness_threshold, --ptrgl-optimization-heatness-threshold
heatness_threshold
Set the frequency threshold for indirect calls that are to be
optimized by -pto optimization. Allowed range between 0 and 1.
Default is set to 0.8. (Applicable only with -pto flag).
-ptosl limit_size, --ptrgl-optimization-size-limit limit_size
Set the limit of the number of conditional statements generated by
-pto optimization. Allowed values are between 1 and 100. Default
value is set to 3. (Applicable only with the -pto flag).
-RC, --reorder-code
Perform code reordering.
-rcaf aggressiveness_factor, --reorder-code-aggressivenes-factor
aggressiveness_factor
Set the aggressiveness of code reordering optimization. Allowed
values are [0 1 2], where 0 preserves then original code order
and 2 is the most aggressive. Default is set to 1. (Applicable
only with the -RC flag).
-rccrf reversal_factor, --reorder-code-condition-reversal-factor rever-
sal_factor
Set the threshold fraction that determines when to enable condi-
tion reversal for each conditional branch during code reordering.
Allowed input range is between 0.0 and 1.0 where 0.0 tries to pre-
serve original condition direction and 1.0 ignores it. Default is
set to 0.8 (Applicable only with the -RC flag).
-rcctf termination_factor, --reorder-code-chain-termination-factor ter-
mination_factor
Set the threshold fraction that determines when to terminate each
chain of basic blocks during code reordering. Allowed input range
is between 0.0 and 1.0 where 0.0 generates long chains and 1.0
creates single basic block chains. Default is set to 0.05. (Appli-
cable only with the -RC flag).
-RD, --reorder-data
Perform static data reordering.
-rmte, --remove-multiple-toc-entries
Remove multiple TOC entries pointing to the same location in the
input program file.
-rt removal_factor, --reduce-toc removal_factor
Perform removal of TOC entries according to a removal factor
between (0,1), where 0 removes non-accessed TOC entries only and 1
removes all possible TOC entries.
-rtb, --remove-traceback-tables
Remove traceback tables in reordered code.
-sdp aggressiveness_factor, --stride-data-prefetch aggressiveness_fac-
tor
Perform data prefetching within frequently executed loops based on
stride analysis, according to an aggressiveness factor between
(1,9), where 1 is the least aggressive.
-sdpla iterations_number, --stride-data-prefetch-look-ahead itera-
tions_number
Set the number of iterations for which data is prefetched into the
cache ahead of time. Default value is set to 4 iterations. (Appli-
cable only with the -sdp flag).
-sdpms stride_min_size, --stride-data-prefetch-min-size stride_min_size
Set the minimal stride size in bytes, for which data will be con-
sidered a candidate for prefetching. Default value is set to 128
bytes. (Applicable only with the -sdp flag).
-see level
Use simplified prolog/epilog for functions that perform condi-
tional early-exit. Use basic optimization with level=0 and maximal
with level=1.
-shci pct, --selective-hot-code-inline pct
Perform selective inlining of functions in order to decrease the
total number of execution counts, so that only functions with hot-
ness above the given percentage are inlined.
-si, --selective-inline
Perform selective inlining of dominant hot function calls.
-sidf percentage_factor, --selective-inline-dominant-factor percent-
age_factor
Set a dominant factor percentage for selective inline optimiza-
tion. The allowed range is between 0 and 100. Default is set to
80. (Applicable only with the -si and -pbsi flags).
-siht frequency_factor, --selective-inline-hotness-threshold fre-
quency_factor
Set a hotness threshold factor percentage for selective inline
optimization to inline all dominant function calls that have a
frequency count greater than the given frequency percentage.
Default is set to 100. (Applicable only with the -si -pbsi flags).
-slbp, --spinlock-branch-prediction
Perform branch prediction bit setting for conditional branches in
spinlock code containing l*arx and st*cx instructions. (Applicable
after -bp flag).
-sldp, --spinlock-data-prefetch
Perform data prefetching for memory access instructions preceding
spinlock code containing l*arx and st*cx instructions.
-sll Lib1:Prof1,...,LibN:ProfN, --static-link-libraries
Lib1:Prof1,...,LibN:ProfN
Statically link hot code from specified dynamically linked
libraries to the input program. The parameter consists of a comma-
separated list of libraries and their profiles. IMPORTANT: Licens-
ing rights of specified libraries should be observed when applying
this copying optimization.
-sllht hotness_threshold, --static-link-libraries-hotness-threshold
hotness_threshold
Set hotness threshold for the --static-link-libraries optimiza-
tion. The allowed input range is between 0 (least aggressive) and
1, or -1, which does not require a profile and selects all code
that might be called by the input program from the given
libraries. Default is set at 0.5.
-so, --stack-optimization
Reduce the stack frame size of functions that are called with a
small number of arguments.
-spc, --shortcut-plt-calls
Shortcut PLT calls in shared libraries to local functions if they
exist. Note: Resolving to external symbols is disabled for such
calls.
-stf, --stack-flattening
Merge the stack frames of inlined functions with the frames of the
calling functions.
-tb, --preserve-traceback-tables
Force the restructuring of traceback tables in reordered code. If
-tb option is omitted, traceback tables are automatically included
only for C++ applications that use the Try & Catch mechanism.
-tlo, --tocload-optimization
Replace each load instruction that references the TOC with a cor-
responding add-immediate instruction via the TOC anchor register,
where possible.
-ucde, --unreachable-code-data-elimination
Remove unreachable code and non-accessed static data.
-vro, --volatile-registers-optimization
Eliminate stores and restores of non-volatile registers in fre-
quently executed functions by using available volatile registers.
-vrox, --volatile-registers-extended-optimization
Eliminate stores and restores of non-volatile registers in fre-
quently executed functions by using available volatile registers,
the extended version supports FP registers and transparency.
General Options:
-h, --help
Print online help.
-m machine-model, --machine machine-model
Generate code for the specified machine model. Target machine can be one of the following models: power2, power3, ppc405, ppc440,
power4, ppc970, power5, power6, ppe, spe, spe_edp, z10, z9. Default is set to no machine.
-q, --quiet
Set quiet output mode, suppressing informational messages.
-st stat_file, --statistics stat_file
Output statistics information to stat_file. If stat_file is '-', the output goes to standard output. See --verbose for the default.
-v level, --verbose level
Set verbose output mode level. When set, various statistics about the target optimized program are printed into the file pro-
gram.stat. Allowed level range is between 0 and 3. Default is set to 0.
-V, --version
Print version.