SPEC CPU2006 software OS and BIOS Settings Descriptions for Cisco UCS Intel-based systems

Firmware / BIOS / Microcode Settings

Intel Turbo boost Technology:

Enabling this option allows the processor cores to automatically increase its frequency and increasing performance if it is running below power, temperature.

Intel Hyper Threading Technology:

Enabling this option allows to use processor resources more efficiently, enabling multiple threads to run on each core and increases processor throughput, improving overall performance on threaded software.

Enhanced Intel SpeedStep:

Enabling this option allows the system to dynamically adjust processor voltage and core frequency. This technology can result in decreased average power consumption and decreased average heat production.

Core Multi Processing:

This option Specifies the number of logical processor cores that can run on the server. This option sets he state of logical processor cores in a package. If you disable this setting, Hyper Threading is also disabled.

Virualization Technology:

If the processor uses Intel Virtualization Technology, which allows a platform to run multiple operating systems and applications in independent partitions. Users should disabled this option for performing application benchmarking.

Direct Cache Access:

Enabling this option allows processors to increase I/O performance by placing data from I/O devices directly into the processor cache. This setting helps to reduce cache misses.

Power Technology:

This BIOS option enables to configure the CPU power management settings such as Enhance Intel Speedstep technology, Intel Turbo Boost technology and Processor Power State C6. Settings in Custom will allows to change the CPU Power management settings. Settings in Energy Efficient will determine the best settings for the BIOS parameters. Settings in Disabled state does not perform any CPU power management and any settings for the BIOS paramaters.

Processor C1 Enhanced:

Enabling this option allows the processor to transition to its minimum frequency upon entering C1. This setting does not take effect until after you have rebooted the server. In disabled state, the CPU continues to run at its maximum frequency in C1 state. Users should disabled this option for performing application benchmarking.

Processor State C6:

Enabling this option allows the processor to send the C6 report to the Operating system. Users should disabled this option for performing application benchmarking.

Energy Performance:

This BIOS option allows you to determine whether system Performance or energy efficiency is more important on server. This can be one of the following: Balanced Energy, Balanced Performance, Energy Efficient and Performance. Note: Power Technology must be set to Custom to expose these BIOS option.

CPU Performance:

This BIOS option allows the enabling/disabling of a processor mechanism in 3 modes Enterprise, High-Throughput and HPC. Setting this BIOS option in Enterprise and High-throughput mode, will enable all the prefetchers and disables Data Reuse technology. Setting this BIOS option in HPC mode, will enable all the prefetchers and enables Data Reuse technology.

Low Voltage DDR Mode and DRAM Clock Throttling:

This BIOS option allows the enabling/disabling of a memory operations. Setting this BIOS option in Power-saving-mode, will prioritizes low voltage memory operations over high frequency memory operations. This mode may lower memory frequency in order to keep the voltage low. Setting this BIOS option in Performance-mode, will prioritizes high frequency operations over low voltage operations.

Closed Loop Thermal Throttling:

This BIOS option allows to enable/disable temperature-based memory throttling feature. By default this BIOS option is enabled. By enabling this BIOS option, the system BIOS will intiate memory throttling to manage memory performane by limiting bandwith to the DIMMs, therefore capping the power consumption and preventing the DIMMs from overheating.

Memory RAS Configuration:

This BIOS option allows to configure memory reliability, availability and serviceability (RAS). Setting this BIOS option in maximum performance, system performance is optimized Setting this BIOS option in mirroring, system reliability is optimized by using half the system memory as backup. Setting this BIOS option in lockstep, If the DIMM pairs in the server have an identical type, size, and organization and are populated across the SMI channels, you can enable lockstep mode to minimize memory access latency and provide better performance. Setting this BIOS option in sparing, System reliability is enhanced with a degree of memory redundancy while making more memory available to the operating system than mirrorin

DRAM Refresh Rate:

This option controls the refresh interval rate for internal memory. By default, the refresh interval rate set as Auto, which is 2X DRAM refresh for every 32ns. Setting this BIOS option in 1X, DRAM cells are refreshed every 64ns.

QPI Snoop Configuration:

There are 3 snoop mode options for how to maintain cache coherency across the Intel QPI fabric, each with varying memory latency and bandwidth characteristics depending on how the snoop traffic is generated.

Cluster on Die (COD) mode logically splits a socket into 2 NUMA domains that are exposed to the OS with half the amount of cores and LLC assigned to each NUMA domain in a socket. This mode utilizes an on-die directory cache and in memory directory bits to determine whether a snoop needs to be sent. Use this mode for highly NUMA optimized workloads to get the lowest local memory latency and highest local memory bandwidth for NUMA workloads.

In Home Snoop and Early Snoop modes, snoops are always sent , they just originate from different places: the caching agent (earlier) in Early Snoop mode and the home agent (later) in Home Snoop mode.

Use Home Snoop mode for NUMA workloads that are memory bandwidth sensitive and need both local and remote memory bandwidth.

Use Early Snoop mode for workloads that are memory latency sensitive or for workloads that benefit from fast cache-to-cache transfer latencies from the remote socket. Snoops are sent out earlier, which is why memory latency is lower in this mode.

High Bandwidth:

Enabling this option allows the chipset to defer memory transactions and process them out of order for optimal performance.

ulimit -s <n>

Sets the stack size to n kbytes, or unlimited to allow the stack size to grow without limit.

numactl --interleave=all "runspec command"

Launching a process with numactl --interleave=all sets the memory interleave policy so that memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes.

Free the file system page cache

The command "echo 1> /proc/sys/vm/drop_caches" is used to free up the filesystem page cache.

Using numactl to bind processes and memory to cores

For multi-copy runs or single copy runs on systems with multiple sockets, it is advantageous to bind a process to a particular core. Otherwise, the OS may arbitrarily move your process from one core to another. This can effect performance. To help, SPEC allows the use of a "submit" command where users can specify a utility to use to bind processes. We have found the utility 'numactl' to be the best choice.

numactl runs processes with a specific NUMA scheduling or memory placement policy. The policy is set for a command and inherited by all of its children. The numactl flag "--physcpubind" specifies which core(s) to bind the process. "-l" instructs numactl to keep a process memory on the local node while "-m" specifies which node(s) to place a process memory. For full details on using numactl, please refer to your Linux documentation, 'man numactl'

Linux Huge Page settings

In order to take advantage of large pages, your system must be configured to use large pages. To configure your system for huge pages perform the following steps:

Create a mount point for the huge pages: "mkdir /mnt/hugepages" The huge page file system needs to be mounted when the systems reboots. Add the following to a system boot configuration file before any services are started: "mount -t hugetlbfs nodev /mnt/hugepages" Set vm/nr_hugepages=N in your /etc/sysctl.conf file where N is the maximum number of pages the system may allocate. Reboot to have the changes take effect.(Not necessary on some operating systems like RedHat Enterprise Linux 5.5.

Note that further information about huge pages may be found in your Linux documentation file: /usr/src/linux/Documentation/vm/hugetlbpage.txt

Transparent Huge Pages

On RedHat EL 6 and later, Transparent Hugepages increase the memory page size from 4 kilobytes to 2 megabytes. Transparent Hugepages provide significant performance advantages on systems with highly contended resources and large memory workloads. If memory utilization is too high or memory is badly fragmented which prevents hugepages being allocated, the kernel will assign smaller 4k pages instead. Hugepages are used by default if /sys/kernel/mm/redhat_transparent_hugepage/enabled is set to always

HUGETLB_MORECORE

Set this environment variable to "yes" to enable applications to use large pages.

LD_PRELOAD=/usr/lib64/libhugetlbfs.so

Setting this environment variable is necessary to enable applications to use large pages.

KMP_STACKSIZE

Specify stack size to be allocated for each thread.

KMP_AFFINITY

KMP_AFFINITY = < physical | logical >, starting-core-id specifies the static mapping of user threads to physical cores. For example, if you have a system configured with 8 cores, OMP_NUM_THREADS=8 and KMP_AFFINITY=physical,0 then thread 0 will mapped to core 0, thread 1 will be mapped to core 1, and so on in a round-robin fashion. KMP_AFFINITY = granularity=fine,scatter The value for the environment variable KMP_AFFINITY affects how the threads from an auto-parallelized program are scheduled across processors. Specifying granularity=fine selects the finest granularity level, causes each OpenMP thread to be bound to a single thread context. This ensures that there is only one thread per core on cores supporting HyperThreading Technology Specifying scatter distributes the threads as evenly as possible across the entire system. Hence a combination of these two options, will spread the threads evenly across sockets, with one thread per physical core.

OMP_NUM_THREADS

Sets the maximum number of threads to use for OpenMP* parallel regions if no other value is specified in the application. This environment variable applies to both -openmp and -parallel (Linux and Mac OS X) or /Qopenmp and /Qparallel (Windows). Example syntax on a Linux system with 8 cores: export OMP_NUM_THREADS=8

submit= MYMASK=`printf '0x%x' \$((1<< \$SPECCOPYNUM))`; /usr/bin/taskset \$MYMASK $command

When running multiple copies of benchmarks, the SPEC config file feature submit is sometimes used to cause individual jobs to be bound to specific processors. This specific submit command is used for Linux. The description of the elements of the command are:

/usr/bin/taskset [options] [mask] [pid | command [arg] ... ] :
taskset is used to set or retreive the CPU affinity of a running process given its PID or to launch a new COMMAND with a given CPU affinity. The CPU affinity is represented as a bitmask, with the lowest order bit corresponding to the first logical CPU and highest order bit corresponding to the last logical CPU. When the taskset returns, it is guaranteed that the given program has been scheduled to a legal CPU. :
The default behaviour of taskset is to run a new command with a given affinity mask: :
taskset [mask] [command] [arguments]

$MYMASK: The bitmask (in hexadecimal) corresponding to a specific SPECCOPYNUM. For example, $MYMASK value for the first copy of a rate run will be 0x00000001, for the second copy of the rate will be 0x00000002 etc. Thus, the first copy of the rate run will have a CPU affinity of CPU0, the second copy will have the affinity CPU1 etc.

$command: Program to be started, in this case, the benchmark instance to be started. :

mysubmit.pl

This perl script is used to ensure that for a system with N cores the first N/2 benchmark copies are bound to a core that does not share its L2 cache with any of the other copies. The script does this by retrieving and using CPU data from /proc/cpuinfo. Note this script will only work for 6-core CPUs.

Source
******************************************************************************************************
#!/usr/bin/perl

use strict;
use Cwd;

# The order in which we want copies to be bound to cores
# Copies: 0, 1, 2, 3
# Cores: 0, 1, 3, 6

my $rundir = getcwd;

my $copynum = shift @ARGV;

my $i;
my $j;
my $tag;
my $num;
my $core;

my @proc;
my @cores;

open(INPUT, "/proc/cpuinfo") or
die "can't open /proc/cpuinfo\n";

#open(OUTPUT, "STDOUT");

# proc[i][0] = logical processor ID
# proc[i][1] = physical processor ID
# proc[i][2] = core ID

$i = 0;

while(<INPUT>)
{
chop;

($tag, $num) = split(/\s+:\s+/, $_);

if ($tag eq "processor") {
$proc[$i][0] = $num;
}

if ($tag eq "physical id") {
$proc[$i][1] = $num;
}

if ($tag eq "core id") {
$proc[$i][2] = $num;
$i++;
}
}

$i = 0;
$j = 0;

for $core (0, 4, 2, 1, 5, 3) {
while ($i < 24) {
if ($proc[$i][2] == $core) {
$cores[$j] = $proc[$i][0];
$j++;
}
$i++;
}
$i=0;
}

open RUNCOMMAND, "> runcommand" or die "failed to create run file";
print RUNCOMMAND "cd $rundir\n";
print RUNCOMMAND "@ARGV\n";
close RUNCOMMAND;
system 'taskset', '-c', $cores[$copynum], 'sh', "$rundir/runcommand";