SPEC CPU2017 Platform Settings for Nettrix Systems

Operating System Tuning Parameters

ulimit

Used to set user limits of system-wide resources. Provides control over resources available to the shell and processes started by it. Some common ulimit commands may include:

The ulimit -s [n | unlimited]: Set the stack size to n kbytes, or unlimited to allow the stack size to grow without limit.

The ulimit -l (number): Set the maximum size that can be locked into memory.

Disabling Linux services

Certain Linux services may be disabled to minimize tasks that may consume CPU cycles.

irqbalance

Disabled through "service irqbalance stop". Depending on the workload involved, the irqbalance service reassigns various IRQ's to system CPUs. Though this service might help in some situations, disabling it can also help environments which need to minimize or eliminate latency to more quickly respond to events.

cpupower tool

Use the cpupower tool to read your supported CPU frequencies, and to set them. For rhel linux, you can set the cpu frequency in three modes with the command "cpupower frequency-set -g [options]", where "[options]" could be:

Values for this setting can be:

userspace: Allows the frequency to be set manually.

ondemand: Allows the CPU to run at different speed depending on the workloads.

performance: Set the CPU frequency to the maximum allowed. The default governor mode is "performance".

Tuning Kernel parameters

The following Linux Kernel parameters were tuned to better optimize performance of some areas of the system:

The swappiness: The swappiness value can range from 1 to 100. A value of 100 will cause the kernel to swap out inactive processes frequently in favor of file system performance, resulting in large disk cache sizes. A value of 1 tells the kernel to only swap processes to disk if absolutely necessary. This can be set through a command like "echo 1 > /proc/sys/vm/swappiness".

The sleep_millisecs: Set through "echo 200 > /sys/kernel/mm/ksm/sleep_millisecs". This setting controls how many milliseconds the ksmd (KSM daeomn) should sleep before the next scan.

The scan_sleep_millisecs: Set through "echo 50000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs". This setting controls how many milliseconds to wait in khugepaged is there is a hugepage allocation failure to throttle the next allocation attempt.

Zone Reclaim Mode: Zone reclaim allows the reclaiming of pages from a zone if the number of free pages falls below a watermark even if other zones still have enough pages available. Reclaiming a page can be more beneficial than taking the performance penalties that are associated with allocating a page on a remote zone, especially for NUMA machines. To tell the kernel to free local node memory rather than grabbing free memory from remote nodes, use a command like "echo 1 > /proc/sys/vm/zone_reclaim_mode"

The max_map_count-n: The maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries.

The sched_cfs_bandwidth_slice_us: This OS setting controls the amount of run-time(bandwidth) transferred to a run queue from the task's control group bandwidth pool. Small values allow the global bandwidth to be shared in a fine-grained manner among tasks, larger values reduce transfer overhead.

The sched_latency_ns: This OS setting configures targeted preemption latency for CPU bound tasks. The default value is 24000000 (ns).

The sched_rt_runtime_us: A global limit on how much time realtime scheduling may use.

The sched_migration_cost_ns: Amount of time after the last execution that a task is considered to be "cache hot" in migration decisions. A "hot" task is less likely to be migrated to another CPU, so increasing this variable reduces task migrations.

The sched_min_granularity_ns: This OS setting controls the minimal preemption granularity for CPU bound tasks. As the number of runnable tasks increases, CFS(Complete Fair Scheduler), the scheduler of the Linux kernel, decreases the timeslices of tasks. If the number of runnable tasks exceeds sched_latency_ns/sched_min_granularity_ns, the timeslice becomes number_of_running_tasks * sched_min_granularity_ns.

The sched_wakeup_granularity_ns: This OS setting controls the wake-up preemption granularity. Increasing this variable reduces wake-up preemption, reducing disturbance of compute bound tasks. Lowering it improves wake-up latency and throughput for latency critical tasks, particularly when a short duty cycle load component must compete with CPU bound components.

The numa_balancing: This OS setting controls automatic NUMA balancing on memory mapping and process placement. Setting 0 disables this feature. It is enabled by default 1.

The dirty_ratio: This OS setting controls the absolute maximum amount of system memory (here expressed as a percentage) that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point, all new I/O operations are blocked until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory.

The dirty_background_ratio: This OS setting controls the percentage of system memory that can be filled with "dirty" pages before the pdflush/flush/kdmflush background processes kick in to write it to disk. "Dirty" pages are memory pages that still need to be written to disk. As an example, if you set this value to 10 (it means 10%), and your server has 256 GB of memory, then 25.6 GB of data could be sitting in RAM before something is done.

The dirty_writeback_centisecs: The kernel flusher threads will periodically wake up and write old data out to disk. This OS setting controls the interval between those wakeups, in 100'ths of a second. Setting this to zero disables periodic writeback altogether.

The dirty_expire_centisecs: This OS setting is used to define when dirty data is old enough to be eligible for writeout by the kernel flusher threads. It is expressed in 100'ths of a second. Data which has been dirty in-memory for longer than this interval will be written out next time a flusher thread wakes up.

Transparent Huge Pages (THP)

THP is an abstraction layer that automates most aspects of creating, managing, and using huge pages. THP is designed to hide much of the complexity in using huge pages from system administrators and developers, as normal huge pages must be assigned at boot time, can be difficult to manage manually, and often require significant changes to code in order to be used effectively. Transparent Hugepages increase the memory page size from 4 kilobytes to 2 megabytes. Transparent Hugepages provide significant performance advantages on systems with highly contended resources and large memory workloads. If memory utilization is too high or memory is badly fragmented which prevents hugepages being allocated, the kernel will assign smaller 4k pages instead. Most recent Linux OS releases have THP enabled by default.

Linux Huge Page settings

If you need finer control and manually set the Huge Pages you can follow the below steps:

Create a mount point for the huge pages: "mkdir /mnt/hugepages"

The huge page file system needs to be mounted when the systems reboots. Add the following to a system boot configuration file before any services are started: "mount -t hugetlbfs nodev /mnt/hugepages"

Set vm/nr_hugepages=N in your /etc/sysctl.conf file where N is the maximum number of pages the system may allocate. Reboot to have the changes take effect. For further information about huge pages may be found in your Linux documentation file: /usr/src/linux/Documentation/vm/hugetlbpage.txt

Firmware / BIOS / Microcode Settings

Package C State (Default = "C0/C1 state"):

This feature selects the processor's lowest idle package power state (C-state) that is enabled. The processor will automatically transition into the package C-states based on the Core C-states, in which cores on the processor have transitioned. The higher the package C-state, the lower the power usage of that idle package state.

Values for this BIOS setting can be:

C0/C1 state: C-States range from C0 to Cn. C0 indicates an active state. All other C-states (C1-Cn) represent idle sleep states where the processor clock is inactive (cannot execute instructions) and different parts of the processor are powered down.

C2 state: All IA cores requested C6 or deeper + Processor Graphic cores in RC6 but there are constraints (LTR, programmed timer events in the near future and so forth) prevent entry to any state deeper than C2 state.

C6(non Retention) state: All cores have saved their architectural state and have had their core voltages reduced to zero volts. The LLC does not retain context, and no accesses can be made to the LLC in this state, the cores must break out to the internal state package C2 for snoops to occur.

C6(Retention) state: All cores have saved their architectural state and have had their core voltages reduced to zero volts. The LLC retains context, but no accesses can be made to the LLC in this state, the cores must break out to the internal state package C2 for snoops to occur.

Enhanced Halt State (C1E) (Default = "Disabled"):

Enabling C1E (C1 enhanced) state can save power by halting CPU cores that are idle. Values for this BIOS option can be: Enabled or Disabled.

Turbo Mode (Default = "Enabled"):

Enabling turbo mode can boost the overall CPU performance when all CPU cores are not being fully utilized. A CPU core can run above its rated frequency for a short perios of time when it is in turbo mode. Values for this BIOS option can be: Enabled or Disabled.

Enable LP [Global] (Default = "All LPs"):

The Intel Hyper-Threading knob has been renamed Enable LP [Global] to represent the number of logical processors (LP). This feature allows enabling or disabling of logical processor cores on processors supporting Intel Hyper-Threading.

Values for this BIOS setting can be:

All LPs: Setting All LPs let operating system addresses two virtual or logical cores for a physical presented core. Workloads can be shared between virtual or logical cores when possible. The main function is to increase the number of independent instructions in the pipeline for using the processor resources more efficiently.

Single LP: Each physical core operates as only one logical processor core.

Hardware P-States (Default = "Disabled"):

This setting allows the user to select between OS and hardware-controlled P-states.

Values for this BIOS setting can be:

Native Mode: Allows the OS to choose a P-state.

Out of Band Mode: Allows the hardware to autonomously choose a P-state without OS guidance.

Native Mode with No Legacy Support: Support functions as Native Mode with no support for older hardware.

Disabled: Hardware chooses a P-state based on OS Request (Legacy P-States).

SNC (Sub NUMA) (Default = "Auto"):

SNC breaks up the last level cache (LLC) into disjoint clusters based on address range, with each cluster bound to a subset of the memory controllers in the system. SNC improves average latency to the LLC and memory. SNC is a replacement for the cluster on die (COD) feature found in previous processor families. For a multi-socketed system, all SNC clusters are mapped to unique NUMA domains.

Values for this BIOS setting can be:

Auto: Recommended setting.

Disabled: The LLC is treated as one cluster when this option is disabled.

Enable SNC2 (2-clusters): Utilizes LLC capacity more efficiently and reduces latency due to core/IMC proximity. This may provide performance improvement on NUMA-aware operating systems. When "Enable SNC2 (2-clusters)", the interleaving between the Integrated Memory Controllers (IMCs) is set to 1-way interleave autonomously.

Enable SNC4 (4-clusters): Four-way sub-NUMA clustering (SNC4) is an extension of SNC2. SNC4 is recommended for best latencies for core to cache/memory.It divides the XCC die into four NUMA domains/clusters and affinitizes the cores with CHAs and the memory channels within each cluster. It requires symmetric memory population and OS support is needed for NUMA-aware memory management.

XPT Prefetch (Default = "Auto"):

XPT prefetch is a mechanism that enables a read request that is being sent to the last level cache to speculatively issue a copy of that read to the memory controller prefetching.

Values for this BIOS setting can be:

Disabled: The CPU does not use the XPT Prefetch option.

Enabled: The CPU enables the XPT prefetcher option.

Auto: Recommended setting.

KTI Prefetch (Default = "Auto"):

KTI prefetch is a mechanism to get the memory read started early on a DDR bus.

Values for this BIOS setting can be:

Disabled: The processor does not preload any cache data.

Enabled: The KTI prefetcher preloads the L1 cache with the data it determines to be the most relevant.

Auto: Recommended setting.

Patrol Scrub (Default = "Enable at End of POST"):

Patrol Scrub is a memory RAS feature which runs a background memory scrub against all DIMMs. Can negatively impact performance. This option allows for correction of soft memory errors. Over the length of system runtime, the risk of producing multi-bit and uncorrected errors is reduced with this option.

Values for this BIOS setting can be:

Enable at End of POST: Correction of soft memory errors can occur during runtime.

Disabled: Soft memory error correction is turned off during runtime.

DCU Streamer Prefetcher (Default = "Enabled"):

DCU (Level 1 Data Cache) streamer prefetcher is an L1 data cache prefetcher. Lightly threaded applications and some benchmarks can benefit from having the DCU streamer prefetcher enabled. Values for this BIOS option can be: Enabled or Disabled.

Hardware Prefetcher (Default = "Enabled"):

When this option is enable, a dedicated hardware mechanism in the processor is supported to watch the stream of instructions or data being requested by the executing program, recognize the next few elements that the program might need based on this stream and prefetch into the processor's cache. The program with good instruction and data locality will benefit from this feature when this option is enable. Values for this BIOS option can be: Enabled or Disabled.

Page Policy (Default = "Closed"):

Adaptive Open Page Policy can improve performance for applications with a highly localized memory access pattern; Closed Page Policy can benifit applications that access memory more randomly. Values for this BIOS option can be: Closed or Adaptive.

LLC dead line alloc (Default = "Enabled"):

If the LLC dead line alloc feature is disabled, dead lines will always be dropped and will never fill into the LLC. This can help save space in the LLC and prevent the LLC from evicting useful data. However, if the Dead Line LLC Alloc feature is enabled, the LLC can opportunistically fill dead lines into the LLC if there is free space available. Values for this BIOS option can be: Enabled or Disabled.

Stale AtoS (Default = "Enabled"):

Stale AtoS is the transition of a directory line state. The inmemory directory has three states: I, A, and S. I (invalid) state means the data is clean and does not exist in any other socket's cache. The A (snoopAll) state means the data may exist in another socket in exclusive or modified state. S (Shared) state means the data is clean and may be shared across one or more socket's caches. When doing a read to memory, if the directory line is in the A state, we must snoop all the other sockets because another socket may have the line in modified state. If this is the case,the snoop will return the modified data. However, it may be the case that a line is read in A state and all the snoops come back a miss. This can happen if another socket read the line earlier and then silently dropped it from its cache without modifying it.

Values for this BIOS setting can be:

Auto: Automatically set the options according to the actual machine hardware configuration.

Disabled: Disabling this option allows the feature to process memory directory states as described above.

Enabled: In the situation where a line in A state returns only snoop misses, the line will transition to S state. That way, subsequent reads to the line will encounter it in S state and not have to snoop, saving latency and snoop bandwidth.

Energy Efficient Turbo (Default = "Enabled"):

This option controls whether the processor uses an energy efficiency based policy when engaging turbo range frequencies. This option is only applicable when Turbo Mode is enabled. Values for this BIOS setting can be: Enabled or Disabled.

CPU C6 report (Default = "Disabled"):

Controls the BIOS to report the CPU C6 State (ACPI C3) to the operating system. During the CPU C6 State, the power to all cache is turned off.

Values for this BIOS setting can be:

Enabled: Enable BIOS to report the CPU C6 State (ACPI C3) to the operating system.

Disabled: Disable BIOS to report the CPU C6 State (ACPI C3) to the operating system.

Auto: BIOS automatically decides to report the CPU C6 State (ACPI C3) to the operating system or not depends on Power Technology setting.

Power Performance Tuning (Default = "BIOS Controls EPB"):

Allows the OS or BIOS to control the Energy Performance Bias.

Values for this BIOS setting can be:

OS Controls EPB: Allows the OS to control the Energy Performance Bias.

BIOS Controls EPB: Allows the BIOS to control the Energy Performance Bias.

Intel VT for Directed I/O (VT-d) (Default = "Enabled"):

If enabled, a hypervisor or operating system supporting this option can use hardware capabilities provided by Intel VT for Directed I/O. You can leave this set to enabled even if you are not using a hypervisor or an operating system that uses this option. With default BIOS settings as shipped with most systems, the default state for this setting is Enabled. However, this setting can change it's default setting depending on the Workload Profile that is selected, or what Workload Profile is default for the a certain system. Values for this BIOS option can be: Enabled or Disabled.

SR-IOV Support (Default = "Enabled"):

In virtualization, single root input/output virtualization or SR-IOV is a specification that allows the isolation of the PCI Express resources for manageability and performance reasons. A single physical PCI Express can be shared on a virtual environment using the SR-IOV specification. If system has SR-IOV capable PCIe Devices, this option Enables or Disables Single Root IO Virtualization Support. Values for this BIOS option can be: Enabled or Disabled.

LLC Prefetch (Default = "Disabled"):

The LLC prefetcher is a new prefetcher added to the Intel Xeon Scalable family as a result of the non-inclusive cache architecture. The LLC prefetcher is an additional prefetch mechanism on top of the existing prefetchers that prefetch data into the core DCU and the MLC. Enabling LLC prefetch gives the core prefetcher the ability to prefetch data directly into the LLC without necessarily filling into the MLC. In some cases, setting this option to disabled can improve performance.

Values for this BIOS setting can be:

Enabled: Give the core prefetcher the ability to prefetch data directly to the LLC.

Disabled: Disable the LLC prefetcher. The other core prefetchers are unaffected.

Homeless Prefetch (Default = "Disabled"):

The Homeless Prefetch allows early fetch of the demand miss into the mid-level cache when we don't have enough resources to track this demand in the L1 cache. Special data cache unit (DCU) prefetch which has no allocated fill buffer (FB) entry - the data is not returned to the DCU (DCU is the L1 data cache). Allows for DCU prefetching when the FB is full.

Values for this BIOS setting can be:

Disabled: Disables Homeless Prefetch, preventing early fetch from taking place.

Enabled: Enables the Homeless Prefetch.

Auto: This setting maps to the hardware default setting: Disabled for XCC die, Enabled for MCC die.