IBM POWER Systems Overview

Abstract

This tutorial provides an overview of IBM POWER hardware and software components with a practical emphasis on how to develop and run parallel programs on IBM POWER systems. It does not attempt to cover the entire range of IBM POWER products, however. Instead, it focuses on the types of IBM POWER machines and their environment as implemented by Livermore Computing (LC).

From the point of historical interest, the tutorial begins by providing a succinct history of IBM's POWER architectures. Each of the major hardware components of a parallel POWER system are then discussed in detail, including processor architectures, frames, nodes and the internal high-speed switch network. A description of each of LC's IBM POWER systems follows.

The remainder, and majority, of the tutorial then progresses through "how to use" an IBM POWER system for parallel programming, with an emphasis on IBM's Parallel Operating Environment (POE) software. POE provides the facilities for developing and running parallel Fortran, C/C++ programs on parallel POWER systems. POE components are explained and their usage is demonstrated. The tutorial concludes with a brief discussion of LC specifics and mention of several miscellaneous POE components/tools. A lab exercise follows the presentation.

Level/Prerequisites: Intended for those who are new to developing parallel programs in the IBM POWER environment. A basic understanding of parallel programming in C or Fortran is assumed. The material covered by EC3501 - Introduction to Livermore Computing Resources would also be useful.

Evolution of IBM's POWER Architectures

This section provides a brief history of the IBM POWER architecture.

POWER1:

1990: IBM announces the RISC System/6000 (RS/6000) family of superscalar workstations and servers based upon its new POWER architecture:

RISC = Reduced Instruction Set Computer
Superscalar = Multiple chip units (floating point unit, fixed point unit, load/store unit, etc.) execute instructions simultaneously with every clock cycle
POWER = Performance Optimized With Enhanced RISC

Initial configurations had a clock speed of 25 MHz, single floating point and fixed point units, and a peak performance of 50 MFLOPS.

Clusters are not new: networked configurations of POWER machines became common as distributed memory parallel computing started to become popular.

SP1:

IBM's first SP (Scalable POWERparallel) system was the SP1. It was the logical evolution of clustered POWER1 computing. It was also short-lived, serving as a foot-in-the-door to the rapidly growing market of distributed computing. The SP2 (shortly after) was IBM's real entry point into distributed computing.

The key innovations of the SP1 included:

Reduced footprint: all of those real-estate consuming stand-alone POWER1 machines were put into a rack
Reduced maintenance: new software and hardware made it possible for a system administrator to manage many machines from a single console
High-performance interprocessor communications over an internal switch network
Parallel Environment software made it much easier to develop and run distributed memory parallel programs

The SP1 POWER1 processor had a 62.5 MHz clock with peak performance of 125 MFLOPS

POWER2 and SP2:

1993: Continued improvements in the POWER1 processor architecture led to the POWER2 processor. Some of the POWER2 processor improvements included:

Floating point and fixed point units increased to two each
Increased data cache size
Increased memory to cache bandwidth
Clock speed of 66.5 MHz with peak performance of 254 MFLOPS
Improved instruction set (quad-word load/store, zero-cycle branches, hardware square root, etc)

Lessons learned from the SP1 led to the SP2, which incorporated the improved POWER2 processor.

SP2 improvements were directed at greater scalability and included:

Better system and system management software
Improved Parallel Environment software for users
Higher bandwidth internal switch network

IBM RS/6000 processors

P2SC:

1996: The P2SC (POWER2 Super Chip) debuted. The P2SC was an improved POWER2 processor with a clock speed of 160 MHz. This effectively doubled the performance of POWER2 systems.
Otherwise, it was virtually identical to the POWER2 architecture that it replaced.

PowerPC:

Introduced in 1993 as the result of a partnership between IBM, Apple and Motorola, the PowerPC processor included most of the POWER instructions. New instructions and features were added to support SMPs.
The PowerPC line had several iterations, finally ending with the 604e. Its primary advantages over the POWER2 line were:
- Multiple CPUs
- Faster clock speeds
- Introduction of an L2 cache
- Increased memory, disk, I/O slots, memory bandwidth....
Not much was heard in SP circles about PowerPC until the 604e, a 4-way SMP with a 332 MHz clock. The ASC Blue-Pacific system - at one time ranked as the most powerful computer on the planet, was based upon the 604e processor.
The PowerPC architecture was IBM's entry into the SMP world and eventually replaced all previous, uniprocessor architectures in the SP evolution.
Additional details

POWER3:

1998: The POWER3 SMP architecture is announced. POWER3 represents a merging of the POWER2 uniprocessor architecture and the PowerPC SMP architecture.

Key improvements:

64-bit architecture
Increased clock speeds
Increased memory, cache, disk, I/O slots, memory bandwidth....
Increased number of SMP processors

Several varieties were produced with very different specs. At the time they were made available, they were known as Winterhawk-1, Winterhawk-2, Nighthawk-1 and Nighthawk-2 nodes.

ASC White was based upon the POWER3 (Nighthawk-2) processor. Like ASC Blue-Pacific, ASC White also ranked as the world's #1 computer at one time.

Additional details
IBM POWER3

POWER4:

In 2001 IBM introduced its 64-bit POWER4 architecture. It is very different from its POWER3 predecessor.

The basic building block is a two processor SMP chip with shared L2 cache. Four chips are then joined to make an 8-way SMP "module". Combining modules creates 16, 24 and 32-way SMP machines.

Key improvements over POWER3 include:

Increased CPUs - up to 32 per node
Faster clock speeds - over 1 GHz. Later POWER4 models reached 1.9 GHz.
Increased memory, L2 cache, disk, I/O slots, memory bandwidth....
New L3 cache - logically shared between modules

IBM pSeries POWER4

POWER5:

Introduced in 2004. IBM now offers a full line of POWER5 products that range from desktops to supercomputers.
Similar in design and looks to POWER4, but with some new features/improvements:
- Increased CPUs - up to 64 per node
- Clock speeds up to 2.5 GHz, using a 90 nm Cu/SOI process manufacturing technology
- L3 Cache improvements - larger, faster, on-chip
- Improved chip-memory bandwidth - ~16 GB/sec. This is 4x faster than POWER4.
- Additional rename registers for better floating point performance
- Simultaneous multithreading - two threads executing simultaneously per processor. Takes advantage of unused execution unit cycles for better performance.
- Dynamic power management - chips that aren't busy use less power and generate less heat.
- Micro-partitioning: running up to 10 copies of the OS on a processor.
ASC Purple is based upon POWER5 technology.

IBM, POWER and Linux:

IBM's POWER systems run under Linux in addition to IBM's proprietary AIX operating system.
IBM also offers clustered Linux solutions that are based on Intel processors, combined with hardware/software from other vendors (such as Myrinet, Redhat) along with its own hardware/software.
AIX is "Linux friendly". In fact, beginning with AIX version 5, IBM now refers to AIX as AIX 5L, with the "L" obviously implicating Linux. The "friendliness" means:
- That many solutions developed under Linux will run under AIX 5L by simply recompiling the source code.
- IBM will provide (at no charge) the "AIX Toolbox for Linux Applications" which is a collection of Open Source and GNU software commonly found with Linux distributions.

To Sum It Up, an IBM POWER Timeline:

Adapted from www.rootvg.net/column_risc.htm which although is full of broken links and typos, provides an interesting history of developments that led to, and interleaved with, the POWER architecture.

But What About the Future of POWER?

Stay tuned for continued evolution of the POWER line:
- May 2007: POWER6. Up to 4.7 GHz clock, 2-16 cpus/node.
- POWER6+
- POWER7? POWER8?
One thing is clear, unlike some other vendors (HP/Compaq and the Alpha chip), IBM definitely plans to continue to develop its own proprietary chip and not completely give in to Intel (even though they offer systems with Intel chips).
See IBM's website and Google for more information.

BlueGene:

Completely different IBM architecture that has nothing to do with POWER. Has already taken supercomputing beyond where POWER architectures have been and are going currently.
A quote from IBM:
"at least 15 times faster, 15 times more power efficient and consume about 50 times less space per computation than today's fastest supercomputers."
LLNL has collaborated with IBM to implement what was the world's fastest computer (November 2004 - June 2008), based on this architecture, called BlueGene/L.
Not covered in this tutorial. Lots of information on the web though, including:
- LC's BG/L web pages: asc.llnl.gov/computing_resources/bluegenel
- From IBM: www.research.ibm.com/bluegene

Hardware

System Components

There are five basic physical components in a parallel POWER system, described briefly below and in more detail later:
- Nodes with POWER processors
- Frames
- Switch Network
- Parallel File Systems
- Hardware Management Console
Nodes: Comprise the heart of a system. All nodes are SMPs, containing multiple POWER processors. Nodes are rack mounted in a frame and directly connected to the switch network. The majority of nodes are dedicated as compute nodes to run user jobs. Other nodes serve as file servers and login machines.
Frames: The containment units that physically house nodes, switch hardware, and other control/supporting hardware.
Switch Network: The internal network fabric that enables high-speed communication between nodes. Also called the High Performance Switch (HPS).
Parallel File Systems: Each LC POWER system mounts one or more GPFS parallel file system(s).
Hardware Management Console: A stand-alone workstation that possesses the hardware and software required to monitor and control the frames, nodes and switches of an entire system by one person from a single point. With larger systems, the Hardware Management Console function is actually distributed over a cluster of machines with a single management console.
Additionally, LC systems are connected to external networks and NFS file systems.

Hardware

POWER4 Processor

Architecture:

Dual-core chip (2 cpus per chip)
64-bit architecture and address space
Clock speeds range from 1 - 1.9 GHz
Fast I/O interface (GXX bus) onto chip - ~1.7 TB/sec
Superscalar CPU - 8 execution units that can operate simultaneously:
- 2 floating point units
- 2 fixed point units
- 2 load/store units
- Branch resolution unit
- Condition Register Unit
Memory/Cache:
- L1 Data Cache: 32 KB, 128 byte line, 2-way associative
- L1 Instruction Cache: 64 KB, 128 byte line, direct mapped
- L2 Cache: 1.44 MB shared per chip, 128 byte line, 8-way associative
- L3 Cache: 32 MB per chip, 512 byte line, 8-way associative. The total 128 MB of L3 cache is logically shared by all chips on a module.
- Memory Bandwidth: 4 GB/sec/chip
- Chip-to-chip Bandwidth: 35 GB/sec
- Maximum Memory: depends on model - for example:
  - p655 8-way SMP: 64 GB
  - p690 32-way SMP: 1 TB
Scalable up to 32 CPUs per node

Modules:

The basic building block is a 2 processor chip with shared L2 cache. Four of these chips are then joined to form a "multi chip module" (MCM) which is an 8-way SMP. MCMs can then be combined to form 16, 24 and 32 way SMP nodes.

2-Processor Chip

MCM Diagram

POWER4 MCM Photo

32-way System Showing 4 MCMs and L3 Cache

Hardware

POWER5 Processor

Architecture:

POWER5 Chip

Similar to its POWER4 predecessor in many ways
Dual-core chip (2 cpus per chip)
64-bit architecture
Superscalar, out of order execution with multiple functional units - including two fixed point and two floating point units
Clock speeds of 1.65 - 2+ GHz
Memory/Cache:
- On-chip memory controller and L3 cache directory
- 1 GB - 256 GB memory
- ~16 GB/sec memory-CPU bandwidth (versus 4 GB/sec for POWER4)
- L1 data cache: 32KB per processor, 128 byte line, 4-way associative
- L1 instruction cache: 64KB per processor, 128 byte line, 2-way associative
- L2 cache: 1.9MB per chip (shared between dual processors), 128 byte line, 10-way associative
- L3 cache: 36 MB per chip (shared between dual processors), 256 byte line, 12-way associative. Improvements over POWER4 include:
  - 4 MB larger
  - Physically resides on the chip module
  - Now an "extension" of the L2 cache - directly connected, putting it physically closer to the CPUs.
  - Access to L3 cache is now ~80 clock cycles versus 118 cycles for POWER4. 30.4 GB/sec bandwidth.
- The diagram below highlights the POWER4 / POWER5 L3 cache and memory controller configuration differences.
POWER5 systems can scale up to 64-way SMPs, versus the maximum of 32-way for POWER4 systems.
POWER5 introduces simultaneous multithreading, where two threads can execute at the same time on the same processor by taking advantage of unused execution unit cycles. Note that this same idea is employed in the Intel IA32 systems and is called "hyperthreading".
Increase in floating point registers to 120 from 72 in POWER4.
Dynamic power management - chips that aren't busy use less power and generate less heat.
Micro-partitioning: permits running up to 10 copies of the OS per processor.

Modules:

Dual-core POWER5 chips are combined with other components to form modules. IBM produces the following types of POWER5 modules:
- Dual-chip Module (DCM): includes one dual-core POWER5 processor chip and one L3 cache chip (2-way SMP).
- Quad-core Module (QCM): includes two dual-core POWER5 processor chips and two L3 cache chips (4-way SMP).
- Multi-chip Module (MCM): includes four dual-core POWER5 processor chips and four L3 cache chips (8-way SMP).
Several diagrams and pictures of POWER5 modules are shown below.
Modules can be combined to form larger SMPs. For example, a 16-way SMP can be constructed from two MCMs, and is called a “book" building block. Four books can be used to make a 64-way SMP. Diagrams demonstrating both of these are shown below.

ASC Purple Chips and Modules:

ASC Purple compute nodes are p5 575 nodes, which differ from standard p5 nodes in having only one active core in a dual-processor chip.

With only one active cpu in a chip, the entire L2 and L3 cache is dedicated. This design benefits scientific HPC applications by providing better cpu-memory bandwidth.

ASC Purple nodes are built from Dual-chip Modules (DCMs). Each node has a total of eight DCMs. A photo showing these appears in the next section below.
p5 575 DCM

Hardware

Nodes

Node Characteristics:

A node in terms of a POWER system, is defined as a single, stand-alone, multi-processor machine, self-contained in a "box" which is mounted in a frame. Nodes follow the "shared nothing" model, which allows them to be "hot swappable".
There is considerable variation in the types of POWER nodes IBM offers. However, some common characteristics include:
- Uses one of the POWER family of processors
- SMP design - multiple processors (number of CPUs varies widely)
- Independent internal disk drives
- Network adapters, including adapter(s) for the internal switch network
- Memory resources, including memory cards and caches
- Power and cooling equipment
- Expansion slots for additional network and I/O devices
Some types of nodes can be logically partitioned to look like multiple nodes. For example, a 32-way POWER4 node could be made to look and operate like 4 independent 8-way nodes.
Typically, one copy of the operating system runs per node. However, IBM does provide the means to run different operating systems on the same node through partitioning.

Node Types:

IBM packages its POWER nodes in a variety of ways, ranging from desktop machines to supercomputers. Each variety is given an "official" IBM model number/name. This is true for all POWER architectures.
Within any given POWER architecture, models differ radically in how they are configured: clock speed, physical size, number of CPUs, number of I/O expansion slots, memory, price, etc.
Most POWER nodes also have a "nickname" used in customer circles. The nickname is often the "code name" born during the product's "confidential" development cycle, which just happens to stick around forever after.

Some examples for nodes used by LC systems (past, present, future):

LC Systems Node Model/Description Node Nickname
BLUE, SKY 604e SMP 332 MHz, 4 CPUs Silver
WHITE, FROST POWER3 SMP 375 MHz High, 16 CPUs Nighthawk2, NH2
UM, UV POWER4 pSeries p655, 1.5 GHz, 8 CPUs ???
BERG, NEWBERG POWER4 pSeries p690, 1.3 GHz, 32 CPUs Regatta
ASC PURPLE POWER5 p5 575, 1.9 GHz, 8 CPUs Squadron

LC Systems	Node Model/Description	Node Nickname
BLUE, SKY	604e SMP 332 MHz, 4 CPUs	Silver
WHITE, FROST	POWER3 SMP 375 MHz High, 16 CPUs	Nighthawk2, NH2
UM, UV	POWER4 pSeries p655, 1.5 GHz, 8 CPUs	???
BERG, NEWBERG	POWER4 pSeries p690, 1.3 GHz, 32 CPUs	Regatta
ASC PURPLE	POWER5 p5 575, 1.9 GHz, 8 CPUs	Squadron

p5 575 Node:

ASC Purple systems use p5 575 POWER5 nodes.
"Innovative" 2U packaging - a novel design to minimize space requirements and achieve "ultra dense" CPU distribution. Up to 192 CPUs per frame (12 nodes * 16 CPUs/node).
Eight Dual-chip Modules (DCMs) with associated memory
Comprised of 4 "field swappable" component modules:
- I/O subsystem
- DC power converter/lid
- processor and memory planar
- cooling system
I/O: standard configuration of two hot-swappable SCSI disk drives. Expansion via an I/O drawer to 16 additional SCSI bays with a maximum of 1.17 TB of disk storage.
Adapters: standard configuration of four 10/100/1000 Mb/s ethernet ports and two HMC ports. Expansion up to 4 internal PCI-X slots and 20 external PCI-X slots via the I/O drawer.
Support for the High Performance Switch and InfiniBand

Hardware

Frames

Frame Characteristics:

Typical frames contain nodes, switch hardware, and frame hardware.
Frame hardware includes:
- Redundant power supply units
- Air cooling hardware
- Control and diagnostic components
- Networking
Frames are also used to house intermediate switch hardware (covered later) and additional I/O, media and networking hardware.
Frames can be used to "mix and match" POWER components in a wide variety of ways.
Frames vary in size and appearance, depending upon the type of POWER system and what is required to support it.
Several example POWER4 frame configurations (doors removed) are shown below.
Power Supply
32-way POWER4 node
8-way POWER4 node
Switch
Intermediate switch
I/O drawer
Media Drawer

ASC Purple Frames:

At LC, the early delivery ASC Purple machines are POWER4 nodes. The final delivery systems contain POWER5 nodes.
The frames that house ASC Purple compute nodes are shown below. They are approximately 31"w x 80"h x 70"d (depth can vary).
Also, as described previously, some frames are used to house required intermediate switch hardware.

UM / UV

PURPLE / UP

Hardware

Switch Network

Quick Intro:

The switch network provides the internal, high performance message passing fabric that connects every node to every other node in a parallel POWER system.
It has evolved along with the POWER architecture, and has been called by various names along the way:
- High Performance Switch (HiPS)
- SP Switch
- SP Switch2
- Colony
- Federation
- HPS
Currently, IBM is referring to its latest switch network as the "High Performance Switch" or just "HPS", which interestingly enough, is similar to what the very first switch network was called in the days of the SP1.
As would be expected, there are considerable differences between the various switches, especially with regard to performance and the types of nodes they are compatible with.
For the interested, a history (an much, much more) of the switch is presented in the IBM Redbook called, "An Introduction to the New IBM eserver pSeries High Performance Switch". Currently, this publication is available in PDF format at: www.redbooks.ibm.com/redbooks/pdfs/sg246978.pdf.
The discussion here is limited (more or less) to a user's view of the switch network and is highly simplified. In reality, the switch network is very complicated. The same IBM redbook mentioned above covers in much greater detail the "real" switch network, for the curious.

Topology:

Bidirectional: Any-to-any internode connection allows all processors to send messages simultaneously. Each point-to-point connection between nodes is comprised of two channels (full duplex) that can carry data in opposite directions simultaneously.
Multistage: On larger systems, additional intermediate switches are required to scale the system upwards. For example, with ASC Purple, there will be 3 levels of switches required in order for every node to communicate with every other node.

Switch Network Characteristics:

Packet-switched network (versus circuit-switched). Messages are broken into discrete packets and sent to their final destination, possibly following different routes and out of sequence. All invisible to the user.
Support for multi-user environment - multiple jobs may run simultaneously over the switch (one user does not monopolize switch)
Path redundancy - multiple routings between any two nodes. Permits routes to be generated even when there are faulty components in the system.
Built-in error detection
Hardware redundancy for reliability - the switch board (discussed below) actually uses twice as many hardware components as it minimally requires, for RAS purposes.
Architected for expansion to 1000s of ports. ASC Purple is the first real system to prove this.
Hardware components: in reality, the switch network is a very sophisticated system with many complex components. From a user's perspective however, there are only a few hardware components worth mentioning:
- Switch drawers: house the switch boards and other support hardware. Mounted in a frame.
- Switch boards: the heart of the switch network
- Switch Network Interface (SNI): an adapter that plugs into a node
- Cables: to connect nodes to switch boards and switchboards to other switchboards

Switch Drawer:

The switch drawer fits into a slot in a frame. For frames with nodes, this is usually the bottom slot of a frame. For systems requiring intermediate switches, there are frames dedicated to housing only switch drawers.
The switch drawer contains most of the components that comprise the switch network, including but not limited to:
- Switchboard with switch chips
- Power supply
- Fans for cooling
- Switch port connector cards (riser cards)

Switch Board:

The switch board is really the heart of the switch network. The main features of the switch board are listed below.

There are 8 logical Switch Chips, each of which is connected to 4 other Switch Chips to form an internal 4x4 crossbar switch.

A total of 32 ports controlled by Link Driver Chips on riser cards, are used to connect to nodes and/or other switch boards.

Depending upon how the Switch Board is used, it will be called a Node Switch Board (NSB) or Intermediate Switch Board (ISB):

NSB: 16 ports are configured for node connections. The other 16 ports are configured for connections to switch boards in other frames.
ISB: all ports are used to cascade to other switch boards.
Practically speaking, the distinction between an NSB and ISB is only one of topology. An ISB is just located higher up in the network hierarchy.

Switch-node connections are by copper cable. Switch-switch connections can be either copper or optical fiber cable.

Minimal hardware latency: approximately 59 nanoseconds to cross each Switch Chip.

Two simple configurations (96 node and 128 node systems) using both NSB and ISB switch boards are shown below. The number “4" refers to the number of ports connecting each ISB to each NSB. Nodes are not shown, but each NSB may connect to 16 nodes.
HPS Switch Board

Switch Network Interface (SNI): SNI diagram

The Switch Network Interface (SNI) is an adapter card that plugs into a node's GX bus slot, allowing it to use the switch to communicate with other nodes in the system. Every node that is connected to the switch must have at least one switch adapter.
There are different types of SNI cards. For example, p5 575 nodes use a single, 2-link adapter card.
One of the key features of the SNI is providing the ability for a process to communicate via Direct Memory Access (DMA). Using DMA for communications eliminates additional copies of data to system buffers; a process can directly read/write to another process's memory.
The node's adapter is directly cabled via a rather bulky copper cable into a corresponding port on the switch board.
There is much more to say about SNIs, but we'll leave that to the curious to pursue in the previously mentioned (and other) IBM documentation.

Switch Communication Protocols:

Applications can use one of two communications protocols, either US or IP
US - User Space Protocol. Preferred protocol due to performance. Default at LC.
IP - Internet Protocol. Slower but more "flexible". Used for communications by jobs that span disjoint systems. Also used for small systems that don't have a switch.
Usage details are covered later in this tutorial.

Switch Application Performance:

An application's communication performance over the switch is dependent upon several factors:
- Node type
- Switch and switch adapter type
- Communications protocol used
- On-node vs. off-node proximity
- Application specific characteristics
- Network tuning parameters
- Competing network traffic
Theoretical peak bi-directional performance: 4 GB/sec for POWER4/5 HPS (Federation) Switch
Hardware latency: in practical terms, the switch hardware latency is almost negligible when compared to the software latency involved in sending data. Between any two nodes, hardware latency is in the range of hundreds of nanoseconds (59 nanoseconds per switch chip crossed).
Software latency comprises most of the delay in sending a message between processes. To send MPI messages through the software stack over the switch incurs a latency of ~5 microseconds for the HPS switch.
The table below demonstrates performance metrics for a 2 task, MPI point-to-point (blocking) message passing program run on various LC IBM systems.

Note that these figures aren't even close to the theoretical peak figures. Adding more MPI tasks would take full advantage of the switch/adapter bandwidth and come closer to the theoretical peak.

Switch Type
Node Type Protocol Latency
(usec) Pt to Pt
Bandwidth
(MB/sec)
Colony
POWER3 375 MHz IP 105 77
US 20 390
HPS (Federation) Switch
POWER4 1.5 GHz IP 32 318
US 6 3100
HPS (Federation) Switch
POWER5 1.9 GHz IP n/a n/a
US 5 3100

Switch Type Node Type	Protocol	Latency (usec)	Pt to Pt Bandwidth (MB/sec)
Colony POWER3 375 MHz	IP	105	77
US	20	390
HPS (Federation) Switch POWER4 1.5 GHz	IP	32	318
US	6	3100
HPS (Federation) Switch POWER5 1.9 GHz	IP	n/a	n/a
US	5	3100

Hardware

GPFS Parallel File System

Overview:

GPFS is IBM's General Parallel File System product.
All of LC's parallel production POWER systems have at least one GPFS file system.
"Looks and feels" like any other UNIX file system from a user's perspective.
Architecture:
- Most nodes in a system are application/compute nodes where programs actually run. A subset of the system's nodes are dedicated to serve as storage nodes for conducting I/O activities between the compute nodes and physical disk. Storage nodes are the interface to disk resources.
- For performance reasons, data transfer between the application nodes and storage nodes typically occurs over the internal switch network.
- Individual files are stored as a series of "blocks" that are striped across the disks of different storage nodes. This permits concurrent access by a multi-task application when tasks read/write to different segments of a common file.
- Internally, GPFS's file striping is set to a specific block size that is configurable. At LC, the most efficient use of GPFS is with large files. The use of many small files in a GPFS file system is not advised if performance is important.
- IBM's implementation of MPI-IO routines depends upon an underlying GPFS system to accomplish parallel I/O within MPI programs.
GPFS Parallelism:
- Simultaneous reads/writes to non-overlapping regions of the same file by multiple tasks
- Concurrent reads and writes to different files by multiple tasks
- I/O will be serial if tasks attempt to use the same stripe of a file simultaneously.
Additional information: Search ibm.com for "GPFS".

LC Configuration Details:

Naming scheme: /p/gscratch#/username where:
- # = one digit number on the SCF; one character alpha on the OCF (ex: gscratch1, gscratcha).
- username = your user name on that machine. Established automatically.
- Symbolic links allow for standardized naming in scripts, etc.:
  /p/glocal1, /p/glocal2 link to local GPFS file systems.
Configurations:
- Sizes of GPFS file systems vary between systems. And they change.
- A machine may have more than one GPFS file system.
- Sizes change from time to time. du -k will tell you what the current configuration.
- GPFS file systems are not global; they are local to a specific system.
- At LC, GPFS file systems are configured optimally for use with large data files.
Temporary location:
- No backup
- Purge policies are in effect, since a full file system reduces performance.
- Not reliable for long term storage

LC POWER Systems

General Configuration:

All of LC’s production, parallel POWER systems follow the general configuration schematic shown below.
Most nodes are designated as compute nodes
Some nodes are dedicated as file servers to the GPFS parallel file system(s)
A few nodes serve exclusively as login machines
All nodes are internally connected by a high speed switch
Access to HPSS storage and external systems is over a GigE network

SCF POWER Systems

ASC PURPLE:

Purpose:	ASC Capability Resource
Nodes:	1532
CPUs/Node:	8
CPU Type:	POWER5 p5 575 @1.9 GHz
Peak Performance:	93 TFLOPS
Memory/Node:	32 GB
Memory Total:	49 TB
Cache:	32 KB L1 data; 64 KB L1 instruction; 1.9 MB L2; 36 MB L3
Interconnect:	IBM High Performance Switch (HPS)
Parallel File System:	GPFS
OS:	AIX
Notes:	Photo below Login nodes differ from compute nodes: Two 32-way POWER5 machines partitioned to look like four 16-way machines 64 GB memory Additional info: asc.llnl.gov/computing_resources/purple

ASC Purple images

UM and UV:

Purpose:	ASC Capacity Resources Two nearly identical resources. Information below is for each system.
Nodes:	128
CPUs/Node:	8
CPU Type:	POWER4 p655 @1.5 GHz
Peak Performance:	6.1 TFLOPS
Memory/Node:	16 GB
Memory Total:	2 TB
Cache:	32 KB L1 data; 64 KB L1 instruction; 1.44 MB L2; 32 MB L3
Interconnect:	IBM High Performance Switch (HPS)
Parallel File System:	GPFS
OS:	AIX
Notes:	Part of ASC Purple "early delivery". Photo below

ASC UM/UV

TEMPEST:

Purpose:	ASC Single-node or serial computing
Nodes:	12
CPUs/Node:	9 nodes with 4 CPUs/node 3 nodes with 16 CPUs/node
CPU Type:	4-way nodes: POWER5 p5 550 @1.65 GHz 16-way nodes: POWER5 p5 570 @1.65 GHz
Peak Performance:	554 GFLOPS
Memory/Node:	4-way nodes: 32 GB 16-way nodes: 64 GB
Memory Total:	480 GB
Cache:	32 KB L1 data; 64 KB L1 instruction; 1.9 MB L2; 36 MB L3
Interconnect:	None
Parallel File System:	None
OS:	AIX
Notes:	Only 4-way or 16-way parallel jobs (depending upon the node) because there is no switch interconnect (single node parallelism). See /usr/local/docs/tempest.basics for more information.

OCF POWER Systems

UP:

Purpose:	ASC Capacity Resource
Nodes:	108
CPUs/Node:	8
CPU Type:	POWER5 p5 575 @1.9 GHz
Peak Performance:	6.6 TFLOPS
Memory/Node:	32 GB
Memory Total:	3 TB
Cache:	32 KB L1 data; 64 KB L1 instruction; 1.9 MB L2; 36 MB L3
Interconnect:	IBM High Performance Switch (HPS)
Parallel File System:	GPFS
OS:	AIX
Notes:	UP stands for "Unclassified Purple"

BERG, NEWBERG:

Purpose:	Non-production, testing, prototyping
Nodes:	2
CPUs/Node:	32
CPU Type:	POWER5 p4 690 @1.3 GHz
Peak Performance:	166 GFLOPS
Memory/Node:	32 GB
Memory Total:	64 GB
Cache:	32 KB L1 data; 64 KB L1 instruction; 1.44 MB L2; 32 MB L3
Interconnect:	IBM High Performance Switch (HPS)
Parallel File System:	None
OS:	AIX
Notes:	BERG: Single 32-processor machine configured into 3 logical nodes NEWBERG: Single 32-processor machine configured into 4 logical nodes Photo

Software and Development Environment

The software and development environment for the IBM SPs at LC is similar to what is described in the Introduction to LC Resources tutorial. Items specific to the IBM SPs are discussed below.

AIX Operating System:

AIX is IBM's proprietary version of UNIX.
In the past, AIX was the only choice of operating system for POWER machines. Now, Linux can be used on POWER systems also.
As mentioned earlier, AIX is "Linux friendly", which means:
- That many solutions developed under Linux will run under AIX 5L by simply recompiling the source code.
- IBM provides the "AIX Toolbox for Linux Applications" which is a collection of Open Source and GNU software commonly found with Linux distributions.
LC currently uses only AIX for all of its IBM systems. Every SMP node runs under a single copy of the AIX operating system, which is threaded for all CPUs.
Beginning with POWER5 and AIX 5.3, simultaneous multithreading is supported. Micro-partitioning (multiple copies of an OS on a single processor) is also supported.
AIX product information and complete documentation are available from IBM on the web at www-03.ibm.com/servers/aix

Parallel Environment:

IBM's Parallel Environment is a collection of software tools and libraries designed for developing, executing, debugging and profiling parallel C, C++ and Fortran applications on POWER systems running AIX.
The Parallel Environment consists of:
- Parallel Operating Environment (POE) software for submitting and managing jobs
- IBM's MPI library
- A parallel debugger (pdbx) for debugging parallel programs
- Parallel utilities for simplified file manipulation
- PE Benchmarker performance analysis toolset
Parallel Environment documentation can be found in IBM's Parallel Environment manuals. Parallel Environment topics are also discussed in the POE section below.

Compilers:

IBM - C/C++ and Fortran compilers. Covered later.
gcc, g++, g77 - GNU C, C++ and Fortran compilers
Guide - KAI OpenMP C, C++ and Fortran compilers. Available but no longer officially supported by LC.

Math Libraries Specific to IBM SPs:

ESSL - IBM's Engineering Scientific Subroutine Library.
PESSL - IBM's Parallel Engineering Scientific Subroutine Library. A subset of ESSL that has been parallelized. Documentation is located with ESSL documentation mentioned above.
MASS - Math Acceleration Subsystem. High performance versions of most math intrinsic functions. Scalar versions and vector versions. See /usr/local/lpp/mass or search IBM's web pages for more information.

Batch Systems:

LCRM - LC's legacy batch system. Covered in the LCRM Tutorial. Currently being migrated to Moab.
Moab - New Tri-lab batch system. Covered in the Moab Tutorial.
SLURM - LC's native resource manager system, which resides "under" LCRM/Moab. Stands for "Simple Linux Utility for Resource Management". More information available at: computing.llnl.gov/linux/slurm.

User Filesystems:

As usual - home directories, /nfs/tmp, /var/tmp, /tmp, /usr/gapps, archival storage. For more information see the Introduction to LC Resources tutorial.
General Parallel File System (GPFS) - IBM's parallel filesystem available on LC's parallel production IBM systems. GPFS is discussed in the Parallel File Systems section of the Introduction to Livermore Computing Resources tutorial.

Software Tools:

In addition to compilers, LC's Development Environment Group (DEG) supports a wide variety of software tools including:
- Debuggers
- Memory tools
- Profilers
- Tracing and instrumentation tools
- Correctness tools
- Performance analysis tools
- Various utilities
Most of these tools are simply listed below. For detailed information see computing.llnl.gov/code/content/software_tools.php.
Debugging/Memory Tools:
- TotalView
- dbx
- pdbx
- gdb
- decor
Tracing, Profiling, Performance Analysis and Other"Tools:
- prof
- gprof
- PE Benchmarker
- IBM HPC Toolkit
- TAU
- VampirGuideView (VGV)
- Paraver
- mpiP
- Xprofiler
- mpi_trace
- PAPI
- PMAPI
- Jumpshot
- Dimemas
- Assure
- Umpire
- DPCL

Video and Graphics Services:

LC's Information Management and Graphics Group (IMGG) provides a range of visualization hardware, software and services including:
- Parallel visualization clusters
- PowerWalls
- Video production
- Consulting for scientific visualization issues
- Installation and support of visualization and graphics software
- Support for LLNL Access Grid nodes
Contacts and more information:
- IMGG Home Page: computation.llnl.gov/icc/sdd/img
- IMGG Supported Software: computing.llnl.gov/vis/graphics_sw.shtml
- Graphics support: lc-graphics@lists.llnl.gov

Parallel Operating Environment (POE) Overview

Most of what you'll do on any parallel IBM AIX POWER system will be under IBM's Parallel Operating Environment (POE) software. This section provides a quick overview. Other sections provide the details for actually using POE.

PE vs POE:

IBM's Parallel Environment (PE) software product encompasses a collection of software tools designed to provide a complete environment for developing, executing, debugging and profiling parallel C, C++ and Fortran programs.
As previously mentioned, PE's primary components include:
- Parallel compiler scripts
- Facilities to manage your parallel execution environment (environment variables and command line flags)
- Message Passing Interface (MPI) library
- Low-level API (LAPI) communication library
- Parallel file management utilities
- Authentication utilities
- pdbx parallel debugger
- PE Benchmarker performance analysis toolset
Technically, the Parallel Operating Environment (POE) is a subset of PE that actually contains the majority of the PE product.
However, to the user this distinction is not really necessary and probably serves more to confuse than enlighten. Consequently, this tutorial will consider PE and POE synonymous.

Types of Parallelism Supported:

POE is primarily designed for process level (MPI) parallelism, but fully supports threaded and hybrid (MPI + threads) parallel programs also.
- Process level MPI parallelism is directly managed by POE from compilation through execution.
- Thread level parallelism is "handed off" to the compiler, threads library and OS.
- For hybrid programs, POE manages the MPI tasks, and lets the compiler, threads library and OS manage the threads.
POE fully supports Single Process Multiple Data (SPMD) and Mutltiple Process Multiple Data (MPMD) models for parallel programming.
For more information about parallel programming, MPI, OpenMP and POSIX threads, see the tutorials listed on the LC Training web page.

Interactive and Batch:

POE can be used both interactively and within a batch scheduler system to compile, load and run parallel jobs.
There are many similarities between interactive and batch POE usage. There are also important differences. These will be pointed out later as appropriate.

Typical Usage Progression:

The typical progression of steps for POE usage is outlined below, and discussed in more detail in following sections.
1. Understand your system's configuration (always changing?)
2. Establish POE authorization on all nodes that you will use (one-time event for some. Not even required at LC.)
3. Compile and link the program using one of the POE parallel compiler scripts. Best to do this on the actual platform you want to run on.
4. Set up your execution environment by setting the necessary POE environment variables. Of course, depending upon your application, and whether you are running interactively or batch, you may need to do a lot more than this. But we're only talking about POE here...
5. Invoke the executable - with or w/o POE options

A Few Miscellaneous Words About POE:

POE is unique to the IBM AIX environment. It runs only on the IBM POWER platforms under AIX.
Much of what POE does is designed to be transparent to the user. Some of these tasks include:
- Linking to the necessary parallel libraries during compilation (via parallel compiler scripts)
- Finding and acquiring requested machine resources for your parallel job
- Loading and starting parallel tasks
- Handling all stdin, stderr and stdout for each parallel task
- Signal handling for parallel jobs
- Providing parallel communications support
- Managing the use of processor and network resources
- Retrieving system and job status information
- Error detection and reporting
- Providing support for run-time profiling and analysis tools
POE can also be used to run serial jobs and shell commands concurrently across a network of machines. For example, issuing the command
```
poe hostname
```
will cause each machine in your partition to tell you its name. Run just about any other shell command or serial job under poe and it will work the same way.
POE limts (number of tasks, message sizes, etc.) can be found in the MPI Programming Guide Parallel Environment manual (see the chapter on Limits).

Some POE Terminology:

Before learning how to use POE, understanding some basic definitions may be useful. Note that some of these terms are common to parallel programming in general while others are unique or tailored to POE.

Node: Within POE, a node usually refers to single machine, running its own copy of the AIX operating system. A node has a unique network name/address. All current model IBM nodes are SMPs (next).
SMP: Symmetric Multi-Processor. A computer (single machine/node) with multiple CPUs that share a common memory. Different types of SMP nodes may vary in the number of CPUs they possess and the manner in which the shared memory is accessed.
Process / Task: Under POE, an executable (a.out) that may be scheduled to run by AIX on any available physical processor as a UNIX process is considered a task. Task and process are synonymous. For MPI applications, each MPI process is referred to as a "task" with a unique identifier starting at zero up to the number of processes minus one.
Job: A job refers to the entire parallel application and typically consists of multiple processes/tasks.
Interprocess: Between different processes/tasks. For example, interprocess communications can refer to the exchange of data between different MPI tasks executing on different physical processors. The processors can be on the same node (SMP) or on different nodes, but with POE, are always part of the same job.

Pool
A pool is an arbitrary collection of nodes assigned by system managers. Pools are typically used to separate nodes into disjoint groups, each of which is used for specific purposes. For example, on a given system, some nodes may be designated as "login" nodes, while others are reserved for "batch" or "testing" use only.

Partition
The group of nodes used to run a parallel job is called a partition. Across a system, there is one discreet partition for each user's job. Typically, the nodes in a partition are used exclusively by a single user for the duration of a job. (Technically though, POE allows multiple users to share a partition, but in practice, this is not common, for obvious reasons.) After a job completes, the nodes may be allocated for other users' partitions.

Partition Manager
The Partition Manager, also known as the poe daemon, is a process that is automatically started for each parallel job. The Partition Manager is responsible for overseeing the parallel execution of the job by communicating with daemon processes on each node in the partition and with the system scheduler. It operates transparently to the user and terminates after the job completes.

Home Node / Remote Node
The home node is the node where the parallel job is initiated and where the Partition Manager process lives. The home node may or may not be considered part of your partition depending upon how the system is configured, interactive vs. batch, etc. A Remote Node is any other node in your partition.
Partition diagram

Compilers

Compilers and Compiler Scripts:

In IBM's Parallel Environment, there are a number of compiler invocation commands, depending upon what you want to do. However, underlying all of these commands are the same AIX C/C++ and Fortran compilers.
The POE parallel compiler commands are actually scripts that automatically link to the necessary Parallel Environment libraries, include files, etc. and then call the appropriate native AIX compiler.
For the most part, the native IBM compilers and their parallel compiler scripts support a common command line syntax.
See the References and More Information section for links to IBM compiler documentation. Versions change frequently and downloading the relevant documentation from IBM is probably the best source of information for the version of compiler you are using.

Compiler Syntax:

[compiler] [options] [source_files]

For example:

mpxlf -g -O3 -qlistopt -o myprog mprog.f

Common Compiler Invocation Commands:

Note that all of the IBM compiler invocation commands are not shown. Other compiler commands are available to select IBM compiler extensions and features. Consult the appropriate IBM compiler man page and compiler manuals for details. Man pages are linked below for convenience.
- C/C++ compiler man page for all C/C++ compiler commands
- Fortran compiler man page for all Fortran compiler commands

Also note, that since POE version 4, all commands actually use the _r (thread-safe) version of the command. In other words, even though you compile with xlc, you will really get the xlc_r thread-safe version of the compiler.

IBM Compiler Invocation Commands
Serial xlc cc ANSI C compiler
Extended C compiler (not strict ANSI)
xlC C++ compiler
xlf f77 Extended Fortran; Fortran 77 compatible
xlf alias
xlf90 f90 Full Fortran 90 with IBM extensions
xlf90 alias
xlf95 f95 Full Fortran 95 with IBM extensions
xlf95 alias
Threads
(OpenMP,
Pthreads,
IBM threads) xlc_r / cc_r xlc / cc for use with threaded programs
xlC_r xlC for use with threaded programs
xlf_r xlf for use with threaded programs
xlf90_r xlf90 for use with threaded programs
xlf95_r xlf95 for use with threaded programs
MPI mpxlc / mpcc Parallel xlc / cc compiler scripts
mpCC Parallel xlC compiler script
mpxlf Parallel xlf compiler script
mpxlf90 Parallel xlf90 compiler script
mpxlf95 Parallel xlf95 compiler script
MPI with
Threads
(OpenMP,
Pthreads,
IBM threads) mpxlc_r / mpcc_r Parallel xlc / cc compiler scripts for hybrid MPI/threads programs
mpCC_r Parallel xlC compiler script for hybrid MPI/threads programs
mpxlf_r Parallel xlf compiler script for hybrid MPI/threads programs
mpxlf90_r Parallel xlf90 compiler script for hybrid MPI/threads programs
mpxlf95_r Parallel xlf95 compiler script for hybrid MPI/threads programs

IBM Compiler Invocation Commands
Serial	`xlc cc`	ANSI C compiler Extended C compiler (not strict ANSI)
`xlC`	C++ compiler
`xlf f77`	Extended Fortran; Fortran 77 compatible xlf alias
`xlf90 f90`	Full Fortran 90 with IBM extensions xlf90 alias
`xlf95 f95`	Full Fortran 95 with IBM extensions xlf95 alias
Threads (OpenMP, Pthreads, IBM threads)	`xlc_r / cc_r`	xlc / cc for use with threaded programs
`xlC_r`	xlC for use with threaded programs
`xlf_r`	xlf for use with threaded programs
`xlf90_r`	xlf90 for use with threaded programs
`xlf95_r`	xlf95 for use with threaded programs
MPI	`mpxlc / mpcc`	Parallel xlc / cc compiler scripts
`mpCC`	Parallel xlC compiler script
`mpxlf`	Parallel xlf compiler script
`mpxlf90`	Parallel xlf90 compiler script
`mpxlf95`	Parallel xlf95 compiler script
MPI with Threads (OpenMP, Pthreads, IBM threads)	`mpxlc_r / mpcc_r`	Parallel xlc / cc compiler scripts for hybrid MPI/threads programs
`mpCC_r`	Parallel xlC compiler script for hybrid MPI/threads programs
`mpxlf_r`	Parallel xlf compiler script for hybrid MPI/threads programs
`mpxlf90_r`	Parallel xlf90 compiler script for hybrid MPI/threads programs
`mpxlf95_r`	Parallel xlf95 compiler script for hybrid MPI/threads programs

Compiler Options:

IBM compilers include many options - too numerous to be covered here. For a full discussion, consult the IBM compiler documentation. An abbreviated summary of some common/useful options are listed in the table below.

Option Description C/C++ Fortran

-blpdata
Enable executable for large pages. Linker option. See Large Pages discussion.

-bbigtoc
Required if the image's table of contents (TOC) is greater than 64KB.

-c
Compile only, producing a ".o" file. Does not link object files.

-g
Produce information required by debuggers and some profiler tools

-I
Names directories for additional include files.

-L
Specifies pathname where additional libraries reside directories will be searched in the order of their occurrence on the command line.

-l
Names additional libraries to be searched.

-O -O2 -O3 -O4 -O5
Various levels of optimization. See discussion below.

-o
Specifies the name of the executable (a.out by default)

-p -pg
Generate profiling support code. -p is required for use with the prof utility and -pg is required for use with the gprof utility.

-q32, -q64
Specifies generation of 32-bit or 64-bit objects. See discussion below.

-qhot
Determines whether or not to perform high-order transformations on loops and array language during optimization, and whether or not to pad array dimensions and data objects to avoid cache misses.

-qipa
Specifies interprocedural analysis optimizations

-qarch=arch -qtune=arch
Permits maximum optimization for the processor architecture being used. Can improve performance at the expense of portability. It's probably best to use auto and let the compiler optimize for the platform where you actually compile. See the man page for other options.

-qautodbl=setting
Automatic conversion of single precision to double precision, or double precision to extended precision. See the man page for correct setting options.

-qreport
Displays information about loop transformations if -qhot or -qsmp are used.

-qsmp=omp
Specifies OpenMP compilation

-qstrict
Turns off aggressive optimizations which have the potential to alter the semantics of a user's program.

-qlist -qlistopt -qsource -qxref
Compiler listing/reporting options. -qlistopt may be of use if you want to know the setting of ALL options.

-qwarn64
Aids in porting code from a 32-bit environment to a 64-bit environment by detecting the truncation of an 8 byte integer to 4 bytes. Statements which may cause problems will be identified through informational messages.

-v -V
Display verbose information about the compilation

-w
Suppress informational, language-level, and warning messages.

-bmaxdata:bytes
Historical. This is actually a loader (ld) flag required for use on 32-bit objects that exceed the default data segment size, which is only 256 MB, regardless of the machine's actual memory. At LC, this option would not normally be used because all of its IBM systems are now 64-bit since the retirement of the ASC Blue systems. Codes that link to old libraries compiled in 32-bit mode may still need this option, however.

Option	Description	C/C++	Fortran
-blpdata	Enable executable for large pages. Linker option. See Large Pages discussion.
-bbigtoc	Required if the image's table of contents (TOC) is greater than 64KB.
-c	Compile only, producing a ".o" file. Does not link object files.
-g	Produce information required by debuggers and some profiler tools
-I	Names directories for additional include files.
-L	Specifies pathname where additional libraries reside directories will be searched in the order of their occurrence on the command line.
-l	Names additional libraries to be searched.
-O -O2 -O3 -O4 -O5	Various levels of optimization. See discussion below.
-o	Specifies the name of the executable (a.out by default)
-p -pg	Generate profiling support code. `-p` is required for use with the `prof` utility and `-pg` is required for use with the `gprof` utility.
-q32, -q64	Specifies generation of 32-bit or 64-bit objects. See discussion below.
-qhot	Determines whether or not to perform high-order transformations on loops and array language during optimization, and whether or not to pad array dimensions and data objects to avoid cache misses.
-qipa	Specifies interprocedural analysis optimizations
-qarch=arch -qtune=arch	Permits maximum optimization for the processor architecture being used. Can improve performance at the expense of portability. It's probably best to use `auto` and let the compiler optimize for the platform where you actually compile. See the man page for other options.
-qautodbl=setting	Automatic conversion of single precision to double precision, or double precision to extended precision. See the man page for correct setting options.
-qreport	Displays information about loop transformations if `-qhot` or `-qsmp` are used.
-qsmp=omp	Specifies OpenMP compilation
-qstrict	Turns off aggressive optimizations which have the potential to alter the semantics of a user's program.
-qlist -qlistopt -qsource -qxref	Compiler listing/reporting options. `-qlistopt` may be of use if you want to know the setting of ALL options.
-qwarn64	Aids in porting code from a 32-bit environment to a 64-bit environment by detecting the truncation of an 8 byte integer to 4 bytes. Statements which may cause problems will be identified through informational messages.
-v -V	Display verbose information about the compilation
-w	Suppress informational, language-level, and warning messages.
-bmaxdata:bytes	Historical. This is actually a loader (ld) flag required for use on 32-bit objects that exceed the default data segment size, which is only 256 MB, regardless of the machine's actual memory. At LC, this option would not normally be used because all of its IBM systems are now 64-bit since the retirement of the ASC Blue systems. Codes that link to old libraries compiled in 32-bit mode may still need this option, however.

32-bit versus 64-bit:

LC's POWER4 and POWER5 machines default to 64-bit compilations.
In the past, LC operated the 32-bit ASC Blue machines.
Because 32-bit and 64-bit executables are incompatible, users needed to be aware of the compilation mode for all files in their application. An executable has to be entirely 32-bit or entirely 64-bit. An example of when this might be a problem would be trying to link old 32-bit executables/libraries into your 64-bit application.
Recommendation: explicitly specify your compilations with either -q32 or -q64 to avoid any problems encountered by accepting the defaults.

Optimization:

Default is no optimization
Without the correct -O option specified, the defaults for -qarch and -qtune are not optimal! Only -O4 and -O5 automatically select the best architecture related optimizations.
-O4 and -O5 can perform optimizations specific to L1 and L2 caches on a given platform. Use the -qlistopt flag with either of these and then look at the listing file for this information.
Any level of optimization above -O2 can be aggressive and change the semantics of your program, possibly reduce performance or cause wrong results. You can use -qstrict flag with the higher levels of optimization to restrict semantic changing optimizations.
The compiler uses a default amount of memory to perform optimizations. If it thinks it needs more memory to do a better job, you may get a warning message about setting MAXMEM to a higher value. If you specify -qmaxmem=-1 the compiler is free to use as much memory as it needs for its optimization efforts.
Optimizations may cause the compiler to relax conformance to the IEEE Floating-Point Standard.

Miscellaneous:

Conformance to IEEE Standard Floating-Point Arithmetic: the IBM C/C++ and Fortran compilers "mostly" follow the standard, however, the exceptions and discussions are too involved to cover here.
The IBM documentation states that the C/C++ and Fortran compilers support OpenMP version 2.5.
All of the IBM compiler commands have default options, which can be configured by a site's system adminstrators. It may be useful to review the files /etc/*cfg* to learn exactly what the defaults are for the system you're using.
Static Linking and POE: POE executables that use MPI are dynamically linked with the appropriate communications library at run time. Beginning with POE version 4, there is no support for building statically bound executables.
The IBM C/C++ compilers automatically support POSIX threads - no special compile flag(s) are needed. Additionally, the IBM Fortran compiler provides an API and support for pthreads even though there is no POSIX API standard for Fortran.

See the IBM Documentation - Really!

As already mentioned, but it can't be stressed enough, consult the latest IBM compiler and Parallel Environment documentation for details and more information on any of the above (and other) topics.
REALLY!

MPI

Implementations:

On LC's POWER4 and POWER5 systems, the only MPI library available is IBM's thread-safe MPI. It includes MPI-1 and most of MPI-2 (excludes dynamic processes).

MPI is automatically linked into your build when you use any of the compiler commands below. Recall as discussed earlier, the thread-safe version of the compiler (_r version) is automatically used at LC even if you call the non _r version.

IBM MPI Compiler Invocation Commands
MPI mpxlc / mpcc Parallel xlc / cc compiler scripts
mpCC Parallel xlC compiler script
mpxlf Parallel xlf compiler script
mpxlf90 Parallel xlf90 compiler script
mpxlf95 Parallel xlf95 compiler script
MPI with
Threads
(OpenMP,
Pthreads,
IBM threads) mpxlc_r / mpcc_r Parallel xlc / cc compiler scripts for hybrid MPI/threads programs
mpCC_r Parallel xlC compiler script for hybrid MPI/threads programs
mpxlf_r Parallel xlf compiler script for hybrid MPI/threads programs
mpxlf90_r Parallel xlf90 compiler script for hybrid MPI/threads programs
mpxlf95_r Parallel xlf95 compiler script for hybrid MPI/threads programs

IBM MPI Compiler Invocation Commands
MPI	`mpxlc / mpcc`	Parallel xlc / cc compiler scripts
`mpCC`	Parallel xlC compiler script
`mpxlf`	Parallel xlf compiler script
`mpxlf90`	Parallel xlf90 compiler script
`mpxlf95`	Parallel xlf95 compiler script
MPI with Threads (OpenMP, Pthreads, IBM threads)	`mpxlc_r / mpcc_r`	Parallel xlc / cc compiler scripts for hybrid MPI/threads programs
`mpCC_r`	Parallel xlC compiler script for hybrid MPI/threads programs
`mpxlf_r`	Parallel xlf compiler script for hybrid MPI/threads programs
`mpxlf90_r`	Parallel xlf90 compiler script for hybrid MPI/threads programs
`mpxlf95_r`	Parallel xlf95 compiler script for hybrid MPI/threads programs

Notes:

All MPI compiler commands are actually "scripts" that automatically link in the necessary MPI libraries, include files, etc. and then call the appropriate native AIX compiler.
Documentation for the IBM implementation is available from IBM.
LC's MPI tutorial describes how to create MPI programs.

Running on LC's POWER Systems

Large Pages

Large Page Overview:

IBM AIX page size has typically been 4KB. Beginning with POWER4 and AIX 5L version 5.1, 16MB large page support was implemented.
The primary purpose of large pages is to provide performance improvements for memory intensive HPC applications. The performance improvement results from:
- Reducing translation look-aside buffer (TLB) misses through mapping more virtual memory into the TLB. TLB memory coverage for large pages is 16 GB vs. 4 MB for small pages.
- Improved memory prefetching by eliminating the need to restart prefetch operations on 4KB boundaries. Large pages hold 131,072 cache lines vs. 32 cache lines for 4KB pages.
AIX treats large pages as pinned memory - an application's data remains in physical memory until the application completes. AIX does not provide paging support for large pages.
According to IBM, memory bandwidth can be increased up to 3x for some applications when using large pages. In practice this may translate to an overall application speedup of 5-20%.
However, some applications may demonstrate a marked decrease in performance with large pages:
- Short running applications (measured in minutes)
- Applications that perform fork() and exec() operations
- Shell scripts
- Compilers
- Graphics tools
- GUIs for other tools (such as TotalView)
If large pages are exhausted, enabled applications silently fail over to use small pages, with possible ramifications to performance. However, the converse is not true: applications that use only small pages cannot access large-page memory.
Large page configuration is controlled by system managers. Using this configuration is entirely up to the user. It is not automatic.
More information: AIX Support For Large Pages whitepaper.

Large Pages and Purple:

Purple systems are configured to allocate the maximum AIX permitted amount (85%) of a machine's memory for large pages. This means that there is a relatively small amount of memory available for regular 4KB pages.
IMPORTANT: Because LC has allocated most of memory for large pages, applications which aren't enabled for large pages will default to using the limited 4KB page pool. It is quite likely in such cases that excessive paging will occur and the job will have to be terminated to prevent it from hanging or crashing the system.

How to Enable Large Pages:

As mentioned, even though a system has large pages configured, making use of them is up to the user.
Use any of three ways to enable an application for large page use:
1. At build time: link with the -blpdata flag. Recommended.
2. After build: use the ldedit -blpdata executable command on your executable. Recommended.
3. At runtime: set the LDR_CNTRL=LARGE_PAGE_DATA=Y environment variable. For example:
```
setenv LDR_CNTRL=LARGE_PAGE_DATA=Y
```
  Note that if you foget to unset this environment variable after your application runs, this method will affect all other tasks in your login session. Routine, non-application tasks will probably be very slow, and it is therefore NOT recommended for interactive jobs where you are using other tools/utilities.

When NOT to Use Large Pages:

In most cases, large pages should not be used for non-application tasks such as editing, compiling, running scripts, debugging, using GUIs or running non-application tools. Using large pages for these tasks will cause them to perform poorly in most cases.
Using the LDR_CNTRL=LARGE_PAGE_DATA=Y environment variable will cause all tasks to use large pages, not just your executable. For this reason, it is not recommended.
Of special note: do not run TotalView under large pages. Discussed later in the Debugging section of this tutorial.
To change a large page executable to not use large pages, use the command lededit -bnolpdata executable

Miscellaneous Large Page Info:

To see if the large page bit is set on an executable, use the command shown below. Note that the example executable's name is "bandwidth" and that the output will contain LPDATA if the bit is set.
% dump -Xany -ov bandwidth | grep "Flags" Flags=( EXEC DYNLOAD LPDATA DEP_SYSTEM )
On AIX 5.3 and later, "ps -Z" will show 16M in the DPGSZ column (data page size) for jobs using large pages. The SPGSZ (stack) and TPGSZ (text) columns will remain at 4K regardless. For example:
% ps -Z PID TTY TIME DPGSZ SPGSZ TPGSZ CMD 135182 pts/68 0:00 4K 4K 4K my_small_page_job 177982 pts/68 0:00 16M 4K 4K my_large_page_job
The sysconf(_SC_LARGE_PAGE_SIZE) function call will return the large page size on systems that have large pages.
The vmgetinfo() function returns information about large page pools size and other large page related information.

Running on LC's POWER Systems

SLURM

LC's Batch Schedulers:

The native (low level) scheduling software provided by IBM for parallel POWER systems is LoadLeveler. At LC, LoadLeveler has been replaced by SLURM.
SLURM resource manager:
- SLURM = Simple Linux Utility for Resource Management
- Developed by LC; open source
- Also used on all of LC's Linux clusters
- SLURM manages a single cluster - it does not know about other clusters - hence "low level".
LC's system-wide batch schedulers (high-level; workload managers) are LCRM and Moab:
- LCRM/Moab "talk" to the low-level SLURM resource manager on each system.
- Tie multiple clusters into an enterprise-wide system
- LCRM and Moab are discussed in detail in the LCRM tutorial and Moab tutorial.

SLURM Architecture:

SLURM is implemented with two daemons:
- slurmctld - central management daemon. Monitors all other SLURM daemons and resources, accepts work (jobs), and allocates resources to those jobs. Given the critical functionality of slurmctld, there may be a backup daemon to assume these functions in the event that the primary daemon fails.
- slurmd - compute node daemon. Monitors all tasks running on the compute node, accepts work (tasks), launches tasks, and kills running tasks upon request.
SLURM significantly alters the way POE behaves, especially in batch.
There are differences between the SLURM implementation on POWER systems and the SLURM implementation on LC's Linux systems.

SLURM Commands:

SLURM provides six user-level commands. Note that the IBM implementation does not support all six commands. Commands are linked to their corresponding man page.

SLURM Command Description Supported on IBMs?
scancel Cancel or signal a job INTERACTIVE ONLY
scontrol Administration tool; configuration YES
sinfo Reports general system information YES
smap Displays an asci-graphical version of squeue YES
squeue Reports job information YES
srun Submits/initiates a job NO

SLURM Command	Description	Supported on IBMs?
`scancel`	Cancel or signal a job	INTERACTIVE ONLY
`scontrol`	Administration tool; configuration	YES
`sinfo`	Reports general system information	YES
`smap`	Displays an asci-graphical version of squeue	YES
`squeue`	Reports job information	YES
`srun`	Submits/initiates a job	NO

SLURM Environment Variables:

The srun man page describes a number of SLURM environment variables. However, under AIX, only a few of these are supported (described below).

SLURM Environment Variable Description
SLURM_JOBID Use with "echo" to display the jobid.
SLURM_NETWORK Specifies switch and adapter settings such as communication protocol, RDMA and number of adapter ports. Replaces the use of the POE MP_EUILIB and MP_EUIDEVICE environment variables. In most cases, users should not modify the default settings, but if needed, they can. For example to run with IP protocol over a single switch adapter port:
setenv SLURM_NETWORK ip,sn_single
The default setting is to use User Space protocol over both switch adapter ports and to permit RDMA.
setenv SLURM_NETWORK us,bulk_xfer,sn_all
SLURM_NNODES Specify the number of nodes to use. Currently not working on IBM systems.
SLURM_NPROCS Specify the number of processes to run. The default is one process per node.

SLURM Environment Variable	Description
`SLURM_JOBID`	Use with "echo" to display the jobid.
`SLURM_NETWORK`	Specifies switch and adapter settings such as communication protocol, RDMA and number of adapter ports. Replaces the use of the POE `MP_EUILIB` and `MP_EUIDEVICE` environment variables. In most cases, users should not modify the default settings, but if needed, they can. For example to run with IP protocol over a single switch adapter port: `setenv SLURM_NETWORK ip,sn_single` The default setting is to use User Space protocol over both switch adapter ports and to permit RDMA. `setenv SLURM_NETWORK us,bulk_xfer,sn_all`
`SLURM_NNODES`	Specify the number of nodes to use. Currently not working on IBM systems.
`SLURM_NPROCS`	Specify the number of processes to run. The default is one process per node.

Additional Information:

SLURM web pages at computing.llnl.gov/linux/slurm.
Man pages

Running on LC's POWER Systems

Understanding Your System Configuration

First Things First:

Before building and running your parallel application, it is important to know a few details regarding the system you intend to use. This is especially important if you use multiple systems, as they will be configured differently.
Also, things at LC are in a continual state of flux. Machines change, software changes, etc.
Several information sources and simple configuration commands are available for understanding LC's IBM systems.

System Configuration/Status Information:

LC's Home Page: computing.llnl.gov
- Important Notices and News - timely information for every system
- OCF Machine Status (LLNL internal) - shows if the machine is up or down and then links to the following information for each machine:
  - Message of the Day
  - Announcements
  - Load Information
  - Machine Configuration (and other) information
  - Job Limits
  - Purge Policy
- Computing Resources - detailed machine configuration information for all machines.
When you login, be sure to check the login banner & news items
Machine status email lists.
- Each machine has a status list which provides the most timely status information and plans for upcoming system maintenance or changes. For example:
  uv-status@llnl.gov
  um-status@llnl.gov
  up-status@llnl.gov
  ...
- LC support initially adds people to the list, but just in case you find you aren't on a particular list (or want to get off), just use the usual majordomo commands in an email sent to Majordomo@lists.llnl.gov.

LC Configuration Commands:

ju: Displays a summary of node availability and usage within each pool. Sample output, partially truncated, shown below.

spjstat: Displays a summary of pool information followed by a listing of all running jobs, one job per line. Sample output, partially truncated, shown below.

uv006% ju Partition total down used avail cap Jobs systest 4 0 0 4 0% pdebug 2 0 1 1 50% degrt-2 pbatch 99 0 92 7 93% halbac-8, fdy-8, fkdd-8, kuba-16, dael-6

up041% spjstat Scheduling pool data: -------------------------------------------------------- Pool Memory Cpus Nodes Usable Free Other traits -------------------------------------------------------- pbatch 31616Mb 8 99 99 8 pdebug 31616Mb 8 2 1 1 systest 31616Mb 8 4 4 4 Running job data: ------------------------------------------------------- Job ID User Name Nodes Pool Status ------------------------------------------------------- 11412 dael 6 pbatch Running 28420 hlbac 8 pbatch Running 28040 rtyg 6 pbatch Running 30243 kubii 16 pbatch Running

sinfo: SLURM systems only. Displays a summary of the node/pool configuration. Sample output below.

up041% sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST pbatch* up infinite 91 alloc up[001-016,021-036,042-082,089-106] systest up 2:00:00 4 idle up[085-088] pdebug up 2:00:00 1 idle up037 pdebug up 2:00:00 1 comp up040 pbatch* up infinite 8 idle up[017-020,083-084,107-108]

IBM Configuration Commands:

Several standard IBM AIX and Parallel Environment commands that may prove useful are shown below. See the man pages for usage details.
lsdev: This command lists all of the available physical devices (disk, memory, adapters, etc.) on a single machine. Probably its most useful purpose after providing a list, is that it tells you the names of devices you can use with the lsattr command for detailed information.
Sample output
lsattr: Allows you to obtain detailed information for a specific device on a single machine. The catch is you need to know the device name, and for that, the previously mentioned lsdev command is used.
Sample output
lslpp -al | grep poe: The lslpp command is used to check on installed software. If you grep the output for poe it will show you which version of POE is installed on that machine.
Note that the serial lsattr, lsdev, and lslpp commands can be used in parallel simply by calling them via poe. For example, if the following command was put in your batch script, it would show the memory configuration for every node in your partition:
```
poe lsattr -El mem0
```

Running on LC's POWER Systems

Setting POE Environment Variables

In General:

Application behavior under POE is very much determined by a number of POE environment variables. They control important factors in how a program runs.
POE environment variables fall into several categories:
- Partition Manager control
- Job specification
- I/O control
- Diagnostic information generation
- MPI behavior
- Corefile generation
- Miscellaneous
There are over 50 POE environment variables. Most of them also have a corresponding command line flag that temporarily overrides the variable's setting.
At LC, interactive POE and batch POE usage differ greatly. In particular, LC's batch scheduler systems (covered in the LCRM and Moab tutorials) modify and/or override the behavior of most basic POE environment variables. Knowing how to use POE in batch requires also knowing how to use LCRM and/or Moab.
A complete discussion and list of the POE environment variables and their corresponding command line flags can be found in the Parallel Environment Operation and Use Volume 1 manual. They can also be reviewed (in less detail) in the POE man page.

Different versions of POE software are not identical in the environment variables they support. Things change.

How to Set POE Environment Variables:

As with other environment variables, POE environment variables are set using either the setenv (csh/tcsh) or export (ksh, bsh) commands. For example:
setenv MP_PROCS 64 export MP_PROCS=64
This can be done in several places:
- At the shell command prompt
- Within your shell's "dot" files (.cshrc, .profile)
- Within a script file which is "sourced" prior to execution
Command line flags: virtually all POE environment variables can be set/overridden for a single execution by using the corresponding POE command line flags. A few examples:
```
myjob -procs 8 -rmpool 0
myjob -stdoutmode ordered
myjob -tasks_per_node 8
myjob -infolevel 5
```

Basic Interactive POE Environment Variables:

Although there are many POE environment variables, you really only need to be familiar with a few basic ones. Specifically, those that answer three essential questions:
- How many nodes and how many tasks does my job require?
- How will nodes be allocated?
- Which communications protocol and network interface should be used?
The environment variables that answer these questions are discussed below. Their corresponding command line flags are shown in parenthesis.
Note that at LC, the basic POE environment variables are mostly ignored, usurped or overridden by the LCRM batch system. Their usage as shown here is for interactive jobs

QUESTION 1: How many nodes and how many tasks does my job require?

MP_PROCS (-procs)
The total number of MPI processes/tasks for your parallel job. May be used alone or in conjunction with MP_NODES and/or MP_TASKS_PER_NODE to specify how many tasks are loaded onto a physical node. The maximum value for MP_PROCS is dependent upon the version of POE software installed. For version 4.2 the limit is 8192 tasks. The default is 1.

MP_NODES (-nodes)
Specifies the number of physical nodes on which to run the parallel tasks. May be used alone or in conjunction with MP_TASKS_PER_NODE and/or MP_PROCS.

MP_TASKS_PER_NODE (-tasks_per_node)
Specifies the number of tasks to be run on each of the physical nodes. May be used in conjunction with MP_NODES and/or MP_PROCS.

QUESTION 2: How will nodes be allocated - should I choose them myself or let POE automatically choose them for me?

MP_RMPOOL (-rmpool)
Specifies the system pool where your job should run. At LC, this environment variable is only used for interactive jobs. Available pools can be determined by using the ju, mjstat, spjstat or sinfo commands.

The following two environment variables are preset for LC users and in most cases should not be changed. They are only mentioned here FYI.

MP_RESD (-resd)
Specifies whether your nodes should be selected automatically for you by POE (non-specific node allocation), or whether you want to select the nodes yourself (specific node allocation). Valid values are either yes or no .

MP_HOSTFILE (-hostfile)
This environment variable is used only if you wish to explicitly select the nodes that will be allocated for your job (specific node allocation). If used, this variable specifies the name of a file which contains the actual machine (domain) names of nodes you wish to use.

QUESTION 3: Which communications protocol and network interface should be used?
(non-Purple systems only)

These environment variables are preset for LC users and in most cases should not be changed. They are only mentioned here FYI.

MP_EUILIB (-euilib)
Specifies which of two protocols should be used for task communications. Valid values are either ip for Internet Protocol (slow) or us for User Space (fast) protocol. User Space is the default and preferred protocol at LC.

MP_EUIDEVICE (-euidevice)
A node may be physically connected to different networks. This environment variable is used to specify which network adapter should be used for communications. The value must match a real physical adapter installed on the node. The IBM documentation provides a list of possible values, and can easily confuse the user. For LC production machines, the recommendation is to just accept the default setting for this variable (and note that somtimes it isn't even set). It will be optimal for the particular machine you are using.

Example Basic Interactive Environment Variable Settings:

The example below demonstrates how to set the basic POE environment variables for an interactive job which will:
1. Use 16 tasks on 2 nodes
2. Requests nodes from a pool called "pdebug"
3. Allows the Resource Manager to select nodes (non-specific allocation)
4. Use User Space protocol with HPS switch adapter(s)

Note that steps 3 and 4 above are accomplished by accepting the default LC settings for MP_RESD, MP_EUILIB and MP_EUIDEVICE.

csh / tcsh ksh / bsh
setenv MP_PROCS 16 setenv MP_NODES 2 setenv MP_RMPOOL pdebug export MP_PROCS=16 export MP_NODES=2 export MP_RMPOOL=pdebug

Other Common/Useful POE Environment Variables

A list of some commonly used and potentially useful POE environment variables appears below. A complete list of the POE environment variables can be viewed quickly in the POE man page. A much fuller discussion is available in the Parallel Environment Operation and Use Volume 1 manual.
Unlike the basic POE environment variables covered above, a number of these are able to be used for batch jobs at LC.

For IBM systems using the SLURM scheduler, some POE variables not shown are ignored, such as MP_RETRY and MP_RETRYCOUNT.

Variable Description
MP_SHARED_MEMORY Allows MPI programs with more than one task on a node to use shared memory instead of the switch for communications. Can significantly improve on-node communication bandwidth. Valid values are "yes" and "no". Default is "yes" at LC.
MP_LABELIO Determines whether or not output from the parallel tasks are labeled by task id. Valid values are yes or no. The default is yes at LC.
MP_PRINTENV Can be used to generate a report of the your job's parallel environment setup information, which may be useful for diagnostic purposes. Default value is "no". Set to "yes". The report goes to stdout by default, but can be directed to be added to the output of a user-specified script name.
MP_STATISTICS Allows you to obtain certain statistical information about your job's communications. The default setting is "no". Set to "print" and the statistics will appear on stdout after your job finishes. Note that there may be a slight impact on your job's performance if you use this feature.
MP_INFOLEVEL Determines the level of message reporting. Default is 1. Valid values are:
0 = error
1 = warning and error
2 = informational, warning, and error
3 = informational, warning, and error. Also reports diagnostic messages for use by the IBM Support Center.
4,5,6 = Informational, warning, and error. Also reports high- and low-level diagnostic messages for use by the IBM Support Center.

MP_COREDIR MP_COREFILE_FORMAT MP_COREFILE_SIGTERM Allow you to control how, when and where core files are created. See the POE man page and/or IBM documentation for details. Note that LC is currently setting MP_COREFILE_FORMAT to "core.light" by default, which may/may not be what you want for debugging purposes.
MP_STDOUTMODE Enables you to manage the STDOUT from your parallel tasks. If set to "unordered" all tasks write output data to STDOUT asynchronously. If set to "ordered" output data from each parallel task is written to its own buffer. Later, all buffers are flushed in task order to stdout. If a task id is specified, only the task indicated writes output data to stdout. The default is unordered. Warning: use "unordered" if your interactive program prompts for input - otherwise your prompts may not appear.
MP_SAVEHOSTFILE Specifies the file name where POE should record the hosts used by your job. Can be used to "save" the names of the execution nodes.
MP_CHILD Is an undocumented, "read-only" variable set by POE. Each task will have this variable set to equal its unique taskid (0 thru MP_PROCS-1). Can be queried in scripts or batch jobs to determine "who I am".
MP_PGMMODEL Determines the programming model you are using. Valid values are "spmd" or "mpmd". The default is "spmd". If set to "mpmd" you will be enabled to load different executables individually on the nodes of your partition.
MP_CMDFILE Is generally used when MP_PGMMODEL=mpmd, but doesn't have to be. It specifies the name of a file that lists the commands that are to be run by your job. Nodes are loaded with these commands in the order they are listed in the file. If set, POE will read the commands file rather than try to use STDIN - such as in a batch job.

LLNL Preset POE Environment Variables:

LC automatically sets several POE environment variables for all users. In most cases, these are the "best" settings. For the most current settings, check the /etc/environment file. Note that these will vary by machine. An example is shown below.

# POE default environment variables MP_COREFILE_SIGTERM=NO MP_CPU_USE=unique MP_EUILIB=us MP_HOSTFILE=NULL MP_LABELIO=yes MP_RESD=yes MP_SHARED_MEMORY=yes .... # Set Poe Environment Variables MP_COREFILE_FORMAT=core.light MP_INFOLEVEL=1 MP_RMLIB=/opt/freeware/lib/slurm_ll_api.so MP_PRIORITY_NTP=yes MP_EUILIB=us # For Warning about using Large Pages MP_TLP_REQUIRED=WARN # o) Constrains AIX MPI tasks to physical CPUs # o) Allows TV to work with poe MP_S_POE_AFFINITY=YES

Running on LC's POWER Systems

Invoking the Executable

Syntax:

Once the executable has been created, and the environment is setup, invoking the executable is relatively easy. It can be as simple as issuing the command: a.out
The general syntax for invoking your executable is:

[executable_name] [POE_option_flags] [executable_arguments]

For example:
```
myprog  -procs 8 -rmpool pdebug  myarg1 myarg2
```
This syntax is the same for interactive and batch jobs.
Command line flags will override any POE environment variables that have been previously set. If used, an option flag is in effect for the duration of the program's execution only. See the POE man page for a complete listing of flags.

Multiple Program Multiple Data (MPMD) Programs:

By default, POE follows the Single Program Multiple Data parallel programming model: all parallel tasks execute the same program but may use different data.
For some applications, parallel tasks may need to run different programs as well as use different data. This parallel programming model is called Multiple Program Multiple Data (MPMD).
For MPMD programs, the following steps must be performed:
Interactive:
1. Set the MP_PGMMODEL environment variable to "mpmd". For example:
  setenv MP_PGMMODEL mpmd export MP_PGMMODEL=mpmd
2. Enter poe at the Unix prompt. You will then be prompted to enter the executable which should be loaded on each node. The example below loads a "master" program on the first node and 4 "worker" tasks on the remaining four nodes.
3. Execution starts automatically after the last node has been loaded.
4. Note that if you don't want to type each command in line by line, you can put the commands in a file, one per line, to match the number of MPI tasks, and then set MP_CMDFILE to the name of that file. POE will then read that file instead of prompting you to input each executable.
Batch:
1. Create a file which contains a list of the program names, one per line, that must be loaded onto your nodes. There should be one command per MPI task that you will be using.
2. Set the MP_PGMMODEL environment variable to "mpmd" - usually done in your batch submission script
3. Set the environment variable MP_CMDFILE to the name of the file you created in step 1 above - usually done in your batch submission script also.
4. When your application is invoked within the batch system, POE will automatically load the nodes as specified by your file.
5. Execution starts automatically after the last node has been loaded.

Using POE with Serial Programs:

For serial programs or commands, use the poe command followed by your program name or the command you wish to run across your partition. For example:
poe cp ~/input.file /tmp/input.file poe my_serial_job poe rm /tmp/input.file
The specified command will be run by all of the tasks in your partition.

POE Error Messages:

POE error messages: these can sometimes provide useful diagnostic information if you encounter problems running your executable. They take the format of:
0031-nnn
where nnn is the specific error message number. Consult the Parallel Environment Messages manual for details.

Running on LC's POWER Systems

Monitoring Job Status

POE does not provide any basic commands for monitoring your job.
LC does provide several different commands to accomplish this however. Some of these commands were previously discussed in the Understanding Your System Configuration section, as they serve more than one purpose.
The most useful LC commands for monitoring your job's status are:
- ju - succinct display of running jobs
- spjstat / spj - running jobs (non Moab systems)
- mjstat - running jobs (Moab systems)
- squeue - SLURM command for running jobs. All systems.
- pstat - LCRM command for both running and queued jobs. Available on Moab systems via a wrapper also.
- mshow, showq, checkjob - Moab systems

Examples are shown below, some with truncated output for readability. See the man pagesfor more information.

up041% ju Partition total down used avail cap Jobs systest 4 0 0 4 0% pdebug 2 0 0 2 0% pbatch 99 0 90 9 91% hac-8, fdy-8, fdy-8, gky-3, dhkel-32, kuta-1, kuta-16, lski-2, danl-6, danl-6 up041% spjstat Scheduling pool data: -------------------------------------------------------- Pool Memory Cpus Nodes Usable Free Other traits -------------------------------------------------------- pbatch 31616Mb 8 99 99 8 pdebug 31616Mb 8 2 1 1 systest 31616Mb 8 4 4 4 Running job data: ------------------------------------------------------- Job ID User Name Nodes Pool Status ------------------------------------------------------- 11412 dael 6 pbatch Running 28420 hlbac 8 pbatch Running 28040 rtyg 6 pbatch Running 30243 kubii 16 pbatch Running 34087 dhrrtel 32 pbatch Running 34433 gddy 3 pbatch Running… … up041% squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 28420 pbatch uu4r elbac R 11:25:18 8 up[001-008] 28040 pbatch rr5t16b danl R 8:43:30 6 up[095-100] 30243 pbatch www34r14 jj4ota R 8:38:49 16 up[064-079] 28058 pbatch rr5t16c danl R 8:38:37 6 up[101-106] 33257 pbatch www34r14 jj4ota R 7:49:20 1 up036 33882 pbatch BH44T2-t fdy R 5:13:34 8 up[009-016] 34087 pbatch YY015 nekel R 4:46:09 32 up[029-035,042-047,054-063,080-082,089-094] 34433 pbatch ruunra6 ekay R 3:18:51 3 up[023-025] 34664 pbatch BHJRL01k fdy R 1:34:27 8 up[017-022,026-027] 34640 pbatch RUUE80 edeski R 54:24 2 up[083-084] up041% pstat JID NAME USER BANK STATUS EXEHOST CL 28033 33er16a danl a_phys *WPRIO up N 28040 phjje6b danl a_phys RUN up N 28420 rsr22erv10 haslr4 illinois RUN up N 29267 B7eee0ms fddt illinois *DEPEND up N 30243 wol8une409 wwiota a_cms RUN up N 33675 phrr33a qqw3el a_phys *DEPEND up N 34071 inhiyyrrr76 weertler illinois *WCPU up N 34087 RTT15 dqrtkel axicf RUN up N 34433 runwww6 robr a_engr RUN up N 34435 runwww6 robr a_engr *DEPEND up N 34640 RTTT80 lssgh axicf RUN up N 34653 RTT081 lssgh axicf *WPRIO up N 34661 B7eee0ms fddt illinois *DEPEND up N 34749 rsff450v10 haslr4 illinois *DEPEND up N 35221 nm-hhj99.inp cnbvvdy bdivp *WPRIO up N

Running on LC's POWER Systems

Interactive Job Specifics

The pdebug Interactive Pool/Partition:

Most LC systems have a small number of nodes dedicated for interactive use called the pdebug pool/partition. As an exception, Purple does not have a pdebug pool, but does have a viz pool that serves a similar purpose.
Login nodes should NOT be used for running parallel or CPU intensive serial jobs. Use the pdebug partition instead.
The MP_RMPOOL POE environment variable must be set to the pdebug. For example:
```
setenv MP_RMPOOL pdebug         export MP_RMPOOL=pdebug
```
Alternately, you can use the POE -rmpool command line flag:
```
myprog -rmpool pdebug
```

Insufficient Resources:

Interactive partitions are small and must be shared by all users on the system. It is easy to exhaust the available nodes. When this happens, you will receive an informational message from SLURM as shown below.
SLURMINFO: Job 84233 is pending allocation of resources.
If this happens, you can either wait until SLURM runs the job when nodes are free, or CTRL-C to quit.

Killing Interactive Jobs:

You can use CTRL-C to terminate a running, interactive POE job that has not been put in the background. POE will propagate the termination signal to all tasks.

On SLURM systems, the scancel command can also be used to kill an interactive job:

up041% squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 18352 pbatch 4ye_Conv 4enollin R 1:00:37 1 up100 18354 pbatch 4ye_Conv yeioan R 59:47 1 up074 18378 pbatch 4ye_Conv yeioan R 48:14 1 up101 66004 pdebug poe blaise R 0:13 1 up037 up041% scancel 66004 up041% ERROR: 0031-250 task 3: Terminated ERROR: 0031-250 task 2: Terminated ERROR: 0031-250 task 0: Terminated ERROR: 0031-250 task 1: Terminated SLURMERROR: slurm_complete_job: Job/step already completed [1] Exit 143 poe hangme up041%

Another way to kill your interactive job is to kill the poe process on the node where you started the job (usually your login node). For example:

up041% ps PID TTY TIME CMD 3223668 pts/68 0:00 ps 3952660 pts/68 0:00 -tcsh 4313092 pts/68 0:00 poe up041% kill 4313092 up041% ERROR: 0031-250 task 2: Terminated ERROR: 0031-250 task 0: Terminated ERROR: 0031-250 task 3: Terminated ERROR: 0031-250 task 1: Terminated up041%

Yet another alternative is to use the poekill command to kill the poe process. For example:

up041% poekill poe Terminating process 4173960 program poe ERROR: 0031-250 task 0: Terminated ERROR: 0031-250 task 1: Terminated ERROR: 0031-250 task 2: Terminated ERROR: 0031-250 task 3: Terminated up041%

Running on LC's POWER Systems

Batch Job Specifics

Things Are Changing:

Currently, LC is migrating from its LCRM workload manager to Moab. At this time (6/07), all production IBM POWER systems at LC are running LCRM. LCRM is covered in detail in the LCRM Tutorial. This section only provides a quick summary of LCRM usage.
In the near future, LCRM will be discontinued, however a subset of the most common LCRM commands will be supported by Moab via wrapper scripts. Moab is covered in detail in the Moab Tutorial.

Submitting Batch Jobs:

LC production systems allocate the majority of their nodes for batch use. Batch nodes are configured into the pbatch pool/partition. pool. This is also the default pool for batch jobs.

The first step in running a batch job is to create an LCRM job control script. A sample job control script appears below.

# Sample LCRM script to be submitted with psub #PSUB -c up # which machine to use #PSUB -pool pbatch # which pool to use #PSUB -r myjob # specify job name #PSUB -tM 1:00 # set maximum total CPU time #PSUB -b micphys # set bank account #PSUB -ln 4 # use 4 nodes #PSUB -g 16 # use 16 tasks #PSUB -x # export current env var settings #PSUB -o myjob.log # set output log name #PSUB -e myjob.err # set error log name #PSUB -nr # do not rerun job after system reboot #PSUB -mb # send email at execution start #PSUB -me # send email at execution finish # no more psub commands # job commands start here set echo setenv MP_INFOLEVEL 4 setenv MP_SAVEHOSTFILE myhosts setenv MP_PRINTENV yes echo LCRM job id = $PSUB_JOBID cd ~/db/myjob ./my_mpiprog rm -rf tempfiles echo 'ALL DONE'

Submit your LCRM job control script using the psub command. For example, if the above script were named run.cmd, it would be submitted as:
psub run.cmd
You may then check your job's progress as discussed in the Monitoring Job Status section above.

Quick Summary of Common LCRM Batch Commands:

Command	Description
`psub`	Submits a job to LCRM
`pstat`	LCRM job status command
`prm`	Remove a running or queued job
`phold`	Place a queued job on hold
`prel`	Release a held job
`palter`	Modify job attributes (limited subset)
`lrmmgr`	Show host configuration information
`pshare`	Queries the LCRM database for bank share allocations, usage statistics, and priorities.
`defbank`	Set default bank for interactive sessions
`newbank`	Change interactive session bank

Batch Jobs and POE Environment Variables:

Certain POE environment variables will affect batch jobs just as they do interactive jobs. For example, MP_INFOLEVEL and MP_PGMMODEL. These can be placed in your batch job command script.

However, other POE environment variables are ignored by the batch scheduler for obvious reasons. For example, the following POE variables will have no effect if used in a batch job control script:

MP_PROCS MP_NODES MP_TASKS_PER_NODE MP_RMPOOL MP_HOSTFILE MP_PMDSUFFIX MP_RESD MP_RETRY MP_RETRYCOUNT MP_ADAPTER_USE MP_CPU_USE

Be aware that POE environment variables in your .login, .cshrc, .profile, etc. files may also affect your batch job.

Killing Batch Jobs:

The best way to kill batch jobs is to use the LCRM prm command. It can be used to terminate both running and queued jobs. For example:

% pstat JID NAME USER BANK STATUS EXEHOST CL 42991 batch_run3 joe33 cs RUN up N 42999 batch_run4 joe33 cs *DEPEND up N % prm 42991 remove running job 42991 (joe33, cs)? [y/n] y % pstat JID NAME USER BANK STATUS EXEHOST CL 42999 batch_run4 joe33 cs *DEPEND up N

Running on LC's POWER Systems

Optimizing CPU Usage

SMP Nodes:

All IBM SP nodes are shared memory SMPs. Each SMP node has multiple CPUs, and is thus capable of running multiple tasks simultaneously.
All of LC's production IBM POWER systems have 8 cpus/node:
- POWER4 p655 - 8 CPUs per node (UM, UV)
- POWER5 p5 575 - 8 CPUs per node (UP, PURPLE)
Optimizing CPU usage on these nodes means using the available CPUs as fully as possible.

Effectively Using Available CPUs:

By default, POE schedules one task per node. This can leave the majority of the CPUs underutilized. For example, imagine a 4-task MPI job running on 8-CPU POWER5 nodes:
To use more than 1 of the CPUs on this type of node, you could:
- Interactive: set the POE environment variables MP_PROCS, MP_NODES and MP_TASKS_PER_NODE appropriately. For example:
```
setenv MP_NODES 1           setenv MP_NODES 1
setenv MP_PROCS 4           setenv MP_TASKS_PER_NODE 4
```
- Batch: use the LCRM -ln and -g options in a similar manner. For example, in your batch job script:
```
#PSUB -ln 1                 #PSUB -ln 1
#PSUB -g 4                  #PSUB -g @tpn4
```
Resulting in 3 free nodes for other user jobs:

When Not to Use All CPUs:

For MPI codes that use OpenMP or Pthread threads, you probably do not want to place an MPI task on each CPU, as the threads will need someplace to run.
For tasks that a substantial portion of a node's memory, you may likewise not want to put a task on every CPU if it will lead to memory exhaustion.

Running on LC's POWER Systems

RDMA

What is RDMA?

One definition: A communications protocol that provides transmission of data from the memory of one computer to the memory of another without involving the CPU, cache or context switches.
Since there is no CPU involvement, data transfer can occur in parallel with other system operations. Overlapping computation with communication is one of the major benefits of RDMA.
Zero-copy data transport reduces memory subsystem load.
Implemented in hardware on the network adapter (NIC). Transmission occurs over the High Performance Switch.
Other advantages:
- Offload message fragmentation and reassembly to the adapter
- Reduced packet arrival interrupts
- One-sided shared memory programming model
IBM also refers to RDMA as "Bulk Transfer".

How to Use RDMA:

LC's POWER4 and POWER5 systems are already configured to permit RDMA. Furthermore, LC automatically sets the SLURM_NETWORK environment variable for batch jobs to use RDMA (bulk_xfer). To confirm this you can echo $SLURM_NETWORK from within your batch script. It should look like:
```
SLURM_NETWORK=us,bulk_xfer,sn_all
```
For interactive jobs, you will need to set the MP_USE_BULK_XFER POE environment variable to "yes". It is not set by default. Note that setting this variable in batch does nothing - it is overridden by the SLURM_NETWORK setting.
RDMA works best when Large Pages are used.
Some other helpful considerations:
- Design or restructure programs to "think computation/communication overlap".
- Avoid using blocking operations if possible (although RDMA still improves performance). Recode with equivalent non-blocking operations.
- Applications that communicate large amounts of data and which are one-sided in nature to begin with, are particularly suited for RDMA.
How much will RDMA benefit your application? That depends on a number of things. For a simple bandwidth program running on Purple, that does nothing more than have two tasks perform sends/receives of large messages (between 1-3MB):

Blocking send/recv Non-blocking send/recv
No RDMA RDMA No RDMA RDMA
1.8 3.0 2.5 4.4
* GB/sec

Running on LC's POWER Systems

Other Considerations

Simultaneous Multi-Threading (SMT)

LC's POWER5 systems have been enabled for SMT.
SMT is a combination of POWER5 hardware and AIX software that creates and manages two independent instruction streams (threads) on the same physical CPU.
SMT makes one processor appear as two processors.
The primary intent of SMT is to improve performance by overlapping instructions so that idle execution units are used.
Users don't need to do anything specific to take advantage of SMT, though there are a few considerations:
- Use no more than 1 MPI task per CPU. Regard the second virtual CPU for use by auxiliary threads or system daemons.
- Some applications will note a marked improvement in performance. For example, the LLNL benchmark codes sPPM and UMT2K both realized a 20-22% performance gain (according to IBM).
- Some applications may experience a performance degradation.
- IBM documentation states that performance can range from -20% to +60%

POE Co-Scheduler

Currently enabled on Purple only.
Under normal execution, every process will be interrupted periodically in order to allow system daemons and other processes to use the CPU.
On multi-CPU nodes, CPU interruptions are not synchronized. Furthermore, CPU interruptions across the nodes of a system are not synchronized.
For MPI programs, the non-synchronized CPU interruptions can significantly affect performance, particularly for collective operations. Some tasks will be executing while other tasks will be waiting for a CPU being used by system processes.
IBM has enabled POE to elevate user task priority and to force system tasks into a common time slice. This is accomplished by the POE co-scheduler daemon.
Configuring and enabling the POE co-scheduler is performed by system administrators. Two primary components:
- /etc/poe.priority file specifies priority "class" configurations
- /etc/environment sets the MP_PRIORITY environment variable to the desired priority class.
Note that if you unset or change the setting of MP_PRIORITY you may defeat the co-scheduler's purpose.

Debugging With TotalView

TotalView windows

TotalView remains the debugger of choice when working with parallel programs on LC's IBM AIX machines.
TotalView is a complex and sophisticated tool which requires much more than few paragraphs of description before it can be used effectively. This section serves only as a quick and convenient "getting started" summary.
Using TotalView is covered in great detail in LC's Totalview tutorial.

The Very Basics:

Be sure to compile your program with the -g option
When starting TotalView, specify the poe process and then use TotalView's -a option for your program and any other arguments (including POE arguments). For example:
```
totalview poe -a myprog -procs 4
```
TotalView will then load the poe process and open its Root and Process windows as usual. Note that the poe process appears in the Process Window.
Use the Go command in the Process Window to start poe with your executable.
TotalView will then attempt to acquire your partition and load your job. When it is ready to run your job, you will be prompted about stopping your parallel job (below). In most cases, answering yes is the right thing to do.
Your executable should then appear in the Process Window. You are now ready to begin debugging your parallel program.
For debugging in batch, see Batch System Debugging in LC's TotalView tutorial.

A Couple LC Specific Notes:

For non-MPI jobs on LC's IBMs, you will need to put at least one poe command in your batch script if you plan to login to a batch node where your job is running. Something as simple as poe date or poe hostname will do the trick. Otherwise you will be prompted for a password, which will never be recognized.
For Tri-lab cross-cell authentication users: instead of using ssh for connecting to a batch node, use rsh. The above note for non-MPI jobs also applies.

TotalView and Large Pages:

TotalView will perform poorly if run with large pages.
Do not set the LDR_CNTRL=LARGE_PAGE_DATA=Y environment variable when you are using TotalView. Instead, enable your application for large pages with the -blpdata flag at build time, or by using the ledit -blpdata executable_name command before starting it with TotalView.
This will allow your application to use large pages, but keep TotalView using standard AIX 4 KB pages, where it performs best.

TotalView will warn you if you are trying to run it with Large Pages, as shown below:

************************************************************************
*   WARNING: This TotalView session may run SLOWLY because this        *
*   machine has a large page pool, and you have set LDR_CNTRL, but     *
*   it is not set to LARGE_PAGE_DATA=N .                               *
*                                                                      *
*   TotalView will run at its normal speed if you exit this session,   *
*   unsetenv LDR_CNTRL, and flag your executable to use large pages    *
*   by issuing ``ldedit -blpdata <your executable>''.  Later, you      *
*   may unflag it with ``ldedit -bnolpdata <your executable>''.  To    *
*   list the flag, run ``dump -ov <your executable> | grep LPDATA''.   *
************************************************************************

This completes the tutorial.

Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?

References and More Information

Author: Blaise Barney, Livermore Computing.
IBM Parallel Environment (PE), ESSL/PESSL and GPFS documentation: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.pe.doc/pebooks.html
IBM Compiler Documentation:
- Fortran: www-4.ibm.com/software/ad/fortran
- C/C++: www-4.ibm.com/software/ad/caix
Numerous web pages at IBM's web site: ibm.com.
Presentation materials from LC's IBM seminar on Power4/5 Tools and Technologies. To access these materials, see computing.llnl.gov/mpi/news_events.html and scroll down to the Thursday, July 8-9, 2004 event.
Photos/Graphics: Permission to use some of IBM's photos/graphics has been obtained by the author from photo@us.ibm.com and is on file. Other photos/graphics have been created by the author, created by other LLNL employees, obtained from non-copyrighted sources, or used with the permission of authors from other presentations and web pages.
Benchmarks:
- SPEC Benchmarks maintained by the Standard Performance Evaluation Corporation located at: www.specbench.org
- IBM POWER hardware and performance specs located at: www-1.ibm.com/servers/eserver/pseries/hardware/factsfeatures.html
Some portions of this tutorial have been adapted from the "Parallel Operating Environment (POE)" tutorial by the author during his employment with the Maui High Performance Computing Center, which grants permission to reproduce its materials in whole or in part for United States Government or educational purposes.

2-Processor Chip	MCM Diagram


POWER4 MCM Photo	32-way System Showing 4 MCMs and L3 Cache

Variable	Description
`MP_SHARED_MEMORY`	Allows MPI programs with more than one task on a node to use shared memory instead of the switch for communications. Can significantly improve on-node communication bandwidth. Valid values are "yes" and "no". Default is "yes" at LC.
`MP_LABELIO`	Determines whether or not output from the parallel tasks are labeled by task id. Valid values are yes or no. The default is yes at LC.
`MP_PRINTENV`	Can be used to generate a report of the your job's parallel environment setup information, which may be useful for diagnostic purposes. Default value is "no". Set to "yes". The report goes to stdout by default, but can be directed to be added to the output of a user-specified script name.
`MP_STATISTICS`	Allows you to obtain certain statistical information about your job's communications. The default setting is "no". Set to "print" and the statistics will appear on stdout after your job finishes. Note that there may be a slight impact on your job's performance if you use this feature.
`MP_INFOLEVEL`	Determines the level of message reporting. Default is 1. Valid values are: 0 = error 1 = warning and error 2 = informational, warning, and error 3 = informational, warning, and error. Also reports diagnostic messages for use by the IBM Support Center. 4,5,6 = Informational, warning, and error. Also reports high- and low-level diagnostic messages for use by the IBM Support Center.
`MP_COREDIR MP_COREFILE_FORMAT MP_COREFILE_SIGTERM`	Allow you to control how, when and where core files are created. See the POE man page and/or IBM documentation for details. Note that LC is currently setting MP_COREFILE_FORMAT to "core.light" by default, which may/may not be what you want for debugging purposes.
`MP_STDOUTMODE`	Enables you to manage the STDOUT from your parallel tasks. If set to "unordered" all tasks write output data to STDOUT asynchronously. If set to "ordered" output data from each parallel task is written to its own buffer. Later, all buffers are flushed in task order to stdout. If a task id is specified, only the task indicated writes output data to stdout. The default is unordered. Warning: use "unordered" if your interactive program prompts for input - otherwise your prompts may not appear.
`MP_SAVEHOSTFILE`	Specifies the file name where POE should record the hosts used by your job. Can be used to "save" the names of the execution nodes.
`MP_CHILD`	Is an undocumented, "read-only" variable set by POE. Each task will have this variable set to equal its unique taskid (0 thru MP_PROCS-1). Can be queried in scripts or batch jobs to determine "who I am".
`MP_PGMMODEL`	Determines the programming model you are using. Valid values are "spmd" or "mpmd". The default is "spmd". If set to "mpmd" you will be enabled to load different executables individually on the nodes of your partition.
`MP_CMDFILE`	Is generally used when MP_PGMMODEL=mpmd, but doesn't have to be. It specifies the name of a file that lists the commands that are to be run by your job. Nodes are loaded with these commands in the order they are listed in the file. If set, POE will read the commands file rather than try to use STDIN - such as in a batch job.

IBM POWER Systems Overview

Table of Contents

System Components

POWER4 Processor

2-Processor Chip

MCM Diagram

POWER4 MCM Photo

32-way System Showing 4 MCMs and L3 Cache

POWER5 Processor

Nodes

Frames

UM / UV

PURPLE / UP

Switch Network

GPFS Parallel File System

SCF POWER Systems

OCF POWER Systems

Large Pages

SLURM

Understanding Your System Configuration

Setting POE Environment Variables

Invoking the Executable

Monitoring Job Status

Interactive Job Specifics

Batch Job Specifics

Optimizing CPU Usage

RDMA

Other Considerations