Chapter 2. Configuring Your System

This chapter provides information on configuring your system and covers the following topics:


Note: For information on system configuration and operation on SGI ProPack 4 for Linux or SGI ProPack 5 for Linux systems, see the 007-4633-009 or 007-4633-0012 version of this manual, respectively. From the current 007-4633-016 version of this manual on the SGI Technical Publications Library, select the additional info link. Click on 007-4633-009 under Other Versions :


PCI or PCI-X Card Hot-Plug Software

The Linux PCI/X hot-plug feature supports inserting a PCI or PCI-X card into an empty slot and preparing that card for use or deactivating a PCI or PCI-X card and then removing it from its slot, while the system is running. Hot-plug operations can be initiated using a series of shell commands.


Note: PCI hot-plug is only supported on SGI Altix 3000 series systems on the IX-Brick, PX-Brick, IA-Brick, and PA-Brick. It is not supported on SGI Altix XE systems or Altix 350 systems. For SGI ProPack 6 systems running RHEL5, the I3X blade (3 slot double-wide PCI-X I/O) on SGI Altix 450 or SGI Altix 4700 systems supports PCI hot-plug. PCI-X cards can be added, removed, or replaced in the I3X blade while the I3X blade is installed in the 1955 chassis and the system is operating.


This section describes hot-swap operations and covers the following topics:


Note: The hot-plug feature is not configured on by default. It is a loadable module. To load it, see “Loading Hot-plug Software”.


Introduction PCI or PCI-X Card Hot-plug Operations

A hot-swap operation is the combination of a remove and insert operation targeting the same slot. Single function cards, multi-function cards, and PCI/X-to-PCI/X bridges are supported.

A hot-plug insert operation consists of attaching a card to an SGI card carrier, inserting the carrier in an empty slot, and using software commands to initiate the software controlled power-up and initialization of the card.

A hot-plug remove operation consists of manually terminating any users of the card, and then using software commands to initiate the remove operation to deactivate and power-down the card.

The Altix system L1 hardware controller has these hot-plug restrictions, as follows:

  • A 33 MHz PCI/X card cannot be inserted into an empty PCI bus

  • The last card cannot be removed from a bus running at 33 MHz

If these restrictions are detected by the Linux kernel and reported to the user, the requested hot-plug operation fails.

For detailed instructions on how to install or remove a PCI or PCI-X card on the SGI Altix 3000 series systems, see “Adding or Replacing a PCI or PCI-X Card” in Chapter 12, “Maintenance and Upgrade Procedures” in SGI Altix 3000 User's Guide.

For more information on the SGI L1 and L2 controller software, see the SGI L1 and L2 Controller Software User's Guide.

Loading Hot-plug Software

The hot-plug feature is not configured on by default. To load the sgi_hotplug module, perform the following steps:

  1. Load the sgi_hotplug module, as follows:

    % modprobe sgi_hotplug

  2. Make sure the module is loaded, as follows:

    % lsmod | grep sgi_hotplug
    sgi_hotplug           145168  0 
    pci_hotplug           189124  2 sgi_hotplug,shpchp

  3. Change directory (cd) to the /sys/bus/pci/slots directory and verify its contents, as follows:

    % ls -l
        total 0
        drwxr-xr-x 2 root root 0 Aug 22 10:54 0021:00:01
        drwxr-xr-x 2 root root 0 Aug 22 10:54 0022:00:01
        drwxr-xr-x 2 root root 0 Aug 22 10:54 0022:00:02
        drwxr-xr-x 2 root root 0 Aug 22 10:54 0023:00:01
        drwxr-xr-x 2 root root 0 Aug 22 10:54 0024:00:01
        drwxr-xr-x 2 root root 0 Aug 22 10:54 0024:00:02

Controlling Hot-plug Operations

This section describes hot-plug operations and the format of a slot name. It covers the following topics:

Slot Name Format

Hot-plug operations target a particular slot using the name of the slot. All slots that are eligible for a hot-plug operation have a directory in the hot-plug file system that is mounted at /sys/bus/pci/slots. The name of the target slot is based on the hardware location of the slot in the system. For the SGI ProPack 6 release, slot directories are in the form that the lspci(8) command uses, that is, as follows:

segment:bus:slot

Change directory (cd), to the /sys/bus/pci/slots directory and use the ls or ls -l command to view the contents of the file, as follows:

pci/slots> ls
0001:00:01  0002:00:01  0002:00:02  0003:00:01  0004:00:01  0004:00:02

Slot is part of a PCI domain. On an SGI Altix system, a PCI domain is a functional entity that includes a root bridge, subordinate buses under the root bridge, and the peripheral devices it controls. For more information, see “PCI Domain Support for SGI Altix Systems”.

Each slots directory contains two files called path and power. For example, change directory to /sys/bus/pci/slots/0001:00:01 and perform the ls command, as follows:

slots/0001:00:01> ls
path  power 

The power file provides the current hot-plug status of the slot. A value of 0 indicates that the slot is powered-down, and a value of 1 indicates that the slot is powered-up.

The path file is the module ID where the brick resides.

Hot-plug Insert Operation

A hot-plug insert operation first instructs the L1 hardware controller to power-up the slot and reset the card. The L1 controller then checks that the card to be inserted is compatible with the running bus. Compatible is defined, as follows:

  • The card must support the same mode as the running bus, for example, PCI or PCI-X

  • The card must be able to run at the current bus speed

  • That a 33 MHz card is not being inserted into an empty bus

Any L1 controller detected incompatibilities or errors are reported to the user and the insert operation fails.

Once the slot has been successfully powered-up by the L1 controller, the Linux hot-plug infrastructure notifies the driver of the card that the card is available and needs to be initialized. After the driver has initialized the card, the hot-plug insert operation is complete and card is ready for use.

Hot-plug Remove Operation

Before initiating a hot-plug remove operation, the system administrator must manually terminate any processes using the target card.


Warning: Failure to properly terminate any outstanding accesses to the target card may result in a system failure or data corruption when the hot-plug operation is initiated.

For a hot-plug remove operation, the hot-plug infrastructure verifies that the target slot is eligible to be powered-down. The L1 hardware controller restrictions do not permit the last card to be removed from a bus running at 33 MHz and an attempt to remove the last card fails. The hot-plug infrastructure then notifies the driver of the card of a pending hot-plug remove operation and the driver deactivates the card. The L1 hardware controller is then instructed to power-down the slot.

Attempts to power-down a slot that is already powered-down, or power-up a slot that is already powered-up are ignored.

Using Shell Commands To Control a Hot-plug Operation

A hot-plug operation can be initiated by writing to the target power file of the slot. After composing the name of the slot based on its location, change into the directory of the slot in the hot-plug virtual file system.

Change directory (cd), to the /sys/bus/pci/slots directory and use the ls or ls -l command to view the contents of the file, as follows:

pci/slots> ls
 0001:00:01  0002:00:01  0002:00:02  0003:00:01  0004:00:01  0004:00:02

For example, to target hot-plug operations to slot1, segment1 in module 001i03 (IA-Brick in position 3 of rack 1), change to directory 0001:00:01 and then perform the ls command, as follows:

slots/0001:00:01> ls
path  power 

To query the current hot-plug status of the slot, read the power file of the slot, as follows:

slots//0001:00:01> cat power
1

A value of 0 indicates that the slot is powered-down, and a value of 1 indicates that the slot is powered-up.

The path file is the module ID were the brick resides, as follows:

slots/0001:00:01> cat path
module_001i03

To initiate an insert operation to a slot that is powered-down, write the character 1 to the power file of the slot, as follows:

slots/0001:00:01> echo 1 > power

Detailed status messages for the insert operation are written to the syslog by the hot-plug infrastructure. These messages can be displayed using the Linux dmesg user command, as follows:

slots/0001:00:01> dmesg

A hot-plug remove operation is initiated to a slot that is powered-up by writing the character 0 to the power file of the slot, as follows:

slots/0001:00:01> echo 0 > power

Detailed status messages for the remove operation are written to the syslog by the hot-plug infrastructure. These messages can be displayed using the Linux dmesg user command, as follows:

slots/0001:00:01> dmesg

Faster SCSI Device Booting

There are three files in the /etc/udev/rules.d directory that can be modified to make systems boot faster that have many logical unit numbers (LUNs) for attached SCSI devices (1000 LUNS or more), as follows:

  • 50-udev-default.rules

  • 58-xscsi.rules

  • 60-persistent-storage.rules

This section describes these rules.

50-udev-default.rules


Note: This section only applies to SGI Altix systems with SGI ProPack 6 running SLES10 or SLES11.


The rules in the 50-udev-default.rules file cause SCSI drivers sd_mod,osst, st, sr_mod, and sg to be loaded automatically when appropriate devices are found. These rules are, as follows:

SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="0|7|14", RUN+="/sbin/modprobe sd_mod"
SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="1", SYSFS{vendor}=="On[sS]tream", RUN+="/sbin/modprobe osst"
SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="1", RUN+="/sbin/modprobe st"                              
SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="[45]", RUN+="/sbin/modprobe sr_mod"                       
SUBSYSTEM=="scsi_device", ACTION=="add", RUN+="/sbin/modprobe sg"

You can comment out all of these rules to save calls to the modprobe(8) command for each SCSI device to save boot time, as follows:

#SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="0|7|14", RUN+="/sbin/modprobe sd_mod"
#SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="1", SYSFS{vendor}=="On[sS]tream", RUN+="/sbin/modprobe osst"
#SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="1", RUN+="/sbin/modprobe st"                              
#SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="[45]", RUN+="/sbin/modprobe sr_mod"                       
#SUBSYSTEM=="scsi_device", ACTION=="add", RUN+="/sbin/modprobe sg"

Make sure the drivers are loaded by adding them to INITRD_MODULES variable in the /etc/sysconfig/kernel file and then run the mkinitrd(8) command to make sure the changes are picked up.

58-xscsi.rules


Note: This section only applies to SGI Altix systems with SGI ProPack 6 running SLES10 or SLES11.


The rules in the 58-xscsi.rules file create the /dev/xscsi/pci... persistent device symbolic links. The rules are, as follows:

# This rule creates sg symlinks for all SCSI devices (/dev/xscsi/pci..../sg)
KERNEL=="sg*", PROGRAM="/sbin/udev_xscsi %p", SYMLINK+="%c"

# This rule creates symlinks for entire disks/luns (/dev/xscsi/pci..../disc)
KERNEL=="sd*[a-z]", PROGRAM="/sbin/udev_xscsi %p", SYMLINK+="%c"

# This rule creates symlinks for disk partitions (/dev/xscsi/pci..../partX)
KERNEL=="sd*[a-z]*[0-9]", PROGRAM="/sbin/udev_xscsi %p", SYMLINK+="%c"

You need the rule for the disk/luns symbolic links for XVM in this file, (the middle rule). You can comment out the rule that creates sg symbolic links and the rule for disk partition symbolic links (top and bottom rules, respectively).

60-persistent-storage.rules


Note: This section only applies to SGI Altix 4000 series systems with SGI ProPack 6 running SLES10 or SLES11.


The rules in the 60-persistent-storage.rules file are for persistent storage links. They are not necessary and can all be commented out or you can add GOTO="persistent_storage_end" at the top of the file to accomplish the same.

Filesystems Changes

The tmpfs filesystem memory allocations have changed for the SGI ProPack 6 release. Prior to the SGI ProPack 6 release, allocations were always done round-robin on all nodes. With SGI ProPack 6, this is now a tmpfs mount option. This is actually a Linux kernel change that applies to both SLES10, SLES11, and RHEL5 base releases.


Note: To maintain SGI ProPack 4 tmpfs filesystem memory allocation default behavior, use the tmpfs filesystem mpol=interleave mount option.


The tmpfs filesystem has a mount option to set the NUMA memory allocation policy for all files if the CONFIG_NUMA flag is enabled at mount time. You can adjust this on a running system, as follows:

mount -o remount ... 

The following mount options apply:
mpol=default

Prefers to allocate memory from the local node

mpol=prefer:Node

Prefers to allocate memory from the given node

mpol=bind:NodeList

Allocates memory only from nodes in NodeList

mpol=interleave

Prefers to allocate from each node in turn

mpol=interleave:NodeList

Allocates from each node of NodeList in turn

NodeList format is a comma-separated list of decimal numbers and ranges. The range being two hyphen-separated decimal numbers, the smallest and largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15.

Trying to mount a tmpfs filesystem with an mpol option will fail if the running kernel does not support NUMA architecture. It will also fail if its Nodelist argument specifies a node greater than or equal to MAX_NUMNODES.

If your system relies on that tmpfs file system being mounted, but from time to time runs a kernel built without NUMA capability (such as, a safe recovery kernel), or configured to support fewer nodes, it is advisable to omit the mpol option from automatic mount options. It can be added later, when the tmpfs is already mounted on MountPoint using the following:

mount -o remount,mpol=Policy:NodeList MountPoint

For more information on the tmpfs filesystem, see /usr/src/linux-2.6.x.x-x/Documentation/filesystems file on your system.

I /O Subsystems

Although some HPC workloads might be mostly CPU bound, others involve processing large amounts of data and require an I/O subsystem capable of moving data between memory and storage quickly, as well as having the ability to manage large storage farms effectively. The XSCSI subsystem, XFS filesystem, XVM volume manager, and data migration facilities were leveraged from the IRIX operating system and ported to provide a robust, high-performance, and stable storage I/O subsystem on Linux.

The following sections describe persistent PCI-X bus numbering, persistent naming of Ethernet devices, the XSCSI subsystem, the XSCSI-SCSI subsystem, the XFS filesystem, and the XVM Volume Manager.

This section covers the following topics:

XSCSI Naming Systems on SGI ProPack Systems

This section describes XSCSI naming systems on SGI ProPack systems.


Note: The XSCSI subsystem on SGI ProPack 3 systems is an I/O infrastructure that leverages technology from the IRIX operating system to provide more robust error handling, failover, and storage area network (SAN) infrastructure support, as well as long-term, large system performance tuning. This subsystem was not necessary for SGI ProPack 4 systems or later. However, the XSCSI naming convention is still used on SGI ProPack 3, SGI ProPack 4, and SGI ProPack 5 and SGI ProPack 6 systems. XSCSI naming is used to provide persistent naming for devices, by using the persistent PCI bus numbering. For SGI Altix 450 and 4700 systems, see “PCI Domain Support for SGI Altix Systems”.


This section covers the following topics:

XSCSI Names on Non-blade Systems

For SGI ProPack 3 and SGI ProPack 4 non-blade systems, the XSCSI name has the following forms.

For direct attached devices, the form of the XSCSI name is, as follows:

/dev/xscsi/pciBB.SS.F[-C]/targetX/lunL/partT

For SAN attached devices, it is, as follows:

/dev/xscsi/pciBB.SS.F[-P]/nodeWWN/portP/lunL/partT

Where:

BB is the PCI bus number

SS is the Slot

F is the Function

C is the Channel ("-C" for QLA12160 HBA cards only)

For direct attach devices:

X is the target number

For SAN attached devices:

WWN is the world wide node number of the device

P is the port number of the device

For either direct attach or SAN attach devices:

L is the logical unit number (LUN) in lunL

T is the partition number

There are two ways of handling multiple port host bus adapter (HBA) cards in PCI. One way is to have each port be a different function. The other way is to have one function but have multiple channels off of the function. Most HBA cards use multiple functions. Therefore, most HBA cards will have differing F (function) numbers and the -C will be absent. The QLA12160 (Qlogic Parallel SCSI) uses multiple channels. Therefore, it will have one function, "0" and multiple channels, that is, "0-1" and "0-2".

An example of a direct attached device with partition 1 is, as follows:

/dev/xscsi/pci01.02.0/target1/lun0/part1

The same device attached device off of a Qlogic SCSI HBA card (or base IO9 base I/O) would be, as follows:

/dev/xscsi/pci01.02.0/target1/lun0/part1

An example of a SAN attached device is, as follows:

/dev/xscsi/pci22.01.1/node20000004cf2c84de/port1/lun0/part1

Domain-Based XSCSI Names

All SGI Altix systems running SGI ProPack 6 for Linux use domain-based XSCSI names. The XSCSI names change to accommodate PCI Express. They basically have the same form except that the PCI numbering takes the following form:

/dev/xscsi/pciDDDD:BB:SS.F[-C]/...

Where:

DDDD is the domain number

BB is the Bridge number

SS is the slot number

F is the function

C is the Channel

An example of a direct attach device with partition 1 is, as follows:

/dev/xscsi/pci0001:00:03.0-1/target1/lun0/part1

An example of a SAN attached device with partition 1 is, as follows:

/dev/xscsi/pci0002:00:01.0/node20000004cf2c8d0c/port2/lun0/part1

Note that the device number (slot number), function WWN, logical unit, and port number are fixed. These will never change. However, the system bus number could change because of a hardware problem (such as the I/O brick not booting) or the system being reconfigured.

For non-PCI express host bus adapter (HBA) cards in an SGI Altix 4700 system, the domain number is equivalent to the old bus number and the bridge number is 0. For more information, see “PCI Domain Support for SGI Altix Systems”.

Persistent Network Interface Names


Note: This section only applies to SGI Altix systems with SGI ProPack 6 running SLES10 or SLES11.


Ethernet persistent behavior changes in the SGI ProPack 6 for Linux from prior releases. This functionality is provided by the base operating system.

The basic change is that the first time a SGI ProPack 6 system is booted after installation, a udev rule defined in /etc/udev/rules.d/31-net_create_names.rules is invoked that enumerates all of the Ethernet devices on the system. It then writes another rule file called /etc/udev/rules.d/30-net_persistent_names.rules. This file contains a mapping of the media access control (MAC) addresses to Ethernet IP addresses. A specific physical interface is always mapped to the same Ethernet address.

When a system is rebooted, the same Ethernet addresses are always mapped back to the same MAC addresses, even if some of the interfaces have been removed in the interim. For example, if Ethernet 1 and Ethernet 3 devices were removed, a sparsely populated Ethernet space of Ethernet 0, Ethernet 2 and Ethernet 4 would result.

To re-enumerate the devices attached to your system, delete the /etc/udev/rules.d/30-net_persistent_names.rules file. When the system is rebooted, a new rules file is created for the current compliment of network device Ethernet addresses.

For more information, see the SGI ProPack 6 /usr/share/doc/packages/sysconfig/README.Persistent_Interface_Names file.

PCI Domain Support for SGI Altix Systems


Note: This section does not apply to SGI Altix XE systems.


On an SGI Altix system, a PCI domain is a functional entity that includes a root bridge, subordinate buses under the root bridge, and the peripheral devices it controls. Separation, management, and protection of PCI domains is implemented and controlled by system software.

Previously, a PCI device was identified by bus:slot:function. A PCI device is identified by domain:bus:slot:function.

Domains (sometimes referred to as a PCI segment) are numbered from (0 to ffff). bus (0 to ff), slot (0 to 1f) and function (0 to 7).

A domain is a root bridge (with the bus being numbered zero), subordinate buses under that root bridge are buses 1-255.

In the past, a PA-brick was numbered bus 01, 02, 03, 04 (A bus for each root bridge (that is, each TIO ASIC)). With PCI domain support each root bridge is numbered with a domain/bus number 0001:00, 0002:00, 0003:00, 0004:00. If a subordinate bus is plugged into the root bridge bus, it has the same domain number as the root bridge, but a different bus number (for example, 0001:01).

For PCI Express the PCIE root complex is its own domain. Each port is a subordinate bus under that domain.

Domain numbers are allocated starting from the lowest module ID (that is, rack/slot/blade location).

System Partitioning

This section describes how to partition an SGI ProPack server and contains the following topics:

This section does not apply to SGI Altix XE systems.

Overview

A single SGI ProPack for Linux server can be divided into multiple distinct systems, each with its own console, root filesystem, and IP network address. Each of these software-defined group of processors are distinct systems referred to as a partition. Each partition can be rebooted, loaded with software, powered down, and upgraded independently. The partitions communicate with each other over an SGI NUMAlink connection. Collectively, all of these partitions compose a single, shared-memory cluster.

Direct memory access between partitions, sometimes referred to as global shared memory, is made available by the XPC and XPMEM kernel modules. This allows processes in one partition to access physical memory located on another partition. The benefits of global shared memory are currently available via SGI's Message Passing Toolkit (MPT) software.

It is relatively easy to configure a large SGI Altix system into partitions and reconfigure the machine for specific needs. No cable changes are needed to partition or repartition an SGI Altix machine. Partitioning is accomplished by commands sent to the system controller. For details on system controller commands, see the SGI L1 and L2 Controller Software User's Guide.

Advantages of Partitioning

This section describes the advantages of partitioning an SGI ProPack server as follows:

Create a Large, Shared-memory Cluster

You can use SGI's NUMAlink technology and the XPC and XPMEM kernel modules to create a very low latency, very large, shared-memory cluster for optimized use of Message Passing Interface (MPI) software and logically shared, distributed memory access (SHMEM) routines. The globally addressable, cache coherent, shared memory is exploited by MPI and SHMEM to deliver high performance.

Provides Fault Containment

Another reason for partitioning a system is fault containment. In most cases, a single partition can be brought down (because of a hardware or software failure, or as part of a controlled shutdown) without affecting the rest of the system. Hardware memory protections prevent any unintentional accesses to physical memory on a different partition from reaching and corrupting that physical memory. For current fault containment caveats, see “Limitations of Partitioning”.

You can power off and “warm swap” a failing C-brick in a down partition while other partitions are powered up and booted. For information see “Adding or Replacing a PCI or PCI-X Card” in chapter 12, “Maintenance and Upgrade Procedures” in the SGI Altix 3000 User's Guide or see “PCI and PCI-X Cards” in Chapter 6, “Installing and Removing Customer-replaceable Units” in the SGI Altix 350 User's Guide or see the appropriate chapter in the new SGI Altix 4700 User's Guide.

Allows Variable Partition Sizes

Partitions can be of different sizes, and a particular system can be configured in more than one way. For example, a 128-processor system could be configured into four partitions of 32 processors each or configured into two partitions of 64 processors each. (See "Supported Configurations" for a list of supported configurations for system partitioning.)

Your choice of partition size and number of partitions affects both fault containment and scalability. For example, you may want to dedicate all 64 processors of a system to a single large application during the night, but then partition the system in two 32 processor systems for separate and isolated use during the day.

Provide High Performance Clusters

One of the fundamental factors that determines the performance of a high-end computer is the bandwidth and latency of the memory. The SGI NUMAflex technology gives an SGI ProPack partitioned, shared-memory cluster a huge performance advantage over a cluster of commodity Linux machines (white boxes). If a cluster of N white boxes, each with M CPUs is connected via Ethernet or Myrinet or InfinaBand, an SGI ProPack system with N partitions of M CPUs provides superior performance because of the significantly lower latency of the NUMAlink interconnect, which is exploited by the XPNET kernel module.

Limitations of Partitioning

Partitioning can increase the reliability of a system because power failures and other hardware errors can be contained within a particular partition. There are still cases where the whole shared memory cluster is affected; for example, during upgrades of harware which is shared by multiple partitions.

If a partition is sharing its memory with other partitions, the loss of that partition may take down all other partitions that were accessing its memory. This is currently possible when an MPI or SHMEM job is running across partitions using the XPMEM kernel module.

Failures can usually be contained within a partition even when memory is being shared with other partitions. XPC is invoked using normal shutdown commands such as reboot(8) and halt(8) to ensure that all memory shared between partitions is revoked before the partition resets. This is also done if you remove the XPC kernel modules using the rmmod (8) command. Unexpected failures such as kernel panics or hardware failures almost always force the affected partition into the KDB kernel debugger or the LKCD crash dump utility. These tools also invoke XPC to revoke all memory shared between partitions before the partition resets. XPC cannot be invoked for unexpected failures such as power failures and spontaneous resets (not generated by the operating system), and thus all partitions sharing memory with the partition may also reset.

Supported Configurations

See the SGI Altix 3000 User's Guide, SGI Altix 350 User's Guide, SGI Altix 450 User's Guide, or the SGI Altix 4700 User's Guide for information on configurations that are supported for system partitioning. Currently, the following guidelines are valid for SGI ProPack 6 for Linux release:

  • Maximum number of partitions supported is 48

  • Maximum partition size is 1024 cores

  • Maximum system size is 9726 cores

For additional information about configurations that are supported for system partitioning, see your sales representative.

Installing Partitioning Software and Configuring Partitions

To enable or disable partitioning software, see “Partitioning Software”, to use the system partitioning capabilities, see “Partitioning Guidelines for SGI Altix 3000 Series Systems” and “Partitioning a System”.

This section covers the following topics:

Partitioning Software

SGI ProPack for Linux servers have XP, XPC, XPNET, and XPMEM kernel modules installed by default to provide partitioning support. XPC and XPNET are configured off by default in the /etc/sysconfig/sgi-xpc and /etc/sysconfig/sgi-xpnet files, respectively. XPMEM is configured on by default in the /etc/sysconfig/sgi-xpmem file. To enable or disable any of these features, edit the appropriate /etc/sysconfig/ file and execute the /etc/init.d/sgi-xp script.

On SGI ProPack systems running SLES10 or SLES11, if you intend to use the cross-partition functionality of XPMEM, you will need to add xpc to the line in the /etc/sysconfig/kernel file that begins with MODULES_LOADED_ON_BOOT. Once that is added, you may either reboot the system or issue an modprobe xpc command to get the cross-partition functionality to start working. For more information on using modprobe, see the modprobe(8) man page.

To activate xpc on future boots on SGI ProPack 6 systems running RHEL5, you need to add a modprobe xpc line to the /etc/sysconfig/modules/sgi-propack.modules file instead of adding the module name to the MODULES_LOADED_ON_BOOT line.

The XP kernel module is a simple module which coordinates activities between XPC, XPMEM, and XPNET. All of the other cross-partition kernel modules require XP to function.

The XPC kernel module provides fault-tolerant, cross-partition communication channels over NUMAlink for use by the XPNET and XPMEM kernel modules.

The XPNET kernel module implements an Internet protocol (IP) interface on top of XPC to provide high-speed network access via NUMAlink. XPNET can be used by applications to communicate between partitions via NUMAlink, to mount file systems across partitions, and so on. The XPNET driver is configured using the ifconfig commands. For more information, see the ifconfig(1M) man page. The procedure for configuring the XPNET kernel module as a network driver is essentially the same as the procedure used to configure the Ethernet driver. You can configure the XPNET driver at boot time like the Ethernet interfaces by using the configuration files in /etc/sysconfig/network-scripts. To configure the XPNET driver as a network driver see the following procedure.

Procedure 2-1. Setting up Networking Between Partitions

    The procedure for configuring the XPNET driver as a network driver is essentially the same as the procedure used to configure the Ethernet driver (eth0), as follows:

    1. Log in as root.

    2. On a SGI ProPack 3 system, configure the xp0 IP address, as follows:

      netconfig -d xp0

      For later SGI ProPack systems, configure the xp0 IP address using yast2. For information on using yast2, see SUSE LINUX Enterprise Server 9 Installation and Administration manual. The driver's full name inside yast2 is SGI Cross Partition Network adapter.

    3. Add the network address for the xp0 interface by editing the /etc/hosts file.

    4. Reboot your system or restart networking.

    The XPMEM kernel module provides direct access to memory located on other partitions. It uses XPC internally to communicate with XPMEM kernel modules on other partitions to accomplish this. XPMEM is currently used by SGI's Message Passing Toolkit (MPT) software (MPI and SHMEM).

    Partitioning Guidelines

    Partitioning rules define the set of valid configurations for a partitioned system. Fault isolation is one of the major reasons for partitioning a system. A software or hardware failure in one partition should not cause a failure in another partition. This section describes restrictions are placed on partitions to accomplish this. This section covers these topics:

    Partitioning Guidelines for SGI Altix 3000 Series Systems

    Follow these guidelines when partitioning your system:

    • A partition must be made up of one or more C-bricks. The number of C-bricks in your systems determines the number of partitions you can create. The number of partitions cannot exceed the number of C-bricks your system contains. The first C-brick in each partition must have an IX-brick attached to it via the XIO connection of the C-brick.

    • You need at least as many IX-bricks for base IO as partitions you wish to use.

    • Each partition needs to have an IX-brick with a valid system disk in it. Since each partition is a separate running system, each system disk should be configured with a different IP address/system name, and so on.

    • Each partition must have a unique partition ID number between 1 and 63, inclusively.

    • All bricks in a partition must be physically contiguous. The route between any two processors in the same partition must be contained within that partition, and not through any other partition. If the bricks in a partition are not contiguous, the system will not boot.

    • Each partition must contain the following components:

      • At least one C-brick for system sizes of 64 C-bricks and below, or multiples of 4 C-bricks for system sizes of 65 C-bricks and above (minimum)

      • One IX-brick (minimum)

      • One root disk

      • One console connection

    Partitioning Guidelines for SGI Altix 450 Series Systems

    Partitioning guidelines on a SGI Altix 450 system are, as follows:

    • The minimum granularity for a partition is one IRU (ideally with its own power supply setup). On Altix 450 systems, this means four compute blades is the minimum level of hardware isolation.

    • Each partition must have the infrastructure to run as a standalone system. This infrastructure includes a system disk and console connection.

    • An I/O blade belongs to the partition that the attached IRU belongs to. I/O blades cannot be shared by two partitions.

    • Peripherals, such as dual-ported disks, can be shared the same way two nodes in a cluster can share peripherals.

    • Partitions must be contiguous in the topology (for example, the route between any two nodes in the same partition must be contained within that partition - and not route through any other partition). This allows intra-partition communication to be independent of other partitions.

    • Partitions must be fully interconnected. That is to say, for any two partitions, there is a direct route between those partitions without passing through a third. This is required to fulfill true isolation of a hardware or software fault to the partition in which it occurs.

    Partitioning Guidelines for SGI Altix 4000 Series Systems

    Partitioning guidelines on a SGI Altix 4700 system are, as follows:

    • The minimum granularity for a partition is one individual rack unit (IRU) (ideally with its own power supply setup). On SGI Altix 4700 systems, this means eight compute blades is the minimum level of hardware isolation.

    • Each partition must have the infrastructure to run as a standalone system. This infrastructure includes a system disk and console connection.

    • An I/O blade belongs to the partition to which the attached IRU belongs. I/O blades cannot be shared by two partitions.

    • Peripherals, such as dual-ported disks, can be shared the same way two nodes in a cluster can share peripherals.

    • Partitions must be contiguous in the topology (for example, the route between any two nodes in the same partition must be contained within that partition; and not route through any other partition). This allows intra-partition communication to be independent of other partitions.

    • Quad-dense meta-routers ( 32-port routers) are a shared resource. A single quad-dense meta-router can connect to multiple partitions.

    • Partitions must be fully interconnected. That is, for any two partitions, there is a direct route between those partitions without passing through a third. This is required to fulfill true isolation of a hardware or software fault to the partition in which it occurs.

    • When the total system is greater than 16 IRUs (128 SHubs), it runs in coarse mode. In coarse mode, the minimum partition size is two IRUs (16 SHUBs).

    Partitioning a System

    This section describes how to partition your system.

    Procedure 2-2. Partitioning a System Into Four Partitions

      To partition your system, perform the following steps :

      1. Make sure your system can be partitioned. See “Partitioning Guidelines for SGI Altix 3000 Series Systems”.

      2. You can use the Connect to System Controller task of the SGIconsole Console Manager GUI to connect to the L2 controller of the system you want to partition. The L2 controller must appear as a node in the SGIconsole configuration. For information on how to use SGIconsole, see the Console Manager for SGIconsole Administrator's Guide.

      3. Using the L2 terminal (l2term), connect to the L2 controller of the system you wish to partition. After a connection to the L2 controller, an L2> prompt appears, indicating that the L2 is ready to accept commands, for example:

        cranberry-192.168.11.92-L2>:

        If the L2 prompt does not appear, you can type Ctrl-T. To remain at the L2 command prompt, type l2 (lowercase letter 'L') at the L2> prompt


        Note: Each partition has its own set of the following PROM environment variables: ConsolePath, OSLoadPartition, SystemPartition, netaddr, and root.

        For more information on using the L2 controller, see the SGI L1 and L2 Controller Software User's Guide.

        You can partition a system from SGIconsole Console Manager system console connection, however, SGIconsole does not include any GUI awareness of partitions (in the node tree view for instance) or in the commands, and there is no way to power down a group of partitions, or get all the logs of a partitioned system, or to actually partition a system. If you partition a node that is managed by SGIconsole, make sure to edit the partition number of the node using the Modify a Node task. For more information, see the Console Manager for SGIconsole Administrator's Guide.


      4. Use the L2 sel command to list the available consoles, as follows:

        cranberry-192.168.11.92-L2>sel


        Note: If the Linux operating system is currently executing, perform the proper shutdown procedures before you partition your system.


      5. This step shows an example of how to partition a system into four separate partitions.

        1. To see the current configuration of your system, use the L2 cfg command to display the available bricks, as follows:

          cranberry-192.168.11.92-L2>cfg
               L2 192.168.11.92: - --- (no rack ID set) (LOCAL)
               L1 192.168.11.92:8:0    - 001c31    
               L1 192.168.11.92:8:1    - 001i34    
               L1 192.168.11.92:11:0   - 001c31    
               L1 192.168.11.92:11:1   - 001i34    
               L1 192.168.11.92:6:0    - 001r29    
               L1 192.168.11.92:9:0    - 001r27    
               L1 192.168.11.92:7:0    - 001c24    
               L1 192.168.11.92:7:1    - 101i25    
               L1 192.168.11.92:10:0   - 001c24    
               L1 192.168.11.92:10:1   - 101i25    
               L1 192.168.11.92:0:0    - 001r22    
               L1 192.168.11.92:3:0    - 001r20    
               L1 192.168.11.92:2:0    - 001c14    
               L1 192.168.11.92:2:1    - 101i21    
               L1 192.168.11.92:5:0    - 001c14    
               L1 192.168.11.92:5:1    - 101i21    
               L1 192.168.11.92:1:0    - 001c11    
               L1 192.168.11.92:1:1    - 001i07    
               L1 192.168.11.92:4:0    - 001c11    
               L1 192.168.11.92:4:1    - 001i07   

        2. In this step, you need to decide which bricks to put into which partitions.

          You can determine which C-bricks are directly attached to IX-bricks by looking at the output from the cfg man. Consult the hardware configuration guide for the partitioning layout for your particular system. In the cfg output above, you can check the number after the IP address. For example, 001c31 is attached to 001i34 which is indicated by the fact that they both have 11 after their respective IP address.


          Note: On some systems, you will have a rack ID in place of the IP address. 001c31 is a C-brick (designated by the c in 001c31) and 001i34 is an IX-brick (designated with an i in 001i34).


          Another pair is 101i25 and 001c24. They both have 10 after the IP address. The brick names containing an r designation are routers. Routers do not need to be designated to a specific partition number.

          In this example, the maximum number of partitions this system can have is four. There are only four IX-bricks total: 001i07, 101i21, 101i25, and 001i34.


          Note: Some IX-brick names appear twice. This occurs because some IX-bricks have dual XIO connections.

          You do not have to explicitly assign IX-bricks to a partition. The IX-bricks assigned to a partition are inherited from the C-bricks.


        3. When you specify bricks to L2 commands, you use a rack.slot naming convention. To configure the system into four partitions, do not specify the whole brick name (001c31) but rather use the designation 1.31 as follows:

          cranberry-192.168.11.92-L2>1.31 brick part 1
               001c31:
               brick partition set to 1.
               cranberry-192.168.11.92-L2>1.34 brick part 1
               001#34:
               brick partition set to 1.
               cranberry-192.168.11.92-L2>1.24 brick part 2
               001c24:
               brick partition set to 2.
               cranberry-192.168.11.92-L2>101.25 brick part 2
               101#25:
               brick partition set to 2.
               cranberry-192.168.11.92-L2>1.14 brick part 3
               001c14:
               brick partition set to 3.
               cranberry-192.168.11.92-L2>101.21 brick part 3
               101i21:
               brick partition set to 3.
               cranberry-192.168.11.92-L2>1.11 brick part 4
               001c11:
               brick partition set to 4.
               cranberry-192.168.11.92-L2>1.07 brick part 4
               001#07:
               brick partition set to 4.

        4. To confirm your settings, enter the cfg command again, as follows:


          Note: This may take up to 30 seconds.


          cranberry-192.168.11.92-L2>cfg
               L2 192.168.11.92: - --- (no rack ID set) (LOCAL)
               L1 192.168.11.92:8:0    - 001c31.1  
               L1 192.168.11.92:8:1    - 001i34.1  
               L1 192.168.11.92:11:0   - 001c31.1  
               L1 192.168.11.92:11:1   - 001i34.1  
               L1 192.168.11.92:6:0    - 001r29    
               L1 192.168.11.92:9:0    - 001r27    
               L1 192.168.11.92:7:0    - 001c24.2  
               L1 192.168.11.92:7:1    - 101i25.2  
               L1 192.168.11.92:10:0   - 001c24.2  
               L1 192.168.11.92:10:1   - 101i25.2  
               L1 192.168.11.92:0:0    - 001r22    
               L1 192.168.11.92:3:0    - 001r20    
               L1 192.168.11.92:2:0    - 001c14.3  
               L1 192.168.11.92:2:1    - 101i21.3  
               L1 192.168.11.92:5:0    - 001c14.3  
               L1 192.168.11.92:5:1    - 101i21.3  
               L1 192.168.11.92:1:0    - 001c11.4  
               L1 192.168.11.92:1:1    - 001i07.4  
               L1 192.168.11.92:4:0    - 001c11.4  
               L1 192.168.11.92:4:1    - 001i07.4

        5. The system is now partitioned. However, you need to reset each partition to complete the configuration, as follows:

          cranberry-192.168.11.92-L2>p 1,2,3,4 rst


          Note: You can use a shortcut to reset every partition, as follows:
          cranberry-192.168.11.92-L2>p * rst 



        6. To get to the individual console of a partition, such as partition 2, enter the following:

          cranberry-192.168.11.92-L2>sel p 2

          For more information on accessing the console of a partition, see “Accessing the Console on a Partitioned System”.

      Procedure 2-3. Partitioning a System into Two Partitions

        To partition your system, perform the following steps:

        1. Perform steps 1 through 5 in Procedure 2-2.

        2. To configure the system into two partitions, enter the following commands:

          cranberry-192.168.11.92-L2>1.31 brick part 1
               001c31:
               brick partition set to 1.
               cranberry-192.168.11.92-L2>1.34 brick part 1
               001#34:
               brick partition set to 1.
               cranberry-192.168.11.92-L2>1.24 brick part 1
               001c24:
               brick partition set to 1.
               cranberry-192.168.11.92-L2>101.25 brick part 1
               101#25:
               brick partition set to 1.
               cranberry-192.168.11.92-L2>1.14 brick part 2
               001c14:
               brick partition set to 2.
               cranberry-192.168.11.92-L2>101.21 brick part 2
               101i21:
               brick partition set to 2.
               cranberry-192.168.11.92-L2>1.11 brick part 2
               001c11:
               brick partition set to 2.
               cranberry-192.168.11.92-L2>1.7 brick part 2
               001#07:
               brick partition set to 2.

        3. To confirm your settings, issue the cfg command again, as follows:


          Note: This may take up to 30 seconds.


          cranberry-192.168.11.92-L2>cfg
               L2 192.168.11.92: - --- (no rack ID set) (LOCAL)
               L1 192.168.11.92:8:0    - 001c31.1  
               L1 192.168.11.92:8:1    - 001i34.1  
               L1 192.168.11.92:11:0   - 001c31.1  
               L1 192.168.11.92:11:1   - 001i34.1  
               L1 192.168.11.92:6:0    - 001r29    
               L1 192.168.11.92:9:0    - 001r27    
               L1 192.168.11.92:7:0    - 001c24.1  
               L1 192.168.11.92:7:1    - 101i25.1  
               L1 192.168.11.92:10:0   - 001c24.1  
               L1 192.168.11.92:10:1   - 101i25.1  
               L1 192.168.11.92:0:0    - 001r22    
               L1 192.168.11.92:3:0    - 001r20    
               L1 192.168.11.92:2:0    - 001c14.2  
               L1 192.168.11.92:2:1    - 101i21.2  
               L1 192.168.11.92:5:0    - 001c14.2  
               L1 192.168.11.92:5:1    - 101i21.2  
               L1 192.168.11.92:1:0    - 001c11.2  
               L1 192.168.11.92:1:1    - 001i07.2  
               L1 192.168.11.92:4:0    - 001c11.2  
               L1 192.168.11.92:4:1    - 001i07.2  
                   

        4. Now the system has two partitions. To complete the configuration, reset the two partitions as follows:

        cranberry-192.168.11.92-L2>p 1,2 rst

        Determining If a System is Partitioned

        Procedure 2-4. Determing If a System Is Partitioned

          To determine whether a system is partitioned or not, perform the following steps:

          1. Use the L2term to connect to the L2 controller of the system.


            Note: If you are connected to the L2 controller, but do not have the L2 prompt, try typing the following: CTRL-t.


          2. Use the cfg command to determine if the system is partitioned, as follows:

            cranberry-192.168.11.92-L2>cfg
            L2 192.168.11.92: -(no rack ID set) (LOCAL)
            L1 192.168.11.92:8:0    - 001c31.1  
            L1 192.168.11.92:8:1    - 001i34.1  
            L1 192.168.11.92:11:0   - 001c31.1  
            L1 192.168.11.92:11:1   - 001i34.1  
            L1 192.168.11.92:6:0    - 001r29    
            L1 192.168.11.92:9:0    - 001r27    
            L1 192.168.11.92:7:0    - 001c24.2  
            L1 192.168.11.92:7:1    - 101i25.2  
            L1 192.168.11.92:10:0   - 001c24.2  
            L1 192.168.11.92:10:1   - 101i25.2  
            L1 192.168.11.92:0:0    - 001r22    
            L1 192.168.11.92:3:0    - 001r20    
            L1 192.168.11.92:2:0    - 001c14.3  
            L1 192.168.11.92:2:1    - 101i21.3  
            L1 192.168.11.92:5:0    - 001c14.3  
            L1 192.168.11.92:5:1    - 101i21.3  
            L1 192.168.11.92:1:0    - 001c11.4  
            L1 192.168.11.92:1:1    - 001i07.4  
            L1 192.168.11.92:4:0    - 001c11.4  
            L1 192.168.11.92:4:1    - 001i07.4

          3. See the explanation of the output from the cfg command in Procedure 2-2.

          Accessing the Console on a Partitioned System

          Procedure 2-5. Access the Console on a Partitioned System

            To access the console on a partition, perform the following steps:

            1. Use the L2term to connect to the L2 controller of the system.


              Note: If you are connected to the L2 controller, but do not have the L2 prompt, try typing the following: CTRL-t.


            2. To see output that shows which C-bricks have system consoles, enter the sel command without options on a partitioned system as follows:

              cranberry-192.168.11.92-L2>sel
              
                   known system consoles (partitioned)
              
                           partition  1: 001c31 - L2 detected
                           partition  2: 001c24 - L2 detected
                           partition  3: 001c14 - L2 detected
                           partition  4: 001c11 - L2 detected
              
                   current system console
              
                   console input: not defined
                   console output: not filtered

              The output from the sel command shows that there are four partitions defined.

            3. To get to the console of partition 2, for example, enter the following:

              cranberry-192.168.11.92-L2>sel p 2

            4. To connect to the console of partition 2, enter Ctrl-d.

            When a system is partitioned, the L2 prompt shows the partition number of the partition you selected, as follows:

            cranberry-001-L2>sel p 2
            console input: partition 2, 001c24 console0
            console output: any brick partition 2
            cranberry-001-L2:p2>

            Unpartitioning a System

            Procedure 2-6. Unpartitioning a System

              To remove the partitions from a system, perform the following steps:

              1. Use the L2term to connect to the L2 controller of the system.


                Note: If you are connected to the L2 controller, but do not have the L2 prompt, try typing the following: CTRL-t.


              2. Shut down the Linux operating system running on each partition before unpartitioning a system.

              3. To set the partition ID on all bricks to zero, enter the following command:

                cranberry-192.168.11.92-L2>r * brick part 0

              4. To confirm that all the partitions on your system have been removed, enter the following command:

                cranberry-192.168.11.92-L2>cfg

                The list of bricks no longer have a dot followed by a number in their name (see “Determining If a System is Partitioned”).

              5. To reset all of the bricks, enter the following command:

                cranberry-192.168.11.92-L2>r * rst 

              6. To get to the system console for the newly unpartitioned system, you need to reset the select setting as follows:

                 cranberry-192.168.11.92-L2>sel reset

              7. To get the console (assuming you still have the L2 prompt), enter Ctrl-d.

              Connecting the System Console to the Controller

              System partitioning is an administrative function. The system console is connected to the controller as required by the configuration selected when an SGI ProPack system is installed. For additional information or recabling, contact your service representative.

              Making Array Services Operational

              This section describes how to get Array Services operational on your system. For detailed information on Array Services, see chapter 3, “Array Sevices”, in the Linux Resource Administration Guide.


              Note: This section does not apply to SGI Altix XE systems.


              Standard Array Services is installed by default on an SGI ProPack 6 system. To install Secure Array Services, use the YaST Software Management and use the Filter->search function to search for secure array services by name (sarraysvcs).

              Procedure 2-7. Making Array Services Operational

                To make Array Services operational on your system, perform the following steps:


                Note: Most of the steps to install array services is now performed automatically when the array services RPM is installed. To complete installation, perform the steps that follow.


                1. Make sure that the setting in the /usr/lib/array/arrayd.auth file is appropriate for your site.


                  Caution: Changing the AUTHENTICATION parameter from NOREMOTE to NONE may have a negative security impact on your site.


                2. Make sure that the list of machines in your cluster is included in one or more array definitions in /usr/lib/array/arrayd.conf file.

                3. To determine if Array Services is correctly installed, run the following command:

                  array who

                  You should see yourself listed.

                Floating Point Assist Warnings from Applications

                Some applications can generate an excessive number of kernel KERN_WARN "floating point assist" warning messages. This section describes how you can make these messages disappear for specific applications or for specific users.


                Note: This section does not apply to SGI Altix XE systems.


                An application generates a "floating point assist" trap when a floating point operation involves a corner case of operand value(s) that the Itanium processor cannot handle in hardware and requires kernel emulation software to complete.

                Sometimes the application can be recompiled with specific options that will avoid such problematic operand value(s). Using the -ffast-math option with gcc or the -ftz option with the Intel compiler might be effective. Other instances can be avoided by recoding the application to avoid denormalized or non-finite operands.

                When recoding or recompiling is ineffective or infeasible, the user running the application or the system administrator can avoid having these "floating point assist" syslog messages appear by using the prctl(1) command, as follows:

                 % prctl --fpemu=silent command

                The command and every child process of the command invoked with prctl will produce no "floating point assist" messages. The command can be a single application that is producing unwanted syslog messages or it may be any shell script. For example, the "floating point assist" messages for the user as a whole, that is, for all applications that the user may execute, can be silenced by changing the /etc/passwd entry of the user to invoke a custom script at login, instead of executing (for instance) /bin/bash. That custom script then launches the user's high level shell using the following:

                prctl --fpemu=silent /bin/bash

                Even if the syslog messages are silenced, the "assist" traps will continue to occur and the kernel software will still handle the problem. If these traps occur at a high enough frequency, the application performance may suffer and notification of these occurrences are not logged.

                The syslog messages can be made to reappear by executing the prctl command again, as follows:

                 % prctl --fpemu=default command

                Unaligned Access Warnings


                Note: This section does not apply to SGI Altix XE systems.


                The section describes unaligned access warnings, as follows:

                • The kernel generated unaligned access warnings in syslog and on the console, when applications do misaligned loads and stores. This is normally a sign of badly written code and an indication that the application should be fixed.

                • Use the prctl(1) command to disable these messages on a per application basis.

                • SLES10 offers a new way allowing system administrators to disable these messages on a system wide basis. This is generally discouraged, but useful for the case where a system is used to run third-party applications which cannot be fixed by the system owner.

                  In order to disable these messages on a system wide level, do the following as root:

                  echo 1 > /proc/sys/kernel/ignore-unaligned-usertrap