This chapter describes the operation of an SGI Altix systems. It covers the following topics:
Note: This chapter does not apply to SGI Altix XE or SGI Altix ICE systems. |
This section describes how to boot an SGI Altix series computer system.
To boot an SGI Altix system, perform the following:
Obtain the system console as described in “Getting Console Access” if you are using SGIconsole or telnet to the L2 controller as described in “Connecting to the L2 Controller”.
By default, when booting a menu of boot options appears. On a properly configured system (as shipped from the factory), you can boot directly to the Linux operating sytem. A system administrator can change the default using the boot mainenance menus, the extensible firmware interface (EFI) shell> command, or the efibootmgr command (see the efibootmgr options usage statement for more information).
Note: To see the boot menus properly on an SGI Altix system, make sure sure the debug switches are set to 0 or 1. |
A screen similar to the following appears after booting your machine:
EFI Boot Manager ver 1.02 [12.38] Partition 0: Enabled Disabled CBricks 1 Nodes 2 0 RBricks 0 CPUs 4 0 IOBricks 1 Mem(GB) 4 0 Please select a boot option UnitedLinux ProPack Boot option maintenance menu Use the arrow keys to change option(s). Use Enter to select an option |
One of the menu options appears highlighted. Pressing arrow keys moves the highlight.
An example of selecting the EFI Boot Maintenance Manager menu is, as follows:
EFI Boot Maintenance Manager ver 1.02 [12.38] Main Menu. Select an Operation Boot from a File Add a Boot Option Delete Boot Option(s) Change Boot Order Manage BootNext setting Set Auto Boot TimeOut Select Active Console Output Devices Select Active Console Input Devices Select Active Standard Error Devices Cold Reset Exit |
An example of selecting the Boot from a File option is, as follows:
EFI Boot Maintenance Manager ver 1.02 [12.38] Boot From a File. Select a Volume NO VOLUME LABEL [Pci(1|1)/Scsi(Pun0,Lun1)/HD(Part1,Sig7CFD016D-A NO VOLUME LABEL [Pci(1|1)/Scsi(Pun0,Lun2)/HD(Part1,Sig1F9EFFAD-7 Default Boot [Pci(1|1)/Scsi(Pun0,Lun1)] Default Boot [Pci(1|1)/Scsi(Pun0,Lun2)] Load File [Pci(4|0)/Mac(08006913DB7D)/NicName(tg0)] Load File [EFI Shell [Built-in]] Exit |
An example of selecting Load File EFI Shell is, as follows:
Device Path VenHw(D65A6B8C-71E5-4DF0-A909-F0D23000000099B40000002 BB40000005AB4000000A9) EFI Shell version 1.02 [12.38] Device mapping table fs0 : Pci(1|1)/Scsi(Pun0,Lun1)/HD(Part1,Sigg1) fs1 : Pci(1|1)/Scsi(Pun0,Lun2)/HD(Part1,Sigg2) blk0 : Pci(1|1)/Scsi(Pun0,Lun1) blk1 : Pci(1|1)/Scsi(Pun0,Lun1)/HD(Part1,Sigg1) blk2 : Pci(1|1)/Scsi(Pun0,Lun1)/HD(Part2,Sigg3) blk3 : Pci(1|1)/Scsi(Pun0,Lun1)/HD(Part3,Sigg4) blk4 : Pci(1|1)/Scsi(Pun0,Lun2) blk5 : Pci(1|1)/Scsi(Pun0,Lun2)/HD(Part1,Sigg2) blk6 : Pci(1|1)/Scsi(Pun0,Lun2)/HD(Part2,Sigg5) blk7 : Pci(1|1)/Scsi(Pun0,Lun2)/HD(Part3,Sigg6) Shell> |
From the system EFI shell> prompt, you can proceed with booting the system. Optional booting steps follow:
You have the option to select an EFI partition you want to load your kernel as follows. If you choose not to do this, fs0: is searched by default and the following prompt appears:
fs0: |
If there are multiple EFI filesystems (for example, fs1, fs2, and so on), change to the one you from which you wish to load the kernel. Then perform the following:
On an SGI ProPack 6 system, to boot the default kernel from the first boot disk, enter the following command at the prompt:
Booting disk 1:efi/SuSE/elilo |
To boot the default kernel from the second root disk, change directory (cd) to efi/sgi (efi/SuSE on SGI ProPack 6) and enter the following command at the prompt:
Booting disk 2:elilo sgilinux root=/dev/xscsi/pci01.03.0-1/target2/lun0/part3 |
The preceding XSCSI path points to the seoncd disk of the primary IX-brick.
If the system is at the kernel debugger kdb> prompt or is not responding at all, try resetting the system from the system controller. To get to the system controller prompt, enter Ctrl-T. At the system controller (L1 or L2) prompt, enter rst. The EFI shell> promt appears.
To give up control of the console, perform one of the following:
For k consoles, enter Ctrl-] and then enter Ctrl-D.
To exit a session from from SGIconsole, from the Console Manager File pulldown menu, choose Exit. To exit from the SGIconsole text-based user interface (tscm), enter 8 for quit.
To exit a session in in either the graphical or text version of IRISconsole, enter the following: ~x.
To end a telnet session to the the L2 controller, enter the following: Ctrl-] and then Ctrl-D.
This section describes how halt the Linux operating sytem and power down your system.
To halt the Linux operating sytem and power down your system, perform the following:
Connect to the L2> controller by following the steps in Procedure 3-4.
Enter Ctrl-D to connect to the system console.
Enter the halt command.
You can also reset the system by enter Ctrl-T to get to the L2> prompt and then enter the rst command to reset the system.
To to power on and power off individual bricks or your entire Altix system, see “Powering the System On and Off” section in the appropriate SGI Altix system hardware manual (see the “Related Publications” section in the SGI ProPack 6 for Linux Start Here.).
SGI Altix 4000 series systems can recover from Machine Check Architecture (MCA) memory correction code (ECC) errors in user space (that is, not executing kernel code). The mca_recovery kernel module logs the process and memory page to the /var/log/messages file. The salinfo_decoded module logs the MCA to the /var/log/salinfo/decoded directory. The user code is killed.
An application that uses a page of memory and encounters an ECC error produces an entry, similar to the following, in the /var/log/messages file:
Oct 25 16:54:56 7A: kernel: OS_MCA: process [pid: 6094](errit) encounters MCA. Oct 25 16:54:56 7A: kernel: Page isolation: ( 1301113a480 ) success. |
/var/log/salinfo/decoded # ls -lrt total 8 drwxr-xr-x 2 root root 6 Nov 5 2004 old drwxr-xr-x 4 root root 30 Jul 18 20:29 .. drwxr-xr-x 3 root root 67 Oct 25 16:54 . -rw-r--r-- 1 root root 2209 Oct 25 16:54 oemdata -rw-r--r-- 1 root root 2722 Oct 25 16:54 2005-10-25-21_54_54-cpu0-cpe.0 |
View the error message, as follows:
/var/log/salinfo/decoded # cat 2005-10-25-21_54_54-cpu0-cpe.0 BEGIN HARDWARE ERROR STATE from cpe on cpu 0 Err Record ID: 785412062707715 SAL Rev: 0.02 Time: 2005-10-25 21:54:54 Severity 0 Platform Memory Device Error Info Section Mem Error Detail Physical Address: 0x1301113a490 Address Mask: 0x3ffffffffffff Node: 4 Bank: 0 OEM Specific Data UNCORRECTED MEMORY ERROR :module/001c14/slab/0/node :Loc DIMM1 N0_L_BUS_Y and/or DIMM1 N0_R_BUS_Y :Address 0x000001301113a490 :Syn 0x30 (Multi) :Bad Data 0x0003000000000000 MBIST read address Data ECC SYN FSB Bit 0x000001301113a480 0x0000000000000000 0x00 0x00 good 0x000001301113a488 0x0000000000000000 0x00 0x00 good 0x000001301113a490 0x0003000000000000 0x00 0x30 multi 0x000001301113a498 0x0000000000000000 0x00 0x00 good 0x000001301113a4a0 0x0000000000000000 0x00 0x00 good 0x000001301113a4a8 0x0000000000000000 0x00 0x00 good 0x000001301113a4b0 0x0003000000000000 0x00 0x30 multi 0x000001301113a4b8 0x0000000000000000 0x00 0x00 good 0x000001301113a4c0 0x0000000000000000 0x00 0x00 good 0x000001301113a4c8 0x0000000000000000 0x00 0x00 good 0x000001301113a4d0 0x0003000000000000 0x00 0x30 multi 0x000001301113a4d8 0x0000000000000000 0x00 0x00 good 0x000001301113a4e0 0x0000000000000000 0x00 0x00 good 0x000001301113a4e8 0x0000000000000000 0x00 0x00 good 0x000001301113a4f0 0x0003000000000000 0x00 0x30 multi 0x000001301113a4f8 0x0000000000000000 0x00 0x00 good Platform Specific Error Info Section Platform Specific Error Detail OEM Specific Data UNCORRECTED ECC ERROR :module/001c14/slab/0/node :Processor received bad data from SHub SH_EVENT_OCCURRED : 0x0000000018000100 PI Uncorrectable Error Interrupt Pending SH_FIRST_ERROR : 0x0000000000000100 PI Uncorrectable Error Interrupt Pending SH_PI_ERROR_SUMMARY : 0x0000000020000000 PI_UCE_INT: SHub-to-FSB Uncorrectable Data Error SH_PI_FIRST_ERROR : 0x0000000020000000 PI_UCE_INT: SHub-to-FSB Uncorrectable Data Error SH_PI_ERROR_OVERFLOW : 0x0000000020000000 PI_UCE_INT: SHub-to-FSB Uncorrectable Data Error SH_PI_UNCORRECTED_DETAIL_1 : 0x0030002602227492 Address: 0x000001301113a490 Nasid: 0x4 Syndrome: 0x30 ECC: 0x00 SH_PI_UNCORRECTED_DETAIL_2 : 0x0003000000000000 Failing Dbl-word Data: 0x0003000000000000 SH_PI_UNCOR_TIME_STAMP : 0x8000013523cc4805 END HARDWARE ERROR STATE from cpe on cpu 0 |
The MCA recovery code does not attempt recovery if the CPU is in privileged mode (in kernel context). The MCA record will show if this is the case.
You can monitor the L1 controller status and error messages on the L1 controller's liquid crystal display (LCD) located on the front panel of the individual bricks. The L1 controller and L2 controller status and error messages can also be monitored at your system console. The system console allows you to monitor and manage your server or graphics system by entering L1 controller commands. You can also enter L2 controller commands to monitor and manage your system if your system has L2 controller hardware and a system console or if you are using an SGIconsole as your system console. For information on connecting to the system console, see “Getting Console Access”. For detailed information on using the L2 controller software, see the SGI L1 and L2 Controller Software User's Guide
From the Tasks pulldown menu of SGIconsole Console Manager GUI, choose Connect to a System Controller.
To get to the L2> controller prompt, enter Ctrl -T.
To get back to the L1> controller prompt, enter Ctrl-D.
To access the L2 controller firmware, you must connect a system console such as SGIconsole or a dumb terminal, to the L2 controller. The L2 firmware is always running as long as power is supplied to the L2 controller. If you connect a system console to the L2 controller's console port, the L2 prompt appears. For instructions on connecting a console to the L2 controller, see your server or graphics system owner's guide or the SGIconsole Hardware Connectivity Guide.
The SGIconsole Console Manager graphical user interface (GUI) or text-based user interface (tscm(1)), can be used to securely access a system console and connect to an L2 controller. For information on using Console Manager to access an SGI Altix system or to access an SGI Altix system in secure mode using the ssh(1) command, see the Console Manager for SGIconsole Administrator's Guide.
Your SGI Altix system should have an L2 controller on your network that you can access. This section describes how you can connet to an L2 controller if you are not using SGIconsole.
To connect to a system L2 controller, perform the following steps:
From the Tasks pulldown menu of SGIconsole Console Manager GUI, choose Node Tasks -> Get/Steal/Spy. Follow the instructions in the Console Manager for SGIconsole Administrator's Guide to connect to the console. You can also use the tscm(1) command line interface to Console Manager. If you do not have SGIconsole installed, proceed to the next step.
Use the telnet(1) command to connect to the L2 controller as follows:
telnet L2-system-name.domain-name.company.com |
Once connected, press the Enter key and a prompt similar to the following appears:
system_name-001-L2> |
To connect to the system console, enter Ctrl-D.
If your system is partitioned, a message similar to the following appears:
INFO: ERROR: no system console defined |
For information on working with partitioned systems, see “System Partitioning ” in Chapter 2.
Note: For detailed information on using the L2 controller software, see the SGI L1 and L2 Controller Software User's Guide and the SGI Altix 3000 User's Guide. |
This section describes how to access a system console.
From the Tasks pulldown menu of SGIconsole Console Manager GUI, choose Node Tasks -> Get/Steal/Spy. Follow the instructions in the Console Manager for SGIconsole Administrator's Guide to connect to the console. You can also use the tscm(1) command line interface to Console Manager. For information on using Console Manger or tscm(1), see Console Manager for SGIconsole Administrator's Guide.
If you do not have Console Manager installed, proceed to the next step.
To connect to a system L2 controller, perform the steps in Procedure 3-4.
Once connected, press the Enter key and a prompt similar to the following appears:
system_name-001-L2> |
To connect to the system console, enter Ctrl-D.
If your system is partitioned, a message similar to the following appears:
INFO: ERROR: no system console defined |
For information on working with partitioned systems, see “System Partitioning ” in Chapter 2.
To return to the L2 controller, enter Ctrl-T.
To return to the telnet prompt, enter Ctrl-] (control -right bracket).
Note: For detailed information on using the L2 controller software, see the SGI L1 and L2 Controller Software User's Guide. |
This section describes diskless booting supported on SGI Altix systems and covers the following topics:
This section describes an SGI approach to diskless booting of SGI Altix systems. Other approaches to diskless booting have been covered thoroughly by Linux HOWTO documents. You can use your favorite web search engine and looking for "linux diskless boot howto". You should find ample information on the concepts and methods available through Linux.
Unlike other diskless client-server models, Altix diskless systems do not rely on a server to operate after being booted. This allows a client to boot over a satellite link and continue to operate after that link has been broken.
Altix diskless booting uses a ramdisk for the root filesystem instead of NFS mounting the root filesystem. Many diskless clients operate with a root filesystem provided over a network using a remote filesystem protocol such as NFS. This approach suffers from several disadvantages, including the following:
A strong dependence on the file server, that typically results in the client hanging if communication to the file server is lost
Poor performance as many normal system operations (for example, opening a temporary file in /tmp) need several round trips over the network
The file server needs to share writable root filesystem images for every client, which is complex and tedious to manage
A well-equipped Altix system running diskless, for example, would have 16GB of system memory; a 4GB ramdisk for the operating system, another 4GB ramdisk for applications, such as OpenGL Performer and 8GB or more of unused ramdisk space for system memory (16GB minus 8GB plus unused ramdisk memory). Applications requiring large amounts of data can run CXFS.
Although system swapping is not required if there is sufficient memory, swapping may be added and verified as shown in the last step of Procedure 3-6.
Because the root file system is in memory, system performance is increased above that of systems with local disk.
You can use the Altix Standalone Maintenance CD without installing on hard drive to perform the following:
Flash snprom images directly from CD.
Boot SGI ProPack 6 for Linux from CD in about three minutes time on 2.6.5-xxxx-rtgfx kernel and modules to perform system maintenance. Once booted, it has the same capabilities to telnet to the system as an L3 controller on CD. Note that the CD can be unmounted after booted so other CDs can be mounted.
Note: You can use any supported 2.6.x kernel. |
To use the CD in rescue mode to accomplish diskless booting, perform the following steps:
Boot the CD, as follows:
fs0:\>bootia64 |
Hit the Enter key, wait, then hit the Enter key again and you should see a message similar to the following:
Uncompressing Linux... done Loading initrd initrd-2.6.5-7.201-rtgfx.../ |
Mount the DVD/CD, as follows:
mount /dev/hda /media/dvd |
Identify disks, as follows:
dmesg | grep SCSI |
Print out partition table, as follows:
machine-A# parted /dev/sda print Disk geometry for /dev/sda: 0.000-78533.437 megabytes Disk label type: gpt Minor Start End Filesystem Name Flags 1 0.017 500.000 fat16 boot 2 500.000 20500.000 xfs 4 20500.000 21000.000 fat16 5 21000.000 42000.000 xfs 6 42000.000 68000.009 xfs 3 68000.010 78533.421 linux-swap Information: Don't forget to update /etc/fstab, if necessary. |
Mount the partition, as follows:
mount /dev/sda2 /mnt |
Determine which filesystem is the root filesystem and the mount /boot/efi on to it, as follows:
mount /dev/sda1 /mnt/boot/efi |
Typical output is similar to the following:
/dev/sda2 20469760 6461768 14007992 32% /mnt /dev/sda1 511712 62480 449232 13% /mnt/boot/efi |
Use the chroot(1) command to setup the root directory for installing RPMs and so on, as follows:
cd /mnt; chroot . |
Backup the root filesystem to a system on your network, as follows:
xfsdump -l0 - /mnt | gzip -c | ssh root@fax.americas.sgi.com dd of=/bigdisk/altixbackup.dgz |
Note: Answer yes and then enter the password. |
Warning: Note that the dd of= string in this command will overwrite files! |
Restore a backup from a remote system, as follows:
ssh root@backup.eng.sgi.com "dd if=/bigdisk/altixbackup.dgz" | gunzip -c | xfsrestore - /mnt |
Note: Use the quotes to keep the gunzip command from running on the remote system. |
If a DHCP server did not start the network, start it manually, as follows:
ifconfig eth0 149.xxx.xxx.xx netmask 255.255.xxx.xxx up |
From the system booted from CD, use the wget(1) command to download an RPM at the location below, as follows:
wget http://rpms.eng.sgi.com/package.rpm |
Login from another system using anonymous ftp, as follows:
ftp IP address of Altix system |
For example:
Name (ptc-tulip.americas:system-user): ftp |
Note: Name can be ftp or anonymous. |
Telnet from another system, as follows;
% telnet 169.239.221.86 Trying 169.239.221.86... Connected to 169.239.221.86. Escape character is '^]'. Linux 2.6.5-7.201-rtgfx (sgialtix) (0) bash-2.05b# |
Although system swapping is not required if there is sufficient memory, swapping may be added and verified, as follows:
bash-2.05b# swapon -a /dev/sda3 Adding 10786176k swap on /dev/sda3. Priority:-1 extents:1 |
bash-2.05b# swapon -s Filename Type Size Used Priority /dev/sda3 partition 10786176 0 -1 |
Note: Swap partitions can be identified using the parted command, see step 4. |
bash-2.05b# mkswap /dev/sda3 Setting up swapspace version 1, size = 11045060 kB |
Warning: All data in the target partition is destroyed. |
Go to Supportfolio, the SGI support web site, for additional information about Altix Standalone Maintenance CD and diskless booting at: https://support.sgi.com/login
This section describes diskless booting from the network.
Three network services required for diskless boot from network are, as follows:
DHCP server
TFTP server
Anonymous FTP or HTTP server
The Boot Option Maintenance menu may be used to set a network device as the default boot device.
An example EFI boot device is, as follows:
[Pci(4|0)/Mac(08006913F0B4)/NicName(tg0) ] |
When a network device is selected as the boot device, a DHCP client request is made to a DHCP server. Diskless booting requires two special parameters next-server and filename to identify the TFTP server and the boot loader file.
An example entry for these parameters in dhcp.conf file is, as follows:
next-server 169.238.221.85; filename "/tftpboot/merged/bootia64.efi"; |
The next-server parameter may be the same system as the system providing the DHCP service or it may be another system. Using the parameters provided by the DHCP server, the Altix diskless client initializes the network and makes a TFTP request to the next-server.
The diskless client contacts the TFTP server, downloads the boot loader, and executes it. The elilo.conf configuration file, in the same directory as the boot loader, provides the name of a kernel and an initial ramdisk.
Three special parameters in the elilo.conf file establish the identity of the diskless client. Examples of these parameters are, as follows:
config_server=ftp://169.238.221.102/hosts config_server=http://169.238.221.102/hosts config_file=ITSECDEMO hostname=tulip |
The config_server parameter may define an anonymous FTP server or an HTTP server. The config_file parameter is equivalent to the rc.sysinit script in a disk boot and is executed once by init. The hostname parameter is used to identify the system name.
This setup simplifies administration by allowing common configuration settings such as desired software to be specified in config_file and host-specific files such as licensing and network settings to be specified by hostname.
The diskless client loads the operating system into memory along with the host-specific configuration files. Before control of the system is returned to init, DHCP client services are terminated.
When control of the boot process is returned to init, the system will boot in the same manner as a system with local disk. The network is re-initialized as specified in the host-specific parameters and services such as CXFS are started.