Chapter 5. XVM Failover

Failover creates an infrastructure for the definition and management of multiple paths to a single disk device or LUN. XVM uses this infrastructure to select the path used for access to logical volumes created on the storage devices.

If your XVM configuration requires that you spread I/O across controllers, you must define a complete failover configuration file.This is necessary to ensure that I/O is restricted to the path that you select.For example, if you want a striped volume to span two host bus adapters, you must configure a failover configuration file to specify the preferred paths.

You should configure failover paths in order to get the maximum bandwidth and avoid LUN ownership movement (or changes) between RAID controllers; accessing the same LUN through different RAID controllers can degrade performance considerably. In general, you want to evenly distribute the I/O to LUNs across all available host bus adapters and RAID controllers and attempt to avoid blocking in the SAN fabric.

The ideal case, from performance standpoint, is to use as many paths as connection endpoints between two nodes in the fabric with as few blocking paths as possible in the intervening SAN fabric.

There are two failover mechanisms that XVM uses to select the preferred I/O path, each with its associated failover configuration file:

These failover mechanisms are described in the following sections.Information on failover V2 can also be found in the xvm man page.

For information on using XVM failover with CXFS, see CXFS Administration Guide for SGI InfiniteStorage.

Selecting a Failover Version

Whether you use failover V1 or failover V2 depends on a number of considerations, including the RAID mode you are running.

The TP9100 and RM610/660 RAID units do not have any host type failover configuration. Each LUN should be accessed via the same RAID controller for each node in the cluster because of performance reasons. These RAIDs behave and have the same characteristics as the SGIAVT mode discussed below.

The TP9300, TP9500, and TP9700 RAID units will behave differently depending on the host type that is configured:

  • SGIRDAC mode requires all I/O for a LUN to take place through the RAID controller that currently owns the LUN. Any I/O sent to a RAID controller that does not own the LUN will return an error to the host that sent the request. In order for the LUN to be accessed via the alternate controller in a RAID array, it requires the failover driver software on a host to send a command to the backup controller instructing it to take ownership of the specified LUN. At that point, the ownership of the LUN is transferred to the other LUN and I/O can take place via the new owner. Other hosts in the cluster will detect this change and update their I/O for the LUN to use a path to the RAID controller that now owns the LUN. Only XVM failover V1 can successfully control RAIDs in SGIRDAC mode.

  • SGIAVT mode also has the concept of LUN ownership by a single RAID controller. However, LUN ownership change will take place if any I/O for a given LUN is received by the RAID controller that is not the current owner. The change of ownership is automatic based on where I/O for a LUN is received and is not done by a specific request from a host failover driver. The concern with this mode of operation is that when a host in the cluster changes I/O to a different RAID controller than that used by the rest of the cluster, it can result in severe performance degradation for the LUN because of the overhead involved in constantly changing ownership of the LUN. Either XVM failover V1 or V2 can successfully control RAIDs in SGIAVT mode (TP9400 does not accept SGIAVT mode).

If you are using XVM failover version 2, note the following:

  • TP9100 1 GB and 2 GB:

    • SGIRDAC mode requires that the array is set to multiport

    • SGIAVT mode requires that the array is set to multitid

  • TP9300/9500/S330 Fiber or SATA use of SGIAVT requires 06.12.18.xx code or later be installed

  • TP9700 use of SGIAVT requires that 06.15.17xx. code or later be installed

SGIRDAC mode is supported under all revisions in RAID firmware section for these models.

Note that Failover V1 is not available on all operating systems:

  • IRIX operating systems support both Failover V1 and Failover V2

  • SGI ProPack 3 for Linux supports both Failover V1 and Failover V2

  • SGI Propack 4 for Linux and SGI ProPack 5 for Linux support Failover V2 only

  • CXFS clients support Failover V2 only.

For information on choosing an appropriate failover version for a CXFS cluster, see CXFS Administration Guide for SGI InfiniteStorage.

Failover V1

Failover V1 is the original IRIX failover mechanism. Failover V1 can be used with SGI TP9100, SGI TP9300, SGI TP9400, SGI TP9500, and SGI TP9700 RAID devices. It can also be used with third party storage by defining the path entries for the storage device in the failover.conf file.

When using failover V1, you can manually specify failover groups with the failover.conf file. See the failover(7M) man page for information on the failover.conf file, as well as additional information on failover V1.

XVM uses failover V1 when either of the following conditions is met:

  • The SGI RAID device that contains the XVM physvol is not set to Automatic Volume Transfer (AVT) mode

  • A failover.conf file has been defined (even if you have defined a failover2.conf file as well)

  • The operating system supports Failover V1, as indicated in “Selecting a Failover Version”. If the operating system does not support Failover V1, it will always use Failover V2 whether there is a failover.conf file or not.

It is not necessary to define a failover.conf file in order to use failover V1.

Failover V1 is the version of failover supported by the XLV logical volume manager. If you are upgrading from XLV to XVM, you must replace the failover.conf file with a failover2.conf file if you choose to use failover V2, as described in “The failover2.conf File”.


Note: If a failover.conf file is missing or is not correctly defined on a host system, you may see “ Illegal request” messages such as the following:

Mar 114:44:20 6A:houu19 unix: dksc 200200a0b80cd8da/lun1vol/c8p1: [Alert]
Illegal request: (asc=0x94, asq=0x1) CDB: 28 0 0 0 0 0 0 0 1 0

This message indicates that the host is trying to access a LUN by means of the alternate controller on a TP9500, TP9400, or TP9300 RAID. This message does not indicate a problem. You can eliminate this message by supplying a failover.conf file or by addressing existing errors in the failover.conf file.

This message can be generated by running the XVM probe command. The XVM probe command is run once automatically for every CXFS membership transition.


Failover V2

Failover V2 can be used with SGI TP9100, SGI TP9300, SGI TP9400,SGI TP9500,SGI TP9700 or 3rd-party RAID devices. When using failover V2, you can manually specify the attributes associated with a storage path by using the failover2.conf file, as described in “The failover2.conf File”.

XVM uses failover V2 when both of the following conditions are met:

  • The SGI RAID device that contains the XVM physvol is set to Automatic Volume Transfer (AVT) mode

  • There is no failover.conf file

It is not necessary to define a failover2.conf file in order to use failover V2.

The failover2.conf File

The configuration file for failover V2 is /etc/failover2.conf. The entries in this file define failover attributes associated with a path to the storage.Entries can be in any order.

In a failover2.conf file, you use the preferred keyword to specify the preferred path for accessing each XVM physvol; there is no default preferred path.The paths to a physvol are assigned an affinity value. This value is used to associate paths to a specific RAID controller, and to determine priority order in which groups of paths from a node to a LUN will be tried in the case of a failure: all affinity 0 paths are tried, then all affinity 1, then all affinity 2, etc.

Usually, all paths to the same controller are configured with the same affinity value and thus only two affinity values are used. You can, however, use more than two affinity values. What is important is that an affinity group for a LUN should not contain paths that go to different RAID groups.

The valid range of affinity values is 0 (lowest) through 15 (highest); the default is affinity 0. Paths with the same affinity number are all tried before failover V2 moves to the next highest affinity number; at 15, failover V2 wraps back to affinity 0 and starts over.

The paths to one controller of the RAID device should be affinity 0, which is the default affinity value. You should set the paths to the second controller to affinity 1.

Since the default affinity is zero, it would be sufficient to include entries only for those paths that are a non-zero affinity. It would also be sufficient to include an entry for the preferred path only. SGI recommends including definitions for all paths, however.

In a multi-host environment, it is recommended that the affinity values for a particular RAID controller be identical on every host in the CXFS cluster.

You can use the affinity value in association with the XVM foswitch command to switch an XVM physvol to a physical path of a defined affinity value, as described in “Switching physvol Path Interactively”.

If a failover.conf file is configured, XVM will employ failover V1 and use the configuration information in that file, even if you configure a failover2.conf file and the RAID devices are in AVT mode. You should remove or comment out an existing failover.conf file when you configure a failover2.conf file.

For instructions on generating a failover2.conf file, see “How to Create a failover2.conf File”.

Example failover2.conf Files

The following example for SGI ProPack groups the paths for lun3 and the paths for lun4. The order of paths in the file is not significant.

Paths to the same LUN are detected automatically if the LUN has been labeled by XVM.A label command inititates a reprobe to discover new alternate paths. If storage that has already been labeled is connected to a live system, you must run an XVM probe for XVM to recognize the disk as an XVM disk.

Without this file, all paths to each LUN would have affinity 0 and there would be no preferred path.

Setting a preferred path allows the administrator to guarantee that not all traffic to multiple LUNs goes through the same HBA, for example, by selecting preferred paths that spread the load out. Otherwise, the path used is the first one discovered and usually leads to almost all of the load going through the first HBA discovered that is attached to a specific RAID controller.

If no path is designated as preferred, the path used to the LUN is arbitrary based on the order of device discovery. There is no interaction between the preferred path and the affinity values.

This file uses affinity to group the RAID controllers for a particular path. Each controller has been assigned an affinity value. It shows the following:

  • There is one PCI card with two ports off of the HBA (pci04.01.1 and pci04.01.0)

  • There are two RAID controllers, node200800a0b813b982 and node200900a0b813b982

  • Each RAID controller has two ports that are identified by port1 or port2

  • Each LUN has eight paths (via two PCI cards, two RAID controllers, and two ports on the controllers)

  • There are two affinity groups for each LUN, affinity=0 and affinity=1

  • There is a preferred path for each LUN

      /dev/xscsi/pci04.01.1/node200900a0b813b982/port1/lun3/disc,  affinity=0
      /dev/xscsi/pci04.01.1/node200900a0b813b982/port2/lun3/disc,  affinity=0
      /dev/xscsi/pci04.01.0/node200900a0b813b982/port1/lun3/disc,  affinity=0
      /dev/xscsi/pci04.01.0/node200900a0b813b982/port2/lun3/disc,  affinity=0   preferred
      /dev/xscsi/pci04.01.1/node200800a0b813b982/port1/lun3/disc,  affinity=1
      /dev/xscsi/pci04.01.0/node200800a0b813b982/port1/lun3/disc,  affinity=1
      /dev/xscsi/pci04.01.1/node200800a0b813b982/port2/lun3/disc,  affinity=1
      /dev/xscsi/pci04.01.0/node200800a0b813b982/port2/lun3/disc,  affinity=1
    
      /dev/xscsi/pci04.01.1/node200900a0b813b982/port1/lun4/disc, affinity=0
      /dev/xscsi/pci04.01.1/node200900a0b813b982/port2/lun4/disc, affinity=0
      /dev/xscsi/pci04.01.0/node200900a0b813b982/port1/lun4/disc, affinity=0
      /dev/xscsi/pci04.01.0/node200900a0b813b982/port2/lun4/disc, affinity=0   
      /dev/xscsi/pci04.01.1/node200800a0b813b982/port1/lun4/disc, affinity=1 
      /dev/xscsi/pci04.01.1/node200800a0b813b982/port2/lun4/disc, affinity=1 preferred
      /dev/xscsi/pci04.01.0/node200800a0b813b982/port1/lun4/disc, affinity=1
      /dev/xscsi/pci04.01.0/node200800a0b813b982/port2/lun4/disc, affinity=1
    

Given the above, failover will exhaust all paths to lun3 from RAID controller node200900a0b813b982 (with affinity=0 and the preferred path) before moving to RAID controller node200800a0b813b982 paths (with affinity=1)

The following example for SGI ProPack shows an additional grouping of PCI cards. The preferred path has an affinity of 2. If that path is not available, the failover mechanism will try the next path on the same PCI card (with affinity=2). If that is not successful, it will move to affinity=3, which is the other PCI port on the same RAID controller (node200800a0b813b982).

  /dev/xscsi/pci04.01.1/node200900a0b813b982/port1/lun4/disc, affinity=0
  /dev/xscsi/pci04.01.1/node200900a0b813b982/port2/lun4/disc, affinity=1 
  /dev/xscsi/pci04.01.0/node200900a0b813b982/port1/lun4/disc, affinity=0
  /dev/xscsi/pci04.01.0/node200900a0b813b982/port2/lun4/disc, affinity=1
  /dev/xscsi/pci04.01.1/node200800a0b813b982/port1/lun4/disc, affinity=3
  /dev/xscsi/pci04.01.1/node200800a0b813b982/port2/lun4/disc, affinity=2 preferred
  /dev/xscsi/pci04.01.0/node200800a0b813b982/port1/lun4/disc, affinity=3
  /dev/xscsi/pci04.01.0/node200800a0b813b982/port2/lun4/disc, affinity=2

The following example for IRIX shows two RAID controllers, 200800a0b818b4de and 200900a0b818b4de for lun4vol:

 /dev/dsk/200800a0b818b4de/lun4vol/c2p2 affinity=0 preferred
 /dev/dsk/200800a0b818b4de/lun4vol/c2p1 affinity=0
 /dev/dsk/200900a0b818b4de/lun4vol/c2p2 affinity=1
 /dev/dsk/200900a0b818b4de/lun4vol/c2p1 affinity=1

Parsing the failover2.conf File

The configuration information in the failover2.conf file becomes available to the system and takes effect when the system is rebooted. You can also parse the failover2.conf file on a running system by means of the XVM foconfig command:

xvm:cluster> 
foconfig -init

You can also execute the foconfig command directly from the shell prompt:

% xvm foconfig -init

The XVM foconfig command allows you to override the default /etc/failover2.conf filename with the -f option. The following command parses the failover information in the file myfailover2.conf:

xvm:cluster> foconfig -f myfailover2.conf

Additionally, the XVM foconfig command provides a -verbose option.

Running the foconfig command does not change any paths, even if new preferred paths are specified in the new failover file. To change the curent path, use the foswitch command, as described in “Switching physvol Path Interactively”.

Switching physvol Path Interactively

When using failover V2, you can switch the path used to access an XVM physvol by using the XVM foswitch command. This enables you to set up a new current path on a running system, without rebooting.


Note: The XVM foswitch command does not switch paths for storage being managed by XVM V1. For XVM V1, use the (x)scsifo command.


Returning to a preferred path

The following command switches all XVM physvols back to their preferred path:

xvm:cluster> 
foswitch -preferred phys

You can also execute the foswitch command directly from the shell prompt:

% xvm foswitch -preferred phys

You may need to use the -preferred option of the foswitch command, for example, when a hardware problem may cause the system to switch the path to an XVM physvol. After addressing the problem, you can use this option to return to the preferred path.

Switching to a new device

The following command switches physvol phys/lun22 to use device 345 (as indicated in the output to a show -v command):

xvm:cluster> foswitch -dev 345 phys/lun22

Setting a new affinity

The following command switches physvol phys/lun33 to a path of affinity 2 if the current path does not already have that affinity. If the current path already has that affinity, no switch is made.

xvm:cluster> foswitch -setaffinity 2 phys/lun33
The following command switches physvol phys/lun33 to the next available path of affinity 2, if there is one available.
xvm:cluster> foswitch -setaffinity 2 -movepath phys/lun33

The -affinity option of the foswitch command is being deprecated. Its functionality is the same as using -setaffinity x -movepath, as in the above example.

Switching paths for all nodes in a cluster

You can use the -cluster option of the foswitch command to perform the indicated operation on all nodes in a cluster.

The following command switches physvol phys/lun33 to a path of affinity 2 for all nodes in the cluster if the current path does not already have that affinity. Where the current path already has that affinity, no switch is made.

xvm:cluster> foswitch -cluster -setaffinity 2 phys/lun33

The following command switches to the preferred path for phys/lun33 for all nodes in the cluster:

xvm:cluster> foswitch -cluster -preferred phys/lun33

Automatic Probe after Labeling a Device

Under failover V2, after you label a device XVM must probe the device to locate the alternate paths to the device. Disks are probed when the system is booted and when you execute an XVM probe command.

When using failover V2, unlabeled disks are probed automatically when the XVM command exits after you label a device.This allows XVM failover to discover alternate paths for newly-labeled devices.

A probe can be slow, and it is necessary to probe a newly-labeled device only once. XVM allows you to disable the automatic probe feature of failover V2.

You can disable automatic probe in the following ways:

  • Use the -noprobe option of the label command when you label the disk as a XVM physvol.

  • Use the set autoprobe command to set autoprobe to disabled (or 0), as in the following example:

    xvm:cluster> set autoprobe disabled

    You can re-enable the automatic probe feature with the XVM set autoprobe enabled (or set autoprobe 1) command.

How to Create a failover2.conf File

This section provides a procedure for creating a failover2.conf file.

Create an initial /etc/failover2.conf file

You can easily create a failover2.conf file which can be edited to change affinity and preferred path settings with the following command.

xvm show -v phys | grep affinity > /etc/failover2.conf

Values in `< >' within the file are considered comments and can be deleted or ignored.

The entries in the file only apply to already labeled devices. You might want to run the command in both xvm domains, local and cluster, in order to get all defined devices.

 Set affinity for each path in /etc/failover2.conf

To make it easier to understand and maintain the /etc/failover2.conf file, it is best to follow a consistent strategy for setting path affinity. Path failover will occur with preference toward another path of the same affinity as the current path regardless of the affinity value. Here is a simple strategy that works well for most sites:

  • Set to affinity=0 all paths to a physvol that go through controller A

  • Set to affinity=1 all paths to a physvol that go through controller B.

Note: For SGI Infinite Storage platforms, the WWN of a controller A path always starts with an even number in the first 4 digits (e.g. 2002, 2004, 2006) and the WWN of a controller B path always starts with an odd number in the first 4 digits (e.g. 2003, 2005, 2007).

Set the preferred path for each physvol

Make sure that for each LUN in a cluster the same controller is used. Otherwise, a LUN ownership change could result with each I/O process if the hosts are accessing the LUN at the same time.

When setting the preferred path you should have it match the preferred controller owner for the LUN. You can get this information from the TPSSM GUI or from the RAID Array profile.

Initialize the XVM configuration in the kernel

If you want to have XVM initialize its pathing configuration without going through a reboot, you can do that with the following command.

# xvm foconfig -init

This can be done on a live system without any ill effect because it will not initiate any path failovers. The current path will stay the current path even though the defined preferred path and path affinities may change.

Pay attention to any messages that are generated by this command as it will tell you if you goofed in defining your /etc/failover2.conf file.

Set all LUNs to their preferred path

You can set all phyvols to their preferred path using the following command. In a cluster configuration be sure all of your /etc/failover2.conf files are correct and consistent and do this from every system to avoid trespass “storms”.

# xvm foswitch -preferred phys

Sample /etc/failover2.conf file

# failover v2 configuration file
#
# Note: All controller A paths are affinity=0
#      All controller B paths are affinity=1
# Make sure preferred path matches the preferred owner of the LUN
#

# RAID Array A
/dev/xscsi/pci02.01.1/node200500a0b813c606/port2/lun0/disc affinity=1
/dev/xscsi/pci02.01.1/node200400a0b813c606/port2/lun0/disc affinity=0
/dev/xscsi/pci02.01.0/node200500a0b813c606/port1/lun0/disc affinity=1
/dev/xscsi/pci02.01.0/node200400a0b813c606/port1/lun0/disc affinity=0 preferred

/dev/xscsi/pci02.01.1/node200500a0b813c606/port2/lun1/disc affinity=1
/dev/xscsi/pci02.01.1/node200400a0b813c606/port2/lun1/disc affinity=0
/dev/xscsi/pci02.01.0/node200500a0b813c606/port1/lun1/disc affinity=1 preferred
/dev/xscsi/pci02.01.0/node200400a0b813c606/port1/lun1/disc affinity=0

/dev/xscsi/pci02.01.1/node200500a0b813c606/port2/lun2/disc affinity=1
/dev/xscsi/pci02.01.1/node200400a0b813c606/port2/lun2/disc affinity=0 preferred
/dev/xscsi/pci02.01.0/node200500a0b813c606/port1/lun2/disc affinity=1
/dev/xscsi/pci02.01.0/node200400a0b813c606/port1/lun2/disc affinity=0

/dev/xscsi/pci02.01.1/node200500a0b813c606/port2/lun3/disc affinity=1 preferred
/dev/xscsi/pci02.01.1/node200400a0b813c606/port2/lun3/disc affinity=0
/dev/xscsi/pci02.01.0/node200500a0b813c606/port1/lun3/disc affinity=1
/dev/xscsi/pci02.01.0/node200400a0b813c606/port1/lun3/disc affinity=0