Chapter 5. Kernel Tunable Parameters on SGI ProPack Servers

This section identifies and describes the settings for kernel tunable parameters appropriate for large SGI ProPack servers.


Note: This chapter does not apply to SGI Altix XE or SGI Altix ICE systems.


This information about Linux kernel tunable parameters is included in your Linux release and can be found in the following directory on an SGI ProPack 6 system:

/usr/src/linux/Documentation/sysctl

This section covers the following topics:

Please note that sysctl parameters are also described in the following files on your system:

/usr/src/linux/Documentation/filesystems/proc.txt
/usr/src/linux/Documentation/filesystems/xfs.txt
/usr/src/linux/Documentation/networking/ip-sysctl.txt 

CPU Scheduler /proc/sys/sched Directory

This section describes tunable parameters for CPU scheduling in the /proc/sys/sched file and only applies to SGI ProPack 3 for Linux systems. On SGI ProPack 4 for Linux systems, tunable parameters can be found in /proc/sys/kernel directory.

The contents of the /proc/sys/sched directory is similar to the following:

[root@profit sched]# ls
busy_node_rebalance_ratio  idle_node_rebalance_ratio_max  min_timeslice
child_penalty              max_loadbal_rejects            sched_exec_threshold
idle_node_rebalance_ratio  max_timeslice                  sched_node_threshold

Do not change min_timeslice value to be less than 10, which is the current value for cache_decay_ticks, or else the scheduler's load-balancing will be adversely affected on some workloads.


Note: Be very careful in changing any of the values in this directory. You risk adversely affecting CPU scheduling performance.


/etc/sysconfig/dump File

This file contains the configuration variables for the Linux Kernel Crash Dump (LKCD) facility that creates files in the /var/log/dump directory.

The following variables defined in this directory:

  • DUMP_ACTIVE

    The DUMP_ACTIVE variable indicates whether the dump process is active or not. If this variable is 0, the dump kernel process is not activated.

  • DUMPDEV

    The DUMPDEV variable represents the name of the dump device. It is typically the primary swap partition on the local system, although any disk device can be used.


    Caution: Be careful when defining this value to avoid unintended problems.


  • DUMPDIR

    The DUMPDIR variable defines the location where crash dumps are saved. In that directory, a file called bounds is created that is the current index of the last crash dump saved. The bounds file is updated with an incremented index once a new crash dump or crash report is saved.

    If there is an lkcd dump, LKCD could easily exceed multiple gigabytes in /var. This is why the default root filesystem is larger. For this reason, you may wish to make a separate /var/dump filesystem or change the configuration of lkcd. For more information on lkcd, see the lkcd_config(1) man page.

    To save crash dumps to a different location, change the DUMPDIR value in /etc/sysconfig/dump file.

  • DUMP_SAVE

    The DUMP_SAVE variable defines whether to save the memory image to disk or not. If the value is 1, the vmcore image is stored, and a crash report is created from the saved dump. If it is not set to 1, only a crash report is created and the dump is not saved. Use this option if you do not want your system's disk space consumed by large crash dump images.

  • DUMP_LEVEL

    The DUMP_LEVEL variable has a number of possible values, as follows:

    DUMP_NONE (0) 

    Do nothing, just return if called.

    DUMP_HEADER (1) 

    Dump the dump header and first 128K bytes.

    DUMP_KERN (2) 

    Everything in DUMP_HEADER and kernel pages only

    DUMP_USED (4) 

    Everything except the kernel free pages.

    DUMP_ALL (8) 

    All memory is dumped.


    Note: You must use the numeric value, not the name of the variable.


  • DUMP_COMPRESS

    The DUMP_COMPRESS variable indicates which compression mechanism the kernel should attempt to use for compression. The new method is not to use dump compression unless someone specifically asks for it. There are multiple types of compression available. For now, if you modprobe dump_rle , the dump_rle.o module is installed, that enables RLE compression of the dump pages. The RLE compression algorithm used in the kernel gives (on average) 40% compression of the memory image, which can vary depending on how much memory is used on the system. There are also other compression modules coming (such as gzip). The values for the DUMP_COMPRESS variable are currently, as follows:

    DUMP_COMPRESS_NONE(0) 

    Do not compress this dump.

    DUMP_COMPRESS_RLE(1) 

    Use RLE compression.

    DUMP_COMPRESS_GZIP(2) 

    Use GZIP compression.

  • PANIC_TIMEOUT

    The PANIC_TIMEOUT variable represents the timeout (in seconds) before reboot after a panic occurs. Typically, this is set to 0 on the system, which means the kernel sits and spins until someone resets the machine. This is not the preferred action if we want to recover the dump after the reboot.

The following is an example of a /etc/sysconfig/dump file on an SGI ProPack 3 system:

DUMP_ACTIVE=1
DUMPDEV=/dev/vmdump
DUMPDIR=/var/log/dump
DUMP_SAVE=1
DUMP_LEVEL=2
DUMP_FLAGS=0
DUMP_COMPRESS=0
PANIC_TIMEOUT=5

The following is an example of a /etc/sysconfig/dump file on an SGI ProPack 4 system:

DUMP_ACTIVE="1"
DUMPDEV="/dev/vmdump"
DUMPDIR="/var/log/dump"
DUMP_LEVEL="2"
DUMP_COMPRESS="2"
DUMP_FLAGS="0x80000004"
DUMP_SAVE="1"
PANIC_TIMEOUT="5"
BOUNDS_LIMIT=10
KEXEC_IMAGE=/boot/vmlinuz
KEXEC_CMDLINE="root console=tty0"
TARGET_HOST=""
TARGET_PORT=6688
SOURCE_PORT=6688
ETH_ADDRESS=ff:ff:ff:ff:ff:ff
DUMP_MAX_CONCURRENT=4
NETDUMP_VERBOSE=no

For descriptions of these SGI ProPack 4 dump parameters, see the /etc/sysconfig/dump file on your system.

Resetting System Limits

To regulate these limits on a per-user basis (for applications that do not rely on limit.h), the limits.conf file can be modified. System limits that can be modified include maximum file size, maximum number of open files, maximum stack size, and so on. You can view this file is, as follows:

[user@machine user]# cat /etc/security/limits.conf
# /etc/security/limits.conf
#
#Each line describes a limit for a user in the form:
#
#            #
#Where:
# can be:
#        - an user name
#        - a group name, with @group syntax
#        - the wildcard *, for default entry
#
# can have the two values:
#        - "soft" for enforcing the soft limits
#        - "hard" for enforcing hard limits
#
# can be one of the following:
#        - core - limits the core file size (KB)
#        - data - max data size (KB)
#        - fsize - maximum filesize (KB)
#        - memlock - max locked-in-memory address space (KB)
#        - nofile - max number of open files
#        - rss - max resident set size (KB)
#        - stack - max stack size (KB)
#        - cpu - max CPU time (MIN)
#        - nproc - max number of processes
#        - as - address space limit
#        - maxlogins - max number of logins for this user
#        - priority - the priority to run user process with
#        - locks - max number of file locks the user can hold
#
#                 #

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4

# End of file

For instructions on how to change these limits, follow the procedure in “File Descriptor Limits for MPI Jobs”.

File Descriptor Limits for MPI Jobs

Because of the large number of file descriptors that MPI jobs require, you might need to increase the system-wide limit on the number of open files on your Altix system. The default value for the file limit resource is 1024. You can change the file descriptor limit by editing the /etc/security/limits.conf file.

Procedure 5-1. Increasing File Descriptor Limits for MPI Jobs

    To change the default value for all users to 8196 file descriptors, perform the following:

    1. Add the following line to /etc/pam.d/login file:

      session    required     /lib/security/pam_limits.so

    2. Add the following lines to /etc/security/limits.conf file:

      *     soft    nofile      8196
      *     hard    nofile      8196
      

    The default 1024 file descriptors allows for approximately 199 MPI processes per host. Increasing the file descriptor value to 8196, allows for more than 512 MPI processes per host.

    Understanding RSS, SIZE, and SHARE Values for MPI Jobs

    You can use the top(1) and ps(1) command to view the RSS, SIZE, and SHARE values when running MPI jobs on an Altix system. In particular the RSS number for an MPI job can seem reasonable, while the SIZE and SHARE values are much higher.

    The RSS value reflects the amount of memory that has been accessed and is currently resident in memory (faulted in). If swapping occurs, this value should go down. This tracks real memory usage of an application.

    The SIZE value includes the faulted in memory plus adds up all the possible pages the application _could_ fault in (pages that have been previously allocated by the mmap function). This value includes the length of memory-mapped (mmap) regions, even if they are never touched.

    The SHARE value includes the faulted in memory plus adds up all the possible pages the application _could_ fault in that are marked with the MAP_SHARED attribute. This value includes the length of MAP_SHARED memory-mapped (mmap) regions, even if they are never touched.

    The reason that the SIZE and SHARE values are so high for MPI jobs, is that MPI cross-maps a significant amount of memory from each MPI process onto every other MPI process via XPMEM. This is done to allow single-copy transfers, fast MPI-2 one-sided transfers, and SHMEM capability. MPI programs currently cross-map the static region, stack, and a good portion of the private heap. All of these mappings use the MAP_SHARED attribute.

    MPI programs cross-map these regions at init time, but none of the memory is actually touched or faulted in until the particular application accesses these mapped pages. The only "resource" the MPI programs are consuming is using up some virtual address space.

    A high SIZE and/or SHARE value should not indicate any additional need for swapping, since these pages are not faulted in.

    The RSS value should reasonably reflect the application memory usage, except it does not have any way to indicate shared resources.

    Memory (Swap) sysctl Parameters


    Note: This section applies to SGI ProPack 3 systems only.


    The following kernel parameters can be modified at runtime using the sysctl(8) command to affect kernel swap behavior, as follows:

    vm.min_swap_page_calls

    Minimum number of swap calls in a swap watch interval before a decision is made to determine if the system is out of swap space or not.

    Defaults to 1000

    vm.oom_killer_nap_jiffies

    How long the oom_killer_thread naps after it is first woken up and before it kills some process; also the time it naps after killing a process and before killing the next process. This is done to make sure that the system does not kill a process unless it has been out of swap space for a while. It also gives the system time to react (and perhaps stop swapping so much) after a processes has been killed and before it decides whether or not to kill another process. (Note that oom_killer() function checks to make sure it is still out of swap space every it time it wakes up from a "nap". If we are not out of swap space, it goes back to long term sleep waiting until start_oom_killer() is called again.

    Defaults to 10*HZ

    vm.swap_watch_interval

    How long between resets of the swap out statistics collected by the get_swap_page() function.

    Defaults to 10*HZ

    vm.min_jiffies_out

    Minimum time that the try_to_free_pages_zone() function has to have consistently failed before the out_of_memory() function will start up the oom_killer() function.

    Defaults to 5*HZ

    vm.print_get_swap_page

    If set to 1, vm.print_get_swap_page parameter will cause the system to log swap out statistics every swap_watch_interval, provided that swapping is active. This may be removed in a future release.

    Defaults to 0

    vm.min_free_swap_pages

    If this much swap space is free, the system will decide it is no longer out of swap (out of memory) on the next pass through the oom_killer() function. While settable using the sysctl(8) command, this is reset after each swapon() call to 1% of available swap pages. If some other default is desired, there has to be another sysctl call.

    Defaults to 1% of total swap pages

    sched.child_penalty

    The sched. child_penalty parameter controls how much or how little a forking child process inherits of one of the scheduling characteristics of the parent process, that is; its "interactivity" assessment. A forking child typically inherits only a fraction of the parent's "interactivity" assessment in order to avoid a potential denial-of-service attack on the system's CPU resource.

    Each process is regularly assessed with a quantitative "interactivity" level and is assigned a value in a numerical continuum that ranges between the extremes of "totally computebound" and "executes for brief periods of time on rare occasions." If a process is deemed to be more and more "interactive," the scheduler gives it more and more of a transitory boost in priority when the process wakes up and wants the CPU resource. That is, a process that appears to be "interactive," such as a shell that responds to user keyboard inputs, is given more timely access to the CPU than a process which appears to be computebound.

    This is a very heuristic assessment, and as such it is prone to approximation, confusion, and errors. One of the potential problems is the denial-of-service effect of having an interactive parent (which executes with that priority boost) being able to fork numerous children that would inherit the same high-priority "interactive" label as the parent and would themselves also preempt other lower-priority less-interactive processes.

    The remedy for this potential problem is to not allow a forked child to inherit the exact same "interactive" quantitative value as the parent. Instead, a forked child is assessed a child_penalty, which is a percentage of the parent's "interactive" assessment. The default child_penalty is 50, or 50% of the parent's value.

    Load-balancing Algorithms sysctl Parameters


    Note: This section applies to SGI ProPack 3 systems only.


    The kernel parameters that can be modified at runtime using the sysctl(8) command to affect kernel load-balancing algorithms, are as follows:
    sched.sched_loadbal_max = 1
    sched.sched_exec_threshold = 1
    sched.max_loadbal_rejects = 100
    sched.sched_node_threshold = 125
    sched.busy_node_rebalance_ratio = 10
    sched.idle_node_rebalance_ratio_max = 50
    sched.idle_node_rebalance_ratio = 10

    These CPU Scheduler parameters affect the behavior of the load-balancing algorithms that attempt to equalize the runqueue lengths. This section describes these parameters.

    The sched_exec_threshold parameter affects the aggressiveness of load-balancing at process exec time. When a parent process forks, the resulting child process is by default assigned to the same CPU as the parent. This is often the optimal behavior, for example, due to shared virtual address spaces. However, if the child process issues an exec call itself, it is reasonable to assume that the child process typically gains little advantage in executing on the CPU of the parent process. Instead, the child process should be migrated to a lesser loaded CPU. The sched_exec_threshold parameter is the runqueue length above which the child issuing an exec call will search for a lesser-loaded CPU. The default value of one (1) means that the child process will always search. This is the most aggressive behavior. Raising the value makes the search less aggressive, which trades off a slight decrease in exec overhead for a less load-balanced set of CPUs.

    The remaining parameters described in this section control behavior of the load-balancing algorithms that periodically execute on each CPU. During each 1024Hz timer tick, the CPU Scheduler decides whether or not to spend effort to compare its runqueue length against the runqueue lengths of the other CPUs. This is to determine if one or more processes should be pull-migrated from the runqueue of another CPU onto the runqueue of this CPU. This runqueue examination can be an expensive operation in terms of system resources, especially with large numbers of CPUs. Therefore, a CPU must trade off the cost of executing it too frequently versus the inefficiency of executing it too infrequently and having the CPU remain under utilized.

    At each timer tick, the first decision made is whether or not to execute the basic load-balancing algorithm at all. The more frequently a CPU performs this load-balancing scan, the more evenly balanced are the runqueues of each and every CPU relative to the other runqueues. However, there are two significant downsides to performing overly frequent load-balancing. The first is that frequent load-balancing is invasive and causes contention on the busiest CPUs' runqueues's spinlocks. High contention levels will affect context-switching performance and may in fact produce so much contention (especially at high CPU counts) that the system "livelocks" on the busiest CPU's runqueue lock. The second downside to frequent load-balancing is that processes may be migrated away from local physical memory and thus may suffer substantially longer memory access latencies. The trade-off is giving a process access to more CPU cycles at the cost of having those CPU cycles be less efficient because of longer latencies. In some cases, a process is much better off remaining on a more busy CPU because the process remains close to the physical memory it can most efficiently access.

    An idle CPU is more tempted to perform this relatively costly load-balancing scan than a non-idle (“busy") CPU, since the system would generally benefit (ignoring issues of NUMA memory locality) from migrating a not-currently-executing process from another CPU into this idle CPU. Every 1024Hz tick (roughly every millisecond) an idle CPU performs the load-balance scan within the node, that is , examining only the other CPU in the two-CPU Altix node. Every idle_node_rebalance_ratio ticks (current a default value of 10, or roughly every ten milliseconds) an idle CPU performs a load-balance scan of all nodes in the system. Therefore, increasing the idle_node_rebalance_ratio value makes the idle CPU full system rebalancing less frequent. Decreasing the value makes it more frequent.

    If an idle CPU finds no process to pull-migrate from a busier CPU, then the delay (the "infrequency") of these idle scans is dynamically increased by one, up to a maximum value of idle_node_rebalance_ratio_max. Therefore, with a default maximum of 50, an idle CPU does an all-CPU load-balance scan after 10 milliseconds. If no pull-migrate occurs, the next scan occurs 11 milliseconds later, then 12 milliseconds later, and so on, up to a maximum of 50 milliseconds. When one of the scans finds a process to pull-migrate, the delay is reset to the basic idle_node_rebalance_ratio value, which defaults to 10. The higher the value of the idle_node_rebalance_ratio_max parameter, the longer it will likely be between all-CPU load-balancing scan. This means that when a "busier" CPU does emerge, the slower the other idle CPUs will be to recognize it and to off-load that "busier" CPU. Systems with larger CPU counts may well benefit from higher idle_node_rebalance_ratio and idle_node_rebalance_ratio_max values. Each individual idle CPU may be slower to see a suddenly overloaded CPU, but because there are likely to be many idle CPUs, then some idle CPU will likely recognize the overloaded CPU and perform load-balancing. Never specify an idle_node_rebalance_ratio_max value less than idle_node_rebalance_ratio.

    A non-idle "busy" CPU performs the same within-the-local-node load-balancing scan at 1/100th the frequency of an idle CPU, or about every 100 milliseconds, and it performs the all-CPUs scan at a multiplier of busy_node_rebalance_ratio of that. With a busy_node_rebalance_ratio value of 10, this means an all-CPUs scan about once per second. These scan rates are definitely less frequent for a non-idle CPU than for an idle CPU. Once again, the CPU Scheduler is reluctant to migrate processes between nodes and potentially away from low-latency, local physical memory.

    Once a CPU decides to perform the load-balancing scan, there are more tuning parameters that can affect behavior. The first is the sched_node_threshold parameter, which is the threshold ratio of "imbalance" (relative to 100) that determines whether to begin pulling processes. The default value of 125 defines an imbalance of 25%. An alternative value of 150 would define an imbalance of 50%. You should never use a value less than 100 and should avoid using values less than the default 125.

    Once "busier" CPUs are identified that have processes that can be pull-migrated to this less-loaded CPU, the sched_loadbal_max is the number of processes that may be pull-migrated. The default value of one means that each load-balance scan will pull-migrate at most one process. Raising this value, increases the number of processes that may be migrated. The higher the value, the greater the likelihood that the load-balancing migrations may become overly aggressive. However, this is a heuristic algorithm, and some workloads might benefit from a value greater than the default of one.

    Consider a system with 128 CPUs, when the runqueue workload of one of the CPUs suddenly skyrockets to 128 processes, while the other 127 CPUs are idle. With a sched_loadbal_max value of one, as each idle CPU executes the load-balancing scan at whatever frequency that scan occurs (as noted earlier, depending upon both fixed constants and by tuning parameters), each will pick off one process at a time from this overloaded runqueue until each CPU has a runqueue of one.

    If, however, the sched_loadbal_max parameter is a high value, the first idle CPU to execute the load-balance algorithm would pull half of the processes of the busy CPUs; the 128-CPU system will have two CPUs with equal runqueue lengths of 64-64. The next CPU to execute the load-balance algorithm would pull 32 processes from one of these busy CPUs to equalize that one bilateral imbalance, producing runqueue lengths of 32-64-32, and then would pull 16 processes from the second CPU to equalize that bilateral imbalance, thus producing final runqueue lengths of 32-48-48. Note that most of the migrated processes will not have actually executed, but will merely have moved from waiting on one runqueue to waiting in a second runqueue. A third idle CPU does its load-balancing and produces another readjustment of 32-24-36-36. Once again, processes are migrated around in groups, from CPU to CPU, likely before actually executing. Thus, a higher sched_loadbal_max value may result in more active migrations and certainly in a different pattern of change of the runqueue lengths, but it is unclear whether higher values are more, or less, effective than lower value.

    Finally, the max_loadbal_rejects parameter puts a limit the number of pull-migration candidate processes a load-balancing CPU will examine; and reject as being unsuitable, before releasing runqueue spinlocks and giving up the search. A candidate is rejected if it has not been in the runqueue of the busy CPU for a long enough time and thus is deemed as being "cache hot"; or is rejected because the candidate process's cpus_allowed mask (as set by CpuMemSets or by the sys_sched_setaffinity() system call) does not allow that process to execute on the load-balancing CPU. The higher the max_loadbal_rejects value, the more effort the searching CPU makes to find a process to pull-migrate. Once again, this is another trade-off; the cost of acquiring and holding valuable runqueue spinlocks for potentially longer and longer periods of time, versus the benefit of successfully load-balancing the CPU runqueues.

    Virtual Memory hugetlb Parameter

    The /usr/src/linux/Documentation/vm/hugetlbpage.txt file on your system provides a brief summary of hugetlb page support in the Linux kernel. The Intel Itanium architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, 256M and so on. A translation lookaside buffer (TLB) is a cache of virtual-to-physical translations. Typically, this is a very scarce resource on a processor. Operating systems try to make best use of limited number of TLB resources. This optimization is more critical now as larger physical memories (several GBs) are more available.

    You can use the huge page support in Linux kernel by either using the mmap(2) system call or standard shared memory system calls shmget(2) and shmat(2) (see the shmop (2) man page).

    For information on using hugetlb, see /usr/src/linux/Documentation/vm/hugetlbpage.txt.

    You can also boot your system with the kernel hugepages= X parameter set, where X is a number of pages. Setting the hugepages parameter allows the kernel to allocate as many of the huge pages as possible early on. After the system is booted, it gets harder to find contiguous memory. You can also set huge pages early in the init scripts in /proc/sys/vm/nr_hugepages as described in the /usr/src/linux/Documentation/vm/hugetlbpage.txt file.

    Some considerations on using the hugetlb parameter on your Altix system are, as follows:

    • Starting with the SGI ProPack 3 for Linux Service Pack 1 release, allocation of hugetlb pages is NUMA aware. That is, pages are allocated either on the node, or as close as possible to the node where the mmap or shmget(2) system call is executed. The hugetlb pages are allocated and mapped into the requesting address space at mmap() or shmat() time; the hugetlb pages are not demand faulted into the address space as are regular pages. The hugetlb pages are allocated and zeroed by the thread that calls mmap() or shmget().

    • Starting with the SGI Propack 3, Service Pack 1 release, hugetlb pages are allocated on first touch, much like how regular pages are allocated into an address space by a call to the mmap routine. The hugetlb page allocation is NUMA aware; this means that the huge page is allocated on (or as close as possible to) the node where the thread that first touched the page is executing. In previous SGI ProPack releases, tasks that allocated more hugetlb pages than were available at the time of the mmap() or shmget() call, caused the mmap() or shmget() call to fail. To keep this behavior the same in the ProPack 3, Service Pack 1 release, the system implements a "reservation" algorithm to ensure that once the mmap() or shmget() call has completed successfully, the system guarantees that there are sufficient pages at first-touch time to satisfy the (delayge) storage allocation request that occurs at first-touch time. Page reservation is not subject to NUMA allocation; that is, you cannot reserve pages on a particular node, nor does the node on which the mmap() or shmget() request executes have any influence on where the hugetlb pages are finally placed. Placement of hugetlb pages is determined by the node where the first touch occurs and the locations of the available hugetlb pages available at that time. The number of hugetlb pages currently reserved can be found in the /proc/meminfo file, as shown in the following example:

      [root@revenue3 proc]# cat meminfo
              total:    used:    free:  shared: buffers:  cached:
      Mem:  32301940736 4025974784 28275965952        0    98304 2333163520
      Swap: 10737319936  9928704 10727391232
      MemTotal:     31544864 kB
      MemFree:      27613248 kB
      MemShared:           0 kB
      Buffers:            96 kB
      Cached:        2271280 kB
      SwapCached:       7200 kB
      Active:         459360 kB
      Inactive:      1845840 kB
      HighTotal:           0 kB
      HighFree:            0 kB
      LowTotal:     31544864 kB
      LowFree:      27613248 kB
      SwapTotal:    10485664 kB
      SwapFree:     10475968 kB
      HugePages_Total:     0
      HugePages_Free:      0
      Hugepagesize:    262144 kB

    • You can change the hugetlb page size can at system boot time by specifying the hugepagesz=NNNN parameter to the ELILO boot command prompt (this parameter can also be specified using the append command in the/boot/efi/efi/sgi/elilo.conf file). The NNNN parameter is the size of the hugetlb pages in bytes and this parameter must have a value that is a supported hugetlb page size for the hardware platform where the kernel is booted.