Chapter 2. GRU Driver and GRU Libraries Environment Variables

This chapter describes environment variables that can be used to specify options to the global reference unit (GRU) driver and GRU libraries. For a description of the GRU, see Chapter 1, “Altix UV GRU Direct Access API”.

GRU_TLBMISS_MODE

If an instruction references a virtual address that is not in the GRU translation lookaside buffer (TLB), a TLB miss occurs. TLB misses can be handled in several ways:

  • user_polling

    TLB dropins are done as a side effect of users calling gru_wait or gru_check_status on the coherence buffer request (CBR).

  • interrupt

    The GRU sends an interrupt to the CPU. The TLB dropin is done in the GRU interrupt handler.

  • The default mode is "interrupt" although you can override this default using an option on the gru_create_context() request. The environment variable can be used to override both, as follows:

    setenv GRU_TLBMISS_MODE [interrupt|user_polling]

GRU_CCH_REQUEST_SLICE

The GRU execution unit timeslices across all active instructions. By default, the GRU issues four NUMAlink get/put messages for an active instruction, then switches the next active instruction. You can override the default, as follows:

setenv GRU_CCH_REQUEST_SLICE [0|1|2|3]
 
 0 - issue 4 requests
 1 - issue 8 requests
 2 - issue 16 requests
 3 - not sliced. All requests are issued

GRU_TLB_PRELOAD

The GRU driver can be configured to do anticipatory TLB dropins for GRU BCOPY instructions that take a TLB miss. When a TLB miss occurs, and the instruction is a BCOPY, the GRU driver will dropin multiple TLB entries. To configure the GRU driver to do anticipatory TLB dropins for GRU, perform the following:

setenv GRU_EXCEPTION_RETRY <num>
<num> number of consecutive retries before returning an error

GRU_STATISTICS_FILE

You can collect statistics of a task's usage of GRU contexts by using this option to specify a statistics file, as follows:

setenv GRU_STATISTICS_FILE <filename> 

Whenever a task exits or a GRU context is destroyed, statistics are written to this file. A sample file is, as follows:

 Pid: 23020                          Mon Oct 19 20:46:56 2009
 Command: ./sgup2
 CBRs: 4
 DSRs: 24576 bytes
 Gseg vaddr: 0x7fe3a1e80000
    46740 instructions
       23 instruction_wait
        0 exceptions
     9903 FMM tlb dropin
        1 UPM tlb dropin
     1040 context stolen

GRU_TRACE_FILE

You can collect detailed trace of GRU instructions. Use this option to specify the name of the file for the trace information. There are levels of tracing, as follows:

  • All GRU instructions

  • GRU instructions that return error EXCEPTIONS to users

  • GRU instructions that fail and are automatically retried

To collect detailed trace of GRU instructions, perform the following:

setenv GRU_TRACE_FILE <filename> 

GRU_TRACE_INSTRUCTIONS

Setting this option enables tracing of every GRU instruction, as follows:

setenv GRU_TRACE_INSTRUCTIONS

GRU_TRACE_EXCEPTIONS

This option enables tracing of GRU instruction that cause exceptions. Note that some exceptions for GRU MESQ instructions are automatically handled by the GRU mesq library routines. These exceptions are not traced if <val> is equal to 1 (or not specified). If you want to see these exceptions (mesq_full, amo_nacked, and so on), set <val> to 2.

setenv GRU_EXCEPTION_RETRY <num>
<num> number of consecutive retries before returning an error
  

GRU_STATISTICS_FILE

You can collect statistics of a task's usage of GRU contexts by using this option to specify a statistics file. Whenever a task exits or a GRU context is destroyed, statistics are written to this file. To specify a statistics file, perform the following:

setenv GRU_STATISTICS_FILE <filename>

A sample file is, as follows:
Pid: 23020                          Mon Oct 19 20:46:56 2009
 Command: ./sgup2
 CBRs: 4
 DSRs: 24576 bytes
 Gseg vaddr: 0x7fe3a1e80000
    46740 instructions
       23 instruction_wait
        0 exceptions
     9903 FMM tlb dropin
        1 UPM tlb dropin
     1040 context stolen

GRU_TRACE_INSTRUCTION_RETRY

This option enables tracing of GRU instructions that fail due to transient errors. The GRU library routine normally retry the instruction and the failure is hidden from the user. If you want to see these failure that are retried successfully, enable this option, as follows:

setenv GRU_TRACE_INSTRUCTION_RETRY

An example output file is, as follows:
Pid: 25276 - gru_wait
         opc: NOP, xtype: BYTE, ima: ImmResp
         istatus: IDLE
 Pid: 25276 - gru_wait
         opc: VLOAD, xtype: DWORD, ima: DelResp, baddr0: 0x604450, tri0: 0x0, nelem: 0x1, stride: 0x1
         istatus: IDLE
 Pid: 25276 - gru_wait
         opc: VSTORE, xtype: DWORD, ima: DelResp, baddr0: 0x604450, tri0: 0x0, nelem: 0x1, stride: 0x1
         istatus: IDLE
 Pid: 25276 - gru_wait
         opc: IVLOAD, xtype: DWORD, ima: DelResp, baddr0: 0x0, tri0: 0x0, tri1: 0x40, nelem: 0x1
         istatus: IDLE
 Pid: 25276 - gru_wait
         opc: IVSTORE, xtype: DWORD, ima: DelResp, baddr0: 0x0, tri0: 0x0, tri1: 0x40, nelem: 0x1
         istatus: IDLE
 Pid: 25276 - gru_wait
         opc: VSET, xtype: DWORD, ima: DelResp, baddr0: 0x604450, value: 0x483966aa127ded1d, nelem: 0x1, stride: 0x1
         istatus: IDLE
 Pid: 25284, Tid: 25289 - gru_wait
         opc: MESQ, xtype: CACHELINE, ima: DelResp, baddr0: 0x606000, tri0: 0x0, nelem: 0x1
         istatus: EXCEPTION, isubstatus: QLIMIT, avalue: 0f0000000f
             execstatus: EXCEPTION
             state: 0x1, exceptdet0: 0x606000, exceptdet1: 0x8
 Pid: 25284, Tid: 25288 - gru_wait
         opc: MESQ, xtype: CACHELINE, ima: DelResp, baddr0: 0x606000, tri0: 0x0, nelem: 0x1
         istatus: EXCEPTION, isubstatus: AMO_NACKED, avalue: 00
             execstatus: EXCEPTION
             state: 0x1, exceptdet0: 0x606000, exceptdet1: 0x8

GRU Files in /proc

The /proc/sgi_uv/gru directory contains several files that have information about GRU state, as follows:

  • gru_options

    Bit-field that can be used to enable or disable options

  • cch_status

    List of tasks using GRU contexts

  • gru_status

    List of available GRU resources

  • statistics

    Detailed GRU driver statistics (if enabled)

  • mcs_status

    Timing information for kernel GRU commands

Some examples of the files in /proc/sgi_uv/gru are, as follows:

Example 2-1. gru_status - Available Resources

The file shows the free resources available in each GRU chiplet, as follows:

 % cat gru_status
 #  gid  nid    ctx   cbr   dsr     ctx   cbr   dsr
 #             busy  busy  busy    free  free  free
      0    0      8    36 32768       8    92     0
      1    0      1     4  4096      15   124 28672
      2    1      7    56 28672       9    72  4096
      3    1      7    28 28672       9   100  4096


Example 2-2. gru_options - Enable or Disable Driver Features

Various GRU options (mostly debugging) can be enabled or disabled by writing values to /proc/sgi_uv/gru/gru_options file. Use cat command, to view the file to see the current settings or to see a description of the various options.

 % cat debug_options
  # bitmask: 1=trace, 2=statistics, 0x10=No_4k_dsr_AU_war
  # bitmask: 0x20=no_iabort_war, 0x40=no_chiplet_affinity
  # bitmask: 0x80=no_tlb_war, 0x100=no_mesq_war

  0x0001 - enable statistics (they are not free)
  0x0002 - enable VERY verbose driver trace information to /var/log/messages


Example 2-3. statistics - Very Detailed Driver Statistics

You can collect detailed driver statistics, as follows:

% echo 2 > /proc/sgi_uv/gru/gru_options

This enabled, detailed statistic collection occurs in numerous places in the driver. There is system usage overhead associated with this collection, especially on large systems.

% cat /proc/sgi_uv/gru/statistics
          45806 vdata_alloc
          45771 vdata_free
         195712 gts_alloc
         195668 gts_free
          34351 gms_alloc
          34333 gms_free
         149398 gts_double_allocate
         ... (lots more)


grustats Command

You can use the grustats command, to view GRU statistics. You will see output similar to the following:

uv15-sys    TOTAL GRU STATISTICS SINCE COMMAND START
         0  vdata_alloc                             0  copy_gpa
         0  vdata_open                              0  read_gpa
         0  vdata_free                              0  mesq_receive
         0  gts_alloc                               0  mesq_receive_none
         0  gts_free                                0  mesq_send
         0  gms_alloc                               0  mesq_send_failed
         0  gms_free                                0  mesq_noop
         0  gts_double_allocate                     0  mesq_send_unexpected_error
         0  assign_context                          0  mesq_send_lb_overflow
         0  assign_context_failed                   0  mesq_send_qlimit_reached
         0  free_context                            0  mesq_send_amo_nacked
         0  load_user_context                       0  mesq_send_put_nacked
         0  load_kcontext                           0  mesq_qf_locked
         0  load_kcontext_assign                    0  mesq_qf_noop_not_full
         0  load_kcontext_steal                     0  mesq_qf_switch_head_failed
         0  lock_kcontext                           0  mesq_qf_unexpected_error
         0  unlock_kcontext                         0  mesq_noop_unexpected_error
         0  get_kcontext_cbr                        0  mesq_noop_lb_overflow
         0  get_kcontext_cbr_busy                   0  mesq_noop_qlimit_reached
         0  lock_async_resource                     0  mesq_noop_amo_nacked
         0  unlock_async_resource                   0  mesq_noop_put_nacked
         0  steal_user_context                      0  mesq_noop_page_overflow
         0  steal_kernel_context                    0  implicit_abort
         0  steal_context_failed                    0  implicit_abort_retried
... and much more

For a usage statement, once the grustats command is executing, enter the letter h for help. A usage statement appears, as follows:

Intstats help:
    h            - help (this screen)
    q            - quit
    r            - reset command-start statistics
    t or <TAB>   - toggle between total and incremental mode
    CTL-L        - redraw screen

        CR - to return to display