This chapter describes environment variables that can be used to specify options to the global reference unit (GRU) driver and GRU libraries. For a description of the GRU, see Chapter 1, “Altix UV GRU Direct Access API”.
If an instruction references a virtual address that is not in the GRU translation lookaside buffer (TLB), a TLB miss occurs. TLB misses can be handled in several ways:
user_polling
TLB dropins are done as a side effect of users calling gru_wait or gru_check_status on the coherence buffer request (CBR).
interrupt
The GRU sends an interrupt to the CPU. The TLB dropin is done in the GRU interrupt handler.
The default mode is "interrupt" although you can override this default using an option on the gru_create_context() request. The environment variable can be used to override both, as follows:
setenv GRU_TLBMISS_MODE [interrupt|user_polling] |
The GRU execution unit timeslices across all active instructions. By default, the GRU issues four NUMAlink get/put messages for an active instruction, then switches the next active instruction. You can override the default, as follows:
setenv GRU_CCH_REQUEST_SLICE [0|1|2|3] 0 - issue 4 requests 1 - issue 8 requests 2 - issue 16 requests 3 - not sliced. All requests are issued |
The GRU driver can be configured to do anticipatory TLB dropins for GRU BCOPY instructions that take a TLB miss. When a TLB miss occurs, and the instruction is a BCOPY, the GRU driver will dropin multiple TLB entries. To configure the GRU driver to do anticipatory TLB dropins for GRU, perform the following:
setenv GRU_EXCEPTION_RETRY <num> <num> number of consecutive retries before returning an error |
You can collect statistics of a task's usage of GRU contexts by using this option to specify a statistics file, as follows:
setenv GRU_STATISTICS_FILE <filename> |
Whenever a task exits or a GRU context is destroyed, statistics are written to this file. A sample file is, as follows:
Pid: 23020 Mon Oct 19 20:46:56 2009 Command: ./sgup2 CBRs: 4 DSRs: 24576 bytes Gseg vaddr: 0x7fe3a1e80000 46740 instructions 23 instruction_wait 0 exceptions 9903 FMM tlb dropin 1 UPM tlb dropin 1040 context stolen |
You can collect detailed trace of GRU instructions. Use this option to specify the name of the file for the trace information. There are levels of tracing, as follows:
All GRU instructions
GRU instructions that return error EXCEPTIONS to users
GRU instructions that fail and are automatically retried
To collect detailed trace of GRU instructions, perform the following:
setenv GRU_TRACE_FILE <filename> |
Setting this option enables tracing of every GRU instruction, as follows:
setenv GRU_TRACE_INSTRUCTIONS |
This option enables tracing of GRU instruction that cause exceptions. Note that some exceptions for GRU MESQ instructions are automatically handled by the GRU mesq library routines. These exceptions are not traced if <val> is equal to 1 (or not specified). If you want to see these exceptions (mesq_full, amo_nacked, and so on), set <val> to 2.
setenv GRU_EXCEPTION_RETRY <num> <num> number of consecutive retries before returning an error |
You can collect statistics of a task's usage of GRU contexts by using this option to specify a statistics file. Whenever a task exits or a GRU context is destroyed, statistics are written to this file. To specify a statistics file, perform the following:
setenv GRU_STATISTICS_FILE <filename> |
Pid: 23020 Mon Oct 19 20:46:56 2009 Command: ./sgup2 CBRs: 4 DSRs: 24576 bytes Gseg vaddr: 0x7fe3a1e80000 46740 instructions 23 instruction_wait 0 exceptions 9903 FMM tlb dropin 1 UPM tlb dropin 1040 context stolen |
This option enables tracing of GRU instructions that fail due to transient errors. The GRU library routine normally retry the instruction and the failure is hidden from the user. If you want to see these failure that are retried successfully, enable this option, as follows:
setenv GRU_TRACE_INSTRUCTION_RETRY |
Pid: 25276 - gru_wait opc: NOP, xtype: BYTE, ima: ImmResp istatus: IDLE Pid: 25276 - gru_wait opc: VLOAD, xtype: DWORD, ima: DelResp, baddr0: 0x604450, tri0: 0x0, nelem: 0x1, stride: 0x1 istatus: IDLE Pid: 25276 - gru_wait opc: VSTORE, xtype: DWORD, ima: DelResp, baddr0: 0x604450, tri0: 0x0, nelem: 0x1, stride: 0x1 istatus: IDLE Pid: 25276 - gru_wait opc: IVLOAD, xtype: DWORD, ima: DelResp, baddr0: 0x0, tri0: 0x0, tri1: 0x40, nelem: 0x1 istatus: IDLE Pid: 25276 - gru_wait opc: IVSTORE, xtype: DWORD, ima: DelResp, baddr0: 0x0, tri0: 0x0, tri1: 0x40, nelem: 0x1 istatus: IDLE Pid: 25276 - gru_wait opc: VSET, xtype: DWORD, ima: DelResp, baddr0: 0x604450, value: 0x483966aa127ded1d, nelem: 0x1, stride: 0x1 istatus: IDLE Pid: 25284, Tid: 25289 - gru_wait opc: MESQ, xtype: CACHELINE, ima: DelResp, baddr0: 0x606000, tri0: 0x0, nelem: 0x1 istatus: EXCEPTION, isubstatus: QLIMIT, avalue: 0f0000000f execstatus: EXCEPTION state: 0x1, exceptdet0: 0x606000, exceptdet1: 0x8 Pid: 25284, Tid: 25288 - gru_wait opc: MESQ, xtype: CACHELINE, ima: DelResp, baddr0: 0x606000, tri0: 0x0, nelem: 0x1 istatus: EXCEPTION, isubstatus: AMO_NACKED, avalue: 00 execstatus: EXCEPTION state: 0x1, exceptdet0: 0x606000, exceptdet1: 0x8 |
The /proc/sgi_uv/gru directory contains several files that have information about GRU state, as follows:
gru_options
Bit-field that can be used to enable or disable options
cch_status
List of tasks using GRU contexts
gru_status
List of available GRU resources
statistics
Detailed GRU driver statistics (if enabled)
mcs_status
Timing information for kernel GRU commands
Some examples of the files in /proc/sgi_uv/gru are, as follows:
Example 2-1. gru_status - Available Resources
The file shows the free resources available in each GRU chiplet, as follows:
% cat gru_status # gid nid ctx cbr dsr ctx cbr dsr # busy busy busy free free free 0 0 8 36 32768 8 92 0 1 0 1 4 4096 15 124 28672 2 1 7 56 28672 9 72 4096 3 1 7 28 28672 9 100 4096 |
Example 2-2. gru_options - Enable or Disable Driver Features
Various GRU options (mostly debugging) can be enabled or disabled by writing values to /proc/sgi_uv/gru/gru_options file. Use cat command, to view the file to see the current settings or to see a description of the various options.
% cat debug_options # bitmask: 1=trace, 2=statistics, 0x10=No_4k_dsr_AU_war # bitmask: 0x20=no_iabort_war, 0x40=no_chiplet_affinity # bitmask: 0x80=no_tlb_war, 0x100=no_mesq_war 0x0001 - enable statistics (they are not free) 0x0002 - enable VERY verbose driver trace information to /var/log/messages |
Example 2-3. statistics - Very Detailed Driver Statistics
You can collect detailed driver statistics, as follows:
% echo 2 > /proc/sgi_uv/gru/gru_options |
This enabled, detailed statistic collection occurs in numerous places in the driver. There is system usage overhead associated with this collection, especially on large systems.
% cat /proc/sgi_uv/gru/statistics 45806 vdata_alloc 45771 vdata_free 195712 gts_alloc 195668 gts_free 34351 gms_alloc 34333 gms_free 149398 gts_double_allocate ... (lots more) |
You can use the grustats command, to view GRU statistics. You will see output similar to the following:
uv15-sys TOTAL GRU STATISTICS SINCE COMMAND START 0 vdata_alloc 0 copy_gpa 0 vdata_open 0 read_gpa 0 vdata_free 0 mesq_receive 0 gts_alloc 0 mesq_receive_none 0 gts_free 0 mesq_send 0 gms_alloc 0 mesq_send_failed 0 gms_free 0 mesq_noop 0 gts_double_allocate 0 mesq_send_unexpected_error 0 assign_context 0 mesq_send_lb_overflow 0 assign_context_failed 0 mesq_send_qlimit_reached 0 free_context 0 mesq_send_amo_nacked 0 load_user_context 0 mesq_send_put_nacked 0 load_kcontext 0 mesq_qf_locked 0 load_kcontext_assign 0 mesq_qf_noop_not_full 0 load_kcontext_steal 0 mesq_qf_switch_head_failed 0 lock_kcontext 0 mesq_qf_unexpected_error 0 unlock_kcontext 0 mesq_noop_unexpected_error 0 get_kcontext_cbr 0 mesq_noop_lb_overflow 0 get_kcontext_cbr_busy 0 mesq_noop_qlimit_reached 0 lock_async_resource 0 mesq_noop_amo_nacked 0 unlock_async_resource 0 mesq_noop_put_nacked 0 steal_user_context 0 mesq_noop_page_overflow 0 steal_kernel_context 0 implicit_abort 0 steal_context_failed 0 implicit_abort_retried ... and much more |
For a usage statement, once the grustats command is executing, enter the letter h for help. A usage statement appears, as follows:
Intstats help: h - help (this screen) q - quit r - reset command-start statistics t or <TAB> - toggle between total and incremental mode CTL-L - redraw screen CR - to return to display |