Chapter 1. Altix UV GRU Direct Access API

This chapter provides an overview of the SGI Altix UV global reference unit (GRU) development kit. It describes the application programming interface (API) that allows an application direct access to GRU functionality.

The GRU is part of the SGI Altix UV Hub application-specific integrated circuit (ASIC). The UV Hub is the heart of the SGI Altix UV compute blade. It connects to two Intel Xeon 7500 series processor sockets through the Intel QuickPath Interconnect (QPI) ports and to the high speed SGI NUMAlink interconnect fabric through one of four NUMAlink 5 ports.

For more information on the SGI Altix UV hub, Altix UV compute blades, QPI, and NUMAlink 5, see the SGI Altix UV 1000 System User's Guide.

SGI High Level APIs Supporting GRU Access

Message Passing Interface (MPI), SHMEM, and Unified Parallel C (UPC) high level APIs and programming models that are implemented and supported by SGI that support access to GRU functionality. For more information, see mpi(1), shmem(3), or sgiupc(1) man pages and the Message Passing Toolkit (MPT) User's Guide and Unified Parallel C (UPC) User's Guide.

Overview of API for Direct GRU Access

The Direct GRU Access API has four components, as follows:

  • GRU resource allocators

    The GRU resource allocator functions provide management of the GRU resources to allow independent software components in the same program access the GRU without oversubscribing the GRU resources.

  • GRU memory access functions

    The GRU memory access functions perform GRU operations that include memory read, memory write, memory-to-memory copies, and atomic memory operations and so on.

  • XPMEM address mapping functions

    The XPMEM address mapping functions set up mappings to target memory throughout the system into local GRU-mapped virtual addresses.

  • MPT address mapping functions

    The MPT address mapping functions are a layer on top of XPMEM, and expose mapped memory regions already set up for MPI and SHMEM to the user application.

GRU Resource Allocators

The UV global reference unit (GRU) has control block (CB) and data segment (DSEG) resources associated with it. User applications need to allocate CB resources and usually DSEG resources for use in GRU memory access functions.

There are two categories of GRU resources used by any thread: temporarily and permanently allocated. A program starts running with all the available GRU resources being in the temporary pool until some resources are allocated permanently via the gru_pallocate() function.

The preferred way to get access to all the GRU temporary CBs and DSEG is through the use of the lightweight gru_temp_reserve() and gru_temp_release() functions. These functions should wrap any use of the GRU memory access fuctions, with an exception to be described later.

#include <gru_alloc.h>

void gru_temp_reserve(gru_alloc_thdata_t *gat);

typedef struct {
      gru_segment_t       *gruseg;
      gru_control_block_t *cbp;
      void                *dsegp;
      int                 cb_cnt;
      int                 dseg_size;
} gru_alloc_thdata_t;

The gru_alloc_thdata_t structure returned from this function will describe the GRU resources available for use until the next call to gru_temp_release().

The following code example shows a GRU memory access function gru_gamirr() being called after which the gru_temp_reserve() function reserves the GRU resources, and before the gru_wait_abort() function waits for completion of the operation. Then, followed by a call to gru_temp_release() to release the temporary GRU resources.

Example 1-1. GRU Memory Access Function (gru_gamirr())

gru_alloc_thdata_t gat;
gru_temp_reserve(&gat);
gru_gamirr( gat.cbp, EOP_IRR_DECZ, address, XTYPE_DW, IMA_CB_DELAY);
gru_wait_abort(gat.cbp);
gru_temp_release();


The effect of the gru_temp_reserve() andgru_temp_release() functions is thread-private, so related POSIX threads or OpenMP threads could be executing the above sequence, concurrently.

An alternative allocation scheme is permanent allocation. The gru_pallocate() function returns CB and DSEG resources that can be used at any time thereafter. This can simplify the allocation strategy but it has the disadvantage of reducing the number of GRU resources that can be used by other software. An example would be a call to gru_bcopy() which allows you to pass a DSEG work buffer of any size. The achieved bandwidth for gru_bcopy() is higher with larger DSEG work buffers.

You can find more detailed information in the following man pages:

Use the man(1) command to view these man pages online. For your convenience, copies of the GRU-related man pages are included in the following section.

GRU Man Pages

This section contains GRU-related man pages.

gru_temp_reserve(3)

NAME

gru_temp_reserve, gru_temp_release - temporary GRU resource allocator

SYNOPSIS

#include <gru_alloc.h>

void gru_temp_reserve(gru_alloc_thdata_t *gat);
int gru_temp_reserve_try(gru_alloc_thdata_t *gat);
void gru_temp_release(void);

typedef struct {
      gru_segment_t       *gruseg;
      gru_control_block_t *cbp;
      void                *dsegp;
      int                 cb_cnt;
      int                 dseg_size;
} gru_alloc_thdata_t;

LIBRARY

-lgru_alloc

DESCRIPTION

The gru_temp_reserve() and gru_temp_reserve_try() functions will allocate and reserve the temporary use GRU resources for a thread. The gru_alloc_thdata_t structure returne in gat describes the number and locations of the temporary use GRU resources which may be used until the next call to gru_temp_release().

The fields are defined, as follows:

gruseg 

The GRU segment

cbp 

A convenient pointer to the first control block (CB). Equal to gru_get_cb_pointer(gat->gruseg, 0).

dsegp 

A pointer to data segment space (DSEG) space available for temporary use.

cb_cnt 

The number of consecutive CBs in the GRU segment that are available for temporary use.

dseg_size 

The size of the DSEG region available for temporary use (bytes).

The first call to gru_temp_reserve() will allocate a GRU segment for the calling thread, and this same segment will be assigned to the thread for use after any call to gru_temp_reserve().

Every call to gru_temp_reserve() sets a thread-private "temporary resources in use" (TRU) flag. The temporary GRU resources identified by the gat structure are valid and may be referenced only when the TRU flag is set. Note that later calls to gru_temp_reserve() may return different values in the gat structure.

The program will abort if the TRU flag is already set when a call is made to gru_temp_reserve() or gru_pallocate().

The GRU allocation library attempts to provide a quantity of temporary use GRU resources that is equal to the quantity on each UV hub divided by the number of processors per hub. This quantity will be reduced by any GRU resources permanently allocated via the gru_pallocate() function.

SUGGESTED USAGE CONVENTIONS

The above usage rules suggest two natural usage conventions that are equally valid:

A. Users surround code blocks that use temporary GRU resources with gru_temp_reserve() and gru_temp_release() calls.

or

B. Users should insert calls to gru_temp_reserve() at the beginning of each GRU-using function and calls to gru_temp_release() at each return point for that function. In addition, every function call site that might end up calling a GRU function with temporary GRU resources should have a call to gru_temp_release() prior to the call site and a call to gru_temp_reserve() upon return.

Note that use of GRU functions with temporary storage in signal handlers is dangerous. The program will abort if the TRU flag is set when a signal handler is entered that also calls gru_temp_reserve().

ENVIRONMENT VARIABLES

See the gru_resource(3) man page for information about environment variables that can control the amount of GRU resources that are allocated.

RETURN VALUE

gru_temp_reserve_try returns 0 if able to reserve the temporary GRU resources,

and -1 otherwise.

Failure to reserve temporary resources results from a previous reservation on the temporary resource still being in effect. gru_temp_reserve aborts if unable to reserve the temporary GRU resources.

NOTES

The deprecated gru_all_reserve() function has the same effect as gru_temp_reserve().

The deprecated gru_all_release() function has the same effect as gru_temp_release().

SEE ALSO

gru_pallocate(3), gru_all_reserve(3), and gru_resource(3)

gru_pallocate(3)

NAME

gru_pallocate - permanently allocate GRU resources

SYNOPSIS

#include <gru_alloc.h>

int gru_pallocate(int num_cbs, int dseg_sz, gru_segment_t **gruseg,
      int *cbnum, void **dseg);

int gru_pallocate_dseg_granularity(void);

LIBRARY

-lgru_alloc

DESCRIPTION

The gru_pallocate() function will permanently reserve a specified number of GRU control blocks (CBs) and data segment space (DSEG).

Arguments are, as follows:

num_cbs 

(input) the number of CBs desired.

dseg_sz 

(input) the number of bytes of DSEG space desired. dseg_sz must be a multiple of the DSEG allocation granularity.

gruseg 

(output) assigned the pointer to the GRU segment containing the returned resources.

cbnum 

(output) assigned the ordinal value of the first CB in the GRU segment that is part of the allocation.

dseg 

(output) assigned the pointer to the allocated DSEG space.

The gru_pallocate() function may not be called between calls to gru_temp_reserve() and gru_temp_release(). After gru_pallocate() is called, the amount of GRU resources available to the caller of gru_temp_reserve() will be decreased.

The gru_pallocate_dseg_granularity() function returns the DSEG allocation granularity, which is the smallest number of bytes of DSEG space that may be allocated.

RETURN VALUE

gru_pallocate() returns 0 on success, -1 on error with one of the following errno values set:

ENOMEM - the library GRU segment local to this thread had insufficient CB or DSEG space to satisfy the request.

EINVAL - the dseg_sz value is not a multiple of the DSEG allocation granularity.

SEE ALSO

gru_temp_reserve(3), gru_temp_release(3)

gru_resource(3)

NAME

gru_resource - tuning the GRU allocator run-time library

LIBRARY

libgru_alloc run-time library

DESCRIPTION

The GRU allocator run-time library is linked in to some programs and libraries to manage available GRU resources. The amount of GRU resource that can be allocated defaults to a logical CPU's share of the GRU resources on an Altix UV hub. However, the user can modify and tune the quantities of GRU resource by setting environment variables, as described in the following section of this man page.

ENVIRONMENT VARIABLES

GRU_RESOURCE_FACTOR

Multiplies the quantity of control blocks (CB) and data segment space (DSEG) resources assigned to each thread by the factor given. For example, when parallel jobs are run with only one user thread per core, a factor of 2 could be specified. If only one GRU-using thread or process will be run on each socket, and each socket had 16 hyperthreads, then a factor of 16 could be specified. If GRU_THREAD_CBS or GRU_THREAD_DSEG_SZ are specified, they override GRU_RESOURCE_FACTOR.

GRU_THREAD_CBS

Overrides the number of per-thread CBs assigned to the caller of gru_temp_reserve(). The default is a processor's fair portion of the available CBs, which is 8 on systems with 8 cores per socket and 10 on systems with 6 cores per socket.

GRU_THREAD_DSEG_SZ

Overrides the amount of per-thread DSEG space assigned to the caller of gru_temp_reserve(). The default is a processor's fair portion of the available DSEG space, which is 2048.

SEE ALSO

cpumap(1)

GRU Memory Access Functions

The GRU memory access functions perform GRU operations that include memory read, memory write, memory-to-memory copies, and atomic memory operations, and so on. These functions use an ordinary virtual address or a GRU-mapped virtual address to reference the remote memory.

The interfaces to these functions are viewable in the uv/gru/gru_instructions.h header file installed by the gru-devel RPM. Use a hardware reference manual to get functional descriptions of these operations. TBD? What hardware reference manual?

The following code example of a GRU memory access function illustrates the basic call structure.

Example 1-2. GRU Memory Access Function Basic Call Structure

static inline 
void gru_vload(gru_control_block_t *cb, void *mem_addr,
         unsigned int tri0, unsigned char xtype, unsigned long nelem,
         unsigned long stride, unsigned long hints);

Arguments are:
	cb	 - pointer to CB
	mem_addr - address of targeted memory
	tri0     - index to DSEG buffer.  Compute it
		   using gru_get_tri().
	xtype	 - log2 of data type byte size (XTYPE_B ...)
	nelem	 - number of elements to transfer
	stride   - memory stride, scaled in elements
	hints    - IMA_CB_DELAY is commonly used



All memory access operations are asynchronous. The wait functions, such as, gru_wait_abort(), specify the CB handle and are used to wait to completion.

XPMEM Library Functions

The XPMEM interface can map a virtual address range in one process into the GRU-mapped virtual address in another process. The XPMEM interface was designed to meet the needs of MPI and SHMEM implementations and provide ways to map any data region. As a GRU API user, you need to find a way to map the needed memory regions into the processes or threads involved. The Linux operating sytems offers many options for doing this, as follows:

  • mmap

  • System V shared memory

  • memory sharing among pthreads

  • memory sharing among OpenMP threads

These methods are the likely first choice for most potential GRU users.

The sn/xpmem.h header file installed by the xpmem-devel-noship RPM has interface definitions for all the XPMEM functions.

The following example shows the main XPMEM functions:

Example 1-3. Main XPMEM Functions

extern __s64 xpmem_make_2(void *, size_t, int, void *);
extern int xpmem_remove_2(__s64);
extern __s64 xpmem_get_2(__s64, int, int, void *);
extern int xpmem_release_2(__s64);
extern void *xpmem_attach_2(__s64, off_t, size_t, void *);
extern void *xpmem_attach_high_2(__s64, off_t, size_t, void *);
extern int xpmem_detach_2(void *, size_t size);
extern void *xpmem_reserve_high_2(size_t, size_t);
extern int xpmem_unreserve_high_2(void *, size_t);


For more information on using XPMEM, see the SGI internal XPMEM API document. TBD

MPT Address Mapping Functions

The MPT libmpi library uses XPMEM to cross-map virtual memory between all the processes in an MPI job. Several functions are available to lookup mapped virtual addresses that are pre-attached in the virtual address space of a process by MPI. The addresses returned by the lookups may be passed to the GRU library functions.

Not all GRU API users can require their code to execute in an MPI job, but if you do, you may find the MPT address mapping functions are a convenient way to reference remote data arrays and objects.

The MPT address mapping functions are shown below. They reference ordinary virtual addresses or addresses of symmetric data objects. Symmetric data is static data or array-defined in the intro_shmem(3) man page.

The following example shows an MPI_SGI_gam_type:

Example 1-4. MPI_SGI_gam_type

#include <mpi_ext.h>
   
int
MPI_SGI_gam_type(int rank, MPI_Comm comm)
      
Return value is the XPMEM accessibility of the specified rank.
    
  MPI_GAM_NONE       - not referenceable by load/store or GRU
  MPI_GAM_CPU_NONCOH - Altix 3700 noncoherent
  MPI_GAM_CPU        - if referencable by load/store only
  MPI_GAM_GRU        - if referenceable by GRU only
  MPI_GAM_CPU_PREF   - if referenceable by either load/store
                         or GRU, preferred by load/store
  MPI_GAM_GRU_PREF   - if referenceable by either load/store
                         or GRU, preferred by GRU


The MPT address mapping functions are influenced by the MPI_GSM_NEIGHBORHOOD environment variable. This variable may be used to specify the "neighborhood size" for shared memory accesses. Contigous groups of ranks within a hostcan be considered to be in the same neighborhood. The MPI_GSM_NEIGHBORHOOD variable specifies the size of these neighborhoods, as follows:

  • MPI processes within a neighborhood will return gam_type MPI_GAM_CPU_PREF.

  • MPI processes outside a neighborhood with a host will return gam_type MPI_GAM_GRU_PREF.

  • MPI processes from a different host within a Altix UV system will return gam_type MPI_GAM_GRU.

When MPI_GSM_NEIGHBORHOOD is not set, the neighborhood size defaults to all ranks in the current host.

MPI_SGI_gam_ptr Function

The MPI_SGI_gam_ptr function is, as follows:

#include <mpi_ext.h>

void * MPI_SGI_gam_ptr(void *rem_addr, size_t len, int remote_rank,
  MPI_Comm comm, int acc_mode);

Given a virtual address in a specified MPI process rank, returns a general virtual address that may be used to directly reference the memory.

This function is for general users.

acc_mode 

Chooses CPU or GRU addressable

MPI_GAM_CPU 

Requests CPU address that can be referenced

MPI_GAM_GRU 

Requests GRU address that can be referenced

This function prints an error message when error conditions occur and then aborts.

MPI_SGI_symmetric_addr Function

The MPI_SGI_symmetric_addr function is, as follows:

void *MPI_SGI_symmetric_addr(void *local_addr, size_t len,
	    int remote_rank, MPI_Comm comm)

For symmetric objects, returns the virtual address (VA) of the corresponding object in a specified MPI process.

shmem_ptr Function

The shmem_ptr function is, as follows:

#include <mpp/shmem.h>

       void *shmem_ptr(void *target, int pe);

Returns a processor-referencable address that can be used to reference symmetric data object target on a specfied MPI process. See shmem_ptr(3) for more details.