| ======================================= |
| Oracle Data Analytics Accelerator (DAX) |
| ======================================= |
| |
| DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8 |
| (DAX2) processor chips, and has direct access to the CPU's L3 caches |
| as well as physical memory. It can perform several operations on data |
| streams with various input and output formats. A driver provides a |
| transport mechanism and has limited knowledge of the various opcodes |
| and data formats. A user space library provides high level services |
| and translates these into low level commands which are then passed |
| into the driver and subsequently the Hypervisor and the coprocessor. |
| The library is the recommended way for applications to use the |
| coprocessor, and the driver interface is not intended for general use. |
| This document describes the general flow of the driver, its |
| structures, and its programmatic interface. It also provides example |
| code sufficient to write user or kernel applications that use DAX |
| functionality. |
| |
| The user library is open source and available at: |
| |
| https://oss.oracle.com/git/gitweb.cgi?p=libdax.git |
| |
| The Hypervisor interface to the coprocessor is described in detail in |
| the accompanying document, dax-hv-api.txt, which is a plain text |
| excerpt of the (Oracle internal) "UltraSPARC Virtual Machine |
| Specification" version 3.0.20+15, dated 2017-09-25. |
| |
| |
| High Level Overview |
| =================== |
| |
| A coprocessor request is described by a Command Control Block |
| (CCB). The CCB contains an opcode and various parameters. The opcode |
| specifies what operation is to be done, and the parameters specify |
| options, flags, sizes, and addresses. The CCB (or an array of CCBs) |
| is passed to the Hypervisor, which handles queueing and scheduling of |
| requests to the available coprocessor execution units. A status code |
| returned indicates if the request was submitted successfully or if |
| there was an error. One of the addresses given in each CCB is a |
| pointer to a "completion area", which is a 128 byte memory block that |
| is written by the coprocessor to provide execution status. No |
| interrupt is generated upon completion; the completion area must be |
| polled by software to find out when a transaction has finished, but |
| the M7 and later processors provide a mechanism to pause the virtual |
| processor until the completion status has been updated by the |
| coprocessor. This is done using the monitored load and mwait |
| instructions, which are described in more detail later. The DAX |
| coprocessor was designed so that after a request is submitted, the |
| kernel is no longer involved in the processing of it. The polling is |
| done at the user level, which results in almost zero latency between |
| completion of a request and resumption of execution of the requesting |
| thread. |
| |
| |
| Addressing Memory |
| ================= |
| |
| The kernel does not have access to physical memory in the Sun4v |
| architecture, as there is an additional level of memory virtualization |
| present. This intermediate level is called "real" memory, and the |
| kernel treats this as if it were physical. The Hypervisor handles the |
| translations between real memory and physical so that each logical |
| domain (LDOM) can have a partition of physical memory that is isolated |
| from that of other LDOMs. When the kernel sets up a virtual mapping, |
| it specifies a virtual address and the real address to which it should |
| be mapped. |
| |
| The DAX coprocessor can only operate on physical memory, so before a |
| request can be fed to the coprocessor, all the addresses in a CCB must |
| be converted into physical addresses. The kernel cannot do this since |
| it has no visibility into physical addresses. So a CCB may contain |
| either the virtual or real addresses of the buffers or a combination |
| of them. An "address type" field is available for each address that |
| may be given in the CCB. In all cases, the Hypervisor will translate |
| all the addresses to physical before dispatching to hardware. Address |
| translations are performed using the context of the process initiating |
| the request. |
| |
| |
| The Driver API |
| ============== |
| |
| An application makes requests to the driver via the write() system |
| call, and gets results (if any) via read(). The completion areas are |
| made accessible via mmap(), and are read-only for the application. |
| |
| The request may either be an immediate command or an array of CCBs to |
| be submitted to the hardware. |
| |
| Each open instance of the device is exclusive to the thread that |
| opened it, and must be used by that thread for all subsequent |
| operations. The driver open function creates a new context for the |
| thread and initializes it for use. This context contains pointers and |
| values used internally by the driver to keep track of submitted |
| requests. The completion area buffer is also allocated, and this is |
| large enough to contain the completion areas for many concurrent |
| requests. When the device is closed, any outstanding transactions are |
| flushed and the context is cleaned up. |
| |
| On a DAX1 system (M7), the device will be called "oradax1", while on a |
| DAX2 system (M8) it will be "oradax2". If an application requires one |
| or the other, it should simply attempt to open the appropriate |
| device. Only one of the devices will exist on any given system, so the |
| name can be used to determine what the platform supports. |
| |
| The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For |
| all of these, success is indicated by a return value from write() |
| equal to the number of bytes given in the call. Otherwise -1 is |
| returned and errno is set. |
| |
| CCB_DEQUEUE |
| ----------- |
| |
| Tells the driver to clean up resources associated with past |
| requests. Since no interrupt is generated upon the completion of a |
| request, the driver must be told when it may reclaim resources. No |
| further status information is returned, so the user should not |
| subsequently call read(). |
| |
| CCB_KILL |
| -------- |
| |
| Kills a CCB during execution. The CCB is guaranteed to not continue |
| executing once this call returns successfully. On success, read() must |
| be called to retrieve the result of the action. |
| |
| CCB_INFO |
| -------- |
| |
| Retrieves information about a currently executing CCB. Note that some |
| Hypervisors might return 'notfound' when the CCB is in 'inprogress' |
| state. To ensure a CCB in the 'notfound' state will never be executed, |
| CCB_KILL must be invoked on that CCB. Upon success, read() must be |
| called to retrieve the details of the action. |
| |
| Submission of an array of CCBs for execution |
| --------------------------------------------- |
| |
| A write() whose length is a multiple of the CCB size is treated as a |
| submit operation. The file offset is treated as the index of the |
| completion area to use, and may be set via lseek() or using the |
| pwrite() system call. If -1 is returned then errno is set to indicate |
| the error. Otherwise, the return value is the length of the array that |
| was actually accepted by the coprocessor. If the accepted length is |
| equal to the requested length, then the submission was completely |
| successful and there is no further status needed; hence, the user |
| should not subsequently call read(). Partial acceptance of the CCB |
| array is indicated by a return value less than the requested length, |
| and read() must be called to retrieve further status information. The |
| status will reflect the error caused by the first CCB that was not |
| accepted, and status_data will provide additional data in some cases. |
| |
| MMAP |
| ---- |
| |
| The mmap() function provides access to the completion area allocated |
| in the driver. Note that the completion area is not writeable by the |
| user process, and the mmap call must not specify PROT_WRITE. |
| |
| |
| Completion of a Request |
| ======================= |
| |
| The first byte in each completion area is the command status which is |
| updated by the coprocessor hardware. Software may take advantage of |
| new M7/M8 processor capabilities to efficiently poll this status byte. |
| First, a "monitored load" is achieved via a Load from Alternate Space |
| (ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a |
| "monitored wait" is achieved via the mwait instruction (a write to |
| %asr28). This instruction is like pause in that it suspends execution |
| of the virtual processor for the given number of nanoseconds, but in |
| addition will terminate early when one of several events occur. If the |
| block of data containing the monitored location is modified, then the |
| mwait terminates. This causes software to resume execution immediately |
| (without a context switch or kernel to user transition) after a |
| transaction completes. Thus the latency between transaction completion |
| and resumption of execution may be just a few nanoseconds. |
| |
| |
| Application Life Cycle of a DAX Submission |
| ========================================== |
| |
| - open dax device |
| - call mmap() to get the completion area address |
| - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc. |
| - submit CCB via write() or pwrite() |
| - go into a loop executing monitored load + monitored wait and |
| terminate when the command status indicates the request is complete |
| (CCB_KILL or CCB_INFO may be used any time as necessary) |
| - perform a CCB_DEQUEUE |
| - call munmap() for completion area |
| - close the dax device |
| |
| |
| Memory Constraints |
| ================== |
| |
| The DAX hardware operates only on physical addresses. Therefore, it is |
| not aware of virtual memory mappings and the discontiguities that may |
| exist in the physical memory that a virtual buffer maps to. There is |
| no I/O TLB or any scatter/gather mechanism. All buffers, whether input |
| or output, must reside in a physically contiguous region of memory. |
| |
| The Hypervisor translates all addresses within a CCB to physical |
| before handing off the CCB to DAX. The Hypervisor determines the |
| virtual page size for each virtual address given, and uses this to |
| program a size limit for each address. This prevents the coprocessor |
| from reading or writing beyond the bound of the virtual page, even |
| though it is accessing physical memory directly. A simpler way of |
| saying this is that a DAX operation will never "cross" a virtual page |
| boundary. If an 8k virtual page is used, then the data is strictly |
| limited to 8k. If a user's buffer is larger than 8k, then a larger |
| page size must be used, or the transaction size will be truncated to |
| 8k. |
| |
| Huge pages. A user may allocate huge pages using standard interfaces. |
| Memory buffers residing on huge pages may be used to achieve much |
| larger DAX transaction sizes, but the rules must still be followed, |
| and no transaction will cross a page boundary, even a huge page. A |
| major caveat is that Linux on Sparc presents 8Mb as one of the huge |
| page sizes. Sparc does not actually provide a 8Mb hardware page size, |
| and this size is synthesized by pasting together two 4Mb pages. The |
| reasons for this are historical, and it creates an issue because only |
| half of this 8Mb page can actually be used for any given buffer in a |
| DAX request, and it must be either the first half or the second half; |
| it cannot be a 4Mb chunk in the middle, since that crosses a |
| (hardware) page boundary. Note that this entire issue may be hidden by |
| higher level libraries. |
| |
| |
| CCB Structure |
| ------------- |
| A CCB is an array of 8 64-bit words. Several of these words provide |
| command opcodes, parameters, flags, etc., and the rest are addresses |
| for the completion area, output buffer, and various inputs:: |
| |
| struct ccb { |
| u64 control; |
| u64 completion; |
| u64 input0; |
| u64 access; |
| u64 input1; |
| u64 op_data; |
| u64 output; |
| u64 table; |
| }; |
| |
| See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of |
| each of these fields, and see dax-hv-api.txt for a complete description |
| of the Hypervisor API available to the guest OS (ie, Linux kernel). |
| |
| The first word (control) is examined by the driver for the following: |
| - CCB version, which must be consistent with hardware version |
| - Opcode, which must be one of the documented allowable commands |
| - Address types, which must be set to "virtual" for all the addresses |
| given by the user, thereby ensuring that the application can |
| only access memory that it owns |
| |
| |
| Example Code |
| ============ |
| |
| The DAX is accessible to both user and kernel code. The kernel code |
| can make hypercalls directly while the user code must use wrappers |
| provided by the driver. The setup of the CCB is nearly identical for |
| both; the only difference is in preparation of the completion area. An |
| example of user code is given now, with kernel code afterwards. |
| |
| In order to program using the driver API, the file |
| arch/sparc/include/uapi/asm/oradax.h must be included. |
| |
| First, the proper device must be opened. For M7 it will be |
| /dev/oradax1 and for M8 it will be /dev/oradax2. The simplest |
| procedure is to attempt to open both, as only one will succeed:: |
| |
| fd = open("/dev/oradax1", O_RDWR); |
| if (fd < 0) |
| fd = open("/dev/oradax2", O_RDWR); |
| if (fd < 0) |
| /* No DAX found */ |
| |
| Next, the completion area must be mapped:: |
| |
| completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0); |
| |
| All input and output buffers must be fully contained in one hardware |
| page, since as explained above, the DAX is strictly constrained by |
| virtual page boundaries. In addition, the output buffer must be |
| 64-byte aligned and its size must be a multiple of 64 bytes because |
| the coprocessor writes in units of cache lines. |
| |
| This example demonstrates the DAX Scan command, which takes as input a |
| vector and a match value, and produces a bitmap as the output. For |
| each input element that matches the value, the corresponding bit is |
| set in the output. |
| |
| In this example, the input vector consists of a series of single bits, |
| and the match value is 0. So each 0 bit in the input will produce a 1 |
| in the output, and vice versa, which produces an output bitmap which |
| is the input bitmap inverted. |
| |
| For details of all the parameters and bits used in this CCB, please |
| refer to section 36.2.1.3 of the DAX Hypervisor API document, which |
| describes the Scan command in detail:: |
| |
| ccb->control = /* Table 36.1, CCB Header Format */ |
| (2L << 48) /* command = Scan Value */ |
| | (3L << 40) /* output address type = primary virtual */ |
| | (3L << 34) /* primary input address type = primary virtual */ |
| /* Section 36.2.1, Query CCB Command Formats */ |
| | (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */ |
| | (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */ |
| | (8 << 10) /* 36.2.1.1.6 output format = bit vector */ |
| | (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */ |
| | (31 << 0); /* 36.2.1.3 Disable second scan criteria */ |
| |
| ccb->completion = 0; /* Completion area address, to be filled in by driver */ |
| |
| ccb->input0 = (unsigned long) input; /* primary input address */ |
| |
| ccb->access = /* Section 36.2.1.2, Data Access Control */ |
| (2 << 24) /* Primary input length format = bits */ |
| | (nbits - 1); /* number of bits in primary input stream, minus 1 */ |
| |
| ccb->input1 = 0; /* secondary input address, unused */ |
| |
| ccb->op_data = 0; /* scan criteria (value to be matched) */ |
| |
| ccb->output = (unsigned long) output; /* output address */ |
| |
| ccb->table = 0; /* table address, unused */ |
| |
| The CCB submission is a write() or pwrite() system call to the |
| driver. If the call fails, then a read() must be used to retrieve the |
| status:: |
| |
| if (pwrite(fd, ccb, 64, 0) != 64) { |
| struct ccb_exec_result status; |
| read(fd, &status, sizeof(status)); |
| /* bail out */ |
| } |
| |
| After a successful submission of the CCB, the completion area may be |
| polled to determine when the DAX is finished. Detailed information on |
| the contents of the completion area can be found in section 36.2.2 of |
| the DAX HV API document:: |
| |
| while (1) { |
| /* Monitored Load */ |
| __asm__ __volatile__("lduba [%1] 0x84, %0\n" |
| : "=r" (status) |
| : "r" (completion_area)); |
| |
| if (status) /* 0 indicates command in progress */ |
| break; |
| |
| /* MWAIT */ |
| __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */ |
| } |
| |
| A completion area status of 1 indicates successful completion of the |
| CCB and validity of the output bitmap, which may be used immediately. |
| All other non-zero values indicate error conditions which are |
| described in section 36.2.2:: |
| |
| if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */ |
| /* completion_area[0] contains the completion status */ |
| /* completion_area[1] contains an error code, see 36.2.2 */ |
| } |
| |
| After the completion area has been processed, the driver must be |
| notified that it can release any resources associated with the |
| request. This is done via the dequeue operation:: |
| |
| struct dax_command cmd; |
| cmd.command = CCB_DEQUEUE; |
| if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) { |
| /* bail out */ |
| } |
| |
| Finally, normal program cleanup should be done, i.e., unmapping |
| completion area, closing the dax device, freeing memory etc. |
| |
| Kernel example |
| -------------- |
| |
| The only difference in using the DAX in kernel code is the treatment |
| of the completion area. Unlike user applications which mmap the |
| completion area allocated by the driver, kernel code must allocate its |
| own memory to use for the completion area, and this address and its |
| type must be given in the CCB:: |
| |
| ccb->control |= /* Table 36.1, CCB Header Format */ |
| (3L << 32); /* completion area address type = primary virtual */ |
| |
| ccb->completion = (unsigned long) completion_area; /* Completion area address */ |
| |
| The dax submit hypercall is made directly. The flags used in the |
| ccb_submit call are documented in the DAX HV API in section 36.3.1/ |
| |
| :: |
| |
| #include <asm/hypervisor.h> |
| |
| hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64, |
| HV_CCB_QUERY_CMD | |
| HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY | |
| HV_CCB_VA_PRIVILEGED, |
| 0, &bytes_accepted, &status_data); |
| |
| if (hv_rv != HV_EOK) { |
| /* hv_rv is an error code, status_data contains */ |
| /* potential additional status, see 36.3.1.1 */ |
| } |
| |
| After the submission, the completion area polling code is identical to |
| that in user land:: |
| |
| while (1) { |
| /* Monitored Load */ |
| __asm__ __volatile__("lduba [%1] 0x84, %0\n" |
| : "=r" (status) |
| : "r" (completion_area)); |
| |
| if (status) /* 0 indicates command in progress */ |
| break; |
| |
| /* MWAIT */ |
| __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */ |
| } |
| |
| if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */ |
| /* completion_area[0] contains the completion status */ |
| /* completion_area[1] contains an error code, see 36.2.2 */ |
| } |
| |
| The output bitmap is ready for consumption immediately after the |
| completion status indicates success. |
| |
| Excer[t from UltraSPARC Virtual Machine Specification |
| ===================================================== |
| |
| .. include:: dax-hv-api.txt |
| :literal: |