~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/arch/sparc/oradax/oracle-dax.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 =======================================
  2 Oracle Data Analytics Accelerator (DAX)
  3 =======================================
  4 
  5 DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
  6 (DAX2) processor chips, and has direct access to the CPU's L3 caches
  7 as well as physical memory. It can perform several operations on data
  8 streams with various input and output formats.  A driver provides a
  9 transport mechanism and has limited knowledge of the various opcodes
 10 and data formats. A user space library provides high level services
 11 and translates these into low level commands which are then passed
 12 into the driver and subsequently the Hypervisor and the coprocessor.
 13 The library is the recommended way for applications to use the
 14 coprocessor, and the driver interface is not intended for general use.
 15 This document describes the general flow of the driver, its
 16 structures, and its programmatic interface. It also provides example
 17 code sufficient to write user or kernel applications that use DAX
 18 functionality.
 19 
 20 The user library is open source and available at:
 21 
 22     https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
 23 
 24 The Hypervisor interface to the coprocessor is described in detail in
 25 the accompanying document, dax-hv-api.txt, which is a plain text
 26 excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
 27 Specification" version 3.0.20+15, dated 2017-09-25.
 28 
 29 
 30 High Level Overview
 31 ===================
 32 
 33 A coprocessor request is described by a Command Control Block
 34 (CCB). The CCB contains an opcode and various parameters. The opcode
 35 specifies what operation is to be done, and the parameters specify
 36 options, flags, sizes, and addresses.  The CCB (or an array of CCBs)
 37 is passed to the Hypervisor, which handles queueing and scheduling of
 38 requests to the available coprocessor execution units. A status code
 39 returned indicates if the request was submitted successfully or if
 40 there was an error.  One of the addresses given in each CCB is a
 41 pointer to a "completion area", which is a 128 byte memory block that
 42 is written by the coprocessor to provide execution status. No
 43 interrupt is generated upon completion; the completion area must be
 44 polled by software to find out when a transaction has finished, but
 45 the M7 and later processors provide a mechanism to pause the virtual
 46 processor until the completion status has been updated by the
 47 coprocessor. This is done using the monitored load and mwait
 48 instructions, which are described in more detail later.  The DAX
 49 coprocessor was designed so that after a request is submitted, the
 50 kernel is no longer involved in the processing of it.  The polling is
 51 done at the user level, which results in almost zero latency between
 52 completion of a request and resumption of execution of the requesting
 53 thread.
 54 
 55 
 56 Addressing Memory
 57 =================
 58 
 59 The kernel does not have access to physical memory in the Sun4v
 60 architecture, as there is an additional level of memory virtualization
 61 present. This intermediate level is called "real" memory, and the
 62 kernel treats this as if it were physical.  The Hypervisor handles the
 63 translations between real memory and physical so that each logical
 64 domain (LDOM) can have a partition of physical memory that is isolated
 65 from that of other LDOMs.  When the kernel sets up a virtual mapping,
 66 it specifies a virtual address and the real address to which it should
 67 be mapped.
 68 
 69 The DAX coprocessor can only operate on physical memory, so before a
 70 request can be fed to the coprocessor, all the addresses in a CCB must
 71 be converted into physical addresses. The kernel cannot do this since
 72 it has no visibility into physical addresses. So a CCB may contain
 73 either the virtual or real addresses of the buffers or a combination
 74 of them. An "address type" field is available for each address that
 75 may be given in the CCB. In all cases, the Hypervisor will translate
 76 all the addresses to physical before dispatching to hardware. Address
 77 translations are performed using the context of the process initiating
 78 the request.
 79 
 80 
 81 The Driver API
 82 ==============
 83 
 84 An application makes requests to the driver via the write() system
 85 call, and gets results (if any) via read(). The completion areas are
 86 made accessible via mmap(), and are read-only for the application.
 87 
 88 The request may either be an immediate command or an array of CCBs to
 89 be submitted to the hardware.
 90 
 91 Each open instance of the device is exclusive to the thread that
 92 opened it, and must be used by that thread for all subsequent
 93 operations. The driver open function creates a new context for the
 94 thread and initializes it for use.  This context contains pointers and
 95 values used internally by the driver to keep track of submitted
 96 requests. The completion area buffer is also allocated, and this is
 97 large enough to contain the completion areas for many concurrent
 98 requests.  When the device is closed, any outstanding transactions are
 99 flushed and the context is cleaned up.
100 
101 On a DAX1 system (M7), the device will be called "oradax1", while on a
102 DAX2 system (M8) it will be "oradax2". If an application requires one
103 or the other, it should simply attempt to open the appropriate
104 device. Only one of the devices will exist on any given system, so the
105 name can be used to determine what the platform supports.
106 
107 The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
108 all of these, success is indicated by a return value from write()
109 equal to the number of bytes given in the call. Otherwise -1 is
110 returned and errno is set.
111 
112 CCB_DEQUEUE
113 -----------
114 
115 Tells the driver to clean up resources associated with past
116 requests. Since no interrupt is generated upon the completion of a
117 request, the driver must be told when it may reclaim resources.  No
118 further status information is returned, so the user should not
119 subsequently call read().
120 
121 CCB_KILL
122 --------
123 
124 Kills a CCB during execution. The CCB is guaranteed to not continue
125 executing once this call returns successfully. On success, read() must
126 be called to retrieve the result of the action.
127 
128 CCB_INFO
129 --------
130 
131 Retrieves information about a currently executing CCB. Note that some
132 Hypervisors might return 'notfound' when the CCB is in 'inprogress'
133 state. To ensure a CCB in the 'notfound' state will never be executed,
134 CCB_KILL must be invoked on that CCB. Upon success, read() must be
135 called to retrieve the details of the action.
136 
137 Submission of an array of CCBs for execution
138 ---------------------------------------------
139 
140 A write() whose length is a multiple of the CCB size is treated as a
141 submit operation. The file offset is treated as the index of the
142 completion area to use, and may be set via lseek() or using the
143 pwrite() system call. If -1 is returned then errno is set to indicate
144 the error. Otherwise, the return value is the length of the array that
145 was actually accepted by the coprocessor. If the accepted length is
146 equal to the requested length, then the submission was completely
147 successful and there is no further status needed; hence, the user
148 should not subsequently call read(). Partial acceptance of the CCB
149 array is indicated by a return value less than the requested length,
150 and read() must be called to retrieve further status information.  The
151 status will reflect the error caused by the first CCB that was not
152 accepted, and status_data will provide additional data in some cases.
153 
154 MMAP
155 ----
156 
157 The mmap() function provides access to the completion area allocated
158 in the driver.  Note that the completion area is not writeable by the
159 user process, and the mmap call must not specify PROT_WRITE.
160 
161 
162 Completion of a Request
163 =======================
164 
165 The first byte in each completion area is the command status which is
166 updated by the coprocessor hardware. Software may take advantage of
167 new M7/M8 processor capabilities to efficiently poll this status byte.
168 First, a "monitored load" is achieved via a Load from Alternate Space
169 (ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY).  Second, a
170 "monitored wait" is achieved via the mwait instruction (a write to
171 %asr28). This instruction is like pause in that it suspends execution
172 of the virtual processor for the given number of nanoseconds, but in
173 addition will terminate early when one of several events occur. If the
174 block of data containing the monitored location is modified, then the
175 mwait terminates. This causes software to resume execution immediately
176 (without a context switch or kernel to user transition) after a
177 transaction completes. Thus the latency between transaction completion
178 and resumption of execution may be just a few nanoseconds.
179 
180 
181 Application Life Cycle of a DAX Submission
182 ==========================================
183 
184  - open dax device
185  - call mmap() to get the completion area address
186  - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
187  - submit CCB via write() or pwrite()
188  - go into a loop executing monitored load + monitored wait and
189    terminate when the command status indicates the request is complete
190    (CCB_KILL or CCB_INFO may be used any time as necessary)
191  - perform a CCB_DEQUEUE
192  - call munmap() for completion area
193  - close the dax device
194 
195 
196 Memory Constraints
197 ==================
198 
199 The DAX hardware operates only on physical addresses. Therefore, it is
200 not aware of virtual memory mappings and the discontiguities that may
201 exist in the physical memory that a virtual buffer maps to. There is
202 no I/O TLB or any scatter/gather mechanism. All buffers, whether input
203 or output, must reside in a physically contiguous region of memory.
204 
205 The Hypervisor translates all addresses within a CCB to physical
206 before handing off the CCB to DAX. The Hypervisor determines the
207 virtual page size for each virtual address given, and uses this to
208 program a size limit for each address. This prevents the coprocessor
209 from reading or writing beyond the bound of the virtual page, even
210 though it is accessing physical memory directly. A simpler way of
211 saying this is that a DAX operation will never "cross" a virtual page
212 boundary. If an 8k virtual page is used, then the data is strictly
213 limited to 8k. If a user's buffer is larger than 8k, then a larger
214 page size must be used, or the transaction size will be truncated to
215 8k.
216 
217 Huge pages. A user may allocate huge pages using standard interfaces.
218 Memory buffers residing on huge pages may be used to achieve much
219 larger DAX transaction sizes, but the rules must still be followed,
220 and no transaction will cross a page boundary, even a huge page.  A
221 major caveat is that Linux on Sparc presents 8Mb as one of the huge
222 page sizes. Sparc does not actually provide a 8Mb hardware page size,
223 and this size is synthesized by pasting together two 4Mb pages. The
224 reasons for this are historical, and it creates an issue because only
225 half of this 8Mb page can actually be used for any given buffer in a
226 DAX request, and it must be either the first half or the second half;
227 it cannot be a 4Mb chunk in the middle, since that crosses a
228 (hardware) page boundary. Note that this entire issue may be hidden by
229 higher level libraries.
230 
231 
232 CCB Structure
233 -------------
234 A CCB is an array of 8 64-bit words. Several of these words provide
235 command opcodes, parameters, flags, etc., and the rest are addresses
236 for the completion area, output buffer, and various inputs::
237 
238    struct ccb {
239        u64   control;
240        u64   completion;
241        u64   input0;
242        u64   access;
243        u64   input1;
244        u64   op_data;
245        u64   output;
246        u64   table;
247    };
248 
249 See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
250 each of these fields, and see dax-hv-api.txt for a complete description
251 of the Hypervisor API available to the guest OS (ie, Linux kernel).
252 
253 The first word (control) is examined by the driver for the following:
254  - CCB version, which must be consistent with hardware version
255  - Opcode, which must be one of the documented allowable commands
256  - Address types, which must be set to "virtual" for all the addresses
257    given by the user, thereby ensuring that the application can
258    only access memory that it owns
259 
260 
261 Example Code
262 ============
263 
264 The DAX is accessible to both user and kernel code.  The kernel code
265 can make hypercalls directly while the user code must use wrappers
266 provided by the driver. The setup of the CCB is nearly identical for
267 both; the only difference is in preparation of the completion area. An
268 example of user code is given now, with kernel code afterwards.
269 
270 In order to program using the driver API, the file
271 arch/sparc/include/uapi/asm/oradax.h must be included.
272 
273 First, the proper device must be opened. For M7 it will be
274 /dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
275 procedure is to attempt to open both, as only one will succeed::
276 
277         fd = open("/dev/oradax1", O_RDWR);
278         if (fd < 0)
279                 fd = open("/dev/oradax2", O_RDWR);
280         if (fd < 0)
281                /* No DAX found */
282 
283 Next, the completion area must be mapped::
284 
285       completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
286 
287 All input and output buffers must be fully contained in one hardware
288 page, since as explained above, the DAX is strictly constrained by
289 virtual page boundaries.  In addition, the output buffer must be
290 64-byte aligned and its size must be a multiple of 64 bytes because
291 the coprocessor writes in units of cache lines.
292 
293 This example demonstrates the DAX Scan command, which takes as input a
294 vector and a match value, and produces a bitmap as the output. For
295 each input element that matches the value, the corresponding bit is
296 set in the output.
297 
298 In this example, the input vector consists of a series of single bits,
299 and the match value is 0. So each 0 bit in the input will produce a 1
300 in the output, and vice versa, which produces an output bitmap which
301 is the input bitmap inverted.
302 
303 For details of all the parameters and bits used in this CCB, please
304 refer to section 36.2.1.3 of the DAX Hypervisor API document, which
305 describes the Scan command in detail::
306 
307         ccb->control =       /* Table 36.1, CCB Header Format */
308                   (2L << 48)     /* command = Scan Value */
309                 | (3L << 40)     /* output address type = primary virtual */
310                 | (3L << 34)     /* primary input address type = primary virtual */
311                              /* Section 36.2.1, Query CCB Command Formats */
312                 | (1 << 28)     /* 36.2.1.1.1 primary input format = fixed width bit packed */
313                 | (0 << 23)     /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
314                 | (8 << 10)     /* 36.2.1.1.6 output format = bit vector */
315                 | (0 <<  5)     /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
316                 | (31 << 0);    /* 36.2.1.3 Disable second scan criteria */
317 
318         ccb->completion = 0;    /* Completion area address, to be filled in by driver */
319 
320         ccb->input0 = (unsigned long) input; /* primary input address */
321 
322         ccb->access =       /* Section 36.2.1.2, Data Access Control */
323                   (2 << 24)    /* Primary input length format = bits */
324                 | (nbits - 1); /* number of bits in primary input stream, minus 1 */
325 
326         ccb->input1 = 0;       /* secondary input address, unused */
327 
328         ccb->op_data = 0;      /* scan criteria (value to be matched) */
329 
330         ccb->output = (unsigned long) output;   /* output address */
331 
332         ccb->table = 0;        /* table address, unused */
333 
334 The CCB submission is a write() or pwrite() system call to the
335 driver. If the call fails, then a read() must be used to retrieve the
336 status::
337 
338         if (pwrite(fd, ccb, 64, 0) != 64) {
339                 struct ccb_exec_result status;
340                 read(fd, &status, sizeof(status));
341                 /* bail out */
342         }
343 
344 After a successful submission of the CCB, the completion area may be
345 polled to determine when the DAX is finished. Detailed information on
346 the contents of the completion area can be found in section 36.2.2 of
347 the DAX HV API document::
348 
349         while (1) {
350                 /* Monitored Load */
351                 __asm__ __volatile__("lduba [%1] 0x84, %0\n"
352                                      : "=r" (status)
353                                      : "r"  (completion_area));
354 
355                 if (status)          /* 0 indicates command in progress */
356                         break;
357 
358                 /* MWAIT */
359                 __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
360         }
361 
362 A completion area status of 1 indicates successful completion of the
363 CCB and validity of the output bitmap, which may be used immediately.
364 All other non-zero values indicate error conditions which are
365 described in section 36.2.2::
366 
367         if (completion_area[0] != 1) {  /* section 36.2.2, 1 = command ran and succeeded */
368                 /* completion_area[0] contains the completion status */
369                 /* completion_area[1] contains an error code, see 36.2.2 */
370         }
371 
372 After the completion area has been processed, the driver must be
373 notified that it can release any resources associated with the
374 request. This is done via the dequeue operation::
375 
376         struct dax_command cmd;
377         cmd.command = CCB_DEQUEUE;
378         if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
379                 /* bail out */
380         }
381 
382 Finally, normal program cleanup should be done, i.e., unmapping
383 completion area, closing the dax device, freeing memory etc.
384 
385 Kernel example
386 --------------
387 
388 The only difference in using the DAX in kernel code is the treatment
389 of the completion area. Unlike user applications which mmap the
390 completion area allocated by the driver, kernel code must allocate its
391 own memory to use for the completion area, and this address and its
392 type must be given in the CCB::
393 
394         ccb->control |=      /* Table 36.1, CCB Header Format */
395                 (3L << 32);     /* completion area address type = primary virtual */
396 
397         ccb->completion = (unsigned long) completion_area;   /* Completion area address */
398 
399 The dax submit hypercall is made directly. The flags used in the
400 ccb_submit call are documented in the DAX HV API in section 36.3.1/
401 
402 ::
403 
404   #include <asm/hypervisor.h>
405 
406         hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
407                                  HV_CCB_QUERY_CMD |
408                                  HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
409                                  HV_CCB_VA_PRIVILEGED,
410                                  0, &bytes_accepted, &status_data);
411 
412         if (hv_rv != HV_EOK) {
413                 /* hv_rv is an error code, status_data contains */
414                 /* potential additional status, see 36.3.1.1 */
415         }
416 
417 After the submission, the completion area polling code is identical to
418 that in user land::
419 
420         while (1) {
421                 /* Monitored Load */
422                 __asm__ __volatile__("lduba [%1] 0x84, %0\n"
423                                      : "=r" (status)
424                                      : "r"  (completion_area));
425 
426                 if (status)          /* 0 indicates command in progress */
427                         break;
428 
429                 /* MWAIT */
430                 __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
431         }
432 
433         if (completion_area[0] != 1) {  /* section 36.2.2, 1 = command ran and succeeded */
434                 /* completion_area[0] contains the completion status */
435                 /* completion_area[1] contains an error code, see 36.2.2 */
436         }
437 
438 The output bitmap is ready for consumption immediately after the
439 completion status indicates success.
440 
441 Excer[t from UltraSPARC Virtual Machine Specification
442 =====================================================
443 
444  .. include:: dax-hv-api.txt
445     :literal:

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php