1 .. SPDX-License-Identifier: GPL-2.0-only 1 .. SPDX-License-Identifier: GPL-2.0-only 2 2 3 ============= 3 ============= 4 QAIC driver 4 QAIC driver 5 ============= 5 ============= 6 6 7 The QAIC driver is the Kernel Mode Driver (KMD 7 The QAIC driver is the Kernel Mode Driver (KMD) for the AIC100 family of AI 8 accelerator products. 8 accelerator products. 9 9 10 Interrupts 10 Interrupts 11 ========== 11 ========== 12 12 13 IRQ Storm Mitigation 13 IRQ Storm Mitigation 14 -------------------- 14 -------------------- 15 15 16 While the AIC100 DMA Bridge hardware implement 16 While the AIC100 DMA Bridge hardware implements an IRQ storm mitigation 17 mechanism, it is still possible for an IRQ sto 17 mechanism, it is still possible for an IRQ storm to occur. A storm can happen 18 if the workload is particularly quick, and the 18 if the workload is particularly quick, and the host is responsive. If the host 19 can drain the response FIFO as quickly as the 19 can drain the response FIFO as quickly as the device can insert elements into 20 it, then the device will frequently transition 20 it, then the device will frequently transition the response FIFO from empty to 21 non-empty and generate MSIs at a rate equivale 21 non-empty and generate MSIs at a rate equivalent to the speed of the 22 workload's ability to process inputs. The lprn 22 workload's ability to process inputs. The lprnet (license plate reader network) 23 workload is known to trigger this condition, a 23 workload is known to trigger this condition, and can generate in excess of 100k 24 MSIs per second. It has been observed that mos 24 MSIs per second. It has been observed that most systems cannot tolerate this 25 for long, and will crash due to some form of w 25 for long, and will crash due to some form of watchdog due to the overhead of 26 the interrupt controller interrupting the host 26 the interrupt controller interrupting the host CPU. 27 27 28 To mitigate this issue, the QAIC driver implem 28 To mitigate this issue, the QAIC driver implements specific IRQ handling. When 29 QAIC receives an IRQ, it disables that line. T 29 QAIC receives an IRQ, it disables that line. This prevents the interrupt 30 controller from interrupting the CPU. Then AIC 30 controller from interrupting the CPU. Then AIC drains the FIFO. Once the FIFO 31 is drained, QAIC implements a "last chance" po 31 is drained, QAIC implements a "last chance" polling algorithm where QAIC will 32 sleep for a time to see if the workload will g 32 sleep for a time to see if the workload will generate more activity. The IRQ 33 line remains disabled during this time. If no 33 line remains disabled during this time. If no activity is detected, QAIC exits 34 polling mode and reenables the IRQ line. 34 polling mode and reenables the IRQ line. 35 35 36 This mitigation in QAIC is very effective. The 36 This mitigation in QAIC is very effective. The same lprnet usecase that 37 generates 100k IRQs per second (per /proc/inte 37 generates 100k IRQs per second (per /proc/interrupts) is reduced to roughly 64 38 IRQs over 5 minutes while keeping the host sys 38 IRQs over 5 minutes while keeping the host system stable, and having the same 39 workload throughput performance (within run to 39 workload throughput performance (within run to run noise variation). 40 40 41 Single MSI Mode 41 Single MSI Mode 42 --------------- 42 --------------- 43 43 44 MultiMSI is not well supported on all systems; 44 MultiMSI is not well supported on all systems; virtualized ones even less so 45 (circa 2023). Between hypervisors masking the 45 (circa 2023). Between hypervisors masking the PCIe MSI capability structure to 46 large memory requirements for vIOMMUs (require 46 large memory requirements for vIOMMUs (required for supporting MultiMSI), it is 47 useful to be able to fall back to a single MSI 47 useful to be able to fall back to a single MSI when needed. 48 48 49 To support this fallback, we allow the case wh 49 To support this fallback, we allow the case where only one MSI is able to be 50 allocated, and share that one MSI between MHI 50 allocated, and share that one MSI between MHI and the DBCs. The device detects 51 when only one MSI has been configured and dire 51 when only one MSI has been configured and directs the interrupts for the DBCs 52 to the interrupt normally used for MHI. Unfort 52 to the interrupt normally used for MHI. Unfortunately this means that the 53 interrupt handlers for every DBC and MHI wake 53 interrupt handlers for every DBC and MHI wake up for every interrupt that 54 arrives; however, the DBC threaded irq handler 54 arrives; however, the DBC threaded irq handlers only are started when work to be 55 done is detected (MHI will always start its th 55 done is detected (MHI will always start its threaded handler). 56 56 57 If the DBC is configured to force MSI interrup 57 If the DBC is configured to force MSI interrupts, this can circumvent the 58 software IRQ storm mitigation mentioned above. 58 software IRQ storm mitigation mentioned above. Since the MSI is shared it is 59 never disabled, allowing each new entry to the 59 never disabled, allowing each new entry to the FIFO to trigger a new interrupt. 60 60 61 61 62 Neural Network Control (NNC) Protocol 62 Neural Network Control (NNC) Protocol 63 ===================================== 63 ===================================== 64 64 65 The implementation of NNC is split between the 65 The implementation of NNC is split between the KMD (QAIC) and UMD. In general 66 QAIC understands how to encode/decode NNC wire 66 QAIC understands how to encode/decode NNC wire protocol, and elements of the 67 protocol which require kernel space knowledge 67 protocol which require kernel space knowledge to process (for example, mapping 68 host memory to device IOVAs). QAIC understands 68 host memory to device IOVAs). QAIC understands the structure of a message, and 69 all of the transactions. QAIC does not underst 69 all of the transactions. QAIC does not understand commands (the payload of a 70 passthrough transaction). 70 passthrough transaction). 71 71 72 QAIC handles and enforces the required little 72 QAIC handles and enforces the required little endianness and 64-bit alignment, 73 to the degree that it can. Since QAIC does not 73 to the degree that it can. Since QAIC does not know the contents of a 74 passthrough transaction, it relies on the UMD 74 passthrough transaction, it relies on the UMD to satisfy the requirements. 75 75 76 The terminate transaction is of particular use 76 The terminate transaction is of particular use to QAIC. QAIC is not aware of 77 the resources that are loaded onto a device si 77 the resources that are loaded onto a device since the majority of that activity 78 occurs within NNC commands. As a result, QAIC 78 occurs within NNC commands. As a result, QAIC does not have the means to 79 roll back userspace activity. To ensure that a 79 roll back userspace activity. To ensure that a userspace client's resources 80 are fully released in the case of a process cr 80 are fully released in the case of a process crash, or a bug, QAIC uses the 81 terminate command to let QSM know when a user 81 terminate command to let QSM know when a user has gone away, and the resources 82 can be released. 82 can be released. 83 83 84 QSM can report a version number of the NNC pro 84 QSM can report a version number of the NNC protocol it supports. This is in the 85 form of a Major number and a Minor number. 85 form of a Major number and a Minor number. 86 86 87 Major number updates indicate changes to the N 87 Major number updates indicate changes to the NNC protocol which impact the 88 message format, or transactions (impacts QAIC) 88 message format, or transactions (impacts QAIC). 89 89 90 Minor number updates indicate changes to the N 90 Minor number updates indicate changes to the NNC protocol which impact the 91 commands (does not impact QAIC). 91 commands (does not impact QAIC). 92 92 93 uAPI 93 uAPI 94 ==== 94 ==== 95 95 96 QAIC creates an accel device per physical PCIe !! 96 QAIC creates an accel device per phsyical PCIe device. This accel device exists 97 for as long as the PCIe device is known to Lin 97 for as long as the PCIe device is known to Linux. 98 98 99 The PCIe device may not be in the state to acc 99 The PCIe device may not be in the state to accept requests from userspace at 100 all times. QAIC will trigger KOBJ_ONLINE/OFFLI 100 all times. QAIC will trigger KOBJ_ONLINE/OFFLINE uevents to advertise when the 101 device can accept requests (ONLINE) and when t 101 device can accept requests (ONLINE) and when the device is no longer accepting 102 requests (OFFLINE) because of a reset or other 102 requests (OFFLINE) because of a reset or other state transition. 103 103 104 QAIC defines a number of driver specific IOCTL 104 QAIC defines a number of driver specific IOCTLs as part of the userspace API. 105 105 106 DRM_IOCTL_QAIC_MANAGE 106 DRM_IOCTL_QAIC_MANAGE 107 This IOCTL allows userspace to send a NNC re 107 This IOCTL allows userspace to send a NNC request to the QSM. The call will 108 block until a response is received, or the r 108 block until a response is received, or the request has timed out. 109 109 110 DRM_IOCTL_QAIC_CREATE_BO 110 DRM_IOCTL_QAIC_CREATE_BO 111 This IOCTL allows userspace to allocate a bu 111 This IOCTL allows userspace to allocate a buffer object (BO) which can send 112 or receive data from a workload. The call wi 112 or receive data from a workload. The call will return a GEM handle that 113 represents the allocated buffer. The BO is n 113 represents the allocated buffer. The BO is not usable until it has been 114 sliced (see DRM_IOCTL_QAIC_ATTACH_SLICE_BO). 114 sliced (see DRM_IOCTL_QAIC_ATTACH_SLICE_BO). 115 115 116 DRM_IOCTL_QAIC_MMAP_BO 116 DRM_IOCTL_QAIC_MMAP_BO 117 This IOCTL allows userspace to prepare an al 117 This IOCTL allows userspace to prepare an allocated BO to be mmap'd into the 118 userspace process. 118 userspace process. 119 119 120 DRM_IOCTL_QAIC_ATTACH_SLICE_BO 120 DRM_IOCTL_QAIC_ATTACH_SLICE_BO 121 This IOCTL allows userspace to slice a BO in 121 This IOCTL allows userspace to slice a BO in preparation for sending the BO 122 to the device. Slicing is the operation of d 122 to the device. Slicing is the operation of describing what portions of a BO 123 get sent where to a workload. This requires 123 get sent where to a workload. This requires a set of DMA transfers for the 124 DMA Bridge, and as such, locks the BO to a s 124 DMA Bridge, and as such, locks the BO to a specific DBC. 125 125 126 DRM_IOCTL_QAIC_EXECUTE_BO 126 DRM_IOCTL_QAIC_EXECUTE_BO 127 This IOCTL allows userspace to submit a set 127 This IOCTL allows userspace to submit a set of sliced BOs to the device. The 128 call is non-blocking. Success only indicates 128 call is non-blocking. Success only indicates that the BOs have been queued 129 to the device, but does not guarantee they h 129 to the device, but does not guarantee they have been executed. 130 130 131 DRM_IOCTL_QAIC_PARTIAL_EXECUTE_BO 131 DRM_IOCTL_QAIC_PARTIAL_EXECUTE_BO 132 This IOCTL operates like DRM_IOCTL_QAIC_EXEC 132 This IOCTL operates like DRM_IOCTL_QAIC_EXECUTE_BO, but it allows userspace 133 to shrink the BOs sent to the device for thi 133 to shrink the BOs sent to the device for this specific call. If a BO 134 typically has N inputs, but only a subset of 134 typically has N inputs, but only a subset of those is available, this IOCTL 135 allows userspace to indicate that only the f 135 allows userspace to indicate that only the first M bytes of the BO should be 136 sent to the device to minimize data transfer 136 sent to the device to minimize data transfer overhead. This IOCTL dynamically 137 recomputes the slicing, and therefore has so 137 recomputes the slicing, and therefore has some processing overhead before the 138 BOs can be queued to the device. 138 BOs can be queued to the device. 139 139 140 DRM_IOCTL_QAIC_WAIT_BO 140 DRM_IOCTL_QAIC_WAIT_BO 141 This IOCTL allows userspace to determine whe 141 This IOCTL allows userspace to determine when a particular BO has been 142 processed by the device. The call will block 142 processed by the device. The call will block until either the BO has been 143 processed and can be re-queued to the device 143 processed and can be re-queued to the device, or a timeout occurs. 144 144 145 DRM_IOCTL_QAIC_PERF_STATS_BO 145 DRM_IOCTL_QAIC_PERF_STATS_BO 146 This IOCTL allows userspace to collect perfo 146 This IOCTL allows userspace to collect performance statistics on the most 147 recent execution of a BO. This allows usersp 147 recent execution of a BO. This allows userspace to construct an end to end 148 timeline of the BO processing for a performa 148 timeline of the BO processing for a performance analysis. >> 149 >> 150 DRM_IOCTL_QAIC_PART_DEV >> 151 This IOCTL allows userspace to request a duplicate "shadow device". This extra >> 152 accelN device is associated with a specific partition of resources on the >> 153 AIC100 device and can be used for limiting a process to some subset of >> 154 resources. 149 155 150 DRM_IOCTL_QAIC_DETACH_SLICE_BO 156 DRM_IOCTL_QAIC_DETACH_SLICE_BO 151 This IOCTL allows userspace to remove the sl 157 This IOCTL allows userspace to remove the slicing information from a BO that 152 was originally provided by a call to DRM_IOC 158 was originally provided by a call to DRM_IOCTL_QAIC_ATTACH_SLICE_BO. This 153 is the inverse of DRM_IOCTL_QAIC_ATTACH_SLIC 159 is the inverse of DRM_IOCTL_QAIC_ATTACH_SLICE_BO. The BO must be idle for 154 DRM_IOCTL_QAIC_DETACH_SLICE_BO to be called. 160 DRM_IOCTL_QAIC_DETACH_SLICE_BO to be called. After a successful detach slice 155 operation the BO may have new slicing inform 161 operation the BO may have new slicing information attached with a new call 156 to DRM_IOCTL_QAIC_ATTACH_SLICE_BO. After det 162 to DRM_IOCTL_QAIC_ATTACH_SLICE_BO. After detach slice, the BO cannot be 157 executed until after a new attach slice oper 163 executed until after a new attach slice operation. Combining attach slice 158 and detach slice calls allows userspace to u 164 and detach slice calls allows userspace to use a BO with multiple workloads. 159 165 160 Userspace Client Isolation 166 Userspace Client Isolation 161 ========================== 167 ========================== 162 168 163 AIC100 supports multiple clients. Multiple DBC 169 AIC100 supports multiple clients. Multiple DBCs can be consumed by a single 164 client, and multiple clients can each consume 170 client, and multiple clients can each consume one or more DBCs. Workloads 165 may contain sensitive information therefore on 171 may contain sensitive information therefore only the client that owns the 166 workload should be allowed to interface with t 172 workload should be allowed to interface with the DBC. 167 173 168 Clients are identified by the instance associa 174 Clients are identified by the instance associated with their open(). A client 169 may only use memory they allocate, and DBCs th 175 may only use memory they allocate, and DBCs that are assigned to their 170 workloads. Attempts to access resources assign 176 workloads. Attempts to access resources assigned to other clients will be 171 rejected. 177 rejected. 172 178 173 Module parameters 179 Module parameters 174 ================= 180 ================= 175 181 176 QAIC supports the following module parameters: 182 QAIC supports the following module parameters: 177 183 178 **datapath_polling (bool)** 184 **datapath_polling (bool)** 179 185 180 Configures QAIC to use a polling thread for da 186 Configures QAIC to use a polling thread for datapath events instead of relying 181 on the device interrupts. Useful for platforms 187 on the device interrupts. Useful for platforms with broken multiMSI. Must be 182 set at QAIC driver initialization. Default is 188 set at QAIC driver initialization. Default is 0 (off). 183 189 184 **mhi_timeout_ms (unsigned int)** 190 **mhi_timeout_ms (unsigned int)** 185 191 186 Sets the timeout value for MHI operations in m 192 Sets the timeout value for MHI operations in milliseconds (ms). Must be set 187 at the time the driver detects a device. Defau 193 at the time the driver detects a device. Default is 2000 (2 seconds). 188 194 189 **control_resp_timeout_s (unsigned int)** 195 **control_resp_timeout_s (unsigned int)** 190 196 191 Sets the timeout value for QSM responses to NN 197 Sets the timeout value for QSM responses to NNC messages in seconds (s). Must 192 be set at the time the driver is sending a req 198 be set at the time the driver is sending a request to QSM. Default is 60 (one 193 minute). 199 minute). 194 200 195 **wait_exec_default_timeout_ms (unsigned int)* 201 **wait_exec_default_timeout_ms (unsigned int)** 196 202 197 Sets the default timeout for the wait_exec ioc 203 Sets the default timeout for the wait_exec ioctl in milliseconds (ms). Must be 198 set prior to the waic_exec ioctl call. A value 204 set prior to the waic_exec ioctl call. A value specified in the ioctl call 199 overrides this for that call. Default is 5000 205 overrides this for that call. Default is 5000 (5 seconds). 200 206 201 **datapath_poll_interval_us (unsigned int)** 207 **datapath_poll_interval_us (unsigned int)** 202 208 203 Sets the polling interval in microseconds (us) 209 Sets the polling interval in microseconds (us) when datapath polling is active. 204 Takes effect at the next polling interval. Def 210 Takes effect at the next polling interval. Default is 100 (100 us). 205 211 206 **timesync_delay_ms (unsigned int)** 212 **timesync_delay_ms (unsigned int)** 207 213 208 Sets the time interval in milliseconds (ms) be 214 Sets the time interval in milliseconds (ms) between two consecutive timesync 209 operations. Default is 1000 (1000 ms). 215 operations. Default is 1000 (1000 ms).
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.