1 .. SPDX-License-Identifier: GPL-2.0-only 1 .. SPDX-License-Identifier: GPL-2.0-only 2 2 3 =============================== 3 =============================== 4 Qualcomm Cloud AI 100 (AIC100) 4 Qualcomm Cloud AI 100 (AIC100) 5 =============================== 5 =============================== 6 6 7 Overview 7 Overview 8 ======== 8 ======== 9 9 10 The Qualcomm Cloud AI 100/AIC100 family of pro 10 The Qualcomm Cloud AI 100/AIC100 family of products (including SA9000P - part of 11 Snapdragon Ride) are PCIe adapter cards which 11 Snapdragon Ride) are PCIe adapter cards which contain a dedicated SoC ASIC for 12 the purpose of efficiently running Artificial 12 the purpose of efficiently running Artificial Intelligence (AI) Deep Learning 13 inference workloads. They are AI accelerators. 13 inference workloads. They are AI accelerators. 14 14 15 The PCIe interface of AIC100 is capable of PCI 15 The PCIe interface of AIC100 is capable of PCIe Gen4 speeds over eight lanes 16 (x8). An individual SoC on a card can have up 16 (x8). An individual SoC on a card can have up to 16 NSPs for running workloads. 17 Each SoC has an A53 management CPU. On card, t 17 Each SoC has an A53 management CPU. On card, there can be up to 32 GB of DDR. 18 18 19 Multiple AIC100 cards can be hosted in a singl 19 Multiple AIC100 cards can be hosted in a single system to scale overall 20 performance. AIC100 cards are multi-user capab 20 performance. AIC100 cards are multi-user capable and able to execute workloads 21 from multiple users in a concurrent manner. 21 from multiple users in a concurrent manner. 22 22 23 Hardware Description 23 Hardware Description 24 ==================== 24 ==================== 25 25 26 An AIC100 card consists of an AIC100 SoC, on-c 26 An AIC100 card consists of an AIC100 SoC, on-card DDR, and a set of misc 27 peripherals (PMICs, etc). 27 peripherals (PMICs, etc). 28 28 29 An AIC100 card can either be a PCIe HHHL form 29 An AIC100 card can either be a PCIe HHHL form factor (a traditional PCIe card), 30 or a Dual M.2 card. Both use PCIe to connect t 30 or a Dual M.2 card. Both use PCIe to connect to the host system. 31 31 32 As a PCIe endpoint/adapter, AIC100 uses the st 32 As a PCIe endpoint/adapter, AIC100 uses the standard VendorID(VID)/ 33 DeviceID(DID) combination to uniquely identify 33 DeviceID(DID) combination to uniquely identify itself to the host. AIC100 34 uses the standard Qualcomm VID (0x17cb). All A 34 uses the standard Qualcomm VID (0x17cb). All AIC100 SKUs use the same 35 AIC100 DID (0xa100). 35 AIC100 DID (0xa100). 36 36 37 AIC100 does not implement FLR (function level 37 AIC100 does not implement FLR (function level reset). 38 38 39 AIC100 implements MSI but does not implement M !! 39 AIC100 implements MSI but does not implement MSI-X. AIC100 requires 17 MSIs to 40 operate (1 for MHI, 16 for the DMA Bridge). Fa !! 40 operate (1 for MHI, 16 for the DMA Bridge). 41 scenarios where reserving 32 MSIs isn't feasib << 42 41 43 As a PCIe device, AIC100 utilizes BARs to prov 42 As a PCIe device, AIC100 utilizes BARs to provide host interfaces to the device 44 hardware. AIC100 provides 3, 64-bit BARs. 43 hardware. AIC100 provides 3, 64-bit BARs. 45 44 46 * The first BAR is 4K in size, and exposes the 45 * The first BAR is 4K in size, and exposes the MHI interface to the host. 47 46 48 * The second BAR is 2M in size, and exposes th 47 * The second BAR is 2M in size, and exposes the DMA Bridge interface to the 49 host. 48 host. 50 49 51 * The third BAR is variable in size based on a 50 * The third BAR is variable in size based on an individual AIC100's 52 configuration, but defaults to 64K. This BAR 51 configuration, but defaults to 64K. This BAR currently has no purpose. 53 52 54 From the host perspective, AIC100 has several 53 From the host perspective, AIC100 has several key hardware components - 55 54 56 * MHI (Modem Host Interface) 55 * MHI (Modem Host Interface) 57 * QSM (QAIC Service Manager) 56 * QSM (QAIC Service Manager) 58 * NSPs (Neural Signal Processor) 57 * NSPs (Neural Signal Processor) 59 * DMA Bridge 58 * DMA Bridge 60 * DDR 59 * DDR 61 60 62 MHI 61 MHI 63 --- 62 --- 64 63 65 AIC100 has one MHI interface over PCIe. MHI it 64 AIC100 has one MHI interface over PCIe. MHI itself is documented at 66 Documentation/mhi/index.rst MHI is the mechani 65 Documentation/mhi/index.rst MHI is the mechanism the host uses to communicate 67 with the QSM. Except for workload data via the 66 with the QSM. Except for workload data via the DMA Bridge, all interaction with 68 the device occurs via MHI. 67 the device occurs via MHI. 69 68 70 QSM 69 QSM 71 --- 70 --- 72 71 73 QAIC Service Manager. This is an ARM A53 CPU t 72 QAIC Service Manager. This is an ARM A53 CPU that runs the primary 74 firmware of the card and performs on-card mana 73 firmware of the card and performs on-card management tasks. It also 75 communicates with the host via MHI. Each AIC10 74 communicates with the host via MHI. Each AIC100 has one of 76 these. 75 these. 77 76 78 NSP 77 NSP 79 --- 78 --- 80 79 81 Neural Signal Processor. Each AIC100 has up to 80 Neural Signal Processor. Each AIC100 has up to 16 of these. These are 82 the processors that run the workloads on AIC10 81 the processors that run the workloads on AIC100. Each NSP is a Qualcomm Hexagon 83 (Q6) DSP with HVX and HMX. Each NSP can only r 82 (Q6) DSP with HVX and HMX. Each NSP can only run one workload at a time, but 84 multiple NSPs may be assigned to a single work 83 multiple NSPs may be assigned to a single workload. Since each NSP can only run 85 one workload, AIC100 is limited to 16 concurre 84 one workload, AIC100 is limited to 16 concurrent workloads. Workload 86 "scheduling" is under the purview of the host. 85 "scheduling" is under the purview of the host. AIC100 does not automatically 87 timeslice. 86 timeslice. 88 87 89 DMA Bridge 88 DMA Bridge 90 ---------- 89 ---------- 91 90 92 The DMA Bridge is custom DMA engine that manag 91 The DMA Bridge is custom DMA engine that manages the flow of data 93 in and out of workloads. AIC100 has one of the 92 in and out of workloads. AIC100 has one of these. The DMA Bridge has 16 94 channels, each consisting of a set of request/ 93 channels, each consisting of a set of request/response FIFOs. Each active 95 workload is assigned a single DMA Bridge chann 94 workload is assigned a single DMA Bridge channel. The DMA Bridge exposes 96 hardware registers to manage the FIFOs (head/t 95 hardware registers to manage the FIFOs (head/tail pointers), but requires host 97 memory to store the FIFOs. 96 memory to store the FIFOs. 98 97 99 DDR 98 DDR 100 --- 99 --- 101 100 102 AIC100 has on-card DDR. In total, an AIC100 ca 101 AIC100 has on-card DDR. In total, an AIC100 can have up to 32 GB of DDR. 103 This DDR is used to store workloads, data for 102 This DDR is used to store workloads, data for the workloads, and is used by the 104 QSM for managing the device. NSPs are granted 103 QSM for managing the device. NSPs are granted access to sections of the DDR by 105 the QSM. The host does not have direct access 104 the QSM. The host does not have direct access to the DDR, and must make 106 requests to the QSM to transfer data to the DD 105 requests to the QSM to transfer data to the DDR. 107 106 108 High-level Use Flow 107 High-level Use Flow 109 =================== 108 =================== 110 109 111 AIC100 is a multi-user, programmable accelerat 110 AIC100 is a multi-user, programmable accelerator typically used for running 112 neural networks in inferencing mode to efficie 111 neural networks in inferencing mode to efficiently perform AI operations. 113 AIC100 is not intended for training neural net 112 AIC100 is not intended for training neural networks. AIC100 can be utilized 114 for generic compute workloads. 113 for generic compute workloads. 115 114 116 Assuming a user wants to utilize AIC100, they 115 Assuming a user wants to utilize AIC100, they would follow these steps: 117 116 118 1. Compile the workload into an ELF targeting 117 1. Compile the workload into an ELF targeting the NSP(s) 119 2. Make requests to the QSM to load the worklo 118 2. Make requests to the QSM to load the workload and related artifacts into the 120 device DDR 119 device DDR 121 3. Make a request to the QSM to activate the w 120 3. Make a request to the QSM to activate the workload onto a set of idle NSPs 122 4. Make requests to the DMA Bridge to send inp 121 4. Make requests to the DMA Bridge to send input data to the workload to be 123 processed, and other requests to receive pr 122 processed, and other requests to receive processed output data from the 124 workload. 123 workload. 125 5. Once the workload is no longer required, ma 124 5. Once the workload is no longer required, make a request to the QSM to 126 deactivate the workload, thus putting the N 125 deactivate the workload, thus putting the NSPs back into an idle state. 127 6. Once the workload and related artifacts are 126 6. Once the workload and related artifacts are no longer needed for future 128 sessions, make requests to the QSM to unloa 127 sessions, make requests to the QSM to unload the data from DDR. This frees 129 the DDR to be used by other users. 128 the DDR to be used by other users. 130 129 131 130 132 Boot Flow 131 Boot Flow 133 ========= 132 ========= 134 133 135 AIC100 uses a flashless boot flow, derived fro 134 AIC100 uses a flashless boot flow, derived from Qualcomm MSMs. 136 135 137 When AIC100 is first powered on, it begins exe 136 When AIC100 is first powered on, it begins executing PBL (Primary Bootloader) 138 from ROM. PBL enumerates the PCIe link, and in 137 from ROM. PBL enumerates the PCIe link, and initializes the BHI (Boot Host 139 Interface) component of MHI. 138 Interface) component of MHI. 140 139 141 Using BHI, the host points PBL to the location 140 Using BHI, the host points PBL to the location of the SBL (Secondary Bootloader) 142 image. The PBL pulls the image from the host, 141 image. The PBL pulls the image from the host, validates it, and begins 143 execution of SBL. 142 execution of SBL. 144 143 145 SBL initializes MHI, and uses MHI to notify th 144 SBL initializes MHI, and uses MHI to notify the host that the device has entered 146 the SBL stage. SBL performs a number of operat 145 the SBL stage. SBL performs a number of operations: 147 146 148 * SBL initializes the majority of hardware (an 147 * SBL initializes the majority of hardware (anything PBL left uninitialized), 149 including DDR. 148 including DDR. 150 * SBL offloads the bootlog to the host. 149 * SBL offloads the bootlog to the host. 151 * SBL synchronizes timestamps with the host fo 150 * SBL synchronizes timestamps with the host for future logging. 152 * SBL uses the Sahara protocol to obtain the r 151 * SBL uses the Sahara protocol to obtain the runtime firmware images from the 153 host. 152 host. 154 153 155 Once SBL has obtained and validated the runtim 154 Once SBL has obtained and validated the runtime firmware, it brings the NSPs out 156 of reset, and jumps into the QSM. 155 of reset, and jumps into the QSM. 157 156 158 The QSM uses MHI to notify the host that the d 157 The QSM uses MHI to notify the host that the device has entered the QSM stage 159 (AMSS in MHI terms). At this point, the AIC100 158 (AMSS in MHI terms). At this point, the AIC100 device is fully functional, and 160 ready to process workloads. 159 ready to process workloads. 161 160 162 Userspace components 161 Userspace components 163 ==================== 162 ==================== 164 163 165 Compiler 164 Compiler 166 -------- 165 -------- 167 166 168 An open compiler for AIC100 based on upstream 167 An open compiler for AIC100 based on upstream LLVM can be found at: 169 https://github.com/quic/software-kit-for-qualc 168 https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100-cc 170 169 171 Usermode Driver (UMD) 170 Usermode Driver (UMD) 172 --------------------- 171 --------------------- 173 172 174 An open UMD that interfaces with the qaic kern 173 An open UMD that interfaces with the qaic kernel driver can be found at: 175 https://github.com/quic/software-kit-for-qualc 174 https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100 176 175 177 Sahara loader 176 Sahara loader 178 ------------- 177 ------------- 179 178 180 An open implementation of the Sahara protocol 179 An open implementation of the Sahara protocol called kickstart can be found at: 181 https://github.com/andersson/qdl 180 https://github.com/andersson/qdl 182 181 183 MHI Channels 182 MHI Channels 184 ============ 183 ============ 185 184 186 AIC100 defines a number of MHI channels for di 185 AIC100 defines a number of MHI channels for different purposes. This is a list 187 of the defined channels, and their uses. 186 of the defined channels, and their uses. 188 187 189 +----------------+---------+----------+------- 188 +----------------+---------+----------+----------------------------------------+ 190 | Channel name | IDs | EEs | Purpos 189 | Channel name | IDs | EEs | Purpose | 191 +================+=========+==========+======= 190 +================+=========+==========+========================================+ 192 | QAIC_LOOPBACK | 0 & 1 | AMSS | Any da 191 | QAIC_LOOPBACK | 0 & 1 | AMSS | Any data sent to the device on this | 193 | | | | channe 192 | | | | channel is sent back to the host. | 194 +----------------+---------+----------+------- 193 +----------------+---------+----------+----------------------------------------+ 195 | QAIC_SAHARA | 2 & 3 | SBL | Used b 194 | QAIC_SAHARA | 2 & 3 | SBL | Used by SBL to obtain the runtime | 196 | | | | firmwa 195 | | | | firmware from the host. | 197 +----------------+---------+----------+------- 196 +----------------+---------+----------+----------------------------------------+ 198 | QAIC_DIAG | 4 & 5 | AMSS | Used t 197 | QAIC_DIAG | 4 & 5 | AMSS | Used to communicate with QSM via the | 199 | | | | DIAG p 198 | | | | DIAG protocol. | 200 +----------------+---------+----------+------- 199 +----------------+---------+----------+----------------------------------------+ 201 | QAIC_SSR | 6 & 7 | AMSS | Used t 200 | QAIC_SSR | 6 & 7 | AMSS | Used to notify the host of subsystem | 202 | | | | restar 201 | | | | restart events, and to offload SSR | 203 | | | | crashd 202 | | | | crashdumps. | 204 +----------------+---------+----------+------- 203 +----------------+---------+----------+----------------------------------------+ 205 | QAIC_QDSS | 8 & 9 | AMSS | Used f 204 | QAIC_QDSS | 8 & 9 | AMSS | Used for the Qualcomm Debug Subsystem. | 206 +----------------+---------+----------+------- 205 +----------------+---------+----------+----------------------------------------+ 207 | QAIC_CONTROL | 10 & 11 | AMSS | Used f 206 | QAIC_CONTROL | 10 & 11 | AMSS | Used for the Neural Network Control | 208 | | | | (NNC) 207 | | | | (NNC) protocol. This is the primary | 209 | | | | channe 208 | | | | channel between host and QSM for | 210 | | | | managi 209 | | | | managing workloads. | 211 +----------------+---------+----------+------- 210 +----------------+---------+----------+----------------------------------------+ 212 | QAIC_LOGGING | 12 & 13 | SBL | Used b 211 | QAIC_LOGGING | 12 & 13 | SBL | Used by the SBL to send the bootlog to | 213 | | | | the ho 212 | | | | the host. | 214 +----------------+---------+----------+------- 213 +----------------+---------+----------+----------------------------------------+ 215 | QAIC_STATUS | 14 & 15 | AMSS | Used t 214 | QAIC_STATUS | 14 & 15 | AMSS | Used to notify the host of Reliability,| 216 | | | | Access 215 | | | | Accessibility, Serviceability (RAS) | 217 | | | | events 216 | | | | events. | 218 +----------------+---------+----------+------- 217 +----------------+---------+----------+----------------------------------------+ 219 | QAIC_TELEMETRY | 16 & 17 | AMSS | Used t 218 | QAIC_TELEMETRY | 16 & 17 | AMSS | Used to get/set power/thermal/etc | 220 | | | | attrib 219 | | | | attributes. | 221 +----------------+---------+----------+------- 220 +----------------+---------+----------+----------------------------------------+ 222 | QAIC_DEBUG | 18 & 19 | AMSS | Not us 221 | QAIC_DEBUG | 18 & 19 | AMSS | Not used. | 223 +----------------+---------+----------+------- 222 +----------------+---------+----------+----------------------------------------+ 224 | QAIC_TIMESYNC | 20 & 21 | SBL | Used t !! 223 | QAIC_TIMESYNC | 20 & 21 | SBL/AMSS | Used to synchronize timestamps in the | 225 | | | | device 224 | | | | device side logs with the host time | 226 | | | | source 225 | | | | source. | 227 +----------------+---------+----------+------- << 228 | QAIC_TIMESYNC | 22 & 23 | AMSS | Used t << 229 | _PERIODIC | | | timest << 230 | | | | the ho << 231 +----------------+---------+----------+------- 226 +----------------+---------+----------+----------------------------------------+ 232 227 233 DMA Bridge 228 DMA Bridge 234 ========== 229 ========== 235 230 236 Overview 231 Overview 237 -------- 232 -------- 238 233 239 The DMA Bridge is one of the main interfaces t 234 The DMA Bridge is one of the main interfaces to the host from the device 240 (the other being MHI). As part of activating a 235 (the other being MHI). As part of activating a workload to run on NSPs, the QSM 241 assigns that network a DMA Bridge channel. A w 236 assigns that network a DMA Bridge channel. A workload's DMA Bridge channel 242 (DBC for short) is solely for the use of that 237 (DBC for short) is solely for the use of that workload and is not shared with 243 other workloads. 238 other workloads. 244 239 245 Each DBC is a pair of FIFOs that manage data i 240 Each DBC is a pair of FIFOs that manage data in and out of the workload. One 246 FIFO is the request FIFO. The other FIFO is th 241 FIFO is the request FIFO. The other FIFO is the response FIFO. 247 242 248 Each DBC contains 4 registers in hardware: 243 Each DBC contains 4 registers in hardware: 249 244 250 * Request FIFO head pointer (offset 0x0). Read 245 * Request FIFO head pointer (offset 0x0). Read only by the host. Indicates the 251 latest item in the FIFO the device has consu 246 latest item in the FIFO the device has consumed. 252 * Request FIFO tail pointer (offset 0x4). Read 247 * Request FIFO tail pointer (offset 0x4). Read/write by the host. Host 253 increments this register to add new items to 248 increments this register to add new items to the FIFO. 254 * Response FIFO head pointer (offset 0x8). Rea 249 * Response FIFO head pointer (offset 0x8). Read/write by the host. Indicates 255 the latest item in the FIFO the host has con 250 the latest item in the FIFO the host has consumed. 256 * Response FIFO tail pointer (offset 0xc). Rea 251 * Response FIFO tail pointer (offset 0xc). Read only by the host. Device 257 increments this register to add new items to 252 increments this register to add new items to the FIFO. 258 253 259 The values in each register are indexes in the 254 The values in each register are indexes in the FIFO. To get the location of the 260 FIFO element pointed to by the register: FIFO 255 FIFO element pointed to by the register: FIFO base address + register * element 261 size. 256 size. 262 257 263 DBC registers are exposed to the host via the 258 DBC registers are exposed to the host via the second BAR. Each DBC consumes 264 4KB of space in the BAR. 259 4KB of space in the BAR. 265 260 266 The actual FIFOs are backed by host memory. Wh 261 The actual FIFOs are backed by host memory. When sending a request to the QSM 267 to activate a network, the host must donate me 262 to activate a network, the host must donate memory to be used for the FIFOs. 268 Due to internal mapping limitations of the dev 263 Due to internal mapping limitations of the device, a single contiguous chunk of 269 memory must be provided per DBC, which hosts b 264 memory must be provided per DBC, which hosts both FIFOs. The request FIFO will 270 consume the beginning of the memory chunk, and 265 consume the beginning of the memory chunk, and the response FIFO will consume 271 the end of the memory chunk. 266 the end of the memory chunk. 272 267 273 Request FIFO 268 Request FIFO 274 ------------ 269 ------------ 275 270 276 A request FIFO element has the following struc 271 A request FIFO element has the following structure: 277 272 278 .. code-block:: c 273 .. code-block:: c 279 274 280 struct request_elem { 275 struct request_elem { 281 u16 req_id; 276 u16 req_id; 282 u8 seq_id; 277 u8 seq_id; 283 u8 pcie_dma_cmd; 278 u8 pcie_dma_cmd; 284 u32 reserved; 279 u32 reserved; 285 u64 pcie_dma_source_addr; 280 u64 pcie_dma_source_addr; 286 u64 pcie_dma_dest_addr; 281 u64 pcie_dma_dest_addr; 287 u32 pcie_dma_len; 282 u32 pcie_dma_len; 288 u32 reserved; 283 u32 reserved; 289 u64 doorbell_addr; 284 u64 doorbell_addr; 290 u8 doorbell_attr; 285 u8 doorbell_attr; 291 u8 reserved; 286 u8 reserved; 292 u16 reserved; 287 u16 reserved; 293 u32 doorbell_data; 288 u32 doorbell_data; 294 u32 sem_cmd0; 289 u32 sem_cmd0; 295 u32 sem_cmd1; 290 u32 sem_cmd1; 296 u32 sem_cmd2; 291 u32 sem_cmd2; 297 u32 sem_cmd3; 292 u32 sem_cmd3; 298 }; 293 }; 299 294 300 Request field descriptions: 295 Request field descriptions: 301 296 302 req_id 297 req_id 303 request ID. A request FIFO element and 298 request ID. A request FIFO element and a response FIFO element with 304 the same request ID refer to the same 299 the same request ID refer to the same command. 305 300 306 seq_id 301 seq_id 307 sequence ID within a request. Ignored 302 sequence ID within a request. Ignored by the DMA Bridge. 308 303 309 pcie_dma_cmd 304 pcie_dma_cmd 310 describes the DMA element of this requ 305 describes the DMA element of this request. 311 306 312 * Bit(7) is the force msi flag, which 307 * Bit(7) is the force msi flag, which overrides the DMA Bridge MSI logic 313 and generates a MSI when this reques 308 and generates a MSI when this request is complete, and QSM 314 configures the DMA Bridge to look at 309 configures the DMA Bridge to look at this bit. 315 * Bits(6:5) are reserved. 310 * Bits(6:5) are reserved. 316 * Bit(4) is the completion code flag, 311 * Bit(4) is the completion code flag, and indicates that the DMA Bridge 317 shall generate a response FIFO eleme 312 shall generate a response FIFO element when this request is 318 complete. 313 complete. 319 * Bit(3) indicates if this request is 314 * Bit(3) indicates if this request is a linked list transfer(0) or a bulk 320 transfer(1). 315 transfer(1). 321 * Bit(2) is reserved. 316 * Bit(2) is reserved. 322 * Bits(1:0) indicate the type of trans 317 * Bits(1:0) indicate the type of transfer. No transfer(0), to device(1), 323 from device(2). Value 3 is illegal. 318 from device(2). Value 3 is illegal. 324 319 325 pcie_dma_source_addr 320 pcie_dma_source_addr 326 source address for a bulk transfer, or 321 source address for a bulk transfer, or the address of the linked list. 327 322 328 pcie_dma_dest_addr 323 pcie_dma_dest_addr 329 destination address for a bulk transfe 324 destination address for a bulk transfer. 330 325 331 pcie_dma_len 326 pcie_dma_len 332 length of the bulk transfer. Note that 327 length of the bulk transfer. Note that the size of this field 333 limits transfers to 4G in size. 328 limits transfers to 4G in size. 334 329 335 doorbell_addr 330 doorbell_addr 336 address of the doorbell to ring when t 331 address of the doorbell to ring when this request is complete. 337 332 338 doorbell_attr 333 doorbell_attr 339 doorbell attributes. 334 doorbell attributes. 340 335 341 * Bit(7) indicates if a write to a doo 336 * Bit(7) indicates if a write to a doorbell is to occur. 342 * Bits(6:2) are reserved. 337 * Bits(6:2) are reserved. 343 * Bits(1:0) contain the encoding of th 338 * Bits(1:0) contain the encoding of the doorbell length. 0 is 32-bit, 344 1 is 16-bit, 2 is 8-bit, 3 is reserv 339 1 is 16-bit, 2 is 8-bit, 3 is reserved. The doorbell address 345 must be naturally aligned to the spe 340 must be naturally aligned to the specified length. 346 341 347 doorbell_data 342 doorbell_data 348 data to write to the doorbell. Only th 343 data to write to the doorbell. Only the bits corresponding to 349 the doorbell length are valid. 344 the doorbell length are valid. 350 345 351 sem_cmdN 346 sem_cmdN 352 semaphore command. 347 semaphore command. 353 348 354 * Bit(31) indicates this semaphore com 349 * Bit(31) indicates this semaphore command is enabled. 355 * Bit(30) is the to-device DMA fence. 350 * Bit(30) is the to-device DMA fence. Block this request until all 356 to-device DMA transfers are complete 351 to-device DMA transfers are complete. 357 * Bit(29) is the from-device DMA fence 352 * Bit(29) is the from-device DMA fence. Block this request until all 358 from-device DMA transfers are comple 353 from-device DMA transfers are complete. 359 * Bits(28:27) are reserved. 354 * Bits(28:27) are reserved. 360 * Bits(26:24) are the semaphore comman 355 * Bits(26:24) are the semaphore command. 0 is NOP. 1 is init with the 361 specified value. 2 is increment. 3 i 356 specified value. 2 is increment. 3 is decrement. 4 is wait 362 until the semaphore is equal to the 357 until the semaphore is equal to the specified value. 5 is wait 363 until the semaphore is greater or eq 358 until the semaphore is greater or equal to the specified value. 364 6 is "P", wait until semaphore is gr 359 6 is "P", wait until semaphore is greater than 0, then 365 decrement by 1. 7 is reserved. 360 decrement by 1. 7 is reserved. 366 * Bit(23) is reserved. 361 * Bit(23) is reserved. 367 * Bit(22) is the semaphore sync. 0 is 362 * Bit(22) is the semaphore sync. 0 is post sync, which means that the 368 semaphore operation is done after th 363 semaphore operation is done after the DMA transfer. 1 is 369 presync, which gates the DMA transfe 364 presync, which gates the DMA transfer. Only one presync is 370 allowed per request. 365 allowed per request. 371 * Bit(21) is reserved. 366 * Bit(21) is reserved. 372 * Bits(20:16) is the index of the sema 367 * Bits(20:16) is the index of the semaphore to operate on. 373 * Bits(15:12) are reserved. 368 * Bits(15:12) are reserved. 374 * Bits(11:0) are the semaphore value t 369 * Bits(11:0) are the semaphore value to use in operations. 375 370 376 Overall, a request is processed in 4 steps: 371 Overall, a request is processed in 4 steps: 377 372 378 1. If specified, the presync semaphore conditi 373 1. If specified, the presync semaphore condition must be true 379 2. If enabled, the DMA transfer occurs 374 2. If enabled, the DMA transfer occurs 380 3. If specified, the postsync semaphore condit 375 3. If specified, the postsync semaphore conditions must be true 381 4. If enabled, the doorbell is written 376 4. If enabled, the doorbell is written 382 377 383 By using the semaphores in conjunction with th 378 By using the semaphores in conjunction with the workload running on the NSPs, 384 the data pipeline can be synchronized such tha 379 the data pipeline can be synchronized such that the host can queue multiple 385 requests of data for the workload to process, 380 requests of data for the workload to process, but the DMA Bridge will only copy 386 the data into the memory of the workload when 381 the data into the memory of the workload when the workload is ready to process 387 the next input. 382 the next input. 388 383 389 Response FIFO 384 Response FIFO 390 ------------- 385 ------------- 391 386 392 Once a request is fully processed, a response 387 Once a request is fully processed, a response FIFO element is generated if 393 specified in pcie_dma_cmd. The structure of a 388 specified in pcie_dma_cmd. The structure of a response FIFO element: 394 389 395 .. code-block:: c 390 .. code-block:: c 396 391 397 struct response_elem { 392 struct response_elem { 398 u16 req_id; 393 u16 req_id; 399 u16 completion_code; 394 u16 completion_code; 400 }; 395 }; 401 396 402 req_id 397 req_id 403 matches the req_id of the request that 398 matches the req_id of the request that generated this element. 404 399 405 completion_code 400 completion_code 406 status of this request. 0 is success. 401 status of this request. 0 is success. Non-zero is an error. 407 402 408 The DMA Bridge will generate a MSI to the host 403 The DMA Bridge will generate a MSI to the host as a reaction to activity in the 409 response FIFO of a DBC. The DMA Bridge hardwar 404 response FIFO of a DBC. The DMA Bridge hardware has an IRQ storm mitigation 410 algorithm, where it will only generate a MSI w 405 algorithm, where it will only generate a MSI when the response FIFO transitions 411 from empty to non-empty (unless force MSI is e 406 from empty to non-empty (unless force MSI is enabled and triggered). In 412 response to this MSI, the host is expected to 407 response to this MSI, the host is expected to drain the response FIFO, and must 413 take care to handle any race conditions betwee 408 take care to handle any race conditions between draining the FIFO, and the 414 device inserting elements into the FIFO. 409 device inserting elements into the FIFO. 415 410 416 Neural Network Control (NNC) Protocol 411 Neural Network Control (NNC) Protocol 417 ===================================== 412 ===================================== 418 413 419 The NNC protocol is how the host makes request 414 The NNC protocol is how the host makes requests to the QSM to manage workloads. 420 It uses the QAIC_CONTROL MHI channel. 415 It uses the QAIC_CONTROL MHI channel. 421 416 422 Each NNC request is packaged into a message. E 417 Each NNC request is packaged into a message. Each message is a series of 423 transactions. A passthrough type transaction c 418 transactions. A passthrough type transaction can contain elements known as 424 commands. 419 commands. 425 420 426 QSM requires NNC messages be little endian enc 421 QSM requires NNC messages be little endian encoded and the fields be naturally 427 aligned. Since there are 64-bit elements in so 422 aligned. Since there are 64-bit elements in some NNC messages, 64-bit alignment 428 must be maintained. 423 must be maintained. 429 424 430 A message contains a header and then a series 425 A message contains a header and then a series of transactions. A message may be 431 at most 4K in size from QSM to the host. From 426 at most 4K in size from QSM to the host. From the host to the QSM, a message 432 can be at most 64K (maximum size of a single M 427 can be at most 64K (maximum size of a single MHI packet), but there is a 433 continuation feature where message N+1 can be 428 continuation feature where message N+1 can be marked as a continuation of 434 message N. This is used for exceedingly large 429 message N. This is used for exceedingly large DMA xfer transactions. 435 430 436 Transaction descriptions 431 Transaction descriptions 437 ------------------------ 432 ------------------------ 438 433 439 passthrough 434 passthrough 440 Allows userspace to send an opaque pay 435 Allows userspace to send an opaque payload directly to the QSM. 441 This is used for NNC commands. Userspa 436 This is used for NNC commands. Userspace is responsible for managing 442 the QSM message requirements in the pa 437 the QSM message requirements in the payload. 443 438 444 dma_xfer 439 dma_xfer 445 DMA transfer. Describes an object that 440 DMA transfer. Describes an object that the QSM should DMA into the 446 device via address and size tuples. 441 device via address and size tuples. 447 442 448 activate 443 activate 449 Activate a workload onto NSPs. The hos 444 Activate a workload onto NSPs. The host must provide memory to be 450 used by the DBC. 445 used by the DBC. 451 446 452 deactivate 447 deactivate 453 Deactivate an active workload and retu 448 Deactivate an active workload and return the NSPs to idle. 454 449 455 status 450 status 456 Query the QSM about it's NNC implement 451 Query the QSM about it's NNC implementation. Returns the NNC version, 457 and if CRC is used. 452 and if CRC is used. 458 453 459 terminate 454 terminate 460 Release a user's resources. 455 Release a user's resources. 461 456 462 dma_xfer_cont 457 dma_xfer_cont 463 Continuation of a previous DMA transfe 458 Continuation of a previous DMA transfer. If a DMA transfer 464 cannot be specified in a single messag 459 cannot be specified in a single message (highly fragmented), this 465 transaction can be used to specify mor 460 transaction can be used to specify more ranges. 466 461 467 validate_partition 462 validate_partition 468 Query to QSM to determine if a partiti 463 Query to QSM to determine if a partition identifier is valid. 469 464 470 Each message is tagged with a user id, and a p 465 Each message is tagged with a user id, and a partition id. The user id allows 471 QSM to track resources, and release them when 466 QSM to track resources, and release them when the user goes away (eg the process 472 crashes). A partition id identifies the resour 467 crashes). A partition id identifies the resource partition that QSM manages, 473 which this message applies to. 468 which this message applies to. 474 469 475 Messages may have CRCs. Messages should have C 470 Messages may have CRCs. Messages should have CRCs applied until the QSM 476 reports via the status transaction that CRCs a 471 reports via the status transaction that CRCs are not needed. The QSM on the 477 SA9000P requires CRCs for black channel safing 472 SA9000P requires CRCs for black channel safing. 478 473 479 Subsystem Restart (SSR) 474 Subsystem Restart (SSR) 480 ======================= 475 ======================= 481 476 482 SSR is the concept of limiting the impact of a 477 SSR is the concept of limiting the impact of an error. An AIC100 device may 483 have multiple users, each with their own workl 478 have multiple users, each with their own workload running. If the workload of 484 one user crashes, the fallout of that should b 479 one user crashes, the fallout of that should be limited to that workload and not 485 impact other workloads. SSR accomplishes this. 480 impact other workloads. SSR accomplishes this. 486 481 487 If a particular workload crashes, QSM notifies 482 If a particular workload crashes, QSM notifies the host via the QAIC_SSR MHI 488 channel. This notification identifies the work 483 channel. This notification identifies the workload by it's assigned DBC. A 489 multi-stage recovery process is then used to c 484 multi-stage recovery process is then used to cleanup both sides, and get the 490 DBC/NSPs into a working state. 485 DBC/NSPs into a working state. 491 486 492 When SSR occurs, any state in the workload is 487 When SSR occurs, any state in the workload is lost. Any inputs that were in 493 process, or queued by not yet serviced, are lo 488 process, or queued by not yet serviced, are lost. The loaded artifacts will 494 remain in on-card DDR, but the host will need 489 remain in on-card DDR, but the host will need to re-activate the workload if 495 it desires to recover the workload. 490 it desires to recover the workload. 496 491 497 Reliability, Accessibility, Serviceability (RA 492 Reliability, Accessibility, Serviceability (RAS) 498 ============================================== 493 ================================================ 499 494 500 AIC100 is expected to be deployed in server sy 495 AIC100 is expected to be deployed in server systems where RAS ideology is 501 applied. Simply put, RAS is the concept of det 496 applied. Simply put, RAS is the concept of detecting, classifying, and 502 reporting errors. While PCIe has AER (Advanced 497 reporting errors. While PCIe has AER (Advanced Error Reporting) which factors 503 into RAS, AER does not allow for a device to r 498 into RAS, AER does not allow for a device to report details about internal 504 errors. Therefore, AIC100 implements a custom 499 errors. Therefore, AIC100 implements a custom RAS mechanism. When a RAS event 505 occurs, QSM will report the event with appropr 500 occurs, QSM will report the event with appropriate details via the QAIC_STATUS 506 MHI channel. A sysadmin may determine that a p 501 MHI channel. A sysadmin may determine that a particular device needs 507 additional service based on RAS reports. 502 additional service based on RAS reports. 508 503 509 Telemetry 504 Telemetry 510 ========= 505 ========= 511 506 512 QSM has the ability to report various physical 507 QSM has the ability to report various physical attributes of the device, and in 513 some cases, to allow the host to control them. 508 some cases, to allow the host to control them. Examples include thermal limits, 514 thermal readings, and power readings. These it 509 thermal readings, and power readings. These items are communicated via the 515 QAIC_TELEMETRY MHI channel. 510 QAIC_TELEMETRY MHI channel.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.