1 .. SPDX-License-Identifier: GPL-2.0-only 2 3 =============================== 4 Qualcomm Cloud AI 100 (AIC100) 5 =============================== 6 7 Overview 8 ======== 9 10 The Qualcomm Cloud AI 100/AIC100 family of pro 11 Snapdragon Ride) are PCIe adapter cards which 12 the purpose of efficiently running Artificial 13 inference workloads. They are AI accelerators. 14 15 The PCIe interface of AIC100 is capable of PCI 16 (x8). An individual SoC on a card can have up 17 Each SoC has an A53 management CPU. On card, t 18 19 Multiple AIC100 cards can be hosted in a singl 20 performance. AIC100 cards are multi-user capab 21 from multiple users in a concurrent manner. 22 23 Hardware Description 24 ==================== 25 26 An AIC100 card consists of an AIC100 SoC, on-c 27 peripherals (PMICs, etc). 28 29 An AIC100 card can either be a PCIe HHHL form 30 or a Dual M.2 card. Both use PCIe to connect t 31 32 As a PCIe endpoint/adapter, AIC100 uses the st 33 DeviceID(DID) combination to uniquely identify 34 uses the standard Qualcomm VID (0x17cb). All A 35 AIC100 DID (0xa100). 36 37 AIC100 does not implement FLR (function level 38 39 AIC100 implements MSI but does not implement M 40 operate (1 for MHI, 16 for the DMA Bridge). Fa 41 scenarios where reserving 32 MSIs isn't feasib 42 43 As a PCIe device, AIC100 utilizes BARs to prov 44 hardware. AIC100 provides 3, 64-bit BARs. 45 46 * The first BAR is 4K in size, and exposes the 47 48 * The second BAR is 2M in size, and exposes th 49 host. 50 51 * The third BAR is variable in size based on a 52 configuration, but defaults to 64K. This BAR 53 54 From the host perspective, AIC100 has several 55 56 * MHI (Modem Host Interface) 57 * QSM (QAIC Service Manager) 58 * NSPs (Neural Signal Processor) 59 * DMA Bridge 60 * DDR 61 62 MHI 63 --- 64 65 AIC100 has one MHI interface over PCIe. MHI it 66 Documentation/mhi/index.rst MHI is the mechani 67 with the QSM. Except for workload data via the 68 the device occurs via MHI. 69 70 QSM 71 --- 72 73 QAIC Service Manager. This is an ARM A53 CPU t 74 firmware of the card and performs on-card mana 75 communicates with the host via MHI. Each AIC10 76 these. 77 78 NSP 79 --- 80 81 Neural Signal Processor. Each AIC100 has up to 82 the processors that run the workloads on AIC10 83 (Q6) DSP with HVX and HMX. Each NSP can only r 84 multiple NSPs may be assigned to a single work 85 one workload, AIC100 is limited to 16 concurre 86 "scheduling" is under the purview of the host. 87 timeslice. 88 89 DMA Bridge 90 ---------- 91 92 The DMA Bridge is custom DMA engine that manag 93 in and out of workloads. AIC100 has one of the 94 channels, each consisting of a set of request/ 95 workload is assigned a single DMA Bridge chann 96 hardware registers to manage the FIFOs (head/t 97 memory to store the FIFOs. 98 99 DDR 100 --- 101 102 AIC100 has on-card DDR. In total, an AIC100 ca 103 This DDR is used to store workloads, data for 104 QSM for managing the device. NSPs are granted 105 the QSM. The host does not have direct access 106 requests to the QSM to transfer data to the DD 107 108 High-level Use Flow 109 =================== 110 111 AIC100 is a multi-user, programmable accelerat 112 neural networks in inferencing mode to efficie 113 AIC100 is not intended for training neural net 114 for generic compute workloads. 115 116 Assuming a user wants to utilize AIC100, they 117 118 1. Compile the workload into an ELF targeting 119 2. Make requests to the QSM to load the worklo 120 device DDR 121 3. Make a request to the QSM to activate the w 122 4. Make requests to the DMA Bridge to send inp 123 processed, and other requests to receive pr 124 workload. 125 5. Once the workload is no longer required, ma 126 deactivate the workload, thus putting the N 127 6. Once the workload and related artifacts are 128 sessions, make requests to the QSM to unloa 129 the DDR to be used by other users. 130 131 132 Boot Flow 133 ========= 134 135 AIC100 uses a flashless boot flow, derived fro 136 137 When AIC100 is first powered on, it begins exe 138 from ROM. PBL enumerates the PCIe link, and in 139 Interface) component of MHI. 140 141 Using BHI, the host points PBL to the location 142 image. The PBL pulls the image from the host, 143 execution of SBL. 144 145 SBL initializes MHI, and uses MHI to notify th 146 the SBL stage. SBL performs a number of operat 147 148 * SBL initializes the majority of hardware (an 149 including DDR. 150 * SBL offloads the bootlog to the host. 151 * SBL synchronizes timestamps with the host fo 152 * SBL uses the Sahara protocol to obtain the r 153 host. 154 155 Once SBL has obtained and validated the runtim 156 of reset, and jumps into the QSM. 157 158 The QSM uses MHI to notify the host that the d 159 (AMSS in MHI terms). At this point, the AIC100 160 ready to process workloads. 161 162 Userspace components 163 ==================== 164 165 Compiler 166 -------- 167 168 An open compiler for AIC100 based on upstream 169 https://github.com/quic/software-kit-for-qualc 170 171 Usermode Driver (UMD) 172 --------------------- 173 174 An open UMD that interfaces with the qaic kern 175 https://github.com/quic/software-kit-for-qualc 176 177 Sahara loader 178 ------------- 179 180 An open implementation of the Sahara protocol 181 https://github.com/andersson/qdl 182 183 MHI Channels 184 ============ 185 186 AIC100 defines a number of MHI channels for di 187 of the defined channels, and their uses. 188 189 +----------------+---------+----------+------- 190 | Channel name | IDs | EEs | Purpos 191 +================+=========+==========+======= 192 | QAIC_LOOPBACK | 0 & 1 | AMSS | Any da 193 | | | | channe 194 +----------------+---------+----------+------- 195 | QAIC_SAHARA | 2 & 3 | SBL | Used b 196 | | | | firmwa 197 +----------------+---------+----------+------- 198 | QAIC_DIAG | 4 & 5 | AMSS | Used t 199 | | | | DIAG p 200 +----------------+---------+----------+------- 201 | QAIC_SSR | 6 & 7 | AMSS | Used t 202 | | | | restar 203 | | | | crashd 204 +----------------+---------+----------+------- 205 | QAIC_QDSS | 8 & 9 | AMSS | Used f 206 +----------------+---------+----------+------- 207 | QAIC_CONTROL | 10 & 11 | AMSS | Used f 208 | | | | (NNC) 209 | | | | channe 210 | | | | managi 211 +----------------+---------+----------+------- 212 | QAIC_LOGGING | 12 & 13 | SBL | Used b 213 | | | | the ho 214 +----------------+---------+----------+------- 215 | QAIC_STATUS | 14 & 15 | AMSS | Used t 216 | | | | Access 217 | | | | events 218 +----------------+---------+----------+------- 219 | QAIC_TELEMETRY | 16 & 17 | AMSS | Used t 220 | | | | attrib 221 +----------------+---------+----------+------- 222 | QAIC_DEBUG | 18 & 19 | AMSS | Not us 223 +----------------+---------+----------+------- 224 | QAIC_TIMESYNC | 20 & 21 | SBL | Used t 225 | | | | device 226 | | | | source 227 +----------------+---------+----------+------- 228 | QAIC_TIMESYNC | 22 & 23 | AMSS | Used t 229 | _PERIODIC | | | timest 230 | | | | the ho 231 +----------------+---------+----------+------- 232 233 DMA Bridge 234 ========== 235 236 Overview 237 -------- 238 239 The DMA Bridge is one of the main interfaces t 240 (the other being MHI). As part of activating a 241 assigns that network a DMA Bridge channel. A w 242 (DBC for short) is solely for the use of that 243 other workloads. 244 245 Each DBC is a pair of FIFOs that manage data i 246 FIFO is the request FIFO. The other FIFO is th 247 248 Each DBC contains 4 registers in hardware: 249 250 * Request FIFO head pointer (offset 0x0). Read 251 latest item in the FIFO the device has consu 252 * Request FIFO tail pointer (offset 0x4). Read 253 increments this register to add new items to 254 * Response FIFO head pointer (offset 0x8). Rea 255 the latest item in the FIFO the host has con 256 * Response FIFO tail pointer (offset 0xc). Rea 257 increments this register to add new items to 258 259 The values in each register are indexes in the 260 FIFO element pointed to by the register: FIFO 261 size. 262 263 DBC registers are exposed to the host via the 264 4KB of space in the BAR. 265 266 The actual FIFOs are backed by host memory. Wh 267 to activate a network, the host must donate me 268 Due to internal mapping limitations of the dev 269 memory must be provided per DBC, which hosts b 270 consume the beginning of the memory chunk, and 271 the end of the memory chunk. 272 273 Request FIFO 274 ------------ 275 276 A request FIFO element has the following struc 277 278 .. code-block:: c 279 280 struct request_elem { 281 u16 req_id; 282 u8 seq_id; 283 u8 pcie_dma_cmd; 284 u32 reserved; 285 u64 pcie_dma_source_addr; 286 u64 pcie_dma_dest_addr; 287 u32 pcie_dma_len; 288 u32 reserved; 289 u64 doorbell_addr; 290 u8 doorbell_attr; 291 u8 reserved; 292 u16 reserved; 293 u32 doorbell_data; 294 u32 sem_cmd0; 295 u32 sem_cmd1; 296 u32 sem_cmd2; 297 u32 sem_cmd3; 298 }; 299 300 Request field descriptions: 301 302 req_id 303 request ID. A request FIFO element and 304 the same request ID refer to the same 305 306 seq_id 307 sequence ID within a request. Ignored 308 309 pcie_dma_cmd 310 describes the DMA element of this requ 311 312 * Bit(7) is the force msi flag, which 313 and generates a MSI when this reques 314 configures the DMA Bridge to look at 315 * Bits(6:5) are reserved. 316 * Bit(4) is the completion code flag, 317 shall generate a response FIFO eleme 318 complete. 319 * Bit(3) indicates if this request is 320 transfer(1). 321 * Bit(2) is reserved. 322 * Bits(1:0) indicate the type of trans 323 from device(2). Value 3 is illegal. 324 325 pcie_dma_source_addr 326 source address for a bulk transfer, or 327 328 pcie_dma_dest_addr 329 destination address for a bulk transfe 330 331 pcie_dma_len 332 length of the bulk transfer. Note that 333 limits transfers to 4G in size. 334 335 doorbell_addr 336 address of the doorbell to ring when t 337 338 doorbell_attr 339 doorbell attributes. 340 341 * Bit(7) indicates if a write to a doo 342 * Bits(6:2) are reserved. 343 * Bits(1:0) contain the encoding of th 344 1 is 16-bit, 2 is 8-bit, 3 is reserv 345 must be naturally aligned to the spe 346 347 doorbell_data 348 data to write to the doorbell. Only th 349 the doorbell length are valid. 350 351 sem_cmdN 352 semaphore command. 353 354 * Bit(31) indicates this semaphore com 355 * Bit(30) is the to-device DMA fence. 356 to-device DMA transfers are complete 357 * Bit(29) is the from-device DMA fence 358 from-device DMA transfers are comple 359 * Bits(28:27) are reserved. 360 * Bits(26:24) are the semaphore comman 361 specified value. 2 is increment. 3 i 362 until the semaphore is equal to the 363 until the semaphore is greater or eq 364 6 is "P", wait until semaphore is gr 365 decrement by 1. 7 is reserved. 366 * Bit(23) is reserved. 367 * Bit(22) is the semaphore sync. 0 is 368 semaphore operation is done after th 369 presync, which gates the DMA transfe 370 allowed per request. 371 * Bit(21) is reserved. 372 * Bits(20:16) is the index of the sema 373 * Bits(15:12) are reserved. 374 * Bits(11:0) are the semaphore value t 375 376 Overall, a request is processed in 4 steps: 377 378 1. If specified, the presync semaphore conditi 379 2. If enabled, the DMA transfer occurs 380 3. If specified, the postsync semaphore condit 381 4. If enabled, the doorbell is written 382 383 By using the semaphores in conjunction with th 384 the data pipeline can be synchronized such tha 385 requests of data for the workload to process, 386 the data into the memory of the workload when 387 the next input. 388 389 Response FIFO 390 ------------- 391 392 Once a request is fully processed, a response 393 specified in pcie_dma_cmd. The structure of a 394 395 .. code-block:: c 396 397 struct response_elem { 398 u16 req_id; 399 u16 completion_code; 400 }; 401 402 req_id 403 matches the req_id of the request that 404 405 completion_code 406 status of this request. 0 is success. 407 408 The DMA Bridge will generate a MSI to the host 409 response FIFO of a DBC. The DMA Bridge hardwar 410 algorithm, where it will only generate a MSI w 411 from empty to non-empty (unless force MSI is e 412 response to this MSI, the host is expected to 413 take care to handle any race conditions betwee 414 device inserting elements into the FIFO. 415 416 Neural Network Control (NNC) Protocol 417 ===================================== 418 419 The NNC protocol is how the host makes request 420 It uses the QAIC_CONTROL MHI channel. 421 422 Each NNC request is packaged into a message. E 423 transactions. A passthrough type transaction c 424 commands. 425 426 QSM requires NNC messages be little endian enc 427 aligned. Since there are 64-bit elements in so 428 must be maintained. 429 430 A message contains a header and then a series 431 at most 4K in size from QSM to the host. From 432 can be at most 64K (maximum size of a single M 433 continuation feature where message N+1 can be 434 message N. This is used for exceedingly large 435 436 Transaction descriptions 437 ------------------------ 438 439 passthrough 440 Allows userspace to send an opaque pay 441 This is used for NNC commands. Userspa 442 the QSM message requirements in the pa 443 444 dma_xfer 445 DMA transfer. Describes an object that 446 device via address and size tuples. 447 448 activate 449 Activate a workload onto NSPs. The hos 450 used by the DBC. 451 452 deactivate 453 Deactivate an active workload and retu 454 455 status 456 Query the QSM about it's NNC implement 457 and if CRC is used. 458 459 terminate 460 Release a user's resources. 461 462 dma_xfer_cont 463 Continuation of a previous DMA transfe 464 cannot be specified in a single messag 465 transaction can be used to specify mor 466 467 validate_partition 468 Query to QSM to determine if a partiti 469 470 Each message is tagged with a user id, and a p 471 QSM to track resources, and release them when 472 crashes). A partition id identifies the resour 473 which this message applies to. 474 475 Messages may have CRCs. Messages should have C 476 reports via the status transaction that CRCs a 477 SA9000P requires CRCs for black channel safing 478 479 Subsystem Restart (SSR) 480 ======================= 481 482 SSR is the concept of limiting the impact of a 483 have multiple users, each with their own workl 484 one user crashes, the fallout of that should b 485 impact other workloads. SSR accomplishes this. 486 487 If a particular workload crashes, QSM notifies 488 channel. This notification identifies the work 489 multi-stage recovery process is then used to c 490 DBC/NSPs into a working state. 491 492 When SSR occurs, any state in the workload is 493 process, or queued by not yet serviced, are lo 494 remain in on-card DDR, but the host will need 495 it desires to recover the workload. 496 497 Reliability, Accessibility, Serviceability (RA 498 ============================================== 499 500 AIC100 is expected to be deployed in server sy 501 applied. Simply put, RAS is the concept of det 502 reporting errors. While PCIe has AER (Advanced 503 into RAS, AER does not allow for a device to r 504 errors. Therefore, AIC100 implements a custom 505 occurs, QSM will report the event with appropr 506 MHI channel. A sysadmin may determine that a p 507 additional service based on RAS reports. 508 509 Telemetry 510 ========= 511 512 QSM has the ability to report various physical 513 some cases, to allow the host to control them. 514 thermal readings, and power readings. These it 515 QAIC_TELEMETRY MHI channel.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.