~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/accel/qaic/aic100.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/accel/qaic/aic100.rst (Version linux-6.12-rc7) and /Documentation/accel/qaic/aic100.rst (Version linux-5.7.19)


  1 .. SPDX-License-Identifier: GPL-2.0-only          
  2                                                   
  3 ===============================                   
  4  Qualcomm Cloud AI 100 (AIC100)                   
  5 ===============================                   
  6                                                   
  7 Overview                                          
  8 ========                                          
  9                                                   
 10 The Qualcomm Cloud AI 100/AIC100 family of pro    
 11 Snapdragon Ride) are PCIe adapter cards which     
 12 the purpose of efficiently running Artificial     
 13 inference workloads. They are AI accelerators.    
 14                                                   
 15 The PCIe interface of AIC100 is capable of PCI    
 16 (x8). An individual SoC on a card can have up     
 17 Each SoC has an A53 management CPU. On card, t    
 18                                                   
 19 Multiple AIC100 cards can be hosted in a singl    
 20 performance. AIC100 cards are multi-user capab    
 21 from multiple users in a concurrent manner.       
 22                                                   
 23 Hardware Description                              
 24 ====================                              
 25                                                   
 26 An AIC100 card consists of an AIC100 SoC, on-c    
 27 peripherals (PMICs, etc).                         
 28                                                   
 29 An AIC100 card can either be a PCIe HHHL form     
 30 or a Dual M.2 card. Both use PCIe to connect t    
 31                                                   
 32 As a PCIe endpoint/adapter, AIC100 uses the st    
 33 DeviceID(DID) combination to uniquely identify    
 34 uses the standard Qualcomm VID (0x17cb). All A    
 35 AIC100 DID (0xa100).                              
 36                                                   
 37 AIC100 does not implement FLR (function level     
 38                                                   
 39 AIC100 implements MSI but does not implement M    
 40 operate (1 for MHI, 16 for the DMA Bridge). Fa    
 41 scenarios where reserving 32 MSIs isn't feasib    
 42                                                   
 43 As a PCIe device, AIC100 utilizes BARs to prov    
 44 hardware. AIC100 provides 3, 64-bit BARs.         
 45                                                   
 46 * The first BAR is 4K in size, and exposes the    
 47                                                   
 48 * The second BAR is 2M in size, and exposes th    
 49   host.                                           
 50                                                   
 51 * The third BAR is variable in size based on a    
 52   configuration, but defaults to 64K. This BAR    
 53                                                   
 54 From the host perspective, AIC100 has several     
 55                                                   
 56 * MHI (Modem Host Interface)                      
 57 * QSM (QAIC Service Manager)                      
 58 * NSPs (Neural Signal Processor)                  
 59 * DMA Bridge                                      
 60 * DDR                                             
 61                                                   
 62 MHI                                               
 63 ---                                               
 64                                                   
 65 AIC100 has one MHI interface over PCIe. MHI it    
 66 Documentation/mhi/index.rst MHI is the mechani    
 67 with the QSM. Except for workload data via the    
 68 the device occurs via MHI.                        
 69                                                   
 70 QSM                                               
 71 ---                                               
 72                                                   
 73 QAIC Service Manager. This is an ARM A53 CPU t    
 74 firmware of the card and performs on-card mana    
 75 communicates with the host via MHI. Each AIC10    
 76 these.                                            
 77                                                   
 78 NSP                                               
 79 ---                                               
 80                                                   
 81 Neural Signal Processor. Each AIC100 has up to    
 82 the processors that run the workloads on AIC10    
 83 (Q6) DSP with HVX and HMX. Each NSP can only r    
 84 multiple NSPs may be assigned to a single work    
 85 one workload, AIC100 is limited to 16 concurre    
 86 "scheduling" is under the purview of the host.    
 87 timeslice.                                        
 88                                                   
 89 DMA Bridge                                        
 90 ----------                                        
 91                                                   
 92 The DMA Bridge is custom DMA engine that manag    
 93 in and out of workloads. AIC100 has one of the    
 94 channels, each consisting of a set of request/    
 95 workload is assigned a single DMA Bridge chann    
 96 hardware registers to manage the FIFOs (head/t    
 97 memory to store the FIFOs.                        
 98                                                   
 99 DDR                                               
100 ---                                               
101                                                   
102 AIC100 has on-card DDR. In total, an AIC100 ca    
103 This DDR is used to store workloads, data for     
104 QSM for managing the device. NSPs are granted     
105 the QSM. The host does not have direct access     
106 requests to the QSM to transfer data to the DD    
107                                                   
108 High-level Use Flow                               
109 ===================                               
110                                                   
111 AIC100 is a multi-user, programmable accelerat    
112 neural networks in inferencing mode to efficie    
113 AIC100 is not intended for training neural net    
114 for generic compute workloads.                    
115                                                   
116 Assuming a user wants to utilize AIC100, they     
117                                                   
118 1. Compile the workload into an ELF targeting     
119 2. Make requests to the QSM to load the worklo    
120    device DDR                                     
121 3. Make a request to the QSM to activate the w    
122 4. Make requests to the DMA Bridge to send inp    
123    processed, and other requests to receive pr    
124    workload.                                      
125 5. Once the workload is no longer required, ma    
126    deactivate the workload, thus putting the N    
127 6. Once the workload and related artifacts are    
128    sessions, make requests to the QSM to unloa    
129    the DDR to be used by other users.             
130                                                   
131                                                   
132 Boot Flow                                         
133 =========                                         
134                                                   
135 AIC100 uses a flashless boot flow, derived fro    
136                                                   
137 When AIC100 is first powered on, it begins exe    
138 from ROM. PBL enumerates the PCIe link, and in    
139 Interface) component of MHI.                      
140                                                   
141 Using BHI, the host points PBL to the location    
142 image. The PBL pulls the image from the host,     
143 execution of SBL.                                 
144                                                   
145 SBL initializes MHI, and uses MHI to notify th    
146 the SBL stage. SBL performs a number of operat    
147                                                   
148 * SBL initializes the majority of hardware (an    
149   including DDR.                                  
150 * SBL offloads the bootlog to the host.           
151 * SBL synchronizes timestamps with the host fo    
152 * SBL uses the Sahara protocol to obtain the r    
153   host.                                           
154                                                   
155 Once SBL has obtained and validated the runtim    
156 of reset, and jumps into the QSM.                 
157                                                   
158 The QSM uses MHI to notify the host that the d    
159 (AMSS in MHI terms). At this point, the AIC100    
160 ready to process workloads.                       
161                                                   
162 Userspace components                              
163 ====================                              
164                                                   
165 Compiler                                          
166 --------                                          
167                                                   
168 An open compiler for AIC100 based on upstream     
169 https://github.com/quic/software-kit-for-qualc    
170                                                   
171 Usermode Driver (UMD)                             
172 ---------------------                             
173                                                   
174 An open UMD that interfaces with the qaic kern    
175 https://github.com/quic/software-kit-for-qualc    
176                                                   
177 Sahara loader                                     
178 -------------                                     
179                                                   
180 An open implementation of the Sahara protocol     
181 https://github.com/andersson/qdl                  
182                                                   
183 MHI Channels                                      
184 ============                                      
185                                                   
186 AIC100 defines a number of MHI channels for di    
187 of the defined channels, and their uses.          
188                                                   
189 +----------------+---------+----------+-------    
190 | Channel name   | IDs     | EEs      | Purpos    
191 +================+=========+==========+=======    
192 | QAIC_LOOPBACK  | 0 & 1   | AMSS     | Any da    
193 |                |         |          | channe    
194 +----------------+---------+----------+-------    
195 | QAIC_SAHARA    | 2 & 3   | SBL      | Used b    
196 |                |         |          | firmwa    
197 +----------------+---------+----------+-------    
198 | QAIC_DIAG      | 4 & 5   | AMSS     | Used t    
199 |                |         |          | DIAG p    
200 +----------------+---------+----------+-------    
201 | QAIC_SSR       | 6 & 7   | AMSS     | Used t    
202 |                |         |          | restar    
203 |                |         |          | crashd    
204 +----------------+---------+----------+-------    
205 | QAIC_QDSS      | 8 & 9   | AMSS     | Used f    
206 +----------------+---------+----------+-------    
207 | QAIC_CONTROL   | 10 & 11 | AMSS     | Used f    
208 |                |         |          | (NNC)     
209 |                |         |          | channe    
210 |                |         |          | managi    
211 +----------------+---------+----------+-------    
212 | QAIC_LOGGING   | 12 & 13 | SBL      | Used b    
213 |                |         |          | the ho    
214 +----------------+---------+----------+-------    
215 | QAIC_STATUS    | 14 & 15 | AMSS     | Used t    
216 |                |         |          | Access    
217 |                |         |          | events    
218 +----------------+---------+----------+-------    
219 | QAIC_TELEMETRY | 16 & 17 | AMSS     | Used t    
220 |                |         |          | attrib    
221 +----------------+---------+----------+-------    
222 | QAIC_DEBUG     | 18 & 19 | AMSS     | Not us    
223 +----------------+---------+----------+-------    
224 | QAIC_TIMESYNC  | 20 & 21 | SBL      | Used t    
225 |                |         |          | device    
226 |                |         |          | source    
227 +----------------+---------+----------+-------    
228 | QAIC_TIMESYNC  | 22 & 23 | AMSS     | Used t    
229 | _PERIODIC      |         |          | timest    
230 |                |         |          | the ho    
231 +----------------+---------+----------+-------    
232                                                   
233 DMA Bridge                                        
234 ==========                                        
235                                                   
236 Overview                                          
237 --------                                          
238                                                   
239 The DMA Bridge is one of the main interfaces t    
240 (the other being MHI). As part of activating a    
241 assigns that network a DMA Bridge channel. A w    
242 (DBC for short) is solely for the use of that     
243 other workloads.                                  
244                                                   
245 Each DBC is a pair of FIFOs that manage data i    
246 FIFO is the request FIFO. The other FIFO is th    
247                                                   
248 Each DBC contains 4 registers in hardware:        
249                                                   
250 * Request FIFO head pointer (offset 0x0). Read    
251   latest item in the FIFO the device has consu    
252 * Request FIFO tail pointer (offset 0x4). Read    
253   increments this register to add new items to    
254 * Response FIFO head pointer (offset 0x8). Rea    
255   the latest item in the FIFO the host has con    
256 * Response FIFO tail pointer (offset 0xc). Rea    
257   increments this register to add new items to    
258                                                   
259 The values in each register are indexes in the    
260 FIFO element pointed to by the register: FIFO     
261 size.                                             
262                                                   
263 DBC registers are exposed to the host via the     
264 4KB of space in the BAR.                          
265                                                   
266 The actual FIFOs are backed by host memory. Wh    
267 to activate a network, the host must donate me    
268 Due to internal mapping limitations of the dev    
269 memory must be provided per DBC, which hosts b    
270 consume the beginning of the memory chunk, and    
271 the end of the memory chunk.                      
272                                                   
273 Request FIFO                                      
274 ------------                                      
275                                                   
276 A request FIFO element has the following struc    
277                                                   
278 .. code-block:: c                                 
279                                                   
280   struct request_elem {                           
281         u16 req_id;                               
282         u8  seq_id;                               
283         u8  pcie_dma_cmd;                         
284         u32 reserved;                             
285         u64 pcie_dma_source_addr;                 
286         u64 pcie_dma_dest_addr;                   
287         u32 pcie_dma_len;                         
288         u32 reserved;                             
289         u64 doorbell_addr;                        
290         u8  doorbell_attr;                        
291         u8  reserved;                             
292         u16 reserved;                             
293         u32 doorbell_data;                        
294         u32 sem_cmd0;                             
295         u32 sem_cmd1;                             
296         u32 sem_cmd2;                             
297         u32 sem_cmd3;                             
298   };                                              
299                                                   
300 Request field descriptions:                       
301                                                   
302 req_id                                            
303         request ID. A request FIFO element and    
304         the same request ID refer to the same     
305                                                   
306 seq_id                                            
307         sequence ID within a request. Ignored     
308                                                   
309 pcie_dma_cmd                                      
310         describes the DMA element of this requ    
311                                                   
312         * Bit(7) is the force msi flag, which     
313           and generates a MSI when this reques    
314           configures the DMA Bridge to look at    
315         * Bits(6:5) are reserved.                 
316         * Bit(4) is the completion code flag,     
317           shall generate a response FIFO eleme    
318           complete.                               
319         * Bit(3) indicates if this request is     
320           transfer(1).                            
321         * Bit(2) is reserved.                     
322         * Bits(1:0) indicate the type of trans    
323           from device(2). Value 3 is illegal.     
324                                                   
325 pcie_dma_source_addr                              
326         source address for a bulk transfer, or    
327                                                   
328 pcie_dma_dest_addr                                
329         destination address for a bulk transfe    
330                                                   
331 pcie_dma_len                                      
332         length of the bulk transfer. Note that    
333         limits transfers to 4G in size.           
334                                                   
335 doorbell_addr                                     
336         address of the doorbell to ring when t    
337                                                   
338 doorbell_attr                                     
339         doorbell attributes.                      
340                                                   
341         * Bit(7) indicates if a write to a doo    
342         * Bits(6:2) are reserved.                 
343         * Bits(1:0) contain the encoding of th    
344           1 is 16-bit, 2 is 8-bit, 3 is reserv    
345           must be naturally aligned to the spe    
346                                                   
347 doorbell_data                                     
348         data to write to the doorbell. Only th    
349         the doorbell length are valid.            
350                                                   
351 sem_cmdN                                          
352         semaphore command.                        
353                                                   
354         * Bit(31) indicates this semaphore com    
355         * Bit(30) is the to-device DMA fence.     
356           to-device DMA transfers are complete    
357         * Bit(29) is the from-device DMA fence    
358           from-device DMA transfers are comple    
359         * Bits(28:27) are reserved.               
360         * Bits(26:24) are the semaphore comman    
361           specified value. 2 is increment. 3 i    
362           until the semaphore is equal to the     
363           until the semaphore is greater or eq    
364           6 is "P", wait until semaphore is gr    
365           decrement by 1. 7 is reserved.          
366         * Bit(23) is reserved.                    
367         * Bit(22) is the semaphore sync. 0 is     
368           semaphore operation is done after th    
369           presync, which gates the DMA transfe    
370           allowed per request.                    
371         * Bit(21) is reserved.                    
372         * Bits(20:16) is the index of the sema    
373         * Bits(15:12) are reserved.               
374         * Bits(11:0) are the semaphore value t    
375                                                   
376 Overall, a request is processed in 4 steps:       
377                                                   
378 1. If specified, the presync semaphore conditi    
379 2. If enabled, the DMA transfer occurs            
380 3. If specified, the postsync semaphore condit    
381 4. If enabled, the doorbell is written            
382                                                   
383 By using the semaphores in conjunction with th    
384 the data pipeline can be synchronized such tha    
385 requests of data for the workload to process,     
386 the data into the memory of the workload when     
387 the next input.                                   
388                                                   
389 Response FIFO                                     
390 -------------                                     
391                                                   
392 Once a request is fully processed, a response     
393 specified in pcie_dma_cmd. The structure of a     
394                                                   
395 .. code-block:: c                                 
396                                                   
397   struct response_elem {                          
398         u16 req_id;                               
399         u16 completion_code;                      
400   };                                              
401                                                   
402 req_id                                            
403         matches the req_id of the request that    
404                                                   
405 completion_code                                   
406         status of this request. 0 is success.     
407                                                   
408 The DMA Bridge will generate a MSI to the host    
409 response FIFO of a DBC. The DMA Bridge hardwar    
410 algorithm, where it will only generate a MSI w    
411 from empty to non-empty (unless force MSI is e    
412 response to this MSI, the host is expected to     
413 take care to handle any race conditions betwee    
414 device inserting elements into the FIFO.          
415                                                   
416 Neural Network Control (NNC) Protocol             
417 =====================================             
418                                                   
419 The NNC protocol is how the host makes request    
420 It uses the QAIC_CONTROL MHI channel.             
421                                                   
422 Each NNC request is packaged into a message. E    
423 transactions. A passthrough type transaction c    
424 commands.                                         
425                                                   
426 QSM requires NNC messages be little endian enc    
427 aligned. Since there are 64-bit elements in so    
428 must be maintained.                               
429                                                   
430 A message contains a header and then a series     
431 at most 4K in size from QSM to the host. From     
432 can be at most 64K (maximum size of a single M    
433 continuation feature where message N+1 can be     
434 message N. This is used for exceedingly large     
435                                                   
436 Transaction descriptions                          
437 ------------------------                          
438                                                   
439 passthrough                                       
440         Allows userspace to send an opaque pay    
441         This is used for NNC commands. Userspa    
442         the QSM message requirements in the pa    
443                                                   
444 dma_xfer                                          
445         DMA transfer. Describes an object that    
446         device via address and size tuples.       
447                                                   
448 activate                                          
449         Activate a workload onto NSPs. The hos    
450         used by the DBC.                          
451                                                   
452 deactivate                                        
453         Deactivate an active workload and retu    
454                                                   
455 status                                            
456         Query the QSM about it's NNC implement    
457         and if CRC is used.                       
458                                                   
459 terminate                                         
460         Release a user's resources.               
461                                                   
462 dma_xfer_cont                                     
463         Continuation of a previous DMA transfe    
464         cannot be specified in a single messag    
465         transaction can be used to specify mor    
466                                                   
467 validate_partition                                
468         Query to QSM to determine if a partiti    
469                                                   
470 Each message is tagged with a user id, and a p    
471 QSM to track resources, and release them when     
472 crashes). A partition id identifies the resour    
473 which this message applies to.                    
474                                                   
475 Messages may have CRCs. Messages should have C    
476 reports via the status transaction that CRCs a    
477 SA9000P requires CRCs for black channel safing    
478                                                   
479 Subsystem Restart (SSR)                           
480 =======================                           
481                                                   
482 SSR is the concept of limiting the impact of a    
483 have multiple users, each with their own workl    
484 one user crashes, the fallout of that should b    
485 impact other workloads. SSR accomplishes this.    
486                                                   
487 If a particular workload crashes, QSM notifies    
488 channel. This notification identifies the work    
489 multi-stage recovery process is then used to c    
490 DBC/NSPs into a working state.                    
491                                                   
492 When SSR occurs, any state in the workload is     
493 process, or queued by not yet serviced, are lo    
494 remain in on-card DDR, but the host will need     
495 it desires to recover the workload.               
496                                                   
497 Reliability, Accessibility, Serviceability (RA    
498 ==============================================    
499                                                   
500 AIC100 is expected to be deployed in server sy    
501 applied. Simply put, RAS is the concept of det    
502 reporting errors. While PCIe has AER (Advanced    
503 into RAS, AER does not allow for a device to r    
504 errors. Therefore, AIC100 implements a custom     
505 occurs, QSM will report the event with appropr    
506 MHI channel. A sysadmin may determine that a p    
507 additional service based on RAS reports.          
508                                                   
509 Telemetry                                         
510 =========                                         
511                                                   
512 QSM has the ability to report various physical    
513 some cases, to allow the host to control them.    
514 thermal readings, and power readings. These it    
515 QAIC_TELEMETRY MHI channel.                       
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php