1 .. SPDX-License-Identifier: GPL-2.0 2 3 =========================================== 4 Shared Virtual Addressing (SVA) with ENQCMD 5 =========================================== 6 7 Background 8 ========== 9 10 Shared Virtual Addressing (SVA) allows the processor and device to use the 11 same virtual addresses avoiding the need for software to translate virtual 12 addresses to physical addresses. SVA is what PCIe calls Shared Virtual 13 Memory (SVM). 14 15 In addition to the convenience of using application virtual addresses 16 by the device, it also doesn't require pinning pages for DMA. 17 PCIe Address Translation Services (ATS) along with Page Request Interface 18 (PRI) allow devices to function much the same way as the CPU handling 19 application page-faults. For more information please refer to the PCIe 20 specification Chapter 10: ATS Specification. 21 22 Use of SVA requires IOMMU support in the platform. IOMMU is also 23 required to support the PCIe features ATS and PRI. ATS allows devices 24 to cache translations for virtual addresses. The IOMMU driver uses the 25 mmu_notifier() support to keep the device TLB cache and the CPU cache in 26 sync. When an ATS lookup fails for a virtual address, the device should 27 use the PRI in order to request the virtual address to be paged into the 28 CPU page tables. The device must use ATS again in order the fetch the 29 translation before use. 30 31 Shared Hardware Workqueues 32 ========================== 33 34 Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits 35 the use of Shared Work Queues (SWQ) by both applications and Virtual 36 Machines (VM's). This allows better hardware utilization vs. hard 37 partitioning resources that could result in under utilization. In order to 38 allow the hardware to distinguish the context for which work is being 39 executed in the hardware by SWQ interface, SIOV uses Process Address Space 40 ID (PASID), which is a 20-bit number defined by the PCIe SIG. 41 42 PASID value is encoded in all transactions from the device. This allows the 43 IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe 44 Resource Identifier (RID) which is the Bus/Device/Function. 45 46 47 ENQCMD 48 ====== 49 50 ENQCMD is a new instruction on Intel platforms that atomically submits a 51 work descriptor to a device. The descriptor includes the operation to be 52 performed, virtual addresses of all parameters, virtual address of a completion 53 record, and the PASID (process address space ID) of the current process. 54 55 ENQCMD works with non-posted semantics and carries a status back if the 56 command was accepted by hardware. This allows the submitter to know if the 57 submission needs to be retried or other device specific mechanisms to 58 implement fairness or ensure forward progress should be provided. 59 60 ENQCMD is the glue that ensures applications can directly submit commands 61 to the hardware and also permits hardware to be aware of application context 62 to perform I/O operations via use of PASID. 63 64 Process Address Space Tagging 65 ============================= 66 67 A new thread-scoped MSR (IA32_PASID) provides the connection between 68 user processes and the rest of the hardware. When an application first 69 accesses an SVA-capable device, this MSR is initialized with a newly 70 allocated PASID. The driver for the device calls an IOMMU-specific API 71 that sets up the routing for DMA and page-requests. 72 73 For example, the Intel Data Streaming Accelerator (DSA) uses 74 iommu_sva_bind_device(), which will do the following: 75 76 - Allocate the PASID, and program the process page-table (%cr3 register) in the 77 PASID context entries. 78 - Register for mmu_notifier() to track any page-table invalidations to keep 79 the device TLB in sync. For example, when a page-table entry is invalidated, 80 the IOMMU propagates the invalidation to the device TLB. This will force any 81 future access by the device to this virtual address to participate in 82 ATS. If the IOMMU responds with proper response that a page is not 83 present, the device would request the page to be paged in via the PCIe PRI 84 protocol before performing I/O. 85 86 This MSR is managed with the XSAVE feature set as "supervisor state" to 87 ensure the MSR is updated during context switch. 88 89 PASID Management 90 ================ 91 92 The kernel must allocate a PASID on behalf of each process which will use 93 ENQCMD and program it into the new MSR to communicate the process identity to 94 platform hardware. ENQCMD uses the PASID stored in this MSR to tag requests 95 from this process. When a user submits a work descriptor to a device using the 96 ENQCMD instruction, the PASID field in the descriptor is auto-filled with the 97 value from MSR_IA32_PASID. Requests for DMA from the device are also tagged 98 with the same PASID. The platform IOMMU uses the PASID in the transaction to 99 perform address translation. The IOMMU APIs setup the corresponding PASID 100 entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in 101 x86). 102 103 The MSR must be configured on each logical CPU before any application 104 thread can interact with a device. Threads that belong to the same 105 process share the same page tables, thus the same MSR value. 106 107 PASID Life Cycle Management 108 =========================== 109 110 PASID is initialized as IOMMU_PASID_INVALID (-1) when a process is created. 111 112 Only processes that access SVA-capable devices need to have a PASID 113 allocated. This allocation happens when a process opens/binds an SVA-capable 114 device but finds no PASID for this process. Subsequent binds of the same, or 115 other devices will share the same PASID. 116 117 Although the PASID is allocated to the process by opening a device, 118 it is not active in any of the threads of that process. It's loaded to the 119 IA32_PASID MSR lazily when a thread tries to submit a work descriptor 120 to a device using the ENQCMD. 121 122 That first access will trigger a #GP fault because the IA32_PASID MSR 123 has not been initialized with the PASID value assigned to the process 124 when the device was opened. The Linux #GP handler notes that a PASID has 125 been allocated for the process, and so initializes the IA32_PASID MSR 126 and returns so that the ENQCMD instruction is re-executed. 127 128 On fork(2) or exec(2) the PASID is removed from the process as it no 129 longer has the same address space that it had when the device was opened. 130 131 On clone(2) the new task shares the same address space, so will be 132 able to use the PASID allocated to the process. The IA32_PASID is not 133 preemptively initialized as the PASID value might not be allocated yet or 134 the kernel does not know whether this thread is going to access the device 135 and the cleared IA32_PASID MSR reduces context switch overhead by xstate 136 init optimization. Since #GP faults have to be handled on any threads that 137 were created before the PASID was assigned to the mm of the process, newly 138 created threads might as well be treated in a consistent way. 139 140 Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in 141 all threads in unbind, free the PASID lazily only on mm exit. 142 143 If a process does a close(2) of the device file descriptor and munmap(2) 144 of the device MMIO portal, then the driver will unbind the device. The 145 PASID is still marked VALID in the PASID_MSR for any threads in the 146 process that accessed the device. But this is harmless as without the 147 MMIO portal they cannot submit new work to the device. 148 149 Relationships 150 ============= 151 152 * Each process has many threads, but only one PASID. 153 * Devices have a limited number (~10's to 1000's) of hardware workqueues. 154 The device driver manages allocating hardware workqueues. 155 * A single mmap() maps a single hardware workqueue as a "portal" and 156 each portal maps down to a single workqueue. 157 * For each device with which a process interacts, there must be 158 one or more mmap()'d portals. 159 * Many threads within a process can share a single portal to access 160 a single device. 161 * Multiple processes can separately mmap() the same portal, in 162 which case they still share one device hardware workqueue. 163 * The single process-wide PASID is used by all threads to interact 164 with all devices. There is not, for instance, a PASID for each 165 thread or each thread<->device pair. 166 167 FAQ 168 === 169 170 * What is SVA/SVM? 171 172 Shared Virtual Addressing (SVA) permits I/O hardware and the processor to 173 work in the same address space, i.e., to share it. Some call it Shared 174 Virtual Memory (SVM), but Linux community wanted to avoid confusing it with 175 POSIX Shared Memory and Secure Virtual Machines which were terms already in 176 circulation. 177 178 * What is a PASID? 179 180 A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet 181 (TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS. 182 PASID is included in all transactions between the platform and the device. 183 184 * How are shared workqueues different? 185 186 Traditionally, in order for userspace applications to interact with hardware, 187 there is a separate hardware instance required per process. For example, 188 consider doorbells as a mechanism of informing hardware about work to process. 189 Each doorbell is required to be spaced 4k (or page-size) apart for process 190 isolation. This requires hardware to provision that space and reserve it in 191 MMIO. This doesn't scale as the number of threads becomes quite large. The 192 hardware also manages the queue depth for Shared Work Queues (SWQ), and 193 consumers don't need to track queue depth. If there is no space to accept 194 a command, the device will return an error indicating retry. 195 196 A user should check Deferrable Memory Write (DMWr) capability on the device 197 and only submits ENQCMD when the device supports it. In the new DMWr PCIe 198 terminology, devices need to support DMWr completer capability. In addition, 199 it requires all switch ports to support DMWr routing and must be enabled by 200 the PCIe subsystem, much like how PCIe atomic operations are managed for 201 instance. 202 203 SWQ allows hardware to provision just a single address in the device. When 204 used with ENQCMD to submit work, the device can distinguish the process 205 submitting the work since it will include the PASID assigned to that 206 process. This helps the device scale to a large number of processes. 207 208 * Is this the same as a user space device driver? 209 210 Communicating with the device via the shared workqueue is much simpler 211 than a full blown user space driver. The kernel driver does all the 212 initialization of the hardware. User space only needs to worry about 213 submitting work and processing completions. 214 215 * Is this the same as SR-IOV? 216 217 Single Root I/O Virtualization (SR-IOV) focuses on providing independent 218 hardware interfaces for virtualizing hardware. Hence, it's required to be 219 almost fully functional interface to software supporting the traditional 220 BARs, space for interrupts via MSI-X, its own register layout. 221 Virtual Functions (VFs) are assisted by the Physical Function (PF) 222 driver. 223 224 Scalable I/O Virtualization builds on the PASID concept to create device 225 instances for virtualization. SIOV requires host software to assist in 226 creating virtual devices; each virtual device is represented by a PASID 227 along with the bus/device/function of the device. This allows device 228 hardware to optimize device resource creation and can grow dynamically on 229 demand. SR-IOV creation and management is very static in nature. Consult 230 references below for more details. 231 232 * Why not just create a virtual function for each app? 233 234 Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require 235 duplicated hardware for PCI config space and interrupts such as MSI-X. 236 Resources such as interrupts have to be hard partitioned between VFs at 237 creation time, and cannot scale dynamically on demand. The VFs are not 238 completely independent from the Physical Function (PF). Most VFs require 239 some communication and assistance from the PF driver. SIOV, in contrast, 240 creates a software-defined device where all the configuration and control 241 aspects are mediated via the slow path. The work submission and completion 242 happen without any mediation. 243 244 * Does this support virtualization? 245 246 ENQCMD can be used from within a guest VM. In these cases, the VMM helps 247 with setting up a translation table to translate from Guest PASID to Host 248 PASID. Please consult the ENQCMD instruction set reference for more 249 details. 250 251 * Does memory need to be pinned? 252 253 When devices support SVA along with platform hardware such as IOMMU 254 supporting such devices, there is no need to pin memory for DMA purposes. 255 Devices that support SVA also support other PCIe features that remove the 256 pinning requirement for memory. 257 258 Device TLB support - Device requests the IOMMU to lookup an address before 259 use via Address Translation Service (ATS) requests. If the mapping exists 260 but there is no page allocated by the OS, IOMMU hardware returns that no 261 mapping exists. 262 263 Device requests the virtual address to be mapped via Page Request 264 Interface (PRI). Once the OS has successfully completed the mapping, it 265 returns the response back to the device. The device requests again for 266 a translation and continues. 267 268 IOMMU works with the OS in managing consistency of page-tables with the 269 device. When removing pages, it interacts with the device to remove any 270 device TLB entry that might have been cached before removing the mappings from 271 the OS. 272 273 References 274 ========== 275 276 VT-D: 277 https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d 278 279 SIOV: 280 https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux 281 282 ENQCMD in ISE: 283 https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf 284 285 DSA spec: 286 https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.