1 .. SPDX-License-Identifier: GPL-2.0 2 3 =============== 4 DMA and swiotlb 5 =============== 6 7 swiotlb is a memory buffer allocator used by the Linux kernel DMA layer. It is 8 typically used when a device doing DMA can't directly access the target memory 9 buffer because of hardware limitations or other requirements. In such a case, 10 the DMA layer calls swiotlb to allocate a temporary memory buffer that conforms 11 to the limitations. The DMA is done to/from this temporary memory buffer, and 12 the CPU copies the data between the temporary buffer and the original target 13 memory buffer. This approach is generically called "bounce buffering", and the 14 temporary memory buffer is called a "bounce buffer". 15 16 Device drivers don't interact directly with swiotlb. Instead, drivers inform 17 the DMA layer of the DMA attributes of the devices they are managing, and use 18 the normal DMA map, unmap, and sync APIs when programming a device to do DMA. 19 These APIs use the device DMA attributes and kernel-wide settings to determine 20 if bounce buffering is necessary. If so, the DMA layer manages the allocation, 21 freeing, and sync'ing of bounce buffers. Since the DMA attributes are per 22 device, some devices in a system may use bounce buffering while others do not. 23 24 Because the CPU copies data between the bounce buffer and the original target 25 memory buffer, doing bounce buffering is slower than doing DMA directly to the 26 original memory buffer, and it consumes more CPU resources. So it is used only 27 when necessary for providing DMA functionality. 28 29 Usage Scenarios 30 --------------- 31 swiotlb was originally created to handle DMA for devices with addressing 32 limitations. As physical memory sizes grew beyond 4 GiB, some devices could 33 only provide 32-bit DMA addresses. By allocating bounce buffer memory below 34 the 4 GiB line, these devices with addressing limitations could still work and 35 do DMA. 36 37 More recently, Confidential Computing (CoCo) VMs have the guest VM's memory 38 encrypted by default, and the memory is not accessible by the host hypervisor 39 and VMM. For the host to do I/O on behalf of the guest, the I/O must be 40 directed to guest memory that is unencrypted. CoCo VMs set a kernel-wide option 41 to force all DMA I/O to use bounce buffers, and the bounce buffer memory is set 42 up as unencrypted. The host does DMA I/O to/from the bounce buffer memory, and 43 the Linux kernel DMA layer does "sync" operations to cause the CPU to copy the 44 data to/from the original target memory buffer. The CPU copying bridges between 45 the unencrypted and the encrypted memory. This use of bounce buffers allows 46 device drivers to "just work" in a CoCo VM, with no modifications 47 needed to handle the memory encryption complexity. 48 49 Other edge case scenarios arise for bounce buffers. For example, when IOMMU 50 mappings are set up for a DMA operation to/from a device that is considered 51 "untrusted", the device should be given access only to the memory containing 52 the data being transferred. But if that memory occupies only part of an IOMMU 53 granule, other parts of the granule may contain unrelated kernel data. Since 54 IOMMU access control is per-granule, the untrusted device can gain access to 55 the unrelated kernel data. This problem is solved by bounce buffering the DMA 56 operation and ensuring that unused portions of the bounce buffers do not 57 contain any unrelated kernel data. 58 59 Core Functionality 60 ------------------ 61 The primary swiotlb APIs are swiotlb_tbl_map_single() and 62 swiotlb_tbl_unmap_single(). The "map" API allocates a bounce buffer of a 63 specified size in bytes and returns the physical address of the buffer. The 64 buffer memory is physically contiguous. The expectation is that the DMA layer 65 maps the physical memory address to a DMA address, and returns the DMA address 66 to the driver for programming into the device. If a DMA operation specifies 67 multiple memory buffer segments, a separate bounce buffer must be allocated for 68 each segment. swiotlb_tbl_map_single() always does a "sync" operation (i.e., a 69 CPU copy) to initialize the bounce buffer to match the contents of the original 70 buffer. 71 72 swiotlb_tbl_unmap_single() does the reverse. If the DMA operation might have 73 updated the bounce buffer memory and DMA_ATTR_SKIP_CPU_SYNC is not set, the 74 unmap does a "sync" operation to cause a CPU copy of the data from the bounce 75 buffer back to the original buffer. Then the bounce buffer memory is freed. 76 77 swiotlb also provides "sync" APIs that correspond to the dma_sync_*() APIs that 78 a driver may use when control of a buffer transitions between the CPU and the 79 device. The swiotlb "sync" APIs cause a CPU copy of the data between the 80 original buffer and the bounce buffer. Like the dma_sync_*() APIs, the swiotlb 81 "sync" APIs support doing a partial sync, where only a subset of the bounce 82 buffer is copied to/from the original buffer. 83 84 Core Functionality Constraints 85 ------------------------------ 86 The swiotlb map/unmap/sync APIs must operate without blocking, as they are 87 called by the corresponding DMA APIs which may run in contexts that cannot 88 block. Hence the default memory pool for swiotlb allocations must be 89 pre-allocated at boot time (but see Dynamic swiotlb below). Because swiotlb 90 allocations must be physically contiguous, the entire default memory pool is 91 allocated as a single contiguous block. 92 93 The need to pre-allocate the default swiotlb pool creates a boot-time tradeoff. 94 The pool should be large enough to ensure that bounce buffer requests can 95 always be satisfied, as the non-blocking requirement means requests can't wait 96 for space to become available. But a large pool potentially wastes memory, as 97 this pre-allocated memory is not available for other uses in the system. The 98 tradeoff is particularly acute in CoCo VMs that use bounce buffers for all DMA 99 I/O. These VMs use a heuristic to set the default pool size to ~6% of memory, 100 with a max of 1 GiB, which has the potential to be very wasteful of memory. 101 Conversely, the heuristic might produce a size that is insufficient, depending 102 on the I/O patterns of the workload in the VM. The dynamic swiotlb feature 103 described below can help, but has limitations. Better management of the swiotlb 104 default memory pool size remains an open issue. 105 106 A single allocation from swiotlb is limited to IO_TLB_SIZE * IO_TLB_SEGSIZE 107 bytes, which is 256 KiB with current definitions. When a device's DMA settings 108 are such that the device might use swiotlb, the maximum size of a DMA segment 109 must be limited to that 256 KiB. This value is communicated to higher-level 110 kernel code via dma_map_mapping_size() and swiotlb_max_mapping_size(). If the 111 higher-level code fails to account for this limit, it may make requests that 112 are too large for swiotlb, and get a "swiotlb full" error. 113 114 A key device DMA setting is "min_align_mask", which is a power of 2 minus 1 115 so that some number of low order bits are set, or it may be zero. swiotlb 116 allocations ensure these min_align_mask bits of the physical address of the 117 bounce buffer match the same bits in the address of the original buffer. When 118 min_align_mask is non-zero, it may produce an "alignment offset" in the address 119 of the bounce buffer that slightly reduces the maximum size of an allocation. 120 This potential alignment offset is reflected in the value returned by 121 swiotlb_max_mapping_size(), which can show up in places like 122 /sys/block/<device>/queue/max_sectors_kb. For example, if a device does not use 123 swiotlb, max_sectors_kb might be 512 KiB or larger. If a device might use 124 swiotlb, max_sectors_kb will be 256 KiB. When min_align_mask is non-zero, 125 max_sectors_kb might be even smaller, such as 252 KiB. 126 127 swiotlb_tbl_map_single() also takes an "alloc_align_mask" parameter. This 128 parameter specifies the allocation of bounce buffer space must start at a 129 physical address with the alloc_align_mask bits set to zero. But the actual 130 bounce buffer might start at a larger address if min_align_mask is non-zero. 131 Hence there may be pre-padding space that is allocated prior to the start of 132 the bounce buffer. Similarly, the end of the bounce buffer is rounded up to an 133 alloc_align_mask boundary, potentially resulting in post-padding space. Any 134 pre-padding or post-padding space is not initialized by swiotlb code. The 135 "alloc_align_mask" parameter is used by IOMMU code when mapping for untrusted 136 devices. It is set to the granule size - 1 so that the bounce buffer is 137 allocated entirely from granules that are not used for any other purpose. 138 139 Data structures concepts 140 ------------------------ 141 Memory used for swiotlb bounce buffers is allocated from overall system memory 142 as one or more "pools". The default pool is allocated during system boot with a 143 default size of 64 MiB. The default pool size may be modified with the 144 "swiotlb=" kernel boot line parameter. The default size may also be adjusted 145 due to other conditions, such as running in a CoCo VM, as described above. If 146 CONFIG_SWIOTLB_DYNAMIC is enabled, additional pools may be allocated later in 147 the life of the system. Each pool must be a contiguous range of physical 148 memory. The default pool is allocated below the 4 GiB physical address line so 149 it works for devices that can only address 32-bits of physical memory (unless 150 architecture-specific code provides the SWIOTLB_ANY flag). In a CoCo VM, the 151 pool memory must be decrypted before swiotlb is used. 152 153 Each pool is divided into "slots" of size IO_TLB_SIZE, which is 2 KiB with 154 current definitions. IO_TLB_SEGSIZE contiguous slots (128 slots) constitute 155 what might be called a "slot set". When a bounce buffer is allocated, it 156 occupies one or more contiguous slots. A slot is never shared by multiple 157 bounce buffers. Furthermore, a bounce buffer must be allocated from a single 158 slot set, which leads to the maximum bounce buffer size being IO_TLB_SIZE * 159 IO_TLB_SEGSIZE. Multiple smaller bounce buffers may co-exist in a single slot 160 set if the alignment and size constraints can be met. 161 162 Slots are also grouped into "areas", with the constraint that a slot set exists 163 entirely in a single area. Each area has its own spin lock that must be held to 164 manipulate the slots in that area. The division into areas avoids contending 165 for a single global spin lock when swiotlb is heavily used, such as in a CoCo 166 VM. The number of areas defaults to the number of CPUs in the system for 167 maximum parallelism, but since an area can't be smaller than IO_TLB_SEGSIZE 168 slots, it might be necessary to assign multiple CPUs to the same area. The 169 number of areas can also be set via the "swiotlb=" kernel boot parameter. 170 171 When allocating a bounce buffer, if the area associated with the calling CPU 172 does not have enough free space, areas associated with other CPUs are tried 173 sequentially. For each area tried, the area's spin lock must be obtained before 174 trying an allocation, so contention may occur if swiotlb is relatively busy 175 overall. But an allocation request does not fail unless all areas do not have 176 enough free space. 177 178 IO_TLB_SIZE, IO_TLB_SEGSIZE, and the number of areas must all be powers of 2 as 179 the code uses shifting and bit masking to do many of the calculations. The 180 number of areas is rounded up to a power of 2 if necessary to meet this 181 requirement. 182 183 The default pool is allocated with PAGE_SIZE alignment. If an alloc_align_mask 184 argument to swiotlb_tbl_map_single() specifies a larger alignment, one or more 185 initial slots in each slot set might not meet the alloc_align_mask criterium. 186 Because a bounce buffer allocation can't cross a slot set boundary, eliminating 187 those initial slots effectively reduces the max size of a bounce buffer. 188 Currently, there's no problem because alloc_align_mask is set based on IOMMU 189 granule size, and granules cannot be larger than PAGE_SIZE. But if that were to 190 change in the future, the initial pool allocation might need to be done with 191 alignment larger than PAGE_SIZE. 192 193 Dynamic swiotlb 194 --------------- 195 When CONFIG_SWIOTLB_DYNAMIC is enabled, swiotlb can do on-demand expansion of 196 the amount of memory available for allocation as bounce buffers. If a bounce 197 buffer request fails due to lack of available space, an asynchronous background 198 task is kicked off to allocate memory from general system memory and turn it 199 into an swiotlb pool. Creating an additional pool must be done asynchronously 200 because the memory allocation may block, and as noted above, swiotlb requests 201 are not allowed to block. Once the background task is kicked off, the bounce 202 buffer request creates a "transient pool" to avoid returning an "swiotlb full" 203 error. A transient pool has the size of the bounce buffer request, and is 204 deleted when the bounce buffer is freed. Memory for this transient pool comes 205 from the general system memory atomic pool so that creation does not block. 206 Creating a transient pool has relatively high cost, particularly in a CoCo VM 207 where the memory must be decrypted, so it is done only as a stopgap until the 208 background task can add another non-transient pool. 209 210 Adding a dynamic pool has limitations. Like with the default pool, the memory 211 must be physically contiguous, so the size is limited to MAX_PAGE_ORDER pages 212 (e.g., 4 MiB on a typical x86 system). Due to memory fragmentation, a max size 213 allocation may not be available. The dynamic pool allocator tries smaller sizes 214 until it succeeds, but with a minimum size of 1 MiB. Given sufficient system 215 memory fragmentation, dynamically adding a pool might not succeed at all. 216 217 The number of areas in a dynamic pool may be different from the number of areas 218 in the default pool. Because the new pool size is typically a few MiB at most, 219 the number of areas will likely be smaller. For example, with a new pool size 220 of 4 MiB and the 256 KiB minimum area size, only 16 areas can be created. If 221 the system has more than 16 CPUs, multiple CPUs must share an area, creating 222 more lock contention. 223 224 New pools added via dynamic swiotlb are linked together in a linear list. 225 swiotlb code frequently must search for the pool containing a particular 226 swiotlb physical address, so that search is linear and not performant with a 227 large number of dynamic pools. The data structures could be improved for 228 faster searches. 229 230 Overall, dynamic swiotlb works best for small configurations with relatively 231 few CPUs. It allows the default swiotlb pool to be smaller so that memory is 232 not wasted, with dynamic pools making more space available if needed (as long 233 as fragmentation isn't an obstacle). It is less useful for large CoCo VMs. 234 235 Data Structure Details 236 ---------------------- 237 swiotlb is managed with four primary data structures: io_tlb_mem, io_tlb_pool, 238 io_tlb_area, and io_tlb_slot. io_tlb_mem describes a swiotlb memory allocator, 239 which includes the default memory pool and any dynamic or transient pools 240 linked to it. Limited statistics on swiotlb usage are kept per memory allocator 241 and are stored in this data structure. These statistics are available under 242 /sys/kernel/debug/swiotlb when CONFIG_DEBUG_FS is set. 243 244 io_tlb_pool describes a memory pool, either the default pool, a dynamic pool, 245 or a transient pool. The description includes the start and end addresses of 246 the memory in the pool, a pointer to an array of io_tlb_area structures, and a 247 pointer to an array of io_tlb_slot structures that are associated with the pool. 248 249 io_tlb_area describes an area. The primary field is the spin lock used to 250 serialize access to slots in the area. The io_tlb_area array for a pool has an 251 entry for each area, and is accessed using a 0-based area index derived from the 252 calling processor ID. Areas exist solely to allow parallel access to swiotlb 253 from multiple CPUs. 254 255 io_tlb_slot describes an individual memory slot in the pool, with size 256 IO_TLB_SIZE (2 KiB currently). The io_tlb_slot array is indexed by the slot 257 index computed from the bounce buffer address relative to the starting memory 258 address of the pool. The size of struct io_tlb_slot is 24 bytes, so the 259 overhead is about 1% of the slot size. 260 261 The io_tlb_slot array is designed to meet several requirements. First, the DMA 262 APIs and the corresponding swiotlb APIs use the bounce buffer address as the 263 identifier for a bounce buffer. This address is returned by 264 swiotlb_tbl_map_single(), and then passed as an argument to 265 swiotlb_tbl_unmap_single() and the swiotlb_sync_*() functions. The original 266 memory buffer address obviously must be passed as an argument to 267 swiotlb_tbl_map_single(), but it is not passed to the other APIs. Consequently, 268 swiotlb data structures must save the original memory buffer address so that it 269 can be used when doing sync operations. This original address is saved in the 270 io_tlb_slot array. 271 272 Second, the io_tlb_slot array must handle partial sync requests. In such cases, 273 the argument to swiotlb_sync_*() is not the address of the start of the bounce 274 buffer but an address somewhere in the middle of the bounce buffer, and the 275 address of the start of the bounce buffer isn't known to swiotlb code. But 276 swiotlb code must be able to calculate the corresponding original memory buffer 277 address to do the CPU copy dictated by the "sync". So an adjusted original 278 memory buffer address is populated into the struct io_tlb_slot for each slot 279 occupied by the bounce buffer. An adjusted "alloc_size" of the bounce buffer is 280 also recorded in each struct io_tlb_slot so a sanity check can be performed on 281 the size of the "sync" operation. The "alloc_size" field is not used except for 282 the sanity check. 283 284 Third, the io_tlb_slot array is used to track available slots. The "list" field 285 in struct io_tlb_slot records how many contiguous available slots exist starting 286 at that slot. A "0" indicates that the slot is occupied. A value of "1" 287 indicates only the current slot is available. A value of "2" indicates the 288 current slot and the next slot are available, etc. The maximum value is 289 IO_TLB_SEGSIZE, which can appear in the first slot in a slot set, and indicates 290 that the entire slot set is available. These values are used when searching for 291 available slots to use for a new bounce buffer. They are updated when allocating 292 a new bounce buffer and when freeing a bounce buffer. At pool creation time, the 293 "list" field is initialized to IO_TLB_SEGSIZE down to 1 for the slots in every 294 slot set. 295 296 Fourth, the io_tlb_slot array keeps track of any "padding slots" allocated to 297 meet alloc_align_mask requirements described above. When 298 swiotlb_tlb_map_single() allocates bounce buffer space to meet alloc_align_mask 299 requirements, it may allocate pre-padding space across zero or more slots. But 300 when swiotbl_tlb_unmap_single() is called with the bounce buffer address, the 301 alloc_align_mask value that governed the allocation, and therefore the 302 allocation of any padding slots, is not known. The "pad_slots" field records 303 the number of padding slots so that swiotlb_tbl_unmap_single() can free them. 304 The "pad_slots" value is recorded only in the first non-padding slot allocated 305 to the bounce buffer. 306 307 Restricted pools 308 ---------------- 309 The swiotlb machinery is also used for "restricted pools", which are pools of 310 memory separate from the default swiotlb pool, and that are dedicated for DMA 311 use by a particular device. Restricted pools provide a level of DMA memory 312 protection on systems with limited hardware protection capabilities, such as 313 those lacking an IOMMU. Such usage is specified by DeviceTree entries and 314 requires that CONFIG_DMA_RESTRICTED_POOL is set. Each restricted pool is based 315 on its own io_tlb_mem data structure that is independent of the main swiotlb 316 io_tlb_mem. 317 318 Restricted pools add swiotlb_alloc() and swiotlb_free() APIs, which are called 319 from the dma_alloc_*() and dma_free_*() APIs. The swiotlb_alloc/free() APIs 320 allocate/free slots from/to the restricted pool directly and do not go through 321 swiotlb_tbl_map/unmap_single().
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.