1 .. SPDX-License-Identifier: (GPL-2.0+ OR MIT) 1 .. SPDX-License-Identifier: (GPL-2.0+ OR MIT) 2 2 3 =============== 3 =============== 4 VM_BIND locking 4 VM_BIND locking 5 =============== 5 =============== 6 6 7 This document attempts to describe what's need 7 This document attempts to describe what's needed to get VM_BIND locking right, 8 including the userptr mmu_notifier locking. It 8 including the userptr mmu_notifier locking. It also discusses some 9 optimizations to get rid of the looping throug 9 optimizations to get rid of the looping through of all userptr mappings and 10 external / shared object mappings that is need 10 external / shared object mappings that is needed in the simplest 11 implementation. In addition, there is a sectio 11 implementation. In addition, there is a section describing the VM_BIND locking 12 required for implementing recoverable pagefaul 12 required for implementing recoverable pagefaults. 13 13 14 The DRM GPUVM set of helpers 14 The DRM GPUVM set of helpers 15 ============================ 15 ============================ 16 16 17 There is a set of helpers for drivers implemen 17 There is a set of helpers for drivers implementing VM_BIND, and this 18 set of helpers implements much, but not all of 18 set of helpers implements much, but not all of the locking described 19 in this document. In particular, it is current 19 in this document. In particular, it is currently lacking a userptr 20 implementation. This document does not intend 20 implementation. This document does not intend to describe the DRM GPUVM 21 implementation in detail, but it is covered in 21 implementation in detail, but it is covered in :ref:`its own 22 documentation <drm_gpuvm>`. It is highly recom 22 documentation <drm_gpuvm>`. It is highly recommended for any driver 23 implementing VM_BIND to use the DRM GPUVM help 23 implementing VM_BIND to use the DRM GPUVM helpers and to extend it if 24 common functionality is missing. 24 common functionality is missing. 25 25 26 Nomenclature 26 Nomenclature 27 ============ 27 ============ 28 28 29 * ``gpu_vm``: Abstraction of a virtual GPU add 29 * ``gpu_vm``: Abstraction of a virtual GPU address space with 30 meta-data. Typically one per client (DRM fil 30 meta-data. Typically one per client (DRM file-private), or one per 31 execution context. 31 execution context. 32 * ``gpu_vma``: Abstraction of a GPU address ra 32 * ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with 33 associated meta-data. The backing storage of 33 associated meta-data. The backing storage of a gpu_vma can either be 34 a GEM object or anonymous or page-cache page 34 a GEM object or anonymous or page-cache pages mapped also into the CPU 35 address space for the process. 35 address space for the process. 36 * ``gpu_vm_bo``: Abstracts the association of 36 * ``gpu_vm_bo``: Abstracts the association of a GEM object and 37 a VM. The GEM object maintains a list of gpu 37 a VM. The GEM object maintains a list of gpu_vm_bos, where each gpu_vm_bo 38 maintains a list of gpu_vmas. 38 maintains a list of gpu_vmas. 39 * ``userptr gpu_vma or just userptr``: A gpu_v 39 * ``userptr gpu_vma or just userptr``: A gpu_vma, whose backing store 40 is anonymous or page-cache pages as describe 40 is anonymous or page-cache pages as described above. 41 * ``revalidating``: Revalidating a gpu_vma mea 41 * ``revalidating``: Revalidating a gpu_vma means making the latest version 42 of the backing store resident and making sur 42 of the backing store resident and making sure the gpu_vma's 43 page-table entries point to that backing sto 43 page-table entries point to that backing store. 44 * ``dma_fence``: A struct dma_fence that is si 44 * ``dma_fence``: A struct dma_fence that is similar to a struct completion 45 and which tracks GPU activity. When the GPU 45 and which tracks GPU activity. When the GPU activity is finished, 46 the dma_fence signals. Please refer to the ` 46 the dma_fence signals. Please refer to the ``DMA Fences`` section of 47 the :doc:`dma-buf doc </driver-api/dma-buf>` 47 the :doc:`dma-buf doc </driver-api/dma-buf>`. 48 * ``dma_resv``: A struct dma_resv (a.k.a reser 48 * ``dma_resv``: A struct dma_resv (a.k.a reservation object) that is used 49 to track GPU activity in the form of multipl 49 to track GPU activity in the form of multiple dma_fences on a 50 gpu_vm or a GEM object. The dma_resv contain 50 gpu_vm or a GEM object. The dma_resv contains an array / list 51 of dma_fences and a lock that needs to be he 51 of dma_fences and a lock that needs to be held when adding 52 additional dma_fences to the dma_resv. The l 52 additional dma_fences to the dma_resv. The lock is of a type that 53 allows deadlock-safe locking of multiple dma 53 allows deadlock-safe locking of multiple dma_resvs in arbitrary 54 order. Please refer to the ``Reservation Obj 54 order. Please refer to the ``Reservation Objects`` section of the 55 :doc:`dma-buf doc </driver-api/dma-buf>`. 55 :doc:`dma-buf doc </driver-api/dma-buf>`. 56 * ``exec function``: An exec function is a fun 56 * ``exec function``: An exec function is a function that revalidates all 57 affected gpu_vmas, submits a GPU command bat 57 affected gpu_vmas, submits a GPU command batch and registers the 58 dma_fence representing the GPU command's act 58 dma_fence representing the GPU command's activity with all affected 59 dma_resvs. For completeness, although not co 59 dma_resvs. For completeness, although not covered by this document, 60 it's worth mentioning that an exec function 60 it's worth mentioning that an exec function may also be the 61 revalidation worker that is used by some dri 61 revalidation worker that is used by some drivers in compute / 62 long-running mode. 62 long-running mode. 63 * ``local object``: A GEM object which is only 63 * ``local object``: A GEM object which is only mapped within a 64 single VM. Local GEM objects share the gpu_v 64 single VM. Local GEM objects share the gpu_vm's dma_resv. 65 * ``external object``: a.k.a shared object: A 65 * ``external object``: a.k.a shared object: A GEM object which may be shared 66 by multiple gpu_vms and whose backing storag 66 by multiple gpu_vms and whose backing storage may be shared with 67 other drivers. 67 other drivers. 68 68 69 Locks and locking order 69 Locks and locking order 70 ======================= 70 ======================= 71 71 72 One of the benefits of VM_BIND is that local G 72 One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's 73 dma_resv object and hence the dma_resv lock. S 73 dma_resv object and hence the dma_resv lock. So, even with a huge 74 number of local GEM objects, only one lock is 74 number of local GEM objects, only one lock is needed to make the exec 75 sequence atomic. 75 sequence atomic. 76 76 77 The following locks and locking orders are use 77 The following locks and locking orders are used: 78 78 79 * The ``gpu_vm->lock`` (optionally an rwsem). 79 * The ``gpu_vm->lock`` (optionally an rwsem). Protects the gpu_vm's 80 data structure keeping track of gpu_vmas. It 80 data structure keeping track of gpu_vmas. It can also protect the 81 gpu_vm's list of userptr gpu_vmas. With a CP 81 gpu_vm's list of userptr gpu_vmas. With a CPU mm analogy this would 82 correspond to the mmap_lock. An rwsem allows 82 correspond to the mmap_lock. An rwsem allows several readers to walk 83 the VM tree concurrently, but the benefit of 83 the VM tree concurrently, but the benefit of that concurrency most 84 likely varies from driver to driver. 84 likely varies from driver to driver. 85 * The ``userptr_seqlock``. This lock is taken 85 * The ``userptr_seqlock``. This lock is taken in read mode for each 86 userptr gpu_vma on the gpu_vm's userptr list 86 userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu 87 notifier invalidation. This is not a real se 87 notifier invalidation. This is not a real seqlock but described in 88 ``mm/mmu_notifier.c`` as a "Collision-retry 88 ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side 89 'lock' a lot like a seqcount. However this a 89 'lock' a lot like a seqcount. However this allows multiple 90 write-sides to hold it at once...". The read 90 write-sides to hold it at once...". The read side critical section 91 is enclosed by ``mmu_interval_read_begin() / 91 is enclosed by ``mmu_interval_read_begin() / 92 mmu_interval_read_retry()`` with ``mmu_inter 92 mmu_interval_read_retry()`` with ``mmu_interval_read_begin()`` 93 sleeping if the write side is held. 93 sleeping if the write side is held. 94 The write side is held by the core mm while 94 The write side is held by the core mm while calling mmu interval 95 invalidation notifiers. 95 invalidation notifiers. 96 * The ``gpu_vm->resv`` lock. Protects the gpu_ 96 * The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing 97 rebinding, as well as the residency state of 97 rebinding, as well as the residency state of all the gpu_vm's local 98 GEM objects. 98 GEM objects. 99 Furthermore, it typically protects the gpu_v 99 Furthermore, it typically protects the gpu_vm's list of evicted and 100 external GEM objects. 100 external GEM objects. 101 * The ``gpu_vm->userptr_notifier_lock``. This 101 * The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is 102 taken in read mode during exec and write mod 102 taken in read mode during exec and write mode during a mmu notifier 103 invalidation. The userptr notifier lock is p 103 invalidation. The userptr notifier lock is per gpu_vm. 104 * The ``gem_object->gpuva_lock`` This lock pro 104 * The ``gem_object->gpuva_lock`` This lock protects the GEM object's 105 list of gpu_vm_bos. This is usually the same 105 list of gpu_vm_bos. This is usually the same lock as the GEM 106 object's dma_resv, but some drivers protects 106 object's dma_resv, but some drivers protects this list differently, 107 see below. 107 see below. 108 * The ``gpu_vm list spinlocks``. With some imp 108 * The ``gpu_vm list spinlocks``. With some implementations they are needed 109 to be able to update the gpu_vm evicted- and 109 to be able to update the gpu_vm evicted- and external object 110 list. For those implementations, the spinloc 110 list. For those implementations, the spinlocks are grabbed when the 111 lists are manipulated. However, to avoid loc 111 lists are manipulated. However, to avoid locking order violations 112 with the dma_resv locks, a special scheme is 112 with the dma_resv locks, a special scheme is needed when iterating 113 over the lists. 113 over the lists. 114 114 115 .. _gpu_vma lifetime: 115 .. _gpu_vma lifetime: 116 116 117 Protection and lifetime of gpu_vm_bos and gpu_ 117 Protection and lifetime of gpu_vm_bos and gpu_vmas 118 ============================================== 118 ================================================== 119 119 120 The GEM object's list of gpu_vm_bos, and the g 120 The GEM object's list of gpu_vm_bos, and the gpu_vm_bo's list of gpu_vmas 121 is protected by the ``gem_object->gpuva_lock`` 121 is protected by the ``gem_object->gpuva_lock``, which is typically the 122 same as the GEM object's dma_resv, but if the 122 same as the GEM object's dma_resv, but if the driver 123 needs to access these lists from within a dma_ 123 needs to access these lists from within a dma_fence signalling 124 critical section, it can instead choose to pro 124 critical section, it can instead choose to protect it with a 125 separate lock, which can be locked from within 125 separate lock, which can be locked from within the dma_fence signalling 126 critical section. Such drivers then need to pa 126 critical section. Such drivers then need to pay additional attention 127 to what locks need to be taken from within the 127 to what locks need to be taken from within the loop when iterating 128 over the gpu_vm_bo and gpu_vma lists to avoid 128 over the gpu_vm_bo and gpu_vma lists to avoid locking-order violations. 129 129 130 The DRM GPUVM set of helpers provide lockdep a 130 The DRM GPUVM set of helpers provide lockdep asserts that this lock is 131 held in relevant situations and also provides 131 held in relevant situations and also provides a means of making itself 132 aware of which lock is actually used: :c:func: 132 aware of which lock is actually used: :c:func:`drm_gem_gpuva_set_lock`. 133 133 134 Each gpu_vm_bo holds a reference counted point 134 Each gpu_vm_bo holds a reference counted pointer to the underlying GEM 135 object, and each gpu_vma holds a reference cou 135 object, and each gpu_vma holds a reference counted pointer to the 136 gpu_vm_bo. When iterating over the GEM object' 136 gpu_vm_bo. When iterating over the GEM object's list of gpu_vm_bos and 137 over the gpu_vm_bo's list of gpu_vmas, the ``g 137 over the gpu_vm_bo's list of gpu_vmas, the ``gem_object->gpuva_lock`` must 138 not be dropped, otherwise, gpu_vmas attached t 138 not be dropped, otherwise, gpu_vmas attached to a gpu_vm_bo may 139 disappear without notice since those are not r 139 disappear without notice since those are not reference-counted. A 140 driver may implement its own scheme to allow t 140 driver may implement its own scheme to allow this at the expense of 141 additional complexity, but this is outside the 141 additional complexity, but this is outside the scope of this document. 142 142 143 In the DRM GPUVM implementation, each gpu_vm_b 143 In the DRM GPUVM implementation, each gpu_vm_bo and each gpu_vma 144 holds a reference count on the gpu_vm itself. 144 holds a reference count on the gpu_vm itself. Due to this, and to avoid circular 145 reference counting, cleanup of the gpu_vm's gp 145 reference counting, cleanup of the gpu_vm's gpu_vmas must not be done from the 146 gpu_vm's destructor. Drivers typically impleme 146 gpu_vm's destructor. Drivers typically implements a gpu_vm close 147 function for this cleanup. The gpu_vm close fu 147 function for this cleanup. The gpu_vm close function will abort gpu 148 execution using this VM, unmap all gpu_vmas an 148 execution using this VM, unmap all gpu_vmas and release page-table memory. 149 149 150 Revalidation and eviction of local objects 150 Revalidation and eviction of local objects 151 ========================================== 151 ========================================== 152 152 153 Note that in all the code examples given below 153 Note that in all the code examples given below we use simplified 154 pseudo-code. In particular, the dma_resv deadl 154 pseudo-code. In particular, the dma_resv deadlock avoidance algorithm 155 as well as reserving memory for dma_resv fence 155 as well as reserving memory for dma_resv fences is left out. 156 156 157 Revalidation 157 Revalidation 158 ____________ 158 ____________ 159 With VM_BIND, all local objects need to be res 159 With VM_BIND, all local objects need to be resident when the gpu is 160 executing using the gpu_vm, and the objects ne 160 executing using the gpu_vm, and the objects need to have valid 161 gpu_vmas set up pointing to them. Typically, e 161 gpu_vmas set up pointing to them. Typically, each gpu command buffer 162 submission is therefore preceded with a re-val 162 submission is therefore preceded with a re-validation section: 163 163 164 .. code-block:: C 164 .. code-block:: C 165 165 166 dma_resv_lock(gpu_vm->resv); 166 dma_resv_lock(gpu_vm->resv); 167 167 168 // Validation section starts here. 168 // Validation section starts here. 169 for_each_gpu_vm_bo_on_evict_list(&gpu_vm->e 169 for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) { 170 validate_gem_bo(&gpu_vm_bo->gem_bo) 170 validate_gem_bo(&gpu_vm_bo->gem_bo); 171 171 172 // The following list iteration nee 172 // The following list iteration needs the Gem object's 173 // dma_resv to be held (it protects 173 // dma_resv to be held (it protects the gpu_vm_bo's list of 174 // gpu_vmas, but since local gem ob 174 // gpu_vmas, but since local gem objects share the gpu_vm's 175 // dma_resv, it is already held at 175 // dma_resv, it is already held at this point. 176 for_each_gpu_vma_of_gpu_vm_bo(&gpu_ 176 for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma) 177 move_gpu_vma_to_rebind_list( 177 move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list); 178 } 178 } 179 179 180 for_each_gpu_vma_on_rebind_list(&gpu vm->re 180 for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) { 181 rebind_gpu_vma(&gpu_vma); 181 rebind_gpu_vma(&gpu_vma); 182 remove_gpu_vma_from_rebind_list(&gp 182 remove_gpu_vma_from_rebind_list(&gpu_vma); 183 } 183 } 184 // Validation section ends here, and job su 184 // Validation section ends here, and job submission starts. 185 185 186 add_dependencies(&gpu_job, &gpu_vm->resv); 186 add_dependencies(&gpu_job, &gpu_vm->resv); 187 job_dma_fence = gpu_submit(&gpu_job)); 187 job_dma_fence = gpu_submit(&gpu_job)); 188 188 189 add_dma_fence(job_dma_fence, &gpu_vm->resv) 189 add_dma_fence(job_dma_fence, &gpu_vm->resv); 190 dma_resv_unlock(gpu_vm->resv); 190 dma_resv_unlock(gpu_vm->resv); 191 191 192 The reason for having a separate gpu_vm rebind 192 The reason for having a separate gpu_vm rebind list is that there 193 might be userptr gpu_vmas that are not mapping 193 might be userptr gpu_vmas that are not mapping a buffer object that 194 also need rebinding. 194 also need rebinding. 195 195 196 Eviction 196 Eviction 197 ________ 197 ________ 198 198 199 Eviction of one of these local objects will th 199 Eviction of one of these local objects will then look similar to the 200 following: 200 following: 201 201 202 .. code-block:: C 202 .. code-block:: C 203 203 204 obj = get_object_from_lru(); 204 obj = get_object_from_lru(); 205 205 206 dma_resv_lock(obj->resv); 206 dma_resv_lock(obj->resv); 207 for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo); 207 for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo); 208 add_gpu_vm_bo_to_evict_list(&gpu_vm 208 add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); 209 209 210 add_dependencies(&eviction_job, &obj->resv) 210 add_dependencies(&eviction_job, &obj->resv); 211 job_dma_fence = gpu_submit(&eviction_job); 211 job_dma_fence = gpu_submit(&eviction_job); 212 add_dma_fence(&obj->resv, job_dma_fence); 212 add_dma_fence(&obj->resv, job_dma_fence); 213 213 214 dma_resv_unlock(&obj->resv); 214 dma_resv_unlock(&obj->resv); 215 put_object(obj); 215 put_object(obj); 216 216 217 Note that since the object is local to the gpu 217 Note that since the object is local to the gpu_vm, it will share the gpu_vm's 218 dma_resv lock such that ``obj->resv == gpu_vm- 218 dma_resv lock such that ``obj->resv == gpu_vm->resv``. 219 The gpu_vm_bos marked for eviction are put on 219 The gpu_vm_bos marked for eviction are put on the gpu_vm's evict list, 220 which is protected by ``gpu_vm->resv``. During 220 which is protected by ``gpu_vm->resv``. During eviction all local 221 objects have their dma_resv locked and, due to 221 objects have their dma_resv locked and, due to the above equality, also 222 the gpu_vm's dma_resv protecting the gpu_vm's 222 the gpu_vm's dma_resv protecting the gpu_vm's evict list is locked. 223 223 224 With VM_BIND, gpu_vmas don't need to be unboun 224 With VM_BIND, gpu_vmas don't need to be unbound before eviction, 225 since the driver must ensure that the eviction 225 since the driver must ensure that the eviction blit or copy will wait 226 for GPU idle or depend on all previous GPU act 226 for GPU idle or depend on all previous GPU activity. Furthermore, any 227 subsequent attempt by the GPU to access freed 227 subsequent attempt by the GPU to access freed memory through the 228 gpu_vma will be preceded by a new exec functio 228 gpu_vma will be preceded by a new exec function, with a revalidation 229 section which will make sure all gpu_vmas are 229 section which will make sure all gpu_vmas are rebound. The eviction 230 code holding the object's dma_resv while reval 230 code holding the object's dma_resv while revalidating will ensure a 231 new exec function may not race with the evicti 231 new exec function may not race with the eviction. 232 232 233 A driver can be implemented in such a way that 233 A driver can be implemented in such a way that, on each exec function, 234 only a subset of vmas are selected for rebind. 234 only a subset of vmas are selected for rebind. In this case, all vmas that are 235 *not* selected for rebind must be unbound befo 235 *not* selected for rebind must be unbound before the exec 236 function workload is submitted. 236 function workload is submitted. 237 237 238 Locking with external buffer objects 238 Locking with external buffer objects 239 ==================================== 239 ==================================== 240 240 241 Since external buffer objects may be shared by 241 Since external buffer objects may be shared by multiple gpu_vm's they 242 can't share their reservation object with a si 242 can't share their reservation object with a single gpu_vm. Instead 243 they need to have a reservation object of thei 243 they need to have a reservation object of their own. The external 244 objects bound to a gpu_vm using one or many gp 244 objects bound to a gpu_vm using one or many gpu_vmas are therefore put on a 245 per-gpu_vm list which is protected by the gpu_ 245 per-gpu_vm list which is protected by the gpu_vm's dma_resv lock or 246 one of the :ref:`gpu_vm list spinlocks <Spinlo 246 one of the :ref:`gpu_vm list spinlocks <Spinlock iteration>`. Once 247 the gpu_vm's reservation object is locked, it 247 the gpu_vm's reservation object is locked, it is safe to traverse the 248 external object list and lock the dma_resvs of 248 external object list and lock the dma_resvs of all external 249 objects. However, if instead a list spinlock i 249 objects. However, if instead a list spinlock is used, a more elaborate 250 iteration scheme needs to be used. 250 iteration scheme needs to be used. 251 251 252 At eviction time, the gpu_vm_bos of *all* the 252 At eviction time, the gpu_vm_bos of *all* the gpu_vms an external 253 object is bound to need to be put on their gpu 253 object is bound to need to be put on their gpu_vm's evict list. 254 However, when evicting an external object, the 254 However, when evicting an external object, the dma_resvs of the 255 gpu_vms the object is bound to are typically n 255 gpu_vms the object is bound to are typically not held. Only 256 the object's private dma_resv can be guarantee 256 the object's private dma_resv can be guaranteed to be held. If there 257 is a ww_acquire context at hand at eviction ti 257 is a ww_acquire context at hand at eviction time we could grab those 258 dma_resvs but that could cause expensive ww_mu 258 dma_resvs but that could cause expensive ww_mutex rollbacks. A simple 259 option is to just mark the gpu_vm_bos of the e 259 option is to just mark the gpu_vm_bos of the evicted gem object with 260 an ``evicted`` bool that is inspected before t 260 an ``evicted`` bool that is inspected before the next time the 261 corresponding gpu_vm evicted list needs to be 261 corresponding gpu_vm evicted list needs to be traversed. For example, when 262 traversing the list of external objects and lo 262 traversing the list of external objects and locking them. At that time, 263 both the gpu_vm's dma_resv and the object's dm 263 both the gpu_vm's dma_resv and the object's dma_resv is held, and the 264 gpu_vm_bo marked evicted, can then be added to 264 gpu_vm_bo marked evicted, can then be added to the gpu_vm's list of 265 evicted gpu_vm_bos. The ``evicted`` bool is fo 265 evicted gpu_vm_bos. The ``evicted`` bool is formally protected by the 266 object's dma_resv. 266 object's dma_resv. 267 267 268 The exec function becomes 268 The exec function becomes 269 269 270 .. code-block:: C 270 .. code-block:: C 271 271 272 dma_resv_lock(gpu_vm->resv); 272 dma_resv_lock(gpu_vm->resv); 273 273 274 // External object list is protected by the 274 // External object list is protected by the gpu_vm->resv lock. 275 for_each_gpu_vm_bo_on_extobj_list(gpu_vm, & 275 for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) { 276 dma_resv_lock(gpu_vm_bo.gem_obj->re 276 dma_resv_lock(gpu_vm_bo.gem_obj->resv); 277 if (gpu_vm_bo_marked_evicted(&gpu_v 277 if (gpu_vm_bo_marked_evicted(&gpu_vm_bo)) 278 add_gpu_vm_bo_to_evict_list 278 add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); 279 } 279 } 280 280 281 for_each_gpu_vm_bo_on_evict_list(&gpu_vm->e 281 for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) { 282 validate_gem_bo(&gpu_vm_bo->gem_bo) 282 validate_gem_bo(&gpu_vm_bo->gem_bo); 283 283 284 for_each_gpu_vma_of_gpu_vm_bo(&gpu_ 284 for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma) 285 move_gpu_vma_to_rebind_list( 285 move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list); 286 } 286 } 287 287 288 for_each_gpu_vma_on_rebind_list(&gpu vm->re 288 for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) { 289 rebind_gpu_vma(&gpu_vma); 289 rebind_gpu_vma(&gpu_vma); 290 remove_gpu_vma_from_rebind_list(&gp 290 remove_gpu_vma_from_rebind_list(&gpu_vma); 291 } 291 } 292 292 293 add_dependencies(&gpu_job, &gpu_vm->resv); 293 add_dependencies(&gpu_job, &gpu_vm->resv); 294 job_dma_fence = gpu_submit(&gpu_job)); 294 job_dma_fence = gpu_submit(&gpu_job)); 295 295 296 add_dma_fence(job_dma_fence, &gpu_vm->resv) 296 add_dma_fence(job_dma_fence, &gpu_vm->resv); 297 for_each_external_obj(gpu_vm, &obj) 297 for_each_external_obj(gpu_vm, &obj) 298 add_dma_fence(job_dma_fence, &obj->r 298 add_dma_fence(job_dma_fence, &obj->resv); 299 dma_resv_unlock_all_resv_locks(); 299 dma_resv_unlock_all_resv_locks(); 300 300 301 And the corresponding shared-object aware evic 301 And the corresponding shared-object aware eviction would look like: 302 302 303 .. code-block:: C 303 .. code-block:: C 304 304 305 obj = get_object_from_lru(); 305 obj = get_object_from_lru(); 306 306 307 dma_resv_lock(obj->resv); 307 dma_resv_lock(obj->resv); 308 for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo) 308 for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo) 309 if (object_is_vm_local(obj)) 309 if (object_is_vm_local(obj)) 310 add_gpu_vm_bo_to_evict_list(&g 310 add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); 311 else 311 else 312 mark_gpu_vm_bo_evicted(&gpu_vm 312 mark_gpu_vm_bo_evicted(&gpu_vm_bo); 313 313 314 add_dependencies(&eviction_job, &obj->resv) 314 add_dependencies(&eviction_job, &obj->resv); 315 job_dma_fence = gpu_submit(&eviction_job); 315 job_dma_fence = gpu_submit(&eviction_job); 316 add_dma_fence(&obj->resv, job_dma_fence); 316 add_dma_fence(&obj->resv, job_dma_fence); 317 317 318 dma_resv_unlock(&obj->resv); 318 dma_resv_unlock(&obj->resv); 319 put_object(obj); 319 put_object(obj); 320 320 321 .. _Spinlock iteration: 321 .. _Spinlock iteration: 322 322 323 Accessing the gpu_vm's lists without the dma_r 323 Accessing the gpu_vm's lists without the dma_resv lock held 324 ============================================== 324 =========================================================== 325 325 326 Some drivers will hold the gpu_vm's dma_resv l 326 Some drivers will hold the gpu_vm's dma_resv lock when accessing the 327 gpu_vm's evict list and external objects lists 327 gpu_vm's evict list and external objects lists. However, there are 328 drivers that need to access these lists withou 328 drivers that need to access these lists without the dma_resv lock 329 held, for example due to asynchronous state up 329 held, for example due to asynchronous state updates from within the 330 dma_fence signalling critical path. In such ca 330 dma_fence signalling critical path. In such cases, a spinlock can be 331 used to protect manipulation of the lists. How 331 used to protect manipulation of the lists. However, since higher level 332 sleeping locks need to be taken for each list 332 sleeping locks need to be taken for each list item while iterating 333 over the lists, the items already iterated ove 333 over the lists, the items already iterated over need to be 334 temporarily moved to a private list and the sp 334 temporarily moved to a private list and the spinlock released 335 while processing each item: 335 while processing each item: 336 336 337 .. code block:: C 337 .. code block:: C 338 338 339 struct list_head still_in_list; 339 struct list_head still_in_list; 340 340 341 INIT_LIST_HEAD(&still_in_list); 341 INIT_LIST_HEAD(&still_in_list); 342 342 343 spin_lock(&gpu_vm->list_lock); 343 spin_lock(&gpu_vm->list_lock); 344 do { 344 do { 345 struct list_head *entry = list_fir 345 struct list_head *entry = list_first_entry_or_null(&gpu_vm->list, head); 346 346 347 if (!entry) 347 if (!entry) 348 break; 348 break; 349 349 350 list_move_tail(&entry->head, &stil 350 list_move_tail(&entry->head, &still_in_list); 351 list_entry_get_unless_zero(entry); 351 list_entry_get_unless_zero(entry); 352 spin_unlock(&gpu_vm->list_lock); 352 spin_unlock(&gpu_vm->list_lock); 353 353 354 process(entry); 354 process(entry); 355 355 356 spin_lock(&gpu_vm->list_lock); 356 spin_lock(&gpu_vm->list_lock); 357 list_entry_put(entry); 357 list_entry_put(entry); 358 } while (true); 358 } while (true); 359 359 360 list_splice_tail(&still_in_list, &gpu_vm-> 360 list_splice_tail(&still_in_list, &gpu_vm->list); 361 spin_unlock(&gpu_vm->list_lock); 361 spin_unlock(&gpu_vm->list_lock); 362 362 363 Due to the additional locking and atomic opera 363 Due to the additional locking and atomic operations, drivers that *can* 364 avoid accessing the gpu_vm's list outside of t 364 avoid accessing the gpu_vm's list outside of the dma_resv lock 365 might want to avoid also this iteration scheme 365 might want to avoid also this iteration scheme. Particularly, if the 366 driver anticipates a large number of list item 366 driver anticipates a large number of list items. For lists where the 367 anticipated number of list items is small, whe 367 anticipated number of list items is small, where list iteration doesn't 368 happen very often or if there is a significant 368 happen very often or if there is a significant additional cost 369 associated with each iteration, the atomic ope 369 associated with each iteration, the atomic operation overhead 370 associated with this type of iteration is, mos 370 associated with this type of iteration is, most likely, negligible. Note that 371 if this scheme is used, it is necessary to mak 371 if this scheme is used, it is necessary to make sure this list 372 iteration is protected by an outer level lock 372 iteration is protected by an outer level lock or semaphore, since list 373 items are temporarily pulled off the list whil 373 items are temporarily pulled off the list while iterating, and it is 374 also worth mentioning that the local list ``st 374 also worth mentioning that the local list ``still_in_list`` should 375 also be considered protected by the ``gpu_vm-> 375 also be considered protected by the ``gpu_vm->list_lock``, and it is 376 thus possible that items can be removed also f 376 thus possible that items can be removed also from the local list 377 concurrently with list iteration. 377 concurrently with list iteration. 378 378 379 Please refer to the :ref:`DRM GPUVM locking se 379 Please refer to the :ref:`DRM GPUVM locking section 380 <drm_gpuvm_locking>` and its internal 380 <drm_gpuvm_locking>` and its internal 381 :c:func:`get_next_vm_bo_from_list` function. 381 :c:func:`get_next_vm_bo_from_list` function. 382 382 383 383 384 userptr gpu_vmas 384 userptr gpu_vmas 385 ================ 385 ================ 386 386 387 A userptr gpu_vma is a gpu_vma that, instead o 387 A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a 388 GPU virtual address range, directly maps a CPU 388 GPU virtual address range, directly maps a CPU mm range of anonymous- 389 or file page-cache pages. 389 or file page-cache pages. 390 A very simple approach would be to just pin th 390 A very simple approach would be to just pin the pages using 391 pin_user_pages() at bind time and unpin them a 391 pin_user_pages() at bind time and unpin them at unbind time, but this 392 creates a Denial-Of-Service vector since a sin 392 creates a Denial-Of-Service vector since a single user-space process 393 would be able to pin down all of system memory 393 would be able to pin down all of system memory, which is not 394 desirable. (For special use-cases and assuming 394 desirable. (For special use-cases and assuming proper accounting pinning might 395 still be a desirable feature, though). What we 395 still be a desirable feature, though). What we need to do in the 396 general case is to obtain a reference to the d 396 general case is to obtain a reference to the desired pages, make sure 397 we are notified using a MMU notifier just befo 397 we are notified using a MMU notifier just before the CPU mm unmaps the 398 pages, dirty them if they are not mapped read- 398 pages, dirty them if they are not mapped read-only to the GPU, and 399 then drop the reference. 399 then drop the reference. 400 When we are notified by the MMU notifier that 400 When we are notified by the MMU notifier that CPU mm is about to drop the 401 pages, we need to stop GPU access to the pages 401 pages, we need to stop GPU access to the pages by waiting for VM idle 402 in the MMU notifier and make sure that before 402 in the MMU notifier and make sure that before the next time the GPU 403 tries to access whatever is now present in the 403 tries to access whatever is now present in the CPU mm range, we unmap 404 the old pages from the GPU page tables and rep 404 the old pages from the GPU page tables and repeat the process of 405 obtaining new page references. (See the :ref:` 405 obtaining new page references. (See the :ref:`notifier example 406 <Invalidation example>` below). Note that when 406 <Invalidation example>` below). Note that when the core mm decides to 407 laundry pages, we get such an unmap MMU notifi 407 laundry pages, we get such an unmap MMU notification and can mark the 408 pages dirty again before the next GPU access. 408 pages dirty again before the next GPU access. We also get similar MMU 409 notifications for NUMA accounting which the GP 409 notifications for NUMA accounting which the GPU driver doesn't really 410 need to care about, but so far it has proven d 410 need to care about, but so far it has proven difficult to exclude 411 certain notifications. 411 certain notifications. 412 412 413 Using a MMU notifier for device DMA (and other 413 Using a MMU notifier for device DMA (and other methods) is described in 414 :ref:`the pin_user_pages() documentation <mmu- 414 :ref:`the pin_user_pages() documentation <mmu-notifier-registration-case>`. 415 415 416 Now, the method of obtaining struct page refer 416 Now, the method of obtaining struct page references using 417 get_user_pages() unfortunately can't be used u 417 get_user_pages() unfortunately can't be used under a dma_resv lock 418 since that would violate the locking order of 418 since that would violate the locking order of the dma_resv lock vs the 419 mmap_lock that is grabbed when resolving a CPU 419 mmap_lock that is grabbed when resolving a CPU pagefault. This means 420 the gpu_vm's list of userptr gpu_vmas needs to 420 the gpu_vm's list of userptr gpu_vmas needs to be protected by an 421 outer lock, which in our example below is the 421 outer lock, which in our example below is the ``gpu_vm->lock``. 422 422 423 The MMU interval seqlock for a userptr gpu_vma 423 The MMU interval seqlock for a userptr gpu_vma is used in the following 424 way: 424 way: 425 425 426 .. code-block:: C 426 .. code-block:: C 427 427 428 // Exclusive locking mode here is strictly 428 // Exclusive locking mode here is strictly needed only if there are 429 // invalidated userptr gpu_vmas present, to 429 // invalidated userptr gpu_vmas present, to avoid concurrent userptr 430 // revalidations of the same userptr gpu_vm 430 // revalidations of the same userptr gpu_vma. 431 down_write(&gpu_vm->lock); 431 down_write(&gpu_vm->lock); 432 retry: 432 retry: 433 433 434 // Note: mmu_interval_read_begin() blocks u 434 // Note: mmu_interval_read_begin() blocks until there is no 435 // invalidation notifier running anymore. 435 // invalidation notifier running anymore. 436 seq = mmu_interval_read_begin(&gpu_vma->use 436 seq = mmu_interval_read_begin(&gpu_vma->userptr_interval); 437 if (seq != gpu_vma->saved_seq) { 437 if (seq != gpu_vma->saved_seq) { 438 obtain_new_page_pointers(&gpu_vma); 438 obtain_new_page_pointers(&gpu_vma); 439 dma_resv_lock(&gpu_vm->resv); 439 dma_resv_lock(&gpu_vm->resv); 440 add_gpu_vma_to_revalidate_list(&gpu 440 add_gpu_vma_to_revalidate_list(&gpu_vma, &gpu_vm); 441 dma_resv_unlock(&gpu_vm->resv); 441 dma_resv_unlock(&gpu_vm->resv); 442 gpu_vma->saved_seq = seq; 442 gpu_vma->saved_seq = seq; 443 } 443 } 444 444 445 // The usual revalidation goes here. 445 // The usual revalidation goes here. 446 446 447 // Final userptr sequence validation may no 447 // Final userptr sequence validation may not happen before the 448 // submission dma_fence is added to the gpu 448 // submission dma_fence is added to the gpu_vm's resv, from the POW 449 // of the MMU invalidation notifier. Hence 449 // of the MMU invalidation notifier. Hence the 450 // userptr_notifier_lock that will make the 450 // userptr_notifier_lock that will make them appear atomic. 451 451 452 add_dependencies(&gpu_job, &gpu_vm->resv); 452 add_dependencies(&gpu_job, &gpu_vm->resv); 453 down_read(&gpu_vm->userptr_notifier_lock); 453 down_read(&gpu_vm->userptr_notifier_lock); 454 if (mmu_interval_read_retry(&gpu_vma->userp 454 if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) { 455 up_read(&gpu_vm->userptr_notifier_lo 455 up_read(&gpu_vm->userptr_notifier_lock); 456 goto retry; 456 goto retry; 457 } 457 } 458 458 459 job_dma_fence = gpu_submit(&gpu_job)); 459 job_dma_fence = gpu_submit(&gpu_job)); 460 460 461 add_dma_fence(job_dma_fence, &gpu_vm->resv) 461 add_dma_fence(job_dma_fence, &gpu_vm->resv); 462 462 463 for_each_external_obj(gpu_vm, &obj) 463 for_each_external_obj(gpu_vm, &obj) 464 add_dma_fence(job_dma_fence, &obj->r 464 add_dma_fence(job_dma_fence, &obj->resv); 465 465 466 dma_resv_unlock_all_resv_locks(); 466 dma_resv_unlock_all_resv_locks(); 467 up_read(&gpu_vm->userptr_notifier_lock); 467 up_read(&gpu_vm->userptr_notifier_lock); 468 up_write(&gpu_vm->lock); 468 up_write(&gpu_vm->lock); 469 469 470 The code between ``mmu_interval_read_begin()`` 470 The code between ``mmu_interval_read_begin()`` and the 471 ``mmu_interval_read_retry()`` marks the read s 471 ``mmu_interval_read_retry()`` marks the read side critical section of 472 what we call the ``userptr_seqlock``. In reali 472 what we call the ``userptr_seqlock``. In reality, the gpu_vm's userptr 473 gpu_vma list is looped through, and the check 473 gpu_vma list is looped through, and the check is done for *all* of its 474 userptr gpu_vmas, although we only show a sing 474 userptr gpu_vmas, although we only show a single one here. 475 475 476 The userptr gpu_vma MMU invalidation notifier 476 The userptr gpu_vma MMU invalidation notifier might be called from 477 reclaim context and, again, to avoid locking o 477 reclaim context and, again, to avoid locking order violations, we can't 478 take any dma_resv lock nor the gpu_vm->lock fr 478 take any dma_resv lock nor the gpu_vm->lock from within it. 479 479 480 .. _Invalidation example: 480 .. _Invalidation example: 481 .. code-block:: C 481 .. code-block:: C 482 482 483 bool gpu_vma_userptr_invalidate(userptr_inte 483 bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq) 484 { 484 { 485 // Make sure the exec function eithe 485 // Make sure the exec function either sees the new sequence 486 // and backs off or we wait for the 486 // and backs off or we wait for the dma-fence: 487 487 488 down_write(&gpu_vm->userptr_notifier 488 down_write(&gpu_vm->userptr_notifier_lock); 489 mmu_interval_set_seq(userptr_interva 489 mmu_interval_set_seq(userptr_interval, cur_seq); 490 up_write(&gpu_vm->userptr_notifier_l 490 up_write(&gpu_vm->userptr_notifier_lock); 491 491 492 // At this point, the exec function 492 // At this point, the exec function can't succeed in 493 // submitting a new job, because cur 493 // submitting a new job, because cur_seq is an invalid 494 // sequence number and will always c 494 // sequence number and will always cause a retry. When all 495 // invalidation callbacks, the mmu n 495 // invalidation callbacks, the mmu notifier core will flip 496 // the sequence number to a valid on 496 // the sequence number to a valid one. However we need to 497 // stop gpu access to the old pages 497 // stop gpu access to the old pages here. 498 498 499 dma_resv_wait_timeout(&gpu_vm->resv, 499 dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP, 500 false, MAX_SCH 500 false, MAX_SCHEDULE_TIMEOUT); 501 return true; 501 return true; 502 } 502 } 503 503 504 When this invalidation notifier returns, the G 504 When this invalidation notifier returns, the GPU can no longer be 505 accessing the old pages of the userptr gpu_vma 505 accessing the old pages of the userptr gpu_vma and needs to redo the 506 page-binding before a new GPU submission can s 506 page-binding before a new GPU submission can succeed. 507 507 508 Efficient userptr gpu_vma exec_function iterat 508 Efficient userptr gpu_vma exec_function iteration 509 ______________________________________________ 509 _________________________________________________ 510 510 511 If the gpu_vm's list of userptr gpu_vmas becom 511 If the gpu_vm's list of userptr gpu_vmas becomes large, it's 512 inefficient to iterate through the complete li 512 inefficient to iterate through the complete lists of userptrs on each 513 exec function to check whether each userptr gp 513 exec function to check whether each userptr gpu_vma's saved 514 sequence number is stale. A solution to this i 514 sequence number is stale. A solution to this is to put all 515 *invalidated* userptr gpu_vmas on a separate g 515 *invalidated* userptr gpu_vmas on a separate gpu_vm list and 516 only check the gpu_vmas present on this list o 516 only check the gpu_vmas present on this list on each exec 517 function. This list will then lend itself very 517 function. This list will then lend itself very-well to the spinlock 518 locking scheme that is 518 locking scheme that is 519 :ref:`described in the spinlock iteration sect 519 :ref:`described in the spinlock iteration section <Spinlock iteration>`, since 520 in the mmu notifier, where we add the invalida 520 in the mmu notifier, where we add the invalidated gpu_vmas to the 521 list, it's not possible to take any outer lock 521 list, it's not possible to take any outer locks like the 522 ``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. 522 ``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. Note that the 523 ``gpu_vm->lock`` still needs to be taken while 523 ``gpu_vm->lock`` still needs to be taken while iterating to ensure the list is 524 complete, as also mentioned in that section. 524 complete, as also mentioned in that section. 525 525 526 If using an invalidated userptr list like this 526 If using an invalidated userptr list like this, the retry check in the 527 exec function trivially becomes a check for in 527 exec function trivially becomes a check for invalidated list empty. 528 528 529 Locking at bind and unbind time 529 Locking at bind and unbind time 530 =============================== 530 =============================== 531 531 532 At bind time, assuming a GEM object backed gpu 532 At bind time, assuming a GEM object backed gpu_vma, each 533 gpu_vma needs to be associated with a gpu_vm_b 533 gpu_vma needs to be associated with a gpu_vm_bo and that 534 gpu_vm_bo in turn needs to be added to the GEM 534 gpu_vm_bo in turn needs to be added to the GEM object's 535 gpu_vm_bo list, and possibly to the gpu_vm's e 535 gpu_vm_bo list, and possibly to the gpu_vm's external object 536 list. This is referred to as *linking* the gpu 536 list. This is referred to as *linking* the gpu_vma, and typically 537 requires that the ``gpu_vm->lock`` and the ``g 537 requires that the ``gpu_vm->lock`` and the ``gem_object->gpuva_lock`` 538 are held. When unlinking a gpu_vma the same lo 538 are held. When unlinking a gpu_vma the same locks should be held, 539 and that ensures that when iterating over ``gp 539 and that ensures that when iterating over ``gpu_vmas`, either under 540 the ``gpu_vm->resv`` or the GEM object's dma_r 540 the ``gpu_vm->resv`` or the GEM object's dma_resv, that the gpu_vmas 541 stay alive as long as the lock under which we 541 stay alive as long as the lock under which we iterate is not released. For 542 userptr gpu_vmas it's similarly required that 542 userptr gpu_vmas it's similarly required that during vma destroy, the 543 outer ``gpu_vm->lock`` is held, since otherwis 543 outer ``gpu_vm->lock`` is held, since otherwise when iterating over 544 the invalidated userptr list as described in t 544 the invalidated userptr list as described in the previous section, 545 there is nothing keeping those userptr gpu_vma 545 there is nothing keeping those userptr gpu_vmas alive. 546 546 547 Locking for recoverable page-fault page-table 547 Locking for recoverable page-fault page-table updates 548 ============================================== 548 ===================================================== 549 549 550 There are two important things we need to ensu 550 There are two important things we need to ensure with locking for 551 recoverable page-faults: 551 recoverable page-faults: 552 552 553 * At the time we return pages back to the syst 553 * At the time we return pages back to the system / allocator for 554 reuse, there should be no remaining GPU mapp 554 reuse, there should be no remaining GPU mappings and any GPU TLB 555 must have been flushed. 555 must have been flushed. 556 * The unmapping and mapping of a gpu_vma must 556 * The unmapping and mapping of a gpu_vma must not race. 557 557 558 Since the unmapping (or zapping) of GPU ptes i 558 Since the unmapping (or zapping) of GPU ptes is typically taking place 559 where it is hard or even impossible to take an 559 where it is hard or even impossible to take any outer level locks we 560 must either introduce a new lock that is held 560 must either introduce a new lock that is held at both mapping and 561 unmapping time, or look at the locks we do hol 561 unmapping time, or look at the locks we do hold at unmapping time and 562 make sure that they are held also at mapping t 562 make sure that they are held also at mapping time. For userptr 563 gpu_vmas, the ``userptr_seqlock`` is held in w 563 gpu_vmas, the ``userptr_seqlock`` is held in write mode in the mmu 564 invalidation notifier where zapping happens. H 564 invalidation notifier where zapping happens. Hence, if the 565 ``userptr_seqlock`` as well as the ``gpu_vm->u 565 ``userptr_seqlock`` as well as the ``gpu_vm->userptr_notifier_lock`` 566 is held in read mode during mapping, it will n 566 is held in read mode during mapping, it will not race with the 567 zapping. For GEM object backed gpu_vmas, zappi 567 zapping. For GEM object backed gpu_vmas, zapping will take place under 568 the GEM object's dma_resv and ensuring that th 568 the GEM object's dma_resv and ensuring that the dma_resv is held also 569 when populating the page-tables for any gpu_vm 569 when populating the page-tables for any gpu_vma pointing to the GEM 570 object, will similarly ensure we are race-free 570 object, will similarly ensure we are race-free. 571 571 572 If any part of the mapping is performed asynch 572 If any part of the mapping is performed asynchronously 573 under a dma-fence with these locks released, t 573 under a dma-fence with these locks released, the zapping will need to 574 wait for that dma-fence to signal under the re 574 wait for that dma-fence to signal under the relevant lock before 575 starting to modify the page-table. 575 starting to modify the page-table. 576 576 577 Since modifying the 577 Since modifying the 578 page-table structure in a way that frees up pa 578 page-table structure in a way that frees up page-table memory 579 might also require outer level locks, the zapp 579 might also require outer level locks, the zapping of GPU ptes 580 typically focuses only on zeroing page-table o 580 typically focuses only on zeroing page-table or page-directory entries 581 and flushing TLB, whereas freeing of page-tabl 581 and flushing TLB, whereas freeing of page-table memory is deferred to 582 unbind or rebind time. 582 unbind or rebind time.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.