1 .. SPDX-License-Identifier: (GPL-2.0+ OR MIT) 2 3 =============== 4 VM_BIND locking 5 =============== 6 7 This document attempts to describe what's need 8 including the userptr mmu_notifier locking. It 9 optimizations to get rid of the looping throug 10 external / shared object mappings that is need 11 implementation. In addition, there is a sectio 12 required for implementing recoverable pagefaul 13 14 The DRM GPUVM set of helpers 15 ============================ 16 17 There is a set of helpers for drivers implemen 18 set of helpers implements much, but not all of 19 in this document. In particular, it is current 20 implementation. This document does not intend 21 implementation in detail, but it is covered in 22 documentation <drm_gpuvm>`. It is highly recom 23 implementing VM_BIND to use the DRM GPUVM help 24 common functionality is missing. 25 26 Nomenclature 27 ============ 28 29 * ``gpu_vm``: Abstraction of a virtual GPU add 30 meta-data. Typically one per client (DRM fil 31 execution context. 32 * ``gpu_vma``: Abstraction of a GPU address ra 33 associated meta-data. The backing storage of 34 a GEM object or anonymous or page-cache page 35 address space for the process. 36 * ``gpu_vm_bo``: Abstracts the association of 37 a VM. The GEM object maintains a list of gpu 38 maintains a list of gpu_vmas. 39 * ``userptr gpu_vma or just userptr``: A gpu_v 40 is anonymous or page-cache pages as describe 41 * ``revalidating``: Revalidating a gpu_vma mea 42 of the backing store resident and making sur 43 page-table entries point to that backing sto 44 * ``dma_fence``: A struct dma_fence that is si 45 and which tracks GPU activity. When the GPU 46 the dma_fence signals. Please refer to the ` 47 the :doc:`dma-buf doc </driver-api/dma-buf>` 48 * ``dma_resv``: A struct dma_resv (a.k.a reser 49 to track GPU activity in the form of multipl 50 gpu_vm or a GEM object. The dma_resv contain 51 of dma_fences and a lock that needs to be he 52 additional dma_fences to the dma_resv. The l 53 allows deadlock-safe locking of multiple dma 54 order. Please refer to the ``Reservation Obj 55 :doc:`dma-buf doc </driver-api/dma-buf>`. 56 * ``exec function``: An exec function is a fun 57 affected gpu_vmas, submits a GPU command bat 58 dma_fence representing the GPU command's act 59 dma_resvs. For completeness, although not co 60 it's worth mentioning that an exec function 61 revalidation worker that is used by some dri 62 long-running mode. 63 * ``local object``: A GEM object which is only 64 single VM. Local GEM objects share the gpu_v 65 * ``external object``: a.k.a shared object: A 66 by multiple gpu_vms and whose backing storag 67 other drivers. 68 69 Locks and locking order 70 ======================= 71 72 One of the benefits of VM_BIND is that local G 73 dma_resv object and hence the dma_resv lock. S 74 number of local GEM objects, only one lock is 75 sequence atomic. 76 77 The following locks and locking orders are use 78 79 * The ``gpu_vm->lock`` (optionally an rwsem). 80 data structure keeping track of gpu_vmas. It 81 gpu_vm's list of userptr gpu_vmas. With a CP 82 correspond to the mmap_lock. An rwsem allows 83 the VM tree concurrently, but the benefit of 84 likely varies from driver to driver. 85 * The ``userptr_seqlock``. This lock is taken 86 userptr gpu_vma on the gpu_vm's userptr list 87 notifier invalidation. This is not a real se 88 ``mm/mmu_notifier.c`` as a "Collision-retry 89 'lock' a lot like a seqcount. However this a 90 write-sides to hold it at once...". The read 91 is enclosed by ``mmu_interval_read_begin() / 92 mmu_interval_read_retry()`` with ``mmu_inter 93 sleeping if the write side is held. 94 The write side is held by the core mm while 95 invalidation notifiers. 96 * The ``gpu_vm->resv`` lock. Protects the gpu_ 97 rebinding, as well as the residency state of 98 GEM objects. 99 Furthermore, it typically protects the gpu_v 100 external GEM objects. 101 * The ``gpu_vm->userptr_notifier_lock``. This 102 taken in read mode during exec and write mod 103 invalidation. The userptr notifier lock is p 104 * The ``gem_object->gpuva_lock`` This lock pro 105 list of gpu_vm_bos. This is usually the same 106 object's dma_resv, but some drivers protects 107 see below. 108 * The ``gpu_vm list spinlocks``. With some imp 109 to be able to update the gpu_vm evicted- and 110 list. For those implementations, the spinloc 111 lists are manipulated. However, to avoid loc 112 with the dma_resv locks, a special scheme is 113 over the lists. 114 115 .. _gpu_vma lifetime: 116 117 Protection and lifetime of gpu_vm_bos and gpu_ 118 ============================================== 119 120 The GEM object's list of gpu_vm_bos, and the g 121 is protected by the ``gem_object->gpuva_lock`` 122 same as the GEM object's dma_resv, but if the 123 needs to access these lists from within a dma_ 124 critical section, it can instead choose to pro 125 separate lock, which can be locked from within 126 critical section. Such drivers then need to pa 127 to what locks need to be taken from within the 128 over the gpu_vm_bo and gpu_vma lists to avoid 129 130 The DRM GPUVM set of helpers provide lockdep a 131 held in relevant situations and also provides 132 aware of which lock is actually used: :c:func: 133 134 Each gpu_vm_bo holds a reference counted point 135 object, and each gpu_vma holds a reference cou 136 gpu_vm_bo. When iterating over the GEM object' 137 over the gpu_vm_bo's list of gpu_vmas, the ``g 138 not be dropped, otherwise, gpu_vmas attached t 139 disappear without notice since those are not r 140 driver may implement its own scheme to allow t 141 additional complexity, but this is outside the 142 143 In the DRM GPUVM implementation, each gpu_vm_b 144 holds a reference count on the gpu_vm itself. 145 reference counting, cleanup of the gpu_vm's gp 146 gpu_vm's destructor. Drivers typically impleme 147 function for this cleanup. The gpu_vm close fu 148 execution using this VM, unmap all gpu_vmas an 149 150 Revalidation and eviction of local objects 151 ========================================== 152 153 Note that in all the code examples given below 154 pseudo-code. In particular, the dma_resv deadl 155 as well as reserving memory for dma_resv fence 156 157 Revalidation 158 ____________ 159 With VM_BIND, all local objects need to be res 160 executing using the gpu_vm, and the objects ne 161 gpu_vmas set up pointing to them. Typically, e 162 submission is therefore preceded with a re-val 163 164 .. code-block:: C 165 166 dma_resv_lock(gpu_vm->resv); 167 168 // Validation section starts here. 169 for_each_gpu_vm_bo_on_evict_list(&gpu_vm->e 170 validate_gem_bo(&gpu_vm_bo->gem_bo) 171 172 // The following list iteration nee 173 // dma_resv to be held (it protects 174 // gpu_vmas, but since local gem ob 175 // dma_resv, it is already held at 176 for_each_gpu_vma_of_gpu_vm_bo(&gpu_ 177 move_gpu_vma_to_rebind_list( 178 } 179 180 for_each_gpu_vma_on_rebind_list(&gpu vm->re 181 rebind_gpu_vma(&gpu_vma); 182 remove_gpu_vma_from_rebind_list(&gp 183 } 184 // Validation section ends here, and job su 185 186 add_dependencies(&gpu_job, &gpu_vm->resv); 187 job_dma_fence = gpu_submit(&gpu_job)); 188 189 add_dma_fence(job_dma_fence, &gpu_vm->resv) 190 dma_resv_unlock(gpu_vm->resv); 191 192 The reason for having a separate gpu_vm rebind 193 might be userptr gpu_vmas that are not mapping 194 also need rebinding. 195 196 Eviction 197 ________ 198 199 Eviction of one of these local objects will th 200 following: 201 202 .. code-block:: C 203 204 obj = get_object_from_lru(); 205 206 dma_resv_lock(obj->resv); 207 for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo); 208 add_gpu_vm_bo_to_evict_list(&gpu_vm 209 210 add_dependencies(&eviction_job, &obj->resv) 211 job_dma_fence = gpu_submit(&eviction_job); 212 add_dma_fence(&obj->resv, job_dma_fence); 213 214 dma_resv_unlock(&obj->resv); 215 put_object(obj); 216 217 Note that since the object is local to the gpu 218 dma_resv lock such that ``obj->resv == gpu_vm- 219 The gpu_vm_bos marked for eviction are put on 220 which is protected by ``gpu_vm->resv``. During 221 objects have their dma_resv locked and, due to 222 the gpu_vm's dma_resv protecting the gpu_vm's 223 224 With VM_BIND, gpu_vmas don't need to be unboun 225 since the driver must ensure that the eviction 226 for GPU idle or depend on all previous GPU act 227 subsequent attempt by the GPU to access freed 228 gpu_vma will be preceded by a new exec functio 229 section which will make sure all gpu_vmas are 230 code holding the object's dma_resv while reval 231 new exec function may not race with the evicti 232 233 A driver can be implemented in such a way that 234 only a subset of vmas are selected for rebind. 235 *not* selected for rebind must be unbound befo 236 function workload is submitted. 237 238 Locking with external buffer objects 239 ==================================== 240 241 Since external buffer objects may be shared by 242 can't share their reservation object with a si 243 they need to have a reservation object of thei 244 objects bound to a gpu_vm using one or many gp 245 per-gpu_vm list which is protected by the gpu_ 246 one of the :ref:`gpu_vm list spinlocks <Spinlo 247 the gpu_vm's reservation object is locked, it 248 external object list and lock the dma_resvs of 249 objects. However, if instead a list spinlock i 250 iteration scheme needs to be used. 251 252 At eviction time, the gpu_vm_bos of *all* the 253 object is bound to need to be put on their gpu 254 However, when evicting an external object, the 255 gpu_vms the object is bound to are typically n 256 the object's private dma_resv can be guarantee 257 is a ww_acquire context at hand at eviction ti 258 dma_resvs but that could cause expensive ww_mu 259 option is to just mark the gpu_vm_bos of the e 260 an ``evicted`` bool that is inspected before t 261 corresponding gpu_vm evicted list needs to be 262 traversing the list of external objects and lo 263 both the gpu_vm's dma_resv and the object's dm 264 gpu_vm_bo marked evicted, can then be added to 265 evicted gpu_vm_bos. The ``evicted`` bool is fo 266 object's dma_resv. 267 268 The exec function becomes 269 270 .. code-block:: C 271 272 dma_resv_lock(gpu_vm->resv); 273 274 // External object list is protected by the 275 for_each_gpu_vm_bo_on_extobj_list(gpu_vm, & 276 dma_resv_lock(gpu_vm_bo.gem_obj->re 277 if (gpu_vm_bo_marked_evicted(&gpu_v 278 add_gpu_vm_bo_to_evict_list 279 } 280 281 for_each_gpu_vm_bo_on_evict_list(&gpu_vm->e 282 validate_gem_bo(&gpu_vm_bo->gem_bo) 283 284 for_each_gpu_vma_of_gpu_vm_bo(&gpu_ 285 move_gpu_vma_to_rebind_list( 286 } 287 288 for_each_gpu_vma_on_rebind_list(&gpu vm->re 289 rebind_gpu_vma(&gpu_vma); 290 remove_gpu_vma_from_rebind_list(&gp 291 } 292 293 add_dependencies(&gpu_job, &gpu_vm->resv); 294 job_dma_fence = gpu_submit(&gpu_job)); 295 296 add_dma_fence(job_dma_fence, &gpu_vm->resv) 297 for_each_external_obj(gpu_vm, &obj) 298 add_dma_fence(job_dma_fence, &obj->r 299 dma_resv_unlock_all_resv_locks(); 300 301 And the corresponding shared-object aware evic 302 303 .. code-block:: C 304 305 obj = get_object_from_lru(); 306 307 dma_resv_lock(obj->resv); 308 for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo) 309 if (object_is_vm_local(obj)) 310 add_gpu_vm_bo_to_evict_list(&g 311 else 312 mark_gpu_vm_bo_evicted(&gpu_vm 313 314 add_dependencies(&eviction_job, &obj->resv) 315 job_dma_fence = gpu_submit(&eviction_job); 316 add_dma_fence(&obj->resv, job_dma_fence); 317 318 dma_resv_unlock(&obj->resv); 319 put_object(obj); 320 321 .. _Spinlock iteration: 322 323 Accessing the gpu_vm's lists without the dma_r 324 ============================================== 325 326 Some drivers will hold the gpu_vm's dma_resv l 327 gpu_vm's evict list and external objects lists 328 drivers that need to access these lists withou 329 held, for example due to asynchronous state up 330 dma_fence signalling critical path. In such ca 331 used to protect manipulation of the lists. How 332 sleeping locks need to be taken for each list 333 over the lists, the items already iterated ove 334 temporarily moved to a private list and the sp 335 while processing each item: 336 337 .. code block:: C 338 339 struct list_head still_in_list; 340 341 INIT_LIST_HEAD(&still_in_list); 342 343 spin_lock(&gpu_vm->list_lock); 344 do { 345 struct list_head *entry = list_fir 346 347 if (!entry) 348 break; 349 350 list_move_tail(&entry->head, &stil 351 list_entry_get_unless_zero(entry); 352 spin_unlock(&gpu_vm->list_lock); 353 354 process(entry); 355 356 spin_lock(&gpu_vm->list_lock); 357 list_entry_put(entry); 358 } while (true); 359 360 list_splice_tail(&still_in_list, &gpu_vm-> 361 spin_unlock(&gpu_vm->list_lock); 362 363 Due to the additional locking and atomic opera 364 avoid accessing the gpu_vm's list outside of t 365 might want to avoid also this iteration scheme 366 driver anticipates a large number of list item 367 anticipated number of list items is small, whe 368 happen very often or if there is a significant 369 associated with each iteration, the atomic ope 370 associated with this type of iteration is, mos 371 if this scheme is used, it is necessary to mak 372 iteration is protected by an outer level lock 373 items are temporarily pulled off the list whil 374 also worth mentioning that the local list ``st 375 also be considered protected by the ``gpu_vm-> 376 thus possible that items can be removed also f 377 concurrently with list iteration. 378 379 Please refer to the :ref:`DRM GPUVM locking se 380 <drm_gpuvm_locking>` and its internal 381 :c:func:`get_next_vm_bo_from_list` function. 382 383 384 userptr gpu_vmas 385 ================ 386 387 A userptr gpu_vma is a gpu_vma that, instead o 388 GPU virtual address range, directly maps a CPU 389 or file page-cache pages. 390 A very simple approach would be to just pin th 391 pin_user_pages() at bind time and unpin them a 392 creates a Denial-Of-Service vector since a sin 393 would be able to pin down all of system memory 394 desirable. (For special use-cases and assuming 395 still be a desirable feature, though). What we 396 general case is to obtain a reference to the d 397 we are notified using a MMU notifier just befo 398 pages, dirty them if they are not mapped read- 399 then drop the reference. 400 When we are notified by the MMU notifier that 401 pages, we need to stop GPU access to the pages 402 in the MMU notifier and make sure that before 403 tries to access whatever is now present in the 404 the old pages from the GPU page tables and rep 405 obtaining new page references. (See the :ref:` 406 <Invalidation example>` below). Note that when 407 laundry pages, we get such an unmap MMU notifi 408 pages dirty again before the next GPU access. 409 notifications for NUMA accounting which the GP 410 need to care about, but so far it has proven d 411 certain notifications. 412 413 Using a MMU notifier for device DMA (and other 414 :ref:`the pin_user_pages() documentation <mmu- 415 416 Now, the method of obtaining struct page refer 417 get_user_pages() unfortunately can't be used u 418 since that would violate the locking order of 419 mmap_lock that is grabbed when resolving a CPU 420 the gpu_vm's list of userptr gpu_vmas needs to 421 outer lock, which in our example below is the 422 423 The MMU interval seqlock for a userptr gpu_vma 424 way: 425 426 .. code-block:: C 427 428 // Exclusive locking mode here is strictly 429 // invalidated userptr gpu_vmas present, to 430 // revalidations of the same userptr gpu_vm 431 down_write(&gpu_vm->lock); 432 retry: 433 434 // Note: mmu_interval_read_begin() blocks u 435 // invalidation notifier running anymore. 436 seq = mmu_interval_read_begin(&gpu_vma->use 437 if (seq != gpu_vma->saved_seq) { 438 obtain_new_page_pointers(&gpu_vma); 439 dma_resv_lock(&gpu_vm->resv); 440 add_gpu_vma_to_revalidate_list(&gpu 441 dma_resv_unlock(&gpu_vm->resv); 442 gpu_vma->saved_seq = seq; 443 } 444 445 // The usual revalidation goes here. 446 447 // Final userptr sequence validation may no 448 // submission dma_fence is added to the gpu 449 // of the MMU invalidation notifier. Hence 450 // userptr_notifier_lock that will make the 451 452 add_dependencies(&gpu_job, &gpu_vm->resv); 453 down_read(&gpu_vm->userptr_notifier_lock); 454 if (mmu_interval_read_retry(&gpu_vma->userp 455 up_read(&gpu_vm->userptr_notifier_lo 456 goto retry; 457 } 458 459 job_dma_fence = gpu_submit(&gpu_job)); 460 461 add_dma_fence(job_dma_fence, &gpu_vm->resv) 462 463 for_each_external_obj(gpu_vm, &obj) 464 add_dma_fence(job_dma_fence, &obj->r 465 466 dma_resv_unlock_all_resv_locks(); 467 up_read(&gpu_vm->userptr_notifier_lock); 468 up_write(&gpu_vm->lock); 469 470 The code between ``mmu_interval_read_begin()`` 471 ``mmu_interval_read_retry()`` marks the read s 472 what we call the ``userptr_seqlock``. In reali 473 gpu_vma list is looped through, and the check 474 userptr gpu_vmas, although we only show a sing 475 476 The userptr gpu_vma MMU invalidation notifier 477 reclaim context and, again, to avoid locking o 478 take any dma_resv lock nor the gpu_vm->lock fr 479 480 .. _Invalidation example: 481 .. code-block:: C 482 483 bool gpu_vma_userptr_invalidate(userptr_inte 484 { 485 // Make sure the exec function eithe 486 // and backs off or we wait for the 487 488 down_write(&gpu_vm->userptr_notifier 489 mmu_interval_set_seq(userptr_interva 490 up_write(&gpu_vm->userptr_notifier_l 491 492 // At this point, the exec function 493 // submitting a new job, because cur 494 // sequence number and will always c 495 // invalidation callbacks, the mmu n 496 // the sequence number to a valid on 497 // stop gpu access to the old pages 498 499 dma_resv_wait_timeout(&gpu_vm->resv, 500 false, MAX_SCH 501 return true; 502 } 503 504 When this invalidation notifier returns, the G 505 accessing the old pages of the userptr gpu_vma 506 page-binding before a new GPU submission can s 507 508 Efficient userptr gpu_vma exec_function iterat 509 ______________________________________________ 510 511 If the gpu_vm's list of userptr gpu_vmas becom 512 inefficient to iterate through the complete li 513 exec function to check whether each userptr gp 514 sequence number is stale. A solution to this i 515 *invalidated* userptr gpu_vmas on a separate g 516 only check the gpu_vmas present on this list o 517 function. This list will then lend itself very 518 locking scheme that is 519 :ref:`described in the spinlock iteration sect 520 in the mmu notifier, where we add the invalida 521 list, it's not possible to take any outer lock 522 ``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. 523 ``gpu_vm->lock`` still needs to be taken while 524 complete, as also mentioned in that section. 525 526 If using an invalidated userptr list like this 527 exec function trivially becomes a check for in 528 529 Locking at bind and unbind time 530 =============================== 531 532 At bind time, assuming a GEM object backed gpu 533 gpu_vma needs to be associated with a gpu_vm_b 534 gpu_vm_bo in turn needs to be added to the GEM 535 gpu_vm_bo list, and possibly to the gpu_vm's e 536 list. This is referred to as *linking* the gpu 537 requires that the ``gpu_vm->lock`` and the ``g 538 are held. When unlinking a gpu_vma the same lo 539 and that ensures that when iterating over ``gp 540 the ``gpu_vm->resv`` or the GEM object's dma_r 541 stay alive as long as the lock under which we 542 userptr gpu_vmas it's similarly required that 543 outer ``gpu_vm->lock`` is held, since otherwis 544 the invalidated userptr list as described in t 545 there is nothing keeping those userptr gpu_vma 546 547 Locking for recoverable page-fault page-table 548 ============================================== 549 550 There are two important things we need to ensu 551 recoverable page-faults: 552 553 * At the time we return pages back to the syst 554 reuse, there should be no remaining GPU mapp 555 must have been flushed. 556 * The unmapping and mapping of a gpu_vma must 557 558 Since the unmapping (or zapping) of GPU ptes i 559 where it is hard or even impossible to take an 560 must either introduce a new lock that is held 561 unmapping time, or look at the locks we do hol 562 make sure that they are held also at mapping t 563 gpu_vmas, the ``userptr_seqlock`` is held in w 564 invalidation notifier where zapping happens. H 565 ``userptr_seqlock`` as well as the ``gpu_vm->u 566 is held in read mode during mapping, it will n 567 zapping. For GEM object backed gpu_vmas, zappi 568 the GEM object's dma_resv and ensuring that th 569 when populating the page-tables for any gpu_vm 570 object, will similarly ensure we are race-free 571 572 If any part of the mapping is performed asynch 573 under a dma-fence with these locks released, t 574 wait for that dma-fence to signal under the re 575 starting to modify the page-table. 576 577 Since modifying the 578 page-table structure in a way that frees up pa 579 might also require outer level locks, the zapp 580 typically focuses only on zeroing page-table o 581 and flushing TLB, whereas freeing of page-tabl 582 unbind or rebind time.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.