~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/gpu/drm-vm-bind-locking.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 .. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
  2 
  3 ===============
  4 VM_BIND locking
  5 ===============
  6 
  7 This document attempts to describe what's needed to get VM_BIND locking right,
  8 including the userptr mmu_notifier locking. It also discusses some
  9 optimizations to get rid of the looping through of all userptr mappings and
 10 external / shared object mappings that is needed in the simplest
 11 implementation. In addition, there is a section describing the VM_BIND locking
 12 required for implementing recoverable pagefaults.
 13 
 14 The DRM GPUVM set of helpers
 15 ============================
 16 
 17 There is a set of helpers for drivers implementing VM_BIND, and this
 18 set of helpers implements much, but not all of the locking described
 19 in this document. In particular, it is currently lacking a userptr
 20 implementation. This document does not intend to describe the DRM GPUVM
 21 implementation in detail, but it is covered in :ref:`its own
 22 documentation <drm_gpuvm>`. It is highly recommended for any driver
 23 implementing VM_BIND to use the DRM GPUVM helpers and to extend it if
 24 common functionality is missing.
 25 
 26 Nomenclature
 27 ============
 28 
 29 * ``gpu_vm``: Abstraction of a virtual GPU address space with
 30   meta-data. Typically one per client (DRM file-private), or one per
 31   execution context.
 32 * ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with
 33   associated meta-data. The backing storage of a gpu_vma can either be
 34   a GEM object or anonymous or page-cache pages mapped also into the CPU
 35   address space for the process.
 36 * ``gpu_vm_bo``: Abstracts the association of a GEM object and
 37   a VM. The GEM object maintains a list of gpu_vm_bos, where each gpu_vm_bo
 38   maintains a list of gpu_vmas.
 39 * ``userptr gpu_vma or just userptr``: A gpu_vma, whose backing store
 40   is anonymous or page-cache pages as described above.
 41 * ``revalidating``: Revalidating a gpu_vma means making the latest version
 42   of the backing store resident and making sure the gpu_vma's
 43   page-table entries point to that backing store.
 44 * ``dma_fence``: A struct dma_fence that is similar to a struct completion
 45   and which tracks GPU activity. When the GPU activity is finished,
 46   the dma_fence signals. Please refer to the ``DMA Fences`` section of
 47   the :doc:`dma-buf doc </driver-api/dma-buf>`.
 48 * ``dma_resv``: A struct dma_resv (a.k.a reservation object) that is used
 49   to track GPU activity in the form of multiple dma_fences on a
 50   gpu_vm or a GEM object. The dma_resv contains an array / list
 51   of dma_fences and a lock that needs to be held when adding
 52   additional dma_fences to the dma_resv. The lock is of a type that
 53   allows deadlock-safe locking of multiple dma_resvs in arbitrary
 54   order. Please refer to the ``Reservation Objects`` section of the
 55   :doc:`dma-buf doc </driver-api/dma-buf>`.
 56 * ``exec function``: An exec function is a function that revalidates all
 57   affected gpu_vmas, submits a GPU command batch and registers the
 58   dma_fence representing the GPU command's activity with all affected
 59   dma_resvs. For completeness, although not covered by this document,
 60   it's worth mentioning that an exec function may also be the
 61   revalidation worker that is used by some drivers in compute /
 62   long-running mode.
 63 * ``local object``: A GEM object which is only mapped within a
 64   single VM. Local GEM objects share the gpu_vm's dma_resv.
 65 * ``external object``: a.k.a shared object: A GEM object which may be shared
 66   by multiple gpu_vms and whose backing storage may be shared with
 67   other drivers.
 68 
 69 Locks and locking order
 70 =======================
 71 
 72 One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's
 73 dma_resv object and hence the dma_resv lock. So, even with a huge
 74 number of local GEM objects, only one lock is needed to make the exec
 75 sequence atomic.
 76 
 77 The following locks and locking orders are used:
 78 
 79 * The ``gpu_vm->lock`` (optionally an rwsem). Protects the gpu_vm's
 80   data structure keeping track of gpu_vmas. It can also protect the
 81   gpu_vm's list of userptr gpu_vmas. With a CPU mm analogy this would
 82   correspond to the mmap_lock. An rwsem allows several readers to walk
 83   the VM tree concurrently, but the benefit of that concurrency most
 84   likely varies from driver to driver.
 85 * The ``userptr_seqlock``. This lock is taken in read mode for each
 86   userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu
 87   notifier invalidation. This is not a real seqlock but described in
 88   ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side
 89   'lock' a lot like a seqcount. However this allows multiple
 90   write-sides to hold it at once...". The read side critical section
 91   is enclosed by ``mmu_interval_read_begin() /
 92   mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
 93   sleeping if the write side is held.
 94   The write side is held by the core mm while calling mmu interval
 95   invalidation notifiers.
 96 * The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing
 97   rebinding, as well as the residency state of all the gpu_vm's local
 98   GEM objects.
 99   Furthermore, it typically protects the gpu_vm's list of evicted and
100   external GEM objects.
101 * The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is
102   taken in read mode during exec and write mode during a mmu notifier
103   invalidation. The userptr notifier lock is per gpu_vm.
104 * The ``gem_object->gpuva_lock`` This lock protects the GEM object's
105   list of gpu_vm_bos. This is usually the same lock as the GEM
106   object's dma_resv, but some drivers protects this list differently,
107   see below.
108 * The ``gpu_vm list spinlocks``. With some implementations they are needed
109   to be able to update the gpu_vm evicted- and external object
110   list. For those implementations, the spinlocks are grabbed when the
111   lists are manipulated. However, to avoid locking order violations
112   with the dma_resv locks, a special scheme is needed when iterating
113   over the lists.
114 
115 .. _gpu_vma lifetime:
116 
117 Protection and lifetime of gpu_vm_bos and gpu_vmas
118 ==================================================
119 
120 The GEM object's list of gpu_vm_bos, and the gpu_vm_bo's list of gpu_vmas
121 is protected by the ``gem_object->gpuva_lock``, which is typically the
122 same as the GEM object's dma_resv, but if the driver
123 needs to access these lists from within a dma_fence signalling
124 critical section, it can instead choose to protect it with a
125 separate lock, which can be locked from within the dma_fence signalling
126 critical section. Such drivers then need to pay additional attention
127 to what locks need to be taken from within the loop when iterating
128 over the gpu_vm_bo and gpu_vma lists to avoid locking-order violations.
129 
130 The DRM GPUVM set of helpers provide lockdep asserts that this lock is
131 held in relevant situations and also provides a means of making itself
132 aware of which lock is actually used: :c:func:`drm_gem_gpuva_set_lock`.
133 
134 Each gpu_vm_bo holds a reference counted pointer to the underlying GEM
135 object, and each gpu_vma holds a reference counted pointer to the
136 gpu_vm_bo. When iterating over the GEM object's list of gpu_vm_bos and
137 over the gpu_vm_bo's list of gpu_vmas, the ``gem_object->gpuva_lock`` must
138 not be dropped, otherwise, gpu_vmas attached to a gpu_vm_bo may
139 disappear without notice since those are not reference-counted. A
140 driver may implement its own scheme to allow this at the expense of
141 additional complexity, but this is outside the scope of this document.
142 
143 In the DRM GPUVM implementation, each gpu_vm_bo and each gpu_vma
144 holds a reference count on the gpu_vm itself. Due to this, and to avoid circular
145 reference counting, cleanup of the gpu_vm's gpu_vmas must not be done from the
146 gpu_vm's destructor. Drivers typically implements a gpu_vm close
147 function for this cleanup. The gpu_vm close function will abort gpu
148 execution using this VM, unmap all gpu_vmas and release page-table memory.
149 
150 Revalidation and eviction of local objects
151 ==========================================
152 
153 Note that in all the code examples given below we use simplified
154 pseudo-code. In particular, the dma_resv deadlock avoidance algorithm
155 as well as reserving memory for dma_resv fences is left out.
156 
157 Revalidation
158 ____________
159 With VM_BIND, all local objects need to be resident when the gpu is
160 executing using the gpu_vm, and the objects need to have valid
161 gpu_vmas set up pointing to them. Typically, each gpu command buffer
162 submission is therefore preceded with a re-validation section:
163 
164 .. code-block:: C
165 
166    dma_resv_lock(gpu_vm->resv);
167 
168    // Validation section starts here.
169    for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) {
170            validate_gem_bo(&gpu_vm_bo->gem_bo);
171 
172            // The following list iteration needs the Gem object's
173            // dma_resv to be held (it protects the gpu_vm_bo's list of
174            // gpu_vmas, but since local gem objects share the gpu_vm's
175            // dma_resv, it is already held at this point.
176            for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
177                   move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list);
178    }
179 
180    for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) {
181            rebind_gpu_vma(&gpu_vma);
182            remove_gpu_vma_from_rebind_list(&gpu_vma);
183    }
184    // Validation section ends here, and job submission starts.
185 
186    add_dependencies(&gpu_job, &gpu_vm->resv);
187    job_dma_fence = gpu_submit(&gpu_job));
188 
189    add_dma_fence(job_dma_fence, &gpu_vm->resv);
190    dma_resv_unlock(gpu_vm->resv);
191 
192 The reason for having a separate gpu_vm rebind list is that there
193 might be userptr gpu_vmas that are not mapping a buffer object that
194 also need rebinding.
195 
196 Eviction
197 ________
198 
199 Eviction of one of these local objects will then look similar to the
200 following:
201 
202 .. code-block:: C
203 
204    obj = get_object_from_lru();
205 
206    dma_resv_lock(obj->resv);
207    for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo);
208            add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
209 
210    add_dependencies(&eviction_job, &obj->resv);
211    job_dma_fence = gpu_submit(&eviction_job);
212    add_dma_fence(&obj->resv, job_dma_fence);
213 
214    dma_resv_unlock(&obj->resv);
215    put_object(obj);
216 
217 Note that since the object is local to the gpu_vm, it will share the gpu_vm's
218 dma_resv lock such that ``obj->resv == gpu_vm->resv``.
219 The gpu_vm_bos marked for eviction are put on the gpu_vm's evict list,
220 which is protected by ``gpu_vm->resv``. During eviction all local
221 objects have their dma_resv locked and, due to the above equality, also
222 the gpu_vm's dma_resv protecting the gpu_vm's evict list is locked.
223 
224 With VM_BIND, gpu_vmas don't need to be unbound before eviction,
225 since the driver must ensure that the eviction blit or copy will wait
226 for GPU idle or depend on all previous GPU activity. Furthermore, any
227 subsequent attempt by the GPU to access freed memory through the
228 gpu_vma will be preceded by a new exec function, with a revalidation
229 section which will make sure all gpu_vmas are rebound. The eviction
230 code holding the object's dma_resv while revalidating will ensure a
231 new exec function may not race with the eviction.
232 
233 A driver can be implemented in such a way that, on each exec function,
234 only a subset of vmas are selected for rebind.  In this case, all vmas that are
235 *not* selected for rebind must be unbound before the exec
236 function workload is submitted.
237 
238 Locking with external buffer objects
239 ====================================
240 
241 Since external buffer objects may be shared by multiple gpu_vm's they
242 can't share their reservation object with a single gpu_vm. Instead
243 they need to have a reservation object of their own. The external
244 objects bound to a gpu_vm using one or many gpu_vmas are therefore put on a
245 per-gpu_vm list which is protected by the gpu_vm's dma_resv lock or
246 one of the :ref:`gpu_vm list spinlocks <Spinlock iteration>`. Once
247 the gpu_vm's reservation object is locked, it is safe to traverse the
248 external object list and lock the dma_resvs of all external
249 objects. However, if instead a list spinlock is used, a more elaborate
250 iteration scheme needs to be used.
251 
252 At eviction time, the gpu_vm_bos of *all* the gpu_vms an external
253 object is bound to need to be put on their gpu_vm's evict list.
254 However, when evicting an external object, the dma_resvs of the
255 gpu_vms the object is bound to are typically not held. Only
256 the object's private dma_resv can be guaranteed to be held. If there
257 is a ww_acquire context at hand at eviction time we could grab those
258 dma_resvs but that could cause expensive ww_mutex rollbacks. A simple
259 option is to just mark the gpu_vm_bos of the evicted gem object with
260 an ``evicted`` bool that is inspected before the next time the
261 corresponding gpu_vm evicted list needs to be traversed. For example, when
262 traversing the list of external objects and locking them. At that time,
263 both the gpu_vm's dma_resv and the object's dma_resv is held, and the
264 gpu_vm_bo marked evicted, can then be added to the gpu_vm's list of
265 evicted gpu_vm_bos. The ``evicted`` bool is formally protected by the
266 object's dma_resv.
267 
268 The exec function becomes
269 
270 .. code-block:: C
271 
272    dma_resv_lock(gpu_vm->resv);
273 
274    // External object list is protected by the gpu_vm->resv lock.
275    for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) {
276            dma_resv_lock(gpu_vm_bo.gem_obj->resv);
277            if (gpu_vm_bo_marked_evicted(&gpu_vm_bo))
278                    add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
279    }
280 
281    for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) {
282            validate_gem_bo(&gpu_vm_bo->gem_bo);
283 
284            for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
285                   move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list);
286    }
287 
288    for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) {
289            rebind_gpu_vma(&gpu_vma);
290            remove_gpu_vma_from_rebind_list(&gpu_vma);
291    }
292 
293    add_dependencies(&gpu_job, &gpu_vm->resv);
294    job_dma_fence = gpu_submit(&gpu_job));
295 
296    add_dma_fence(job_dma_fence, &gpu_vm->resv);
297    for_each_external_obj(gpu_vm, &obj)
298           add_dma_fence(job_dma_fence, &obj->resv);
299    dma_resv_unlock_all_resv_locks();
300 
301 And the corresponding shared-object aware eviction would look like:
302 
303 .. code-block:: C
304 
305    obj = get_object_from_lru();
306 
307    dma_resv_lock(obj->resv);
308    for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo)
309            if (object_is_vm_local(obj))
310                 add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
311            else
312                 mark_gpu_vm_bo_evicted(&gpu_vm_bo);
313 
314    add_dependencies(&eviction_job, &obj->resv);
315    job_dma_fence = gpu_submit(&eviction_job);
316    add_dma_fence(&obj->resv, job_dma_fence);
317 
318    dma_resv_unlock(&obj->resv);
319    put_object(obj);
320 
321 .. _Spinlock iteration:
322 
323 Accessing the gpu_vm's lists without the dma_resv lock held
324 ===========================================================
325 
326 Some drivers will hold the gpu_vm's dma_resv lock when accessing the
327 gpu_vm's evict list and external objects lists. However, there are
328 drivers that need to access these lists without the dma_resv lock
329 held, for example due to asynchronous state updates from within the
330 dma_fence signalling critical path. In such cases, a spinlock can be
331 used to protect manipulation of the lists. However, since higher level
332 sleeping locks need to be taken for each list item while iterating
333 over the lists, the items already iterated over need to be
334 temporarily moved to a private list and the spinlock released
335 while processing each item:
336 
337 .. code block:: C
338 
339     struct list_head still_in_list;
340 
341     INIT_LIST_HEAD(&still_in_list);
342 
343     spin_lock(&gpu_vm->list_lock);
344     do {
345             struct list_head *entry = list_first_entry_or_null(&gpu_vm->list, head);
346 
347             if (!entry)
348                     break;
349 
350             list_move_tail(&entry->head, &still_in_list);
351             list_entry_get_unless_zero(entry);
352             spin_unlock(&gpu_vm->list_lock);
353 
354             process(entry);
355 
356             spin_lock(&gpu_vm->list_lock);
357             list_entry_put(entry);
358     } while (true);
359 
360     list_splice_tail(&still_in_list, &gpu_vm->list);
361     spin_unlock(&gpu_vm->list_lock);
362 
363 Due to the additional locking and atomic operations, drivers that *can*
364 avoid accessing the gpu_vm's list outside of the dma_resv lock
365 might want to avoid also this iteration scheme. Particularly, if the
366 driver anticipates a large number of list items. For lists where the
367 anticipated number of list items is small, where list iteration doesn't
368 happen very often or if there is a significant additional cost
369 associated with each iteration, the atomic operation overhead
370 associated with this type of iteration is, most likely, negligible. Note that
371 if this scheme is used, it is necessary to make sure this list
372 iteration is protected by an outer level lock or semaphore, since list
373 items are temporarily pulled off the list while iterating, and it is
374 also worth mentioning that the local list ``still_in_list`` should
375 also be considered protected by the ``gpu_vm->list_lock``, and it is
376 thus possible that items can be removed also from the local list
377 concurrently with list iteration.
378 
379 Please refer to the :ref:`DRM GPUVM locking section
380 <drm_gpuvm_locking>` and its internal
381 :c:func:`get_next_vm_bo_from_list` function.
382 
383 
384 userptr gpu_vmas
385 ================
386 
387 A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a
388 GPU virtual address range, directly maps a CPU mm range of anonymous-
389 or file page-cache pages.
390 A very simple approach would be to just pin the pages using
391 pin_user_pages() at bind time and unpin them at unbind time, but this
392 creates a Denial-Of-Service vector since a single user-space process
393 would be able to pin down all of system memory, which is not
394 desirable. (For special use-cases and assuming proper accounting pinning might
395 still be a desirable feature, though). What we need to do in the
396 general case is to obtain a reference to the desired pages, make sure
397 we are notified using a MMU notifier just before the CPU mm unmaps the
398 pages, dirty them if they are not mapped read-only to the GPU, and
399 then drop the reference.
400 When we are notified by the MMU notifier that CPU mm is about to drop the
401 pages, we need to stop GPU access to the pages by waiting for VM idle
402 in the MMU notifier and make sure that before the next time the GPU
403 tries to access whatever is now present in the CPU mm range, we unmap
404 the old pages from the GPU page tables and repeat the process of
405 obtaining new page references. (See the :ref:`notifier example
406 <Invalidation example>` below). Note that when the core mm decides to
407 laundry pages, we get such an unmap MMU notification and can mark the
408 pages dirty again before the next GPU access. We also get similar MMU
409 notifications for NUMA accounting which the GPU driver doesn't really
410 need to care about, but so far it has proven difficult to exclude
411 certain notifications.
412 
413 Using a MMU notifier for device DMA (and other methods) is described in
414 :ref:`the pin_user_pages() documentation <mmu-notifier-registration-case>`.
415 
416 Now, the method of obtaining struct page references using
417 get_user_pages() unfortunately can't be used under a dma_resv lock
418 since that would violate the locking order of the dma_resv lock vs the
419 mmap_lock that is grabbed when resolving a CPU pagefault. This means
420 the gpu_vm's list of userptr gpu_vmas needs to be protected by an
421 outer lock, which in our example below is the ``gpu_vm->lock``.
422 
423 The MMU interval seqlock for a userptr gpu_vma is used in the following
424 way:
425 
426 .. code-block:: C
427 
428    // Exclusive locking mode here is strictly needed only if there are
429    // invalidated userptr gpu_vmas present, to avoid concurrent userptr
430    // revalidations of the same userptr gpu_vma.
431    down_write(&gpu_vm->lock);
432    retry:
433 
434    // Note: mmu_interval_read_begin() blocks until there is no
435    // invalidation notifier running anymore.
436    seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
437    if (seq != gpu_vma->saved_seq) {
438            obtain_new_page_pointers(&gpu_vma);
439            dma_resv_lock(&gpu_vm->resv);
440            add_gpu_vma_to_revalidate_list(&gpu_vma, &gpu_vm);
441            dma_resv_unlock(&gpu_vm->resv);
442            gpu_vma->saved_seq = seq;
443    }
444 
445    // The usual revalidation goes here.
446 
447    // Final userptr sequence validation may not happen before the
448    // submission dma_fence is added to the gpu_vm's resv, from the POW
449    // of the MMU invalidation notifier. Hence the
450    // userptr_notifier_lock that will make them appear atomic.
451 
452    add_dependencies(&gpu_job, &gpu_vm->resv);
453    down_read(&gpu_vm->userptr_notifier_lock);
454    if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) {
455           up_read(&gpu_vm->userptr_notifier_lock);
456           goto retry;
457    }
458 
459    job_dma_fence = gpu_submit(&gpu_job));
460 
461    add_dma_fence(job_dma_fence, &gpu_vm->resv);
462 
463    for_each_external_obj(gpu_vm, &obj)
464           add_dma_fence(job_dma_fence, &obj->resv);
465 
466    dma_resv_unlock_all_resv_locks();
467    up_read(&gpu_vm->userptr_notifier_lock);
468    up_write(&gpu_vm->lock);
469 
470 The code between ``mmu_interval_read_begin()`` and the
471 ``mmu_interval_read_retry()`` marks the read side critical section of
472 what we call the ``userptr_seqlock``. In reality, the gpu_vm's userptr
473 gpu_vma list is looped through, and the check is done for *all* of its
474 userptr gpu_vmas, although we only show a single one here.
475 
476 The userptr gpu_vma MMU invalidation notifier might be called from
477 reclaim context and, again, to avoid locking order violations, we can't
478 take any dma_resv lock nor the gpu_vm->lock from within it.
479 
480 .. _Invalidation example:
481 .. code-block:: C
482 
483   bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
484   {
485           // Make sure the exec function either sees the new sequence
486           // and backs off or we wait for the dma-fence:
487 
488           down_write(&gpu_vm->userptr_notifier_lock);
489           mmu_interval_set_seq(userptr_interval, cur_seq);
490           up_write(&gpu_vm->userptr_notifier_lock);
491 
492           // At this point, the exec function can't succeed in
493           // submitting a new job, because cur_seq is an invalid
494           // sequence number and will always cause a retry. When all
495           // invalidation callbacks, the mmu notifier core will flip
496           // the sequence number to a valid one. However we need to
497           // stop gpu access to the old pages here.
498 
499           dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP,
500                                 false, MAX_SCHEDULE_TIMEOUT);
501           return true;
502   }
503 
504 When this invalidation notifier returns, the GPU can no longer be
505 accessing the old pages of the userptr gpu_vma and needs to redo the
506 page-binding before a new GPU submission can succeed.
507 
508 Efficient userptr gpu_vma exec_function iteration
509 _________________________________________________
510 
511 If the gpu_vm's list of userptr gpu_vmas becomes large, it's
512 inefficient to iterate through the complete lists of userptrs on each
513 exec function to check whether each userptr gpu_vma's saved
514 sequence number is stale. A solution to this is to put all
515 *invalidated* userptr gpu_vmas on a separate gpu_vm list and
516 only check the gpu_vmas present on this list on each exec
517 function. This list will then lend itself very-well to the spinlock
518 locking scheme that is
519 :ref:`described in the spinlock iteration section <Spinlock iteration>`, since
520 in the mmu notifier, where we add the invalidated gpu_vmas to the
521 list, it's not possible to take any outer locks like the
522 ``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. Note that the
523 ``gpu_vm->lock`` still needs to be taken while iterating to ensure the list is
524 complete, as also mentioned in that section.
525 
526 If using an invalidated userptr list like this, the retry check in the
527 exec function trivially becomes a check for invalidated list empty.
528 
529 Locking at bind and unbind time
530 ===============================
531 
532 At bind time, assuming a GEM object backed gpu_vma, each
533 gpu_vma needs to be associated with a gpu_vm_bo and that
534 gpu_vm_bo in turn needs to be added to the GEM object's
535 gpu_vm_bo list, and possibly to the gpu_vm's external object
536 list. This is referred to as *linking* the gpu_vma, and typically
537 requires that the ``gpu_vm->lock`` and the ``gem_object->gpuva_lock``
538 are held. When unlinking a gpu_vma the same locks should be held,
539 and that ensures that when iterating over ``gpu_vmas`, either under
540 the ``gpu_vm->resv`` or the GEM object's dma_resv, that the gpu_vmas
541 stay alive as long as the lock under which we iterate is not released. For
542 userptr gpu_vmas it's similarly required that during vma destroy, the
543 outer ``gpu_vm->lock`` is held, since otherwise when iterating over
544 the invalidated userptr list as described in the previous section,
545 there is nothing keeping those userptr gpu_vmas alive.
546 
547 Locking for recoverable page-fault page-table updates
548 =====================================================
549 
550 There are two important things we need to ensure with locking for
551 recoverable page-faults:
552 
553 * At the time we return pages back to the system / allocator for
554   reuse, there should be no remaining GPU mappings and any GPU TLB
555   must have been flushed.
556 * The unmapping and mapping of a gpu_vma must not race.
557 
558 Since the unmapping (or zapping) of GPU ptes is typically taking place
559 where it is hard or even impossible to take any outer level locks we
560 must either introduce a new lock that is held at both mapping and
561 unmapping time, or look at the locks we do hold at unmapping time and
562 make sure that they are held also at mapping time. For userptr
563 gpu_vmas, the ``userptr_seqlock`` is held in write mode in the mmu
564 invalidation notifier where zapping happens. Hence, if the
565 ``userptr_seqlock`` as well as the ``gpu_vm->userptr_notifier_lock``
566 is held in read mode during mapping, it will not race with the
567 zapping. For GEM object backed gpu_vmas, zapping will take place under
568 the GEM object's dma_resv and ensuring that the dma_resv is held also
569 when populating the page-tables for any gpu_vma pointing to the GEM
570 object, will similarly ensure we are race-free.
571 
572 If any part of the mapping is performed asynchronously
573 under a dma-fence with these locks released, the zapping will need to
574 wait for that dma-fence to signal under the relevant lock before
575 starting to modify the page-table.
576 
577 Since modifying the
578 page-table structure in a way that frees up page-table memory
579 might also require outer level locks, the zapping of GPU ptes
580 typically focuses only on zeroing page-table or page-directory entries
581 and flushing TLB, whereas freeing of page-table memory is deferred to
582 unbind or rebind time.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php