~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/locking/ww-mutex-design.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 ======================================
  2 Wound/Wait Deadlock-Proof Mutex Design
  3 ======================================
  4 
  5 Please read mutex-design.rst first, as it applies to wait/wound mutexes too.
  6 
  7 Motivation for WW-Mutexes
  8 -------------------------
  9 
 10 GPU's do operations that commonly involve many buffers.  Those buffers
 11 can be shared across contexts/processes, exist in different memory
 12 domains (for example VRAM vs system memory), and so on.  And with
 13 PRIME / dmabuf, they can even be shared across devices.  So there are
 14 a handful of situations where the driver needs to wait for buffers to
 15 become ready.  If you think about this in terms of waiting on a buffer
 16 mutex for it to become available, this presents a problem because
 17 there is no way to guarantee that buffers appear in a execbuf/batch in
 18 the same order in all contexts.  That is directly under control of
 19 userspace, and a result of the sequence of GL calls that an application
 20 makes.  Which results in the potential for deadlock.  The problem gets
 21 more complex when you consider that the kernel may need to migrate the
 22 buffer(s) into VRAM before the GPU operates on the buffer(s), which
 23 may in turn require evicting some other buffers (and you don't want to
 24 evict other buffers which are already queued up to the GPU), but for a
 25 simplified understanding of the problem you can ignore this.
 26 
 27 The algorithm that the TTM graphics subsystem came up with for dealing with
 28 this problem is quite simple.  For each group of buffers (execbuf) that need
 29 to be locked, the caller would be assigned a unique reservation id/ticket,
 30 from a global counter.  In case of deadlock while locking all the buffers
 31 associated with a execbuf, the one with the lowest reservation ticket (i.e.
 32 the oldest task) wins, and the one with the higher reservation id (i.e. the
 33 younger task) unlocks all of the buffers that it has already locked, and then
 34 tries again.
 35 
 36 In the RDBMS literature, a reservation ticket is associated with a transaction.
 37 and the deadlock handling approach is called Wait-Die. The name is based on
 38 the actions of a locking thread when it encounters an already locked mutex.
 39 If the transaction holding the lock is younger, the locking transaction waits.
 40 If the transaction holding the lock is older, the locking transaction backs off
 41 and dies. Hence Wait-Die.
 42 There is also another algorithm called Wound-Wait:
 43 If the transaction holding the lock is younger, the locking transaction
 44 wounds the transaction holding the lock, requesting it to die.
 45 If the transaction holding the lock is older, it waits for the other
 46 transaction. Hence Wound-Wait.
 47 The two algorithms are both fair in that a transaction will eventually succeed.
 48 However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
 49 compared to Wait-Die, but is, on the other hand, associated with more work than
 50 Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
 51 algorithm in that transactions are wounded by other transactions, and that
 52 requires a reliable way to pick up the wounded condition and preempt the
 53 running transaction. Note that this is not the same as process preemption. A
 54 Wound-Wait transaction is considered preempted when it dies (returning
 55 -EDEADLK) following a wound.
 56 
 57 Concepts
 58 --------
 59 
 60 Compared to normal mutexes two additional concepts/objects show up in the lock
 61 interface for w/w mutexes:
 62 
 63 Acquire context: To ensure eventual forward progress it is important that a task
 64 trying to acquire locks doesn't grab a new reservation id, but keeps the one it
 65 acquired when starting the lock acquisition. This ticket is stored in the
 66 acquire context. Furthermore the acquire context keeps track of debugging state
 67 to catch w/w mutex interface abuse. An acquire context is representing a
 68 transaction.
 69 
 70 W/w class: In contrast to normal mutexes the lock class needs to be explicit for
 71 w/w mutexes, since it is required to initialize the acquire context. The lock
 72 class also specifies what algorithm to use, Wound-Wait or Wait-Die.
 73 
 74 Furthermore there are three different class of w/w lock acquire functions:
 75 
 76 * Normal lock acquisition with a context, using ww_mutex_lock.
 77 
 78 * Slowpath lock acquisition on the contending lock, used by the task that just
 79   killed its transaction after having dropped all already acquired locks.
 80   These functions have the _slow postfix.
 81 
 82   From a simple semantics point-of-view the _slow functions are not strictly
 83   required, since simply calling the normal ww_mutex_lock functions on the
 84   contending lock (after having dropped all other already acquired locks) will
 85   work correctly. After all if no other ww mutex has been acquired yet there's
 86   no deadlock potential and hence the ww_mutex_lock call will block and not
 87   prematurely return -EDEADLK. The advantage of the _slow functions is in
 88   interface safety:
 89 
 90   - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow
 91     has a void return type. Note that since ww mutex code needs loops/retries
 92     anyway the __must_check doesn't result in spurious warnings, even though the
 93     very first lock operation can never fail.
 94   - When full debugging is enabled ww_mutex_lock_slow checks that all acquired
 95     ww mutex have been released (preventing deadlocks) and makes sure that we
 96     block on the contending lock (preventing spinning through the -EDEADLK
 97     slowpath until the contended lock can be acquired).
 98 
 99 * Functions to only acquire a single w/w mutex, which results in the exact same
100   semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL
101   context.
102 
103   Again this is not strictly required. But often you only want to acquire a
104   single lock in which case it's pointless to set up an acquire context (and so
105   better to avoid grabbing a deadlock avoidance ticket).
106 
107 Of course, all the usual variants for handling wake-ups due to signals are also
108 provided.
109 
110 Usage
111 -----
112 
113 The algorithm (Wait-Die vs Wound-Wait) is chosen by using either
114 DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die)
115 As a rough rule of thumb, use Wound-Wait iff you
116 expect the number of simultaneous competing transactions to be typically small,
117 and you want to reduce the number of rollbacks.
118 
119 Three different ways to acquire locks within the same w/w class. Common
120 definitions for methods #1 and #2::
121 
122   static DEFINE_WW_CLASS(ww_class);
123 
124   struct obj {
125         struct ww_mutex lock;
126         /* obj data */
127   };
128 
129   struct obj_entry {
130         struct list_head head;
131         struct obj *obj;
132   };
133 
134 Method 1, using a list in execbuf->buffers that's not allowed to be reordered.
135 This is useful if a list of required objects is already tracked somewhere.
136 Furthermore the lock helper can use propagate the -EALREADY return code back to
137 the caller as a signal that an object is twice on the list. This is useful if
138 the list is constructed from userspace input and the ABI requires userspace to
139 not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl)::
140 
141   int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
142   {
143         struct obj *res_obj = NULL;
144         struct obj_entry *contended_entry = NULL;
145         struct obj_entry *entry;
146 
147         ww_acquire_init(ctx, &ww_class);
148 
149   retry:
150         list_for_each_entry (entry, list, head) {
151                 if (entry->obj == res_obj) {
152                         res_obj = NULL;
153                         continue;
154                 }
155                 ret = ww_mutex_lock(&entry->obj->lock, ctx);
156                 if (ret < 0) {
157                         contended_entry = entry;
158                         goto err;
159                 }
160         }
161 
162         ww_acquire_done(ctx);
163         return 0;
164 
165   err:
166         list_for_each_entry_continue_reverse (entry, list, head)
167                 ww_mutex_unlock(&entry->obj->lock);
168 
169         if (res_obj)
170                 ww_mutex_unlock(&res_obj->lock);
171 
172         if (ret == -EDEADLK) {
173                 /* we lost out in a seqno race, lock and retry.. */
174                 ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
175                 res_obj = contended_entry->obj;
176                 goto retry;
177         }
178         ww_acquire_fini(ctx);
179 
180         return ret;
181   }
182 
183 Method 2, using a list in execbuf->buffers that can be reordered. Same semantics
184 of duplicate entry detection using -EALREADY as method 1 above. But the
185 list-reordering allows for a bit more idiomatic code::
186 
187   int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
188   {
189         struct obj_entry *entry, *entry2;
190 
191         ww_acquire_init(ctx, &ww_class);
192 
193         list_for_each_entry (entry, list, head) {
194                 ret = ww_mutex_lock(&entry->obj->lock, ctx);
195                 if (ret < 0) {
196                         entry2 = entry;
197 
198                         list_for_each_entry_continue_reverse (entry2, list, head)
199                                 ww_mutex_unlock(&entry2->obj->lock);
200 
201                         if (ret != -EDEADLK) {
202                                 ww_acquire_fini(ctx);
203                                 return ret;
204                         }
205 
206                         /* we lost out in a seqno race, lock and retry.. */
207                         ww_mutex_lock_slow(&entry->obj->lock, ctx);
208 
209                         /*
210                          * Move buf to head of the list, this will point
211                          * buf->next to the first unlocked entry,
212                          * restarting the for loop.
213                          */
214                         list_del(&entry->head);
215                         list_add(&entry->head, list);
216                 }
217         }
218 
219         ww_acquire_done(ctx);
220         return 0;
221   }
222 
223 Unlocking works the same way for both methods #1 and #2::
224 
225   void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
226   {
227         struct obj_entry *entry;
228 
229         list_for_each_entry (entry, list, head)
230                 ww_mutex_unlock(&entry->obj->lock);
231 
232         ww_acquire_fini(ctx);
233   }
234 
235 Method 3 is useful if the list of objects is constructed ad-hoc and not upfront,
236 e.g. when adjusting edges in a graph where each node has its own ww_mutex lock,
237 and edges can only be changed when holding the locks of all involved nodes. w/w
238 mutexes are a natural fit for such a case for two reasons:
239 
240 - They can handle lock-acquisition in any order which allows us to start walking
241   a graph from a starting point and then iteratively discovering new edges and
242   locking down the nodes those edges connect to.
243 - Due to the -EALREADY return code signalling that a given objects is already
244   held there's no need for additional book-keeping to break cycles in the graph
245   or keep track off which looks are already held (when using more than one node
246   as a starting point).
247 
248 Note that this approach differs in two important ways from the above methods:
249 
250 - Since the list of objects is dynamically constructed (and might very well be
251   different when retrying due to hitting the -EDEADLK die condition) there's
252   no need to keep any object on a persistent list when it's not locked. We can
253   therefore move the list_head into the object itself.
254 - On the other hand the dynamic object list construction also means that the -EALREADY return
255   code can't be propagated.
256 
257 Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a
258 list of starting nodes (passed in from userspace) using one of the above
259 methods. And then lock any additional objects affected by the operations using
260 method #3 below. The backoff/retry procedure will be a bit more involved, since
261 when the dynamic locking step hits -EDEADLK we also need to unlock all the
262 objects acquired with the fixed list. But the w/w mutex debug checks will catch
263 any interface misuse for these cases.
264 
265 Also, method 3 can't fail the lock acquisition step since it doesn't return
266 -EALREADY. Of course this would be different when using the _interruptible
267 variants, but that's outside of the scope of these examples here::
268 
269   struct obj {
270         struct ww_mutex ww_mutex;
271         struct list_head locked_list;
272   };
273 
274   static DEFINE_WW_CLASS(ww_class);
275 
276   void __unlock_objs(struct list_head *list)
277   {
278         struct obj *entry, *temp;
279 
280         list_for_each_entry_safe (entry, temp, list, locked_list) {
281                 /* need to do that before unlocking, since only the current lock holder is
282                 allowed to use object */
283                 list_del(&entry->locked_list);
284                 ww_mutex_unlock(entry->ww_mutex)
285         }
286   }
287 
288   void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
289   {
290         struct obj *obj;
291 
292         ww_acquire_init(ctx, &ww_class);
293 
294   retry:
295         /* re-init loop start state */
296         loop {
297                 /* magic code which walks over a graph and decides which objects
298                  * to lock */
299 
300                 ret = ww_mutex_lock(obj->ww_mutex, ctx);
301                 if (ret == -EALREADY) {
302                         /* we have that one already, get to the next object */
303                         continue;
304                 }
305                 if (ret == -EDEADLK) {
306                         __unlock_objs(list);
307 
308                         ww_mutex_lock_slow(obj, ctx);
309                         list_add(&entry->locked_list, list);
310                         goto retry;
311                 }
312 
313                 /* locked a new object, add it to the list */
314                 list_add_tail(&entry->locked_list, list);
315         }
316 
317         ww_acquire_done(ctx);
318         return 0;
319   }
320 
321   void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
322   {
323         __unlock_objs(list);
324         ww_acquire_fini(ctx);
325   }
326 
327 Method 4: Only lock one single objects. In that case deadlock detection and
328 prevention is obviously overkill, since with grabbing just one lock you can't
329 produce a deadlock within just one class. To simplify this case the w/w mutex
330 api can be used with a NULL context.
331 
332 Implementation Details
333 ----------------------
334 
335 Design:
336 ^^^^^^^
337 
338   ww_mutex currently encapsulates a struct mutex, this means no extra overhead for
339   normal mutex locks, which are far more common. As such there is only a small
340   increase in code size if wait/wound mutexes are not used.
341 
342   We maintain the following invariants for the wait list:
343 
344   (1) Waiters with an acquire context are sorted by stamp order; waiters
345       without an acquire context are interspersed in FIFO order.
346   (2) For Wait-Die, among waiters with contexts, only the first one can have
347       other locks acquired already (ctx->acquired > 0). Note that this waiter
348       may come after other waiters without contexts in the list.
349 
350   The Wound-Wait preemption is implemented with a lazy-preemption scheme:
351   The wounded status of the transaction is checked only when there is
352   contention for a new lock and hence a true chance of deadlock. In that
353   situation, if the transaction is wounded, it backs off, clears the
354   wounded status and retries. A great benefit of implementing preemption in
355   this way is that the wounded transaction can identify a contending lock to
356   wait for before restarting the transaction. Just blindly restarting the
357   transaction would likely make the transaction end up in a situation where
358   it would have to back off again.
359 
360   In general, not much contention is expected. The locks are typically used to
361   serialize access to resources for devices, and optimization focus should
362   therefore be directed towards the uncontended cases.
363 
364 Lockdep:
365 ^^^^^^^^
366 
367   Special care has been taken to warn for as many cases of api abuse
368   as possible. Some common api abuses will be caught with
369   CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.
370 
371   Some of the errors which will be warned about:
372    - Forgetting to call ww_acquire_fini or ww_acquire_init.
373    - Attempting to lock more mutexes after ww_acquire_done.
374    - Attempting to lock the wrong mutex after -EDEADLK and
375      unlocking all mutexes.
376    - Attempting to lock the right mutex after -EDEADLK,
377      before unlocking all mutexes.
378 
379    - Calling ww_mutex_lock_slow before -EDEADLK was returned.
380 
381    - Unlocking mutexes with the wrong unlock function.
382    - Calling one of the ww_acquire_* twice on the same context.
383    - Using a different ww_class for the mutex than for the ww_acquire_ctx.
384    - Normal lockdep errors that can result in deadlocks.
385 
386   Some of the lockdep errors that can result in deadlocks:
387    - Calling ww_acquire_init to initialize a second ww_acquire_ctx before
388      having called ww_acquire_fini on the first.
389    - 'normal' deadlocks that can occur.
390 
391 FIXME:
392   Update this section once we have the TASK_DEADLOCK task state flag magic
393   implemented.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php