~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/locking/rt-mutex-design.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/locking/rt-mutex-design.rst (Version linux-6.12-rc7) and /Documentation/locking/rt-mutex-design.rst (Version linux-2.6.32.71)


  1 ==============================                    
  2 RT-mutex implementation design                    
  3 ==============================                    
  4                                                   
  5 Copyright (c) 2006 Steven Rostedt                 
  6                                                   
  7 Licensed under the GNU Free Documentation Lice    
  8                                                   
  9                                                   
 10 This document tries to describe the design of     
 11 It doesn't describe the reasons why rtmutex.c     
 12 Documentation/locking/rt-mutex.rst.  Although     
 13 that happen without this code, but that is in     
 14 what the code actually is doing.                  
 15                                                   
 16 The goal of this document is to help others un    
 17 inheritance (PI) algorithm that is used, as we    
 18 decisions that were made to implement PI in th    
 19                                                   
 20                                                   
 21 Unbounded Priority Inversion                      
 22 ----------------------------                      
 23                                                   
 24 Priority inversion is when a lower priority pr    
 25 priority process wants to run.  This happens f    
 26 most of the time it can't be helped.  Anytime     
 27 to use a resource that a lower priority proces    
 28 the high priority process must wait until the     
 29 with the resource.  This is a priority inversi    
 30 is something called unbounded priority inversi    
 31 priority process is prevented from running by     
 32 an undetermined amount of time.                   
 33                                                   
 34 The classic example of unbounded priority inve    
 35 processes, let's call them processes A, B, and    
 36 priority process, C is the lowest, and B is in    
 37 that C owns and must wait and lets C run to re    
 38 meantime, B executes, and since B is of a high    
 39 but by doing so, it is in fact preempting A wh    
 40 Now there's no way of knowing how long A will     
 41 to release the lock, because for all we know,     
 42 never give C a chance to release the lock.  Th    
 43 inversion.                                        
 44                                                   
 45 Here's a little ASCII art to show the problem:    
 46                                                   
 47      grab lock L1 (owned by C)                    
 48        |                                          
 49   A ---+                                          
 50           C preempted by B                        
 51             |                                     
 52   C    +----+                                     
 53                                                   
 54   B         +-------->                            
 55                   B now keeps A from running.     
 56                                                   
 57                                                   
 58 Priority Inheritance (PI)                         
 59 -------------------------                         
 60                                                   
 61 There are several ways to solve this issue, bu    
 62 for this document.  Here we only discuss PI.      
 63                                                   
 64 PI is where a process inherits the priority of    
 65 process blocks on a lock owned by the current     
 66 to understand, let's use the previous example,    
 67                                                   
 68 This time, when A blocks on the lock owned by     
 69 of A.  So now if B becomes runnable, it would     
 70 the high priority of A.  As soon as C releases    
 71 inherited priority, and A then can continue wi    
 72                                                   
 73 Terminology                                       
 74 -----------                                       
 75                                                   
 76 Here I explain some terminology that is used i    
 77 the design that is used to implement PI.          
 78                                                   
 79 PI chain                                          
 80          - The PI chain is an ordered series o    
 81            processes to inherit priorities fro    
 82            blocked on one of its locks.  This     
 83            later in this document.                
 84                                                   
 85 mutex                                             
 86          - In this document, to differentiate     
 87            PI and spin locks that are used in     
 88            the PI locks will be called a mutex    
 89                                                   
 90 lock                                              
 91          - In this document from now on, I wil    
 92            referring to spin locks that are us    
 93            algorithm.  These locks disable pre    
 94            CONFIG_PREEMPT is enabled) and on S    
 95            entering critical sections simultan    
 96                                                   
 97 spin lock                                         
 98          - Same as lock above.                    
 99                                                   
100 waiter                                            
101          - A waiter is a struct that is stored    
102            process.  Since the scope of the wa    
103            a process being blocked on the mute    
104            the waiter on the process's stack (    
105            structure holds a pointer to the ta    
106            the task is blocked on.  It also ha    
107            place the task in the waiters rbtre    
108            pi_waiters rbtree of a mutex owner     
109                                                   
110            waiter is sometimes used in referen    
111            on a mutex. This is the same as wai    
112                                                   
113 waiters                                           
114          - A list of processes that are blocke    
115                                                   
116 top waiter                                        
117          - The highest priority process waitin    
118                                                   
119 top pi waiter                                     
120               - The highest priority process w    
121                 that a specific process owns.     
122                                                   
123 Note:                                             
124        task and process are used interchangeab    
125        differentiate between two processes tha    
126                                                   
127                                                   
128 PI chain                                          
129 --------                                          
130                                                   
131 The PI chain is a list of processes and mutexe    
132 inheritance to take place.  Multiple chains ma    
133 would never diverge, since a process can't be     
134 mutex at a time.                                  
135                                                   
136 Example::                                         
137                                                   
138    Process:  A, B, C, D, E                        
139    Mutexes:  L1, L2, L3, L4                       
140                                                   
141    A owns: L1                                     
142            B blocked on L1                        
143            B owns L2                              
144                   C blocked on L2                 
145                   C owns L3                       
146                          D blocked on L3          
147                          D owns L4                
148                                 E blocked on L    
149                                                   
150 The chain would be::                              
151                                                   
152    E->L4->D->L3->C->L2->B->L1->A                  
153                                                   
154 To show where two chains merge, we could add a    
155 another mutex L5 where B owns L5 and F is bloc    
156                                                   
157 The chain for F would be::                        
158                                                   
159    F->L5->B->L1->A                                
160                                                   
161 Since a process may own more than one mutex, b    
162 one, the chains merge.                            
163                                                   
164 Here we show both chains::                        
165                                                   
166    E->L4->D->L3->C->L2-+                          
167                        |                          
168                        +->B->L1->A                
169                        |                          
170                  F->L5-+                          
171                                                   
172 For PI to work, the processes at the right end    
173 also call it the Top of the chain) must be equ    
174 than the processes to the left or below in the    
175                                                   
176 Also since a mutex may have more than one proc    
177 have multiple chains merge at mutexes.  If we     
178 blocked on mutex L2::                             
179                                                   
180   G->L2->B->L1->A                                 
181                                                   
182 And once again, to show how this can grow I wi    
183 again::                                           
184                                                   
185    E->L4->D->L3->C-+                              
186                    +->L2-+                        
187                    |     |                        
188                  G-+     +->B->L1->A              
189                          |                        
190                    F->L5-+                        
191                                                   
192 If process G has the highest priority in the c    
193 the chain (A and B in this example), must have    
194 to that of G.                                     
195                                                   
196 Mutex Waiters Tree                                
197 ------------------                                
198                                                   
199 Every mutex keeps track of all the waiters tha    
200 mutex has a rbtree to store these waiters by p    
201 by a spin lock that is located in the struct o    
202 wait_lock.                                        
203                                                   
204                                                   
205 Task PI Tree                                      
206 ------------                                      
207                                                   
208 To keep track of the PI chains, each process h    
209 a tree of all top waiters of the mutexes that     
210 Note that this tree only holds the top waiters    
211 blocked on mutexes owned by the process.          
212                                                   
213 The top of the task's PI tree is always the hi    
214 is waiting on a mutex that is owned by the tas    
215 inherited a priority, it will always be the pr    
216 at the top of this tree.                          
217                                                   
218 This tree is stored in the task structure of a    
219 pi_waiters.  It is protected by a spin lock al    
220 called pi_lock.  This lock may also be taken i    
221 locking the pi_lock, interrupts must be disabl    
222                                                   
223                                                   
224 Depth of the PI Chain                             
225 ---------------------                             
226                                                   
227 The maximum depth of the PI chain is not dynam    
228 defined.  But is very complex to figure it out    
229 the nesting of mutexes.  Let's look at the exa    
230 L1, L2, and L3, and four separate functions fu    
231 The following shows a locking order of L1->L2-    
232 be directly nested that way::                     
233                                                   
234   void func1(void)                                
235   {                                               
236         mutex_lock(L1);                           
237                                                   
238         /* do anything */                         
239                                                   
240         mutex_unlock(L1);                         
241   }                                               
242                                                   
243   void func2(void)                                
244   {                                               
245         mutex_lock(L1);                           
246         mutex_lock(L2);                           
247                                                   
248         /* do something */                        
249                                                   
250         mutex_unlock(L2);                         
251         mutex_unlock(L1);                         
252   }                                               
253                                                   
254   void func3(void)                                
255   {                                               
256         mutex_lock(L2);                           
257         mutex_lock(L3);                           
258                                                   
259         /* do something else */                   
260                                                   
261         mutex_unlock(L3);                         
262         mutex_unlock(L2);                         
263   }                                               
264                                                   
265   void func4(void)                                
266   {                                               
267         mutex_lock(L3);                           
268                                                   
269         /* do something again */                  
270                                                   
271         mutex_unlock(L3);                         
272   }                                               
273                                                   
274 Now we add 4 processes that run each of these     
275 Processes A, B, C, and D which run functions f    
276 respectively, and such that D runs first and A    
277 in func4 in the "do something again" area, we     
278                                                   
279   D owns L3                                       
280          C blocked on L3                          
281          C owns L2                                
282                 B blocked on L2                   
283                 B owns L1                         
284                        A blocked on L1            
285                                                   
286   And thus we have the chain A->L1->B->L2->C->    
287                                                   
288 This gives us a PI depth of 4 (four processes)    
289 functions individually, it seems as though the    
290 depth of two.  So, although the locking depth     
291 it still is very difficult to find the possibi    
292                                                   
293 Now since mutexes can be defined by user-land     
294 type of application that nests large amounts o    
295 PI chain, and have the code holding spin locks    
296 amount of data.  So to prevent this, the imple    
297 a maximum lock depth, but also only holds at m    
298 time, as it walks the PI chain.  More about th    
299                                                   
300                                                   
301 Mutex owner and flags                             
302 ---------------------                             
303                                                   
304 The mutex structure contains a pointer to the     
305 mutex is not owned, this owner is set to NULL.    
306 have the task structure on at least a two byte    
307 not true, the rtmutex.c code will be broken!),    
308 significant bit to be used as a flag.  Bit 0 i    
309 flag. It's set whenever there are waiters on a    
310                                                   
311 See Documentation/locking/rt-mutex.rst for fur    
312                                                   
313 cmpxchg Tricks                                    
314 --------------                                    
315                                                   
316 Some architectures implement an atomic cmpxchg    
317 is used (when applicable) to keep the fast pat    
318 mutexes short.                                    
319                                                   
320 cmpxchg is basically the following function pe    
321                                                   
322   unsigned long _cmpxchg(unsigned long *A, uns    
323   {                                               
324         unsigned long T = *A;                     
325         if (*A == *B) {                           
326                 *A = *C;                          
327         }                                         
328         return T;                                 
329   }                                               
330   #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c)       
331                                                   
332 This is really nice to have, since it allows y    
333 if the variable is what you expect it to be.      
334 the return value (the old value of A) is equal    
335                                                   
336 The macro rt_mutex_cmpxchg is used to try to l    
337 the architecture does not support CMPXCHG, the    
338 to fail every time.  But if CMPXCHG is support    
339 help out extremely to keep the fast path short    
340                                                   
341 The use of rt_mutex_cmpxchg with the flags in     
342 the system for architectures that support it.     
343 later in this document.                           
344                                                   
345                                                   
346 Priority adjustments                              
347 --------------------                              
348                                                   
349 The implementation of the PI code in rtmutex.c    
350 process must adjust its priority.  With the he    
351 process this is rather easy to know what needs    
352                                                   
353 The functions implementing the task adjustment    
354 and rt_mutex_setprio. rt_mutex_setprio is only    
355                                                   
356 rt_mutex_adjust_prio examines the priority of     
357 priority process that is waiting any of mutexe    
358 the pi_waiters of a task holds an order by pri    
359 of all the mutexes that the task owns, we simp    
360 pi waiter to its own normal/deadline priority     
361 Then rt_mutex_setprio is called to adjust the     
362 new priority. Note that rt_mutex_setprio is de    
363 to implement the actual change in priority.       
364                                                   
365 Note:                                             
366         For the "prio" field in task_struct, t    
367         higher the priority. A "prio" of 5 is     
368         "prio" of 10.                             
369                                                   
370 It is interesting to note that rt_mutex_adjust    
371 or decrease the priority of the task.  In the     
372 process has just blocked on a mutex owned by t    
373 would increase/boost the task's priority.  But    
374 were for some reason to leave the mutex (timeo    
375 would decrease/unboost the priority of the tas    
376 always contains the highest priority task that    
377 by the task, so we only need to compare the pr    
378 to the normal priority of the given task.         
379                                                   
380                                                   
381 High level overview of the PI chain walk          
382 ----------------------------------------          
383                                                   
384 The PI chain walk is implemented by the functi    
385                                                   
386 The implementation has gone through several it    
387 with what we believe is the best.  It walks th    
388 at most two locks at a time, and is very effic    
389                                                   
390 The rt_mutex_adjust_prio_chain can be used eit    
391 priorities.                                       
392                                                   
393 rt_mutex_adjust_prio_chain is called with a ta    
394 (de)boosting (the owner of a mutex that a proc    
395 check for deadlocking, the mutex that the task    
396 that is the process's waiter struct that is bl    
397 parameter may be NULL for deboosting), a point    
398 is blocked, and a top_task as the top waiter o    
399                                                   
400 For this explanation, I will not mention deadl    
401 will try to stay at a high level.                 
402                                                   
403 When this function is called, there are no loc    
404 that the state of the owner and lock can chang    
405                                                   
406 Before this function is called, the task has a    
407 performed on it.  This means that the task is     
408 should be at, but the rbtree nodes of the task    
409 with the new priorities, and this task may not    
410 in the pi_waiters and waiters trees that the t    
411 solves all that.                                  
412                                                   
413 The main operation of this function is summari    
414 rtmutex.c. See the 'Chain walk basics and prot    
415 details.                                          
416                                                   
417 Taking of a mutex (The walk through)              
418 ------------------------------------              
419                                                   
420 OK, now let's take a look at the detailed walk    
421 taking a mutex.                                   
422                                                   
423 The first thing that is tried is the fast taki    
424 done when we have CMPXCHG enabled (otherwise t    
425 fails).  Only when the owner field of the mute    
426 taken with the CMPXCHG and nothing else needs     
427                                                   
428 If there is contention on the lock, we go abou    
429 (rt_mutex_slowlock).                              
430                                                   
431 The slow path function is where the task's wai    
432 the stack.  This is because the waiter structu    
433 scope of this function.  The waiter structure     
434 the task on the waiters tree of the mutex, and    
435 tree of the owner.                                
436                                                   
437 The wait_lock of the mutex is taken since the     
438 mutex also takes this lock.                       
439                                                   
440 We then call try_to_take_rt_mutex.  This is wh    
441 does not implement CMPXCHG would always grab t    
442 contention).                                      
443                                                   
444 try_to_take_rt_mutex is used every time the ta    
445 slow path.  The first thing that is done here     
446 the "Has Waiters" flag of the mutex's owner fi    
447 now, the current owner of the mutex being cont    
448 without going into the slow unlock path, and i    
449 wait_lock, which this code currently holds. So    
450 forces the current owner to synchronize with t    
451                                                   
452 The lock is taken if the following are true:      
453                                                   
454    1) The lock has no owner                       
455    2) The current task is the highest priority    
456       waiters of the lock                         
457                                                   
458 If the task succeeds to acquire the lock, then    
459 owner of the lock, and if the lock still has w    
460 (highest priority task waiting on the lock) is    
461 pi_waiters tree.                                  
462                                                   
463 If the lock is not taken by try_to_take_rt_mut    
464 task_blocks_on_rt_mutex() function is called.     
465 the lock's waiter tree and propagate the pi ch    
466 as the lock's owner's pi_waiters tree. This is    
467 section.                                          
468                                                   
469 Task blocks on mutex                              
470 --------------------                              
471                                                   
472 The accounting of a mutex and process is done     
473 the process.  The "task" field is set to the p    
474 to the mutex.  The rbtree node of waiter are i    
475 current priority.                                 
476                                                   
477 Since the wait_lock was taken at the entry of     
478 add the waiter to the task waiter tree.  If th    
479 highest priority process currently waiting on     
480 previous top waiter process (if it exists) fro    
481 and add the current process to that tree.  Sin    
482 has changed, we call rt_mutex_adjust_prio on t    
483 should adjust its priority accordingly.           
484                                                   
485 If the owner is also blocked on a lock, and ha    
486 (or deadlock checking is on), we unlock the wa    
487 and run rt_mutex_adjust_prio_chain on the owne    
488                                                   
489 Now all locks are released, and if the current    
490 mutex (waiter "task" field is not NULL), then     
491                                                   
492 Waking up in the loop                             
493 ---------------------                             
494                                                   
495 The task can then wake up for a couple of reas    
496   1) The previous lock owner released the lock    
497   2) we received a signal or timeout              
498                                                   
499 In both cases, the task will try again to acqu    
500 does, then it will take itself off the waiters    
501 to the TASK_RUNNING state.                        
502                                                   
503 In first case, if the lock was acquired by ano    
504 could get the lock, then it will go back to sl    
505                                                   
506 The second case is only applicable for tasks t    
507 that can wake up before getting the lock, eith    
508 a timeout (i.e. rt_mutex_timed_futex_lock()).     
509 take the lock again, if it succeeds, then the     
510 lock held, otherwise it will return with -EINT    
511 by a signal, or -ETIMEDOUT if it timed out.       
512                                                   
513                                                   
514 Unlocking the Mutex                               
515 -------------------                               
516                                                   
517 The unlocking of a mutex also has a fast path     
518 CMPXCHG.  Since the taking of a mutex on conte    
519 "Has Waiters" flag of the mutex's owner, we us    
520 take the slow path when unlocking the mutex.      
521 waiters, the owner field of the mutex would eq    
522 the mutex can be unlocked by just replacing th    
523                                                   
524 If the owner field has the "Has Waiters" bit s    
525 the slow unlock path is taken.                    
526                                                   
527 The first thing done in the slow unlock path i    
528 mutex.  This synchronizes the locking and unlo    
529                                                   
530 A check is made to see if the mutex has waiter    
531 do not have CMPXCHG, this is the location that    
532 determine if a waiter needs to be awoken or no    
533 do have CMPXCHG, that check is done in the fas    
534 in the slow path too.  If a waiter of a mutex     
535 or timeout between the time the owner failed t    
536 the grabbing of the wait_lock, the mutex may n    
537 owner still needs to make this check. If there    
538 owner field is set to NULL, the wait_lock is r    
539 needed.                                           
540                                                   
541 If there are waiters, then we need to wake one    
542                                                   
543 On the wake up code, the pi_lock of the curren    
544 waiter of the lock is found and removed from t    
545 as well as the pi_waiters tree of the current     
546 marked to prevent lower priority tasks from st    
547                                                   
548 Finally we unlock the pi_lock of the pending o    
549                                                   
550                                                   
551 Contact                                           
552 -------                                           
553                                                   
554 For updates on this document, please email Ste<    
555                                                   
556                                                   
557 Credits                                           
558 -------                                           
559                                                   
560 Author:  Steven Rostedt <rostedt@goodmis.org>      
561                                                   
562 Updated: Alex Shi <alex.shi@linaro.org> - 7/6/2    
563                                                   
564 Original Reviewers:                               
565                      Ingo Molnar, Thomas Gleix    
566                      Randy Dunlap                 
567                                                   
568 Update (7/6/2017) Reviewers: Steven Rostedt an    
569                                                   
570 Updates                                           
571 -------                                           
572                                                   
573 This document was originally written for 2.6.1    
574 was updated on 4.12                               
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php