~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/locking/robust-futexes.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/locking/robust-futexes.rst (Version linux-6.11.5) and /Documentation/locking/robust-futexes.rst (Version linux-4.20.17)


  1 ========================================          
  2 A description of what robust futexes are          
  3 ========================================          
  4                                                   
  5 :Started by: Ingo Molnar <mingo@redhat.com>        
  6                                                   
  7 Background                                        
  8 ----------                                        
  9                                                   
 10 what are robust futexes? To answer that, we fi    
 11 what futexes are: normal futexes are special t    
 12 noncontended case can be acquired/released fro    
 13 to enter the kernel.                              
 14                                                   
 15 A futex is in essence a user-space address, e.    
 16 field. If userspace notices contention (the lo    
 17 someone else wants to grab it too) then the lo    
 18 that says "there's a waiter pending", and the     
 19 syscall is used to wait for the other guy to r    
 20 creates a 'futex queue' internally, so that it    
 21 waiter with the waker - without them having to    
 22 When the owner thread releases the futex, it n    
 23 value) that there were waiter(s) pending, and     
 24 sys_futex(FUTEX_WAKE) syscall to wake them up.    
 25 taken and released the lock, the futex is agai    
 26 state, and there's no in-kernel state associat    
 27 completely forgets that there ever was a futex    
 28 method makes futexes very lightweight and scal    
 29                                                   
 30 "Robustness" is about dealing with crashes whi    
 31 process exits prematurely while holding a pthr    
 32 also shared with some other process (e.g. yum     
 33 pthread_mutex_t, or yum is kill -9-ed), then w    
 34 to be notified that the last owner of the lock    
 35 way.                                              
 36                                                   
 37 To solve such types of problems, "robust mutex    
 38 created: pthread_mutex_lock() returns an error    
 39 prematurely - and the new owner can decide whe    
 40 the lock can be recovered safely.                 
 41                                                   
 42 There is a big conceptual problem with futex b    
 43 the kernel that destroys the owner task (e.g.     
 44 the kernel cannot help with the cleanup: if th    
 45 (and in most cases there is none, futexes bein    
 46 then the kernel has no information to clean up    
 47 Userspace has no chance to clean up after the     
 48 the one that crashes, so it has no opportunity    
 49                                                   
 50 In practice, when e.g. yum is kill -9-ed (or s    
 51 is needed to release that futex based lock. Th    
 52 bugreports against yum.                           
 53                                                   
 54 To solve this problem, the traditional approac    
 55 (virtual memory area descriptor) concept to ha    
 56 robust futexes attached to this area'. This ap    
 57 syscall variants to sys_futex(): FUTEX_REGISTE    
 58 FUTEX_RECOVER. At do_exit() time, all vmas are    
 59 they have a robust_head set. This approach has    
 60 left:                                             
 61                                                   
 62  - it has quite complex locking and race scena    
 63    approach had been pending for years, but th    
 64    reliable.                                      
 65                                                   
 66  - they have to scan _every_ vma at sys_exit()    
 67                                                   
 68 The second disadvantage is a real killer: pthr    
 69 microsecond on Linux, but with thousands (or t    
 70 every pthread_exit() takes a millisecond or mo    
 71 destroying the CPU's L1 and L2 caches!            
 72                                                   
 73 This is very much noticeable even for normal p    
 74 calls: the kernel has to do the vma scanning u    
 75 because the kernel has no knowledge about how     
 76 are to be cleaned up, because a robust futex m    
 77 in another task, and the futex variable might     
 78 into this process's address space).               
 79                                                   
 80 This huge overhead forced the creation of CONF    
 81 normal kernels can turn it off, but worse than    
 82 robust futexes impractical for any type of gen    
 83                                                   
 84 So something had to be done.                      
 85                                                   
 86 New approach to robust futexes                    
 87 ------------------------------                    
 88                                                   
 89 At the heart of this new approach there is a p    
 90 robust locks that userspace is holding (mainta    
 91 userspace list is registered with the kernel v    
 92 registration happens at most once per thread l    
 93 time, the kernel checks this user-space list:     
 94 locks to be cleaned up?                           
 95                                                   
 96 In the common case, at do_exit() time, there i    
 97 the cost of robust futexes is just a simple cu    
 98 comparison. If the thread has registered a lis    
 99 is empty. If the thread/process crashed or ter    
100 way then the list might be non-empty: in this     
101 walks the list [not trusting it], and marks al    
102 this thread with the FUTEX_OWNER_DIED bit, and    
103 any).                                             
104                                                   
105 The list is guaranteed to be private and per-t    
106 so it can be accessed by the kernel in a lockl    
107                                                   
108 There is one race possible though: since addin    
109 list is done after the futex is acquired by gl    
110 instructions window for the thread (or process    
111 the futex hung. To protect against this possib    
112 also maintains a simple per-thread 'list_op_pe    
113 kernel to clean up if the thread dies after ac    
114 before it could have added itself to the list.    
115 list_op_pending field before it tries to acqui    
116 it after the list-add (or list-remove) has fin    
117                                                   
118 That's all that is needed - all the rest of ro    
119 in userspace [just like with the previous patc    
120                                                   
121 Ulrich Drepper has implemented the necessary g    
122 mechanism, which fully enables robust mutexes.    
123                                                   
124 Key differences of this userspace-list based a    
125 vma based method:                                 
126                                                   
127  - it's much, much faster: at thread exit time    
128    over every vma (!), which the VM-based meth    
129    simple 'is the list empty' op is done.         
130                                                   
131  - no VM changes are needed - 'struct address_    
132                                                   
133  - no registration of individual locks is need    
134    need any extra per-lock syscalls. Robust mu    
135    lightweight primitive - so they don't force    
136    to do a hard choice between performance and    
137    mutexes are just as fast.                      
138                                                   
139  - no per-lock kernel allocation happens.         
140                                                   
141  - no resource limits are needed.                 
142                                                   
143  - no kernel-space recovery call (FUTEX_RECOVE    
144                                                   
145  - the implementation and the locking is "obvi    
146    interactions with the VM.                      
147                                                   
148 Performance                                       
149 -----------                                       
150                                                   
151 I have benchmarked the time needed for the ker    
152 million (!) held locks, using the new method [    
153                                                   
154  - with FUTEX_WAIT set [contended mutex]: 130     
155  - without FUTEX_WAIT set [uncontended mutex]:    
156                                                   
157 I have also measured an approach where glibc d    
158 [which it currently does for !pshared robust m    
159 msecs - clearly slower, due to the 1 million F    
160 userspace had to do.                              
161                                                   
162 (1 million held locks are unheard of - we expe    
163 locks to be held at a time. Nevertheless it's     
164 approach scales nicely.)                          
165                                                   
166 Implementation details                            
167 ----------------------                            
168                                                   
169 The patch adds two new syscalls: one to regist    
170 one to query the registered list pointer::        
171                                                   
172  asmlinkage long                                  
173  sys_set_robust_list(struct robust_list_head _    
174                      size_t len);                 
175                                                   
176  asmlinkage long                                  
177  sys_get_robust_list(int pid, struct robust_li    
178                      size_t __user *len_ptr);     
179                                                   
180 List registration is very fast: the pointer is    
181 current->robust_list. [Note that in the future    
182 widespread, we could extend sys_clone() to reg    
183 for new threads, without the need of another s    
184                                                   
185 So there is virtually zero overhead for tasks     
186 and even for robust futex users, there is only    
187 thread lifetime, and the cleanup operation, if    
188 straightforward. The kernel doesn't have any i    
189 robust and normal futexes.                        
190                                                   
191 If a futex is found to be held at exit time, t    
192 following bit of the futex word::                 
193                                                   
194         #define FUTEX_OWNER_DIED        0x4000    
195                                                   
196 and wakes up the next futex waiter (if any). U    
197 the cleanup.                                      
198                                                   
199 Otherwise, robust futexes are acquired by glib    
200 the futex field atomically. Waiters set the FU    
201                                                   
202         #define FUTEX_WAITERS           0x8000    
203                                                   
204 and the remaining bits are for the TID.           
205                                                   
206 Testing, architecture support                     
207 -----------------------------                     
208                                                   
209 I've tested the new syscalls on x86 and x86_64    
210 parsing of the userspace list is robust [ ;-)     
211 deliberately corrupted.                           
212                                                   
213 i386 and x86_64 syscalls are wired up at the m    
214 tested the new glibc code (on x86_64 and i386)    
215 robust-mutex testcases.                           
216                                                   
217 All other architectures should build just fine    
218 the new syscalls yet.                             
219                                                   
220 Architectures need to implement the new futex_    
221 inline function before writing up the syscalls    
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php