~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/filesystems/path-lookup.txt

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/filesystems/path-lookup.txt (Version linux-6.12-rc7) and /Documentation/filesystems/path-lookup.txt (Version linux-2.6.32.71)


  1 Path walking and name lookup locking              
  2 ====================================              
  3                                                   
  4 Path resolution is the finding a dentry corres    
  5 performing a path walk. Typically, for every o    
  6 will be resolved. Paths are resolved by walkin    
  7 with the first component of the pathname (eg.     
  8 then finding the child of that dentry, which i    
  9 path string. Then repeating the lookup from th    
 10 child with the next element, and so on.           
 11                                                   
 12 Since it is a frequent operation for workloads    
 13 web servers, it is important to optimize this     
 14                                                   
 15 Path walking synchronisation history:             
 16 Prior to 2.5.10, dcache_lock was acquired in d    
 17 thus in every component during path look-up. S    
 18 algorithm changed this by holding the dcache_l    
 19 as many cached path component dentries as poss    
 20 decreases the number of acquisition of dcache_    
 21 the lock hold time significantly and affects p    
 22 Since 2.5.62 kernel, dcache has been using a n    
 23 make dcache look-up lock-free.                    
 24                                                   
 25 All the above algorithms required taking a loc    
 26 dentry that was looked up, so that may be used    
 27 next path element. This is inefficient and uns    
 28 because of the locks and atomic operations req    
 29 slows things down. It is not scalable because     
 30 are path-walk intensive tend to do path lookup    
 31 (usually, the root "/" or current working dire    
 32 common path elements causes lock and cacheline    
 33                                                   
 34 Since 2.6.38, RCU is used to make a significan    
 35 (including dcache look-up) completely "store-f    
 36 even stores into cachelines of common dentries    
 37 path walking.                                     
 38                                                   
 39 Path walking overview                             
 40 =====================                             
 41                                                   
 42 A name string specifies a start (root director    
 43 sequence of elements (directory entry names),     
 44 the namespace. A path is represented as a (den    
 45 elements are sub-strings, separated by '/'.       
 46                                                   
 47 Name lookups will want to find a particular pa    
 48 (usually the final element, or parent of final    
 49 the path given by the name's starting point (w    
 50 current->fs->cwd or current->fs->root) as the     
 51 iteratively for each subsequent name element,     
 52 parent with the given name and if it is not th    
 53 parent for the next lookup.                       
 54                                                   
 55 A parent, of course, must be a directory, and     
 56 permissions on the parent inode to be able to     
 57                                                   
 58 Turning the child into a parent for the next l    
 59 procedures. Symlinks essentially substitute th    
 60 name in the name string, and require some recu    
 61 must be followed into (thus changing the vfsmo    
 62 refer to), switching from the mount point path    
 63 mounted vfsmount. These behaviours are various    
 64 exact path walking flags.                         
 65                                                   
 66 Path walking then must, broadly, do several pa    
 67 - find the start point of the walk;               
 68 - perform permissions and validity checks on i    
 69 - perform dcache hash name lookups on (parent,    
 70 - traverse mount points;                          
 71 - traverse symlinks;                              
 72 - lookup and create missing parts of the path     
 73                                                   
 74 Safe store-free look-up of dcache hash table      
 75 ============================================      
 76                                                   
 77 Dcache name lookup                                
 78 ------------------                                
 79 In order to lookup a dcache (parent, name) tup    
 80 and use that to select a bucket in the dcache-    
 81 in that bucket is then walked, and we do a ful    
 82 against our (parent, name) tuple.                 
 83                                                   
 84 The hash lists are RCU protected, so list walk    
 85 concurrent updates (insertion, deletion from t    
 86 list application with the exception of renames    
 87                                                   
 88 Parent and name members of a dentry, as well a    
 89 hash, and its inode are protected by the per-d    
 90 reference is taken on the dentry (while the fi    
 91 and this stabilises its d_inode pointer and ac    
 92 point to perform the next step of our path wal    
 93                                                   
 94 These members are also protected by d_seq seql    
 95 read-only protection and no durability of resu    
 96 using d_seq for synchronisation (see seqcount     
 97                                                   
 98 Renames                                           
 99 -------                                           
100 Back to the rename case. In usual RCU protecte    
101 will happen to an object is insertion, and the    
102 list. The object will not be reused until an R    
103 This ensures the RCU list traversal primitives    
104 problems (see RCU documentation for how this w    
105                                                   
106 However when a dentry is renamed, its hash val    
107 moved to a new hash list. Allocating and inser    
108 expensive and also problematic for directory d    
109 high to wait for a grace period after removing    
110 it in the new hash bucket. So what is done is     
111 new list immediately.                             
112                                                   
113 However, when the dentry's list pointers are u    
114 new list before waiting for a grace period, th    
115 lookup of the old list veering off into the ne    
116 the remaining dentries on the list.               
117                                                   
118 There is no fundamental problem with walking d    
119 dentry comparisons will never match. However i    
120 dentry. So a seqlock is used to detect when a     
121 lookup can be retried.                            
122                                                   
123          1      2      3                          
124         +---+  +---+  +---+                       
125 hlist-->| N-+->| N-+->| N-+->                     
126 head <--+-P |<-+-P |<-+-P |                       
127         +---+  +---+  +---+                       
128                                                   
129 Rename of dentry 2 may require it deleted from    
130 into a new list. Deleting 2 gives the followin    
131                                                   
132          1             3                          
133         +---+         +---+     (don't worry,     
134 hlist-->| N-+-------->| N-+->    impose a meas    
135 head <--+-P |<--------+-P |      on modern CPU    
136         +---+         +---+                       
137           ^      2      ^                         
138           |    +---+    |                         
139           |    | N-+----+                         
140           +----+-P |                              
141                +---+                              
142                                                   
143 This is a standard RCU-list deletion, which le    
144 pointers intact, so a concurrent list walker t    
145 object 2 will correctly continue to object 3 w    
146 next object.                                      
147                                                   
148 However, when inserting object 2 onto a new li    
149                                                   
150          1             3                          
151         +---+         +---+                       
152 hlist-->| N-+-------->| N-+->                     
153 head <--+-P |<--------+-P |                       
154         +---+         +---+                       
155                  2                                
156                +---+                              
157                | N-+---->                         
158           <----+-P |                              
159                +---+                              
160                                                   
161 Because we didn't wait for a grace period, the    
162 still at 2. Now when it follows 2's 'next' poi    
163 another list without ever having checked objec    
164                                                   
165 A related, but distinctly different, issue is     
166 lookup operations. If a file is renamed from '    
167 find either 'A' or 'B'. So if a lookup of 'A'     
168 of 'B' must succeed (note the reverse is not t    
169                                                   
170 Between deleting the dentry from the old hash     
171 hash list, a lookup may find neither 'A' nor '    
172 rename seqlock is also used to cover this race    
173 retrying a negative lookup result if a rename     
174                                                   
175 Seqcount based lookups                            
176 ----------------------                            
177 In refcount based dcache lookups, d_lock is us    
178 the dentry, stabilising it while comparing its    
179 taking a reference count (the reference count     
180 start the next part of the path walk from).       
181                                                   
182 As explained above, we would like to do path w    
183 reference counts on intermediate dentries alon    
184 dentry seqlock (d_seq) is used to take a "cohe    
185 looks like (its name, parent, and inode). That    
186 the next part of the path walk. When loading t    
187 care must be taken to load the members up-fron    
188 than reloading from the dentry later on (other    
189 like d_inode going NULL underneath us, if the     
190                                                   
191 Also important is to avoid performing any dest    
192 no non-atomic stores to shared data), and to r    
193 "done" with the operation. Retry or abort if t    
194 Avoiding destructive or changing operations me    
195 failure.                                          
196                                                   
197 What this means is that a caller, provided the    
198 protect the dentry object from disappearing, c    
199 lookup which does not increment the refcount o    
200 it in any way. This returned dentry can be use    
201 provided that d_seq is rechecked after that op    
202                                                   
203 Inodes are also rcu freed, so the seqcount loo    
204 queried for permissions.                          
205                                                   
206 With this two parts of the puzzle, we can do p    
207 locks or refcounts on dentry elements.            
208                                                   
209 RCU-walk path walking design                      
210 ============================                      
211                                                   
212 Path walking code now has two distinct modes,     
213 is the traditional[*] way of performing dcache    
214 serialise concurrent modifications to the dent    
215 it. ref-walk is simple and obvious, and may sl    
216 walking is operating on each dentry. rcu-walk     
217 lookups, and can perform lookup of intermediat    
218 shared data in the dentry or inode. rcu-walk c    
219 eg. if the filesystem must sleep or perform no    
220 must be switched to ref-walk mode.                
221                                                   
222 [*] RCU is still used for the dentry hash look    
223     path walk.                                    
224                                                   
225 Where ref-walk uses a stable, refcounted ``par    
226 path string, rcu-walk uses a d_seq protected s    
227 child of this parent snapshot, we open d_seq c    
228 before closing d_seq critical section on the p    
229 ladder of snapshots to walk down.                 
230                                                   
231                                                   
232      proc 101                                     
233       /----------------\                          
234      / comm:    "vi"    \                         
235     /  fs.root: dentry0  \                        
236     \  fs.cwd:  dentry2  /                        
237      \                  /                         
238       \----------------/                          
239                                                   
240 So when vi wants to open("/home/npiggin/test.c    
241 start from current->fs->root, which is a pinne    
242 "./test.c" would start from cwd; both names re    
243 the context of proc101.                           
244                                                   
245      dentry 0                                     
246     +---------------------+   rcu-walk begins     
247     | name:    "/"        |   inode's permissi    
248     | inode:   10         |   path element whi    
249     | children:"home", ...|                       
250     +---------------------+                       
251               |                                   
252      dentry 1 V                                   
253     +---------------------+   ... which brings    
254     | name:    "home"     |   hash lookup, the    
255     | inode:   678        |   string and paren    
256     | children:"npiggin"  |   we now recheck t    
257     +---------------------+   check inode and     
258               |                                   
259      dentry2  V                                   
260     +---------------------+   Note: if dentry0    
261     | name:    "npiggin"  |   not necessarily     
262     | inode:   543        |   parent for d_seq    
263     | children:"a.c", ... |   can be forgotten    
264     +---------------------+                       
265               |                                   
266      dentry3  V                                   
267     +---------------------+   At this point we    
268     | name:    "a.c"      |   We now take its     
269     | inode:   14221      |   dentry. If that     
270     | children:NULL       |   its refcount bec    
271     +---------------------+                       
272                                                   
273 Taking a refcount on a dentry from rcu-walk mo    
274 re-checking its d_seq, and then incrementing i    
275 "dropping rcu" or dropping from rcu-walk into     
276                                                   
277 It is, in some sense, a bit of a house of card    
278 parent snapshot fails, the house comes down, b    
279 section on the grandparent, so we have nothing    
280 the path walk must be fully restarted (which w    
281 live locks). It is costly to have a full resta    
282 quite rare.                                       
283                                                   
284 When we reach a point where sleeping is requir    
285 requires ref-walk, then instead of restarting     
286 at the last known good dentry we have. Avoidin    
287 these cases is fundamental for performance and    
288 operations such as creates and unlinks are not    
289                                                   
290 The detailed design for rcu-walk is like this:    
291 * LOOKUP_RCU is set in nd->flags, which distin    
292 * Take the RCU lock for the entire path walk,     
293   of the starting path (eg. root/cwd/fd-path).    
294   not required for dentry persistence.            
295 * synchronize_rcu is called when unregistering    
296   access d_ops and i_ops during rcu-walk.         
297 * Similarly take the vfsmount lock for the ent    
298   refcounts are not required for persistence.     
299   lookups, and to assume dentry mount points a    
300   down the path.                                  
301 * Have a per-dentry seqlock to protect the den    
302   so we can load this tuple atomically, and al    
303   members have changed.                           
304 * Dentry lookups (based on parent, candidate s    
305   sequence after the child is found in case an    
306   during the path walk.                           
307 * inode is also RCU protected so we can load d    
308   limited things.                                 
309 * i_mode, i_uid, i_gid can be tested for exec     
310 * i_op can be loaded.                             
311 * When the destination dentry is reached, drop    
312   verify d_seq, increment refcount).              
313 * If seqlock verification fails anywhere along    
314   of the path lookup in ref-walk mode. -ECHILD    
315   a better errno) to signal an rcu-walk failur    
316                                                   
317 The cases where rcu-walk cannot continue are:     
318 * NULL dentry (ie. any uncached path element)     
319 * Following links                                 
320                                                   
321 It may be possible eventually to make followin    
322                                                   
323 Uncached path elements will always require dro    
324 very least because i_mutex needs to be grabbed    
325                                                   
326 Final note:                                       
327 "store-free" path walking is not strictly stor    
328 and refcounts (both of which can be made per-c    
329 stack (which is essentially CPU-local), and we    
330 refcount on final dentry.                         
331                                                   
332 The point is that shared data, where practical    
333 or stored into. The result is massive improvem    
334 scalability of path resolution.                   
335                                                   
336                                                   
337 Interesting statistics                            
338 ======================                            
339                                                   
340 The following table gives rcu lookup statistic    
341 (2s12c24t Westmere, debian non-graphical syste    
342 drop rcu that fail due to d_seq failure and re    
343 again. Other cases are successful rcu-drops th    
344 element, nodentry for missing dentry, revalida    
345 routine requiring rcu drop, permission for per    
346 and link for symlink traversal requiring drop.    
347                                                   
348      rcu-lookups     restart  nodentry            
349 bootup     47121           0      4624            
350 dbench  25386793           0   6778659(26.7%)     
351 kbuild   2696672          10     64442(2.3%)      
352 git diff   39605           0        28            
353 vfstest 24185492        4945    708725(2.9%) 1    
354                                                   
355 What this shows is that failed rcu-walk lookup    
356 entirely with ref-walk, are quite rare. Even t    
357 specifically has concurrent renames/mkdir/rmdi    
358 such races is not showing a huge amount of res    
359                                                   
360 Dropping from rcu-walk to ref-walk mean that w    
361 the reference count needs to be taken for some    
362 we have reached the target of the path walk, o    
363 condition that can't be resolved in rcu-walk m    
364 only when we have reached the target dentry, s    
365 this does not happen.                             
366                                                   
367 Note that a graceful drop from rcu-walk mode d    
368 dentry not existing (which can be common) is n    
369 rcu-walk scheme, because some elements of the     
370 rcu-walk mode. The further we get from common     
371 root), the less contended the dentry is likely    
372 common path elements, the more likely they wil    
373                                                   
374                                                   
375 Papers and other documentation on dcache locki    
376 ==============================================    
377                                                   
378 1. Scaling dcache with RCU (https://linuxjourn    
379                                                   
380 2. http://lse.sourceforge.net/locking/dcache/d    
381                                                   
382 3. path-lookup.md in this directory.              
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php