1 Path walking and name lookup locking 2 ==================================== 3 4 Path resolution is the finding a dentry corres 5 performing a path walk. Typically, for every o 6 will be resolved. Paths are resolved by walkin 7 with the first component of the pathname (eg. 8 then finding the child of that dentry, which i 9 path string. Then repeating the lookup from th 10 child with the next element, and so on. 11 12 Since it is a frequent operation for workloads 13 web servers, it is important to optimize this 14 15 Path walking synchronisation history: 16 Prior to 2.5.10, dcache_lock was acquired in d 17 thus in every component during path look-up. S 18 algorithm changed this by holding the dcache_l 19 as many cached path component dentries as poss 20 decreases the number of acquisition of dcache_ 21 the lock hold time significantly and affects p 22 Since 2.5.62 kernel, dcache has been using a n 23 make dcache look-up lock-free. 24 25 All the above algorithms required taking a loc 26 dentry that was looked up, so that may be used 27 next path element. This is inefficient and uns 28 because of the locks and atomic operations req 29 slows things down. It is not scalable because 30 are path-walk intensive tend to do path lookup 31 (usually, the root "/" or current working dire 32 common path elements causes lock and cacheline 33 34 Since 2.6.38, RCU is used to make a significan 35 (including dcache look-up) completely "store-f 36 even stores into cachelines of common dentries 37 path walking. 38 39 Path walking overview 40 ===================== 41 42 A name string specifies a start (root director 43 sequence of elements (directory entry names), 44 the namespace. A path is represented as a (den 45 elements are sub-strings, separated by '/'. 46 47 Name lookups will want to find a particular pa 48 (usually the final element, or parent of final 49 the path given by the name's starting point (w 50 current->fs->cwd or current->fs->root) as the 51 iteratively for each subsequent name element, 52 parent with the given name and if it is not th 53 parent for the next lookup. 54 55 A parent, of course, must be a directory, and 56 permissions on the parent inode to be able to 57 58 Turning the child into a parent for the next l 59 procedures. Symlinks essentially substitute th 60 name in the name string, and require some recu 61 must be followed into (thus changing the vfsmo 62 refer to), switching from the mount point path 63 mounted vfsmount. These behaviours are various 64 exact path walking flags. 65 66 Path walking then must, broadly, do several pa 67 - find the start point of the walk; 68 - perform permissions and validity checks on i 69 - perform dcache hash name lookups on (parent, 70 - traverse mount points; 71 - traverse symlinks; 72 - lookup and create missing parts of the path 73 74 Safe store-free look-up of dcache hash table 75 ============================================ 76 77 Dcache name lookup 78 ------------------ 79 In order to lookup a dcache (parent, name) tup 80 and use that to select a bucket in the dcache- 81 in that bucket is then walked, and we do a ful 82 against our (parent, name) tuple. 83 84 The hash lists are RCU protected, so list walk 85 concurrent updates (insertion, deletion from t 86 list application with the exception of renames 87 88 Parent and name members of a dentry, as well a 89 hash, and its inode are protected by the per-d 90 reference is taken on the dentry (while the fi 91 and this stabilises its d_inode pointer and ac 92 point to perform the next step of our path wal 93 94 These members are also protected by d_seq seql 95 read-only protection and no durability of resu 96 using d_seq for synchronisation (see seqcount 97 98 Renames 99 ------- 100 Back to the rename case. In usual RCU protecte 101 will happen to an object is insertion, and the 102 list. The object will not be reused until an R 103 This ensures the RCU list traversal primitives 104 problems (see RCU documentation for how this w 105 106 However when a dentry is renamed, its hash val 107 moved to a new hash list. Allocating and inser 108 expensive and also problematic for directory d 109 high to wait for a grace period after removing 110 it in the new hash bucket. So what is done is 111 new list immediately. 112 113 However, when the dentry's list pointers are u 114 new list before waiting for a grace period, th 115 lookup of the old list veering off into the ne 116 the remaining dentries on the list. 117 118 There is no fundamental problem with walking d 119 dentry comparisons will never match. However i 120 dentry. So a seqlock is used to detect when a 121 lookup can be retried. 122 123 1 2 3 124 +---+ +---+ +---+ 125 hlist-->| N-+->| N-+->| N-+-> 126 head <--+-P |<-+-P |<-+-P | 127 +---+ +---+ +---+ 128 129 Rename of dentry 2 may require it deleted from 130 into a new list. Deleting 2 gives the followin 131 132 1 3 133 +---+ +---+ (don't worry, 134 hlist-->| N-+-------->| N-+-> impose a meas 135 head <--+-P |<--------+-P | on modern CPU 136 +---+ +---+ 137 ^ 2 ^ 138 | +---+ | 139 | | N-+----+ 140 +----+-P | 141 +---+ 142 143 This is a standard RCU-list deletion, which le 144 pointers intact, so a concurrent list walker t 145 object 2 will correctly continue to object 3 w 146 next object. 147 148 However, when inserting object 2 onto a new li 149 150 1 3 151 +---+ +---+ 152 hlist-->| N-+-------->| N-+-> 153 head <--+-P |<--------+-P | 154 +---+ +---+ 155 2 156 +---+ 157 | N-+----> 158 <----+-P | 159 +---+ 160 161 Because we didn't wait for a grace period, the 162 still at 2. Now when it follows 2's 'next' poi 163 another list without ever having checked objec 164 165 A related, but distinctly different, issue is 166 lookup operations. If a file is renamed from ' 167 find either 'A' or 'B'. So if a lookup of 'A' 168 of 'B' must succeed (note the reverse is not t 169 170 Between deleting the dentry from the old hash 171 hash list, a lookup may find neither 'A' nor ' 172 rename seqlock is also used to cover this race 173 retrying a negative lookup result if a rename 174 175 Seqcount based lookups 176 ---------------------- 177 In refcount based dcache lookups, d_lock is us 178 the dentry, stabilising it while comparing its 179 taking a reference count (the reference count 180 start the next part of the path walk from). 181 182 As explained above, we would like to do path w 183 reference counts on intermediate dentries alon 184 dentry seqlock (d_seq) is used to take a "cohe 185 looks like (its name, parent, and inode). That 186 the next part of the path walk. When loading t 187 care must be taken to load the members up-fron 188 than reloading from the dentry later on (other 189 like d_inode going NULL underneath us, if the 190 191 Also important is to avoid performing any dest 192 no non-atomic stores to shared data), and to r 193 "done" with the operation. Retry or abort if t 194 Avoiding destructive or changing operations me 195 failure. 196 197 What this means is that a caller, provided the 198 protect the dentry object from disappearing, c 199 lookup which does not increment the refcount o 200 it in any way. This returned dentry can be use 201 provided that d_seq is rechecked after that op 202 203 Inodes are also rcu freed, so the seqcount loo 204 queried for permissions. 205 206 With this two parts of the puzzle, we can do p 207 locks or refcounts on dentry elements. 208 209 RCU-walk path walking design 210 ============================ 211 212 Path walking code now has two distinct modes, 213 is the traditional[*] way of performing dcache 214 serialise concurrent modifications to the dent 215 it. ref-walk is simple and obvious, and may sl 216 walking is operating on each dentry. rcu-walk 217 lookups, and can perform lookup of intermediat 218 shared data in the dentry or inode. rcu-walk c 219 eg. if the filesystem must sleep or perform no 220 must be switched to ref-walk mode. 221 222 [*] RCU is still used for the dentry hash look 223 path walk. 224 225 Where ref-walk uses a stable, refcounted ``par 226 path string, rcu-walk uses a d_seq protected s 227 child of this parent snapshot, we open d_seq c 228 before closing d_seq critical section on the p 229 ladder of snapshots to walk down. 230 231 232 proc 101 233 /----------------\ 234 / comm: "vi" \ 235 / fs.root: dentry0 \ 236 \ fs.cwd: dentry2 / 237 \ / 238 \----------------/ 239 240 So when vi wants to open("/home/npiggin/test.c 241 start from current->fs->root, which is a pinne 242 "./test.c" would start from cwd; both names re 243 the context of proc101. 244 245 dentry 0 246 +---------------------+ rcu-walk begins 247 | name: "/" | inode's permissi 248 | inode: 10 | path element whi 249 | children:"home", ...| 250 +---------------------+ 251 | 252 dentry 1 V 253 +---------------------+ ... which brings 254 | name: "home" | hash lookup, the 255 | inode: 678 | string and paren 256 | children:"npiggin" | we now recheck t 257 +---------------------+ check inode and 258 | 259 dentry2 V 260 +---------------------+ Note: if dentry0 261 | name: "npiggin" | not necessarily 262 | inode: 543 | parent for d_seq 263 | children:"a.c", ... | can be forgotten 264 +---------------------+ 265 | 266 dentry3 V 267 +---------------------+ At this point we 268 | name: "a.c" | We now take its 269 | inode: 14221 | dentry. If that 270 | children:NULL | its refcount bec 271 +---------------------+ 272 273 Taking a refcount on a dentry from rcu-walk mo 274 re-checking its d_seq, and then incrementing i 275 "dropping rcu" or dropping from rcu-walk into 276 277 It is, in some sense, a bit of a house of card 278 parent snapshot fails, the house comes down, b 279 section on the grandparent, so we have nothing 280 the path walk must be fully restarted (which w 281 live locks). It is costly to have a full resta 282 quite rare. 283 284 When we reach a point where sleeping is requir 285 requires ref-walk, then instead of restarting 286 at the last known good dentry we have. Avoidin 287 these cases is fundamental for performance and 288 operations such as creates and unlinks are not 289 290 The detailed design for rcu-walk is like this: 291 * LOOKUP_RCU is set in nd->flags, which distin 292 * Take the RCU lock for the entire path walk, 293 of the starting path (eg. root/cwd/fd-path). 294 not required for dentry persistence. 295 * synchronize_rcu is called when unregistering 296 access d_ops and i_ops during rcu-walk. 297 * Similarly take the vfsmount lock for the ent 298 refcounts are not required for persistence. 299 lookups, and to assume dentry mount points a 300 down the path. 301 * Have a per-dentry seqlock to protect the den 302 so we can load this tuple atomically, and al 303 members have changed. 304 * Dentry lookups (based on parent, candidate s 305 sequence after the child is found in case an 306 during the path walk. 307 * inode is also RCU protected so we can load d 308 limited things. 309 * i_mode, i_uid, i_gid can be tested for exec 310 * i_op can be loaded. 311 * When the destination dentry is reached, drop 312 verify d_seq, increment refcount). 313 * If seqlock verification fails anywhere along 314 of the path lookup in ref-walk mode. -ECHILD 315 a better errno) to signal an rcu-walk failur 316 317 The cases where rcu-walk cannot continue are: 318 * NULL dentry (ie. any uncached path element) 319 * Following links 320 321 It may be possible eventually to make followin 322 323 Uncached path elements will always require dro 324 very least because i_mutex needs to be grabbed 325 326 Final note: 327 "store-free" path walking is not strictly stor 328 and refcounts (both of which can be made per-c 329 stack (which is essentially CPU-local), and we 330 refcount on final dentry. 331 332 The point is that shared data, where practical 333 or stored into. The result is massive improvem 334 scalability of path resolution. 335 336 337 Interesting statistics 338 ====================== 339 340 The following table gives rcu lookup statistic 341 (2s12c24t Westmere, debian non-graphical syste 342 drop rcu that fail due to d_seq failure and re 343 again. Other cases are successful rcu-drops th 344 element, nodentry for missing dentry, revalida 345 routine requiring rcu drop, permission for per 346 and link for symlink traversal requiring drop. 347 348 rcu-lookups restart nodentry 349 bootup 47121 0 4624 350 dbench 25386793 0 6778659(26.7%) 351 kbuild 2696672 10 64442(2.3%) 352 git diff 39605 0 28 353 vfstest 24185492 4945 708725(2.9%) 1 354 355 What this shows is that failed rcu-walk lookup 356 entirely with ref-walk, are quite rare. Even t 357 specifically has concurrent renames/mkdir/rmdi 358 such races is not showing a huge amount of res 359 360 Dropping from rcu-walk to ref-walk mean that w 361 the reference count needs to be taken for some 362 we have reached the target of the path walk, o 363 condition that can't be resolved in rcu-walk m 364 only when we have reached the target dentry, s 365 this does not happen. 366 367 Note that a graceful drop from rcu-walk mode d 368 dentry not existing (which can be common) is n 369 rcu-walk scheme, because some elements of the 370 rcu-walk mode. The further we get from common 371 root), the less contended the dentry is likely 372 common path elements, the more likely they wil 373 374 375 Papers and other documentation on dcache locki 376 ============================================== 377 378 1. Scaling dcache with RCU (https://linuxjourn 379 380 2. http://lse.sourceforge.net/locking/dcache/d 381 382 3. path-lookup.md in this directory.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.