1 .. SPDX-License-Identifier: GPL-2.0 2 3 .. _kernel_hacking_locktypes: 4 5 ========================== 6 Lock types and their rules 7 ========================== 8 9 Introduction 10 ============ 11 12 The kernel provides a variety of locking primitives which can be divided 13 into three categories: 14 15 - Sleeping locks 16 - CPU local locks 17 - Spinning locks 18 19 This document conceptually describes these lock types and provides rules 20 for their nesting, including the rules for use under PREEMPT_RT. 21 22 23 Lock categories 24 =============== 25 26 Sleeping locks 27 -------------- 28 29 Sleeping locks can only be acquired in preemptible task context. 30 31 Although implementations allow try_lock() from other contexts, it is 32 necessary to carefully evaluate the safety of unlock() as well as of 33 try_lock(). Furthermore, it is also necessary to evaluate the debugging 34 versions of these primitives. In short, don't acquire sleeping locks from 35 other contexts unless there is no other option. 36 37 Sleeping lock types: 38 39 - mutex 40 - rt_mutex 41 - semaphore 42 - rw_semaphore 43 - ww_mutex 44 - percpu_rw_semaphore 45 46 On PREEMPT_RT kernels, these lock types are converted to sleeping locks: 47 48 - local_lock 49 - spinlock_t 50 - rwlock_t 51 52 53 CPU local locks 54 --------------- 55 56 - local_lock 57 58 On non-PREEMPT_RT kernels, local_lock functions are wrappers around 59 preemption and interrupt disabling primitives. Contrary to other locking 60 mechanisms, disabling preemption or interrupts are pure CPU local 61 concurrency control mechanisms and not suited for inter-CPU concurrency 62 control. 63 64 65 Spinning locks 66 -------------- 67 68 - raw_spinlock_t 69 - bit spinlocks 70 71 On non-PREEMPT_RT kernels, these lock types are also spinning locks: 72 73 - spinlock_t 74 - rwlock_t 75 76 Spinning locks implicitly disable preemption and the lock / unlock functions 77 can have suffixes which apply further protections: 78 79 =================== ==================================================== 80 _bh() Disable / enable bottom halves (soft interrupts) 81 _irq() Disable / enable interrupts 82 _irqsave/restore() Save and disable / restore interrupt disabled state 83 =================== ==================================================== 84 85 86 Owner semantics 87 =============== 88 89 The aforementioned lock types except semaphores have strict owner 90 semantics: 91 92 The context (task) that acquired the lock must release it. 93 94 rw_semaphores have a special interface which allows non-owner release for 95 readers. 96 97 98 rtmutex 99 ======= 100 101 RT-mutexes are mutexes with support for priority inheritance (PI). 102 103 PI has limitations on non-PREEMPT_RT kernels due to preemption and 104 interrupt disabled sections. 105 106 PI clearly cannot preempt preemption-disabled or interrupt-disabled 107 regions of code, even on PREEMPT_RT kernels. Instead, PREEMPT_RT kernels 108 execute most such regions of code in preemptible task context, especially 109 interrupt handlers and soft interrupts. This conversion allows spinlock_t 110 and rwlock_t to be implemented via RT-mutexes. 111 112 113 semaphore 114 ========= 115 116 semaphore is a counting semaphore implementation. 117 118 Semaphores are often used for both serialization and waiting, but new use 119 cases should instead use separate serialization and wait mechanisms, such 120 as mutexes and completions. 121 122 semaphores and PREEMPT_RT 123 ---------------------------- 124 125 PREEMPT_RT does not change the semaphore implementation because counting 126 semaphores have no concept of owners, thus preventing PREEMPT_RT from 127 providing priority inheritance for semaphores. After all, an unknown 128 owner cannot be boosted. As a consequence, blocking on semaphores can 129 result in priority inversion. 130 131 132 rw_semaphore 133 ============ 134 135 rw_semaphore is a multiple readers and single writer lock mechanism. 136 137 On non-PREEMPT_RT kernels the implementation is fair, thus preventing 138 writer starvation. 139 140 rw_semaphore complies by default with the strict owner semantics, but there 141 exist special-purpose interfaces that allow non-owner release for readers. 142 These interfaces work independent of the kernel configuration. 143 144 rw_semaphore and PREEMPT_RT 145 --------------------------- 146 147 PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based 148 implementation, thus changing the fairness: 149 150 Because an rw_semaphore writer cannot grant its priority to multiple 151 readers, a preempted low-priority reader will continue holding its lock, 152 thus starving even high-priority writers. In contrast, because readers 153 can grant their priority to a writer, a preempted low-priority writer will 154 have its priority boosted until it releases the lock, thus preventing that 155 writer from starving readers. 156 157 158 local_lock 159 ========== 160 161 local_lock provides a named scope to critical sections which are protected 162 by disabling preemption or interrupts. 163 164 On non-PREEMPT_RT kernels local_lock operations map to the preemption and 165 interrupt disabling and enabling primitives: 166 167 =============================== ====================== 168 local_lock(&llock) preempt_disable() 169 local_unlock(&llock) preempt_enable() 170 local_lock_irq(&llock) local_irq_disable() 171 local_unlock_irq(&llock) local_irq_enable() 172 local_lock_irqsave(&llock) local_irq_save() 173 local_unlock_irqrestore(&llock) local_irq_restore() 174 =============================== ====================== 175 176 The named scope of local_lock has two advantages over the regular 177 primitives: 178 179 - The lock name allows static analysis and is also a clear documentation 180 of the protection scope while the regular primitives are scopeless and 181 opaque. 182 183 - If lockdep is enabled the local_lock gains a lockmap which allows to 184 validate the correctness of the protection. This can detect cases where 185 e.g. a function using preempt_disable() as protection mechanism is 186 invoked from interrupt or soft-interrupt context. Aside of that 187 lockdep_assert_held(&llock) works as with any other locking primitive. 188 189 local_lock and PREEMPT_RT 190 ------------------------- 191 192 PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing 193 semantics: 194 195 - All spinlock_t changes also apply to local_lock. 196 197 local_lock usage 198 ---------------- 199 200 local_lock should be used in situations where disabling preemption or 201 interrupts is the appropriate form of concurrency control to protect 202 per-CPU data structures on a non PREEMPT_RT kernel. 203 204 local_lock is not suitable to protect against preemption or interrupts on a 205 PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics. 206 207 208 raw_spinlock_t and spinlock_t 209 ============================= 210 211 raw_spinlock_t 212 -------------- 213 214 raw_spinlock_t is a strict spinning lock implementation in all kernels, 215 including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical 216 core code, low-level interrupt handling and places where disabling 217 preemption or interrupts is required, for example, to safely access 218 hardware state. raw_spinlock_t can sometimes also be used when the 219 critical section is tiny, thus avoiding RT-mutex overhead. 220 221 spinlock_t 222 ---------- 223 224 The semantics of spinlock_t change with the state of PREEMPT_RT. 225 226 On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has 227 exactly the same semantics. 228 229 spinlock_t and PREEMPT_RT 230 ------------------------- 231 232 On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation 233 based on rt_mutex which changes the semantics: 234 235 - Preemption is not disabled. 236 237 - The hard interrupt related suffixes for spin_lock / spin_unlock 238 operations (_irq, _irqsave / _irqrestore) do not affect the CPU's 239 interrupt disabled state. 240 241 - The soft interrupt related suffix (_bh()) still disables softirq 242 handlers. 243 244 Non-PREEMPT_RT kernels disable preemption to get this effect. 245 246 PREEMPT_RT kernels use a per-CPU lock for serialization which keeps 247 preemption enabled. The lock disables softirq handlers and also 248 prevents reentrancy due to task preemption. 249 250 PREEMPT_RT kernels preserve all other spinlock_t semantics: 251 252 - Tasks holding a spinlock_t do not migrate. Non-PREEMPT_RT kernels 253 avoid migration by disabling preemption. PREEMPT_RT kernels instead 254 disable migration, which ensures that pointers to per-CPU variables 255 remain valid even if the task is preempted. 256 257 - Task state is preserved across spinlock acquisition, ensuring that the 258 task-state rules apply to all kernel configurations. Non-PREEMPT_RT 259 kernels leave task state untouched. However, PREEMPT_RT must change 260 task state if the task blocks during acquisition. Therefore, it saves 261 the current task state before blocking and the corresponding lock wakeup 262 restores it, as shown below:: 263 264 task->state = TASK_INTERRUPTIBLE 265 lock() 266 block() 267 task->saved_state = task->state 268 task->state = TASK_UNINTERRUPTIBLE 269 schedule() 270 lock wakeup 271 task->state = task->saved_state 272 273 Other types of wakeups would normally unconditionally set the task state 274 to RUNNING, but that does not work here because the task must remain 275 blocked until the lock becomes available. Therefore, when a non-lock 276 wakeup attempts to awaken a task blocked waiting for a spinlock, it 277 instead sets the saved state to RUNNING. Then, when the lock 278 acquisition completes, the lock wakeup sets the task state to the saved 279 state, in this case setting it to RUNNING:: 280 281 task->state = TASK_INTERRUPTIBLE 282 lock() 283 block() 284 task->saved_state = task->state 285 task->state = TASK_UNINTERRUPTIBLE 286 schedule() 287 non lock wakeup 288 task->saved_state = TASK_RUNNING 289 290 lock wakeup 291 task->state = task->saved_state 292 293 This ensures that the real wakeup cannot be lost. 294 295 296 rwlock_t 297 ======== 298 299 rwlock_t is a multiple readers and single writer lock mechanism. 300 301 Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the 302 suffix rules of spinlock_t apply accordingly. The implementation is fair, 303 thus preventing writer starvation. 304 305 rwlock_t and PREEMPT_RT 306 ----------------------- 307 308 PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based 309 implementation, thus changing semantics: 310 311 - All the spinlock_t changes also apply to rwlock_t. 312 313 - Because an rwlock_t writer cannot grant its priority to multiple 314 readers, a preempted low-priority reader will continue holding its lock, 315 thus starving even high-priority writers. In contrast, because readers 316 can grant their priority to a writer, a preempted low-priority writer 317 will have its priority boosted until it releases the lock, thus 318 preventing that writer from starving readers. 319 320 321 PREEMPT_RT caveats 322 ================== 323 324 local_lock on RT 325 ---------------- 326 327 The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few 328 implications. For example, on a non-PREEMPT_RT kernel the following code 329 sequence works as expected:: 330 331 local_lock_irq(&local_lock); 332 raw_spin_lock(&lock); 333 334 and is fully equivalent to:: 335 336 raw_spin_lock_irq(&lock); 337 338 On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq() 339 is mapped to a per-CPU spinlock_t which neither disables interrupts nor 340 preemption. The following code sequence works perfectly correct on both 341 PREEMPT_RT and non-PREEMPT_RT kernels:: 342 343 local_lock_irq(&local_lock); 344 spin_lock(&lock); 345 346 Another caveat with local locks is that each local_lock has a specific 347 protection scope. So the following substitution is wrong:: 348 349 func1() 350 { 351 local_irq_save(flags); -> local_lock_irqsave(&local_lock_1, flags); 352 func3(); 353 local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags); 354 } 355 356 func2() 357 { 358 local_irq_save(flags); -> local_lock_irqsave(&local_lock_2, flags); 359 func3(); 360 local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags); 361 } 362 363 func3() 364 { 365 lockdep_assert_irqs_disabled(); 366 access_protected_data(); 367 } 368 369 On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel 370 local_lock_1 and local_lock_2 are distinct and cannot serialize the callers 371 of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel 372 because local_lock_irqsave() does not disable interrupts due to the 373 PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is:: 374 375 func1() 376 { 377 local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags); 378 func3(); 379 local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags); 380 } 381 382 func2() 383 { 384 local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags); 385 func3(); 386 local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags); 387 } 388 389 func3() 390 { 391 lockdep_assert_held(&local_lock); 392 access_protected_data(); 393 } 394 395 396 spinlock_t and rwlock_t 397 ----------------------- 398 399 The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels 400 have a few implications. For example, on a non-PREEMPT_RT kernel the 401 following code sequence works as expected:: 402 403 local_irq_disable(); 404 spin_lock(&lock); 405 406 and is fully equivalent to:: 407 408 spin_lock_irq(&lock); 409 410 Same applies to rwlock_t and the _irqsave() suffix variants. 411 412 On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a 413 fully preemptible context. Instead, use spin_lock_irq() or 414 spin_lock_irqsave() and their unlock counterparts. In cases where the 415 interrupt disabling and locking must remain separate, PREEMPT_RT offers a 416 local_lock mechanism. Acquiring the local_lock pins the task to a CPU, 417 allowing things like per-CPU interrupt disabled locks to be acquired. 418 However, this approach should be used only where absolutely necessary. 419 420 A typical scenario is protection of per-CPU variables in thread context:: 421 422 struct foo *p = get_cpu_ptr(&var1); 423 424 spin_lock(&p->lock); 425 p->count += this_cpu_read(var2); 426 427 This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel 428 this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does 429 not allow to acquire p->lock because get_cpu_ptr() implicitly disables 430 preemption. The following substitution works on both kernels:: 431 432 struct foo *p; 433 434 migrate_disable(); 435 p = this_cpu_ptr(&var1); 436 spin_lock(&p->lock); 437 p->count += this_cpu_read(var2); 438 439 migrate_disable() ensures that the task is pinned on the current CPU which 440 in turn guarantees that the per-CPU access to var1 and var2 are staying on 441 the same CPU while the task remains preemptible. 442 443 The migrate_disable() substitution is not valid for the following 444 scenario:: 445 446 func() 447 { 448 struct foo *p; 449 450 migrate_disable(); 451 p = this_cpu_ptr(&var1); 452 p->val = func2(); 453 454 This breaks because migrate_disable() does not protect against reentrancy from 455 a preempting task. A correct substitution for this case is:: 456 457 func() 458 { 459 struct foo *p; 460 461 local_lock(&foo_lock); 462 p = this_cpu_ptr(&var1); 463 p->val = func2(); 464 465 On a non-PREEMPT_RT kernel this protects against reentrancy by disabling 466 preemption. On a PREEMPT_RT kernel this is achieved by acquiring the 467 underlying per-CPU spinlock. 468 469 470 raw_spinlock_t on RT 471 -------------------- 472 473 Acquiring a raw_spinlock_t disables preemption and possibly also 474 interrupts, so the critical section must avoid acquiring a regular 475 spinlock_t or rwlock_t, for example, the critical section must avoid 476 allocating memory. Thus, on a non-PREEMPT_RT kernel the following code 477 works perfectly:: 478 479 raw_spin_lock(&lock); 480 p = kmalloc(sizeof(*p), GFP_ATOMIC); 481 482 But this code fails on PREEMPT_RT kernels because the memory allocator is 483 fully preemptible and therefore cannot be invoked from truly atomic 484 contexts. However, it is perfectly fine to invoke the memory allocator 485 while holding normal non-raw spinlocks because they do not disable 486 preemption on PREEMPT_RT kernels:: 487 488 spin_lock(&lock); 489 p = kmalloc(sizeof(*p), GFP_ATOMIC); 490 491 492 bit spinlocks 493 ------------- 494 495 PREEMPT_RT cannot substitute bit spinlocks because a single bit is too 496 small to accommodate an RT-mutex. Therefore, the semantics of bit 497 spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t 498 caveats also apply to bit spinlocks. 499 500 Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT 501 using conditional (#ifdef'ed) code changes at the usage site. In contrast, 502 usage-site changes are not needed for the spinlock_t substitution. 503 Instead, conditionals in header files and the core locking implementation 504 enable the compiler to do the substitution transparently. 505 506 507 Lock type nesting rules 508 ======================= 509 510 The most basic rules are: 511 512 - Lock types of the same lock category (sleeping, CPU local, spinning) 513 can nest arbitrarily as long as they respect the general lock ordering 514 rules to prevent deadlocks. 515 516 - Sleeping lock types cannot nest inside CPU local and spinning lock types. 517 518 - CPU local and spinning lock types can nest inside sleeping lock types. 519 520 - Spinning lock types can nest inside all lock types 521 522 These constraints apply both in PREEMPT_RT and otherwise. 523 524 The fact that PREEMPT_RT changes the lock category of spinlock_t and 525 rwlock_t from spinning to sleeping and substitutes local_lock with a 526 per-CPU spinlock_t means that they cannot be acquired while holding a raw 527 spinlock. This results in the following nesting ordering: 528 529 1) Sleeping locks 530 2) spinlock_t, rwlock_t, local_lock 531 3) raw_spinlock_t and bit spinlocks 532 533 Lockdep will complain if these constraints are violated, both in 534 PREEMPT_RT and otherwise.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.