1 This document provides options for those wishi 1 This document provides options for those wishing to keep their 2 memory-ordering lives simple, as is necessary 2 memory-ordering lives simple, as is necessary for those whose domain 3 is complex. After all, there are bugs other t 3 is complex. After all, there are bugs other than memory-ordering bugs, 4 and the time spent gaining memory-ordering kno 4 and the time spent gaining memory-ordering knowledge is not available 5 for gaining domain knowledge. Furthermore Lin 5 for gaining domain knowledge. Furthermore Linux-kernel memory model 6 (LKMM) is quite complex, with subtle differenc 6 (LKMM) is quite complex, with subtle differences in code often having 7 dramatic effects on correctness. 7 dramatic effects on correctness. 8 8 9 The options near the beginning of this list ar 9 The options near the beginning of this list are quite simple. The idea 10 is not that kernel hackers don't already know 10 is not that kernel hackers don't already know about them, but rather 11 that they might need the occasional reminder. 11 that they might need the occasional reminder. 12 12 13 Please note that this is a generic guide, and 13 Please note that this is a generic guide, and that specific subsystems 14 will often have special requirements or idioms 14 will often have special requirements or idioms. For example, developers 15 of MMIO-based device drivers will often need t 15 of MMIO-based device drivers will often need to use mb(), rmb(), and 16 wmb(), and therefore might find smp_mb(), smp_ 16 wmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb() 17 to be more natural than smp_load_acquire() and 17 to be more natural than smp_load_acquire() and smp_store_release(). 18 On the other hand, those coming in from other 18 On the other hand, those coming in from other environments will likely 19 be more familiar with these last two. 19 be more familiar with these last two. 20 20 21 21 22 Single-threaded code 22 Single-threaded code 23 ==================== 23 ==================== 24 24 25 In single-threaded code, there is no reorderin 25 In single-threaded code, there is no reordering, at least assuming 26 that your toolchain and hardware are working c 26 that your toolchain and hardware are working correctly. In addition, 27 it is generally a mistake to assume your code 27 it is generally a mistake to assume your code will only run in a single 28 threaded context as the kernel can enter the s 28 threaded context as the kernel can enter the same code path on multiple 29 CPUs at the same time. One important exceptio 29 CPUs at the same time. One important exception is a function that makes 30 no external data references. 30 no external data references. 31 31 32 In the general case, you will need to take exp 32 In the general case, you will need to take explicit steps to ensure that 33 your code really is executed within a single t 33 your code really is executed within a single thread that does not access 34 shared variables. A simple way to achieve thi 34 shared variables. A simple way to achieve this is to define a global lock 35 that you acquire at the beginning of your code 35 that you acquire at the beginning of your code and release at the end, 36 taking care to ensure that all references to y 36 taking care to ensure that all references to your code's shared data are 37 also carried out under that same lock. Becaus 37 also carried out under that same lock. Because only one thread can hold 38 this lock at a given time, your code will be e 38 this lock at a given time, your code will be executed single-threaded. 39 This approach is called "code locking". 39 This approach is called "code locking". 40 40 41 Code locking can severely limit both performan 41 Code locking can severely limit both performance and scalability, so it 42 should be used with caution, and only on code 42 should be used with caution, and only on code paths that execute rarely. 43 After all, a huge amount of effort was require 43 After all, a huge amount of effort was required to remove the Linux 44 kernel's old "Big Kernel Lock", so let's pleas 44 kernel's old "Big Kernel Lock", so let's please be very careful about 45 adding new "little kernel locks". 45 adding new "little kernel locks". 46 46 47 One of the advantages of locking is that, in h 47 One of the advantages of locking is that, in happy contrast with the 48 year 1981, almost all kernel developers are ve 48 year 1981, almost all kernel developers are very familiar with locking. 49 The Linux kernel's lockdep (CONFIG_PROVE_LOCKI 49 The Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with 50 the formerly feared deadlock scenarios. 50 the formerly feared deadlock scenarios. 51 51 52 Please use the standard locking primitives pro 52 Please use the standard locking primitives provided by the kernel rather 53 than rolling your own. For one thing, the sta 53 than rolling your own. For one thing, the standard primitives interact 54 properly with lockdep. For another thing, the 54 properly with lockdep. For another thing, these primitives have been 55 tuned to deal better with high contention. An 55 tuned to deal better with high contention. And for one final thing, it is 56 surprisingly hard to correctly code production 56 surprisingly hard to correctly code production-quality lock acquisition 57 and release functions. After all, even simple 57 and release functions. After all, even simple non-production-quality 58 locking functions must carefully prevent both 58 locking functions must carefully prevent both the CPU and the compiler 59 from moving code in either direction across th 59 from moving code in either direction across the locking function. 60 60 61 Despite the scalability limitations of single- 61 Despite the scalability limitations of single-threaded code, RCU 62 takes this approach for much of its grace-peri 62 takes this approach for much of its grace-period processing and also 63 for early-boot operation. The reason RCU is a 63 for early-boot operation. The reason RCU is able to scale despite 64 single-threaded grace-period processing is use 64 single-threaded grace-period processing is use of batching, where all 65 updates that accumulated during one grace peri 65 updates that accumulated during one grace period are handled by the 66 next one. In other words, slowing down grace- 66 next one. In other words, slowing down grace-period processing makes 67 it more efficient. Nor is RCU unique: Simila 67 it more efficient. Nor is RCU unique: Similar batching optimizations 68 are used in many I/O operations. 68 are used in many I/O operations. 69 69 70 70 71 Packaged code 71 Packaged code 72 ============= 72 ============= 73 73 74 Even if performance and scalability concerns p 74 Even if performance and scalability concerns prevent your code from 75 being completely single-threaded, it is often 75 being completely single-threaded, it is often possible to use library 76 functions that handle the concurrency nearly o 76 functions that handle the concurrency nearly or entirely on their own. 77 This approach delegates any LKMM worries to th 77 This approach delegates any LKMM worries to the library maintainer. 78 78 79 In the kernel, what is the "library"? Quite a 79 In the kernel, what is the "library"? Quite a bit. It includes the 80 contents of the lib/ directory, much of the in 80 contents of the lib/ directory, much of the include/linux/ directory along 81 with a lot of other heavily used APIs. But he 81 with a lot of other heavily used APIs. But heavily used examples include 82 the list macros (for example, include/linux/{, 82 the list macros (for example, include/linux/{,rcu}list.h), workqueues, 83 smp_call_function(), and the various hash tabl 83 smp_call_function(), and the various hash tables and search trees. 84 84 85 85 86 Data locking 86 Data locking 87 ============ 87 ============ 88 88 89 With code locking, we use single-threaded code 89 With code locking, we use single-threaded code execution to guarantee 90 serialized access to the data that the code is 90 serialized access to the data that the code is accessing. However, 91 we can also achieve this by instead associatin 91 we can also achieve this by instead associating the lock with specific 92 instances of the data structures. This create 92 instances of the data structures. This creates a "critical section" 93 in the code execution that will execute as tho 93 in the code execution that will execute as though it is single threaded. 94 By placing all the accesses and modifications 94 By placing all the accesses and modifications to a shared data structure 95 inside a critical section, we ensure that the 95 inside a critical section, we ensure that the execution context that 96 holds the lock has exclusive access to the sha 96 holds the lock has exclusive access to the shared data. 97 97 98 The poster boy for this approach is the hash t 98 The poster boy for this approach is the hash table, where placing a lock 99 in each hash bucket allows operations on diffe 99 in each hash bucket allows operations on different buckets to proceed 100 concurrently. This works because the buckets 100 concurrently. This works because the buckets do not overlap with each 101 other, so that an operation on one bucket does 101 other, so that an operation on one bucket does not interfere with any 102 other bucket. 102 other bucket. 103 103 104 As the number of buckets increases, data locki 104 As the number of buckets increases, data locking scales naturally. 105 In particular, if the amount of data increases 105 In particular, if the amount of data increases with the number of CPUs, 106 increasing the number of buckets as the number 106 increasing the number of buckets as the number of CPUs increase results 107 in a naturally scalable data structure. 107 in a naturally scalable data structure. 108 108 109 109 110 Per-CPU processing 110 Per-CPU processing 111 ================== 111 ================== 112 112 113 Partitioning processing and data over CPUs all 113 Partitioning processing and data over CPUs allows each CPU to take 114 a single-threaded approach while providing exc 114 a single-threaded approach while providing excellent performance and 115 scalability. Of course, there is no free lunc 115 scalability. Of course, there is no free lunch: The dark side of this 116 excellence is substantially increased memory f 116 excellence is substantially increased memory footprint. 117 117 118 In addition, it is sometimes necessary to occa 118 In addition, it is sometimes necessary to occasionally update some global 119 view of this processing and data, in which cas 119 view of this processing and data, in which case something like locking 120 must be used to protect this global view. Thi 120 must be used to protect this global view. This is the approach taken 121 by the percpu_counter infrastructure. In many 121 by the percpu_counter infrastructure. In many cases, there are already 122 generic/library variants of commonly used per- 122 generic/library variants of commonly used per-cpu constructs available. 123 Please use them rather than rolling your own. 123 Please use them rather than rolling your own. 124 124 125 RCU uses DEFINE_PER_CPU*() declaration to crea 125 RCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU 126 data sets. For example, each CPU does private 126 data sets. For example, each CPU does private quiescent-state processing 127 within its instance of the per-CPU rcu_data st 127 within its instance of the per-CPU rcu_data structure, and then uses data 128 locking to report quiescent states up the grac 128 locking to report quiescent states up the grace-period combining tree. 129 129 130 130 131 Packaged primitives: Sequence locking 131 Packaged primitives: Sequence locking 132 ===================================== 132 ===================================== 133 133 134 Lockless programming is considered by many to 134 Lockless programming is considered by many to be more difficult than 135 lock-based programming, but there are a few lo 135 lock-based programming, but there are a few lockless design patterns that 136 have been built out into an API. One of these 136 have been built out into an API. One of these APIs is sequence locking. 137 Although this APIs can be used in extremely co 137 Although this APIs can be used in extremely complex ways, there are simple 138 and effective ways of using it that avoid the 138 and effective ways of using it that avoid the need to pay attention to 139 memory ordering. 139 memory ordering. 140 140 141 The basic keep-things-simple rule for sequence 141 The basic keep-things-simple rule for sequence locking is "do not write 142 in read-side code". Yes, you can do writes fr 142 in read-side code". Yes, you can do writes from within sequence-locking 143 readers, but it won't be so simple. For examp 143 readers, but it won't be so simple. For example, such writes will be 144 lockless and should be idempotent. 144 lockless and should be idempotent. 145 145 146 For more sophisticated use cases, LKMM can gui 146 For more sophisticated use cases, LKMM can guide you, including use 147 cases involving combining sequence locking wit 147 cases involving combining sequence locking with other synchronization 148 primitives. (LKMM does not yet know about seq 148 primitives. (LKMM does not yet know about sequence locking, so it is 149 currently necessary to open-code it in your li 149 currently necessary to open-code it in your litmus tests.) 150 150 151 Additional information may be found in include 151 Additional information may be found in include/linux/seqlock.h. 152 152 153 Packaged primitives: RCU 153 Packaged primitives: RCU 154 ======================== 154 ======================== 155 155 156 Another lockless design pattern that has been 156 Another lockless design pattern that has been baked into an API 157 is RCU. The Linux kernel makes sophisticated 157 is RCU. The Linux kernel makes sophisticated use of RCU, but the 158 keep-things-simple rules for RCU are "do not w 158 keep-things-simple rules for RCU are "do not write in read-side code" 159 and "do not update anything that is visible to 159 and "do not update anything that is visible to and accessed by readers", 160 and "protect updates with locking". 160 and "protect updates with locking". 161 161 162 These rules are illustrated by the functions f 162 These rules are illustrated by the functions foo_update_a() and 163 foo_get_a() shown in Documentation/RCU/whatisR 163 foo_get_a() shown in Documentation/RCU/whatisRCU.rst. Additional 164 RCU usage patterns maybe found in Documentatio 164 RCU usage patterns maybe found in Documentation/RCU and in the 165 source code. 165 source code. 166 166 167 167 168 Packaged primitives: Atomic operations 168 Packaged primitives: Atomic operations 169 ====================================== 169 ====================================== 170 170 171 Back in the day, the Linux kernel had three ty 171 Back in the day, the Linux kernel had three types of atomic operations: 172 172 173 1. Initialization and read-out, such as a 173 1. Initialization and read-out, such as atomic_set() and atomic_read(). 174 174 175 2. Operations that did not return a value 175 2. Operations that did not return a value and provided no ordering, 176 such as atomic_inc() and atomic_dec(). 176 such as atomic_inc() and atomic_dec(). 177 177 178 3. Operations that returned a value and p 178 3. Operations that returned a value and provided full ordering, such as 179 atomic_add_return() and atomic_dec_and 179 atomic_add_return() and atomic_dec_and_test(). Note that some 180 value-returning operations provide ful 180 value-returning operations provide full ordering only conditionally. 181 For example, cmpxchg() provides orderi 181 For example, cmpxchg() provides ordering only upon success. 182 182 183 More recent kernels have operations that retur 183 More recent kernels have operations that return a value but do not 184 provide full ordering. These are flagged with 184 provide full ordering. These are flagged with either a _relaxed() 185 suffix (providing no ordering), or an _acquire 185 suffix (providing no ordering), or an _acquire() or _release() suffix 186 (providing limited ordering). 186 (providing limited ordering). 187 187 188 Additional information may be found in these f 188 Additional information may be found in these files: 189 189 190 Documentation/atomic_t.txt 190 Documentation/atomic_t.txt 191 Documentation/atomic_bitops.txt 191 Documentation/atomic_bitops.txt 192 Documentation/core-api/refcount-vs-atomic.rst 192 Documentation/core-api/refcount-vs-atomic.rst 193 193 194 Reading code using these primitives is often a 194 Reading code using these primitives is often also quite helpful. 195 195 196 196 197 Lockless, fully ordered 197 Lockless, fully ordered 198 ======================= 198 ======================= 199 199 200 When using locking, there often comes a time w 200 When using locking, there often comes a time when it is necessary 201 to access some variable or another without hol 201 to access some variable or another without holding the data lock 202 that serializes access to that variable. 202 that serializes access to that variable. 203 203 204 If you want to keep things simple, use the ini 204 If you want to keep things simple, use the initialization and read-out 205 operations from the previous section only when 205 operations from the previous section only when there are no racing 206 accesses. Otherwise, use only fully ordered o 206 accesses. Otherwise, use only fully ordered operations when accessing 207 or modifying the variable. This approach guar 207 or modifying the variable. This approach guarantees that code prior 208 to a given access to that variable will be see 208 to a given access to that variable will be seen by all CPUs has having 209 happened before any code following any later a 209 happened before any code following any later access to that same variable. 210 210 211 Please note that per-CPU functions are not ato 211 Please note that per-CPU functions are not atomic operations and 212 hence they do not provide any ordering guarant 212 hence they do not provide any ordering guarantees at all. 213 213 214 If the lockless accesses are frequently execut 214 If the lockless accesses are frequently executed reads that are used 215 only for heuristics, or if they are frequently 215 only for heuristics, or if they are frequently executed writes that 216 are used only for statistics, please see the n 216 are used only for statistics, please see the next section. 217 217 218 218 219 Lockless statistics and heuristics 219 Lockless statistics and heuristics 220 ================================== 220 ================================== 221 221 222 Unordered primitives such as atomic_read(), at 222 Unordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and 223 WRITE_ONCE() can safely be used in some cases. 223 WRITE_ONCE() can safely be used in some cases. These primitives provide 224 no ordering, but they do prevent the compiler 224 no ordering, but they do prevent the compiler from carrying out a number 225 of destructive optimizations (for which please 225 of destructive optimizations (for which please see the next section). 226 One example use for these primitives is statis 226 One example use for these primitives is statistics, such as per-CPU 227 counters exemplified by the rt_cache_stat stru 227 counters exemplified by the rt_cache_stat structure's routing-cache 228 statistics counters. Another example use case 228 statistics counters. Another example use case is heuristics, such as 229 the jiffies_till_first_fqs and jiffies_till_ne 229 the jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters 230 controlling how often RCU scans for idle CPUs. 230 controlling how often RCU scans for idle CPUs. 231 231 232 But be careful. "Unordered" really does mean 232 But be careful. "Unordered" really does mean "unordered". It is all 233 too easy to assume ordering, and this assumpti 233 too easy to assume ordering, and this assumption must be avoided when 234 using these primitives. 234 using these primitives. 235 235 236 236 237 Don't let the compiler trip you up 237 Don't let the compiler trip you up 238 ================================== 238 ================================== 239 239 240 It can be quite tempting to use plain C-langua 240 It can be quite tempting to use plain C-language accesses for lockless 241 loads from and stores to shared variables. Al 241 loads from and stores to shared variables. Although this is both 242 possible and quite common in the Linux kernel, 242 possible and quite common in the Linux kernel, it does require a 243 surprising amount of analysis, care, and knowl 243 surprising amount of analysis, care, and knowledge about the compiler. 244 Yes, some decades ago it was not unfair to con 244 Yes, some decades ago it was not unfair to consider a C compiler to be 245 an assembler with added syntax and better port 245 an assembler with added syntax and better portability, but the advent of 246 sophisticated optimizing compilers mean that t 246 sophisticated optimizing compilers mean that those days are long gone. 247 Today's optimizing compilers can profoundly re 247 Today's optimizing compilers can profoundly rewrite your code during the 248 translation process, and have long been ready, 248 translation process, and have long been ready, willing, and able to do so. 249 249 250 Therefore, if you really need to use C-languag 250 Therefore, if you really need to use C-language assignments instead of 251 READ_ONCE(), WRITE_ONCE(), and so on, you will 251 READ_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good 252 understanding of both the C standard and your 252 understanding of both the C standard and your compiler. Here are some 253 introductory references and some tooling to st 253 introductory references and some tooling to start you on this noble quest: 254 254 255 Who's afraid of a big bad optimizing compiler? 255 Who's afraid of a big bad optimizing compiler? 256 https://lwn.net/Articles/793253/ 256 https://lwn.net/Articles/793253/ 257 Calibrating your fear of big bad optimizing co 257 Calibrating your fear of big bad optimizing compilers 258 https://lwn.net/Articles/799218/ 258 https://lwn.net/Articles/799218/ 259 Concurrency bugs should fear the big bad data- 259 Concurrency bugs should fear the big bad data-race detector (part 1) 260 https://lwn.net/Articles/816850/ 260 https://lwn.net/Articles/816850/ 261 Concurrency bugs should fear the big bad data- 261 Concurrency bugs should fear the big bad data-race detector (part 2) 262 https://lwn.net/Articles/816854/ 262 https://lwn.net/Articles/816854/ 263 263 264 264 265 More complex use cases 265 More complex use cases 266 ====================== 266 ====================== 267 267 268 If the alternatives above do not do what you n 268 If the alternatives above do not do what you need, please look at the 269 recipes-pairs.txt file to peel off the next la 269 recipes-pairs.txt file to peel off the next layer of the memory-ordering 270 onion. 270 onion.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.