1 This document provides "recipes", that is, lit 1 This document provides "recipes", that is, litmus tests for commonly 2 occurring situations, as well as a few that il 2 occurring situations, as well as a few that illustrate subtly broken but 3 attractive nuisances. Many of these recipes i 3 attractive nuisances. Many of these recipes include example code from 4 v5.7 of the Linux kernel. 4 v5.7 of the Linux kernel. 5 5 6 The first section covers simple special cases, 6 The first section covers simple special cases, the second section 7 takes off the training wheels to cover more in 7 takes off the training wheels to cover more involved examples, 8 and the third section provides a few rules of 8 and the third section provides a few rules of thumb. 9 9 10 10 11 Simple special cases 11 Simple special cases 12 ==================== 12 ==================== 13 13 14 This section presents two simple special cases 14 This section presents two simple special cases, the first being where 15 there is only one CPU or only one memory locat 15 there is only one CPU or only one memory location is accessed, and the 16 second being use of that old concurrency workh 16 second being use of that old concurrency workhorse, locking. 17 17 18 18 19 Single CPU or single memory location 19 Single CPU or single memory location 20 ------------------------------------ 20 ------------------------------------ 21 21 22 If there is only one CPU on the one hand or on 22 If there is only one CPU on the one hand or only one variable 23 on the other, the code will execute in order. 23 on the other, the code will execute in order. There are (as 24 usual) some things to be careful of: 24 usual) some things to be careful of: 25 25 26 1. Some aspects of the C language are uno 26 1. Some aspects of the C language are unordered. For example, 27 in the expression "f(x) + g(y)", the o 27 in the expression "f(x) + g(y)", the order in which f and g are 28 called is not defined; the object code 28 called is not defined; the object code is allowed to use either 29 order or even to interleave the comput 29 order or even to interleave the computations. 30 30 31 2. Compilers are permitted to use the "as 31 2. Compilers are permitted to use the "as-if" rule. That is, a 32 compiler can emit whatever code it lik 32 compiler can emit whatever code it likes for normal accesses, 33 as long as the results of a single-thr 33 as long as the results of a single-threaded execution appear 34 just as if the compiler had followed a 34 just as if the compiler had followed all the relevant rules. 35 To see this, compile with a high level 35 To see this, compile with a high level of optimization and run 36 the debugger on the resulting binary. 36 the debugger on the resulting binary. 37 37 38 3. If there is only one variable but mult 38 3. If there is only one variable but multiple CPUs, that variable 39 must be properly aligned and all acces 39 must be properly aligned and all accesses to that variable must 40 be full sized. Variables that straddl 40 be full sized. Variables that straddle cachelines or pages void 41 your full-ordering warranty, as do und 41 your full-ordering warranty, as do undersized accesses that load 42 from or store to only part of the vari 42 from or store to only part of the variable. 43 43 44 4. If there are multiple CPUs, accesses t 44 4. If there are multiple CPUs, accesses to shared variables should 45 use READ_ONCE() and WRITE_ONCE() or st 45 use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store 46 tearing, load/store fusing, and invent 46 tearing, load/store fusing, and invented loads and stores. 47 There are exceptions to this rule, inc 47 There are exceptions to this rule, including: 48 48 49 i. When there is no possibility o 49 i. When there is no possibility of a given shared variable 50 being updated by some other CP 50 being updated by some other CPU, for example, while 51 holding the update-side lock, 51 holding the update-side lock, reads from that variable 52 need not use READ_ONCE(). 52 need not use READ_ONCE(). 53 53 54 ii. When there is no possibility o 54 ii. When there is no possibility of a given shared variable 55 being either read or updated b 55 being either read or updated by other CPUs, for example, 56 when running during early boot 56 when running during early boot, reads from that variable 57 need not use READ_ONCE() and w 57 need not use READ_ONCE() and writes to that variable 58 need not use WRITE_ONCE(). 58 need not use WRITE_ONCE(). 59 59 60 60 61 Locking 61 Locking 62 ------- 62 ------- 63 63 64 Locking is well-known and straightforward, at 64 Locking is well-known and straightforward, at least if you don't think 65 about it too hard. And the basic rule is inde 65 about it too hard. And the basic rule is indeed quite simple: Any CPU that 66 has acquired a given lock sees any changes pre 66 has acquired a given lock sees any changes previously seen or made by any 67 CPU before it released that same lock. Note t 67 CPU before it released that same lock. Note that this statement is a bit 68 stronger than "Any CPU holding a given lock se 68 stronger than "Any CPU holding a given lock sees all changes made by any 69 CPU during the time that CPU was holding this 69 CPU during the time that CPU was holding this same lock". For example, 70 consider the following pair of code fragments: 70 consider the following pair of code fragments: 71 71 72 /* See MP+polocks.litmus. */ 72 /* See MP+polocks.litmus. */ 73 void CPU0(void) 73 void CPU0(void) 74 { 74 { 75 WRITE_ONCE(x, 1); 75 WRITE_ONCE(x, 1); 76 spin_lock(&mylock); 76 spin_lock(&mylock); 77 WRITE_ONCE(y, 1); 77 WRITE_ONCE(y, 1); 78 spin_unlock(&mylock); 78 spin_unlock(&mylock); 79 } 79 } 80 80 81 void CPU1(void) 81 void CPU1(void) 82 { 82 { 83 spin_lock(&mylock); 83 spin_lock(&mylock); 84 r0 = READ_ONCE(y); 84 r0 = READ_ONCE(y); 85 spin_unlock(&mylock); 85 spin_unlock(&mylock); 86 r1 = READ_ONCE(x); 86 r1 = READ_ONCE(x); 87 } 87 } 88 88 89 The basic rule guarantees that if CPU0() acqui 89 The basic rule guarantees that if CPU0() acquires mylock before CPU1(), 90 then both r0 and r1 must be set to the value 1 90 then both r0 and r1 must be set to the value 1. This also has the 91 consequence that if the final value of r0 is e 91 consequence that if the final value of r0 is equal to 1, then the final 92 value of r1 must also be equal to 1. In contr 92 value of r1 must also be equal to 1. In contrast, the weaker rule would 93 say nothing about the final value of r1. 93 say nothing about the final value of r1. 94 94 95 The converse to the basic rule also holds, as 95 The converse to the basic rule also holds, as illustrated by the 96 following litmus test: 96 following litmus test: 97 97 98 /* See MP+porevlocks.litmus. */ 98 /* See MP+porevlocks.litmus. */ 99 void CPU0(void) 99 void CPU0(void) 100 { 100 { 101 r0 = READ_ONCE(y); 101 r0 = READ_ONCE(y); 102 spin_lock(&mylock); 102 spin_lock(&mylock); 103 r1 = READ_ONCE(x); 103 r1 = READ_ONCE(x); 104 spin_unlock(&mylock); 104 spin_unlock(&mylock); 105 } 105 } 106 106 107 void CPU1(void) 107 void CPU1(void) 108 { 108 { 109 spin_lock(&mylock); 109 spin_lock(&mylock); 110 WRITE_ONCE(x, 1); 110 WRITE_ONCE(x, 1); 111 spin_unlock(&mylock); 111 spin_unlock(&mylock); 112 WRITE_ONCE(y, 1); 112 WRITE_ONCE(y, 1); 113 } 113 } 114 114 115 This converse to the basic rule guarantees tha 115 This converse to the basic rule guarantees that if CPU0() acquires 116 mylock before CPU1(), then both r0 and r1 must 116 mylock before CPU1(), then both r0 and r1 must be set to the value 0. 117 This also has the consequence that if the fina 117 This also has the consequence that if the final value of r1 is equal 118 to 0, then the final value of r0 must also be 118 to 0, then the final value of r0 must also be equal to 0. In contrast, 119 the weaker rule would say nothing about the fi 119 the weaker rule would say nothing about the final value of r0. 120 120 121 These examples show only a single pair of CPUs 121 These examples show only a single pair of CPUs, but the effects of the 122 locking basic rule extend across multiple acqu 122 locking basic rule extend across multiple acquisitions of a given lock 123 across multiple CPUs. 123 across multiple CPUs. 124 124 125 However, it is not necessarily the case that a 125 However, it is not necessarily the case that accesses ordered by 126 locking will be seen as ordered by CPUs not ho 126 locking will be seen as ordered by CPUs not holding that lock. 127 Consider this example: 127 Consider this example: 128 128 129 /* See Z6.0+pooncelock+pooncelock+pomb 129 /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */ 130 void CPU0(void) 130 void CPU0(void) 131 { 131 { 132 spin_lock(&mylock); 132 spin_lock(&mylock); 133 WRITE_ONCE(x, 1); 133 WRITE_ONCE(x, 1); 134 WRITE_ONCE(y, 1); 134 WRITE_ONCE(y, 1); 135 spin_unlock(&mylock); 135 spin_unlock(&mylock); 136 } 136 } 137 137 138 void CPU1(void) 138 void CPU1(void) 139 { 139 { 140 spin_lock(&mylock); 140 spin_lock(&mylock); 141 r0 = READ_ONCE(y); 141 r0 = READ_ONCE(y); 142 WRITE_ONCE(z, 1); 142 WRITE_ONCE(z, 1); 143 spin_unlock(&mylock); 143 spin_unlock(&mylock); 144 } 144 } 145 145 146 void CPU2(void) 146 void CPU2(void) 147 { 147 { 148 WRITE_ONCE(z, 2); 148 WRITE_ONCE(z, 2); 149 smp_mb(); 149 smp_mb(); 150 r1 = READ_ONCE(x); 150 r1 = READ_ONCE(x); 151 } 151 } 152 152 153 Counter-intuitive though it might be, it is qu 153 Counter-intuitive though it might be, it is quite possible to have 154 the final value of r0 be 1, the final value of 154 the final value of r0 be 1, the final value of z be 2, and the final 155 value of r1 be 0. The reason for this surpris 155 value of r1 be 0. The reason for this surprising outcome is that 156 CPU2() never acquired the lock, and thus did n 156 CPU2() never acquired the lock, and thus did not benefit from the 157 lock's ordering properties. 157 lock's ordering properties. 158 158 159 Ordering can be extended to CPUs not holding t 159 Ordering can be extended to CPUs not holding the lock by careful use 160 of smp_mb__after_spinlock(): 160 of smp_mb__after_spinlock(): 161 161 162 /* See Z6.0+pooncelock+poonceLock+pomb 162 /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */ 163 void CPU0(void) 163 void CPU0(void) 164 { 164 { 165 spin_lock(&mylock); 165 spin_lock(&mylock); 166 WRITE_ONCE(x, 1); 166 WRITE_ONCE(x, 1); 167 WRITE_ONCE(y, 1); 167 WRITE_ONCE(y, 1); 168 spin_unlock(&mylock); 168 spin_unlock(&mylock); 169 } 169 } 170 170 171 void CPU1(void) 171 void CPU1(void) 172 { 172 { 173 spin_lock(&mylock); 173 spin_lock(&mylock); 174 smp_mb__after_spinlock(); 174 smp_mb__after_spinlock(); 175 r0 = READ_ONCE(y); 175 r0 = READ_ONCE(y); 176 WRITE_ONCE(z, 1); 176 WRITE_ONCE(z, 1); 177 spin_unlock(&mylock); 177 spin_unlock(&mylock); 178 } 178 } 179 179 180 void CPU2(void) 180 void CPU2(void) 181 { 181 { 182 WRITE_ONCE(z, 2); 182 WRITE_ONCE(z, 2); 183 smp_mb(); 183 smp_mb(); 184 r1 = READ_ONCE(x); 184 r1 = READ_ONCE(x); 185 } 185 } 186 186 187 This addition of smp_mb__after_spinlock() stre 187 This addition of smp_mb__after_spinlock() strengthens the lock acquisition 188 sufficiently to rule out the counter-intuitive 188 sufficiently to rule out the counter-intuitive outcome. 189 189 190 190 191 Taking off the training wheels 191 Taking off the training wheels 192 ============================== 192 ============================== 193 193 194 This section looks at more complex examples, i 194 This section looks at more complex examples, including message passing, 195 load buffering, release-acquire chains, store 195 load buffering, release-acquire chains, store buffering. 196 Many classes of litmus tests have abbreviated 196 Many classes of litmus tests have abbreviated names, which may be found 197 here: https://www.cl.cam.ac.uk/~pes20/ppc-supp 197 here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf 198 198 199 199 200 Message passing (MP) 200 Message passing (MP) 201 -------------------- 201 -------------------- 202 202 203 The MP pattern has one CPU execute a pair of s 203 The MP pattern has one CPU execute a pair of stores to a pair of variables 204 and another CPU execute a pair of loads from t 204 and another CPU execute a pair of loads from this same pair of variables, 205 but in the opposite order. The goal is to avo 205 but in the opposite order. The goal is to avoid the counter-intuitive 206 outcome in which the first load sees the value 206 outcome in which the first load sees the value written by the second store 207 but the second load does not see the value wri 207 but the second load does not see the value written by the first store. 208 In the absence of any ordering, this goal may 208 In the absence of any ordering, this goal may not be met, as can be seen 209 in the MP+poonceonces.litmus litmus test. Thi 209 in the MP+poonceonces.litmus litmus test. This section therefore looks at 210 a number of ways of meeting this goal. 210 a number of ways of meeting this goal. 211 211 212 212 213 Release and acquire 213 Release and acquire 214 ~~~~~~~~~~~~~~~~~~~ 214 ~~~~~~~~~~~~~~~~~~~ 215 215 216 Use of smp_store_release() and smp_load_acquir 216 Use of smp_store_release() and smp_load_acquire() is one way to force 217 the desired MP ordering. The general approach 217 the desired MP ordering. The general approach is shown below: 218 218 219 /* See MP+pooncerelease+poacquireonce. 219 /* See MP+pooncerelease+poacquireonce.litmus. */ 220 void CPU0(void) 220 void CPU0(void) 221 { 221 { 222 WRITE_ONCE(x, 1); 222 WRITE_ONCE(x, 1); 223 smp_store_release(&y, 1); 223 smp_store_release(&y, 1); 224 } 224 } 225 225 226 void CPU1(void) 226 void CPU1(void) 227 { 227 { 228 r0 = smp_load_acquire(&y); 228 r0 = smp_load_acquire(&y); 229 r1 = READ_ONCE(x); 229 r1 = READ_ONCE(x); 230 } 230 } 231 231 232 The smp_store_release() macro orders any prior 232 The smp_store_release() macro orders any prior accesses against the 233 store, while the smp_load_acquire macro orders 233 store, while the smp_load_acquire macro orders the load against any 234 subsequent accesses. Therefore, if the final 234 subsequent accesses. Therefore, if the final value of r0 is the value 1, 235 the final value of r1 must also be the value 1 235 the final value of r1 must also be the value 1. 236 236 237 The init_stack_slab() function in lib/stackdep 237 The init_stack_slab() function in lib/stackdepot.c uses release-acquire 238 in this way to safely initialize of a slab of 238 in this way to safely initialize of a slab of the stack. Working out 239 the mutual-exclusion design is left as an exer 239 the mutual-exclusion design is left as an exercise for the reader. 240 240 241 241 242 Assign and dereference 242 Assign and dereference 243 ~~~~~~~~~~~~~~~~~~~~~~ 243 ~~~~~~~~~~~~~~~~~~~~~~ 244 244 245 Use of rcu_assign_pointer() and rcu_dereferenc 245 Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the 246 use of smp_store_release() and smp_load_acquir 246 use of smp_store_release() and smp_load_acquire(), except that both 247 rcu_assign_pointer() and rcu_dereference() ope 247 rcu_assign_pointer() and rcu_dereference() operate on RCU-protected 248 pointers. The general approach is shown below 248 pointers. The general approach is shown below: 249 249 250 /* See MP+onceassign+derefonce.litmus. 250 /* See MP+onceassign+derefonce.litmus. */ 251 int z; 251 int z; 252 int *y = &z; 252 int *y = &z; 253 int x; 253 int x; 254 254 255 void CPU0(void) 255 void CPU0(void) 256 { 256 { 257 WRITE_ONCE(x, 1); 257 WRITE_ONCE(x, 1); 258 rcu_assign_pointer(y, &x); 258 rcu_assign_pointer(y, &x); 259 } 259 } 260 260 261 void CPU1(void) 261 void CPU1(void) 262 { 262 { 263 rcu_read_lock(); 263 rcu_read_lock(); 264 r0 = rcu_dereference(y); 264 r0 = rcu_dereference(y); 265 r1 = READ_ONCE(*r0); 265 r1 = READ_ONCE(*r0); 266 rcu_read_unlock(); 266 rcu_read_unlock(); 267 } 267 } 268 268 269 In this example, if the final value of r0 is & 269 In this example, if the final value of r0 is &x then the final value of 270 r1 must be 1. 270 r1 must be 1. 271 271 272 The rcu_assign_pointer() macro has the same or 272 The rcu_assign_pointer() macro has the same ordering properties as does 273 smp_store_release(), but the rcu_dereference() 273 smp_store_release(), but the rcu_dereference() macro orders the load only 274 against later accesses that depend on the valu 274 against later accesses that depend on the value loaded. A dependency 275 is present if the value loaded determines the 275 is present if the value loaded determines the address of a later access 276 (address dependency, as shown above), the valu 276 (address dependency, as shown above), the value written by a later store 277 (data dependency), or whether or not a later s 277 (data dependency), or whether or not a later store is executed in the 278 first place (control dependency). Note that t 278 first place (control dependency). Note that the term "data dependency" 279 is sometimes casually used to cover both addre 279 is sometimes casually used to cover both address and data dependencies. 280 280 281 In lib/math/prime_numbers.c, the expand_to_nex 281 In lib/math/prime_numbers.c, the expand_to_next_prime() function invokes 282 rcu_assign_pointer(), and the next_prime_numbe 282 rcu_assign_pointer(), and the next_prime_number() function invokes 283 rcu_dereference(). This combination mediates 283 rcu_dereference(). This combination mediates access to a bit vector 284 that is expanded as additional primes are need 284 that is expanded as additional primes are needed. 285 285 286 286 287 Write and read memory barriers 287 Write and read memory barriers 288 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 288 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 289 289 290 It is usually better to use smp_store_release( 290 It is usually better to use smp_store_release() instead of smp_wmb() 291 and to use smp_load_acquire() instead of smp_r 291 and to use smp_load_acquire() instead of smp_rmb(). However, the older 292 smp_wmb() and smp_rmb() APIs are still heavily 292 smp_wmb() and smp_rmb() APIs are still heavily used, so it is important 293 to understand their use cases. The general ap 293 to understand their use cases. The general approach is shown below: 294 294 295 /* See MP+fencewmbonceonce+fencermbonc 295 /* See MP+fencewmbonceonce+fencermbonceonce.litmus. */ 296 void CPU0(void) 296 void CPU0(void) 297 { 297 { 298 WRITE_ONCE(x, 1); 298 WRITE_ONCE(x, 1); 299 smp_wmb(); 299 smp_wmb(); 300 WRITE_ONCE(y, 1); 300 WRITE_ONCE(y, 1); 301 } 301 } 302 302 303 void CPU1(void) 303 void CPU1(void) 304 { 304 { 305 r0 = READ_ONCE(y); 305 r0 = READ_ONCE(y); 306 smp_rmb(); 306 smp_rmb(); 307 r1 = READ_ONCE(x); 307 r1 = READ_ONCE(x); 308 } 308 } 309 309 310 The smp_wmb() macro orders prior stores agains 310 The smp_wmb() macro orders prior stores against later stores, and the 311 smp_rmb() macro orders prior loads against lat 311 smp_rmb() macro orders prior loads against later loads. Therefore, if 312 the final value of r0 is 1, the final value of 312 the final value of r0 is 1, the final value of r1 must also be 1. 313 313 314 The xlog_state_switch_iclogs() function in fs/ 314 The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains 315 the following write-side code fragment: 315 the following write-side code fragment: 316 316 317 log->l_curr_block -= log->l_logBBsize; 317 log->l_curr_block -= log->l_logBBsize; 318 ASSERT(log->l_curr_block >= 0); 318 ASSERT(log->l_curr_block >= 0); 319 smp_wmb(); 319 smp_wmb(); 320 log->l_curr_cycle++; 320 log->l_curr_cycle++; 321 321 322 And the xlog_valid_lsn() function in fs/xfs/xf 322 And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains 323 the corresponding read-side code fragment: 323 the corresponding read-side code fragment: 324 324 325 cur_cycle = READ_ONCE(log->l_curr_cycl 325 cur_cycle = READ_ONCE(log->l_curr_cycle); 326 smp_rmb(); 326 smp_rmb(); 327 cur_block = READ_ONCE(log->l_curr_bloc 327 cur_block = READ_ONCE(log->l_curr_block); 328 328 329 Alternatively, consider the following comment 329 Alternatively, consider the following comment in function 330 perf_output_put_handle() in kernel/events/ring 330 perf_output_put_handle() in kernel/events/ring_buffer.c: 331 331 332 * kernel 332 * kernel user 333 * 333 * 334 * if (LOAD ->data_tail) { 334 * if (LOAD ->data_tail) { LOAD ->data_head 335 * (A) 335 * (A) smp_rmb() (C) 336 * STORE $data 336 * STORE $data LOAD $data 337 * smp_wmb() (B) 337 * smp_wmb() (B) smp_mb() (D) 338 * STORE ->data_head 338 * STORE ->data_head STORE ->data_tail 339 * } 339 * } 340 340 341 The B/C pairing is an example of the MP patter 341 The B/C pairing is an example of the MP pattern using smp_wmb() on the 342 write side and smp_rmb() on the read side. 342 write side and smp_rmb() on the read side. 343 343 344 Of course, given that smp_mb() is strictly str 344 Of course, given that smp_mb() is strictly stronger than either smp_wmb() 345 or smp_rmb(), any code fragment that would wor 345 or smp_rmb(), any code fragment that would work with smp_rmb() and 346 smp_wmb() would also work with smp_mb() replac 346 smp_wmb() would also work with smp_mb() replacing either or both of the 347 weaker barriers. 347 weaker barriers. 348 348 349 349 350 Load buffering (LB) 350 Load buffering (LB) 351 ------------------- 351 ------------------- 352 352 353 The LB pattern has one CPU load from one varia 353 The LB pattern has one CPU load from one variable and then store to a 354 second, while another CPU loads from the secon 354 second, while another CPU loads from the second variable and then stores 355 to the first. The goal is to avoid the counte 355 to the first. The goal is to avoid the counter-intuitive situation where 356 each load reads the value written by the other 356 each load reads the value written by the other CPU's store. In the 357 absence of any ordering it is quite possible t 357 absence of any ordering it is quite possible that this may happen, as 358 can be seen in the LB+poonceonces.litmus litmu 358 can be seen in the LB+poonceonces.litmus litmus test. 359 359 360 One way of avoiding the counter-intuitive outc 360 One way of avoiding the counter-intuitive outcome is through the use of a 361 control dependency paired with a full memory b 361 control dependency paired with a full memory barrier: 362 362 363 /* See LB+fencembonceonce+ctrlonceonce 363 /* See LB+fencembonceonce+ctrlonceonce.litmus. */ 364 void CPU0(void) 364 void CPU0(void) 365 { 365 { 366 r0 = READ_ONCE(x); 366 r0 = READ_ONCE(x); 367 if (r0) 367 if (r0) 368 WRITE_ONCE(y, 1); 368 WRITE_ONCE(y, 1); 369 } 369 } 370 370 371 void CPU1(void) 371 void CPU1(void) 372 { 372 { 373 r1 = READ_ONCE(y); 373 r1 = READ_ONCE(y); 374 smp_mb(); 374 smp_mb(); 375 WRITE_ONCE(x, 1); 375 WRITE_ONCE(x, 1); 376 } 376 } 377 377 378 This pairing of a control dependency in CPU0() 378 This pairing of a control dependency in CPU0() with a full memory 379 barrier in CPU1() prevents r0 and r1 from both 379 barrier in CPU1() prevents r0 and r1 from both ending up equal to 1. 380 380 381 The A/D pairing from the ring-buffer use case 381 The A/D pairing from the ring-buffer use case shown earlier also 382 illustrates LB. Here is a repeat of the comme 382 illustrates LB. Here is a repeat of the comment in 383 perf_output_put_handle() in kernel/events/ring 383 perf_output_put_handle() in kernel/events/ring_buffer.c, showing a 384 control dependency on the kernel side and a fu 384 control dependency on the kernel side and a full memory barrier on 385 the user side: 385 the user side: 386 386 387 * kernel 387 * kernel user 388 * 388 * 389 * if (LOAD ->data_tail) { 389 * if (LOAD ->data_tail) { LOAD ->data_head 390 * (A) 390 * (A) smp_rmb() (C) 391 * STORE $data 391 * STORE $data LOAD $data 392 * smp_wmb() (B) 392 * smp_wmb() (B) smp_mb() (D) 393 * STORE ->data_head 393 * STORE ->data_head STORE ->data_tail 394 * } 394 * } 395 * 395 * 396 * Where A pairs with D, and B pairs w 396 * Where A pairs with D, and B pairs with C. 397 397 398 The kernel's control dependency between the lo 398 The kernel's control dependency between the load from ->data_tail 399 and the store to data combined with the user's 399 and the store to data combined with the user's full memory barrier 400 between the load from data and the store to -> 400 between the load from data and the store to ->data_tail prevents 401 the counter-intuitive outcome where the kernel 401 the counter-intuitive outcome where the kernel overwrites the data 402 before the user gets done loading it. 402 before the user gets done loading it. 403 403 404 404 405 Release-acquire chains 405 Release-acquire chains 406 ---------------------- 406 ---------------------- 407 407 408 Release-acquire chains are a low-overhead, fle 408 Release-acquire chains are a low-overhead, flexible, and easy-to-use 409 method of maintaining order. However, they do 409 method of maintaining order. However, they do have some limitations that 410 need to be fully understood. Here is an examp 410 need to be fully understood. Here is an example that maintains order: 411 411 412 /* See ISA2+pooncerelease+poacquirerel 412 /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */ 413 void CPU0(void) 413 void CPU0(void) 414 { 414 { 415 WRITE_ONCE(x, 1); 415 WRITE_ONCE(x, 1); 416 smp_store_release(&y, 1); 416 smp_store_release(&y, 1); 417 } 417 } 418 418 419 void CPU1(void) 419 void CPU1(void) 420 { 420 { 421 r0 = smp_load_acquire(y); 421 r0 = smp_load_acquire(y); 422 smp_store_release(&z, 1); 422 smp_store_release(&z, 1); 423 } 423 } 424 424 425 void CPU2(void) 425 void CPU2(void) 426 { 426 { 427 r1 = smp_load_acquire(z); 427 r1 = smp_load_acquire(z); 428 r2 = READ_ONCE(x); 428 r2 = READ_ONCE(x); 429 } 429 } 430 430 431 In this case, if r0 and r1 both have final val 431 In this case, if r0 and r1 both have final values of 1, then r2 must 432 also have a final value of 1. 432 also have a final value of 1. 433 433 434 The ordering in this example is stronger than 434 The ordering in this example is stronger than it needs to be. For 435 example, ordering would still be preserved if 435 example, ordering would still be preserved if CPU1()'s smp_load_acquire() 436 invocation was replaced with READ_ONCE(). 436 invocation was replaced with READ_ONCE(). 437 437 438 It is tempting to assume that CPU0()'s store t 438 It is tempting to assume that CPU0()'s store to x is globally ordered 439 before CPU1()'s store to z, but this is not th 439 before CPU1()'s store to z, but this is not the case: 440 440 441 /* See Z6.0+pooncerelease+poacquirerel 441 /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */ 442 void CPU0(void) 442 void CPU0(void) 443 { 443 { 444 WRITE_ONCE(x, 1); 444 WRITE_ONCE(x, 1); 445 smp_store_release(&y, 1); 445 smp_store_release(&y, 1); 446 } 446 } 447 447 448 void CPU1(void) 448 void CPU1(void) 449 { 449 { 450 r0 = smp_load_acquire(y); 450 r0 = smp_load_acquire(y); 451 smp_store_release(&z, 1); 451 smp_store_release(&z, 1); 452 } 452 } 453 453 454 void CPU2(void) 454 void CPU2(void) 455 { 455 { 456 WRITE_ONCE(z, 2); 456 WRITE_ONCE(z, 2); 457 smp_mb(); 457 smp_mb(); 458 r1 = READ_ONCE(x); 458 r1 = READ_ONCE(x); 459 } 459 } 460 460 461 One might hope that if the final value of r0 i 461 One might hope that if the final value of r0 is 1 and the final value 462 of z is 2, then the final value of r1 must als 462 of z is 2, then the final value of r1 must also be 1, but it really is 463 possible for r1 to have the final value of 0. 463 possible for r1 to have the final value of 0. The reason, of course, 464 is that in this version, CPU2() is not part of 464 is that in this version, CPU2() is not part of the release-acquire chain. 465 This situation is accounted for in the rules o 465 This situation is accounted for in the rules of thumb below. 466 466 467 Despite this limitation, release-acquire chain 467 Despite this limitation, release-acquire chains are low-overhead as 468 well as simple and powerful, at least as memor 468 well as simple and powerful, at least as memory-ordering mechanisms go. 469 469 470 470 471 Store buffering 471 Store buffering 472 --------------- 472 --------------- 473 473 474 Store buffering can be thought of as upside-do 474 Store buffering can be thought of as upside-down load buffering, so 475 that one CPU first stores to one variable and 475 that one CPU first stores to one variable and then loads from a second, 476 while another CPU stores to the second variabl 476 while another CPU stores to the second variable and then loads from the 477 first. Preserving order requires nothing less 477 first. Preserving order requires nothing less than full barriers: 478 478 479 /* See SB+fencembonceonces.litmus. */ 479 /* See SB+fencembonceonces.litmus. */ 480 void CPU0(void) 480 void CPU0(void) 481 { 481 { 482 WRITE_ONCE(x, 1); 482 WRITE_ONCE(x, 1); 483 smp_mb(); 483 smp_mb(); 484 r0 = READ_ONCE(y); 484 r0 = READ_ONCE(y); 485 } 485 } 486 486 487 void CPU1(void) 487 void CPU1(void) 488 { 488 { 489 WRITE_ONCE(y, 1); 489 WRITE_ONCE(y, 1); 490 smp_mb(); 490 smp_mb(); 491 r1 = READ_ONCE(x); 491 r1 = READ_ONCE(x); 492 } 492 } 493 493 494 Omitting either smp_mb() will allow both r0 an 494 Omitting either smp_mb() will allow both r0 and r1 to have final 495 values of 0, but providing both full barriers 495 values of 0, but providing both full barriers as shown above prevents 496 this counter-intuitive outcome. 496 this counter-intuitive outcome. 497 497 498 This pattern most famously appears as part of 498 This pattern most famously appears as part of Dekker's locking 499 algorithm, but it has a much more practical us 499 algorithm, but it has a much more practical use within the Linux kernel 500 of ordering wakeups. The following comment ta 500 of ordering wakeups. The following comment taken from waitqueue_active() 501 in include/linux/wait.h shows the canonical pa 501 in include/linux/wait.h shows the canonical pattern: 502 502 503 * CPU0 - waker CPU1 - 503 * CPU0 - waker CPU1 - waiter 504 * 504 * 505 * for (; 505 * for (;;) { 506 * @cond = true; prep 506 * @cond = true; prepare_to_wait(&wq_head, &wait, state); 507 * smp_mb(); // s 507 * smp_mb(); // smp_mb() from set_current_state() 508 * if (waitqueue_active(wq_head)) 508 * if (waitqueue_active(wq_head)) if (@cond) 509 * wake_up(wq_head); 509 * wake_up(wq_head); break; 510 * sche 510 * schedule(); 511 * } 511 * } 512 * finish 512 * finish_wait(&wq_head, &wait); 513 513 514 On CPU0, the store is to @cond and the load is 514 On CPU0, the store is to @cond and the load is in waitqueue_active(). 515 On CPU1, prepare_to_wait() contains both a sto 515 On CPU1, prepare_to_wait() contains both a store to wq_head and a call 516 to set_current_state(), which contains an smp_ 516 to set_current_state(), which contains an smp_mb() barrier; the load is 517 "if (@cond)". The full barriers prevent the u 517 "if (@cond)". The full barriers prevent the undesirable outcome where 518 CPU1 puts the waiting task to sleep and CPU0 f 518 CPU1 puts the waiting task to sleep and CPU0 fails to wake it up. 519 519 520 Note that use of locking can greatly simplify 520 Note that use of locking can greatly simplify this pattern. 521 521 522 522 523 Rules of thumb 523 Rules of thumb 524 ============== 524 ============== 525 525 526 There might seem to be no pattern governing wh 526 There might seem to be no pattern governing what ordering primitives are 527 needed in which situations, but this is not th 527 needed in which situations, but this is not the case. There is a pattern 528 based on the relation between the accesses lin 528 based on the relation between the accesses linking successive CPUs in a 529 given litmus test. There are three types of l 529 given litmus test. There are three types of linkage: 530 530 531 1. Write-to-read, where the next CPU read 531 1. Write-to-read, where the next CPU reads the value that the 532 previous CPU wrote. The LB litmus-tes 532 previous CPU wrote. The LB litmus-test patterns contain only 533 this type of relation. In formal memo 533 this type of relation. In formal memory-modeling texts, this 534 relation is called "reads-from" and is 534 relation is called "reads-from" and is usually abbreviated "rf". 535 535 536 2. Read-to-write, where the next CPU over 536 2. Read-to-write, where the next CPU overwrites the value that the 537 previous CPU read. The SB litmus test 537 previous CPU read. The SB litmus test contains only this type 538 of relation. In formal memory-modelin 538 of relation. In formal memory-modeling texts, this relation is 539 often called "from-reads" and is somet 539 often called "from-reads" and is sometimes abbreviated "fr". 540 540 541 3. Write-to-write, where the next CPU ove 541 3. Write-to-write, where the next CPU overwrites the value written 542 by the previous CPU. The Z6.0 litmus 542 by the previous CPU. The Z6.0 litmus test pattern contains a 543 write-to-write relation between the la 543 write-to-write relation between the last access of CPU1() and 544 the first access of CPU2(). In formal 544 the first access of CPU2(). In formal memory-modeling texts, 545 this relation is often called "coheren 545 this relation is often called "coherence order" and is sometimes 546 abbreviated "co". In the C++ standard 546 abbreviated "co". In the C++ standard, it is instead called 547 "modification order" and often abbrevi 547 "modification order" and often abbreviated "mo". 548 548 549 The strength of memory ordering required for a 549 The strength of memory ordering required for a given litmus test to 550 avoid a counter-intuitive outcome depends on t 550 avoid a counter-intuitive outcome depends on the types of relations 551 linking the memory accesses for the outcome in 551 linking the memory accesses for the outcome in question: 552 552 553 o If all links are write-to-read links, 553 o If all links are write-to-read links, then the weakest 554 possible ordering within each CPU suff 554 possible ordering within each CPU suffices. For example, in 555 the LB litmus test, a control dependen 555 the LB litmus test, a control dependency was enough to do the 556 job. 556 job. 557 557 558 o If all but one of the links are write- 558 o If all but one of the links are write-to-read links, then a 559 release-acquire chain suffices. Both 559 release-acquire chain suffices. Both the MP and the ISA2 560 litmus tests illustrate this case. 560 litmus tests illustrate this case. 561 561 562 o If more than one of the links are some 562 o If more than one of the links are something other than 563 write-to-read links, then a full memor 563 write-to-read links, then a full memory barrier is required 564 between each successive pair of non-wr 564 between each successive pair of non-write-to-read links. This 565 case is illustrated by the Z6.0 litmus 565 case is illustrated by the Z6.0 litmus tests, both in the 566 locking and in the release-acquire sec 566 locking and in the release-acquire sections. 567 567 568 However, if you find yourself having to stretc 568 However, if you find yourself having to stretch these rules of thumb 569 to fit your situation, you should consider cre 569 to fit your situation, you should consider creating a litmus test and 570 running it on the model. 570 running it on the model.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.