1 ============================================== 2 Cluster-wide Power-up/power-down race avoidanc 3 ============================================== 4 5 This file documents the algorithm which is use 6 cluster setup and teardown operations and to m 7 controls safely. 8 9 The section "Rationale" explains what the algo 10 needed. "Basic model" explains general concep 11 of the system. The other sections explain the 12 algorithm in use. 13 14 15 Rationale 16 --------- 17 18 In a system containing multiple CPUs, it is de 19 ability to turn off individual CPUs when the s 20 power consumption and thermal dissipation. 21 22 In a system containing multiple clusters of CP 23 to have the ability to turn off entire cluster 24 25 Turning entire clusters off and on is a risky 26 involves performing potentially destructive op 27 of independently running CPUs, while the OS co 28 means that we need some coordination in order 29 cluster-level operations are only performed wh 30 so. 31 32 Simple locking may not be sufficient to solve 33 mechanisms like Linux spinlocks may rely on co 34 are not immediately enabled when a cluster pow 35 disabling those mechanisms may itself be a non 36 writing some hardware registers and invalidati 37 methods of coordination are required in order 38 power-down and power-up at the cluster level. 39 40 The mechanism presented in this document descr 41 based protocol for performing the needed coord 42 lightweight as possible, while providing the r 43 44 45 Basic model 46 ----------- 47 48 Each cluster and CPU is assigned a state, as f 49 50 - DOWN 51 - COMING_UP 52 - UP 53 - GOING_DOWN 54 55 :: 56 57 +---------> UP ----------+ 58 | v 59 60 COMING_UP GOING_DOWN 61 62 ^ | 63 +--------- DOWN <--------+ 64 65 66 DOWN: 67 The CPU or cluster is not coherent, an 68 suspended, or is ready to be powered o 69 70 COMING_UP: 71 The CPU or cluster has committed to mo 72 It may be part way through the process 73 enabling coherency. 74 75 UP: 76 The CPU or cluster is active and coher 77 level. A CPU in this state is not nec 78 actively by the kernel. 79 80 GOING_DOWN: 81 The CPU or cluster has committed to mo 82 state. It may be part way through the 83 coherency exit. 84 85 86 Each CPU has one of these states assigned to i 87 The CPU states are described in the "CPU state 88 89 Each cluster is also assigned a state, but it 90 state value into two parts (the "cluster" stat 91 to introduce additional states in order to avo 92 CPUs in the cluster simultaneously modifying t 93 level states are described in the "Cluster sta 94 95 To help distinguish the CPU states from cluste 96 discussion, the state names are given a `CPU_` 97 and a `CLUSTER_` or `INBOUND_` prefix for the 98 99 100 CPU state 101 --------- 102 103 In this algorithm, each individual core in a m 104 referred to as a "CPU". CPUs are assumed to b 105 therefore, a CPU can only be doing one thing a 106 107 This means that CPUs fit the basic model close 108 109 The algorithm defines the following states for 110 111 - CPU_DOWN 112 - CPU_COMING_UP 113 - CPU_UP 114 - CPU_GOING_DOWN 115 116 :: 117 118 cluster setup and 119 CPU setup complete policy dec 120 +-----------> CPU_UP ----------- 121 | 122 123 CPU_COMING_UP CPU_GO 124 125 ^ 126 +----------- CPU_DOWN <--------- 127 policy decision CPU teardow 128 or hardware event 129 130 131 The definitions of the four states correspond 132 the basic model. 133 134 Transitions between states occur as follows. 135 136 A trigger event (spontaneous) means that the C 137 next state as a result of making local progres 138 requirement for any external event to happen. 139 140 141 CPU_DOWN: 142 A CPU reaches the CPU_DOWN state when 143 power-down. On reaching this state, t 144 power itself down or suspend itself, v 145 firmware call. 146 147 Next state: 148 CPU_COMING_UP 149 Conditions: 150 none 151 152 Trigger events: 153 a) an explicit hardware power- 154 from a policy decision on a 155 156 b) a hardware event, such as a 157 158 159 CPU_COMING_UP: 160 A CPU cannot start participating in ha 161 cluster is set up and coherent. If th 162 then the CPU will wait in the CPU_COMI 163 cluster has been set up. 164 165 Next state: 166 CPU_UP 167 Conditions: 168 The CPU's parent cluster must 169 Trigger events: 170 Transition of the parent clust 171 172 Refer to the "Cluster state" section f 173 CLUSTER_UP state. 174 175 176 CPU_UP: 177 When a CPU reaches the CPU_UP state, i 178 start participating in local coherency 179 180 This is done by jumping to the kernel' 181 182 Note that the definition of this state 183 from the basic model definition: CPU_U 184 CPU is coherent yet, but it does mean 185 the kernel. The kernel handles the re 186 procedure, so the remaining steps are 187 race avoidance algorithm. 188 189 The CPU remains in this state until an 190 is made to shut down or suspend the CP 191 192 Next state: 193 CPU_GOING_DOWN 194 Conditions: 195 none 196 Trigger events: 197 explicit policy decision 198 199 200 CPU_GOING_DOWN: 201 While in this state, the CPU exits coh 202 operations required to achieve this (s 203 caches). 204 205 Next state: 206 CPU_DOWN 207 Conditions: 208 local CPU teardown complete 209 Trigger events: 210 (spontaneous) 211 212 213 Cluster state 214 ------------- 215 216 A cluster is a group of connected CPUs with so 217 Because a cluster contains multiple CPUs, it c 218 things at the same time. This has some implic 219 CPU can start up while another CPU is tearing 220 221 In this discussion, the "outbound side" is the 222 as seen by a CPU tearing the cluster down. Th 223 view of the cluster state as seen by a CPU set 224 225 In order to enable safe coordination in such s 226 that a CPU which is setting up the cluster can 227 independently of the CPU which is tearing down 228 reason, the cluster state is split into two pa 229 230 "cluster" state: The global state of t 231 on the outbound side: 232 233 - CLUSTER_DOWN 234 - CLUSTER_UP 235 - CLUSTER_GOING_DOWN 236 237 "inbound" state: The state of the clus 238 239 - INBOUND_NOT_COMING_UP 240 - INBOUND_COMING_UP 241 242 243 The different pairings of these states 244 states for the cluster as a whole:: 245 246 CLUSTER_UP 247 +==========> INBOUND_NOT_COM 248 # 249 250 CLUSTER_UP <----+ 251 INBOUND_COMING_UP | 252 253 ^ CLUSTER_GOING_ 254 # INBOUND_COMIN 255 256 CLUSTER_DOWN | 257 INBOUND_COMING_UP <----+ 258 259 ^ 260 +=========== CLUSTER_DOW 261 INBOUND_NOT_COM 262 263 Transitions -----> can only be made by 264 only involve changes to the "cluster" 265 266 Transitions ===##> can only be made by 267 involve changes to the "inbound" state 268 further transition possible on the out 269 outbound CPU has put the cluster into 270 271 The race avoidance algorithm does not 272 which exact CPUs within the cluster pl 273 be decided in advance by some other me 274 "Last man and first man selection" for 275 276 277 CLUSTER_DOWN/INBOUND_NOT_COMING_UP is 278 cluster can actually be powered down. 279 280 The parallelism of the inbound and out 281 the existence of two different paths f 282 INBOUND_NOT_COMING_UP (corresponding t 283 model) to CLUSTER_DOWN/INBOUND_COMING_ 284 COMING_UP in the basic model). The se 285 teardown completely. 286 287 CLUSTER_UP/INBOUND_COMING_UP is equiva 288 model. The final transition to CLUSTE 289 is trivial and merely resets the state 290 next cycle. 291 292 Details of the allowable transitions f 293 294 The next state in each case is notated 295 296 <cluster state>/<inbound state 297 298 where the <transitioner> is the side o 299 can occur; either the inbound or the o 300 301 302 CLUSTER_DOWN/INBOUND_NOT_COMING_UP: 303 Next state: 304 CLUSTER_DOWN/INBOUND_COMING_UP 305 Conditions: 306 none 307 308 Trigger events: 309 a) an explicit hardware power- 310 from a policy decision on a 311 312 b) a hardware event, such as a 313 314 315 CLUSTER_DOWN/INBOUND_COMING_UP: 316 317 In this state, an inbound CPU sets up 318 enabling of hardware coherency at the 319 other operations (such as cache invali 320 in order to achieve this. 321 322 The purpose of this state is to do suf 323 setup to enable other CPUs in the clus 324 safely. 325 326 Next state: 327 CLUSTER_UP/INBOUND_COMING_UP ( 328 Conditions: 329 cluster-level setup and hardwa 330 Trigger events: 331 (spontaneous) 332 333 334 CLUSTER_UP/INBOUND_COMING_UP: 335 336 Cluster-level setup is complete and ha 337 enabled for the cluster. Other CPUs i 338 enter coherency. 339 340 This is a transient state, leading imm 341 CLUSTER_UP/INBOUND_NOT_COMING_UP. All 342 should consider treat these two states 343 344 Next state: 345 CLUSTER_UP/INBOUND_NOT_COMING_ 346 Conditions: 347 none 348 Trigger events: 349 (spontaneous) 350 351 352 CLUSTER_UP/INBOUND_NOT_COMING_UP: 353 354 Cluster-level setup is complete and ha 355 enabled for the cluster. Other CPUs i 356 enter coherency. 357 358 The cluster will remain in this state 359 made to power the cluster down. 360 361 Next state: 362 CLUSTER_GOING_DOWN/INBOUND_NOT 363 Conditions: 364 none 365 Trigger events: 366 policy decision to power down 367 368 369 CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP: 370 371 An outbound CPU is tearing the cluster 372 must wait in this state until all CPUs 373 CPU_DOWN state. 374 375 When all CPUs are in the CPU_DOWN stat 376 down, for example by cleaning data cac 377 cluster-level coherency. 378 379 To avoid wasteful unnecessary teardown 380 should check the inbound cluster state 381 transitions to INBOUND_COMING_UP. Alt 382 CPUs can be checked for entry into CPU 383 384 385 Next states: 386 387 CLUSTER_DOWN/INBOUND_NOT_COMING_UP (ou 388 Conditions: 389 cluster torn down and 390 Trigger events: 391 (spontaneous) 392 393 CLUSTER_GOING_DOWN/INBOUND_COMING_UP ( 394 Conditions: 395 none 396 397 Trigger events: 398 a) an explicit hardwar 399 resulting from a po 400 CPU; 401 402 b) a hardware event, s 403 404 405 CLUSTER_GOING_DOWN/INBOUND_COMING_UP: 406 407 The cluster is (or was) being torn dow 408 come online in the meantime and is try 409 again. 410 411 If the outbound CPU observes this stat 412 413 a) back out of teardown, resto 414 CLUSTER_UP state; 415 416 b) finish tearing the cluster 417 in the CLUSTER_DOWN state; 418 set up the cluster again fr 419 420 Choice (a) permits the removal of some 421 unnecessary teardown and setup operati 422 the cluster is not really going to be 423 424 425 Next states: 426 427 CLUSTER_UP/INBOUND_COMING_UP (outbound 428 Conditions: 429 cluster-level 430 coherency comp 431 432 Trigger events: 433 (spontaneous) 434 435 CLUSTER_DOWN/INBOUND_COMING_UP (outbou 436 Conditions: 437 cluster torn down and 438 439 Trigger events: 440 (spontaneous) 441 442 443 Last man and First man selection 444 -------------------------------- 445 446 The CPU which performs cluster tear-down opera 447 is commonly referred to as the "last man". 448 449 The CPU which performs cluster setup on the in 450 referred to as the "first man". 451 452 The race avoidance algorithm documented above 453 mechanism to choose which CPUs should play the 454 455 456 Last man: 457 458 When shutting down the cluster, all the CPUs i 459 executing Linux and hence coherent. Therefore 460 be used to select a last man safely, before th 461 non-coherent. 462 463 464 First man: 465 466 Because CPUs may power up asynchronously in re 467 events, a dynamic mechanism is needed to make 468 attempts to play the first man role and do the 469 initialisation: any other CPUs must wait for t 470 proceeding. 471 472 Cluster-level initialisation may involve actio 473 coherency controls in the bus fabric. 474 475 The current implementation in mcpm_head.S uses 476 mechanism to do this arbitration. This mechan 477 detail in vlocks.txt. 478 479 480 Features and Limitations 481 ------------------------ 482 483 Implementation: 484 485 The current ARM-based implementation i 486 arch/arm/common/mcpm_head.S (low-level 487 arch/arm/common/mcpm_entry.c (everythi 488 489 __mcpm_cpu_going_down() signals the tr 490 CPU_GOING_DOWN state. 491 492 __mcpm_cpu_down() signals the transiti 493 state. 494 495 A CPU transitions to CPU_COMING_UP and 496 low-level power-up code in mcpm_head.S 497 involve CPU-specific setup code, but i 498 implementation it does not. 499 500 __mcpm_outbound_enter_critical() and _ 501 handle transitions from CLUSTER_UP to 502 and from there to CLUSTER_DOWN or back 503 the case of an aborted cluster power-d 504 505 These functions are more complex than 506 functions due to the extra inter-CPU c 507 is needed for safe transitions at the 508 509 A cluster transitions from CLUSTER_DOW 510 the low-level power-up code in mcpm_he 511 typically involves platform-specific s 512 provided by the platform-specific powe 513 function registered via mcpm_sync_init 514 515 Deep topologies: 516 517 As currently described and implemented 518 support CPU topologies involving more 519 clusters of clusters are not supported 520 extended by replicating the cluster-le 521 additional topological levels, and mod 522 rules for the intermediate (non-outerm 523 524 525 Colophon 526 -------- 527 528 Originally created and documented by Dave Mart 529 collaboration with Nicolas Pitre and Achin Gup 530 531 Copyright (C) 2012-2013 Linaro Limited 532 Distributed under the terms of Version 2 of th 533 License, as defined in linux/COPYING.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.