1 ========================= 2 Capacity Aware Scheduling 3 ========================= 4 5 1. CPU Capacity 6 =============== 7 8 1.1 Introduction 9 ---------------- 10 11 Conventional, homogeneous SMP platforms are co 12 CPUs. Heterogeneous platforms on the other han 13 different performance characteristics - on suc 14 considered equal. 15 16 CPU capacity is a measure of the performance a 17 the most performant CPU in the system. Heterog 18 asymmetric CPU capacity systems, as they conta 19 20 Disparity in maximum attainable performance (I 21 from two factors: 22 23 - not all CPUs may have the same microarchitec 24 - with Dynamic Voltage and Frequency Scaling ( 25 physically able to attain the higher Operati 26 27 Arm big.LITTLE systems are an example of both. 28 performance-oriented than the LITTLE ones (mor 29 smarter predictors, etc), and can usually reac 30 can. 31 32 CPU performance is usually expressed in Millio 33 (MIPS), which can also be expressed as a given 34 per Hz, leading to:: 35 36 capacity(cpu) = work_per_hz(cpu) * max_freq( 37 38 1.2 Scheduler terms 39 ------------------- 40 41 Two different capacity values are used within 42 ``original capacity`` is its maximum attainabl 43 attainable performance level. This original ca 44 the function arch_scale_cpu_capacity(). A CPU' 45 capacity`` to which some loss of available per 46 handling IRQs) is subtracted. 47 48 Note that a CPU's ``capacity`` is solely inten 49 while ``original capacity`` is class-agnostic. 50 the term ``capacity`` interchangeably with ``o 51 brevity. 52 53 1.3 Platform examples 54 --------------------- 55 56 1.3.1 Identical OPPs 57 ~~~~~~~~~~~~~~~~~~~~ 58 59 Consider an hypothetical dual-core asymmetric 60 61 - work_per_hz(CPU0) = W 62 - work_per_hz(CPU1) = W/2 63 - all CPUs are running at the same fixed frequ 64 65 By the above definition of capacity: 66 67 - capacity(CPU0) = C 68 - capacity(CPU1) = C/2 69 70 To draw the parallel with Arm big.LITTLE, CPU0 71 be a LITTLE. 72 73 With a workload that periodically does a fixed 74 execution trace like so:: 75 76 CPU0 work ^ 77 | ____ ____ 78 | | | | | 79 +----+----+----+----+----+----+---- 80 81 CPU1 work ^ 82 | _________ _________ 83 | | | | 84 +----+----+----+----+----+----+---- 85 86 CPU0 has the highest capacity in the system (C 87 work W in T units of time. On the other hand, 88 CPU0, and thus only completes W/2 in T. 89 90 1.3.2 Different max OPPs 91 ~~~~~~~~~~~~~~~~~~~~~~~~ 92 93 Usually, CPUs of different capacity values als 94 OPPs. Consider the same CPUs as above (i.e. sa 95 96 - max_freq(CPU0) = F 97 - max_freq(CPU1) = 2/3 * F 98 99 This yields: 100 101 - capacity(CPU0) = C 102 - capacity(CPU1) = C/3 103 104 Executing the same workload as described in 1. 105 maximum frequency results in:: 106 107 CPU0 work ^ 108 | ____ ____ 109 | | | | | 110 +----+----+----+----+----+----+---- 111 112 workload on CPU1 113 CPU1 work ^ 114 | ______________ _________ 115 | | | | 116 +----+----+----+----+----+----+---- 117 118 1.4 Representation caveat 119 ------------------------- 120 121 It should be noted that having a *single* valu 122 performance is somewhat of a contentious point 123 difference between two different µarchs could 124 floating point operations, Z% on branches, and 125 simple approach have been satisfactory for now 126 127 2. Task utilization 128 =================== 129 130 2.1 Introduction 131 ---------------- 132 133 Capacity aware scheduling requires an expressi 134 regards to CPU capacity. Each scheduler class 135 while task utilization is specific to CFS, it 136 in order to introduce more generic concepts. 137 138 Task utilization is a percentage meant to repr 139 of a task. A simple approximation of it is the 140 141 task_util(p) = duty_cycle(p) 142 143 On an SMP system with fixed frequencies, 100% 144 busy loop. Conversely, 10% utilization hints i 145 spends more time sleeping than executing. Vari 146 asymmetric CPU capacities complexify this some 147 expand on these. 148 149 2.2 Frequency invariance 150 ------------------------ 151 152 One issue that needs to be taken into account 153 directly impacted by the current OPP the CPU i 154 periodic workload at a given frequency F:: 155 156 CPU work ^ 157 | ____ ____ 158 | | | | | 159 +----+----+----+----+----+----+---- 160 161 This yields duty_cycle(p) == 25%. 162 163 Now, consider running the *same* workload at f 164 165 CPU work ^ 166 | _________ _________ 167 | | | | 168 +----+----+----+----+----+----+---- 169 170 This yields duty_cycle(p) == 50%, despite the 171 behaviour (i.e. executing the same amount of w 172 173 The task utilization signal can be made freque 174 formula:: 175 176 task_util_freq_inv(p) = duty_cycle(p) * (cur 177 178 Applying this formula to the two examples abov 179 task utilization of 25%. 180 181 2.3 CPU invariance 182 ------------------ 183 184 CPU capacity has a similar effect on task util 185 identical workload on CPUs of different capaci 186 duty cycles. 187 188 Consider the system described in 1.3.2., i.e.: 189 190 - capacity(CPU0) = C 191 - capacity(CPU1) = C/3 192 193 Executing a given periodic workload on each CP 194 result in:: 195 196 CPU0 work ^ 197 | ____ ____ 198 | | | | | 199 +----+----+----+----+----+----+---- 200 201 CPU1 work ^ 202 | ______________ _________ 203 | | | | 204 +----+----+----+----+----+----+---- 205 206 IOW, 207 208 - duty_cycle(p) == 25% if p runs on CPU0 at it 209 - duty_cycle(p) == 75% if p runs on CPU1 at it 210 211 The task utilization signal can be made CPU in 212 formula:: 213 214 task_util_cpu_inv(p) = duty_cycle(p) * (capa 215 216 with ``max_capacity`` being the highest CPU ca 217 system. Applying this formula to the above exa 218 invariant task utilization of 25%. 219 220 2.4 Invariant task utilization 221 ------------------------------ 222 223 Both frequency and CPU invariance need to be a 224 order to obtain a truly invariant signal. The 225 utilization that is both CPU and frequency inv 226 task p:: 227 228 curr_freq 229 task_util_inv(p) = duty_cycle(p) * --------- 230 max_frequ 231 232 In other words, invariant task utilization des 233 if it were running on the highest-capacity CPU 234 maximum frequency. 235 236 Any mention of task utilization in the followi 237 invariant form. 238 239 2.5 Utilization estimation 240 -------------------------- 241 242 Without a crystal ball, task behaviour (and th 243 accurately be predicted the moment a task firs 244 maintains a handful of CPU and task signals ba 245 Tracking (PELT) mechanism, one of those yieldi 246 opposed to instantaneous). 247 248 This means that while the capacity aware sched 249 considering a "true" task utilization (using a 250 will only ever be able to use an estimator the 251 252 3. Capacity aware scheduling requirements 253 ========================================= 254 255 3.1 CPU capacity 256 ---------------- 257 258 Linux cannot currently figure out CPU capacity 259 needs to be handed to it. Architectures must d 260 for that purpose. 261 262 The arm, arm64, and RISC-V architectures direc 263 CPU scaling data, which is derived from the ca 264 Documentation/devicetree/bindings/cpu/cpu-capa 265 266 3.2 Frequency invariance 267 ------------------------ 268 269 As stated in 2.2, capacity-aware scheduling re 270 utilization. Architectures must define arch_sc 271 purpose. 272 273 Implementing this function requires figuring o 274 have been running at. One way to implement thi 275 whose increment rate scale with a CPU's curren 276 AMU on arm64). Another is to directly hook int 277 when the kernel is aware of the switched-to fr 278 arm/arm64). 279 280 4. Scheduler topology 281 ===================== 282 283 During the construction of the sched domains, 284 whether the system exhibits asymmetric CPU cap 285 case: 286 287 - The sched_asym_cpucapacity static key will b 288 - The SD_ASYM_CPUCAPACITY_FULL flag will be se 289 level that spans all unique CPU capacity val 290 - The SD_ASYM_CPUCAPACITY flag will be set for 291 CPUs with any range of asymmetry. 292 293 The sched_asym_cpucapacity static key is inten 294 cater to asymmetric CPU capacity systems. Do n 295 *system-wide*. Imagine the following setup usi 296 297 capacity C/2 C 298 ________ ________ 299 / \ / \ 300 CPUs 0 1 2 3 4 5 6 7 301 \__/ \______________/ 302 cpusets cs0 cs1 303 304 Which could be created via: 305 306 .. code-block:: sh 307 308 mkdir /sys/fs/cgroup/cpuset/cs0 309 echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset. 310 echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.me 311 312 mkdir /sys/fs/cgroup/cpuset/cs1 313 echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset. 314 echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.me 315 316 echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_ 317 318 Since there *is* CPU capacity asymmetry in the 319 sched_asym_cpucapacity static key will be enab 320 hierarchy of CPUs 0-1 spans a single capacity 321 set in that hierarchy, it describes an SMP isl 322 323 Therefore, the 'canonical' pattern for protect 324 asymmetric CPU capacities is to: 325 326 - Check the sched_asym_cpucapacity static key 327 - If it is enabled, then also check for the pr 328 the sched_domain hierarchy (if relevant, i.e 329 CPU or group thereof) 330 331 5. Capacity aware scheduling implementation 332 =========================================== 333 334 5.1 CFS 335 ------- 336 337 5.1.1 Capacity fitness 338 ~~~~~~~~~~~~~~~~~~~~~~ 339 340 The main capacity scheduling criterion of CFS 341 342 task_util(p) < capacity(task_cpu(p)) 343 344 This is commonly called the capacity fitness c 345 task "fits" on its CPU. If it is violated, the 346 work than what its CPU can provide: it will be 347 348 Furthermore, uclamp lets userspace specify a m 349 value for a task, either via sched_setattr() o 350 Documentation/admin-guide/cgroup-v2.rst). As i 351 clamp task_util() in the previous criterion. 352 353 5.1.2 Wakeup CPU selection 354 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 355 356 CFS task wakeup CPU selection follows the capa 357 above. On top of that, uclamp is used to clamp 358 which lets userspace have more leverage over t 359 tasks. IOW, CFS wakeup CPU selection searches 360 361 clamp(task_util(p), task_uclamp_min(p), task 362 363 By using uclamp, userspace can e.g. allow a bu 364 on any CPU by giving it a low uclamp.max value 365 periodic task (e.g. 10% utilization) to run on 366 giving it a high uclamp.min value. 367 368 .. note:: 369 370 Wakeup CPU selection in CFS can be eclipsed 371 (EAS), which is described in Documentation/s 372 373 5.1.3 Load balancing 374 ~~~~~~~~~~~~~~~~~~~~ 375 376 A pathological case in the wakeup CPU selectio 377 sleeps, if at all - it thus rarely wakes up, i 378 379 w == wakeup event 380 381 capacity(CPU0) = C 382 capacity(CPU1) = C / 3 383 384 workload on CPU0 385 CPU work ^ 386 | _________ _________ 387 | | | | 388 +----+----+----+----+----+----+---- 389 w w 390 391 workload on CPU1 392 CPU work ^ 393 | _____________________________ 394 | | 395 +----+----+----+----+----+----+---- 396 w 397 398 This workload should run on CPU0, but if the t 399 400 - was improperly scheduled from the start (ina 401 utilization estimation) 402 - was properly scheduled from the start, but s 403 processing power 404 405 then it might become CPU-bound, IOW ``task_uti 406 the CPU capacity scheduling criterion is viola 407 wakeup event to fix this up via wakeup CPU sel 408 409 Tasks that are in this situation are dubbed "m 410 put in place to handle this shares the same na 411 leverages the CFS load balancer, more specific 412 (which caters to migrating currently running t 413 a misfit active load balance will be triggered 414 to a CPU with more capacity than its current o 415 416 5.2 RT 417 ------ 418 419 5.2.1 Wakeup CPU selection 420 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 421 422 RT task wakeup CPU selection searches for a CP 423 424 task_uclamp_min(p) <= capacity(task_cpu(cpu) 425 426 while still following the usual priority const 427 CPUs can satisfy this capacity criterion, then 428 is followed and CPU capacities are ignored. 429 430 5.3 DL 431 ------ 432 433 5.3.1 Wakeup CPU selection 434 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 435 436 DL task wakeup CPU selection searches for a CP 437 438 task_bandwidth(p) < capacity(task_cpu(p)) 439 440 while still respecting the usual bandwidth and 441 none of the candidate CPUs can satisfy this ca 442 task will remain on its current CPU.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.