1 ========================= 1 ========================= 2 Capacity Aware Scheduling 2 Capacity Aware Scheduling 3 ========================= 3 ========================= 4 4 5 1. CPU Capacity 5 1. CPU Capacity 6 =============== 6 =============== 7 7 8 1.1 Introduction 8 1.1 Introduction 9 ---------------- 9 ---------------- 10 10 11 Conventional, homogeneous SMP platforms are co 11 Conventional, homogeneous SMP platforms are composed of purely identical 12 CPUs. Heterogeneous platforms on the other han 12 CPUs. Heterogeneous platforms on the other hand are composed of CPUs with 13 different performance characteristics - on suc 13 different performance characteristics - on such platforms, not all CPUs can be 14 considered equal. 14 considered equal. 15 15 16 CPU capacity is a measure of the performance a 16 CPU capacity is a measure of the performance a CPU can reach, normalized against 17 the most performant CPU in the system. Heterog 17 the most performant CPU in the system. Heterogeneous systems are also called 18 asymmetric CPU capacity systems, as they conta 18 asymmetric CPU capacity systems, as they contain CPUs of different capacities. 19 19 20 Disparity in maximum attainable performance (I 20 Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems 21 from two factors: 21 from two factors: 22 22 23 - not all CPUs may have the same microarchitec 23 - not all CPUs may have the same microarchitecture (µarch). 24 - with Dynamic Voltage and Frequency Scaling ( 24 - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be 25 physically able to attain the higher Operati 25 physically able to attain the higher Operating Performance Points (OPP). 26 26 27 Arm big.LITTLE systems are an example of both. 27 Arm big.LITTLE systems are an example of both. The big CPUs are more 28 performance-oriented than the LITTLE ones (mor 28 performance-oriented than the LITTLE ones (more pipeline stages, bigger caches, 29 smarter predictors, etc), and can usually reac 29 smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones 30 can. 30 can. 31 31 32 CPU performance is usually expressed in Millio 32 CPU performance is usually expressed in Millions of Instructions Per Second 33 (MIPS), which can also be expressed as a given 33 (MIPS), which can also be expressed as a given amount of instructions attainable 34 per Hz, leading to:: 34 per Hz, leading to:: 35 35 36 capacity(cpu) = work_per_hz(cpu) * max_freq( 36 capacity(cpu) = work_per_hz(cpu) * max_freq(cpu) 37 37 38 1.2 Scheduler terms 38 1.2 Scheduler terms 39 ------------------- 39 ------------------- 40 40 41 Two different capacity values are used within 41 Two different capacity values are used within the scheduler. A CPU's 42 ``original capacity`` is its maximum attainabl 42 ``original capacity`` is its maximum attainable capacity, i.e. its maximum 43 attainable performance level. This original ca 43 attainable performance level. This original capacity is returned by 44 the function arch_scale_cpu_capacity(). A CPU' 44 the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original 45 capacity`` to which some loss of available per 45 capacity`` to which some loss of available performance (e.g. time spent 46 handling IRQs) is subtracted. 46 handling IRQs) is subtracted. 47 47 48 Note that a CPU's ``capacity`` is solely inten 48 Note that a CPU's ``capacity`` is solely intended to be used by the CFS class, 49 while ``original capacity`` is class-agnostic. 49 while ``original capacity`` is class-agnostic. The rest of this document will use 50 the term ``capacity`` interchangeably with ``o 50 the term ``capacity`` interchangeably with ``original capacity`` for the sake of 51 brevity. 51 brevity. 52 52 53 1.3 Platform examples 53 1.3 Platform examples 54 --------------------- 54 --------------------- 55 55 56 1.3.1 Identical OPPs 56 1.3.1 Identical OPPs 57 ~~~~~~~~~~~~~~~~~~~~ 57 ~~~~~~~~~~~~~~~~~~~~ 58 58 59 Consider an hypothetical dual-core asymmetric 59 Consider an hypothetical dual-core asymmetric CPU capacity system where 60 60 61 - work_per_hz(CPU0) = W 61 - work_per_hz(CPU0) = W 62 - work_per_hz(CPU1) = W/2 62 - work_per_hz(CPU1) = W/2 63 - all CPUs are running at the same fixed frequ 63 - all CPUs are running at the same fixed frequency 64 64 65 By the above definition of capacity: 65 By the above definition of capacity: 66 66 67 - capacity(CPU0) = C 67 - capacity(CPU0) = C 68 - capacity(CPU1) = C/2 68 - capacity(CPU1) = C/2 69 69 70 To draw the parallel with Arm big.LITTLE, CPU0 70 To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would 71 be a LITTLE. 71 be a LITTLE. 72 72 73 With a workload that periodically does a fixed 73 With a workload that periodically does a fixed amount of work, you will get an 74 execution trace like so:: 74 execution trace like so:: 75 75 76 CPU0 work ^ 76 CPU0 work ^ 77 | ____ ____ 77 | ____ ____ ____ 78 | | | | | 78 | | | | | | | 79 +----+----+----+----+----+----+---- 79 +----+----+----+----+----+----+----+----+----+----+-> time 80 80 81 CPU1 work ^ 81 CPU1 work ^ 82 | _________ _________ 82 | _________ _________ ____ 83 | | | | 83 | | | | | | 84 +----+----+----+----+----+----+---- 84 +----+----+----+----+----+----+----+----+----+----+-> time 85 85 86 CPU0 has the highest capacity in the system (C 86 CPU0 has the highest capacity in the system (C), and completes a fixed amount of 87 work W in T units of time. On the other hand, 87 work W in T units of time. On the other hand, CPU1 has half the capacity of 88 CPU0, and thus only completes W/2 in T. 88 CPU0, and thus only completes W/2 in T. 89 89 90 1.3.2 Different max OPPs 90 1.3.2 Different max OPPs 91 ~~~~~~~~~~~~~~~~~~~~~~~~ 91 ~~~~~~~~~~~~~~~~~~~~~~~~ 92 92 93 Usually, CPUs of different capacity values als 93 Usually, CPUs of different capacity values also have different maximum 94 OPPs. Consider the same CPUs as above (i.e. sa 94 OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with: 95 95 96 - max_freq(CPU0) = F 96 - max_freq(CPU0) = F 97 - max_freq(CPU1) = 2/3 * F 97 - max_freq(CPU1) = 2/3 * F 98 98 99 This yields: 99 This yields: 100 100 101 - capacity(CPU0) = C 101 - capacity(CPU0) = C 102 - capacity(CPU1) = C/3 102 - capacity(CPU1) = C/3 103 103 104 Executing the same workload as described in 1. 104 Executing the same workload as described in 1.3.1, which each CPU running at its 105 maximum frequency results in:: 105 maximum frequency results in:: 106 106 107 CPU0 work ^ 107 CPU0 work ^ 108 | ____ ____ 108 | ____ ____ ____ 109 | | | | | 109 | | | | | | | 110 +----+----+----+----+----+----+---- 110 +----+----+----+----+----+----+----+----+----+----+-> time 111 111 112 workload on CPU1 112 workload on CPU1 113 CPU1 work ^ 113 CPU1 work ^ 114 | ______________ _________ 114 | ______________ ______________ ____ 115 | | | | 115 | | | | | | 116 +----+----+----+----+----+----+---- 116 +----+----+----+----+----+----+----+----+----+----+-> time 117 117 118 1.4 Representation caveat 118 1.4 Representation caveat 119 ------------------------- 119 ------------------------- 120 120 121 It should be noted that having a *single* valu 121 It should be noted that having a *single* value to represent differences in CPU 122 performance is somewhat of a contentious point 122 performance is somewhat of a contentious point. The relative performance 123 difference between two different µarchs could 123 difference between two different µarchs could be X% on integer operations, Y% on 124 floating point operations, Z% on branches, and 124 floating point operations, Z% on branches, and so on. Still, results using this 125 simple approach have been satisfactory for now 125 simple approach have been satisfactory for now. 126 126 127 2. Task utilization 127 2. Task utilization 128 =================== 128 =================== 129 129 130 2.1 Introduction 130 2.1 Introduction 131 ---------------- 131 ---------------- 132 132 133 Capacity aware scheduling requires an expressi 133 Capacity aware scheduling requires an expression of a task's requirements with 134 regards to CPU capacity. Each scheduler class 134 regards to CPU capacity. Each scheduler class can express this differently, and 135 while task utilization is specific to CFS, it 135 while task utilization is specific to CFS, it is convenient to describe it here 136 in order to introduce more generic concepts. 136 in order to introduce more generic concepts. 137 137 138 Task utilization is a percentage meant to repr 138 Task utilization is a percentage meant to represent the throughput requirements 139 of a task. A simple approximation of it is the 139 of a task. A simple approximation of it is the task's duty cycle, i.e.:: 140 140 141 task_util(p) = duty_cycle(p) 141 task_util(p) = duty_cycle(p) 142 142 143 On an SMP system with fixed frequencies, 100% 143 On an SMP system with fixed frequencies, 100% utilization suggests the task is a 144 busy loop. Conversely, 10% utilization hints i 144 busy loop. Conversely, 10% utilization hints it is a small periodic task that 145 spends more time sleeping than executing. Vari 145 spends more time sleeping than executing. Variable CPU frequencies and 146 asymmetric CPU capacities complexify this some 146 asymmetric CPU capacities complexify this somewhat; the following sections will 147 expand on these. 147 expand on these. 148 148 149 2.2 Frequency invariance 149 2.2 Frequency invariance 150 ------------------------ 150 ------------------------ 151 151 152 One issue that needs to be taken into account 152 One issue that needs to be taken into account is that a workload's duty cycle is 153 directly impacted by the current OPP the CPU i 153 directly impacted by the current OPP the CPU is running at. Consider running a 154 periodic workload at a given frequency F:: 154 periodic workload at a given frequency F:: 155 155 156 CPU work ^ 156 CPU work ^ 157 | ____ ____ 157 | ____ ____ ____ 158 | | | | | 158 | | | | | | | 159 +----+----+----+----+----+----+---- 159 +----+----+----+----+----+----+----+----+----+----+-> time 160 160 161 This yields duty_cycle(p) == 25%. 161 This yields duty_cycle(p) == 25%. 162 162 163 Now, consider running the *same* workload at f 163 Now, consider running the *same* workload at frequency F/2:: 164 164 165 CPU work ^ 165 CPU work ^ 166 | _________ _________ 166 | _________ _________ ____ 167 | | | | 167 | | | | | | 168 +----+----+----+----+----+----+---- 168 +----+----+----+----+----+----+----+----+----+----+-> time 169 169 170 This yields duty_cycle(p) == 50%, despite the 170 This yields duty_cycle(p) == 50%, despite the task having the exact same 171 behaviour (i.e. executing the same amount of w 171 behaviour (i.e. executing the same amount of work) in both executions. 172 172 173 The task utilization signal can be made freque 173 The task utilization signal can be made frequency invariant using the following 174 formula:: 174 formula:: 175 175 176 task_util_freq_inv(p) = duty_cycle(p) * (cur 176 task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu)) 177 177 178 Applying this formula to the two examples abov 178 Applying this formula to the two examples above yields a frequency invariant 179 task utilization of 25%. 179 task utilization of 25%. 180 180 181 2.3 CPU invariance 181 2.3 CPU invariance 182 ------------------ 182 ------------------ 183 183 184 CPU capacity has a similar effect on task util 184 CPU capacity has a similar effect on task utilization in that running an 185 identical workload on CPUs of different capaci 185 identical workload on CPUs of different capacity values will yield different 186 duty cycles. 186 duty cycles. 187 187 188 Consider the system described in 1.3.2., i.e.: 188 Consider the system described in 1.3.2., i.e.:: 189 189 190 - capacity(CPU0) = C 190 - capacity(CPU0) = C 191 - capacity(CPU1) = C/3 191 - capacity(CPU1) = C/3 192 192 193 Executing a given periodic workload on each CP 193 Executing a given periodic workload on each CPU at their maximum frequency would 194 result in:: 194 result in:: 195 195 196 CPU0 work ^ 196 CPU0 work ^ 197 | ____ ____ 197 | ____ ____ ____ 198 | | | | | 198 | | | | | | | 199 +----+----+----+----+----+----+---- 199 +----+----+----+----+----+----+----+----+----+----+-> time 200 200 201 CPU1 work ^ 201 CPU1 work ^ 202 | ______________ _________ 202 | ______________ ______________ ____ 203 | | | | 203 | | | | | | 204 +----+----+----+----+----+----+---- 204 +----+----+----+----+----+----+----+----+----+----+-> time 205 205 206 IOW, 206 IOW, 207 207 208 - duty_cycle(p) == 25% if p runs on CPU0 at it 208 - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency 209 - duty_cycle(p) == 75% if p runs on CPU1 at it 209 - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency 210 210 211 The task utilization signal can be made CPU in 211 The task utilization signal can be made CPU invariant using the following 212 formula:: 212 formula:: 213 213 214 task_util_cpu_inv(p) = duty_cycle(p) * (capa 214 task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity) 215 215 216 with ``max_capacity`` being the highest CPU ca 216 with ``max_capacity`` being the highest CPU capacity value in the 217 system. Applying this formula to the above exa 217 system. Applying this formula to the above example above yields a CPU 218 invariant task utilization of 25%. 218 invariant task utilization of 25%. 219 219 220 2.4 Invariant task utilization 220 2.4 Invariant task utilization 221 ------------------------------ 221 ------------------------------ 222 222 223 Both frequency and CPU invariance need to be a 223 Both frequency and CPU invariance need to be applied to task utilization in 224 order to obtain a truly invariant signal. The 224 order to obtain a truly invariant signal. The pseudo-formula for a task 225 utilization that is both CPU and frequency inv 225 utilization that is both CPU and frequency invariant is thus, for a given 226 task p:: 226 task p:: 227 227 228 curr_freq 228 curr_frequency(cpu) capacity(cpu) 229 task_util_inv(p) = duty_cycle(p) * --------- 229 task_util_inv(p) = duty_cycle(p) * ------------------- * ------------- 230 max_frequ 230 max_frequency(cpu) max_capacity 231 231 232 In other words, invariant task utilization des 232 In other words, invariant task utilization describes the behaviour of a task as 233 if it were running on the highest-capacity CPU 233 if it were running on the highest-capacity CPU in the system, running at its 234 maximum frequency. 234 maximum frequency. 235 235 236 Any mention of task utilization in the followi 236 Any mention of task utilization in the following sections will imply its 237 invariant form. 237 invariant form. 238 238 239 2.5 Utilization estimation 239 2.5 Utilization estimation 240 -------------------------- 240 -------------------------- 241 241 242 Without a crystal ball, task behaviour (and th 242 Without a crystal ball, task behaviour (and thus task utilization) cannot 243 accurately be predicted the moment a task firs 243 accurately be predicted the moment a task first becomes runnable. The CFS class 244 maintains a handful of CPU and task signals ba 244 maintains a handful of CPU and task signals based on the Per-Entity Load 245 Tracking (PELT) mechanism, one of those yieldi 245 Tracking (PELT) mechanism, one of those yielding an *average* utilization (as 246 opposed to instantaneous). 246 opposed to instantaneous). 247 247 248 This means that while the capacity aware sched 248 This means that while the capacity aware scheduling criteria will be written 249 considering a "true" task utilization (using a 249 considering a "true" task utilization (using a crystal ball), the implementation 250 will only ever be able to use an estimator the 250 will only ever be able to use an estimator thereof. 251 251 252 3. Capacity aware scheduling requirements 252 3. Capacity aware scheduling requirements 253 ========================================= 253 ========================================= 254 254 255 3.1 CPU capacity 255 3.1 CPU capacity 256 ---------------- 256 ---------------- 257 257 258 Linux cannot currently figure out CPU capacity 258 Linux cannot currently figure out CPU capacity on its own, this information thus 259 needs to be handed to it. Architectures must d 259 needs to be handed to it. Architectures must define arch_scale_cpu_capacity() 260 for that purpose. 260 for that purpose. 261 261 262 The arm, arm64, and RISC-V architectures direc 262 The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver 263 CPU scaling data, which is derived from the ca 263 CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see 264 Documentation/devicetree/bindings/cpu/cpu-capa 264 Documentation/devicetree/bindings/cpu/cpu-capacity.txt. 265 265 266 3.2 Frequency invariance 266 3.2 Frequency invariance 267 ------------------------ 267 ------------------------ 268 268 269 As stated in 2.2, capacity-aware scheduling re 269 As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task 270 utilization. Architectures must define arch_sc 270 utilization. Architectures must define arch_scale_freq_capacity(cpu) for that 271 purpose. 271 purpose. 272 272 273 Implementing this function requires figuring o 273 Implementing this function requires figuring out at which frequency each CPU 274 have been running at. One way to implement thi 274 have been running at. One way to implement this is to leverage hardware counters 275 whose increment rate scale with a CPU's curren 275 whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86, 276 AMU on arm64). Another is to directly hook int 276 AMU on arm64). Another is to directly hook into cpufreq frequency transitions, 277 when the kernel is aware of the switched-to fr 277 when the kernel is aware of the switched-to frequency (also employed by 278 arm/arm64). 278 arm/arm64). 279 279 280 4. Scheduler topology 280 4. Scheduler topology 281 ===================== 281 ===================== 282 282 283 During the construction of the sched domains, 283 During the construction of the sched domains, the scheduler will figure out 284 whether the system exhibits asymmetric CPU cap 284 whether the system exhibits asymmetric CPU capacities. Should that be the 285 case: 285 case: 286 286 287 - The sched_asym_cpucapacity static key will b 287 - The sched_asym_cpucapacity static key will be enabled. 288 - The SD_ASYM_CPUCAPACITY_FULL flag will be se 288 - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain 289 level that spans all unique CPU capacity val 289 level that spans all unique CPU capacity values. 290 - The SD_ASYM_CPUCAPACITY flag will be set for 290 - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans 291 CPUs with any range of asymmetry. 291 CPUs with any range of asymmetry. 292 292 293 The sched_asym_cpucapacity static key is inten 293 The sched_asym_cpucapacity static key is intended to guard sections of code that 294 cater to asymmetric CPU capacity systems. Do n 294 cater to asymmetric CPU capacity systems. Do note however that said key is 295 *system-wide*. Imagine the following setup usi 295 *system-wide*. Imagine the following setup using cpusets:: 296 296 297 capacity C/2 C 297 capacity C/2 C 298 ________ ________ 298 ________ ________ 299 / \ / \ 299 / \ / \ 300 CPUs 0 1 2 3 4 5 6 7 300 CPUs 0 1 2 3 4 5 6 7 301 \__/ \______________/ 301 \__/ \______________/ 302 cpusets cs0 cs1 302 cpusets cs0 cs1 303 303 304 Which could be created via: 304 Which could be created via: 305 305 306 .. code-block:: sh 306 .. code-block:: sh 307 307 308 mkdir /sys/fs/cgroup/cpuset/cs0 308 mkdir /sys/fs/cgroup/cpuset/cs0 309 echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset. 309 echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus 310 echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.me 310 echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems 311 311 312 mkdir /sys/fs/cgroup/cpuset/cs1 312 mkdir /sys/fs/cgroup/cpuset/cs1 313 echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset. 313 echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus 314 echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.me 314 echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems 315 315 316 echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_ 316 echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance 317 317 318 Since there *is* CPU capacity asymmetry in the 318 Since there *is* CPU capacity asymmetry in the system, the 319 sched_asym_cpucapacity static key will be enab 319 sched_asym_cpucapacity static key will be enabled. However, the sched_domain 320 hierarchy of CPUs 0-1 spans a single capacity 320 hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't 321 set in that hierarchy, it describes an SMP isl 321 set in that hierarchy, it describes an SMP island and should be treated as such. 322 322 323 Therefore, the 'canonical' pattern for protect 323 Therefore, the 'canonical' pattern for protecting codepaths that cater to 324 asymmetric CPU capacities is to: 324 asymmetric CPU capacities is to: 325 325 326 - Check the sched_asym_cpucapacity static key 326 - Check the sched_asym_cpucapacity static key 327 - If it is enabled, then also check for the pr 327 - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in 328 the sched_domain hierarchy (if relevant, i.e 328 the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific 329 CPU or group thereof) 329 CPU or group thereof) 330 330 331 5. Capacity aware scheduling implementation 331 5. Capacity aware scheduling implementation 332 =========================================== 332 =========================================== 333 333 334 5.1 CFS 334 5.1 CFS 335 ------- 335 ------- 336 336 337 5.1.1 Capacity fitness 337 5.1.1 Capacity fitness 338 ~~~~~~~~~~~~~~~~~~~~~~ 338 ~~~~~~~~~~~~~~~~~~~~~~ 339 339 340 The main capacity scheduling criterion of CFS 340 The main capacity scheduling criterion of CFS is:: 341 341 342 task_util(p) < capacity(task_cpu(p)) 342 task_util(p) < capacity(task_cpu(p)) 343 343 344 This is commonly called the capacity fitness c 344 This is commonly called the capacity fitness criterion, i.e. CFS must ensure a 345 task "fits" on its CPU. If it is violated, the 345 task "fits" on its CPU. If it is violated, the task will need to achieve more 346 work than what its CPU can provide: it will be 346 work than what its CPU can provide: it will be CPU-bound. 347 347 348 Furthermore, uclamp lets userspace specify a m 348 Furthermore, uclamp lets userspace specify a minimum and a maximum utilization 349 value for a task, either via sched_setattr() o 349 value for a task, either via sched_setattr() or via the cgroup interface (see 350 Documentation/admin-guide/cgroup-v2.rst). As i 350 Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to 351 clamp task_util() in the previous criterion. 351 clamp task_util() in the previous criterion. 352 352 353 5.1.2 Wakeup CPU selection 353 5.1.2 Wakeup CPU selection 354 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 354 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 355 355 356 CFS task wakeup CPU selection follows the capa 356 CFS task wakeup CPU selection follows the capacity fitness criterion described 357 above. On top of that, uclamp is used to clamp 357 above. On top of that, uclamp is used to clamp the task utilization values, 358 which lets userspace have more leverage over t 358 which lets userspace have more leverage over the CPU selection of CFS 359 tasks. IOW, CFS wakeup CPU selection searches 359 tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies:: 360 360 361 clamp(task_util(p), task_uclamp_min(p), task 361 clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu) 362 362 363 By using uclamp, userspace can e.g. allow a bu 363 By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run 364 on any CPU by giving it a low uclamp.max value 364 on any CPU by giving it a low uclamp.max value. Conversely, it can force a small 365 periodic task (e.g. 10% utilization) to run on 365 periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by 366 giving it a high uclamp.min value. 366 giving it a high uclamp.min value. 367 367 368 .. note:: 368 .. note:: 369 369 370 Wakeup CPU selection in CFS can be eclipsed 370 Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling 371 (EAS), which is described in Documentation/s 371 (EAS), which is described in Documentation/scheduler/sched-energy.rst. 372 372 373 5.1.3 Load balancing 373 5.1.3 Load balancing 374 ~~~~~~~~~~~~~~~~~~~~ 374 ~~~~~~~~~~~~~~~~~~~~ 375 375 376 A pathological case in the wakeup CPU selectio 376 A pathological case in the wakeup CPU selection occurs when a task rarely 377 sleeps, if at all - it thus rarely wakes up, i 377 sleeps, if at all - it thus rarely wakes up, if at all. Consider:: 378 378 379 w == wakeup event 379 w == wakeup event 380 380 381 capacity(CPU0) = C 381 capacity(CPU0) = C 382 capacity(CPU1) = C / 3 382 capacity(CPU1) = C / 3 383 383 384 workload on CPU0 384 workload on CPU0 385 CPU work ^ 385 CPU work ^ 386 | _________ _________ 386 | _________ _________ ____ 387 | | | | 387 | | | | | | 388 +----+----+----+----+----+----+---- 388 +----+----+----+----+----+----+----+----+----+----+-> time 389 w w 389 w w w 390 390 391 workload on CPU1 391 workload on CPU1 392 CPU work ^ 392 CPU work ^ 393 | _____________________________ 393 | ____________________________________________ 394 | | 394 | | 395 +----+----+----+----+----+----+---- 395 +----+----+----+----+----+----+----+----+----+----+-> 396 w 396 w 397 397 398 This workload should run on CPU0, but if the t 398 This workload should run on CPU0, but if the task either: 399 399 400 - was improperly scheduled from the start (ina 400 - was improperly scheduled from the start (inaccurate initial 401 utilization estimation) 401 utilization estimation) 402 - was properly scheduled from the start, but s 402 - was properly scheduled from the start, but suddenly needs more 403 processing power 403 processing power 404 404 405 then it might become CPU-bound, IOW ``task_uti 405 then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``; 406 the CPU capacity scheduling criterion is viola 406 the CPU capacity scheduling criterion is violated, and there may not be any more 407 wakeup event to fix this up via wakeup CPU sel 407 wakeup event to fix this up via wakeup CPU selection. 408 408 409 Tasks that are in this situation are dubbed "m 409 Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism 410 put in place to handle this shares the same na 410 put in place to handle this shares the same name. Misfit task migration 411 leverages the CFS load balancer, more specific 411 leverages the CFS load balancer, more specifically the active load balance part 412 (which caters to migrating currently running t 412 (which caters to migrating currently running tasks). When load balance happens, 413 a misfit active load balance will be triggered 413 a misfit active load balance will be triggered if a misfit task can be migrated 414 to a CPU with more capacity than its current o 414 to a CPU with more capacity than its current one. 415 415 416 5.2 RT 416 5.2 RT 417 ------ 417 ------ 418 418 419 5.2.1 Wakeup CPU selection 419 5.2.1 Wakeup CPU selection 420 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 420 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 421 421 422 RT task wakeup CPU selection searches for a CP 422 RT task wakeup CPU selection searches for a CPU that satisfies:: 423 423 424 task_uclamp_min(p) <= capacity(task_cpu(cpu) 424 task_uclamp_min(p) <= capacity(task_cpu(cpu)) 425 425 426 while still following the usual priority const 426 while still following the usual priority constraints. If none of the candidate 427 CPUs can satisfy this capacity criterion, then 427 CPUs can satisfy this capacity criterion, then strict priority based scheduling 428 is followed and CPU capacities are ignored. 428 is followed and CPU capacities are ignored. 429 429 430 5.3 DL 430 5.3 DL 431 ------ 431 ------ 432 432 433 5.3.1 Wakeup CPU selection 433 5.3.1 Wakeup CPU selection 434 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 434 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 435 435 436 DL task wakeup CPU selection searches for a CP 436 DL task wakeup CPU selection searches for a CPU that satisfies:: 437 437 438 task_bandwidth(p) < capacity(task_cpu(p)) 438 task_bandwidth(p) < capacity(task_cpu(p)) 439 439 440 while still respecting the usual bandwidth and 440 while still respecting the usual bandwidth and deadline constraints. If 441 none of the candidate CPUs can satisfy this ca 441 none of the candidate CPUs can satisfy this capacity criterion, then the 442 task will remain on its current CPU. 442 task will remain on its current CPU.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.