1 ========= 1 ========= 2 Schedutil 2 Schedutil 3 ========= 3 ========= 4 4 5 .. note:: 5 .. note:: 6 6 7 All this assumes a linear relation between 7 All this assumes a linear relation between frequency and work capacity, 8 we know this is flawed, but it is the best 8 we know this is flawed, but it is the best workable approximation. 9 9 10 10 11 PELT (Per Entity Load Tracking) 11 PELT (Per Entity Load Tracking) 12 =============================== 12 =============================== 13 13 14 With PELT we track some metrics across the var 14 With PELT we track some metrics across the various scheduler entities, from 15 individual tasks to task-group slices to CPU r 15 individual tasks to task-group slices to CPU runqueues. As the basis for this 16 we use an Exponentially Weighted Moving Averag 16 we use an Exponentially Weighted Moving Average (EWMA), each period (1024us) 17 is decayed such that y^32 = 0.5. That is, the 17 is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute 18 half, while the rest of history contribute the 18 half, while the rest of history contribute the other half. 19 19 20 Specifically: 20 Specifically: 21 21 22 ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... 22 ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... 23 23 24 ewma(u) = ewma_sum(u) / ewma_sum(1) 24 ewma(u) = ewma_sum(u) / ewma_sum(1) 25 25 26 Since this is essentially a progression of an 26 Since this is essentially a progression of an infinite geometric series, the 27 results are composable, that is ewma(A) + ewma 27 results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property 28 is key, since it gives the ability to recompos 28 is key, since it gives the ability to recompose the averages when tasks move 29 around. 29 around. 30 30 31 Note that blocked tasks still contribute to th 31 Note that blocked tasks still contribute to the aggregates (task-group slices 32 and CPU runqueues), which reflects their expec 32 and CPU runqueues), which reflects their expected contribution when they 33 resume running. 33 resume running. 34 34 35 Using this we track 2 key metrics: 'running' a 35 Using this we track 2 key metrics: 'running' and 'runnable'. 'Running' 36 reflects the time an entity spends on the CPU, 36 reflects the time an entity spends on the CPU, while 'runnable' reflects the 37 time an entity spends on the runqueue. When th 37 time an entity spends on the runqueue. When there is only a single task these 38 two metrics are the same, but once there is co 38 two metrics are the same, but once there is contention for the CPU 'running' 39 will decrease to reflect the fraction of time 39 will decrease to reflect the fraction of time each task spends on the CPU 40 while 'runnable' will increase to reflect the 40 while 'runnable' will increase to reflect the amount of contention. 41 41 42 For more detail see: kernel/sched/pelt.c 42 For more detail see: kernel/sched/pelt.c 43 43 44 44 45 Frequency / CPU Invariance 45 Frequency / CPU Invariance 46 ========================== 46 ========================== 47 47 48 Because consuming the CPU for 50% at 1GHz is n 48 Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU 49 for 50% at 2GHz, nor is running 50% on a LITTL 49 for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on 50 a big CPU, we allow architectures to scale the 50 a big CPU, we allow architectures to scale the time delta with two ratios, one 51 Dynamic Voltage and Frequency Scaling (DVFS) r 51 Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio. 52 52 53 For simple DVFS architectures (where software 53 For simple DVFS architectures (where software is in full control) we trivially 54 compute the ratio as:: 54 compute the ratio as:: 55 55 56 f_cur 56 f_cur 57 r_dvfs := ----- 57 r_dvfs := ----- 58 f_max 58 f_max 59 59 60 For more dynamic systems where the hardware is 60 For more dynamic systems where the hardware is in control of DVFS we use 61 hardware counters (Intel APERF/MPERF, ARMv8.4- 61 hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio. 62 For Intel specifically, we use:: 62 For Intel specifically, we use:: 63 63 64 APERF 64 APERF 65 f_cur := ----- * P0 65 f_cur := ----- * P0 66 MPERF 66 MPERF 67 67 68 4C-turbo; if available and turbo 68 4C-turbo; if available and turbo enabled 69 f_max := { 1C-turbo; if turbo enabled 69 f_max := { 1C-turbo; if turbo enabled 70 P0; otherwise 70 P0; otherwise 71 71 72 f_cur 72 f_cur 73 r_dvfs := min( 1, ----- ) 73 r_dvfs := min( 1, ----- ) 74 f_max 74 f_max 75 75 76 We pick 4C turbo over 1C turbo to make it slig 76 We pick 4C turbo over 1C turbo to make it slightly more sustainable. 77 77 78 r_cpu is determined as the ratio of highest pe 78 r_cpu is determined as the ratio of highest performance level of the current 79 CPU vs the highest performance level of any ot 79 CPU vs the highest performance level of any other CPU in the system. 80 80 81 r_tot = r_dvfs * r_cpu 81 r_tot = r_dvfs * r_cpu 82 82 83 The result is that the above 'running' and 'ru 83 The result is that the above 'running' and 'runnable' metrics become invariant 84 of DVFS and CPU type. IOW. we can transfer and 84 of DVFS and CPU type. IOW. we can transfer and compare them between CPUs. 85 85 86 For more detail see: 86 For more detail see: 87 87 88 - kernel/sched/pelt.h:update_rq_clock_pelt() 88 - kernel/sched/pelt.h:update_rq_clock_pelt() 89 - arch/x86/kernel/smpboot.c:"APERF/MPERF freq 89 - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation." 90 - Documentation/scheduler/sched-capacity.rst: 90 - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" 91 91 92 92 93 UTIL_EST !! 93 UTIL_EST / UTIL_EST_FASTUP 94 ======== !! 94 ========================== 95 95 96 Because periodic tasks have their averages dec 96 Because periodic tasks have their averages decayed while they sleep, even 97 though when running their expected utilization 97 though when running their expected utilization will be the same, they suffer a 98 (DVFS) ramp-up after they are running again. 98 (DVFS) ramp-up after they are running again. 99 99 100 To alleviate this (a default enabled option) U 100 To alleviate this (a default enabled option) UTIL_EST drives an Infinite 101 Impulse Response (IIR) EWMA with the 'running' 101 Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is 102 highest. UTIL_EST filters to instantly increas !! 102 highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR >> 103 filter to instantly increase and only decay on decrease. 103 104 104 A further runqueue wide sum (of runnable tasks 105 A further runqueue wide sum (of runnable tasks) is maintained of: 105 106 106 util_est := \Sum_t max( t_running, t_util_es 107 util_est := \Sum_t max( t_running, t_util_est_ewma ) 107 108 108 For more detail see: kernel/sched/fair.c:util_ 109 For more detail see: kernel/sched/fair.c:util_est_dequeue() 109 110 110 111 111 UCLAMP 112 UCLAMP 112 ====== 113 ====== 113 114 114 It is possible to set effective u_min and u_ma 115 It is possible to set effective u_min and u_max clamps on each CFS or RT task; 115 the runqueue keeps an max aggregate of these c 116 the runqueue keeps an max aggregate of these clamps for all running tasks. 116 117 117 For more detail see: include/uapi/linux/sched/ 118 For more detail see: include/uapi/linux/sched/types.h 118 119 119 120 120 Schedutil / DVFS 121 Schedutil / DVFS 121 ================ 122 ================ 122 123 123 Every time the scheduler load tracking is upda 124 Every time the scheduler load tracking is updated (task wakeup, task 124 migration, time progression) we call out to sc 125 migration, time progression) we call out to schedutil to update the hardware 125 DVFS state. 126 DVFS state. 126 127 127 The basis is the CPU runqueue's 'running' metr 128 The basis is the CPU runqueue's 'running' metric, which per the above it is 128 the frequency invariant utilization estimate o 129 the frequency invariant utilization estimate of the CPU. From this we compute 129 a desired frequency like:: 130 a desired frequency like:: 130 131 131 max( running, util_est ); if UTI 132 max( running, util_est ); if UTIL_EST 132 u_cfs := { running; otherw 133 u_cfs := { running; otherwise 133 134 134 clamp( u_cfs + u_rt , u_min, u_ 135 clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK 135 u_clamp := { u_cfs + u_rt; 136 u_clamp := { u_cfs + u_rt; otherwise 136 137 137 u := u_clamp + u_irq + u_dl; [appro 138 u := u_clamp + u_irq + u_dl; [approx. see source for more detail] 138 139 139 f_des := min( f_max, 1.25 u * f_max ) 140 f_des := min( f_max, 1.25 u * f_max ) 140 141 141 XXX IO-wait: when the update is due to a task 142 XXX IO-wait: when the update is due to a task wakeup from IO-completion we 142 boost 'u' above. 143 boost 'u' above. 143 144 144 This frequency is then used to select a P-stat 145 This frequency is then used to select a P-state/OPP or directly munged into a 145 CPPC style request to the hardware. 146 CPPC style request to the hardware. 146 147 147 XXX: deadline tasks (Sporadic Task Model) allo 148 XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min 148 required to satisfy the workload. 149 required to satisfy the workload. 149 150 150 Because these callbacks are directly from the 151 Because these callbacks are directly from the scheduler, the DVFS hardware 151 interaction should be 'fast' and non-blocking. 152 interaction should be 'fast' and non-blocking. Schedutil supports 152 rate-limiting DVFS requests for when hardware 153 rate-limiting DVFS requests for when hardware interaction is slow and 153 expensive, this reduces effectiveness. 154 expensive, this reduces effectiveness. 154 155 155 For more information see: kernel/sched/cpufreq 156 For more information see: kernel/sched/cpufreq_schedutil.c 156 157 157 158 158 NOTES 159 NOTES 159 ===== 160 ===== 160 161 161 - On low-load scenarios, where DVFS is most r 162 - On low-load scenarios, where DVFS is most relevant, the 'running' numbers 162 will closely reflect utilization. 163 will closely reflect utilization. 163 164 164 - In saturated scenarios task movement will c 165 - In saturated scenarios task movement will cause some transient dips, 165 suppose we have a CPU saturated with 4 task 166 suppose we have a CPU saturated with 4 tasks, then when we migrate a task 166 to an idle CPU, the old CPU will have a 'ru 167 to an idle CPU, the old CPU will have a 'running' value of 0.75 while the 167 new CPU will gain 0.25. This is inevitable 168 new CPU will gain 0.25. This is inevitable and time progression will 168 correct this. XXX do we still guarantee f_m 169 correct this. XXX do we still guarantee f_max due to no idle-time? 169 170 170 - Much of the above is about avoiding DVFS di 171 - Much of the above is about avoiding DVFS dips, and independent DVFS domains 171 having to re-learn / ramp-up when load shif 172 having to re-learn / ramp-up when load shifts. 172 173
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.