1 ========= 2 Schedutil 3 ========= 4 5 .. note:: 6 7 All this assumes a linear relation between 8 we know this is flawed, but it is the best 9 10 11 PELT (Per Entity Load Tracking) 12 =============================== 13 14 With PELT we track some metrics across the var 15 individual tasks to task-group slices to CPU r 16 we use an Exponentially Weighted Moving Averag 17 is decayed such that y^32 = 0.5. That is, the 18 half, while the rest of history contribute the 19 20 Specifically: 21 22 ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... 23 24 ewma(u) = ewma_sum(u) / ewma_sum(1) 25 26 Since this is essentially a progression of an 27 results are composable, that is ewma(A) + ewma 28 is key, since it gives the ability to recompos 29 around. 30 31 Note that blocked tasks still contribute to th 32 and CPU runqueues), which reflects their expec 33 resume running. 34 35 Using this we track 2 key metrics: 'running' a 36 reflects the time an entity spends on the CPU, 37 time an entity spends on the runqueue. When th 38 two metrics are the same, but once there is co 39 will decrease to reflect the fraction of time 40 while 'runnable' will increase to reflect the 41 42 For more detail see: kernel/sched/pelt.c 43 44 45 Frequency / CPU Invariance 46 ========================== 47 48 Because consuming the CPU for 50% at 1GHz is n 49 for 50% at 2GHz, nor is running 50% on a LITTL 50 a big CPU, we allow architectures to scale the 51 Dynamic Voltage and Frequency Scaling (DVFS) r 52 53 For simple DVFS architectures (where software 54 compute the ratio as:: 55 56 f_cur 57 r_dvfs := ----- 58 f_max 59 60 For more dynamic systems where the hardware is 61 hardware counters (Intel APERF/MPERF, ARMv8.4- 62 For Intel specifically, we use:: 63 64 APERF 65 f_cur := ----- * P0 66 MPERF 67 68 4C-turbo; if available and turbo 69 f_max := { 1C-turbo; if turbo enabled 70 P0; otherwise 71 72 f_cur 73 r_dvfs := min( 1, ----- ) 74 f_max 75 76 We pick 4C turbo over 1C turbo to make it slig 77 78 r_cpu is determined as the ratio of highest pe 79 CPU vs the highest performance level of any ot 80 81 r_tot = r_dvfs * r_cpu 82 83 The result is that the above 'running' and 'ru 84 of DVFS and CPU type. IOW. we can transfer and 85 86 For more detail see: 87 88 - kernel/sched/pelt.h:update_rq_clock_pelt() 89 - arch/x86/kernel/smpboot.c:"APERF/MPERF freq 90 - Documentation/scheduler/sched-capacity.rst: 91 92 93 UTIL_EST 94 ======== 95 96 Because periodic tasks have their averages dec 97 though when running their expected utilization 98 (DVFS) ramp-up after they are running again. 99 100 To alleviate this (a default enabled option) U 101 Impulse Response (IIR) EWMA with the 'running' 102 highest. UTIL_EST filters to instantly increas 103 104 A further runqueue wide sum (of runnable tasks 105 106 util_est := \Sum_t max( t_running, t_util_es 107 108 For more detail see: kernel/sched/fair.c:util_ 109 110 111 UCLAMP 112 ====== 113 114 It is possible to set effective u_min and u_ma 115 the runqueue keeps an max aggregate of these c 116 117 For more detail see: include/uapi/linux/sched/ 118 119 120 Schedutil / DVFS 121 ================ 122 123 Every time the scheduler load tracking is upda 124 migration, time progression) we call out to sc 125 DVFS state. 126 127 The basis is the CPU runqueue's 'running' metr 128 the frequency invariant utilization estimate o 129 a desired frequency like:: 130 131 max( running, util_est ); if UTI 132 u_cfs := { running; otherw 133 134 clamp( u_cfs + u_rt , u_min, u_ 135 u_clamp := { u_cfs + u_rt; 136 137 u := u_clamp + u_irq + u_dl; [appro 138 139 f_des := min( f_max, 1.25 u * f_max ) 140 141 XXX IO-wait: when the update is due to a task 142 boost 'u' above. 143 144 This frequency is then used to select a P-stat 145 CPPC style request to the hardware. 146 147 XXX: deadline tasks (Sporadic Task Model) allo 148 required to satisfy the workload. 149 150 Because these callbacks are directly from the 151 interaction should be 'fast' and non-blocking. 152 rate-limiting DVFS requests for when hardware 153 expensive, this reduces effectiveness. 154 155 For more information see: kernel/sched/cpufreq 156 157 158 NOTES 159 ===== 160 161 - On low-load scenarios, where DVFS is most r 162 will closely reflect utilization. 163 164 - In saturated scenarios task movement will c 165 suppose we have a CPU saturated with 4 task 166 to an idle CPU, the old CPU will have a 'ru 167 new CPU will gain 0.25. This is inevitable 168 correct this. XXX do we still guarantee f_m 169 170 - Much of the above is about avoiding DVFS di 171 having to re-learn / ramp-up when load shif 172
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.