~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/scheduler/schedutil.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/scheduler/schedutil.rst (Version linux-6.12-rc7) and /Documentation/scheduler/schedutil.rst (Version linux-6.6.60)


  1 =========                                           1 =========
  2 Schedutil                                           2 Schedutil
  3 =========                                           3 =========
  4                                                     4 
  5 .. note::                                           5 .. note::
  6                                                     6 
  7    All this assumes a linear relation between       7    All this assumes a linear relation between frequency and work capacity,
  8    we know this is flawed, but it is the best       8    we know this is flawed, but it is the best workable approximation.
  9                                                     9 
 10                                                    10 
 11 PELT (Per Entity Load Tracking)                    11 PELT (Per Entity Load Tracking)
 12 ===============================                    12 ===============================
 13                                                    13 
 14 With PELT we track some metrics across the var     14 With PELT we track some metrics across the various scheduler entities, from
 15 individual tasks to task-group slices to CPU r     15 individual tasks to task-group slices to CPU runqueues. As the basis for this
 16 we use an Exponentially Weighted Moving Averag     16 we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
 17 is decayed such that y^32 = 0.5. That is, the      17 is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
 18 half, while the rest of history contribute the     18 half, while the rest of history contribute the other half.
 19                                                    19 
 20 Specifically:                                      20 Specifically:
 21                                                    21 
 22   ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...       22   ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
 23                                                    23 
 24   ewma(u) = ewma_sum(u) / ewma_sum(1)              24   ewma(u) = ewma_sum(u) / ewma_sum(1)
 25                                                    25 
 26 Since this is essentially a progression of an      26 Since this is essentially a progression of an infinite geometric series, the
 27 results are composable, that is ewma(A) + ewma     27 results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
 28 is key, since it gives the ability to recompos     28 is key, since it gives the ability to recompose the averages when tasks move
 29 around.                                            29 around.
 30                                                    30 
 31 Note that blocked tasks still contribute to th     31 Note that blocked tasks still contribute to the aggregates (task-group slices
 32 and CPU runqueues), which reflects their expec     32 and CPU runqueues), which reflects their expected contribution when they
 33 resume running.                                    33 resume running.
 34                                                    34 
 35 Using this we track 2 key metrics: 'running' a     35 Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
 36 reflects the time an entity spends on the CPU,     36 reflects the time an entity spends on the CPU, while 'runnable' reflects the
 37 time an entity spends on the runqueue. When th     37 time an entity spends on the runqueue. When there is only a single task these
 38 two metrics are the same, but once there is co     38 two metrics are the same, but once there is contention for the CPU 'running'
 39 will decrease to reflect the fraction of time      39 will decrease to reflect the fraction of time each task spends on the CPU
 40 while 'runnable' will increase to reflect the      40 while 'runnable' will increase to reflect the amount of contention.
 41                                                    41 
 42 For more detail see: kernel/sched/pelt.c           42 For more detail see: kernel/sched/pelt.c
 43                                                    43 
 44                                                    44 
 45 Frequency / CPU Invariance                         45 Frequency / CPU Invariance
 46 ==========================                         46 ==========================
 47                                                    47 
 48 Because consuming the CPU for 50% at 1GHz is n     48 Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
 49 for 50% at 2GHz, nor is running 50% on a LITTL     49 for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
 50 a big CPU, we allow architectures to scale the     50 a big CPU, we allow architectures to scale the time delta with two ratios, one
 51 Dynamic Voltage and Frequency Scaling (DVFS) r     51 Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
 52                                                    52 
 53 For simple DVFS architectures (where software      53 For simple DVFS architectures (where software is in full control) we trivially
 54 compute the ratio as::                             54 compute the ratio as::
 55                                                    55 
 56             f_cur                                  56             f_cur
 57   r_dvfs := -----                                  57   r_dvfs := -----
 58             f_max                                  58             f_max
 59                                                    59 
 60 For more dynamic systems where the hardware is     60 For more dynamic systems where the hardware is in control of DVFS we use
 61 hardware counters (Intel APERF/MPERF, ARMv8.4-     61 hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
 62 For Intel specifically, we use::                   62 For Intel specifically, we use::
 63                                                    63 
 64            APERF                                   64            APERF
 65   f_cur := ----- * P0                              65   f_cur := ----- * P0
 66            MPERF                                   66            MPERF
 67                                                    67 
 68              4C-turbo;  if available and turbo     68              4C-turbo;  if available and turbo enabled
 69   f_max := { 1C-turbo;  if turbo enabled           69   f_max := { 1C-turbo;  if turbo enabled
 70              P0;        otherwise                  70              P0;        otherwise
 71                                                    71 
 72                     f_cur                          72                     f_cur
 73   r_dvfs := min( 1, ----- )                        73   r_dvfs := min( 1, ----- )
 74                     f_max                          74                     f_max
 75                                                    75 
 76 We pick 4C turbo over 1C turbo to make it slig     76 We pick 4C turbo over 1C turbo to make it slightly more sustainable.
 77                                                    77 
 78 r_cpu is determined as the ratio of highest pe     78 r_cpu is determined as the ratio of highest performance level of the current
 79 CPU vs the highest performance level of any ot     79 CPU vs the highest performance level of any other CPU in the system.
 80                                                    80 
 81   r_tot = r_dvfs * r_cpu                           81   r_tot = r_dvfs * r_cpu
 82                                                    82 
 83 The result is that the above 'running' and 'ru     83 The result is that the above 'running' and 'runnable' metrics become invariant
 84 of DVFS and CPU type. IOW. we can transfer and     84 of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
 85                                                    85 
 86 For more detail see:                               86 For more detail see:
 87                                                    87 
 88  - kernel/sched/pelt.h:update_rq_clock_pelt()      88  - kernel/sched/pelt.h:update_rq_clock_pelt()
 89  - arch/x86/kernel/smpboot.c:"APERF/MPERF freq     89  - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
 90  - Documentation/scheduler/sched-capacity.rst:     90  - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
 91                                                    91 
 92                                                    92 
 93 UTIL_EST                                       !!  93 UTIL_EST / UTIL_EST_FASTUP
 94 ========                                       !!  94 ==========================
 95                                                    95 
 96 Because periodic tasks have their averages dec     96 Because periodic tasks have their averages decayed while they sleep, even
 97 though when running their expected utilization     97 though when running their expected utilization will be the same, they suffer a
 98 (DVFS) ramp-up after they are running again.       98 (DVFS) ramp-up after they are running again.
 99                                                    99 
100 To alleviate this (a default enabled option) U    100 To alleviate this (a default enabled option) UTIL_EST drives an Infinite
101 Impulse Response (IIR) EWMA with the 'running'    101 Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
102 highest. UTIL_EST filters to instantly increas !! 102 highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
                                                   >> 103 filter to instantly increase and only decay on decrease.
103                                                   104 
104 A further runqueue wide sum (of runnable tasks    105 A further runqueue wide sum (of runnable tasks) is maintained of:
105                                                   106 
106   util_est := \Sum_t max( t_running, t_util_es    107   util_est := \Sum_t max( t_running, t_util_est_ewma )
107                                                   108 
108 For more detail see: kernel/sched/fair.c:util_    109 For more detail see: kernel/sched/fair.c:util_est_dequeue()
109                                                   110 
110                                                   111 
111 UCLAMP                                            112 UCLAMP
112 ======                                            113 ======
113                                                   114 
114 It is possible to set effective u_min and u_ma    115 It is possible to set effective u_min and u_max clamps on each CFS or RT task;
115 the runqueue keeps an max aggregate of these c    116 the runqueue keeps an max aggregate of these clamps for all running tasks.
116                                                   117 
117 For more detail see: include/uapi/linux/sched/    118 For more detail see: include/uapi/linux/sched/types.h
118                                                   119 
119                                                   120 
120 Schedutil / DVFS                                  121 Schedutil / DVFS
121 ================                                  122 ================
122                                                   123 
123 Every time the scheduler load tracking is upda    124 Every time the scheduler load tracking is updated (task wakeup, task
124 migration, time progression) we call out to sc    125 migration, time progression) we call out to schedutil to update the hardware
125 DVFS state.                                       126 DVFS state.
126                                                   127 
127 The basis is the CPU runqueue's 'running' metr    128 The basis is the CPU runqueue's 'running' metric, which per the above it is
128 the frequency invariant utilization estimate o    129 the frequency invariant utilization estimate of the CPU. From this we compute
129 a desired frequency like::                        130 a desired frequency like::
130                                                   131 
131              max( running, util_est );  if UTI    132              max( running, util_est );  if UTIL_EST
132   u_cfs := { running;                   otherw    133   u_cfs := { running;                   otherwise
133                                                   134 
134                clamp( u_cfs + u_rt , u_min, u_    135                clamp( u_cfs + u_rt , u_min, u_max );    if UCLAMP_TASK
135   u_clamp := { u_cfs + u_rt;                      136   u_clamp := { u_cfs + u_rt;                            otherwise
136                                                   137 
137   u := u_clamp + u_irq + u_dl;          [appro    138   u := u_clamp + u_irq + u_dl;          [approx. see source for more detail]
138                                                   139 
139   f_des := min( f_max, 1.25 u * f_max )           140   f_des := min( f_max, 1.25 u * f_max )
140                                                   141 
141 XXX IO-wait: when the update is due to a task     142 XXX IO-wait: when the update is due to a task wakeup from IO-completion we
142 boost 'u' above.                                  143 boost 'u' above.
143                                                   144 
144 This frequency is then used to select a P-stat    145 This frequency is then used to select a P-state/OPP or directly munged into a
145 CPPC style request to the hardware.               146 CPPC style request to the hardware.
146                                                   147 
147 XXX: deadline tasks (Sporadic Task Model) allo    148 XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
148 required to satisfy the workload.                 149 required to satisfy the workload.
149                                                   150 
150 Because these callbacks are directly from the     151 Because these callbacks are directly from the scheduler, the DVFS hardware
151 interaction should be 'fast' and non-blocking.    152 interaction should be 'fast' and non-blocking. Schedutil supports
152 rate-limiting DVFS requests for when hardware     153 rate-limiting DVFS requests for when hardware interaction is slow and
153 expensive, this reduces effectiveness.            154 expensive, this reduces effectiveness.
154                                                   155 
155 For more information see: kernel/sched/cpufreq    156 For more information see: kernel/sched/cpufreq_schedutil.c
156                                                   157 
157                                                   158 
158 NOTES                                             159 NOTES
159 =====                                             160 =====
160                                                   161 
161  - On low-load scenarios, where DVFS is most r    162  - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
162    will closely reflect utilization.              163    will closely reflect utilization.
163                                                   164 
164  - In saturated scenarios task movement will c    165  - In saturated scenarios task movement will cause some transient dips,
165    suppose we have a CPU saturated with 4 task    166    suppose we have a CPU saturated with 4 tasks, then when we migrate a task
166    to an idle CPU, the old CPU will have a 'ru    167    to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
167    new CPU will gain 0.25. This is inevitable     168    new CPU will gain 0.25. This is inevitable and time progression will
168    correct this. XXX do we still guarantee f_m    169    correct this. XXX do we still guarantee f_max due to no idle-time?
169                                                   170 
170  - Much of the above is about avoiding DVFS di    171  - Much of the above is about avoiding DVFS dips, and independent DVFS domains
171    having to re-learn / ramp-up when load shif    172    having to re-learn / ramp-up when load shifts.
172                                                   173 
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php