~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/scheduler/sched-capacity.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/scheduler/sched-capacity.rst (Version linux-6.12-rc7) and /Documentation/scheduler/sched-capacity.rst (Version linux-6.8.12)


  1 =========================                           1 =========================
  2 Capacity Aware Scheduling                           2 Capacity Aware Scheduling
  3 =========================                           3 =========================
  4                                                     4 
  5 1. CPU Capacity                                     5 1. CPU Capacity
  6 ===============                                     6 ===============
  7                                                     7 
  8 1.1 Introduction                                    8 1.1 Introduction
  9 ----------------                                    9 ----------------
 10                                                    10 
 11 Conventional, homogeneous SMP platforms are co     11 Conventional, homogeneous SMP platforms are composed of purely identical
 12 CPUs. Heterogeneous platforms on the other han     12 CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
 13 different performance characteristics - on suc     13 different performance characteristics - on such platforms, not all CPUs can be
 14 considered equal.                                  14 considered equal.
 15                                                    15 
 16 CPU capacity is a measure of the performance a     16 CPU capacity is a measure of the performance a CPU can reach, normalized against
 17 the most performant CPU in the system. Heterog     17 the most performant CPU in the system. Heterogeneous systems are also called
 18 asymmetric CPU capacity systems, as they conta     18 asymmetric CPU capacity systems, as they contain CPUs of different capacities.
 19                                                    19 
 20 Disparity in maximum attainable performance (I     20 Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
 21 from two factors:                                  21 from two factors:
 22                                                    22 
 23 - not all CPUs may have the same microarchitec     23 - not all CPUs may have the same microarchitecture (µarch).
 24 - with Dynamic Voltage and Frequency Scaling (     24 - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
 25   physically able to attain the higher Operati     25   physically able to attain the higher Operating Performance Points (OPP).
 26                                                    26 
 27 Arm big.LITTLE systems are an example of both.     27 Arm big.LITTLE systems are an example of both. The big CPUs are more
 28 performance-oriented than the LITTLE ones (mor     28 performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
 29 smarter predictors, etc), and can usually reac     29 smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
 30 can.                                               30 can.
 31                                                    31 
 32 CPU performance is usually expressed in Millio     32 CPU performance is usually expressed in Millions of Instructions Per Second
 33 (MIPS), which can also be expressed as a given     33 (MIPS), which can also be expressed as a given amount of instructions attainable
 34 per Hz, leading to::                               34 per Hz, leading to::
 35                                                    35 
 36   capacity(cpu) = work_per_hz(cpu) * max_freq(     36   capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
 37                                                    37 
 38 1.2 Scheduler terms                                38 1.2 Scheduler terms
 39 -------------------                                39 -------------------
 40                                                    40 
 41 Two different capacity values are used within      41 Two different capacity values are used within the scheduler. A CPU's
 42 ``original capacity`` is its maximum attainabl     42 ``original capacity`` is its maximum attainable capacity, i.e. its maximum
 43 attainable performance level. This original ca     43 attainable performance level. This original capacity is returned by
 44 the function arch_scale_cpu_capacity(). A CPU'     44 the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
 45 capacity`` to which some loss of available per     45 capacity`` to which some loss of available performance (e.g. time spent
 46 handling IRQs) is subtracted.                      46 handling IRQs) is subtracted.
 47                                                    47 
 48 Note that a CPU's ``capacity`` is solely inten     48 Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
 49 while ``original capacity`` is class-agnostic.     49 while ``original capacity`` is class-agnostic. The rest of this document will use
 50 the term ``capacity`` interchangeably with ``o     50 the term ``capacity`` interchangeably with ``original capacity`` for the sake of
 51 brevity.                                           51 brevity.
 52                                                    52 
 53 1.3 Platform examples                              53 1.3 Platform examples
 54 ---------------------                              54 ---------------------
 55                                                    55 
 56 1.3.1 Identical OPPs                               56 1.3.1 Identical OPPs
 57 ~~~~~~~~~~~~~~~~~~~~                               57 ~~~~~~~~~~~~~~~~~~~~
 58                                                    58 
 59 Consider an hypothetical dual-core asymmetric      59 Consider an hypothetical dual-core asymmetric CPU capacity system where
 60                                                    60 
 61 - work_per_hz(CPU0) = W                            61 - work_per_hz(CPU0) = W
 62 - work_per_hz(CPU1) = W/2                          62 - work_per_hz(CPU1) = W/2
 63 - all CPUs are running at the same fixed frequ     63 - all CPUs are running at the same fixed frequency
 64                                                    64 
 65 By the above definition of capacity:               65 By the above definition of capacity:
 66                                                    66 
 67 - capacity(CPU0) = C                               67 - capacity(CPU0) = C
 68 - capacity(CPU1) = C/2                             68 - capacity(CPU1) = C/2
 69                                                    69 
 70 To draw the parallel with Arm big.LITTLE, CPU0     70 To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
 71 be a LITTLE.                                       71 be a LITTLE.
 72                                                    72 
 73 With a workload that periodically does a fixed     73 With a workload that periodically does a fixed amount of work, you will get an
 74 execution trace like so::                          74 execution trace like so::
 75                                                    75 
 76  CPU0 work ^                                       76  CPU0 work ^
 77            |     ____                ____          77            |     ____                ____                ____
 78            |    |    |              |    |         78            |    |    |              |    |              |    |
 79            +----+----+----+----+----+----+----     79            +----+----+----+----+----+----+----+----+----+----+-> time
 80                                                    80 
 81  CPU1 work ^                                       81  CPU1 work ^
 82            |     _________           _________     82            |     _________           _________           ____
 83            |    |         |         |              83            |    |         |         |         |         |
 84            +----+----+----+----+----+----+----     84            +----+----+----+----+----+----+----+----+----+----+-> time
 85                                                    85 
 86 CPU0 has the highest capacity in the system (C     86 CPU0 has the highest capacity in the system (C), and completes a fixed amount of
 87 work W in T units of time. On the other hand,      87 work W in T units of time. On the other hand, CPU1 has half the capacity of
 88 CPU0, and thus only completes W/2 in T.            88 CPU0, and thus only completes W/2 in T.
 89                                                    89 
 90 1.3.2 Different max OPPs                           90 1.3.2 Different max OPPs
 91 ~~~~~~~~~~~~~~~~~~~~~~~~                           91 ~~~~~~~~~~~~~~~~~~~~~~~~
 92                                                    92 
 93 Usually, CPUs of different capacity values als     93 Usually, CPUs of different capacity values also have different maximum
 94 OPPs. Consider the same CPUs as above (i.e. sa     94 OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
 95                                                    95 
 96 - max_freq(CPU0) = F                               96 - max_freq(CPU0) = F
 97 - max_freq(CPU1) = 2/3 * F                         97 - max_freq(CPU1) = 2/3 * F
 98                                                    98 
 99 This yields:                                       99 This yields:
100                                                   100 
101 - capacity(CPU0) = C                              101 - capacity(CPU0) = C
102 - capacity(CPU1) = C/3                            102 - capacity(CPU1) = C/3
103                                                   103 
104 Executing the same workload as described in 1.    104 Executing the same workload as described in 1.3.1, which each CPU running at its
105 maximum frequency results in::                    105 maximum frequency results in::
106                                                   106 
107  CPU0 work ^                                      107  CPU0 work ^
108            |     ____                ____         108            |     ____                ____                ____
109            |    |    |              |    |        109            |    |    |              |    |              |    |
110            +----+----+----+----+----+----+----    110            +----+----+----+----+----+----+----+----+----+----+-> time
111                                                   111 
112                             workload on CPU1      112                             workload on CPU1
113  CPU1 work ^                                      113  CPU1 work ^
114            |     ______________      _________    114            |     ______________      ______________      ____
115            |    |              |    |             115            |    |              |    |              |    |
116            +----+----+----+----+----+----+----    116            +----+----+----+----+----+----+----+----+----+----+-> time
117                                                   117 
118 1.4 Representation caveat                         118 1.4 Representation caveat
119 -------------------------                         119 -------------------------
120                                                   120 
121 It should be noted that having a *single* valu    121 It should be noted that having a *single* value to represent differences in CPU
122 performance is somewhat of a contentious point    122 performance is somewhat of a contentious point. The relative performance
123 difference between two different µarchs could    123 difference between two different µarchs could be X% on integer operations, Y% on
124 floating point operations, Z% on branches, and    124 floating point operations, Z% on branches, and so on. Still, results using this
125 simple approach have been satisfactory for now    125 simple approach have been satisfactory for now.
126                                                   126 
127 2. Task utilization                               127 2. Task utilization
128 ===================                               128 ===================
129                                                   129 
130 2.1 Introduction                                  130 2.1 Introduction
131 ----------------                                  131 ----------------
132                                                   132 
133 Capacity aware scheduling requires an expressi    133 Capacity aware scheduling requires an expression of a task's requirements with
134 regards to CPU capacity. Each scheduler class     134 regards to CPU capacity. Each scheduler class can express this differently, and
135 while task utilization is specific to CFS, it     135 while task utilization is specific to CFS, it is convenient to describe it here
136 in order to introduce more generic concepts.      136 in order to introduce more generic concepts.
137                                                   137 
138 Task utilization is a percentage meant to repr    138 Task utilization is a percentage meant to represent the throughput requirements
139 of a task. A simple approximation of it is the    139 of a task. A simple approximation of it is the task's duty cycle, i.e.::
140                                                   140 
141   task_util(p) = duty_cycle(p)                    141   task_util(p) = duty_cycle(p)
142                                                   142 
143 On an SMP system with fixed frequencies, 100%     143 On an SMP system with fixed frequencies, 100% utilization suggests the task is a
144 busy loop. Conversely, 10% utilization hints i    144 busy loop. Conversely, 10% utilization hints it is a small periodic task that
145 spends more time sleeping than executing. Vari    145 spends more time sleeping than executing. Variable CPU frequencies and
146 asymmetric CPU capacities complexify this some    146 asymmetric CPU capacities complexify this somewhat; the following sections will
147 expand on these.                                  147 expand on these.
148                                                   148 
149 2.2 Frequency invariance                          149 2.2 Frequency invariance
150 ------------------------                          150 ------------------------
151                                                   151 
152 One issue that needs to be taken into account     152 One issue that needs to be taken into account is that a workload's duty cycle is
153 directly impacted by the current OPP the CPU i    153 directly impacted by the current OPP the CPU is running at. Consider running a
154 periodic workload at a given frequency F::        154 periodic workload at a given frequency F::
155                                                   155 
156   CPU work ^                                      156   CPU work ^
157            |     ____                ____         157            |     ____                ____                ____
158            |    |    |              |    |        158            |    |    |              |    |              |    |
159            +----+----+----+----+----+----+----    159            +----+----+----+----+----+----+----+----+----+----+-> time
160                                                   160 
161 This yields duty_cycle(p) == 25%.                 161 This yields duty_cycle(p) == 25%.
162                                                   162 
163 Now, consider running the *same* workload at f    163 Now, consider running the *same* workload at frequency F/2::
164                                                   164 
165   CPU work ^                                      165   CPU work ^
166            |     _________           _________    166            |     _________           _________           ____
167            |    |         |         |             167            |    |         |         |         |         |
168            +----+----+----+----+----+----+----    168            +----+----+----+----+----+----+----+----+----+----+-> time
169                                                   169 
170 This yields duty_cycle(p) == 50%, despite the     170 This yields duty_cycle(p) == 50%, despite the task having the exact same
171 behaviour (i.e. executing the same amount of w    171 behaviour (i.e. executing the same amount of work) in both executions.
172                                                   172 
173 The task utilization signal can be made freque    173 The task utilization signal can be made frequency invariant using the following
174 formula::                                         174 formula::
175                                                   175 
176   task_util_freq_inv(p) = duty_cycle(p) * (cur    176   task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
177                                                   177 
178 Applying this formula to the two examples abov    178 Applying this formula to the two examples above yields a frequency invariant
179 task utilization of 25%.                          179 task utilization of 25%.
180                                                   180 
181 2.3 CPU invariance                                181 2.3 CPU invariance
182 ------------------                                182 ------------------
183                                                   183 
184 CPU capacity has a similar effect on task util    184 CPU capacity has a similar effect on task utilization in that running an
185 identical workload on CPUs of different capaci    185 identical workload on CPUs of different capacity values will yield different
186 duty cycles.                                      186 duty cycles.
187                                                   187 
188 Consider the system described in 1.3.2., i.e.:    188 Consider the system described in 1.3.2., i.e.::
189                                                   189 
190 - capacity(CPU0) = C                              190 - capacity(CPU0) = C
191 - capacity(CPU1) = C/3                            191 - capacity(CPU1) = C/3
192                                                   192 
193 Executing a given periodic workload on each CP    193 Executing a given periodic workload on each CPU at their maximum frequency would
194 result in::                                       194 result in::
195                                                   195 
196  CPU0 work ^                                      196  CPU0 work ^
197            |     ____                ____         197            |     ____                ____                ____
198            |    |    |              |    |        198            |    |    |              |    |              |    |
199            +----+----+----+----+----+----+----    199            +----+----+----+----+----+----+----+----+----+----+-> time
200                                                   200 
201  CPU1 work ^                                      201  CPU1 work ^
202            |     ______________      _________    202            |     ______________      ______________      ____
203            |    |              |    |             203            |    |              |    |              |    |
204            +----+----+----+----+----+----+----    204            +----+----+----+----+----+----+----+----+----+----+-> time
205                                                   205 
206 IOW,                                              206 IOW,
207                                                   207 
208 - duty_cycle(p) == 25% if p runs on CPU0 at it    208 - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
209 - duty_cycle(p) == 75% if p runs on CPU1 at it    209 - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
210                                                   210 
211 The task utilization signal can be made CPU in    211 The task utilization signal can be made CPU invariant using the following
212 formula::                                         212 formula::
213                                                   213 
214   task_util_cpu_inv(p) = duty_cycle(p) * (capa    214   task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
215                                                   215 
216 with ``max_capacity`` being the highest CPU ca    216 with ``max_capacity`` being the highest CPU capacity value in the
217 system. Applying this formula to the above exa    217 system. Applying this formula to the above example above yields a CPU
218 invariant task utilization of 25%.                218 invariant task utilization of 25%.
219                                                   219 
220 2.4 Invariant task utilization                    220 2.4 Invariant task utilization
221 ------------------------------                    221 ------------------------------
222                                                   222 
223 Both frequency and CPU invariance need to be a    223 Both frequency and CPU invariance need to be applied to task utilization in
224 order to obtain a truly invariant signal. The     224 order to obtain a truly invariant signal. The pseudo-formula for a task
225 utilization that is both CPU and frequency inv    225 utilization that is both CPU and frequency invariant is thus, for a given
226 task p::                                          226 task p::
227                                                   227 
228                                      curr_freq    228                                      curr_frequency(cpu)   capacity(cpu)
229   task_util_inv(p) = duty_cycle(p) * ---------    229   task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
230                                      max_frequ    230                                      max_frequency(cpu)    max_capacity
231                                                   231 
232 In other words, invariant task utilization des    232 In other words, invariant task utilization describes the behaviour of a task as
233 if it were running on the highest-capacity CPU    233 if it were running on the highest-capacity CPU in the system, running at its
234 maximum frequency.                                234 maximum frequency.
235                                                   235 
236 Any mention of task utilization in the followi    236 Any mention of task utilization in the following sections will imply its
237 invariant form.                                   237 invariant form.
238                                                   238 
239 2.5 Utilization estimation                        239 2.5 Utilization estimation
240 --------------------------                        240 --------------------------
241                                                   241 
242 Without a crystal ball, task behaviour (and th    242 Without a crystal ball, task behaviour (and thus task utilization) cannot
243 accurately be predicted the moment a task firs    243 accurately be predicted the moment a task first becomes runnable. The CFS class
244 maintains a handful of CPU and task signals ba    244 maintains a handful of CPU and task signals based on the Per-Entity Load
245 Tracking (PELT) mechanism, one of those yieldi    245 Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
246 opposed to instantaneous).                        246 opposed to instantaneous).
247                                                   247 
248 This means that while the capacity aware sched    248 This means that while the capacity aware scheduling criteria will be written
249 considering a "true" task utilization (using a    249 considering a "true" task utilization (using a crystal ball), the implementation
250 will only ever be able to use an estimator the    250 will only ever be able to use an estimator thereof.
251                                                   251 
252 3. Capacity aware scheduling requirements         252 3. Capacity aware scheduling requirements
253 =========================================         253 =========================================
254                                                   254 
255 3.1 CPU capacity                                  255 3.1 CPU capacity
256 ----------------                                  256 ----------------
257                                                   257 
258 Linux cannot currently figure out CPU capacity    258 Linux cannot currently figure out CPU capacity on its own, this information thus
259 needs to be handed to it. Architectures must d    259 needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
260 for that purpose.                                 260 for that purpose.
261                                                   261 
262 The arm, arm64, and RISC-V architectures direc    262 The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver
263 CPU scaling data, which is derived from the ca    263 CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
264 Documentation/devicetree/bindings/cpu/cpu-capa    264 Documentation/devicetree/bindings/cpu/cpu-capacity.txt.
265                                                   265 
266 3.2 Frequency invariance                          266 3.2 Frequency invariance
267 ------------------------                          267 ------------------------
268                                                   268 
269 As stated in 2.2, capacity-aware scheduling re    269 As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
270 utilization. Architectures must define arch_sc    270 utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
271 purpose.                                          271 purpose.
272                                                   272 
273 Implementing this function requires figuring o    273 Implementing this function requires figuring out at which frequency each CPU
274 have been running at. One way to implement thi    274 have been running at. One way to implement this is to leverage hardware counters
275 whose increment rate scale with a CPU's curren    275 whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
276 AMU on arm64). Another is to directly hook int    276 AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
277 when the kernel is aware of the switched-to fr    277 when the kernel is aware of the switched-to frequency (also employed by
278 arm/arm64).                                       278 arm/arm64).
279                                                   279 
280 4. Scheduler topology                             280 4. Scheduler topology
281 =====================                             281 =====================
282                                                   282 
283 During the construction of the sched domains,     283 During the construction of the sched domains, the scheduler will figure out
284 whether the system exhibits asymmetric CPU cap    284 whether the system exhibits asymmetric CPU capacities. Should that be the
285 case:                                             285 case:
286                                                   286 
287 - The sched_asym_cpucapacity static key will b    287 - The sched_asym_cpucapacity static key will be enabled.
288 - The SD_ASYM_CPUCAPACITY_FULL flag will be se    288 - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
289   level that spans all unique CPU capacity val    289   level that spans all unique CPU capacity values.
290 - The SD_ASYM_CPUCAPACITY flag will be set for    290 - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
291   CPUs with any range of asymmetry.               291   CPUs with any range of asymmetry.
292                                                   292 
293 The sched_asym_cpucapacity static key is inten    293 The sched_asym_cpucapacity static key is intended to guard sections of code that
294 cater to asymmetric CPU capacity systems. Do n    294 cater to asymmetric CPU capacity systems. Do note however that said key is
295 *system-wide*. Imagine the following setup usi    295 *system-wide*. Imagine the following setup using cpusets::
296                                                   296 
297   capacity    C/2          C                      297   capacity    C/2          C
298             ________    ________                  298             ________    ________
299            /        \  /        \                 299            /        \  /        \
300   CPUs     0  1  2  3  4  5  6  7                 300   CPUs     0  1  2  3  4  5  6  7
301            \__/  \______________/                 301            \__/  \______________/
302   cpusets   cs0         cs1                       302   cpusets   cs0         cs1
303                                                   303 
304 Which could be created via:                       304 Which could be created via:
305                                                   305 
306 .. code-block:: sh                                306 .. code-block:: sh
307                                                   307 
308   mkdir /sys/fs/cgroup/cpuset/cs0                 308   mkdir /sys/fs/cgroup/cpuset/cs0
309   echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.    309   echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
310   echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.me    310   echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
311                                                   311 
312   mkdir /sys/fs/cgroup/cpuset/cs1                 312   mkdir /sys/fs/cgroup/cpuset/cs1
313   echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.    313   echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
314   echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.me    314   echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
315                                                   315 
316   echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_    316   echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
317                                                   317 
318 Since there *is* CPU capacity asymmetry in the    318 Since there *is* CPU capacity asymmetry in the system, the
319 sched_asym_cpucapacity static key will be enab    319 sched_asym_cpucapacity static key will be enabled. However, the sched_domain
320 hierarchy of CPUs 0-1 spans a single capacity     320 hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
321 set in that hierarchy, it describes an SMP isl    321 set in that hierarchy, it describes an SMP island and should be treated as such.
322                                                   322 
323 Therefore, the 'canonical' pattern for protect    323 Therefore, the 'canonical' pattern for protecting codepaths that cater to
324 asymmetric CPU capacities is to:                  324 asymmetric CPU capacities is to:
325                                                   325 
326 - Check the sched_asym_cpucapacity static key     326 - Check the sched_asym_cpucapacity static key
327 - If it is enabled, then also check for the pr    327 - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
328   the sched_domain hierarchy (if relevant, i.e    328   the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
329   CPU or group thereof)                           329   CPU or group thereof)
330                                                   330 
331 5. Capacity aware scheduling implementation       331 5. Capacity aware scheduling implementation
332 ===========================================       332 ===========================================
333                                                   333 
334 5.1 CFS                                           334 5.1 CFS
335 -------                                           335 -------
336                                                   336 
337 5.1.1 Capacity fitness                            337 5.1.1 Capacity fitness
338 ~~~~~~~~~~~~~~~~~~~~~~                            338 ~~~~~~~~~~~~~~~~~~~~~~
339                                                   339 
340 The main capacity scheduling criterion of CFS     340 The main capacity scheduling criterion of CFS is::
341                                                   341 
342   task_util(p) < capacity(task_cpu(p))            342   task_util(p) < capacity(task_cpu(p))
343                                                   343 
344 This is commonly called the capacity fitness c    344 This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
345 task "fits" on its CPU. If it is violated, the    345 task "fits" on its CPU. If it is violated, the task will need to achieve more
346 work than what its CPU can provide: it will be    346 work than what its CPU can provide: it will be CPU-bound.
347                                                   347 
348 Furthermore, uclamp lets userspace specify a m    348 Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
349 value for a task, either via sched_setattr() o    349 value for a task, either via sched_setattr() or via the cgroup interface (see
350 Documentation/admin-guide/cgroup-v2.rst). As i    350 Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
351 clamp task_util() in the previous criterion.      351 clamp task_util() in the previous criterion.
352                                                   352 
353 5.1.2 Wakeup CPU selection                        353 5.1.2 Wakeup CPU selection
354 ~~~~~~~~~~~~~~~~~~~~~~~~~~                        354 ~~~~~~~~~~~~~~~~~~~~~~~~~~
355                                                   355 
356 CFS task wakeup CPU selection follows the capa    356 CFS task wakeup CPU selection follows the capacity fitness criterion described
357 above. On top of that, uclamp is used to clamp    357 above. On top of that, uclamp is used to clamp the task utilization values,
358 which lets userspace have more leverage over t    358 which lets userspace have more leverage over the CPU selection of CFS
359 tasks. IOW, CFS wakeup CPU selection searches     359 tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
360                                                   360 
361   clamp(task_util(p), task_uclamp_min(p), task    361   clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
362                                                   362 
363 By using uclamp, userspace can e.g. allow a bu    363 By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
364 on any CPU by giving it a low uclamp.max value    364 on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
365 periodic task (e.g. 10% utilization) to run on    365 periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
366 giving it a high uclamp.min value.                366 giving it a high uclamp.min value.
367                                                   367 
368 .. note::                                         368 .. note::
369                                                   369 
370   Wakeup CPU selection in CFS can be eclipsed     370   Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
371   (EAS), which is described in Documentation/s    371   (EAS), which is described in Documentation/scheduler/sched-energy.rst.
372                                                   372 
373 5.1.3 Load balancing                              373 5.1.3 Load balancing
374 ~~~~~~~~~~~~~~~~~~~~                              374 ~~~~~~~~~~~~~~~~~~~~
375                                                   375 
376 A pathological case in the wakeup CPU selectio    376 A pathological case in the wakeup CPU selection occurs when a task rarely
377 sleeps, if at all - it thus rarely wakes up, i    377 sleeps, if at all - it thus rarely wakes up, if at all. Consider::
378                                                   378 
379   w == wakeup event                               379   w == wakeup event
380                                                   380 
381   capacity(CPU0) = C                              381   capacity(CPU0) = C
382   capacity(CPU1) = C / 3                          382   capacity(CPU1) = C / 3
383                                                   383 
384                            workload on CPU0       384                            workload on CPU0
385   CPU work ^                                      385   CPU work ^
386            |     _________           _________    386            |     _________           _________           ____
387            |    |         |         |             387            |    |         |         |         |         |
388            +----+----+----+----+----+----+----    388            +----+----+----+----+----+----+----+----+----+----+-> time
389                 w                   w             389                 w                   w                   w
390                                                   390 
391                            workload on CPU1       391                            workload on CPU1
392   CPU work ^                                      392   CPU work ^
393            |     _____________________________    393            |     ____________________________________________
394            |    |                                 394            |    |
395            +----+----+----+----+----+----+----    395            +----+----+----+----+----+----+----+----+----+----+->
396                 w                                 396                 w
397                                                   397 
398 This workload should run on CPU0, but if the t    398 This workload should run on CPU0, but if the task either:
399                                                   399 
400 - was improperly scheduled from the start (ina    400 - was improperly scheduled from the start (inaccurate initial
401   utilization estimation)                         401   utilization estimation)
402 - was properly scheduled from the start, but s    402 - was properly scheduled from the start, but suddenly needs more
403   processing power                                403   processing power
404                                                   404 
405 then it might become CPU-bound, IOW ``task_uti    405 then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
406 the CPU capacity scheduling criterion is viola    406 the CPU capacity scheduling criterion is violated, and there may not be any more
407 wakeup event to fix this up via wakeup CPU sel    407 wakeup event to fix this up via wakeup CPU selection.
408                                                   408 
409 Tasks that are in this situation are dubbed "m    409 Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
410 put in place to handle this shares the same na    410 put in place to handle this shares the same name. Misfit task migration
411 leverages the CFS load balancer, more specific    411 leverages the CFS load balancer, more specifically the active load balance part
412 (which caters to migrating currently running t    412 (which caters to migrating currently running tasks). When load balance happens,
413 a misfit active load balance will be triggered    413 a misfit active load balance will be triggered if a misfit task can be migrated
414 to a CPU with more capacity than its current o    414 to a CPU with more capacity than its current one.
415                                                   415 
416 5.2 RT                                            416 5.2 RT
417 ------                                            417 ------
418                                                   418 
419 5.2.1 Wakeup CPU selection                        419 5.2.1 Wakeup CPU selection
420 ~~~~~~~~~~~~~~~~~~~~~~~~~~                        420 ~~~~~~~~~~~~~~~~~~~~~~~~~~
421                                                   421 
422 RT task wakeup CPU selection searches for a CP    422 RT task wakeup CPU selection searches for a CPU that satisfies::
423                                                   423 
424   task_uclamp_min(p) <= capacity(task_cpu(cpu)    424   task_uclamp_min(p) <= capacity(task_cpu(cpu))
425                                                   425 
426 while still following the usual priority const    426 while still following the usual priority constraints. If none of the candidate
427 CPUs can satisfy this capacity criterion, then    427 CPUs can satisfy this capacity criterion, then strict priority based scheduling
428 is followed and CPU capacities are ignored.       428 is followed and CPU capacities are ignored.
429                                                   429 
430 5.3 DL                                            430 5.3 DL
431 ------                                            431 ------
432                                                   432 
433 5.3.1 Wakeup CPU selection                        433 5.3.1 Wakeup CPU selection
434 ~~~~~~~~~~~~~~~~~~~~~~~~~~                        434 ~~~~~~~~~~~~~~~~~~~~~~~~~~
435                                                   435 
436 DL task wakeup CPU selection searches for a CP    436 DL task wakeup CPU selection searches for a CPU that satisfies::
437                                                   437 
438   task_bandwidth(p) < capacity(task_cpu(p))       438   task_bandwidth(p) < capacity(task_cpu(p))
439                                                   439 
440 while still respecting the usual bandwidth and    440 while still respecting the usual bandwidth and deadline constraints. If
441 none of the candidate CPUs can satisfy this ca    441 none of the candidate CPUs can satisfy this capacity criterion, then the
442 task will remain on its current CPU.              442 task will remain on its current CPU.
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php