~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/scheduler/sched-capacity.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/scheduler/sched-capacity.rst (Version linux-6.12-rc7) and /Documentation/scheduler/sched-capacity.rst (Version linux-6.1.116)


  1 =========================                           1 =========================
  2 Capacity Aware Scheduling                           2 Capacity Aware Scheduling
  3 =========================                           3 =========================
  4                                                     4 
  5 1. CPU Capacity                                     5 1. CPU Capacity
  6 ===============                                     6 ===============
  7                                                     7 
  8 1.1 Introduction                                    8 1.1 Introduction
  9 ----------------                                    9 ----------------
 10                                                    10 
 11 Conventional, homogeneous SMP platforms are co     11 Conventional, homogeneous SMP platforms are composed of purely identical
 12 CPUs. Heterogeneous platforms on the other han     12 CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
 13 different performance characteristics - on suc     13 different performance characteristics - on such platforms, not all CPUs can be
 14 considered equal.                                  14 considered equal.
 15                                                    15 
 16 CPU capacity is a measure of the performance a     16 CPU capacity is a measure of the performance a CPU can reach, normalized against
 17 the most performant CPU in the system. Heterog     17 the most performant CPU in the system. Heterogeneous systems are also called
 18 asymmetric CPU capacity systems, as they conta     18 asymmetric CPU capacity systems, as they contain CPUs of different capacities.
 19                                                    19 
 20 Disparity in maximum attainable performance (I     20 Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
 21 from two factors:                                  21 from two factors:
 22                                                    22 
 23 - not all CPUs may have the same microarchitec     23 - not all CPUs may have the same microarchitecture (µarch).
 24 - with Dynamic Voltage and Frequency Scaling (     24 - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
 25   physically able to attain the higher Operati     25   physically able to attain the higher Operating Performance Points (OPP).
 26                                                    26 
 27 Arm big.LITTLE systems are an example of both.     27 Arm big.LITTLE systems are an example of both. The big CPUs are more
 28 performance-oriented than the LITTLE ones (mor     28 performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
 29 smarter predictors, etc), and can usually reac     29 smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
 30 can.                                               30 can.
 31                                                    31 
 32 CPU performance is usually expressed in Millio     32 CPU performance is usually expressed in Millions of Instructions Per Second
 33 (MIPS), which can also be expressed as a given     33 (MIPS), which can also be expressed as a given amount of instructions attainable
 34 per Hz, leading to::                               34 per Hz, leading to::
 35                                                    35 
 36   capacity(cpu) = work_per_hz(cpu) * max_freq(     36   capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
 37                                                    37 
 38 1.2 Scheduler terms                                38 1.2 Scheduler terms
 39 -------------------                                39 -------------------
 40                                                    40 
 41 Two different capacity values are used within      41 Two different capacity values are used within the scheduler. A CPU's
 42 ``original capacity`` is its maximum attainabl !!  42 ``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
 43 attainable performance level. This original ca !!  43 attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
 44 the function arch_scale_cpu_capacity(). A CPU' !!  44 which some loss of available performance (e.g. time spent handling IRQs) is
 45 capacity`` to which some loss of available per !!  45 subtracted.
 46 handling IRQs) is subtracted.                  << 
 47                                                    46 
 48 Note that a CPU's ``capacity`` is solely inten     47 Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
 49 while ``original capacity`` is class-agnostic. !!  48 while ``capacity_orig`` is class-agnostic. The rest of this document will use
 50 the term ``capacity`` interchangeably with ``o !!  49 the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
 51 brevity.                                           50 brevity.
 52                                                    51 
 53 1.3 Platform examples                              52 1.3 Platform examples
 54 ---------------------                              53 ---------------------
 55                                                    54 
 56 1.3.1 Identical OPPs                               55 1.3.1 Identical OPPs
 57 ~~~~~~~~~~~~~~~~~~~~                               56 ~~~~~~~~~~~~~~~~~~~~
 58                                                    57 
 59 Consider an hypothetical dual-core asymmetric      58 Consider an hypothetical dual-core asymmetric CPU capacity system where
 60                                                    59 
 61 - work_per_hz(CPU0) = W                            60 - work_per_hz(CPU0) = W
 62 - work_per_hz(CPU1) = W/2                          61 - work_per_hz(CPU1) = W/2
 63 - all CPUs are running at the same fixed frequ     62 - all CPUs are running at the same fixed frequency
 64                                                    63 
 65 By the above definition of capacity:               64 By the above definition of capacity:
 66                                                    65 
 67 - capacity(CPU0) = C                               66 - capacity(CPU0) = C
 68 - capacity(CPU1) = C/2                             67 - capacity(CPU1) = C/2
 69                                                    68 
 70 To draw the parallel with Arm big.LITTLE, CPU0     69 To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
 71 be a LITTLE.                                       70 be a LITTLE.
 72                                                    71 
 73 With a workload that periodically does a fixed     72 With a workload that periodically does a fixed amount of work, you will get an
 74 execution trace like so::                          73 execution trace like so::
 75                                                    74 
 76  CPU0 work ^                                       75  CPU0 work ^
 77            |     ____                ____          76            |     ____                ____                ____
 78            |    |    |              |    |         77            |    |    |              |    |              |    |
 79            +----+----+----+----+----+----+----     78            +----+----+----+----+----+----+----+----+----+----+-> time
 80                                                    79 
 81  CPU1 work ^                                       80  CPU1 work ^
 82            |     _________           _________     81            |     _________           _________           ____
 83            |    |         |         |              82            |    |         |         |         |         |
 84            +----+----+----+----+----+----+----     83            +----+----+----+----+----+----+----+----+----+----+-> time
 85                                                    84 
 86 CPU0 has the highest capacity in the system (C     85 CPU0 has the highest capacity in the system (C), and completes a fixed amount of
 87 work W in T units of time. On the other hand,      86 work W in T units of time. On the other hand, CPU1 has half the capacity of
 88 CPU0, and thus only completes W/2 in T.            87 CPU0, and thus only completes W/2 in T.
 89                                                    88 
 90 1.3.2 Different max OPPs                           89 1.3.2 Different max OPPs
 91 ~~~~~~~~~~~~~~~~~~~~~~~~                           90 ~~~~~~~~~~~~~~~~~~~~~~~~
 92                                                    91 
 93 Usually, CPUs of different capacity values als     92 Usually, CPUs of different capacity values also have different maximum
 94 OPPs. Consider the same CPUs as above (i.e. sa     93 OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
 95                                                    94 
 96 - max_freq(CPU0) = F                               95 - max_freq(CPU0) = F
 97 - max_freq(CPU1) = 2/3 * F                         96 - max_freq(CPU1) = 2/3 * F
 98                                                    97 
 99 This yields:                                       98 This yields:
100                                                    99 
101 - capacity(CPU0) = C                              100 - capacity(CPU0) = C
102 - capacity(CPU1) = C/3                            101 - capacity(CPU1) = C/3
103                                                   102 
104 Executing the same workload as described in 1.    103 Executing the same workload as described in 1.3.1, which each CPU running at its
105 maximum frequency results in::                    104 maximum frequency results in::
106                                                   105 
107  CPU0 work ^                                      106  CPU0 work ^
108            |     ____                ____         107            |     ____                ____                ____
109            |    |    |              |    |        108            |    |    |              |    |              |    |
110            +----+----+----+----+----+----+----    109            +----+----+----+----+----+----+----+----+----+----+-> time
111                                                   110 
112                             workload on CPU1      111                             workload on CPU1
113  CPU1 work ^                                      112  CPU1 work ^
114            |     ______________      _________    113            |     ______________      ______________      ____
115            |    |              |    |             114            |    |              |    |              |    |
116            +----+----+----+----+----+----+----    115            +----+----+----+----+----+----+----+----+----+----+-> time
117                                                   116 
118 1.4 Representation caveat                         117 1.4 Representation caveat
119 -------------------------                         118 -------------------------
120                                                   119 
121 It should be noted that having a *single* valu    120 It should be noted that having a *single* value to represent differences in CPU
122 performance is somewhat of a contentious point    121 performance is somewhat of a contentious point. The relative performance
123 difference between two different µarchs could    122 difference between two different µarchs could be X% on integer operations, Y% on
124 floating point operations, Z% on branches, and    123 floating point operations, Z% on branches, and so on. Still, results using this
125 simple approach have been satisfactory for now    124 simple approach have been satisfactory for now.
126                                                   125 
127 2. Task utilization                               126 2. Task utilization
128 ===================                               127 ===================
129                                                   128 
130 2.1 Introduction                                  129 2.1 Introduction
131 ----------------                                  130 ----------------
132                                                   131 
133 Capacity aware scheduling requires an expressi    132 Capacity aware scheduling requires an expression of a task's requirements with
134 regards to CPU capacity. Each scheduler class     133 regards to CPU capacity. Each scheduler class can express this differently, and
135 while task utilization is specific to CFS, it     134 while task utilization is specific to CFS, it is convenient to describe it here
136 in order to introduce more generic concepts.      135 in order to introduce more generic concepts.
137                                                   136 
138 Task utilization is a percentage meant to repr    137 Task utilization is a percentage meant to represent the throughput requirements
139 of a task. A simple approximation of it is the    138 of a task. A simple approximation of it is the task's duty cycle, i.e.::
140                                                   139 
141   task_util(p) = duty_cycle(p)                    140   task_util(p) = duty_cycle(p)
142                                                   141 
143 On an SMP system with fixed frequencies, 100%     142 On an SMP system with fixed frequencies, 100% utilization suggests the task is a
144 busy loop. Conversely, 10% utilization hints i    143 busy loop. Conversely, 10% utilization hints it is a small periodic task that
145 spends more time sleeping than executing. Vari    144 spends more time sleeping than executing. Variable CPU frequencies and
146 asymmetric CPU capacities complexify this some    145 asymmetric CPU capacities complexify this somewhat; the following sections will
147 expand on these.                                  146 expand on these.
148                                                   147 
149 2.2 Frequency invariance                          148 2.2 Frequency invariance
150 ------------------------                          149 ------------------------
151                                                   150 
152 One issue that needs to be taken into account     151 One issue that needs to be taken into account is that a workload's duty cycle is
153 directly impacted by the current OPP the CPU i    152 directly impacted by the current OPP the CPU is running at. Consider running a
154 periodic workload at a given frequency F::        153 periodic workload at a given frequency F::
155                                                   154 
156   CPU work ^                                      155   CPU work ^
157            |     ____                ____         156            |     ____                ____                ____
158            |    |    |              |    |        157            |    |    |              |    |              |    |
159            +----+----+----+----+----+----+----    158            +----+----+----+----+----+----+----+----+----+----+-> time
160                                                   159 
161 This yields duty_cycle(p) == 25%.                 160 This yields duty_cycle(p) == 25%.
162                                                   161 
163 Now, consider running the *same* workload at f    162 Now, consider running the *same* workload at frequency F/2::
164                                                   163 
165   CPU work ^                                      164   CPU work ^
166            |     _________           _________    165            |     _________           _________           ____
167            |    |         |         |             166            |    |         |         |         |         |
168            +----+----+----+----+----+----+----    167            +----+----+----+----+----+----+----+----+----+----+-> time
169                                                   168 
170 This yields duty_cycle(p) == 50%, despite the     169 This yields duty_cycle(p) == 50%, despite the task having the exact same
171 behaviour (i.e. executing the same amount of w    170 behaviour (i.e. executing the same amount of work) in both executions.
172                                                   171 
173 The task utilization signal can be made freque    172 The task utilization signal can be made frequency invariant using the following
174 formula::                                         173 formula::
175                                                   174 
176   task_util_freq_inv(p) = duty_cycle(p) * (cur    175   task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
177                                                   176 
178 Applying this formula to the two examples abov    177 Applying this formula to the two examples above yields a frequency invariant
179 task utilization of 25%.                          178 task utilization of 25%.
180                                                   179 
181 2.3 CPU invariance                                180 2.3 CPU invariance
182 ------------------                                181 ------------------
183                                                   182 
184 CPU capacity has a similar effect on task util    183 CPU capacity has a similar effect on task utilization in that running an
185 identical workload on CPUs of different capaci    184 identical workload on CPUs of different capacity values will yield different
186 duty cycles.                                      185 duty cycles.
187                                                   186 
188 Consider the system described in 1.3.2., i.e.:    187 Consider the system described in 1.3.2., i.e.::
189                                                   188 
190 - capacity(CPU0) = C                              189 - capacity(CPU0) = C
191 - capacity(CPU1) = C/3                            190 - capacity(CPU1) = C/3
192                                                   191 
193 Executing a given periodic workload on each CP    192 Executing a given periodic workload on each CPU at their maximum frequency would
194 result in::                                       193 result in::
195                                                   194 
196  CPU0 work ^                                      195  CPU0 work ^
197            |     ____                ____         196            |     ____                ____                ____
198            |    |    |              |    |        197            |    |    |              |    |              |    |
199            +----+----+----+----+----+----+----    198            +----+----+----+----+----+----+----+----+----+----+-> time
200                                                   199 
201  CPU1 work ^                                      200  CPU1 work ^
202            |     ______________      _________    201            |     ______________      ______________      ____
203            |    |              |    |             202            |    |              |    |              |    |
204            +----+----+----+----+----+----+----    203            +----+----+----+----+----+----+----+----+----+----+-> time
205                                                   204 
206 IOW,                                              205 IOW,
207                                                   206 
208 - duty_cycle(p) == 25% if p runs on CPU0 at it    207 - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
209 - duty_cycle(p) == 75% if p runs on CPU1 at it    208 - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
210                                                   209 
211 The task utilization signal can be made CPU in    210 The task utilization signal can be made CPU invariant using the following
212 formula::                                         211 formula::
213                                                   212 
214   task_util_cpu_inv(p) = duty_cycle(p) * (capa    213   task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
215                                                   214 
216 with ``max_capacity`` being the highest CPU ca    215 with ``max_capacity`` being the highest CPU capacity value in the
217 system. Applying this formula to the above exa    216 system. Applying this formula to the above example above yields a CPU
218 invariant task utilization of 25%.                217 invariant task utilization of 25%.
219                                                   218 
220 2.4 Invariant task utilization                    219 2.4 Invariant task utilization
221 ------------------------------                    220 ------------------------------
222                                                   221 
223 Both frequency and CPU invariance need to be a    222 Both frequency and CPU invariance need to be applied to task utilization in
224 order to obtain a truly invariant signal. The     223 order to obtain a truly invariant signal. The pseudo-formula for a task
225 utilization that is both CPU and frequency inv    224 utilization that is both CPU and frequency invariant is thus, for a given
226 task p::                                          225 task p::
227                                                   226 
228                                      curr_freq    227                                      curr_frequency(cpu)   capacity(cpu)
229   task_util_inv(p) = duty_cycle(p) * ---------    228   task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
230                                      max_frequ    229                                      max_frequency(cpu)    max_capacity
231                                                   230 
232 In other words, invariant task utilization des    231 In other words, invariant task utilization describes the behaviour of a task as
233 if it were running on the highest-capacity CPU    232 if it were running on the highest-capacity CPU in the system, running at its
234 maximum frequency.                                233 maximum frequency.
235                                                   234 
236 Any mention of task utilization in the followi    235 Any mention of task utilization in the following sections will imply its
237 invariant form.                                   236 invariant form.
238                                                   237 
239 2.5 Utilization estimation                        238 2.5 Utilization estimation
240 --------------------------                        239 --------------------------
241                                                   240 
242 Without a crystal ball, task behaviour (and th    241 Without a crystal ball, task behaviour (and thus task utilization) cannot
243 accurately be predicted the moment a task firs    242 accurately be predicted the moment a task first becomes runnable. The CFS class
244 maintains a handful of CPU and task signals ba    243 maintains a handful of CPU and task signals based on the Per-Entity Load
245 Tracking (PELT) mechanism, one of those yieldi    244 Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
246 opposed to instantaneous).                        245 opposed to instantaneous).
247                                                   246 
248 This means that while the capacity aware sched    247 This means that while the capacity aware scheduling criteria will be written
249 considering a "true" task utilization (using a    248 considering a "true" task utilization (using a crystal ball), the implementation
250 will only ever be able to use an estimator the    249 will only ever be able to use an estimator thereof.
251                                                   250 
252 3. Capacity aware scheduling requirements         251 3. Capacity aware scheduling requirements
253 =========================================         252 =========================================
254                                                   253 
255 3.1 CPU capacity                                  254 3.1 CPU capacity
256 ----------------                                  255 ----------------
257                                                   256 
258 Linux cannot currently figure out CPU capacity    257 Linux cannot currently figure out CPU capacity on its own, this information thus
259 needs to be handed to it. Architectures must d    258 needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
260 for that purpose.                                 259 for that purpose.
261                                                   260 
262 The arm, arm64, and RISC-V architectures direc !! 261 The arm and arm64 architectures directly map this to the arch_topology driver
263 CPU scaling data, which is derived from the ca    262 CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
264 Documentation/devicetree/bindings/cpu/cpu-capa !! 263 Documentation/devicetree/bindings/arm/cpu-capacity.txt.
265                                                   264 
266 3.2 Frequency invariance                          265 3.2 Frequency invariance
267 ------------------------                          266 ------------------------
268                                                   267 
269 As stated in 2.2, capacity-aware scheduling re    268 As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
270 utilization. Architectures must define arch_sc    269 utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
271 purpose.                                          270 purpose.
272                                                   271 
273 Implementing this function requires figuring o    272 Implementing this function requires figuring out at which frequency each CPU
274 have been running at. One way to implement thi    273 have been running at. One way to implement this is to leverage hardware counters
275 whose increment rate scale with a CPU's curren    274 whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
276 AMU on arm64). Another is to directly hook int    275 AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
277 when the kernel is aware of the switched-to fr    276 when the kernel is aware of the switched-to frequency (also employed by
278 arm/arm64).                                       277 arm/arm64).
279                                                   278 
280 4. Scheduler topology                             279 4. Scheduler topology
281 =====================                             280 =====================
282                                                   281 
283 During the construction of the sched domains,     282 During the construction of the sched domains, the scheduler will figure out
284 whether the system exhibits asymmetric CPU cap    283 whether the system exhibits asymmetric CPU capacities. Should that be the
285 case:                                             284 case:
286                                                   285 
287 - The sched_asym_cpucapacity static key will b    286 - The sched_asym_cpucapacity static key will be enabled.
288 - The SD_ASYM_CPUCAPACITY_FULL flag will be se    287 - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
289   level that spans all unique CPU capacity val    288   level that spans all unique CPU capacity values.
290 - The SD_ASYM_CPUCAPACITY flag will be set for    289 - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
291   CPUs with any range of asymmetry.               290   CPUs with any range of asymmetry.
292                                                   291 
293 The sched_asym_cpucapacity static key is inten    292 The sched_asym_cpucapacity static key is intended to guard sections of code that
294 cater to asymmetric CPU capacity systems. Do n    293 cater to asymmetric CPU capacity systems. Do note however that said key is
295 *system-wide*. Imagine the following setup usi    294 *system-wide*. Imagine the following setup using cpusets::
296                                                   295 
297   capacity    C/2          C                      296   capacity    C/2          C
298             ________    ________                  297             ________    ________
299            /        \  /        \                 298            /        \  /        \
300   CPUs     0  1  2  3  4  5  6  7                 299   CPUs     0  1  2  3  4  5  6  7
301            \__/  \______________/                 300            \__/  \______________/
302   cpusets   cs0         cs1                       301   cpusets   cs0         cs1
303                                                   302 
304 Which could be created via:                       303 Which could be created via:
305                                                   304 
306 .. code-block:: sh                                305 .. code-block:: sh
307                                                   306 
308   mkdir /sys/fs/cgroup/cpuset/cs0                 307   mkdir /sys/fs/cgroup/cpuset/cs0
309   echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.    308   echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
310   echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.me    309   echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
311                                                   310 
312   mkdir /sys/fs/cgroup/cpuset/cs1                 311   mkdir /sys/fs/cgroup/cpuset/cs1
313   echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.    312   echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
314   echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.me    313   echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
315                                                   314 
316   echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_    315   echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
317                                                   316 
318 Since there *is* CPU capacity asymmetry in the    317 Since there *is* CPU capacity asymmetry in the system, the
319 sched_asym_cpucapacity static key will be enab    318 sched_asym_cpucapacity static key will be enabled. However, the sched_domain
320 hierarchy of CPUs 0-1 spans a single capacity     319 hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
321 set in that hierarchy, it describes an SMP isl    320 set in that hierarchy, it describes an SMP island and should be treated as such.
322                                                   321 
323 Therefore, the 'canonical' pattern for protect    322 Therefore, the 'canonical' pattern for protecting codepaths that cater to
324 asymmetric CPU capacities is to:                  323 asymmetric CPU capacities is to:
325                                                   324 
326 - Check the sched_asym_cpucapacity static key     325 - Check the sched_asym_cpucapacity static key
327 - If it is enabled, then also check for the pr    326 - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
328   the sched_domain hierarchy (if relevant, i.e    327   the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
329   CPU or group thereof)                           328   CPU or group thereof)
330                                                   329 
331 5. Capacity aware scheduling implementation       330 5. Capacity aware scheduling implementation
332 ===========================================       331 ===========================================
333                                                   332 
334 5.1 CFS                                           333 5.1 CFS
335 -------                                           334 -------
336                                                   335 
337 5.1.1 Capacity fitness                            336 5.1.1 Capacity fitness
338 ~~~~~~~~~~~~~~~~~~~~~~                            337 ~~~~~~~~~~~~~~~~~~~~~~
339                                                   338 
340 The main capacity scheduling criterion of CFS     339 The main capacity scheduling criterion of CFS is::
341                                                   340 
342   task_util(p) < capacity(task_cpu(p))            341   task_util(p) < capacity(task_cpu(p))
343                                                   342 
344 This is commonly called the capacity fitness c    343 This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
345 task "fits" on its CPU. If it is violated, the    344 task "fits" on its CPU. If it is violated, the task will need to achieve more
346 work than what its CPU can provide: it will be    345 work than what its CPU can provide: it will be CPU-bound.
347                                                   346 
348 Furthermore, uclamp lets userspace specify a m    347 Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
349 value for a task, either via sched_setattr() o    348 value for a task, either via sched_setattr() or via the cgroup interface (see
350 Documentation/admin-guide/cgroup-v2.rst). As i    349 Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
351 clamp task_util() in the previous criterion.      350 clamp task_util() in the previous criterion.
352                                                   351 
353 5.1.2 Wakeup CPU selection                        352 5.1.2 Wakeup CPU selection
354 ~~~~~~~~~~~~~~~~~~~~~~~~~~                        353 ~~~~~~~~~~~~~~~~~~~~~~~~~~
355                                                   354 
356 CFS task wakeup CPU selection follows the capa    355 CFS task wakeup CPU selection follows the capacity fitness criterion described
357 above. On top of that, uclamp is used to clamp    356 above. On top of that, uclamp is used to clamp the task utilization values,
358 which lets userspace have more leverage over t    357 which lets userspace have more leverage over the CPU selection of CFS
359 tasks. IOW, CFS wakeup CPU selection searches     358 tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
360                                                   359 
361   clamp(task_util(p), task_uclamp_min(p), task    360   clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
362                                                   361 
363 By using uclamp, userspace can e.g. allow a bu    362 By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
364 on any CPU by giving it a low uclamp.max value    363 on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
365 periodic task (e.g. 10% utilization) to run on    364 periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
366 giving it a high uclamp.min value.                365 giving it a high uclamp.min value.
367                                                   366 
368 .. note::                                         367 .. note::
369                                                   368 
370   Wakeup CPU selection in CFS can be eclipsed     369   Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
371   (EAS), which is described in Documentation/s    370   (EAS), which is described in Documentation/scheduler/sched-energy.rst.
372                                                   371 
373 5.1.3 Load balancing                              372 5.1.3 Load balancing
374 ~~~~~~~~~~~~~~~~~~~~                              373 ~~~~~~~~~~~~~~~~~~~~
375                                                   374 
376 A pathological case in the wakeup CPU selectio    375 A pathological case in the wakeup CPU selection occurs when a task rarely
377 sleeps, if at all - it thus rarely wakes up, i    376 sleeps, if at all - it thus rarely wakes up, if at all. Consider::
378                                                   377 
379   w == wakeup event                               378   w == wakeup event
380                                                   379 
381   capacity(CPU0) = C                              380   capacity(CPU0) = C
382   capacity(CPU1) = C / 3                          381   capacity(CPU1) = C / 3
383                                                   382 
384                            workload on CPU0       383                            workload on CPU0
385   CPU work ^                                      384   CPU work ^
386            |     _________           _________    385            |     _________           _________           ____
387            |    |         |         |             386            |    |         |         |         |         |
388            +----+----+----+----+----+----+----    387            +----+----+----+----+----+----+----+----+----+----+-> time
389                 w                   w             388                 w                   w                   w
390                                                   389 
391                            workload on CPU1       390                            workload on CPU1
392   CPU work ^                                      391   CPU work ^
393            |     _____________________________    392            |     ____________________________________________
394            |    |                                 393            |    |
395            +----+----+----+----+----+----+----    394            +----+----+----+----+----+----+----+----+----+----+->
396                 w                                 395                 w
397                                                   396 
398 This workload should run on CPU0, but if the t    397 This workload should run on CPU0, but if the task either:
399                                                   398 
400 - was improperly scheduled from the start (ina    399 - was improperly scheduled from the start (inaccurate initial
401   utilization estimation)                         400   utilization estimation)
402 - was properly scheduled from the start, but s    401 - was properly scheduled from the start, but suddenly needs more
403   processing power                                402   processing power
404                                                   403 
405 then it might become CPU-bound, IOW ``task_uti    404 then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
406 the CPU capacity scheduling criterion is viola    405 the CPU capacity scheduling criterion is violated, and there may not be any more
407 wakeup event to fix this up via wakeup CPU sel    406 wakeup event to fix this up via wakeup CPU selection.
408                                                   407 
409 Tasks that are in this situation are dubbed "m    408 Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
410 put in place to handle this shares the same na    409 put in place to handle this shares the same name. Misfit task migration
411 leverages the CFS load balancer, more specific    410 leverages the CFS load balancer, more specifically the active load balance part
412 (which caters to migrating currently running t    411 (which caters to migrating currently running tasks). When load balance happens,
413 a misfit active load balance will be triggered    412 a misfit active load balance will be triggered if a misfit task can be migrated
414 to a CPU with more capacity than its current o    413 to a CPU with more capacity than its current one.
415                                                   414 
416 5.2 RT                                            415 5.2 RT
417 ------                                            416 ------
418                                                   417 
419 5.2.1 Wakeup CPU selection                        418 5.2.1 Wakeup CPU selection
420 ~~~~~~~~~~~~~~~~~~~~~~~~~~                        419 ~~~~~~~~~~~~~~~~~~~~~~~~~~
421                                                   420 
422 RT task wakeup CPU selection searches for a CP    421 RT task wakeup CPU selection searches for a CPU that satisfies::
423                                                   422 
424   task_uclamp_min(p) <= capacity(task_cpu(cpu)    423   task_uclamp_min(p) <= capacity(task_cpu(cpu))
425                                                   424 
426 while still following the usual priority const    425 while still following the usual priority constraints. If none of the candidate
427 CPUs can satisfy this capacity criterion, then    426 CPUs can satisfy this capacity criterion, then strict priority based scheduling
428 is followed and CPU capacities are ignored.       427 is followed and CPU capacities are ignored.
429                                                   428 
430 5.3 DL                                            429 5.3 DL
431 ------                                            430 ------
432                                                   431 
433 5.3.1 Wakeup CPU selection                        432 5.3.1 Wakeup CPU selection
434 ~~~~~~~~~~~~~~~~~~~~~~~~~~                        433 ~~~~~~~~~~~~~~~~~~~~~~~~~~
435                                                   434 
436 DL task wakeup CPU selection searches for a CP    435 DL task wakeup CPU selection searches for a CPU that satisfies::
437                                                   436 
438   task_bandwidth(p) < capacity(task_cpu(p))       437   task_bandwidth(p) < capacity(task_cpu(p))
439                                                   438 
440 while still respecting the usual bandwidth and    439 while still respecting the usual bandwidth and deadline constraints. If
441 none of the candidate CPUs can satisfy this ca    440 none of the candidate CPUs can satisfy this capacity criterion, then the
442 task will remain on its current CPU.              441 task will remain on its current CPU.
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php