1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 ==================== 3 ==================== 4 Utilization Clamping 4 Utilization Clamping 5 ==================== 5 ==================== 6 6 7 1. Introduction 7 1. Introduction 8 =============== 8 =============== 9 9 10 Utilization clamping, also known as util clamp 10 Utilization clamping, also known as util clamp or uclamp, is a scheduler 11 feature that allows user space to help in mana 11 feature that allows user space to help in managing the performance requirement 12 of tasks. It was introduced in v5.3 release. T 12 of tasks. It was introduced in v5.3 release. The CGroup support was merged in 13 v5.4. 13 v5.4. 14 14 15 Uclamp is a hinting mechanism that allows the 15 Uclamp is a hinting mechanism that allows the scheduler to understand the 16 performance requirements and restrictions of t 16 performance requirements and restrictions of the tasks, thus it helps the 17 scheduler to make a better decision. And when 17 scheduler to make a better decision. And when schedutil cpufreq governor is 18 used, util clamp will influence the CPU freque 18 used, util clamp will influence the CPU frequency selection as well. 19 19 20 Since the scheduler and schedutil are both dri 20 Since the scheduler and schedutil are both driven by PELT (util_avg) signals, 21 util clamp acts on that to achieve its goal by 21 util clamp acts on that to achieve its goal by clamping the signal to a certain 22 point; hence the name. That is, by clamping ut 22 point; hence the name. That is, by clamping utilization we are making the 23 system run at a certain performance point. 23 system run at a certain performance point. 24 24 25 The right way to view util clamp is as a mecha 25 The right way to view util clamp is as a mechanism to make request or hint on 26 performance constraints. It consists of two tu 26 performance constraints. It consists of two tunables: 27 27 28 * UCLAMP_MIN, which sets the lower bou 28 * UCLAMP_MIN, which sets the lower bound. 29 * UCLAMP_MAX, which sets the upper bou 29 * UCLAMP_MAX, which sets the upper bound. 30 30 31 These two bounds will ensure a task will opera 31 These two bounds will ensure a task will operate within this performance range 32 of the system. UCLAMP_MIN implies boosting a t 32 of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies 33 capping a task. 33 capping a task. 34 34 35 One can tell the system (scheduler) that some 35 One can tell the system (scheduler) that some tasks require a minimum 36 performance point to operate at to deliver the 36 performance point to operate at to deliver the desired user experience. Or one 37 can tell the system that some tasks should be 37 can tell the system that some tasks should be restricted from consuming too 38 much resources and should not go above a speci 38 much resources and should not go above a specific performance point. Viewing 39 the uclamp values as performance points rather 39 the uclamp values as performance points rather than utilization is a better 40 abstraction from user space point of view. 40 abstraction from user space point of view. 41 41 42 As an example, a game can use util clamp to fo 42 As an example, a game can use util clamp to form a feedback loop with its 43 perceived Frames Per Second (FPS). It can dyna 43 perceived Frames Per Second (FPS). It can dynamically increase the minimum 44 performance point required by its display pipe 44 performance point required by its display pipeline to ensure no frame is 45 dropped. It can also dynamically 'prime' up th 45 dropped. It can also dynamically 'prime' up these tasks if it knows in the 46 coming few hundred milliseconds a computationa 46 coming few hundred milliseconds a computationally intensive scene is about to 47 happen. 47 happen. 48 48 49 On mobile hardware where the capability of the 49 On mobile hardware where the capability of the devices varies a lot, this 50 dynamic feedback loop offers a great flexibili 50 dynamic feedback loop offers a great flexibility to ensure best user experience 51 given the capabilities of any system. 51 given the capabilities of any system. 52 52 53 Of course a static configuration is possible t 53 Of course a static configuration is possible too. The exact usage will depend 54 on the system, application and the desired out 54 on the system, application and the desired outcome. 55 55 56 Another example is in Android where tasks are 56 Another example is in Android where tasks are classified as background, 57 foreground, top-app, etc. Util clamp can be us 57 foreground, top-app, etc. Util clamp can be used to constrain how much 58 resources background tasks are consuming by ca 58 resources background tasks are consuming by capping the performance point they 59 can run at. This constraint helps reserve reso 59 can run at. This constraint helps reserve resources for important tasks, like 60 the ones belonging to the currently active app 60 the ones belonging to the currently active app (top-app group). Beside this 61 helps in limiting how much power they consume. 61 helps in limiting how much power they consume. This can be more obvious in 62 heterogeneous systems (e.g. Arm big.LITTLE); t 62 heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the 63 background tasks to stay on the little cores w 63 background tasks to stay on the little cores which will ensure that: 64 64 65 1. The big cores are free to run top-a 65 1. The big cores are free to run top-app tasks immediately. top-app 66 tasks are the tasks the user is cur 66 tasks are the tasks the user is currently interacting with, hence 67 the most important tasks in the sys 67 the most important tasks in the system. 68 2. They don't run on a power hungry co 68 2. They don't run on a power hungry core and drain battery even if they 69 are CPU intensive tasks. 69 are CPU intensive tasks. 70 70 71 .. note:: 71 .. note:: 72 **little cores**: 72 **little cores**: 73 CPUs with capacity < 1024 73 CPUs with capacity < 1024 74 74 75 **big cores**: 75 **big cores**: 76 CPUs with capacity = 1024 76 CPUs with capacity = 1024 77 77 78 By making these uclamp performance requests, o 78 By making these uclamp performance requests, or rather hints, user space can 79 ensure system resources are used optimally to 79 ensure system resources are used optimally to deliver the best possible user 80 experience. 80 experience. 81 81 82 Another use case is to help with **overcoming 82 Another use case is to help with **overcoming the ramp up latency inherit in 83 how scheduler utilization signal is calculated 83 how scheduler utilization signal is calculated**. 84 84 85 On the other hand, a busy task for instance th 85 On the other hand, a busy task for instance that requires to run at maximum 86 performance point will suffer a delay of ~200m 86 performance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the 87 scheduler to realize that. This is known to af 87 scheduler to realize that. This is known to affect workloads like gaming on 88 mobile devices where frames will drop due to s 88 mobile devices where frames will drop due to slow response time to select the 89 higher frequency required for the tasks to fin 89 higher frequency required for the tasks to finish their work in time. Setting 90 UCLAMP_MIN=1024 will ensure such tasks will al 90 UCLAMP_MIN=1024 will ensure such tasks will always see the highest performance 91 level when they start running. 91 level when they start running. 92 92 93 The overall visible effect goes beyond better 93 The overall visible effect goes beyond better perceived user 94 experience/performance and stretches to help a 94 experience/performance and stretches to help achieve a better overall 95 performance/watt if used effectively. 95 performance/watt if used effectively. 96 96 97 User space can form a feedback loop with the t 97 User space can form a feedback loop with the thermal subsystem too to ensure 98 the device doesn't heat up to the point where 98 the device doesn't heat up to the point where it will throttle. 99 99 100 Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR hono 100 Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR honour uclamp requests/hints. 101 101 102 In the SCHED_FIFO/RR case, uclamp gives the op 102 In the SCHED_FIFO/RR case, uclamp gives the option to run RT tasks at any 103 performance point rather than being tied to MA 103 performance point rather than being tied to MAX frequency all the time. Which 104 can be useful on general purpose systems that 104 can be useful on general purpose systems that run on battery powered devices. 105 105 106 Note that by design RT tasks don't have per-ta 106 Note that by design RT tasks don't have per-task PELT signal and must always 107 run at a constant frequency to combat undeterm 107 run at a constant frequency to combat undeterministic DVFS rampup delays. 108 108 109 Note that using schedutil always implies a sin 109 Note that using schedutil always implies a single delay to modify the frequency 110 when an RT task wakes up. This cost is unchang 110 when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only 111 helps picking what frequency to request instea 111 helps picking what frequency to request instead of schedutil always requesting 112 MAX for all RT tasks. 112 MAX for all RT tasks. 113 113 114 See :ref:`section 3.4 <uclamp-default-values>` 114 See :ref:`section 3.4 <uclamp-default-values>` for default values and 115 :ref:`3.4.1 <sched-util-clamp-min-rt-default>` 115 :ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks 116 default value. 116 default value. 117 117 118 2. Design 118 2. Design 119 ========= 119 ========= 120 120 121 Util clamp is a property of every task in the 121 Util clamp is a property of every task in the system. It sets the boundaries of 122 its utilization signal; acting as a bias mecha 122 its utilization signal; acting as a bias mechanism that influences certain 123 decisions within the scheduler. 123 decisions within the scheduler. 124 124 125 The actual utilization signal of a task is nev 125 The actual utilization signal of a task is never clamped in reality. If you 126 inspect PELT signals at any point of time you 126 inspect PELT signals at any point of time you should continue to see them as 127 they are intact. Clamping happens only when ne 127 they are intact. Clamping happens only when needed, e.g: when a task wakes up 128 and the scheduler needs to select a suitable C 128 and the scheduler needs to select a suitable CPU for it to run on. 129 129 130 Since the goal of util clamp is to allow reque 130 Since the goal of util clamp is to allow requesting a minimum and maximum 131 performance point for a task to run on, it mus 131 performance point for a task to run on, it must be able to influence the 132 frequency selection as well as task placement 132 frequency selection as well as task placement to be most effective. Both of 133 which have implications on the utilization val 133 which have implications on the utilization value at CPU runqueue (rq for short) 134 level, which brings us to the main design chal 134 level, which brings us to the main design challenge. 135 135 136 When a task wakes up on an rq, the utilization 136 When a task wakes up on an rq, the utilization signal of the rq will be 137 affected by the uclamp settings of all the tas 137 affected by the uclamp settings of all the tasks enqueued on it. For example if 138 a task requests to run at UTIL_MIN = 512, then 138 a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs 139 to respect to this request as well as all othe 139 to respect to this request as well as all other requests from all of the 140 enqueued tasks. 140 enqueued tasks. 141 141 142 To be able to aggregate the util clamp value o 142 To be able to aggregate the util clamp value of all the tasks attached to the 143 rq, uclamp must do some housekeeping at every 143 rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the 144 scheduler hot path. Hence care must be taken s 144 scheduler hot path. Hence care must be taken since any slow down will have 145 significant impact on a lot of use cases and c 145 significant impact on a lot of use cases and could hinder its usability in 146 practice. 146 practice. 147 147 148 The way this is handled is by dividing the uti 148 The way this is handled is by dividing the utilization range into buckets 149 (struct uclamp_bucket) which allows us to redu 149 (struct uclamp_bucket) which allows us to reduce the search space from every 150 task on the rq to only a subset of tasks on th 150 task on the rq to only a subset of tasks on the top-most bucket. 151 151 152 When a task is enqueued, the counter in the ma 152 When a task is enqueued, the counter in the matching bucket is incremented, 153 and on dequeue it is decremented. This makes k 153 and on dequeue it is decremented. This makes keeping track of the effective 154 uclamp value at rq level a lot easier. 154 uclamp value at rq level a lot easier. 155 155 156 As tasks are enqueued and dequeued, we keep tr 156 As tasks are enqueued and dequeued, we keep track of the current effective 157 uclamp value of the rq. See :ref:`section 2.1 157 uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on 158 how this works. 158 how this works. 159 159 160 Later at any path that wants to identify the e 160 Later at any path that wants to identify the effective uclamp value of the rq, 161 it will simply need to read this effective ucl 161 it will simply need to read this effective uclamp value of the rq at that exact 162 moment of time it needs to take a decision. 162 moment of time it needs to take a decision. 163 163 164 For task placement case, only Energy Aware and 164 For task placement case, only Energy Aware and Capacity Aware Scheduling 165 (EAS/CAS) make use of uclamp for now, which im 165 (EAS/CAS) make use of uclamp for now, which implies that it is applied on 166 heterogeneous systems only. 166 heterogeneous systems only. 167 When a task wakes up, the scheduler will look 167 When a task wakes up, the scheduler will look at the current effective uclamp 168 value of every rq and compare it with the pote 168 value of every rq and compare it with the potential new value if the task were 169 to be enqueued there. Favoring the rq that wil 169 to be enqueued there. Favoring the rq that will end up with the most energy 170 efficient combination. 170 efficient combination. 171 171 172 Similarly in schedutil, when it needs to make 172 Similarly in schedutil, when it needs to make a frequency update it will look 173 at the current effective uclamp value of the r 173 at the current effective uclamp value of the rq which is influenced by the set 174 of tasks currently enqueued there and select t 174 of tasks currently enqueued there and select the appropriate frequency that 175 will satisfy constraints from requests. 175 will satisfy constraints from requests. 176 176 177 Other paths like setting overutilization state 177 Other paths like setting overutilization state (which effectively disables EAS) 178 make use of uclamp as well. Such cases are con 178 make use of uclamp as well. Such cases are considered necessary housekeeping to 179 allow the 2 main use cases above and will not 179 allow the 2 main use cases above and will not be covered in detail here as they 180 could change with implementation details. 180 could change with implementation details. 181 181 182 .. _uclamp-buckets: 182 .. _uclamp-buckets: 183 183 184 2.1. Buckets 184 2.1. Buckets 185 ------------ 185 ------------ 186 186 187 :: 187 :: 188 188 189 [struct rq] 189 [struct rq] 190 190 191 (bottom) 191 (bottom) (top) 192 192 193 0 193 0 1024 194 | 194 | | 195 +-----------+-----------+-----------+---- 195 +-----------+-----------+-----------+---- ----+-----------+ 196 | Bucket 0 | Bucket 1 | Bucket 2 | . 196 | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N | 197 +-----------+-----------+-----------+---- 197 +-----------+-----------+-----------+---- ----+-----------+ 198 : : 198 : : : 199 +- p0 +- p3 199 +- p0 +- p3 +- p4 200 : 200 : : 201 +- p1 201 +- p1 +- p5 202 : 202 : 203 +- p2 203 +- p2 204 204 205 205 206 .. note:: 206 .. note:: 207 The diagram above is an illustration rather 207 The diagram above is an illustration rather than a true depiction of the 208 internal data structure. 208 internal data structure. 209 209 210 To reduce the search space when trying to deci 210 To reduce the search space when trying to decide the effective uclamp value of 211 an rq as tasks are enqueued/dequeued, the whol 211 an rq as tasks are enqueued/dequeued, the whole utilization range is divided 212 into N buckets where N is configured at compil 212 into N buckets where N is configured at compile time by setting 213 CONFIG_UCLAMP_BUCKETS_COUNT. By default it is 213 CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5. 214 214 215 The rq has a bucket for each uclamp_id tunable 215 The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX]. 216 216 217 The range of each bucket is 1024/N. For exampl 217 The range of each bucket is 1024/N. For example, for the default value of 218 5 there will be 5 buckets, each of which will 218 5 there will be 5 buckets, each of which will cover the following range: 219 219 220 :: 220 :: 221 221 222 DELTA = round_closest(1024/5) = 204.8 222 DELTA = round_closest(1024/5) = 204.8 = 205 223 223 224 Bucket 0: [0:204] 224 Bucket 0: [0:204] 225 Bucket 1: [205:409] 225 Bucket 1: [205:409] 226 Bucket 2: [410:614] 226 Bucket 2: [410:614] 227 Bucket 3: [615:819] 227 Bucket 3: [615:819] 228 Bucket 4: [820:1024] 228 Bucket 4: [820:1024] 229 229 230 When a task p with following tunable parameter 230 When a task p with following tunable parameters 231 231 232 :: 232 :: 233 233 234 p->uclamp[UCLAMP_MIN] = 300 234 p->uclamp[UCLAMP_MIN] = 300 235 p->uclamp[UCLAMP_MAX] = 1024 235 p->uclamp[UCLAMP_MAX] = 1024 236 236 237 is enqueued into the rq, bucket 1 will be incr 237 is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket 238 4 will be incremented for UCLAMP_MAX to reflec 238 4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in 239 this range. 239 this range. 240 240 241 The rq then keeps track of its current effecti 241 The rq then keeps track of its current effective uclamp value for each 242 uclamp_id. 242 uclamp_id. 243 243 244 When a task p is enqueued, the rq value change 244 When a task p is enqueued, the rq value changes to: 245 245 246 :: 246 :: 247 247 248 // update bucket logic goes here 248 // update bucket logic goes here 249 rq->uclamp[UCLAMP_MIN] = max(rq->uclam 249 rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN]) 250 // repeat for UCLAMP_MAX 250 // repeat for UCLAMP_MAX 251 251 252 Similarly, when p is dequeued the rq value cha 252 Similarly, when p is dequeued the rq value changes to: 253 253 254 :: 254 :: 255 255 256 // update bucket logic goes here 256 // update bucket logic goes here 257 rq->uclamp[UCLAMP_MIN] = search_top_bu 257 rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value() 258 // repeat for UCLAMP_MAX 258 // repeat for UCLAMP_MAX 259 259 260 When all buckets are empty, the rq uclamp valu 260 When all buckets are empty, the rq uclamp values are reset to system defaults. 261 See :ref:`section 3.4 <uclamp-default-values>` 261 See :ref:`section 3.4 <uclamp-default-values>` for details on default values. 262 262 263 263 264 2.2. Max aggregation 264 2.2. Max aggregation 265 -------------------- 265 -------------------- 266 266 267 Util clamp is tuned to honour the request for 267 Util clamp is tuned to honour the request for the task that requires the 268 highest performance point. 268 highest performance point. 269 269 270 When multiple tasks are attached to the same r 270 When multiple tasks are attached to the same rq, then util clamp must make sure 271 the task that needs the highest performance po 271 the task that needs the highest performance point gets it even if there's 272 another task that doesn't need it or is disall 272 another task that doesn't need it or is disallowed from reaching this point. 273 273 274 For example, if there are multiple tasks attac 274 For example, if there are multiple tasks attached to an rq with the following 275 values: 275 values: 276 276 277 :: 277 :: 278 278 279 p0->uclamp[UCLAMP_MIN] = 300 279 p0->uclamp[UCLAMP_MIN] = 300 280 p0->uclamp[UCLAMP_MAX] = 900 280 p0->uclamp[UCLAMP_MAX] = 900 281 281 282 p1->uclamp[UCLAMP_MIN] = 500 282 p1->uclamp[UCLAMP_MIN] = 500 283 p1->uclamp[UCLAMP_MAX] = 500 283 p1->uclamp[UCLAMP_MAX] = 500 284 284 285 then assuming both p0 and p1 are enqueued to t 285 then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN 286 and UCLAMP_MAX become: 286 and UCLAMP_MAX become: 287 287 288 :: 288 :: 289 289 290 rq->uclamp[UCLAMP_MIN] = max(300, 500) 290 rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500 291 rq->uclamp[UCLAMP_MAX] = max(900, 500) 291 rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900 292 292 293 As we shall see in :ref:`section 5.1 <uclamp-c 293 As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max 294 aggregation is the cause of one of limitations 294 aggregation is the cause of one of limitations when using util clamp, in 295 particular for UCLAMP_MAX hint when user space 295 particular for UCLAMP_MAX hint when user space would like to save power. 296 296 297 2.3. Hierarchical aggregation 297 2.3. Hierarchical aggregation 298 ----------------------------- 298 ----------------------------- 299 299 300 As stated earlier, util clamp is a property of 300 As stated earlier, util clamp is a property of every task in the system. But 301 the actual applied (effective) value can be in 301 the actual applied (effective) value can be influenced by more than just the 302 request made by the task or another actor on i 302 request made by the task or another actor on its behalf (middleware library). 303 303 304 The effective util clamp value of any task is 304 The effective util clamp value of any task is restricted as follows: 305 305 306 1. By the uclamp settings defined by the cgr 306 1. By the uclamp settings defined by the cgroup CPU controller it is attached 307 to, if any. 307 to, if any. 308 2. The restricted value in (1) is then furth 308 2. The restricted value in (1) is then further restricted by the system wide 309 uclamp settings. 309 uclamp settings. 310 310 311 :ref:`Section 3 <uclamp-interfaces>` discusses 311 :ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand 312 further on that. 312 further on that. 313 313 314 For now suffice to say that if a task makes a 314 For now suffice to say that if a task makes a request, its actual effective 315 value will have to adhere to some restrictions 315 value will have to adhere to some restrictions imposed by cgroup and system 316 wide settings. 316 wide settings. 317 317 318 The system will still accept the request even 318 The system will still accept the request even if effectively will be beyond the 319 constraints, but as soon as the task moves to 319 constraints, but as soon as the task moves to a different cgroup or a sysadmin 320 modifies the system settings, the request will 320 modifies the system settings, the request will be satisfied only if it is 321 within new constraints. 321 within new constraints. 322 322 323 In other words, this aggregation will not caus 323 In other words, this aggregation will not cause an error when a task changes 324 its uclamp values, but rather the system may n 324 its uclamp values, but rather the system may not be able to satisfy requests 325 based on those factors. 325 based on those factors. 326 326 327 2.4. Range 327 2.4. Range 328 ---------- 328 ---------- 329 329 330 Uclamp performance request has the range of 0 330 Uclamp performance request has the range of 0 to 1024 inclusive. 331 331 332 For cgroup interface percentage is used (that 332 For cgroup interface percentage is used (that is 0 to 100 inclusive). 333 Just like other cgroup interfaces, you can use 333 Just like other cgroup interfaces, you can use 'max' instead of 100. 334 334 335 .. _uclamp-interfaces: 335 .. _uclamp-interfaces: 336 336 337 3. Interfaces 337 3. Interfaces 338 ============= 338 ============= 339 339 340 3.1. Per task interface 340 3.1. Per task interface 341 ----------------------- 341 ----------------------- 342 342 343 sched_setattr() syscall was extended to accept 343 sched_setattr() syscall was extended to accept two new fields: 344 344 345 * sched_util_min: requests the minimum perform 345 * sched_util_min: requests the minimum performance point the system should run 346 at when this task is running. Or lower perfo 346 at when this task is running. Or lower performance bound. 347 * sched_util_max: requests the maximum perform 347 * sched_util_max: requests the maximum performance point the system should run 348 at when this task is running. Or upper perfo 348 at when this task is running. Or upper performance bound. 349 349 350 For example, the following scenario have 40% t 350 For example, the following scenario have 40% to 80% utilization constraints: 351 351 352 :: 352 :: 353 353 354 attr->sched_util_min = 40% * 1024; 354 attr->sched_util_min = 40% * 1024; 355 attr->sched_util_max = 80% * 1024; 355 attr->sched_util_max = 80% * 1024; 356 356 357 When task @p is running, **the scheduler shoul 357 When task @p is running, **the scheduler should try its best to ensure it 358 starts at 40% performance level**. If the task 358 starts at 40% performance level**. If the task runs for a long enough time so 359 that its actual utilization goes above 80%, th 359 that its actual utilization goes above 80%, the utilization, or performance 360 level, will be capped. 360 level, will be capped. 361 361 362 The special value -1 is used to reset the ucla 362 The special value -1 is used to reset the uclamp settings to the system 363 default. 363 default. 364 364 365 Note that resetting the uclamp value to system 365 Note that resetting the uclamp value to system default using -1 is not the same 366 as manually setting uclamp value to system def 366 as manually setting uclamp value to system default. This distinction is 367 important because as we shall see in system in 367 important because as we shall see in system interfaces, the default value for 368 RT could be changed. SCHED_NORMAL/OTHER might 368 RT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the 369 future. 369 future. 370 370 371 3.2. cgroup interface 371 3.2. cgroup interface 372 --------------------- 372 --------------------- 373 373 374 There are two uclamp related values in the CPU 374 There are two uclamp related values in the CPU cgroup controller: 375 375 376 * cpu.uclamp.min 376 * cpu.uclamp.min 377 * cpu.uclamp.max 377 * cpu.uclamp.max 378 378 379 When a task is attached to a CPU controller, i 379 When a task is attached to a CPU controller, its uclamp values will be impacted 380 as follows: 380 as follows: 381 381 382 * cpu.uclamp.min is a protection as described 382 * cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup 383 v2 documentation <cgroupv2-protections-distr 383 v2 documentation <cgroupv2-protections-distributor>`. 384 384 385 If a task uclamp_min value is lower than cpu 385 If a task uclamp_min value is lower than cpu.uclamp.min, then the task will 386 inherit the cgroup cpu.uclamp.min value. 386 inherit the cgroup cpu.uclamp.min value. 387 387 388 In a cgroup hierarchy, effective cpu.uclamp. 388 In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child, 389 parent). 389 parent). 390 390 391 * cpu.uclamp.max is a limit as described in :r 391 * cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2 392 documentation <cgroupv2-limits-distributor>` 392 documentation <cgroupv2-limits-distributor>`. 393 393 394 If a task uclamp_max value is higher than cp 394 If a task uclamp_max value is higher than cpu.uclamp.max, then the task will 395 inherit the cgroup cpu.uclamp.max value. 395 inherit the cgroup cpu.uclamp.max value. 396 396 397 In a cgroup hierarchy, effective cpu.uclamp. 397 In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child, 398 parent). 398 parent). 399 399 400 For example, given following parameters: 400 For example, given following parameters: 401 401 402 :: 402 :: 403 403 404 p0->uclamp[UCLAMP_MIN] = // system def 404 p0->uclamp[UCLAMP_MIN] = // system default; 405 p0->uclamp[UCLAMP_MAX] = // system def 405 p0->uclamp[UCLAMP_MAX] = // system default; 406 406 407 p1->uclamp[UCLAMP_MIN] = 40% * 1024; 407 p1->uclamp[UCLAMP_MIN] = 40% * 1024; 408 p1->uclamp[UCLAMP_MAX] = 50% * 1024; 408 p1->uclamp[UCLAMP_MAX] = 50% * 1024; 409 409 410 cgroup0->cpu.uclamp.min = 20% * 1024; 410 cgroup0->cpu.uclamp.min = 20% * 1024; 411 cgroup0->cpu.uclamp.max = 60% * 1024; 411 cgroup0->cpu.uclamp.max = 60% * 1024; 412 412 413 cgroup1->cpu.uclamp.min = 60% * 1024; 413 cgroup1->cpu.uclamp.min = 60% * 1024; 414 cgroup1->cpu.uclamp.max = 100% * 1024; 414 cgroup1->cpu.uclamp.max = 100% * 1024; 415 415 416 when p0 and p1 are attached to cgroup0, the va 416 when p0 and p1 are attached to cgroup0, the values become: 417 417 418 :: 418 :: 419 419 420 p0->uclamp[UCLAMP_MIN] = cgroup0->cpu. 420 p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024; 421 p0->uclamp[UCLAMP_MAX] = cgroup0->cpu. 421 p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024; 422 422 423 p1->uclamp[UCLAMP_MIN] = 40% * 1024; / 423 p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact 424 p1->uclamp[UCLAMP_MAX] = 50% * 1024; / 424 p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact 425 425 426 when p0 and p1 are attached to cgroup1, these 426 when p0 and p1 are attached to cgroup1, these instead become: 427 427 428 :: 428 :: 429 429 430 p0->uclamp[UCLAMP_MIN] = cgroup1->cpu. 430 p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; 431 p0->uclamp[UCLAMP_MAX] = cgroup1->cpu. 431 p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024; 432 432 433 p1->uclamp[UCLAMP_MIN] = cgroup1->cpu. 433 p1->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; 434 p1->uclamp[UCLAMP_MAX] = 50% * 1024; / 434 p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact 435 435 436 Note that cgroup interfaces allows cpu.uclamp. 436 Note that cgroup interfaces allows cpu.uclamp.max value to be lower than 437 cpu.uclamp.min. Other interfaces don't allow t 437 cpu.uclamp.min. Other interfaces don't allow that. 438 438 439 3.3. System interface 439 3.3. System interface 440 --------------------- 440 --------------------- 441 441 442 3.3.1 sched_util_clamp_min 442 3.3.1 sched_util_clamp_min 443 -------------------------- 443 -------------------------- 444 444 445 System wide limit of allowed UCLAMP_MIN range. 445 System wide limit of allowed UCLAMP_MIN range. By default it is set to 1024, 446 which means that permitted effective UCLAMP_MI 446 which means that permitted effective UCLAMP_MIN range for tasks is [0:1024]. 447 By changing it to 512 for example the range re 447 By changing it to 512 for example the range reduces to [0:512]. This is useful 448 to restrict how much boosting tasks are allowe 448 to restrict how much boosting tasks are allowed to acquire. 449 449 450 Requests from tasks to go above this knob valu 450 Requests from tasks to go above this knob value will still succeed, but 451 they won't be satisfied until it is more than 451 they won't be satisfied until it is more than p->uclamp[UCLAMP_MIN]. 452 452 453 The value must be smaller than or equal to sch 453 The value must be smaller than or equal to sched_util_clamp_max. 454 454 455 3.3.2 sched_util_clamp_max 455 3.3.2 sched_util_clamp_max 456 -------------------------- 456 -------------------------- 457 457 458 System wide limit of allowed UCLAMP_MAX range. 458 System wide limit of allowed UCLAMP_MAX range. By default it is set to 1024, 459 which means that permitted effective UCLAMP_MA 459 which means that permitted effective UCLAMP_MAX range for tasks is [0:1024]. 460 460 461 By changing it to 512 for example the effectiv 461 By changing it to 512 for example the effective allowed range reduces to 462 [0:512]. This means is that no task can run ab 462 [0:512]. This means is that no task can run above 512, which implies that all 463 rqs are restricted too. IOW, the whole system 463 rqs are restricted too. IOW, the whole system is capped to half its performance 464 capacity. 464 capacity. 465 465 466 This is useful to restrict the overall maximum 466 This is useful to restrict the overall maximum performance point of the system. 467 For example, it can be handy to limit performa 467 For example, it can be handy to limit performance when running low on battery 468 or when the system wants to limit access to mo 468 or when the system wants to limit access to more energy hungry performance 469 levels when it's in idle state or screen is of 469 levels when it's in idle state or screen is off. 470 470 471 Requests from tasks to go above this knob valu 471 Requests from tasks to go above this knob value will still succeed, but they 472 won't be satisfied until it is more than p->uc 472 won't be satisfied until it is more than p->uclamp[UCLAMP_MAX]. 473 473 474 The value must be greater than or equal to sch 474 The value must be greater than or equal to sched_util_clamp_min. 475 475 476 .. _uclamp-default-values: 476 .. _uclamp-default-values: 477 477 478 3.4. Default values 478 3.4. Default values 479 ------------------- 479 ------------------- 480 480 481 By default all SCHED_NORMAL/SCHED_OTHER tasks 481 By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to: 482 482 483 :: 483 :: 484 484 485 p_fair->uclamp[UCLAMP_MIN] = 0 485 p_fair->uclamp[UCLAMP_MIN] = 0 486 p_fair->uclamp[UCLAMP_MAX] = 1024 486 p_fair->uclamp[UCLAMP_MAX] = 1024 487 487 488 That is, by default they're boosted to run at 488 That is, by default they're boosted to run at the maximum performance point of 489 changed at boot or runtime. No argument was ma 489 changed at boot or runtime. No argument was made yet as to why we should 490 provide this, but can be added in the future. 490 provide this, but can be added in the future. 491 491 492 For SCHED_FIFO/SCHED_RR tasks: 492 For SCHED_FIFO/SCHED_RR tasks: 493 493 494 :: 494 :: 495 495 496 p_rt->uclamp[UCLAMP_MIN] = 1024 496 p_rt->uclamp[UCLAMP_MIN] = 1024 497 p_rt->uclamp[UCLAMP_MAX] = 1024 497 p_rt->uclamp[UCLAMP_MAX] = 1024 498 498 499 That is by default they're boosted to run at t 499 That is by default they're boosted to run at the maximum performance point of 500 the system which retains the historical behavi 500 the system which retains the historical behavior of the RT tasks. 501 501 502 RT tasks default uclamp_min value can be modif 502 RT tasks default uclamp_min value can be modified at boot or runtime via 503 sysctl. See below section. 503 sysctl. See below section. 504 504 505 .. _sched-util-clamp-min-rt-default: 505 .. _sched-util-clamp-min-rt-default: 506 506 507 3.4.1 sched_util_clamp_min_rt_default 507 3.4.1 sched_util_clamp_min_rt_default 508 ------------------------------------- 508 ------------------------------------- 509 509 510 Running RT tasks at maximum performance point 510 Running RT tasks at maximum performance point is expensive on battery powered 511 devices and not necessary. To allow system dev 511 devices and not necessary. To allow system developer to offer good performance 512 guarantees for these tasks without pushing it 512 guarantees for these tasks without pushing it all the way to maximum 513 performance point, this sysctl knob allows tun 513 performance point, this sysctl knob allows tuning the best boost value to 514 address the system requirement without burning 514 address the system requirement without burning power running at maximum 515 performance point all the time. 515 performance point all the time. 516 516 517 Application developer are encouraged to use th 517 Application developer are encouraged to use the per task util clamp interface 518 to ensure they are performance and power aware 518 to ensure they are performance and power aware. Ideally this knob should be set 519 to 0 by system designers and leave the task of 519 to 0 by system designers and leave the task of managing performance 520 requirements to the apps. 520 requirements to the apps. 521 521 522 4. How to use util clamp 522 4. How to use util clamp 523 ======================== 523 ======================== 524 524 525 Util clamp promotes the concept of user space 525 Util clamp promotes the concept of user space assisted power and performance 526 management. At the scheduler level there is no 526 management. At the scheduler level there is no info required to make the best 527 decision. However, with util clamp user space 527 decision. However, with util clamp user space can hint to the scheduler to make 528 better decision about task placement and frequ 528 better decision about task placement and frequency selection. 529 529 530 Best results are achieved by not making any as 530 Best results are achieved by not making any assumptions about the system the 531 application is running on and to use it in con 531 application is running on and to use it in conjunction with a feedback loop to 532 dynamically monitor and adjust. Ultimately thi 532 dynamically monitor and adjust. Ultimately this will allow for a better user 533 experience at a better perf/watt. 533 experience at a better perf/watt. 534 534 535 For some systems and use cases, static setup w 535 For some systems and use cases, static setup will help to achieve good results. 536 Portability will be a problem in this case. Ho 536 Portability will be a problem in this case. How much work one can do at 100, 537 200 or 1024 is different for each system. Unle 537 200 or 1024 is different for each system. Unless there's a specific target 538 system, static setup should be avoided. 538 system, static setup should be avoided. 539 539 540 There are enough possibilities to create a who 540 There are enough possibilities to create a whole framework based on util clamp 541 or self contained app that makes use of it dir 541 or self contained app that makes use of it directly. 542 542 543 4.1. Boost important and DVFS-latency-sensitiv 543 4.1. Boost important and DVFS-latency-sensitive tasks 544 ---------------------------------------------- 544 ----------------------------------------------------- 545 545 546 A GUI task might not be busy to warrant drivin 546 A GUI task might not be busy to warrant driving the frequency high when it 547 wakes up. However, it requires to finish its w 547 wakes up. However, it requires to finish its work within a specific time window 548 to deliver the desired user experience. The ri 548 to deliver the desired user experience. The right frequency it requires at 549 wakeup will be system dependent. On some under 549 wakeup will be system dependent. On some underpowered systems it will be high, 550 on other overpowered ones it will be low or 0. 550 on other overpowered ones it will be low or 0. 551 551 552 This task can increase its UCLAMP_MIN value ev 552 This task can increase its UCLAMP_MIN value every time it misses the deadline 553 to ensure on next wake up it runs at a higher 553 to ensure on next wake up it runs at a higher performance point. It should try 554 to approach the lowest UCLAMP_MIN value that a 554 to approach the lowest UCLAMP_MIN value that allows to meet its deadline on any 555 particular system to achieve the best possible 555 particular system to achieve the best possible perf/watt for that system. 556 556 557 On heterogeneous systems, it might be importan 557 On heterogeneous systems, it might be important for this task to run on 558 a faster CPU. 558 a faster CPU. 559 559 560 **Generally it is advised to perceive the inpu 560 **Generally it is advised to perceive the input as performance level or point 561 which will imply both task placement and frequ 561 which will imply both task placement and frequency selection**. 562 562 563 4.2. Cap background tasks 563 4.2. Cap background tasks 564 ------------------------- 564 ------------------------- 565 565 566 Like explained for Android case in the introdu 566 Like explained for Android case in the introduction. Any app can lower 567 UCLAMP_MAX for some background tasks that don' 567 UCLAMP_MAX for some background tasks that don't care about performance but 568 could end up being busy and consume unnecessar 568 could end up being busy and consume unnecessary system resources on the system. 569 569 570 4.3. Powersave mode 570 4.3. Powersave mode 571 ------------------- 571 ------------------- 572 572 573 sched_util_clamp_max system wide interface can 573 sched_util_clamp_max system wide interface can be used to limit all tasks from 574 operating at the higher performance points whi 574 operating at the higher performance points which are usually energy 575 inefficient. 575 inefficient. 576 576 577 This is not unique to uclamp as one can achiev 577 This is not unique to uclamp as one can achieve the same by reducing max 578 frequency of the cpufreq governor. It can be c 578 frequency of the cpufreq governor. It can be considered a more convenient 579 alternative interface. 579 alternative interface. 580 580 581 4.4. Per-app performance restriction 581 4.4. Per-app performance restriction 582 ------------------------------------ 582 ------------------------------------ 583 583 584 Middleware/Utility can provide the user an opt 584 Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an 585 app every time it is executed to guarantee a m 585 app every time it is executed to guarantee a minimum performance point and/or 586 limit it from draining system power at the cos 586 limit it from draining system power at the cost of reduced performance for 587 these apps. 587 these apps. 588 588 589 If you want to prevent your laptop from heatin 589 If you want to prevent your laptop from heating up while on the go from 590 compiling the kernel and happy to sacrifice pe 590 compiling the kernel and happy to sacrifice performance to save power, but 591 still would like to keep your browser performa 591 still would like to keep your browser performance intact, uclamp makes it 592 possible. 592 possible. 593 593 594 5. Limitations 594 5. Limitations 595 ============== 595 ============== 596 596 597 .. _uclamp-capping-fail: 597 .. _uclamp-capping-fail: 598 598 599 5.1. Capping frequency with uclamp_max fails u 599 5.1. Capping frequency with uclamp_max fails under certain conditions 600 ---------------------------------------------- 600 --------------------------------------------------------------------- 601 601 602 If task p0 is capped to run at 512: 602 If task p0 is capped to run at 512: 603 603 604 :: 604 :: 605 605 606 p0->uclamp[UCLAMP_MAX] = 512 606 p0->uclamp[UCLAMP_MAX] = 512 607 607 608 and it shares the rq with p1 which is free to 608 and it shares the rq with p1 which is free to run at any performance point: 609 609 610 :: 610 :: 611 611 612 p1->uclamp[UCLAMP_MAX] = 1024 612 p1->uclamp[UCLAMP_MAX] = 1024 613 613 614 then due to max aggregation the rq will be all 614 then due to max aggregation the rq will be allowed to reach max performance 615 point: 615 point: 616 616 617 :: 617 :: 618 618 619 rq->uclamp[UCLAMP_MAX] = max(512, 1024 619 rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024 620 620 621 Assuming both p0 and p1 have UCLAMP_MIN = 0, t 621 Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for 622 the rq will depend on the actual utilization v 622 the rq will depend on the actual utilization value of the tasks. 623 623 624 If p1 is a small task but p0 is a CPU intensiv 624 If p1 is a small task but p0 is a CPU intensive task, then due to the fact that 625 both are running at the same rq, p1 will cause 625 both are running at the same rq, p1 will cause the frequency capping to be left 626 from the rq although p1, which is allowed to r 626 from the rq although p1, which is allowed to run at any performance point, 627 doesn't actually need to run at that frequency 627 doesn't actually need to run at that frequency. 628 628 629 5.2. UCLAMP_MAX can break PELT (util_avg) sign 629 5.2. UCLAMP_MAX can break PELT (util_avg) signal 630 ---------------------------------------------- 630 ------------------------------------------------ 631 631 632 PELT assumes that frequency will always increa 632 PELT assumes that frequency will always increase as the signals grow to ensure 633 there's always some idle time on the CPU. But 633 there's always some idle time on the CPU. But with UCLAMP_MAX, this frequency 634 increase will be prevented which can lead to n 634 increase will be prevented which can lead to no idle time in some 635 circumstances. When there's no idle time, a ta 635 circumstances. When there's no idle time, a task will stuck in a busy loop, 636 which would result in util_avg being 1024. 636 which would result in util_avg being 1024. 637 637 638 Combing with issue described below, this can l 638 Combing with issue described below, this can lead to unwanted frequency spikes 639 when severely capped tasks share the rq with a 639 when severely capped tasks share the rq with a small non capped task. 640 640 641 As an example if task p, which have: 641 As an example if task p, which have: 642 642 643 :: 643 :: 644 644 645 p0->util_avg = 300 645 p0->util_avg = 300 646 p0->uclamp[UCLAMP_MAX] = 0 646 p0->uclamp[UCLAMP_MAX] = 0 647 647 648 wakes up on an idle CPU, then it will run at m 648 wakes up on an idle CPU, then it will run at min frequency (Fmin) this 649 CPU is capable of. The max CPU frequency (Fmax 649 CPU is capable of. The max CPU frequency (Fmax) matters here as well, 650 since it designates the shortest computational 650 since it designates the shortest computational time to finish the task's 651 work on this CPU. 651 work on this CPU. 652 652 653 :: 653 :: 654 654 655 rq->uclamp[UCLAMP_MAX] = 0 655 rq->uclamp[UCLAMP_MAX] = 0 656 656 657 If the ratio of Fmax/Fmin is 3, then maximum v 657 If the ratio of Fmax/Fmin is 3, then maximum value will be: 658 658 659 :: 659 :: 660 660 661 300 * (Fmax/Fmin) = 900 661 300 * (Fmax/Fmin) = 900 662 662 663 which indicates the CPU will still see idle ti 663 which indicates the CPU will still see idle time since 900 is < 1024. The 664 _actual_ util_avg will not be 900 though, but 664 _actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As 665 long as there's idle time, p->util_avg updates 665 long as there's idle time, p->util_avg updates will be off by a some margin, 666 but not proportional to Fmax/Fmin. 666 but not proportional to Fmax/Fmin. 667 667 668 :: 668 :: 669 669 670 p0->util_avg = 300 + small_error 670 p0->util_avg = 300 + small_error 671 671 672 Now if the ratio of Fmax/Fmin is 4, the maximu 672 Now if the ratio of Fmax/Fmin is 4, the maximum value becomes: 673 673 674 :: 674 :: 675 675 676 300 * (Fmax/Fmin) = 1200 676 300 * (Fmax/Fmin) = 1200 677 677 678 which is higher than 1024 and indicates that t 678 which is higher than 1024 and indicates that the CPU has no idle time. When 679 this happens, then the _actual_ util_avg will 679 this happens, then the _actual_ util_avg will become: 680 680 681 :: 681 :: 682 682 683 p0->util_avg = 1024 683 p0->util_avg = 1024 684 684 685 If task p1 wakes up on this CPU, which have: 685 If task p1 wakes up on this CPU, which have: 686 686 687 :: 687 :: 688 688 689 p1->util_avg = 200 689 p1->util_avg = 200 690 p1->uclamp[UCLAMP_MAX] = 1024 690 p1->uclamp[UCLAMP_MAX] = 1024 691 691 692 then the effective UCLAMP_MAX for the CPU will 692 then the effective UCLAMP_MAX for the CPU will be 1024 according to max 693 aggregation rule. But since the capped p0 task 693 aggregation rule. But since the capped p0 task was running and throttled 694 severely, then the rq->util_avg will be: 694 severely, then the rq->util_avg will be: 695 695 696 :: 696 :: 697 697 698 p0->util_avg = 1024 698 p0->util_avg = 1024 699 p1->util_avg = 200 699 p1->util_avg = 200 700 700 701 rq->util_avg = 1024 701 rq->util_avg = 1024 702 rq->uclamp[UCLAMP_MAX] = 1024 702 rq->uclamp[UCLAMP_MAX] = 1024 703 703 704 Hence lead to a frequency spike since if p0 wa 704 Hence lead to a frequency spike since if p0 wasn't throttled we should get: 705 705 706 :: 706 :: 707 707 708 p0->util_avg = 300 708 p0->util_avg = 300 709 p1->util_avg = 200 709 p1->util_avg = 200 710 710 711 rq->util_avg = 500 711 rq->util_avg = 500 712 712 713 and run somewhere near mid performance point o 713 and run somewhere near mid performance point of that CPU, not the Fmax we get. 714 714 715 5.3. Schedutil response time issues 715 5.3. Schedutil response time issues 716 ----------------------------------- 716 ----------------------------------- 717 717 718 schedutil has three limitations: 718 schedutil has three limitations: 719 719 720 1. Hardware takes non-zero time to res 720 1. Hardware takes non-zero time to respond to any frequency change 721 request. On some platforms can be i 721 request. On some platforms can be in the order of few ms. 722 2. Non fast-switch systems require a w 722 2. Non fast-switch systems require a worker deadline thread to wake up 723 and perform the frequency change, w 723 and perform the frequency change, which adds measurable overhead. 724 3. schedutil rate_limit_us drops any r 724 3. schedutil rate_limit_us drops any requests during this rate_limit_us 725 window. 725 window. 726 726 727 If a relatively small task is doing critical j 727 If a relatively small task is doing critical job and requires a certain 728 performance point when it wakes up and starts 728 performance point when it wakes up and starts running, then all these 729 limitations will prevent it from getting what 729 limitations will prevent it from getting what it wants in the time scale it 730 expects. 730 expects. 731 731 732 This limitation is not only impactful when usi 732 This limitation is not only impactful when using uclamp, but will be more 733 prevalent as we no longer gradually ramp up or 733 prevalent as we no longer gradually ramp up or down. We could easily be 734 jumping between frequencies depending on the o 734 jumping between frequencies depending on the order tasks wake up, and their 735 respective uclamp values. 735 respective uclamp values. 736 736 737 We regard that as a limitation of the capabili 737 We regard that as a limitation of the capabilities of the underlying system 738 itself. 738 itself. 739 739 740 There is room to improve the behavior of sched 740 There is room to improve the behavior of schedutil rate_limit_us, but not much 741 to be done for 1 or 2. They are considered har 741 to be done for 1 or 2. They are considered hard limitations of the system.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.