1 .. SPDX-License-Identifier: GPL-2.0 2 3 ==================== 4 Utilization Clamping 5 ==================== 6 7 1. Introduction 8 =============== 9 10 Utilization clamping, also known as util clamp 11 feature that allows user space to help in mana 12 of tasks. It was introduced in v5.3 release. T 13 v5.4. 14 15 Uclamp is a hinting mechanism that allows the 16 performance requirements and restrictions of t 17 scheduler to make a better decision. And when 18 used, util clamp will influence the CPU freque 19 20 Since the scheduler and schedutil are both dri 21 util clamp acts on that to achieve its goal by 22 point; hence the name. That is, by clamping ut 23 system run at a certain performance point. 24 25 The right way to view util clamp is as a mecha 26 performance constraints. It consists of two tu 27 28 * UCLAMP_MIN, which sets the lower bou 29 * UCLAMP_MAX, which sets the upper bou 30 31 These two bounds will ensure a task will opera 32 of the system. UCLAMP_MIN implies boosting a t 33 capping a task. 34 35 One can tell the system (scheduler) that some 36 performance point to operate at to deliver the 37 can tell the system that some tasks should be 38 much resources and should not go above a speci 39 the uclamp values as performance points rather 40 abstraction from user space point of view. 41 42 As an example, a game can use util clamp to fo 43 perceived Frames Per Second (FPS). It can dyna 44 performance point required by its display pipe 45 dropped. It can also dynamically 'prime' up th 46 coming few hundred milliseconds a computationa 47 happen. 48 49 On mobile hardware where the capability of the 50 dynamic feedback loop offers a great flexibili 51 given the capabilities of any system. 52 53 Of course a static configuration is possible t 54 on the system, application and the desired out 55 56 Another example is in Android where tasks are 57 foreground, top-app, etc. Util clamp can be us 58 resources background tasks are consuming by ca 59 can run at. This constraint helps reserve reso 60 the ones belonging to the currently active app 61 helps in limiting how much power they consume. 62 heterogeneous systems (e.g. Arm big.LITTLE); t 63 background tasks to stay on the little cores w 64 65 1. The big cores are free to run top-a 66 tasks are the tasks the user is cur 67 the most important tasks in the sys 68 2. They don't run on a power hungry co 69 are CPU intensive tasks. 70 71 .. note:: 72 **little cores**: 73 CPUs with capacity < 1024 74 75 **big cores**: 76 CPUs with capacity = 1024 77 78 By making these uclamp performance requests, o 79 ensure system resources are used optimally to 80 experience. 81 82 Another use case is to help with **overcoming 83 how scheduler utilization signal is calculated 84 85 On the other hand, a busy task for instance th 86 performance point will suffer a delay of ~200m 87 scheduler to realize that. This is known to af 88 mobile devices where frames will drop due to s 89 higher frequency required for the tasks to fin 90 UCLAMP_MIN=1024 will ensure such tasks will al 91 level when they start running. 92 93 The overall visible effect goes beyond better 94 experience/performance and stretches to help a 95 performance/watt if used effectively. 96 97 User space can form a feedback loop with the t 98 the device doesn't heat up to the point where 99 100 Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR hono 101 102 In the SCHED_FIFO/RR case, uclamp gives the op 103 performance point rather than being tied to MA 104 can be useful on general purpose systems that 105 106 Note that by design RT tasks don't have per-ta 107 run at a constant frequency to combat undeterm 108 109 Note that using schedutil always implies a sin 110 when an RT task wakes up. This cost is unchang 111 helps picking what frequency to request instea 112 MAX for all RT tasks. 113 114 See :ref:`section 3.4 <uclamp-default-values>` 115 :ref:`3.4.1 <sched-util-clamp-min-rt-default>` 116 default value. 117 118 2. Design 119 ========= 120 121 Util clamp is a property of every task in the 122 its utilization signal; acting as a bias mecha 123 decisions within the scheduler. 124 125 The actual utilization signal of a task is nev 126 inspect PELT signals at any point of time you 127 they are intact. Clamping happens only when ne 128 and the scheduler needs to select a suitable C 129 130 Since the goal of util clamp is to allow reque 131 performance point for a task to run on, it mus 132 frequency selection as well as task placement 133 which have implications on the utilization val 134 level, which brings us to the main design chal 135 136 When a task wakes up on an rq, the utilization 137 affected by the uclamp settings of all the tas 138 a task requests to run at UTIL_MIN = 512, then 139 to respect to this request as well as all othe 140 enqueued tasks. 141 142 To be able to aggregate the util clamp value o 143 rq, uclamp must do some housekeeping at every 144 scheduler hot path. Hence care must be taken s 145 significant impact on a lot of use cases and c 146 practice. 147 148 The way this is handled is by dividing the uti 149 (struct uclamp_bucket) which allows us to redu 150 task on the rq to only a subset of tasks on th 151 152 When a task is enqueued, the counter in the ma 153 and on dequeue it is decremented. This makes k 154 uclamp value at rq level a lot easier. 155 156 As tasks are enqueued and dequeued, we keep tr 157 uclamp value of the rq. See :ref:`section 2.1 158 how this works. 159 160 Later at any path that wants to identify the e 161 it will simply need to read this effective ucl 162 moment of time it needs to take a decision. 163 164 For task placement case, only Energy Aware and 165 (EAS/CAS) make use of uclamp for now, which im 166 heterogeneous systems only. 167 When a task wakes up, the scheduler will look 168 value of every rq and compare it with the pote 169 to be enqueued there. Favoring the rq that wil 170 efficient combination. 171 172 Similarly in schedutil, when it needs to make 173 at the current effective uclamp value of the r 174 of tasks currently enqueued there and select t 175 will satisfy constraints from requests. 176 177 Other paths like setting overutilization state 178 make use of uclamp as well. Such cases are con 179 allow the 2 main use cases above and will not 180 could change with implementation details. 181 182 .. _uclamp-buckets: 183 184 2.1. Buckets 185 ------------ 186 187 :: 188 189 [struct rq] 190 191 (bottom) 192 193 0 194 | 195 +-----------+-----------+-----------+---- 196 | Bucket 0 | Bucket 1 | Bucket 2 | . 197 +-----------+-----------+-----------+---- 198 : : 199 +- p0 +- p3 200 : 201 +- p1 202 : 203 +- p2 204 205 206 .. note:: 207 The diagram above is an illustration rather 208 internal data structure. 209 210 To reduce the search space when trying to deci 211 an rq as tasks are enqueued/dequeued, the whol 212 into N buckets where N is configured at compil 213 CONFIG_UCLAMP_BUCKETS_COUNT. By default it is 214 215 The rq has a bucket for each uclamp_id tunable 216 217 The range of each bucket is 1024/N. For exampl 218 5 there will be 5 buckets, each of which will 219 220 :: 221 222 DELTA = round_closest(1024/5) = 204.8 223 224 Bucket 0: [0:204] 225 Bucket 1: [205:409] 226 Bucket 2: [410:614] 227 Bucket 3: [615:819] 228 Bucket 4: [820:1024] 229 230 When a task p with following tunable parameter 231 232 :: 233 234 p->uclamp[UCLAMP_MIN] = 300 235 p->uclamp[UCLAMP_MAX] = 1024 236 237 is enqueued into the rq, bucket 1 will be incr 238 4 will be incremented for UCLAMP_MAX to reflec 239 this range. 240 241 The rq then keeps track of its current effecti 242 uclamp_id. 243 244 When a task p is enqueued, the rq value change 245 246 :: 247 248 // update bucket logic goes here 249 rq->uclamp[UCLAMP_MIN] = max(rq->uclam 250 // repeat for UCLAMP_MAX 251 252 Similarly, when p is dequeued the rq value cha 253 254 :: 255 256 // update bucket logic goes here 257 rq->uclamp[UCLAMP_MIN] = search_top_bu 258 // repeat for UCLAMP_MAX 259 260 When all buckets are empty, the rq uclamp valu 261 See :ref:`section 3.4 <uclamp-default-values>` 262 263 264 2.2. Max aggregation 265 -------------------- 266 267 Util clamp is tuned to honour the request for 268 highest performance point. 269 270 When multiple tasks are attached to the same r 271 the task that needs the highest performance po 272 another task that doesn't need it or is disall 273 274 For example, if there are multiple tasks attac 275 values: 276 277 :: 278 279 p0->uclamp[UCLAMP_MIN] = 300 280 p0->uclamp[UCLAMP_MAX] = 900 281 282 p1->uclamp[UCLAMP_MIN] = 500 283 p1->uclamp[UCLAMP_MAX] = 500 284 285 then assuming both p0 and p1 are enqueued to t 286 and UCLAMP_MAX become: 287 288 :: 289 290 rq->uclamp[UCLAMP_MIN] = max(300, 500) 291 rq->uclamp[UCLAMP_MAX] = max(900, 500) 292 293 As we shall see in :ref:`section 5.1 <uclamp-c 294 aggregation is the cause of one of limitations 295 particular for UCLAMP_MAX hint when user space 296 297 2.3. Hierarchical aggregation 298 ----------------------------- 299 300 As stated earlier, util clamp is a property of 301 the actual applied (effective) value can be in 302 request made by the task or another actor on i 303 304 The effective util clamp value of any task is 305 306 1. By the uclamp settings defined by the cgr 307 to, if any. 308 2. The restricted value in (1) is then furth 309 uclamp settings. 310 311 :ref:`Section 3 <uclamp-interfaces>` discusses 312 further on that. 313 314 For now suffice to say that if a task makes a 315 value will have to adhere to some restrictions 316 wide settings. 317 318 The system will still accept the request even 319 constraints, but as soon as the task moves to 320 modifies the system settings, the request will 321 within new constraints. 322 323 In other words, this aggregation will not caus 324 its uclamp values, but rather the system may n 325 based on those factors. 326 327 2.4. Range 328 ---------- 329 330 Uclamp performance request has the range of 0 331 332 For cgroup interface percentage is used (that 333 Just like other cgroup interfaces, you can use 334 335 .. _uclamp-interfaces: 336 337 3. Interfaces 338 ============= 339 340 3.1. Per task interface 341 ----------------------- 342 343 sched_setattr() syscall was extended to accept 344 345 * sched_util_min: requests the minimum perform 346 at when this task is running. Or lower perfo 347 * sched_util_max: requests the maximum perform 348 at when this task is running. Or upper perfo 349 350 For example, the following scenario have 40% t 351 352 :: 353 354 attr->sched_util_min = 40% * 1024; 355 attr->sched_util_max = 80% * 1024; 356 357 When task @p is running, **the scheduler shoul 358 starts at 40% performance level**. If the task 359 that its actual utilization goes above 80%, th 360 level, will be capped. 361 362 The special value -1 is used to reset the ucla 363 default. 364 365 Note that resetting the uclamp value to system 366 as manually setting uclamp value to system def 367 important because as we shall see in system in 368 RT could be changed. SCHED_NORMAL/OTHER might 369 future. 370 371 3.2. cgroup interface 372 --------------------- 373 374 There are two uclamp related values in the CPU 375 376 * cpu.uclamp.min 377 * cpu.uclamp.max 378 379 When a task is attached to a CPU controller, i 380 as follows: 381 382 * cpu.uclamp.min is a protection as described 383 v2 documentation <cgroupv2-protections-distr 384 385 If a task uclamp_min value is lower than cpu 386 inherit the cgroup cpu.uclamp.min value. 387 388 In a cgroup hierarchy, effective cpu.uclamp. 389 parent). 390 391 * cpu.uclamp.max is a limit as described in :r 392 documentation <cgroupv2-limits-distributor>` 393 394 If a task uclamp_max value is higher than cp 395 inherit the cgroup cpu.uclamp.max value. 396 397 In a cgroup hierarchy, effective cpu.uclamp. 398 parent). 399 400 For example, given following parameters: 401 402 :: 403 404 p0->uclamp[UCLAMP_MIN] = // system def 405 p0->uclamp[UCLAMP_MAX] = // system def 406 407 p1->uclamp[UCLAMP_MIN] = 40% * 1024; 408 p1->uclamp[UCLAMP_MAX] = 50% * 1024; 409 410 cgroup0->cpu.uclamp.min = 20% * 1024; 411 cgroup0->cpu.uclamp.max = 60% * 1024; 412 413 cgroup1->cpu.uclamp.min = 60% * 1024; 414 cgroup1->cpu.uclamp.max = 100% * 1024; 415 416 when p0 and p1 are attached to cgroup0, the va 417 418 :: 419 420 p0->uclamp[UCLAMP_MIN] = cgroup0->cpu. 421 p0->uclamp[UCLAMP_MAX] = cgroup0->cpu. 422 423 p1->uclamp[UCLAMP_MIN] = 40% * 1024; / 424 p1->uclamp[UCLAMP_MAX] = 50% * 1024; / 425 426 when p0 and p1 are attached to cgroup1, these 427 428 :: 429 430 p0->uclamp[UCLAMP_MIN] = cgroup1->cpu. 431 p0->uclamp[UCLAMP_MAX] = cgroup1->cpu. 432 433 p1->uclamp[UCLAMP_MIN] = cgroup1->cpu. 434 p1->uclamp[UCLAMP_MAX] = 50% * 1024; / 435 436 Note that cgroup interfaces allows cpu.uclamp. 437 cpu.uclamp.min. Other interfaces don't allow t 438 439 3.3. System interface 440 --------------------- 441 442 3.3.1 sched_util_clamp_min 443 -------------------------- 444 445 System wide limit of allowed UCLAMP_MIN range. 446 which means that permitted effective UCLAMP_MI 447 By changing it to 512 for example the range re 448 to restrict how much boosting tasks are allowe 449 450 Requests from tasks to go above this knob valu 451 they won't be satisfied until it is more than 452 453 The value must be smaller than or equal to sch 454 455 3.3.2 sched_util_clamp_max 456 -------------------------- 457 458 System wide limit of allowed UCLAMP_MAX range. 459 which means that permitted effective UCLAMP_MA 460 461 By changing it to 512 for example the effectiv 462 [0:512]. This means is that no task can run ab 463 rqs are restricted too. IOW, the whole system 464 capacity. 465 466 This is useful to restrict the overall maximum 467 For example, it can be handy to limit performa 468 or when the system wants to limit access to mo 469 levels when it's in idle state or screen is of 470 471 Requests from tasks to go above this knob valu 472 won't be satisfied until it is more than p->uc 473 474 The value must be greater than or equal to sch 475 476 .. _uclamp-default-values: 477 478 3.4. Default values 479 ------------------- 480 481 By default all SCHED_NORMAL/SCHED_OTHER tasks 482 483 :: 484 485 p_fair->uclamp[UCLAMP_MIN] = 0 486 p_fair->uclamp[UCLAMP_MAX] = 1024 487 488 That is, by default they're boosted to run at 489 changed at boot or runtime. No argument was ma 490 provide this, but can be added in the future. 491 492 For SCHED_FIFO/SCHED_RR tasks: 493 494 :: 495 496 p_rt->uclamp[UCLAMP_MIN] = 1024 497 p_rt->uclamp[UCLAMP_MAX] = 1024 498 499 That is by default they're boosted to run at t 500 the system which retains the historical behavi 501 502 RT tasks default uclamp_min value can be modif 503 sysctl. See below section. 504 505 .. _sched-util-clamp-min-rt-default: 506 507 3.4.1 sched_util_clamp_min_rt_default 508 ------------------------------------- 509 510 Running RT tasks at maximum performance point 511 devices and not necessary. To allow system dev 512 guarantees for these tasks without pushing it 513 performance point, this sysctl knob allows tun 514 address the system requirement without burning 515 performance point all the time. 516 517 Application developer are encouraged to use th 518 to ensure they are performance and power aware 519 to 0 by system designers and leave the task of 520 requirements to the apps. 521 522 4. How to use util clamp 523 ======================== 524 525 Util clamp promotes the concept of user space 526 management. At the scheduler level there is no 527 decision. However, with util clamp user space 528 better decision about task placement and frequ 529 530 Best results are achieved by not making any as 531 application is running on and to use it in con 532 dynamically monitor and adjust. Ultimately thi 533 experience at a better perf/watt. 534 535 For some systems and use cases, static setup w 536 Portability will be a problem in this case. Ho 537 200 or 1024 is different for each system. Unle 538 system, static setup should be avoided. 539 540 There are enough possibilities to create a who 541 or self contained app that makes use of it dir 542 543 4.1. Boost important and DVFS-latency-sensitiv 544 ---------------------------------------------- 545 546 A GUI task might not be busy to warrant drivin 547 wakes up. However, it requires to finish its w 548 to deliver the desired user experience. The ri 549 wakeup will be system dependent. On some under 550 on other overpowered ones it will be low or 0. 551 552 This task can increase its UCLAMP_MIN value ev 553 to ensure on next wake up it runs at a higher 554 to approach the lowest UCLAMP_MIN value that a 555 particular system to achieve the best possible 556 557 On heterogeneous systems, it might be importan 558 a faster CPU. 559 560 **Generally it is advised to perceive the inpu 561 which will imply both task placement and frequ 562 563 4.2. Cap background tasks 564 ------------------------- 565 566 Like explained for Android case in the introdu 567 UCLAMP_MAX for some background tasks that don' 568 could end up being busy and consume unnecessar 569 570 4.3. Powersave mode 571 ------------------- 572 573 sched_util_clamp_max system wide interface can 574 operating at the higher performance points whi 575 inefficient. 576 577 This is not unique to uclamp as one can achiev 578 frequency of the cpufreq governor. It can be c 579 alternative interface. 580 581 4.4. Per-app performance restriction 582 ------------------------------------ 583 584 Middleware/Utility can provide the user an opt 585 app every time it is executed to guarantee a m 586 limit it from draining system power at the cos 587 these apps. 588 589 If you want to prevent your laptop from heatin 590 compiling the kernel and happy to sacrifice pe 591 still would like to keep your browser performa 592 possible. 593 594 5. Limitations 595 ============== 596 597 .. _uclamp-capping-fail: 598 599 5.1. Capping frequency with uclamp_max fails u 600 ---------------------------------------------- 601 602 If task p0 is capped to run at 512: 603 604 :: 605 606 p0->uclamp[UCLAMP_MAX] = 512 607 608 and it shares the rq with p1 which is free to 609 610 :: 611 612 p1->uclamp[UCLAMP_MAX] = 1024 613 614 then due to max aggregation the rq will be all 615 point: 616 617 :: 618 619 rq->uclamp[UCLAMP_MAX] = max(512, 1024 620 621 Assuming both p0 and p1 have UCLAMP_MIN = 0, t 622 the rq will depend on the actual utilization v 623 624 If p1 is a small task but p0 is a CPU intensiv 625 both are running at the same rq, p1 will cause 626 from the rq although p1, which is allowed to r 627 doesn't actually need to run at that frequency 628 629 5.2. UCLAMP_MAX can break PELT (util_avg) sign 630 ---------------------------------------------- 631 632 PELT assumes that frequency will always increa 633 there's always some idle time on the CPU. But 634 increase will be prevented which can lead to n 635 circumstances. When there's no idle time, a ta 636 which would result in util_avg being 1024. 637 638 Combing with issue described below, this can l 639 when severely capped tasks share the rq with a 640 641 As an example if task p, which have: 642 643 :: 644 645 p0->util_avg = 300 646 p0->uclamp[UCLAMP_MAX] = 0 647 648 wakes up on an idle CPU, then it will run at m 649 CPU is capable of. The max CPU frequency (Fmax 650 since it designates the shortest computational 651 work on this CPU. 652 653 :: 654 655 rq->uclamp[UCLAMP_MAX] = 0 656 657 If the ratio of Fmax/Fmin is 3, then maximum v 658 659 :: 660 661 300 * (Fmax/Fmin) = 900 662 663 which indicates the CPU will still see idle ti 664 _actual_ util_avg will not be 900 though, but 665 long as there's idle time, p->util_avg updates 666 but not proportional to Fmax/Fmin. 667 668 :: 669 670 p0->util_avg = 300 + small_error 671 672 Now if the ratio of Fmax/Fmin is 4, the maximu 673 674 :: 675 676 300 * (Fmax/Fmin) = 1200 677 678 which is higher than 1024 and indicates that t 679 this happens, then the _actual_ util_avg will 680 681 :: 682 683 p0->util_avg = 1024 684 685 If task p1 wakes up on this CPU, which have: 686 687 :: 688 689 p1->util_avg = 200 690 p1->uclamp[UCLAMP_MAX] = 1024 691 692 then the effective UCLAMP_MAX for the CPU will 693 aggregation rule. But since the capped p0 task 694 severely, then the rq->util_avg will be: 695 696 :: 697 698 p0->util_avg = 1024 699 p1->util_avg = 200 700 701 rq->util_avg = 1024 702 rq->uclamp[UCLAMP_MAX] = 1024 703 704 Hence lead to a frequency spike since if p0 wa 705 706 :: 707 708 p0->util_avg = 300 709 p1->util_avg = 200 710 711 rq->util_avg = 500 712 713 and run somewhere near mid performance point o 714 715 5.3. Schedutil response time issues 716 ----------------------------------- 717 718 schedutil has three limitations: 719 720 1. Hardware takes non-zero time to res 721 request. On some platforms can be i 722 2. Non fast-switch systems require a w 723 and perform the frequency change, w 724 3. schedutil rate_limit_us drops any r 725 window. 726 727 If a relatively small task is doing critical j 728 performance point when it wakes up and starts 729 limitations will prevent it from getting what 730 expects. 731 732 This limitation is not only impactful when usi 733 prevalent as we no longer gradually ramp up or 734 jumping between frequencies depending on the o 735 respective uclamp values. 736 737 We regard that as a limitation of the capabili 738 itself. 739 740 There is room to improve the behavior of sched 741 to be done for 1 or 2. They are considered har
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.