1 .. SPDX-License-Identifier: GPL-2.0 2 .. include:: <isonum.txt> 3 4 .. |intel_pstate| replace:: :doc:`intel_pstate 5 6 ======================= 7 CPU Performance Scaling 8 ======================= 9 10 :Copyright: |copy| 2017 Intel Corporation 11 12 :Author: Rafael J. Wysocki <rafael.j.wysocki@in 13 14 15 The Concept of CPU Performance Scaling 16 ====================================== 17 18 The majority of modern processors are capable 19 different clock frequency and voltage configur 20 Operating Performance Points or P-states (in A 21 the higher the clock frequency and the higher 22 can be retired by the CPU over a unit of time, 23 frequency and the higher the voltage, the more 24 time (or the more power is drawn) by the CPU i 25 there is a natural tradeoff between the CPU ca 26 that can be executed over a unit of time) and 27 28 In some situations it is desirable or even nec 29 as possible and then there is no reason to use 30 highest one (i.e. the highest-performance freq 31 available). In some other cases, however, it 32 instructions so quickly and maintaining the hi 33 relatively long time without utilizing it enti 34 It also may not be physically possible to main 35 long for thermal or power supply capacity reas 36 cases, there are hardware interfaces allowing 37 different frequency/voltage configurations or 38 put into different P-states. 39 40 Typically, they are used along with algorithms 41 capacity, so as to decide which P-states to pu 42 the utilization of the system generally change 43 repeatedly on a regular basis. The activity b 44 to as CPU performance scaling or CPU frequency 45 adjusting the CPU clock frequency). 46 47 48 CPU Performance Scaling in Linux 49 ================================ 50 51 The Linux kernel supports CPU performance scal 52 (CPU Frequency scaling) subsystem that consist 53 core, scaling governors and scaling drivers. 54 55 The ``CPUFreq`` core provides the common code 56 interfaces for all platforms that support CPU 57 the basic framework in which the other compone 58 59 Scaling governors implement algorithms to esti 60 As a rule, each governor implements one, possi 61 algorithm. 62 63 Scaling drivers talk to the hardware. They pr 64 information on the available P-states (or P-st 65 access platform-specific hardware interfaces t 66 by scaling governors. 67 68 In principle, all available scaling governors 69 driver. That design is based on the observati 70 performance scaling algorithms for P-state sel 71 platform-independent form in the majority of c 72 to use the same performance scaling algorithm 73 way regardless of which scaling driver is used 74 scaling governors should be suitable for every 75 76 However, that observation may not hold for per 77 based on information provided by the hardware 78 feedback registers, as that information is typ 79 interface it comes from and may not be easily 80 platform-independent way. For this reason, `` 81 to bypass the governor layer and implement the 82 algorithms. That is done by the |intel_pstate 83 84 85 ``CPUFreq`` Policy Objects 86 ========================== 87 88 In some cases the hardware interface for P-sta 89 CPUs. That is, for example, the same register 90 control the P-state of multiple CPUs at the sa 91 all of those CPUs simultaneously. 92 93 Sets of CPUs sharing hardware P-state control 94 ``CPUFreq`` as struct cpufreq_policy objects. 95 struct cpufreq_policy is also used when there 96 set. 97 98 The ``CPUFreq`` core maintains a pointer to a 99 every CPU in the system, including CPUs that a 100 CPUs share the same hardware P-state control i 101 corresponding to them point to the same struct 102 103 ``CPUFreq`` uses struct cpufreq_policy as its 104 of its user space interface is based on the po 105 106 107 CPU Initialization 108 ================== 109 110 First of all, a scaling driver has to be regis 111 It is only possible to register one scaling dr 112 driver is expected to be able to handle all CP 113 114 The scaling driver may be registered before or 115 CPUs are registered earlier, the driver core i 116 take a note of all of the already registered C 117 scaling driver. In turn, if any CPUs are regi 118 the scaling driver, the ``CPUFreq`` core will 119 at their registration time. 120 121 In any case, the ``CPUFreq`` core is invoked t 122 has not seen so far as soon as it is ready to 123 logical CPU may be a physical single-core proc 124 multicore processor, or a hardware thread in a 125 core. In what follows "CPU" always means "log 126 otherwise and the word "processor" is used to 127 possibly including multiple logical CPUs.] 128 129 Once invoked, the ``CPUFreq`` core checks if t 130 for the given CPU and if so, it skips the poli 131 a new policy object is created and initialized 132 a new policy directory in ``sysfs``, and the p 133 the given CPU is set to the new policy object' 134 135 Next, the scaling driver's ``->init()`` callba 136 pointer of the new CPU passed to it as the arg 137 to initialize the performance scaling hardware 138 more precisely, for the set of CPUs sharing th 139 to, represented by its policy object) and, if 140 called for is new, to set parameters of the po 141 frequencies supported by the hardware, the tab 142 the set of supported P-states is not a continu 143 that belong to the same policy (including both 144 mask is then used by the core to populate the 145 CPUs in it. 146 147 The next major initialization step for a new p 148 scaling governor to it (to begin with, that is 149 determined by the kernel command line or confi 150 later via ``sysfs``). First, a pointer to the 151 the governor's ``->init()`` callback which is 152 data structures necessary to handle the given 153 a governor ``sysfs`` interface to it. Next, t 154 invoking its ``->start()`` callback. 155 156 That callback is expected to register per-CPU 157 all of the online CPUs belonging to the given 158 The utilization update callbacks will be invok 159 important events, like task enqueue and dequeu 160 scheduler tick or generally whenever the CPU u 161 scheduler's perspective). They are expected t 162 to determine the P-state to use for the given 163 invoke the scaling driver to make changes to t 164 the P-state selection. The scaling driver may 165 scheduler context or asynchronously, via a ker 166 on the configuration and capabilities of the s 167 168 Similar steps are taken for policy objects tha 169 previously, meaning that all of the CPUs belon 170 only practical difference in that case is that 171 to use the scaling governor previously used wi 172 "inactive" (and is re-initialized now) instead 173 174 In turn, if a previously offline CPU is being 175 other CPUs sharing the policy object with it a 176 need to re-initialize the policy object at all 177 necessary to restart the scaling governor so t 178 into account. That is achieved by invoking th 179 ``->start()`` callbacks, in this order, for th 180 181 As mentioned before, the |intel_pstate| scalin 182 governor layer of ``CPUFreq`` and provides its 183 Consequently, if |intel_pstate| is used, scali 184 new policy objects. Instead, the driver's ``- 185 to register per-CPU utilization update callbac 186 callbacks are invoked by the CPU scheduler in 187 governors, but in the |intel_pstate| case they 188 use and change the hardware configuration acco 189 context. 190 191 The policy objects created during CPU initiali 192 associated with them are torn down when the sc 193 (which happens when the kernel module containi 194 when the last CPU belonging to the given polic 195 196 197 Policy Interface in ``sysfs`` 198 ============================= 199 200 During the initialization of the kernel, the ` 201 ``sysfs`` directory (kobject) called ``cpufreq 202 :file:`/sys/devices/system/cpu/`. 203 204 That directory contains a ``policyX`` subdirec 205 integer number) for every policy object mainta 206 Each ``policyX`` directory is pointed to by `` 207 under :file:`/sys/devices/system/cpu/cpuY/` (w 208 that may be different from the one represented 209 associated with (or belonging to) the given po 210 in :file:`/sys/devices/system/cpu/cpufreq` eac 211 attributes (files) to control ``CPUFreq`` beha 212 objects (that is, for all of the CPUs associat 213 214 Some of those attributes are generic. They ar 215 and their behavior generally does not depend o 216 and what scaling governor is attached to the g 217 also add driver-specific attributes to the pol 218 control policy-specific aspects of driver beha 219 220 The generic attributes under :file:`/sys/devic 221 are the following: 222 223 ``affected_cpus`` 224 List of online CPUs belonging to this 225 performance scaling interface represen 226 object). 227 228 ``bios_limit`` 229 If the platform firmware (BIOS) tells 230 CPU frequencies, that limit will be re 231 present). 232 233 The existence of the limit may be a re 234 BIOS settings, restrictions coming fro 235 BIOS/HW-based mechanisms. 236 237 This does not cover ACPI thermal limit 238 through a generic thermal driver. 239 240 This attribute is not present if the s 241 support it. 242 243 ``cpuinfo_cur_freq`` 244 Current frequency of the CPUs belongin 245 the hardware (in KHz). 246 247 This is expected to be the frequency t 248 If that frequency cannot be determined 249 be present. 250 251 ``cpuinfo_max_freq`` 252 Maximum possible operating frequency t 253 can run at (in kHz). 254 255 ``cpuinfo_min_freq`` 256 Minimum possible operating frequency t 257 can run at (in kHz). 258 259 ``cpuinfo_transition_latency`` 260 The time it takes to switch the CPUs b 261 P-state to another, in nanoseconds. 262 263 If unknown or if known to be so high t 264 work with the `ondemand`_ governor, -1 265 will be returned by reads from this at 266 267 ``related_cpus`` 268 List of all (online and offline) CPUs 269 270 ``scaling_available_frequencies`` 271 List of available frequencies of the C 272 (in kHz). 273 274 ``scaling_available_governors`` 275 List of ``CPUFreq`` scaling governors 276 be attached to this policy or (if the 277 in use) list of scaling algorithms pro 278 applied to this policy. 279 280 [Note that some governors are modular 281 kernel module for the governor held by 282 listed by this attribute.] 283 284 ``scaling_cur_freq`` 285 Current frequency of all of the CPUs b 286 287 In the majority of cases, this is the 288 requested by the scaling driver from t 289 interface provided by it, which may or 290 the CPU is actually running at (due to 291 limitations). 292 293 Some architectures (e.g. ``x86``) may 294 more precisely reflecting the current 295 attribute, but that still may not be t 296 seen by the hardware at the moment. 297 298 ``scaling_driver`` 299 The scaling driver currently in use. 300 301 ``scaling_governor`` 302 The scaling governor currently attache 303 |intel_pstate| scaling driver is in us 304 provided by the driver that is current 305 306 This attribute is read-write and writi 307 governor to be attached to this policy 308 provided by the scaling driver to be a 309 |intel_pstate| case), as indicated by 310 attribute (which must be one of the na 311 ``scaling_available_governors`` attrib 312 313 ``scaling_max_freq`` 314 Maximum frequency the CPUs belonging t 315 running at (in kHz). 316 317 This attribute is read-write and writi 318 integer to it will cause a new limit t 319 than the value of the ``scaling_min_fr 320 321 ``scaling_min_freq`` 322 Minimum frequency the CPUs belonging t 323 running at (in kHz). 324 325 This attribute is read-write and writi 326 non-negative integer to it will cause 327 be higher than the value of the ``scal 328 329 ``scaling_setspeed`` 330 This attribute is functional only if t 331 is attached to the given policy. 332 333 It returns the last frequency requeste 334 be written to in order to set a new fr 335 336 337 Generic Scaling Governors 338 ========================= 339 340 ``CPUFreq`` provides generic scaling governors 341 scaling drivers. As stated before, each of th 342 parametrized, performance scaling algorithm. 343 344 Scaling governors are attached to policy objec 345 can be handled by different scaling governors 346 may lead to suboptimal results in some cases). 347 348 The scaling governor for a given policy object 349 the help of the ``scaling_governor`` policy at 350 351 Some governors expose ``sysfs`` attributes to 352 algorithms implemented by them. Those attribu 353 tunables, can be either global (system-wide) o 354 scaling driver in use. If the driver requires 355 per-policy, they are located in a subdirectory 356 Otherwise, they are located in a subdirectory 357 :file:`/sys/devices/system/cpu/cpufreq/`. In 358 subdirectory containing the governor tunables 359 providing them. 360 361 ``performance`` 362 --------------- 363 364 When attached to a policy object, this governo 365 within the ``scaling_max_freq`` policy limit, 366 367 The request is made once at that time the gove 368 ``performance`` and whenever the ``scaling_max 369 policy limits change after that. 370 371 ``powersave`` 372 ------------- 373 374 When attached to a policy object, this governo 375 within the ``scaling_min_freq`` policy limit, 376 377 The request is made once at that time the gove 378 ``powersave`` and whenever the ``scaling_max_f 379 policy limits change after that. 380 381 ``userspace`` 382 ------------- 383 384 This governor does not do anything by itself. 385 to set the CPU frequency for the policy it is 386 ``scaling_setspeed`` attribute of that policy. 387 388 ``schedutil`` 389 ------------- 390 391 This governor uses CPU utilization data availa 392 generally is regarded as a part of the CPU sch 393 scheduler's internal data structures directly. 394 395 It runs entirely in scheduler context, althoug 396 invoke the scaling driver asynchronously when 397 should be changed for a given policy (that dep 398 is capable of changing the CPU frequency from 399 400 The actions of this governor for a particular 401 invoking its utilization update callback for t 402 RT or deadline scheduling classes, the governo 403 the allowed maximum (that is, the ``scaling_ma 404 if it is invoked by the CFS scheduling class, 405 Per-Entity Load Tracking (PELT) metric for the 406 given CPU as the CPU utilization estimate (see 407 LWN.net article [1]_ for a description of the 408 CPU frequency to apply is computed in accordan 409 410 f = 1.25 * ``f_0`` * ``util`` / ``max` 411 412 where ``util`` is the PELT number, ``max`` is 413 ``util``, and ``f_0`` is either the maximum po 414 policy (if the PELT number is frequency-invari 415 (otherwise). 416 417 This governor also employs a mechanism allowin 418 CPU frequency for tasks that have been waiting 419 "IO-wait boosting". That happens when the :c: 420 is passed by the scheduler to the governor cal 421 to go up to the allowed maximum immediately an 422 returned by the above formula over time. 423 424 This governor exposes only one tunable: 425 426 ``rate_limit_us`` 427 Minimum time (in microseconds) that ha 428 runs of governor computations (default 429 transition latency or the maximum 2ms) 430 431 The purpose of this tunable is to redu 432 of the governor which might be excessi 433 434 This governor generally is regarded as a repla 435 and `conservative`_ governors (described below 436 tightly integrated with the CPU scheduler, its 437 switches and similar is less significant, and 438 utilization metric, so in principle its decisi 439 decisions made by the other parts of the sched 440 441 ``ondemand`` 442 ------------ 443 444 This governor uses CPU load as a CPU frequency 445 446 In order to estimate the current CPU load, it 447 consecutive invocations of its worker routine 448 time in which the given CPU was not idle. The 449 time to the total CPU time is taken as an esti 450 451 If this governor is attached to a policy share 452 estimated for all of them and the greatest res 453 for the entire policy. 454 455 The worker routine of this governor has to run 456 invoked asynchronously (via a workqueue) and C 457 there if necessary. As a result, the schedule 458 governor is minimum, but it causes additional 459 relatively often and the CPU P-state updates t 460 irregular. Also, it affects its own CPU load 461 reduces the CPU idle time (even though the CPU 462 slightly by it). 463 464 It generally selects CPU frequencies proportio 465 the value of the ``cpuinfo_max_freq`` policy a 466 1 (or 100%), and the value of the ``cpuinfo_mi 467 corresponds to the load of 0, unless when the 468 speedup threshold, in which case it will go st 469 it is allowed to use (the ``scaling_max_freq`` 470 471 This governor exposes the following tunables: 472 473 ``sampling_rate`` 474 This is how often the governor's worke 475 microseconds. 476 477 Typically, it is set to values of the 478 default value is to add a 50% breathin 479 to ``cpuinfo_transition_latency`` on e 480 attached to. The minimum is typically 481 ticks. 482 483 If this tunable is per-policy, the fol 484 represented by it to be 1.5 times as h 485 (the default):: 486 487 # echo `$(($(cat cpuinfo_transition_la 488 489 ``up_threshold`` 490 If the estimated CPU load is above thi 491 will set the frequency to the maximum 492 Otherwise, the selected frequency will 493 CPU load. 494 495 ``ignore_nice_load`` 496 If set to 1 (default 0), it will cause 497 treat the CPU time spent on executing 498 than 0 as CPU idle time. 499 500 This may be useful if there are tasks 501 taken into account when deciding what 502 Then, to make that happen it is suffic 503 of those tasks above 0 and set this at 504 505 ``sampling_down_factor`` 506 Temporary multiplier, between 1 (defau 507 the ``sampling_rate`` value if the CPU 508 509 This causes the next execution of the 510 setting the frequency to the allowed m 511 frequency stays at the maximum level f 512 513 Frequency fluctuations in some bursty 514 at the cost of additional energy spent 515 capacity. 516 517 ``powersave_bias`` 518 Reduction factor to apply to the origi 519 governor (including the maximum value 520 value is exceeded by the estimated CPU 521 for the AMD frequency sensitivity powe 522 (:file:`drivers/cpufreq/amd_freq_sensi 523 inclusive. 524 525 If the AMD frequency sensitivity power 526 the effective frequency to apply is gi 527 528 f * (1 - ``powersave_bias`` / 529 530 where f is the governor's original fre 531 of this attribute is 0 in that case. 532 533 If the AMD frequency sensitivity power 534 value of this attribute is 400 by defa 535 way. 536 537 On Family 16h (and later) AMD processo 538 measured workload sensitivity, between 539 hardware. That value can be used to e 540 workload running on a CPU will change 541 542 The performance of a workload with the 543 IO-bound) is not expected to increase 544 the CPU frequency, whereas workloads w 545 (CPU-bound) are expected to perform mu 546 increased. 547 548 If the workload sensitivity is less th 549 the ``powersave_bias`` value, the sens 550 will cause the governor to select a fr 551 target, so as to avoid over-provisioni 552 from running at higher CPU frequencies 553 554 ``conservative`` 555 ---------------- 556 557 This governor uses CPU load as a CPU frequency 558 559 It estimates the CPU load in the same way as t 560 above, but the CPU frequency selection algorit 561 562 Namely, it avoids changing the frequency signi 563 which may not be suitable for systems with lim 564 battery-powered). To achieve that, it changes 565 small steps, one step at a time, up or down - 566 (configurable) threshold has been exceeded by 567 568 This governor exposes the following tunables: 569 570 ``freq_step`` 571 Frequency step in percent of the maxim 572 allowed to set (the ``scaling_max_freq 573 100 (5 by default). 574 575 This is how much the frequency is allo 576 it to 0 will cause the default frequen 577 and setting it to 100 effectively caus 578 switch the frequency between the ``sca 579 ``scaling_max_freq`` policy limits. 580 581 ``down_threshold`` 582 Threshold value (in percent, 20 by def 583 frequency change direction. 584 585 If the estimated CPU load is greater t 586 go up (by ``freq_step``). If the load 587 ``sampling_down_factor`` mechanism is 588 go down. Otherwise, the frequency wil 589 590 ``sampling_down_factor`` 591 Frequency decrease deferral factor, be 592 inclusive. 593 594 It effectively causes the frequency to 595 times slower than it ramps up. 596 597 598 Frequency Boost Support 599 ======================= 600 601 Background 602 ---------- 603 604 Some processors support a mechanism to raise t 605 cores in a multicore package temporarily (and 606 threshold for the whole package) under certain 607 whole chip is not fully utilized and below its 608 609 Different names are used by different vendors 610 For Intel processors it is referred to as "Tur 611 "Turbo-Core" or (in technical documentation) " 612 As a rule, it also is implemented differently 613 term "frequency boost" is used here for brevit 614 implementations. 615 616 The frequency boost mechanism may be either ha 617 If it is hardware-based (e.g. on x86), the dec 618 made by the hardware (although in general it r 619 into a special state in which it can control t 620 limits). If it is software-based (e.g. on ARM 621 whether or not to trigger boosting and when to 622 623 The ``boost`` File in ``sysfs`` 624 ------------------------------- 625 626 This file is located under :file:`/sys/devices 627 the "boost" setting for the whole system. It 628 scaling driver does not support the frequency 629 but provides a driver-specific interface for c 630 |intel_pstate|). 631 632 If the value in this file is 1, the frequency 633 means that either the hardware can be put into 634 trigger boosting (in the hardware-based case), 635 trigger boosting (in the software-based case). 636 is actually in use at the moment on any CPUs i 637 permission to use the frequency boost mechanis 638 for other reasons). 639 640 If the value in this file is 0, the frequency 641 cannot be used at all. 642 643 The only values that can be written to this fi 644 645 Rationale for Boost Control Knob 646 -------------------------------- 647 648 The frequency boost mechanism is generally int 649 CPU performance on time scales below software 650 scheduler tick interval) and it is demonstrabl 651 it may lead to problems in certain situations. 652 653 For this reason, many systems make it possible 654 mechanism in the platform firmware (BIOS) setu 655 be restarted for the setting to be adjusted as 656 practical at least in some cases. For example 657 658 1. Boosting means overclocking the processor 659 conditions. Generally, the processor's e 660 as a result of increasing its frequency a 661 That may not be desirable on systems that 662 limited capacity, such as batteries, so t 663 mechanism while the system is running may 664 the workload too). 665 666 2. In some situations deterministic behavior 667 performance or energy consumption (or bot 668 boosting while the system is running may 669 670 3. To examine the impact of the frequency bo 671 to be able to run tests with and without 672 restarting the system in the meantime. 673 674 4. Reproducible results are important when r 675 the boosting functionality depends on the 676 single-thread performance may vary becaus 677 unreproducible results sometimes. That c 678 frequency boost mechanism before running 679 issue. 680 681 Legacy AMD ``cpb`` Knob 682 ----------------------- 683 684 The AMD powernow-k8 scaling driver supports a 685 the global ``boost`` one. It is used for disa 686 Performance Boost" feature of some AMD process 687 688 If present, that knob is located in every ``CP 689 ``sysfs`` (:file:`/sys/devices/system/cpu/cpuf 690 ``cpb``, which indicates a more fine grained c 691 implementation, however, works on the system-w 692 for one policy causes the same value of it to 693 policies at the same time. 694 695 That knob is still supported on AMD processors 696 hardware feature, but it may be configured out 697 :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configu 698 ``boost`` knob is present regardless. Thus it 699 ``boost`` knob instead of the ``cpb`` one whic 700 is more consistent with what all of the other 701 may not be supported any more in the future). 702 703 The ``cpb`` knob is never present for any proc 704 hardware feature (e.g. all Intel ones), even i 705 :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configu 706 707 708 References 709 ========== 710 711 .. [1] Jonathan Corbet, *Per-entity load track 712 https://lwn.net/Articles/531853/
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.