1 .. SPDX-License-Identifier: GPL-2.0 2 .. include:: <isonum.txt> 3 4 ============================================== 5 ``intel_idle`` CPU Idle Time Management Driver 6 ============================================== 7 8 :Copyright: |copy| 2020 Intel Corporation 9 10 :Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> 11 12 13 General Information 14 =================== 15 16 ``intel_idle`` is a part of the 17 :doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel 18 (``CPUIdle``). It is the default CPU idle time management driver for the 19 Nehalem and later generations of Intel processors, but the level of support for 20 a particular processor model in it depends on whether or not it recognizes that 21 processor model and may also depend on information coming from the platform 22 firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle`` 23 works in general, so this is the time to get familiar with 24 Documentation/admin-guide/pm/cpuidle.rst if you have not done that yet.] 25 26 ``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the 27 logical CPU executing it is idle and so it may be possible to put some of the 28 processor's functional blocks into low-power states. That instruction takes two 29 arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the 30 first of which, referred to as a *hint*, can be used by the processor to 31 determine what can be done (for details refer to Intel Software Developer’s 32 Manual [1]_). Accordingly, ``intel_idle`` refuses to work with processors in 33 which the support for the ``MWAIT`` instruction has been disabled (for example, 34 via the platform firmware configuration menu) or which do not support that 35 instruction at all. 36 37 ``intel_idle`` is not modular, so it cannot be unloaded, which means that the 38 only way to pass early-configuration-time parameters to it is via the kernel 39 command line. 40 41 42 .. _intel-idle-enumeration-of-states: 43 44 Enumeration of Idle States 45 ========================== 46 47 Each ``MWAIT`` hint value is interpreted by the processor as a license to 48 reconfigure itself in a certain way in order to save energy. The processor 49 configurations (with reduced power draw) resulting from that are referred to 50 as C-states (in the ACPI terminology) or idle states. The list of meaningful 51 ``MWAIT`` hint values and idle states (i.e. low-power configurations of the 52 processor) corresponding to them depends on the processor model and it may also 53 depend on the configuration of the platform. 54 55 In order to create a list of available idle states required by the ``CPUIdle`` 56 subsystem (see :ref:`idle-states-representation` in 57 Documentation/admin-guide/pm/cpuidle.rst), 58 ``intel_idle`` can use two sources of information: static tables of idle states 59 for different processor models included in the driver itself and the ACPI tables 60 of the system. The former are always used if the processor model at hand is 61 recognized by ``intel_idle`` and the latter are used if that is required for 62 the given processor model (which is the case for all server processor models 63 recognized by ``intel_idle``) or if the processor model is not recognized. 64 [There is a module parameter that can be used to make the driver use the ACPI 65 tables with any processor model recognized by it; see 66 `below <intel-idle-parameters_>`_.] 67 68 If the ACPI tables are going to be used for building the list of available idle 69 states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI 70 objects corresponding to the CPUs in the system (refer to the ACPI specification 71 [2]_ for the description of ``_CST`` and its output package). Because the 72 ``CPUIdle`` subsystem expects that the list of idle states supplied by the 73 driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is 74 registered as the ``CPUIdle`` driver for all of the CPUs in the system, the 75 driver looks for the first ``_CST`` object returning at least one valid idle 76 state description and such that all of the idle states included in its return 77 package are of the FFH (Functional Fixed Hardware) type, which means that the 78 ``MWAIT`` instruction is expected to be used to tell the processor that it can 79 enter one of them. The return package of that ``_CST`` is then assumed to be 80 applicable to all of the other CPUs in the system and the idle state 81 descriptions extracted from it are stored in a preliminary list of idle states 82 coming from the ACPI tables. [This step is skipped if ``intel_idle`` is 83 configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.] 84 85 Next, the first (index 0) entry in the list of available idle states is 86 initialized to represent a "polling idle state" (a pseudo-idle state in which 87 the target CPU continuously fetches and executes instructions), and the 88 subsequent (real) idle state entries are populated as follows. 89 90 If the processor model at hand is recognized by ``intel_idle``, there is a 91 (static) table of idle state descriptions for it in the driver. In that case, 92 the "internal" table is the primary source of information on idle states and the 93 information from it is copied to the final list of available idle states. If 94 using the ACPI tables for the enumeration of idle states is not required 95 (depending on the processor model), all of the listed idle state are enabled by 96 default (so all of them will be taken into consideration by ``CPUIdle`` 97 governors during CPU idle state selection). Otherwise, some of the listed idle 98 states may not be enabled by default if there are no matching entries in the 99 preliminary list of idle states coming from the ACPI tables. In that case user 100 space still can enable them later (on a per-CPU basis) with the help of 101 the ``disable`` idle state attribute in ``sysfs`` (see 102 :ref:`idle-states-representation` in 103 Documentation/admin-guide/pm/cpuidle.rst). This basically means that 104 the idle states "known" to the driver may not be enabled by default if they have 105 not been exposed by the platform firmware (through the ACPI tables). 106 107 If the given processor model is not recognized by ``intel_idle``, but it 108 supports ``MWAIT``, the preliminary list of idle states coming from the ACPI 109 tables is used for building the final list that will be supplied to the 110 ``CPUIdle`` core during driver registration. For each idle state in that list, 111 the description, ``MWAIT`` hint and exit latency are copied to the corresponding 112 entry in the final list of idle states. The name of the idle state represented 113 by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is 114 "CX_ACPI", where X is the index of that idle state in the final list (note that 115 the minimum value of X is 1, because 0 is reserved for the "polling" state), and 116 its target residency is based on the exit latency value. Specifically, for 117 C1-type idle states the exit latency value is also used as the target residency 118 (for compatibility with the majority of the "internal" tables of idle states for 119 various processor models recognized by ``intel_idle``) and for the other idle 120 state types (C2 and C3) the target residency value is 3 times the exit latency 121 (again, that is because it reflects the target residency to exit latency ratio 122 in the majority of cases for the processor models recognized by ``intel_idle``). 123 All of the idle states in the final list are enabled by default in this case. 124 125 126 .. _intel-idle-initialization: 127 128 Initialization 129 ============== 130 131 The initialization of ``intel_idle`` starts with checking if the kernel command 132 line options forbid the use of the ``MWAIT`` instruction. If that is the case, 133 an error code is returned right away. 134 135 The next step is to check whether or not the processor model is known to the 136 driver, which determines the idle states enumeration method (see 137 `above <intel-idle-enumeration-of-states_>`_), and whether or not the processor 138 supports ``MWAIT`` (the initialization fails if that is not the case). Then, 139 the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the 140 driver initialization fails if the level of support is not as expected (for 141 example, if the total number of ``MWAIT`` substates returned is 0). 142 143 Next, if the driver is not configured to ignore the ACPI tables (see 144 `below <intel-idle-parameters_>`_), the idle states information provided by the 145 platform firmware is extracted from them. 146 147 Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of 148 available idle states is created as explained 149 `above <intel-idle-enumeration-of-states_>`_. 150 151 Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver() 152 as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback 153 for configuring individual CPUs is registered via cpuhp_setup_state(), which 154 (among other things) causes the callback routine to be invoked for all of the 155 CPUs present in the system at that time (each CPU executes its own instance of 156 the callback routine). That routine registers a ``CPUIdle`` device for the CPU 157 running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and 158 optionally performs some CPU-specific initialization actions that may be 159 required for the given processor model. 160 161 162 .. _intel-idle-parameters: 163 164 Kernel Command Line Options and Module Parameters 165 ================================================= 166 167 The *x86* architecture support code recognizes three kernel command line 168 options related to CPU idle time management: ``idle=poll``, ``idle=halt``, 169 and ``idle=nomwait``. If any of them is present in the kernel command line, the 170 ``MWAIT`` instruction is not allowed to be used, so the initialization of 171 ``intel_idle`` will fail. 172 173 Apart from that there are five module parameters recognized by ``intel_idle`` 174 itself that can be set via the kernel command line (they cannot be updated via 175 sysfs, so that is the only way to change their values). 176 177 The ``max_cstate`` parameter value is the maximum idle state index in the list 178 of idle states supplied to the ``CPUIdle`` core during the registration of the 179 driver. It is also the maximum number of regular (non-polling) idle states that 180 can be used by ``intel_idle``, so the enumeration of idle states is terminated 181 after finding that number of usable idle states (the other idle states that 182 potentially might have been used if ``max_cstate`` had been greater are not 183 taken into consideration at all). Setting ``max_cstate`` can prevent 184 ``intel_idle`` from exposing idle states that are regarded as "too deep" for 185 some reason to the ``CPUIdle`` core, but it does so by making them effectively 186 invisible until the system is shut down and started again which may not always 187 be desirable. In practice, it is only really necessary to do that if the idle 188 states in question cannot be enabled during system startup, because in the 189 working state of the system the CPU power management quality of service (PM 190 QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states 191 even if they have been enumerated (see :ref:`cpu-pm-qos` in 192 Documentation/admin-guide/pm/cpuidle.rst). 193 Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail. 194 195 The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle`` 196 if the kernel has been configured with ACPI support) can be set to make the 197 driver ignore the system's ACPI tables entirely or use them for all of the 198 recognized processor models, respectively (they both are unset by default and 199 ``use_acpi`` has no effect if ``no_acpi`` is set). 200 201 The value of the ``states_off`` module parameter (0 by default) represents a 202 list of idle states to be disabled by default in the form of a bitmask. 203 204 Namely, the positions of the bits that are set in the ``states_off`` value are 205 the indices of idle states to be disabled by default (as reflected by the names 206 of the corresponding idle state directories in ``sysfs``, :file:`state0`, 207 :file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given 208 idle state; see :ref:`idle-states-representation` in 209 Documentation/admin-guide/pm/cpuidle.rst). 210 211 For example, if ``states_off`` is equal to 3, the driver will disable idle 212 states 0 and 1 by default, and if it is equal to 8, idle state 3 will be 213 disabled by default and so on (bit positions beyond the maximum idle state index 214 are ignored). 215 216 The idle states disabled this way can be enabled (on a per-CPU basis) from user 217 space via ``sysfs``. 218 219 The ``ibrs_off`` module parameter is a boolean flag (defaults to 220 false). If set, it is used to control if IBRS (Indirect Branch Restricted 221 Speculation) should be turned off when the CPU enters an idle state. 222 This flag does not affect CPUs that use Enhanced IBRS which can remain 223 on with little performance impact. 224 225 For some CPUs, IBRS will be selected as mitigation for Spectre v2 and Retbleed 226 security vulnerabilities by default. Leaving the IBRS mode on while idling may 227 have a performance impact on its sibling CPU. The IBRS mode will be turned off 228 by default when the CPU enters into a deep idle state, but not in some 229 shallower ones. Setting the ``ibrs_off`` module parameter will force the IBRS 230 mode to off when the CPU is in any one of the available idle states. This may 231 help performance of a sibling CPU at the expense of a slightly higher wakeup 232 latency for the idle CPU. 233 234 235 .. _intel-idle-core-and-package-idle-states: 236 237 Core and Package Levels of Idle States 238 ====================================== 239 240 Typically, in a processor supporting the ``MWAIT`` instruction there are (at 241 least) two levels of idle states (or C-states). One level, referred to as 242 "core C-states", covers individual cores in the processor, whereas the other 243 level, referred to as "package C-states", covers the entire processor package 244 and it may also involve other components of the system (GPUs, memory 245 controllers, I/O hubs etc.). 246 247 Some of the ``MWAIT`` hint values allow the processor to use core C-states only 248 (most importantly, that is the case for the ``MWAIT`` hint value corresponding 249 to the ``C1`` idle state), but the majority of them give it a license to put 250 the target core (i.e. the core containing the logical CPU executing ``MWAIT`` 251 with the given hint value) into a specific core C-state and then (if possible) 252 to enter a specific package C-state at the deeper level. For example, the 253 ``MWAIT`` hint value representing the ``C3`` idle state allows the processor to 254 put the target core into the low-power state referred to as "core ``C3``" (or 255 ``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core 256 have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value 257 representing a deeper idle state), and in addition to that (in the majority of 258 cases) it gives the processor a license to put the entire package (possibly 259 including some non-CPU components such as a GPU or a memory controller) into the 260 low-power state referred to as "package ``C3``" (or ``PC3``), which happens if 261 all of the cores have gone into the ``CC3`` state and (possibly) some additional 262 conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may 263 be required to be in a certain GPU-specific low-power state for ``PC3`` to be 264 reachable). 265 266 As a rule, there is no simple way to make the processor use core C-states only 267 if the conditions for entering the corresponding package C-states are met, so 268 the logical CPU executing ``MWAIT`` with a hint value that is not core-level 269 only (like for ``C1``) must always assume that this may cause the processor to 270 enter a package C-state. [That is why the exit latency and target residency 271 values corresponding to the majority of ``MWAIT`` hint values in the "internal" 272 tables of idle states in ``intel_idle`` reflect the properties of package 273 C-states.] If using package C-states is not desirable at all, either 274 :ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of 275 ``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to 276 restrict the range of permissible idle states to the ones with core-level only 277 ``MWAIT`` hint values (like ``C1``). 278 279 280 References 281 ========== 282 283 .. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*, 284 https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html 285 286 .. [2] *Advanced Configuration and Power Interface (ACPI) Specification*, 287 https://uefi.org/specifications
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.