~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/admin-guide/thermal/intel_powerclamp.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 =======================
  2 Intel Powerclamp Driver
  3 =======================
  4 
  5 By:
  6   - Arjan van de Ven <arjan@linux.intel.com>
  7   - Jacob Pan <jacob.jun.pan@linux.intel.com>
  8 
  9 .. Contents:
 10 
 11         (*) Introduction
 12             - Goals and Objectives
 13 
 14         (*) Theory of Operation
 15             - Idle Injection
 16             - Calibration
 17 
 18         (*) Performance Analysis
 19             - Effectiveness and Limitations
 20             - Power vs Performance
 21             - Scalability
 22             - Calibration
 23             - Comparison with Alternative Techniques
 24 
 25         (*) Usage and Interfaces
 26             - Generic Thermal Layer (sysfs)
 27             - Kernel APIs (TBD)
 28 
 29         (*) Module Parameters
 30 
 31 INTRODUCTION
 32 ============
 33 
 34 Consider the situation where a system’s power consumption must be
 35 reduced at runtime, due to power budget, thermal constraint, or noise
 36 level, and where active cooling is not preferred. Software managed
 37 passive power reduction must be performed to prevent the hardware
 38 actions that are designed for catastrophic scenarios.
 39 
 40 Currently, P-states, T-states (clock modulation), and CPU offlining
 41 are used for CPU throttling.
 42 
 43 On Intel CPUs, C-states provide effective power reduction, but so far
 44 they’re only used opportunistically, based on workload. With the
 45 development of intel_powerclamp driver, the method of synchronizing
 46 idle injection across all online CPU threads was introduced. The goal
 47 is to achieve forced and controllable C-state residency.
 48 
 49 Test/Analysis has been made in the areas of power, performance,
 50 scalability, and user experience. In many cases, clear advantage is
 51 shown over taking the CPU offline or modulating the CPU clock.
 52 
 53 
 54 THEORY OF OPERATION
 55 ===================
 56 
 57 Idle Injection
 58 --------------
 59 
 60 On modern Intel processors (Nehalem or later), package level C-state
 61 residency is available in MSRs, thus also available to the kernel.
 62 
 63 These MSRs are::
 64 
 65       #define MSR_PKG_C2_RESIDENCY      0x60D
 66       #define MSR_PKG_C3_RESIDENCY      0x3F8
 67       #define MSR_PKG_C6_RESIDENCY      0x3F9
 68       #define MSR_PKG_C7_RESIDENCY      0x3FA
 69 
 70 If the kernel can also inject idle time to the system, then a
 71 closed-loop control system can be established that manages package
 72 level C-state. The intel_powerclamp driver is conceived as such a
 73 control system, where the target set point is a user-selected idle
 74 ratio (based on power reduction), and the error is the difference
 75 between the actual package level C-state residency ratio and the target idle
 76 ratio.
 77 
 78 Injection is controlled by high priority kernel threads, spawned for
 79 each online CPU.
 80 
 81 These kernel threads, with SCHED_FIFO class, are created to perform
 82 clamping actions of controlled duty ratio and duration. Each per-CPU
 83 thread synchronizes its idle time and duration, based on the rounding
 84 of jiffies, so accumulated errors can be prevented to avoid a jittery
 85 effect. Threads are also bound to the CPU such that they cannot be
 86 migrated, unless the CPU is taken offline. In this case, threads
 87 belong to the offlined CPUs will be terminated immediately.
 88 
 89 Running as SCHED_FIFO and relatively high priority, also allows such
 90 scheme to work for both preemptible and non-preemptible kernels.
 91 Alignment of idle time around jiffies ensures scalability for HZ
 92 values. This effect can be better visualized using a Perf timechart.
 93 The following diagram shows the behavior of kernel thread
 94 kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
 95 for a given "duration", then relinquishes the CPU to other tasks,
 96 until the next time interval.
 97 
 98 The NOHZ schedule tick is disabled during idle time, but interrupts
 99 are not masked. Tests show that the extra wakeups from scheduler tick
100 have a dramatic impact on the effectiveness of the powerclamp driver
101 on large scale systems (Westmere system with 80 processors).
102 
103 ::
104 
105   CPU0
106                     ____________          ____________
107   kidle_inject/0   |   sleep    |  mwait |  sleep     |
108           _________|            |________|            |_______
109                                  duration
110   CPU1
111                     ____________          ____________
112   kidle_inject/1   |   sleep    |  mwait |  sleep     |
113           _________|            |________|            |_______
114                                 ^
115                                 |
116                                 |
117                                 roundup(jiffies, interval)
118 
119 Only one CPU is allowed to collect statistics and update global
120 control parameters. This CPU is referred to as the controlling CPU in
121 this document. The controlling CPU is elected at runtime, with a
122 policy that favors BSP, taking into account the possibility of a CPU
123 hot-plug.
124 
125 In terms of dynamics of the idle control system, package level idle
126 time is considered largely as a non-causal system where its behavior
127 cannot be based on the past or current input. Therefore, the
128 intel_powerclamp driver attempts to enforce the desired idle time
129 instantly as given input (target idle ratio). After injection,
130 powerclamp monitors the actual idle for a given time window and adjust
131 the next injection accordingly to avoid over/under correction.
132 
133 When used in a causal control system, such as a temperature control,
134 it is up to the user of this driver to implement algorithms where
135 past samples and outputs are included in the feedback. For example, a
136 PID-based thermal controller can use the powerclamp driver to
137 maintain a desired target temperature, based on integral and
138 derivative gains of the past samples.
139 
140 
141 
142 Calibration
143 -----------
144 During scalability testing, it is observed that synchronized actions
145 among CPUs become challenging as the number of cores grows. This is
146 also true for the ability of a system to enter package level C-states.
147 
148 To make sure the intel_powerclamp driver scales well, online
149 calibration is implemented. The goals for doing such a calibration
150 are:
151 
152 a) determine the effective range of idle injection ratio
153 b) determine the amount of compensation needed at each target ratio
154 
155 Compensation to each target ratio consists of two parts:
156 
157         a) steady state error compensation
158 
159            This is to offset the error occurring when the system can
160            enter idle without extra wakeups (such as external interrupts).
161 
162         b) dynamic error compensation
163 
164            When an excessive amount of wakeups occurs during idle, an
165            additional idle ratio can be added to quiet interrupts, by
166            slowing down CPU activities.
167 
168 A debugfs file is provided for the user to examine compensation
169 progress and results, such as on a Westmere system::
170 
171   [jacob@nex01 ~]$ cat
172   /sys/kernel/debug/intel_powerclamp/powerclamp_calib
173   controlling cpu: 0
174   pct confidence steady dynamic (compensation)
175   0       0       0       0
176   1       1       0       0
177   2       1       1       0
178   3       3       1       0
179   4       3       1       0
180   5       3       1       0
181   6       3       1       0
182   7       3       1       0
183   8       3       1       0
184   ...
185   30      3       2       0
186   31      3       2       0
187   32      3       1       0
188   33      3       2       0
189   34      3       1       0
190   35      3       2       0
191   36      3       1       0
192   37      3       2       0
193   38      3       1       0
194   39      3       2       0
195   40      3       3       0
196   41      3       1       0
197   42      3       2       0
198   43      3       1       0
199   44      3       1       0
200   45      3       2       0
201   46      3       3       0
202   47      3       0       0
203   48      3       2       0
204   49      3       3       0
205 
206 Calibration occurs during runtime. No offline method is available.
207 Steady state compensation is used only when confidence levels of all
208 adjacent ratios have reached satisfactory level. A confidence level
209 is accumulated based on clean data collected at runtime. Data
210 collected during a period without extra interrupts is considered
211 clean.
212 
213 To compensate for excessive amounts of wakeup during idle, additional
214 idle time is injected when such a condition is detected. Currently,
215 we have a simple algorithm to double the injection ratio. A possible
216 enhancement might be to throttle the offending IRQ, such as delaying
217 EOI for level triggered interrupts. But it is a challenge to be
218 non-intrusive to the scheduler or the IRQ core code.
219 
220 
221 CPU Online/Offline
222 ------------------
223 Per-CPU kernel threads are started/stopped upon receiving
224 notifications of CPU hotplug activities. The intel_powerclamp driver
225 keeps track of clamping kernel threads, even after they are migrated
226 to other CPUs, after a CPU offline event.
227 
228 
229 Performance Analysis
230 ====================
231 This section describes the general performance data collected on
232 multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
233 
234 Effectiveness and Limitations
235 -----------------------------
236 The maximum range that idle injection is allowed is capped at 50
237 percent. As mentioned earlier, since interrupts are allowed during
238 forced idle time, excessive interrupts could result in less
239 effectiveness. The extreme case would be doing a ping -f to generated
240 flooded network interrupts without much CPU acknowledgement. In this
241 case, little can be done from the idle injection threads. In most
242 normal cases, such as scp a large file, applications can be throttled
243 by the powerclamp driver, since slowing down the CPU also slows down
244 network protocol processing, which in turn reduces interrupts.
245 
246 When control parameters change at runtime by the controlling CPU, it
247 may take an additional period for the rest of the CPUs to catch up
248 with the changes. During this time, idle injection is out of sync,
249 thus not able to enter package C- states at the expected ratio. But
250 this effect is minor, in that in most cases change to the target
251 ratio is updated much less frequently than the idle injection
252 frequency.
253 
254 Scalability
255 -----------
256 Tests also show a minor, but measurable, difference between the 4P/8P
257 Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
258 More compensation is needed on Westmere for the same amount of
259 target idle ratio. The compensation also increases as the idle ratio
260 gets larger. The above reason constitutes the need for the
261 calibration code.
262 
263 On the IVB 8P system, compared to an offline CPU, powerclamp can
264 achieve up to 40% better performance per watt. (measured by a spin
265 counter summed over per CPU counting threads spawned for all running
266 CPUs).
267 
268 Usage and Interfaces
269 ====================
270 The powerclamp driver is registered to the generic thermal layer as a
271 cooling device. Currently, it’s not bound to any thermal zones::
272 
273   jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
274   cur_state:0
275   max_state:50
276   type:intel_powerclamp
277 
278 cur_state allows user to set the desired idle percentage. Writing 0 to
279 cur_state will stop idle injection. Writing a value between 1 and
280 max_state will start the idle injection. Reading cur_state returns the
281 actual and current idle percentage. This may not be the same value
282 set by the user in that current idle percentage depends on workload
283 and includes natural idle. When idle injection is disabled, reading
284 cur_state returns value -1 instead of 0 which is to avoid confusing
285 100% busy state with the disabled state.
286 
287 Example usage:
288 
289 - To inject 25% idle time::
290 
291         $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
292 
293 If the system is not busy and has more than 25% idle time already,
294 then the powerclamp driver will not start idle injection. Using Top
295 will not show idle injection kernel threads.
296 
297 If the system is busy (spin test below) and has less than 25% natural
298 idle time, powerclamp kernel threads will do idle injection. Forced
299 idle time is accounted as normal idle in that common code path is
300 taken as the idle task.
301 
302 In this example, 24.1% idle is shown. This helps the system admin or
303 user determine the cause of slowdown, when a powerclamp driver is in action::
304 
305 
306   Tasks: 197 total,   1 running, 196 sleeping,   0 stopped,   0 zombie
307   Cpu(s): 71.2%us,  4.7%sy,  0.0%ni, 24.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
308   Mem:   3943228k total,  1689632k used,  2253596k free,    74960k buffers
309   Swap:  4087804k total,        0k used,  4087804k free,   945336k cached
310 
311     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
312    3352 jacob     20   0  262m  644  428 S  286  0.0   0:17.16 spin
313    3341 root     -51   0     0    0    0 D   25  0.0   0:01.62 kidle_inject/0
314    3344 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/3
315    3342 root     -51   0     0    0    0 D   25  0.0   0:01.61 kidle_inject/1
316    3343 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/2
317    2935 jacob     20   0  696m 125m  35m S    5  3.3   0:31.11 firefox
318    1546 root      20   0  158m  20m 6640 S    3  0.5   0:26.97 Xorg
319    2100 jacob     20   0 1223m  88m  30m S    3  2.3   0:23.68 compiz
320 
321 Tests have shown that by using the powerclamp driver as a cooling
322 device, a PID based userspace thermal controller can manage to
323 control CPU temperature effectively, when no other thermal influence
324 is added. For example, a UltraBook user can compile the kernel under
325 certain temperature (below most active trip points).
326 
327 Module Parameters
328 =================
329 
330 ``cpumask`` (RW)
331         A bit mask of CPUs to inject idle. The format of the bitmask is same as
332         used in other subsystems like in /proc/irq/\*/smp_affinity. The mask is
333         comma separated 32 bit groups. Each CPU is one bit. For example for a 256
334         CPU system the full mask is:
335         ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
336 
337         The rightmost mask is for CPU 0-32.
338 
339 ``max_idle`` (RW)
340         Maximum injected idle time to the total CPU time ratio in percent range
341         from 1 to 100. Even if the cooling device max_state is always 100 (100%),
342         this parameter allows to add a max idle percent limit. The default is 50,
343         to match the current implementation of powerclamp driver. Also doesn't
344         allow value more than 75, if the cpumask includes every CPU present in
345         the system.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php