1 ======================= 2 Intel Powerclamp Driver 3 ======================= 4 5 By: 6 - Arjan van de Ven <arjan@linux.intel.com> 7 - Jacob Pan <jacob.jun.pan@linux.intel.com> 8 9 .. Contents: 10 11 (*) Introduction 12 - Goals and Objectives 13 14 (*) Theory of Operation 15 - Idle Injection 16 - Calibration 17 18 (*) Performance Analysis 19 - Effectiveness and Limitations 20 - Power vs Performance 21 - Scalability 22 - Calibration 23 - Comparison with Alternative Tech 24 25 (*) Usage and Interfaces 26 - Generic Thermal Layer (sysfs) 27 - Kernel APIs (TBD) 28 29 (*) Module Parameters 30 31 INTRODUCTION 32 ============ 33 34 Consider the situation where a system’s powe 35 reduced at runtime, due to power budget, therm 36 level, and where active cooling is not preferr 37 passive power reduction must be performed to p 38 actions that are designed for catastrophic sce 39 40 Currently, P-states, T-states (clock modulatio 41 are used for CPU throttling. 42 43 On Intel CPUs, C-states provide effective powe 44 they’re only used opportunistically, based o 45 development of intel_powerclamp driver, the me 46 idle injection across all online CPU threads w 47 is to achieve forced and controllable C-state 48 49 Test/Analysis has been made in the areas of po 50 scalability, and user experience. In many case 51 shown over taking the CPU offline or modulatin 52 53 54 THEORY OF OPERATION 55 =================== 56 57 Idle Injection 58 -------------- 59 60 On modern Intel processors (Nehalem or later), 61 residency is available in MSRs, thus also avai 62 63 These MSRs are:: 64 65 #define MSR_PKG_C2_RESIDENCY 0x60D 66 #define MSR_PKG_C3_RESIDENCY 0x3F8 67 #define MSR_PKG_C6_RESIDENCY 0x3F9 68 #define MSR_PKG_C7_RESIDENCY 0x3FA 69 70 If the kernel can also inject idle time to the 71 closed-loop control system can be established 72 level C-state. The intel_powerclamp driver is 73 control system, where the target set point is 74 ratio (based on power reduction), and the erro 75 between the actual package level C-state resid 76 ratio. 77 78 Injection is controlled by high priority kerne 79 each online CPU. 80 81 These kernel threads, with SCHED_FIFO class, a 82 clamping actions of controlled duty ratio and 83 thread synchronizes its idle time and duration 84 of jiffies, so accumulated errors can be preve 85 effect. Threads are also bound to the CPU such 86 migrated, unless the CPU is taken offline. In 87 belong to the offlined CPUs will be terminated 88 89 Running as SCHED_FIFO and relatively high prio 90 scheme to work for both preemptible and non-pr 91 Alignment of idle time around jiffies ensures 92 values. This effect can be better visualized u 93 The following diagram shows the behavior of ke 94 kidle_inject/cpu. During idle injection, it ru 95 for a given "duration", then relinquishes the 96 until the next time interval. 97 98 The NOHZ schedule tick is disabled during idle 99 are not masked. Tests show that the extra wake 100 have a dramatic impact on the effectiveness of 101 on large scale systems (Westmere system with 8 102 103 :: 104 105 CPU0 106 ____________ ____ 107 kidle_inject/0 | sleep | mwait | sl 108 _________| |________| 109 duration 110 CPU1 111 ____________ ____ 112 kidle_inject/1 | sleep | mwait | sl 113 _________| |________| 114 ^ 115 | 116 | 117 roundup(jiffie 118 119 Only one CPU is allowed to collect statistics 120 control parameters. This CPU is referred to as 121 this document. The controlling CPU is elected 122 policy that favors BSP, taking into account th 123 hot-plug. 124 125 In terms of dynamics of the idle control syste 126 time is considered largely as a non-causal sys 127 cannot be based on the past or current input. 128 intel_powerclamp driver attempts to enforce th 129 instantly as given input (target idle ratio). 130 powerclamp monitors the actual idle for a give 131 the next injection accordingly to avoid over/u 132 133 When used in a causal control system, such as 134 it is up to the user of this driver to impleme 135 past samples and outputs are included in the f 136 PID-based thermal controller can use the power 137 maintain a desired target temperature, based o 138 derivative gains of the past samples. 139 140 141 142 Calibration 143 ----------- 144 During scalability testing, it is observed tha 145 among CPUs become challenging as the number of 146 also true for the ability of a system to enter 147 148 To make sure the intel_powerclamp driver scale 149 calibration is implemented. The goals for doin 150 are: 151 152 a) determine the effective range of idle injec 153 b) determine the amount of compensation needed 154 155 Compensation to each target ratio consists of 156 157 a) steady state error compensation 158 159 This is to offset the error occurri 160 enter idle without extra wakeups (s 161 162 b) dynamic error compensation 163 164 When an excessive amount of wakeups 165 additional idle ratio can be added 166 slowing down CPU activities. 167 168 A debugfs file is provided for the user to exa 169 progress and results, such as on a Westmere sy 170 171 [jacob@nex01 ~]$ cat 172 /sys/kernel/debug/intel_powerclamp/powerclam 173 controlling cpu: 0 174 pct confidence steady dynamic (compensation) 175 0 0 0 0 176 1 1 0 0 177 2 1 1 0 178 3 3 1 0 179 4 3 1 0 180 5 3 1 0 181 6 3 1 0 182 7 3 1 0 183 8 3 1 0 184 ... 185 30 3 2 0 186 31 3 2 0 187 32 3 1 0 188 33 3 2 0 189 34 3 1 0 190 35 3 2 0 191 36 3 1 0 192 37 3 2 0 193 38 3 1 0 194 39 3 2 0 195 40 3 3 0 196 41 3 1 0 197 42 3 2 0 198 43 3 1 0 199 44 3 1 0 200 45 3 2 0 201 46 3 3 0 202 47 3 0 0 203 48 3 2 0 204 49 3 3 0 205 206 Calibration occurs during runtime. No offline 207 Steady state compensation is used only when co 208 adjacent ratios have reached satisfactory leve 209 is accumulated based on clean data collected a 210 collected during a period without extra interr 211 clean. 212 213 To compensate for excessive amounts of wakeup 214 idle time is injected when such a condition is 215 we have a simple algorithm to double the injec 216 enhancement might be to throttle the offending 217 EOI for level triggered interrupts. But it is 218 non-intrusive to the scheduler or the IRQ core 219 220 221 CPU Online/Offline 222 ------------------ 223 Per-CPU kernel threads are started/stopped upo 224 notifications of CPU hotplug activities. The i 225 keeps track of clamping kernel threads, even a 226 to other CPUs, after a CPU offline event. 227 228 229 Performance Analysis 230 ==================== 231 This section describes the general performance 232 multiple systems, including Westmere (80P) and 233 234 Effectiveness and Limitations 235 ----------------------------- 236 The maximum range that idle injection is allow 237 percent. As mentioned earlier, since interrupt 238 forced idle time, excessive interrupts could r 239 effectiveness. The extreme case would be doing 240 flooded network interrupts without much CPU ac 241 case, little can be done from the idle injecti 242 normal cases, such as scp a large file, applic 243 by the powerclamp driver, since slowing down t 244 network protocol processing, which in turn red 245 246 When control parameters change at runtime by t 247 may take an additional period for the rest of 248 with the changes. During this time, idle injec 249 thus not able to enter package C- states at th 250 this effect is minor, in that in most cases ch 251 ratio is updated much less frequently than the 252 frequency. 253 254 Scalability 255 ----------- 256 Tests also show a minor, but measurable, diffe 257 Ivy Bridge system and the 80P Westmere server 258 More compensation is needed on Westmere for th 259 target idle ratio. The compensation also incre 260 gets larger. The above reason constitutes the 261 calibration code. 262 263 On the IVB 8P system, compared to an offline C 264 achieve up to 40% better performance per watt. 265 counter summed over per CPU counting threads s 266 CPUs). 267 268 Usage and Interfaces 269 ==================== 270 The powerclamp driver is registered to the gen 271 cooling device. Currently, it’s not bound to 272 273 jacob@chromoly:/sys/class/thermal/cooling_de 274 cur_state:0 275 max_state:50 276 type:intel_powerclamp 277 278 cur_state allows user to set the desired idle 279 cur_state will stop idle injection. Writing a 280 max_state will start the idle injection. Readi 281 actual and current idle percentage. This may n 282 set by the user in that current idle percentag 283 and includes natural idle. When idle injection 284 cur_state returns value -1 instead of 0 which 285 100% busy state with the disabled state. 286 287 Example usage: 288 289 - To inject 25% idle time:: 290 291 $ sudo sh -c "echo 25 > /sys/class/the 292 293 If the system is not busy and has more than 25 294 then the powerclamp driver will not start idle 295 will not show idle injection kernel threads. 296 297 If the system is busy (spin test below) and ha 298 idle time, powerclamp kernel threads will do i 299 idle time is accounted as normal idle in that 300 taken as the idle task. 301 302 In this example, 24.1% idle is shown. This hel 303 user determine the cause of slowdown, when a p 304 305 306 Tasks: 197 total, 1 running, 196 sleeping, 307 Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 308 Mem: 3943228k total, 1689632k used, 2253 309 Swap: 4087804k total, 0k used, 4087 310 311 PID USER PR NI VIRT RES SHR S %CP 312 3352 jacob 20 0 262m 644 428 S 28 313 3341 root -51 0 0 0 0 D 2 314 3344 root -51 0 0 0 0 D 2 315 3342 root -51 0 0 0 0 D 2 316 3343 root -51 0 0 0 0 D 2 317 2935 jacob 20 0 696m 125m 35m S 318 1546 root 20 0 158m 20m 6640 S 319 2100 jacob 20 0 1223m 88m 30m S 320 321 Tests have shown that by using the powerclamp 322 device, a PID based userspace thermal controll 323 control CPU temperature effectively, when no o 324 is added. For example, a UltraBook user can co 325 certain temperature (below most active trip po 326 327 Module Parameters 328 ================= 329 330 ``cpumask`` (RW) 331 A bit mask of CPUs to inject idle. The 332 used in other subsystems like in /proc 333 comma separated 32 bit groups. Each CP 334 CPU system the full mask is: 335 ffffffff,ffffffff,ffffffff,ffffffff,ff 336 337 The rightmost mask is for CPU 0-32. 338 339 ``max_idle`` (RW) 340 Maximum injected idle time to the tota 341 from 1 to 100. Even if the cooling dev 342 this parameter allows to add a max idl 343 to match the current implementation of 344 allow value more than 75, if the cpuma 345 the system.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.