1 ======================= 1 ======================= 2 Intel Powerclamp Driver 2 Intel Powerclamp Driver 3 ======================= 3 ======================= 4 4 5 By: 5 By: 6 - Arjan van de Ven <arjan@linux.intel.com> 6 - Arjan van de Ven <arjan@linux.intel.com> 7 - Jacob Pan <jacob.jun.pan@linux.intel.com> 7 - Jacob Pan <jacob.jun.pan@linux.intel.com> 8 8 9 .. Contents: 9 .. Contents: 10 10 11 (*) Introduction 11 (*) Introduction 12 - Goals and Objectives 12 - Goals and Objectives 13 13 14 (*) Theory of Operation 14 (*) Theory of Operation 15 - Idle Injection 15 - Idle Injection 16 - Calibration 16 - Calibration 17 17 18 (*) Performance Analysis 18 (*) Performance Analysis 19 - Effectiveness and Limitations 19 - Effectiveness and Limitations 20 - Power vs Performance 20 - Power vs Performance 21 - Scalability 21 - Scalability 22 - Calibration 22 - Calibration 23 - Comparison with Alternative Tech 23 - Comparison with Alternative Techniques 24 24 25 (*) Usage and Interfaces 25 (*) Usage and Interfaces 26 - Generic Thermal Layer (sysfs) 26 - Generic Thermal Layer (sysfs) 27 - Kernel APIs (TBD) 27 - Kernel APIs (TBD) 28 28 29 (*) Module Parameters 29 (*) Module Parameters 30 30 31 INTRODUCTION 31 INTRODUCTION 32 ============ 32 ============ 33 33 34 Consider the situation where a system’s powe 34 Consider the situation where a system’s power consumption must be 35 reduced at runtime, due to power budget, therm 35 reduced at runtime, due to power budget, thermal constraint, or noise 36 level, and where active cooling is not preferr 36 level, and where active cooling is not preferred. Software managed 37 passive power reduction must be performed to p 37 passive power reduction must be performed to prevent the hardware 38 actions that are designed for catastrophic sce 38 actions that are designed for catastrophic scenarios. 39 39 40 Currently, P-states, T-states (clock modulatio 40 Currently, P-states, T-states (clock modulation), and CPU offlining 41 are used for CPU throttling. 41 are used for CPU throttling. 42 42 43 On Intel CPUs, C-states provide effective powe 43 On Intel CPUs, C-states provide effective power reduction, but so far 44 they’re only used opportunistically, based o 44 they’re only used opportunistically, based on workload. With the 45 development of intel_powerclamp driver, the me 45 development of intel_powerclamp driver, the method of synchronizing 46 idle injection across all online CPU threads w 46 idle injection across all online CPU threads was introduced. The goal 47 is to achieve forced and controllable C-state 47 is to achieve forced and controllable C-state residency. 48 48 49 Test/Analysis has been made in the areas of po 49 Test/Analysis has been made in the areas of power, performance, 50 scalability, and user experience. In many case 50 scalability, and user experience. In many cases, clear advantage is 51 shown over taking the CPU offline or modulatin 51 shown over taking the CPU offline or modulating the CPU clock. 52 52 53 53 54 THEORY OF OPERATION 54 THEORY OF OPERATION 55 =================== 55 =================== 56 56 57 Idle Injection 57 Idle Injection 58 -------------- 58 -------------- 59 59 60 On modern Intel processors (Nehalem or later), 60 On modern Intel processors (Nehalem or later), package level C-state 61 residency is available in MSRs, thus also avai 61 residency is available in MSRs, thus also available to the kernel. 62 62 63 These MSRs are:: 63 These MSRs are:: 64 64 65 #define MSR_PKG_C2_RESIDENCY 0x60D 65 #define MSR_PKG_C2_RESIDENCY 0x60D 66 #define MSR_PKG_C3_RESIDENCY 0x3F8 66 #define MSR_PKG_C3_RESIDENCY 0x3F8 67 #define MSR_PKG_C6_RESIDENCY 0x3F9 67 #define MSR_PKG_C6_RESIDENCY 0x3F9 68 #define MSR_PKG_C7_RESIDENCY 0x3FA 68 #define MSR_PKG_C7_RESIDENCY 0x3FA 69 69 70 If the kernel can also inject idle time to the 70 If the kernel can also inject idle time to the system, then a 71 closed-loop control system can be established 71 closed-loop control system can be established that manages package 72 level C-state. The intel_powerclamp driver is 72 level C-state. The intel_powerclamp driver is conceived as such a 73 control system, where the target set point is 73 control system, where the target set point is a user-selected idle 74 ratio (based on power reduction), and the erro 74 ratio (based on power reduction), and the error is the difference 75 between the actual package level C-state resid 75 between the actual package level C-state residency ratio and the target idle 76 ratio. 76 ratio. 77 77 78 Injection is controlled by high priority kerne 78 Injection is controlled by high priority kernel threads, spawned for 79 each online CPU. 79 each online CPU. 80 80 81 These kernel threads, with SCHED_FIFO class, a 81 These kernel threads, with SCHED_FIFO class, are created to perform 82 clamping actions of controlled duty ratio and 82 clamping actions of controlled duty ratio and duration. Each per-CPU 83 thread synchronizes its idle time and duration 83 thread synchronizes its idle time and duration, based on the rounding 84 of jiffies, so accumulated errors can be preve 84 of jiffies, so accumulated errors can be prevented to avoid a jittery 85 effect. Threads are also bound to the CPU such 85 effect. Threads are also bound to the CPU such that they cannot be 86 migrated, unless the CPU is taken offline. In 86 migrated, unless the CPU is taken offline. In this case, threads 87 belong to the offlined CPUs will be terminated 87 belong to the offlined CPUs will be terminated immediately. 88 88 89 Running as SCHED_FIFO and relatively high prio 89 Running as SCHED_FIFO and relatively high priority, also allows such 90 scheme to work for both preemptible and non-pr 90 scheme to work for both preemptible and non-preemptible kernels. 91 Alignment of idle time around jiffies ensures 91 Alignment of idle time around jiffies ensures scalability for HZ 92 values. This effect can be better visualized u 92 values. This effect can be better visualized using a Perf timechart. 93 The following diagram shows the behavior of ke 93 The following diagram shows the behavior of kernel thread 94 kidle_inject/cpu. During idle injection, it ru 94 kidle_inject/cpu. During idle injection, it runs monitor/mwait idle 95 for a given "duration", then relinquishes the 95 for a given "duration", then relinquishes the CPU to other tasks, 96 until the next time interval. 96 until the next time interval. 97 97 98 The NOHZ schedule tick is disabled during idle 98 The NOHZ schedule tick is disabled during idle time, but interrupts 99 are not masked. Tests show that the extra wake 99 are not masked. Tests show that the extra wakeups from scheduler tick 100 have a dramatic impact on the effectiveness of 100 have a dramatic impact on the effectiveness of the powerclamp driver 101 on large scale systems (Westmere system with 8 101 on large scale systems (Westmere system with 80 processors). 102 102 103 :: 103 :: 104 104 105 CPU0 105 CPU0 106 ____________ ____ 106 ____________ ____________ 107 kidle_inject/0 | sleep | mwait | sl 107 kidle_inject/0 | sleep | mwait | sleep | 108 _________| |________| 108 _________| |________| |_______ 109 duration 109 duration 110 CPU1 110 CPU1 111 ____________ ____ 111 ____________ ____________ 112 kidle_inject/1 | sleep | mwait | sl 112 kidle_inject/1 | sleep | mwait | sleep | 113 _________| |________| 113 _________| |________| |_______ 114 ^ 114 ^ 115 | 115 | 116 | 116 | 117 roundup(jiffie 117 roundup(jiffies, interval) 118 118 119 Only one CPU is allowed to collect statistics 119 Only one CPU is allowed to collect statistics and update global 120 control parameters. This CPU is referred to as 120 control parameters. This CPU is referred to as the controlling CPU in 121 this document. The controlling CPU is elected 121 this document. The controlling CPU is elected at runtime, with a 122 policy that favors BSP, taking into account th 122 policy that favors BSP, taking into account the possibility of a CPU 123 hot-plug. 123 hot-plug. 124 124 125 In terms of dynamics of the idle control syste 125 In terms of dynamics of the idle control system, package level idle 126 time is considered largely as a non-causal sys 126 time is considered largely as a non-causal system where its behavior 127 cannot be based on the past or current input. 127 cannot be based on the past or current input. Therefore, the 128 intel_powerclamp driver attempts to enforce th 128 intel_powerclamp driver attempts to enforce the desired idle time 129 instantly as given input (target idle ratio). 129 instantly as given input (target idle ratio). After injection, 130 powerclamp monitors the actual idle for a give 130 powerclamp monitors the actual idle for a given time window and adjust 131 the next injection accordingly to avoid over/u 131 the next injection accordingly to avoid over/under correction. 132 132 133 When used in a causal control system, such as 133 When used in a causal control system, such as a temperature control, 134 it is up to the user of this driver to impleme 134 it is up to the user of this driver to implement algorithms where 135 past samples and outputs are included in the f 135 past samples and outputs are included in the feedback. For example, a 136 PID-based thermal controller can use the power 136 PID-based thermal controller can use the powerclamp driver to 137 maintain a desired target temperature, based o 137 maintain a desired target temperature, based on integral and 138 derivative gains of the past samples. 138 derivative gains of the past samples. 139 139 140 140 141 141 142 Calibration 142 Calibration 143 ----------- 143 ----------- 144 During scalability testing, it is observed tha 144 During scalability testing, it is observed that synchronized actions 145 among CPUs become challenging as the number of 145 among CPUs become challenging as the number of cores grows. This is 146 also true for the ability of a system to enter 146 also true for the ability of a system to enter package level C-states. 147 147 148 To make sure the intel_powerclamp driver scale 148 To make sure the intel_powerclamp driver scales well, online 149 calibration is implemented. The goals for doin 149 calibration is implemented. The goals for doing such a calibration 150 are: 150 are: 151 151 152 a) determine the effective range of idle injec 152 a) determine the effective range of idle injection ratio 153 b) determine the amount of compensation needed 153 b) determine the amount of compensation needed at each target ratio 154 154 155 Compensation to each target ratio consists of 155 Compensation to each target ratio consists of two parts: 156 156 157 a) steady state error compensation 157 a) steady state error compensation 158 158 159 This is to offset the error occurri 159 This is to offset the error occurring when the system can 160 enter idle without extra wakeups (s 160 enter idle without extra wakeups (such as external interrupts). 161 161 162 b) dynamic error compensation 162 b) dynamic error compensation 163 163 164 When an excessive amount of wakeups 164 When an excessive amount of wakeups occurs during idle, an 165 additional idle ratio can be added 165 additional idle ratio can be added to quiet interrupts, by 166 slowing down CPU activities. 166 slowing down CPU activities. 167 167 168 A debugfs file is provided for the user to exa 168 A debugfs file is provided for the user to examine compensation 169 progress and results, such as on a Westmere sy 169 progress and results, such as on a Westmere system:: 170 170 171 [jacob@nex01 ~]$ cat 171 [jacob@nex01 ~]$ cat 172 /sys/kernel/debug/intel_powerclamp/powerclam 172 /sys/kernel/debug/intel_powerclamp/powerclamp_calib 173 controlling cpu: 0 173 controlling cpu: 0 174 pct confidence steady dynamic (compensation) 174 pct confidence steady dynamic (compensation) 175 0 0 0 0 175 0 0 0 0 176 1 1 0 0 176 1 1 0 0 177 2 1 1 0 177 2 1 1 0 178 3 3 1 0 178 3 3 1 0 179 4 3 1 0 179 4 3 1 0 180 5 3 1 0 180 5 3 1 0 181 6 3 1 0 181 6 3 1 0 182 7 3 1 0 182 7 3 1 0 183 8 3 1 0 183 8 3 1 0 184 ... 184 ... 185 30 3 2 0 185 30 3 2 0 186 31 3 2 0 186 31 3 2 0 187 32 3 1 0 187 32 3 1 0 188 33 3 2 0 188 33 3 2 0 189 34 3 1 0 189 34 3 1 0 190 35 3 2 0 190 35 3 2 0 191 36 3 1 0 191 36 3 1 0 192 37 3 2 0 192 37 3 2 0 193 38 3 1 0 193 38 3 1 0 194 39 3 2 0 194 39 3 2 0 195 40 3 3 0 195 40 3 3 0 196 41 3 1 0 196 41 3 1 0 197 42 3 2 0 197 42 3 2 0 198 43 3 1 0 198 43 3 1 0 199 44 3 1 0 199 44 3 1 0 200 45 3 2 0 200 45 3 2 0 201 46 3 3 0 201 46 3 3 0 202 47 3 0 0 202 47 3 0 0 203 48 3 2 0 203 48 3 2 0 204 49 3 3 0 204 49 3 3 0 205 205 206 Calibration occurs during runtime. No offline 206 Calibration occurs during runtime. No offline method is available. 207 Steady state compensation is used only when co 207 Steady state compensation is used only when confidence levels of all 208 adjacent ratios have reached satisfactory leve 208 adjacent ratios have reached satisfactory level. A confidence level 209 is accumulated based on clean data collected a 209 is accumulated based on clean data collected at runtime. Data 210 collected during a period without extra interr 210 collected during a period without extra interrupts is considered 211 clean. 211 clean. 212 212 213 To compensate for excessive amounts of wakeup 213 To compensate for excessive amounts of wakeup during idle, additional 214 idle time is injected when such a condition is 214 idle time is injected when such a condition is detected. Currently, 215 we have a simple algorithm to double the injec 215 we have a simple algorithm to double the injection ratio. A possible 216 enhancement might be to throttle the offending 216 enhancement might be to throttle the offending IRQ, such as delaying 217 EOI for level triggered interrupts. But it is 217 EOI for level triggered interrupts. But it is a challenge to be 218 non-intrusive to the scheduler or the IRQ core 218 non-intrusive to the scheduler or the IRQ core code. 219 219 220 220 221 CPU Online/Offline 221 CPU Online/Offline 222 ------------------ 222 ------------------ 223 Per-CPU kernel threads are started/stopped upo 223 Per-CPU kernel threads are started/stopped upon receiving 224 notifications of CPU hotplug activities. The i 224 notifications of CPU hotplug activities. The intel_powerclamp driver 225 keeps track of clamping kernel threads, even a 225 keeps track of clamping kernel threads, even after they are migrated 226 to other CPUs, after a CPU offline event. 226 to other CPUs, after a CPU offline event. 227 227 228 228 229 Performance Analysis 229 Performance Analysis 230 ==================== 230 ==================== 231 This section describes the general performance 231 This section describes the general performance data collected on 232 multiple systems, including Westmere (80P) and 232 multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P). 233 233 234 Effectiveness and Limitations 234 Effectiveness and Limitations 235 ----------------------------- 235 ----------------------------- 236 The maximum range that idle injection is allow 236 The maximum range that idle injection is allowed is capped at 50 237 percent. As mentioned earlier, since interrupt 237 percent. As mentioned earlier, since interrupts are allowed during 238 forced idle time, excessive interrupts could r 238 forced idle time, excessive interrupts could result in less 239 effectiveness. The extreme case would be doing 239 effectiveness. The extreme case would be doing a ping -f to generated 240 flooded network interrupts without much CPU ac 240 flooded network interrupts without much CPU acknowledgement. In this 241 case, little can be done from the idle injecti 241 case, little can be done from the idle injection threads. In most 242 normal cases, such as scp a large file, applic 242 normal cases, such as scp a large file, applications can be throttled 243 by the powerclamp driver, since slowing down t 243 by the powerclamp driver, since slowing down the CPU also slows down 244 network protocol processing, which in turn red 244 network protocol processing, which in turn reduces interrupts. 245 245 246 When control parameters change at runtime by t 246 When control parameters change at runtime by the controlling CPU, it 247 may take an additional period for the rest of 247 may take an additional period for the rest of the CPUs to catch up 248 with the changes. During this time, idle injec 248 with the changes. During this time, idle injection is out of sync, 249 thus not able to enter package C- states at th 249 thus not able to enter package C- states at the expected ratio. But 250 this effect is minor, in that in most cases ch 250 this effect is minor, in that in most cases change to the target 251 ratio is updated much less frequently than the 251 ratio is updated much less frequently than the idle injection 252 frequency. 252 frequency. 253 253 254 Scalability 254 Scalability 255 ----------- 255 ----------- 256 Tests also show a minor, but measurable, diffe 256 Tests also show a minor, but measurable, difference between the 4P/8P 257 Ivy Bridge system and the 80P Westmere server 257 Ivy Bridge system and the 80P Westmere server under 50% idle ratio. 258 More compensation is needed on Westmere for th 258 More compensation is needed on Westmere for the same amount of 259 target idle ratio. The compensation also incre 259 target idle ratio. The compensation also increases as the idle ratio 260 gets larger. The above reason constitutes the 260 gets larger. The above reason constitutes the need for the 261 calibration code. 261 calibration code. 262 262 263 On the IVB 8P system, compared to an offline C 263 On the IVB 8P system, compared to an offline CPU, powerclamp can 264 achieve up to 40% better performance per watt. 264 achieve up to 40% better performance per watt. (measured by a spin 265 counter summed over per CPU counting threads s 265 counter summed over per CPU counting threads spawned for all running 266 CPUs). 266 CPUs). 267 267 268 Usage and Interfaces 268 Usage and Interfaces 269 ==================== 269 ==================== 270 The powerclamp driver is registered to the gen 270 The powerclamp driver is registered to the generic thermal layer as a 271 cooling device. Currently, it’s not bound to 271 cooling device. Currently, it’s not bound to any thermal zones:: 272 272 273 jacob@chromoly:/sys/class/thermal/cooling_de 273 jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . * 274 cur_state:0 274 cur_state:0 275 max_state:50 275 max_state:50 276 type:intel_powerclamp 276 type:intel_powerclamp 277 277 278 cur_state allows user to set the desired idle 278 cur_state allows user to set the desired idle percentage. Writing 0 to 279 cur_state will stop idle injection. Writing a 279 cur_state will stop idle injection. Writing a value between 1 and 280 max_state will start the idle injection. Readi 280 max_state will start the idle injection. Reading cur_state returns the 281 actual and current idle percentage. This may n 281 actual and current idle percentage. This may not be the same value 282 set by the user in that current idle percentag 282 set by the user in that current idle percentage depends on workload 283 and includes natural idle. When idle injection 283 and includes natural idle. When idle injection is disabled, reading 284 cur_state returns value -1 instead of 0 which 284 cur_state returns value -1 instead of 0 which is to avoid confusing 285 100% busy state with the disabled state. 285 100% busy state with the disabled state. 286 286 287 Example usage: 287 Example usage: 288 288 289 - To inject 25% idle time:: 289 - To inject 25% idle time:: 290 290 291 $ sudo sh -c "echo 25 > /sys/class/the 291 $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state 292 292 293 If the system is not busy and has more than 25 293 If the system is not busy and has more than 25% idle time already, 294 then the powerclamp driver will not start idle 294 then the powerclamp driver will not start idle injection. Using Top 295 will not show idle injection kernel threads. 295 will not show idle injection kernel threads. 296 296 297 If the system is busy (spin test below) and ha 297 If the system is busy (spin test below) and has less than 25% natural 298 idle time, powerclamp kernel threads will do i 298 idle time, powerclamp kernel threads will do idle injection. Forced 299 idle time is accounted as normal idle in that 299 idle time is accounted as normal idle in that common code path is 300 taken as the idle task. 300 taken as the idle task. 301 301 302 In this example, 24.1% idle is shown. This hel 302 In this example, 24.1% idle is shown. This helps the system admin or 303 user determine the cause of slowdown, when a p 303 user determine the cause of slowdown, when a powerclamp driver is in action:: 304 304 305 305 306 Tasks: 197 total, 1 running, 196 sleeping, 306 Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie 307 Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 307 Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st 308 Mem: 3943228k total, 1689632k used, 2253 308 Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers 309 Swap: 4087804k total, 0k used, 4087 309 Swap: 4087804k total, 0k used, 4087804k free, 945336k cached 310 310 311 PID USER PR NI VIRT RES SHR S %CP 311 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 312 3352 jacob 20 0 262m 644 428 S 28 312 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin 313 3341 root -51 0 0 0 0 D 2 313 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0 314 3344 root -51 0 0 0 0 D 2 314 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3 315 3342 root -51 0 0 0 0 D 2 315 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1 316 3343 root -51 0 0 0 0 D 2 316 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2 317 2935 jacob 20 0 696m 125m 35m S 317 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox 318 1546 root 20 0 158m 20m 6640 S 318 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg 319 2100 jacob 20 0 1223m 88m 30m S 319 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz 320 320 321 Tests have shown that by using the powerclamp 321 Tests have shown that by using the powerclamp driver as a cooling 322 device, a PID based userspace thermal controll 322 device, a PID based userspace thermal controller can manage to 323 control CPU temperature effectively, when no o 323 control CPU temperature effectively, when no other thermal influence 324 is added. For example, a UltraBook user can co 324 is added. For example, a UltraBook user can compile the kernel under 325 certain temperature (below most active trip po 325 certain temperature (below most active trip points). 326 326 327 Module Parameters 327 Module Parameters 328 ================= 328 ================= 329 329 330 ``cpumask`` (RW) 330 ``cpumask`` (RW) 331 A bit mask of CPUs to inject idle. The 331 A bit mask of CPUs to inject idle. The format of the bitmask is same as 332 used in other subsystems like in /proc 332 used in other subsystems like in /proc/irq/\*/smp_affinity. The mask is 333 comma separated 32 bit groups. Each CP 333 comma separated 32 bit groups. Each CPU is one bit. For example for a 256 334 CPU system the full mask is: 334 CPU system the full mask is: 335 ffffffff,ffffffff,ffffffff,ffffffff,ff 335 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 336 336 337 The rightmost mask is for CPU 0-32. 337 The rightmost mask is for CPU 0-32. 338 338 339 ``max_idle`` (RW) 339 ``max_idle`` (RW) 340 Maximum injected idle time to the tota 340 Maximum injected idle time to the total CPU time ratio in percent range 341 from 1 to 100. Even if the cooling dev 341 from 1 to 100. Even if the cooling device max_state is always 100 (100%), 342 this parameter allows to add a max idl 342 this parameter allows to add a max idle percent limit. The default is 50, 343 to match the current implementation of 343 to match the current implementation of powerclamp driver. Also doesn't 344 allow value more than 75, if the cpuma 344 allow value more than 75, if the cpumask includes every CPU present in 345 the system. 345 the system.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.