~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/admin-guide/thermal/intel_powerclamp.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/admin-guide/thermal/intel_powerclamp.rst (Version linux-6.12-rc7) and /Documentation/admin-guide/thermal/intel_powerclamp.rst (Version linux-5.15.171)


  1 =======================                           
  2 Intel Powerclamp Driver                           
  3 =======================                           
  4                                                   
  5 By:                                               
  6   - Arjan van de Ven <arjan@linux.intel.com>       
  7   - Jacob Pan <jacob.jun.pan@linux.intel.com>      
  8                                                   
  9 .. Contents:                                      
 10                                                   
 11         (*) Introduction                          
 12             - Goals and Objectives                
 13                                                   
 14         (*) Theory of Operation                   
 15             - Idle Injection                      
 16             - Calibration                         
 17                                                   
 18         (*) Performance Analysis                  
 19             - Effectiveness and Limitations       
 20             - Power vs Performance                
 21             - Scalability                         
 22             - Calibration                         
 23             - Comparison with Alternative Tech    
 24                                                   
 25         (*) Usage and Interfaces                  
 26             - Generic Thermal Layer (sysfs)       
 27             - Kernel APIs (TBD)                   
 28                                                   
 29         (*) Module Parameters                     
 30                                                   
 31 INTRODUCTION                                      
 32 ============                                      
 33                                                   
 34 Consider the situation where a system’s powe    
 35 reduced at runtime, due to power budget, therm    
 36 level, and where active cooling is not preferr    
 37 passive power reduction must be performed to p    
 38 actions that are designed for catastrophic sce    
 39                                                   
 40 Currently, P-states, T-states (clock modulatio    
 41 are used for CPU throttling.                      
 42                                                   
 43 On Intel CPUs, C-states provide effective powe    
 44 they’re only used opportunistically, based o    
 45 development of intel_powerclamp driver, the me    
 46 idle injection across all online CPU threads w    
 47 is to achieve forced and controllable C-state     
 48                                                   
 49 Test/Analysis has been made in the areas of po    
 50 scalability, and user experience. In many case    
 51 shown over taking the CPU offline or modulatin    
 52                                                   
 53                                                   
 54 THEORY OF OPERATION                               
 55 ===================                               
 56                                                   
 57 Idle Injection                                    
 58 --------------                                    
 59                                                   
 60 On modern Intel processors (Nehalem or later),    
 61 residency is available in MSRs, thus also avai    
 62                                                   
 63 These MSRs are::                                  
 64                                                   
 65       #define MSR_PKG_C2_RESIDENCY      0x60D     
 66       #define MSR_PKG_C3_RESIDENCY      0x3F8     
 67       #define MSR_PKG_C6_RESIDENCY      0x3F9     
 68       #define MSR_PKG_C7_RESIDENCY      0x3FA     
 69                                                   
 70 If the kernel can also inject idle time to the    
 71 closed-loop control system can be established     
 72 level C-state. The intel_powerclamp driver is     
 73 control system, where the target set point is     
 74 ratio (based on power reduction), and the erro    
 75 between the actual package level C-state resid    
 76 ratio.                                            
 77                                                   
 78 Injection is controlled by high priority kerne    
 79 each online CPU.                                  
 80                                                   
 81 These kernel threads, with SCHED_FIFO class, a    
 82 clamping actions of controlled duty ratio and     
 83 thread synchronizes its idle time and duration    
 84 of jiffies, so accumulated errors can be preve    
 85 effect. Threads are also bound to the CPU such    
 86 migrated, unless the CPU is taken offline. In     
 87 belong to the offlined CPUs will be terminated    
 88                                                   
 89 Running as SCHED_FIFO and relatively high prio    
 90 scheme to work for both preemptible and non-pr    
 91 Alignment of idle time around jiffies ensures     
 92 values. This effect can be better visualized u    
 93 The following diagram shows the behavior of ke    
 94 kidle_inject/cpu. During idle injection, it ru    
 95 for a given "duration", then relinquishes the     
 96 until the next time interval.                     
 97                                                   
 98 The NOHZ schedule tick is disabled during idle    
 99 are not masked. Tests show that the extra wake    
100 have a dramatic impact on the effectiveness of    
101 on large scale systems (Westmere system with 8    
102                                                   
103 ::                                                
104                                                   
105   CPU0                                            
106                     ____________          ____    
107   kidle_inject/0   |   sleep    |  mwait |  sl    
108           _________|            |________|        
109                                  duration         
110   CPU1                                            
111                     ____________          ____    
112   kidle_inject/1   |   sleep    |  mwait |  sl    
113           _________|            |________|        
114                                 ^                 
115                                 |                 
116                                 |                 
117                                 roundup(jiffie    
118                                                   
119 Only one CPU is allowed to collect statistics     
120 control parameters. This CPU is referred to as    
121 this document. The controlling CPU is elected     
122 policy that favors BSP, taking into account th    
123 hot-plug.                                         
124                                                   
125 In terms of dynamics of the idle control syste    
126 time is considered largely as a non-causal sys    
127 cannot be based on the past or current input.     
128 intel_powerclamp driver attempts to enforce th    
129 instantly as given input (target idle ratio).     
130 powerclamp monitors the actual idle for a give    
131 the next injection accordingly to avoid over/u    
132                                                   
133 When used in a causal control system, such as     
134 it is up to the user of this driver to impleme    
135 past samples and outputs are included in the f    
136 PID-based thermal controller can use the power    
137 maintain a desired target temperature, based o    
138 derivative gains of the past samples.             
139                                                   
140                                                   
141                                                   
142 Calibration                                       
143 -----------                                       
144 During scalability testing, it is observed tha    
145 among CPUs become challenging as the number of    
146 also true for the ability of a system to enter    
147                                                   
148 To make sure the intel_powerclamp driver scale    
149 calibration is implemented. The goals for doin    
150 are:                                              
151                                                   
152 a) determine the effective range of idle injec    
153 b) determine the amount of compensation needed    
154                                                   
155 Compensation to each target ratio consists of     
156                                                   
157         a) steady state error compensation        
158                                                   
159            This is to offset the error occurri    
160            enter idle without extra wakeups (s    
161                                                   
162         b) dynamic error compensation             
163                                                   
164            When an excessive amount of wakeups    
165            additional idle ratio can be added     
166            slowing down CPU activities.           
167                                                   
168 A debugfs file is provided for the user to exa    
169 progress and results, such as on a Westmere sy    
170                                                   
171   [jacob@nex01 ~]$ cat                            
172   /sys/kernel/debug/intel_powerclamp/powerclam    
173   controlling cpu: 0                              
174   pct confidence steady dynamic (compensation)    
175   0       0       0       0                       
176   1       1       0       0                       
177   2       1       1       0                       
178   3       3       1       0                       
179   4       3       1       0                       
180   5       3       1       0                       
181   6       3       1       0                       
182   7       3       1       0                       
183   8       3       1       0                       
184   ...                                             
185   30      3       2       0                       
186   31      3       2       0                       
187   32      3       1       0                       
188   33      3       2       0                       
189   34      3       1       0                       
190   35      3       2       0                       
191   36      3       1       0                       
192   37      3       2       0                       
193   38      3       1       0                       
194   39      3       2       0                       
195   40      3       3       0                       
196   41      3       1       0                       
197   42      3       2       0                       
198   43      3       1       0                       
199   44      3       1       0                       
200   45      3       2       0                       
201   46      3       3       0                       
202   47      3       0       0                       
203   48      3       2       0                       
204   49      3       3       0                       
205                                                   
206 Calibration occurs during runtime. No offline     
207 Steady state compensation is used only when co    
208 adjacent ratios have reached satisfactory leve    
209 is accumulated based on clean data collected a    
210 collected during a period without extra interr    
211 clean.                                            
212                                                   
213 To compensate for excessive amounts of wakeup     
214 idle time is injected when such a condition is    
215 we have a simple algorithm to double the injec    
216 enhancement might be to throttle the offending    
217 EOI for level triggered interrupts. But it is     
218 non-intrusive to the scheduler or the IRQ core    
219                                                   
220                                                   
221 CPU Online/Offline                                
222 ------------------                                
223 Per-CPU kernel threads are started/stopped upo    
224 notifications of CPU hotplug activities. The i    
225 keeps track of clamping kernel threads, even a    
226 to other CPUs, after a CPU offline event.         
227                                                   
228                                                   
229 Performance Analysis                              
230 ====================                              
231 This section describes the general performance    
232 multiple systems, including Westmere (80P) and    
233                                                   
234 Effectiveness and Limitations                     
235 -----------------------------                     
236 The maximum range that idle injection is allow    
237 percent. As mentioned earlier, since interrupt    
238 forced idle time, excessive interrupts could r    
239 effectiveness. The extreme case would be doing    
240 flooded network interrupts without much CPU ac    
241 case, little can be done from the idle injecti    
242 normal cases, such as scp a large file, applic    
243 by the powerclamp driver, since slowing down t    
244 network protocol processing, which in turn red    
245                                                   
246 When control parameters change at runtime by t    
247 may take an additional period for the rest of     
248 with the changes. During this time, idle injec    
249 thus not able to enter package C- states at th    
250 this effect is minor, in that in most cases ch    
251 ratio is updated much less frequently than the    
252 frequency.                                        
253                                                   
254 Scalability                                       
255 -----------                                       
256 Tests also show a minor, but measurable, diffe    
257 Ivy Bridge system and the 80P Westmere server     
258 More compensation is needed on Westmere for th    
259 target idle ratio. The compensation also incre    
260 gets larger. The above reason constitutes the     
261 calibration code.                                 
262                                                   
263 On the IVB 8P system, compared to an offline C    
264 achieve up to 40% better performance per watt.    
265 counter summed over per CPU counting threads s    
266 CPUs).                                            
267                                                   
268 Usage and Interfaces                              
269 ====================                              
270 The powerclamp driver is registered to the gen    
271 cooling device. Currently, it’s not bound to    
272                                                   
273   jacob@chromoly:/sys/class/thermal/cooling_de    
274   cur_state:0                                     
275   max_state:50                                    
276   type:intel_powerclamp                           
277                                                   
278 cur_state allows user to set the desired idle     
279 cur_state will stop idle injection. Writing a     
280 max_state will start the idle injection. Readi    
281 actual and current idle percentage. This may n    
282 set by the user in that current idle percentag    
283 and includes natural idle. When idle injection    
284 cur_state returns value -1 instead of 0 which     
285 100% busy state with the disabled state.          
286                                                   
287 Example usage:                                    
288                                                   
289 - To inject 25% idle time::                       
290                                                   
291         $ sudo sh -c "echo 25 > /sys/class/the    
292                                                   
293 If the system is not busy and has more than 25    
294 then the powerclamp driver will not start idle    
295 will not show idle injection kernel threads.      
296                                                   
297 If the system is busy (spin test below) and ha    
298 idle time, powerclamp kernel threads will do i    
299 idle time is accounted as normal idle in that     
300 taken as the idle task.                           
301                                                   
302 In this example, 24.1% idle is shown. This hel    
303 user determine the cause of slowdown, when a p    
304                                                   
305                                                   
306   Tasks: 197 total,   1 running, 196 sleeping,    
307   Cpu(s): 71.2%us,  4.7%sy,  0.0%ni, 24.1%id,     
308   Mem:   3943228k total,  1689632k used,  2253    
309   Swap:  4087804k total,        0k used,  4087    
310                                                   
311     PID USER      PR  NI  VIRT  RES  SHR S %CP    
312    3352 jacob     20   0  262m  644  428 S  28    
313    3341 root     -51   0     0    0    0 D   2    
314    3344 root     -51   0     0    0    0 D   2    
315    3342 root     -51   0     0    0    0 D   2    
316    3343 root     -51   0     0    0    0 D   2    
317    2935 jacob     20   0  696m 125m  35m S        
318    1546 root      20   0  158m  20m 6640 S        
319    2100 jacob     20   0 1223m  88m  30m S        
320                                                   
321 Tests have shown that by using the powerclamp     
322 device, a PID based userspace thermal controll    
323 control CPU temperature effectively, when no o    
324 is added. For example, a UltraBook user can co    
325 certain temperature (below most active trip po    
326                                                   
327 Module Parameters                                 
328 =================                                 
329                                                   
330 ``cpumask`` (RW)                                  
331         A bit mask of CPUs to inject idle. The    
332         used in other subsystems like in /proc    
333         comma separated 32 bit groups. Each CP    
334         CPU system the full mask is:              
335         ffffffff,ffffffff,ffffffff,ffffffff,ff    
336                                                   
337         The rightmost mask is for CPU 0-32.       
338                                                   
339 ``max_idle`` (RW)                                 
340         Maximum injected idle time to the tota    
341         from 1 to 100. Even if the cooling dev    
342         this parameter allows to add a max idl    
343         to match the current implementation of    
344         allow value more than 75, if the cpuma    
345         the system.                               
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php