~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~
resctrl.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~
Diff markup

Differences between /Documentation/arch/x86/resctrl.rst (Version linux-6.12-rc7) and /Documentation/arch/mips/resctrl.rst (Version linux-5.2.21)

  1 .. SPDX-License-Identifier: GPL-2.0               
  2 .. include:: <isonum.txt>                         
  3                                                   
  4 ===========================================       
  5 User Interface for Resource Control feature       
  6 ===========================================       
  7                                                   
  8 :Copyright: |copy| 2016 Intel Corporation         
  9 :Authors: - Fenghua Yu <fenghua.yu@intel.com>      
 10           - Tony Luck <tony.luck@intel.com>        
 11           - Vikas Shivappa <vikas.shivappa@inte    
 12                                                   
 13                                                   
 14 Intel refers to this feature as Intel Resource    
 15 AMD refers to this feature as AMD Platform Qua    
 16                                                   
 17 This feature is enabled by the CONFIG_X86_CPU_    
 18 flag bits:                                        
 19                                                   
 20 ==============================================    
 21 RDT (Resource Director Technology) Allocation     
 22 CAT (Cache Allocation Technology)                 
 23 CDP (Code and Data Prioritization)                
 24 CQM (Cache QoS Monitoring)                        
 25 MBM (Memory Bandwidth Monitoring)                 
 26 MBA (Memory Bandwidth Allocation)                 
 27 SMBA (Slow Memory Bandwidth Allocation)           
 28 BMEC (Bandwidth Monitoring Event Configuration    
 29 ==============================================    
 30                                                   
 31 Historically, new features were made visible b    
 32 resulted in the feature flags becoming hard to    
 33 flag to /proc/cpuinfo should be avoided if use    
 34 about the feature from resctrl's info director    
 35                                                   
 36 To use the feature mount the file system::        
 37                                                   
 38  # mount -t resctrl resctrl [-o cdp[,cdpl2][,m    
 39                                                   
 40 mount options are:                                
 41                                                   
 42 "cdp":                                            
 43         Enable code/data prioritization in L3     
 44 "cdpl2":                                          
 45         Enable code/data prioritization in L2     
 46 "mba_MBps":                                       
 47         Enable the MBA Software Controller(mba    
 48         bandwidth in MiBps                        
 49 "debug":                                          
 50         Make debug files accessible. Available    
 51         "Available only with debug option".       
 52                                                   
 53 L2 and L3 CDP are controlled separately.          
 54                                                   
 55 RDT features are orthogonal. A particular syst    
 56 monitoring, only control, or both monitoring a    
 57 pseudo-locking is a unique way of using cache     
 58 "lock" data in the cache. Details can be found    
 59 "Cache Pseudo-Locking".                           
 60                                                   
 61                                                   
 62 The mount succeeds if either of allocation or     
 63 only those files and directories supported by     
 64 For more details on the behavior of the interf    
 65 and allocation, see the "Resource alloc and mo    
 66                                                   
 67 Info directory                                    
 68 ==============                                    
 69                                                   
 70 The 'info' directory contains information abou    
 71 resources. Each resource has its own subdirect    
 72 names reflect the resource names.                 
 73                                                   
 74 Each subdirectory contains the following files    
 75 allocation:                                       
 76                                                   
 77 Cache resource(L3/L2)  subdirectory contains t    
 78 related to allocation:                            
 79                                                   
 80 "num_closids":                                    
 81                 The number of CLOSIDs which ar    
 82                 resource. The kernel uses the     
 83                 CLOSIDs of all enabled resourc    
 84 "cbm_mask":                                       
 85                 The bitmask which is valid for    
 86                 This mask is equivalent to 100    
 87 "min_cbm_bits":                                   
 88                 The minimum number of consecut    
 89                 must be set when writing a mas    
 90                                                   
 91 "shareable_bits":                                 
 92                 Bitmask of shareable resource     
 93                 entities (e.g. I/O). User can     
 94                 setting up exclusive cache par    
 95                 some platforms support devices    
 96                 own settings for cache use whi    
 97                 these bits.                       
 98 "bit_usage":                                      
 99                 Annotated capacity bitmasks sh    
100                 instances of the resource are     
101                                                   
102                         "0":                      
103                               Corresponding re    
104                               resources have b    
105                               in "bit_usage" i    
106                               wasted.             
107                                                   
108                         "H":                      
109                               Corresponding re    
110                               but available fo    
111                               has bits set in     
112                               of these bits ap    
113                               schematas then t    
114                               "shareable_bits"    
115                               be marked as "H"    
116                         "X":                      
117                               Corresponding re    
118                               used by hardware    
119                               bits that appear    
120                               well as a resour    
121                         "S":                      
122                               Corresponding re    
123                               and available fo    
124                         "E":                      
125                               Corresponding re    
126                               one resource gro    
127                         "P":                      
128                               Corresponding re    
129                               sharing allowed.    
130 "sparse_masks":                                   
131                 Indicates if non-contiguous 1s    
132                                                   
133                         "0":                      
134                               Only contiguous     
135                         "1":                      
136                               Non-contiguous 1    
137                                                   
138 Memory bandwidth(MB) subdirectory contains the    
139 with respect to allocation:                       
140                                                   
141 "min_bandwidth":                                  
142                 The minimum memory bandwidth p    
143                 user can request.                 
144                                                   
145 "bandwidth_gran":                                 
146                 The granularity in which the m    
147                 percentage is allocated. The a    
148                 b/w percentage is rounded off     
149                 control step available on the     
150                 available bandwidth control st    
151                 min_bandwidth + N * bandwidth_    
152                                                   
153 "delay_linear":                                   
154                 Indicates if the delay scale i    
155                 non-linear. This field is pure    
156                 only.                             
157                                                   
158 "thread_throttle_mode":                           
159                 Indicator on Intel systems of     
160                 of a physical core are throttl    
161                 request different memory bandw    
162                                                   
163                 "max":                            
164                         the smallest percentag    
165                         to all threads            
166                 "per-thread":                     
167                         bandwidth percentages     
168                         the threads running on    
169                                                   
170 If RDT monitoring is available there will be a    
171 with the following files:                         
172                                                   
173 "num_rmids":                                      
174                 The number of RMIDs available.    
175                 upper bound for how many "CTRL    
176                 groups can be created.            
177                                                   
178 "mon_features":                                   
179                 Lists the monitoring events if    
180                 monitoring is enabled for the     
181                 Example::                         
182                                                   
183                         # cat /sys/fs/resctrl/    
184                         llc_occupancy             
185                         mbm_total_bytes           
186                         mbm_local_bytes           
187                                                   
188                 If the system supports Bandwid    
189                 Configuration (BMEC), then the    
190                 be configurable. The output wi    
191                                                   
192                         # cat /sys/fs/resctrl/    
193                         llc_occupancy             
194                         mbm_total_bytes           
195                         mbm_total_bytes_config    
196                         mbm_local_bytes           
197                         mbm_local_bytes_config    
198                                                   
199 "mbm_total_bytes_config", "mbm_local_bytes_con    
200         Read/write files containing the config    
201         and mbm_local_bytes events, respective    
202         Monitoring Event Configuration (BMEC)     
203         The event configuration settings are d    
204         all the CPUs in the domain. When eithe    
205         changed, the bandwidth counters for al    
206         (mbm_total_bytes as well as mbm_local_    
207         domain. The next read for every RMID w    
208         and subsequent reads will report the v    
209                                                   
210         Following are the types of events supp    
211                                                   
212         ====    ==============================    
213         Bits    Description                       
214         ====    ==============================    
215         6       Dirty Victims from the QOS dom    
216         5       Reads to slow memory in the no    
217         4       Reads to slow memory in the lo    
218         3       Non-temporal writes to non-loc    
219         2       Non-temporal writes to local N    
220         1       Reads to memory in the non-loc    
221         0       Reads to memory in the local N    
222         ====    ==============================    
223                                                   
224         By default, the mbm_total_bytes config    
225         all the event types and the mbm_local_    
226         0x15 to count all the local memory eve    
227                                                   
228         Examples:                                 
229                                                   
230         * To view the current configuration::     
231           ::                                      
232                                                   
233             # cat /sys/fs/resctrl/info/L3_MON/    
234             0=0x7f;1=0x7f;2=0x7f;3=0x7f           
235                                                   
236             # cat /sys/fs/resctrl/info/L3_MON/    
237             0=0x15;1=0x15;3=0x15;4=0x15           
238                                                   
239         * To change the mbm_total_bytes to cou    
240           the bits 0, 1, 4 and 5 needs to be s    
241           (in hexadecimal 0x33):                  
242           ::                                      
243                                                   
244             # echo  "0=0x33" > /sys/fs/resctrl    
245                                                   
246             # cat /sys/fs/resctrl/info/L3_MON/    
247             0=0x33;1=0x7f;2=0x7f;3=0x7f           
248                                                   
249         * To change the mbm_local_bytes to cou    
250           domain 0 and 1, the bits 4 and 5 nee    
251           in binary (in hexadecimal 0x30):        
252           ::                                      
253                                                   
254             # echo  "0=0x30;1=0x30" > /sys/fs/    
255                                                   
256             # cat /sys/fs/resctrl/info/L3_MON/    
257             0=0x30;1=0x30;3=0x15;4=0x15           
258                                                   
259 "max_threshold_occupancy":                        
260                 Read/write file provides the l    
261                 bytes) at which a previously u    
262                 counter can be considered for     
263                                                   
264 Finally, in the top level of the "info" direct    
265 named "last_cmd_status". This is reset with ev    
266 via the file system (making new directories or    
267 control files). If the command was successful,    
268 If the command failed, it will provide more in    
269 conveyed in the error returns from file operat    
270 ::                                                
271                                                   
272         # echo L3:0=f7 > schemata                 
273         bash: echo: write error: Invalid argum    
274         # cat info/last_cmd_status                
275         mask f7 has non-consecutive 1-bits        
276                                                   
277 Resource alloc and monitor groups                 
278 =================================                 
279                                                   
280 Resource groups are represented as directories    
281 system.  The default group is the root directo    
282 after mounting, owns all the tasks and cpus in    
283 full use of all resources.                        
284                                                   
285 On a system with RDT control features addition    
286 created in the root directory that specify dif    
287 resource (see "schemata" below). The root and     
288 directories are referred to as "CTRL_MON" grou    
289                                                   
290 On a system with RDT monitoring the root direc    
291 directories contain a directory named "mon_gro    
292 directories can be created to monitor subsets     
293 group that is their ancestor. These are called    
294 of this document.                                 
295                                                   
296 Removing a directory will move all tasks and c    
297 represents to the parent. Removing one of the     
298 will automatically remove all MON groups below    
299                                                   
300 Moving MON group directories to a new parent C    
301 for the purpose of changing the resource alloc    
302 without impacting its monitoring data or assig    
303 is not allowed for MON groups which monitor CP    
304 operation is currently allowed other than simp    
305 MON group.                                        
306                                                   
307 All groups contain the following files:           
308                                                   
309 "tasks":                                          
310         Reading this file shows the list of al    
311         this group. Writing a task id to the f    
312         group. Multiple tasks can be added by     
313         with commas. Tasks will be assigned se    
314         failures are not supported. A single f    
315         attempting to assign a task will cause    
316         already added tasks before the failure    
317         Failures will be logged to /sys/fs/res    
318                                                   
319         If the group is a CTRL_MON group the t    
320         whichever previous CTRL_MON group owne    
321         any MON group that owned the task. If     
322         then the task must already belong to t    
323         group. The task is removed from any pr    
324                                                   
325                                                   
326 "cpus":                                           
327         Reading this file shows a bitmask of t    
328         this group. Writing a mask to this fil    
329         CPUs to/from this group. As with the t    
330         maintained where MON groups may only i    
331         parent CTRL_MON group.                    
332         When the resource group is in pseudo-l    
333         only be readable, reflecting the CPUs     
334         pseudo-locked region.                     
335                                                   
336                                                   
337 "cpus_list":                                      
338         Just like "cpus", only using ranges of    
339                                                   
340                                                   
341 When control is enabled all CTRL_MON groups wi    
342                                                   
343 "schemata":                                       
344         A list of all the resources available     
345         Each resource has its own line and for    
346                                                   
347 "size":                                           
348         Mirrors the display of the "schemata"     
349         bytes of each allocation instead of th    
350         allocation.                               
351                                                   
352 "mode":                                           
353         The "mode" of the resource group dicta    
354         allocations. A "shareable" resource gr    
355         allocations while an "exclusive" resou    
356         cache pseudo-locked region is created     
357         "pseudo-locksetup" to the "mode" file     
358         pseudo-locked region's schemata to the    
359         file. On successful pseudo-locked regi    
360         automatically change to "pseudo-locked    
361                                                   
362 "ctrl_hw_id":                                     
363         Available only with debug option. The     
364         for the control group. On x86 this is     
365                                                   
366 When monitoring is enabled all MON groups will    
367                                                   
368 "mon_data":                                       
369         This contains a set of files organized    
370         RDT event. E.g. on a system with two L    
371         be subdirectories "mon_L3_00" and "mon    
372         directories have one file per event (e    
373         "mbm_total_bytes", and "mbm_local_byte    
374         files provide a read out of the curren    
375         all tasks in the group. In CTRL_MON gr    
376         the sum for all tasks in the CTRL_MON     
377         MON groups. Please see example section    
378         On systems with Sub-NUMA Cluster (SNC)    
379         directories for each node (located wit    
380         for the L3 cache they occupy). These a    
381         where "YY" is the node number.            
382                                                   
383 "mon_hw_id":                                      
384         Available only with debug option. The     
385         for the monitor group. On x86 this is     
386                                                   
387 Resource allocation rules                         
388 -------------------------                         
389                                                   
390 When a task is running the following rules def    
391 available to it:                                  
392                                                   
393 1) If the task is a member of a non-default gr    
394    for that group is used.                        
395                                                   
396 2) Else if the task belongs to the default gro    
397    CPU that is assigned to some specific group    
398    CPU's group is used.                           
399                                                   
400 3) Otherwise the schemata for the default grou    
401                                                   
402 Resource monitoring rules                         
403 -------------------------                         
404 1) If a task is a member of a MON group, or no    
405    then RDT events for the task will be report    
406                                                   
407 2) If a task is a member of the default CTRL_M    
408    on a CPU that is assigned to some specific     
409    for the task will be reported in that group    
410                                                   
411 3) Otherwise RDT events for the task will be r    
412    "mon_data" group.                              
413                                                   
414                                                   
415 Notes on cache occupancy monitoring and contro    
416 ==============================================    
417 When moving a task from one group to another y    
418 this only affects *new* cache allocations by t    
419 a task in a monitor group showing 3 MB of cach    
420 to a new group and immediately check the occup    
421 groups you will likely see that the old group     
422 the new group zero. When the task accesses loc    
423 before the move, the h/w does not update any c    
424 you will likely see the occupancy in the old g    
425 are evicted and re-used while the occupancy in    
426 the task accesses memory and loads into the ca    
427 membership in the new group.                      
428                                                   
429 The same applies to cache allocation control.     
430 with a smaller cache partition will not evict     
431 process may continue to use them from the old     
432                                                   
433 Hardware uses CLOSid(Class of service ID) and     
434 to identify a control group and a monitoring g    
435 the resource groups are mapped to these IDs ba    
436 number of CLOSid and RMID are limited by the h    
437 a "CTRL_MON" directory may fail if we run out     
438 and creation of "MON" group may fail if we run    
439                                                   
440 max_threshold_occupancy - generic concepts        
441 ------------------------------------------        
442                                                   
443 Note that an RMID once freed may not be immedi    
444 the RMID is still tagged the cache lines of th    
445 Hence such RMIDs are placed on limbo list and     
446 occupancy has gone down. If there is a time wh    
447 limbo RMIDs but which are not ready to be used    
448 during mkdir.                                     
449                                                   
450 max_threshold_occupancy is a user configurable    
451 occupancy at which an RMID can be freed.          
452                                                   
453 The mon_llc_occupancy_limbo tracepoint gives t    
454 for a subset of RMID that are not immediately     
455 This can't be relied on to produce output ever    
456 to attempt to create an empty monitor group to    
457 only be produced if creation of a control or m    
458                                                   
459 Schemata files - general concepts                 
460 ---------------------------------                 
461 Each line in the file describes one resource.     
462 the name of the resource, followed by specific    
463 in each of the instances of that resource on t    
464                                                   
465 Cache IDs                                         
466 ---------                                         
467 On current generation systems there is one L3     
468 caches are generally just shared by the hypert    
469 isn't an architectural requirement. We could h    
470 caches on a socket, multiple cores could share    
471 of using "socket" or "core" to define the set     
472 a resource we use a "Cache ID". At a given cac    
473 unique number across the whole system (but it     
474 contiguous sequence, there may be gaps).  To f    
475 CPU look in /sys/devices/system/cpu/cpu*/cache    
476                                                   
477 Cache Bit Masks (CBM)                             
478 ---------------------                             
479 For cache resources we describe the portion of    
480 for allocation using a bitmask. The maximum va    
481 by each cpu model (and may be different for di    
482 is found using CPUID, but is also provided in     
483 the resctrl file system in "info/{resource}/cb    
484 requires that these masks have all the '1' bit    
485 0x3, 0x6 and 0xC are legal 4-bit masks with tw    
486 and 0xA are not. Check /sys/fs/resctrl/info/{r    
487 if non-contiguous 1s value is supported. On a     
488 each bit represents 5% of the capacity of the     
489 the cache into four equal parts with masks: 0x    
490                                                   
491 Notes on Sub-NUMA Cluster mode                    
492 ==============================                    
493 When SNC mode is enabled, Linux may load balan    
494 nodes much more readily than between regular N    
495 on Sub-NUMA nodes share the same L3 cache and     
496 the NUMA distance between Sub-NUMA nodes with     
497 for regular NUMA nodes.                           
498                                                   
499 The top-level monitoring files in each "mon_L3    
500 the sum of data across all SNC nodes sharing a    
501 Users who bind tasks to the CPUs of a specific    
502 the "llc_occupancy", "mbm_total_bytes", and "m    
503 "mon_sub_L3_YY" directories to get node local     
504                                                   
505 Memory bandwidth allocation is still performed    
506 level. I.e. throttling controls are applied to    
507                                                   
508 L3 cache allocation bitmaps also apply to all     
509 the amount of L3 cache represented by each bit    
510 of SNC nodes per L3 cache. E.g. with a 100MB c    
511 allocation masks each bit normally represents     
512 with two SNC nodes per L3 cache, each bit only    
513                                                   
514 Memory bandwidth Allocation and monitoring        
515 ==========================================        
516                                                   
517 For Memory bandwidth resource, by default the     
518 by indicating the percentage of total memory b    
519                                                   
520 The minimum bandwidth percentage value for eac    
521 and can be looked up through "info/MB/min_band    
522 granularity that is allocated is also dependen    
523 be looked up at "info/MB/bandwidth_gran". The     
524 control steps are: min_bw + N * bw_gran. Inter    
525 to the next control step available on the hard    
526                                                   
527 The bandwidth throttling is a core specific me    
528 SKUs. Using a high bandwidth and a low bandwid    
529 sharing a core may result in both threads bein    
530 low bandwidth (see "thread_throttle_mode").       
531                                                   
532 The fact that Memory bandwidth allocation(MBA)    
533 specific mechanism where as memory bandwidth m    
534 the package level may lead to confusion when u    
535 via the MBA and then monitor the bandwidth to     
536 effective. Below are such scenarios:              
537                                                   
538 1. User may *not* see increase in actual bandw    
539    values are increased:                          
540                                                   
541 This can occur when aggregate L2 external band    
542 external bandwidth. Consider an SKL SKU with 2    
543 where L2 external  is 10GBps (hence aggregate     
544 240GBps) and L3 external bandwidth is 100GBps.    
545 threads, having 50% bandwidth, each consuming     
546 bandwidth of 100GBps although the percentage v    
547 << 100%. Hence increasing the bandwidth percen    
548 more bandwidth. This is because although the L    
549 has capacity, the L3 external bandwidth is ful    
550 this would be dependent on number of cores the    
551                                                   
552 2. Same bandwidth percentage may mean differen    
553    depending on # of threads:                     
554                                                   
555 For the same SKU in #1, a 'single thread, with    
556 thread, with 10% bandwidth' can consume upto 1    
557 they have same percentage bandwidth of 10%. Th    
558 threads start using more cores in an rdtgroup,    
559 increase or vary although user specified bandw    
560                                                   
561 In order to mitigate this and make the interfa    
562 resctrl added support for specifying the bandw    
563 kernel underneath would use a software feedbac    
564 Controller(mba_sc)" which reads the actual ban    
565 and adjust the memory bandwidth percentages to    
566                                                   
567         "actual bandwidth < user specified ban    
568                                                   
569 By default, the schemata would take the bandwi    
570 where as user can switch to the "MBA software     
571 a mount option 'mba_MBps'. The schemata format    
572 sections.                                         
573                                                   
574 L3 schemata file details (code and data priori    
575 ----------------------------------------------    
576 With CDP disabled the L3 schemata format is::     
577                                                   
578         L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>    
579                                                   
580 L3 schemata file details (CDP enabled via moun    
581 ----------------------------------------------    
582 When CDP is enabled L3 control is split into t    
583 so you can specify independent masks for code     
584                                                   
585         L3DATA:<cache_id0>=<cbm>;<cache_id1>=<    
586         L3CODE:<cache_id0>=<cbm>;<cache_id1>=<    
587                                                   
588 L2 schemata file details                          
589 ------------------------                          
590 CDP is supported at L2 using the 'cdpl2' mount    
591 format is either::                                
592                                                   
593         L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>    
594                                                   
595 or                                                
596                                                   
597         L2DATA:<cache_id0>=<cbm>;<cache_id1>=<    
598         L2CODE:<cache_id0>=<cbm>;<cache_id1>=<    
599                                                   
600                                                   
601 Memory bandwidth Allocation (default mode)        
602 ------------------------------------------        
603                                                   
604 Memory b/w domain is L3 cache.                    
605 ::                                                
606                                                   
607         MB:<cache_id0>=bandwidth0;<cache_id1>=    
608                                                   
609 Memory bandwidth Allocation specified in MiBps    
610 ----------------------------------------------    
611                                                   
612 Memory bandwidth domain is L3 cache.              
613 ::                                                
614                                                   
615         MB:<cache_id0>=bw_MiBps0;<cache_id1>=b    
616                                                   
617 Slow Memory Bandwidth Allocation (SMBA)           
618 ---------------------------------------           
619 AMD hardware supports Slow Memory Bandwidth Al    
620 CXL.memory is the only supported "slow" memory    
621 support of SMBA, the hardware enables bandwidt    
622 the slow memory devices. If there are multiple    
623 the system, the throttling logic groups all th    
624 together and applies the limit on them as a wh    
625                                                   
626 The presence of SMBA (with CXL.memory) is inde    
627 devices presence. If there are no such devices    
628 configuring SMBA will have no impact on the pe    
629                                                   
630 The bandwidth domain for slow memory is L3 cac    
631 is formatted as:                                  
632 ::                                                
633                                                   
634         SMBA:<cache_id0>=bandwidth0;<cache_id1    
635                                                   
636 Reading/writing the schemata file                 
637 ---------------------------------                 
638 Reading the schemata file will show the state     
639 on all domains. When writing you only need to     
640 which you wish to change.  E.g.                   
641 ::                                                
642                                                   
643   # cat schemata                                  
644   L3DATA:0=fffff;1=fffff;2=fffff;3=fffff          
645   L3CODE:0=fffff;1=fffff;2=fffff;3=fffff          
646   # echo "L3DATA:2=3c0;" > schemata               
647   # cat schemata                                  
648   L3DATA:0=fffff;1=fffff;2=3c0;3=fffff            
649   L3CODE:0=fffff;1=fffff;2=fffff;3=fffff          
650                                                   
651 Reading/writing the schemata file (on AMD syst    
652 ----------------------------------------------    
653 Reading the schemata file will show the curren    
654 domains. The allocated resources are in multip    
655 When writing to the file, you need to specify     
656 configure the bandwidth limit.                    
657                                                   
658 For example, to allocate 2GB/s limit on the fi    
659                                                   
660 ::                                                
661                                                   
662   # cat schemata                                  
663     MB:0=2048;1=2048;2=2048;3=2048                
664     L3:0=ffff;1=ffff;2=ffff;3=ffff                
665                                                   
666   # echo "MB:1=16" > schemata                     
667   # cat schemata                                  
668     MB:0=2048;1=  16;2=2048;3=2048                
669     L3:0=ffff;1=ffff;2=ffff;3=ffff                
670                                                   
671 Reading/writing the schemata file (on AMD syst    
672 ----------------------------------------------    
673 Reading and writing the schemata file is the s    
674 above section.                                    
675                                                   
676 For example, to allocate 8GB/s limit on the fi    
677                                                   
678 ::                                                
679                                                   
680   # cat schemata                                  
681     SMBA:0=2048;1=2048;2=2048;3=2048              
682       MB:0=2048;1=2048;2=2048;3=2048              
683       L3:0=ffff;1=ffff;2=ffff;3=ffff              
684                                                   
685   # echo "SMBA:1=64" > schemata                   
686   # cat schemata                                  
687     SMBA:0=2048;1=  64;2=2048;3=2048              
688       MB:0=2048;1=2048;2=2048;3=2048              
689       L3:0=ffff;1=ffff;2=ffff;3=ffff              
690                                                   
691 Cache Pseudo-Locking                              
692 ====================                              
693 CAT enables a user to specify the amount of ca    
694 application can fill. Cache pseudo-locking bui    
695 CPU can still read and write data pre-allocate    
696 allocated area on a cache hit. With cache pseu    
697 preloaded into a reserved portion of cache tha    
698 fill, and from that point on will only serve c    
699 pseudo-locked memory is made accessible to use    
700 application can map it into its virtual addres    
701 a region of memory with reduced average read l    
702                                                   
703 The creation of a cache pseudo-locked region i    
704 from the user to do so that is accompanied by     
705 to be pseudo-locked. The cache pseudo-locked r    
706                                                   
707 - Create a CAT allocation CLOSNEW with a CBM m    
708   from the user of the cache region that will     
709   memory. This region must not overlap with an    
710   on the system and no future overlap with thi    
711   while the pseudo-locked region exists.          
712 - Create a contiguous region of memory of the     
713   region.                                         
714 - Flush the cache, disable hardware prefetcher    
715 - Make CLOSNEW the active CLOS and touch the a    
716   it into the cache.                              
717 - Set the previous CLOS as active.                
718 - At this point the closid CLOSNEW can be rele    
719   pseudo-locked region is protected as long as    
720   any CAT allocation. Even though the cache ps    
721   this point on not appear in any CBM of any C    
722   any CLOS will be able to access the memory i    
723   the region continues to serve cache hits.       
724 - The contiguous region of memory loaded into     
725   user-space as a character device.               
726                                                   
727 Cache pseudo-locking increases the probability    
728 in the cache via carefully configuring the CAT    
729 application behavior. There is no guarantee th    
730 cache. Instructions like INVD, WBINVD, CLFLUSH    
731 “locked” data from cache. Power management    
732 power off cache. Deeper C-states will automati    
733 pseudo-locked region creation.                    
734                                                   
735 It is required that an application using a pse    
736 with affinity to the cores (or a subset of the    
737 with the cache on which the pseudo-locked regi    
738 within the code will not allow an application     
739 unless it runs with affinity to cores associat    
740 pseudo-locked region resides. The sanity check    
741 initial mmap() handling, there is no enforceme    
742 application self needs to ensure it remains af    
743                                                   
744 Pseudo-locking is accomplished in two stages:     
745                                                   
746 1) During the first stage the system administr    
747    of cache that should be dedicated to pseudo    
748    equivalent portion of memory is allocated,     
749    cache portion, and exposed as a character d    
750 2) During the second stage a user-space applic    
751    pseudo-locked memory into its address space    
752                                                   
753 Cache Pseudo-Locking Interface                    
754 ------------------------------                    
755 A pseudo-locked region is created using the re    
756                                                   
757 1) Create a new resource group by creating a n    
758 2) Change the new resource group's mode to "ps    
759    "pseudo-locksetup" to the "mode" file.         
760 3) Write the schemata of the pseudo-locked reg    
761    bits within the schemata should be "unused"    
762    file.                                          
763                                                   
764 On successful pseudo-locked region creation th    
765 "pseudo-locked" and a new character device wit    
766 group will exist in /dev/pseudo_lock. This cha    
767 by user space in order to obtain access to the    
768                                                   
769 An example of cache pseudo-locked region creat    
770                                                   
771 Cache Pseudo-Locking Debugging Interface          
772 ----------------------------------------          
773 The pseudo-locking debugging interface is enab    
774 CONFIG_DEBUG_FS is enabled) and can be found i    
775                                                   
776 There is no explicit way for the kernel to tes    
777 location is present in the cache. The pseudo-l    
778 the tracing infrastructure to provide two ways    
779 the pseudo-locked region:                         
780                                                   
781 1) Memory access latency using the pseudo_lock    
782    from these measurements are best visualized    
783    example below). In this test the pseudo-loc    
784    a stride of 32 bytes while hardware prefetc    
785    are disabled. This also provides a substitu    
786    hits and misses.                               
787 2) Cache hit and miss measurements using model    
788    available. Depending on the levels of cache    
789    and pseudo_lock_l3 tracepoints are availabl    
790                                                   
791 When a pseudo-locked region is created a new d    
792 it in debugfs as /sys/kernel/debug/resctrl/<ne    
793 write-only file, pseudo_lock_measure, is prese    
794 measurement of the pseudo-locked region depend    
795 debugfs file:                                     
796                                                   
797 1:                                                
798      writing "1" to the pseudo_lock_measure fi    
799      measurement captured in the pseudo_lock_m    
800      example below.                               
801 2:                                                
802      writing "2" to the pseudo_lock_measure fi    
803      residency (cache hits and misses) measure    
804      pseudo_lock_l2 tracepoint. See example be    
805 3:                                                
806      writing "3" to the pseudo_lock_measure fi    
807      residency (cache hits and misses) measure    
808      pseudo_lock_l3 tracepoint.                   
809                                                   
810 All measurements are recorded with the tracing    
811 the relevant tracepoints to be enabled before     
812                                                   
813 Example of latency debugging interface            
814 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~            
815 In this example a pseudo-locked region named "    
816 how we can measure the latency in cycles of re    
817 visualize this data with a histogram that is a    
818 is set::                                          
819                                                   
820   # :> /sys/kernel/tracing/trace                  
821   # echo 'hist:keys=latency' > /sys/kernel/tra    
822   # echo 1 > /sys/kernel/tracing/events/resctr    
823   # echo 1 > /sys/kernel/debug/resctrl/newlock    
824   # echo 0 > /sys/kernel/tracing/events/resctr    
825   # cat /sys/kernel/tracing/events/resctrl/pse    
826                                                   
827   # event histogram                               
828   #                                               
829   # trigger info: hist:keys=latency:vals=hitco    
830   #                                               
831                                                   
832   { latency:        456 } hitcount:          1    
833   { latency:         50 } hitcount:         83    
834   { latency:         36 } hitcount:         96    
835   { latency:         44 } hitcount:        174    
836   { latency:         48 } hitcount:        195    
837   { latency:         46 } hitcount:        262    
838   { latency:         42 } hitcount:        693    
839   { latency:         40 } hitcount:       3204    
840   { latency:         38 } hitcount:       3484    
841                                                   
842   Totals:                                         
843       Hits: 8192                                  
844       Entries: 9                                  
845     Dropped: 0                                    
846                                                   
847 Example of cache hits/misses debugging            
848 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~            
849 In this example a pseudo-locked region named "    
850 cache of a platform. Here is how we can obtain    
851 and misses using the platform's precision coun    
852 ::                                                
853                                                   
854   # :> /sys/kernel/tracing/trace                  
855   # echo 1 > /sys/kernel/tracing/events/resctr    
856   # echo 2 > /sys/kernel/debug/resctrl/newlock    
857   # echo 0 > /sys/kernel/tracing/events/resctr    
858   # cat /sys/kernel/tracing/trace                 
859                                                   
860   # tracer: nop                                   
861   #                                               
862   #                              _-----=> irqs    
863   #                             / _----=> need    
864   #                            | / _---=> hard    
865   #                            || / _--=> pree    
866   #                            ||| /     delay    
867   #           TASK-PID   CPU#  ||||    TIMESTA    
868   #              | |       |   ||||       |       
869   pseudo_lock_mea-1672  [002] ....  3132.86050    
870                                                   
871                                                   
872 Examples for RDT allocation usage                 
873 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                 
874                                                   
875 1) Example 1                                      
876                                                   
877 On a two socket machine (one L3 cache per sock    
878 for cache bit masks, minimum b/w of 10% with a    
879 granularity of 10%.                               
880 ::                                                
881                                                   
882   # mount -t resctrl resctrl /sys/fs/resctrl      
883   # cd /sys/fs/resctrl                            
884   # mkdir p0 p1                                   
885   # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/    
886   # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/    
887                                                   
888 The default resource group is unmodified, so w    
889 of all caches (its schemata file reads "L3:0=f    
890                                                   
891 Tasks that are under the control of group "p0"    
892 "lower" 50% on cache ID 0, and the "upper" 50%    
893 Tasks in group "p1" use the "lower" 50% of cac    
894                                                   
895 Similarly, tasks that are under the control of    
896 maximum memory b/w of 50% on socket0 and 50% o    
897 Tasks in group "p1" may also use 50% memory b/    
898 Note that unlike cache masks, memory b/w canno    
899 allocations can overlap or not. The allocation    
900 b/w that the group may be able to use and the     
901 the b/w accordingly.                              
902                                                   
903 If resctrl is using the software controller (m    
904 max b/w in MB rather than the percentage value    
905 ::                                                
906                                                   
907   # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/    
908   # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/    
909                                                   
910 In the above example the tasks in "p1" and "p0    
911 of 1024MB where as on socket 1 they would use     
912                                                   
913 2) Example 2                                      
914                                                   
915 Again two sockets, but this time with a more r    
916                                                   
917 Two real time tasks pid=1234 running on proces    
918 processor 1 on socket 0 on a 2-socket and dual    
919 neighbors, each of the two real-time tasks exc    
920 of L3 cache on socket 0.                          
921 ::                                                
922                                                   
923   # mount -t resctrl resctrl /sys/fs/resctrl      
924   # cd /sys/fs/resctrl                            
925                                                   
926 First we reset the schemata for the default gr    
927 50% of the L3 cache on socket 0 and 50% of mem    
928 ordinary tasks::                                  
929                                                   
930   # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > s    
931                                                   
932 Next we make a resource group for our first re    
933 it access to the "top" 25% of the cache on soc    
934 ::                                                
935                                                   
936   # mkdir p0                                      
937   # echo "L3:0=f8000;1=fffff" > p0/schemata       
938                                                   
939 Finally we move our first real time task into     
940 also use taskset(1) to ensure the task always     
941 on socket 0. Most uses of resource groups will    
942 processors tasks run on.                          
943 ::                                                
944                                                   
945   # echo 1234 > p0/tasks                          
946   # taskset -cp 1 1234                            
947                                                   
948 Ditto for the second real time task (with the     
949                                                   
950   # mkdir p1                                      
951   # echo "L3:0=7c00;1=fffff" > p1/schemata        
952   # echo 5678 > p1/tasks                          
953   # taskset -cp 2 5678                            
954                                                   
955 For the same 2 socket system with memory b/w r    
956 schemata would look like(Assume min_bandwidth     
957 10):                                              
958                                                   
959 For our first real time task this would reques    
960 ::                                                
961                                                   
962   # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100    
963                                                   
964 For our second real time task this would reque    
965 on socket 0.                                      
966 ::                                                
967                                                   
968   # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100    
969                                                   
970 3) Example 3                                      
971                                                   
972 A single socket system which has real-time tas    
973 non real-time workload assigned to core 0-3. T    
974 and data, so a per task association is not req    
975 with the kernel it's desired that the kernel o    
976 the tasks.                                        
977 ::                                                
978                                                   
979   # mount -t resctrl resctrl /sys/fs/resctrl      
980   # cd /sys/fs/resctrl                            
981                                                   
982 First we reset the schemata for the default gr    
983 50% of the L3 cache on socket 0, and 50% of me    
984 cannot be used by ordinary tasks::                
985                                                   
986   # echo "L3:0=3ff\nMB:0=50" > schemata           
987                                                   
988 Next we make a resource group for our real tim    
989 to the "top" 50% of the cache on socket 0 and     
990 socket 0.                                         
991 ::                                                
992                                                   
993   # mkdir p0                                      
994   # echo "L3:0=ffc00\nMB:0=50" > p0/schemata      
995                                                   
996 Finally we move core 4-7 over to the new group    
997 kernel and the tasks running there get 50% of     
998 also get 50% of memory bandwidth assuming that    
999 siblings and only the real time threads are sc    
1000 ::                                               
1001                                                  
1002   # echo F0 > p0/cpus                            
1003                                                  
1004 4) Example 4                                     
1005                                                  
1006 The resource groups in previous examples were    
1007 mode allowing sharing of their cache allocati    
1008 configures a cache allocation then nothing pr    
1009 to overlap with that allocation.                 
1010                                                  
1011 In this example a new exclusive resource grou    
1012 system with two L2 cache instances that can b    
1013 capacity bitmask. The new exclusive resource     
1014 25% of each cache instance.                      
1015 ::                                               
1016                                                  
1017   # mount -t resctrl resctrl /sys/fs/resctrl/    
1018   # cd /sys/fs/resctrl                           
1019                                                  
1020 First, we observe that the default group is c    
1021 cache::                                          
1022                                                  
1023   # cat schemata                                 
1024   L2:0=ff;1=ff                                   
1025                                                  
1026 We could attempt to create the new resource g    
1027 fail because of the overlap with the schemata    
1028                                                  
1029   # mkdir p0                                     
1030   # echo 'L2:0=0x3;1=0x3' > p0/schemata          
1031   # cat p0/mode                                  
1032   shareable                                      
1033   # echo exclusive > p0/mode                     
1034   -sh: echo: write error: Invalid argument       
1035   # cat info/last_cmd_status                     
1036   schemata overlaps                              
1037                                                  
1038 To ensure that there is no overlap with anoth    
1039 resource group's schemata has to change, maki    
1040 resource group to become exclusive.              
1041 ::                                               
1042                                                  
1043   # echo 'L2:0=0xfc;1=0xfc' > schemata           
1044   # echo exclusive > p0/mode                     
1045   # grep . p0/*                                  
1046   p0/cpus:0                                      
1047   p0/mode:exclusive                              
1048   p0/schemata:L2:0=03;1=03                       
1049   p0/size:L2:0=262144;1=262144                   
1050                                                  
1051 A new resource group will on creation not ove    
1052 group::                                          
1053                                                  
1054   # mkdir p1                                     
1055   # grep . p1/*                                  
1056   p1/cpus:0                                      
1057   p1/mode:shareable                              
1058   p1/schemata:L2:0=fc;1=fc                       
1059   p1/size:L2:0=786432;1=786432                   
1060                                                  
1061 The bit_usage will reflect how the cache is u    
1062                                                  
1063   # cat info/L2/bit_usage                        
1064   0=SSSSSSEE;1=SSSSSSEE                          
1065                                                  
1066 A resource group cannot be forced to overlap     
1067                                                  
1068   # echo 'L2:0=0x1;1=0x1' > p1/schemata          
1069   -sh: echo: write error: Invalid argument       
1070   # cat info/last_cmd_status                     
1071   overlaps with exclusive group                  
1072                                                  
1073 Example of Cache Pseudo-Locking                  
1074 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                  
1075 Lock portion of L2 cache from cache id 1 usin    
1076 region is exposed at /dev/pseudo_lock/newlock    
1077 application for argument to mmap().              
1078 ::                                               
1079                                                  
1080   # mount -t resctrl resctrl /sys/fs/resctrl/    
1081   # cd /sys/fs/resctrl                           
1082                                                  
1083 Ensure that there are bits available that can    
1084 unused bits can be pseudo-locked the bits to     
1085 removed from the default resource group's sch    
1086                                                  
1087   # cat info/L2/bit_usage                        
1088   0=SSSSSSSS;1=SSSSSSSS                          
1089   # echo 'L2:1=0xfc' > schemata                  
1090   # cat info/L2/bit_usage                        
1091   0=SSSSSSSS;1=SSSSSS00                          
1092                                                  
1093 Create a new resource group that will be asso    
1094 region, indicate that it will be used for a p    
1095 configure the requested pseudo-locked region     
1096                                                  
1097   # mkdir newlock                                
1098   # echo pseudo-locksetup > newlock/mode         
1099   # echo 'L2:1=0x3' > newlock/schemata           
1100                                                  
1101 On success the resource group's mode will cha    
1102 bit_usage will reflect the pseudo-locked regi    
1103 exposing the pseudo-locked region will exist:    
1104                                                  
1105   # cat newlock/mode                             
1106   pseudo-locked                                  
1107   # cat info/L2/bit_usage                        
1108   0=SSSSSSSS;1=SSSSSSPP                          
1109   # ls -l /dev/pseudo_lock/newlock               
1110   crw------- 1 root root 243, 0 Apr  3 05:01     
1111                                                  
1112 ::                                               
1113                                                  
1114   /*                                             
1115   * Example code to access one page of pseudo    
1116   * from user space.                             
1117   */                                             
1118   #define _GNU_SOURCE                            
1119   #include <fcntl.h>                             
1120   #include <sched.h>                             
1121   #include <stdio.h>                             
1122   #include <stdlib.h>                            
1123   #include <unistd.h>                            
1124   #include <sys/mman.h>                          
1125                                                  
1126   /*                                             
1127   * It is required that the application runs     
1128   * cores associated with the pseudo-locked r    
1129   * is hardcoded for convenience of example.     
1130   */                                             
1131   static int cpuid = 2;                          
1132                                                  
1133   int main(int argc, char *argv[])               
1134   {                                              
1135     cpu_set_t cpuset;                            
1136     long page_size;                              
1137     void *mapping;                               
1138     int dev_fd;                                  
1139     int ret;                                     
1140                                                  
1141     page_size = sysconf(_SC_PAGESIZE);           
1142                                                  
1143     CPU_ZERO(&cpuset);                           
1144     CPU_SET(cpuid, &cpuset);                     
1145     ret = sched_setaffinity(0, sizeof(cpuset)    
1146     if (ret < 0) {                               
1147       perror("sched_setaffinity");               
1148       exit(EXIT_FAILURE);                        
1149     }                                            
1150                                                  
1151     dev_fd = open("/dev/pseudo_lock/newlock",    
1152     if (dev_fd < 0) {                            
1153       perror("open");                            
1154       exit(EXIT_FAILURE);                        
1155     }                                            
1156                                                  
1157     mapping = mmap(0, page_size, PROT_READ |     
1158             dev_fd, 0);                          
1159     if (mapping == MAP_FAILED) {                 
1160       perror("mmap");                            
1161       close(dev_fd);                             
1162       exit(EXIT_FAILURE);                        
1163     }                                            
1164                                                  
1165     /* Application interacts with pseudo-lock    
1166                                                  
1167     ret = munmap(mapping, page_size);            
1168     if (ret < 0) {                               
1169       perror("munmap");                          
1170       close(dev_fd);                             
1171       exit(EXIT_FAILURE);                        
1172     }                                            
1173                                                  
1174     close(dev_fd);                               
1175     exit(EXIT_SUCCESS);                          
1176   }                                              
1177                                                  
1178 Locking between applications                     
1179 ----------------------------                     
1180                                                  
1181 Certain operations on the resctrl filesystem,    
1182 to/from multiple files, must be atomic.          
1183                                                  
1184 As an example, the allocation of an exclusive    
1185 involves:                                        
1186                                                  
1187   1. Read the cbmmasks from each directory or    
1188   2. Find a contiguous set of bits in the glo    
1189      in any of the directory cbmmasks            
1190   3. Create a new directory                      
1191   4. Set the bits found in step 2 to the new     
1192                                                  
1193 If two applications attempt to allocate space    
1194 end up allocating the same bits so the reserv    
1195 exclusive.                                       
1196                                                  
1197 To coordinate atomic operations on the resctr    
1198 above, the following locking procedure is rec    
1199                                                  
1200 Locking is based on flock, which is available    
1201 script command                                   
1202                                                  
1203 Write lock:                                      
1204                                                  
1205  A) Take flock(LOCK_EX) on /sys/fs/resctrl       
1206  B) Read/write the directory structure.          
1207  C) funlock                                      
1208                                                  
1209 Read lock:                                       
1210                                                  
1211  A) Take flock(LOCK_SH) on /sys/fs/resctrl       
1212  B) If success read the directory structure.     
1213  C) funlock                                      
1214                                                  
1215 Example with bash::                              
1216                                                  
1217   # Atomically read directory structure          
1218   $ flock -s /sys/fs/resctrl/ find /sys/fs/re    
1219                                                  
1220   # Read directory contents and create new su    
1221                                                  
1222   $ cat create-dir.sh                            
1223   find /sys/fs/resctrl/ > output.txt             
1224   mask = function-of(output.txt)                 
1225   mkdir /sys/fs/resctrl/newres/                  
1226   echo mask > /sys/fs/resctrl/newres/schemata    
1227                                                  
1228   $ flock /sys/fs/resctrl/ ./create-dir.sh       
1229                                                  
1230 Example with C::                                 
1231                                                  
1232   /*                                             
1233   * Example code do take advisory locks          
1234   * before accessing resctrl filesystem          
1235   */                                             
1236   #include <sys/file.h>                          
1237   #include <stdlib.h>                            
1238                                                  
1239   void resctrl_take_shared_lock(int fd)          
1240   {                                              
1241     int ret;                                     
1242                                                  
1243     /* take shared lock on resctrl filesystem    
1244     ret = flock(fd, LOCK_SH);                    
1245     if (ret) {                                   
1246       perror("flock");                           
1247       exit(-1);                                  
1248     }                                            
1249   }                                              
1250                                                  
1251   void resctrl_take_exclusive_lock(int fd)       
1252   {                                              
1253     int ret;                                     
1254                                                  
1255     /* release lock on resctrl filesystem */     
1256     ret = flock(fd, LOCK_EX);                    
1257     if (ret) {                                   
1258       perror("flock");                           
1259       exit(-1);                                  
1260     }                                            
1261   }                                              
1262                                                  
1263   void resctrl_release_lock(int fd)              
1264   {                                              
1265     int ret;                                     
1266                                                  
1267     /* take shared lock on resctrl filesystem    
1268     ret = flock(fd, LOCK_UN);                    
1269     if (ret) {                                   
1270       perror("flock");                           
1271       exit(-1);                                  
1272     }                                            
1273   }                                              
1274                                                  
1275   void main(void)                                
1276   {                                              
1277     int fd, ret;                                 
1278                                                  
1279     fd = open("/sys/fs/resctrl", O_DIRECTORY)    
1280     if (fd == -1) {                              
1281       perror("open");                            
1282       exit(-1);                                  
1283     }                                            
1284     resctrl_take_shared_lock(fd);                
1285     /* code to read directory contents */        
1286     resctrl_release_lock(fd);                    
1287                                                  
1288     resctrl_take_exclusive_lock(fd);             
1289     /* code to read and write directory conte    
1290     resctrl_release_lock(fd);                    
1291   }                                              
1292                                                  
1293 Examples for RDT Monitoring along with alloca    
1294 =============================================    
1295 Reading monitored data                           
1296 ----------------------                           
1297 Reading an event file (for ex: mon_data/mon_L    
1298 show the current snapshot of LLC occupancy of    
1299 group or CTRL_MON group.                         
1300                                                  
1301                                                  
1302 Example 1 (Monitor CTRL_MON group and subset     
1303 ---------------------------------------------    
1304 On a two socket machine (one L3 cache per soc    
1305 for cache bit masks::                            
1306                                                  
1307   # mount -t resctrl resctrl /sys/fs/resctrl     
1308   # cd /sys/fs/resctrl                           
1309   # mkdir p0 p1                                  
1310   # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/sc    
1311   # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/sc    
1312   # echo 5678 > p1/tasks                         
1313   # echo 5679 > p1/tasks                         
1314                                                  
1315 The default resource group is unmodified, so     
1316 of all caches (its schemata file reads "L3:0=    
1317                                                  
1318 Tasks that are under the control of group "p0    
1319 "lower" 50% on cache ID 0, and the "upper" 50    
1320 Tasks in group "p1" use the "lower" 50% of ca    
1321                                                  
1322 Create monitor groups and assign a subset of     
1323 ::                                               
1324                                                  
1325   # cd /sys/fs/resctrl/p1/mon_groups             
1326   # mkdir m11 m12                                
1327   # echo 5678 > m11/tasks                        
1328   # echo 5679 > m12/tasks                        
1329                                                  
1330 fetch data (data shown in bytes)                 
1331 ::                                               
1332                                                  
1333   # cat m11/mon_data/mon_L3_00/llc_occupancy     
1334   16234000                                       
1335   # cat m11/mon_data/mon_L3_01/llc_occupancy     
1336   14789000                                       
1337   # cat m12/mon_data/mon_L3_00/llc_occupancy     
1338   16789000                                       
1339                                                  
1340 The parent ctrl_mon group shows the aggregate    
1341 ::                                               
1342                                                  
1343   # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00    
1344   31234000                                       
1345                                                  
1346 Example 2 (Monitor a task from its creation)     
1347 --------------------------------------------     
1348 On a two socket machine (one L3 cache per soc    
1349                                                  
1350   # mount -t resctrl resctrl /sys/fs/resctrl     
1351   # cd /sys/fs/resctrl                           
1352   # mkdir p0 p1                                  
1353                                                  
1354 An RMID is allocated to the group once its cr    
1355 below is monitored from its creation.            
1356 ::                                               
1357                                                  
1358   # echo $$ > /sys/fs/resctrl/p1/tasks           
1359   # <cmd>                                        
1360                                                  
1361 Fetch the data::                                 
1362                                                  
1363   # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00    
1364   31789000                                       
1365                                                  
1366 Example 3 (Monitor without CAT support or bef    
1367 ---------------------------------------------    
1368                                                  
1369 Assume a system like HSW has only CQM and no     
1370 the resctrl will still mount but cannot creat    
1371 But user can create different MON groups with    
1372 able to monitor all tasks including kernel th    
1373                                                  
1374 This can also be used to profile jobs cache s    
1375 able to allocate them to different allocation    
1376 ::                                               
1377                                                  
1378   # mount -t resctrl resctrl /sys/fs/resctrl     
1379   # cd /sys/fs/resctrl                           
1380   # mkdir mon_groups/m01                         
1381   # mkdir mon_groups/m02                         
1382                                                  
1383   # echo 3478 > /sys/fs/resctrl/mon_groups/m0    
1384   # echo 2467 > /sys/fs/resctrl/mon_groups/m0    
1385                                                  
1386 Monitor the groups separately and also get pe    
1387 below its apparent that the tasks are mostly     
1388 domain(socket) 0.                                
1389 ::                                               
1390                                                  
1391   # cat /sys/fs/resctrl/mon_groups/m01/mon_L3    
1392   31234000                                       
1393   # cat /sys/fs/resctrl/mon_groups/m01/mon_L3    
1394   34555                                          
1395   # cat /sys/fs/resctrl/mon_groups/m02/mon_L3    
1396   31234000                                       
1397   # cat /sys/fs/resctrl/mon_groups/m02/mon_L3    
1398   32789                                          
1399                                                  
1400                                                  
1401 Example 4 (Monitor real time tasks)              
1402 -----------------------------------              
1403                                                  
1404 A single socket system which has real time ta    
1405 and non real time tasks on other cpus. We wan    
1406 occupancy of the real time threads on these c    
1407 ::                                               
1408                                                  
1409   # mount -t resctrl resctrl /sys/fs/resctrl     
1410   # cd /sys/fs/resctrl                           
1411   # mkdir p1                                     
1412                                                  
1413 Move the cpus 4-7 over to p1::                   
1414                                                  
1415   # echo f0 > p1/cpus                            
1416                                                  
1417 View the llc occupancy snapshot::                
1418                                                  
1419   # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00    
1420   11234000                                       
1421                                                  
1422 Intel RDT Errata                                 
1423 ================                                 
1424                                                  
1425 Intel MBM Counters May Report System Memory B    
1426 ---------------------------------------------    
1427                                                  
1428 Errata SKX99 for Skylake server and BDF102 fo    
1429                                                  
1430 Problem: Intel Memory Bandwidth Monitoring (M    
1431 according to the assigned Resource Monitor ID    
1432 core. The IA32_QM_CTR register (MSR 0xC8E), u    
1433 metrics, may report incorrect system bandwidt    
1434                                                  
1435 Implication: Due to the errata, system memory    
1436 what is reported.                                
1437                                                  
1438 Workaround: MBM total and local readings are     
1439 following correction factor table:               
1440                                                  
1441 +---------------+---------------+------------    
1442 |core count     |rmid count     |rmid thresho    
1443 +---------------+---------------+------------    
1444 |1              |8              |0               
1445 +---------------+---------------+------------    
1446 |2              |16             |0               
1447 +---------------+---------------+------------    
1448 |3              |24             |15              
1449 +---------------+---------------+------------    
1450 |4              |32             |0               
1451 +---------------+---------------+------------    
1452 |6              |48             |31              
1453 +---------------+---------------+------------    
1454 |7              |56             |47              
1455 +---------------+---------------+------------    
1456 |8              |64             |0               
1457 +---------------+---------------+------------    
1458 |9              |72             |63              
1459 +---------------+---------------+------------    
1460 |10             |80             |63              
1461 +---------------+---------------+------------    
1462 |11             |88             |79              
1463 +---------------+---------------+------------    
1464 |12             |96             |0               
1465 +---------------+---------------+------------    
1466 |13             |104            |95              
1467 +---------------+---------------+------------    
1468 |14             |112            |95              
1469 +---------------+---------------+------------    
1470 |15             |120            |95              
1471 +---------------+---------------+------------    
1472 |16             |128            |0               
1473 +---------------+---------------+------------    
1474 |17             |136            |127             
1475 +---------------+---------------+------------    
1476 |18             |144            |127             
1477 +---------------+---------------+------------    
1478 |19             |152            |0               
1479 +---------------+---------------+------------    
1480 |20             |160            |127             
1481 +---------------+---------------+------------    
1482 |21             |168            |0               
1483 +---------------+---------------+------------    
1484 |22             |176            |159             
1485 +---------------+---------------+------------    
1486 |23             |184            |0               
1487 +---------------+---------------+------------    
1488 |24             |192            |127             
1489 +---------------+---------------+------------    
1490 |25             |200            |191             
1491 +---------------+---------------+------------    
1492 |26             |208            |191             
1493 +---------------+---------------+------------    
1494 |27             |216            |0               
1495 +---------------+---------------+------------    
1496 |28             |224            |191             
1497 +---------------+---------------+------------    
1498                                                  
1499 If rmid > rmid threshold, MBM total and local    
1500 by the correction factor.                        
1501                                                  
1502 See:                                             
1503                                                  
1504 1. Erratum SKX99 in Intel Xeon Processor Scal    
1505 http://web.archive.org/web/20200716124958/htt    
1506                                                  
1507 2. Erratum BDF102 in Intel Xeon E5-2600 v4 Pr    
1508 http://web.archive.org/web/20191125200531/htt    
1509                                                  
1510 3. The errata in Intel Resource Director Tech    
1511 https://software.intel.com/content/www/us/en/    
1512                                                  
1513 for further information.
~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.
TOMOYO Linux Cross Reference Linux/Documentation/arch/x86/resctrl.rst

Diff markup

Differences between /Documentation/arch/x86/resctrl.rst (Version linux-6.12-rc7) and /Documentation/arch/mips/resctrl.rst (Version linux-5.2.21)

TOMOYO Linux Cross Reference
Linux/Documentation/arch/x86/resctrl.rst