1 .. SPDX-License-Identifier: GPL-2.0 2 .. include:: <isonum.txt> 3 4 =========================================== 5 User Interface for Resource Control feature 6 =========================================== 7 8 :Copyright: |copy| 2016 Intel Corporation 9 :Authors: - Fenghua Yu <fenghua.yu@intel.com> 10 - Tony Luck <tony.luck@intel.com> 11 - Vikas Shivappa <vikas.shivappa@inte 12 13 14 Intel refers to this feature as Intel Resource 15 AMD refers to this feature as AMD Platform Qua 16 17 This feature is enabled by the CONFIG_X86_CPU_ 18 flag bits: 19 20 ============================================== 21 RDT (Resource Director Technology) Allocation 22 CAT (Cache Allocation Technology) 23 CDP (Code and Data Prioritization) 24 CQM (Cache QoS Monitoring) 25 MBM (Memory Bandwidth Monitoring) 26 MBA (Memory Bandwidth Allocation) 27 SMBA (Slow Memory Bandwidth Allocation) 28 BMEC (Bandwidth Monitoring Event Configuration 29 ============================================== 30 31 Historically, new features were made visible b 32 resulted in the feature flags becoming hard to 33 flag to /proc/cpuinfo should be avoided if use 34 about the feature from resctrl's info director 35 36 To use the feature mount the file system:: 37 38 # mount -t resctrl resctrl [-o cdp[,cdpl2][,m 39 40 mount options are: 41 42 "cdp": 43 Enable code/data prioritization in L3 44 "cdpl2": 45 Enable code/data prioritization in L2 46 "mba_MBps": 47 Enable the MBA Software Controller(mba 48 bandwidth in MiBps 49 "debug": 50 Make debug files accessible. Available 51 "Available only with debug option". 52 53 L2 and L3 CDP are controlled separately. 54 55 RDT features are orthogonal. A particular syst 56 monitoring, only control, or both monitoring a 57 pseudo-locking is a unique way of using cache 58 "lock" data in the cache. Details can be found 59 "Cache Pseudo-Locking". 60 61 62 The mount succeeds if either of allocation or 63 only those files and directories supported by 64 For more details on the behavior of the interf 65 and allocation, see the "Resource alloc and mo 66 67 Info directory 68 ============== 69 70 The 'info' directory contains information abou 71 resources. Each resource has its own subdirect 72 names reflect the resource names. 73 74 Each subdirectory contains the following files 75 allocation: 76 77 Cache resource(L3/L2) subdirectory contains t 78 related to allocation: 79 80 "num_closids": 81 The number of CLOSIDs which ar 82 resource. The kernel uses the 83 CLOSIDs of all enabled resourc 84 "cbm_mask": 85 The bitmask which is valid for 86 This mask is equivalent to 100 87 "min_cbm_bits": 88 The minimum number of consecut 89 must be set when writing a mas 90 91 "shareable_bits": 92 Bitmask of shareable resource 93 entities (e.g. I/O). User can 94 setting up exclusive cache par 95 some platforms support devices 96 own settings for cache use whi 97 these bits. 98 "bit_usage": 99 Annotated capacity bitmasks sh 100 instances of the resource are 101 102 "0": 103 Corresponding re 104 resources have b 105 in "bit_usage" i 106 wasted. 107 108 "H": 109 Corresponding re 110 but available fo 111 has bits set in 112 of these bits ap 113 schematas then t 114 "shareable_bits" 115 be marked as "H" 116 "X": 117 Corresponding re 118 used by hardware 119 bits that appear 120 well as a resour 121 "S": 122 Corresponding re 123 and available fo 124 "E": 125 Corresponding re 126 one resource gro 127 "P": 128 Corresponding re 129 sharing allowed. 130 "sparse_masks": 131 Indicates if non-contiguous 1s 132 133 "0": 134 Only contiguous 135 "1": 136 Non-contiguous 1 137 138 Memory bandwidth(MB) subdirectory contains the 139 with respect to allocation: 140 141 "min_bandwidth": 142 The minimum memory bandwidth p 143 user can request. 144 145 "bandwidth_gran": 146 The granularity in which the m 147 percentage is allocated. The a 148 b/w percentage is rounded off 149 control step available on the 150 available bandwidth control st 151 min_bandwidth + N * bandwidth_ 152 153 "delay_linear": 154 Indicates if the delay scale i 155 non-linear. This field is pure 156 only. 157 158 "thread_throttle_mode": 159 Indicator on Intel systems of 160 of a physical core are throttl 161 request different memory bandw 162 163 "max": 164 the smallest percentag 165 to all threads 166 "per-thread": 167 bandwidth percentages 168 the threads running on 169 170 If RDT monitoring is available there will be a 171 with the following files: 172 173 "num_rmids": 174 The number of RMIDs available. 175 upper bound for how many "CTRL 176 groups can be created. 177 178 "mon_features": 179 Lists the monitoring events if 180 monitoring is enabled for the 181 Example:: 182 183 # cat /sys/fs/resctrl/ 184 llc_occupancy 185 mbm_total_bytes 186 mbm_local_bytes 187 188 If the system supports Bandwid 189 Configuration (BMEC), then the 190 be configurable. The output wi 191 192 # cat /sys/fs/resctrl/ 193 llc_occupancy 194 mbm_total_bytes 195 mbm_total_bytes_config 196 mbm_local_bytes 197 mbm_local_bytes_config 198 199 "mbm_total_bytes_config", "mbm_local_bytes_con 200 Read/write files containing the config 201 and mbm_local_bytes events, respective 202 Monitoring Event Configuration (BMEC) 203 The event configuration settings are d 204 all the CPUs in the domain. When eithe 205 changed, the bandwidth counters for al 206 (mbm_total_bytes as well as mbm_local_ 207 domain. The next read for every RMID w 208 and subsequent reads will report the v 209 210 Following are the types of events supp 211 212 ==== ============================== 213 Bits Description 214 ==== ============================== 215 6 Dirty Victims from the QOS dom 216 5 Reads to slow memory in the no 217 4 Reads to slow memory in the lo 218 3 Non-temporal writes to non-loc 219 2 Non-temporal writes to local N 220 1 Reads to memory in the non-loc 221 0 Reads to memory in the local N 222 ==== ============================== 223 224 By default, the mbm_total_bytes config 225 all the event types and the mbm_local_ 226 0x15 to count all the local memory eve 227 228 Examples: 229 230 * To view the current configuration:: 231 :: 232 233 # cat /sys/fs/resctrl/info/L3_MON/ 234 0=0x7f;1=0x7f;2=0x7f;3=0x7f 235 236 # cat /sys/fs/resctrl/info/L3_MON/ 237 0=0x15;1=0x15;3=0x15;4=0x15 238 239 * To change the mbm_total_bytes to cou 240 the bits 0, 1, 4 and 5 needs to be s 241 (in hexadecimal 0x33): 242 :: 243 244 # echo "0=0x33" > /sys/fs/resctrl 245 246 # cat /sys/fs/resctrl/info/L3_MON/ 247 0=0x33;1=0x7f;2=0x7f;3=0x7f 248 249 * To change the mbm_local_bytes to cou 250 domain 0 and 1, the bits 4 and 5 nee 251 in binary (in hexadecimal 0x30): 252 :: 253 254 # echo "0=0x30;1=0x30" > /sys/fs/ 255 256 # cat /sys/fs/resctrl/info/L3_MON/ 257 0=0x30;1=0x30;3=0x15;4=0x15 258 259 "max_threshold_occupancy": 260 Read/write file provides the l 261 bytes) at which a previously u 262 counter can be considered for 263 264 Finally, in the top level of the "info" direct 265 named "last_cmd_status". This is reset with ev 266 via the file system (making new directories or 267 control files). If the command was successful, 268 If the command failed, it will provide more in 269 conveyed in the error returns from file operat 270 :: 271 272 # echo L3:0=f7 > schemata 273 bash: echo: write error: Invalid argum 274 # cat info/last_cmd_status 275 mask f7 has non-consecutive 1-bits 276 277 Resource alloc and monitor groups 278 ================================= 279 280 Resource groups are represented as directories 281 system. The default group is the root directo 282 after mounting, owns all the tasks and cpus in 283 full use of all resources. 284 285 On a system with RDT control features addition 286 created in the root directory that specify dif 287 resource (see "schemata" below). The root and 288 directories are referred to as "CTRL_MON" grou 289 290 On a system with RDT monitoring the root direc 291 directories contain a directory named "mon_gro 292 directories can be created to monitor subsets 293 group that is their ancestor. These are called 294 of this document. 295 296 Removing a directory will move all tasks and c 297 represents to the parent. Removing one of the 298 will automatically remove all MON groups below 299 300 Moving MON group directories to a new parent C 301 for the purpose of changing the resource alloc 302 without impacting its monitoring data or assig 303 is not allowed for MON groups which monitor CP 304 operation is currently allowed other than simp 305 MON group. 306 307 All groups contain the following files: 308 309 "tasks": 310 Reading this file shows the list of al 311 this group. Writing a task id to the f 312 group. Multiple tasks can be added by 313 with commas. Tasks will be assigned se 314 failures are not supported. A single f 315 attempting to assign a task will cause 316 already added tasks before the failure 317 Failures will be logged to /sys/fs/res 318 319 If the group is a CTRL_MON group the t 320 whichever previous CTRL_MON group owne 321 any MON group that owned the task. If 322 then the task must already belong to t 323 group. The task is removed from any pr 324 325 326 "cpus": 327 Reading this file shows a bitmask of t 328 this group. Writing a mask to this fil 329 CPUs to/from this group. As with the t 330 maintained where MON groups may only i 331 parent CTRL_MON group. 332 When the resource group is in pseudo-l 333 only be readable, reflecting the CPUs 334 pseudo-locked region. 335 336 337 "cpus_list": 338 Just like "cpus", only using ranges of 339 340 341 When control is enabled all CTRL_MON groups wi 342 343 "schemata": 344 A list of all the resources available 345 Each resource has its own line and for 346 347 "size": 348 Mirrors the display of the "schemata" 349 bytes of each allocation instead of th 350 allocation. 351 352 "mode": 353 The "mode" of the resource group dicta 354 allocations. A "shareable" resource gr 355 allocations while an "exclusive" resou 356 cache pseudo-locked region is created 357 "pseudo-locksetup" to the "mode" file 358 pseudo-locked region's schemata to the 359 file. On successful pseudo-locked regi 360 automatically change to "pseudo-locked 361 362 "ctrl_hw_id": 363 Available only with debug option. The 364 for the control group. On x86 this is 365 366 When monitoring is enabled all MON groups will 367 368 "mon_data": 369 This contains a set of files organized 370 RDT event. E.g. on a system with two L 371 be subdirectories "mon_L3_00" and "mon 372 directories have one file per event (e 373 "mbm_total_bytes", and "mbm_local_byte 374 files provide a read out of the curren 375 all tasks in the group. In CTRL_MON gr 376 the sum for all tasks in the CTRL_MON 377 MON groups. Please see example section 378 On systems with Sub-NUMA Cluster (SNC) 379 directories for each node (located wit 380 for the L3 cache they occupy). These a 381 where "YY" is the node number. 382 383 "mon_hw_id": 384 Available only with debug option. The 385 for the monitor group. On x86 this is 386 387 Resource allocation rules 388 ------------------------- 389 390 When a task is running the following rules def 391 available to it: 392 393 1) If the task is a member of a non-default gr 394 for that group is used. 395 396 2) Else if the task belongs to the default gro 397 CPU that is assigned to some specific group 398 CPU's group is used. 399 400 3) Otherwise the schemata for the default grou 401 402 Resource monitoring rules 403 ------------------------- 404 1) If a task is a member of a MON group, or no 405 then RDT events for the task will be report 406 407 2) If a task is a member of the default CTRL_M 408 on a CPU that is assigned to some specific 409 for the task will be reported in that group 410 411 3) Otherwise RDT events for the task will be r 412 "mon_data" group. 413 414 415 Notes on cache occupancy monitoring and contro 416 ============================================== 417 When moving a task from one group to another y 418 this only affects *new* cache allocations by t 419 a task in a monitor group showing 3 MB of cach 420 to a new group and immediately check the occup 421 groups you will likely see that the old group 422 the new group zero. When the task accesses loc 423 before the move, the h/w does not update any c 424 you will likely see the occupancy in the old g 425 are evicted and re-used while the occupancy in 426 the task accesses memory and loads into the ca 427 membership in the new group. 428 429 The same applies to cache allocation control. 430 with a smaller cache partition will not evict 431 process may continue to use them from the old 432 433 Hardware uses CLOSid(Class of service ID) and 434 to identify a control group and a monitoring g 435 the resource groups are mapped to these IDs ba 436 number of CLOSid and RMID are limited by the h 437 a "CTRL_MON" directory may fail if we run out 438 and creation of "MON" group may fail if we run 439 440 max_threshold_occupancy - generic concepts 441 ------------------------------------------ 442 443 Note that an RMID once freed may not be immedi 444 the RMID is still tagged the cache lines of th 445 Hence such RMIDs are placed on limbo list and 446 occupancy has gone down. If there is a time wh 447 limbo RMIDs but which are not ready to be used 448 during mkdir. 449 450 max_threshold_occupancy is a user configurable 451 occupancy at which an RMID can be freed. 452 453 The mon_llc_occupancy_limbo tracepoint gives t 454 for a subset of RMID that are not immediately 455 This can't be relied on to produce output ever 456 to attempt to create an empty monitor group to 457 only be produced if creation of a control or m 458 459 Schemata files - general concepts 460 --------------------------------- 461 Each line in the file describes one resource. 462 the name of the resource, followed by specific 463 in each of the instances of that resource on t 464 465 Cache IDs 466 --------- 467 On current generation systems there is one L3 468 caches are generally just shared by the hypert 469 isn't an architectural requirement. We could h 470 caches on a socket, multiple cores could share 471 of using "socket" or "core" to define the set 472 a resource we use a "Cache ID". At a given cac 473 unique number across the whole system (but it 474 contiguous sequence, there may be gaps). To f 475 CPU look in /sys/devices/system/cpu/cpu*/cache 476 477 Cache Bit Masks (CBM) 478 --------------------- 479 For cache resources we describe the portion of 480 for allocation using a bitmask. The maximum va 481 by each cpu model (and may be different for di 482 is found using CPUID, but is also provided in 483 the resctrl file system in "info/{resource}/cb 484 requires that these masks have all the '1' bit 485 0x3, 0x6 and 0xC are legal 4-bit masks with tw 486 and 0xA are not. Check /sys/fs/resctrl/info/{r 487 if non-contiguous 1s value is supported. On a 488 each bit represents 5% of the capacity of the 489 the cache into four equal parts with masks: 0x 490 491 Notes on Sub-NUMA Cluster mode 492 ============================== 493 When SNC mode is enabled, Linux may load balan 494 nodes much more readily than between regular N 495 on Sub-NUMA nodes share the same L3 cache and 496 the NUMA distance between Sub-NUMA nodes with 497 for regular NUMA nodes. 498 499 The top-level monitoring files in each "mon_L3 500 the sum of data across all SNC nodes sharing a 501 Users who bind tasks to the CPUs of a specific 502 the "llc_occupancy", "mbm_total_bytes", and "m 503 "mon_sub_L3_YY" directories to get node local 504 505 Memory bandwidth allocation is still performed 506 level. I.e. throttling controls are applied to 507 508 L3 cache allocation bitmaps also apply to all 509 the amount of L3 cache represented by each bit 510 of SNC nodes per L3 cache. E.g. with a 100MB c 511 allocation masks each bit normally represents 512 with two SNC nodes per L3 cache, each bit only 513 514 Memory bandwidth Allocation and monitoring 515 ========================================== 516 517 For Memory bandwidth resource, by default the 518 by indicating the percentage of total memory b 519 520 The minimum bandwidth percentage value for eac 521 and can be looked up through "info/MB/min_band 522 granularity that is allocated is also dependen 523 be looked up at "info/MB/bandwidth_gran". The 524 control steps are: min_bw + N * bw_gran. Inter 525 to the next control step available on the hard 526 527 The bandwidth throttling is a core specific me 528 SKUs. Using a high bandwidth and a low bandwid 529 sharing a core may result in both threads bein 530 low bandwidth (see "thread_throttle_mode"). 531 532 The fact that Memory bandwidth allocation(MBA) 533 specific mechanism where as memory bandwidth m 534 the package level may lead to confusion when u 535 via the MBA and then monitor the bandwidth to 536 effective. Below are such scenarios: 537 538 1. User may *not* see increase in actual bandw 539 values are increased: 540 541 This can occur when aggregate L2 external band 542 external bandwidth. Consider an SKL SKU with 2 543 where L2 external is 10GBps (hence aggregate 544 240GBps) and L3 external bandwidth is 100GBps. 545 threads, having 50% bandwidth, each consuming 546 bandwidth of 100GBps although the percentage v 547 << 100%. Hence increasing the bandwidth percen 548 more bandwidth. This is because although the L 549 has capacity, the L3 external bandwidth is ful 550 this would be dependent on number of cores the 551 552 2. Same bandwidth percentage may mean differen 553 depending on # of threads: 554 555 For the same SKU in #1, a 'single thread, with 556 thread, with 10% bandwidth' can consume upto 1 557 they have same percentage bandwidth of 10%. Th 558 threads start using more cores in an rdtgroup, 559 increase or vary although user specified bandw 560 561 In order to mitigate this and make the interfa 562 resctrl added support for specifying the bandw 563 kernel underneath would use a software feedbac 564 Controller(mba_sc)" which reads the actual ban 565 and adjust the memory bandwidth percentages to 566 567 "actual bandwidth < user specified ban 568 569 By default, the schemata would take the bandwi 570 where as user can switch to the "MBA software 571 a mount option 'mba_MBps'. The schemata format 572 sections. 573 574 L3 schemata file details (code and data priori 575 ---------------------------------------------- 576 With CDP disabled the L3 schemata format is:: 577 578 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm> 579 580 L3 schemata file details (CDP enabled via moun 581 ---------------------------------------------- 582 When CDP is enabled L3 control is split into t 583 so you can specify independent masks for code 584 585 L3DATA:<cache_id0>=<cbm>;<cache_id1>=< 586 L3CODE:<cache_id0>=<cbm>;<cache_id1>=< 587 588 L2 schemata file details 589 ------------------------ 590 CDP is supported at L2 using the 'cdpl2' mount 591 format is either:: 592 593 L2:<cache_id0>=<cbm>;<cache_id1>=<cbm> 594 595 or 596 597 L2DATA:<cache_id0>=<cbm>;<cache_id1>=< 598 L2CODE:<cache_id0>=<cbm>;<cache_id1>=< 599 600 601 Memory bandwidth Allocation (default mode) 602 ------------------------------------------ 603 604 Memory b/w domain is L3 cache. 605 :: 606 607 MB:<cache_id0>=bandwidth0;<cache_id1>= 608 609 Memory bandwidth Allocation specified in MiBps 610 ---------------------------------------------- 611 612 Memory bandwidth domain is L3 cache. 613 :: 614 615 MB:<cache_id0>=bw_MiBps0;<cache_id1>=b 616 617 Slow Memory Bandwidth Allocation (SMBA) 618 --------------------------------------- 619 AMD hardware supports Slow Memory Bandwidth Al 620 CXL.memory is the only supported "slow" memory 621 support of SMBA, the hardware enables bandwidt 622 the slow memory devices. If there are multiple 623 the system, the throttling logic groups all th 624 together and applies the limit on them as a wh 625 626 The presence of SMBA (with CXL.memory) is inde 627 devices presence. If there are no such devices 628 configuring SMBA will have no impact on the pe 629 630 The bandwidth domain for slow memory is L3 cac 631 is formatted as: 632 :: 633 634 SMBA:<cache_id0>=bandwidth0;<cache_id1 635 636 Reading/writing the schemata file 637 --------------------------------- 638 Reading the schemata file will show the state 639 on all domains. When writing you only need to 640 which you wish to change. E.g. 641 :: 642 643 # cat schemata 644 L3DATA:0=fffff;1=fffff;2=fffff;3=fffff 645 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 646 # echo "L3DATA:2=3c0;" > schemata 647 # cat schemata 648 L3DATA:0=fffff;1=fffff;2=3c0;3=fffff 649 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 650 651 Reading/writing the schemata file (on AMD syst 652 ---------------------------------------------- 653 Reading the schemata file will show the curren 654 domains. The allocated resources are in multip 655 When writing to the file, you need to specify 656 configure the bandwidth limit. 657 658 For example, to allocate 2GB/s limit on the fi 659 660 :: 661 662 # cat schemata 663 MB:0=2048;1=2048;2=2048;3=2048 664 L3:0=ffff;1=ffff;2=ffff;3=ffff 665 666 # echo "MB:1=16" > schemata 667 # cat schemata 668 MB:0=2048;1= 16;2=2048;3=2048 669 L3:0=ffff;1=ffff;2=ffff;3=ffff 670 671 Reading/writing the schemata file (on AMD syst 672 ---------------------------------------------- 673 Reading and writing the schemata file is the s 674 above section. 675 676 For example, to allocate 8GB/s limit on the fi 677 678 :: 679 680 # cat schemata 681 SMBA:0=2048;1=2048;2=2048;3=2048 682 MB:0=2048;1=2048;2=2048;3=2048 683 L3:0=ffff;1=ffff;2=ffff;3=ffff 684 685 # echo "SMBA:1=64" > schemata 686 # cat schemata 687 SMBA:0=2048;1= 64;2=2048;3=2048 688 MB:0=2048;1=2048;2=2048;3=2048 689 L3:0=ffff;1=ffff;2=ffff;3=ffff 690 691 Cache Pseudo-Locking 692 ==================== 693 CAT enables a user to specify the amount of ca 694 application can fill. Cache pseudo-locking bui 695 CPU can still read and write data pre-allocate 696 allocated area on a cache hit. With cache pseu 697 preloaded into a reserved portion of cache tha 698 fill, and from that point on will only serve c 699 pseudo-locked memory is made accessible to use 700 application can map it into its virtual addres 701 a region of memory with reduced average read l 702 703 The creation of a cache pseudo-locked region i 704 from the user to do so that is accompanied by 705 to be pseudo-locked. The cache pseudo-locked r 706 707 - Create a CAT allocation CLOSNEW with a CBM m 708 from the user of the cache region that will 709 memory. This region must not overlap with an 710 on the system and no future overlap with thi 711 while the pseudo-locked region exists. 712 - Create a contiguous region of memory of the 713 region. 714 - Flush the cache, disable hardware prefetcher 715 - Make CLOSNEW the active CLOS and touch the a 716 it into the cache. 717 - Set the previous CLOS as active. 718 - At this point the closid CLOSNEW can be rele 719 pseudo-locked region is protected as long as 720 any CAT allocation. Even though the cache ps 721 this point on not appear in any CBM of any C 722 any CLOS will be able to access the memory i 723 the region continues to serve cache hits. 724 - The contiguous region of memory loaded into 725 user-space as a character device. 726 727 Cache pseudo-locking increases the probability 728 in the cache via carefully configuring the CAT 729 application behavior. There is no guarantee th 730 cache. Instructions like INVD, WBINVD, CLFLUSH 731 “locked” data from cache. Power management 732 power off cache. Deeper C-states will automati 733 pseudo-locked region creation. 734 735 It is required that an application using a pse 736 with affinity to the cores (or a subset of the 737 with the cache on which the pseudo-locked regi 738 within the code will not allow an application 739 unless it runs with affinity to cores associat 740 pseudo-locked region resides. The sanity check 741 initial mmap() handling, there is no enforceme 742 application self needs to ensure it remains af 743 744 Pseudo-locking is accomplished in two stages: 745 746 1) During the first stage the system administr 747 of cache that should be dedicated to pseudo 748 equivalent portion of memory is allocated, 749 cache portion, and exposed as a character d 750 2) During the second stage a user-space applic 751 pseudo-locked memory into its address space 752 753 Cache Pseudo-Locking Interface 754 ------------------------------ 755 A pseudo-locked region is created using the re 756 757 1) Create a new resource group by creating a n 758 2) Change the new resource group's mode to "ps 759 "pseudo-locksetup" to the "mode" file. 760 3) Write the schemata of the pseudo-locked reg 761 bits within the schemata should be "unused" 762 file. 763 764 On successful pseudo-locked region creation th 765 "pseudo-locked" and a new character device wit 766 group will exist in /dev/pseudo_lock. This cha 767 by user space in order to obtain access to the 768 769 An example of cache pseudo-locked region creat 770 771 Cache Pseudo-Locking Debugging Interface 772 ---------------------------------------- 773 The pseudo-locking debugging interface is enab 774 CONFIG_DEBUG_FS is enabled) and can be found i 775 776 There is no explicit way for the kernel to tes 777 location is present in the cache. The pseudo-l 778 the tracing infrastructure to provide two ways 779 the pseudo-locked region: 780 781 1) Memory access latency using the pseudo_lock 782 from these measurements are best visualized 783 example below). In this test the pseudo-loc 784 a stride of 32 bytes while hardware prefetc 785 are disabled. This also provides a substitu 786 hits and misses. 787 2) Cache hit and miss measurements using model 788 available. Depending on the levels of cache 789 and pseudo_lock_l3 tracepoints are availabl 790 791 When a pseudo-locked region is created a new d 792 it in debugfs as /sys/kernel/debug/resctrl/<ne 793 write-only file, pseudo_lock_measure, is prese 794 measurement of the pseudo-locked region depend 795 debugfs file: 796 797 1: 798 writing "1" to the pseudo_lock_measure fi 799 measurement captured in the pseudo_lock_m 800 example below. 801 2: 802 writing "2" to the pseudo_lock_measure fi 803 residency (cache hits and misses) measure 804 pseudo_lock_l2 tracepoint. See example be 805 3: 806 writing "3" to the pseudo_lock_measure fi 807 residency (cache hits and misses) measure 808 pseudo_lock_l3 tracepoint. 809 810 All measurements are recorded with the tracing 811 the relevant tracepoints to be enabled before 812 813 Example of latency debugging interface 814 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 815 In this example a pseudo-locked region named " 816 how we can measure the latency in cycles of re 817 visualize this data with a histogram that is a 818 is set:: 819 820 # :> /sys/kernel/tracing/trace 821 # echo 'hist:keys=latency' > /sys/kernel/tra 822 # echo 1 > /sys/kernel/tracing/events/resctr 823 # echo 1 > /sys/kernel/debug/resctrl/newlock 824 # echo 0 > /sys/kernel/tracing/events/resctr 825 # cat /sys/kernel/tracing/events/resctrl/pse 826 827 # event histogram 828 # 829 # trigger info: hist:keys=latency:vals=hitco 830 # 831 832 { latency: 456 } hitcount: 1 833 { latency: 50 } hitcount: 83 834 { latency: 36 } hitcount: 96 835 { latency: 44 } hitcount: 174 836 { latency: 48 } hitcount: 195 837 { latency: 46 } hitcount: 262 838 { latency: 42 } hitcount: 693 839 { latency: 40 } hitcount: 3204 840 { latency: 38 } hitcount: 3484 841 842 Totals: 843 Hits: 8192 844 Entries: 9 845 Dropped: 0 846 847 Example of cache hits/misses debugging 848 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 849 In this example a pseudo-locked region named " 850 cache of a platform. Here is how we can obtain 851 and misses using the platform's precision coun 852 :: 853 854 # :> /sys/kernel/tracing/trace 855 # echo 1 > /sys/kernel/tracing/events/resctr 856 # echo 2 > /sys/kernel/debug/resctrl/newlock 857 # echo 0 > /sys/kernel/tracing/events/resctr 858 # cat /sys/kernel/tracing/trace 859 860 # tracer: nop 861 # 862 # _-----=> irqs 863 # / _----=> need 864 # | / _---=> hard 865 # || / _--=> pree 866 # ||| / delay 867 # TASK-PID CPU# |||| TIMESTA 868 # | | | |||| | 869 pseudo_lock_mea-1672 [002] .... 3132.86050 870 871 872 Examples for RDT allocation usage 873 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 874 875 1) Example 1 876 877 On a two socket machine (one L3 cache per sock 878 for cache bit masks, minimum b/w of 10% with a 879 granularity of 10%. 880 :: 881 882 # mount -t resctrl resctrl /sys/fs/resctrl 883 # cd /sys/fs/resctrl 884 # mkdir p0 p1 885 # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/ 886 # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/ 887 888 The default resource group is unmodified, so w 889 of all caches (its schemata file reads "L3:0=f 890 891 Tasks that are under the control of group "p0" 892 "lower" 50% on cache ID 0, and the "upper" 50% 893 Tasks in group "p1" use the "lower" 50% of cac 894 895 Similarly, tasks that are under the control of 896 maximum memory b/w of 50% on socket0 and 50% o 897 Tasks in group "p1" may also use 50% memory b/ 898 Note that unlike cache masks, memory b/w canno 899 allocations can overlap or not. The allocation 900 b/w that the group may be able to use and the 901 the b/w accordingly. 902 903 If resctrl is using the software controller (m 904 max b/w in MB rather than the percentage value 905 :: 906 907 # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/ 908 # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/ 909 910 In the above example the tasks in "p1" and "p0 911 of 1024MB where as on socket 1 they would use 912 913 2) Example 2 914 915 Again two sockets, but this time with a more r 916 917 Two real time tasks pid=1234 running on proces 918 processor 1 on socket 0 on a 2-socket and dual 919 neighbors, each of the two real-time tasks exc 920 of L3 cache on socket 0. 921 :: 922 923 # mount -t resctrl resctrl /sys/fs/resctrl 924 # cd /sys/fs/resctrl 925 926 First we reset the schemata for the default gr 927 50% of the L3 cache on socket 0 and 50% of mem 928 ordinary tasks:: 929 930 # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > s 931 932 Next we make a resource group for our first re 933 it access to the "top" 25% of the cache on soc 934 :: 935 936 # mkdir p0 937 # echo "L3:0=f8000;1=fffff" > p0/schemata 938 939 Finally we move our first real time task into 940 also use taskset(1) to ensure the task always 941 on socket 0. Most uses of resource groups will 942 processors tasks run on. 943 :: 944 945 # echo 1234 > p0/tasks 946 # taskset -cp 1 1234 947 948 Ditto for the second real time task (with the 949 950 # mkdir p1 951 # echo "L3:0=7c00;1=fffff" > p1/schemata 952 # echo 5678 > p1/tasks 953 # taskset -cp 2 5678 954 955 For the same 2 socket system with memory b/w r 956 schemata would look like(Assume min_bandwidth 957 10): 958 959 For our first real time task this would reques 960 :: 961 962 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100 963 964 For our second real time task this would reque 965 on socket 0. 966 :: 967 968 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100 969 970 3) Example 3 971 972 A single socket system which has real-time tas 973 non real-time workload assigned to core 0-3. T 974 and data, so a per task association is not req 975 with the kernel it's desired that the kernel o 976 the tasks. 977 :: 978 979 # mount -t resctrl resctrl /sys/fs/resctrl 980 # cd /sys/fs/resctrl 981 982 First we reset the schemata for the default gr 983 50% of the L3 cache on socket 0, and 50% of me 984 cannot be used by ordinary tasks:: 985 986 # echo "L3:0=3ff\nMB:0=50" > schemata 987 988 Next we make a resource group for our real tim 989 to the "top" 50% of the cache on socket 0 and 990 socket 0. 991 :: 992 993 # mkdir p0 994 # echo "L3:0=ffc00\nMB:0=50" > p0/schemata 995 996 Finally we move core 4-7 over to the new group 997 kernel and the tasks running there get 50% of 998 also get 50% of memory bandwidth assuming that 999 siblings and only the real time threads are sc 1000 :: 1001 1002 # echo F0 > p0/cpus 1003 1004 4) Example 4 1005 1006 The resource groups in previous examples were 1007 mode allowing sharing of their cache allocati 1008 configures a cache allocation then nothing pr 1009 to overlap with that allocation. 1010 1011 In this example a new exclusive resource grou 1012 system with two L2 cache instances that can b 1013 capacity bitmask. The new exclusive resource 1014 25% of each cache instance. 1015 :: 1016 1017 # mount -t resctrl resctrl /sys/fs/resctrl/ 1018 # cd /sys/fs/resctrl 1019 1020 First, we observe that the default group is c 1021 cache:: 1022 1023 # cat schemata 1024 L2:0=ff;1=ff 1025 1026 We could attempt to create the new resource g 1027 fail because of the overlap with the schemata 1028 1029 # mkdir p0 1030 # echo 'L2:0=0x3;1=0x3' > p0/schemata 1031 # cat p0/mode 1032 shareable 1033 # echo exclusive > p0/mode 1034 -sh: echo: write error: Invalid argument 1035 # cat info/last_cmd_status 1036 schemata overlaps 1037 1038 To ensure that there is no overlap with anoth 1039 resource group's schemata has to change, maki 1040 resource group to become exclusive. 1041 :: 1042 1043 # echo 'L2:0=0xfc;1=0xfc' > schemata 1044 # echo exclusive > p0/mode 1045 # grep . p0/* 1046 p0/cpus:0 1047 p0/mode:exclusive 1048 p0/schemata:L2:0=03;1=03 1049 p0/size:L2:0=262144;1=262144 1050 1051 A new resource group will on creation not ove 1052 group:: 1053 1054 # mkdir p1 1055 # grep . p1/* 1056 p1/cpus:0 1057 p1/mode:shareable 1058 p1/schemata:L2:0=fc;1=fc 1059 p1/size:L2:0=786432;1=786432 1060 1061 The bit_usage will reflect how the cache is u 1062 1063 # cat info/L2/bit_usage 1064 0=SSSSSSEE;1=SSSSSSEE 1065 1066 A resource group cannot be forced to overlap 1067 1068 # echo 'L2:0=0x1;1=0x1' > p1/schemata 1069 -sh: echo: write error: Invalid argument 1070 # cat info/last_cmd_status 1071 overlaps with exclusive group 1072 1073 Example of Cache Pseudo-Locking 1074 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1075 Lock portion of L2 cache from cache id 1 usin 1076 region is exposed at /dev/pseudo_lock/newlock 1077 application for argument to mmap(). 1078 :: 1079 1080 # mount -t resctrl resctrl /sys/fs/resctrl/ 1081 # cd /sys/fs/resctrl 1082 1083 Ensure that there are bits available that can 1084 unused bits can be pseudo-locked the bits to 1085 removed from the default resource group's sch 1086 1087 # cat info/L2/bit_usage 1088 0=SSSSSSSS;1=SSSSSSSS 1089 # echo 'L2:1=0xfc' > schemata 1090 # cat info/L2/bit_usage 1091 0=SSSSSSSS;1=SSSSSS00 1092 1093 Create a new resource group that will be asso 1094 region, indicate that it will be used for a p 1095 configure the requested pseudo-locked region 1096 1097 # mkdir newlock 1098 # echo pseudo-locksetup > newlock/mode 1099 # echo 'L2:1=0x3' > newlock/schemata 1100 1101 On success the resource group's mode will cha 1102 bit_usage will reflect the pseudo-locked regi 1103 exposing the pseudo-locked region will exist: 1104 1105 # cat newlock/mode 1106 pseudo-locked 1107 # cat info/L2/bit_usage 1108 0=SSSSSSSS;1=SSSSSSPP 1109 # ls -l /dev/pseudo_lock/newlock 1110 crw------- 1 root root 243, 0 Apr 3 05:01 1111 1112 :: 1113 1114 /* 1115 * Example code to access one page of pseudo 1116 * from user space. 1117 */ 1118 #define _GNU_SOURCE 1119 #include <fcntl.h> 1120 #include <sched.h> 1121 #include <stdio.h> 1122 #include <stdlib.h> 1123 #include <unistd.h> 1124 #include <sys/mman.h> 1125 1126 /* 1127 * It is required that the application runs 1128 * cores associated with the pseudo-locked r 1129 * is hardcoded for convenience of example. 1130 */ 1131 static int cpuid = 2; 1132 1133 int main(int argc, char *argv[]) 1134 { 1135 cpu_set_t cpuset; 1136 long page_size; 1137 void *mapping; 1138 int dev_fd; 1139 int ret; 1140 1141 page_size = sysconf(_SC_PAGESIZE); 1142 1143 CPU_ZERO(&cpuset); 1144 CPU_SET(cpuid, &cpuset); 1145 ret = sched_setaffinity(0, sizeof(cpuset) 1146 if (ret < 0) { 1147 perror("sched_setaffinity"); 1148 exit(EXIT_FAILURE); 1149 } 1150 1151 dev_fd = open("/dev/pseudo_lock/newlock", 1152 if (dev_fd < 0) { 1153 perror("open"); 1154 exit(EXIT_FAILURE); 1155 } 1156 1157 mapping = mmap(0, page_size, PROT_READ | 1158 dev_fd, 0); 1159 if (mapping == MAP_FAILED) { 1160 perror("mmap"); 1161 close(dev_fd); 1162 exit(EXIT_FAILURE); 1163 } 1164 1165 /* Application interacts with pseudo-lock 1166 1167 ret = munmap(mapping, page_size); 1168 if (ret < 0) { 1169 perror("munmap"); 1170 close(dev_fd); 1171 exit(EXIT_FAILURE); 1172 } 1173 1174 close(dev_fd); 1175 exit(EXIT_SUCCESS); 1176 } 1177 1178 Locking between applications 1179 ---------------------------- 1180 1181 Certain operations on the resctrl filesystem, 1182 to/from multiple files, must be atomic. 1183 1184 As an example, the allocation of an exclusive 1185 involves: 1186 1187 1. Read the cbmmasks from each directory or 1188 2. Find a contiguous set of bits in the glo 1189 in any of the directory cbmmasks 1190 3. Create a new directory 1191 4. Set the bits found in step 2 to the new 1192 1193 If two applications attempt to allocate space 1194 end up allocating the same bits so the reserv 1195 exclusive. 1196 1197 To coordinate atomic operations on the resctr 1198 above, the following locking procedure is rec 1199 1200 Locking is based on flock, which is available 1201 script command 1202 1203 Write lock: 1204 1205 A) Take flock(LOCK_EX) on /sys/fs/resctrl 1206 B) Read/write the directory structure. 1207 C) funlock 1208 1209 Read lock: 1210 1211 A) Take flock(LOCK_SH) on /sys/fs/resctrl 1212 B) If success read the directory structure. 1213 C) funlock 1214 1215 Example with bash:: 1216 1217 # Atomically read directory structure 1218 $ flock -s /sys/fs/resctrl/ find /sys/fs/re 1219 1220 # Read directory contents and create new su 1221 1222 $ cat create-dir.sh 1223 find /sys/fs/resctrl/ > output.txt 1224 mask = function-of(output.txt) 1225 mkdir /sys/fs/resctrl/newres/ 1226 echo mask > /sys/fs/resctrl/newres/schemata 1227 1228 $ flock /sys/fs/resctrl/ ./create-dir.sh 1229 1230 Example with C:: 1231 1232 /* 1233 * Example code do take advisory locks 1234 * before accessing resctrl filesystem 1235 */ 1236 #include <sys/file.h> 1237 #include <stdlib.h> 1238 1239 void resctrl_take_shared_lock(int fd) 1240 { 1241 int ret; 1242 1243 /* take shared lock on resctrl filesystem 1244 ret = flock(fd, LOCK_SH); 1245 if (ret) { 1246 perror("flock"); 1247 exit(-1); 1248 } 1249 } 1250 1251 void resctrl_take_exclusive_lock(int fd) 1252 { 1253 int ret; 1254 1255 /* release lock on resctrl filesystem */ 1256 ret = flock(fd, LOCK_EX); 1257 if (ret) { 1258 perror("flock"); 1259 exit(-1); 1260 } 1261 } 1262 1263 void resctrl_release_lock(int fd) 1264 { 1265 int ret; 1266 1267 /* take shared lock on resctrl filesystem 1268 ret = flock(fd, LOCK_UN); 1269 if (ret) { 1270 perror("flock"); 1271 exit(-1); 1272 } 1273 } 1274 1275 void main(void) 1276 { 1277 int fd, ret; 1278 1279 fd = open("/sys/fs/resctrl", O_DIRECTORY) 1280 if (fd == -1) { 1281 perror("open"); 1282 exit(-1); 1283 } 1284 resctrl_take_shared_lock(fd); 1285 /* code to read directory contents */ 1286 resctrl_release_lock(fd); 1287 1288 resctrl_take_exclusive_lock(fd); 1289 /* code to read and write directory conte 1290 resctrl_release_lock(fd); 1291 } 1292 1293 Examples for RDT Monitoring along with alloca 1294 ============================================= 1295 Reading monitored data 1296 ---------------------- 1297 Reading an event file (for ex: mon_data/mon_L 1298 show the current snapshot of LLC occupancy of 1299 group or CTRL_MON group. 1300 1301 1302 Example 1 (Monitor CTRL_MON group and subset 1303 --------------------------------------------- 1304 On a two socket machine (one L3 cache per soc 1305 for cache bit masks:: 1306 1307 # mount -t resctrl resctrl /sys/fs/resctrl 1308 # cd /sys/fs/resctrl 1309 # mkdir p0 p1 1310 # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/sc 1311 # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/sc 1312 # echo 5678 > p1/tasks 1313 # echo 5679 > p1/tasks 1314 1315 The default resource group is unmodified, so 1316 of all caches (its schemata file reads "L3:0= 1317 1318 Tasks that are under the control of group "p0 1319 "lower" 50% on cache ID 0, and the "upper" 50 1320 Tasks in group "p1" use the "lower" 50% of ca 1321 1322 Create monitor groups and assign a subset of 1323 :: 1324 1325 # cd /sys/fs/resctrl/p1/mon_groups 1326 # mkdir m11 m12 1327 # echo 5678 > m11/tasks 1328 # echo 5679 > m12/tasks 1329 1330 fetch data (data shown in bytes) 1331 :: 1332 1333 # cat m11/mon_data/mon_L3_00/llc_occupancy 1334 16234000 1335 # cat m11/mon_data/mon_L3_01/llc_occupancy 1336 14789000 1337 # cat m12/mon_data/mon_L3_00/llc_occupancy 1338 16789000 1339 1340 The parent ctrl_mon group shows the aggregate 1341 :: 1342 1343 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00 1344 31234000 1345 1346 Example 2 (Monitor a task from its creation) 1347 -------------------------------------------- 1348 On a two socket machine (one L3 cache per soc 1349 1350 # mount -t resctrl resctrl /sys/fs/resctrl 1351 # cd /sys/fs/resctrl 1352 # mkdir p0 p1 1353 1354 An RMID is allocated to the group once its cr 1355 below is monitored from its creation. 1356 :: 1357 1358 # echo $$ > /sys/fs/resctrl/p1/tasks 1359 # <cmd> 1360 1361 Fetch the data:: 1362 1363 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00 1364 31789000 1365 1366 Example 3 (Monitor without CAT support or bef 1367 --------------------------------------------- 1368 1369 Assume a system like HSW has only CQM and no 1370 the resctrl will still mount but cannot creat 1371 But user can create different MON groups with 1372 able to monitor all tasks including kernel th 1373 1374 This can also be used to profile jobs cache s 1375 able to allocate them to different allocation 1376 :: 1377 1378 # mount -t resctrl resctrl /sys/fs/resctrl 1379 # cd /sys/fs/resctrl 1380 # mkdir mon_groups/m01 1381 # mkdir mon_groups/m02 1382 1383 # echo 3478 > /sys/fs/resctrl/mon_groups/m0 1384 # echo 2467 > /sys/fs/resctrl/mon_groups/m0 1385 1386 Monitor the groups separately and also get pe 1387 below its apparent that the tasks are mostly 1388 domain(socket) 0. 1389 :: 1390 1391 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3 1392 31234000 1393 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3 1394 34555 1395 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3 1396 31234000 1397 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3 1398 32789 1399 1400 1401 Example 4 (Monitor real time tasks) 1402 ----------------------------------- 1403 1404 A single socket system which has real time ta 1405 and non real time tasks on other cpus. We wan 1406 occupancy of the real time threads on these c 1407 :: 1408 1409 # mount -t resctrl resctrl /sys/fs/resctrl 1410 # cd /sys/fs/resctrl 1411 # mkdir p1 1412 1413 Move the cpus 4-7 over to p1:: 1414 1415 # echo f0 > p1/cpus 1416 1417 View the llc occupancy snapshot:: 1418 1419 # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00 1420 11234000 1421 1422 Intel RDT Errata 1423 ================ 1424 1425 Intel MBM Counters May Report System Memory B 1426 --------------------------------------------- 1427 1428 Errata SKX99 for Skylake server and BDF102 fo 1429 1430 Problem: Intel Memory Bandwidth Monitoring (M 1431 according to the assigned Resource Monitor ID 1432 core. The IA32_QM_CTR register (MSR 0xC8E), u 1433 metrics, may report incorrect system bandwidt 1434 1435 Implication: Due to the errata, system memory 1436 what is reported. 1437 1438 Workaround: MBM total and local readings are 1439 following correction factor table: 1440 1441 +---------------+---------------+------------ 1442 |core count |rmid count |rmid thresho 1443 +---------------+---------------+------------ 1444 |1 |8 |0 1445 +---------------+---------------+------------ 1446 |2 |16 |0 1447 +---------------+---------------+------------ 1448 |3 |24 |15 1449 +---------------+---------------+------------ 1450 |4 |32 |0 1451 +---------------+---------------+------------ 1452 |6 |48 |31 1453 +---------------+---------------+------------ 1454 |7 |56 |47 1455 +---------------+---------------+------------ 1456 |8 |64 |0 1457 +---------------+---------------+------------ 1458 |9 |72 |63 1459 +---------------+---------------+------------ 1460 |10 |80 |63 1461 +---------------+---------------+------------ 1462 |11 |88 |79 1463 +---------------+---------------+------------ 1464 |12 |96 |0 1465 +---------------+---------------+------------ 1466 |13 |104 |95 1467 +---------------+---------------+------------ 1468 |14 |112 |95 1469 +---------------+---------------+------------ 1470 |15 |120 |95 1471 +---------------+---------------+------------ 1472 |16 |128 |0 1473 +---------------+---------------+------------ 1474 |17 |136 |127 1475 +---------------+---------------+------------ 1476 |18 |144 |127 1477 +---------------+---------------+------------ 1478 |19 |152 |0 1479 +---------------+---------------+------------ 1480 |20 |160 |127 1481 +---------------+---------------+------------ 1482 |21 |168 |0 1483 +---------------+---------------+------------ 1484 |22 |176 |159 1485 +---------------+---------------+------------ 1486 |23 |184 |0 1487 +---------------+---------------+------------ 1488 |24 |192 |127 1489 +---------------+---------------+------------ 1490 |25 |200 |191 1491 +---------------+---------------+------------ 1492 |26 |208 |191 1493 +---------------+---------------+------------ 1494 |27 |216 |0 1495 +---------------+---------------+------------ 1496 |28 |224 |191 1497 +---------------+---------------+------------ 1498 1499 If rmid > rmid threshold, MBM total and local 1500 by the correction factor. 1501 1502 See: 1503 1504 1. Erratum SKX99 in Intel Xeon Processor Scal 1505 http://web.archive.org/web/20200716124958/htt 1506 1507 2. Erratum BDF102 in Intel Xeon E5-2600 v4 Pr 1508 http://web.archive.org/web/20191125200531/htt 1509 1510 3. The errata in Intel Resource Director Tech 1511 https://software.intel.com/content/www/us/en/ 1512 1513 for further information.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.