1 perf-amd-ibs(1) 2 =============== 3 4 NAME 5 ---- 6 perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool 7 8 SYNOPSIS 9 -------- 10 [verse] 11 'perf record' -e ibs_op// 12 'perf record' -e ibs_fetch// 13 14 DESCRIPTION 15 ----------- 16 17 Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP) 18 profiling support on AMD platforms. IBS has two independent components: IBS 19 Op and IBS Fetch. IBS Op sampling provides information about instruction 20 execution (micro-op execution to be precise) with details like d-cache 21 hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch 22 behavior etc. IBS Fetch sampling provides information about instruction fetch 23 with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is 24 per-smt-thread i.e. each SMT hardware thread contains standalone IBS units. 25 26 Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited 27 using the Linux perf utility. The following files will be created at boot time 28 if IBS is supported by the hardware and kernel. 29 30 /sys/bus/event_source/devices/ibs_op/ 31 /sys/bus/event_source/devices/ibs_fetch/ 32 33 IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports 34 one event: fetch ops. 35 36 IBS PMUs do not have user/kernel filtering capability and thus it requires 37 CAP_SYS_ADMIN or CAP_PERFMON privilege. 38 39 IBS VS. REGULAR CORE PMU 40 ------------------------ 41 42 IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has 43 no skid. Whereas the IP recorded by regular core PMU will have some skid 44 (sample was generated at IP X but perf would record it at IP X+n). Hence, 45 regular core PMU might not help for profiling with instruction level 46 precision. Further, IBS provides additional information about the sample in 47 question. On the other hand, regular core PMU has it's own advantages like 48 plethora of events, counting mode (less interference), up to 6 parallel 49 counters, event grouping support, filtering capabilities etc. 50 51 Three regular core PMU events are internally forwarded to IBS Op PMU when 52 precise_ip attribute is set: 53 54 -e cpu-cycles:p becomes -e ibs_op// 55 -e r076:p becomes -e ibs_op// 56 -e r0C1:p becomes -e ibs_op/cnt_ctl=1/ 57 58 EXAMPLES 59 -------- 60 61 IBS Op PMU 62 ~~~~~~~~~~ 63 64 System-wide profile, cycles event, sampling period: 100000 65 66 # perf record -e ibs_op// -c 100000 -a 67 68 Per-cpu profile (cpu10), cycles event, sampling period: 100000 69 70 # perf record -e ibs_op// -c 100000 -C 10 71 72 Per-cpu profile (cpu10), cycles event, sampling freq: 1000 73 74 # perf record -e ibs_op// -F 1000 -C 10 75 76 System-wide profile, uOps event, sampling period: 100000 77 78 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a 79 80 Same command, but also capture IBS register raw dump along with perf sample: 81 82 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples 83 84 System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward) 85 86 # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a 87 88 Per process(upstream v6.2 onward), uOps event, sampling period: 100000 89 90 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234 91 92 Per process(upstream v6.2 onward), uOps event, sampling period: 100000 93 94 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls 95 96 To analyse recorded profile in aggregate mode 97 98 # perf report 99 /* Select a line and press 'a' to drill down at instruction level. */ 100 101 To go over each sample 102 103 # perf script 104 105 Raw dump of IBS registers when profiled with --raw-samples 106 107 # perf report -D 108 /* Look for PERF_RECORD_SAMPLE */ 109 110 Example register raw dump: 111 112 ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1 113 Val 1 CntCtl 0=cycles CurCnt 707 114 IbsOpRip: ffffffff8204aea7 115 ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597 116 BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1 117 ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM 118 ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0 119 DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0 120 DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0 121 DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1 122 DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes 123 OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0 124 IbsDCLinAd: ff110008a5398920 125 IbsDCPhysAd: 00000008a5398920 126 127 IBS applied in a real world usecase 128 129 ~90% regression was observed in tbench with specific scheduler hint 130 which was counter intuitive. IBS profile of good and bad run captured 131 using perf helped in identifying exact cause of the problem: 132 133 https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com 134 135 IBS Fetch PMU 136 ~~~~~~~~~~~~~ 137 138 Similar commands can be used with Fetch PMU as well. 139 140 System-wide profile, fetch ops event, sampling period: 100000 141 142 # perf record -e ibs_fetch// -c 100000 -a 143 144 System-wide profile, fetch ops event, sampling period: 100000, Random enable 145 146 # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a 147 148 Random enable adds small degree of variability to sample period. This 149 helps in cases like long running loops where PMU is tagging the same 150 instruction over and over because of fixed sample period. 151 152 etc. 153 154 PERF MEM AND PERF C2C 155 --------------------- 156 157 perf mem is a memory access profiler tool and perf c2c is a shared data 158 cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD. 159 Below is a simple example of the perf mem tool. 160 161 # perf mem record -c 100000 -- make 162 # perf mem report 163 164 A normal perf mem report output will provide detailed memory access profile. 165 However, it can also be aggregated based on output fields. For example: 166 167 # perf mem report -F mem,sample,snoop 168 Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876 169 Memory access Samples Snoop 170 N/A 1903343 N/A 171 L1 hit 1056754 N/A 172 L2 hit 75231 N/A 173 L3 hit 9496 HitM 174 L3 hit 2270 N/A 175 RAM hit 8710 N/A 176 Remote node, same socket RAM hit 3241 N/A 177 Remote core, same node Any cache hit 1572 HitM 178 Remote core, same node Any cache hit 514 N/A 179 Remote node, same socket Any cache hit 1216 HitM 180 Remote node, same socket Any cache hit 350 N/A 181 Uncached hit 18 N/A 182 183 Please refer to their man page for more detail. 184 185 SEE ALSO 186 -------- 187 188 linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 189 linkperf:perf-mem[1], linkperf:perf-c2c[1]
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.