1 perf-amd-ibs(1) 1 perf-amd-ibs(1) 2 =============== 2 =============== 3 3 4 NAME 4 NAME 5 ---- 5 ---- 6 perf-amd-ibs - Support for AMD Instruction-Bas 6 perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool 7 7 8 SYNOPSIS 8 SYNOPSIS 9 -------- 9 -------- 10 [verse] 10 [verse] 11 'perf record' -e ibs_op// 11 'perf record' -e ibs_op// 12 'perf record' -e ibs_fetch// 12 'perf record' -e ibs_fetch// 13 13 14 DESCRIPTION 14 DESCRIPTION 15 ----------- 15 ----------- 16 16 17 Instruction-Based Sampling (IBS) provides prec 17 Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP) 18 profiling support on AMD platforms. IBS has tw 18 profiling support on AMD platforms. IBS has two independent components: IBS 19 Op and IBS Fetch. IBS Op sampling provides inf 19 Op and IBS Fetch. IBS Op sampling provides information about instruction 20 execution (micro-op execution to be precise) w 20 execution (micro-op execution to be precise) with details like d-cache 21 hit/miss, d-TLB hit/miss, cache miss latency, 21 hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch 22 behavior etc. IBS Fetch sampling provides info 22 behavior etc. IBS Fetch sampling provides information about instruction fetch 23 with details like i-cache hit/miss, i-TLB hit/ 23 with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is 24 per-smt-thread i.e. each SMT hardware thread c 24 per-smt-thread i.e. each SMT hardware thread contains standalone IBS units. 25 25 26 Both, IBS Op and IBS Fetch, are exposed as PMU 26 Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited 27 using the Linux perf utility. The following fi 27 using the Linux perf utility. The following files will be created at boot time 28 if IBS is supported by the hardware and kernel 28 if IBS is supported by the hardware and kernel. 29 29 30 /sys/bus/event_source/devices/ibs_op/ 30 /sys/bus/event_source/devices/ibs_op/ 31 /sys/bus/event_source/devices/ibs_fetch/ 31 /sys/bus/event_source/devices/ibs_fetch/ 32 32 33 IBS Op PMU supports two events: cycles and mic 33 IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports 34 one event: fetch ops. 34 one event: fetch ops. 35 35 36 IBS PMUs do not have user/kernel filtering cap 36 IBS PMUs do not have user/kernel filtering capability and thus it requires 37 CAP_SYS_ADMIN or CAP_PERFMON privilege. 37 CAP_SYS_ADMIN or CAP_PERFMON privilege. 38 38 39 IBS VS. REGULAR CORE PMU 39 IBS VS. REGULAR CORE PMU 40 ------------------------ 40 ------------------------ 41 41 42 IBS gives samples with precise IP, i.e. the IP 42 IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has 43 no skid. Whereas the IP recorded by regular co 43 no skid. Whereas the IP recorded by regular core PMU will have some skid 44 (sample was generated at IP X but perf would r 44 (sample was generated at IP X but perf would record it at IP X+n). Hence, 45 regular core PMU might not help for profiling 45 regular core PMU might not help for profiling with instruction level 46 precision. Further, IBS provides additional in 46 precision. Further, IBS provides additional information about the sample in 47 question. On the other hand, regular core PMU 47 question. On the other hand, regular core PMU has it's own advantages like 48 plethora of events, counting mode (less interf 48 plethora of events, counting mode (less interference), up to 6 parallel 49 counters, event grouping support, filtering ca 49 counters, event grouping support, filtering capabilities etc. 50 50 51 Three regular core PMU events are internally f 51 Three regular core PMU events are internally forwarded to IBS Op PMU when 52 precise_ip attribute is set: 52 precise_ip attribute is set: 53 53 54 -e cpu-cycles:p becomes -e ibs_op// 54 -e cpu-cycles:p becomes -e ibs_op// 55 -e r076:p becomes -e ibs_op// 55 -e r076:p becomes -e ibs_op// 56 -e r0C1:p becomes -e ibs_op/cnt_ctl=1/ 56 -e r0C1:p becomes -e ibs_op/cnt_ctl=1/ 57 57 58 EXAMPLES 58 EXAMPLES 59 -------- 59 -------- 60 60 61 IBS Op PMU 61 IBS Op PMU 62 ~~~~~~~~~~ 62 ~~~~~~~~~~ 63 63 64 System-wide profile, cycles event, sampling pe 64 System-wide profile, cycles event, sampling period: 100000 65 65 66 # perf record -e ibs_op// -c 100000 -a 66 # perf record -e ibs_op// -c 100000 -a 67 67 68 Per-cpu profile (cpu10), cycles event, samplin 68 Per-cpu profile (cpu10), cycles event, sampling period: 100000 69 69 70 # perf record -e ibs_op// -c 100000 -C 70 # perf record -e ibs_op// -c 100000 -C 10 71 71 72 Per-cpu profile (cpu10), cycles event, samplin 72 Per-cpu profile (cpu10), cycles event, sampling freq: 1000 73 73 74 # perf record -e ibs_op// -F 1000 -C 1 74 # perf record -e ibs_op// -F 1000 -C 10 75 75 76 System-wide profile, uOps event, sampling peri 76 System-wide profile, uOps event, sampling period: 100000 77 77 78 # perf record -e ibs_op/cnt_ctl=1/ -c 78 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a 79 79 80 Same command, but also capture IBS register ra 80 Same command, but also capture IBS register raw dump along with perf sample: 81 81 82 # perf record -e ibs_op/cnt_ctl=1/ -c 82 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples 83 83 84 System-wide profile, uOps event, sampling peri 84 System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward) 85 85 86 # perf record -e ibs_op/cnt_ctl=1,l3mi 86 # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a 87 87 88 Per process(upstream v6.2 onward), uOps event, 88 Per process(upstream v6.2 onward), uOps event, sampling period: 100000 89 89 90 # perf record -e ibs_op/cnt_ctl=1/ -c 90 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234 91 91 92 Per process(upstream v6.2 onward), uOps event, 92 Per process(upstream v6.2 onward), uOps event, sampling period: 100000 93 93 94 # perf record -e ibs_op/cnt_ctl=1/ -c 94 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls 95 95 96 To analyse recorded profile in aggregate mode 96 To analyse recorded profile in aggregate mode 97 97 98 # perf report 98 # perf report 99 /* Select a line and press 'a' to dril 99 /* Select a line and press 'a' to drill down at instruction level. */ 100 100 101 To go over each sample 101 To go over each sample 102 102 103 # perf script 103 # perf script 104 104 105 Raw dump of IBS registers when profiled with - 105 Raw dump of IBS registers when profiled with --raw-samples 106 106 107 # perf report -D 107 # perf report -D 108 /* Look for PERF_RECORD_SAMPLE */ 108 /* Look for PERF_RECORD_SAMPLE */ 109 109 110 Example register raw dump: 110 Example register raw dump: 111 111 112 ibs_op_ctl: 000002c30006186a MaxCn 112 ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1 113 Val 1 CntCtl 0=cycles CurCnt 113 Val 1 CntCtl 0=cycles CurCnt 707 114 IbsOpRip: ffffffff8204aea7 114 IbsOpRip: ffffffff8204aea7 115 ibs_op_data: 0000010002550001 CompT 115 ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597 116 BrnRet 0 RipInvalid 0 BrnFuse 116 BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1 117 ibs_op_data2: 0000000000000013 RmtNo 117 ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM 118 ibs_op_data3: 0000000031960092 LdOp 118 ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0 119 DcL2TlbMiss 0 DcL1TlbHit2M 1 D 119 DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0 120 DcMiss 1 DcMisAcc 0 DcWcMemAcc 120 DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0 121 DcMissNoMabAlloc 0 DcLinAddrVa 121 DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1 122 DcL2TlbHit1G 0 L2Miss 1 SwPf 0 122 DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes 123 OpDcMissOpenMemReqs 12 DcMissL 123 OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0 124 IbsDCLinAd: ff110008a5398920 124 IbsDCLinAd: ff110008a5398920 125 IbsDCPhysAd: 00000008a5398920 125 IbsDCPhysAd: 00000008a5398920 126 126 127 IBS applied in a real world usecase 127 IBS applied in a real world usecase 128 128 129 ~90% regression was observed in tbench 129 ~90% regression was observed in tbench with specific scheduler hint 130 which was counter intuitive. IBS profi 130 which was counter intuitive. IBS profile of good and bad run captured 131 using perf helped in identifying exact 131 using perf helped in identifying exact cause of the problem: 132 132 133 https://lore.kernel.org/r/202209210636 133 https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com 134 134 135 IBS Fetch PMU 135 IBS Fetch PMU 136 ~~~~~~~~~~~~~ 136 ~~~~~~~~~~~~~ 137 137 138 Similar commands can be used with Fetch PMU as 138 Similar commands can be used with Fetch PMU as well. 139 139 140 System-wide profile, fetch ops event, sampling 140 System-wide profile, fetch ops event, sampling period: 100000 141 141 142 # perf record -e ibs_fetch// -c 100000 142 # perf record -e ibs_fetch// -c 100000 -a 143 143 144 System-wide profile, fetch ops event, sampling 144 System-wide profile, fetch ops event, sampling period: 100000, Random enable 145 145 146 # perf record -e ibs_fetch/rand_en=1/ 146 # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a 147 147 148 Random enable adds small degree of var 148 Random enable adds small degree of variability to sample period. This 149 helps in cases like long running loops 149 helps in cases like long running loops where PMU is tagging the same 150 instruction over and over because of f 150 instruction over and over because of fixed sample period. 151 151 152 etc. 152 etc. 153 153 154 PERF MEM AND PERF C2C 154 PERF MEM AND PERF C2C 155 --------------------- 155 --------------------- 156 156 157 perf mem is a memory access profiler tool and 157 perf mem is a memory access profiler tool and perf c2c is a shared data 158 cacheline analyser tool. Both of them internal 158 cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD. 159 Below is a simple example of the perf mem tool 159 Below is a simple example of the perf mem tool. 160 160 161 # perf mem record -c 100000 -- make 161 # perf mem record -c 100000 -- make 162 # perf mem report 162 # perf mem report 163 163 164 A normal perf mem report output will provide d 164 A normal perf mem report output will provide detailed memory access profile. 165 However, it can also be aggregated based on ou 165 However, it can also be aggregated based on output fields. For example: 166 166 167 # perf mem report -F mem,sample,snoop 167 # perf mem report -F mem,sample,snoop 168 Samples: 3M of event 'ibs_op//', Event 168 Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876 169 Memory access 169 Memory access Samples Snoop 170 N/A 170 N/A 1903343 N/A 171 L1 hit 171 L1 hit 1056754 N/A 172 L2 hit 172 L2 hit 75231 N/A 173 L3 hit 173 L3 hit 9496 HitM 174 L3 hit 174 L3 hit 2270 N/A 175 RAM hit 175 RAM hit 8710 N/A 176 Remote node, same socket RAM hit 176 Remote node, same socket RAM hit 3241 N/A 177 Remote core, same node Any cache hit 177 Remote core, same node Any cache hit 1572 HitM 178 Remote core, same node Any cache hit 178 Remote core, same node Any cache hit 514 N/A 179 Remote node, same socket Any cache hit 179 Remote node, same socket Any cache hit 1216 HitM 180 Remote node, same socket Any cache hit 180 Remote node, same socket Any cache hit 350 N/A 181 Uncached hit 181 Uncached hit 18 N/A 182 182 183 Please refer to their man page for more detail 183 Please refer to their man page for more detail. 184 184 185 SEE ALSO 185 SEE ALSO 186 -------- 186 -------- 187 187 188 linkperf:perf-record[1], linkperf:perf-script[ 188 linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 189 linkperf:perf-mem[1], linkperf:perf-c2c[1] 189 linkperf:perf-mem[1], linkperf:perf-c2c[1]
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.