1 perf-arm-spe(1) 1 perf-arm-spe(1) 2 ================ 2 ================ 3 3 4 NAME 4 NAME 5 ---- 5 ---- 6 perf-arm-spe - Support for Arm Statistical Pro 6 perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools 7 7 8 SYNOPSIS 8 SYNOPSIS 9 -------- 9 -------- 10 [verse] 10 [verse] 11 'perf record' -e arm_spe// 11 'perf record' -e arm_spe// 12 12 13 DESCRIPTION 13 DESCRIPTION 14 ----------- 14 ----------- 15 15 16 The SPE (Statistical Profiling Extension) feat 16 The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and 17 events down to individual instructions. Rathe 17 events down to individual instructions. Rather than being interrupt-driven, it picks an 18 instruction to sample and then captures data f 18 instruction to sample and then captures data for it during execution. Data includes execution time 19 in cycles. For loads and stores it also includ 19 in cycles. For loads and stores it also includes data address, cache miss events, and data origin. 20 20 21 The sampling has 5 stages: 21 The sampling has 5 stages: 22 22 23 1. Choose an operation 23 1. Choose an operation 24 2. Collect data about the operation 24 2. Collect data about the operation 25 3. Optionally discard the record based on a 25 3. Optionally discard the record based on a filter 26 4. Write the record to memory 26 4. Write the record to memory 27 5. Interrupt when the buffer is full 27 5. Interrupt when the buffer is full 28 28 29 Choose an operation 29 Choose an operation 30 ~~~~~~~~~~~~~~~~~~~ 30 ~~~~~~~~~~~~~~~~~~~ 31 31 32 This is chosen from a sample population, for S 32 This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all 33 architectural instructions or all micro-ops. S 33 architectural instructions or all micro-ops. Sampling happens at a programmable interval. The 34 architecture provides a mechanism for the SPE 34 architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should 35 sample. This minimum interval is used by the d 35 sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random 36 perturbation is also added to the sampling int 36 perturbation is also added to the sampling interval by default. 37 37 38 Collect data about the operation 38 Collect data about the operation 39 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 39 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 40 40 41 Program counter, PMU events, timings and data 41 Program counter, PMU events, timings and data addresses related to the operation are recorded. 42 Sampling ensures there is only one sampled ope 42 Sampling ensures there is only one sampled operation is in flight. 43 43 44 Optionally discard the record based on a filte 44 Optionally discard the record based on a filter 45 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 45 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 46 46 47 Based on programmable criteria, choose whether 47 Based on programmable criteria, choose whether to keep the record or discard it. If the record is 48 discarded then the flow stops here for this sa 48 discarded then the flow stops here for this sample. 49 49 50 Write the record to memory 50 Write the record to memory 51 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 51 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 52 52 53 The record is appended to a memory buffer 53 The record is appended to a memory buffer 54 54 55 Interrupt when the buffer is full 55 Interrupt when the buffer is full 56 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 56 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 57 57 58 When the buffer fills, an interrupt is sent an 58 When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records. 59 Perf saves the raw data in the perf.data file. 59 Perf saves the raw data in the perf.data file. 60 60 61 Opening the file 61 Opening the file 62 ---------------- 62 ---------------- 63 63 64 Up until this point no decoding of the SPE dat 64 Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the 65 recorded file is opened with 'perf report' or 65 recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding 66 the data, Perf generates "synthetic samples" a 66 the data, Perf generates "synthetic samples" as if these were generated at the time of the 67 recording. These samples are the same as if no 67 recording. These samples are the same as if normal sampling was done by Perf without using SPE, 68 although they may have more attributes associa 68 although they may have more attributes associated with them. For example a normal sample may have 69 just the instruction pointer, but an SPE sampl 69 just the instruction pointer, but an SPE sample can have data addresses and latency attributes. 70 70 71 Why Sampling? 71 Why Sampling? 72 ------------- 72 ------------- 73 73 74 - Sampling, rather than tracing, cuts down th 74 - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for 75 hardware. Only one sampled operation is in fl 75 hardware. Only one sampled operation is in flight at a time. 76 76 77 - Allows precise attribution data, including: 77 - Allows precise attribution data, including: Full PC of instruction, data virtual and physical 78 addresses. 78 addresses. 79 79 80 - Allows correlation between an instruction a 80 - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source 81 indicates which particular cache was hit, but 81 indicates which particular cache was hit, but the meaning is implementation defined because 82 different implementations can have different 82 different implementations can have different cache configurations.) 83 83 84 However, SPE does not provide any call-graph i 84 However, SPE does not provide any call-graph information, and relies on statistical methods. 85 85 86 Collisions 86 Collisions 87 ---------- 87 ---------- 88 88 89 When an operation is sampled while a previous 89 When an operation is sampled while a previous sampled operation has not finished, a collision 90 occurs. The new sample is dropped. Collisions 90 occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate 91 should be set to avoid collisions. 91 should be set to avoid collisions. 92 92 93 The 'sample_collision' PMU event can be used t 93 The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this 94 count is based on collisions _before_ filterin 94 count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact 95 number for samples dropped that would have mad 95 number for samples dropped that would have made it through the filter, but can be a rough 96 guide. 96 guide. 97 97 98 The effect of microarchitectural sampling 98 The effect of microarchitectural sampling 99 ----------------------------------------- 99 ----------------------------------------- 100 100 101 If an implementation samples micro-operations 101 If an implementation samples micro-operations instead of instructions, the results of sampling must 102 be weighted accordingly. 102 be weighted accordingly. 103 103 104 For example, if a given instruction A is alway 104 For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it 105 becomes twice as likely to appear in the sampl 105 becomes twice as likely to appear in the sample population. 106 106 107 The coarse effect of conversions, and, if appl 107 The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be 108 estimated from the 'sample_pop' and 'inst_reti 108 estimated from the 'sample_pop' and 'inst_retired' PMU events. 109 109 110 Kernel Requirements 110 Kernel Requirements 111 ------------------- 111 ------------------- 112 112 113 The ARM_SPE_PMU config must be set to build as 113 The ARM_SPE_PMU config must be set to build as either a module or statically. 114 114 115 Depending on CPU model, the kernel may need to 115 Depending on CPU model, the kernel may need to be booted with page table isolation disabled 116 (kpti=off). If KPTI needs to be disabled, this 116 (kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer 117 inaccessible. Try passing 'kpti=off' on the ke 117 inaccessible. Try passing 'kpti=off' on the kernel command line". 118 118 119 For the full criteria that determine whether K << 120 unmap_kernel_at_el0() in the kernel sources. C << 121 are on the CPUs in kpti_safe_list, or on Arm v << 122 << 123 The SPE interrupt must also be described by th << 124 disabled (or isn't required to be disabled) bu << 125 /sys/bus/event_source/devices/, then it's poss << 126 ACPI or DT. In this case no warning will be pr << 127 << 128 Capturing SPE with perf command-line tools 119 Capturing SPE with perf command-line tools 129 ------------------------------------------ 120 ------------------------------------------ 130 121 131 You can record a session with SPE samples: 122 You can record a session with SPE samples: 132 123 133 perf record -e arm_spe// -- ./mybench 124 perf record -e arm_spe// -- ./mybench 134 125 135 The sample period is set from the -c option, a 126 The sample period is set from the -c option, and because the minimum interval is used by default 136 it's recommended to set this to a higher value 127 it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL. 137 128 138 Config parameters 129 Config parameters 139 ~~~~~~~~~~~~~~~~~ 130 ~~~~~~~~~~~~~~~~~ 140 131 141 These are placed between the // in the event a 132 These are placed between the // in the event and comma separated. For example '-e 142 arm_spe/load_filter=1,min_latency=10/' 133 arm_spe/load_filter=1,min_latency=10/' 143 134 144 branch_filter=1 - collect branches only 135 branch_filter=1 - collect branches only (PMSFCR.B) 145 event_filter=<mask> - filter on specific eve 136 event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below 146 jitter=1 - use jitter to avoid re 137 jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND) 147 load_filter=1 - collect loads only (PM 138 load_filter=1 - collect loads only (PMSFCR.LD) 148 min_latency=<n> - collect only samples w 139 min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR) 149 pa_enable=1 - collect physical addre 140 pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege 150 pct_enable=1 - collect physical times 141 pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege 151 store_filter=1 - collect stores only (P 142 store_filter=1 - collect stores only (PMSFCR.ST) 152 ts_enable=1 - enable timestamping wi 143 ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS) 153 144 154 +++*+++ Latency is the total latency from the 145 +++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather 155 than only the execution latency. 146 than only the execution latency. 156 147 157 Only some events can be filtered on; these inc 148 Only some events can be filtered on; these include: 158 149 159 bit 1 - instruction retired (i.e. omit s 150 bit 1 - instruction retired (i.e. omit speculative instructions) 160 bit 3 - L1D refill 151 bit 3 - L1D refill 161 bit 5 - TLB refill 152 bit 5 - TLB refill 162 bit 7 - mispredict 153 bit 7 - mispredict 163 bit 11 - misaligned access 154 bit 11 - misaligned access 164 155 165 So to sample just retired instructions: 156 So to sample just retired instructions: 166 157 167 perf record -e arm_spe/event_filter=2/ -- ./ 158 perf record -e arm_spe/event_filter=2/ -- ./mybench 168 159 169 or just mispredicted branches: 160 or just mispredicted branches: 170 161 171 perf record -e arm_spe/event_filter=0x80/ -- 162 perf record -e arm_spe/event_filter=0x80/ -- ./mybench 172 163 173 Viewing the data 164 Viewing the data 174 ~~~~~~~~~~~~~~~~~ 165 ~~~~~~~~~~~~~~~~~ 175 166 176 By default perf report and perf script will as 167 By default perf report and perf script will assign samples to separate groups depending on the 177 attributes/events of the SPE record. Because i 168 attributes/events of the SPE record. Because instructions can have multiple events associated with 178 them, the samples in these groups are not nece 169 them, the samples in these groups are not necessarily unique. For example perf report shows these 179 groups: 170 groups: 180 171 181 Available samples 172 Available samples 182 0 arm_spe// 173 0 arm_spe// 183 0 dummy:u 174 0 dummy:u 184 21 l1d-miss 175 21 l1d-miss 185 897 l1d-access 176 897 l1d-access 186 5 llc-miss 177 5 llc-miss 187 7 llc-access 178 7 llc-access 188 2 tlb-miss 179 2 tlb-miss 189 1K tlb-access 180 1K tlb-access 190 36 branch-miss 181 36 branch-miss 191 0 remote-access 182 0 remote-access 192 900 memory 183 900 memory 193 184 194 The arm_spe// and dummy:u events are implement 185 The arm_spe// and dummy:u events are implementation details and are expected to be empty. 195 186 196 To get a full list of unique samples that are 187 To get a full list of unique samples that are not sorted into groups, set the itrace option to 197 generate 'instruction' samples. The period opt 188 generate 'instruction' samples. The period option is also taken into account, so set it to 1 198 instruction unless you want to further downsam 189 instruction unless you want to further downsample the already sampled SPE data: 199 190 200 perf report --itrace=i1i 191 perf report --itrace=i1i 201 192 202 Memory access details are also stored on the s 193 Memory access details are also stored on the samples and this can be viewed with: 203 194 204 perf report --mem-mode 195 perf report --mem-mode 205 196 206 Common errors 197 Common errors 207 ~~~~~~~~~~~~~ 198 ~~~~~~~~~~~~~ 208 199 209 - "Cannot find PMU `arm_spe'. Missing kernel 200 - "Cannot find PMU `arm_spe'. Missing kernel support?" 210 201 211 Module not built or loaded, KPTI not disabl !! 202 Module not built or loaded, KPTI not disabled (see above), or running on a VM 212 or running on a VM. See 'Kernel Requirement << 213 203 214 - "Arm SPE CONTEXT packets not found in the t 204 - "Arm SPE CONTEXT packets not found in the traces." 215 205 216 Root privilege is required to collect conte 206 Root privilege is required to collect context packets. But these only increase the accuracy of 217 assigning PIDs to kernel samples. For users 207 assigning PIDs to kernel samples. For userspace sampling this can be ignored. 218 208 219 - Excessively large perf.data file size 209 - Excessively large perf.data file size 220 210 221 Increase sampling interval (see above) 211 Increase sampling interval (see above) 222 212 223 213 224 SEE ALSO 214 SEE ALSO 225 -------- 215 -------- 226 216 227 linkperf:perf-record[1], linkperf:perf-script[ 217 linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 228 linkperf:perf-inject[1] 218 linkperf:perf-inject[1]
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.