1 2 Performance Counters for Linux 3 ------------------------------ 4 5 Performance counters are special hardware regi 6 CPUs. These registers count the number of cert 7 as instructions executed, cachemisses suffered 8 without slowing down the kernel or application 9 trigger interrupts when a threshold number of 10 thus be used to profile the code that runs on 11 12 The Linux Performance Counter subsystem provid 13 hardware capabilities. It provides per task an 14 groups, and it provides event capabilities on 15 provides "virtual" 64-bit counters, regardless 16 underlying hardware counters. 17 18 Performance counters are accessed via special 19 There's one file descriptor per virtual counte 20 21 The special file descriptor is opened via the 22 system call: 23 24 int sys_perf_event_open(struct perf_event_a 25 pid_t pid, int cp 26 unsigned long fla 27 28 The syscall returns the new fd. The fd can be 29 VFS system calls: read() can be used to read t 30 can be used to set the blocking mode, etc. 31 32 Multiple counters can be kept open at a time, 33 can be poll()ed. 34 35 When creating a new counter fd, 'perf_event_at 36 37 struct perf_event_attr { 38 /* 39 * The MSB of the config word signifie 40 * specific (raw) counter configuratio 41 * 7 bits are an event type and the re 42 * identifier. 43 */ 44 __u64 config; 45 46 __u64 irq_period; 47 __u32 record_type; 48 __u32 read_format; 49 50 __u64 disabled 51 inherit 52 pinned 53 exclusive 54 exclude_user 55 exclude_kernel 56 exclude_hv 57 exclude_idle 58 mmap 59 munmap 60 comm 61 62 __reserved_1 63 64 __u32 extra_config_l 65 __u32 wakeup_events; 66 67 __u64 __reserved_2; 68 __u64 __reserved_3; 69 }; 70 71 The 'config' field specifies what the counter 72 is divided into 3 bit-fields: 73 74 raw_type: 1 bit (most significant bit) 75 type: 7 bits (next most significant) 76 event_id: 56 bits (least significant) 77 78 If 'raw_type' is 1, then the counter will coun 79 specified by the remaining 63 bits of event_co 80 machine-specific. 81 82 If 'raw_type' is 0, then the 'type' field says 83 this is, with the following encoding: 84 85 enum perf_type_id { 86 PERF_TYPE_HARDWARE = 0, 87 PERF_TYPE_SOFTWARE = 1, 88 PERF_TYPE_TRACEPOINT = 2, 89 }; 90 91 A counter of PERF_TYPE_HARDWARE will count the 92 specified by 'event_id': 93 94 /* 95 * Generalized performance counter event types 96 * parameter of the sys_perf_event_open() sysc 97 */ 98 enum perf_hw_id { 99 /* 100 * Common hardware events, generalized 101 */ 102 PERF_COUNT_HW_CPU_CYCLES 103 PERF_COUNT_HW_INSTRUCTIONS 104 PERF_COUNT_HW_CACHE_REFERENCES 105 PERF_COUNT_HW_CACHE_MISSES 106 PERF_COUNT_HW_BRANCH_INSTRUCTIONS 107 PERF_COUNT_HW_BRANCH_MISSES 108 PERF_COUNT_HW_BUS_CYCLES 109 PERF_COUNT_HW_STALLED_CYCLES_FRONTEND 110 PERF_COUNT_HW_STALLED_CYCLES_BACKEND 111 PERF_COUNT_HW_REF_CPU_CYCLES 112 }; 113 114 These are standardized types of events that wo 115 on all CPUs that implement Performance Counter 116 although there may be variations (e.g., differ 117 cache references and misses at different level 118 If a CPU is not able to count the selected eve 119 will return -EINVAL. 120 121 More hw_event_types are supported as well, but 122 and accessed as raw events. For example, to c 123 cycles while bus lock signal asserted" events 124 in a 0x4064 event_id value and set hw_event.ra 125 126 A counter of type PERF_TYPE_SOFTWARE will coun 127 software events, selected by 'event_id': 128 129 /* 130 * Special "software" counters provided by the 131 * does not support performance counters. Thes 132 * physical and sw events of the kernel (and a 133 * well): 134 */ 135 enum perf_sw_ids { 136 PERF_COUNT_SW_CPU_CLOCK = 0, 137 PERF_COUNT_SW_TASK_CLOCK = 1, 138 PERF_COUNT_SW_PAGE_FAULTS = 2, 139 PERF_COUNT_SW_CONTEXT_SWITCHES = 3, 140 PERF_COUNT_SW_CPU_MIGRATIONS = 4, 141 PERF_COUNT_SW_PAGE_FAULTS_MIN = 5, 142 PERF_COUNT_SW_PAGE_FAULTS_MAJ = 6, 143 PERF_COUNT_SW_ALIGNMENT_FAULTS = 7, 144 PERF_COUNT_SW_EMULATION_FAULTS = 8, 145 }; 146 147 Counters of the type PERF_TYPE_TRACEPOINT are 148 tracer is available, and event_id values can b 149 /debug/tracing/events/*/*/id 150 151 152 Counters come in two flavours: counting counte 153 counters. A "counting" counter is one that is 154 number of events that occur, and is characteri 155 irq_period = 0. 156 157 158 A read() on a counter returns the current valu 159 additional values as specified by 'read_format 160 in size. 161 162 /* 163 * Bits that can be set in hw_event.read_forma 164 * reads on the counter should return the indi 165 * in increasing order of bit value, after the 166 */ 167 enum perf_event_read_format { 168 PERF_FORMAT_TOTAL_TIME_ENABLED = 1, 169 PERF_FORMAT_TOTAL_TIME_RUNNING = 2, 170 }; 171 172 Using these additional values one can establis 173 particular counter allowing one to take the ro 174 into account. 175 176 177 A "sampling" counter is one that is set up to 178 every N events, where N is given by 'irq_perio 179 has irq_period > 0. The record_type controls w 180 interrupt: 181 182 /* 183 * Bits that can be set in hw_event.record_typ 184 * in the overflow packets. 185 */ 186 enum perf_event_record_format { 187 PERF_RECORD_IP = 1U << 0, 188 PERF_RECORD_TID = 1U << 1, 189 PERF_RECORD_TIME = 1U << 2, 190 PERF_RECORD_ADDR = 1U << 3, 191 PERF_RECORD_GROUP = 1U << 4, 192 PERF_RECORD_CALLCHAIN = 1U << 5, 193 }; 194 195 Such (and other) events will be recorded in a 196 available to user-space using mmap() (see belo 197 198 The 'disabled' bit specifies whether the count 199 or enabled. If it is initially disabled, it c 200 or prctl (see below). 201 202 The 'inherit' bit, if set, specifies that this 203 events on descendant tasks as well as the task 204 applies to new descendents, not to any existin 205 time the counter is created (nor to any new de 206 descendents). 207 208 The 'pinned' bit, if set, specifies that the c 209 on the CPU if at all possible. It only applie 210 and only to group leaders. If a pinned counte 211 CPU (e.g. because there are not enough hardwar 212 a conflict with some other event), then the co 213 'error' state, where reads return end-of-file 214 until the counter is subsequently enabled or d 215 216 The 'exclusive' bit, if set, specifies that wh 217 is on the CPU, it should be the only group usi 218 In future, this will allow sophisticated monit 219 extra configuration information via 'extra_con 220 advanced features of the CPU's Performance Mon 221 not otherwise accessible and that might disrup 222 counters. 223 224 The 'exclude_user', 'exclude_kernel' and 'excl 225 way to request that counting of events be rest 226 CPU is in user, kernel and/or hypervisor mode. 227 228 Furthermore the 'exclude_host' and 'exclude_gu 229 to request counting of events restricted to gu 230 using Linux as the hypervisor. 231 232 The 'mmap' and 'munmap' bits allow recording o 233 operations, these can be used to relate usersp 234 code, even after the mapping (or even the whol 235 these events are recorded in the ring-buffer ( 236 237 The 'comm' bit allows tracking of process comm 238 This too is recorded in the ring-buffer (see b 239 240 The 'pid' parameter to the sys_perf_event_open 241 counter to be specific to a task: 242 243 pid == 0: if the pid parameter is zero, the c 244 current task. 245 246 pid > 0: the counter is attached to a specifi 247 has sufficient privilege to do so) 248 249 pid < 0: all tasks are counted (per cpu count 250 251 The 'cpu' parameter allows a counter to be mad 252 253 cpu >= 0: the counter is restricted to a spec 254 cpu == -1: the counter counts on all CPUs 255 256 (Note: the combination of 'pid == -1' and 'cpu 257 258 A 'pid > 0' and 'cpu == -1' counter is a per t 259 events of that task and 'follows' that task to 260 gets schedule to. Per task counters can be cre 261 their own tasks. 262 263 A 'pid == -1' and 'cpu == x' counter is a per 264 all events on CPU-x. Per CPU counters need CAP 265 privilege. 266 267 The 'flags' parameter is currently unused and 268 269 The 'group_fd' parameter allows counter "group 270 counter group has one counter which is the gro 271 is created first, with group_fd = -1 in the sy 272 that creates it. The rest of the group member 273 subsequently, with group_fd giving the fd of t 274 (A single counter on its own is created with g 275 considered to be a group with only 1 member.) 276 277 A counter group is scheduled onto the CPU as a 278 only be put onto the CPU if all of the counter 279 put onto the CPU. This means that the values 280 can be meaningfully compared, added, divided ( 281 with each other, since they have counted event 282 executed instructions. 283 284 285 Like stated, asynchronous events, like counter 286 tracking are logged into a ring-buffer. This r 287 accessed through mmap(). 288 289 The mmap size should be 1+2^n pages, where the 290 (struct perf_event_mmap_page) that contains va 291 as where the ring-buffer head is. 292 293 /* 294 * Structure of the page that can be mapped vi 295 */ 296 struct perf_event_mmap_page { 297 __u32 version; /* ver 298 __u32 compat_version; /* low 299 300 /* 301 * Bits needed to read the hw counters 302 * 303 * u32 seq; 304 * s64 count; 305 * 306 * do { 307 * seq = pc->lock; 308 * 309 * barrier() 310 * if (pc->index) { 311 * count = pmc_read(pc->index - 312 * count += pc->offset; 313 * } else 314 * goto regular_read; 315 * 316 * barrier(); 317 * } while (pc->lock != seq); 318 * 319 * NOTE: for obvious reason this only 320 * processes. 321 */ 322 __u32 lock; /* seq 323 __u32 index; /* har 324 __s64 offset; /* add 325 326 /* 327 * Control data for the mmap() data bu 328 * 329 * User-space reading this value shoul 330 * platforms, after reading this value 331 */ 332 __u32 data_head; /* hea 333 }; 334 335 NOTE: the hw-counter userspace bits are arch s 336 implemented on powerpc. 337 338 The following 2^n pages are the ring-buffer wh 339 340 #define PERF_RECORD_MISC_KERNEL (1 << 341 #define PERF_RECORD_MISC_USER (1 << 342 #define PERF_RECORD_MISC_OVERFLOW (1 << 343 344 struct perf_event_header { 345 __u32 type; 346 __u16 misc; 347 __u16 size; 348 }; 349 350 enum perf_event_type { 351 352 /* 353 * The MMAP events record the PROT_EXE 354 * correlate userspace IPs to code. Th 355 * 356 * struct { 357 * struct perf_event_header 358 * 359 * u32 360 * u64 361 * u64 362 * u64 363 * char 364 * }; 365 */ 366 PERF_RECORD_MMAP = 1, 367 PERF_RECORD_MUNMAP = 2, 368 369 /* 370 * struct { 371 * struct perf_event_header 372 * 373 * u32 374 * char 375 * }; 376 */ 377 PERF_RECORD_COMM = 3, 378 379 /* 380 * When header.misc & PERF_RECORD_MISC 381 * will be PERF_RECORD_* 382 * 383 * struct { 384 * struct perf_event_header 385 * 386 * { u64 ip; 387 * { u32 pid, t 388 * { u64 time; 389 * { u64 addr; 390 * 391 * { u64 nr; 392 * { u64 event, val; } cnt[nr 393 * 394 * { u16 nr, 395 * hv, 396 * kernel 397 * user; 398 * u64 ips[nr 399 * }; 400 */ 401 }; 402 403 NOTE: PERF_RECORD_CALLCHAIN is arch specific a 404 on x86. 405 406 Notification of new events is possible through 407 fcntl() managing signals. 408 409 Normally a notification is generated for every 410 additionally set perf_event_attr.wakeup_events 411 so many counter overflow events. 412 413 Future work will include a splice() interface 414 415 416 Counters can be enabled and disabled in two wa 417 prctl. When a counter is disabled, it doesn't 418 events but does continue to exist and maintain 419 420 An individual counter can be enabled with 421 422 ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); 423 424 or disabled with 425 426 ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); 427 428 For a counter group, pass PERF_IOC_FLAG_GROUP 429 Enabling or disabling the leader of a group en 430 whole group; that is, while the group leader i 431 counters in the group will count. Enabling or 432 group other than the leader only affects that 433 non-leader stops that counter from counting bu 434 other counter. 435 436 Additionally, non-inherited overflow counters 437 438 ioctl(fd, PERF_EVENT_IOC_REFRESH, nr); 439 440 to enable a counter for 'nr' events, after whi 441 442 A process can enable or disable all the counte 443 attached to it, using prctl: 444 445 prctl(PR_TASK_PERF_EVENTS_ENABLE); 446 447 prctl(PR_TASK_PERF_EVENTS_DISABLE); 448 449 This applies to all counters on the current pr 450 by this process or by another, and doesn't aff 451 this process has created on other processes. 452 disables the group leaders, not any other memb 453 454 455 Arch requirements 456 ----------------- 457 458 If your architecture does not have hardware pe 459 still use the generic software counters based 460 461 So to start with, in order to add HAVE_PERF_EV 462 will need at least this: 463 - asm/perf_event.h - a basic stub will 464 - support for atomic64 types (and asso 465 466 If your architecture does have hardware capabi 467 weak stub hw_perf_event_init() to register har 468 469 Architectures that have d-cache aliassing issu 470 should select PERF_USE_VMALLOC in order to avo
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.