1 perf-stat(1) 1 perf-stat(1) 2 ============ 2 ============ 3 3 4 NAME 4 NAME 5 ---- 5 ---- 6 perf-stat - Run a command and gather performan 6 perf-stat - Run a command and gather performance counter statistics 7 7 8 SYNOPSIS 8 SYNOPSIS 9 -------- 9 -------- 10 [verse] 10 [verse] 11 'perf stat' [-e <EVENT> | --event=EVENT] [-a] !! 11 'perf stat' [-e <EVENT> | --event=EVENT] [-S] [-a] <command> 12 'perf stat' [-e <EVENT> | --event=EVENT] [-a] !! 12 'perf stat' [-e <EVENT> | --event=EVENT] [-S] [-a] -- <command> [<options>] 13 'perf stat' [-e <EVENT> | --event=EVENT] [-a] << 14 'perf stat' report [-i file] << 15 13 16 DESCRIPTION 14 DESCRIPTION 17 ----------- 15 ----------- 18 This command runs a command and gathers perfor 16 This command runs a command and gathers performance counter statistics 19 from it. 17 from it. 20 18 21 19 22 OPTIONS 20 OPTIONS 23 ------- 21 ------- 24 <command>...:: 22 <command>...:: 25 Any command you can specify in a shell 23 Any command you can specify in a shell. 26 24 27 record:: << 28 See STAT RECORD. << 29 << 30 report:: << 31 See STAT REPORT. << 32 25 33 -e:: 26 -e:: 34 --event=:: 27 --event=:: 35 Select the PMU event. Selection can be !! 28 Select the PMU event. Selection can be a symbolic event name 36 !! 29 (use 'perf list' to list all events) or a raw PMU 37 - a symbolic event name (use 'perf lis !! 30 event (eventsel+umask) in the form of rNNN where NNN is a 38 !! 31 hexadecimal event descriptor. 39 - a raw PMU event in the form of rN wh << 40 that represents the raw register enc << 41 event control registers as described << 42 /sys/bus/event_source/devices/cpu/fo << 43 << 44 - a symbolic or raw PMU event followed << 45 and a list of event modifiers, e.g., << 46 linkperf:perf-list[1] man page for d << 47 << 48 - a symbolically formed event like 'pm << 49 param1 and param2 are defined as for << 50 /sys/bus/event_source/devices/<pmu>/ << 51 << 52 'percore' is a event qualifier that << 53 hardware threads in a core. For exam << 54 perf stat -A -a -e cpu/event,percore << 55 << 56 - a symbolically formed event like 'pm << 57 where M, N, K are numbers (in decima << 58 Acceptable values for each of 'confi << 59 parameters are defined by correspond << 60 /sys/bus/event_source/devices/<pmu>/ << 61 << 62 Note that the last two syntaxes suppor << 63 the PMU name to simplify creation of e << 64 of the same type of PMU in large syste << 65 Multiple PMU instances are typical for << 66 'uncore_' is also ignored when perform << 67 << 68 32 69 -i:: 33 -i:: 70 --no-inherit:: !! 34 --inherit:: 71 child tasks do not inherit counters !! 35 child tasks inherit counters 72 -p:: 36 -p:: 73 --pid=<pid>:: 37 --pid=<pid>:: 74 stat events on existing process id (co !! 38 stat events on existing pid 75 << 76 -t:: << 77 --tid=<tid>:: << 78 stat events on existing thread id (com << 79 << 80 -b:: << 81 --bpf-prog:: << 82 stat events on existing bpf program id << 83 requiring root rights. bpftool-prog co << 84 id all bpf programs in the system. For << 85 << 86 # bpftool prog | head -n 1 << 87 17247: tracepoint name sys_enter tag 192d5 << 88 << 89 # perf stat -e cycles,instructions --bpf-pro << 90 << 91 Performance counter stats for 'BPF program( << 92 << 93 85,967 cycles << 94 28,982 instructions << 95 << 96 1.102235068 seconds time elapsed << 97 << 98 --bpf-counters:: << 99 Use BPF programs to aggregate readings << 100 allows multiple perf-stat sessions tha << 101 instructions, etc.) to share hardware << 102 To use BPF programs on common events b << 103 "perf config stat.bpf-counter-events=< << 104 << 105 --bpf-attr-map:: << 106 With option "--bpf-counters", differen << 107 information about shared BPF programs << 108 Use "--bpf-attr-map" to specify the pa << 109 The default path is /sys/fs/bpf/perf_a << 110 << 111 ifdef::HAVE_LIBPFM[] << 112 --pfm-events events:: << 113 Select a PMU event using libpfm4 syntax (see h << 114 including support for event filters. For examp << 115 inst_retired:any_p:u:c=1:i'. More than one eve << 116 option using the comma separator. Hardware eve << 117 events cannot be mixed together. The latter mu << 118 option. The -e option and this one can be mixe << 119 can be grouped using the {} notation. << 120 endif::HAVE_LIBPFM[] << 121 39 122 -a:: 40 -a:: 123 --all-cpus:: !! 41 system-wide collection 124 system-wide collection from all CPUs ( << 125 << 126 --no-scale:: << 127 Don't scale/normalize counter values << 128 42 129 -d:: !! 43 -c:: 130 --detailed:: !! 44 scale counter values 131 print more detailed statistics, can be << 132 << 133 -d: detailed events, L1 an << 134 -d -d: more detailed events, dTLB << 135 -d -d -d: very detailed events, addin << 136 << 137 -r:: << 138 --repeat=<n>:: << 139 repeat command and print average + std << 140 << 141 -B:: << 142 --big-num:: << 143 print large numbers with thousands' se << 144 Enabled by default. Use "--no-big-num" << 145 Default setting can be changed with "p << 146 << 147 -C:: << 148 --cpu=:: << 149 Count only on the list of CPUs provided. Multi << 150 comma-separated list with no space: 0,1. Range << 151 In per-thread mode, this option is ignored. Th << 152 to activate system-wide monitoring. Default is << 153 << 154 -A:: << 155 --no-aggr:: << 156 Do not aggregate counts across all monitored C << 157 << 158 -n:: << 159 --null:: << 160 null run - Don't start any counters. << 161 << 162 This can be useful to measure just elapsed wal << 163 raw overhead of perf stat itself, without runn << 164 << 165 -v:: << 166 --verbose:: << 167 be more verbose (show counter open err << 168 << 169 -x SEP:: << 170 --field-separator SEP:: << 171 print counts using a CSV-style output to make << 172 spreadsheets. Columns are separated by the str << 173 << 174 --table:: Display time for each run (-r option << 175 << 176 $ perf stat --null -r 5 --table perf bench s << 177 << 178 Performance counter stats for 'perf bench s << 179 << 180 # Table of individual measurement << 181 5.189 (-0.293) # << 182 5.189 (-0.294) # << 183 5.186 (-0.296) # << 184 5.663 (+0.181) ## << 185 6.186 (+0.703) #### << 186 << 187 # Final result: << 188 5.483 +- 0.198 seconds time elaps << 189 << 190 -G name:: << 191 --cgroup name:: << 192 monitor only in the container (cgroup) called << 193 in per-cpu mode. The cgroup filesystem must be << 194 container "name" are monitored when they run o << 195 can be provided. Each cgroup is applied to the << 196 to first event, second cgroup to second event << 197 an empty cgroup (monitor all the time) using, << 198 corresponding events, i.e., they always refer << 199 line. If the user wants to track multiple even << 200 use '-e e1 -e e2 -G foo,foo' or just use '-e e << 201 << 202 If wanting to monitor, say, 'cycles' for a cgr << 203 command line can be used: 'perf stat -e cycles << 204 << 205 --for-each-cgroup name:: << 206 Expand event list for each cgroup in "name" (a << 207 by comma). It also support regex patterns to << 208 effect that repeating -e option and -G option << 209 cannot be used with -G/--cgroup option. << 210 << 211 -o file:: << 212 --output file:: << 213 Print the output into the designated file. << 214 << 215 --append:: << 216 Append to the output file designated with the << 217 << 218 --log-fd:: << 219 << 220 Log output to fd, instead of stderr. Compleme << 221 with it. --append may be used here. Examples << 222 3>results perf stat --log-fd 3 << 223 3>>results perf stat --log-fd 3 --append << 224 << 225 --control=fifo:ctl-fifo[,ack-fifo]:: << 226 --control=fd:ctl-fd[,ack-fd]:: << 227 ctl-fifo / ack-fifo are opened and used as ctl << 228 Listen on ctl-fd descriptor for command to con << 229 'disable': disable events). Measurements can b << 230 --delay=-1 option. Optionally send control com << 231 to synchronize with the controlling process. E << 232 disable events during measurements: << 233 << 234 #!/bin/bash << 235 << 236 ctl_dir=/tmp/ << 237 << 238 ctl_fifo=${ctl_dir}perf_ctl.fifo << 239 test -p ${ctl_fifo} && unlink ${ctl_fifo} << 240 mkfifo ${ctl_fifo} << 241 exec {ctl_fd}<>${ctl_fifo} << 242 << 243 ctl_ack_fifo=${ctl_dir}perf_ctl_ack.fifo << 244 test -p ${ctl_ack_fifo} && unlink ${ctl_ack_f << 245 mkfifo ${ctl_ack_fifo} << 246 exec {ctl_fd_ack}<>${ctl_ack_fifo} << 247 << 248 perf stat -D -1 -e cpu-cycles -a -I 1000 << 249 --control fd:${ctl_fd},${ctl_fd_ack << 250 \-- sleep 30 & << 251 perf_pid=$! << 252 << 253 sleep 5 && echo 'enable' >&${ctl_fd} && read << 254 sleep 10 && echo 'disable' >&${ctl_fd} && rea << 255 << 256 exec {ctl_fd_ack}>&- << 257 unlink ${ctl_ack_fifo} << 258 << 259 exec {ctl_fd}>&- << 260 unlink ${ctl_fifo} << 261 << 262 wait -n ${perf_pid} << 263 exit $? << 264 << 265 << 266 --pre:: << 267 --post:: << 268 Pre and post measurement hooks, e.g.: << 269 << 270 perf stat --repeat 10 --null --sync --pre 'mak << 271 << 272 -I msecs:: << 273 --interval-print msecs:: << 274 Print count deltas every N milliseconds (minim << 275 The overhead percentage could be high in some << 276 example: 'perf stat -I 1000 -e cycles << 277 << 278 If the metric exists, it is calculated by the << 279 << 280 --interval-count times:: << 281 Print count deltas for fixed number of times. << 282 This option should be used together with "-I" << 283 example: 'perf stat -I 1000 --interval << 284 << 285 --interval-clear:: << 286 Clear the screen before next interval. << 287 << 288 --timeout msecs:: << 289 Stop the 'perf stat' session and print count d << 290 This option is not supported with the "-I" opt << 291 example: 'perf stat --time 2000 -e cyc << 292 << 293 --metric-only:: << 294 Only print computed metrics. Print them in a s << 295 Don't show any raw values. Not supported with << 296 << 297 --per-socket:: << 298 Aggregate counts per processor socket for syst << 299 is a useful mode to detect imbalance between s << 300 use --per-socket in addition to -a. (system-wi << 301 socket number and the number of online process << 302 useful to gauge the amount of aggregation. << 303 << 304 --per-die:: << 305 Aggregate counts per processor die for system- << 306 is a useful mode to detect imbalance between d << 307 use --per-die in addition to -a. (system-wide) << 308 die number and the number of online processors << 309 useful to gauge the amount of aggregation. << 310 << 311 --per-cluster:: << 312 Aggregate counts per processor cluster for sys << 313 is a useful mode to detect imbalance between c << 314 use --per-cluster in addition to -a. (system-w << 315 cluster number and the number of online proces << 316 useful to gauge the amount of aggregation. The << 317 related CPUs can be gotten from /sys/devices/s << 318 << 319 --per-cache:: << 320 Aggregate counts per cache instance for system << 321 default, the aggregation happens for the cache << 322 in the system. To specify a particular level, << 323 alongside the option in the format [Ll][1-9][0 << 324 Using option "--per-cache=l3" or "--per-cache= << 325 information at the boundary of the level 3 cac << 326 << 327 --per-core:: << 328 Aggregate counts per physical processor for sy << 329 is a useful mode to detect imbalance between p << 330 use --per-core in addition to -a. (system-wide << 331 core number and the number of online logical p << 332 << 333 --per-thread:: << 334 Aggregate counts per monitored threads, when m << 335 or processes (-p option). << 336 << 337 --per-node:: << 338 Aggregate counts per NUMA nodes for system-wid << 339 is a useful mode to detect imbalance between N << 340 mode, use --per-node in addition to -a. (syste << 341 << 342 -D msecs:: << 343 --delay msecs:: << 344 After starting the program, wait msecs before << 345 disabled). This is useful to filter out the st << 346 which is often very different. << 347 << 348 -T:: << 349 --transaction:: << 350 << 351 Print statistics of transactional execution if << 352 << 353 --metric-no-group:: << 354 By default, events to compute a metric are pla << 355 group tries to enforce scheduling all or none << 356 --metric-no-group option places events outside << 357 increase the chance of the event being schedul << 358 accuracy. However, as events may not be schedu << 359 for metrics like instructions per cycle can be << 360 may no longer be being measured at the same ti << 361 << 362 --metric-no-merge:: << 363 By default metric events in different weak gro << 364 group contains all the events needed by anothe << 365 group will be eliminated reducing event multip << 366 that certain groups of metrics sum to 100%. A << 367 group is that the group may require multiplexi << 368 small group that need not have multiplexing is << 369 forbids the event merging logic from sharing e << 370 may be used to increase accuracy in this case. << 371 << 372 --metric-no-threshold:: << 373 Metric thresholds may increase the number of e << 374 compute whether a metric has exceeded its thre << 375 may not be desirable, for example, as the even << 376 multiplexing. This option disables the adding << 377 events for a metric. However, if there are suf << 378 compute the threshold then the threshold is st << 379 color the metric's computed value. << 380 << 381 --quiet:: << 382 Don't print output, warnings or messages. This << 383 record below to only write data to the perf.da << 384 << 385 STAT RECORD << 386 ----------- << 387 Stores stat data into perf data file. << 388 << 389 -o file:: << 390 --output file:: << 391 Output file name. << 392 << 393 STAT REPORT << 394 ----------- << 395 Reads and reports stat data from perf data fil << 396 << 397 -i file:: << 398 --input file:: << 399 Input file name. << 400 << 401 --per-socket:: << 402 Aggregate counts per processor socket for syst << 403 << 404 --per-die:: << 405 Aggregate counts per processor die for system- << 406 << 407 --per-cluster:: << 408 Aggregate counts perf processor cluster for sy << 409 << 410 --per-cache:: << 411 Aggregate counts per cache instance for system << 412 default, the aggregation happens for the cache << 413 in the system. To specify a particular level, << 414 alongside the option in the format [Ll][1-9][0 << 415 option "--per-cache=l3" or "--per-cache=L3" wi << 416 information at the boundary of the level 3 cac << 417 << 418 --per-core:: << 419 Aggregate counts per physical processor for sy << 420 << 421 -M:: << 422 --metrics:: << 423 Print metrics or metricgroups specified in a c << 424 For a group all metrics from the group are add << 425 The events from the metrics are automatically << 426 See perf list output for the possible metrics << 427 << 428 When threshold information is availabl << 429 color red is used to signify a metric << 430 while green shows it hasn't. The defau << 431 no threshold information was available << 432 couldn't be computed. << 433 << 434 -A:: << 435 --no-aggr:: << 436 --no-merge:: << 437 Do not aggregate/merge counts across monitored << 438 << 439 When multiple events are created from a single << 440 stat will, by default, aggregate the event cou << 441 in a single row. This option disables that beh << 442 individual events and counts. << 443 << 444 Multiple events are created from a single even << 445 << 446 1. PID monitoring isn't requested and the syst << 447 CPU. For example, a system with 8 SMT threa << 448 opened on each thread and aggregation is pe << 449 << 450 2. Prefix or glob wildcard matching is used fo << 451 example, multiple memory controller PMUs ma << 452 suffix of _0, _1, etc. By default the event << 453 combined if the PMU is specified without th << 454 uncore_imc rather than uncore_imc_0. << 455 << 456 3. Aliases, which are listed immediately after << 457 by perf list, are used. << 458 << 459 --hybrid-merge:: << 460 Merge core event counts from all core PMUs. In << 461 systems by default each core PMU will report i << 462 separately. This option forces core PMU counts << 463 a behavior closer to having a single CPU type << 464 << 465 --topdown:: << 466 Print top-down metrics supported by the CPU. T << 467 bottle necks in the CPU pipeline for CPU bound << 468 the cycles consumed down into frontend bound, << 469 speculation and retiring. << 470 << 471 Frontend bound means that the CPU cannot fetch << 472 enough. Backend bound means that computation o << 473 neck. Bad Speculation means that the CPU waste << 474 mispredictions and similar issues. Retiring me << 475 an apparently bottleneck. The bottleneck is on << 476 if the workload is actually bound by the CPU a << 477 << 478 For best results it is usually a good idea to << 479 mode like -I 1000, as the bottleneck of worklo << 480 << 481 This enables --metric-only, unless overridden << 482 << 483 The following restrictions only apply to older << 484 on newer CPUs (IceLake and later) TopDown can << 485 << 486 The top down metrics are collected per core in << 487 CPU thread. Per core mode is automatically ena << 488 and -a (global monitoring) is needed, requirin << 489 perf.perf_event_paranoid=-1. << 490 << 491 Topdown uses the full Performance Monitoring U << 492 disabling of the NMI watchdog (as root): << 493 echo 0 > /proc/sys/kernel/nmi_watchdog << 494 for best results. Otherwise the bottlenecks ma << 495 on workload with changing phases. << 496 << 497 To interpret the results it is usually needed << 498 CPUs the workload runs on. If needed the CPUs << 499 taskset. << 500 << 501 --record-tpebs:: << 502 Enable automatic sampling on Intel TPEBS retir << 503 modifier). Without this option, perf would not << 504 at runtime. Currently, a zero value is assigne << 505 this option is not set. The TPEBS hardware fea << 506 Rapids microarchitecture. This option only exi << 507 Intel platforms with TPEBS feature. << 508 << 509 --td-level:: << 510 Print the top-down statistics that equal the i << 511 users to print the interested top-down metrics << 512 level 1 top-down metrics. << 513 << 514 As the higher levels gather more metrics and u << 515 will be less accurate. By convention a metric << 516 appending '_group' to it and this will increas << 517 gathering all metrics for a level. For example << 518 highlight 'tma_frontend_bound'. This metric ma << 519 'tma_frontend_bound_group' with << 520 'perf stat -M tma_frontend_bound_group...'. << 521 << 522 Error out if the input is higher than the supp << 523 << 524 --smi-cost:: << 525 Measure SMI cost if msr/aperf/ and msr/smi/ ev << 526 << 527 During the measurement, the /sys/device/cpu/fr << 528 freeze core counters on SMI. << 529 The aperf counter will not be effected by the << 530 The cost of SMI can be measured by (aperf - un << 531 << 532 In practice, the percentages of SMI cycles is << 533 oriented analysis. --metric_only will be appli << 534 The output is SMI cycles%, equals to (aperf - << 535 << 536 Users who wants to get the actual value can ap << 537 << 538 --all-kernel:: << 539 Configure all used events to run in kernel spa << 540 << 541 --all-user:: << 542 Configure all used events to run in user space << 543 << 544 --percore-show-thread:: << 545 The event modifier "percore" has supported to << 546 for all hardware threads in a core and show th << 547 << 548 This option with event modifier "percore" enab << 549 counts for all hardware threads in a core but << 550 hardware thread. This is essentially a replace << 551 convenient for post processing. << 552 << 553 --summary:: << 554 Print summary for interval mode (-I). << 555 << 556 --no-csv-summary:: << 557 Don't print 'summary' at the first column for << 558 This option must be used with -x and --summary << 559 << 560 This option can be enabled in perf config by s << 561 'stat.no-csv-summary'. << 562 << 563 $ perf config stat.no-csv-summary=true << 564 << 565 --cputype:: << 566 Only enable events on applying cpu with this t << 567 (e.g. core or atom)" << 568 45 569 EXAMPLES 46 EXAMPLES 570 -------- 47 -------- 571 48 572 $ perf stat \-- make !! 49 $ perf stat -- make -j 573 << 574 Performance counter stats for 'make': << 575 << 576 83723.452481 task-clock:u (msec) << 577 0 context-switches:u << 578 0 cpu-migrations:u << 579 3,228,188 page-faults:u << 580 229,570,665,834 cycles:u << 581 313,163,853,778 instructions:u << 582 69,704,684,856 branches:u << 583 2,078,861,393 branch-misses:u << 584 << 585 83.409183620 seconds time elapsed << 586 50 587 74.684747000 seconds user !! 51 Performance counter stats for 'make -j': 588 8.739217000 seconds sys << 589 << 590 TIMINGS << 591 ------- << 592 As displayed in the example above we can displ << 593 We always display the time the counters were e << 594 << 595 83.409183620 seconds time elapsed << 596 << 597 For workload sessions we also display time the << 598 user/system lands: << 599 << 600 74.684747000 seconds user << 601 8.739217000 seconds sys << 602 << 603 Those times are the very same as displayed by << 604 << 605 CSV FORMAT << 606 ---------- << 607 << 608 With -x, perf stat is able to output a not-qui << 609 Commas in the output are not put into "". To m << 610 it is recommended to use a different character << 611 << 612 The fields are in this order: << 613 << 614 - optional usec time stamp in fraction << 615 - optional CPU, core, or socket identi << 616 - optional number of logical CPUs aggr << 617 - counter value << 618 - unit of the counter value or empty << 619 - event name << 620 - run time of counter << 621 - percentage of measurement time the c << 622 - optional variance if multiple values << 623 - optional metric value << 624 - optional unit of metric << 625 << 626 Additional metrics may be printed with all ear << 627 << 628 include::intel-hybrid.txt[] << 629 << 630 JSON FORMAT << 631 ----------- << 632 52 633 With -j, perf stat is able to print out a JSON !! 53 8117.370256 task clock ticks # 11.281 CPU utilization factor 634 that can be used for parsing. !! 54 678 context switches # 0.000 M/sec >> 55 133 CPU migrations # 0.000 M/sec >> 56 235724 pagefaults # 0.029 M/sec >> 57 24821162526 CPU cycles # 3057.784 M/sec >> 58 18687303457 instructions # 2302.138 M/sec >> 59 172158895 cache references # 21.209 M/sec >> 60 27075259 cache misses # 3.335 M/sec 635 61 636 - timestamp : optional usec time stamp in frac !! 62 Wall-clock time elapsed: 719.554352 msecs 637 - optional aggregate options: << 638 - core : core identifier (with << 639 - die : die identifier (with - << 640 - socket : socket identifier ( << 641 - node : node identifier (with << 642 - thread : thread identifier ( << 643 - counter-value : counter value << 644 - unit : unit of the counter value or empty << 645 - event : event name << 646 - variance : optional variance if multiple val << 647 - runtime : run time of counter << 648 - metric-value : optional metric value << 649 - metric-unit : optional unit of metric << 650 63 651 SEE ALSO 64 SEE ALSO 652 -------- 65 -------- 653 linkperf:perf-top[1], linkperf:perf-list[1] 66 linkperf:perf-top[1], linkperf:perf-list[1]
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.