1 .. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY 2 3 ============================================== 4 Discovering Linux kernel subsystems used by a 5 ============================================== 6 7 :Authors: - Shuah Khan <skhan@linuxfoundation.o 8 - Shefali Sharma <sshefali021@gmail.c 9 :maintained-by: Shuah Khan <skhan@linuxfoundati 10 11 Key Points 12 ========== 13 14 * Understanding system resources necessary to 15 is important. 16 * Linux tracing and strace can be used to dis 17 in use by a workload. The completeness of t 18 depends on the completeness of coverage of 19 * Performance and security of the operating s 20 the help of tools such as: 21 `perf <https://man7.org/linux/man-pages/man 22 `stress-ng <https://www.mankier.com/1/stres 23 `paxtest <https://github.com/opntr/paxtest- 24 * Once we discover and understand the workloa 25 to avoid regressions and use it to evaluate 26 27 Methodology 28 =========== 29 30 `strace <https://man7.org/linux/man-pages/man1 31 diagnostic, instructional, and debugging tool 32 the system resources in use by a workload. Onc 33 the workload needs, we can focus on them to av 34 to evaluate safety considerations. We use stra 35 36 This method of tracing using strace tells us t 37 the workload and doesn't include all the syste 38 by it. In addition, this tracing method tells 39 these system calls that are invoked. As an exa 40 file and reads from it successfully, then the 41 is traced. Any error paths in that system call 42 is a workload that provides full coverage of a 43 outlined here will trace and find all possible 44 of the system usage information depends on the 45 workload. 46 47 The goal is tracing a workload on a system run 48 requiring custom kernel installs. 49 50 How do we gather fine-grained system informati 51 ============================================== 52 53 strace tool can be used to trace system calls 54 it receives. System calls are the fundamental 55 application and the operating system kernel. T 56 request services from the kernel. For instance 57 Linux is used to provide access to a file in t 58 us to track all the system calls made by an ap 59 system calls made by a process and their resul 60 61 You can generate profiling data combining stra 62 record the events and information associated w 63 insight into the process. "perf annotate" tool 64 each instruction of the program. This document 65 to gather fine-grained information on a worklo 66 67 We used strace to trace the perf, stress-ng, p 68 our methodology to discover resources used by 69 be applied to trace other workloads. 70 71 Getting the system ready for tracing 72 ==================================== 73 74 Before we can get started we will show you how 75 We assume that you have a Linux distribution r 76 or a virtual machine. Most distributions will 77 install other tools that aren’t usually incl 78 Please note that the following works on Debian 79 might have to find equivalent packages on othe 80 81 Install tools to build Linux kernel and tools 82 scripts/ver_linux is a good way to check if yo 83 the necessary tools:: 84 85 sudo apt-get build-essentials flex bison yac 86 sudo apt install libelf-dev systemtap-sdt-de 87 88 cscope is a good tool to browse kernel sources 89 90 sudo apt-get install cscope 91 92 Install stress-ng and paxtest:: 93 94 apt-get install stress-ng 95 apt-get install paxtest 96 97 Workload overview 98 ================= 99 100 As mentioned earlier, we used strace to trace 101 paxtest workloads to show how to analyze a wor 102 subsystems used by these workloads. Let's star 103 three workloads to get a better understanding 104 use them. 105 106 perf bench (all) workload 107 ------------------------- 108 109 The perf bench command contains multiple multi 110 benchmarks for executing different subsystems 111 system calls. This allows us to easily measure 112 which can help mitigate performance regression 113 benchmarking framework, enabling developers to 114 integrate transparently, and use performance-r 115 116 Stress-ng netdev stressor workload 117 ---------------------------------- 118 119 stress-ng is used for performing stress testin 120 you to exercise various physical subsystems of 121 interfaces of the OS kernel, using "stressor-s 122 CPU, CPU cache, devices, I/O, interrupts, file 123 operating system, pipelines, schedulers, and v 124 to the `stress-ng man-page <https://www.mankie 125 find the description of all the available stre 126 starts specified number (N) of workers that ex 127 ioctl commands across all the available networ 128 129 paxtest kiddie workload 130 ----------------------- 131 132 paxtest is a program that tests buffer overflo 133 kernel enforcements over memory usage. General 134 segments makes buffer overflows possible. It r 135 attempt to subvert memory usage. It is used as 136 PaX, but might be useful to test other memory 137 kernel. We used paxtest kiddie mode which look 138 139 What is strace and how do we use it? 140 ==================================== 141 142 As mentioned earlier, strace which is a useful 143 and debugging tool and can be used to discover 144 by a workload. It can be used: 145 146 * To see how a process interacts with the ker 147 * To see why a process is failing or hanging. 148 * For reverse engineering a process. 149 * To find the files on which a program depend 150 * For analyzing the performance of an applica 151 * For troubleshooting various problems relate 152 153 In addition, strace can generate run-time stat 154 errors for each system call and report a summa 155 suppressing the regular output. This attempts 156 spent running in the kernel) independent of wa 157 these features to get information on workload 158 159 strace command supports basic, verbose, and st 160 run in verbose mode gives more detailed inform 161 invoked by a process. 162 163 Running strace -c generates a report of the pe 164 system call, the total time in seconds, the mi 165 number of calls, the count of each system call 166 and the type of system call made. 167 168 * Usage: strace <command we want to trace> 169 * Verbose mode usage: strace -v <command> 170 * Gather statistics: strace -c <command> 171 172 We used the “-c” option to gather fine-gra 173 by three workloads we have chose for this anal 174 175 * perf 176 * stress-ng 177 * paxtest 178 179 What is cscope and how do we use it? 180 ==================================== 181 182 Now let’s look at `cscope <https://cscope.so 183 line tool for browsing C, C++ or Java code-bas 184 all the references to a symbol, global definit 185 function, functions calling a function, text s 186 patterns, files including a file. 187 188 We can use cscope to find which system call be 189 This way we can find the kernel subsystems use 190 executed. 191 192 Let’s checkout the latest Linux repository a 193 194 git clone git://git.kernel.org/pub/scm/linux 195 cd linux 196 cscope -R -p10 # builds cscope.out database 197 cscope -d -p10 # starts browse session on c 198 199 Note: Run "cscope -R -p10" to build the databa 200 enter into the browsing session. cscope by def 201 To get out of this mode press ctrl+d. -p optio 202 number of file path components to display. -p1 203 kernel sources. 204 205 What is perf and how do we use it? 206 ================================== 207 208 Perf is an analysis tool based on Linux 2.6+ s 209 CPU hardware difference in performance measure 210 a simple command line interface. Perf is based 211 exported by the kernel. It is very useful for 212 finding performance bottlenecks in an applicat 213 214 If you haven't already checked out the Linux m 215 so and then build kernel and perf tool:: 216 217 git clone git://git.kernel.org/pub/scm/linux 218 cd linux 219 make -j3 all 220 cd tools/perf 221 make 222 223 Note: The perf command can be built without bu 224 repository and can be run on older kernels. Ho 225 and perf revisions gives more accurate informa 226 227 We used "perf stat" and "perf bench" options. 228 the perf tool, run "perf -h". 229 230 perf stat 231 --------- 232 The perf stat command generates a report of va 233 events. It does so with the help of hardware c 234 modern CPUs that keep the count of these activ 235 stats for cal command. 236 237 Perf bench 238 ---------- 239 The perf bench command contains multiple multi 240 benchmarks for executing different subsystems 241 system calls. This allows us to easily measure 242 which can help mitigate performance regression 243 benchmarking framework, enabling developers to 244 integrate transparently, and use performance-r 245 246 "perf bench all" command runs the following be 247 248 * sched/messaging 249 * sched/pipe 250 * syscall/basic 251 * mem/memcpy 252 * mem/memset 253 254 What is stress-ng and how do we use it? 255 ======================================= 256 257 As mentioned earlier, stress-ng is used for pe 258 the kernel. It allows you to exercise various 259 computer, as well as interfaces of the OS kern 260 are available for CPU, CPU cache, devices, I/O 261 memory, network, operating system, pipelines, 262 machines. 263 264 The netdev stressor starts N workers that exer 265 commands across all the available network devi 266 exercised: 267 268 * SIOCGIFCONF, SIOCGIFINDEX, SIOCGIFNAME, SIO 269 * SIOCGIFADDR, SIOCGIFNETMASK, SIOCGIFMETRIC, 270 * SIOCGIFHWADDR, SIOCGIFMAP, SIOCGIFTXQLEN 271 272 The following command runs the stressor:: 273 274 stress-ng --netdev 1 -t 60 --metrics command 275 276 We can use the perf record command to record t 277 associated with a process. This command record 278 perf.data file in the same directory. 279 280 Using the following commands you can record th 281 netdev stressor, view the generated report per 282 view the statistics of each instruction of the 283 284 perf record stress-ng --netdev 1 -t 60 --met 285 perf report 286 perf annotate 287 288 What is paxtest and how do we use it? 289 ===================================== 290 291 paxtest is a program that tests buffer overflo 292 kernel enforcements over memory usage. General 293 segments makes buffer overflows possible. It r 294 attempt to subvert memory usage. It is used as 295 PaX, and will be useful to test other memory p 296 kernel. 297 298 paxtest provides kiddie and blackhat modes. Th 299 in normal mode, whereas the blackhat mode trie 300 of the kernel testing for vulnerabilities. We 301 and combine "paxtest kiddie" run with "perf re 302 traces for the paxtest kiddie run to see which 303 functions in the performance profile. Then the 304 Information) mode can be used to unwind the st 305 306 The following command can be used to view resu 307 format:: 308 309 perf record --call-graph dwarf paxtest kiddi 310 perf report --stdio 311 312 Tracing workloads 313 ================= 314 315 Now that we understand the workloads, let's st 316 317 Tracing perf bench all workload 318 ------------------------------- 319 320 Run the following command to trace perf bench 321 322 strace -c perf bench all 323 324 **System Calls made by the workload** 325 326 The below table shows the system calls invoked 327 times each system call is invoked, and the cor 328 329 +-------------------+-----------+------------- 330 | System Call | # calls | Linux Subsys 331 +===================+===========+============= 332 | getppid | 10000001 | Process Mgmt 333 +-------------------+-----------+------------- 334 | clone | 1077 | Process Mgmt 335 +-------------------+-----------+------------- 336 | prctl | 23 | Process Mgmt 337 +-------------------+-----------+------------- 338 | prlimit64 | 7 | Process Mgmt 339 +-------------------+-----------+------------- 340 | getpid | 10 | Process Mgmt 341 +-------------------+-----------+------------- 342 | uname | 3 | Process Mgmt 343 +-------------------+-----------+------------- 344 | sysinfo | 1 | Process Mgmt 345 +-------------------+-----------+------------- 346 | getuid | 1 | Process Mgmt 347 +-------------------+-----------+------------- 348 | getgid | 1 | Process Mgmt 349 +-------------------+-----------+------------- 350 | geteuid | 1 | Process Mgmt 351 +-------------------+-----------+------------- 352 | getegid | 1 | Process Mgmt 353 +-------------------+-----------+------------- 354 | close | 49951 | Filesystem 355 +-------------------+-----------+------------- 356 | pipe | 604 | Filesystem 357 +-------------------+-----------+------------- 358 | openat | 48560 | Filesystem 359 +-------------------+-----------+------------- 360 | fstat | 8338 | Filesystem 361 +-------------------+-----------+------------- 362 | stat | 1573 | Filesystem 363 +-------------------+-----------+------------- 364 | pread64 | 9646 | Filesystem 365 +-------------------+-----------+------------- 366 | getdents64 | 1873 | Filesystem 367 +-------------------+-----------+------------- 368 | access | 3 | Filesystem 369 +-------------------+-----------+------------- 370 | lstat | 1880 | Filesystem 371 +-------------------+-----------+------------- 372 | lseek | 6 | Filesystem 373 +-------------------+-----------+------------- 374 | ioctl | 3 | Filesystem 375 +-------------------+-----------+------------- 376 | dup2 | 1 | Filesystem 377 +-------------------+-----------+------------- 378 | execve | 2 | Filesystem 379 +-------------------+-----------+------------- 380 | fcntl | 8779 | Filesystem 381 +-------------------+-----------+------------- 382 | statfs | 1 | Filesystem 383 +-------------------+-----------+------------- 384 | epoll_create | 2 | Filesystem 385 +-------------------+-----------+------------- 386 | epoll_ctl | 64 | Filesystem 387 +-------------------+-----------+------------- 388 | newfstatat | 8318 | Filesystem 389 +-------------------+-----------+------------- 390 | eventfd2 | 192 | Filesystem 391 +-------------------+-----------+------------- 392 | mmap | 243 | Memory Mgmt. 393 +-------------------+-----------+------------- 394 | mprotect | 32 | Memory Mgmt. 395 +-------------------+-----------+------------- 396 | brk | 21 | Memory Mgmt. 397 +-------------------+-----------+------------- 398 | munmap | 128 | Memory Mgmt. 399 +-------------------+-----------+------------- 400 | set_mempolicy | 156 | Memory Mgmt. 401 +-------------------+-----------+------------- 402 | set_tid_address | 1 | Process Mgmt 403 +-------------------+-----------+------------- 404 | set_robust_list | 1 | Futex 405 +-------------------+-----------+------------- 406 | futex | 341 | Futex 407 +-------------------+-----------+------------- 408 | sched_getaffinity | 79 | Scheduler 409 +-------------------+-----------+------------- 410 | sched_setaffinity | 223 | Scheduler 411 +-------------------+-----------+------------- 412 | socketpair | 202 | Network 413 +-------------------+-----------+------------- 414 | rt_sigprocmask | 21 | Signal 415 +-------------------+-----------+------------- 416 | rt_sigaction | 36 | Signal 417 +-------------------+-----------+------------- 418 | rt_sigreturn | 2 | Signal 419 +-------------------+-----------+------------- 420 | wait4 | 889 | Time 421 +-------------------+-----------+------------- 422 | clock_nanosleep | 37 | Time 423 +-------------------+-----------+------------- 424 | capget | 4 | Capability 425 +-------------------+-----------+------------- 426 427 Tracing stress-ng netdev stressor workload 428 ------------------------------------------ 429 430 Run the following command to trace stress-ng n 431 432 strace -c stress-ng --netdev 1 -t 60 --metr 433 434 **System Calls made by the workload** 435 436 The below table shows the system calls invoked 437 times each system call is invoked, and the cor 438 439 +-------------------+-----------+------------- 440 | System Call | # calls | Linux Subsys 441 +===================+===========+============= 442 | openat | 74 | Filesystem 443 +-------------------+-----------+------------- 444 | close | 75 | Filesystem 445 +-------------------+-----------+------------- 446 | read | 58 | Filesystem 447 +-------------------+-----------+------------- 448 | fstat | 20 | Filesystem 449 +-------------------+-----------+------------- 450 | flock | 10 | Filesystem 451 +-------------------+-----------+------------- 452 | write | 7 | Filesystem 453 +-------------------+-----------+------------- 454 | getdents64 | 8 | Filesystem 455 +-------------------+-----------+------------- 456 | pread64 | 8 | Filesystem 457 +-------------------+-----------+------------- 458 | lseek | 1 | Filesystem 459 +-------------------+-----------+------------- 460 | access | 2 | Filesystem 461 +-------------------+-----------+------------- 462 | getcwd | 1 | Filesystem 463 +-------------------+-----------+------------- 464 | execve | 1 | Filesystem 465 +-------------------+-----------+------------- 466 | mmap | 61 | Memory Mgmt. 467 +-------------------+-----------+------------- 468 | munmap | 3 | Memory Mgmt. 469 +-------------------+-----------+------------- 470 | mprotect | 20 | Memory Mgmt. 471 +-------------------+-----------+------------- 472 | mlock | 2 | Memory Mgmt. 473 +-------------------+-----------+------------- 474 | brk | 3 | Memory Mgmt. 475 +-------------------+-----------+------------- 476 | rt_sigaction | 21 | Signal 477 +-------------------+-----------+------------- 478 | rt_sigprocmask | 1 | Signal 479 +-------------------+-----------+------------- 480 | sigaltstack | 1 | Signal 481 +-------------------+-----------+------------- 482 | rt_sigreturn | 1 | Signal 483 +-------------------+-----------+------------- 484 | getpid | 8 | Process Mgmt 485 +-------------------+-----------+------------- 486 | prlimit64 | 5 | Process Mgmt 487 +-------------------+-----------+------------- 488 | arch_prctl | 2 | Process Mgmt 489 +-------------------+-----------+------------- 490 | sysinfo | 2 | Process Mgmt 491 +-------------------+-----------+------------- 492 | getuid | 2 | Process Mgmt 493 +-------------------+-----------+------------- 494 | uname | 1 | Process Mgmt 495 +-------------------+-----------+------------- 496 | setpgid | 1 | Process Mgmt 497 +-------------------+-----------+------------- 498 | getrusage | 1 | Process Mgmt 499 +-------------------+-----------+------------- 500 | geteuid | 1 | Process Mgmt 501 +-------------------+-----------+------------- 502 | getppid | 1 | Process Mgmt 503 +-------------------+-----------+------------- 504 | sendto | 3 | Network 505 +-------------------+-----------+------------- 506 | connect | 1 | Network 507 +-------------------+-----------+------------- 508 | socket | 1 | Network 509 +-------------------+-----------+------------- 510 | clone | 1 | Process Mgmt 511 +-------------------+-----------+------------- 512 | set_tid_address | 1 | Process Mgmt 513 +-------------------+-----------+------------- 514 | wait4 | 2 | Time 515 +-------------------+-----------+------------- 516 | alarm | 1 | Time 517 +-------------------+-----------+------------- 518 | set_robust_list | 1 | Futex 519 +-------------------+-----------+------------- 520 521 Tracing paxtest kiddie workload 522 ------------------------------- 523 524 Run the following command to trace paxtest kid 525 526 strace -c paxtest kiddie 527 528 **System Calls made by the workload** 529 530 The below table shows the system calls invoked 531 times each system call is invoked, and the cor 532 533 +-------------------+-----------+------------- 534 | System Call | # calls | Linux Subsys 535 +===================+===========+============= 536 | read | 3 | Filesystem 537 +-------------------+-----------+------------- 538 | write | 11 | Filesystem 539 +-------------------+-----------+------------- 540 | close | 41 | Filesystem 541 +-------------------+-----------+------------- 542 | stat | 24 | Filesystem 543 +-------------------+-----------+------------- 544 | fstat | 2 | Filesystem 545 +-------------------+-----------+------------- 546 | pread64 | 6 | Filesystem 547 +-------------------+-----------+------------- 548 | access | 1 | Filesystem 549 +-------------------+-----------+------------- 550 | pipe | 1 | Filesystem 551 +-------------------+-----------+------------- 552 | dup2 | 24 | Filesystem 553 +-------------------+-----------+------------- 554 | execve | 1 | Filesystem 555 +-------------------+-----------+------------- 556 | fcntl | 26 | Filesystem 557 +-------------------+-----------+------------- 558 | openat | 14 | Filesystem 559 +-------------------+-----------+------------- 560 | rt_sigaction | 7 | Signal 561 +-------------------+-----------+------------- 562 | rt_sigreturn | 38 | Signal 563 +-------------------+-----------+------------- 564 | clone | 38 | Process Mgmt 565 +-------------------+-----------+------------- 566 | wait4 | 44 | Time 567 +-------------------+-----------+------------- 568 | mmap | 7 | Memory Mgmt. 569 +-------------------+-----------+------------- 570 | mprotect | 3 | Memory Mgmt. 571 +-------------------+-----------+------------- 572 | munmap | 1 | Memory Mgmt. 573 +-------------------+-----------+------------- 574 | brk | 3 | Memory Mgmt. 575 +-------------------+-----------+------------- 576 | getpid | 1 | Process Mgmt 577 +-------------------+-----------+------------- 578 | getuid | 1 | Process Mgmt 579 +-------------------+-----------+------------- 580 | getgid | 1 | Process Mgmt 581 +-------------------+-----------+------------- 582 | geteuid | 2 | Process Mgmt 583 +-------------------+-----------+------------- 584 | getegid | 1 | Process Mgmt 585 +-------------------+-----------+------------- 586 | getppid | 1 | Process Mgmt 587 +-------------------+-----------+------------- 588 | arch_prctl | 2 | Process Mgmt 589 +-------------------+-----------+------------- 590 591 Conclusion 592 ========== 593 594 This document is intended to be used as a guid 595 information on the resources in use by workloa 596 597 References 598 ========== 599 600 * `Discovery Linux Kernel Subsystems used by 601 * `ELISA-White-Papers-Discovering Linux kerne 602 * `strace <https://man7.org/linux/man-pages/m 603 * `perf <https://man7.org/linux/man-pages/man 604 * `paxtest README <https://github.com/opntr/p 605 * `stress-ng <https://www.mankier.com/1/stres 606 * `Monitoring and managing system status and
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.