1 ============== 2 Control Groups 3 ============== 4 5 Written by Paul Menage <menage@google.com> base 6 Documentation/admin-guide/cgroup-v1/cpusets.rs 7 8 Original copyright statements from cpusets.txt 9 10 Portions Copyright (C) 2004 BULL SA. 11 12 Portions Copyright (c) 2004-2006 Silicon Graph 13 14 Modified by Paul Jackson <pj@sgi.com> 15 16 Modified by Christoph Lameter <cl@linux.com> 17 18 .. CONTENTS: 19 20 1. Control Groups 21 1.1 What are cgroups ? 22 1.2 Why are cgroups needed ? 23 1.3 How are cgroups implemented ? 24 1.4 What does notify_on_release do ? 25 1.5 What does clone_children do ? 26 1.6 How do I use cgroups ? 27 2. Usage Examples and Syntax 28 2.1 Basic Usage 29 2.2 Attaching processes 30 2.3 Mounting hierarchies by name 31 3. Kernel API 32 3.1 Overview 33 3.2 Synchronization 34 3.3 Subsystem API 35 4. Extended attributes usage 36 5. Questions 37 38 1. Control Groups 39 ================= 40 41 1.1 What are cgroups ? 42 ---------------------- 43 44 Control Groups provide a mechanism for aggrega 45 tasks, and all their future children, into hie 46 specialized behaviour. 47 48 Definitions: 49 50 A *cgroup* associates a set of tasks with a se 51 or more subsystems. 52 53 A *subsystem* is a module that makes use of th 54 facilities provided by cgroups to treat groups 55 particular ways. A subsystem is typically a "r 56 schedules a resource or applies per-cgroup lim 57 anything that wants to act on a group of proce 58 virtualization subsystem. 59 60 A *hierarchy* is a set of cgroups arranged in 61 every task in the system is in exactly one of 62 hierarchy, and a set of subsystems; each subsy 63 state attached to each cgroup in the hierarchy 64 an instance of the cgroup virtual filesystem a 65 66 At any one time there may be multiple active h 67 cgroups. Each hierarchy is a partition of all 68 69 User-level code may create and destroy cgroups 70 instance of the cgroup virtual file system, sp 71 which cgroup a task is assigned, and list the 72 a cgroup. Those creations and assignments only 73 associated with that instance of the cgroup fi 74 75 On their own, the only use for cgroups is for 76 tracking. The intention is that other subsyste 77 cgroup support to provide new attributes for c 78 accounting/limiting the resources which proces 79 access. For example, cpusets (see Documentatio 80 you to associate a set of CPUs and a set of me 81 tasks in each cgroup. 82 83 .. _cgroups-why-needed: 84 85 1.2 Why are cgroups needed ? 86 ---------------------------- 87 88 There are multiple efforts to provide process 89 Linux kernel, mainly for resource-tracking pur 90 include cpusets, CKRM/ResGroups, UserBeanCount 91 namespaces. These all require the basic notion 92 grouping/partitioning of processes, with newly 93 up in the same group (cgroup) as their parent 94 95 The kernel cgroup patch provides the minimum e 96 mechanisms required to efficiently implement s 97 minimal impact on the system fast paths, and p 98 specific subsystems such as cpusets to provide 99 desired. 100 101 Multiple hierarchy support is provided to allo 102 the division of tasks into cgroups is distinct 103 different subsystems - having parallel hierarc 104 hierarchy to be a natural division of tasks, w 105 complex combinations of tasks that would be pr 106 unrelated subsystems needed to be forced into 107 cgroups. 108 109 At one extreme, each resource controller or su 110 separate hierarchy; at the other extreme, all 111 would be attached to the same hierarchy. 112 113 As an example of a scenario (originally propos 114 that can benefit from multiple hierarchies, co 115 university server with various users - student 116 tasks etc. The resource planning for this serv 117 following lines:: 118 119 CPU : "Top cpuset" 120 / \ 121 CPUSet1 CPUSet2 122 | | 123 (Professors) (Students) 124 125 In addition (system tasks) are 126 that they can run anywhere) wit 127 128 Memory : Professors (50%), Students (30 129 130 Disk : Professors (50%), Students (30%) 131 132 Network : WWW browsing (20%), Network F 133 / \ 134 Professors (15%) students (5%) 135 136 Browsers like Firefox/Lynx go into the WWW net 137 into the NFS network class. 138 139 At the same time Firefox/Lynx will share an ap 140 depending on who launched it (prof/student). 141 142 With the ability to classify tasks differently 143 (by putting those resource subsystems in diffe 144 the admin can easily set up a script which rec 145 and depending on who is launching the browser 146 147 # echo browser_pid > /sys/fs/cgroup/<resty 148 149 With only a single hierarchy, he now would pot 150 a separate cgroup for every browser launched a 151 appropriate network and other resource class. 152 proliferation of such cgroups. 153 154 Also let's say that the administrator would li 155 access temporarily to a student's browser (sin 156 wants to do online gaming :)) OR give one of 157 apps enhanced CPU power. 158 159 With ability to write PIDs directly to resourc 160 matter of:: 161 162 # echo pid > /sys/fs/cgroup/network/<ne 163 (after some time) 164 # echo pid > /sys/fs/cgroup/network/<or 165 166 Without this ability, the administrator would 167 multiple separate ones and then associate the 168 new resource classes. 169 170 171 172 1.3 How are cgroups implemented ? 173 --------------------------------- 174 175 Control Groups extends the kernel as follows: 176 177 - Each task in the system has a reference-cou 178 css_set. 179 180 - A css_set contains a set of reference-count 181 cgroup_subsys_state objects, one for each c 182 registered in the system. There is no direc 183 the cgroup of which it's a member in each h 184 can be determined by following pointers thr 185 cgroup_subsys_state objects. This is becaus 186 subsystem state is something that's expecte 187 and in performance-critical code, whereas o 188 task's actual cgroup assignments (in partic 189 cgroups) are less common. A linked list run 190 field of each task_struct using the css_set 191 css_set->tasks. 192 193 - A cgroup hierarchy filesystem can be mounte 194 manipulation from user space. 195 196 - You can list all the tasks (by PID) attache 197 198 The implementation of cgroups requires a few, 199 into the rest of the kernel, none in performan 200 201 - in init/main.c, to initialize the root cgro 202 css_set at system boot. 203 204 - in fork and exit, to attach and detach a ta 205 206 In addition, a new file system of type "cgroup 207 enable browsing and modifying the cgroups pres 208 kernel. When mounting a cgroup hierarchy, you 209 comma-separated list of subsystems to mount as 210 options. By default, mounting the cgroup file 211 mount a hierarchy containing all registered su 212 213 If an active hierarchy with exactly the same s 214 exists, it will be reused for the new mount. I 215 matches, and any of the requested subsystems a 216 hierarchy, the mount will fail with -EBUSY. Ot 217 is activated, associated with the requested su 218 219 It's not currently possible to bind a new subs 220 cgroup hierarchy, or to unbind a subsystem fro 221 hierarchy. This may be possible in future, but 222 error-recovery issues. 223 224 When a cgroup filesystem is unmounted, if ther 225 child cgroups created below the top-level cgro 226 will remain active even though unmounted; if t 227 child cgroups then the hierarchy will be deact 228 229 No new system calls are added for cgroups - al 230 querying and modifying cgroups is via this cgr 231 232 Each task under /proc has an added file named 233 for each active hierarchy, the subsystem names 234 as the path relative to the root of the cgroup 235 236 Each cgroup is represented by a directory in t 237 containing the following files describing that 238 239 - tasks: list of tasks (by PID) attached to t 240 is not guaranteed to be sorted. Writing a 241 moves the thread into this cgroup. 242 - cgroup.procs: list of thread group IDs in t 243 not guaranteed to be sorted or free of dupl 244 should sort/uniquify the list if this prope 245 Writing a thread group ID into this file mo 246 group into this cgroup. 247 - notify_on_release flag: run the release age 248 - release_agent: the path to use for release 249 exists in the top cgroup only) 250 251 Other subsystems such as cpusets may add addit 252 cgroup dir. 253 254 New cgroups are created using the mkdir system 255 command. The properties of a cgroup, such as 256 modified by writing to the appropriate file in 257 directory, as listed above. 258 259 The named hierarchical structure of nested cgr 260 a large system into nested, dynamically change 261 262 The attachment of each task, automatically inh 263 children of that task, to a cgroup allows orga 264 on a system into related sets of tasks. A tas 265 any other cgroup, if allowed by the permission 266 cgroup file system directories. 267 268 When a task is moved from one cgroup to anothe 269 css_set pointer - if there's an already existi 270 desired collection of cgroups then that group 271 css_set is allocated. The appropriate existing 272 looking into a hash table. 273 274 To allow access from a cgroup to the css_sets 275 that comprise it, a set of cg_cgroup_link obje 276 each cg_cgroup_link is linked into a list of c 277 a single cgroup on its cgrp_link_list field, a 278 cg_cgroup_links for a single css_set on its cg 279 280 Thus the set of tasks in a cgroup can be liste 281 each css_set that references the cgroup, and s 282 each css_set's task set. 283 284 The use of a Linux virtual file system (vfs) t 285 cgroup hierarchy provides for a familiar permi 286 for cgroups, with a minimum of additional kern 287 288 1.4 What does notify_on_release do ? 289 ------------------------------------ 290 291 If the notify_on_release flag is enabled (1) i 292 whenever the last task in the cgroup leaves (e 293 some other cgroup) and the last child cgroup o 294 is removed, then the kernel runs the command s 295 of the "release_agent" file in that hierarchy' 296 supplying the pathname (relative to the mount 297 file system) of the abandoned cgroup. This en 298 removal of abandoned cgroups. The default val 299 notify_on_release in the root cgroup at system 300 (0). The default value of other cgroups at cr 301 value of their parents' notify_on_release sett 302 a cgroup hierarchy's release_agent path is emp 303 304 1.5 What does clone_children do ? 305 --------------------------------- 306 307 This flag only affects the cpuset controller. 308 flag is enabled (1) in a cgroup, a new cpuset 309 configuration from the parent during initializ 310 311 1.6 How do I use cgroups ? 312 -------------------------- 313 314 To start a new job that is to be contained wit 315 the "cpuset" cgroup subsystem, the steps are s 316 317 1) mount -t tmpfs cgroup_root /sys/fs/cgroup 318 2) mkdir /sys/fs/cgroup/cpuset 319 3) mount -t cgroup -ocpuset cpuset /sys/fs/cg 320 4) Create the new cgroup by doing mkdir's and 321 the /sys/fs/cgroup/cpuset virtual file sys 322 5) Start a task that will be the "founding fa 323 6) Attach that task to the new cgroup by writ 324 /sys/fs/cgroup/cpuset tasks file for that 325 7) fork, exec or clone the job tasks from thi 326 327 For example, the following sequence of command 328 named "Charlie", containing just CPUs 2 and 3, 329 and then start a subshell 'sh' in that cgroup: 330 331 mount -t tmpfs cgroup_root /sys/fs/cgroup 332 mkdir /sys/fs/cgroup/cpuset 333 mount -t cgroup cpuset -ocpuset /sys/fs/cgro 334 cd /sys/fs/cgroup/cpuset 335 mkdir Charlie 336 cd Charlie 337 /bin/echo 2-3 > cpuset.cpus 338 /bin/echo 1 > cpuset.mems 339 /bin/echo $$ > tasks 340 sh 341 # The subshell 'sh' is now running in cgroup 342 # The next line should display '/Charlie' 343 cat /proc/self/cgroup 344 345 2. Usage Examples and Syntax 346 ============================ 347 348 2.1 Basic Usage 349 --------------- 350 351 Creating, modifying, using cgroups can be done 352 virtual filesystem. 353 354 To mount a cgroup hierarchy with all available 355 356 # mount -t cgroup xxx /sys/fs/cgroup 357 358 The "xxx" is not interpreted by the cgroup cod 359 /proc/mounts so may be any useful identifying 360 361 Note: Some subsystems do not work without some 362 if cpusets are enabled the user will have to p 363 for each new cgroup created before that group 364 365 As explained in section `1.2 Why are cgroups n 366 different hierarchies of cgroups for each sing 367 resources you want to control. Therefore, you 368 /sys/fs/cgroup and create directories for each 369 group:: 370 371 # mount -t tmpfs cgroup_root /sys/fs/cgroup 372 # mkdir /sys/fs/cgroup/rg1 373 374 To mount a cgroup hierarchy with just the cpus 375 subsystems, type:: 376 377 # mount -t cgroup -o cpuset,memory hier1 /sy 378 379 While remounting cgroups is currently supporte 380 to use it. Remounting allows changing bound su 381 release_agent. Rebinding is hardly useful as i 382 hierarchy is empty and release_agent itself sh 383 conventional fsnotify. The support for remount 384 the future. 385 386 To Specify a hierarchy's release_agent:: 387 388 # mount -t cgroup -o cpuset,release_agent="/ 389 xxx /sys/fs/cgroup/rg1 390 391 Note that specifying 'release_agent' more than 392 393 Note that changing the set of subsystems is cu 394 when the hierarchy consists of a single (root) 395 the ability to arbitrarily bind/unbind subsyst 396 cgroup hierarchy is intended to be implemented 397 398 Then under /sys/fs/cgroup/rg1 you can find a t 399 tree of the cgroups in the system. For instanc 400 is the cgroup that holds the whole system. 401 402 If you want to change the value of release_age 403 404 # echo "/sbin/new_release_agent" > /sys/fs/c 405 406 It can also be changed via remount. 407 408 If you want to create a new cgroup under /sys/ 409 410 # cd /sys/fs/cgroup/rg1 411 # mkdir my_cgroup 412 413 Now you want to do something with this cgroup: 414 415 # cd my_cgroup 416 417 In this directory you can find several files:: 418 419 # ls 420 cgroup.procs notify_on_release tasks 421 (plus whatever files added by the attached s 422 423 Now attach your shell to this cgroup:: 424 425 # /bin/echo $$ > tasks 426 427 You can also create cgroups inside your cgroup 428 directory:: 429 430 # mkdir my_sub_cs 431 432 To remove a cgroup, just use rmdir:: 433 434 # rmdir my_sub_cs 435 436 This will fail if the cgroup is in use (has cg 437 has processes attached, or is held alive by ot 438 reference). 439 440 2.2 Attaching processes 441 ----------------------- 442 443 :: 444 445 # /bin/echo PID > tasks 446 447 Note that it is PID, not PIDs. You can only at 448 If you have several tasks to attach, you have 449 450 # /bin/echo PID1 > tasks 451 # /bin/echo PID2 > tasks 452 ... 453 # /bin/echo PIDn > tasks 454 455 You can attach the current shell task by echoi 456 457 # echo 0 > tasks 458 459 You can use the cgroup.procs file instead of t 460 threads in a threadgroup at once. Echoing the 461 threadgroup to cgroup.procs causes all tasks i 462 attached to the cgroup. Writing 0 to cgroup.pr 463 in the writing task's threadgroup. 464 465 Note: Since every task is always a member of e 466 mounted hierarchy, to remove a task from its c 467 move it into a new cgroup (possibly the root c 468 new cgroup's tasks file. 469 470 Note: Due to some restrictions enforced by som 471 a process to another cgroup can fail. 472 473 2.3 Mounting hierarchies by name 474 -------------------------------- 475 476 Passing the name=<x> option when mounting a cg 477 associates the given name with the hierarchy. 478 mounting a pre-existing hierarchy, in order to 479 rather than by its set of active subsystems. 480 nameless, or has a unique name. 481 482 The name should match [\w.-]+ 483 484 When passing a name=<x> option for a new hiera 485 specify subsystems manually; the legacy behavi 486 subsystems when none are explicitly specified 487 you give a subsystem a name. 488 489 The name of the subsystem appears as part of t 490 in /proc/mounts and /proc/<pid>/cgroups. 491 492 493 3. Kernel API 494 ============= 495 496 3.1 Overview 497 ------------ 498 499 Each kernel subsystem that wants to hook into 500 system needs to create a cgroup_subsys object. 501 various methods, which are callbacks from the 502 with a subsystem ID which will be assigned by 503 504 Other fields in the cgroup_subsys object inclu 505 506 - subsys_id: a unique array index for the subs 507 entry in cgroup->subsys[] this subsystem sho 508 509 - name: should be initialized to a unique subs 510 no longer than MAX_CGROUP_TYPE_NAMELEN. 511 512 - early_init: indicate if the subsystem needs 513 at system boot. 514 515 Each cgroup object created by the system has a 516 indexed by subsystem ID; this pointer is entir 517 subsystem; the generic cgroup code will never 518 519 3.2 Synchronization 520 ------------------- 521 522 There is a global mutex, cgroup_mutex, used by 523 system. This should be taken by anything that 524 cgroup. It may also be taken to prevent cgroup 525 modified, but more specific locks may be more 526 situation. 527 528 See kernel/cgroup.c for more details. 529 530 Subsystems can take/release the cgroup_mutex v 531 cgroup_lock()/cgroup_unlock(). 532 533 Accessing a task's cgroup pointer may be done 534 - while holding cgroup_mutex 535 - while holding the task's alloc_lock (via tas 536 - inside an rcu_read_lock() section via rcu_de 537 538 3.3 Subsystem API 539 ----------------- 540 541 Each subsystem should: 542 543 - add an entry in linux/cgroup_subsys.h 544 - define a cgroup_subsys object called <name>_ 545 546 Each subsystem may export the following method 547 methods are css_alloc/free. Any others that ar 548 be successful no-ops. 549 550 ``struct cgroup_subsys_state *css_alloc(struct 551 (cgroup_mutex held by caller) 552 553 Called to allocate a subsystem state object fo 554 subsystem should allocate its subsystem state 555 cgroup, returning a pointer to the new object 556 ERR_PTR() value. On success, the subsystem poi 557 a structure of type cgroup_subsys_state (typic 558 larger subsystem-specific object), which will 559 cgroup system. Note that this will be called a 560 create the root subsystem state for this subsy 561 identified by the passed cgroup object having 562 it's the root of the hierarchy) and may be an 563 initialization code. 564 565 ``int css_online(struct cgroup *cgrp)`` 566 (cgroup_mutex held by caller) 567 568 Called after @cgrp successfully completed all 569 visible to cgroup_for_each_child/descendant_*( 570 subsystem may choose to fail creation by retur 571 callback can be used to implement reliable sta 572 propagation along the hierarchy. See the comme 573 cgroup_for_each_live_descendant_pre() for deta 574 575 ``void css_offline(struct cgroup *cgrp);`` 576 (cgroup_mutex held by caller) 577 578 This is the counterpart of css_online() and ca 579 has succeeded on @cgrp. This signifies the beg 580 @cgrp. @cgrp is being removed and the subsyste 581 all references it's holding on @cgrp. When all 582 cgroup removal will proceed to the next step - 583 callback, @cgrp should be considered dead to t 584 585 ``void css_free(struct cgroup *cgrp)`` 586 (cgroup_mutex held by caller) 587 588 The cgroup system is about to free @cgrp; the 589 its subsystem state object. By the time this m 590 is completely unused; @cgrp->parent is still v 591 be called for a newly-created cgroup if an err 592 subsystem's create() method has been called fo 593 594 ``int can_attach(struct cgroup *cgrp, struct c 595 (cgroup_mutex held by caller) 596 597 Called prior to moving one or more tasks into 598 subsystem returns an error, this will abort th 599 @tset contains the tasks to be attached and is 600 least one task in it. 601 602 If there are multiple tasks in the taskset, th 603 - it's guaranteed that all are from the same 604 - @tset contains all tasks from the thread g 605 they're switching cgroups 606 - the first task is the leader 607 608 Each @tset entry also contains the task's old 609 aren't switching cgroup can be skipped easily 610 cgroup_taskset_for_each() iterator. Note that 611 fork. If this method returns 0 (success) then 612 while the caller holds cgroup_mutex and it is 613 attach() or cancel_attach() will be called in 614 615 ``void css_reset(struct cgroup_subsys_state *c 616 (cgroup_mutex held by caller) 617 618 An optional operation which should restore @cs 619 initial state. This is currently only used on 620 when a subsystem is disabled on a cgroup throu 621 "cgroup.subtree_control" but should remain ena 622 subsystems depend on it. cgroup core makes su 623 removing the associated interface files and in 624 that the hidden subsystem can return to the in 625 This prevents unexpected resource control from 626 ensures that the configuration is in the initi 627 visible again later. 628 629 ``void cancel_attach(struct cgroup *cgrp, stru 630 (cgroup_mutex held by caller) 631 632 Called when a task attach operation has failed 633 A subsystem whose can_attach() has some side-e 634 function, so that the subsystem can implement 635 This will be called only about subsystems whos 636 succeeded. The parameters are identical to can 637 638 ``void attach(struct cgroup *cgrp, struct cgro 639 (cgroup_mutex held by caller) 640 641 Called after the task has been attached to the 642 post-attachment activity that requires memory 643 The parameters are identical to can_attach(). 644 645 ``void fork(struct task_struct *task)`` 646 647 Called when a task is forked into a cgroup. 648 649 ``void exit(struct task_struct *task)`` 650 651 Called during task exit. 652 653 ``void free(struct task_struct *task)`` 654 655 Called when the task_struct is freed. 656 657 ``void bind(struct cgroup *root)`` 658 (cgroup_mutex held by caller) 659 660 Called when a cgroup subsystem is rebound to a 661 and root cgroup. Currently this will only invo 662 the default hierarchy (which never has sub-cgr 663 that is being created/destroyed (and hence has 664 665 4. Extended attribute usage 666 =========================== 667 668 cgroup filesystem supports certain types of ex 669 directories and files. The current supported 670 671 - Trusted (XATTR_TRUSTED) 672 - Security (XATTR_SECURITY) 673 674 Both require CAP_SYS_ADMIN capability to set. 675 676 Like in tmpfs, the extended attributes in cgro 677 using kernel memory and it's advised to keep t 678 is the reason why user defined extended attrib 679 any user can do it and there's no limit in the 680 681 The current known users for this feature are S 682 in containers and systemd for assorted meta da 683 (systemd creates a cgroup per service). 684 685 5. Questions 686 ============ 687 688 :: 689 690 Q: what's up with this '/bin/echo' ? 691 A: bash's builtin 'echo' command does not ch 692 errors. If you use it in the cgroup file 693 able to tell whether a command succeeded 694 695 Q: When I attach processes, only the first o 696 A: We can only return one error code per cal 697 put only ONE PID.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.