~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~
cpusets.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~
Diff markup

Differences between /Documentation/admin-guide/cgroup-v1/cpusets.rst (Version linux-6.11.5) and /Documentation/admin-guide/cgroup-v1/cpusets.rst (Version linux-4.19.322)

  1 .. _cpusets:                                      
  2                                                   
  3 =======                                           
  4 CPUSETS                                           
  5 =======                                           
  6                                                   
  7 Copyright (C) 2004 BULL SA.                       
  8                                                   
  9 Written by Simon.Derr@bull.net                    
 10                                                   
 11 - Portions Copyright (c) 2004-2006 Silicon Gra    
 12 - Modified by Paul Jackson <pj@sgi.com>            
 13 - Modified by Christoph Lameter <cl@linux.com>     
 14 - Modified by Paul Menage <menage@google.com>      
 15 - Modified by Hidetoshi Seto <seto.hidetoshi@jp    
 16                                                   
 17 .. CONTENTS:                                      
 18                                                   
 19    1. Cpusets                                     
 20      1.1 What are cpusets ?                       
 21      1.2 Why are cpusets needed ?                 
 22      1.3 How are cpusets implemented ?            
 23      1.4 What are exclusive cpusets ?             
 24      1.5 What is memory_pressure ?                
 25      1.6 What is memory spread ?                  
 26      1.7 What is sched_load_balance ?             
 27      1.8 What is sched_relax_domain_level ?       
 28      1.9 How do I use cpusets ?                   
 29    2. Usage Examples and Syntax                   
 30      2.1 Basic Usage                              
 31      2.2 Adding/removing cpus                     
 32      2.3 Setting flags                            
 33      2.4 Attaching processes                      
 34    3. Questions                                   
 35    4. Contact                                     
 36                                                   
 37 1. Cpusets                                        
 38 ==========                                        
 39                                                   
 40 1.1 What are cpusets ?                            
 41 ----------------------                            
 42                                                   
 43 Cpusets provide a mechanism for assigning a se    
 44 Nodes to a set of tasks.   In this document "M    
 45 an on-line node that contains memory.             
 46                                                   
 47 Cpusets constrain the CPU and Memory placement    
 48 the resources within a task's current cpuset.     
 49 hierarchy visible in a virtual file system.  T    
 50 hooks, beyond what is already present, require    
 51 job placement on large systems.                   
 52                                                   
 53 Cpusets use the generic cgroup subsystem descr    
 54 Documentation/admin-guide/cgroup-v1/cgroups.rs    
 55                                                   
 56 Requests by a task, using the sched_setaffinit    
 57 include CPUs in its CPU affinity mask, and usi    
 58 set_mempolicy(2) system calls to include Memor    
 59 policy, are both filtered through that task's     
 60 CPUs or Memory Nodes not in that cpuset.  The     
 61 schedule a task on a CPU that is not allowed i    
 62 vector, and the kernel page allocator will not    
 63 node that is not allowed in the requesting tas    
 64                                                   
 65 User level code may create and destroy cpusets    
 66 virtual file system, manage the attributes and    
 67 cpusets and which CPUs and Memory Nodes are as    
 68 specify and query to which cpuset a task is as    
 69 task pids assigned to a cpuset.                   
 70                                                   
 71                                                   
 72 1.2 Why are cpusets needed ?                      
 73 ----------------------------                      
 74                                                   
 75 The management of large computer systems, with    
 76 complex memory cache hierarchies and multiple     
 77 non-uniform access times (NUMA) presents addit    
 78 the efficient scheduling and memory placement     
 79                                                   
 80 Frequently more modest sized systems can be op    
 81 efficiency just by letting the operating syste    
 82 the available CPU and Memory resources amongst    
 83                                                   
 84 But larger systems, which benefit more from ca    
 85 memory placement to reduce memory access times    
 86 and which typically represent a larger investm    
 87 can benefit from explicitly placing jobs on pr    
 88 the system.                                       
 89                                                   
 90 This can be especially valuable on:               
 91                                                   
 92     * Web Servers running multiple instances o    
 93     * Servers running different applications (    
 94       and a database), or                         
 95     * NUMA systems running large HPC applicati    
 96       performance characteristics.                
 97                                                   
 98 These subsets, or "soft partitions" must be ab    
 99 adjusted, as the job mix changes, without impa    
100 executing jobs. The location of the running jo    
101 when the memory locations are changed.            
102                                                   
103 The kernel cpuset patch provides the minimum e    
104 mechanisms required to efficiently implement s    
105 leverages existing CPU and Memory Placement fa    
106 kernel to avoid any additional impact on the c    
107 memory allocator code.                            
108                                                   
109                                                   
110 1.3 How are cpusets implemented ?                 
111 ---------------------------------                 
112                                                   
113 Cpusets provide a Linux kernel mechanism to co    
114 Memory Nodes are used by a process or set of p    
115                                                   
116 The Linux kernel already has a pair of mechani    
117 CPUs a task may be scheduled (sched_setaffinit    
118 Nodes it may obtain memory (mbind, set_mempoli    
119                                                   
120 Cpusets extends these two mechanisms as follow    
121                                                   
122  - Cpusets are sets of allowed CPUs and Memory    
123    kernel.                                        
124  - Each task in the system is attached to a cp    
125    in the task structure to a reference counte    
126  - Calls to sched_setaffinity are filtered to     
127    allowed in that task's cpuset.                 
128  - Calls to mbind and set_mempolicy are filter    
129    those Memory Nodes allowed in that task's c    
130  - The root cpuset contains all the systems CP    
131    Nodes.                                         
132  - For any cpuset, one can define child cpuset    
133    of the parents CPU and Memory Node resource    
134  - The hierarchy of cpusets can be mounted at     
135    browsing and manipulation from user space.     
136  - A cpuset may be marked exclusive, which ens    
137    cpuset (except direct ancestors and descend    
138    any overlapping CPUs or Memory Nodes.          
139  - You can list all the tasks (by pid) attache    
140                                                   
141 The implementation of cpusets requires a few,     
142 into the rest of the kernel, none in performan    
143                                                   
144  - in init/main.c, to initialize the root cpus    
145  - in fork and exit, to attach and detach a ta    
146  - in sched_setaffinity, to mask the requested    
147    allowed in that task's cpuset.                 
148  - in sched.c migrate_live_tasks(), to keep mi    
149    the CPUs allowed by their cpuset, if possib    
150  - in the mbind and set_mempolicy system calls    
151    Memory Nodes by what's allowed in that task    
152  - in page_alloc.c, to restrict memory to allo    
153  - in vmscan.c, to restrict page recovery to t    
154                                                   
155 You should mount the "cgroup" filesystem type     
156 browsing and modifying the cpusets presently k    
157 new system calls are added for cpusets - all s    
158 modifying cpusets is via this cpuset file syst    
159                                                   
160 The /proc/<pid>/status file for each task has     
161 displaying the task's cpus_allowed (on which C    
162 and mems_allowed (on which Memory Nodes it may    
163 in the two formats seen in the following examp    
164                                                   
165   Cpus_allowed:   ffffffff,ffffffff,ffffffff,f    
166   Cpus_allowed_list:      0-127                   
167   Mems_allowed:   ffffffff,ffffffff               
168   Mems_allowed_list:      0-63                    
169                                                   
170 Each cpuset is represented by a directory in t    
171 containing (on top of the standard cgroup file    
172 files describing that cpuset:                     
173                                                   
174  - cpuset.cpus: list of CPUs in that cpuset       
175  - cpuset.mems: list of Memory Nodes in that c    
176  - cpuset.memory_migrate flag: if set, move pa    
177  - cpuset.cpu_exclusive flag: is cpu placement    
178  - cpuset.mem_exclusive flag: is memory placem    
179  - cpuset.mem_hardwall flag:  is memory alloca    
180  - cpuset.memory_pressure: measure of how much    
181  - cpuset.memory_spread_page flag: if set, spr    
182  - cpuset.memory_spread_slab flag: OBSOLETE. D    
183  - cpuset.sched_load_balance flag: if set, loa    
184  - cpuset.sched_relax_domain_level: the search    
185                                                   
186 In addition, only the root cpuset has the foll    
187                                                   
188  - cpuset.memory_pressure_enabled flag: comput    
189                                                   
190 New cpusets are created using the mkdir system    
191 command.  The properties of a cpuset, such as     
192 CPUs and Memory Nodes, and attached tasks, are    
193 to the appropriate file in that cpusets direct    
194                                                   
195 The named hierarchical structure of nested cpu    
196 a large system into nested, dynamically change    
197                                                   
198 The attachment of each task, automatically inh    
199 children of that task, to a cpuset allows orga    
200 on a system into related sets of tasks such th    
201 to using the CPUs and Memory Nodes of a partic    
202 may be re-attached to any other cpuset, if all    
203 on the necessary cpuset file system directorie    
204                                                   
205 Such management of a system "in the large" int    
206 the detailed placement done on individual task    
207 using the sched_setaffinity, mbind and set_mem    
208                                                   
209 The following rules apply to each cpuset:         
210                                                   
211  - Its CPUs and Memory Nodes must be a subset     
212  - It can't be marked exclusive unless its par    
213  - If its cpu or memory is exclusive, they may    
214                                                   
215 These rules, and the natural hierarchy of cpus    
216 enforcement of the exclusive guarantee, withou    
217 cpusets every time any of them change to ensur    
218 exclusive cpuset.  Also, the use of a Linux vi    
219 to represent the cpuset hierarchy provides for    
220 and name space for cpusets, with a minimum of     
221                                                   
222 The cpus and mems files in the root (top_cpuse    
223 read-only.  The cpus file automatically tracks    
224 cpu_online_mask using a CPU hotplug notifier,     
225 automatically tracks the value of node_states[    
226 nodes with memory--using the cpuset_track_onli    
227                                                   
228 The cpuset.effective_cpus and cpuset.effective    
229 normally read-only copies of cpuset.cpus and c    
230 respectively.  If the cpuset cgroup filesystem    
231 special "cpuset_v2_mode" option, the behavior     
232 similar to the corresponding files in cpuset v    
233 events will not change cpuset.cpus and cpuset.    
234 only affect cpuset.effective_cpus and cpuset.e    
235 the actual cpus and memory nodes that are curr    
236 See Documentation/admin-guide/cgroup-v2.rst fo    
237 cpuset v2 behavior.                               
238                                                   
239                                                   
240 1.4 What are exclusive cpusets ?                  
241 --------------------------------                  
242                                                   
243 If a cpuset is cpu or mem exclusive, no other     
244 a direct ancestor or descendant, may share any    
245 Memory Nodes.                                     
246                                                   
247 A cpuset that is cpuset.mem_exclusive *or* cpu    
248 i.e. it restricts kernel allocations for page,    
249 commonly shared by the kernel across multiple     
250 whether hardwalled or not, restrict allocation    
251 space.  This enables configuring a system so t    
252 jobs can share common kernel data, such as fil    
253 isolating each job's user allocation in its ow    
254 construct a large mem_exclusive cpuset to hold    
255 construct child, non-mem_exclusive cpusets for    
256 Only a small amount of typical kernel memory,     
257 interrupt handlers, is allowed to be taken out    
258 mem_exclusive cpuset.                             
259                                                   
260                                                   
261 1.5 What is memory_pressure ?                     
262 -----------------------------                     
263 The memory_pressure of a cpuset provides a sim    
264 of the rate that the tasks in a cpuset are att    
265 use memory on the nodes of the cpuset to satis    
266 requests.                                         
267                                                   
268 This enables batch managers monitoring jobs ru    
269 cpusets to efficiently detect what level of me    
270 is causing.                                       
271                                                   
272 This is useful both on tightly managed systems    
273 submitted jobs, which may choose to terminate     
274 are trying to use more memory than allowed on     
275 and with tightly coupled, long running, massiv    
276 computing jobs that will dramatically fail to     
277 goals if they start to use more memory than al    
278                                                   
279 This mechanism provides a very economical way     
280 to monitor a cpuset for signs of memory pressu    
281 batch manager or other user code to decide wha    
282 take action.                                      
283                                                   
284 ==>                                               
285     Unless this feature is enabled by writing     
286     /dev/cpuset/memory_pressure_enabled, the h    
287     code of __alloc_pages() for this metric re    
288     that the cpuset_memory_pressure_enabled fl    
289     systems that enable this feature will comp    
290                                                   
291 Why a per-cpuset, running average:                
292                                                   
293     Because this meter is per-cpuset, rather t    
294     the system load imposed by a batch schedul    
295     metric is sharply reduced on large systems    
296     the tasklist can be avoided on each set of    
297                                                   
298     Because this meter is a running average, i    
299     counter, a batch scheduler can detect memo    
300     single read, instead of having to read and    
301     for a period of time.                         
302                                                   
303     Because this meter is per-cpuset rather th    
304     the batch scheduler can obtain the key inf    
305     pressure in a cpuset, with a single read,     
306     query and accumulate results over all the     
307     set of tasks in the cpuset.                   
308                                                   
309 A per-cpuset simple digital filter (requires a    
310 of data per-cpuset) is kept, and updated by an    
311 cpuset, if it enters the synchronous (direct)     
312                                                   
313 A per-cpuset file provides an integer number r    
314 (half-life of 10 seconds) rate of direct page     
315 the tasks in the cpuset, in units of reclaims     
316 times 1000.                                       
317                                                   
318                                                   
319 1.6 What is memory spread ?                       
320 ---------------------------                       
321 There are two boolean flag files per cpuset th    
322 kernel allocates pages for the file system buf    
323 kernel data structures.  They are called 'cpus    
324 'cpuset.memory_spread_slab'.                      
325                                                   
326 If the per-cpuset boolean flag file 'cpuset.me    
327 the kernel will spread the file system buffers    
328 over all the nodes that the faulting task is a    
329 of preferring to put those pages on the node w    
330                                                   
331 If the per-cpuset boolean flag file 'cpuset.me    
332 then the kernel will spread some file system r    
333 such as for inodes and dentries evenly over al    
334 faulting task is allowed to use, instead of pr    
335 pages on the node where the task is running.      
336                                                   
337 The setting of these flags does not affect ano    
338 stack segment pages of a task.                    
339                                                   
340 By default, both kinds of memory spreading are    
341 pages are allocated on the node local to where    
342 except perhaps as modified by the task's NUMA     
343 configuration, so long as sufficient free memo    
344                                                   
345 When new cpusets are created, they inherit the    
346 of their parent.                                  
347                                                   
348 Setting memory spreading causes allocations fo    
349 or slab caches to ignore the task's NUMA mempo    
350 instead.    Tasks using mbind() or set_mempoli    
351 mempolicies will not notice any change in thes    
352 their containing task's memory spread settings    
353 is turned off, then the currently specified NU    
354 applies to memory page allocations.               
355                                                   
356 Both 'cpuset.memory_spread_page' and 'cpuset.m    
357 files.  By default they contain "0", meaning t    
358 for that cpuset.  If a "1" is written to that     
359 the named feature on.                             
360                                                   
361 The implementation is simple.                     
362                                                   
363 Setting the flag 'cpuset.memory_spread_page' t    
364 PFA_SPREAD_PAGE for each task that is in that     
365 joins that cpuset.  The page allocation calls     
366 is modified to perform an inline check for thi    
367 flag, and if set, a call to a new routine cpus    
368 returns the node to prefer for the allocation.    
369                                                   
370 Similarly, setting 'cpuset.memory_spread_slab'    
371 PFA_SPREAD_SLAB, and appropriately marked slab    
372 pages from the node returned by cpuset_mem_spr    
373                                                   
374 The cpuset_mem_spread_node() routine is also s    
375 value of a per-task rotor cpuset_mem_spread_ro    
376 node in the current task's mems_allowed to pre    
377                                                   
378 This memory placement policy is also known (in    
379 round-robin or interleave.                        
380                                                   
381 This policy can provide substantial improvemen    
382 to place thread local data on the correspondin    
383 to access large file system data sets that nee    
384 the several nodes in the jobs cpuset in order     
385 policy, especially for jobs that might have on    
386 data set, the memory allocation across the nod    
387 can become very uneven.                           
388                                                   
389 1.7 What is sched_load_balance ?                  
390 --------------------------------                  
391                                                   
392 The kernel scheduler (kernel/sched/core.c) aut    
393 tasks.  If one CPU is underutilized, kernel co    
394 CPU will look for tasks on other more overload    
395 tasks to itself, within the constraints of suc    
396 as cpusets and sched_setaffinity.                 
397                                                   
398 The algorithmic cost of load balancing and its    
399 kernel data structures such as the task list i    
400 linearly with the number of CPUs being balance    
401 has support to partition the systems CPUs into    
402 domains such that it only load balances within    
403 Each sched domain covers some subset of the CP    
404 no two sched domains overlap; some CPUs might     
405 domain and hence won't be load balanced.          
406                                                   
407 Put simply, it costs less to balance between t    
408 than one big one, but doing so means that over    
409 two domains won't be load balanced to the othe    
410                                                   
411 By default, there is one sched domain covering    
412 marked isolated using the kernel boot time "is    
413 the isolated CPUs will not participate in load    
414 have tasks running on them unless explicitly a    
415                                                   
416 This default load balancing across all CPUs is    
417 the following two situations:                     
418                                                   
419  1) On large systems, load balancing across ma    
420     If the system is managed using cpusets to     
421     on separate sets of CPUs, full load balanc    
422  2) Systems supporting realtime on some CPUs n    
423     system overhead on those CPUs, including a    
424     balancing if that is not needed.              
425                                                   
426 When the per-cpuset flag "cpuset.sched_load_ba    
427 setting), it requests that all the CPUs in tha    
428 be contained in a single sched domain, ensurin    
429 can move a task (not otherwised pinned, as by     
430 from any CPU in that cpuset to any other.         
431                                                   
432 When the per-cpuset flag "cpuset.sched_load_ba    
433 scheduler will avoid load balancing across the    
434 --except-- in so far as is necessary because s    
435 has "sched_load_balance" enabled.                 
436                                                   
437 So, for example, if the top cpuset has the fla    
438 enabled, then the scheduler will have one sche    
439 CPUs, and the setting of the "cpuset.sched_loa    
440 cpusets won't matter, as we're already fully l    
441                                                   
442 Therefore in the above two situations, the top    
443 "cpuset.sched_load_balance" should be disabled    
444 child cpusets have this flag enabled.             
445                                                   
446 When doing this, you don't usually want to lea    
447 the top cpuset that might use non-trivial amou    
448 may be artificially constrained to some subset    
449 the particulars of this flag setting in descen    
450 such a task could use spare CPU cycles in some    
451 scheduler might not consider the possibility o    
452 task to that underused CPU.                       
453                                                   
454 Of course, tasks pinned to a particular CPU ca    
455 that disables "cpuset.sched_load_balance" as t    
456 else anyway.                                      
457                                                   
458 There is an impedance mismatch here, between c    
459 Cpusets are hierarchical and nest.  Sched doma    
460 overlap and each CPU is in at most one sched d    
461                                                   
462 It is necessary for sched domains to be flat b    
463 across partially overlapping sets of CPUs woul    
464 that would be beyond our understanding.  So if    
465 overlapping cpusets enables the flag 'cpuset.s    
466 form a single sched domain that is a superset     
467 a task to a CPU outside its cpuset, but the sc    
468 code might waste some compute cycles consideri    
469                                                   
470 This mismatch is why there is not a simple one    
471 between which cpusets have the flag "cpuset.sc    
472 and the sched domain configuration.  If a cpus    
473 will get balancing across all its CPUs, but if    
474 it will only be assured of no load balancing i    
475 cpuset enables the flag.                          
476                                                   
477 If two cpusets have partially overlapping 'cpu    
478 one of them has this flag enabled, then the ot    
479 tasks only partially load balanced, just on th    
480 This is just the general case of the top_cpuse    
481 paragraphs above.  In the general case, as in     
482 don't leave tasks that might use non-trivial a    
483 such partially load balanced cpusets, as they     
484 constrained to some subset of the CPUs allowed    
485 load balancing to the other CPUs.                 
486                                                   
487 CPUs in "cpuset.isolcpus" were excluded from l    
488 isolcpus= kernel boot option, and will never b    
489 of the value of "cpuset.sched_load_balance" in    
490                                                   
491 1.7.1 sched_load_balance implementation detail    
492 ----------------------------------------------    
493                                                   
494 The per-cpuset flag 'cpuset.sched_load_balance    
495 to most cpuset flags.)  When enabled for a cpu    
496 ensure that it can load balance across all the    
497 (makes sure that all the CPUs in the cpus_allo    
498 in the same sched domain.)                        
499                                                   
500 If two overlapping cpusets both have 'cpuset.s    
501 then they will be (must be) both in the same s    
502                                                   
503 If, as is the default, the top cpuset has 'cpu    
504 then by the above that means there is a single    
505 the whole system, regardless of any other cpus    
506                                                   
507 The kernel commits to user space that it will     
508 where it can.  It will pick as fine a granular    
509 domains as it can while still providing load b    
510 of CPUs allowed to a cpuset having 'cpuset.sch    
511                                                   
512 The internal kernel cpuset to scheduler interf    
513 cpuset code to the scheduler code a partition     
514 CPUs in the system. This partition is a set of    
515 as an array of struct cpumask) of CPUs, pairwi    
516 all the CPUs that must be load balanced.          
517                                                   
518 The cpuset code builds a new such partition an    
519 scheduler sched domain setup code, to have the    
520 as necessary, whenever:                           
521                                                   
522  - the 'cpuset.sched_load_balance' flag of a c    
523  - or CPUs come or go from a cpuset with this     
524  - or 'cpuset.sched_relax_domain_level' value     
525    and with this flag enabled changes,            
526  - or a cpuset with non-empty CPUs and with th    
527  - or a cpu is offlined/onlined.                  
528                                                   
529 This partition exactly defines what sched doma    
530 setup - one sched domain for each element (str    
531 partition.                                        
532                                                   
533 The scheduler remembers the currently active s    
534 When the scheduler routine partition_sched_dom    
535 the cpuset code to update these sched domains,    
536 partition requested with the current, and upda    
537 removing the old and adding the new, for each     
538                                                   
539                                                   
540 1.8 What is sched_relax_domain_level ?            
541 --------------------------------------            
542                                                   
543 In sched domain, the scheduler migrates tasks     
544 balance on tick, and at time of some schedule     
545                                                   
546 When a task is woken up, scheduler try to move    
547 For example, if a task A running on CPU X acti    
548 on the same CPU X, and if CPU Y is X's sibling    
549 then scheduler migrate task B to CPU Y so that    
550 CPU Y without waiting task A on CPU X.            
551                                                   
552 And if a CPU run out of tasks in its runqueue,    
553 extra tasks from other busy CPUs to help them     
554 be idle.                                          
555                                                   
556 Of course it takes some searching cost to find    
557 idle CPUs, the scheduler might not search all     
558 every time.  In fact, in some architectures, t    
559 events are limited in the same socket or node     
560 while the load balance on tick searches all.      
561                                                   
562 For example, assume CPU Z is relatively far fr    
563 is idle while CPU X and the siblings are busy,    
564 woken task B from X to Z since it is out of it    
565 As the result, task B on CPU X need to wait ta    
566 on the next tick.  For some applications in sp    
567 1 tick may be too long.                           
568                                                   
569 The 'cpuset.sched_relax_domain_level' file all    
570 this searching range as you like.  This file t    
571 indicates size of searching range in levels ap    
572 otherwise initial value -1 that indicates the     
573                                                   
574 ====== =======================================    
575   -1   no request. use system default or follo    
576    0   no search.                                 
577    1   search siblings (hyperthreads in a core    
578    2   search cores in a package.                 
579    3   search cpus in a node [= system wide on    
580    4   search nodes in a chunk of node [on NUM    
581    5   search system wide [on NUMA system]        
582 ====== =======================================    
583                                                   
584 Not all levels can be present and values can c    
585 system architecture and kernel configuration.     
586 /sys/kernel/debug/sched/domains/cpu*/domain*/     
587 details.                                          
588                                                   
589 The system default is architecture dependent.     
590 can be changed using the relax_domain_level= b    
591                                                   
592 This file is per-cpuset and affect the sched d    
593 belongs to.  Therefore if the flag 'cpuset.sch    
594 is disabled, then 'cpuset.sched_relax_domain_l    
595 there is no sched domain belonging the cpuset.    
596                                                   
597 If multiple cpusets are overlapping and hence     
598 domain, the largest value among those is used.    
599 requests 0 and others are -1 then 0 is used.      
600                                                   
601 Note that modifying this file will have both g    
602 and whether it is acceptable or not depends on    
603 Don't modify this file if you are not sure.       
604                                                   
605 If your situation is:                             
606                                                   
607  - The migration costs between each cpu can be    
608    small(for you) due to your special applicat    
609    special hardware support for CPU cache etc.    
610  - The searching cost doesn't have impact(for     
611    the searching cost enough small by managing    
612  - The latency is required even it sacrifices     
613    then increasing 'sched_relax_domain_level'     
614                                                   
615                                                   
616 1.9 How do I use cpusets ?                        
617 --------------------------                        
618                                                   
619 In order to minimize the impact of cpusets on     
620 code, such as the scheduler, and due to the fa    
621 does not support one task updating the memory     
622 task directly, the impact on a task of changin    
623 or Memory Node placement, or of changing to wh    
624 is attached, is subtle.                           
625                                                   
626 If a cpuset has its Memory Nodes modified, the    
627 to that cpuset, the next time that the kernel     
628 a page of memory for that task, the kernel wil    
629 in the task's cpuset, and update its per-task     
630 remain within the new cpusets memory placement    
631 mempolicy MPOL_BIND, and the nodes to which it    
632 its new cpuset, then the task will continue to    
633 of MPOL_BIND nodes are still allowed in the ne    
634 was using MPOL_BIND and now none of its MPOL_B    
635 in the new cpuset, then the task will be essen    
636 was MPOL_BIND bound to the new cpuset (even th    
637 as queried by get_mempolicy(), doesn't change)    
638 from one cpuset to another, then the kernel wi    
639 memory placement, as above, the next time that    
640 to allocate a page of memory for that task.       
641                                                   
642 If a cpuset has its 'cpuset.cpus' modified, th    
643 will have its allowed CPU placement changed im    
644 if a task's pid is written to another cpuset's    
645 allowed CPU placement is changed immediately.     
646 bound to some subset of its cpuset using the s    
647 the task will be allowed to run on any CPU all    
648 negating the effect of the prior sched_setaffi    
649                                                   
650 In summary, the memory placement of a task who    
651 updated by the kernel, on the next allocation     
652 and the processor placement is updated immedia    
653                                                   
654 Normally, once a page is allocated (given a ph    
655 of main memory) then that page stays on whatev    
656 was allocated, so long as it remains allocated    
657 cpusets memory placement policy 'cpuset.mems'     
658 If the cpuset flag file 'cpuset.memory_migrate    
659 tasks are attached to that cpuset, any pages t    
660 allocated to it on nodes in its previous cpuse    
661 to the task's new cpuset. The relative placeme    
662 the cpuset is preserved during these migration    
663 For example if the page was on the second vali    
664 then the page will be placed on the second val    
665                                                   
666 Also if 'cpuset.memory_migrate' is set true, t    
667 'cpuset.mems' file is modified, pages allocate    
668 cpuset, that were on nodes in the previous set    
669 will be moved to nodes in the new setting of '    
670 Pages that were not in the task's prior cpuset    
671 prior 'cpuset.mems' setting, will not be moved    
672                                                   
673 There is an exception to the above.  If hotplu    
674 to remove all the CPUs that are currently assi    
675 then all the tasks in that cpuset will be move    
676 with non-empty cpus.  But the moving of some (    
677 cpuset is bound with another cgroup subsystem     
678 on task attaching.  In this failing case, thos    
679 in the original cpuset, and the kernel will au    
680 their cpus_allowed to allow all online CPUs.      
681 functionality for removing Memory Nodes is ava    
682 is expected to apply there as well.  In genera    
683 violate cpuset placement, over starving a task    
684 its allowed CPUs or Memory Nodes taken offline    
685                                                   
686 There is a second exception to the above.  GFP    
687 kernel internal allocations that must be satis    
688 The kernel may drop some request, in rare case    
689 GFP_ATOMIC alloc fails.  If the request cannot    
690 the current task's cpuset, then we relax the c    
691 memory anywhere we can find it.  It's better t    
692 than stress the kernel.                           
693                                                   
694 To start a new job that is to be contained wit    
695                                                   
696  1) mkdir /sys/fs/cgroup/cpuset                   
697  2) mount -t cgroup -ocpuset cpuset /sys/fs/cg    
698  3) Create the new cpuset by doing mkdir's and    
699     the /sys/fs/cgroup/cpuset virtual file sys    
700  4) Start a task that will be the "founding fa    
701  5) Attach that task to the new cpuset by writ    
702     /sys/fs/cgroup/cpuset tasks file for that     
703  6) fork, exec or clone the job tasks from thi    
704                                                   
705 For example, the following sequence of command    
706 named "Charlie", containing just CPUs 2 and 3,    
707 and then start a subshell 'sh' in that cpuset:    
708                                                   
709   mount -t cgroup -ocpuset cpuset /sys/fs/cgro    
710   cd /sys/fs/cgroup/cpuset                        
711   mkdir Charlie                                   
712   cd Charlie                                      
713   /bin/echo 2-3 > cpuset.cpus                     
714   /bin/echo 1 > cpuset.mems                       
715   /bin/echo $$ > tasks                            
716   sh                                              
717   # The subshell 'sh' is now running in cpuset    
718   # The next line should display '/Charlie'       
719   cat /proc/self/cpuset                           
720                                                   
721 There are ways to query or modify cpusets:        
722                                                   
723  - via the cpuset file system directly, using     
724    cat, rmdir commands from the shell, or thei    
725  - via the C library libcpuset.                   
726  - via the C library libcgroup.                   
727    (https://github.com/libcgroup/libcgroup/)      
728  - via the python application cset.               
729    (http://code.google.com/p/cpuset/)             
730                                                   
731 The sched_setaffinity calls can also be done a    
732 SGI's runon or Robert Love's taskset.  The mbi    
733 calls can be done at the shell prompt using th    
734 (part of Andi Kleen's numa package).              
735                                                   
736 2. Usage Examples and Syntax                      
737 ============================                      
738                                                   
739 2.1 Basic Usage                                   
740 ---------------                                   
741                                                   
742 Creating, modifying, using the cpusets can be     
743 virtual filesystem.                               
744                                                   
745 To mount it, type:                                
746 # mount -t cgroup -o cpuset cpuset /sys/fs/cgr    
747                                                   
748 Then under /sys/fs/cgroup/cpuset you can find     
749 tree of the cpusets in the system. For instanc    
750 is the cpuset that holds the whole system.        
751                                                   
752 If you want to create a new cpuset under /sys/    
753                                                   
754   # cd /sys/fs/cgroup/cpuset                      
755   # mkdir my_cpuset                               
756                                                   
757 Now you want to do something with this cpuset:    
758                                                   
759   # cd my_cpuset                                  
760                                                   
761 In this directory you can find several files::    
762                                                   
763   # ls                                            
764   cgroup.clone_children  cpuset.memory_pressur    
765   cgroup.event_control   cpuset.memory_spread_    
766   cgroup.procs           cpuset.memory_spread_    
767   cpuset.cpu_exclusive   cpuset.mems              
768   cpuset.cpus            cpuset.sched_load_bal    
769   cpuset.mem_exclusive   cpuset.sched_relax_do    
770   cpuset.mem_hardwall    notify_on_release        
771   cpuset.memory_migrate  tasks                    
772                                                   
773 Reading them will give you information about t    
774 the CPUs and Memory Nodes it can use, the proc    
775 it, its properties.  By writing to these files    
776 the cpuset.                                       
777                                                   
778 Set some flags::                                  
779                                                   
780   # /bin/echo 1 > cpuset.cpu_exclusive            
781                                                   
782 Add some cpus::                                   
783                                                   
784   # /bin/echo 0-7 > cpuset.cpus                   
785                                                   
786 Add some mems::                                   
787                                                   
788   # /bin/echo 0-7 > cpuset.mems                   
789                                                   
790 Now attach your shell to this cpuset::            
791                                                   
792   # /bin/echo $$ > tasks                          
793                                                   
794 You can also create cpusets inside your cpuset    
795 directory::                                       
796                                                   
797   # mkdir my_sub_cs                               
798                                                   
799 To remove a cpuset, just use rmdir::              
800                                                   
801   # rmdir my_sub_cs                               
802                                                   
803 This will fail if the cpuset is in use (has cp    
804 processes attached).                              
805                                                   
806 Note that for legacy reasons, the "cpuset" fil    
807 wrapper around the cgroup filesystem.             
808                                                   
809 The command::                                     
810                                                   
811   mount -t cpuset X /sys/fs/cgroup/cpuset         
812                                                   
813 is equivalent to::                                
814                                                   
815   mount -t cgroup -ocpuset,noprefix X /sys/fs/    
816   echo "/sbin/cpuset_release_agent" > /sys/fs/    
817                                                   
818 2.2 Adding/removing cpus                          
819 ------------------------                          
820                                                   
821 This is the syntax to use when writing in the     
822 in cpuset directories::                           
823                                                   
824   # /bin/echo 1-4 > cpuset.cpus         -> set    
825   # /bin/echo 1,2,3,4 > cpuset.cpus     -> set    
826                                                   
827 To add a CPU to a cpuset, write the new list o    
828 CPU to be added. To add 6 to the above cpuset:    
829                                                   
830   # /bin/echo 1-4,6 > cpuset.cpus       -> set    
831                                                   
832 Similarly to remove a CPU from a cpuset, write    
833 without the CPU to be removed.                    
834                                                   
835 To remove all the CPUs::                          
836                                                   
837   # /bin/echo "" > cpuset.cpus          -> cle    
838                                                   
839 2.3 Setting flags                                 
840 -----------------                                 
841                                                   
842 The syntax is very simple::                       
843                                                   
844   # /bin/echo 1 > cpuset.cpu_exclusive  -> set    
845   # /bin/echo 0 > cpuset.cpu_exclusive  -> uns    
846                                                   
847 2.4 Attaching processes                           
848 -----------------------                           
849                                                   
850 ::                                                
851                                                   
852   # /bin/echo PID > tasks                         
853                                                   
854 Note that it is PID, not PIDs. You can only at    
855 If you have several tasks to attach, you have     
856                                                   
857   # /bin/echo PID1 > tasks                        
858   # /bin/echo PID2 > tasks                        
859         ...                                       
860   # /bin/echo PIDn > tasks                        
861                                                   
862                                                   
863 3. Questions                                      
864 ============                                      
865                                                   
866 Q:                                                
867    what's up with this '/bin/echo' ?              
868                                                   
869 A:                                                
870    bash's builtin 'echo' command does not chec    
871    errors. If you use it in the cpuset file sy    
872    able to tell whether a command succeeded or    
873                                                   
874 Q:                                                
875    When I attach processes, only the first of     
876                                                   
877 A:                                                
878    We can only return one error code per call     
879    put only ONE pid.                              
880                                                   
881 4. Contact                                        
882 ==========                                        
883                                                   
884 Web: http://www.bullopensource.org/cpuset
~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.
TOMOYO Linux Cross Reference Linux/Documentation/admin-guide/cgroup-v1/cpusets.rst

Diff markup

Differences between /Documentation/admin-guide/cgroup-v1/cpusets.rst (Version linux-6.11.5) and /Documentation/admin-guide/cgroup-v1/cpusets.rst (Version linux-4.19.322)

TOMOYO Linux Cross Reference
Linux/Documentation/admin-guide/cgroup-v1/cpusets.rst