~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/scheduler/sched-design-CFS.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/scheduler/sched-design-CFS.rst (Version linux-6.12-rc7) and /Documentation/scheduler/sched-design-CFS.rst (Version linux-4.13.16)


  1 .. _sched_design_CFS:                             
  2                                                   
  3 =============                                     
  4 CFS Scheduler                                     
  5 =============                                     
  6                                                   
  7                                                   
  8 1.  OVERVIEW                                      
  9 ============                                      
 10                                                   
 11 CFS stands for "Completely Fair Scheduler," an    
 12 scheduler implemented by Ingo Molnar and merge    
 13 originally merged, it was the replacement for     
 14 scheduler's SCHED_OTHER interactivity code. No    
 15 for EEVDF, for which documentation can be foun    
 16 Documentation/scheduler/sched-eevdf.rst.          
 17                                                   
 18 80% of CFS's design can be summed up in a sing    
 19 an "ideal, precise multi-tasking CPU" on real     
 20                                                   
 21 "Ideal multi-tasking CPU" is a (non-existent      
 22 power and which can run each task at precise e    
 23 1/nr_running speed.  For example: if there are    
 24 each at 50% physical power --- i.e., actually     
 25                                                   
 26 On real hardware, we can run only a single tas    
 27 introduce the concept of "virtual runtime."  T    
 28 specifies when its next timeslice would start     
 29 multi-tasking CPU described above.  In practic    
 30 is its actual runtime normalized to the total     
 31                                                   
 32                                                   
 33                                                   
 34 2.  FEW IMPLEMENTATION DETAILS                    
 35 ==============================                    
 36                                                   
 37 In CFS the virtual runtime is expressed and tr    
 38 p->se.vruntime (nanosec-unit) value.  This way    
 39 timestamp and measure the "expected CPU time"     
 40                                                   
 41    Small detail: on "ideal" hardware, at any t    
 42    p->se.vruntime value --- i.e., tasks would     
 43    would ever get "out of balance" from the "i    
 44                                                   
 45 CFS's task picking logic is based on this p->s    
 46 very simple: it always tries to run the task w    
 47 value (i.e., the task which executed least so     
 48 up CPU time between runnable tasks as close to    
 49 possible.                                         
 50                                                   
 51 Most of the rest of CFS's design just falls ou    
 52 with a few add-on embellishments like nice lev    
 53 algorithm variants to recognize sleepers.         
 54                                                   
 55                                                   
 56                                                   
 57 3.  THE RBTREE                                    
 58 ==============                                    
 59                                                   
 60 CFS's design is quite radical: it does not use    
 61 runqueues, but it uses a time-ordered rbtree t    
 62 task execution, and thus has no "array switch"    
 63 previous vanilla scheduler and RSDL/SD are aff    
 64                                                   
 65 CFS also maintains the rq->cfs.min_vruntime va    
 66 increasing value tracking the smallest vruntim    
 67 runqueue.  The total amount of work done by th    
 68 min_vruntime; that value is used to place newl    
 69 side of the tree as much as possible.             
 70                                                   
 71 The total number of running tasks in the runqu    
 72 rq->cfs.load value, which is the sum of the we    
 73 runqueue.                                         
 74                                                   
 75 CFS maintains a time-ordered rbtree, where all    
 76 p->se.vruntime key. CFS picks the "leftmost" t    
 77 As the system progresses forwards, the execute    
 78 more and more to the right --- slowly but sure    
 79 to become the "leftmost task" and thus get on     
 80 amount of time.                                   
 81                                                   
 82 Summing up, CFS works like this: it runs a tas    
 83 schedules (or a scheduler tick happens) the ta    
 84 for": the (small) time it just spent using the    
 85 p->se.vruntime.  Once p->se.vruntime gets high    
 86 becomes the "leftmost task" of the time-ordere    
 87 small amount of "granularity" distance relativ    
 88 do not over-schedule tasks and trash the cache    
 89 picked and the current task is preempted.         
 90                                                   
 91                                                   
 92                                                   
 93 4.  SOME FEATURES OF CFS                          
 94 ========================                          
 95                                                   
 96 CFS uses nanosecond granularity accounting and    
 97 other HZ detail.  Thus the CFS scheduler has n    
 98 way the previous scheduler had, and has no heu    
 99 only one central tunable (you have to switch o    
100                                                   
101    /sys/kernel/debug/sched/base_slice_ns          
102                                                   
103 which can be used to tune the scheduler from "    
104 "server" (i.e., good batching) workloads.  It     
105 for desktop workloads.  SCHED_BATCH is handled    
106                                                   
107 In case CONFIG_HZ results in base_slice_ns < T    
108 base_slice_ns will have little to no impact on    
109                                                   
110 Due to its design, the CFS scheduler is not pr    
111 exist today against the heuristics of the stoc    
112 chew.c, ring-test.c, massive_intr.c all work f    
113 interactivity and produce the expected behavio    
114                                                   
115 The CFS scheduler has a much stronger handling    
116 than the previous vanilla scheduler: both type    
117 more aggressively.                                
118                                                   
119 SMP load-balancing has been reworked/sanitized    
120 assumptions are gone from the load-balancing c    
121 scheduling modules are used.  The balancing co    
122 result.                                           
123                                                   
124                                                   
125                                                   
126 5. Scheduling policies                            
127 ======================                            
128                                                   
129 CFS implements three scheduling policies:         
130                                                   
131   - SCHED_NORMAL (traditionally called SCHED_O    
132     policy that is used for regular tasks.        
133                                                   
134   - SCHED_BATCH: Does not preempt nearly as of    
135     would, thereby allowing tasks to run longe    
136     caches but at the cost of interactivity. T    
137     batch jobs.                                   
138                                                   
139   - SCHED_IDLE: This is even weaker than nice     
140     idle timer scheduler in order to avoid to     
141     inversion problems which would deadlock th    
142                                                   
143 SCHED_FIFO/_RR are implemented in sched/rt.c a    
144 POSIX.                                            
145                                                   
146 The command chrt from util-linux-ng 2.13.1.1 c    
147 SCHED_IDLE.                                       
148                                                   
149                                                   
150                                                   
151 6.  SCHEDULING CLASSES                            
152 ======================                            
153                                                   
154 The new CFS scheduler has been designed in suc    
155 Classes," an extensible hierarchy of scheduler    
156 encapsulate scheduling policy details and are     
157 without the core code assuming too much about     
158                                                   
159 sched/fair.c implements the CFS scheduler desc    
160                                                   
161 sched/rt.c implements SCHED_FIFO and SCHED_RR     
162 the previous vanilla scheduler did.  It uses 1    
163 priority levels, instead of 140 in the previou    
164 expired array.                                    
165                                                   
166 Scheduling classes are implemented through the    
167 contains hooks to functions that must be calle    
168 occurs.                                           
169                                                   
170 This is the (partial) list of the hooks:          
171                                                   
172  - enqueue_task(...)                              
173                                                   
174    Called when a task enters a runnable state.    
175    It puts the scheduling entity (task) into t    
176    increments the nr_running variable.            
177                                                   
178  - dequeue_task(...)                              
179                                                   
180    When a task is no longer runnable, this fun    
181    corresponding scheduling entity out of the     
182    the nr_running variable.                       
183                                                   
184  - yield_task(...)                                
185                                                   
186    This function is basically just a dequeue f    
187    compat_yield sysctl is turned on; in that c    
188    entity at the right-most end of the red-bla    
189                                                   
190  - wakeup_preempt(...)                            
191                                                   
192    This function checks if a task that entered    
193    preempt the currently running task.            
194                                                   
195  - pick_next_task(...)                            
196                                                   
197    This function chooses the most appropriate     
198                                                   
199  - set_next_task(...)                             
200                                                   
201    This function is called when a task changes    
202    its task group or is scheduled.                
203                                                   
204  - task_tick(...)                                 
205                                                   
206    This function is mostly called from time ti    
207    process switch.  This drives the running pr    
208                                                   
209                                                   
210                                                   
211                                                   
212 7.  GROUP SCHEDULER EXTENSIONS TO CFS             
213 =====================================             
214                                                   
215 Normally, the scheduler operates on individual    
216 fair CPU time to each task.  Sometimes, it may    
217 provide fair CPU time to each such task group.    
218 desirable to first provide fair CPU time to ea    
219 each task belonging to a user.                    
220                                                   
221 CONFIG_CGROUP_SCHED strives to achieve exactly    
222 grouped and divides CPU time fairly among such    
223                                                   
224 CONFIG_RT_GROUP_SCHED permits to group real-ti    
225 SCHED_RR) tasks.                                  
226                                                   
227 CONFIG_FAIR_GROUP_SCHED permits to group CFS (    
228 SCHED_BATCH) tasks.                               
229                                                   
230    These options need CONFIG_CGROUPS to be def    
231    create arbitrary groups of tasks, using the    
232    Documentation/admin-guide/cgroup-v1/cgroups    
233                                                   
234 When CONFIG_FAIR_GROUP_SCHED is defined, a "cp    
235 group created using the pseudo filesystem.  Se    
236 task groups and modify their CPU share using t    
237                                                   
238         # mount -t tmpfs cgroup_root /sys/fs/c    
239         # mkdir /sys/fs/cgroup/cpu                
240         # mount -t cgroup -ocpu none /sys/fs/c    
241         # cd /sys/fs/cgroup/cpu                   
242                                                   
243         # mkdir multimedia      # create "mult    
244         # mkdir browser         # create "brow    
245                                                   
246         # #Configure the multimedia group to r    
247         # #that of browser group                  
248                                                   
249         # echo 2048 > multimedia/cpu.shares       
250         # echo 1024 > browser/cpu.shares          
251                                                   
252         # firefox &     # Launch firefox and m    
253         # echo <firefox_pid> > browser/tasks      
254                                                   
255         # #Launch gmplayer (or your favourite     
256         # echo <movie_player_pid> > multimedia    
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php