1 .. _sched_design_CFS: 2 3 ============= 4 CFS Scheduler 5 ============= 6 7 8 1. OVERVIEW 9 ============ 10 11 CFS stands for "Completely Fair Scheduler," an 12 scheduler implemented by Ingo Molnar and merge 13 originally merged, it was the replacement for 14 scheduler's SCHED_OTHER interactivity code. No 15 for EEVDF, for which documentation can be foun 16 Documentation/scheduler/sched-eevdf.rst. 17 18 80% of CFS's design can be summed up in a sing 19 an "ideal, precise multi-tasking CPU" on real 20 21 "Ideal multi-tasking CPU" is a (non-existent 22 power and which can run each task at precise e 23 1/nr_running speed. For example: if there are 24 each at 50% physical power --- i.e., actually 25 26 On real hardware, we can run only a single tas 27 introduce the concept of "virtual runtime." T 28 specifies when its next timeslice would start 29 multi-tasking CPU described above. In practic 30 is its actual runtime normalized to the total 31 32 33 34 2. FEW IMPLEMENTATION DETAILS 35 ============================== 36 37 In CFS the virtual runtime is expressed and tr 38 p->se.vruntime (nanosec-unit) value. This way 39 timestamp and measure the "expected CPU time" 40 41 Small detail: on "ideal" hardware, at any t 42 p->se.vruntime value --- i.e., tasks would 43 would ever get "out of balance" from the "i 44 45 CFS's task picking logic is based on this p->s 46 very simple: it always tries to run the task w 47 value (i.e., the task which executed least so 48 up CPU time between runnable tasks as close to 49 possible. 50 51 Most of the rest of CFS's design just falls ou 52 with a few add-on embellishments like nice lev 53 algorithm variants to recognize sleepers. 54 55 56 57 3. THE RBTREE 58 ============== 59 60 CFS's design is quite radical: it does not use 61 runqueues, but it uses a time-ordered rbtree t 62 task execution, and thus has no "array switch" 63 previous vanilla scheduler and RSDL/SD are aff 64 65 CFS also maintains the rq->cfs.min_vruntime va 66 increasing value tracking the smallest vruntim 67 runqueue. The total amount of work done by th 68 min_vruntime; that value is used to place newl 69 side of the tree as much as possible. 70 71 The total number of running tasks in the runqu 72 rq->cfs.load value, which is the sum of the we 73 runqueue. 74 75 CFS maintains a time-ordered rbtree, where all 76 p->se.vruntime key. CFS picks the "leftmost" t 77 As the system progresses forwards, the execute 78 more and more to the right --- slowly but sure 79 to become the "leftmost task" and thus get on 80 amount of time. 81 82 Summing up, CFS works like this: it runs a tas 83 schedules (or a scheduler tick happens) the ta 84 for": the (small) time it just spent using the 85 p->se.vruntime. Once p->se.vruntime gets high 86 becomes the "leftmost task" of the time-ordere 87 small amount of "granularity" distance relativ 88 do not over-schedule tasks and trash the cache 89 picked and the current task is preempted. 90 91 92 93 4. SOME FEATURES OF CFS 94 ======================== 95 96 CFS uses nanosecond granularity accounting and 97 other HZ detail. Thus the CFS scheduler has n 98 way the previous scheduler had, and has no heu 99 only one central tunable (you have to switch o 100 101 /sys/kernel/debug/sched/base_slice_ns 102 103 which can be used to tune the scheduler from " 104 "server" (i.e., good batching) workloads. It 105 for desktop workloads. SCHED_BATCH is handled 106 107 In case CONFIG_HZ results in base_slice_ns < T 108 base_slice_ns will have little to no impact on 109 110 Due to its design, the CFS scheduler is not pr 111 exist today against the heuristics of the stoc 112 chew.c, ring-test.c, massive_intr.c all work f 113 interactivity and produce the expected behavio 114 115 The CFS scheduler has a much stronger handling 116 than the previous vanilla scheduler: both type 117 more aggressively. 118 119 SMP load-balancing has been reworked/sanitized 120 assumptions are gone from the load-balancing c 121 scheduling modules are used. The balancing co 122 result. 123 124 125 126 5. Scheduling policies 127 ====================== 128 129 CFS implements three scheduling policies: 130 131 - SCHED_NORMAL (traditionally called SCHED_O 132 policy that is used for regular tasks. 133 134 - SCHED_BATCH: Does not preempt nearly as of 135 would, thereby allowing tasks to run longe 136 caches but at the cost of interactivity. T 137 batch jobs. 138 139 - SCHED_IDLE: This is even weaker than nice 140 idle timer scheduler in order to avoid to 141 inversion problems which would deadlock th 142 143 SCHED_FIFO/_RR are implemented in sched/rt.c a 144 POSIX. 145 146 The command chrt from util-linux-ng 2.13.1.1 c 147 SCHED_IDLE. 148 149 150 151 6. SCHEDULING CLASSES 152 ====================== 153 154 The new CFS scheduler has been designed in suc 155 Classes," an extensible hierarchy of scheduler 156 encapsulate scheduling policy details and are 157 without the core code assuming too much about 158 159 sched/fair.c implements the CFS scheduler desc 160 161 sched/rt.c implements SCHED_FIFO and SCHED_RR 162 the previous vanilla scheduler did. It uses 1 163 priority levels, instead of 140 in the previou 164 expired array. 165 166 Scheduling classes are implemented through the 167 contains hooks to functions that must be calle 168 occurs. 169 170 This is the (partial) list of the hooks: 171 172 - enqueue_task(...) 173 174 Called when a task enters a runnable state. 175 It puts the scheduling entity (task) into t 176 increments the nr_running variable. 177 178 - dequeue_task(...) 179 180 When a task is no longer runnable, this fun 181 corresponding scheduling entity out of the 182 the nr_running variable. 183 184 - yield_task(...) 185 186 This function is basically just a dequeue f 187 compat_yield sysctl is turned on; in that c 188 entity at the right-most end of the red-bla 189 190 - wakeup_preempt(...) 191 192 This function checks if a task that entered 193 preempt the currently running task. 194 195 - pick_next_task(...) 196 197 This function chooses the most appropriate 198 199 - set_next_task(...) 200 201 This function is called when a task changes 202 its task group or is scheduled. 203 204 - task_tick(...) 205 206 This function is mostly called from time ti 207 process switch. This drives the running pr 208 209 210 211 212 7. GROUP SCHEDULER EXTENSIONS TO CFS 213 ===================================== 214 215 Normally, the scheduler operates on individual 216 fair CPU time to each task. Sometimes, it may 217 provide fair CPU time to each such task group. 218 desirable to first provide fair CPU time to ea 219 each task belonging to a user. 220 221 CONFIG_CGROUP_SCHED strives to achieve exactly 222 grouped and divides CPU time fairly among such 223 224 CONFIG_RT_GROUP_SCHED permits to group real-ti 225 SCHED_RR) tasks. 226 227 CONFIG_FAIR_GROUP_SCHED permits to group CFS ( 228 SCHED_BATCH) tasks. 229 230 These options need CONFIG_CGROUPS to be def 231 create arbitrary groups of tasks, using the 232 Documentation/admin-guide/cgroup-v1/cgroups 233 234 When CONFIG_FAIR_GROUP_SCHED is defined, a "cp 235 group created using the pseudo filesystem. Se 236 task groups and modify their CPU share using t 237 238 # mount -t tmpfs cgroup_root /sys/fs/c 239 # mkdir /sys/fs/cgroup/cpu 240 # mount -t cgroup -ocpu none /sys/fs/c 241 # cd /sys/fs/cgroup/cpu 242 243 # mkdir multimedia # create "mult 244 # mkdir browser # create "brow 245 246 # #Configure the multimedia group to r 247 # #that of browser group 248 249 # echo 2048 > multimedia/cpu.shares 250 # echo 1024 > browser/cpu.shares 251 252 # firefox & # Launch firefox and m 253 # echo <firefox_pid> > browser/tasks 254 255 # #Launch gmplayer (or your favourite 256 # echo <movie_player_pid> > multimedia
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.