1 ========================== 2 Extensible Scheduler Class 3 ========================== 4 5 sched_ext is a scheduler class whose behavior 6 programs - the BPF scheduler. 7 8 * sched_ext exports a full scheduling interfac 9 algorithm can be implemented on top. 10 11 * The BPF scheduler can group CPUs however it 12 together, as tasks aren't tied to specific C 13 14 * The BPF scheduler can be turned on and off d 15 16 * The system integrity is maintained no matter 17 The default scheduling behavior is restored 18 a runnable task stalls, or on invoking the S 19 :kbd:`SysRq-S`. 20 21 * When the BPF scheduler triggers an error, de 22 aid debugging. The debug dump is passed to a 23 scheduler binary. The debug dump can also be 24 `sched_ext_dump` tracepoint. The SysRq key s 25 triggers a debug dump. This doesn't terminat 26 only be read through the tracepoint. 27 28 Switching to and from sched_ext 29 =============================== 30 31 ``CONFIG_SCHED_CLASS_EXT`` is the config optio 32 ``tools/sched_ext`` contains the example sched 33 options should be enabled to use sched_ext: 34 35 .. code-block:: none 36 37 CONFIG_BPF=y 38 CONFIG_SCHED_CLASS_EXT=y 39 CONFIG_BPF_SYSCALL=y 40 CONFIG_BPF_JIT=y 41 CONFIG_DEBUG_INFO_BTF=y 42 CONFIG_BPF_JIT_ALWAYS_ON=y 43 CONFIG_BPF_JIT_DEFAULT_ON=y 44 CONFIG_PAHOLE_HAS_SPLIT_BTF=y 45 CONFIG_PAHOLE_HAS_BTF_TAG=y 46 47 sched_ext is used only when the BPF scheduler 48 49 If a task explicitly sets its scheduling polic 50 treated as ``SCHED_NORMAL`` and scheduled by C 51 loaded. 52 53 When the BPF scheduler is loaded and ``SCX_OPS 54 in ``ops->flags``, all ``SCHED_NORMAL``, ``SCH 55 ``SCHED_EXT`` tasks are scheduled by sched_ext 56 57 However, when the BPF scheduler is loaded and 58 set in ``ops->flags``, only tasks with the ``S 59 by sched_ext, while tasks with ``SCHED_NORMAL` 60 ``SCHED_IDLE`` policies are scheduled by CFS. 61 62 Terminating the sched_ext scheduler program, t 63 detection of any internal error including stal 64 BPF scheduler and reverts all tasks back to CF 65 66 .. code-block:: none 67 68 # make -j16 -C tools/sched_ext 69 # tools/sched_ext/build/bin/scx_simple 70 local=0 global=3 71 local=5 global=24 72 local=9 global=44 73 local=13 global=56 74 local=17 global=72 75 ^CEXIT: BPF scheduler unregistered 76 77 The current status of the BPF scheduler can be 78 79 .. code-block:: none 80 81 # cat /sys/kernel/sched_ext/state 82 enabled 83 # cat /sys/kernel/sched_ext/root/ops 84 simple 85 86 You can check if any BPF scheduler has ever be 87 this monotonically incrementing counter (a val 88 scheduler has been loaded): 89 90 .. code-block:: none 91 92 # cat /sys/kernel/sched_ext/enable_seq 93 1 94 95 ``tools/sched_ext/scx_show_state.py`` is a drg 96 detailed information: 97 98 .. code-block:: none 99 100 # tools/sched_ext/scx_show_state.py 101 ops : simple 102 enabled : 1 103 switching_all : 1 104 switched_all : 1 105 enable_state : enabled (2) 106 bypass_depth : 0 107 nr_rejected : 0 108 enable_seq : 1 109 110 If ``CONFIG_SCHED_DEBUG`` is set, whether a gi 111 be determined as follows: 112 113 .. code-block:: none 114 115 # grep ext /proc/self/sched 116 ext.enabled 117 118 The Basics 119 ========== 120 121 Userspace can implement an arbitrary BPF sched 122 programs that implement ``struct sched_ext_ops 123 is ``ops.name`` which must be a valid BPF obje 124 optional. The following modified excerpt is fr 125 ``tools/sched_ext/scx_simple.bpf.c`` showing a 126 127 .. code-block:: c 128 129 /* 130 * Decide which CPU a task should be migra 131 * enqueued (either at wakeup, fork time, 132 * idle core is found by the default ops.s 133 * then dispatch the task directly to SCX_ 134 * ops.enqueue() callback. 135 * 136 * Note that this implementation has exact 137 * default ops.select_cpu implementation. 138 * would be exactly same if the implementa 139 * simple_select_cpu() struct_ops prog. 140 */ 141 s32 BPF_STRUCT_OPS(simple_select_cpu, stru 142 s32 prev_cpu, u64 wake_ 143 { 144 s32 cpu; 145 /* Need to initialize or the BPF v 146 bool direct = false; 147 148 cpu = scx_bpf_select_cpu_dfl(p, pr 149 150 if (direct) 151 scx_bpf_dispatch(p, SCX_DS 152 153 return cpu; 154 } 155 156 /* 157 * Do a direct dispatch of a task to the g 158 * callback will only be invoked if we fai 159 * to in ops.select_cpu() above. 160 * 161 * Note that this implementation has exact 162 * default ops.enqueue implementation, whi 163 * to SCX_DSQ_GLOBAL. The behavior of the 164 * if the implementation just didn't defin 165 * prog. 166 */ 167 void BPF_STRUCT_OPS(simple_enqueue, struct 168 { 169 scx_bpf_dispatch(p, SCX_DSQ_GLOBAL 170 } 171 172 s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) 173 { 174 /* 175 * By default, all SCHED_EXT, SCHE 176 * SCHED_BATCH tasks should use sc 177 */ 178 return 0; 179 } 180 181 void BPF_STRUCT_OPS(simple_exit, struct sc 182 { 183 exit_type = ei->type; 184 } 185 186 SEC(".struct_ops") 187 struct sched_ext_ops simple_ops = { 188 .select_cpu = (void *) 189 .enqueue = (void *) 190 .init = (void *) 191 .exit = (void *) 192 .name = "simple" 193 }; 194 195 Dispatch Queues 196 --------------- 197 198 To match the impedance between the scheduler c 199 sched_ext uses DSQs (dispatch queues) which ca 200 priority queue. By default, there is one globa 201 and one local dsq per CPU (``SCX_DSQ_LOCAL``). 202 an arbitrary number of dsq's using ``scx_bpf_c 203 ``scx_bpf_destroy_dsq()``. 204 205 A CPU always executes a task from its local DS 206 DSQ. A non-local DSQ is "consumed" to transfer 207 local DSQ. 208 209 When a CPU is looking for the next task to run 210 empty, the first task is picked. Otherwise, th 211 global DSQ. If that doesn't yield a runnable t 212 is invoked. 213 214 Scheduling Cycle 215 ---------------- 216 217 The following briefly shows how a waking task 218 219 1. When a task is waking up, ``ops.select_cpu( 220 invoked. This serves two purposes. First, C 221 hint. Second, waking up the selected CPU if 222 223 The CPU selected by ``ops.select_cpu()`` is 224 binding. The actual decision is made at the 225 However, there is a small performance gain 226 ``ops.select_cpu()`` returns matches the CP 227 228 A side-effect of selecting a CPU is waking 229 scheduler can wake up any cpu using the ``s 230 using ``ops.select_cpu()`` judiciously can 231 232 A task can be immediately dispatched to a D 233 calling ``scx_bpf_dispatch()``. If the task 234 ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()`` 235 local DSQ of whichever CPU is returned from 236 Additionally, dispatching directly from ``o 237 ``ops.enqueue()`` callback to be skipped. 238 239 Note that the scheduler core will ignore an 240 example, if it's outside the allowed cpumas 241 242 2. Once the target CPU is selected, ``ops.enqu 243 task was dispatched directly from ``ops.sel 244 can make one of the following decisions: 245 246 * Immediately dispatch the task to either t 247 calling ``scx_bpf_dispatch()`` with ``SCX 248 ``SCX_DSQ_LOCAL``, respectively. 249 250 * Immediately dispatch the task to a custom 251 ``scx_bpf_dispatch()`` with a DSQ ID whic 252 253 * Queue the task on the BPF side. 254 255 3. When a CPU is ready to schedule, it first l 256 empty, it then looks at the global DSQ. If 257 run, ``ops.dispatch()`` is invoked which ca 258 functions to populate the local DSQ. 259 260 * ``scx_bpf_dispatch()`` dispatches a task 261 be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LO 262 ``SCX_DSQ_GLOBAL`` or a custom DSQ. While 263 currently can't be called with BPF locks 264 and will be supported. ``scx_bpf_dispatch 265 rather than performing them immediately. 266 ``ops.dispatch_max_batch`` pending tasks. 267 268 * ``scx_bpf_consume()`` tranfers a task fro 269 to the dispatching DSQ. This function can 270 locks held. ``scx_bpf_consume()`` flushes 271 before trying to consume the specified DS 272 273 4. After ``ops.dispatch()`` returns, if there 274 the CPU runs the first one. If empty, the f 275 276 * Try to consume the global DSQ. If success 277 278 * If ``ops.dispatch()`` has dispatched any 279 280 * If the previous task is an SCX task and s 281 it (see ``SCX_OPS_ENQ_LAST``). 282 283 * Go idle. 284 285 Note that the BPF scheduler can always choose 286 in ``ops.enqueue()`` as illustrated in the abo 287 built-in DSQs are used, there is no need to im 288 a task is never queued on the BPF scheduler an 289 DSQs are consumed automatically. 290 291 ``scx_bpf_dispatch()`` queues the task on the 292 ``scx_bpf_dispatch_vtime()`` for the priority 293 ``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do no 294 dispatching, and must be dispatched to with `` 295 function documentation and usage in ``tools/sc 296 more information. 297 298 Where to Look 299 ============= 300 301 * ``include/linux/sched/ext.h`` defines the co 302 and constants. 303 304 * ``kernel/sched/ext.c`` contains sched_ext co 305 The functions prefixed with ``scx_bpf_`` can 306 scheduler. 307 308 * ``tools/sched_ext/`` hosts example BPF sched 309 310 * ``scx_simple[.bpf].c``: Minimal global FIF 311 custom DSQ. 312 313 * ``scx_qmap[.bpf].c``: A multi-level FIFO s 314 levels of priority implemented with ``BPF_ 315 316 ABI Instability 317 =============== 318 319 The APIs provided by sched_ext to BPF schedule 320 guarantees. This includes the ops table callba 321 ``include/linux/sched/ext.h``, as well as the 322 ``kernel/sched/ext.c``. 323 324 While we will attempt to provide a relatively 325 possible, they are subject to change without w 326 versions.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.