1 ========================== 2 Extensible Scheduler Class 3 ========================== 4 5 sched_ext is a scheduler class whose behavior can be defined by a set of BPF 6 programs - the BPF scheduler. 7 8 * sched_ext exports a full scheduling interface so that any scheduling 9 algorithm can be implemented on top. 10 11 * The BPF scheduler can group CPUs however it sees fit and schedule them 12 together, as tasks aren't tied to specific CPUs at the time of wakeup. 13 14 * The BPF scheduler can be turned on and off dynamically anytime. 15 16 * The system integrity is maintained no matter what the BPF scheduler does. 17 The default scheduling behavior is restored anytime an error is detected, 18 a runnable task stalls, or on invoking the SysRq key sequence 19 :kbd:`SysRq-S`. 20 21 * When the BPF scheduler triggers an error, debug information is dumped to 22 aid debugging. The debug dump is passed to and printed out by the 23 scheduler binary. The debug dump can also be accessed through the 24 `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D` 25 triggers a debug dump. This doesn't terminate the BPF scheduler and can 26 only be read through the tracepoint. 27 28 Switching to and from sched_ext 29 =============================== 30 31 ``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and 32 ``tools/sched_ext`` contains the example schedulers. The following config 33 options should be enabled to use sched_ext: 34 35 .. code-block:: none 36 37 CONFIG_BPF=y 38 CONFIG_SCHED_CLASS_EXT=y 39 CONFIG_BPF_SYSCALL=y 40 CONFIG_BPF_JIT=y 41 CONFIG_DEBUG_INFO_BTF=y 42 CONFIG_BPF_JIT_ALWAYS_ON=y 43 CONFIG_BPF_JIT_DEFAULT_ON=y 44 CONFIG_PAHOLE_HAS_SPLIT_BTF=y 45 CONFIG_PAHOLE_HAS_BTF_TAG=y 46 47 sched_ext is used only when the BPF scheduler is loaded and running. 48 49 If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be 50 treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is 51 loaded. 52 53 When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set 54 in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and 55 ``SCHED_EXT`` tasks are scheduled by sched_ext. 56 57 However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is 58 set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled 59 by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and 60 ``SCHED_IDLE`` policies are scheduled by CFS. 61 62 Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or 63 detection of any internal error including stalled runnable tasks aborts the 64 BPF scheduler and reverts all tasks back to CFS. 65 66 .. code-block:: none 67 68 # make -j16 -C tools/sched_ext 69 # tools/sched_ext/build/bin/scx_simple 70 local=0 global=3 71 local=5 global=24 72 local=9 global=44 73 local=13 global=56 74 local=17 global=72 75 ^CEXIT: BPF scheduler unregistered 76 77 The current status of the BPF scheduler can be determined as follows: 78 79 .. code-block:: none 80 81 # cat /sys/kernel/sched_ext/state 82 enabled 83 # cat /sys/kernel/sched_ext/root/ops 84 simple 85 86 You can check if any BPF scheduler has ever been loaded since boot by examining 87 this monotonically incrementing counter (a value of zero indicates that no BPF 88 scheduler has been loaded): 89 90 .. code-block:: none 91 92 # cat /sys/kernel/sched_ext/enable_seq 93 1 94 95 ``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more 96 detailed information: 97 98 .. code-block:: none 99 100 # tools/sched_ext/scx_show_state.py 101 ops : simple 102 enabled : 1 103 switching_all : 1 104 switched_all : 1 105 enable_state : enabled (2) 106 bypass_depth : 0 107 nr_rejected : 0 108 enable_seq : 1 109 110 If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can 111 be determined as follows: 112 113 .. code-block:: none 114 115 # grep ext /proc/self/sched 116 ext.enabled : 1 117 118 The Basics 119 ========== 120 121 Userspace can implement an arbitrary BPF scheduler by loading a set of BPF 122 programs that implement ``struct sched_ext_ops``. The only mandatory field 123 is ``ops.name`` which must be a valid BPF object name. All operations are 124 optional. The following modified excerpt is from 125 ``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. 126 127 .. code-block:: c 128 129 /* 130 * Decide which CPU a task should be migrated to before being 131 * enqueued (either at wakeup, fork time, or exec time). If an 132 * idle core is found by the default ops.select_cpu() implementation, 133 * then dispatch the task directly to SCX_DSQ_LOCAL and skip the 134 * ops.enqueue() callback. 135 * 136 * Note that this implementation has exactly the same behavior as the 137 * default ops.select_cpu implementation. The behavior of the scheduler 138 * would be exactly same if the implementation just didn't define the 139 * simple_select_cpu() struct_ops prog. 140 */ 141 s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, 142 s32 prev_cpu, u64 wake_flags) 143 { 144 s32 cpu; 145 /* Need to initialize or the BPF verifier will reject the program */ 146 bool direct = false; 147 148 cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); 149 150 if (direct) 151 scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 152 153 return cpu; 154 } 155 156 /* 157 * Do a direct dispatch of a task to the global DSQ. This ops.enqueue() 158 * callback will only be invoked if we failed to find a core to dispatch 159 * to in ops.select_cpu() above. 160 * 161 * Note that this implementation has exactly the same behavior as the 162 * default ops.enqueue implementation, which just dispatches the task 163 * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same 164 * if the implementation just didn't define the simple_enqueue struct_ops 165 * prog. 166 */ 167 void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) 168 { 169 scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 170 } 171 172 s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) 173 { 174 /* 175 * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and 176 * SCHED_BATCH tasks should use sched_ext. 177 */ 178 return 0; 179 } 180 181 void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) 182 { 183 exit_type = ei->type; 184 } 185 186 SEC(".struct_ops") 187 struct sched_ext_ops simple_ops = { 188 .select_cpu = (void *)simple_select_cpu, 189 .enqueue = (void *)simple_enqueue, 190 .init = (void *)simple_init, 191 .exit = (void *)simple_exit, 192 .name = "simple", 193 }; 194 195 Dispatch Queues 196 --------------- 197 198 To match the impedance between the scheduler core and the BPF scheduler, 199 sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a 200 priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``), 201 and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage 202 an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and 203 ``scx_bpf_destroy_dsq()``. 204 205 A CPU always executes a task from its local DSQ. A task is "dispatched" to a 206 DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's 207 local DSQ. 208 209 When a CPU is looking for the next task to run, if the local DSQ is not 210 empty, the first task is picked. Otherwise, the CPU tries to consume the 211 global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()`` 212 is invoked. 213 214 Scheduling Cycle 215 ---------------- 216 217 The following briefly shows how a waking task is scheduled and executed. 218 219 1. When a task is waking up, ``ops.select_cpu()`` is the first operation 220 invoked. This serves two purposes. First, CPU selection optimization 221 hint. Second, waking up the selected CPU if idle. 222 223 The CPU selected by ``ops.select_cpu()`` is an optimization hint and not 224 binding. The actual decision is made at the last step of scheduling. 225 However, there is a small performance gain if the CPU 226 ``ops.select_cpu()`` returns matches the CPU the task eventually runs on. 227 228 A side-effect of selecting a CPU is waking it up from idle. While a BPF 229 scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, 230 using ``ops.select_cpu()`` judiciously can be simpler and more efficient. 231 232 A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by 233 calling ``scx_bpf_dispatch()``. If the task is dispatched to 234 ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the 235 local DSQ of whichever CPU is returned from ``ops.select_cpu()``. 236 Additionally, dispatching directly from ``ops.select_cpu()`` will cause the 237 ``ops.enqueue()`` callback to be skipped. 238 239 Note that the scheduler core will ignore an invalid CPU selection, for 240 example, if it's outside the allowed cpumask of the task. 241 242 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the 243 task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()`` 244 can make one of the following decisions: 245 246 * Immediately dispatch the task to either the global or local DSQ by 247 calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or 248 ``SCX_DSQ_LOCAL``, respectively. 249 250 * Immediately dispatch the task to a custom DSQ by calling 251 ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63. 252 253 * Queue the task on the BPF side. 254 255 3. When a CPU is ready to schedule, it first looks at its local DSQ. If 256 empty, it then looks at the global DSQ. If there still isn't a task to 257 run, ``ops.dispatch()`` is invoked which can use the following two 258 functions to populate the local DSQ. 259 260 * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can 261 be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, 262 ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()`` 263 currently can't be called with BPF locks held, this is being worked on 264 and will be supported. ``scx_bpf_dispatch()`` schedules dispatching 265 rather than performing them immediately. There can be up to 266 ``ops.dispatch_max_batch`` pending tasks. 267 268 * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ 269 to the dispatching DSQ. This function cannot be called with any BPF 270 locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks 271 before trying to consume the specified DSQ. 272 273 4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, 274 the CPU runs the first one. If empty, the following steps are taken: 275 276 * Try to consume the global DSQ. If successful, run the task. 277 278 * If ``ops.dispatch()`` has dispatched any tasks, retry #3. 279 280 * If the previous task is an SCX task and still runnable, keep executing 281 it (see ``SCX_OPS_ENQ_LAST``). 282 283 * Go idle. 284 285 Note that the BPF scheduler can always choose to dispatch tasks immediately 286 in ``ops.enqueue()`` as illustrated in the above simple example. If only the 287 built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as 288 a task is never queued on the BPF scheduler and both the local and global 289 DSQs are consumed automatically. 290 291 ``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use 292 ``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as 293 ``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue 294 dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the 295 function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for 296 more information. 297 298 Where to Look 299 ============= 300 301 * ``include/linux/sched/ext.h`` defines the core data structures, ops table 302 and constants. 303 304 * ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers. 305 The functions prefixed with ``scx_bpf_`` can be called from the BPF 306 scheduler. 307 308 * ``tools/sched_ext/`` hosts example BPF scheduler implementations. 309 310 * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a 311 custom DSQ. 312 313 * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five 314 levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``. 315 316 ABI Instability 317 =============== 318 319 The APIs provided by sched_ext to BPF schedulers programs have no stability 320 guarantees. This includes the ops table callbacks and constants defined in 321 ``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in 322 ``kernel/sched/ext.c``. 323 324 While we will attempt to provide a relatively stable API surface when 325 possible, they are subject to change without warning between kernel 326 versions.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.