~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/scheduler/sched-ext.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 ==========================
  2 Extensible Scheduler Class
  3 ==========================
  4 
  5 sched_ext is a scheduler class whose behavior can be defined by a set of BPF
  6 programs - the BPF scheduler.
  7 
  8 * sched_ext exports a full scheduling interface so that any scheduling
  9   algorithm can be implemented on top.
 10 
 11 * The BPF scheduler can group CPUs however it sees fit and schedule them
 12   together, as tasks aren't tied to specific CPUs at the time of wakeup.
 13 
 14 * The BPF scheduler can be turned on and off dynamically anytime.
 15 
 16 * The system integrity is maintained no matter what the BPF scheduler does.
 17   The default scheduling behavior is restored anytime an error is detected,
 18   a runnable task stalls, or on invoking the SysRq key sequence
 19   :kbd:`SysRq-S`.
 20 
 21 * When the BPF scheduler triggers an error, debug information is dumped to
 22   aid debugging. The debug dump is passed to and printed out by the
 23   scheduler binary. The debug dump can also be accessed through the
 24   `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D`
 25   triggers a debug dump. This doesn't terminate the BPF scheduler and can
 26   only be read through the tracepoint.
 27 
 28 Switching to and from sched_ext
 29 ===============================
 30 
 31 ``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
 32 ``tools/sched_ext`` contains the example schedulers. The following config
 33 options should be enabled to use sched_ext:
 34 
 35 .. code-block:: none
 36 
 37     CONFIG_BPF=y
 38     CONFIG_SCHED_CLASS_EXT=y
 39     CONFIG_BPF_SYSCALL=y
 40     CONFIG_BPF_JIT=y
 41     CONFIG_DEBUG_INFO_BTF=y
 42     CONFIG_BPF_JIT_ALWAYS_ON=y
 43     CONFIG_BPF_JIT_DEFAULT_ON=y
 44     CONFIG_PAHOLE_HAS_SPLIT_BTF=y
 45     CONFIG_PAHOLE_HAS_BTF_TAG=y
 46 
 47 sched_ext is used only when the BPF scheduler is loaded and running.
 48 
 49 If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
 50 treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
 51 loaded.
 52 
 53 When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set
 54 in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and
 55 ``SCHED_EXT`` tasks are scheduled by sched_ext.
 56 
 57 However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is
 58 set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled
 59 by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and
 60 ``SCHED_IDLE`` policies are scheduled by CFS.
 61 
 62 Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or
 63 detection of any internal error including stalled runnable tasks aborts the
 64 BPF scheduler and reverts all tasks back to CFS.
 65 
 66 .. code-block:: none
 67 
 68     # make -j16 -C tools/sched_ext
 69     # tools/sched_ext/build/bin/scx_simple
 70     local=0 global=3
 71     local=5 global=24
 72     local=9 global=44
 73     local=13 global=56
 74     local=17 global=72
 75     ^CEXIT: BPF scheduler unregistered
 76 
 77 The current status of the BPF scheduler can be determined as follows:
 78 
 79 .. code-block:: none
 80 
 81     # cat /sys/kernel/sched_ext/state
 82     enabled
 83     # cat /sys/kernel/sched_ext/root/ops
 84     simple
 85 
 86 You can check if any BPF scheduler has ever been loaded since boot by examining
 87 this monotonically incrementing counter (a value of zero indicates that no BPF
 88 scheduler has been loaded):
 89 
 90 .. code-block:: none
 91 
 92     # cat /sys/kernel/sched_ext/enable_seq
 93     1
 94 
 95 ``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
 96 detailed information:
 97 
 98 .. code-block:: none
 99 
100     # tools/sched_ext/scx_show_state.py
101     ops           : simple
102     enabled       : 1
103     switching_all : 1
104     switched_all  : 1
105     enable_state  : enabled (2)
106     bypass_depth  : 0
107     nr_rejected   : 0
108     enable_seq    : 1
109 
110 If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can
111 be determined as follows:
112 
113 .. code-block:: none
114 
115     # grep ext /proc/self/sched
116     ext.enabled                                  :                    1
117 
118 The Basics
119 ==========
120 
121 Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
122 programs that implement ``struct sched_ext_ops``. The only mandatory field
123 is ``ops.name`` which must be a valid BPF object name. All operations are
124 optional. The following modified excerpt is from
125 ``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.
126 
127 .. code-block:: c
128 
129     /*
130      * Decide which CPU a task should be migrated to before being
131      * enqueued (either at wakeup, fork time, or exec time). If an
132      * idle core is found by the default ops.select_cpu() implementation,
133      * then dispatch the task directly to SCX_DSQ_LOCAL and skip the
134      * ops.enqueue() callback.
135      *
136      * Note that this implementation has exactly the same behavior as the
137      * default ops.select_cpu implementation. The behavior of the scheduler
138      * would be exactly same if the implementation just didn't define the
139      * simple_select_cpu() struct_ops prog.
140      */
141     s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
142                        s32 prev_cpu, u64 wake_flags)
143     {
144             s32 cpu;
145             /* Need to initialize or the BPF verifier will reject the program */
146             bool direct = false;
147 
148             cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
149 
150             if (direct)
151                     scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
152 
153             return cpu;
154     }
155 
156     /*
157      * Do a direct dispatch of a task to the global DSQ. This ops.enqueue()
158      * callback will only be invoked if we failed to find a core to dispatch
159      * to in ops.select_cpu() above.
160      *
161      * Note that this implementation has exactly the same behavior as the
162      * default ops.enqueue implementation, which just dispatches the task
163      * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
164      * if the implementation just didn't define the simple_enqueue struct_ops
165      * prog.
166      */
167     void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
168     {
169             scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
170     }
171 
172     s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
173     {
174             /*
175              * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and
176              * SCHED_BATCH tasks should use sched_ext.
177              */
178             return 0;
179     }
180 
181     void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
182     {
183             exit_type = ei->type;
184     }
185 
186     SEC(".struct_ops")
187     struct sched_ext_ops simple_ops = {
188             .select_cpu             = (void *)simple_select_cpu,
189             .enqueue                = (void *)simple_enqueue,
190             .init                   = (void *)simple_init,
191             .exit                   = (void *)simple_exit,
192             .name                   = "simple",
193     };
194 
195 Dispatch Queues
196 ---------------
197 
198 To match the impedance between the scheduler core and the BPF scheduler,
199 sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
200 priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
201 and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
202 an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and
203 ``scx_bpf_destroy_dsq()``.
204 
205 A CPU always executes a task from its local DSQ. A task is "dispatched" to a
206 DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
207 local DSQ.
208 
209 When a CPU is looking for the next task to run, if the local DSQ is not
210 empty, the first task is picked. Otherwise, the CPU tries to consume the
211 global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
212 is invoked.
213 
214 Scheduling Cycle
215 ----------------
216 
217 The following briefly shows how a waking task is scheduled and executed.
218 
219 1. When a task is waking up, ``ops.select_cpu()`` is the first operation
220    invoked. This serves two purposes. First, CPU selection optimization
221    hint. Second, waking up the selected CPU if idle.
222 
223    The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
224    binding. The actual decision is made at the last step of scheduling.
225    However, there is a small performance gain if the CPU
226    ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
227 
228    A side-effect of selecting a CPU is waking it up from idle. While a BPF
229    scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
230    using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
231 
232    A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by
233    calling ``scx_bpf_dispatch()``. If the task is dispatched to
234    ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the
235    local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
236    Additionally, dispatching directly from ``ops.select_cpu()`` will cause the
237    ``ops.enqueue()`` callback to be skipped.
238 
239    Note that the scheduler core will ignore an invalid CPU selection, for
240    example, if it's outside the allowed cpumask of the task.
241 
242 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
243    task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()``
244    can make one of the following decisions:
245 
246    * Immediately dispatch the task to either the global or local DSQ by
247      calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
248      ``SCX_DSQ_LOCAL``, respectively.
249 
250    * Immediately dispatch the task to a custom DSQ by calling
251      ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.
252 
253    * Queue the task on the BPF side.
254 
255 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
256    empty, it then looks at the global DSQ. If there still isn't a task to
257    run, ``ops.dispatch()`` is invoked which can use the following two
258    functions to populate the local DSQ.
259 
260    * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
261      be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
262      ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
263      currently can't be called with BPF locks held, this is being worked on
264      and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
265      rather than performing them immediately. There can be up to
266      ``ops.dispatch_max_batch`` pending tasks.
267 
268    * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
269      to the dispatching DSQ. This function cannot be called with any BPF
270      locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
271      before trying to consume the specified DSQ.
272 
273 4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
274    the CPU runs the first one. If empty, the following steps are taken:
275 
276    * Try to consume the global DSQ. If successful, run the task.
277 
278    * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
279 
280    * If the previous task is an SCX task and still runnable, keep executing
281      it (see ``SCX_OPS_ENQ_LAST``).
282 
283    * Go idle.
284 
285 Note that the BPF scheduler can always choose to dispatch tasks immediately
286 in ``ops.enqueue()`` as illustrated in the above simple example. If only the
287 built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
288 a task is never queued on the BPF scheduler and both the local and global
289 DSQs are consumed automatically.
290 
291 ``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use
292 ``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as
293 ``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
294 dispatching, and must be dispatched to with ``scx_bpf_dispatch()``.  See the
295 function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for
296 more information.
297 
298 Where to Look
299 =============
300 
301 * ``include/linux/sched/ext.h`` defines the core data structures, ops table
302   and constants.
303 
304 * ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
305   The functions prefixed with ``scx_bpf_`` can be called from the BPF
306   scheduler.
307 
308 * ``tools/sched_ext/`` hosts example BPF scheduler implementations.
309 
310   * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
311     custom DSQ.
312 
313   * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
314     levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
315 
316 ABI Instability
317 ===============
318 
319 The APIs provided by sched_ext to BPF schedulers programs have no stability
320 guarantees. This includes the ops table callbacks and constants defined in
321 ``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
322 ``kernel/sched/ext.c``.
323 
324 While we will attempt to provide a relatively stable API surface when
325 possible, they are subject to change without warning between kernel
326 versions.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php