~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/scheduler/sched-ext.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/scheduler/sched-ext.rst (Version linux-6.12-rc7) and /Documentation/scheduler/sched-ext.rst (Version linux-5.17.15)


  1 ==========================                        
  2 Extensible Scheduler Class                        
  3 ==========================                        
  4                                                   
  5 sched_ext is a scheduler class whose behavior     
  6 programs - the BPF scheduler.                     
  7                                                   
  8 * sched_ext exports a full scheduling interfac    
  9   algorithm can be implemented on top.            
 10                                                   
 11 * The BPF scheduler can group CPUs however it     
 12   together, as tasks aren't tied to specific C    
 13                                                   
 14 * The BPF scheduler can be turned on and off d    
 15                                                   
 16 * The system integrity is maintained no matter    
 17   The default scheduling behavior is restored     
 18   a runnable task stalls, or on invoking the S    
 19   :kbd:`SysRq-S`.                                 
 20                                                   
 21 * When the BPF scheduler triggers an error, de    
 22   aid debugging. The debug dump is passed to a    
 23   scheduler binary. The debug dump can also be    
 24   `sched_ext_dump` tracepoint. The SysRq key s    
 25   triggers a debug dump. This doesn't terminat    
 26   only be read through the tracepoint.            
 27                                                   
 28 Switching to and from sched_ext                   
 29 ===============================                   
 30                                                   
 31 ``CONFIG_SCHED_CLASS_EXT`` is the config optio    
 32 ``tools/sched_ext`` contains the example sched    
 33 options should be enabled to use sched_ext:       
 34                                                   
 35 .. code-block:: none                              
 36                                                   
 37     CONFIG_BPF=y                                  
 38     CONFIG_SCHED_CLASS_EXT=y                      
 39     CONFIG_BPF_SYSCALL=y                          
 40     CONFIG_BPF_JIT=y                              
 41     CONFIG_DEBUG_INFO_BTF=y                       
 42     CONFIG_BPF_JIT_ALWAYS_ON=y                    
 43     CONFIG_BPF_JIT_DEFAULT_ON=y                   
 44     CONFIG_PAHOLE_HAS_SPLIT_BTF=y                 
 45     CONFIG_PAHOLE_HAS_BTF_TAG=y                   
 46                                                   
 47 sched_ext is used only when the BPF scheduler     
 48                                                   
 49 If a task explicitly sets its scheduling polic    
 50 treated as ``SCHED_NORMAL`` and scheduled by C    
 51 loaded.                                           
 52                                                   
 53 When the BPF scheduler is loaded and ``SCX_OPS    
 54 in ``ops->flags``, all ``SCHED_NORMAL``, ``SCH    
 55 ``SCHED_EXT`` tasks are scheduled by sched_ext    
 56                                                   
 57 However, when the BPF scheduler is loaded and     
 58 set in ``ops->flags``, only tasks with the ``S    
 59 by sched_ext, while tasks with ``SCHED_NORMAL`    
 60 ``SCHED_IDLE`` policies are scheduled by CFS.     
 61                                                   
 62 Terminating the sched_ext scheduler program, t    
 63 detection of any internal error including stal    
 64 BPF scheduler and reverts all tasks back to CF    
 65                                                   
 66 .. code-block:: none                              
 67                                                   
 68     # make -j16 -C tools/sched_ext                
 69     # tools/sched_ext/build/bin/scx_simple        
 70     local=0 global=3                              
 71     local=5 global=24                             
 72     local=9 global=44                             
 73     local=13 global=56                            
 74     local=17 global=72                            
 75     ^CEXIT: BPF scheduler unregistered            
 76                                                   
 77 The current status of the BPF scheduler can be    
 78                                                   
 79 .. code-block:: none                              
 80                                                   
 81     # cat /sys/kernel/sched_ext/state             
 82     enabled                                       
 83     # cat /sys/kernel/sched_ext/root/ops          
 84     simple                                        
 85                                                   
 86 You can check if any BPF scheduler has ever be    
 87 this monotonically incrementing counter (a val    
 88 scheduler has been loaded):                       
 89                                                   
 90 .. code-block:: none                              
 91                                                   
 92     # cat /sys/kernel/sched_ext/enable_seq        
 93     1                                             
 94                                                   
 95 ``tools/sched_ext/scx_show_state.py`` is a drg    
 96 detailed information:                             
 97                                                   
 98 .. code-block:: none                              
 99                                                   
100     # tools/sched_ext/scx_show_state.py           
101     ops           : simple                        
102     enabled       : 1                             
103     switching_all : 1                             
104     switched_all  : 1                             
105     enable_state  : enabled (2)                   
106     bypass_depth  : 0                             
107     nr_rejected   : 0                             
108     enable_seq    : 1                             
109                                                   
110 If ``CONFIG_SCHED_DEBUG`` is set, whether a gi    
111 be determined as follows:                         
112                                                   
113 .. code-block:: none                              
114                                                   
115     # grep ext /proc/self/sched                   
116     ext.enabled                                   
117                                                   
118 The Basics                                        
119 ==========                                        
120                                                   
121 Userspace can implement an arbitrary BPF sched    
122 programs that implement ``struct sched_ext_ops    
123 is ``ops.name`` which must be a valid BPF obje    
124 optional. The following modified excerpt is fr    
125 ``tools/sched_ext/scx_simple.bpf.c`` showing a    
126                                                   
127 .. code-block:: c                                 
128                                                   
129     /*                                            
130      * Decide which CPU a task should be migra    
131      * enqueued (either at wakeup, fork time,     
132      * idle core is found by the default ops.s    
133      * then dispatch the task directly to SCX_    
134      * ops.enqueue() callback.                    
135      *                                            
136      * Note that this implementation has exact    
137      * default ops.select_cpu implementation.     
138      * would be exactly same if the implementa    
139      * simple_select_cpu() struct_ops prog.       
140      */                                           
141     s32 BPF_STRUCT_OPS(simple_select_cpu, stru    
142                        s32 prev_cpu, u64 wake_    
143     {                                             
144             s32 cpu;                              
145             /* Need to initialize or the BPF v    
146             bool direct = false;                  
147                                                   
148             cpu = scx_bpf_select_cpu_dfl(p, pr    
149                                                   
150             if (direct)                           
151                     scx_bpf_dispatch(p, SCX_DS    
152                                                   
153             return cpu;                           
154     }                                             
155                                                   
156     /*                                            
157      * Do a direct dispatch of a task to the g    
158      * callback will only be invoked if we fai    
159      * to in ops.select_cpu() above.              
160      *                                            
161      * Note that this implementation has exact    
162      * default ops.enqueue implementation, whi    
163      * to SCX_DSQ_GLOBAL. The behavior of the     
164      * if the implementation just didn't defin    
165      * prog.                                      
166      */                                           
167     void BPF_STRUCT_OPS(simple_enqueue, struct    
168     {                                             
169             scx_bpf_dispatch(p, SCX_DSQ_GLOBAL    
170     }                                             
171                                                   
172     s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)     
173     {                                             
174             /*                                    
175              * By default, all SCHED_EXT, SCHE    
176              * SCHED_BATCH tasks should use sc    
177              */                                   
178             return 0;                             
179     }                                             
180                                                   
181     void BPF_STRUCT_OPS(simple_exit, struct sc    
182     {                                             
183             exit_type = ei->type;                 
184     }                                             
185                                                   
186     SEC(".struct_ops")                            
187     struct sched_ext_ops simple_ops = {           
188             .select_cpu             = (void *)    
189             .enqueue                = (void *)    
190             .init                   = (void *)    
191             .exit                   = (void *)    
192             .name                   = "simple"    
193     };                                            
194                                                   
195 Dispatch Queues                                   
196 ---------------                                   
197                                                   
198 To match the impedance between the scheduler c    
199 sched_ext uses DSQs (dispatch queues) which ca    
200 priority queue. By default, there is one globa    
201 and one local dsq per CPU (``SCX_DSQ_LOCAL``).    
202 an arbitrary number of dsq's using ``scx_bpf_c    
203 ``scx_bpf_destroy_dsq()``.                        
204                                                   
205 A CPU always executes a task from its local DS    
206 DSQ. A non-local DSQ is "consumed" to transfer    
207 local DSQ.                                        
208                                                   
209 When a CPU is looking for the next task to run    
210 empty, the first task is picked. Otherwise, th    
211 global DSQ. If that doesn't yield a runnable t    
212 is invoked.                                       
213                                                   
214 Scheduling Cycle                                  
215 ----------------                                  
216                                                   
217 The following briefly shows how a waking task     
218                                                   
219 1. When a task is waking up, ``ops.select_cpu(    
220    invoked. This serves two purposes. First, C    
221    hint. Second, waking up the selected CPU if    
222                                                   
223    The CPU selected by ``ops.select_cpu()`` is    
224    binding. The actual decision is made at the    
225    However, there is a small performance gain     
226    ``ops.select_cpu()`` returns matches the CP    
227                                                   
228    A side-effect of selecting a CPU is waking     
229    scheduler can wake up any cpu using the ``s    
230    using ``ops.select_cpu()`` judiciously can     
231                                                   
232    A task can be immediately dispatched to a D    
233    calling ``scx_bpf_dispatch()``. If the task    
234    ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``    
235    local DSQ of whichever CPU is returned from    
236    Additionally, dispatching directly from ``o    
237    ``ops.enqueue()`` callback to be skipped.      
238                                                   
239    Note that the scheduler core will ignore an    
240    example, if it's outside the allowed cpumas    
241                                                   
242 2. Once the target CPU is selected, ``ops.enqu    
243    task was dispatched directly from ``ops.sel    
244    can make one of the following decisions:       
245                                                   
246    * Immediately dispatch the task to either t    
247      calling ``scx_bpf_dispatch()`` with ``SCX    
248      ``SCX_DSQ_LOCAL``, respectively.             
249                                                   
250    * Immediately dispatch the task to a custom    
251      ``scx_bpf_dispatch()`` with a DSQ ID whic    
252                                                   
253    * Queue the task on the BPF side.              
254                                                   
255 3. When a CPU is ready to schedule, it first l    
256    empty, it then looks at the global DSQ. If     
257    run, ``ops.dispatch()`` is invoked which ca    
258    functions to populate the local DSQ.           
259                                                   
260    * ``scx_bpf_dispatch()`` dispatches a task     
261      be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LO    
262      ``SCX_DSQ_GLOBAL`` or a custom DSQ. While    
263      currently can't be called with BPF locks     
264      and will be supported. ``scx_bpf_dispatch    
265      rather than performing them immediately.     
266      ``ops.dispatch_max_batch`` pending tasks.    
267                                                   
268    * ``scx_bpf_consume()`` tranfers a task fro    
269      to the dispatching DSQ. This function can    
270      locks held. ``scx_bpf_consume()`` flushes    
271      before trying to consume the specified DS    
272                                                   
273 4. After ``ops.dispatch()`` returns, if there     
274    the CPU runs the first one. If empty, the f    
275                                                   
276    * Try to consume the global DSQ. If success    
277                                                   
278    * If ``ops.dispatch()`` has dispatched any     
279                                                   
280    * If the previous task is an SCX task and s    
281      it (see ``SCX_OPS_ENQ_LAST``).               
282                                                   
283    * Go idle.                                     
284                                                   
285 Note that the BPF scheduler can always choose     
286 in ``ops.enqueue()`` as illustrated in the abo    
287 built-in DSQs are used, there is no need to im    
288 a task is never queued on the BPF scheduler an    
289 DSQs are consumed automatically.                  
290                                                   
291 ``scx_bpf_dispatch()`` queues the task on the     
292 ``scx_bpf_dispatch_vtime()`` for the priority     
293 ``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do no    
294 dispatching, and must be dispatched to with ``    
295 function documentation and usage in ``tools/sc    
296 more information.                                 
297                                                   
298 Where to Look                                     
299 =============                                     
300                                                   
301 * ``include/linux/sched/ext.h`` defines the co    
302   and constants.                                  
303                                                   
304 * ``kernel/sched/ext.c`` contains sched_ext co    
305   The functions prefixed with ``scx_bpf_`` can    
306   scheduler.                                      
307                                                   
308 * ``tools/sched_ext/`` hosts example BPF sched    
309                                                   
310   * ``scx_simple[.bpf].c``: Minimal global FIF    
311     custom DSQ.                                   
312                                                   
313   * ``scx_qmap[.bpf].c``: A multi-level FIFO s    
314     levels of priority implemented with ``BPF_    
315                                                   
316 ABI Instability                                   
317 ===============                                   
318                                                   
319 The APIs provided by sched_ext to BPF schedule    
320 guarantees. This includes the ops table callba    
321 ``include/linux/sched/ext.h``, as well as the     
322 ``kernel/sched/ext.c``.                           
323                                                   
324 While we will attempt to provide a relatively     
325 possible, they are subject to change without w    
326 versions.                                         
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php