~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/core-api/entry.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 Entry/exit handling for exceptions, interrupts, syscalls and KVM
  2 ================================================================
  3 
  4 All transitions between execution domains require state updates which are
  5 subject to strict ordering constraints. State updates are required for the
  6 following:
  7 
  8   * Lockdep
  9   * RCU / Context tracking
 10   * Preemption counter
 11   * Tracing
 12   * Time accounting
 13 
 14 The update order depends on the transition type and is explained below in
 15 the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
 16 exceptions`_, `NMI and NMI-like exceptions`_.
 17 
 18 Non-instrumentable code - noinstr
 19 ---------------------------------
 20 
 21 Most instrumentation facilities depend on RCU, so instrumentation is prohibited
 22 for entry code before RCU starts watching and exit code after RCU stops
 23 watching. In addition, many architectures must save and restore register state,
 24 which means that (for example) a breakpoint in the breakpoint entry code would
 25 overwrite the debug registers of the initial breakpoint.
 26 
 27 Such code must be marked with the 'noinstr' attribute, placing that code into a
 28 special section inaccessible to instrumentation and debug facilities. Some
 29 functions are partially instrumentable, which is handled by marking them
 30 noinstr and using instrumentation_begin() and instrumentation_end() to flag the
 31 instrumentable ranges of code:
 32 
 33 .. code-block:: c
 34 
 35   noinstr void entry(void)
 36   {
 37         handle_entry();     // <-- must be 'noinstr' or '__always_inline'
 38         ...
 39 
 40         instrumentation_begin();
 41         handle_context();   // <-- instrumentable code
 42         instrumentation_end();
 43 
 44         ...
 45         handle_exit();      // <-- must be 'noinstr' or '__always_inline'
 46   }
 47 
 48 This allows verification of the 'noinstr' restrictions via objtool on
 49 supported architectures.
 50 
 51 Invoking non-instrumentable functions from instrumentable context has no
 52 restrictions and is useful to protect e.g. state switching which would
 53 cause malfunction if instrumented.
 54 
 55 All non-instrumentable entry/exit code sections before and after the RCU
 56 state transitions must run with interrupts disabled.
 57 
 58 Syscalls
 59 --------
 60 
 61 Syscall-entry code starts in assembly code and calls out into low-level C code
 62 after establishing low-level architecture-specific state and stack frames. This
 63 low-level C code must not be instrumented. A typical syscall handling function
 64 invoked from low-level assembly code looks like this:
 65 
 66 .. code-block:: c
 67 
 68   noinstr void syscall(struct pt_regs *regs, int nr)
 69   {
 70         arch_syscall_enter(regs);
 71         nr = syscall_enter_from_user_mode(regs, nr);
 72 
 73         instrumentation_begin();
 74         if (!invoke_syscall(regs, nr) && nr != -1)
 75                 result_reg(regs) = __sys_ni_syscall(regs);
 76         instrumentation_end();
 77 
 78         syscall_exit_to_user_mode(regs);
 79   }
 80 
 81 syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
 82 establishes state in the following order:
 83 
 84   * Lockdep
 85   * RCU / Context tracking
 86   * Tracing
 87 
 88 and then invokes the various entry work functions like ptrace, seccomp, audit,
 89 syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
 90 function can be invoked. The instrumentable code section then ends, after which
 91 syscall_exit_to_user_mode() is invoked.
 92 
 93 syscall_exit_to_user_mode() handles all work which needs to be done before
 94 returning to user space like tracing, audit, signals, task work etc. After
 95 that it invokes exit_to_user_mode() which again handles the state
 96 transition in the reverse order:
 97 
 98   * Tracing
 99   * RCU / Context tracking
100   * Lockdep
101 
102 syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
103 available as fine grained subfunctions in cases where the architecture code
104 has to do extra work between the various steps. In such cases it has to
105 ensure that enter_from_user_mode() is called first on entry and
106 exit_to_user_mode() is called last on exit.
107 
108 Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
109 to print a warning.
110 
111 KVM
112 ---
113 
114 Entering or exiting guest mode is very similar to syscalls. From the host
115 kernel point of view the CPU goes off into user space when entering the
116 guest and returns to the kernel on exit.
117 
118 kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
119 and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
120 The state operations have the same ordering.
121 
122 Task work handling is done separately for guest at the boundary of the
123 vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
124 the work handled on return to user space.
125 
126 Do not nest KVM entry/exit transitions because doing so is nonsensical.
127 
128 Interrupts and regular exceptions
129 ---------------------------------
130 
131 Interrupts entry and exit handling is slightly more complex than syscalls
132 and KVM transitions.
133 
134 If an interrupt is raised while the CPU executes in user space, the entry
135 and exit handling is exactly the same as for syscalls.
136 
137 If the interrupt is raised while the CPU executes in kernel space the entry and
138 exit handling is slightly different. RCU state is only updated when the
139 interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
140 already be watching. Lockdep and tracing have to be updated unconditionally.
141 
142 irqentry_enter() and irqentry_exit() provide the implementation for this.
143 
144 The architecture-specific part looks similar to syscall handling:
145 
146 .. code-block:: c
147 
148   noinstr void interrupt(struct pt_regs *regs, int nr)
149   {
150         arch_interrupt_enter(regs);
151         state = irqentry_enter(regs);
152 
153         instrumentation_begin();
154 
155         irq_enter_rcu();
156         invoke_irq_handler(regs, nr);
157         irq_exit_rcu();
158 
159         instrumentation_end();
160 
161         irqentry_exit(regs, state);
162   }
163 
164 Note that the invocation of the actual interrupt handler is within a
165 irq_enter_rcu() and irq_exit_rcu() pair.
166 
167 irq_enter_rcu() updates the preemption count which makes in_hardirq()
168 return true, handles NOHZ tick state and interrupt time accounting. This
169 means that up to the point where irq_enter_rcu() is invoked in_hardirq()
170 returns false.
171 
172 irq_exit_rcu() handles interrupt time accounting, undoes the preemption
173 count update and eventually handles soft interrupts and NOHZ tick state.
174 
175 In theory, the preemption count could be updated in irqentry_enter(). In
176 practice, deferring this update to irq_enter_rcu() allows the preemption-count
177 code to be traced, while also maintaining symmetry with irq_exit_rcu() and
178 irqentry_exit(), which are described in the next paragraph. The only downside
179 is that the early entry code up to irq_enter_rcu() must be aware that the
180 preemption count has not yet been updated with the HARDIRQ_OFFSET state.
181 
182 Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
183 before it handles soft interrupts, whose handlers must run in BH context rather
184 than irq-disabled context. In addition, irqentry_exit() might schedule, which
185 also requires that HARDIRQ_OFFSET has been removed from the preemption count.
186 
187 Even though interrupt handlers are expected to run with local interrupts
188 disabled, interrupt nesting is common from an entry/exit perspective. For
189 example, softirq handling happens within an irqentry_{enter,exit}() block with
190 local interrupts enabled. Also, although uncommon, nothing prevents an
191 interrupt handler from re-enabling interrupts.
192 
193 Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
194 runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
195 the entry code is shared between the two.
196 
197 NMI and NMI-like exceptions
198 ---------------------------
199 
200 NMIs and NMI-like exceptions (machine checks, double faults, debug
201 interrupts, etc.) can hit any context and must be extra careful with
202 the state.
203 
204 State changes for debug exceptions and machine-check exceptions depend on
205 whether these exceptions happened in user-space (breakpoints or watchpoints) or
206 in kernel mode (code patching). From user-space, they are treated like
207 interrupts, while from kernel mode they are treated like NMIs.
208 
209 NMIs and other NMI-like exceptions handle state transitions without
210 distinguishing between user-mode and kernel-mode origin.
211 
212 The state update on entry is handled in irqentry_nmi_enter() which updates
213 state in the following order:
214 
215   * Preemption counter
216   * Lockdep
217   * RCU / Context tracking
218   * Tracing
219 
220 The exit counterpart irqentry_nmi_exit() does the reverse operation in the
221 reverse order.
222 
223 Note that the update of the preemption counter has to be the first
224 operation on enter and the last operation on exit. The reason is that both
225 lockdep and RCU rely on in_nmi() returning true in this case. The
226 preemption count modification in the NMI entry/exit case must not be
227 traced.
228 
229 Architecture-specific code looks like this:
230 
231 .. code-block:: c
232 
233   noinstr void nmi(struct pt_regs *regs)
234   {
235         arch_nmi_enter(regs);
236         state = irqentry_nmi_enter(regs);
237 
238         instrumentation_begin();
239         nmi_handler(regs);
240         instrumentation_end();
241 
242         irqentry_nmi_exit(regs);
243   }
244 
245 and for e.g. a debug exception it can look like this:
246 
247 .. code-block:: c
248 
249   noinstr void debug(struct pt_regs *regs)
250   {
251         arch_nmi_enter(regs);
252 
253         debug_regs = save_debug_regs();
254 
255         if (user_mode(regs)) {
256                 state = irqentry_enter(regs);
257 
258                 instrumentation_begin();
259                 user_mode_debug_handler(regs, debug_regs);
260                 instrumentation_end();
261 
262                 irqentry_exit(regs, state);
263         } else {
264                 state = irqentry_nmi_enter(regs);
265 
266                 instrumentation_begin();
267                 kernel_mode_debug_handler(regs, debug_regs);
268                 instrumentation_end();
269 
270                 irqentry_nmi_exit(regs, state);
271         }
272   }
273 
274 There is no combined irqentry_nmi_if_kernel() function available as the
275 above cannot be handled in an exception-agnostic way.
276 
277 NMIs can happen in any context. For example, an NMI-like exception triggered
278 while handling an NMI. So NMI entry code has to be reentrant and state updates
279 need to handle nesting.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php