~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/security/self-protection.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 ======================
  2 Kernel Self-Protection
  3 ======================
  4 
  5 Kernel self-protection is the design and implementation of systems and
  6 structures within the Linux kernel to protect against security flaws in
  7 the kernel itself. This covers a wide range of issues, including removing
  8 entire classes of bugs, blocking security flaw exploitation methods,
  9 and actively detecting attack attempts. Not all topics are explored in
 10 this document, but it should serve as a reasonable starting point and
 11 answer any frequently asked questions. (Patches welcome, of course!)
 12 
 13 In the worst-case scenario, we assume an unprivileged local attacker
 14 has arbitrary read and write access to the kernel's memory. In many
 15 cases, bugs being exploited will not provide this level of access,
 16 but with systems in place that defend against the worst case we'll
 17 cover the more limited cases as well. A higher bar, and one that should
 18 still be kept in mind, is protecting the kernel against a _privileged_
 19 local attacker, since the root user has access to a vastly increased
 20 attack surface. (Especially when they have the ability to load arbitrary
 21 kernel modules.)
 22 
 23 The goals for successful self-protection systems would be that they
 24 are effective, on by default, require no opt-in by developers, have no
 25 performance impact, do not impede kernel debugging, and have tests. It
 26 is uncommon that all these goals can be met, but it is worth explicitly
 27 mentioning them, since these aspects need to be explored, dealt with,
 28 and/or accepted.
 29 
 30 
 31 Attack Surface Reduction
 32 ========================
 33 
 34 The most fundamental defense against security exploits is to reduce the
 35 areas of the kernel that can be used to redirect execution. This ranges
 36 from limiting the exposed APIs available to userspace, making in-kernel
 37 APIs hard to use incorrectly, minimizing the areas of writable kernel
 38 memory, etc.
 39 
 40 Strict kernel memory permissions
 41 --------------------------------
 42 
 43 When all of kernel memory is writable, it becomes trivial for attacks
 44 to redirect execution flow. To reduce the availability of these targets
 45 the kernel needs to protect its memory with a tight set of permissions.
 46 
 47 Executable code and read-only data must not be writable
 48 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 49 
 50 Any areas of the kernel with executable memory must not be writable.
 51 While this obviously includes the kernel text itself, we must consider
 52 all additional places too: kernel modules, JIT memory, etc. (There are
 53 temporary exceptions to this rule to support things like instruction
 54 alternatives, breakpoints, kprobes, etc. If these must exist in a
 55 kernel, they are implemented in a way where the memory is temporarily
 56 made writable during the update, and then returned to the original
 57 permissions.)
 58 
 59 In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
 60 ``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
 61 writable, data is not executable, and read-only data is neither writable
 62 nor executable.
 63 
 64 Most architectures have these options on by default and not user selectable.
 65 For some architectures like arm that wish to have these be selectable,
 66 the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
 67 a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
 68 the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
 69 
 70 Function pointers and sensitive variables must not be writable
 71 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 72 
 73 Vast areas of kernel memory contain function pointers that are looked
 74 up by the kernel and used to continue execution (e.g. descriptor/vector
 75 tables, file/network/etc operation structures, etc). The number of these
 76 variables must be reduced to an absolute minimum.
 77 
 78 Many such variables can be made read-only by setting them "const"
 79 so that they live in the .rodata section instead of the .data section
 80 of the kernel, gaining the protection of the kernel's strict memory
 81 permissions as described above.
 82 
 83 For variables that are initialized once at ``__init`` time, these can
 84 be marked with the ``__ro_after_init`` attribute.
 85 
 86 What remains are variables that are updated rarely (e.g. GDT). These
 87 will need another infrastructure (similar to the temporary exceptions
 88 made to kernel code mentioned above) that allow them to spend the rest
 89 of their lifetime read-only. (For example, when being updated, only the
 90 CPU thread performing the update would be given uninterruptible write
 91 access to the memory.)
 92 
 93 Segregation of kernel memory from userspace memory
 94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 95 
 96 The kernel must never execute userspace memory. The kernel must also never
 97 access userspace memory without explicit expectation to do so. These
 98 rules can be enforced either by support of hardware-based restrictions
 99 (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
100 By blocking userspace memory in this way, execution and data parsing
101 cannot be passed to trivially-controlled userspace memory, forcing
102 attacks to operate entirely in kernel memory.
103 
104 Reduced access to syscalls
105 --------------------------
106 
107 One trivial way to eliminate many syscalls for 64-bit systems is building
108 without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
109 
110 The "seccomp" system provides an opt-in feature made available to
111 userspace, which provides a way to reduce the number of kernel entry
112 points available to a running process. This limits the breadth of kernel
113 code that can be reached, possibly reducing the availability of a given
114 bug to an attack.
115 
116 An area of improvement would be creating viable ways to keep access to
117 things like compat, user namespaces, BPF creation, and perf limited only
118 to trusted processes. This would keep the scope of kernel entry points
119 restricted to the more regular set of normally available to unprivileged
120 userspace.
121 
122 Restricting access to kernel modules
123 ------------------------------------
124 
125 The kernel should never allow an unprivileged user the ability to
126 load specific kernel modules, since that would provide a facility to
127 unexpectedly extend the available attack surface. (The on-demand loading
128 of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
129 considered "expected" here, though additional consideration should be
130 given even to these.) For example, loading a filesystem module via an
131 unprivileged socket API is nonsense: only the root or physically local
132 user should trigger filesystem module loading. (And even this can be up
133 for debate in some scenarios.)
134 
135 To protect against even privileged users, systems may need to either
136 disable module loading entirely (e.g. monolithic kernel builds or
137 modules_disabled sysctl), or provide signed modules (e.g.
138 ``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
139 root load arbitrary kernel code via the module loader interface.
140 
141 
142 Memory integrity
143 ================
144 
145 There are many memory structures in the kernel that are regularly abused
146 to gain execution control during an attack, By far the most commonly
147 understood is that of the stack buffer overflow in which the return
148 address stored on the stack is overwritten. Many other examples of this
149 kind of attack exist, and protections exist to defend against them.
150 
151 Stack buffer overflow
152 ---------------------
153 
154 The classic stack buffer overflow involves writing past the expected end
155 of a variable stored on the stack, ultimately writing a controlled value
156 to the stack frame's stored return address. The most widely used defense
157 is the presence of a stack canary between the stack variables and the
158 return address (``CONFIG_STACKPROTECTOR``), which is verified just before
159 the function returns. Other defenses include things like shadow stacks.
160 
161 Stack depth overflow
162 --------------------
163 
164 A less well understood attack is using a bug that triggers the
165 kernel to consume stack memory with deep function calls or large stack
166 allocations. With this attack it is possible to write beyond the end of
167 the kernel's preallocated stack space and into sensitive structures. Two
168 important changes need to be made for better protections: moving the
169 sensitive thread_info structure elsewhere, and adding a faulting memory
170 hole at the bottom of the stack to catch these overflows.
171 
172 Heap memory integrity
173 ---------------------
174 
175 The structures used to track heap free lists can be sanity-checked during
176 allocation and freeing to make sure they aren't being used to manipulate
177 other memory areas.
178 
179 Counter integrity
180 -----------------
181 
182 Many places in the kernel use atomic counters to track object references
183 or perform similar lifetime management. When these counters can be made
184 to wrap (over or under) this traditionally exposes a use-after-free
185 flaw. By trapping atomic wrapping, this class of bug vanishes.
186 
187 Size calculation overflow detection
188 -----------------------------------
189 
190 Similar to counter overflow, integer overflows (usually size calculations)
191 need to be detected at runtime to kill this class of bug, which
192 traditionally leads to being able to write past the end of kernel buffers.
193 
194 
195 Probabilistic defenses
196 ======================
197 
198 While many protections can be considered deterministic (e.g. read-only
199 memory cannot be written to), some protections provide only statistical
200 defense, in that an attack must gather enough information about a
201 running system to overcome the defense. While not perfect, these do
202 provide meaningful defenses.
203 
204 Canaries, blinding, and other secrets
205 -------------------------------------
206 
207 It should be noted that things like the stack canary discussed earlier
208 are technically statistical defenses, since they rely on a secret value,
209 and such values may become discoverable through an information exposure
210 flaw.
211 
212 Blinding literal values for things like JITs, where the executable
213 contents may be partially under the control of userspace, need a similar
214 secret value.
215 
216 It is critical that the secret values used must be separate (e.g.
217 different canary per stack) and high entropy (e.g. is the RNG actually
218 working?) in order to maximize their success.
219 
220 Kernel Address Space Layout Randomization (KASLR)
221 -------------------------------------------------
222 
223 Since the location of kernel memory is almost always instrumental in
224 mounting a successful attack, making the location non-deterministic
225 raises the difficulty of an exploit. (Note that this in turn makes
226 the value of information exposures higher, since they may be used to
227 discover desired memory locations.)
228 
229 Text and module base
230 ~~~~~~~~~~~~~~~~~~~~
231 
232 By relocating the physical and virtual base address of the kernel at
233 boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
234 frustrated. Additionally, offsetting the module loading base address
235 means that even systems that load the same set of modules in the same
236 order every boot will not share a common base address with the rest of
237 the kernel text.
238 
239 Stack base
240 ~~~~~~~~~~
241 
242 If the base address of the kernel stack is not the same between processes,
243 or even not the same between syscalls, targets on or beyond the stack
244 become more difficult to locate.
245 
246 Dynamic memory base
247 ~~~~~~~~~~~~~~~~~~~
248 
249 Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
250 being relatively deterministic in layout due to the order of early-boot
251 initializations. If the base address of these areas is not the same
252 between boots, targeting them is frustrated, requiring an information
253 exposure specific to the region.
254 
255 Structure layout
256 ~~~~~~~~~~~~~~~~
257 
258 By performing a per-build randomization of the layout of sensitive
259 structures, attacks must either be tuned to known kernel builds or expose
260 enough kernel memory to determine structure layouts before manipulating
261 them.
262 
263 
264 Preventing Information Exposures
265 ================================
266 
267 Since the locations of sensitive structures are the primary target for
268 attacks, it is important to defend against exposure of both kernel memory
269 addresses and kernel memory contents (since they may contain kernel
270 addresses or other sensitive things like canary values).
271 
272 Kernel addresses
273 ----------------
274 
275 Printing kernel addresses to userspace leaks sensitive information about
276 the kernel memory layout. Care should be exercised when using any printk
277 specifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]
278 in certain circumstances [*]).  Any file written to using one of these
279 specifiers should be readable only by privileged processes.
280 
281 Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1
282 addresses printed with the specifier %p are hashed before printing.
283 
284 [*] If KALLSYMS is enabled and symbol lookup fails, the raw address is
285 printed. If KALLSYMS is not enabled the raw address is printed.
286 
287 Unique identifiers
288 ------------------
289 
290 Kernel memory addresses must never be used as identifiers exposed to
291 userspace. Instead, use an atomic counter, an idr, or similar unique
292 identifier.
293 
294 Memory initialization
295 ---------------------
296 
297 Memory copied to userspace must always be fully initialized. If not
298 explicitly memset(), this will require changes to the compiler to make
299 sure structure holes are cleared.
300 
301 Memory poisoning
302 ----------------
303 
304 When releasing memory, it is best to poison the contents, to avoid reuse
305 attacks that rely on the old contents of memory. E.g., clear stack on a
306 syscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a
307 free. This frustrates many uninitialized variable attacks, stack content
308 exposures, heap content exposures, and use-after-free attacks.
309 
310 Destination tracking
311 --------------------
312 
313 To help kill classes of bugs that result in kernel addresses being
314 written to userspace, the destination of writes needs to be tracked. If
315 the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
316 it should automatically censor sensitive values.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php