~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/arch/x86/pti.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 .. SPDX-License-Identifier: GPL-2.0
  2 
  3 ==========================
  4 Page Table Isolation (PTI)
  5 ==========================
  6 
  7 Overview
  8 ========
  9 
 10 Page Table Isolation (pti, previously known as KAISER [1]_) is a
 11 countermeasure against attacks on the shared user/kernel address
 12 space such as the "Meltdown" approach [2]_.
 13 
 14 To mitigate this class of attacks, we create an independent set of
 15 page tables for use only when running userspace applications.  When
 16 the kernel is entered via syscalls, interrupts or exceptions, the
 17 page tables are switched to the full "kernel" copy.  When the system
 18 switches back to user mode, the user copy is used again.
 19 
 20 The userspace page tables contain only a minimal amount of kernel
 21 data: only what is needed to enter/exit the kernel such as the
 22 entry/exit functions themselves and the interrupt descriptor table
 23 (IDT).  There are a few strictly unnecessary things that get mapped
 24 such as the first C function when entering an interrupt (see
 25 comments in pti.c).
 26 
 27 This approach helps to ensure that side-channel attacks leveraging
 28 the paging structures do not function when PTI is enabled.  It can be
 29 enabled by setting CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y at compile
 30 time.  Once enabled at compile-time, it can be disabled at boot with
 31 the 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
 32 
 33 Page Table Management
 34 =====================
 35 
 36 When PTI is enabled, the kernel manages two sets of page tables.
 37 The first set is very similar to the single set which is present in
 38 kernels without PTI.  This includes a complete mapping of userspace
 39 that the kernel can use for things like copy_to_user().
 40 
 41 Although _complete_, the user portion of the kernel page tables is
 42 crippled by setting the NX bit in the top level.  This ensures
 43 that any missed kernel->user CR3 switch will immediately crash
 44 userspace upon executing its first instruction.
 45 
 46 The userspace page tables map only the kernel data needed to enter
 47 and exit the kernel.  This data is entirely contained in the 'struct
 48 cpu_entry_area' structure which is placed in the fixmap which gives
 49 each CPU's copy of the area a compile-time-fixed virtual address.
 50 
 51 For new userspace mappings, the kernel makes the entries in its
 52 page tables like normal.  The only difference is when the kernel
 53 makes entries in the top (PGD) level.  In addition to setting the
 54 entry in the main kernel PGD, a copy of the entry is made in the
 55 userspace page tables' PGD.
 56 
 57 This sharing at the PGD level also inherently shares all the lower
 58 layers of the page tables.  This leaves a single, shared set of
 59 userspace page tables to manage.  One PTE to lock, one set of
 60 accessed bits, dirty bits, etc...
 61 
 62 Overhead
 63 ========
 64 
 65 Protection against side-channel attacks is important.  But,
 66 this protection comes at a cost:
 67 
 68 1. Increased Memory Use
 69 
 70   a. Each process now needs an order-1 PGD instead of order-0.
 71      (Consumes an additional 4k per process).
 72   b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
 73      aligned so that it can be mapped by setting a single PMD
 74      entry.  This consumes nearly 2MB of RAM once the kernel
 75      is decompressed, but no space in the kernel image itself.
 76 
 77 2. Runtime Cost
 78 
 79   a. CR3 manipulation to switch between the page table copies
 80      must be done at interrupt, syscall, and exception entry
 81      and exit (it can be skipped when the kernel is interrupted,
 82      though.)  Moves to CR3 are on the order of a hundred
 83      cycles, and are required at every entry and exit.
 84   b. Percpu TSS is mapped into the user page tables to allow SYSCALL64 path
 85      to work under PTI. This doesn't have a direct runtime cost but it can
 86      be argued it opens certain timing attack scenarios.
 87   c. Global pages are disabled for all kernel structures not
 88      mapped into both kernel and userspace page tables.  This
 89      feature of the MMU allows different processes to share TLB
 90      entries mapping the kernel.  Losing the feature means more
 91      TLB misses after a context switch.  The actual loss of
 92      performance is very small, however, never exceeding 1%.
 93   d. Process Context IDentifiers (PCID) is a CPU feature that
 94      allows us to skip flushing the entire TLB when switching page
 95      tables by setting a special bit in CR3 when the page tables
 96      are changed.  This makes switching the page tables (at context
 97      switch, or kernel entry/exit) cheaper.  But, on systems with
 98      PCID support, the context switch code must flush both the user
 99      and kernel entries out of the TLB.  The user PCID TLB flush is
100      deferred until the exit to userspace, minimizing the cost.
101      See intel.com/sdm for the gory PCID/INVPCID details.
102   e. The userspace page tables must be populated for each new
103      process.  Even without PTI, the shared kernel mappings
104      are created by copying top-level (PGD) entries into each
105      new process.  But, with PTI, there are now *two* kernel
106      mappings: one in the kernel page tables that maps everything
107      and one for the entry/exit structures.  At fork(), we need to
108      copy both.
109   f. In addition to the fork()-time copying, there must also
110      be an update to the userspace PGD any time a set_pgd() is done
111      on a PGD used to map userspace.  This ensures that the kernel
112      and userspace copies always map the same userspace
113      memory.
114   g. On systems without PCID support, each CR3 write flushes
115      the entire TLB.  That means that each syscall, interrupt
116      or exception flushes the TLB.
117   h. INVPCID is a TLB-flushing instruction which allows flushing
118      of TLB entries for non-current PCIDs.  Some systems support
119      PCIDs, but do not support INVPCID.  On these systems, addresses
120      can only be flushed from the TLB for the current PCID.  When
121      flushing a kernel address, we need to flush all PCIDs, so a
122      single kernel address flush will require a TLB-flushing CR3
123      write upon the next use of every PCID.
124 
125 Possible Future Work
126 ====================
127 1. We can be more careful about not actually writing to CR3
128    unless its value is actually changed.
129 2. Allow PTI to be enabled/disabled at runtime in addition to the
130    boot-time switching.
131 
132 Testing
133 ========
134 
135 To test stability of PTI, the following test procedure is recommended,
136 ideally doing all of these in parallel:
137 
138 1. Set CONFIG_DEBUG_ENTRY=y
139 2. Run several copies of all of the tools/testing/selftests/x86/ tests
140    (excluding MPX and protection_keys) in a loop on multiple CPUs for
141    several minutes.  These tests frequently uncover corner cases in the
142    kernel entry code.  In general, old kernels might cause these tests
143    themselves to crash, but they should never crash the kernel.
144 3. Run the 'perf' tool in a mode (top or record) that generates many
145    frequent performance monitoring non-maskable interrupts (see "NMI"
146    in /proc/interrupts).  This exercises the NMI entry/exit code which
147    is known to trigger bugs in code paths that did not expect to be
148    interrupted, including nested NMIs.  Using "-c" boosts the rate of
149    NMIs, and using two -c with separate counters encourages nested NMIs
150    and less deterministic behavior.
151    ::
152 
153         while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
154 
155 4. Launch a KVM virtual machine.
156 5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
157    This has been a lightly-tested code path and needs extra scrutiny.
158 
159 Debugging
160 =========
161 
162 Bugs in PTI cause a few different signatures of crashes
163 that are worth noting here.
164 
165  * Failures of the selftests/x86 code.  Usually a bug in one of the
166    more obscure corners of entry_64.S
167  * Crashes in early boot, especially around CPU bringup.  Bugs
168    in the mappings cause these.
169  * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
170    like screwing up a page table switch.  Also caused by
171    incorrectly mapping the IRQ handler entry code.
172  * Crashes at the first NMI.  The NMI code is separate from main
173    interrupt handlers and can have bugs that do not affect
174    normal interrupts.  Also caused by incorrectly mapping NMI
175    code.  NMIs that interrupt the entry code must be very
176    careful and can be the cause of crashes that show up when
177    running perf.
178  * Kernel crashes at the first exit to userspace.  entry_64.S
179    bugs, or failing to map some of the exit code.
180  * Crashes at first interrupt that interrupts userspace. The paths
181    in entry_64.S that return to userspace are sometimes separate
182    from the ones that return to the kernel.
183  * Double faults: overflowing the kernel stack because of page
184    faults upon page faults.  Caused by touching non-pti-mapped
185    data in the entry code, or forgetting to switch to kernel
186    CR3 before calling into C functions which are not pti-mapped.
187  * Userspace segfaults early in boot, sometimes manifesting
188    as mount(8) failing to mount the rootfs.  These have
189    tended to be TLB invalidation issues.  Usually invalidating
190    the wrong PCID, or otherwise missing an invalidation.
191 
192 .. [1] https://gruss.cc/files/kaiser.pdf
193 .. [2] https://meltdownattack.com/meltdown.pdf

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php