1 .. SPDX-License-Identifier: GPL-2.0 2 3 ========================== 4 Page Table Isolation (PTI) 5 ========================== 6 7 Overview 8 ======== 9 10 Page Table Isolation (pti, previously known as 11 countermeasure against attacks on the shared u 12 space such as the "Meltdown" approach [2]_. 13 14 To mitigate this class of attacks, we create a 15 page tables for use only when running userspac 16 the kernel is entered via syscalls, interrupts 17 page tables are switched to the full "kernel" 18 switches back to user mode, the user copy is u 19 20 The userspace page tables contain only a minim 21 data: only what is needed to enter/exit the ke 22 entry/exit functions themselves and the interr 23 (IDT). There are a few strictly unnecessary t 24 such as the first C function when entering an 25 comments in pti.c). 26 27 This approach helps to ensure that side-channe 28 the paging structures do not function when PTI 29 enabled by setting CONFIG_MITIGATION_PAGE_TABL 30 time. Once enabled at compile-time, it can be 31 the 'nopti' or 'pti=' kernel parameters (see k 32 33 Page Table Management 34 ===================== 35 36 When PTI is enabled, the kernel manages two se 37 The first set is very similar to the single se 38 kernels without PTI. This includes a complete 39 that the kernel can use for things like copy_t 40 41 Although _complete_, the user portion of the k 42 crippled by setting the NX bit in the top leve 43 that any missed kernel->user CR3 switch will i 44 userspace upon executing its first instruction 45 46 The userspace page tables map only the kernel 47 and exit the kernel. This data is entirely co 48 cpu_entry_area' structure which is placed in t 49 each CPU's copy of the area a compile-time-fix 50 51 For new userspace mappings, the kernel makes t 52 page tables like normal. The only difference 53 makes entries in the top (PGD) level. In addi 54 entry in the main kernel PGD, a copy of the en 55 userspace page tables' PGD. 56 57 This sharing at the PGD level also inherently 58 layers of the page tables. This leaves a sing 59 userspace page tables to manage. One PTE to l 60 accessed bits, dirty bits, etc... 61 62 Overhead 63 ======== 64 65 Protection against side-channel attacks is imp 66 this protection comes at a cost: 67 68 1. Increased Memory Use 69 70 a. Each process now needs an order-1 PGD ins 71 (Consumes an additional 4k per process). 72 b. The 'cpu_entry_area' structure must be 2M 73 aligned so that it can be mapped by setti 74 entry. This consumes nearly 2MB of RAM o 75 is decompressed, but no space in the kern 76 77 2. Runtime Cost 78 79 a. CR3 manipulation to switch between the pa 80 must be done at interrupt, syscall, and e 81 and exit (it can be skipped when the kern 82 though.) Moves to CR3 are on the order o 83 cycles, and are required at every entry a 84 b. Percpu TSS is mapped into the user page t 85 to work under PTI. This doesn't have a di 86 be argued it opens certain timing attack 87 c. Global pages are disabled for all kernel 88 mapped into both kernel and userspace pag 89 feature of the MMU allows different proce 90 entries mapping the kernel. Losing the f 91 TLB misses after a context switch. The a 92 performance is very small, however, never 93 d. Process Context IDentifiers (PCID) is a C 94 allows us to skip flushing the entire TLB 95 tables by setting a special bit in CR3 wh 96 are changed. This makes switching the pa 97 switch, or kernel entry/exit) cheaper. B 98 PCID support, the context switch code mus 99 and kernel entries out of the TLB. The u 100 deferred until the exit to userspace, min 101 See intel.com/sdm for the gory PCID/INVPC 102 e. The userspace page tables must be populat 103 process. Even without PTI, the shared ke 104 are created by copying top-level (PGD) en 105 new process. But, with PTI, there are no 106 mappings: one in the kernel page tables t 107 and one for the entry/exit structures. A 108 copy both. 109 f. In addition to the fork()-time copying, t 110 be an update to the userspace PGD any tim 111 on a PGD used to map userspace. This ens 112 and userspace copies always map the same 113 memory. 114 g. On systems without PCID support, each CR3 115 the entire TLB. That means that each sys 116 or exception flushes the TLB. 117 h. INVPCID is a TLB-flushing instruction whi 118 of TLB entries for non-current PCIDs. So 119 PCIDs, but do not support INVPCID. On th 120 can only be flushed from the TLB for the 121 flushing a kernel address, we need to flu 122 single kernel address flush will require 123 write upon the next use of every PCID. 124 125 Possible Future Work 126 ==================== 127 1. We can be more careful about not actually w 128 unless its value is actually changed. 129 2. Allow PTI to be enabled/disabled at runtime 130 boot-time switching. 131 132 Testing 133 ======== 134 135 To test stability of PTI, the following test p 136 ideally doing all of these in parallel: 137 138 1. Set CONFIG_DEBUG_ENTRY=y 139 2. Run several copies of all of the tools/test 140 (excluding MPX and protection_keys) in a lo 141 several minutes. These tests frequently un 142 kernel entry code. In general, old kernels 143 themselves to crash, but they should never 144 3. Run the 'perf' tool in a mode (top or recor 145 frequent performance monitoring non-maskabl 146 in /proc/interrupts). This exercises the N 147 is known to trigger bugs in code paths that 148 interrupted, including nested NMIs. Using 149 NMIs, and using two -c with separate counte 150 and less deterministic behavior. 151 :: 152 153 while true; do perf record -c 10000 -e 154 155 4. Launch a KVM virtual machine. 156 5. Run 32-bit binaries on systems supporting t 157 This has been a lightly-tested code path an 158 159 Debugging 160 ========= 161 162 Bugs in PTI cause a few different signatures o 163 that are worth noting here. 164 165 * Failures of the selftests/x86 code. Usuall 166 more obscure corners of entry_64.S 167 * Crashes in early boot, especially around CP 168 in the mappings cause these. 169 * Crashes at the first interrupt. Caused by 170 like screwing up a page table switch. Also 171 incorrectly mapping the IRQ handler entry c 172 * Crashes at the first NMI. The NMI code is 173 interrupt handlers and can have bugs that d 174 normal interrupts. Also caused by incorrec 175 code. NMIs that interrupt the entry code m 176 careful and can be the cause of crashes tha 177 running perf. 178 * Kernel crashes at the first exit to userspa 179 bugs, or failing to map some of the exit co 180 * Crashes at first interrupt that interrupts 181 in entry_64.S that return to userspace are 182 from the ones that return to the kernel. 183 * Double faults: overflowing the kernel stack 184 faults upon page faults. Caused by touchin 185 data in the entry code, or forgetting to sw 186 CR3 before calling into C functions which a 187 * Userspace segfaults early in boot, sometime 188 as mount(8) failing to mount the rootfs. T 189 tended to be TLB invalidation issues. Usua 190 the wrong PCID, or otherwise missing an inv 191 192 .. [1] https://gruss.cc/files/kaiser.pdf 193 .. [2] https://meltdownattack.com/meltdown.pdf
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.