1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 ========== 3 ========== 4 Nested VMX 4 Nested VMX 5 ========== 5 ========== 6 6 7 Overview 7 Overview 8 --------- 8 --------- 9 9 10 On Intel processors, KVM uses Intel's VMX (Vir 10 On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) 11 to easily and efficiently run guest operating 11 to easily and efficiently run guest operating systems. Normally, these guests 12 *cannot* themselves be hypervisors running the 12 *cannot* themselves be hypervisors running their own guests, because in VMX, 13 guests cannot use VMX instructions. 13 guests cannot use VMX instructions. 14 14 15 The "Nested VMX" feature adds this missing cap 15 The "Nested VMX" feature adds this missing capability - of running guest 16 hypervisors (which use VMX) with their own nes 16 hypervisors (which use VMX) with their own nested guests. It does so by 17 allowing a guest to use VMX instructions, and 17 allowing a guest to use VMX instructions, and correctly and efficiently 18 emulating them using the single level of VMX a 18 emulating them using the single level of VMX available in the hardware. 19 19 20 We describe in much greater detail the theory 20 We describe in much greater detail the theory behind the nested VMX feature, 21 its implementation and its performance charact 21 its implementation and its performance characteristics, in the OSDI 2010 paper 22 "The Turtles Project: Design and Implementatio 22 "The Turtles Project: Design and Implementation of Nested Virtualization", 23 available at: 23 available at: 24 24 25 https://www.usenix.org/events/osdi10/t 25 https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf 26 26 27 27 28 Terminology 28 Terminology 29 ----------- 29 ----------- 30 30 31 Single-level virtualization has two levels - t 31 Single-level virtualization has two levels - the host (KVM) and the guests. 32 In nested virtualization, we have three levels 32 In nested virtualization, we have three levels: The host (KVM), which we call 33 L0, the guest hypervisor, which we call L1, an 33 L0, the guest hypervisor, which we call L1, and its nested guest, which we 34 call L2. 34 call L2. 35 35 36 36 37 Running nested VMX 37 Running nested VMX 38 ------------------ 38 ------------------ 39 39 40 The nested VMX feature is enabled by default s 40 The nested VMX feature is enabled by default since Linux kernel v4.20. For 41 older Linux kernel, it can be enabled by givin 41 older Linux kernel, it can be enabled by giving the "nested=1" option to the 42 kvm-intel module. 42 kvm-intel module. 43 43 44 44 45 No modifications are required to user space (q 45 No modifications are required to user space (qemu). However, qemu's default 46 emulated CPU type (qemu64) does not list the " 46 emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be 47 explicitly enabled, by giving qemu one of the 47 explicitly enabled, by giving qemu one of the following options: 48 48 49 - cpu host (emulated CPU has 49 - cpu host (emulated CPU has all features of the real CPU) 50 50 51 - cpu qemu64,+vmx (add just the vmx 51 - cpu qemu64,+vmx (add just the vmx feature to a named CPU type) 52 52 53 53 54 ABIs 54 ABIs 55 ---- 55 ---- 56 56 57 Nested VMX aims to present a standard and (eve 57 Nested VMX aims to present a standard and (eventually) fully-functional VMX 58 implementation for the a guest hypervisor to u 58 implementation for the a guest hypervisor to use. As such, the official 59 specification of the ABI that it provides is I 59 specification of the ABI that it provides is Intel's VMX specification, 60 namely volume 3B of their "Intel 64 and IA-32 60 namely volume 3B of their "Intel 64 and IA-32 Architectures Software 61 Developer's Manual". Not all of VMX's features 61 Developer's Manual". Not all of VMX's features are currently fully supported, 62 but the goal is to eventually support them all 62 but the goal is to eventually support them all, starting with the VMX features 63 which are used in practice by popular hypervis 63 which are used in practice by popular hypervisors (KVM and others). 64 64 65 As a VMX implementation, nested VMX presents a 65 As a VMX implementation, nested VMX presents a VMCS structure to L1. 66 As mandated by the spec, other than the two fi 66 As mandated by the spec, other than the two fields revision_id and abort, 67 this structure is *opaque* to its user, who is 67 this structure is *opaque* to its user, who is not supposed to know or care 68 about its internal structure. Rather, the stru 68 about its internal structure. Rather, the structure is accessed through the 69 VMREAD and VMWRITE instructions. 69 VMREAD and VMWRITE instructions. 70 Still, for debugging purposes, KVM developers 70 Still, for debugging purposes, KVM developers might be interested to know the 71 internals of this structure; This is struct vm 71 internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. 72 72 73 The name "vmcs12" refers to the VMCS that L1 b 73 The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we 74 also have "vmcs01", the VMCS that L0 built for 74 also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS 75 which L0 builds to actually run L2 - how this 75 which L0 builds to actually run L2 - how this is done is explained in the 76 aforementioned paper. 76 aforementioned paper. 77 77 78 For convenience, we repeat the content of stru 78 For convenience, we repeat the content of struct vmcs12 here. If the internals 79 of this structure changes, this can break live 79 of this structure changes, this can break live migration across KVM versions. 80 VMCS12_REVISION (from vmx.c) should be changed 80 VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner 81 struct shadow_vmcs is ever changed. 81 struct shadow_vmcs is ever changed. 82 82 83 :: 83 :: 84 84 85 typedef u64 natural_width; 85 typedef u64 natural_width; 86 struct __packed vmcs12 { 86 struct __packed vmcs12 { 87 /* According to the Intel spec 87 /* According to the Intel spec, a VMCS region must start with 88 * these two user-visible fiel 88 * these two user-visible fields */ 89 u32 revision_id; 89 u32 revision_id; 90 u32 abort; 90 u32 abort; 91 91 92 u32 launch_state; /* set to 0 92 u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ 93 u32 padding[7]; /* room for fu 93 u32 padding[7]; /* room for future expansion */ 94 94 95 u64 io_bitmap_a; 95 u64 io_bitmap_a; 96 u64 io_bitmap_b; 96 u64 io_bitmap_b; 97 u64 msr_bitmap; 97 u64 msr_bitmap; 98 u64 vm_exit_msr_store_addr; 98 u64 vm_exit_msr_store_addr; 99 u64 vm_exit_msr_load_addr; 99 u64 vm_exit_msr_load_addr; 100 u64 vm_entry_msr_load_addr; 100 u64 vm_entry_msr_load_addr; 101 u64 tsc_offset; 101 u64 tsc_offset; 102 u64 virtual_apic_page_addr; 102 u64 virtual_apic_page_addr; 103 u64 apic_access_addr; 103 u64 apic_access_addr; 104 u64 ept_pointer; 104 u64 ept_pointer; 105 u64 guest_physical_address; 105 u64 guest_physical_address; 106 u64 vmcs_link_pointer; 106 u64 vmcs_link_pointer; 107 u64 guest_ia32_debugctl; 107 u64 guest_ia32_debugctl; 108 u64 guest_ia32_pat; 108 u64 guest_ia32_pat; 109 u64 guest_ia32_efer; 109 u64 guest_ia32_efer; 110 u64 guest_pdptr0; 110 u64 guest_pdptr0; 111 u64 guest_pdptr1; 111 u64 guest_pdptr1; 112 u64 guest_pdptr2; 112 u64 guest_pdptr2; 113 u64 guest_pdptr3; 113 u64 guest_pdptr3; 114 u64 host_ia32_pat; 114 u64 host_ia32_pat; 115 u64 host_ia32_efer; 115 u64 host_ia32_efer; 116 u64 padding64[8]; /* room for 116 u64 padding64[8]; /* room for future expansion */ 117 natural_width cr0_guest_host_m 117 natural_width cr0_guest_host_mask; 118 natural_width cr4_guest_host_m 118 natural_width cr4_guest_host_mask; 119 natural_width cr0_read_shadow; 119 natural_width cr0_read_shadow; 120 natural_width cr4_read_shadow; 120 natural_width cr4_read_shadow; 121 natural_width dead_space[4]; / 121 natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */ 122 natural_width exit_qualificati 122 natural_width exit_qualification; 123 natural_width guest_linear_add 123 natural_width guest_linear_address; 124 natural_width guest_cr0; 124 natural_width guest_cr0; 125 natural_width guest_cr3; 125 natural_width guest_cr3; 126 natural_width guest_cr4; 126 natural_width guest_cr4; 127 natural_width guest_es_base; 127 natural_width guest_es_base; 128 natural_width guest_cs_base; 128 natural_width guest_cs_base; 129 natural_width guest_ss_base; 129 natural_width guest_ss_base; 130 natural_width guest_ds_base; 130 natural_width guest_ds_base; 131 natural_width guest_fs_base; 131 natural_width guest_fs_base; 132 natural_width guest_gs_base; 132 natural_width guest_gs_base; 133 natural_width guest_ldtr_base; 133 natural_width guest_ldtr_base; 134 natural_width guest_tr_base; 134 natural_width guest_tr_base; 135 natural_width guest_gdtr_base; 135 natural_width guest_gdtr_base; 136 natural_width guest_idtr_base; 136 natural_width guest_idtr_base; 137 natural_width guest_dr7; 137 natural_width guest_dr7; 138 natural_width guest_rsp; 138 natural_width guest_rsp; 139 natural_width guest_rip; 139 natural_width guest_rip; 140 natural_width guest_rflags; 140 natural_width guest_rflags; 141 natural_width guest_pending_db 141 natural_width guest_pending_dbg_exceptions; 142 natural_width guest_sysenter_e 142 natural_width guest_sysenter_esp; 143 natural_width guest_sysenter_e 143 natural_width guest_sysenter_eip; 144 natural_width host_cr0; 144 natural_width host_cr0; 145 natural_width host_cr3; 145 natural_width host_cr3; 146 natural_width host_cr4; 146 natural_width host_cr4; 147 natural_width host_fs_base; 147 natural_width host_fs_base; 148 natural_width host_gs_base; 148 natural_width host_gs_base; 149 natural_width host_tr_base; 149 natural_width host_tr_base; 150 natural_width host_gdtr_base; 150 natural_width host_gdtr_base; 151 natural_width host_idtr_base; 151 natural_width host_idtr_base; 152 natural_width host_ia32_sysent 152 natural_width host_ia32_sysenter_esp; 153 natural_width host_ia32_sysent 153 natural_width host_ia32_sysenter_eip; 154 natural_width host_rsp; 154 natural_width host_rsp; 155 natural_width host_rip; 155 natural_width host_rip; 156 natural_width paddingl[8]; /* 156 natural_width paddingl[8]; /* room for future expansion */ 157 u32 pin_based_vm_exec_control; 157 u32 pin_based_vm_exec_control; 158 u32 cpu_based_vm_exec_control; 158 u32 cpu_based_vm_exec_control; 159 u32 exception_bitmap; 159 u32 exception_bitmap; 160 u32 page_fault_error_code_mask 160 u32 page_fault_error_code_mask; 161 u32 page_fault_error_code_matc 161 u32 page_fault_error_code_match; 162 u32 cr3_target_count; 162 u32 cr3_target_count; 163 u32 vm_exit_controls; 163 u32 vm_exit_controls; 164 u32 vm_exit_msr_store_count; 164 u32 vm_exit_msr_store_count; 165 u32 vm_exit_msr_load_count; 165 u32 vm_exit_msr_load_count; 166 u32 vm_entry_controls; 166 u32 vm_entry_controls; 167 u32 vm_entry_msr_load_count; 167 u32 vm_entry_msr_load_count; 168 u32 vm_entry_intr_info_field; 168 u32 vm_entry_intr_info_field; 169 u32 vm_entry_exception_error_c 169 u32 vm_entry_exception_error_code; 170 u32 vm_entry_instruction_len; 170 u32 vm_entry_instruction_len; 171 u32 tpr_threshold; 171 u32 tpr_threshold; 172 u32 secondary_vm_exec_control; 172 u32 secondary_vm_exec_control; 173 u32 vm_instruction_error; 173 u32 vm_instruction_error; 174 u32 vm_exit_reason; 174 u32 vm_exit_reason; 175 u32 vm_exit_intr_info; 175 u32 vm_exit_intr_info; 176 u32 vm_exit_intr_error_code; 176 u32 vm_exit_intr_error_code; 177 u32 idt_vectoring_info_field; 177 u32 idt_vectoring_info_field; 178 u32 idt_vectoring_error_code; 178 u32 idt_vectoring_error_code; 179 u32 vm_exit_instruction_len; 179 u32 vm_exit_instruction_len; 180 u32 vmx_instruction_info; 180 u32 vmx_instruction_info; 181 u32 guest_es_limit; 181 u32 guest_es_limit; 182 u32 guest_cs_limit; 182 u32 guest_cs_limit; 183 u32 guest_ss_limit; 183 u32 guest_ss_limit; 184 u32 guest_ds_limit; 184 u32 guest_ds_limit; 185 u32 guest_fs_limit; 185 u32 guest_fs_limit; 186 u32 guest_gs_limit; 186 u32 guest_gs_limit; 187 u32 guest_ldtr_limit; 187 u32 guest_ldtr_limit; 188 u32 guest_tr_limit; 188 u32 guest_tr_limit; 189 u32 guest_gdtr_limit; 189 u32 guest_gdtr_limit; 190 u32 guest_idtr_limit; 190 u32 guest_idtr_limit; 191 u32 guest_es_ar_bytes; 191 u32 guest_es_ar_bytes; 192 u32 guest_cs_ar_bytes; 192 u32 guest_cs_ar_bytes; 193 u32 guest_ss_ar_bytes; 193 u32 guest_ss_ar_bytes; 194 u32 guest_ds_ar_bytes; 194 u32 guest_ds_ar_bytes; 195 u32 guest_fs_ar_bytes; 195 u32 guest_fs_ar_bytes; 196 u32 guest_gs_ar_bytes; 196 u32 guest_gs_ar_bytes; 197 u32 guest_ldtr_ar_bytes; 197 u32 guest_ldtr_ar_bytes; 198 u32 guest_tr_ar_bytes; 198 u32 guest_tr_ar_bytes; 199 u32 guest_interruptibility_inf 199 u32 guest_interruptibility_info; 200 u32 guest_activity_state; 200 u32 guest_activity_state; 201 u32 guest_sysenter_cs; 201 u32 guest_sysenter_cs; 202 u32 host_ia32_sysenter_cs; 202 u32 host_ia32_sysenter_cs; 203 u32 padding32[8]; /* room for 203 u32 padding32[8]; /* room for future expansion */ 204 u16 virtual_processor_id; 204 u16 virtual_processor_id; 205 u16 guest_es_selector; 205 u16 guest_es_selector; 206 u16 guest_cs_selector; 206 u16 guest_cs_selector; 207 u16 guest_ss_selector; 207 u16 guest_ss_selector; 208 u16 guest_ds_selector; 208 u16 guest_ds_selector; 209 u16 guest_fs_selector; 209 u16 guest_fs_selector; 210 u16 guest_gs_selector; 210 u16 guest_gs_selector; 211 u16 guest_ldtr_selector; 211 u16 guest_ldtr_selector; 212 u16 guest_tr_selector; 212 u16 guest_tr_selector; 213 u16 host_es_selector; 213 u16 host_es_selector; 214 u16 host_cs_selector; 214 u16 host_cs_selector; 215 u16 host_ss_selector; 215 u16 host_ss_selector; 216 u16 host_ds_selector; 216 u16 host_ds_selector; 217 u16 host_fs_selector; 217 u16 host_fs_selector; 218 u16 host_gs_selector; 218 u16 host_gs_selector; 219 u16 host_tr_selector; 219 u16 host_tr_selector; 220 }; 220 }; 221 221 222 222 223 Authors 223 Authors 224 ------- 224 ------- 225 225 226 These patches were written by: 226 These patches were written by: 227 - Abel Gordon, abelg <at> il.ibm.com 227 - Abel Gordon, abelg <at> il.ibm.com 228 - Nadav Har'El, nyh <at> il.ibm.com 228 - Nadav Har'El, nyh <at> il.ibm.com 229 - Orit Wasserman, oritw <at> il.ibm.com 229 - Orit Wasserman, oritw <at> il.ibm.com 230 - Ben-Ami Yassor, benami <at> il.ibm.com 230 - Ben-Ami Yassor, benami <at> il.ibm.com 231 - Muli Ben-Yehuda, muli <at> il.ibm.com 231 - Muli Ben-Yehuda, muli <at> il.ibm.com 232 232 233 With contributions by: 233 With contributions by: 234 - Anthony Liguori, aliguori <at> us.ibm.co 234 - Anthony Liguori, aliguori <at> us.ibm.com 235 - Mike Day, mdday <at> us.ibm.com 235 - Mike Day, mdday <at> us.ibm.com 236 - Michael Factor, factor <at> il.ibm.com 236 - Michael Factor, factor <at> il.ibm.com 237 - Zvi Dubitzky, dubi <at> il.ibm.com 237 - Zvi Dubitzky, dubi <at> il.ibm.com 238 238 239 And valuable reviews by: 239 And valuable reviews by: 240 - Avi Kivity, avi <at> redhat.com 240 - Avi Kivity, avi <at> redhat.com 241 - Gleb Natapov, gleb <at> redhat.com 241 - Gleb Natapov, gleb <at> redhat.com 242 - Marcelo Tosatti, mtosatti <at> redhat.co 242 - Marcelo Tosatti, mtosatti <at> redhat.com 243 - Kevin Tian, kevin.tian <at> intel.com 243 - Kevin Tian, kevin.tian <at> intel.com 244 - and others. 244 - and others.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.