1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 ============================== 3 ============================== 4 Running nested guests with KVM 4 Running nested guests with KVM 5 ============================== 5 ============================== 6 6 7 A nested guest is the ability to run a guest i 7 A nested guest is the ability to run a guest inside another guest (it 8 can be KVM-based or a different hypervisor). 8 can be KVM-based or a different hypervisor). The straightforward 9 example is a KVM guest that in turn runs on a 9 example is a KVM guest that in turn runs on a KVM guest (the rest of 10 this document is built on this example):: 10 this document is built on this example):: 11 11 12 .----------------. .----------- 12 .----------------. .----------------. 13 | | | 13 | | | | 14 | L2 | | L2 14 | L2 | | L2 | 15 | (Nested Guest) | | (Nested Gu 15 | (Nested Guest) | | (Nested Guest) | 16 | | | 16 | | | | 17 |----------------'--'----------- 17 |----------------'--'----------------| 18 | 18 | | 19 | L1 (Guest Hypervisor) 19 | L1 (Guest Hypervisor) | 20 | KVM (/dev/kvm) 20 | KVM (/dev/kvm) | 21 | 21 | | 22 .--------------------------------------- 22 .------------------------------------------------------. 23 | L0 (Host Hypervisor) 23 | L0 (Host Hypervisor) | 24 | KVM (/dev/kvm) 24 | KVM (/dev/kvm) | 25 |--------------------------------------- 25 |------------------------------------------------------| 26 | Hardware (with virtualization e 26 | Hardware (with virtualization extensions) | 27 '--------------------------------------- 27 '------------------------------------------------------' 28 28 29 Terminology: 29 Terminology: 30 30 31 - L0 – level-0; the bare metal host, running 31 - L0 – level-0; the bare metal host, running KVM 32 32 33 - L1 – level-1 guest; a VM running on L0; al 33 - L1 – level-1 guest; a VM running on L0; also called the "guest 34 hypervisor", as it itself is capable of runn 34 hypervisor", as it itself is capable of running KVM. 35 35 36 - L2 – level-2 guest; a VM running on L1, th 36 - L2 – level-2 guest; a VM running on L1, this is the "nested guest" 37 37 38 .. note:: The above diagram is modelled after 38 .. note:: The above diagram is modelled after the x86 architecture; 39 s390x, ppc64 and other architectures 39 s390x, ppc64 and other architectures are likely to have 40 a different design for nesting. 40 a different design for nesting. 41 41 42 For example, s390x always has an LPA 42 For example, s390x always has an LPAR (LogicalPARtition) 43 hypervisor running on bare metal, ad 43 hypervisor running on bare metal, adding another layer and 44 resulting in at least four levels in 44 resulting in at least four levels in a nested setup — L0 (bare 45 metal, running the LPAR hypervisor), 45 metal, running the LPAR hypervisor), L1 (host hypervisor), L2 46 (guest hypervisor), L3 (nested guest 46 (guest hypervisor), L3 (nested guest). 47 47 48 This document will stick with the th 48 This document will stick with the three-level terminology (L0, 49 L1, and L2) for all architectures; a 49 L1, and L2) for all architectures; and will largely focus on 50 x86. 50 x86. 51 51 52 52 53 Use Cases 53 Use Cases 54 --------- 54 --------- 55 55 56 There are several scenarios where nested KVM c 56 There are several scenarios where nested KVM can be useful, to name a 57 few: 57 few: 58 58 59 - As a developer, you want to test your softwa 59 - As a developer, you want to test your software on different operating 60 systems (OSes). Instead of renting multiple 60 systems (OSes). Instead of renting multiple VMs from a Cloud 61 Provider, using nested KVM lets you rent a l 61 Provider, using nested KVM lets you rent a large enough "guest 62 hypervisor" (level-1 guest). This in turn a 62 hypervisor" (level-1 guest). This in turn allows you to create 63 multiple nested guests (level-2 guests), run 63 multiple nested guests (level-2 guests), running different OSes, on 64 which you can develop and test your software 64 which you can develop and test your software. 65 65 66 - Live migration of "guest hypervisors" and th 66 - Live migration of "guest hypervisors" and their nested guests, for 67 load balancing, disaster recovery, etc. 67 load balancing, disaster recovery, etc. 68 68 69 - VM image creation tools (e.g. ``virt-install 69 - VM image creation tools (e.g. ``virt-install``, etc) often run 70 their own VM, and users expect these to work 70 their own VM, and users expect these to work inside a VM. 71 71 72 - Some OSes use virtualization internally for 72 - Some OSes use virtualization internally for security (e.g. to let 73 applications run safely in isolation). 73 applications run safely in isolation). 74 74 75 75 76 Enabling "nested" (x86) 76 Enabling "nested" (x86) 77 ----------------------- 77 ----------------------- 78 78 79 From Linux kernel v4.20 onwards, the ``nested` 79 From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled 80 by default for Intel and AMD. (Though your Li 80 by default for Intel and AMD. (Though your Linux distribution might 81 override this default.) 81 override this default.) 82 82 83 In case you are running a Linux kernel older t 83 In case you are running a Linux kernel older than v4.19, to enable 84 nesting, set the ``nested`` KVM module paramet 84 nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To 85 persist this setting across reboots, you can a 85 persist this setting across reboots, you can add it in a config file, as 86 shown below: 86 shown below: 87 87 88 1. On the bare metal host (L0), list the kerne 88 1. On the bare metal host (L0), list the kernel modules and ensure that 89 the KVM modules:: 89 the KVM modules:: 90 90 91 $ lsmod | grep -i kvm 91 $ lsmod | grep -i kvm 92 kvm_intel 133627 0 92 kvm_intel 133627 0 93 kvm 435079 1 kvm_intel 93 kvm 435079 1 kvm_intel 94 94 95 2. Show information for ``kvm_intel`` module:: 95 2. Show information for ``kvm_intel`` module:: 96 96 97 $ modinfo kvm_intel | grep -i nested 97 $ modinfo kvm_intel | grep -i nested 98 parm: nested:bool 98 parm: nested:bool 99 99 100 3. For the nested KVM configuration to persist 100 3. For the nested KVM configuration to persist across reboots, place the 101 below in ``/etc/modprobed/kvm_intel.conf`` 101 below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it 102 doesn't exist):: 102 doesn't exist):: 103 103 104 $ cat /etc/modprobe.d/kvm_intel.conf 104 $ cat /etc/modprobe.d/kvm_intel.conf 105 options kvm-intel nested=y 105 options kvm-intel nested=y 106 106 107 4. Unload and re-load the KVM Intel module:: 107 4. Unload and re-load the KVM Intel module:: 108 108 109 $ sudo rmmod kvm-intel 109 $ sudo rmmod kvm-intel 110 $ sudo modprobe kvm-intel 110 $ sudo modprobe kvm-intel 111 111 112 5. Verify if the ``nested`` parameter for KVM 112 5. Verify if the ``nested`` parameter for KVM is enabled:: 113 113 114 $ cat /sys/module/kvm_intel/parameters/nes 114 $ cat /sys/module/kvm_intel/parameters/nested 115 Y 115 Y 116 116 117 For AMD hosts, the process is the same as abov 117 For AMD hosts, the process is the same as above, except that the module 118 name is ``kvm-amd``. 118 name is ``kvm-amd``. 119 119 120 120 121 Additional nested-related kernel parameters (x 121 Additional nested-related kernel parameters (x86) 122 ---------------------------------------------- 122 ------------------------------------------------- 123 123 124 If your hardware is sufficiently advanced (Int 124 If your hardware is sufficiently advanced (Intel Haswell processor or 125 higher, which has newer hardware virt extensio 125 higher, which has newer hardware virt extensions), the following 126 additional features will also be enabled by de 126 additional features will also be enabled by default: "Shadow VMCS 127 (Virtual Machine Control Structure)", APIC Vir 127 (Virtual Machine Control Structure)", APIC Virtualization on your bare 128 metal host (L0). Parameters for Intel hosts:: 128 metal host (L0). Parameters for Intel hosts:: 129 129 130 $ cat /sys/module/kvm_intel/parameters/ena 130 $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs 131 Y 131 Y 132 132 133 $ cat /sys/module/kvm_intel/parameters/ena 133 $ cat /sys/module/kvm_intel/parameters/enable_apicv 134 Y 134 Y 135 135 136 $ cat /sys/module/kvm_intel/parameters/ept 136 $ cat /sys/module/kvm_intel/parameters/ept 137 Y 137 Y 138 138 139 .. note:: If you suspect your L2 (i.e. nested 139 .. note:: If you suspect your L2 (i.e. nested guest) is running slower, 140 ensure the above are enabled (partic 140 ensure the above are enabled (particularly 141 ``enable_shadow_vmcs`` and ``ept``). 141 ``enable_shadow_vmcs`` and ``ept``). 142 142 143 143 144 Starting a nested guest (x86) 144 Starting a nested guest (x86) 145 ----------------------------- 145 ----------------------------- 146 146 147 Once your bare metal host (L0) is configured f 147 Once your bare metal host (L0) is configured for nesting, you should be 148 able to start an L1 guest with:: 148 able to start an L1 guest with:: 149 149 150 $ qemu-kvm -cpu host [...] 150 $ qemu-kvm -cpu host [...] 151 151 152 The above will pass through the host CPU's cap 152 The above will pass through the host CPU's capabilities as-is to the 153 guest, or for better live migration compatibil 153 guest, or for better live migration compatibility, use a named CPU 154 model supported by QEMU. e.g.:: 154 model supported by QEMU. e.g.:: 155 155 156 $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on 156 $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on 157 157 158 then the guest hypervisor will subsequently be 158 then the guest hypervisor will subsequently be capable of running a 159 nested guest with accelerated KVM. 159 nested guest with accelerated KVM. 160 160 161 161 162 Enabling "nested" (s390x) 162 Enabling "nested" (s390x) 163 ------------------------- 163 ------------------------- 164 164 165 1. On the host hypervisor (L0), enable the ``n 165 1. On the host hypervisor (L0), enable the ``nested`` parameter on 166 s390x:: 166 s390x:: 167 167 168 $ rmmod kvm 168 $ rmmod kvm 169 $ modprobe kvm nested=1 169 $ modprobe kvm nested=1 170 170 171 .. note:: On s390x, the kernel parameter ``hpa 171 .. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive 172 with the ``nested`` parameter — i. 172 with the ``nested`` parameter — i.e. to be able to enable 173 ``nested``, the ``hpage`` parameter 173 ``nested``, the ``hpage`` parameter *must* be disabled. 174 174 175 2. The guest hypervisor (L1) must be provided 175 2. The guest hypervisor (L1) must be provided with the ``sie`` CPU 176 feature — with QEMU, this can be done by 176 feature — with QEMU, this can be done by using "host passthrough" 177 (via the command-line ``-cpu host``). 177 (via the command-line ``-cpu host``). 178 178 179 3. Now the KVM module can be loaded in the L1 179 3. Now the KVM module can be loaded in the L1 (guest hypervisor):: 180 180 181 $ modprobe kvm 181 $ modprobe kvm 182 182 183 183 184 Live migration with nested KVM 184 Live migration with nested KVM 185 ------------------------------ 185 ------------------------------ 186 186 187 Migrating an L1 guest, with a *live* nested g 187 Migrating an L1 guest, with a *live* nested guest in it, to another 188 bare metal host, works as of Linux kernel 5.3 188 bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for 189 Intel x86 systems, and even on older versions 189 Intel x86 systems, and even on older versions for s390x. 190 190 191 On AMD systems, once an L1 guest has started a 191 On AMD systems, once an L1 guest has started an L2 guest, the L1 guest 192 should no longer be migrated or saved (refer t 192 should no longer be migrated or saved (refer to QEMU documentation on 193 "savevm"/"loadvm") until the L2 guest shuts do 193 "savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate 194 or save-and-load an L1 guest while an L2 guest 194 or save-and-load an L1 guest while an L2 guest is running will result in 195 undefined behavior. You might see a ``kernel 195 undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a 196 kernel 'oops', or an outright kernel panic. S 196 kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1 197 guest can no longer be considered stable or se 197 guest can no longer be considered stable or secure, and must be restarted. 198 Migrating an L1 guest merely configured to sup 198 Migrating an L1 guest merely configured to support nesting, while not 199 actually running L2 guests, is expected to fun 199 actually running L2 guests, is expected to function normally even on AMD 200 systems but may fail once guests are started. 200 systems but may fail once guests are started. 201 201 202 Migrating an L2 guest is always expected to su 202 Migrating an L2 guest is always expected to succeed, so all the following 203 scenarios should work even on AMD systems: 203 scenarios should work even on AMD systems: 204 204 205 - Migrating a nested guest (L2) to another L1 205 - Migrating a nested guest (L2) to another L1 guest on the *same* bare 206 metal host. 206 metal host. 207 207 208 - Migrating a nested guest (L2) to another L1 208 - Migrating a nested guest (L2) to another L1 guest on a *different* 209 bare metal host. 209 bare metal host. 210 210 211 - Migrating a nested guest (L2) to a bare meta 211 - Migrating a nested guest (L2) to a bare metal host. 212 212 213 Reporting bugs from nested setups 213 Reporting bugs from nested setups 214 ----------------------------------- 214 ----------------------------------- 215 215 216 Debugging "nested" problems can involve siftin 216 Debugging "nested" problems can involve sifting through log files across 217 L0, L1 and L2; this can result in tedious back 217 L0, L1 and L2; this can result in tedious back-n-forth between the bug 218 reporter and the bug fixer. 218 reporter and the bug fixer. 219 219 220 - Mention that you are in a "nested" setup. I 220 - Mention that you are in a "nested" setup. If you are running any kind 221 of "nesting" at all, say so. Unfortunately, 221 of "nesting" at all, say so. Unfortunately, this needs to be called 222 out because when reporting bugs, people tend 222 out because when reporting bugs, people tend to forget to even 223 *mention* that they're using nested virtuali 223 *mention* that they're using nested virtualization. 224 224 225 - Ensure you are actually running KVM on KVM. 225 - Ensure you are actually running KVM on KVM. Sometimes people do not 226 have KVM enabled for their guest hypervisor 226 have KVM enabled for their guest hypervisor (L1), which results in 227 them running with pure emulation or what QEM 227 them running with pure emulation or what QEMU calls it as "TCG", but 228 they think they're running nested KVM. Thus 228 they think they're running nested KVM. Thus confusing "nested Virt" 229 (which could also mean, QEMU on KVM) with "n 229 (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM). 230 230 231 Information to collect (generic) 231 Information to collect (generic) 232 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 232 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 233 233 234 The following is not an exhaustive list, but a 234 The following is not an exhaustive list, but a very good starting point: 235 235 236 - Kernel, libvirt, and QEMU version from L0 236 - Kernel, libvirt, and QEMU version from L0 237 237 238 - Kernel, libvirt and QEMU version from L1 238 - Kernel, libvirt and QEMU version from L1 239 239 240 - QEMU command-line of L1 -- when using libv 240 - QEMU command-line of L1 -- when using libvirt, you'll find it here: 241 ``/var/log/libvirt/qemu/instance.log`` 241 ``/var/log/libvirt/qemu/instance.log`` 242 242 243 - QEMU command-line of L2 -- as above, when 243 - QEMU command-line of L2 -- as above, when using libvirt, get the 244 complete libvirt-generated QEMU command-li 244 complete libvirt-generated QEMU command-line 245 245 246 - ``cat /sys/cpuinfo`` from L0 246 - ``cat /sys/cpuinfo`` from L0 247 247 248 - ``cat /sys/cpuinfo`` from L1 248 - ``cat /sys/cpuinfo`` from L1 249 249 250 - ``lscpu`` from L0 250 - ``lscpu`` from L0 251 251 252 - ``lscpu`` from L1 252 - ``lscpu`` from L1 253 253 254 - Full ``dmesg`` output from L0 254 - Full ``dmesg`` output from L0 255 255 256 - Full ``dmesg`` output from L1 256 - Full ``dmesg`` output from L1 257 257 258 x86-specific info to collect 258 x86-specific info to collect 259 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 259 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 260 260 261 Both the below commands, ``x86info`` and ``dmi 261 Both the below commands, ``x86info`` and ``dmidecode``, should be 262 available on most Linux distributions with the 262 available on most Linux distributions with the same name: 263 263 264 - Output of: ``x86info -a`` from L0 264 - Output of: ``x86info -a`` from L0 265 265 266 - Output of: ``x86info -a`` from L1 266 - Output of: ``x86info -a`` from L1 267 267 268 - Output of: ``dmidecode`` from L0 268 - Output of: ``dmidecode`` from L0 269 269 270 - Output of: ``dmidecode`` from L1 270 - Output of: ``dmidecode`` from L1 271 271 272 s390x-specific info to collect 272 s390x-specific info to collect 273 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 273 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 274 274 275 Along with the earlier mentioned generic detai 275 Along with the earlier mentioned generic details, the below is 276 also recommended: 276 also recommended: 277 277 278 - ``/proc/sysinfo`` from L1; this will also 278 - ``/proc/sysinfo`` from L1; this will also include the info from L0
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.