1 .. SPDX-License-Identifier: GPL-2.0 2 3 PCI pass-thru devices 4 ========================= 5 In a Hyper-V guest VM, PCI pass-thru devices ( 6 virtual PCI devices, or vPCI devices) are phys 7 that are mapped directly into the VM's physica 8 Guest device drivers can interact directly wit 9 without intermediation by the host hypervisor. 10 provides higher bandwidth access to the device 11 latency, compared with devices that are virtua 12 hypervisor. The device should appear to the g 13 would when running on bare metal, so no change 14 to the Linux device drivers for the device. 15 16 Hyper-V terminology for vPCI devices is "Discr 17 Assignment" (DDA). Public documentation for H 18 available here: `DDA`_ 19 20 .. _DDA: https://learn.microsoft.com/en-us/win 21 22 DDA is typically used for storage controllers, 23 and for GPUs. A similar mechanism for NICs is 24 and produces the same benefits by allowing a g 25 driver to interact directly with the hardware. 26 public documentation here: `SR-IOV`_ 27 28 .. _SR-IOV: https://learn.microsoft.com/en-us/ 29 30 This discussion of vPCI devices includes DDA a 31 devices. 32 33 Device Presentation 34 ------------------- 35 Hyper-V provides full PCI functionality for a 36 it is operating, so the Linux device driver fo 37 be used unchanged, provided it uses the correc 38 APIs for accessing PCI config space and for ot 39 with Linux. But the initial detection of the 40 its integration with the Linux PCI subsystem m 41 specific mechanisms. Consequently, vPCI devic 42 have a dual identity. They are initially pres 43 guests as VMBus devices via the standard VMBus 44 mechanism, so they have a VMBus identity and a 45 /sys/bus/vmbus/devices. The VMBus vPCI driver 46 drivers/pci/controller/pci-hyperv.c handles a 47 vPCI device by fabricating a PCI bus topology 48 the normal PCI device data structures in Linux 49 exist if the PCI device were discovered via AC 50 metal system. Once those data structures are 51 device also has a normal PCI identity in Linux 52 Linux device driver for the vPCI device can fu 53 were running in Linux on bare-metal. Because 54 presented dynamically through the VMBus offer 55 do not appear in the Linux guest's ACPI tables 56 may be added to a VM or removed from a VM at a 57 the life of the VM, and not just during initia 58 59 With this approach, the vPCI device is a VMBus 60 PCI device at the same time. In response to t 61 message, the hv_pci_probe() function runs and 62 VMBus connection to the vPCI VSP on the Hyper- 63 connection has a single VMBus channel. The ch 64 exchange messages with the vPCI VSP for the pu 65 up and configuring the vPCI device in Linux. 66 is fully configured in Linux as a PCI device, 67 channel is used only if Linux changes the vCPU 68 in the guest, or if the vPCI device is removed 69 the VM while the VM is running. The ongoing o 70 device happens directly between the Linux devi 71 the device and the hardware, with VMBus and th 72 playing no role. 73 74 PCI Device Setup 75 ---------------- 76 PCI device setup follows a sequence that Hyper 77 created for Windows guests, and that can be il 78 Linux guests due to differences in the overall 79 the Linux PCI subsystem compared with Windows. 80 with a bit of hackery in the Hyper-V virtual P 81 Linux, the virtual PCI device is setup in Linu 82 generic Linux PCI subsystem code and the Linux 83 device "just work". 84 85 Each vPCI device is set up in Linux to be in i 86 domain with a host bridge. The PCI domainID i 87 bytes 4 and 5 of the instance GUID assigned to 88 device. The Hyper-V host does not guarantee t 89 are unique, so hv_pci_probe() has an algorithm 90 collisions. The collision resolution is inten 91 across reboots of the same VM so that the PCI 92 change, as the domainID appears in the user sp 93 configuration of some devices. 94 95 hv_pci_probe() allocates a guest MMIO range to 96 config space for the device. This MMIO range 97 to the Hyper-V host over the VMBus channel as 98 the host that the device is ready to enter d0. 99 hv_pci_enter_d0(). When the guest subsequentl 100 MMIO range, the Hyper-V host intercepts the ac 101 them to the physical device PCI config space. 102 103 hv_pci_probe() also gets BAR information for t 104 the Hyper-V host, and uses this information to 105 space for the BARs. That MMIO space is then s 106 associated with the host bridge so that it wor 107 PCI subsystem code in Linux processes the BARs 108 109 Finally, hv_pci_probe() creates the root PCI b 110 point the Hyper-V virtual PCI driver hackery i 111 normal Linux PCI machinery for scanning the ro 112 detect the device, to perform driver matching, 113 initialize the driver and device. 114 115 PCI Device Removal 116 ------------------ 117 A Hyper-V host may initiate removal of a vPCI 118 guest VM at any time during the life of the VM 119 is instigated by an admin action taken on the 120 is not under the control of the guest OS. 121 122 A guest VM is notified of the removal by an un 123 "Eject" message sent from the host to the gues 124 channel associated with the vPCI device. Upon 125 a message, the Hyper-V virtual PCI driver in L 126 asynchronously invokes Linux kernel PCI subsys 127 shutdown and remove the device. When those ca 128 complete, an "Ejection Complete" message is se 129 Hyper-V over the VMBus channel indicating that 130 been removed. At this point, Hyper-V sends a 131 message to the Linux guest, which the VMBus dr 132 processes by removing the VMBus identity for t 133 that processing is complete, all vestiges of t 134 been present are gone from the Linux kernel. 135 message also indicates to the guest that Hyper 136 providing support for the vPCI device in the g 137 guest were to attempt to access that device's 138 would be an invalid reference. Hypercalls affe 139 return errors, and any further messages sent i 140 channel are ignored. 141 142 After sending the Eject message, Hyper-V allow 143 60 seconds to cleanly shutdown the device and 144 Ejection Complete before sending the VMBus res 145 message. If for any reason the Eject steps do 146 within the allowed 60 seconds, the Hyper-V hos 147 performs the rescind steps, which will likely 148 cascading errors in the guest because the devi 149 longer present from the guest standpoint and a 150 device MMIO space will fail. 151 152 Because ejection is asynchronous and can happe 153 during the guest VM lifecycle, proper synchron 154 Hyper-V virtual PCI driver is very tricky. Ej 155 observed even before a newly offered vPCI devi 156 fully setup. The Hyper-V virtual PCI driver h 157 several times over the years to fix race condi 158 ejections happen at inopportune times. Care mu 159 modifying this code to prevent re-introducing 160 See comments in the code. 161 162 Interrupt Assignment 163 -------------------- 164 The Hyper-V virtual PCI driver supports vPCI d 165 MSI, multi-MSI, or MSI-X. Assigning the guest 166 receive the interrupt for a particular MSI or 167 complex because of the way the Linux setup of 168 the Hyper-V interfaces. For the single-MSI an 169 Linux calls hv_compse_msi_msg() twice, with th 170 containing a dummy vCPU and the second call co 171 real vCPU. Furthermore, hv_irq_unmask() is fi 172 (on x86) or the GICD registers are set (on arm 173 the real vCPU again. Each of these three call 174 with Hyper-V, which must decide which physical 175 receive the interrupt before it is forwarded t 176 Unfortunately, the Hyper-V decision-making pro 177 limited, and can result in concentrating the p 178 interrupts on a single CPU, causing a performa 179 See details about how this is resolved in the 180 comment above the function hv_compose_msi_req_ 181 182 The Hyper-V virtual PCI driver implements the 183 irq_chip.irq_compose_msi_msg function as hv_co 184 Unfortunately, on Hyper-V the implementation r 185 a VMBus message to the Hyper-V host and awaiti 186 indicating receipt of a reply message. Since 187 irq_chip.irq_compose_msi_msg can be called wit 188 held, it doesn't work to do the normal sleep u 189 the interrupt. Instead hv_compose_msi_msg() mu 190 VMBus message, and then poll for the completio 191 further complexity, the vPCI device could be e 192 while the polling is in progress, so this scen 193 detected as well. See comments in the code re 194 very tricky area. 195 196 Most of the code in the Hyper-V virtual PCI dr 197 hyperv.c) applies to Hyper-V and Linux guests 198 and on arm64 architectures. But there are dif 199 interrupt assignments are managed. On x86, th 200 virtual PCI driver in the guest must make a hy 201 Hyper-V which guest vCPU should be interrupted 202 MSI/MSI-X interrupt, and the x86 interrupt vec 203 the x86_vector IRQ domain has picked for the i 204 hypercall is made by hv_arch_irq_unmask(). On 205 Hyper-V virtual PCI driver manages the allocat 206 for each MSI/MSI-X interrupt. The Hyper-V vir 207 stores the allocated SPI in the architectural 208 which Hyper-V emulates, so no hypercall is nec 209 x86. Hyper-V does not support using LPIs for 210 arm64 guest VMs because it does not emulate a 211 212 The Hyper-V virtual PCI driver in Linux suppor 213 whose drivers create managed or unmanaged Linu 214 smp_affinity for an unmanaged IRQ is updated v 215 interface, the Hyper-V virtual PCI driver is c 216 the Hyper-V host to change the interrupt targe 217 everything works properly. However, on x86 if 218 IRQ domain needs to reassign an interrupt vect 219 running out of vectors on a CPU, there's no pa 220 Hyper-V host of the change, and things break. 221 guest VMs operate in a constrained device envi 222 using all the vectors on a CPU doesn't happen. 223 problem is only a theoretical concern rather t 224 concern, it has been left unaddressed. 225 226 DMA 227 --- 228 By default, Hyper-V pins all guest VM memory i 229 when the VM is created, and programs the physi 230 allow the VM to have DMA access to all its mem 231 it is safe to assign PCI devices to the VM, an 232 guest operating system to program the DMA tran 233 physical IOMMU prevents a malicious guest from 234 DMA to memory belonging to the host or to othe 235 host. From the Linux guest standpoint, such DM 236 are in "direct" mode since Hyper-V does not pr 237 IOMMU in the guest. 238 239 Hyper-V assumes that physical PCI devices alwa 240 cache-coherent DMA. When running on x86, this 241 required by the architecture. When running on 242 architecture allows for both cache-coherent an 243 non-cache-coherent devices, with the behavior 244 specified in the ACPI DSDT. But when a PCI de 245 to a guest VM, that device does not appear in 246 Hyper-V VMBus driver propagates cache-coherenc 247 from the VMBus node in the ACPI DSDT to all VM 248 including vPCI devices (since they have a dual 249 device and as a PCI device). See vmbus_dma_co 250 Current Hyper-V versions always indicate that 251 cache coherent, so vPCI devices on arm64 alway 252 cache coherent and the CPU does not perform an 253 operations as part of dma_map/unmap_*() calls. 254 255 vPCI protocol versions 256 ---------------------- 257 As previously described, during vPCI device se 258 messages are passed over a VMBus channel betwe 259 host and the Hyper-v vPCI driver in the Linux 260 messages have been revised in newer versions o 261 the guest and host must agree on the vPCI prot 262 be used. The version is negotiated when commu 263 the VMBus channel is first established. See 264 hv_pci_protocol_negotiation(). Newer versions 265 extend support to VMs with more than 64 vCPUs, 266 additional information about the vPCI device, 267 guest virtual NUMA node to which it is most cl 268 the underlying hardware. 269 270 Guest NUMA node affinity 271 ------------------------ 272 When the vPCI protocol version provides it, th 273 node affinity of the vPCI device is stored as 274 device information for subsequent use by the L 275 hv_pci_assign_numa_node(). If the negotiated 276 does not support the host providing NUMA affin 277 the Linux guest defaults the device NUMA node 278 when the negotiated protocol version includes 279 information, the ability of the host to provid 280 information depends on certain host configurat 281 the guest receives NUMA node value "0", it cou 282 node 0, or it could mean "no information is av 283 Unfortunately it is not possible to distinguis 284 from the guest side. 285 286 PCI config space access in a CoCo VM 287 ------------------------------------ 288 Linux PCI device drivers access PCI config spa 289 standard set of functions provided by the Linu 290 In Hyper-V guests these standard functions map 291 hv_pcifront_read_config() and hv_pcifront_writ 292 in the Hyper-V virtual PCI driver. In normal 293 these hv_pcifront_*() functions directly acces 294 space, and the accesses trap to Hyper-V to be 295 But in CoCo VMs, memory encryption prevents Hy 296 from reading the guest instruction stream to e 297 access, so the hv_pcifront_*() functions must 298 hypercalls with explicit arguments describing 299 made. 300 301 Config Block back-channel 302 ------------------------- 303 The Hyper-V host and Hyper-V virtual PCI drive 304 together implement a non-standard back-channel 305 path between the host and guest. The back-cha 306 messages sent over the VMBus channel associate 307 device. The functions hyperv_read_cfg_blk() a 308 hyperv_write_cfg_blk() are the primary interfa 309 other parts of the Linux kernel. As of this w 310 interfaces are used only by the Mellanox mlx5 311 diagnostic data to a Hyper-V host running in t 312 cloud. The functions hyperv_read_cfg_blk() an 313 hyperv_write_cfg_blk() are implemented in a se 314 (pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_IN 315 effectively stubs them out when running in non 316 environments.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.