~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/virt/hyperv/vpci.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/virt/hyperv/vpci.rst (Version linux-6.11.5) and /Documentation/virt/hyperv/vpci.rst (Version linux-5.16.20)


  1 .. SPDX-License-Identifier: GPL-2.0               
  2                                                   
  3 PCI pass-thru devices                             
  4 =========================                         
  5 In a Hyper-V guest VM, PCI pass-thru devices (    
  6 virtual PCI devices, or vPCI devices) are phys    
  7 that are mapped directly into the VM's physica    
  8 Guest device drivers can interact directly wit    
  9 without intermediation by the host hypervisor.    
 10 provides higher bandwidth access to the device    
 11 latency, compared with devices that are virtua    
 12 hypervisor.  The device should appear to the g    
 13 would when running on bare metal, so no change    
 14 to the Linux device drivers for the device.       
 15                                                   
 16 Hyper-V terminology for vPCI devices is "Discr    
 17 Assignment" (DDA).  Public documentation for H    
 18 available here: `DDA`_                            
 19                                                   
 20 .. _DDA: https://learn.microsoft.com/en-us/win    
 21                                                   
 22 DDA is typically used for storage controllers,    
 23 and for GPUs.  A similar mechanism for NICs is    
 24 and produces the same benefits by allowing a g    
 25 driver to interact directly with the hardware.    
 26 public documentation here: `SR-IOV`_              
 27                                                   
 28 .. _SR-IOV: https://learn.microsoft.com/en-us/    
 29                                                   
 30 This discussion of vPCI devices includes DDA a    
 31 devices.                                          
 32                                                   
 33 Device Presentation                               
 34 -------------------                               
 35 Hyper-V provides full PCI functionality for a     
 36 it is operating, so the Linux device driver fo    
 37 be used unchanged, provided it uses the correc    
 38 APIs for accessing PCI config space and for ot    
 39 with Linux.  But the initial detection of the     
 40 its integration with the Linux PCI subsystem m    
 41 specific mechanisms.  Consequently, vPCI devic    
 42 have a dual identity.  They are initially pres    
 43 guests as VMBus devices via the standard VMBus    
 44 mechanism, so they have a VMBus identity and a    
 45 /sys/bus/vmbus/devices.  The VMBus vPCI driver    
 46 drivers/pci/controller/pci-hyperv.c handles a     
 47 vPCI device by fabricating a PCI bus topology     
 48 the normal PCI device data structures in Linux    
 49 exist if the PCI device were discovered via AC    
 50 metal system.  Once those data structures are     
 51 device also has a normal PCI identity in Linux    
 52 Linux device driver for the vPCI device can fu    
 53 were running in Linux on bare-metal.  Because     
 54 presented dynamically through the VMBus offer     
 55 do not appear in the Linux guest's ACPI tables    
 56 may be added to a VM or removed from a VM at a    
 57 the life of the VM, and not just during initia    
 58                                                   
 59 With this approach, the vPCI device is a VMBus    
 60 PCI device at the same time.  In response to t    
 61 message, the hv_pci_probe() function runs and     
 62 VMBus connection to the vPCI VSP on the Hyper-    
 63 connection has a single VMBus channel.  The ch    
 64 exchange messages with the vPCI VSP for the pu    
 65 up and configuring the vPCI device in Linux.      
 66 is fully configured in Linux as a PCI device,     
 67 channel is used only if Linux changes the vCPU    
 68 in the guest, or if the vPCI device is removed    
 69 the VM while the VM is running.  The ongoing o    
 70 device happens directly between the Linux devi    
 71 the device and the hardware, with VMBus and th    
 72 playing no role.                                  
 73                                                   
 74 PCI Device Setup                                  
 75 ----------------                                  
 76 PCI device setup follows a sequence that Hyper    
 77 created for Windows guests, and that can be il    
 78 Linux guests due to differences in the overall    
 79 the Linux PCI subsystem compared with Windows.    
 80 with a bit of hackery in the Hyper-V virtual P    
 81 Linux, the virtual PCI device is setup in Linu    
 82 generic Linux PCI subsystem code and the Linux    
 83 device "just work".                               
 84                                                   
 85 Each vPCI device is set up in Linux to be in i    
 86 domain with a host bridge.  The PCI domainID i    
 87 bytes 4 and 5 of the instance GUID assigned to    
 88 device.  The Hyper-V host does not guarantee t    
 89 are unique, so hv_pci_probe() has an algorithm    
 90 collisions.  The collision resolution is inten    
 91 across reboots of the same VM so that the PCI     
 92 change, as the domainID appears in the user sp    
 93 configuration of some devices.                    
 94                                                   
 95 hv_pci_probe() allocates a guest MMIO range to    
 96 config space for the device.  This MMIO range     
 97 to the Hyper-V host over the VMBus channel as     
 98 the host that the device is ready to enter d0.    
 99 hv_pci_enter_d0().  When the guest subsequentl    
100 MMIO range, the Hyper-V host intercepts the ac    
101 them to the physical device PCI config space.     
102                                                   
103 hv_pci_probe() also gets BAR information for t    
104 the Hyper-V host, and uses this information to    
105 space for the BARs.  That MMIO space is then s    
106 associated with the host bridge so that it wor    
107 PCI subsystem code in Linux processes the BARs    
108                                                   
109 Finally, hv_pci_probe() creates the root PCI b    
110 point the Hyper-V virtual PCI driver hackery i    
111 normal Linux PCI machinery for scanning the ro    
112 detect the device, to perform driver matching,    
113 initialize the driver and device.                 
114                                                   
115 PCI Device Removal                                
116 ------------------                                
117 A Hyper-V host may initiate removal of a vPCI     
118 guest VM at any time during the life of the VM    
119 is instigated by an admin action taken on the     
120 is not under the control of the guest OS.         
121                                                   
122 A guest VM is notified of the removal by an un    
123 "Eject" message sent from the host to the gues    
124 channel associated with the vPCI device.  Upon    
125 a message, the Hyper-V virtual PCI driver in L    
126 asynchronously invokes Linux kernel PCI subsys    
127 shutdown and remove the device.  When those ca    
128 complete, an "Ejection Complete" message is se    
129 Hyper-V over the VMBus channel indicating that    
130 been removed.  At this point, Hyper-V sends a     
131 message to the Linux guest, which the VMBus dr    
132 processes by removing the VMBus identity for t    
133 that processing is complete, all vestiges of t    
134 been present are gone from the Linux kernel.      
135 message also indicates to the guest that Hyper    
136 providing support for the vPCI device in the g    
137 guest were to attempt to access that device's     
138 would be an invalid reference. Hypercalls affe    
139 return errors, and any further messages sent i    
140 channel are ignored.                              
141                                                   
142 After sending the Eject message, Hyper-V allow    
143 60 seconds to cleanly shutdown the device and     
144 Ejection Complete before sending the VMBus res    
145 message.  If for any reason the Eject steps do    
146 within the allowed 60 seconds, the Hyper-V hos    
147 performs the rescind steps, which will likely     
148 cascading errors in the guest because the devi    
149 longer present from the guest standpoint and a    
150 device MMIO space will fail.                      
151                                                   
152 Because ejection is asynchronous and can happe    
153 during the guest VM lifecycle, proper synchron    
154 Hyper-V virtual PCI driver is very tricky.  Ej    
155 observed even before a newly offered vPCI devi    
156 fully setup.  The Hyper-V virtual PCI driver h    
157 several times over the years to fix race condi    
158 ejections happen at inopportune times. Care mu    
159 modifying this code to prevent re-introducing     
160 See comments in the code.                         
161                                                   
162 Interrupt Assignment                              
163 --------------------                              
164 The Hyper-V virtual PCI driver supports vPCI d    
165 MSI, multi-MSI, or MSI-X.  Assigning the guest    
166 receive the interrupt for a particular MSI or     
167 complex because of the way the Linux setup of     
168 the Hyper-V interfaces.  For the single-MSI an    
169 Linux calls hv_compse_msi_msg() twice, with th    
170 containing a dummy vCPU and the second call co    
171 real vCPU.  Furthermore, hv_irq_unmask() is fi    
172 (on x86) or the GICD registers are set (on arm    
173 the real vCPU again.  Each of these three call    
174 with Hyper-V, which must decide which physical    
175 receive the interrupt before it is forwarded t    
176 Unfortunately, the Hyper-V decision-making pro    
177 limited, and can result in concentrating the p    
178 interrupts on a single CPU, causing a performa    
179 See details about how this is resolved in the     
180 comment above the function hv_compose_msi_req_    
181                                                   
182 The Hyper-V virtual PCI driver implements the     
183 irq_chip.irq_compose_msi_msg function as hv_co    
184 Unfortunately, on Hyper-V the implementation r    
185 a VMBus message to the Hyper-V host and awaiti    
186 indicating receipt of a reply message.  Since     
187 irq_chip.irq_compose_msi_msg can be called wit    
188 held, it doesn't work to do the normal sleep u    
189 the interrupt. Instead hv_compose_msi_msg() mu    
190 VMBus message, and then poll for the completio    
191 further complexity, the vPCI device could be e    
192 while the polling is in progress, so this scen    
193 detected as well.  See comments in the code re    
194 very tricky area.                                 
195                                                   
196 Most of the code in the Hyper-V virtual PCI dr    
197 hyperv.c) applies to Hyper-V and Linux guests     
198 and on arm64 architectures.  But there are dif    
199 interrupt assignments are managed.  On x86, th    
200 virtual PCI driver in the guest must make a hy    
201 Hyper-V which guest vCPU should be interrupted    
202 MSI/MSI-X interrupt, and the x86 interrupt vec    
203 the x86_vector IRQ domain has picked for the i    
204 hypercall is made by hv_arch_irq_unmask().  On    
205 Hyper-V virtual PCI driver manages the allocat    
206 for each MSI/MSI-X interrupt.  The Hyper-V vir    
207 stores the allocated SPI in the architectural     
208 which Hyper-V emulates, so no hypercall is nec    
209 x86.  Hyper-V does not support using LPIs for     
210 arm64 guest VMs because it does not emulate a     
211                                                   
212 The Hyper-V virtual PCI driver in Linux suppor    
213 whose drivers create managed or unmanaged Linu    
214 smp_affinity for an unmanaged IRQ is updated v    
215 interface, the Hyper-V virtual PCI driver is c    
216 the Hyper-V host to change the interrupt targe    
217 everything works properly.  However, on x86 if    
218 IRQ domain needs to reassign an interrupt vect    
219 running out of vectors on a CPU, there's no pa    
220 Hyper-V host of the change, and things break.     
221 guest VMs operate in a constrained device envi    
222 using all the vectors on a CPU doesn't happen.    
223 problem is only a theoretical concern rather t    
224 concern, it has been left unaddressed.            
225                                                   
226 DMA                                               
227 ---                                               
228 By default, Hyper-V pins all guest VM memory i    
229 when the VM is created, and programs the physi    
230 allow the VM to have DMA access to all its mem    
231 it is safe to assign PCI devices to the VM, an    
232 guest operating system to program the DMA tran    
233 physical IOMMU prevents a malicious guest from    
234 DMA to memory belonging to the host or to othe    
235 host. From the Linux guest standpoint, such DM    
236 are in "direct" mode since Hyper-V does not pr    
237 IOMMU in the guest.                               
238                                                   
239 Hyper-V assumes that physical PCI devices alwa    
240 cache-coherent DMA.  When running on x86, this    
241 required by the architecture.  When running on    
242 architecture allows for both cache-coherent an    
243 non-cache-coherent devices, with the behavior     
244 specified in the ACPI DSDT.  But when a PCI de    
245 to a guest VM, that device does not appear in     
246 Hyper-V VMBus driver propagates cache-coherenc    
247 from the VMBus node in the ACPI DSDT to all VM    
248 including vPCI devices (since they have a dual    
249 device and as a PCI device).  See vmbus_dma_co    
250 Current Hyper-V versions always indicate that     
251 cache coherent, so vPCI devices on arm64 alway    
252 cache coherent and the CPU does not perform an    
253 operations as part of dma_map/unmap_*() calls.    
254                                                   
255 vPCI protocol versions                            
256 ----------------------                            
257 As previously described, during vPCI device se    
258 messages are passed over a VMBus channel betwe    
259 host and the Hyper-v vPCI driver in the Linux     
260 messages have been revised in newer versions o    
261 the guest and host must agree on the vPCI prot    
262 be used.  The version is negotiated when commu    
263 the VMBus channel is first established.  See      
264 hv_pci_protocol_negotiation(). Newer versions     
265 extend support to VMs with more than 64 vCPUs,    
266 additional information about the vPCI device,     
267 guest virtual NUMA node to which it is most cl    
268 the underlying hardware.                          
269                                                   
270 Guest NUMA node affinity                          
271 ------------------------                          
272 When the vPCI protocol version provides it, th    
273 node affinity of the vPCI device is stored as     
274 device information for subsequent use by the L    
275 hv_pci_assign_numa_node().  If the negotiated     
276 does not support the host providing NUMA affin    
277 the Linux guest defaults the device NUMA node     
278 when the negotiated protocol version includes     
279 information, the ability of the host to provid    
280 information depends on certain host configurat    
281 the guest receives NUMA node value "0", it cou    
282 node 0, or it could mean "no information is av    
283 Unfortunately it is not possible to distinguis    
284 from the guest side.                              
285                                                   
286 PCI config space access in a CoCo VM              
287 ------------------------------------              
288 Linux PCI device drivers access PCI config spa    
289 standard set of functions provided by the Linu    
290 In Hyper-V guests these standard functions map    
291 hv_pcifront_read_config() and hv_pcifront_writ    
292 in the Hyper-V virtual PCI driver.  In normal     
293 these hv_pcifront_*() functions directly acces    
294 space, and the accesses trap to Hyper-V to be     
295 But in CoCo VMs, memory encryption prevents Hy    
296 from reading the guest instruction stream to e    
297 access, so the hv_pcifront_*() functions must     
298 hypercalls with explicit arguments describing     
299 made.                                             
300                                                   
301 Config Block back-channel                         
302 -------------------------                         
303 The Hyper-V host and Hyper-V virtual PCI drive    
304 together implement a non-standard back-channel    
305 path between the host and guest.  The back-cha    
306 messages sent over the VMBus channel associate    
307 device.  The functions hyperv_read_cfg_blk() a    
308 hyperv_write_cfg_blk() are the primary interfa    
309 other parts of the Linux kernel.  As of this w    
310 interfaces are used only by the Mellanox mlx5     
311 diagnostic data to a Hyper-V host running in t    
312 cloud.  The functions hyperv_read_cfg_blk() an    
313 hyperv_write_cfg_blk() are implemented in a se    
314 (pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_IN    
315 effectively stubs them out when running in non    
316 environments.                                     
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php