~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/arch/powerpc/pci_iov_resource_on_powernv.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 ===================================================
  2 PCI Express I/O Virtualization Resource on Powerenv
  3 ===================================================
  4 
  5 Wei Yang <weiyang@linux.vnet.ibm.com>
  6 
  7 Benjamin Herrenschmidt <benh@au1.ibm.com>
  8 
  9 Bjorn Helgaas <bhelgaas@google.com>
 10 
 11 26 Aug 2014
 12 
 13 This document describes the requirement from hardware for PCI MMIO resource
 14 sizing and assignment on PowerKVM and how generic PCI code handles this
 15 requirement. The first two sections describe the concepts of Partitionable
 16 Endpoints and the implementation on P8 (IODA2). The next two sections talks
 17 about considerations on enabling SRIOV on IODA2.
 18 
 19 1. Introduction to Partitionable Endpoints
 20 ==========================================
 21 
 22 A Partitionable Endpoint (PE) is a way to group the various resources
 23 associated with a device or a set of devices to provide isolation between
 24 partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
 25 to freeze a device that is causing errors in order to limit the possibility
 26 of propagation of bad data.
 27 
 28 There is thus, in HW, a table of PE states that contains a pair of "frozen"
 29 state bits (one for MMIO and one for DMA, they get set together but can be
 30 cleared independently) for each PE.
 31 
 32 When a PE is frozen, all stores in any direction are dropped and all loads
 33 return all 1's value. MSIs are also blocked. There's a bit more state that
 34 captures things like the details of the error that caused the freeze etc., but
 35 that's not critical.
 36 
 37 The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
 38 are matched to their corresponding PEs.
 39 
 40 The following section provides a rough description of what we have on P8
 41 (IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
 42 is a completely separate HW entity that replicates the entire logic, so has
 43 its own set of PEs, etc.
 44 
 45 2. Implementation of Partitionable Endpoints on P8 (IODA2)
 46 ==========================================================
 47 
 48 P8 supports up to 256 Partitionable Endpoints per PHB.
 49 
 50   * Inbound
 51 
 52     For DMA, MSIs and inbound PCIe error messages, we have a table (in
 53     memory but accessed in HW by the chip) that provides a direct
 54     correspondence between a PCIe RID (bus/dev/fn) with a PE number.
 55     We call this the RTT.
 56 
 57     - For DMA we then provide an entire address space for each PE that can
 58       contain two "windows", depending on the value of PCI address bit 59.
 59       Each window can be configured to be remapped via a "TCE table" (IOMMU
 60       translation table), which has various configurable characteristics
 61       not described here.
 62 
 63     - For MSIs, we have two windows in the address space (one at the top of
 64       the 32-bit space and one much higher) which, via a combination of the
 65       address and MSI value, will result in one of the 2048 interrupts per
 66       bridge being triggered.  There's a PE# in the interrupt controller
 67       descriptor table as well which is compared with the PE# obtained from
 68       the RTT to "authorize" the device to emit that specific interrupt.
 69 
 70     - Error messages just use the RTT.
 71 
 72   * Outbound.  That's where the tricky part is.
 73 
 74     Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
 75     from the CPU address space to the PCI address space.  There is one M32
 76     window and sixteen M64 windows.  They have different characteristics.
 77     First what they have in common: they forward a configurable portion of
 78     the CPU address space to the PCIe bus and must be naturally aligned
 79     power of two in size.  The rest is different:
 80 
 81     - The M32 window:
 82 
 83       * Is limited to 4GB in size.
 84 
 85       * Drops the top bits of the address (above the size) and replaces
 86         them with a configurable value.  This is typically used to generate
 87         32-bit PCIe accesses.  We configure that window at boot from FW and
 88         don't touch it from Linux; it's usually set to forward a 2GB
 89         portion of address space from the CPU to PCIe
 90         0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
 91         reserved for MSIs but this is not a problem at this point; we just
 92         need to ensure Linux doesn't assign anything there, the M32 logic
 93         ignores that however and will forward in that space if we try).
 94 
 95       * It is divided into 256 segments of equal size.  A table in the chip
 96         maps each segment to a PE#.  That allows portions of the MMIO space
 97         to be assigned to PEs on a segment granularity.  For a 2GB window,
 98         the segment granularity is 2GB/256 = 8MB.
 99 
100     Now, this is the "main" window we use in Linux today (excluding
101     SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
102     onto a segment alignment/granularity so that the space behind a bridge
103     can be assigned to a PE.
104 
105     Ideally we would like to be able to have individual functions in PEs
106     but that would mean using a completely different address allocation
107     scheme where individual function BARs can be "grouped" to fit in one or
108     more segments.
109 
110     - The M64 windows:
111 
112       * Must be at least 256MB in size.
113 
114       * Do not translate addresses (the address on PCIe is the same as the
115         address on the PowerBus).  There is a way to also set the top 14
116         bits which are not conveyed by PowerBus but we don't use this.
117 
118       * Can be configured to be segmented.  When not segmented, we can
119         specify the PE# for the entire window.  When segmented, a window
120         has 256 segments; however, there is no table for mapping a segment
121         to a PE#.  The segment number *is* the PE#.
122 
123       * Support overlaps.  If an address is covered by multiple windows,
124         there's a defined ordering for which window applies.
125 
126     We have code (fairly new compared to the M32 stuff) that exploits that
127     for large BARs in 64-bit space:
128 
129     We configure an M64 window to cover the entire region of address space
130     that has been assigned by FW for the PHB (about 64GB, ignore the space
131     for the M32, it comes out of a different "reserve").  We configure it
132     as segmented.
133 
134     Then we do the same thing as with M32, using the bridge alignment
135     trick, to match to those giant segments.
136 
137     Since we cannot remap, we have two additional constraints:
138 
139     - We do the PE# allocation *after* the 64-bit space has been assigned
140       because the addresses we use directly determine the PE#.  We then
141       update the M32 PE# for the devices that use both 32-bit and 64-bit
142       spaces or assign the remaining PE# to 32-bit only devices.
143 
144     - We cannot "group" segments in HW, so if a device ends up using more
145       than one segment, we end up with more than one PE#.  There is a HW
146       mechanism to make the freeze state cascade to "companion" PEs but
147       that only works for PCIe error messages (typically used so that if
148       you freeze a switch, it freezes all its children).  So we do it in
149       SW.  We lose a bit of effectiveness of EEH in that case, but that's
150       the best we found.  So when any of the PEs freezes, we freeze the
151       other ones for that "domain".  We thus introduce the concept of
152       "master PE" which is the one used for DMA, MSIs, etc., and "secondary
153       PEs" that are used for the remaining M64 segments.
154 
155     We would like to investigate using additional M64 windows in "single
156     PE" mode to overlay over specific BARs to work around some of that, for
157     example for devices with very large BARs, e.g., GPUs.  It would make
158     sense, but we haven't done it yet.
159 
160 3. Considerations for SR-IOV on PowerKVM
161 ========================================
162 
163   * SR-IOV Background
164 
165     The PCIe SR-IOV feature allows a single Physical Function (PF) to
166     support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
167     Capability control the number of VFs and whether they are enabled.
168 
169     When VFs are enabled, they appear in Configuration Space like normal
170     PCI devices, but the BARs in VF config space headers are unusual.  For
171     a non-VF device, software uses BARs in the config space header to
172     discover the BAR sizes and assign addresses for them.  For VF devices,
173     software uses VF BAR registers in the *PF* SR-IOV Capability to
174     discover sizes and assign addresses.  The BARs in the VF's config space
175     header are read-only zeros.
176 
177     When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
178     base address for all the corresponding VF(n) BARs.  For example, if the
179     PF SR-IOV Capability is programmed to enable eight VFs, and it has a
180     1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
181     This region is divided into eight contiguous 1MB regions, each of which
182     is a BAR0 for one of the VFs.  Note that even though the VF BAR
183     describes an 8MB region, the alignment requirement is for a single VF,
184     i.e., 1MB in this example.
185 
186   There are several strategies for isolating VFs in PEs:
187 
188   - M32 window: There's one M32 window, and it is split into 256
189     equally-sized segments.  The finest granularity possible is a 256MB
190     window with 1MB segments.  VF BARs that are 1MB or larger could be
191     mapped to separate PEs in this window.  Each segment can be
192     individually mapped to a PE via the lookup table, so this is quite
193     flexible, but it works best when all the VF BARs are the same size.  If
194     they are different sizes, the entire window has to be small enough that
195     the segment size matches the smallest VF BAR, which means larger VF
196     BARs span several segments.
197 
198   - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
199     to a single PE, so it could only isolate one VF.
200 
201   - Single segmented M64 windows: A segmented M64 window could be used just
202     like the M32 window, but the segments can't be individually mapped to
203     PEs (the segment number is the PE#), so there isn't as much
204     flexibility.  A VF with multiple BARs would have to be in a "domain" of
205     multiple PEs, which is not as well isolated as a single PE.
206 
207   - Multiple segmented M64 windows: As usual, each window is split into 256
208     equally-sized segments, and the segment number is the PE#.  But if we
209     use several M64 windows, they can be set to different base addresses
210     and different segment sizes.  If we have VFs that each have a 1MB BAR
211     and a 32MB BAR, we could use one M64 window to assign 1MB segments and
212     another M64 window to assign 32MB segments.
213 
214   Finally, the plan to use M64 windows for SR-IOV, which will be described
215   more in the next two sections.  For a given VF BAR, we need to
216   effectively reserve the entire 256 segments (256 * VF BAR size) and
217   position the VF BAR to start at the beginning of a free range of
218   segments/PEs inside that M64 window.
219 
220   The goal is of course to be able to give a separate PE for each VF.
221 
222   The IODA2 platform has 16 M64 windows, which are used to map MMIO
223   range to PE#.  Each M64 window defines one MMIO range and this range is
224   divided into 256 segments, with each segment corresponding to one PE.
225 
226   We decide to leverage this M64 window to map VFs to individual PEs, since
227   SR-IOV VF BARs are all the same size.
228 
229   But doing so introduces another problem: total_VFs is usually smaller
230   than the number of M64 window segments, so if we map one VF BAR directly
231   to one M64 window, some part of the M64 window will map to another
232   device's MMIO range.
233 
234   IODA supports 256 PEs, so segmented windows contain 256 segments, so if
235   total_VFs is less than 256, we have the situation in Figure 1.0, where
236   segments [total_VFs, 255] of the M64 window may map to some MMIO range on
237   other devices::
238 
239      0      1                     total_VFs - 1
240      +------+------+-     -+------+------+
241      |      |      |  ...  |      |      |
242      +------+------+-     -+------+------+
243 
244                            VF(n) BAR space
245 
246      0      1                     total_VFs - 1                255
247      +------+------+-     -+------+------+-      -+------+------+
248      |      |      |  ...  |      |      |   ...  |      |      |
249      +------+------+-     -+------+------+-      -+------+------+
250 
251                            M64 window
252 
253                 Figure 1.0 Direct map VF(n) BAR space
254 
255   Our current solution is to allocate 256 segments even if the VF(n) BAR
256   space doesn't need that much, as shown in Figure 1.1::
257 
258      0      1                     total_VFs - 1                255
259      +------+------+-     -+------+------+-      -+------+------+
260      |      |      |  ...  |      |      |   ...  |      |      |
261      +------+------+-     -+------+------+-      -+------+------+
262 
263                            VF(n) BAR space + extra
264 
265      0      1                     total_VFs - 1                255
266      +------+------+-     -+------+------+-      -+------+------+
267      |      |      |  ...  |      |      |   ...  |      |      |
268      +------+------+-     -+------+------+-      -+------+------+
269 
270                            M64 window
271 
272                 Figure 1.1 Map VF(n) BAR space + extra
273 
274   Allocating the extra space ensures that the entire M64 window will be
275   assigned to this one SR-IOV device and none of the space will be
276   available for other devices.  Note that this only expands the space
277   reserved in software; there are still only total_VFs VFs, and they only
278   respond to segments [0, total_VFs - 1].  There's nothing in hardware that
279   responds to segments [total_VFs, 255].
280 
281 4. Implications for the Generic PCI Code
282 ========================================
283 
284 The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
285 aligned to the size of an individual VF BAR.
286 
287 In IODA2, the MMIO address determines the PE#.  If the address is in an M32
288 window, we can set the PE# by updating the table that translates segments
289 to PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
290 set the PE# for the window.  But if it's in a segmented M64 window, the
291 segment number is the PE#.
292 
293 Therefore, the only way to control the PE# for a VF is to change the base
294 of the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
295 amount of space required for the VF(n) BAR space, the VF BAR value is fixed
296 and cannot be changed.
297 
298 On the other hand, if the PCI core allocates additional space, the VF BAR
299 value can be changed as long as the entire VF(n) BAR space remains inside
300 the space allocated by the core.
301 
302 Ideally the segment size will be the same as an individual VF BAR size.
303 Then each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
304 are contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
305 allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
306 
307 If the segment size is smaller than the VF BAR size, it will take several
308 segments to cover a VF BAR, and a VF will be in several PEs.  This is
309 possible, but the isolation isn't as good, and it reduces the number of PE#
310 choices because instead of consuming only numVFs segments, the VF(n) BAR
311 space will consume (numVFs * n) segments.  That means there aren't as many
312 available segments for adjusting base of the VF(n) BAR space.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php