~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/networking/devlink/devlink-port.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 .. SPDX-License-Identifier: GPL-2.0
  2 
  3 .. _devlink_port:
  4 
  5 ============
  6 Devlink Port
  7 ============
  8 
  9 ``devlink-port`` is a port that exists on the device. It has a logically
 10 separate ingress/egress point of the device. A devlink port can be any one
 11 of many flavours. A devlink port flavour along with port attributes
 12 describe what a port represents.
 13 
 14 A device driver that intends to publish a devlink port sets the
 15 devlink port attributes and registers the devlink port.
 16 
 17 Devlink port flavours are described below.
 18 
 19 .. list-table:: List of devlink port flavours
 20    :widths: 33 90
 21 
 22    * - Flavour
 23      - Description
 24    * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
 25      - Any kind of physical port. This can be an eswitch physical port or any
 26        other physical port on the device.
 27    * - ``DEVLINK_PORT_FLAVOUR_DSA``
 28      - This indicates a DSA interconnect port.
 29    * - ``DEVLINK_PORT_FLAVOUR_CPU``
 30      - This indicates a CPU port applicable only to DSA.
 31    * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
 32      - This indicates an eswitch port representing a port of PCI
 33        physical function (PF).
 34    * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
 35      - This indicates an eswitch port representing a port of PCI
 36        virtual function (VF).
 37    * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
 38      - This indicates an eswitch port representing a port of PCI
 39        subfunction (SF).
 40    * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
 41      - This indicates a virtual port for the PCI virtual function.
 42 
 43 Devlink port can have a different type based on the link layer described below.
 44 
 45 .. list-table:: List of devlink port types
 46    :widths: 23 90
 47 
 48    * - Type
 49      - Description
 50    * - ``DEVLINK_PORT_TYPE_ETH``
 51      - Driver should set this port type when a link layer of the port is
 52        Ethernet.
 53    * - ``DEVLINK_PORT_TYPE_IB``
 54      - Driver should set this port type when a link layer of the port is
 55        InfiniBand.
 56    * - ``DEVLINK_PORT_TYPE_AUTO``
 57      - This type is indicated by the user when driver should detect the port
 58        type automatically.
 59 
 60 PCI controllers
 61 ---------------
 62 In most cases a PCI device has only one controller. A controller consists of
 63 potentially multiple physical, virtual functions and subfunctions. A function
 64 consists of one or more ports. This port is represented by the devlink eswitch
 65 port.
 66 
 67 A PCI device connected to multiple CPUs or multiple PCI root complexes or a
 68 SmartNIC, however, may have multiple controllers. For a device with multiple
 69 controllers, each controller is distinguished by a unique controller number.
 70 An eswitch is on the PCI device which supports ports of multiple controllers.
 71 
 72 An example view of a system with two controllers::
 73 
 74                  ---------------------------------------------------------
 75                  |                                                       |
 76                  |           --------- ---------         ------- ------- |
 77     -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
 78     | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
 79     | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
 80     | connect |  | -------                       -------                 |
 81     -----------  |     | controller_num=1 (no eswitch)                   |
 82                  ------|--------------------------------------------------
 83                  (internal wire)
 84                        |
 85                  ---------------------------------------------------------
 86                  | devlink eswitch ports and reps                        |
 87                  | ----------------------------------------------------- |
 88                  | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
 89                  | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
 90                  | ----------------------------------------------------- |
 91                  | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
 92                  | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
 93                  | ----------------------------------------------------- |
 94                  |                                                       |
 95                  |                                                       |
 96     -----------  |           --------- ---------         ------- ------- |
 97     | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
 98     | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
 99     | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
100     -----------  | -------                       -------                 |
101                  |                                                       |
102                  |  local controller_num=0 (eswitch)                     |
103                  ---------------------------------------------------------
104 
105 In the above example, the external controller (identified by controller number = 1)
106 doesn't have the eswitch. Local controller (identified by controller number = 0)
107 has the eswitch. The Devlink instance on the local controller has eswitch
108 devlink ports for both the controllers.
109 
110 Function configuration
111 ======================
112 
113 Users can configure one or more function attributes before enumerating the PCI
114 function. Usually it means, user should configure function attribute
115 before a bus specific device for the function is created. However, when
116 SRIOV is enabled, virtual function devices are created on the PCI bus.
117 Hence, function attribute should be configured before binding virtual
118 function device to the driver. For subfunctions, this means user should
119 configure port function attribute before activating the port function.
120 
121 A user may set the hardware address of the function using
122 `devlink port function set hw_addr` command. For Ethernet port function
123 this means a MAC address.
124 
125 Users may also set the RoCE capability of the function using
126 `devlink port function set roce` command.
127 
128 Users may also set the function as migratable using
129 `devlink port function set migratable` command.
130 
131 Users may also set the IPsec crypto capability of the function using
132 `devlink port function set ipsec_crypto` command.
133 
134 Users may also set the IPsec packet capability of the function using
135 `devlink port function set ipsec_packet` command.
136 
137 Users may also set the maximum IO event queues of the function
138 using `devlink port function set max_io_eqs` command.
139 
140 Function attributes
141 ===================
142 
143 MAC address setup
144 -----------------
145 The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
146 device created for the PCI VF/SF.
147 
148 - Get the MAC address of the VF identified by its unique devlink port index::
149 
150     $ devlink port show pci/0000:06:00.0/2
151     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
152       function:
153         hw_addr 00:00:00:00:00:00
154 
155 - Set the MAC address of the VF identified by its unique devlink port index::
156 
157     $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
158 
159     $ devlink port show pci/0000:06:00.0/2
160     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
161       function:
162         hw_addr 00:11:22:33:44:55
163 
164 - Get the MAC address of the SF identified by its unique devlink port index::
165 
166     $ devlink port show pci/0000:06:00.0/32768
167     pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
168       function:
169         hw_addr 00:00:00:00:00:00
170 
171 - Set the MAC address of the SF identified by its unique devlink port index::
172 
173     $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
174 
175     $ devlink port show pci/0000:06:00.0/32768
176     pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
177       function:
178         hw_addr 00:00:00:00:88:88
179 
180 RoCE capability setup
181 ---------------------
182 Not all PCI VFs/SFs require RoCE capability.
183 
184 When RoCE capability is disabled, it saves system memory per PCI VF/SF.
185 
186 When user disables RoCE capability for a VF/SF, user application cannot send or
187 receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
188 will be empty.
189 
190 When RoCE capability is disabled in the device using port function attribute,
191 VF/SF driver cannot override it.
192 
193 - Get RoCE capability of the VF device::
194 
195     $ devlink port show pci/0000:06:00.0/2
196     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
197         function:
198             hw_addr 00:00:00:00:00:00 roce enable
199 
200 - Set RoCE capability of the VF device::
201 
202     $ devlink port function set pci/0000:06:00.0/2 roce disable
203 
204     $ devlink port show pci/0000:06:00.0/2
205     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
206         function:
207             hw_addr 00:00:00:00:00:00 roce disable
208 
209 migratable capability setup
210 ---------------------------
211 Live migration is the process of transferring a live virtual machine
212 from one physical host to another without disrupting its normal
213 operation.
214 
215 User who want PCI VFs to be able to perform live migration need to
216 explicitly enable the VF migratable capability.
217 
218 When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
219 with migration support, the user can migrate the VM with this VF from one HV to a
220 different one.
221 
222 However, when migratable capability is enable, device will disable features which cannot
223 be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
224 
225 Example of LM with migratable function configuration:
226 - Get migratable capability of the VF device::
227 
228     $ devlink port show pci/0000:06:00.0/2
229     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
230         function:
231             hw_addr 00:00:00:00:00:00 migratable disable
232 
233 - Set migratable capability of the VF device::
234 
235     $ devlink port function set pci/0000:06:00.0/2 migratable enable
236 
237     $ devlink port show pci/0000:06:00.0/2
238     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
239         function:
240             hw_addr 00:00:00:00:00:00 migratable enable
241 
242 - Bind VF to VFIO driver with migration support::
243 
244     $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
245     $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
246     $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
247 
248 Attach VF to the VM.
249 Start the VM.
250 Perform live migration.
251 
252 IPsec crypto capability setup
253 -----------------------------
254 When user enables IPsec crypto capability for a VF, user application can offload
255 XFRM state crypto operation (Encrypt/Decrypt) to this VF.
256 
257 When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
258 processed in software by the kernel.
259 
260 - Get IPsec crypto capability of the VF device::
261 
262     $ devlink port show pci/0000:06:00.0/2
263     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
264         function:
265             hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
266 
267 - Set IPsec crypto capability of the VF device::
268 
269     $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
270 
271     $ devlink port show pci/0000:06:00.0/2
272     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
273         function:
274             hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
275 
276 IPsec packet capability setup
277 -----------------------------
278 When user enables IPsec packet capability for a VF, user application can offload
279 XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
280 IPsec encapsulation.
281 
282 When IPsec packet capability is disabled (default) for a VF, the XFRM state and
283 policy is processed in software by the kernel.
284 
285 - Get IPsec packet capability of the VF device::
286 
287     $ devlink port show pci/0000:06:00.0/2
288     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
289         function:
290             hw_addr 00:00:00:00:00:00 ipsec_packet disabled
291 
292 - Set IPsec packet capability of the VF device::
293 
294     $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
295 
296     $ devlink port show pci/0000:06:00.0/2
297     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
298         function:
299             hw_addr 00:00:00:00:00:00 ipsec_packet enabled
300 
301 Maximum IO events queues setup
302 ------------------------------
303 When user sets maximum number of IO event queues for a SF or
304 a VF, such function driver is limited to consume only enforced
305 number of IO event queues.
306 
307 IO event queues deliver events related to IO queues, including network
308 device transmit and receive queues (txq and rxq) and RDMA Queue Pairs (QPs).
309 For example, the number of netdevice channels and RDMA device completion
310 vectors are derived from the function's IO event queues. Usually, the number
311 of interrupt vectors consumed by the driver is limited by the number of IO
312 event queues per device, as each of the IO event queues is connected to an
313 interrupt vector.
314 
315 - Get maximum IO event queues of the VF device::
316 
317     $ devlink port show pci/0000:06:00.0/2
318     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
319         function:
320             hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10
321 
322 - Set maximum IO event queues of the VF device::
323 
324     $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32
325 
326     $ devlink port show pci/0000:06:00.0/2
327     pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
328         function:
329             hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32
330 
331 Subfunction
332 ============
333 
334 Subfunction is a lightweight function that has a parent PCI function on which
335 it is deployed. Subfunction is created and deployed in unit of 1. Unlike
336 SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
337 A subfunction communicates with the hardware through the parent PCI function.
338 
339 To use a subfunction, 3 steps setup sequence is followed:
340 
341 1) create - create a subfunction;
342 2) configure - configure subfunction attributes;
343 3) deploy - deploy the subfunction;
344 
345 Subfunction management is done using devlink port user interface.
346 User performs setup on the subfunction management device.
347 
348 (1) Create
349 ----------
350 A subfunction is created using a devlink port interface. A user adds the
351 subfunction by adding a devlink port of subfunction flavour. The devlink
352 kernel code calls down to subfunction management driver (devlink ops) and asks
353 it to create a subfunction devlink port. Driver then instantiates the
354 subfunction port and any associated objects such as health reporters and
355 representor netdevice.
356 
357 (2) Configure
358 -------------
359 A subfunction devlink port is created but it is not active yet. That means the
360 entities are created on devlink side, the e-switch port representor is created,
361 but the subfunction device itself is not created. A user might use e-switch port
362 representor to do settings, putting it into bridge, adding TC rules, etc. A user
363 might as well configure the hardware address (such as MAC address) of the
364 subfunction while subfunction is inactive.
365 
366 (3) Deploy
367 ----------
368 Once a subfunction is configured, user must activate it to use it. Upon
369 activation, subfunction management driver asks the subfunction management
370 device to instantiate the subfunction device on particular PCI function.
371 A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
372 At this point a matching subfunction driver binds to the subfunction's auxiliary device.
373 
374 Rate object management
375 ======================
376 
377 Devlink provides API to manage tx rates of single devlink port or a group.
378 This is done through rate objects, which can be one of the two types:
379 
380 ``leaf``
381   Represents a single devlink port; created/destroyed by the driver. Since leaf
382   have 1to1 mapping to its devlink port, in user space it is referred as
383   ``pci/<bus_addr>/<port_index>``;
384 
385 ``node``
386   Represents a group of rate objects (leafs and/or nodes); created/deleted by
387   request from the userspace; initially empty (no rate objects added). In
388   userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
389   ``node_name`` can be any identifier, except decimal number, to avoid
390   collisions with leafs.
391 
392 API allows to configure following rate object's parameters:
393 
394 ``tx_share``
395   Minimum TX rate value shared among all other rate objects, or rate objects
396   that parts of the parent group, if it is a part of the same group.
397 
398 ``tx_max``
399   Maximum TX rate value.
400 
401 ``tx_priority``
402   Allows for usage of strict priority arbiter among siblings. This
403   arbitration scheme attempts to schedule nodes based on their priority
404   as long as the nodes remain within their bandwidth limit. The higher the
405   priority the higher the probability that the node will get selected for
406   scheduling.
407 
408 ``tx_weight``
409   Allows for usage of Weighted Fair Queuing arbitration scheme among
410   siblings. This arbitration scheme can be used simultaneously with the
411   strict priority. As a node is configured with a higher rate it gets more
412   BW relative to its siblings. Values are relative like a percentage
413   points, they basically tell how much BW should node take relative to
414   its siblings.
415 
416 ``parent``
417   Parent node name. Parent node rate limits are considered as additional limits
418   to all node children limits. ``tx_max`` is an upper limit for children.
419   ``tx_share`` is a total bandwidth distributed among children.
420 
421 ``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
422 nodes with the same priority form a WFQ subgroup in the sibling group
423 and arbitration among them is based on assigned weights.
424 
425 Arbitration flow from the high level:
426 
427 #. Choose a node, or group of nodes with the highest priority that stays
428    within the BW limit and are not blocked. Use ``tx_priority`` as a
429    parameter for this arbitration.
430 
431 #. If group of nodes have the same priority perform WFQ arbitration on
432    that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
433 
434 #. Select the winner node, and continue arbitration flow among its children,
435    until leaf node is reached, and the winner is established.
436 
437 #. If all the nodes from the highest priority sub-group are satisfied, or
438    overused their assigned BW, move to the lower priority nodes.
439 
440 Driver implementations are allowed to support both or either rate object types
441 and setting methods of their parameters. Additionally driver implementation
442 may export nodes/leafs and their child-parent relationships.
443 
444 Terms and Definitions
445 =====================
446 
447 .. list-table:: Terms and Definitions
448    :widths: 22 90
449 
450    * - Term
451      - Definitions
452    * - ``PCI device``
453      - A physical PCI device having one or more PCI buses consists of one or
454        more PCI controllers.
455    * - ``PCI controller``
456      -  A controller consists of potentially multiple physical functions,
457         virtual functions and subfunctions.
458    * - ``Port function``
459      -  An object to manage the function of a port.
460    * - ``Subfunction``
461      -  A lightweight function that has parent PCI function on which it is
462         deployed.
463    * - ``Subfunction device``
464      -  A bus device of the subfunction, usually on a auxiliary bus.
465    * - ``Subfunction driver``
466      -  A device driver for the subfunction auxiliary device.
467    * - ``Subfunction management device``
468      -  A PCI physical function that supports subfunction management.
469    * - ``Subfunction management driver``
470      -  A device driver for PCI physical function that supports
471         subfunction management using devlink port interface.
472    * - ``Subfunction host driver``
473      -  A device driver for PCI physical function that hosts subfunction
474         devices. In most cases it is same as subfunction management driver. When
475         subfunction is used on external controller, subfunction management and
476         host drivers are different.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php