1 .. SPDX-License-Identifier: GPL-2.0 2 3 .. _devlink_port: 4 5 ============ 6 Devlink Port 7 ============ 8 9 ``devlink-port`` is a port that exists on the device. It has a logically 10 separate ingress/egress point of the device. A devlink port can be any one 11 of many flavours. A devlink port flavour along with port attributes 12 describe what a port represents. 13 14 A device driver that intends to publish a devlink port sets the 15 devlink port attributes and registers the devlink port. 16 17 Devlink port flavours are described below. 18 19 .. list-table:: List of devlink port flavours 20 :widths: 33 90 21 22 * - Flavour 23 - Description 24 * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL`` 25 - Any kind of physical port. This can be an eswitch physical port or any 26 other physical port on the device. 27 * - ``DEVLINK_PORT_FLAVOUR_DSA`` 28 - This indicates a DSA interconnect port. 29 * - ``DEVLINK_PORT_FLAVOUR_CPU`` 30 - This indicates a CPU port applicable only to DSA. 31 * - ``DEVLINK_PORT_FLAVOUR_PCI_PF`` 32 - This indicates an eswitch port representing a port of PCI 33 physical function (PF). 34 * - ``DEVLINK_PORT_FLAVOUR_PCI_VF`` 35 - This indicates an eswitch port representing a port of PCI 36 virtual function (VF). 37 * - ``DEVLINK_PORT_FLAVOUR_PCI_SF`` 38 - This indicates an eswitch port representing a port of PCI 39 subfunction (SF). 40 * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL`` 41 - This indicates a virtual port for the PCI virtual function. 42 43 Devlink port can have a different type based on the link layer described below. 44 45 .. list-table:: List of devlink port types 46 :widths: 23 90 47 48 * - Type 49 - Description 50 * - ``DEVLINK_PORT_TYPE_ETH`` 51 - Driver should set this port type when a link layer of the port is 52 Ethernet. 53 * - ``DEVLINK_PORT_TYPE_IB`` 54 - Driver should set this port type when a link layer of the port is 55 InfiniBand. 56 * - ``DEVLINK_PORT_TYPE_AUTO`` 57 - This type is indicated by the user when driver should detect the port 58 type automatically. 59 60 PCI controllers 61 --------------- 62 In most cases a PCI device has only one controller. A controller consists of 63 potentially multiple physical, virtual functions and subfunctions. A function 64 consists of one or more ports. This port is represented by the devlink eswitch 65 port. 66 67 A PCI device connected to multiple CPUs or multiple PCI root complexes or a 68 SmartNIC, however, may have multiple controllers. For a device with multiple 69 controllers, each controller is distinguished by a unique controller number. 70 An eswitch is on the PCI device which supports ports of multiple controllers. 71 72 An example view of a system with two controllers:: 73 74 --------------------------------------------------------- 75 | | 76 | --------- --------- ------- ------- | 77 ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | 78 | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- | 79 | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ | 80 | connect | | ------- ------- | 81 ----------- | | controller_num=1 (no eswitch) | 82 ------|-------------------------------------------------- 83 (internal wire) 84 | 85 --------------------------------------------------------- 86 | devlink eswitch ports and reps | 87 | ----------------------------------------------------- | 88 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | | 89 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | 90 | ----------------------------------------------------- | 91 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | | 92 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | 93 | ----------------------------------------------------- | 94 | | 95 | | 96 ----------- | --------- --------- ------- ------- | 97 | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | 98 | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- | 99 | connect | | | pf0 |______/________/ | pf1 |___/_______/ | 100 ----------- | ------- ------- | 101 | | 102 | local controller_num=0 (eswitch) | 103 --------------------------------------------------------- 104 105 In the above example, the external controller (identified by controller number = 1) 106 doesn't have the eswitch. Local controller (identified by controller number = 0) 107 has the eswitch. The Devlink instance on the local controller has eswitch 108 devlink ports for both the controllers. 109 110 Function configuration 111 ====================== 112 113 Users can configure one or more function attributes before enumerating the PCI 114 function. Usually it means, user should configure function attribute 115 before a bus specific device for the function is created. However, when 116 SRIOV is enabled, virtual function devices are created on the PCI bus. 117 Hence, function attribute should be configured before binding virtual 118 function device to the driver. For subfunctions, this means user should 119 configure port function attribute before activating the port function. 120 121 A user may set the hardware address of the function using 122 `devlink port function set hw_addr` command. For Ethernet port function 123 this means a MAC address. 124 125 Users may also set the RoCE capability of the function using 126 `devlink port function set roce` command. 127 128 Users may also set the function as migratable using 129 `devlink port function set migratable` command. 130 131 Users may also set the IPsec crypto capability of the function using 132 `devlink port function set ipsec_crypto` command. 133 134 Users may also set the IPsec packet capability of the function using 135 `devlink port function set ipsec_packet` command. 136 137 Users may also set the maximum IO event queues of the function 138 using `devlink port function set max_io_eqs` command. 139 140 Function attributes 141 =================== 142 143 MAC address setup 144 ----------------- 145 The configured MAC address of the PCI VF/SF will be used by netdevice and rdma 146 device created for the PCI VF/SF. 147 148 - Get the MAC address of the VF identified by its unique devlink port index:: 149 150 $ devlink port show pci/0000:06:00.0/2 151 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 152 function: 153 hw_addr 00:00:00:00:00:00 154 155 - Set the MAC address of the VF identified by its unique devlink port index:: 156 157 $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55 158 159 $ devlink port show pci/0000:06:00.0/2 160 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 161 function: 162 hw_addr 00:11:22:33:44:55 163 164 - Get the MAC address of the SF identified by its unique devlink port index:: 165 166 $ devlink port show pci/0000:06:00.0/32768 167 pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 168 function: 169 hw_addr 00:00:00:00:00:00 170 171 - Set the MAC address of the SF identified by its unique devlink port index:: 172 173 $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 174 175 $ devlink port show pci/0000:06:00.0/32768 176 pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 177 function: 178 hw_addr 00:00:00:00:88:88 179 180 RoCE capability setup 181 --------------------- 182 Not all PCI VFs/SFs require RoCE capability. 183 184 When RoCE capability is disabled, it saves system memory per PCI VF/SF. 185 186 When user disables RoCE capability for a VF/SF, user application cannot send or 187 receive any RoCE packets through this VF/SF and RoCE GID table for this PCI 188 will be empty. 189 190 When RoCE capability is disabled in the device using port function attribute, 191 VF/SF driver cannot override it. 192 193 - Get RoCE capability of the VF device:: 194 195 $ devlink port show pci/0000:06:00.0/2 196 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 197 function: 198 hw_addr 00:00:00:00:00:00 roce enable 199 200 - Set RoCE capability of the VF device:: 201 202 $ devlink port function set pci/0000:06:00.0/2 roce disable 203 204 $ devlink port show pci/0000:06:00.0/2 205 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 206 function: 207 hw_addr 00:00:00:00:00:00 roce disable 208 209 migratable capability setup 210 --------------------------- 211 Live migration is the process of transferring a live virtual machine 212 from one physical host to another without disrupting its normal 213 operation. 214 215 User who want PCI VFs to be able to perform live migration need to 216 explicitly enable the VF migratable capability. 217 218 When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver 219 with migration support, the user can migrate the VM with this VF from one HV to a 220 different one. 221 222 However, when migratable capability is enable, device will disable features which cannot 223 be migrated. Thus migratable cap can impose limitations on a VF so let the user decide. 224 225 Example of LM with migratable function configuration: 226 - Get migratable capability of the VF device:: 227 228 $ devlink port show pci/0000:06:00.0/2 229 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 230 function: 231 hw_addr 00:00:00:00:00:00 migratable disable 232 233 - Set migratable capability of the VF device:: 234 235 $ devlink port function set pci/0000:06:00.0/2 migratable enable 236 237 $ devlink port show pci/0000:06:00.0/2 238 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 239 function: 240 hw_addr 00:00:00:00:00:00 migratable enable 241 242 - Bind VF to VFIO driver with migration support:: 243 244 $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind 245 $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override 246 $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind 247 248 Attach VF to the VM. 249 Start the VM. 250 Perform live migration. 251 252 IPsec crypto capability setup 253 ----------------------------- 254 When user enables IPsec crypto capability for a VF, user application can offload 255 XFRM state crypto operation (Encrypt/Decrypt) to this VF. 256 257 When IPsec crypto capability is disabled (default) for a VF, the XFRM state is 258 processed in software by the kernel. 259 260 - Get IPsec crypto capability of the VF device:: 261 262 $ devlink port show pci/0000:06:00.0/2 263 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 264 function: 265 hw_addr 00:00:00:00:00:00 ipsec_crypto disabled 266 267 - Set IPsec crypto capability of the VF device:: 268 269 $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable 270 271 $ devlink port show pci/0000:06:00.0/2 272 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 273 function: 274 hw_addr 00:00:00:00:00:00 ipsec_crypto enabled 275 276 IPsec packet capability setup 277 ----------------------------- 278 When user enables IPsec packet capability for a VF, user application can offload 279 XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as 280 IPsec encapsulation. 281 282 When IPsec packet capability is disabled (default) for a VF, the XFRM state and 283 policy is processed in software by the kernel. 284 285 - Get IPsec packet capability of the VF device:: 286 287 $ devlink port show pci/0000:06:00.0/2 288 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 289 function: 290 hw_addr 00:00:00:00:00:00 ipsec_packet disabled 291 292 - Set IPsec packet capability of the VF device:: 293 294 $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable 295 296 $ devlink port show pci/0000:06:00.0/2 297 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 298 function: 299 hw_addr 00:00:00:00:00:00 ipsec_packet enabled 300 301 Maximum IO events queues setup 302 ------------------------------ 303 When user sets maximum number of IO event queues for a SF or 304 a VF, such function driver is limited to consume only enforced 305 number of IO event queues. 306 307 IO event queues deliver events related to IO queues, including network 308 device transmit and receive queues (txq and rxq) and RDMA Queue Pairs (QPs). 309 For example, the number of netdevice channels and RDMA device completion 310 vectors are derived from the function's IO event queues. Usually, the number 311 of interrupt vectors consumed by the driver is limited by the number of IO 312 event queues per device, as each of the IO event queues is connected to an 313 interrupt vector. 314 315 - Get maximum IO event queues of the VF device:: 316 317 $ devlink port show pci/0000:06:00.0/2 318 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 319 function: 320 hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10 321 322 - Set maximum IO event queues of the VF device:: 323 324 $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32 325 326 $ devlink port show pci/0000:06:00.0/2 327 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 328 function: 329 hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32 330 331 Subfunction 332 ============ 333 334 Subfunction is a lightweight function that has a parent PCI function on which 335 it is deployed. Subfunction is created and deployed in unit of 1. Unlike 336 SRIOV VFs, a subfunction doesn't require its own PCI virtual function. 337 A subfunction communicates with the hardware through the parent PCI function. 338 339 To use a subfunction, 3 steps setup sequence is followed: 340 341 1) create - create a subfunction; 342 2) configure - configure subfunction attributes; 343 3) deploy - deploy the subfunction; 344 345 Subfunction management is done using devlink port user interface. 346 User performs setup on the subfunction management device. 347 348 (1) Create 349 ---------- 350 A subfunction is created using a devlink port interface. A user adds the 351 subfunction by adding a devlink port of subfunction flavour. The devlink 352 kernel code calls down to subfunction management driver (devlink ops) and asks 353 it to create a subfunction devlink port. Driver then instantiates the 354 subfunction port and any associated objects such as health reporters and 355 representor netdevice. 356 357 (2) Configure 358 ------------- 359 A subfunction devlink port is created but it is not active yet. That means the 360 entities are created on devlink side, the e-switch port representor is created, 361 but the subfunction device itself is not created. A user might use e-switch port 362 representor to do settings, putting it into bridge, adding TC rules, etc. A user 363 might as well configure the hardware address (such as MAC address) of the 364 subfunction while subfunction is inactive. 365 366 (3) Deploy 367 ---------- 368 Once a subfunction is configured, user must activate it to use it. Upon 369 activation, subfunction management driver asks the subfunction management 370 device to instantiate the subfunction device on particular PCI function. 371 A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. 372 At this point a matching subfunction driver binds to the subfunction's auxiliary device. 373 374 Rate object management 375 ====================== 376 377 Devlink provides API to manage tx rates of single devlink port or a group. 378 This is done through rate objects, which can be one of the two types: 379 380 ``leaf`` 381 Represents a single devlink port; created/destroyed by the driver. Since leaf 382 have 1to1 mapping to its devlink port, in user space it is referred as 383 ``pci/<bus_addr>/<port_index>``; 384 385 ``node`` 386 Represents a group of rate objects (leafs and/or nodes); created/deleted by 387 request from the userspace; initially empty (no rate objects added). In 388 userspace it is referred as ``pci/<bus_addr>/<node_name>``, where 389 ``node_name`` can be any identifier, except decimal number, to avoid 390 collisions with leafs. 391 392 API allows to configure following rate object's parameters: 393 394 ``tx_share`` 395 Minimum TX rate value shared among all other rate objects, or rate objects 396 that parts of the parent group, if it is a part of the same group. 397 398 ``tx_max`` 399 Maximum TX rate value. 400 401 ``tx_priority`` 402 Allows for usage of strict priority arbiter among siblings. This 403 arbitration scheme attempts to schedule nodes based on their priority 404 as long as the nodes remain within their bandwidth limit. The higher the 405 priority the higher the probability that the node will get selected for 406 scheduling. 407 408 ``tx_weight`` 409 Allows for usage of Weighted Fair Queuing arbitration scheme among 410 siblings. This arbitration scheme can be used simultaneously with the 411 strict priority. As a node is configured with a higher rate it gets more 412 BW relative to its siblings. Values are relative like a percentage 413 points, they basically tell how much BW should node take relative to 414 its siblings. 415 416 ``parent`` 417 Parent node name. Parent node rate limits are considered as additional limits 418 to all node children limits. ``tx_max`` is an upper limit for children. 419 ``tx_share`` is a total bandwidth distributed among children. 420 421 ``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case 422 nodes with the same priority form a WFQ subgroup in the sibling group 423 and arbitration among them is based on assigned weights. 424 425 Arbitration flow from the high level: 426 427 #. Choose a node, or group of nodes with the highest priority that stays 428 within the BW limit and are not blocked. Use ``tx_priority`` as a 429 parameter for this arbitration. 430 431 #. If group of nodes have the same priority perform WFQ arbitration on 432 that subgroup. Use ``tx_weight`` as a parameter for this arbitration. 433 434 #. Select the winner node, and continue arbitration flow among its children, 435 until leaf node is reached, and the winner is established. 436 437 #. If all the nodes from the highest priority sub-group are satisfied, or 438 overused their assigned BW, move to the lower priority nodes. 439 440 Driver implementations are allowed to support both or either rate object types 441 and setting methods of their parameters. Additionally driver implementation 442 may export nodes/leafs and their child-parent relationships. 443 444 Terms and Definitions 445 ===================== 446 447 .. list-table:: Terms and Definitions 448 :widths: 22 90 449 450 * - Term 451 - Definitions 452 * - ``PCI device`` 453 - A physical PCI device having one or more PCI buses consists of one or 454 more PCI controllers. 455 * - ``PCI controller`` 456 - A controller consists of potentially multiple physical functions, 457 virtual functions and subfunctions. 458 * - ``Port function`` 459 - An object to manage the function of a port. 460 * - ``Subfunction`` 461 - A lightweight function that has parent PCI function on which it is 462 deployed. 463 * - ``Subfunction device`` 464 - A bus device of the subfunction, usually on a auxiliary bus. 465 * - ``Subfunction driver`` 466 - A device driver for the subfunction auxiliary device. 467 * - ``Subfunction management device`` 468 - A PCI physical function that supports subfunction management. 469 * - ``Subfunction management driver`` 470 - A device driver for PCI physical function that supports 471 subfunction management using devlink port interface. 472 * - ``Subfunction host driver`` 473 - A device driver for PCI physical function that hosts subfunction 474 devices. In most cases it is same as subfunction management driver. When 475 subfunction is used on external controller, subfunction management and 476 host drivers are different.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.