1 .. SPDX-License-Identifier: GPL-2.0 2 3 =================== 4 ice devlink support 5 =================== 6 7 This document describes the devlink features implemented by the ``ice`` 8 device driver. 9 10 Parameters 11 ========== 12 13 .. list-table:: Generic parameters implemented 14 :widths: 5 5 90 15 16 * - Name 17 - Mode 18 - Notes 19 * - ``enable_roce`` 20 - runtime 21 - mutually exclusive with ``enable_iwarp`` 22 * - ``enable_iwarp`` 23 - runtime 24 - mutually exclusive with ``enable_roce`` 25 * - ``tx_scheduling_layers`` 26 - permanent 27 - The ice hardware uses hierarchical scheduling for Tx with a fixed 28 number of layers in the scheduling tree. Each of them are decision 29 points. Root node represents a port, while all the leaves represent 30 the queues. This way of configuring the Tx scheduler allows features 31 like DCB or devlink-rate (documented below) to configure how much 32 bandwidth is given to any given queue or group of queues, enabling 33 fine-grained control because scheduling parameters can be configured 34 at any given layer of the tree. 35 36 The default 9-layer tree topology was deemed best for most workloads, 37 as it gives an optimal ratio of performance to configurability. However, 38 for some specific cases, this 9-layer topology might not be desired. 39 One example would be sending traffic to queues that are not a multiple 40 of 8. Because the maximum radix is limited to 8 in 9-layer topology, 41 the 9th queue has a different parent than the rest, and it's given 42 more bandwidth credits. This causes a problem when the system is 43 sending traffic to 9 queues: 44 45 | tx_queue_0_packets: 24163396 46 | tx_queue_1_packets: 24164623 47 | tx_queue_2_packets: 24163188 48 | tx_queue_3_packets: 24163701 49 | tx_queue_4_packets: 24163683 50 | tx_queue_5_packets: 24164668 51 | tx_queue_6_packets: 23327200 52 | tx_queue_7_packets: 24163853 53 | tx_queue_8_packets: 91101417 < Too much traffic is sent from 9th 54 55 To address this need, you can switch to a 5-layer topology, which 56 changes the maximum topology radix to 512. With this enhancement, 57 the performance characteristic is equal as all queues can be assigned 58 to the same parent in the tree. The obvious drawback of this solution 59 is a lower configuration depth of the tree. 60 61 Use the ``tx_scheduling_layer`` parameter with the devlink command 62 to change the transmit scheduler topology. To use 5-layer topology, 63 use a value of 5. For example: 64 $ devlink dev param set pci/0000:16:00.0 name tx_scheduling_layers 65 value 5 cmode permanent 66 Use a value of 9 to set it back to the default value. 67 68 You must do PCI slot powercycle for the selected topology to take effect. 69 70 To verify that value has been set: 71 $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers 72 .. list-table:: Driver specific parameters implemented 73 :widths: 5 5 90 74 75 * - Name 76 - Mode 77 - Description 78 * - ``local_forwarding`` 79 - runtime 80 - Controls loopback behavior by tuning scheduler bandwidth. 81 It impacts all kinds of functions: physical, virtual and 82 subfunctions. 83 Supported values are: 84 85 ``enabled`` - loopback traffic is allowed on port 86 87 ``disabled`` - loopback traffic is not allowed on this port 88 89 ``prioritized`` - loopback traffic is prioritized on this port 90 91 Default value of ``local_forwarding`` parameter is ``enabled``. 92 ``prioritized`` provides ability to adjust loopback traffic rate to increase 93 one port capacity at cost of the another. User needs to disable 94 local forwarding on one of the ports in order have increased capacity 95 on the ``prioritized`` port. 96 97 Info versions 98 ============= 99 100 The ``ice`` driver reports the following versions 101 102 .. list-table:: devlink info versions implemented 103 :widths: 5 5 5 90 104 105 * - Name 106 - Type 107 - Example 108 - Description 109 * - ``board.id`` 110 - fixed 111 - K65390-000 112 - The Product Board Assembly (PBA) identifier of the board. 113 * - ``cgu.id`` 114 - fixed 115 - 36 116 - The Clock Generation Unit (CGU) hardware revision identifier. 117 * - ``fw.mgmt`` 118 - running 119 - 2.1.7 120 - 3-digit version number of the management firmware running on the 121 Embedded Management Processor of the device. It controls the PHY, 122 link, access to device resources, etc. Intel documentation refers to 123 this as the EMP firmware. 124 * - ``fw.mgmt.api`` 125 - running 126 - 1.5.1 127 - 3-digit version number (major.minor.patch) of the API exported over 128 the AdminQ by the management firmware. Used by the driver to 129 identify what commands are supported. Historical versions of the 130 kernel only displayed a 2-digit version number (major.minor). 131 * - ``fw.mgmt.build`` 132 - running 133 - 0x305d955f 134 - Unique identifier of the source for the management firmware. 135 * - ``fw.undi`` 136 - running 137 - 1.2581.0 138 - Version of the Option ROM containing the UEFI driver. The version is 139 reported in ``major.minor.patch`` format. The major version is 140 incremented whenever a major breaking change occurs, or when the 141 minor version would overflow. The minor version is incremented for 142 non-breaking changes and reset to 1 when the major version is 143 incremented. The patch version is normally 0 but is incremented when 144 a fix is delivered as a patch against an older base Option ROM. 145 * - ``fw.psid.api`` 146 - running 147 - 0.80 148 - Version defining the format of the flash contents. 149 * - ``fw.bundle_id`` 150 - running 151 - 0x80002ec0 152 - Unique identifier of the firmware image file that was loaded onto 153 the device. Also referred to as the EETRACK identifier of the NVM. 154 * - ``fw.app.name`` 155 - running 156 - ICE OS Default Package 157 - The name of the DDP package that is active in the device. The DDP 158 package is loaded by the driver during initialization. Each 159 variation of the DDP package has a unique name. 160 * - ``fw.app`` 161 - running 162 - 1.3.1.0 163 - The version of the DDP package that is active in the device. Note 164 that both the name (as reported by ``fw.app.name``) and version are 165 required to uniquely identify the package. 166 * - ``fw.app.bundle_id`` 167 - running 168 - 0xc0000001 169 - Unique identifier for the DDP package loaded in the device. Also 170 referred to as the DDP Track ID. Can be used to uniquely identify 171 the specific DDP package. 172 * - ``fw.netlist`` 173 - running 174 - 1.1.2000-6.7.0 175 - The version of the netlist module. This module defines the device's 176 Ethernet capabilities and default settings, and is used by the 177 management firmware as part of managing link and device 178 connectivity. 179 * - ``fw.netlist.build`` 180 - running 181 - 0xee16ced7 182 - The first 4 bytes of the hash of the netlist module contents. 183 * - ``fw.cgu`` 184 - running 185 - 8032.16973825.6021 186 - The version of Clock Generation Unit (CGU). Format: 187 <CGU type>.<configuration version>.<firmware version>. 188 189 Flash Update 190 ============ 191 192 The ``ice`` driver implements support for flash update using the 193 ``devlink-flash`` interface. It supports updating the device flash using a 194 combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and 195 ``fw.netlist`` components. 196 197 .. list-table:: List of supported overwrite modes 198 :widths: 5 95 199 200 * - Bits 201 - Behavior 202 * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` 203 - Do not preserve settings stored in the flash components being 204 updated. This includes overwriting the port configuration that 205 determines the number of physical functions the device will 206 initialize with. 207 * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` 208 - Do not preserve either settings or identifiers. Overwrite everything 209 in the flash with the contents from the provided image, without 210 performing any preservation. This includes overwriting device 211 identifying fields such as the MAC address, VPD area, and device 212 serial number. It is expected that this combination be used with an 213 image customized for the specific device. 214 215 The ice hardware does not support overwriting only identifiers while 216 preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its 217 own will be rejected. If no overwrite mask is provided, the firmware will be 218 instructed to preserve all settings and identifying fields when updating. 219 220 Reload 221 ====== 222 223 The ``ice`` driver supports activating new firmware after a flash update 224 using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE`` 225 action. 226 227 .. code:: shell 228 229 $ devlink dev reload pci/0000:01:00.0 reload action fw_activate 230 231 The new firmware is activated by issuing a device specific Embedded 232 Management Processor reset which requests the device to reset and reload the 233 EMP firmware image. 234 235 The driver does not currently support reloading the driver via 236 ``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``. 237 238 Port split 239 ========== 240 241 The ``ice`` driver supports port splitting only for port 0, as the FW has 242 a predefined set of available port split options for the whole device. 243 244 A system reboot is required for port split to be applied. 245 246 The following command will select the port split option with 4 ports: 247 248 .. code:: shell 249 250 $ devlink port split pci/0000:16:00.0/0 count 4 251 252 The list of all available port options will be printed to dynamic debug after 253 each ``split`` and ``unsplit`` command. The first option is the default. 254 255 .. code:: shell 256 257 ice 0000:16:00.0: Available port split options and max port speeds (Gbps): 258 ice 0000:16:00.0: Status Split Quad 0 Quad 1 259 ice 0000:16:00.0: count L0 L1 L2 L3 L4 L5 L6 L7 260 ice 0000:16:00.0: Active 2 100 - - - 100 - - - 261 ice 0000:16:00.0: 2 50 - 50 - - - - - 262 ice 0000:16:00.0: Pending 4 25 25 25 25 - - - - 263 ice 0000:16:00.0: 4 25 25 - - 25 25 - - 264 ice 0000:16:00.0: 8 10 10 10 10 10 10 10 10 265 ice 0000:16:00.0: 1 100 - - - - - - - 266 267 There could be multiple FW port options with the same port split count. When 268 the same port split count request is issued again, the next FW port option with 269 the same port split count will be selected. 270 271 ``devlink port unsplit`` will select the option with a split count of 1. If 272 there is no FW option available with split count 1, you will receive an error. 273 274 Regions 275 ======= 276 277 The ``ice`` driver implements the following regions for accessing internal 278 device data. 279 280 .. list-table:: regions implemented 281 :widths: 15 85 282 283 * - Name 284 - Description 285 * - ``nvm-flash`` 286 - The contents of the entire flash chip, sometimes referred to as 287 the device's Non Volatile Memory. 288 * - ``shadow-ram`` 289 - The contents of the Shadow RAM, which is loaded from the beginning 290 of the flash. Although the contents are primarily from the flash, 291 this area also contains data generated during device boot which is 292 not stored in flash. 293 * - ``device-caps`` 294 - The contents of the device firmware's capabilities buffer. Useful to 295 determine the current state and configuration of the device. 296 297 Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a 298 snapshot. The ``device-caps`` region requires a snapshot as the contents are 299 sent by firmware and can't be split into separate reads. 300 301 Users can request an immediate capture of a snapshot for all three regions 302 via the ``DEVLINK_CMD_REGION_NEW`` command. 303 304 .. code:: shell 305 306 $ devlink region show 307 pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1 308 pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10 309 310 $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1 311 $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1 312 313 $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1 314 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 315 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8 316 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc 317 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5 318 319 $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16 320 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 321 322 $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1 323 324 $ devlink region new pci/0000:01:00.0/device-caps snapshot 1 325 $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1 326 0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00 327 0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 328 0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00 329 0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 330 0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 331 0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 332 0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00 333 0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 334 0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 335 0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 336 00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 337 00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 338 00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00 339 00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 340 00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00 341 00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 342 0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 343 0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 344 0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 345 0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 346 0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 347 0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 348 0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00 349 0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 350 0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00 351 0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 352 00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 353 00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 354 00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00 355 00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 356 00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00 357 00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 358 0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00 359 0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 360 361 $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1 362 363 Devlink Rate 364 ============ 365 366 The ``ice`` driver implements devlink-rate API. It allows for offload of 367 the Hierarchical QoS to the hardware. It enables user to group Virtual 368 Functions in a tree structure and assign supported parameters: tx_share, 369 tx_max, tx_priority and tx_weight to each node in a tree. So effectively 370 user gains an ability to control how much bandwidth is allocated for each 371 VF group. This is later enforced by the HW. 372 373 It is assumed that this feature is mutually exclusive with DCB performed 374 in FW and ADQ, or any driver feature that would trigger changes in QoS, 375 for example creation of the new traffic class. The driver will prevent DCB 376 or ADQ configuration if user started making any changes to the nodes using 377 devlink-rate API. To configure those features a driver reload is necessary. 378 Correspondingly if ADQ or DCB will get configured the driver won't export 379 hierarchy at all, or will remove the untouched hierarchy if those 380 features are enabled after the hierarchy is exported, but before any 381 changes are made. 382 383 This feature is also dependent on switchdev being enabled in the system. 384 It's required because devlink-rate requires devlink-port objects to be 385 present, and those objects are only created in switchdev mode. 386 387 If the driver is set to the switchdev mode, it will export internal 388 hierarchy the moment VF's are created. Root of the tree is always 389 represented by the node_0. This node can't be deleted by the user. Leaf 390 nodes and nodes with children also can't be deleted. 391 392 .. list-table:: Attributes supported 393 :widths: 15 85 394 395 * - Name 396 - Description 397 * - ``tx_max`` 398 - maximum bandwidth to be consumed by the tree Node. Rate Limit is 399 an absolute number specifying a maximum amount of bytes a Node may 400 consume during the course of one second. Rate limit guarantees 401 that a link will not oversaturate the receiver on the remote end 402 and also enforces an SLA between the subscriber and network 403 provider. 404 * - ``tx_share`` 405 - minimum bandwidth allocated to a tree node when it is not blocked. 406 It specifies an absolute BW. While tx_max defines the maximum 407 bandwidth the node may consume, the tx_share marks committed BW 408 for the Node. 409 * - ``tx_priority`` 410 - allows for usage of strict priority arbiter among siblings. This 411 arbitration scheme attempts to schedule nodes based on their 412 priority as long as the nodes remain within their bandwidth limit. 413 Range 0-7. Nodes with priority 7 have the highest priority and are 414 selected first, while nodes with priority 0 have the lowest 415 priority. Nodes that have the same priority are treated equally. 416 * - ``tx_weight`` 417 - allows for usage of Weighted Fair Queuing arbitration scheme among 418 siblings. This arbitration scheme can be used simultaneously with 419 the strict priority. Range 1-200. Only relative values matter for 420 arbitration. 421 422 ``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case 423 nodes with the same priority form a WFQ subgroup in the sibling group 424 and arbitration among them is based on assigned weights. 425 426 .. code:: shell 427 428 # enable switchdev 429 $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev 430 431 # at this point driver should export internal hierarchy 432 $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs 433 434 $ devlink port function rate show 435 pci/0000:4b:00.0/node_25: type node parent node_24 436 pci/0000:4b:00.0/node_24: type node parent node_0 437 pci/0000:4b:00.0/node_32: type node parent node_31 438 pci/0000:4b:00.0/node_31: type node parent node_30 439 pci/0000:4b:00.0/node_30: type node parent node_16 440 pci/0000:4b:00.0/node_19: type node parent node_18 441 pci/0000:4b:00.0/node_18: type node parent node_17 442 pci/0000:4b:00.0/node_17: type node parent node_16 443 pci/0000:4b:00.0/node_14: type node parent node_5 444 pci/0000:4b:00.0/node_5: type node parent node_3 445 pci/0000:4b:00.0/node_13: type node parent node_4 446 pci/0000:4b:00.0/node_12: type node parent node_4 447 pci/0000:4b:00.0/node_11: type node parent node_4 448 pci/0000:4b:00.0/node_10: type node parent node_4 449 pci/0000:4b:00.0/node_9: type node parent node_4 450 pci/0000:4b:00.0/node_8: type node parent node_4 451 pci/0000:4b:00.0/node_7: type node parent node_4 452 pci/0000:4b:00.0/node_6: type node parent node_4 453 pci/0000:4b:00.0/node_4: type node parent node_3 454 pci/0000:4b:00.0/node_3: type node parent node_16 455 pci/0000:4b:00.0/node_16: type node parent node_15 456 pci/0000:4b:00.0/node_15: type node parent node_0 457 pci/0000:4b:00.0/node_2: type node parent node_1 458 pci/0000:4b:00.0/node_1: type node parent node_0 459 pci/0000:4b:00.0/node_0: type node 460 pci/0000:4b:00.0/1: type leaf parent node_25 461 pci/0000:4b:00.0/2: type leaf parent node_25 462 463 # let's create some custom node 464 $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0 465 466 # second custom node 467 $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom 468 469 # reassign second VF to newly created branch 470 $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1 471 472 # assign tx_weight to the VF 473 $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5 474 475 # assign tx_share to the VF 476 $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.