1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 ============= 3 ============= 4 Devlink DPIPE 4 Devlink DPIPE 5 ============= 5 ============= 6 6 7 Background 7 Background 8 ========== 8 ========== 9 9 10 While performing the hardware offloading proce 10 While performing the hardware offloading process, much of the hardware 11 specifics cannot be presented. These details a 11 specifics cannot be presented. These details are useful for debugging, and 12 ``devlink-dpipe`` provides a standardized way 12 ``devlink-dpipe`` provides a standardized way to provide visibility into the 13 offloading process. 13 offloading process. 14 14 15 For example, the routing longest prefix match 15 For example, the routing longest prefix match (LPM) algorithm used by the 16 Linux kernel may differ from the hardware impl 16 Linux kernel may differ from the hardware implementation. The pipeline debug 17 API (DPIPE) is aimed at providing the user vis 17 API (DPIPE) is aimed at providing the user visibility into the ASIC's 18 pipeline in a generic way. 18 pipeline in a generic way. 19 19 20 The hardware offload process is expected to be 20 The hardware offload process is expected to be done in a way that the user 21 should not be able to distinguish between the 21 should not be able to distinguish between the hardware vs. software 22 implementation. In this process, hardware spec 22 implementation. In this process, hardware specifics are neglected. In 23 reality those details can have lots of meaning 23 reality those details can have lots of meaning and should be exposed in some 24 standard way. 24 standard way. 25 25 26 This problem is made even more complex when on 26 This problem is made even more complex when one wishes to offload the 27 control path of the whole networking stack to 27 control path of the whole networking stack to a switch ASIC. Due to 28 differences in the hardware and software model 28 differences in the hardware and software models some processes cannot be 29 represented correctly. 29 represented correctly. 30 30 31 One example is the kernel's LPM algorithm whic 31 One example is the kernel's LPM algorithm which in many cases differs 32 greatly to the hardware implementation. The co 32 greatly to the hardware implementation. The configuration API is the same, 33 but one cannot rely on the Forward Information 33 but one cannot rely on the Forward Information Base (FIB) to look like the 34 Level Path Compression trie (LPC-trie) in hard 34 Level Path Compression trie (LPC-trie) in hardware. 35 35 36 In many situations trying to analyze systems f 36 In many situations trying to analyze systems failure solely based on the 37 kernel's dump may not be enough. By combining 37 kernel's dump may not be enough. By combining this data with complementary 38 information about the underlying hardware, thi 38 information about the underlying hardware, this debugging can be made 39 easier; additionally, the information can be u 39 easier; additionally, the information can be useful when debugging 40 performance issues. 40 performance issues. 41 41 42 Overview 42 Overview 43 ======== 43 ======== 44 44 45 The ``devlink-dpipe`` interface closes this ga 45 The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is 46 modeled as a graph of match/action tables. Eac 46 modeled as a graph of match/action tables. Each table represents a specific 47 hardware block. This model is not new, first b 47 hardware block. This model is not new, first being used by the P4 language. 48 48 49 Traditionally it has been used as an alternati 49 Traditionally it has been used as an alternative model for hardware 50 configuration, but the ``devlink-dpipe`` inter 50 configuration, but the ``devlink-dpipe`` interface uses it for visibility 51 purposes as a standard complementary tool. The 51 purposes as a standard complementary tool. The system's view from 52 ``devlink-dpipe`` should change according to t 52 ``devlink-dpipe`` should change according to the changes done by the 53 standard configuration tools. 53 standard configuration tools. 54 54 55 For example, it’s quite common to implement 55 For example, it’s quite common to implement Access Control Lists (ACL) 56 using Ternary Content Addressable Memory (TCAM 56 using Ternary Content Addressable Memory (TCAM). The TCAM memory can be 57 divided into TCAM regions. Complex TC filters 57 divided into TCAM regions. Complex TC filters can have multiple rules with 58 different priorities and different lookup keys 58 different priorities and different lookup keys. On the other hand hardware 59 TCAM regions have a predefined lookup key. Off 59 TCAM regions have a predefined lookup key. Offloading the TC filter rules 60 using TCAM engine can result in multiple TCAM 60 using TCAM engine can result in multiple TCAM regions being interconnected 61 in a chain (which may affect the data path lat 61 in a chain (which may affect the data path latency). In response to a new TC 62 filter new tables should be created describing 62 filter new tables should be created describing those regions. 63 63 64 Model 64 Model 65 ===== 65 ===== 66 66 67 The ``DPIPE`` model introduces several objects 67 The ``DPIPE`` model introduces several objects: 68 68 69 * headers 69 * headers 70 * tables 70 * tables 71 * entries 71 * entries 72 72 73 A ``header`` describes packet formats and prov 73 A ``header`` describes packet formats and provides names for fields within 74 the packet. A ``table`` describes hardware blo 74 the packet. A ``table`` describes hardware blocks. An ``entry`` describes 75 the actual content of a specific table. 75 the actual content of a specific table. 76 76 77 The hardware pipeline is not port specific, bu 77 The hardware pipeline is not port specific, but rather describes the whole 78 ASIC. Thus it is tied to the top of the ``devl 78 ASIC. Thus it is tied to the top of the ``devlink`` infrastructure. 79 79 80 Drivers can register and unregister tables at 80 Drivers can register and unregister tables at run time, in order to support 81 dynamic behavior. This dynamic behavior is man 81 dynamic behavior. This dynamic behavior is mandatory for describing hardware 82 blocks like TCAM regions which can be allocate 82 blocks like TCAM regions which can be allocated and freed dynamically. 83 83 84 ``devlink-dpipe`` generally is not intended fo 84 ``devlink-dpipe`` generally is not intended for configuration. The exception 85 is hardware counting for a specific table. 85 is hardware counting for a specific table. 86 86 87 The following commands are used to obtain the 87 The following commands are used to obtain the ``dpipe`` objects from 88 userspace: 88 userspace: 89 89 90 * ``table_get``: Receive a table's descripti 90 * ``table_get``: Receive a table's description. 91 * ``headers_get``: Receive a device's suppor 91 * ``headers_get``: Receive a device's supported headers. 92 * ``entries_get``: Receive a table's current 92 * ``entries_get``: Receive a table's current entries. 93 * ``counters_set``: Enable or disable counte 93 * ``counters_set``: Enable or disable counters on a table. 94 94 95 Table 95 Table 96 ----- 96 ----- 97 97 98 The driver should implement the following oper 98 The driver should implement the following operations for each table: 99 99 100 * ``matches_dump``: Dump the supported match 100 * ``matches_dump``: Dump the supported matches. 101 * ``actions_dump``: Dump the supported actio 101 * ``actions_dump``: Dump the supported actions. 102 * ``entries_dump``: Dump the actual content 102 * ``entries_dump``: Dump the actual content of the table. 103 * ``counters_set_update``: Synchronize hardw 103 * ``counters_set_update``: Synchronize hardware with counters enabled or 104 disabled. 104 disabled. 105 105 106 Header/Field 106 Header/Field 107 ------------ 107 ------------ 108 108 109 In a similar way to P4 headers and fields are 109 In a similar way to P4 headers and fields are used to describe a table's 110 behavior. There is a slight difference between 110 behavior. There is a slight difference between the standard protocol headers 111 and specific ASIC metadata. The protocol heade 111 and specific ASIC metadata. The protocol headers should be declared in the 112 ``devlink`` core API. On the other hand ASIC m 112 ``devlink`` core API. On the other hand ASIC meta data is driver specific 113 and should be defined in the driver. Additiona 113 and should be defined in the driver. Additionally, each driver-specific 114 devlink documentation file should document the 114 devlink documentation file should document the driver-specific ``dpipe`` 115 headers it implements. The headers and fields 115 headers it implements. The headers and fields are identified by enumeration. 116 116 117 In order to provide further visibility some AS 117 In order to provide further visibility some ASIC metadata fields could be 118 mapped to kernel objects. For example, interna 118 mapped to kernel objects. For example, internal router interface indexes can 119 be directly mapped to the net device ifindex. 119 be directly mapped to the net device ifindex. FIB table indexes used by 120 different Virtual Routing and Forwarding (VRF) 120 different Virtual Routing and Forwarding (VRF) tables can be mapped to 121 internal routing table indexes. 121 internal routing table indexes. 122 122 123 Match 123 Match 124 ----- 124 ----- 125 125 126 Matches are kept primitive and close to hardwa 126 Matches are kept primitive and close to hardware operation. Match types like 127 LPM are not supported due to the fact that thi 127 LPM are not supported due to the fact that this is exactly a process we wish 128 to describe in full detail. Example of matches 128 to describe in full detail. Example of matches: 129 129 130 * ``field_exact``: Exact match on a specific 130 * ``field_exact``: Exact match on a specific field. 131 * ``field_exact_mask``: Exact match on a spe 131 * ``field_exact_mask``: Exact match on a specific field after masking. 132 * ``field_range``: Match on a specific range 132 * ``field_range``: Match on a specific range. 133 133 134 The id's of the header and the field should be 134 The id's of the header and the field should be specified in order to 135 identify the specific field. Furthermore, the 135 identify the specific field. Furthermore, the header index should be 136 specified in order to distinguish multiple hea 136 specified in order to distinguish multiple headers of the same type in a 137 packet (tunneling). 137 packet (tunneling). 138 138 139 Action 139 Action 140 ------ 140 ------ 141 141 142 Similar to match, the actions are kept primiti 142 Similar to match, the actions are kept primitive and close to hardware 143 operation. For example: 143 operation. For example: 144 144 145 * ``field_modify``: Modify the field value. 145 * ``field_modify``: Modify the field value. 146 * ``field_inc``: Increment the field value. 146 * ``field_inc``: Increment the field value. 147 * ``push_header``: Add a header. 147 * ``push_header``: Add a header. 148 * ``pop_header``: Remove a header. 148 * ``pop_header``: Remove a header. 149 149 150 Entry 150 Entry 151 ----- 151 ----- 152 152 153 Entries of a specific table can be dumped on d 153 Entries of a specific table can be dumped on demand. Each eentry is 154 identified with an index and its properties ar 154 identified with an index and its properties are described by a list of 155 match/action values and specific counter. By d 155 match/action values and specific counter. By dumping the tables content the 156 interactions between tables can be resolved. 156 interactions between tables can be resolved. 157 157 158 Abstraction Example 158 Abstraction Example 159 =================== 159 =================== 160 160 161 The following is an example of the abstraction 161 The following is an example of the abstraction model of the L3 part of 162 Mellanox Spectrum ASIC. The blocks are describ 162 Mellanox Spectrum ASIC. The blocks are described in the order they appear in 163 the pipeline. The table sizes in the following 163 the pipeline. The table sizes in the following examples are not real 164 hardware sizes and are provided for demonstrat 164 hardware sizes and are provided for demonstration purposes. 165 165 166 LPM 166 LPM 167 --- 167 --- 168 168 169 The LPM algorithm can be implemented as a list 169 The LPM algorithm can be implemented as a list of hash tables. Each hash 170 table contains routes with the same prefix len 170 table contains routes with the same prefix length. The root of the list is 171 /32, and in case of a miss the hardware will c 171 /32, and in case of a miss the hardware will continue to the next hash 172 table. The depth of the search will affect the 172 table. The depth of the search will affect the data path latency. 173 173 174 In case of a hit the entry contains informatio 174 In case of a hit the entry contains information about the next stage of the 175 pipeline which resolves the MAC address. The n 175 pipeline which resolves the MAC address. The next stage can be either local 176 host table for directly connected routes, or a 176 host table for directly connected routes, or adjacency table for next-hops. 177 The ``meta.lpm_prefix`` field is used to conne 177 The ``meta.lpm_prefix`` field is used to connect two LPM tables. 178 178 179 .. code:: 179 .. code:: 180 180 181 table lpm_prefix_16 { 181 table lpm_prefix_16 { 182 size: 4096, 182 size: 4096, 183 counters_enabled: true, 183 counters_enabled: true, 184 match: { meta.vr_id: exact, 184 match: { meta.vr_id: exact, 185 ipv4.dst_addr: exact_mask, 185 ipv4.dst_addr: exact_mask, 186 ipv6.dst_addr: exact_mask, 186 ipv6.dst_addr: exact_mask, 187 meta.lpm_prefix: exact }, 187 meta.lpm_prefix: exact }, 188 action: { meta.adj_index: set, 188 action: { meta.adj_index: set, 189 meta.adj_group_size: set, 189 meta.adj_group_size: set, 190 meta.rif_port: set, 190 meta.rif_port: set, 191 meta.lpm_prefix: set }, 191 meta.lpm_prefix: set }, 192 } 192 } 193 193 194 Local Host 194 Local Host 195 ---------- 195 ---------- 196 196 197 In the case of local routes the LPM lookup alr 197 In the case of local routes the LPM lookup already resolves the egress 198 router interface (RIF), yet the exact MAC addr 198 router interface (RIF), yet the exact MAC address is not known. The local 199 host table is a hash table combining the outpu 199 host table is a hash table combining the output interface id with 200 destination IP address as a key. The result is 200 destination IP address as a key. The result is the MAC address. 201 201 202 .. code:: 202 .. code:: 203 203 204 table local_host { 204 table local_host { 205 size: 4096, 205 size: 4096, 206 counters_enabled: true, 206 counters_enabled: true, 207 match: { meta.rif_port: exact, 207 match: { meta.rif_port: exact, 208 ipv4.dst_addr: exact}, 208 ipv4.dst_addr: exact}, 209 action: { ethernet.daddr: set } 209 action: { ethernet.daddr: set } 210 } 210 } 211 211 212 Adjacency 212 Adjacency 213 --------- 213 --------- 214 214 215 In case of remote routes this table does the E 215 In case of remote routes this table does the ECMP. The LPM lookup results in 216 ECMP group size and index that serves as a glo 216 ECMP group size and index that serves as a global offset into this table. 217 Concurrently a hash of the packet is generated 217 Concurrently a hash of the packet is generated. Based on the ECMP group size 218 and the packet's hash a local offset is genera 218 and the packet's hash a local offset is generated. Multiple LPM entries can 219 point to the same adjacency group. 219 point to the same adjacency group. 220 220 221 .. code:: 221 .. code:: 222 222 223 table adjacency { 223 table adjacency { 224 size: 4096, 224 size: 4096, 225 counters_enabled: true, 225 counters_enabled: true, 226 match: { meta.adj_index: exact, 226 match: { meta.adj_index: exact, 227 meta.adj_group_size: exact, 227 meta.adj_group_size: exact, 228 meta.packet_hash_index: exact } 228 meta.packet_hash_index: exact }, 229 action: { ethernet.daddr: set, 229 action: { ethernet.daddr: set, 230 meta.erif: set } 230 meta.erif: set } 231 } 231 } 232 232 233 ERIF 233 ERIF 234 ---- 234 ---- 235 235 236 In case the egress RIF and destination MAC hav 236 In case the egress RIF and destination MAC have been resolved by previous 237 tables this table does multiple operations lik 237 tables this table does multiple operations like TTL decrease and MTU check. 238 Then the decision of forward/drop is taken and 238 Then the decision of forward/drop is taken and the port L3 statistics are 239 updated based on the packet's type (broadcast, 239 updated based on the packet's type (broadcast, unicast, multicast). 240 240 241 .. code:: 241 .. code:: 242 242 243 table erif { 243 table erif { 244 size: 800, 244 size: 800, 245 counters_enabled: true, 245 counters_enabled: true, 246 match: { meta.rif_port: exact, 246 match: { meta.rif_port: exact, 247 meta.is_l3_unicast: exact, 247 meta.is_l3_unicast: exact, 248 meta.is_l3_broadcast: exact, 248 meta.is_l3_broadcast: exact, 249 meta.is_l3_multicast, exact }, 249 meta.is_l3_multicast, exact }, 250 action: { meta.l3_drop: set, 250 action: { meta.l3_drop: set, 251 meta.l3_forward: set } 251 meta.l3_forward: set } 252 } 252 }
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.