1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 ============================================= 3 ============================================= 4 Open vSwitch datapath developer documentation 4 Open vSwitch datapath developer documentation 5 ============================================= 5 ============================================= 6 6 7 The Open vSwitch kernel module allows flexible 7 The Open vSwitch kernel module allows flexible userspace control over 8 flow-level packet processing on selected netwo 8 flow-level packet processing on selected network devices. It can be 9 used to implement a plain Ethernet switch, net 9 used to implement a plain Ethernet switch, network device bonding, 10 VLAN processing, network access control, flow- 10 VLAN processing, network access control, flow-based network control, 11 and so on. 11 and so on. 12 12 13 The kernel module implements multiple "datapat 13 The kernel module implements multiple "datapaths" (analogous to 14 bridges), each of which can have multiple "vpo 14 bridges), each of which can have multiple "vports" (analogous to ports 15 within a bridge). Each datapath also has asso 15 within a bridge). Each datapath also has associated with it a "flow 16 table" that userspace populates with "flows" t 16 table" that userspace populates with "flows" that map from keys based 17 on packet headers and metadata to sets of acti 17 on packet headers and metadata to sets of actions. The most common 18 action forwards the packet to another vport; o 18 action forwards the packet to another vport; other actions are also 19 implemented. 19 implemented. 20 20 21 When a packet arrives on a vport, the kernel m 21 When a packet arrives on a vport, the kernel module processes it by 22 extracting its flow key and looking it up in t 22 extracting its flow key and looking it up in the flow table. If there 23 is a matching flow, it executes the associated 23 is a matching flow, it executes the associated actions. If there is 24 no match, it queues the packet to userspace fo 24 no match, it queues the packet to userspace for processing (as part of 25 its processing, userspace will likely set up a 25 its processing, userspace will likely set up a flow to handle further 26 packets of the same type entirely in-kernel). 26 packets of the same type entirely in-kernel). 27 27 28 28 29 Flow key compatibility 29 Flow key compatibility 30 ---------------------- 30 ---------------------- 31 31 32 Network protocols evolve over time. New proto 32 Network protocols evolve over time. New protocols become important 33 and existing protocols lose their prominence. 33 and existing protocols lose their prominence. For the Open vSwitch 34 kernel module to remain relevant, it must be p 34 kernel module to remain relevant, it must be possible for newer 35 versions to parse additional protocols as part 35 versions to parse additional protocols as part of the flow key. It 36 might even be desirable, someday, to drop supp 36 might even be desirable, someday, to drop support for parsing 37 protocols that have become obsolete. Therefor 37 protocols that have become obsolete. Therefore, the Netlink interface 38 to Open vSwitch is designed to allow carefully 38 to Open vSwitch is designed to allow carefully written userspace 39 applications to work with any version of the f 39 applications to work with any version of the flow key, past or future. 40 40 41 To support this forward and backward compatibi 41 To support this forward and backward compatibility, whenever the 42 kernel module passes a packet to userspace, it 42 kernel module passes a packet to userspace, it also passes along the 43 flow key that it parsed from the packet. User 43 flow key that it parsed from the packet. Userspace then extracts its 44 own notion of a flow key from the packet and c 44 own notion of a flow key from the packet and compares it against the 45 kernel-provided version: 45 kernel-provided version: 46 46 47 - If userspace's notion of the flow key fo 47 - If userspace's notion of the flow key for the packet matches the 48 kernel's, then nothing special is necess 48 kernel's, then nothing special is necessary. 49 49 50 - If the kernel's flow key includes more f 50 - If the kernel's flow key includes more fields than the userspace 51 version of the flow key, for example if 51 version of the flow key, for example if the kernel decoded IPv6 52 headers but userspace stopped at the Eth 52 headers but userspace stopped at the Ethernet type (because it 53 does not understand IPv6), then again no 53 does not understand IPv6), then again nothing special is 54 necessary. Userspace can still set up a 54 necessary. Userspace can still set up a flow in the usual way, 55 as long as it uses the kernel-provided f 55 as long as it uses the kernel-provided flow key to do it. 56 56 57 - If the userspace flow key includes more 57 - If the userspace flow key includes more fields than the 58 kernel's, for example if userspace decod 58 kernel's, for example if userspace decoded an IPv6 header but 59 the kernel stopped at the Ethernet type, 59 the kernel stopped at the Ethernet type, then userspace can 60 forward the packet manually, without set 60 forward the packet manually, without setting up a flow in the 61 kernel. This case is bad for performanc 61 kernel. This case is bad for performance because every packet 62 that the kernel considers part of the fl 62 that the kernel considers part of the flow must go to userspace, 63 but the forwarding behavior is correct. 63 but the forwarding behavior is correct. (If userspace can 64 determine that the values of the extra f 64 determine that the values of the extra fields would not affect 65 forwarding behavior, then it could set u 65 forwarding behavior, then it could set up a flow anyway.) 66 66 67 How flow keys evolve over time is important to 67 How flow keys evolve over time is important to making this work, so 68 the following sections go into detail. 68 the following sections go into detail. 69 69 70 70 71 Flow key format 71 Flow key format 72 --------------- 72 --------------- 73 73 74 A flow key is passed over a Netlink socket as 74 A flow key is passed over a Netlink socket as a sequence of Netlink 75 attributes. Some attributes represent packet 75 attributes. Some attributes represent packet metadata, defined as any 76 information about a packet that cannot be extr 76 information about a packet that cannot be extracted from the packet 77 itself, e.g. the vport on which the packet was 77 itself, e.g. the vport on which the packet was received. Most 78 attributes, however, are extracted from header 78 attributes, however, are extracted from headers within the packet, 79 e.g. source and destination addresses from Eth 79 e.g. source and destination addresses from Ethernet, IP, or TCP 80 headers. 80 headers. 81 81 82 The <linux/openvswitch.h> header file defines 82 The <linux/openvswitch.h> header file defines the exact format of the 83 flow key attributes. For informal explanatory 83 flow key attributes. For informal explanatory purposes here, we write 84 them as comma-separated strings, with parenthe 84 them as comma-separated strings, with parentheses indicating arguments 85 and nesting. For example, the following could 85 and nesting. For example, the following could represent a flow key 86 corresponding to a TCP packet that arrived on 86 corresponding to a TCP packet that arrived on vport 1:: 87 87 88 in_port(1), eth(src=e0:91:f5:21:d0:b2, dst 88 in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), 89 eth_type(0x0800), ipv4(src=172.16.0.20, ds 89 eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, 90 frag=no), tcp(src=49163, dst=80) 90 frag=no), tcp(src=49163, dst=80) 91 91 92 Often we ellipsize arguments not important to 92 Often we ellipsize arguments not important to the discussion, e.g.:: 93 93 94 in_port(1), eth(...), eth_type(0x0800), ip 94 in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) 95 95 96 96 97 Wildcarded flow key format 97 Wildcarded flow key format 98 -------------------------- 98 -------------------------- 99 99 100 A wildcarded flow is described with two sequen 100 A wildcarded flow is described with two sequences of Netlink attributes 101 passed over the Netlink socket. A flow key, ex 101 passed over the Netlink socket. A flow key, exactly as described above, and an 102 optional corresponding flow mask. 102 optional corresponding flow mask. 103 103 104 A wildcarded flow can represent a group of exa 104 A wildcarded flow can represent a group of exact match flows. Each '1' bit 105 in the mask specifies a exact match with the c 105 in the mask specifies a exact match with the corresponding bit in the flow key. 106 A '0' bit specifies a don't care bit, which wi 106 A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit 107 of a incoming packet. Using wildcarded flow ca 107 of a incoming packet. Using wildcarded flow can improve the flow set up rate 108 by reduce the number of new flows need to be p 108 by reduce the number of new flows need to be processed by the user space program. 109 109 110 Support for the mask Netlink attribute is opti 110 Support for the mask Netlink attribute is optional for both the kernel and user 111 space program. The kernel can ignore the mask 111 space program. The kernel can ignore the mask attribute, installing an exact 112 match flow, or reduce the number of don't care 112 match flow, or reduce the number of don't care bits in the kernel to less than 113 what was specified by the user space program. 113 what was specified by the user space program. In this case, variations in bits 114 that the kernel does not implement will simply 114 that the kernel does not implement will simply result in additional flow setups. 115 The kernel module will also work with user spa 115 The kernel module will also work with user space programs that neither support 116 nor supply flow mask attributes. 116 nor supply flow mask attributes. 117 117 118 Since the kernel may ignore or modify wildcard 118 Since the kernel may ignore or modify wildcard bits, it can be difficult for 119 the userspace program to know exactly what mat 119 the userspace program to know exactly what matches are installed. There are 120 two possible approaches: reactively install fl 120 two possible approaches: reactively install flows as they miss the kernel 121 flow table (and therefore not attempt to deter 121 flow table (and therefore not attempt to determine wildcard changes at all) 122 or use the kernel's response messages to deter 122 or use the kernel's response messages to determine the installed wildcards. 123 123 124 When interacting with userspace, the kernel sh 124 When interacting with userspace, the kernel should maintain the match portion 125 of the key exactly as originally installed. Th 125 of the key exactly as originally installed. This will provides a handle to 126 identify the flow for all future operations. H 126 identify the flow for all future operations. However, when reporting the 127 mask of an installed flow, the mask should inc 127 mask of an installed flow, the mask should include any restrictions imposed 128 by the kernel. 128 by the kernel. 129 129 130 The behavior when using overlapping wildcarded 130 The behavior when using overlapping wildcarded flows is undefined. It is the 131 responsibility of the user space program to en 131 responsibility of the user space program to ensure that any incoming packet 132 can match at most one flow, wildcarded or not. 132 can match at most one flow, wildcarded or not. The current implementation 133 performs best-effort detection of overlapping 133 performs best-effort detection of overlapping wildcarded flows and may reject 134 some but not all of them. However, this behavi 134 some but not all of them. However, this behavior may change in future versions. 135 135 136 136 137 Unique flow identifiers 137 Unique flow identifiers 138 ----------------------- 138 ----------------------- 139 139 140 An alternative to using the original match por 140 An alternative to using the original match portion of a key as the handle for 141 flow identification is a unique flow identifie 141 flow identification is a unique flow identifier, or "UFID". UFIDs are optional 142 for both the kernel and user space program. 142 for both the kernel and user space program. 143 143 144 User space programs that support UFID are expe 144 User space programs that support UFID are expected to provide it during flow 145 setup in addition to the flow, then refer to t 145 setup in addition to the flow, then refer to the flow using the UFID for all 146 future operations. The kernel is not required 146 future operations. The kernel is not required to index flows by the original 147 flow key if a UFID is specified. 147 flow key if a UFID is specified. 148 148 149 149 150 Basic rule for evolving flow keys 150 Basic rule for evolving flow keys 151 --------------------------------- 151 --------------------------------- 152 152 153 Some care is needed to really maintain forward 153 Some care is needed to really maintain forward and backward 154 compatibility for applications that follow the 154 compatibility for applications that follow the rules listed under 155 "Flow key compatibility" above. 155 "Flow key compatibility" above. 156 156 157 The basic rule is obvious:: 157 The basic rule is obvious:: 158 158 159 ========================================== 159 ================================================================== 160 New network protocol support must only sup 160 New network protocol support must only supplement existing flow 161 key attributes. It must not change the me 161 key attributes. It must not change the meaning of already defined 162 flow key attributes. 162 flow key attributes. 163 ========================================== 163 ================================================================== 164 164 165 This rule does have less-obvious consequences 165 This rule does have less-obvious consequences so it is worth working 166 through a few examples. Suppose, for example, 166 through a few examples. Suppose, for example, that the kernel module 167 did not already implement VLAN parsing. Inste 167 did not already implement VLAN parsing. Instead, it just interpreted 168 the 802.1Q TPID (0x8100) as the Ethertype then 168 the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the 169 packet. The flow key for any packet with an 8 169 packet. The flow key for any packet with an 802.1Q header would look 170 essentially like this, ignoring metadata:: 170 essentially like this, ignoring metadata:: 171 171 172 eth(...), eth_type(0x8100) 172 eth(...), eth_type(0x8100) 173 173 174 Naively, to add VLAN support, it makes sense t 174 Naively, to add VLAN support, it makes sense to add a new "vlan" flow 175 key attribute to contain the VLAN tag, then co 175 key attribute to contain the VLAN tag, then continue to decode the 176 encapsulated headers beyond the VLAN tag using 176 encapsulated headers beyond the VLAN tag using the existing field 177 definitions. With this change, a TCP packet i 177 definitions. With this change, a TCP packet in VLAN 10 would have a 178 flow key much like this:: 178 flow key much like this:: 179 179 180 eth(...), vlan(vid=10, pcp=0), eth_type(0x 180 eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) 181 181 182 But this change would negatively affect a user 182 But this change would negatively affect a userspace application that 183 has not been updated to understand the new "vl 183 has not been updated to understand the new "vlan" flow key attribute. 184 The application could, following the flow comp 184 The application could, following the flow compatibility rules above, 185 ignore the "vlan" attribute that it does not u 185 ignore the "vlan" attribute that it does not understand and therefore 186 assume that the flow contained IP packets. Th 186 assume that the flow contained IP packets. This is a bad assumption 187 (the flow only contains IP packets if one pars 187 (the flow only contains IP packets if one parses and skips over the 188 802.1Q header) and it could cause the applicat 188 802.1Q header) and it could cause the application's behavior to change 189 across kernel versions even though it follows 189 across kernel versions even though it follows the compatibility rules. 190 190 191 The solution is to use a set of nested attribu 191 The solution is to use a set of nested attributes. This is, for 192 example, why 802.1Q support uses nested attrib 192 example, why 802.1Q support uses nested attributes. A TCP packet in 193 VLAN 10 is actually expressed as:: 193 VLAN 10 is actually expressed as:: 194 194 195 eth(...), eth_type(0x8100), vlan(vid=10, p 195 eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), 196 ip(proto=6, ...), tcp(...))) 196 ip(proto=6, ...), tcp(...))) 197 197 198 Notice how the "eth_type", "ip", and "tcp" flo 198 Notice how the "eth_type", "ip", and "tcp" flow key attributes are 199 nested inside the "encap" attribute. Thus, an 199 nested inside the "encap" attribute. Thus, an application that does 200 not understand the "vlan" key will not see eit 200 not understand the "vlan" key will not see either of those attributes 201 and therefore will not misinterpret them. (Al 201 and therefore will not misinterpret them. (Also, the outer eth_type 202 is still 0x8100, not changed to 0x0800.) 202 is still 0x8100, not changed to 0x0800.) 203 203 204 Handling malformed packets 204 Handling malformed packets 205 -------------------------- 205 -------------------------- 206 206 207 Don't drop packets in the kernel for malformed 207 Don't drop packets in the kernel for malformed protocol headers, bad 208 checksums, etc. This would prevent userspace 208 checksums, etc. This would prevent userspace from implementing a 209 simple Ethernet switch that forwards every pac 209 simple Ethernet switch that forwards every packet. 210 210 211 Instead, in such a case, include an attribute 211 Instead, in such a case, include an attribute with "empty" content. 212 It doesn't matter if the empty content could b 212 It doesn't matter if the empty content could be valid protocol values, 213 as long as those values are rarely seen in pra 213 as long as those values are rarely seen in practice, because userspace 214 can always forward all packets with those valu 214 can always forward all packets with those values to userspace and 215 handle them individually. 215 handle them individually. 216 216 217 For example, consider a packet that contains a 217 For example, consider a packet that contains an IP header that 218 indicates protocol 6 for TCP, but which is tru 218 indicates protocol 6 for TCP, but which is truncated just after the IP 219 header, so that the TCP header is missing. Th 219 header, so that the TCP header is missing. The flow key for this 220 packet would include a tcp attribute with all- 220 packet would include a tcp attribute with all-zero src and dst, like 221 this:: 221 this:: 222 222 223 eth(...), eth_type(0x0800), ip(proto=6, .. 223 eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) 224 224 225 As another example, consider a packet with an 225 As another example, consider a packet with an Ethernet type of 0x8100, 226 indicating that a VLAN TCI should follow, but 226 indicating that a VLAN TCI should follow, but which is truncated just 227 after the Ethernet type. The flow key for thi 227 after the Ethernet type. The flow key for this packet would include 228 an all-zero-bits vlan and an empty encap attri 228 an all-zero-bits vlan and an empty encap attribute, like this:: 229 229 230 eth(...), eth_type(0x8100), vlan(0), encap 230 eth(...), eth_type(0x8100), vlan(0), encap() 231 231 232 Unlike a TCP packet with source and destinatio 232 Unlike a TCP packet with source and destination ports 0, an 233 all-zero-bits VLAN TCI is not that rare, so th 233 all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka 234 VLAN_TAG_PRESENT inside the kernel) is ordinar 234 VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan 235 attribute expressly to allow this situation to 235 attribute expressly to allow this situation to be distinguished. 236 Thus, the flow key in this second example unam 236 Thus, the flow key in this second example unambiguously indicates a 237 missing or malformed VLAN TCI. 237 missing or malformed VLAN TCI. 238 238 239 Other rules 239 Other rules 240 ----------- 240 ----------- 241 241 242 The other rules for flow keys are much less su 242 The other rules for flow keys are much less subtle: 243 243 244 - Duplicate attributes are not allowed at 244 - Duplicate attributes are not allowed at a given nesting level. 245 245 246 - Ordering of attributes is not significan 246 - Ordering of attributes is not significant. 247 247 248 - When the kernel sends a given flow key t 248 - When the kernel sends a given flow key to userspace, it always 249 composes it the same way. This allows u 249 composes it the same way. This allows userspace to hash and 250 compare entire flow keys that it may not 250 compare entire flow keys that it may not be able to fully 251 interpret. 251 interpret.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.