~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/networking/representors.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 .. SPDX-License-Identifier: GPL-2.0
  2 .. _representors:
  3 
  4 =============================
  5 Network Function Representors
  6 =============================
  7 
  8 This document describes the semantics and usage of representor netdevices, as
  9 used to control internal switching on SmartNICs.  For the closely-related port
 10 representors on physical (multi-port) switches, see
 11 :ref:`Documentation/networking/switchdev.rst <switchdev>`.
 12 
 13 Motivation
 14 ----------
 15 
 16 Since the mid-2010s, network cards have started offering more complex
 17 virtualisation capabilities than the legacy SR-IOV approach (with its simple
 18 MAC/VLAN-based switching model) can support.  This led to a desire to offload
 19 software-defined networks (such as OpenVSwitch) to these NICs to specify the
 20 network connectivity of each function.  The resulting designs are variously
 21 called SmartNICs or DPUs.
 22 
 23 Network function representors bring the standard Linux networking stack to
 24 virtual switches and IOV devices.  Just as each physical port of a Linux-
 25 controlled switch has a separate netdev, so does each virtual port of a virtual
 26 switch.
 27 When the system boots, and before any offload is configured, all packets from
 28 the virtual functions appear in the networking stack of the PF via the
 29 representors.  The PF can thus always communicate freely with the virtual
 30 functions.
 31 The PF can configure standard Linux forwarding between representors, the uplink
 32 or any other netdev (routing, bridging, TC classifiers).
 33 
 34 Thus, a representor is both a control plane object (representing the function in
 35 administrative commands) and a data plane object (one end of a virtual pipe).
 36 As a virtual link endpoint, the representor can be configured like any other
 37 netdevice; in some cases (e.g. link state) the representee will follow the
 38 representor's configuration, while in others there are separate APIs to
 39 configure the representee.
 40 
 41 Definitions
 42 -----------
 43 
 44 This document uses the term "switchdev function" to refer to the PCIe function
 45 which has administrative control over the virtual switch on the device.
 46 Typically, this will be a PF, but conceivably a NIC could be configured to grant
 47 these administrative privileges instead to a VF or SF (subfunction).
 48 Depending on NIC design, a multi-port NIC might have a single switchdev function
 49 for the whole device or might have a separate virtual switch, and hence
 50 switchdev function, for each physical network port.
 51 If the NIC supports nested switching, there might be separate switchdev
 52 functions for each nested switch, in which case each switchdev function should
 53 only create representors for the ports on the (sub-)switch it directly
 54 administers.
 55 
 56 A "representee" is the object that a representor represents.  So for example in
 57 the case of a VF representor, the representee is the corresponding VF.
 58 
 59 What does a representor do?
 60 ---------------------------
 61 
 62 A representor has three main roles.
 63 
 64 1. It is used to configure the network connection the representee sees, e.g.
 65    link up/down, MTU, etc.  For instance, bringing the representor
 66    administratively UP should cause the representee to see a link up / carrier
 67    on event.
 68 2. It provides the slow path for traffic which does not hit any offloaded
 69    fast-path rules in the virtual switch.  Packets transmitted on the
 70    representor netdevice should be delivered to the representee; packets
 71    transmitted by the representee which fail to match any switching rule should
 72    be received on the representor netdevice.  (That is, there is a virtual pipe
 73    connecting the representor to the representee, similar in concept to a veth
 74    pair.)
 75    This allows software switch implementations (such as OpenVSwitch or a Linux
 76    bridge) to forward packets between representees and the rest of the network.
 77 3. It acts as a handle by which switching rules (such as TC filters) can refer
 78    to the representee, allowing these rules to be offloaded.
 79 
 80 The combination of 2) and 3) means that the behaviour (apart from performance)
 81 should be the same whether a TC filter is offloaded or not.  E.g. a TC rule
 82 on a VF representor applies in software to packets received on that representor
 83 netdevice, while in hardware offload it would apply to packets transmitted by
 84 the representee VF.  Conversely, a mirred egress redirect to a VF representor
 85 corresponds in hardware to delivery directly to the representee VF.
 86 
 87 What functions should have a representor?
 88 -----------------------------------------
 89 
 90 Essentially, for each virtual port on the device's internal switch, there
 91 should be a representor.
 92 Some vendors have chosen to omit representors for the uplink and the physical
 93 network port, which can simplify usage (the uplink netdev becomes in effect the
 94 physical port's representor) but does not generalise to devices with multiple
 95 ports or uplinks.
 96 
 97 Thus, the following should all have representors:
 98 
 99  - VFs belonging to the switchdev function.
100  - Other PFs on the local PCIe controller, and any VFs belonging to them.
101  - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded
102    System-on-Chip within the SmartNIC).
103  - PFs and VFs with other personalities, including network block devices (such
104    as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only
105    if) their network access is implemented through a virtual switch port. [#]_
106    Note that such functions can require a representor despite the representee
107    not having a netdev.
108  - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
109    their own port on the switch (as opposed to using their parent PF's port).
110  - Any accelerators or plugins on the device whose interface to the network is
111    through a virtual switch port, even if they do not have a corresponding PCIe
112    PF or VF.
113 
114 This allows the entire switching behaviour of the NIC to be controlled through
115 representor TC rules.
116 
117 It is a common misunderstanding to conflate virtual ports with PCIe virtual
118 functions or their netdevs.  While in simple cases there will be a 1:1
119 correspondence between VF netdevices and VF representors, more advanced device
120 configurations may not follow this.
121 A PCIe function which does not have network access through the internal switch
122 (not even indirectly through the hardware implementation of whatever services
123 the function provides) should *not* have a representor (even if it has a
124 netdev).
125 Such a function has no switch virtual port for the representor to configure or
126 to be the other end of the virtual pipe.
127 The representor represents the virtual port, not the PCIe function nor the 'end
128 user' netdevice.
129 
130 .. [#] The concept here is that a hardware IP stack in the device performs the
131    translation between block DMA requests and network packets, so that only
132    network packets pass through the virtual port onto the switch.  The network
133    access that the IP stack "sees" would then be configurable through tc rules;
134    e.g. its traffic might all be wrapped in a specific VLAN or VxLAN.  However,
135    any needed configuration of the block device *qua* block device, not being a
136    networking entity, would not be appropriate for the representor and would
137    thus use some other channel such as devlink.
138    Contrast this with the case of a virtio-blk implementation which forwards the
139    DMA requests unchanged to another PF whose driver then initiates and
140    terminates IP traffic in software; in that case the DMA traffic would *not*
141    run over the virtual switch and the virtio-blk PF should thus *not* have a
142    representor.
143 
144 How are representors created?
145 -----------------------------
146 
147 The driver instance attached to the switchdev function should, for each virtual
148 port on the switch, create a pure-software netdevice which has some form of
149 in-kernel reference to the switchdev function's own netdevice or driver private
150 data (``netdev_priv()``).
151 This may be by enumerating ports at probe time, reacting dynamically to the
152 creation and destruction of ports at run time, or a combination of the two.
153 
154 The operations of the representor netdevice will generally involve acting
155 through the switchdev function.  For example, ``ndo_start_xmit()`` might send
156 the packet through a hardware TX queue attached to the switchdev function, with
157 either packet metadata or queue configuration marking it for delivery to the
158 representee.
159 
160 How are representors identified?
161 --------------------------------
162 
163 The representor netdevice should *not* directly refer to a PCIe device (e.g.
164 through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
165 representee or of the switchdev function.
166 Instead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to
167 assign a devlink port instance to the netdevice before registering the
168 netdevice; the kernel uses the devlink port to provide the ``phys_switch_id``
169 and ``phys_port_name`` sysfs nodes.
170 (Some legacy drivers implement ``ndo_get_port_parent_id()`` and
171 ``ndo_get_phys_port_name()`` directly, but this is deprecated.)  See
172 :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
173 details of this API.
174 
175 It is expected that userland will use this information (e.g. through udev rules)
176 to construct an appropriately informative name or alias for the netdevice.  For
177 instance if the switchdev function is ``eth4`` then a representor with a
178 ``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
179 
180 There are as yet no established conventions for naming representors which do not
181 correspond to PCIe functions (e.g. accelerators and plugins).
182 
183 How do representors interact with TC rules?
184 -------------------------------------------
185 
186 Any TC rule on a representor applies (in software TC) to packets received by
187 that representor netdevice.  Thus, if the delivery part of the rule corresponds
188 to another port on the virtual switch, the driver may choose to offload it to
189 hardware, applying it to packets transmitted by the representee.
190 
191 Similarly, since a TC mirred egress action targeting the representor would (in
192 software) send the packet through the representor (and thus indirectly deliver
193 it to the representee), hardware offload should interpret this as delivery to
194 the representee.
195 
196 As a simple example, if ``PORT_DEV`` is the physical port representor and
197 ``REP_DEV`` is a VF representor, the following rules::
198 
199     tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \
200         action mirred egress redirect dev $PORT_DEV
201     tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \
202         action mirred egress mirror dev $REP_DEV
203 
204 would mean that all IPv4 packets from the VF are sent out the physical port, and
205 all IPv4 packets received on the physical port are delivered to the VF in
206 addition to ``PORT_DEV``.  (Note that without ``skip_sw`` on the second rule,
207 the VF would get two copies, as the packet reception on ``PORT_DEV`` would
208 trigger the TC rule again and mirror the packet to ``REP_DEV``.)
209 
210 On devices without separate port and uplink representors, ``PORT_DEV`` would
211 instead be the switchdev function's own uplink netdevice.
212 
213 Of course the rules can (if supported by the NIC) include packet-modifying
214 actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
215 
216 Tunnel encapsulation and decapsulation are rather more complicated, as they
217 involve a third netdevice (a tunnel netdev operating in metadata mode, such as
218 a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
219 require an IP address to be bound to the underlay device (e.g. switchdev
220 function uplink netdev or port representor).  TC rules such as::
221 
222     tc filter add dev $REP_DEV parent ffff: flower \
223         action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
224                               dst_port 4789 \
225         action mirred egress redirect dev vxlan0
226     tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
227         enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
228         action tunnel_key unset action mirred egress redirect dev $REP_DEV
229 
230 where ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is
231 another IP address on the same subnet, mean that packets sent by the VF should
232 be VxLAN encapsulated and sent out the physical port (the driver has to deduce
233 this by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also
234 perform an ARP/neighbour table lookup to find the MAC addresses to use in the
235 outer Ethernet frame), while UDP packets received on the physical port with UDP
236 port 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``,
237 decapsulated and forwarded to the VF.
238 
239 If this all seems complicated, just remember the 'golden rule' of TC offload:
240 the hardware should ensure the same final results as if the packets were
241 processed through the slow path, traversed software TC (except ignoring any
242 ``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or
243 received through the representor netdevices.
244 
245 Configuring the representee's MAC
246 ---------------------------------
247 
248 The representee's link state is controlled through the representor.  Setting the
249 representor administratively UP or DOWN should cause carrier ON or OFF at the
250 representee.
251 
252 Setting an MTU on the representor should cause that same MTU to be reported to
253 the representee.
254 (On hardware that allows configuring separate and distinct MTU and MRU values,
255 the representor MTU should correspond to the representee's MRU and vice-versa.)
256 
257 Currently there is no way to use the representor to set the station permanent
258 MAC address of the representee; other methods available to do this include:
259 
260  - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
261  - devlink port function (see **devlink-port(8)** and
262    :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php