~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/networking/devmem.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/networking/devmem.rst (Architecture sparc64) and /Documentation/networking/devmem.rst (Architecture alpha)


  1 .. SPDX-License-Identifier: GPL-2.0                 1 .. SPDX-License-Identifier: GPL-2.0
  2                                                     2 
  3 =================                                   3 =================
  4 Device Memory TCP                                   4 Device Memory TCP
  5 =================                                   5 =================
  6                                                     6 
  7                                                     7 
  8 Intro                                               8 Intro
  9 =====                                               9 =====
 10                                                    10 
 11 Device memory TCP (devmem TCP) enables receivi     11 Device memory TCP (devmem TCP) enables receiving data directly into device
 12 memory (dmabuf). The feature is currently impl     12 memory (dmabuf). The feature is currently implemented for TCP sockets.
 13                                                    13 
 14                                                    14 
 15 Opportunity                                        15 Opportunity
 16 -----------                                        16 -----------
 17                                                    17 
 18 A large number of data transfers have device m     18 A large number of data transfers have device memory as the source and/or
 19 destination. Accelerators drastically increase     19 destination. Accelerators drastically increased the prevalence of such
 20 transfers.  Some examples include:                 20 transfers.  Some examples include:
 21                                                    21 
 22 - Distributed training, where ML accelerators,     22 - Distributed training, where ML accelerators, such as GPUs on different hosts,
 23   exchange data.                                   23   exchange data.
 24                                                    24 
 25 - Distributed raw block storage applications t     25 - Distributed raw block storage applications transfer large amounts of data with
 26   remote SSDs. Much of this data does not requ     26   remote SSDs. Much of this data does not require host processing.
 27                                                    27 
 28 Typically the Device-to-Device data transfers      28 Typically the Device-to-Device data transfers in the network are implemented as
 29 the following low-level operations: Device-to-     29 the following low-level operations: Device-to-Host copy, Host-to-Host network
 30 transfer, and Host-to-Device copy.                 30 transfer, and Host-to-Device copy.
 31                                                    31 
 32 The flow involving host copies is suboptimal,      32 The flow involving host copies is suboptimal, especially for bulk data transfers,
 33 and can put significant strains on system reso     33 and can put significant strains on system resources such as host memory
 34 bandwidth and PCIe bandwidth.                      34 bandwidth and PCIe bandwidth.
 35                                                    35 
 36 Devmem TCP optimizes this use case by implemen     36 Devmem TCP optimizes this use case by implementing socket APIs that enable
 37 the user to receive incoming network packets d     37 the user to receive incoming network packets directly into device memory.
 38                                                    38 
 39 Packet payloads go directly from the NIC to de     39 Packet payloads go directly from the NIC to device memory.
 40                                                    40 
 41 Packet headers go to host memory and are proce     41 Packet headers go to host memory and are processed by the TCP/IP stack
 42 normally. The NIC must support header split to     42 normally. The NIC must support header split to achieve this.
 43                                                    43 
 44 Advantages:                                        44 Advantages:
 45                                                    45 
 46 - Alleviate host memory bandwidth pressure, co     46 - Alleviate host memory bandwidth pressure, compared to existing
 47   network-transfer + device-copy semantics.        47   network-transfer + device-copy semantics.
 48                                                    48 
 49 - Alleviate PCIe bandwidth pressure, by limiti     49 - Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
 50   level of the PCIe tree, compared to the trad     50   level of the PCIe tree, compared to the traditional path which sends data
 51   through the root complex.                        51   through the root complex.
 52                                                    52 
 53                                                    53 
 54 More Info                                          54 More Info
 55 ---------                                          55 ---------
 56                                                    56 
 57   slides, video                                    57   slides, video
 58     https://netdevconf.org/0x17/sessions/talk/     58     https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
 59                                                    59 
 60   patchset                                         60   patchset
 61     [PATCH net-next v24 00/13] Device Memory T     61     [PATCH net-next v24 00/13] Device Memory TCP
 62     https://lore.kernel.org/netdev/20240831004     62     https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
 63                                                    63 
 64                                                    64 
 65 Interface                                          65 Interface
 66 =========                                          66 =========
 67                                                    67 
 68                                                    68 
 69 Example                                            69 Example
 70 -------                                            70 -------
 71                                                    71 
 72 tools/testing/selftests/net/ncdevmem.c:do_serv     72 tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
 73 the RX path of this API.                           73 the RX path of this API.
 74                                                    74 
 75                                                    75 
 76 NIC Setup                                          76 NIC Setup
 77 ---------                                          77 ---------
 78                                                    78 
 79 Header split, flow steering, & RSS are require     79 Header split, flow steering, & RSS are required features for devmem TCP.
 80                                                    80 
 81 Header split is used to split incoming packets     81 Header split is used to split incoming packets into a header buffer in host
 82 memory, and a payload buffer in device memory.     82 memory, and a payload buffer in device memory.
 83                                                    83 
 84 Flow steering & RSS are used to ensure that on     84 Flow steering & RSS are used to ensure that only flows targeting devmem land on
 85 an RX queue bound to devmem.                       85 an RX queue bound to devmem.
 86                                                    86 
 87 Enable header split & flow steering::              87 Enable header split & flow steering::
 88                                                    88 
 89         # enable header split                      89         # enable header split
 90         ethtool -G eth1 tcp-data-split on          90         ethtool -G eth1 tcp-data-split on
 91                                                    91 
 92                                                    92 
 93         # enable flow steering                     93         # enable flow steering
 94         ethtool -K eth1 ntuple on                  94         ethtool -K eth1 ntuple on
 95                                                    95 
 96 Configure RSS to steer all traffic away from t     96 Configure RSS to steer all traffic away from the target RX queue (queue 15 in
 97 this example)::                                    97 this example)::
 98                                                    98 
 99         ethtool --set-rxfh-indir eth1 equal 15     99         ethtool --set-rxfh-indir eth1 equal 15
100                                                   100 
101                                                   101 
102 The user must bind a dmabuf to any number of R    102 The user must bind a dmabuf to any number of RX queues on a given NIC using
103 the netlink API::                                 103 the netlink API::
104                                                   104 
105         /* Bind dmabuf to NIC RX queue 15 */      105         /* Bind dmabuf to NIC RX queue 15 */
106         struct netdev_queue *queues;              106         struct netdev_queue *queues;
107         queues = malloc(sizeof(*queues) * 1);     107         queues = malloc(sizeof(*queues) * 1);
108                                                   108 
109         queues[0]._present.type = 1;              109         queues[0]._present.type = 1;
110         queues[0]._present.idx = 1;               110         queues[0]._present.idx = 1;
111         queues[0].type = NETDEV_RX_QUEUE_TYPE_    111         queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
112         queues[0].idx = 15;                       112         queues[0].idx = 15;
113                                                   113 
114         *ys = ynl_sock_create(&ynl_netdev_fami    114         *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
115                                                   115 
116         req = netdev_bind_rx_req_alloc();         116         req = netdev_bind_rx_req_alloc();
117         netdev_bind_rx_req_set_ifindex(req, 1     117         netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
118         netdev_bind_rx_req_set_dmabuf_fd(req,     118         netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
119         __netdev_bind_rx_req_set_queues(req, q    119         __netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
120                                                   120 
121         rsp = netdev_bind_rx(*ys, req);           121         rsp = netdev_bind_rx(*ys, req);
122                                                   122 
123         dmabuf_id = rsp->dmabuf_id;               123         dmabuf_id = rsp->dmabuf_id;
124                                                   124 
125                                                   125 
126 The netlink API returns a dmabuf_id: a unique     126 The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
127 that has been bound.                              127 that has been bound.
128                                                   128 
129 The user can unbind the dmabuf from the netdev    129 The user can unbind the dmabuf from the netdevice by closing the netlink socket
130 that established the binding. We do this so th    130 that established the binding. We do this so that the binding is automatically
131 unbound even if the userspace process crashes.    131 unbound even if the userspace process crashes.
132                                                   132 
133 Note that any reasonably well-behaved dmabuf f    133 Note that any reasonably well-behaved dmabuf from any exporter should work with
134 devmem TCP, even if the dmabuf is not actually    134 devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
135 this is udmabuf, which wraps user memory (non-    135 this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
136                                                   136 
137                                                   137 
138 Socket Setup                                      138 Socket Setup
139 ------------                                      139 ------------
140                                                   140 
141 The socket must be flow steered to the dmabuf     141 The socket must be flow steered to the dmabuf bound RX queue::
142                                                   142 
143         ethtool -N eth1 flow-type tcp4 ... que    143         ethtool -N eth1 flow-type tcp4 ... queue 15
144                                                   144 
145                                                   145 
146 Receiving data                                    146 Receiving data
147 --------------                                    147 --------------
148                                                   148 
149 The user application must signal to the kernel    149 The user application must signal to the kernel that it is capable of receiving
150 devmem data by passing the MSG_SOCK_DEVMEM fla    150 devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
151                                                   151 
152         ret = recvmsg(fd, &msg, MSG_SOCK_DEVME    152         ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
153                                                   153 
154 Applications that do not specify the MSG_SOCK_    154 Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
155 on devmem data.                                   155 on devmem data.
156                                                   156 
157 Devmem data is received directly into the dmab    157 Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
158 Setup', and the kernel signals such to the use    158 Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
159                                                   159 
160                 for (cm = CMSG_FIRSTHDR(&msg);    160                 for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
161                         if (cm->cmsg_level !=     161                         if (cm->cmsg_level != SOL_SOCKET ||
162                                 (cm->cmsg_type    162                                 (cm->cmsg_type != SCM_DEVMEM_DMABUF &&
163                                  cm->cmsg_type    163                                  cm->cmsg_type != SCM_DEVMEM_LINEAR))
164                                 continue;         164                                 continue;
165                                                   165 
166                         dmabuf_cmsg = (struct     166                         dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
167                                                   167 
168                         if (cm->cmsg_type == S    168                         if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
169                                 /* Frag landed    169                                 /* Frag landed in dmabuf.
170                                  *                170                                  *
171                                  * dmabuf_cmsg    171                                  * dmabuf_cmsg->dmabuf_id is the dmabuf the
172                                  * frag landed    172                                  * frag landed on.
173                                  *                173                                  *
174                                  * dmabuf_cmsg    174                                  * dmabuf_cmsg->frag_offset is the offset into
175                                  * the dmabuf     175                                  * the dmabuf where the frag starts.
176                                  *                176                                  *
177                                  * dmabuf_cmsg    177                                  * dmabuf_cmsg->frag_size is the size of the
178                                  * frag.          178                                  * frag.
179                                  *                179                                  *
180                                  * dmabuf_cmsg    180                                  * dmabuf_cmsg->frag_token is a token used to
181                                  * refer to th    181                                  * refer to this frag for later freeing.
182                                  */               182                                  */
183                                                   183 
184                                 struct dmabuf_    184                                 struct dmabuf_token token;
185                                 token.token_st    185                                 token.token_start = dmabuf_cmsg->frag_token;
186                                 token.token_co    186                                 token.token_count = 1;
187                                 continue;         187                                 continue;
188                         }                         188                         }
189                                                   189 
190                         if (cm->cmsg_type == S    190                         if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
191                                 /* Frag landed    191                                 /* Frag landed in linear buffer.
192                                  *                192                                  *
193                                  * dmabuf_cmsg    193                                  * dmabuf_cmsg->frag_size is the size of the
194                                  * frag.          194                                  * frag.
195                                  */               195                                  */
196                                 continue;         196                                 continue;
197                                                   197 
198                 }                                 198                 }
199                                                   199 
200 Applications may receive 2 cmsgs:                 200 Applications may receive 2 cmsgs:
201                                                   201 
202 - SCM_DEVMEM_DMABUF: this indicates the fragme    202 - SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
203   by dmabuf_id.                                   203   by dmabuf_id.
204                                                   204 
205 - SCM_DEVMEM_LINEAR: this indicates the fragme    205 - SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
206   This typically happens when the NIC is unabl    206   This typically happens when the NIC is unable to split the packet at the
207   header boundary, such that part (or all) of     207   header boundary, such that part (or all) of the payload landed in host
208   memory.                                         208   memory.
209                                                   209 
210 Applications may receive no SO_DEVMEM_* cmsgs.    210 Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
211 regular TCP data that landed on an RX queue no    211 regular TCP data that landed on an RX queue not bound to a dmabuf.
212                                                   212 
213                                                   213 
214 Freeing frags                                     214 Freeing frags
215 -------------                                     215 -------------
216                                                   216 
217 Frags received via SCM_DEVMEM_DMABUF are pinne    217 Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
218 processes the frag. The user must return the f    218 processes the frag. The user must return the frag to the kernel via
219 SO_DEVMEM_DONTNEED::                              219 SO_DEVMEM_DONTNEED::
220                                                   220 
221         ret = setsockopt(client_fd, SOL_SOCKET    221         ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
222                          sizeof(token));          222                          sizeof(token));
223                                                   223 
224 The user must ensure the tokens are returned t    224 The user must ensure the tokens are returned to the kernel in a timely manner.
225 Failure to do so will exhaust the limited dmab    225 Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
226 and will lead to packet drops.                    226 and will lead to packet drops.
227                                                   227 
228                                                   228 
229 Implementation & Caveats                          229 Implementation & Caveats
230 ========================                          230 ========================
231                                                   231 
232 Unreadable skbs                                   232 Unreadable skbs
233 ---------------                                   233 ---------------
234                                                   234 
235 Devmem payloads are inaccessible to the kernel    235 Devmem payloads are inaccessible to the kernel processing the packets. This
236 results in a few quirks for payloads of devmem    236 results in a few quirks for payloads of devmem skbs:
237                                                   237 
238 - Loopback is not functional. Loopback relies     238 - Loopback is not functional. Loopback relies on copying the payload, which is
239   not possible with devmem skbs.                  239   not possible with devmem skbs.
240                                                   240 
241 - Software checksum calculation fails.            241 - Software checksum calculation fails.
242                                                   242 
243 - TCP Dump and bpf can't access devmem packet     243 - TCP Dump and bpf can't access devmem packet payloads.
244                                                   244 
245                                                   245 
246 Testing                                           246 Testing
247 =======                                           247 =======
248                                                   248 
249 More realistic example code can be found in th    249 More realistic example code can be found in the kernel source under
250 ``tools/testing/selftests/net/ncdevmem.c``        250 ``tools/testing/selftests/net/ncdevmem.c``
251                                                   251 
252 ncdevmem is a devmem TCP netcat. It works very    252 ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
253 receives data directly into a udmabuf.            253 receives data directly into a udmabuf.
254                                                   254 
255 To run ncdevmem, you need to run it on a serve    255 To run ncdevmem, you need to run it on a server on the machine under test, and
256 you need to run netcat on a peer to provide th    256 you need to run netcat on a peer to provide the TX data.
257                                                   257 
258 ncdevmem has a validation mode as well that ex    258 ncdevmem has a validation mode as well that expects a repeating pattern of
259 incoming data and validates it as such. For ex    259 incoming data and validates it as such. For example, you can launch
260 ncdevmem on the server by::                       260 ncdevmem on the server by::
261                                                   261 
262         ncdevmem -s <server IP> -c <client IP>    262         ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \
263                  -p 5201 -v 7                     263                  -p 5201 -v 7
264                                                   264 
265 On client side, use regular netcat to send TX     265 On client side, use regular netcat to send TX data to ncdevmem process
266 on the server::                                   266 on the server::
267                                                   267 
268         yes $(echo -e \\x01\\x02\\x03\\x04\\x0    268         yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
269                 tr \\n \\0 | head -c 5G | nc <    269                 tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php