1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 ================= 3 ================= 4 Device Memory TCP 4 Device Memory TCP 5 ================= 5 ================= 6 6 7 7 8 Intro 8 Intro 9 ===== 9 ===== 10 10 11 Device memory TCP (devmem TCP) enables receivi 11 Device memory TCP (devmem TCP) enables receiving data directly into device 12 memory (dmabuf). The feature is currently impl 12 memory (dmabuf). The feature is currently implemented for TCP sockets. 13 13 14 14 15 Opportunity 15 Opportunity 16 ----------- 16 ----------- 17 17 18 A large number of data transfers have device m 18 A large number of data transfers have device memory as the source and/or 19 destination. Accelerators drastically increase 19 destination. Accelerators drastically increased the prevalence of such 20 transfers. Some examples include: 20 transfers. Some examples include: 21 21 22 - Distributed training, where ML accelerators, 22 - Distributed training, where ML accelerators, such as GPUs on different hosts, 23 exchange data. 23 exchange data. 24 24 25 - Distributed raw block storage applications t 25 - Distributed raw block storage applications transfer large amounts of data with 26 remote SSDs. Much of this data does not requ 26 remote SSDs. Much of this data does not require host processing. 27 27 28 Typically the Device-to-Device data transfers 28 Typically the Device-to-Device data transfers in the network are implemented as 29 the following low-level operations: Device-to- 29 the following low-level operations: Device-to-Host copy, Host-to-Host network 30 transfer, and Host-to-Device copy. 30 transfer, and Host-to-Device copy. 31 31 32 The flow involving host copies is suboptimal, 32 The flow involving host copies is suboptimal, especially for bulk data transfers, 33 and can put significant strains on system reso 33 and can put significant strains on system resources such as host memory 34 bandwidth and PCIe bandwidth. 34 bandwidth and PCIe bandwidth. 35 35 36 Devmem TCP optimizes this use case by implemen 36 Devmem TCP optimizes this use case by implementing socket APIs that enable 37 the user to receive incoming network packets d 37 the user to receive incoming network packets directly into device memory. 38 38 39 Packet payloads go directly from the NIC to de 39 Packet payloads go directly from the NIC to device memory. 40 40 41 Packet headers go to host memory and are proce 41 Packet headers go to host memory and are processed by the TCP/IP stack 42 normally. The NIC must support header split to 42 normally. The NIC must support header split to achieve this. 43 43 44 Advantages: 44 Advantages: 45 45 46 - Alleviate host memory bandwidth pressure, co 46 - Alleviate host memory bandwidth pressure, compared to existing 47 network-transfer + device-copy semantics. 47 network-transfer + device-copy semantics. 48 48 49 - Alleviate PCIe bandwidth pressure, by limiti 49 - Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest 50 level of the PCIe tree, compared to the trad 50 level of the PCIe tree, compared to the traditional path which sends data 51 through the root complex. 51 through the root complex. 52 52 53 53 54 More Info 54 More Info 55 --------- 55 --------- 56 56 57 slides, video 57 slides, video 58 https://netdevconf.org/0x17/sessions/talk/ 58 https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html 59 59 60 patchset 60 patchset 61 [PATCH net-next v24 00/13] Device Memory T 61 [PATCH net-next v24 00/13] Device Memory TCP 62 https://lore.kernel.org/netdev/20240831004 62 https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/ 63 63 64 64 65 Interface 65 Interface 66 ========= 66 ========= 67 67 68 68 69 Example 69 Example 70 ------- 70 ------- 71 71 72 tools/testing/selftests/net/ncdevmem.c:do_serv 72 tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up 73 the RX path of this API. 73 the RX path of this API. 74 74 75 75 76 NIC Setup 76 NIC Setup 77 --------- 77 --------- 78 78 79 Header split, flow steering, & RSS are require 79 Header split, flow steering, & RSS are required features for devmem TCP. 80 80 81 Header split is used to split incoming packets 81 Header split is used to split incoming packets into a header buffer in host 82 memory, and a payload buffer in device memory. 82 memory, and a payload buffer in device memory. 83 83 84 Flow steering & RSS are used to ensure that on 84 Flow steering & RSS are used to ensure that only flows targeting devmem land on 85 an RX queue bound to devmem. 85 an RX queue bound to devmem. 86 86 87 Enable header split & flow steering:: 87 Enable header split & flow steering:: 88 88 89 # enable header split 89 # enable header split 90 ethtool -G eth1 tcp-data-split on 90 ethtool -G eth1 tcp-data-split on 91 91 92 92 93 # enable flow steering 93 # enable flow steering 94 ethtool -K eth1 ntuple on 94 ethtool -K eth1 ntuple on 95 95 96 Configure RSS to steer all traffic away from t 96 Configure RSS to steer all traffic away from the target RX queue (queue 15 in 97 this example):: 97 this example):: 98 98 99 ethtool --set-rxfh-indir eth1 equal 15 99 ethtool --set-rxfh-indir eth1 equal 15 100 100 101 101 102 The user must bind a dmabuf to any number of R 102 The user must bind a dmabuf to any number of RX queues on a given NIC using 103 the netlink API:: 103 the netlink API:: 104 104 105 /* Bind dmabuf to NIC RX queue 15 */ 105 /* Bind dmabuf to NIC RX queue 15 */ 106 struct netdev_queue *queues; 106 struct netdev_queue *queues; 107 queues = malloc(sizeof(*queues) * 1); 107 queues = malloc(sizeof(*queues) * 1); 108 108 109 queues[0]._present.type = 1; 109 queues[0]._present.type = 1; 110 queues[0]._present.idx = 1; 110 queues[0]._present.idx = 1; 111 queues[0].type = NETDEV_RX_QUEUE_TYPE_ 111 queues[0].type = NETDEV_RX_QUEUE_TYPE_RX; 112 queues[0].idx = 15; 112 queues[0].idx = 15; 113 113 114 *ys = ynl_sock_create(&ynl_netdev_fami 114 *ys = ynl_sock_create(&ynl_netdev_family, &yerr); 115 115 116 req = netdev_bind_rx_req_alloc(); 116 req = netdev_bind_rx_req_alloc(); 117 netdev_bind_rx_req_set_ifindex(req, 1 117 netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */); 118 netdev_bind_rx_req_set_dmabuf_fd(req, 118 netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd); 119 __netdev_bind_rx_req_set_queues(req, q 119 __netdev_bind_rx_req_set_queues(req, queues, n_queue_index); 120 120 121 rsp = netdev_bind_rx(*ys, req); 121 rsp = netdev_bind_rx(*ys, req); 122 122 123 dmabuf_id = rsp->dmabuf_id; 123 dmabuf_id = rsp->dmabuf_id; 124 124 125 125 126 The netlink API returns a dmabuf_id: a unique 126 The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf 127 that has been bound. 127 that has been bound. 128 128 129 The user can unbind the dmabuf from the netdev 129 The user can unbind the dmabuf from the netdevice by closing the netlink socket 130 that established the binding. We do this so th 130 that established the binding. We do this so that the binding is automatically 131 unbound even if the userspace process crashes. 131 unbound even if the userspace process crashes. 132 132 133 Note that any reasonably well-behaved dmabuf f 133 Note that any reasonably well-behaved dmabuf from any exporter should work with 134 devmem TCP, even if the dmabuf is not actually 134 devmem TCP, even if the dmabuf is not actually backed by devmem. An example of 135 this is udmabuf, which wraps user memory (non- 135 this is udmabuf, which wraps user memory (non-devmem) in a dmabuf. 136 136 137 137 138 Socket Setup 138 Socket Setup 139 ------------ 139 ------------ 140 140 141 The socket must be flow steered to the dmabuf 141 The socket must be flow steered to the dmabuf bound RX queue:: 142 142 143 ethtool -N eth1 flow-type tcp4 ... que 143 ethtool -N eth1 flow-type tcp4 ... queue 15 144 144 145 145 146 Receiving data 146 Receiving data 147 -------------- 147 -------------- 148 148 149 The user application must signal to the kernel 149 The user application must signal to the kernel that it is capable of receiving 150 devmem data by passing the MSG_SOCK_DEVMEM fla 150 devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg:: 151 151 152 ret = recvmsg(fd, &msg, MSG_SOCK_DEVME 152 ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM); 153 153 154 Applications that do not specify the MSG_SOCK_ 154 Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT 155 on devmem data. 155 on devmem data. 156 156 157 Devmem data is received directly into the dmab 157 Devmem data is received directly into the dmabuf bound to the NIC in 'NIC 158 Setup', and the kernel signals such to the use 158 Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs:: 159 159 160 for (cm = CMSG_FIRSTHDR(&msg); 160 for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { 161 if (cm->cmsg_level != 161 if (cm->cmsg_level != SOL_SOCKET || 162 (cm->cmsg_type 162 (cm->cmsg_type != SCM_DEVMEM_DMABUF && 163 cm->cmsg_type 163 cm->cmsg_type != SCM_DEVMEM_LINEAR)) 164 continue; 164 continue; 165 165 166 dmabuf_cmsg = (struct 166 dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm); 167 167 168 if (cm->cmsg_type == S 168 if (cm->cmsg_type == SCM_DEVMEM_DMABUF) { 169 /* Frag landed 169 /* Frag landed in dmabuf. 170 * 170 * 171 * dmabuf_cmsg 171 * dmabuf_cmsg->dmabuf_id is the dmabuf the 172 * frag landed 172 * frag landed on. 173 * 173 * 174 * dmabuf_cmsg 174 * dmabuf_cmsg->frag_offset is the offset into 175 * the dmabuf 175 * the dmabuf where the frag starts. 176 * 176 * 177 * dmabuf_cmsg 177 * dmabuf_cmsg->frag_size is the size of the 178 * frag. 178 * frag. 179 * 179 * 180 * dmabuf_cmsg 180 * dmabuf_cmsg->frag_token is a token used to 181 * refer to th 181 * refer to this frag for later freeing. 182 */ 182 */ 183 183 184 struct dmabuf_ 184 struct dmabuf_token token; 185 token.token_st 185 token.token_start = dmabuf_cmsg->frag_token; 186 token.token_co 186 token.token_count = 1; 187 continue; 187 continue; 188 } 188 } 189 189 190 if (cm->cmsg_type == S 190 if (cm->cmsg_type == SCM_DEVMEM_LINEAR) 191 /* Frag landed 191 /* Frag landed in linear buffer. 192 * 192 * 193 * dmabuf_cmsg 193 * dmabuf_cmsg->frag_size is the size of the 194 * frag. 194 * frag. 195 */ 195 */ 196 continue; 196 continue; 197 197 198 } 198 } 199 199 200 Applications may receive 2 cmsgs: 200 Applications may receive 2 cmsgs: 201 201 202 - SCM_DEVMEM_DMABUF: this indicates the fragme 202 - SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated 203 by dmabuf_id. 203 by dmabuf_id. 204 204 205 - SCM_DEVMEM_LINEAR: this indicates the fragme 205 - SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer. 206 This typically happens when the NIC is unabl 206 This typically happens when the NIC is unable to split the packet at the 207 header boundary, such that part (or all) of 207 header boundary, such that part (or all) of the payload landed in host 208 memory. 208 memory. 209 209 210 Applications may receive no SO_DEVMEM_* cmsgs. 210 Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem, 211 regular TCP data that landed on an RX queue no 211 regular TCP data that landed on an RX queue not bound to a dmabuf. 212 212 213 213 214 Freeing frags 214 Freeing frags 215 ------------- 215 ------------- 216 216 217 Frags received via SCM_DEVMEM_DMABUF are pinne 217 Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user 218 processes the frag. The user must return the f 218 processes the frag. The user must return the frag to the kernel via 219 SO_DEVMEM_DONTNEED:: 219 SO_DEVMEM_DONTNEED:: 220 220 221 ret = setsockopt(client_fd, SOL_SOCKET 221 ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token, 222 sizeof(token)); 222 sizeof(token)); 223 223 224 The user must ensure the tokens are returned t 224 The user must ensure the tokens are returned to the kernel in a timely manner. 225 Failure to do so will exhaust the limited dmab 225 Failure to do so will exhaust the limited dmabuf that is bound to the RX queue 226 and will lead to packet drops. 226 and will lead to packet drops. 227 227 228 228 229 Implementation & Caveats 229 Implementation & Caveats 230 ======================== 230 ======================== 231 231 232 Unreadable skbs 232 Unreadable skbs 233 --------------- 233 --------------- 234 234 235 Devmem payloads are inaccessible to the kernel 235 Devmem payloads are inaccessible to the kernel processing the packets. This 236 results in a few quirks for payloads of devmem 236 results in a few quirks for payloads of devmem skbs: 237 237 238 - Loopback is not functional. Loopback relies 238 - Loopback is not functional. Loopback relies on copying the payload, which is 239 not possible with devmem skbs. 239 not possible with devmem skbs. 240 240 241 - Software checksum calculation fails. 241 - Software checksum calculation fails. 242 242 243 - TCP Dump and bpf can't access devmem packet 243 - TCP Dump and bpf can't access devmem packet payloads. 244 244 245 245 246 Testing 246 Testing 247 ======= 247 ======= 248 248 249 More realistic example code can be found in th 249 More realistic example code can be found in the kernel source under 250 ``tools/testing/selftests/net/ncdevmem.c`` 250 ``tools/testing/selftests/net/ncdevmem.c`` 251 251 252 ncdevmem is a devmem TCP netcat. It works very 252 ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but 253 receives data directly into a udmabuf. 253 receives data directly into a udmabuf. 254 254 255 To run ncdevmem, you need to run it on a serve 255 To run ncdevmem, you need to run it on a server on the machine under test, and 256 you need to run netcat on a peer to provide th 256 you need to run netcat on a peer to provide the TX data. 257 257 258 ncdevmem has a validation mode as well that ex 258 ncdevmem has a validation mode as well that expects a repeating pattern of 259 incoming data and validates it as such. For ex 259 incoming data and validates it as such. For example, you can launch 260 ncdevmem on the server by:: 260 ncdevmem on the server by:: 261 261 262 ncdevmem -s <server IP> -c <client IP> 262 ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \ 263 -p 5201 -v 7 263 -p 5201 -v 7 264 264 265 On client side, use regular netcat to send TX 265 On client side, use regular netcat to send TX data to ncdevmem process 266 on the server:: 266 on the server:: 267 267 268 yes $(echo -e \\x01\\x02\\x03\\x04\\x0 268 yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \ 269 tr \\n \\0 | head -c 5G | nc < 269 tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.