1 .. SPDX-License-Identifier: GPL-2.0 2 3 ================= 4 Device Memory TCP 5 ================= 6 7 8 Intro 9 ===== 10 11 Device memory TCP (devmem TCP) enables receivi 12 memory (dmabuf). The feature is currently impl 13 14 15 Opportunity 16 ----------- 17 18 A large number of data transfers have device m 19 destination. Accelerators drastically increase 20 transfers. Some examples include: 21 22 - Distributed training, where ML accelerators, 23 exchange data. 24 25 - Distributed raw block storage applications t 26 remote SSDs. Much of this data does not requ 27 28 Typically the Device-to-Device data transfers 29 the following low-level operations: Device-to- 30 transfer, and Host-to-Device copy. 31 32 The flow involving host copies is suboptimal, 33 and can put significant strains on system reso 34 bandwidth and PCIe bandwidth. 35 36 Devmem TCP optimizes this use case by implemen 37 the user to receive incoming network packets d 38 39 Packet payloads go directly from the NIC to de 40 41 Packet headers go to host memory and are proce 42 normally. The NIC must support header split to 43 44 Advantages: 45 46 - Alleviate host memory bandwidth pressure, co 47 network-transfer + device-copy semantics. 48 49 - Alleviate PCIe bandwidth pressure, by limiti 50 level of the PCIe tree, compared to the trad 51 through the root complex. 52 53 54 More Info 55 --------- 56 57 slides, video 58 https://netdevconf.org/0x17/sessions/talk/ 59 60 patchset 61 [PATCH net-next v24 00/13] Device Memory T 62 https://lore.kernel.org/netdev/20240831004 63 64 65 Interface 66 ========= 67 68 69 Example 70 ------- 71 72 tools/testing/selftests/net/ncdevmem.c:do_serv 73 the RX path of this API. 74 75 76 NIC Setup 77 --------- 78 79 Header split, flow steering, & RSS are require 80 81 Header split is used to split incoming packets 82 memory, and a payload buffer in device memory. 83 84 Flow steering & RSS are used to ensure that on 85 an RX queue bound to devmem. 86 87 Enable header split & flow steering:: 88 89 # enable header split 90 ethtool -G eth1 tcp-data-split on 91 92 93 # enable flow steering 94 ethtool -K eth1 ntuple on 95 96 Configure RSS to steer all traffic away from t 97 this example):: 98 99 ethtool --set-rxfh-indir eth1 equal 15 100 101 102 The user must bind a dmabuf to any number of R 103 the netlink API:: 104 105 /* Bind dmabuf to NIC RX queue 15 */ 106 struct netdev_queue *queues; 107 queues = malloc(sizeof(*queues) * 1); 108 109 queues[0]._present.type = 1; 110 queues[0]._present.idx = 1; 111 queues[0].type = NETDEV_RX_QUEUE_TYPE_ 112 queues[0].idx = 15; 113 114 *ys = ynl_sock_create(&ynl_netdev_fami 115 116 req = netdev_bind_rx_req_alloc(); 117 netdev_bind_rx_req_set_ifindex(req, 1 118 netdev_bind_rx_req_set_dmabuf_fd(req, 119 __netdev_bind_rx_req_set_queues(req, q 120 121 rsp = netdev_bind_rx(*ys, req); 122 123 dmabuf_id = rsp->dmabuf_id; 124 125 126 The netlink API returns a dmabuf_id: a unique 127 that has been bound. 128 129 The user can unbind the dmabuf from the netdev 130 that established the binding. We do this so th 131 unbound even if the userspace process crashes. 132 133 Note that any reasonably well-behaved dmabuf f 134 devmem TCP, even if the dmabuf is not actually 135 this is udmabuf, which wraps user memory (non- 136 137 138 Socket Setup 139 ------------ 140 141 The socket must be flow steered to the dmabuf 142 143 ethtool -N eth1 flow-type tcp4 ... que 144 145 146 Receiving data 147 -------------- 148 149 The user application must signal to the kernel 150 devmem data by passing the MSG_SOCK_DEVMEM fla 151 152 ret = recvmsg(fd, &msg, MSG_SOCK_DEVME 153 154 Applications that do not specify the MSG_SOCK_ 155 on devmem data. 156 157 Devmem data is received directly into the dmab 158 Setup', and the kernel signals such to the use 159 160 for (cm = CMSG_FIRSTHDR(&msg); 161 if (cm->cmsg_level != 162 (cm->cmsg_type 163 cm->cmsg_type 164 continue; 165 166 dmabuf_cmsg = (struct 167 168 if (cm->cmsg_type == S 169 /* Frag landed 170 * 171 * dmabuf_cmsg 172 * frag landed 173 * 174 * dmabuf_cmsg 175 * the dmabuf 176 * 177 * dmabuf_cmsg 178 * frag. 179 * 180 * dmabuf_cmsg 181 * refer to th 182 */ 183 184 struct dmabuf_ 185 token.token_st 186 token.token_co 187 continue; 188 } 189 190 if (cm->cmsg_type == S 191 /* Frag landed 192 * 193 * dmabuf_cmsg 194 * frag. 195 */ 196 continue; 197 198 } 199 200 Applications may receive 2 cmsgs: 201 202 - SCM_DEVMEM_DMABUF: this indicates the fragme 203 by dmabuf_id. 204 205 - SCM_DEVMEM_LINEAR: this indicates the fragme 206 This typically happens when the NIC is unabl 207 header boundary, such that part (or all) of 208 memory. 209 210 Applications may receive no SO_DEVMEM_* cmsgs. 211 regular TCP data that landed on an RX queue no 212 213 214 Freeing frags 215 ------------- 216 217 Frags received via SCM_DEVMEM_DMABUF are pinne 218 processes the frag. The user must return the f 219 SO_DEVMEM_DONTNEED:: 220 221 ret = setsockopt(client_fd, SOL_SOCKET 222 sizeof(token)); 223 224 The user must ensure the tokens are returned t 225 Failure to do so will exhaust the limited dmab 226 and will lead to packet drops. 227 228 229 Implementation & Caveats 230 ======================== 231 232 Unreadable skbs 233 --------------- 234 235 Devmem payloads are inaccessible to the kernel 236 results in a few quirks for payloads of devmem 237 238 - Loopback is not functional. Loopback relies 239 not possible with devmem skbs. 240 241 - Software checksum calculation fails. 242 243 - TCP Dump and bpf can't access devmem packet 244 245 246 Testing 247 ======= 248 249 More realistic example code can be found in th 250 ``tools/testing/selftests/net/ncdevmem.c`` 251 252 ncdevmem is a devmem TCP netcat. It works very 253 receives data directly into a udmabuf. 254 255 To run ncdevmem, you need to run it on a serve 256 you need to run netcat on a peer to provide th 257 258 ncdevmem has a validation mode as well that ex 259 incoming data and validates it as such. For ex 260 ncdevmem on the server by:: 261 262 ncdevmem -s <server IP> -c <client IP> 263 -p 5201 -v 7 264 265 On client side, use regular netcat to send TX 266 on the server:: 267 268 yes $(echo -e \\x01\\x02\\x03\\x04\\x0 269 tr \\n \\0 | head -c 5G | nc <
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.