~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/networking/msg_zerocopy.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 
  2 ============
  3 MSG_ZEROCOPY
  4 ============
  5 
  6 Intro
  7 =====
  8 
  9 The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
 10 The feature is currently implemented for TCP, UDP and VSOCK (with
 11 virtio transport) sockets.
 12 
 13 
 14 Opportunity and Caveats
 15 -----------------------
 16 
 17 Copying large buffers between user process and kernel can be
 18 expensive. Linux supports various interfaces that eschew copying,
 19 such as sendfile and splice. The MSG_ZEROCOPY flag extends the
 20 underlying copy avoidance mechanism to common socket send calls.
 21 
 22 Copy avoidance is not a free lunch. As implemented, with page pinning,
 23 it replaces per byte copy cost with page accounting and completion
 24 notification overhead. As a result, MSG_ZEROCOPY is generally only
 25 effective at writes over around 10 KB.
 26 
 27 Page pinning also changes system call semantics. It temporarily shares
 28 the buffer between process and network stack. Unlike with copying, the
 29 process cannot immediately overwrite the buffer after system call
 30 return without possibly modifying the data in flight. Kernel integrity
 31 is not affected, but a buggy program can possibly corrupt its own data
 32 stream.
 33 
 34 The kernel returns a notification when it is safe to modify data.
 35 Converting an existing application to MSG_ZEROCOPY is not always as
 36 trivial as just passing the flag, then.
 37 
 38 
 39 More Info
 40 ---------
 41 
 42 Much of this document was derived from a longer paper presented at
 43 netdev 2.1. For more in-depth information see that paper and talk,
 44 the excellent reporting over at LWN.net or read the original code.
 45 
 46   paper, slides, video
 47     https://netdevconf.org/2.1/session.html?debruijn
 48 
 49   LWN article
 50     https://lwn.net/Articles/726917/
 51 
 52   patchset
 53     [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
 54     https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
 55 
 56 
 57 Interface
 58 =========
 59 
 60 Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
 61 avoidance, but not the only one.
 62 
 63 Socket Setup
 64 ------------
 65 
 66 The kernel is permissive when applications pass undefined flags to the
 67 send system call. By default it simply ignores these. To avoid enabling
 68 copy avoidance mode for legacy processes that accidentally already pass
 69 this flag, a process must first signal intent by setting a socket option:
 70 
 71 ::
 72 
 73         if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
 74                 error(1, errno, "setsockopt zerocopy");
 75 
 76 Transmission
 77 ------------
 78 
 79 The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
 80 Pass the new flag.
 81 
 82 ::
 83 
 84         ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
 85 
 86 A zerocopy failure will return -1 with errno ENOBUFS. This happens if
 87 the socket exceeds its optmem limit or the user exceeds their ulimit on
 88 locked pages.
 89 
 90 
 91 Mixing copy avoidance and copying
 92 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 93 
 94 Many workloads have a mixture of large and small buffers. Because copy
 95 avoidance is more expensive than copying for small packets, the
 96 feature is implemented as a flag. It is safe to mix calls with the flag
 97 with those without.
 98 
 99 
100 Notifications
101 -------------
102 
103 The kernel has to notify the process when it is safe to reuse a
104 previously passed buffer. It queues completion notifications on the
105 socket error queue, akin to the transmit timestamping interface.
106 
107 The notification itself is a simple scalar value. Each socket
108 maintains an internal unsigned 32-bit counter. Each send call with
109 MSG_ZEROCOPY that successfully sends data increments the counter. The
110 counter is not incremented on failure or if called with length zero.
111 The counter counts system call invocations, not bytes. It wraps after
112 UINT_MAX calls.
113 
114 
115 Notification Reception
116 ~~~~~~~~~~~~~~~~~~~~~~
117 
118 The below snippet demonstrates the API. In the simplest case, each
119 send syscall is followed by a poll and recvmsg on the error queue.
120 
121 Reading from the error queue is always a non-blocking operation. The
122 poll call is there to block until an error is outstanding. It will set
123 POLLERR in its output flags. That flag does not have to be set in the
124 events field. Errors are signaled unconditionally.
125 
126 ::
127 
128         pfd.fd = fd;
129         pfd.events = 0;
130         if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
131                 error(1, errno, "poll");
132 
133         ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
134         if (ret == -1)
135                 error(1, errno, "recvmsg");
136 
137         read_notification(msg);
138 
139 The example is for demonstration purpose only. In practice, it is more
140 efficient to not wait for notifications, but read without blocking
141 every couple of send calls.
142 
143 Notifications can be processed out of order with other operations on
144 the socket. A socket that has an error queued would normally block
145 other operations until the error is read. Zerocopy notifications have
146 a zero error code, however, to not block send and recv calls.
147 
148 
149 Notification Batching
150 ~~~~~~~~~~~~~~~~~~~~~
151 
152 Multiple outstanding packets can be read at once using the recvmmsg
153 call. This is often not needed. In each message the kernel returns not
154 a single value, but a range. It coalesces consecutive notifications
155 while one is outstanding for reception on the error queue.
156 
157 When a new notification is about to be queued, it checks whether the
158 new value extends the range of the notification at the tail of the
159 queue. If so, it drops the new notification packet and instead increases
160 the range upper value of the outstanding notification.
161 
162 For protocols that acknowledge data in-order, like TCP, each
163 notification can be squashed into the previous one, so that no more
164 than one notification is outstanding at any one point.
165 
166 Ordered delivery is the common case, but not guaranteed. Notifications
167 may arrive out of order on retransmission and socket teardown.
168 
169 
170 Notification Parsing
171 ~~~~~~~~~~~~~~~~~~~~
172 
173 The below snippet demonstrates how to parse the control message: the
174 read_notification() call in the previous snippet. A notification
175 is encoded in the standard error format, sock_extended_err.
176 
177 The level and type fields in the control data are protocol family
178 specific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket).
179 For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will be
180 VSOCK_RECVERR.
181 
182 Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
183 as explained before, to avoid blocking read and write system calls on
184 the socket.
185 
186 The 32-bit notification range is encoded as [ee_info, ee_data]. This
187 range is inclusive. Other fields in the struct must be treated as
188 undefined, bar for ee_code, as discussed below.
189 
190 ::
191 
192         struct sock_extended_err *serr;
193         struct cmsghdr *cm;
194 
195         cm = CMSG_FIRSTHDR(msg);
196         if (cm->cmsg_level != SOL_IP &&
197             cm->cmsg_type != IP_RECVERR)
198                 error(1, 0, "cmsg");
199 
200         serr = (void *) CMSG_DATA(cm);
201         if (serr->ee_errno != 0 ||
202             serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
203                 error(1, 0, "serr");
204 
205         printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
206 
207 
208 Deferred copies
209 ~~~~~~~~~~~~~~~
210 
211 Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
212 avoidance, and a contract that the kernel will queue a completion
213 notification. It is not a guarantee that the copy is elided.
214 
215 Copy avoidance is not always feasible. Devices that do not support
216 scatter-gather I/O cannot send packets made up of kernel generated
217 protocol headers plus zerocopy user data. A packet may need to be
218 converted to a private copy of data deep in the stack, say to compute
219 a checksum.
220 
221 In all these cases, the kernel returns a completion notification when
222 it releases its hold on the shared pages. That notification may arrive
223 before the (copied) data is fully transmitted. A zerocopy completion
224 notification is not a transmit completion notification, therefore.
225 
226 Deferred copies can be more expensive than a copy immediately in the
227 system call, if the data is no longer warm in the cache. The process
228 also incurs notification processing cost for no benefit. For this
229 reason, the kernel signals if data was completed with a copy, by
230 setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
231 A process may use this signal to stop passing flag MSG_ZEROCOPY on
232 subsequent requests on the same socket.
233 
234 
235 Implementation
236 ==============
237 
238 Loopback
239 --------
240 
241 For TCP and UDP:
242 Data sent to local sockets can be queued indefinitely if the receive
243 process does not read its socket. Unbound notification latency is not
244 acceptable. For this reason all packets generated with MSG_ZEROCOPY
245 that are looped to a local socket will incur a deferred copy. This
246 includes looping onto packet sockets (e.g., tcpdump) and tun devices.
247 
248 For VSOCK:
249 Data path sent to local sockets is the same as for non-local sockets.
250 
251 Testing
252 =======
253 
254 More realistic example code can be found in the kernel source under
255 tools/testing/selftests/net/msg_zerocopy.c.
256 
257 Be cognizant of the loopback constraint. The test can be run between
258 a pair of hosts. But if run between a local pair of processes, for
259 instance when run with msg_zerocopy.sh between a veth pair across
260 namespaces, the test will not show any improvement. For testing, the
261 loopback restriction can be temporarily relaxed by making
262 skb_orphan_frags_rx identical to skb_orphan_frags.
263 
264 For VSOCK type of socket example can be found in
265 tools/testing/vsock/vsock_test_zerocopy.c.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php