1 .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) 2 3 ================== 4 Kernel TLS offload 5 ================== 6 7 Kernel TLS operation 8 ==================== 9 10 Linux kernel provides TLS connection offload infrastructure. Once a TCP 11 connection is in ``ESTABLISHED`` state user space can enable the TLS Upper 12 Layer Protocol (ULP) and install the cryptographic connection state. 13 For details regarding the user-facing interface refer to the TLS 14 documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`. 15 16 ``ktls`` can operate in three modes: 17 18 * Software crypto mode (``TLS_SW``) - CPU handles the cryptography. 19 In most basic cases only crypto operations synchronous with the CPU 20 can be used, but depending on calling context CPU may utilize 21 asynchronous crypto accelerators. The use of accelerators introduces extra 22 latency on socket reads (decryption only starts when a read syscall 23 is made) and additional I/O load on the system. 24 * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto 25 on a packet by packet basis, provided the packets arrive in order. 26 This mode integrates best with the kernel stack and is described in detail 27 in the remaining part of this document 28 (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``). 29 * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where 30 NIC driver and firmware replace the kernel networking stack 31 with its own TCP handling, it is not usable in production environments 32 making use of the Linux networking stack for example any firewalling 33 abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``). 34 35 The operation mode is selected automatically based on device configuration, 36 offload opt-in or opt-out on per-connection basis is not currently supported. 37 38 TX 39 -- 40 41 At a high level user write requests are turned into a scatter list, the TLS ULP 42 intercepts them, inserts record framing, performs encryption (in ``TLS_SW`` 43 mode) and then hands the modified scatter list to the TCP layer. From this 44 point on the TCP stack proceeds as normal. 45 46 In ``TLS_HW`` mode the encryption is not performed in the TLS ULP. 47 Instead packets reach a device driver, the driver will mark the packets 48 for crypto offload based on the socket the packet is attached to, 49 and send them to the device for encryption and transmission. 50 51 RX 52 -- 53 54 On the receive side if the device handled decryption and authentication 55 successfully, the driver will set the decrypted bit in the associated 56 :c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and 57 are handled normally. ``ktls`` is informed when data is queued to the socket 58 and the ``strparser`` mechanism is used to delineate the records. Upon read 59 request, records are retrieved from the socket and passed to decryption routine. 60 If device decrypted all the segments of the record the decryption is skipped, 61 otherwise software path handles decryption. 62 63 .. kernel-figure:: tls-offload-layers.svg 64 :alt: TLS offload layers 65 :align: center 66 :figwidth: 28em 67 68 Layers of Kernel TLS stack 69 70 Device configuration 71 ==================== 72 73 During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and 74 ``NETIF_F_HW_TLS_TX`` features and installs its 75 :c:type:`struct tlsdev_ops <tlsdev_ops>` 76 pointer in the :c:member:`tlsdev_ops` member of the 77 :c:type:`struct net_device <net_device>`. 78 79 When TLS cryptographic connection state is installed on a ``ktls`` socket 80 (note that it is done twice, once for RX and once for TX direction, 81 and the two are completely independent), the kernel checks if the underlying 82 network device is offload-capable and attempts the offload. In case offload 83 fails the connection is handled entirely in software using the same mechanism 84 as if the offload was never tried. 85 86 Offload request is performed via the :c:member:`tls_dev_add` callback of 87 :c:type:`struct tlsdev_ops <tlsdev_ops>`: 88 89 .. code-block:: c 90 91 int (*tls_dev_add)(struct net_device *netdev, struct sock *sk, 92 enum tls_offload_ctx_dir direction, 93 struct tls_crypto_info *crypto_info, 94 u32 start_offload_tcp_sn); 95 96 ``direction`` indicates whether the cryptographic information is for 97 the received or transmitted packets. Driver uses the ``sk`` parameter 98 to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6). 99 Cryptographic information in ``crypto_info`` includes the key, iv, salt 100 as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates 101 which TCP sequence number corresponds to the beginning of the record with 102 sequence number from ``crypto_info``. The driver can add its state 103 at the end of kernel structures (see :c:member:`driver_state` members 104 in ``include/net/tls.h``) to avoid additional allocations and pointer 105 dereferences. 106 107 TX 108 -- 109 110 After TX state is installed, the stack guarantees that the first segment 111 of the stream will start exactly at the ``start_offload_tcp_sn`` sequence 112 number, simplifying TCP sequence number matching. 113 114 TX offload being fully initialized does not imply that all segments passing 115 through the driver and which belong to the offloaded socket will be after 116 the expected sequence number and will have kernel record information. 117 In particular, already encrypted data may have been queued to the socket 118 before installing the connection state in the kernel. 119 120 RX 121 -- 122 123 In RX direction local networking stack has little control over the segmentation, 124 so the initial records' TCP sequence number may be anywhere inside the segment. 125 126 Normal operation 127 ================ 128 129 At the minimum the device maintains the following state for each connection, in 130 each direction: 131 132 * crypto secrets (key, iv, salt) 133 * crypto processing state (partial blocks, partial authentication tag, etc.) 134 * record metadata (sequence number, processing offset and length) 135 * expected TCP sequence number 136 137 There are no guarantees on record length or record segmentation. In particular 138 segments may start at any point of a record and contain any number of records. 139 Assuming segments are received in order, the device should be able to perform 140 crypto operations and authentication regardless of segmentation. For this 141 to be possible device has to keep small amount of segment-to-segment state. 142 This includes at least: 143 144 * partial headers (if a segment carried only a part of the TLS header) 145 * partial data block 146 * partial authentication tag (all data had been seen but part of the 147 authentication tag has to be written or read from the subsequent segment) 148 149 Record reassembly is not necessary for TLS offload. If the packets arrive 150 in order the device should be able to handle them separately and make 151 forward progress. 152 153 TX 154 -- 155 156 The kernel stack performs record framing reserving space for the authentication 157 tag and populating all other TLS header and tailer fields. 158 159 Both the device and the driver maintain expected TCP sequence numbers 160 due to the possibility of retransmissions and the lack of software fallback 161 once the packet reaches the device. 162 For segments passed in order, the driver marks the packets with 163 a connection identifier (note that a 5-tuple lookup is insufficient to identify 164 packets requiring HW offload, see the :ref:`5tuple_problems` section) 165 and hands them to the device. The device identifies the packet as requiring 166 TLS handling and confirms the sequence number matches its expectation. 167 The device performs encryption and authentication of the record data. 168 It replaces the authentication tag and TCP checksum with correct values. 169 170 RX 171 -- 172 173 Before a packet is DMAed to the host (but after NIC's embedded switching 174 and packet transformation functions) the device validates the Layer 4 175 checksum and performs a 5-tuple lookup to find any TLS connection the packet 176 may belong to (technically a 4-tuple 177 lookup is sufficient - IP addresses and TCP port numbers, as the protocol 178 is always TCP). If connection is matched device confirms if the TCP sequence 179 number is the expected one and proceeds to TLS handling (record delineation, 180 decryption, authentication for each record in the packet). The device leaves 181 the record framing unmodified, the stack takes care of record decapsulation. 182 Device indicates successful handling of TLS offload in the per-packet context 183 (descriptor) passed to the host. 184 185 Upon reception of a TLS offloaded packet, the driver sets 186 the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>` 187 corresponding to the segment. Networking stack makes sure decrypted 188 and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer) 189 and takes care of partial decryption. 190 191 Resync handling 192 =============== 193 194 In presence of packet drops or network packet reordering, the device may lose 195 synchronization with the TLS stream, and require a resync with the kernel's 196 TCP stack. 197 198 Note that resync is only attempted for connections which were successfully 199 added to the device table and are in TLS_HW mode. For example, 200 if the table was full when cryptographic state was installed in the kernel, 201 such connection will never get offloaded. Therefore the resync request 202 does not carry any cryptographic connection state. 203 204 TX 205 -- 206 207 Segments transmitted from an offloaded socket can get out of sync 208 in similar ways to the receive side-retransmissions - local drops 209 are possible, though network reorders are not. There are currently 210 two mechanisms for dealing with out of order segments. 211 212 Crypto state rebuilding 213 ~~~~~~~~~~~~~~~~~~~~~~~ 214 215 Whenever an out of order segment is transmitted the driver provides 216 the device with enough information to perform cryptographic operations. 217 This means most likely that the part of the record preceding the current 218 segment has to be passed to the device as part of the packet context, 219 together with its TCP sequence number and TLS record number. The device 220 can then initialize its crypto state, process and discard the preceding 221 data (to be able to insert the authentication tag) and move onto handling 222 the actual packet. 223 224 In this mode depending on the implementation the driver can either ask 225 for a continuation with the crypto state and the new sequence number 226 (next expected segment is the one after the out of order one), or continue 227 with the previous stream state - assuming that the out of order segment 228 was just a retransmission. The former is simpler, and does not require 229 retransmission detection therefore it is the recommended method until 230 such time it is proven inefficient. 231 232 Next record sync 233 ~~~~~~~~~~~~~~~~ 234 235 Whenever an out of order segment is detected the driver requests 236 that the ``ktls`` software fallback code encrypt it. If the segment's 237 sequence number is lower than expected the driver assumes retransmission 238 and doesn't change device state. If the segment is in the future, it 239 may imply a local drop, the driver asks the stack to sync the device 240 to the next record state and falls back to software. 241 242 Resync request is indicated with: 243 244 .. code-block:: c 245 246 void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq) 247 248 Until resync is complete driver should not access its expected TCP 249 sequence number (as it will be updated from a different context). 250 Following helper should be used to test if resync is complete: 251 252 .. code-block:: c 253 254 bool tls_offload_tx_resync_pending(struct sock *sk) 255 256 Next time ``ktls`` pushes a record it will first send its TCP sequence number 257 and TLS record number to the driver. Stack will also make sure that 258 the new record will start on a segment boundary (like it does when 259 the connection is initially added). 260 261 RX 262 -- 263 264 A small amount of RX reorder events may not require a full resynchronization. 265 In particular the device should not lose synchronization 266 when record boundary can be recovered: 267 268 .. kernel-figure:: tls-offload-reorder-good.svg 269 :alt: reorder of non-header segment 270 :align: center 271 272 Reorder of non-header segment 273 274 Green segments are successfully decrypted, blue ones are passed 275 as received on wire, red stripes mark start of new records. 276 277 In above case segment 1 is received and decrypted successfully. 278 Segment 2 was dropped so 3 arrives out of order. The device knows 279 the next record starts inside 3, based on record length in segment 1. 280 Segment 3 is passed untouched, because due to lack of data from segment 2 281 the remainder of the previous record inside segment 3 cannot be handled. 282 The device can, however, collect the authentication algorithm's state 283 and partial block from the new record in segment 3 and when 4 and 5 284 arrive continue decryption. Finally when 2 arrives it's completely outside 285 of expected window of the device so it's passed as is without special 286 handling. ``ktls`` software fallback handles the decryption of record 287 spanning segments 1, 2 and 3. The device did not get out of sync, 288 even though two segments did not get decrypted. 289 290 Kernel synchronization may be necessary if the lost segment contained 291 a record header and arrived after the next record header has already passed: 292 293 .. kernel-figure:: tls-offload-reorder-bad.svg 294 :alt: reorder of header segment 295 :align: center 296 297 Reorder of segment with a TLS header 298 299 In this example segment 2 gets dropped, and it contains a record header. 300 Device can only detect that segment 4 also contains a TLS header 301 if it knows the length of the previous record from segment 2. In this case 302 the device will lose synchronization with the stream. 303 304 Stream scan resynchronization 305 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 306 307 When the device gets out of sync and the stream reaches TCP sequence 308 numbers more than a max size record past the expected TCP sequence number, 309 the device starts scanning for a known header pattern. For example 310 for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur 311 in the SSL/TLS version field of the header. Once pattern is matched 312 the device continues attempting parsing headers at expected locations 313 (based on the length fields at guessed locations). 314 Whenever the expected location does not contain a valid header the scan 315 is restarted. 316 317 When the header is matched the device sends a confirmation request 318 to the kernel, asking if the guessed location is correct (if a TLS record 319 really starts there), and which record sequence number the given header had. 320 The kernel confirms the guessed location was correct and tells the device 321 the record sequence number. Meanwhile, the device had been parsing 322 and counting all records since the just-confirmed one, it adds the number 323 of records it had seen to the record number provided by the kernel. 324 At this point the device is in sync and can resume decryption at next 325 segment boundary. 326 327 In a pathological case the device may latch onto a sequence of matching 328 headers and never hear back from the kernel (there is no negative 329 confirmation from the kernel). The implementation may choose to periodically 330 restart scan. Given how unlikely falsely-matching stream is, however, 331 periodic restart is not deemed necessary. 332 333 Special care has to be taken if the confirmation request is passed 334 asynchronously to the packet stream and record may get processed 335 by the kernel before the confirmation request. 336 337 Stack-driven resynchronization 338 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 339 340 The driver may also request the stack to perform resynchronization 341 whenever it sees the records are no longer getting decrypted. 342 If the connection is configured in this mode the stack automatically 343 schedules resynchronization after it has received two completely encrypted 344 records. 345 346 The stack waits for the socket to drain and informs the device about 347 the next expected record number and its TCP sequence number. If the 348 records continue to be received fully encrypted stack retries the 349 synchronization with an exponential back off (first after 2 encrypted 350 records, then after 4 records, after 8, after 16... up until every 351 128 records). 352 353 Error handling 354 ============== 355 356 TX 357 -- 358 359 Packets may be redirected or rerouted by the stack to a different 360 device than the selected TLS offload device. The stack will handle 361 such condition using the :c:func:`sk_validate_xmit_skb` helper 362 (TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook). 363 Offload maintains information about all records until the data is 364 fully acknowledged, so if skbs reach the wrong device they can be handled 365 by software fallback. 366 367 Any device TLS offload handling error on the transmission side must result 368 in the packet being dropped. For example if a packet got out of order 369 due to a bug in the stack or the device, reached the device and can't 370 be encrypted such packet must be dropped. 371 372 RX 373 -- 374 375 If the device encounters any problems with TLS offload on the receive 376 side it should pass the packet to the host's networking stack as it was 377 received on the wire. 378 379 For example authentication failure for any record in the segment should 380 result in passing the unmodified packet to the software fallback. This means 381 packets should not be modified "in place". Splitting segments to handle partial 382 decryption is not advised. In other words either all records in the packet 383 had been handled successfully and authenticated or the packet has to be passed 384 to the host's stack as it was on the wire (recovering original packet in the 385 driver if device provides precise error is sufficient). 386 387 The Linux networking stack does not provide a way of reporting per-packet 388 decryption and authentication errors, packets with errors must simply not 389 have the :c:member:`decrypted` mark set. 390 391 A packet should also not be handled by the TLS offload if it contains 392 incorrect checksums. 393 394 Performance metrics 395 =================== 396 397 TLS offload can be characterized by the following basic metrics: 398 399 * max connection count 400 * connection installation rate 401 * connection installation latency 402 * total cryptographic performance 403 404 Note that each TCP connection requires a TLS session in both directions, 405 the performance may be reported treating each direction separately. 406 407 Max connection count 408 -------------------- 409 410 The number of connections device can support can be exposed via 411 ``devlink resource`` API. 412 413 Total cryptographic performance 414 ------------------------------- 415 416 Offload performance may depend on segment and record size. 417 418 Overload of the cryptographic subsystem of the device should not have 419 significant performance impact on non-offloaded streams. 420 421 Statistics 422 ========== 423 424 Following minimum set of TLS-related statistics should be reported 425 by the driver: 426 427 * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets 428 which were part of a TLS stream. 429 * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets 430 which were successfully decrypted. 431 * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for 432 decryption. 433 * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device 434 (connection has finished). 435 * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync 436 request. 437 * ``rx_tls_resync_req_start`` - number of times the TLS async resync request 438 was started. 439 * ``rx_tls_resync_req_end`` - number of times the TLS async resync request 440 properly ended with providing the HW tracked tcp-seq. 441 * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request 442 procedure was started by not properly ended. 443 * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to 444 the driver was successfully handled. 445 * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to 446 the driver was terminated unsuccessfully. 447 * ``rx_tls_err`` - number of RX packets which were part of a TLS stream 448 but were not decrypted due to unexpected error in the state machine. 449 * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device 450 for encryption of their TLS payload. 451 * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets 452 passed to the device for encryption. 453 * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for 454 encryption. 455 * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream 456 but did not arrive in the expected order. 457 * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of 458 a TLS stream and arrived out-of-order, but skipped the HW offload routine 459 and went to the regular transmit flow as they were retransmissions of the 460 connection handshake. 461 * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of 462 a TLS stream dropped, because they arrived out of order and associated 463 record could not be found. 464 * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS 465 stream dropped, because they contain both data that has been encrypted by 466 software and data that expects hardware crypto offload. 467 468 Notable corner cases, exceptions and additional requirements 469 ============================================================ 470 471 .. _5tuple_problems: 472 473 5-tuple matching limitations 474 ---------------------------- 475 476 The device can only recognize received packets based on the 5-tuple 477 of the socket. Current ``ktls`` implementation will not offload sockets 478 routed through software interfaces such as those used for tunneling 479 or virtual networking. However, many packet transformations performed 480 by the networking stack (most notably any BPF logic) do not require 481 any intermediate software device, therefore a 5-tuple match may 482 consistently miss at the device level. In such cases the device 483 should still be able to perform TX offload (encryption) and should 484 fallback cleanly to software decryption (RX). 485 486 Out of order 487 ------------ 488 489 Introducing extra processing in NICs should not cause packets to be 490 transmitted or received out of order, for example pure ACK packets 491 should not be reordered with respect to data segments. 492 493 Ingress reorder 494 --------------- 495 496 A device is permitted to perform packet reordering for consecutive 497 TCP segments (i.e. placing packets in the correct order) but any form 498 of additional buffering is disallowed. 499 500 Coexistence with standard networking offload features 501 ----------------------------------------------------- 502 503 Offloaded ``ktls`` sockets should support standard TCP stack features 504 transparently. Enabling device TLS offload should not cause any difference 505 in packets as seen on the wire. 506 507 Transport layer transparency 508 ---------------------------- 509 510 The device should not modify any packet headers for the purpose 511 of the simplifying TLS offload. 512 513 The device should not depend on any packet headers beyond what is strictly 514 necessary for TLS offload. 515 516 Segment drops 517 ------------- 518 519 Dropping packets is acceptable only in the event of catastrophic 520 system errors and should never be used as an error handling mechanism 521 in cases arising from normal operation. In other words, reliance 522 on TCP retransmissions to handle corner cases is not acceptable. 523 524 TLS device features 525 ------------------- 526 527 Drivers should ignore the changes to the TLS device feature flags. 528 These flags will be acted upon accordingly by the core ``ktls`` code. 529 TLS device feature flags only control adding of new TLS connection 530 offloads, old connections will remain active after flags are cleared. 531 532 TLS encryption cannot be offloaded to devices without checksum calculation 533 offload. Hence, TLS TX device feature flag requires TX csum offload being set. 534 Disabling the latter implies clearing the former. Disabling TX checksum offload 535 should not affect old connections, and drivers should make sure checksum 536 calculation does not break for them. 537 Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user 538 does not want to enable RX csum offload, TLS RX device feature is disabled 539 as well.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.