1 ================================ 2 Documentation for /proc/sys/net/ 3 ================================ 4 5 Copyright 6 7 Copyright (c) 1999 8 9 - Terrehon Bowden <terrehon@pacbell.net> 10 - Bodo Bauer <bb@ricochet.net> 11 12 Copyright (c) 2000 13 14 - Jorge Nerin <comandante@zaralinux.com> 15 16 Copyright (c) 2009 17 18 - Shen Feng <shen@cn.fujitsu.com> 19 20 For general info and legal blurb, please look in index.rst. 21 22 ------------------------------------------------------------------------------ 23 24 This file contains the documentation for the sysctl files in 25 /proc/sys/net 26 27 The interface to the networking parts of the kernel is located in 28 /proc/sys/net. The following table shows all possible subdirectories. You may 29 see only some of them, depending on your kernel's configuration. 30 31 32 Table : Subdirectories in /proc/sys/net 33 34 ========= =================== = ========== =================== 35 Directory Content Directory Content 36 ========= =================== = ========== =================== 37 802 E802 protocol mptcp Multipath TCP 38 appletalk Appletalk protocol netfilter Network Filter 39 ax25 AX25 netrom NET/ROM 40 bridge Bridging rose X.25 PLP layer 41 core General parameter tipc TIPC 42 ethernet Ethernet protocol unix Unix domain sockets 43 ipv4 IP version 4 x25 X.25 protocol 44 ipv6 IP version 6 45 ========= =================== = ========== =================== 46 47 1. /proc/sys/net/core - Network core options 48 ============================================ 49 50 bpf_jit_enable 51 -------------- 52 53 This enables the BPF Just in Time (JIT) compiler. BPF is a flexible 54 and efficient infrastructure allowing to execute bytecode at various 55 hook points. It is used in a number of Linux kernel subsystems such 56 as networking (e.g. XDP, tc), tracing (e.g. kprobes, uprobes, tracepoints) 57 and security (e.g. seccomp). LLVM has a BPF back end that can compile 58 restricted C into a sequence of BPF instructions. After program load 59 through bpf(2) and passing a verifier in the kernel, a JIT will then 60 translate these BPF proglets into native CPU instructions. There are 61 two flavors of JITs, the newer eBPF JIT currently supported on: 62 63 - x86_64 64 - x86_32 65 - arm64 66 - arm32 67 - ppc64 68 - ppc32 69 - sparc64 70 - mips64 71 - s390x 72 - riscv64 73 - riscv32 74 - loongarch64 75 - arc 76 77 And the older cBPF JIT supported on the following archs: 78 79 - mips 80 - sparc 81 82 eBPF JITs are a superset of cBPF JITs, meaning the kernel will 83 migrate cBPF instructions into eBPF instructions and then JIT 84 compile them transparently. Older cBPF JITs can only translate 85 tcpdump filters, seccomp rules, etc, but not mentioned eBPF 86 programs loaded through bpf(2). 87 88 Values: 89 90 - 0 - disable the JIT (default value) 91 - 1 - enable the JIT 92 - 2 - enable the JIT and ask the compiler to emit traces on kernel log. 93 94 bpf_jit_harden 95 -------------- 96 97 This enables hardening for the BPF JIT compiler. Supported are eBPF 98 JIT backends. Enabling hardening trades off performance, but can 99 mitigate JIT spraying. 100 101 Values: 102 103 - 0 - disable JIT hardening (default value) 104 - 1 - enable JIT hardening for unprivileged users only 105 - 2 - enable JIT hardening for all users 106 107 where "privileged user" in this context means a process having 108 CAP_BPF or CAP_SYS_ADMIN in the root user name space. 109 110 bpf_jit_kallsyms 111 ---------------- 112 113 When BPF JIT compiler is enabled, then compiled images are unknown 114 addresses to the kernel, meaning they neither show up in traces nor 115 in /proc/kallsyms. This enables export of these addresses, which can 116 be used for debugging/tracing. If bpf_jit_harden is enabled, this 117 feature is disabled. 118 119 Values : 120 121 - 0 - disable JIT kallsyms export (default value) 122 - 1 - enable JIT kallsyms export for privileged users only 123 124 bpf_jit_limit 125 ------------- 126 127 This enforces a global limit for memory allocations to the BPF JIT 128 compiler in order to reject unprivileged JIT requests once it has 129 been surpassed. bpf_jit_limit contains the value of the global limit 130 in bytes. 131 132 dev_weight 133 ---------- 134 135 The maximum number of packets that kernel can handle on a NAPI interrupt, 136 it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware 137 aggregated packet is counted as one packet in this context. 138 139 Default: 64 140 141 dev_weight_rx_bias 142 ------------------ 143 144 RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function 145 of the driver for the per softirq cycle netdev_budget. This parameter influences 146 the proportion of the configured netdev_budget that is spent on RPS based packet 147 processing during RX softirq cycles. It is further meant for making current 148 dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack. 149 (see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based 150 on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias). 151 152 Default: 1 153 154 dev_weight_tx_bias 155 ------------------ 156 157 Scales the maximum number of packets that can be processed during a TX softirq cycle. 158 Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric 159 net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog. 160 161 Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias). 162 163 Default: 1 164 165 default_qdisc 166 ------------- 167 168 The default queuing discipline to use for network devices. This allows 169 overriding the default of pfifo_fast with an alternative. Since the default 170 queuing discipline is created without additional parameters so is best suited 171 to queuing disciplines that work well without configuration like stochastic 172 fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use 173 queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin 174 which require setting up classes and bandwidths. Note that physical multiqueue 175 interfaces still use mq as root qdisc, which in turn uses this default for its 176 leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead 177 default to noqueue. 178 179 Default: pfifo_fast 180 181 busy_read 182 --------- 183 184 Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL) 185 Approximate time in us to busy loop waiting for packets on the device queue. 186 This sets the default value of the SO_BUSY_POLL socket option. 187 Can be set or overridden per socket by setting socket option SO_BUSY_POLL, 188 which is the preferred method of enabling. If you need to enable the feature 189 globally via sysctl, a value of 50 is recommended. 190 191 Will increase power usage. 192 193 Default: 0 (off) 194 195 busy_poll 196 ---------------- 197 Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL) 198 Approximate time in us to busy loop waiting for events. 199 Recommended value depends on the number of sockets you poll on. 200 For several sockets 50, for several hundreds 100. 201 For more than that you probably want to use epoll. 202 Note that only sockets with SO_BUSY_POLL set will be busy polled, 203 so you want to either selectively set SO_BUSY_POLL on those sockets or set 204 sysctl.net.busy_read globally. 205 206 Will increase power usage. 207 208 Default: 0 (off) 209 210 mem_pcpu_rsv 211 ------------ 212 213 Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU. 214 215 rmem_default 216 ------------ 217 218 The default setting of the socket receive buffer in bytes. 219 220 rmem_max 221 -------- 222 223 The maximum receive socket buffer size in bytes. 224 225 rps_default_mask 226 ---------------- 227 228 The default RPS CPU mask used on newly created network devices. An empty 229 mask means RPS disabled by default. 230 231 tstamp_allow_data 232 ----------------- 233 Allow processes to receive tx timestamps looped together with the original 234 packet contents. If disabled, transmit timestamp requests from unprivileged 235 processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set. 236 237 Default: 1 (on) 238 239 240 wmem_default 241 ------------ 242 243 The default setting (in bytes) of the socket send buffer. 244 245 wmem_max 246 -------- 247 248 The maximum send socket buffer size in bytes. 249 250 message_burst and message_cost 251 ------------------------------ 252 253 These parameters are used to limit the warning messages written to the kernel 254 log from the networking code. They enforce a rate limit to make a 255 denial-of-service attack impossible. A higher message_cost factor, results in 256 fewer messages that will be written. Message_burst controls when messages will 257 be dropped. The default settings limit warning messages to one every five 258 seconds. 259 260 warnings 261 -------- 262 263 This sysctl is now unused. 264 265 This was used to control console messages from the networking stack that 266 occur because of problems on the network like duplicate address or bad 267 checksums. 268 269 These messages are now emitted at KERN_DEBUG and can generally be enabled 270 and controlled by the dynamic_debug facility. 271 272 netdev_budget 273 ------------- 274 275 Maximum number of packets taken from all interfaces in one polling cycle (NAPI 276 poll). In one polling cycle interfaces which are registered to polling are 277 probed in a round-robin manner. Also, a polling cycle may not exceed 278 netdev_budget_usecs microseconds, even if netdev_budget has not been 279 exhausted. 280 281 netdev_budget_usecs 282 --------------------- 283 284 Maximum number of microseconds in one NAPI polling cycle. Polling 285 will exit when either netdev_budget_usecs have elapsed during the 286 poll cycle or the number of packets processed reaches netdev_budget. 287 288 netdev_max_backlog 289 ------------------ 290 291 Maximum number of packets, queued on the INPUT side, when the interface 292 receives packets faster than kernel can process them. 293 294 netdev_rss_key 295 -------------- 296 297 RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is 298 randomly generated. 299 Some user space might need to gather its content even if drivers do not 300 provide ethtool -x support yet. 301 302 :: 303 304 myhost:~# cat /proc/sys/net/core/netdev_rss_key 305 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) 306 307 File contains nul bytes if no driver ever called netdev_rss_key_fill() function. 308 309 Note: 310 /proc/sys/net/core/netdev_rss_key contains 52 bytes of key, 311 but most drivers only use 40 bytes of it. 312 313 :: 314 315 myhost:~# ethtool -x eth0 316 RX flow hash indirection table for eth0 with 8 RX ring(s): 317 0: 0 1 2 3 4 5 6 7 318 RSS hash key: 319 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89 320 321 netdev_tstamp_prequeue 322 ---------------------- 323 324 If set to 0, RX packet timestamps can be sampled after RPS processing, when 325 the target CPU processes packets. It might give some delay on timestamps, but 326 permit to distribute the load on several cpus. 327 328 If set to 1 (default), timestamps are sampled as soon as possible, before 329 queueing. 330 331 netdev_unregister_timeout_secs 332 ------------------------------ 333 334 Unregister network device timeout in seconds. 335 This option controls the timeout (in seconds) used to issue a warning while 336 waiting for a network device refcount to drop to 0 during device 337 unregistration. A lower value may be useful during bisection to detect 338 a leaked reference faster. A larger value may be useful to prevent false 339 warnings on slow/loaded systems. 340 Default value is 10, minimum 1, maximum 3600. 341 342 skb_defer_max 343 ------------- 344 345 Max size (in skbs) of the per-cpu list of skbs being freed 346 by the cpu which allocated them. Used by TCP stack so far. 347 348 Default: 64 349 350 optmem_max 351 ---------- 352 353 Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence 354 of struct cmsghdr structures with appended data. TCP tx zerocopy also uses 355 optmem_max as a limit for its internal structures. 356 357 Default : 128 KB 358 359 fb_tunnels_only_for_init_net 360 ---------------------------- 361 362 Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0, 363 sit0, ip6tnl0, ip6gre0) are automatically created. There are 3 possibilities 364 (a) value = 0; respective fallback tunnels are created when module is 365 loaded in every net namespaces (backward compatible behavior). 366 (b) value = 1; [kcmd value: initns] respective fallback tunnels are 367 created only in init net namespace and every other net namespace will 368 not have them. 369 (c) value = 2; [kcmd value: none] fallback tunnels are not created 370 when a module is loaded in any of the net namespace. Setting value to 371 "2" is pointless after boot if these modules are built-in, so there is 372 a kernel command-line option that can change this default. Please refer to 373 Documentation/admin-guide/kernel-parameters.txt for additional details. 374 375 Not creating fallback tunnels gives control to userspace to create 376 whatever is needed only and avoid creating devices which are redundant. 377 378 Default : 0 (for compatibility reasons) 379 380 devconf_inherit_init_net 381 ------------------------ 382 383 Controls if a new network namespace should inherit all current 384 settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By 385 default, we keep the current behavior: for IPv4 we inherit all current 386 settings from init_net and for IPv6 we reset all settings to default. 387 388 If set to 1, both IPv4 and IPv6 settings are forced to inherit from 389 current ones in init_net. If set to 2, both IPv4 and IPv6 settings are 390 forced to reset to their default values. If set to 3, both IPv4 and IPv6 391 settings are forced to inherit from current ones in the netns where this 392 new netns has been created. 393 394 Default : 0 (for compatibility reasons) 395 396 txrehash 397 -------- 398 399 Controls default hash rethink behaviour on socket when SO_TXREHASH option is set 400 to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt). 401 402 If set to 1 (default), hash rethink is performed on listening socket. 403 If set to 0, hash rethink is not performed. 404 405 gro_normal_batch 406 ---------------- 407 408 Maximum number of the segments to batch up on output of GRO. When a packet 409 exits GRO, either as a coalesced superframe or as an original packet which 410 GRO has decided not to coalesce, it is placed on a per-NAPI list. This 411 list is then passed to the stack when the number of segments reaches the 412 gro_normal_batch limit. 413 414 high_order_alloc_disable 415 ------------------------ 416 417 By default the allocator for page frags tries to use high order pages (order-3 418 on x86). While the default behavior gives good results in most cases, some users 419 might have hit a contention in page allocations/freeing. This was especially 420 true on older kernels (< 5.14) when high-order pages were not stored on per-cpu 421 lists. This allows to opt-in for order-0 allocation instead but is now mostly of 422 historical importance. 423 424 Default: 0 425 426 2. /proc/sys/net/unix - Parameters for Unix domain sockets 427 ---------------------------------------------------------- 428 429 There is only one file in this directory. 430 unix_dgram_qlen limits the max number of datagrams queued in Unix domain 431 socket's buffer. It will not take effect unless PF_UNIX flag is specified. 432 433 434 3. /proc/sys/net/ipv4 - IPV4 settings 435 ------------------------------------- 436 Please see: Documentation/networking/ip-sysctl.rst and 437 Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. 438 439 440 4. Appletalk 441 ------------ 442 443 The /proc/sys/net/appletalk directory holds the Appletalk configuration data 444 when Appletalk is loaded. The configurable parameters are: 445 446 aarp-expiry-time 447 ---------------- 448 449 The amount of time we keep an ARP entry before expiring it. Used to age out 450 old hosts. 451 452 aarp-resolve-time 453 ----------------- 454 455 The amount of time we will spend trying to resolve an Appletalk address. 456 457 aarp-retransmit-limit 458 --------------------- 459 460 The number of times we will retransmit a query before giving up. 461 462 aarp-tick-time 463 -------------- 464 465 Controls the rate at which expires are checked. 466 467 The directory /proc/net/appletalk holds the list of active Appletalk sockets 468 on a machine. 469 470 The fields indicate the DDP type, the local address (in network:node format) 471 the remote address, the size of the transmit pending queue, the size of the 472 received queue (bytes waiting for applications to read) the state and the uid 473 owning the socket. 474 475 /proc/net/atalk_iface lists all the interfaces configured for appletalk.It 476 shows the name of the interface, its Appletalk address, the network range on 477 that address (or network number for phase 1 networks), and the status of the 478 interface. 479 480 /proc/net/atalk_route lists each known network route. It lists the target 481 (network) that the route leads to, the router (may be directly connected), the 482 route flags, and the device the route is using. 483 484 5. TIPC 485 ------- 486 487 tipc_rmem 488 --------- 489 490 The TIPC protocol now has a tunable for the receive memory, similar to the 491 tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max) 492 493 :: 494 495 # cat /proc/sys/net/tipc/tipc_rmem 496 4252725 34021800 68043600 497 # 498 499 The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values 500 are scaled (shifted) versions of that same value. Note that the min value 501 is not at this point in time used in any meaningful way, but the triplet is 502 preserved in order to be consistent with things like tcp_rmem. 503 504 named_timeout 505 ------------- 506 507 TIPC name table updates are distributed asynchronously in a cluster, without 508 any form of transaction handling. This means that different race scenarios are 509 possible. One such is that a name withdrawal sent out by one node and received 510 by another node may arrive after a second, overlapping name publication already 511 has been accepted from a third node, although the conflicting updates 512 originally may have been issued in the correct sequential order. 513 If named_timeout is nonzero, failed topology updates will be placed on a defer 514 queue until another event arrives that clears the error, or until the timeout 515 expires. Value is in milliseconds.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.