1 ================== 1 ================== 2 IP over InfiniBand 2 IP over InfiniBand 3 ================== 3 ================== 4 4 5 The ib_ipoib driver is an implementation of 5 The ib_ipoib driver is an implementation of the IP over InfiniBand 6 protocol as specified by RFC 4391 and 4392, 6 protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib 7 working group. It is a "native" implementat 7 working group. It is a "native" implementation in the sense of 8 setting the interface type to ARPHRD_INFINIB 8 setting the interface type to ARPHRD_INFINIBAND and the hardware 9 address length to 20 (earlier proprietary im 9 address length to 20 (earlier proprietary implementations 10 masqueraded to the kernel as ethernet interf 10 masqueraded to the kernel as ethernet interfaces). 11 11 12 Partitions and P_Keys 12 Partitions and P_Keys 13 ===================== 13 ===================== 14 14 15 When the IPoIB driver is loaded, it creates 15 When the IPoIB driver is loaded, it creates one interface for each 16 port using the P_Key at index 0. To create 16 port using the P_Key at index 0. To create an interface with a 17 different P_Key, write the desired P_Key int 17 different P_Key, write the desired P_Key into the main interface's 18 /sys/class/net/<intf name>/create_child file 18 /sys/class/net/<intf name>/create_child file. For example:: 19 19 20 echo 0x8001 > /sys/class/net/ib0/create_ch 20 echo 0x8001 > /sys/class/net/ib0/create_child 21 21 22 This will create an interface named ib0.8001 22 This will create an interface named ib0.8001 with P_Key 0x8001. To 23 remove a subinterface, use the "delete_child 23 remove a subinterface, use the "delete_child" file:: 24 24 25 echo 0x8001 > /sys/class/net/ib0/delete_ch 25 echo 0x8001 > /sys/class/net/ib0/delete_child 26 26 27 The P_Key for any interface is given by the 27 The P_Key for any interface is given by the "pkey" file, and the 28 main interface for a subinterface is in "par 28 main interface for a subinterface is in "parent." 29 29 30 Child interface create/delete can also be do 30 Child interface create/delete can also be done using IPoIB's 31 rtnl_link_ops, where children created using 31 rtnl_link_ops, where children created using either way behave the same. 32 32 33 Datagram vs Connected modes 33 Datagram vs Connected modes 34 =========================== 34 =========================== 35 35 36 The IPoIB driver supports two modes of opera 36 The IPoIB driver supports two modes of operation: datagram and 37 connected. The mode is set and read through 37 connected. The mode is set and read through an interface's 38 /sys/class/net/<intf name>/mode file. 38 /sys/class/net/<intf name>/mode file. 39 39 40 In datagram mode, the IB UD (Unreliable Data 40 In datagram mode, the IB UD (Unreliable Datagram) transport is used 41 and so the interface MTU has is equal to the 41 and so the interface MTU has is equal to the IB L2 MTU minus the 42 IPoIB encapsulation header (4 bytes). For e 42 IPoIB encapsulation header (4 bytes). For example, in a typical IB 43 fabric with a 2K MTU, the IPoIB MTU will be 43 fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes. 44 44 45 In connected mode, the IB RC (Reliable Conne 45 In connected mode, the IB RC (Reliable Connected) transport is used. 46 Connected mode takes advantage of the connec 46 Connected mode takes advantage of the connected nature of the IB 47 transport and allows an MTU up to the maxima 47 transport and allows an MTU up to the maximal IP packet size of 64K, 48 which reduces the number of IP packets neede 48 which reduces the number of IP packets needed for handling large UDP 49 datagrams, TCP segments, etc and increases t 49 datagrams, TCP segments, etc and increases the performance for large 50 messages. 50 messages. 51 51 52 In connected mode, the interface's UD QP is 52 In connected mode, the interface's UD QP is still used for multicast 53 and communication with peers that don't supp 53 and communication with peers that don't support connected mode. In 54 this case, RX emulation of ICMP PMTU packets 54 this case, RX emulation of ICMP PMTU packets is used to cause the 55 networking stack to use the smaller UD MTU f 55 networking stack to use the smaller UD MTU for these neighbours. 56 56 57 Stateless offloads 57 Stateless offloads 58 ================== 58 ================== 59 59 60 If the IB HW supports IPoIB stateless offloa 60 If the IB HW supports IPoIB stateless offloads, IPoIB advertises 61 TCP/IP checksum and/or Large Send (LSO) offl 61 TCP/IP checksum and/or Large Send (LSO) offloading capability to the 62 network stack. 62 network stack. 63 63 64 Large Receive (LRO) offloading is also imple 64 Large Receive (LRO) offloading is also implemented and may be turned 65 on/off using ethtool calls. Currently LRO i 65 on/off using ethtool calls. Currently LRO is supported only for 66 checksum offload capable devices. 66 checksum offload capable devices. 67 67 68 Stateless offloads are supported only in dat 68 Stateless offloads are supported only in datagram mode. 69 69 70 Interrupt moderation 70 Interrupt moderation 71 ==================== 71 ==================== 72 72 73 If the underlying IB device supports CQ even 73 If the underlying IB device supports CQ event moderation, one can 74 use ethtool to set interrupt mitigation para 74 use ethtool to set interrupt mitigation parameters and thus reduce 75 the overhead incurred by handling interrupts 75 the overhead incurred by handling interrupts. The main code path of 76 IPoIB doesn't use events for TX completion s 76 IPoIB doesn't use events for TX completion signaling so only RX 77 moderation is supported. 77 moderation is supported. 78 78 79 Debugging Information 79 Debugging Information 80 ===================== 80 ===================== 81 81 82 By compiling the IPoIB driver with CONFIG_IN 82 By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set 83 to 'y', tracing messages are compiled into t 83 to 'y', tracing messages are compiled into the driver. They are 84 turned on by setting the module parameters d 84 turned on by setting the module parameters debug_level and 85 mcast_debug_level to 1. These parameters ca 85 mcast_debug_level to 1. These parameters can be controlled at 86 runtime through files in /sys/module/ib_ipoi 86 runtime through files in /sys/module/ib_ipoib/. 87 87 88 CONFIG_INFINIBAND_IPOIB_DEBUG also enables f 88 CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs 89 virtual filesystem. By mounting this filesy 89 virtual filesystem. By mounting this filesystem, for example with:: 90 90 91 mount -t debugfs none /sys/kernel/debug 91 mount -t debugfs none /sys/kernel/debug 92 92 93 it is possible to get statistics about multi 93 it is possible to get statistics about multicast groups from the 94 files /sys/kernel/debug/ipoib/ib0_mcg and so 94 files /sys/kernel/debug/ipoib/ib0_mcg and so on. 95 95 96 The performance impact of this option is neg 96 The performance impact of this option is negligible, so it 97 is safe to enable this option with debug_lev 97 is safe to enable this option with debug_level set to 0 for normal 98 operation. 98 operation. 99 99 100 CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables e 100 CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in 101 the data path when data_debug_level is set t 101 the data path when data_debug_level is set to 1. However, even with 102 the output disabled, enabling this configura 102 the output disabled, enabling this configuration option will affect 103 performance, because it adds tests to the fa 103 performance, because it adds tests to the fast path. 104 104 105 References 105 References 106 ========== 106 ========== 107 107 108 Transmission of IP over InfiniBand (IPoIB) ( 108 Transmission of IP over InfiniBand (IPoIB) (RFC 4391) 109 http://ietf.org/rfc/rfc4391.txt 109 http://ietf.org/rfc/rfc4391.txt 110 110 111 IP over InfiniBand (IPoIB) Architecture (RFC 111 IP over InfiniBand (IPoIB) Architecture (RFC 4392) 112 http://ietf.org/rfc/rfc4392.txt 112 http://ietf.org/rfc/rfc4392.txt 113 113 114 IP over InfiniBand: Connected Mode (RFC 4755 114 IP over InfiniBand: Connected Mode (RFC 4755) 115 http://ietf.org/rfc/rfc4755.txt 115 http://ietf.org/rfc/rfc4755.txt
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.