Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 1 | ================== |
| 2 | IP over InfiniBand |
| 3 | ================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 4 | |
| 5 | The ib_ipoib driver is an implementation of the IP over InfiniBand |
Roland Dreier | ac83cba | 2006-06-17 20:37:32 -0700 | [diff] [blame] | 6 | protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib |
| 7 | working group. It is a "native" implementation in the sense of |
| 8 | setting the interface type to ARPHRD_INFINIBAND and the hardware |
| 9 | address length to 20 (earlier proprietary implementations |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 10 | masqueraded to the kernel as ethernet interfaces). |
| 11 | |
| 12 | Partitions and P_Keys |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 13 | ===================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 14 | |
| 15 | When the IPoIB driver is loaded, it creates one interface for each |
| 16 | port using the P_Key at index 0. To create an interface with a |
| 17 | different P_Key, write the desired P_Key into the main interface's |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 18 | /sys/class/net/<intf name>/create_child file. For example:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 19 | |
| 20 | echo 0x8001 > /sys/class/net/ib0/create_child |
| 21 | |
| 22 | This will create an interface named ib0.8001 with P_Key 0x8001. To |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 23 | remove a subinterface, use the "delete_child" file:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 24 | |
| 25 | echo 0x8001 > /sys/class/net/ib0/delete_child |
| 26 | |
| 27 | The P_Key for any interface is given by the "pkey" file, and the |
| 28 | main interface for a subinterface is in "parent." |
| 29 | |
Or Gerlitz | 9baa0b0 | 2012-09-13 05:56:36 +0000 | [diff] [blame] | 30 | Child interface create/delete can also be done using IPoIB's |
Kees Cook | 0855965 | 2016-04-26 16:41:21 -0700 | [diff] [blame] | 31 | rtnl_link_ops, where children created using either way behave the same. |
Or Gerlitz | 9baa0b0 | 2012-09-13 05:56:36 +0000 | [diff] [blame] | 32 | |
Or Gerlitz | 6a3335b | 2009-04-08 13:52:01 -0700 | [diff] [blame] | 33 | Datagram vs Connected modes |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 34 | =========================== |
Or Gerlitz | 6a3335b | 2009-04-08 13:52:01 -0700 | [diff] [blame] | 35 | |
| 36 | The IPoIB driver supports two modes of operation: datagram and |
| 37 | connected. The mode is set and read through an interface's |
| 38 | /sys/class/net/<intf name>/mode file. |
| 39 | |
| 40 | In datagram mode, the IB UD (Unreliable Datagram) transport is used |
| 41 | and so the interface MTU has is equal to the IB L2 MTU minus the |
| 42 | IPoIB encapsulation header (4 bytes). For example, in a typical IB |
| 43 | fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes. |
| 44 | |
| 45 | In connected mode, the IB RC (Reliable Connected) transport is used. |
Bart Van Assche | f711182 | 2009-12-09 14:21:36 -0800 | [diff] [blame] | 46 | Connected mode takes advantage of the connected nature of the IB |
| 47 | transport and allows an MTU up to the maximal IP packet size of 64K, |
| 48 | which reduces the number of IP packets needed for handling large UDP |
| 49 | datagrams, TCP segments, etc and increases the performance for large |
| 50 | messages. |
Or Gerlitz | 6a3335b | 2009-04-08 13:52:01 -0700 | [diff] [blame] | 51 | |
| 52 | In connected mode, the interface's UD QP is still used for multicast |
| 53 | and communication with peers that don't support connected mode. In |
| 54 | this case, RX emulation of ICMP PMTU packets is used to cause the |
| 55 | networking stack to use the smaller UD MTU for these neighbours. |
| 56 | |
| 57 | Stateless offloads |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 58 | ================== |
Or Gerlitz | 6a3335b | 2009-04-08 13:52:01 -0700 | [diff] [blame] | 59 | |
| 60 | If the IB HW supports IPoIB stateless offloads, IPoIB advertises |
| 61 | TCP/IP checksum and/or Large Send (LSO) offloading capability to the |
| 62 | network stack. |
| 63 | |
| 64 | Large Receive (LRO) offloading is also implemented and may be turned |
| 65 | on/off using ethtool calls. Currently LRO is supported only for |
| 66 | checksum offload capable devices. |
| 67 | |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 68 | Stateless offloads are supported only in datagram mode. |
Or Gerlitz | 6a3335b | 2009-04-08 13:52:01 -0700 | [diff] [blame] | 69 | |
| 70 | Interrupt moderation |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 71 | ==================== |
Or Gerlitz | 6a3335b | 2009-04-08 13:52:01 -0700 | [diff] [blame] | 72 | |
| 73 | If the underlying IB device supports CQ event moderation, one can |
| 74 | use ethtool to set interrupt mitigation parameters and thus reduce |
| 75 | the overhead incurred by handling interrupts. The main code path of |
| 76 | IPoIB doesn't use events for TX completion signaling so only RX |
| 77 | moderation is supported. |
| 78 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 79 | Debugging Information |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 80 | ===================== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 81 | |
| 82 | By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set |
| 83 | to 'y', tracing messages are compiled into the driver. They are |
| 84 | turned on by setting the module parameters debug_level and |
| 85 | mcast_debug_level to 1. These parameters can be controlled at |
| 86 | runtime through files in /sys/module/ib_ipoib/. |
| 87 | |
Roland Dreier | b1ed8da | 2005-04-16 15:26:07 -0700 | [diff] [blame] | 88 | CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 89 | virtual filesystem. By mounting this filesystem, for example with:: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 90 | |
Roland Dreier | b1ed8da | 2005-04-16 15:26:07 -0700 | [diff] [blame] | 91 | mount -t debugfs none /sys/kernel/debug |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 92 | |
| 93 | it is possible to get statistics about multicast groups from the |
Roland Dreier | b1ed8da | 2005-04-16 15:26:07 -0700 | [diff] [blame] | 94 | files /sys/kernel/debug/ipoib/ib0_mcg and so on. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 95 | |
| 96 | The performance impact of this option is negligible, so it |
| 97 | is safe to enable this option with debug_level set to 0 for normal |
| 98 | operation. |
| 99 | |
| 100 | CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in |
| 101 | the data path when data_debug_level is set to 1. However, even with |
| 102 | the output disabled, enabling this configuration option will affect |
| 103 | performance, because it adds tests to the fast path. |
| 104 | |
| 105 | References |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 106 | ========== |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 107 | |
Roland Dreier | ac83cba | 2006-06-17 20:37:32 -0700 | [diff] [blame] | 108 | Transmission of IP over InfiniBand (IPoIB) (RFC 4391) |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 109 | http://ietf.org/rfc/rfc4391.txt |
| 110 | |
Roland Dreier | ac83cba | 2006-06-17 20:37:32 -0700 | [diff] [blame] | 111 | IP over InfiniBand (IPoIB) Architecture (RFC 4392) |
Mauro Carvalho Chehab | 97162a1 | 2019-06-08 23:27:03 -0300 | [diff] [blame] | 112 | http://ietf.org/rfc/rfc4392.txt |
| 113 | |
Or Gerlitz | 6a3335b | 2009-04-08 13:52:01 -0700 | [diff] [blame] | 114 | IP over InfiniBand: Connected Mode (RFC 4755) |
| 115 | http://ietf.org/rfc/rfc4755.txt |