| **************************** |
| RDMA Transport (RTRS) |
| **************************** |
| |
| RTRS (RDMA Transport) is a reliable high speed transport library |
| which provides support to establish optimal number of connections |
| between client and server machines using RDMA (InfiniBand, RoCE, iWarp) |
| transport. It is optimized to transfer (read/write) IO blocks. |
| |
| In its core interface it follows the BIO semantics of providing the |
| possibility to either write data from an sg list to the remote side |
| or to request ("read") data transfer from the remote side into a given |
| sg list. |
| |
| RTRS provides I/O fail-over and load-balancing capabilities by using |
| multipath I/O (see "add_path" and "mp_policy" configuration entries in |
| Documentation/ABI/testing/sysfs-class-rtrs-client). |
| |
| RTRS is used by the RNBD (RDMA Network Block Device) modules. |
| |
| ================== |
| Transport protocol |
| ================== |
| |
| Overview |
| -------- |
| An established connection between a client and a server is called rtrs |
| session. A session is associated with a set of memory chunks reserved on the |
| server side for a given client for rdma transfer. A session |
| consists of multiple paths, each representing a separate physical link |
| between client and server. Those are used for load balancing and failover. |
| Each path consists of as many connections (QPs) as there are cpus on |
| the client. |
| |
| When processing an incoming write or read request, rtrs client uses memory |
| chunks reserved for him on the server side. Their number, size and addresses |
| need to be exchanged between client and server during the connection |
| establishment phase. Apart from the memory related information client needs to |
| inform the server about the session name and identify each path and connection |
| individually. |
| |
| On an established session client sends to server write or read messages. |
| Server uses immediate field to tell the client which request is being |
| acknowledged and for errno. Client uses immediate field to tell the server |
| which of the memory chunks has been accessed and at which offset the message |
| can be found. |
| |
| Module parameter always_invalidate is introduced for the security problem |
| discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we |
| invalidate each rdma buffer before we hand it over to RNBD server and |
| then pass it to the block layer. A new rkey is generated and registered for the |
| buffer after it returns back from the block layer and RNBD server. |
| The new rkey is sent back to the client along with the IO result. |
| The procedure is the default behaviour of the driver. This invalidation and |
| registration on each IO causes performance drop of up to 20%. A user of the |
| driver may choose to load the modules with this mechanism switched off |
| (always_invalidate=N), if he understands and can take the risk of a malicious |
| client being able to corrupt memory of a server it is connected to. This might |
| be a reasonable option in a scenario where all the clients and all the servers |
| are located within a secure datacenter. |
| |
| |
| Connection establishment |
| ------------------------ |
| |
| 1. Client starts establishing connections belonging to a path of a session one |
| by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests. |
| Those include uuid of the session and uuid of the path to be |
| established. They are used by the server to find a persisting session/path or |
| to create a new one when necessary. The message also contains the protocol |
| version and magic for compatibility, total number of connections per session |
| (as many as cpus on the client), the id of the current connection and |
| the reconnect counter, which is used to resolve the situations where |
| client is trying to reconnect a path, while server is still destroying the old |
| one. |
| |
| 2. Server accepts the connection requests one by one and attaches |
| RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and |
| protocol version, the messages include error code, queue depth supported by |
| the server (number of memory chunks which are going to be allocated for that |
| session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set |
| when always_invalidate=Y. |
| |
| 3. After all connections of a path are established client sends to server the |
| RTRS_MSG_INFO_REQ message, containing the name of the session. This message |
| requests the address information from the server. |
| |
| 4. Server replies to the session info request message with RTRS_MSG_INFO_RSP, |
| which contains the addresses and keys of the RDMA buffers allocated for that |
| session. |
| |
| 5. Session becomes connected after all paths to be established are connected |
| (i.e. steps 1-4 finished for all paths requested for a session) |
| |
| 6. Server and client exchange periodically heartbeat messages (empty rdma |
| messages with an immediate field) which are used to detect a crash on remote |
| side or network outage in an absence of IO. |
| |
| 7. On any RDMA related error or in the case of a heartbeat timeout, the |
| corresponding path is disconnected, all the inflight IO are failed over to a |
| healthy path, if any, and the reconnect mechanism is triggered. |
| |
| CLT SRV |
| *for each connection belonging to a path and for each path: |
| RTRS_MSG_CON_REQ -------------------> |
| <------------------- RTRS_MSG_CON_RSP |
| ... |
| *after all connections are established: |
| RTRS_MSG_INFO_REQ -------------------> |
| <------------------- RTRS_MSG_INFO_RSP |
| *heartbeat is started from both sides: |
| -------------------> [RTRS_HB_MSG_IMM] |
| [RTRS_HB_MSG_ACK] <------------------- |
| [RTRS_HB_MSG_IMM] <------------------- |
| -------------------> [RTRS_HB_MSG_ACK] |
| |
| IO path |
| ------- |
| |
| * Write (always_invalidate=N) * |
| |
| 1. When processing a write request client selects one of the memory chunks |
| on the server side and rdma writes there the user data, user header and the |
| RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only |
| contains size of the user header. The client tells the server which chunk has |
| been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by |
| using the IMM field. |
| |
| 2. When confirming a write request server sends an "empty" rdma message with |
| an immediate field. The 32 bit field is used to specify the outstanding |
| inflight IO and for the error code. |
| |
| CLT SRV |
| usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] |
| [RTRS_IO_RSP_IMM] <----------------- (id + errno) |
| |
| * Write (always_invalidate=Y) * |
| |
| 1. When processing a write request client selects one of the memory chunks |
| on the server side and rdma writes there the user data, user header and the |
| RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only |
| contains size of the user header. The client tells the server which chunk has |
| been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by |
| using the IMM field, Server invalidate rkey associated to the memory chunks |
| first, when it finishes, pass the IO to RNBD server module. |
| |
| 2. When confirming a write request server sends an "empty" rdma message with |
| an immediate field. The 32 bit field is used to specify the outstanding |
| inflight IO and for the error code. The new rkey is sent back using |
| SEND_WITH_IMM WR, client When it recived new rkey message, it validates |
| the message and finished IO after update rkey for the rbuffer, then post |
| back the recv buffer for later use. |
| |
| CLT SRV |
| usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] |
| [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) |
| [RTRS_IO_RSP_IMM] <----------------- (id + errno) |
| |
| |
| * Read (always_invalidate=N)* |
| |
| 1. When processing a read request client selects one of the memory chunks |
| on the server side and rdma writes there the user header and the |
| RTRS_MSG_RDMA_READ message. This message contains the type (read), size of |
| the user header, flags (specifying if memory invalidation is necessary) and the |
| list of addresses along with keys for the data to be read into. |
| |
| 2. When confirming a read request server transfers the requested data first, |
| attaches an invalidation message if requested and finally an "empty" rdma |
| message with an immediate field. The 32 bit field is used to specify the |
| outstanding inflight IO and the error code. |
| |
| CLT SRV |
| usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] |
| [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) |
| or in case client requested invalidation: |
| [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) |
| |
| * Read (always_invalidate=Y)* |
| |
| 1. When processing a read request client selects one of the memory chunks |
| on the server side and rdma writes there the user header and the |
| RTRS_MSG_RDMA_READ message. This message contains the type (read), size of |
| the user header, flags (specifying if memory invalidation is necessary) and the |
| list of addresses along with keys for the data to be read into. |
| Server invalidate rkey associated to the memory chunks first, when it finishes, |
| passes the IO to RNBD server module. |
| |
| 2. When confirming a read request server transfers the requested data first, |
| attaches an invalidation message if requested and finally an "empty" rdma |
| message with an immediate field. The 32 bit field is used to specify the |
| outstanding inflight IO and the error code. The new rkey is sent back using |
| SEND_WITH_IMM WR, client When it recived new rkey message, it validates |
| the message and finished IO after update rkey for the rbuffer, then post |
| back the recv buffer for later use. |
| |
| CLT SRV |
| usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] |
| [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) |
| [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) |
| or in case client requested invalidation: |
| [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) |
| ========================================= |
| Contributors List(in alphabetical order) |
| ========================================= |
| Danil Kipnis <danil.kipnis@profitbricks.com> |
| Fabian Holler <mail@fholler.de> |
| Guoqing Jiang <guoqing.jiang@cloud.ionos.com> |
| Jack Wang <jinpu.wang@profitbricks.com> |
| Kleber Souza <kleber.souza@profitbricks.com> |
| Lutz Pogrell <lutz.pogrell@cloud.ionos.com> |
| Milind Dumbare <Milind.dumbare@gmail.com> |
| Roman Penyaev <roman.penyaev@profitbricks.com> |