| .. _kernel_tls: |
| |
| ========== |
| Kernel TLS |
| ========== |
| |
| Overview |
| ======== |
| |
| Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs over |
| TCP. TLS provides end-to-end data integrity and confidentiality. |
| |
| User interface |
| ============== |
| |
| Creating a TLS connection |
| ------------------------- |
| |
| First create a new TCP socket and set the TLS ULP. |
| |
| .. code-block:: c |
| |
| sock = socket(AF_INET, SOCK_STREAM, 0); |
| setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")); |
| |
| Setting the TLS ULP allows us to set/get TLS socket options. Currently |
| only the symmetric encryption is handled in the kernel. After the TLS |
| handshake is complete, we have all the parameters required to move the |
| data-path to the kernel. There is a separate socket option for moving |
| the transmit and the receive into the kernel. |
| |
| .. code-block:: c |
| |
| /* From linux/tls.h */ |
| struct tls_crypto_info { |
| unsigned short version; |
| unsigned short cipher_type; |
| }; |
| |
| struct tls12_crypto_info_aes_gcm_128 { |
| struct tls_crypto_info info; |
| unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE]; |
| unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE]; |
| unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE]; |
| unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE]; |
| }; |
| |
| |
| struct tls12_crypto_info_aes_gcm_128 crypto_info; |
| |
| crypto_info.info.version = TLS_1_2_VERSION; |
| crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128; |
| memcpy(crypto_info.iv, iv_write, TLS_CIPHER_AES_GCM_128_IV_SIZE); |
| memcpy(crypto_info.rec_seq, seq_number_write, |
| TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); |
| memcpy(crypto_info.key, cipher_key_write, TLS_CIPHER_AES_GCM_128_KEY_SIZE); |
| memcpy(crypto_info.salt, implicit_iv_write, TLS_CIPHER_AES_GCM_128_SALT_SIZE); |
| |
| setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)); |
| |
| Transmit and receive are set separately, but the setup is the same, using either |
| TLS_TX or TLS_RX. |
| |
| Sending TLS application data |
| ---------------------------- |
| |
| After setting the TLS_TX socket option all application data sent over this |
| socket is encrypted using TLS and the parameters provided in the socket option. |
| For example, we can send an encrypted hello world record as follows: |
| |
| .. code-block:: c |
| |
| const char *msg = "hello world\n"; |
| send(sock, msg, strlen(msg)); |
| |
| send() data is directly encrypted from the userspace buffer provided |
| to the encrypted kernel send buffer if possible. |
| |
| The sendfile system call will send the file's data over TLS records of maximum |
| length (2^14). |
| |
| .. code-block:: c |
| |
| file = open(filename, O_RDONLY); |
| fstat(file, &stat); |
| sendfile(sock, file, &offset, stat.st_size); |
| |
| TLS records are created and sent after each send() call, unless |
| MSG_MORE is passed. MSG_MORE will delay creation of a record until |
| MSG_MORE is not passed, or the maximum record size is reached. |
| |
| The kernel will need to allocate a buffer for the encrypted data. |
| This buffer is allocated at the time send() is called, such that |
| either the entire send() call will return -ENOMEM (or block waiting |
| for memory), or the encryption will always succeed. If send() returns |
| -ENOMEM and some data was left on the socket buffer from a previous |
| call using MSG_MORE, the MSG_MORE data is left on the socket buffer. |
| |
| Receiving TLS application data |
| ------------------------------ |
| |
| After setting the TLS_RX socket option, all recv family socket calls |
| are decrypted using TLS parameters provided. A full TLS record must |
| be received before decryption can happen. |
| |
| .. code-block:: c |
| |
| char buffer[16384]; |
| recv(sock, buffer, 16384); |
| |
| Received data is decrypted directly in to the user buffer if it is |
| large enough, and no additional allocations occur. If the userspace |
| buffer is too small, data is decrypted in the kernel and copied to |
| userspace. |
| |
| ``EINVAL`` is returned if the TLS version in the received message does not |
| match the version passed in setsockopt. |
| |
| ``EMSGSIZE`` is returned if the received message is too big. |
| |
| ``EBADMSG`` is returned if decryption failed for any other reason. |
| |
| Send TLS control messages |
| ------------------------- |
| |
| Other than application data, TLS has control messages such as alert |
| messages (record type 21) and handshake messages (record type 22), etc. |
| These messages can be sent over the socket by providing the TLS record type |
| via a CMSG. For example the following function sends @data of @length bytes |
| using a record of type @record_type. |
| |
| .. code-block:: c |
| |
| /* send TLS control message using record_type */ |
| static int klts_send_ctrl_message(int sock, unsigned char record_type, |
| void *data, size_t length) |
| { |
| struct msghdr msg = {0}; |
| int cmsg_len = sizeof(record_type); |
| struct cmsghdr *cmsg; |
| char buf[CMSG_SPACE(cmsg_len)]; |
| struct iovec msg_iov; /* Vector of data to send/receive into. */ |
| |
| msg.msg_control = buf; |
| msg.msg_controllen = sizeof(buf); |
| cmsg = CMSG_FIRSTHDR(&msg); |
| cmsg->cmsg_level = SOL_TLS; |
| cmsg->cmsg_type = TLS_SET_RECORD_TYPE; |
| cmsg->cmsg_len = CMSG_LEN(cmsg_len); |
| *CMSG_DATA(cmsg) = record_type; |
| msg.msg_controllen = cmsg->cmsg_len; |
| |
| msg_iov.iov_base = data; |
| msg_iov.iov_len = length; |
| msg.msg_iov = &msg_iov; |
| msg.msg_iovlen = 1; |
| |
| return sendmsg(sock, &msg, 0); |
| } |
| |
| Control message data should be provided unencrypted, and will be |
| encrypted by the kernel. |
| |
| Receiving TLS control messages |
| ------------------------------ |
| |
| TLS control messages are passed in the userspace buffer, with message |
| type passed via cmsg. If no cmsg buffer is provided, an error is |
| returned if a control message is received. Data messages may be |
| received without a cmsg buffer set. |
| |
| .. code-block:: c |
| |
| char buffer[16384]; |
| char cmsg[CMSG_SPACE(sizeof(unsigned char))]; |
| struct msghdr msg = {0}; |
| msg.msg_control = cmsg; |
| msg.msg_controllen = sizeof(cmsg); |
| |
| struct iovec msg_iov; |
| msg_iov.iov_base = buffer; |
| msg_iov.iov_len = 16384; |
| |
| msg.msg_iov = &msg_iov; |
| msg.msg_iovlen = 1; |
| |
| int ret = recvmsg(sock, &msg, 0 /* flags */); |
| |
| struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg); |
| if (cmsg->cmsg_level == SOL_TLS && |
| cmsg->cmsg_type == TLS_GET_RECORD_TYPE) { |
| int record_type = *((unsigned char *)CMSG_DATA(cmsg)); |
| // Do something with record_type, and control message data in |
| // buffer. |
| // |
| // Note that record_type may be == to application data (23). |
| } else { |
| // Buffer contains application data. |
| } |
| |
| recv will never return data from mixed types of TLS records. |
| |
| Integrating in to userspace TLS library |
| --------------------------------------- |
| |
| At a high level, the kernel TLS ULP is a replacement for the record |
| layer of a userspace TLS library. |
| |
| A patchset to OpenSSL to use ktls as the record layer is |
| `here <https://github.com/Mellanox/openssl/commits/tls_rx2>`_. |
| |
| `An example <https://github.com/ktls/af_ktls-tool/commits/RX>`_ |
| of calling send directly after a handshake using gnutls. |
| Since it doesn't implement a full record layer, control |
| messages are not supported. |
| |
| Optional optimizations |
| ---------------------- |
| |
| There are certain condition-specific optimizations the TLS ULP can make, |
| if requested. Those optimizations are either not universally beneficial |
| or may impact correctness, hence they require an opt-in. |
| All options are set per-socket using setsockopt(), and their |
| state can be checked using getsockopt() and via socket diag (``ss``). |
| |
| TLS_TX_ZEROCOPY_RO |
| ~~~~~~~~~~~~~~~~~~ |
| |
| For device offload only. Allow sendfile() data to be transmitted directly |
| to the NIC without making an in-kernel copy. This allows true zero-copy |
| behavior when device offload is enabled. |
| |
| The application must make sure that the data is not modified between being |
| submitted and transmission completing. In other words this is mostly |
| applicable if the data sent on a socket via sendfile() is read-only. |
| |
| Modifying the data may result in different versions of the data being used |
| for the original TCP transmission and TCP retransmissions. To the receiver |
| this will look like TLS records had been tampered with and will result |
| in record authentication failures. |
| |
| TLS_RX_EXPECT_NO_PAD |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| TLS 1.3 only. Expect the sender to not pad records. This allows the data |
| to be decrypted directly into user space buffers with TLS 1.3. |
| |
| This optimization is safe to enable only if the remote end is trusted, |
| otherwise it is an attack vector to doubling the TLS processing cost. |
| |
| If the record decrypted turns out to had been padded or is not a data |
| record it will be decrypted again into a kernel buffer without zero copy. |
| Such events are counted in the ``TlsDecryptRetry`` statistic. |
| |
| Statistics |
| ========== |
| |
| TLS implementation exposes the following per-namespace statistics |
| (``/proc/net/tls_stat``): |
| |
| - ``TlsCurrTxSw``, ``TlsCurrRxSw`` - |
| number of TX and RX sessions currently installed where host handles |
| cryptography |
| |
| - ``TlsCurrTxDevice``, ``TlsCurrRxDevice`` - |
| number of TX and RX sessions currently installed where NIC handles |
| cryptography |
| |
| - ``TlsTxSw``, ``TlsRxSw`` - |
| number of TX and RX sessions opened with host cryptography |
| |
| - ``TlsTxDevice``, ``TlsRxDevice`` - |
| number of TX and RX sessions opened with NIC cryptography |
| |
| - ``TlsDecryptError`` - |
| record decryption failed (e.g. due to incorrect authentication tag) |
| |
| - ``TlsDeviceRxResync`` - |
| number of RX resyncs sent to NICs handling cryptography |
| |
| - ``TlsDecryptRetry`` - |
| number of RX records which had to be re-decrypted due to |
| ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. Note that this counter will |
| also increment for non-data records. |
| |
| - ``TlsRxNoPadViolation`` - |
| number of data RX records which had to be re-decrypted due to |
| ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. |