| .. SPDX-License-Identifier: GPL-2.0 |
| |
| =============================== |
| Software Guard eXtensions (SGX) |
| =============================== |
| |
| Overview |
| ======== |
| |
| Software Guard eXtensions (SGX) hardware enables for user space applications |
| to set aside private memory regions of code and data: |
| |
| * Privileged (ring-0) ENCLS functions orchestrate the construction of the |
| regions. |
| * Unprivileged (ring-3) ENCLU functions allow an application to enter and |
| execute inside the regions. |
| |
| These memory regions are called enclaves. An enclave can be only entered at a |
| fixed set of entry points. Each entry point can hold a single hardware thread |
| at a time. While the enclave is loaded from a regular binary file by using |
| ENCLS functions, only the threads inside the enclave can access its memory. The |
| region is denied from outside access by the CPU, and encrypted before it leaves |
| from LLC. |
| |
| The support can be determined by |
| |
| ``grep sgx /proc/cpuinfo`` |
| |
| SGX must both be supported in the processor and enabled by the BIOS. If SGX |
| appears to be unsupported on a system which has hardware support, ensure |
| support is enabled in the BIOS. If a BIOS presents a choice between "Enabled" |
| and "Software Enabled" modes for SGX, choose "Enabled". |
| |
| Enclave Page Cache |
| ================== |
| |
| SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated |
| with an enclave. It is contained in a BIOS-reserved region of physical memory. |
| Unlike pages used for regular memory, pages can only be accessed from outside of |
| the enclave during enclave construction with special, limited SGX instructions. |
| |
| Only a CPU executing inside an enclave can directly access enclave memory. |
| However, a CPU executing inside an enclave may access normal memory outside the |
| enclave. |
| |
| The kernel manages enclave memory similar to how it treats device memory. |
| |
| Enclave Page Types |
| ------------------ |
| |
| **SGX Enclave Control Structure (SECS)** |
| Enclave's address range, attributes and other global data are defined |
| by this structure. |
| |
| **Regular (REG)** |
| Regular EPC pages contain the code and data of an enclave. |
| |
| **Thread Control Structure (TCS)** |
| Thread Control Structure pages define the entry points to an enclave and |
| track the execution state of an enclave thread. |
| |
| **Version Array (VA)** |
| Version Array pages contain 512 slots, each of which can contain a version |
| number for a page evicted from the EPC. |
| |
| Enclave Page Cache Map |
| ---------------------- |
| |
| The processor tracks EPC pages in a hardware metadata structure called the |
| *Enclave Page Cache Map (EPCM)*. The EPCM contains an entry for each EPC page |
| which describes the owning enclave, access rights and page type among the other |
| things. |
| |
| EPCM permissions are separate from the normal page tables. This prevents the |
| kernel from, for instance, allowing writes to data which an enclave wishes to |
| remain read-only. EPCM permissions may only impose additional restrictions on |
| top of normal x86 page permissions. |
| |
| For all intents and purposes, the SGX architecture allows the processor to |
| invalidate all EPCM entries at will. This requires that software be prepared to |
| handle an EPCM fault at any time. In practice, this can happen on events like |
| power transitions when the ephemeral key that encrypts enclave memory is lost. |
| |
| Application interface |
| ===================== |
| |
| Enclave build functions |
| ----------------------- |
| |
| In addition to the traditional compiler and linker build process, SGX has a |
| separate enclave “build” process. Enclaves must be built before they can be |
| executed (entered). The first step in building an enclave is opening the |
| **/dev/sgx_enclave** device. Since enclave memory is protected from direct |
| access, special privileged instructions are then used to copy data into enclave |
| pages and establish enclave page permissions. |
| |
| .. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c |
| :functions: sgx_ioc_enclave_create |
| sgx_ioc_enclave_add_pages |
| sgx_ioc_enclave_init |
| sgx_ioc_enclave_provision |
| |
| Enclave vDSO |
| ------------ |
| |
| Entering an enclave can only be done through SGX-specific EENTER and ERESUME |
| functions, and is a non-trivial process. Because of the complexity of |
| transitioning to and from an enclave, enclaves typically utilize a library to |
| handle the actual transitions. This is roughly analogous to how glibc |
| implementations are used by most applications to wrap system calls. |
| |
| Another crucial characteristic of enclaves is that they can generate exceptions |
| as part of their normal operation that need to be handled in the enclave or are |
| unique to SGX. |
| |
| Instead of the traditional signal mechanism to handle these exceptions, SGX |
| can leverage special exception fixup provided by the vDSO. The kernel-provided |
| vDSO function wraps low-level transitions to/from the enclave like EENTER and |
| ERESUME. The vDSO function intercepts exceptions that would otherwise generate |
| a signal and return the fault information directly to its caller. This avoids |
| the need to juggle signal handlers. |
| |
| .. kernel-doc:: arch/x86/include/uapi/asm/sgx.h |
| :functions: vdso_sgx_enter_enclave_t |
| |
| ksgxd |
| ===== |
| |
| SGX support includes a kernel thread called *ksgxd*. |
| |
| EPC sanitization |
| ---------------- |
| |
| ksgxd is started when SGX initializes. Enclave memory is typically ready |
| for use when the processor powers on or resets. However, if SGX has been in |
| use since the reset, enclave pages may be in an inconsistent state. This might |
| occur after a crash and kexec() cycle, for instance. At boot, ksgxd |
| reinitializes all enclave pages so that they can be allocated and re-used. |
| |
| The sanitization is done by going through EPC address space and applying the |
| EREMOVE function to each physical page. Some enclave pages like SECS pages have |
| hardware dependencies on other pages which prevents EREMOVE from functioning. |
| Executing two EREMOVE passes removes the dependencies. |
| |
| Page reclaimer |
| -------------- |
| |
| Similar to the core kswapd, ksgxd, is responsible for managing the |
| overcommitment of enclave memory. If the system runs out of enclave memory, |
| *ksgxd* “swaps” enclave memory to normal memory. |
| |
| Launch Control |
| ============== |
| |
| SGX provides a launch control mechanism. After all enclave pages have been |
| copied, kernel executes EINIT function, which initializes the enclave. Only after |
| this the CPU can execute inside the enclave. |
| |
| EINIT function takes an RSA-3072 signature of the enclave measurement. The function |
| checks that the measurement is correct and signature is signed with the key |
| hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the |
| SHA256 of a public key. |
| |
| Those MSRs can be configured by the BIOS to be either readable or writable. |
| Linux supports only writable configuration in order to give full control to the |
| kernel on launch control policy. Before calling EINIT function, the driver sets |
| the MSRs to match the enclave's signing key. |
| |
| Encryption engines |
| ================== |
| |
| In order to conceal the enclave data while it is out of the CPU package, the |
| memory controller has an encryption engine to transparently encrypt and decrypt |
| enclave memory. |
| |
| In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to |
| encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in |
| SRAM to maintain integrity of the encrypted data. This provides integrity and |
| anti-replay protection but does not scale to large memory sizes because the time |
| required to update the Merkle tree grows logarithmically in relation to the |
| memory size. |
| |
| CPUs starting from Icelake use Total Memory Encryption (TME) in the place of |
| MEE. TME-based SGX implementations do not have an integrity Merkle tree, which |
| means integrity and replay-attacks are not mitigated. B, it includes |
| additional changes to prevent cipher text from being returned and SW memory |
| aliases from being created. |
| |
| DMA to enclave memory is blocked by range registers on both MEE and TME systems |
| (SDM section 41.10). |
| |
| Usage Models |
| ============ |
| |
| Shared Library |
| -------------- |
| |
| Sensitive data and the code that acts on it is partitioned from the application |
| into a separate library. The library is then linked as a DSO which can be loaded |
| into an enclave. The application can then make individual function calls into |
| the enclave through special SGX instructions. A run-time within the enclave is |
| configured to marshal function parameters into and out of the enclave and to |
| call the correct library function. |
| |
| Application Container |
| --------------------- |
| |
| An application may be loaded into a container enclave which is specially |
| configured with a library OS and run-time which permits the application to run. |
| The enclave run-time and library OS work together to execute the application |
| when a thread enters the enclave. |
| |
| Impact of Potential Kernel SGX Bugs |
| =================================== |
| |
| EPC leaks |
| --------- |
| |
| When EPC page leaks happen, a WARNING like this is shown in dmesg: |
| |
| "EREMOVE returned ... and an EPC page was leaked. SGX may become unusable..." |
| |
| This is effectively a kernel use-after-free of an EPC page, and due |
| to the way SGX works, the bug is detected at freeing. Rather than |
| adding the page back to the pool of available EPC pages, the kernel |
| intentionally leaks the page to avoid additional errors in the future. |
| |
| When this happens, the kernel will likely soon leak more EPC pages, and |
| SGX will likely become unusable because the memory available to SGX is |
| limited. However, while this may be fatal to SGX, the rest of the kernel |
| is unlikely to be impacted and should continue to work. |
| |
| As a result, when this happpens, user should stop running any new |
| SGX workloads, (or just any new workloads), and migrate all valuable |
| workloads. Although a machine reboot can recover all EPC memory, the bug |
| should be reported to Linux developers. |
| |
| |
| Virtual EPC |
| =========== |
| |
| The implementation has also a virtual EPC driver to support SGX enclaves |
| in guests. Unlike the SGX driver, an EPC page allocated by the virtual |
| EPC driver doesn't have a specific enclave associated with it. This is |
| because KVM doesn't track how a guest uses EPC pages. |
| |
| As a result, the SGX core page reclaimer doesn't support reclaiming EPC |
| pages allocated to KVM guests through the virtual EPC driver. If the |
| user wants to deploy SGX applications both on the host and in guests |
| on the same machine, the user should reserve enough EPC (by taking out |
| total virtual EPC size of all SGX VMs from the physical EPC size) for |
| host SGX applications so they can run with acceptable performance. |
| |
| Architectural behavior is to restore all EPC pages to an uninitialized |
| state also after a guest reboot. Because this state can be reached only |
| through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc`` |
| provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction |
| on all pages in the virtual EPC. |
| |
| ``EREMOVE`` can fail for three reasons. Userspace must pay attention |
| to expected failures and handle them as follows: |
| |
| 1. Page removal will always fail when any thread is running in the |
| enclave to which the page belongs. In this case the ioctl will |
| return ``EBUSY`` independent of whether it has successfully removed |
| some pages; userspace can avoid these failures by preventing execution |
| of any vcpu which maps the virtual EPC. |
| |
| 2. Page removal will cause a general protection fault if two calls to |
| ``EREMOVE`` happen concurrently for pages that refer to the same |
| "SECS" metadata pages. This can happen if there are concurrent |
| invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc`` |
| file descriptor in the guest is closed at the same time as |
| ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``. |
| This can be avoided in userspace by serializing calls to the ioctl() |
| and to close(), but in general it should not be a problem. |
| |
| 3. Finally, page removal will fail for SECS metadata pages which still |
| have child pages. Child pages can be removed by executing |
| ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors |
| mapped into the guest. This means that the ioctl() must be called |
| twice: an initial set of calls to remove child pages and a subsequent |
| set of calls to remove SECS pages. The second set of calls is only |
| required for those mappings that returned a nonzero value from the |
| first call. It indicates a bug in the kernel or the userspace client |
| if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has |
| a return code other than 0. |