Jarkko Sakkinen | 3fa97bf | 2020-11-13 00:01:34 +0200 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | =============================== |
| 4 | Software Guard eXtensions (SGX) |
| 5 | =============================== |
| 6 | |
| 7 | Overview |
| 8 | ======== |
| 9 | |
| 10 | Software Guard eXtensions (SGX) hardware enables for user space applications |
| 11 | to set aside private memory regions of code and data: |
| 12 | |
Reinette Chatre | 379e4de | 2021-10-29 10:49:56 -0700 | [diff] [blame] | 13 | * Privileged (ring-0) ENCLS functions orchestrate the construction of the |
Jarkko Sakkinen | 3fa97bf | 2020-11-13 00:01:34 +0200 | [diff] [blame] | 14 | regions. |
| 15 | * Unprivileged (ring-3) ENCLU functions allow an application to enter and |
| 16 | execute inside the regions. |
| 17 | |
| 18 | These memory regions are called enclaves. An enclave can be only entered at a |
| 19 | fixed set of entry points. Each entry point can hold a single hardware thread |
| 20 | at a time. While the enclave is loaded from a regular binary file by using |
| 21 | ENCLS functions, only the threads inside the enclave can access its memory. The |
| 22 | region is denied from outside access by the CPU, and encrypted before it leaves |
| 23 | from LLC. |
| 24 | |
| 25 | The support can be determined by |
| 26 | |
| 27 | ``grep sgx /proc/cpuinfo`` |
| 28 | |
| 29 | SGX must both be supported in the processor and enabled by the BIOS. If SGX |
| 30 | appears to be unsupported on a system which has hardware support, ensure |
| 31 | support is enabled in the BIOS. If a BIOS presents a choice between "Enabled" |
| 32 | and "Software Enabled" modes for SGX, choose "Enabled". |
| 33 | |
| 34 | Enclave Page Cache |
| 35 | ================== |
| 36 | |
| 37 | SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated |
| 38 | with an enclave. It is contained in a BIOS-reserved region of physical memory. |
| 39 | Unlike pages used for regular memory, pages can only be accessed from outside of |
| 40 | the enclave during enclave construction with special, limited SGX instructions. |
| 41 | |
| 42 | Only a CPU executing inside an enclave can directly access enclave memory. |
| 43 | However, a CPU executing inside an enclave may access normal memory outside the |
| 44 | enclave. |
| 45 | |
| 46 | The kernel manages enclave memory similar to how it treats device memory. |
| 47 | |
| 48 | Enclave Page Types |
| 49 | ------------------ |
| 50 | |
| 51 | **SGX Enclave Control Structure (SECS)** |
| 52 | Enclave's address range, attributes and other global data are defined |
| 53 | by this structure. |
| 54 | |
| 55 | **Regular (REG)** |
| 56 | Regular EPC pages contain the code and data of an enclave. |
| 57 | |
| 58 | **Thread Control Structure (TCS)** |
| 59 | Thread Control Structure pages define the entry points to an enclave and |
| 60 | track the execution state of an enclave thread. |
| 61 | |
| 62 | **Version Array (VA)** |
| 63 | Version Array pages contain 512 slots, each of which can contain a version |
| 64 | number for a page evicted from the EPC. |
| 65 | |
| 66 | Enclave Page Cache Map |
| 67 | ---------------------- |
| 68 | |
| 69 | The processor tracks EPC pages in a hardware metadata structure called the |
| 70 | *Enclave Page Cache Map (EPCM)*. The EPCM contains an entry for each EPC page |
| 71 | which describes the owning enclave, access rights and page type among the other |
| 72 | things. |
| 73 | |
| 74 | EPCM permissions are separate from the normal page tables. This prevents the |
| 75 | kernel from, for instance, allowing writes to data which an enclave wishes to |
| 76 | remain read-only. EPCM permissions may only impose additional restrictions on |
| 77 | top of normal x86 page permissions. |
| 78 | |
| 79 | For all intents and purposes, the SGX architecture allows the processor to |
| 80 | invalidate all EPCM entries at will. This requires that software be prepared to |
| 81 | handle an EPCM fault at any time. In practice, this can happen on events like |
| 82 | power transitions when the ephemeral key that encrypts enclave memory is lost. |
| 83 | |
| 84 | Application interface |
| 85 | ===================== |
| 86 | |
| 87 | Enclave build functions |
| 88 | ----------------------- |
| 89 | |
| 90 | In addition to the traditional compiler and linker build process, SGX has a |
| 91 | separate enclave “build” process. Enclaves must be built before they can be |
| 92 | executed (entered). The first step in building an enclave is opening the |
| 93 | **/dev/sgx_enclave** device. Since enclave memory is protected from direct |
Reinette Chatre | 379e4de | 2021-10-29 10:49:56 -0700 | [diff] [blame] | 94 | access, special privileged instructions are then used to copy data into enclave |
Jarkko Sakkinen | 3fa97bf | 2020-11-13 00:01:34 +0200 | [diff] [blame] | 95 | pages and establish enclave page permissions. |
| 96 | |
| 97 | .. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c |
| 98 | :functions: sgx_ioc_enclave_create |
| 99 | sgx_ioc_enclave_add_pages |
| 100 | sgx_ioc_enclave_init |
| 101 | sgx_ioc_enclave_provision |
| 102 | |
| 103 | Enclave vDSO |
| 104 | ------------ |
| 105 | |
| 106 | Entering an enclave can only be done through SGX-specific EENTER and ERESUME |
| 107 | functions, and is a non-trivial process. Because of the complexity of |
| 108 | transitioning to and from an enclave, enclaves typically utilize a library to |
| 109 | handle the actual transitions. This is roughly analogous to how glibc |
| 110 | implementations are used by most applications to wrap system calls. |
| 111 | |
| 112 | Another crucial characteristic of enclaves is that they can generate exceptions |
| 113 | as part of their normal operation that need to be handled in the enclave or are |
| 114 | unique to SGX. |
| 115 | |
| 116 | Instead of the traditional signal mechanism to handle these exceptions, SGX |
| 117 | can leverage special exception fixup provided by the vDSO. The kernel-provided |
| 118 | vDSO function wraps low-level transitions to/from the enclave like EENTER and |
| 119 | ERESUME. The vDSO function intercepts exceptions that would otherwise generate |
| 120 | a signal and return the fault information directly to its caller. This avoids |
| 121 | the need to juggle signal handlers. |
| 122 | |
| 123 | .. kernel-doc:: arch/x86/include/uapi/asm/sgx.h |
| 124 | :functions: vdso_sgx_enter_enclave_t |
| 125 | |
| 126 | ksgxd |
| 127 | ===== |
| 128 | |
Reinette Chatre | 379e4de | 2021-10-29 10:49:56 -0700 | [diff] [blame] | 129 | SGX support includes a kernel thread called *ksgxd*. |
Jarkko Sakkinen | 3fa97bf | 2020-11-13 00:01:34 +0200 | [diff] [blame] | 130 | |
| 131 | EPC sanitization |
| 132 | ---------------- |
| 133 | |
| 134 | ksgxd is started when SGX initializes. Enclave memory is typically ready |
Reinette Chatre | 379e4de | 2021-10-29 10:49:56 -0700 | [diff] [blame] | 135 | for use when the processor powers on or resets. However, if SGX has been in |
Jarkko Sakkinen | 3fa97bf | 2020-11-13 00:01:34 +0200 | [diff] [blame] | 136 | use since the reset, enclave pages may be in an inconsistent state. This might |
| 137 | occur after a crash and kexec() cycle, for instance. At boot, ksgxd |
| 138 | reinitializes all enclave pages so that they can be allocated and re-used. |
| 139 | |
| 140 | The sanitization is done by going through EPC address space and applying the |
| 141 | EREMOVE function to each physical page. Some enclave pages like SECS pages have |
| 142 | hardware dependencies on other pages which prevents EREMOVE from functioning. |
| 143 | Executing two EREMOVE passes removes the dependencies. |
| 144 | |
| 145 | Page reclaimer |
| 146 | -------------- |
| 147 | |
| 148 | Similar to the core kswapd, ksgxd, is responsible for managing the |
| 149 | overcommitment of enclave memory. If the system runs out of enclave memory, |
Reinette Chatre | 379e4de | 2021-10-29 10:49:56 -0700 | [diff] [blame] | 150 | *ksgxd* “swaps” enclave memory to normal memory. |
Jarkko Sakkinen | 3fa97bf | 2020-11-13 00:01:34 +0200 | [diff] [blame] | 151 | |
| 152 | Launch Control |
| 153 | ============== |
| 154 | |
| 155 | SGX provides a launch control mechanism. After all enclave pages have been |
| 156 | copied, kernel executes EINIT function, which initializes the enclave. Only after |
| 157 | this the CPU can execute inside the enclave. |
| 158 | |
Reinette Chatre | 379e4de | 2021-10-29 10:49:56 -0700 | [diff] [blame] | 159 | EINIT function takes an RSA-3072 signature of the enclave measurement. The function |
Jarkko Sakkinen | 3fa97bf | 2020-11-13 00:01:34 +0200 | [diff] [blame] | 160 | checks that the measurement is correct and signature is signed with the key |
| 161 | hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the |
| 162 | SHA256 of a public key. |
| 163 | |
| 164 | Those MSRs can be configured by the BIOS to be either readable or writable. |
| 165 | Linux supports only writable configuration in order to give full control to the |
| 166 | kernel on launch control policy. Before calling EINIT function, the driver sets |
| 167 | the MSRs to match the enclave's signing key. |
| 168 | |
| 169 | Encryption engines |
| 170 | ================== |
| 171 | |
| 172 | In order to conceal the enclave data while it is out of the CPU package, the |
| 173 | memory controller has an encryption engine to transparently encrypt and decrypt |
| 174 | enclave memory. |
| 175 | |
| 176 | In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to |
| 177 | encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in |
| 178 | SRAM to maintain integrity of the encrypted data. This provides integrity and |
| 179 | anti-replay protection but does not scale to large memory sizes because the time |
| 180 | required to update the Merkle tree grows logarithmically in relation to the |
| 181 | memory size. |
| 182 | |
| 183 | CPUs starting from Icelake use Total Memory Encryption (TME) in the place of |
| 184 | MEE. TME-based SGX implementations do not have an integrity Merkle tree, which |
| 185 | means integrity and replay-attacks are not mitigated. B, it includes |
| 186 | additional changes to prevent cipher text from being returned and SW memory |
Reinette Chatre | 379e4de | 2021-10-29 10:49:56 -0700 | [diff] [blame] | 187 | aliases from being created. |
Jarkko Sakkinen | 3fa97bf | 2020-11-13 00:01:34 +0200 | [diff] [blame] | 188 | |
| 189 | DMA to enclave memory is blocked by range registers on both MEE and TME systems |
| 190 | (SDM section 41.10). |
| 191 | |
| 192 | Usage Models |
| 193 | ============ |
| 194 | |
| 195 | Shared Library |
| 196 | -------------- |
| 197 | |
| 198 | Sensitive data and the code that acts on it is partitioned from the application |
| 199 | into a separate library. The library is then linked as a DSO which can be loaded |
| 200 | into an enclave. The application can then make individual function calls into |
| 201 | the enclave through special SGX instructions. A run-time within the enclave is |
| 202 | configured to marshal function parameters into and out of the enclave and to |
| 203 | call the correct library function. |
| 204 | |
| 205 | Application Container |
| 206 | --------------------- |
| 207 | |
| 208 | An application may be loaded into a container enclave which is specially |
| 209 | configured with a library OS and run-time which permits the application to run. |
| 210 | The enclave run-time and library OS work together to execute the application |
| 211 | when a thread enters the enclave. |
Kai Huang | b0c7459 | 2021-03-25 22:30:57 +1300 | [diff] [blame] | 212 | |
| 213 | Impact of Potential Kernel SGX Bugs |
| 214 | =================================== |
| 215 | |
| 216 | EPC leaks |
| 217 | --------- |
| 218 | |
| 219 | When EPC page leaks happen, a WARNING like this is shown in dmesg: |
| 220 | |
| 221 | "EREMOVE returned ... and an EPC page was leaked. SGX may become unusable..." |
| 222 | |
| 223 | This is effectively a kernel use-after-free of an EPC page, and due |
| 224 | to the way SGX works, the bug is detected at freeing. Rather than |
| 225 | adding the page back to the pool of available EPC pages, the kernel |
| 226 | intentionally leaks the page to avoid additional errors in the future. |
| 227 | |
| 228 | When this happens, the kernel will likely soon leak more EPC pages, and |
| 229 | SGX will likely become unusable because the memory available to SGX is |
| 230 | limited. However, while this may be fatal to SGX, the rest of the kernel |
| 231 | is unlikely to be impacted and should continue to work. |
| 232 | |
| 233 | As a result, when this happpens, user should stop running any new |
| 234 | SGX workloads, (or just any new workloads), and migrate all valuable |
| 235 | workloads. Although a machine reboot can recover all EPC memory, the bug |
| 236 | should be reported to Linux developers. |
Sean Christopherson | 540745d | 2021-03-19 20:22:21 +1300 | [diff] [blame] | 237 | |
| 238 | |
| 239 | Virtual EPC |
| 240 | =========== |
| 241 | |
| 242 | The implementation has also a virtual EPC driver to support SGX enclaves |
| 243 | in guests. Unlike the SGX driver, an EPC page allocated by the virtual |
| 244 | EPC driver doesn't have a specific enclave associated with it. This is |
| 245 | because KVM doesn't track how a guest uses EPC pages. |
| 246 | |
| 247 | As a result, the SGX core page reclaimer doesn't support reclaiming EPC |
| 248 | pages allocated to KVM guests through the virtual EPC driver. If the |
| 249 | user wants to deploy SGX applications both on the host and in guests |
| 250 | on the same machine, the user should reserve enough EPC (by taking out |
| 251 | total virtual EPC size of all SGX VMs from the physical EPC size) for |
| 252 | host SGX applications so they can run with acceptable performance. |
Paolo Bonzini | ae095b1 | 2021-10-21 16:11:55 -0400 | [diff] [blame] | 253 | |
| 254 | Architectural behavior is to restore all EPC pages to an uninitialized |
| 255 | state also after a guest reboot. Because this state can be reached only |
| 256 | through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc`` |
| 257 | provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction |
| 258 | on all pages in the virtual EPC. |
| 259 | |
| 260 | ``EREMOVE`` can fail for three reasons. Userspace must pay attention |
| 261 | to expected failures and handle them as follows: |
| 262 | |
| 263 | 1. Page removal will always fail when any thread is running in the |
| 264 | enclave to which the page belongs. In this case the ioctl will |
| 265 | return ``EBUSY`` independent of whether it has successfully removed |
| 266 | some pages; userspace can avoid these failures by preventing execution |
| 267 | of any vcpu which maps the virtual EPC. |
| 268 | |
| 269 | 2. Page removal will cause a general protection fault if two calls to |
| 270 | ``EREMOVE`` happen concurrently for pages that refer to the same |
| 271 | "SECS" metadata pages. This can happen if there are concurrent |
| 272 | invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc`` |
| 273 | file descriptor in the guest is closed at the same time as |
| 274 | ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``. |
| 275 | This can be avoided in userspace by serializing calls to the ioctl() |
| 276 | and to close(), but in general it should not be a problem. |
| 277 | |
| 278 | 3. Finally, page removal will fail for SECS metadata pages which still |
| 279 | have child pages. Child pages can be removed by executing |
| 280 | ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors |
| 281 | mapped into the guest. This means that the ioctl() must be called |
| 282 | twice: an initial set of calls to remove child pages and a subsequent |
| 283 | set of calls to remove SECS pages. The second set of calls is only |
| 284 | required for those mappings that returned a nonzero value from the |
| 285 | first call. It indicates a bug in the kernel or the userspace client |
| 286 | if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has |
| 287 | a return code other than 0. |