Will Drewry | 8ac270d | 2012-04-12 16:48:04 -0500 | [diff] [blame] | 1 | SECure COMPuting with filters |
| 2 | ============================= |
| 3 | |
| 4 | Introduction |
| 5 | ------------ |
| 6 | |
| 7 | A large number of system calls are exposed to every userland process |
| 8 | with many of them going unused for the entire lifetime of the process. |
| 9 | As system calls change and mature, bugs are found and eradicated. A |
| 10 | certain subset of userland applications benefit by having a reduced set |
| 11 | of available system calls. The resulting set reduces the total kernel |
| 12 | surface exposed to the application. System call filtering is meant for |
| 13 | use with those applications. |
| 14 | |
| 15 | Seccomp filtering provides a means for a process to specify a filter for |
| 16 | incoming system calls. The filter is expressed as a Berkeley Packet |
| 17 | Filter (BPF) program, as with socket filters, except that the data |
| 18 | operated on is related to the system call being made: system call |
| 19 | number and the system call arguments. This allows for expressive |
| 20 | filtering of system calls using a filter program language with a long |
| 21 | history of being exposed to userland and a straightforward data set. |
| 22 | |
| 23 | Additionally, BPF makes it impossible for users of seccomp to fall prey |
| 24 | to time-of-check-time-of-use (TOCTOU) attacks that are common in system |
| 25 | call interposition frameworks. BPF programs may not dereference |
| 26 | pointers which constrains all filters to solely evaluating the system |
| 27 | call arguments directly. |
| 28 | |
| 29 | What it isn't |
| 30 | ------------- |
| 31 | |
| 32 | System call filtering isn't a sandbox. It provides a clearly defined |
| 33 | mechanism for minimizing the exposed kernel surface. It is meant to be |
| 34 | a tool for sandbox developers to use. Beyond that, policy for logical |
| 35 | behavior and information flow should be managed with a combination of |
| 36 | other system hardening techniques and, potentially, an LSM of your |
| 37 | choosing. Expressive, dynamic filters provide further options down this |
| 38 | path (avoiding pathological sizes or selecting which of the multiplexed |
| 39 | system calls in socketcall() is allowed, for instance) which could be |
| 40 | construed, incorrectly, as a more complete sandboxing solution. |
| 41 | |
| 42 | Usage |
| 43 | ----- |
| 44 | |
| 45 | An additional seccomp mode is added and is enabled using the same |
| 46 | prctl(2) call as the strict seccomp. If the architecture has |
| 47 | CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: |
| 48 | |
| 49 | PR_SET_SECCOMP: |
| 50 | Now takes an additional argument which specifies a new filter |
| 51 | using a BPF program. |
| 52 | The BPF program will be executed over struct seccomp_data |
| 53 | reflecting the system call number, arguments, and other |
| 54 | metadata. The BPF program must then return one of the |
| 55 | acceptable values to inform the kernel which action should be |
| 56 | taken. |
| 57 | |
| 58 | Usage: |
| 59 | prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); |
| 60 | |
| 61 | The 'prog' argument is a pointer to a struct sock_fprog which |
| 62 | will contain the filter program. If the program is invalid, the |
| 63 | call will return -1 and set errno to EINVAL. |
| 64 | |
| 65 | If fork/clone and execve are allowed by @prog, any child |
| 66 | processes will be constrained to the same filters and system |
| 67 | call ABI as the parent. |
| 68 | |
| 69 | Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or |
| 70 | run with CAP_SYS_ADMIN privileges in its namespace. If these are not |
| 71 | true, -EACCES will be returned. This requirement ensures that filter |
| 72 | programs cannot be applied to child processes with greater privileges |
| 73 | than the task that installed them. |
| 74 | |
| 75 | Additionally, if prctl(2) is allowed by the attached filter, |
| 76 | additional filters may be layered on which will increase evaluation |
| 77 | time, but allow for further decreasing the attack surface during |
| 78 | execution of a process. |
| 79 | |
| 80 | The above call returns 0 on success and non-zero on error. |
| 81 | |
| 82 | Return values |
| 83 | ------------- |
| 84 | A seccomp filter may return any of the following values. If multiple |
| 85 | filters exist, the return value for the evaluation of a given system |
| 86 | call will always use the highest precedent value. (For example, |
| 87 | SECCOMP_RET_KILL will always take precedence.) |
| 88 | |
| 89 | In precedence order, they are: |
| 90 | |
| 91 | SECCOMP_RET_KILL: |
| 92 | Results in the task exiting immediately without executing the |
| 93 | system call. The exit status of the task (status & 0x7f) will |
| 94 | be SIGSYS, not SIGKILL. |
| 95 | |
| 96 | SECCOMP_RET_TRAP: |
| 97 | Results in the kernel sending a SIGSYS signal to the triggering |
Andy Lutomirski | 87b526d | 2012-10-01 11:40:45 -0700 | [diff] [blame] | 98 | task without executing the system call. siginfo->si_call_addr |
| 99 | will show the address of the system call instruction, and |
| 100 | siginfo->si_syscall and siginfo->si_arch will indicate which |
| 101 | syscall was attempted. The program counter will be as though |
| 102 | the syscall happened (i.e. it will not point to the syscall |
| 103 | instruction). The return value register will contain an arch- |
| 104 | dependent value -- if resuming execution, set it to something |
| 105 | sensible. (The architecture dependency is because replacing |
| 106 | it with -ENOSYS could overwrite some useful information.) |
Will Drewry | 8ac270d | 2012-04-12 16:48:04 -0500 | [diff] [blame] | 107 | |
| 108 | The SECCOMP_RET_DATA portion of the return value will be passed |
| 109 | as si_errno. |
| 110 | |
| 111 | SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP. |
| 112 | |
| 113 | SECCOMP_RET_ERRNO: |
| 114 | Results in the lower 16-bits of the return value being passed |
| 115 | to userland as the errno without executing the system call. |
| 116 | |
| 117 | SECCOMP_RET_TRACE: |
| 118 | When returned, this value will cause the kernel to attempt to |
| 119 | notify a ptrace()-based tracer prior to executing the system |
| 120 | call. If there is no tracer present, -ENOSYS is returned to |
| 121 | userland and the system call is not executed. |
| 122 | |
| 123 | A tracer will be notified if it requests PTRACE_O_TRACESECCOMP |
| 124 | using ptrace(PTRACE_SETOPTIONS). The tracer will be notified |
| 125 | of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of |
| 126 | the BPF program return value will be available to the tracer |
| 127 | via PTRACE_GETEVENTMSG. |
| 128 | |
Andy Lutomirski | 87b526d | 2012-10-01 11:40:45 -0700 | [diff] [blame] | 129 | The tracer can skip the system call by changing the syscall number |
| 130 | to -1. Alternatively, the tracer can change the system call |
| 131 | requested by changing the system call to a valid syscall number. If |
| 132 | the tracer asks to skip the system call, then the system call will |
| 133 | appear to return the value that the tracer puts in the return value |
| 134 | register. |
| 135 | |
| 136 | The seccomp check will not be run again after the tracer is |
| 137 | notified. (This means that seccomp-based sandboxes MUST NOT |
| 138 | allow use of ptrace, even of other sandboxed processes, without |
| 139 | extreme care; ptracers can use this mechanism to escape.) |
| 140 | |
Will Drewry | 8ac270d | 2012-04-12 16:48:04 -0500 | [diff] [blame] | 141 | SECCOMP_RET_ALLOW: |
| 142 | Results in the system call being executed. |
| 143 | |
| 144 | If multiple filters exist, the return value for the evaluation of a |
| 145 | given system call will always use the highest precedent value. |
| 146 | |
| 147 | Precedence is only determined using the SECCOMP_RET_ACTION mask. When |
| 148 | multiple filters return values of the same precedence, only the |
| 149 | SECCOMP_RET_DATA from the most recently installed filter will be |
| 150 | returned. |
| 151 | |
| 152 | Pitfalls |
| 153 | -------- |
| 154 | |
| 155 | The biggest pitfall to avoid during use is filtering on system call |
| 156 | number without checking the architecture value. Why? On any |
| 157 | architecture that supports multiple system call invocation conventions, |
| 158 | the system call numbers may vary based on the specific invocation. If |
| 159 | the numbers in the different calling conventions overlap, then checks in |
| 160 | the filters may be abused. Always check the arch value! |
| 161 | |
| 162 | Example |
| 163 | ------- |
| 164 | |
| 165 | The samples/seccomp/ directory contains both an x86-specific example |
| 166 | and a more generic example of a higher level macro interface for BPF |
| 167 | program generation. |
| 168 | |
| 169 | |
| 170 | |
| 171 | Adding architecture support |
| 172 | ----------------------- |
| 173 | |
| 174 | See arch/Kconfig for the authoritative requirements. In general, if an |
| 175 | architecture supports both ptrace_event and seccomp, it will be able to |
| 176 | support seccomp filter with minor fixup: SIGSYS support and seccomp return |
| 177 | value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER |
| 178 | to its arch-specific Kconfig. |
Andy Lutomirski | 87b526d | 2012-10-01 11:40:45 -0700 | [diff] [blame] | 179 | |
| 180 | |
| 181 | |
| 182 | Caveats |
| 183 | ------- |
| 184 | |
| 185 | The vDSO can cause some system calls to run entirely in userspace, |
| 186 | leading to surprises when you run programs on different machines that |
| 187 | fall back to real syscalls. To minimize these surprises on x86, make |
| 188 | sure you test with |
| 189 | /sys/devices/system/clocksource/clocksource0/current_clocksource set to |
| 190 | something like acpi_pm. |
| 191 | |
| 192 | On x86-64, vsyscall emulation is enabled by default. (vsyscalls are |
| 193 | legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities: |
| 194 | |
| 195 | - A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to |
| 196 | the vsyscall entry for the given call and not the address after the |
| 197 | 'syscall' instruction. Any code which wants to restart the call |
| 198 | should be aware that (a) a ret instruction has been emulated and (b) |
| 199 | trying to resume the syscall will again trigger the standard vsyscall |
| 200 | emulation security checks, making resuming the syscall mostly |
| 201 | pointless. |
| 202 | |
| 203 | - A return value of SECCOMP_RET_TRACE will signal the tracer as usual, |
| 204 | but the syscall may not be changed to another system call using the |
| 205 | orig_rax register. It may only be changed to -1 order to skip the |
| 206 | currently emulated call. Any other change MAY terminate the process. |
| 207 | The rip value seen by the tracer will be the syscall entry address; |
| 208 | this is different from normal behavior. The tracer MUST NOT modify |
| 209 | rip or rsp. (Do not rely on other changes terminating the process. |
| 210 | They might work. For example, on some kernels, choosing a syscall |
| 211 | that only exists in future kernels will be correctly emulated (by |
| 212 | returning -ENOSYS). |
| 213 | |
| 214 | To detect this quirky behavior, check for addr & ~0x0C00 == |
| 215 | 0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For |
| 216 | SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other |
| 217 | condition: future kernels may improve vsyscall emulation and current |
| 218 | kernels in vsyscall=native mode will behave differently, but the |
| 219 | instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these |
| 220 | cases. |
| 221 | |
| 222 | Note that modern systems are unlikely to use vsyscalls at all -- they |
| 223 | are a legacy feature and they are considerably slower than standard |
| 224 | syscalls. New code will use the vDSO, and vDSO-issued system calls |
| 225 | are indistinguishable from normal system calls. |