| SECure COMPuting with filters |
| ============================= |
| |
| Introduction |
| ------------ |
| |
| A large number of system calls are exposed to every userland process |
| with many of them going unused for the entire lifetime of the process. |
| As system calls change and mature, bugs are found and eradicated. A |
| certain subset of userland applications benefit by having a reduced set |
| of available system calls. The resulting set reduces the total kernel |
| surface exposed to the application. System call filtering is meant for |
| use with those applications. |
| |
| Seccomp filtering provides a means for a process to specify a filter for |
| incoming system calls. The filter is expressed as a Berkeley Packet |
| Filter (BPF) program, as with socket filters, except that the data |
| operated on is related to the system call being made: system call |
| number and the system call arguments. This allows for expressive |
| filtering of system calls using a filter program language with a long |
| history of being exposed to userland and a straightforward data set. |
| |
| Additionally, BPF makes it impossible for users of seccomp to fall prey |
| to time-of-check-time-of-use (TOCTOU) attacks that are common in system |
| call interposition frameworks. BPF programs may not dereference |
| pointers which constrains all filters to solely evaluating the system |
| call arguments directly. |
| |
| What it isn't |
| ------------- |
| |
| System call filtering isn't a sandbox. It provides a clearly defined |
| mechanism for minimizing the exposed kernel surface. It is meant to be |
| a tool for sandbox developers to use. Beyond that, policy for logical |
| behavior and information flow should be managed with a combination of |
| other system hardening techniques and, potentially, an LSM of your |
| choosing. Expressive, dynamic filters provide further options down this |
| path (avoiding pathological sizes or selecting which of the multiplexed |
| system calls in socketcall() is allowed, for instance) which could be |
| construed, incorrectly, as a more complete sandboxing solution. |
| |
| Usage |
| ----- |
| |
| An additional seccomp mode is added and is enabled using the same |
| prctl(2) call as the strict seccomp. If the architecture has |
| CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: |
| |
| PR_SET_SECCOMP: |
| Now takes an additional argument which specifies a new filter |
| using a BPF program. |
| The BPF program will be executed over struct seccomp_data |
| reflecting the system call number, arguments, and other |
| metadata. The BPF program must then return one of the |
| acceptable values to inform the kernel which action should be |
| taken. |
| |
| Usage: |
| prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); |
| |
| The 'prog' argument is a pointer to a struct sock_fprog which |
| will contain the filter program. If the program is invalid, the |
| call will return -1 and set errno to EINVAL. |
| |
| If fork/clone and execve are allowed by @prog, any child |
| processes will be constrained to the same filters and system |
| call ABI as the parent. |
| |
| Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or |
| run with CAP_SYS_ADMIN privileges in its namespace. If these are not |
| true, -EACCES will be returned. This requirement ensures that filter |
| programs cannot be applied to child processes with greater privileges |
| than the task that installed them. |
| |
| Additionally, if prctl(2) is allowed by the attached filter, |
| additional filters may be layered on which will increase evaluation |
| time, but allow for further decreasing the attack surface during |
| execution of a process. |
| |
| The above call returns 0 on success and non-zero on error. |
| |
| Return values |
| ------------- |
| A seccomp filter may return any of the following values. If multiple |
| filters exist, the return value for the evaluation of a given system |
| call will always use the highest precedent value. (For example, |
| SECCOMP_RET_KILL will always take precedence.) |
| |
| In precedence order, they are: |
| |
| SECCOMP_RET_KILL: |
| Results in the task exiting immediately without executing the |
| system call. The exit status of the task (status & 0x7f) will |
| be SIGSYS, not SIGKILL. |
| |
| SECCOMP_RET_TRAP: |
| Results in the kernel sending a SIGSYS signal to the triggering |
| task without executing the system call. The kernel will |
| rollback the register state to just before the system call |
| entry such that a signal handler in the task will be able to |
| inspect the ucontext_t->uc_mcontext registers and emulate |
| system call success or failure upon return from the signal |
| handler. |
| |
| The SECCOMP_RET_DATA portion of the return value will be passed |
| as si_errno. |
| |
| SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP. |
| |
| SECCOMP_RET_ERRNO: |
| Results in the lower 16-bits of the return value being passed |
| to userland as the errno without executing the system call. |
| |
| SECCOMP_RET_TRACE: |
| When returned, this value will cause the kernel to attempt to |
| notify a ptrace()-based tracer prior to executing the system |
| call. If there is no tracer present, -ENOSYS is returned to |
| userland and the system call is not executed. |
| |
| A tracer will be notified if it requests PTRACE_O_TRACESECCOMP |
| using ptrace(PTRACE_SETOPTIONS). The tracer will be notified |
| of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of |
| the BPF program return value will be available to the tracer |
| via PTRACE_GETEVENTMSG. |
| |
| SECCOMP_RET_ALLOW: |
| Results in the system call being executed. |
| |
| If multiple filters exist, the return value for the evaluation of a |
| given system call will always use the highest precedent value. |
| |
| Precedence is only determined using the SECCOMP_RET_ACTION mask. When |
| multiple filters return values of the same precedence, only the |
| SECCOMP_RET_DATA from the most recently installed filter will be |
| returned. |
| |
| Pitfalls |
| -------- |
| |
| The biggest pitfall to avoid during use is filtering on system call |
| number without checking the architecture value. Why? On any |
| architecture that supports multiple system call invocation conventions, |
| the system call numbers may vary based on the specific invocation. If |
| the numbers in the different calling conventions overlap, then checks in |
| the filters may be abused. Always check the arch value! |
| |
| Example |
| ------- |
| |
| The samples/seccomp/ directory contains both an x86-specific example |
| and a more generic example of a higher level macro interface for BPF |
| program generation. |
| |
| |
| |
| Adding architecture support |
| ----------------------- |
| |
| See arch/Kconfig for the authoritative requirements. In general, if an |
| architecture supports both ptrace_event and seccomp, it will be able to |
| support seccomp filter with minor fixup: SIGSYS support and seccomp return |
| value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER |
| to its arch-specific Kconfig. |