| Entry/exit handling for exceptions, interrupts, syscalls and KVM |
| ================================================================ |
| |
| All transitions between execution domains require state updates which are |
| subject to strict ordering constraints. State updates are required for the |
| following: |
| |
| * Lockdep |
| * RCU / Context tracking |
| * Preemption counter |
| * Tracing |
| * Time accounting |
| |
| The update order depends on the transition type and is explained below in |
| the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular |
| exceptions`_, `NMI and NMI-like exceptions`_. |
| |
| Non-instrumentable code - noinstr |
| --------------------------------- |
| |
| Most instrumentation facilities depend on RCU, so instrumentation is prohibited |
| for entry code before RCU starts watching and exit code after RCU stops |
| watching. In addition, many architectures must save and restore register state, |
| which means that (for example) a breakpoint in the breakpoint entry code would |
| overwrite the debug registers of the initial breakpoint. |
| |
| Such code must be marked with the 'noinstr' attribute, placing that code into a |
| special section inaccessible to instrumentation and debug facilities. Some |
| functions are partially instrumentable, which is handled by marking them |
| noinstr and using instrumentation_begin() and instrumentation_end() to flag the |
| instrumentable ranges of code: |
| |
| .. code-block:: c |
| |
| noinstr void entry(void) |
| { |
| handle_entry(); // <-- must be 'noinstr' or '__always_inline' |
| ... |
| |
| instrumentation_begin(); |
| handle_context(); // <-- instrumentable code |
| instrumentation_end(); |
| |
| ... |
| handle_exit(); // <-- must be 'noinstr' or '__always_inline' |
| } |
| |
| This allows verification of the 'noinstr' restrictions via objtool on |
| supported architectures. |
| |
| Invoking non-instrumentable functions from instrumentable context has no |
| restrictions and is useful to protect e.g. state switching which would |
| cause malfunction if instrumented. |
| |
| All non-instrumentable entry/exit code sections before and after the RCU |
| state transitions must run with interrupts disabled. |
| |
| Syscalls |
| -------- |
| |
| Syscall-entry code starts in assembly code and calls out into low-level C code |
| after establishing low-level architecture-specific state and stack frames. This |
| low-level C code must not be instrumented. A typical syscall handling function |
| invoked from low-level assembly code looks like this: |
| |
| .. code-block:: c |
| |
| noinstr void syscall(struct pt_regs *regs, int nr) |
| { |
| arch_syscall_enter(regs); |
| nr = syscall_enter_from_user_mode(regs, nr); |
| |
| instrumentation_begin(); |
| if (!invoke_syscall(regs, nr) && nr != -1) |
| result_reg(regs) = __sys_ni_syscall(regs); |
| instrumentation_end(); |
| |
| syscall_exit_to_user_mode(regs); |
| } |
| |
| syscall_enter_from_user_mode() first invokes enter_from_user_mode() which |
| establishes state in the following order: |
| |
| * Lockdep |
| * RCU / Context tracking |
| * Tracing |
| |
| and then invokes the various entry work functions like ptrace, seccomp, audit, |
| syscall tracing, etc. After all that is done, the instrumentable invoke_syscall |
| function can be invoked. The instrumentable code section then ends, after which |
| syscall_exit_to_user_mode() is invoked. |
| |
| syscall_exit_to_user_mode() handles all work which needs to be done before |
| returning to user space like tracing, audit, signals, task work etc. After |
| that it invokes exit_to_user_mode() which again handles the state |
| transition in the reverse order: |
| |
| * Tracing |
| * RCU / Context tracking |
| * Lockdep |
| |
| syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also |
| available as fine grained subfunctions in cases where the architecture code |
| has to do extra work between the various steps. In such cases it has to |
| ensure that enter_from_user_mode() is called first on entry and |
| exit_to_user_mode() is called last on exit. |
| |
| Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking |
| to print a warning. |
| |
| KVM |
| --- |
| |
| Entering or exiting guest mode is very similar to syscalls. From the host |
| kernel point of view the CPU goes off into user space when entering the |
| guest and returns to the kernel on exit. |
| |
| kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() |
| and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). |
| The state operations have the same ordering. |
| |
| Task work handling is done separately for guest at the boundary of the |
| vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of |
| the work handled on return to user space. |
| |
| Do not nest KVM entry/exit transitions because doing so is nonsensical. |
| |
| Interrupts and regular exceptions |
| --------------------------------- |
| |
| Interrupts entry and exit handling is slightly more complex than syscalls |
| and KVM transitions. |
| |
| If an interrupt is raised while the CPU executes in user space, the entry |
| and exit handling is exactly the same as for syscalls. |
| |
| If the interrupt is raised while the CPU executes in kernel space the entry and |
| exit handling is slightly different. RCU state is only updated when the |
| interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will |
| already be watching. Lockdep and tracing have to be updated unconditionally. |
| |
| irqentry_enter() and irqentry_exit() provide the implementation for this. |
| |
| The architecture-specific part looks similar to syscall handling: |
| |
| .. code-block:: c |
| |
| noinstr void interrupt(struct pt_regs *regs, int nr) |
| { |
| arch_interrupt_enter(regs); |
| state = irqentry_enter(regs); |
| |
| instrumentation_begin(); |
| |
| irq_enter_rcu(); |
| invoke_irq_handler(regs, nr); |
| irq_exit_rcu(); |
| |
| instrumentation_end(); |
| |
| irqentry_exit(regs, state); |
| } |
| |
| Note that the invocation of the actual interrupt handler is within a |
| irq_enter_rcu() and irq_exit_rcu() pair. |
| |
| irq_enter_rcu() updates the preemption count which makes in_hardirq() |
| return true, handles NOHZ tick state and interrupt time accounting. This |
| means that up to the point where irq_enter_rcu() is invoked in_hardirq() |
| returns false. |
| |
| irq_exit_rcu() handles interrupt time accounting, undoes the preemption |
| count update and eventually handles soft interrupts and NOHZ tick state. |
| |
| In theory, the preemption count could be updated in irqentry_enter(). In |
| practice, deferring this update to irq_enter_rcu() allows the preemption-count |
| code to be traced, while also maintaining symmetry with irq_exit_rcu() and |
| irqentry_exit(), which are described in the next paragraph. The only downside |
| is that the early entry code up to irq_enter_rcu() must be aware that the |
| preemption count has not yet been updated with the HARDIRQ_OFFSET state. |
| |
| Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count |
| before it handles soft interrupts, whose handlers must run in BH context rather |
| than irq-disabled context. In addition, irqentry_exit() might schedule, which |
| also requires that HARDIRQ_OFFSET has been removed from the preemption count. |
| |
| Even though interrupt handlers are expected to run with local interrupts |
| disabled, interrupt nesting is common from an entry/exit perspective. For |
| example, softirq handling happens within an irqentry_{enter,exit}() block with |
| local interrupts enabled. Also, although uncommon, nothing prevents an |
| interrupt handler from re-enabling interrupts. |
| |
| Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it |
| runs with local interrupts disabled. But NMIs can happen anytime, and a lot of |
| the entry code is shared between the two. |
| |
| NMI and NMI-like exceptions |
| --------------------------- |
| |
| NMIs and NMI-like exceptions (machine checks, double faults, debug |
| interrupts, etc.) can hit any context and must be extra careful with |
| the state. |
| |
| State changes for debug exceptions and machine-check exceptions depend on |
| whether these exceptions happened in user-space (breakpoints or watchpoints) or |
| in kernel mode (code patching). From user-space, they are treated like |
| interrupts, while from kernel mode they are treated like NMIs. |
| |
| NMIs and other NMI-like exceptions handle state transitions without |
| distinguishing between user-mode and kernel-mode origin. |
| |
| The state update on entry is handled in irqentry_nmi_enter() which updates |
| state in the following order: |
| |
| * Preemption counter |
| * Lockdep |
| * RCU / Context tracking |
| * Tracing |
| |
| The exit counterpart irqentry_nmi_exit() does the reverse operation in the |
| reverse order. |
| |
| Note that the update of the preemption counter has to be the first |
| operation on enter and the last operation on exit. The reason is that both |
| lockdep and RCU rely on in_nmi() returning true in this case. The |
| preemption count modification in the NMI entry/exit case must not be |
| traced. |
| |
| Architecture-specific code looks like this: |
| |
| .. code-block:: c |
| |
| noinstr void nmi(struct pt_regs *regs) |
| { |
| arch_nmi_enter(regs); |
| state = irqentry_nmi_enter(regs); |
| |
| instrumentation_begin(); |
| nmi_handler(regs); |
| instrumentation_end(); |
| |
| irqentry_nmi_exit(regs); |
| } |
| |
| and for e.g. a debug exception it can look like this: |
| |
| .. code-block:: c |
| |
| noinstr void debug(struct pt_regs *regs) |
| { |
| arch_nmi_enter(regs); |
| |
| debug_regs = save_debug_regs(); |
| |
| if (user_mode(regs)) { |
| state = irqentry_enter(regs); |
| |
| instrumentation_begin(); |
| user_mode_debug_handler(regs, debug_regs); |
| instrumentation_end(); |
| |
| irqentry_exit(regs, state); |
| } else { |
| state = irqentry_nmi_enter(regs); |
| |
| instrumentation_begin(); |
| kernel_mode_debug_handler(regs, debug_regs); |
| instrumentation_end(); |
| |
| irqentry_nmi_exit(regs, state); |
| } |
| } |
| |
| There is no combined irqentry_nmi_if_kernel() function available as the |
| above cannot be handled in an exception-agnostic way. |
| |
| NMIs can happen in any context. For example, an NMI-like exception triggered |
| while handling an NMI. So NMI entry code has to be reentrant and state updates |
| need to handle nesting. |