| .. SPDX-License-Identifier: GPL-2.0 | 
 |  | 
 | .. _kernel_hacking_locktypes: | 
 |  | 
 | ========================== | 
 | Lock types and their rules | 
 | ========================== | 
 |  | 
 | Introduction | 
 | ============ | 
 |  | 
 | The kernel provides a variety of locking primitives which can be divided | 
 | into three categories: | 
 |  | 
 |  - Sleeping locks | 
 |  - CPU local locks | 
 |  - Spinning locks | 
 |  | 
 | This document conceptually describes these lock types and provides rules | 
 | for their nesting, including the rules for use under PREEMPT_RT. | 
 |  | 
 |  | 
 | Lock categories | 
 | =============== | 
 |  | 
 | Sleeping locks | 
 | -------------- | 
 |  | 
 | Sleeping locks can only be acquired in preemptible task context. | 
 |  | 
 | Although implementations allow try_lock() from other contexts, it is | 
 | necessary to carefully evaluate the safety of unlock() as well as of | 
 | try_lock().  Furthermore, it is also necessary to evaluate the debugging | 
 | versions of these primitives.  In short, don't acquire sleeping locks from | 
 | other contexts unless there is no other option. | 
 |  | 
 | Sleeping lock types: | 
 |  | 
 |  - mutex | 
 |  - rt_mutex | 
 |  - semaphore | 
 |  - rw_semaphore | 
 |  - ww_mutex | 
 |  - percpu_rw_semaphore | 
 |  | 
 | On PREEMPT_RT kernels, these lock types are converted to sleeping locks: | 
 |  | 
 |  - local_lock | 
 |  - spinlock_t | 
 |  - rwlock_t | 
 |  | 
 |  | 
 | CPU local locks | 
 | --------------- | 
 |  | 
 |  - local_lock | 
 |  | 
 | On non-PREEMPT_RT kernels, local_lock functions are wrappers around | 
 | preemption and interrupt disabling primitives. Contrary to other locking | 
 | mechanisms, disabling preemption or interrupts are pure CPU local | 
 | concurrency control mechanisms and not suited for inter-CPU concurrency | 
 | control. | 
 |  | 
 |  | 
 | Spinning locks | 
 | -------------- | 
 |  | 
 |  - raw_spinlock_t | 
 |  - bit spinlocks | 
 |  | 
 | On non-PREEMPT_RT kernels, these lock types are also spinning locks: | 
 |  | 
 |  - spinlock_t | 
 |  - rwlock_t | 
 |  | 
 | Spinning locks implicitly disable preemption and the lock / unlock functions | 
 | can have suffixes which apply further protections: | 
 |  | 
 |  ===================  ==================================================== | 
 |  _bh()                Disable / enable bottom halves (soft interrupts) | 
 |  _irq()               Disable / enable interrupts | 
 |  _irqsave/restore()   Save and disable / restore interrupt disabled state | 
 |  ===================  ==================================================== | 
 |  | 
 |  | 
 | Owner semantics | 
 | =============== | 
 |  | 
 | The aforementioned lock types except semaphores have strict owner | 
 | semantics: | 
 |  | 
 |   The context (task) that acquired the lock must release it. | 
 |  | 
 | rw_semaphores have a special interface which allows non-owner release for | 
 | readers. | 
 |  | 
 |  | 
 | rtmutex | 
 | ======= | 
 |  | 
 | RT-mutexes are mutexes with support for priority inheritance (PI). | 
 |  | 
 | PI has limitations on non-PREEMPT_RT kernels due to preemption and | 
 | interrupt disabled sections. | 
 |  | 
 | PI clearly cannot preempt preemption-disabled or interrupt-disabled | 
 | regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels | 
 | execute most such regions of code in preemptible task context, especially | 
 | interrupt handlers and soft interrupts.  This conversion allows spinlock_t | 
 | and rwlock_t to be implemented via RT-mutexes. | 
 |  | 
 |  | 
 | semaphore | 
 | ========= | 
 |  | 
 | semaphore is a counting semaphore implementation. | 
 |  | 
 | Semaphores are often used for both serialization and waiting, but new use | 
 | cases should instead use separate serialization and wait mechanisms, such | 
 | as mutexes and completions. | 
 |  | 
 | semaphores and PREEMPT_RT | 
 | ---------------------------- | 
 |  | 
 | PREEMPT_RT does not change the semaphore implementation because counting | 
 | semaphores have no concept of owners, thus preventing PREEMPT_RT from | 
 | providing priority inheritance for semaphores.  After all, an unknown | 
 | owner cannot be boosted. As a consequence, blocking on semaphores can | 
 | result in priority inversion. | 
 |  | 
 |  | 
 | rw_semaphore | 
 | ============ | 
 |  | 
 | rw_semaphore is a multiple readers and single writer lock mechanism. | 
 |  | 
 | On non-PREEMPT_RT kernels the implementation is fair, thus preventing | 
 | writer starvation. | 
 |  | 
 | rw_semaphore complies by default with the strict owner semantics, but there | 
 | exist special-purpose interfaces that allow non-owner release for readers. | 
 | These interfaces work independent of the kernel configuration. | 
 |  | 
 | rw_semaphore and PREEMPT_RT | 
 | --------------------------- | 
 |  | 
 | PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based | 
 | implementation, thus changing the fairness: | 
 |  | 
 |  Because an rw_semaphore writer cannot grant its priority to multiple | 
 |  readers, a preempted low-priority reader will continue holding its lock, | 
 |  thus starving even high-priority writers.  In contrast, because readers | 
 |  can grant their priority to a writer, a preempted low-priority writer will | 
 |  have its priority boosted until it releases the lock, thus preventing that | 
 |  writer from starving readers. | 
 |  | 
 |  | 
 | local_lock | 
 | ========== | 
 |  | 
 | local_lock provides a named scope to critical sections which are protected | 
 | by disabling preemption or interrupts. | 
 |  | 
 | On non-PREEMPT_RT kernels local_lock operations map to the preemption and | 
 | interrupt disabling and enabling primitives: | 
 |  | 
 |  ===============================  ====================== | 
 |  local_lock(&llock)               preempt_disable() | 
 |  local_unlock(&llock)             preempt_enable() | 
 |  local_lock_irq(&llock)           local_irq_disable() | 
 |  local_unlock_irq(&llock)         local_irq_enable() | 
 |  local_lock_irqsave(&llock)       local_irq_save() | 
 |  local_unlock_irqrestore(&llock)  local_irq_restore() | 
 |  ===============================  ====================== | 
 |  | 
 | The named scope of local_lock has two advantages over the regular | 
 | primitives: | 
 |  | 
 |   - The lock name allows static analysis and is also a clear documentation | 
 |     of the protection scope while the regular primitives are scopeless and | 
 |     opaque. | 
 |  | 
 |   - If lockdep is enabled the local_lock gains a lockmap which allows to | 
 |     validate the correctness of the protection. This can detect cases where | 
 |     e.g. a function using preempt_disable() as protection mechanism is | 
 |     invoked from interrupt or soft-interrupt context. Aside of that | 
 |     lockdep_assert_held(&llock) works as with any other locking primitive. | 
 |  | 
 | local_lock and PREEMPT_RT | 
 | ------------------------- | 
 |  | 
 | PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing | 
 | semantics: | 
 |  | 
 |   - All spinlock_t changes also apply to local_lock. | 
 |  | 
 | local_lock usage | 
 | ---------------- | 
 |  | 
 | local_lock should be used in situations where disabling preemption or | 
 | interrupts is the appropriate form of concurrency control to protect | 
 | per-CPU data structures on a non PREEMPT_RT kernel. | 
 |  | 
 | local_lock is not suitable to protect against preemption or interrupts on a | 
 | PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics. | 
 |  | 
 |  | 
 | raw_spinlock_t and spinlock_t | 
 | ============================= | 
 |  | 
 | raw_spinlock_t | 
 | -------------- | 
 |  | 
 | raw_spinlock_t is a strict spinning lock implementation regardless of the | 
 | kernel configuration including PREEMPT_RT enabled kernels. | 
 |  | 
 | raw_spinlock_t is a strict spinning lock implementation in all kernels, | 
 | including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical | 
 | core code, low-level interrupt handling and places where disabling | 
 | preemption or interrupts is required, for example, to safely access | 
 | hardware state.  raw_spinlock_t can sometimes also be used when the | 
 | critical section is tiny, thus avoiding RT-mutex overhead. | 
 |  | 
 | spinlock_t | 
 | ---------- | 
 |  | 
 | The semantics of spinlock_t change with the state of PREEMPT_RT. | 
 |  | 
 | On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has | 
 | exactly the same semantics. | 
 |  | 
 | spinlock_t and PREEMPT_RT | 
 | ------------------------- | 
 |  | 
 | On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation | 
 | based on rt_mutex which changes the semantics: | 
 |  | 
 |  - Preemption is not disabled. | 
 |  | 
 |  - The hard interrupt related suffixes for spin_lock / spin_unlock | 
 |    operations (_irq, _irqsave / _irqrestore) do not affect the CPU's | 
 |    interrupt disabled state. | 
 |  | 
 |  - The soft interrupt related suffix (_bh()) still disables softirq | 
 |    handlers. | 
 |  | 
 |    Non-PREEMPT_RT kernels disable preemption to get this effect. | 
 |  | 
 |    PREEMPT_RT kernels use a per-CPU lock for serialization which keeps | 
 |    preemption disabled. The lock disables softirq handlers and also | 
 |    prevents reentrancy due to task preemption. | 
 |  | 
 | PREEMPT_RT kernels preserve all other spinlock_t semantics: | 
 |  | 
 |  - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels | 
 |    avoid migration by disabling preemption.  PREEMPT_RT kernels instead | 
 |    disable migration, which ensures that pointers to per-CPU variables | 
 |    remain valid even if the task is preempted. | 
 |  | 
 |  - Task state is preserved across spinlock acquisition, ensuring that the | 
 |    task-state rules apply to all kernel configurations.  Non-PREEMPT_RT | 
 |    kernels leave task state untouched.  However, PREEMPT_RT must change | 
 |    task state if the task blocks during acquisition.  Therefore, it saves | 
 |    the current task state before blocking and the corresponding lock wakeup | 
 |    restores it, as shown below:: | 
 |  | 
 |     task->state = TASK_INTERRUPTIBLE | 
 |      lock() | 
 |        block() | 
 |          task->saved_state = task->state | 
 | 	 task->state = TASK_UNINTERRUPTIBLE | 
 | 	 schedule() | 
 | 					lock wakeup | 
 | 					  task->state = task->saved_state | 
 |  | 
 |    Other types of wakeups would normally unconditionally set the task state | 
 |    to RUNNING, but that does not work here because the task must remain | 
 |    blocked until the lock becomes available.  Therefore, when a non-lock | 
 |    wakeup attempts to awaken a task blocked waiting for a spinlock, it | 
 |    instead sets the saved state to RUNNING.  Then, when the lock | 
 |    acquisition completes, the lock wakeup sets the task state to the saved | 
 |    state, in this case setting it to RUNNING:: | 
 |  | 
 |     task->state = TASK_INTERRUPTIBLE | 
 |      lock() | 
 |        block() | 
 |          task->saved_state = task->state | 
 | 	 task->state = TASK_UNINTERRUPTIBLE | 
 | 	 schedule() | 
 | 					non lock wakeup | 
 | 					  task->saved_state = TASK_RUNNING | 
 |  | 
 | 					lock wakeup | 
 | 					  task->state = task->saved_state | 
 |  | 
 |    This ensures that the real wakeup cannot be lost. | 
 |  | 
 |  | 
 | rwlock_t | 
 | ======== | 
 |  | 
 | rwlock_t is a multiple readers and single writer lock mechanism. | 
 |  | 
 | Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the | 
 | suffix rules of spinlock_t apply accordingly. The implementation is fair, | 
 | thus preventing writer starvation. | 
 |  | 
 | rwlock_t and PREEMPT_RT | 
 | ----------------------- | 
 |  | 
 | PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based | 
 | implementation, thus changing semantics: | 
 |  | 
 |  - All the spinlock_t changes also apply to rwlock_t. | 
 |  | 
 |  - Because an rwlock_t writer cannot grant its priority to multiple | 
 |    readers, a preempted low-priority reader will continue holding its lock, | 
 |    thus starving even high-priority writers.  In contrast, because readers | 
 |    can grant their priority to a writer, a preempted low-priority writer | 
 |    will have its priority boosted until it releases the lock, thus | 
 |    preventing that writer from starving readers. | 
 |  | 
 |  | 
 | PREEMPT_RT caveats | 
 | ================== | 
 |  | 
 | local_lock on RT | 
 | ---------------- | 
 |  | 
 | The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few | 
 | implications. For example, on a non-PREEMPT_RT kernel the following code | 
 | sequence works as expected:: | 
 |  | 
 |   local_lock_irq(&local_lock); | 
 |   raw_spin_lock(&lock); | 
 |  | 
 | and is fully equivalent to:: | 
 |  | 
 |    raw_spin_lock_irq(&lock); | 
 |  | 
 | On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq() | 
 | is mapped to a per-CPU spinlock_t which neither disables interrupts nor | 
 | preemption. The following code sequence works perfectly correct on both | 
 | PREEMPT_RT and non-PREEMPT_RT kernels:: | 
 |  | 
 |   local_lock_irq(&local_lock); | 
 |   spin_lock(&lock); | 
 |  | 
 | Another caveat with local locks is that each local_lock has a specific | 
 | protection scope. So the following substitution is wrong:: | 
 |  | 
 |   func1() | 
 |   { | 
 |     local_irq_save(flags);    -> local_lock_irqsave(&local_lock_1, flags); | 
 |     func3(); | 
 |     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags); | 
 |   } | 
 |  | 
 |   func2() | 
 |   { | 
 |     local_irq_save(flags);    -> local_lock_irqsave(&local_lock_2, flags); | 
 |     func3(); | 
 |     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags); | 
 |   } | 
 |  | 
 |   func3() | 
 |   { | 
 |     lockdep_assert_irqs_disabled(); | 
 |     access_protected_data(); | 
 |   } | 
 |  | 
 | On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel | 
 | local_lock_1 and local_lock_2 are distinct and cannot serialize the callers | 
 | of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel | 
 | because local_lock_irqsave() does not disable interrupts due to the | 
 | PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is:: | 
 |  | 
 |   func1() | 
 |   { | 
 |     local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags); | 
 |     func3(); | 
 |     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags); | 
 |   } | 
 |  | 
 |   func2() | 
 |   { | 
 |     local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags); | 
 |     func3(); | 
 |     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags); | 
 |   } | 
 |  | 
 |   func3() | 
 |   { | 
 |     lockdep_assert_held(&local_lock); | 
 |     access_protected_data(); | 
 |   } | 
 |  | 
 |  | 
 | spinlock_t and rwlock_t | 
 | ----------------------- | 
 |  | 
 | The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels | 
 | have a few implications.  For example, on a non-PREEMPT_RT kernel the | 
 | following code sequence works as expected:: | 
 |  | 
 |    local_irq_disable(); | 
 |    spin_lock(&lock); | 
 |  | 
 | and is fully equivalent to:: | 
 |  | 
 |    spin_lock_irq(&lock); | 
 |  | 
 | Same applies to rwlock_t and the _irqsave() suffix variants. | 
 |  | 
 | On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a | 
 | fully preemptible context.  Instead, use spin_lock_irq() or | 
 | spin_lock_irqsave() and their unlock counterparts.  In cases where the | 
 | interrupt disabling and locking must remain separate, PREEMPT_RT offers a | 
 | local_lock mechanism.  Acquiring the local_lock pins the task to a CPU, | 
 | allowing things like per-CPU interrupt disabled locks to be acquired. | 
 | However, this approach should be used only where absolutely necessary. | 
 |  | 
 | A typical scenario is protection of per-CPU variables in thread context:: | 
 |  | 
 |   struct foo *p = get_cpu_ptr(&var1); | 
 |  | 
 |   spin_lock(&p->lock); | 
 |   p->count += this_cpu_read(var2); | 
 |  | 
 | This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel | 
 | this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does | 
 | not allow to acquire p->lock because get_cpu_ptr() implicitly disables | 
 | preemption. The following substitution works on both kernels:: | 
 |  | 
 |   struct foo *p; | 
 |  | 
 |   migrate_disable(); | 
 |   p = this_cpu_ptr(&var1); | 
 |   spin_lock(&p->lock); | 
 |   p->count += this_cpu_read(var2); | 
 |  | 
 | On a non-PREEMPT_RT kernel migrate_disable() maps to preempt_disable() | 
 | which makes the above code fully equivalent. On a PREEMPT_RT kernel | 
 | migrate_disable() ensures that the task is pinned on the current CPU which | 
 | in turn guarantees that the per-CPU access to var1 and var2 are staying on | 
 | the same CPU. | 
 |  | 
 | The migrate_disable() substitution is not valid for the following | 
 | scenario:: | 
 |  | 
 |   func() | 
 |   { | 
 |     struct foo *p; | 
 |  | 
 |     migrate_disable(); | 
 |     p = this_cpu_ptr(&var1); | 
 |     p->val = func2(); | 
 |  | 
 | While correct on a non-PREEMPT_RT kernel, this breaks on PREEMPT_RT because | 
 | here migrate_disable() does not protect against reentrancy from a | 
 | preempting task. A correct substitution for this case is:: | 
 |  | 
 |   func() | 
 |   { | 
 |     struct foo *p; | 
 |  | 
 |     local_lock(&foo_lock); | 
 |     p = this_cpu_ptr(&var1); | 
 |     p->val = func2(); | 
 |  | 
 | On a non-PREEMPT_RT kernel this protects against reentrancy by disabling | 
 | preemption. On a PREEMPT_RT kernel this is achieved by acquiring the | 
 | underlying per-CPU spinlock. | 
 |  | 
 |  | 
 | raw_spinlock_t on RT | 
 | -------------------- | 
 |  | 
 | Acquiring a raw_spinlock_t disables preemption and possibly also | 
 | interrupts, so the critical section must avoid acquiring a regular | 
 | spinlock_t or rwlock_t, for example, the critical section must avoid | 
 | allocating memory.  Thus, on a non-PREEMPT_RT kernel the following code | 
 | works perfectly:: | 
 |  | 
 |   raw_spin_lock(&lock); | 
 |   p = kmalloc(sizeof(*p), GFP_ATOMIC); | 
 |  | 
 | But this code fails on PREEMPT_RT kernels because the memory allocator is | 
 | fully preemptible and therefore cannot be invoked from truly atomic | 
 | contexts.  However, it is perfectly fine to invoke the memory allocator | 
 | while holding normal non-raw spinlocks because they do not disable | 
 | preemption on PREEMPT_RT kernels:: | 
 |  | 
 |   spin_lock(&lock); | 
 |   p = kmalloc(sizeof(*p), GFP_ATOMIC); | 
 |  | 
 |  | 
 | bit spinlocks | 
 | ------------- | 
 |  | 
 | PREEMPT_RT cannot substitute bit spinlocks because a single bit is too | 
 | small to accommodate an RT-mutex.  Therefore, the semantics of bit | 
 | spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t | 
 | caveats also apply to bit spinlocks. | 
 |  | 
 | Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT | 
 | using conditional (#ifdef'ed) code changes at the usage site.  In contrast, | 
 | usage-site changes are not needed for the spinlock_t substitution. | 
 | Instead, conditionals in header files and the core locking implemementation | 
 | enable the compiler to do the substitution transparently. | 
 |  | 
 |  | 
 | Lock type nesting rules | 
 | ======================= | 
 |  | 
 | The most basic rules are: | 
 |  | 
 |   - Lock types of the same lock category (sleeping, CPU local, spinning) | 
 |     can nest arbitrarily as long as they respect the general lock ordering | 
 |     rules to prevent deadlocks. | 
 |  | 
 |   - Sleeping lock types cannot nest inside CPU local and spinning lock types. | 
 |  | 
 |   - CPU local and spinning lock types can nest inside sleeping lock types. | 
 |  | 
 |   - Spinning lock types can nest inside all lock types | 
 |  | 
 | These constraints apply both in PREEMPT_RT and otherwise. | 
 |  | 
 | The fact that PREEMPT_RT changes the lock category of spinlock_t and | 
 | rwlock_t from spinning to sleeping and substitutes local_lock with a | 
 | per-CPU spinlock_t means that they cannot be acquired while holding a raw | 
 | spinlock.  This results in the following nesting ordering: | 
 |  | 
 |   1) Sleeping locks | 
 |   2) spinlock_t, rwlock_t, local_lock | 
 |   3) raw_spinlock_t and bit spinlocks | 
 |  | 
 | Lockdep will complain if these constraints are violated, both in | 
 | PREEMPT_RT and otherwise. |