Documentation/locking/locktypes.rst - linux - Git at Google

 .. SPDX-License-Identifier: GPL-2.0

 .. _kernel_hacking_locktypes:

 ==========================
 Lock types and their rules
 ==========================

 Introduction
 ============

 The kernel provides a variety of locking primitives which can be divided
 into three categories:

  - Sleeping locks
  - CPU local locks
  - Spinning locks

 This document conceptually describes these lock types and provides rules
 for their nesting, including the rules for use under PREEMPT_RT.


 Lock categories
 ===============

 Sleeping locks
 --------------

 Sleeping locks can only be acquired in preemptible task context.

 Although implementations allow try_lock() from other contexts, it is
 necessary to carefully evaluate the safety of unlock() as well as of
 try_lock().  Furthermore, it is also necessary to evaluate the debugging
 versions of these primitives.  In short, don't acquire sleeping locks from
 other contexts unless there is no other option.

 Sleeping lock types:

  - mutex
  - rt_mutex
  - semaphore
  - rw_semaphore
  - ww_mutex
  - percpu_rw_semaphore

 On PREEMPT_RT kernels, these lock types are converted to sleeping locks:

  - local_lock
  - spinlock_t
  - rwlock_t


 CPU local locks
 ---------------

  - local_lock

 On non-PREEMPT_RT kernels, local_lock functions are wrappers around
 preemption and interrupt disabling primitives. Contrary to other locking
 mechanisms, disabling preemption or interrupts are pure CPU local
 concurrency control mechanisms and not suited for inter-CPU concurrency
 control.


 Spinning locks
 --------------

  - raw_spinlock_t
  - bit spinlocks

 On non-PREEMPT_RT kernels, these lock types are also spinning locks:

  - spinlock_t
  - rwlock_t

 Spinning locks implicitly disable preemption and the lock / unlock functions
 can have suffixes which apply further protections:

  ===================  ====================================================
  _bh()                Disable / enable bottom halves (soft interrupts)
  _irq()               Disable / enable interrupts
  _irqsave/restore()   Save and disable / restore interrupt disabled state
  ===================  ====================================================


 Owner semantics
 ===============

 The aforementioned lock types except semaphores have strict owner
 semantics:

   The context (task) that acquired the lock must release it.

 rw_semaphores have a special interface which allows non-owner release for
 readers.


 rtmutex
 =======

 RT-mutexes are mutexes with support for priority inheritance (PI).

 PI has limitations on non-PREEMPT_RT kernels due to preemption and
 interrupt disabled sections.

 PI clearly cannot preempt preemption-disabled or interrupt-disabled
 regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels
 execute most such regions of code in preemptible task context, especially
 interrupt handlers and soft interrupts.  This conversion allows spinlock_t
 and rwlock_t to be implemented via RT-mutexes.


 semaphore
 =========

 semaphore is a counting semaphore implementation.

 Semaphores are often used for both serialization and waiting, but new use
 cases should instead use separate serialization and wait mechanisms, such
 as mutexes and completions.

 semaphores and PREEMPT_RT
 ----------------------------

 PREEMPT_RT does not change the semaphore implementation because counting
 semaphores have no concept of owners, thus preventing PREEMPT_RT from
 providing priority inheritance for semaphores.  After all, an unknown
 owner cannot be boosted. As a consequence, blocking on semaphores can
 result in priority inversion.


 rw_semaphore
 ============

 rw_semaphore is a multiple readers and single writer lock mechanism.

 On non-PREEMPT_RT kernels the implementation is fair, thus preventing
 writer starvation.

 rw_semaphore complies by default with the strict owner semantics, but there
 exist special-purpose interfaces that allow non-owner release for readers.
 These interfaces work independent of the kernel configuration.

 rw_semaphore and PREEMPT_RT
 ---------------------------

 PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
 implementation, thus changing the fairness:

  Because an rw_semaphore writer cannot grant its priority to multiple
  readers, a preempted low-priority reader will continue holding its lock,
  thus starving even high-priority writers.  In contrast, because readers
  can grant their priority to a writer, a preempted low-priority writer will
  have its priority boosted until it releases the lock, thus preventing that
  writer from starving readers.


 local_lock
 ==========

 local_lock provides a named scope to critical sections which are protected
 by disabling preemption or interrupts.

 On non-PREEMPT_RT kernels local_lock operations map to the preemption and
 interrupt disabling and enabling primitives:

  ===============================  ======================
  local_lock(&llock)               preempt_disable()
  local_unlock(&llock)             preempt_enable()
  local_lock_irq(&llock)           local_irq_disable()
  local_unlock_irq(&llock)         local_irq_enable()
  local_lock_irqsave(&llock)       local_irq_save()
  local_unlock_irqrestore(&llock)  local_irq_restore()
  ===============================  ======================

 The named scope of local_lock has two advantages over the regular
 primitives:

   - The lock name allows static analysis and is also a clear documentation
     of the protection scope while the regular primitives are scopeless and
     opaque.

   - If lockdep is enabled the local_lock gains a lockmap which allows to
     validate the correctness of the protection. This can detect cases where
     e.g. a function using preempt_disable() as protection mechanism is
     invoked from interrupt or soft-interrupt context. Aside of that
     lockdep_assert_held(&llock) works as with any other locking primitive.

 local_lock and PREEMPT_RT
 -------------------------

 PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing
 semantics:

   - All spinlock_t changes also apply to local_lock.

 local_lock usage
 ----------------

 local_lock should be used in situations where disabling preemption or
 interrupts is the appropriate form of concurrency control to protect
 per-CPU data structures on a non PREEMPT_RT kernel.

 local_lock is not suitable to protect against preemption or interrupts on a
 PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics.


 raw_spinlock_t and spinlock_t
 =============================

 raw_spinlock_t
 --------------

 raw_spinlock_t is a strict spinning lock implementation regardless of the
 kernel configuration including PREEMPT_RT enabled kernels.

 raw_spinlock_t is a strict spinning lock implementation in all kernels,
 including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
 core code, low-level interrupt handling and places where disabling
 preemption or interrupts is required, for example, to safely access
 hardware state.  raw_spinlock_t can sometimes also be used when the
 critical section is tiny, thus avoiding RT-mutex overhead.

 spinlock_t
 ----------

 The semantics of spinlock_t change with the state of PREEMPT_RT.

 On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
 exactly the same semantics.

 spinlock_t and PREEMPT_RT
 -------------------------

 On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
 based on rt_mutex which changes the semantics:

  - Preemption is not disabled.

  - The hard interrupt related suffixes for spin_lock / spin_unlock
    operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
    interrupt disabled state.

  - The soft interrupt related suffix (_bh()) still disables softirq
    handlers.

    Non-PREEMPT_RT kernels disable preemption to get this effect.

    PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
    preemption disabled. The lock disables softirq handlers and also
    prevents reentrancy due to task preemption.

 PREEMPT_RT kernels preserve all other spinlock_t semantics:

  - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
    avoid migration by disabling preemption.  PREEMPT_RT kernels instead
    disable migration, which ensures that pointers to per-CPU variables
    remain valid even if the task is preempted.

  - Task state is preserved across spinlock acquisition, ensuring that the
    task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
    kernels leave task state untouched.  However, PREEMPT_RT must change
    task state if the task blocks during acquisition.  Therefore, it saves
    the current task state before blocking and the corresponding lock wakeup
    restores it, as shown below::

     task->state = TASK_INTERRUPTIBLE
      lock()
        block()
          task->saved_state = task->state
 	 task->state = TASK_UNINTERRUPTIBLE
 	 schedule()
 					lock wakeup
 					  task->state = task->saved_state

    Other types of wakeups would normally unconditionally set the task state
    to RUNNING, but that does not work here because the task must remain
    blocked until the lock becomes available.  Therefore, when a non-lock
    wakeup attempts to awaken a task blocked waiting for a spinlock, it
    instead sets the saved state to RUNNING.  Then, when the lock
    acquisition completes, the lock wakeup sets the task state to the saved
    state, in this case setting it to RUNNING::

     task->state = TASK_INTERRUPTIBLE
      lock()
        block()
          task->saved_state = task->state
 	 task->state = TASK_UNINTERRUPTIBLE
 	 schedule()
 					non lock wakeup
 					  task->saved_state = TASK_RUNNING

 					lock wakeup
 					  task->state = task->saved_state

    This ensures that the real wakeup cannot be lost.


 rwlock_t
 ========

 rwlock_t is a multiple readers and single writer lock mechanism.

 Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
 suffix rules of spinlock_t apply accordingly. The implementation is fair,
 thus preventing writer starvation.

 rwlock_t and PREEMPT_RT
 -----------------------

 PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
 implementation, thus changing semantics:

  - All the spinlock_t changes also apply to rwlock_t.

  - Because an rwlock_t writer cannot grant its priority to multiple
    readers, a preempted low-priority reader will continue holding its lock,
    thus starving even high-priority writers.  In contrast, because readers
    can grant their priority to a writer, a preempted low-priority writer
    will have its priority boosted until it releases the lock, thus
    preventing that writer from starving readers.


 PREEMPT_RT caveats
 ==================

 local_lock on RT
 ----------------

 The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few
 implications. For example, on a non-PREEMPT_RT kernel the following code
 sequence works as expected::

   local_lock_irq(&local_lock);
   raw_spin_lock(&lock);

 and is fully equivalent to::

    raw_spin_lock_irq(&lock);

 On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq()
 is mapped to a per-CPU spinlock_t which neither disables interrupts nor
 preemption. The following code sequence works perfectly correct on both
 PREEMPT_RT and non-PREEMPT_RT kernels::

   local_lock_irq(&local_lock);
   spin_lock(&lock);

 Another caveat with local locks is that each local_lock has a specific
 protection scope. So the following substitution is wrong::

   func1()
   {
     local_irq_save(flags);    -> local_lock_irqsave(&local_lock_1, flags);
     func3();
     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
   }

   func2()
   {
     local_irq_save(flags);    -> local_lock_irqsave(&local_lock_2, flags);
     func3();
     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
   }

   func3()
   {
     lockdep_assert_irqs_disabled();
     access_protected_data();
   }

 On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel
 local_lock_1 and local_lock_2 are distinct and cannot serialize the callers
 of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel
 because local_lock_irqsave() does not disable interrupts due to the
 PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is::

   func1()
   {
     local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
     func3();
     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
   }

   func2()
   {
     local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
     func3();
     local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
   }

   func3()
   {
     lockdep_assert_held(&local_lock);
     access_protected_data();
   }


 spinlock_t and rwlock_t
 -----------------------

 The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
 have a few implications.  For example, on a non-PREEMPT_RT kernel the
 following code sequence works as expected::

    local_irq_disable();
    spin_lock(&lock);

 and is fully equivalent to::

    spin_lock_irq(&lock);

 Same applies to rwlock_t and the _irqsave() suffix variants.

 On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
 fully preemptible context.  Instead, use spin_lock_irq() or
 spin_lock_irqsave() and their unlock counterparts.  In cases where the
 interrupt disabling and locking must remain separate, PREEMPT_RT offers a
 local_lock mechanism.  Acquiring the local_lock pins the task to a CPU,
 allowing things like per-CPU interrupt disabled locks to be acquired.
 However, this approach should be used only where absolutely necessary.

 A typical scenario is protection of per-CPU variables in thread context::

   struct foo *p = get_cpu_ptr(&var1);

   spin_lock(&p->lock);
   p->count += this_cpu_read(var2);

 This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
 this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
 not allow to acquire p->lock because get_cpu_ptr() implicitly disables
 preemption. The following substitution works on both kernels::

   struct foo *p;

   migrate_disable();
   p = this_cpu_ptr(&var1);
   spin_lock(&p->lock);
   p->count += this_cpu_read(var2);

 On a non-PREEMPT_RT kernel migrate_disable() maps to preempt_disable()
 which makes the above code fully equivalent. On a PREEMPT_RT kernel
 migrate_disable() ensures that the task is pinned on the current CPU which
 in turn guarantees that the per-CPU access to var1 and var2 are staying on
 the same CPU.

 The migrate_disable() substitution is not valid for the following
 scenario::

   func()
   {
     struct foo *p;

     migrate_disable();
     p = this_cpu_ptr(&var1);
     p->val = func2();

 While correct on a non-PREEMPT_RT kernel, this breaks on PREEMPT_RT because
 here migrate_disable() does not protect against reentrancy from a
 preempting task. A correct substitution for this case is::

   func()
   {
     struct foo *p;

     local_lock(&foo_lock);
     p = this_cpu_ptr(&var1);
     p->val = func2();

 On a non-PREEMPT_RT kernel this protects against reentrancy by disabling
 preemption. On a PREEMPT_RT kernel this is achieved by acquiring the
 underlying per-CPU spinlock.


 raw_spinlock_t on RT
 --------------------

 Acquiring a raw_spinlock_t disables preemption and possibly also
 interrupts, so the critical section must avoid acquiring a regular
 spinlock_t or rwlock_t, for example, the critical section must avoid
 allocating memory.  Thus, on a non-PREEMPT_RT kernel the following code
 works perfectly::

   raw_spin_lock(&lock);
   p = kmalloc(sizeof(*p), GFP_ATOMIC);

 But this code fails on PREEMPT_RT kernels because the memory allocator is
 fully preemptible and therefore cannot be invoked from truly atomic
 contexts.  However, it is perfectly fine to invoke the memory allocator
 while holding normal non-raw spinlocks because they do not disable
 preemption on PREEMPT_RT kernels::

   spin_lock(&lock);
   p = kmalloc(sizeof(*p), GFP_ATOMIC);


 bit spinlocks
 -------------

 PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
 small to accommodate an RT-mutex.  Therefore, the semantics of bit
 spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
 caveats also apply to bit spinlocks.

 Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
 using conditional (#ifdef'ed) code changes at the usage site.  In contrast,
 usage-site changes are not needed for the spinlock_t substitution.
 Instead, conditionals in header files and the core locking implemementation
 enable the compiler to do the substitution transparently.


 Lock type nesting rules
 =======================

 The most basic rules are:

   - Lock types of the same lock category (sleeping, CPU local, spinning)
     can nest arbitrarily as long as they respect the general lock ordering
     rules to prevent deadlocks.

   - Sleeping lock types cannot nest inside CPU local and spinning lock types.

   - CPU local and spinning lock types can nest inside sleeping lock types.

   - Spinning lock types can nest inside all lock types

 These constraints apply both in PREEMPT_RT and otherwise.

 The fact that PREEMPT_RT changes the lock category of spinlock_t and
 rwlock_t from spinning to sleeping and substitutes local_lock with a
 per-CPU spinlock_t means that they cannot be acquired while holding a raw
 spinlock.  This results in the following nesting ordering:

   1) Sleeping locks
   2) spinlock_t, rwlock_t, local_lock
   3) raw_spinlock_t and bit spinlocks

 Lockdep will complain if these constraints are violated, both in
 PREEMPT_RT and otherwise.
	.. SPDX-License-Identifier: GPL-2.0

	.. _kernel_hacking_locktypes:

	==========================
	Lock types and their rules
	==========================

	Introduction
	============

	The kernel provides a variety of locking primitives which can be divided
	into three categories:

	- Sleeping locks
	- CPU local locks
	- Spinning locks

	This document conceptually describes these lock types and provides rules
	for their nesting, including the rules for use under PREEMPT_RT.


	Lock categories
	===============

	Sleeping locks
	--------------

	Sleeping locks can only be acquired in preemptible task context.

	Although implementations allow try_lock() from other contexts, it is
	necessary to carefully evaluate the safety of unlock() as well as of
	try_lock(). Furthermore, it is also necessary to evaluate the debugging
	versions of these primitives. In short, don't acquire sleeping locks from
	other contexts unless there is no other option.

	Sleeping lock types:

	- mutex
	- rt_mutex
	- semaphore
	- rw_semaphore
	- ww_mutex
	- percpu_rw_semaphore

	On PREEMPT_RT kernels, these lock types are converted to sleeping locks:

	- local_lock
	- spinlock_t
	- rwlock_t


	CPU local locks
	---------------

	- local_lock

	On non-PREEMPT_RT kernels, local_lock functions are wrappers around
	preemption and interrupt disabling primitives. Contrary to other locking
	mechanisms, disabling preemption or interrupts are pure CPU local
	concurrency control mechanisms and not suited for inter-CPU concurrency
	control.


	Spinning locks
	--------------

	- raw_spinlock_t
	- bit spinlocks

	On non-PREEMPT_RT kernels, these lock types are also spinning locks:

	- spinlock_t
	- rwlock_t

	Spinning locks implicitly disable preemption and the lock / unlock functions
	can have suffixes which apply further protections:

	=================== ====================================================
	_bh() Disable / enable bottom halves (soft interrupts)
	_irq() Disable / enable interrupts
	_irqsave/restore() Save and disable / restore interrupt disabled state
	=================== ====================================================


	Owner semantics
	===============

	The aforementioned lock types except semaphores have strict owner
	semantics:

	The context (task) that acquired the lock must release it.

	rw_semaphores have a special interface which allows non-owner release for
	readers.


	rtmutex
	=======

	RT-mutexes are mutexes with support for priority inheritance (PI).

	PI has limitations on non-PREEMPT_RT kernels due to preemption and
	interrupt disabled sections.

	PI clearly cannot preempt preemption-disabled or interrupt-disabled
	regions of code, even on PREEMPT_RT kernels. Instead, PREEMPT_RT kernels
	execute most such regions of code in preemptible task context, especially
	interrupt handlers and soft interrupts. This conversion allows spinlock_t
	and rwlock_t to be implemented via RT-mutexes.


	semaphore
	=========

	semaphore is a counting semaphore implementation.

	Semaphores are often used for both serialization and waiting, but new use
	cases should instead use separate serialization and wait mechanisms, such
	as mutexes and completions.

	semaphores and PREEMPT_RT
	----------------------------

	PREEMPT_RT does not change the semaphore implementation because counting
	semaphores have no concept of owners, thus preventing PREEMPT_RT from
	providing priority inheritance for semaphores. After all, an unknown
	owner cannot be boosted. As a consequence, blocking on semaphores can
	result in priority inversion.


	rw_semaphore
	============

	rw_semaphore is a multiple readers and single writer lock mechanism.

	On non-PREEMPT_RT kernels the implementation is fair, thus preventing
	writer starvation.

	rw_semaphore complies by default with the strict owner semantics, but there
	exist special-purpose interfaces that allow non-owner release for readers.
	These interfaces work independent of the kernel configuration.

	rw_semaphore and PREEMPT_RT
	---------------------------

	PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
	implementation, thus changing the fairness:

	Because an rw_semaphore writer cannot grant its priority to multiple
	readers, a preempted low-priority reader will continue holding its lock,
	thus starving even high-priority writers. In contrast, because readers
	can grant their priority to a writer, a preempted low-priority writer will
	have its priority boosted until it releases the lock, thus preventing that
	writer from starving readers.


	local_lock
	==========

	local_lock provides a named scope to critical sections which are protected
	by disabling preemption or interrupts.

	On non-PREEMPT_RT kernels local_lock operations map to the preemption and
	interrupt disabling and enabling primitives:

	=============================== ======================
	local_lock(&llock) preempt_disable()
	local_unlock(&llock) preempt_enable()
	local_lock_irq(&llock) local_irq_disable()
	local_unlock_irq(&llock) local_irq_enable()
	local_lock_irqsave(&llock) local_irq_save()
	local_unlock_irqrestore(&llock) local_irq_restore()
	=============================== ======================

	The named scope of local_lock has two advantages over the regular
	primitives:

	- The lock name allows static analysis and is also a clear documentation
	of the protection scope while the regular primitives are scopeless and
	opaque.

	- If lockdep is enabled the local_lock gains a lockmap which allows to
	validate the correctness of the protection. This can detect cases where
	e.g. a function using preempt_disable() as protection mechanism is
	invoked from interrupt or soft-interrupt context. Aside of that
	lockdep_assert_held(&llock) works as with any other locking primitive.

	local_lock and PREEMPT_RT
	-------------------------

	PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing
	semantics:

	- All spinlock_t changes also apply to local_lock.

	local_lock usage
	----------------

	local_lock should be used in situations where disabling preemption or
	interrupts is the appropriate form of concurrency control to protect
	per-CPU data structures on a non PREEMPT_RT kernel.

	local_lock is not suitable to protect against preemption or interrupts on a
	PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics.


	raw_spinlock_t and spinlock_t
	=============================

	raw_spinlock_t
	--------------

	raw_spinlock_t is a strict spinning lock implementation regardless of the
	kernel configuration including PREEMPT_RT enabled kernels.

	raw_spinlock_t is a strict spinning lock implementation in all kernels,
	including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical
	core code, low-level interrupt handling and places where disabling
	preemption or interrupts is required, for example, to safely access
	hardware state. raw_spinlock_t can sometimes also be used when the
	critical section is tiny, thus avoiding RT-mutex overhead.

	spinlock_t
	----------

	The semantics of spinlock_t change with the state of PREEMPT_RT.

	On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
	exactly the same semantics.

	spinlock_t and PREEMPT_RT
	-------------------------

	On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
	based on rt_mutex which changes the semantics:

	- Preemption is not disabled.

	- The hard interrupt related suffixes for spin_lock / spin_unlock
	operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
	interrupt disabled state.

	- The soft interrupt related suffix (_bh()) still disables softirq
	handlers.

	Non-PREEMPT_RT kernels disable preemption to get this effect.

	PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
	preemption disabled. The lock disables softirq handlers and also
	prevents reentrancy due to task preemption.

	PREEMPT_RT kernels preserve all other spinlock_t semantics:

	- Tasks holding a spinlock_t do not migrate. Non-PREEMPT_RT kernels
	avoid migration by disabling preemption. PREEMPT_RT kernels instead
	disable migration, which ensures that pointers to per-CPU variables
	remain valid even if the task is preempted.

	- Task state is preserved across spinlock acquisition, ensuring that the
	task-state rules apply to all kernel configurations. Non-PREEMPT_RT
	kernels leave task state untouched. However, PREEMPT_RT must change
	task state if the task blocks during acquisition. Therefore, it saves
	the current task state before blocking and the corresponding lock wakeup
	restores it, as shown below::

	task->state = TASK_INTERRUPTIBLE
	lock()
	block()
	task->saved_state = task->state
	task->state = TASK_UNINTERRUPTIBLE
	schedule()
	lock wakeup
	task->state = task->saved_state

	Other types of wakeups would normally unconditionally set the task state
	to RUNNING, but that does not work here because the task must remain
	blocked until the lock becomes available. Therefore, when a non-lock
	wakeup attempts to awaken a task blocked waiting for a spinlock, it
	instead sets the saved state to RUNNING. Then, when the lock
	acquisition completes, the lock wakeup sets the task state to the saved
	state, in this case setting it to RUNNING::

	task->state = TASK_INTERRUPTIBLE
	lock()
	block()
	task->saved_state = task->state
	task->state = TASK_UNINTERRUPTIBLE
	schedule()
	non lock wakeup
	task->saved_state = TASK_RUNNING

	lock wakeup
	task->state = task->saved_state

	This ensures that the real wakeup cannot be lost.


	rwlock_t
	========

	rwlock_t is a multiple readers and single writer lock mechanism.

	Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
	suffix rules of spinlock_t apply accordingly. The implementation is fair,
	thus preventing writer starvation.

	rwlock_t and PREEMPT_RT
	-----------------------

	PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
	implementation, thus changing semantics:

	- All the spinlock_t changes also apply to rwlock_t.

	- Because an rwlock_t writer cannot grant its priority to multiple
	readers, a preempted low-priority reader will continue holding its lock,
	thus starving even high-priority writers. In contrast, because readers
	can grant their priority to a writer, a preempted low-priority writer
	will have its priority boosted until it releases the lock, thus
	preventing that writer from starving readers.


	PREEMPT_RT caveats
	==================

	local_lock on RT
	----------------

	The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few
	implications. For example, on a non-PREEMPT_RT kernel the following code
	sequence works as expected::

	local_lock_irq(&local_lock);
	raw_spin_lock(&lock);

	and is fully equivalent to::

	raw_spin_lock_irq(&lock);

	On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq()
	is mapped to a per-CPU spinlock_t which neither disables interrupts nor
	preemption. The following code sequence works perfectly correct on both
	PREEMPT_RT and non-PREEMPT_RT kernels::

	local_lock_irq(&local_lock);
	spin_lock(&lock);

	Another caveat with local locks is that each local_lock has a specific
	protection scope. So the following substitution is wrong::

	func1()
	{
	local_irq_save(flags); -> local_lock_irqsave(&local_lock_1, flags);
	func3();
	local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
	}

	func2()
	{
	local_irq_save(flags); -> local_lock_irqsave(&local_lock_2, flags);
	func3();
	local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
	}

	func3()
	{
	lockdep_assert_irqs_disabled();
	access_protected_data();
	}

	On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel
	local_lock_1 and local_lock_2 are distinct and cannot serialize the callers
	of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel
	because local_lock_irqsave() does not disable interrupts due to the
	PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is::

	func1()
	{
	local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags);
	func3();
	local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
	}

	func2()
	{
	local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags);
	func3();
	local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
	}

	func3()
	{
	lockdep_assert_held(&local_lock);
	access_protected_data();
	}


	spinlock_t and rwlock_t
	-----------------------

	The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
	have a few implications. For example, on a non-PREEMPT_RT kernel the
	following code sequence works as expected::

	local_irq_disable();
	spin_lock(&lock);

	and is fully equivalent to::

	spin_lock_irq(&lock);

	Same applies to rwlock_t and the _irqsave() suffix variants.

	On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
	fully preemptible context. Instead, use spin_lock_irq() or
	spin_lock_irqsave() and their unlock counterparts. In cases where the
	interrupt disabling and locking must remain separate, PREEMPT_RT offers a
	local_lock mechanism. Acquiring the local_lock pins the task to a CPU,
	allowing things like per-CPU interrupt disabled locks to be acquired.
	However, this approach should be used only where absolutely necessary.

	A typical scenario is protection of per-CPU variables in thread context::

	struct foo *p = get_cpu_ptr(&var1);

	spin_lock(&p->lock);
	p->count += this_cpu_read(var2);

	This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
	this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
	not allow to acquire p->lock because get_cpu_ptr() implicitly disables
	preemption. The following substitution works on both kernels::

	struct foo *p;

	migrate_disable();
	p = this_cpu_ptr(&var1);
	spin_lock(&p->lock);
	p->count += this_cpu_read(var2);

	On a non-PREEMPT_RT kernel migrate_disable() maps to preempt_disable()
	which makes the above code fully equivalent. On a PREEMPT_RT kernel
	migrate_disable() ensures that the task is pinned on the current CPU which
	in turn guarantees that the per-CPU access to var1 and var2 are staying on
	the same CPU.

	The migrate_disable() substitution is not valid for the following
	scenario::

	func()
	{
	struct foo *p;

	migrate_disable();
	p = this_cpu_ptr(&var1);
	p->val = func2();

	While correct on a non-PREEMPT_RT kernel, this breaks on PREEMPT_RT because
	here migrate_disable() does not protect against reentrancy from a
	preempting task. A correct substitution for this case is::

	func()
	{
	struct foo *p;

	local_lock(&foo_lock);
	p = this_cpu_ptr(&var1);
	p->val = func2();

	On a non-PREEMPT_RT kernel this protects against reentrancy by disabling
	preemption. On a PREEMPT_RT kernel this is achieved by acquiring the
	underlying per-CPU spinlock.


	raw_spinlock_t on RT
	--------------------

	Acquiring a raw_spinlock_t disables preemption and possibly also
	interrupts, so the critical section must avoid acquiring a regular
	spinlock_t or rwlock_t, for example, the critical section must avoid
	allocating memory. Thus, on a non-PREEMPT_RT kernel the following code
	works perfectly::

	raw_spin_lock(&lock);
	p = kmalloc(sizeof(*p), GFP_ATOMIC);

	But this code fails on PREEMPT_RT kernels because the memory allocator is
	fully preemptible and therefore cannot be invoked from truly atomic
	contexts. However, it is perfectly fine to invoke the memory allocator
	while holding normal non-raw spinlocks because they do not disable
	preemption on PREEMPT_RT kernels::

	spin_lock(&lock);
	p = kmalloc(sizeof(*p), GFP_ATOMIC);


	bit spinlocks
	-------------

	PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
	small to accommodate an RT-mutex. Therefore, the semantics of bit
	spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
	caveats also apply to bit spinlocks.

	Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
	using conditional (#ifdef'ed) code changes at the usage site. In contrast,
	usage-site changes are not needed for the spinlock_t substitution.
	Instead, conditionals in header files and the core locking implemementation
	enable the compiler to do the substitution transparently.


	Lock type nesting rules
	=======================

	The most basic rules are:

	- Lock types of the same lock category (sleeping, CPU local, spinning)
	can nest arbitrarily as long as they respect the general lock ordering
	rules to prevent deadlocks.

	- Sleeping lock types cannot nest inside CPU local and spinning lock types.

	- CPU local and spinning lock types can nest inside sleeping lock types.

	- Spinning lock types can nest inside all lock types

	These constraints apply both in PREEMPT_RT and otherwise.

	The fact that PREEMPT_RT changes the lock category of spinlock_t and
	rwlock_t from spinning to sleeping and substitutes local_lock with a
	per-CPU spinlock_t means that they cannot be acquired while holding a raw
	spinlock. This results in the following nesting ordering:

	1) Sleeping locks
	2) spinlock_t, rwlock_t, local_lock
	3) raw_spinlock_t and bit spinlocks

	Lockdep will complain if these constraints are violated, both in
	PREEMPT_RT and otherwise.