| .. SPDX-License-Identifier: GPL-2.0 |
| |
| ================================ |
| Review Checklist for RCU Patches |
| ================================ |
| |
| |
| This document contains a checklist for producing and reviewing patches |
| that make use of RCU. Violating any of the rules listed below will |
| result in the same sorts of problems that leaving out a locking primitive |
| would cause. This list is based on experiences reviewing such patches |
| over a rather long period of time, but improvements are always welcome! |
| |
| 0. Is RCU being applied to a read-mostly situation? If the data |
| structure is updated more than about 10% of the time, then you |
| should strongly consider some other approach, unless detailed |
| performance measurements show that RCU is nonetheless the right |
| tool for the job. Yes, RCU does reduce read-side overhead by |
| increasing write-side overhead, which is exactly why normal uses |
| of RCU will do much more reading than updating. |
| |
| Another exception is where performance is not an issue, and RCU |
| provides a simpler implementation. An example of this situation |
| is the dynamic NMI code in the Linux 2.6 kernel, at least on |
| architectures where NMIs are rare. |
| |
| Yet another exception is where the low real-time latency of RCU's |
| read-side primitives is critically important. |
| |
| One final exception is where RCU readers are used to prevent |
| the ABA problem (https://en.wikipedia.org/wiki/ABA_problem) |
| for lockless updates. This does result in the mildly |
| counter-intuitive situation where rcu_read_lock() and |
| rcu_read_unlock() are used to protect updates, however, this |
| approach can provide the same simplifications to certain types |
| of lockless algorithms that garbage collectors do. |
| |
| 1. Does the update code have proper mutual exclusion? |
| |
| RCU does allow *readers* to run (almost) naked, but *writers* must |
| still use some sort of mutual exclusion, such as: |
| |
| a. locking, |
| b. atomic operations, or |
| c. restricting updates to a single task. |
| |
| If you choose #b, be prepared to describe how you have handled |
| memory barriers on weakly ordered machines (pretty much all of |
| them -- even x86 allows later loads to be reordered to precede |
| earlier stores), and be prepared to explain why this added |
| complexity is worthwhile. If you choose #c, be prepared to |
| explain how this single task does not become a major bottleneck |
| on large systems (for example, if the task is updating information |
| relating to itself that other tasks can read, there by definition |
| can be no bottleneck). Note that the definition of "large" has |
| changed significantly: Eight CPUs was "large" in the year 2000, |
| but a hundred CPUs was unremarkable in 2017. |
| |
| 2. Do the RCU read-side critical sections make proper use of |
| rcu_read_lock() and friends? These primitives are needed |
| to prevent grace periods from ending prematurely, which |
| could result in data being unceremoniously freed out from |
| under your read-side code, which can greatly increase the |
| actuarial risk of your kernel. |
| |
| As a rough rule of thumb, any dereference of an RCU-protected |
| pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(), |
| rcu_read_lock_sched(), or by the appropriate update-side lock. |
| Explicit disabling of preemption (preempt_disable(), for example) |
| can serve as rcu_read_lock_sched(), but is less readable and |
| prevents lockdep from detecting locking issues. |
| |
| Please not that you *cannot* rely on code known to be built |
| only in non-preemptible kernels. Such code can and will break, |
| especially in kernels built with CONFIG_PREEMPT_COUNT=y. |
| |
| Letting RCU-protected pointers "leak" out of an RCU read-side |
| critical section is every bit as bad as letting them leak out |
| from under a lock. Unless, of course, you have arranged some |
| other means of protection, such as a lock or a reference count |
| *before* letting them out of the RCU read-side critical section. |
| |
| 3. Does the update code tolerate concurrent accesses? |
| |
| The whole point of RCU is to permit readers to run without |
| any locks or atomic operations. This means that readers will |
| be running while updates are in progress. There are a number |
| of ways to handle this concurrency, depending on the situation: |
| |
| a. Use the RCU variants of the list and hlist update |
| primitives to add, remove, and replace elements on |
| an RCU-protected list. Alternatively, use the other |
| RCU-protected data structures that have been added to |
| the Linux kernel. |
| |
| This is almost always the best approach. |
| |
| b. Proceed as in (a) above, but also maintain per-element |
| locks (that are acquired by both readers and writers) |
| that guard per-element state. Fields that the readers |
| refrain from accessing can be guarded by some other lock |
| acquired only by updaters, if desired. |
| |
| This also works quite well. |
| |
| c. Make updates appear atomic to readers. For example, |
| pointer updates to properly aligned fields will |
| appear atomic, as will individual atomic primitives. |
| Sequences of operations performed under a lock will *not* |
| appear to be atomic to RCU readers, nor will sequences |
| of multiple atomic primitives. One alternative is to |
| move multiple individual fields to a separate structure, |
| thus solving the multiple-field problem by imposing an |
| additional level of indirection. |
| |
| This can work, but is starting to get a bit tricky. |
| |
| d. Carefully order the updates and the reads so that readers |
| see valid data at all phases of the update. This is often |
| more difficult than it sounds, especially given modern |
| CPUs' tendency to reorder memory references. One must |
| usually liberally sprinkle memory-ordering operations |
| through the code, making it difficult to understand and |
| to test. Where it works, it is better to use things |
| like smp_store_release() and smp_load_acquire(), but in |
| some cases the smp_mb() full memory barrier is required. |
| |
| As noted earlier, it is usually better to group the |
| changing data into a separate structure, so that the |
| change may be made to appear atomic by updating a pointer |
| to reference a new structure containing updated values. |
| |
| 4. Weakly ordered CPUs pose special challenges. Almost all CPUs |
| are weakly ordered -- even x86 CPUs allow later loads to be |
| reordered to precede earlier stores. RCU code must take all of |
| the following measures to prevent memory-corruption problems: |
| |
| a. Readers must maintain proper ordering of their memory |
| accesses. The rcu_dereference() primitive ensures that |
| the CPU picks up the pointer before it picks up the data |
| that the pointer points to. This really is necessary |
| on Alpha CPUs. |
| |
| The rcu_dereference() primitive is also an excellent |
| documentation aid, letting the person reading the |
| code know exactly which pointers are protected by RCU. |
| Please note that compilers can also reorder code, and |
| they are becoming increasingly aggressive about doing |
| just that. The rcu_dereference() primitive therefore also |
| prevents destructive compiler optimizations. However, |
| with a bit of devious creativity, it is possible to |
| mishandle the return value from rcu_dereference(). |
| Please see rcu_dereference.rst for more information. |
| |
| The rcu_dereference() primitive is used by the |
| various "_rcu()" list-traversal primitives, such |
| as the list_for_each_entry_rcu(). Note that it is |
| perfectly legal (if redundant) for update-side code to |
| use rcu_dereference() and the "_rcu()" list-traversal |
| primitives. This is particularly useful in code that |
| is common to readers and updaters. However, lockdep |
| will complain if you access rcu_dereference() outside |
| of an RCU read-side critical section. See lockdep.rst |
| to learn what to do about this. |
| |
| Of course, neither rcu_dereference() nor the "_rcu()" |
| list-traversal primitives can substitute for a good |
| concurrency design coordinating among multiple updaters. |
| |
| b. If the list macros are being used, the list_add_tail_rcu() |
| and list_add_rcu() primitives must be used in order |
| to prevent weakly ordered machines from misordering |
| structure initialization and pointer planting. |
| Similarly, if the hlist macros are being used, the |
| hlist_add_head_rcu() primitive is required. |
| |
| c. If the list macros are being used, the list_del_rcu() |
| primitive must be used to keep list_del()'s pointer |
| poisoning from inflicting toxic effects on concurrent |
| readers. Similarly, if the hlist macros are being used, |
| the hlist_del_rcu() primitive is required. |
| |
| The list_replace_rcu() and hlist_replace_rcu() primitives |
| may be used to replace an old structure with a new one |
| in their respective types of RCU-protected lists. |
| |
| d. Rules similar to (4b) and (4c) apply to the "hlist_nulls" |
| type of RCU-protected linked lists. |
| |
| e. Updates must ensure that initialization of a given |
| structure happens before pointers to that structure are |
| publicized. Use the rcu_assign_pointer() primitive |
| when publicizing a pointer to a structure that can |
| be traversed by an RCU read-side critical section. |
| |
| 5. If any of call_rcu(), call_srcu(), call_rcu_tasks(), |
| call_rcu_tasks_rude(), or call_rcu_tasks_trace() is used, |
| the callback function may be invoked from softirq context, |
| and in any case with bottom halves disabled. In particular, |
| this callback function cannot block. If you need the callback |
| to block, run that code in a workqueue handler scheduled from |
| the callback. The queue_rcu_work() function does this for you |
| in the case of call_rcu(). |
| |
| 6. Since synchronize_rcu() can block, it cannot be called |
| from any sort of irq context. The same rule applies |
| for synchronize_srcu(), synchronize_rcu_expedited(), |
| synchronize_srcu_expedited(), synchronize_rcu_tasks(), |
| synchronize_rcu_tasks_rude(), and synchronize_rcu_tasks_trace(). |
| |
| The expedited forms of these primitives have the same semantics |
| as the non-expedited forms, but expediting is more CPU intensive. |
| Use of the expedited primitives should be restricted to rare |
| configuration-change operations that would not normally be |
| undertaken while a real-time workload is running. Note that |
| IPI-sensitive real-time workloads can use the rcupdate.rcu_normal |
| kernel boot parameter to completely disable expedited grace |
| periods, though this might have performance implications. |
| |
| In particular, if you find yourself invoking one of the expedited |
| primitives repeatedly in a loop, please do everyone a favor: |
| Restructure your code so that it batches the updates, allowing |
| a single non-expedited primitive to cover the entire batch. |
| This will very likely be faster than the loop containing the |
| expedited primitive, and will be much much easier on the rest |
| of the system, especially to real-time workloads running on the |
| rest of the system. Alternatively, instead use asynchronous |
| primitives such as call_rcu(). |
| |
| 7. As of v4.20, a given kernel implements only one RCU flavor, which |
| is RCU-sched for PREEMPTION=n and RCU-preempt for PREEMPTION=y. |
| If the updater uses call_rcu() or synchronize_rcu(), then |
| the corresponding readers may use: (1) rcu_read_lock() and |
| rcu_read_unlock(), (2) any pair of primitives that disables |
| and re-enables softirq, for example, rcu_read_lock_bh() and |
| rcu_read_unlock_bh(), or (3) any pair of primitives that disables |
| and re-enables preemption, for example, rcu_read_lock_sched() and |
| rcu_read_unlock_sched(). If the updater uses synchronize_srcu() |
| or call_srcu(), then the corresponding readers must use |
| srcu_read_lock() and srcu_read_unlock(), and with the same |
| srcu_struct. The rules for the expedited RCU grace-period-wait |
| primitives are the same as for their non-expedited counterparts. |
| |
| If the updater uses call_rcu_tasks() or synchronize_rcu_tasks(), |
| then the readers must refrain from executing voluntary |
| context switches, that is, from blocking. If the updater uses |
| call_rcu_tasks_trace() or synchronize_rcu_tasks_trace(), then |
| the corresponding readers must use rcu_read_lock_trace() and |
| rcu_read_unlock_trace(). If an updater uses call_rcu_tasks_rude() |
| or synchronize_rcu_tasks_rude(), then the corresponding readers |
| must use anything that disables preemption, for example, |
| preempt_disable() and preempt_enable(). |
| |
| Mixing things up will result in confusion and broken kernels, and |
| has even resulted in an exploitable security issue. Therefore, |
| when using non-obvious pairs of primitives, commenting is |
| of course a must. One example of non-obvious pairing is |
| the XDP feature in networking, which calls BPF programs from |
| network-driver NAPI (softirq) context. BPF relies heavily on RCU |
| protection for its data structures, but because the BPF program |
| invocation happens entirely within a single local_bh_disable() |
| section in a NAPI poll cycle, this usage is safe. The reason |
| that this usage is safe is that readers can use anything that |
| disables BH when updaters use call_rcu() or synchronize_rcu(). |
| |
| 8. Although synchronize_rcu() is slower than is call_rcu(), |
| it usually results in simpler code. So, unless update |
| performance is critically important, the updaters cannot block, |
| or the latency of synchronize_rcu() is visible from userspace, |
| synchronize_rcu() should be used in preference to call_rcu(). |
| Furthermore, kfree_rcu() and kvfree_rcu() usually result |
| in even simpler code than does synchronize_rcu() without |
| synchronize_rcu()'s multi-millisecond latency. So please take |
| advantage of kfree_rcu()'s and kvfree_rcu()'s "fire and forget" |
| memory-freeing capabilities where it applies. |
| |
| An especially important property of the synchronize_rcu() |
| primitive is that it automatically self-limits: if grace periods |
| are delayed for whatever reason, then the synchronize_rcu() |
| primitive will correspondingly delay updates. In contrast, |
| code using call_rcu() should explicitly limit update rate in |
| cases where grace periods are delayed, as failing to do so can |
| result in excessive realtime latencies or even OOM conditions. |
| |
| Ways of gaining this self-limiting property when using call_rcu(), |
| kfree_rcu(), or kvfree_rcu() include: |
| |
| a. Keeping a count of the number of data-structure elements |
| used by the RCU-protected data structure, including |
| those waiting for a grace period to elapse. Enforce a |
| limit on this number, stalling updates as needed to allow |
| previously deferred frees to complete. Alternatively, |
| limit only the number awaiting deferred free rather than |
| the total number of elements. |
| |
| One way to stall the updates is to acquire the update-side |
| mutex. (Don't try this with a spinlock -- other CPUs |
| spinning on the lock could prevent the grace period |
| from ever ending.) Another way to stall the updates |
| is for the updates to use a wrapper function around |
| the memory allocator, so that this wrapper function |
| simulates OOM when there is too much memory awaiting an |
| RCU grace period. There are of course many other |
| variations on this theme. |
| |
| b. Limiting update rate. For example, if updates occur only |
| once per hour, then no explicit rate limiting is |
| required, unless your system is already badly broken. |
| Older versions of the dcache subsystem take this approach, |
| guarding updates with a global lock, limiting their rate. |
| |
| c. Trusted update -- if updates can only be done manually by |
| superuser or some other trusted user, then it might not |
| be necessary to automatically limit them. The theory |
| here is that superuser already has lots of ways to crash |
| the machine. |
| |
| d. Periodically invoke rcu_barrier(), permitting a limited |
| number of updates per grace period. |
| |
| The same cautions apply to call_srcu(), call_rcu_tasks(), |
| call_rcu_tasks_rude(), and call_rcu_tasks_trace(). This is |
| why there is an srcu_barrier(), rcu_barrier_tasks(), |
| rcu_barrier_tasks_rude(), and rcu_barrier_tasks_rude(), |
| respectively. |
| |
| Note that although these primitives do take action to avoid |
| memory exhaustion when any given CPU has too many callbacks, |
| a determined user or administrator can still exhaust memory. |
| This is especially the case if a system with a large number of |
| CPUs has been configured to offload all of its RCU callbacks onto |
| a single CPU, or if the system has relatively little free memory. |
| |
| 9. All RCU list-traversal primitives, which include |
| rcu_dereference(), list_for_each_entry_rcu(), and |
| list_for_each_safe_rcu(), must be either within an RCU read-side |
| critical section or must be protected by appropriate update-side |
| locks. RCU read-side critical sections are delimited by |
| rcu_read_lock() and rcu_read_unlock(), or by similar primitives |
| such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which |
| case the matching rcu_dereference() primitive must be used in |
| order to keep lockdep happy, in this case, rcu_dereference_bh(). |
| |
| The reason that it is permissible to use RCU list-traversal |
| primitives when the update-side lock is held is that doing so |
| can be quite helpful in reducing code bloat when common code is |
| shared between readers and updaters. Additional primitives |
| are provided for this case, as discussed in lockdep.rst. |
| |
| One exception to this rule is when data is only ever added to |
| the linked data structure, and is never removed during any |
| time that readers might be accessing that structure. In such |
| cases, READ_ONCE() may be used in place of rcu_dereference() |
| and the read-side markers (rcu_read_lock() and rcu_read_unlock(), |
| for example) may be omitted. |
| |
| 10. Conversely, if you are in an RCU read-side critical section, |
| and you don't hold the appropriate update-side lock, you *must* |
| use the "_rcu()" variants of the list macros. Failing to do so |
| will break Alpha, cause aggressive compilers to generate bad code, |
| and confuse people trying to understand your code. |
| |
| 11. Any lock acquired by an RCU callback must be acquired elsewhere |
| with softirq disabled, e.g., via spin_lock_bh(). Failing to |
| disable softirq on a given acquisition of that lock will result |
| in deadlock as soon as the RCU softirq handler happens to run |
| your RCU callback while interrupting that acquisition's critical |
| section. |
| |
| 12. RCU callbacks can be and are executed in parallel. In many cases, |
| the callback code simply wrappers around kfree(), so that this |
| is not an issue (or, more accurately, to the extent that it is |
| an issue, the memory-allocator locking handles it). However, |
| if the callbacks do manipulate a shared data structure, they |
| must use whatever locking or other synchronization is required |
| to safely access and/or modify that data structure. |
| |
| Do not assume that RCU callbacks will be executed on the same |
| CPU that executed the corresponding call_rcu() or call_srcu(). |
| For example, if a given CPU goes offline while having an RCU |
| callback pending, then that RCU callback will execute on some |
| surviving CPU. (If this was not the case, a self-spawning RCU |
| callback would prevent the victim CPU from ever going offline.) |
| Furthermore, CPUs designated by rcu_nocbs= might well *always* |
| have their RCU callbacks executed on some other CPUs, in fact, |
| for some real-time workloads, this is the whole point of using |
| the rcu_nocbs= kernel boot parameter. |
| |
| In addition, do not assume that callbacks queued in a given order |
| will be invoked in that order, even if they all are queued on the |
| same CPU. Furthermore, do not assume that same-CPU callbacks will |
| be invoked serially. For example, in recent kernels, CPUs can be |
| switched between offloaded and de-offloaded callback invocation, |
| and while a given CPU is undergoing such a switch, its callbacks |
| might be concurrently invoked by that CPU's softirq handler and |
| that CPU's rcuo kthread. At such times, that CPU's callbacks |
| might be executed both concurrently and out of order. |
| |
| 13. Unlike most flavors of RCU, it *is* permissible to block in an |
| SRCU read-side critical section (demarked by srcu_read_lock() |
| and srcu_read_unlock()), hence the "SRCU": "sleepable RCU". |
| Please note that if you don't need to sleep in read-side critical |
| sections, you should be using RCU rather than SRCU, because RCU |
| is almost always faster and easier to use than is SRCU. |
| |
| Also unlike other forms of RCU, explicit initialization and |
| cleanup is required either at build time via DEFINE_SRCU() |
| or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct() |
| and cleanup_srcu_struct(). These last two are passed a |
| "struct srcu_struct" that defines the scope of a given |
| SRCU domain. Once initialized, the srcu_struct is passed |
| to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(), |
| synchronize_srcu_expedited(), and call_srcu(). A given |
| synchronize_srcu() waits only for SRCU read-side critical |
| sections governed by srcu_read_lock() and srcu_read_unlock() |
| calls that have been passed the same srcu_struct. This property |
| is what makes sleeping read-side critical sections tolerable -- |
| a given subsystem delays only its own updates, not those of other |
| subsystems using SRCU. Therefore, SRCU is less prone to OOM the |
| system than RCU would be if RCU's read-side critical sections |
| were permitted to sleep. |
| |
| The ability to sleep in read-side critical sections does not |
| come for free. First, corresponding srcu_read_lock() and |
| srcu_read_unlock() calls must be passed the same srcu_struct. |
| Second, grace-period-detection overhead is amortized only |
| over those updates sharing a given srcu_struct, rather than |
| being globally amortized as they are for other forms of RCU. |
| Therefore, SRCU should be used in preference to rw_semaphore |
| only in extremely read-intensive situations, or in situations |
| requiring SRCU's read-side deadlock immunity or low read-side |
| realtime latency. You should also consider percpu_rw_semaphore |
| when you need lightweight readers. |
| |
| SRCU's expedited primitive (synchronize_srcu_expedited()) |
| never sends IPIs to other CPUs, so it is easier on |
| real-time workloads than is synchronize_rcu_expedited(). |
| |
| It is also permissible to sleep in RCU Tasks Trace read-side |
| critical, which are delimited by rcu_read_lock_trace() and |
| rcu_read_unlock_trace(). However, this is a specialized flavor |
| of RCU, and you should not use it without first checking with |
| its current users. In most cases, you should instead use SRCU. |
| |
| Note that rcu_assign_pointer() relates to SRCU just as it does to |
| other forms of RCU, but instead of rcu_dereference() you should |
| use srcu_dereference() in order to avoid lockdep splats. |
| |
| 14. The whole point of call_rcu(), synchronize_rcu(), and friends |
| is to wait until all pre-existing readers have finished before |
| carrying out some otherwise-destructive operation. It is |
| therefore critically important to *first* remove any path |
| that readers can follow that could be affected by the |
| destructive operation, and *only then* invoke call_rcu(), |
| synchronize_rcu(), or friends. |
| |
| Because these primitives only wait for pre-existing readers, it |
| is the caller's responsibility to guarantee that any subsequent |
| readers will execute safely. |
| |
| 15. The various RCU read-side primitives do *not* necessarily contain |
| memory barriers. You should therefore plan for the CPU |
| and the compiler to freely reorder code into and out of RCU |
| read-side critical sections. It is the responsibility of the |
| RCU update-side primitives to deal with this. |
| |
| For SRCU readers, you can use smp_mb__after_srcu_read_unlock() |
| immediately after an srcu_read_unlock() to get a full barrier. |
| |
| 16. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the |
| __rcu sparse checks to validate your RCU code. These can help |
| find problems as follows: |
| |
| CONFIG_PROVE_LOCKING: |
| check that accesses to RCU-protected data structures |
| are carried out under the proper RCU read-side critical |
| section, while holding the right combination of locks, |
| or whatever other conditions are appropriate. |
| |
| CONFIG_DEBUG_OBJECTS_RCU_HEAD: |
| check that you don't pass the same object to call_rcu() |
| (or friends) before an RCU grace period has elapsed |
| since the last time that you passed that same object to |
| call_rcu() (or friends). |
| |
| __rcu sparse checks: |
| tag the pointer to the RCU-protected data structure |
| with __rcu, and sparse will warn you if you access that |
| pointer without the services of one of the variants |
| of rcu_dereference(). |
| |
| These debugging aids can help you find problems that are |
| otherwise extremely difficult to spot. |
| |
| 17. If you pass a callback function defined within a module to one of |
| call_rcu(), call_srcu(), call_rcu_tasks(), call_rcu_tasks_rude(), |
| or call_rcu_tasks_trace(), then it is necessary to wait for all |
| pending callbacks to be invoked before unloading that module. |
| Note that it is absolutely *not* sufficient to wait for a grace |
| period! For example, synchronize_rcu() implementation is *not* |
| guaranteed to wait for callbacks registered on other CPUs via |
| call_rcu(). Or even on the current CPU if that CPU recently |
| went offline and came back online. |
| |
| You instead need to use one of the barrier functions: |
| |
| - call_rcu() -> rcu_barrier() |
| - call_srcu() -> srcu_barrier() |
| - call_rcu_tasks() -> rcu_barrier_tasks() |
| - call_rcu_tasks_rude() -> rcu_barrier_tasks_rude() |
| - call_rcu_tasks_trace() -> rcu_barrier_tasks_trace() |
| |
| However, these barrier functions are absolutely *not* guaranteed |
| to wait for a grace period. For example, if there are no |
| call_rcu() callbacks queued anywhere in the system, rcu_barrier() |
| can and will return immediately. |
| |
| So if you need to wait for both a grace period and for all |
| pre-existing callbacks, you will need to invoke both functions, |
| with the pair depending on the flavor of RCU: |
| |
| - Either synchronize_rcu() or synchronize_rcu_expedited(), |
| together with rcu_barrier() |
| - Either synchronize_srcu() or synchronize_srcu_expedited(), |
| together with and srcu_barrier() |
| - synchronize_rcu_tasks() and rcu_barrier_tasks() |
| - synchronize_tasks_rude() and rcu_barrier_tasks_rude() |
| - synchronize_tasks_trace() and rcu_barrier_tasks_trace() |
| |
| If necessary, you can use something like workqueues to execute |
| the requisite pair of functions concurrently. |
| |
| See rcubarrier.rst for more information. |