| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
| "http://www.w3.org/TR/html4/loose.dtd"> |
| <html> |
| <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title> |
| <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
| |
| <p>August 8, 2017</p> |
| <p>This article was contributed by Paul E. McKenney</p> |
| |
| <h3>Introduction</h3> |
| |
| <p>This document gives a rough visual overview of how Tree RCU's |
| grace-period memory ordering guarantee is provided. |
| |
| <ol> |
| <li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> |
| What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a> |
| <li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks"> |
| Tree RCU Grace Period Memory Ordering Building Blocks</a> |
| <li> <a href="#Tree RCU Grace Period Memory Ordering Components"> |
| Tree RCU Grace Period Memory Ordering Components</a> |
| <li> <a href="#Putting It All Together">Putting It All Together</a> |
| </ol> |
| |
| <h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> |
| What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3> |
| |
| <p>RCU grace periods provide extremely strong memory-ordering guarantees |
| for non-idle non-offline code. |
| Any code that happens after the end of a given RCU grace period is guaranteed |
| to see the effects of all accesses prior to the beginning of that grace |
| period that are within RCU read-side critical sections. |
| Similarly, any code that happens before the beginning of a given RCU grace |
| period is guaranteed to see the effects of all accesses following the end |
| of that grace period that are within RCU read-side critical sections. |
| |
| <p>This guarantee is particularly pervasive for <tt>synchronize_sched()</tt>, |
| for which RCU-sched read-side critical sections include any region |
| of code for which preemption is disabled. |
| Given that each individual machine instruction can be thought of as |
| an extremely small region of preemption-disabled code, one can think of |
| <tt>synchronize_sched()</tt> as <tt>smp_mb()</tt> on steroids. |
| |
| <p>RCU updaters use this guarantee by splitting their updates into |
| two phases, one of which is executed before the grace period and |
| the other of which is executed after the grace period. |
| In the most common use case, phase one removes an element from |
| a linked RCU-protected data structure, and phase two frees that element. |
| For this to work, any readers that have witnessed state prior to the |
| phase-one update (in the common case, removal) must not witness state |
| following the phase-two update (in the common case, freeing). |
| |
| <p>The RCU implementation provides this guarantee using a network |
| of lock-based critical sections, memory barriers, and per-CPU |
| processing, as is described in the following sections. |
| |
| <h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks"> |
| Tree RCU Grace Period Memory Ordering Building Blocks</a></h3> |
| |
| <p>The workhorse for RCU's grace-period memory ordering is the |
| critical section for the <tt>rcu_node</tt> structure's |
| <tt>->lock</tt>. |
| These critical sections use helper functions for lock acquisition, including |
| <tt>raw_spin_lock_rcu_node()</tt>, |
| <tt>raw_spin_lock_irq_rcu_node()</tt>, and |
| <tt>raw_spin_lock_irqsave_rcu_node()</tt>. |
| Their lock-release counterparts are |
| <tt>raw_spin_unlock_rcu_node()</tt>, |
| <tt>raw_spin_unlock_irq_rcu_node()</tt>, and |
| <tt>raw_spin_unlock_irqrestore_rcu_node()</tt>, |
| respectively. |
| For completeness, a |
| <tt>raw_spin_trylock_rcu_node()</tt> |
| is also provided. |
| The key point is that the lock-acquisition functions, including |
| <tt>raw_spin_trylock_rcu_node()</tt>, all invoke |
| <tt>smp_mb__after_unlock_lock()</tt> immediately after successful |
| acquisition of the lock. |
| |
| <p>Therefore, for any given <tt>rcu_node</tt> struction, any access |
| happening before one of the above lock-release functions will be seen |
| by all CPUs as happening before any access happening after a later |
| one of the above lock-acquisition functions. |
| Furthermore, any access happening before one of the |
| above lock-release function on any given CPU will be seen by all |
| CPUs as happening before any access happening after a later one |
| of the above lock-acquisition functions executing on that same CPU, |
| even if the lock-release and lock-acquisition functions are operating |
| on different <tt>rcu_node</tt> structures. |
| Tree RCU uses these two ordering guarantees to form an ordering |
| network among all CPUs that were in any way involved in the grace |
| period, including any CPUs that came online or went offline during |
| the grace period in question. |
| |
| <p>The following litmus test exhibits the ordering effects of these |
| lock-acquisition and lock-release functions: |
| |
| <pre> |
| 1 int x, y, z; |
| 2 |
| 3 void task0(void) |
| 4 { |
| 5 raw_spin_lock_rcu_node(rnp); |
| 6 WRITE_ONCE(x, 1); |
| 7 r1 = READ_ONCE(y); |
| 8 raw_spin_unlock_rcu_node(rnp); |
| 9 } |
| 10 |
| 11 void task1(void) |
| 12 { |
| 13 raw_spin_lock_rcu_node(rnp); |
| 14 WRITE_ONCE(y, 1); |
| 15 r2 = READ_ONCE(z); |
| 16 raw_spin_unlock_rcu_node(rnp); |
| 17 } |
| 18 |
| 19 void task2(void) |
| 20 { |
| 21 WRITE_ONCE(z, 1); |
| 22 smp_mb(); |
| 23 r3 = READ_ONCE(x); |
| 24 } |
| 25 |
| 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); |
| </pre> |
| |
| <p>The <tt>WARN_ON()</tt> is evaluated at “the end of time”, |
| after all changes have propagated throughout the system. |
| Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the |
| acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example |
| on PowerPC. |
| The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this |
| <tt>WARN_ON()</tt> from triggering. |
| |
| <p>This approach must be extended to include idle CPUs, which need |
| RCU's grace-period memory ordering guarantee to extend to any |
| RCU read-side critical sections preceding and following the current |
| idle sojourn. |
| This case is handled by calls to the strongly ordered |
| <tt>atomic_add_return()</tt> read-modify-write atomic operation that |
| is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry |
| time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time. |
| The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and |
| <tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke |
| an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| But what about CPUs that remain offline for the entire |
| grace period? |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| Such CPUs will be offline at the beginning of the grace period, |
| so the grace period won't expect quiescent states from them. |
| Races between grace-period start and CPU-hotplug operations |
| are mediated by the CPU's leaf <tt>rcu_node</tt> structure's |
| <tt>->lock</tt> as described above. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| <p>The approach must be extended to handle one final case, that |
| of waking a task blocked in <tt>synchronize_rcu()</tt>. |
| This task might be affinitied to a CPU that is not yet aware that |
| the grace period has ended, and thus might not yet be subject to |
| the grace period's memory ordering. |
| Therefore, there is an <tt>smp_mb()</tt> after the return from |
| <tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt> |
| code path. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| What? Where??? |
| I don't see any <tt>smp_mb()</tt> after the return from |
| <tt>wait_for_completion()</tt>!!! |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| That would be because I spotted the need for that |
| <tt>smp_mb()</tt> during the creation of this documentation, |
| and it is therefore unlikely to hit mainline before v4.14. |
| Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and |
| Jonathan Cameron for asking questions that sensitized me |
| to the rather elaborate sequence of events that demonstrate |
| the need for this memory barrier. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| <p>Tree RCU's grace--period memory-ordering guarantees rely most |
| heavily on the <tt>rcu_node</tt> structure's <tt>->lock</tt> |
| field, so much so that it is necessary to abbreviate this pattern |
| in the diagrams in the next section. |
| For example, consider the <tt>rcu_prepare_for_idle()</tt> function |
| shown below, which is one of several functions that enforce ordering |
| of newly arrived RCU callbacks against future grace periods: |
| |
| <pre> |
| 1 static void rcu_prepare_for_idle(void) |
| 2 { |
| 3 bool needwake; |
| 4 struct rcu_data *rdp; |
| 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); |
| 6 struct rcu_node *rnp; |
| 7 struct rcu_state *rsp; |
| 8 int tne; |
| 9 |
| 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || |
| 11 rcu_is_nocb_cpu(smp_processor_id())) |
| 12 return; |
| 13 tne = READ_ONCE(tick_nohz_active); |
| 14 if (tne != rdtp->tick_nohz_enabled_snap) { |
| 15 if (rcu_cpu_has_callbacks(NULL)) |
| 16 invoke_rcu_core(); |
| 17 rdtp->tick_nohz_enabled_snap = tne; |
| 18 return; |
| 19 } |
| 20 if (!tne) |
| 21 return; |
| 22 if (rdtp->all_lazy && |
| 23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { |
| 24 rdtp->all_lazy = false; |
| 25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; |
| 26 invoke_rcu_core(); |
| 27 return; |
| 28 } |
| 29 if (rdtp->last_accelerate == jiffies) |
| 30 return; |
| 31 rdtp->last_accelerate = jiffies; |
| 32 for_each_rcu_flavor(rsp) { |
| 33 rdp = this_cpu_ptr(rsp->rda); |
| 34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) |
| 35 continue; |
| 36 rnp = rdp->mynode; |
| 37 raw_spin_lock_rcu_node(rnp); |
| 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); |
| 39 raw_spin_unlock_rcu_node(rnp); |
| 40 if (needwake) |
| 41 rcu_gp_kthread_wake(rsp); |
| 42 } |
| 43 } |
| </pre> |
| |
| <p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really |
| matters for this discussion are lines 37–39. |
| We will therefore abbreviate this function as follows: |
| |
| </p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg"> |
| |
| <p>The box represents the <tt>rcu_node</tt> structure's <tt>->lock</tt> |
| critical section, with the double line on top representing the additional |
| <tt>smp_mb__after_unlock_lock()</tt>. |
| |
| <h3><a name="Tree RCU Grace Period Memory Ordering Components"> |
| Tree RCU Grace Period Memory Ordering Components</a></h3> |
| |
| <p>Tree RCU's grace-period memory-ordering guarantee is provided by |
| a number of RCU components: |
| |
| <ol> |
| <li> <a href="#Callback Registry">Callback Registry</a> |
| <li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a> |
| <li> <a href="#Self-Reported Quiescent States"> |
| Self-Reported Quiescent States</a> |
| <li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a> |
| <li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a> |
| <li> <a href="Forcing Quiescent States">Forcing Quiescent States</a> |
| <li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a> |
| <li> <a href="Callback Invocation">Callback Invocation</a> |
| </ol> |
| |
| <p>Each of the following section looks at the corresponding component |
| in detail. |
| |
| <h4><a name="Callback Registry">Callback Registry</a></h4> |
| |
| <p>If RCU's grace-period guarantee is to mean anything at all, any |
| access that happens before a given invocation of <tt>call_rcu()</tt> |
| must also happen before the corresponding grace period. |
| The implementation of this portion of RCU's grace period guarantee |
| is shown in the following figure: |
| |
| </p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg"> |
| |
| <p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state, |
| it provides no ordering guarantees, either for itself or for |
| phase one of the update (which again will usually be removal of |
| an element from an RCU-protected data structure). |
| It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list, |
| which cannot become associated with a grace period until a later |
| call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above. |
| |
| <p>One set of code paths shown on the left invokes |
| <tt>rcu_accelerate_cbs()</tt> via |
| <tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if |
| the current CPU is inundated with queued <tt>rcu_head</tt> structures) |
| or more likely from an <tt>RCU_SOFTIRQ</tt> handler. |
| Another code path in the middle is taken only in kernels built with |
| <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes |
| <tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>. |
| The final code path on the right is taken only in kernels built with |
| <tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes |
| <tt>rcu_accelerate_cbs()</tt> via |
| <tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>, |
| <tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>, |
| which in turn is invoked on a surviving CPU after the outgoing |
| CPU has been completely offlined. |
| |
| <p>There are a few other code paths within grace-period processing |
| that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>. |
| However, either way, all of the CPU's recently queued <tt>rcu_head</tt> |
| structures are associated with a future grace-period number under |
| the protection of the CPU's lead <tt>rcu_node</tt> structure's |
| <tt>->lock</tt>. |
| In all cases, there is full ordering against any prior critical section |
| for that same <tt>rcu_node</tt> structure's <tt>->lock</tt>, and |
| also full ordering against any of the current task's or CPU's prior critical |
| sections for any <tt>rcu_node</tt> structure's <tt>->lock</tt>. |
| |
| <p>The next section will show how this ordering ensures that any |
| accesses prior to the <tt>call_rcu()</tt> (particularly including phase |
| one of the update) |
| happen before the start of the corresponding grace period. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| But what about <tt>synchronize_rcu()</tt>? |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt> |
| to <tt>wait_rcu_gp()</tt>, which invokes it. |
| So either way, it eventually comes down to <tt>call_rcu()</tt>. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| <h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4> |
| |
| <p>Grace-period initialization is carried out by |
| the grace-period kernel thread, which makes several passes over the |
| <tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function. |
| This means that showing the full flow of ordering through the |
| grace-period computation will require duplicating this tree. |
| If you find this confusing, please note that the state of the |
| <tt>rcu_node</tt> changes over time, just like Heraclitus's river. |
| However, to keep the <tt>rcu_node</tt> river tractable, the |
| grace-period kernel thread's traversals are presented in multiple |
| parts, starting in this section with the various phases of |
| grace-period initialization. |
| |
| <p>The first ordering-related grace-period initialization action is to |
| increment the <tt>rcu_state</tt> structure's <tt>->gpnum</tt> |
| grace-period-number counter, as shown below: |
| |
| </p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> |
| |
| <p>The actual increment is carried out using <tt>smp_store_release()</tt>, |
| which helps reject false-positive RCU CPU stall detection. |
| Note that only the root <tt>rcu_node</tt> structure is touched. |
| |
| <p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks |
| based on CPUs having come online or gone offline since the start of |
| the previous grace period. |
| In the common case where the number of online CPUs for this <tt>rcu_node</tt> |
| structure has not transitioned to or from zero, |
| this pass will scan only the leaf <tt>rcu_node</tt> structures. |
| However, if the number of online CPUs for a given leaf <tt>rcu_node</tt> |
| structure has transitioned from zero, |
| <tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU. |
| Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt> |
| structure has transitioned to zero, |
| <tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU. |
| The diagram below shows the path of ordering if the leftmost |
| <tt>rcu_node</tt> structure onlines its first CPU and if the next |
| <tt>rcu_node</tt> structure has no online CPUs |
| (or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines |
| its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs). |
| |
| </p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> |
| |
| <p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt> |
| tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's |
| <tt>->gpnum</tt> field to the newly incremented value from the |
| <tt>rcu_state</tt> structure, as shown in the following diagram. |
| |
| </p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> |
| |
| <p>This change will also cause each CPU's next call to |
| <tt>__note_gp_changes()</tt> |
| to notice that a new grace period has started, as described in the next |
| section. |
| But because the grace-period kthread started the grace period at the |
| root (with the increment of the <tt>rcu_state</tt> structure's |
| <tt>->gpnum</tt> field) before setting each leaf <tt>rcu_node</tt> |
| structure's <tt>->gpnum</tt> field, each CPU's observation of |
| the start of the grace period will happen after the actual start |
| of the grace period. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| But what about the CPU that started the grace period? |
| Why wouldn't it see the start of the grace period right when |
| it started that grace period? |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| In some deep philosophical and overly anthromorphized |
| sense, yes, the CPU starting the grace period is immediately |
| aware of having done so. |
| However, if we instead assume that RCU is not self-aware, |
| then even the CPU starting the grace period does not really |
| become aware of the start of this grace period until its |
| first call to <tt>__note_gp_changes()</tt>. |
| On the other hand, this CPU potentially gets early notification |
| because it invokes <tt>__note_gp_changes()</tt> during its |
| last <tt>rcu_gp_init()</tt> pass through its leaf |
| <tt>rcu_node</tt> structure. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| <h4><a name="Self-Reported Quiescent States"> |
| Self-Reported Quiescent States</a></h4> |
| |
| <p>When all entities that might block the grace period have reported |
| quiescent states (or as described in a later section, had quiescent |
| states reported on their behalf), the grace period can end. |
| Online non-idle CPUs report their own quiescent states, as shown |
| in the following diagram: |
| |
| </p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%"> |
| |
| <p>This is for the last CPU to report a quiescent state, which signals |
| the end of the grace period. |
| Earlier quiescent states would push up the <tt>rcu_node</tt> tree |
| only until they encountered an <tt>rcu_node</tt> structure that |
| is waiting for additional quiescent states. |
| However, ordering is nevertheless preserved because some later quiescent |
| state will acquire that <tt>rcu_node</tt> structure's <tt>->lock</tt>. |
| |
| <p>Any number of events can lead up to a CPU invoking |
| <tt>note_gp_changes</tt> (or alternatively, directly invoking |
| <tt>__note_gp_changes()</tt>), at which point that CPU will notice |
| the start of a new grace period while holding its leaf |
| <tt>rcu_node</tt> lock. |
| Therefore, all execution shown in this diagram happens after the |
| start of the grace period. |
| In addition, this CPU will consider any RCU read-side critical |
| section that started before the invocation of <tt>__note_gp_changes()</tt> |
| to have started before the grace period, and thus a critical |
| section that the grace period must wait on. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| But a RCU read-side critical section might have started |
| after the beginning of the grace period |
| (the <tt>->gpnum++</tt> from earlier), so why should |
| the grace period wait on such a critical section? |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| It is indeed not necessary for the grace period to wait on such |
| a critical section. |
| However, it is permissible to wait on it. |
| And it is furthermore important to wait on it, as this |
| lazy approach is far more scalable than a “big bang” |
| all-at-once grace-period start could possibly be. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| <p>If the CPU does a context switch, a quiescent state will be |
| noted by <tt>rcu_node_context_switch()</tt> on the left. |
| On the other hand, if the CPU takes a scheduler-clock interrupt |
| while executing in usermode, a quiescent state will be noted by |
| <tt>rcu_check_callbacks()</tt> on the right. |
| Either way, the passage through a quiescent state will be noted |
| in a per-CPU variable. |
| |
| <p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on |
| this CPU (for example, after the next scheduler-clock |
| interrupt), <tt>__rcu_process_callbacks()</tt> will invoke |
| <tt>rcu_check_quiescent_state()</tt>, which will notice the |
| recorded quiescent state, and invoke |
| <tt>rcu_report_qs_rdp()</tt>. |
| If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state |
| really does apply to the current grace period, it invokes |
| <tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt> |
| tree as shown at the bottom of the diagram, clearing bits from |
| each <tt>rcu_node</tt> structure's <tt>->qsmask</tt> field, |
| and propagating up the tree when the result is zero. |
| |
| <p>Note that traversal passes upwards out of a given <tt>rcu_node</tt> |
| structure only if the current CPU is reporting the last quiescent |
| state for the subtree headed by that <tt>rcu_node</tt> structure. |
| A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt> |
| structure, then there will be a later traversal by another CPU |
| (or perhaps the same one) that proceeds upwards |
| from that point, and the <tt>rcu_node</tt> <tt>->lock</tt> |
| guarantees that the first CPU's quiescent state happens before the |
| remainder of the second CPU's traversal. |
| Applying this line of thought repeatedly shows that all CPUs' |
| quiescent states happen before the last CPU traverses through |
| the root <tt>rcu_node</tt> structure, the “last CPU” |
| being the one that clears the last bit in the root <tt>rcu_node</tt> |
| structure's <tt>->qsmask</tt> field. |
| |
| <h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4> |
| |
| <p>Due to energy-efficiency considerations, RCU is forbidden from |
| disturbing idle CPUs. |
| CPUs are therefore required to notify RCU when entering or leaving idle |
| state, which they do via fully ordered value-returning atomic operations |
| on a per-CPU variable. |
| The ordering effects are as shown below: |
| |
| </p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%"> |
| |
| <p>The RCU grace-period kernel thread samples the per-CPU idleness |
| variable while holding the corresponding CPU's leaf <tt>rcu_node</tt> |
| structure's <tt>->lock</tt>. |
| This means that any RCU read-side critical sections that precede the |
| idle period (the oval near the top of the diagram above) will happen |
| before the end of the current grace period. |
| Similarly, the beginning of the current grace period will happen before |
| any RCU read-side critical sections that follow the |
| idle period (the oval near the bottom of the diagram above). |
| |
| <p>Plumbing this into the full grace-period execution is described |
| <a href="#Forcing Quiescent States">below</a>. |
| |
| <h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4> |
| |
| <p>RCU is also forbidden from disturbing offline CPUs, which might well |
| be powered off and removed from the system completely. |
| CPUs are therefore required to notify RCU of their comings and goings |
| as part of the corresponding CPU hotplug operations. |
| The ordering effects are shown below: |
| |
| </p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%"> |
| |
| <p>Because CPU hotplug operations are much less frequent than idle transitions, |
| they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt> |
| structure's <tt>->lock</tt> and update this structure's |
| <tt>->qsmaskinitnext</tt>. |
| The RCU grace-period kernel thread samples this mask to detect CPUs |
| having gone offline since the beginning of this grace period. |
| |
| <p>Plumbing this into the full grace-period execution is described |
| <a href="#Forcing Quiescent States">below</a>. |
| |
| <h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4> |
| |
| <p>As noted above, idle and offline CPUs cannot report their own |
| quiescent states, and therefore the grace-period kernel thread |
| must do the reporting on their behalf. |
| This process is called “forcing quiescent states”, it is |
| repeated every few jiffies, and its ordering effects are shown below: |
| |
| </p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%"> |
| |
| <p>Each pass of quiescent state forcing is guaranteed to traverse the |
| leaf <tt>rcu_node</tt> structures, and if there are no new quiescent |
| states due to recently idled and/or offlined CPUs, then only the |
| leaves are traversed. |
| However, if there is a newly offlined CPU as illustrated on the left |
| or a newly idled CPU as illustrated on the right, the corresponding |
| quiescent state will be driven up towards the root. |
| As with self-reported quiescent states, the upwards driving stops |
| once it reaches an <tt>rcu_node</tt> structure that has quiescent |
| states outstanding from other CPUs. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| The leftmost drive to root stopped before it reached |
| the root <tt>rcu_node</tt> structure, which means that |
| there are still CPUs subordinate to that structure on |
| which the current grace period is waiting. |
| Given that, how is it possible that the rightmost drive |
| to root ended the grace period? |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| Good analysis! |
| It is in fact impossible in the absence of bugs in RCU. |
| But this diagram is complex enough as it is, so simplicity |
| overrode accuracy. |
| You can think of it as poetic license, or you can think of |
| it as misdirection that is resolved in the |
| <a href="#Putting It All Together">stitched-together diagram</a>. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| <h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4> |
| |
| <p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree |
| breadth-first setting all the <tt>->completed</tt> fields equal |
| to the number of the newly completed grace period, then it sets |
| the <tt>rcu_state</tt> structure's <tt>->completed</tt> field, |
| again to the number of the newly completed grace period. |
| The ordering effects are shown below: |
| |
| </p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%"> |
| |
| <p>As indicated by the oval at the bottom of the diagram, once |
| grace-period cleanup is complete, the next grace period can begin. |
| |
| <table> |
| <tr><th> </th></tr> |
| <tr><th align="left">Quick Quiz:</th></tr> |
| <tr><td> |
| But when precisely does the grace period end? |
| </td></tr> |
| <tr><th align="left">Answer:</th></tr> |
| <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| There is no useful single point at which the grace period |
| can be said to end. |
| The earliest reasonable candidate is as soon as the last |
| CPU has reported its quiescent state, but it may be some |
| milliseconds before RCU becomes aware of this. |
| The latest reasonable candidate is once the <tt>rcu_state</tt> |
| structure's <tt>->completed</tt> field has been updated, |
| but it is quite possible that some CPUs have already completed |
| phase two of their updates by that time. |
| In short, if you are going to work with RCU, you need to |
| learn to embrace uncertainty. |
| </font></td></tr> |
| <tr><td> </td></tr> |
| </table> |
| |
| |
| <h4><a name="Callback Invocation">Callback Invocation</a></h4> |
| |
| <p>Once a given CPU's leaf <tt>rcu_node</tt> structure's |
| <tt>->completed</tt> field has been updated, that CPU can begin |
| invoking its RCU callbacks that were waiting for this grace period |
| to end. |
| These callbacks are identified by <tt>rcu_advance_cbs()</tt>, |
| which is usually invoked by <tt>__note_gp_changes()</tt>. |
| As shown in the diagram below, this invocation can be triggered by |
| the scheduling-clock interrupt (<tt>rcu_check_callbacks()</tt> on |
| the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on |
| the right, but only for kernels build with |
| <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>). |
| Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in |
| <tt>rcu_do_batch()</tt> invoking the callbacks, which in turn |
| allows those callbacks to carry out (either directly or indirectly |
| via wakeup) the needed phase-two processing for each update. |
| |
| </p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%"> |
| |
| <p>Please note that callback invocation can also be prompted by any |
| number of corner-case code paths, for example, when a CPU notes that |
| it has excessive numbers of callbacks queued. |
| In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's |
| <tt>->lock</tt> before invoking callbacks, which preserves the |
| required ordering against the newly completed grace period. |
| |
| <p>However, if the callback function communicates to other CPUs, |
| for example, doing a wakeup, then it is that function's responsibility |
| to maintain ordering. |
| For example, if the callback function wakes up a task that runs on |
| some other CPU, proper ordering must in place in both the callback |
| function and the task being awakened. |
| To see why this is important, consider the top half of the |
| <a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram. |
| The callback might be running on a CPU corresponding to the leftmost |
| leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on |
| a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure, |
| and the grace-period kernel thread might not yet have reached the |
| rightmost leaf. |
| In this case, the grace period's memory ordering might not yet have |
| reached that CPU, so again the callback function and the awakened |
| task must supply proper ordering. |
| |
| <h3><a name="Putting It All Together">Putting It All Together</a></h3> |
| |
| <p>A stitched-together diagram is |
| <a href="Tree-RCU-Diagram.html">here</a>. |
| |
| <h3><a name="Legal Statement"> |
| Legal Statement</a></h3> |
| |
| <p>This work represents the view of the author and does not necessarily |
| represent the view of IBM. |
| |
| </p><p>Linux is a registered trademark of Linus Torvalds. |
| |
| </p><p>Other company, product, and service names may be trademarks or |
| service marks of others. |
| |
| </body></html> |