| .. This file is dual-licensed: you can use it either under the terms |
| .. of the GPL 2.0 or the GFDL 1.2 license, at your option. Note that this |
| .. dual licensing only applies to this file, and not this project as a |
| .. whole. |
| .. |
| .. a) This file is free software; you can redistribute it and/or |
| .. modify it under the terms of the GNU General Public License as |
| .. published by the Free Software Foundation version 2 of |
| .. the License. |
| .. |
| .. This file is distributed in the hope that it will be useful, |
| .. but WITHOUT ANY WARRANTY; without even the implied warranty of |
| .. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
| .. GNU General Public License for more details. |
| .. |
| .. Or, alternatively, |
| .. |
| .. b) Permission is granted to copy, distribute and/or modify this |
| .. document under the terms of the GNU Free Documentation License, |
| .. Version 1.2 version published by the Free Software |
| .. Foundation, with no Invariant Sections, no Front-Cover Texts |
| .. and no Back-Cover Texts. A copy of the license is included at |
| .. Documentation/userspace-api/media/fdl-appendix.rst. |
| .. |
| .. TODO: replace it to GPL-2.0 OR GFDL-1.2 WITH no-invariant-sections |
| |
| =========================== |
| Lockless Ring Buffer Design |
| =========================== |
| |
| Copyright 2009 Red Hat Inc. |
| |
| :Author: Steven Rostedt <srostedt@redhat.com> |
| :License: The GNU Free Documentation License, Version 1.2 |
| (dual licensed under the GPL v2) |
| :Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, |
| and Frederic Weisbecker. |
| |
| |
| Written for: 2.6.31 |
| |
| Terminology used in this Document |
| --------------------------------- |
| |
| tail |
| - where new writes happen in the ring buffer. |
| |
| head |
| - where new reads happen in the ring buffer. |
| |
| producer |
| - the task that writes into the ring buffer (same as writer) |
| |
| writer |
| - same as producer |
| |
| consumer |
| - the task that reads from the buffer (same as reader) |
| |
| reader |
| - same as consumer. |
| |
| reader_page |
| - A page outside the ring buffer used solely (for the most part) |
| by the reader. |
| |
| head_page |
| - a pointer to the page that the reader will use next |
| |
| tail_page |
| - a pointer to the page that will be written to next |
| |
| commit_page |
| - a pointer to the page with the last finished non-nested write. |
| |
| cmpxchg |
| - hardware-assisted atomic transaction that performs the following:: |
| |
| A = B if previous A == C |
| |
| R = cmpxchg(A, C, B) is saying that we replace A with B if and only |
| if current A is equal to C, and we put the old (current) |
| A into R |
| |
| R gets the previous A regardless if A is updated with B or not. |
| |
| To see if the update was successful a compare of ``R == C`` |
| may be used. |
| |
| The Generic Ring Buffer |
| ----------------------- |
| |
| The ring buffer can be used in either an overwrite mode or in |
| producer/consumer mode. |
| |
| Producer/consumer mode is where if the producer were to fill up the |
| buffer before the consumer could free up anything, the producer |
| will stop writing to the buffer. This will lose most recent events. |
| |
| Overwrite mode is where if the producer were to fill up the buffer |
| before the consumer could free up anything, the producer will |
| overwrite the older data. This will lose the oldest events. |
| |
| No two writers can write at the same time (on the same per-cpu buffer), |
| but a writer may interrupt another writer, but it must finish writing |
| before the previous writer may continue. This is very important to the |
| algorithm. The writers act like a "stack". The way interrupts works |
| enforces this behavior:: |
| |
| |
| writer1 start |
| <preempted> writer2 start |
| <preempted> writer3 start |
| writer3 finishes |
| writer2 finishes |
| writer1 finishes |
| |
| This is very much like a writer being preempted by an interrupt and |
| the interrupt doing a write as well. |
| |
| Readers can happen at any time. But no two readers may run at the |
| same time, nor can a reader preempt/interrupt another reader. A reader |
| cannot preempt/interrupt a writer, but it may read/consume from the |
| buffer at the same time as a writer is writing, but the reader must be |
| on another processor to do so. A reader may read on its own processor |
| and can be preempted by a writer. |
| |
| A writer can preempt a reader, but a reader cannot preempt a writer. |
| But a reader can read the buffer at the same time (on another processor) |
| as a writer. |
| |
| The ring buffer is made up of a list of pages held together by a linked list. |
| |
| At initialization a reader page is allocated for the reader that is not |
| part of the ring buffer. |
| |
| The head_page, tail_page and commit_page are all initialized to point |
| to the same page. |
| |
| The reader page is initialized to have its next pointer pointing to |
| the head page, and its previous pointer pointing to a page before |
| the head page. |
| |
| The reader has its own page to use. At start up time, this page is |
| allocated but is not attached to the list. When the reader wants |
| to read from the buffer, if its page is empty (like it is on start-up), |
| it will swap its page with the head_page. The old reader page will |
| become part of the ring buffer and the head_page will be removed. |
| The page after the inserted page (old reader_page) will become the |
| new head page. |
| |
| Once the new page is given to the reader, the reader could do what |
| it wants with it, as long as a writer has left that page. |
| |
| A sample of how the reader page is swapped: Note this does not |
| show the head page in the buffer, it is for demonstrating a swap |
| only. |
| |
| :: |
| |
| +------+ |
| |reader| RING BUFFER |
| |page | |
| +------+ |
| +---+ +---+ +---+ |
| | |-->| |-->| | |
| | |<--| |<--| | |
| +---+ +---+ +---+ |
| ^ | ^ | |
| | +-------------+ | |
| +-----------------+ |
| |
| |
| +------+ |
| |reader| RING BUFFER |
| |page |-------------------+ |
| +------+ v |
| | +---+ +---+ +---+ |
| | | |-->| |-->| | |
| | | |<--| |<--| |<-+ |
| | +---+ +---+ +---+ | |
| | ^ | ^ | | |
| | | +-------------+ | | |
| | +-----------------+ | |
| +------------------------------------+ |
| |
| +------+ |
| |reader| RING BUFFER |
| |page |-------------------+ |
| +------+ <---------------+ v |
| | ^ +---+ +---+ +---+ |
| | | | |-->| |-->| | |
| | | | | | |<--| |<-+ |
| | | +---+ +---+ +---+ | |
| | | | ^ | | |
| | | +-------------+ | | |
| | +-----------------------------+ | |
| +------------------------------------+ |
| |
| +------+ |
| |buffer| RING BUFFER |
| |page |-------------------+ |
| +------+ <---------------+ v |
| | ^ +---+ +---+ +---+ |
| | | | | | |-->| | |
| | | New | | | |<--| |<-+ |
| | | Reader +---+ +---+ +---+ | |
| | | page ----^ | | |
| | | | | |
| | +-----------------------------+ | |
| +------------------------------------+ |
| |
| |
| |
| It is possible that the page swapped is the commit page and the tail page, |
| if what is in the ring buffer is less than what is held in a buffer page. |
| |
| :: |
| |
| reader page commit page tail page |
| | | | |
| v | | |
| +---+ | | |
| | |<----------+ | |
| | |<------------------------+ |
| | |------+ |
| +---+ | |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| This case is still valid for this algorithm. |
| When the writer leaves the page, it simply goes into the ring buffer |
| since the reader page still points to the next location in the ring |
| buffer. |
| |
| |
| The main pointers: |
| |
| reader page |
| - The page used solely by the reader and is not part |
| of the ring buffer (may be swapped in) |
| |
| head page |
| - the next page in the ring buffer that will be swapped |
| with the reader page. |
| |
| tail page |
| - the page where the next write will take place. |
| |
| commit page |
| - the page that last finished a write. |
| |
| The commit page only is updated by the outermost writer in the |
| writer stack. A writer that preempts another writer will not move the |
| commit page. |
| |
| When data is written into the ring buffer, a position is reserved |
| in the ring buffer and passed back to the writer. When the writer |
| is finished writing data into that position, it commits the write. |
| |
| Another write (or a read) may take place at anytime during this |
| transaction. If another write happens it must finish before continuing |
| with the previous write. |
| |
| |
| Write reserve:: |
| |
| Buffer page |
| +---------+ |
| |written | |
| +---------+ <--- given back to writer (current commit) |
| |reserved | |
| +---------+ <--- tail pointer |
| | empty | |
| +---------+ |
| |
| Write commit:: |
| |
| Buffer page |
| +---------+ |
| |written | |
| +---------+ |
| |written | |
| +---------+ <--- next position for write (current commit) |
| | empty | |
| +---------+ |
| |
| |
| If a write happens after the first reserve:: |
| |
| Buffer page |
| +---------+ |
| |written | |
| +---------+ <-- current commit |
| |reserved | |
| +---------+ <--- given back to second writer |
| |reserved | |
| +---------+ <--- tail pointer |
| |
| After second writer commits:: |
| |
| |
| Buffer page |
| +---------+ |
| |written | |
| +---------+ <--(last full commit) |
| |reserved | |
| +---------+ |
| |pending | |
| |commit | |
| +---------+ <--- tail pointer |
| |
| When the first writer commits:: |
| |
| Buffer page |
| +---------+ |
| |written | |
| +---------+ |
| |written | |
| +---------+ |
| |written | |
| +---------+ <--(last full commit and tail pointer) |
| |
| |
| The commit pointer points to the last write location that was |
| committed without preempting another write. When a write that |
| preempted another write is committed, it only becomes a pending commit |
| and will not be a full commit until all writes have been committed. |
| |
| The commit page points to the page that has the last full commit. |
| The tail page points to the page with the last write (before |
| committing). |
| |
| The tail page is always equal to or after the commit page. It may |
| be several pages ahead. If the tail page catches up to the commit |
| page then no more writes may take place (regardless of the mode |
| of the ring buffer: overwrite and produce/consumer). |
| |
| The order of pages is:: |
| |
| head page |
| commit page |
| tail page |
| |
| Possible scenario:: |
| |
| tail page |
| head page commit page | |
| | | | |
| v v v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| There is a special case that the head page is after either the commit page |
| and possibly the tail page. That is when the commit (and tail) page has been |
| swapped with the reader page. This is because the head page is always |
| part of the ring buffer, but the reader page is not. Whenever there |
| has been less than a full page that has been committed inside the ring buffer, |
| and a reader swaps out a page, it will be swapping out the commit page. |
| |
| :: |
| |
| reader page commit page tail page |
| | | | |
| v | | |
| +---+ | | |
| | |<----------+ | |
| | |<------------------------+ |
| | |------+ |
| +---+ | |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| ^ |
| | |
| head page |
| |
| |
| In this case, the head page will not move when the tail and commit |
| move back into the ring buffer. |
| |
| The reader cannot swap a page into the ring buffer if the commit page |
| is still on that page. If the read meets the last commit (real commit |
| not pending or reserved), then there is nothing more to read. |
| The buffer is considered empty until another full commit finishes. |
| |
| When the tail meets the head page, if the buffer is in overwrite mode, |
| the head page will be pushed ahead one. If the buffer is in producer/consumer |
| mode, the write will fail. |
| |
| Overwrite mode:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| ^ |
| | |
| head page |
| |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| ^ |
| | |
| head page |
| |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| ^ |
| | |
| head page |
| |
| Note, the reader page will still point to the previous head page. |
| But when a swap takes place, it will use the most recent head page. |
| |
| |
| Making the Ring Buffer Lockless: |
| -------------------------------- |
| |
| The main idea behind the lockless algorithm is to combine the moving |
| of the head_page pointer with the swapping of pages with the reader. |
| State flags are placed inside the pointer to the page. To do this, |
| each page must be aligned in memory by 4 bytes. This will allow the 2 |
| least significant bits of the address to be used as flags, since |
| they will always be zero for the address. To get the address, |
| simply mask out the flags:: |
| |
| MASK = ~3 |
| |
| address & MASK |
| |
| Two flags will be kept by these two bits: |
| |
| HEADER |
| - the page being pointed to is a head page |
| |
| UPDATE |
| - the page being pointed to is being updated by a writer |
| and was or is about to be a head page. |
| |
| :: |
| |
| reader page |
| | |
| v |
| +---+ |
| | |------+ |
| +---+ | |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-H->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| |
| The above pointer "-H->" would have the HEADER flag set. That is |
| the next page is the next page to be swapped out by the reader. |
| This pointer means the next page is the head page. |
| |
| When the tail page meets the head pointer, it will use cmpxchg to |
| change the pointer to the UPDATE state:: |
| |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-H->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| "-U->" represents a pointer in the UPDATE state. |
| |
| Any access to the reader will need to take some sort of lock to serialize |
| the readers. But the writers will never take a lock to write to the |
| ring buffer. This means we only need to worry about a single reader, |
| and writes only preempt in "stack" formation. |
| |
| When the reader tries to swap the page with the ring buffer, it |
| will also use cmpxchg. If the flag bit in the pointer to the |
| head page does not have the HEADER flag set, the compare will fail |
| and the reader will need to look for the new head page and try again. |
| Note, the flags UPDATE and HEADER are never set at the same time. |
| |
| The reader swaps the reader page as follows:: |
| |
| +------+ |
| |reader| RING BUFFER |
| |page | |
| +------+ |
| +---+ +---+ +---+ |
| | |--->| |--->| | |
| | |<---| |<---| | |
| +---+ +---+ +---+ |
| ^ | ^ | |
| | +---------------+ | |
| +-----H-------------+ |
| |
| The reader sets the reader page next pointer as HEADER to the page after |
| the head page:: |
| |
| |
| +------+ |
| |reader| RING BUFFER |
| |page |-------H-----------+ |
| +------+ v |
| | +---+ +---+ +---+ |
| | | |--->| |--->| | |
| | | |<---| |<---| |<-+ |
| | +---+ +---+ +---+ | |
| | ^ | ^ | | |
| | | +---------------+ | | |
| | +-----H-------------+ | |
| +--------------------------------------+ |
| |
| It does a cmpxchg with the pointer to the previous head page to make it |
| point to the reader page. Note that the new pointer does not have the HEADER |
| flag set. This action atomically moves the head page forward:: |
| |
| +------+ |
| |reader| RING BUFFER |
| |page |-------H-----------+ |
| +------+ v |
| | ^ +---+ +---+ +---+ |
| | | | |-->| |-->| | |
| | | | |<--| |<--| |<-+ |
| | | +---+ +---+ +---+ | |
| | | | ^ | | |
| | | +-------------+ | | |
| | +-----------------------------+ | |
| +------------------------------------+ |
| |
| After the new head page is set, the previous pointer of the head page is |
| updated to the reader page:: |
| |
| +------+ |
| |reader| RING BUFFER |
| |page |-------H-----------+ |
| +------+ <---------------+ v |
| | ^ +---+ +---+ +---+ |
| | | | |-->| |-->| | |
| | | | | | |<--| |<-+ |
| | | +---+ +---+ +---+ | |
| | | | ^ | | |
| | | +-------------+ | | |
| | +-----------------------------+ | |
| +------------------------------------+ |
| |
| +------+ |
| |buffer| RING BUFFER |
| |page |-------H-----------+ <--- New head page |
| +------+ <---------------+ v |
| | ^ +---+ +---+ +---+ |
| | | | | | |-->| | |
| | | New | | | |<--| |<-+ |
| | | Reader +---+ +---+ +---+ | |
| | | page ----^ | | |
| | | | | |
| | +-----------------------------+ | |
| +------------------------------------+ |
| |
| Another important point: The page that the reader page points back to |
| by its previous pointer (the one that now points to the new head page) |
| never points back to the reader page. That is because the reader page is |
| not part of the ring buffer. Traversing the ring buffer via the next pointers |
| will always stay in the ring buffer. Traversing the ring buffer via the |
| prev pointers may not. |
| |
| Note, the way to determine a reader page is simply by examining the previous |
| pointer of the page. If the next pointer of the previous page does not |
| point back to the original page, then the original page is a reader page:: |
| |
| |
| +--------+ |
| | reader | next +----+ |
| | page |-------->| |<====== (buffer page) |
| +--------+ +----+ |
| | | ^ |
| | v | next |
| prev | +----+ |
| +------------->| | |
| +----+ |
| |
| The way the head page moves forward: |
| |
| When the tail page meets the head page and the buffer is in overwrite mode |
| and more writes take place, the head page must be moved forward before the |
| writer may move the tail page. The way this is done is that the writer |
| performs a cmpxchg to convert the pointer to the head page from the HEADER |
| flag to have the UPDATE flag set. Once this is done, the reader will |
| not be able to swap the head page from the buffer, nor will it be able to |
| move the head page, until the writer is finished with the move. |
| |
| This eliminates any races that the reader can have on the writer. The reader |
| must spin, and this is why the reader cannot preempt the writer:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-H->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| The following page will be made into the new head page:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |-H->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| After the new head page has been set, we can set the old head page |
| pointer back to NORMAL:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |-H->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| After the head page has been moved, the tail page may now move forward:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |-H->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| |
| The above are the trivial updates. Now for the more complex scenarios. |
| |
| |
| As stated before, if enough writes preempt the first write, the |
| tail page may make it all the way around the buffer and meet the commit |
| page. At this time, we must start dropping writes (usually with some kind |
| of warning to the user). But what happens if the commit was still on the |
| reader page? The commit page is not part of the ring buffer. The tail page |
| must account for this:: |
| |
| |
| reader page commit page |
| | | |
| v | |
| +---+ | |
| | |<----------+ |
| | | |
| | |------+ |
| +---+ | |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-H->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| ^ |
| | |
| tail page |
| |
| If the tail page were to simply push the head page forward, the commit when |
| leaving the reader page would not be pointing to the correct page. |
| |
| The solution to this is to test if the commit page is on the reader page |
| before pushing the head page. If it is, then it can be assumed that the |
| tail page wrapped the buffer, and we must drop new writes. |
| |
| This is not a race condition, because the commit page can only be moved |
| by the outermost writer (the writer that was preempted). |
| This means that the commit will not move while a writer is moving the |
| tail page. The reader cannot swap the reader page if it is also being |
| used as the commit page. The reader can simply check that the commit |
| is off the reader page. Once the commit page leaves the reader page |
| it will never go back on it unless a reader does another swap with the |
| buffer page that is also the commit page. |
| |
| |
| Nested writes |
| ------------- |
| |
| In the pushing forward of the tail page we must first push forward |
| the head page if the head page is the next page. If the head page |
| is not the next page, the tail page is simply updated with a cmpxchg. |
| |
| Only writers move the tail page. This must be done atomically to protect |
| against nested writers:: |
| |
| temp_page = tail_page |
| next_page = temp_page->next |
| cmpxchg(tail_page, temp_page, next_page) |
| |
| The above will update the tail page if it is still pointing to the expected |
| page. If this fails, a nested write pushed it forward, the current write |
| does not need to push it:: |
| |
| |
| temp page |
| | |
| v |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| Nested write comes in and moves the tail page forward:: |
| |
| tail page (moved by nested writer) |
| temp page | |
| | | |
| v v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| The above would fail the cmpxchg, but since the tail page has already |
| been moved forward, the writer will just try again to reserve storage |
| on the new tail page. |
| |
| But the moving of the head page is a bit more complex:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-H->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| The write converts the head page pointer to UPDATE:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| But if a nested writer preempts here, it will see that the next |
| page is a head page, but it is also nested. It will detect that |
| it is nested and will save that information. The detection is the |
| fact that it sees the UPDATE flag instead of a HEADER or NORMAL |
| pointer. |
| |
| The nested writer will set the new head page pointer:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |-H->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| But it will not reset the update back to normal. Only the writer |
| that converted a pointer from HEAD to UPDATE will convert it back |
| to NORMAL:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |-H->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| After the nested writer finishes, the outermost writer will convert |
| the UPDATE pointer to NORMAL:: |
| |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |-H->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| |
| It can be even more complex if several nested writes came in and moved |
| the tail page ahead several pages:: |
| |
| |
| (first writer) |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-H->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| The write converts the head page pointer to UPDATE:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |--->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| Next writer comes in, and sees the update and sets up the new |
| head page:: |
| |
| (second writer) |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |-H->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| The nested writer moves the tail page forward. But does not set the old |
| update page to NORMAL because it is not the outermost writer:: |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |-H->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| Another writer preempts and sees the page after the tail page is a head page. |
| It changes it from HEAD to UPDATE:: |
| |
| (third writer) |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |-U->| |---> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| The writer will move the head page forward:: |
| |
| |
| (third writer) |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |-U->| |-H-> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| But now that the third writer did change the HEAD flag to UPDATE it |
| will convert it to normal:: |
| |
| |
| (third writer) |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |--->| |-H-> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| |
| Then it will move the tail page, and return back to the second writer:: |
| |
| |
| (second writer) |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |--->| |-H-> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| |
| The second writer will fail to move the tail page because it was already |
| moved, so it will try again and add its data to the new tail page. |
| It will return to the first writer:: |
| |
| |
| (first writer) |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |--->| |-H-> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| The first writer cannot know atomically if the tail page moved |
| while it updates the HEAD page. It will then update the head page to |
| what it thinks is the new head page:: |
| |
| |
| (first writer) |
| |
| tail page |
| | |
| v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |-H->| |-H-> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| Since the cmpxchg returns the old value of the pointer the first writer |
| will see it succeeded in updating the pointer from NORMAL to HEAD. |
| But as we can see, this is not good enough. It must also check to see |
| if the tail page is either where it use to be or on the next page:: |
| |
| |
| (first writer) |
| |
| A B tail page |
| | | | |
| v v v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |-H->| |-H-> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| If tail page != A and tail page != B, then it must reset the pointer |
| back to NORMAL. The fact that it only needs to worry about nested |
| writers means that it only needs to check this after setting the HEAD page:: |
| |
| |
| (first writer) |
| |
| A B tail page |
| | | | |
| v v v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |-U->| |--->| |-H-> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |
| |
| Now the writer can update the head page. This is also why the head page must |
| remain in UPDATE and only reset by the outermost writer. This prevents |
| the reader from seeing the incorrect head page:: |
| |
| |
| (first writer) |
| |
| A B tail page |
| | | | |
| v v v |
| +---+ +---+ +---+ +---+ |
| <---| |--->| |--->| |--->| |-H-> |
| --->| |<---| |<---| |<---| |<--- |
| +---+ +---+ +---+ +---+ |