John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ==================================================== |
| 4 | pin_user_pages() and related calls |
| 5 | ==================================================== |
| 6 | |
| 7 | .. contents:: :local: |
| 8 | |
| 9 | Overview |
| 10 | ======== |
| 11 | |
| 12 | This document describes the following functions:: |
| 13 | |
| 14 | pin_user_pages() |
| 15 | pin_user_pages_fast() |
| 16 | pin_user_pages_remote() |
| 17 | |
| 18 | Basic description of FOLL_PIN |
| 19 | ============================= |
| 20 | |
| 21 | FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() |
| 22 | ("gup") family of functions. FOLL_PIN has significant interactions and |
| 23 | interdependencies with FOLL_LONGTERM, so both are covered here. |
| 24 | |
| 25 | FOLL_PIN is internal to gup, meaning that it should not appear at the gup call |
| 26 | sites. This allows the associated wrapper functions (pin_user_pages*() and |
| 27 | others) to set the correct combination of these flags, and to check for problems |
| 28 | as well. |
| 29 | |
| 30 | FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. |
| 31 | This is in order to avoid creating a large number of wrapper functions to cover |
| 32 | all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the |
| 33 | pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so |
| 34 | that's a natural dividing line, and a good point to make separate wrapper calls. |
| 35 | In other words, use pin_user_pages*() for DMA-pinned pages, and |
Souptick Joarder | f9e5597 | 2020-06-25 20:30:25 -0700 | [diff] [blame] | 36 | get_user_pages*() for other cases. There are five cases described later on in |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 37 | this document, to further clarify that concept. |
| 38 | |
| 39 | FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, |
| 40 | multiple threads and call sites are free to pin the same struct pages, via both |
| 41 | FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the |
| 42 | other, not the struct page(s). |
| 43 | |
| 44 | The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN |
| 45 | uses a different reference counting technique. |
| 46 | |
| 47 | FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, |
| 48 | FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. |
| 49 | |
| 50 | Which flags are set by each wrapper |
| 51 | =================================== |
| 52 | |
| 53 | For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup |
| 54 | flags the caller provides. The caller is required to pass in a non-null struct |
John Hubbard | 47e29d3 | 2020-04-01 21:05:33 -0700 | [diff] [blame] | 55 | pages* array, and the function then pins pages by incrementing each by a special |
| 56 | value: GUP_PIN_COUNTING_BIAS. |
| 57 | |
Matthew Wilcox (Oracle) | 94688e8 | 2023-01-11 14:28:47 +0000 | [diff] [blame] | 58 | For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead, |
| 59 | the extra space available in the struct folio is used to store the |
| 60 | pincount directly. |
John Hubbard | 47e29d3 | 2020-04-01 21:05:33 -0700 | [diff] [blame] | 61 | |
Matthew Wilcox (Oracle) | 94688e8 | 2023-01-11 14:28:47 +0000 | [diff] [blame] | 62 | This approach for large folios avoids the counting upper limit problems |
| 63 | that are discussed below. Those limitations would have been aggravated |
| 64 | severely by huge pages, because each tail page adds a refcount to the |
| 65 | head page. And in fact, testing revealed that, without a separate pincount |
| 66 | field, refcount overflows were seen in some huge page stress tests. |
John Hubbard | 47e29d3 | 2020-04-01 21:05:33 -0700 | [diff] [blame] | 67 | |
Matthew Wilcox (Oracle) | 94688e8 | 2023-01-11 14:28:47 +0000 | [diff] [blame] | 68 | This also means that huge pages and large folios do not suffer |
John Hubbard | 47e29d3 | 2020-04-01 21:05:33 -0700 | [diff] [blame] | 69 | from the false positives problem that is mentioned below.:: |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 70 | |
| 71 | Function |
| 72 | -------- |
| 73 | pin_user_pages FOLL_PIN is always set internally by this function. |
| 74 | pin_user_pages_fast FOLL_PIN is always set internally by this function. |
| 75 | pin_user_pages_remote FOLL_PIN is always set internally by this function. |
| 76 | |
| 77 | For these get_user_pages*() functions, FOLL_GET might not even be specified. |
| 78 | Behavior is a little more complex than above. If FOLL_GET was *not* specified, |
| 79 | but the caller passed in a non-null struct pages* array, then the function |
| 80 | sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount |
| 81 | of each page by +1.:: |
| 82 | |
| 83 | Function |
| 84 | -------- |
| 85 | get_user_pages FOLL_GET is sometimes set internally by this function. |
| 86 | get_user_pages_fast FOLL_GET is sometimes set internally by this function. |
| 87 | get_user_pages_remote FOLL_GET is sometimes set internally by this function. |
| 88 | |
| 89 | Tracking dma-pinned pages |
| 90 | ========================= |
| 91 | |
| 92 | Some of the key design constraints, and solutions, for tracking dma-pinned |
| 93 | pages: |
| 94 | |
| 95 | * An actual reference count, per struct page, is required. This is because |
| 96 | multiple processes may pin and unpin a page. |
| 97 | |
| 98 | * False positives (reporting that a page is dma-pinned, when in fact it is not) |
| 99 | are acceptable, but false negatives are not. |
| 100 | |
| 101 | * struct page may not be increased in size for this, and all fields are already |
| 102 | used. |
| 103 | |
| 104 | * Given the above, we can overload the page->_refcount field by using, sort of, |
| 105 | the upper bits in that field for a dma-pinned count. "Sort of", means that, |
| 106 | rather than dividing page->_refcount into bit fields, we simple add a medium- |
| 107 | large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to |
| 108 | page->_refcount. This provides fuzzy behavior: if a page has get_page() called |
| 109 | on it 1024 times, then it will appear to have a single dma-pinned count. |
| 110 | And again, that's acceptable. |
| 111 | |
| 112 | This also leads to limitations: there are only 31-10==21 bits available for a |
| 113 | counter that increments 10 bits at a time. |
| 114 | |
David Howells | c8070b7 | 2023-05-26 22:41:40 +0100 | [diff] [blame] | 115 | * Because of that limitation, special handling is applied to the zero pages |
| 116 | when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its |
| 117 | refcount or pincount at all (it is permanent, so there's no need). The |
| 118 | unpinning functions also don't do anything to a zero page. This is |
| 119 | transparent to the caller. |
| 120 | |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 121 | * Callers must specifically request "dma-pinned tracking of pages". In other |
| 122 | words, just calling get_user_pages() will not suffice; a new set of functions, |
| 123 | pin_user_page() and related, must be used. |
| 124 | |
| 125 | FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags |
| 126 | ========================================================== |
| 127 | |
| 128 | Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing |
| 129 | these categories: |
| 130 | |
| 131 | CASE 1: Direct IO (DIO) |
| 132 | ----------------------- |
| 133 | There are GUP references to pages that are serving |
| 134 | as DIO buffers. These buffers are needed for a relatively short time (so they |
| 135 | are not "long term"). No special synchronization with page_mkclean() or |
| 136 | munmap() is provided. Therefore, flags to set at the call site are: :: |
| 137 | |
| 138 | FOLL_PIN |
| 139 | |
| 140 | ...but rather than setting FOLL_PIN directly, call sites should use one of |
| 141 | the pin_user_pages*() routines that set FOLL_PIN. |
| 142 | |
| 143 | CASE 2: RDMA |
| 144 | ------------ |
| 145 | There are GUP references to pages that are serving as DMA |
| 146 | buffers. These buffers are needed for a long time ("long term"). No special |
| 147 | synchronization with page_mkclean() or munmap() is provided. Therefore, flags |
| 148 | to set at the call site are: :: |
| 149 | |
| 150 | FOLL_PIN | FOLL_LONGTERM |
| 151 | |
| 152 | NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's |
| 153 | because DAX pages do not have a separate page cache, and so "pinning" implies |
| 154 | locking down file system blocks, which is not (yet) supported in that way. |
| 155 | |
John Hubbard | a8f80f5 | 2020-06-07 21:40:59 -0700 | [diff] [blame] | 156 | CASE 3: MMU notifier registration, with or without page faulting hardware |
| 157 | ------------------------------------------------------------------------- |
| 158 | Device drivers can pin pages via get_user_pages*(), and register for mmu |
| 159 | notifier callbacks for the memory range. Then, upon receiving a notifier |
| 160 | "invalidate range" callback , stop the device from using the range, and unpin |
| 161 | the pages. There may be other possible schemes, such as for example explicitly |
| 162 | synchronizing against pending IO, that accomplish approximately the same thing. |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 163 | |
John Hubbard | a8f80f5 | 2020-06-07 21:40:59 -0700 | [diff] [blame] | 164 | Or, if the hardware supports replayable page faults, then the device driver can |
| 165 | avoid pinning entirely (this is ideal), as follows: register for mmu notifier |
| 166 | callbacks as above, but instead of stopping the device and unpinning in the |
| 167 | callback, simply remove the range from the device's page tables. |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 168 | |
John Hubbard | a8f80f5 | 2020-06-07 21:40:59 -0700 | [diff] [blame] | 169 | Either way, as long as the driver unpins the pages upon mmu notifier callback, |
| 170 | then there is proper synchronization with both filesystem and mm |
| 171 | (page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 172 | |
| 173 | CASE 4: Pinning for struct page manipulation only |
| 174 | ------------------------------------------------- |
John Hubbard | a8f80f5 | 2020-06-07 21:40:59 -0700 | [diff] [blame] | 175 | If only struct page data (as opposed to the actual memory contents that a page |
| 176 | is tracking) is affected, then normal GUP calls are sufficient, and neither flag |
| 177 | needs to be set. |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 178 | |
John Hubbard | eaf4d22a9 | 2020-06-07 21:41:11 -0700 | [diff] [blame] | 179 | CASE 5: Pinning in order to write to the data within the page |
| 180 | ------------------------------------------------------------- |
| 181 | Even though neither DMA nor Direct IO is involved, just a simple case of "pin, |
| 182 | write to a page's data, unpin" can cause a problem. Case 5 may be considered a |
| 183 | superset of Case 1, plus Case 2, plus anything that invokes that pattern. In |
| 184 | other words, if the code is neither Case 1 nor Case 2, it may still require |
| 185 | FOLL_PIN, for patterns like this: |
| 186 | |
| 187 | Correct (uses FOLL_PIN calls): |
| 188 | pin_user_pages() |
| 189 | write to the data within the pages |
| 190 | unpin_user_pages() |
| 191 | |
| 192 | INCORRECT (uses FOLL_GET calls): |
| 193 | get_user_pages() |
| 194 | write to the data within the pages |
| 195 | put_page() |
| 196 | |
John Hubbard | 3faa52c | 2020-04-01 21:05:29 -0700 | [diff] [blame] | 197 | page_maybe_dma_pinned(): the whole point of pinning |
| 198 | =================================================== |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 199 | |
| 200 | The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able |
| 201 | to query, "is this page DMA-pinned?" That allows code such as page_mkclean() |
| 202 | (and file system writeback code in general) to make informed decisions about |
| 203 | what to do when a page cannot be unmapped due to such pins. |
| 204 | |
| 205 | What to do in those cases is the subject of a years-long series of discussions |
| 206 | and debates (see the References at the end of this document). It's a TODO item |
| 207 | here: fill in the details once that's worked out. Meanwhile, it's safe to say |
| 208 | that having this available: :: |
| 209 | |
John Hubbard | 3faa52c | 2020-04-01 21:05:29 -0700 | [diff] [blame] | 210 | static inline bool page_maybe_dma_pinned(struct page *page) |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 211 | |
| 212 | ...is a prerequisite to solving the long-running gup+DMA problem. |
| 213 | |
| 214 | Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM |
| 215 | =================================================================== |
| 216 | |
| 217 | Another way of thinking about these flags is as a progression of restrictions: |
| 218 | FOLL_GET is for struct page manipulation, without affecting the data that the |
| 219 | struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for |
| 220 | short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is |
| 221 | a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more |
| 222 | restrictive case that has FOLL_PIN as a prerequisite: this is for pages that |
| 223 | will be pinned longterm, and whose data will be accessed. |
| 224 | |
| 225 | Unit testing |
| 226 | ============ |
| 227 | This file:: |
| 228 | |
SeongJae Park | baa489f | 2023-01-03 18:07:53 +0000 | [diff] [blame] | 229 | tools/testing/selftests/mm/gup_test.c |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 230 | |
| 231 | has the following new calls to exercise the new pin*() wrapper functions: |
| 232 | |
John Hubbard | 9c84f22 | 2020-12-14 19:05:05 -0800 | [diff] [blame] | 233 | * PIN_FAST_BENCHMARK (./gup_test -a) |
John Hubbard | a9bed1e | 2020-12-14 19:05:17 -0800 | [diff] [blame] | 234 | * PIN_BASIC_TEST (./gup_test -b) |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 235 | |
| 236 | You can monitor how many total dma-pinned pages have been acquired and released |
| 237 | since the system was booted, via two new /proc/vmstat entries: :: |
| 238 | |
John Hubbard | 1970dc6 | 2020-04-01 21:05:37 -0700 | [diff] [blame] | 239 | /proc/vmstat/nr_foll_pin_acquired |
| 240 | /proc/vmstat/nr_foll_pin_released |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 241 | |
John Hubbard | 1970dc6 | 2020-04-01 21:05:37 -0700 | [diff] [blame] | 242 | Under normal conditions, these two values will be equal unless there are any |
| 243 | long-term [R]DMA pins in place, or during pin/unpin transitions. |
| 244 | |
| 245 | * nr_foll_pin_acquired: This is the number of logical pins that have been |
| 246 | acquired since the system was powered on. For huge pages, the head page is |
| 247 | pinned once for each page (head page and each tail page) within the huge page. |
| 248 | This follows the same sort of behavior that get_user_pages() uses for huge |
| 249 | pages: the head page is refcounted once for each tail or head page in the huge |
| 250 | page, when get_user_pages() is applied to a huge page. |
| 251 | |
| 252 | * nr_foll_pin_released: The number of logical pins that have been released since |
| 253 | the system was powered on. Note that pages are released (unpinned) on a |
| 254 | PAGE_SIZE granularity, even if the original pin was applied to a huge page. |
| 255 | Becaused of the pin count behavior described above in "nr_foll_pin_acquired", |
| 256 | the accounting balances out, so that after doing this:: |
| 257 | |
| 258 | pin_user_pages(huge_page); |
| 259 | for (each page in huge_page) |
| 260 | unpin_user_page(page); |
| 261 | |
| 262 | ...the following is expected:: |
| 263 | |
| 264 | nr_foll_pin_released == nr_foll_pin_acquired |
| 265 | |
| 266 | (...unless it was already out of balance due to a long-term RDMA pin being in |
| 267 | place.) |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 268 | |
John Hubbard | dc8fb2f2 | 2020-04-01 21:05:52 -0700 | [diff] [blame] | 269 | Other diagnostics |
| 270 | ================= |
| 271 | |
Matthew Wilcox (Oracle) | 94688e8 | 2023-01-11 14:28:47 +0000 | [diff] [blame] | 272 | dump_page() has been enhanced slightly to handle these new counting |
| 273 | fields, and to better report on large folios in general. Specifically, |
| 274 | for large folios, the exact pincount is reported. |
John Hubbard | dc8fb2f2 | 2020-04-01 21:05:52 -0700 | [diff] [blame] | 275 | |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 276 | References |
| 277 | ========== |
| 278 | |
| 279 | * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ |
| 280 | * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ |
| 281 | * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ |
John Hubbard | 47e29d3 | 2020-04-01 21:05:33 -0700 | [diff] [blame] | 282 | * `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_ |
John Hubbard | eddb1c2 | 2020-01-30 22:12:54 -0800 | [diff] [blame] | 283 | |
| 284 | John Hubbard, October, 2019 |