blob: c16ca163b55e3c8a1caab1b6136ae37df3bbf72d [file] [log] [blame]
John Hubbardeddb1c22020-01-30 22:12:54 -08001.. SPDX-License-Identifier: GPL-2.0
2
3====================================================
4pin_user_pages() and related calls
5====================================================
6
7.. contents:: :local:
8
9Overview
10========
11
12This document describes the following functions::
13
14 pin_user_pages()
15 pin_user_pages_fast()
16 pin_user_pages_remote()
17
18Basic description of FOLL_PIN
19=============================
20
21FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
22("gup") family of functions. FOLL_PIN has significant interactions and
23interdependencies with FOLL_LONGTERM, so both are covered here.
24
25FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
26sites. This allows the associated wrapper functions (pin_user_pages*() and
27others) to set the correct combination of these flags, and to check for problems
28as well.
29
30FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
31This is in order to avoid creating a large number of wrapper functions to cover
32all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
33pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
34that's a natural dividing line, and a good point to make separate wrapper calls.
35In other words, use pin_user_pages*() for DMA-pinned pages, and
Souptick Joarderf9e55972020-06-25 20:30:25 -070036get_user_pages*() for other cases. There are five cases described later on in
John Hubbardeddb1c22020-01-30 22:12:54 -080037this document, to further clarify that concept.
38
39FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
40multiple threads and call sites are free to pin the same struct pages, via both
41FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
42other, not the struct page(s).
43
44The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
45uses a different reference counting technique.
46
47FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
48FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
49
50Which flags are set by each wrapper
51===================================
52
53For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
54flags the caller provides. The caller is required to pass in a non-null struct
John Hubbard47e29d32020-04-01 21:05:33 -070055pages* array, and the function then pins pages by incrementing each by a special
56value: GUP_PIN_COUNTING_BIAS.
57
Matthew Wilcox (Oracle)94688e82023-01-11 14:28:47 +000058For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
59the extra space available in the struct folio is used to store the
60pincount directly.
John Hubbard47e29d32020-04-01 21:05:33 -070061
Matthew Wilcox (Oracle)94688e82023-01-11 14:28:47 +000062This approach for large folios avoids the counting upper limit problems
63that are discussed below. Those limitations would have been aggravated
64severely by huge pages, because each tail page adds a refcount to the
65head page. And in fact, testing revealed that, without a separate pincount
66field, refcount overflows were seen in some huge page stress tests.
John Hubbard47e29d32020-04-01 21:05:33 -070067
Matthew Wilcox (Oracle)94688e82023-01-11 14:28:47 +000068This also means that huge pages and large folios do not suffer
John Hubbard47e29d32020-04-01 21:05:33 -070069from the false positives problem that is mentioned below.::
John Hubbardeddb1c22020-01-30 22:12:54 -080070
71 Function
72 --------
73 pin_user_pages FOLL_PIN is always set internally by this function.
74 pin_user_pages_fast FOLL_PIN is always set internally by this function.
75 pin_user_pages_remote FOLL_PIN is always set internally by this function.
76
77For these get_user_pages*() functions, FOLL_GET might not even be specified.
78Behavior is a little more complex than above. If FOLL_GET was *not* specified,
79but the caller passed in a non-null struct pages* array, then the function
80sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
81of each page by +1.::
82
83 Function
84 --------
85 get_user_pages FOLL_GET is sometimes set internally by this function.
86 get_user_pages_fast FOLL_GET is sometimes set internally by this function.
87 get_user_pages_remote FOLL_GET is sometimes set internally by this function.
88
89Tracking dma-pinned pages
90=========================
91
92Some of the key design constraints, and solutions, for tracking dma-pinned
93pages:
94
95* An actual reference count, per struct page, is required. This is because
96 multiple processes may pin and unpin a page.
97
98* False positives (reporting that a page is dma-pinned, when in fact it is not)
99 are acceptable, but false negatives are not.
100
101* struct page may not be increased in size for this, and all fields are already
102 used.
103
104* Given the above, we can overload the page->_refcount field by using, sort of,
105 the upper bits in that field for a dma-pinned count. "Sort of", means that,
106 rather than dividing page->_refcount into bit fields, we simple add a medium-
107 large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
108 page->_refcount. This provides fuzzy behavior: if a page has get_page() called
109 on it 1024 times, then it will appear to have a single dma-pinned count.
110 And again, that's acceptable.
111
112This also leads to limitations: there are only 31-10==21 bits available for a
113counter that increments 10 bits at a time.
114
David Howellsc8070b72023-05-26 22:41:40 +0100115* Because of that limitation, special handling is applied to the zero pages
116 when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its
117 refcount or pincount at all (it is permanent, so there's no need). The
118 unpinning functions also don't do anything to a zero page. This is
119 transparent to the caller.
120
John Hubbardeddb1c22020-01-30 22:12:54 -0800121* Callers must specifically request "dma-pinned tracking of pages". In other
122 words, just calling get_user_pages() will not suffice; a new set of functions,
123 pin_user_page() and related, must be used.
124
125FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
126==========================================================
127
128Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
129these categories:
130
131CASE 1: Direct IO (DIO)
132-----------------------
133There are GUP references to pages that are serving
134as DIO buffers. These buffers are needed for a relatively short time (so they
Kefeng Wanga929e0d2024-06-04 19:48:22 +0800135are not "long term"). No special synchronization with folio_mkclean() or
John Hubbardeddb1c22020-01-30 22:12:54 -0800136munmap() is provided. Therefore, flags to set at the call site are: ::
137
138 FOLL_PIN
139
140...but rather than setting FOLL_PIN directly, call sites should use one of
141the pin_user_pages*() routines that set FOLL_PIN.
142
143CASE 2: RDMA
144------------
145There are GUP references to pages that are serving as DMA
146buffers. These buffers are needed for a long time ("long term"). No special
Kefeng Wanga929e0d2024-06-04 19:48:22 +0800147synchronization with folio_mkclean() or munmap() is provided. Therefore, flags
John Hubbardeddb1c22020-01-30 22:12:54 -0800148to set at the call site are: ::
149
150 FOLL_PIN | FOLL_LONGTERM
151
152NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
153because DAX pages do not have a separate page cache, and so "pinning" implies
154locking down file system blocks, which is not (yet) supported in that way.
155
Thomas Hellströmdad19632023-11-29 10:06:37 +0100156.. _mmu-notifier-registration-case:
157
John Hubbarda8f80f52020-06-07 21:40:59 -0700158CASE 3: MMU notifier registration, with or without page faulting hardware
159-------------------------------------------------------------------------
160Device drivers can pin pages via get_user_pages*(), and register for mmu
161notifier callbacks for the memory range. Then, upon receiving a notifier
162"invalidate range" callback , stop the device from using the range, and unpin
163the pages. There may be other possible schemes, such as for example explicitly
164synchronizing against pending IO, that accomplish approximately the same thing.
John Hubbardeddb1c22020-01-30 22:12:54 -0800165
John Hubbarda8f80f52020-06-07 21:40:59 -0700166Or, if the hardware supports replayable page faults, then the device driver can
167avoid pinning entirely (this is ideal), as follows: register for mmu notifier
168callbacks as above, but instead of stopping the device and unpinning in the
169callback, simply remove the range from the device's page tables.
John Hubbardeddb1c22020-01-30 22:12:54 -0800170
John Hubbarda8f80f52020-06-07 21:40:59 -0700171Either way, as long as the driver unpins the pages upon mmu notifier callback,
172then there is proper synchronization with both filesystem and mm
Kefeng Wanga929e0d2024-06-04 19:48:22 +0800173(folio_mkclean(), munmap(), etc). Therefore, neither flag needs to be set.
John Hubbardeddb1c22020-01-30 22:12:54 -0800174
175CASE 4: Pinning for struct page manipulation only
176-------------------------------------------------
John Hubbarda8f80f52020-06-07 21:40:59 -0700177If only struct page data (as opposed to the actual memory contents that a page
178is tracking) is affected, then normal GUP calls are sufficient, and neither flag
179needs to be set.
John Hubbardeddb1c22020-01-30 22:12:54 -0800180
John Hubbardeaf4d22a92020-06-07 21:41:11 -0700181CASE 5: Pinning in order to write to the data within the page
182-------------------------------------------------------------
183Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
184write to a page's data, unpin" can cause a problem. Case 5 may be considered a
185superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
186other words, if the code is neither Case 1 nor Case 2, it may still require
187FOLL_PIN, for patterns like this:
188
189Correct (uses FOLL_PIN calls):
190 pin_user_pages()
191 write to the data within the pages
192 unpin_user_pages()
193
194INCORRECT (uses FOLL_GET calls):
195 get_user_pages()
196 write to the data within the pages
197 put_page()
198
Kefeng Wang26693242024-06-04 19:48:20 +0800199folio_maybe_dma_pinned(): the whole point of pinning
200====================================================
John Hubbardeddb1c22020-01-30 22:12:54 -0800201
Kefeng Wang26693242024-06-04 19:48:20 +0800202The whole point of marking folios as "DMA-pinned" or "gup-pinned" is to be able
Kefeng Wanga929e0d2024-06-04 19:48:22 +0800203to query, "is this folio DMA-pinned?" That allows code such as folio_mkclean()
John Hubbardeddb1c22020-01-30 22:12:54 -0800204(and file system writeback code in general) to make informed decisions about
Kefeng Wang26693242024-06-04 19:48:20 +0800205what to do when a folio cannot be unmapped due to such pins.
John Hubbardeddb1c22020-01-30 22:12:54 -0800206
207What to do in those cases is the subject of a years-long series of discussions
208and debates (see the References at the end of this document). It's a TODO item
209here: fill in the details once that's worked out. Meanwhile, it's safe to say
210that having this available: ::
211
Kefeng Wang26693242024-06-04 19:48:20 +0800212 static inline bool folio_maybe_dma_pinned(struct folio *folio)
John Hubbardeddb1c22020-01-30 22:12:54 -0800213
214...is a prerequisite to solving the long-running gup+DMA problem.
215
216Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
217===================================================================
218
219Another way of thinking about these flags is as a progression of restrictions:
220FOLL_GET is for struct page manipulation, without affecting the data that the
221struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
222short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
223a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
224restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
225will be pinned longterm, and whose data will be accessed.
226
227Unit testing
228============
229This file::
230
SeongJae Parkbaa489f2023-01-03 18:07:53 +0000231 tools/testing/selftests/mm/gup_test.c
John Hubbardeddb1c22020-01-30 22:12:54 -0800232
233has the following new calls to exercise the new pin*() wrapper functions:
234
John Hubbard9c84f222020-12-14 19:05:05 -0800235* PIN_FAST_BENCHMARK (./gup_test -a)
John Hubbarda9bed1e2020-12-14 19:05:17 -0800236* PIN_BASIC_TEST (./gup_test -b)
John Hubbardeddb1c22020-01-30 22:12:54 -0800237
238You can monitor how many total dma-pinned pages have been acquired and released
239since the system was booted, via two new /proc/vmstat entries: ::
240
John Hubbard1970dc62020-04-01 21:05:37 -0700241 /proc/vmstat/nr_foll_pin_acquired
242 /proc/vmstat/nr_foll_pin_released
John Hubbardeddb1c22020-01-30 22:12:54 -0800243
John Hubbard1970dc62020-04-01 21:05:37 -0700244Under normal conditions, these two values will be equal unless there are any
245long-term [R]DMA pins in place, or during pin/unpin transitions.
246
247* nr_foll_pin_acquired: This is the number of logical pins that have been
248 acquired since the system was powered on. For huge pages, the head page is
249 pinned once for each page (head page and each tail page) within the huge page.
250 This follows the same sort of behavior that get_user_pages() uses for huge
251 pages: the head page is refcounted once for each tail or head page in the huge
252 page, when get_user_pages() is applied to a huge page.
253
254* nr_foll_pin_released: The number of logical pins that have been released since
255 the system was powered on. Note that pages are released (unpinned) on a
256 PAGE_SIZE granularity, even if the original pin was applied to a huge page.
257 Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
258 the accounting balances out, so that after doing this::
259
260 pin_user_pages(huge_page);
261 for (each page in huge_page)
262 unpin_user_page(page);
263
264...the following is expected::
265
266 nr_foll_pin_released == nr_foll_pin_acquired
267
268(...unless it was already out of balance due to a long-term RDMA pin being in
269place.)
John Hubbardeddb1c22020-01-30 22:12:54 -0800270
John Hubbarddc8fb2f22020-04-01 21:05:52 -0700271Other diagnostics
272=================
273
Matthew Wilcox (Oracle)94688e82023-01-11 14:28:47 +0000274dump_page() has been enhanced slightly to handle these new counting
275fields, and to better report on large folios in general. Specifically,
276for large folios, the exact pincount is reported.
John Hubbarddc8fb2f22020-04-01 21:05:52 -0700277
John Hubbardeddb1c22020-01-30 22:12:54 -0800278References
279==========
280
281* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
282* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
283* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
John Hubbard47e29d32020-04-01 21:05:33 -0700284* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_
John Hubbardeddb1c22020-01-30 22:12:54 -0800285
286John Hubbard, October, 2019