Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 1 | ============================== |
| 2 | Unevictable LRU Infrastructure |
| 3 | ============================== |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 4 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 5 | .. contents:: :local: |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 6 | |
| 7 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 8 | Introduction |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 9 | ============ |
| 10 | |
| 11 | This document describes the Linux memory manager's "Unevictable LRU" |
| 12 | infrastructure and the use of this to manage several types of "unevictable" |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 13 | folios. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 14 | |
| 15 | The document attempts to provide the overall rationale behind this mechanism |
| 16 | and the rationale for some of the design decisions that drove the |
| 17 | implementation. The latter design rationale is discussed in the context of an |
| 18 | implementation description. Admittedly, one can obtain the implementation |
| 19 | details - the "what does it do?" - by reading the code. One hopes that the |
| 20 | descriptions below add value by provide the answer to "why does it do that?". |
| 21 | |
| 22 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 23 | |
| 24 | The Unevictable LRU |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 25 | =================== |
| 26 | |
| 27 | The Unevictable LRU facility adds an additional LRU list to track unevictable |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 28 | folios and to hide these folios from vmscan. This mechanism is based on a patch |
| 29 | by Larry Woodman of Red Hat to address several scalability problems with folio |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 30 | reclaim in Linux. The problems have been observed at customer sites on large |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 31 | memory x86_64 systems. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 32 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 33 | To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of |
Hugh Dickins | 15b4473 | 2020-12-15 14:21:31 -0800 | [diff] [blame] | 34 | main memory will have over 32 million 4k pages in a single node. When a large |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 35 | fraction of these pages are not evictable for any reason [see below], vmscan |
| 36 | will spend a lot of time scanning the LRU lists looking for the small fraction |
| 37 | of pages that are evictable. This can result in a situation where all CPUs are |
| 38 | spending 100% of their time in vmscan for hours or days on end, with the system |
| 39 | completely unresponsive. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 40 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 41 | The unevictable list addresses the following classes of unevictable pages: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 42 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 43 | * Those owned by ramfs. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 44 | |
Luis Chamberlain | 2c6efe9 | 2023-03-09 15:05:45 -0800 | [diff] [blame] | 45 | * Those owned by tmpfs with the noswap mount option. |
| 46 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 47 | * Those mapped into SHM_LOCK'd shared memory regions. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 48 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 49 | * Those mapped into VM_LOCKED [mlock()ed] VMAs. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 50 | |
| 51 | The infrastructure may also be able to handle other conditions that make pages |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 52 | unevictable, either by definition or by circumstance, in the future. |
| 53 | |
| 54 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 55 | The Unevictable LRU Folio List |
| 56 | ------------------------------ |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 57 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 58 | The Unevictable LRU folio list is a lie. It was never an LRU-ordered |
| 59 | list, but a companion to the LRU-ordered anonymous and file, active and |
| 60 | inactive folio lists; and now it is not even a folio list. But following |
| 61 | familiar convention, here in this document and in the source, we often |
| 62 | imagine it as a fifth LRU folio list. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 63 | |
Hugh Dickins | 15b4473 | 2020-12-15 14:21:31 -0800 | [diff] [blame] | 64 | The Unevictable LRU infrastructure consists of an additional, per-node, LRU list |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 65 | called the "unevictable" list and an associated folio flag, PG_unevictable, to |
| 66 | indicate that the folio is being managed on the unevictable list. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 67 | |
| 68 | The PG_unevictable flag is analogous to, and mutually exclusive with, the |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 69 | PG_active flag in that it indicates on which LRU list a folio resides when |
Michal Hocko | e6e8dd50 | 2011-03-16 15:01:37 +0100 | [diff] [blame] | 70 | PG_lru is set. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 71 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 72 | The Unevictable LRU infrastructure maintains unevictable folios as if they were |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 73 | on an additional LRU list for a few reasons: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 74 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 75 | (1) We get to "treat unevictable folios just like we treat other folios in the |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 76 | system - which means we get to use the same code to manipulate them, the |
| 77 | same code to isolate them (for migrate, etc.), the same code to keep track |
| 78 | of the statistics, etc..." [Rik van Riel] |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 79 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 80 | (2) We want to be able to migrate unevictable folios between nodes for memory |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 81 | defragmentation, workload management and memory hotplug. The Linux kernel |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 82 | can only migrate folios that it can successfully isolate from the LRU |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 83 | lists (or "Movable" pages: outside of consideration here). If we were to |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 84 | maintain folios elsewhere than on an LRU-like list, where they can be |
| 85 | detected by folio_isolate_lru(), we would prevent their migration. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 86 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 87 | The unevictable list does not differentiate between file-backed and |
| 88 | anonymous, swap-backed folios. This differentiation is only important |
| 89 | while the folios are, in fact, evictable. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 90 | |
Hugh Dickins | 15b4473 | 2020-12-15 14:21:31 -0800 | [diff] [blame] | 91 | The unevictable list benefits from the "arrayification" of the per-node LRU |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 92 | lists and statistics originally proposed and posted by Christoph Lameter. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 93 | |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 94 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 95 | Memory Control Group Interaction |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 96 | -------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 97 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 98 | The unevictable LRU facility interacts with the memory control group [aka |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 99 | memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by |
| 100 | extending the lru_list enum. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 101 | |
Hugh Dickins | 15b4473 | 2020-12-15 14:21:31 -0800 | [diff] [blame] | 102 | The memory controller data structure automatically gets a per-node unevictable |
| 103 | list as a result of the "arrayification" of the per-node LRU lists (one per |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 104 | lru_list enum element). The memory controller tracks the movement of pages to |
| 105 | and from the unevictable list. |
| 106 | |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 107 | When a memory control group comes under memory pressure, the controller will |
| 108 | not attempt to reclaim pages on the unevictable list. This has a couple of |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 109 | effects: |
| 110 | |
| 111 | (1) Because the pages are "hidden" from reclaim on the unevictable list, the |
| 112 | reclaim process can be more efficient, dealing only with pages that have a |
| 113 | chance of being reclaimed. |
| 114 | |
| 115 | (2) On the other hand, if too many of the pages charged to the control group |
| 116 | are unevictable, the evictable portion of the working set of the tasks in |
| 117 | the control group may not fit into the available memory. This can cause |
| 118 | the control group to thrash or to OOM-kill tasks. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 119 | |
| 120 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 121 | .. _mark_addr_space_unevict: |
| 122 | |
| 123 | Marking Address Spaces Unevictable |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 124 | ---------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 125 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 126 | For facilities such as ramfs none of the pages attached to the address space |
| 127 | may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE |
| 128 | address space flag is provided, and this can be manipulated by a filesystem |
| 129 | using a number of wrapper functions: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 130 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 131 | * ``void mapping_set_unevictable(struct address_space *mapping);`` |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 132 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 133 | Mark the address space as being completely unevictable. |
| 134 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 135 | * ``void mapping_clear_unevictable(struct address_space *mapping);`` |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 136 | |
| 137 | Mark the address space as being evictable. |
| 138 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 139 | * ``int mapping_unevictable(struct address_space *mapping);`` |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 140 | |
| 141 | Query the address space, and return true if it is completely |
| 142 | unevictable. |
| 143 | |
Kuo-Hsin Yang | 64e3d12 | 2018-11-06 13:23:24 +0000 | [diff] [blame] | 144 | These are currently used in three places in the kernel: |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 145 | |
| 146 | (1) By ramfs to mark the address spaces of its inodes when they are created, |
| 147 | and this mark remains for the life of the inode. |
| 148 | |
| 149 | (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 150 | Note that SHM_LOCK is not required to page in the locked pages if they're |
| 151 | swapped out; the application must touch the pages manually if it wants to |
| 152 | ensure they're in memory. |
| 153 | |
Kuo-Hsin Yang | 64e3d12 | 2018-11-06 13:23:24 +0000 | [diff] [blame] | 154 | (3) By the i915 driver to mark pinned address space until it's unpinned. The |
| 155 | amount of unevictable memory marked by i915 driver is roughly the bounded |
| 156 | object size in debugfs/dri/0/i915_gem_objects. |
| 157 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 158 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 159 | Detecting Unevictable Pages |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 160 | --------------------------- |
| 161 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 162 | The function folio_evictable() in mm/internal.h determines whether a folio is |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 163 | evictable or not using the query function outlined above [see section |
| 164 | :ref:`Marking address spaces unevictable <mark_addr_space_unevict>`] |
| 165 | to check the AS_UNEVICTABLE flag. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 166 | |
| 167 | For address spaces that are so marked after being populated (as SHM regions |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 168 | might be), the lock action (e.g. SHM_LOCK) can be lazy, and need not populate |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 169 | the page tables for the region as does, for example, mlock(), nor need it make |
| 170 | any special effort to push any pages in the SHM_LOCK'd area to the unevictable |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 171 | list. Instead, vmscan will do this if and when it encounters the folios during |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 172 | a reclamation scan. |
| 173 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 174 | On an unlock action (such as SHM_UNLOCK), the unlocker (e.g. shmctl()) must scan |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 175 | the pages in the region and "rescue" them from the unevictable list if no other |
| 176 | condition is keeping them unevictable. If an unevictable region is destroyed, |
| 177 | the pages are also "rescued" from the unevictable list in the process of |
| 178 | freeing them. |
| 179 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 180 | folio_evictable() also checks for mlocked folios by calling |
| 181 | folio_test_mlocked(), which is set when a folio is faulted into a |
| 182 | VM_LOCKED VMA, or found in a VMA being VM_LOCKED. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 183 | |
| 184 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 185 | Vmscan's Handling of Unevictable Folios |
| 186 | --------------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 187 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 188 | If unevictable folios are culled in the fault path, or moved to the unevictable |
| 189 | list at mlock() or mmap() time, vmscan will not encounter the folios until they |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 190 | have become evictable again (via munlock() for example) and have been "rescued" |
| 191 | from the unevictable list. However, there may be situations where we decide, |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 192 | for the sake of expediency, to leave an unevictable folio on one of the regular |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 193 | active/inactive LRU lists for vmscan to deal with. vmscan checks for such |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 194 | folios in all of the shrink_{active|inactive|page}_list() functions and will |
| 195 | "cull" such folios that it encounters: that is, it diverts those folios to the |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 196 | unevictable list for the memory cgroup and node being scanned. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 197 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 198 | There may be situations where a folio is mapped into a VM_LOCKED VMA, |
| 199 | but the folio does not have the mlocked flag set. Such folios will make |
| 200 | it all the way to shrink_active_list() or shrink_page_list() where they |
| 201 | will be detected when vmscan walks the reverse map in folio_referenced() |
| 202 | or try_to_unmap(). The folio is culled to the unevictable list when it |
| 203 | is released by the shrinker. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 204 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 205 | To "cull" an unevictable folio, vmscan simply puts the folio back on |
| 206 | the LRU list using folio_putback_lru() - the inverse operation to |
| 207 | folio_isolate_lru() - after dropping the folio lock. Because the |
| 208 | condition which makes the folio unevictable may change once the folio |
| 209 | is unlocked, __pagevec_lru_add_fn() will recheck the unevictable state |
| 210 | of a folio before placing it on the unevictable list. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 211 | |
| 212 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 213 | MLOCKED Pages |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 214 | ============= |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 215 | |
Matthew Wilcox (Oracle) | 90c9d13 | 2023-01-16 19:28:24 +0000 | [diff] [blame] | 216 | The unevictable folio list is also useful for mlock(), in addition to ramfs and |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 217 | SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in |
| 218 | NOMMU situations, all mappings are effectively mlocked. |
| 219 | |
| 220 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 221 | History |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 222 | ------- |
| 223 | |
| 224 | The "Unevictable mlocked Pages" infrastructure is based on work originally |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 225 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 226 | Nick posted his patch as an alternative to a patch posted by Christoph Lameter |
| 227 | to achieve the same objective: hiding mlocked pages from vmscan. |
| 228 | |
| 229 | In Nick's patch, he used one of the struct page LRU list link fields as a count |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 230 | of VM_LOCKED VMAs that map the page (Rik van Riel had the same idea three years |
| 231 | earlier). But this use of the link field for a count prevented the management |
| 232 | of the pages on an LRU list, and thus mlocked pages were not migratable as |
| 233 | isolate_lru_page() could not detect them, and the LRU list link field was not |
| 234 | available to the migration subsystem. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 235 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 236 | Nick resolved this by putting mlocked pages back on the LRU list before |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 237 | attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When |
| 238 | Nick's patch was integrated with the Unevictable LRU work, the count was |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 239 | replaced by walking the reverse map when munlocking, to determine whether any |
| 240 | other VM_LOCKED VMAs still mapped the page. |
| 241 | |
| 242 | However, walking the reverse map for each page when munlocking was ugly and |
| 243 | inefficient, and could lead to catastrophic contention on a file's rmap lock, |
| 244 | when many processes which had it mlocked were trying to exit. In 5.18, the |
| 245 | idea of keeping mlock_count in Unevictable LRU list link field was revived and |
| 246 | put to work, without preventing the migration of mlocked pages. This is why |
| 247 | the "Unevictable LRU list" cannot be a linked list of pages now; but there was |
| 248 | no use for that linked list anyway - though its size is maintained for meminfo. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 249 | |
| 250 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 251 | Basic Management |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 252 | ---------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 253 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 254 | mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable |
| 255 | pages. When such a page has been "noticed" by the memory management subsystem, |
| 256 | the page is marked with the PG_mlocked flag. This can be manipulated using the |
| 257 | PageMlocked() functions. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 258 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 259 | A PG_mlocked page will be placed on the unevictable list when it is added to |
| 260 | the LRU. Such pages can be "noticed" by memory management in several places: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 261 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 262 | (1) in the mlock()/mlock2()/mlockall() system call handlers; |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 263 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 264 | (2) in the mmap() system call handler when mmapping a region with the |
| 265 | MAP_LOCKED flag; |
| 266 | |
| 267 | (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 268 | flag; |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 269 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 270 | (4) in the fault path and when a VM_LOCKED stack segment is expanded; or |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 271 | |
| 272 | (5) as mentioned above, in vmscan:shrink_page_list() when attempting to |
Vernon Yang | 9a7d7a8 | 2022-09-26 23:20:32 +0800 | [diff] [blame] | 273 | reclaim a page in a VM_LOCKED VMA by folio_referenced() or try_to_unmap(). |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 274 | |
| 275 | mlocked pages become unlocked and rescued from the unevictable list when: |
| 276 | |
| 277 | (1) mapped in a range unlocked via the munlock()/munlockall() system calls; |
| 278 | |
| 279 | (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including |
| 280 | unmapping at task exit; |
| 281 | |
| 282 | (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file; |
| 283 | or |
| 284 | |
| 285 | (4) before a page is COW'd in a VM_LOCKED VMA. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 286 | |
| 287 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 288 | mlock()/mlock2()/mlockall() System Call Handling |
| 289 | ------------------------------------------------ |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 290 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 291 | mlock(), mlock2() and mlockall() system call handlers proceed to mlock_fixup() |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 292 | for each VMA in the range specified by the call. In the case of mlockall(), |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 293 | this is the entire active address space of the task. Note that mlock_fixup() |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 294 | is used for both mlocking and munlocking a range of memory. A call to mlock() |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 295 | an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED, is |
| 296 | treated as a no-op and mlock_fixup() simply returns. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 297 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 298 | If the VMA passes some filtering as described in "Filtering Special VMAs" |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 299 | below, mlock_fixup() will attempt to merge the VMA with its neighbors or split |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 300 | off a subset of the VMA if the range does not cover the entire VMA. Any pages |
Matthew Wilcox (Oracle) | e0650a4 | 2023-01-16 19:28:27 +0000 | [diff] [blame] | 301 | already present in the VMA are then marked as mlocked by mlock_folio() via |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 302 | mlock_pte_range() via walk_page_range() via mlock_vma_pages_range(). |
| 303 | |
| 304 | Before returning from the system call, do_mlock() or mlockall() will call |
| 305 | __mm_populate() to fault in the remaining pages via get_user_pages() and to |
| 306 | mark those pages as mlocked as they are faulted. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 307 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 308 | Note that the VMA being mlocked might be mapped with PROT_NONE. In this case, |
| 309 | get_user_pages() will be unable to fault in the pages. That's okay. If pages |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 310 | do end up getting faulted into this VM_LOCKED VMA, they will be handled in the |
| 311 | fault path - which is also how mlock2()'s MLOCK_ONFAULT areas are handled. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 312 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 313 | For each PTE (or PMD) being faulted into a VMA, the page add rmap function |
Matthew Wilcox (Oracle) | 7efecff | 2023-01-16 19:28:25 +0000 | [diff] [blame] | 314 | calls mlock_vma_folio(), which calls mlock_folio() when the VMA is VM_LOCKED |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 315 | (unless it is a PTE mapping of a part of a transparent huge page). Or when |
Lorenzo Stoakes | a8265cd | 2023-01-12 12:39:32 +0000 | [diff] [blame] | 316 | it is a newly allocated anonymous page, folio_add_lru_vma() calls |
| 317 | mlock_new_folio() instead: similar to mlock_folio(), but can make better |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 318 | judgments, since this page is held exclusively and known not to be on LRU yet. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 319 | |
Lorenzo Stoakes | a8265cd | 2023-01-12 12:39:32 +0000 | [diff] [blame] | 320 | mlock_folio() sets PG_mlocked immediately, then places the page on the CPU's |
| 321 | mlock folio batch, to batch up the rest of the work to be done under lru_lock by |
| 322 | __mlock_folio(). __mlock_folio() sets PG_unevictable, initializes mlock_count |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 323 | and moves the page to unevictable state ("the unevictable LRU", but with |
Lorenzo Stoakes | a8265cd | 2023-01-12 12:39:32 +0000 | [diff] [blame] | 324 | mlock_count in place of LRU threading). Or if the page was already PG_lru |
| 325 | and PG_unevictable and PG_mlocked, it simply increments the mlock_count. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 326 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 327 | But in practice that may not work ideally: the page may not yet be on an LRU, or |
| 328 | it may have been temporarily isolated from LRU. In such cases the mlock_count |
Lorenzo Stoakes | a8265cd | 2023-01-12 12:39:32 +0000 | [diff] [blame] | 329 | field cannot be touched, but will be set to 0 later when __munlock_folio() |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 330 | returns the page to "LRU". Races prohibit mlock_count from being set to 1 then: |
| 331 | rather than risk stranding a page indefinitely as unevictable, always err with |
| 332 | mlock_count on the low side, so that when munlocked the page will be rescued to |
| 333 | an evictable LRU, then perhaps be mlocked again later if vmscan finds it in a |
| 334 | VM_LOCKED VMA. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 335 | |
| 336 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 337 | Filtering Special VMAs |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 338 | ---------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 339 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 340 | mlock_fixup() filters several classes of "special" VMAs: |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 341 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 342 | 1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 343 | these mappings are inherently pinned, so we don't need to mark them as |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 344 | mlocked. In any case, most of the pages have no struct page in which to so |
| 345 | mark the page. Because of this, get_user_pages() will fail for these VMAs, |
| 346 | so there is no sense in attempting to visit them. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 347 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 348 | 2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 349 | neither need nor want to mlock() these pages. But __mm_populate() includes |
| 350 | hugetlbfs ranges, allocating the huge pages and populating the PTEs. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 351 | |
Konstantin Khlebnikov | 314e51b | 2012-10-08 16:29:02 -0700 | [diff] [blame] | 352 | 3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages, |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 353 | such as the VDSO page, relay channel pages, etc. These pages are inherently |
| 354 | unevictable and are not managed on the LRU lists. __mm_populate() includes |
| 355 | these ranges, populating the PTEs if not already populated. |
| 356 | |
| 357 | 4) VMAs with VM_MIXEDMAP set are not marked VM_LOCKED, but __mm_populate() |
| 358 | includes these ranges, populating the PTEs if not already populated. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 359 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 360 | Note that for all of these special VMAs, mlock_fixup() does not set the |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 361 | VM_LOCKED flag. Therefore, we won't have to deal with them later during |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 362 | munlock(), munmap() or task exit. Neither does mlock_fixup() account these |
| 363 | VMAs against the task's "locked_vm". |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 364 | |
| 365 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 366 | munlock()/munlockall() System Call Handling |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 367 | ------------------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 368 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 369 | The munlock() and munlockall() system calls are handled by the same |
| 370 | mlock_fixup() function as mlock(), mlock2() and mlockall() system calls are. |
| 371 | If called to munlock an already munlocked VMA, mlock_fixup() simply returns. |
| 372 | Because of the VMA filtering discussed above, VM_LOCKED will not be set in |
| 373 | any "special" VMAs. So, those VMAs will be ignored for munlock. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 374 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 375 | If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the |
Matthew Wilcox (Oracle) | e0650a4 | 2023-01-16 19:28:27 +0000 | [diff] [blame] | 376 | specified range. All pages in the VMA are then munlocked by munlock_folio() via |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 377 | mlock_pte_range() via walk_page_range() via mlock_vma_pages_range() - the same |
| 378 | function used when mlocking a VMA range, with new flags for the VMA indicating |
| 379 | that it is munlock() being performed. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 380 | |
Matthew Wilcox (Oracle) | e0650a4 | 2023-01-16 19:28:27 +0000 | [diff] [blame] | 381 | munlock_folio() uses the mlock pagevec to batch up work to be done |
| 382 | under lru_lock by __munlock_folio(). __munlock_folio() decrements the |
| 383 | folio's mlock_count, and when that reaches 0 it clears the mlocked flag |
| 384 | and clears the unevictable flag, moving the folio from unevictable state |
| 385 | to the inactive LRU. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 386 | |
Matthew Wilcox (Oracle) | e0650a4 | 2023-01-16 19:28:27 +0000 | [diff] [blame] | 387 | But in practice that may not work ideally: the folio may not yet have reached |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 388 | "the unevictable LRU", or it may have been temporarily isolated from it. In |
| 389 | those cases its mlock_count field is unusable and must be assumed to be 0: so |
Matthew Wilcox (Oracle) | e0650a4 | 2023-01-16 19:28:27 +0000 | [diff] [blame] | 390 | that the folio will be rescued to an evictable LRU, then perhaps be mlocked |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 391 | again later if vmscan finds it in a VM_LOCKED VMA. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 392 | |
| 393 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 394 | Migrating MLOCKED Pages |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 395 | ----------------------- |
| 396 | |
| 397 | A page that is being migrated has been isolated from the LRU lists and is held |
| 398 | locked across unmapping of the page, updating the page's address space entry |
| 399 | and copying the contents and state, until the page table entry has been |
| 400 | replaced with an entry that refers to the new page. Linux supports migration |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 401 | of mlocked pages and other unevictable pages. PG_mlocked is cleared from the |
| 402 | the old page when it is unmapped from the last VM_LOCKED VMA, and set when the |
| 403 | new page is mapped in place of migration entry in a VM_LOCKED VMA. If the page |
| 404 | was unevictable because mlocked, PG_unevictable follows PG_mlocked; but if the |
| 405 | page was unevictable for other reasons, PG_unevictable is copied explicitly. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 406 | |
| 407 | Note that page migration can race with mlocking or munlocking of the same page. |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 408 | There is mostly no problem since page migration requires unmapping all PTEs of |
| 409 | the old page (including munlock where VM_LOCKED), then mapping in the new page |
| 410 | (including mlock where VM_LOCKED). The page table locks provide sufficient |
| 411 | synchronization. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 412 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 413 | However, since mlock_vma_pages_range() starts by setting VM_LOCKED on a VMA, |
| 414 | before mlocking any pages already present, if one of those pages were migrated |
| 415 | before mlock_pte_range() reached it, it would get counted twice in mlock_count. |
| 416 | To prevent that, mlock_vma_pages_range() temporarily marks the VMA as VM_IO, |
Matthew Wilcox (Oracle) | 7efecff | 2023-01-16 19:28:25 +0000 | [diff] [blame] | 417 | so that mlock_vma_folio() will skip it. |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 418 | |
| 419 | To complete page migration, we place the old and new pages back onto the LRU |
| 420 | afterwards. The "unneeded" page - old page on success, new page on failure - |
| 421 | is freed when the reference count held by the migration process is released. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 422 | |
| 423 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 424 | Compacting MLOCKED Pages |
Eric B Munson | 922c055 | 2015-04-15 16:13:23 -0700 | [diff] [blame] | 425 | ------------------------ |
| 426 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 427 | The memory map can be scanned for compactable regions and the default behavior |
| 428 | is to let unevictable pages be moved. /proc/sys/vm/compact_unevictable_allowed |
| 429 | controls this behavior (see Documentation/admin-guide/sysctl/vm.rst). The work |
| 430 | of compaction is mostly handled by the page migration code and the same work |
| 431 | flow as described in Migrating MLOCKED Pages will apply. |
| 432 | |
Eric B Munson | 922c055 | 2015-04-15 16:13:23 -0700 | [diff] [blame] | 433 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 434 | MLOCKING Transparent Huge Pages |
Kirill A. Shutemov | 6fb8ddf | 2016-07-26 15:25:15 -0700 | [diff] [blame] | 435 | ------------------------------- |
| 436 | |
| 437 | A transparent huge page is represented by a single entry on an LRU list. |
| 438 | Therefore, we can only make unevictable an entire compound page, not |
| 439 | individual subpages. |
| 440 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 441 | If a user tries to mlock() part of a huge page, and no user mlock()s the |
| 442 | whole of the huge page, we want the rest of the page to be reclaimable. |
Kirill A. Shutemov | 6fb8ddf | 2016-07-26 15:25:15 -0700 | [diff] [blame] | 443 | |
| 444 | We cannot just split the page on partial mlock() as split_huge_page() can |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 445 | fail and a new intermittent failure mode for the syscall is undesirable. |
Kirill A. Shutemov | 6fb8ddf | 2016-07-26 15:25:15 -0700 | [diff] [blame] | 446 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 447 | We handle this by keeping PTE-mlocked huge pages on evictable LRU lists: |
| 448 | the PMD on the border of a VM_LOCKED VMA will be split into a PTE table. |
Kirill A. Shutemov | 6fb8ddf | 2016-07-26 15:25:15 -0700 | [diff] [blame] | 449 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 450 | This way the huge page is accessible for vmscan. Under memory pressure the |
Kirill A. Shutemov | 6fb8ddf | 2016-07-26 15:25:15 -0700 | [diff] [blame] | 451 | page will be split, subpages which belong to VM_LOCKED VMAs will be moved |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 452 | to the unevictable LRU and the rest can be reclaimed. |
Kirill A. Shutemov | 6fb8ddf | 2016-07-26 15:25:15 -0700 | [diff] [blame] | 453 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 454 | /proc/meminfo's Unevictable and Mlocked amounts do not include those parts |
| 455 | of a transparent huge page which are mapped only by PTEs in VM_LOCKED VMAs. |
| 456 | |
Eric B Munson | 922c055 | 2015-04-15 16:13:23 -0700 | [diff] [blame] | 457 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 458 | mmap(MAP_LOCKED) System Call Handling |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 459 | ------------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 460 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 461 | In addition to the mlock(), mlock2() and mlockall() system calls, an application |
| 462 | can request that a region of memory be mlocked by supplying the MAP_LOCKED flag |
| 463 | to the mmap() call. There is one important and subtle difference here, though. |
| 464 | mmap() + mlock() will fail if the range cannot be faulted in (e.g. because |
| 465 | mm_populate fails) and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. |
Bjorn Helgaas | d56b699 | 2023-08-14 16:28:22 -0500 | [diff] [blame] | 466 | The mmapped area will still have properties of the locked area - pages will not |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 467 | get swapped out - but major page faults to fault memory in might still happen. |
Michal Hocko | 9b012a2 | 2015-06-24 16:57:50 -0700 | [diff] [blame] | 468 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 469 | Furthermore, any mmap() call or brk() call that expands the heap by a task |
| 470 | that has previously called mlockall() with the MCL_FUTURE flag will result |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 471 | in the newly mapped memory being mlocked. Before the unevictable/mlock |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 472 | changes, the kernel simply called make_pages_present() to allocate pages |
| 473 | and populate the page table. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 474 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 475 | To mlock a range of memory under the unevictable/mlock infrastructure, |
| 476 | the mmap() handler and task address space expansion functions call |
Kirill A. Shutemov | fc05f56 | 2015-04-14 15:44:39 -0700 | [diff] [blame] | 477 | populate_vma_page_range() specifying the vma and the address range to mlock. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 478 | |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 479 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 480 | munmap()/exit()/exec() System Call Handling |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 481 | ------------------------------------------- |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 482 | |
| 483 | When unmapping an mlocked region of memory, whether by an explicit call to |
| 484 | munmap() or via an internal unmap from exit() or exec() processing, we must |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 485 | munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. |
Hugh Dickins | 63d6c5a | 2009-01-06 14:39:38 -0800 | [diff] [blame] | 486 | Before the unevictable/mlock changes, mlocking did not mark the pages in any |
| 487 | way, so unmapping them required no processing. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 488 | |
David Hildenbrand | 5a0033f | 2023-12-20 23:44:55 +0100 | [diff] [blame] | 489 | For each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls |
Matthew Wilcox (Oracle) | 672aa27 | 2023-01-16 19:28:26 +0000 | [diff] [blame] | 490 | munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 491 | (unless it was a PTE mapping of a part of a transparent huge page). |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 492 | |
Matthew Wilcox (Oracle) | e0650a4 | 2023-01-16 19:28:27 +0000 | [diff] [blame] | 493 | munlock_folio() uses the mlock pagevec to batch up work to be done |
| 494 | under lru_lock by __munlock_folio(). __munlock_folio() decrements the |
| 495 | folio's mlock_count, and when that reaches 0 it clears the mlocked flag |
| 496 | and clears the unevictable flag, moving the folio from unevictable state |
| 497 | to the inactive LRU. |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 498 | |
Matthew Wilcox (Oracle) | e0650a4 | 2023-01-16 19:28:27 +0000 | [diff] [blame] | 499 | But in practice that may not work ideally: the folio may not yet have reached |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 500 | "the unevictable LRU", or it may have been temporarily isolated from it. In |
| 501 | those cases its mlock_count field is unusable and must be assumed to be 0: so |
Matthew Wilcox (Oracle) | e0650a4 | 2023-01-16 19:28:27 +0000 | [diff] [blame] | 502 | that the folio will be rescued to an evictable LRU, then perhaps be mlocked |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 503 | again later if vmscan finds it in a VM_LOCKED VMA. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 504 | |
| 505 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 506 | Truncating MLOCKED Pages |
| 507 | ------------------------ |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 508 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 509 | File truncation or hole punching forcibly unmaps the deleted pages from |
| 510 | userspace; truncation even unmaps and deletes any private anonymous pages |
| 511 | which had been Copied-On-Write from the file pages now being truncated. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 512 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 513 | Mlocked pages can be munlocked and deleted in this way: like with munmap(), |
David Hildenbrand | 5a0033f | 2023-12-20 23:44:55 +0100 | [diff] [blame] | 514 | for each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls |
Matthew Wilcox (Oracle) | 672aa27 | 2023-01-16 19:28:26 +0000 | [diff] [blame] | 515 | munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 516 | (unless it was a PTE mapping of a part of a transparent huge page). |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 517 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 518 | However, if there is a racing munlock(), since mlock_vma_pages_range() starts |
| 519 | munlocking by clearing VM_LOCKED from a VMA, before munlocking all the pages |
| 520 | present, if one of those pages were unmapped by truncation or hole punch before |
| 521 | mlock_pte_range() reached it, it would not be recognized as mlocked by this VMA, |
| 522 | and would not be counted out of mlock_count. In this rare case, a page may |
Lorenzo Stoakes | a8265cd | 2023-01-12 12:39:32 +0000 | [diff] [blame] | 523 | still appear as PG_mlocked after it has been fully unmapped: and it is left to |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 524 | release_pages() (or __page_cache_release()) to clear it and update statistics |
| 525 | before freeing (this event is counted in /proc/vmstat unevictable_pgs_cleared, |
| 526 | which is usually 0). |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 527 | |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 528 | |
Mike Rapoport | a5e4da9 | 2018-03-21 21:22:42 +0200 | [diff] [blame] | 529 | Page Reclaim in shrink_*_list() |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 530 | ------------------------------- |
| 531 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 532 | vmscan's shrink_active_list() culls any obviously unevictable pages - |
| 533 | i.e. !page_evictable(page) pages - diverting those to the unevictable list. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 534 | However, shrink_active_list() only sees unevictable pages that made it onto the |
Lorenzo Stoakes | a8265cd | 2023-01-12 12:39:32 +0000 | [diff] [blame] | 535 | active/inactive LRU lists. Note that these pages do not have PG_unevictable |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 536 | set - otherwise they would be on the unevictable list and shrink_active_list() |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 537 | would never see them. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 538 | |
| 539 | Some examples of these unevictable pages on the LRU lists are: |
| 540 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 541 | (1) ramfs pages that have been placed on the LRU lists when first allocated. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 542 | |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 543 | (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to |
| 544 | allocate or fault in the pages in the shared memory region. This happens |
| 545 | when an application accesses the page the first time after SHM_LOCK'ing |
| 546 | the segment. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 547 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 548 | (3) pages still mapped into VM_LOCKED VMAs, which should be marked mlocked, |
| 549 | but events left mlock_count too low, so they were munlocked too early. |
Lee Schermerhorn | fa07e78 | 2008-10-18 20:26:47 -0700 | [diff] [blame] | 550 | |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 551 | vmscan's shrink_inactive_list() and shrink_page_list() also divert obviously |
| 552 | unevictable pages found on the inactive lists to the appropriate memory cgroup |
| 553 | and node unevictable list. |
David Howells | c24b720 | 2009-04-13 14:40:01 -0700 | [diff] [blame] | 554 | |
Vernon Yang | 9a7d7a8 | 2022-09-26 23:20:32 +0800 | [diff] [blame] | 555 | rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 556 | shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(), |
Matthew Wilcox (Oracle) | 7efecff | 2023-01-16 19:28:25 +0000 | [diff] [blame] | 557 | check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_folio() |
Hugh Dickins | 577e984 | 2022-04-01 11:28:30 -0700 | [diff] [blame] | 558 | to correct them. Such pages are culled to the unevictable list when released |
| 559 | by the shrinker. |