Mike Rapoport | aa9f34e | 2018-03-21 21:22:22 +0200 | [diff] [blame] | 1 | ===================================== |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 2 | Heterogeneous Memory Management (HMM) |
Mike Rapoport | aa9f34e | 2018-03-21 21:22:22 +0200 | [diff] [blame] | 3 | ===================================== |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 4 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 5 | Provide infrastructure and helpers to integrate non-conventional memory (device |
| 6 | memory like GPU on board memory) into regular kernel path, with the cornerstone |
| 7 | of this being specialized struct page for such memory (see sections 5 to 7 of |
| 8 | this document). |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 9 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 10 | HMM also provides optional helpers for SVM (Share Virtual Memory), i.e., |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 11 | allowing a device to transparently access program addresses coherently with |
Jonathan Corbet | 24844fd | 2018-04-16 14:25:08 -0600 | [diff] [blame] | 12 | the CPU meaning that any valid pointer on the CPU is also a valid pointer |
| 13 | for the device. This is becoming mandatory to simplify the use of advanced |
| 14 | heterogeneous computing where GPU, DSP, or FPGA are used to perform various |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 15 | computations on behalf of a process. |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 16 | |
| 17 | This document is divided as follows: in the first section I expose the problems |
| 18 | related to using device specific memory allocators. In the second section, I |
| 19 | expose the hardware limitations that are inherent to many platforms. The third |
| 20 | section gives an overview of the HMM design. The fourth section explains how |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 21 | CPU page-table mirroring works and the purpose of HMM in this context. The |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 22 | fifth section deals with how device memory is represented inside the kernel. |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 23 | Finally, the last section presents a new migration helper that allows |
| 24 | leveraging the device DMA engine. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 25 | |
Mike Rapoport | aa9f34e | 2018-03-21 21:22:22 +0200 | [diff] [blame] | 26 | .. contents:: :local: |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 27 | |
Jonathan Corbet | 24844fd | 2018-04-16 14:25:08 -0600 | [diff] [blame] | 28 | Problems of using a device specific memory allocator |
| 29 | ==================================================== |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 30 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 31 | Devices with a large amount of on board memory (several gigabytes) like GPUs |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 32 | have historically managed their memory through dedicated driver specific APIs. |
| 33 | This creates a disconnect between memory allocated and managed by a device |
| 34 | driver and regular application memory (private anonymous, shared memory, or |
| 35 | regular file backed memory). From here on I will refer to this aspect as split |
| 36 | address space. I use shared address space to refer to the opposite situation: |
| 37 | i.e., one in which any application memory region can be used by a device |
| 38 | transparently. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 39 | |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 40 | Split address space happens because devices can only access memory allocated |
| 41 | through a device specific API. This implies that all memory objects in a program |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 42 | are not equal from the device point of view which complicates large programs |
| 43 | that rely on a wide set of libraries. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 44 | |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 45 | Concretely, this means that code that wants to leverage devices like GPUs needs |
| 46 | to copy objects between generically allocated memory (malloc, mmap private, mmap |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 47 | share) and memory allocated through the device driver API (this still ends up |
| 48 | with an mmap but of the device file). |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 49 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 50 | For flat data sets (array, grid, image, ...) this isn't too hard to achieve but |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 51 | for complex data sets (list, tree, ...) it's hard to get right. Duplicating a |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 52 | complex data set needs to re-map all the pointer relations between each of its |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 53 | elements. This is error prone and programs get harder to debug because of the |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 54 | duplicate data set and addresses. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 55 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 56 | Split address space also means that libraries cannot transparently use data |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 57 | they are getting from the core program or another library and thus each library |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 58 | might have to duplicate its input data set using the device specific memory |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 59 | allocator. Large projects suffer from this and waste resources because of the |
| 60 | various memory copies. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 61 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 62 | Duplicating each library API to accept as input or output memory allocated by |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 63 | each device specific allocator is not a viable option. It would lead to a |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 64 | combinatorial explosion in the library entry points. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 65 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 66 | Finally, with the advance of high level language constructs (in C++ but in |
| 67 | other languages too) it is now possible for the compiler to leverage GPUs and |
| 68 | other devices without programmer knowledge. Some compiler identified patterns |
| 69 | are only do-able with a shared address space. It is also more reasonable to use |
| 70 | a shared address space for all other patterns. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 71 | |
| 72 | |
Jonathan Corbet | 24844fd | 2018-04-16 14:25:08 -0600 | [diff] [blame] | 73 | I/O bus, device memory characteristics |
| 74 | ====================================== |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 75 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 76 | I/O buses cripple shared address spaces due to a few limitations. Most I/O |
| 77 | buses only allow basic memory access from device to main memory; even cache |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 78 | coherency is often optional. Access to device memory from a CPU is even more |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 79 | limited. More often than not, it is not cache coherent. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 80 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 81 | If we only consider the PCIE bus, then a device can access main memory (often |
| 82 | through an IOMMU) and be cache coherent with the CPUs. However, it only allows |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 83 | a limited set of atomic operations from the device on main memory. This is worse |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 84 | in the other direction: the CPU can only access a limited range of the device |
| 85 | memory and cannot perform atomic operations on it. Thus device memory cannot |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 86 | be considered the same as regular memory from the kernel point of view. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 87 | |
| 88 | Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0 |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 89 | and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s). |
| 90 | The final limitation is latency. Access to main memory from the device has an |
| 91 | order of magnitude higher latency than when the device accesses its own memory. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 92 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 93 | Some platforms are developing new I/O buses or additions/modifications to PCIE |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 94 | to address some of these limitations (OpenCAPI, CCIX). They mainly allow |
| 95 | two-way cache coherency between CPU and device and allow all atomic operations the |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 96 | architecture supports. Sadly, not all platforms are following this trend and |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 97 | some major architectures are left without hardware solutions to these problems. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 98 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 99 | So for shared address space to make sense, not only must we allow devices to |
| 100 | access any memory but we must also permit any memory to be migrated to device |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 101 | memory while the device is using it (blocking CPU access while it happens). |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 102 | |
| 103 | |
Jonathan Corbet | 24844fd | 2018-04-16 14:25:08 -0600 | [diff] [blame] | 104 | Shared address space and migration |
| 105 | ================================== |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 106 | |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 107 | HMM intends to provide two main features. The first one is to share the address |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 108 | space by duplicating the CPU page table in the device page table so the same |
| 109 | address points to the same physical memory for any valid main memory address in |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 110 | the process address space. |
| 111 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 112 | To achieve this, HMM offers a set of helpers to populate the device page table |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 113 | while keeping track of CPU page table updates. Device page table updates are |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 114 | not as easy as CPU page table updates. To update the device page table, you must |
| 115 | allocate a buffer (or use a pool of pre-allocated buffers) and write GPU |
| 116 | specific commands in it to perform the update (unmap, cache invalidations, and |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 117 | flush, ...). This cannot be done through common code for all devices. Hence |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 118 | why HMM provides helpers to factor out everything that can be while leaving the |
| 119 | hardware specific details to the device driver. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 120 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 121 | The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 122 | allows allocating a struct page for each page of device memory. Those pages |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 123 | are special because the CPU cannot map them. However, they allow migrating |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 124 | main memory to device memory using existing migration mechanisms and everything |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 125 | looks like a page that is swapped out to disk from the CPU point of view. Using a |
| 126 | struct page gives the easiest and cleanest integration with existing mm |
| 127 | mechanisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 128 | memory for the device memory and second to perform migration. Policy decisions |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 129 | of what and when to migrate is left to the device driver. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 130 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 131 | Note that any CPU access to a device page triggers a page fault and a migration |
| 132 | back to main memory. For example, when a page backing a given CPU address A is |
| 133 | migrated from a main memory page to a device page, then any CPU access to |
| 134 | address A triggers a page fault and initiates a migration back to main memory. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 135 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 136 | With these two features, HMM not only allows a device to mirror process address |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 137 | space and keeps both CPU and device page tables synchronized, but also |
| 138 | leverages device memory by migrating the part of the data set that is actively being |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 139 | used by the device. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 140 | |
| 141 | |
Mike Rapoport | aa9f34e | 2018-03-21 21:22:22 +0200 | [diff] [blame] | 142 | Address space mirroring implementation and API |
| 143 | ============================================== |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 144 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 145 | Address space mirroring's main objective is to allow duplication of a range of |
| 146 | CPU page table into a device page table; HMM helps keep both synchronized. A |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 147 | device driver that wants to mirror a process address space must start with the |
Jason Gunthorpe | a22dd50 | 2019-11-12 16:22:30 -0400 | [diff] [blame] | 148 | registration of a mmu_interval_notifier:: |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 149 | |
Jason Gunthorpe | 5292e24 | 2020-01-14 11:29:52 -0400 | [diff] [blame] | 150 | int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub, |
| 151 | struct mm_struct *mm, unsigned long start, |
| 152 | unsigned long length, |
| 153 | const struct mmu_interval_notifier_ops *ops); |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 154 | |
Jason Gunthorpe | 5292e24 | 2020-01-14 11:29:52 -0400 | [diff] [blame] | 155 | During the ops->invalidate() callback the device driver must perform the |
| 156 | update action to the range (mark range read only, or fully unmap, etc.). The |
| 157 | device must complete the update before the driver callback returns. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 158 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 159 | When the device driver wants to populate a range of virtual addresses, it can |
Christoph Hellwig | d45d464b | 2019-07-25 17:56:47 -0700 | [diff] [blame] | 160 | use:: |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 161 | |
Jason Gunthorpe | be957c8 | 2020-05-01 15:20:45 -0300 | [diff] [blame] | 162 | int hmm_range_fault(struct hmm_range *range); |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 163 | |
Jason Gunthorpe | 6bfef2f | 2020-03-27 17:00:16 -0300 | [diff] [blame] | 164 | It will trigger a page fault on missing or read-only entries if write access is |
| 165 | requested (see below). Page faults use the generic mm page fault code path just |
Marco Pagani | 090a7f1 | 2023-08-25 15:35:46 +0200 | [diff] [blame] | 166 | like a CPU page fault. The usage pattern is:: |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 167 | |
| 168 | int driver_populate_range(...) |
| 169 | { |
| 170 | struct hmm_range range; |
| 171 | ... |
Jérôme Glisse | 25f23a0 | 2019-05-13 17:19:55 -0700 | [diff] [blame] | 172 | |
Jason Gunthorpe | 5292e24 | 2020-01-14 11:29:52 -0400 | [diff] [blame] | 173 | range.notifier = &interval_sub; |
Jérôme Glisse | 25f23a0 | 2019-05-13 17:19:55 -0700 | [diff] [blame] | 174 | range.start = ...; |
| 175 | range.end = ...; |
Jason Gunthorpe | 2733ea1 | 2020-05-01 15:20:48 -0300 | [diff] [blame] | 176 | range.hmm_pfns = ...; |
Jérôme Glisse | a3e0d41 | 2019-05-13 17:20:01 -0700 | [diff] [blame] | 177 | |
Jason Gunthorpe | 5292e24 | 2020-01-14 11:29:52 -0400 | [diff] [blame] | 178 | if (!mmget_not_zero(interval_sub->notifier.mm)) |
Jason Gunthorpe | a22dd50 | 2019-11-12 16:22:30 -0400 | [diff] [blame] | 179 | return -EFAULT; |
Jérôme Glisse | 25f23a0 | 2019-05-13 17:19:55 -0700 | [diff] [blame] | 180 | |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 181 | again: |
Jason Gunthorpe | 5292e24 | 2020-01-14 11:29:52 -0400 | [diff] [blame] | 182 | range.notifier_seq = mmu_interval_read_begin(&interval_sub); |
Michel Lespinasse | 3e4e28c | 2020-06-08 21:33:51 -0700 | [diff] [blame] | 183 | mmap_read_lock(mm); |
Jason Gunthorpe | 6bfef2f | 2020-03-27 17:00:16 -0300 | [diff] [blame] | 184 | ret = hmm_range_fault(&range); |
Jérôme Glisse | 25f23a0 | 2019-05-13 17:19:55 -0700 | [diff] [blame] | 185 | if (ret) { |
Michel Lespinasse | 3e4e28c | 2020-06-08 21:33:51 -0700 | [diff] [blame] | 186 | mmap_read_unlock(mm); |
Jason Gunthorpe | a22dd50 | 2019-11-12 16:22:30 -0400 | [diff] [blame] | 187 | if (ret == -EBUSY) |
| 188 | goto again; |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 189 | return ret; |
Jérôme Glisse | 25f23a0 | 2019-05-13 17:19:55 -0700 | [diff] [blame] | 190 | } |
Michel Lespinasse | 3e4e28c | 2020-06-08 21:33:51 -0700 | [diff] [blame] | 191 | mmap_read_unlock(mm); |
Jason Gunthorpe | a22dd50 | 2019-11-12 16:22:30 -0400 | [diff] [blame] | 192 | |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 193 | take_lock(driver->update); |
Jason Gunthorpe | a22dd50 | 2019-11-12 16:22:30 -0400 | [diff] [blame] | 194 | if (mmu_interval_read_retry(&ni, range.notifier_seq) { |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 195 | release_lock(driver->update); |
| 196 | goto again; |
| 197 | } |
| 198 | |
Jason Gunthorpe | a22dd50 | 2019-11-12 16:22:30 -0400 | [diff] [blame] | 199 | /* Use pfns array content to update device page table, |
| 200 | * under the update lock */ |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 201 | |
| 202 | release_lock(driver->update); |
| 203 | return 0; |
| 204 | } |
| 205 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 206 | The driver->update lock is the same lock that the driver takes inside its |
Jason Gunthorpe | a22dd50 | 2019-11-12 16:22:30 -0400 | [diff] [blame] | 207 | invalidate() callback. That lock must be held before calling |
| 208 | mmu_interval_read_retry() to avoid any race with a concurrent CPU page table |
| 209 | update. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 210 | |
Jérôme Glisse | 023a019 | 2019-05-13 17:20:05 -0700 | [diff] [blame] | 211 | Leverage default_flags and pfn_flags_mask |
| 212 | ========================================= |
| 213 | |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 214 | The hmm_range struct has 2 fields, default_flags and pfn_flags_mask, that specify |
| 215 | fault or snapshot policy for the whole range instead of having to set them |
| 216 | for each entry in the pfns array. |
Jérôme Glisse | 023a019 | 2019-05-13 17:20:05 -0700 | [diff] [blame] | 217 | |
Jason Gunthorpe | 2733ea1 | 2020-05-01 15:20:48 -0300 | [diff] [blame] | 218 | For instance if the device driver wants pages for a range with at least read |
| 219 | permission, it sets:: |
Jérôme Glisse | 023a019 | 2019-05-13 17:20:05 -0700 | [diff] [blame] | 220 | |
Jason Gunthorpe | 2733ea1 | 2020-05-01 15:20:48 -0300 | [diff] [blame] | 221 | range->default_flags = HMM_PFN_REQ_FAULT; |
Jérôme Glisse | 023a019 | 2019-05-13 17:20:05 -0700 | [diff] [blame] | 222 | range->pfn_flags_mask = 0; |
| 223 | |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 224 | and calls hmm_range_fault() as described above. This will fill fault all pages |
Jérôme Glisse | 023a019 | 2019-05-13 17:20:05 -0700 | [diff] [blame] | 225 | in the range with at least read permission. |
| 226 | |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 227 | Now let's say the driver wants to do the same except for one page in the range for |
| 228 | which it wants to have write permission. Now driver set:: |
Randy Dunlap | 91173c6 | 2019-05-31 22:29:57 -0700 | [diff] [blame] | 229 | |
Jason Gunthorpe | 2733ea1 | 2020-05-01 15:20:48 -0300 | [diff] [blame] | 230 | range->default_flags = HMM_PFN_REQ_FAULT; |
| 231 | range->pfn_flags_mask = HMM_PFN_REQ_WRITE; |
| 232 | range->pfns[index_of_write] = HMM_PFN_REQ_WRITE; |
Jérôme Glisse | 023a019 | 2019-05-13 17:20:05 -0700 | [diff] [blame] | 233 | |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 234 | With this, HMM will fault in all pages with at least read (i.e., valid) and for the |
Jérôme Glisse | 023a019 | 2019-05-13 17:20:05 -0700 | [diff] [blame] | 235 | address == range->start + (index_of_write << PAGE_SHIFT) it will fault with |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 236 | write permission i.e., if the CPU pte does not have write permission set then HMM |
Jérôme Glisse | 023a019 | 2019-05-13 17:20:05 -0700 | [diff] [blame] | 237 | will call handle_mm_fault(). |
| 238 | |
Jason Gunthorpe | 2733ea1 | 2020-05-01 15:20:48 -0300 | [diff] [blame] | 239 | After hmm_range_fault completes the flag bits are set to the current state of |
| 240 | the page tables, ie HMM_PFN_VALID | HMM_PFN_WRITE will be set if the page is |
| 241 | writable. |
Jérôme Glisse | 023a019 | 2019-05-13 17:20:05 -0700 | [diff] [blame] | 242 | |
| 243 | |
Mike Rapoport | aa9f34e | 2018-03-21 21:22:22 +0200 | [diff] [blame] | 244 | Represent and manage device memory from core kernel point of view |
| 245 | ================================================================= |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 246 | |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 247 | Several different designs were tried to support device memory. The first one |
| 248 | used a device specific data structure to keep information about migrated memory |
| 249 | and HMM hooked itself in various places of mm code to handle any access to |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 250 | addresses that were backed by device memory. It turns out that this ended up |
| 251 | replicating most of the fields of struct page and also needed many kernel code |
| 252 | paths to be updated to understand this new kind of memory. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 253 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 254 | Most kernel code paths never try to access the memory behind a page |
| 255 | but only care about struct page contents. Because of this, HMM switched to |
| 256 | directly using struct page for device memory which left most kernel code paths |
| 257 | unaware of the difference. We only need to make sure that no one ever tries to |
| 258 | map those pages from the CPU side. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 259 | |
Jonathan Corbet | 24844fd | 2018-04-16 14:25:08 -0600 | [diff] [blame] | 260 | Migration to and from device memory |
| 261 | =================================== |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 262 | |
Ralph Campbell | f7ebd9e | 2020-09-09 14:29:56 -0700 | [diff] [blame] | 263 | Because the CPU cannot access device memory directly, the device driver must |
| 264 | use hardware DMA or device specific load/store instructions to migrate data. |
| 265 | The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() |
| 266 | functions are designed to make drivers easier to write and to centralize common |
| 267 | code across drivers. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 268 | |
Ralph Campbell | f7ebd9e | 2020-09-09 14:29:56 -0700 | [diff] [blame] | 269 | Before migrating pages to device private memory, special device private |
| 270 | ``struct page`` need to be created. These will be used as special "swap" |
| 271 | page table entries so that a CPU process will fault if it tries to access |
| 272 | a page that has been migrated to device private memory. |
| 273 | |
| 274 | These can be allocated and freed with:: |
| 275 | |
| 276 | struct resource *res; |
| 277 | struct dev_pagemap pagemap; |
| 278 | |
| 279 | res = request_free_mem_region(&iomem_resource, /* number of bytes */, |
| 280 | "name of driver resource"); |
| 281 | pagemap.type = MEMORY_DEVICE_PRIVATE; |
| 282 | pagemap.range.start = res->start; |
| 283 | pagemap.range.end = res->end; |
| 284 | pagemap.nr_range = 1; |
| 285 | pagemap.ops = &device_devmem_ops; |
| 286 | memremap_pages(&pagemap, numa_node_id()); |
| 287 | |
| 288 | memunmap_pages(&pagemap); |
| 289 | release_mem_region(pagemap.range.start, range_len(&pagemap.range)); |
| 290 | |
| 291 | There are also devm_request_free_mem_region(), devm_memremap_pages(), |
| 292 | devm_memunmap_pages(), and devm_release_mem_region() when the resources can |
| 293 | be tied to a ``struct device``. |
| 294 | |
| 295 | The overall migration steps are similar to migrating NUMA pages within system |
Mike Rapoport (IBM) | ee86588 | 2023-02-01 11:41:55 +0200 | [diff] [blame] | 296 | memory (see Documentation/mm/page_migration.rst) but the steps are split |
Ralph Campbell | f7ebd9e | 2020-09-09 14:29:56 -0700 | [diff] [blame] | 297 | between device driver specific code and shared common code: |
| 298 | |
| 299 | 1. ``mmap_read_lock()`` |
| 300 | |
| 301 | The device driver has to pass a ``struct vm_area_struct`` to |
| 302 | migrate_vma_setup() so the mmap_read_lock() or mmap_write_lock() needs to |
| 303 | be held for the duration of the migration. |
| 304 | |
| 305 | 2. ``migrate_vma_setup(struct migrate_vma *args)`` |
| 306 | |
| 307 | The device driver initializes the ``struct migrate_vma`` fields and passes |
| 308 | the pointer to migrate_vma_setup(). The ``args->flags`` field is used to |
| 309 | filter which source pages should be migrated. For example, setting |
| 310 | ``MIGRATE_VMA_SELECT_SYSTEM`` will only migrate system memory and |
| 311 | ``MIGRATE_VMA_SELECT_DEVICE_PRIVATE`` will only migrate pages residing in |
| 312 | device private memory. If the latter flag is set, the ``args->pgmap_owner`` |
| 313 | field is used to identify device private pages owned by the driver. This |
| 314 | avoids trying to migrate device private pages residing in other devices. |
| 315 | Currently only anonymous private VMA ranges can be migrated to or from |
| 316 | system memory and device private memory. |
| 317 | |
| 318 | One of the first steps migrate_vma_setup() does is to invalidate other |
| 319 | device's MMUs with the ``mmu_notifier_invalidate_range_start(()`` and |
| 320 | ``mmu_notifier_invalidate_range_end()`` calls around the page table |
| 321 | walks to fill in the ``args->src`` array with PFNs to be migrated. |
| 322 | The ``invalidate_range_start()`` callback is passed a |
| 323 | ``struct mmu_notifier_range`` with the ``event`` field set to |
Alistair Popple | 6b49bf6 | 2021-06-30 18:54:19 -0700 | [diff] [blame] | 324 | ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to |
Ralph Campbell | f7ebd9e | 2020-09-09 14:29:56 -0700 | [diff] [blame] | 325 | the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is |
| 326 | allows the device driver to skip the invalidation callback and only |
| 327 | invalidate device private MMU mappings that are actually migrating. |
| 328 | This is explained more in the next section. |
| 329 | |
| 330 | While walking the page tables, a ``pte_none()`` or ``is_zero_pfn()`` |
| 331 | entry results in a valid "zero" PFN stored in the ``args->src`` array. |
| 332 | This lets the driver allocate device private memory and clear it instead |
| 333 | of copying a page of zeros. Valid PTE entries to system memory or |
| 334 | device private struct pages will be locked with ``lock_page()``, isolated |
| 335 | from the LRU (if system memory since device private pages are not on |
| 336 | the LRU), unmapped from the process, and a special migration PTE is |
| 337 | inserted in place of the original PTE. |
| 338 | migrate_vma_setup() also clears the ``args->dst`` array. |
| 339 | |
| 340 | 3. The device driver allocates destination pages and copies source pages to |
| 341 | destination pages. |
| 342 | |
| 343 | The driver checks each ``src`` entry to see if the ``MIGRATE_PFN_MIGRATE`` |
| 344 | bit is set and skips entries that are not migrating. The device driver |
| 345 | can also choose to skip migrating a page by not filling in the ``dst`` |
| 346 | array for that page. |
| 347 | |
| 348 | The driver then allocates either a device private struct page or a |
| 349 | system memory page, locks the page with ``lock_page()``, and fills in the |
| 350 | ``dst`` array entry with:: |
| 351 | |
Alistair Popple | ab09243 | 2021-11-10 20:32:40 -0800 | [diff] [blame] | 352 | dst[i] = migrate_pfn(page_to_pfn(dpage)); |
Ralph Campbell | f7ebd9e | 2020-09-09 14:29:56 -0700 | [diff] [blame] | 353 | |
| 354 | Now that the driver knows that this page is being migrated, it can |
| 355 | invalidate device private MMU mappings and copy device private memory |
| 356 | to system memory or another device private page. The core Linux kernel |
| 357 | handles CPU page table invalidations so the device driver only has to |
| 358 | invalidate its own MMU mappings. |
| 359 | |
| 360 | The driver can use ``migrate_pfn_to_page(src[i])`` to get the |
| 361 | ``struct page`` of the source and either copy the source page to the |
| 362 | destination or clear the destination device private memory if the pointer |
| 363 | is ``NULL`` meaning the source page was not populated in system memory. |
| 364 | |
| 365 | 4. ``migrate_vma_pages()`` |
| 366 | |
| 367 | This step is where the migration is actually "committed". |
| 368 | |
| 369 | If the source page was a ``pte_none()`` or ``is_zero_pfn()`` page, this |
| 370 | is where the newly allocated page is inserted into the CPU's page table. |
| 371 | This can fail if a CPU thread faults on the same page. However, the page |
| 372 | table is locked and only one of the new pages will be inserted. |
| 373 | The device driver will see that the ``MIGRATE_PFN_MIGRATE`` bit is cleared |
| 374 | if it loses the race. |
| 375 | |
| 376 | If the source page was locked, isolated, etc. the source ``struct page`` |
| 377 | information is now copied to destination ``struct page`` finalizing the |
| 378 | migration on the CPU side. |
| 379 | |
| 380 | 5. Device driver updates device MMU page tables for pages still migrating, |
| 381 | rolling back pages not migrating. |
| 382 | |
| 383 | If the ``src`` entry still has ``MIGRATE_PFN_MIGRATE`` bit set, the device |
| 384 | driver can update the device MMU and set the write enable bit if the |
| 385 | ``MIGRATE_PFN_WRITE`` bit is set. |
| 386 | |
| 387 | 6. ``migrate_vma_finalize()`` |
| 388 | |
| 389 | This step replaces the special migration page table entry with the new |
| 390 | page's page table entry and releases the reference to the source and |
| 391 | destination ``struct page``. |
| 392 | |
| 393 | 7. ``mmap_read_unlock()`` |
| 394 | |
| 395 | The lock can now be released. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 396 | |
Alistair Popple | b756a3b | 2021-06-30 18:54:25 -0700 | [diff] [blame] | 397 | Exclusive access memory |
| 398 | ======================= |
| 399 | |
| 400 | Some devices have features such as atomic PTE bits that can be used to implement |
| 401 | atomic access to system memory. To support atomic operations to a shared virtual |
| 402 | memory page such a device needs access to that page which is exclusive of any |
| 403 | userspace access from the CPU. The ``make_device_exclusive_range()`` function |
| 404 | can be used to make a memory range inaccessible from userspace. |
| 405 | |
| 406 | This replaces all mappings for pages in the given range with special swap |
| 407 | entries. Any attempt to access the swap entry results in a fault which is |
| 408 | resovled by replacing the entry with the original mapping. A driver gets |
| 409 | notified that the mapping has been changed by MMU notifiers, after which point |
| 410 | it will no longer have exclusive access to the page. Exclusive access is |
Bjorn Helgaas | d56b699 | 2023-08-14 16:28:22 -0500 | [diff] [blame] | 411 | guaranteed to last until the driver drops the page lock and page reference, at |
Alistair Popple | b756a3b | 2021-06-30 18:54:25 -0700 | [diff] [blame] | 412 | which point any CPU faults on the page may proceed as described. |
| 413 | |
Mike Rapoport | aa9f34e | 2018-03-21 21:22:22 +0200 | [diff] [blame] | 414 | Memory cgroup (memcg) and rss accounting |
| 415 | ======================================== |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 416 | |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 417 | For now, device memory is accounted as any regular page in rss counters (either |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 418 | anonymous if device page is used for anonymous, file if device page is used for |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 419 | file backed page, or shmem if device page is used for shared memory). This is a |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 420 | deliberate choice to keep existing applications, that might start using device |
| 421 | memory without knowing about it, running unimpacted. |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 422 | |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 423 | A drawback is that the OOM killer might kill an application using a lot of |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 424 | device memory and not a lot of regular system memory and thus not freeing much |
| 425 | system memory. We want to gather more real world experience on how applications |
| 426 | and system react under memory pressure in the presence of device memory before |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 427 | deciding to account device memory differently. |
| 428 | |
| 429 | |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 430 | Same decision was made for memory cgroup. Device memory pages are accounted |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 431 | against same memory cgroup a regular page would be accounted to. This does |
| 432 | simplify migration to and from device memory. This also means that migration |
Jérôme Glisse | e8eddfd | 2018-04-10 16:29:16 -0700 | [diff] [blame] | 433 | back from device memory to regular memory cannot fail because it would |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 434 | go above memory cgroup limit. We might revisit this choice latter on once we |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 435 | get more experience in how device memory is used and its impact on memory |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 436 | resource control. |
| 437 | |
| 438 | |
Ralph Campbell | 2076e5c | 2019-05-06 16:29:38 -0700 | [diff] [blame] | 439 | Note that device memory can never be pinned by a device driver nor through GUP |
Jérôme Glisse | bffc33e | 2017-09-08 16:11:19 -0700 | [diff] [blame] | 440 | and thus such memory is always free upon process exit. Or when last reference |
Ralph Campbell | 76ea470 | 2018-04-10 16:28:11 -0700 | [diff] [blame] | 441 | is dropped in case of shared memory or file backed memory. |