| =========== |
| Userfaultfd |
| =========== |
| |
| Objective |
| ========= |
| |
| Userfaults allow the implementation of on-demand paging from userland |
| and more generally they allow userland to take control of various |
| memory page faults, something otherwise only the kernel code could do. |
| |
| For example userfaults allows a proper and more optimal implementation |
| of the ``PROT_NONE+SIGSEGV`` trick. |
| |
| Design |
| ====== |
| |
| Userspace creates a new userfaultfd, initializes it, and registers one or more |
| regions of virtual memory with it. Then, any page faults which occur within the |
| region(s) result in a message being delivered to the userfaultfd, notifying |
| userspace of the fault. |
| |
| The ``userfaultfd`` (aside from registering and unregistering virtual |
| memory ranges) provides two primary functionalities: |
| |
| 1) ``read/POLLIN`` protocol to notify a userland thread of the faults |
| happening |
| |
| 2) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions |
| registered in the ``userfaultfd`` that allows userland to efficiently |
| resolve the userfaults it receives via 1) or to manage the virtual |
| memory in the background |
| |
| The real advantage of userfaults if compared to regular virtual memory |
| management of mremap/mprotect is that the userfaults in all their |
| operations never involve heavyweight structures like vmas (in fact the |
| ``userfaultfd`` runtime load never takes the mmap_lock for writing). |
| Vmas are not suitable for page- (or hugepage) granular fault tracking |
| when dealing with virtual address spaces that could span |
| Terabytes. Too many vmas would be needed for that. |
| |
| The ``userfaultfd``, once created, can also be |
| passed using unix domain sockets to a manager process, so the same |
| manager process could handle the userfaults of a multitude of |
| different processes without them being aware about what is going on |
| (well of course unless they later try to use the ``userfaultfd`` |
| themselves on the same region the manager is already tracking, which |
| is a corner case that would currently return ``-EBUSY``). |
| |
| API |
| === |
| |
| Creating a userfaultfd |
| ---------------------- |
| |
| There are two ways to create a new userfaultfd, each of which provide ways to |
| restrict access to this functionality (since historically userfaultfds which |
| handle kernel page faults have been a useful tool for exploiting the kernel). |
| |
| The first way, supported since userfaultfd was introduced, is the |
| userfaultfd(2) syscall. Access to this is controlled in several ways: |
| |
| - Any user can always create a userfaultfd which traps userspace page faults |
| only. Such a userfaultfd can be created using the userfaultfd(2) syscall |
| with the flag UFFD_USER_MODE_ONLY. |
| |
| - In order to also trap kernel page faults for the address space, either the |
| process needs the CAP_SYS_PTRACE capability, or the system must have |
| vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd |
| is set to 0. |
| |
| The second way, added to the kernel more recently, is by opening |
| /dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method |
| yields equivalent userfaultfds to the userfaultfd(2) syscall. |
| |
| Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal |
| filesystem permissions (user/group/mode), which gives fine grained access to |
| userfaultfd specifically, without also granting other unrelated privileges at |
| the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access |
| to /dev/userfaultfd can always create userfaultfds that trap kernel page faults; |
| vm.unprivileged_userfaultfd is not considered. |
| |
| Initializing a userfaultfd |
| -------------------------- |
| |
| When first opened the ``userfaultfd`` must be enabled invoking the |
| ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or |
| a later API version) which will specify the ``read/POLLIN`` protocol |
| userland intends to speak on the ``UFFD`` and the ``uffdio_api.features`` |
| userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the |
| requested ``uffdio_api.api`` is spoken also by the running kernel and the |
| requested features are going to be enabled) will return into |
| ``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of |
| respectively all the available features of the read(2) protocol and |
| the generic ioctl available. |
| |
| The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl |
| defines what memory types are supported by the ``userfaultfd`` and what |
| events, except page fault notifications, may be generated: |
| |
| - The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events |
| other than page faults are supported. These events are described in more |
| detail below in the `Non-cooperative userfaultfd`_ section. |
| |
| - ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM`` |
| indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING`` |
| registrations for hugetlbfs and shared memory (covering all shmem APIs, |
| i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``, |
| etc) virtual memory areas, respectively. |
| |
| - ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports |
| ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory |
| areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating |
| support for shmem virtual memory areas. |
| |
| - ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an |
| existing page contents from userspace. |
| |
| The userland application should set the feature flags it intends to use |
| when invoking the ``UFFDIO_API`` ioctl, to request that those features be |
| enabled if supported. |
| |
| Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER`` |
| ioctl should be invoked (if present in the returned ``uffdio_api.ioctls`` |
| bitmask) to register a memory range in the ``userfaultfd`` by setting the |
| uffdio_register structure accordingly. The ``uffdio_register.mode`` |
| bitmask will specify to the kernel which kind of faults to track for |
| the range. The ``UFFDIO_REGISTER`` ioctl will return the |
| ``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve |
| userfaults on the range registered. Not all ioctls will necessarily be |
| supported for all memory types (e.g. anonymous memory vs. shmem vs. |
| hugetlbfs), or all types of intercepted faults. |
| |
| Userland can use the ``uffdio_register.ioctls`` to manage the virtual |
| address space in the background (to add or potentially also remove |
| memory from the ``userfaultfd`` registered range). This means a userfault |
| could be triggering just before userland maps in the background the |
| user-faulted page. |
| |
| Resolving Userfaults |
| -------------------- |
| |
| There are three basic ways to resolve userfaults: |
| |
| - ``UFFDIO_COPY`` atomically copies some existing page contents from |
| userspace. |
| |
| - ``UFFDIO_ZEROPAGE`` atomically zeros the new page. |
| |
| - ``UFFDIO_CONTINUE`` maps an existing, previously-populated page. |
| |
| These operations are atomic in the sense that they guarantee nothing can |
| see a half-populated page, since readers will keep userfaulting until the |
| operation has finished. |
| |
| By default, these wake up userfaults blocked on the range in question. |
| They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates |
| that waking will be done separately at some later time. |
| |
| Which ioctl to choose depends on the kind of page fault, and what we'd |
| like to do to resolve it: |
| |
| - For ``UFFDIO_REGISTER_MODE_MISSING`` faults, the fault needs to be |
| resolved by either providing a new page (``UFFDIO_COPY``), or mapping |
| the zero page (``UFFDIO_ZEROPAGE``). By default, the kernel would map |
| the zero page for a missing fault. With userfaultfd, userspace can |
| decide what content to provide before the faulting thread continues. |
| |
| - For ``UFFDIO_REGISTER_MODE_MINOR`` faults, there is an existing page (in |
| the page cache). Userspace has the option of modifying the page's |
| contents before resolving the fault. Once the contents are correct |
| (modified or not), userspace asks the kernel to map the page and let the |
| faulting thread continue with ``UFFDIO_CONTINUE``. |
| |
| Notes: |
| |
| - You can tell which kind of fault occurred by examining |
| ``pagefault.flags`` within the ``uffd_msg``, checking for the |
| ``UFFD_PAGEFAULT_FLAG_*`` flags. |
| |
| - None of the page-delivering ioctls default to the range that you |
| registered with. You must fill in all fields for the appropriate |
| ioctl struct including the range. |
| |
| - You get the address of the access that triggered the missing page |
| event out of a struct uffd_msg that you read in the thread from the |
| uffd. You can supply as many pages as you want with these IOCTLs. |
| Keep in mind that unless you used DONTWAKE then the first of any of |
| those IOCTLs wakes up the faulting thread. |
| |
| - Be sure to test for all errors including |
| (``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges |
| supplied were incorrect. |
| |
| Write Protect Notifications |
| --------------------------- |
| |
| This is equivalent to (but faster than) using mprotect and a SIGSEGV |
| signal handler. |
| |
| Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``. |
| Instead of using mprotect(2) you use |
| ``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)`` |
| while ``mode = UFFDIO_WRITEPROTECT_MODE_WP`` |
| in the struct passed in. The range does not default to and does not |
| have to be identical to the range you registered with. You can write |
| protect as many ranges as you like (inside the registered range). |
| Then, in the thread reading from uffd the struct will have |
| ``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send |
| ``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)`` |
| again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP`` |
| set. This wakes up the thread which will continue to run with writes. This |
| allows you to do the bookkeeping about the write in the uffd reading |
| thread before the ioctl. |
| |
| If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and |
| ``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in |
| which you supply a page and undo write protect. Note that there is a |
| difference between writes into a WP area and into a !WP area. The |
| former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter |
| ``UFFD_PAGEFAULT_FLAG_WRITE``. The latter did not fail on protection but |
| you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was |
| used. |
| |
| Userfaultfd write-protect mode currently behave differently on none ptes |
| (when e.g. page is missing) over different types of memories. |
| |
| For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes |
| (e.g. when pages are missing and not populated). For file-backed memories |
| like shmem and hugetlbfs, none ptes will be write protected just like a |
| present pte. In other words, there will be a userfaultfd write fault |
| message generated when writing to a missing page on file typed memories, |
| as long as the page range was write-protected before. Such a message will |
| not be generated on anonymous memories by default. |
| |
| If the application wants to be able to write protect none ptes on anonymous |
| memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On |
| newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED |
| and set the feature bit in advance to make sure none ptes will also be |
| write protected even upon anonymous memory. |
| |
| When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either |
| ``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when |
| resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE`` |
| respectively, it may be desirable for the new page / mapping to be |
| write-protected (so future writes will also result in a WP fault). These ioctls |
| support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP`` |
| respectively) to configure the mapping this way. |
| |
| If the userfaultfd context has ``UFFD_FEATURE_WP_ASYNC`` feature bit set, |
| any vma registered with write-protection will work in async mode rather |
| than the default sync mode. |
| |
| In async mode, there will be no message generated when a write operation |
| happens, meanwhile the write-protection will be resolved automatically by |
| the kernel. It can be seen as a more accurate version of soft-dirty |
| tracking and it can be different in a few ways: |
| |
| - The dirty result will not be affected by vma changes (e.g. vma |
| merging) because the dirty is only tracked by the pte. |
| |
| - It supports range operations by default, so one can enable tracking on |
| any range of memory as long as page aligned. |
| |
| - Dirty information will not get lost if the pte was zapped due to |
| various reasons (e.g. during split of a shmem transparent huge page). |
| |
| - Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit |
| set; dirty when uffd-wp bit cleared), it has different semantics on |
| some of the memory operations. For example: ``MADV_DONTNEED`` on |
| anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as |
| dirtying of memory by dropping uffd-wp bit during the procedure. |
| |
| The user app can collect the "written/dirty" status by looking up the |
| uffd-wp bit for the pages being interested in /proc/pagemap. |
| |
| The page will not be under track of uffd-wp async mode until the page is |
| explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode |
| flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault |
| that was tracked by async mode userfaultfd-wp is invalid. |
| |
| When userfaultfd-wp async mode is used alone, it can be applied to all |
| kinds of memory. |
| |
| Memory Poisioning Emulation |
| --------------------------- |
| |
| In response to a fault (either missing or minor), an action userspace can |
| take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any |
| future faulters to either get a SIGBUS, or in KVM's case the guest will |
| receive an MCE as if there were hardware memory poisoning. |
| |
| This is used to emulate hardware memory poisoning. Imagine a VM running on a |
| machine which experiences a real hardware memory error. Later, we live migrate |
| the VM to another physical machine. Since we want the migration to be |
| transparent to the guest, we want that same address range to act as if it was |
| still poisoned, even though it's on a new physical host which ostensibly |
| doesn't have a memory error in the exact same spot. |
| |
| QEMU/KVM |
| ======== |
| |
| QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live |
| migration. Postcopy live migration is one form of memory |
| externalization consisting of a virtual machine running with part or |
| all of its memory residing on a different node in the cloud. The |
| ``userfaultfd`` abstraction is generic enough that not a single line of |
| KVM kernel code had to be modified in order to add postcopy live |
| migration to QEMU. |
| |
| Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work |
| just fine in combination with userfaults. Userfaults trigger async |
| page faults in the guest scheduler so those guest processes that |
| aren't waiting for userfaults (i.e. network bound) can keep running in |
| the guest vcpus. |
| |
| It is generally beneficial to run one pass of precopy live migration |
| just before starting postcopy live migration, in order to avoid |
| generating userfaults for readonly guest regions. |
| |
| The implementation of postcopy live migration currently uses one |
| single bidirectional socket but in the future two different sockets |
| will be used (to reduce the latency of the userfaults to the minimum |
| possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``). |
| |
| The QEMU in the source node writes all pages that it knows are missing |
| in the destination node, into the socket, and the migration thread of |
| the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE`` |
| ioctls on the ``userfaultfd`` in order to map the received pages into the |
| guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page). |
| |
| A different postcopy thread in the destination node listens with |
| poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is |
| generated after a userfault triggers, the postcopy thread read() from |
| the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the |
| userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run |
| by the parallel QEMU migration thread). |
| |
| After the QEMU postcopy thread (running in the destination node) gets |
| the userfault address it writes the information about the missing page |
| into the socket. The QEMU source node receives the information and |
| roughly "seeks" to that page address and continues sending all |
| remaining missing pages from that new page offset. Soon after that |
| (just the time to flush the tcp_wmem queue through the network) the |
| migration thread in the QEMU running in the destination node will |
| receive the page that triggered the userfault and it'll map it as |
| usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it |
| was spontaneously sent by the source or if it was an urgent page |
| requested through a userfault). |
| |
| By the time the userfaults start, the QEMU in the destination node |
| doesn't need to keep any per-page state bitmap relative to the live |
| migration around and a single per-page bitmap has to be maintained in |
| the QEMU running in the source node to know which pages are still |
| missing in the destination node. The bitmap in the source node is |
| checked to find which missing pages to send in round robin and we seek |
| over it when receiving incoming userfaults. After sending each page of |
| course the bitmap is updated accordingly. It's also useful to avoid |
| sending the same page twice (in case the userfault is read by the |
| postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration |
| thread). |
| |
| Non-cooperative userfaultfd |
| =========================== |
| |
| When the ``userfaultfd`` is monitored by an external manager, the manager |
| must be able to track changes in the process virtual memory |
| layout. Userfaultfd can notify the manager about such changes using |
| the same read(2) protocol as for the page fault notifications. The |
| manager has to explicitly enable these events by setting appropriate |
| bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl: |
| |
| ``UFFD_FEATURE_EVENT_FORK`` |
| enable ``userfaultfd`` hooks for fork(). When this feature is |
| enabled, the ``userfaultfd`` context of the parent process is |
| duplicated into the newly created process. The manager |
| receives ``UFFD_EVENT_FORK`` with file descriptor of the new |
| ``userfaultfd`` context in the ``uffd_msg.fork``. |
| |
| ``UFFD_FEATURE_EVENT_REMAP`` |
| enable notifications about mremap() calls. When the |
| non-cooperative process moves a virtual memory area to a |
| different location, the manager will receive |
| ``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and |
| new addresses of the area and its original length. |
| |
| ``UFFD_FEATURE_EVENT_REMOVE`` |
| enable notifications about madvise(MADV_REMOVE) and |
| madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will |
| be generated upon these calls to madvise(). The ``uffd_msg.remove`` |
| will contain start and end addresses of the removed area. |
| |
| ``UFFD_FEATURE_EVENT_UNMAP`` |
| enable notifications about memory unmapping. The manager will |
| get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and |
| end addresses of the unmapped area. |
| |
| Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP`` |
| are pretty similar, they quite differ in the action expected from the |
| ``userfaultfd`` manager. In the former case, the virtual memory is |
| removed, but the area is not, the area remains monitored by the |
| ``userfaultfd``, and if a page fault occurs in that area it will be |
| delivered to the manager. The proper resolution for such page fault is |
| to zeromap the faulting address. However, in the latter case, when an |
| area is unmapped, either explicitly (with munmap() system call), or |
| implicitly (e.g. during mremap()), the area is removed and in turn the |
| ``userfaultfd`` context for such area disappears too and the manager will |
| not get further userland page faults from the removed area. Still, the |
| notification is required in order to prevent manager from using |
| ``UFFDIO_COPY`` on the unmapped area. |
| |
| Unlike userland page faults which have to be synchronous and require |
| explicit or implicit wakeup, all the events are delivered |
| asynchronously and the non-cooperative process resumes execution as |
| soon as manager executes read(). The ``userfaultfd`` manager should |
| carefully synchronize calls to ``UFFDIO_COPY`` with the events |
| processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will |
| return ``-ENOSPC`` when the monitored process exits at the time of |
| ``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed |
| its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY`` |
| operation. |
| |
| The current asynchronous model of the event delivery is optimal for |
| single threaded non-cooperative ``userfaultfd`` manager implementations. A |
| synchronous event delivery model can be added later as a new |
| ``userfaultfd`` feature to facilitate multithreading enhancements of the |
| non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to |
| run in parallel to the event reception. Single threaded |
| implementations should continue to use the current async event |
| delivery model instead. |