Documentation/mm/multigen_lru.rst - linux - Git at Google

 .. SPDX-License-Identifier: GPL-2.0

 =============
 Multi-Gen LRU
 =============
 The multi-gen LRU is an alternative LRU implementation that optimizes
 page reclaim and improves performance under memory pressure. Page
 reclaim decides the kernel's caching policy and ability to overcommit
 memory. It directly impacts the kswapd CPU usage and RAM efficiency.

 Design overview
 ===============
 Objectives
 ----------
 The design objectives are:

 * Good representation of access recency
 * Try to profit from spatial locality
 * Fast paths to make obvious choices
 * Simple self-correcting heuristics

 The representation of access recency is at the core of all LRU
 implementations. In the multi-gen LRU, each generation represents a
 group of pages with similar access recency. Generations establish a
 (time-based) common frame of reference and therefore help make better
 choices, e.g., between different memcgs on a computer or different
 computers in a data center (for job scheduling).

 Exploiting spatial locality improves efficiency when gathering the
 accessed bit. A rmap walk targets a single page and does not try to
 profit from discovering a young PTE. A page table walk can sweep all
 the young PTEs in an address space, but the address space can be too
 sparse to make a profit. The key is to optimize both methods and use
 them in combination.

 Fast paths reduce code complexity and runtime overhead. Unmapped pages
 do not require TLB flushes; clean pages do not require writeback.
 These facts are only helpful when other conditions, e.g., access
 recency, are similar. With generations as a common frame of reference,
 additional factors stand out. But obvious choices might not be good
 choices; thus self-correction is necessary.

 The benefits of simple self-correcting heuristics are self-evident.
 Again, with generations as a common frame of reference, this becomes
 attainable. Specifically, pages in the same generation can be
 categorized based on additional factors, and a feedback loop can
 statistically compare the refault percentages across those categories
 and infer which of them are better choices.

 Assumptions
 -----------
 The protection of hot pages and the selection of cold pages are based
 on page access channels and patterns. There are two access channels:

 * Accesses through page tables
 * Accesses through file descriptors

 The protection of the former channel is by design stronger because:

 1. The uncertainty in determining the access patterns of the former
    channel is higher due to the approximation of the accessed bit.
 2. The cost of evicting the former channel is higher due to the TLB
    flushes required and the likelihood of encountering the dirty bit.
 3. The penalty of underprotecting the former channel is higher because
    applications usually do not prepare themselves for major page
    faults like they do for blocked I/O. E.g., GUI applications
    commonly use dedicated I/O threads to avoid blocking rendering
    threads.

 There are also two access patterns:

 * Accesses exhibiting temporal locality
 * Accesses not exhibiting temporal locality

 For the reasons listed above, the former channel is assumed to follow
 the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
 present, and the latter channel is assumed to follow the latter
 pattern unless outlying refaults have been observed.

 Workflow overview
 =================
 Evictable pages are divided into multiple generations for each
 ``lruvec``. The youngest generation number is stored in
 ``lrugen->max_seq`` for both anon and file types as they are aged on
 an equal footing. The oldest generation numbers are stored in
 ``lrugen->min_seq[]`` separately for anon and file types as clean file
 pages can be evicted regardless of swap constraints. These three
 variables are monotonically increasing.

 Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
 bits in order to fit into the gen counter in ``folio->flags``. Each
 truncated generation number is an index to ``lrugen->folios[]``. The
 sliding window technique is used to track at least ``MIN_NR_GENS`` and
 at most ``MAX_NR_GENS`` generations. The gen counter stores a value
 within ``[1, MAX_NR_GENS]`` while a page is on one of
 ``lrugen->folios[]``; otherwise it stores zero.

 Each generation is divided into multiple tiers. A page accessed ``N``
 times through file descriptors is in tier ``order_base_2(N)``. Unlike
 generations, tiers do not have dedicated ``lrugen->folios[]``. In
 contrast to moving across generations, which requires the LRU lock,
 moving across tiers only involves atomic operations on
 ``folio->flags`` and therefore has a negligible cost. A feedback loop
 modeled after the PID controller monitors refaults over all the tiers
 from anon and file types and decides which tiers from which types to
 evict or protect. The desired effect is to balance refault percentages
 between anon and file types proportional to the swappiness level.

 There are two conceptually independent procedures: the aging and the
 eviction. They form a closed-loop system, i.e., the page reclaim.

 Aging
 -----
 The aging produces young generations. Given an ``lruvec``, it
 increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
 ``MIN_NR_GENS``. The aging promotes hot pages to the youngest
 generation when it finds them accessed through page tables; the
 demotion of cold pages happens consequently when it increments
 ``max_seq``. The aging uses page table walks and rmap walks to find
 young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
 and calls ``walk_page_range()`` with each ``mm_struct`` on this list
 to scan PTEs, and after each iteration, it increments ``max_seq``. For
 the latter, when the eviction walks the rmap and finds a young PTE,
 the aging scans the adjacent PTEs. For both, on finding a young PTE,
 the aging clears the accessed bit and updates the gen counter of the
 page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.

 Eviction
 --------
 The eviction consumes old generations. Given an ``lruvec``, it
 increments ``min_seq`` when ``lrugen->folios[]`` indexed by
 ``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
 evict from, it first compares ``min_seq[]`` to select the older type.
 If both types are equally old, it selects the one whose first tier has
 a lower refault percentage. The first tier contains single-use
 unmapped clean pages, which are the best bet. The eviction sorts a
 page according to its gen counter if the aging has found this page
 accessed through page tables and updated its gen counter. It also
 moves a page to the next generation, i.e., ``min_seq+1``, if this page
 was accessed multiple times through file descriptors and the feedback
 loop has detected outlying refaults from the tier this page is in. To
 this end, the feedback loop uses the first tier as the baseline, for
 the reason stated earlier.

 Working set protection
 ----------------------
 Each generation is timestamped at birth. If ``lru_gen_min_ttl`` is
 set, an ``lruvec`` is protected from the eviction when its oldest
 generation was born within ``lru_gen_min_ttl`` milliseconds. In other
 words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds
 from getting evicted. The OOM killer is triggered if this working set
 cannot be kept in memory.

 This time-based approach has the following advantages:

 1. It is easier to configure because it is agnostic to applications
    and memory sizes.
 2. It is more reliable because it is directly wired to the OOM killer.

 ``mm_struct`` list
 ------------------
 An ``mm_struct`` list is maintained for each memcg, and an
 ``mm_struct`` follows its owner task to the new memcg when this task
 is migrated.

 A page table walker iterates ``lruvec_memcg()->mm_list`` and calls
 ``walk_page_range()`` with each ``mm_struct`` on this list to scan
 PTEs. When multiple page table walkers iterate the same list, each of
 them gets a unique ``mm_struct``, and therefore they can run in
 parallel.

 Page table walkers ignore any misplaced pages, e.g., if an
 ``mm_struct`` was migrated, pages left in the previous memcg will be
 ignored when the current memcg is under reclaim. Similarly, page table
 walkers will ignore pages from nodes other than the one under reclaim.

 This infrastructure also tracks the usage of ``mm_struct`` between
 context switches so that page table walkers can skip processes that
 have been sleeping since the last iteration.

 Rmap/PT walk feedback
 ---------------------
 Searching the rmap for PTEs mapping each page on an LRU list (to test
 and clear the accessed bit) can be expensive because pages from
 different VMAs (PA space) are not cache friendly to the rmap (VA
 space). For workloads mostly using mapped pages, searching the rmap
 can incur the highest CPU cost in the reclaim path.

 ``lru_gen_look_around()`` exploits spatial locality to reduce the
 trips into the rmap. It scans the adjacent PTEs of a young PTE and
 promotes hot pages. If the scan was done cacheline efficiently, it
 adds the PMD entry pointing to the PTE table to the Bloom filter. This
 forms a feedback loop between the eviction and the aging.

 Bloom filters
 -------------
 Bloom filters are a space and memory efficient data structure for set
 membership test, i.e., test if an element is not in the set or may be
 in the set.

 In the eviction path, specifically, in ``lru_gen_look_around()``, if a
 PMD has a sufficient number of hot pages, its address is placed in the
 filter. In the aging path, set membership means that the PTE range
 will be scanned for young pages.

 Note that Bloom filters are probabilistic on set membership. If a test
 is false positive, the cost is an additional scan of a range of PTEs,
 which may yield hot pages anyway. Parameters of the filter itself can
 control the false positive rate in the limit.

 PID controller
 --------------
 A feedback loop modeled after the Proportional-Integral-Derivative
 (PID) controller monitors refaults over anon and file types and
 decides which type to evict when both types are available from the
 same generation.

 The PID controller uses generations rather than the wall clock as the
 time domain because a CPU can scan pages at different rates under
 varying memory pressure. It calculates a moving average for each new
 generation to avoid being permanently locked in a suboptimal state.

 Memcg LRU
 ---------
 An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
 since each node and memcg combination has an LRU of folios (see
 ``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
 global reclaim, which is critical to system-wide memory overcommit in
 data centers. Note that memcg LRU only applies to global reclaim.

 The basic structure of an memcg LRU can be understood by an analogy to
 the active/inactive LRU (of folios):

 1. It has the young and the old (generations), i.e., the counterparts
    to the active and the inactive;
 2. The increment of ``max_seq`` triggers promotion, i.e., the
    counterpart to activation;
 3. Other events trigger similar operations, e.g., offlining an memcg
    triggers demotion, i.e., the counterpart to deactivation.

 In terms of global reclaim, it has two distinct features:

 1. Sharding, which allows each thread to start at a random memcg (in
    the old generation) and improves parallelism;
 2. Eventual fairness, which allows direct reclaim to bail out at will
    and reduces latency without affecting fairness over some time.

 In terms of traversing memcgs during global reclaim, it improves the
 best-case complexity from O(n) to O(1) and does not affect the
 worst-case complexity O(n). Therefore, on average, it has a sublinear
 complexity.

 Summary
 -------
 The multi-gen LRU (of folios) can be disassembled into the following
 parts:

 * Generations
 * Rmap walks
 * Page table walks via ``mm_struct`` list
 * Bloom filters for rmap/PT walk feedback
 * PID controller for refault feedback

 The aging and the eviction form a producer-consumer model;
 specifically, the latter drives the former by the sliding window over
 generations. Within the aging, rmap walks drive page table walks by
 inserting hot densely populated page tables to the Bloom filters.
 Within the eviction, the PID controller uses refaults as the feedback
 to select types to evict and tiers to protect.
	.. SPDX-License-Identifier: GPL-2.0

	=============
	Multi-Gen LRU
	=============
	The multi-gen LRU is an alternative LRU implementation that optimizes
	page reclaim and improves performance under memory pressure. Page
	reclaim decides the kernel's caching policy and ability to overcommit
	memory. It directly impacts the kswapd CPU usage and RAM efficiency.

	Design overview
	===============
	Objectives
	----------
	The design objectives are:

	* Good representation of access recency
	* Try to profit from spatial locality
	* Fast paths to make obvious choices
	* Simple self-correcting heuristics

	The representation of access recency is at the core of all LRU
	implementations. In the multi-gen LRU, each generation represents a
	group of pages with similar access recency. Generations establish a
	(time-based) common frame of reference and therefore help make better
	choices, e.g., between different memcgs on a computer or different
	computers in a data center (for job scheduling).

	Exploiting spatial locality improves efficiency when gathering the
	accessed bit. A rmap walk targets a single page and does not try to
	profit from discovering a young PTE. A page table walk can sweep all
	the young PTEs in an address space, but the address space can be too
	sparse to make a profit. The key is to optimize both methods and use
	them in combination.

	Fast paths reduce code complexity and runtime overhead. Unmapped pages
	do not require TLB flushes; clean pages do not require writeback.
	These facts are only helpful when other conditions, e.g., access
	recency, are similar. With generations as a common frame of reference,
	additional factors stand out. But obvious choices might not be good
	choices; thus self-correction is necessary.

	The benefits of simple self-correcting heuristics are self-evident.
	Again, with generations as a common frame of reference, this becomes
	attainable. Specifically, pages in the same generation can be
	categorized based on additional factors, and a feedback loop can
	statistically compare the refault percentages across those categories
	and infer which of them are better choices.

	Assumptions
	-----------
	The protection of hot pages and the selection of cold pages are based
	on page access channels and patterns. There are two access channels:

	* Accesses through page tables
	* Accesses through file descriptors

	The protection of the former channel is by design stronger because:

	1. The uncertainty in determining the access patterns of the former
	channel is higher due to the approximation of the accessed bit.
	2. The cost of evicting the former channel is higher due to the TLB
	flushes required and the likelihood of encountering the dirty bit.
	3. The penalty of underprotecting the former channel is higher because
	applications usually do not prepare themselves for major page
	faults like they do for blocked I/O. E.g., GUI applications
	commonly use dedicated I/O threads to avoid blocking rendering
	threads.

	There are also two access patterns:

	* Accesses exhibiting temporal locality
	* Accesses not exhibiting temporal locality

	For the reasons listed above, the former channel is assumed to follow
	the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
	present, and the latter channel is assumed to follow the latter
	pattern unless outlying refaults have been observed.

	Workflow overview
	=================
	Evictable pages are divided into multiple generations for each
	``lruvec``. The youngest generation number is stored in
	``lrugen->max_seq`` for both anon and file types as they are aged on
	an equal footing. The oldest generation numbers are stored in
	``lrugen->min_seq[]`` separately for anon and file types as clean file
	pages can be evicted regardless of swap constraints. These three
	variables are monotonically increasing.

	Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
	bits in order to fit into the gen counter in ``folio->flags``. Each
	truncated generation number is an index to ``lrugen->folios[]``. The
	sliding window technique is used to track at least ``MIN_NR_GENS`` and
	at most ``MAX_NR_GENS`` generations. The gen counter stores a value
	within ``[1, MAX_NR_GENS]`` while a page is on one of
	``lrugen->folios[]``; otherwise it stores zero.

	Each generation is divided into multiple tiers. A page accessed ``N``
	times through file descriptors is in tier ``order_base_2(N)``. Unlike
	generations, tiers do not have dedicated ``lrugen->folios[]``. In
	contrast to moving across generations, which requires the LRU lock,
	moving across tiers only involves atomic operations on
	``folio->flags`` and therefore has a negligible cost. A feedback loop
	modeled after the PID controller monitors refaults over all the tiers
	from anon and file types and decides which tiers from which types to
	evict or protect. The desired effect is to balance refault percentages
	between anon and file types proportional to the swappiness level.

	There are two conceptually independent procedures: the aging and the
	eviction. They form a closed-loop system, i.e., the page reclaim.

	Aging
	-----
	The aging produces young generations. Given an ``lruvec``, it
	increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
	``MIN_NR_GENS``. The aging promotes hot pages to the youngest
	generation when it finds them accessed through page tables; the
	demotion of cold pages happens consequently when it increments
	``max_seq``. The aging uses page table walks and rmap walks to find
	young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
	and calls ``walk_page_range()`` with each ``mm_struct`` on this list
	to scan PTEs, and after each iteration, it increments ``max_seq``. For
	the latter, when the eviction walks the rmap and finds a young PTE,
	the aging scans the adjacent PTEs. For both, on finding a young PTE,
	the aging clears the accessed bit and updates the gen counter of the
	page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.

	Eviction
	--------
	The eviction consumes old generations. Given an ``lruvec``, it
	increments ``min_seq`` when ``lrugen->folios[]`` indexed by
	``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
	evict from, it first compares ``min_seq[]`` to select the older type.
	If both types are equally old, it selects the one whose first tier has
	a lower refault percentage. The first tier contains single-use
	unmapped clean pages, which are the best bet. The eviction sorts a
	page according to its gen counter if the aging has found this page
	accessed through page tables and updated its gen counter. It also
	moves a page to the next generation, i.e., ``min_seq+1``, if this page
	was accessed multiple times through file descriptors and the feedback
	loop has detected outlying refaults from the tier this page is in. To
	this end, the feedback loop uses the first tier as the baseline, for
	the reason stated earlier.

	Working set protection
	----------------------
	Each generation is timestamped at birth. If ``lru_gen_min_ttl`` is
	set, an ``lruvec`` is protected from the eviction when its oldest
	generation was born within ``lru_gen_min_ttl`` milliseconds. In other
	words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds
	from getting evicted. The OOM killer is triggered if this working set
	cannot be kept in memory.

	This time-based approach has the following advantages:

	1. It is easier to configure because it is agnostic to applications
	and memory sizes.
	2. It is more reliable because it is directly wired to the OOM killer.

	``mm_struct`` list
	------------------
	An ``mm_struct`` list is maintained for each memcg, and an
	``mm_struct`` follows its owner task to the new memcg when this task
	is migrated.

	A page table walker iterates ``lruvec_memcg()->mm_list`` and calls
	``walk_page_range()`` with each ``mm_struct`` on this list to scan
	PTEs. When multiple page table walkers iterate the same list, each of
	them gets a unique ``mm_struct``, and therefore they can run in
	parallel.

	Page table walkers ignore any misplaced pages, e.g., if an
	``mm_struct`` was migrated, pages left in the previous memcg will be
	ignored when the current memcg is under reclaim. Similarly, page table
	walkers will ignore pages from nodes other than the one under reclaim.

	This infrastructure also tracks the usage of ``mm_struct`` between
	context switches so that page table walkers can skip processes that
	have been sleeping since the last iteration.

	Rmap/PT walk feedback
	---------------------
	Searching the rmap for PTEs mapping each page on an LRU list (to test
	and clear the accessed bit) can be expensive because pages from
	different VMAs (PA space) are not cache friendly to the rmap (VA
	space). For workloads mostly using mapped pages, searching the rmap
	can incur the highest CPU cost in the reclaim path.

	``lru_gen_look_around()`` exploits spatial locality to reduce the
	trips into the rmap. It scans the adjacent PTEs of a young PTE and
	promotes hot pages. If the scan was done cacheline efficiently, it
	adds the PMD entry pointing to the PTE table to the Bloom filter. This
	forms a feedback loop between the eviction and the aging.

	Bloom filters
	-------------
	Bloom filters are a space and memory efficient data structure for set
	membership test, i.e., test if an element is not in the set or may be
	in the set.

	In the eviction path, specifically, in ``lru_gen_look_around()``, if a
	PMD has a sufficient number of hot pages, its address is placed in the
	filter. In the aging path, set membership means that the PTE range
	will be scanned for young pages.

	Note that Bloom filters are probabilistic on set membership. If a test
	is false positive, the cost is an additional scan of a range of PTEs,
	which may yield hot pages anyway. Parameters of the filter itself can
	control the false positive rate in the limit.

	PID controller
	--------------
	A feedback loop modeled after the Proportional-Integral-Derivative
	(PID) controller monitors refaults over anon and file types and
	decides which type to evict when both types are available from the
	same generation.

	The PID controller uses generations rather than the wall clock as the
	time domain because a CPU can scan pages at different rates under
	varying memory pressure. It calculates a moving average for each new
	generation to avoid being permanently locked in a suboptimal state.

	Memcg LRU
	---------
	An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
	since each node and memcg combination has an LRU of folios (see
	``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
	global reclaim, which is critical to system-wide memory overcommit in
	data centers. Note that memcg LRU only applies to global reclaim.

	The basic structure of an memcg LRU can be understood by an analogy to
	the active/inactive LRU (of folios):

	1. It has the young and the old (generations), i.e., the counterparts
	to the active and the inactive;
	2. The increment of ``max_seq`` triggers promotion, i.e., the
	counterpart to activation;
	3. Other events trigger similar operations, e.g., offlining an memcg
	triggers demotion, i.e., the counterpart to deactivation.

	In terms of global reclaim, it has two distinct features:

	1. Sharding, which allows each thread to start at a random memcg (in
	the old generation) and improves parallelism;
	2. Eventual fairness, which allows direct reclaim to bail out at will
	and reduces latency without affecting fairness over some time.

	In terms of traversing memcgs during global reclaim, it improves the
	best-case complexity from O(n) to O(1) and does not affect the
	worst-case complexity O(n). Therefore, on average, it has a sublinear
	complexity.

	Summary
	-------
	The multi-gen LRU (of folios) can be disassembled into the following
	parts:

	* Generations
	* Rmap walks
	* Page table walks via ``mm_struct`` list
	* Bloom filters for rmap/PT walk feedback
	* PID controller for refault feedback

	The aging and the eviction form a producer-consumer model;
	specifically, the latter drives the former by the sliding window over
	generations. Within the aging, rmap walks drive page table walks by
	inserting hot densely populated page tables to the Bloom filters.
	Within the eviction, the PID controller uses refaults as the feedback
	to select types to evict and tiers to protect.