| .. SPDX-License-Identifier: GPL-2.0 |
| |
| ============= |
| Multi-Gen LRU |
| ============= |
| The multi-gen LRU is an alternative LRU implementation that optimizes |
| page reclaim and improves performance under memory pressure. Page |
| reclaim decides the kernel's caching policy and ability to overcommit |
| memory. It directly impacts the kswapd CPU usage and RAM efficiency. |
| |
| Quick start |
| =========== |
| Build the kernel with the following configurations. |
| |
| * ``CONFIG_LRU_GEN=y`` |
| * ``CONFIG_LRU_GEN_ENABLED=y`` |
| |
| All set! |
| |
| Runtime options |
| =============== |
| ``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the |
| following subsections. |
| |
| Kill switch |
| ----------- |
| ``enabled`` accepts different values to enable or disable the |
| following components. Its default value depends on |
| ``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled |
| unless some of them have unforeseen side effects. Writing to |
| ``enabled`` has no effect when a component is not supported by the |
| hardware, and valid values will be accepted even when the main switch |
| is off. |
| |
| ====== =============================================================== |
| Values Components |
| ====== =============================================================== |
| 0x0001 The main switch for the multi-gen LRU. |
| 0x0002 Clearing the accessed bit in leaf page table entries in large |
| batches, when MMU sets it (e.g., on x86). This behavior can |
| theoretically worsen lock contention (mmap_lock). If it is |
| disabled, the multi-gen LRU will suffer a minor performance |
| degradation for workloads that contiguously map hot pages, |
| whose accessed bits can be otherwise cleared by fewer larger |
| batches. |
| 0x0004 Clearing the accessed bit in non-leaf page table entries as |
| well, when MMU sets it (e.g., on x86). This behavior was not |
| verified on x86 varieties other than Intel and AMD. If it is |
| disabled, the multi-gen LRU will suffer a negligible |
| performance degradation. |
| [yYnN] Apply to all the components above. |
| ====== =============================================================== |
| |
| E.g., |
| :: |
| |
| echo y >/sys/kernel/mm/lru_gen/enabled |
| cat /sys/kernel/mm/lru_gen/enabled |
| 0x0007 |
| echo 5 >/sys/kernel/mm/lru_gen/enabled |
| cat /sys/kernel/mm/lru_gen/enabled |
| 0x0005 |
| |
| Thrashing prevention |
| -------------------- |
| Personal computers are more sensitive to thrashing because it can |
| cause janks (lags when rendering UI) and negatively impact user |
| experience. The multi-gen LRU offers thrashing prevention to the |
| majority of laptop and desktop users who do not have ``oomd``. |
| |
| Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of |
| ``N`` milliseconds from getting evicted. The OOM killer is triggered |
| if this working set cannot be kept in memory. In other words, this |
| option works as an adjustable pressure relief valve, and when open, it |
| terminates applications that are hopefully not being used. |
| |
| Based on the average human detectable lag (~100ms), ``N=1000`` usually |
| eliminates intolerable janks due to thrashing. Larger values like |
| ``N=3000`` make janks less noticeable at the risk of premature OOM |
| kills. |
| |
| The default value ``0`` means disabled. |
| |
| Experimental features |
| ===================== |
| ``/sys/kernel/debug/lru_gen`` accepts commands described in the |
| following subsections. Multiple command lines are supported, so does |
| concatenation with delimiters ``,`` and ``;``. |
| |
| ``/sys/kernel/debug/lru_gen_full`` provides additional stats for |
| debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from |
| evicted generations in this file. |
| |
| Working set estimation |
| ---------------------- |
| Working set estimation measures how much memory an application needs |
| in a given time interval, and it is usually done with little impact on |
| the performance of the application. E.g., data centers want to |
| optimize job scheduling (bin packing) to improve memory utilizations. |
| When a new job comes in, the job scheduler needs to find out whether |
| each server it manages can allocate a certain amount of memory for |
| this new job before it can pick a candidate. To do so, the job |
| scheduler needs to estimate the working sets of the existing jobs. |
| |
| When it is read, ``lru_gen`` returns a histogram of numbers of pages |
| accessed over different time intervals for each memcg and node. |
| ``MAX_NR_GENS`` decides the number of bins for each histogram. The |
| histograms are noncumulative. |
| :: |
| |
| memcg memcg_id memcg_path |
| node node_id |
| min_gen_nr age_in_ms nr_anon_pages nr_file_pages |
| ... |
| max_gen_nr age_in_ms nr_anon_pages nr_file_pages |
| |
| Each bin contains an estimated number of pages that have been accessed |
| within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages |
| and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of |
| the former is the largest and that of the latter is the smallest. |
| |
| Users can write the following command to ``lru_gen`` to create a new |
| generation ``max_gen_nr+1``: |
| |
| ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` |
| |
| ``can_swap`` defaults to the swap setting and, if it is set to ``1``, |
| it forces the scan of anon pages when swap is off, and vice versa. |
| ``force_scan`` defaults to ``1`` and, if it is set to ``0``, it |
| employs heuristics to reduce the overhead, which is likely to reduce |
| the coverage as well. |
| |
| A typical use case is that a job scheduler runs this command at a |
| certain time interval to create new generations, and it ranks the |
| servers it manages based on the sizes of their cold pages defined by |
| this time interval. |
| |
| Proactive reclaim |
| ----------------- |
| Proactive reclaim induces page reclaim when there is no memory |
| pressure. It usually targets cold pages only. E.g., when a new job |
| comes in, the job scheduler wants to proactively reclaim cold pages on |
| the server it selected, to improve the chance of successfully landing |
| this new job. |
| |
| Users can write the following command to ``lru_gen`` to evict |
| generations less than or equal to ``min_gen_nr``. |
| |
| ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` |
| |
| ``min_gen_nr`` should be less than ``max_gen_nr-1``, since |
| ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to |
| the active list) and therefore cannot be evicted. ``swappiness`` |
| overrides the default value in ``/proc/sys/vm/swappiness``. |
| ``nr_to_reclaim`` limits the number of pages to evict. |
| |
| A typical use case is that a job scheduler runs this command before it |
| tries to land a new job on a server. If it fails to materialize enough |
| cold pages because of the overestimation, it retries on the next |
| server according to the ranking result obtained from the working set |
| estimation step. This less forceful approach limits the impacts on the |
| existing jobs. |