| .. SPDX-License-Identifier: GPL-2.0 |
| |
| ============= |
| Multi-Gen LRU |
| ============= |
| The multi-gen LRU is an alternative LRU implementation that optimizes |
| page reclaim and improves performance under memory pressure. Page |
| reclaim decides the kernel's caching policy and ability to overcommit |
| memory. It directly impacts the kswapd CPU usage and RAM efficiency. |
| |
| Quick start |
| =========== |
| Build the kernel with the following configurations. |
| |
| * ``CONFIG_LRU_GEN=y`` |
| * ``CONFIG_LRU_GEN_ENABLED=y`` |
| |
| All set! |
| |
| Runtime options |
| =============== |
| ``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the |
| following subsections. |
| |
| Kill switch |
| ----------- |
| ``enable`` accepts different values to enable or disable the following |
| components. Its default value depends on ``CONFIG_LRU_GEN_ENABLED``. |
| All the components should be enabled unless some of them have |
| unforeseen side effects. Writing to ``enable`` has no effect when a |
| component is not supported by the hardware, and valid values will be |
| accepted even when the main switch is off. |
| |
| ====== =============================================================== |
| Values Components |
| ====== =============================================================== |
| 0x0001 The main switch for the multi-gen LRU. |
| 0x0002 Clearing the accessed bit in leaf page table entries in large |
| batches, when MMU sets it (e.g., on x86). This behavior can |
| theoretically worsen lock contention (mmap_lock). If it is |
| disabled, the multi-gen LRU will suffer a minor performance |
| degradation. |
| 0x0004 Clearing the accessed bit in non-leaf page table entries as |
| well, when MMU sets it (e.g., on x86). This behavior was not |
| verified on x86 varieties other than Intel and AMD. If it is |
| disabled, the multi-gen LRU will suffer a negligible |
| performance degradation. |
| [yYnN] Apply to all the components above. |
| ====== =============================================================== |
| |
| E.g., |
| :: |
| |
| echo y >/sys/kernel/mm/lru_gen/enabled |
| cat /sys/kernel/mm/lru_gen/enabled |
| 0x0007 |
| echo 5 >/sys/kernel/mm/lru_gen/enabled |
| cat /sys/kernel/mm/lru_gen/enabled |
| 0x0005 |
| |
| Thrashing prevention |
| -------------------- |
| Personal computers are more sensitive to thrashing because it can |
| cause janks (lags when rendering UI) and negatively impact user |
| experience. The multi-gen LRU offers thrashing prevention to the |
| majority of laptop and desktop users who do not have ``oomd``. |
| |
| Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of |
| ``N`` milliseconds from getting evicted. The OOM killer is triggered |
| if this working set cannot be kept in memory. In other words, this |
| option works as an adjustable pressure relief valve, and when open, it |
| terminates applications that are hopefully not being used. |
| |
| Based on the average human detectable lag (~100ms), ``N=1000`` usually |
| eliminates intolerable janks due to thrashing. Larger values like |
| ``N=3000`` make janks less noticeable at the risk of premature OOM |
| kills. |
| |
| The default value ``0`` means disabled. |
| |
| Experimental features |
| ===================== |
| ``/sys/kernel/debug/lru_gen`` accepts commands described in the |
| following subsections. Multiple command lines are supported, so does |
| concatenation with delimiters ``,`` and ``;``. |
| |
| ``/sys/kernel/debug/lru_gen_full`` provides additional stats for |
| debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from |
| evicted generations in this file. |
| |
| Working set estimation |
| ---------------------- |
| Working set estimation measures how much memory an application |
| requires in a given time interval, and it is usually done with little |
| impact on the performance of the application. E.g., data centers want |
| to optimize job scheduling (bin packing) to improve memory |
| utilizations. When a new job comes in, the job scheduler needs to find |
| out whether each server it manages can allocate a certain amount of |
| memory for this new job before it can pick a candidate. To do so, this |
| job scheduler needs to estimate the working sets of the existing jobs. |
| |
| When it is read, ``lru_gen`` returns a histogram of numbers of pages |
| accessed over different time intervals for each memcg and node. |
| ``MAX_NR_GENS`` decides the number of bins for each histogram. |
| :: |
| |
| memcg memcg_id memcg_path |
| node node_id |
| min_gen_nr age_in_ms nr_anon_pages nr_file_pages |
| ... |
| max_gen_nr age_in_ms nr_anon_pages nr_file_pages |
| |
| Each generation contains an estimated number of pages that have been |
| accessed within ``age_in_ms`` non-cumulatively. E.g., ``min_gen_nr`` |
| contains the coldest pages and ``max_gen_nr`` contains the hottest |
| pages, since ``age_in_ms`` of the former is the largest and that of |
| the latter is the smallest. |
| |
| Users can write ``+ memcg_id node_id max_gen_nr |
| [can_swap[full_scan]]`` to ``lru_gen`` to create a new generation |
| ``max_gen_nr+1``. ``can_swap`` defaults to the swap setting and, if it |
| is set to ``1``, it forces the scan of anon pages when swap is off. |
| ``full_scan`` defaults to ``1`` and, if it is set to ``0``, it reduces |
| the overhead as well as the coverage when scanning page tables. |
| |
| A typical use case is that a job scheduler writes to ``lru_gen`` at a |
| certain time interval to create new generations, and it ranks the |
| servers it manages based on the sizes of their cold memory defined by |
| this time interval. |
| |
| Proactive reclaim |
| ----------------- |
| Proactive reclaim induces memory reclaim when there is no memory |
| pressure and usually targets cold memory only. E.g., when a new job |
| comes in, the job scheduler wants to proactively reclaim memory on the |
| server it has selected to improve the chance of successfully landing |
| this new job. |
| |
| Users can write ``- memcg_id node_id min_gen_nr [swappiness |
| [nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or |
| equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than |
| ``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully |
| aged and therefore cannot be evicted. ``swappiness`` overrides the |
| default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits |
| the number of pages to evict. |
| |
| A typical use case is that a job scheduler writes to ``lru_gen`` |
| before it tries to land a new job on a server, and if it fails to |
| materialize the cold memory without impacting the existing jobs on |
| this server, it retries on the next server according to the ranking |
| result obtained from the working set estimation step described |
| earlier. |