Documentation/admin-guide/mm/multigen_lru.rst - linux - Git at Google

 .. SPDX-License-Identifier: GPL-2.0

 =============
 Multi-Gen LRU
 =============
 The multi-gen LRU is an alternative LRU implementation that optimizes
 page reclaim and improves performance under memory pressure. Page
 reclaim decides the kernel's caching policy and ability to overcommit
 memory. It directly impacts the kswapd CPU usage and RAM efficiency.

 Quick start
 ===========
 Build the kernel with the following configurations.

 * ``CONFIG_LRU_GEN=y``
 * ``CONFIG_LRU_GEN_ENABLED=y``

 All set!

 Runtime options
 ===============
 ``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
 following subsections.

 Kill switch
 -----------
 ``enabled`` accepts different values to enable or disable the
 following components. Its default value depends on
 ``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
 unless some of them have unforeseen side effects. Writing to
 ``enabled`` has no effect when a component is not supported by the
 hardware, and valid values will be accepted even when the main switch
 is off.

 ====== ===============================================================
 Values Components
 ====== ===============================================================
 0x0001 The main switch for the multi-gen LRU.
 0x0002 Clearing the accessed bit in leaf page table entries in large
        batches, when MMU sets it (e.g., on x86). This behavior can
        theoretically worsen lock contention (mmap_lock). If it is
        disabled, the multi-gen LRU will suffer a minor performance
        degradation for workloads that contiguously map hot pages,
        whose accessed bits can be otherwise cleared by fewer larger
        batches.
 0x0004 Clearing the accessed bit in non-leaf page table entries as
        well, when MMU sets it (e.g., on x86). This behavior was not
        verified on x86 varieties other than Intel and AMD. If it is
        disabled, the multi-gen LRU will suffer a negligible
        performance degradation.
 [yYnN] Apply to all the components above.
 ====== ===============================================================

 E.g.,
 ::

     echo y >/sys/kernel/mm/lru_gen/enabled
     cat /sys/kernel/mm/lru_gen/enabled
     0x0007
     echo 5 >/sys/kernel/mm/lru_gen/enabled
     cat /sys/kernel/mm/lru_gen/enabled
     0x0005

 Thrashing prevention
 --------------------
 Personal computers are more sensitive to thrashing because it can
 cause janks (lags when rendering UI) and negatively impact user
 experience. The multi-gen LRU offers thrashing prevention to the
 majority of laptop and desktop users who do not have ``oomd``.

 Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
 ``N`` milliseconds from getting evicted. The OOM killer is triggered
 if this working set cannot be kept in memory. In other words, this
 option works as an adjustable pressure relief valve, and when open, it
 terminates applications that are hopefully not being used.

 Based on the average human detectable lag (~100ms), ``N=1000`` usually
 eliminates intolerable janks due to thrashing. Larger values like
 ``N=3000`` make janks less noticeable at the risk of premature OOM
 kills.

 The default value ``0`` means disabled.

 Experimental features
 =====================
 ``/sys/kernel/debug/lru_gen`` accepts commands described in the
 following subsections. Multiple command lines are supported, so does
 concatenation with delimiters ``,`` and ``;``.

 ``/sys/kernel/debug/lru_gen_full`` provides additional stats for
 debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
 evicted generations in this file.

 Working set estimation
 ----------------------
 Working set estimation measures how much memory an application needs
 in a given time interval, and it is usually done with little impact on
 the performance of the application. E.g., data centers want to
 optimize job scheduling (bin packing) to improve memory utilizations.
 When a new job comes in, the job scheduler needs to find out whether
 each server it manages can allocate a certain amount of memory for
 this new job before it can pick a candidate. To do so, the job
 scheduler needs to estimate the working sets of the existing jobs.

 When it is read, ``lru_gen`` returns a histogram of numbers of pages
 accessed over different time intervals for each memcg and node.
 ``MAX_NR_GENS`` decides the number of bins for each histogram. The
 histograms are noncumulative.
 ::

     memcg  memcg_id  memcg_path
        node  node_id
            min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
            ...
            max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages

 Each bin contains an estimated number of pages that have been accessed
 within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
 and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
 the former is the largest and that of the latter is the smallest.

 Users can write the following command to ``lru_gen`` to create a new
 generation ``max_gen_nr+1``:

     ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``

 ``can_swap`` defaults to the swap setting and, if it is set to ``1``,
 it forces the scan of anon pages when swap is off, and vice versa.
 ``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
 employs heuristics to reduce the overhead, which is likely to reduce
 the coverage as well.

 A typical use case is that a job scheduler runs this command at a
 certain time interval to create new generations, and it ranks the
 servers it manages based on the sizes of their cold pages defined by
 this time interval.

 Proactive reclaim
 -----------------
 Proactive reclaim induces page reclaim when there is no memory
 pressure. It usually targets cold pages only. E.g., when a new job
 comes in, the job scheduler wants to proactively reclaim cold pages on
 the server it selected, to improve the chance of successfully landing
 this new job.

 Users can write the following command to ``lru_gen`` to evict
 generations less than or equal to ``min_gen_nr``.

     ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``

 ``min_gen_nr`` should be less than ``max_gen_nr-1``, since
 ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
 the active list) and therefore cannot be evicted. ``swappiness``
 overrides the default value in ``/proc/sys/vm/swappiness``.
 ``nr_to_reclaim`` limits the number of pages to evict.

 A typical use case is that a job scheduler runs this command before it
 tries to land a new job on a server. If it fails to materialize enough
 cold pages because of the overestimation, it retries on the next
 server according to the ranking result obtained from the working set
 estimation step. This less forceful approach limits the impacts on the
 existing jobs.
	.. SPDX-License-Identifier: GPL-2.0

	=============
	Multi-Gen LRU
	=============
	The multi-gen LRU is an alternative LRU implementation that optimizes
	page reclaim and improves performance under memory pressure. Page
	reclaim decides the kernel's caching policy and ability to overcommit
	memory. It directly impacts the kswapd CPU usage and RAM efficiency.

	Quick start
	===========
	Build the kernel with the following configurations.

	* ``CONFIG_LRU_GEN=y``
	* ``CONFIG_LRU_GEN_ENABLED=y``

	All set!

	Runtime options
	===============
	``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
	following subsections.

	Kill switch
	-----------
	``enabled`` accepts different values to enable or disable the
	following components. Its default value depends on
	``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
	unless some of them have unforeseen side effects. Writing to
	``enabled`` has no effect when a component is not supported by the
	hardware, and valid values will be accepted even when the main switch
	is off.

	====== ===============================================================
	Values Components
	====== ===============================================================
	0x0001 The main switch for the multi-gen LRU.
	0x0002 Clearing the accessed bit in leaf page table entries in large
	batches, when MMU sets it (e.g., on x86). This behavior can
	theoretically worsen lock contention (mmap_lock). If it is
	disabled, the multi-gen LRU will suffer a minor performance
	degradation for workloads that contiguously map hot pages,
	whose accessed bits can be otherwise cleared by fewer larger
	batches.
	0x0004 Clearing the accessed bit in non-leaf page table entries as
	well, when MMU sets it (e.g., on x86). This behavior was not
	verified on x86 varieties other than Intel and AMD. If it is
	disabled, the multi-gen LRU will suffer a negligible
	performance degradation.
	[yYnN] Apply to all the components above.
	====== ===============================================================

	E.g.,
	::

	echo y >/sys/kernel/mm/lru_gen/enabled
	cat /sys/kernel/mm/lru_gen/enabled
	0x0007
	echo 5 >/sys/kernel/mm/lru_gen/enabled
	cat /sys/kernel/mm/lru_gen/enabled
	0x0005

	Thrashing prevention
	--------------------
	Personal computers are more sensitive to thrashing because it can
	cause janks (lags when rendering UI) and negatively impact user
	experience. The multi-gen LRU offers thrashing prevention to the
	majority of laptop and desktop users who do not have ``oomd``.

	Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
	``N`` milliseconds from getting evicted. The OOM killer is triggered
	if this working set cannot be kept in memory. In other words, this
	option works as an adjustable pressure relief valve, and when open, it
	terminates applications that are hopefully not being used.

	Based on the average human detectable lag (~100ms), ``N=1000`` usually
	eliminates intolerable janks due to thrashing. Larger values like
	``N=3000`` make janks less noticeable at the risk of premature OOM
	kills.

	The default value ``0`` means disabled.

	Experimental features
	=====================
	``/sys/kernel/debug/lru_gen`` accepts commands described in the
	following subsections. Multiple command lines are supported, so does
	concatenation with delimiters ``,`` and ``;``.

	``/sys/kernel/debug/lru_gen_full`` provides additional stats for
	debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
	evicted generations in this file.

	Working set estimation
	----------------------
	Working set estimation measures how much memory an application needs
	in a given time interval, and it is usually done with little impact on
	the performance of the application. E.g., data centers want to
	optimize job scheduling (bin packing) to improve memory utilizations.
	When a new job comes in, the job scheduler needs to find out whether
	each server it manages can allocate a certain amount of memory for
	this new job before it can pick a candidate. To do so, the job
	scheduler needs to estimate the working sets of the existing jobs.

	When it is read, ``lru_gen`` returns a histogram of numbers of pages
	accessed over different time intervals for each memcg and node.
	``MAX_NR_GENS`` decides the number of bins for each histogram. The
	histograms are noncumulative.
	::

	memcg memcg_id memcg_path
	node node_id
	min_gen_nr age_in_ms nr_anon_pages nr_file_pages
	...
	max_gen_nr age_in_ms nr_anon_pages nr_file_pages

	Each bin contains an estimated number of pages that have been accessed
	within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
	and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
	the former is the largest and that of the latter is the smallest.

	Users can write the following command to ``lru_gen`` to create a new
	generation ``max_gen_nr+1``:

	``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``

	``can_swap`` defaults to the swap setting and, if it is set to ``1``,
	it forces the scan of anon pages when swap is off, and vice versa.
	``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
	employs heuristics to reduce the overhead, which is likely to reduce
	the coverage as well.

	A typical use case is that a job scheduler runs this command at a
	certain time interval to create new generations, and it ranks the
	servers it manages based on the sizes of their cold pages defined by
	this time interval.

	Proactive reclaim
	-----------------
	Proactive reclaim induces page reclaim when there is no memory
	pressure. It usually targets cold pages only. E.g., when a new job
	comes in, the job scheduler wants to proactively reclaim cold pages on
	the server it selected, to improve the chance of successfully landing
	this new job.

	Users can write the following command to ``lru_gen`` to evict
	generations less than or equal to ``min_gen_nr``.

	``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``

	``min_gen_nr`` should be less than ``max_gen_nr-1``, since
	``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
	the active list) and therefore cannot be evicted. ``swappiness``
	overrides the default value in ``/proc/sys/vm/swappiness``.
	``nr_to_reclaim`` limits the number of pages to evict.

	A typical use case is that a job scheduler runs this command before it
	tries to land a new job on a server. If it fails to materialize enough
	cold pages because of the overestimation, it retries on the next
	server according to the ranking result obtained from the working set
	estimation step. This less forceful approach limits the impacts on the
	existing jobs.