1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 ============= 3 ============= 4 Multi-Gen LRU 4 Multi-Gen LRU 5 ============= 5 ============= 6 The multi-gen LRU is an alternative LRU implem 6 The multi-gen LRU is an alternative LRU implementation that optimizes 7 page reclaim and improves performance under me 7 page reclaim and improves performance under memory pressure. Page 8 reclaim decides the kernel's caching policy an 8 reclaim decides the kernel's caching policy and ability to overcommit 9 memory. It directly impacts the kswapd CPU usa 9 memory. It directly impacts the kswapd CPU usage and RAM efficiency. 10 10 11 Quick start 11 Quick start 12 =========== 12 =========== 13 Build the kernel with the following configurat 13 Build the kernel with the following configurations. 14 14 15 * ``CONFIG_LRU_GEN=y`` 15 * ``CONFIG_LRU_GEN=y`` 16 * ``CONFIG_LRU_GEN_ENABLED=y`` 16 * ``CONFIG_LRU_GEN_ENABLED=y`` 17 17 18 All set! 18 All set! 19 19 20 Runtime options 20 Runtime options 21 =============== 21 =============== 22 ``/sys/kernel/mm/lru_gen/`` contains stable AB 22 ``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the 23 following subsections. 23 following subsections. 24 24 25 Kill switch 25 Kill switch 26 ----------- 26 ----------- 27 ``enabled`` accepts different values to enable 27 ``enabled`` accepts different values to enable or disable the 28 following components. Its default value depend 28 following components. Its default value depends on 29 ``CONFIG_LRU_GEN_ENABLED``. All the components 29 ``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled 30 unless some of them have unforeseen side effec 30 unless some of them have unforeseen side effects. Writing to 31 ``enabled`` has no effect when a component is 31 ``enabled`` has no effect when a component is not supported by the 32 hardware, and valid values will be accepted ev 32 hardware, and valid values will be accepted even when the main switch 33 is off. 33 is off. 34 34 35 ====== ======================================= 35 ====== =============================================================== 36 Values Components 36 Values Components 37 ====== ======================================= 37 ====== =============================================================== 38 0x0001 The main switch for the multi-gen LRU. 38 0x0001 The main switch for the multi-gen LRU. 39 0x0002 Clearing the accessed bit in leaf page 39 0x0002 Clearing the accessed bit in leaf page table entries in large 40 batches, when MMU sets it (e.g., on x86 40 batches, when MMU sets it (e.g., on x86). This behavior can 41 theoretically worsen lock contention (m 41 theoretically worsen lock contention (mmap_lock). If it is 42 disabled, the multi-gen LRU will suffer 42 disabled, the multi-gen LRU will suffer a minor performance 43 degradation for workloads that contiguo 43 degradation for workloads that contiguously map hot pages, 44 whose accessed bits can be otherwise cl 44 whose accessed bits can be otherwise cleared by fewer larger 45 batches. 45 batches. 46 0x0004 Clearing the accessed bit in non-leaf p 46 0x0004 Clearing the accessed bit in non-leaf page table entries as 47 well, when MMU sets it (e.g., on x86). 47 well, when MMU sets it (e.g., on x86). This behavior was not 48 verified on x86 varieties other than In 48 verified on x86 varieties other than Intel and AMD. If it is 49 disabled, the multi-gen LRU will suffer 49 disabled, the multi-gen LRU will suffer a negligible 50 performance degradation. 50 performance degradation. 51 [yYnN] Apply to all the components above. 51 [yYnN] Apply to all the components above. 52 ====== ======================================= 52 ====== =============================================================== 53 53 54 E.g., 54 E.g., 55 :: 55 :: 56 56 57 echo y >/sys/kernel/mm/lru_gen/enabled 57 echo y >/sys/kernel/mm/lru_gen/enabled 58 cat /sys/kernel/mm/lru_gen/enabled 58 cat /sys/kernel/mm/lru_gen/enabled 59 0x0007 59 0x0007 60 echo 5 >/sys/kernel/mm/lru_gen/enabled 60 echo 5 >/sys/kernel/mm/lru_gen/enabled 61 cat /sys/kernel/mm/lru_gen/enabled 61 cat /sys/kernel/mm/lru_gen/enabled 62 0x0005 62 0x0005 63 63 64 Thrashing prevention 64 Thrashing prevention 65 -------------------- 65 -------------------- 66 Personal computers are more sensitive to thras 66 Personal computers are more sensitive to thrashing because it can 67 cause janks (lags when rendering UI) and negat 67 cause janks (lags when rendering UI) and negatively impact user 68 experience. The multi-gen LRU offers thrashing 68 experience. The multi-gen LRU offers thrashing prevention to the 69 majority of laptop and desktop users who do no 69 majority of laptop and desktop users who do not have ``oomd``. 70 70 71 Users can write ``N`` to ``min_ttl_ms`` to pre 71 Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of 72 ``N`` milliseconds from getting evicted. The O 72 ``N`` milliseconds from getting evicted. The OOM killer is triggered 73 if this working set cannot be kept in memory. 73 if this working set cannot be kept in memory. In other words, this 74 option works as an adjustable pressure relief 74 option works as an adjustable pressure relief valve, and when open, it 75 terminates applications that are hopefully not 75 terminates applications that are hopefully not being used. 76 76 77 Based on the average human detectable lag (~10 77 Based on the average human detectable lag (~100ms), ``N=1000`` usually 78 eliminates intolerable janks due to thrashing. 78 eliminates intolerable janks due to thrashing. Larger values like 79 ``N=3000`` make janks less noticeable at the r 79 ``N=3000`` make janks less noticeable at the risk of premature OOM 80 kills. 80 kills. 81 81 82 The default value ``0`` means disabled. 82 The default value ``0`` means disabled. 83 83 84 Experimental features 84 Experimental features 85 ===================== 85 ===================== 86 ``/sys/kernel/debug/lru_gen`` accepts commands 86 ``/sys/kernel/debug/lru_gen`` accepts commands described in the 87 following subsections. Multiple command lines 87 following subsections. Multiple command lines are supported, so does 88 concatenation with delimiters ``,`` and ``;``. 88 concatenation with delimiters ``,`` and ``;``. 89 89 90 ``/sys/kernel/debug/lru_gen_full`` provides ad 90 ``/sys/kernel/debug/lru_gen_full`` provides additional stats for 91 debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps hi 91 debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from 92 evicted generations in this file. 92 evicted generations in this file. 93 93 94 Working set estimation 94 Working set estimation 95 ---------------------- 95 ---------------------- 96 Working set estimation measures how much memor 96 Working set estimation measures how much memory an application needs 97 in a given time interval, and it is usually do 97 in a given time interval, and it is usually done with little impact on 98 the performance of the application. E.g., data 98 the performance of the application. E.g., data centers want to 99 optimize job scheduling (bin packing) to impro 99 optimize job scheduling (bin packing) to improve memory utilizations. 100 When a new job comes in, the job scheduler nee 100 When a new job comes in, the job scheduler needs to find out whether 101 each server it manages can allocate a certain 101 each server it manages can allocate a certain amount of memory for 102 this new job before it can pick a candidate. T 102 this new job before it can pick a candidate. To do so, the job 103 scheduler needs to estimate the working sets o 103 scheduler needs to estimate the working sets of the existing jobs. 104 104 105 When it is read, ``lru_gen`` returns a histogr 105 When it is read, ``lru_gen`` returns a histogram of numbers of pages 106 accessed over different time intervals for eac 106 accessed over different time intervals for each memcg and node. 107 ``MAX_NR_GENS`` decides the number of bins for 107 ``MAX_NR_GENS`` decides the number of bins for each histogram. The 108 histograms are noncumulative. 108 histograms are noncumulative. 109 :: 109 :: 110 110 111 memcg memcg_id memcg_path 111 memcg memcg_id memcg_path 112 node node_id 112 node node_id 113 min_gen_nr age_in_ms nr_anon_page 113 min_gen_nr age_in_ms nr_anon_pages nr_file_pages 114 ... 114 ... 115 max_gen_nr age_in_ms nr_anon_page 115 max_gen_nr age_in_ms nr_anon_pages nr_file_pages 116 116 117 Each bin contains an estimated number of pages 117 Each bin contains an estimated number of pages that have been accessed 118 within ``age_in_ms``. E.g., ``min_gen_nr`` con 118 within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages 119 and ``max_gen_nr`` contains the hottest pages, 119 and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of 120 the former is the largest and that of the latt 120 the former is the largest and that of the latter is the smallest. 121 121 122 Users can write the following command to ``lru 122 Users can write the following command to ``lru_gen`` to create a new 123 generation ``max_gen_nr+1``: 123 generation ``max_gen_nr+1``: 124 124 125 ``+ memcg_id node_id max_gen_nr [can_swap 125 ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` 126 126 127 ``can_swap`` defaults to the swap setting and, 127 ``can_swap`` defaults to the swap setting and, if it is set to ``1``, 128 it forces the scan of anon pages when swap is 128 it forces the scan of anon pages when swap is off, and vice versa. 129 ``force_scan`` defaults to ``1`` and, if it is 129 ``force_scan`` defaults to ``1`` and, if it is set to ``0``, it 130 employs heuristics to reduce the overhead, whi 130 employs heuristics to reduce the overhead, which is likely to reduce 131 the coverage as well. 131 the coverage as well. 132 132 133 A typical use case is that a job scheduler run 133 A typical use case is that a job scheduler runs this command at a 134 certain time interval to create new generation 134 certain time interval to create new generations, and it ranks the 135 servers it manages based on the sizes of their 135 servers it manages based on the sizes of their cold pages defined by 136 this time interval. 136 this time interval. 137 137 138 Proactive reclaim 138 Proactive reclaim 139 ----------------- 139 ----------------- 140 Proactive reclaim induces page reclaim when th 140 Proactive reclaim induces page reclaim when there is no memory 141 pressure. It usually targets cold pages only. 141 pressure. It usually targets cold pages only. E.g., when a new job 142 comes in, the job scheduler wants to proactive 142 comes in, the job scheduler wants to proactively reclaim cold pages on 143 the server it selected, to improve the chance 143 the server it selected, to improve the chance of successfully landing 144 this new job. 144 this new job. 145 145 146 Users can write the following command to ``lru 146 Users can write the following command to ``lru_gen`` to evict 147 generations less than or equal to ``min_gen_nr 147 generations less than or equal to ``min_gen_nr``. 148 148 149 ``- memcg_id node_id min_gen_nr [swappines 149 ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` 150 150 151 ``min_gen_nr`` should be less than ``max_gen_n 151 ``min_gen_nr`` should be less than ``max_gen_nr-1``, since 152 ``max_gen_nr`` and ``max_gen_nr-1`` are not fu 152 ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to 153 the active list) and therefore cannot be evict 153 the active list) and therefore cannot be evicted. ``swappiness`` 154 overrides the default value in ``/proc/sys/vm/ 154 overrides the default value in ``/proc/sys/vm/swappiness``. 155 ``nr_to_reclaim`` limits the number of pages t 155 ``nr_to_reclaim`` limits the number of pages to evict. 156 156 157 A typical use case is that a job scheduler run 157 A typical use case is that a job scheduler runs this command before it 158 tries to land a new job on a server. If it fai 158 tries to land a new job on a server. If it fails to materialize enough 159 cold pages because of the overestimation, it r 159 cold pages because of the overestimation, it retries on the next 160 server according to the ranking result obtaine 160 server according to the ranking result obtained from the working set 161 estimation step. This less forceful approach l 161 estimation step. This less forceful approach limits the impacts on the 162 existing jobs. 162 existing jobs.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.