~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/mm/multigen_lru.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 .. SPDX-License-Identifier: GPL-2.0
  2 
  3 =============
  4 Multi-Gen LRU
  5 =============
  6 The multi-gen LRU is an alternative LRU implementation that optimizes
  7 page reclaim and improves performance under memory pressure. Page
  8 reclaim decides the kernel's caching policy and ability to overcommit
  9 memory. It directly impacts the kswapd CPU usage and RAM efficiency.
 10 
 11 Design overview
 12 ===============
 13 Objectives
 14 ----------
 15 The design objectives are:
 16 
 17 * Good representation of access recency
 18 * Try to profit from spatial locality
 19 * Fast paths to make obvious choices
 20 * Simple self-correcting heuristics
 21 
 22 The representation of access recency is at the core of all LRU
 23 implementations. In the multi-gen LRU, each generation represents a
 24 group of pages with similar access recency. Generations establish a
 25 (time-based) common frame of reference and therefore help make better
 26 choices, e.g., between different memcgs on a computer or different
 27 computers in a data center (for job scheduling).
 28 
 29 Exploiting spatial locality improves efficiency when gathering the
 30 accessed bit. A rmap walk targets a single page and does not try to
 31 profit from discovering a young PTE. A page table walk can sweep all
 32 the young PTEs in an address space, but the address space can be too
 33 sparse to make a profit. The key is to optimize both methods and use
 34 them in combination.
 35 
 36 Fast paths reduce code complexity and runtime overhead. Unmapped pages
 37 do not require TLB flushes; clean pages do not require writeback.
 38 These facts are only helpful when other conditions, e.g., access
 39 recency, are similar. With generations as a common frame of reference,
 40 additional factors stand out. But obvious choices might not be good
 41 choices; thus self-correction is necessary.
 42 
 43 The benefits of simple self-correcting heuristics are self-evident.
 44 Again, with generations as a common frame of reference, this becomes
 45 attainable. Specifically, pages in the same generation can be
 46 categorized based on additional factors, and a feedback loop can
 47 statistically compare the refault percentages across those categories
 48 and infer which of them are better choices.
 49 
 50 Assumptions
 51 -----------
 52 The protection of hot pages and the selection of cold pages are based
 53 on page access channels and patterns. There are two access channels:
 54 
 55 * Accesses through page tables
 56 * Accesses through file descriptors
 57 
 58 The protection of the former channel is by design stronger because:
 59 
 60 1. The uncertainty in determining the access patterns of the former
 61    channel is higher due to the approximation of the accessed bit.
 62 2. The cost of evicting the former channel is higher due to the TLB
 63    flushes required and the likelihood of encountering the dirty bit.
 64 3. The penalty of underprotecting the former channel is higher because
 65    applications usually do not prepare themselves for major page
 66    faults like they do for blocked I/O. E.g., GUI applications
 67    commonly use dedicated I/O threads to avoid blocking rendering
 68    threads.
 69 
 70 There are also two access patterns:
 71 
 72 * Accesses exhibiting temporal locality
 73 * Accesses not exhibiting temporal locality
 74 
 75 For the reasons listed above, the former channel is assumed to follow
 76 the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
 77 present, and the latter channel is assumed to follow the latter
 78 pattern unless outlying refaults have been observed.
 79 
 80 Workflow overview
 81 =================
 82 Evictable pages are divided into multiple generations for each
 83 ``lruvec``. The youngest generation number is stored in
 84 ``lrugen->max_seq`` for both anon and file types as they are aged on
 85 an equal footing. The oldest generation numbers are stored in
 86 ``lrugen->min_seq[]`` separately for anon and file types as clean file
 87 pages can be evicted regardless of swap constraints. These three
 88 variables are monotonically increasing.
 89 
 90 Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
 91 bits in order to fit into the gen counter in ``folio->flags``. Each
 92 truncated generation number is an index to ``lrugen->folios[]``. The
 93 sliding window technique is used to track at least ``MIN_NR_GENS`` and
 94 at most ``MAX_NR_GENS`` generations. The gen counter stores a value
 95 within ``[1, MAX_NR_GENS]`` while a page is on one of
 96 ``lrugen->folios[]``; otherwise it stores zero.
 97 
 98 Each generation is divided into multiple tiers. A page accessed ``N``
 99 times through file descriptors is in tier ``order_base_2(N)``. Unlike
100 generations, tiers do not have dedicated ``lrugen->folios[]``. In
101 contrast to moving across generations, which requires the LRU lock,
102 moving across tiers only involves atomic operations on
103 ``folio->flags`` and therefore has a negligible cost. A feedback loop
104 modeled after the PID controller monitors refaults over all the tiers
105 from anon and file types and decides which tiers from which types to
106 evict or protect. The desired effect is to balance refault percentages
107 between anon and file types proportional to the swappiness level.
108 
109 There are two conceptually independent procedures: the aging and the
110 eviction. They form a closed-loop system, i.e., the page reclaim.
111 
112 Aging
113 -----
114 The aging produces young generations. Given an ``lruvec``, it
115 increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
116 ``MIN_NR_GENS``. The aging promotes hot pages to the youngest
117 generation when it finds them accessed through page tables; the
118 demotion of cold pages happens consequently when it increments
119 ``max_seq``. The aging uses page table walks and rmap walks to find
120 young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
121 and calls ``walk_page_range()`` with each ``mm_struct`` on this list
122 to scan PTEs, and after each iteration, it increments ``max_seq``. For
123 the latter, when the eviction walks the rmap and finds a young PTE,
124 the aging scans the adjacent PTEs. For both, on finding a young PTE,
125 the aging clears the accessed bit and updates the gen counter of the
126 page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
127 
128 Eviction
129 --------
130 The eviction consumes old generations. Given an ``lruvec``, it
131 increments ``min_seq`` when ``lrugen->folios[]`` indexed by
132 ``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
133 evict from, it first compares ``min_seq[]`` to select the older type.
134 If both types are equally old, it selects the one whose first tier has
135 a lower refault percentage. The first tier contains single-use
136 unmapped clean pages, which are the best bet. The eviction sorts a
137 page according to its gen counter if the aging has found this page
138 accessed through page tables and updated its gen counter. It also
139 moves a page to the next generation, i.e., ``min_seq+1``, if this page
140 was accessed multiple times through file descriptors and the feedback
141 loop has detected outlying refaults from the tier this page is in. To
142 this end, the feedback loop uses the first tier as the baseline, for
143 the reason stated earlier.
144 
145 Working set protection
146 ----------------------
147 Each generation is timestamped at birth. If ``lru_gen_min_ttl`` is
148 set, an ``lruvec`` is protected from the eviction when its oldest
149 generation was born within ``lru_gen_min_ttl`` milliseconds. In other
150 words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds
151 from getting evicted. The OOM killer is triggered if this working set
152 cannot be kept in memory.
153 
154 This time-based approach has the following advantages:
155 
156 1. It is easier to configure because it is agnostic to applications
157    and memory sizes.
158 2. It is more reliable because it is directly wired to the OOM killer.
159 
160 ``mm_struct`` list
161 ------------------
162 An ``mm_struct`` list is maintained for each memcg, and an
163 ``mm_struct`` follows its owner task to the new memcg when this task
164 is migrated.
165 
166 A page table walker iterates ``lruvec_memcg()->mm_list`` and calls
167 ``walk_page_range()`` with each ``mm_struct`` on this list to scan
168 PTEs. When multiple page table walkers iterate the same list, each of
169 them gets a unique ``mm_struct``, and therefore they can run in
170 parallel.
171 
172 Page table walkers ignore any misplaced pages, e.g., if an
173 ``mm_struct`` was migrated, pages left in the previous memcg will be
174 ignored when the current memcg is under reclaim. Similarly, page table
175 walkers will ignore pages from nodes other than the one under reclaim.
176 
177 This infrastructure also tracks the usage of ``mm_struct`` between
178 context switches so that page table walkers can skip processes that
179 have been sleeping since the last iteration.
180 
181 Rmap/PT walk feedback
182 ---------------------
183 Searching the rmap for PTEs mapping each page on an LRU list (to test
184 and clear the accessed bit) can be expensive because pages from
185 different VMAs (PA space) are not cache friendly to the rmap (VA
186 space). For workloads mostly using mapped pages, searching the rmap
187 can incur the highest CPU cost in the reclaim path.
188 
189 ``lru_gen_look_around()`` exploits spatial locality to reduce the
190 trips into the rmap. It scans the adjacent PTEs of a young PTE and
191 promotes hot pages. If the scan was done cacheline efficiently, it
192 adds the PMD entry pointing to the PTE table to the Bloom filter. This
193 forms a feedback loop between the eviction and the aging.
194 
195 Bloom filters
196 -------------
197 Bloom filters are a space and memory efficient data structure for set
198 membership test, i.e., test if an element is not in the set or may be
199 in the set.
200 
201 In the eviction path, specifically, in ``lru_gen_look_around()``, if a
202 PMD has a sufficient number of hot pages, its address is placed in the
203 filter. In the aging path, set membership means that the PTE range
204 will be scanned for young pages.
205 
206 Note that Bloom filters are probabilistic on set membership. If a test
207 is false positive, the cost is an additional scan of a range of PTEs,
208 which may yield hot pages anyway. Parameters of the filter itself can
209 control the false positive rate in the limit.
210 
211 PID controller
212 --------------
213 A feedback loop modeled after the Proportional-Integral-Derivative
214 (PID) controller monitors refaults over anon and file types and
215 decides which type to evict when both types are available from the
216 same generation.
217 
218 The PID controller uses generations rather than the wall clock as the
219 time domain because a CPU can scan pages at different rates under
220 varying memory pressure. It calculates a moving average for each new
221 generation to avoid being permanently locked in a suboptimal state.
222 
223 Memcg LRU
224 ---------
225 An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
226 since each node and memcg combination has an LRU of folios (see
227 ``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
228 global reclaim, which is critical to system-wide memory overcommit in
229 data centers. Note that memcg LRU only applies to global reclaim.
230 
231 The basic structure of an memcg LRU can be understood by an analogy to
232 the active/inactive LRU (of folios):
233 
234 1. It has the young and the old (generations), i.e., the counterparts
235    to the active and the inactive;
236 2. The increment of ``max_seq`` triggers promotion, i.e., the
237    counterpart to activation;
238 3. Other events trigger similar operations, e.g., offlining an memcg
239    triggers demotion, i.e., the counterpart to deactivation.
240 
241 In terms of global reclaim, it has two distinct features:
242 
243 1. Sharding, which allows each thread to start at a random memcg (in
244    the old generation) and improves parallelism;
245 2. Eventual fairness, which allows direct reclaim to bail out at will
246    and reduces latency without affecting fairness over some time.
247 
248 In terms of traversing memcgs during global reclaim, it improves the
249 best-case complexity from O(n) to O(1) and does not affect the
250 worst-case complexity O(n). Therefore, on average, it has a sublinear
251 complexity.
252 
253 Summary
254 -------
255 The multi-gen LRU (of folios) can be disassembled into the following
256 parts:
257 
258 * Generations
259 * Rmap walks
260 * Page table walks via ``mm_struct`` list
261 * Bloom filters for rmap/PT walk feedback
262 * PID controller for refault feedback
263 
264 The aging and the eviction form a producer-consumer model;
265 specifically, the latter drives the former by the sliding window over
266 generations. Within the aging, rmap walks drive page table walks by
267 inserting hot densely populated page tables to the Bloom filters.
268 Within the eviction, the PID controller uses refaults as the feedback
269 to select types to evict and tiers to protect.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php