~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/admin-guide/mm/concepts.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/admin-guide/mm/concepts.rst (Version linux-6.12-rc7) and /Documentation/admin-guide/mm/concepts.rst (Version linux-5.5.19)


                                                   >>   1 .. _mm_concepts:
                                                   >>   2 
  1 =================                                   3 =================
  2 Concepts overview                                   4 Concepts overview
  3 =================                                   5 =================
  4                                                     6 
  5 The memory management in Linux is a complex sy      7 The memory management in Linux is a complex system that evolved over the
  6 years and included more and more functionality      8 years and included more and more functionality to support a variety of
  7 systems from MMU-less microcontrollers to supe      9 systems from MMU-less microcontrollers to supercomputers. The memory
  8 management for systems without an MMU is calle     10 management for systems without an MMU is called ``nommu`` and it
  9 definitely deserves a dedicated document, whic     11 definitely deserves a dedicated document, which hopefully will be
 10 eventually written. Yet, although some of the      12 eventually written. Yet, although some of the concepts are the same,
 11 here we assume that an MMU is available and a      13 here we assume that an MMU is available and a CPU can translate a virtual
 12 address to a physical address.                     14 address to a physical address.
 13                                                    15 
 14 .. contents:: :local:                              16 .. contents:: :local:
 15                                                    17 
 16 Virtual Memory Primer                              18 Virtual Memory Primer
 17 =====================                              19 =====================
 18                                                    20 
 19 The physical memory in a computer system is a      21 The physical memory in a computer system is a limited resource and
 20 even for systems that support memory hotplug t     22 even for systems that support memory hotplug there is a hard limit on
 21 the amount of memory that can be installed. Th     23 the amount of memory that can be installed. The physical memory is not
 22 necessarily contiguous; it might be accessible     24 necessarily contiguous; it might be accessible as a set of distinct
 23 address ranges. Besides, different CPU archite     25 address ranges. Besides, different CPU architectures, and even
 24 different implementations of the same architec     26 different implementations of the same architecture have different views
 25 of how these address ranges are defined.           27 of how these address ranges are defined.
 26                                                    28 
 27 All this makes dealing directly with physical      29 All this makes dealing directly with physical memory quite complex and
 28 to avoid this complexity a concept of virtual      30 to avoid this complexity a concept of virtual memory was developed.
 29                                                    31 
 30 The virtual memory abstracts the details of ph     32 The virtual memory abstracts the details of physical memory from the
 31 application software, allows to keep only need     33 application software, allows to keep only needed information in the
 32 physical memory (demand paging) and provides a     34 physical memory (demand paging) and provides a mechanism for the
 33 protection and controlled sharing of data betw     35 protection and controlled sharing of data between processes.
 34                                                    36 
 35 With virtual memory, each and every memory acc     37 With virtual memory, each and every memory access uses a virtual
 36 address. When the CPU decodes an instruction t !!  38 address. When the CPU decodes the an instruction that reads (or
 37 writes) from (or to) the system memory, it tra     39 writes) from (or to) the system memory, it translates the `virtual`
 38 address encoded in that instruction to a `phys     40 address encoded in that instruction to a `physical` address that the
 39 memory controller can understand.                  41 memory controller can understand.
 40                                                    42 
 41 The physical system memory is divided into pag     43 The physical system memory is divided into page frames, or pages. The
 42 size of each page is architecture specific. So     44 size of each page is architecture specific. Some architectures allow
 43 selection of the page size from several suppor     45 selection of the page size from several supported values; this
 44 selection is performed at the kernel build tim     46 selection is performed at the kernel build time by setting an
 45 appropriate kernel configuration option.           47 appropriate kernel configuration option.
 46                                                    48 
 47 Each physical memory page can be mapped as one     49 Each physical memory page can be mapped as one or more virtual
 48 pages. These mappings are described by page ta     50 pages. These mappings are described by page tables that allow
 49 translation from a virtual address used by pro     51 translation from a virtual address used by programs to the physical
 50 memory address. The page tables are organized      52 memory address. The page tables are organized hierarchically.
 51                                                    53 
 52 The tables at the lowest level of the hierarch     54 The tables at the lowest level of the hierarchy contain physical
 53 addresses of actual pages used by the software     55 addresses of actual pages used by the software. The tables at higher
 54 levels contain physical addresses of the pages     56 levels contain physical addresses of the pages belonging to the lower
 55 levels. The pointer to the top level page tabl     57 levels. The pointer to the top level page table resides in a
 56 register. When the CPU performs the address tr     58 register. When the CPU performs the address translation, it uses this
 57 register to access the top level page table. T     59 register to access the top level page table. The high bits of the
 58 virtual address are used to index an entry in      60 virtual address are used to index an entry in the top level page
 59 table. That entry is then used to access the n     61 table. That entry is then used to access the next level in the
 60 hierarchy with the next bits of the virtual ad     62 hierarchy with the next bits of the virtual address as the index to
 61 that level page table. The lowest bits in the      63 that level page table. The lowest bits in the virtual address define
 62 the offset inside the actual page.                 64 the offset inside the actual page.
 63                                                    65 
 64 Huge Pages                                         66 Huge Pages
 65 ==========                                         67 ==========
 66                                                    68 
 67 The address translation requires several memor     69 The address translation requires several memory accesses and memory
 68 accesses are slow relatively to CPU speed. To      70 accesses are slow relatively to CPU speed. To avoid spending precious
 69 processor cycles on the address translation, C     71 processor cycles on the address translation, CPUs maintain a cache of
 70 such translations called Translation Lookaside     72 such translations called Translation Lookaside Buffer (or
 71 TLB). Usually TLB is pretty scarce resource an     73 TLB). Usually TLB is pretty scarce resource and applications with
 72 large memory working set will experience perfo     74 large memory working set will experience performance hit because of
 73 TLB misses.                                        75 TLB misses.
 74                                                    76 
 75 Many modern CPU architectures allow mapping of     77 Many modern CPU architectures allow mapping of the memory pages
 76 directly by the higher levels in the page tabl     78 directly by the higher levels in the page table. For instance, on x86,
 77 it is possible to map 2M and even 1G pages usi     79 it is possible to map 2M and even 1G pages using entries in the second
 78 and the third level page tables. In Linux such     80 and the third level page tables. In Linux such pages are called
 79 `huge`. Usage of huge pages significantly redu     81 `huge`. Usage of huge pages significantly reduces pressure on TLB,
 80 improves TLB hit-rate and thus improves overal     82 improves TLB hit-rate and thus improves overall system performance.
 81                                                    83 
 82 There are two mechanisms in Linux that enable      84 There are two mechanisms in Linux that enable mapping of the physical
 83 memory with the huge pages. The first one is `     85 memory with the huge pages. The first one is `HugeTLB filesystem`, or
 84 hugetlbfs. It is a pseudo filesystem that uses     86 hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
 85 store. For the files created in this filesyste     87 store. For the files created in this filesystem the data resides in
 86 the memory and mapped using huge pages. The hu     88 the memory and mapped using huge pages. The hugetlbfs is described at
 87 Documentation/admin-guide/mm/hugetlbpage.rst.  !!  89 :ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
 88                                                    90 
 89 Another, more recent, mechanism that enables u     91 Another, more recent, mechanism that enables use of the huge pages is
 90 called `Transparent HugePages`, or THP. Unlike     92 called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
 91 requires users and/or system administrators to     93 requires users and/or system administrators to configure what parts of
 92 the system memory should and can be mapped by      94 the system memory should and can be mapped by the huge pages, THP
 93 manages such mappings transparently to the use     95 manages such mappings transparently to the user and hence the
 94 name. See Documentation/admin-guide/mm/transhu !!  96 name. See
 95 about THP.                                     !!  97 :ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
                                                   >>  98 for more details about THP.
 96                                                    99 
 97 Zones                                             100 Zones
 98 =====                                             101 =====
 99                                                   102 
100 Often hardware poses restrictions on how diffe    103 Often hardware poses restrictions on how different physical memory
101 ranges can be accessed. In some cases, devices    104 ranges can be accessed. In some cases, devices cannot perform DMA to
102 all the addressable memory. In other cases, th    105 all the addressable memory. In other cases, the size of the physical
103 memory exceeds the maximal addressable size of    106 memory exceeds the maximal addressable size of virtual memory and
104 special actions are required to access portion    107 special actions are required to access portions of the memory. Linux
105 groups memory pages into `zones` according to     108 groups memory pages into `zones` according to their possible
106 usage. For example, ZONE_DMA will contain memo    109 usage. For example, ZONE_DMA will contain memory that can be used by
107 devices for DMA, ZONE_HIGHMEM will contain mem    110 devices for DMA, ZONE_HIGHMEM will contain memory that is not
108 permanently mapped into kernel's address space    111 permanently mapped into kernel's address space and ZONE_NORMAL will
109 contain normally addressed pages.                 112 contain normally addressed pages.
110                                                   113 
111 The actual layout of the memory zones is hardw    114 The actual layout of the memory zones is hardware dependent as not all
112 architectures define all zones, and requiremen    115 architectures define all zones, and requirements for DMA are different
113 for different platforms.                          116 for different platforms.
114                                                   117 
115 Nodes                                             118 Nodes
116 =====                                             119 =====
117                                                   120 
118 Many multi-processor machines are NUMA - Non-U    121 Many multi-processor machines are NUMA - Non-Uniform Memory Access -
119 systems. In such systems the memory is arrange    122 systems. In such systems the memory is arranged into banks that have
120 different access latency depending on the "dis    123 different access latency depending on the "distance" from the
121 processor. Each bank is referred to as a `node    124 processor. Each bank is referred to as a `node` and for each node Linux
122 constructs an independent memory management su    125 constructs an independent memory management subsystem. A node has its
123 own set of zones, lists of free and used pages    126 own set of zones, lists of free and used pages and various statistics
124 counters. You can find more details about NUMA    127 counters. You can find more details about NUMA in
125 Documentation/mm/numa.rst` and in              !! 128 :ref:`Documentation/vm/numa.rst <numa>` and in
126 Documentation/admin-guide/mm/numa_memory_polic !! 129 :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
127                                                   130 
128 Page cache                                        131 Page cache
129 ==========                                        132 ==========
130                                                   133 
131 The physical memory is volatile and the common    134 The physical memory is volatile and the common case for getting data
132 into the memory is to read it from files. When    135 into the memory is to read it from files. Whenever a file is read, the
133 data is put into the `page cache` to avoid exp    136 data is put into the `page cache` to avoid expensive disk access on
134 the subsequent reads. Similarly, when one writ    137 the subsequent reads. Similarly, when one writes to a file, the data
135 is placed in the page cache and eventually get    138 is placed in the page cache and eventually gets into the backing
136 storage device. The written pages are marked a    139 storage device. The written pages are marked as `dirty` and when Linux
137 decides to reuse them for other purposes, it m    140 decides to reuse them for other purposes, it makes sure to synchronize
138 the file contents on the device with the updat    141 the file contents on the device with the updated data.
139                                                   142 
140 Anonymous Memory                                  143 Anonymous Memory
141 ================                                  144 ================
142                                                   145 
143 The `anonymous memory` or `anonymous mappings`    146 The `anonymous memory` or `anonymous mappings` represent memory that
144 is not backed by a filesystem. Such mappings a    147 is not backed by a filesystem. Such mappings are implicitly created
145 for program's stack and heap or by explicit ca    148 for program's stack and heap or by explicit calls to mmap(2) system
146 call. Usually, the anonymous mappings only def    149 call. Usually, the anonymous mappings only define virtual memory areas
147 that the program is allowed to access. The rea    150 that the program is allowed to access. The read accesses will result
148 in creation of a page table entry that referen    151 in creation of a page table entry that references a special physical
149 page filled with zeroes. When the program perf    152 page filled with zeroes. When the program performs a write, a regular
150 physical page will be allocated to hold the wr    153 physical page will be allocated to hold the written data. The page
151 will be marked dirty and if the kernel decides    154 will be marked dirty and if the kernel decides to repurpose it,
152 the dirty page will be swapped out.               155 the dirty page will be swapped out.
153                                                   156 
154 Reclaim                                           157 Reclaim
155 =======                                           158 =======
156                                                   159 
157 Throughout the system lifetime, a physical pag    160 Throughout the system lifetime, a physical page can be used for storing
158 different types of data. It can be kernel inte    161 different types of data. It can be kernel internal data structures,
159 DMA'able buffers for device drivers use, data     162 DMA'able buffers for device drivers use, data read from a filesystem,
160 memory allocated by user space processes etc.     163 memory allocated by user space processes etc.
161                                                   164 
162 Depending on the page usage it is treated diff    165 Depending on the page usage it is treated differently by the Linux
163 memory management. The pages that can be freed    166 memory management. The pages that can be freed at any time, either
164 because they cache the data available elsewher    167 because they cache the data available elsewhere, for instance, on a
165 hard disk, or because they can be swapped out,    168 hard disk, or because they can be swapped out, again, to the hard
166 disk, are called `reclaimable`. The most notab    169 disk, are called `reclaimable`. The most notable categories of the
167 reclaimable pages are page cache and anonymous    170 reclaimable pages are page cache and anonymous memory.
168                                                   171 
169 In most cases, the pages holding internal kern    172 In most cases, the pages holding internal kernel data and used as DMA
170 buffers cannot be repurposed, and they remain     173 buffers cannot be repurposed, and they remain pinned until freed by
171 their user. Such pages are called `unreclaimab    174 their user. Such pages are called `unreclaimable`. However, in certain
172 circumstances, even pages occupied with kernel    175 circumstances, even pages occupied with kernel data structures can be
173 reclaimed. For instance, in-memory caches of f    176 reclaimed. For instance, in-memory caches of filesystem metadata can
174 be re-read from the storage device and therefo    177 be re-read from the storage device and therefore it is possible to
175 discard them from the main memory when system     178 discard them from the main memory when system is under memory
176 pressure.                                         179 pressure.
177                                                   180 
178 The process of freeing the reclaimable physica    181 The process of freeing the reclaimable physical memory pages and
179 repurposing them is called (surprise!) `reclai    182 repurposing them is called (surprise!) `reclaim`. Linux can reclaim
180 pages either asynchronously or synchronously,     183 pages either asynchronously or synchronously, depending on the state
181 of the system. When the system is not loaded,     184 of the system. When the system is not loaded, most of the memory is free
182 and allocation requests will be satisfied imme    185 and allocation requests will be satisfied immediately from the free
183 pages supply. As the load increases, the amoun    186 pages supply. As the load increases, the amount of the free pages goes
184 down and when it reaches a certain threshold ( !! 187 down and when it reaches a certain threshold (high watermark), an
185 allocation request will awaken the ``kswapd``     188 allocation request will awaken the ``kswapd`` daemon. It will
186 asynchronously scan memory pages and either ju    189 asynchronously scan memory pages and either just free them if the data
187 they contain is available elsewhere, or evict     190 they contain is available elsewhere, or evict to the backing storage
188 device (remember those dirty pages?). As memor    191 device (remember those dirty pages?). As memory usage increases even
189 more and reaches another threshold - min water    192 more and reaches another threshold - min watermark - an allocation
190 will trigger `direct reclaim`. In this case al    193 will trigger `direct reclaim`. In this case allocation is stalled
191 until enough memory pages are reclaimed to sat    194 until enough memory pages are reclaimed to satisfy the request.
192                                                   195 
193 Compaction                                        196 Compaction
194 ==========                                        197 ==========
195                                                   198 
196 As the system runs, tasks allocate and free th    199 As the system runs, tasks allocate and free the memory and it becomes
197 fragmented. Although with virtual memory it is    200 fragmented. Although with virtual memory it is possible to present
198 scattered physical pages as virtually contiguo    201 scattered physical pages as virtually contiguous range, sometimes it is
199 necessary to allocate large physically contigu    202 necessary to allocate large physically contiguous memory areas. Such
200 need may arise, for instance, when a device dr    203 need may arise, for instance, when a device driver requires a large
201 buffer for DMA, or when THP allocates a huge p    204 buffer for DMA, or when THP allocates a huge page. Memory `compaction`
202 addresses the fragmentation issue. This mechan    205 addresses the fragmentation issue. This mechanism moves occupied pages
203 from the lower part of a memory zone to free p    206 from the lower part of a memory zone to free pages in the upper part
204 of the zone. When a compaction scan is finishe    207 of the zone. When a compaction scan is finished free pages are grouped
205 together at the beginning of the zone and allo    208 together at the beginning of the zone and allocations of large
206 physically contiguous areas become possible.      209 physically contiguous areas become possible.
207                                                   210 
208 Like reclaim, the compaction may happen asynch    211 Like reclaim, the compaction may happen asynchronously in the ``kcompactd``
209 daemon or synchronously as a result of a memor    212 daemon or synchronously as a result of a memory allocation request.
210                                                   213 
211 OOM killer                                        214 OOM killer
212 ==========                                        215 ==========
213                                                   216 
214 It is possible that on a loaded machine memory    217 It is possible that on a loaded machine memory will be exhausted and the
215 kernel will be unable to reclaim enough memory    218 kernel will be unable to reclaim enough memory to continue to operate. In
216 order to save the rest of the system, it invok    219 order to save the rest of the system, it invokes the `OOM killer`.
217                                                   220 
218 The `OOM killer` selects a task to sacrifice f    221 The `OOM killer` selects a task to sacrifice for the sake of the overall
219 system health. The selected task is killed in     222 system health. The selected task is killed in a hope that after it exits
220 enough memory will be freed to continue normal    223 enough memory will be freed to continue normal operation.
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php