>> 1 .. _mm_concepts: >> 2 1 ================= 3 ================= 2 Concepts overview 4 Concepts overview 3 ================= 5 ================= 4 6 5 The memory management in Linux is a complex sy 7 The memory management in Linux is a complex system that evolved over the 6 years and included more and more functionality 8 years and included more and more functionality to support a variety of 7 systems from MMU-less microcontrollers to supe 9 systems from MMU-less microcontrollers to supercomputers. The memory 8 management for systems without an MMU is calle 10 management for systems without an MMU is called ``nommu`` and it 9 definitely deserves a dedicated document, whic 11 definitely deserves a dedicated document, which hopefully will be 10 eventually written. Yet, although some of the 12 eventually written. Yet, although some of the concepts are the same, 11 here we assume that an MMU is available and a 13 here we assume that an MMU is available and a CPU can translate a virtual 12 address to a physical address. 14 address to a physical address. 13 15 14 .. contents:: :local: 16 .. contents:: :local: 15 17 16 Virtual Memory Primer 18 Virtual Memory Primer 17 ===================== 19 ===================== 18 20 19 The physical memory in a computer system is a 21 The physical memory in a computer system is a limited resource and 20 even for systems that support memory hotplug t 22 even for systems that support memory hotplug there is a hard limit on 21 the amount of memory that can be installed. Th 23 the amount of memory that can be installed. The physical memory is not 22 necessarily contiguous; it might be accessible 24 necessarily contiguous; it might be accessible as a set of distinct 23 address ranges. Besides, different CPU archite 25 address ranges. Besides, different CPU architectures, and even 24 different implementations of the same architec 26 different implementations of the same architecture have different views 25 of how these address ranges are defined. 27 of how these address ranges are defined. 26 28 27 All this makes dealing directly with physical 29 All this makes dealing directly with physical memory quite complex and 28 to avoid this complexity a concept of virtual 30 to avoid this complexity a concept of virtual memory was developed. 29 31 30 The virtual memory abstracts the details of ph 32 The virtual memory abstracts the details of physical memory from the 31 application software, allows to keep only need 33 application software, allows to keep only needed information in the 32 physical memory (demand paging) and provides a 34 physical memory (demand paging) and provides a mechanism for the 33 protection and controlled sharing of data betw 35 protection and controlled sharing of data between processes. 34 36 35 With virtual memory, each and every memory acc 37 With virtual memory, each and every memory access uses a virtual 36 address. When the CPU decodes an instruction t 38 address. When the CPU decodes an instruction that reads (or 37 writes) from (or to) the system memory, it tra 39 writes) from (or to) the system memory, it translates the `virtual` 38 address encoded in that instruction to a `phys 40 address encoded in that instruction to a `physical` address that the 39 memory controller can understand. 41 memory controller can understand. 40 42 41 The physical system memory is divided into pag 43 The physical system memory is divided into page frames, or pages. The 42 size of each page is architecture specific. So 44 size of each page is architecture specific. Some architectures allow 43 selection of the page size from several suppor 45 selection of the page size from several supported values; this 44 selection is performed at the kernel build tim 46 selection is performed at the kernel build time by setting an 45 appropriate kernel configuration option. 47 appropriate kernel configuration option. 46 48 47 Each physical memory page can be mapped as one 49 Each physical memory page can be mapped as one or more virtual 48 pages. These mappings are described by page ta 50 pages. These mappings are described by page tables that allow 49 translation from a virtual address used by pro 51 translation from a virtual address used by programs to the physical 50 memory address. The page tables are organized 52 memory address. The page tables are organized hierarchically. 51 53 52 The tables at the lowest level of the hierarch 54 The tables at the lowest level of the hierarchy contain physical 53 addresses of actual pages used by the software 55 addresses of actual pages used by the software. The tables at higher 54 levels contain physical addresses of the pages 56 levels contain physical addresses of the pages belonging to the lower 55 levels. The pointer to the top level page tabl 57 levels. The pointer to the top level page table resides in a 56 register. When the CPU performs the address tr 58 register. When the CPU performs the address translation, it uses this 57 register to access the top level page table. T 59 register to access the top level page table. The high bits of the 58 virtual address are used to index an entry in 60 virtual address are used to index an entry in the top level page 59 table. That entry is then used to access the n 61 table. That entry is then used to access the next level in the 60 hierarchy with the next bits of the virtual ad 62 hierarchy with the next bits of the virtual address as the index to 61 that level page table. The lowest bits in the 63 that level page table. The lowest bits in the virtual address define 62 the offset inside the actual page. 64 the offset inside the actual page. 63 65 64 Huge Pages 66 Huge Pages 65 ========== 67 ========== 66 68 67 The address translation requires several memor 69 The address translation requires several memory accesses and memory 68 accesses are slow relatively to CPU speed. To 70 accesses are slow relatively to CPU speed. To avoid spending precious 69 processor cycles on the address translation, C 71 processor cycles on the address translation, CPUs maintain a cache of 70 such translations called Translation Lookaside 72 such translations called Translation Lookaside Buffer (or 71 TLB). Usually TLB is pretty scarce resource an 73 TLB). Usually TLB is pretty scarce resource and applications with 72 large memory working set will experience perfo 74 large memory working set will experience performance hit because of 73 TLB misses. 75 TLB misses. 74 76 75 Many modern CPU architectures allow mapping of 77 Many modern CPU architectures allow mapping of the memory pages 76 directly by the higher levels in the page tabl 78 directly by the higher levels in the page table. For instance, on x86, 77 it is possible to map 2M and even 1G pages usi 79 it is possible to map 2M and even 1G pages using entries in the second 78 and the third level page tables. In Linux such 80 and the third level page tables. In Linux such pages are called 79 `huge`. Usage of huge pages significantly redu 81 `huge`. Usage of huge pages significantly reduces pressure on TLB, 80 improves TLB hit-rate and thus improves overal 82 improves TLB hit-rate and thus improves overall system performance. 81 83 82 There are two mechanisms in Linux that enable 84 There are two mechanisms in Linux that enable mapping of the physical 83 memory with the huge pages. The first one is ` 85 memory with the huge pages. The first one is `HugeTLB filesystem`, or 84 hugetlbfs. It is a pseudo filesystem that uses 86 hugetlbfs. It is a pseudo filesystem that uses RAM as its backing 85 store. For the files created in this filesyste 87 store. For the files created in this filesystem the data resides in 86 the memory and mapped using huge pages. The hu 88 the memory and mapped using huge pages. The hugetlbfs is described at 87 Documentation/admin-guide/mm/hugetlbpage.rst. !! 89 :ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`. 88 90 89 Another, more recent, mechanism that enables u 91 Another, more recent, mechanism that enables use of the huge pages is 90 called `Transparent HugePages`, or THP. Unlike 92 called `Transparent HugePages`, or THP. Unlike the hugetlbfs that 91 requires users and/or system administrators to 93 requires users and/or system administrators to configure what parts of 92 the system memory should and can be mapped by 94 the system memory should and can be mapped by the huge pages, THP 93 manages such mappings transparently to the use 95 manages such mappings transparently to the user and hence the 94 name. See Documentation/admin-guide/mm/transhu !! 96 name. See 95 about THP. !! 97 :ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>` >> 98 for more details about THP. 96 99 97 Zones 100 Zones 98 ===== 101 ===== 99 102 100 Often hardware poses restrictions on how diffe 103 Often hardware poses restrictions on how different physical memory 101 ranges can be accessed. In some cases, devices 104 ranges can be accessed. In some cases, devices cannot perform DMA to 102 all the addressable memory. In other cases, th 105 all the addressable memory. In other cases, the size of the physical 103 memory exceeds the maximal addressable size of 106 memory exceeds the maximal addressable size of virtual memory and 104 special actions are required to access portion 107 special actions are required to access portions of the memory. Linux 105 groups memory pages into `zones` according to 108 groups memory pages into `zones` according to their possible 106 usage. For example, ZONE_DMA will contain memo 109 usage. For example, ZONE_DMA will contain memory that can be used by 107 devices for DMA, ZONE_HIGHMEM will contain mem 110 devices for DMA, ZONE_HIGHMEM will contain memory that is not 108 permanently mapped into kernel's address space 111 permanently mapped into kernel's address space and ZONE_NORMAL will 109 contain normally addressed pages. 112 contain normally addressed pages. 110 113 111 The actual layout of the memory zones is hardw 114 The actual layout of the memory zones is hardware dependent as not all 112 architectures define all zones, and requiremen 115 architectures define all zones, and requirements for DMA are different 113 for different platforms. 116 for different platforms. 114 117 115 Nodes 118 Nodes 116 ===== 119 ===== 117 120 118 Many multi-processor machines are NUMA - Non-U 121 Many multi-processor machines are NUMA - Non-Uniform Memory Access - 119 systems. In such systems the memory is arrange 122 systems. In such systems the memory is arranged into banks that have 120 different access latency depending on the "dis 123 different access latency depending on the "distance" from the 121 processor. Each bank is referred to as a `node 124 processor. Each bank is referred to as a `node` and for each node Linux 122 constructs an independent memory management su 125 constructs an independent memory management subsystem. A node has its 123 own set of zones, lists of free and used pages 126 own set of zones, lists of free and used pages and various statistics 124 counters. You can find more details about NUMA 127 counters. You can find more details about NUMA in 125 Documentation/mm/numa.rst` and in !! 128 :ref:`Documentation/vm/numa.rst <numa>` and in 126 Documentation/admin-guide/mm/numa_memory_polic !! 129 :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`. 127 130 128 Page cache 131 Page cache 129 ========== 132 ========== 130 133 131 The physical memory is volatile and the common 134 The physical memory is volatile and the common case for getting data 132 into the memory is to read it from files. When 135 into the memory is to read it from files. Whenever a file is read, the 133 data is put into the `page cache` to avoid exp 136 data is put into the `page cache` to avoid expensive disk access on 134 the subsequent reads. Similarly, when one writ 137 the subsequent reads. Similarly, when one writes to a file, the data 135 is placed in the page cache and eventually get 138 is placed in the page cache and eventually gets into the backing 136 storage device. The written pages are marked a 139 storage device. The written pages are marked as `dirty` and when Linux 137 decides to reuse them for other purposes, it m 140 decides to reuse them for other purposes, it makes sure to synchronize 138 the file contents on the device with the updat 141 the file contents on the device with the updated data. 139 142 140 Anonymous Memory 143 Anonymous Memory 141 ================ 144 ================ 142 145 143 The `anonymous memory` or `anonymous mappings` 146 The `anonymous memory` or `anonymous mappings` represent memory that 144 is not backed by a filesystem. Such mappings a 147 is not backed by a filesystem. Such mappings are implicitly created 145 for program's stack and heap or by explicit ca 148 for program's stack and heap or by explicit calls to mmap(2) system 146 call. Usually, the anonymous mappings only def 149 call. Usually, the anonymous mappings only define virtual memory areas 147 that the program is allowed to access. The rea 150 that the program is allowed to access. The read accesses will result 148 in creation of a page table entry that referen 151 in creation of a page table entry that references a special physical 149 page filled with zeroes. When the program perf 152 page filled with zeroes. When the program performs a write, a regular 150 physical page will be allocated to hold the wr 153 physical page will be allocated to hold the written data. The page 151 will be marked dirty and if the kernel decides 154 will be marked dirty and if the kernel decides to repurpose it, 152 the dirty page will be swapped out. 155 the dirty page will be swapped out. 153 156 154 Reclaim 157 Reclaim 155 ======= 158 ======= 156 159 157 Throughout the system lifetime, a physical pag 160 Throughout the system lifetime, a physical page can be used for storing 158 different types of data. It can be kernel inte 161 different types of data. It can be kernel internal data structures, 159 DMA'able buffers for device drivers use, data 162 DMA'able buffers for device drivers use, data read from a filesystem, 160 memory allocated by user space processes etc. 163 memory allocated by user space processes etc. 161 164 162 Depending on the page usage it is treated diff 165 Depending on the page usage it is treated differently by the Linux 163 memory management. The pages that can be freed 166 memory management. The pages that can be freed at any time, either 164 because they cache the data available elsewher 167 because they cache the data available elsewhere, for instance, on a 165 hard disk, or because they can be swapped out, 168 hard disk, or because they can be swapped out, again, to the hard 166 disk, are called `reclaimable`. The most notab 169 disk, are called `reclaimable`. The most notable categories of the 167 reclaimable pages are page cache and anonymous 170 reclaimable pages are page cache and anonymous memory. 168 171 169 In most cases, the pages holding internal kern 172 In most cases, the pages holding internal kernel data and used as DMA 170 buffers cannot be repurposed, and they remain 173 buffers cannot be repurposed, and they remain pinned until freed by 171 their user. Such pages are called `unreclaimab 174 their user. Such pages are called `unreclaimable`. However, in certain 172 circumstances, even pages occupied with kernel 175 circumstances, even pages occupied with kernel data structures can be 173 reclaimed. For instance, in-memory caches of f 176 reclaimed. For instance, in-memory caches of filesystem metadata can 174 be re-read from the storage device and therefo 177 be re-read from the storage device and therefore it is possible to 175 discard them from the main memory when system 178 discard them from the main memory when system is under memory 176 pressure. 179 pressure. 177 180 178 The process of freeing the reclaimable physica 181 The process of freeing the reclaimable physical memory pages and 179 repurposing them is called (surprise!) `reclai 182 repurposing them is called (surprise!) `reclaim`. Linux can reclaim 180 pages either asynchronously or synchronously, 183 pages either asynchronously or synchronously, depending on the state 181 of the system. When the system is not loaded, 184 of the system. When the system is not loaded, most of the memory is free 182 and allocation requests will be satisfied imme 185 and allocation requests will be satisfied immediately from the free 183 pages supply. As the load increases, the amoun 186 pages supply. As the load increases, the amount of the free pages goes 184 down and when it reaches a certain threshold ( 187 down and when it reaches a certain threshold (low watermark), an 185 allocation request will awaken the ``kswapd`` 188 allocation request will awaken the ``kswapd`` daemon. It will 186 asynchronously scan memory pages and either ju 189 asynchronously scan memory pages and either just free them if the data 187 they contain is available elsewhere, or evict 190 they contain is available elsewhere, or evict to the backing storage 188 device (remember those dirty pages?). As memor 191 device (remember those dirty pages?). As memory usage increases even 189 more and reaches another threshold - min water 192 more and reaches another threshold - min watermark - an allocation 190 will trigger `direct reclaim`. In this case al 193 will trigger `direct reclaim`. In this case allocation is stalled 191 until enough memory pages are reclaimed to sat 194 until enough memory pages are reclaimed to satisfy the request. 192 195 193 Compaction 196 Compaction 194 ========== 197 ========== 195 198 196 As the system runs, tasks allocate and free th 199 As the system runs, tasks allocate and free the memory and it becomes 197 fragmented. Although with virtual memory it is 200 fragmented. Although with virtual memory it is possible to present 198 scattered physical pages as virtually contiguo 201 scattered physical pages as virtually contiguous range, sometimes it is 199 necessary to allocate large physically contigu 202 necessary to allocate large physically contiguous memory areas. Such 200 need may arise, for instance, when a device dr 203 need may arise, for instance, when a device driver requires a large 201 buffer for DMA, or when THP allocates a huge p 204 buffer for DMA, or when THP allocates a huge page. Memory `compaction` 202 addresses the fragmentation issue. This mechan 205 addresses the fragmentation issue. This mechanism moves occupied pages 203 from the lower part of a memory zone to free p 206 from the lower part of a memory zone to free pages in the upper part 204 of the zone. When a compaction scan is finishe 207 of the zone. When a compaction scan is finished free pages are grouped 205 together at the beginning of the zone and allo 208 together at the beginning of the zone and allocations of large 206 physically contiguous areas become possible. 209 physically contiguous areas become possible. 207 210 208 Like reclaim, the compaction may happen asynch 211 Like reclaim, the compaction may happen asynchronously in the ``kcompactd`` 209 daemon or synchronously as a result of a memor 212 daemon or synchronously as a result of a memory allocation request. 210 213 211 OOM killer 214 OOM killer 212 ========== 215 ========== 213 216 214 It is possible that on a loaded machine memory 217 It is possible that on a loaded machine memory will be exhausted and the 215 kernel will be unable to reclaim enough memory 218 kernel will be unable to reclaim enough memory to continue to operate. In 216 order to save the rest of the system, it invok 219 order to save the rest of the system, it invokes the `OOM killer`. 217 220 218 The `OOM killer` selects a task to sacrifice f 221 The `OOM killer` selects a task to sacrifice for the sake of the overall 219 system health. The selected task is killed in 222 system health. The selected task is killed in a hope that after it exits 220 enough memory will be freed to continue normal 223 enough memory will be freed to continue normal operation.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.