1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 >> 3 .. _physical_memory_model: >> 4 3 ===================== 5 ===================== 4 Physical Memory Model 6 Physical Memory Model 5 ===================== 7 ===================== 6 8 7 Physical memory in a system may be addressed i 9 Physical memory in a system may be addressed in different ways. The 8 simplest case is when the physical memory star 10 simplest case is when the physical memory starts at address 0 and 9 spans a contiguous range up to the maximal add 11 spans a contiguous range up to the maximal address. It could be, 10 however, that this range contains small holes 12 however, that this range contains small holes that are not accessible 11 for the CPU. Then there could be several conti 13 for the CPU. Then there could be several contiguous ranges at 12 completely distinct addresses. And, don't forg 14 completely distinct addresses. And, don't forget about NUMA, where 13 different memory banks are attached to differe 15 different memory banks are attached to different CPUs. 14 16 15 Linux abstracts this diversity using one of th 17 Linux abstracts this diversity using one of the two memory models: 16 FLATMEM and SPARSEMEM. Each architecture defin 18 FLATMEM and SPARSEMEM. Each architecture defines what 17 memory models it supports, what the default me 19 memory models it supports, what the default memory model is and 18 whether it is possible to manually override th 20 whether it is possible to manually override that default. 19 21 20 All the memory models track the status of phys 22 All the memory models track the status of physical page frames using 21 struct page arranged in one or more arrays. 23 struct page arranged in one or more arrays. 22 24 23 Regardless of the selected memory model, there 25 Regardless of the selected memory model, there exists one-to-one 24 mapping between the physical page frame number 26 mapping between the physical page frame number (PFN) and the 25 corresponding `struct page`. 27 corresponding `struct page`. 26 28 27 Each memory model defines :c:func:`pfn_to_page 29 Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn` 28 helpers that allow the conversion from PFN to 30 helpers that allow the conversion from PFN to `struct page` and vice 29 versa. 31 versa. 30 32 31 FLATMEM 33 FLATMEM 32 ======= 34 ======= 33 35 34 The simplest memory model is FLATMEM. This mod 36 The simplest memory model is FLATMEM. This model is suitable for 35 non-NUMA systems with contiguous, or mostly co 37 non-NUMA systems with contiguous, or mostly contiguous, physical 36 memory. 38 memory. 37 39 38 In the FLATMEM memory model, there is a global 40 In the FLATMEM memory model, there is a global `mem_map` array that 39 maps the entire physical memory. For most arch 41 maps the entire physical memory. For most architectures, the holes 40 have entries in the `mem_map` array. The `stru 42 have entries in the `mem_map` array. The `struct page` objects 41 corresponding to the holes are never fully ini 43 corresponding to the holes are never fully initialized. 42 44 43 To allocate the `mem_map` array, architecture 45 To allocate the `mem_map` array, architecture specific setup code should 44 call :c:func:`free_area_init` function. Yet, t 46 call :c:func:`free_area_init` function. Yet, the mappings array is not 45 usable until the call to :c:func:`memblock_fre 47 usable until the call to :c:func:`memblock_free_all` that hands all the 46 memory to the page allocator. 48 memory to the page allocator. 47 49 48 An architecture may free parts of the `mem_map 50 An architecture may free parts of the `mem_map` array that do not cover the 49 actual physical pages. In such case, the archi 51 actual physical pages. In such case, the architecture specific 50 :c:func:`pfn_valid` implementation should take 52 :c:func:`pfn_valid` implementation should take the holes in the 51 `mem_map` into account. 53 `mem_map` into account. 52 54 53 With FLATMEM, the conversion between a PFN and 55 With FLATMEM, the conversion between a PFN and the `struct page` is 54 straightforward: `PFN - ARCH_PFN_OFFSET` is an 56 straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the 55 `mem_map` array. 57 `mem_map` array. 56 58 57 The `ARCH_PFN_OFFSET` defines the first page f 59 The `ARCH_PFN_OFFSET` defines the first page frame number for 58 systems with physical memory starting at addre 60 systems with physical memory starting at address different from 0. 59 61 60 SPARSEMEM 62 SPARSEMEM 61 ========= 63 ========= 62 64 63 SPARSEMEM is the most versatile memory model a 65 SPARSEMEM is the most versatile memory model available in Linux and it 64 is the only memory model that supports several 66 is the only memory model that supports several advanced features such 65 as hot-plug and hot-remove of the physical mem 67 as hot-plug and hot-remove of the physical memory, alternative memory 66 maps for non-volatile memory devices and defer 68 maps for non-volatile memory devices and deferred initialization of 67 the memory map for larger systems. 69 the memory map for larger systems. 68 70 69 The SPARSEMEM model presents the physical memo 71 The SPARSEMEM model presents the physical memory as a collection of 70 sections. A section is represented with struct 72 sections. A section is represented with struct mem_section 71 that contains `section_mem_map` that is, logic 73 that contains `section_mem_map` that is, logically, a pointer to an 72 array of struct pages. However, it is stored w 74 array of struct pages. However, it is stored with some other magic 73 that aids the sections management. The section 75 that aids the sections management. The section size and maximal number 74 of section is specified using `SECTION_SIZE_BI 76 of section is specified using `SECTION_SIZE_BITS` and 75 `MAX_PHYSMEM_BITS` constants defined by each a 77 `MAX_PHYSMEM_BITS` constants defined by each architecture that 76 supports SPARSEMEM. While `MAX_PHYSMEM_BITS` i 78 supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a 77 physical address that an architecture supports 79 physical address that an architecture supports, the 78 `SECTION_SIZE_BITS` is an arbitrary value. 80 `SECTION_SIZE_BITS` is an arbitrary value. 79 81 80 The maximal number of sections is denoted `NR_ 82 The maximal number of sections is denoted `NR_MEM_SECTIONS` and 81 defined as 83 defined as 82 84 83 .. math:: 85 .. math:: 84 86 85 NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BIT 87 NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)} 86 88 87 The `mem_section` objects are arranged in a tw 89 The `mem_section` objects are arranged in a two-dimensional array 88 called `mem_sections`. The size and placement 90 called `mem_sections`. The size and placement of this array depend 89 on `CONFIG_SPARSEMEM_EXTREME` and the maximal 91 on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of 90 sections: 92 sections: 91 93 92 * When `CONFIG_SPARSEMEM_EXTREME` is disabled, 94 * When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections` 93 array is static and has `NR_MEM_SECTIONS` ro 95 array is static and has `NR_MEM_SECTIONS` rows. Each row holds a 94 single `mem_section` object. 96 single `mem_section` object. 95 * When `CONFIG_SPARSEMEM_EXTREME` is enabled, 97 * When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections` 96 array is dynamically allocated. Each row con 98 array is dynamically allocated. Each row contains PAGE_SIZE worth of 97 `mem_section` objects and the number of rows 99 `mem_section` objects and the number of rows is calculated to fit 98 all the memory sections. 100 all the memory sections. 99 101 100 The architecture setup code should call sparse 102 The architecture setup code should call sparse_init() to 101 initialize the memory sections and the memory 103 initialize the memory sections and the memory maps. 102 104 103 With SPARSEMEM there are two possible ways to 105 With SPARSEMEM there are two possible ways to convert a PFN to the 104 corresponding `struct page` - a "classic spars 106 corresponding `struct page` - a "classic sparse" and "sparse 105 vmemmap". The selection is made at build time 107 vmemmap". The selection is made at build time and it is determined by 106 the value of `CONFIG_SPARSEMEM_VMEMMAP`. 108 the value of `CONFIG_SPARSEMEM_VMEMMAP`. 107 109 108 The classic sparse encodes the section number 110 The classic sparse encodes the section number of a page in page->flags 109 and uses high bits of a PFN to access the sect 111 and uses high bits of a PFN to access the section that maps that page 110 frame. Inside a section, the PFN is the index 112 frame. Inside a section, the PFN is the index to the array of pages. 111 113 112 The sparse vmemmap uses a virtually mapped mem 114 The sparse vmemmap uses a virtually mapped memory map to optimize 113 pfn_to_page and page_to_pfn operations. There 115 pfn_to_page and page_to_pfn operations. There is a global `struct 114 page *vmemmap` pointer that points to a virtua 116 page *vmemmap` pointer that points to a virtually contiguous array of 115 `struct page` objects. A PFN is an index to th 117 `struct page` objects. A PFN is an index to that array and the 116 offset of the `struct page` from `vmemmap` is 118 offset of the `struct page` from `vmemmap` is the PFN of that 117 page. 119 page. 118 120 119 To use vmemmap, an architecture has to reserve 121 To use vmemmap, an architecture has to reserve a range of virtual 120 addresses that will map the physical pages con 122 addresses that will map the physical pages containing the memory 121 map and make sure that `vmemmap` points to tha 123 map and make sure that `vmemmap` points to that range. In addition, 122 the architecture should implement :c:func:`vme 124 the architecture should implement :c:func:`vmemmap_populate` method 123 that will allocate the physical memory and cre 125 that will allocate the physical memory and create page tables for the 124 virtual memory map. If an architecture does no 126 virtual memory map. If an architecture does not have any special 125 requirements for the vmemmap mappings, it can 127 requirements for the vmemmap mappings, it can use default 126 :c:func:`vmemmap_populate_basepages` provided 128 :c:func:`vmemmap_populate_basepages` provided by the generic memory 127 management. 129 management. 128 130 129 The virtually mapped memory map allows storing 131 The virtually mapped memory map allows storing `struct page` objects 130 for persistent memory devices in pre-allocated 132 for persistent memory devices in pre-allocated storage on those 131 devices. This storage is represented with stru 133 devices. This storage is represented with struct vmem_altmap 132 that is eventually passed to vmemmap_populate( 134 that is eventually passed to vmemmap_populate() through a long chain 133 of function calls. The vmemmap_populate() impl 135 of function calls. The vmemmap_populate() implementation may use the 134 `vmem_altmap` along with :c:func:`vmemmap_allo 136 `vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to 135 allocate memory map on the persistent memory d 137 allocate memory map on the persistent memory device. 136 138 137 ZONE_DEVICE 139 ZONE_DEVICE 138 =========== 140 =========== 139 The `ZONE_DEVICE` facility builds upon `SPARSE 141 The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer 140 `struct page` `mem_map` services for device dr 142 `struct page` `mem_map` services for device driver identified physical 141 address ranges. The "device" aspect of `ZONE_D 143 address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact 142 that the page objects for these address ranges 144 that the page objects for these address ranges are never marked online, 143 and that a reference must be taken against the 145 and that a reference must be taken against the device, not just the page 144 to keep the memory pinned for active use. `ZON 146 to keep the memory pinned for active use. `ZONE_DEVICE`, via 145 :c:func:`devm_memremap_pages`, performs just e 147 :c:func:`devm_memremap_pages`, performs just enough memory hotplug to 146 turn on :c:func:`pfn_to_page`, :c:func:`page_t 148 turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and 147 :c:func:`get_user_pages` service for the given 149 :c:func:`get_user_pages` service for the given range of pfns. Since the 148 page reference count never drops below 1 the p 150 page reference count never drops below 1 the page is never tracked as 149 free memory and the page's `struct list_head l 151 free memory and the page's `struct list_head lru` space is repurposed 150 for back referencing to the host device / driv 152 for back referencing to the host device / driver that mapped the memory. 151 153 152 While `SPARSEMEM` presents memory as a collect 154 While `SPARSEMEM` presents memory as a collection of sections, 153 optionally collected into memory blocks, `ZONE 155 optionally collected into memory blocks, `ZONE_DEVICE` users have a need 154 for smaller granularity of populating the `mem 156 for smaller granularity of populating the `mem_map`. Given that 155 `ZONE_DEVICE` memory is never marked online it 157 `ZONE_DEVICE` memory is never marked online it is subsequently never 156 subject to its memory ranges being exposed thr 158 subject to its memory ranges being exposed through the sysfs memory 157 hotplug api on memory block boundaries. The im 159 hotplug api on memory block boundaries. The implementation relies on 158 this lack of user-api constraint to allow sub- 160 this lack of user-api constraint to allow sub-section sized memory 159 ranges to be specified to :c:func:`arch_add_me 161 ranges to be specified to :c:func:`arch_add_memory`, the top-half of 160 memory hotplug. Sub-section support allows for 162 memory hotplug. Sub-section support allows for 2MB as the cross-arch 161 common alignment granularity for :c:func:`devm 163 common alignment granularity for :c:func:`devm_memremap_pages`. 162 164 163 The users of `ZONE_DEVICE` are: 165 The users of `ZONE_DEVICE` are: 164 166 165 * pmem: Map platform persistent memory to be u 167 * pmem: Map platform persistent memory to be used as a direct-I/O target 166 via DAX mappings. 168 via DAX mappings. 167 169 168 * hmm: Extend `ZONE_DEVICE` with `->page_fault 170 * hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()` 169 event callbacks to allow a device-driver to 171 event callbacks to allow a device-driver to coordinate memory management 170 events related to device-memory, typically G 172 events related to device-memory, typically GPU memory. See 171 Documentation/mm/hmm.rst. 173 Documentation/mm/hmm.rst. 172 174 173 * p2pdma: Create `struct page` objects to allo 175 * p2pdma: Create `struct page` objects to allow peer devices in a 174 PCI/-E topology to coordinate direct-DMA ope 176 PCI/-E topology to coordinate direct-DMA operations between themselves, 175 i.e. bypass host memory. 177 i.e. bypass host memory.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.