1 Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com 2 3 ============= 4 What is NUMA? 5 ============= 6 7 This question can be answered from a couple of 8 hardware view and the Linux software view. 9 10 From the hardware perspective, a NUMA system i 11 comprises multiple components or assemblies ea 12 or more CPUs, local memory, and/or IO buses. 13 disambiguate the hardware view of these physic 14 from the software abstraction thereof, we'll c 15 'cells' in this document. 16 17 Each of the 'cells' may be viewed as an SMP [s 18 of the system--although some components necess 19 may not be populated on any given cell. The 20 connected together with some sort of system in 21 point-to-point link are common types of NUMA s 22 these types of interconnects can be aggregated 23 cells at multiple distances from other cells. 24 25 For Linux, the NUMA platforms of interest are 26 Coherent NUMA or ccNUMA systems. With ccNUMA 27 to and accessible from any CPU attached to any 28 is handled in hardware by the processor caches 29 30 Memory access time and effective memory bandwi 31 away the cell containing the CPU or IO bus mak 32 cell containing the target memory. For exampl 33 attached to the same cell will experience fast 34 bandwidths than accesses to memory on other, r 35 can have cells at multiple remote distances fr 36 37 Platform vendors don't build NUMA systems just 38 lives interesting. Rather, this architecture 39 memory bandwidth. However, to achieve scalabl 40 application software must arrange for a large 41 [cache misses] to be to "local" memory--memory 42 to the closest cell with memory. 43 44 This leads to the Linux software view of a NUM 45 46 Linux divides the system's hardware resources 47 abstractions called "nodes". Linux maps the n 48 of the hardware platform, abstracting away som 49 architectures. As with physical cells, softwa 50 CPUs, memory and/or IO buses. And, again, mem 51 "closer" nodes--nodes that map to closer cells 52 faster access times and higher effective bandw 53 remote cells. 54 55 For some architectures, such as x86, Linux wil 56 physical cell that has no memory attached, and 57 that cell to a node representing a cell that d 58 these architectures, one cannot assume that al 59 a given node will see the same local memory ac 60 61 In addition, for some architectures, again x86 62 the emulation of additional nodes. For NUMA e 63 the existing nodes--or the system memory for n 64 nodes. Each emulated node will manage a fract 65 physical memory. NUMA emulation is useful for 66 application features on non-NUMA platforms, an 67 management mechanism when used together with c 68 [see Documentation/admin-guide/cgroup-v1/cpuse 69 70 For each node with memory, Linux constructs an 71 subsystem, complete with its own free page lis 72 statistics and locks to mediate access. In ad 73 each memory zone [one or more of DMA, DMA32, N 74 an ordered "zonelist". A zonelist specifies t 75 selected zone/node cannot satisfy the allocati 76 when a zone has no available memory to satisfy 77 "overflow" or "fallback". 78 79 Because some nodes contain multiple zones cont 80 memory, Linux must decide whether to order the 81 fall back to the same zone type on a different 82 type on the same node. This is an important c 83 such as DMA or DMA32, represent relatively sca 84 a default Node ordered zonelist. This means it 85 from the same node before using remote nodes w 86 87 By default, Linux will attempt to satisfy memo 88 node to which the CPU that executes the reques 89 Linux will attempt to allocate from the first 90 for the node where the request originates. Th 91 If the "local" node cannot satisfy the request 92 nodes' zones in the selected zonelist looking 93 that can satisfy the request. 94 95 Local allocation will tend to keep subsequent 96 "local" to the underlying physical resources a 97 as long as the task on whose behalf the kernel 98 later migrate away from that memory. The Linu 99 NUMA topology of the platform--embodied in the 100 structures [see Documentation/scheduler/sched- 101 attempts to minimize task migration to distant 102 the scheduler does not take a task's NUMA foot 103 Thus, under sufficient imbalance, tasks can mi 104 from their initial node and kernel data struct 105 106 System administrators and application designer 107 to improve NUMA locality using various CPU aff 108 such as taskset(1) and numactl(1), and program 109 sched_setaffinity(2). Further, one can modify 110 allocation behavior using Linux NUMA memory po 111 Documentation/admin-guide/mm/numa_memory_polic 112 113 System administrators can restrict the CPUs an 114 privileged user can specify in the scheduling 115 using control groups and CPUsets. [see Docume 116 117 On architectures that do not hide memoryless n 118 zones [nodes] with memory in the zonelists. T 119 node the "local memory node"--the node of the 120 zonelist--will not be the node itself. Rather 121 kernel selected as the nearest node with memor 122 So, default, local allocations will succeed wi 123 closest available memory. This is a consequen 124 allows such allocations to fallback to other n 125 does contain memory overflows. 126 127 Some kernel allocations do not want or cannot 128 behavior. Rather they want to be sure they ge 129 or get notified that the node has no free memo 130 a subsystem allocates per CPU memory resources 131 132 A typical model for making such an allocation 133 node to which the "current CPU" is attached us 134 numa_node_id() or CPU_to_node() functions and 135 the node id returned. When such an allocation 136 may revert to its own fallback path. The slab 137 example of this. Or, the subsystem may choose 138 itself on allocation failure. The kernel prof 139 this. 140 141 If the architecture supports--does not hide--m 142 attached to memoryless nodes would always incu 143 or some subsystems would fail to initialize if 144 memory exclusively from a node without memory. 145 architectures transparently, kernel subsystems 146 or cpu_to_mem() function to locate the "local 147 specified CPU. Again, this is the same node f 148 allocations will be attempted.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.