1 =============================== 1 =============================== 2 LIBNVDIMM: Non-Volatile Devices 2 LIBNVDIMM: Non-Volatile Devices 3 =============================== 3 =============================== 4 4 5 libnvdimm - kernel / libndctl - userspace help 5 libnvdimm - kernel / libndctl - userspace helper library 6 6 7 nvdimm@lists.linux.dev 7 nvdimm@lists.linux.dev 8 8 9 Version 13 9 Version 13 10 10 11 .. contents: 11 .. contents: 12 12 13 Glossary 13 Glossary 14 Overview 14 Overview 15 Supporting Documents 15 Supporting Documents 16 Git Trees 16 Git Trees 17 LIBNVDIMM PMEM 17 LIBNVDIMM PMEM 18 PMEM-REGIONs, Atomic Sectors, and 18 PMEM-REGIONs, Atomic Sectors, and DAX 19 Example NVDIMM Platform 19 Example NVDIMM Platform 20 LIBNVDIMM Kernel Device Model and LIBN 20 LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API 21 LIBNDCTL: Context 21 LIBNDCTL: Context 22 libndctl: instantiate a new li 22 libndctl: instantiate a new library context example 23 LIBNVDIMM/LIBNDCTL: Bus 23 LIBNVDIMM/LIBNDCTL: Bus 24 libnvdimm: control class devic 24 libnvdimm: control class device in /sys/class 25 libnvdimm: bus 25 libnvdimm: bus 26 libndctl: bus enumeration exam 26 libndctl: bus enumeration example 27 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 27 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 28 libnvdimm: DIMM (NMEM) 28 libnvdimm: DIMM (NMEM) 29 libndctl: DIMM enumeration exa 29 libndctl: DIMM enumeration example 30 LIBNVDIMM/LIBNDCTL: Region 30 LIBNVDIMM/LIBNDCTL: Region 31 libnvdimm: region 31 libnvdimm: region 32 libndctl: region enumeration e 32 libndctl: region enumeration example 33 Why Not Encode the Region Type 33 Why Not Encode the Region Type into the Region Name? 34 How Do I Determine the Major T 34 How Do I Determine the Major Type of a Region? 35 LIBNVDIMM/LIBNDCTL: Namespace 35 LIBNVDIMM/LIBNDCTL: Namespace 36 libnvdimm: namespace 36 libnvdimm: namespace 37 libndctl: namespace enumeratio 37 libndctl: namespace enumeration example 38 libndctl: namespace creation e 38 libndctl: namespace creation example 39 Why the Term "namespace"? 39 Why the Term "namespace"? 40 LIBNVDIMM/LIBNDCTL: Block Translat 40 LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" 41 libnvdimm: btt layout 41 libnvdimm: btt layout 42 libndctl: btt creation example 42 libndctl: btt creation example 43 Summary LIBNDCTL Diagram 43 Summary LIBNDCTL Diagram 44 44 45 45 46 Glossary 46 Glossary 47 ======== 47 ======== 48 48 49 PMEM: 49 PMEM: 50 A system-physical-address range where writes 50 A system-physical-address range where writes are persistent. A 51 block device composed of PMEM is capable of 51 block device composed of PMEM is capable of DAX. A PMEM address range 52 may span an interleave of several DIMMs. 52 may span an interleave of several DIMMs. 53 53 54 DPA: 54 DPA: 55 DIMM Physical Address, is a DIMM-relative of 55 DIMM Physical Address, is a DIMM-relative offset. With one DIMM in 56 the system there would be a 1:1 system-physi 56 the system there would be a 1:1 system-physical-address:DPA association. 57 Once more DIMMs are added a memory controlle 57 Once more DIMMs are added a memory controller interleave must be 58 decoded to determine the DPA associated with 58 decoded to determine the DPA associated with a given 59 system-physical-address. 59 system-physical-address. 60 60 61 DAX: 61 DAX: 62 File system extensions to bypass the page ca 62 File system extensions to bypass the page cache and block layer to 63 mmap persistent memory, from a PMEM block de 63 mmap persistent memory, from a PMEM block device, directly into a 64 process address space. 64 process address space. 65 65 66 DSM: 66 DSM: 67 Device Specific Method: ACPI method to contr 67 Device Specific Method: ACPI method to control specific 68 device - in this case the firmware. 68 device - in this case the firmware. 69 69 70 DCR: 70 DCR: 71 NVDIMM Control Region Structure defined in A 71 NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. 72 It defines a vendor-id, device-id, and inter 72 It defines a vendor-id, device-id, and interface format for a given DIMM. 73 73 74 BTT: 74 BTT: 75 Block Translation Table: Persistent memory i 75 Block Translation Table: Persistent memory is byte addressable. 76 Existing software may have an expectation th 76 Existing software may have an expectation that the power-fail-atomicity 77 of writes is at least one sector, 512 bytes. 77 of writes is at least one sector, 512 bytes. The BTT is an indirection 78 table with atomic update semantics to front 78 table with atomic update semantics to front a PMEM block device 79 driver and present arbitrary atomic sector s 79 driver and present arbitrary atomic sector sizes. 80 80 81 LABEL: 81 LABEL: 82 Metadata stored on a DIMM device that partit 82 Metadata stored on a DIMM device that partitions and identifies 83 (persistently names) capacity allocated to d 83 (persistently names) capacity allocated to different PMEM namespaces. It 84 also indicates whether an address abstractio 84 also indicates whether an address abstraction like a BTT is applied to 85 the namespace. Note that traditional partit 85 the namespace. Note that traditional partition tables, GPT/MBR, are 86 layered on top of a PMEM namespace, or an ad 86 layered on top of a PMEM namespace, or an address abstraction like BTT 87 if present, but partition support is depreca 87 if present, but partition support is deprecated going forward. 88 88 89 89 90 Overview 90 Overview 91 ======== 91 ======== 92 92 93 The LIBNVDIMM subsystem provides support for P 93 The LIBNVDIMM subsystem provides support for PMEM described by platform 94 firmware or a device driver. On ACPI based sys 94 firmware or a device driver. On ACPI based systems the platform firmware 95 conveys persistent memory resource via the ACP 95 conveys persistent memory resource via the ACPI NFIT "NVDIMM Firmware 96 Interface Table" in ACPI 6. While the LIBNVDIM 96 Interface Table" in ACPI 6. While the LIBNVDIMM subsystem implementation 97 is generic and supports pre-NFIT platforms, it 97 is generic and supports pre-NFIT platforms, it was guided by the 98 superset of capabilities need to support this 98 superset of capabilities need to support this ACPI 6 definition for 99 NVDIMM resources. The original implementation 99 NVDIMM resources. The original implementation supported the 100 block-window-aperture capability described in 100 block-window-aperture capability described in the NFIT, but that support 101 has since been abandoned and never shipped in 101 has since been abandoned and never shipped in a product. 102 102 103 Supporting Documents 103 Supporting Documents 104 -------------------- 104 -------------------- 105 105 106 ACPI 6: 106 ACPI 6: 107 https://www.uefi.org/sites/default/fil 107 https://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf 108 NVDIMM Namespace: 108 NVDIMM Namespace: 109 https://pmem.io/documents/NVDIMM_Names 109 https://pmem.io/documents/NVDIMM_Namespace_Spec.pdf 110 DSM Interface Example: 110 DSM Interface Example: 111 https://pmem.io/documents/NVDIMM_DSM_I 111 https://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf 112 Driver Writer's Guide: 112 Driver Writer's Guide: 113 https://pmem.io/documents/NVDIMM_Drive 113 https://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf 114 114 115 Git Trees 115 Git Trees 116 --------- 116 --------- 117 117 118 LIBNVDIMM: 118 LIBNVDIMM: 119 https://git.kernel.org/cgit/linux/kern 119 https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git 120 LIBNDCTL: 120 LIBNDCTL: 121 https://github.com/pmem/ndctl.git 121 https://github.com/pmem/ndctl.git 122 122 123 123 124 LIBNVDIMM PMEM 124 LIBNVDIMM PMEM 125 ============== 125 ============== 126 126 127 Prior to the arrival of the NFIT, non-volatile 127 Prior to the arrival of the NFIT, non-volatile memory was described to a 128 system in various ad-hoc ways. Usually only t 128 system in various ad-hoc ways. Usually only the bare minimum was 129 provided, namely, a single system-physical-add 129 provided, namely, a single system-physical-address range where writes 130 are expected to be durable after a system powe 130 are expected to be durable after a system power loss. Now, the NFIT 131 specification standardizes not only the descri 131 specification standardizes not only the description of PMEM, but also 132 platform message-passing entry points for cont 132 platform message-passing entry points for control and configuration. 133 133 134 PMEM (nd_pmem.ko): Drives a system-physical-ad 134 PMEM (nd_pmem.ko): Drives a system-physical-address range. This range is 135 contiguous in system memory and may be interle 135 contiguous in system memory and may be interleaved (hardware memory controller 136 striped) across multiple DIMMs. When interlea 136 striped) across multiple DIMMs. When interleaved the platform may optionally 137 provide details of which DIMMs are participati 137 provide details of which DIMMs are participating in the interleave. 138 138 139 It is worth noting that when the labeling capa 139 It is worth noting that when the labeling capability is detected (a EFI 140 namespace label index block is found), then no 140 namespace label index block is found), then no block device is created 141 by default as userspace needs to do at least o 141 by default as userspace needs to do at least one allocation of DPA to 142 the PMEM range. In contrast ND_NAMESPACE_IO r 142 the PMEM range. In contrast ND_NAMESPACE_IO ranges, once registered, 143 can be immediately attached to nd_pmem. This l 143 can be immediately attached to nd_pmem. This latter mode is called 144 label-less or "legacy". 144 label-less or "legacy". 145 145 146 PMEM-REGIONs, Atomic Sectors, and DAX 146 PMEM-REGIONs, Atomic Sectors, and DAX 147 ------------------------------------- 147 ------------------------------------- 148 148 149 For the cases where an application or filesyst 149 For the cases where an application or filesystem still needs atomic sector 150 update guarantees it can register a BTT on a P 150 update guarantees it can register a BTT on a PMEM device or partition. See 151 LIBNVDIMM/NDCTL: Block Translation Table "btt" 151 LIBNVDIMM/NDCTL: Block Translation Table "btt" 152 152 153 153 154 Example NVDIMM Platform 154 Example NVDIMM Platform 155 ======================= 155 ======================= 156 156 157 For the remainder of this document the followi 157 For the remainder of this document the following diagram will be 158 referenced for any example sysfs layouts:: 158 referenced for any example sysfs layouts:: 159 159 160 160 161 (a) 161 (a) (b) DIMM 162 +-------------------+--------+---- 162 +-------------------+--------+--------+--------+ 163 +------+ | pm0.0 | free | pm1 163 +------+ | pm0.0 | free | pm1.0 | free | 0 164 | imc0 +--+- - - region0- - - +--------+ 164 | imc0 +--+- - - region0- - - +--------+ +--------+ 165 +--+---+ | pm0.0 | free | pm1 165 +--+---+ | pm0.0 | free | pm1.0 | free | 1 166 | +-------------------+--------v 166 | +-------------------+--------v v--------+ 167 +--+---+ | 167 +--+---+ | | 168 | cpu0 | 168 | cpu0 | region1 169 +--+---+ | 169 +--+---+ | | 170 | +----------------------------^ 170 | +----------------------------^ ^--------+ 171 +--+---+ | free | pm1 171 +--+---+ | free | pm1.0 | free | 2 172 | imc1 +--+----------------------------| 172 | imc1 +--+----------------------------| +--------+ 173 +------+ | free | pm1 173 +------+ | free | pm1.0 | free | 3 174 +----------------------------+---- 174 +----------------------------+--------+--------+ 175 175 176 In this platform we have four DIMMs and two me 176 In this platform we have four DIMMs and two memory controllers in one 177 socket. Each PMEM interleave set is identifie 177 socket. Each PMEM interleave set is identified by a region device with 178 a dynamically assigned id. 178 a dynamically assigned id. 179 179 180 1. The first portion of DIMM0 and DIMM1 ar 180 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A 181 single PMEM namespace is created in the 181 single PMEM namespace is created in the REGION0-SPA-range that spans most 182 of DIMM0 and DIMM1 with a user-specifie 182 of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that 183 interleaved system-physical-address ran 183 interleaved system-physical-address range is left free for 184 another PMEM namespace to be defined. 184 another PMEM namespace to be defined. 185 185 186 2. In the last portion of DIMM0 and DIMM1 186 2. In the last portion of DIMM0 and DIMM1 we have an interleaved 187 system-physical-address range, REGION1, 187 system-physical-address range, REGION1, that spans those two DIMMs as 188 well as DIMM2 and DIMM3. Some of REGIO 188 well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace 189 named "pm1.0". 189 named "pm1.0". 190 190 191 This bus is provided by the kernel under t 191 This bus is provided by the kernel under the device 192 /sys/devices/platform/nfit_test.0 when the 192 /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from 193 tools/testing/nvdimm is loaded. This modul 193 tools/testing/nvdimm is loaded. This module is a unit test for 194 LIBNVDIMM and the acpi_nfit.ko driver. 194 LIBNVDIMM and the acpi_nfit.ko driver. 195 195 196 196 197 LIBNVDIMM Kernel Device Model and LIBNDCTL Use 197 LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API 198 ============================================== 198 ======================================================== 199 199 200 What follows is a description of the LIBNVDIMM 200 What follows is a description of the LIBNVDIMM sysfs layout and a 201 corresponding object hierarchy diagram as view 201 corresponding object hierarchy diagram as viewed through the LIBNDCTL 202 API. The example sysfs paths and diagrams are 202 API. The example sysfs paths and diagrams are relative to the Example 203 NVDIMM Platform which is also the LIBNVDIMM bu 203 NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit 204 test. 204 test. 205 205 206 LIBNDCTL: Context 206 LIBNDCTL: Context 207 ----------------- 207 ----------------- 208 208 209 Every API call in the LIBNDCTL library require 209 Every API call in the LIBNDCTL library requires a context that holds the 210 logging parameters and other library instance 210 logging parameters and other library instance state. The library is 211 based on the libabc template: 211 based on the libabc template: 212 212 213 https://git.kernel.org/cgit/linux/kern 213 https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git 214 214 215 LIBNDCTL: instantiate a new library context ex 215 LIBNDCTL: instantiate a new library context example 216 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 216 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 217 217 218 :: 218 :: 219 219 220 struct ndctl_ctx *ctx; 220 struct ndctl_ctx *ctx; 221 221 222 if (ndctl_new(&ctx) == 0) 222 if (ndctl_new(&ctx) == 0) 223 return ctx; 223 return ctx; 224 else 224 else 225 return NULL; 225 return NULL; 226 226 227 LIBNVDIMM/LIBNDCTL: Bus 227 LIBNVDIMM/LIBNDCTL: Bus 228 ----------------------- 228 ----------------------- 229 229 230 A bus has a 1:1 relationship with an NFIT. Th 230 A bus has a 1:1 relationship with an NFIT. The current expectation for 231 ACPI based systems is that there is only ever 231 ACPI based systems is that there is only ever one platform-global NFIT. 232 That said, it is trivial to register multiple 232 That said, it is trivial to register multiple NFITs, the specification 233 does not preclude it. The infrastructure supp 233 does not preclude it. The infrastructure supports multiple busses and 234 we use this capability to test multiple NFIT c 234 we use this capability to test multiple NFIT configurations in the unit 235 test. 235 test. 236 236 237 LIBNVDIMM: control class device in /sys/class 237 LIBNVDIMM: control class device in /sys/class 238 --------------------------------------------- 238 --------------------------------------------- 239 239 240 This character device accepts DSM messages to 240 This character device accepts DSM messages to be passed to DIMM 241 identified by its NFIT handle:: 241 identified by its NFIT handle:: 242 242 243 /sys/class/nd/ndctl0 243 /sys/class/nd/ndctl0 244 |-- dev 244 |-- dev 245 |-- device -> ../../../ndbus0 245 |-- device -> ../../../ndbus0 246 |-- subsystem -> ../../../../../../../ 246 |-- subsystem -> ../../../../../../../class/nd 247 247 248 248 249 249 250 LIBNVDIMM: bus 250 LIBNVDIMM: bus 251 -------------- 251 -------------- 252 252 253 :: 253 :: 254 254 255 struct nvdimm_bus *nvdimm_bus_register 255 struct nvdimm_bus *nvdimm_bus_register(struct device *parent, 256 struct nvdimm_bus_descriptor *n 256 struct nvdimm_bus_descriptor *nfit_desc); 257 257 258 :: 258 :: 259 259 260 /sys/devices/platform/nfit_test.0/ndbu 260 /sys/devices/platform/nfit_test.0/ndbus0 261 |-- commands 261 |-- commands 262 |-- nd 262 |-- nd 263 |-- nfit 263 |-- nfit 264 |-- nmem0 264 |-- nmem0 265 |-- nmem1 265 |-- nmem1 266 |-- nmem2 266 |-- nmem2 267 |-- nmem3 267 |-- nmem3 268 |-- power 268 |-- power 269 |-- provider 269 |-- provider 270 |-- region0 270 |-- region0 271 |-- region1 271 |-- region1 272 |-- region2 272 |-- region2 273 |-- region3 273 |-- region3 274 |-- region4 274 |-- region4 275 |-- region5 275 |-- region5 276 |-- uevent 276 |-- uevent 277 `-- wait_probe 277 `-- wait_probe 278 278 279 LIBNDCTL: bus enumeration example 279 LIBNDCTL: bus enumeration example 280 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 280 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 281 281 282 Find the bus handle that describes the bus fro 282 Find the bus handle that describes the bus from Example NVDIMM Platform:: 283 283 284 static struct ndctl_bus *get_bus_by_pr 284 static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, 285 const char *provider) 285 const char *provider) 286 { 286 { 287 struct ndctl_bus *bus; 287 struct ndctl_bus *bus; 288 288 289 ndctl_bus_foreach(ctx, bus) 289 ndctl_bus_foreach(ctx, bus) 290 if (strcmp(provider, n 290 if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) 291 return bus; 291 return bus; 292 292 293 return NULL; 293 return NULL; 294 } 294 } 295 295 296 bus = get_bus_by_provider(ctx, "nfit_t 296 bus = get_bus_by_provider(ctx, "nfit_test.0"); 297 297 298 298 299 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 299 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 300 ------------------------------- 300 ------------------------------- 301 301 302 The DIMM device provides a character device fo 302 The DIMM device provides a character device for sending commands to 303 hardware, and it is a container for LABELs. I 303 hardware, and it is a container for LABELs. If the DIMM is defined by 304 NFIT then an optional 'nfit' attribute sub-dir 304 NFIT then an optional 'nfit' attribute sub-directory is available to add 305 NFIT-specifics. 305 NFIT-specifics. 306 306 307 Note that the kernel device name for "DIMMs" i 307 Note that the kernel device name for "DIMMs" is "nmemX". The NFIT 308 describes these devices via "Memory Device to 308 describes these devices via "Memory Device to System Physical Address 309 Range Mapping Structure", and there is no requ 309 Range Mapping Structure", and there is no requirement that they actually 310 be physical DIMMs, so we use a more generic na 310 be physical DIMMs, so we use a more generic name. 311 311 312 LIBNVDIMM: DIMM (NMEM) 312 LIBNVDIMM: DIMM (NMEM) 313 ^^^^^^^^^^^^^^^^^^^^^^ 313 ^^^^^^^^^^^^^^^^^^^^^^ 314 314 315 :: 315 :: 316 316 317 struct nvdimm *nvdimm_create(struct nv 317 struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, 318 const struct attribute 318 const struct attribute_group **groups, unsigned long flags, 319 unsigned long *dsm_mas 319 unsigned long *dsm_mask); 320 320 321 :: 321 :: 322 322 323 /sys/devices/platform/nfit_test.0/ndbu 323 /sys/devices/platform/nfit_test.0/ndbus0 324 |-- nmem0 324 |-- nmem0 325 | |-- available_slots 325 | |-- available_slots 326 | |-- commands 326 | |-- commands 327 | |-- dev 327 | |-- dev 328 | |-- devtype 328 | |-- devtype 329 | |-- driver -> ../../../../../bus/n 329 | |-- driver -> ../../../../../bus/nd/drivers/nvdimm 330 | |-- modalias 330 | |-- modalias 331 | |-- nfit 331 | |-- nfit 332 | | |-- device 332 | | |-- device 333 | | |-- format 333 | | |-- format 334 | | |-- handle 334 | | |-- handle 335 | | |-- phys_id 335 | | |-- phys_id 336 | | |-- rev_id 336 | | |-- rev_id 337 | | |-- serial 337 | | |-- serial 338 | | `-- vendor 338 | | `-- vendor 339 | |-- state 339 | |-- state 340 | |-- subsystem -> ../../../../../bu 340 | |-- subsystem -> ../../../../../bus/nd 341 | `-- uevent 341 | `-- uevent 342 |-- nmem1 342 |-- nmem1 343 [..] 343 [..] 344 344 345 345 346 LIBNDCTL: DIMM enumeration example 346 LIBNDCTL: DIMM enumeration example 347 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 347 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 348 348 349 Note, in this example we are assuming NFIT-def 349 Note, in this example we are assuming NFIT-defined DIMMs which are 350 identified by an "nfit_handle" a 32-bit value 350 identified by an "nfit_handle" a 32-bit value where: 351 351 352 - Bit 3:0 DIMM number within the memory cha 352 - Bit 3:0 DIMM number within the memory channel 353 - Bit 7:4 memory channel number 353 - Bit 7:4 memory channel number 354 - Bit 11:8 memory controller ID 354 - Bit 11:8 memory controller ID 355 - Bit 15:12 socket ID (within scope of a No 355 - Bit 15:12 socket ID (within scope of a Node controller if node 356 controller is present) 356 controller is present) 357 - Bit 27:16 Node Controller ID 357 - Bit 27:16 Node Controller ID 358 - Bit 31:28 Reserved 358 - Bit 31:28 Reserved 359 359 360 :: 360 :: 361 361 362 static struct ndctl_dimm *get_dimm_by_ 362 static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, 363 unsigned int handle) 363 unsigned int handle) 364 { 364 { 365 struct ndctl_dimm *dimm; 365 struct ndctl_dimm *dimm; 366 366 367 ndctl_dimm_foreach(bus, dimm) 367 ndctl_dimm_foreach(bus, dimm) 368 if (ndctl_dimm_get_han 368 if (ndctl_dimm_get_handle(dimm) == handle) 369 return dimm; 369 return dimm; 370 370 371 return NULL; 371 return NULL; 372 } 372 } 373 373 374 #define DIMM_HANDLE(n, s, i, c, d) \ 374 #define DIMM_HANDLE(n, s, i, c, d) \ 375 (((n & 0xfff) << 16) | ((s & 0 375 (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ 376 | ((c & 0xf) << 4) | (d & 0xf 376 | ((c & 0xf) << 4) | (d & 0xf)) 377 377 378 dimm = get_dimm_by_handle(bus, DIMM_HA 378 dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); 379 379 380 LIBNVDIMM/LIBNDCTL: Region 380 LIBNVDIMM/LIBNDCTL: Region 381 -------------------------- 381 -------------------------- 382 382 383 A generic REGION device is registered for each 383 A generic REGION device is registered for each PMEM interleave-set / 384 range. Per the example there are 2 PMEM region 384 range. Per the example there are 2 PMEM regions on the "nfit_test.0" 385 bus. The primary role of regions are to be a c 385 bus. The primary role of regions are to be a container of "mappings". A 386 mapping is a tuple of <DIMM, DPA-start-offset, 386 mapping is a tuple of <DIMM, DPA-start-offset, length>. 387 387 388 LIBNVDIMM provides a built-in driver for REGIO 388 LIBNVDIMM provides a built-in driver for REGION devices. This driver 389 is responsible for all parsing LABELs, if pres 389 is responsible for all parsing LABELs, if present, and then emitting NAMESPACE 390 devices for the nd_pmem driver to consume. 390 devices for the nd_pmem driver to consume. 391 391 392 In addition to the generic attributes of "mapp 392 In addition to the generic attributes of "mapping"s, "interleave_ways" 393 and "size" the REGION device also exports some 393 and "size" the REGION device also exports some convenience attributes. 394 "nstype" indicates the integer type of namespa 394 "nstype" indicates the integer type of namespace-device this region 395 emits, "devtype" duplicates the DEVTYPE variab 395 emits, "devtype" duplicates the DEVTYPE variable stored by udev at the 396 'add' event, "modalias" duplicates the MODALIA 396 'add' event, "modalias" duplicates the MODALIAS variable stored by udev 397 at the 'add' event, and finally, the optional 397 at the 'add' event, and finally, the optional "spa_index" is provided in 398 the case where the region is defined by a SPA. 398 the case where the region is defined by a SPA. 399 399 400 LIBNVDIMM: region:: 400 LIBNVDIMM: region:: 401 401 402 struct nd_region *nvdimm_pmem_region_c 402 struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, 403 struct nd_region_desc 403 struct nd_region_desc *ndr_desc); 404 404 405 :: 405 :: 406 406 407 /sys/devices/platform/nfit_test.0/ndbu 407 /sys/devices/platform/nfit_test.0/ndbus0 408 |-- region0 408 |-- region0 409 | |-- available_size 409 | |-- available_size 410 | |-- btt0 410 | |-- btt0 411 | |-- btt_seed 411 | |-- btt_seed 412 | |-- devtype 412 | |-- devtype 413 | |-- driver -> ../../../../../bus/n 413 | |-- driver -> ../../../../../bus/nd/drivers/nd_region 414 | |-- init_namespaces 414 | |-- init_namespaces 415 | |-- mapping0 415 | |-- mapping0 416 | |-- mapping1 416 | |-- mapping1 417 | |-- mappings 417 | |-- mappings 418 | |-- modalias 418 | |-- modalias 419 | |-- namespace0.0 419 | |-- namespace0.0 420 | |-- namespace_seed 420 | |-- namespace_seed 421 | |-- numa_node 421 | |-- numa_node 422 | |-- nfit 422 | |-- nfit 423 | | `-- spa_index 423 | | `-- spa_index 424 | |-- nstype 424 | |-- nstype 425 | |-- set_cookie 425 | |-- set_cookie 426 | |-- size 426 | |-- size 427 | |-- subsystem -> ../../../../../bu 427 | |-- subsystem -> ../../../../../bus/nd 428 | `-- uevent 428 | `-- uevent 429 |-- region1 429 |-- region1 430 [..] 430 [..] 431 431 432 LIBNDCTL: region enumeration example 432 LIBNDCTL: region enumeration example 433 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 433 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 434 434 435 Sample region retrieval routines based on NFIT 435 Sample region retrieval routines based on NFIT-unique data like 436 "spa_index" (interleave set id). 436 "spa_index" (interleave set id). 437 437 438 :: 438 :: 439 439 440 static struct ndctl_region *get_pmem_r 440 static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, 441 unsigned int spa_index 441 unsigned int spa_index) 442 { 442 { 443 struct ndctl_region *region; 443 struct ndctl_region *region; 444 444 445 ndctl_region_foreach(bus, regi 445 ndctl_region_foreach(bus, region) { 446 if (ndctl_region_get_t 446 if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) 447 continue; 447 continue; 448 if (ndctl_region_get_s 448 if (ndctl_region_get_spa_index(region) == spa_index) 449 return region; 449 return region; 450 } 450 } 451 return NULL; 451 return NULL; 452 } 452 } 453 453 454 454 455 LIBNVDIMM/LIBNDCTL: Namespace 455 LIBNVDIMM/LIBNDCTL: Namespace 456 ----------------------------- 456 ----------------------------- 457 457 458 A REGION, after resolving DPA aliasing and LAB 458 A REGION, after resolving DPA aliasing and LABEL specified boundaries, surfaces 459 one or more "namespace" devices. The arrival 459 one or more "namespace" devices. The arrival of a "namespace" device currently 460 triggers the nd_pmem driver to load and regist 460 triggers the nd_pmem driver to load and register a disk/block device. 461 461 462 LIBNVDIMM: namespace 462 LIBNVDIMM: namespace 463 ^^^^^^^^^^^^^^^^^^^^ 463 ^^^^^^^^^^^^^^^^^^^^ 464 464 465 Here is a sample layout from the 2 major types 465 Here is a sample layout from the 2 major types of NAMESPACE where namespace0.0 466 represents DIMM-info-backed PMEM (note that it 466 represents DIMM-info-backed PMEM (note that it has a 'uuid' attribute), and 467 namespace1.0 represents an anonymous PMEM name 467 namespace1.0 represents an anonymous PMEM namespace (note that has no 'uuid' 468 attribute due to not support a LABEL) 468 attribute due to not support a LABEL) 469 469 470 :: 470 :: 471 471 472 /sys/devices/platform/nfit_test.0/ndbu 472 /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 473 |-- alt_name 473 |-- alt_name 474 |-- devtype 474 |-- devtype 475 |-- dpa_extents 475 |-- dpa_extents 476 |-- force_raw 476 |-- force_raw 477 |-- modalias 477 |-- modalias 478 |-- numa_node 478 |-- numa_node 479 |-- resource 479 |-- resource 480 |-- size 480 |-- size 481 |-- subsystem -> ../../../../../../bus 481 |-- subsystem -> ../../../../../../bus/nd 482 |-- type 482 |-- type 483 |-- uevent 483 |-- uevent 484 `-- uuid 484 `-- uuid 485 /sys/devices/platform/nfit_test.1/ndbu 485 /sys/devices/platform/nfit_test.1/ndbus1/region1/namespace1.0 486 |-- block 486 |-- block 487 | `-- pmem0 487 | `-- pmem0 488 |-- devtype 488 |-- devtype 489 |-- driver -> ../../../../../../bus/nd 489 |-- driver -> ../../../../../../bus/nd/drivers/pmem 490 |-- force_raw 490 |-- force_raw 491 |-- modalias 491 |-- modalias 492 |-- numa_node 492 |-- numa_node 493 |-- resource 493 |-- resource 494 |-- size 494 |-- size 495 |-- subsystem -> ../../../../../../bus 495 |-- subsystem -> ../../../../../../bus/nd 496 |-- type 496 |-- type 497 `-- uevent 497 `-- uevent 498 498 499 LIBNDCTL: namespace enumeration example 499 LIBNDCTL: namespace enumeration example 500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 501 Namespaces are indexed relative to their paren 501 Namespaces are indexed relative to their parent region, example below. 502 These indexes are mostly static from boot to b 502 These indexes are mostly static from boot to boot, but subsystem makes 503 no guarantees in this regard. For a static na 503 no guarantees in this regard. For a static namespace identifier use its 504 'uuid' attribute. 504 'uuid' attribute. 505 505 506 :: 506 :: 507 507 508 static struct ndctl_namespace 508 static struct ndctl_namespace 509 *get_namespace_by_id(struct ndctl_region *re 509 *get_namespace_by_id(struct ndctl_region *region, unsigned int id) 510 { 510 { 511 struct ndctl_namespace *ndns; 511 struct ndctl_namespace *ndns; 512 512 513 ndctl_namespace_foreach(region, ndns 513 ndctl_namespace_foreach(region, ndns) 514 if (ndctl_namespace_get_id(n 514 if (ndctl_namespace_get_id(ndns) == id) 515 return ndns; 515 return ndns; 516 516 517 return NULL; 517 return NULL; 518 } 518 } 519 519 520 LIBNDCTL: namespace creation example 520 LIBNDCTL: namespace creation example 521 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 521 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 522 522 523 Idle namespaces are automatically created by t 523 Idle namespaces are automatically created by the kernel if a given 524 region has enough available capacity to create 524 region has enough available capacity to create a new namespace. 525 Namespace instantiation involves finding an id 525 Namespace instantiation involves finding an idle namespace and 526 configuring it. For the most part the setting 526 configuring it. For the most part the setting of namespace attributes 527 can occur in any order, the only constraint is 527 can occur in any order, the only constraint is that 'uuid' must be set 528 before 'size'. This enables the kernel to tra 528 before 'size'. This enables the kernel to track DPA allocations 529 internally with a static identifier:: 529 internally with a static identifier:: 530 530 531 static int configure_namespace(struct ndctl_ 531 static int configure_namespace(struct ndctl_region *region, 532 struct ndctl_namespace *ndns 532 struct ndctl_namespace *ndns, 533 struct namespace_parameters 533 struct namespace_parameters *parameters) 534 { 534 { 535 char devname[50]; 535 char devname[50]; 536 536 537 snprintf(devname, sizeof(devname), " 537 snprintf(devname, sizeof(devname), "namespace%d.%d", 538 ndctl_region_get_id( 538 ndctl_region_get_id(region), paramaters->id); 539 539 540 ndctl_namespace_set_alt_name(ndns, d 540 ndctl_namespace_set_alt_name(ndns, devname); 541 /* 'uuid' must be set prior to setti 541 /* 'uuid' must be set prior to setting size! */ 542 ndctl_namespace_set_uuid(ndns, param 542 ndctl_namespace_set_uuid(ndns, paramaters->uuid); 543 ndctl_namespace_set_size(ndns, param 543 ndctl_namespace_set_size(ndns, paramaters->size); 544 /* unlike pmem namespaces, blk names 544 /* unlike pmem namespaces, blk namespaces have a sector size */ 545 if (parameters->lbasize) 545 if (parameters->lbasize) 546 ndctl_namespace_set_sector_s 546 ndctl_namespace_set_sector_size(ndns, parameters->lbasize); 547 ndctl_namespace_enable(ndns); 547 ndctl_namespace_enable(ndns); 548 } 548 } 549 549 550 550 551 Why the Term "namespace"? 551 Why the Term "namespace"? 552 ^^^^^^^^^^^^^^^^^^^^^^^^^ 552 ^^^^^^^^^^^^^^^^^^^^^^^^^ 553 553 554 1. Why not "volume" for instance? "volume 554 1. Why not "volume" for instance? "volume" ran the risk of confusing 555 ND (libnvdimm subsystem) to a volume ma 555 ND (libnvdimm subsystem) to a volume manager like device-mapper. 556 556 557 2. The term originated to describe the sub 557 2. The term originated to describe the sub-devices that can be created 558 within a NVME controller (see the nvme 558 within a NVME controller (see the nvme specification: 559 https://www.nvmexpress.org/specificatio 559 https://www.nvmexpress.org/specifications/), and NFIT namespaces are 560 meant to parallel the capabilities and 560 meant to parallel the capabilities and configurability of 561 NVME-namespaces. 561 NVME-namespaces. 562 562 563 563 564 LIBNVDIMM/LIBNDCTL: Block Translation Table "b 564 LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" 565 ---------------------------------------------- 565 ------------------------------------------------- 566 566 567 A BTT (design document: https://pmem.io/2014/0 567 A BTT (design document: https://pmem.io/2014/09/23/btt.html) is a 568 personality driver for a namespace that fronts 568 personality driver for a namespace that fronts entire namespace as an 569 'address abstraction'. 569 'address abstraction'. 570 570 571 LIBNVDIMM: btt layout 571 LIBNVDIMM: btt layout 572 ^^^^^^^^^^^^^^^^^^^^^ 572 ^^^^^^^^^^^^^^^^^^^^^ 573 573 574 Every region will start out with at least one 574 Every region will start out with at least one BTT device which is the 575 seed device. To activate it set the "namespac 575 seed device. To activate it set the "namespace", "uuid", and 576 "sector_size" attributes and then bind the dev 576 "sector_size" attributes and then bind the device to the nd_pmem or 577 nd_blk driver depending on the region type:: 577 nd_blk driver depending on the region type:: 578 578 579 /sys/devices/platform/nfit_test.1/ndbu 579 /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ 580 |-- namespace 580 |-- namespace 581 |-- delete 581 |-- delete 582 |-- devtype 582 |-- devtype 583 |-- modalias 583 |-- modalias 584 |-- numa_node 584 |-- numa_node 585 |-- sector_size 585 |-- sector_size 586 |-- subsystem -> ../../../../../bus/nd 586 |-- subsystem -> ../../../../../bus/nd 587 |-- uevent 587 |-- uevent 588 `-- uuid 588 `-- uuid 589 589 590 LIBNDCTL: btt creation example 590 LIBNDCTL: btt creation example 591 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 591 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 592 592 593 Similar to namespaces an idle BTT device is au 593 Similar to namespaces an idle BTT device is automatically created per 594 region. Each time this "seed" btt device is c 594 region. Each time this "seed" btt device is configured and enabled a new 595 seed is created. Creating a BTT configuration 595 seed is created. Creating a BTT configuration involves two steps of 596 finding and idle BTT and assigning it to consu 596 finding and idle BTT and assigning it to consume a namespace. 597 597 598 :: 598 :: 599 599 600 static struct ndctl_btt *get_idle_btt( 600 static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) 601 { 601 { 602 struct ndctl_btt *btt; 602 struct ndctl_btt *btt; 603 603 604 ndctl_btt_foreach(region, btt) 604 ndctl_btt_foreach(region, btt) 605 if (!ndctl_btt_is_enab 605 if (!ndctl_btt_is_enabled(btt) 606 && !nd 606 && !ndctl_btt_is_configured(btt)) 607 return btt; 607 return btt; 608 608 609 return NULL; 609 return NULL; 610 } 610 } 611 611 612 static int configure_btt(struct ndctl_ 612 static int configure_btt(struct ndctl_region *region, 613 struct btt_parameters 613 struct btt_parameters *parameters) 614 { 614 { 615 btt = get_idle_btt(region); 615 btt = get_idle_btt(region); 616 616 617 ndctl_btt_set_uuid(btt, parame 617 ndctl_btt_set_uuid(btt, parameters->uuid); 618 ndctl_btt_set_sector_size(btt, 618 ndctl_btt_set_sector_size(btt, parameters->sector_size); 619 ndctl_btt_set_namespace(btt, p 619 ndctl_btt_set_namespace(btt, parameters->ndns); 620 /* turn off raw mode device */ 620 /* turn off raw mode device */ 621 ndctl_namespace_disable(parame 621 ndctl_namespace_disable(parameters->ndns); 622 /* turn on btt access */ 622 /* turn on btt access */ 623 ndctl_btt_enable(btt); 623 ndctl_btt_enable(btt); 624 } 624 } 625 625 626 Once instantiated a new inactive btt seed devi 626 Once instantiated a new inactive btt seed device will appear underneath 627 the region. 627 the region. 628 628 629 Once a "namespace" is removed from a BTT that 629 Once a "namespace" is removed from a BTT that instance of the BTT device 630 will be deleted or otherwise reset to default 630 will be deleted or otherwise reset to default values. This deletion is 631 only at the device model level. In order to d 631 only at the device model level. In order to destroy a BTT the "info 632 block" needs to be destroyed. Note, that to d 632 block" needs to be destroyed. Note, that to destroy a BTT the media 633 needs to be written in raw mode. By default, 633 needs to be written in raw mode. By default, the kernel will autodetect 634 the presence of a BTT and disable raw mode. T 634 the presence of a BTT and disable raw mode. This autodetect behavior 635 can be suppressed by enabling raw mode for the 635 can be suppressed by enabling raw mode for the namespace via the 636 ndctl_namespace_set_raw_mode() API. 636 ndctl_namespace_set_raw_mode() API. 637 637 638 638 639 Summary LIBNDCTL Diagram 639 Summary LIBNDCTL Diagram 640 ------------------------ 640 ------------------------ 641 641 642 For the given example above, here is the view 642 For the given example above, here is the view of the objects as seen by the 643 LIBNDCTL API:: 643 LIBNDCTL API:: 644 644 645 +---+ 645 +---+ 646 |CTX| 646 |CTX| 647 +-+-+ 647 +-+-+ 648 | 648 | 649 +-------+ | 649 +-------+ | 650 | DIMM0 <-+ | +---------+ +-------- 650 | DIMM0 <-+ | +---------+ +--------------+ +---------------+ 651 +-------+ | | +-> REGION0 +---> NAMESPA 651 +-------+ | | +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | 652 | DIMM1 <-+ +-v--+ | +---------+ +-------- 652 | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ 653 +-------+ +-+BUS0+-| +---------+ +-------- 653 +-------+ +-+BUS0+-| +---------+ +--------------+ +----------------------+ 654 | DIMM2 <-+ +----+ +-> REGION1 +---> NAMESPA 654 | DIMM2 <-+ +----+ +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | BTT1 | 655 +-------+ | | +---------+ +-------- 655 +-------+ | | +---------+ +--------------+ +---------------+------+ 656 | DIMM3 <-+ 656 | DIMM3 <-+ 657 +-------+ 657 +-------+
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.