1 =============================== 2 LIBNVDIMM: Non-Volatile Devices 3 =============================== 4 5 libnvdimm - kernel / libndctl - userspace help 6 7 nvdimm@lists.linux.dev 8 9 Version 13 10 11 .. contents: 12 13 Glossary 14 Overview 15 Supporting Documents 16 Git Trees 17 LIBNVDIMM PMEM 18 PMEM-REGIONs, Atomic Sectors, and 19 Example NVDIMM Platform 20 LIBNVDIMM Kernel Device Model and LIBN 21 LIBNDCTL: Context 22 libndctl: instantiate a new li 23 LIBNVDIMM/LIBNDCTL: Bus 24 libnvdimm: control class devic 25 libnvdimm: bus 26 libndctl: bus enumeration exam 27 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 28 libnvdimm: DIMM (NMEM) 29 libndctl: DIMM enumeration exa 30 LIBNVDIMM/LIBNDCTL: Region 31 libnvdimm: region 32 libndctl: region enumeration e 33 Why Not Encode the Region Type 34 How Do I Determine the Major T 35 LIBNVDIMM/LIBNDCTL: Namespace 36 libnvdimm: namespace 37 libndctl: namespace enumeratio 38 libndctl: namespace creation e 39 Why the Term "namespace"? 40 LIBNVDIMM/LIBNDCTL: Block Translat 41 libnvdimm: btt layout 42 libndctl: btt creation example 43 Summary LIBNDCTL Diagram 44 45 46 Glossary 47 ======== 48 49 PMEM: 50 A system-physical-address range where writes 51 block device composed of PMEM is capable of 52 may span an interleave of several DIMMs. 53 54 DPA: 55 DIMM Physical Address, is a DIMM-relative of 56 the system there would be a 1:1 system-physi 57 Once more DIMMs are added a memory controlle 58 decoded to determine the DPA associated with 59 system-physical-address. 60 61 DAX: 62 File system extensions to bypass the page ca 63 mmap persistent memory, from a PMEM block de 64 process address space. 65 66 DSM: 67 Device Specific Method: ACPI method to contr 68 device - in this case the firmware. 69 70 DCR: 71 NVDIMM Control Region Structure defined in A 72 It defines a vendor-id, device-id, and inter 73 74 BTT: 75 Block Translation Table: Persistent memory i 76 Existing software may have an expectation th 77 of writes is at least one sector, 512 bytes. 78 table with atomic update semantics to front 79 driver and present arbitrary atomic sector s 80 81 LABEL: 82 Metadata stored on a DIMM device that partit 83 (persistently names) capacity allocated to d 84 also indicates whether an address abstractio 85 the namespace. Note that traditional partit 86 layered on top of a PMEM namespace, or an ad 87 if present, but partition support is depreca 88 89 90 Overview 91 ======== 92 93 The LIBNVDIMM subsystem provides support for P 94 firmware or a device driver. On ACPI based sys 95 conveys persistent memory resource via the ACP 96 Interface Table" in ACPI 6. While the LIBNVDIM 97 is generic and supports pre-NFIT platforms, it 98 superset of capabilities need to support this 99 NVDIMM resources. The original implementation 100 block-window-aperture capability described in 101 has since been abandoned and never shipped in 102 103 Supporting Documents 104 -------------------- 105 106 ACPI 6: 107 https://www.uefi.org/sites/default/fil 108 NVDIMM Namespace: 109 https://pmem.io/documents/NVDIMM_Names 110 DSM Interface Example: 111 https://pmem.io/documents/NVDIMM_DSM_I 112 Driver Writer's Guide: 113 https://pmem.io/documents/NVDIMM_Drive 114 115 Git Trees 116 --------- 117 118 LIBNVDIMM: 119 https://git.kernel.org/cgit/linux/kern 120 LIBNDCTL: 121 https://github.com/pmem/ndctl.git 122 123 124 LIBNVDIMM PMEM 125 ============== 126 127 Prior to the arrival of the NFIT, non-volatile 128 system in various ad-hoc ways. Usually only t 129 provided, namely, a single system-physical-add 130 are expected to be durable after a system powe 131 specification standardizes not only the descri 132 platform message-passing entry points for cont 133 134 PMEM (nd_pmem.ko): Drives a system-physical-ad 135 contiguous in system memory and may be interle 136 striped) across multiple DIMMs. When interlea 137 provide details of which DIMMs are participati 138 139 It is worth noting that when the labeling capa 140 namespace label index block is found), then no 141 by default as userspace needs to do at least o 142 the PMEM range. In contrast ND_NAMESPACE_IO r 143 can be immediately attached to nd_pmem. This l 144 label-less or "legacy". 145 146 PMEM-REGIONs, Atomic Sectors, and DAX 147 ------------------------------------- 148 149 For the cases where an application or filesyst 150 update guarantees it can register a BTT on a P 151 LIBNVDIMM/NDCTL: Block Translation Table "btt" 152 153 154 Example NVDIMM Platform 155 ======================= 156 157 For the remainder of this document the followi 158 referenced for any example sysfs layouts:: 159 160 161 (a) 162 +-------------------+--------+---- 163 +------+ | pm0.0 | free | pm1 164 | imc0 +--+- - - region0- - - +--------+ 165 +--+---+ | pm0.0 | free | pm1 166 | +-------------------+--------v 167 +--+---+ | 168 | cpu0 | 169 +--+---+ | 170 | +----------------------------^ 171 +--+---+ | free | pm1 172 | imc1 +--+----------------------------| 173 +------+ | free | pm1 174 +----------------------------+---- 175 176 In this platform we have four DIMMs and two me 177 socket. Each PMEM interleave set is identifie 178 a dynamically assigned id. 179 180 1. The first portion of DIMM0 and DIMM1 ar 181 single PMEM namespace is created in the 182 of DIMM0 and DIMM1 with a user-specifie 183 interleaved system-physical-address ran 184 another PMEM namespace to be defined. 185 186 2. In the last portion of DIMM0 and DIMM1 187 system-physical-address range, REGION1, 188 well as DIMM2 and DIMM3. Some of REGIO 189 named "pm1.0". 190 191 This bus is provided by the kernel under t 192 /sys/devices/platform/nfit_test.0 when the 193 tools/testing/nvdimm is loaded. This modul 194 LIBNVDIMM and the acpi_nfit.ko driver. 195 196 197 LIBNVDIMM Kernel Device Model and LIBNDCTL Use 198 ============================================== 199 200 What follows is a description of the LIBNVDIMM 201 corresponding object hierarchy diagram as view 202 API. The example sysfs paths and diagrams are 203 NVDIMM Platform which is also the LIBNVDIMM bu 204 test. 205 206 LIBNDCTL: Context 207 ----------------- 208 209 Every API call in the LIBNDCTL library require 210 logging parameters and other library instance 211 based on the libabc template: 212 213 https://git.kernel.org/cgit/linux/kern 214 215 LIBNDCTL: instantiate a new library context ex 216 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 217 218 :: 219 220 struct ndctl_ctx *ctx; 221 222 if (ndctl_new(&ctx) == 0) 223 return ctx; 224 else 225 return NULL; 226 227 LIBNVDIMM/LIBNDCTL: Bus 228 ----------------------- 229 230 A bus has a 1:1 relationship with an NFIT. Th 231 ACPI based systems is that there is only ever 232 That said, it is trivial to register multiple 233 does not preclude it. The infrastructure supp 234 we use this capability to test multiple NFIT c 235 test. 236 237 LIBNVDIMM: control class device in /sys/class 238 --------------------------------------------- 239 240 This character device accepts DSM messages to 241 identified by its NFIT handle:: 242 243 /sys/class/nd/ndctl0 244 |-- dev 245 |-- device -> ../../../ndbus0 246 |-- subsystem -> ../../../../../../../ 247 248 249 250 LIBNVDIMM: bus 251 -------------- 252 253 :: 254 255 struct nvdimm_bus *nvdimm_bus_register 256 struct nvdimm_bus_descriptor *n 257 258 :: 259 260 /sys/devices/platform/nfit_test.0/ndbu 261 |-- commands 262 |-- nd 263 |-- nfit 264 |-- nmem0 265 |-- nmem1 266 |-- nmem2 267 |-- nmem3 268 |-- power 269 |-- provider 270 |-- region0 271 |-- region1 272 |-- region2 273 |-- region3 274 |-- region4 275 |-- region5 276 |-- uevent 277 `-- wait_probe 278 279 LIBNDCTL: bus enumeration example 280 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 281 282 Find the bus handle that describes the bus fro 283 284 static struct ndctl_bus *get_bus_by_pr 285 const char *provider) 286 { 287 struct ndctl_bus *bus; 288 289 ndctl_bus_foreach(ctx, bus) 290 if (strcmp(provider, n 291 return bus; 292 293 return NULL; 294 } 295 296 bus = get_bus_by_provider(ctx, "nfit_t 297 298 299 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 300 ------------------------------- 301 302 The DIMM device provides a character device fo 303 hardware, and it is a container for LABELs. I 304 NFIT then an optional 'nfit' attribute sub-dir 305 NFIT-specifics. 306 307 Note that the kernel device name for "DIMMs" i 308 describes these devices via "Memory Device to 309 Range Mapping Structure", and there is no requ 310 be physical DIMMs, so we use a more generic na 311 312 LIBNVDIMM: DIMM (NMEM) 313 ^^^^^^^^^^^^^^^^^^^^^^ 314 315 :: 316 317 struct nvdimm *nvdimm_create(struct nv 318 const struct attribute 319 unsigned long *dsm_mas 320 321 :: 322 323 /sys/devices/platform/nfit_test.0/ndbu 324 |-- nmem0 325 | |-- available_slots 326 | |-- commands 327 | |-- dev 328 | |-- devtype 329 | |-- driver -> ../../../../../bus/n 330 | |-- modalias 331 | |-- nfit 332 | | |-- device 333 | | |-- format 334 | | |-- handle 335 | | |-- phys_id 336 | | |-- rev_id 337 | | |-- serial 338 | | `-- vendor 339 | |-- state 340 | |-- subsystem -> ../../../../../bu 341 | `-- uevent 342 |-- nmem1 343 [..] 344 345 346 LIBNDCTL: DIMM enumeration example 347 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 348 349 Note, in this example we are assuming NFIT-def 350 identified by an "nfit_handle" a 32-bit value 351 352 - Bit 3:0 DIMM number within the memory cha 353 - Bit 7:4 memory channel number 354 - Bit 11:8 memory controller ID 355 - Bit 15:12 socket ID (within scope of a No 356 controller is present) 357 - Bit 27:16 Node Controller ID 358 - Bit 31:28 Reserved 359 360 :: 361 362 static struct ndctl_dimm *get_dimm_by_ 363 unsigned int handle) 364 { 365 struct ndctl_dimm *dimm; 366 367 ndctl_dimm_foreach(bus, dimm) 368 if (ndctl_dimm_get_han 369 return dimm; 370 371 return NULL; 372 } 373 374 #define DIMM_HANDLE(n, s, i, c, d) \ 375 (((n & 0xfff) << 16) | ((s & 0 376 | ((c & 0xf) << 4) | (d & 0xf 377 378 dimm = get_dimm_by_handle(bus, DIMM_HA 379 380 LIBNVDIMM/LIBNDCTL: Region 381 -------------------------- 382 383 A generic REGION device is registered for each 384 range. Per the example there are 2 PMEM region 385 bus. The primary role of regions are to be a c 386 mapping is a tuple of <DIMM, DPA-start-offset, 387 388 LIBNVDIMM provides a built-in driver for REGIO 389 is responsible for all parsing LABELs, if pres 390 devices for the nd_pmem driver to consume. 391 392 In addition to the generic attributes of "mapp 393 and "size" the REGION device also exports some 394 "nstype" indicates the integer type of namespa 395 emits, "devtype" duplicates the DEVTYPE variab 396 'add' event, "modalias" duplicates the MODALIA 397 at the 'add' event, and finally, the optional 398 the case where the region is defined by a SPA. 399 400 LIBNVDIMM: region:: 401 402 struct nd_region *nvdimm_pmem_region_c 403 struct nd_region_desc 404 405 :: 406 407 /sys/devices/platform/nfit_test.0/ndbu 408 |-- region0 409 | |-- available_size 410 | |-- btt0 411 | |-- btt_seed 412 | |-- devtype 413 | |-- driver -> ../../../../../bus/n 414 | |-- init_namespaces 415 | |-- mapping0 416 | |-- mapping1 417 | |-- mappings 418 | |-- modalias 419 | |-- namespace0.0 420 | |-- namespace_seed 421 | |-- numa_node 422 | |-- nfit 423 | | `-- spa_index 424 | |-- nstype 425 | |-- set_cookie 426 | |-- size 427 | |-- subsystem -> ../../../../../bu 428 | `-- uevent 429 |-- region1 430 [..] 431 432 LIBNDCTL: region enumeration example 433 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 434 435 Sample region retrieval routines based on NFIT 436 "spa_index" (interleave set id). 437 438 :: 439 440 static struct ndctl_region *get_pmem_r 441 unsigned int spa_index 442 { 443 struct ndctl_region *region; 444 445 ndctl_region_foreach(bus, regi 446 if (ndctl_region_get_t 447 continue; 448 if (ndctl_region_get_s 449 return region; 450 } 451 return NULL; 452 } 453 454 455 LIBNVDIMM/LIBNDCTL: Namespace 456 ----------------------------- 457 458 A REGION, after resolving DPA aliasing and LAB 459 one or more "namespace" devices. The arrival 460 triggers the nd_pmem driver to load and regist 461 462 LIBNVDIMM: namespace 463 ^^^^^^^^^^^^^^^^^^^^ 464 465 Here is a sample layout from the 2 major types 466 represents DIMM-info-backed PMEM (note that it 467 namespace1.0 represents an anonymous PMEM name 468 attribute due to not support a LABEL) 469 470 :: 471 472 /sys/devices/platform/nfit_test.0/ndbu 473 |-- alt_name 474 |-- devtype 475 |-- dpa_extents 476 |-- force_raw 477 |-- modalias 478 |-- numa_node 479 |-- resource 480 |-- size 481 |-- subsystem -> ../../../../../../bus 482 |-- type 483 |-- uevent 484 `-- uuid 485 /sys/devices/platform/nfit_test.1/ndbu 486 |-- block 487 | `-- pmem0 488 |-- devtype 489 |-- driver -> ../../../../../../bus/nd 490 |-- force_raw 491 |-- modalias 492 |-- numa_node 493 |-- resource 494 |-- size 495 |-- subsystem -> ../../../../../../bus 496 |-- type 497 `-- uevent 498 499 LIBNDCTL: namespace enumeration example 500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 501 Namespaces are indexed relative to their paren 502 These indexes are mostly static from boot to b 503 no guarantees in this regard. For a static na 504 'uuid' attribute. 505 506 :: 507 508 static struct ndctl_namespace 509 *get_namespace_by_id(struct ndctl_region *re 510 { 511 struct ndctl_namespace *ndns; 512 513 ndctl_namespace_foreach(region, ndns 514 if (ndctl_namespace_get_id(n 515 return ndns; 516 517 return NULL; 518 } 519 520 LIBNDCTL: namespace creation example 521 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 522 523 Idle namespaces are automatically created by t 524 region has enough available capacity to create 525 Namespace instantiation involves finding an id 526 configuring it. For the most part the setting 527 can occur in any order, the only constraint is 528 before 'size'. This enables the kernel to tra 529 internally with a static identifier:: 530 531 static int configure_namespace(struct ndctl_ 532 struct ndctl_namespace *ndns 533 struct namespace_parameters 534 { 535 char devname[50]; 536 537 snprintf(devname, sizeof(devname), " 538 ndctl_region_get_id( 539 540 ndctl_namespace_set_alt_name(ndns, d 541 /* 'uuid' must be set prior to setti 542 ndctl_namespace_set_uuid(ndns, param 543 ndctl_namespace_set_size(ndns, param 544 /* unlike pmem namespaces, blk names 545 if (parameters->lbasize) 546 ndctl_namespace_set_sector_s 547 ndctl_namespace_enable(ndns); 548 } 549 550 551 Why the Term "namespace"? 552 ^^^^^^^^^^^^^^^^^^^^^^^^^ 553 554 1. Why not "volume" for instance? "volume 555 ND (libnvdimm subsystem) to a volume ma 556 557 2. The term originated to describe the sub 558 within a NVME controller (see the nvme 559 https://www.nvmexpress.org/specificatio 560 meant to parallel the capabilities and 561 NVME-namespaces. 562 563 564 LIBNVDIMM/LIBNDCTL: Block Translation Table "b 565 ---------------------------------------------- 566 567 A BTT (design document: https://pmem.io/2014/0 568 personality driver for a namespace that fronts 569 'address abstraction'. 570 571 LIBNVDIMM: btt layout 572 ^^^^^^^^^^^^^^^^^^^^^ 573 574 Every region will start out with at least one 575 seed device. To activate it set the "namespac 576 "sector_size" attributes and then bind the dev 577 nd_blk driver depending on the region type:: 578 579 /sys/devices/platform/nfit_test.1/ndbu 580 |-- namespace 581 |-- delete 582 |-- devtype 583 |-- modalias 584 |-- numa_node 585 |-- sector_size 586 |-- subsystem -> ../../../../../bus/nd 587 |-- uevent 588 `-- uuid 589 590 LIBNDCTL: btt creation example 591 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 592 593 Similar to namespaces an idle BTT device is au 594 region. Each time this "seed" btt device is c 595 seed is created. Creating a BTT configuration 596 finding and idle BTT and assigning it to consu 597 598 :: 599 600 static struct ndctl_btt *get_idle_btt( 601 { 602 struct ndctl_btt *btt; 603 604 ndctl_btt_foreach(region, btt) 605 if (!ndctl_btt_is_enab 606 && !nd 607 return btt; 608 609 return NULL; 610 } 611 612 static int configure_btt(struct ndctl_ 613 struct btt_parameters 614 { 615 btt = get_idle_btt(region); 616 617 ndctl_btt_set_uuid(btt, parame 618 ndctl_btt_set_sector_size(btt, 619 ndctl_btt_set_namespace(btt, p 620 /* turn off raw mode device */ 621 ndctl_namespace_disable(parame 622 /* turn on btt access */ 623 ndctl_btt_enable(btt); 624 } 625 626 Once instantiated a new inactive btt seed devi 627 the region. 628 629 Once a "namespace" is removed from a BTT that 630 will be deleted or otherwise reset to default 631 only at the device model level. In order to d 632 block" needs to be destroyed. Note, that to d 633 needs to be written in raw mode. By default, 634 the presence of a BTT and disable raw mode. T 635 can be suppressed by enabling raw mode for the 636 ndctl_namespace_set_raw_mode() API. 637 638 639 Summary LIBNDCTL Diagram 640 ------------------------ 641 642 For the given example above, here is the view 643 LIBNDCTL API:: 644 645 +---+ 646 |CTX| 647 +-+-+ 648 | 649 +-------+ | 650 | DIMM0 <-+ | +---------+ +-------- 651 +-------+ | | +-> REGION0 +---> NAMESPA 652 | DIMM1 <-+ +-v--+ | +---------+ +-------- 653 +-------+ +-+BUS0+-| +---------+ +-------- 654 | DIMM2 <-+ +----+ +-> REGION1 +---> NAMESPA 655 +-------+ | | +---------+ +-------- 656 | DIMM3 <-+ 657 +-------+
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.