1 =============================== 1 =============================== 2 LIBNVDIMM: Non-Volatile Devices 2 LIBNVDIMM: Non-Volatile Devices 3 =============================== 3 =============================== 4 4 5 libnvdimm - kernel / libndctl - userspace help 5 libnvdimm - kernel / libndctl - userspace helper library 6 6 7 nvdimm@lists.linux.dev 7 nvdimm@lists.linux.dev 8 8 9 Version 13 9 Version 13 10 10 11 .. contents: 11 .. contents: 12 12 13 Glossary 13 Glossary 14 Overview 14 Overview 15 Supporting Documents 15 Supporting Documents 16 Git Trees 16 Git Trees 17 LIBNVDIMM PMEM !! 17 LIBNVDIMM PMEM and BLK 18 PMEM-REGIONs, Atomic Sectors, and !! 18 Why BLK? >> 19 PMEM vs BLK >> 20 BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX 19 Example NVDIMM Platform 21 Example NVDIMM Platform 20 LIBNVDIMM Kernel Device Model and LIBN 22 LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API 21 LIBNDCTL: Context 23 LIBNDCTL: Context 22 libndctl: instantiate a new li 24 libndctl: instantiate a new library context example 23 LIBNVDIMM/LIBNDCTL: Bus 25 LIBNVDIMM/LIBNDCTL: Bus 24 libnvdimm: control class devic 26 libnvdimm: control class device in /sys/class 25 libnvdimm: bus 27 libnvdimm: bus 26 libndctl: bus enumeration exam 28 libndctl: bus enumeration example 27 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 29 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 28 libnvdimm: DIMM (NMEM) 30 libnvdimm: DIMM (NMEM) 29 libndctl: DIMM enumeration exa 31 libndctl: DIMM enumeration example 30 LIBNVDIMM/LIBNDCTL: Region 32 LIBNVDIMM/LIBNDCTL: Region 31 libnvdimm: region 33 libnvdimm: region 32 libndctl: region enumeration e 34 libndctl: region enumeration example 33 Why Not Encode the Region Type 35 Why Not Encode the Region Type into the Region Name? 34 How Do I Determine the Major T 36 How Do I Determine the Major Type of a Region? 35 LIBNVDIMM/LIBNDCTL: Namespace 37 LIBNVDIMM/LIBNDCTL: Namespace 36 libnvdimm: namespace 38 libnvdimm: namespace 37 libndctl: namespace enumeratio 39 libndctl: namespace enumeration example 38 libndctl: namespace creation e 40 libndctl: namespace creation example 39 Why the Term "namespace"? 41 Why the Term "namespace"? 40 LIBNVDIMM/LIBNDCTL: Block Translat 42 LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" 41 libnvdimm: btt layout 43 libnvdimm: btt layout 42 libndctl: btt creation example 44 libndctl: btt creation example 43 Summary LIBNDCTL Diagram 45 Summary LIBNDCTL Diagram 44 46 45 47 46 Glossary 48 Glossary 47 ======== 49 ======== 48 50 49 PMEM: 51 PMEM: 50 A system-physical-address range where writes 52 A system-physical-address range where writes are persistent. A 51 block device composed of PMEM is capable of 53 block device composed of PMEM is capable of DAX. A PMEM address range 52 may span an interleave of several DIMMs. 54 may span an interleave of several DIMMs. 53 55 >> 56 BLK: >> 57 A set of one or more programmable memory mapped apertures provided >> 58 by a DIMM to access its media. This indirection precludes the >> 59 performance benefit of interleaving, but enables DIMM-bounded failure >> 60 modes. >> 61 54 DPA: 62 DPA: 55 DIMM Physical Address, is a DIMM-relative of 63 DIMM Physical Address, is a DIMM-relative offset. With one DIMM in 56 the system there would be a 1:1 system-physi 64 the system there would be a 1:1 system-physical-address:DPA association. 57 Once more DIMMs are added a memory controlle 65 Once more DIMMs are added a memory controller interleave must be 58 decoded to determine the DPA associated with 66 decoded to determine the DPA associated with a given 59 system-physical-address. !! 67 system-physical-address. BLK capacity always has a 1:1 relationship >> 68 with a single-DIMM's DPA range. 60 69 61 DAX: 70 DAX: 62 File system extensions to bypass the page ca 71 File system extensions to bypass the page cache and block layer to 63 mmap persistent memory, from a PMEM block de 72 mmap persistent memory, from a PMEM block device, directly into a 64 process address space. 73 process address space. 65 74 66 DSM: 75 DSM: 67 Device Specific Method: ACPI method to contr 76 Device Specific Method: ACPI method to control specific 68 device - in this case the firmware. 77 device - in this case the firmware. 69 78 70 DCR: 79 DCR: 71 NVDIMM Control Region Structure defined in A 80 NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. 72 It defines a vendor-id, device-id, and inter 81 It defines a vendor-id, device-id, and interface format for a given DIMM. 73 82 74 BTT: 83 BTT: 75 Block Translation Table: Persistent memory i 84 Block Translation Table: Persistent memory is byte addressable. 76 Existing software may have an expectation th 85 Existing software may have an expectation that the power-fail-atomicity 77 of writes is at least one sector, 512 bytes. 86 of writes is at least one sector, 512 bytes. The BTT is an indirection 78 table with atomic update semantics to front !! 87 table with atomic update semantics to front a PMEM/BLK block device 79 driver and present arbitrary atomic sector s 88 driver and present arbitrary atomic sector sizes. 80 89 81 LABEL: 90 LABEL: 82 Metadata stored on a DIMM device that partit 91 Metadata stored on a DIMM device that partitions and identifies 83 (persistently names) capacity allocated to d !! 92 (persistently names) storage between PMEM and BLK. It also partitions 84 also indicates whether an address abstractio !! 93 BLK storage to host BTTs with different parameters per BLK-partition. 85 the namespace. Note that traditional partit !! 94 Note that traditional partition tables, GPT/MBR, are layered on top of a 86 layered on top of a PMEM namespace, or an ad !! 95 BLK or PMEM device. 87 if present, but partition support is depreca << 88 96 89 97 90 Overview 98 Overview 91 ======== 99 ======== 92 100 93 The LIBNVDIMM subsystem provides support for P !! 101 The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, 94 firmware or a device driver. On ACPI based sys !! 102 PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM 95 conveys persistent memory resource via the ACP !! 103 and BLK mode access. These three modes of operation are described by 96 Interface Table" in ACPI 6. While the LIBNVDIM !! 104 the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM 97 is generic and supports pre-NFIT platforms, it !! 105 implementation is generic and supports pre-NFIT platforms, it was guided 98 superset of capabilities need to support this !! 106 by the superset of capabilities need to support this ACPI 6 definition 99 NVDIMM resources. The original implementation !! 107 for NVDIMM resources. The bulk of the kernel implementation is in place 100 block-window-aperture capability described in !! 108 to handle the case where DPA accessible via PMEM is aliased with DPA 101 has since been abandoned and never shipped in !! 109 accessible via BLK. When that occurs a LABEL is needed to reserve DPA >> 110 for exclusive access via one mode a time. 102 111 103 Supporting Documents 112 Supporting Documents 104 -------------------- 113 -------------------- 105 114 106 ACPI 6: 115 ACPI 6: 107 https://www.uefi.org/sites/default/fil 116 https://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf 108 NVDIMM Namespace: 117 NVDIMM Namespace: 109 https://pmem.io/documents/NVDIMM_Names 118 https://pmem.io/documents/NVDIMM_Namespace_Spec.pdf 110 DSM Interface Example: 119 DSM Interface Example: 111 https://pmem.io/documents/NVDIMM_DSM_I 120 https://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf 112 Driver Writer's Guide: 121 Driver Writer's Guide: 113 https://pmem.io/documents/NVDIMM_Drive 122 https://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf 114 123 115 Git Trees 124 Git Trees 116 --------- 125 --------- 117 126 118 LIBNVDIMM: 127 LIBNVDIMM: 119 https://git.kernel.org/cgit/linux/kern !! 128 https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git 120 LIBNDCTL: 129 LIBNDCTL: 121 https://github.com/pmem/ndctl.git 130 https://github.com/pmem/ndctl.git >> 131 PMEM: >> 132 https://github.com/01org/prd 122 133 123 134 124 LIBNVDIMM PMEM !! 135 LIBNVDIMM PMEM and BLK 125 ============== !! 136 ====================== 126 137 127 Prior to the arrival of the NFIT, non-volatile 138 Prior to the arrival of the NFIT, non-volatile memory was described to a 128 system in various ad-hoc ways. Usually only t 139 system in various ad-hoc ways. Usually only the bare minimum was 129 provided, namely, a single system-physical-add 140 provided, namely, a single system-physical-address range where writes 130 are expected to be durable after a system powe 141 are expected to be durable after a system power loss. Now, the NFIT 131 specification standardizes not only the descri 142 specification standardizes not only the description of PMEM, but also 132 platform message-passing entry points for cont !! 143 BLK and platform message-passing entry points for control and >> 144 configuration. >> 145 >> 146 For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block >> 147 device driver: 133 148 134 PMEM (nd_pmem.ko): Drives a system-physical-ad !! 149 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This 135 contiguous in system memory and may be interle !! 150 range is contiguous in system memory and may be interleaved (hardware 136 striped) across multiple DIMMs. When interlea !! 151 memory controller striped) across multiple DIMMs. When interleaved the 137 provide details of which DIMMs are participati !! 152 platform may optionally provide details of which DIMMs are participating 138 !! 153 in the interleave. 139 It is worth noting that when the labeling capa !! 154 140 namespace label index block is found), then no !! 155 Note that while LIBNVDIMM describes system-physical-address ranges that may 141 by default as userspace needs to do at least o !! 156 alias with BLK access as ND_NAMESPACE_PMEM ranges and those without 142 the PMEM range. In contrast ND_NAMESPACE_IO r !! 157 alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no 143 can be immediately attached to nd_pmem. This l !! 158 distinction. The different device-types are an implementation detail 144 label-less or "legacy". !! 159 that userspace can exploit to implement policies like "only interface >> 160 with address ranges from certain DIMMs". It is worth noting that when >> 161 aliasing is present and a DIMM lacks a label, then no block device can >> 162 be created by default as userspace needs to do at least one allocation >> 163 of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once >> 164 registered, can be immediately attached to nd_pmem. >> 165 >> 166 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform >> 167 defined apertures. A set of apertures will access just one DIMM. >> 168 Multiple windows (apertures) allow multiple concurrent accesses, much like >> 169 tagged-command-queuing, and would likely be used by different threads or >> 170 different CPUs. >> 171 >> 172 The NFIT specification defines a standard format for a BLK-aperture, but >> 173 the spec also allows for vendor specific layouts, and non-NFIT BLK >> 174 implementations may have other designs for BLK I/O. For this reason >> 175 "nd_blk" calls back into platform-specific code to perform the I/O. 145 176 146 PMEM-REGIONs, Atomic Sectors, and DAX !! 177 One such implementation is defined in the "Driver Writer's Guide" and "DSM 147 ------------------------------------- !! 178 Interface Example". >> 179 >> 180 >> 181 Why BLK? >> 182 ======== 148 183 149 For the cases where an application or filesyst !! 184 While PMEM provides direct byte-addressable CPU-load/store access to 150 update guarantees it can register a BTT on a P !! 185 NVDIMM storage, it does not provide the best system RAS (recovery, >> 186 availability, and serviceability) model. An access to a corrupted >> 187 system-physical-address address causes a CPU exception while an access >> 188 to a corrupted address through an BLK-aperture causes that block window >> 189 to raise an error status in a register. The latter is more aligned with >> 190 the standard error model that host-bus-adapter attached disks present. >> 191 >> 192 Also, if an administrator ever wants to replace a memory it is easier to >> 193 service a system at DIMM module boundaries. Compare this to PMEM where >> 194 data could be interleaved in an opaque hardware specific manner across >> 195 several DIMMs. >> 196 >> 197 PMEM vs BLK >> 198 ----------- >> 199 >> 200 BLK-apertures solve these RAS problems, but their presence is also the >> 201 major contributing factor to the complexity of the ND subsystem. They >> 202 complicate the implementation because PMEM and BLK alias in DPA space. >> 203 Any given DIMM's DPA-range may contribute to one or more >> 204 system-physical-address sets of interleaved DIMMs, *and* may also be >> 205 accessed in its entirety through its BLK-aperture. Accessing a DPA >> 206 through a system-physical-address while simultaneously accessing the >> 207 same DPA through a BLK-aperture has undefined results. For this reason, >> 208 DIMMs with this dual interface configuration include a DSM function to >> 209 store/retrieve a LABEL. The LABEL effectively partitions the DPA-space >> 210 into exclusive system-physical-address and BLK-aperture accessible >> 211 regions. For simplicity a DIMM is allowed a PMEM "region" per each >> 212 interleave set in which it is a member. The remaining DPA space can be >> 213 carved into an arbitrary number of BLK devices with discontiguous >> 214 extents. >> 215 >> 216 BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX >> 217 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> 218 >> 219 One of the few >> 220 reasons to allow multiple BLK namespaces per REGION is so that each >> 221 BLK-namespace can be configured with a BTT with unique atomic sector >> 222 sizes. While a PMEM device can host a BTT the LABEL specification does >> 223 not provide for a sector size to be specified for a PMEM namespace. >> 224 >> 225 This is due to the expectation that the primary usage model for PMEM is >> 226 via DAX, and the BTT is incompatible with DAX. However, for the cases >> 227 where an application or filesystem still needs atomic sector update >> 228 guarantees it can register a BTT on a PMEM device or partition. See 151 LIBNVDIMM/NDCTL: Block Translation Table "btt" 229 LIBNVDIMM/NDCTL: Block Translation Table "btt" 152 230 153 231 154 Example NVDIMM Platform 232 Example NVDIMM Platform 155 ======================= 233 ======================= 156 234 157 For the remainder of this document the followi 235 For the remainder of this document the following diagram will be 158 referenced for any example sysfs layouts:: 236 referenced for any example sysfs layouts:: 159 237 160 238 161 (a) !! 239 (a) (b) DIMM BLK-REGION 162 +-------------------+--------+---- 240 +-------------------+--------+--------+--------+ 163 +------+ | pm0.0 | free | pm1 !! 241 +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 164 | imc0 +--+- - - region0- - - +--------+ 242 | imc0 +--+- - - region0- - - +--------+ +--------+ 165 +--+---+ | pm0.0 | free | pm1 !! 243 +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 166 | +-------------------+--------v 244 | +-------------------+--------v v--------+ 167 +--+---+ | 245 +--+---+ | | 168 | cpu0 | 246 | cpu0 | region1 169 +--+---+ | 247 +--+---+ | | 170 | +----------------------------^ 248 | +----------------------------^ ^--------+ 171 +--+---+ | free | pm1 !! 249 +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 172 | imc1 +--+----------------------------| 250 | imc1 +--+----------------------------| +--------+ 173 +------+ | free | pm1 !! 251 +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 174 +----------------------------+---- 252 +----------------------------+--------+--------+ 175 253 176 In this platform we have four DIMMs and two me 254 In this platform we have four DIMMs and two memory controllers in one 177 socket. Each PMEM interleave set is identifie !! 255 socket. Each unique interface (BLK or PMEM) to DPA space is identified 178 a dynamically assigned id. !! 256 by a region device with a dynamically assigned id (REGION0 - REGION5). 179 257 180 1. The first portion of DIMM0 and DIMM1 ar 258 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A 181 single PMEM namespace is created in the 259 single PMEM namespace is created in the REGION0-SPA-range that spans most 182 of DIMM0 and DIMM1 with a user-specifie 260 of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that 183 interleaved system-physical-address ran !! 261 interleaved system-physical-address range is reclaimed as BLK-aperture 184 another PMEM namespace to be defined. !! 262 accessed space starting at DPA-offset (a) into each DIMM. In that >> 263 reclaimed space we create two BLK-aperture "namespaces" from REGION2 and >> 264 REGION3 where "blk2.0" and "blk3.0" are just human readable names that >> 265 could be set to any user-desired name in the LABEL. 185 266 186 2. In the last portion of DIMM0 and DIMM1 267 2. In the last portion of DIMM0 and DIMM1 we have an interleaved 187 system-physical-address range, REGION1, 268 system-physical-address range, REGION1, that spans those two DIMMs as 188 well as DIMM2 and DIMM3. Some of REGIO 269 well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace 189 named "pm1.0". !! 270 named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for >> 271 each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and >> 272 "blk5.0". >> 273 >> 274 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 >> 275 interleaved system-physical-address range (i.e. the DPA address past >> 276 offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. >> 277 Note, that this example shows that BLK-aperture namespaces don't need to >> 278 be contiguous in DPA-space. 190 279 191 This bus is provided by the kernel under t 280 This bus is provided by the kernel under the device 192 /sys/devices/platform/nfit_test.0 when the 281 /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from 193 tools/testing/nvdimm is loaded. This modul !! 282 tools/testing/nvdimm is loaded. This not only test LIBNVDIMM but the 194 LIBNVDIMM and the acpi_nfit.ko driver. !! 283 acpi_nfit.ko driver as well. 195 284 196 285 197 LIBNVDIMM Kernel Device Model and LIBNDCTL Use 286 LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API 198 ============================================== 287 ======================================================== 199 288 200 What follows is a description of the LIBNVDIMM 289 What follows is a description of the LIBNVDIMM sysfs layout and a 201 corresponding object hierarchy diagram as view 290 corresponding object hierarchy diagram as viewed through the LIBNDCTL 202 API. The example sysfs paths and diagrams are 291 API. The example sysfs paths and diagrams are relative to the Example 203 NVDIMM Platform which is also the LIBNVDIMM bu 292 NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit 204 test. 293 test. 205 294 206 LIBNDCTL: Context 295 LIBNDCTL: Context 207 ----------------- 296 ----------------- 208 297 209 Every API call in the LIBNDCTL library require 298 Every API call in the LIBNDCTL library requires a context that holds the 210 logging parameters and other library instance 299 logging parameters and other library instance state. The library is 211 based on the libabc template: 300 based on the libabc template: 212 301 213 https://git.kernel.org/cgit/linux/kern 302 https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git 214 303 215 LIBNDCTL: instantiate a new library context ex 304 LIBNDCTL: instantiate a new library context example 216 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 305 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 217 306 218 :: 307 :: 219 308 220 struct ndctl_ctx *ctx; 309 struct ndctl_ctx *ctx; 221 310 222 if (ndctl_new(&ctx) == 0) 311 if (ndctl_new(&ctx) == 0) 223 return ctx; 312 return ctx; 224 else 313 else 225 return NULL; 314 return NULL; 226 315 227 LIBNVDIMM/LIBNDCTL: Bus 316 LIBNVDIMM/LIBNDCTL: Bus 228 ----------------------- 317 ----------------------- 229 318 230 A bus has a 1:1 relationship with an NFIT. Th 319 A bus has a 1:1 relationship with an NFIT. The current expectation for 231 ACPI based systems is that there is only ever 320 ACPI based systems is that there is only ever one platform-global NFIT. 232 That said, it is trivial to register multiple 321 That said, it is trivial to register multiple NFITs, the specification 233 does not preclude it. The infrastructure supp 322 does not preclude it. The infrastructure supports multiple busses and 234 we use this capability to test multiple NFIT c 323 we use this capability to test multiple NFIT configurations in the unit 235 test. 324 test. 236 325 237 LIBNVDIMM: control class device in /sys/class 326 LIBNVDIMM: control class device in /sys/class 238 --------------------------------------------- 327 --------------------------------------------- 239 328 240 This character device accepts DSM messages to 329 This character device accepts DSM messages to be passed to DIMM 241 identified by its NFIT handle:: 330 identified by its NFIT handle:: 242 331 243 /sys/class/nd/ndctl0 332 /sys/class/nd/ndctl0 244 |-- dev 333 |-- dev 245 |-- device -> ../../../ndbus0 334 |-- device -> ../../../ndbus0 246 |-- subsystem -> ../../../../../../../ 335 |-- subsystem -> ../../../../../../../class/nd 247 336 248 337 249 338 250 LIBNVDIMM: bus 339 LIBNVDIMM: bus 251 -------------- 340 -------------- 252 341 253 :: 342 :: 254 343 255 struct nvdimm_bus *nvdimm_bus_register 344 struct nvdimm_bus *nvdimm_bus_register(struct device *parent, 256 struct nvdimm_bus_descriptor *n 345 struct nvdimm_bus_descriptor *nfit_desc); 257 346 258 :: 347 :: 259 348 260 /sys/devices/platform/nfit_test.0/ndbu 349 /sys/devices/platform/nfit_test.0/ndbus0 261 |-- commands 350 |-- commands 262 |-- nd 351 |-- nd 263 |-- nfit 352 |-- nfit 264 |-- nmem0 353 |-- nmem0 265 |-- nmem1 354 |-- nmem1 266 |-- nmem2 355 |-- nmem2 267 |-- nmem3 356 |-- nmem3 268 |-- power 357 |-- power 269 |-- provider 358 |-- provider 270 |-- region0 359 |-- region0 271 |-- region1 360 |-- region1 272 |-- region2 361 |-- region2 273 |-- region3 362 |-- region3 274 |-- region4 363 |-- region4 275 |-- region5 364 |-- region5 276 |-- uevent 365 |-- uevent 277 `-- wait_probe 366 `-- wait_probe 278 367 279 LIBNDCTL: bus enumeration example 368 LIBNDCTL: bus enumeration example 280 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 369 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 281 370 282 Find the bus handle that describes the bus fro 371 Find the bus handle that describes the bus from Example NVDIMM Platform:: 283 372 284 static struct ndctl_bus *get_bus_by_pr 373 static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, 285 const char *provider) 374 const char *provider) 286 { 375 { 287 struct ndctl_bus *bus; 376 struct ndctl_bus *bus; 288 377 289 ndctl_bus_foreach(ctx, bus) 378 ndctl_bus_foreach(ctx, bus) 290 if (strcmp(provider, n 379 if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) 291 return bus; 380 return bus; 292 381 293 return NULL; 382 return NULL; 294 } 383 } 295 384 296 bus = get_bus_by_provider(ctx, "nfit_t 385 bus = get_bus_by_provider(ctx, "nfit_test.0"); 297 386 298 387 299 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 388 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 300 ------------------------------- 389 ------------------------------- 301 390 302 The DIMM device provides a character device fo 391 The DIMM device provides a character device for sending commands to 303 hardware, and it is a container for LABELs. I 392 hardware, and it is a container for LABELs. If the DIMM is defined by 304 NFIT then an optional 'nfit' attribute sub-dir 393 NFIT then an optional 'nfit' attribute sub-directory is available to add 305 NFIT-specifics. 394 NFIT-specifics. 306 395 307 Note that the kernel device name for "DIMMs" i 396 Note that the kernel device name for "DIMMs" is "nmemX". The NFIT 308 describes these devices via "Memory Device to 397 describes these devices via "Memory Device to System Physical Address 309 Range Mapping Structure", and there is no requ 398 Range Mapping Structure", and there is no requirement that they actually 310 be physical DIMMs, so we use a more generic na 399 be physical DIMMs, so we use a more generic name. 311 400 312 LIBNVDIMM: DIMM (NMEM) 401 LIBNVDIMM: DIMM (NMEM) 313 ^^^^^^^^^^^^^^^^^^^^^^ 402 ^^^^^^^^^^^^^^^^^^^^^^ 314 403 315 :: 404 :: 316 405 317 struct nvdimm *nvdimm_create(struct nv 406 struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, 318 const struct attribute 407 const struct attribute_group **groups, unsigned long flags, 319 unsigned long *dsm_mas 408 unsigned long *dsm_mask); 320 409 321 :: 410 :: 322 411 323 /sys/devices/platform/nfit_test.0/ndbu 412 /sys/devices/platform/nfit_test.0/ndbus0 324 |-- nmem0 413 |-- nmem0 325 | |-- available_slots 414 | |-- available_slots 326 | |-- commands 415 | |-- commands 327 | |-- dev 416 | |-- dev 328 | |-- devtype 417 | |-- devtype 329 | |-- driver -> ../../../../../bus/n 418 | |-- driver -> ../../../../../bus/nd/drivers/nvdimm 330 | |-- modalias 419 | |-- modalias 331 | |-- nfit 420 | |-- nfit 332 | | |-- device 421 | | |-- device 333 | | |-- format 422 | | |-- format 334 | | |-- handle 423 | | |-- handle 335 | | |-- phys_id 424 | | |-- phys_id 336 | | |-- rev_id 425 | | |-- rev_id 337 | | |-- serial 426 | | |-- serial 338 | | `-- vendor 427 | | `-- vendor 339 | |-- state 428 | |-- state 340 | |-- subsystem -> ../../../../../bu 429 | |-- subsystem -> ../../../../../bus/nd 341 | `-- uevent 430 | `-- uevent 342 |-- nmem1 431 |-- nmem1 343 [..] 432 [..] 344 433 345 434 346 LIBNDCTL: DIMM enumeration example 435 LIBNDCTL: DIMM enumeration example 347 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 436 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 348 437 349 Note, in this example we are assuming NFIT-def 438 Note, in this example we are assuming NFIT-defined DIMMs which are 350 identified by an "nfit_handle" a 32-bit value 439 identified by an "nfit_handle" a 32-bit value where: 351 440 352 - Bit 3:0 DIMM number within the memory cha 441 - Bit 3:0 DIMM number within the memory channel 353 - Bit 7:4 memory channel number 442 - Bit 7:4 memory channel number 354 - Bit 11:8 memory controller ID 443 - Bit 11:8 memory controller ID 355 - Bit 15:12 socket ID (within scope of a No 444 - Bit 15:12 socket ID (within scope of a Node controller if node 356 controller is present) 445 controller is present) 357 - Bit 27:16 Node Controller ID 446 - Bit 27:16 Node Controller ID 358 - Bit 31:28 Reserved 447 - Bit 31:28 Reserved 359 448 360 :: 449 :: 361 450 362 static struct ndctl_dimm *get_dimm_by_ 451 static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, 363 unsigned int handle) 452 unsigned int handle) 364 { 453 { 365 struct ndctl_dimm *dimm; 454 struct ndctl_dimm *dimm; 366 455 367 ndctl_dimm_foreach(bus, dimm) 456 ndctl_dimm_foreach(bus, dimm) 368 if (ndctl_dimm_get_han 457 if (ndctl_dimm_get_handle(dimm) == handle) 369 return dimm; 458 return dimm; 370 459 371 return NULL; 460 return NULL; 372 } 461 } 373 462 374 #define DIMM_HANDLE(n, s, i, c, d) \ 463 #define DIMM_HANDLE(n, s, i, c, d) \ 375 (((n & 0xfff) << 16) | ((s & 0 464 (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ 376 | ((c & 0xf) << 4) | (d & 0xf 465 | ((c & 0xf) << 4) | (d & 0xf)) 377 466 378 dimm = get_dimm_by_handle(bus, DIMM_HA 467 dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); 379 468 380 LIBNVDIMM/LIBNDCTL: Region 469 LIBNVDIMM/LIBNDCTL: Region 381 -------------------------- 470 -------------------------- 382 471 383 A generic REGION device is registered for each !! 472 A generic REGION device is registered for each PMEM range or BLK-aperture 384 range. Per the example there are 2 PMEM region !! 473 set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture 385 bus. The primary role of regions are to be a c !! 474 sets on the "nfit_test.0" bus. The primary role of regions are to be a 386 mapping is a tuple of <DIMM, DPA-start-offset, !! 475 container of "mappings". A mapping is a tuple of <DIMM, 387 !! 476 DPA-start-offset, length>. 388 LIBNVDIMM provides a built-in driver for REGIO !! 477 389 is responsible for all parsing LABELs, if pres !! 478 LIBNVDIMM provides a built-in driver for these REGION devices. This driver 390 devices for the nd_pmem driver to consume. !! 479 is responsible for reconciling the aliased DPA mappings across all >> 480 regions, parsing the LABEL, if present, and then emitting NAMESPACE >> 481 devices with the resolved/exclusive DPA-boundaries for the nd_pmem or >> 482 nd_blk device driver to consume. 391 483 392 In addition to the generic attributes of "mapp 484 In addition to the generic attributes of "mapping"s, "interleave_ways" 393 and "size" the REGION device also exports some 485 and "size" the REGION device also exports some convenience attributes. 394 "nstype" indicates the integer type of namespa 486 "nstype" indicates the integer type of namespace-device this region 395 emits, "devtype" duplicates the DEVTYPE variab 487 emits, "devtype" duplicates the DEVTYPE variable stored by udev at the 396 'add' event, "modalias" duplicates the MODALIA 488 'add' event, "modalias" duplicates the MODALIAS variable stored by udev 397 at the 'add' event, and finally, the optional 489 at the 'add' event, and finally, the optional "spa_index" is provided in 398 the case where the region is defined by a SPA. 490 the case where the region is defined by a SPA. 399 491 400 LIBNVDIMM: region:: 492 LIBNVDIMM: region:: 401 493 402 struct nd_region *nvdimm_pmem_region_c 494 struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, 403 struct nd_region_desc 495 struct nd_region_desc *ndr_desc); >> 496 struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, >> 497 struct nd_region_desc *ndr_desc); 404 498 405 :: 499 :: 406 500 407 /sys/devices/platform/nfit_test.0/ndbu 501 /sys/devices/platform/nfit_test.0/ndbus0 408 |-- region0 502 |-- region0 409 | |-- available_size 503 | |-- available_size 410 | |-- btt0 504 | |-- btt0 411 | |-- btt_seed 505 | |-- btt_seed 412 | |-- devtype 506 | |-- devtype 413 | |-- driver -> ../../../../../bus/n 507 | |-- driver -> ../../../../../bus/nd/drivers/nd_region 414 | |-- init_namespaces 508 | |-- init_namespaces 415 | |-- mapping0 509 | |-- mapping0 416 | |-- mapping1 510 | |-- mapping1 417 | |-- mappings 511 | |-- mappings 418 | |-- modalias 512 | |-- modalias 419 | |-- namespace0.0 513 | |-- namespace0.0 420 | |-- namespace_seed 514 | |-- namespace_seed 421 | |-- numa_node 515 | |-- numa_node 422 | |-- nfit 516 | |-- nfit 423 | | `-- spa_index 517 | | `-- spa_index 424 | |-- nstype 518 | |-- nstype 425 | |-- set_cookie 519 | |-- set_cookie 426 | |-- size 520 | |-- size 427 | |-- subsystem -> ../../../../../bu 521 | |-- subsystem -> ../../../../../bus/nd 428 | `-- uevent 522 | `-- uevent 429 |-- region1 523 |-- region1 430 [..] 524 [..] 431 525 432 LIBNDCTL: region enumeration example 526 LIBNDCTL: region enumeration example 433 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 527 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 434 528 435 Sample region retrieval routines based on NFIT 529 Sample region retrieval routines based on NFIT-unique data like 436 "spa_index" (interleave set id). !! 530 "spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for 437 !! 531 BLK:: 438 :: << 439 532 440 static struct ndctl_region *get_pmem_r 533 static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, 441 unsigned int spa_index 534 unsigned int spa_index) 442 { 535 { 443 struct ndctl_region *region; 536 struct ndctl_region *region; 444 537 445 ndctl_region_foreach(bus, regi 538 ndctl_region_foreach(bus, region) { 446 if (ndctl_region_get_t 539 if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) 447 continue; 540 continue; 448 if (ndctl_region_get_s 541 if (ndctl_region_get_spa_index(region) == spa_index) 449 return region; 542 return region; 450 } 543 } 451 return NULL; 544 return NULL; 452 } 545 } 453 546 >> 547 static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, >> 548 unsigned int handle) >> 549 { >> 550 struct ndctl_region *region; >> 551 >> 552 ndctl_region_foreach(bus, region) { >> 553 struct ndctl_mapping *map; >> 554 >> 555 if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) >> 556 continue; >> 557 ndctl_mapping_foreach(region, map) { >> 558 struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); >> 559 >> 560 if (ndctl_dimm_get_handle(dimm) == handle) >> 561 return region; >> 562 } >> 563 } >> 564 return NULL; >> 565 } >> 566 >> 567 >> 568 Why Not Encode the Region Type into the Region Name? >> 569 ---------------------------------------------------- >> 570 >> 571 At first glance it seems since NFIT defines just PMEM and BLK interface >> 572 types that we should simply name REGION devices with something derived >> 573 from those type names. However, the ND subsystem explicitly keeps the >> 574 REGION name generic and expects userspace to always consider the >> 575 region-attributes for four reasons: >> 576 >> 577 1. There are already more than two REGION and "namespace" types. For >> 578 PMEM there are two subtypes. As mentioned previously we have PMEM where >> 579 the constituent DIMM devices are known and anonymous PMEM. For BLK >> 580 regions the NFIT specification already anticipates vendor specific >> 581 implementations. The exact distinction of what a region contains is in >> 582 the region-attributes not the region-name or the region-devtype. >> 583 >> 584 2. A region with zero child-namespaces is a possible configuration. For >> 585 example, the NFIT allows for a DCR to be published without a >> 586 corresponding BLK-aperture. This equates to a DIMM that can only accept >> 587 control/configuration messages, but no i/o through a descendant block >> 588 device. Again, this "type" is advertised in the attributes ('mappings' >> 589 == 0) and the name does not tell you much. >> 590 >> 591 3. What if a third major interface type arises in the future? Outside >> 592 of vendor specific implementations, it's not difficult to envision a >> 593 third class of interface type beyond BLK and PMEM. With a generic name >> 594 for the REGION level of the device-hierarchy old userspace >> 595 implementations can still make sense of new kernel advertised >> 596 region-types. Userspace can always rely on the generic region >> 597 attributes like "mappings", "size", etc and the expected child devices >> 598 named "namespace". This generic format of the device-model hierarchy >> 599 allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and >> 600 future-proof. >> 601 >> 602 4. There are more robust mechanisms for determining the major type of a >> 603 region than a device name. See the next section, How Do I Determine the >> 604 Major Type of a Region? >> 605 >> 606 How Do I Determine the Major Type of a Region? >> 607 ---------------------------------------------- >> 608 >> 609 Outside of the blanket recommendation of "use libndctl", or simply >> 610 looking at the kernel header (/usr/include/linux/ndctl.h) to decode the >> 611 "nstype" integer attribute, here are some other options. >> 612 >> 613 1. module alias lookup >> 614 ^^^^^^^^^^^^^^^^^^^^^^ >> 615 >> 616 The whole point of region/namespace device type differentiation is to >> 617 decide which block-device driver will attach to a given LIBNVDIMM namespace. >> 618 One can simply use the modalias to lookup the resulting module. It's >> 619 important to note that this method is robust in the presence of a >> 620 vendor-specific driver down the road. If a vendor-specific >> 621 implementation wants to supplant the standard nd_blk driver it can with >> 622 minimal impact to the rest of LIBNVDIMM. >> 623 >> 624 In fact, a vendor may also want to have a vendor-specific region-driver >> 625 (outside of nd_region). For example, if a vendor defined its own LABEL >> 626 format it would need its own region driver to parse that LABEL and emit >> 627 the resulting namespaces. The output from module resolution is more >> 628 accurate than a region-name or region-devtype. >> 629 >> 630 2. udev >> 631 ^^^^^^^ >> 632 >> 633 The kernel "devtype" is registered in the udev database:: >> 634 >> 635 # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 >> 636 P: /devices/platform/nfit_test.0/ndbus0/region0 >> 637 E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 >> 638 E: DEVTYPE=nd_pmem >> 639 E: MODALIAS=nd:t2 >> 640 E: SUBSYSTEM=nd >> 641 >> 642 # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 >> 643 P: /devices/platform/nfit_test.0/ndbus0/region4 >> 644 E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 >> 645 E: DEVTYPE=nd_blk >> 646 E: MODALIAS=nd:t3 >> 647 E: SUBSYSTEM=nd >> 648 >> 649 ...and is available as a region attribute, but keep in mind that the >> 650 "devtype" does not indicate sub-type variations and scripts should >> 651 really be understanding the other attributes. >> 652 >> 653 3. type specific attributes >> 654 ^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> 655 >> 656 As it currently stands a BLK-aperture region will never have a >> 657 "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A >> 658 BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM >> 659 that does not allow I/O. A PMEM region with a "mappings" value of zero >> 660 is a simple system-physical-address range. >> 661 454 662 455 LIBNVDIMM/LIBNDCTL: Namespace 663 LIBNVDIMM/LIBNDCTL: Namespace 456 ----------------------------- 664 ----------------------------- 457 665 458 A REGION, after resolving DPA aliasing and LAB !! 666 A REGION, after resolving DPA aliasing and LABEL specified boundaries, 459 one or more "namespace" devices. The arrival !! 667 surfaces one or more "namespace" devices. The arrival of a "namespace" 460 triggers the nd_pmem driver to load and regist !! 668 device currently triggers either the nd_blk or nd_pmem driver to load >> 669 and register a disk/block device. 461 670 462 LIBNVDIMM: namespace 671 LIBNVDIMM: namespace 463 ^^^^^^^^^^^^^^^^^^^^ 672 ^^^^^^^^^^^^^^^^^^^^ 464 673 465 Here is a sample layout from the 2 major types !! 674 Here is a sample layout from the three major types of NAMESPACE where 466 represents DIMM-info-backed PMEM (note that it !! 675 namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' 467 namespace1.0 represents an anonymous PMEM name !! 676 attribute), namespace2.0 represents a BLK namespace (note it has a 468 attribute due to not support a LABEL) !! 677 'sector_size' attribute) that, and namespace6.0 represents an anonymous 469 !! 678 PMEM namespace (note that has no 'uuid' attribute due to not support a 470 :: !! 679 LABEL):: 471 680 472 /sys/devices/platform/nfit_test.0/ndbu 681 /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 473 |-- alt_name 682 |-- alt_name 474 |-- devtype 683 |-- devtype 475 |-- dpa_extents 684 |-- dpa_extents 476 |-- force_raw 685 |-- force_raw 477 |-- modalias 686 |-- modalias 478 |-- numa_node 687 |-- numa_node 479 |-- resource 688 |-- resource 480 |-- size 689 |-- size 481 |-- subsystem -> ../../../../../../bus 690 |-- subsystem -> ../../../../../../bus/nd 482 |-- type 691 |-- type 483 |-- uevent 692 |-- uevent 484 `-- uuid 693 `-- uuid 485 /sys/devices/platform/nfit_test.1/ndbu !! 694 /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 >> 695 |-- alt_name >> 696 |-- devtype >> 697 |-- dpa_extents >> 698 |-- force_raw >> 699 |-- modalias >> 700 |-- numa_node >> 701 |-- sector_size >> 702 |-- size >> 703 |-- subsystem -> ../../../../../../bus/nd >> 704 |-- type >> 705 |-- uevent >> 706 `-- uuid >> 707 /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 486 |-- block 708 |-- block 487 | `-- pmem0 709 | `-- pmem0 488 |-- devtype 710 |-- devtype 489 |-- driver -> ../../../../../../bus/nd 711 |-- driver -> ../../../../../../bus/nd/drivers/pmem 490 |-- force_raw 712 |-- force_raw 491 |-- modalias 713 |-- modalias 492 |-- numa_node 714 |-- numa_node 493 |-- resource 715 |-- resource 494 |-- size 716 |-- size 495 |-- subsystem -> ../../../../../../bus 717 |-- subsystem -> ../../../../../../bus/nd 496 |-- type 718 |-- type 497 `-- uevent 719 `-- uevent 498 720 499 LIBNDCTL: namespace enumeration example 721 LIBNDCTL: namespace enumeration example 500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 722 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 501 Namespaces are indexed relative to their paren 723 Namespaces are indexed relative to their parent region, example below. 502 These indexes are mostly static from boot to b 724 These indexes are mostly static from boot to boot, but subsystem makes 503 no guarantees in this regard. For a static na 725 no guarantees in this regard. For a static namespace identifier use its 504 'uuid' attribute. 726 'uuid' attribute. 505 727 506 :: 728 :: 507 729 508 static struct ndctl_namespace 730 static struct ndctl_namespace 509 *get_namespace_by_id(struct ndctl_region *re 731 *get_namespace_by_id(struct ndctl_region *region, unsigned int id) 510 { 732 { 511 struct ndctl_namespace *ndns; 733 struct ndctl_namespace *ndns; 512 734 513 ndctl_namespace_foreach(region, ndns 735 ndctl_namespace_foreach(region, ndns) 514 if (ndctl_namespace_get_id(n 736 if (ndctl_namespace_get_id(ndns) == id) 515 return ndns; 737 return ndns; 516 738 517 return NULL; 739 return NULL; 518 } 740 } 519 741 520 LIBNDCTL: namespace creation example 742 LIBNDCTL: namespace creation example 521 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 743 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 522 744 523 Idle namespaces are automatically created by t 745 Idle namespaces are automatically created by the kernel if a given 524 region has enough available capacity to create 746 region has enough available capacity to create a new namespace. 525 Namespace instantiation involves finding an id 747 Namespace instantiation involves finding an idle namespace and 526 configuring it. For the most part the setting 748 configuring it. For the most part the setting of namespace attributes 527 can occur in any order, the only constraint is 749 can occur in any order, the only constraint is that 'uuid' must be set 528 before 'size'. This enables the kernel to tra 750 before 'size'. This enables the kernel to track DPA allocations 529 internally with a static identifier:: 751 internally with a static identifier:: 530 752 531 static int configure_namespace(struct ndctl_ 753 static int configure_namespace(struct ndctl_region *region, 532 struct ndctl_namespace *ndns 754 struct ndctl_namespace *ndns, 533 struct namespace_parameters 755 struct namespace_parameters *parameters) 534 { 756 { 535 char devname[50]; 757 char devname[50]; 536 758 537 snprintf(devname, sizeof(devname), " 759 snprintf(devname, sizeof(devname), "namespace%d.%d", 538 ndctl_region_get_id( 760 ndctl_region_get_id(region), paramaters->id); 539 761 540 ndctl_namespace_set_alt_name(ndns, d 762 ndctl_namespace_set_alt_name(ndns, devname); 541 /* 'uuid' must be set prior to setti 763 /* 'uuid' must be set prior to setting size! */ 542 ndctl_namespace_set_uuid(ndns, param 764 ndctl_namespace_set_uuid(ndns, paramaters->uuid); 543 ndctl_namespace_set_size(ndns, param 765 ndctl_namespace_set_size(ndns, paramaters->size); 544 /* unlike pmem namespaces, blk names 766 /* unlike pmem namespaces, blk namespaces have a sector size */ 545 if (parameters->lbasize) 767 if (parameters->lbasize) 546 ndctl_namespace_set_sector_s 768 ndctl_namespace_set_sector_size(ndns, parameters->lbasize); 547 ndctl_namespace_enable(ndns); 769 ndctl_namespace_enable(ndns); 548 } 770 } 549 771 550 772 551 Why the Term "namespace"? 773 Why the Term "namespace"? 552 ^^^^^^^^^^^^^^^^^^^^^^^^^ 774 ^^^^^^^^^^^^^^^^^^^^^^^^^ 553 775 554 1. Why not "volume" for instance? "volume 776 1. Why not "volume" for instance? "volume" ran the risk of confusing 555 ND (libnvdimm subsystem) to a volume ma 777 ND (libnvdimm subsystem) to a volume manager like device-mapper. 556 778 557 2. The term originated to describe the sub 779 2. The term originated to describe the sub-devices that can be created 558 within a NVME controller (see the nvme 780 within a NVME controller (see the nvme specification: 559 https://www.nvmexpress.org/specificatio 781 https://www.nvmexpress.org/specifications/), and NFIT namespaces are 560 meant to parallel the capabilities and 782 meant to parallel the capabilities and configurability of 561 NVME-namespaces. 783 NVME-namespaces. 562 784 563 785 564 LIBNVDIMM/LIBNDCTL: Block Translation Table "b 786 LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" 565 ---------------------------------------------- 787 ------------------------------------------------- 566 788 567 A BTT (design document: https://pmem.io/2014/0 !! 789 A BTT (design document: https://pmem.io/2014/09/23/btt.html) is a stacked 568 personality driver for a namespace that fronts !! 790 block device driver that fronts either the whole block device or a 569 'address abstraction'. !! 791 partition of a block device emitted by either a PMEM or BLK NAMESPACE. 570 792 571 LIBNVDIMM: btt layout 793 LIBNVDIMM: btt layout 572 ^^^^^^^^^^^^^^^^^^^^^ 794 ^^^^^^^^^^^^^^^^^^^^^ 573 795 574 Every region will start out with at least one 796 Every region will start out with at least one BTT device which is the 575 seed device. To activate it set the "namespac 797 seed device. To activate it set the "namespace", "uuid", and 576 "sector_size" attributes and then bind the dev 798 "sector_size" attributes and then bind the device to the nd_pmem or 577 nd_blk driver depending on the region type:: 799 nd_blk driver depending on the region type:: 578 800 579 /sys/devices/platform/nfit_test.1/ndbu 801 /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ 580 |-- namespace 802 |-- namespace 581 |-- delete 803 |-- delete 582 |-- devtype 804 |-- devtype 583 |-- modalias 805 |-- modalias 584 |-- numa_node 806 |-- numa_node 585 |-- sector_size 807 |-- sector_size 586 |-- subsystem -> ../../../../../bus/nd 808 |-- subsystem -> ../../../../../bus/nd 587 |-- uevent 809 |-- uevent 588 `-- uuid 810 `-- uuid 589 811 590 LIBNDCTL: btt creation example 812 LIBNDCTL: btt creation example 591 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 813 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 592 814 593 Similar to namespaces an idle BTT device is au 815 Similar to namespaces an idle BTT device is automatically created per 594 region. Each time this "seed" btt device is c 816 region. Each time this "seed" btt device is configured and enabled a new 595 seed is created. Creating a BTT configuration 817 seed is created. Creating a BTT configuration involves two steps of 596 finding and idle BTT and assigning it to consu !! 818 finding and idle BTT and assigning it to consume a PMEM or BLK namespace:: 597 << 598 :: << 599 819 600 static struct ndctl_btt *get_idle_btt( 820 static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) 601 { 821 { 602 struct ndctl_btt *btt; 822 struct ndctl_btt *btt; 603 823 604 ndctl_btt_foreach(region, btt) 824 ndctl_btt_foreach(region, btt) 605 if (!ndctl_btt_is_enab 825 if (!ndctl_btt_is_enabled(btt) 606 && !nd 826 && !ndctl_btt_is_configured(btt)) 607 return btt; 827 return btt; 608 828 609 return NULL; 829 return NULL; 610 } 830 } 611 831 612 static int configure_btt(struct ndctl_ 832 static int configure_btt(struct ndctl_region *region, 613 struct btt_parameters 833 struct btt_parameters *parameters) 614 { 834 { 615 btt = get_idle_btt(region); 835 btt = get_idle_btt(region); 616 836 617 ndctl_btt_set_uuid(btt, parame 837 ndctl_btt_set_uuid(btt, parameters->uuid); 618 ndctl_btt_set_sector_size(btt, 838 ndctl_btt_set_sector_size(btt, parameters->sector_size); 619 ndctl_btt_set_namespace(btt, p 839 ndctl_btt_set_namespace(btt, parameters->ndns); 620 /* turn off raw mode device */ 840 /* turn off raw mode device */ 621 ndctl_namespace_disable(parame 841 ndctl_namespace_disable(parameters->ndns); 622 /* turn on btt access */ 842 /* turn on btt access */ 623 ndctl_btt_enable(btt); 843 ndctl_btt_enable(btt); 624 } 844 } 625 845 626 Once instantiated a new inactive btt seed devi 846 Once instantiated a new inactive btt seed device will appear underneath 627 the region. 847 the region. 628 848 629 Once a "namespace" is removed from a BTT that 849 Once a "namespace" is removed from a BTT that instance of the BTT device 630 will be deleted or otherwise reset to default 850 will be deleted or otherwise reset to default values. This deletion is 631 only at the device model level. In order to d 851 only at the device model level. In order to destroy a BTT the "info 632 block" needs to be destroyed. Note, that to d 852 block" needs to be destroyed. Note, that to destroy a BTT the media 633 needs to be written in raw mode. By default, 853 needs to be written in raw mode. By default, the kernel will autodetect 634 the presence of a BTT and disable raw mode. T 854 the presence of a BTT and disable raw mode. This autodetect behavior 635 can be suppressed by enabling raw mode for the 855 can be suppressed by enabling raw mode for the namespace via the 636 ndctl_namespace_set_raw_mode() API. 856 ndctl_namespace_set_raw_mode() API. 637 857 638 858 639 Summary LIBNDCTL Diagram 859 Summary LIBNDCTL Diagram 640 ------------------------ 860 ------------------------ 641 861 642 For the given example above, here is the view 862 For the given example above, here is the view of the objects as seen by the 643 LIBNDCTL API:: 863 LIBNDCTL API:: 644 864 645 +---+ 865 +---+ 646 |CTX| !! 866 |CTX| +---------+ +--------------+ +---------------+ 647 +-+-+ !! 867 +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | 648 | !! 868 | | +---------+ +--------------+ +---------------+ 649 +-------+ | !! 869 +-------+ | | +---------+ +--------------+ +---------------+ 650 | DIMM0 <-+ | +---------+ +-------- !! 870 | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | 651 +-------+ | | +-> REGION0 +---> NAMESPA !! 871 +-------+ | | | +---------+ +--------------+ +---------------+ 652 | DIMM1 <-+ +-v--+ | +---------+ +-------- 872 | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ 653 +-------+ +-+BUS0+-| +---------+ +-------- !! 873 +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | 654 | DIMM2 <-+ +----+ +-> REGION1 +---> NAMESPA !! 874 | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ 655 +-------+ | | +---------+ +-------- !! 875 +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | 656 | DIMM3 <-+ !! 876 | DIMM3 <-+ | +--------------+ +----------------------+ 657 +-------+ !! 877 +-------+ | +---------+ +--------------+ +---------------+ >> 878 +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | >> 879 | +---------+ | +--------------+ +----------------------+ >> 880 | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | >> 881 | +--------------+ +----------------------+ >> 882 | +---------+ +--------------+ +---------------+ >> 883 +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | >> 884 | +---------+ +--------------+ +---------------+ >> 885 | +---------+ +--------------+ +----------------------+ >> 886 +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | >> 887 +---------+ +--------------+ +---------------+------+
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.