| =============================== |
| LIBNVDIMM: Non-Volatile Devices |
| =============================== |
| |
| libnvdimm - kernel / libndctl - userspace helper library |
| |
| linux-nvdimm@lists.01.org |
| |
| Version 13 |
| |
| .. contents: |
| |
| Glossary |
| Overview |
| Supporting Documents |
| Git Trees |
| LIBNVDIMM PMEM and BLK |
| Why BLK? |
| PMEM vs BLK |
| BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX |
| Example NVDIMM Platform |
| LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API |
| LIBNDCTL: Context |
| libndctl: instantiate a new library context example |
| LIBNVDIMM/LIBNDCTL: Bus |
| libnvdimm: control class device in /sys/class |
| libnvdimm: bus |
| libndctl: bus enumeration example |
| LIBNVDIMM/LIBNDCTL: DIMM (NMEM) |
| libnvdimm: DIMM (NMEM) |
| libndctl: DIMM enumeration example |
| LIBNVDIMM/LIBNDCTL: Region |
| libnvdimm: region |
| libndctl: region enumeration example |
| Why Not Encode the Region Type into the Region Name? |
| How Do I Determine the Major Type of a Region? |
| LIBNVDIMM/LIBNDCTL: Namespace |
| libnvdimm: namespace |
| libndctl: namespace enumeration example |
| libndctl: namespace creation example |
| Why the Term "namespace"? |
| LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" |
| libnvdimm: btt layout |
| libndctl: btt creation example |
| Summary LIBNDCTL Diagram |
| |
| |
| Glossary |
| ======== |
| |
| PMEM: |
| A system-physical-address range where writes are persistent. A |
| block device composed of PMEM is capable of DAX. A PMEM address range |
| may span an interleave of several DIMMs. |
| |
| BLK: |
| A set of one or more programmable memory mapped apertures provided |
| by a DIMM to access its media. This indirection precludes the |
| performance benefit of interleaving, but enables DIMM-bounded failure |
| modes. |
| |
| DPA: |
| DIMM Physical Address, is a DIMM-relative offset. With one DIMM in |
| the system there would be a 1:1 system-physical-address:DPA association. |
| Once more DIMMs are added a memory controller interleave must be |
| decoded to determine the DPA associated with a given |
| system-physical-address. BLK capacity always has a 1:1 relationship |
| with a single-DIMM's DPA range. |
| |
| DAX: |
| File system extensions to bypass the page cache and block layer to |
| mmap persistent memory, from a PMEM block device, directly into a |
| process address space. |
| |
| DSM: |
| Device Specific Method: ACPI method to control specific |
| device - in this case the firmware. |
| |
| DCR: |
| NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. |
| It defines a vendor-id, device-id, and interface format for a given DIMM. |
| |
| BTT: |
| Block Translation Table: Persistent memory is byte addressable. |
| Existing software may have an expectation that the power-fail-atomicity |
| of writes is at least one sector, 512 bytes. The BTT is an indirection |
| table with atomic update semantics to front a PMEM/BLK block device |
| driver and present arbitrary atomic sector sizes. |
| |
| LABEL: |
| Metadata stored on a DIMM device that partitions and identifies |
| (persistently names) storage between PMEM and BLK. It also partitions |
| BLK storage to host BTTs with different parameters per BLK-partition. |
| Note that traditional partition tables, GPT/MBR, are layered on top of a |
| BLK or PMEM device. |
| |
| |
| Overview |
| ======== |
| |
| The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, |
| PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM |
| and BLK mode access. These three modes of operation are described by |
| the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM |
| implementation is generic and supports pre-NFIT platforms, it was guided |
| by the superset of capabilities need to support this ACPI 6 definition |
| for NVDIMM resources. The bulk of the kernel implementation is in place |
| to handle the case where DPA accessible via PMEM is aliased with DPA |
| accessible via BLK. When that occurs a LABEL is needed to reserve DPA |
| for exclusive access via one mode a time. |
| |
| Supporting Documents |
| -------------------- |
| |
| ACPI 6: |
| https://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf |
| NVDIMM Namespace: |
| https://pmem.io/documents/NVDIMM_Namespace_Spec.pdf |
| DSM Interface Example: |
| https://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf |
| Driver Writer's Guide: |
| https://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf |
| |
| Git Trees |
| --------- |
| |
| LIBNVDIMM: |
| https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git |
| LIBNDCTL: |
| https://github.com/pmem/ndctl.git |
| PMEM: |
| https://github.com/01org/prd |
| |
| |
| LIBNVDIMM PMEM and BLK |
| ====================== |
| |
| Prior to the arrival of the NFIT, non-volatile memory was described to a |
| system in various ad-hoc ways. Usually only the bare minimum was |
| provided, namely, a single system-physical-address range where writes |
| are expected to be durable after a system power loss. Now, the NFIT |
| specification standardizes not only the description of PMEM, but also |
| BLK and platform message-passing entry points for control and |
| configuration. |
| |
| For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block |
| device driver: |
| |
| 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This |
| range is contiguous in system memory and may be interleaved (hardware |
| memory controller striped) across multiple DIMMs. When interleaved the |
| platform may optionally provide details of which DIMMs are participating |
| in the interleave. |
| |
| Note that while LIBNVDIMM describes system-physical-address ranges that may |
| alias with BLK access as ND_NAMESPACE_PMEM ranges and those without |
| alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no |
| distinction. The different device-types are an implementation detail |
| that userspace can exploit to implement policies like "only interface |
| with address ranges from certain DIMMs". It is worth noting that when |
| aliasing is present and a DIMM lacks a label, then no block device can |
| be created by default as userspace needs to do at least one allocation |
| of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once |
| registered, can be immediately attached to nd_pmem. |
| |
| 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform |
| defined apertures. A set of apertures will access just one DIMM. |
| Multiple windows (apertures) allow multiple concurrent accesses, much like |
| tagged-command-queuing, and would likely be used by different threads or |
| different CPUs. |
| |
| The NFIT specification defines a standard format for a BLK-aperture, but |
| the spec also allows for vendor specific layouts, and non-NFIT BLK |
| implementations may have other designs for BLK I/O. For this reason |
| "nd_blk" calls back into platform-specific code to perform the I/O. |
| |
| One such implementation is defined in the "Driver Writer's Guide" and "DSM |
| Interface Example". |
| |
| |
| Why BLK? |
| ======== |
| |
| While PMEM provides direct byte-addressable CPU-load/store access to |
| NVDIMM storage, it does not provide the best system RAS (recovery, |
| availability, and serviceability) model. An access to a corrupted |
| system-physical-address address causes a CPU exception while an access |
| to a corrupted address through an BLK-aperture causes that block window |
| to raise an error status in a register. The latter is more aligned with |
| the standard error model that host-bus-adapter attached disks present. |
| |
| Also, if an administrator ever wants to replace a memory it is easier to |
| service a system at DIMM module boundaries. Compare this to PMEM where |
| data could be interleaved in an opaque hardware specific manner across |
| several DIMMs. |
| |
| PMEM vs BLK |
| ----------- |
| |
| BLK-apertures solve these RAS problems, but their presence is also the |
| major contributing factor to the complexity of the ND subsystem. They |
| complicate the implementation because PMEM and BLK alias in DPA space. |
| Any given DIMM's DPA-range may contribute to one or more |
| system-physical-address sets of interleaved DIMMs, *and* may also be |
| accessed in its entirety through its BLK-aperture. Accessing a DPA |
| through a system-physical-address while simultaneously accessing the |
| same DPA through a BLK-aperture has undefined results. For this reason, |
| DIMMs with this dual interface configuration include a DSM function to |
| store/retrieve a LABEL. The LABEL effectively partitions the DPA-space |
| into exclusive system-physical-address and BLK-aperture accessible |
| regions. For simplicity a DIMM is allowed a PMEM "region" per each |
| interleave set in which it is a member. The remaining DPA space can be |
| carved into an arbitrary number of BLK devices with discontiguous |
| extents. |
| |
| BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| One of the few |
| reasons to allow multiple BLK namespaces per REGION is so that each |
| BLK-namespace can be configured with a BTT with unique atomic sector |
| sizes. While a PMEM device can host a BTT the LABEL specification does |
| not provide for a sector size to be specified for a PMEM namespace. |
| |
| This is due to the expectation that the primary usage model for PMEM is |
| via DAX, and the BTT is incompatible with DAX. However, for the cases |
| where an application or filesystem still needs atomic sector update |
| guarantees it can register a BTT on a PMEM device or partition. See |
| LIBNVDIMM/NDCTL: Block Translation Table "btt" |
| |
| |
| Example NVDIMM Platform |
| ======================= |
| |
| For the remainder of this document the following diagram will be |
| referenced for any example sysfs layouts:: |
| |
| |
| (a) (b) DIMM BLK-REGION |
| +-------------------+--------+--------+--------+ |
| +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 |
| | imc0 +--+- - - region0- - - +--------+ +--------+ |
| +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 |
| | +-------------------+--------v v--------+ |
| +--+---+ | | |
| | cpu0 | region1 |
| +--+---+ | | |
| | +----------------------------^ ^--------+ |
| +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 |
| | imc1 +--+----------------------------| +--------+ |
| +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 |
| +----------------------------+--------+--------+ |
| |
| In this platform we have four DIMMs and two memory controllers in one |
| socket. Each unique interface (BLK or PMEM) to DPA space is identified |
| by a region device with a dynamically assigned id (REGION0 - REGION5). |
| |
| 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A |
| single PMEM namespace is created in the REGION0-SPA-range that spans most |
| of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that |
| interleaved system-physical-address range is reclaimed as BLK-aperture |
| accessed space starting at DPA-offset (a) into each DIMM. In that |
| reclaimed space we create two BLK-aperture "namespaces" from REGION2 and |
| REGION3 where "blk2.0" and "blk3.0" are just human readable names that |
| could be set to any user-desired name in the LABEL. |
| |
| 2. In the last portion of DIMM0 and DIMM1 we have an interleaved |
| system-physical-address range, REGION1, that spans those two DIMMs as |
| well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace |
| named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for |
| each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and |
| "blk5.0". |
| |
| 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 |
| interleaved system-physical-address range (i.e. the DPA address past |
| offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. |
| Note, that this example shows that BLK-aperture namespaces don't need to |
| be contiguous in DPA-space. |
| |
| This bus is provided by the kernel under the device |
| /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from |
| tools/testing/nvdimm is loaded. This not only test LIBNVDIMM but the |
| acpi_nfit.ko driver as well. |
| |
| |
| LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API |
| ======================================================== |
| |
| What follows is a description of the LIBNVDIMM sysfs layout and a |
| corresponding object hierarchy diagram as viewed through the LIBNDCTL |
| API. The example sysfs paths and diagrams are relative to the Example |
| NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit |
| test. |
| |
| LIBNDCTL: Context |
| ----------------- |
| |
| Every API call in the LIBNDCTL library requires a context that holds the |
| logging parameters and other library instance state. The library is |
| based on the libabc template: |
| |
| https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git |
| |
| LIBNDCTL: instantiate a new library context example |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| :: |
| |
| struct ndctl_ctx *ctx; |
| |
| if (ndctl_new(&ctx) == 0) |
| return ctx; |
| else |
| return NULL; |
| |
| LIBNVDIMM/LIBNDCTL: Bus |
| ----------------------- |
| |
| A bus has a 1:1 relationship with an NFIT. The current expectation for |
| ACPI based systems is that there is only ever one platform-global NFIT. |
| That said, it is trivial to register multiple NFITs, the specification |
| does not preclude it. The infrastructure supports multiple busses and |
| we use this capability to test multiple NFIT configurations in the unit |
| test. |
| |
| LIBNVDIMM: control class device in /sys/class |
| --------------------------------------------- |
| |
| This character device accepts DSM messages to be passed to DIMM |
| identified by its NFIT handle:: |
| |
| /sys/class/nd/ndctl0 |
| |-- dev |
| |-- device -> ../../../ndbus0 |
| |-- subsystem -> ../../../../../../../class/nd |
| |
| |
| |
| LIBNVDIMM: bus |
| -------------- |
| |
| :: |
| |
| struct nvdimm_bus *nvdimm_bus_register(struct device *parent, |
| struct nvdimm_bus_descriptor *nfit_desc); |
| |
| :: |
| |
| /sys/devices/platform/nfit_test.0/ndbus0 |
| |-- commands |
| |-- nd |
| |-- nfit |
| |-- nmem0 |
| |-- nmem1 |
| |-- nmem2 |
| |-- nmem3 |
| |-- power |
| |-- provider |
| |-- region0 |
| |-- region1 |
| |-- region2 |
| |-- region3 |
| |-- region4 |
| |-- region5 |
| |-- uevent |
| `-- wait_probe |
| |
| LIBNDCTL: bus enumeration example |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Find the bus handle that describes the bus from Example NVDIMM Platform:: |
| |
| static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, |
| const char *provider) |
| { |
| struct ndctl_bus *bus; |
| |
| ndctl_bus_foreach(ctx, bus) |
| if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) |
| return bus; |
| |
| return NULL; |
| } |
| |
| bus = get_bus_by_provider(ctx, "nfit_test.0"); |
| |
| |
| LIBNVDIMM/LIBNDCTL: DIMM (NMEM) |
| ------------------------------- |
| |
| The DIMM device provides a character device for sending commands to |
| hardware, and it is a container for LABELs. If the DIMM is defined by |
| NFIT then an optional 'nfit' attribute sub-directory is available to add |
| NFIT-specifics. |
| |
| Note that the kernel device name for "DIMMs" is "nmemX". The NFIT |
| describes these devices via "Memory Device to System Physical Address |
| Range Mapping Structure", and there is no requirement that they actually |
| be physical DIMMs, so we use a more generic name. |
| |
| LIBNVDIMM: DIMM (NMEM) |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| :: |
| |
| struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, |
| const struct attribute_group **groups, unsigned long flags, |
| unsigned long *dsm_mask); |
| |
| :: |
| |
| /sys/devices/platform/nfit_test.0/ndbus0 |
| |-- nmem0 |
| | |-- available_slots |
| | |-- commands |
| | |-- dev |
| | |-- devtype |
| | |-- driver -> ../../../../../bus/nd/drivers/nvdimm |
| | |-- modalias |
| | |-- nfit |
| | | |-- device |
| | | |-- format |
| | | |-- handle |
| | | |-- phys_id |
| | | |-- rev_id |
| | | |-- serial |
| | | `-- vendor |
| | |-- state |
| | |-- subsystem -> ../../../../../bus/nd |
| | `-- uevent |
| |-- nmem1 |
| [..] |
| |
| |
| LIBNDCTL: DIMM enumeration example |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Note, in this example we are assuming NFIT-defined DIMMs which are |
| identified by an "nfit_handle" a 32-bit value where: |
| |
| - Bit 3:0 DIMM number within the memory channel |
| - Bit 7:4 memory channel number |
| - Bit 11:8 memory controller ID |
| - Bit 15:12 socket ID (within scope of a Node controller if node |
| controller is present) |
| - Bit 27:16 Node Controller ID |
| - Bit 31:28 Reserved |
| |
| :: |
| |
| static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, |
| unsigned int handle) |
| { |
| struct ndctl_dimm *dimm; |
| |
| ndctl_dimm_foreach(bus, dimm) |
| if (ndctl_dimm_get_handle(dimm) == handle) |
| return dimm; |
| |
| return NULL; |
| } |
| |
| #define DIMM_HANDLE(n, s, i, c, d) \ |
| (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ |
| | ((c & 0xf) << 4) | (d & 0xf)) |
| |
| dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); |
| |
| LIBNVDIMM/LIBNDCTL: Region |
| -------------------------- |
| |
| A generic REGION device is registered for each PMEM range or BLK-aperture |
| set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture |
| sets on the "nfit_test.0" bus. The primary role of regions are to be a |
| container of "mappings". A mapping is a tuple of <DIMM, |
| DPA-start-offset, length>. |
| |
| LIBNVDIMM provides a built-in driver for these REGION devices. This driver |
| is responsible for reconciling the aliased DPA mappings across all |
| regions, parsing the LABEL, if present, and then emitting NAMESPACE |
| devices with the resolved/exclusive DPA-boundaries for the nd_pmem or |
| nd_blk device driver to consume. |
| |
| In addition to the generic attributes of "mapping"s, "interleave_ways" |
| and "size" the REGION device also exports some convenience attributes. |
| "nstype" indicates the integer type of namespace-device this region |
| emits, "devtype" duplicates the DEVTYPE variable stored by udev at the |
| 'add' event, "modalias" duplicates the MODALIAS variable stored by udev |
| at the 'add' event, and finally, the optional "spa_index" is provided in |
| the case where the region is defined by a SPA. |
| |
| LIBNVDIMM: region:: |
| |
| struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, |
| struct nd_region_desc *ndr_desc); |
| struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, |
| struct nd_region_desc *ndr_desc); |
| |
| :: |
| |
| /sys/devices/platform/nfit_test.0/ndbus0 |
| |-- region0 |
| | |-- available_size |
| | |-- btt0 |
| | |-- btt_seed |
| | |-- devtype |
| | |-- driver -> ../../../../../bus/nd/drivers/nd_region |
| | |-- init_namespaces |
| | |-- mapping0 |
| | |-- mapping1 |
| | |-- mappings |
| | |-- modalias |
| | |-- namespace0.0 |
| | |-- namespace_seed |
| | |-- numa_node |
| | |-- nfit |
| | | `-- spa_index |
| | |-- nstype |
| | |-- set_cookie |
| | |-- size |
| | |-- subsystem -> ../../../../../bus/nd |
| | `-- uevent |
| |-- region1 |
| [..] |
| |
| LIBNDCTL: region enumeration example |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Sample region retrieval routines based on NFIT-unique data like |
| "spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for |
| BLK:: |
| |
| static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, |
| unsigned int spa_index) |
| { |
| struct ndctl_region *region; |
| |
| ndctl_region_foreach(bus, region) { |
| if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) |
| continue; |
| if (ndctl_region_get_spa_index(region) == spa_index) |
| return region; |
| } |
| return NULL; |
| } |
| |
| static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, |
| unsigned int handle) |
| { |
| struct ndctl_region *region; |
| |
| ndctl_region_foreach(bus, region) { |
| struct ndctl_mapping *map; |
| |
| if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) |
| continue; |
| ndctl_mapping_foreach(region, map) { |
| struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); |
| |
| if (ndctl_dimm_get_handle(dimm) == handle) |
| return region; |
| } |
| } |
| return NULL; |
| } |
| |
| |
| Why Not Encode the Region Type into the Region Name? |
| ---------------------------------------------------- |
| |
| At first glance it seems since NFIT defines just PMEM and BLK interface |
| types that we should simply name REGION devices with something derived |
| from those type names. However, the ND subsystem explicitly keeps the |
| REGION name generic and expects userspace to always consider the |
| region-attributes for four reasons: |
| |
| 1. There are already more than two REGION and "namespace" types. For |
| PMEM there are two subtypes. As mentioned previously we have PMEM where |
| the constituent DIMM devices are known and anonymous PMEM. For BLK |
| regions the NFIT specification already anticipates vendor specific |
| implementations. The exact distinction of what a region contains is in |
| the region-attributes not the region-name or the region-devtype. |
| |
| 2. A region with zero child-namespaces is a possible configuration. For |
| example, the NFIT allows for a DCR to be published without a |
| corresponding BLK-aperture. This equates to a DIMM that can only accept |
| control/configuration messages, but no i/o through a descendant block |
| device. Again, this "type" is advertised in the attributes ('mappings' |
| == 0) and the name does not tell you much. |
| |
| 3. What if a third major interface type arises in the future? Outside |
| of vendor specific implementations, it's not difficult to envision a |
| third class of interface type beyond BLK and PMEM. With a generic name |
| for the REGION level of the device-hierarchy old userspace |
| implementations can still make sense of new kernel advertised |
| region-types. Userspace can always rely on the generic region |
| attributes like "mappings", "size", etc and the expected child devices |
| named "namespace". This generic format of the device-model hierarchy |
| allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and |
| future-proof. |
| |
| 4. There are more robust mechanisms for determining the major type of a |
| region than a device name. See the next section, How Do I Determine the |
| Major Type of a Region? |
| |
| How Do I Determine the Major Type of a Region? |
| ---------------------------------------------- |
| |
| Outside of the blanket recommendation of "use libndctl", or simply |
| looking at the kernel header (/usr/include/linux/ndctl.h) to decode the |
| "nstype" integer attribute, here are some other options. |
| |
| 1. module alias lookup |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The whole point of region/namespace device type differentiation is to |
| decide which block-device driver will attach to a given LIBNVDIMM namespace. |
| One can simply use the modalias to lookup the resulting module. It's |
| important to note that this method is robust in the presence of a |
| vendor-specific driver down the road. If a vendor-specific |
| implementation wants to supplant the standard nd_blk driver it can with |
| minimal impact to the rest of LIBNVDIMM. |
| |
| In fact, a vendor may also want to have a vendor-specific region-driver |
| (outside of nd_region). For example, if a vendor defined its own LABEL |
| format it would need its own region driver to parse that LABEL and emit |
| the resulting namespaces. The output from module resolution is more |
| accurate than a region-name or region-devtype. |
| |
| 2. udev |
| ^^^^^^^ |
| |
| The kernel "devtype" is registered in the udev database:: |
| |
| # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 |
| P: /devices/platform/nfit_test.0/ndbus0/region0 |
| E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 |
| E: DEVTYPE=nd_pmem |
| E: MODALIAS=nd:t2 |
| E: SUBSYSTEM=nd |
| |
| # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 |
| P: /devices/platform/nfit_test.0/ndbus0/region4 |
| E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 |
| E: DEVTYPE=nd_blk |
| E: MODALIAS=nd:t3 |
| E: SUBSYSTEM=nd |
| |
| ...and is available as a region attribute, but keep in mind that the |
| "devtype" does not indicate sub-type variations and scripts should |
| really be understanding the other attributes. |
| |
| 3. type specific attributes |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| As it currently stands a BLK-aperture region will never have a |
| "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A |
| BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM |
| that does not allow I/O. A PMEM region with a "mappings" value of zero |
| is a simple system-physical-address range. |
| |
| |
| LIBNVDIMM/LIBNDCTL: Namespace |
| ----------------------------- |
| |
| A REGION, after resolving DPA aliasing and LABEL specified boundaries, |
| surfaces one or more "namespace" devices. The arrival of a "namespace" |
| device currently triggers either the nd_blk or nd_pmem driver to load |
| and register a disk/block device. |
| |
| LIBNVDIMM: namespace |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| Here is a sample layout from the three major types of NAMESPACE where |
| namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' |
| attribute), namespace2.0 represents a BLK namespace (note it has a |
| 'sector_size' attribute) that, and namespace6.0 represents an anonymous |
| PMEM namespace (note that has no 'uuid' attribute due to not support a |
| LABEL):: |
| |
| /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 |
| |-- alt_name |
| |-- devtype |
| |-- dpa_extents |
| |-- force_raw |
| |-- modalias |
| |-- numa_node |
| |-- resource |
| |-- size |
| |-- subsystem -> ../../../../../../bus/nd |
| |-- type |
| |-- uevent |
| `-- uuid |
| /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 |
| |-- alt_name |
| |-- devtype |
| |-- dpa_extents |
| |-- force_raw |
| |-- modalias |
| |-- numa_node |
| |-- sector_size |
| |-- size |
| |-- subsystem -> ../../../../../../bus/nd |
| |-- type |
| |-- uevent |
| `-- uuid |
| /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 |
| |-- block |
| | `-- pmem0 |
| |-- devtype |
| |-- driver -> ../../../../../../bus/nd/drivers/pmem |
| |-- force_raw |
| |-- modalias |
| |-- numa_node |
| |-- resource |
| |-- size |
| |-- subsystem -> ../../../../../../bus/nd |
| |-- type |
| `-- uevent |
| |
| LIBNDCTL: namespace enumeration example |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| Namespaces are indexed relative to their parent region, example below. |
| These indexes are mostly static from boot to boot, but subsystem makes |
| no guarantees in this regard. For a static namespace identifier use its |
| 'uuid' attribute. |
| |
| :: |
| |
| static struct ndctl_namespace |
| *get_namespace_by_id(struct ndctl_region *region, unsigned int id) |
| { |
| struct ndctl_namespace *ndns; |
| |
| ndctl_namespace_foreach(region, ndns) |
| if (ndctl_namespace_get_id(ndns) == id) |
| return ndns; |
| |
| return NULL; |
| } |
| |
| LIBNDCTL: namespace creation example |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Idle namespaces are automatically created by the kernel if a given |
| region has enough available capacity to create a new namespace. |
| Namespace instantiation involves finding an idle namespace and |
| configuring it. For the most part the setting of namespace attributes |
| can occur in any order, the only constraint is that 'uuid' must be set |
| before 'size'. This enables the kernel to track DPA allocations |
| internally with a static identifier:: |
| |
| static int configure_namespace(struct ndctl_region *region, |
| struct ndctl_namespace *ndns, |
| struct namespace_parameters *parameters) |
| { |
| char devname[50]; |
| |
| snprintf(devname, sizeof(devname), "namespace%d.%d", |
| ndctl_region_get_id(region), paramaters->id); |
| |
| ndctl_namespace_set_alt_name(ndns, devname); |
| /* 'uuid' must be set prior to setting size! */ |
| ndctl_namespace_set_uuid(ndns, paramaters->uuid); |
| ndctl_namespace_set_size(ndns, paramaters->size); |
| /* unlike pmem namespaces, blk namespaces have a sector size */ |
| if (parameters->lbasize) |
| ndctl_namespace_set_sector_size(ndns, parameters->lbasize); |
| ndctl_namespace_enable(ndns); |
| } |
| |
| |
| Why the Term "namespace"? |
| ^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| 1. Why not "volume" for instance? "volume" ran the risk of confusing |
| ND (libnvdimm subsystem) to a volume manager like device-mapper. |
| |
| 2. The term originated to describe the sub-devices that can be created |
| within a NVME controller (see the nvme specification: |
| https://www.nvmexpress.org/specifications/), and NFIT namespaces are |
| meant to parallel the capabilities and configurability of |
| NVME-namespaces. |
| |
| |
| LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" |
| ------------------------------------------------- |
| |
| A BTT (design document: https://pmem.io/2014/09/23/btt.html) is a stacked |
| block device driver that fronts either the whole block device or a |
| partition of a block device emitted by either a PMEM or BLK NAMESPACE. |
| |
| LIBNVDIMM: btt layout |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| Every region will start out with at least one BTT device which is the |
| seed device. To activate it set the "namespace", "uuid", and |
| "sector_size" attributes and then bind the device to the nd_pmem or |
| nd_blk driver depending on the region type:: |
| |
| /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ |
| |-- namespace |
| |-- delete |
| |-- devtype |
| |-- modalias |
| |-- numa_node |
| |-- sector_size |
| |-- subsystem -> ../../../../../bus/nd |
| |-- uevent |
| `-- uuid |
| |
| LIBNDCTL: btt creation example |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Similar to namespaces an idle BTT device is automatically created per |
| region. Each time this "seed" btt device is configured and enabled a new |
| seed is created. Creating a BTT configuration involves two steps of |
| finding and idle BTT and assigning it to consume a PMEM or BLK namespace:: |
| |
| static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) |
| { |
| struct ndctl_btt *btt; |
| |
| ndctl_btt_foreach(region, btt) |
| if (!ndctl_btt_is_enabled(btt) |
| && !ndctl_btt_is_configured(btt)) |
| return btt; |
| |
| return NULL; |
| } |
| |
| static int configure_btt(struct ndctl_region *region, |
| struct btt_parameters *parameters) |
| { |
| btt = get_idle_btt(region); |
| |
| ndctl_btt_set_uuid(btt, parameters->uuid); |
| ndctl_btt_set_sector_size(btt, parameters->sector_size); |
| ndctl_btt_set_namespace(btt, parameters->ndns); |
| /* turn off raw mode device */ |
| ndctl_namespace_disable(parameters->ndns); |
| /* turn on btt access */ |
| ndctl_btt_enable(btt); |
| } |
| |
| Once instantiated a new inactive btt seed device will appear underneath |
| the region. |
| |
| Once a "namespace" is removed from a BTT that instance of the BTT device |
| will be deleted or otherwise reset to default values. This deletion is |
| only at the device model level. In order to destroy a BTT the "info |
| block" needs to be destroyed. Note, that to destroy a BTT the media |
| needs to be written in raw mode. By default, the kernel will autodetect |
| the presence of a BTT and disable raw mode. This autodetect behavior |
| can be suppressed by enabling raw mode for the namespace via the |
| ndctl_namespace_set_raw_mode() API. |
| |
| |
| Summary LIBNDCTL Diagram |
| ------------------------ |
| |
| For the given example above, here is the view of the objects as seen by the |
| LIBNDCTL API:: |
| |
| +---+ |
| |CTX| +---------+ +--------------+ +---------------+ |
| +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | |
| | | +---------+ +--------------+ +---------------+ |
| +-------+ | | +---------+ +--------------+ +---------------+ |
| | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | |
| +-------+ | | | +---------+ +--------------+ +---------------+ |
| | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ |
| +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | |
| | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ |
| +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | |
| | DIMM3 <-+ | +--------------+ +----------------------+ |
| +-------+ | +---------+ +--------------+ +---------------+ |
| +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | |
| | +---------+ | +--------------+ +----------------------+ |
| | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | |
| | +--------------+ +----------------------+ |
| | +---------+ +--------------+ +---------------+ |
| +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | |
| | +---------+ +--------------+ +---------------+ |
| | +---------+ +--------------+ +----------------------+ |
| +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | |
| +---------+ +--------------+ +---------------+------+ |