| .. SPDX-License-Identifier: GPL-2.0 |
| |
| ================================================ |
| ZoneFS - Zone filesystem for Zoned block devices |
| ================================================ |
| |
| Introduction |
| ============ |
| |
| zonefs is a very simple file system exposing each zone of a zoned block device |
| as a file. Unlike a regular POSIX-compliant file system with native zoned block |
| device support (e.g. f2fs), zonefs does not hide the sequential write |
| constraint of zoned block devices to the user. Files representing sequential |
| write zones of the device must be written sequentially starting from the end |
| of the file (append only writes). |
| |
| As such, zonefs is in essence closer to a raw block device access interface |
| than to a full-featured POSIX file system. The goal of zonefs is to simplify |
| the implementation of zoned block device support in applications by replacing |
| raw block device file accesses with a richer file API, avoiding relying on |
| direct block device file ioctls which may be more obscure to developers. One |
| example of this approach is the implementation of LSM (log-structured merge) |
| tree structures (such as used in RocksDB and LevelDB) on zoned block devices |
| by allowing SSTables to be stored in a zone file similarly to a regular file |
| system rather than as a range of sectors of the entire disk. The introduction |
| of the higher level construct "one file is one zone" can help reducing the |
| amount of changes needed in the application as well as introducing support for |
| different application programming languages. |
| |
| Zoned block devices |
| ------------------- |
| |
| Zoned storage devices belong to a class of storage devices with an address |
| space that is divided into zones. A zone is a group of consecutive LBAs and all |
| zones are contiguous (there are no LBA gaps). Zones may have different types. |
| |
| * Conventional zones: there are no access constraints to LBAs belonging to |
| conventional zones. Any read or write access can be executed, similarly to a |
| regular block device. |
| * Sequential zones: these zones accept random reads but must be written |
| sequentially. Each sequential zone has a write pointer maintained by the |
| device that keeps track of the mandatory start LBA position of the next write |
| to the device. As a result of this write constraint, LBAs in a sequential zone |
| cannot be overwritten. Sequential zones must first be erased using a special |
| command (zone reset) before rewriting. |
| |
| Zoned storage devices can be implemented using various recording and media |
| technologies. The most common form of zoned storage today uses the SCSI Zoned |
| Block Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled |
| Magnetic Recording (SMR) HDDs. |
| |
| Solid State Disks (SSD) storage devices can also implement a zoned interface |
| to, for instance, reduce internal write amplification due to garbage collection. |
| The NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard |
| committee aiming at adding a zoned storage interface to the NVMe protocol. |
| |
| Zonefs Overview |
| =============== |
| |
| Zonefs exposes the zones of a zoned block device as files. The files |
| representing zones are grouped by zone type, which are themselves represented |
| by sub-directories. This file structure is built entirely using zone information |
| provided by the device and so does not require any complex on-disk metadata |
| structure. |
| |
| On-disk metadata |
| ---------------- |
| |
| zonefs on-disk metadata is reduced to an immutable super block which |
| persistently stores a magic number and optional feature flags and values. On |
| mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration |
| and populates the mount point with a static file tree solely based on this |
| information. File sizes come from the device zone type and write pointer |
| position managed by the device itself. |
| |
| The super block is always written on disk at sector 0. The first zone of the |
| device storing the super block is never exposed as a zone file by zonefs. If |
| the zone containing the super block is a sequential zone, the mkzonefs format |
| tool always "finishes" the zone, that is, it transitions the zone to a full |
| state to make it read-only, preventing any data write. |
| |
| Zone type sub-directories |
| ------------------------- |
| |
| Files representing zones of the same type are grouped together under the same |
| sub-directory automatically created on mount. |
| |
| For conventional zones, the sub-directory "cnv" is used. This directory is |
| however created if and only if the device has usable conventional zones. If |
| the device only has a single conventional zone at sector 0, the zone will not |
| be exposed as a file as it will be used to store the zonefs super block. For |
| such devices, the "cnv" sub-directory will not be created. |
| |
| For sequential write zones, the sub-directory "seq" is used. |
| |
| These two directories are the only directories that exist in zonefs. Users |
| cannot create other directories and cannot rename nor delete the "cnv" and |
| "seq" sub-directories. |
| |
| The size of the directories indicated by the st_size field of struct stat, |
| obtained with the stat() or fstat() system calls, indicates the number of files |
| existing under the directory. |
| |
| Zone files |
| ---------- |
| |
| Zone files are named using the number of the zone they represent within the set |
| of zones of a particular type. That is, both the "cnv" and "seq" directories |
| contain files named "0", "1", "2", ... The file numbers also represent |
| increasing zone start sector on the device. |
| |
| All read and write operations to zone files are not allowed beyond the file |
| maximum size, that is, beyond the zone capacity. Any access exceeding the zone |
| capacity is failed with the -EFBIG error. |
| |
| Creating, deleting, renaming or modifying any attribute of files and |
| sub-directories is not allowed. |
| |
| The number of blocks of a file as reported by stat() and fstat() indicates the |
| capacity of the zone file, or in other words, the maximum file size. |
| |
| Conventional zone files |
| ----------------------- |
| |
| The size of conventional zone files is fixed to the size of the zone they |
| represent. Conventional zone files cannot be truncated. |
| |
| These files can be randomly read and written using any type of I/O operation: |
| buffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O |
| constraint for these files beyond the file size limit mentioned above. |
| |
| Sequential zone files |
| --------------------- |
| |
| The size of sequential zone files grouped in the "seq" sub-directory represents |
| the file's zone write pointer position relative to the zone start sector. |
| |
| Sequential zone files can only be written sequentially, starting from the file |
| end, that is, write operations can only be append writes. Zonefs makes no |
| attempt at accepting random writes and will fail any write request that has a |
| start offset not corresponding to the end of the file, or to the end of the last |
| write issued and still in-flight (for asynchronous I/O operations). |
| |
| Since dirty page writeback by the page cache does not guarantee a sequential |
| write pattern, zonefs prevents buffered writes and writeable shared mappings |
| on sequential files. Only direct I/O writes are accepted for these files. |
| zonefs relies on the sequential delivery of write I/O requests to the device |
| implemented by the block layer elevator. An elevator implementing the sequential |
| write feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature) |
| must be used. This type of elevator (e.g. mq-deadline) is set by default |
| for zoned block devices on device initialization. |
| |
| There are no restrictions on the type of I/O used for read operations in |
| sequential zone files. Buffered I/Os, direct I/Os and shared read mappings are |
| all accepted. |
| |
| Truncating sequential zone files is allowed only down to 0, in which case, the |
| zone is reset to rewind the file zone write pointer position to the start of |
| the zone, or up to the zone capacity, in which case the file's zone is |
| transitioned to the FULL state (finish zone operation). |
| |
| Format options |
| -------------- |
| |
| Several optional features of zonefs can be enabled at format time. |
| |
| * Conventional zone aggregation: ranges of contiguous conventional zones can be |
| aggregated into a single larger file instead of the default one file per zone. |
| * File ownership: The owner UID and GID of zone files is by default 0 (root) |
| but can be changed to any valid UID/GID. |
| * File access permissions: the default 640 access permissions can be changed. |
| |
| IO error handling |
| ----------------- |
| |
| Zoned block devices may fail I/O requests for reasons similar to regular block |
| devices, e.g. due to bad sectors. However, in addition to such known I/O |
| failure pattern, the standards governing zoned block devices behavior define |
| additional conditions that result in I/O errors. |
| |
| * A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY): |
| While the data already written in the zone is still readable, the zone can |
| no longer be written. No user action on the zone (zone management command or |
| read/write access) can change the zone condition back to a normal read/write |
| state. While the reasons for the device to transition a zone to read-only |
| state are not defined by the standards, a typical cause for such transition |
| would be a defective write head on an HDD (all zones under this head are |
| changed to read-only). |
| |
| * A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE): |
| An offline zone cannot be read nor written. No user action can transition an |
| offline zone back to an operational good state. Similarly to zone read-only |
| transitions, the reasons for a drive to transition a zone to the offline |
| condition are undefined. A typical cause would be a defective read-write head |
| on an HDD causing all zones on the platter under the broken head to be |
| inaccessible. |
| |
| * Unaligned write errors: These errors result from the host issuing write |
| requests with a start sector that does not correspond to a zone write pointer |
| position when the write request is executed by the device. Even though zonefs |
| enforces sequential file write for sequential zones, unaligned write errors |
| may still happen in the case of a partial failure of a very large direct I/O |
| operation split into multiple BIOs/requests or asynchronous I/O operations. |
| If one of the write request within the set of sequential write requests |
| issued to the device fails, all write requests queued after it will |
| become unaligned and fail. |
| |
| * Delayed write errors: similarly to regular block devices, if the device side |
| write cache is enabled, write errors may occur in ranges of previously |
| completed writes when the device write cache is flushed, e.g. on fsync(). |
| Similarly to the previous immediate unaligned write error case, delayed write |
| errors can propagate through a stream of cached sequential data for a zone |
| causing all data to be dropped after the sector that caused the error. |
| |
| All I/O errors detected by zonefs are notified to the user with an error code |
| return for the system call that triggered or detected the error. The recovery |
| actions taken by zonefs in response to I/O errors depend on the I/O type (read |
| vs write) and on the reason for the error (bad sector, unaligned writes or zone |
| condition change). |
| |
| * For read I/O errors, zonefs does not execute any particular recovery action, |
| but only if the file zone is still in a good condition and there is no |
| inconsistency between the file inode size and its zone write pointer position. |
| If a problem is detected, I/O error recovery is executed (see below table). |
| |
| * For write I/O errors, zonefs I/O error recovery is always executed. |
| |
| * A zone condition change to read-only or offline also always triggers zonefs |
| I/O error recovery. |
| |
| Zonefs minimal I/O error recovery may change a file size and file access |
| permissions. |
| |
| * File size changes: |
| Immediate or delayed write errors in a sequential zone file may cause the file |
| inode size to be inconsistent with the amount of data successfully written in |
| the file zone. For instance, the partial failure of a multi-BIO large write |
| operation will cause the zone write pointer to advance partially, even though |
| the entire write operation will be reported as failed to the user. In such |
| case, the file inode size must be advanced to reflect the zone write pointer |
| change and eventually allow the user to restart writing at the end of the |
| file. |
| A file size may also be reduced to reflect a delayed write error detected on |
| fsync(): in this case, the amount of data effectively written in the zone may |
| be less than originally indicated by the file inode size. After such I/O |
| error, zonefs always fixes the file inode size to reflect the amount of data |
| persistently stored in the file zone. |
| |
| * Access permission changes: |
| A zone condition change to read-only is indicated with a change in the file |
| access permissions to render the file read-only. This disables changes to the |
| file attributes and data modification. For offline zones, all permissions |
| (read and write) to the file are disabled. |
| |
| Further action taken by zonefs I/O error recovery can be controlled by the user |
| with the "errors=xxx" mount option. The table below summarizes the result of |
| zonefs I/O error processing depending on the mount option and on the zone |
| conditions:: |
| |
| +--------------+-----------+-----------------------------------------+ |
| | | | Post error state | |
| | "errors=xxx" | device | access permissions | |
| | mount | zone | file file device zone | |
| | option | condition | size read write read write | |
| +--------------+-----------+-----------------------------------------+ |
| | | good | fixed yes no yes yes | |
| | remount-ro | read-only | as is yes no yes no | |
| | (default) | offline | 0 no no no no | |
| +--------------+-----------+-----------------------------------------+ |
| | | good | fixed yes no yes yes | |
| | zone-ro | read-only | as is yes no yes no | |
| | | offline | 0 no no no no | |
| +--------------+-----------+-----------------------------------------+ |
| | | good | 0 no no yes yes | |
| | zone-offline | read-only | 0 no no yes no | |
| | | offline | 0 no no no no | |
| +--------------+-----------+-----------------------------------------+ |
| | | good | fixed yes yes yes yes | |
| | repair | read-only | as is yes no yes no | |
| | | offline | 0 no no no no | |
| +--------------+-----------+-----------------------------------------+ |
| |
| Further notes: |
| |
| * The "errors=remount-ro" mount option is the default behavior of zonefs I/O |
| error processing if no errors mount option is specified. |
| * With the "errors=remount-ro" mount option, the change of the file access |
| permissions to read-only applies to all files. The file system is remounted |
| read-only. |
| * Access permission and file size changes due to the device transitioning zones |
| to the offline condition are permanent. Remounting or reformatting the device |
| with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good |
| state. |
| * File access permission changes to read-only due to the device transitioning |
| zones to the read-only condition are permanent. Remounting or reformatting |
| the device will not re-enable file write access. |
| * File access permission changes implied by the remount-ro, zone-ro and |
| zone-offline mount options are temporary for zones in a good condition. |
| Unmounting and remounting the file system will restore the previous default |
| (format time values) access rights to the files affected. |
| * The repair mount option triggers only the minimal set of I/O error recovery |
| actions, that is, file size fixes for zones in a good condition. Zones |
| indicated as being read-only or offline by the device still imply changes to |
| the zone file access permissions as noted in the table above. |
| |
| Mount options |
| ------------- |
| |
| zonefs defines several mount options: |
| * errors=<behavior> |
| * explicit-open |
| |
| "errors=<behavior>" option |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The "errors=<behavior>" option mount option allows the user to specify zonefs |
| behavior in response to I/O errors, inode size inconsistencies or zone |
| condition changes. The defined behaviors are as follow: |
| |
| * remount-ro (default) |
| * zone-ro |
| * zone-offline |
| * repair |
| |
| The run-time I/O error actions defined for each behavior are detailed in the |
| previous section. Mount time I/O errors will cause the mount operation to fail. |
| The handling of read-only zones also differs between mount-time and run-time. |
| If a read-only zone is found at mount time, the zone is always treated in the |
| same manner as offline zones, that is, all accesses are disabled and the zone |
| file size set to 0. This is necessary as the write pointer of read-only zones |
| is defined as invalib by the ZBC and ZAC standards, making it impossible to |
| discover the amount of data that has been written to the zone. In the case of a |
| read-only zone discovered at run-time, as indicated in the previous section. |
| The size of the zone file is left unchanged from its last updated value. |
| |
| "explicit-open" option |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| A zoned block device (e.g. an NVMe Zoned Namespace device) may have limits on |
| the number of zones that can be active, that is, zones that are in the |
| implicit open, explicit open or closed conditions. This potential limitation |
| translates into a risk for applications to see write IO errors due to this |
| limit being exceeded if the zone of a file is not already active when a write |
| request is issued by the user. |
| |
| To avoid these potential errors, the "explicit-open" mount option forces zones |
| to be made active using an open zone command when a file is opened for writing |
| for the first time. If the zone open command succeeds, the application is then |
| guaranteed that write requests can be processed. Conversely, the |
| "explicit-open" mount option will result in a zone close command being issued |
| to the device on the last close() of a zone file if the zone is not full nor |
| empty. |
| |
| Runtime sysfs attributes |
| ------------------------ |
| |
| zonefs defines several sysfs attributes for mounted devices. All attributes |
| are user readable and can be found in the directory /sys/fs/zonefs/<dev>/, |
| where <dev> is the name of the mounted zoned block device. |
| |
| The attributes defined are as follows. |
| |
| * **max_wro_seq_files**: This attribute reports the maximum number of |
| sequential zone files that can be open for writing. This number corresponds |
| to the maximum number of explicitly or implicitly open zones that the device |
| supports. A value of 0 means that the device has no limit and that any zone |
| (any file) can be open for writing and written at any time, regardless of the |
| state of other zones. When the *explicit-open* mount option is used, zonefs |
| will fail any open() system call requesting to open a sequential zone file for |
| writing when the number of sequential zone files already open for writing has |
| reached the *max_wro_seq_files* limit. |
| * **nr_wro_seq_files**: This attribute reports the current number of sequential |
| zone files open for writing. When the "explicit-open" mount option is used, |
| this number can never exceed *max_wro_seq_files*. If the *explicit-open* |
| mount option is not used, the reported number can be greater than |
| *max_wro_seq_files*. In such case, it is the responsibility of the |
| application to not write simultaneously more than *max_wro_seq_files* |
| sequential zone files. Failure to do so can result in write errors. |
| * **max_active_seq_files**: This attribute reports the maximum number of |
| sequential zone files that are in an active state, that is, sequential zone |
| files that are partially written (not empty nor full) or that have a zone that |
| is explicitly open (which happens only if the *explicit-open* mount option is |
| used). This number is always equal to the maximum number of active zones that |
| the device supports. A value of 0 means that the mounted device has no limit |
| on the number of sequential zone files that can be active. |
| * **nr_active_seq_files**: This attributes reports the current number of |
| sequential zone files that are active. If *max_active_seq_files* is not 0, |
| then the value of *nr_active_seq_files* can never exceed the value of |
| *nr_active_seq_files*, regardless of the use of the *explicit-open* mount |
| option. |
| |
| Zonefs User Space Tools |
| ======================= |
| |
| The mkzonefs tool is used to format zoned block devices for use with zonefs. |
| This tool is available on Github at: |
| |
| https://github.com/damien-lemoal/zonefs-tools |
| |
| zonefs-tools also includes a test suite which can be run against any zoned |
| block device, including null_blk block device created with zoned mode. |
| |
| Examples |
| -------- |
| |
| The following formats a 15TB host-managed SMR HDD with 256 MB zones |
| with the conventional zones aggregation feature enabled:: |
| |
| # mkzonefs -o aggr_cnv /dev/sdX |
| # mount -t zonefs /dev/sdX /mnt |
| # ls -l /mnt/ |
| total 0 |
| dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv |
| dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq |
| |
| The size of the zone files sub-directories indicate the number of files |
| existing for each type of zones. In this example, there is only one |
| conventional zone file (all conventional zones are aggregated under a single |
| file):: |
| |
| # ls -l /mnt/cnv |
| total 137101312 |
| -rw-r----- 1 root root 140391743488 Nov 25 13:23 0 |
| |
| This aggregated conventional zone file can be used as a regular file:: |
| |
| # mkfs.ext4 /mnt/cnv/0 |
| # mount -o loop /mnt/cnv/0 /data |
| |
| The "seq" sub-directory grouping files for sequential write zones has in this |
| example 55356 zones:: |
| |
| # ls -lv /mnt/seq |
| total 14511243264 |
| -rw-r----- 1 root root 0 Nov 25 13:23 0 |
| -rw-r----- 1 root root 0 Nov 25 13:23 1 |
| -rw-r----- 1 root root 0 Nov 25 13:23 2 |
| ... |
| -rw-r----- 1 root root 0 Nov 25 13:23 55354 |
| -rw-r----- 1 root root 0 Nov 25 13:23 55355 |
| |
| For sequential write zone files, the file size changes as data is appended at |
| the end of the file, similarly to any regular file system:: |
| |
| # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct |
| 1+0 records in |
| 1+0 records out |
| 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s |
| |
| # ls -l /mnt/seq/0 |
| -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 |
| |
| The written file can be truncated to the zone size, preventing any further |
| write operation:: |
| |
| # truncate -s 268435456 /mnt/seq/0 |
| # ls -l /mnt/seq/0 |
| -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 |
| |
| Truncation to 0 size allows freeing the file zone storage space and restart |
| append-writes to the file:: |
| |
| # truncate -s 0 /mnt/seq/0 |
| # ls -l /mnt/seq/0 |
| -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 |
| |
| Since files are statically mapped to zones on the disk, the number of blocks |
| of a file as reported by stat() and fstat() indicates the capacity of the file |
| zone:: |
| |
| # stat /mnt/seq/0 |
| File: /mnt/seq/0 |
| Size: 0 Blocks: 524288 IO Block: 4096 regular empty file |
| Device: 870h/2160d Inode: 50431 Links: 1 |
| Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) |
| Access: 2019-11-25 13:23:57.048971997 +0900 |
| Modify: 2019-11-25 13:52:25.553805765 +0900 |
| Change: 2019-11-25 13:52:25.553805765 +0900 |
| Birth: - |
| |
| The number of blocks of the file ("Blocks") in units of 512B blocks gives the |
| maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone |
| capacity in this example. Of note is that the "IO block" field always |
| indicates the minimum I/O size for writes and corresponds to the device |
| physical sector size. |