| .. SPDX-License-Identifier: GPL-2.0 |
| |
| ====================================== |
| EROFS - Enhanced Read-Only File System |
| ====================================== |
| |
| Overview |
| ======== |
| |
| EROFS filesystem stands for Enhanced Read-Only File System. It aims to form a |
| generic read-only filesystem solution for various read-only use cases instead |
| of just focusing on storage space saving without considering any side effects |
| of runtime performance. |
| |
| It is designed to meet the needs of flexibility, feature extendability and user |
| payload friendly, etc. Apart from those, it is still kept as a simple |
| random-access friendly high-performance filesystem to get rid of unneeded I/O |
| amplification and memory-resident overhead compared to similar approaches. |
| |
| It is implemented to be a better choice for the following scenarios: |
| |
| - read-only storage media or |
| |
| - part of a fully trusted read-only solution, which means it needs to be |
| immutable and bit-for-bit identical to the official golden image for |
| their releases due to security or other considerations and |
| |
| - hope to minimize extra storage space with guaranteed end-to-end performance |
| by using compact layout, transparent file compression and direct access, |
| especially for those embedded devices with limited memory and high-density |
| hosts with numerous containers. |
| |
| Here are the main features of EROFS: |
| |
| - Little endian on-disk design; |
| |
| - Block-based distribution and file-based distribution over fscache are |
| supported; |
| |
| - Support multiple devices to refer to external blobs, which can be used |
| for container images; |
| |
| - 32-bit block addresses for each device, therefore 16TiB address space at |
| most with 4KiB block size for now; |
| |
| - Two inode layouts for different requirements: |
| |
| ===================== ============ ====================================== |
| compact (v1) extended (v2) |
| ===================== ============ ====================================== |
| Inode metadata size 32 bytes 64 bytes |
| Max file size 4 GiB 16 EiB (also limited by max. vol size) |
| Max uids/gids 65536 4294967296 |
| Per-inode timestamp no yes (64 + 32-bit timestamp) |
| Max hardlinks 65536 4294967296 |
| Metadata reserved 8 bytes 18 bytes |
| ===================== ============ ====================================== |
| |
| - Support extended attributes as an option; |
| |
| - Support a bloom filter that speeds up negative extended attribute lookups; |
| |
| - Support POSIX.1e ACLs by using extended attributes; |
| |
| - Support transparent data compression as an option: |
| LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; In |
| addition, inplace decompression is also supported to avoid bounce compressed |
| buffers and unnecessary page cache thrashing. |
| |
| - Support chunk-based data deduplication and rolling-hash compressed data |
| deduplication; |
| |
| - Support tailpacking inline compared to byte-addressed unaligned metadata |
| or smaller block size alternatives; |
| |
| - Support merging tail-end data into a special inode as fragments. |
| |
| - Support large folios to make use of THPs (Transparent Hugepages); |
| |
| - Support direct I/O on uncompressed files to avoid double caching for loop |
| devices; |
| |
| - Support FSDAX on uncompressed images for secure containers and ramdisks in |
| order to get rid of unnecessary page cache. |
| |
| - Support file-based on-demand loading with the Fscache infrastructure. |
| |
| The following git tree provides the file system user-space tools under |
| development, such as a formatting tool (mkfs.erofs), an on-disk consistency & |
| compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs): |
| |
| - git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git |
| |
| For more information, please also refer to the documentation site: |
| |
| - https://erofs.docs.kernel.org |
| |
| Bugs and patches are welcome, please kindly help us and send to the following |
| linux-erofs mailing list: |
| |
| - linux-erofs mailing list <linux-erofs@lists.ozlabs.org> |
| |
| Mount options |
| ============= |
| |
| =================== ========================================================= |
| (no)user_xattr Setup Extended User Attributes. Note: xattr is enabled |
| by default if CONFIG_EROFS_FS_XATTR is selected. |
| (no)acl Setup POSIX Access Control List. Note: acl is enabled |
| by default if CONFIG_EROFS_FS_POSIX_ACL is selected. |
| cache_strategy=%s Select a strategy for cached decompression from now on: |
| |
| ========== ============================================= |
| disabled In-place I/O decompression only; |
| readahead Cache the last incomplete compressed physical |
| cluster for further reading. It still does |
| in-place I/O decompression for the rest |
| compressed physical clusters; |
| readaround Cache the both ends of incomplete compressed |
| physical clusters for further reading. |
| It still does in-place I/O decompression |
| for the rest compressed physical clusters. |
| ========== ============================================= |
| dax={always,never} Use direct access (no page cache). See |
| Documentation/filesystems/dax.rst. |
| dax A legacy option which is an alias for ``dax=always``. |
| device=%s Specify a path to an extra device to be used together. |
| fsid=%s Specify a filesystem image ID for Fscache back-end. |
| domain_id=%s Specify a domain ID in fscache mode so that different images |
| with the same blobs under a given domain ID can share storage. |
| =================== ========================================================= |
| |
| Sysfs Entries |
| ============= |
| |
| Information about mounted erofs file systems can be found in /sys/fs/erofs. |
| Each mounted filesystem will have a directory in /sys/fs/erofs based on its |
| device name (i.e., /sys/fs/erofs/sda). |
| (see also Documentation/ABI/testing/sysfs-fs-erofs) |
| |
| On-disk details |
| =============== |
| |
| Summary |
| ------- |
| Different from other read-only file systems, an EROFS volume is designed |
| to be as simple as possible:: |
| |
| |-> aligned with the block size |
| ____________________________________________________________ |
| | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data | |
| |_|__|_|_____|__________|_____|______|__________|_____|______| |
| 0 +1K |
| |
| All data areas should be aligned with the block size, but metadata areas |
| may not. All metadatas can be now observed in two different spaces (views): |
| |
| 1. Inode metadata space |
| |
| Each valid inode should be aligned with an inode slot, which is a fixed |
| value (32 bytes) and designed to be kept in line with compact inode size. |
| |
| Each inode can be directly found with the following formula: |
| inode offset = meta_blkaddr * block_size + 32 * nid |
| |
| :: |
| |
| |-> aligned with 8B |
| |-> followed closely |
| + meta_blkaddr blocks |-> another slot |
| _____________________________________________________________________ |
| | ... | inode | xattrs | extents | data inline | ... | inode ... |
| |________|_______|(optional)|(optional)|__(optional)_|_____|__________ |
| |-> aligned with the inode slot size |
| . . |
| . . |
| . . |
| . . |
| . . |
| . . |
| .____________________________________________________|-> aligned with 4B |
| | xattr_ibody_header | shared xattrs | inline xattrs | |
| |____________________|_______________|_______________| |
| |-> 12 bytes <-|->x * 4 bytes<-| . |
| . . . |
| . . . |
| . . . |
| ._______________________________.______________________. |
| | id | id | id | id | ... | id | ent | ... | ent| ... | |
| |____|____|____|____|______|____|_____|_____|____|_____| |
| |-> aligned with 4B |
| |-> aligned with 4B |
| |
| Inode could be 32 or 64 bytes, which can be distinguished from a common |
| field which all inode versions have -- i_format:: |
| |
| __________________ __________________ |
| | i_format | | i_format | |
| |__________________| |__________________| |
| | ... | | ... | |
| | | | | |
| |__________________| 32 bytes | | |
| | | |
| |__________________| 64 bytes |
| |
| Xattrs, extents, data inline are placed after the corresponding inode with |
| proper alignment, and they could be optional for different data mappings. |
| _currently_ total 5 data layouts are supported: |
| |
| == ==================================================================== |
| 0 flat file data without data inline (no extent); |
| 1 fixed-sized output data compression (with non-compacted indexes); |
| 2 flat file data with tail packing data inline (no extent); |
| 3 fixed-sized output data compression (with compacted indexes, v5.3+); |
| 4 chunk-based file (v5.15+). |
| == ==================================================================== |
| |
| The size of the optional xattrs is indicated by i_xattr_count in inode |
| header. Large xattrs or xattrs shared by many different files can be |
| stored in shared xattrs metadata rather than inlined right after inode. |
| |
| 2. Shared xattrs metadata space |
| |
| Shared xattrs space is similar to the above inode space, started with |
| a specific block indicated by xattr_blkaddr, organized one by one with |
| proper align. |
| |
| Each share xattr can also be directly found by the following formula: |
| xattr offset = xattr_blkaddr * block_size + 4 * xattr_id |
| |
| :: |
| |
| |-> aligned by 4 bytes |
| + xattr_blkaddr blocks |-> aligned with 4 bytes |
| _________________________________________________________________________ |
| | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... |
| |________|_____________|_____________|_____|______________|_______________ |
| |
| Directories |
| ----------- |
| All directories are now organized in a compact on-disk format. Note that |
| each directory block is divided into index and name areas in order to support |
| random file lookup, and all directory entries are _strictly_ recorded in |
| alphabetical order in order to support improved prefix binary search |
| algorithm (could refer to the related source code). |
| |
| :: |
| |
| ___________________________ |
| / | |
| / ______________|________________ |
| / / | nameoff1 | nameoffN-1 |
| ____________.______________._______________v________________v__________ |
| | dirent | dirent | ... | dirent | filename | filename | ... | filename | |
| |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| |
| \ ^ |
| \ | * could have |
| \ | trailing '\0' |
| \________________________| nameoff0 |
| Directory block |
| |
| Note that apart from the offset of the first filename, nameoff0 also indicates |
| the total number of directory entries in this block since it is no need to |
| introduce another on-disk field at all. |
| |
| Chunk-based files |
| ----------------- |
| In order to support chunk-based data deduplication, a new inode data layout has |
| been supported since Linux v5.15: Files are split in equal-sized data chunks |
| with ``extents`` area of the inode metadata indicating how to get the chunk |
| data: these can be simply as a 4-byte block address array or in the 8-byte |
| chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more |
| details.) |
| |
| By the way, chunk-based files are all uncompressed for now. |
| |
| Long extended attribute name prefixes |
| ------------------------------------- |
| There are use cases where extended attributes with different values can have |
| only a few common prefixes (such as overlayfs xattrs). The predefined prefixes |
| work inefficiently in both image size and runtime performance in such cases. |
| |
| The long xattr name prefixes feature is introduced to address this issue. The |
| overall idea is that, apart from the existing predefined prefixes, the xattr |
| entry could also refer to user-specified long xattr name prefixes, e.g. |
| "trusted.overlay.". |
| |
| When referring to a long xattr name prefix, the highest bit (bit 7) of |
| erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole |
| represent the index of the referred long name prefix among all long name |
| prefixes. Therefore, only the trailing part of the name apart from the long |
| xattr name prefix is stored in erofs_xattr_entry.e_name, which could be empty if |
| the full xattr name matches exactly as its long xattr name prefix. |
| |
| All long xattr prefixes are stored one by one in the packed inode as long as |
| the packed inode is valid, or in the meta inode otherwise. The |
| xattr_prefix_count (of the on-disk superblock) indicates the total number of |
| long xattr name prefixes, while (xattr_prefix_start * 4) indicates the start |
| offset of long name prefixes in the packed/meta inode. Note that, long extended |
| attribute name prefixes are disabled if xattr_prefix_count is 0. |
| |
| Each long name prefix is stored in the format: ALIGN({__le16 len, data}, 4), |
| where len represents the total size of the data part. The data part is actually |
| represented by 'struct erofs_xattr_long_prefix', where base_index represents the |
| index of the predefined xattr name prefix, e.g. EROFS_XATTR_INDEX_TRUSTED for |
| "trusted.overlay." long name prefix, while the infix string keeps the string |
| after stripping the short prefix, e.g. "overlay." for the example above. |
| |
| Data compression |
| ---------------- |
| EROFS implements fixed-sized output compression which generates fixed-sized |
| compressed data blocks from variable-sized input in contrast to other existing |
| fixed-sized input solutions. Relatively higher compression ratios can be gotten |
| by using fixed-sized output compression since nowadays popular data compression |
| algorithms are mostly LZ77-based and such fixed-sized output approach can be |
| benefited from the historical dictionary (aka. sliding window). |
| |
| In details, original (uncompressed) data is turned into several variable-sized |
| extents and in the meanwhile, compressed into physical clusters (pclusters). |
| In order to record each variable-sized extent, logical clusters (lclusters) are |
| introduced as the basic unit of compress indexes to indicate whether a new |
| extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now |
| fixed in block size, as illustrated below:: |
| |
| |<- variable-sized extent ->|<- VLE ->| |
| clusterofs clusterofs clusterofs |
| | | | |
| _________v_________________________________v_______________________v________ |
| ... | . | | . | | . ... |
| ____|____._________|______________|________.___ _|______________|__.________ |
| |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-| |
| (HEAD) (NONHEAD) (HEAD) (NONHEAD) . |
| . CBLKCNT . . |
| . . . |
| . . . |
| _______._____________________________.______________._________________ |
| ... | | | | ... |
| _______|______________|______________|______________|_________________ |
| |-> big pcluster <-|-> pcluster <-| |
| |
| A physical cluster can be seen as a container of physical compressed blocks |
| which contains compressed data. Previously, only lcluster-sized (4KB) pclusters |
| were supported. After big pcluster feature is introduced (available since |
| Linux v5.13), pcluster can be a multiple of lcluster size. |
| |
| For each HEAD lcluster, clusterofs is recorded to indicate where a new extent |
| starts and blkaddr is used to seek the compressed data. For each NONHEAD |
| lcluster, delta0 and delta1 are available instead of blkaddr to indicate the |
| distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is |
| also a HEAD lcluster except that its data is uncompressed. See the comments |
| around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. |
| |
| If big pcluster is enabled, pcluster size in lclusters needs to be recorded as |
| well. Let the delta0 of the first NONHEAD lcluster store the compressed block |
| count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy |
| to understand its delta0 is constantly 1, as illustrated below:: |
| |
| __________________________________________________________ |
| | HEAD | NONHEAD | NONHEAD | ... | NONHEAD | HEAD | HEAD | |
| |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_| |
| |<----- a big pcluster (with CBLKCNT) ------>|<-- -->| |
| a lcluster-sized pcluster (without CBLKCNT) ^ |
| |
| If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT, |
| but it's easy to know the size of such pcluster is 1 lcluster as well. |
| |
| Since Linux v6.1, each pcluster can be used for multiple variable-sized extents, |
| therefore it can be used for compressed data deduplication. |