| .. SPDX-License-Identifier: GPL-2.0 |
| |
| Layout |
| ------ |
| |
| The layout of a standard block group is approximately as follows (each |
| of these fields is discussed in a separate section below): |
| |
| .. list-table:: |
| :widths: 1 1 1 1 1 1 1 1 |
| :header-rows: 1 |
| |
| * - Group 0 Padding |
| - ext4 Super Block |
| - Group Descriptors |
| - Reserved GDT Blocks |
| - Data Block Bitmap |
| - inode Bitmap |
| - inode Table |
| - Data Blocks |
| * - 1024 bytes |
| - 1 block |
| - many blocks |
| - many blocks |
| - 1 block |
| - 1 block |
| - many blocks |
| - many more blocks |
| |
| For the special case of block group 0, the first 1024 bytes are unused, |
| to allow for the installation of x86 boot sectors and other oddities. |
| The superblock will start at offset 1024 bytes, whichever block that |
| happens to be (usually 0). However, if for some reason the block size = |
| 1024, then block 0 is marked in use and the superblock goes in block 1. |
| For all other block groups, there is no padding. |
| |
| The ext4 driver primarily works with the superblock and the group |
| descriptors that are found in block group 0. Redundant copies of the |
| superblock and group descriptors are written to some of the block groups |
| across the disk in case the beginning of the disk gets trashed, though |
| not all block groups necessarily host a redundant copy (see following |
| paragraph for more details). If the group does not have a redundant |
| copy, the block group begins with the data block bitmap. Note also that |
| when the filesystem is freshly formatted, mkfs will allocate “reserve |
| GDT block” space after the block group descriptors and before the start |
| of the block bitmaps to allow for future expansion of the filesystem. By |
| default, a filesystem is allowed to increase in size by a factor of |
| 1024x over the original filesystem size. |
| |
| The location of the inode table is given by ``grp.bg_inode_table_*``. It |
| is continuous range of blocks large enough to contain |
| ``sb.s_inodes_per_group * sb.s_inode_size`` bytes. |
| |
| As for the ordering of items in a block group, it is generally |
| established that the super block and the group descriptor table, if |
| present, will be at the beginning of the block group. The bitmaps and |
| the inode table can be anywhere, and it is quite possible for the |
| bitmaps to come after the inode table, or for both to be in different |
| groups (flex\_bg). Leftover space is used for file data blocks, indirect |
| block maps, extent tree blocks, and extended attributes. |
| |
| Flexible Block Groups |
| --------------------- |
| |
| Starting in ext4, there is a new feature called flexible block groups |
| (flex\_bg). In a flex\_bg, several block groups are tied together as one |
| logical block group; the bitmap spaces and the inode table space in the |
| first block group of the flex\_bg are expanded to include the bitmaps |
| and inode tables of all other block groups in the flex\_bg. For example, |
| if the flex\_bg size is 4, then group 0 will contain (in order) the |
| superblock, group descriptors, data block bitmaps for groups 0-3, inode |
| bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining |
| space in group 0 is for file data. The effect of this is to group the |
| block group metadata close together for faster loading, and to enable |
| large files to be continuous on disk. Backup copies of the superblock |
| and group descriptors are always at the beginning of block groups, even |
| if flex\_bg is enabled. The number of block groups that make up a |
| flex\_bg is given by 2 ^ ``sb.s_log_groups_per_flex``. |
| |
| Meta Block Groups |
| ----------------- |
| |
| Without the option META\_BG, for safety concerns, all block group |
| descriptors copies are kept in the first block group. Given the default |
| 128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4 |
| can have at most 2^27/64 = 2^21 block groups. This limits the entire |
| filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB. |
| |
| The solution to this problem is to use the metablock group feature |
| (META\_BG), which is already in ext3 for all 2.6 releases. With the |
| META\_BG feature, ext4 filesystems are partitioned into many metablock |
| groups. Each metablock group is a cluster of block groups whose group |
| descriptor structures can be stored in a single disk block. For ext4 |
| filesystems with 4 KB block size, a single metablock group partition |
| includes 64 block groups, or 8 GiB of disk space. The metablock group |
| feature moves the location of the group descriptors from the congested |
| first block group of the whole filesystem into the first group of each |
| metablock group itself. The backups are in the second and last group of |
| each metablock group. This increases the 2^21 maximum block groups limit |
| to the hard limit 2^32, allowing support for a 512PiB filesystem. |
| |
| The change in the filesystem format replaces the current scheme where |
| the superblock is followed by a variable-length set of block group |
| descriptors. Instead, the superblock and a single block group descriptor |
| block is placed at the beginning of the first, second, and last block |
| groups in a meta-block group. A meta-block group is a collection of |
| block groups which can be described by a single block group descriptor |
| block. Since the size of the block group descriptor structure is 32 |
| bytes, a meta-block group contains 32 block groups for filesystems with |
| a 1KB block size, and 128 block groups for filesystems with a 4KB |
| blocksize. Filesystems can either be created using this new block group |
| descriptor layout, or existing filesystems can be resized on-line, and |
| the field s\_first\_meta\_bg in the superblock will indicate the first |
| block group using this new layout. |
| |
| Please see an important note about ``BLOCK_UNINIT`` in the section about |
| block and inode bitmaps. |
| |
| Lazy Block Group Initialization |
| ------------------------------- |
| |
| A new feature for ext4 are three block group descriptor flags that |
| enable mkfs to skip initializing other parts of the block group |
| metadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean |
| that the inode and block bitmaps for that group can be calculated and |
| therefore the on-disk bitmap blocks are not initialized. This is |
| generally the case for an empty block group or a block group containing |
| only fixed-location block group metadata. The INODE\_ZEROED flag means |
| that the inode table has been initialized; mkfs will unset this flag and |
| rely on the kernel to initialize the inode tables in the background. |
| |
| By not writing zeroes to the bitmaps and inode table, mkfs time is |
| reduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM, |
| but the dumpe2fs output prints this as “uninit\_bg”. They are the same |
| thing. |