| .. SPDX-License-Identifier: GPL-2.0 |
| |
| Journal (jbd2) |
| -------------- |
| |
| Introduced in ext3, the ext4 filesystem employs a journal to protect the |
| filesystem against corruption in the case of a system crash. A small |
| continuous region of disk (default 128MiB) is reserved inside the |
| filesystem as a place to land “important” data writes on-disk as quickly |
| as possible. Once the important data transaction is fully written to the |
| disk and flushed from the disk write cache, a record of the data being |
| committed is also written to the journal. At some later point in time, |
| the journal code writes the transactions to their final locations on |
| disk (this could involve a lot of seeking or a lot of small |
| read-write-erases) before erasing the commit record. Should the system |
| crash during the second slow write, the journal can be replayed all the |
| way to the latest commit record, guaranteeing the atomicity of whatever |
| gets written through the journal to the disk. The effect of this is to |
| guarantee that the filesystem does not become stuck midway through a |
| metadata update. |
| |
| For performance reasons, ext4 by default only writes filesystem metadata |
| through the journal. This means that file data blocks are /not/ |
| guaranteed to be in any consistent state after a crash. If this default |
| guarantee level (``data=ordered``) is not satisfactory, there is a mount |
| option to control journal behavior. If ``data=journal``, all data and |
| metadata are written to disk through the journal. This is slower but |
| safest. If ``data=writeback``, dirty data blocks are not flushed to the |
| disk before the metadata are written to disk through the journal. |
| |
| In case of ``data=ordered`` mode, Ext4 also supports fast commits which |
| help reduce commit latency significantly. The default ``data=ordered`` |
| mode works by logging metadata blocks to the journal. In fast commit |
| mode, Ext4 only stores the minimal delta needed to recreate the |
| affected metadata in fast commit space that is shared with JBD2. |
| Once the fast commit area fills in or if fast commit is not possible |
| or if JBD2 commit timer goes off, Ext4 performs a traditional full commit. |
| A full commit invalidates all the fast commits that happened before |
| it and thus it makes the fast commit area empty for further fast |
| commits. This feature needs to be enabled at mkfs time. |
| |
| The journal inode is typically inode 8. The first 68 bytes of the |
| journal inode are replicated in the ext4 superblock. The journal itself |
| is normal (but hidden) file within the filesystem. The file usually |
| consumes an entire block group, though mke2fs tries to put it in the |
| middle of the disk. |
| |
| All fields in jbd2 are written to disk in big-endian order. This is the |
| opposite of ext4. |
| |
| NOTE: Both ext4 and ocfs2 use jbd2. |
| |
| The maximum size of a journal embedded in an ext4 filesystem is 2^32 |
| blocks. jbd2 itself does not seem to care. |
| |
| Layout |
| ~~~~~~ |
| |
| Generally speaking, the journal has this format: |
| |
| .. list-table:: |
| :widths: 16 48 16 |
| :header-rows: 1 |
| |
| * - Superblock |
| - descriptor\_block (data\_blocks or revocation\_block) [more data or |
| revocations] commmit\_block |
| - [more transactions...] |
| * - |
| - One transaction |
| - |
| |
| Notice that a transaction begins with either a descriptor and some data, |
| or a block revocation list. A finished transaction always ends with a |
| commit. If there is no commit record (or the checksums don't match), the |
| transaction will be discarded during replay. |
| |
| External Journal |
| ~~~~~~~~~~~~~~~~ |
| |
| Optionally, an ext4 filesystem can be created with an external journal |
| device (as opposed to an internal journal, which uses a reserved inode). |
| In this case, on the filesystem device, ``s_journal_inum`` should be |
| zero and ``s_journal_uuid`` should be set. On the journal device there |
| will be an ext4 super block in the usual place, with a matching UUID. |
| The journal superblock will be in the next full block after the |
| superblock. |
| |
| .. list-table:: |
| :widths: 12 12 12 32 12 |
| :header-rows: 1 |
| |
| * - 1024 bytes of padding |
| - ext4 Superblock |
| - Journal Superblock |
| - descriptor\_block (data\_blocks or revocation\_block) [more data or |
| revocations] commmit\_block |
| - [more transactions...] |
| * - |
| - |
| - |
| - One transaction |
| - |
| |
| Block Header |
| ~~~~~~~~~~~~ |
| |
| Every block in the journal starts with a common 12-byte header |
| ``struct journal_header_s``: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Description |
| * - 0x0 |
| - \_\_be32 |
| - h\_magic |
| - jbd2 magic number, 0xC03B3998. |
| * - 0x4 |
| - \_\_be32 |
| - h\_blocktype |
| - Description of what this block contains. See the jbd2_blocktype_ table |
| below. |
| * - 0x8 |
| - \_\_be32 |
| - h\_sequence |
| - The transaction ID that goes with this block. |
| |
| .. _jbd2_blocktype: |
| |
| The journal block type can be any one of: |
| |
| .. list-table:: |
| :widths: 16 64 |
| :header-rows: 1 |
| |
| * - Value |
| - Description |
| * - 1 |
| - Descriptor. This block precedes a series of data blocks that were |
| written through the journal during a transaction. |
| * - 2 |
| - Block commit record. This block signifies the completion of a |
| transaction. |
| * - 3 |
| - Journal superblock, v1. |
| * - 4 |
| - Journal superblock, v2. |
| * - 5 |
| - Block revocation records. This speeds up recovery by enabling the |
| journal to skip writing blocks that were subsequently rewritten. |
| |
| Super Block |
| ~~~~~~~~~~~ |
| |
| The super block for the journal is much simpler as compared to ext4's. |
| The key data kept within are size of the journal, and where to find the |
| start of the log of transactions. |
| |
| The journal superblock is recorded as ``struct journal_superblock_s``, |
| which is 1024 bytes long: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Description |
| * - |
| - |
| - |
| - Static information describing the journal. |
| * - 0x0 |
| - journal\_header\_t (12 bytes) |
| - s\_header |
| - Common header identifying this as a superblock. |
| * - 0xC |
| - \_\_be32 |
| - s\_blocksize |
| - Journal device block size. |
| * - 0x10 |
| - \_\_be32 |
| - s\_maxlen |
| - Total number of blocks in this journal. |
| * - 0x14 |
| - \_\_be32 |
| - s\_first |
| - First block of log information. |
| * - |
| - |
| - |
| - Dynamic information describing the current state of the log. |
| * - 0x18 |
| - \_\_be32 |
| - s\_sequence |
| - First commit ID expected in log. |
| * - 0x1C |
| - \_\_be32 |
| - s\_start |
| - Block number of the start of log. Contrary to the comments, this field |
| being zero does not imply that the journal is clean! |
| * - 0x20 |
| - \_\_be32 |
| - s\_errno |
| - Error value, as set by jbd2\_journal\_abort(). |
| * - |
| - |
| - |
| - The remaining fields are only valid in a v2 superblock. |
| * - 0x24 |
| - \_\_be32 |
| - s\_feature\_compat; |
| - Compatible feature set. See the table jbd2_compat_ below. |
| * - 0x28 |
| - \_\_be32 |
| - s\_feature\_incompat |
| - Incompatible feature set. See the table jbd2_incompat_ below. |
| * - 0x2C |
| - \_\_be32 |
| - s\_feature\_ro\_compat |
| - Read-only compatible feature set. There aren't any of these currently. |
| * - 0x30 |
| - \_\_u8 |
| - s\_uuid[16] |
| - 128-bit uuid for journal. This is compared against the copy in the ext4 |
| super block at mount time. |
| * - 0x40 |
| - \_\_be32 |
| - s\_nr\_users |
| - Number of file systems sharing this journal. |
| * - 0x44 |
| - \_\_be32 |
| - s\_dynsuper |
| - Location of dynamic super block copy. (Not used?) |
| * - 0x48 |
| - \_\_be32 |
| - s\_max\_transaction |
| - Limit of journal blocks per transaction. (Not used?) |
| * - 0x4C |
| - \_\_be32 |
| - s\_max\_trans\_data |
| - Limit of data blocks per transaction. (Not used?) |
| * - 0x50 |
| - \_\_u8 |
| - s\_checksum\_type |
| - Checksum algorithm used for the journal. See jbd2_checksum_type_ for |
| more info. |
| * - 0x51 |
| - \_\_u8[3] |
| - s\_padding2 |
| - |
| * - 0x54 |
| - \_\_u32 |
| - s\_padding[42] |
| - |
| * - 0xFC |
| - \_\_be32 |
| - s\_checksum |
| - Checksum of the entire superblock, with this field set to zero. |
| * - 0x100 |
| - \_\_u8 |
| - s\_users[16\*48] |
| - ids of all file systems sharing the log. e2fsprogs/Linux don't allow |
| shared external journals, but I imagine Lustre (or ocfs2?), which use |
| the jbd2 code, might. |
| |
| .. _jbd2_compat: |
| |
| The journal compat features are any combination of the following: |
| |
| .. list-table:: |
| :widths: 16 64 |
| :header-rows: 1 |
| |
| * - Value |
| - Description |
| * - 0x1 |
| - Journal maintains checksums on the data blocks. |
| (JBD2\_FEATURE\_COMPAT\_CHECKSUM) |
| |
| .. _jbd2_incompat: |
| |
| The journal incompat features are any combination of the following: |
| |
| .. list-table:: |
| :widths: 16 64 |
| :header-rows: 1 |
| |
| * - Value |
| - Description |
| * - 0x1 |
| - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE) |
| * - 0x2 |
| - Journal can deal with 64-bit block numbers. |
| (JBD2\_FEATURE\_INCOMPAT\_64BIT) |
| * - 0x4 |
| - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT) |
| * - 0x8 |
| - This journal uses v2 of the checksum on-disk format. Each journal |
| metadata block gets its own checksum, and the block tags in the |
| descriptor table contain checksums for each of the data blocks in the |
| journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2) |
| * - 0x10 |
| - This journal uses v3 of the checksum on-disk format. This is the same as |
| v2, but the journal block tag size is fixed regardless of the size of |
| block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3) |
| |
| .. _jbd2_checksum_type: |
| |
| Journal checksum type codes are one of the following. crc32 or crc32c are the |
| most likely choices. |
| |
| .. list-table:: |
| :widths: 16 64 |
| :header-rows: 1 |
| |
| * - Value |
| - Description |
| * - 1 |
| - CRC32 |
| * - 2 |
| - MD5 |
| * - 3 |
| - SHA1 |
| * - 4 |
| - CRC32C |
| |
| Descriptor Block |
| ~~~~~~~~~~~~~~~~ |
| |
| The descriptor block contains an array of journal block tags that |
| describe the final locations of the data blocks that follow in the |
| journal. Descriptor blocks are open-coded instead of being completely |
| described by a data structure, but here is the block structure anyway. |
| Descriptor blocks consume at least 36 bytes, but use a full block: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Descriptor |
| * - 0x0 |
| - journal\_header\_t |
| - (open coded) |
| - Common block header. |
| * - 0xC |
| - struct journal\_block\_tag\_s |
| - open coded array[] |
| - Enough tags either to fill up the block or to describe all the data |
| blocks that follow this descriptor block. |
| |
| Journal block tags have any of the following formats, depending on which |
| journal feature and block tag flags are set. |
| |
| If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is |
| defined as ``struct journal_block_tag3_s``, which looks like the |
| following. The size is 16 or 32 bytes. |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Descriptor |
| * - 0x0 |
| - \_\_be32 |
| - t\_blocknr |
| - Lower 32-bits of the location of where the corresponding data block |
| should end up on disk. |
| * - 0x4 |
| - \_\_be32 |
| - t\_flags |
| - Flags that go with the descriptor. See the table jbd2_tag_flags_ for |
| more info. |
| * - 0x8 |
| - \_\_be32 |
| - t\_blocknr\_high |
| - Upper 32-bits of the location of where the corresponding data block |
| should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is |
| not enabled. |
| * - 0xC |
| - \_\_be32 |
| - t\_checksum |
| - Checksum of the journal UUID, the sequence number, and the data block. |
| * - |
| - |
| - |
| - This field appears to be open coded. It always comes at the end of the |
| tag, after t_checksum. This field is not present if the "same UUID" flag |
| is set. |
| * - 0x8 or 0xC |
| - char |
| - uuid[16] |
| - A UUID to go with this tag. This field appears to be copied from the |
| ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that |
| field. |
| |
| .. _jbd2_tag_flags: |
| |
| The journal tag flags are any combination of the following: |
| |
| .. list-table:: |
| :widths: 16 64 |
| :header-rows: 1 |
| |
| * - Value |
| - Description |
| * - 0x1 |
| - On-disk block is escaped. The first four bytes of the data block just |
| happened to match the jbd2 magic number. |
| * - 0x2 |
| - This block has the same UUID as previous, therefore the UUID field is |
| omitted. |
| * - 0x4 |
| - The data block was deleted by the transaction. (Not used?) |
| * - 0x8 |
| - This is the last tag in this descriptor block. |
| |
| If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag |
| is defined as ``struct journal_block_tag_s``, which looks like the |
| following. The size is 8, 12, 24, or 28 bytes: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Descriptor |
| * - 0x0 |
| - \_\_be32 |
| - t\_blocknr |
| - Lower 32-bits of the location of where the corresponding data block |
| should end up on disk. |
| * - 0x4 |
| - \_\_be16 |
| - t\_checksum |
| - Checksum of the journal UUID, the sequence number, and the data block. |
| Note that only the lower 16 bits are stored. |
| * - 0x6 |
| - \_\_be16 |
| - t\_flags |
| - Flags that go with the descriptor. See the table jbd2_tag_flags_ for |
| more info. |
| * - |
| - |
| - |
| - This next field is only present if the super block indicates support for |
| 64-bit block numbers. |
| * - 0x8 |
| - \_\_be32 |
| - t\_blocknr\_high |
| - Upper 32-bits of the location of where the corresponding data block |
| should end up on disk. |
| * - |
| - |
| - |
| - This field appears to be open coded. It always comes at the end of the |
| tag, after t_flags or t_blocknr_high. This field is not present if the |
| "same UUID" flag is set. |
| * - 0x8 or 0xC |
| - char |
| - uuid[16] |
| - A UUID to go with this tag. This field appears to be copied from the |
| ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that |
| field. |
| |
| If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or |
| JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a |
| ``struct jbd2_journal_block_tail``, which looks like this: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Descriptor |
| * - 0x0 |
| - \_\_be32 |
| - t\_checksum |
| - Checksum of the journal UUID + the descriptor block, with this field set |
| to zero. |
| |
| Data Block |
| ~~~~~~~~~~ |
| |
| In general, the data blocks being written to disk through the journal |
| are written verbatim into the journal file after the descriptor block. |
| However, if the first four bytes of the block match the jbd2 magic |
| number then those four bytes are replaced with zeroes and the “escaped” |
| flag is set in the descriptor block tag. |
| |
| Revocation Block |
| ~~~~~~~~~~~~~~~~ |
| |
| A revocation block is used to prevent replay of a block in an earlier |
| transaction. This is used to mark blocks that were journalled at one |
| time but are no longer journalled. Typically this happens if a metadata |
| block is freed and re-allocated as a file data block; in this case, a |
| journal replay after the file block was written to disk will cause |
| corruption. |
| |
| **NOTE**: This mechanism is NOT used to express “this journal block is |
| superseded by this other journal block”, as the author (djwong) |
| mistakenly thought. Any block being added to a transaction will cause |
| the removal of all existing revocation records for that block. |
| |
| Revocation blocks are described in |
| ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in |
| length, but use a full block: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Description |
| * - 0x0 |
| - journal\_header\_t |
| - r\_header |
| - Common block header. |
| * - 0xC |
| - \_\_be32 |
| - r\_count |
| - Number of bytes used in this block. |
| * - 0x10 |
| - \_\_be32 or \_\_be64 |
| - blocks[0] |
| - Blocks to revoke. |
| |
| After r\_count is a linear array of block numbers that are effectively |
| revoked by this transaction. The size of each block number is 8 bytes if |
| the superblock advertises 64-bit block number support, or 4 bytes |
| otherwise. |
| |
| If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or |
| JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation |
| block is a ``struct jbd2_journal_revoke_tail``, which has this format: |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Description |
| * - 0x0 |
| - \_\_be32 |
| - r\_checksum |
| - Checksum of the journal UUID + revocation block |
| |
| Commit Block |
| ~~~~~~~~~~~~ |
| |
| The commit block is a sentry that indicates that a transaction has been |
| completely written to the journal. Once this commit block reaches the |
| journal, the data stored with this transaction can be written to their |
| final locations on disk. |
| |
| The commit block is described by ``struct commit_header``, which is 32 |
| bytes long (but uses a full block): |
| |
| .. list-table:: |
| :widths: 8 8 24 40 |
| :header-rows: 1 |
| |
| * - Offset |
| - Type |
| - Name |
| - Descriptor |
| * - 0x0 |
| - journal\_header\_s |
| - (open coded) |
| - Common block header. |
| * - 0xC |
| - unsigned char |
| - h\_chksum\_type |
| - The type of checksum to use to verify the integrity of the data blocks |
| in the transaction. See jbd2_checksum_type_ for more info. |
| * - 0xD |
| - unsigned char |
| - h\_chksum\_size |
| - The number of bytes used by the checksum. Most likely 4. |
| * - 0xE |
| - unsigned char |
| - h\_padding[2] |
| - |
| * - 0x10 |
| - \_\_be32 |
| - h\_chksum[JBD2\_CHECKSUM\_BYTES] |
| - 32 bytes of space to store checksums. If |
| JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 |
| are set, the first ``__be32`` is the checksum of the journal UUID and |
| the entire commit block, with this field zeroed. If |
| JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the |
| crc32 of all the blocks already written to the transaction. |
| * - 0x30 |
| - \_\_be64 |
| - h\_commit\_sec |
| - The time that the transaction was committed, in seconds since the epoch. |
| * - 0x38 |
| - \_\_be32 |
| - h\_commit\_nsec |
| - Nanoseconds component of the above timestamp. |
| |
| Fast commits |
| ~~~~~~~~~~~~ |
| |
| Fast commit area is organized as a log of tag length values. Each TLV has |
| a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length |
| of the entire field. It is followed by variable length tag specific value. |
| Here is the list of supported tags and their meanings: |
| |
| .. list-table:: |
| :widths: 8 20 20 32 |
| :header-rows: 1 |
| |
| * - Tag |
| - Meaning |
| - Value struct |
| - Description |
| * - EXT4_FC_TAG_HEAD |
| - Fast commit area header |
| - ``struct ext4_fc_head`` |
| - Stores the TID of the transaction after which these fast commits should |
| be applied. |
| * - EXT4_FC_TAG_ADD_RANGE |
| - Add extent to inode |
| - ``struct ext4_fc_add_range`` |
| - Stores the inode number and extent to be added in this inode |
| * - EXT4_FC_TAG_DEL_RANGE |
| - Remove logical offsets to inode |
| - ``struct ext4_fc_del_range`` |
| - Stores the inode number and the logical offset range that needs to be |
| removed |
| * - EXT4_FC_TAG_CREAT |
| - Create directory entry for a newly created file |
| - ``struct ext4_fc_dentry_info`` |
| - Stores the parent inode number, inode number and directory entry of the |
| newly created file |
| * - EXT4_FC_TAG_LINK |
| - Link a directory entry to an inode |
| - ``struct ext4_fc_dentry_info`` |
| - Stores the parent inode number, inode number and directory entry |
| * - EXT4_FC_TAG_UNLINK |
| - Unlink a directory entry of an inode |
| - ``struct ext4_fc_dentry_info`` |
| - Stores the parent inode number, inode number and directory entry |
| |
| * - EXT4_FC_TAG_PAD |
| - Padding (unused area) |
| - None |
| - Unused bytes in the fast commit area. |
| |
| * - EXT4_FC_TAG_TAIL |
| - Mark the end of a fast commit |
| - ``struct ext4_fc_tail`` |
| - Stores the TID of the commit, CRC of the fast commit of which this tag |
| represents the end of |
| |