Documentation/filesystems/ext4/journal.rst - linux - Git at Google

 .. SPDX-License-Identifier: GPL-2.0

 Journal (jbd2)
 --------------

 Introduced in ext3, the ext4 filesystem employs a journal to protect the
 filesystem against corruption in the case of a system crash. A small
 continuous region of disk (default 128MiB) is reserved inside the
 filesystem as a place to land “important” data writes on-disk as quickly
 as possible. Once the important data transaction is fully written to the
 disk and flushed from the disk write cache, a record of the data being
 committed is also written to the journal. At some later point in time,
 the journal code writes the transactions to their final locations on
 disk (this could involve a lot of seeking or a lot of small
 read-write-erases) before erasing the commit record. Should the system
 crash during the second slow write, the journal can be replayed all the
 way to the latest commit record, guaranteeing the atomicity of whatever
 gets written through the journal to the disk. The effect of this is to
 guarantee that the filesystem does not become stuck midway through a
 metadata update.

 For performance reasons, ext4 by default only writes filesystem metadata
 through the journal. This means that file data blocks are /not/
 guaranteed to be in any consistent state after a crash. If this default
 guarantee level (``data=ordered``) is not satisfactory, there is a mount
 option to control journal behavior. If ``data=journal``, all data and
 metadata are written to disk through the journal. This is slower but
 safest. If ``data=writeback``, dirty data blocks are not flushed to the
 disk before the metadata are written to disk through the journal.

 In case of ``data=ordered`` mode, Ext4 also supports fast commits which
 help reduce commit latency significantly. The default ``data=ordered``
 mode works by logging metadata blocks to the journal. In fast commit
 mode, Ext4 only stores the minimal delta needed to recreate the
 affected metadata in fast commit space that is shared with JBD2.
 Once the fast commit area fills in or if fast commit is not possible
 or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
 A full commit invalidates all the fast commits that happened before
 it and thus it makes the fast commit area empty for further fast
 commits. This feature needs to be enabled at mkfs time.

 The journal inode is typically inode 8. The first 68 bytes of the
 journal inode are replicated in the ext4 superblock. The journal itself
 is normal (but hidden) file within the filesystem. The file usually
 consumes an entire block group, though mke2fs tries to put it in the
 middle of the disk.

 All fields in jbd2 are written to disk in big-endian order. This is the
 opposite of ext4.

 NOTE: Both ext4 and ocfs2 use jbd2.

 The maximum size of a journal embedded in an ext4 filesystem is 2^32
 blocks. jbd2 itself does not seem to care.

 Layout
 ~~~~~~

 Generally speaking, the journal has this format:

 .. list-table::
    :widths: 16 48 16
    :header-rows: 1

    * - Superblock
      - descriptor\_block (data\_blocks or revocation\_block) [more data or
        revocations] commmit\_block
      - [more transactions...]
    * -
      - One transaction
      -

 Notice that a transaction begins with either a descriptor and some data,
 or a block revocation list. A finished transaction always ends with a
 commit. If there is no commit record (or the checksums don't match), the
 transaction will be discarded during replay.

 External Journal
 ~~~~~~~~~~~~~~~~

 Optionally, an ext4 filesystem can be created with an external journal
 device (as opposed to an internal journal, which uses a reserved inode).
 In this case, on the filesystem device, ``s_journal_inum`` should be
 zero and ``s_journal_uuid`` should be set. On the journal device there
 will be an ext4 super block in the usual place, with a matching UUID.
 The journal superblock will be in the next full block after the
 superblock.

 .. list-table::
    :widths: 12 12 12 32 12
    :header-rows: 1

    * - 1024 bytes of padding
      - ext4 Superblock
      - Journal Superblock
      - descriptor\_block (data\_blocks or revocation\_block) [more data or
        revocations] commmit\_block
      - [more transactions...]
    * -
      -
      -
      - One transaction
      -

 Block Header
 ~~~~~~~~~~~~

 Every block in the journal starts with a common 12-byte header
 ``struct journal_header_s``:

 .. list-table::
    :widths: 8 8 24 40
    :header-rows: 1

    * - Offset
      - Type
      - Name
      - Description
    * - 0x0
      - \_\_be32
      - h\_magic
      - jbd2 magic number, 0xC03B3998.
    * - 0x4
      - \_\_be32
      - h\_blocktype
      - Description of what this block contains. See the jbd2_blocktype_ table
        below.
    * - 0x8
      - \_\_be32
      - h\_sequence
      - The transaction ID that goes with this block.

 .. _jbd2_blocktype:

 The journal block type can be any one of:

 .. list-table::
    :widths: 16 64
    :header-rows: 1

    * - Value
      - Description
    * - 1
      - Descriptor. This block precedes a series of data blocks that were
        written through the journal during a transaction.
    * - 2
      - Block commit record. This block signifies the completion of a
        transaction.
    * - 3
      - Journal superblock, v1.
    * - 4
      - Journal superblock, v2.
    * - 5
      - Block revocation records. This speeds up recovery by enabling the
        journal to skip writing blocks that were subsequently rewritten.

 Super Block
 ~~~~~~~~~~~

 The super block for the journal is much simpler as compared to ext4's.
 The key data kept within are size of the journal, and where to find the
 start of the log of transactions.

 The journal superblock is recorded as ``struct journal_superblock_s``,
 which is 1024 bytes long:

 .. list-table::
    :widths: 8 8 24 40
    :header-rows: 1

    * - Offset
      - Type
      - Name
      - Description
    * -
      -
      -
      - Static information describing the journal.
    * - 0x0
      - journal\_header\_t (12 bytes)
      - s\_header
      - Common header identifying this as a superblock.
    * - 0xC
      - \_\_be32
      - s\_blocksize
      - Journal device block size.
    * - 0x10
      - \_\_be32
      - s\_maxlen
      - Total number of blocks in this journal.
    * - 0x14
      - \_\_be32
      - s\_first
      - First block of log information.
    * -
      -
      -
      - Dynamic information describing the current state of the log.
    * - 0x18
      - \_\_be32
      - s\_sequence
      - First commit ID expected in log.
    * - 0x1C
      - \_\_be32
      - s\_start
      - Block number of the start of log. Contrary to the comments, this field
        being zero does not imply that the journal is clean!
    * - 0x20
      - \_\_be32
      - s\_errno
      - Error value, as set by jbd2\_journal\_abort().
    * -
      -
      -
      - The remaining fields are only valid in a v2 superblock.
    * - 0x24
      - \_\_be32
      - s\_feature\_compat;
      - Compatible feature set. See the table jbd2_compat_ below.
    * - 0x28
      - \_\_be32
      - s\_feature\_incompat
      - Incompatible feature set. See the table jbd2_incompat_ below.
    * - 0x2C
      - \_\_be32
      - s\_feature\_ro\_compat
      - Read-only compatible feature set. There aren't any of these currently.
    * - 0x30
      - \_\_u8
      - s\_uuid[16]
      - 128-bit uuid for journal. This is compared against the copy in the ext4
        super block at mount time.
    * - 0x40
      - \_\_be32
      - s\_nr\_users
      - Number of file systems sharing this journal.
    * - 0x44
      - \_\_be32
      - s\_dynsuper
      - Location of dynamic super block copy. (Not used?)
    * - 0x48
      - \_\_be32
      - s\_max\_transaction
      - Limit of journal blocks per transaction. (Not used?)
    * - 0x4C
      - \_\_be32
      - s\_max\_trans\_data
      - Limit of data blocks per transaction. (Not used?)
    * - 0x50
      - \_\_u8
      - s\_checksum\_type
      - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
        more info.
    * - 0x51
      - \_\_u8[3]
      - s\_padding2
      -
    * - 0x54
      - \_\_be32
      - s\_num\_fc\_blocks
      - Number of fast commit blocks in the journal.
    * - 0x58
      - \_\_u32
      - s\_padding[42]
      -
    * - 0xFC
      - \_\_be32
      - s\_checksum
      - Checksum of the entire superblock, with this field set to zero.
    * - 0x100
      - \_\_u8
      - s\_users[16\*48]
      - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
        shared external journals, but I imagine Lustre (or ocfs2?), which use
        the jbd2 code, might.

 .. _jbd2_compat:

 The journal compat features are any combination of the following:

 .. list-table::
    :widths: 16 64
    :header-rows: 1

    * - Value
      - Description
    * - 0x1
      - Journal maintains checksums on the data blocks.
        (JBD2\_FEATURE\_COMPAT\_CHECKSUM)

 .. _jbd2_incompat:

 The journal incompat features are any combination of the following:

 .. list-table::
    :widths: 16 64
    :header-rows: 1

    * - Value
      - Description
    * - 0x1
      - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE)
    * - 0x2
      - Journal can deal with 64-bit block numbers.
        (JBD2\_FEATURE\_INCOMPAT\_64BIT)
    * - 0x4
      - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT)
    * - 0x8
      - This journal uses v2 of the checksum on-disk format. Each journal
        metadata block gets its own checksum, and the block tags in the
        descriptor table contain checksums for each of the data blocks in the
        journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2)
    * - 0x10
      - This journal uses v3 of the checksum on-disk format. This is the same as
        v2, but the journal block tag size is fixed regardless of the size of
        block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
    * - 0x20
      - Journal has fast commit blocks. (JBD2\_FEATURE\_INCOMPAT\_FAST\_COMMIT)

 .. _jbd2_checksum_type:

 Journal checksum type codes are one of the following.  crc32 or crc32c are the
 most likely choices.

 .. list-table::
    :widths: 16 64
    :header-rows: 1

    * - Value
      - Description
    * - 1
      - CRC32
    * - 2
      - MD5
    * - 3
      - SHA1
    * - 4
      - CRC32C

 Descriptor Block
 ~~~~~~~~~~~~~~~~

 The descriptor block contains an array of journal block tags that
 describe the final locations of the data blocks that follow in the
 journal. Descriptor blocks are open-coded instead of being completely
 described by a data structure, but here is the block structure anyway.
 Descriptor blocks consume at least 36 bytes, but use a full block:

 .. list-table::
    :widths: 8 8 24 40
    :header-rows: 1

    * - Offset
      - Type
      - Name
      - Descriptor
    * - 0x0
      - journal\_header\_t
      - (open coded)
      - Common block header.
    * - 0xC
      - struct journal\_block\_tag\_s
      - open coded array[]
      - Enough tags either to fill up the block or to describe all the data
        blocks that follow this descriptor block.

 Journal block tags have any of the following formats, depending on which
 journal feature and block tag flags are set.

 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is
 defined as ``struct journal_block_tag3_s``, which looks like the
 following. The size is 16 or 32 bytes.

 .. list-table::
    :widths: 8 8 24 40
    :header-rows: 1

    * - Offset
      - Type
      - Name
      - Descriptor
    * - 0x0
      - \_\_be32
      - t\_blocknr
      - Lower 32-bits of the location of where the corresponding data block
        should end up on disk.
    * - 0x4
      - \_\_be32
      - t\_flags
      - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
        more info.
    * - 0x8
      - \_\_be32
      - t\_blocknr\_high
      - Upper 32-bits of the location of where the corresponding data block
        should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is
        not enabled.
    * - 0xC
      - \_\_be32
      - t\_checksum
      - Checksum of the journal UUID, the sequence number, and the data block.
    * -
      -
      -
      - This field appears to be open coded. It always comes at the end of the
        tag, after t_checksum. This field is not present if the "same UUID" flag
        is set.
    * - 0x8 or 0xC
      - char
      - uuid[16]
      - A UUID to go with this tag. This field appears to be copied from the
        ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
        field.

 .. _jbd2_tag_flags:

 The journal tag flags are any combination of the following:

 .. list-table::
    :widths: 16 64
    :header-rows: 1

    * - Value
      - Description
    * - 0x1
      - On-disk block is escaped. The first four bytes of the data block just
        happened to match the jbd2 magic number.
    * - 0x2
      - This block has the same UUID as previous, therefore the UUID field is
        omitted.
    * - 0x4
      - The data block was deleted by the transaction. (Not used?)
    * - 0x8
      - This is the last tag in this descriptor block.

 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag
 is defined as ``struct journal_block_tag_s``, which looks like the
 following. The size is 8, 12, 24, or 28 bytes:

 .. list-table::
    :widths: 8 8 24 40
    :header-rows: 1

    * - Offset
      - Type
      - Name
      - Descriptor
    * - 0x0
      - \_\_be32
      - t\_blocknr
      - Lower 32-bits of the location of where the corresponding data block
        should end up on disk.
    * - 0x4
      - \_\_be16
      - t\_checksum
      - Checksum of the journal UUID, the sequence number, and the data block.
        Note that only the lower 16 bits are stored.
    * - 0x6
      - \_\_be16
      - t\_flags
      - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
        more info.
    * -
      -
      -
      - This next field is only present if the super block indicates support for
        64-bit block numbers.
    * - 0x8
      - \_\_be32
      - t\_blocknr\_high
      - Upper 32-bits of the location of where the corresponding data block
        should end up on disk.
    * -
      -
      -
      - This field appears to be open coded. It always comes at the end of the
        tag, after t_flags or t_blocknr_high. This field is not present if the
        "same UUID" flag is set.
    * - 0x8 or 0xC
      - char
      - uuid[16]
      - A UUID to go with this tag. This field appears to be copied from the
        ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
        field.

 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
 JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
 ``struct jbd2_journal_block_tail``, which looks like this:

 .. list-table::
    :widths: 8 8 24 40
    :header-rows: 1

    * - Offset
      - Type
      - Name
      - Descriptor
    * - 0x0
      - \_\_be32
      - t\_checksum
      - Checksum of the journal UUID + the descriptor block, with this field set
        to zero.

 Data Block
 ~~~~~~~~~~

 In general, the data blocks being written to disk through the journal
 are written verbatim into the journal file after the descriptor block.
 However, if the first four bytes of the block match the jbd2 magic
 number then those four bytes are replaced with zeroes and the “escaped”
 flag is set in the descriptor block tag.

 Revocation Block
 ~~~~~~~~~~~~~~~~

 A revocation block is used to prevent replay of a block in an earlier
 transaction. This is used to mark blocks that were journalled at one
 time but are no longer journalled. Typically this happens if a metadata
 block is freed and re-allocated as a file data block; in this case, a
 journal replay after the file block was written to disk will cause
 corruption.

 **NOTE**: This mechanism is NOT used to express “this journal block is
 superseded by this other journal block”, as the author (djwong)
 mistakenly thought. Any block being added to a transaction will cause
 the removal of all existing revocation records for that block.

 Revocation blocks are described in
 ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
 length, but use a full block:

 .. list-table::
    :widths: 8 8 24 40
    :header-rows: 1

    * - Offset
      - Type
      - Name
      - Description
    * - 0x0
      - journal\_header\_t
      - r\_header
      - Common block header.
    * - 0xC
      - \_\_be32
      - r\_count
      - Number of bytes used in this block.
    * - 0x10
      - \_\_be32 or \_\_be64
      - blocks[0]
      - Blocks to revoke.

 After r\_count is a linear array of block numbers that are effectively
 revoked by this transaction. The size of each block number is 8 bytes if
 the superblock advertises 64-bit block number support, or 4 bytes
 otherwise.

 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
 JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
 block is a ``struct jbd2_journal_revoke_tail``, which has this format:

 .. list-table::
    :widths: 8 8 24 40
    :header-rows: 1

    * - Offset
      - Type
      - Name
      - Description
    * - 0x0
      - \_\_be32
      - r\_checksum
      - Checksum of the journal UUID + revocation block

 Commit Block
 ~~~~~~~~~~~~

 The commit block is a sentry that indicates that a transaction has been
 completely written to the journal. Once this commit block reaches the
 journal, the data stored with this transaction can be written to their
 final locations on disk.

 The commit block is described by ``struct commit_header``, which is 32
 bytes long (but uses a full block):

 .. list-table::
    :widths: 8 8 24 40
    :header-rows: 1

    * - Offset
      - Type
      - Name
      - Descriptor
    * - 0x0
      - journal\_header\_s
      - (open coded)
      - Common block header.
    * - 0xC
      - unsigned char
      - h\_chksum\_type
      - The type of checksum to use to verify the integrity of the data blocks
        in the transaction. See jbd2_checksum_type_ for more info.
    * - 0xD
      - unsigned char
      - h\_chksum\_size
      - The number of bytes used by the checksum. Most likely 4.
    * - 0xE
      - unsigned char
      - h\_padding[2]
      -
    * - 0x10
      - \_\_be32
      - h\_chksum[JBD2\_CHECKSUM\_BYTES]
      - 32 bytes of space to store checksums. If
        JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3
        are set, the first ``__be32`` is the checksum of the journal UUID and
        the entire commit block, with this field zeroed. If
        JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the
        crc32 of all the blocks already written to the transaction.
    * - 0x30
      - \_\_be64
      - h\_commit\_sec
      - The time that the transaction was committed, in seconds since the epoch.
    * - 0x38
      - \_\_be32
      - h\_commit\_nsec
      - Nanoseconds component of the above timestamp.

 Fast commits
 ~~~~~~~~~~~~

 Fast commit area is organized as a log of tag length values. Each TLV has
 a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
 of the entire field. It is followed by variable length tag specific value.
 Here is the list of supported tags and their meanings:

 .. list-table::
    :widths: 8 20 20 32
    :header-rows: 1

    * - Tag
      - Meaning
      - Value struct
      - Description
    * - EXT4_FC_TAG_HEAD
      - Fast commit area header
      - ``struct ext4_fc_head``
      - Stores the TID of the transaction after which these fast commits should
        be applied.
    * - EXT4_FC_TAG_ADD_RANGE
      - Add extent to inode
      - ``struct ext4_fc_add_range``
      - Stores the inode number and extent to be added in this inode
    * - EXT4_FC_TAG_DEL_RANGE
      - Remove logical offsets to inode
      - ``struct ext4_fc_del_range``
      - Stores the inode number and the logical offset range that needs to be
        removed
    * - EXT4_FC_TAG_CREAT
      - Create directory entry for a newly created file
      - ``struct ext4_fc_dentry_info``
      - Stores the parent inode number, inode number and directory entry of the
        newly created file
    * - EXT4_FC_TAG_LINK
      - Link a directory entry to an inode
      - ``struct ext4_fc_dentry_info``
      - Stores the parent inode number, inode number and directory entry
    * - EXT4_FC_TAG_UNLINK
      - Unlink a directory entry of an inode
      - ``struct ext4_fc_dentry_info``
      - Stores the parent inode number, inode number and directory entry

    * - EXT4_FC_TAG_PAD
      - Padding (unused area)
      - None
      - Unused bytes in the fast commit area.

    * - EXT4_FC_TAG_TAIL
      - Mark the end of a fast commit
      - ``struct ext4_fc_tail``
      - Stores the TID of the commit, CRC of the fast commit of which this tag
        represents the end of

 Fast Commit Replay Idempotence
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Fast commits tags are idempotent in nature provided the recovery code follows
 certain rules. The guiding principle that the commit path follows while
 committing is that it stores the result of a particular operation instead of
 storing the procedure.

 Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
 was associated with inode 10. During fast commit, instead of storing this
 operation as a procedure "rename a to b", we store the resulting file system
 state as a "series" of outcomes:

 - Link dirent b to inode 10
 - Unlink dirent a
 - Inode 10 with valid refcount

 Now when recovery code runs, it needs "enforce" this state on the file
 system. This is what guarantees idempotence of fast commit replay.

 Let's take an example of a procedure that is not idempotent and see how fast
 commits make it idempotent. Consider following sequence of operations:

 1) rm A
 2) mv B A
 3) read A

 If we store this sequence of operations as is then the replay is not idempotent.
 Let's say while in replay, we crash after (2). During the second replay,
 file A (which was actually created as a result of "mv B A" operation) would get
 deleted. Thus, file named A would be absent when we try to read A. So, this
 sequence of operations is not idempotent. However, as mentioned above, instead
 of storing the procedure fast commits store the outcome of each procedure. Thus
 the fast commit log for above procedure would be as follows:

 (Let's assume dirent A was linked to inode 10 and dirent B was linked to
 inode 11 before the replay)

 1) Unlink A
 2) Link A to inode 11
 3) Unlink B
 4) Inode 11

 If we crash after (3) we will have file A linked to inode 11. During the second
 replay, we will remove file A (inode 11). But we will create it back and make
 it point to inode 11. We won't find B, so we'll just skip that step. At this
 point, the refcount for inode 11 is not reliable, but that gets fixed by the
 replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
 into a series of idempotent outcomes, fast commits ensured idempotence during
 the replay.
	.. SPDX-License-Identifier: GPL-2.0

	Journal (jbd2)
	--------------

	Introduced in ext3, the ext4 filesystem employs a journal to protect the
	filesystem against corruption in the case of a system crash. A small
	continuous region of disk (default 128MiB) is reserved inside the
	filesystem as a place to land “important” data writes on-disk as quickly
	as possible. Once the important data transaction is fully written to the
	disk and flushed from the disk write cache, a record of the data being
	committed is also written to the journal. At some later point in time,
	the journal code writes the transactions to their final locations on
	disk (this could involve a lot of seeking or a lot of small
	read-write-erases) before erasing the commit record. Should the system
	crash during the second slow write, the journal can be replayed all the
	way to the latest commit record, guaranteeing the atomicity of whatever
	gets written through the journal to the disk. The effect of this is to
	guarantee that the filesystem does not become stuck midway through a
	metadata update.

	For performance reasons, ext4 by default only writes filesystem metadata
	through the journal. This means that file data blocks are /not/
	guaranteed to be in any consistent state after a crash. If this default
	guarantee level (``data=ordered``) is not satisfactory, there is a mount
	option to control journal behavior. If ``data=journal``, all data and
	metadata are written to disk through the journal. This is slower but
	safest. If ``data=writeback``, dirty data blocks are not flushed to the
	disk before the metadata are written to disk through the journal.

	In case of ``data=ordered`` mode, Ext4 also supports fast commits which
	help reduce commit latency significantly. The default ``data=ordered``
	mode works by logging metadata blocks to the journal. In fast commit
	mode, Ext4 only stores the minimal delta needed to recreate the
	affected metadata in fast commit space that is shared with JBD2.
	Once the fast commit area fills in or if fast commit is not possible
	or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
	A full commit invalidates all the fast commits that happened before
	it and thus it makes the fast commit area empty for further fast
	commits. This feature needs to be enabled at mkfs time.

	The journal inode is typically inode 8. The first 68 bytes of the
	journal inode are replicated in the ext4 superblock. The journal itself
	is normal (but hidden) file within the filesystem. The file usually
	consumes an entire block group, though mke2fs tries to put it in the
	middle of the disk.

	All fields in jbd2 are written to disk in big-endian order. This is the
	opposite of ext4.

	NOTE: Both ext4 and ocfs2 use jbd2.

	The maximum size of a journal embedded in an ext4 filesystem is 2^32
	blocks. jbd2 itself does not seem to care.

	Layout
	~~~~~~

	Generally speaking, the journal has this format:

	.. list-table::
	:widths: 16 48 16
	:header-rows: 1

	* - Superblock
	- descriptor\_block (data\_blocks or revocation\_block) [more data or
	revocations] commmit\_block
	- [more transactions...]
	* -
	- One transaction
	-

	Notice that a transaction begins with either a descriptor and some data,
	or a block revocation list. A finished transaction always ends with a
	commit. If there is no commit record (or the checksums don't match), the
	transaction will be discarded during replay.

	External Journal
	~~~~~~~~~~~~~~~~

	Optionally, an ext4 filesystem can be created with an external journal
	device (as opposed to an internal journal, which uses a reserved inode).
	In this case, on the filesystem device, ``s_journal_inum`` should be
	zero and ``s_journal_uuid`` should be set. On the journal device there
	will be an ext4 super block in the usual place, with a matching UUID.
	The journal superblock will be in the next full block after the
	superblock.

	.. list-table::
	:widths: 12 12 12 32 12
	:header-rows: 1

	* - 1024 bytes of padding
	- ext4 Superblock
	- Journal Superblock
	- descriptor\_block (data\_blocks or revocation\_block) [more data or
	revocations] commmit\_block
	- [more transactions...]
	* -
	-
	-
	- One transaction
	-

	Block Header
	~~~~~~~~~~~~

	Every block in the journal starts with a common 12-byte header
	``struct journal_header_s``:

	.. list-table::
	:widths: 8 8 24 40
	:header-rows: 1

	* - Offset
	- Type
	- Name
	- Description
	* - 0x0
	- \_\_be32
	- h\_magic
	- jbd2 magic number, 0xC03B3998.
	* - 0x4
	- \_\_be32
	- h\_blocktype
	- Description of what this block contains. See the jbd2_blocktype_ table
	below.
	* - 0x8
	- \_\_be32
	- h\_sequence
	- The transaction ID that goes with this block.

	.. _jbd2_blocktype:

	The journal block type can be any one of:

	.. list-table::
	:widths: 16 64
	:header-rows: 1

	* - Value
	- Description
	* - 1
	- Descriptor. This block precedes a series of data blocks that were
	written through the journal during a transaction.
	* - 2
	- Block commit record. This block signifies the completion of a
	transaction.
	* - 3
	- Journal superblock, v1.
	* - 4
	- Journal superblock, v2.
	* - 5
	- Block revocation records. This speeds up recovery by enabling the
	journal to skip writing blocks that were subsequently rewritten.

	Super Block
	~~~~~~~~~~~

	The super block for the journal is much simpler as compared to ext4's.
	The key data kept within are size of the journal, and where to find the
	start of the log of transactions.

	The journal superblock is recorded as ``struct journal_superblock_s``,
	which is 1024 bytes long:

	.. list-table::
	:widths: 8 8 24 40
	:header-rows: 1

	* - Offset
	- Type
	- Name
	- Description
	* -
	-
	-
	- Static information describing the journal.
	* - 0x0
	- journal\_header\_t (12 bytes)
	- s\_header
	- Common header identifying this as a superblock.
	* - 0xC
	- \_\_be32
	- s\_blocksize
	- Journal device block size.
	* - 0x10
	- \_\_be32
	- s\_maxlen
	- Total number of blocks in this journal.
	* - 0x14
	- \_\_be32
	- s\_first
	- First block of log information.
	* -
	-
	-
	- Dynamic information describing the current state of the log.
	* - 0x18
	- \_\_be32
	- s\_sequence
	- First commit ID expected in log.
	* - 0x1C
	- \_\_be32
	- s\_start
	- Block number of the start of log. Contrary to the comments, this field
	being zero does not imply that the journal is clean!
	* - 0x20
	- \_\_be32
	- s\_errno
	- Error value, as set by jbd2\_journal\_abort().
	* -
	-
	-
	- The remaining fields are only valid in a v2 superblock.
	* - 0x24
	- \_\_be32
	- s\_feature\_compat;
	- Compatible feature set. See the table jbd2_compat_ below.
	* - 0x28
	- \_\_be32
	- s\_feature\_incompat
	- Incompatible feature set. See the table jbd2_incompat_ below.
	* - 0x2C
	- \_\_be32
	- s\_feature\_ro\_compat
	- Read-only compatible feature set. There aren't any of these currently.
	* - 0x30
	- \_\_u8
	- s\_uuid[16]
	- 128-bit uuid for journal. This is compared against the copy in the ext4
	super block at mount time.
	* - 0x40
	- \_\_be32
	- s\_nr\_users
	- Number of file systems sharing this journal.
	* - 0x44
	- \_\_be32
	- s\_dynsuper
	- Location of dynamic super block copy. (Not used?)
	* - 0x48
	- \_\_be32
	- s\_max\_transaction
	- Limit of journal blocks per transaction. (Not used?)
	* - 0x4C
	- \_\_be32
	- s\_max\_trans\_data
	- Limit of data blocks per transaction. (Not used?)
	* - 0x50
	- \_\_u8
	- s\_checksum\_type
	- Checksum algorithm used for the journal. See jbd2_checksum_type_ for
	more info.
	* - 0x51
	- \_\_u8[3]
	- s\_padding2
	-
	* - 0x54
	- \_\_be32
	- s\_num\_fc\_blocks
	- Number of fast commit blocks in the journal.
	* - 0x58
	- \_\_u32
	- s\_padding[42]
	-
	* - 0xFC
	- \_\_be32
	- s\_checksum
	- Checksum of the entire superblock, with this field set to zero.
	* - 0x100
	- \_\_u8
	- s\_users[16\*48]
	- ids of all file systems sharing the log. e2fsprogs/Linux don't allow
	shared external journals, but I imagine Lustre (or ocfs2?), which use
	the jbd2 code, might.

	.. _jbd2_compat:

	The journal compat features are any combination of the following:

	.. list-table::
	:widths: 16 64
	:header-rows: 1

	* - Value
	- Description
	* - 0x1
	- Journal maintains checksums on the data blocks.
	(JBD2\_FEATURE\_COMPAT\_CHECKSUM)

	.. _jbd2_incompat:

	The journal incompat features are any combination of the following:

	.. list-table::
	:widths: 16 64
	:header-rows: 1

	* - Value
	- Description
	* - 0x1
	- Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE)
	* - 0x2
	- Journal can deal with 64-bit block numbers.
	(JBD2\_FEATURE\_INCOMPAT\_64BIT)
	* - 0x4
	- Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT)
	* - 0x8
	- This journal uses v2 of the checksum on-disk format. Each journal
	metadata block gets its own checksum, and the block tags in the
	descriptor table contain checksums for each of the data blocks in the
	journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2)
	* - 0x10
	- This journal uses v3 of the checksum on-disk format. This is the same as
	v2, but the journal block tag size is fixed regardless of the size of
	block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
	* - 0x20
	- Journal has fast commit blocks. (JBD2\_FEATURE\_INCOMPAT\_FAST\_COMMIT)

	.. _jbd2_checksum_type:

	Journal checksum type codes are one of the following. crc32 or crc32c are the
	most likely choices.

	.. list-table::
	:widths: 16 64
	:header-rows: 1

	* - Value
	- Description
	* - 1
	- CRC32
	* - 2
	- MD5
	* - 3
	- SHA1
	* - 4
	- CRC32C

	Descriptor Block
	~~~~~~~~~~~~~~~~

	The descriptor block contains an array of journal block tags that
	describe the final locations of the data blocks that follow in the
	journal. Descriptor blocks are open-coded instead of being completely
	described by a data structure, but here is the block structure anyway.
	Descriptor blocks consume at least 36 bytes, but use a full block:

	.. list-table::
	:widths: 8 8 24 40
	:header-rows: 1

	* - Offset
	- Type
	- Name
	- Descriptor
	* - 0x0
	- journal\_header\_t
	- (open coded)
	- Common block header.
	* - 0xC
	- struct journal\_block\_tag\_s
	- open coded array[]
	- Enough tags either to fill up the block or to describe all the data
	blocks that follow this descriptor block.

	Journal block tags have any of the following formats, depending on which
	journal feature and block tag flags are set.

	If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is
	defined as ``struct journal_block_tag3_s``, which looks like the
	following. The size is 16 or 32 bytes.

	.. list-table::
	:widths: 8 8 24 40
	:header-rows: 1

	* - Offset
	- Type
	- Name
	- Descriptor
	* - 0x0
	- \_\_be32
	- t\_blocknr
	- Lower 32-bits of the location of where the corresponding data block
	should end up on disk.
	* - 0x4
	- \_\_be32
	- t\_flags
	- Flags that go with the descriptor. See the table jbd2_tag_flags_ for
	more info.
	* - 0x8
	- \_\_be32
	- t\_blocknr\_high
	- Upper 32-bits of the location of where the corresponding data block
	should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is
	not enabled.
	* - 0xC
	- \_\_be32
	- t\_checksum
	- Checksum of the journal UUID, the sequence number, and the data block.
	* -
	-
	-
	- This field appears to be open coded. It always comes at the end of the
	tag, after t_checksum. This field is not present if the "same UUID" flag
	is set.
	* - 0x8 or 0xC
	- char
	- uuid[16]
	- A UUID to go with this tag. This field appears to be copied from the
	``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
	field.

	.. _jbd2_tag_flags:

	The journal tag flags are any combination of the following:

	.. list-table::
	:widths: 16 64
	:header-rows: 1

	* - Value
	- Description
	* - 0x1
	- On-disk block is escaped. The first four bytes of the data block just
	happened to match the jbd2 magic number.
	* - 0x2
	- This block has the same UUID as previous, therefore the UUID field is
	omitted.
	* - 0x4
	- The data block was deleted by the transaction. (Not used?)
	* - 0x8
	- This is the last tag in this descriptor block.

	If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag
	is defined as ``struct journal_block_tag_s``, which looks like the
	following. The size is 8, 12, 24, or 28 bytes:

	.. list-table::
	:widths: 8 8 24 40
	:header-rows: 1

	* - Offset
	- Type
	- Name
	- Descriptor
	* - 0x0
	- \_\_be32
	- t\_blocknr
	- Lower 32-bits of the location of where the corresponding data block
	should end up on disk.
	* - 0x4
	- \_\_be16
	- t\_checksum
	- Checksum of the journal UUID, the sequence number, and the data block.
	Note that only the lower 16 bits are stored.
	* - 0x6
	- \_\_be16
	- t\_flags
	- Flags that go with the descriptor. See the table jbd2_tag_flags_ for
	more info.
	* -
	-
	-
	- This next field is only present if the super block indicates support for
	64-bit block numbers.
	* - 0x8
	- \_\_be32
	- t\_blocknr\_high
	- Upper 32-bits of the location of where the corresponding data block
	should end up on disk.
	* -
	-
	-
	- This field appears to be open coded. It always comes at the end of the
	tag, after t_flags or t_blocknr_high. This field is not present if the
	"same UUID" flag is set.
	* - 0x8 or 0xC
	- char
	- uuid[16]
	- A UUID to go with this tag. This field appears to be copied from the
	``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
	field.

	If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
	JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
	``struct jbd2_journal_block_tail``, which looks like this:

	.. list-table::
	:widths: 8 8 24 40
	:header-rows: 1

	* - Offset
	- Type
	- Name
	- Descriptor
	* - 0x0
	- \_\_be32
	- t\_checksum
	- Checksum of the journal UUID + the descriptor block, with this field set
	to zero.

	Data Block
	~~~~~~~~~~

	In general, the data blocks being written to disk through the journal
	are written verbatim into the journal file after the descriptor block.
	However, if the first four bytes of the block match the jbd2 magic
	number then those four bytes are replaced with zeroes and the “escaped”
	flag is set in the descriptor block tag.

	Revocation Block
	~~~~~~~~~~~~~~~~

	A revocation block is used to prevent replay of a block in an earlier
	transaction. This is used to mark blocks that were journalled at one
	time but are no longer journalled. Typically this happens if a metadata
	block is freed and re-allocated as a file data block; in this case, a
	journal replay after the file block was written to disk will cause
	corruption.

	NOTE: This mechanism is NOT used to express “this journal block is
	superseded by this other journal block”, as the author (djwong)
	mistakenly thought. Any block being added to a transaction will cause
	the removal of all existing revocation records for that block.

	Revocation blocks are described in
	``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
	length, but use a full block:

	.. list-table::
	:widths: 8 8 24 40
	:header-rows: 1

	* - Offset
	- Type
	- Name
	- Description
	* - 0x0
	- journal\_header\_t
	- r\_header
	- Common block header.
	* - 0xC
	- \_\_be32
	- r\_count
	- Number of bytes used in this block.
	* - 0x10
	- \_\_be32 or \_\_be64
	- blocks[0]
	- Blocks to revoke.

	After r\_count is a linear array of block numbers that are effectively
	revoked by this transaction. The size of each block number is 8 bytes if
	the superblock advertises 64-bit block number support, or 4 bytes
	otherwise.

	If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
	JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
	block is a ``struct jbd2_journal_revoke_tail``, which has this format:

	.. list-table::
	:widths: 8 8 24 40
	:header-rows: 1

	* - Offset
	- Type
	- Name
	- Description
	* - 0x0
	- \_\_be32
	- r\_checksum
	- Checksum of the journal UUID + revocation block

	Commit Block
	~~~~~~~~~~~~

	The commit block is a sentry that indicates that a transaction has been
	completely written to the journal. Once this commit block reaches the
	journal, the data stored with this transaction can be written to their
	final locations on disk.

	The commit block is described by ``struct commit_header``, which is 32
	bytes long (but uses a full block):

	.. list-table::
	:widths: 8 8 24 40
	:header-rows: 1

	* - Offset
	- Type
	- Name
	- Descriptor
	* - 0x0
	- journal\_header\_s
	- (open coded)
	- Common block header.
	* - 0xC
	- unsigned char
	- h\_chksum\_type
	- The type of checksum to use to verify the integrity of the data blocks
	in the transaction. See jbd2_checksum_type_ for more info.
	* - 0xD
	- unsigned char
	- h\_chksum\_size
	- The number of bytes used by the checksum. Most likely 4.
	* - 0xE
	- unsigned char
	- h\_padding[2]
	-
	* - 0x10
	- \_\_be32
	- h\_chksum[JBD2\_CHECKSUM\_BYTES]
	- 32 bytes of space to store checksums. If
	JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3
	are set, the first ``__be32`` is the checksum of the journal UUID and
	the entire commit block, with this field zeroed. If
	JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the
	crc32 of all the blocks already written to the transaction.
	* - 0x30
	- \_\_be64
	- h\_commit\_sec
	- The time that the transaction was committed, in seconds since the epoch.
	* - 0x38
	- \_\_be32
	- h\_commit\_nsec
	- Nanoseconds component of the above timestamp.

	Fast commits
	~~~~~~~~~~~~

	Fast commit area is organized as a log of tag length values. Each TLV has
	a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
	of the entire field. It is followed by variable length tag specific value.
	Here is the list of supported tags and their meanings:

	.. list-table::
	:widths: 8 20 20 32
	:header-rows: 1

	* - Tag
	- Meaning
	- Value struct
	- Description
	* - EXT4_FC_TAG_HEAD
	- Fast commit area header
	- ``struct ext4_fc_head``
	- Stores the TID of the transaction after which these fast commits should
	be applied.
	* - EXT4_FC_TAG_ADD_RANGE
	- Add extent to inode
	- ``struct ext4_fc_add_range``
	- Stores the inode number and extent to be added in this inode
	* - EXT4_FC_TAG_DEL_RANGE
	- Remove logical offsets to inode
	- ``struct ext4_fc_del_range``
	- Stores the inode number and the logical offset range that needs to be
	removed
	* - EXT4_FC_TAG_CREAT
	- Create directory entry for a newly created file
	- ``struct ext4_fc_dentry_info``
	- Stores the parent inode number, inode number and directory entry of the
	newly created file
	* - EXT4_FC_TAG_LINK
	- Link a directory entry to an inode
	- ``struct ext4_fc_dentry_info``
	- Stores the parent inode number, inode number and directory entry
	* - EXT4_FC_TAG_UNLINK
	- Unlink a directory entry of an inode
	- ``struct ext4_fc_dentry_info``
	- Stores the parent inode number, inode number and directory entry

	* - EXT4_FC_TAG_PAD
	- Padding (unused area)
	- None
	- Unused bytes in the fast commit area.

	* - EXT4_FC_TAG_TAIL
	- Mark the end of a fast commit
	- ``struct ext4_fc_tail``
	- Stores the TID of the commit, CRC of the fast commit of which this tag
	represents the end of

	Fast Commit Replay Idempotence
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Fast commits tags are idempotent in nature provided the recovery code follows
	certain rules. The guiding principle that the commit path follows while
	committing is that it stores the result of a particular operation instead of
	storing the procedure.

	Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
	was associated with inode 10. During fast commit, instead of storing this
	operation as a procedure "rename a to b", we store the resulting file system
	state as a "series" of outcomes:

	- Link dirent b to inode 10
	- Unlink dirent a
	- Inode 10 with valid refcount

	Now when recovery code runs, it needs "enforce" this state on the file
	system. This is what guarantees idempotence of fast commit replay.

	Let's take an example of a procedure that is not idempotent and see how fast
	commits make it idempotent. Consider following sequence of operations:

	1) rm A
	2) mv B A
	3) read A

	If we store this sequence of operations as is then the replay is not idempotent.
	Let's say while in replay, we crash after (2). During the second replay,
	file A (which was actually created as a result of "mv B A" operation) would get
	deleted. Thus, file named A would be absent when we try to read A. So, this
	sequence of operations is not idempotent. However, as mentioned above, instead
	of storing the procedure fast commits store the outcome of each procedure. Thus
	the fast commit log for above procedure would be as follows:

	(Let's assume dirent A was linked to inode 10 and dirent B was linked to
	inode 11 before the replay)

	1) Unlink A
	2) Link A to inode 11
	3) Unlink B
	4) Inode 11

	If we crash after (3) we will have file A linked to inode 11. During the second
	replay, we will remove file A (inode 11). But we will create it back and make
	it point to inode 11. We won't find B, so we'll just skip that step. At this
	point, the refcount for inode 11 is not reliable, but that gets fixed by the
	replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
	into a series of idempotent outcomes, fast commits ensured idempotence during
	the replay.