| .. SPDX-License-Identifier: GPL-2.0 |
| .. _xfs_online_fsck_design: |
| |
| .. |
| Mapping of heading styles within this document: |
| Heading 1 uses "====" above and below |
| Heading 2 uses "====" |
| Heading 3 uses "----" |
| Heading 4 uses "````" |
| Heading 5 uses "^^^^" |
| Heading 6 uses "~~~~" |
| Heading 7 uses "...." |
| |
| Sections are manually numbered because apparently that's what everyone |
| does in the kernel. |
| |
| ====================== |
| XFS Online Fsck Design |
| ====================== |
| |
| This document captures the design of the online filesystem check feature for |
| XFS. |
| The purpose of this document is threefold: |
| |
| - To help kernel distributors understand exactly what the XFS online fsck |
| feature is, and issues about which they should be aware. |
| |
| - To help people reading the code to familiarize themselves with the relevant |
| concepts and design points before they start digging into the code. |
| |
| - To help developers maintaining the system by capturing the reasons |
| supporting higher level decision making. |
| |
| As the online fsck code is merged, the links in this document to topic branches |
| will be replaced with links to code. |
| |
| This document is licensed under the terms of the GNU Public License, v2. |
| The primary author is Darrick J. Wong. |
| |
| This design document is split into seven parts. |
| Part 1 defines what fsck tools are and the motivations for writing a new one. |
| Parts 2 and 3 present a high level overview of how online fsck process works |
| and how it is tested to ensure correct functionality. |
| Part 4 discusses the user interface and the intended usage modes of the new |
| program. |
| Parts 5 and 6 show off the high level components and how they fit together, and |
| then present case studies of how each repair function actually works. |
| Part 7 sums up what has been discussed so far and speculates about what else |
| might be built atop online fsck. |
| |
| .. contents:: Table of Contents |
| :local: |
| |
| 1. What is a Filesystem Check? |
| ============================== |
| |
| A Unix filesystem has four main responsibilities: |
| |
| - Provide a hierarchy of names through which application programs can associate |
| arbitrary blobs of data for any length of time, |
| |
| - Virtualize physical storage media across those names, and |
| |
| - Retrieve the named data blobs at any time. |
| |
| - Examine resource usage. |
| |
| Metadata directly supporting these functions (e.g. files, directories, space |
| mappings) are sometimes called primary metadata. |
| Secondary metadata (e.g. reverse mapping and directory parent pointers) support |
| operations internal to the filesystem, such as internal consistency checking |
| and reorganization. |
| Summary metadata, as the name implies, condense information contained in |
| primary metadata for performance reasons. |
| |
| The filesystem check (fsck) tool examines all the metadata in a filesystem |
| to look for errors. |
| In addition to looking for obvious metadata corruptions, fsck also |
| cross-references different types of metadata records with each other to look |
| for inconsistencies. |
| People do not like losing data, so most fsck tools also contains some ability |
| to correct any problems found. |
| As a word of caution -- the primary goal of most Linux fsck tools is to restore |
| the filesystem metadata to a consistent state, not to maximize the data |
| recovered. |
| That precedent will not be challenged here. |
| |
| Filesystems of the 20th century generally lacked any redundancy in the ondisk |
| format, which means that fsck can only respond to errors by erasing files until |
| errors are no longer detected. |
| More recent filesystem designs contain enough redundancy in their metadata that |
| it is now possible to regenerate data structures when non-catastrophic errors |
| occur; this capability aids both strategies. |
| |
| +--------------------------------------------------------------------------+ |
| | **Note**: | |
| +--------------------------------------------------------------------------+ |
| | System administrators avoid data loss by increasing the number of | |
| | separate storage systems through the creation of backups; and they avoid | |
| | downtime by increasing the redundancy of each storage system through the | |
| | creation of RAID arrays. | |
| | fsck tools address only the first problem. | |
| +--------------------------------------------------------------------------+ |
| |
| TLDR; Show Me the Code! |
| ----------------------- |
| |
| Code is posted to the kernel.org git trees as follows: |
| `kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_, |
| `userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and |
| `QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_. |
| Each kernel patchset adding an online repair function will use the same branch |
| name across the kernel, xfsprogs, and fstests git repos. |
| |
| Existing Tools |
| -------------- |
| |
| The online fsck tool described here will be the third tool in the history of |
| XFS (on Linux) to check and repair filesystems. |
| Two programs precede it: |
| |
| The first program, ``xfs_check``, was created as part of the XFS debugger |
| (``xfs_db``) and can only be used with unmounted filesystems. |
| It walks all metadata in the filesystem looking for inconsistencies in the |
| metadata, though it lacks any ability to repair what it finds. |
| Due to its high memory requirements and inability to repair things, this |
| program is now deprecated and will not be discussed further. |
| |
| The second program, ``xfs_repair``, was created to be faster and more robust |
| than the first program. |
| Like its predecessor, it can only be used with unmounted filesystems. |
| It uses extent-based in-memory data structures to reduce memory consumption, |
| and tries to schedule readahead IO appropriately to reduce I/O waiting time |
| while it scans the metadata of the entire filesystem. |
| The most important feature of this tool is its ability to respond to |
| inconsistencies in file metadata and directory tree by erasing things as needed |
| to eliminate problems. |
| Space usage metadata are rebuilt from the observed file metadata. |
| |
| Problem Statement |
| ----------------- |
| |
| The current XFS tools leave several problems unsolved: |
| |
| 1. **User programs** suddenly **lose access** to the filesystem when unexpected |
| shutdowns occur as a result of silent corruptions in the metadata. |
| These occur **unpredictably** and often without warning. |
| |
| 2. **Users** experience a **total loss of service** during the recovery period |
| after an **unexpected shutdown** occurs. |
| |
| 3. **Users** experience a **total loss of service** if the filesystem is taken |
| offline to **look for problems** proactively. |
| |
| 4. **Data owners** cannot **check the integrity** of their stored data without |
| reading all of it. |
| This may expose them to substantial billing costs when a linear media scan |
| performed by the storage system administrator might suffice. |
| |
| 5. **System administrators** cannot **schedule** a maintenance window to deal |
| with corruptions if they **lack the means** to assess filesystem health |
| while the filesystem is online. |
| |
| 6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem |
| health when doing so requires **manual intervention** and downtime. |
| |
| 7. **Users** can be tricked into **doing things they do not desire** when |
| malicious actors **exploit quirks of Unicode** to place misleading names |
| in directories. |
| |
| Given this definition of the problems to be solved and the actors who would |
| benefit, the proposed solution is a third fsck tool that acts on a running |
| filesystem. |
| |
| This new third program has three components: an in-kernel facility to check |
| metadata, an in-kernel facility to repair metadata, and a userspace driver |
| program to drive fsck activity on a live filesystem. |
| ``xfs_scrub`` is the name of the driver program. |
| The rest of this document presents the goals and use cases of the new fsck |
| tool, describes its major design points in connection to those goals, and |
| discusses the similarities and differences with existing tools. |
| |
| +--------------------------------------------------------------------------+ |
| | **Note**: | |
| +--------------------------------------------------------------------------+ |
| | Throughout this document, the existing offline fsck tool can also be | |
| | referred to by its current name "``xfs_repair``". | |
| | The userspace driver program for the new online fsck tool can be | |
| | referred to as "``xfs_scrub``". | |
| | The kernel portion of online fsck that validates metadata is called | |
| | "online scrub", and portion of the kernel that fixes metadata is called | |
| | "online repair". | |
| +--------------------------------------------------------------------------+ |
| |
| The naming hierarchy is broken up into objects known as directories and files |
| and the physical space is split into pieces known as allocation groups. |
| Sharding enables better performance on highly parallel systems and helps to |
| contain the damage when corruptions occur. |
| The division of the filesystem into principal objects (allocation groups and |
| inodes) means that there are ample opportunities to perform targeted checks and |
| repairs on a subset of the filesystem. |
| |
| While this is going on, other parts continue processing IO requests. |
| Even if a piece of filesystem metadata can only be regenerated by scanning the |
| entire system, the scan can still be done in the background while other file |
| operations continue. |
| |
| In summary, online fsck takes advantage of resource sharding and redundant |
| metadata to enable targeted checking and repair operations while the system |
| is running. |
| This capability will be coupled to automatic system management so that |
| autonomous self-healing of XFS maximizes service availability. |
| |
| 2. Theory of Operation |
| ====================== |
| |
| Because it is necessary for online fsck to lock and scan live metadata objects, |
| online fsck consists of three separate code components. |
| The first is the userspace driver program ``xfs_scrub``, which is responsible |
| for identifying individual metadata items, scheduling work items for them, |
| reacting to the outcomes appropriately, and reporting results to the system |
| administrator. |
| The second and third are in the kernel, which implements functions to check |
| and repair each type of online fsck work item. |
| |
| +------------------------------------------------------------------+ |
| | **Note**: | |
| +------------------------------------------------------------------+ |
| | For brevity, this document shortens the phrase "online fsck work | |
| | item" to "scrub item". | |
| +------------------------------------------------------------------+ |
| |
| Scrub item types are delineated in a manner consistent with the Unix design |
| philosophy, which is to say that each item should handle one aspect of a |
| metadata structure, and handle it well. |
| |
| Scope |
| ----- |
| |
| In principle, online fsck should be able to check and to repair everything that |
| the offline fsck program can handle. |
| However, online fsck cannot be running 100% of the time, which means that |
| latent errors may creep in after a scrub completes. |
| If these errors cause the next mount to fail, offline fsck is the only |
| solution. |
| This limitation means that maintenance of the offline fsck tool will continue. |
| A second limitation of online fsck is that it must follow the same resource |
| sharing and lock acquisition rules as the regular filesystem. |
| This means that scrub cannot take *any* shortcuts to save time, because doing |
| so could lead to concurrency problems. |
| In other words, online fsck is not a complete replacement for offline fsck, and |
| a complete run of online fsck may take longer than online fsck. |
| However, both of these limitations are acceptable tradeoffs to satisfy the |
| different motivations of online fsck, which are to **minimize system downtime** |
| and to **increase predictability of operation**. |
| |
| .. _scrubphases: |
| |
| Phases of Work |
| -------------- |
| |
| The userspace driver program ``xfs_scrub`` splits the work of checking and |
| repairing an entire filesystem into seven phases. |
| Each phase concentrates on checking specific types of scrub items and depends |
| on the success of all previous phases. |
| The seven phases are as follows: |
| |
| 1. Collect geometry information about the mounted filesystem and computer, |
| discover the online fsck capabilities of the kernel, and open the |
| underlying storage devices. |
| |
| 2. Check allocation group metadata, all realtime volume metadata, and all quota |
| files. |
| Each metadata structure is scheduled as a separate scrub item. |
| If corruption is found in the inode header or inode btree and ``xfs_scrub`` |
| is permitted to perform repairs, then those scrub items are repaired to |
| prepare for phase 3. |
| Repairs are implemented by using the information in the scrub item to |
| resubmit the kernel scrub call with the repair flag enabled; this is |
| discussed in the next section. |
| Optimizations and all other repairs are deferred to phase 4. |
| |
| 3. Check all metadata of every file in the filesystem. |
| Each metadata structure is also scheduled as a separate scrub item. |
| If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, |
| and there were no problems detected during phase 2, then those scrub items |
| are repaired immediately. |
| Optimizations, deferred repairs, and unsuccessful repairs are deferred to |
| phase 4. |
| |
| 4. All remaining repairs and scheduled optimizations are performed during this |
| phase, if the caller permits them. |
| Before starting repairs, the summary counters are checked and any necessary |
| repairs are performed so that subsequent repairs will not fail the resource |
| reservation step due to wildly incorrect summary counters. |
| Unsuccessful repairs are requeued as long as forward progress on repairs is |
| made somewhere in the filesystem. |
| Free space in the filesystem is trimmed at the end of phase 4 if the |
| filesystem is clean. |
| |
| 5. By the start of this phase, all primary and secondary filesystem metadata |
| must be correct. |
| Summary counters such as the free space counts and quota resource counts |
| are checked and corrected. |
| Directory entry names and extended attribute names are checked for |
| suspicious entries such as control characters or confusing Unicode sequences |
| appearing in names. |
| |
| 6. If the caller asks for a media scan, read all allocated and written data |
| file extents in the filesystem. |
| The ability to use hardware-assisted data file integrity checking is new |
| to online fsck; neither of the previous tools have this capability. |
| If media errors occur, they will be mapped to the owning files and reported. |
| |
| 7. Re-check the summary counters and presents the caller with a summary of |
| space usage and file counts. |
| |
| This allocation of responsibilities will be :ref:`revisited <scrubcheck>` |
| later in this document. |
| |
| Steps for Each Scrub Item |
| ------------------------- |
| |
| The kernel scrub code uses a three-step strategy for checking and repairing |
| the one aspect of a metadata object represented by a scrub item: |
| |
| 1. The scrub item of interest is checked for corruptions; opportunities for |
| optimization; and for values that are directly controlled by the system |
| administrator but look suspicious. |
| If the item is not corrupt or does not need optimization, resource are |
| released and the positive scan results are returned to userspace. |
| If the item is corrupt or could be optimized but the caller does not permit |
| this, resources are released and the negative scan results are returned to |
| userspace. |
| Otherwise, the kernel moves on to the second step. |
| |
| 2. The repair function is called to rebuild the data structure. |
| Repair functions generally choose rebuild a structure from other metadata |
| rather than try to salvage the existing structure. |
| If the repair fails, the scan results from the first step are returned to |
| userspace. |
| Otherwise, the kernel moves on to the third step. |
| |
| 3. In the third step, the kernel runs the same checks over the new metadata |
| item to assess the efficacy of the repairs. |
| The results of the reassessment are returned to userspace. |
| |
| Classification of Metadata |
| -------------------------- |
| |
| Each type of metadata object (and therefore each type of scrub item) is |
| classified as follows: |
| |
| Primary Metadata |
| ```````````````` |
| |
| Metadata structures in this category should be most familiar to filesystem |
| users either because they are directly created by the user or they index |
| objects created by the user |
| Most filesystem objects fall into this class: |
| |
| - Free space and reference count information |
| |
| - Inode records and indexes |
| |
| - Storage mapping information for file data |
| |
| - Directories |
| |
| - Extended attributes |
| |
| - Symbolic links |
| |
| - Quota limits |
| |
| Scrub obeys the same rules as regular filesystem accesses for resource and lock |
| acquisition. |
| |
| Primary metadata objects are the simplest for scrub to process. |
| The principal filesystem object (either an allocation group or an inode) that |
| owns the item being scrubbed is locked to guard against concurrent updates. |
| The check function examines every record associated with the type for obvious |
| errors and cross-references healthy records against other metadata to look for |
| inconsistencies. |
| Repairs for this class of scrub item are simple, since the repair function |
| starts by holding all the resources acquired in the previous step. |
| The repair function scans available metadata as needed to record all the |
| observations needed to complete the structure. |
| Next, it stages the observations in a new ondisk structure and commits it |
| atomically to complete the repair. |
| Finally, the storage from the old data structure are carefully reaped. |
| |
| Because ``xfs_scrub`` locks a primary object for the duration of the repair, |
| this is effectively an offline repair operation performed on a subset of the |
| filesystem. |
| This minimizes the complexity of the repair code because it is not necessary to |
| handle concurrent updates from other threads, nor is it necessary to access |
| any other part of the filesystem. |
| As a result, indexed structures can be rebuilt very quickly, and programs |
| trying to access the damaged structure will be blocked until repairs complete. |
| The only infrastructure needed by the repair code are the staging area for |
| observations and a means to write new structures to disk. |
| Despite these limitations, the advantage that online repair holds is clear: |
| targeted work on individual shards of the filesystem avoids total loss of |
| service. |
| |
| This mechanism is described in section 2.1 ("Off-Line Algorithm") of |
| V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction |
| Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_, |
| *Extending Database Technology*, pp. 293-309, 1992. |
| |
| Most primary metadata repair functions stage their intermediate results in an |
| in-memory array prior to formatting the new ondisk structure, which is very |
| similar to the list-based algorithm discussed in section 2.3 ("List-Based |
| Algorithms") of Srinivasan. |
| However, any data structure builder that maintains a resource lock for the |
| duration of the repair is *always* an offline algorithm. |
| |
| .. _secondary_metadata: |
| |
| Secondary Metadata |
| `````````````````` |
| |
| Metadata structures in this category reflect records found in primary metadata, |
| but are only needed for online fsck or for reorganization of the filesystem. |
| |
| Secondary metadata include: |
| |
| - Reverse mapping information |
| |
| - Directory parent pointers |
| |
| This class of metadata is difficult for scrub to process because scrub attaches |
| to the secondary object but needs to check primary metadata, which runs counter |
| to the usual order of resource acquisition. |
| Frequently, this means that full filesystems scans are necessary to rebuild the |
| metadata. |
| Check functions can be limited in scope to reduce runtime. |
| Repairs, however, require a full scan of primary metadata, which can take a |
| long time to complete. |
| Under these conditions, ``xfs_scrub`` cannot lock resources for the entire |
| duration of the repair. |
| |
| Instead, repair functions set up an in-memory staging structure to store |
| observations. |
| Depending on the requirements of the specific repair function, the staging |
| index will either have the same format as the ondisk structure or a design |
| specific to that repair function. |
| The next step is to release all locks and start the filesystem scan. |
| When the repair scanner needs to record an observation, the staging data are |
| locked long enough to apply the update. |
| While the filesystem scan is in progress, the repair function hooks the |
| filesystem so that it can apply pending filesystem updates to the staging |
| information. |
| Once the scan is done, the owning object is re-locked, the live data is used to |
| write a new ondisk structure, and the repairs are committed atomically. |
| The hooks are disabled and the staging staging area is freed. |
| Finally, the storage from the old data structure are carefully reaped. |
| |
| Introducing concurrency helps online repair avoid various locking problems, but |
| comes at a high cost to code complexity. |
| Live filesystem code has to be hooked so that the repair function can observe |
| updates in progress. |
| The staging area has to become a fully functional parallel structure so that |
| updates can be merged from the hooks. |
| Finally, the hook, the filesystem scan, and the inode locking model must be |
| sufficiently well integrated that a hook event can decide if a given update |
| should be applied to the staging structure. |
| |
| In theory, the scrub implementation could apply these same techniques for |
| primary metadata, but doing so would make it massively more complex and less |
| performant. |
| Programs attempting to access the damaged structures are not blocked from |
| operation, which may cause application failure or an unplanned filesystem |
| shutdown. |
| |
| Inspiration for the secondary metadata repair strategy was drawn from section |
| 2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") |
| and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for |
| Creating Indexes for Very Large Tables Without Quiescing Updates" |
| <https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992. |
| |
| The sidecar index mentioned above bears some resemblance to the side file |
| method mentioned in Srinivasan and Mohan. |
| Their method consists of an index builder that extracts relevant record data to |
| build the new structure as quickly as possible; and an auxiliary structure that |
| captures all updates that would be committed to the index by other threads were |
| the new index already online. |
| After the index building scan finishes, the updates recorded in the side file |
| are applied to the new index. |
| To avoid conflicts between the index builder and other writer threads, the |
| builder maintains a publicly visible cursor that tracks the progress of the |
| scan through the record space. |
| To avoid duplication of work between the side file and the index builder, side |
| file updates are elided when the record ID for the update is greater than the |
| cursor position within the record ID space. |
| |
| To minimize changes to the rest of the codebase, XFS online repair keeps the |
| replacement index hidden until it's completely ready to go. |
| In other words, there is no attempt to expose the keyspace of the new index |
| while repair is running. |
| The complexity of such an approach would be very high and perhaps more |
| appropriate to building *new* indices. |
| |
| **Future Work Question**: Can the full scan and live update code used to |
| facilitate a repair also be used to implement a comprehensive check? |
| |
| *Answer*: In theory, yes. Check would be much stronger if each scrub function |
| employed these live scans to build a shadow copy of the metadata and then |
| compared the shadow records to the ondisk records. |
| However, doing that is a fair amount more work than what the checking functions |
| do now. |
| The live scans and hooks were developed much later. |
| That in turn increases the runtime of those scrub functions. |
| |
| Summary Information |
| ``````````````````` |
| |
| Metadata structures in this last category summarize the contents of primary |
| metadata records. |
| These are often used to speed up resource usage queries, and are many times |
| smaller than the primary metadata which they represent. |
| |
| Examples of summary information include: |
| |
| - Summary counts of free space and inodes |
| |
| - File link counts from directories |
| |
| - Quota resource usage counts |
| |
| Check and repair require full filesystem scans, but resource and lock |
| acquisition follow the same paths as regular filesystem accesses. |
| |
| The superblock summary counters have special requirements due to the underlying |
| implementation of the incore counters, and will be treated separately. |
| Check and repair of the other types of summary counters (quota resource counts |
| and file link counts) employ the same filesystem scanning and hooking |
| techniques as outlined above, but because the underlying data are sets of |
| integer counters, the staging data need not be a fully functional mirror of the |
| ondisk structure. |
| |
| Inspiration for quota and file link count repair strategies were drawn from |
| sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View |
| Maintenance") of G. Graefe, `"Concurrent Queries and Updates in Summary Views |
| and Their Indexes" |
| <http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011. |
| |
| Since quotas are non-negative integer counts of resource usage, online |
| quotacheck can use the incremental view deltas described in section 2.14 to |
| track pending changes to the block and inode usage counts in each transaction, |
| and commit those changes to a dquot side file when the transaction commits. |
| Delta tracking is necessary for dquots because the index builder scans inodes, |
| whereas the data structure being rebuilt is an index of dquots. |
| Link count checking combines the view deltas and commit step into one because |
| it sets attributes of the objects being scanned instead of writing them to a |
| separate data structure. |
| Each online fsck function will be discussed as case studies later in this |
| document. |
| |
| Risk Management |
| --------------- |
| |
| During the development of online fsck, several risk factors were identified |
| that may make the feature unsuitable for certain distributors and users. |
| Steps can be taken to mitigate or eliminate those risks, though at a cost to |
| functionality. |
| |
| - **Decreased performance**: Adding metadata indices to the filesystem |
| increases the time cost of persisting changes to disk, and the reverse space |
| mapping and directory parent pointers are no exception. |
| System administrators who require the maximum performance can disable the |
| reverse mapping features at format time, though this choice dramatically |
| reduces the ability of online fsck to find inconsistencies and repair them. |
| |
| - **Incorrect repairs**: As with all software, there might be defects in the |
| software that result in incorrect repairs being written to the filesystem. |
| Systematic fuzz testing (detailed in the next section) is employed by the |
| authors to find bugs early, but it might not catch everything. |
| The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` |
| and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to |
| accept this risk. |
| The xfsprogs build system has a configure option (``--enable-scrub=no``) that |
| disables building of the ``xfs_scrub`` binary, though this is not a risk |
| mitigation if the kernel functionality remains enabled. |
| |
| - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be |
| repairable. |
| If the keyspaces of several metadata indices overlap in some manner but a |
| coherent narrative cannot be formed from records collected, then the repair |
| fails. |
| To reduce the chance that a repair will fail with a dirty transaction and |
| render the filesystem unusable, the online repair functions have been |
| designed to stage and validate all new records before committing the new |
| structure. |
| |
| - **Misbehavior**: Online fsck requires many privileges -- raw IO to block |
| devices, opening files by handle, ignoring Unix discretionary access control, |
| and the ability to perform administrative changes. |
| Running this automatically in the background scares people, so the systemd |
| background service is configured to run with only the privileges required. |
| Obviously, this cannot address certain problems like the kernel crashing or |
| deadlocking, but it should be sufficient to prevent the scrub process from |
| escaping and reconfiguring the system. |
| The cron job does not have this protection. |
| |
| - **Fuzz Kiddiez**: There are many people now who seem to think that running |
| automated fuzz testing of ondisk artifacts to find mischievous behavior and |
| spraying exploit code onto the public mailing list for instant zero-day |
| disclosure is somehow of some social benefit. |
| In the view of this author, the benefit is realized only when the fuzz |
| operators help to **fix** the flaws, but this opinion apparently is not |
| widely shared among security "researchers". |
| The XFS maintainers' continuing ability to manage these events presents an |
| ongoing risk to the stability of the development process. |
| Automated testing should front-load some of the risk while the feature is |
| considered EXPERIMENTAL. |
| |
| Many of these risks are inherent to software programming. |
| Despite this, it is hoped that this new functionality will prove useful in |
| reducing unexpected downtime. |
| |
| 3. Testing Plan |
| =============== |
| |
| As stated before, fsck tools have three main goals: |
| |
| 1. Detect inconsistencies in the metadata; |
| |
| 2. Eliminate those inconsistencies; and |
| |
| 3. Minimize further loss of data. |
| |
| Demonstrations of correct operation are necessary to build users' confidence |
| that the software behaves within expectations. |
| Unfortunately, it was not really feasible to perform regular exhaustive testing |
| of every aspect of a fsck tool until the introduction of low-cost virtual |
| machines with high-IOPS storage. |
| With ample hardware availability in mind, the testing strategy for the online |
| fsck project involves differential analysis against the existing fsck tools and |
| systematic testing of every attribute of every type of metadata object. |
| Testing can be split into four major categories, as discussed below. |
| |
| Integrated Testing with fstests |
| ------------------------------- |
| |
| The primary goal of any free software QA effort is to make testing as |
| inexpensive and widespread as possible to maximize the scaling advantages of |
| community. |
| In other words, testing should maximize the breadth of filesystem configuration |
| scenarios and hardware setups. |
| This improves code quality by enabling the authors of online fsck to find and |
| fix bugs early, and helps developers of new features to find integration |
| issues earlier in their development effort. |
| |
| The Linux filesystem community shares a common QA testing suite, |
| `fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for |
| functional and regression testing. |
| Even before development work began on online fsck, fstests (when run on XFS) |
| would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and |
| scratch filesystems between each test. |
| This provides a level of assurance that the kernel and the fsck tools stay in |
| alignment about what constitutes consistent metadata. |
| During development of the online checking code, fstests was modified to run |
| ``xfs_scrub -n`` between each test to ensure that the new checking code |
| produces the same results as the two existing fsck tools. |
| |
| To start development of online repair, fstests was modified to run |
| ``xfs_repair`` to rebuild the filesystem's metadata indices between tests. |
| This ensures that offline repair does not crash, leave a corrupt filesystem |
| after it exists, or trigger complaints from the online check. |
| This also established a baseline for what can and cannot be repaired offline. |
| To complete the first phase of development of online repair, fstests was |
| modified to be able to run ``xfs_scrub`` in a "force rebuild" mode. |
| This enables a comparison of the effectiveness of online repair as compared to |
| the existing offline repair tools. |
| |
| General Fuzz Testing of Metadata Blocks |
| --------------------------------------- |
| |
| XFS benefits greatly from having a very robust debugging tool, ``xfs_db``. |
| |
| Before development of online fsck even began, a set of fstests were created |
| to test the rather common fault that entire metadata blocks get corrupted. |
| This required the creation of fstests library code that can create a filesystem |
| containing every possible type of metadata object. |
| Next, individual test cases were created to create a test filesystem, identify |
| a single block of a specific type of metadata object, trash it with the |
| existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a |
| particular metadata validation strategy. |
| |
| This earlier test suite enabled XFS developers to test the ability of the |
| in-kernel validation functions and the ability of the offline fsck tool to |
| detect and eliminate the inconsistent metadata. |
| This part of the test suite was extended to cover online fsck in exactly the |
| same manner. |
| |
| In other words, for a given fstests filesystem configuration: |
| |
| * For each metadata object existing on the filesystem: |
| |
| * Write garbage to it |
| |
| * Test the reactions of: |
| |
| 1. The kernel verifiers to stop obviously bad metadata |
| 2. Offline repair (``xfs_repair``) to detect and fix |
| 3. Online repair (``xfs_scrub``) to detect and fix |
| |
| Targeted Fuzz Testing of Metadata Records |
| ----------------------------------------- |
| |
| The testing plan for online fsck includes extending the existing fs testing |
| infrastructure to provide a much more powerful facility: targeted fuzz testing |
| of every metadata field of every metadata object in the filesystem. |
| ``xfs_db`` can modify every field of every metadata structure in every |
| block in the filesystem to simulate the effects of memory corruption and |
| software bugs. |
| Given that fstests already contains the ability to create a filesystem |
| containing every metadata format known to the filesystem, ``xfs_db`` can be |
| used to perform exhaustive fuzz testing! |
| |
| For a given fstests filesystem configuration: |
| |
| * For each metadata object existing on the filesystem... |
| |
| * For each record inside that metadata object... |
| |
| * For each field inside that record... |
| |
| * For each conceivable type of transformation that can be applied to a bit field... |
| |
| 1. Clear all bits |
| 2. Set all bits |
| 3. Toggle the most significant bit |
| 4. Toggle the middle bit |
| 5. Toggle the least significant bit |
| 6. Add a small quantity |
| 7. Subtract a small quantity |
| 8. Randomize the contents |
| |
| * ...test the reactions of: |
| |
| 1. The kernel verifiers to stop obviously bad metadata |
| 2. Offline checking (``xfs_repair -n``) |
| 3. Offline repair (``xfs_repair``) |
| 4. Online checking (``xfs_scrub -n``) |
| 5. Online repair (``xfs_scrub``) |
| 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) |
| |
| This is quite the combinatoric explosion! |
| |
| Fortunately, having this much test coverage makes it easy for XFS developers to |
| check the responses of XFS' fsck tools. |
| Since the introduction of the fuzz testing framework, these tests have been |
| used to discover incorrect repair code and missing functionality for entire |
| classes of metadata objects in ``xfs_repair``. |
| The enhanced testing was used to finalize the deprecation of ``xfs_check`` by |
| confirming that ``xfs_repair`` could detect at least as many corruptions as |
| the older tool. |
| |
| These tests have been very valuable for ``xfs_scrub`` in the same ways -- they |
| allow the online fsck developers to compare online fsck against offline fsck, |
| and they enable XFS developers to find deficiencies in the code base. |
| |
| Proposed patchsets include |
| `general fuzzer improvements |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_, |
| `fuzzing baselines |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_, |
| and `improvements in fuzz testing comprehensiveness |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_. |
| |
| Stress Testing |
| -------------- |
| |
| A unique requirement to online fsck is the ability to operate on a filesystem |
| concurrently with regular workloads. |
| Although it is of course impossible to run ``xfs_scrub`` with *zero* observable |
| impact on the running system, the online repair code should never introduce |
| inconsistencies into the filesystem metadata, and regular workloads should |
| never notice resource starvation. |
| To verify that these conditions are being met, fstests has been enhanced in |
| the following ways: |
| |
| * For each scrub item type, create a test to exercise checking that item type |
| while running ``fsstress``. |
| * For each scrub item type, create a test to exercise repairing that item type |
| while running ``fsstress``. |
| * Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole |
| filesystem doesn't cause problems. |
| * Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that |
| force-repairing the whole filesystem doesn't cause problems. |
| * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while |
| freezing and thawing the filesystem. |
| * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while |
| remounting the filesystem read-only and read-write. |
| * The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) |
| |
| Success is defined by the ability to run all of these tests without observing |
| any unexpected filesystem shutdowns due to corrupted metadata, kernel hang |
| check warnings, or any other sort of mischief. |
| |
| Proposed patchsets include `general stress testing |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_ |
| and the `evolution of existing per-function stress testing |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_. |
| |
| 4. User Interface |
| ================= |
| |
| The primary user of online fsck is the system administrator, just like offline |
| repair. |
| Online fsck presents two modes of operation to administrators: |
| A foreground CLI process for online fsck on demand, and a background service |
| that performs autonomous checking and repair. |
| |
| Checking on Demand |
| ------------------ |
| |
| For administrators who want the absolute freshest information about the |
| metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on |
| a command line. |
| The program checks every piece of metadata in the filesystem while the |
| administrator waits for the results to be reported, just like the existing |
| ``xfs_repair`` tool. |
| Both tools share a ``-n`` option to perform a read-only scan, and a ``-v`` |
| option to increase the verbosity of the information reported. |
| |
| A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error |
| correction capabilities of the hardware to check data file contents. |
| The media scan is not enabled by default because it may dramatically increase |
| program runtime and consume a lot of bandwidth on older storage hardware. |
| |
| The output of a foreground invocation is captured in the system log. |
| |
| The ``xfs_scrub_all`` program walks the list of mounted filesystems and |
| initiates ``xfs_scrub`` for each of them in parallel. |
| It serializes scans for any filesystems that resolve to the same top level |
| kernel block device to prevent resource overconsumption. |
| |
| Background Service |
| ------------------ |
| |
| To reduce the workload of system administrators, the ``xfs_scrub`` package |
| provides a suite of `systemd <https://systemd.io/>`_ timers and services that |
| run online fsck automatically on weekends by default. |
| The background service configures scrub to run with as little privilege as |
| possible, the lowest CPU and IO priority, and in a CPU-constrained single |
| threaded mode. |
| This can be tuned by the systemd administrator at any time to suit the latency |
| and throughput requirements of customer workloads. |
| |
| The output of the background service is also captured in the system log. |
| If desired, reports of failures (either due to inconsistencies or mere runtime |
| errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment |
| variable in the following service files: |
| |
| * ``xfs_scrub_fail@.service`` |
| * ``xfs_scrub_media_fail@.service`` |
| * ``xfs_scrub_all_fail.service`` |
| |
| The decision to enable the background scan is left to the system administrator. |
| This can be done by enabling either of the following services: |
| |
| * ``xfs_scrub_all.timer`` on systemd systems |
| * ``xfs_scrub_all.cron`` on non-systemd systems |
| |
| This automatic weekly scan is configured out of the box to perform an |
| additional media scan of all file data once per month. |
| This is less foolproof than, say, storing file data block checksums, but much |
| more performant if application software provides its own integrity checking, |
| redundancy can be provided elsewhere above the filesystem, or the storage |
| device's integrity guarantees are deemed sufficient. |
| |
| The systemd unit file definitions have been subjected to a security audit |
| (as of systemd 249) to ensure that the xfs_scrub processes have as little |
| access to the rest of the system as possible. |
| This was performed via ``systemd-analyze security``, after which privileges |
| were restricted to the minimum required, sandboxing was set up to the maximal |
| extent possible with sandboxing and system call filtering; and access to the |
| filesystem tree was restricted to the minimum needed to start the program and |
| access the filesystem being scanned. |
| The service definition files restrict CPU usage to 80% of one CPU core, and |
| apply as nice of a priority to IO and CPU scheduling as possible. |
| This measure was taken to minimize delays in the rest of the filesystem. |
| No such hardening has been performed for the cron job. |
| |
| Proposed patchset: |
| `Enabling the xfs_scrub background service |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_. |
| |
| Health Reporting |
| ---------------- |
| |
| XFS caches a summary of each filesystem's health status in memory. |
| The information is updated whenever ``xfs_scrub`` is run, or whenever |
| inconsistencies are detected in the filesystem metadata during regular |
| operations. |
| System administrators should use the ``health`` command of ``xfs_spaceman`` to |
| download this information into a human-readable format. |
| If problems have been observed, the administrator can schedule a reduced |
| service window to run the online repair tool to correct the problem. |
| Failing that, the administrator can decide to schedule a maintenance window to |
| run the traditional offline repair tool to correct the problem. |
| |
| **Future Work Question**: Should the health reporting integrate with the new |
| inotify fs error notification system? |
| Would it be helpful for sysadmins to have a daemon to listen for corruption |
| notifications and initiate a repair? |
| |
| *Answer*: These questions remain unanswered, but should be a part of the |
| conversation with early adopters and potential downstream users of XFS. |
| |
| Proposed patchsets include |
| `wiring up health reports to correction returns |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_ |
| and |
| `preservation of sickness info during memory reclaim |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_. |
| |
| 5. Kernel Algorithms and Data Structures |
| ======================================== |
| |
| This section discusses the key algorithms and data structures of the kernel |
| code that provide the ability to check and repair metadata while the system |
| is running. |
| The first chapters in this section reveal the pieces that provide the |
| foundation for checking metadata. |
| The remainder of this section presents the mechanisms through which XFS |
| regenerates itself. |
| |
| Self Describing Metadata |
| ------------------------ |
| |
| Starting with XFS version 5 in 2012, XFS updated the format of nearly every |
| ondisk block header to record a magic number, a checksum, a universally |
| "unique" identifier (UUID), an owner code, the ondisk address of the block, |
| and a log sequence number. |
| When loading a block buffer from disk, the magic number, UUID, owner, and |
| ondisk address confirm that the retrieved block matches the specific owner of |
| the current filesystem, and that the information contained in the block is |
| supposed to be found at the ondisk address. |
| The first three components enable checking tools to disregard alleged metadata |
| that doesn't belong to the filesystem, and the fourth component enables the |
| filesystem to detect lost writes. |
| |
| Whenever a file system operation modifies a block, the change is submitted |
| to the log as part of a transaction. |
| The log then processes these transactions marking them done once they are |
| safely persisted to storage. |
| The logging code maintains the checksum and the log sequence number of the last |
| transactional update. |
| Checksums are useful for detecting torn writes and other discrepancies that can |
| be introduced between the computer and its storage devices. |
| Sequence number tracking enables log recovery to avoid applying out of date |
| log updates to the filesystem. |
| |
| These two features improve overall runtime resiliency by providing a means for |
| the filesystem to detect obvious corruption when reading metadata blocks from |
| disk, but these buffer verifiers cannot provide any consistency checking |
| between metadata structures. |
| |
| For more information, please see the documentation for |
| Documentation/filesystems/xfs/xfs-self-describing-metadata.rst |
| |
| Reverse Mapping |
| --------------- |
| |
| The original design of XFS (circa 1993) is an improvement upon 1980s Unix |
| filesystem design. |
| In those days, storage density was expensive, CPU time was scarce, and |
| excessive seek time could kill performance. |
| For performance reasons, filesystem authors were reluctant to add redundancy to |
| the filesystem, even at the cost of data integrity. |
| Filesystems designers in the early 21st century choose different strategies to |
| increase internal redundancy -- either storing nearly identical copies of |
| metadata, or more space-efficient encoding techniques. |
| |
| For XFS, a different redundancy strategy was chosen to modernize the design: |
| a secondary space usage index that maps allocated disk extents back to their |
| owners. |
| By adding a new index, the filesystem retains most of its ability to scale |
| well to heavily threaded workloads involving large datasets, since the primary |
| file metadata (the directory tree, the file block map, and the allocation |
| groups) remain unchanged. |
| Like any system that improves redundancy, the reverse-mapping feature increases |
| overhead costs for space mapping activities. |
| However, it has two critical advantages: first, the reverse index is key to |
| enabling online fsck and other requested functionality such as free space |
| defragmentation, better media failure reporting, and filesystem shrinking. |
| Second, the different ondisk storage format of the reverse mapping btree |
| defeats device-level deduplication because the filesystem requires real |
| redundancy. |
| |
| +--------------------------------------------------------------------------+ |
| | **Sidebar**: | |
| +--------------------------------------------------------------------------+ |
| | A criticism of adding the secondary index is that it does nothing to | |
| | improve the robustness of user data storage itself. | |
| | This is a valid point, but adding a new index for file data block | |
| | checksums increases write amplification by turning data overwrites into | |
| | copy-writes, which age the filesystem prematurely. | |
| | In keeping with thirty years of precedent, users who want file data | |
| | integrity can supply as powerful a solution as they require. | |
| | As for metadata, the complexity of adding a new secondary index of space | |
| | usage is much less than adding volume management and storage device | |
| | mirroring to XFS itself. | |
| | Perfection of RAID and volume management are best left to existing | |
| | layers in the kernel. | |
| +--------------------------------------------------------------------------+ |
| |
| The information captured in a reverse space mapping record is as follows: |
| |
| .. code-block:: c |
| |
| struct xfs_rmap_irec { |
| xfs_agblock_t rm_startblock; /* extent start block */ |
| xfs_extlen_t rm_blockcount; /* extent length */ |
| uint64_t rm_owner; /* extent owner */ |
| uint64_t rm_offset; /* offset within the owner */ |
| unsigned int rm_flags; /* state flags */ |
| }; |
| |
| The first two fields capture the location and size of the physical space, |
| in units of filesystem blocks. |
| The owner field tells scrub which metadata structure or file inode have been |
| assigned this space. |
| For space allocated to files, the offset field tells scrub where the space was |
| mapped within the file fork. |
| Finally, the flags field provides extra information about the space usage -- |
| is this an attribute fork extent? A file mapping btree extent? Or an |
| unwritten data extent? |
| |
| Online filesystem checking judges the consistency of each primary metadata |
| record by comparing its information against all other space indices. |
| The reverse mapping index plays a key role in the consistency checking process |
| because it contains a centralized alternate copy of all space allocation |
| information. |
| Program runtime and ease of resource acquisition are the only real limits to |
| what online checking can consult. |
| For example, a file data extent mapping can be checked against: |
| |
| * The absence of an entry in the free space information. |
| * The absence of an entry in the inode index. |
| * The absence of an entry in the reference count data if the file is not |
| marked as having shared extents. |
| * The correspondence of an entry in the reverse mapping information. |
| |
| There are several observations to make about reverse mapping indices: |
| |
| 1. Reverse mappings can provide a positive affirmation of correctness if any of |
| the above primary metadata are in doubt. |
| The checking code for most primary metadata follows a path similar to the |
| one outlined above. |
| |
| 2. Proving the consistency of secondary metadata with the primary metadata is |
| difficult because that requires a full scan of all primary space metadata, |
| which is very time intensive. |
| For example, checking a reverse mapping record for a file extent mapping |
| btree block requires locking the file and searching the entire btree to |
| confirm the block. |
| Instead, scrub relies on rigorous cross-referencing during the primary space |
| mapping structure checks. |
| |
| 3. Consistency scans must use non-blocking lock acquisition primitives if the |
| required locking order is not the same order used by regular filesystem |
| operations. |
| For example, if the filesystem normally takes a file ILOCK before taking |
| the AGF buffer lock but scrub wants to take a file ILOCK while holding |
| an AGF buffer lock, scrub cannot block on that second acquisition. |
| This means that forward progress during this part of a scan of the reverse |
| mapping data cannot be guaranteed if system load is heavy. |
| |
| In summary, reverse mappings play a key role in reconstruction of primary |
| metadata. |
| The details of how these records are staged, written to disk, and committed |
| into the filesystem are covered in subsequent sections. |
| |
| Checking and Cross-Referencing |
| ------------------------------ |
| |
| The first step of checking a metadata structure is to examine every record |
| contained within the structure and its relationship with the rest of the |
| system. |
| XFS contains multiple layers of checking to try to prevent inconsistent |
| metadata from wreaking havoc on the system. |
| Each of these layers contributes information that helps the kernel to make |
| three decisions about the health of a metadata structure: |
| |
| - Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ? |
| - Is this structure inconsistent with the rest of the system |
| (``XFS_SCRUB_OFLAG_XCORRUPT``) ? |
| - Is there so much damage around the filesystem that cross-referencing is not |
| possible (``XFS_SCRUB_OFLAG_XFAIL``) ? |
| - Can the structure be optimized to improve performance or reduce the size of |
| metadata (``XFS_SCRUB_OFLAG_PREEN``) ? |
| - Does the structure contain data that is not inconsistent but deserves review |
| by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ? |
| |
| The following sections describe how the metadata scrubbing process works. |
| |
| Metadata Buffer Verification |
| ```````````````````````````` |
| |
| The lowest layer of metadata protection in XFS are the metadata verifiers built |
| into the buffer cache. |
| These functions perform inexpensive internal consistency checking of the block |
| itself, and answer these questions: |
| |
| - Does the block belong to this filesystem? |
| |
| - Does the block belong to the structure that asked for the read? |
| This assumes that metadata blocks only have one owner, which is always true |
| in XFS. |
| |
| - Is the type of data stored in the block within a reasonable range of what |
| scrub is expecting? |
| |
| - Does the physical location of the block match the location it was read from? |
| |
| - Does the block checksum match the data? |
| |
| The scope of the protections here are very limited -- verifiers can only |
| establish that the filesystem code is reasonably free of gross corruption bugs |
| and that the storage system is reasonably competent at retrieval. |
| Corruption problems observed at runtime cause the generation of health reports, |
| failed system calls, and in the extreme case, filesystem shutdowns if the |
| corrupt metadata force the cancellation of a dirty transaction. |
| |
| Every online fsck scrubbing function is expected to read every ondisk metadata |
| block of a structure in the course of checking the structure. |
| Corruption problems observed during a check are immediately reported to |
| userspace as corruption; during a cross-reference, they are reported as a |
| failure to cross-reference once the full examination is complete. |
| Reads satisfied by a buffer already in cache (and hence already verified) |
| bypass these checks. |
| |
| Internal Consistency Checks |
| ``````````````````````````` |
| |
| After the buffer cache, the next level of metadata protection is the internal |
| record verification code built into the filesystem. |
| These checks are split between the buffer verifiers, the in-filesystem users of |
| the buffer cache, and the scrub code itself, depending on the amount of higher |
| level context required. |
| The scope of checking is still internal to the block. |
| These higher level checking functions answer these questions: |
| |
| - Does the type of data stored in the block match what scrub is expecting? |
| |
| - Does the block belong to the owning structure that asked for the read? |
| |
| - If the block contains records, do the records fit within the block? |
| |
| - If the block tracks internal free space information, is it consistent with |
| the record areas? |
| |
| - Are the records contained inside the block free of obvious corruptions? |
| |
| Record checks in this category are more rigorous and more time-intensive. |
| For example, block pointers and inumbers are checked to ensure that they point |
| within the dynamically allocated parts of an allocation group and within |
| the filesystem. |
| Names are checked for invalid characters, and flags are checked for invalid |
| combinations. |
| Other record attributes are checked for sensible values. |
| Btree records spanning an interval of the btree keyspace are checked for |
| correct order and lack of mergeability (except for file fork mappings). |
| For performance reasons, regular code may skip some of these checks unless |
| debugging is enabled or a write is about to occur. |
| Scrub functions, of course, must check all possible problems. |
| |
| Validation of Userspace-Controlled Record Attributes |
| ```````````````````````````````````````````````````` |
| |
| Various pieces of filesystem metadata are directly controlled by userspace. |
| Because of this nature, validation work cannot be more precise than checking |
| that a value is within the possible range. |
| These fields include: |
| |
| - Superblock fields controlled by mount options |
| - Filesystem labels |
| - File timestamps |
| - File permissions |
| - File size |
| - File flags |
| - Names present in directory entries, extended attribute keys, and filesystem |
| labels |
| - Extended attribute key namespaces |
| - Extended attribute values |
| - File data block contents |
| - Quota limits |
| - Quota timer expiration (if resource usage exceeds the soft limit) |
| |
| Cross-Referencing Space Metadata |
| ```````````````````````````````` |
| |
| After internal block checks, the next higher level of checking is |
| cross-referencing records between metadata structures. |
| For regular runtime code, the cost of these checks is considered to be |
| prohibitively expensive, but as scrub is dedicated to rooting out |
| inconsistencies, it must pursue all avenues of inquiry. |
| The exact set of cross-referencing is highly dependent on the context of the |
| data structure being checked. |
| |
| The XFS btree code has keyspace scanning functions that online fsck uses to |
| cross reference one structure with another. |
| Specifically, scrub can scan the key space of an index to determine if that |
| keyspace is fully, sparsely, or not at all mapped to records. |
| For the reverse mapping btree, it is possible to mask parts of the key for the |
| purposes of performing a keyspace scan so that scrub can decide if the rmap |
| btree contains records mapping a certain extent of physical space without the |
| sparsenses of the rest of the rmap keyspace getting in the way. |
| |
| Btree blocks undergo the following checks before cross-referencing: |
| |
| - Does the type of data stored in the block match what scrub is expecting? |
| |
| - Does the block belong to the owning structure that asked for the read? |
| |
| - Do the records fit within the block? |
| |
| - Are the records contained inside the block free of obvious corruptions? |
| |
| - Are the name hashes in the correct order? |
| |
| - Do node pointers within the btree point to valid block addresses for the type |
| of btree? |
| |
| - Do child pointers point towards the leaves? |
| |
| - Do sibling pointers point across the same level? |
| |
| - For each node block record, does the record key accurate reflect the contents |
| of the child block? |
| |
| Space allocation records are cross-referenced as follows: |
| |
| 1. Any space mentioned by any metadata structure are cross-referenced as |
| follows: |
| |
| - Does the reverse mapping index list only the appropriate owner as the |
| owner of each block? |
| |
| - Are none of the blocks claimed as free space? |
| |
| - If these aren't file data blocks, are none of the blocks claimed as space |
| shared by different owners? |
| |
| 2. Btree blocks are cross-referenced as follows: |
| |
| - Everything in class 1 above. |
| |
| - If there's a parent node block, do the keys listed for this block match the |
| keyspace of this block? |
| |
| - Do the sibling pointers point to valid blocks? Of the same level? |
| |
| - Do the child pointers point to valid blocks? Of the next level down? |
| |
| 3. Free space btree records are cross-referenced as follows: |
| |
| - Everything in class 1 and 2 above. |
| |
| - Does the reverse mapping index list no owners of this space? |
| |
| - Is this space not claimed by the inode index for inodes? |
| |
| - Is it not mentioned by the reference count index? |
| |
| - Is there a matching record in the other free space btree? |
| |
| 4. Inode btree records are cross-referenced as follows: |
| |
| - Everything in class 1 and 2 above. |
| |
| - Is there a matching record in free inode btree? |
| |
| - Do cleared bits in the holemask correspond with inode clusters? |
| |
| - Do set bits in the freemask correspond with inode records with zero link |
| count? |
| |
| 5. Inode records are cross-referenced as follows: |
| |
| - Everything in class 1. |
| |
| - Do all the fields that summarize information about the file forks actually |
| match those forks? |
| |
| - Does each inode with zero link count correspond to a record in the free |
| inode btree? |
| |
| 6. File fork space mapping records are cross-referenced as follows: |
| |
| - Everything in class 1 and 2 above. |
| |
| - Is this space not mentioned by the inode btrees? |
| |
| - If this is a CoW fork mapping, does it correspond to a CoW entry in the |
| reference count btree? |
| |
| 7. Reference count records are cross-referenced as follows: |
| |
| - Everything in class 1 and 2 above. |
| |
| - Within the space subkeyspace of the rmap btree (that is to say, all |
| records mapped to a particular space extent and ignoring the owner info), |
| are there the same number of reverse mapping records for each block as the |
| reference count record claims? |
| |
| Proposed patchsets are the series to find gaps in |
| `refcount btree |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_, |
| `inode btree |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and |
| `rmap btree |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records; |
| to find |
| `mergeable records |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_; |
| and to |
| `improve cross referencing with rmap |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_ |
| before starting a repair. |
| |
| Checking Extended Attributes |
| ```````````````````````````` |
| |
| Extended attributes implement a key-value store that enable fragments of data |
| to be attached to any file. |
| Both the kernel and userspace can access the keys and values, subject to |
| namespace and privilege restrictions. |
| Most typically these fragments are metadata about the file -- origins, security |
| contexts, user-supplied labels, indexing information, etc. |
| |
| Names can be as long as 255 bytes and can exist in several different |
| namespaces. |
| Values can be as large as 64KB. |
| A file's extended attributes are stored in blocks mapped by the attr fork. |
| The mappings point to leaf blocks, remote value blocks, or dabtree blocks. |
| Block 0 in the attribute fork is always the top of the structure, but otherwise |
| each of the three types of blocks can be found at any offset in the attr fork. |
| Leaf blocks contain attribute key records that point to the name and the value. |
| Names are always stored elsewhere in the same leaf block. |
| Values that are less than 3/4 the size of a filesystem block are also stored |
| elsewhere in the same leaf block. |
| Remote value blocks contain values that are too large to fit inside a leaf. |
| If the leaf information exceeds a single filesystem block, a dabtree (also |
| rooted at block 0) is created to map hashes of the attribute names to leaf |
| blocks in the attr fork. |
| |
| Checking an extended attribute structure is not so straightforward due to the |
| lack of separation between attr blocks and index blocks. |
| Scrub must read each block mapped by the attr fork and ignore the non-leaf |
| blocks: |
| |
| 1. Walk the dabtree in the attr fork (if present) to ensure that there are no |
| irregularities in the blocks or dabtree mappings that do not point to |
| attr leaf blocks. |
| |
| 2. Walk the blocks of the attr fork looking for leaf blocks. |
| For each entry inside a leaf: |
| |
| a. Validate that the name does not contain invalid characters. |
| |
| b. Read the attr value. |
| This performs a named lookup of the attr name to ensure the correctness |
| of the dabtree. |
| If the value is stored in a remote block, this also validates the |
| integrity of the remote value block. |
| |
| Checking and Cross-Referencing Directories |
| `````````````````````````````````````````` |
| |
| The filesystem directory tree is a directed acylic graph structure, with files |
| constituting the nodes, and directory entries (dirents) constituting the edges. |
| Directories are a special type of file containing a set of mappings from a |
| 255-byte sequence (name) to an inumber. |
| These are called directory entries, or dirents for short. |
| Each directory file must have exactly one directory pointing to the file. |
| A root directory points to itself. |
| Directory entries point to files of any type. |
| Each non-directory file may have multiple directories point to it. |
| |
| In XFS, directories are implemented as a file containing up to three 32GB |
| partitions. |
| The first partition contains directory entry data blocks. |
| Each data block contains variable-sized records associating a user-provided |
| name with an inumber and, optionally, a file type. |
| If the directory entry data grows beyond one block, the second partition (which |
| exists as post-EOF extents) is populated with a block containing free space |
| information and an index that maps hashes of the dirent names to directory data |
| blocks in the first partition. |
| This makes directory name lookups very fast. |
| If this second partition grows beyond one block, the third partition is |
| populated with a linear array of free space information for faster |
| expansions. |
| If the free space has been separated and the second partition grows again |
| beyond one block, then a dabtree is used to map hashes of dirent names to |
| directory data blocks. |
| |
| Checking a directory is pretty straightforward: |
| |
| 1. Walk the dabtree in the second partition (if present) to ensure that there |
| are no irregularities in the blocks or dabtree mappings that do not point to |
| dirent blocks. |
| |
| 2. Walk the blocks of the first partition looking for directory entries. |
| Each dirent is checked as follows: |
| |
| a. Does the name contain no invalid characters? |
| |
| b. Does the inumber correspond to an actual, allocated inode? |
| |
| c. Does the child inode have a nonzero link count? |
| |
| d. If a file type is included in the dirent, does it match the type of the |
| inode? |
| |
| e. If the child is a subdirectory, does the child's dotdot pointer point |
| back to the parent? |
| |
| f. If the directory has a second partition, perform a named lookup of the |
| dirent name to ensure the correctness of the dabtree. |
| |
| 3. Walk the free space list in the third partition (if present) to ensure that |
| the free spaces it describes are really unused. |
| |
| Checking operations involving :ref:`parents <dirparent>` and |
| :ref:`file link counts <nlinks>` are discussed in more detail in later |
| sections. |
| |
| Checking Directory/Attribute Btrees |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| As stated in previous sections, the directory/attribute btree (dabtree) index |
| maps user-provided names to improve lookup times by avoiding linear scans. |
| Internally, it maps a 32-bit hash of the name to a block offset within the |
| appropriate file fork. |
| |
| The internal structure of a dabtree closely resembles the btrees that record |
| fixed-size metadata records -- each dabtree block contains a magic number, a |
| checksum, sibling pointers, a UUID, a tree level, and a log sequence number. |
| The format of leaf and node records are the same -- each entry points to the |
| next level down in the hierarchy, with dabtree node records pointing to dabtree |
| leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere |
| in the fork. |
| |
| Checking and cross-referencing the dabtree is very similar to what is done for |
| space btrees: |
| |
| - Does the type of data stored in the block match what scrub is expecting? |
| |
| - Does the block belong to the owning structure that asked for the read? |
| |
| - Do the records fit within the block? |
| |
| - Are the records contained inside the block free of obvious corruptions? |
| |
| - Are the name hashes in the correct order? |
| |
| - Do node pointers within the dabtree point to valid fork offsets for dabtree |
| blocks? |
| |
| - Do leaf pointers within the dabtree point to valid fork offsets for directory |
| or attr leaf blocks? |
| |
| - Do child pointers point towards the leaves? |
| |
| - Do sibling pointers point across the same level? |
| |
| - For each dabtree node record, does the record key accurate reflect the |
| contents of the child dabtree block? |
| |
| - For each dabtree leaf record, does the record key accurate reflect the |
| contents of the directory or attr block? |
| |
| Cross-Referencing Summary Counters |
| `````````````````````````````````` |
| |
| XFS maintains three classes of summary counters: available resources, quota |
| resource usage, and file link counts. |
| |
| In theory, the amount of available resources (data blocks, inodes, realtime |
| extents) can be found by walking the entire filesystem. |
| This would make for very slow reporting, so a transactional filesystem can |
| maintain summaries of this information in the superblock. |
| Cross-referencing these values against the filesystem metadata should be a |
| simple matter of walking the free space and inode metadata in each AG and the |
| realtime bitmap, but there are complications that will be discussed in |
| :ref:`more detail <fscounters>` later. |
| |
| :ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>` |
| checking are sufficiently complicated to warrant separate sections. |
| |
| Post-Repair Reverification |
| `````````````````````````` |
| |
| After performing a repair, the checking code is run a second time to validate |
| the new structure, and the results of the health assessment are recorded |
| internally and returned to the calling process. |
| This step is critical for enabling system administrator to monitor the status |
| of the filesystem and the progress of any repairs. |
| For developers, it is a useful means to judge the efficacy of error detection |
| and correction in the online and offline checking tools. |
| |
| Eventual Consistency vs. Online Fsck |
| ------------------------------------ |
| |
| Complex operations can make modifications to multiple per-AG data structures |
| with a chain of transactions. |
| These chains, once committed to the log, are restarted during log recovery if |
| the system crashes while processing the chain. |
| Because the AG header buffers are unlocked between transactions within a chain, |
| online checking must coordinate with chained operations that are in progress to |
| avoid incorrectly detecting inconsistencies due to pending chains. |
| Furthermore, online repair must not run when operations are pending because |
| the metadata are temporarily inconsistent with each other, and rebuilding is |
| not possible. |
| |
| Only online fsck has this requirement of total consistency of AG metadata, and |
| should be relatively rare as compared to filesystem change operations. |
| Online fsck coordinates with transaction chains as follows: |
| |
| * For each AG, maintain a count of intent items targeting that AG. |
| The count should be bumped whenever a new item is added to the chain. |
| The count should be dropped when the filesystem has locked the AG header |
| buffers and finished the work. |
| |
| * When online fsck wants to examine an AG, it should lock the AG header |
| buffers to quiesce all transaction chains that want to modify that AG. |
| If the count is zero, proceed with the checking operation. |
| If it is nonzero, cycle the buffer locks to allow the chain to make forward |
| progress. |
| |
| This may lead to online fsck taking a long time to complete, but regular |
| filesystem updates take precedence over background checking activity. |
| Details about the discovery of this situation are presented in the |
| :ref:`next section <chain_coordination>`, and details about the solution |
| are presented :ref:`after that<intent_drains>`. |
| |
| .. _chain_coordination: |
| |
| Discovery of the Problem |
| ```````````````````````` |
| |
| Midway through the development of online scrubbing, the fsstress tests |
| uncovered a misinteraction between online fsck and compound transaction chains |
| created by other writer threads that resulted in false reports of metadata |
| inconsistency. |
| The root cause of these reports is the eventual consistency model introduced by |
| the expansion of deferred work items and compound transaction chains when |
| reverse mapping and reflink were introduced. |
| |
| Originally, transaction chains were added to XFS to avoid deadlocks when |
| unmapping space from files. |
| Deadlock avoidance rules require that AGs only be locked in increasing order, |
| which makes it impossible (say) to use a single transaction to free a space |
| extent in AG 7 and then try to free a now superfluous block mapping btree block |
| in AG 3. |
| To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log |
| items to commit to freeing some space in one transaction while deferring the |
| actual metadata updates to a fresh transaction. |
| The transaction sequence looks like this: |
| |
| 1. The first transaction contains a physical update to the file's block mapping |
| structures to remove the mapping from the btree blocks. |
| It then attaches to the in-memory transaction an action item to schedule |
| deferred freeing of space. |
| Concretely, each transaction maintains a list of ``struct |
| xfs_defer_pending`` objects, each of which maintains a list of ``struct |
| xfs_extent_free_item`` objects. |
| Returning to the example above, the action item tracks the freeing of both |
| the unmapped space from AG 7 and the block mapping btree (BMBT) block from |
| AG 3. |
| Deferred frees recorded in this manner are committed in the log by creating |
| an EFI log item from the ``struct xfs_extent_free_item`` object and |
| attaching the log item to the transaction. |
| When the log is persisted to disk, the EFI item is written into the ondisk |
| transaction record. |
| EFIs can list up to 16 extents to free, all sorted in AG order. |
| |
| 2. The second transaction contains a physical update to the free space btrees |
| of AG 3 to release the former BMBT block and a second physical update to the |
| free space btrees of AG 7 to release the unmapped file space. |
| Observe that the physical updates are resequenced in the correct order |
| when possible. |
| Attached to the transaction is a an extent free done (EFD) log item. |
| The EFD contains a pointer to the EFI logged in transaction #1 so that log |
| recovery can tell if the EFI needs to be replayed. |
| |
| If the system goes down after transaction #1 is written back to the filesystem |
| but before #2 is committed, a scan of the filesystem metadata would show |
| inconsistent filesystem metadata because there would not appear to be any owner |
| of the unmapped space. |
| Happily, log recovery corrects this inconsistency for us -- when recovery finds |
| an intent log item but does not find a corresponding intent done item, it will |
| reconstruct the incore state of the intent item and finish it. |
| In the example above, the log must replay both frees described in the recovered |
| EFI to complete the recovery phase. |
| |
| There are subtleties to XFS' transaction chaining strategy to consider: |
| |
| * Log items must be added to a transaction in the correct order to prevent |
| conflicts with principal objects that are not held by the transaction. |
| In other words, all per-AG metadata updates for an unmapped block must be |
| completed before the last update to free the extent, and extents should not |
| be reallocated until that last update commits to the log. |
| |
| * AG header buffers are released between each transaction in a chain. |
| This means that other threads can observe an AG in an intermediate state, |
| but as long as the first subtlety is handled, this should not affect the |
| correctness of filesystem operations. |
| |
| * Unmounting the filesystem flushes all pending work to disk, which means that |
| offline fsck never sees the temporary inconsistencies caused by deferred |
| work item processing. |
| |
| In this manner, XFS employs a form of eventual consistency to avoid deadlocks |
| and increase parallelism. |
| |
| During the design phase of the reverse mapping and reflink features, it was |
| decided that it was impractical to cram all the reverse mapping updates for a |
| single filesystem change into a single transaction because a single file |
| mapping operation can explode into many small updates: |
| |
| * The block mapping update itself |
| * A reverse mapping update for the block mapping update |
| * Fixing the freelist |
| * A reverse mapping update for the freelist fix |
| |
| * A shape change to the block mapping btree |
| * A reverse mapping update for the btree update |
| * Fixing the freelist (again) |
| * A reverse mapping update for the freelist fix |
| |
| * An update to the reference counting information |
| * A reverse mapping update for the refcount update |
| * Fixing the freelist (a third time) |
| * A reverse mapping update for the freelist fix |
| |
| * Freeing any space that was unmapped and not owned by any other file |
| * Fixing the freelist (a fourth time) |
| * A reverse mapping update for the freelist fix |
| |
| * Freeing the space used by the block mapping btree |
| * Fixing the freelist (a fifth time) |
| * A reverse mapping update for the freelist fix |
| |
| Free list fixups are not usually needed more than once per AG per transaction |
| chain, but it is theoretically possible if space is very tight. |
| For copy-on-write updates this is even worse, because this must be done once to |
| remove the space from a staging area and again to map it into the file! |
| |
| To deal with this explosion in a calm manner, XFS expands its use of deferred |
| work items to cover most reverse mapping updates and all refcount updates. |
| This reduces the worst case size of transaction reservations by breaking the |
| work into a long chain of small updates, which increases the degree of eventual |
| consistency in the system. |
| Again, this generally isn't a problem because XFS orders its deferred work |
| items carefully to avoid resource reuse conflicts between unsuspecting threads. |
| |
| However, online fsck changes the rules -- remember that although physical |
| updates to per-AG structures are coordinated by locking the buffers for AG |
| headers, buffer locks are dropped between transactions. |
| Once scrub acquires resources and takes locks for a data structure, it must do |
| all the validation work without releasing the lock. |
| If the main lock for a space btree is an AG header buffer lock, scrub may have |
| interrupted another thread that is midway through finishing a chain. |
| For example, if a thread performing a copy-on-write has completed a reverse |
| mapping update but not the corresponding refcount update, the two AG btrees |
| will appear inconsistent to scrub and an observation of corruption will be |
| recorded. This observation will not be correct. |
| If a repair is attempted in this state, the results will be catastrophic! |
| |
| Several other solutions to this problem were evaluated upon discovery of this |
| flaw and rejected: |
| |
| 1. Add a higher level lock to allocation groups and require writer threads to |
| acquire the higher level lock in AG order before making any changes. |
| This would be very difficult to implement in practice because it is |
| difficult to determine which locks need to be obtained, and in what order, |
| without simulating the entire operation. |
| Performing a dry run of a file operation to discover necessary locks would |
| make the filesystem very slow. |
| |
| 2. Make the deferred work coordinator code aware of consecutive intent items |
| targeting the same AG and have it hold the AG header buffers locked across |
| the transaction roll between updates. |
| This would introduce a lot of complexity into the coordinator since it is |
| only loosely coupled with the actual deferred work items. |
| It would also fail to solve the problem because deferred work items can |
| generate new deferred subtasks, but all subtasks must be complete before |
| work can start on a new sibling task. |
| |
| 3. Teach online fsck to walk all transactions waiting for whichever lock(s) |
| protect the data structure being scrubbed to look for pending operations. |
| The checking and repair operations must factor these pending operations into |
| the evaluations being performed. |
| This solution is a nonstarter because it is *extremely* invasive to the main |
| filesystem. |
| |
| .. _intent_drains: |
| |
| Intent Drains |
| ````````````` |
| |
| Online fsck uses an atomic intent item counter and lock cycling to coordinate |
| with transaction chains. |
| There are two key properties to the drain mechanism. |
| First, the counter is incremented when a deferred work item is *queued* to a |
| transaction, and it is decremented after the associated intent done log item is |
| *committed* to another transaction. |
| The second property is that deferred work can be added to a transaction without |
| holding an AG header lock, but per-AG work items cannot be marked done without |
| locking that AG header buffer to log the physical updates and the intent done |
| log item. |
| The first property enables scrub to yield to running transaction chains, which |
| is an explicit deprioritization of online fsck to benefit file operations. |
| The second property of the drain is key to the correct coordination of scrub, |
| since scrub will always be able to decide if a conflict is possible. |
| |
| For regular filesystem code, the drain works as follows: |
| |
| 1. Call the appropriate subsystem function to add a deferred work item to a |
| transaction. |
| |
| 2. The function calls ``xfs_defer_drain_bump`` to increase the counter. |
| |
| 3. When the deferred item manager wants to finish the deferred work item, it |
| calls ``->finish_item`` to complete it. |
| |
| 4. The ``->finish_item`` implementation logs some changes and calls |
| ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads |
| waiting on the drain. |
| |
| 5. The subtransaction commits, which unlocks the resource associated with the |
| intent item. |
| |
| For scrub, the drain works as follows: |
| |
| 1. Lock the resource(s) associated with the metadata being scrubbed. |
| For example, a scan of the refcount btree would lock the AGI and AGF header |
| buffers. |
| |
| 2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no |
| chains in progress and the operation may proceed. |
| |
| 3. Otherwise, release the resources grabbed in step 1. |
| |
| 4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go |
| back to step 1 unless a signal has been caught. |
| |
| To avoid polling in step 4, the drain provides a waitqueue for scrub threads to |
| be woken up whenever the intent count drops to zero. |
| |
| The proposed patchset is the |
| `scrub intent drain series |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_. |
| |
| .. _jump_labels: |
| |
| Static Keys (aka Jump Label Patching) |
| ````````````````````````````````````` |
| |
| Online fsck for XFS separates the regular filesystem from the checking and |
| repair code as much as possible. |
| However, there are a few parts of online fsck (such as the intent drains, and |
| later, live update hooks) where it is useful for the online fsck code to know |
| what's going on in the rest of the filesystem. |
| Since it is not expected that online fsck will be constantly running in the |
| background, it is very important to minimize the runtime overhead imposed by |
| these hooks when online fsck is compiled into the kernel but not actively |
| running on behalf of userspace. |
| Taking locks in the hot path of a writer thread to access a data structure only |
| to find that no further action is necessary is expensive -- on the author's |
| computer, this have an overhead of 40-50ns per access. |
| Fortunately, the kernel supports dynamic code patching, which enables XFS to |
| replace a static branch to hook code with ``nop`` sleds when online fsck isn't |
| running. |
| This sled has an overhead of however long it takes the instruction decoder to |
| skip past the sled, which seems to be on the order of less than 1ns and |
| does not access memory outside of instruction fetching. |
| |
| When online fsck enables the static key, the sled is replaced with an |
| unconditional branch to call the hook code. |
| The switchover is quite expensive (~22000ns) but is paid entirely by the |
| program that invoked online fsck, and can be amortized if multiple threads |
| enter online fsck at the same time, or if multiple filesystems are being |
| checked at the same time. |
| Changing the branch direction requires taking the CPU hotplug lock, and since |
| CPU initialization requires memory allocation, online fsck must be careful not |
| to change a static key while holding any locks or resources that could be |
| accessed in the memory reclaim paths. |
| To minimize contention on the CPU hotplug lock, care should be taken not to |
| enable or disable static keys unnecessarily. |
| |
| Because static keys are intended to minimize hook overhead for regular |
| filesystem operations when xfs_scrub is not running, the intended usage |
| patterns are as follows: |
| |
| - The hooked part of XFS should declare a static-scoped static key that |
| defaults to false. |
| The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. |
| The static key itself should be declared as a ``static`` variable. |
| |
| - When deciding to invoke code that's only used by scrub, the regular |
| filesystem should call the ``static_branch_unlikely`` predicate to avoid the |
| scrub-only hook code if the static key is not enabled. |
| |
| - The regular filesystem should export helper functions that call |
| ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the |
| static key. |
| Wrapper functions make it easy to compile out the relevant code if the kernel |
| distributor turns off online fsck at build time. |
| |
| - Scrub functions wanting to turn on scrub-only XFS functionality should call |
| the ``xchk_fsgates_enable`` from the setup function to enable a specific |
| hook. |
| This must be done before obtaining any resources that are used by memory |
| reclaim. |
| Callers had better be sure they really need the functionality gated by the |
| static key; the ``TRY_HARDER`` flag is useful here. |
| |
| Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to |
| handle locking AGI and AGF buffers for all scrubber functions. |
| If it detects a conflict between scrub and the running transactions, it will |
| try to wait for intents to complete. |
| If the caller of the helper has not enabled the static key, the helper will |
| return -EDEADLOCK, which should result in the scrub being restarted with the |
| ``TRY_HARDER`` flag set. |
| The scrub setup function should detect that flag, enable the static key, and |
| try the scrub again. |
| Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``. |
| |
| For more information, please see the kernel documentation of |
| Documentation/staging/static-keys.rst. |
| |
| .. _xfile: |
| |
| Pageable Kernel Memory |
| ---------------------- |
| |
| Some online checking functions work by scanning the filesystem to build a |
| shadow copy of an ondisk metadata structure in memory and comparing the two |
| copies. |
| For online repair to rebuild a metadata structure, it must compute the record |
| set that will be stored in the new structure before it can persist that new |
| structure to disk. |
| Ideally, repairs complete with a single atomic commit that introduces |
| a new data structure. |
| To meet these goals, the kernel needs to collect a large amount of information |
| in a place that doesn't require the correct operation of the filesystem. |
| |
| Kernel memory isn't suitable because: |
| |
| * Allocating a contiguous region of memory to create a C array is very |
| difficult, especially on 32-bit systems. |
| |
| * Linked lists of records introduce double pointer overhead which is very high |
| and eliminate the possibility of indexed lookups. |
| |
| * Kernel memory is pinned, which can drive the system into OOM conditions. |
| |
| * The system might not have sufficient memory to stage all the information. |
| |
| At any given time, online fsck does not need to keep the entire record set in |
| memory, which means that individual records can be paged out if necessary. |
| Continued development of online fsck demonstrated that the ability to perform |
| indexed data storage would also be very useful. |
| Fortunately, the Linux kernel already has a facility for byte-addressable and |
| pageable storage: tmpfs. |
| In-kernel graphics drivers (most notably i915) take advantage of tmpfs files |
| to store intermediate data that doesn't need to be in memory at all times, so |
| that usage precedent is already established. |
| Hence, the ``xfile`` was born! |
| |
| +--------------------------------------------------------------------------+ |
| | **Historical Sidebar**: | |
| +--------------------------------------------------------------------------+ |
| | The first edition of online repair inserted records into a new btree as | |
| | it found them, which failed because filesystem could shut down with a | |
| | built data structure, which would be live after recovery finished. | |
| | | |
| | The second edition solved the half-rebuilt structure problem by storing | |
| | everything in memory, but frequently ran the system out of memory. | |
| | | |
| | The third edition solved the OOM problem by using linked lists, but the | |
| | memory overhead of the list pointers was extreme. | |
| +--------------------------------------------------------------------------+ |
| |
| xfile Access Models |
| ``````````````````` |
| |
| A survey of the intended uses of xfiles suggested these use cases: |
| |
| 1. Arrays of fixed-sized records (space management btrees, directory and |
| extended attribute entries) |
| |
| 2. Sparse arrays of fixed-sized records (quotas and link counts) |
| |
| 3. Large binary objects (BLOBs) of variable sizes (directory and extended |
| attribute names and values) |
| |
| 4. Staging btrees in memory (reverse mapping btrees) |
| |
| 5. Arbitrary contents (realtime space management) |
| |
| To support the first four use cases, high level data structures wrap the xfile |
| to share functionality between online fsck functions. |
| The rest of this section discusses the interfaces that the xfile presents to |
| four of those five higher level data structures. |
| The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case |
| study. |
| |
| XFS is very record-based, which suggests that the ability to load and store |
| complete records is important. |
| To support these cases, a pair of ``xfile_load`` and ``xfile_store`` |
| functions are provided to read and persist objects into an xfile that treat any |
| error as an out of memory error. For online repair, squashing error conditions |
| in this manner is an acceptable behavior because the only reaction is to abort |
| the operation back to userspace. |
| |
| However, no discussion of file access idioms is complete without answering the |
| question, "But what about mmap?" |
| It is convenient to access storage directly with pointers, just like userspace |
| code does with regular memory. |
| Online fsck must not drive the system into OOM conditions, which means that |
| xfiles must be responsive to memory reclamation. |
| tmpfs can only push a pagecache folio to the swap cache if the folio is neither |
| pinned nor locked, which means the xfile must not pin too many folios. |
| |
| Short term direct access to xfile contents is done by locking the pagecache |
| folio and mapping it into kernel address space. Object load and store uses this |
| mechanism. Folio locks are not supposed to be held for long periods of time, so |
| long term direct access to xfile contents is done by bumping the folio refcount, |
| mapping it into kernel address space, and dropping the folio lock. |
| These long term users *must* be responsive to memory reclaim by hooking into |
| the shrinker infrastructure to know when to release folios. |
| |
| The ``xfile_get_folio`` and ``xfile_put_folio`` functions are provided to |
| retrieve the (locked) folio that backs part of an xfile and to release it. |
| The only code to use these folio lease functions are the xfarray |
| :ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory |
| btrees<xfbtree>`. |
| |
| xfile Access Coordination |
| ````````````````````````` |
| |
| For security reasons, xfiles must be owned privately by the kernel. |
| They are marked ``S_PRIVATE`` to prevent interference from the security system, |
| must never be mapped into process file descriptor tables, and their pages must |
| never be mapped into userspace processes. |
| |
| To avoid locking recursion issues with the VFS, all accesses to the shmfs file |
| are performed by manipulating the page cache directly. |
| xfile writers call the ``->write_begin`` and ``->write_end`` functions of the |
| xfile's address space to grab writable pages, copy the caller's buffer into the |
| page, and release the pages. |
| xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly |
| before copying the contents into the caller's buffer. |
| In other words, xfiles ignore the VFS read and write code paths to avoid |
| having to create a dummy ``struct kiocb`` and to avoid taking inode and |
| freeze locks. |
| tmpfs cannot be frozen, and xfiles must not be exposed to userspace. |
| |
| If an xfile is shared between threads to stage repairs, the caller must provide |
| its own locks to coordinate access. |
| For example, if a scrub function stores scan results in an xfile and needs |
| other threads to provide updates to the scanned data, the scrub function must |
| provide a lock for all threads to share. |
| |
| .. _xfarray: |
| |
| Arrays of Fixed-Sized Records |
| ````````````````````````````` |
| |
| In XFS, each type of indexed space metadata (free space, inodes, reference |
| counts, file fork space, and reverse mappings) consists of a set of fixed-size |
| records indexed with a classic B+ tree. |
| Directories have a set of fixed-size dirent records that point to the names, |
| and extended attributes have a set of fixed-size attribute keys that point to |
| names and values. |
| Quota counters and file link counters index records with numbers. |
| During a repair, scrub needs to stage new records during the gathering step and |
| retrieve them during the btree building step. |
| |
| Although this requirement can be satisfied by calling the read and write |
| methods of the xfile directly, it is simpler for callers for there to be a |
| higher level abstraction to take care of computing array offsets, to provide |
| iterator functions, and to deal with sparse records and sorting. |
| The ``xfarray`` abstraction presents a linear array for fixed-size records atop |
| the byte-accessible xfile. |
| |
| .. _xfarray_access_patterns: |
| |
| Array Access Patterns |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| Array access patterns in online fsck tend to fall into three categories. |
| Iteration of records is assumed to be necessary for all cases and will be |
| covered in the next section. |
| |
| The first type of caller handles records that are indexed by position. |
| Gaps may exist between records, and a record may be updated multiple times |
| during the collection step. |
| In other words, these callers want a sparse linearly addressed table file. |
| The typical use case are quota records or file link count records. |
| Access to array elements is performed programmatically via ``xfarray_load`` and |
| ``xfarray_store`` functions, which wrap the similarly-named xfile functions to |
| provide loading and storing of array elements at arbitrary array indices. |
| Gaps are defined to be null records, and null records are defined to be a |
| sequence of all zero bytes. |
| Null records are detected by calling ``xfarray_element_is_null``. |
| They are created either by calling ``xfarray_unset`` to null out an existing |
| record or by never storing anything to an array index. |
| |
| The second type of caller handles records that are not indexed by position |
| and do not require multiple updates to a record. |
| The typical use case here is rebuilding space btrees and key/value btrees. |
| These callers can add records to the array without caring about array indices |
| via the ``xfarray_append`` function, which stores a record at the end of the |
| array. |
| For callers that require records to be presentable in a specific order (e.g. |
| rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted |
| records; this function will be covered later. |
| |
| The third type of caller is a bag, which is useful for counting records. |
| The typical use case here is constructing space extent reference counts from |
| reverse mapping information. |
| Records can be put in the bag in any order, they can be removed from the bag |
| at any time, and uniqueness of records is left to callers. |
| The ``xfarray_store_anywhere`` function is used to insert a record in any |
| null record slot in the bag; and the ``xfarray_unset`` function removes a |
| record from the bag. |
| |
| The proposed patchset is the |
| `big in-memory array |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_. |
| |
| Iterating Array Elements |
| ^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Most users of the xfarray require the ability to iterate the records stored in |
| the array. |
| Callers can probe every possible array index with the following: |
| |
| .. code-block:: c |
| |
| xfarray_idx_t i; |
| foreach_xfarray_idx(array, i) { |
| xfarray_load(array, i, &rec); |
| |
| /* do something with rec */ |
| } |
| |
| All users of this idiom must be prepared to handle null records or must already |
| know that there aren't any. |
| |
| For xfarray users that want to iterate a sparse array, the ``xfarray_iter`` |
| function ignores indices in the xfarray that have never been written to by |
| calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas |
| of the array that are not populated with memory pages. |
| Once it finds a page, it will skip the zeroed areas of the page. |
| |
| .. code-block:: c |
| |
| xfarray_idx_t i = XFARRAY_CURSOR_INIT; |
| while ((ret = xfarray_iter(array, &i, &rec)) == 1) { |
| /* do something with rec */ |
| } |
| |
| .. _xfarray_sort: |
| |
| Sorting Array Elements |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| During the fourth demonstration of online repair, a community reviewer remarked |
| that for performance reasons, online repair ought to load batches of records |
| into btree record blocks instead of inserting records into a new btree one at a |
| time. |
| The btree insertion code in XFS is responsible for maintaining correct ordering |
| of the records, so naturally the xfarray must also support sorting the record |
| set prior to bulk loading. |
| |
| Case Study: Sorting xfarrays |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The sorting algorithm used in the xfarray is actually a combination of adaptive |
| quicksort and a heapsort subalgorithm in the spirit of |
| `Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and |
| `pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux |
| kernel. |
| To sort records in a reasonably short amount of time, ``xfarray`` takes |
| advantage of the binary subpartitioning offered by quicksort, but it also uses |
| heapsort to hedge against performance collapse if the chosen quicksort pivots |
| are poor. |
| Both algorithms are (in general) O(n * lg(n)), but there is a wide performance |
| gulf between the two implementations. |
| |
| The Linux kernel already contains a reasonably fast implementation of heapsort. |
| It only operates on regular C arrays, which limits the scope of its usefulness. |
| There are two key places where the xfarray uses it: |
| |
| * Sorting any record subset backed by a single xfile page. |
| |
| * Loading a small number of xfarray records from potentially disparate parts |
| of the xfarray into a memory buffer, and sorting the buffer. |
| |
| In other words, ``xfarray`` uses heapsort to constrain the nested recursion of |
| quicksort, thereby mitigating quicksort's worst runtime behavior. |
| |
| Choosing a quicksort pivot is a tricky business. |
| A good pivot splits the set to sort in half, leading to the divide and conquer |
| behavior that is crucial to O(n * lg(n)) performance. |
| A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`) |
| runtime. |
| The xfarray sort routine tries to avoid picking a bad pivot by sampling nine |
| records into a memory buffer and using the kernel heapsort to identify the |
| median of the nine. |
| |
| Most modern quicksort implementations employ Tukey's "ninther" to select a |
| pivot from a classic C array. |
| Typical ninther implementations pick three unique triads of records, sort each |
| of the triads, and then sort the middle value of each triad to determine the |
| ninther value. |
| As stated previously, however, xfile accesses are not entirely cheap. |
| It turned out to be much more performant to read the nine elements into a |
| memory buffer, run the kernel's in-memory heapsort on the buffer, and choose |
| the 4th element of that buffer as the pivot. |
| Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for |
| low-effort robust (resistant) location in large samples`, in *Contributions to |
| Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press, |
| 1978), pp. 251–257. |
| |
| The partitioning of quicksort is fairly textbook -- rearrange the record |
| subset around the pivot, then set up the current and next stack frames to |
| sort with the larger and the smaller halves of the pivot, respectively. |
| This keeps the stack space requirements to log2(record count). |
| |
| As a final performance optimization, the hi and lo scanning phase of quicksort |
| keeps examined xfile pages mapped in the kernel for as long as possible to |
| reduce map/unmap cycles. |
| Surprisingly, this reduces overall sort runtime by nearly half again after |
| accounting for the application of heapsort directly onto xfile pages. |
| |
| .. _xfblob: |
| |
| Blob Storage |
| ```````````` |
| |
| Extended attributes and directories add an additional requirement for staging |
| records: arbitrary byte sequences of finite length. |
| Each directory entry record needs to store entry name, |
| and each extended attribute needs to store both the attribute name and value. |
| The names, keys, and values can consume a large amount of memory, so the |
| ``xfblob`` abstraction was created to simplify management of these blobs |
| atop an xfile. |
| |
| Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve |
| and persist objects. |
| The store function returns a magic cookie for every object that it persists. |
| Later, callers provide this cookie to the ``xblob_load`` to recall the object. |
| The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` |
| function frees them all because compaction is not needed. |
| |
| The details of repairing directories and extended attributes will be discussed |
| in a subsequent section about atomic file content exchanges. |
| However, it should be noted that these repair functions only use blob storage |
| to cache a small number of entries before adding them to a temporary ondisk |
| file, which is why compaction is not required. |
| |
| The proposed patchset is at the start of the |
| `extended attribute repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series. |
| |
| .. _xfbtree: |
| |
| In-Memory B+Trees |
| ````````````````` |
| |
| The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that |
| checking and repairing of secondary metadata commonly requires coordination |
| between a live metadata scan of the filesystem and writer threads that are |
| updating that metadata. |
| Keeping the scan data up to date requires requires the ability to propagate |
| metadata updates from the filesystem into the data being collected by the scan. |
| This *can* be done by appending concurrent updates into a separate log file and |
| applying them before writing the new metadata to disk, but this leads to |
| unbounded memory consumption if the rest of the system is very busy. |
| Another option is to skip the side-log and commit live updates from the |
| filesystem directly into the scan data, which trades more overhead for a lower |
| maximum memory requirement. |
| In both cases, the data structure holding the scan results must support indexed |
| access to perform well. |
| |
| Given that indexed lookups of scan data is required for both strategies, online |
| fsck employs the second strategy of committing live updates directly into |
| scan data. |
| Because xfarrays are not indexed and do not enforce record ordering, they |
| are not suitable for this task. |
| Conveniently, however, XFS has a library to create and maintain ordered reverse |
| mapping records: the existing rmap btree code! |
| If only there was a means to create one in memory. |
| |
| Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a |
| regular file, which means that the kernel can create byte or block addressable |
| virtual address spaces at will. |
| The XFS buffer cache specializes in abstracting IO to block-oriented address |
| spaces, which means that adaptation of the buffer cache to interface with |
| xfiles enables reuse of the entire btree library. |
| Btrees built atop an xfile are collectively known as ``xfbtrees``. |
| The next few sections describe how they actually work. |
| |
| The proposed patchset is the |
| `in-memory btree |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_ |
| series. |
| |
| Using xfiles as a Buffer Cache Target |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Two modifications are necessary to support xfiles as a buffer cache target. |
| The first is to make it possible for the ``struct xfs_buftarg`` structure to |
| host the ``struct xfs_buf`` rhashtable, because normally those are held by a |
| per-AG structure. |
| The second change is to modify the buffer ``ioapply`` function to "read" cached |
| pages from the xfile and "write" cached pages back to the xfile. |
| Multiple access to individual buffers is controlled by the ``xfs_buf`` lock, |
| since the xfile does not provide any locking on its own. |
| With this adaptation in place, users of the xfile-backed buffer cache use |
| exactly the same APIs as users of the disk-backed buffer cache. |
| The separation between xfile and buffer cache implies higher memory usage since |
| they do not share pages, but this property could some day enable transactional |
| updates to an in-memory btree. |
| Today, however, it simply eliminates the need for new code. |
| |
| Space Management with an xfbtree |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Space management for an xfile is very simple -- each btree block is one memory |
| page in size. |
| These blocks use the same header format as an on-disk btree, but the in-memory |
| block verifiers ignore the checksums, assuming that xfile memory is no more |
| corruption-prone than regular DRAM. |
| Reusing existing code here is more important than absolute memory efficiency. |
| |
| The very first block of an xfile backing an xfbtree contains a header block. |
| The header describes the owner, height, and the block number of the root |
| xfbtree block. |
| |
| To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file. |
| If there are no gaps, create one by extending the length of the xfile. |
| Preallocate space for the block with ``xfile_prealloc``, and hand back the |
| location. |
| To free an xfbtree block, use ``xfile_discard`` (which internally uses |
| ``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile. |
| |
| Populating an xfbtree |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| An online fsck function that wants to create an xfbtree should proceed as |
| follows: |
| |
| 1. Call ``xfile_create`` to create an xfile. |
| |
| 2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure |
| pointing to the xfile. |
| |
| 3. Pass the buffer cache target, buffer ops, and other information to |
| ``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an |
| initial root block to the xfile. |
| Each btree type should define a wrapper that passes necessary arguments to |
| the creation function. |
| For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of |
| all the necessary details for callers. |
| |
| 4. Pass the xfbtree object to the btree cursor creation function for the |
| btree type. |
| Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this |
| for callers. |
| |
| 5. Pass the btree cursor to the regular btree functions to make queries against |
| and to update the in-memory btree. |
| For example, a btree cursor for an rmap xfbtree can be passed to the |
| ``xfs_rmap_*`` functions just like any other btree cursor. |
| See the :ref:`next section<xfbtree_commit>` for information on dealing with |
| xfbtree updates that are logged to a transaction. |
| |
| 6. When finished, delete the btree cursor, destroy the xfbtree object, free the |
| buffer target, and the destroy the xfile to release all resources. |
| |
| .. _xfbtree_commit: |
| |
| Committing Logged xfbtree Buffers |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Although it is a clever hack to reuse the rmap btree code to handle the staging |
| structure, the ephemeral nature of the in-memory btree block storage presents |
| some challenges of its own. |
| The XFS transaction manager must not commit buffer log items for buffers backed |
| by an xfile because the log format does not understand updates for devices |
| other than the data device. |
| An ephemeral xfbtree probably will not exist by the time the AIL checkpoints |
| log transactions back into the filesystem, and certainly won't exist during |
| log recovery. |
| For these reasons, any code updating an xfbtree in transaction context must |
| remove the buffer log items from the transaction and write the updates into the |
| backing xfile before committing or cancelling the transaction. |
| |
| The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement |
| this functionality as follows: |
| |
| 1. Find each buffer log item whose buffer targets the xfile. |
| |
| 2. Record the dirty/ordered status of the log item. |
| |
| 3. Detach the log item from the buffer. |
| |
| 4. Queue the buffer to a special delwri list. |
| |
| 5. Clear the transaction dirty flag if the only dirty log items were the ones |
| that were detached in step 3. |
| |
| 6. Submit the delwri list to commit the changes to the xfile, if the updates |
| are being committed. |
| |
| After removing xfile logged buffers from the transaction in this manner, the |
| transaction can be committed or cancelled. |
| |
| Bulk Loading of Ondisk B+Trees |
| ------------------------------ |
| |
| As mentioned previously, early iterations of online repair built new btree |
| structures by creating a new btree and adding observations individually. |
| Loading a btree one record at a time had a slight advantage of not requiring |
| the incore records to be sorted prior to commit, but was very slow and leaked |
| blocks if the system went down during a repair. |
| Loading records one at a time also meant that repair could not control the |
| loading factor of the blocks in the new btree. |
| |
| Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for |
| rebuilding a btree index from a collection of records -- bulk btree loading. |
| This was implemented rather inefficiently code-wise, since ``xfs_repair`` |
| had separate copy-pasted implementations for each btree type. |
| |
| To prepare for online fsck, each of the four bulk loaders were studied, notes |
| were taken, and the four were refactored into a single generic btree bulk |
| loading mechanism. |
| Those notes in turn have been refreshed and are presented below. |
| |
| Geometry Computation |
| ```````````````````` |
| |
| The zeroth step of bulk loading is to assemble the entire record set that will |
| be stored in the new btree, and sort the records. |
| Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the |
| btree from the record set, the type of btree, and any load factor preferences. |
| This information is required for resource reservation. |
| |
| First, the geometry computation computes the minimum and maximum records that |
| will fit in a leaf block from the size of a btree block and the size of the |
| block header. |
| Roughly speaking, the maximum number of records is:: |
| |
| maxrecs = (block_size - header_size) / record_size |
| |
| The XFS design specifies that btree blocks should be merged when possible, |
| which means the minimum number of records is half of maxrecs:: |
| |
| minrecs = maxrecs / 2 |
| |
| The next variable to determine is the desired loading factor. |
| This must be at least minrecs and no more than maxrecs. |
| Choosing minrecs is undesirable because it wastes half the block. |
| Choosing maxrecs is also undesirable because adding a single record to each |
| newly rebuilt leaf block will cause a tree split, which causes a noticeable |
| drop in performance immediately afterwards. |
| The default loading factor was chosen to be 75% of maxrecs, which provides a |
| reasonably compact structure without any immediate split penalties:: |
| |
| default_load_factor = (maxrecs + minrecs) / 2 |
| |
| If space is tight, the loading factor will be set to maxrecs to try to avoid |
| running out of space:: |
| |
| leaf_load_factor = enough space ? default_load_factor : maxrecs |
| |
| Load factor is computed for btree node blocks using the combined size of the |
| btree key and pointer as the record size:: |
| |
| maxrecs = (block_size - header_size) / (key_size + ptr_size) |
| minrecs = maxrecs / 2 |
| node_load_factor = enough space ? default_load_factor : maxrecs |
| |
| Once that's done, the number of leaf blocks required to store the record set |
| can be computed as:: |
| |
| leaf_blocks = ceil(record_count / leaf_load_factor) |
| |
| The number of node blocks needed to point to the next level down in the tree |
| is computed as:: |
| |
| n_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) |
| node_blocks[n + 1] = ceil(n_blocks / node_load_factor) |
| |
| The entire computation is performed recursively until the current level only |
| needs one block. |
| The resulting geometry is as follows: |
| |
| - For AG-rooted btrees, this level is the root level, so the height of the new |
| tree is ``level + 1`` and the space needed is the summation of the number of |
| blocks on each level. |
| |
| - For inode-rooted btrees where the records in the top level do not fit in the |
| inode fork area, the height is ``level + 2``, the space needed is the |
| summation of the number of blocks on each level, and the inode fork points to |
| the root block. |
| |
| - For inode-rooted btrees where the records in the top level can be stored in |
| the inode fork area, then the root block can be stored in the inode, the |
| height is ``level + 1``, and the space needed is one less than the summation |
| of the number of blocks on each level. |
| This only becomes relevant when non-bmap btrees gain the ability to root in |
| an inode, which is a future patchset and only included here for completeness. |
| |
| .. _newbt: |
| |
| Reserving New B+Tree Blocks |
| ``````````````````````````` |
| |
| Once repair knows the number of blocks needed for the new btree, it allocates |
| those blocks using the free space information. |
| Each reserved extent is tracked separately by the btree builder state data. |
| To improve crash resilience, the reservation code also logs an Extent Freeing |
| Intent (EFI) item in the same transaction as each space allocation and attaches |
| its in-memory ``struct xfs_extent_free_item`` object to the space reservation. |
| If the system goes down, log recovery will use the unfinished EFIs to free the |
| unused space, the free space, leaving the filesystem unchanged. |
| |
| Each time the btree builder claims a block for the btree from a reserved |
| extent, it updates the in-memory reservation to reflect the claimed space. |
| Block reservation tries to allocate as much contiguous space as possible to |
| reduce the number of EFIs in play. |
| |
| While repair is writing these new btree blocks, the EFIs created for the space |
| reservations pin the tail of the ondisk log. |
| It's possible that other parts of the system will remain busy and push the head |
| of the log towards the pinned tail. |
| To avoid livelocking the filesystem, the EFIs must not pin the tail of the log |
| for too long. |
| To alleviate this problem, the dynamic relogging capability of the deferred ops |
| mechanism is reused here to commit a transaction at the log head containing an |
| EFD for the old EFI and new EFI at the head. |
| This enables the log to release the old EFI to keep the log moving forwards. |
| |
| EFIs have a role to play during the commit and reaping phases; please see the |
| next section and the section about :ref:`reaping<reaping>` for more details. |
| |
| Proposed patchsets are the |
| `bitmap rework |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_ |
| and the |
| `preparation for bulk loading btrees |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_. |
| |
| |
| Writing the New Tree |
| ```````````````````` |
| |
| This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims |
| a block from the reserved list, writes the new btree block header, fills the |
| rest of the block with records, and adds the new leaf block to a list of |
| written blocks:: |
| |
| ┌────┐ |
| │leaf│ |
| │RRR │ |
| └────┘ |
| |
| Sibling pointers are set every time a new block is added to the level:: |
| |
| ┌────┐ ┌────┐ ┌────┐ ┌────┐ |
| │leaf│→│leaf│→│leaf│→│leaf│ |
| │RRR │←│RRR │←│RRR │←│RRR │ |
| └────┘ └────┘ └────┘ └────┘ |
| |
| When it finishes writing the record leaf blocks, it moves on to the node |
| blocks |
| To fill a node block, it walks each block in the next level down in the tree |
| to compute the relevant keys and write them into the parent node:: |
| |
| ┌────┐ ┌────┐ |
| │node│──────→│node│ |
| │PP │←──────│PP │ |
| └────┘ └────┘ |
| ↙ ↘ ↙ ↘ |
| ┌────┐ ┌────┐ ┌────┐ ┌────┐ |
| │leaf│→│leaf│→│leaf│→│leaf│ |
| │RRR │←│RRR │←│RRR │←│RRR │ |
| └────┘ └────┘ └────┘ └────┘ |
| |
| When it reaches the root level, it is ready to commit the new btree!:: |
| |
| ┌─────────┐ |
| │ root │ |
| │ PP │ |
| └─────────┘ |
| ↙ ↘ |
| ┌────┐ ┌────┐ |
| │node│──────→│node│ |
| │PP │←──────│PP │ |
| └────┘ └────┘ |
| ↙ ↘ ↙ ↘ |
| ┌────┐ ┌────┐ ┌────┐ ┌────┐ |
| │leaf│→│leaf│→│leaf│→│leaf│ |
| │RRR │←│RRR │←│RRR │←│RRR │ |
| └────┘ └────┘ └────┘ └────┘ |
| |
| The first step to commit the new btree is to persist the btree blocks to disk |
| synchronously. |
| This is a little complicated because a new btree block could have been freed |
| in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to |
| remove the (stale) buffer from the AIL list before it can write the new blocks |
| to disk. |
| Blocks are queued for IO using a delwri list and written in one large batch |
| with ``xfs_buf_delwri_submit``. |
| |
| Once the new blocks have been persisted to disk, control returns to the |
| individual repair function that called the bulk loader. |
| The repair function must log the location of the new root in a transaction, |
| clean up the space reservations that were made for the new btree, and reap the |
| old metadata blocks: |
| |
| 1. Commit the location of the new btree root. |
| |
| 2. For each incore reservation: |
| |
| a. Log Extent Freeing Done (EFD) items for all the space that was consumed |
| by the btree builder. The new EFDs must point to the EFIs attached to |
| the reservation to prevent log recovery from freeing the new blocks. |
| |
| b. For unclaimed portions of incore reservations, create a regular deferred |
| extent free work item to be free the unused space later in the |
| transaction chain. |
| |
| c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the |
| reservation of the committing transaction. |
| If the btree loading code suspects this might be about to happen, it must |
| call ``xrep_defer_finish`` to clear out the deferred work and obtain a |
| fresh transaction. |
| |
| 3. Clear out the deferred work a second time to finish the commit and clean |
| the repair transaction. |
| |
| The transaction rolling in steps 2c and 3 represent a weakness in the repair |
| algorithm, because a log flush and a crash before the end of the reap step can |
| result in space leaking. |
| Online repair functions minimize the chances of this occurring by using very |
| large transactions, which each can accommodate many thousands of block freeing |
| instructions. |
| Repair moves on to reaping the old blocks, which will be presented in a |
| subsequent :ref:`section<reaping>` after a few case studies of bulk loading. |
| |
| Case Study: Rebuilding the Inode Index |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The high level process to rebuild the inode index btree is: |
| |
| 1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec`` |
| records from the inode chunk information and a bitmap of the old inode btree |
| blocks. |
| |
| 2. Append the records to an xfarray in inode order. |
| |
| 3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number |
| of blocks needed for the inode btree. |
| If the free space inode btree is enabled, call it again to estimate the |
| geometry of the finobt. |
| |
| 4. Allocate the number of blocks computed in the previous step. |
| |
| 5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and |
| generate the internal node blocks. |
| If the free space inode btree is enabled, call it again to load the finobt. |
| |
| 6. Commit the location of the new btree root block(s) to the AGI. |
| |
| 7. Reap the old btree blocks using the bitmap created in step 1. |
| |
| Details are as follows. |
| |
| The inode btree maps inumbers to the ondisk location of the associated |
| inode records, which means that the inode btrees can be rebuilt from the |
| reverse mapping information. |
| Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the |
| location of the old inode btree blocks. |
| Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the |
| location of at least one inode cluster buffer. |
| A cluster is the smallest number of ondisk inodes that can be allocated or |
| freed in a single transaction; it is never smaller than 1 fs block or 4 inodes. |
| |
| For the space represented by each inode cluster, ensure that there are no |
| records in the free space btrees nor any records in the reference count btree. |
| If there are, the space metadata inconsistencies are reason enough to abort the |
| operation. |
| Otherwise, read each cluster buffer to check that its contents appear to be |
| ondisk inodes and to decide if the file is allocated |
| (``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``). |
| Accumulate the results of successive inode cluster buffer reads until there is |
| enough information to fill a single inode chunk record, which is 64 consecutive |
| numbers in the inumber keyspace. |
| If the chunk is sparse, the chunk record may include holes. |
| |
| Once the repair function accumulates one chunk's worth of data, it calls |
| ``xfarray_append`` to add the inode btree record to the xfarray. |
| This xfarray is walked twice during the btree creation step -- once to populate |
| the inode btree with all inode chunk records, and a second time to populate the |
| free inode btree with records for chunks that have free non-sparse inodes. |
| The number of records for the inode btree is the number of xfarray records, |
| but the record count for the free inode btree has to be computed as inode chunk |
| records are stored in the xfarray. |
| |
| The proposed patchset is the |
| `AG btree repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ |
| series. |
| |
| Case Study: Rebuilding the Space Reference Counts |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Reverse mapping records are used to rebuild the reference count information. |
| Reference counts are required for correct operation of copy on write for shared |
| file data. |
| Imagine the reverse mapping entries as rectangles representing extents of |
| physical blocks, and that the rectangles can be laid down to allow them to |
| overlap each other. |
| From the diagram below, it is apparent that a reference count record must start |
| or end wherever the height of the stack changes. |
| In other words, the record emission stimulus is level-triggered:: |
| |
| █ ███ |
| ██ █████ ████ ███ ██████ |
| ██ ████ ███████████ ████ █████████ |
| ████████████████████████████████ ███████████ |
| ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ |
| 2 1 23 21 3 43 234 2123 1 01 2 3 0 |
| |
| The ondisk reference count btree does not store the refcount == 0 cases because |
| the free space btree already records which blocks are free. |
| Extents being used to stage copy-on-write operations should be the only records |
| with refcount == 1. |
| Single-owner file blocks aren't recorded in either the free space or the |
| reference count btrees. |
| |
| The high level process to rebuild the reference count btree is: |
| |
| 1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec`` |
| records for any space having more than one reverse mapping and add them to |
| the xfarray. |
| Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray |
| because these are extents allocated to stage a copy on write operation and |
| are tracked in the refcount btree. |
| |
| Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old |
| refcount btree blocks. |
| |
| 2. Sort the records in physical extent order, putting the CoW staging extents |
| at the end of the xfarray. |
| This matches the sorting order of records in the refcount btree. |
| |
| 3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number |
| of blocks needed for the new tree. |
| |
| 4. Allocate the number of blocks computed in the previous step. |
| |
| 5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and |
| generate the internal node blocks. |
| |
| 6. Commit the location of new btree root block to the AGF. |
| |
| 7. Reap the old btree blocks using the bitmap created in step 1. |
| |
| Details are as follows; the same algorithm is used by ``xfs_repair`` to |
| generate refcount information from reverse mapping records. |
| |
| - Until the reverse mapping btree runs out of records: |
| |
| - Retrieve the next record from the btree and put it in a bag. |
| |
| - Collect all records with the same starting block from the btree and put |
| them in the bag. |
| |
| - While the bag isn't empty: |
| |
| - Among the mappings in the bag, compute the lowest block number where the |
| reference count changes. |
| This position will be either the starting block number of the next |
| unprocessed reverse mapping or the next block after the shortest mapping |
| in the bag. |
| |
| - Remove all mappings from the bag that end at this position. |
| |
| - Collect all reverse mappings that start at this position from the btree |
| and put them in the bag. |
| |
| - If the size of the bag changed and is greater than one, create a new |
| refcount record associating the block number range that we just walked to |
| the size of the bag. |
| |
| The bag-like structure in this case is a type 2 xfarray as discussed in the |
| :ref:`xfarray access patterns<xfarray_access_patterns>` section. |
| Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and |
| removed via ``xfarray_unset``. |
| Bag members are examined through ``xfarray_iter`` loops. |
| |
| The proposed patchset is the |
| `AG btree repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ |
| series. |
| |
| Case Study: Rebuilding File Fork Mapping Indices |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The high level process to rebuild a data/attr fork mapping btree is: |
| |
| 1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec`` |
| records from the reverse mapping records for that inode and fork. |
| Append these records to an xfarray. |
| Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` |
| records. |
| |
| 2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number |
| of blocks needed for the new tree. |
| |
| 3. Sort the records in file offset order. |
| |
| 4. If the extent records would fit in the inode fork immediate area, commit the |
| records to that immediate area and skip to step 8. |
| |
| 5. Allocate the number of blocks computed in the previous step. |
| |
| 6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and |
| generate the internal node blocks. |
| |
| 7. Commit the new btree root block to the inode fork immediate area. |
| |
| 8. Reap the old btree blocks using the bitmap created in step 1. |
| |
| There are some complications here: |
| First, it's possible to move the fork offset to adjust the sizes of the |
| immediate areas if the data and attr forks are not both in BMBT format. |
| Second, if there are sufficiently few fork mappings, it may be possible to use |
| EXTENTS format instead of BMBT, which may require a conversion. |
| Third, the incore extent map must be reloaded carefully to avoid disturbing |
| any delayed allocation extents. |
| |
| The proposed patchset is the |
| `file mapping repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_ |
| series. |
| |
| .. _reaping: |
| |
| Reaping Old Metadata Blocks |
| --------------------------- |
| |
| Whenever online fsck builds a new data structure to replace one that is |
| suspect, there is a question of how to find and dispose of the blocks that |
| belonged to the old structure. |
| The laziest method of course is not to deal with them at all, but this slowly |
| leads to service degradations as space leaks out of the filesystem. |
| Hopefully, someone will schedule a rebuild of the free space information to |
| plug all those leaks. |
| Offline repair rebuilds all space metadata after recording the usage of |
| the files and directories that it decides not to clear, hence it can build new |
| structures in the discovered free space and avoid the question of reaping. |
| |
| As part of a repair, online fsck relies heavily on the reverse mapping records |
| to find space that is owned by the corresponding rmap owner yet truly free. |
| Cross referencing rmap records with other rmap records is necessary because |
| there may be other data structures that also think they own some of those |
| blocks (e.g. crosslinked trees). |
| Permitting the block allocator to hand them out again will not push the system |
| towards consistency. |
| |
| For space metadata, the process of finding extents to dispose of generally |
| follows this format: |
| |
| 1. Create a bitmap of space used by data structures that must be preserved. |
| The space reservations used to create the new metadata can be used here if |
| the same rmap owner code is used to denote all of the objects being rebuilt. |
| |
| 2. Survey the reverse mapping data to create a bitmap of space owned by the |
| same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved. |
| |
| 3. Use the bitmap disunion operator to subtract (1) from (2). |
| The remaining set bits represent candidate extents that could be freed. |
| The process moves on to step 4 below. |
| |
| Repairs for file-based metadata such as extended attributes, directories, |
| symbolic links, quota files and realtime bitmaps are performed by building a |
| new structure attached to a temporary file and exchanging all mappings in the |
| file forks. |
| Afterward, the mappings in the old file fork are the candidate blocks for |
| disposal. |
| |
| The process for disposing of old extents is as follows: |
| |
| 4. For each candidate extent, count the number of reverse mapping records for |
| the first block in that extent that do not have the same rmap owner for the |
| data structure being repaired. |
| |
| - If zero, the block has a single owner and can be freed. |
| |
| - If not, the block is part of a crosslinked structure and must not be |
| freed. |
| |
| 5. Starting with the next block in the extent, figure out how many more blocks |
| have the same zero/nonzero other owner status as that first block. |
| |
| 6. If the region is crosslinked, delete the reverse mapping entry for the |
| structure being repaired and move on to the next region. |
| |
| 7. If the region is to be freed, mark any corresponding buffers in the buffer |
| cache as stale to prevent log writeback. |
| |
| 8. Free the region and move on. |
| |
| However, there is one complication to this procedure. |
| Transactions are of finite size, so the reaping process must be careful to roll |
| the transactions to avoid overruns. |
| Overruns come from two sources: |
| |
| a. EFIs logged on behalf of space that is no longer occupied |
| |
| b. Log items for buffer invalidations |
| |
| This is also a window in which a crash during the reaping process can leak |
| blocks. |
| As stated earlier, online repair functions use very large transactions to |
| minimize the chances of this occurring. |
| |
| The proposed patchset is the |
| `preparation for bulk loading btrees |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_ |
| series. |
| |
| Case Study: Reaping After a Regular Btree Repair |
| ```````````````````````````````````````````````` |
| |
| Old reference count and inode btrees are the easiest to reap because they have |
| rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount |
| btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees. |
| Creating a list of extents to reap the old btree blocks is quite simple, |
| conceptually: |
| |
| 1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees. |
| |
| 2. For each reverse mapping record with an rmap owner corresponding to the |
| metadata structure being rebuilt, set the corresponding range in a bitmap. |
| |
| 3. Walk the current data structures that have the same rmap owner. |
| For each block visited, clear that range in the above bitmap. |
| |
| 4. Each set bit in the bitmap represents a block that could be a block from the |
| old data structures and hence is a candidate for reaping. |
| In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` |
| are the blocks that might be freeable. |
| |
| If it is possible to maintain the AGF lock throughout the repair (which is the |
| common case), then step 2 can be performed at the same time as the reverse |
| mapping record walk that creates the records for the new btree. |
| |
| Case Study: Rebuilding the Free Space Indices |
| ````````````````````````````````````````````` |
| |
| The high level process to rebuild the free space indices is: |
| |
| 1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore`` |
| records from the gaps in the reverse mapping btree. |
| |
| 2. Append the records to an xfarray. |
| |
| 3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number |
| of blocks needed for each new tree. |
| |
| 4. Allocate the number of blocks computed in the previous step from the free |
| space information collected. |
| |
| 5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and |
| generate the internal node blocks for the free space by length index. |
| Call it again for the free space by block number index. |
| |
| 6. Commit the locations of the new btree root blocks to the AGF. |
| |
| 7. Reap the old btree blocks by looking for space that is not recorded by the |
| reverse mapping btree, the new free space btrees, or the AGFL. |
| |
| Repairing the free space btrees has three key complications over a regular |
| btree repair: |
| |
| First, free space is not explicitly tracked in the reverse mapping records. |
| Hence, the new free space records must be inferred from gaps in the physical |
| space component of the keyspace of the reverse mapping btree. |
| |
| Second, free space repairs cannot use the common btree reservation code because |
| new blocks are reserved out of the free space btrees. |
| This is impossible when repairing the free space btrees themselves. |
| However, repair holds the AGF buffer lock for the duration of the free space |
| index reconstruction, so it can use the collected free space information to |
| supply the blocks for the new free space btrees. |
| It is not necessary to back each reserved extent with an EFI because the new |
| free space btrees are constructed in what the ondisk filesystem thinks is |
| unowned space. |
| However, if reserving blocks for the new btrees from the collected free space |
| information changes the number of free space records, repair must re-estimate |
| the new free space btree geometry with the new record count until the |
| reservation is sufficient. |
| As part of committing the new btrees, repair must ensure that reverse mappings |
| are created for the reserved blocks and that unused reserved blocks are |
| inserted into the free space btrees. |
| Deferrred rmap and freeing operations are used to ensure that this transition |
| is atomic, similar to the other btree repair functions. |
| |
| Third, finding the blocks to reap after the repair is not overly |
| straightforward. |
| Blocks for the free space btrees and the reverse mapping btrees are supplied by |
| the AGFL. |
| Blocks put onto the AGFL have reverse mapping records with the owner |
| ``XFS_RMAP_OWN_AG``. |
| This ownership is retained when blocks move from the AGFL into the free space |
| btrees or the reverse mapping btrees. |
| When repair walks reverse mapping records to synthesize free space records, it |
| creates a bitmap (``ag_owner_bitmap``) of all the space claimed by |
| ``XFS_RMAP_OWN_AG`` records. |
| The repair context maintains a second bitmap corresponding to the rmap btree |
| blocks and the AGFL blocks (``rmap_agfl_bitmap``). |
| When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap & |
| ~rmap_agfl_bitmap)`` computes the extents that are used by the old free space |
| btrees. |
| These blocks can then be reaped using the methods outlined above. |
| |
| The proposed patchset is the |
| `AG btree repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ |
| series. |
| |
| .. _rmap_reap: |
| |
| Case Study: Reaping After Repairing Reverse Mapping Btrees |
| `````````````````````````````````````````````````````````` |
| |
| Old reverse mapping btrees are less difficult to reap after a repair. |
| As mentioned in the previous section, blocks on the AGFL, the two free space |
| btree blocks, and the reverse mapping btree blocks all have reverse mapping |
| records with ``XFS_RMAP_OWN_AG`` as the owner. |
| The full process of gathering reverse mapping records and building a new btree |
| are described in the case study of |
| :ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that |
| discussion is that the new rmap btree will not contain any records for the old |
| rmap btree, nor will the old btree blocks be tracked in the free space btrees. |
| The list of candidate reaping blocks is computed by setting the bits |
| corresponding to the gaps in the new rmap btree records, and then clearing the |
| bits corresponding to extents in the free space btrees and the current AGFL |
| blocks. |
| The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the |
| methods outlined above. |
| |
| The rest of the process of rebuildng the reverse mapping btree is discussed |
| in a separate :ref:`case study<rmap_repair>`. |
| |
| The proposed patchset is the |
| `AG btree repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ |
| series. |
| |
| Case Study: Rebuilding the AGFL |
| ``````````````````````````````` |
| |
| The allocation group free block list (AGFL) is repaired as follows: |
| |
| 1. Create a bitmap for all the space that the reverse mapping data claims is |
| owned by ``XFS_RMAP_OWN_AG``. |
| |
| 2. Subtract the space used by the two free space btrees and the rmap btree. |
| |
| 3. Subtract any space that the reverse mapping data claims is owned by any |
| other owner, to avoid re-adding crosslinked blocks to the AGFL. |
| |
| 4. Once the AGFL is full, reap any blocks leftover. |
| |
| 5. The next operation to fix the freelist will right-size the list. |
| |
| See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details. |
| |
| Inode Record Repairs |
| -------------------- |
| |
| Inode records must be handled carefully, because they have both ondisk records |
| ("dinodes") and an in-memory ("cached") representation. |
| There is a very high potential for cache coherency issues if online fsck is not |
| careful to access the ondisk metadata *only* when the ondisk metadata is so |
| badly damaged that the filesystem cannot load the in-memory representation. |
| When online fsck wants to open a damaged file for scrubbing, it must use |
| specialized resource acquisition functions that return either the in-memory |
| representation *or* a lock on whichever object is necessary to prevent any |
| update to the ondisk location. |
| |
| The only repairs that should be made to the ondisk inode buffers are whatever |
| is necessary to get the in-core structure loaded. |
| This means fixing whatever is caught by the inode cluster buffer and inode fork |
| verifiers, and retrying the ``iget`` operation. |
| If the second ``iget`` fails, the repair has failed. |
| |
| Once the in-memory representation is loaded, repair can lock the inode and can |
| subject it to comprehensive checks, repairs, and optimizations. |
| Most inode attributes are easy to check and constrain, or are user-controlled |
| arbitrary bit patterns; these are both easy to fix. |
| Dealing with the data and attr fork extent counts and the file block counts is |
| more complicated, because computing the correct value requires traversing the |
| forks, or if that fails, leaving the fields invalid and waiting for the fork |
| fsck functions to run. |
| |
| The proposed patchset is the |
| `inode |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_ |
| repair series. |
| |
| Quota Record Repairs |
| -------------------- |
| |
| Similar to inodes, quota records ("dquots") also have both ondisk records and |
| an in-memory representation, and hence are subject to the same cache coherency |
| issues. |
| Somewhat confusingly, both are known as dquots in the XFS codebase. |
| |
| The only repairs that should be made to the ondisk quota record buffers are |
| whatever is necessary to get the in-core structure loaded. |
| Once the in-memory representation is loaded, the only attributes needing |
| checking are obviously bad limits and timer values. |
| |
| Quota usage counters are checked, repaired, and discussed separately in the |
| section about :ref:`live quotacheck <quotacheck>`. |
| |
| The proposed patchset is the |
| `quota |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_ |
| repair series. |
| |
| .. _fscounters: |
| |
| Freezing to Fix Summary Counters |
| -------------------------------- |
| |
| Filesystem summary counters track availability of filesystem resources such |
| as free blocks, free inodes, and allocated inodes. |
| This information could be compiled by walking the free space and inode indexes, |
| but this is a slow process, so XFS maintains a copy in the ondisk superblock |
| that should reflect the ondisk metadata, at least when the filesystem has been |
| unmounted cleanly. |
| For performance reasons, XFS also maintains incore copies of those counters, |
| which are key to enabling resource reservations for active transactions. |
| Writer threads reserve the worst-case quantities of resources from the |
| incore counter and give back whatever they don't use at commit time. |
| It is therefore only necessary to serialize on the superblock when the |
| superblock is being committed to disk. |
| |
| The lazy superblock counter feature introduced in XFS v5 took this even further |
| by training log recovery to recompute the summary counters from the AG headers, |
| which eliminated the need for most transactions even to touch the superblock. |
| The only time XFS commits the summary counters is at filesystem unmount. |
| To reduce contention even further, the incore counter is implemented as a |
| percpu counter, which means that each CPU is allocated a batch of blocks from a |
| global incore counter and can satisfy small allocations from the local batch. |
| |
| The high-performance nature of the summary counters makes it difficult for |
| online fsck to check them, since there is no way to quiesce a percpu counter |
| while the system is running. |
| Although online fsck can read the filesystem metadata to compute the correct |
| values of the summary counters, there's no way to hold the value of a percpu |
| counter stable, so it's quite possible that the counter will be out of date by |
| the time the walk is complete. |
| Earlier versions of online scrub would return to userspace with an incomplete |
| scan flag, but this is not a satisfying outcome for a system administrator. |
| For repairs, the in-memory counters must be stabilized while walking the |
| filesystem metadata to get an accurate reading and install it in the percpu |
| counter. |
| |
| To satisfy this requirement, online fsck must prevent other programs in the |
| system from initiating new writes to the filesystem, it must disable background |
| garbage collection threads, and it must wait for existing writer programs to |
| exit the kernel. |
| Once that has been established, scrub can walk the AG free space indexes, the |
| inode btrees, and the realtime bitmap to compute the correct value of all |
| four summary counters. |
| This is very similar to a filesystem freeze, though not all of the pieces are |
| necessary: |
| |
| - The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to |
| prevent other threads from thawing the filesystem, or other scrub threads |
| from initiating another fscounters freeze. |
| |
| - It does not quiesce the log. |
| |
| With this code in place, it is now possible to pause the filesystem for just |
| long enough to check and correct the summary counters. |
| |
| +--------------------------------------------------------------------------+ |
| | **Historical Sidebar**: | |
| +--------------------------------------------------------------------------+ |
| | The initial implementation used the actual VFS filesystem freeze | |
| | mechanism to quiesce filesystem activity. | |
| | With the filesystem frozen, it is possible to resolve the counter values | |
| | with exact precision, but there are many problems with calling the VFS | |
| | methods directly: | |
| | | |
| | - Other programs can unfreeze the filesystem without our knowledge. | |
| | This leads to incorrect scan results and incorrect repairs. | |
| | | |
| | - Adding an extra lock to prevent others from thawing the filesystem | |
| | required the addition of a ``->freeze_super`` function to wrap | |
| | ``freeze_fs()``. | |
| | This in turn caused other subtle problems because it turns out that | |
| | the VFS ``freeze_super`` and ``thaw_super`` functions can drop the | |
| | last reference to the VFS superblock, and any subsequent access | |
| | becomes a UAF bug! | |
| | This can happen if the filesystem is unmounted while the underlying | |
| | block device has frozen the filesystem. | |
| | This problem could be solved by grabbing extra references to the | |
| | superblock, but it felt suboptimal given the other inadequacies of | |
| | this approach. | |
| | | |
| | - The log need not be quiesced to check the summary counters, but a VFS | |
| | freeze initiates one anyway. | |
| | This adds unnecessary runtime to live fscounter fsck operations. | |
| | | |
| | - Quiescing the log means that XFS flushes the (possibly incorrect) | |
| | counters to disk as part of cleaning the log. | |
| | | |
| | - A bug in the VFS meant that freeze could complete even when | |
| | sync_filesystem fails to flush the filesystem and returns an error. | |
| | This bug was fixed in Linux 5.17. | |
| +--------------------------------------------------------------------------+ |
| |
| The proposed patchset is the |
| `summary counter cleanup |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_ |
| series. |
| |
| Full Filesystem Scans |
| --------------------- |
| |
| Certain types of metadata can only be checked by walking every file in the |
| entire filesystem to record observations and comparing the observations against |
| what's recorded on disk. |
| Like every other type of online repair, repairs are made by writing those |
| observations to disk in a replacement structure and committing it atomically. |
| However, it is not practical to shut down the entire filesystem to examine |
| hundreds of billions of files because the downtime would be excessive. |
| Therefore, online fsck must build the infrastructure to manage a live scan of |
| all the files in the filesystem. |
| There are two questions that need to be solved to perform a live walk: |
| |
| - How does scrub manage the scan while it is collecting data? |
| |
| - How does the scan keep abreast of changes being made to the system by other |
| threads? |
| |
| .. _iscan: |
| |
| Coordinated Inode Scans |
| ``````````````````````` |
| |
| In the original Unix filesystems of the 1970s, each directory entry contained |
| an index number (*inumber*) which was used as an index into on ondisk array |
| (*itable*) of fixed-size records (*inodes*) describing a file's attributes and |
| its data block mapping. |
| This system is described by J. Lions, `"inode (5659)" |
| <http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on |
| UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South |
| Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, |
| `"Implementation of the File System" |
| <https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX |
| Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp. |
| 1913-4. |
| |
| XFS retains most of this design, except now inumbers are search keys over all |
| the space in the data section filesystem. |
| They form a continuous keyspace that can be expressed as a 64-bit integer, |
| though the inodes themselves are sparsely distributed within the keyspace. |
| Scans proceed in a linear fashion across the inumber keyspace, starting from |
| ``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. |
| Naturally, a scan through a keyspace requires a scan cursor object to track the |
| scan progress. |
| Because this keyspace is sparse, this cursor contains two parts. |
| The first part of this scan cursor object tracks the inode that will be |
| examined next; call this the examination cursor. |
| Somewhat less obviously, the scan cursor object must also track which parts of |
| the keyspace have already been visited, which is critical for deciding if a |
| concurrent filesystem update needs to be incorporated into the scan data. |
| Call this the visited inode cursor. |
| |
| Advancing the scan cursor is a multi-step process encapsulated in |
| ``xchk_iscan_iter``: |
| |
| 1. Lock the AGI buffer of the AG containing the inode pointed to by the visited |
| inode cursor. |
| This guarantee that inodes in this AG cannot be allocated or freed while |
| advancing the cursor. |
| |
| 2. Use the per-AG inode btree to look up the next inumber after the one that |
| was just visited, since it may not be keyspace adjacent. |
| |
| 3. If there are no more inodes left in this AG: |
| |
| a. Move the examination cursor to the point of the inumber keyspace that |
| corresponds to the start of the next AG. |
| |
| b. Adjust the visited inode cursor to indicate that it has "visited" the |
| last possible inode in the current AG's inode keyspace. |
| XFS inumbers are segmented, so the cursor needs to be marked as having |
| visited the entire keyspace up to just before the start of the next AG's |
| inode keyspace. |
| |
| c. Unlock the AGI and return to step 1 if there are unexamined AGs in the |
| filesystem. |
| |
| d. If there are no more AGs to examine, set both cursors to the end of the |
| inumber keyspace. |
| The scan is now complete. |
| |
| 4. Otherwise, there is at least one more inode to scan in this AG: |
| |
| a. Move the examination cursor ahead to the next inode marked as allocated |
| by the inode btree. |
| |
| b. Adjust the visited inode cursor to point to the inode just prior to where |
| the examination cursor is now. |
| Because the scanner holds the AGI buffer lock, no inodes could have been |
| created in the part of the inode keyspace that the visited inode cursor |
| just advanced. |
| |
| 5. Get the incore inode for the inumber of the examination cursor. |
| By maintaining the AGI buffer lock until this point, the scanner knows that |
| it was safe to advance the examination cursor across the entire keyspace, |
| and that it has stabilized this next inode so that it cannot disappear from |
| the filesystem until the scan releases the incore inode. |
| |
| 6. Drop the AGI lock and return the incore inode to the caller. |
| |
| Online fsck functions scan all files in the filesystem as follows: |
| |
| 1. Start a scan by calling ``xchk_iscan_start``. |
| |
| 2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode. |
| If one is provided: |
| |
| a. Lock the inode to prevent updates during the scan. |
| |
| b. Scan the inode. |
| |
| c. While still holding the inode lock, adjust the visited inode cursor |
| (``xchk_iscan_mark_visited``) to point to this inode. |
| |
| d. Unlock and release the inode. |
| |
| 8. Call ``xchk_iscan_teardown`` to complete the scan. |
| |
| There are subtleties with the inode cache that complicate grabbing the incore |
| inode for the caller. |
| Obviously, it is an absolute requirement that the inode metadata be consistent |
| enough to load it into the inode cache. |
| Second, if the incore inode is stuck in some intermediate state, the scan |
| coordinator must release the AGI and push the main filesystem to get the inode |
| back into a loadable state. |
| |
| The proposed patches are the |
| `inode scanner |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ |
| series. |
| The first user of the new functionality is the |
| `online quotacheck |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ |
| series. |
| |
| Inode Management |
| ```````````````` |
| |
| In regular filesystem code, references to allocated XFS incore inodes are |
| always obtained (``xfs_iget``) outside of transaction context because the |
| creation of the incore context for an existing file does not require metadata |
| updates. |
| However, it is important to note that references to incore inodes obtained as |
| part of file creation must be performed in transaction context because the |
| filesystem must ensure the atomicity of the ondisk inode btree index updates |
| and the initialization of the actual ondisk inode. |
| |
| References to incore inodes are always released (``xfs_irele``) outside of |
| transaction context because there are a handful of activities that might |
| require ondisk updates: |
| |
| - The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode |
| release. |
| |
| - Speculative preallocations need to be unreserved. |
| |
| - An unlinked file may have lost its last reference, in which case the entire |
| file must be inactivated, which involves releasing all of its resources in |
| the ondisk metadata and freeing the inode. |
| |
| These activities are collectively called inode inactivation. |
| Inactivation has two parts -- the VFS part, which initiates writeback on all |
| dirty file pages, and the XFS part, which cleans up XFS-specific information |
| and frees the inode if it was unlinked. |
| If the inode is unlinked (or unconnected after a file handle operation), the |
| kernel drops the inode into the inactivation machinery immediately. |
| |
| During normal operation, resource acquisition for an update follows this order |
| to avoid deadlocks: |
| |
| 1. Inode reference (``iget``). |
| |
| 2. Filesystem freeze protection, if repairing (``mnt_want_write_file``). |
| |
| 3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO. |
| |
| 4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that |
| can update page cache mappings. |
| |
| 5. Log feature enablement. |
| |
| 6. Transaction log space grant. |
| |
| 7. Space on the data and realtime devices for the transaction. |
| |
| 8. Incore dquot references, if a file is being repaired. |
| Note that they are not locked, merely acquired. |
| |
| 9. Inode ``ILOCK`` for file metadata updates. |
| |
| 10. AG header buffer locks / Realtime metadata inode ILOCK. |
| |
| 11. Realtime metadata buffer locks, if applicable. |
| |
| 12. Extent mapping btree blocks, if applicable. |
| |
| Resources are often released in the reverse order, though this is not required. |
| However, online fsck differs from regular XFS operations because it may examine |
| an object that normally is acquired in a later stage of the locking order, and |
| then decide to cross-reference the object with an object that is acquired |
| earlier in the order. |
| The next few sections detail the specific ways in which online fsck takes care |
| to avoid deadlocks. |
| |
| iget and irele During a Scrub |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| An inode scan performed on behalf of a scrub operation runs in transaction |
| context, and possibly with resources already locked and bound to it. |
| This isn't much of a problem for ``iget`` since it can operate in the context |
| of an existing transaction, as long as all of the bound resources are acquired |
| before the inode reference in the regular filesystem. |
| |
| When the VFS ``iput`` function is given a linked inode with no other |
| references, it normally puts the inode on an LRU list in the hope that it can |
| save time if another process re-opens the file before the system runs out |
| of memory and frees it. |
| Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE`` |
| flag on the inode to cause the kernel to try to drop the inode into the |
| inactivation machinery immediately. |
| |
| In the past, inactivation was always done from the process that dropped the |
| inode, which was a problem for scrub because scrub may already hold a |
| transaction, and XFS does not support nesting transactions. |
| On the other hand, if there is no scrub transaction, it is desirable to drop |
| otherwise unused inodes immediately to avoid polluting caches. |
| To capture these nuances, the online fsck code has a separate ``xchk_irele`` |
| function to set or clear the ``DONTCACHE`` flag to get the required release |
| behavior. |
| |
| Proposed patchsets include fixing |
| `scrub iget usage |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and |
| `dir iget usage |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_. |
| |
| .. _ilocking: |
| |
| Locking Inodes |
| ^^^^^^^^^^^^^^ |
| |
| In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks |
| in a well-known order: parent → child when updating the directory tree, and |
| in numerical order of the addresses of their ``struct inode`` object otherwise. |
| For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page |
| faults. |
| If two MMAPLOCKs must be acquired, they are acquired in numerical order of |
| the addresses of their ``struct address_space`` objects. |
| Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be |
| acquired before transactions are allocated. |
| If two ILOCKs must be acquired, they are acquired in inumber order. |
| |
| Inode lock acquisition must be done carefully during a coordinated inode scan. |
| Online fsck cannot abide these conventions, because for a directory tree |
| scanner, the scrub process holds the IOLOCK of the file being scanned and it |
| needs to take the IOLOCK of the file at the other end of the directory link. |
| If the directory tree is corrupt because it contains a cycle, ``xfs_scrub`` |
| cannot use the regular inode locking functions and avoid becoming trapped in an |
| ABBA deadlock. |
| |
| Solving both of these problems is straightforward -- any time online fsck |
| needs to take a second lock of the same class, it uses trylock to avoid an ABBA |
| deadlock. |
| If the trylock fails, scrub drops all inode locks and use trylock loops to |
| (re)acquire all necessary resources. |
| Trylock loops enable scrub to check for pending fatal signals, which is how |
| scrub avoids deadlocking the filesystem or becoming an unresponsive process. |
| However, trylock loops means that online fsck must be prepared to measure the |
| resource being scrubbed before and after the lock cycle to detect changes and |
| react accordingly. |
| |
| .. _dirparent: |
| |
| Case Study: Finding a Directory Parent |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Consider the directory parent pointer repair code as an example. |
| Online fsck must verify that the dotdot dirent of a directory points up to a |
| parent directory, and that the parent directory contains exactly one dirent |
| pointing down to the child directory. |
| Fully validating this relationship (and repairing it if possible) requires a |
| walk of every directory on the filesystem while holding the child locked, and |
| while updates to the directory tree are being made. |
| The coordinated inode scan provides a way to walk the filesystem without the |
| possibility of missing an inode. |
| The child directory is kept locked to prevent updates to the dotdot dirent, but |
| if the scanner fails to lock a parent, it can drop and relock both the child |
| and the prospective parent. |
| If the dotdot entry changes while the directory is unlocked, then a move or |
| rename operation must have changed the child's parentage, and the scan can |
| exit early. |
| |
| The proposed patchset is the |
| `directory repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ |
| series. |
| |
| .. _fshooks: |
| |
| Filesystem Hooks |
| ````````````````` |
| |
| The second piece of support that online fsck functions need during a full |
| filesystem scan is the ability to stay informed about updates being made by |
| other threads in the filesystem, since comparisons against the past are useless |
| in a dynamic environment. |
| Two pieces of Linux kernel infrastructure enable online fsck to monitor regular |
| filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`. |
| |
| Filesystem hooks convey information about an ongoing filesystem operation to |
| a downstream consumer. |
| In this case, the downstream consumer is always an online fsck function. |
| Because multiple fsck functions can run in parallel, online fsck uses the Linux |
| notifier call chain facility to dispatch updates to any number of interested |
| fsck processes. |
| Call chains are a dynamic list, which means that they can be configured at |
| run time. |
| Because these hooks are private to the XFS module, the information passed along |
| contains exactly what the checking function needs to update its observations. |
| |
| The current implementation of XFS hooks uses SRCU notifier chains to reduce the |
| impact to highly threaded workloads. |
| Regular blocking notifier chains use a rwsem and seem to have a much lower |
| overhead for single-threaded applications. |
| However, it may turn out that the combination of blocking chains and static |
| keys are a more performant combination; more study is needed here. |
| |
| The following pieces are necessary to hook a certain point in the filesystem: |
| |
| - A ``struct xfs_hooks`` object must be embedded in a convenient place such as |
| a well-known incore filesystem object. |
| |
| - Each hook must define an action code and a structure containing more context |
| about the action. |
| |
| - Hook providers should provide appropriate wrapper functions and structs |
| around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type |
| checking to ensure correct usage. |
| |
| - A callsite in the regular filesystem code must be chosen to call |
| ``xfs_hooks_call`` with the action code and data structure. |
| This place should be adjacent to (and not earlier than) the place where |
| the filesystem update is committed to the transaction. |
| In general, when the filesystem calls a hook chain, it should be able to |
| handle sleeping and should not be vulnerable to memory reclaim or locking |
| recursion. |
| However, the exact requirements are very dependent on the context of the hook |
| caller and the callee. |
| |
| - The online fsck function should define a structure to hold scan data, a lock |
| to coordinate access to the scan data, and a ``struct xfs_hook`` object. |
| The scanner function and the regular filesystem code must acquire resources |
| in the same order; see the next section for details. |
| |
| - The online fsck code must contain a C function to catch the hook action code |
| and data structure. |
| If the object being updated has already been visited by the scan, then the |
| hook information must be applied to the scan data. |
| |
| - Prior to unlocking inodes to start the scan, online fsck must call |
| ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and |
| ``xfs_hooks_add`` to enable the hook. |
| |
| - Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is |
| complete. |
| |
| The number of hooks should be kept to a minimum to reduce complexity. |
| Static keys are used to reduce the overhead of filesystem hooks to nearly |
| zero when online fsck is not running. |
| |
| .. _liveupdate: |
| |
| Live Updates During a Scan |
| `````````````````````````` |
| |
| The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>` |
| filesystem code look like this:: |
| |
| other program |
| ↓ |
| inode lock ←────────────────────┐ |
| ↓ │ |
| AG header lock │ |
| ↓ │ |
| filesystem function │ |
| ↓ │ |
| notifier call chain │ same |
| ↓ ├─── inode |
| scrub hook function │ lock |
| ↓ │ |
| scan data mutex ←──┐ same │ |
| ↓ ├─── scan │ |
| update scan data │ lock │ |
| ↑ │ │ |
| scan data mutex ←──┘ │ |
| ↑ │ |
| inode lock ←────────────────────┘ |
| ↑ |
| scrub function |
| ↑ |
| inode scanner |
| ↑ |
| xfs_scrub |
| |
| These rules must be followed to ensure correct interactions between the |
| checking code and the code making an update to the filesystem: |
| |
| - Prior to invoking the notifier call chain, the filesystem function being |
| hooked must acquire the same lock that the scrub scanning function acquires |
| to scan the inode. |
| |
| - The scanning function and the scrub hook function must coordinate access to |
| the scan data by acquiring a lock on the scan data. |
| |
| - Scrub hook function must not add the live update information to the scan |
| observations unless the inode being updated has already been scanned. |
| The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) |
| for this. |
| |
| - Scrub hook functions must not change the caller's state, including the |
| transaction that it is running. |
| They must not acquire any resources that might conflict with the filesystem |
| function being hooked. |
| |
| - The hook function can abort the inode scan to avoid breaking the other rules. |
| |
| The inode scan APIs are pretty simple: |
| |
| - ``xchk_iscan_start`` starts a scan |
| |
| - ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or |
| returns zero if there is nothing left to scan |
| |
| - ``xchk_iscan_want_live_update`` to decide if an inode has already been |
| visited in the scan. |
| This is critical for hook functions to decide if they need to update the |
| in-memory scan information. |
| |
| - ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the |
| scan |
| |
| - ``xchk_iscan_teardown`` to finish the scan |
| |
| This functionality is also a part of the |
| `inode scanner |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ |
| series. |
| |
| .. _quotacheck: |
| |
| Case Study: Quota Counter Checking |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| It is useful to compare the mount time quotacheck code to the online repair |
| quotacheck code. |
| Mount time quotacheck does not have to contend with concurrent operations, so |
| it does the following: |
| |
| 1. Make sure the ondisk dquots are in good enough shape that all the incore |
| dquots will actually load, and zero the resource usage counters in the |
| ondisk buffer. |
| |
| 2. Walk every inode in the filesystem. |
| Add each file's resource usage to the incore dquot. |
| |
| 3. Walk each incore dquot. |
| If the incore dquot is not being flushed, add the ondisk buffer backing the |
| incore dquot to a delayed write (delwri) list. |
| |
| 4. Write the buffer list to disk. |
| |
| Like most online fsck functions, online quotacheck can't write to regular |
| filesystem objects until the newly collected metadata reflect all filesystem |
| state. |
| Therefore, online quotacheck records file resource usage to a shadow dquot |
| index implemented with a sparse ``xfarray``, and only writes to the real dquots |
| once the scan is complete. |
| Handling transactional updates is tricky because quota resource usage updates |
| are handled in phases to minimize contention on dquots: |
| |
| 1. The inodes involved are joined and locked to a transaction. |
| |
| 2. For each dquot attached to the file: |
| |
| a. The dquot is locked. |
| |
| b. A quota reservation is added to the dquot's resource usage. |
| The reservation is recorded in the transaction. |
| |
| c. The dquot is unlocked. |
| |
| 3. Changes in actual quota usage are tracked in the transaction. |
| |
| 4. At transaction commit time, each dquot is examined again: |
| |
| a. The dquot is locked again. |
| |
| b. Quota usage changes are logged and unused reservation is given back to |
| the dquot. |
| |
| c. The dquot is unlocked. |
| |
| For online quotacheck, hooks are placed in steps 2 and 4. |
| The step 2 hook creates a shadow version of the transaction dquot context |
| (``dqtrx``) that operates in a similar manner to the regular code. |
| The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots. |
| Notice that both hooks are called with the inode locked, which is how the |
| live update coordinates with the inode scanner. |
| |
| The quotacheck scan looks like this: |
| |
| 1. Set up a coordinated inode scan. |
| |
| 2. For each inode returned by the inode scan iterator: |
| |
| a. Grab and lock the inode. |
| |
| b. Determine that inode's resource usage (data blocks, inode counts, |
| realtime blocks) and add that to the shadow dquots for the user, group, |
| and project ids associated with the inode. |
| |
| c. Unlock and release the inode. |
| |
| 3. For each dquot in the system: |
| |
| a. Grab and lock the dquot. |
| |
| b. Check the dquot against the shadow dquots created by the scan and updated |
| by the live hooks. |
| |
| Live updates are key to being able to walk every quota record without |
| needing to hold any locks for a long duration. |
| If repairs are desired, the real and shadow dquots are locked and their |
| resource counts are set to the values in the shadow dquot. |
| |
| The proposed patchset is the |
| `online quotacheck |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ |
| series. |
| |
| .. _nlinks: |
| |
| Case Study: File Link Count Checking |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| File link count checking also uses live update hooks. |
| The coordinated inode scanner is used to visit all directories on the |
| filesystem, and per-file link count records are stored in a sparse ``xfarray`` |
| indexed by inumber. |
| During the scanning phase, each entry in a directory generates observation |
| data as follows: |
| |
| 1. If the entry is a dotdot (``'..'``) entry of the root directory, the |
| directory's parent link count is bumped because the root directory's dotdot |
| entry is self referential. |
| |
| 2. If the entry is a dotdot entry of a subdirectory, the parent's backref |
| count is bumped. |
| |
| 3. If the entry is neither a dot nor a dotdot entry, the target file's parent |
| count is bumped. |
| |
| 4. If the target is a subdirectory, the parent's child link count is bumped. |
| |
| A crucial point to understand about how the link count inode scanner interacts |
| with the live update hooks is that the scan cursor tracks which *parent* |
| directories have been scanned. |
| In other words, the live updates ignore any update about ``A → B`` when A has |
| not been scanned, even if B has been scanned. |
| Furthermore, a subdirectory A with a dotdot entry pointing back to B is |
| accounted as a backref counter in the shadow data for A, since child dotdot |
| entries affect the parent's link count. |
| Live update hooks are carefully placed in all parts of the filesystem that |
| create, change, or remove directory entries, since those operations involve |
| bumplink and droplink. |
| |
| For any file, the correct link count is the number of parents plus the number |
| of child subdirectories. |
| Non-directories never have children of any kind. |
| The backref information is used to detect inconsistencies in the number of |
| links pointing to child subdirectories and the number of dotdot entries |
| pointing back. |
| |
| After the scan completes, the link count of each file can be checked by locking |
| both the inode and the shadow data, and comparing the link counts. |
| A second coordinated inode scan cursor is used for comparisons. |
| Live updates are key to being able to walk every inode without needing to hold |
| any locks between inodes. |
| If repairs are desired, the inode's link count is set to the value in the |
| shadow information. |
| If no parents are found, the file must be :ref:`reparented <orphanage>` to the |
| orphanage to prevent the file from being lost forever. |
| |
| The proposed patchset is the |
| `file link count repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_ |
| series. |
| |
| .. _rmap_repair: |
| |
| Case Study: Rebuilding Reverse Mapping Records |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Most repair functions follow the same pattern: lock filesystem resources, |
| walk the surviving ondisk metadata looking for replacement metadata records, |
| and use an :ref:`in-memory array <xfarray>` to store the gathered observations. |
| The primary advantage of this approach is the simplicity and modularity of the |
| repair code -- code and data are entirely contained within the scrub module, |
| do not require hooks in the main filesystem, and are usually the most efficient |
| in memory use. |
| A secondary advantage of this repair approach is atomicity -- once the kernel |
| decides a structure is corrupt, no other threads can access the metadata until |
| the kernel finishes repairing and revalidating the metadata. |
| |
| For repairs going on within a shard of the filesystem, these advantages |
| outweigh the delays inherent in locking the shard while repairing parts of the |
| shard. |
| Unfortunately, repairs to the reverse mapping btree cannot use the "standard" |
| btree repair strategy because it must scan every space mapping of every fork of |
| every file in the filesystem, and the filesystem cannot stop. |
| Therefore, rmap repair foregoes atomicity between scrub and repair. |
| It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks |
| <liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the |
| scan for reverse mapping records. |
| |
| 1. Set up an xfbtree to stage rmap records. |
| |
| 2. While holding the locks on the AGI and AGF buffers acquired during the |
| scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW |
| staging extents, and the internal log. |
| |
| 3. Set up an inode scanner. |
| |
| 4. Hook into rmap updates for the AG being repaired so that the live scan data |
| can receive updates to the rmap btree from the rest of the filesystem during |
| the file scan. |
| |
| 5. For each space mapping found in either fork of each file scanned, |
| decide if the mapping matches the AG of interest. |
| If so: |
| |
| a. Create a btree cursor for the in-memory btree. |
| |
| b. Use the rmap code to add the record to the in-memory btree. |
| |
| c. Use the :ref:`special commit function <xfbtree_commit>` to write the |
| xfbtree changes to the xfile. |
| |
| 6. For each live update received via the hook, decide if the owner has already |
| been scanned. |
| If so, apply the live update into the scan data: |
| |
| a. Create a btree cursor for the in-memory btree. |
| |
| b. Replay the operation into the in-memory btree. |
| |
| c. Use the :ref:`special commit function <xfbtree_commit>` to write the |
| xfbtree changes to the xfile. |
| This is performed with an empty transaction to avoid changing the |
| caller's state. |
| |
| 7. When the inode scan finishes, create a new scrub transaction and relock the |
| two AG headers. |
| |
| 8. Compute the new btree geometry using the number of rmap records in the |
| shadow btree, like all other btree rebuilding functions. |
| |
| 9. Allocate the number of blocks computed in the previous step. |
| |
| 10. Perform the usual btree bulk loading and commit to install the new rmap |
| btree. |
| |
| 11. Reap the old rmap btree blocks as discussed in the case study about how |
| to :ref:`reap after rmap btree repair <rmap_reap>`. |
| |
| 12. Free the xfbtree now that it not needed. |
| |
| The proposed patchset is the |
| `rmap repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_ |
| series. |
| |
| Staging Repairs with Temporary Files on Disk |
| -------------------------------------------- |
| |
| XFS stores a substantial amount of metadata in file forks: directories, |
| extended attributes, symbolic link targets, free space bitmaps and summary |
| information for the realtime volume, and quota records. |
| File forks map 64-bit logical file fork space extents to physical storage space |
| extents, similar to how a memory management unit maps 64-bit virtual addresses |
| to physical memory addresses. |
| Therefore, file-based tree structures (such as directories and extended |
| attributes) use blocks mapped in the file fork offset address space that point |
| to other blocks mapped within that same address space, and file-based linear |
| structures (such as bitmaps and quota records) compute array element offsets in |
| the file fork offset address space. |
| |
| Because file forks can consume as much space as the entire filesystem, repairs |
| cannot be staged in memory, even when a paging scheme is available. |
| Therefore, online repair of file-based metadata createas a temporary file in |
| the XFS filesystem, writes a new structure at the correct offsets into the |
| temporary file, and atomically exchanges all file fork mappings (and hence the |
| fork contents) to commit the repair. |
| Once the repair is complete, the old fork can be reaped as necessary; if the |
| system goes down during the reap, the iunlink code will delete the blocks |
| during log recovery. |
| |
| **Note**: All space usage and inode indices in the filesystem *must* be |
| consistent to use a temporary file safely! |
| This dependency is the reason why online repair can only use pageable kernel |
| memory to stage ondisk space usage information. |
| |
| Exchanging metadata file mappings with a temporary file requires the owner |
| field of the block headers to match the file being repaired and not the |
| temporary file. |
| The directory, extended attribute, and symbolic link functions were all |
| modified to allow callers to specify owner numbers explicitly. |
| |
| There is a downside to the reaping process -- if the system crashes during the |
| reap phase and the fork extents are crosslinked, the iunlink processing will |
| fail because freeing space will find the extra reverse mappings and abort. |
| |
| Temporary files created for repair are similar to ``O_TMPFILE`` files created |
| by userspace. |
| They are not linked into a directory and the entire file will be reaped when |
| the last reference to the file is lost. |
| The key differences are that these files must have no access permission outside |
| the kernel at all, they must be specially marked to prevent them from being |
| opened by handle, and they must never be linked into the directory tree. |
| |
| +--------------------------------------------------------------------------+ |
| | **Historical Sidebar**: | |
| +--------------------------------------------------------------------------+ |
| | In the initial iteration of file metadata repair, the damaged metadata | |
| | blocks would be scanned for salvageable data; the extents in the file | |
| | fork would be reaped; and then a new structure would be built in its | |
| | place. | |
| | This strategy did not survive the introduction of the atomic repair | |
| | requirement expressed earlier in this document. | |
| | | |
| | The second iteration explored building a second structure at a high | |
| | offset in the fork from the salvage data, reaping the old extents, and | |
| | using a ``COLLAPSE_RANGE`` operation to slide the new extents into | |
| | place. | |
| | | |
| | This had many drawbacks: | |
| | | |
| | - Array structures are linearly addressed, and the regular filesystem | |
| | codebase does not have the concept of a linear offset that could be | |
| | applied to the record offset computation to build an alternate copy. | |
| | | |
| | - Extended attributes are allowed to use the entire attr fork offset | |
| | address space. | |
| | | |
| | - Even if repair could build an alternate copy of a data structure in a | |
| | different part of the fork address space, the atomic repair commit | |
| | requirement means that online repair would have to be able to perform | |
| | a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old | |
| | structure was completely replaced. | |
| | | |
| | - A crash after construction of the secondary tree but before the range | |
| | collapse would leave unreachable blocks in the file fork. | |
| | This would likely confuse things further. | |
| | | |
| | - Reaping blocks after a repair is not a simple operation, and | |
| | initiating a reap operation from a restarted range collapse operation | |
| | during log recovery is daunting. | |
| | | |
| | - Directory entry blocks and quota records record the file fork offset | |
| | in the header area of each block. | |
| | An atomic range collapse operation would have to rewrite this part of | |
| | each block header. | |
| | Rewriting a single field in block headers is not a huge problem, but | |
| | it's something to be aware of. | |
| | | |
| | - Each block in a directory or extended attributes btree index contains | |
| | sibling and child block pointers. | |
| | Were the atomic commit to use a range collapse operation, each block | |
| | would have to be rewritten very carefully to preserve the graph | |
| | structure. | |
| | Doing this as part of a range collapse means rewriting a large number | |
| | of blocks repeatedly, which is not conducive to quick repairs. | |
| | | |
| | This lead to the introduction of temporary file staging. | |
| +--------------------------------------------------------------------------+ |
| |
| Using a Temporary File |
| `````````````````````` |
| |
| Online repair code should use the ``xrep_tempfile_create`` function to create a |
| temporary file inside the filesystem. |
| This allocates an inode, marks the in-core inode private, and attaches it to |
| the scrub context. |
| These files are hidden from userspace, may not be added to the directory tree, |
| and must be kept private. |
| |
| Temporary files only use two inode locks: the IOLOCK and the ILOCK. |
| The MMAPLOCK is not needed here, because there must not be page faults from |
| userspace for data fork blocks. |
| The usage patterns of these two locks are the same as for any other XFS file -- |
| access to file data are controlled via the IOLOCK, and access to file metadata |
| are controlled via the ILOCK. |
| Locking helpers are provided so that the temporary file and its lock state can |
| be cleaned up by the scrub context. |
| To comply with the nested locking strategy laid out in the :ref:`inode |
| locking<ilocking>` section, it is recommended that scrub functions use the |
| xrep_tempfile_ilock*_nowait lock helpers. |
| |
| Data can be written to a temporary file by two means: |
| |
| 1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular |
| temporary file from an xfile. |
| |
| 2. The regular directory, symbolic link, and extended attribute functions can |
| be used to write to the temporary file. |
| |
| Once a good copy of a data file has been constructed in a temporary file, it |
| must be conveyed to the file being repaired, which is the topic of the next |
| section. |
| |
| The proposed patches are in the |
| `repair temporary files |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_ |
| series. |
| |
| Logged File Content Exchanges |
| ----------------------------- |
| |
| Once repair builds a temporary file with a new data structure written into |
| it, it must commit the new changes into the existing file. |
| It is not possible to swap the inumbers of two files, so instead the new |
| metadata must replace the old. |
| This suggests the need for the ability to swap extents, but the existing extent |
| swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient |
| for online repair because: |
| |
| a. When the reverse-mapping btree is enabled, the swap code must keep the |
| reverse mapping information up to date with every exchange of mappings. |
| Therefore, it can only exchange one mapping per transaction, and each |
| transaction is independent. |
| |
| b. Reverse-mapping is critical for the operation of online fsck, so the old |
| defragmentation code (which swapped entire extent forks in a single |
| operation) is not useful here. |
| |
| c. Defragmentation is assumed to occur between two files with identical |
| contents. |
| For this use case, an incomplete exchange will not result in a user-visible |
| change in file contents, even if the operation is interrupted. |
| |
| d. Online repair needs to swap the contents of two files that are by definition |
| *not* identical. |
| For directory and xattr repairs, the user-visible contents might be the |
| same, but the contents of individual blocks may be very different. |
| |
| e. Old blocks in the file may be cross-linked with another structure and must |
| not reappear if the system goes down mid-repair. |
| |
| These problems are overcome by creating a new deferred operation and a new type |
| of log intent item to track the progress of an operation to exchange two file |
| ranges. |
| The new exchange operation type chains together the same transactions used by |
| the reverse-mapping extent swap code, but records intermedia progress in the |
| log so that operations can be restarted after a crash. |
| This new functionality is called the file contents exchange (xfs_exchrange) |
| code. |
| The underlying implementation exchanges file fork mappings (xfs_exchmaps). |
| The new log item records the progress of the exchange to ensure that once an |
| exchange begins, it will always run to completion, even there are |
| interruptions. |
| The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag |
| in the superblock protects these new log item records from being replayed on |
| old kernels. |
| |
| The proposed patchset is the |
| `file contents exchange |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_ |
| series. |
| |
| +--------------------------------------------------------------------------+ |
| | **Sidebar: Using Log-Incompatible Feature Flags** | |
| +--------------------------------------------------------------------------+ |
| | Starting with XFS v5, the superblock contains a | |
| | ``sb_features_log_incompat`` field to indicate that the log contains | |
| | records that might not readable by all kernels that could mount this | |
| | filesystem. | |
| | In short, log incompat features protect the log contents against kernels | |
| | that will not understand the contents. | |
| | Unlike the other superblock feature bits, log incompat bits are | |
| | ephemeral because an empty (clean) log does not need protection. | |
| | The log cleans itself after its contents have been committed into the | |
| | filesystem, either as part of an unmount or because the system is | |
| | otherwise idle. | |
| | Because upper level code can be working on a transaction at the same | |
| | time that the log cleans itself, it is necessary for upper level code to | |
| | communicate to the log when it is going to use a log incompatible | |
| | feature. | |
| | | |
| | The log coordinates access to incompatible features through the use of | |
| | one ``struct rw_semaphore`` for each feature. | |
| | The log cleaning code tries to take this rwsem in exclusive mode to | |
| | clear the bit; if the lock attempt fails, the feature bit remains set. | |
| | The code supporting a log incompat feature should create wrapper | |
| | functions to obtain the log feature and call | |
| | ``xfs_add_incompat_log_feature`` to set the feature bits in the primary | |
| | superblock. | |
| | The superblock update is performed transactionally, so the wrapper to | |
| | obtain log assistance must be called just prior to the creation of the | |
| | transaction that uses the functionality. | |
| | For a file operation, this step must happen after taking the IOLOCK | |
| | and the MMAPLOCK, but before allocating the transaction. | |
| | When the transaction is complete, the ``xlog_drop_incompat_feat`` | |
| | function is called to release the feature. | |
| | The feature bit will not be cleared from the superblock until the log | |
| | becomes clean. | |
| | | |
| | Log-assisted extended attribute updates and file content exchanges bothe | |
| | use log incompat features and provide convenience wrappers around the | |
| | functionality. | |
| +--------------------------------------------------------------------------+ |
| |
| Mechanics of a Logged File Content Exchange |
| ``````````````````````````````````````````` |
| |
| Exchanging contents between file forks is a complex task. |
| The goal is to exchange all file fork mappings between two file fork offset |
| ranges. |
| There are likely to be many extent mappings in each fork, and the edges of |
| the mappings aren't necessarily aligned. |
| Furthermore, there may be other updates that need to happen after the exchange, |
| such as exchanging file sizes, inode flags, or conversion of fork data to local |
| format. |
| This is roughly the format of the new deferred exchange-mapping work item: |
| |
| .. code-block:: c |
| |
| struct xfs_exchmaps_intent { |
| /* Inodes participating in the operation. */ |
| struct xfs_inode *xmi_ip1; |
| struct xfs_inode *xmi_ip2; |
| |
| /* File offset range information. */ |
| xfs_fileoff_t xmi_startoff1; |
| xfs_fileoff_t xmi_startoff2; |
| xfs_filblks_t xmi_blockcount; |
| |
| /* Set these file sizes after the operation, unless negative. */ |
| xfs_fsize_t xmi_isize1; |
| xfs_fsize_t xmi_isize2; |
| |
| /* XFS_EXCHMAPS_* log operation flags */ |
| uint64_t xmi_flags; |
| }; |
| |
| The new log intent item contains enough information to track two logical fork |
| offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, |
| blockcount)``. |
| Each step of an exchange operation exchanges the largest file range mapping |
| possible from one file to the other. |
| After each step in the exchange operation, the two startoff fields are |
| incremented and the blockcount field is decremented to reflect the progress |
| made. |
| The flags field captures behavioral parameters such as exchanging attr fork |
| mappings instead of the data fork and other work to be done after the exchange. |
| The two isize fields are used to exchange the file sizes at the end of the |
| operation if the file data fork is the target of the operation. |
| |
| When the exchange is initiated, the sequence of operations is as follows: |
| |
| 1. Create a deferred work item for the file mapping exchange. |
| At the start, it should contain the entirety of the file block ranges to be |
| exchanged. |
| |
| 2. Call ``xfs_defer_finish`` to process the exchange. |
| This is encapsulated in ``xrep_tempexch_contents`` for scrub operations. |
| This will log an extent swap intent item to the transaction for the deferred |
| mapping exchange work item. |
| |
| 3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero, |
| |
| a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and |
| ``xmi_startoff2``, respectively, and compute the longest extent that can |
| be exchanged in a single step. |
| This is the minimum of the two ``br_blockcount`` s in the mappings. |
| Keep advancing through the file forks until at least one of the mappings |
| contains written blocks. |
| Mutual holes, unwritten extents, and extent mappings to the same physical |
| space are not exchanged. |
| |
| For the next few steps, this document will refer to the mapping that came |
| from file 1 as "map1", and the mapping that came from file 2 as "map2". |
| |
| b. Create a deferred block mapping update to unmap map1 from file 1. |
| |
| c. Create a deferred block mapping update to unmap map2 from file 2. |
| |
| d. Create a deferred block mapping update to map map1 into file 2. |
| |
| e. Create a deferred block mapping update to map map2 into file 1. |
| |
| f. Log the block, quota, and extent count updates for both files. |
| |
| g. Extend the ondisk size of either file if necessary. |
| |
| h. Log a mapping exchange done log item for th mapping exchange intent log |
| item that was read at the start of step 3. |
| |
| i. Compute the amount of file range that has just been covered. |
| This quantity is ``(map1.br_startoff + map1.br_blockcount - |
| xmi_startoff1)``, because step 3a could have skipped holes. |
| |
| j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2`` |
| by the number of blocks computed in the previous step, and decrease |
| ``xmi_blockcount`` by the same quantity. |
| This advances the cursor. |
| |
| k. Log a new mapping exchange intent log item reflecting the advanced state |
| of the work item. |
| |
| l. Return the proper error code (EAGAIN) to the deferred operation manager |
| to inform it that there is more work to be done. |
| The operation manager completes the deferred work in steps 3b-3e before |
| moving back to the start of step 3. |
| |
| 4. Perform any post-processing. |
| This will be discussed in more detail in subsequent sections. |
| |
| If the filesystem goes down in the middle of an operation, log recovery will |
| find the most recent unfinished maping exchange log intent item and restart |
| from there. |
| This is how atomic file mapping exchanges guarantees that an outside observer |
| will either see the old broken structure or the new one, and never a mismash of |
| both. |
| |
| Preparation for File Content Exchanges |
| `````````````````````````````````````` |
| |
| There are a few things that need to be taken care of before initiating an |
| atomic file mapping exchange operation. |
| First, regular files require the page cache to be flushed to disk before the |
| operation begins, and directio writes to be quiesced. |
| Like any filesystem operation, file mapping exchanges must determine the |
| maximum amount of disk space and quota that can be consumed on behalf of both |
| files in the operation, and reserve that quantity of resources to avoid an |
| unrecoverable out of space failure once it starts dirtying metadata. |
| The preparation step scans the ranges of both files to estimate: |
| |
| - Data device blocks needed to handle the repeated updates to the fork |
| mappings. |
| - Change in data and realtime block counts for both files. |
| - Increase in quota usage for both files, if the two files do not share the |
| same set of quota ids. |
| - The number of extent mappings that will be added to each file. |
| - Whether or not there are partially written realtime extents. |
| User programs must never be able to access a realtime file extent that maps |
| to different extents on the realtime volume, which could happen if the |
| operation fails to run to completion. |
| |
| The need for precise estimation increases the run time of the exchange |
| operation, but it is very important to maintain correct accounting. |
| The filesystem must not run completely out of free space, nor can the mapping |
| exchange ever add more extent mappings to a fork than it can support. |
| Regular users are required to abide the quota limits, though metadata repairs |
| may exceed quota to resolve inconsistent metadata elsewhere. |
| |
| Special Features for Exchanging Metadata File Contents |
| `````````````````````````````````````````````````````` |
| |
| Extended attributes, symbolic links, and directories can set the fork format to |
| "local" and treat the fork as a literal area for data storage. |
| Metadata repairs must take extra steps to support these cases: |
| |
| - If both forks are in local format and the fork areas are large enough, the |
| exchange is performed by copying the incore fork contents, logging both |
| forks, and committing. |
| The atomic file mapping exchange mechanism is not necessary, since this can |
| be done with a single transaction. |
| |
| - If both forks map blocks, then the regular atomic file mapping exchange is |
| used. |
| |
| - Otherwise, only one fork is in local format. |
| The contents of the local format fork are converted to a block to perform the |
| exchange. |
| The conversion to block format must be done in the same transaction that |
| logs the initial mapping exchange intent log item. |
| The regular atomic mapping exchange is used to exchange the metadata file |
| mappings. |
| Special flags are set on the exchange operation so that the transaction can |
| be rolled one more time to convert the second file's fork back to local |
| format so that the second file will be ready to go as soon as the ILOCK is |
| dropped. |
| |
| Extended attributes and directories stamp the owning inode into every block, |
| but the buffer verifiers do not actually check the inode number! |
| Although there is no verification, it is still important to maintain |
| referential integrity, so prior to performing the mapping exchange, online |
| repair builds every block in the new data structure with the owner field of the |
| file being repaired. |
| |
| After a successful exchange operation, the repair operation must reap the old |
| fork blocks by processing each fork mapping through the standard :ref:`file |
| extent reaping <reaping>` mechanism that is done post-repair. |
| If the filesystem should go down during the reap part of the repair, the |
| iunlink processing at the end of recovery will free both the temporary file and |
| whatever blocks were not reaped. |
| However, this iunlink processing omits the cross-link detection of online |
| repair, and is not completely foolproof. |
| |
| Exchanging Temporary File Contents |
| `````````````````````````````````` |
| |
| To repair a metadata file, online repair proceeds as follows: |
| |
| 1. Create a temporary repair file. |
| |
| 2. Use the staging data to write out new contents into the temporary repair |
| file. |
| The same fork must be written to as is being repaired. |
| |
| 3. Commit the scrub transaction, since the exchange resource estimation step |
| must be completed before transaction reservations are made. |
| |
| 4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with |
| the appropriate resource reservations, locks, and fill out a ``struct |
| xfs_exchmaps_req`` with the details of the exchange operation. |
| |
| 5. Call ``xrep_tempexch_contents`` to exchange the contents. |
| |
| 6. Commit the transaction to complete the repair. |
| |
| .. _rtsummary: |
| |
| Case Study: Repairing the Realtime Summary File |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| In the "realtime" section of an XFS filesystem, free space is tracked via a |
| bitmap, similar to Unix FFS. |
| Each bit in the bitmap represents one realtime extent, which is a multiple of |
| the filesystem block size between 4KiB and 1GiB in size. |
| The realtime summary file indexes the number of free extents of a given size to |
| the offset of the block within the realtime free space bitmap where those free |
| extents begin. |
| In other words, the summary file helps the allocator find free extents by |
| length, similar to what the free space by count (cntbt) btree does for the data |
| section. |
| |
| The summary file itself is a flat file (with no block headers or checksums!) |
| partitioned into ``log2(total rt extents)`` sections containing enough 32-bit |
| counters to match the number of blocks in the rt bitmap. |
| Each counter records the number of free extents that start in that bitmap block |
| and can satisfy a power-of-two allocation request. |
| |
| To check the summary file against the bitmap: |
| |
| 1. Take the ILOCK of both the realtime bitmap and summary files. |
| |
| 2. For each free space extent recorded in the bitmap: |
| |
| a. Compute the position in the summary file that contains a counter that |
| represents this free extent. |
| |
| b. Read the counter from the xfile. |
| |
| c. Increment it, and write it back to the xfile. |
| |
| 3. Compare the contents of the xfile against the ondisk file. |
| |
| To repair the summary file, write the xfile contents into the temporary file |
| and use atomic mapping exchange to commit the new contents. |
| The temporary file is then reaped. |
| |
| The proposed patchset is the |
| `realtime summary repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_ |
| series. |
| |
| Case Study: Salvaging Extended Attributes |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| In XFS, extended attributes are implemented as a namespaced name-value store. |
| Values are limited in size to 64KiB, but there is no limit in the number of |
| names. |
| The attribute fork is unpartitioned, which means that the root of the attribute |
| structure is always in logical block zero, but attribute leaf blocks, dabtree |
| index blocks, and remote value blocks are intermixed. |
| Attribute leaf blocks contain variable-sized records that associate |
| user-provided names with the user-provided values. |
| Values larger than a block are allocated separate extents and written there. |
| If the leaf information expands beyond a single block, a directory/attribute |
| btree (``dabtree``) is created to map hashes of attribute names to entries |
| for fast lookup. |
| |
| Salvaging extended attributes is done as follows: |
| |
| 1. Walk the attr fork mappings of the file being repaired to find the attribute |
| leaf blocks. |
| When one is found, |
| |
| a. Walk the attr leaf block to find candidate keys. |
| When one is found, |
| |
| 1. Check the name for problems, and ignore the name if there are. |
| |
| 2. Retrieve the value. |
| If that succeeds, add the name and value to the staging xfarray and |
| xfblob. |
| |
| 2. If the memory usage of the xfarray and xfblob exceed a certain amount of |
| memory or there are no more attr fork blocks to examine, unlock the file and |
| add the staged extended attributes to the temporary file. |
| |
| 3. Use atomic file mapping exchange to exchange the new and old extended |
| attribute structures. |
| The old attribute blocks are now attached to the temporary file. |
| |
| 4. Reap the temporary file. |
| |
| The proposed patchset is the |
| `extended attribute repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ |
| series. |
| |
| Fixing Directories |
| ------------------ |
| |
| Fixing directories is difficult with currently available filesystem features, |
| since directory entries are not redundant. |
| The offline repair tool scans all inodes to find files with nonzero link count, |
| and then it scans all directories to establish parentage of those linked files. |
| Damaged files and directories are zapped, and files with no parent are |
| moved to the ``/lost+found`` directory. |
| It does not try to salvage anything. |
| |
| The best that online repair can do at this time is to read directory data |
| blocks and salvage any dirents that look plausible, correct link counts, and |
| move orphans back into the directory tree. |
| The salvage process is discussed in the case study at the end of this section. |
| The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts |
| and moving orphans to the ``/lost+found`` directory. |
| |
| Case Study: Salvaging Directories |
| ````````````````````````````````` |
| |
| Unlike extended attributes, directory blocks are all the same size, so |
| salvaging directories is straightforward: |
| |
| 1. Find the parent of the directory. |
| If the dotdot entry is not unreadable, try to confirm that the alleged |
| parent has a child entry pointing back to the directory being repaired. |
| Otherwise, walk the filesystem to find it. |
| |
| 2. Walk the first partition of data fork of the directory to find the directory |
| entry data blocks. |
| When one is found, |
| |
| a. Walk the directory data block to find candidate entries. |
| When an entry is found: |
| |
| i. Check the name for problems, and ignore the name if there are. |
| |
| ii. Retrieve the inumber and grab the inode. |
| If that succeeds, add the name, inode number, and file type to the |
| staging xfarray and xblob. |
| |
| 3. If the memory usage of the xfarray and xfblob exceed a certain amount of |
| memory or there are no more directory data blocks to examine, unlock the |
| directory and add the staged dirents into the temporary directory. |
| Truncate the staging files. |
| |
| 4. Use atomic file mapping exchange to exchange the new and old directory |
| structures. |
| The old directory blocks are now attached to the temporary file. |
| |
| 5. Reap the temporary file. |
| |
| **Future Work Question**: Should repair revalidate the dentry cache when |
| rebuilding a directory? |
| |
| *Answer*: Yes, it should. |
| |
| In theory it is necessary to scan all dentry cache entries for a directory to |
| ensure that one of the following apply: |
| |
| 1. The cached dentry reflects an ondisk dirent in the new directory. |
| |
| 2. The cached dentry no longer has a corresponding ondisk dirent in the new |
| directory and the dentry can be purged from the cache. |
| |
| 3. The cached dentry no longer has an ondisk dirent but the dentry cannot be |
| purged. |
| This is the problem case. |
| |
| Unfortunately, the current dentry cache design doesn't provide a means to walk |
| every child dentry of a specific directory, which makes this a hard problem. |
| There is no known solution. |
| |
| The proposed patchset is the |
| `directory repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ |
| series. |
| |
| Parent Pointers |
| ``````````````` |
| |
| A parent pointer is a piece of file metadata that enables a user to locate the |
| file's parent directory without having to traverse the directory tree from the |
| root. |
| Without them, reconstruction of directory trees is hindered in much the same |
| way that the historic lack of reverse space mapping information once hindered |
| reconstruction of filesystem space metadata. |
| The parent pointer feature, however, makes total directory reconstruction |
| possible. |
| |
| XFS parent pointers contain the information needed to identify the |
| corresponding directory entry in the parent directory. |
| In other words, child files use extended attributes to store pointers to |
| parents in the form ``(dirent_name) → (parent_inum, parent_gen)``. |
| The directory checking process can be strengthened to ensure that the target of |
| each dirent also contains a parent pointer pointing back to the dirent. |
| Likewise, each parent pointer can be checked by ensuring that the target of |
| each parent pointer is a directory and that it contains a dirent matching |
| the parent pointer. |
| Both online and offline repair can use this strategy. |
| |
| +--------------------------------------------------------------------------+ |
| | **Historical Sidebar**: | |
| +--------------------------------------------------------------------------+ |
| | Directory parent pointers were first proposed as an XFS feature more | |
| | than a decade ago by SGI. | |
| | Each link from a parent directory to a child file is mirrored with an | |
| | extended attribute in the child that could be used to identify the | |
| | parent directory. | |
| | Unfortunately, this early implementation had major shortcomings and was | |
| | never merged into Linux XFS: | |
| | | |
| | 1. The XFS codebase of the late 2000s did not have the infrastructure to | |
| | enforce strong referential integrity in the directory tree. | |
| | It did not guarantee that a change in a forward link would always be | |
| | followed up with the corresponding change to the reverse links. | |
| | | |
| | 2. Referential integrity was not integrated into offline repair. | |
| | Checking and repairs were performed on mounted filesystems without | |
| | taking any kernel or inode locks to coordinate access. | |
| | It is not clear how this actually worked properly. | |
| | | |
| | 3. The extended attribute did not record the name of the directory entry | |
| | in the parent, so the SGI parent pointer implementation cannot be | |
| | used to reconnect the directory tree. | |
| | | |
| | 4. Extended attribute forks only support 65,536 extents, which means | |
| | that parent pointer attribute creation is likely to fail at some | |
| | point before the maximum file link count is achieved. | |
| | | |
| | The original parent pointer design was too unstable for something like | |
| | a file system repair to depend on. | |
| | Allison Henderson, Chandan Babu, and Catherine Hoang are working on a | |
| | second implementation that solves all shortcomings of the first. | |
| | During 2022, Allison introduced log intent items to track physical | |
| | manipulations of the extended attribute structures. | |
| | This solves the referential integrity problem by making it possible to | |
| | commit a dirent update and a parent pointer update in the same | |
| | transaction. | |
| | Chandan increased the maximum extent counts of both data and attribute | |
| | forks, thereby ensuring that the extended attribute structure can grow | |
| | to handle the maximum hardlink count of any file. | |
| | | |
| | For this second effort, the ondisk parent pointer format as originally | |
| | proposed was ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. | |
| | The format was changed during development to eliminate the requirement | |
| | of repair tools needing to to ensure that the ``dirent_pos`` field | |
| | always matched when reconstructing a directory. | |
| | | |
| | There were a few other ways to have solved that problem: | |
| | | |
| | 1. The field could be designated advisory, since the other three values | |
| | are sufficient to find the entry in the parent. | |
| | However, this makes indexed key lookup impossible while repairs are | |
| | ongoing. | |
| | | |
| | 2. We could allow creating directory entries at specified offsets, which | |
| | solves the referential integrity problem but runs the risk that | |
| | dirent creation will fail due to conflicts with the free space in the | |
| | directory. | |
| | | |
| | These conflicts could be resolved by appending the directory entry | |
| | and amending the xattr code to support updating an xattr key and | |
| | reindexing the dabtree, though this would have to be performed with | |
| | the parent directory still locked. | |
| | | |
| | 3. Same as above, but remove the old parent pointer entry and add a new | |
| | one atomically. | |
| | | |
| | 4. Change the ondisk xattr format to | |
| | ``(parent_inum, name) → (parent_gen)``, which would provide the attr | |
| | name uniqueness that we require, without forcing repair code to | |
| | update the dirent position. | |
| | Unfortunately, this requires changes to the xattr code to support | |
| | attr names as long as 263 bytes. | |
| | | |
| | 5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → | |
| | (name, parent_gen)``. | |
| | If the hash is sufficiently resistant to collisions (e.g. sha256) | |
| | then this should provide the attr name uniqueness that we require. | |
| | Names shorter than 247 bytes could be stored directly. | |
| | | |
| | 6. Change the ondisk xattr format to ``(dirent_name) → (parent_ino, | |
| | parent_gen)``. This format doesn't require any of the complicated | |
| | nested name hashing of the previous suggestions. However, it was | |
| | discovered that multiple hardlinks to the same inode with the same | |
| | filename caused performance problems with hashed xattr lookups, so | |
| | the parent inumber is now xor'd into the hash index. | |
| | | |
| | In the end, it was decided that solution #6 was the most compact and the | |
| | most performant. A new hash function was designed for parent pointers. | |
| +--------------------------------------------------------------------------+ |
| |
| |
| Case Study: Repairing Directories with Parent Pointers |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and |
| a :ref:`directory entry live update hook <liveupdate>` as follows: |
| |
| 1. Set up a temporary directory for generating the new directory structure, |
| an xfblob for storing entry names, and an xfarray for stashing the fixed |
| size fields involved in a directory update: ``(child inumber, add vs. |
| remove, name cookie, ftype)``. |
| |
| 2. Set up an inode scanner and hook into the directory entry code to receive |
| updates on directory operations. |
| |
| 3. For each parent pointer found in each file scanned, decide if the parent |
| pointer references the directory of interest. |
| If so: |
| |
| a. Stash the parent pointer name and an addname entry for this dirent in the |
| xfblob and xfarray, respectively. |
| |
| b. When finished scanning that file or the kernel memory consumption exceeds |
| a threshold, flush the stashed updates to the temporary directory. |
| |
| 4. For each live directory update received via the hook, decide if the child |
| has already been scanned. |
| If so: |
| |
| a. Stash the parent pointer name an addname or removename entry for this |
| dirent update in the xfblob and xfarray for later. |
| We cannot write directly to the temporary directory because hook |
| functions are not allowed to modify filesystem metadata. |
| Instead, we stash updates in the xfarray and rely on the scanner thread |
| to apply the stashed updates to the temporary directory. |
| |
| 5. When the scan is complete, replay any stashed entries in the xfarray. |
| |
| 6. When the scan is complete, atomically exchange the contents of the temporary |
| directory and the directory being repaired. |
| The temporary directory now contains the damaged directory structure. |
| |
| 7. Reap the temporary directory. |
| |
| The proposed patchset is the |
| `parent pointers directory repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_ |
| series. |
| |
| Case Study: Repairing Parent Pointers |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Online reconstruction of a file's parent pointer information works similarly to |
| directory reconstruction: |
| |
| 1. Set up a temporary file for generating a new extended attribute structure, |
| an xfblob for storing parent pointer names, and an xfarray for stashing the |
| fixed size fields involved in a parent pointer update: ``(parent inumber, |
| parent generation, add vs. remove, name cookie)``. |
| |
| 2. Set up an inode scanner and hook into the directory entry code to receive |
| updates on directory operations. |
| |
| 3. For each directory entry found in each directory scanned, decide if the |
| dirent references the file of interest. |
| If so: |
| |
| a. Stash the dirent name and an addpptr entry for this parent pointer in the |
| xfblob and xfarray, respectively. |
| |
| b. When finished scanning the directory or the kernel memory consumption |
| exceeds a threshold, flush the stashed updates to the temporary file. |
| |
| 4. For each live directory update received via the hook, decide if the parent |
| has already been scanned. |
| If so: |
| |
| a. Stash the dirent name and an addpptr or removepptr entry for this dirent |
| update in the xfblob and xfarray for later. |
| We cannot write parent pointers directly to the temporary file because |
| hook functions are not allowed to modify filesystem metadata. |
| Instead, we stash updates in the xfarray and rely on the scanner thread |
| to apply the stashed parent pointer updates to the temporary file. |
| |
| 5. When the scan is complete, replay any stashed entries in the xfarray. |
| |
| 6. Copy all non-parent pointer extended attributes to the temporary file. |
| |
| 7. When the scan is complete, atomically exchange the mappings of the attribute |
| forks of the temporary file and the file being repaired. |
| The temporary file now contains the damaged extended attribute structure. |
| |
| 8. Reap the temporary file. |
| |
| The proposed patchset is the |
| `parent pointers repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_ |
| series. |
| |
| Digression: Offline Checking of Parent Pointers |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Examining parent pointers in offline repair works differently because corrupt |
| files are erased long before directory tree connectivity checks are performed. |
| Parent pointer checks are therefore a second pass to be added to the existing |
| connectivity checks: |
| |
| 1. After the set of surviving files has been established (phase 6), |
| walk the surviving directories of each AG in the filesystem. |
| This is already performed as part of the connectivity checks. |
| |
| 2. For each directory entry found, |
| |
| a. If the name has already been stored in the xfblob, then use that cookie |
| and skip the next step. |
| |
| b. Otherwise, record the name in an xfblob, and remember the xfblob cookie. |
| Unique mappings are critical for |
| |
| 1. Deduplicating names to reduce memory usage, and |
| |
| 2. Creating a stable sort key for the parent pointer indexes so that the |
| parent pointer validation described below will work. |
| |
| c. Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len, |
| name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash`` |
| referenced in this section is the regular directory entry name hash, not |
| the specialized one used for parent pointer xattrs. |
| |
| 3. For each AG in the filesystem, |
| |
| a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``, |
| ``name_hash``, and ``name_cookie``. |
| Having a single ``name_cookie`` for each ``name`` is critical for |
| handling the uncommon case of a directory containing multiple hardlinks |
| to the same file where all the names hash to the same value. |
| |
| b. For each inode in the AG, |
| |
| 1. Scan the inode for parent pointers. |
| For each parent pointer found, |
| |
| a. Validate the ondisk parent pointer. |
| If validation fails, move on to the next parent pointer in the |
| file. |
| |
| b. If the name has already been stored in the xfblob, then use that |
| cookie and skip the next step. |
| |
| c. Record the name in a per-file xfblob, and remember the xfblob |
| cookie. |
| |
| d. Store ``(parent_inum, parent_gen, name_hash, name_len, |
| name_cookie)`` tuples in a per-file slab. |
| |
| 2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``, |
| and ``name_cookie``. |
| |
| 3. Position one slab cursor at the start of the inode's records in the |
| per-AG tuple slab. |
| This should be trivial since the per-AG tuples are in child inumber |
| order. |
| |
| 4. Position a second slab cursor at the start of the per-file tuple slab. |
| |
| 5. Iterate the two cursors in lockstep, comparing the ``parent_ino``, |
| ``name_hash``, and ``name_cookie`` fields of the records under each |
| cursor: |
| |
| a. If the per-AG cursor is at a lower point in the keyspace than the |
| per-file cursor, then the per-AG cursor points to a missing parent |
| pointer. |
| Add the parent pointer to the inode and advance the per-AG |
| cursor. |
| |
| b. If the per-file cursor is at a lower point in the keyspace than |
| the per-AG cursor, then the per-file cursor points to a dangling |
| parent pointer. |
| Remove the parent pointer from the inode and advance the per-file |
| cursor. |
| |
| c. Otherwise, both cursors point at the same parent pointer. |
| Update the parent_gen component if necessary. |
| Advance both cursors. |
| |
| 4. Move on to examining link counts, as we do today. |
| |
| The proposed patchset is the |
| `offline parent pointers repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_ |
| series. |
| |
| Rebuilding directories from parent pointers in offline repair would be very |
| challenging because xfs_repair currently uses two single-pass scans of the |
| filesystem during phases 3 and 4 to decide which files are corrupt enough to be |
| zapped. |
| This scan would have to be converted into a multi-pass scan: |
| |
| 1. The first pass of the scan zaps corrupt inodes, forks, and attributes |
| much as it does now. |
| Corrupt directories are noted but not zapped. |
| |
| 2. The next pass records parent pointers pointing to the directories noted |
| as being corrupt in the first pass. |
| This second pass may have to happen after the phase 4 scan for duplicate |
| blocks, if phase 4 is also capable of zapping directories. |
| |
| 3. The third pass resets corrupt directories to an empty shortform directory. |
| Free space metadata has not been ensured yet, so repair cannot yet use the |
| directory building code in libxfs. |
| |
| 4. At the start of phase 6, space metadata have been rebuilt. |
| Use the parent pointer information recorded during step 2 to reconstruct |
| the dirents and add them to the now-empty directories. |
| |
| This code has not yet been constructed. |
| |
| .. _dirtree: |
| |
| Case Study: Directory Tree Structure |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| As mentioned earlier, the filesystem directory tree is supposed to be a |
| directed acylic graph structure. |
| However, each node in this graph is a separate ``xfs_inode`` object with its |
| own locks, which makes validating the tree qualities difficult. |
| Fortunately, non-directories are allowed to have multiple parents and cannot |
| have children, so only directories need to be scanned. |
| Directories typically constitute 5-10% of the files in a filesystem, which |
| reduces the amount of work dramatically. |
| |
| If the directory tree could be frozen, it would be easy to discover cycles and |
| disconnected regions by running a depth (or breadth) first search downwards |
| from the root directory and marking a bitmap for each directory found. |
| At any point in the walk, trying to set an already set bit means there is a |
| cycle. |
| After the scan completes, XORing the marked inode bitmap with the inode |
| allocation bitmap reveals disconnected inodes. |
| However, one of online repair's design goals is to avoid locking the entire |
| filesystem unless it's absolutely necessary. |
| Directory tree updates can move subtrees across the scanner wavefront on a live |
| filesystem, so the bitmap algorithm cannot be applied. |
| |
| Directory parent pointers enable an incremental approach to validation of the |
| tree structure. |
| Instead of using one thread to scan the entire filesystem, multiple threads can |
| walk from individual subdirectories upwards towards the root. |
| For this to work, all directory entries and parent pointers must be internally |
| consistent, each directory entry must have a parent pointer, and the link |
| counts of all directories must be correct. |
| Each scanner thread must be able to take the IOLOCK of an alleged parent |
| directory while holding the IOLOCK of the child directory to prevent either |
| directory from being moved within the tree. |
| This is not possible since the VFS does not take the IOLOCK of a child |
| subdirectory when moving that subdirectory, so instead the scanner stabilizes |
| the parent -> child relationship by taking the ILOCKs and installing a dirent |
| update hook to detect changes. |
| |
| The scanning process uses a dirent hook to detect changes to the directories |
| mentioned in the scan data. |
| The scan works as follows: |
| |
| 1. For each subdirectory in the filesystem, |
| |
| a. For each parent pointer of that subdirectory, |
| |
| 1. Create a path object for that parent pointer, and mark the |
| subdirectory inode number in the path object's bitmap. |
| |
| 2. Record the parent pointer name and inode number in a path structure. |
| |
| 3. If the alleged parent is the subdirectory being scrubbed, the path is |
| a cycle. |
| Mark the path for deletion and repeat step 1a with the next |
| subdirectory parent pointer. |
| |
| 4. Try to mark the alleged parent inode number in a bitmap in the path |
| object. |
| If the bit is already set, then there is a cycle in the directory |
| tree. |
| Mark the path as a cycle and repeat step 1a with the next subdirectory |
| parent pointer. |
| |
| 5. Load the alleged parent. |
| If the alleged parent is not a linked directory, abort the scan |
| because the parent pointer information is inconsistent. |
| |
| 6. For each parent pointer of this alleged ancestor directory, |
| |
| a. Record the parent pointer name and inode number in the path object |
| if no parent has been set for that level. |
| |
| b. If an ancestor has more than one parent, mark the path as corrupt. |
| Repeat step 1a with the next subdirectory parent pointer. |
| |
| c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a. |
| This repeats until the directory tree root is reached or no parents |
| are found. |
| |
| 7. If the walk terminates at the root directory, mark the path as ok. |
| |
| 8. If the walk terminates without reaching the root, mark the path as |
| disconnected. |
| |
| 2. If the directory entry update hook triggers, check all paths already found |
| by the scan. |
| If the entry matches part of a path, mark that path and the scan stale. |
| When the scanner thread sees that the scan has been marked stale, it deletes |
| all scan data and starts over. |
| |
| Repairing the directory tree works as follows: |
| |
| 1. Walk each path of the target subdirectory. |
| |
| a. Corrupt paths and cycle paths are counted as suspect. |
| |
| b. Paths already marked for deletion are counted as bad. |
| |
| c. Paths that reached the root are counted as good. |
| |
| 2. If the subdirectory is either the root directory or has zero link count, |
| delete all incoming directory entries in the immediate parents. |
| Repairs are complete. |
| |
| 3. If the subdirectory has exactly one path, set the dotdot entry to the |
| parent and exit. |
| |
| 4. If the subdirectory has at least one good path, delete all the other |
| incoming directory entries in the immediate parents. |
| |
| 5. If the subdirectory has no good paths and more than one suspect path, delete |
| all the other incoming directory entries in the immediate parents. |
| |
| 6. If the subdirectory has zero paths, attach it to the lost and found. |
| |
| The proposed patches are in the |
| `directory tree repair |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_ |
| series. |
| |
| |
| .. _orphanage: |
| |
| The Orphanage |
| ------------- |
| |
| Filesystems present files as a directed, and hopefully acyclic, graph. |
| In other words, a tree. |
| The root of the filesystem is a directory, and each entry in a directory points |
| downwards either to more subdirectories or to non-directory files. |
| Unfortunately, a disruption in the directory graph pointers result in a |
| disconnected graph, which makes files impossible to access via regular path |
| resolution. |
| |
| Without parent pointers, the directory parent pointer online scrub code can |
| detect a dotdot entry pointing to a parent directory that doesn't have a link |
| back to the child directory and the file link count checker can detect a file |
| that isn't pointed to by any directory in the filesystem. |
| If such a file has a positive link count, the file is an orphan. |
| |
| With parent pointers, directories can be rebuilt by scanning parent pointers |
| and parent pointers can be rebuilt by scanning directories. |
| This should reduce the incidence of files ending up in ``/lost+found``. |
| |
| When orphans are found, they should be reconnected to the directory tree. |
| Offline fsck solves the problem by creating a directory ``/lost+found`` to |
| serve as an orphanage, and linking orphan files into the orphanage by using the |
| inumber as the name. |
| Reparenting a file to the orphanage does not reset any of its permissions or |
| ACLs. |
| |
| This process is more involved in the kernel than it is in userspace. |
| The directory and file link count repair setup functions must use the regular |
| VFS mechanisms to create the orphanage directory with all the necessary |
| security attributes and dentry cache entries, just like a regular directory |
| tree modification. |
| |
| Orphaned files are adopted by the orphanage as follows: |
| |
| 1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function |
| to try to ensure that the lost and found directory actually exists. |
| This also attaches the orphanage directory to the scrub context. |
| |
| 2. If the decision is made to reconnect a file, take the IOLOCK of both the |
| orphanage and the file being reattached. |
| The ``xrep_orphanage_iolock_two`` function follows the inode locking |
| strategy discussed earlier. |
| |
| 3. Use ``xrep_adoption_trans_alloc`` to reserve resources to the repair |
| transaction. |
| |
| 4. Call ``xrep_orphanage_compute_name`` to compute the new name in the |
| orphanage. |
| |
| 5. If the adoption is going to happen, call ``xrep_adoption_reparent`` to |
| reparent the orphaned file into the lost and found and invalidate the dentry |
| cache. |
| |
| 6. Call ``xrep_adoption_finish`` to commit any filesystem updates, release the |
| orphanage ILOCK, and clean the scrub transaction. Call |
| ``xrep_adoption_commit`` to commit the updates and the scrub transaction. |
| |
| 7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all |
| resources. |
| |
| The proposed patches are in the |
| `orphanage adoption |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_ |
| series. |
| |
| 6. Userspace Algorithms and Data Structures |
| =========================================== |
| |
| This section discusses the key algorithms and data structures of the userspace |
| program, ``xfs_scrub``, that provide the ability to drive metadata checks and |
| repairs in the kernel, verify file data, and look for other potential problems. |
| |
| .. _scrubcheck: |
| |
| Checking Metadata |
| ----------------- |
| |
| Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier. |
| That structure follows naturally from the data dependencies designed into the |
| filesystem from its beginnings in 1993. |
| In XFS, there are several groups of metadata dependencies: |
| |
| a. Filesystem summary counts depend on consistency within the inode indices, |
| the allocation group space btrees, and the realtime volume space |
| information. |
| |
| b. Quota resource counts depend on consistency within the quota file data |
| forks, inode indices, inode records, and the forks of every file on the |
| system. |
| |
| c. The naming hierarchy depends on consistency within the directory and |
| extended attribute structures. |
| This includes file link counts. |
| |
| d. Directories, extended attributes, and file data depend on consistency within |
| the file forks that map directory and extended attribute data to physical |
| storage media. |
| |
| e. The file forks depends on consistency within inode records and the space |
| metadata indices of the allocation groups and the realtime volume. |
| This includes quota and realtime metadata files. |
| |
| f. Inode records depends on consistency within the inode metadata indices. |
| |
| g. Realtime space metadata depend on the inode records and data forks of the |
| realtime metadata inodes. |
| |
| h. The allocation group metadata indices (free space, inodes, reference count, |
| and reverse mapping btrees) depend on consistency within the AG headers and |
| between all the AG metadata btrees. |
| |
| i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support |
| for online fsck functionality. |
| |
| Therefore, a metadata dependency graph is a convenient way to schedule checking |
| operations in the ``xfs_scrub`` program: |
| |
| - Phase 1 checks that the provided path maps to an XFS filesystem and detect |
| the kernel's scrubbing abilities, which validates group (i). |
| |
| - Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue. |
| |
| - Phase 3 scans inodes in parallel. |
| For each inode, groups (f), (e), and (d) are checked, in that order. |
| |
| - Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6 |
| may run reliably. |
| |
| - Phase 5 starts by checking groups (b) and (c) in parallel before moving on |
| to checking names. |
| |
| - Phase 6 depends on groups (i) through (b) to find file data blocks to verify, |
| to read them, and to report which blocks of which files are affected. |
| |
| - Phase 7 checks group (a), having validated everything else. |
| |
| Notice that the data dependencies between groups are enforced by the structure |
| of the program flow. |
| |
| Parallel Inode Scans |
| -------------------- |
| |
| An XFS filesystem can easily contain hundreds of millions of inodes. |
| Given that XFS targets installations with large high-performance storage, |
| it is desirable to scrub inodes in parallel to minimize runtime, particularly |
| if the program has been invoked manually from a command line. |
| This requires careful scheduling to keep the threads as evenly loaded as |
| possible. |
| |
| Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single |
| workqueue and scheduled a single workqueue item per AG. |
| Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find |
| inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough |
| information to construct file handles. |
| The file handle was then passed to a function to generate scrub items for each |
| metadata object of each inode. |
| This simple algorithm leads to thread balancing problems in phase 3 if the |
| filesystem contains one AG with a few large sparse files and the rest of the |
| AGs contain many smaller files. |
| The inode scan dispatch function was not sufficiently granular; it should have |
| been dispatching at the level of individual inodes, or, to constrain memory |
| consumption, inode btree records. |
| |
| Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to |
| avoid this problem with ease by adding a second workqueue. |
| Just like before, the first workqueue is seeded with one workqueue item per AG, |
| and it uses INUMBERS to find inode btree chunks. |
| The second workqueue, however, is configured with an upper bound on the number |
| of items that can be waiting to be run. |
| Each inode btree chunk found by the first workqueue's workers are queued to the |
| second workqueue, and it is this second workqueue that queries BULKSTAT, |
| creates a file handle, and passes it to a function to generate scrub items for |
| each metadata object of each inode. |
| If the second workqueue is too full, the workqueue add function blocks the |
| first workqueue's workers until the backlog eases. |
| This doesn't completely solve the balancing problem, but reduces it enough to |
| move on to more pressing issues. |
| |
| The proposed patchsets are the scrub |
| `performance tweaks |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_ |
| and the |
| `inode scan rebalance |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_ |
| series. |
| |
| .. _scrubrepair: |
| |
| Scheduling Repairs |
| ------------------ |
| |
| During phase 2, corruptions and inconsistencies reported in any AGI header or |
| inode btree are repaired immediately, because phase 3 relies on proper |
| functioning of the inode indices to find inodes to scan. |
| Failed repairs are rescheduled to phase 4. |
| Problems reported in any other space metadata are deferred to phase 4. |
| Optimization opportunities are always deferred to phase 4, no matter their |
| origin. |
| |
| During phase 3, corruptions and inconsistencies reported in any part of a |
| file's metadata are repaired immediately if all space metadata were validated |
| during phase 2. |
| Repairs that fail or cannot be repaired immediately are scheduled for phase 4. |
| |
| In the original design of ``xfs_scrub``, it was thought that repairs would be |
| so infrequent that the ``struct xfs_scrub_metadata`` objects used to |
| communicate with the kernel could also be used as the primary object to |
| schedule repairs. |
| With recent increases in the number of optimizations possible for a given |
| filesystem object, it became much more memory-efficient to track all eligible |
| repairs for a given filesystem object with a single repair item. |
| Each repair item represents a single lockable object -- AGs, metadata files, |
| individual inodes, or a class of summary information. |
| |
| Phase 4 is responsible for scheduling a lot of repair work in as quick a |
| manner as is practical. |
| The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which |
| means that ``xfs_scrub`` must try to complete the repair work scheduled by |
| phase 2 before trying repair work scheduled by phase 3. |
| The repair process is as follows: |
| |
| 1. Start a round of repair with a workqueue and enough workers to keep the CPUs |
| as busy as the user desires. |
| |
| a. For each repair item queued by phase 2, |
| |
| i. Ask the kernel to repair everything listed in the repair item for a |
| given filesystem object. |
| |
| ii. Make a note if the kernel made any progress in reducing the number |
| of repairs needed for this object. |
| |
| iii. If the object no longer requires repairs, revalidate all metadata |
| associated with this object. |
| If the revalidation succeeds, drop the repair item. |
| If not, requeue the item for more repairs. |
| |
| b. If any repairs were made, jump back to 1a to retry all the phase 2 items. |
| |
| c. For each repair item queued by phase 3, |
| |
| i. Ask the kernel to repair everything listed in the repair item for a |
| given filesystem object. |
| |
| ii. Make a note if the kernel made any progress in reducing the number |
| of repairs needed for this object. |
| |
| iii. If the object no longer requires repairs, revalidate all metadata |
| associated with this object. |
| If the revalidation succeeds, drop the repair item. |
| If not, requeue the item for more repairs. |
| |
| d. If any repairs were made, jump back to 1c to retry all the phase 3 items. |
| |
| 2. If step 1 made any repair progress of any kind, jump back to step 1 to start |
| another round of repair. |
| |
| 3. If there are items left to repair, run them all serially one more time. |
| Complain if the repairs were not successful, since this is the last chance |
| to repair anything. |
| |
| Corruptions and inconsistencies encountered during phases 5 and 7 are repaired |
| immediately. |
| Corrupt file data blocks reported by phase 6 cannot be recovered by the |
| filesystem. |
| |
| The proposed patchsets are the |
| `repair warning improvements |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_, |
| refactoring of the |
| `repair data dependency |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_ |
| and |
| `object tracking |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_, |
| and the |
| `repair scheduling |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_ |
| improvement series. |
| |
| Checking Names for Confusable Unicode Sequences |
| ----------------------------------------------- |
| |
| If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of |
| phase 4, it moves on to phase 5, which checks for suspicious looking names in |
| the filesystem. |
| These names consist of the filesystem label, names in directory entries, and |
| the names of extended attributes. |
| Like most Unix filesystems, XFS imposes the sparest of constraints on the |
| contents of a name: |
| |
| - Slashes and null bytes are not allowed in directory entries. |
| |
| - Null bytes are not allowed in userspace-visible extended attributes. |
| |
| - Null bytes are not allowed in the filesystem label. |
| |
| Directory entries and attribute keys store the length of the name explicitly |
| ondisk, which means that nulls are not name terminators. |
| For this section, the term "naming domain" refers to any place where names are |
| presented together -- all the names in a directory, or all the attributes of a |
| file. |
| |
| Although the Unix naming constraints are very permissive, the reality of most |
| modern-day Linux systems is that programs work with Unicode character code |
| points to support international languages. |
| These programs typically encode those code points in UTF-8 when interfacing |
| with the C library because the kernel expects null-terminated names. |
| In the common case, therefore, names found in an XFS filesystem are actually |
| UTF-8 encoded Unicode data. |
| |
| To maximize its expressiveness, the Unicode standard defines separate control |
| points for various characters that render similarly or identically in writing |
| systems around the world. |
| For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders |
| identically to "Latin Small Letter A" U+0061 "a". |
| |
| The standard also permits characters to be constructed in multiple ways -- |
| either by using a defined code point, or by combining one code point with |
| various combining marks. |
| For example, the character "Angstrom Sign U+212B "Å" can also be expressed |
| as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above" |
| U+030A "◌̊". |
| Both sequences render identically. |
| |
| Like the standards that preceded it, Unicode also defines various control |
| characters to alter the presentation of text. |
| For example, the character "Right-to-Left Override" U+202E can trick some |
| programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png". |
| A second category of rendering problems involves whitespace characters. |
| If the character "Zero Width Space" U+200B is encountered in a file name, the |
| name will render identically to a name that does not have the zero width |
| space. |
| |
| If two names within a naming domain have different byte sequences but render |
| identically, a user may be confused by it. |
| The kernel, in its indifference to upper level encoding schemes, permits this. |
| Most filesystem drivers persist the byte sequence names that are given to them |
| by the VFS. |
| |
| Techniques for detecting confusable names are explained in great detail in |
| sections 4 and 5 of the |
| `Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_ |
| document. |
| When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the |
| Unicode normalization form NFD in conjunction with the confusable name |
| detection component of |
| `libicu <https://github.com/unicode-org/icu>`_ |
| to identify names with a directory or within a file's extended attributes that |
| could be confused for each other. |
| Names are also checked for control characters, non-rendering characters, and |
| mixing of bidirectional characters. |
| All of these potential issues are reported to the system administrator during |
| phase 5. |
| |
| Media Verification of File Data Extents |
| --------------------------------------- |
| |
| The system administrator can elect to initiate a media scan of all file data |
| blocks. |
| This scan after validation of all filesystem metadata (except for the summary |
| counters) as phase 6. |
| The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map |
| to find areas that are allocated to file data fork extents. |
| Gaps between data fork extents that are smaller than 64k are treated as if |
| they were data fork extents to reduce the command setup overhead. |
| When the space map scan accumulates a region larger than 32MB, a media |
| verification request is sent to the disk as a directio read of the raw block |
| device. |
| |
| If the verification read fails, ``xfs_scrub`` retries with single-block reads |
| to narrow down the failure to the specific region of the media and recorded. |
| When it has finished issuing verification requests, it again uses the space |
| mapping ioctl to map the recorded media errors back to metadata structures |
| and report what has been lost. |
| For media errors in blocks owned by files, parent pointers can be used to |
| construct file paths from inode numbers for user-friendly reporting. |
| |
| 7. Conclusion and Future Work |
| ============================= |
| |
| It is hoped that the reader of this document has followed the designs laid out |
| in this document and now has some familiarity with how XFS performs online |
| rebuilding of its metadata indices, and how filesystem users can interact with |
| that functionality. |
| Although the scope of this work is daunting, it is hoped that this guide will |
| make it easier for code readers to understand what has been built, for whom it |
| has been built, and why. |
| Please feel free to contact the XFS mailing list with questions. |
| |
| XFS_IOC_EXCHANGE_RANGE |
| ---------------------- |
| |
| As discussed earlier, a second frontend to the atomic file mapping exchange |
| mechanism is a new ioctl call that userspace programs can use to commit updates |
| to files atomically. |
| This frontend has been out for review for several years now, though the |
| necessary refinements to online repair and lack of customer demand mean that |
| the proposal has not been pushed very hard. |
| |
| File Content Exchanges with Regular User Files |
| `````````````````````````````````````````````` |
| |
| As mentioned earlier, XFS has long had the ability to swap extents between |
| files, which is used almost exclusively by ``xfs_fsr`` to defragment files. |
| The earliest form of this was the fork swap mechanism, where the entire |
| contents of data forks could be exchanged between two files by exchanging the |
| raw bytes in each inode fork's immediate area. |
| When XFS v5 came along with self-describing metadata, this old mechanism grew |
| some log support to continue rewriting the owner fields of BMBT blocks during |
| log recovery. |
| When the reverse mapping btree was later added to XFS, the only way to maintain |
| the consistency of the fork mappings with the reverse mapping index was to |
| develop an iterative mechanism that used deferred bmap and rmap operations to |
| swap mappings one at a time. |
| This mechanism is identical to steps 2-3 from the procedure above except for |
| the new tracking items, because the atomic file mapping exchange mechanism is |
| an iteration of an existing mechanism and not something totally novel. |
| For the narrow case of file defragmentation, the file contents must be |
| identical, so the recovery guarantees are not much of a gain. |
| |
| Atomic file content exchanges are much more flexible than the existing swapext |
| implementations because it can guarantee that the caller never sees a mix of |
| old and new contents even after a crash, and it can operate on two arbitrary |
| file fork ranges. |
| The extra flexibility enables several new use cases: |
| |
| - **Atomic commit of file writes**: A userspace process opens a file that it |
| wants to update. |
| Next, it opens a temporary file and calls the file clone operation to reflink |
| the first file's contents into the temporary file. |
| Writes to the original file should instead be written to the temporary file. |
| Finally, the process calls the atomic file mapping exchange system call |
| (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby |
| committing all of the updates to the original file, or none of them. |
| |
| .. _exchrange_if_unchanged: |
| |
| - **Transactional file updates**: The same mechanism as above, but the caller |
| only wants the commit to occur if the original file's contents have not |
| changed. |
| To make this happen, the calling process snapshots the file modification and |
| change timestamps of the original file before reflinking its data to the |
| temporary file. |
| When the program is ready to commit the changes, it passes the timestamps |
| into the kernel as arguments to the atomic file mapping exchange system call. |
| The kernel only commits the changes if the provided timestamps match the |
| original file. |
| A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this. |
| |
| - **Emulation of atomic block device writes**: Export a block device with a |
| logical sector size matching the filesystem block size to force all writes |
| to be aligned to the filesystem block size. |
| Stage all writes to a temporary file, and when that is complete, call the |
| atomic file mapping exchange system call with a flag to indicate that holes |
| in the temporary file should be ignored. |
| This emulates an atomic device write in software, and can support arbitrary |
| scattered writes. |
| |
| Vectorized Scrub |
| ---------------- |
| |
| As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned |
| earlier was a catalyst for enabling a vectorized scrub system call. |
| Since 2018, the cost of making a kernel call has increased considerably on some |
| systems to mitigate the effects of speculative execution attacks. |
| This incentivizes program authors to make as few system calls as possible to |
| reduce the number of times an execution path crosses a security boundary. |
| |
| With vectorized scrub, userspace pushes to the kernel the identity of a |
| filesystem object, a list of scrub types to run against that object, and a |
| simple representation of the data dependencies between the selected scrub |
| types. |
| The kernel executes as much of the caller's plan as it can until it hits a |
| dependency that cannot be satisfied due to a corruption, and tells userspace |
| how much was accomplished. |
| It is hoped that ``io_uring`` will pick up enough of this functionality that |
| online fsck can use that instead of adding a separate vectored scrub system |
| call to XFS. |
| |
| The relevant patchsets are the |
| `kernel vectorized scrub |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_ |
| and |
| `userspace vectorized scrub |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_ |
| series. |
| |
| Quality of Service Targets for Scrub |
| ------------------------------------ |
| |
| One serious shortcoming of the online fsck code is that the amount of time that |
| it can spend in the kernel holding resource locks is basically unbounded. |
| Userspace is allowed to send a fatal signal to the process which will cause |
| ``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way |
| for userspace to provide a time budget to the kernel. |
| Given that the scrub codebase has helpers to detect fatal signals, it shouldn't |
| be too much work to allow userspace to specify a timeout for a scrub/repair |
| operation and abort the operation if it exceeds budget. |
| However, most repair functions have the property that once they begin to touch |
| ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS |
| timeout is no longer useful. |
| |
| Defragmenting Free Space |
| ------------------------ |
| |
| Over the years, many XFS users have requested the creation of a program to |
| clear a portion of the physical storage underlying a filesystem so that it |
| becomes a contiguous chunk of free space. |
| Call this free space defragmenter ``clearspace`` for short. |
| |
| The first piece the ``clearspace`` program needs is the ability to read the |
| reverse mapping index from userspace. |
| This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl. |
| The second piece it needs is a new fallocate mode |
| (``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and |
| maps it to a file. |
| Call this file the "space collector" file. |
| The third piece is the ability to force an online repair. |
| |
| To clear all the metadata out of a portion of physical storage, clearspace |
| uses the new fallocate map-freespace call to map any free space in that region |
| to the space collector file. |
| Next, clearspace finds all metadata blocks in that region by way of |
| ``GETFSMAP`` and issues forced repair requests on the data structure. |
| This often results in the metadata being rebuilt somewhere that is not being |
| cleared. |
| After each relocation, clearspace calls the "map free space" function again to |
| collect any newly freed space in the region being cleared. |
| |
| To clear all the file data out of a portion of the physical storage, clearspace |
| uses the FSMAP information to find relevant file data blocks. |
| Having identified a good target, it uses the ``FICLONERANGE`` call on that part |
| of the file to try to share the physical space with a dummy file. |
| Cloning the extent means that the original owners cannot overwrite the |
| contents; any changes will be written somewhere else via copy-on-write. |
| Clearspace makes its own copy of the frozen extent in an area that is not being |
| cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges |
| <exchrange_if_unchanged>` feature) to change the target file's data extent |
| mapping away from the area being cleared. |
| When all other mappings have been moved, clearspace reflinks the space into the |
| space collector file so that it becomes unavailable. |
| |
| There are further optimizations that could apply to the above algorithm. |
| To clear a piece of physical storage that has a high sharing factor, it is |
| strongly desirable to retain this sharing factor. |
| In fact, these extents should be moved first to maximize sharing factor after |
| the operation completes. |
| To make this work smoothly, clearspace needs a new ioctl |
| (``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace. |
| With the refcount information exposed, clearspace can quickly find the longest, |
| most shared data extents in the filesystem, and target them first. |
| |
| **Future Work Question**: How might the filesystem move inode chunks? |
| |
| *Answer*: To move inode chunks, Dave Chinner constructed a prototype program |
| that creates a new file with the old contents and then locklessly runs around |
| the filesystem updating directory entries. |
| The operation cannot complete if the filesystem goes down. |
| That problem isn't totally insurmountable: create an inode remapping table |
| hidden behind a jump label, and a log item that tracks the kernel walking the |
| filesystem to update directory entries. |
| The trouble is, the kernel can't do anything about open files, since it cannot |
| revoke them. |
| |
| **Future Work Question**: Can static keys be used to minimize the cost of |
| supporting ``revoke()`` on XFS files? |
| |
| *Answer*: Yes. |
| Until the first revocation, the bailout code need not be in the call path at |
| all. |
| |
| The relevant patchsets are the |
| `kernel freespace defrag |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_ |
| and |
| `userspace freespace defrag |
| <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_ |
| series. |
| |
| Shrinking Filesystems |
| --------------------- |
| |
| Removing the end of the filesystem ought to be a simple matter of evacuating |
| the data and metadata at the end of the filesystem, and handing the freed space |
| to the shrink code. |
| That requires an evacuation of the space at end of the filesystem, which is a |
| use of free space defragmentation! |