| ==================== |
| Changes since 2.5.0: |
| ==================== |
| |
| --- |
| |
| **recommended** |
| |
| New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(), |
| sb_set_blocksize() and sb_min_blocksize(). |
| |
| Use them. |
| |
| (sb_find_get_block() replaces 2.4's get_hash_table()) |
| |
| --- |
| |
| **recommended** |
| |
| New methods: ->alloc_inode() and ->destroy_inode(). |
| |
| Remove inode->u.foo_inode_i |
| |
| Declare:: |
| |
| struct foo_inode_info { |
| /* fs-private stuff */ |
| struct inode vfs_inode; |
| }; |
| static inline struct foo_inode_info *FOO_I(struct inode *inode) |
| { |
| return list_entry(inode, struct foo_inode_info, vfs_inode); |
| } |
| |
| Use FOO_I(inode) instead of &inode->u.foo_inode_i; |
| |
| Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate |
| foo_inode_info and return the address of ->vfs_inode, the latter should free |
| FOO_I(inode) (see in-tree filesystems for examples). |
| |
| Make them ->alloc_inode and ->destroy_inode in your super_operations. |
| |
| Keep in mind that now you need explicit initialization of private data |
| typically between calling iget_locked() and unlocking the inode. |
| |
| At some point that will become mandatory. |
| |
| **mandatory** |
| |
| The foo_inode_info should always be allocated through alloc_inode_sb() rather |
| than kmem_cache_alloc() or kmalloc() related to set up the inode reclaim context |
| correctly. |
| |
| --- |
| |
| **mandatory** |
| |
| Change of file_system_type method (->read_super to ->get_sb) |
| |
| ->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV. |
| |
| Turn your foo_read_super() into a function that would return 0 in case of |
| success and negative number in case of error (-EINVAL unless you have more |
| informative error value to report). Call it foo_fill_super(). Now declare:: |
| |
| int foo_get_sb(struct file_system_type *fs_type, |
| int flags, const char *dev_name, void *data, struct vfsmount *mnt) |
| { |
| return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super, |
| mnt); |
| } |
| |
| (or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of |
| filesystem). |
| |
| Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as |
| foo_get_sb. |
| |
| --- |
| |
| **mandatory** |
| |
| Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames. |
| Most likely there is no need to change anything, but if you relied on |
| global exclusion between renames for some internal purpose - you need to |
| change your internal locking. Otherwise exclusion warranties remain the |
| same (i.e. parents and victim are locked, etc.). |
| |
| --- |
| |
| **informational** |
| |
| Now we have the exclusion between ->lookup() and directory removal (by |
| ->rmdir() and ->rename()). If you used to need that exclusion and do |
| it by internal locking (most of filesystems couldn't care less) - you |
| can relax your locking. |
| |
| --- |
| |
| **mandatory** |
| |
| ->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(), |
| ->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename() |
| and ->readdir() are called without BKL now. Grab it on entry, drop upon return |
| - that will guarantee the same locking you used to have. If your method or its |
| parts do not need BKL - better yet, now you can shift lock_kernel() and |
| unlock_kernel() so that they would protect exactly what needs to be |
| protected. |
| |
| --- |
| |
| **mandatory** |
| |
| BKL is also moved from around sb operations. BKL should have been shifted into |
| individual fs sb_op functions. If you don't need it, remove it. |
| |
| --- |
| |
| **informational** |
| |
| check for ->link() target not being a directory is done by callers. Feel |
| free to drop it... |
| |
| --- |
| |
| **informational** |
| |
| ->link() callers hold ->i_mutex on the object we are linking to. Some of your |
| problems might be over... |
| |
| --- |
| |
| **mandatory** |
| |
| new file_system_type method - kill_sb(superblock). If you are converting |
| an existing filesystem, set it according to ->fs_flags:: |
| |
| FS_REQUIRES_DEV - kill_block_super |
| FS_LITTER - kill_litter_super |
| neither - kill_anon_super |
| |
| FS_LITTER is gone - just remove it from fs_flags. |
| |
| --- |
| |
| **mandatory** |
| |
| FS_SINGLE is gone (actually, that had happened back when ->get_sb() |
| went in - and hadn't been documented ;-/). Just remove it from fs_flags |
| (and see ->get_sb() entry for other actions). |
| |
| --- |
| |
| **mandatory** |
| |
| ->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so |
| watch for ->i_mutex-grabbing code that might be used by your ->setattr(). |
| Callers of notify_change() need ->i_mutex now. |
| |
| --- |
| |
| **recommended** |
| |
| New super_block field ``struct export_operations *s_export_op`` for |
| explicit support for exporting, e.g. via NFS. The structure is fully |
| documented at its declaration in include/linux/fs.h, and in |
| Documentation/filesystems/nfs/exporting.rst. |
| |
| Briefly it allows for the definition of decode_fh and encode_fh operations |
| to encode and decode filehandles, and allows the filesystem to use |
| a standard helper function for decode_fh, and provide file-system specific |
| support for this helper, particularly get_parent. |
| |
| It is planned that this will be required for exporting once the code |
| settles down a bit. |
| |
| **mandatory** |
| |
| s_export_op is now required for exporting a filesystem. |
| isofs, ext2, ext3, reiserfs, fat |
| can be used as examples of very different filesystems. |
| |
| --- |
| |
| **mandatory** |
| |
| iget4() and the read_inode2 callback have been superseded by iget5_locked() |
| which has the following prototype:: |
| |
| struct inode *iget5_locked(struct super_block *sb, unsigned long ino, |
| int (*test)(struct inode *, void *), |
| int (*set)(struct inode *, void *), |
| void *data); |
| |
| 'test' is an additional function that can be used when the inode |
| number is not sufficient to identify the actual file object. 'set' |
| should be a non-blocking function that initializes those parts of a |
| newly created inode to allow the test function to succeed. 'data' is |
| passed as an opaque value to both test and set functions. |
| |
| When the inode has been created by iget5_locked(), it will be returned with the |
| I_NEW flag set and will still be locked. The filesystem then needs to finalize |
| the initialization. Once the inode is initialized it must be unlocked by |
| calling unlock_new_inode(). |
| |
| The filesystem is responsible for setting (and possibly testing) i_ino |
| when appropriate. There is also a simpler iget_locked function that |
| just takes the superblock and inode number as arguments and does the |
| test and set for you. |
| |
| e.g.:: |
| |
| inode = iget_locked(sb, ino); |
| if (inode->i_state & I_NEW) { |
| err = read_inode_from_disk(inode); |
| if (err < 0) { |
| iget_failed(inode); |
| return err; |
| } |
| unlock_new_inode(inode); |
| } |
| |
| Note that if the process of setting up a new inode fails, then iget_failed() |
| should be called on the inode to render it dead, and an appropriate error |
| should be passed back to the caller. |
| |
| --- |
| |
| **recommended** |
| |
| ->getattr() finally getting used. See instances in nfs, minix, etc. |
| |
| --- |
| |
| **mandatory** |
| |
| ->revalidate() is gone. If your filesystem had it - provide ->getattr() |
| and let it call whatever you had as ->revlidate() + (for symlinks that |
| had ->revalidate()) add calls in ->follow_link()/->readlink(). |
| |
| --- |
| |
| **mandatory** |
| |
| ->d_parent changes are not protected by BKL anymore. Read access is safe |
| if at least one of the following is true: |
| |
| * filesystem has no cross-directory rename() |
| * we know that parent had been locked (e.g. we are looking at |
| ->d_parent of ->lookup() argument). |
| * we are called from ->rename(). |
| * the child's ->d_lock is held |
| |
| Audit your code and add locking if needed. Notice that any place that is |
| not protected by the conditions above is risky even in the old tree - you |
| had been relying on BKL and that's prone to screwups. Old tree had quite |
| a few holes of that kind - unprotected access to ->d_parent leading to |
| anything from oops to silent memory corruption. |
| |
| --- |
| |
| **mandatory** |
| |
| FS_NOMOUNT is gone. If you use it - just set SB_NOUSER in flags |
| (see rootfs for one kind of solution and bdev/socket/pipe for another). |
| |
| --- |
| |
| **recommended** |
| |
| Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter |
| is still alive, but only because of the mess in drivers/s390/block/dasd.c. |
| As soon as it gets fixed is_read_only() will die. |
| |
| --- |
| |
| **mandatory** |
| |
| ->permission() is called without BKL now. Grab it on entry, drop upon |
| return - that will guarantee the same locking you used to have. If |
| your method or its parts do not need BKL - better yet, now you can |
| shift lock_kernel() and unlock_kernel() so that they would protect |
| exactly what needs to be protected. |
| |
| --- |
| |
| **mandatory** |
| |
| ->statfs() is now called without BKL held. BKL should have been |
| shifted into individual fs sb_op functions where it's not clear that |
| it's safe to remove it. If you don't need it, remove it. |
| |
| --- |
| |
| **mandatory** |
| |
| is_read_only() is gone; use bdev_read_only() instead. |
| |
| --- |
| |
| **mandatory** |
| |
| destroy_buffers() is gone; use invalidate_bdev(). |
| |
| --- |
| |
| **mandatory** |
| |
| fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is |
| deliberate; as soon as struct block_device * is propagated in a reasonable |
| way by that code fixing will become trivial; until then nothing can be |
| done. |
| |
| **mandatory** |
| |
| block truncatation on error exit from ->write_begin, and ->direct_IO |
| moved from generic methods (block_write_begin, cont_write_begin, |
| nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at |
| ext2_write_failed and callers for an example. |
| |
| **mandatory** |
| |
| ->truncate is gone. The whole truncate sequence needs to be |
| implemented in ->setattr, which is now mandatory for filesystems |
| implementing on-disk size changes. Start with a copy of the old inode_setattr |
| and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to |
| be in order of zeroing blocks using block_truncate_page or similar helpers, |
| size update and on finally on-disk truncation which should not fail. |
| setattr_prepare (which used to be inode_change_ok) now includes the size checks |
| for ATTR_SIZE and must be called in the beginning of ->setattr unconditionally. |
| |
| **mandatory** |
| |
| ->clear_inode() and ->delete_inode() are gone; ->evict_inode() should |
| be used instead. It gets called whenever the inode is evicted, whether it has |
| remaining links or not. Caller does *not* evict the pagecache or inode-associated |
| metadata buffers; the method has to use truncate_inode_pages_final() to get rid |
| of those. Caller makes sure async writeback cannot be running for the inode while |
| (or after) ->evict_inode() is called. |
| |
| ->drop_inode() returns int now; it's called on final iput() with |
| inode->i_lock held and it returns true if filesystems wants the inode to be |
| dropped. As before, generic_drop_inode() is still the default and it's been |
| updated appropriately. generic_delete_inode() is also alive and it consists |
| simply of return 1. Note that all actual eviction work is done by caller after |
| ->drop_inode() returns. |
| |
| As before, clear_inode() must be called exactly once on each call of |
| ->evict_inode() (as it used to be for each call of ->delete_inode()). Unlike |
| before, if you are using inode-associated metadata buffers (i.e. |
| mark_buffer_dirty_inode()), it's your responsibility to call |
| invalidate_inode_buffers() before clear_inode(). |
| |
| NOTE: checking i_nlink in the beginning of ->write_inode() and bailing out |
| if it's zero is not *and* *never* *had* *been* enough. Final unlink() and iput() |
| may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly |
| free the on-disk inode, you may end up doing that while ->write_inode() is writing |
| to it. |
| |
| --- |
| |
| **mandatory** |
| |
| .d_delete() now only advises the dcache as to whether or not to cache |
| unreferenced dentries, and is now only called when the dentry refcount goes to |
| 0. Even on 0 refcount transition, it must be able to tolerate being called 0, |
| 1, or more times (eg. constant, idempotent). |
| |
| --- |
| |
| **mandatory** |
| |
| .d_compare() calling convention and locking rules are significantly |
| changed. Read updated documentation in Documentation/filesystems/vfs.rst (and |
| look at examples of other filesystems) for guidance. |
| |
| --- |
| |
| **mandatory** |
| |
| .d_hash() calling convention and locking rules are significantly |
| changed. Read updated documentation in Documentation/filesystems/vfs.rst (and |
| look at examples of other filesystems) for guidance. |
| |
| --- |
| |
| **mandatory** |
| |
| dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c |
| for details of what locks to replace dcache_lock with in order to protect |
| particular things. Most of the time, a filesystem only needs ->d_lock, which |
| protects *all* the dcache state of a given dentry. |
| |
| --- |
| |
| **mandatory** |
| |
| Filesystems must RCU-free their inodes, if they can have been accessed |
| via rcu-walk path walk (basically, if the file can have had a path name in the |
| vfs namespace). |
| |
| Even though i_dentry and i_rcu share storage in a union, we will |
| initialize the former in inode_init_always(), so just leave it alone in |
| the callback. It used to be necessary to clean it there, but not anymore |
| (starting at 3.2). |
| |
| --- |
| |
| **recommended** |
| |
| vfs now tries to do path walking in "rcu-walk mode", which avoids |
| atomic operations and scalability hazards on dentries and inodes (see |
| Documentation/filesystems/path-lookup.txt). d_hash and d_compare changes |
| (above) are examples of the changes required to support this. For more complex |
| filesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, so |
| no changes are required to the filesystem. However, this is costly and loses |
| the benefits of rcu-walk mode. We will begin to add filesystem callbacks that |
| are rcu-walk aware, shown below. Filesystems should take advantage of this |
| where possible. |
| |
| --- |
| |
| **mandatory** |
| |
| d_revalidate is a callback that is made on every path element (if |
| the filesystem provides it), which requires dropping out of rcu-walk mode. This |
| may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be |
| returned if the filesystem cannot handle rcu-walk. See |
| Documentation/filesystems/vfs.rst for more details. |
| |
| permission is an inode permission check that is called on many or all |
| directory inodes on the way down a path walk (to check for exec permission). It |
| must now be rcu-walk aware (mask & MAY_NOT_BLOCK). See |
| Documentation/filesystems/vfs.rst for more details. |
| |
| --- |
| |
| **mandatory** |
| |
| In ->fallocate() you must check the mode option passed in. If your |
| filesystem does not support hole punching (deallocating space in the middle of a |
| file) you must return -EOPNOTSUPP if FALLOC_FL_PUNCH_HOLE is set in mode. |
| Currently you can only have FALLOC_FL_PUNCH_HOLE with FALLOC_FL_KEEP_SIZE set, |
| so the i_size should not change when hole punching, even when puching the end of |
| a file off. |
| |
| --- |
| |
| **mandatory** |
| |
| ->get_sb() is gone. Switch to use of ->mount(). Typically it's just |
| a matter of switching from calling ``get_sb_``... to ``mount_``... and changing |
| the function type. If you were doing it manually, just switch from setting |
| ->mnt_root to some pointer to returning that pointer. On errors return |
| ERR_PTR(...). |
| |
| --- |
| |
| **mandatory** |
| |
| ->permission() and generic_permission()have lost flags |
| argument; instead of passing IPERM_FLAG_RCU we add MAY_NOT_BLOCK into mask. |
| |
| generic_permission() has also lost the check_acl argument; ACL checking |
| has been taken to VFS and filesystems need to provide a non-NULL |
| ->i_op->get_inode_acl to read an ACL from disk. |
| |
| --- |
| |
| **mandatory** |
| |
| If you implement your own ->llseek() you must handle SEEK_HOLE and |
| SEEK_DATA. You can handle this by returning -EINVAL, but it would be nicer to |
| support it in some way. The generic handler assumes that the entire file is |
| data and there is a virtual hole at the end of the file. So if the provided |
| offset is less than i_size and SEEK_DATA is specified, return the same offset. |
| If the above is true for the offset and you are given SEEK_HOLE, return the end |
| of the file. If the offset is i_size or greater return -ENXIO in either case. |
| |
| **mandatory** |
| |
| If you have your own ->fsync() you must make sure to call |
| filemap_write_and_wait_range() so that all dirty pages are synced out properly. |
| You must also keep in mind that ->fsync() is not called with i_mutex held |
| anymore, so if you require i_mutex locking you must make sure to take it and |
| release it yourself. |
| |
| --- |
| |
| **mandatory** |
| |
| d_alloc_root() is gone, along with a lot of bugs caused by code |
| misusing it. Replacement: d_make_root(inode). On success d_make_root(inode) |
| allocates and returns a new dentry instantiated with the passed in inode. |
| On failure NULL is returned and the passed in inode is dropped so the reference |
| to inode is consumed in all cases and failure handling need not do any cleanup |
| for the inode. If d_make_root(inode) is passed a NULL inode it returns NULL |
| and also requires no further error handling. Typical usage is:: |
| |
| inode = foofs_new_inode(....); |
| s->s_root = d_make_root(inode); |
| if (!s->s_root) |
| /* Nothing needed for the inode cleanup */ |
| return -ENOMEM; |
| ... |
| |
| --- |
| |
| **mandatory** |
| |
| The witch is dead! Well, 2/3 of it, anyway. ->d_revalidate() and |
| ->lookup() do *not* take struct nameidata anymore; just the flags. |
| |
| --- |
| |
| **mandatory** |
| |
| ->create() doesn't take ``struct nameidata *``; unlike the previous |
| two, it gets "is it an O_EXCL or equivalent?" boolean argument. Note that |
| local filesystems can ignore this argument - they are guaranteed that the |
| object doesn't exist. It's remote/distributed ones that might care... |
| |
| --- |
| |
| **mandatory** |
| |
| FS_REVAL_DOT is gone; if you used to have it, add ->d_weak_revalidate() |
| in your dentry operations instead. |
| |
| --- |
| |
| **mandatory** |
| |
| vfs_readdir() is gone; switch to iterate_dir() instead |
| |
| --- |
| |
| **mandatory** |
| |
| ->readdir() is gone now; switch to ->iterate_shared() |
| |
| **mandatory** |
| |
| vfs_follow_link has been removed. Filesystems must use nd_set_link |
| from ->follow_link for normal symlinks, or nd_jump_link for magic |
| /proc/<pid> style links. |
| |
| --- |
| |
| **mandatory** |
| |
| iget5_locked()/ilookup5()/ilookup5_nowait() test() callback used to be |
| called with both ->i_lock and inode_hash_lock held; the former is *not* |
| taken anymore, so verify that your callbacks do not rely on it (none |
| of the in-tree instances did). inode_hash_lock is still held, |
| of course, so they are still serialized wrt removal from inode hash, |
| as well as wrt set() callback of iget5_locked(). |
| |
| --- |
| |
| **mandatory** |
| |
| d_materialise_unique() is gone; d_splice_alias() does everything you |
| need now. Remember that they have opposite orders of arguments ;-/ |
| |
| --- |
| |
| **mandatory** |
| |
| f_dentry is gone; use f_path.dentry, or, better yet, see if you can avoid |
| it entirely. |
| |
| --- |
| |
| **mandatory** |
| |
| never call ->read() and ->write() directly; use __vfs_{read,write} or |
| wrappers; instead of checking for ->write or ->read being NULL, look for |
| FMODE_CAN_{WRITE,READ} in file->f_mode. |
| |
| --- |
| |
| **mandatory** |
| |
| do _not_ use new_sync_{read,write} for ->read/->write; leave it NULL |
| instead. |
| |
| --- |
| |
| **mandatory** |
| ->aio_read/->aio_write are gone. Use ->read_iter/->write_iter. |
| |
| --- |
| |
| **recommended** |
| |
| for embedded ("fast") symlinks just set inode->i_link to wherever the |
| symlink body is and use simple_follow_link() as ->follow_link(). |
| |
| --- |
| |
| **mandatory** |
| |
| calling conventions for ->follow_link() have changed. Instead of returning |
| cookie and using nd_set_link() to store the body to traverse, we return |
| the body to traverse and store the cookie using explicit void ** argument. |
| nameidata isn't passed at all - nd_jump_link() doesn't need it and |
| nd_[gs]et_link() is gone. |
| |
| --- |
| |
| **mandatory** |
| |
| calling conventions for ->put_link() have changed. It gets inode instead of |
| dentry, it does not get nameidata at all and it gets called only when cookie |
| is non-NULL. Note that link body isn't available anymore, so if you need it, |
| store it as cookie. |
| |
| --- |
| |
| **mandatory** |
| |
| any symlink that might use page_follow_link_light/page_put_link() must |
| have inode_nohighmem(inode) called before anything might start playing with |
| its pagecache. No highmem pages should end up in the pagecache of such |
| symlinks. That includes any preseeding that might be done during symlink |
| creation. page_symlink() will honour the mapping gfp flags, so once |
| you've done inode_nohighmem() it's safe to use, but if you allocate and |
| insert the page manually, make sure to use the right gfp flags. |
| |
| --- |
| |
| **mandatory** |
| |
| ->follow_link() is replaced with ->get_link(); same API, except that |
| |
| * ->get_link() gets inode as a separate argument |
| * ->get_link() may be called in RCU mode - in that case NULL |
| dentry is passed |
| |
| --- |
| |
| **mandatory** |
| |
| ->get_link() gets struct delayed_call ``*done`` now, and should do |
| set_delayed_call() where it used to set ``*cookie``. |
| |
| ->put_link() is gone - just give the destructor to set_delayed_call() |
| in ->get_link(). |
| |
| --- |
| |
| **mandatory** |
| |
| ->getxattr() and xattr_handler.get() get dentry and inode passed separately. |
| dentry might be yet to be attached to inode, so do _not_ use its ->d_inode |
| in the instances. Rationale: !@#!@# security_d_instantiate() needs to be |
| called before we attach dentry to inode. |
| |
| --- |
| |
| **mandatory** |
| |
| symlinks are no longer the only inodes that do *not* have i_bdev/i_cdev/ |
| i_pipe/i_link union zeroed out at inode eviction. As the result, you can't |
| assume that non-NULL value in ->i_nlink at ->destroy_inode() implies that |
| it's a symlink. Checking ->i_mode is really needed now. In-tree we had |
| to fix shmem_destroy_callback() that used to take that kind of shortcut; |
| watch out, since that shortcut is no longer valid. |
| |
| --- |
| |
| **mandatory** |
| |
| ->i_mutex is replaced with ->i_rwsem now. inode_lock() et.al. work as |
| they used to - they just take it exclusive. However, ->lookup() may be |
| called with parent locked shared. Its instances must not |
| |
| * use d_instantiate) and d_rehash() separately - use d_add() or |
| d_splice_alias() instead. |
| * use d_rehash() alone - call d_add(new_dentry, NULL) instead. |
| * in the unlikely case when (read-only) access to filesystem |
| data structures needs exclusion for some reason, arrange it |
| yourself. None of the in-tree filesystems needed that. |
| * rely on ->d_parent and ->d_name not changing after dentry has |
| been fed to d_add() or d_splice_alias(). Again, none of the |
| in-tree instances relied upon that. |
| |
| We are guaranteed that lookups of the same name in the same directory |
| will not happen in parallel ("same" in the sense of your ->d_compare()). |
| Lookups on different names in the same directory can and do happen in |
| parallel now. |
| |
| --- |
| |
| **mandatory** |
| |
| ->iterate_shared() is added. |
| Exclusion on struct file level is still provided (as well as that |
| between it and lseek on the same struct file), but if your directory |
| has been opened several times, you can get these called in parallel. |
| Exclusion between that method and all directory-modifying ones is |
| still provided, of course. |
| |
| If you have any per-inode or per-dentry in-core data structures modified |
| by ->iterate_shared(), you might need something to serialize the access |
| to them. If you do dcache pre-seeding, you'll need to switch to |
| d_alloc_parallel() for that; look for in-tree examples. |
| |
| --- |
| |
| **mandatory** |
| |
| ->atomic_open() calls without O_CREAT may happen in parallel. |
| |
| --- |
| |
| **mandatory** |
| |
| ->setxattr() and xattr_handler.set() get dentry and inode passed separately. |
| The xattr_handler.set() gets passed the user namespace of the mount the inode |
| is seen from so filesystems can idmap the i_uid and i_gid accordingly. |
| dentry might be yet to be attached to inode, so do _not_ use its ->d_inode |
| in the instances. Rationale: !@#!@# security_d_instantiate() needs to be |
| called before we attach dentry to inode and !@#!@##!@$!$#!@#$!@$!@$ smack |
| ->d_instantiate() uses not just ->getxattr() but ->setxattr() as well. |
| |
| --- |
| |
| **mandatory** |
| |
| ->d_compare() doesn't get parent as a separate argument anymore. If you |
| used it for finding the struct super_block involved, dentry->d_sb will |
| work just as well; if it's something more complicated, use dentry->d_parent. |
| Just be careful not to assume that fetching it more than once will yield |
| the same value - in RCU mode it could change under you. |
| |
| --- |
| |
| **mandatory** |
| |
| ->rename() has an added flags argument. Any flags not handled by the |
| filesystem should result in EINVAL being returned. |
| |
| --- |
| |
| |
| **recommended** |
| |
| ->readlink is optional for symlinks. Don't set, unless filesystem needs |
| to fake something for readlink(2). |
| |
| --- |
| |
| **mandatory** |
| |
| ->getattr() is now passed a struct path rather than a vfsmount and |
| dentry separately, and it now has request_mask and query_flags arguments |
| to specify the fields and sync type requested by statx. Filesystems not |
| supporting any statx-specific features may ignore the new arguments. |
| |
| --- |
| |
| **mandatory** |
| |
| ->atomic_open() calling conventions have changed. Gone is ``int *opened``, |
| along with FILE_OPENED/FILE_CREATED. In place of those we have |
| FMODE_OPENED/FMODE_CREATED, set in file->f_mode. Additionally, return |
| value for 'called finish_no_open(), open it yourself' case has become |
| 0, not 1. Since finish_no_open() itself is returning 0 now, that part |
| does not need any changes in ->atomic_open() instances. |
| |
| --- |
| |
| **mandatory** |
| |
| alloc_file() has become static now; two wrappers are to be used instead. |
| alloc_file_pseudo(inode, vfsmount, name, flags, ops) is for the cases |
| when dentry needs to be created; that's the majority of old alloc_file() |
| users. Calling conventions: on success a reference to new struct file |
| is returned and callers reference to inode is subsumed by that. On |
| failure, ERR_PTR() is returned and no caller's references are affected, |
| so the caller needs to drop the inode reference it held. |
| alloc_file_clone(file, flags, ops) does not affect any caller's references. |
| On success you get a new struct file sharing the mount/dentry with the |
| original, on failure - ERR_PTR(). |
| |
| --- |
| |
| **mandatory** |
| |
| ->clone_file_range() and ->dedupe_file_range have been replaced with |
| ->remap_file_range(). See Documentation/filesystems/vfs.rst for more |
| information. |
| |
| --- |
| |
| **recommended** |
| |
| ->lookup() instances doing an equivalent of:: |
| |
| if (IS_ERR(inode)) |
| return ERR_CAST(inode); |
| return d_splice_alias(inode, dentry); |
| |
| don't need to bother with the check - d_splice_alias() will do the |
| right thing when given ERR_PTR(...) as inode. Moreover, passing NULL |
| inode to d_splice_alias() will also do the right thing (equivalent of |
| d_add(dentry, NULL); return NULL;), so that kind of special cases |
| also doesn't need a separate treatment. |
| |
| --- |
| |
| **strongly recommended** |
| |
| take the RCU-delayed parts of ->destroy_inode() into a new method - |
| ->free_inode(). If ->destroy_inode() becomes empty - all the better, |
| just get rid of it. Synchronous work (e.g. the stuff that can't |
| be done from an RCU callback, or any WARN_ON() where we want the |
| stack trace) *might* be movable to ->evict_inode(); however, |
| that goes only for the things that are not needed to balance something |
| done by ->alloc_inode(). IOW, if it's cleaning up the stuff that |
| might have accumulated over the life of in-core inode, ->evict_inode() |
| might be a fit. |
| |
| Rules for inode destruction: |
| |
| * if ->destroy_inode() is non-NULL, it gets called |
| * if ->free_inode() is non-NULL, it gets scheduled by call_rcu() |
| * combination of NULL ->destroy_inode and NULL ->free_inode is |
| treated as NULL/free_inode_nonrcu, to preserve the compatibility. |
| |
| Note that the callback (be it via ->free_inode() or explicit call_rcu() |
| in ->destroy_inode()) is *NOT* ordered wrt superblock destruction; |
| as the matter of fact, the superblock and all associated structures |
| might be already gone. The filesystem driver is guaranteed to be still |
| there, but that's it. Freeing memory in the callback is fine; doing |
| more than that is possible, but requires a lot of care and is best |
| avoided. |
| |
| --- |
| |
| **mandatory** |
| |
| DCACHE_RCUACCESS is gone; having an RCU delay on dentry freeing is the |
| default. DCACHE_NORCU opts out, and only d_alloc_pseudo() has any |
| business doing so. |
| |
| --- |
| |
| **mandatory** |
| |
| d_alloc_pseudo() is internal-only; uses outside of alloc_file_pseudo() are |
| very suspect (and won't work in modules). Such uses are very likely to |
| be misspelled d_alloc_anon(). |
| |
| --- |
| |
| **mandatory** |
| |
| [should've been added in 2016] stale comment in finish_open() notwithstanding, |
| failure exits in ->atomic_open() instances should *NOT* fput() the file, |
| no matter what. Everything is handled by the caller. |
| |
| --- |
| |
| **mandatory** |
| |
| clone_private_mount() returns a longterm mount now, so the proper destructor of |
| its result is kern_unmount() or kern_unmount_array(). |
| |
| --- |
| |
| **mandatory** |
| |
| zero-length bvec segments are disallowed, they must be filtered out before |
| passed on to an iterator. |
| |
| --- |
| |
| **mandatory** |
| |
| For bvec based itererators bio_iov_iter_get_pages() now doesn't copy bvecs but |
| uses the one provided. Anyone issuing kiocb-I/O should ensure that the bvec and |
| page references stay until I/O has completed, i.e. until ->ki_complete() has |
| been called or returned with non -EIOCBQUEUED code. |
| |
| --- |
| |
| **mandatory** |
| |
| mnt_want_write_file() can now only be paired with mnt_drop_write_file(), |
| whereas previously it could be paired with mnt_drop_write() as well. |
| |
| --- |
| |
| **mandatory** |
| |
| iov_iter_copy_from_user_atomic() is gone; use copy_page_from_iter_atomic(). |
| The difference is copy_page_from_iter_atomic() advances the iterator and |
| you don't need iov_iter_advance() after it. However, if you decide to use |
| only a part of obtained data, you should do iov_iter_revert(). |
| |
| --- |
| |
| **mandatory** |
| |
| Calling conventions for file_open_root() changed; now it takes struct path * |
| instead of passing mount and dentry separately. For callers that used to |
| pass <mnt, mnt->mnt_root> pair (i.e. the root of given mount), a new helper |
| is provided - file_open_root_mnt(). In-tree users adjusted. |
| |
| --- |
| |
| **mandatory** |
| |
| no_llseek is gone; don't set .llseek to that - just leave it NULL instead. |
| Checks for "does that file have llseek(2), or should it fail with ESPIPE" |
| should be done by looking at FMODE_LSEEK in file->f_mode. |
| |
| --- |
| |
| *mandatory* |
| |
| filldir_t (readdir callbacks) calling conventions have changed. Instead of |
| returning 0 or -E... it returns bool now. false means "no more" (as -E... used |
| to) and true - "keep going" (as 0 in old calling conventions). Rationale: |
| callers never looked at specific -E... values anyway. -> iterate_shared() |
| instances require no changes at all, all filldir_t ones in the tree |
| converted. |
| |
| --- |
| |
| **mandatory** |
| |
| Calling conventions for ->tmpfile() have changed. It now takes a struct |
| file pointer instead of struct dentry pointer. d_tmpfile() is similarly |
| changed to simplify callers. The passed file is in a non-open state and on |
| success must be opened before returning (e.g. by calling |
| finish_open_simple()). |
| |
| --- |
| |
| **mandatory** |
| |
| Calling convention for ->huge_fault has changed. It now takes a page |
| order instead of an enum page_entry_size, and it may be called without the |
| mmap_lock held. All in-tree users have been audited and do not seem to |
| depend on the mmap_lock being held, but out of tree users should verify |
| for themselves. If they do need it, they can return VM_FAULT_RETRY to |
| be called with the mmap_lock held. |
| |
| --- |
| |
| **mandatory** |
| |
| The order of opening block devices and matching or creating superblocks has |
| changed. |
| |
| The old logic opened block devices first and then tried to find a |
| suitable superblock to reuse based on the block device pointer. |
| |
| The new logic tries to find a suitable superblock first based on the device |
| number, and opening the block device afterwards. |
| |
| Since opening block devices cannot happen under s_umount because of lock |
| ordering requirements s_umount is now dropped while opening block devices and |
| reacquired before calling fill_super(). |
| |
| In the old logic concurrent mounters would find the superblock on the list of |
| superblocks for the filesystem type. Since the first opener of the block device |
| would hold s_umount they would wait until the superblock became either born or |
| was discarded due to initialization failure. |
| |
| Since the new logic drops s_umount concurrent mounters could grab s_umount and |
| would spin. Instead they are now made to wait using an explicit wait-wake |
| mechanism without having to hold s_umount. |
| |
| --- |
| |
| **mandatory** |
| |
| The holder of a block device is now the superblock. |
| |
| The holder of a block device used to be the file_system_type which wasn't |
| particularly useful. It wasn't possible to go from block device to owning |
| superblock without matching on the device pointer stored in the superblock. |
| This mechanism would only work for a single device so the block layer couldn't |
| find the owning superblock of any additional devices. |
| |
| In the old mechanism reusing or creating a superblock for a racing mount(2) and |
| umount(2) relied on the file_system_type as the holder. This was severely |
| underdocumented however: |
| |
| (1) Any concurrent mounter that managed to grab an active reference on an |
| existing superblock was made to wait until the superblock either became |
| ready or until the superblock was removed from the list of superblocks of |
| the filesystem type. If the superblock is ready the caller would simple |
| reuse it. |
| |
| (2) If the mounter came after deactivate_locked_super() but before |
| the superblock had been removed from the list of superblocks of the |
| filesystem type the mounter would wait until the superblock was shutdown, |
| reuse the block device and allocate a new superblock. |
| |
| (3) If the mounter came after deactivate_locked_super() and after |
| the superblock had been removed from the list of superblocks of the |
| filesystem type the mounter would reuse the block device and allocate a new |
| superblock (the bd_holder point may still be set to the filesystem type). |
| |
| Because the holder of the block device was the file_system_type any concurrent |
| mounter could open the block devices of any superblock of the same |
| file_system_type without risking seeing EBUSY because the block device was |
| still in use by another superblock. |
| |
| Making the superblock the owner of the block device changes this as the holder |
| is now a unique superblock and thus block devices associated with it cannot be |
| reused by concurrent mounters. So a concurrent mounter in (2) could suddenly |
| see EBUSY when trying to open a block device whose holder was a different |
| superblock. |
| |
| The new logic thus waits until the superblock and the devices are shutdown in |
| ->kill_sb(). Removal of the superblock from the list of superblocks of the |
| filesystem type is now moved to a later point when the devices are closed: |
| |
| (1) Any concurrent mounter managing to grab an active reference on an existing |
| superblock is made to wait until the superblock is either ready or until |
| the superblock and all devices are shutdown in ->kill_sb(). If the |
| superblock is ready the caller will simply reuse it. |
| |
| (2) If the mounter comes after deactivate_locked_super() but before |
| the superblock has been removed from the list of superblocks of the |
| filesystem type the mounter is made to wait until the superblock and the |
| devices are shut down in ->kill_sb() and the superblock is removed from the |
| list of superblocks of the filesystem type. The mounter will allocate a new |
| superblock and grab ownership of the block device (the bd_holder pointer of |
| the block device will be set to the newly allocated superblock). |
| |
| (3) This case is now collapsed into (2) as the superblock is left on the list |
| of superblocks of the filesystem type until all devices are shutdown in |
| ->kill_sb(). In other words, if the superblock isn't on the list of |
| superblock of the filesystem type anymore then it has given up ownership of |
| all associated block devices (the bd_holder pointer is NULL). |
| |
| As this is a VFS level change it has no practical consequences for filesystems |
| other than that all of them must use one of the provided kill_litter_super(), |
| kill_anon_super(), or kill_block_super() helpers. |
| |
| --- |
| |
| **mandatory** |
| |
| Lock ordering has been changed so that s_umount ranks above open_mutex again. |
| All places where s_umount was taken under open_mutex have been fixed up. |
| |
| --- |
| |
| **mandatory** |
| |
| export_operations ->encode_fh() no longer has a default implementation to |
| encode FILEID_INO32_GEN* file handles. |
| Filesystems that used the default implementation may use the generic helper |
| generic_encode_ino32_fh() explicitly. |
| |
| --- |
| |
| **mandatory** |
| |
| If ->rename() update of .. on cross-directory move needs an exclusion with |
| directory modifications, do *not* lock the subdirectory in question in your |
| ->rename() - it's done by the caller now [that item should've been added in |
| 28eceeda130f "fs: Lock moved directories"]. |
| |
| --- |
| |
| **mandatory** |
| |
| On same-directory ->rename() the (tautological) update of .. is not protected |
| by any locks; just don't do it if the old parent is the same as the new one. |
| We really can't lock two subdirectories in same-directory rename - not without |
| deadlocks. |
| |
| --- |
| |
| **mandatory** |
| |
| lock_rename() and lock_rename_child() may fail in cross-directory case, if |
| their arguments do not have a common ancestor. In that case ERR_PTR(-EXDEV) |
| is returned, with no locks taken. In-tree users updated; out-of-tree ones |
| would need to do so. |
| |
| --- |
| |
| **mandatory** |
| |
| The list of children anchored in parent dentry got turned into hlist now. |
| Field names got changed (->d_children/->d_sib instead of ->d_subdirs/->d_child |
| for anchor/entries resp.), so any affected places will be immediately caught |
| by compiler. |
| |
| --- |
| |
| **mandatory** |
| |
| ->d_delete() instances are now called for dentries with ->d_lock held |
| and refcount equal to 0. They are not permitted to drop/regain ->d_lock. |
| None of in-tree instances did anything of that sort. Make sure yours do not... |
| |
| --- |
| |
| **mandatory** |
| |
| ->d_prune() instances are now called without ->d_lock held on the parent. |
| ->d_lock on dentry itself is still held; if you need per-parent exclusions (none |
| of the in-tree instances did), use your own spinlock. |
| |
| ->d_iput() and ->d_release() are called with victim dentry still in the |
| list of parent's children. It is still unhashed, marked killed, etc., just not |
| removed from parent's ->d_children yet. |
| |
| Anyone iterating through the list of children needs to be aware of the |
| half-killed dentries that might be seen there; taking ->d_lock on those will |
| see them negative, unhashed and with negative refcount, which means that most |
| of the in-kernel users would've done the right thing anyway without any adjustment. |
| |
| --- |
| |
| **recommended** |
| |
| Block device freezing and thawing have been moved to holder operations. |
| |
| Before this change, get_active_super() would only be able to find the |
| superblock of the main block device, i.e., the one stored in sb->s_bdev. Block |
| device freezing now works for any block device owned by a given superblock, not |
| just the main block device. The get_active_super() helper and bd_fsfreeze_sb |
| pointer are gone. |
| |
| --- |
| |
| **mandatory** |
| |
| set_blocksize() takes opened struct file instead of struct block_device now |
| and it *must* be opened exclusive. |