Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
new file mode 100644
index 0000000..bcfbab8
--- /dev/null
+++ b/Documentation/filesystems/00-INDEX
@@ -0,0 +1,50 @@
+00-INDEX
+ - this file (info on some of the filesystems supported by linux).
+Locking
+ - info on locking rules as they pertain to Linux VFS.
+adfs.txt
+ - info and mount options for the Acorn Advanced Disc Filing System.
+affs.txt
+ - info and mount options for the Amiga Fast File System.
+bfs.txt
+ - info for the SCO UnixWare Boot Filesystem (BFS).
+cifs.txt
+ - description of the CIFS filesystem
+coda.txt
+ - description of the CODA filesystem.
+cramfs.txt
+ - info on the cram filesystem for small storage (ROMs etc)
+devfs/
+ - directory containing devfs documentation.
+ext2.txt
+ - info, mount options and specifications for the Ext2 filesystem.
+fat_cvf.txt
+ - info on the Compressed Volume Files extension to the FAT filesystem
+hpfs.txt
+ - info and mount options for the OS/2 HPFS.
+isofs.txt
+ - info and mount options for the ISO 9660 (CDROM) filesystem.
+jfs.txt
+ - info and mount options for the JFS filesystem.
+ncpfs.txt
+ - info on Novell Netware(tm) filesystem using NCP protocol.
+ntfs.txt
+ - info and mount options for the NTFS filesystem (Windows NT).
+proc.txt
+ - info on Linux's /proc filesystem.
+romfs.txt
+ - Description of the ROMFS filesystem.
+smbfs.txt
+ - info on using filesystems with the SMB protocol (Windows 3.11 and NT)
+sysv-fs.txt
+ - info on the SystemV/V7/Xenix/Coherent filesystem.
+udf.txt
+ - info and mount options for the UDF filesystem.
+ufs.txt
+ - info on the ufs filesystem.
+vfat.txt
+ - info on using the VFAT filesystem used in Windows NT and Windows 95
+vfs.txt
+ - Overview of the Virtual File System
+xfs.txt
+ - info and mount options for the XFS filesystem.
diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/Exporting
new file mode 100644
index 0000000..31047e0
--- /dev/null
+++ b/Documentation/filesystems/Exporting
@@ -0,0 +1,176 @@
+
+Making Filesystems Exportable
+=============================
+
+Most filesystem operations require a dentry (or two) as a starting
+point. Local applications have a reference-counted hold on suitable
+dentrys via open file descriptors or cwd/root. However remote
+applications that access a filesystem via a remote filesystem protocol
+such as NFS may not be able to hold such a reference, and so need a
+different way to refer to a particular dentry. As the alternative
+form of reference needs to be stable across renames, truncates, and
+server-reboot (among other things, though these tend to be the most
+problematic), there is no simple answer like 'filename'.
+
+The mechanism discussed here allows each filesystem implementation to
+specify how to generate an opaque (out side of the filesystem) byte
+string for any dentry, and how to find an appropriate dentry for any
+given opaque byte string.
+This byte string will be called a "filehandle fragment" as it
+corresponds to part of an NFS filehandle.
+
+A filesystem which supports the mapping between filehandle fragments
+and dentrys will be termed "exportable".
+
+
+
+Dcache Issues
+-------------
+
+The dcache normally contains a proper prefix of any given filesystem
+tree. This means that if any filesystem object is in the dcache, then
+all of the ancestors of that filesystem object are also in the dcache.
+As normal access is by filename this prefix is created naturally and
+maintained easily (by each object maintaining a reference count on
+its parent).
+
+However when objects are included into the dcache by interpreting a
+filehandle fragment, there is no automatic creation of a path prefix
+for the object. This leads to two related but distinct features of
+the dcache that are not needed for normal filesystem access.
+
+1/ The dcache must sometimes contain objects that are not part of the
+ proper prefix. i.e that are not connected to the root.
+2/ The dcache must be prepared for a newly found (via ->lookup) directory
+ to already have a (non-connected) dentry, and must be able to move
+ that dentry into place (based on the parent and name in the
+ ->lookup). This is particularly needed for directories as
+ it is a dcache invariant that directories only have one dentry.
+
+To implement these features, the dcache has:
+
+a/ A dentry flag DCACHE_DISCONNECTED which is set on
+ any dentry that might not be part of the proper prefix.
+ This is set when anonymous dentries are created, and cleared when a
+ dentry is noticed to be a child of a dentry which is in the proper
+ prefix.
+
+b/ A per-superblock list "s_anon" of dentries which are the roots of
+ subtrees that are not in the proper prefix. These dentries, as
+ well as the proper prefix, need to be released at unmount time. As
+ these dentries will not be hashed, they are linked together on the
+ d_hash list_head.
+
+c/ Helper routines to allocate anonymous dentries, and to help attach
+ loose directory dentries at lookup time. They are:
+ d_alloc_anon(inode) will return a dentry for the given inode.
+ If the inode already has a dentry, one of those is returned.
+ If it doesn't, a new anonymous (IS_ROOT and
+ DCACHE_DISCONNECTED) dentry is allocated and attached.
+ In the case of a directory, care is taken that only one dentry
+ can ever be attached.
+ d_splice_alias(inode, dentry) will make sure that there is a
+ dentry with the same name and parent as the given dentry, and
+ which refers to the given inode.
+ If the inode is a directory and already has a dentry, then that
+ dentry is d_moved over the given dentry.
+ If the passed dentry gets attached, care is taken that this is
+ mutually exclusive to a d_alloc_anon operation.
+ If the passed dentry is used, NULL is returned, else the used
+ dentry is returned. This corresponds to the calling pattern of
+ ->lookup.
+
+
+Filesystem Issues
+-----------------
+
+For a filesystem to be exportable it must:
+
+ 1/ provide the filehandle fragment routines described below.
+ 2/ make sure that d_splice_alias is used rather than d_add
+ when ->lookup finds an inode for a given parent and name.
+ Typically the ->lookup routine will end:
+ if (inode)
+ return d_splice(inode, dentry);
+ d_add(dentry, inode);
+ return NULL;
+ }
+
+
+
+ A file system implementation declares that instances of the filesystem
+are exportable by setting the s_export_op field in the struct
+super_block. This field must point to a "struct export_operations"
+struct which could potentially be full of NULLs, though normally at
+least get_parent will be set.
+
+ The primary operations are decode_fh and encode_fh.
+decode_fh takes a filehandle fragment and tries to find or create a
+dentry for the object referred to by the filehandle.
+encode_fh takes a dentry and creates a filehandle fragment which can
+later be used to find/create a dentry for the same object.
+
+decode_fh will probably make use of "find_exported_dentry".
+This function lives in the "exportfs" module which a filesystem does
+not need unless it is being exported. So rather that calling
+find_exported_dentry directly, each filesystem should call it through
+the find_exported_dentry pointer in it's export_operations table.
+This field is set correctly by the exporting agent (e.g. nfsd) when a
+filesystem is exported, and before any export operations are called.
+
+find_exported_dentry needs three support functions from the
+filesystem:
+ get_name. When given a parent dentry and a child dentry, this
+ should find a name in the directory identified by the parent
+ dentry, which leads to the object identified by the child dentry.
+ If no get_name function is supplied, a default implementation is
+ provided which uses vfs_readdir to find potential names, and
+ matches inode numbers to find the correct match.
+
+ get_parent. When given a dentry for a directory, this should return
+ a dentry for the parent. Quite possibly the parent dentry will
+ have been allocated by d_alloc_anon.
+ The default get_parent function just returns an error so any
+ filehandle lookup that requires finding a parent will fail.
+ ->lookup("..") is *not* used as a default as it can leave ".."
+ entries in the dcache which are too messy to work with.
+
+ get_dentry. When given an opaque datum, this should find the
+ implied object and create a dentry for it (possibly with
+ d_alloc_anon).
+ The opaque datum is whatever is passed down by the decode_fh
+ function, and is often simply a fragment of the filehandle
+ fragment.
+ decode_fh passes two datums through find_exported_dentry. One that
+ should be used to identify the target object, and one that can be
+ used to identify the object's parent, should that be necessary.
+ The default get_dentry function assumes that the datum contains an
+ inode number and a generation number, and it attempts to get the
+ inode using "iget" and check it's validity by matching the
+ generation number. A filesystem should only depend on the default
+ if iget can safely be used this way.
+
+If decode_fh and/or encode_fh are left as NULL, then default
+implementations are used. These defaults are suitable for ext2 and
+extremely similar filesystems (like ext3).
+
+The default encode_fh creates a filehandle fragment from the inode
+number and generation number of the target together with the inode
+number and generation number of the parent (if the parent is
+required).
+
+The default decode_fh extract the target and parent datums from the
+filehandle assuming the format used by the default encode_fh and
+passed them to find_exported_dentry.
+
+
+A filehandle fragment consists of an array of 1 or more 4byte words,
+together with a one byte "type".
+The decode_fh routine should not depend on the stated size that is
+passed to it. This size may be larger than the original filehandle
+generated by encode_fh, in which case it will have been padded with
+nuls. Rather, the encode_fh routine should choose a "type" which
+indicates the decode_fh how much of the filehandle is valid, and how
+it should be interpreted.
+
+
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
new file mode 100644
index 0000000..a934bae
--- /dev/null
+++ b/Documentation/filesystems/Locking
@@ -0,0 +1,515 @@
+ The text below describes the locking rules for VFS-related methods.
+It is (believed to be) up-to-date. *Please*, if you change anything in
+prototypes or locking protocols - update this file. And update the relevant
+instances in the tree, don't leave that to maintainers of filesystems/devices/
+etc. At the very least, put the list of dubious cases in the end of this file.
+Don't turn it into log - maintainers of out-of-the-tree code are supposed to
+be able to use diff(1).
+ Thing currently missing here: socket operations. Alexey?
+
+--------------------------- dentry_operations --------------------------
+prototypes:
+ int (*d_revalidate)(struct dentry *, int);
+ int (*d_hash) (struct dentry *, struct qstr *);
+ int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
+ int (*d_delete)(struct dentry *);
+ void (*d_release)(struct dentry *);
+ void (*d_iput)(struct dentry *, struct inode *);
+
+locking rules:
+ none have BKL
+ dcache_lock rename_lock ->d_lock may block
+d_revalidate: no no no yes
+d_hash no no no yes
+d_compare: no yes no no
+d_delete: yes no yes no
+d_release: no no no yes
+d_iput: no no no yes
+
+--------------------------- inode_operations ---------------------------
+prototypes:
+ int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
+ struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid
+ata *);
+ int (*link) (struct dentry *,struct inode *,struct dentry *);
+ int (*unlink) (struct inode *,struct dentry *);
+ int (*symlink) (struct inode *,struct dentry *,const char *);
+ int (*mkdir) (struct inode *,struct dentry *,int);
+ int (*rmdir) (struct inode *,struct dentry *);
+ int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*rename) (struct inode *, struct dentry *,
+ struct inode *, struct dentry *);
+ int (*readlink) (struct dentry *, char __user *,int);
+ int (*follow_link) (struct dentry *, struct nameidata *);
+ void (*truncate) (struct inode *);
+ int (*permission) (struct inode *, int, struct nameidata *);
+ int (*setattr) (struct dentry *, struct iattr *);
+ int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *);
+ int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
+ ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
+ ssize_t (*listxattr) (struct dentry *, char *, size_t);
+ int (*removexattr) (struct dentry *, const char *);
+
+locking rules:
+ all may block, none have BKL
+ i_sem(inode)
+lookup: yes
+create: yes
+link: yes (both)
+mknod: yes
+symlink: yes
+mkdir: yes
+unlink: yes (both)
+rmdir: yes (both) (see below)
+rename: yes (all) (see below)
+readlink: no
+follow_link: no
+truncate: yes (see below)
+setattr: yes
+permission: no
+getattr: no
+setxattr: yes
+getxattr: no
+listxattr: no
+removexattr: yes
+ Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
+victim.
+ cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
+ ->truncate() is never called directly - it's a callback, not a
+method. It's called by vmtruncate() - library function normally used by
+->setattr(). Locking information above applies to that call (i.e. is
+inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
+passed).
+
+See Documentation/filesystems/directory-locking for more detailed discussion
+of the locking scheme for directory operations.
+
+--------------------------- super_operations ---------------------------
+prototypes:
+ struct inode *(*alloc_inode)(struct super_block *sb);
+ void (*destroy_inode)(struct inode *);
+ void (*read_inode) (struct inode *);
+ void (*dirty_inode) (struct inode *);
+ int (*write_inode) (struct inode *, int);
+ void (*put_inode) (struct inode *);
+ void (*drop_inode) (struct inode *);
+ void (*delete_inode) (struct inode *);
+ void (*put_super) (struct super_block *);
+ void (*write_super) (struct super_block *);
+ int (*sync_fs)(struct super_block *sb, int wait);
+ void (*write_super_lockfs) (struct super_block *);
+ void (*unlockfs) (struct super_block *);
+ int (*statfs) (struct super_block *, struct kstatfs *);
+ int (*remount_fs) (struct super_block *, int *, char *);
+ void (*clear_inode) (struct inode *);
+ void (*umount_begin) (struct super_block *);
+ int (*show_options)(struct seq_file *, struct vfsmount *);
+ ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
+ ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
+
+locking rules:
+ All may block.
+ BKL s_lock s_umount
+alloc_inode: no no no
+destroy_inode: no
+read_inode: no (see below)
+dirty_inode: no (must not sleep)
+write_inode: no
+put_inode: no
+drop_inode: no !!!inode_lock!!!
+delete_inode: no
+put_super: yes yes no
+write_super: no yes read
+sync_fs: no no read
+write_super_lockfs: ?
+unlockfs: ?
+statfs: no no no
+remount_fs: no yes maybe (see below)
+clear_inode: no
+umount_begin: yes no no
+show_options: no (vfsmount->sem)
+quota_read: no no no (see below)
+quota_write: no no no (see below)
+
+->read_inode() is not a method - it's a callback used in iget().
+->remount_fs() will have the s_umount lock if it's already mounted.
+When called from get_sb_single, it does NOT have the s_umount lock.
+->quota_read() and ->quota_write() functions are both guaranteed to
+be the only ones operating on the quota file by the quota code (via
+dqio_sem) (unless an admin really wants to screw up something and
+writes to quota files with quotas on). For other details about locking
+see also dquot_operations section.
+
+--------------------------- file_system_type ---------------------------
+prototypes:
+ struct super_block *(*get_sb) (struct file_system_type *, int,
+ const char *, void *);
+ void (*kill_sb) (struct super_block *);
+locking rules:
+ may block BKL
+get_sb yes yes
+kill_sb yes yes
+
+->get_sb() returns error or a locked superblock (exclusive on ->s_umount).
+->kill_sb() takes a write-locked superblock, does all shutdown work on it,
+unlocks and drops the reference.
+
+--------------------------- address_space_operations --------------------------
+prototypes:
+ int (*writepage)(struct page *page, struct writeback_control *wbc);
+ int (*readpage)(struct file *, struct page *);
+ int (*sync_page)(struct page *);
+ int (*writepages)(struct address_space *, struct writeback_control *);
+ int (*set_page_dirty)(struct page *page);
+ int (*readpages)(struct file *filp, struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages);
+ int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
+ int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+ sector_t (*bmap)(struct address_space *, sector_t);
+ int (*invalidatepage) (struct page *, unsigned long);
+ int (*releasepage) (struct page *, int);
+ int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
+ loff_t offset, unsigned long nr_segs);
+
+locking rules:
+ All except set_page_dirty may block
+
+ BKL PageLocked(page)
+writepage: no yes, unlocks (see below)
+readpage: no yes, unlocks
+sync_page: no maybe
+writepages: no
+set_page_dirty no no
+readpages: no
+prepare_write: no yes
+commit_write: no yes
+bmap: yes
+invalidatepage: no yes
+releasepage: no yes
+direct_IO: no
+
+ ->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
+may be called from the request handler (/dev/loop).
+
+ ->readpage() unlocks the page, either synchronously or via I/O
+completion.
+
+ ->readpages() populates the pagecache with the passed pages and starts
+I/O against them. They come unlocked upon I/O completion.
+
+ ->writepage() is used for two purposes: for "memory cleansing" and for
+"sync". These are quite different operations and the behaviour may differ
+depending upon the mode.
+
+If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
+it *must* start I/O against the page, even if that would involve
+blocking on in-progress I/O.
+
+If writepage is called for memory cleansing (sync_mode ==
+WBC_SYNC_NONE) then its role is to get as much writeout underway as
+possible. So writepage should try to avoid blocking against
+currently-in-progress I/O.
+
+If the filesystem is not called for "sync" and it determines that it
+would need to block against in-progress I/O to be able to start new I/O
+against the page the filesystem should redirty the page with
+redirty_page_for_writepage(), then unlock the page and return zero.
+This may also be done to avoid internal deadlocks, but rarely.
+
+If the filesytem is called for sync then it must wait on any
+in-progress I/O and then start new I/O.
+
+The filesystem should unlock the page synchronously, before returning
+to the caller.
+
+Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
+and return zero, writepage *must* run set_page_writeback() against the page,
+followed by unlocking it. Once set_page_writeback() has been run against the
+page, write I/O can be submitted and the write I/O completion handler must run
+end_page_writeback() once the I/O is complete. If no I/O is submitted, the
+filesystem must run end_page_writeback() against the page before returning from
+writepage.
+
+That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
+if the filesystem needs the page to be locked during writeout, that is ok, too,
+the page is allowed to be unlocked at any point in time between the calls to
+set_page_writeback() and end_page_writeback().
+
+Note, failure to run either redirty_page_for_writepage() or the combination of
+set_page_writeback()/end_page_writeback() on a page submitted to writepage
+will leave the page itself marked clean but it will be tagged as dirty in the
+radix tree. This incoherency can lead to all sorts of hard-to-debug problems
+in the filesystem like having dirty inodes at umount and losing written data.
+
+ ->sync_page() locking rules are not well-defined - usually it is called
+with lock on page, but that is not guaranteed. Considering the currently
+existing instances of this method ->sync_page() itself doesn't look
+well-defined...
+
+ ->writepages() is used for periodic writeback and for syscall-initiated
+sync operations. The address_space should start I/O against at least
+*nr_to_write pages. *nr_to_write must be decremented for each page which is
+written. The address_space implementation may write more (or less) pages
+than *nr_to_write asks for, but it should try to be reasonably close. If
+nr_to_write is NULL, all dirty pages must be written.
+
+writepages should _only_ write pages which are present on
+mapping->io_pages.
+
+ ->set_page_dirty() is called from various places in the kernel
+when the target page is marked as needing writeback. It may be called
+under spinlock (it cannot block) and is sometimes called with the page
+not locked.
+
+ ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
+filesystems and by the swapper. The latter will eventually go away. All
+instances do not actually need the BKL. Please, keep it that way and don't
+breed new callers.
+
+ ->invalidatepage() is called when the filesystem must attempt to drop
+some or all of the buffers from the page when it is being truncated. It
+returns zero on success. If ->invalidatepage is zero, the kernel uses
+block_invalidatepage() instead.
+
+ ->releasepage() is called when the kernel is about to try to drop the
+buffers from the page in preparation for freeing it. It returns zero to
+indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
+the kernel assumes that the fs has no private interest in the buffers.
+
+ Note: currently almost all instances of address_space methods are
+using BKL for internal serialization and that's one of the worst sources
+of contention. Normally they are calling library functions (in fs/buffer.c)
+and pass foo_get_block() as a callback (on local block-based filesystems,
+indeed). BKL is not needed for library stuff and is usually taken by
+foo_get_block(). It's an overkill, since block bitmaps can be protected by
+internal fs locking and real critical areas are much smaller than the areas
+filesystems protect now.
+
+----------------------- file_lock_operations ------------------------------
+prototypes:
+ void (*fl_insert)(struct file_lock *); /* lock insertion callback */
+ void (*fl_remove)(struct file_lock *); /* lock removal callback */
+ void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
+ void (*fl_release_private)(struct file_lock *);
+
+
+locking rules:
+ BKL may block
+fl_insert: yes no
+fl_remove: yes no
+fl_copy_lock: yes no
+fl_release_private: yes yes
+
+----------------------- lock_manager_operations ---------------------------
+prototypes:
+ int (*fl_compare_owner)(struct file_lock *, struct file_lock *);
+ void (*fl_notify)(struct file_lock *); /* unblock callback */
+ void (*fl_copy_lock)(struct file_lock *, struct file_lock *);
+ void (*fl_release_private)(struct file_lock *);
+ void (*fl_break)(struct file_lock *); /* break_lease callback */
+
+locking rules:
+ BKL may block
+fl_compare_owner: yes no
+fl_notify: yes no
+fl_copy_lock: yes no
+fl_release_private: yes yes
+fl_break: yes no
+
+ Currently only NFSD and NLM provide instances of this class. None of the
+them block. If you have out-of-tree instances - please, show up. Locking
+in that area will change.
+--------------------------- buffer_head -----------------------------------
+prototypes:
+ void (*b_end_io)(struct buffer_head *bh, int uptodate);
+
+locking rules:
+ called from interrupts. In other words, extreme care is needed here.
+bh is locked, but that's all warranties we have here. Currently only RAID1,
+highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices
+call this method upon the IO completion.
+
+--------------------------- block_device_operations -----------------------
+prototypes:
+ int (*open) (struct inode *, struct file *);
+ int (*release) (struct inode *, struct file *);
+ int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
+ int (*media_changed) (struct gendisk *);
+ int (*revalidate_disk) (struct gendisk *);
+
+locking rules:
+ BKL bd_sem
+open: yes yes
+release: yes yes
+ioctl: yes no
+media_changed: no no
+revalidate_disk: no no
+
+The last two are called only from check_disk_change().
+
+--------------------------- file_operations -------------------------------
+prototypes:
+ loff_t (*llseek) (struct file *, loff_t, int);
+ ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
+ ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
+ ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
+ ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t,
+ loff_t);
+ int (*readdir) (struct file *, void *, filldir_t);
+ unsigned int (*poll) (struct file *, struct poll_table_struct *);
+ int (*ioctl) (struct inode *, struct file *, unsigned int,
+ unsigned long);
+ long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
+ long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+ int (*mmap) (struct file *, struct vm_area_struct *);
+ int (*open) (struct inode *, struct file *);
+ int (*flush) (struct file *);
+ int (*release) (struct inode *, struct file *);
+ int (*fsync) (struct file *, struct dentry *, int datasync);
+ int (*aio_fsync) (struct kiocb *, int datasync);
+ int (*fasync) (int, struct file *, int);
+ int (*lock) (struct file *, int, struct file_lock *);
+ ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
+ loff_t *);
+ ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
+ loff_t *);
+ ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
+ void __user *);
+ ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
+ loff_t *, int);
+ unsigned long (*get_unmapped_area)(struct file *, unsigned long,
+ unsigned long, unsigned long, unsigned long);
+ int (*check_flags)(int);
+ int (*dir_notify)(struct file *, unsigned long);
+};
+
+locking rules:
+ All except ->poll() may block.
+ BKL
+llseek: no (see below)
+read: no
+aio_read: no
+write: no
+aio_write: no
+readdir: no
+poll: no
+ioctl: yes (see below)
+unlocked_ioctl: no (see below)
+compat_ioctl: no
+mmap: no
+open: maybe (see below)
+flush: no
+release: no
+fsync: no (see below)
+aio_fsync: no
+fasync: yes (see below)
+lock: yes
+readv: no
+writev: no
+sendfile: no
+sendpage: no
+get_unmapped_area: no
+check_flags: no
+dir_notify: no
+
+->llseek() locking has moved from llseek to the individual llseek
+implementations. If your fs is not using generic_file_llseek, you
+need to acquire and release the appropriate locks in your ->llseek().
+For many filesystems, it is probably safe to acquire the inode
+semaphore. Note some filesystems (i.e. remote ones) provide no
+protection for i_size so you will need to use the BKL.
+
+->open() locking is in-transit: big lock partially moved into the methods.
+The only exception is ->open() in the instances of file_operations that never
+end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices
+(chrdev_open() takes lock before replacing ->f_op and calling the secondary
+method. As soon as we fix the handling of module reference counters all
+instances of ->open() will be called without the BKL.
+
+Note: ext2_release() was *the* source of contention on fs-intensive
+loads and dropping BKL on ->release() helps to get rid of that (we still
+grab BKL for cases when we close a file that had been opened r/w, but that
+can and should be done using the internal locking with smaller critical areas).
+Current worst offender is ext2_get_block()...
+
+->fasync() is a mess. This area needs a big cleanup and that will probably
+affect locking.
+
+->readdir() and ->ioctl() on directories must be changed. Ideally we would
+move ->readdir() to inode_operations and use a separate method for directory
+->ioctl() or kill the latter completely. One of the problems is that for
+anything that resembles union-mount we won't have a struct file for all
+components. And there are other reasons why the current interface is a mess...
+
+->ioctl() on regular files is superceded by the ->unlocked_ioctl() that
+doesn't take the BKL.
+
+->read on directories probably must go away - we should just enforce -EISDIR
+in sys_read() and friends.
+
+->fsync() has i_sem on inode.
+
+--------------------------- dquot_operations -------------------------------
+prototypes:
+ int (*initialize) (struct inode *, int);
+ int (*drop) (struct inode *);
+ int (*alloc_space) (struct inode *, qsize_t, int);
+ int (*alloc_inode) (const struct inode *, unsigned long);
+ int (*free_space) (struct inode *, qsize_t);
+ int (*free_inode) (const struct inode *, unsigned long);
+ int (*transfer) (struct inode *, struct iattr *);
+ int (*write_dquot) (struct dquot *);
+ int (*acquire_dquot) (struct dquot *);
+ int (*release_dquot) (struct dquot *);
+ int (*mark_dirty) (struct dquot *);
+ int (*write_info) (struct super_block *, int);
+
+These operations are intended to be more or less wrapping functions that ensure
+a proper locking wrt the filesystem and call the generic quota operations.
+
+What filesystem should expect from the generic quota functions:
+
+ FS recursion Held locks when called
+initialize: yes maybe dqonoff_sem
+drop: yes -
+alloc_space: ->mark_dirty() -
+alloc_inode: ->mark_dirty() -
+free_space: ->mark_dirty() -
+free_inode: ->mark_dirty() -
+transfer: yes -
+write_dquot: yes dqonoff_sem or dqptr_sem
+acquire_dquot: yes dqonoff_sem or dqptr_sem
+release_dquot: yes dqonoff_sem or dqptr_sem
+mark_dirty: no -
+write_info: yes dqonoff_sem
+
+FS recursion means calling ->quota_read() and ->quota_write() from superblock
+operations.
+
+->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called
+only directly by the filesystem and do not call any fs functions only
+the ->mark_dirty() operation.
+
+More details about quota locking can be found in fs/dquot.c.
+
+--------------------------- vm_operations_struct -----------------------------
+prototypes:
+ void (*open)(struct vm_area_struct*);
+ void (*close)(struct vm_area_struct*);
+ struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *);
+
+locking rules:
+ BKL mmap_sem
+open: no yes
+close: no yes
+nopage: no yes
+
+================================================================================
+ Dubious stuff
+
+(if you break something or notice that it is broken and do not fix it yourself
+- at least put it here)
+
+ipc/shm.c::shm_delete() - may need BKL.
+->read() and ->write() in many drivers are (probably) missing BKL.
+drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL.
diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt
new file mode 100644
index 0000000..060abb0
--- /dev/null
+++ b/Documentation/filesystems/adfs.txt
@@ -0,0 +1,57 @@
+Mount options for ADFS
+----------------------
+
+ uid=nnn All files in the partition will be owned by
+ user id nnn. Default 0 (root).
+ gid=nnn All files in the partition willbe in group
+ nnn. Default 0 (root).
+ ownmask=nnn The permission mask for ADFS 'owner' permissions
+ will be nnn. Default 0700.
+ othmask=nnn The permission mask for ADFS 'other' permissions
+ will be nnn. Default 0077.
+
+Mapping of ADFS permissions to Linux permissions
+------------------------------------------------
+
+ ADFS permissions consist of the following:
+
+ Owner read
+ Owner write
+ Other read
+ Other write
+
+ (In older versions, an 'execute' permission did exist, but this
+ does not hold the same meaning as the Linux 'execute' permission
+ and is now obsolete).
+
+ The mapping is performed as follows:
+
+ Owner read -> -r--r--r--
+ Owner write -> --w--w---w
+ Owner read and filetype UnixExec -> ---x--x--x
+ These are then masked by ownmask, eg 700 -> -rwx------
+ Possible owner mode permissions -> -rwx------
+
+ Other read -> -r--r--r--
+ Other write -> --w--w--w-
+ Other read and filetype UnixExec -> ---x--x--x
+ These are then masked by othmask, eg 077 -> ----rwxrwx
+ Possible other mode permissions -> ----rwxrwx
+
+ Hence, with the default masks, if a file is owner read/write, and
+ not a UnixExec filetype, then the permissions will be:
+
+ -rw-------
+
+ However, if the masks were ownmask=0770,othmask=0007, then this would
+ be modified to:
+ -rw-rw----
+
+ There is no restriction on what you can do with these masks. You may
+ wish that either read bits give read access to the file for all, but
+ keep the default write protection (ownmask=0755,othmask=0577):
+
+ -rw-r--r--
+
+ You can therefore tailor the permission translation to whatever you
+ desire the permissions should be under Linux.
diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.txt
new file mode 100644
index 0000000..30c9738
--- /dev/null
+++ b/Documentation/filesystems/affs.txt
@@ -0,0 +1,219 @@
+Overview of Amiga Filesystems
+=============================
+
+Not all varieties of the Amiga filesystems are supported for reading and
+writing. The Amiga currently knows six different filesystems:
+
+DOS\0 The old or original filesystem, not really suited for
+ hard disks and normally not used on them, either.
+ Supported read/write.
+
+DOS\1 The original Fast File System. Supported read/write.
+
+DOS\2 The old "international" filesystem. International means that
+ a bug has been fixed so that accented ("international") letters
+ in file names are case-insensitive, as they ought to be.
+ Supported read/write.
+
+DOS\3 The "international" Fast File System. Supported read/write.
+
+DOS\4 The original filesystem with directory cache. The directory
+ cache speeds up directory accesses on floppies considerably,
+ but slows down file creation/deletion. Doesn't make much
+ sense on hard disks. Supported read only.
+
+DOS\5 The Fast File System with directory cache. Supported read only.
+
+All of the above filesystems allow block sizes from 512 to 32K bytes.
+Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
+speed up almost everything at the expense of wasted disk space. The speed
+gain above 4K seems not really worth the price, so you don't lose too
+much here, either.
+
+The muFS (multi user File System) equivalents of the above file systems
+are supported, too.
+
+Mount options for the AFFS
+==========================
+
+protect If this option is set, the protection bits cannot be altered.
+
+setuid[=uid] This sets the owner of all files and directories in the file
+ system to uid or the uid of the current user, respectively.
+
+setgid[=gid] Same as above, but for gid.
+
+mode=mode Sets the mode flags to the given (octal) value, regardless
+ of the original permissions. Directories will get an x
+ permission if the corresponding r bit is set.
+ This is useful since most of the plain AmigaOS files
+ will map to 600.
+
+reserved=num Sets the number of reserved blocks at the start of the
+ partition to num. You should never need this option.
+ Default is 2.
+
+root=block Sets the block number of the root block. This should never
+ be necessary.
+
+bs=blksize Sets the blocksize to blksize. Valid block sizes are 512,
+ 1024, 2048 and 4096. Like the root option, this should
+ never be necessary, as the affs can figure it out itself.
+
+quiet The file system will not return an error for disallowed
+ mode changes.
+
+verbose The volume name, file system type and block size will
+ be written to the syslog when the filesystem is mounted.
+
+mufs The filesystem is really a muFS, also it doesn't
+ identify itself as one. This option is necessary if
+ the filesystem wasn't formatted as muFS, but is used
+ as one.
+
+prefix=path Path will be prefixed to every absolute path name of
+ symbolic links on an AFFS partition. Default = "/".
+ (See below.)
+
+volume=name When symbolic links with an absolute path are created
+ on an AFFS partition, name will be prepended as the
+ volume name. Default = "" (empty string).
+ (See below.)
+
+Handling of the Users/Groups and protection flags
+=================================================
+
+Amiga -> Linux:
+
+The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
+
+ - R maps to r for user, group and others. On directories, R implies x.
+
+ - If both W and D are allowed, w will be set.
+
+ - E maps to x.
+
+ - H and P are always retained and ignored under Linux.
+
+ - A is always reset when a file is written to.
+
+User id and group id will be used unless set[gu]id are given as mount
+options. Since most of the Amiga file systems are single user systems
+they will be owned by root. The root directory (the mount point) of the
+Amiga filesystem will be owned by the user who actually mounts the
+filesystem (the root directory doesn't have uid/gid fields).
+
+Linux -> Amiga:
+
+The Linux rwxrwxrwx file mode is handled as follows:
+
+ - r permission will set R for user, group and others.
+
+ - w permission will set W and D for user, group and others.
+
+ - x permission of the user will set E for plain files.
+
+ - All other flags (suid, sgid, ...) are ignored and will
+ not be retained.
+
+Newly created files and directories will get the user and group ID
+of the current user and a mode according to the umask.
+
+Symbolic links
+==============
+
+Although the Amiga and Linux file systems resemble each other, there
+are some, not always subtle, differences. One of them becomes apparent
+with symbolic links. While Linux has a file system with exactly one
+root directory, the Amiga has a separate root directory for each
+file system (for example, partition, floppy disk, ...). With the Amiga,
+these entities are called "volumes". They have symbolic names which
+can be used to access them. Thus, symbolic links can point to a
+different volume. AFFS turns the volume name into a directory name
+and prepends the prefix path (see prefix option) to it.
+
+Example:
+You mount all your Amiga partitions under /amiga/<volume> (where
+<volume> is the name of the volume), and you give the option
+"prefix=/amiga/" when mounting all your AFFS partitions. (They
+might be "User", "WB" and "Graphics", the mount points /amiga/User,
+/amiga/WB and /amiga/Graphics). A symbolic link referring to
+"User:sc/include/dos/dos.h" will be followed to
+"/amiga/User/sc/include/dos/dos.h".
+
+Examples
+========
+
+Command line:
+ mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
+ mount /dev/sda3 /Amiga -t affs
+
+/etc/fstab entry:
+ /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
+
+IMPORTANT NOTE
+==============
+
+If you boot Windows 95 (don't know about 3.x, 98 and NT) while you
+have an Amiga harddisk connected to your PC, it will overwrite
+the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating
+the Rigid Disk Block. Sheer luck has it that this is an unused
+area of the RDB, so only the checksum doesn't match anymore.
+Linux will ignore this garbage and recognize the RDB anyway, but
+before you connect that drive to your Amiga again, you must
+restore or repair your RDB. So please do make a backup copy of it
+before booting Windows!
+
+If the damage is already done, the following should fix the RDB
+(where <disk> is the device name).
+DO AT YOUR OWN RISK:
+
+ dd if=/dev/<disk> of=rdb.tmp count=1
+ cp rdb.tmp rdb.fixed
+ dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4
+ dd if=rdb.fixed of=/dev/<disk>
+
+Bugs, Restrictions, Caveats
+===========================
+
+Quite a few things may not work as advertised. Not everything is
+tested, though several hundred MB have been read and written using
+this fs. For a most up-to-date list of bugs please consult
+fs/affs/Changes.
+
+Filenames are truncated to 30 characters without warning (this
+can be changed by setting the compile-time option AFFS_NO_TRUNCATE
+in include/linux/amigaffs.h).
+
+Case is ignored by the affs in filename matching, but Linux shells
+do care about the case. Example (with /wb being an affs mounted fs):
+ rm /wb/WRONGCASE
+will remove /mnt/wrongcase, but
+ rm /wb/WR*
+will not since the names are matched by the shell.
+
+The block allocation is designed for hard disk partitions. If more
+than 1 process writes to a (small) diskette, the blocks are allocated
+in an ugly way (but the real AFFS doesn't do much better). This
+is also true when space gets tight.
+
+You cannot execute programs on an OFS (Old File System), since the
+program files cannot be memory mapped due to the 488 byte blocks.
+For the same reason you cannot mount an image on such a filesystem
+via the loopback device.
+
+The bitmap valid flag in the root block may not be accurate when the
+system crashes while an affs partition is mounted. There's currently
+no way to fix a garbled filesystem without an Amiga (disk validator)
+or manually (who would do this?). Maybe later.
+
+If you mount affs partitions on system startup, you may want to tell
+fsck that the fs should not be checked (place a '0' in the sixth field
+of /etc/fstab).
+
+It's not possible to read floppy disks with a normal PC or workstation
+due to an incompatibility with the Amiga floppy controller.
+
+If you are interested in an Amiga Emulator for Linux, look at
+
+http://www-users.informatik.rwth-aachen.de/~crux/uae.html
diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt
new file mode 100644
index 0000000..2f4237d
--- /dev/null
+++ b/Documentation/filesystems/afs.txt
@@ -0,0 +1,155 @@
+ kAFS: AFS FILESYSTEM
+ ====================
+
+ABOUT
+=====
+
+This filesystem provides a fairly simple AFS filesystem driver. It is under
+development and only provides very basic facilities. It does not yet support
+the following AFS features:
+
+ (*) Write support.
+ (*) Communications security.
+ (*) Local caching.
+ (*) pioctl() system call.
+ (*) Automatic mounting of embedded mountpoints.
+
+
+USAGE
+=====
+
+When inserting the driver modules the root cell must be specified along with a
+list of volume location server IP addresses:
+
+ insmod rxrpc.o
+ insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
+
+The first module is a driver for the RxRPC remote operation protocol, and the
+second is the actual filesystem driver for the AFS filesystem.
+
+Once the module has been loaded, more modules can be added by the following
+procedure:
+
+ echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
+
+Where the parameters to the "add" command are the name of a cell and a list of
+volume location servers within that cell.
+
+Filesystems can be mounted anywhere by commands similar to the following:
+
+ mount -t afs "%cambridge.redhat.com:root.afs." /afs
+ mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
+ mount -t afs "#root.afs." /afs
+ mount -t afs "#root.cell." /afs/cambridge
+
+ NB: When using this on Linux 2.4, the mount command has to be different,
+ since the filesystem doesn't have access to the device name argument:
+
+ mount -t afs none /afs -ovol="#root.afs."
+
+Where the initial character is either a hash or a percent symbol depending on
+whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
+volume, but are willing to use a R/W volume instead (percent).
+
+The name of the volume can be suffixes with ".backup" or ".readonly" to
+specify connection to only volumes of those types.
+
+The name of the cell is optional, and if not given during a mount, then the
+named volume will be looked up in the cell specified during insmod.
+
+Additional cells can be added through /proc (see later section).
+
+
+MOUNTPOINTS
+===========
+
+AFS has a concept of mountpoints. These are specially formatted symbolic links
+(of the same form as the "device name" passed to mount). kAFS presents these
+to the user as directories that have special properties:
+
+ (*) They cannot be listed. Running a program like "ls" on them will incur an
+ EREMOTE error (Object is remote).
+
+ (*) Other objects can't be looked up inside of them. This also incurs an
+ EREMOTE error.
+
+ (*) They can be queried with the readlink() system call, which will return
+ the name of the mountpoint to which they point. The "readlink" program
+ will also work.
+
+ (*) They can be mounted on (which symbolic links can't).
+
+
+PROC FILESYSTEM
+===============
+
+The rxrpc module creates a number of files in various places in the /proc
+filesystem:
+
+ (*) Firstly, some information files are made available in a directory called
+ "/proc/net/rxrpc/". These list the extant transport endpoint, peer,
+ connection and call records.
+
+ (*) Secondly, some control files are made available in a directory called
+ "/proc/sys/rxrpc/". Currently, all these files can be used for is to
+ turn on various levels of tracing.
+
+The AFS modules creates a "/proc/fs/afs/" directory and populates it:
+
+ (*) A "cells" file that lists cells currently known to the afs module.
+
+ (*) A directory per cell that contains files that list volume location
+ servers, volumes, and active servers known within that cell.
+
+
+THE CELL DATABASE
+=================
+
+The filesystem maintains an internal database of all the cells it knows and
+the IP addresses of the volume location servers for those cells. The cell to
+which the computer belongs is added to the database when insmod is performed
+by the "rootcell=" argument.
+
+Further cells can be added by commands similar to the following:
+
+ echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
+ echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
+
+No other cell database operations are available at this time.
+
+
+EXAMPLES
+========
+
+Here's what I use to test this. Some of the names and IP addresses are local
+to my internal DNS. My "root.afs" partition has a mount point within it for
+some public volumes volumes.
+
+insmod -S /tmp/rxrpc.o
+insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
+
+mount -t afs \%root.afs. /afs
+mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
+
+echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells
+mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
+mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
+mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
+mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc
+mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project
+mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service
+mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software
+mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user
+
+umount /afs/grand.central.org/user
+umount /afs/grand.central.org/software
+umount /afs/grand.central.org/service
+umount /afs/grand.central.org/project
+umount /afs/grand.central.org/doc
+umount /afs/grand.central.org/contrib
+umount /afs/grand.central.org/archive
+umount /afs/grand.central.org
+umount /afs/cambridge.redhat.com
+umount /afs
+rmmod kafs
+rmmod rxrpc
diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.txt
new file mode 100644
index 0000000..58c65a1
--- /dev/null
+++ b/Documentation/filesystems/automount-support.txt
@@ -0,0 +1,118 @@
+Support is available for filesystems that wish to do automounting support (such
+as kAFS which can be found in fs/afs/). This facility includes allowing
+in-kernel mounts to be performed and mountpoint degradation to be
+requested. The latter can also be requested by userspace.
+
+
+======================
+IN-KERNEL AUTOMOUNTING
+======================
+
+A filesystem can now mount another filesystem on one of its directories by the
+following procedure:
+
+ (1) Give the directory a follow_link() operation.
+
+ When the directory is accessed, the follow_link op will be called, and
+ it will be provided with the location of the mountpoint in the nameidata
+ structure (vfsmount and dentry).
+
+ (2) Have the follow_link() op do the following steps:
+
+ (a) Call do_kern_mount() to call the appropriate filesystem to set up a
+ superblock and gain a vfsmount structure representing it.
+
+ (b) Copy the nameidata provided as an argument and substitute the dentry
+ argument into it the copy.
+
+ (c) Call do_add_mount() to install the new vfsmount into the namespace's
+ mountpoint tree, thus making it accessible to userspace. Use the
+ nameidata set up in (b) as the destination.
+
+ If the mountpoint will be automatically expired, then do_add_mount()
+ should also be given the location of an expiration list (see further
+ down).
+
+ (d) Release the path in the nameidata argument and substitute in the new
+ vfsmount and its root dentry. The ref counts on these will need
+ incrementing.
+
+Then from userspace, you can just do something like:
+
+ [root@andromeda root]# mount -t afs \#root.afs. /afs
+ [root@andromeda root]# ls /afs
+ asd cambridge cambridge.redhat.com grand.central.org
+ [root@andromeda root]# ls /afs/cambridge
+ afsdoc
+ [root@andromeda root]# ls /afs/cambridge/afsdoc/
+ ChangeLog html LICENSE pdf RELNOTES-1.2.2
+
+And then if you look in the mountpoint catalogue, you'll see something like:
+
+ [root@andromeda root]# cat /proc/mounts
+ ...
+ #root.afs. /afs afs rw 0 0
+ #root.cell. /afs/cambridge.redhat.com afs rw 0 0
+ #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
+
+
+===========================
+AUTOMATIC MOUNTPOINT EXPIRY
+===========================
+
+Automatic expiration of mountpoints is easy, provided you've mounted the
+mountpoint to be expired in the automounting procedure outlined above.
+
+To do expiration, you need to follow these steps:
+
+ (3) Create at least one list off which the vfsmounts to be expired can be
+ hung. Access to this list will be governed by the vfsmount_lock.
+
+ (4) In step (2c) above, the call to do_add_mount() should be provided with a
+ pointer to this list. It will hang the vfsmount off of it if it succeeds.
+
+ (5) When you want mountpoints to be expired, call mark_mounts_for_expiry()
+ with a pointer to this list. This will process the list, marking every
+ vfsmount thereon for potential expiry on the next call.
+
+ If a vfsmount was already flagged for expiry, and if its usage count is 1
+ (it's only referenced by its parent vfsmount), then it will be deleted
+ from the namespace and thrown away (effectively unmounted).
+
+ It may prove simplest to simply call this at regular intervals, using
+ some sort of timed event to drive it.
+
+The expiration flag is cleared by calls to mntput. This means that expiration
+will only happen on the second expiration request after the last time the
+mountpoint was accessed.
+
+If a mountpoint is moved, it gets removed from the expiration list. If a bind
+mount is made on an expirable mount, the new vfsmount will not be on the
+expiration list and will not expire.
+
+If a namespace is copied, all mountpoints contained therein will be copied,
+and the copies of those that are on an expiration list will be added to the
+same expiration list.
+
+
+=======================
+USERSPACE DRIVEN EXPIRY
+=======================
+
+As an alternative, it is possible for userspace to request expiry of any
+mountpoint (though some will be rejected - the current process's idea of the
+rootfs for example). It does this by passing the MNT_EXPIRE flag to
+umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH.
+
+If the mountpoint in question is in referenced by something other than
+umount() or its parent mountpoint, an EBUSY error will be returned and the
+mountpoint will not be marked for expiration or unmounted.
+
+If the mountpoint was not already marked for expiry at that time, an EAGAIN
+error will be given and it won't be unmounted.
+
+Otherwise if it was already marked and it wasn't referenced, unmounting will
+take place as usual.
+
+Again, the expiration flag is cleared every time anything other than umount()
+looks at a mountpoint.
diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt
new file mode 100644
index 0000000..877a7b1
--- /dev/null
+++ b/Documentation/filesystems/befs.txt
@@ -0,0 +1,117 @@
+BeOS filesystem for Linux
+
+Document last updated: Dec 6, 2001
+
+WARNING
+=======
+Make sure you understand that this is alpha software. This means that the
+implementation is neither complete nor well-tested.
+
+I DISCLAIM ALL RESPONSIBILTY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
+
+LICENSE
+=====
+This software is covered by the GNU General Public License.
+See the file COPYING for the complete text of the license.
+Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
+
+AUTHOR
+=====
+The largest part of the code written by Will Dyson <will_dyson@pobox.com>
+He has been working on the code since Aug 13, 2001. See the changelog for
+details.
+
+Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
+His orriginal code can still be found at:
+<http://hp.vector.co.jp/authors/VA008030/bfs/>
+Does anyone know of a more current email address for Makoto? He doesn't
+respond to the address given above...
+
+Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru>
+
+WHAT IS THIS DRIVER?
+==================
+This module implements the native filesystem of BeOS <http://www.be.com/>
+for the linux 2.4.1 and later kernels. Currently it is a read-only
+implementation.
+
+Which is it, BFS or BEFS?
+================
+Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
+But Unixware Boot Filesystem is called bfs, too. And they are already in
+the kernel. Because of this nameing conflict, on Linux the BeOS
+filesystem is called befs.
+
+HOW TO INSTALL
+==============
+step 1. Install the BeFS patch into the source code tree of linux.
+
+Apply the patchfile to your kernel source tree.
+Assuming that your kernel source is in /foo/bar/linux and the patchfile
+is called patch-befs-xxx, you would do the following:
+
+ cd /foo/bar/linux
+ patch -p1 < /path/to/patch-befs-xxx
+
+if the patching step fails (i.e. there are rejected hunks), you can try to
+figure it out yourself (it shouldn't be hard), or mail the maintainer
+(Will Dyson <will_dyson@pobox.com>) for help.
+
+step 2. Configuretion & make kernel
+
+The linux kernel has many compile-time options. Most of them are beyond the
+scope of this document. I suggest the Kernel-HOWTO document as a good general
+reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html>
+
+However, to use the BeFS module, you must enable it at configure time.
+
+ cd /foo/bar/linux
+ make menuconfig (or xconfig)
+
+The BeFS module is not a standard part of the linux kernel, so you must first
+enable support for experimental code under the "Code maturity level" menu.
+
+Then, under the "Filesystems" menu will be an option called "BeFS
+filesystem (experimental)", or something like that. Enable that option
+(it is fine to make it a module).
+
+Save your kernel configuration and then build your kernel.
+
+step 3. Install
+
+See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
+instructions on this critical step.
+
+USING BFS
+=========
+To use the BeOS filesystem, use filesystem type 'befs'.
+
+ex)
+ mount -t befs /dev/fd0 /beos
+
+MOUNT OPTIONS
+=============
+uid=nnn All files in the partition will be owned by user id nnn.
+gid=nnn All files in the partition will be in group nnn.
+iocharset=xxx Use xxx as the name of the NLS translation table.
+debug The driver will output debugging information to the syslog.
+
+HOW TO GET LASTEST VERSION
+==========================
+
+The latest version is currently available at:
+<http://befs-driver.sourceforge.net/>
+
+ANY KNOWN BUGS?
+===========
+As of Jan 20, 2002:
+
+ None
+
+SPECIAL THANKS
+==============
+Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
+Hiroyuki Yamada ... Testing LinuxPPC.
+
+
+
diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt
new file mode 100644
index 0000000..d2841e0
--- /dev/null
+++ b/Documentation/filesystems/bfs.txt
@@ -0,0 +1,57 @@
+BFS FILESYSTEM FOR LINUX
+========================
+
+The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
+usually contains the kernel image and a few other files required for the
+boot process.
+
+In order to access /stand partition under Linux you obviously need to
+know the partition number and the kernel must support UnixWare disk slices
+(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
+depend on having UnixWare disklabel support because one can also mount
+BFS filesystem via loopback:
+
+# losetup /dev/loop0 stand.img
+# mount -t bfs /dev/loop0 /mnt/stand
+
+where stand.img is a file containing the image of BFS filesystem.
+When you have finished using it and umounted you need to also deallocate
+/dev/loop0 device by:
+
+# losetup -d /dev/loop0
+
+You can simplify mounting by just typing:
+
+# mount -t bfs -o loop stand.img /mnt/stand
+
+this will allocate the first available loopback device (and load loop.o
+kernel module if necessary) automatically. If the loopback driver is not
+loaded automatically, make sure that your kernel is compiled with kmod
+support (CONFIG_KMOD) enabled. Beware that umount will not
+deallocate /dev/loopN device if /etc/mtab file on your system is a
+symbolic link to /proc/mounts. You will need to do it manually using
+"-d" switch of losetup(8). Read losetup(8) manpage for more info.
+
+To create the BFS image under UnixWare you need to find out first which
+slice contains it. The command prtvtoc(1M) is your friend:
+
+# prtvtoc /dev/rdsk/c0b0t0d0s0
+
+(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
+look for the slice with tag "STAND", which is usually slice 10. With this
+information you can use dd(1) to create the BFS image:
+
+# umount /stand
+# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
+
+Just in case, you can verify that you have done the right thing by checking
+the magic number:
+
+# od -Ad -tx4 stand.img | more
+
+The first 4 bytes should be 0x1badface.
+
+If you have any patches, questions or suggestions regarding this BFS
+implementation please contact the author:
+
+Tigran A. Aivazian <tigran@veritas.com>
diff --git a/Documentation/filesystems/cifs.txt b/Documentation/filesystems/cifs.txt
new file mode 100644
index 0000000..49cc923
--- /dev/null
+++ b/Documentation/filesystems/cifs.txt
@@ -0,0 +1,51 @@
+ This is the client VFS module for the Common Internet File System
+ (CIFS) protocol which is the successor to the Server Message Block
+ (SMB) protocol, the native file sharing mechanism for most early
+ PC operating systems. CIFS is fully supported by current network
+ file servers such as Windows 2000, Windows 2003 (including
+ Windows XP) as well by Samba (which provides excellent CIFS
+ server support for Linux and many other operating systems), so
+ this network filesystem client can mount to a wide variety of
+ servers. The smbfs module should be used instead of this cifs module
+ for mounting to older SMB servers such as OS/2. The smbfs and cifs
+ modules can coexist and do not conflict. The CIFS VFS filesystem
+ module is designed to work well with servers that implement the
+ newer versions (dialects) of the SMB/CIFS protocol such as Samba,
+ the program written by Andrew Tridgell that turns any Unix host
+ into a SMB/CIFS file server.
+
+ The intent of this module is to provide the most advanced network
+ file system function for CIFS compliant servers, including better
+ POSIX compliance, secure per-user session establishment, high
+ performance safe distributed caching (oplock), optional packet
+ signing, large files, Unicode support and other internationalization
+ improvements. Since both Samba server and this filesystem client support
+ the CIFS Unix extensions, the combination can provide a reasonable
+ alternative to NFSv4 for fileserving in some Linux to Linux environments,
+ not just in Linux to Windows environments.
+
+ This filesystem has an optional mount utility (mount.cifs) that can
+ be obtained from the project page and installed in the path in the same
+ directory with the other mount helpers (such as mount.smbfs).
+ Mounting using the cifs filesystem without installing the mount helper
+ requires specifying the server's ip address.
+
+ For Linux 2.4:
+ mount //anything/here /mnt_target -o
+ user=username,pass=password,unc=//ip_address_of_server/sharename
+
+ For Linux 2.5:
+ mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password
+
+
+ For more information on the module see the project page at
+
+ http://us1.samba.org/samba/Linux_CIFS_client.html
+
+ For more information on CIFS see:
+
+ http://www.snia.org/tech_activities/CIFS
+
+ or the Samba site:
+
+ http://www.samba.org
diff --git a/Documentation/filesystems/coda.txt b/Documentation/filesystems/coda.txt
new file mode 100644
index 0000000..6131135
--- /dev/null
+++ b/Documentation/filesystems/coda.txt
@@ -0,0 +1,1673 @@
+NOTE:
+This is one of the technical documents describing a component of
+Coda -- this document describes the client kernel-Venus interface.
+
+For more information:
+ http://www.coda.cs.cmu.edu
+For user level software needed to run Coda:
+ ftp://ftp.coda.cs.cmu.edu
+
+To run Coda you need to get a user level cache manager for the client,
+named Venus, as well as tools to manipulate ACLs, to log in, etc. The
+client needs to have the Coda filesystem selected in the kernel
+configuration.
+
+The server needs a user level server and at present does not depend on
+kernel support.
+
+
+
+
+
+
+
+ The Venus kernel interface
+ Peter J. Braam
+ v1.0, Nov 9, 1997
+
+ This document describes the communication between Venus and kernel
+ level filesystem code needed for the operation of the Coda file sys-
+ tem. This document version is meant to describe the current interface
+ (version 1.0) as well as improvements we envisage.
+ ______________________________________________________________________
+
+ Table of Contents
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 1. Introduction
+
+ 2. Servicing Coda filesystem calls
+
+ 3. The message layer
+
+ 3.1 Implementation details
+
+ 4. The interface at the call level
+
+ 4.1 Data structures shared by the kernel and Venus
+ 4.2 The pioctl interface
+ 4.3 root
+ 4.4 lookup
+ 4.5 getattr
+ 4.6 setattr
+ 4.7 access
+ 4.8 create
+ 4.9 mkdir
+ 4.10 link
+ 4.11 symlink
+ 4.12 remove
+ 4.13 rmdir
+ 4.14 readlink
+ 4.15 open
+ 4.16 close
+ 4.17 ioctl
+ 4.18 rename
+ 4.19 readdir
+ 4.20 vget
+ 4.21 fsync
+ 4.22 inactive
+ 4.23 rdwr
+ 4.24 odymount
+ 4.25 ody_lookup
+ 4.26 ody_expand
+ 4.27 prefetch
+ 4.28 signal
+
+ 5. The minicache and downcalls
+
+ 5.1 INVALIDATE
+ 5.2 FLUSH
+ 5.3 PURGEUSER
+ 5.4 ZAPFILE
+ 5.5 ZAPDIR
+ 5.6 ZAPVNODE
+ 5.7 PURGEFID
+ 5.8 REPLACE
+
+ 6. Initialization and cleanup
+
+ 6.1 Requirements
+
+
+ ______________________________________________________________________
+ 0wpage
+
+ 11.. IInnttrroodduuccttiioonn
+
+
+
+ A key component in the Coda Distributed File System is the cache
+ manager, _V_e_n_u_s.
+
+
+ When processes on a Coda enabled system access files in the Coda
+ filesystem, requests are directed at the filesystem layer in the
+ operating system. The operating system will communicate with Venus to
+ service the request for the process. Venus manages a persistent
+ client cache and makes remote procedure calls to Coda file servers and
+ related servers (such as authentication servers) to service these
+ requests it receives from the operating system. When Venus has
+ serviced a request it replies to the operating system with appropriate
+ return codes, and other data related to the request. Optionally the
+ kernel support for Coda may maintain a minicache of recently processed
+ requests to limit the number of interactions with Venus. Venus
+ possesses the facility to inform the kernel when elements from its
+ minicache are no longer valid.
+
+ This document describes precisely this communication between the
+ kernel and Venus. The definitions of so called upcalls and downcalls
+ will be given with the format of the data they handle. We shall also
+ describe the semantic invariants resulting from the calls.
+
+ Historically Coda was implemented in a BSD file system in Mach 2.6.
+ The interface between the kernel and Venus is very similar to the BSD
+ VFS interface. Similar functionality is provided, and the format of
+ the parameters and returned data is very similar to the BSD VFS. This
+ leads to an almost natural environment for implementing a kernel-level
+ filesystem driver for Coda in a BSD system. However, other operating
+ systems such as Linux and Windows 95 and NT have virtual filesystem
+ with different interfaces.
+
+ To implement Coda on these systems some reverse engineering of the
+ Venus/Kernel protocol is necessary. Also it came to light that other
+ systems could profit significantly from certain small optimizations
+ and modifications to the protocol. To facilitate this work as well as
+ to make future ports easier, communication between Venus and the
+ kernel should be documented in great detail. This is the aim of this
+ document.
+
+ 0wpage
+
+ 22.. SSeerrvviicciinngg CCooddaa ffiilleessyysstteemm ccaallllss
+
+ The service of a request for a Coda file system service originates in
+ a process PP which accessing a Coda file. It makes a system call which
+ traps to the OS kernel. Examples of such calls trapping to the kernel
+ are _r_e_a_d_, _w_r_i_t_e_, _o_p_e_n_, _c_l_o_s_e_, _c_r_e_a_t_e_, _m_k_d_i_r_, _r_m_d_i_r_, _c_h_m_o_d in a Unix
+ context. Similar calls exist in the Win32 environment, and are named
+ _C_r_e_a_t_e_F_i_l_e_, .
+
+ Generally the operating system handles the request in a virtual
+ filesystem (VFS) layer, which is named I/O Manager in NT and IFS
+ manager in Windows 95. The VFS is responsible for partial processing
+ of the request and for locating the specific filesystem(s) which will
+ service parts of the request. Usually the information in the path
+ assists in locating the correct FS drivers. Sometimes after extensive
+ pre-processing, the VFS starts invoking exported routines in the FS
+ driver. This is the point where the FS specific processing of the
+ request starts, and here the Coda specific kernel code comes into
+ play.
+
+ The FS layer for Coda must expose and implement several interfaces.
+ First and foremost the VFS must be able to make all necessary calls to
+ the Coda FS layer, so the Coda FS driver must expose the VFS interface
+ as applicable in the operating system. These differ very significantly
+ among operating systems, but share features such as facilities to
+ read/write and create and remove objects. The Coda FS layer services
+ such VFS requests by invoking one or more well defined services
+ offered by the cache manager Venus. When the replies from Venus have
+ come back to the FS driver, servicing of the VFS call continues and
+ finishes with a reply to the kernel's VFS. Finally the VFS layer
+ returns to the process.
+
+ As a result of this design a basic interface exposed by the FS driver
+ must allow Venus to manage message traffic. In particular Venus must
+ be able to retrieve and place messages and to be notified of the
+ arrival of a new message. The notification must be through a mechanism
+ which does not block Venus since Venus must attend to other tasks even
+ when no messages are waiting or being processed.
+
+
+
+
+
+
+ Interfaces of the Coda FS Driver
+
+ Furthermore the FS layer provides for a special path of communication
+ between a user process and Venus, called the pioctl interface. The
+ pioctl interface is used for Coda specific services, such as
+ requesting detailed information about the persistent cache managed by
+ Venus. Here the involvement of the kernel is minimal. It identifies
+ the calling process and passes the information on to Venus. When
+ Venus replies the response is passed back to the caller in unmodified
+ form.
+
+ Finally Venus allows the kernel FS driver to cache the results from
+ certain services. This is done to avoid excessive context switches
+ and results in an efficient system. However, Venus may acquire
+ information, for example from the network which implies that cached
+ information must be flushed or replaced. Venus then makes a downcall
+ to the Coda FS layer to request flushes or updates in the cache. The
+ kernel FS driver handles such requests synchronously.
+
+ Among these interfaces the VFS interface and the facility to place,
+ receive and be notified of messages are platform specific. We will
+ not go into the calls exported to the VFS layer but we will state the
+ requirements of the message exchange mechanism.
+
+ 0wpage
+
+ 33.. TThhee mmeessssaaggee llaayyeerr
+
+
+
+ At the lowest level the communication between Venus and the FS driver
+ proceeds through messages. The synchronization between processes
+ requesting Coda file service and Venus relies on blocking and waking
+ up processes. The Coda FS driver processes VFS- and pioctl-requests
+ on behalf of a process P, creates messages for Venus, awaits replies
+ and finally returns to the caller. The implementation of the exchange
+ of messages is platform specific, but the semantics have (so far)
+ appeared to be generally applicable. Data buffers are created by the
+ FS Driver in kernel memory on behalf of P and copied to user memory in
+ Venus.
+
+ The FS Driver while servicing P makes upcalls to Venus. Such an
+ upcall is dispatched to Venus by creating a message structure. The
+ structure contains the identification of P, the message sequence
+ number, the size of the request and a pointer to the data in kernel
+ memory for the request. Since the data buffer is re-used to hold the
+ reply from Venus, there is a field for the size of the reply. A flags
+ field is used in the message to precisely record the status of the
+ message. Additional platform dependent structures involve pointers to
+ determine the position of the message on queues and pointers to
+ synchronization objects. In the upcall routine the message structure
+ is filled in, flags are set to 0, and it is placed on the _p_e_n_d_i_n_g
+ queue. The routine calling upcall is responsible for allocating the
+ data buffer; its structure will be described in the next section.
+
+ A facility must exist to notify Venus that the message has been
+ created, and implemented using available synchronization objects in
+ the OS. This notification is done in the upcall context of the process
+ P. When the message is on the pending queue, process P cannot proceed
+ in upcall. The (kernel mode) processing of P in the filesystem
+ request routine must be suspended until Venus has replied. Therefore
+ the calling thread in P is blocked in upcall. A pointer in the
+ message structure will locate the synchronization object on which P is
+ sleeping.
+
+ Venus detects the notification that a message has arrived, and the FS
+ driver allow Venus to retrieve the message with a getmsg_from_kernel
+ call. This action finishes in the kernel by putting the message on the
+ queue of processing messages and setting flags to READ. Venus is
+ passed the contents of the data buffer. The getmsg_from_kernel call
+ now returns and Venus processes the request.
+
+ At some later point the FS driver receives a message from Venus,
+ namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS
+ driver looks at the contents of the message and decides if:
+
+
+ +o the message is a reply for a suspended thread P. If so it removes
+ the message from the processing queue and marks the message as
+ WRITTEN. Finally, the FS driver unblocks P (still in the kernel
+ mode context of Venus) and the sendmsg_to_kernel call returns to
+ Venus. The process P will be scheduled at some point and continues
+ processing its upcall with the data buffer replaced with the reply
+ from Venus.
+
+ +o The message is a _d_o_w_n_c_a_l_l. A downcall is a request from Venus to
+ the FS Driver. The FS driver processes the request immediately
+ (usually a cache eviction or replacement) and when it finishes
+ sendmsg_to_kernel returns.
+
+ Now P awakes and continues processing upcall. There are some
+ subtleties to take account of. First P will determine if it was woken
+ up in upcall by a signal from some other source (for example an
+ attempt to terminate P) or as is normally the case by Venus in its
+ sendmsg_to_kernel call. In the normal case, the upcall routine will
+ deallocate the message structure and return. The FS routine can proceed
+ with its processing.
+
+
+
+
+
+
+
+ Sleeping and IPC arrangements
+
+ In case P is woken up by a signal and not by Venus, it will first look
+ at the flags field. If the message is not yet READ, the process P can
+ handle its signal without notifying Venus. If Venus has READ, and
+ the request should not be processed, P can send Venus a signal message
+ to indicate that it should disregard the previous message. Such
+ signals are put in the queue at the head, and read first by Venus. If
+ the message is already marked as WRITTEN it is too late to stop the
+ processing. The VFS routine will now continue. (-- If a VFS request
+ involves more than one upcall, this can lead to complicated state, an
+ extra field "handle_signals" could be added in the message structure
+ to indicate points of no return have been passed.--)
+
+
+
+ 33..11.. IImmpplleemmeennttaattiioonn ddeettaaiillss
+
+ The Unix implementation of this mechanism has been through the
+ implementation of a character device associated with Coda. Venus
+ retrieves messages by doing a read on the device, replies are sent
+ with a write and notification is through the select system call on the
+ file descriptor for the device. The process P is kept waiting on an
+ interruptible wait queue object.
+
+ In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl
+ call is used. The DeviceIoControl call is designed to copy buffers
+ from user memory to kernel memory with OPCODES. The sendmsg_to_kernel
+ is issued as a synchronous call, while the getmsg_from_kernel call is
+ asynchronous. Windows EventObjects are used for notification of
+ message arrival. The process P is kept waiting on a KernelEvent
+ object in NT and a semaphore in Windows 95.
+
+ 0wpage
+
+ 44.. TThhee iinntteerrffaaccee aatt tthhee ccaallll lleevveell
+
+
+ This section describes the upcalls a Coda FS driver can make to Venus.
+ Each of these upcalls make use of two structures: inputArgs and
+ outputArgs. In pseudo BNF form the structures take the following
+ form:
+
+
+ struct inputArgs {
+ u_long opcode;
+ u_long unique; /* Keep multiple outstanding msgs distinct */
+ u_short pid; /* Common to all */
+ u_short pgid; /* Common to all */
+ struct CodaCred cred; /* Common to all */
+
+ <union "in" of call dependent parts of inputArgs>
+ };
+
+ struct outputArgs {
+ u_long opcode;
+ u_long unique; /* Keep multiple outstanding msgs distinct */
+ u_long result;
+
+ <union "out" of call dependent parts of inputArgs>
+ };
+
+
+
+ Before going on let us elucidate the role of the various fields. The
+ inputArgs start with the opcode which defines the type of service
+ requested from Venus. There are approximately 30 upcalls at present
+ which we will discuss. The unique field labels the inputArg with a
+ unique number which will identify the message uniquely. A process and
+ process group id are passed. Finally the credentials of the caller
+ are included.
+
+ Before delving into the specific calls we need to discuss a variety of
+ data structures shared by the kernel and Venus.
+
+
+
+
+ 44..11.. DDaattaa ssttrruuccttuurreess sshhaarreedd bbyy tthhee kkeerrnneell aanndd VVeennuuss
+
+
+ The CodaCred structure defines a variety of user and group ids as
+ they are set for the calling process. The vuid_t and guid_t are 32 bit
+ unsigned integers. It also defines group membership in an array. On
+ Unix the CodaCred has proven sufficient to implement good security
+ semantics for Coda but the structure may have to undergo modification
+ for the Windows environment when these mature.
+
+ struct CodaCred {
+ vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid*/
+ vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */
+ vgid_t cr_groups[NGROUPS]; /* Group membership for caller */
+ };
+
+
+
+ NNOOTTEE It is questionable if we need CodaCreds in Venus. Finally Venus
+ doesn't know about groups, although it does create files with the
+ default uid/gid. Perhaps the list of group membership is superfluous.
+
+
+ The next item is the fundamental identifier used to identify Coda
+ files, the ViceFid. A fid of a file uniquely defines a file or
+ directory in the Coda filesystem within a _c_e_l_l. (-- A _c_e_l_l is a
+ group of Coda servers acting under the aegis of a single system
+ control machine or SCM. See the Coda Administration manual for a
+ detailed description of the role of the SCM.--)
+
+
+ typedef struct ViceFid {
+ VolumeId Volume;
+ VnodeId Vnode;
+ Unique_t Unique;
+ } ViceFid;
+
+
+
+ Each of the constituent fields: VolumeId, VnodeId and Unique_t are
+ unsigned 32 bit integers. We envisage that a further field will need
+ to be prefixed to identify the Coda cell; this will probably take the
+ form of a Ipv6 size IP address naming the Coda cell through DNS.
+
+ The next important structure shared between Venus and the kernel is
+ the attributes of the file. The following structure is used to
+ exchange information. It has room for future extensions such as
+ support for device files (currently not present in Coda).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ struct coda_vattr {
+ enum coda_vtype va_type; /* vnode type (for create) */
+ u_short va_mode; /* files access mode and type */
+ short va_nlink; /* number of references to file */
+ vuid_t va_uid; /* owner user id */
+ vgid_t va_gid; /* owner group id */
+ long va_fsid; /* file system id (dev for now) */
+ long va_fileid; /* file id */
+ u_quad_t va_size; /* file size in bytes */
+ long va_blocksize; /* blocksize preferred for i/o */
+ struct timespec va_atime; /* time of last access */
+ struct timespec va_mtime; /* time of last modification */
+ struct timespec va_ctime; /* time file changed */
+ u_long va_gen; /* generation number of file */
+ u_long va_flags; /* flags defined for file */
+ dev_t va_rdev; /* device special file represents */
+ u_quad_t va_bytes; /* bytes of disk space held by file */
+ u_quad_t va_filerev; /* file modification number */
+ u_int va_vaflags; /* operations flags, see below */
+ long va_spare; /* remain quad aligned */
+ };
+
+
+
+
+ 44..22.. TThhee ppiiooccttll iinntteerrffaaccee
+
+
+ Coda specific requests can be made by application through the pioctl
+ interface. The pioctl is implemented as an ordinary ioctl on a
+ fictitious file /coda/.CONTROL. The pioctl call opens this file, gets
+ a file handle and makes the ioctl call. Finally it closes the file.
+
+ The kernel involvement in this is limited to providing the facility to
+ open and close and pass the ioctl message _a_n_d to verify that a path in
+ the pioctl data buffers is a file in a Coda filesystem.
+
+ The kernel is handed a data packet of the form:
+
+ struct {
+ const char *path;
+ struct ViceIoctl vidata;
+ int follow;
+ } data;
+
+
+
+ where
+
+
+ struct ViceIoctl {
+ caddr_t in, out; /* Data to be transferred in, or out */
+ short in_size; /* Size of input buffer <= 2K */
+ short out_size; /* Maximum size of output buffer, <= 2K */
+ };
+
+
+
+ The path must be a Coda file, otherwise the ioctl upcall will not be
+ made.
+
+ NNOOTTEE The data structures and code are a mess. We need to clean this
+ up.
+
+ We now proceed to document the individual calls:
+
+ 0wpage
+
+ 44..33.. rroooott
+
+
+ AArrgguummeennttss
+
+ iinn empty
+
+ oouutt
+
+ struct cfs_root_out {
+ ViceFid VFid;
+ } cfs_root;
+
+
+
+ DDeessccrriippttiioonn This call is made to Venus during the initialization of
+ the Coda filesystem. If the result is zero, the cfs_root structure
+ contains the ViceFid of the root of the Coda filesystem. If a non-zero
+ result is generated, its value is a platform dependent error code
+ indicating the difficulty Venus encountered in locating the root of
+ the Coda filesystem.
+
+ 0wpage
+
+ 44..44.. llooookkuupp
+
+
+ SSuummmmaarryy Find the ViceFid and type of an object in a directory if it
+ exists.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_lookup_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_lookup;
+
+
+
+ oouutt
+
+ struct cfs_lookup_out {
+ ViceFid VFid;
+ int vtype;
+ } cfs_lookup;
+
+
+
+ DDeessccrriippttiioonn This call is made to determine the ViceFid and filetype of
+ a directory entry. The directory entry requested carries name name
+ and Venus will search the directory identified by cfs_lookup_in.VFid.
+ The result may indicate that the name does not exist, or that
+ difficulty was encountered in finding it (e.g. due to disconnection).
+ If the result is zero, the field cfs_lookup_out.VFid contains the
+ targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the
+ type of object the name designates.
+
+ The name of the object is an 8 bit character string of maximum length
+ CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.)
+
+ It is extremely important to realize that Venus bitwise ors the field
+ cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should
+ not be put in the kernel name cache.
+
+ NNOOTTEE The type of the vtype is currently wrong. It should be
+ coda_vtype. Linux does not take note of CFS_NOCACHE. It should.
+
+ 0wpage
+
+ 44..55.. ggeettaattttrr
+
+
+ SSuummmmaarryy Get the attributes of a file.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_getattr_in {
+ ViceFid VFid;
+ struct coda_vattr attr; /* XXXXX */
+ } cfs_getattr;
+
+
+
+ oouutt
+
+ struct cfs_getattr_out {
+ struct coda_vattr attr;
+ } cfs_getattr;
+
+
+
+ DDeessccrriippttiioonn This call returns the attributes of the file identified by
+ fid.
+
+ EErrrroorrss Errors can occur if the object with fid does not exist, is
+ unaccessible or if the caller does not have permission to fetch
+ attributes.
+
+ NNoottee Many kernel FS drivers (Linux, NT and Windows 95) need to acquire
+ the attributes as well as the Fid for the instantiation of an internal
+ "inode" or "FileHandle". A significant improvement in performance on
+ such systems could be made by combining the _l_o_o_k_u_p and _g_e_t_a_t_t_r calls
+ both at the Venus/kernel interaction level and at the RPC level.
+
+ The vattr structure included in the input arguments is superfluous and
+ should be removed.
+
+ 0wpage
+
+ 44..66.. sseettaattttrr
+
+
+ SSuummmmaarryy Set the attributes of a file.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_setattr_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_setattr;
+
+
+
+
+ oouutt
+ empty
+
+ DDeessccrriippttiioonn The structure attr is filled with attributes to be changed
+ in BSD style. Attributes not to be changed are set to -1, apart from
+ vtype which is set to VNON. Other are set to the value to be assigned.
+ The only attributes which the FS driver may request to change are the
+ mode, owner, groupid, atime, mtime and ctime. The return value
+ indicates success or failure.
+
+ EErrrroorrss A variety of errors can occur. The object may not exist, may
+ be inaccessible, or permission may not be granted by Venus.
+
+ 0wpage
+
+ 44..77.. aacccceessss
+
+
+ SSuummmmaarryy
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_access_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_access;
+
+
+
+ oouutt
+ empty
+
+ DDeessccrriippttiioonn Verify if access to the object identified by VFid for
+ operations described by flags is permitted. The result indicates if
+ access will be granted. It is important to remember that Coda uses
+ ACLs to enforce protection and that ultimately the servers, not the
+ clients enforce the security of the system. The result of this call
+ will depend on whether a _t_o_k_e_n is held by the user.
+
+ EErrrroorrss The object may not exist, or the ACL describing the protection
+ may not be accessible.
+
+ 0wpage
+
+ 44..88.. ccrreeaattee
+
+
+ SSuummmmaarryy Invoked to create a file
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_create_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ int excl;
+ int mode;
+ char *name; /* Place holder for data. */
+ } cfs_create;
+
+
+
+
+ oouutt
+
+ struct cfs_create_out {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_create;
+
+
+
+ DDeessccrriippttiioonn This upcall is invoked to request creation of a file.
+ The file will be created in the directory identified by VFid, its name
+ will be name, and the mode will be mode. If excl is set an error will
+ be returned if the file already exists. If the size field in attr is
+ set to zero the file will be truncated. The uid and gid of the file
+ are set by converting the CodaCred to a uid using a macro CRTOUID
+ (this macro is platform dependent). Upon success the VFid and
+ attributes of the file are returned. The Coda FS Driver will normally
+ instantiate a vnode, inode or file handle at kernel level for the new
+ object.
+
+
+ EErrrroorrss A variety of errors can occur. Permissions may be insufficient.
+ If the object exists and is not a file the error EISDIR is returned
+ under Unix.
+
+ NNOOTTEE The packing of parameters is very inefficient and appears to
+ indicate confusion between the system call creat and the VFS operation
+ create. The VFS operation create is only called to create new objects.
+ This create call differs from the Unix one in that it is not invoked
+ to return a file descriptor. The truncate and exclusive options,
+ together with the mode, could simply be part of the mode as it is
+ under Unix. There should be no flags argument; this is used in open
+ (2) to return a file descriptor for READ or WRITE mode.
+
+ The attributes of the directory should be returned too, since the size
+ and mtime changed.
+
+ 0wpage
+
+ 44..99.. mmkkddiirr
+
+
+ SSuummmmaarryy Create a new directory.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_mkdir_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ char *name; /* Place holder for data. */
+ } cfs_mkdir;
+
+
+
+ oouutt
+
+ struct cfs_mkdir_out {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_mkdir;
+
+
+
+
+ DDeessccrriippttiioonn This call is similar to create but creates a directory.
+ Only the mode field in the input parameters is used for creation.
+ Upon successful creation, the attr returned contains the attributes of
+ the new directory.
+
+ EErrrroorrss As for create.
+
+ NNOOTTEE The input parameter should be changed to mode instead of
+ attributes.
+
+ The attributes of the parent should be returned since the size and
+ mtime changes.
+
+ 0wpage
+
+ 44..1100.. lliinnkk
+
+
+ SSuummmmaarryy Create a link to an existing file.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_link_in {
+ ViceFid sourceFid; /* cnode to link *to* */
+ ViceFid destFid; /* Directory in which to place link */
+ char *tname; /* Place holder for data. */
+ } cfs_link;
+
+
+
+ oouutt
+ empty
+
+ DDeessccrriippttiioonn This call creates a link to the sourceFid in the directory
+ identified by destFid with name tname. The source must reside in the
+ target's parent, i.e. the source must be have parent destFid, i.e. Coda
+ does not support cross directory hard links. Only the return value is
+ relevant. It indicates success or the type of failure.
+
+ EErrrroorrss The usual errors can occur.0wpage
+
+ 44..1111.. ssyymmlliinnkk
+
+
+ SSuummmmaarryy create a symbolic link
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_symlink_in {
+ ViceFid VFid; /* Directory to put symlink in */
+ char *srcname;
+ struct coda_vattr attr;
+ char *tname;
+ } cfs_symlink;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Create a symbolic link. The link is to be placed in the
+ directory identified by VFid and named tname. It should point to the
+ pathname srcname. The attributes of the newly created object are to
+ be set to attr.
+
+ EErrrroorrss
+
+ NNOOTTEE The attributes of the target directory should be returned since
+ its size changed.
+
+ 0wpage
+
+ 44..1122.. rreemmoovvee
+
+
+ SSuummmmaarryy Remove a file
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_remove_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_remove;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Remove file named cfs_remove_in.name in directory
+ identified by VFid.
+
+ EErrrroorrss
+
+ NNOOTTEE The attributes of the directory should be returned since its
+ mtime and size may change.
+
+ 0wpage
+
+ 44..1133.. rrmmddiirr
+
+
+ SSuummmmaarryy Remove a directory
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_rmdir_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_rmdir;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Remove the directory with name name from the directory
+ identified by VFid.
+
+ EErrrroorrss
+
+ NNOOTTEE The attributes of the parent directory should be returned since
+ its mtime and size may change.
+
+ 0wpage
+
+ 44..1144.. rreeaaddlliinnkk
+
+
+ SSuummmmaarryy Read the value of a symbolic link.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_readlink_in {
+ ViceFid VFid;
+ } cfs_readlink;
+
+
+
+ oouutt
+
+ struct cfs_readlink_out {
+ int count;
+ caddr_t data; /* Place holder for data. */
+ } cfs_readlink;
+
+
+
+ DDeessccrriippttiioonn This routine reads the contents of symbolic link
+ identified by VFid into the buffer data. The buffer data must be able
+ to hold any name up to CFS_MAXNAMLEN (PATH or NAM??).
+
+ EErrrroorrss No unusual errors.
+
+ 0wpage
+
+ 44..1155.. ooppeenn
+
+
+ SSuummmmaarryy Open a file.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_open_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_open;
+
+
+
+ oouutt
+
+ struct cfs_open_out {
+ dev_t dev;
+ ino_t inode;
+ } cfs_open;
+
+
+
+ DDeessccrriippttiioonn This request asks Venus to place the file identified by
+ VFid in its cache and to note that the calling process wishes to open
+ it with flags as in open(2). The return value to the kernel differs
+ for Unix and Windows systems. For Unix systems the Coda FS Driver is
+ informed of the device and inode number of the container file in the
+ fields dev and inode. For Windows the path of the container file is
+ returned to the kernel.
+ EErrrroorrss
+
+ NNOOTTEE Currently the cfs_open_out structure is not properly adapted to
+ deal with the Windows case. It might be best to implement two
+ upcalls, one to open aiming at a container file name, the other at a
+ container file inode.
+
+ 0wpage
+
+ 44..1166.. cclloossee
+
+
+ SSuummmmaarryy Close a file, update it on the servers.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_close_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_close;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Close the file identified by VFid.
+
+ EErrrroorrss
+
+ NNOOTTEE The flags argument is bogus and not used. However, Venus' code
+ has room to deal with an execp input field, probably this field should
+ be used to inform Venus that the file was closed but is still memory
+ mapped for execution. There are comments about fetching versus not
+ fetching the data in Venus vproc_vfscalls. This seems silly. If a
+ file is being closed, the data in the container file is to be the new
+ data. Here again the execp flag might be in play to create confusion:
+ currently Venus might think a file can be flushed from the cache when
+ it is still memory mapped. This needs to be understood.
+
+ 0wpage
+
+ 44..1177.. iiooccttll
+
+
+ SSuummmmaarryy Do an ioctl on a file. This includes the pioctl interface.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_ioctl_in {
+ ViceFid VFid;
+ int cmd;
+ int len;
+ int rwflag;
+ char *data; /* Place holder for data. */
+ } cfs_ioctl;
+
+
+
+ oouutt
+
+
+ struct cfs_ioctl_out {
+ int len;
+ caddr_t data; /* Place holder for data. */
+ } cfs_ioctl;
+
+
+
+ DDeessccrriippttiioonn Do an ioctl operation on a file. The command, len and
+ data arguments are filled as usual. flags is not used by Venus.
+
+ EErrrroorrss
+
+ NNOOTTEE Another bogus parameter. flags is not used. What is the
+ business about PREFETCHING in the Venus code?
+
+
+ 0wpage
+
+ 44..1188.. rreennaammee
+
+
+ SSuummmmaarryy Rename a fid.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_rename_in {
+ ViceFid sourceFid;
+ char *srcname;
+ ViceFid destFid;
+ char *destname;
+ } cfs_rename;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Rename the object with name srcname in directory
+ sourceFid to destname in destFid. It is important that the names
+ srcname and destname are 0 terminated strings. Strings in Unix
+ kernels are not always null terminated.
+
+ EErrrroorrss
+
+ 0wpage
+
+ 44..1199.. rreeaaddddiirr
+
+
+ SSuummmmaarryy Read directory entries.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_readdir_in {
+ ViceFid VFid;
+ int count;
+ int offset;
+ } cfs_readdir;
+
+
+
+
+ oouutt
+
+ struct cfs_readdir_out {
+ int size;
+ caddr_t data; /* Place holder for data. */
+ } cfs_readdir;
+
+
+
+ DDeessccrriippttiioonn Read directory entries from VFid starting at offset and
+ read at most count bytes. Returns the data in data and returns
+ the size in size.
+
+ EErrrroorrss
+
+ NNOOTTEE This call is not used. Readdir operations exploit container
+ files. We will re-evaluate this during the directory revamp which is
+ about to take place.
+
+ 0wpage
+
+ 44..2200.. vvggeett
+
+
+ SSuummmmaarryy instructs Venus to do an FSDB->Get.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_vget_in {
+ ViceFid VFid;
+ } cfs_vget;
+
+
+
+ oouutt
+
+ struct cfs_vget_out {
+ ViceFid VFid;
+ int vtype;
+ } cfs_vget;
+
+
+
+ DDeessccrriippttiioonn This upcall asks Venus to do a get operation on an fsobj
+ labelled by VFid.
+
+ EErrrroorrss
+
+ NNOOTTEE This operation is not used. However, it is extremely useful
+ since it can be used to deal with read/write memory mapped files.
+ These can be "pinned" in the Venus cache using vget and released with
+ inactive.
+
+ 0wpage
+
+ 44..2211.. ffssyynncc
+
+
+ SSuummmmaarryy Tell Venus to update the RVM attributes of a file.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_fsync_in {
+ ViceFid VFid;
+ } cfs_fsync;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn Ask Venus to update RVM attributes of object VFid. This
+ should be called as part of kernel level fsync type calls. The
+ result indicates if the syncing was successful.
+
+ EErrrroorrss
+
+ NNOOTTEE Linux does not implement this call. It should.
+
+ 0wpage
+
+ 44..2222.. iinnaaccttiivvee
+
+
+ SSuummmmaarryy Tell Venus a vnode is no longer in use.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_inactive_in {
+ ViceFid VFid;
+ } cfs_inactive;
+
+
+
+ oouutt
+ none
+
+ DDeessccrriippttiioonn This operation returns EOPNOTSUPP.
+
+ EErrrroorrss
+
+ NNOOTTEE This should perhaps be removed.
+
+ 0wpage
+
+ 44..2233.. rrddwwrr
+
+
+ SSuummmmaarryy Read or write from a file
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct cfs_rdwr_in {
+ ViceFid VFid;
+ int rwflag;
+ int count;
+ int offset;
+ int ioflag;
+ caddr_t data; /* Place holder for data. */
+ } cfs_rdwr;
+
+
+
+
+ oouutt
+
+ struct cfs_rdwr_out {
+ int rwflag;
+ int count;
+ caddr_t data; /* Place holder for data. */
+ } cfs_rdwr;
+
+
+
+ DDeessccrriippttiioonn This upcall asks Venus to read or write from a file.
+
+ EErrrroorrss
+
+ NNOOTTEE It should be removed since it is against the Coda philosophy that
+ read/write operations never reach Venus. I have been told the
+ operation does not work. It is not currently used.
+
+
+ 0wpage
+
+ 44..2244.. ooddyymmoouunntt
+
+
+ SSuummmmaarryy Allows mounting multiple Coda "filesystems" on one Unix mount
+ point.
+
+ AArrgguummeennttss
+
+ iinn
+
+ struct ody_mount_in {
+ char *name; /* Place holder for data. */
+ } ody_mount;
+
+
+
+ oouutt
+
+ struct ody_mount_out {
+ ViceFid VFid;
+ } ody_mount;
+
+
+
+ DDeessccrriippttiioonn Asks Venus to return the rootfid of a Coda system named
+ name. The fid is returned in VFid.
+
+ EErrrroorrss
+
+ NNOOTTEE This call was used by David for dynamic sets. It should be
+ removed since it causes a jungle of pointers in the VFS mounting area.
+ It is not used by Coda proper. Call is not implemented by Venus.
+
+ 0wpage
+
+ 44..2255.. ooddyy__llooookkuupp
+
+
+ SSuummmmaarryy Looks up something.
+
+ AArrgguummeennttss
+
+ iinn irrelevant
+
+
+ oouutt
+ irrelevant
+
+ DDeessccrriippttiioonn
+
+ EErrrroorrss
+
+ NNOOTTEE Gut it. Call is not implemented by Venus.
+
+ 0wpage
+
+ 44..2266.. ooddyy__eexxppaanndd
+
+
+ SSuummmmaarryy expands something in a dynamic set.
+
+ AArrgguummeennttss
+
+ iinn irrelevant
+
+ oouutt
+ irrelevant
+
+ DDeessccrriippttiioonn
+
+ EErrrroorrss
+
+ NNOOTTEE Gut it. Call is not implemented by Venus.
+
+ 0wpage
+
+ 44..2277.. pprreeffeettcchh
+
+
+ SSuummmmaarryy Prefetch a dynamic set.
+
+ AArrgguummeennttss
+
+ iinn Not documented.
+
+ oouutt
+ Not documented.
+
+ DDeessccrriippttiioonn Venus worker.cc has support for this call, although it is
+ noted that it doesn't work. Not surprising, since the kernel does not
+ have support for it. (ODY_PREFETCH is not a defined operation).
+
+ EErrrroorrss
+
+ NNOOTTEE Gut it. It isn't working and isn't used by Coda.
+
+
+ 0wpage
+
+ 44..2288.. ssiiggnnaall
+
+
+ SSuummmmaarryy Send Venus a signal about an upcall.
+
+ AArrgguummeennttss
+
+ iinn none
+
+ oouutt
+ not applicable.
+
+ DDeessccrriippttiioonn This is an out-of-band upcall to Venus to inform Venus
+ that the calling process received a signal after Venus read the
+ message from the input queue. Venus is supposed to clean up the
+ operation.
+
+ EErrrroorrss No reply is given.
+
+ NNOOTTEE We need to better understand what Venus needs to clean up and if
+ it is doing this correctly. Also we need to handle multiple upcall
+ per system call situations correctly. It would be important to know
+ what state changes in Venus take place after an upcall for which the
+ kernel is responsible for notifying Venus to clean up (e.g. open
+ definitely is such a state change, but many others are maybe not).
+
+ 0wpage
+
+ 55.. TThhee mmiinniiccaacchhee aanndd ddoowwnnccaallllss
+
+
+ The Coda FS Driver can cache results of lookup and access upcalls, to
+ limit the frequency of upcalls. Upcalls carry a price since a process
+ context switch needs to take place. The counterpart of caching the
+ information is that Venus will notify the FS Driver that cached
+ entries must be flushed or renamed.
+
+ The kernel code generally has to maintain a structure which links the
+ internal file handles (called vnodes in BSD, inodes in Linux and
+ FileHandles in Windows) with the ViceFid's which Venus maintains. The
+ reason is that frequent translations back and forth are needed in
+ order to make upcalls and use the results of upcalls. Such linking
+ objects are called ccnnooddeess.
+
+ The current minicache implementations have cache entries which record
+ the following:
+
+ 1. the name of the file
+
+ 2. the cnode of the directory containing the object
+
+ 3. a list of CodaCred's for which the lookup is permitted.
+
+ 4. the cnode of the object
+
+ The lookup call in the Coda FS Driver may request the cnode of the
+ desired object from the cache, by passing its name, directory and the
+ CodaCred's of the caller. The cache will return the cnode or indicate
+ that it cannot be found. The Coda FS Driver must be careful to
+ invalidate cache entries when it modifies or removes objects.
+
+ When Venus obtains information that indicates that cache entries are
+ no longer valid, it will make a downcall to the kernel. Downcalls are
+ intercepted by the Coda FS Driver and lead to cache invalidations of
+ the kind described below. The Coda FS Driver does not return an error
+ unless the downcall data could not be read into kernel memory.
+
+
+ 55..11.. IINNVVAALLIIDDAATTEE
+
+
+ No information is available on this call.
+
+
+ 55..22.. FFLLUUSSHH
+
+
+
+ AArrgguummeennttss None
+
+ SSuummmmaarryy Flush the name cache entirely.
+
+ DDeessccrriippttiioonn Venus issues this call upon startup and when it dies. This
+ is to prevent stale cache information being held. Some operating
+ systems allow the kernel name cache to be switched off dynamically.
+ When this is done, this downcall is made.
+
+
+ 55..33.. PPUURRGGEEUUSSEERR
+
+
+ AArrgguummeennttss
+
+ struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */
+ struct CodaCred cred;
+ } cfs_purgeuser;
+
+
+
+ DDeessccrriippttiioonn Remove all entries in the cache carrying the Cred. This
+ call is issued when tokens for a user expire or are flushed.
+
+
+ 55..44.. ZZAAPPFFIILLEE
+
+
+ AArrgguummeennttss
+
+ struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_zapfile;
+
+
+
+ DDeessccrriippttiioonn Remove all entries which have the (dir vnode, name) pair.
+ This is issued as a result of an invalidation of cached attributes of
+ a vnode.
+
+ NNOOTTEE Call is not named correctly in NetBSD and Mach. The minicache
+ zapfile routine takes different arguments. Linux does not implement
+ the invalidation of attributes correctly.
+
+
+
+ 55..55.. ZZAAPPDDIIRR
+
+
+ AArrgguummeennttss
+
+ struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_zapdir;
+
+
+
+ DDeessccrriippttiioonn Remove all entries in the cache lying in a directory
+ CodaFid, and all children of this directory. This call is issued when
+ Venus receives a callback on the directory.
+
+
+ 55..66.. ZZAAPPVVNNOODDEE
+
+
+
+ AArrgguummeennttss
+
+ struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */
+ struct CodaCred cred;
+ ViceFid VFid;
+ } cfs_zapvnode;
+
+
+
+ DDeessccrriippttiioonn Remove all entries in the cache carrying the cred and VFid
+ as in the arguments. This downcall is probably never issued.
+
+
+ 55..77.. PPUURRGGEEFFIIDD
+
+
+ SSuummmmaarryy
+
+ AArrgguummeennttss
+
+ struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_purgefid;
+
+
+
+ DDeessccrriippttiioonn Flush the attribute for the file. If it is a dir (odd
+ vnode), purge its children from the namecache and remove the file from the
+ namecache.
+
+
+
+ 55..88.. RREEPPLLAACCEE
+
+
+ SSuummmmaarryy Replace the Fid's for a collection of names.
+
+ AArrgguummeennttss
+
+ struct cfs_replace_out { /* cfs_replace is a venus->kernel call */
+ ViceFid NewFid;
+ ViceFid OldFid;
+ } cfs_replace;
+
+
+
+ DDeessccrriippttiioonn This routine replaces a ViceFid in the name cache with
+ another. It is added to allow Venus during reintegration to replace
+ locally allocated temp fids while disconnected with global fids even
+ when the reference counts on those fids are not zero.
+
+ 0wpage
+
+ 66.. IInniittiiaalliizzaattiioonn aanndd cclleeaannuupp
+
+
+ This section gives brief hints as to desirable features for the Coda
+ FS Driver at startup and upon shutdown or Venus failures. Before
+ entering the discussion it is useful to repeat that the Coda FS Driver
+ maintains the following data:
+
+
+ 1. message queues
+
+ 2. cnodes
+
+ 3. name cache entries
+
+ The name cache entries are entirely private to the driver, so they
+ can easily be manipulated. The message queues will generally have
+ clear points of initialization and destruction. The cnodes are
+ much more delicate. User processes hold reference counts in Coda
+ filesystems and it can be difficult to clean up the cnodes.
+
+ It can expect requests through:
+
+ 1. the message subsystem
+
+ 2. the VFS layer
+
+ 3. pioctl interface
+
+ Currently the _p_i_o_c_t_l passes through the VFS for Coda so we can
+ treat these similarly.
+
+
+ 66..11.. RReeqquuiirreemmeennttss
+
+
+ The following requirements should be accommodated:
+
+ 1. The message queues should have open and close routines. On Unix
+ the opening of the character devices are such routines.
+
+ +o Before opening, no messages can be placed.
+
+ +o Opening will remove any old messages still pending.
+
+ +o Close will notify any sleeping processes that their upcall cannot
+ be completed.
+
+ +o Close will free all memory allocated by the message queues.
+
+
+ 2. At open the namecache shall be initialized to empty state.
+
+ 3. Before the message queues are open, all VFS operations will fail.
+ Fortunately this can be achieved by making sure than mounting the
+ Coda filesystem cannot succeed before opening.
+
+ 4. After closing of the queues, no VFS operations can succeed. Here
+ one needs to be careful, since a few operations (lookup,
+ read/write, readdir) can proceed without upcalls. These must be
+ explicitly blocked.
+
+ 5. Upon closing the namecache shall be flushed and disabled.
+
+ 6. All memory held by cnodes can be freed without relying on upcalls.
+
+ 7. Unmounting the file system can be done without relying on upcalls.
+
+ 8. Mounting the Coda filesystem should fail gracefully if Venus cannot
+ get the rootfid or the attributes of the rootfid. The latter is
+ best implemented by Venus fetching these objects before attempting
+ to mount.
+
+ NNOOTTEE NetBSD in particular but also Linux have not implemented the
+ above requirements fully. For smooth operation this needs to be
+ corrected.
+
+
+
diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.txt
new file mode 100644
index 0000000..31f53f0
--- /dev/null
+++ b/Documentation/filesystems/cramfs.txt
@@ -0,0 +1,76 @@
+
+ Cramfs - cram a filesystem onto a small ROM
+
+cramfs is designed to be simple and small, and to compress things well.
+
+It uses the zlib routines to compress a file one page at a time, and
+allows random page access. The meta-data is not compressed, but is
+expressed in a very terse representation to make it use much less
+diskspace than traditional filesystems.
+
+You can't write to a cramfs filesystem (making it compressible and
+compact also makes it _very_ hard to update on-the-fly), so you have to
+create the disk image with the "mkcramfs" utility.
+
+
+Usage Notes
+-----------
+
+File sizes are limited to less than 16MB.
+
+Maximum filesystem size is a little over 256MB. (The last file on the
+filesystem is allowed to extend past 256MB.)
+
+Only the low 8 bits of gid are stored. The current version of
+mkcramfs simply truncates to 8 bits, which is a potential security
+issue.
+
+Hard links are supported, but hard linked files
+will still have a link count of 1 in the cramfs image.
+
+Cramfs directories have no `.' or `..' entries. Directories (like
+every other file on cramfs) always have a link count of 1. (There's
+no need to use -noleaf in `find', btw.)
+
+No timestamps are stored in a cramfs, so these default to the epoch
+(1970 GMT). Recently-accessed files may have updated timestamps, but
+the update lasts only as long as the inode is cached in memory, after
+which the timestamp reverts to 1970, i.e. moves backwards in time.
+
+Currently, cramfs must be written and read with architectures of the
+same endianness, and can be read only by kernels with PAGE_CACHE_SIZE
+== 4096. At least the latter of these is a bug, but it hasn't been
+decided what the best fix is. For the moment if you have larger pages
+you can just change the #define in mkcramfs.c, so long as you don't
+mind the filesystem becoming unreadable to future kernels.
+
+
+For /usr/share/magic
+--------------------
+
+0 ulelong 0x28cd3d45 Linux cramfs offset 0
+>4 ulelong x size %d
+>8 ulelong x flags 0x%x
+>12 ulelong x future 0x%x
+>16 string >\0 signature "%.16s"
+>32 ulelong x fsid.crc 0x%x
+>36 ulelong x fsid.edition %d
+>40 ulelong x fsid.blocks %d
+>44 ulelong x fsid.files %d
+>48 string >\0 name "%.16s"
+512 ulelong 0x28cd3d45 Linux cramfs offset 512
+>516 ulelong x size %d
+>520 ulelong x flags 0x%x
+>524 ulelong x future 0x%x
+>528 string >\0 signature "%.16s"
+>544 ulelong x fsid.crc 0x%x
+>548 ulelong x fsid.edition %d
+>552 ulelong x fsid.blocks %d
+>556 ulelong x fsid.files %d
+>560 string >\0 name "%.16s"
+
+
+Hacker Notes
+------------
+
+See fs/cramfs/README for filesystem layout and implementation notes.
diff --git a/Documentation/filesystems/devfs/ChangeLog b/Documentation/filesystems/devfs/ChangeLog
new file mode 100644
index 0000000..e5aba52
--- /dev/null
+++ b/Documentation/filesystems/devfs/ChangeLog
@@ -0,0 +1,1977 @@
+/* -*- auto-fill -*- */
+===============================================================================
+Changes for patch v1
+
+- creation of devfs
+
+- modified miscellaneous character devices to support devfs
+===============================================================================
+Changes for patch v2
+
+- bug fix with manual inode creation
+===============================================================================
+Changes for patch v3
+
+- bugfixes
+
+- documentation improvements
+
+- created a couple of scripts (one to save&restore a devfs and the
+ other to set up compatibility symlinks)
+
+- devfs support for SCSI discs. New name format is: sd_hHcCiIlL
+===============================================================================
+Changes for patch v4
+
+- bugfix for the directory reading code
+
+- bugfix for compilation with kerneld
+
+- devfs support for generic hard discs
+
+- rationalisation of the various watchdog drivers
+===============================================================================
+Changes for patch v5
+
+- support for mounting directly from entries in the devfs (it doesn't
+ need to be mounted to do this), including the root filesystem.
+ Mounting of swap partitions also works. Hence, now if you set
+ CONFIG_DEVFS_ONLY to 'Y' then you won't be able to access your discs
+ via ordinary device nodes. Naturally, the default is 'N' so that you
+ can still use your old device nodes. If you want to mount from devfs
+ entries, make sure you use: append = "root=/dev/sd_..." in your
+ lilo.conf. It seems LILO looks for the device number (major&minor)
+ and writes that into the kernel image :-(
+
+- support for character memory devices (/dev/null, /dev/zero, /dev/full
+ and so on). Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+===============================================================================
+Changes for patch v6
+
+- support for subdirectories
+
+- support for symbolic links (created by devfs_mk_symlink(), no
+ support yet for creation via symlink(2))
+
+- SCSI disc naming now cast in stone, with the format:
+ /dev/sd/c0b1t2u3 controller=0, bus=1, ID=2, LUN=3, whole disc
+ /dev/sd/c0b1t2u3p4 controller=0, bus=1, ID=2, LUN=3, 4th partition
+
+- loop devices now appear in devfs
+
+- tty devices, console, serial ports, etc. now appear in devfs
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- bugs with mounting devfs-only devices now fixed
+===============================================================================
+Changes for patch v7
+
+- SCSI CD-ROMS, tapes and generic devices now appear in devfs
+===============================================================================
+Changes for patch v8
+
+- bugfix with no-rewind SCSI tapes
+
+- RAMDISCs now appear in devfs
+
+- better cleaning up of devfs entries created by various modules
+
+- interface change to <devfs_register>
+===============================================================================
+Changes for patch v9
+
+- the v8 patch was corrupted somehow, which would affect the patch for
+ linux/fs/filesystems.c
+ I've also fixed the v8 patch file on the WWW
+
+- MetaDevices (/dev/md*) should now appear in devfs
+===============================================================================
+Changes for patch v10
+
+- bugfix in meta device support for devfs
+
+- created this ChangeLog file
+
+- added devfs support to the floppy driver
+
+- added support for creating sockets in a devfs
+===============================================================================
+Changes for patch v11
+
+- added DEVFS_FL_HIDE_UNREG flag
+
+- incorporated better patch for ttyname() in libc 5.4.43 from H.J. Lu.
+
+- interface change to <devfs_mk_symlink>
+
+- support for creating symlinks with symlink(2)
+
+- parallel port printer (/dev/lp*) now appears in devfs
+===============================================================================
+Changes for patch v12
+
+- added inode check to <devfs_fill_file> function
+
+- improved devfs support when mounting from devfs
+
+- added call to <<release>> operation when removing swap areas on
+ devfs devices
+
+- increased NR_SUPER to 128 to support large numbers of devfs mounts
+ (for chroot(2) gaols)
+
+- fixed bug in SCSI disc support: was generating incorrect minors if
+ SCSI ID's did not start at 0 and increase by 1
+
+- support symlink traversal when mounting root
+===============================================================================
+Changes for patch v13
+
+- added devfs support to soundcard driver
+ Thanks to Eric Dumas <dumas@linux.eu.org> and
+ C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- added devfs support to the joystick driver
+
+- loop driver now has it's own subdirectory "/dev/loop/"
+
+- created <devfs_get_flags> and <devfs_set_flags> functions
+
+- fix problem with SCSI disc compatibility names (sd{a,b,c,d,e,f})
+ which assumes ID's start at 0 and increase by 1. Also only create
+ devfs entries for SCSI disc partitions which actually exist
+ Show new names in partition check
+ Thanks to Jakub Jelinek <jj@sunsite.ms.mff.cuni.cz>
+===============================================================================
+Changes for patch v14
+
+- bug fix in floppy driver: would not compile without
+ CONFIG_DEVFS_FS='Y'
+ Thanks to Jurgen Botz <jbotz@nova.botz.org>
+
+- bug fix in loop driver
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- do not create devfs entries for printers not configured
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- do not create devfs entries for serial ports not present
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- ensure <tty_register_devfs> is exported from tty_io.c
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- allow unregistering of devfs symlink entries
+
+- fixed bug in SCSI disc naming introduced in last patch version
+===============================================================================
+Changes for patch v15
+
+- ported to kernel 2.1.81
+===============================================================================
+Changes for patch v16
+
+- created <devfs_set_symlink_destination> function
+
+- moved DEVFS_SUPER_MAGIC into header file
+
+- added DEVFS_FL_HIDE flag
+
+- created <devfs_get_maj_min>
+
+- created <devfs_get_handle_from_inode>
+
+- fixed bugs in searching by major&minor
+
+- changed interface to <devfs_unregister>, <devfs_fill_file> and
+ <devfs_find_handle>
+
+- fixed inode times when symlink created with symlink(2)
+
+- change tty driver to do auto-creation of devfs entries
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- fixed bug in genhd.c: whole disc (non-SCSI) was not registered to
+ devfs
+
+- updated libc 5.4.43 patch for ttyname()
+===============================================================================
+Changes for patch v17
+
+- added CONFIG_DEVFS_TTY_COMPAT
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- bugfix in devfs support for drivers/char/lp.c
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- clean up serial driver so that PCMCIA devices unregister correctly
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- fixed bug in genhd.c: whole disc (non-SCSI) was not registered to
+ devfs [was missing in patch v16]
+
+- updated libc 5.4.43 patch for ttyname() [was missing in patch v16]
+
+- all SCSI devices now registered in /dev/sg
+
+- support removal of devfs entries via unlink(2)
+===============================================================================
+Changes for patch v18
+
+- added floppy/?u720 floppy entry
+
+- fixed kerneld support for entries in devfs subdirectories
+
+- incorporated latest patch for ttyname() in libc 5.4.43 from H.J. Lu.
+===============================================================================
+Changes for patch v19
+
+- bug fix when looking up unregistered entries: kerneld was not called
+
+- fixes for kernel 2.1.86 (now requires 2.1.86)
+===============================================================================
+Changes for patch v20
+
+- only create available floppy entries
+ Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
+
+- new IDE naming scheme following SCSI format (i.e. /dev/id/c0b0t0u0p1
+ instead of /dev/hda1)
+ Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
+
+- new XT disc naming scheme following SCSI format (i.e. /dev/xd/c0t0p1
+ instead of /dev/xda1)
+ Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
+
+- new non-standard CD-ROM names (i.e. /dev/sbp/c#t#)
+ Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
+
+- allow symlink traversal when mounting the root filesystem
+
+- Create entries for MD devices at MD init
+ Thanks to Christophe Leroy <christophe.leroy5@capway.com>
+===============================================================================
+Changes for patch v21
+
+- ported to kernel 2.1.91
+===============================================================================
+Changes for patch v22
+
+- SCSI host number patch ("scsihosts=" kernel option)
+ Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
+===============================================================================
+Changes for patch v23
+
+- Fixed persistence bug with device numbers for manually created
+ device files
+
+- Fixed problem with recreating symlinks with different content
+
+- Added CONFIG_DEVFS_MOUNT (mount devfs on /dev at boot time)
+===============================================================================
+Changes for patch v24
+
+- Switched from CONFIG_KERNELD to CONFIG_KMOD: module autoloading
+ should now work again
+
+- Hide entries which are manually unlinked
+
+- Always invalidate devfs dentry cache when registering entries
+
+- Support removal of devfs directories via rmdir(2)
+
+- Ensure directories created by <devfs_mk_dir> are visible
+
+- Default no access for "other" for floppy device
+===============================================================================
+Changes for patch v25
+
+- Updates to CREDITS file and minor IDE numbering change
+ Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
+
+- Invalidate devfs dentry cache when making directories
+
+- Invalidate devfs dentry cache when removing entries
+
+- More informative message if root FS mount fails when devfs
+ configured
+
+- Fixed persistence bug with fifos
+===============================================================================
+Changes for patch v26
+
+- ported to kernel 2.1.97
+
+- Changed serial directory from "/dev/serial" to "/dev/tts" and
+ "/dev/consoles" to "/dev/vc" to be more friendly to new procps
+===============================================================================
+Changes for patch v27
+
+- Added support for IDE4 and IDE5
+ Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl>
+
+- Documented "scsihosts=" boot parameter
+
+- Print process command when debugging kerneld/kmod
+
+- Added debugging for register/unregister/change operations
+
+- Added "devfs=" boot options
+
+- Hide unregistered entries by default
+===============================================================================
+Changes for patch v28
+
+- No longer lock/unlock superblock in <devfs_put_super> (cope with
+ recent VFS interface change)
+
+- Do not automatically change ownership/protection of /dev/tty
+
+- Drop negative dentries when they are released
+
+- Manage dcache more efficiently
+===============================================================================
+Changes for patch v29
+
+- Added DEVFS_FL_AUTO_DEVNUM flag
+===============================================================================
+Changes for patch v30
+
+- No longer set unnecessary methods
+
+- Ported to kernel 2.1.99-pre3
+===============================================================================
+Changes for patch v31
+
+- Added PID display to <call_kerneld> debugging message
+
+- Added "diread" and "diwrite" options
+
+- Ported to kernel 2.1.102
+
+- Fixed persistence problem with permissions
+===============================================================================
+Changes for patch v32
+
+- Fixed devfs support in drivers/block/md.c
+===============================================================================
+Changes for patch v33
+
+- Support legacy device nodes
+
+- Fixed bug where recreated inodes were hidden
+
+- New IDE naming scheme: everything is under /dev/ide
+===============================================================================
+Changes for patch v34
+
+- Improved debugging in <get_vfs_inode>
+
+- Prevent duplicate calls to <devfs_mk_dir> in SCSI layer
+
+- No longer free old dentries in <devfs_mk_dir>
+
+- Free all dentries for a given entry when deleting inodes
+===============================================================================
+Changes for patch v35
+
+- Ported to kernel 2.1.105 (sound driver changes)
+===============================================================================
+Changes for patch v36
+
+- Fixed sound driver port
+===============================================================================
+Changes for patch v37
+
+- Minor documentation tweaks
+===============================================================================
+Changes for patch v38
+
+- More documentation tweaks
+
+- Fix for sound driver port
+
+- Removed ttyname-patch (grab libc 5.4.44 instead)
+
+- Ported to kernel 2.1.107-pre2 (loop driver fix)
+===============================================================================
+Changes for patch v39
+
+- Ported to kernel 2.1.107 (hd.c hunk broke due to spelling "fixes"). Sigh
+
+- Removed many #ifdef's, replaced with trickery in include/devfs_fs.h
+===============================================================================
+Changes for patch v40
+
+- Fix for sound driver port
+
+- Limit auto-device numbering to majors 128 to 239
+===============================================================================
+Changes for patch v41
+
+- Fixed inode times persistence problem
+===============================================================================
+Changes for patch v42
+
+- Ported to kernel 2.1.108 (drivers/scsi/hosts.c hunk broke)
+===============================================================================
+Changes for patch v43
+
+- Fixed spelling in <devfs_readlink> debug
+
+- Fixed bug in <devfs_setup> parsing "dilookup"
+
+- More #ifdef's removed
+
+- Supported Sparc keyboard (/dev/kbd)
+
+- Supported DSP56001 digital signal processor (/dev/dsp56k)
+
+- Supported Apple Desktop Bus (/dev/adb)
+
+- Supported Coda network file system (/dev/cfs*)
+===============================================================================
+Changes for patch v44
+
+- Fixed devfs inode leak when manually recreating inodes
+
+- Fixed permission persistence problem when recreating inodes
+===============================================================================
+Changes for patch v45
+
+- Ported to kernel 2.1.110
+===============================================================================
+Changes for patch v46
+
+- Ported to kernel 2.1.112-pre1
+
+- Removed harmless "unused variable" compiler warning
+
+- Fixed modes for manually recreated device nodes
+===============================================================================
+Changes for patch v47
+
+- Added NULL devfs inode warning in <devfs_read_inode>
+
+- Force all inode nlink values to 1
+===============================================================================
+Changes for patch v48
+
+- Added "dimknod" option
+
+- Set inode nlink to 0 when freeing dentries
+
+- Added support for virtual console capture devices (/dev/vcs*)
+ Thanks to Dennis Hou <smilax@mindmeld.yi.org>
+
+- Fixed modes for manually recreated symlinks
+===============================================================================
+Changes for patch v49
+
+- Ported to kernel 2.1.113
+===============================================================================
+Changes for patch v50
+
+- Fixed bugs in recreated directories and symlinks
+===============================================================================
+Changes for patch v51
+
+- Improved robustness of rc.devfs script
+ Thanks to Roderich Schupp <rsch@experteam.de>
+
+- Fixed bugs in recreated device nodes
+
+- Fixed bug in currently unused <devfs_get_handle_from_inode>
+
+- Defined new <devfs_handle_t> type
+
+- Improved debugging when getting entries
+
+- Fixed bug where directories could be emptied
+
+- Ported to kernel 2.1.115
+===============================================================================
+Changes for patch v52
+
+- Replaced dummy .epoch inode with .devfsd character device
+
+- Modified rc.devfs to take account of above change
+
+- Removed spurious driver warning messages when CONFIG_DEVFS_FS=n
+
+- Implemented devfsd protocol revision 0
+===============================================================================
+Changes for patch v53
+
+- Ported to kernel 2.1.116 (kmod change broke hunk)
+
+- Updated Documentation/Configure.help
+
+- Test and tty pattern patch for rc.devfs script
+ Thanks to Roderich Schupp <rsch@experteam.de>
+
+- Added soothing message to warning in <devfs_d_iput>
+===============================================================================
+Changes for patch v54
+
+- Ported to kernel 2.1.117
+
+- Fixed default permissions in sound driver
+
+- Added support for frame buffer devices (/dev/fb*)
+===============================================================================
+Changes for patch v55
+
+- Ported to kernel 2.1.119
+
+- Use GCC extensions for structure initialisations
+
+- Implemented async open notification
+
+- Incremented devfsd protocol revision to 1
+===============================================================================
+Changes for patch v56
+
+- Ported to kernel 2.1.120-pre3
+
+- Moved async open notification to end of <devfs_open>
+===============================================================================
+Changes for patch v57
+
+- Ported to kernel 2.1.121
+
+- Prepended "/dev/" to module load request
+
+- Renamed <call_kerneld> to <call_kmod>
+
+- Created sample modules.conf file
+===============================================================================
+Changes for patch v58
+
+- Fixed typo "AYSNC" -> "ASYNC"
+===============================================================================
+Changes for patch v59
+
+- Added open flag for files
+===============================================================================
+Changes for patch v60
+
+- Ported to kernel 2.1.123-pre2
+===============================================================================
+Changes for patch v61
+
+- Set i_blocks=0 and i_blksize=1024 in <devfs_read_inode>
+===============================================================================
+Changes for patch v62
+
+- Ported to kernel 2.1.123
+===============================================================================
+Changes for patch v63
+
+- Ported to kernel 2.1.124-pre2
+===============================================================================
+Changes for patch v64
+
+- Fixed Unix98 pty support
+
+- Increased buffer size in <get_partition_list> to avoid crash and
+ burn
+===============================================================================
+Changes for patch v65
+
+- More Unix98 pty support fixes
+
+- Added test for empty <<name>> in <devfs_find_handle>
+
+- Renamed <generate_path> to <devfs_generate_path> and published
+
+- Created /dev/root symlink
+ Thanks to Roderich Schupp <rsch@ExperTeam.de>
+ with further modifications by me
+===============================================================================
+Changes for patch v66
+
+- Yet more Unix98 pty support fixes (now tested)
+
+- Created <devfs_get_fops>
+
+- Support media change checks when CONFIG_DEVFS_ONLY=y
+
+- Abolished Unix98-style PTY names for old PTY devices
+===============================================================================
+Changes for patch v67
+
+- Added inline declaration for dummy <devfs_generate_path>
+
+- Removed spurious "unable to register... in devfs" messages when
+ CONFIG_DEVFS_FS=n
+
+- Fixed misc. devices when CONFIG_DEVFS_FS=n
+
+- Limit auto-device numbering to majors 144 to 239
+===============================================================================
+Changes for patch v68
+
+- Hide unopened virtual consoles from directory listings
+
+- Added support for video capture devices
+
+- Ported to kernel 2.1.125
+===============================================================================
+Changes for patch v69
+
+- Fix for CONFIG_VT=n
+===============================================================================
+Changes for patch v70
+
+- Added support for non-OSS/Free sound cards
+===============================================================================
+Changes for patch v71
+
+- Ported to kernel 2.1.126-pre2
+===============================================================================
+Changes for patch v72
+
+- #ifdef's for CONFIG_DEVFS_DISABLE_OLD_NAMES removed
+===============================================================================
+Changes for patch v73
+
+- CONFIG_DEVFS_DISABLE_OLD_NAMES replaced with "nocompat" boot option
+
+- CONFIG_DEVFS_BOOT_OPTIONS removed: boot options always available
+===============================================================================
+Changes for patch v74
+
+- Removed CONFIG_DEVFS_MOUNT and "mount" boot option and replaced with
+ "nomount" boot option
+
+- Documentation updates
+
+- Updated sample modules.conf
+===============================================================================
+Changes for patch v75
+
+- Updated sample modules.conf
+
+- Remount devfs after initrd finishes
+
+- Ported to kernel 2.1.127
+
+- Added support for ISDN
+ Thanks to Christophe Leroy <christophe.leroy5@capway.com>
+===============================================================================
+Changes for patch v76
+
+- Updated an email address in ChangeLog
+
+- CONFIG_DEVFS_ONLY replaced with "only" boot option
+===============================================================================
+Changes for patch v77
+
+- Added DEVFS_FL_REMOVABLE flag
+
+- Check for disc change when listing directories with removable media
+ devices
+
+- Use DEVFS_FL_REMOVABLE in sd.c
+
+- Ported to kernel 2.1.128
+===============================================================================
+Changes for patch v78
+
+- Only call <scan_dir_for_removable> on first call to <devfs_readdir>
+
+- Ported to kernel 2.1.129-pre5
+
+- ISDN support improvements
+ Thanks to Christophe Leroy <christophe.leroy5@capway.com>
+===============================================================================
+Changes for patch v79
+
+- Ported to kernel 2.1.130
+
+- Renamed miscdevice "apm" to "apm_bios" to be consistent with
+ devices.txt
+===============================================================================
+Changes for patch v80
+
+- Ported to kernel 2.1.131
+
+- Updated <devfs_rmdir> for VFS change in 2.1.131
+===============================================================================
+Changes for patch v81
+
+- Fixed permissions on /dev/ptmx
+===============================================================================
+Changes for patch v82
+
+- Ported to kernel 2.1.132-pre4
+
+- Changed initial permissions on /dev/pts/*
+
+- Created <devfs_mk_compat>
+
+- Added "symlinks" boot option
+
+- Changed devfs_register_blkdev() back to register_blkdev() for IDE
+
+- Check for partitions on removable media in <devfs_lookup>
+===============================================================================
+Changes for patch v83
+
+- Fixed support for ramdisc when using string-based root FS name
+
+- Ported to kernel 2.2.0-pre1
+===============================================================================
+Changes for patch v84
+
+- Ported to kernel 2.2.0-pre7
+===============================================================================
+Changes for patch v85
+
+- Compile fixes for driver/sound/sound_common.c (non-module) and
+ drivers/isdn/isdn_common.c
+ Thanks to Christophe Leroy <christophe.leroy5@capway.com>
+
+- Added support for registering regular files
+
+- Created <devfs_set_file_size>
+
+- Added /dev/cpu/mtrr as an alternative interface to /proc/mtrr
+
+- Update devfs inodes from entries if not changed through FS
+===============================================================================
+Changes for patch v86
+
+- Ported to kernel 2.2.0-pre9
+===============================================================================
+Changes for patch v87
+
+- Fixed bug when mounting non-devfs devices in a devfs
+===============================================================================
+Changes for patch v88
+
+- Fixed <devfs_fill_file> to only initialise temporary inodes
+
+- Trap for NULL fops in <devfs_register>
+
+- Return -ENODEV in <devfs_fill_file> for non-driver inodes
+
+- Fixed bug when unswapping non-devfs devices in a devfs
+===============================================================================
+Changes for patch v89
+
+- Switched to C data types in include/linux/devfs_fs.h
+
+- Switched from PATH_MAX to DEVFS_PATHLEN
+
+- Updated Documentation/filesystems/devfs/modules.conf to take account
+ of reverse scanning (!) by modprobe
+
+- Ported to kernel 2.2.0
+===============================================================================
+Changes for patch v90
+
+- CONFIG_DEVFS_DISABLE_OLD_TTY_NAMES replaced with "nottycompat" boot
+ option
+
+- CONFIG_DEVFS_TTY_COMPAT removed: existing "symlinks" boot option now
+ controls this. This means you must have libc 5.4.44 or later, or a
+ recent version of libc 6 if you use the "symlinks" option
+===============================================================================
+Changes for patch v91
+
+- Switch from <devfs_mk_symlink> to <devfs_mk_compat> in
+ drivers/char/vc_screen.c to fix problems with Midnight Commander
+===============================================================================
+Changes for patch v92
+
+- Ported to kernel 2.2.2-pre5
+===============================================================================
+Changes for patch v93
+
+- Modified <sd_name> in drivers/scsi/sd.c to cope with devices that
+ don't exist (which happens with new RAID autostart code printk()s)
+===============================================================================
+Changes for patch v94
+
+- Fixed bug in joystick driver: only first joystick was registered
+===============================================================================
+Changes for patch v95
+
+- Fixed another bug in joystick driver
+
+- Fixed <devfsd_read> to not overrun event buffer
+===============================================================================
+Changes for patch v96
+
+- Ported to kernel 2.2.5-2
+
+- Created <devfs_auto_unregister>
+
+- Fixed bugs: compatibility entries were not unregistered for:
+ loop driver
+ floppy driver
+ RAMDISC driver
+ IDE tape driver
+ SCSI CD-ROM driver
+ SCSI HDD driver
+===============================================================================
+Changes for patch v97
+
+- Fixed bugs: compatibility entries were not unregistered for:
+ ALSA sound driver
+ partitions in generic disc driver
+
+- Don't return unregistred entries in <devfs_find_handle>
+
+- Panic in <devfs_unregister> if entry unregistered
+
+- Don't panic in <devfs_auto_unregister> for duplicates
+===============================================================================
+Changes for patch v98
+
+- Don't unregister already unregistered entries in <unregister>
+
+- Register entry in <sd_detect>
+
+- Unregister entry in <sd_detach>
+
+- Changed to <devfs_*register_chrdev> in drivers/char/tty_io.c
+
+- Ported to kernel 2.2.7
+===============================================================================
+Changes for patch v99
+
+- Ported to kernel 2.2.8
+
+- Fixed bug in drivers/scsi/sd.c when >16 SCSI discs
+
+- Disable warning messages when unable to read partition table for
+ removable media
+===============================================================================
+Changes for patch v100
+
+- Ported to kernel 2.3.1-pre5
+
+- Added "oops-on-panic" boot option
+
+- Improved debugging in <devfs_register> and <devfs_unregister>
+
+- Register entry in <sr_detect>
+
+- Unregister entry in <sr_detach>
+
+- Register entry in <sg_detect>
+
+- Unregister entry in <sg_detach>
+
+- Added support for ALSA drivers
+===============================================================================
+Changes for patch v101
+
+- Ported to kernel 2.3.2
+===============================================================================
+Changes for patch v102
+
+- Update serial driver to register PCMCIA entries
+ Thanks to Roch-Alexandre Nomine-Beguin <roch@samarkand.infini.fr>
+
+- Updated an email address in ChangeLog
+
+- Hide virtual console capture entries from directory listings when
+ corresponding console device is not open
+===============================================================================
+Changes for patch v103
+
+- Ported to kernel 2.3.3
+===============================================================================
+Changes for patch v104
+
+- Added documentation for some functions
+
+- Added "doc" target to fs/devfs/Makefile
+
+- Added "v4l" directory for video4linux devices
+
+- Replaced call to <devfs_unregister> in <sd_detach> with call to
+ <devfs_register_partitions>
+
+- Moved registration for sr and sg drivers from detect() to attach()
+ methods
+
+- Register entries in <st_attach> and unregister in <st_detach>
+
+- Work around IDE driver treating CD-ROM as gendisk
+
+- Use <sed> instead of <tr> in rc.devfs
+
+- Updated ToDo list
+
+- Removed "oops-on-panic" boot option: now always Oops
+===============================================================================
+Changes for patch v105
+
+- Unregister SCSI host from <scsi_host_no_list> in <scsi_unregister>
+ Thanks to Zoltán Böszörményi <zboszor@mail.externet.hu>
+
+- Don't save /dev/log in rc.devfs
+
+- Ported to kernel 2.3.4-pre1
+===============================================================================
+Changes for patch v106
+
+- Fixed silly typo in drivers/scsi/st.c
+
+- Improved debugging in <devfs_register>
+===============================================================================
+Changes for patch v107
+
+- Added "diunlink" and "nokmod" boot options
+
+- Removed superfluous warning message in <devfs_d_iput>
+===============================================================================
+Changes for patch v108
+
+- Remove entries when unloading sound module
+===============================================================================
+Changes for patch v109
+
+- Ported to kernel 2.3.6-pre2
+===============================================================================
+Changes for patch v110
+
+- Took account of change to <d_alloc_root>
+===============================================================================
+Changes for patch v111
+
+- Created separate event queue for each mounted devfs
+
+- Removed <devfs_invalidate_dcache>
+
+- Created new ioctl()s for devfsd
+
+- Incremented devfsd protocol revision to 3
+
+- Fixed bug when re-creating directories: contents were lost
+
+- Block access to inodes until devfsd updates permissions
+===============================================================================
+Changes for patch v112
+
+- Modified patch so it applies against 2.3.5 and 2.3.6
+
+- Updated an email address in ChangeLog
+
+- Do not automatically change ownership/protection of /dev/tty<n>
+
+- Updated sample modules.conf
+
+- Switched to sending process uid/gid to devfsd
+
+- Renamed <call_kmod> to <try_modload>
+
+- Added DEVFSD_NOTIFY_LOOKUP event
+
+- Added DEVFSD_NOTIFY_CHANGE event
+
+- Added DEVFSD_NOTIFY_CREATE event
+
+- Incremented devfsd protocol revision to 4
+
+- Moved kernel-specific stuff to include/linux/devfs_fs_kernel.h
+===============================================================================
+Changes for patch v113
+
+- Ported to kernel 2.3.9
+
+- Restricted permissions on some block devices
+===============================================================================
+Changes for patch v114
+
+- Added support for /dev/netlink
+ Thanks to Dennis Hou <smilax@mindmeld.yi.org>
+
+- Return EISDIR rather than EINVAL for read(2) on directories
+
+- Ported to kernel 2.3.10
+===============================================================================
+Changes for patch v115
+
+- Added support for all remaining character devices
+ Thanks to Dennis Hou <smilax@mindmeld.yi.org>
+
+- Cleaned up netlink support
+===============================================================================
+Changes for patch v116
+
+- Added support for /dev/parport%d
+ Thanks to Tim Waugh <tim@cyberelk.demon.co.uk>
+
+- Fixed parallel port ATAPI tape driver
+
+- Fixed Atari SLM laser printer driver
+===============================================================================
+Changes for patch v117
+
+- Added support for COSA card
+ Thanks to Dennis Hou <smilax@mindmeld.yi.org>
+
+- Fixed drivers/char/ppdev.c: missing #include <linux/init.h>
+
+- Fixed drivers/char/ftape/zftape/zftape-init.c
+ Thanks to Vladimir Popov <mashgrad@usa.net>
+===============================================================================
+Changes for patch v118
+
+- Ported to kernel 2.3.15-pre3
+
+- Fixed bug in loop driver
+
+- Unregister /dev/lp%d entries in drivers/char/lp.c
+ Thanks to Maciej W. Rozycki <macro@ds2.pg.gda.pl>
+===============================================================================
+Changes for patch v119
+
+- Ported to kernel 2.3.16
+===============================================================================
+Changes for patch v120
+
+- Fixed bug in drivers/scsi/scsi.c
+
+- Added /dev/ppp
+ Thanks to Dennis Hou <smilax@mindmeld.yi.org>
+
+- Ported to kernel 2.3.17
+===============================================================================
+Changes for patch v121
+
+- Fixed bug in drivers/block/loop.c
+
+- Ported to kernel 2.3.18
+===============================================================================
+Changes for patch v122
+
+- Ported to kernel 2.3.19
+===============================================================================
+Changes for patch v123
+
+- Ported to kernel 2.3.20
+===============================================================================
+Changes for patch v124
+
+- Ported to kernel 2.3.21
+===============================================================================
+Changes for patch v125
+
+- Created <devfs_get_info>, <devfs_set_info>,
+ <devfs_get_first_child> and <devfs_get_next_sibling>
+ Added <<dir>> parameter to <devfs_register>, <devfs_mk_compat>,
+ <devfs_mk_dir> and <devfs_find_handle>
+ Work sponsored by SGI
+
+- Fixed apparent bug in COSA driver
+
+- Re-instated "scsihosts=" boot option
+===============================================================================
+Changes for patch v126
+
+- Always create /dev/pts if CONFIG_UNIX98_PTYS=y
+
+- Fixed call to <devfs_mk_dir> in drivers/block/ide-disk.c
+ Thanks to Dennis Hou <smilax@mindmeld.yi.org>
+
+- Allow multiple unregistrations
+
+- Created /dev/scsi hierarchy
+ Work sponsored by SGI
+===============================================================================
+Changes for patch v127
+
+Work sponsored by SGI
+
+- No longer disable devpts if devfs enabled (caveat emptor)
+
+- Added flags array to struct gendisk and removed code from
+ drivers/scsi/sd.c
+
+- Created /dev/discs hierarchy
+===============================================================================
+Changes for patch v128
+
+Work sponsored by SGI
+
+- Created /dev/cdroms hierarchy
+===============================================================================
+Changes for patch v129
+
+Work sponsored by SGI
+
+- Removed compatibility entries for sound devices
+
+- Removed compatibility entries for printer devices
+
+- Removed compatibility entries for video4linux devices
+
+- Removed compatibility entries for parallel port devices
+
+- Removed compatibility entries for frame buffer devices
+===============================================================================
+Changes for patch v130
+
+Work sponsored by SGI
+
+- Added major and minor number to devfsd protocol
+
+- Incremented devfsd protocol revision to 5
+
+- Removed compatibility entries for SoundBlaster CD-ROMs
+
+- Removed compatibility entries for netlink devices
+
+- Removed compatibility entries for SCSI generic devices
+
+- Removed compatibility entries for SCSI tape devices
+===============================================================================
+Changes for patch v131
+
+Work sponsored by SGI
+
+- Support info pointer for all devfs entry types
+
+- Added <<info>> parameter to <devfs_mk_dir> and <devfs_mk_symlink>
+
+- Removed /dev/st hierarchy
+
+- Removed /dev/sg hierarchy
+
+- Removed compatibility entries for loop devices
+
+- Removed compatibility entries for IDE tape devices
+
+- Removed compatibility entries for SCSI CD-ROMs
+
+- Removed /dev/sr hierarchy
+===============================================================================
+Changes for patch v132
+
+Work sponsored by SGI
+
+- Removed compatibility entries for floppy devices
+
+- Removed compatibility entries for RAMDISCs
+
+- Removed compatibility entries for meta-devices
+
+- Removed compatibility entries for SCSI discs
+
+- Created <devfs_make_root>
+
+- Removed /dev/sd hierarchy
+
+- Support "../" when searching devfs namespace
+
+- Created /dev/ide/host* hierarchy
+
+- Supported IDE hard discs in /dev/ide/host* hierarchy
+
+- Removed compatibility entries for IDE discs
+
+- Removed /dev/ide/hd hierarchy
+
+- Supported IDE CD-ROMs in /dev/ide/host* hierarchy
+
+- Removed compatibility entries for IDE CD-ROMs
+
+- Removed /dev/ide/cd hierarchy
+===============================================================================
+Changes for patch v133
+
+Work sponsored by SGI
+
+- Created <devfs_get_unregister_slave>
+
+- Fixed bug in fs/partitions/check.c when rescanning
+===============================================================================
+Changes for patch v134
+
+Work sponsored by SGI
+
+- Removed /dev/sd, /dev/sr, /dev/st and /dev/sg directories
+
+- Removed /dev/ide/hd directory
+
+- Exported <devfs_get_parent>
+
+- Created <devfs_register_tape> and /dev/tapes hierarchy
+
+- Removed /dev/ide/mt hierarchy
+
+- Removed /dev/ide/fd hierarchy
+
+- Ported to kernel 2.3.25
+===============================================================================
+Changes for patch v135
+
+Work sponsored by SGI
+
+- Removed compatibility entries for virtual console capture devices
+
+- Removed unused <devfs_set_symlink_destination>
+
+- Removed compatibility entries for serial devices
+
+- Removed compatibility entries for console devices
+
+- Do not hide entries from devfsd or children
+
+- Removed DEVFS_FL_TTY_COMPAT flag
+
+- Removed "nottycompat" boot option
+
+- Removed <devfs_mk_compat>
+===============================================================================
+Changes for patch v136
+
+Work sponsored by SGI
+
+- Moved BSD pty devices to /dev/pty
+
+- Added DEVFS_FL_WAIT flag
+===============================================================================
+Changes for patch v137
+
+Work sponsored by SGI
+
+- Really fixed bug in fs/partitions/check.c when rescanning
+
+- Support new "disc" naming scheme in <get_removable_partition>
+
+- Allow NULL fops in <devfs_register>
+
+- Removed redundant name functions in SCSI disc and IDE drivers
+===============================================================================
+Changes for patch v138
+
+Work sponsored by SGI
+
+- Fixed old bugs in drivers/block/paride/pt.c, drivers/char/tpqic02.c,
+ drivers/net/wan/cosa.c and drivers/scsi/scsi.c
+ Thanks to Sergey Kubushin <ksi@ksi-linux.com>
+
+- Fall back to major table if NULL fops given to <devfs_register>
+===============================================================================
+Changes for patch v139
+
+Work sponsored by SGI
+
+- Corrected and moved <get_blkfops> and <get_chrfops> declarations
+ from arch/alpha/kernel/osf_sys.c to include/linux/fs.h
+
+- Removed name function from struct gendisk
+
+- Updated devfs FAQ
+===============================================================================
+Changes for patch v140
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.27
+===============================================================================
+Changes for patch v141
+
+Work sponsored by SGI
+
+- Bug fix in arch/m68k/atari/joystick.c
+
+- Moved ISDN and capi devices to /dev/isdn
+===============================================================================
+Changes for patch v142
+
+Work sponsored by SGI
+
+- Bug fix in drivers/block/ide-probe.c (patch confusion)
+===============================================================================
+Changes for patch v143
+
+Work sponsored by SGI
+
+- Bug fix in drivers/block/blkpg.c:partition_name()
+===============================================================================
+Changes for patch v144
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.29
+
+- Removed calls to <devfs_register> from cdu31a, cm206, mcd and mcdx
+ CD-ROM drivers: generic driver handles this now
+
+- Moved joystick devices to /dev/joysticks
+===============================================================================
+Changes for patch v145
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.30-pre3
+
+- Register whole-disc entry even for invalid partition tables
+
+- Fixed bug in mounting root FS when initrd enabled
+
+- Fixed device entry leak with IDE CD-ROMs
+
+- Fixed compile problem with drivers/isdn/isdn_common.c
+
+- Moved COSA devices to /dev/cosa
+
+- Support fifos when unregistering
+
+- Created <devfs_register_series> and used in many drivers
+
+- Moved Coda devices to /dev/coda
+
+- Moved parallel port IDE tapes to /dev/pt
+
+- Moved parallel port IDE generic devices to /dev/pg
+===============================================================================
+Changes for patch v146
+
+Work sponsored by SGI
+
+- Removed obsolete DEVFS_FL_COMPAT and DEVFS_FL_TOLERANT flags
+
+- Fixed compile problem with fs/coda/psdev.c
+
+- Reinstate change to <devfs_register_blkdev> in
+ drivers/block/ide-probe.c now that fs/isofs/inode.c is fixed
+
+- Switched to <devfs_register_blkdev> in drivers/block/floppy.c,
+ drivers/scsi/sr.c and drivers/block/md.c
+
+- Moved DAC960 devices to /dev/dac960
+===============================================================================
+Changes for patch v147
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.32-pre4
+===============================================================================
+Changes for patch v148
+
+Work sponsored by SGI
+
+- Removed kmod support: use devfsd instead
+
+- Moved miscellaneous character devices to /dev/misc
+===============================================================================
+Changes for patch v149
+
+Work sponsored by SGI
+
+- Ensure include/linux/joystick.h is OK for user-space
+
+- Improved debugging in <get_vfs_inode>
+
+- Ensure dentries created by devfsd will be cleaned up
+===============================================================================
+Changes for patch v150
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.34
+===============================================================================
+Changes for patch v151
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.35-pre1
+
+- Created <devfs_get_name>
+===============================================================================
+Changes for patch v152
+
+Work sponsored by SGI
+
+- Updated sample modules.conf
+
+- Ported to kernel 2.3.36-pre1
+===============================================================================
+Changes for patch v153
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.42
+
+- Removed <devfs_fill_file>
+===============================================================================
+Changes for patch v154
+
+Work sponsored by SGI
+
+- Took account of device number changes for /dev/fb*
+===============================================================================
+Changes for patch v155
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.43-pre8
+
+- Moved /dev/tty0 to /dev/vc/0
+
+- Moved sequence number formatting from <_tty_make_name> to drivers
+===============================================================================
+Changes for patch v156
+
+Work sponsored by SGI
+
+- Fixed breakage in drivers/scsi/sd.c due to recent SCSI changes
+===============================================================================
+Changes for patch v157
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.45
+===============================================================================
+Changes for patch v158
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.46-pre2
+===============================================================================
+Changes for patch v159
+
+Work sponsored by SGI
+
+- Fixed drivers/block/md.c
+ Thanks to Mike Galbraith <mikeg@weiden.de>
+
+- Documentation fixes
+
+- Moved device registration from <lp_init> to <lp_register>
+ Thanks to Tim Waugh <twaugh@redhat.com>
+===============================================================================
+Changes for patch v160
+
+Work sponsored by SGI
+
+- Fixed drivers/char/joystick/joystick.c
+ Thanks to Vojtech Pavlik <vojtech@suse.cz>
+
+- Documentation updates
+
+- Fixed arch/i386/kernel/mtrr.c if procfs and devfs not enabled
+
+- Fixed drivers/char/stallion.c
+===============================================================================
+Changes for patch v161
+
+Work sponsored by SGI
+
+- Remove /dev/ide when ide-mod is unloaded
+
+- Fixed bug in drivers/block/ide-probe.c when secondary but no primary
+
+- Added DEVFS_FL_NO_PERSISTENCE flag
+
+- Used new DEVFS_FL_NO_PERSISTENCE flag for Unix98 pty slaves
+
+- Removed unnecessary call to <update_devfs_inode_from_entry> in
+ <devfs_readdir>
+
+- Only set auto-ownership for /dev/pty/s*
+===============================================================================
+Changes for patch v162
+
+Work sponsored by SGI
+
+- Set inode->i_size to correct size for symlinks
+ Thanks to Jeremy Fitzhardinge <jeremy@goop.org>
+
+- Only give lookup() method to directories to comply with new VFS
+ assumptions
+
+- Remove unnecessary tests in symlink methods
+
+- Don't kill existing block ops in <devfs_read_inode>
+
+- Restore auto-ownership for /dev/pty/m*
+===============================================================================
+Changes for patch v163
+
+Work sponsored by SGI
+
+- Don't create missing directories in <devfs_find_handle>
+
+- Removed Documentation/filesystems/devfs/mk-devlinks
+
+- Updated Documentation/filesystems/devfs/README
+===============================================================================
+Changes for patch v164
+
+Work sponsored by SGI
+
+- Fixed CONFIG_DEVFS breakage in drivers/char/serial.c introduced in
+ linux-2.3.99-pre6-7
+===============================================================================
+Changes for patch v165
+
+Work sponsored by SGI
+
+- Ported to kernel 2.3.99-pre6
+===============================================================================
+Changes for patch v166
+
+Work sponsored by SGI
+
+- Added CONFIG_DEVFS_MOUNT
+===============================================================================
+Changes for patch v167
+
+Work sponsored by SGI
+
+- Updated Documentation/filesystems/devfs/README
+
+- Updated sample modules.conf
+===============================================================================
+Changes for patch v168
+
+Work sponsored by SGI
+
+- Disabled multi-mount capability (use VFS bindings instead)
+
+- Updated README from master HTML file
+===============================================================================
+Changes for patch v169
+
+Work sponsored by SGI
+
+- Removed multi-mount code
+
+- Removed compatibility macros: VFS has changed too much
+===============================================================================
+Changes for patch v170
+
+Work sponsored by SGI
+
+- Updated README from master HTML file
+
+- Merged devfs inode into devfs entry
+===============================================================================
+Changes for patch v171
+
+Work sponsored by SGI
+
+- Updated sample modules.conf
+
+- Removed dead code in <devfs_register> which used to call
+ <free_dentries>
+
+- Ported to kernel 2.4.0-test2-pre3
+===============================================================================
+Changes for patch v172
+
+Work sponsored by SGI
+
+- Changed interface to <devfs_register>
+
+- Changed interface to <devfs_register_series>
+===============================================================================
+Changes for patch v173
+
+Work sponsored by SGI
+
+- Simplified interface to <devfs_mk_symlink>
+
+- Simplified interface to <devfs_mk_dir>
+
+- Simplified interface to <devfs_find_handle>
+===============================================================================
+Changes for patch v174
+
+Work sponsored by SGI
+
+- Updated README from master HTML file
+===============================================================================
+Changes for patch v175
+
+Work sponsored by SGI
+
+- DocBook update for fs/devfs/base.c
+ Thanks to Tim Waugh <twaugh@redhat.com>
+
+- Removed stale fs/tunnel.c (was never used or completed)
+===============================================================================
+Changes for patch v176
+
+Work sponsored by SGI
+
+- Updated ToDo list
+
+- Removed sample modules.conf: now distributed with devfsd
+
+- Updated README from master HTML file
+
+- Ported to kernel 2.4.0-test3-pre4 (which had devfs-patch-v174)
+===============================================================================
+Changes for patch v177
+
+- Updated README from master HTML file
+
+- Documentation cleanups
+
+- Ensure <devfs_generate_path> terminates string for root entry
+ Thanks to Tim Jansen <tim@tjansen.de>
+
+- Exported <devfs_get_name> to modules
+
+- Make <devfs_mk_symlink> send events to devfsd
+
+- Cleaned up option processing in <devfs_setup>
+
+- Fixed bugs in handling symlinks: could leak or cause Oops
+
+- Cleaned up directory handling by separating fops
+ Thanks to Alexander Viro <viro@parcelfarce.linux.theplanet.co.uk>
+===============================================================================
+Changes for patch v178
+
+- Fixed handling of inverted options in <devfs_setup>
+===============================================================================
+Changes for patch v179
+
+- Adjusted <try_modload> to account for <devfs_generate_path> fix
+===============================================================================
+Changes for patch v180
+
+- Fixed !CONFIG_DEVFS_FS stub declaration of <devfs_get_info>
+===============================================================================
+Changes for patch v181
+
+- Answered question posed by Al Viro and removed his comments from <devfs_open>
+
+- Moved setting of registered flag after other fields are changed
+
+- Fixed race between <devfsd_close> and <devfsd_notify_one>
+
+- Global VFS changes added bogus BKL to devfsd_close(): removed
+
+- Widened locking in <devfs_readlink> and <devfs_follow_link>
+
+- Replaced <devfsd_read> stack usage with <devfsd_ioctl> kmalloc
+
+- Simplified locking in <devfsd_ioctl> and fixed memory leak
+===============================================================================
+Changes for patch v182
+
+- Created <devfs_*alloc_major> and <devfs_*alloc_devnum>
+
+- Removed broken devnum allocation and use <devfs_alloc_devnum>
+
+- Fixed old devnum leak by calling new <devfs_dealloc_devnum>
+
+- Created <devfs_*alloc_unique_number>
+
+- Fixed number leak for /dev/cdroms/cdrom%d
+
+- Fixed number leak for /dev/discs/disc%d
+===============================================================================
+Changes for patch v183
+
+- Fixed bug in <devfs_setup> which could hang boot process
+===============================================================================
+Changes for patch v184
+
+- Documentation typo fix for fs/devfs/util.c
+
+- Fixed drivers/char/stallion.c for devfs
+
+- Added DEVFSD_NOTIFY_DELETE event
+
+- Updated README from master HTML file
+
+- Removed #include <asm/segment.h> from fs/devfs/base.c
+===============================================================================
+Changes for patch v185
+
+- Made <block_semaphore> and <char_semaphore> in fs/devfs/util.c
+ private
+
+- Fixed inode table races by removing it and using inode->u.generic_ip
+ instead
+
+- Moved <devfs_read_inode> into <get_vfs_inode>
+
+- Moved <devfs_write_inode> into <devfs_notify_change>
+===============================================================================
+Changes for patch v186
+
+- Fixed race in <devfs_do_symlink> for uni-processor
+
+- Updated README from master HTML file
+===============================================================================
+Changes for patch v187
+
+- Fixed drivers/char/stallion.c for devfs
+
+- Fixed drivers/char/rocket.c for devfs
+
+- Fixed bug in <devfs_alloc_unique_number>: limited to 128 numbers
+===============================================================================
+Changes for patch v188
+
+- Updated major masks in fs/devfs/util.c up to Linus' "no new majors"
+ proclamation. Block: were 126 now 122 free, char: were 26 now 19 free
+
+- Updated README from master HTML file
+
+- Removed remnant of multi-mount support in <devfs_mknod>
+
+- Removed unused DEVFS_FL_SHOW_UNREG flag
+===============================================================================
+Changes for patch v189
+
+- Removed nlink field from struct devfs_inode
+
+- Removed auto-ownership for /dev/pty/* (BSD ptys) and used
+ DEVFS_FL_CURRENT_OWNER|DEVFS_FL_NO_PERSISTENCE for /dev/pty/s* (just
+ like Unix98 pty slaves) and made /dev/pty/m* rw-rw-rw- access
+===============================================================================
+Changes for patch v190
+
+- Updated README from master HTML file
+
+- Replaced BKL with global rwsem to protect symlink data (quick and
+ dirty hack)
+===============================================================================
+Changes for patch v191
+
+- Replaced global rwsem for symlink with per-link refcount
+===============================================================================
+Changes for patch v192
+
+- Removed unnecessary #ifdef CONFIG_DEVFS_FS from arch/i386/kernel/mtrr.c
+
+- Ported to kernel 2.4.10-pre11
+
+- Set inode->i_mapping->a_ops for block nodes in <get_vfs_inode>
+===============================================================================
+Changes for patch v193
+
+- Went back to global rwsem for symlinks (refcount scheme no good)
+===============================================================================
+Changes for patch v194
+
+- Fixed overrun in <devfs_link> by removing function (not needed)
+
+- Updated README from master HTML file
+===============================================================================
+Changes for patch v195
+
+- Fixed buffer underrun in <try_modload>
+
+- Moved down_read() from <search_for_entry_in_dir> to <find_entry>
+===============================================================================
+Changes for patch v196
+
+- Fixed race in <devfsd_ioctl> when setting event mask
+ Thanks to Kari Hurtta <hurtta@leija.mh.fmi.fi>
+
+- Avoid deadlock in <devfs_follow_link> by using temporary buffer
+===============================================================================
+Changes for patch v197
+
+- First release of new locking code for devfs core (v1.0)
+
+- Fixed bug in drivers/cdrom/cdrom.c
+===============================================================================
+Changes for patch v198
+
+- Discard temporary buffer, now use "%s" for dentry names
+
+- Don't generate path in <try_modload>: use fake entry instead
+
+- Use "existing" directory in <_devfs_make_parent_for_leaf>
+
+- Use slab cache rather than fixed buffer for devfsd events
+===============================================================================
+Changes for patch v199
+
+- Removed obsolete usage of DEVFS_FL_NO_PERSISTENCE
+
+- Send DEVFSD_NOTIFY_REGISTERED events in <devfs_mk_dir>
+
+- Fixed locking bug in <devfs_d_revalidate_wait> due to typo
+
+- Do not send CREATE, CHANGE, ASYNC_OPEN or DELETE events from devfsd
+ or children
+===============================================================================
+Changes for patch v200
+
+- Ported to kernel 2.5.1-pre2
+===============================================================================
+Changes for patch v201
+
+- Fixed bug in <devfsd_read>: was dereferencing freed pointer
+===============================================================================
+Changes for patch v202
+
+- Fixed bug in <devfsd_close>: was dereferencing freed pointer
+
+- Added process group check for devfsd privileges
+===============================================================================
+Changes for patch v203
+
+- Use SLAB_ATOMIC in <devfsd_notify_de> from <devfs_d_delete>
+===============================================================================
+Changes for patch v204
+
+- Removed long obsolete rc.devfs
+
+- Return old entry in <devfs_mk_dir> for 2.4.x kernels
+
+- Updated README from master HTML file
+
+- Increment refcount on module in <check_disc_changed>
+
+- Created <devfs_get_handle> and exported <devfs_put>
+
+- Increment refcount on module in <devfs_get_ops>
+
+- Created <devfs_put_ops> and used where needed to fix races
+
+- Added clarifying comments in response to preliminary EMC code review
+
+- Added poisoning to <devfs_put>
+
+- Improved debugging messages
+
+- Fixed unregister bugs in drivers/md/lvm-fs.c
+===============================================================================
+Changes for patch v205
+
+- Corrected (made useful) debugging message in <unregister>
+
+- Moved <kmem_cache_create> in <mount_devfs_fs> to <init_devfs_fs>
+
+- Fixed drivers/md/lvm-fs.c to create "lvm" entry
+
+- Added magic number to guard against scribbling drivers
+
+- Only return old entry in <devfs_mk_dir> if a directory
+
+- Defined macros for error and debug messages
+
+- Updated README from master HTML file
+===============================================================================
+Changes for patch v206
+
+- Added support for multiple Compaq cpqarray controllers
+
+- Fixed (rare, old) race in <devfs_lookup>
+===============================================================================
+Changes for patch v207
+
+- Fixed deadlock bug in <devfs_d_revalidate_wait>
+
+- Tag VFS deletable in <devfs_mk_symlink> if handle ignored
+
+- Updated README from master HTML file
+===============================================================================
+Changes for patch v208
+
+- Added KERN_* to remaining messages
+
+- Cleaned up declaration of <stat_read>
+
+- Updated README from master HTML file
+===============================================================================
+Changes for patch v209
+
+- Updated README from master HTML file
+
+- Removed silently introduced calls to lock_kernel() and
+ unlock_kernel() due to recent VFS locking changes. BKL isn't
+ required in devfs
+
+- Changed <devfs_rmdir> to allow later additions if not yet empty
+
+- Added calls to <devfs_register_partitions> in drivers/block/blkpc.c
+ <add_partition> and <del_partition>
+
+- Fixed bug in <devfs_alloc_unique_number>: was clearing beyond
+ bitfield
+
+- Fixed bitfield data type for <devfs_*alloc_devnum>
+
+- Made major bitfield type and initialiser 64 bit safe
+===============================================================================
+Changes for patch v210
+
+- Updated fs/devfs/util.c to fix shift warning on 64 bit machines
+ Thanks to Anton Blanchard <anton@samba.org>
+
+- Updated README from master HTML file
+===============================================================================
+Changes for patch v211
+
+- Do not put miscellaneous character devices in /dev/misc if they
+ specify their own directory (i.e. contain a '/' character)
+
+- Copied macro for error messages from fs/devfs/base.c to
+ fs/devfs/util.c and made use of this macro
+
+- Removed 2.4.x compatibility code from fs/devfs/base.c
+===============================================================================
+Changes for patch v212
+
+- Added BKL to <devfs_open> because drivers still need it
+===============================================================================
+Changes for patch v213
+
+- Protected <scan_dir_for_removable> and <get_removable_partition>
+ from changing directory contents
+===============================================================================
+Changes for patch v214
+
+- Switched to ISO C structure field initialisers
+
+- Switch to set_current_state() and move before add_wait_queue()
+
+- Updated README from master HTML file
+
+- Fixed devfs entry leak in <devfs_readdir> when *readdir fails
+===============================================================================
+Changes for patch v215
+
+- Created <devfs_find_and_unregister>
+
+- Switched many functions from <devfs_find_handle> to
+ <devfs_find_and_unregister>
+
+- Switched many functions from <devfs_find_handle> to <devfs_get_handle>
+===============================================================================
+Changes for patch v216
+
+- Switched arch/ia64/sn/io/hcl.c from <devfs_find_handle> to
+ <devfs_get_handle>
+
+- Removed deprecated <devfs_find_handle>
+===============================================================================
+Changes for patch v217
+
+- Exported <devfs_find_and_unregister> and <devfs_only> to modules
+
+- Updated README from master HTML file
+
+- Fixed module unload race in <devfs_open>
+===============================================================================
+Changes for patch v218
+
+- Removed DEVFS_FL_AUTO_OWNER flag
+
+- Switched lingering structure field initialiser to ISO C
+
+- Added locking when setting/clearing flags
+
+- Documentation fix in fs/devfs/util.c
diff --git a/Documentation/filesystems/devfs/README b/Documentation/filesystems/devfs/README
new file mode 100644
index 0000000..54366ec
--- /dev/null
+++ b/Documentation/filesystems/devfs/README
@@ -0,0 +1,1964 @@
+Devfs (Device File System) FAQ
+
+
+Linux Devfs (Device File System) FAQ
+Richard Gooch
+20-AUG-2002
+
+
+Document languages:
+
+
+
+
+
+
+
+-----------------------------------------------------------------------------
+
+NOTE: the master copy of this document is available online at:
+
+http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html
+and looks much better than the text version distributed with the
+kernel sources. A mirror site is available at:
+
+http://www.ras.ucalgary.ca/~rgooch/linux/docs/devfs.html
+
+There is also an optional daemon that may be used with devfs. You can
+find out more about it at:
+
+http://www.atnf.csiro.au/~rgooch/linux/
+
+A mailing list is available which you may subscribe to. Send
+email
+to majordomo@oss.sgi.com with the following line in the
+body of the message:
+subscribe devfs
+To unsubscribe, send the message body:
+unsubscribe devfs
+instead. The list is archived at
+
+http://oss.sgi.com/projects/devfs/archive/.
+
+-----------------------------------------------------------------------------
+
+Contents
+
+
+What is it?
+
+Why do it?
+
+Who else does it?
+
+How it works
+
+Operational issues (essential reading)
+
+Instructions for the impatient
+Permissions persistence across reboots
+Dealing with drivers without devfs support
+All the way with Devfs
+Other Issues
+Kernel Naming Scheme
+Devfsd Naming Scheme
+Old Compatibility Names
+SCSI Host Probing Issues
+
+
+
+Device drivers currently ported
+
+Allocation of Device Numbers
+
+Questions and Answers
+
+Making things work
+Alternatives to devfs
+What I don't like about devfs
+How to report bugs
+Strange kernel messages
+Compilation problems with devfsd
+
+
+Other resources
+
+Translations of this document
+
+
+-----------------------------------------------------------------------------
+
+
+What is it?
+
+Devfs is an alternative to "real" character and block special devices
+on your root filesystem. Kernel device drivers can register devices by
+name rather than major and minor numbers. These devices will appear in
+devfs automatically, with whatever default ownership and
+protection the driver specified. A daemon (devfsd) can be used to
+override these defaults. Devfs has been in the kernel since 2.3.46.
+
+NOTE that devfs is entirely optional. If you prefer the old
+disc-based device nodes, then simply leave CONFIG_DEVFS_FS=n (the
+default). In this case, nothing will change. ALSO NOTE that if you do
+enable devfs, the defaults are such that full compatibility is
+maintained with the old devices names.
+
+There are two aspects to devfs: one is the underlying device
+namespace, which is a namespace just like any mounted filesystem. The
+other aspect is the filesystem code which provides a view of the
+device namespace. The reason I make a distinction is because devfs
+can be mounted many times, with each mount showing the same device
+namespace. Changes made are global to all mounted devfs filesystems.
+Also, because the devfs namespace exists without any devfs mounts, you
+can easily mount the root filesystem by referring to an entry in the
+devfs namespace.
+
+
+The cost of devfs is a small increase in kernel code size and memory
+usage. About 7 pages of code (some of that in __init sections) and 72
+bytes for each entry in the namespace. A modest system has only a
+couple of hundred device entries, so this costs a few more
+pages. Compare this with the suggestion to put /dev on a <a
+href="#why-faq-ramdisc">ramdisc.
+
+On a typical machine, the cost is under 0.2 percent. On a modest
+system with 64 MBytes of RAM, the cost is under 0.1 percent. The
+accusations of "bloatware" levelled at devfs are not justified.
+
+-----------------------------------------------------------------------------
+
+
+Why do it?
+
+There are several problems that devfs addresses. Some of these
+problems are more serious than others (depending on your point of
+view), and some can be solved without devfs. However, the totality of
+these problems really calls out for devfs.
+
+The choice is a patchwork of inefficient user space solutions, which
+are complex and likely to be fragile, or to use a simple and efficient
+devfs which is robust.
+
+There have been many counter-proposals to devfs, all seeking to
+provide some of the benefits without actually implementing devfs. So
+far there has been an absence of code and no proposed alternative has
+been able to provide all the features that devfs does. Further,
+alternative proposals require far more complexity in user-space (and
+still deliver less functionality than devfs). Some people have the
+mantra of reducing "kernel bloat", but don't consider the effects on
+user-space.
+
+A good solution limits the total complexity of kernel-space and
+user-space.
+
+
+Major&minor allocation
+
+The existing scheme requires the allocation of major and minor device
+numbers for each and every device. This means that a central
+co-ordinating authority is required to issue these device numbers
+(unless you're developing a "private" device driver), in order to
+preserve uniqueness. Devfs shifts the burden to a namespace. This may
+not seem like a huge benefit, but actually it is. Since driver authors
+will naturally choose a device name which reflects the functionality
+of the device, there is far less potential for namespace conflict.
+Solving this requires a kernel change.
+
+/dev management
+
+Because you currently access devices through device nodes, these must
+be created by the system administrator. For standard devices you can
+usually find a MAKEDEV programme which creates all these (hundreds!)
+of nodes. This means that changes in the kernel must be reflected by
+changes in the MAKEDEV programme, or else the system administrator
+creates device nodes by hand.
+
+The basic problem is that there are two separate databases of
+major and minor numbers. One is in the kernel and one is in /dev (or
+in a MAKEDEV programme, if you want to look at it that way). This is
+duplication of information, which is not good practice.
+Solving this requires a kernel change.
+
+/dev growth
+
+A typical /dev has over 1200 nodes! Most of these devices simply don't
+exist because the hardware is not available. A huge /dev increases the
+time to access devices (I'm just referring to the dentry lookup times
+and the time taken to read inodes off disc: the next subsection shows
+some more horrors).
+
+An example of how big /dev can grow is if we consider SCSI devices:
+
+host 6 bits (say up to 64 hosts on a really big machine)
+channel 4 bits (say up to 16 SCSI buses per host)
+id 4 bits
+lun 3 bits
+partition 6 bits
+TOTAL 23 bits
+
+
+This requires 8 Mega (1024*1024) inodes if we want to store all
+possible device nodes. Even if we scrap everything but id,partition
+and assume a single host adapter with a single SCSI bus and only one
+logical unit per SCSI target (id), that's still 10 bits or 1024
+inodes. Each VFS inode takes around 256 bytes (kernel 2.1.78), so
+that's 256 kBytes of inode storage on disc (assuming real inodes take
+a similar amount of space as VFS inodes). This is actually not so bad,
+because disc is cheap these days. Embedded systems would care about
+256 kBytes of /dev inodes, but you could argue that embedded systems
+would have hand-tuned /dev directories. I've had to do just that on my
+embedded systems, but I would rather just leave it to devfs.
+
+Another issue is the time taken to lookup an inode when first
+referenced. Not only does this take time in scanning through a list in
+memory, but also the seek times to read the inodes off disc.
+This could be solved in user-space using a clever programme which
+scanned the kernel logs and deleted /dev entries which are not
+available and created them when they were available. This programme
+would need to be run every time a new module was loaded, which would
+slow things down a lot.
+
+There is an existing programme called scsidev which will automatically
+create device nodes for SCSI devices. It can do this by scanning files
+in /proc/scsi. Unfortunately, to extend this idea to other device
+nodes would require significant modifications to existing drivers (so
+they too would provide information in /proc). This is a non-trivial
+change (I should know: devfs has had to do something similar). Once
+you go to this much effort, you may as well use devfs itself (which
+also provides this information). Furthermore, such a system would
+likely be implemented in an ad-hoc fashion, as different drivers will
+provide their information in different ways.
+
+Devfs is much cleaner, because it (naturally) has a uniform mechanism
+to provide this information: the device nodes themselves!
+
+
+Node to driver file_operations translation
+
+There is an important difference between the way disc-based character
+and block nodes and devfs entries make the connection between an entry
+in /dev and the actual device driver.
+
+With the current 8 bit major and minor numbers the connection between
+disc-based c&b nodes and per-major drivers is done through a
+fixed-length table of 128 entries. The various filesystem types set
+the inode operations for c&b nodes to {chr,blk}dev_inode_operations,
+so when a device is opened a few quick levels of indirection bring us
+to the driver file_operations.
+
+For miscellaneous character devices a second step is required: there
+is a scan for the driver entry with the same minor number as the file
+that was opened, and the appropriate minor open method is called. This
+scanning is done *every time* you open a device node. Potentially, you
+may be searching through dozens of misc. entries before you find your
+open method. While not an enormous performance overhead, this does
+seem pointless.
+
+Linux *must* move beyond the 8 bit major and minor barrier,
+somehow. If we simply increase each to 16 bits, then the indexing
+scheme used for major driver lookup becomes untenable, because the
+major tables (one each for character and block devices) would need to
+be 64 k entries long (512 kBytes on x86, 1 MByte for 64 bit
+systems). So we would have to use a scheme like that used for
+miscellaneous character devices, which means the search time goes up
+linearly with the average number of major device drivers on your
+system. Not all "devices" are hardware, some are higher-level drivers
+like KGI, so you can get more "devices" without adding hardware
+You can improve this by creating an ordered (balanced:-)
+binary tree, in which case your search time becomes log(N).
+Alternatively, you can use hashing to speed up the search.
+But why do that search at all if you don't have to? Once again, it
+seems pointless.
+
+Note that devfs doesn't use the major&minor system. For devfs
+entries, the connection is done when you lookup the /dev entry. When
+devfs_register() is called, an internal table is appended which has
+the entry name and the file_operations. If the dentry cache doesn't
+have the /dev entry already, this internal table is scanned to get the
+file_operations, and an inode is created. If the dentry cache already
+has the entry, there is *no lookup time* (other than the dentry scan
+itself, but we can't avoid that anyway, and besides Linux dentries
+cream other OS's which don't have them:-). Furthermore, the number of
+node entries in a devfs is only the number of available device
+entries, not the number of *conceivable* entries. Even if you remove
+unnecessary entries in a disc-based /dev, the number of conceivable
+entries remains the same: you just limit yourself in order to save
+space.
+
+Devfs provides a fast connection between a VFS node and the device
+driver, in a scalable way.
+
+/dev as a system administration tool
+
+Right now /dev contains a list of conceivable devices, most of which I
+don't have. Devfs only shows those devices available on my
+system. This means that listing /dev is a handy way of checking what
+devices are available.
+
+Major&minor size
+
+Existing major and minor numbers are limited to 8 bits each. This is
+now a limiting factor for some drivers, particularly the SCSI disc
+driver, which consumes a single major number. Only 16 discs are
+supported, and each disc may have only 15 partitions. Maybe this isn't
+a problem for you, but some of us are building huge Linux systems with
+disc arrays. With devfs an arbitrary pointer can be associated with
+each device entry, which can be used to give an effective 32 bit
+device identifier (i.e. that's like having a 32 bit minor
+number). Since this is private to the kernel, there are no C library
+compatibility issues which you would have with increasing major and
+minor number sizes. See the section on "Allocation of Device Numbers"
+for details on maintaining compatibility with userspace.
+
+Solving this requires a kernel change.
+
+Since writing this, the kernel has been modified so that the SCSI disc
+driver has more major numbers allocated to it and now supports up to
+128 discs. Since these major numbers are non-contiguous (a result of
+unplanned expansion), the implementation is a little more cumbersome
+than originally.
+
+Just like the changes to IPv4 to fix impending limitations in the
+address space, people find ways around the limitations. In the long
+run, however, solutions like IPv6 or devfs can't be put off forever.
+
+Read-only root filesystem
+
+Having your device nodes on the root filesystem means that you can't
+operate properly with a read-only root filesystem. This is because you
+want to change ownerships and protections of tty devices. Existing
+practice prevents you using a CD-ROM as your root filesystem for a
+*real* system. Sure, you can boot off a CD-ROM, but you can't change
+tty ownerships, so it's only good for installing.
+
+Also, you can't use a shared NFS root filesystem for a cluster of
+discless Linux machines (having tty ownerships changed on a common
+/dev is not good). Nor can you embed your root filesystem in a
+ROM-FS.
+
+You can get around this by creating a RAMDISC at boot time, making
+an ext2 filesystem in it, mounting it somewhere and copying the
+contents of /dev into it, then unmounting it and mounting it over
+/dev.
+
+A devfs is a cleaner way of solving this.
+
+Non-Unix root filesystem
+
+Non-Unix filesystems (such as NTFS) can't be used for a root
+filesystem because they variously don't support character and block
+special files or symbolic links. You can't have a separate disc-based
+or RAMDISC-based filesystem mounted on /dev because you need device
+nodes before you can mount these. Devfs can be mounted without any
+device nodes. Devlinks won't work because symlinks aren't supported.
+An alternative solution is to use initrd to mount a RAMDISC initial
+root filesystem (which is populated with a minimal set of device
+nodes), and then construct a new /dev in another RAMDISC, and finally
+switch to your non-Unix root filesystem. This requires clever boot
+scripts and a fragile and conceptually complex boot procedure.
+
+Devfs solves this in a robust and conceptually simple way.
+
+PTY security
+
+Current pseudo-tty (pty) devices are owned by root and read-writable
+by everyone. The user of a pty-pair cannot change
+ownership/protections without being suid-root.
+
+This could be solved with a secure user-space daemon which runs as
+root and does the actual creation of pty-pairs. Such a daemon would
+require modification to *every* programme that wants to use this new
+mechanism. It also slows down creation of pty-pairs.
+
+An alternative is to create a new open_pty() syscall which does much
+the same thing as the user-space daemon. Once again, this requires
+modifications to pty-handling programmes.
+
+The devfs solution allows a device driver to "tag" certain device
+files so that when an unopened device is opened, the ownerships are
+changed to the current euid and egid of the opening process, and the
+protections are changed to the default registered by the driver. When
+the device is closed ownership is set back to root and protections are
+set back to read-write for everybody. No programme need be changed.
+The devpts filesystem provides this auto-ownership feature for Unix98
+ptys. It doesn't support old-style pty devices, nor does it have all
+the other features of devfs.
+
+Intelligent device management
+
+Devfs implements a simple yet powerful protocol for communication with
+a device management daemon (devfsd) which runs in user space. It is
+possible to send a message (either synchronously or asynchronously) to
+devfsd on any event, such as registration/unregistration of device
+entries, opening and closing devices, looking up inodes, scanning
+directories and more. This has many possibilities. Some of these are
+already implemented. See:
+
+
+http://www.atnf.csiro.au/~rgooch/linux/
+
+Device entry registration events can be used by devfsd to change
+permissions of newly-created device nodes. This is one mechanism to
+control device permissions.
+
+Device entry registration/unregistration events can be used to run
+programmes or scripts. This can be used to provide automatic mounting
+of filesystems when a new block device media is inserted into the
+drive.
+
+Asynchronous device open and close events can be used to implement
+clever permissions management. For example, the default permissions on
+/dev/dsp do not allow everybody to read from the device. This is
+sensible, as you don't want some remote user recording what you say at
+your console. However, the console user is also prevented from
+recording. This behaviour is not desirable. With asynchronous device
+open and close events, you can have devfsd run a programme or script
+when console devices are opened to change the ownerships for *other*
+device nodes (such as /dev/dsp). On closure, you can run a different
+script to restore permissions. An advantage of this scheme over
+modifying the C library tty handling is that this works even if your
+programme crashes (how many times have you seen the utmp database with
+lingering entries for non-existent logins?).
+
+Synchronous device open events can be used to perform intelligent
+device access protections. Before the device driver open() method is
+called, the daemon must first validate the open attempt, by running an
+external programme or script. This is far more flexible than access
+control lists, as access can be determined on the basis of other
+system conditions instead of just the UID and GID.
+
+Inode lookup events can be used to authenticate module autoload
+requests. Instead of using kmod directly, the event is sent to
+devfsd which can implement an arbitrary authentication before loading
+the module itself.
+
+Inode lookup events can also be used to construct arbitrary
+namespaces, without having to resort to populating devfs with symlinks
+to devices that don't exist.
+
+Speculative Device Scanning
+
+Consider an application (like cdparanoia) that wants to find all
+CD-ROM devices on the system (SCSI, IDE and other types), whether or
+not their respective modules are loaded. The application must
+speculatively open certain device nodes (such as /dev/sr0 for the SCSI
+CD-ROMs) in order to make sure the module is loaded. This requires
+that all Linux distributions follow the standard device naming scheme
+(last time I looked RedHat did things differently). Devfs solves the
+naming problem.
+
+The same application also wants to see which devices are actually
+available on the system. With the existing system it needs to read the
+/dev directory and speculatively open each /dev/sr* device to
+determine if the device exists or not. With a large /dev this is an
+inefficient operation, especially if there are many /dev/sr* nodes. A
+solution like scsidev could reduce the number of /dev/sr* entries (but
+of course that also requires all that inefficient directory scanning).
+
+With devfs, the application can open the /dev/sr directory
+(which triggers the module autoloading if required), and proceed to
+read /dev/sr. Since only the available devices will have
+entries, there are no inefficencies in directory scanning or device
+openings.
+
+-----------------------------------------------------------------------------
+
+Who else does it?
+
+FreeBSD has a devfs implementation. Solaris and AIX each have a
+pseudo-devfs (something akin to scsidev but for all devices, with some
+unspecified kernel support). BeOS, Plan9 and QNX also have it. SGI's
+IRIX 6.4 and above also have a device filesystem.
+
+While we shouldn't just automatically do something because others do
+it, we should not ignore the work of others either. FreeBSD has a lot
+of competent people working on it, so their opinion should not be
+blithely ignored.
+
+-----------------------------------------------------------------------------
+
+
+How it works
+
+Registering device entries
+
+For every entry (device node) in a devfs-based /dev a driver must call
+devfs_register(). This adds the name of the device entry, the
+file_operations structure pointer and a few other things to an
+internal table. Device entries may be added and removed at any
+time. When a device entry is registered, it automagically appears in
+any mounted devfs'.
+
+Inode lookup
+
+When a lookup operation on an entry is performed and if there is no
+driver information for that entry devfs will attempt to call
+devfsd. If still no driver information can be found then a negative
+dentry is yielded and the next stage operation will be called by the
+VFS (such as create() or mknod() inode methods). If driver information
+can be found, an inode is created (if one does not exist already) and
+all is well.
+
+Manually creating device nodes
+
+The mknod() method allows you to create an ordinary named pipe in the
+devfs, or you can create a character or block special inode if one
+does not already exist. You may wish to create a character or block
+special inode so that you can set permissions and ownership. Later, if
+a device driver registers an entry with the same name, the
+permissions, ownership and times are retained. This is how you can set
+the protections on a device even before the driver is loaded. Once you
+create an inode it appears in the directory listing.
+
+Unregistering device entries
+
+A device driver calls devfs_unregister() to unregister an entry.
+
+Chroot() gaols
+
+2.2.x kernels
+
+The semantics of inode creation are different when devfs is mounted
+with the "explicit" option. Now, when a device entry is registered, it
+will not appear until you use mknod() to create the device. It doesn't
+matter if you mknod() before or after the device is registered with
+devfs_register(). The purpose of this behaviour is to support
+chroot(2) gaols, where you want to mount a minimal devfs inside the
+gaol. Only the devices you specifically want to be available (through
+your mknod() setup) will be accessible.
+
+2.4.x kernels
+
+As of kernel 2.3.99, the VFS has had the ability to rebind parts of
+the global filesystem namespace into another part of the namespace.
+This now works even at the leaf-node level, which means that
+individual files and device nodes may be bound into other parts of the
+namespace. This is like making links, but better, because it works
+across filesystems (unlike hard links) and works through chroot()
+gaols (unlike symbolic links).
+
+Because of these improvements to the VFS, the multi-mount capability
+in devfs is no longer needed. The administrator may create a minimal
+device tree inside a chroot(2) gaol by using VFS bindings. As this
+provides most of the features of the devfs multi-mount capability, I
+removed the multi-mount support code (after issuing an RFC). This
+yielded code size reductions and simplifications.
+
+If you want to construct a minimal chroot() gaol, the following
+command should suffice:
+
+mount --bind /dev/null /gaol/dev/null
+
+
+Repeat for other device nodes you want to expose. Simple!
+
+-----------------------------------------------------------------------------
+
+
+Operational issues
+
+
+Instructions for the impatient
+
+Nobody likes reading documentation. People just want to get in there
+and play. So this section tells you quickly the steps you need to take
+to run with devfs mounted over /dev. Skip these steps and you will end
+up with a nearly unbootable system. Subsequent sections describe the
+issues in more detail, and discuss non-essential configuration
+options.
+
+Devfsd
+OK, if you're reading this, I assume you want to play with
+devfs. First you should ensure that /usr/src/linux contains a
+recent kernel source tree. Then you need to compile devfsd, the device
+management daemon, available at
+
+http://www.atnf.csiro.au/~rgooch/linux/.
+Because the kernel has a naming scheme
+which is quite different from the old naming scheme, you need to
+install devfsd so that software and configuration files that use the
+old naming scheme will not break.
+
+Compile and install devfsd. You will be provided with a default
+configuration file /etc/devfsd.conf which will provide
+compatibility symlinks for the old naming scheme. Don't change this
+config file unless you know what you're doing. Even if you think you
+do know what you're doing, don't change it until you've followed all
+the steps below and booted a devfs-enabled system and verified that it
+works.
+
+Now edit your main system boot script so that devfsd is started at the
+very beginning (before any filesystem
+checks). /etc/rc.d/rc.sysinit is often the main boot script
+on systems with SysV-style boot scripts. On systems with BSD-style
+boot scripts it is often /etc/rc. Also check
+/sbin/rc.
+
+NOTE that the line you put into the boot
+script should be exactly:
+
+/sbin/devfsd /dev
+
+DO NOT use some special daemon-launching
+programme, otherwise the boot script may not wait for devfsd to finish
+initialising.
+
+System Libraries
+There may still be some problems because of broken software making
+assumptions about device names. In particular, some software does not
+handle devices which are symbolic links. If you are running a libc 5
+based system, install libc 5.4.44 (if you have libc 5.4.46, go back to
+libc 5.4.44, which is actually correct). If you are running a glibc
+based system, make sure you have glibc 2.1.3 or later.
+
+/etc/securetty
+PAM (Pluggable Authentication Modules) is supposed to be a flexible
+mechanism for providing better user authentication and access to
+services. Unfortunately, it's also fragile, complex and undocumented
+(check out RedHat 6.1, and probably other distributions as well). PAM
+has problems with symbolic links. Append the following lines to your
+/etc/securetty file:
+
+vc/1
+vc/2
+vc/3
+vc/4
+vc/5
+vc/6
+vc/7
+vc/8
+
+This will not weaken security. If you have a version of util-linux
+earlier than 2.10.h, please upgrade to 2.10.h or later. If you
+absolutely cannot upgrade, then also append the following lines to
+your /etc/securetty file:
+
+1
+2
+3
+4
+5
+6
+7
+8
+
+This may potentially weaken security by allowing root logins over the
+network (a password is still required, though). However, since there
+are problems with dealing with symlinks, I'm suspicious of the level
+of security offered in any case.
+
+XFree86
+While not essential, it's probably a good idea to upgrade to XFree86
+4.0, as patches went in to make it more devfs-friendly. If you don't,
+you'll probably need to apply the following patch to
+/etc/security/console.perms so that ordinary users can run
+startx. Note that not all distributions have this file (e.g. Debian),
+so if it's not present, don't worry about it.
+
+--- /etc/security/console.perms.orig Sat Apr 17 16:26:47 1999
++++ /etc/security/console.perms Fri Feb 25 23:53:55 2000
+@@ -14,7 +14,7 @@
+ # man 5 console.perms
+
+ # file classes -- these are regular expressions
+-<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9]
++<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9]
+
+ # device classes -- these are shell-style globs
+ <floppy>=/dev/fd[0-1]*
+
+If the patch does not apply, then change the line:
+
+<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9]
+
+with:
+
+<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9]
+
+
+Disable devpts
+I've had a report of devpts mounted on /dev/pts not working
+correctly. Since devfs will also manage /dev/pts, there is no
+need to mount devpts as well. You should either edit your
+/etc/fstab so devpts is not mounted, or disable devpts from
+your kernel configuration.
+
+Unsupported drivers
+Not all drivers have devfs support. If you depend on one of these
+drivers, you will need to create a script or tarfile that you can use
+at boot time to create device nodes as appropriate. There is a
+section which describes this. Another
+section lists the drivers which have
+devfs support.
+
+/dev/mouse
+
+Many disributions configure /dev/mouse to be the mouse device
+for XFree86 and GPM. I actually think this is a bad idea, because it
+adds another level of indirection. When looking at a config file, if
+you see /dev/mouse you're left wondering which mouse
+is being referred to. Hence I recommend putting the actual mouse
+device (for example /dev/psaux) into your
+/etc/X11/XF86Config file (and similarly for the GPM
+configuration file).
+
+Alternatively, use the same technique used for unsupported drivers
+described above.
+
+The Kernel
+Finally, you need to make sure devfs is compiled into your kernel. Set
+CONFIG_EXPERIMENTAL=y, CONFIG_DEVFS_FS=y and CONFIG_DEVFS_MOUNT=y by
+using favourite configuration tool (i.e. make config or
+make xconfig) and then make clean and then recompile your kernel and
+modules. At boot, devfs will be mounted onto /dev.
+
+If you encounter problems booting (for example if you forgot a
+configuration step), you can pass devfs=nomount at the kernel
+boot command line. This will prevent the kernel from mounting devfs at
+boot time onto /dev.
+
+In general, a kernel built with CONFIG_DEVFS_FS=y but without mounting
+devfs onto /dev is completely safe, and requires no
+configuration changes. One exception to take note of is when
+LABEL= directives are used in /etc/fstab. In this
+case you will be unable to boot properly. This is because the
+mount(8) programme uses /proc/partitions as part of
+the volume label search process, and the device names it finds are not
+available, because setting CONFIG_DEVFS_FS=y changes the names in
+/proc/partitions, irrespective of whether devfs is mounted.
+
+Now you've finished all the steps required. You're now ready to boot
+your shiny new kernel. Enjoy.
+
+Changing the configuration
+
+OK, you've now booted a devfs-enabled system, and everything works.
+Now you may feel like changing the configuration (common targets are
+/etc/fstab and /etc/devfsd.conf). Since you have a
+system that works, if you make any changes and it doesn't work, you
+now know that you only have to restore your configuration files to the
+default and it will work again.
+
+
+Permissions persistence across reboots
+
+If you don't use mknod(2) to create a device file, nor use chmod(2) or
+chown(2) to change the ownerships/permissions, the inode ctime will
+remain at 0 (the epoch, 12 am, 1-JAN-1970, GMT). Anything with a ctime
+later than this has had it's ownership/permissions changed. Hence, a
+simple script or programme may be used to tar up all changed inodes,
+prior to shutdown. Although effective, many consider this approach a
+kludge.
+
+A much better approach is to use devfsd to save and restore
+permissions. It may be configured to record changes in permissions and
+will save them in a database (in fact a directory tree), and restore
+these upon boot. This is an efficient method and results in immediate
+saving of current permissions (unlike the tar approach, which saves
+permissions at some unspecified future time).
+
+The default configuration file supplied with devfsd has config entries
+which you may uncomment to enable persistence management.
+
+If you decide to use the tar approach anyway, be aware that tar will
+first unlink(2) an inode before creating a new device node. The
+unlink(2) has the effect of breaking the connection between a devfs
+entry and the device driver. If you use the "devfs=only" boot option,
+you lose access to the device driver, requiring you to reload the
+module. I consider this a bug in tar (there is no real need to
+unlink(2) the inode first).
+
+Alternatively, you can use devfsd to provide more sophisticated
+management of device permissions. You can use devfsd to store
+permissions for whole groups of devices with a single configuration
+entry, rather than the conventional single entry per device entry.
+
+Permissions database stored in mounted-over /dev
+
+If you wish to save and restore your device permissions into the
+disc-based /dev while still mounting devfs onto /dev
+you may do so. This requires a 2.4.x kernel (in fact, 2.3.99 or
+later), which has the VFS binding facility. You need to do the
+following to set this up:
+
+
+
+make sure the kernel does not mount devfs at boot time
+
+
+make sure you have a correct /dev/console entry in your
+root file-system (where your disc-based /dev lives)
+
+create the /dev-state directory
+
+
+add the following lines near the very beginning of your boot
+scripts:
+
+mount --bind /dev /dev-state
+mount -t devfs none /dev
+devfsd /dev
+
+
+
+
+add the following lines to your /etc/devfsd.conf file:
+
+REGISTER ^pt[sy] IGNORE
+CREATE ^pt[sy] IGNORE
+CHANGE ^pt[sy] IGNORE
+DELETE ^pt[sy] IGNORE
+REGISTER .* COPY /dev-state/$devname $devpath
+CREATE .* COPY $devpath /dev-state/$devname
+CHANGE .* COPY $devpath /dev-state/$devname
+DELETE .* CFUNCTION GLOBAL unlink /dev-state/$devname
+RESTORE /dev-state
+
+Note that the sample devfsd.conf file contains these lines,
+as well as other sample configurations you may find useful. See the
+devfsd distribution
+
+
+reboot.
+
+
+
+
+Permissions database stored in normal directory
+
+If you are using an older kernel which doesn't support VFS binding,
+then you won't be able to have the permissions database in a
+mounted-over /dev. However, you can still use a regular
+directory to store the database. The sample /etc/devfsd.conf
+file above may still be used. You will need to create the
+/dev-state directory prior to installing devfsd. If you have
+old permissions in /dev, then just copy (or move) the device
+nodes over to the new directory.
+
+Which method is better?
+
+The best method is to have the permissions database stored in the
+mounted-over /dev. This is because you will not need to copy
+device nodes over to /dev-state, and because it allows you to
+switch between devfs and non-devfs kernels, without requiring you to
+copy permissions between /dev-state (for devfs) and
+/dev (for non-devfs).
+
+
+Dealing with drivers without devfs support
+
+Currently, not all device drivers in the kernel have been modified to
+use devfs. Device drivers which do not yet have devfs support will not
+automagically appear in devfs. The simplest way to create device nodes
+for these drivers is to unpack a tarfile containing the required
+device nodes. You can do this in your boot scripts. All your drivers
+will now work as before.
+
+Hopefully for most people devfs will have enough support so that they
+can mount devfs directly over /dev without losing most functionality
+(i.e. losing access to various devices). As of 22-JAN-1998 (devfs
+patch version 10) I am now running this way. All the devices I have
+are available in devfs, so I don't lose anything.
+
+WARNING: if your configuration requires the old-style device names
+(i.e. /dev/hda1 or /dev/sda1), you must install devfsd and configure
+it to maintain compatibility entries. It is almost certain that you
+will require this. Note that the kernel creates a compatibility entry
+for the root device, so you don't need initrd.
+
+Note that you no longer need to mount devpts if you use Unix98 PTYs,
+as devfs can manage /dev/pts itself. This saves you some RAM, as you
+don't need to compile and install devpts. Note that some versions of
+glibc have a bug with Unix98 pty handling on devfs systems. Contact
+the glibc maintainers for a fix. Glibc 2.1.3 has the fix.
+
+Note also that apart from editing /etc/fstab, other things will need
+to be changed if you *don't* install devfsd. Some software (like the X
+server) hard-wire device names in their source. It really is much
+easier to install devfsd so that compatibility entries are created.
+You can then slowly migrate your system to using the new device names
+(for example, by starting with /etc/fstab), and then limiting the
+compatibility entries that devfsd creates.
+
+IF YOU CONFIGURE TO MOUNT DEVFS AT BOOT, MAKE SURE YOU INSTALL DEVFSD
+BEFORE YOU BOOT A DEVFS-ENABLED KERNEL!
+
+Now that devfs has gone into the 2.3.46 kernel, I'm getting a lot of
+reports back. Many of these are because people are trying to run
+without devfsd, and hence some things break. Please just run devfsd if
+things break. I want to concentrate on real bugs rather than
+misconfiguration problems at the moment. If people are willing to fix
+bugs/false assumptions in other code (i.e. glibc, X server) and submit
+that to the respective maintainers, that would be great.
+
+
+All the way with Devfs
+
+The devfs kernel patch creates a rationalised device tree. As stated
+above, if you want to keep using the old /dev naming scheme,
+you just need to configure devfsd appopriately (see the man
+page). People who prefer the old names can ignore this section. For
+those of us who like the rationalised names and an uncluttered
+/dev, read on.
+
+If you don't run devfsd, or don't enable compatibility entry
+management, then you will have to configure your system to use the new
+names. For example, you will then need to edit your
+/etc/fstab to use the new disc naming scheme. If you want to
+be able to boot non-devfs kernels, you will need compatibility
+symlinks in the underlying disc-based /dev pointing back to
+the old-style names for when you boot a kernel without devfs.
+
+You can selectively decide which devices you want compatibility
+entries for. For example, you may only want compatibility entries for
+BSD pseudo-terminal devices (otherwise you'll have to patch you C
+library or use Unix98 ptys instead). It's just a matter of putting in
+the correct regular expression into /dev/devfsd.conf.
+
+There are other choices of naming schemes that you may prefer. For
+example, I don't use the kernel-supplied
+names, because they are too verbose. A common misconception is
+that the kernel-supplied names are meant to be used directly in
+configuration files. This is not the case. They are designed to
+reflect the layout of the devices attached and to provide easy
+classification.
+
+If you like the kernel-supplied names, that's fine. If you don't then
+you should be using devfsd to construct a namespace more to your
+liking. Devfsd has built-in code to construct a
+namespace that is both logical and easy to
+manage. In essence, it creates a convenient abbreviation of the
+kernel-supplied namespace.
+
+You are of course free to build your own namespace. Devfsd has all the
+infrastructure required to make this easy for you. All you need do is
+write a script. You can even write some C code and devfsd can load the
+shared object as a callable extension.
+
+
+Other Issues
+
+The init programme
+Another thing to take note of is whether your init programme
+creates a Unix socket /dev/telinit. Some versions of init
+create /dev/telinit so that the telinit programme can
+communicate with the init process. If you have such a system you need
+to make sure that devfs is mounted over /dev *before* init
+starts. In other words, you can't leave the mounting of devfs to
+/etc/rc, since this is executed after init. Other
+versions of init require a named pipe /dev/initctl
+which must exist *before* init starts. Once again, you need to
+mount devfs and then create the named pipe *before* init
+starts.
+
+The default behaviour now is not to mount devfs onto /dev at
+boot time for 2.3.x and later kernels. You can correct this with the
+"devfs=mount" boot option. This solves any problems with init,
+and also prevents the dreaded:
+
+Cannot open initial console
+
+message. For 2.2.x kernels where you need to apply the devfs patch,
+the default is to mount.
+
+If you have automatic mounting of devfs onto /dev then you
+may need to create /dev/initctl in your boot scripts. The
+following lines should suffice:
+
+mknod /dev/initctl p
+kill -SIGUSR1 1 # tell init that /dev/initctl now exists
+
+Alternatively, if you don't want the kernel to mount devfs onto
+/dev then you could use the following procedure is a
+guideline for how to get around /dev/initctl problems:
+
+# cd /sbin
+# mv init init.real
+# cat > init
+#! /bin/sh
+mount -n -t devfs none /dev
+mknod /dev/initctl p
+exec /sbin/init.real $*
+[control-D]
+# chmod a+x init
+
+Note that newer versions of init create /dev/initctl
+automatically, so you don't have to worry about this.
+
+Module autoloading
+You will need to configure devfsd to enable module
+autoloading. The following lines should be placed in your
+/etc/devfsd.conf file:
+
+LOOKUP .* MODLOAD
+
+
+As of devfsd-v1.3.10, a generic /etc/modules.devfs
+configuration file is installed, which is used by the MODLOAD
+action. This should be sufficient for most configurations. If you
+require further configuration, edit your /etc/modules.conf
+file. The way module autoloading work with devfs is:
+
+
+a process attempts to lookup a device node (e.g. /dev/fred)
+
+
+if that device node does not exist, the full pathname is passed to
+devfsd as a string
+
+
+devfsd will pass the string to the modprobe programme (provided the
+configuration line shown above is present), and specifies that
+/etc/modules.devfs is the configuration file
+
+
+/etc/modules.devfs includes /etc/modules.conf to
+access local configurations
+
+modprobe will search it's configuration files, looking for an alias
+that translates the pathname into a module name
+
+
+the translated pathname is then used to load the module.
+
+
+If you wanted a lookup of /dev/fred to load the
+mymod module, you would require the following configuration
+line in /etc/modules.conf:
+
+alias /dev/fred mymod
+
+The /etc/modules.devfs configuration file provides many such
+aliases for standard device names. If you look closely at this file,
+you will note that some modules require multiple alias configuration
+lines. This is required to support module autoloading for old and new
+device names.
+
+Mounting root off a devfs device
+If you wish to mount root off a devfs device when you pass the
+"devfs=only" boot option, then you need to pass in the
+"root=<device>" option to the kernel when booting. If you use
+LILO, then you must have this in lilo.conf:
+
+append = "root=<device>"
+
+Surprised? Yep, so was I. It turns out if you have (as most people
+do):
+
+root = <device>
+
+
+then LILO will determine the device number of <device> and will
+write that device number into a special place in the kernel image
+before starting the kernel, and the kernel will use that device number
+to mount the root filesystem. So, using the "append" variety ensures
+that LILO passes the root filesystem device as a string, which devfs
+can then use.
+
+Note that this isn't an issue if you don't pass "devfs=only".
+
+TTY issues
+The ttyname(3) function in some versions of the C library makes
+false assumptions about device entries which are symbolic links. The
+tty(1) programme is one that depends on this function. I've
+written a patch to libc 5.4.43 which fixes this. This has been
+included in libc 5.4.44 and a similar fix is in glibc 2.1.3.
+
+
+Kernel Naming Scheme
+
+The kernel provides a default naming scheme. This scheme is designed
+to make it easy to search for specific devices or device types, and to
+view the available devices. Some device types (such as hard discs),
+have a directory of entries, making it easy to see what devices of
+that class are available. Often, the entries are symbolic links into a
+directory tree that reflects the topology of available devices. The
+topological tree is useful for finding how your devices are arranged.
+
+Below is a list of the naming schemes for the most common drivers. A
+list of reserved device names is
+available for reference. Please send email to
+rgooch@atnf.csiro.au to obtain an allocation. Please be
+patient (the maintainer is busy). An alternative name may be allocated
+instead of the requested name, at the discretion of the maintainer.
+
+Disc Devices
+
+All discs, whether SCSI, IDE or whatever, are placed under the
+/dev/discs hierarchy:
+
+ /dev/discs/disc0 first disc
+ /dev/discs/disc1 second disc
+
+
+Each of these entries is a symbolic link to the directory for that
+device. The device directory contains:
+
+ disc for the whole disc
+ part* for individual partitions
+
+
+CD-ROM Devices
+
+All CD-ROMs, whether SCSI, IDE or whatever, are placed under the
+/dev/cdroms hierarchy:
+
+ /dev/cdroms/cdrom0 first CD-ROM
+ /dev/cdroms/cdrom1 second CD-ROM
+
+
+Each of these entries is a symbolic link to the real device entry for
+that device.
+
+Tape Devices
+
+All tapes, whether SCSI, IDE or whatever, are placed under the
+/dev/tapes hierarchy:
+
+ /dev/tapes/tape0 first tape
+ /dev/tapes/tape1 second tape
+
+
+Each of these entries is a symbolic link to the directory for that
+device. The device directory contains:
+
+ mt for mode 0
+ mtl for mode 1
+ mtm for mode 2
+ mta for mode 3
+ mtn for mode 0, no rewind
+ mtln for mode 1, no rewind
+ mtmn for mode 2, no rewind
+ mtan for mode 3, no rewind
+
+
+SCSI Devices
+
+To uniquely identify any SCSI device requires the following
+information:
+
+ controller (host adapter)
+ bus (SCSI channel)
+ target (SCSI ID)
+ unit (Logical Unit Number)
+
+
+All SCSI devices are placed under /dev/scsi (assuming devfs
+is mounted on /dev). Hence, a SCSI device with the following
+parameters: c=1,b=2,t=3,u=4 would appear as:
+
+ /dev/scsi/host1/bus2/target3/lun4 device directory
+
+
+Inside this directory, a number of device entries may be created,
+depending on which SCSI device-type drivers were installed.
+
+See the section on the disc naming scheme to see what entries the SCSI
+disc driver creates.
+
+See the section on the tape naming scheme to see what entries the SCSI
+tape driver creates.
+
+The SCSI CD-ROM driver creates:
+
+ cd
+
+
+The SCSI generic driver creates:
+
+ generic
+
+
+IDE Devices
+
+To uniquely identify any IDE device requires the following
+information:
+
+ controller
+ bus (aka. primary/secondary)
+ target (aka. master/slave)
+ unit
+
+
+All IDE devices are placed under /dev/ide, and uses a similar
+naming scheme to the SCSI subsystem.
+
+XT Hard Discs
+
+All XT discs are placed under /dev/xd. The first XT disc has
+the directory /dev/xd/disc0.
+
+TTY devices
+
+The tty devices now appear as:
+
+ New name Old-name Device Type
+ -------- -------- -----------
+ /dev/tts/{0,1,...} /dev/ttyS{0,1,...} Serial ports
+ /dev/cua/{0,1,...} /dev/cua{0,1,...} Call out devices
+ /dev/vc/0 /dev/tty Current virtual console
+ /dev/vc/{1,2,...} /dev/tty{1...63} Virtual consoles
+ /dev/vcc/{0,1,...} /dev/vcs{1...63} Virtual consoles
+ /dev/pty/m{0,1,...} /dev/ptyp?? PTY masters
+ /dev/pty/s{0,1,...} /dev/ttyp?? PTY slaves
+
+
+RAMDISCS
+
+The RAMDISCS are placed in their own directory, and are named thus:
+
+ /dev/rd/{0,1,2,...}
+
+
+Meta Devices
+
+The meta devices are placed in their own directory, and are named
+thus:
+
+ /dev/md/{0,1,2,...}
+
+
+Floppy discs
+
+Floppy discs are placed in the /dev/floppy directory.
+
+Loop devices
+
+Loop devices are placed in the /dev/loop directory.
+
+Sound devices
+
+Sound devices are placed in the /dev/sound directory
+(audio, sequencer, ...).
+
+
+Devfsd Naming Scheme
+
+Devfsd provides a naming scheme which is a convenient abbreviation of
+the kernel-supplied namespace. In some
+cases, the kernel-supplied naming scheme is quite convenient, so
+devfsd does not provide another naming scheme. The convenience names
+that devfsd creates are in fact the same names as the original devfs
+kernel patch created (before Linus mandated the Big Name
+Change). These are referred to as "new compatibility entries".
+
+In order to configure devfsd to create these convenience names, the
+following lines should be placed in your /etc/devfsd.conf:
+
+REGISTER .* MKNEWCOMPAT
+UNREGISTER .* RMNEWCOMPAT
+
+This will cause devfsd to create (and destroy) symbolic links which
+point to the kernel-supplied names.
+
+SCSI Hard Discs
+
+All SCSI discs are placed under /dev/sd (assuming devfs is
+mounted on /dev). Hence, a SCSI disc with the following
+parameters: c=1,b=2,t=3,u=4 would appear as:
+
+ /dev/sd/c1b2t3u4 for the whole disc
+ /dev/sd/c1b2t3u4p5 for the 5th partition
+ /dev/sd/c1b2t3u4p5s6 for the 6th slice in the 5th partition
+
+
+SCSI Tapes
+
+All SCSI tapes are placed under /dev/st. A similar naming
+scheme is used as for SCSI discs. A SCSI tape with the
+parameters:c=1,b=2,t=3,u=4 would appear as:
+
+ /dev/st/c1b2t3u4m0 for mode 0
+ /dev/st/c1b2t3u4m1 for mode 1
+ /dev/st/c1b2t3u4m2 for mode 2
+ /dev/st/c1b2t3u4m3 for mode 3
+ /dev/st/c1b2t3u4m0n for mode 0, no rewind
+ /dev/st/c1b2t3u4m1n for mode 1, no rewind
+ /dev/st/c1b2t3u4m2n for mode 2, no rewind
+ /dev/st/c1b2t3u4m3n for mode 3, no rewind
+
+
+SCSI CD-ROMs
+
+All SCSI CD-ROMs are placed under /dev/sr. A similar naming
+scheme is used as for SCSI discs. A SCSI CD-ROM with the
+parameters:c=1,b=2,t=3,u=4 would appear as:
+
+ /dev/sr/c1b2t3u4
+
+
+SCSI Generic Devices
+
+The generic (aka. raw) interface for all SCSI devices are placed under
+/dev/sg. A similar naming scheme is used as for SCSI discs. A
+SCSI generic device with the parameters:c=1,b=2,t=3,u=4 would appear
+as:
+
+ /dev/sg/c1b2t3u4
+
+
+IDE Hard Discs
+
+All IDE discs are placed under /dev/ide/hd, using a similar
+convention to SCSI discs. The following mappings exist between the new
+and the old names:
+
+ /dev/hda /dev/ide/hd/c0b0t0u0
+ /dev/hdb /dev/ide/hd/c0b0t1u0
+ /dev/hdc /dev/ide/hd/c0b1t0u0
+ /dev/hdd /dev/ide/hd/c0b1t1u0
+
+
+IDE Tapes
+
+A similar naming scheme is used as for IDE discs. The entries will
+appear in the /dev/ide/mt directory.
+
+IDE CD-ROM
+
+A similar naming scheme is used as for IDE discs. The entries will
+appear in the /dev/ide/cd directory.
+
+IDE Floppies
+
+A similar naming scheme is used as for IDE discs. The entries will
+appear in the /dev/ide/fd directory.
+
+XT Hard Discs
+
+All XT discs are placed under /dev/xd. The first XT disc
+would appear as /dev/xd/c0t0.
+
+
+Old Compatibility Names
+
+The old compatibility names are the legacy device names, such as
+/dev/hda, /dev/sda, /dev/rtc and so on.
+Devfsd can be configured to create compatibility symlinks so that you
+may continue to use the old names in your configuration files and so
+that old applications will continue to function correctly.
+
+In order to configure devfsd to create these legacy names, the
+following lines should be placed in your /etc/devfsd.conf:
+
+REGISTER .* MKOLDCOMPAT
+UNREGISTER .* RMOLDCOMPAT
+
+This will cause devfsd to create (and destroy) symbolic links which
+point to the kernel-supplied names.
+
+
+-----------------------------------------------------------------------------
+
+
+Device drivers currently ported
+
+- All miscellaneous character devices support devfs (this is done
+ transparently through misc_register())
+
+- SCSI discs and generic hard discs
+
+- Character memory devices (null, zero, full and so on)
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- Loop devices (/dev/loop?)
+
+- TTY devices (console, serial ports, terminals and pseudo-terminals)
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- SCSI tapes (/dev/scsi and /dev/tapes)
+
+- SCSI CD-ROMs (/dev/scsi and /dev/cdroms)
+
+- SCSI generic devices (/dev/scsi)
+
+- RAMDISCS (/dev/ram?)
+
+- Meta Devices (/dev/md*)
+
+- Floppy discs (/dev/floppy)
+
+- Parallel port printers (/dev/printers)
+
+- Sound devices (/dev/sound)
+ Thanks to Eric Dumas <dumas@linux.eu.org> and
+ C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- Joysticks (/dev/joysticks)
+
+- Sparc keyboard (/dev/kbd)
+
+- DSP56001 digital signal processor (/dev/dsp56k)
+
+- Apple Desktop Bus (/dev/adb)
+
+- Coda network file system (/dev/cfs*)
+
+- Virtual console capture devices (/dev/vcc)
+ Thanks to Dennis Hou <smilax@mindmeld.yi.org>
+
+- Frame buffer devices (/dev/fb)
+
+- Video capture devices (/dev/v4l)
+
+
+-----------------------------------------------------------------------------
+
+
+Allocation of Device Numbers
+
+Devfs allows you to write a driver which doesn't need to allocate a
+device number (major&minor numbers) for the internal operation of the
+kernel. However, there are a number of userspace programmes that use
+the device number as a unique handle for a device. An example is the
+find programme, which uses device numbers to determine whether
+an inode is on a different filesystem than another inode. The device
+number used is the one for the block device which a filesystem is
+using. To preserve compatibility with userspace programmes, block
+devices using devfs need to have unique device numbers allocated to
+them. Furthermore, POSIX specifies device numbers, so some kind of
+device number needs to be presented to userspace.
+
+The simplest option (especially when porting drivers to devfs) is to
+keep using the old major and minor numbers. Devfs will take whatever
+values are given for major&minor and pass them onto userspace.
+
+This device number is a 16 bit number, so this leaves plenty of space
+for large numbers of discs and partitions. This scheme can also be
+used for character devices, in particular the tty devices, which are
+currently limited to 256 pseudo-ttys (this limits the total number of
+simultaneous xterms and remote logins). Note that the device number
+is limited to the range 36864-61439 (majors 144-239), in order to
+avoid any possible conflicts with existing official allocations.
+
+Please note that using dynamically allocated block device numbers may
+break the NFS daemons (both user and kernel mode), which expect dev_t
+for a given device to be constant over the lifetime of remote mounts.
+
+A final note on this scheme: since it doesn't increase the size of
+device numbers, there are no compatibility issues with userspace.
+
+-----------------------------------------------------------------------------
+
+
+Questions and Answers
+
+
+Making things work
+Alternatives to devfs
+What I don't like about devfs
+How to report bugs
+Strange kernel messages
+Compilation problems with devfsd
+
+
+
+Making things work
+
+Here are some common questions and answers.
+
+
+
+Devfsd doesn't start
+
+Make sure you have compiled and installed devfsd
+Make sure devfsd is being started from your boot
+scripts
+Make sure you have configured your kernel to enable devfs (see
+below)
+Make sure devfs is mounted (see below)
+
+
+Devfsd is not managing all my permissions
+
+Make sure you are capturing the appropriate events. For example,
+device entries created by the kernel generate REGISTER events,
+but those created by devfsd generate CREATE events.
+
+
+Devfsd is not capturing all REGISTER events
+
+See the previous entry: you may need to capture CREATE events.
+
+
+X will not start
+
+Make sure you followed the steps
+outlined above.
+
+
+Why don't my network devices appear in devfs?
+
+This is not a bug. Network devices have their own, completely separate
+namespace. They are accessed via socket(2) and
+setsockopt(2) calls, and thus require no device nodes. I have
+raised the possibilty of moving network devices into the device
+namespace, but have had no response.
+
+
+How can I test if I have devfs compiled into my kernel?
+
+All filesystems built-in or currently loaded are listed in
+/proc/filesystems. If you see a devfs entry, then
+you know that devfs was compiled into your kernel. If you have
+correctly configured and rebuilt your kernel, then devfs will be
+built-in. If you think you've configured it in, but
+/proc/filesystems doesn't show it, you've made a mistake.
+Common mistakes include:
+
+Using a 2.2.x kernel without applying the devfs patch (if you
+don't know how to patch your kernel, use 2.4.x instead, don't bother
+asking me how to patch)
+Forgetting to set CONFIG_EXPERIMENTAL=y
+Forgetting to set CONFIG_DEVFS_FS=y
+Forgetting to set CONFIG_DEVFS_MOUNT=y (if you want devfs
+to be automatically mounted at boot)
+Editing your .config manually, instead of using make
+config or make xconfig
+Forgetting to run make dep; make clean after changing the
+configuration and before compiling
+Forgetting to compile your kernel and modules
+Forgetting to install your kernel
+Forgetting to install your modules
+
+Please check twice that you've done all these steps before sending in
+a bug report.
+
+
+
+How can I test if devfs is mounted on /dev?
+
+The device filesystem will always create an entry called
+".devfsd", which is used to communicate with the daemon. Even
+if the daemon is not running, this entry will exist. Testing for the
+existence of this entry is the approved method of determining if devfs
+is mounted or not. Note that the type of entry (i.e. regular file,
+character device, named pipe, etc.) may change without notice. Only
+the existence of the entry should be relied upon.
+
+
+When I start devfsd, I see the error:
+Error opening file: ".devfsd" No such file or directory?
+
+This means that devfs is not mounted. Make sure you have devfs mounted.
+
+
+How do I mount devfs?
+
+First make sure you have devfs compiled into your kernel (see
+above). Then you will either need to:
+
+set CONFIG_DEVFS_MOUNT=y in your kernel config
+pass devfs=mount to your boot loader
+mount devfs manually in your boot scripts with:
+mount -t none devfs /dev
+
+
+
+Mount by volume LABEL=<label> doesn't work with
+devfs
+
+Most probably you are not mounting devfs onto /dev. What
+happens is that if your kernel config has CONFIG_DEVFS_FS=y
+then the contents of /proc/partitions will have the devfs
+names (such as scsi/host0/bus0/target0/lun0/part1). The
+contents of /proc/partitions are used by mount(8) when
+mounting by volume label. If devfs is not mounted on /dev,
+then mount(8) will fail to find devices. The solution is to
+make sure that devfs is mounted on /dev. See above for how to
+do that.
+
+
+I have extra or incorrect entries in /dev
+
+You may have stale entries in your dev-state area. Check for a
+RESTORE configuration line in your devfsd configuration
+(typically /etc/devfsd.conf). If you have this line, check
+the contents of the specified directory for stale entries. Remove
+any entries which are incorrect, then reboot.
+
+
+I get "Unable to open initial console" messages at boot
+
+This usually happens when you don't have devfs automounted onto
+/dev at boot time, and there is no valid
+/dev/console entry on your root file-system. Create a valid
+/dev/console device node.
+
+
+
+
+
+Alternatives to devfs
+
+I've attempted to collate all the anti-devfs proposals and explain
+their limitations. Under construction.
+
+
+Why not just pass device create/remove events to a daemon?
+
+Here the suggestion is to develop an API in the kernel so that devices
+can register create and remove events, and a daemon listens for those
+events. The daemon would then populate/depopulate /dev (which
+resides on disc).
+
+This has several limitations:
+
+
+it only works for modules loaded and unloaded (or devices inserted
+and removed) after the kernel has finished booting. Without a database
+of events, there is no way the daemon could fully populate
+/dev
+
+
+if you add a database to this scheme, the question is then how to
+present that database to user-space. If you make it a list of strings
+with embedded event codes which are passed through a pipe to the
+daemon, then this is only of use to the daemon. I would argue that the
+natural way to present this data is via a filesystem (since many of
+the events will be of a hierarchical nature), such as devfs.
+Presenting the data as a filesystem makes it easy for the user to see
+what is available and also makes it easy to write scripts to scan the
+"database"
+
+
+the tight binding between device nodes and drivers is no longer
+possible (requiring the otherwise perfectly avoidable
+table lookups)
+
+
+you cannot catch inode lookup events on /dev which means
+that module autoloading requires device nodes to be created. This is a
+problem, particularly for drivers where only a few inodes are created
+from a potentially large set
+
+
+this technique can't be used when the root FS is mounted
+read-only
+
+
+
+
+Just implement a better scsidev
+
+This suggestion involves taking the scsidev programme and
+extending it to scan for all devices, not just SCSI devices. The
+scsidev programme works by scanning /proc/scsi
+
+Problems:
+
+
+the kernel does not currently provide a list of all devices
+available. Not all drivers register entries in /proc or
+generate kernel messages
+
+
+there is no uniform mechanism to register devices other than the
+devfs API
+
+
+implementing such an API is then the same as the
+proposal above
+
+
+
+
+Put /dev on a ramdisc
+
+This suggestion involves creating a ramdisc and populating it with
+device nodes and then mounting it over /dev.
+
+Problems:
+
+
+
+this doesn't help when mounting the root filesystem, since you
+still need a device node to do that
+
+
+if you want to use this technique for the root device node as
+well, you need to use initrd. This complicates the booting sequence
+and makes it significantly harder to administer and configure. The
+initrd is essentially opaque, robbing the system administrator of easy
+configuration
+
+
+insufficient information is available to correctly populate the
+ramdisc. So we come back to the
+proposal above to "solve" this
+
+
+a ramdisc-based solution would take more kernel memory, since the
+backing store would be (at best) normal VFS inodes and dentries, which
+take 284 bytes and 112 bytes, respectively, for each entry. Compare
+that to 72 bytes for devfs
+
+
+
+
+Do nothing: there's no problem
+
+Sometimes people can be heard to claim that the existing scheme is
+fine. This is what they're ignoring:
+
+
+device number size (8 bits each for major and minor) is a real
+limitation, and must be fixed somehow. Systems with large numbers of
+SCSI devices, for example, will continue to consume the remaining
+unallocated major numbers. USB will also need to push beyond the 8 bit
+minor limitation
+
+
+simply increasing the device number size is insufficient. Apart
+from causing a lot of pain, it doesn't solve the management issues
+of a /dev with thousands or more device nodes
+
+
+ignoring the problem of a huge /dev will not make it go
+away, and dismisses the legitimacy of a large number of people who
+want a dynamic /dev
+
+
+the standard response then becomes: "write a device management
+daemon", which brings us back to the
+proposal above
+
+
+
+
+What I don't like about devfs
+
+Here are some common complaints about devfs, and some suggestions and
+solutions that may make it more palatable for you. I can't please
+everybody, but I do try :-)
+
+I hate the naming scheme
+
+First, remember that no naming scheme will please everybody. You hate
+the scheme, others love it. Who's to say who's right and who's wrong?
+Ultimately, the person who writes the code gets to choose, and what
+exists now is a combination of the choices made by the
+devfs author and the
+kernel maintainer (Linus).
+
+However, not all is lost. If you want to create your own naming
+scheme, it is a simple matter to write a standalone script, hack
+devfsd, or write a script called by devfsd. You can create whatever
+naming scheme you like.
+
+Further, if you want to remove all traces of the devfs naming scheme
+from /dev, you can mount devfs elsewhere (say
+/devfs) and populate /dev with links into
+/devfs. This population can be automated using devfsd if you
+wish.
+
+You can even use the VFS binding facility to make the links, rather
+than using symbolic links. This way, you don't even have to see the
+"destination" of these symbolic links.
+
+Devfs puts policy into the kernel
+
+There's already policy in the kernel. Device numbers are in fact
+policy (why should the kernel dictate what device numbers I use?).
+Face it, some policy has to be in the kernel. The real difference
+between device names as policy and device numbers as policy is that
+no one will use device numbers directly, because device
+numbers are devoid of meaning to humans and are ugly. At least with
+the devfs device names, (even though you can add your own naming
+scheme) some people will use the devfs-supplied names directly. This
+offends some people :-)
+
+Devfs is bloatware
+
+This is not even remotely true. As shown above,
+both code and data size are quite modest.
+
+
+How to report bugs
+
+If you have (or think you have) a bug with devfs, please follow the
+steps below:
+
+
+
+make sure you have enabled debugging output when configuring your
+kernel. You will need to set (at least) the following config options:
+
+CONFIG_DEVFS_DEBUG=y
+CONFIG_DEBUG_KERNEL=y
+CONFIG_DEBUG_SLAB=y
+
+
+
+please make sure you have the latest devfs patches applied. The
+latest kernel version might not have the latest devfs patches applied
+yet (Linus is very busy)
+
+
+save a copy of your complete kernel logs (preferably by
+using the dmesg programme) for later inclusion in your bug
+report. You may need to use the -s switch to increase the
+internal buffer size so you can capture all the boot messages.
+Don't edit or trim the dmesg output
+
+
+
+
+try booting with devfs=dall passed to the kernel boot
+command line (read the documentation on your bootloader on how to do
+this), and save the result to a file. This may be quite verbose, and
+it may overflow the messages buffer, but try to get as much of it as
+you can
+
+
+if you get an Oops, run ksymoops to decode it so that the
+names of the offending functions are provided. A non-decoded Oops is
+pretty useless
+
+
+send a copy of your devfsd configuration file(s)
+
+send the bug report to me first.
+Don't expect that I will see it if you post it to the linux-kernel
+mailing list. Include all the information listed above, plus
+anything else that you think might be relevant. Put the string
+devfs somewhere in the subject line, so my mail filters mark
+it as urgent
+
+
+
+
+Here is a general guide on how to ask questions in a way that greatly
+improves your chances of getting a reply:
+
+http://www.tuxedo.org/~esr/faqs/smart-questions.html. If you have
+a bug to report, you should also read
+
+http://www.chiark.greenend.org.uk/~sgtatham/bugs.html.
+
+
+Strange kernel messages
+
+You may see devfs-related messages in your kernel logs. Below are some
+messages and what they mean (and what you should do about them, if
+anything).
+
+
+
+devfs_register(fred): could not append to parent, err: -17
+
+You need to check what the error code means, but usually 17 means
+EEXIST. This means that a driver attempted to create an entry
+fred in a directory, but there already was an entry with that
+name. This is often caused by flawed boot scripts which untar a bunch
+of inodes into /dev, as a way to restore permissions. This
+message is harmless, as the device nodes will still
+provide access to the driver (unless you use the devfs=only
+boot option, which is only for dedicated souls:-). If you want to get
+rid of these annoying messages, upgrade to devfsd-v1.3.20 and use the
+recommended RESTORE directive to restore permissions.
+
+
+devfs_mk_dir(bill): using old entry in dir: c1808724 ""
+
+This is similar to the message above, except that a driver attempted
+to create a directory named bill, and the parent directory
+has an entry with the same name. In this case, to ensure that drivers
+continue to work properly, the old entry is re-used and given to the
+driver. In 2.5 kernels, the driver is given a NULL entry, and thus,
+under rare circumstances, may not create the require device nodes.
+The solution is the same as above.
+
+
+
+
+
+Compilation problems with devfsd
+
+Usually, you can compile devfsd just by typing in
+make in the source directory, followed by a make
+install (as root). Sometimes, you may have problems, particularly
+on broken configurations.
+
+
+
+error messages relating to DEVFSD_NOTIFY_DELETE
+
+This happened because you have an ancient set of kernel headers
+installed in /usr/include/linux or /usr/src/linux.
+Install kernel 2.4.10 or later. You may need to pass the
+KERNEL_DIR variable to make (if you did not install
+the new kernel sources as /usr/src/linux), or you may copy
+the devfs_fs.h file in the kernel source tree into
+/usr/include/linux.
+
+
+
+
+-----------------------------------------------------------------------------
+
+
+Other resources
+
+
+
+Douglas Gilbert has written a useful document at
+
+http://www.torque.net/sg/devfs_scsi.html which
+explores the SCSI subsystem and how it interacts with devfs
+
+
+Douglas Gilbert has written another useful document at
+
+http://www.torque.net/scsi/SCSI-2.4-HOWTO/ which
+discusses the Linux SCSI subsystem in 2.4.
+
+
+Johannes Erdfelt has started a discussion paper on Linux and
+hot-swap devices, describing what the requirements are for a scalable
+solution and how and why he's used devfs+devfsd. Note that this is an
+early draft only, available in plain text form at:
+
+http://johannes.erdfelt.com/hotswap.txt.
+Johannes has promised a HTML version will follow.
+
+
+I presented an invited
+paper
+at the
+
+2nd Annual Storage Management Workshop held in Miamia, Florida,
+U.S.A. in October 2000.
+
+
+
+
+-----------------------------------------------------------------------------
+
+
+Translations of this document
+
+This document has been translated into other languages.
+
+
+
+
+The document master (in English) by rgooch@atnf.csiro.au is
+available at
+
+http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html
+
+
+
+A Korean translation by viatoris@nownuri.net is available at
+
+http://your.destiny.pe.kr/devfs/devfs.html
+
+
+
+
+-----------------------------------------------------------------------------
+Most flags courtesy of ITA's
+Flags of All Countries
+used with permission.
diff --git a/Documentation/filesystems/devfs/ToDo b/Documentation/filesystems/devfs/ToDo
new file mode 100644
index 0000000..afd5a8f
--- /dev/null
+++ b/Documentation/filesystems/devfs/ToDo
@@ -0,0 +1,40 @@
+ Device File System (devfs) ToDo List
+
+ Richard Gooch <rgooch@atnf.csiro.au>
+
+ 3-JUL-2000
+
+This is a list of things to be done for better devfs support in the
+Linux kernel. If you'd like to contribute to the devfs, please have a
+look at this list for anything that is unallocated. Also, if there are
+items missing (surely), please contact me so I can add them to the
+list (preferably with your name attached to them:-).
+
+
+- >256 ptys
+ Thanks to C. Scott Ananian <cananian@alumni.princeton.edu>
+
+- Amiga floppy driver (drivers/block/amiflop.c)
+
+- Atari floppy driver (drivers/block/ataflop.c)
+
+- SWIM3 (Super Woz Integrated Machine 3) floppy driver (drivers/block/swim3.c)
+
+- Amiga ZorroII ramdisc driver (drivers/block/z2ram.c)
+
+- Parallel port ATAPI CD-ROM (drivers/block/paride/pcd.c)
+
+- Parallel port ATAPI floppy (drivers/block/paride/pf.c)
+
+- AP1000 block driver (drivers/ap1000/ap.c, drivers/ap1000/ddv.c)
+
+- Archimedes floppy (drivers/acorn/block/fd1772.c)
+
+- MFM hard drive (drivers/acorn/block/mfmhd.c)
+
+- I2O block device (drivers/message/i2o/i2o_block.c)
+
+- ST-RAM device (arch/m68k/atari/stram.c)
+
+- Raw devices
+
diff --git a/Documentation/filesystems/devfs/boot-options b/Documentation/filesystems/devfs/boot-options
new file mode 100644
index 0000000..df3d33b
--- /dev/null
+++ b/Documentation/filesystems/devfs/boot-options
@@ -0,0 +1,65 @@
+/* -*- auto-fill -*- */
+
+ Device File System (devfs) Boot Options
+
+ Richard Gooch <rgooch@atnf.csiro.au>
+
+ 18-AUG-2001
+
+
+When CONFIG_DEVFS_DEBUG is enabled, you can pass several boot options
+to the kernel to debug devfs. The boot options are prefixed by
+"devfs=", and are separated by commas. Spaces are not allowed. The
+syntax looks like this:
+
+devfs=<option1>,<option2>,<option3>
+
+and so on. For example, if you wanted to turn on debugging for module
+load requests and device registration, you would do:
+
+devfs=dmod,dreg
+
+You may prefix "no" to any option. This will invert the option.
+
+
+Debugging Options
+=================
+
+These requires CONFIG_DEVFS_DEBUG to be enabled.
+Note that all debugging options have 'd' as the first character. By
+default all options are off. All debugging output is sent to the
+kernel logs. The debugging options do not take effect until the devfs
+version message appears (just prior to the root filesystem being
+mounted).
+
+These are the options:
+
+dmod print module load requests to <request_module>
+
+dreg print device register requests to <devfs_register>
+
+dunreg print device unregister requests to <devfs_unregister>
+
+dchange print device change requests to <devfs_set_flags>
+
+dilookup print inode lookup requests
+
+diget print VFS inode allocations
+
+diunlink print inode unlinks
+
+dichange print inode changes
+
+dimknod print calls to mknod(2)
+
+dall some debugging turned on
+
+
+Other Options
+=============
+
+These control the default behaviour of devfs. The options are:
+
+mount mount devfs onto /dev at boot time
+
+only disable non-devfs device nodes for devfs-capable drivers
diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking
new file mode 100644
index 0000000..34380d4
--- /dev/null
+++ b/Documentation/filesystems/directory-locking
@@ -0,0 +1,113 @@
+ Locking scheme used for directory operations is based on two
+kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
+
+ For our purposes all operations fall in 5 classes:
+
+1) read access. Locking rules: caller locks directory we are accessing.
+
+2) object creation. Locking rules: same as above.
+
+3) object removal. Locking rules: caller locks parent, finds victim,
+locks victim and calls the method.
+
+4) rename() that is _not_ cross-directory. Locking rules: caller locks
+the parent, finds source and target, if target already exists - locks it
+and then calls the method.
+
+5) link creation. Locking rules:
+ * lock parent
+ * check that source is not a directory
+ * lock source
+ * call the method.
+
+6) cross-directory rename. The trickiest in the whole bunch. Locking
+rules:
+ * lock the filesystem
+ * lock parents in "ancestors first" order.
+ * find source and target.
+ * if old parent is equal to or is a descendent of target
+ fail with -ENOTEMPTY
+ * if new parent is equal to or is a descendent of source
+ fail with -ELOOP
+ * if target exists - lock it.
+ * call the method.
+
+
+The rules above obviously guarantee that all directories that are going to be
+read, modified or removed by method will be locked by caller.
+
+
+If no directory is its own ancestor, the scheme above is deadlock-free.
+Proof:
+
+ First of all, at any moment we have a partial ordering of the
+objects - A < B iff A is an ancestor of B.
+
+ That ordering can change. However, the following is true:
+
+(1) if object removal or non-cross-directory rename holds lock on A and
+ attempts to acquire lock on B, A will remain the parent of B until we
+ acquire the lock on B. (Proof: only cross-directory rename can change
+ the parent of object and it would have to lock the parent).
+
+(2) if cross-directory rename holds the lock on filesystem, order will not
+ change until rename acquires all locks. (Proof: other cross-directory
+ renames will be blocked on filesystem lock and we don't start changing
+ the order until we had acquired all locks).
+
+(3) any operation holds at most one lock on non-directory object and
+ that lock is acquired after all other locks. (Proof: see descriptions
+ of operations).
+
+ Now consider the minimal deadlock. Each process is blocked on
+attempt to acquire some lock and already holds at least one lock. Let's
+consider the set of contended locks. First of all, filesystem lock is
+not contended, since any process blocked on it is not holding any locks.
+Thus all processes are blocked on ->i_sem.
+
+ Non-directory objects are not contended due to (3). Thus link
+creation can't be a part of deadlock - it can't be blocked on source
+and it means that it doesn't hold any locks.
+
+ Any contended object is either held by cross-directory rename or
+has a child that is also contended. Indeed, suppose that it is held by
+operation other than cross-directory rename. Then the lock this operation
+is blocked on belongs to child of that object due to (1).
+
+ It means that one of the operations is cross-directory rename.
+Otherwise the set of contended objects would be infinite - each of them
+would have a contended child and we had assumed that no object is its
+own descendent. Moreover, there is exactly one cross-directory rename
+(see above).
+
+ Consider the object blocking the cross-directory rename. One
+of its descendents is locked by cross-directory rename (otherwise we
+would again have an infinite set of of contended objects). But that
+means that cross-directory rename is taking locks out of order. Due
+to (2) the order hadn't changed since we had acquired filesystem lock.
+But locking rules for cross-directory rename guarantee that we do not
+try to acquire lock on descendent before the lock on ancestor.
+Contradiction. I.e. deadlock is impossible. Q.E.D.
+
+
+ These operations are guaranteed to avoid loop creation. Indeed,
+the only operation that could introduce loops is cross-directory rename.
+Since the only new (parent, child) pair added by rename() is (new parent,
+source), such loop would have to contain these objects and the rest of it
+would have to exist before rename(). I.e. at the moment of loop creation
+rename() responsible for that would be holding filesystem lock and new parent
+would have to be equal to or a descendent of source. But that means that
+new parent had been equal to or a descendent of source since the moment when
+we had acquired filesystem lock and rename() would fail with -ELOOP in that
+case.
+
+ While this locking scheme works for arbitrary DAGs, it relies on
+ability to check that directory is a descendent of another object. Current
+implementation assumes that directory graph is a tree. This assumption is
+also preserved by all operations (cross-directory rename on a tree that would
+not introduce a cycle will leave it a tree and link() fails for directories).
+
+ Notice that "directory" in the above == "anything that might have
+children", so if we are going to introduce hybrid objects we will need
+either to make sure that link(2) doesn't work for them or to make changes
+in is_subdir() that would make it work even in presence of such beasts.
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
new file mode 100644
index 0000000..b5cb911
--- /dev/null
+++ b/Documentation/filesystems/ext2.txt
@@ -0,0 +1,383 @@
+
+The Second Extended Filesystem
+==============================
+
+ext2 was originally released in January 1993. Written by R\'emy Card,
+Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
+Extended Filesystem. It is currently still (April 2001) the predominant
+filesystem in use by Linux. There are also implementations available
+for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
+
+Options
+=======
+
+Most defaults are determined by the filesystem superblock, and can be
+set using tune2fs(8). Kernel-determined defaults are indicated by (*).
+
+bsddf (*) Makes `df' act like BSD.
+minixdf Makes `df' act like Minix.
+
+check Check block and inode bitmaps at mount time
+ (requires CONFIG_EXT2_CHECK).
+check=none, nocheck (*) Don't do extra checking of bitmaps on mount
+ (check=normal and check=strict options removed)
+
+debug Extra debugging information is sent to the
+ kernel syslog. Useful for developers.
+
+errors=continue Keep going on a filesystem error.
+errors=remount-ro Remount the filesystem read-only on an error.
+errors=panic Panic and halt the machine if an error occurs.
+
+grpid, bsdgroups Give objects the same group ID as their parent.
+nogrpid, sysvgroups New objects have the group ID of their creator.
+
+nouid32 Use 16-bit UIDs and GIDs.
+
+oldalloc Enable the old block allocator. Orlov should
+ have better performance, we'd like to get some
+ feedback if it's the contrary for you.
+orlov (*) Use the Orlov block allocator.
+ (See http://lwn.net/Articles/14633/ and
+ http://lwn.net/Articles/14446/.)
+
+resuid=n The user ID which may use the reserved blocks.
+resgid=n The group ID which may use the reserved blocks.
+
+sb=n Use alternate superblock at this location.
+
+user_xattr Enable "user." POSIX Extended Attributes
+ (requires CONFIG_EXT2_FS_XATTR).
+ See also http://acl.bestbits.at
+nouser_xattr Don't support "user." extended attributes.
+
+acl Enable POSIX Access Control Lists support
+ (requires CONFIG_EXT2_FS_POSIX_ACL).
+ See also http://acl.bestbits.at
+noacl Don't support POSIX ACLs.
+
+nobh Do not attach buffer_heads to file pagecache.
+
+grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
+
+
+Specification
+=============
+
+ext2 shares many properties with traditional Unix filesystems. It has
+the concepts of blocks, inodes and directories. It has space in the
+specification for Access Control Lists (ACLs), fragments, undeletion and
+compression though these are not yet implemented (some are available as
+separate patches). There is also a versioning mechanism to allow new
+features (such as journalling) to be added in a maximally compatible
+manner.
+
+Blocks
+------
+
+The space in the device or file is split up into blocks. These are
+a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
+which is decided when the filesystem is created. Smaller blocks mean
+less wasted space per file, but require slightly more accounting overhead,
+and also impose other limits on the size of files and the filesystem.
+
+Block Groups
+------------
+
+Blocks are clustered into block groups in order to reduce fragmentation
+and minimise the amount of head seeking when reading a large amount
+of consecutive data. Information about each block group is kept in a
+descriptor table stored in the block(s) immediately after the superblock.
+Two blocks near the start of each group are reserved for the block usage
+bitmap and the inode usage bitmap which show which blocks and inodes
+are in use. Since each bitmap is limited to a single block, this means
+that the maximum size of a block group is 8 times the size of a block.
+
+The block(s) following the bitmaps in each block group are designated
+as the inode table for that block group and the remainder are the data
+blocks. The block allocation algorithm attempts to allocate data blocks
+in the same block group as the inode which contains them.
+
+The Superblock
+--------------
+
+The superblock contains all the information about the configuration of
+the filing system. The primary copy of the superblock is stored at an
+offset of 1024 bytes from the start of the device, and it is essential
+to mounting the filesystem. Since it is so important, backup copies of
+the superblock are stored in block groups throughout the filesystem.
+The first version of ext2 (revision 0) stores a copy at the start of
+every block group, along with backups of the group descriptor block(s).
+Because this can consume a considerable amount of space for large
+filesystems, later revisions can optionally reduce the number of backup
+copies by only putting backups in specific groups (this is the sparse
+superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7.
+
+The information in the superblock contains fields such as the total
+number of inodes and blocks in the filesystem and how many are free,
+how many inodes and blocks are in each block group, when the filesystem
+was mounted (and if it was cleanly unmounted), when it was modified,
+what version of the filesystem it is (see the Revisions section below)
+and which OS created it.
+
+If the filesystem is revision 1 or higher, then there are extra fields,
+such as a volume name, a unique identification number, the inode size,
+and space for optional filesystem features to store configuration info.
+
+All fields in the superblock (as in all other ext2 structures) are stored
+on the disc in little endian format, so a filesystem is portable between
+machines without having to know what machine it was created on.
+
+Inodes
+------
+
+The inode (index node) is a fundamental concept in the ext2 filesystem.
+Each object in the filesystem is represented by an inode. The inode
+structure contains pointers to the filesystem blocks which contain the
+data held in the object and all of the metadata about an object except
+its name. The metadata about an object includes the permissions, owner,
+group, flags, size, number of blocks used, access time, change time,
+modification time, deletion time, number of links, fragments, version
+(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
+
+There are some reserved fields which are currently unused in the inode
+structure and several which are overloaded. One field is reserved for the
+directory ACL if the inode is a directory and alternately for the top 32
+bits of the file size if the inode is a regular file (allowing file sizes
+larger than 2GB). The translator field is unused under Linux, but is used
+by the HURD to reference the inode of a program which will be used to
+interpret this object. Most of the remaining reserved fields have been
+used up for both Linux and the HURD for larger owner and group fields,
+The HURD also has a larger mode field so it uses another of the remaining
+fields to store the extra more bits.
+
+There are pointers to the first 12 blocks which contain the file's data
+in the inode. There is a pointer to an indirect block (which contains
+pointers to the next set of blocks), a pointer to a doubly-indirect
+block (which contains pointers to indirect blocks) and a pointer to a
+trebly-indirect block (which contains pointers to doubly-indirect blocks).
+
+The flags field contains some ext2-specific flags which aren't catered
+for by the standard chmod flags. These flags can be listed with lsattr
+and changed with the chattr command, and allow specific filesystem
+behaviour on a per-file basis. There are flags for secure deletion,
+undeletable, compression, synchronous updates, immutability, append-only,
+dumpable, no-atime, indexed directories, and data-journaling. Not all
+of these are supported yet.
+
+Directories
+-----------
+
+A directory is a filesystem object and has an inode just like a file.
+It is a specially formatted file containing records which associate
+each name with an inode number. Later revisions of the filesystem also
+encode the type of the object (file, directory, symlink, device, fifo,
+socket) to avoid the need to check the inode itself for this information
+(support for taking advantage of this feature does not yet exist in
+Glibc 2.2).
+
+The inode allocation code tries to assign inodes which are in the same
+block group as the directory in which they are first created.
+
+The current implementation of ext2 uses a singly-linked list to store
+the filenames in the directory; a pending enhancement uses hashing of the
+filenames to allow lookup without the need to scan the entire directory.
+
+The current implementation never removes empty directory blocks once they
+have been allocated to hold more files.
+
+Special files
+-------------
+
+Symbolic links are also filesystem objects with inodes. They deserve
+special mention because the data for them is stored within the inode
+itself if the symlink is less than 60 bytes long. It uses the fields
+which would normally be used to store the pointers to data blocks.
+This is a worthwhile optimisation as it we avoid allocating a full
+block for the symlink, and most symlinks are less than 60 characters long.
+
+Character and block special devices never have data blocks assigned to
+them. Instead, their device number is stored in the inode, again reusing
+the fields which would be used to point to the data blocks.
+
+Reserved Space
+--------------
+
+In ext2, there is a mechanism for reserving a certain number of blocks
+for a particular user (normally the super-user). This is intended to
+allow for the system to continue functioning even if non-priveleged users
+fill up all the space available to them (this is independent of filesystem
+quotas). It also keeps the filesystem from filling up entirely which
+helps combat fragmentation.
+
+Filesystem check
+----------------
+
+At boot time, most systems run a consistency check (e2fsck) on their
+filesystems. The superblock of the ext2 filesystem contains several
+fields which indicate whether fsck should actually run (since checking
+the filesystem at boot can take a long time if it is large). fsck will
+run if the filesystem was not cleanly unmounted, if the maximum mount
+count has been exceeded or if the maximum time between checks has been
+exceeded.
+
+Feature Compatibility
+---------------------
+
+The compatibility feature mechanism used in ext2 is sophisticated.
+It safely allows features to be added to the filesystem, without
+unnecessarily sacrificing compatibility with older versions of the
+filesystem code. The feature compatibility mechanism is not supported by
+the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
+revision 1. There are three 32-bit fields, one for compatible features
+(COMPAT), one for read-only compatible (RO_COMPAT) features and one for
+incompatible (INCOMPAT) features.
+
+These feature flags have specific meanings for the kernel as follows:
+
+A COMPAT flag indicates that a feature is present in the filesystem,
+but the on-disk format is 100% compatible with older on-disk formats, so
+a kernel which didn't know anything about this feature could read/write
+the filesystem without any chance of corrupting the filesystem (or even
+making it inconsistent). This is essentially just a flag which says
+"this filesystem has a (hidden) feature" that the kernel or e2fsck may
+want to be aware of (more on e2fsck and feature flags later). The ext3
+HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
+a regular file with data blocks in it so the kernel does not need to
+take any special notice of it if it doesn't understand ext3 journaling.
+
+An RO_COMPAT flag indicates that the on-disk format is 100% compatible
+with older on-disk formats for reading (i.e. the feature does not change
+the visible on-disk format). However, an old kernel writing to such a
+filesystem would/could corrupt the filesystem, so this is prevented. The
+most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
+sparse groups allow file data blocks where superblock/group descriptor
+backups used to live, and ext2_free_blocks() refuses to free these blocks,
+which would leading to inconsistent bitmaps. An old kernel would also
+get an error if it tried to free a series of blocks which crossed a group
+boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
+
+An INCOMPAT flag indicates the on-disk format has changed in some
+way that makes it unreadable by older kernels, or would otherwise
+cause a problem if an old kernel tried to mount it. FILETYPE is an
+INCOMPAT flag because older kernels would think a filename was longer
+than 256 characters, which would lead to corrupt directory listings.
+The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
+doesn't understand compression, you would just get garbage back from
+read() instead of it automatically decompressing your data. The ext3
+RECOVER flag is needed to prevent a kernel which does not understand the
+ext3 journal from mounting the filesystem without replaying the journal.
+
+For e2fsck, it needs to be more strict with the handling of these
+flags than the kernel. If it doesn't understand ANY of the COMPAT,
+RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
+because it has no way of verifying whether a given feature is valid
+or not. Allowing e2fsck to succeed on a filesystem with an unknown
+feature is a false sense of security for the user. Refusing to check
+a filesystem with unknown features is a good incentive for the user to
+update to the latest e2fsck. This also means that anyone adding feature
+flags to ext2 also needs to update e2fsck to verify these features.
+
+Metadata
+--------
+
+It is frequently claimed that the ext2 implementation of writing
+asynchronous metadata is faster than the ffs synchronous metadata
+scheme but less reliable. Both methods are equally resolvable by their
+respective fsck programs.
+
+If you're exceptionally paranoid, there are 3 ways of making metadata
+writes synchronous on ext2:
+
+per-file if you have the program source: use the O_SYNC flag to open()
+per-file if you don't have the source: use "chattr +S" on the file
+per-filesystem: add the "sync" option to mount (or in /etc/fstab)
+
+the first and last are not ext2 specific but do force the metadata to
+be written synchronously. See also Journaling below.
+
+Limitations
+-----------
+
+There are various limits imposed by the on-disk layout of ext2. Other
+limits are imposed by the current implementation of the kernel code.
+Many of the limits are determined at the time the filesystem is first
+created, and depend upon the block size chosen. The ratio of inodes to
+data blocks is fixed at filesystem creation time, so the only way to
+increase the number of inodes is to increase the size of the filesystem.
+No tools currently exist which can change the ratio of inodes to blocks.
+
+Most of these limits could be overcome with slight changes in the on-disk
+format and using a compatibility flag to signal the format change (at
+the expense of some compatibility).
+
+Filesystem block size: 1kB 2kB 4kB 8kB
+
+File size limit: 16GB 256GB 2048GB 2048GB
+Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
+
+There is a 2.4 kernel limit of 2048GB for a single block device, so no
+filesystem larger than that can be created at this time. There is also
+an upper limit on the block size imposed by the page size of the kernel,
+so 8kB blocks are only allowed on Alpha systems (and other architectures
+which support larger pages).
+
+There is an upper limit of 32768 subdirectories in a single directory.
+
+There is a "soft" upper limit of about 10-15k files in a single directory
+with the current linear linked-list directory implementation. This limit
+stems from performance problems when creating and deleting (and also
+finding) files in such large directories. Using a hashed directory index
+(under development) allows 100k-1M+ files in a single directory without
+performance problems (although RAM size becomes an issue at this point).
+
+The (meaningless) absolute upper limit of files in a single directory
+(imposed by the file size, the realistic limit is obviously much less)
+is over 130 trillion files. It would be higher except there are not
+enough 4-character names to make up unique directory entries, so they
+have to be 8 character filenames, even then we are fairly close to
+running out of unique filenames.
+
+Journaling
+----------
+
+A journaling extension to the ext2 code has been developed by Stephen
+Tweedie. It avoids the risks of metadata corruption and the need to
+wait for e2fsck to complete after a crash, without requiring a change
+to the on-disk ext2 layout. In a nutshell, the journal is a regular
+file which stores whole metadata (and optionally data) blocks that have
+been modified, prior to writing them into the filesystem. This means
+it is possible to add a journal to an existing ext2 filesystem without
+the need for data conversion.
+
+When changes to the filesystem (e.g. a file is renamed) they are stored in
+a transaction in the journal and can either be complete or incomplete at
+the time of a crash. If a transaction is complete at the time of a crash
+(or in the normal case where the system does not crash), then any blocks
+in that transaction are guaranteed to represent a valid filesystem state,
+and are copied into the filesystem. If a transaction is incomplete at
+the time of the crash, then there is no guarantee of consistency for
+the blocks in that transaction so they are discarded (which means any
+filesystem changes they represent are also lost).
+Check Documentation/filesystems/ext3.txt if you want to read more about
+ext3 and journaling.
+
+References
+==========
+
+The kernel source file:/usr/src/linux/fs/ext2/
+e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
+Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
+Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
+Hashed Directories http://kernelnewbies.org/~phillips/htree/
+Filesystem Resizing http://ext2resize.sourceforge.net/
+Compression (*) http://www.netspace.net.au/~reiter/e2compr/
+
+Implementations for:
+Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm
+Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2
+DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
+OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/
+RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/
+
+(*) no longer actively developed/supported (as of Apr 2001)
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
new file mode 100644
index 0000000..9ab7f44
--- /dev/null
+++ b/Documentation/filesystems/ext3.txt
@@ -0,0 +1,183 @@
+
+Ext3 Filesystem
+===============
+
+ext3 was originally released in September 1999. Written by Stephen Tweedie
+for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger,
+Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
+
+ext3 is ext2 filesystem enhanced with journalling capabilities.
+
+Options
+=======
+
+When mounting an ext3 filesystem, the following option are accepted:
+(*) == default
+
+jounal=update Update the ext3 file system's journal to the
+ current format.
+
+journal=inum When a journal already exists, this option is
+ ignored. Otherwise, it specifies the number of
+ the inode which will represent the ext3 file
+ system's journal file.
+
+noload Don't load the journal on mounting.
+
+data=journal All data are committed into the journal prior
+ to being written into the main file system.
+
+data=ordered (*) All data are forced directly out to the main file
+ system prior to its metadata being committed to
+ the journal.
+
+data=writeback Data ordering is not preserved, data may be
+ written into the main file system after its
+ metadata has been committed to the journal.
+
+commit=nrsec (*) Ext3 can be told to sync all its data and metadata
+ every 'nrsec' seconds. The default value is 5 seconds.
+ This means that if you lose your power, you will lose,
+ as much, the latest 5 seconds of work (your filesystem
+ will not be damaged though, thanks to journaling). This
+ default value (or any low value) will hurt performance,
+ but it's good for data-safety. Setting it to 0 will
+ have the same effect than leaving the default 5 sec.
+ Setting it to very large values will improve
+ performance.
+
+barrier=1 This enables/disables barriers. barrier=0 disables it,
+ barrier=1 enables it.
+
+orlov (*) This enables the new Orlov block allocator. It's enabled
+ by default.
+
+oldalloc This disables the Orlov block allocator and enables the
+ old block allocator. Orlov should have better performance,
+ we'd like to get some feedback if it's the contrary for
+ you.
+
+user_xattr (*) Enables POSIX Extended Attributes. It's enabled by
+ default, however you need to confifure its support
+ (CONFIG_EXT3_FS_XATTR). This is neccesary if you want
+ to use POSIX Acces Control Lists support. You can visit
+ http://acl.bestbits.at to know more about POSIX Extended
+ attributes.
+
+nouser_xattr Disables POSIX Extended Attributes.
+
+acl (*) Enables POSIX Access Control Lists support. This is
+ enabled by default, however you need to configure
+ its support (CONFIG_EXT3_FS_POSIX_ACL). If you want
+ to know more about ACLs visit http://acl.bestbits.at
+
+noacl This option disables POSIX Access Control List support.
+
+reservation
+
+noreservation
+
+resize=
+
+bsddf (*) Make 'df' act like BSD.
+minixdf Make 'df' act like Minix.
+
+check=none Don't do extra checking of bitmaps on mount.
+nocheck
+
+debug Extra debugging information is sent to syslog.
+
+errors=remount-ro(*) Remount the filesystem read-only on an error.
+errors=continue Keep going on a filesystem error.
+errors=panic Panic and halt the machine if an error occurs.
+
+grpid Give objects the same group ID as their creator.
+bsdgroups
+
+nogrpid (*) New objects have the group ID of their creator.
+sysvgroups
+
+resgid=n The group ID which may use the reserved blocks.
+
+resuid=n The user ID which may use the reserved blocks.
+
+sb=n Use alternate superblock at this location.
+
+quota Quota options are currently silently ignored.
+noquota (see fs/ext3/super.c, line 594)
+grpquota
+usrquota
+
+
+Specification
+=============
+ext3 shares all disk implementation with ext2 filesystem, and add
+transactions capabilities to ext2. Journaling is done by the
+Journaling block device layer.
+
+Journaling Block Device layer
+-----------------------------
+The Journaling Block Device layer (JBD) isn't ext3 specific. It was
+design to add journaling capabilities on a block device. The ext3
+filesystem code will inform the JBD of modifications it is performing
+(Call a transaction). the journal support the transactions start and
+stop, and in case of crash, the journal can replayed the transactions
+to put the partition on a consistent state fastly.
+
+handles represent a single atomic update to a filesystem. JBD can
+handle external journal on a block device.
+
+Data Mode
+---------
+There's 3 different data modes:
+
+* writeback mode
+In data=writeback mode, ext3 does not journal data at all. This mode
+provides a similar level of journaling as XFS, JFS, and ReiserFS in its
+default mode - metadata journaling. A crash+recovery can cause
+incorrect data to appear in files which were written shortly before the
+crash. This mode will typically provide the best ext3 performance.
+
+* ordered mode
+In data=ordered mode, ext3 only officially journals metadata, but it
+logically groups metadata and data blocks into a single unit called a
+transaction. When it's time to write the new metadata out to disk, the
+associated data blocks are written first. In general, this mode
+perform slightly slower than writeback but significantly faster than
+journal mode.
+
+* journal mode
+data=journal mode provides full data and metadata journaling. All new
+data is written to the journal first, and then to its final location.
+In the event of a crash, the journal can be replayed, bringing both
+data and metadata into a consistent state. This mode is the slowest
+except when data needs to be read from and written to disk at the same
+time where it outperform all others mode.
+
+Compatibility
+-------------
+
+Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`.
+Ext3 is fully compatible with Ext2. Ext3 partitions can easily be
+mounted as Ext2.
+
+External Tools
+==============
+see manual pages to know more.
+
+tune2fs: create a ext3 journal on a ext2 partition with the -j flags
+mke2fs: create a ext3 partition with the -j flags
+debugfs: ext2 and ext3 file system debugger
+
+References
+==========
+
+kernel source: file:/usr/src/linux/fs/ext3
+ file:/usr/src/linux/fs/jbd
+
+programs: http://e2fsprogs.sourceforge.net
+
+useful link:
+ http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html
+ http://www-106.ibm.com/developerworks/linux/library/l-fs7/
+ http://www-106.ibm.com/developerworks/linux/library/l-fs8/
diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.txt
new file mode 100644
index 0000000..bd0fa770
--- /dev/null
+++ b/Documentation/filesystems/hfs.txt
@@ -0,0 +1,83 @@
+
+Macintosh HFS Filesystem for Linux
+==================================
+
+HFS stands for ``Hierarchical File System'' and is the filesystem used
+by the Mac Plus and all later Macintosh models. Earlier Macintosh
+models used MFS (``Macintosh File System''), which is not supported,
+MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
+HFS but is extended in various areas. Use the hfsplus filesystem driver
+to access such filesystems from Linux.
+
+
+Mount options
+=============
+
+When mounting an HFS filesystem, the following options are accepted:
+
+ creator=cccc, type=cccc
+ Specifies the creator/type values as shown by the MacOS finder
+ used for creating new files. Default values: '????'.
+
+ uid=n, gid=n
+ Specifies the user/group that owns all files on the filesystems.
+ Default: user/group id of the mounting process.
+
+ dir_umask=n, file_umask=n, umask=n
+ Specifies the umask used for all files , all directories or all
+ files and directories. Defaults to the umask of the mounting process.
+
+ session=n
+ Select the CDROM session to mount as HFS filesystem. Defaults to
+ leaving that decision to the CDROM driver. This option will fail
+ with anything but a CDROM as underlying devices.
+
+ part=n
+ Select partition number n from the devices. Does only makes
+ sense for CDROMS because they can't be partitioned under Linux.
+ For disk devices the generic partition parsing code does this
+ for us. Defaults to not parsing the partition table at all.
+
+ quiet
+ Ignore invalid mount options instead of complaining.
+
+
+Writing to HFS Filesystems
+==========================
+
+HFS is not a UNIX filesystem, thus it does not have the usual features you'd
+expect:
+
+ o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
+ and gid of files.
+ o You can't create hard- or symlinks, device files, sockets or FIFOs.
+
+HFS does on the other have the concepts of multiple forks per file. These
+non-standard forks are represented as hidden additional files in the normal
+filesystems namespace which is kind of a cludge and makes the semantics for
+the a little strange:
+
+ o You can't create, delete or rename resource forks of files or the
+ Finder's metadata.
+ o They are however created (with default values), deleted and renamed
+ along with the corresponding data fork or directory.
+ o Copying files to a different filesystem will loose those attributes
+ that are essential for MacOS to work.
+
+
+Creating HFS filesystems
+===================================
+
+The hfsutils package from Robert Leslie contains a program called
+hformat that can be used to create HFS filesystem. See
+<http://www.mars.org/home/rob/proj/hfs/> for details.
+
+
+Credits
+=======
+
+The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU)
+and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis
+Technologies.
+Roman rewrote large parts of the code and brought in btree routines derived
+from Brad Boyer's hfsplus driver (also maintained by Roman now).
diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt
new file mode 100644
index 0000000..33dc360
--- /dev/null
+++ b/Documentation/filesystems/hpfs.txt
@@ -0,0 +1,296 @@
+Read/Write HPFS 2.09
+1998-2004, Mikulas Patocka
+
+email: mikulas@artax.karlin.mff.cuni.cz
+homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
+
+CREDITS:
+Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
+ is taken from it
+Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
+Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
+
+Mount options
+
+uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask)
+ Set owner/group/mode for files that do not have it specified in extended
+ attributes. Mode is inverted umask - for example umask 027 gives owner
+ all permission, group read permission and anybody else no access. Note
+ that for files mode is anded with 0666. If you want files to have 'x'
+ rights, you must use extended attributes.
+case=lower,asis (default asis)
+ File name lowercasing in readdir.
+conv=binary,text,auto (default binary)
+ CR/LF -> LF conversion, if auto, decision is made according to extension
+ - there is a list of text extensions (I thing it's better to not convert
+ text file than to damage binary file). If you want to change that list,
+ change it in the source. Original readonly HPFS contained some strange
+ heuristic algorithm that I removed. I thing it's danger to let the
+ computer decide whether file is text or binary. For example, DJGPP
+ binaries contain small text message at the beginning and they could be
+ misidentified and damaged under some circumstances.
+check=none,normal,strict (default normal)
+ Check level. Selecting none will cause only little speedup and big
+ danger. I tried to write it so that it won't crash if check=normal on
+ corrupted filesystems. check=strict means many superfluous checks -
+ used for debugging (for example it checks if file is allocated in
+ bitmaps when accessing it).
+errors=continue,remount-ro,panic (default remount-ro)
+ Behaviour when filesystem errors found.
+chkdsk=no,errors,always (default errors)
+ When to mark filesystem dirty so that OS/2 checks it.
+eas=no,ro,rw (default rw)
+ What to do with extended attributes. 'no' - ignore them and use always
+ values specified in uid/gid/mode options. 'ro' - read extended
+ attributes but do not create them. 'rw' - create extended attributes
+ when you use chmod/chown/chgrp/mknod/ln -s on the filesystem.
+timeshift=(-)nnn (default 0)
+ Shifts the time by nnn seconds. For example, if you see under linux
+ one hour more, than under os/2, use timeshift=-3600.
+
+
+File names
+
+As in OS/2, filenames are case insensitive. However, shell thinks that names
+are case sensitive, so for example when you create a file FOO, you can use
+'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you
+also won't be able to compile linux kernel (and maybe other things) on HPFS
+because kernel creates different files with names like bootsect.S and
+bootsect.s. When searching for file thats name has characters >= 128, codepages
+are used - see below.
+OS/2 ignores dots and spaces at the end of file name, so this driver does as
+well. If you create 'a. ...', the file 'a' will be created, but you can still
+access it under names 'a.', 'a..', 'a . . . ' etc.
+
+
+Extended attributes
+
+On HPFS partitions, OS/2 can associate to each file a special information called
+extended attributes. Extended attributes are pairs of (key,value) where key is
+an ascii string identifying that attribute and value is any string of bytes of
+variable length. OS/2 stores window and icon positions and file types there. So
+why not use it for unix-specific info like file owner or access rights? This
+driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended
+attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only
+that extended attributes those value differs from defaults specified in mount
+options are created. Once created, the extended attributes are never deleted,
+they're just changed. It means that when your default uid=0 and you type
+something like 'chown luser file; chown root file' the file will contain
+extended attribute UID=0. And when you umount the fs and mount it again with
+uid=luser_uid, the file will be still owned by root! If you chmod file to 444,
+extended attribute "MODE" will not be set, this special case is done by setting
+read-only flag. When you mknod a block or char device, besides "MODE", the
+special 4-byte extended attribute "DEV" will be created containing the device
+number. Currently this driver cannot resize extended attributes - it means
+that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV"
+attributes with different sizes, they won't be rewritten and changing these
+values doesn't work.
+
+
+Symlinks
+
+You can do symlinks on HPFS partition, symlinks are achieved by setting extended
+attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
+chgrp symlinks but I don't know what is it good for. chmoding symlink results
+in chmoding file where symlink points. These symlinks are just for Linux use and
+incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are
+stored in very crazy way. They tried to do it so that link changes when file is
+moved ... sometimes it works. But the link is partly stored in directory
+extended attributes and partly in OS2SYS.INI. I don't want (and don't know how)
+to analyze or change OS2SYS.INI.
+
+
+Codepages
+
+HPFS can contain several uppercasing tables for several codepages and each
+file has a pointer to codepage it's name is in. However OS/2 was created in
+America where people don't care much about codepages and so multiple codepages
+support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk.
+Once I booted English OS/2 working in cp 850 and I created a file on my 852
+partition. It marked file name codepage as 850 - good. But when I again booted
+Czech OS/2, the file was completely inaccessible under any name. It seems that
+OS/2 uppercases the search pattern with its system code page (852) and file
+name it's comparing to with its code page (850). These could never match. Is it
+really what IBM developers wanted? But problems continued. When I created in
+Czech OS/2 another file in that directory, that file was inaccessible too. OS/2
+probably uses different uppercasing method when searching where to place a file
+(note, that files in HPFS directory must be sorted) and when searching for
+a file. Finally when I opened this directory in PmShell, PmShell crashed (the
+funny thing was that, when rebooted, PmShell tried to reopen this directory
+again :-). chkdsk happily ignores these errors and only low-level disk
+modification saved me. Never mix different language versions of OS/2 on one
+system although HPFS was designed to allow that.
+OK, I could implement complex codepage support to this driver but I think it
+would cause more problems than benefit with such buggy implementation in OS/2.
+So this driver simply uses first codepage it finds for uppercasing and
+lowercasing no matter what's file codepage index. Usually all file names are in
+this codepage - if you don't try to do what I described above :-)
+
+
+Known bugs
+
+HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
+should work. If you have OS/2 server, use only read-only mode. I don't know how
+to handle some HPFS386 structures like access control list or extended perm
+list, I don't know how to delete them when file is deleted and how to not
+overwrite them with extended attributes. Send me some info on these structures
+and I'll make it. However, this driver should detect presence of HPFS386
+structures, remount read-only and not destroy them (I hope).
+
+When there's not enough space for extended attributes, they will be truncated
+and no error is returned.
+
+OS/2 can't access files if the path is longer than about 256 chars but this
+driver allows you to do it. chkdsk ignores such errors.
+
+Sometimes you won't be able to delete some files on a very full filesystem
+(returning error ENOSPC). That's because file in non-leaf node in directory tree
+(one directory, if it's large, has dirents in tree on HPFS) must be replaced
+with another node when deleted. And that new file might have larger name than
+the old one so the new name doesn't fit in directory node (dnode). And that
+would result in directory tree splitting, that takes disk space. Workaround is
+to delete other files that are leaf (probability that the file is non-leaf is
+about 1/50) or to truncate file first to make some space.
+You encounter this problem only if you have many directories so that
+preallocated directory band is full i.e.
+ number_of_directories / size_of_filesystem_in_mb > 4.
+
+You can't delete open directories.
+
+You can't rename over directories (what is it good for?).
+
+Renaming files so that only case changes doesn't work. This driver supports it
+but vfs doesn't. Something like 'mv file FILE' won't work.
+
+All atimes and directory mtimes are not updated. That's because of performance
+reasons. If you extremely wish to update them, let me know, I'll write it (but
+it will be slow).
+
+When the system is out of memory and swap, it may slightly corrupt filesystem
+(lost files, unbalanced directories). (I guess all filesystem may do it).
+
+When compiled, you get warning: function declaration isn't a prototype. Does
+anybody know what does it mean?
+
+
+What does "unbalanced tree" message mean?
+
+Old versions of this driver created sometimes unbalanced dnode trees. OS/2
+chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
+unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely
+crashes when the tree is not balanced. This driver handles unbalanced trees
+correctly and writes warning if it finds them. If you see this message, this is
+probably because of directories created with old version of this driver.
+Workaround is to move all files from that directory to another and then back
+again. Do it in Linux, not OS/2! If you see this message in directory that is
+whole created by this driver, it is BUG - let me know about it.
+
+
+Bugs in OS/2
+
+When you have two (or more) lost directories pointing each to other, chkdsk
+locks up when repairing filesystem.
+
+Sometimes (I think it's random) when you create a file with one-char name under
+OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs
+error corrected".
+
+File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and
+marks them as short (and writes "minor fs error corrected"). This bug is not in
+HPFS386.
+
+Codepage bugs described above.
+
+If you don't install fixpacks, there are many, many more...
+
+
+History
+
+0.90 First public release
+0.91 Fixed bug that caused shooting to memory when write_inode was called on
+ open inode (rarely happened)
+0.92 Fixed a little memory leak in freeing directory inodes
+0.93 Fixed bug that locked up the machine when there were too many filenames
+ with first 15 characters same
+ Fixed write_file to zero file when writing behind file end
+0.94 Fixed a little memory leak when trying to delete busy file or directory
+0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
+1.90 First version for 2.1.1xx kernels
+1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
+ Fixed a race-condition when write_inode is called while deleting file
+ Fixed a bug that could possibly happen (with very low probability) when
+ using 0xff in filenames
+ Rewritten locking to avoid race-conditions
+ Mount option 'eas' now works
+ Fsync no longer returns error
+ Files beginning with '.' are marked hidden
+ Remount support added
+ Alloc is not so slow when filesystem becomes full
+ Atimes are no more updated because it slows down operation
+ Code cleanup (removed all commented debug prints)
+1.92 Corrected a bug when sync was called just before closing file
+1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
+ works with previous versions
+ Fixed a possible problem with disks > 64G (but I don't have one, so I can't
+ test it)
+ Fixed a file overflow at 2G
+ Added new option 'timeshift'
+ Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
+ read-only mode
+ Fixed a bug that slowed down alloc and prevented allocating 100% space
+ (this bug was not destructive)
+1.94 Added workaround for one bug in Linux
+ Fixed one buffer leak
+ Fixed some incompatibilities with large extended attributes (but it's still
+ not 100% ok, I have no info on it and OS/2 doesn't want to create them)
+ Rewritten allocation
+ Fixed a bug with i_blocks (du sometimes didn't display correct values)
+ Directories have no longer archive attribute set (some programs don't like
+ it)
+ Fixed a bug that it set badly one flag in large anode tree (it was not
+ destructive)
+1.95 Fixed one buffer leak, that could happen on corrupted filesystem
+ Fixed one bug in allocation in 1.94
+1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
+ error sometimes when opening directories in PMSHELL)
+ Fixed a possible bitmap race
+ Fixed possible problem on large disks
+ You can now delete open files
+ Fixed a nondestructive race in rename
+1.97 Support for HPFS v3 (on large partitions)
+ Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
+1.97.1 Changed names of global symbols
+ Fixed a bug when chmoding or chowning root directory
+1.98 Fixed a deadlock when using old_readdir
+ Better directory handling; workaround for "unbalanced tree" bug in OS/2
+1.99 Corrected a possible problem when there's not enough space while deleting
+ file
+ Now it tries to truncate the file if there's not enough space when deleting
+ Removed a lot of redundant code
+2.00 Fixed a bug in rename (it was there since 1.96)
+ Better anti-fragmentation strategy
+2.01 Fixed problem with directory listing over NFS
+ Directory lseek now checks for proper parameters
+ Fixed race-condition in buffer code - it is in all filesystems in Linux;
+ when reading device (cat /dev/hda) while creating files on it, files
+ could be damaged
+2.02 Woraround for bug in breada in Linux. breada could cause accesses beyond
+ end of partition
+2.03 Char, block devices and pipes are correctly created
+ Fixed non-crashing race in unlink (Alexander Viro)
+ Now it works with Japanese version of OS/2
+2.04 Fixed error when ftruncate used to extend file
+2.05 Fixed crash when got mount parameters without =
+ Fixed crash when allocation of anode failed due to full disk
+ Fixed some crashes when block io or inode allocation failed
+2.06 Fixed some crash on corrupted disk structures
+ Better allocation strategy
+ Reschedule points added so that it doesn't lock CPU long time
+ It should work in read-only mode on Warp Server
+2.07 More fixes for Warp Server. Now it really works
+2.08 Creating new files is not so slow on large disks
+ An attempt to sync deleted file does not generate filesystem error
+2.09 Fixed error on extremly fragmented files
+
+
+ vim: set textwidth=80:
diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt
new file mode 100644
index 0000000..f64a105
--- /dev/null
+++ b/Documentation/filesystems/isofs.txt
@@ -0,0 +1,38 @@
+Mount options that are the same as for msdos and vfat partitions.
+
+ gid=nnn All files in the partition will be in group nnn.
+ uid=nnn All files in the partition will be owned by user id nnn.
+ umask=nnn The permission mask (see umask(1)) for the partition.
+
+Mount options that are the same as vfat partitions. These are only useful
+when using discs encoded using Microsoft's Joliet extensions.
+ iocharset=name Character set to use for converting from Unicode to
+ ASCII. Joliet filenames are stored in Unicode format, but
+ Unix for the most part doesn't know how to deal with Unicode.
+ There is also an option of doing UTF8 translations with the
+ utf8 option.
+ utf8 Encode Unicode names in UTF8 format. Default is no.
+
+Mount options unique to the isofs filesystem.
+ block=512 Set the block size for the disk to 512 bytes
+ block=1024 Set the block size for the disk to 1024 bytes
+ block=2048 Set the block size for the disk to 2048 bytes
+ check=relaxed Matches filenames with different cases
+ check=strict Matches only filenames with the exact same case
+ cruft Try to handle badly formatted CDs.
+ map=off Do not map non-Rock Ridge filenames to lower case
+ map=normal Map non-Rock Ridge filenames to lower case
+ map=acorn As map=normal but also apply Acorn extensions if present
+ mode=xxx Sets the permissions on files to xxx
+ nojoliet Ignore Joliet extensions if they are present.
+ norock Ignore Rock Ridge extensions if they are present.
+ unhide Show hidden files.
+ session=x Select number of session on multisession CD
+ sbsector=xxx Session begins from sector xxx
+
+Recommended documents about ISO 9660 standard are located at:
+http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm
+ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
+Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
+identical with ISO 9660.", so it is a valid and gratis substitute of the
+official ISO specification.
diff --git a/Documentation/filesystems/jfs.txt b/Documentation/filesystems/jfs.txt
new file mode 100644
index 0000000..3e992da
--- /dev/null
+++ b/Documentation/filesystems/jfs.txt
@@ -0,0 +1,35 @@
+IBM's Journaled File System (JFS) for Linux
+
+JFS Homepage: http://jfs.sourceforge.net/
+
+The following mount options are supported:
+
+iocharset=name Character set to use for converting from Unicode to
+ ASCII. The default is to do no conversion. Use
+ iocharset=utf8 for UTF8 translations. This requires
+ CONFIG_NLS_UTF8 to be set in the kernel .config file.
+ iocharset=none specifies the default behavior explicitly.
+
+resize=value Resize the volume to <value> blocks. JFS only supports
+ growing a volume, not shrinking it. This option is only
+ valid during a remount, when the volume is mounted
+ read-write. The resize keyword with no value will grow
+ the volume to the full size of the partition.
+
+nointegrity Do not write to the journal. The primary use of this option
+ is to allow for higher performance when restoring a volume
+ from backup media. The integrity of the volume is not
+ guaranteed if the system abnormally abends.
+
+integrity Default. Commit metadata changes to the journal. Use this
+ option to remount a volume where the nointegrity option was
+ previously specified in order to restore normal behavior.
+
+errors=continue Keep going on a filesystem error.
+errors=remount-ro Default. Remount the filesystem read-only on an error.
+errors=panic Panic and halt the machine if an error occurs.
+
+Please send bugs, comments, cards and letters to shaggy@austin.ibm.com.
+
+The JFS mailing list can be subscribed to by using the link labeled
+"Mail list Subscribe" at our web page http://jfs.sourceforge.net/
diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt
new file mode 100644
index 0000000..f12c30c
--- /dev/null
+++ b/Documentation/filesystems/ncpfs.txt
@@ -0,0 +1,12 @@
+The ncpfs filesystem understands the NCP protocol, designed by the
+Novell Corporation for their NetWare(tm) product. NCP is functionally
+similar to the NFS used in the TCP/IP community.
+To mount a NetWare filesystem, you need a special mount program, which
+can be found in the ncpfs package. The home site for ncpfs is
+ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors
+will have it as well.
+
+Related products are linware and mars_nwe, which will give Linux partial
+NetWare server functionality. Linware's home site is
+klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on
+ftp.gwdg.de/pub/linux/misc/ncpfs.
diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt
new file mode 100644
index 0000000..f89b440
--- /dev/null
+++ b/Documentation/filesystems/ntfs.txt
@@ -0,0 +1,630 @@
+The Linux NTFS filesystem driver
+================================
+
+
+Table of contents
+=================
+
+- Overview
+- Web site
+- Features
+- Supported mount options
+- Known bugs and (mis-)features
+- Using NTFS volume and stripe sets
+ - The Device-Mapper driver
+ - The Software RAID / MD driver
+ - Limitiations when using the MD driver
+- ChangeLog
+
+
+Overview
+========
+
+Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
+These include mkntfs, a full-featured ntfs file system format utility,
+ntfsundelete used for recovering files that were unintentionally deleted
+from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
+See the web site for more information.
+
+To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
+system type 'ntfs'. The driver currently supports read-only mode (with no
+fault-tolerance, encryption or journalling) and very limited, but safe, write
+support.
+
+For fault tolerance and raid support (i.e. volume and stripe sets), you can
+use the kernel's Software RAID / MD driver. See section "Using Software RAID
+with NTFS" for details.
+
+
+Web site
+========
+
+There is plenty of additional information on the linux-ntfs web site
+at http://linux-ntfs.sourceforge.net/
+
+The web site has a lot of additional information, such as a comprehensive
+FAQ, documentation on the NTFS on-disk format, informaiton on the Linux-NTFS
+userspace utilities, etc.
+
+
+Features
+========
+
+- This is a complete rewrite of the NTFS driver that used to be in the kernel.
+ This new driver implements NTFS read support and is functionally equivalent
+ to the old ntfs driver.
+- The new driver has full support for sparse files on NTFS 3.x volumes which
+ the old driver isn't happy with.
+- The new driver supports execution of binaries due to mmap() now being
+ supported.
+- The new driver supports loopback mounting of files on NTFS which is used by
+ some Linux distributions to enable the user to run Linux from an NTFS
+ partition by creating a large file while in Windows and then loopback
+ mounting the file while in Linux and creating a Linux filesystem on it that
+ is used to install Linux on it.
+- A comparison of the two drivers using:
+ time find . -type f -exec md5sum "{}" \;
+ run three times in sequence with each driver (after a reboot) on a 1.4GiB
+ NTFS partition, showed the new driver to be 20% faster in total time elapsed
+ (from 9:43 minutes on average down to 7:53). The time spent in user space
+ was unchanged but the time spent in the kernel was decreased by a factor of
+ 2.5 (from 85 CPU seconds down to 33).
+- The driver does not support short file names in general. For backwards
+ compatibility, we implement access to files using their short file names if
+ they exist. The driver will not create short file names however, and a
+ rename will discard any existing short file name.
+- The new driver supports exporting of mounted NTFS volumes via NFS.
+- The new driver supports async io (aio).
+- The new driver supports fsync(2), fdatasync(2), and msync(2).
+- The new driver supports readv(2) and writev(2).
+- The new driver supports access time updates (including mtime and ctime).
+
+
+Supported mount options
+=======================
+
+In addition to the generic mount options described by the manual page for the
+mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
+following mount options:
+
+iocharset=name Deprecated option. Still supported but please use
+ nls=name in the future. See description for nls=name.
+
+nls=name Character set to use when returning file names.
+ Unlike VFAT, NTFS suppresses names that contain
+ unconvertible characters. Note that most character
+ sets contain insufficient characters to represent all
+ possible Unicode characters that can exist on NTFS.
+ To be sure you are not missing any files, you are
+ advised to use nls=utf8 which is capable of
+ representing all Unicode characters.
+
+utf8=<bool> Option no longer supported. Currently mapped to
+ nls=utf8 but please use nls=utf8 in the future and
+ make sure utf8 is compiled either as module or into
+ the kernel. See description for nls=name.
+
+uid=
+gid=
+umask= Provide default owner, group, and access mode mask.
+ These options work as documented in mount(8). By
+ default, the files/directories are owned by root and
+ he/she has read and write permissions, as well as
+ browse permission for directories. No one else has any
+ access permissions. I.e. the mode on all files is by
+ default rw------- and for directories rwx------, a
+ consequence of the default fmask=0177 and dmask=0077.
+ Using a umask of zero will grant all permissions to
+ everyone, i.e. all files and directories will have mode
+ rwxrwxrwx.
+
+fmask=
+dmask= Instead of specifying umask which applies both to
+ files and directories, fmask applies only to files and
+ dmask only to directories.
+
+sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
+ Otherwise the default behaviour is to abort mount if
+ any unknown options are found.
+
+show_sys_files=<BOOL> If show_sys_files is specified, show the system files
+ in directory listings. Otherwise the default behaviour
+ is to hide the system files.
+ Note that even when show_sys_files is specified, "$MFT"
+ will not be visible due to bugs/mis-features in glibc.
+ Further, note that irrespective of show_sys_files, all
+ files are accessible by name, i.e. you can always do
+ "ls -l \$UpCase" for example to specifically show the
+ system file containing the Unicode upcase table.
+
+case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
+ case sensitive and create file names in the POSIX
+ namespace. Otherwise the default behaviour is to treat
+ file names as case insensitive and to create file names
+ in the WIN32/LONG name space. Note, the Linux NTFS
+ driver will never create short file names and will
+ remove them on rename/delete of the corresponding long
+ file name.
+ Note that files remain accessible via their short file
+ name, if it exists. If case_sensitive, you will need
+ to provide the correct case of the short file name.
+
+errors=opt What to do when critical file system errors are found.
+ Following values can be used for "opt":
+ continue: DEFAULT, try to clean-up as much as
+ possible, e.g. marking a corrupt inode as
+ bad so it is no longer accessed, and then
+ continue.
+ recover: At present only supported is recovery of
+ the boot sector from the backup copy.
+ If read-only mount, the recovery is done
+ in memory only and not written to disk.
+ Note that the options are additive, i.e. specifying:
+ errors=continue,errors=recover
+ means the driver will attempt to recover and if that
+ fails it will clean-up as much as possible and
+ continue.
+
+mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
+ setting is not persistent across mounts and can be
+ changed from mount to mount but cannot be changed on
+ remount). Values of 1 to 4 are allowed, 1 being the
+ default. The MFT zone multiplier determines how much
+ space is reserved for the MFT on the volume. If all
+ other space is used up, then the MFT zone will be
+ shrunk dynamically, so this has no impact on the
+ amount of free space. However, it can have an impact
+ on performance by affecting fragmentation of the MFT.
+ In general use the default. If you have a lot of small
+ files then use a higher value. The values have the
+ following meaning:
+ Value MFT zone size (% of volume size)
+ 1 12.5%
+ 2 25%
+ 3 37.5%
+ 4 50%
+ Note this option is irrelevant for read-only mounts.
+
+
+Known bugs and (mis-)features
+=============================
+
+- The link count on each directory inode entry is set to 1, due to Linux not
+ supporting directory hard links. This may well confuse some user space
+ applications, since the directory names will have the same inode numbers.
+ This also speeds up ntfs_read_inode() immensely. And we haven't found any
+ problems with this approach so far. If you find a problem with this, please
+ let us know.
+
+
+Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
+list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
+
+
+Using NTFS volume and stripe sets
+=================================
+
+For support of volume and stripe sets, you can either use the kernel's
+Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
+the recommended one to use for linear raid. But the latter is required for
+raid level 5. For striping and mirroring, either driver should work fine.
+
+
+The Device-Mapper driver
+------------------------
+
+You will need to create a table of the components of the volume/stripe set and
+how they fit together and load this into the kernel using the dmsetup utility
+(see man 8 dmsetup).
+
+Linear volume sets, i.e. linear raid, has been tested and works fine. Even
+though untested, there is no reason why stripe sets, i.e. raid level 0, and
+mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
+raid level 5, unfortunately cannot work yet because the current version of the
+Device-Mapper driver does not support raid level 5. You may be able to use the
+Software RAID / MD driver for raid level 5, see the next section for details.
+
+To create the table describing your volume you will need to know each of its
+components and their sizes in sectors, i.e. multiples of 512-byte blocks.
+
+For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
+example if one of your partitions is /dev/hda2 you would do:
+
+$ fdisk -ul /dev/hda
+
+Disk /dev/hda: 81.9 GB, 81964302336 bytes
+255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
+Units = sectors of 1 * 512 = 512 bytes
+
+ Device Boot Start End Blocks Id System
+ /dev/hda1 * 63 4209029 2104483+ 83 Linux
+ /dev/hda2 4209030 37768814 16779892+ 86 NTFS
+ /dev/hda3 37768815 46170809 4200997+ 83 Linux
+
+And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
+33559785 sectors.
+
+For Win2k and later dynamic disks, you can for example use the ldminfo utility
+which is part of the Linux LDM tools (the latest version at the time of
+writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
+ http://linux-ntfs.sourceforge.net/downloads.html
+Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
+into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
+will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
+able to compile this yourself easily so use the binary version!
+
+Then you would use ldminfo in dump mode to obtain the necessary information:
+
+$ ./ldminfo --dump /dev/hda
+
+This would dump the LDM database found on /dev/hda which describes all of your
+dynamic disks and all the volumes on them. At the bottom you will see the
+VOLUME DEFINITIONS section which is all you really need. You may need to look
+further above to determine which of the disks in the volume definitions is
+which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
+look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
+section). You can then find these Disk Ids in the VBLK DATABASE section in the
+<Disk> components where you will get the LDM Name for the disk that is found in
+the VOLUME DEFINITIONS section.
+
+Note you will also need to enable the LDM driver in the Linux kernel. If your
+distribution did not enable it, you will need to recompile the kernel with it
+enabled. This will create the LDM partitions on each device at boot time. You
+would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
+in the Device-Mapper table.
+
+You can also bypass using the LDM driver by using the main device (e.g.
+/dev/hda) and then using the offsets of the LDM partitions into this device as
+the "Start sector of device" when creating the table. Once again ldminfo would
+give you the correct information to do this.
+
+Assuming you know all your devices and their sizes things are easy.
+
+For a linear raid the table would look like this (note all values are in
+512-byte sectors):
+
+--- cut here ---
+# Offset into Size of this Raid type Device Start sector
+# volume device of device
+0 1028161 linear /dev/hda1 0
+1028161 3903762 linear /dev/hdb2 0
+4931923 2103211 linear /dev/hdc1 0
+--- cut here ---
+
+For a striped volume, i.e. raid level 0, you will need to know the chunk size
+you used when creating the volume. Windows uses 64kiB as the default, so it
+will probably be this unless you changes the defaults when creating the array.
+
+For a raid level 0 the table would look like this (note all values are in
+512-byte sectors):
+
+--- cut here ---
+# Offset Size Raid Number Chunk 1st Start 2nd Start
+# into of the type of size Device in Device in
+# volume volume stripes device device
+0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
+--- cut here ---
+
+If there are more than two devices, just add each of them to the end of the
+line.
+
+Finally, for a mirrored volume, i.e. raid level 1, the table would look like
+this (note all values are in 512-byte sectors):
+
+--- cut here ---
+# Ofs Size Raid Log Number Region Should Number Source Start Taget Start
+# in of the type type of log size sync? of Device in Device in
+# vol volume params mirrors Device Device
+0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
+--- cut here ---
+
+If you are mirroring to multiple devices you can specify further targets at the
+end of the line.
+
+Note the "Should sync?" parameter "nosync" means that the two mirrors are
+already in sync which will be the case on a clean shutdown of Windows. If the
+mirrors are not clean, you can specify the "sync" option instead of "nosync"
+and the Device-Mapper driver will then copy the entirey of the "Source Device"
+to the "Target Device" or if you specified multipled target devices to all of
+them.
+
+Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
+and hand it over to dmsetup to work with, like so:
+
+$ dmsetup create myvolume1 /etc/ntfsvolume1
+
+You can obviously replace "myvolume1" with whatever name you like.
+
+If it all worked, you will now have the device /dev/device-mapper/myvolume1
+which you can then just use as an argument to the mount command as usual to
+mount the ntfs volume. For example:
+
+$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
+
+(You need to create the directory /mnt/myvol1 first and of course you can use
+anything you like instead of /mnt/myvol1 as long as it is an existing
+directory.)
+
+It is advisable to do the mount read-only to see if the volume has been setup
+correctly to avoid the possibility of causing damage to the data on the ntfs
+volume.
+
+
+The Software RAID / MD driver
+-----------------------------
+
+An alternative to using the Device-Mapper driver is to use the kernel's
+Software RAID / MD driver. For which you need to set up your /etc/raidtab
+appropriately (see man 5 raidtab).
+
+Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
+0, have been tested and work fine (though see section "Limitiations when using
+the MD driver with NTFS volumes" especially if you want to use linear raid).
+Even though untested, there is no reason why mirrors, i.e. raid level 1, and
+stripes with parity, i.e. raid level 5, should not work, too.
+
+You have to use the "persistent-superblock 0" option for each raid-disk in the
+NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
+superblock used by the MD driver would damange the NTFS volume.
+
+Windows by default uses a stripe chunk size of 64k, so you probably want the
+"chunk-size 64k" option for each raid-disk, too.
+
+For example, if you have a stripe set consisting of two partitions /dev/hda5
+and /dev/hdb1 your /etc/raidtab would look like this:
+
+raiddev /dev/md0
+ raid-level 0
+ nr-raid-disks 2
+ nr-spare-disks 0
+ persistent-superblock 0
+ chunk-size 64k
+ device /dev/hda5
+ raid-disk 0
+ device /dev/hdb1
+ raid-disl 1
+
+For linear raid, just change the raid-level above to "raid-level linear", for
+mirrors, change it to "raid-level 1", and for stripe sets with parity, change
+it to "raid-level 5".
+
+Note for stripe sets with parity you will also need to tell the MD driver
+which parity algorithm to use by specifying the option "parity-algorithm
+which", where you need to replace "which" with the name of the algorithm to
+use (see man 5 raidtab for available algorithms) and you will have to try the
+different available algorithms until you find one that works. Make sure you
+are working read-only when playing with this as you may damage your data
+otherwise. If you find which algorithm works please let us know (email the
+linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
+IRC in channel #ntfs on the irc.freenode.net network) so we can update this
+documentation.
+
+Once the raidtab is setup, run for example raid0run -a to start all devices or
+raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
+
+Then just use the mount command as usual to mount the ntfs volume using for
+example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
+
+It is advisable to do the mount read-only to see if the md volume has been
+setup correctly to avoid the possibility of causing damage to the data on the
+ntfs volume.
+
+
+Limitiations when using the Software RAID / MD driver
+-----------------------------------------------------
+
+Using the md driver will not work properly if any of your NTFS partitions have
+an odd number of sectors. This is especially important for linear raid as all
+data after the first partition with an odd number of sectors will be offset by
+one or more sectors so if you mount such a partition with write support you
+will cause massive damage to the data on the volume which will only become
+apparent when you try to use the volume again under Windows.
+
+So when using linear raid, make sure that all your partitions have an even
+number of sectors BEFORE attempting to use it. You have been warned!
+
+Even better is to simply use the Device-Mapper for linear raid and then you do
+not have this problem with odd numbers of sectors.
+
+
+ChangeLog
+=========
+
+Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
+
+2.1.22:
+ - Improve handling of ntfs volumes with errors.
+ - Fix various bugs and race conditions.
+2.1.21:
+ - Fix several race conditions and various other bugs.
+ - Many internal cleanups, code reorganization, optimizations, and mft
+ and index record writing code rewritten to fit in with the changes.
+ - Update Documentation/filesystems/ntfs.txt with instructions on how to
+ use the Device-Mapper driver with NTFS ftdisk/LDM raid.
+2.1.20:
+ - Fix two stupid bugs introduced in 2.1.18 release.
+2.1.19:
+ - Minor bugfix in handling of the default upcase table.
+ - Many internal cleanups and improvements. Many thanks to Linus
+ Torvalds and Al Viro for the help and advice with the sparse
+ annotations and cleanups.
+2.1.18:
+ - Fix scheduling latencies at mount time. (Ingo Molnar)
+ - Fix endianness bug in a little traversed portion of the attribute
+ lookup code.
+2.1.17:
+ - Fix bugs in mount time error code paths.
+2.1.16:
+ - Implement access time updates (including mtime and ctime).
+ - Implement fsync(2), fdatasync(2), and msync(2) system calls.
+ - Enable the readv(2) and writev(2) system calls.
+ - Enable access via the asynchronous io (aio) API by adding support for
+ the aio_read(3) and aio_write(3) functions.
+2.1.15:
+ - Invalidate quotas when (re)mounting read-write.
+ NOTE: This now only leave user space journalling on the side. (See
+ note for version 2.1.13, below.)
+2.1.14:
+ - Fix an NFSd caused deadlock reported by several users.
+2.1.13:
+ - Implement writing of inodes (access time updates are not implemented
+ yet so mounting with -o noatime,nodiratime is enforced).
+ - Enable writing out of resident files so you can now overwrite any
+ uncompressed, unencrypted, nonsparse file as long as you do not
+ change the file size.
+ - Add housekeeping of ntfs system files so that ntfsfix no longer needs
+ to be run after writing to an NTFS volume.
+ NOTE: This still leaves quota tracking and user space journalling on
+ the side but they should not cause data corruption. In the worst
+ case the charged quotas will be out of date ($Quota) and some
+ userspace applications might get confused due to the out of date
+ userspace journal ($UsnJrnl).
+2.1.12:
+ - Fix the second fix to the decompression engine from the 2.1.9 release
+ and some further internals cleanups.
+2.1.11:
+ - Driver internal cleanups.
+2.1.10:
+ - Force read-only (re)mounting of volumes with unsupported volume
+ flags and various cleanups.
+2.1.9:
+ - Fix two bugs in handling of corner cases in the decompression engine.
+2.1.8:
+ - Read the $MFT mirror and compare it to the $MFT and if the two do not
+ match, force a read-only mount and do not allow read-write remounts.
+ - Read and parse the $LogFile journal and if it indicates that the
+ volume was not shutdown cleanly, force a read-only mount and do not
+ allow read-write remounts. If the $LogFile indicates a clean
+ shutdown and a read-write (re)mount is requested, empty $LogFile to
+ ensure that Windows cannot cause data corruption by replaying a stale
+ journal after Linux has written to the volume.
+ - Improve time handling so that the NTFS time is fully preserved when
+ converted to kernel time and only up to 99 nano-seconds are lost when
+ kernel time is converted to NTFS time.
+2.1.7:
+ - Enable NFS exporting of mounted NTFS volumes.
+2.1.6:
+ - Fix minor bug in handling of compressed directories that fixes the
+ erroneous "du" and "stat" output people reported.
+2.1.5:
+ - Minor bug fix in attribute list attribute handling that fixes the
+ I/O errors on "ls" of certain fragmented files found by at least two
+ people running Windows XP.
+2.1.4:
+ - Minor update allowing compilation with all gcc versions (well, the
+ ones the kernel can be compiled with anyway).
+2.1.3:
+ - Major bug fixes for reading files and volumes in corner cases which
+ were being hit by Windows 2k/XP users.
+2.1.2:
+ - Major bug fixes aleviating the hangs in statfs experienced by some
+ users.
+2.1.1:
+ - Update handling of compressed files so people no longer get the
+ frequently reported warning messages about initialized_size !=
+ data_size.
+2.1.0:
+ - Add configuration option for developmental write support.
+ - Initial implementation of file overwriting. (Writes to resident files
+ are not written out to disk yet, so avoid writing to files smaller
+ than about 1kiB.)
+ - Intercept/abort changes in file size as they are not implemented yet.
+2.0.25:
+ - Minor bugfixes in error code paths and small cleanups.
+2.0.24:
+ - Small internal cleanups.
+ - Support for sendfile system call. (Christoph Hellwig)
+2.0.23:
+ - Massive internal locking changes to mft record locking. Fixes
+ various race conditions and deadlocks.
+ - Fix ntfs over loopback for compressed files by adding an
+ optimization barrier. (gcc was screwing up otherwise ?)
+ Thanks go to Christoph Hellwig for pointing these two out:
+ - Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs().
+ - Fix ntfs_free() for ia64 and parisc.
+2.0.22:
+ - Small internal cleanups.
+2.0.21:
+ These only affect 32-bit architectures:
+ - Check for, and refuse to mount too large volumes (maximum is 2TiB).
+ - Check for, and refuse to open too large files and directories
+ (maximum is 16TiB).
+2.0.20:
+ - Support non-resident directory index bitmaps. This means we now cope
+ with huge directories without problems.
+ - Fix a page leak that manifested itself in some cases when reading
+ directory contents.
+ - Internal cleanups.
+2.0.19:
+ - Fix race condition and improvements in block i/o interface.
+ - Optimization when reading compressed files.
+2.0.18:
+ - Fix race condition in reading of compressed files.
+2.0.17:
+ - Cleanups and optimizations.
+2.0.16:
+ - Fix stupid bug introduced in 2.0.15 in new attribute inode API.
+ - Big internal cleanup replacing the mftbmp access hacks by using the
+ new attribute inode API instead.
+2.0.15:
+ - Bug fix in parsing of remount options.
+ - Internal changes implementing attribute (fake) inodes allowing all
+ attribute i/o to go via the page cache and to use all the normal
+ vfs/mm functionality.
+2.0.14:
+ - Internal changes improving run list merging code and minor locking
+ change to not rely on BKL in ntfs_statfs().
+2.0.13:
+ - Internal changes towards using iget5_locked() in preparation for
+ fake inodes and small cleanups to ntfs_volume structure.
+2.0.12:
+ - Internal cleanups in address space operations made possible by the
+ changes introduced in the previous release.
+2.0.11:
+ - Internal updates and cleanups introducing the first step towards
+ fake inode based attribute i/o.
+2.0.10:
+ - Microsoft says that the maximum number of inodes is 2^32 - 1. Update
+ the driver accordingly to only use 32-bits to store inode numbers on
+ 32-bit architectures. This improves the speed of the driver a little.
+2.0.9:
+ - Change decompression engine to use a single buffer. This should not
+ affect performance except perhaps on the most heavy i/o on SMP
+ systems when accessing multiple compressed files from multiple
+ devices simultaneously.
+ - Minor updates and cleanups.
+2.0.8:
+ - Remove now obsolete show_inodes and posix mount option(s).
+ - Restore show_sys_files mount option.
+ - Add new mount option case_sensitive, to determine if the driver
+ treats file names as case sensitive or not.
+ - Mostly drop support for short file names (for backwards compatibility
+ we only support accessing files via their short file name if one
+ exists).
+ - Fix dcache aliasing issues wrt short/long file names.
+ - Cleanups and minor fixes.
+2.0.7:
+ - Just cleanups.
+2.0.6:
+ - Major bugfix to make compatible with other kernel changes. This fixes
+ the hangs/oopses on umount.
+ - Locking cleanup in directory operations (remove BKL usage).
+2.0.5:
+ - Major buffer overflow bug fix.
+ - Minor cleanups and updates for kernel 2.5.12.
+2.0.4:
+ - Cleanups and updates for kernel 2.5.11.
+2.0.3:
+ - Small bug fixes, cleanups, and performance improvements.
+2.0.2:
+ - Use default fmask of 0177 so that files are no executable by default.
+ If you want owner executable files, just use fmask=0077.
+ - Update for kernel 2.5.9 but preserve backwards compatibility with
+ kernel 2.5.7.
+ - Minor bug fixes, cleanups, and updates.
+2.0.1:
+ - Minor updates, primarily set the executable bit by default on files
+ so they can be executed.
+2.0.0:
+ - Started ChangeLog.
+
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
new file mode 100644
index 0000000..2f38846
--- /dev/null
+++ b/Documentation/filesystems/porting
@@ -0,0 +1,266 @@
+Changes since 2.5.0:
+
+---
+[recommended]
+
+New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
+ sb_set_blocksize() and sb_min_blocksize().
+
+Use them.
+
+(sb_find_get_block() replaces 2.4's get_hash_table())
+
+---
+[recommended]
+
+New methods: ->alloc_inode() and ->destroy_inode().
+
+Remove inode->u.foo_inode_i
+Declare
+ struct foo_inode_info {
+ /* fs-private stuff */
+ struct inode vfs_inode;
+ };
+ static inline struct foo_inode_info *FOO_I(struct inode *inode)
+ {
+ return list_entry(inode, struct foo_inode_info, vfs_inode);
+ }
+
+Use FOO_I(inode) instead of &inode->u.foo_inode_i;
+
+Add foo_alloc_inode() and foo_destory_inode() - the former should allocate
+foo_inode_info and return the address of ->vfs_inode, the latter should free
+FOO_I(inode) (see in-tree filesystems for examples).
+
+Make them ->alloc_inode and ->destroy_inode in your super_operations.
+
+Keep in mind that now you need explicit initialization of private data -
+typically in ->read_inode() and after getting an inode from new_inode().
+
+At some point that will become mandatory.
+
+---
+[mandatory]
+
+Change of file_system_type method (->read_super to ->get_sb)
+
+->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
+
+Turn your foo_read_super() into a function that would return 0 in case of
+success and negative number in case of error (-EINVAL unless you have more
+informative error value to report). Call it foo_fill_super(). Now declare
+
+struct super_block foo_get_sb(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data)
+{
+ return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super);
+}
+
+(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
+filesystem).
+
+Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
+foo_get_sb.
+
+---
+[mandatory]
+
+Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
+Most likely there is no need to change anything, but if you relied on
+global exclusion between renames for some internal purpose - you need to
+change your internal locking. Otherwise exclusion warranties remain the
+same (i.e. parents and victim are locked, etc.).
+
+---
+[informational]
+
+Now we have the exclusion between ->lookup() and directory removal (by
+->rmdir() and ->rename()). If you used to need that exclusion and do
+it by internal locking (most of filesystems couldn't care less) - you
+can relax your locking.
+
+---
+[mandatory]
+
+->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
+->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
+and ->readdir() are called without BKL now. Grab it on entry, drop upon return
+- that will guarantee the same locking you used to have. If your method or its
+parts do not need BKL - better yet, now you can shift lock_kernel() and
+unlock_kernel() so that they would protect exactly what needs to be
+protected.
+
+---
+[mandatory]
+
+BKL is also moved from around sb operations. ->write_super() Is now called
+without BKL held. BKL should have been shifted into individual fs sb_op
+functions. If you don't need it, remove it.
+
+---
+[informational]
+
+check for ->link() target not being a directory is done by callers. Feel
+free to drop it...
+
+---
+[informational]
+
+->link() callers hold ->i_sem on the object we are linking to. Some of your
+problems might be over...
+
+---
+[mandatory]
+
+new file_system_type method - kill_sb(superblock). If you are converting
+an existing filesystem, set it according to ->fs_flags:
+ FS_REQUIRES_DEV - kill_block_super
+ FS_LITTER - kill_litter_super
+ neither - kill_anon_super
+FS_LITTER is gone - just remove it from fs_flags.
+
+---
+[mandatory]
+
+ FS_SINGLE is gone (actually, that had happened back when ->get_sb()
+went in - and hadn't been documented ;-/). Just remove it from fs_flags
+(and see ->get_sb() entry for other actions).
+
+---
+[mandatory]
+
+->setattr() is called without BKL now. Caller _always_ holds ->i_sem, so
+watch for ->i_sem-grabbing code that might be used by your ->setattr().
+Callers of notify_change() need ->i_sem now.
+
+---
+[recommended]
+
+New super_block field "struct export_operations *s_export_op" for
+explicit support for exporting, e.g. via NFS. The structure is fully
+documented at its declaration in include/linux/fs.h, and in
+Documentation/filesystems/Exporting.
+
+Briefly it allows for the definition of decode_fh and encode_fh operations
+to encode and decode filehandles, and allows the filesystem to use
+a standard helper function for decode_fh, and provide file-system specific
+support for this helper, particularly get_parent.
+
+It is planned that this will be required for exporting once the code
+settles down a bit.
+
+[mandatory]
+
+s_export_op is now required for exporting a filesystem.
+isofs, ext2, ext3, resierfs, fat
+can be used as examples of very different filesystems.
+
+---
+[mandatory]
+
+iget4() and the read_inode2 callback have been superseded by iget5_locked()
+which has the following prototype,
+
+ struct inode *iget5_locked(struct super_block *sb, unsigned long ino,
+ int (*test)(struct inode *, void *),
+ int (*set)(struct inode *, void *),
+ void *data);
+
+'test' is an additional function that can be used when the inode
+number is not sufficient to identify the actual file object. 'set'
+should be a non-blocking function that initializes those parts of a
+newly created inode to allow the test function to succeed. 'data' is
+passed as an opaque value to both test and set functions.
+
+When the inode has been created by iget5_locked(), it will be returned with
+the I_NEW flag set and will still be locked. read_inode has not been
+called so the file system still has to finalize the initialization. Once
+the inode is initialized it must be unlocked by calling unlock_new_inode().
+
+The filesystem is responsible for setting (and possibly testing) i_ino
+when appropriate. There is also a simpler iget_locked function that
+just takes the superblock and inode number as arguments and does the
+test and set for you.
+
+e.g.
+ inode = iget_locked(sb, ino);
+ if (inode->i_state & I_NEW) {
+ read_inode_from_disk(inode);
+ unlock_new_inode(inode);
+ }
+
+---
+[recommended]
+
+->getattr() finally getting used. See instances in nfs, minix, etc.
+
+---
+[mandatory]
+
+->revalidate() is gone. If your filesystem had it - provide ->getattr()
+and let it call whatever you had as ->revlidate() + (for symlinks that
+had ->revalidate()) add calls in ->follow_link()/->readlink().
+
+---
+[mandatory]
+
+->d_parent changes are not protected by BKL anymore. Read access is safe
+if at least one of the following is true:
+ * filesystem has no cross-directory rename()
+ * dcache_lock is held
+ * we know that parent had been locked (e.g. we are looking at
+->d_parent of ->lookup() argument).
+ * we are called from ->rename().
+ * the child's ->d_lock is held
+Audit your code and add locking if needed. Notice that any place that is
+not protected by the conditions above is risky even in the old tree - you
+had been relying on BKL and that's prone to screwups. Old tree had quite
+a few holes of that kind - unprotected access to ->d_parent leading to
+anything from oops to silent memory corruption.
+
+---
+[mandatory]
+
+ FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags
+(see rootfs for one kind of solution and bdev/socket/pipe for another).
+
+---
+[recommended]
+
+ Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter
+is still alive, but only because of the mess in drivers/s390/block/dasd.c.
+As soon as it gets fixed is_read_only() will die.
+
+---
+[mandatory]
+
+->permission() is called without BKL now. Grab it on entry, drop upon
+return - that will guarantee the same locking you used to have. If
+your method or its parts do not need BKL - better yet, now you can
+shift lock_kernel() and unlock_kernel() so that they would protect
+exactly what needs to be protected.
+
+---
+[mandatory]
+
+->statfs() is now called without BKL held. BKL should have been
+shifted into individual fs sb_op functions where it's not clear that
+it's safe to remove it. If you don't need it, remove it.
+
+---
+[mandatory]
+
+ is_read_only() is gone; use bdev_read_only() instead.
+
+---
+[mandatory]
+
+ destroy_buffers() is gone; use invalidate_bdev().
+
+---
+[mandatory]
+
+ fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is
+deliberate; as soon as struct block_device * is propagated in a reasonable
+way by that code fixing will become trivial; until then nothing can be
+done.
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
new file mode 100644
index 0000000..cbe85c1
--- /dev/null
+++ b/Documentation/filesystems/proc.txt
@@ -0,0 +1,1940 @@
+------------------------------------------------------------------------------
+ T H E /proc F I L E S Y S T E M
+------------------------------------------------------------------------------
+/proc/sys Terrehon Bowden <terrehon@pacbell.net> October 7 1999
+ Bodo Bauer <bb@ricochet.net>
+
+2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000
+------------------------------------------------------------------------------
+Version 1.3 Kernel version 2.2.12
+ Kernel version 2.4.0-test11-pre4
+------------------------------------------------------------------------------
+
+Table of Contents
+-----------------
+
+ 0 Preface
+ 0.1 Introduction/Credits
+ 0.2 Legal Stuff
+
+ 1 Collecting System Information
+ 1.1 Process-Specific Subdirectories
+ 1.2 Kernel data
+ 1.3 IDE devices in /proc/ide
+ 1.4 Networking info in /proc/net
+ 1.5 SCSI info
+ 1.6 Parallel port info in /proc/parport
+ 1.7 TTY info in /proc/tty
+ 1.8 Miscellaneous kernel statistics in /proc/stat
+
+ 2 Modifying System Parameters
+ 2.1 /proc/sys/fs - File system data
+ 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
+ 2.3 /proc/sys/kernel - general kernel parameters
+ 2.4 /proc/sys/vm - The virtual memory subsystem
+ 2.5 /proc/sys/dev - Device specific parameters
+ 2.6 /proc/sys/sunrpc - Remote procedure calls
+ 2.7 /proc/sys/net - Networking stuff
+ 2.8 /proc/sys/net/ipv4 - IPV4 settings
+ 2.9 Appletalk
+ 2.10 IPX
+ 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
+
+------------------------------------------------------------------------------
+Preface
+------------------------------------------------------------------------------
+
+0.1 Introduction/Credits
+------------------------
+
+This documentation is part of a soon (or so we hope) to be released book on
+the SuSE Linux distribution. As there is no complete documentation for the
+/proc file system and we've used many freely available sources to write these
+chapters, it seems only fair to give the work back to the Linux community.
+This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm
+afraid it's still far from complete, but we hope it will be useful. As far as
+we know, it is the first 'all-in-one' document about the /proc file system. It
+is focused on the Intel x86 hardware, so if you are looking for PPC, ARM,
+SPARC, AXP, etc., features, you probably won't find what you are looking for.
+It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But
+additions and patches are welcome and will be added to this document if you
+mail them to Bodo.
+
+We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of
+other people for help compiling this documentation. We'd also like to extend a
+special thank you to Andi Kleen for documentation, which we relied on heavily
+to create this document, as well as the additional information he provided.
+Thanks to everybody else who contributed source or docs to the Linux kernel
+and helped create a great piece of software... :)
+
+If you have any comments, corrections or additions, please don't hesitate to
+contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this
+document.
+
+The latest version of this document is available online at
+http://skaro.nightcrawler.com/~bb/Docs/Proc as HTML version.
+
+If the above direction does not works for you, ypu could try the kernel
+mailing list at linux-kernel@vger.kernel.org and/or try to reach me at
+comandante@zaralinux.com.
+
+0.2 Legal Stuff
+---------------
+
+We don't guarantee the correctness of this document, and if you come to us
+complaining about how you screwed up your system because of incorrect
+documentation, we won't feel responsible...
+
+------------------------------------------------------------------------------
+CHAPTER 1: COLLECTING SYSTEM INFORMATION
+------------------------------------------------------------------------------
+
+------------------------------------------------------------------------------
+In This Chapter
+------------------------------------------------------------------------------
+* Investigating the properties of the pseudo file system /proc and its
+ ability to provide information on the running Linux system
+* Examining /proc's structure
+* Uncovering various information about the kernel and the processes running
+ on the system
+------------------------------------------------------------------------------
+
+
+The proc file system acts as an interface to internal data structures in the
+kernel. It can be used to obtain information about the system and to change
+certain kernel parameters at runtime (sysctl).
+
+First, we'll take a look at the read-only parts of /proc. In Chapter 2, we
+show you how you can use /proc/sys to change settings.
+
+1.1 Process-Specific Subdirectories
+-----------------------------------
+
+The directory /proc contains (among other things) one subdirectory for each
+process running on the system, which is named after the process ID (PID).
+
+The link self points to the process reading the file system. Each process
+subdirectory has the entries listed in Table 1-1.
+
+
+Table 1-1: Process specific entries in /proc
+..............................................................................
+ File Content
+ cmdline Command line arguments
+ cpu Current and last cpu in wich it was executed (2.4)(smp)
+ cwd Link to the current working directory
+ environ Values of environment variables
+ exe Link to the executable of this process
+ fd Directory, which contains all file descriptors
+ maps Memory maps to executables and library files (2.4)
+ mem Memory held by this process
+ root Link to the root directory of this process
+ stat Process status
+ statm Process memory status information
+ status Process status in human readable form
+ wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan
+..............................................................................
+
+For example, to get the status information of a process, all you have to do is
+read the file /proc/PID/status:
+
+ >cat /proc/self/status
+ Name: cat
+ State: R (running)
+ Pid: 5452
+ PPid: 743
+ TracerPid: 0 (2.4)
+ Uid: 501 501 501 501
+ Gid: 100 100 100 100
+ Groups: 100 14 16
+ VmSize: 1112 kB
+ VmLck: 0 kB
+ VmRSS: 348 kB
+ VmData: 24 kB
+ VmStk: 12 kB
+ VmExe: 8 kB
+ VmLib: 1044 kB
+ SigPnd: 0000000000000000
+ SigBlk: 0000000000000000
+ SigIgn: 0000000000000000
+ SigCgt: 0000000000000000
+ CapInh: 00000000fffffeff
+ CapPrm: 0000000000000000
+ CapEff: 0000000000000000
+
+
+This shows you nearly the same information you would get if you viewed it with
+the ps command. In fact, ps uses the proc file system to obtain its
+information. The statm file contains more detailed information about the
+process memory usage. Its seven fields are explained in Table 1-2.
+
+
+Table 1-2: Contents of the statm files (as of 2.6.8-rc3)
+..............................................................................
+ Field Content
+ size total program size (pages) (same as VmSize in status)
+ resident size of memory portions (pages) (same as VmRSS in status)
+ shared number of pages that are shared (i.e. backed by a file)
+ trs number of pages that are 'code' (not including libs; broken,
+ includes data segment)
+ lrs number of pages of library (always 0 on 2.6)
+ drs number of pages of data/stack (including libs; broken,
+ includes library text)
+ dt number of dirty pages (always 0 on 2.6)
+..............................................................................
+
+1.2 Kernel data
+---------------
+
+Similar to the process entries, the kernel data files give information about
+the running kernel. The files used to obtain this information are contained in
+/proc and are listed in Table 1-3. Not all of these will be present in your
+system. It depends on the kernel configuration and the loaded modules, which
+files are there, and which are missing.
+
+Table 1-3: Kernel info in /proc
+..............................................................................
+ File Content
+ apm Advanced power management info
+ buddyinfo Kernel memory allocator information (see text) (2.5)
+ bus Directory containing bus specific information
+ cmdline Kernel command line
+ cpuinfo Info about the CPU
+ devices Available devices (block and character)
+ dma Used DMS channels
+ filesystems Supported filesystems
+ driver Various drivers grouped here, currently rtc (2.4)
+ execdomains Execdomains, related to security (2.4)
+ fb Frame Buffer devices (2.4)
+ fs File system parameters, currently nfs/exports (2.4)
+ ide Directory containing info about the IDE subsystem
+ interrupts Interrupt usage
+ iomem Memory map (2.4)
+ ioports I/O port usage
+ irq Masks for irq to cpu affinity (2.4)(smp?)
+ isapnp ISA PnP (Plug&Play) Info (2.4)
+ kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4))
+ kmsg Kernel messages
+ ksyms Kernel symbol table
+ loadavg Load average of last 1, 5 & 15 minutes
+ locks Kernel locks
+ meminfo Memory info
+ misc Miscellaneous
+ modules List of loaded modules
+ mounts Mounted filesystems
+ net Networking info (see text)
+ partitions Table of partitions known to the system
+ pci Depreciated info of PCI bus (new way -> /proc/bus/pci/,
+ decoupled by lspci (2.4)
+ rtc Real time clock
+ scsi SCSI info (see text)
+ slabinfo Slab pool info
+ stat Overall statistics
+ swaps Swap space utilization
+ sys See chapter 2
+ sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4)
+ tty Info of tty drivers
+ uptime System uptime
+ version Kernel version
+ video bttv info of video resources (2.4)
+..............................................................................
+
+You can, for example, check which interrupts are currently in use and what
+they are used for by looking in the file /proc/interrupts:
+
+ > cat /proc/interrupts
+ CPU0
+ 0: 8728810 XT-PIC timer
+ 1: 895 XT-PIC keyboard
+ 2: 0 XT-PIC cascade
+ 3: 531695 XT-PIC aha152x
+ 4: 2014133 XT-PIC serial
+ 5: 44401 XT-PIC pcnet_cs
+ 8: 2 XT-PIC rtc
+ 11: 8 XT-PIC i82365
+ 12: 182918 XT-PIC PS/2 Mouse
+ 13: 1 XT-PIC fpu
+ 14: 1232265 XT-PIC ide0
+ 15: 7 XT-PIC ide1
+ NMI: 0
+
+In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the
+output of a SMP machine):
+
+ > cat /proc/interrupts
+
+ CPU0 CPU1
+ 0: 1243498 1214548 IO-APIC-edge timer
+ 1: 8949 8958 IO-APIC-edge keyboard
+ 2: 0 0 XT-PIC cascade
+ 5: 11286 10161 IO-APIC-edge soundblaster
+ 8: 1 0 IO-APIC-edge rtc
+ 9: 27422 27407 IO-APIC-edge 3c503
+ 12: 113645 113873 IO-APIC-edge PS/2 Mouse
+ 13: 0 0 XT-PIC fpu
+ 14: 22491 24012 IO-APIC-edge ide0
+ 15: 2183 2415 IO-APIC-edge ide1
+ 17: 30564 30414 IO-APIC-level eth0
+ 18: 177 164 IO-APIC-level bttv
+ NMI: 2457961 2457959
+ LOC: 2457882 2457881
+ ERR: 2155
+
+NMI is incremented in this case because every timer interrupt generates a NMI
+(Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups.
+
+LOC is the local interrupt counter of the internal APIC of every CPU.
+
+ERR is incremented in the case of errors in the IO-APIC bus (the bus that
+connects the CPUs in a SMP system. This means that an error has been detected,
+the IO-APIC automatically retry the transmission, so it should not be a big
+problem, but you should read the SMP-FAQ.
+
+In this context it could be interesting to note the new irq directory in 2.4.
+It could be used to set IRQ to CPU affinity, this means that you can "hook" an
+IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
+irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask
+
+For example
+ > ls /proc/irq/
+ 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask
+ 1 11 13 15 17 19 3 5 7 9
+ > ls /proc/irq/0/
+ smp_affinity
+
+The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ
+is the same by default:
+
+ > cat /proc/irq/0/smp_affinity
+ ffffffff
+
+It's a bitmask, in wich you can specify wich CPUs can handle the IRQ, you can
+set it by doing:
+
+ > echo 1 > /proc/irq/prof_cpu_mask
+
+This means that only the first CPU will handle the IRQ, but you can also echo 5
+wich means that only the first and fourth CPU can handle the IRQ.
+
+The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
+between all the CPUs which are allowed to handle it. As usual the kernel has
+more info than you and does a better job than you, so the defaults are the
+best choice for almost everyone.
+
+There are three more important subdirectories in /proc: net, scsi, and sys.
+The general rule is that the contents, or even the existence of these
+directories, depend on your kernel configuration. If SCSI is not enabled, the
+directory scsi may not exist. The same is true with the net, which is there
+only when networking support is present in the running kernel.
+
+The slabinfo file gives information about memory usage at the slab level.
+Linux uses slab pools for memory management above page level in version 2.2.
+Commonly used objects have their own slab pool (such as network buffers,
+directory cache, and so on).
+
+..............................................................................
+
+> cat /proc/buddyinfo
+
+Node 0, zone DMA 0 4 5 4 4 3 ...
+Node 0, zone Normal 1 0 0 1 101 8 ...
+Node 0, zone HighMem 2 0 0 1 1 0 ...
+
+Memory fragmentation is a problem under some workloads, and buddyinfo is a
+useful tool for helping diagnose these problems. Buddyinfo will give you a
+clue as to how big an area you can safely allocate, or why a previous
+allocation failed.
+
+Each column represents the number of pages of a certain order which are
+available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
+ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
+available in ZONE_NORMAL, etc...
+
+..............................................................................
+
+meminfo:
+
+Provides information about distribution and utilization of memory. This
+varies by architecture and compile options. The following is from a
+16GB PIII, which has highmem enabled. You may not have all of these fields.
+
+> cat /proc/meminfo
+
+
+MemTotal: 16344972 kB
+MemFree: 13634064 kB
+Buffers: 3656 kB
+Cached: 1195708 kB
+SwapCached: 0 kB
+Active: 891636 kB
+Inactive: 1077224 kB
+HighTotal: 15597528 kB
+HighFree: 13629632 kB
+LowTotal: 747444 kB
+LowFree: 4432 kB
+SwapTotal: 0 kB
+SwapFree: 0 kB
+Dirty: 968 kB
+Writeback: 0 kB
+Mapped: 280372 kB
+Slab: 684068 kB
+CommitLimit: 7669796 kB
+Committed_AS: 100056 kB
+PageTables: 24448 kB
+VmallocTotal: 112216 kB
+VmallocUsed: 428 kB
+VmallocChunk: 111088 kB
+
+ MemTotal: Total usable ram (i.e. physical ram minus a few reserved
+ bits and the kernel binary code)
+ MemFree: The sum of LowFree+HighFree
+ Buffers: Relatively temporary storage for raw disk blocks
+ shouldn't get tremendously large (20MB or so)
+ Cached: in-memory cache for files read from the disk (the
+ pagecache). Doesn't include SwapCached
+ SwapCached: Memory that once was swapped out, is swapped back in but
+ still also is in the swapfile (if memory is needed it
+ doesn't need to be swapped out AGAIN because it is already
+ in the swapfile. This saves I/O)
+ Active: Memory that has been used more recently and usually not
+ reclaimed unless absolutely necessary.
+ Inactive: Memory which has been less recently used. It is more
+ eligible to be reclaimed for other purposes
+ HighTotal:
+ HighFree: Highmem is all memory above ~860MB of physical memory
+ Highmem areas are for use by userspace programs, or
+ for the pagecache. The kernel must use tricks to access
+ this memory, making it slower to access than lowmem.
+ LowTotal:
+ LowFree: Lowmem is memory which can be used for everything that
+ highmem can be used for, but it is also availble for the
+ kernel's use for its own data structures. Among many
+ other things, it is where everything from the Slab is
+ allocated. Bad things happen when you're out of lowmem.
+ SwapTotal: total amount of swap space available
+ SwapFree: Memory which has been evicted from RAM, and is temporarily
+ on the disk
+ Dirty: Memory which is waiting to get written back to the disk
+ Writeback: Memory which is actively being written back to the disk
+ Mapped: files which have been mmaped, such as libraries
+ Slab: in-kernel data structures cache
+ CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'),
+ this is the total amount of memory currently available to
+ be allocated on the system. This limit is only adhered to
+ if strict overcommit accounting is enabled (mode 2 in
+ 'vm.overcommit_memory').
+ The CommitLimit is calculated with the following formula:
+ CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap
+ For example, on a system with 1G of physical RAM and 7G
+ of swap with a `vm.overcommit_ratio` of 30 it would
+ yield a CommitLimit of 7.3G.
+ For more details, see the memory overcommit documentation
+ in vm/overcommit-accounting.
+Committed_AS: The amount of memory presently allocated on the system.
+ The committed memory is a sum of all of the memory which
+ has been allocated by processes, even if it has not been
+ "used" by them as of yet. A process which malloc()'s 1G
+ of memory, but only touches 300M of it will only show up
+ as using 300M of memory even if it has the address space
+ allocated for the entire 1G. This 1G is memory which has
+ been "committed" to by the VM and can be used at any time
+ by the allocating application. With strict overcommit
+ enabled on the system (mode 2 in 'vm.overcommit_memory'),
+ allocations which would exceed the CommitLimit (detailed
+ above) will not be permitted. This is useful if one needs
+ to guarantee that processes will not fail due to lack of
+ memory once that memory has been successfully allocated.
+ PageTables: amount of memory dedicated to the lowest level of page
+ tables.
+VmallocTotal: total size of vmalloc memory area
+ VmallocUsed: amount of vmalloc area which is used
+VmallocChunk: largest contigious block of vmalloc area which is free
+
+
+1.3 IDE devices in /proc/ide
+----------------------------
+
+The subdirectory /proc/ide contains information about all IDE devices of which
+the kernel is aware. There is one subdirectory for each IDE controller, the
+file drivers and a link for each IDE device, pointing to the device directory
+in the controller specific subtree.
+
+The file drivers contains general information about the drivers used for the
+IDE devices:
+
+ > cat /proc/ide/drivers
+ ide-cdrom version 4.53
+ ide-disk version 1.08
+
+More detailed information can be found in the controller specific
+subdirectories. These are named ide0, ide1 and so on. Each of these
+directories contains the files shown in table 1-4.
+
+
+Table 1-4: IDE controller info in /proc/ide/ide?
+..............................................................................
+ File Content
+ channel IDE channel (0 or 1)
+ config Configuration (only for PCI/IDE bridge)
+ mate Mate name
+ model Type/Chipset of IDE controller
+..............................................................................
+
+Each device connected to a controller has a separate subdirectory in the
+controllers directory. The files listed in table 1-5 are contained in these
+directories.
+
+
+Table 1-5: IDE device information
+..............................................................................
+ File Content
+ cache The cache
+ capacity Capacity of the medium (in 512Byte blocks)
+ driver driver and version
+ geometry physical and logical geometry
+ identify device identify block
+ media media type
+ model device identifier
+ settings device setup
+ smart_thresholds IDE disk management thresholds
+ smart_values IDE disk management values
+..............................................................................
+
+The most interesting file is settings. This file contains a nice overview of
+the drive parameters:
+
+ # cat /proc/ide/ide0/hda/settings
+ name value min max mode
+ ---- ----- --- --- ----
+ bios_cyl 526 0 65535 rw
+ bios_head 255 0 255 rw
+ bios_sect 63 0 63 rw
+ breada_readahead 4 0 127 rw
+ bswap 0 0 1 r
+ file_readahead 72 0 2097151 rw
+ io_32bit 0 0 3 rw
+ keepsettings 0 0 1 rw
+ max_kb_per_request 122 1 127 rw
+ multcount 0 0 8 rw
+ nice1 1 0 1 rw
+ nowerr 0 0 1 rw
+ pio_mode write-only 0 255 w
+ slow 0 0 1 rw
+ unmaskirq 0 0 1 rw
+ using_dma 0 0 1 rw
+
+
+1.4 Networking info in /proc/net
+--------------------------------
+
+The subdirectory /proc/net follows the usual pattern. Table 1-6 shows the
+additional values you get for IP version 6 if you configure the kernel to
+support this. Table 1-7 lists the files and their meaning.
+
+
+Table 1-6: IPv6 info in /proc/net
+..............................................................................
+ File Content
+ udp6 UDP sockets (IPv6)
+ tcp6 TCP sockets (IPv6)
+ raw6 Raw device statistics (IPv6)
+ igmp6 IP multicast addresses, which this host joined (IPv6)
+ if_inet6 List of IPv6 interface addresses
+ ipv6_route Kernel routing table for IPv6
+ rt6_stats Global IPv6 routing tables statistics
+ sockstat6 Socket statistics (IPv6)
+ snmp6 Snmp data (IPv6)
+..............................................................................
+
+
+Table 1-7: Network info in /proc/net
+..............................................................................
+ File Content
+ arp Kernel ARP table
+ dev network devices with statistics
+ dev_mcast the Layer2 multicast groups a device is listening too
+ (interface index, label, number of references, number of bound
+ addresses).
+ dev_stat network device status
+ ip_fwchains Firewall chain linkage
+ ip_fwnames Firewall chain names
+ ip_masq Directory containing the masquerading tables
+ ip_masquerade Major masquerading table
+ netstat Network statistics
+ raw raw device statistics
+ route Kernel routing table
+ rpc Directory containing rpc info
+ rt_cache Routing cache
+ snmp SNMP data
+ sockstat Socket statistics
+ tcp TCP sockets
+ tr_rif Token ring RIF routing table
+ udp UDP sockets
+ unix UNIX domain sockets
+ wireless Wireless interface data (Wavelan etc)
+ igmp IP multicast addresses, which this host joined
+ psched Global packet scheduler parameters.
+ netlink List of PF_NETLINK sockets
+ ip_mr_vifs List of multicast virtual interfaces
+ ip_mr_cache List of multicast routing cache
+..............................................................................
+
+You can use this information to see which network devices are available in
+your system and how much traffic was routed over those devices:
+
+ > cat /proc/net/dev
+ Inter-|Receive |[...
+ face |bytes packets errs drop fifo frame compressed multicast|[...
+ lo: 908188 5596 0 0 0 0 0 0 [...
+ ppp0:15475140 20721 410 0 0 410 0 0 [...
+ eth0: 614530 7085 0 0 0 0 0 1 [...
+
+ ...] Transmit
+ ...] bytes packets errs drop fifo colls carrier compressed
+ ...] 908188 5596 0 0 0 0 0 0
+ ...] 1375103 17405 0 0 0 0 0 0
+ ...] 1703981 5535 0 0 0 3 0 0
+
+In addition, each Channel Bond interface has it's own directory. For
+example, the bond0 device will have a directory called /proc/net/bond0/.
+It will contain information that is specific to that bond, such as the
+current slaves of the bond, the link status of the slaves, and how
+many times the slaves link has failed.
+
+1.5 SCSI info
+-------------
+
+If you have a SCSI host adapter in your system, you'll find a subdirectory
+named after the driver for this adapter in /proc/scsi. You'll also see a list
+of all recognized SCSI devices in /proc/scsi:
+
+ >cat /proc/scsi/scsi
+ Attached devices:
+ Host: scsi0 Channel: 00 Id: 00 Lun: 00
+ Vendor: IBM Model: DGHS09U Rev: 03E0
+ Type: Direct-Access ANSI SCSI revision: 03
+ Host: scsi0 Channel: 00 Id: 06 Lun: 00
+ Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04
+ Type: CD-ROM ANSI SCSI revision: 02
+
+
+The directory named after the driver has one file for each adapter found in
+the system. These files contain information about the controller, including
+the used IRQ and the IO address range. The amount of information shown is
+dependent on the adapter you use. The example shows the output for an Adaptec
+AHA-2940 SCSI adapter:
+
+ > cat /proc/scsi/aic7xxx/0
+
+ Adaptec AIC7xxx driver version: 5.1.19/3.2.4
+ Compile Options:
+ TCQ Enabled By Default : Disabled
+ AIC7XXX_PROC_STATS : Disabled
+ AIC7XXX_RESET_DELAY : 5
+ Adapter Configuration:
+ SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter
+ Ultra Wide Controller
+ PCI MMAPed I/O Base: 0xeb001000
+ Adapter SEEPROM Config: SEEPROM found and used.
+ Adaptec SCSI BIOS: Enabled
+ IRQ: 10
+ SCBs: Active 0, Max Active 2,
+ Allocated 15, HW 16, Page 255
+ Interrupts: 160328
+ BIOS Control Word: 0x18b6
+ Adapter Control Word: 0x005b
+ Extended Translation: Enabled
+ Disconnect Enable Flags: 0xffff
+ Ultra Enable Flags: 0x0001
+ Tag Queue Enable Flags: 0x0000
+ Ordered Queue Tag Flags: 0x0000
+ Default Tag Queue Depth: 8
+ Tagged Queue By Device array for aic7xxx host instance 0:
+ {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255}
+ Actual queue depth per device for aic7xxx host instance 0:
+ {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
+ Statistics:
+ (scsi0:0:0:0)
+ Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8
+ Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0)
+ Total transfers 160151 (74577 reads and 85574 writes)
+ (scsi0:0:6:0)
+ Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15
+ Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0)
+ Total transfers 0 (0 reads and 0 writes)
+
+
+1.6 Parallel port info in /proc/parport
+---------------------------------------
+
+The directory /proc/parport contains information about the parallel ports of
+your system. It has one subdirectory for each port, named after the port
+number (0,1,2,...).
+
+These directories contain the four files shown in Table 1-8.
+
+
+Table 1-8: Files in /proc/parport
+..............................................................................
+ File Content
+ autoprobe Any IEEE-1284 device ID information that has been acquired.
+ devices list of the device drivers using that port. A + will appear by the
+ name of the device currently using the port (it might not appear
+ against any).
+ hardware Parallel port's base address, IRQ line and DMA channel.
+ irq IRQ that parport is using for that port. This is in a separate
+ file to allow you to alter it by writing a new value in (IRQ
+ number or none).
+..............................................................................
+
+1.7 TTY info in /proc/tty
+-------------------------
+
+Information about the available and actually used tty's can be found in the
+directory /proc/tty.You'll find entries for drivers and line disciplines in
+this directory, as shown in Table 1-9.
+
+
+Table 1-9: Files in /proc/tty
+..............................................................................
+ File Content
+ drivers list of drivers and their usage
+ ldiscs registered line disciplines
+ driver/serial usage statistic and status of single tty lines
+..............................................................................
+
+To see which tty's are currently in use, you can simply look into the file
+/proc/tty/drivers:
+
+ > cat /proc/tty/drivers
+ pty_slave /dev/pts 136 0-255 pty:slave
+ pty_master /dev/ptm 128 0-255 pty:master
+ pty_slave /dev/ttyp 3 0-255 pty:slave
+ pty_master /dev/pty 2 0-255 pty:master
+ serial /dev/cua 5 64-67 serial:callout
+ serial /dev/ttyS 4 64-67 serial
+ /dev/tty0 /dev/tty0 4 0 system:vtmaster
+ /dev/ptmx /dev/ptmx 5 2 system
+ /dev/console /dev/console 5 1 system:console
+ /dev/tty /dev/tty 5 0 system:/dev/tty
+ unknown /dev/tty 4 1-63 console
+
+
+1.8 Miscellaneous kernel statistics in /proc/stat
+-------------------------------------------------
+
+Various pieces of information about kernel activity are available in the
+/proc/stat file. All of the numbers reported in this file are aggregates
+since the system first booted. For a quick look, simply cat the file:
+
+ > cat /proc/stat
+ cpu 2255 34 2290 22625563 6290 127 456
+ cpu0 1132 34 1441 11311718 3675 127 438
+ cpu1 1123 0 849 11313845 2614 0 18
+ intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...]
+ ctxt 1990473
+ btime 1062191376
+ processes 2915
+ procs_running 1
+ procs_blocked 0
+
+The very first "cpu" line aggregates the numbers in all of the other "cpuN"
+lines. These numbers identify the amount of time the CPU has spent performing
+different kinds of work. Time units are in USER_HZ (typically hundredths of a
+second). The meanings of the columns are as follows, from left to right:
+
+- user: normal processes executing in user mode
+- nice: niced processes executing in user mode
+- system: processes executing in kernel mode
+- idle: twiddling thumbs
+- iowait: waiting for I/O to complete
+- irq: servicing interrupts
+- softirq: servicing softirqs
+
+The "intr" line gives counts of interrupts serviced since boot time, for each
+of the possible system interrupts. The first column is the total of all
+interrupts serviced; each subsequent column is the total for that particular
+interrupt.
+
+The "ctxt" line gives the total number of context switches across all CPUs.
+
+The "btime" line gives the time at which the system booted, in seconds since
+the Unix epoch.
+
+The "processes" line gives the number of processes and threads created, which
+includes (but is not limited to) those created by calls to the fork() and
+clone() system calls.
+
+The "procs_running" line gives the number of processes currently running on
+CPUs.
+
+The "procs_blocked" line gives the number of processes currently blocked,
+waiting for I/O to complete.
+
+
+------------------------------------------------------------------------------
+Summary
+------------------------------------------------------------------------------
+The /proc file system serves information about the running system. It not only
+allows access to process data but also allows you to request the kernel status
+by reading files in the hierarchy.
+
+The directory structure of /proc reflects the types of information and makes
+it easy, if not obvious, where to look for specific data.
+------------------------------------------------------------------------------
+
+------------------------------------------------------------------------------
+CHAPTER 2: MODIFYING SYSTEM PARAMETERS
+------------------------------------------------------------------------------
+
+------------------------------------------------------------------------------
+In This Chapter
+------------------------------------------------------------------------------
+* Modifying kernel parameters by writing into files found in /proc/sys
+* Exploring the files which modify certain parameters
+* Review of the /proc/sys file tree
+------------------------------------------------------------------------------
+
+
+A very interesting part of /proc is the directory /proc/sys. This is not only
+a source of information, it also allows you to change parameters within the
+kernel. Be very careful when attempting this. You can optimize your system,
+but you can also cause it to crash. Never alter kernel parameters on a
+production system. Set up a development machine and test to make sure that
+everything works the way you want it to. You may have no alternative but to
+reboot the machine once an error has been made.
+
+To change a value, simply echo the new value into the file. An example is
+given below in the section on the file system data. You need to be root to do
+this. You can create your own boot script to perform this every time your
+system boots.
+
+The files in /proc/sys can be used to fine tune and monitor miscellaneous and
+general things in the operation of the Linux kernel. Since some of the files
+can inadvertently disrupt your system, it is advisable to read both
+documentation and source before actually making adjustments. In any case, be
+very careful when writing to any of these files. The entries in /proc may
+change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt
+review the kernel documentation in the directory /usr/src/linux/Documentation.
+This chapter is heavily based on the documentation included in the pre 2.2
+kernels, and became part of it in version 2.2.1 of the Linux kernel.
+
+2.1 /proc/sys/fs - File system data
+-----------------------------------
+
+This subdirectory contains specific file system, file handle, inode, dentry
+and quota information.
+
+Currently, these files are in /proc/sys/fs:
+
+dentry-state
+------------
+
+Status of the directory cache. Since directory entries are dynamically
+allocated and deallocated, this file indicates the current status. It holds
+six values, in which the last two are not used and are always zero. The others
+are listed in table 2-1.
+
+
+Table 2-1: Status files of the directory cache
+..............................................................................
+ File Content
+ nr_dentry Almost always zero
+ nr_unused Number of unused cache entries
+ age_limit
+ in seconds after the entry may be reclaimed, when memory is short
+ want_pages internally
+..............................................................................
+
+dquot-nr and dquot-max
+----------------------
+
+The file dquot-max shows the maximum number of cached disk quota entries.
+
+The file dquot-nr shows the number of allocated disk quota entries and the
+number of free disk quota entries.
+
+If the number of available cached disk quotas is very low and you have a large
+number of simultaneous system users, you might want to raise the limit.
+
+file-nr and file-max
+--------------------
+
+The kernel allocates file handles dynamically, but doesn't free them again at
+this time.
+
+The value in file-max denotes the maximum number of file handles that the
+Linux kernel will allocate. When you get a lot of error messages about running
+out of file handles, you might want to raise this limit. The default value is
+10% of RAM in kilobytes. To change it, just write the new number into the
+file:
+
+ # cat /proc/sys/fs/file-max
+ 4096
+ # echo 8192 > /proc/sys/fs/file-max
+ # cat /proc/sys/fs/file-max
+ 8192
+
+
+This method of revision is useful for all customizable parameters of the
+kernel - simply echo the new value to the corresponding file.
+
+Historically, the three values in file-nr denoted the number of allocated file
+handles, the number of allocated but unused file handles, and the maximum
+number of file handles. Linux 2.6 always reports 0 as the number of free file
+handles -- this is not an error, it just means that the number of allocated
+file handles exactly matches the number of used file handles.
+
+Attempts to allocate more file descriptors than file-max are reported with
+printk, look for "VFS: file-max limit <number> reached".
+
+inode-state and inode-nr
+------------------------
+
+The file inode-nr contains the first two items from inode-state, so we'll skip
+to that file...
+
+inode-state contains two actual numbers and five dummy values. The numbers
+are nr_inodes and nr_free_inodes (in order of appearance).
+
+nr_inodes
+~~~~~~~~~
+
+Denotes the number of inodes the system has allocated. This number will
+grow and shrink dynamically.
+
+nr_free_inodes
+--------------
+
+Represents the number of free inodes. Ie. The number of inuse inodes is
+(nr_inodes - nr_free_inodes).
+
+super-nr and super-max
+----------------------
+
+Again, super block structures are allocated by the kernel, but not freed. The
+file super-max contains the maximum number of super block handlers, where
+super-nr shows the number of currently allocated ones.
+
+Every mounted file system needs a super block, so if you plan to mount lots of
+file systems, you may want to increase these numbers.
+
+aio-nr and aio-max-nr
+---------------------
+
+aio-nr is the running total of the number of events specified on the
+io_setup system call for all currently active aio contexts. If aio-nr
+reaches aio-max-nr then io_setup will fail with EAGAIN. Note that
+raising aio-max-nr does not result in the pre-allocation or re-sizing
+of any kernel data structures.
+
+2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
+-----------------------------------------------------------
+
+Besides these files, there is the subdirectory /proc/sys/fs/binfmt_misc. This
+handles the kernel support for miscellaneous binary formats.
+
+Binfmt_misc provides the ability to register additional binary formats to the
+Kernel without compiling an additional module/kernel. Therefore, binfmt_misc
+needs to know magic numbers at the beginning or the filename extension of the
+binary.
+
+It works by maintaining a linked list of structs that contain a description of
+a binary format, including a magic with size (or the filename extension),
+offset and mask, and the interpreter name. On request it invokes the given
+interpreter with the original program as argument, as binfmt_java and
+binfmt_em86 and binfmt_mz do. Since binfmt_misc does not define any default
+binary-formats, you have to register an additional binary-format.
+
+There are two general files in binfmt_misc and one file per registered format.
+The two general files are register and status.
+
+Registering a new binary format
+-------------------------------
+
+To register a new binary format you have to issue the command
+
+ echo :name:type:offset:magic:mask:interpreter: > /proc/sys/fs/binfmt_misc/register
+
+
+
+with appropriate name (the name for the /proc-dir entry), offset (defaults to
+0, if omitted), magic, mask (which can be omitted, defaults to all 0xff) and
+last but not least, the interpreter that is to be invoked (for example and
+testing /bin/echo). Type can be M for usual magic matching or E for filename
+extension matching (give extension in place of magic).
+
+Check or reset the status of the binary format handler
+------------------------------------------------------
+
+If you do a cat on the file /proc/sys/fs/binfmt_misc/status, you will get the
+current status (enabled/disabled) of binfmt_misc. Change the status by echoing
+0 (disables) or 1 (enables) or -1 (caution: this clears all previously
+registered binary formats) to status. For example echo 0 > status to disable
+binfmt_misc (temporarily).
+
+Status of a single handler
+--------------------------
+
+Each registered handler has an entry in /proc/sys/fs/binfmt_misc. These files
+perform the same function as status, but their scope is limited to the actual
+binary format. By cating this file, you also receive all related information
+about the interpreter/magic of the binfmt.
+
+Example usage of binfmt_misc (emulate binfmt_java)
+--------------------------------------------------
+
+ cd /proc/sys/fs/binfmt_misc
+ echo ':Java:M::\xca\xfe\xba\xbe::/usr/local/java/bin/javawrapper:' > register
+ echo ':HTML:E::html::/usr/local/java/bin/appletviewer:' > register
+ echo ':Applet:M::<!--applet::/usr/local/java/bin/appletviewer:' > register
+ echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register
+
+
+These four lines add support for Java executables and Java applets (like
+binfmt_java, additionally recognizing the .html extension with no need to put
+<!--applet> to every applet file). You have to install the JDK and the
+shell-script /usr/local/java/bin/javawrapper too. It works around the
+brokenness of the Java filename handling. To add a Java binary, just create a
+link to the class-file somewhere in the path.
+
+2.3 /proc/sys/kernel - general kernel parameters
+------------------------------------------------
+
+This directory reflects general kernel behaviors. As I've said before, the
+contents depend on your configuration. Here you'll find the most important
+files, along with descriptions of what they mean and how to use them.
+
+acct
+----
+
+The file contains three values; highwater, lowwater, and frequency.
+
+It exists only when BSD-style process accounting is enabled. These values
+control its behavior. If the free space on the file system where the log lives
+goes below lowwater percentage, accounting suspends. If it goes above
+highwater percentage, accounting resumes. Frequency determines how often you
+check the amount of free space (value is in seconds). Default settings are: 4,
+2, and 30. That is, suspend accounting if there is less than 2 percent free;
+resume it if we have a value of 3 or more percent; consider information about
+the amount of free space valid for 30 seconds
+
+ctrl-alt-del
+------------
+
+When the value in this file is 0, ctrl-alt-del is trapped and sent to the init
+program to handle a graceful restart. However, when the value is greater that
+zero, Linux's reaction to this key combination will be an immediate reboot,
+without syncing its dirty buffers.
+
+[NOTE]
+ When a program (like dosemu) has the keyboard in raw mode, the
+ ctrl-alt-del is intercepted by the program before it ever reaches the
+ kernel tty layer, and it is up to the program to decide what to do with
+ it.
+
+domainname and hostname
+-----------------------
+
+These files can be controlled to set the NIS domainname and hostname of your
+box. For the classic darkstar.frop.org a simple:
+
+ # echo "darkstar" > /proc/sys/kernel/hostname
+ # echo "frop.org" > /proc/sys/kernel/domainname
+
+
+would suffice to set your hostname and NIS domainname.
+
+osrelease, ostype and version
+-----------------------------
+
+The names make it pretty obvious what these fields contain:
+
+ > cat /proc/sys/kernel/osrelease
+ 2.2.12
+
+ > cat /proc/sys/kernel/ostype
+ Linux
+
+ > cat /proc/sys/kernel/version
+ #4 Fri Oct 1 12:41:14 PDT 1999
+
+
+The files osrelease and ostype should be clear enough. Version needs a little
+more clarification. The #4 means that this is the 4th kernel built from this
+source base and the date after it indicates the time the kernel was built. The
+only way to tune these values is to rebuild the kernel.
+
+panic
+-----
+
+The value in this file represents the number of seconds the kernel waits
+before rebooting on a panic. When you use the software watchdog, the
+recommended setting is 60. If set to 0, the auto reboot after a kernel panic
+is disabled, which is the default setting.
+
+printk
+------
+
+The four values in printk denote
+* console_loglevel,
+* default_message_loglevel,
+* minimum_console_loglevel and
+* default_console_loglevel
+respectively.
+
+These values influence printk() behavior when printing or logging error
+messages, which come from inside the kernel. See syslog(2) for more
+information on the different log levels.
+
+console_loglevel
+----------------
+
+Messages with a higher priority than this will be printed to the console.
+
+default_message_level
+---------------------
+
+Messages without an explicit priority will be printed with this priority.
+
+minimum_console_loglevel
+------------------------
+
+Minimum (highest) value to which the console_loglevel can be set.
+
+default_console_loglevel
+------------------------
+
+Default value for console_loglevel.
+
+sg-big-buff
+-----------
+
+This file shows the size of the generic SCSI (sg) buffer. At this point, you
+can't tune it yet, but you can change it at compile time by editing
+include/scsi/sg.h and changing the value of SG_BIG_BUFF.
+
+If you use a scanner with SANE (Scanner Access Now Easy) you might want to set
+this to a higher value. Refer to the SANE documentation on this issue.
+
+modprobe
+--------
+
+The location where the modprobe binary is located. The kernel uses this
+program to load modules on demand.
+
+unknown_nmi_panic
+-----------------
+
+The value in this file affects behavior of handling NMI. When the value is
+non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel
+debugging information is displayed on console.
+
+NMI switch that most IA32 servers have fires unknown NMI up, for example.
+If a system hangs up, try pressing the NMI switch.
+
+[NOTE]
+ This function and oprofile share a NMI callback. Therefore this function
+ cannot be enabled when oprofile is activated.
+ And NMI watchdog will be disabled when the value in this file is set to
+ non-zero.
+
+
+2.4 /proc/sys/vm - The virtual memory subsystem
+-----------------------------------------------
+
+The files in this directory can be used to tune the operation of the virtual
+memory (VM) subsystem of the Linux kernel.
+
+vfs_cache_pressure
+------------------
+
+Controls the tendency of the kernel to reclaim the memory which is used for
+caching of directory and inode objects.
+
+At the default value of vfs_cache_pressure=100 the kernel will attempt to
+reclaim dentries and inodes at a "fair" rate with respect to pagecache and
+swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
+to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100
+causes the kernel to prefer to reclaim dentries and inodes.
+
+dirty_background_ratio
+----------------------
+
+Contains, as a percentage of total system memory, the number of pages at which
+the pdflush background writeback daemon will start writing out dirty data.
+
+dirty_ratio
+-----------------
+
+Contains, as a percentage of total system memory, the number of pages at which
+a process which is generating disk writes will itself start writing out dirty
+data.
+
+dirty_writeback_centisecs
+-------------------------
+
+The pdflush writeback daemons will periodically wake up and write `old' data
+out to disk. This tunable expresses the interval between those wakeups, in
+100'ths of a second.
+
+Setting this to zero disables periodic writeback altogether.
+
+dirty_expire_centisecs
+----------------------
+
+This tunable is used to define when dirty data is old enough to be eligible
+for writeout by the pdflush daemons. It is expressed in 100'ths of a second.
+Data which has been dirty in-memory for longer than this interval will be
+written out next time a pdflush daemon wakes up.
+
+legacy_va_layout
+----------------
+
+If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel
+will use the legacy (2.4) layout for all processes.
+
+lower_zone_protection
+---------------------
+
+For some specialised workloads on highmem machines it is dangerous for
+the kernel to allow process memory to be allocated from the "lowmem"
+zone. This is because that memory could then be pinned via the mlock()
+system call, or by unavailability of swapspace.
+
+And on large highmem machines this lack of reclaimable lowmem memory
+can be fatal.
+
+So the Linux page allocator has a mechanism which prevents allocations
+which _could_ use highmem from using too much lowmem. This means that
+a certain amount of lowmem is defended from the possibility of being
+captured into pinned user memory.
+
+(The same argument applies to the old 16 megabyte ISA DMA region. This
+mechanism will also defend that region from allocations which could use
+highmem or lowmem).
+
+The `lower_zone_protection' tunable determines how aggressive the kernel is
+in defending these lower zones. The default value is zero - no
+protection at all.
+
+If you have a machine which uses highmem or ISA DMA and your
+applications are using mlock(), or if you are running with no swap then
+you probably should increase the lower_zone_protection setting.
+
+The units of this tunable are fairly vague. It is approximately equal
+to "megabytes". So setting lower_zone_protection=100 will protect around 100
+megabytes of the lowmem zone from user allocations. It will also make
+those 100 megabytes unavaliable for use by applications and by
+pagecache, so there is a cost.
+
+The effects of this tunable may be observed by monitoring
+/proc/meminfo:LowFree. Write a single huge file and observe the point
+at which LowFree ceases to fall.
+
+A reasonable value for lower_zone_protection is 100.
+
+page-cluster
+------------
+
+page-cluster controls the number of pages which are written to swap in
+a single attempt. The swap I/O size.
+
+It is a logarithmic value - setting it to zero means "1 page", setting
+it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+
+The default value is three (eight pages at a time). There may be some
+small benefits in tuning this to a different value if your workload is
+swap-intensive.
+
+overcommit_memory
+-----------------
+
+This file contains one value. The following algorithm is used to decide if
+there's enough memory: if the value of overcommit_memory is positive, then
+there's always enough memory. This is a useful feature, since programs often
+malloc() huge amounts of memory 'just in case', while they only use a small
+part of it. Leaving this value at 0 will lead to the failure of such a huge
+malloc(), when in fact the system has enough memory for the program to run.
+
+On the other hand, enabling this feature can cause you to run out of memory
+and thrash the system to death, so large and/or important servers will want to
+set this value to 0.
+
+nr_hugepages and hugetlb_shm_group
+----------------------------------
+
+nr_hugepages configures number of hugetlb page reserved for the system.
+
+hugetlb_shm_group contains group id that is allowed to create SysV shared
+memory segment using hugetlb page.
+
+laptop_mode
+-----------
+
+laptop_mode is a knob that controls "laptop mode". All the things that are
+controlled by this knob are discussed in Documentation/laptop-mode.txt.
+
+block_dump
+----------
+
+block_dump enables block I/O debugging when set to a nonzero value. More
+information on block I/O debugging is in Documentation/laptop-mode.txt.
+
+swap_token_timeout
+------------------
+
+This file contains valid hold time of swap out protection token. The Linux
+VM has token based thrashing control mechanism and uses the token to prevent
+unnecessary page faults in thrashing situation. The unit of the value is
+second. The value would be useful to tune thrashing behavior.
+
+2.5 /proc/sys/dev - Device specific parameters
+----------------------------------------------
+
+Currently there is only support for CDROM drives, and for those, there is only
+one read-only file containing information about the CD-ROM drives attached to
+the system:
+
+ >cat /proc/sys/dev/cdrom/info
+ CD-ROM information, Id: cdrom.c 2.55 1999/04/25
+
+ drive name: sr0 hdb
+ drive speed: 32 40
+ drive # of slots: 1 0
+ Can close tray: 1 1
+ Can open tray: 1 1
+ Can lock tray: 1 1
+ Can change speed: 1 1
+ Can select disk: 0 1
+ Can read multisession: 1 1
+ Can read MCN: 1 1
+ Reports media changed: 1 1
+ Can play audio: 1 1
+
+
+You see two drives, sr0 and hdb, along with a list of their features.
+
+2.6 /proc/sys/sunrpc - Remote procedure calls
+---------------------------------------------
+
+This directory contains four files, which enable or disable debugging for the
+RPC functions NFS, NFS-daemon, RPC and NLM. The default values are 0. They can
+be set to one to turn debugging on. (The default value is 0 for each)
+
+2.7 /proc/sys/net - Networking stuff
+------------------------------------
+
+The interface to the networking parts of the kernel is located in
+/proc/sys/net. Table 2-3 shows all possible subdirectories. You may see only
+some of them, depending on your kernel's configuration.
+
+
+Table 2-3: Subdirectories in /proc/sys/net
+..............................................................................
+ Directory Content Directory Content
+ core General parameter appletalk Appletalk protocol
+ unix Unix domain sockets netrom NET/ROM
+ 802 E802 protocol ax25 AX25
+ ethernet Ethernet protocol rose X.25 PLP layer
+ ipv4 IP version 4 x25 X.25 protocol
+ ipx IPX token-ring IBM token ring
+ bridge Bridging decnet DEC net
+ ipv6 IP version 6
+..............................................................................
+
+We will concentrate on IP networking here. Since AX15, X.25, and DEC Net are
+only minor players in the Linux world, we'll skip them in this chapter. You'll
+find some short info on Appletalk and IPX further on in this chapter. Review
+the online documentation and the kernel source to get a detailed view of the
+parameters for those protocols. In this section we'll discuss the
+subdirectories printed in bold letters in the table above. As default values
+are suitable for most needs, there is no need to change these values.
+
+/proc/sys/net/core - Network core options
+-----------------------------------------
+
+rmem_default
+------------
+
+The default setting of the socket receive buffer in bytes.
+
+rmem_max
+--------
+
+The maximum receive socket buffer size in bytes.
+
+wmem_default
+------------
+
+The default setting (in bytes) of the socket send buffer.
+
+wmem_max
+--------
+
+The maximum send socket buffer size in bytes.
+
+message_burst and message_cost
+------------------------------
+
+These parameters are used to limit the warning messages written to the kernel
+log from the networking code. They enforce a rate limit to make a
+denial-of-service attack impossible. A higher message_cost factor, results in
+fewer messages that will be written. Message_burst controls when messages will
+be dropped. The default settings limit warning messages to one every five
+seconds.
+
+netdev_max_backlog
+------------------
+
+Maximum number of packets, queued on the INPUT side, when the interface
+receives packets faster than kernel can process them.
+
+optmem_max
+----------
+
+Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence
+of struct cmsghdr structures with appended data.
+
+/proc/sys/net/unix - Parameters for Unix domain sockets
+-------------------------------------------------------
+
+There are only two files in this subdirectory. They control the delays for
+deleting and destroying socket descriptors.
+
+2.8 /proc/sys/net/ipv4 - IPV4 settings
+--------------------------------------
+
+IP version 4 is still the most used protocol in Unix networking. It will be
+replaced by IP version 6 in the next couple of years, but for the moment it's
+the de facto standard for the internet and is used in most networking
+environments around the world. Because of the importance of this protocol,
+we'll have a deeper look into the subtree controlling the behavior of the IPv4
+subsystem of the Linux kernel.
+
+Let's start with the entries in /proc/sys/net/ipv4.
+
+ICMP settings
+-------------
+
+icmp_echo_ignore_all and icmp_echo_ignore_broadcasts
+----------------------------------------------------
+
+Turn on (1) or off (0), if the kernel should ignore all ICMP ECHO requests, or
+just those to broadcast and multicast addresses.
+
+Please note that if you accept ICMP echo requests with a broadcast/multi\-cast
+destination address your network may be used as an exploder for denial of
+service packet flooding attacks to other hosts.
+
+icmp_destunreach_rate, icmp_echoreply_rate, icmp_paramprob_rate and icmp_timeexeed_rate
+---------------------------------------------------------------------------------------
+
+Sets limits for sending ICMP packets to specific targets. A value of zero
+disables all limiting. Any positive value sets the maximum package rate in
+hundredth of a second (on Intel systems).
+
+IP settings
+-----------
+
+ip_autoconfig
+-------------
+
+This file contains the number one if the host received its IP configuration by
+RARP, BOOTP, DHCP or a similar mechanism. Otherwise it is zero.
+
+ip_default_ttl
+--------------
+
+TTL (Time To Live) for IPv4 interfaces. This is simply the maximum number of
+hops a packet may travel.
+
+ip_dynaddr
+----------
+
+Enable dynamic socket address rewriting on interface address change. This is
+useful for dialup interface with changing IP addresses.
+
+ip_forward
+----------
+
+Enable or disable forwarding of IP packages between interfaces. Changing this
+value resets all other parameters to their default values. They differ if the
+kernel is configured as host or router.
+
+ip_local_port_range
+-------------------
+
+Range of ports used by TCP and UDP to choose the local port. Contains two
+numbers, the first number is the lowest port, the second number the highest
+local port. Default is 1024-4999. Should be changed to 32768-61000 for
+high-usage systems.
+
+ip_no_pmtu_disc
+---------------
+
+Global switch to turn path MTU discovery off. It can also be set on a per
+socket basis by the applications or on a per route basis.
+
+ip_masq_debug
+-------------
+
+Enable/disable debugging of IP masquerading.
+
+IP fragmentation settings
+-------------------------
+
+ipfrag_high_trash and ipfrag_low_trash
+--------------------------------------
+
+Maximum memory used to reassemble IP fragments. When ipfrag_high_thresh bytes
+of memory is allocated for this purpose, the fragment handler will toss
+packets until ipfrag_low_thresh is reached.
+
+ipfrag_time
+-----------
+
+Time in seconds to keep an IP fragment in memory.
+
+TCP settings
+------------
+
+tcp_ecn
+-------
+
+This file controls the use of the ECN bit in the IPv4 headers, this is a new
+feature about Explicit Congestion Notification, but some routers and firewalls
+block trafic that has this bit set, so it could be necessary to echo 0 to
+/proc/sys/net/ipv4/tcp_ecn, if you want to talk to this sites. For more info
+you could read RFC2481.
+
+tcp_retrans_collapse
+--------------------
+
+Bug-to-bug compatibility with some broken printers. On retransmit, try to send
+larger packets to work around bugs in certain TCP stacks. Can be turned off by
+setting it to zero.
+
+tcp_keepalive_probes
+--------------------
+
+Number of keep alive probes TCP sends out, until it decides that the
+connection is broken.
+
+tcp_keepalive_time
+------------------
+
+How often TCP sends out keep alive messages, when keep alive is enabled. The
+default is 2 hours.
+
+tcp_syn_retries
+---------------
+
+Number of times initial SYNs for a TCP connection attempt will be
+retransmitted. Should not be higher than 255. This is only the timeout for
+outgoing connections, for incoming connections the number of retransmits is
+defined by tcp_retries1.
+
+tcp_sack
+--------
+
+Enable select acknowledgments after RFC2018.
+
+tcp_timestamps
+--------------
+
+Enable timestamps as defined in RFC1323.
+
+tcp_stdurg
+----------
+
+Enable the strict RFC793 interpretation of the TCP urgent pointer field. The
+default is to use the BSD compatible interpretation of the urgent pointer
+pointing to the first byte after the urgent data. The RFC793 interpretation is
+to have it point to the last byte of urgent data. Enabling this option may
+lead to interoperatibility problems. Disabled by default.
+
+tcp_syncookies
+--------------
+
+Only valid when the kernel was compiled with CONFIG_SYNCOOKIES. Send out
+syncookies when the syn backlog queue of a socket overflows. This is to ward
+off the common 'syn flood attack'. Disabled by default.
+
+Note that the concept of a socket backlog is abandoned. This means the peer
+may not receive reliable error messages from an over loaded server with
+syncookies enabled.
+
+tcp_window_scaling
+------------------
+
+Enable window scaling as defined in RFC1323.
+
+tcp_fin_timeout
+---------------
+
+The length of time in seconds it takes to receive a final FIN before the
+socket is always closed. This is strictly a violation of the TCP
+specification, but required to prevent denial-of-service attacks.
+
+tcp_max_ka_probes
+-----------------
+
+Indicates how many keep alive probes are sent per slow timer run. Should not
+be set too high to prevent bursts.
+
+tcp_max_syn_backlog
+-------------------
+
+Length of the per socket backlog queue. Since Linux 2.2 the backlog specified
+in listen(2) only specifies the length of the backlog queue of already
+established sockets. When more connection requests arrive Linux starts to drop
+packets. When syncookies are enabled the packets are still answered and the
+maximum queue is effectively ignored.
+
+tcp_retries1
+------------
+
+Defines how often an answer to a TCP connection request is retransmitted
+before giving up.
+
+tcp_retries2
+------------
+
+Defines how often a TCP packet is retransmitted before giving up.
+
+Interface specific settings
+---------------------------
+
+In the directory /proc/sys/net/ipv4/conf you'll find one subdirectory for each
+interface the system knows about and one directory calls all. Changes in the
+all subdirectory affect all interfaces, whereas changes in the other
+subdirectories affect only one interface. All directories have the same
+entries:
+
+accept_redirects
+----------------
+
+This switch decides if the kernel accepts ICMP redirect messages or not. The
+default is 'yes' if the kernel is configured for a regular host and 'no' for a
+router configuration.
+
+accept_source_route
+-------------------
+
+Should source routed packages be accepted or declined. The default is
+dependent on the kernel configuration. It's 'yes' for routers and 'no' for
+hosts.
+
+bootp_relay
+~~~~~~~~~~~
+
+Accept packets with source address 0.b.c.d with destinations not to this host
+as local ones. It is supposed that a BOOTP relay daemon will catch and forward
+such packets.
+
+The default is 0, since this feature is not implemented yet (kernel version
+2.2.12).
+
+forwarding
+----------
+
+Enable or disable IP forwarding on this interface.
+
+log_martians
+------------
+
+Log packets with source addresses with no known route to kernel log.
+
+mc_forwarding
+-------------
+
+Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE and a
+multicast routing daemon is required.
+
+proxy_arp
+---------
+
+Does (1) or does not (0) perform proxy ARP.
+
+rp_filter
+---------
+
+Integer value determines if a source validation should be made. 1 means yes, 0
+means no. Disabled by default, but local/broadcast address spoofing is always
+on.
+
+If you set this to 1 on a router that is the only connection for a network to
+the net, it will prevent spoofing attacks against your internal networks
+(external addresses can still be spoofed), without the need for additional
+firewall rules.
+
+secure_redirects
+----------------
+
+Accept ICMP redirect messages only for gateways, listed in default gateway
+list. Enabled by default.
+
+shared_media
+------------
+
+If it is not set the kernel does not assume that different subnets on this
+device can communicate directly. Default setting is 'yes'.
+
+send_redirects
+--------------
+
+Determines whether to send ICMP redirects to other hosts.
+
+Routing settings
+----------------
+
+The directory /proc/sys/net/ipv4/route contains several file to control
+routing issues.
+
+error_burst and error_cost
+--------------------------
+
+These parameters are used to limit how many ICMP destination unreachable to
+send from the host in question. ICMP destination unreachable messages are
+sent when we can not reach the next hop, while trying to transmit a packet.
+It will also print some error messages to kernel logs if someone is ignoring
+our ICMP redirects. The higher the error_cost factor is, the fewer
+destination unreachable and error messages will be let through. Error_burst
+controls when destination unreachable messages and error messages will be
+dropped. The default settings limit warning messages to five every second.
+
+flush
+-----
+
+Writing to this file results in a flush of the routing cache.
+
+gc_elasticity, gc_interval, gc_min_interval_ms, gc_timeout, gc_thresh
+---------------------------------------------------------------------
+
+Values to control the frequency and behavior of the garbage collection
+algorithm for the routing cache. gc_min_interval is deprecated and replaced
+by gc_min_interval_ms.
+
+
+max_size
+--------
+
+Maximum size of the routing cache. Old entries will be purged once the cache
+reached has this size.
+
+max_delay, min_delay
+--------------------
+
+Delays for flushing the routing cache.
+
+redirect_load, redirect_number
+------------------------------
+
+Factors which determine if more ICPM redirects should be sent to a specific
+host. No redirects will be sent once the load limit or the maximum number of
+redirects has been reached.
+
+redirect_silence
+----------------
+
+Timeout for redirects. After this period redirects will be sent again, even if
+this has been stopped, because the load or number limit has been reached.
+
+Network Neighbor handling
+-------------------------
+
+Settings about how to handle connections with direct neighbors (nodes attached
+to the same link) can be found in the directory /proc/sys/net/ipv4/neigh.
+
+As we saw it in the conf directory, there is a default subdirectory which
+holds the default values, and one directory for each interface. The contents
+of the directories are identical, with the single exception that the default
+settings contain additional options to set garbage collection parameters.
+
+In the interface directories you'll find the following entries:
+
+base_reachable_time, base_reachable_time_ms
+-------------------------------------------
+
+A base value used for computing the random reachable time value as specified
+in RFC2461.
+
+Expression of base_reachable_time, which is deprecated, is in seconds.
+Expression of base_reachable_time_ms is in milliseconds.
+
+retrans_time, retrans_time_ms
+-----------------------------
+
+The time between retransmitted Neighbor Solicitation messages.
+Used for address resolution and to determine if a neighbor is
+unreachable.
+
+Expression of retrans_time, which is deprecated, is in 1/100 seconds (for
+IPv4) or in jiffies (for IPv6).
+Expression of retrans_time_ms is in milliseconds.
+
+unres_qlen
+----------
+
+Maximum queue length for a pending arp request - the number of packets which
+are accepted from other layers while the ARP address is still resolved.
+
+anycast_delay
+-------------
+
+Maximum for random delay of answers to neighbor solicitation messages in
+jiffies (1/100 sec). Not yet implemented (Linux does not have anycast support
+yet).
+
+ucast_solicit
+-------------
+
+Maximum number of retries for unicast solicitation.
+
+mcast_solicit
+-------------
+
+Maximum number of retries for multicast solicitation.
+
+delay_first_probe_time
+----------------------
+
+Delay for the first time probe if the neighbor is reachable. (see
+gc_stale_time)
+
+locktime
+--------
+
+An ARP/neighbor entry is only replaced with a new one if the old is at least
+locktime old. This prevents ARP cache thrashing.
+
+proxy_delay
+-----------
+
+Maximum time (real time is random [0..proxytime]) before answering to an ARP
+request for which we have an proxy ARP entry. In some cases, this is used to
+prevent network flooding.
+
+proxy_qlen
+----------
+
+Maximum queue length of the delayed proxy arp timer. (see proxy_delay).
+
+app_solcit
+----------
+
+Determines the number of requests to send to the user level ARP daemon. Use 0
+to turn off.
+
+gc_stale_time
+-------------
+
+Determines how often to check for stale ARP entries. After an ARP entry is
+stale it will be resolved again (which is useful when an IP address migrates
+to another machine). When ucast_solicit is greater than 0 it first tries to
+send an ARP packet directly to the known host When that fails and
+mcast_solicit is greater than 0, an ARP request is broadcasted.
+
+2.9 Appletalk
+-------------
+
+The /proc/sys/net/appletalk directory holds the Appletalk configuration data
+when Appletalk is loaded. The configurable parameters are:
+
+aarp-expiry-time
+----------------
+
+The amount of time we keep an ARP entry before expiring it. Used to age out
+old hosts.
+
+aarp-resolve-time
+-----------------
+
+The amount of time we will spend trying to resolve an Appletalk address.
+
+aarp-retransmit-limit
+---------------------
+
+The number of times we will retransmit a query before giving up.
+
+aarp-tick-time
+--------------
+
+Controls the rate at which expires are checked.
+
+The directory /proc/net/appletalk holds the list of active Appletalk sockets
+on a machine.
+
+The fields indicate the DDP type, the local address (in network:node format)
+the remote address, the size of the transmit pending queue, the size of the
+received queue (bytes waiting for applications to read) the state and the uid
+owning the socket.
+
+/proc/net/atalk_iface lists all the interfaces configured for appletalk.It
+shows the name of the interface, its Appletalk address, the network range on
+that address (or network number for phase 1 networks), and the status of the
+interface.
+
+/proc/net/atalk_route lists each known network route. It lists the target
+(network) that the route leads to, the router (may be directly connected), the
+route flags, and the device the route is using.
+
+2.10 IPX
+--------
+
+The IPX protocol has no tunable values in proc/sys/net.
+
+The IPX protocol does, however, provide proc/net/ipx. This lists each IPX
+socket giving the local and remote addresses in Novell format (that is
+network:node:port). In accordance with the strange Novell tradition,
+everything but the port is in hex. Not_Connected is displayed for sockets that
+are not tied to a specific remote address. The Tx and Rx queue sizes indicate
+the number of bytes pending for transmission and reception. The state
+indicates the state the socket is in and the uid is the owning uid of the
+socket.
+
+The /proc/net/ipx_interface file lists all IPX interfaces. For each interface
+it gives the network number, the node number, and indicates if the network is
+the primary network. It also indicates which device it is bound to (or
+Internal for internal networks) and the Frame Type if appropriate. Linux
+supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for
+IPX.
+
+The /proc/net/ipx_route table holds a list of IPX routes. For each route it
+gives the destination network, the router node (or Directly) and the network
+address of the router (or Connected) for internal networks.
+
+2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
+----------------------------------------------------------
+
+The "mqueue" filesystem provides the necessary kernel features to enable the
+creation of a user space library that implements the POSIX message queues
+API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System
+Interfaces specification.)
+
+The "mqueue" filesystem contains values for determining/setting the amount of
+resources used by the file system.
+
+/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the
+maximum number of message queues allowed on the system.
+
+/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the
+maximum number of messages in a queue value. In fact it is the limiting value
+for another (user) limit which is set in mq_open invocation. This attribute of
+a queue must be less or equal then msg_max.
+
+/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the
+maximum message size value (it is every message queue's attribute set during
+its creation).
+
+
+------------------------------------------------------------------------------
+Summary
+------------------------------------------------------------------------------
+Certain aspects of kernel behavior can be modified at runtime, without the
+need to recompile the kernel, or even to reboot the system. The files in the
+/proc/sys tree can not only be read, but also modified. You can use the echo
+command to write value into these files, thereby changing the default settings
+of the kernel.
+------------------------------------------------------------------------------
diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt
new file mode 100644
index 0000000..2d2a7b2
--- /dev/null
+++ b/Documentation/filesystems/romfs.txt
@@ -0,0 +1,187 @@
+ROMFS - ROM FILE SYSTEM
+
+This is a quite dumb, read only filesystem, mainly for initial RAM
+disks of installation disks. It has grown up by the need of having
+modules linked at boot time. Using this filesystem, you get a very
+similar feature, and even the possibility of a small kernel, with a
+file system which doesn't take up useful memory from the router
+functions in the basement of your office.
+
+For comparison, both the older minix and xiafs (the latter is now
+defunct) filesystems, compiled as module need more than 20000 bytes,
+while romfs is less than a page, about 4000 bytes (assuming i586
+code). Under the same conditions, the msdos filesystem would need
+about 30K (and does not support device nodes or symlinks), while the
+nfs module with nfsroot is about 57K. Furthermore, as a bit unfair
+comparison, an actual rescue disk used up 3202 blocks with ext2, while
+with romfs, it needed 3079 blocks.
+
+To create such a file system, you'll need a user program named
+genromfs. It is available via anonymous ftp on sunsite.unc.edu and
+its mirrors, in the /pub/Linux/system/recovery/ directory.
+
+As the name suggests, romfs could be also used (space-efficiently) on
+various read-only media, like (E)EPROM disks if someone will have the
+motivation.. :)
+
+However, the main purpose of romfs is to have a very small kernel,
+which has only this filesystem linked in, and then can load any module
+later, with the current module utilities. It can also be used to run
+some program to decide if you need SCSI devices, and even IDE or
+floppy drives can be loaded later if you use the "initrd"--initial
+RAM disk--feature of the kernel. This would not be really news
+flash, but with romfs, you can even spare off your ext2 or minix or
+maybe even affs filesystem until you really know that you need it.
+
+For example, a distribution boot disk can contain only the cd disk
+drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem
+module. The kernel can be small enough, since it doesn't have other
+filesystems, like the quite large ext2fs module, which can then be
+loaded off the CD at a later stage of the installation. Another use
+would be for a recovery disk, when you are reinstalling a workstation
+from the network, and you will have all the tools/modules available
+from a nearby server, so you don't want to carry two disks for this
+purpose, just because it won't fit into ext2.
+
+romfs operates on block devices as you can expect, and the underlying
+structure is very simple. Every accessible structure begins on 16
+byte boundaries for fast access. The minimum space a file will take
+is 32 bytes (this is an empty file, with a less than 16 character
+name). The maximum overhead for any non-empty file is the header, and
+the 16 byte padding for the name and the contents, also 16+14+15 = 45
+bytes. This is quite rare however, since most file names are longer
+than 3 bytes, and shorter than 15 bytes.
+
+The layout of the filesystem is the following:
+
+offset content
+
+ +---+---+---+---+
+ 0 | - | r | o | m | \
+ +---+---+---+---+ The ASCII representation of those bytes
+ 4 | 1 | f | s | - | / (i.e. "-rom1fs-")
+ +---+---+---+---+
+ 8 | full size | The number of accessible bytes in this fs.
+ +---+---+---+---+
+ 12 | checksum | The checksum of the FIRST 512 BYTES.
+ +---+---+---+---+
+ 16 | volume name | The zero terminated name of the volume,
+ : : padded to 16 byte boundary.
+ +---+---+---+---+
+ xx | file |
+ : headers :
+
+Every multi byte value (32 bit words, I'll use the longwords term from
+now on) must be in big endian order.
+
+The first eight bytes identify the filesystem, even for the casual
+inspector. After that, in the 3rd longword, it contains the number of
+bytes accessible from the start of this filesystem. The 4th longword
+is the checksum of the first 512 bytes (or the number of bytes
+accessible, whichever is smaller). The applied algorithm is the same
+as in the AFFS filesystem, namely a simple sum of the longwords
+(assuming bigendian quantities again). For details, please consult
+the source. This algorithm was chosen because although it's not quite
+reliable, it does not require any tables, and it is very simple.
+
+The following bytes are now part of the file system; each file header
+must begin on a 16 byte boundary.
+
+offset content
+
+ +---+---+---+---+
+ 0 | next filehdr|X| The offset of the next file header
+ +---+---+---+---+ (zero if no more files)
+ 4 | spec.info | Info for directories/hard links/devices
+ +---+---+---+---+
+ 8 | size | The size of this file in bytes
+ +---+---+---+---+
+ 12 | checksum | Covering the meta data, including the file
+ +---+---+---+---+ name, and padding
+ 16 | file name | The zero terminated name of the file,
+ : : padded to 16 byte boundary
+ +---+---+---+---+
+ xx | file data |
+ : :
+
+Since the file headers begin always at a 16 byte boundary, the lowest
+4 bits would be always zero in the next filehdr pointer. These four
+bits are used for the mode information. Bits 0..2 specify the type of
+the file; while bit 4 shows if the file is executable or not. The
+permissions are assumed to be world readable, if this bit is not set,
+and world executable if it is; except the character and block devices,
+they are never accessible for other than owner. The owner of every
+file is user and group 0, this should never be a problem for the
+intended use. The mapping of the 8 possible values to file types is
+the following:
+
+ mapping spec.info means
+ 0 hard link link destination [file header]
+ 1 directory first file's header
+ 2 regular file unused, must be zero [MBZ]
+ 3 symbolic link unused, MBZ (file data is the link content)
+ 4 block device 16/16 bits major/minor number
+ 5 char device - " -
+ 6 socket unused, MBZ
+ 7 fifo unused, MBZ
+
+Note that hard links are specifically marked in this filesystem, but
+they will behave as you can expect (i.e. share the inode number).
+Note also that it is your responsibility to not create hard link
+loops, and creating all the . and .. links for directories. This is
+normally done correctly by the genromfs program. Please refrain from
+using the executable bits for special purposes on the socket and fifo
+special files, they may have other uses in the future. Additionally,
+please remember that only regular files, and symlinks are supposed to
+have a nonzero size field; they contain the number of bytes available
+directly after the (padded) file name.
+
+Another thing to note is that romfs works on file headers and data
+aligned to 16 byte boundaries, but most hardware devices and the block
+device drivers are unable to cope with smaller than block-sized data.
+To overcome this limitation, the whole size of the file system must be
+padded to an 1024 byte boundary.
+
+If you have any problems or suggestions concerning this file system,
+please contact me. However, think twice before wanting me to add
+features and code, because the primary and most important advantage of
+this file system is the small code. On the other hand, don't be
+alarmed, I'm not getting that much romfs related mail. Now I can
+understand why Avery wrote poems in the ARCnet docs to get some more
+feedback. :)
+
+romfs has also a mailing list, and to date, it hasn't received any
+traffic, so you are welcome to join it to discuss your ideas. :)
+
+It's run by ezmlm, so you can subscribe to it by sending a message
+to romfs-subscribe@shadow.banki.hu, the content is irrelevant.
+
+Pending issues:
+
+- Permissions and owner information are pretty essential features of a
+Un*x like system, but romfs does not provide the full possibilities.
+I have never found this limiting, but others might.
+
+- The file system is read only, so it can be very small, but in case
+one would want to write _anything_ to a file system, he still needs
+a writable file system, thus negating the size advantages. Possible
+solutions: implement write access as a compile-time option, or a new,
+similarly small writable filesystem for RAM disks.
+
+- Since the files are only required to have alignment on a 16 byte
+boundary, it is currently possibly suboptimal to read or execute files
+from the filesystem. It might be resolved by reordering file data to
+have most of it (i.e. except the start and the end) laying at "natural"
+boundaries, thus it would be possible to directly map a big portion of
+the file contents to the mm subsystem.
+
+- Compression might be an useful feature, but memory is quite a
+limiting factor in my eyes.
+
+- Where it is used?
+
+- Does it work on other architectures than intel and motorola?
+
+
+Have fun,
+Janos Farkas <chexum@shadow.banki.hu>
diff --git a/Documentation/filesystems/smbfs.txt b/Documentation/filesystems/smbfs.txt
new file mode 100644
index 0000000..f673ef0
--- /dev/null
+++ b/Documentation/filesystems/smbfs.txt
@@ -0,0 +1,8 @@
+Smbfs is a filesystem that implements the SMB protocol, which is the
+protocol used by Windows for Workgroups, Windows 95 and Windows NT.
+Smbfs was inspired by Samba, the program written by Andrew Tridgell
+that turns any Unix host into a file server for DOS or Windows clients.
+
+Smbfs is a SMB client, but uses parts of samba for it's operation. For
+more info on samba, including documentation, please go to
+http://www.samba.org/ and then on to your nearest mirror.
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt
new file mode 100644
index 0000000..e97d024
--- /dev/null
+++ b/Documentation/filesystems/sysfs-pci.txt
@@ -0,0 +1,88 @@
+Accessing PCI device resources through sysfs
+
+sysfs, usually mounted at /sys, provides access to PCI resources on platforms
+that support it. For example, a given bus might look like this:
+
+ /sys/devices/pci0000:17
+ |-- 0000:17:00.0
+ | |-- class
+ | |-- config
+ | |-- detach_state
+ | |-- device
+ | |-- irq
+ | |-- local_cpus
+ | |-- resource
+ | |-- resource0
+ | |-- resource1
+ | |-- resource2
+ | |-- rom
+ | |-- subsystem_device
+ | |-- subsystem_vendor
+ | `-- vendor
+ `-- detach_state
+
+The topmost element describes the PCI domain and bus number. In this case,
+the domain number is 0000 and the bus number is 17 (both values are in hex).
+This bus contains a single function device in slot 0. The domain and bus
+numbers are reproduced for convenience. Under the device directory are several
+files, each with their own function.
+
+ file function
+ ---- --------
+ class PCI class (ascii, ro)
+ config PCI config space (binary, rw)
+ detach_state connection status (bool, rw)
+ device PCI device (ascii, ro)
+ irq IRQ number (ascii, ro)
+ local_cpus nearby CPU mask (cpumask, ro)
+ resource PCI resource host addresses (ascii, ro)
+ resource0..N PCI resource N, if present (binary, mmap)
+ rom PCI ROM resource, if present (binary, ro)
+ subsystem_device PCI subsystem device (ascii, ro)
+ subsystem_vendor PCI subsystem vendor (ascii, ro)
+ vendor PCI vendor (ascii, ro)
+
+ ro - read only file
+ rw - file is readable and writable
+ mmap - file is mmapable
+ ascii - file contains ascii text
+ binary - file contains binary data
+ cpumask - file contains a cpumask type
+
+The read only files are informational, writes to them will be ignored.
+Writable files can be used to perform actions on the device (e.g. changing
+config space, detaching a device). mmapable files are available via an
+mmap of the file at offset 0 and can be used to do actual device programming
+from userspace. Note that some platforms don't support mmapping of certain
+resources, so be sure to check the return value from any attempted mmap.
+
+Accessing legacy resources through sysfs
+
+Legacy I/O port and ISA memory resources are also provided in sysfs if the
+underlying platform supports them. They're located in the PCI class heirarchy,
+e.g.
+
+ /sys/class/pci_bus/0000:17/
+ |-- bridge -> ../../../devices/pci0000:17
+ |-- cpuaffinity
+ |-- legacy_io
+ `-- legacy_mem
+
+The legacy_io file is a read/write file that can be used by applications to
+do legacy port I/O. The application should open the file, seek to the desired
+port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
+file should be mmapped with an offset corresponding to the memory offset
+desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
+simply dereference the returned pointer (after checking for errors of course)
+to access legacy memory space.
+
+Supporting PCI access on new platforms
+
+In order to support PCI resource mapping as described above, Linux platform
+code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function.
+Platforms are free to only support subsets of the mmap functionality, but
+useful return codes should be provided.
+
+Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
+wishing to support legacy functionality should define it and provide
+pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.
\ No newline at end of file
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
new file mode 100644
index 0000000..60f6c2c
--- /dev/null
+++ b/Documentation/filesystems/sysfs.txt
@@ -0,0 +1,341 @@
+
+sysfs - _The_ filesystem for exporting kernel objects.
+
+Patrick Mochel <mochel@osdl.org>
+
+10 January 2003
+
+
+What it is:
+~~~~~~~~~~~
+
+sysfs is a ram-based filesystem initially based on ramfs. It provides
+a means to export kernel data structures, their attributes, and the
+linkages between them to userspace.
+
+sysfs is tied inherently to the kobject infrastructure. Please read
+Documentation/kobject.txt for more information concerning the kobject
+interface.
+
+
+Using sysfs
+~~~~~~~~~~~
+
+sysfs is always compiled in. You can access it by doing:
+
+ mount -t sysfs sysfs /sys
+
+
+Directory Creation
+~~~~~~~~~~~~~~~~~~
+
+For every kobject that is registered with the system, a directory is
+created for it in sysfs. That directory is created as a subdirectory
+of the kobject's parent, expressing internal object hierarchies to
+userspace. Top-level directories in sysfs represent the common
+ancestors of object hierarchies; i.e. the subsystems the objects
+belong to.
+
+Sysfs internally stores the kobject that owns the directory in the
+->d_fsdata pointer of the directory's dentry. This allows sysfs to do
+reference counting directly on the kobject when the file is opened and
+closed.
+
+
+Attributes
+~~~~~~~~~~
+
+Attributes can be exported for kobjects in the form of regular files in
+the filesystem. Sysfs forwards file I/O operations to methods defined
+for the attributes, providing a means to read and write kernel
+attributes.
+
+Attributes should be ASCII text files, preferably with only one value
+per file. It is noted that it may not be efficient to contain only
+value per file, so it is socially acceptable to express an array of
+values of the same type.
+
+Mixing types, expressing multiple lines of data, and doing fancy
+formatting of data is heavily frowned upon. Doing these things may get
+you publically humiliated and your code rewritten without notice.
+
+
+An attribute definition is simply:
+
+struct attribute {
+ char * name;
+ mode_t mode;
+};
+
+
+int sysfs_create_file(struct kobject * kobj, struct attribute * attr);
+void sysfs_remove_file(struct kobject * kobj, struct attribute * attr);
+
+
+A bare attribute contains no means to read or write the value of the
+attribute. Subsystems are encouraged to define their own attribute
+structure and wrapper functions for adding and removing attributes for
+a specific object type.
+
+For example, the driver model defines struct device_attribute like:
+
+struct device_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct device * dev, char * buf);
+ ssize_t (*store)(struct device * dev, const char * buf);
+};
+
+int device_create_file(struct device *, struct device_attribute *);
+void device_remove_file(struct device *, struct device_attribute *);
+
+It also defines this helper for defining device attributes:
+
+#define DEVICE_ATTR(_name,_mode,_show,_store) \
+struct device_attribute dev_attr_##_name = { \
+ .attr = {.name = __stringify(_name) , .mode = _mode }, \
+ .show = _show, \
+ .store = _store, \
+};
+
+For example, declaring
+
+static DEVICE_ATTR(foo,0644,show_foo,store_foo);
+
+is equivalent to doing:
+
+static struct device_attribute dev_attr_foo = {
+ .attr = {
+ .name = "foo",
+ .mode = 0644,
+ },
+ .show = show_foo,
+ .store = store_foo,
+};
+
+
+Subsystem-Specific Callbacks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a subsystem defines a new attribute type, it must implement a
+set of sysfs operations for forwarding read and write calls to the
+show and store methods of the attribute owners.
+
+struct sysfs_ops {
+ ssize_t (*show)(struct kobject *, struct attribute *,char *);
+ ssize_t (*store)(struct kobject *,struct attribute *,const char *);
+};
+
+[ Subsystems should have already defined a struct kobj_type as a
+descriptor for this type, which is where the sysfs_ops pointer is
+stored. See the kobject documentation for more information. ]
+
+When a file is read or written, sysfs calls the appropriate method
+for the type. The method then translates the generic struct kobject
+and struct attribute pointers to the appropriate pointer types, and
+calls the associated methods.
+
+
+To illustrate:
+
+#define to_dev_attr(_attr) container_of(_attr,struct device_attribute,attr)
+#define to_dev(d) container_of(d, struct device, kobj)
+
+static ssize_t
+dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf)
+{
+ struct device_attribute * dev_attr = to_dev_attr(attr);
+ struct device * dev = to_dev(kobj);
+ ssize_t ret = 0;
+
+ if (dev_attr->show)
+ ret = dev_attr->show(dev,buf);
+ return ret;
+}
+
+
+
+Reading/Writing Attribute Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To read or write attributes, show() or store() methods must be
+specified when declaring the attribute. The method types should be as
+simple as those defined for device attributes:
+
+ ssize_t (*show)(struct device * dev, char * buf);
+ ssize_t (*store)(struct device * dev, const char * buf);
+
+IOW, they should take only an object and a buffer as parameters.
+
+
+sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
+method. Sysfs will call the method exactly once for each read or
+write. This forces the following behavior on the method
+implementations:
+
+- On read(2), the show() method should fill the entire buffer.
+ Recall that an attribute should only be exporting one value, or an
+ array of similar values, so this shouldn't be that expensive.
+
+ This allows userspace to do partial reads and seeks arbitrarily over
+ the entire file at will.
+
+- On write(2), sysfs expects the entire buffer to be passed during the
+ first write. Sysfs then passes the entire buffer to the store()
+ method.
+
+ When writing sysfs files, userspace processes should first read the
+ entire file, modify the values it wishes to change, then write the
+ entire buffer back.
+
+ Attribute method implementations should operate on an identical
+ buffer when reading and writing values.
+
+Other notes:
+
+- The buffer will always be PAGE_SIZE bytes in length. On i386, this
+ is 4096.
+
+- show() methods should return the number of bytes printed into the
+ buffer. This is the return value of snprintf().
+
+- show() should always use snprintf().
+
+- store() should return the number of bytes used from the buffer. This
+ can be done using strlen().
+
+- show() or store() can always return errors. If a bad value comes
+ through, be sure to return an error.
+
+- The object passed to the methods will be pinned in memory via sysfs
+ referencing counting its embedded object. However, the physical
+ entity (e.g. device) the object represents may not be present. Be
+ sure to have a way to check this, if necessary.
+
+
+A very simple (and naive) implementation of a device attribute is:
+
+static ssize_t show_name(struct device * dev, char * buf)
+{
+ return sprintf(buf,"%s\n",dev->name);
+}
+
+static ssize_t store_name(struct device * dev, const char * buf)
+{
+ sscanf(buf,"%20s",dev->name);
+ return strlen(buf);
+}
+
+static DEVICE_ATTR(name,S_IRUGO,show_name,store_name);
+
+
+(Note that the real implementation doesn't allow userspace to set the
+name for a device.)
+
+
+Top Level Directory Layout
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The sysfs directory arrangement exposes the relationship of kernel
+data structures.
+
+The top level sysfs diretory looks like:
+
+block/
+bus/
+class/
+devices/
+firmware/
+net/
+
+devices/ contains a filesystem representation of the device tree. It maps
+directly to the internal kernel device tree, which is a hierarchy of
+struct device.
+
+bus/ contains flat directory layout of the various bus types in the
+kernel. Each bus's directory contains two subdirectories:
+
+ devices/
+ drivers/
+
+devices/ contains symlinks for each device discovered in the system
+that point to the device's directory under root/.
+
+drivers/ contains a directory for each device driver that is loaded
+for devices on that particular bus (this assumes that drivers do not
+span multiple bus types).
+
+
+More information can driver-model specific features can be found in
+Documentation/driver-model/.
+
+
+TODO: Finish this section.
+
+
+Current Interfaces
+~~~~~~~~~~~~~~~~~~
+
+The following interface layers currently exist in sysfs:
+
+
+- devices (include/linux/device.h)
+----------------------------------
+Structure:
+
+struct device_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct device * dev, char * buf);
+ ssize_t (*store)(struct device * dev, const char * buf);
+};
+
+Declaring:
+
+DEVICE_ATTR(_name,_str,_mode,_show,_store);
+
+Creation/Removal:
+
+int device_create_file(struct device *device, struct device_attribute * attr);
+void device_remove_file(struct device * dev, struct device_attribute * attr);
+
+
+- bus drivers (include/linux/device.h)
+--------------------------------------
+Structure:
+
+struct bus_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct bus_type *, char * buf);
+ ssize_t (*store)(struct bus_type *, const char * buf);
+};
+
+Declaring:
+
+BUS_ATTR(_name,_mode,_show,_store)
+
+Creation/Removal:
+
+int bus_create_file(struct bus_type *, struct bus_attribute *);
+void bus_remove_file(struct bus_type *, struct bus_attribute *);
+
+
+- device drivers (include/linux/device.h)
+-----------------------------------------
+
+Structure:
+
+struct driver_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct device_driver *, char * buf);
+ ssize_t (*store)(struct device_driver *, const char * buf);
+};
+
+Declaring:
+
+DRIVER_ATTR(_name,_mode,_show,_store)
+
+Creation/Removal:
+
+int driver_create_file(struct device_driver *, struct driver_attribute *);
+void driver_remove_file(struct device_driver *, struct driver_attribute *);
+
+
diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.txt
new file mode 100644
index 0000000..d817224
--- /dev/null
+++ b/Documentation/filesystems/sysv-fs.txt
@@ -0,0 +1,38 @@
+This is the implementation of the SystemV/Coherent filesystem for Linux.
+It implements all of
+ - Xenix FS,
+ - SystemV/386 FS,
+ - Coherent FS.
+
+This is version beta 4.
+
+To install:
+* Answer the 'System V and Coherent filesystem support' question with 'y'
+ when configuring the kernel.
+* To mount a disk or a partition, use
+ mount [-r] -t sysv device mountpoint
+ The file system type names
+ -t sysv
+ -t xenix
+ -t coherent
+ may be used interchangeably, but the last two will eventually disappear.
+
+Bugs in the present implementation:
+- Coherent FS:
+ - The "free list interleave" n:m is currently ignored.
+ - Only file systems with no filesystem name and no pack name are recognized.
+ (See Coherent "man mkfs" for a description of these features.)
+- SystemV Release 2 FS:
+ The superblock is only searched in the blocks 9, 15, 18, which
+ corresponds to the beginning of track 1 on floppy disks. No support
+ for this FS on hard disk yet.
+
+
+Please report any bugs and suggestions to
+ Bruno Haible <haible@ma2s2.mathematik.uni-karlsruhe.de>
+ Pascal Haible <haible@izfm.uni-stuttgart.de>
+ Krzysztof G. Baranowski <kgb@manjak.knm.org.pl>
+
+Bruno Haible
+<haible@ma2s2.mathematik.uni-karlsruhe.de>
+
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
new file mode 100644
index 0000000..417e309
--- /dev/null
+++ b/Documentation/filesystems/tmpfs.txt
@@ -0,0 +1,100 @@
+Tmpfs is a file system which keeps all files in virtual memory.
+
+
+Everything in tmpfs is temporary in the sense that no files will be
+created on your hard drive. If you unmount a tmpfs instance,
+everything stored therein is lost.
+
+tmpfs puts everything into the kernel internal caches and grows and
+shrinks to accommodate the files it contains and is able to swap
+unneeded pages out to swap space. It has maximum size limits which can
+be adjusted on the fly via 'mount -o remount ...'
+
+If you compare it to ramfs (which was the template to create tmpfs)
+you gain swapping and limit checking. Another similar thing is the RAM
+disk (/dev/ram*), which simulates a fixed size hard disk in physical
+RAM, where you have to create an ordinary filesystem on top. Ramdisks
+cannot swap and you do not have the possibility to resize them.
+
+Since tmpfs lives completely in the page cache and on swap, all tmpfs
+pages currently in memory will show up as cached. It will not show up
+as shared or something like that. Further on you can check the actual
+RAM+swap use of a tmpfs instance with df(1) and du(1).
+
+
+tmpfs has the following uses:
+
+1) There is always a kernel internal mount which you will not see at
+ all. This is used for shared anonymous mappings and SYSV shared
+ memory.
+
+ This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
+ set, the user visible part of tmpfs is not build. But the internal
+ mechanisms are always present.
+
+2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
+ POSIX shared memory (shm_open, shm_unlink). Adding the following
+ line to /etc/fstab should take care of this:
+
+ tmpfs /dev/shm tmpfs defaults 0 0
+
+ Remember to create the directory that you intend to mount tmpfs on
+ if necessary (/dev/shm is automagically created if you use devfs).
+
+ This mount is _not_ needed for SYSV shared memory. The internal
+ mount is used for that. (In the 2.3 kernel versions it was
+ necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
+ shared memory)
+
+3) Some people (including me) find it very convenient to mount it
+ e.g. on /tmp and /var/tmp and have a big swap partition. And now
+ loop mounts of tmpfs files do work, so mkinitrd shipped by most
+ distributions should succeed with a tmpfs /tmp.
+
+4) And probably a lot more I do not know about :-)
+
+
+tmpfs has three mount options for sizing:
+
+size: The limit of allocated bytes for this tmpfs instance. The
+ default is half of your physical RAM without swap. If you
+ oversize your tmpfs instances the machine will deadlock
+ since the OOM handler will not be able to free that memory.
+nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
+nr_inodes: The maximum number of inodes for this instance. The default
+ is half of the number of your physical RAM pages, or (on a
+ a machine with highmem) the number of lowmem RAM pages,
+ whichever is the lower.
+
+These parameters accept a suffix k, m or g for kilo, mega and giga and
+can be changed on remount. The size parameter also accepts a suffix %
+to limit this tmpfs instance to that percentage of your physical RAM:
+the default, when neither size nor nr_blocks is specified, is size=50%
+
+If both nr_blocks (or size) and nr_inodes are set to 0, neither blocks
+nor inodes will be limited in that instance. It is generally unwise to
+mount with such options, since it allows any user with write access to
+use up all the memory on the machine; but enhances the scalability of
+that instance in a system with many cpus making intensive use of it.
+
+
+To specify the initial root directory you can use the following mount
+options:
+
+mode: The permissions as an octal number
+uid: The user id
+gid: The group id
+
+These options do not have any effect on remount. You can change these
+parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
+
+
+So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
+will give you tmpfs instance on /mytmpfs which can allocate 10GB
+RAM/SWAP in 10240 inodes and it is only accessible by root.
+
+
+Author:
+ Christoph Rohland <cr@sap.com>, 1.12.01
+Updated:
+ Hugh Dickins <hugh@veritas.com>, 01 September 2004
diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt
new file mode 100644
index 0000000..e5213bc
--- /dev/null
+++ b/Documentation/filesystems/udf.txt
@@ -0,0 +1,57 @@
+*
+* Documentation/filesystems/udf.txt
+*
+UDF Filesystem version 0.9.8.1
+
+If you encounter problems with reading UDF discs using this driver,
+please report them to linux_udf@hpesjro.fc.hp.com, which is the
+developer's list.
+
+Write support requires a block driver which supports writing. The current
+scsi and ide cdrom drivers do not support writing.
+
+-------------------------------------------------------------------------------
+The following mount options are supported:
+
+ gid= Set the default group.
+ umask= Set the default umask.
+ uid= Set the default user.
+ bs= Set the block size.
+ unhide Show otherwise hidden files.
+ undelete Show deleted files in lists.
+ adinicb Embed data in the inode (default)
+ noadinicb Don't embed data in the inode
+ shortad Use short ad's
+ longad Use long ad's (default)
+ nostrict Unset strict conformance
+ iocharset= Set the NLS character set
+
+The remaining are for debugging and disaster recovery:
+
+ novrs Skip volume sequence recognition
+
+The following expect a offset from 0.
+
+ session= Set the CDROM session (default= last session)
+ anchor= Override standard anchor location. (default= 256)
+ volume= Override the VolumeDesc location. (unused)
+ partition= Override the PartitionDesc location. (unused)
+ lastblock= Set the last block of the filesystem/
+
+The following expect a offset from the partition root.
+
+ fileset= Override the fileset block location. (unused)
+ rootdir= Override the root directory location. (unused)
+ WARNING: overriding the rootdir to a non-directory may
+ yield highly unpredictable results.
+-------------------------------------------------------------------------------
+
+
+For the latest version and toolset see:
+ http://linux-udf.sourceforge.net/
+
+Documentation on UDF and ECMA 167 is available FREE from:
+ http://www.osta.org/
+ http://www.ecma-international.org/
+
+Ben Fennema <bfennema@falcon.csc.calpoly.edu>
diff --git a/Documentation/filesystems/ufs.txt b/Documentation/filesystems/ufs.txt
new file mode 100644
index 0000000..2b5a56a
--- /dev/null
+++ b/Documentation/filesystems/ufs.txt
@@ -0,0 +1,61 @@
+USING UFS
+=========
+
+mount -t ufs -o ufstype=type_of_ufs device dir
+
+
+UFS OPTIONS
+===========
+
+ufstype=type_of_ufs
+ UFS is a file system widely used in different operating systems.
+ The problem are differences among implementations. Features of
+ some implementations are undocumented, so its hard to recognize
+ type of ufs automatically. That's why user must specify type of
+ ufs manually by mount option ufstype. Possible values are:
+
+ old old format of ufs
+ default value, supported as read-only
+
+ 44bsd used in FreeBSD, NetBSD, OpenBSD
+ supported as read-write
+
+ ufs2 used in FreeBSD 5.x
+ supported as read-only
+
+ 5xbsd synonym for ufs2
+
+ sun used in SunOS (Solaris)
+ supported as read-write
+
+ sunx86 used in SunOS for Intel (Solarisx86)
+ supported as read-write
+
+ hp used in HP-UX
+ supported as read-only
+
+ nextstep
+ used in NextStep
+ supported as read-only
+
+ nextstep-cd
+ used for NextStep CDROMs (block_size == 2048)
+ supported as read-only
+
+ openstep
+ used in OpenStep
+ supported as read-only
+
+
+POSSIBLE PROBLEMS
+=================
+
+There is still bug in reallocation of fragment, in file fs/ufs/balloc.c,
+line 364. But it seems working on current buffer cache configuration.
+
+
+BUG REPORTS
+===========
+
+Any ufs bug report you can send to daniel.pirkl@email.cz (do not send
+partition tables bug reports.)
diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt
new file mode 100644
index 0000000..5ead20c
--- /dev/null
+++ b/Documentation/filesystems/vfat.txt
@@ -0,0 +1,231 @@
+USING VFAT
+----------------------------------------------------------------------
+To use the vfat filesystem, use the filesystem type 'vfat'. i.e.
+ mount -t vfat /dev/fd0 /mnt
+
+No special partition formatter is required. mkdosfs will work fine
+if you want to format from within Linux.
+
+VFAT MOUNT OPTIONS
+----------------------------------------------------------------------
+umask=### -- The permission mask (for files and directories, see umask(1)).
+ The default is the umask of current process.
+
+dmask=### -- The permission mask for the directory.
+ The default is the umask of current process.
+
+fmask=### -- The permission mask for files.
+ The default is the umask of current process.
+
+codepage=### -- Sets the codepage number for converting to shortname
+ characters on FAT filesystem.
+ By default, FAT_DEFAULT_CODEPAGE setting is used.
+
+iocharset=name -- Character set to use for converting between the
+ encoding is used for user visible filename and 16 bit
+ Unicode characters. Long filenames are stored on disk
+ in Unicode format, but Unix for the most part doesn't
+ know how to deal with Unicode.
+ By default, FAT_DEFAULT_IOCHARSET setting is used.
+
+ There is also an option of doing UTF8 translations
+ with the utf8 option.
+
+ NOTE: "iocharset=utf8" is not recommended. If unsure,
+ you should consider the following option instead.
+
+utf8=<bool> -- UTF8 is the filesystem safe version of Unicode that
+ is used by the console. It can be be enabled for the
+ filesystem with this option. If 'uni_xlate' gets set,
+ UTF8 gets disabled.
+
+uni_xlate=<bool> -- Translate unhandled Unicode characters to special
+ escaped sequences. This would let you backup and
+ restore filenames that are created with any Unicode
+ characters. Until Linux supports Unicode for real,
+ this gives you an alternative. Without this option,
+ a '?' is used when no translation is possible. The
+ escape character is ':' because it is otherwise
+ illegal on the vfat filesystem. The escape sequence
+ that gets used is ':' and the four digits of hexadecimal
+ unicode.
+
+nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
+ end in '~1' or tilde followed by some number. If this
+ option is set, then if the filename is
+ "longfilename.txt" and "longfile.txt" does not
+ currently exist in the directory, 'longfile.txt' will
+ be the short alias instead of 'longfi~1.txt'.
+
+quiet -- Stops printing certain warning messages.
+
+check=s|r|n -- Case sensitivity checking setting.
+ s: strict, case sensitive
+ r: relaxed, case insensitive
+ n: normal, default setting, currently case insensitive
+
+shortname=lower|win95|winnt|mixed
+ -- Shortname display/create setting.
+ lower: convert to lowercase for display,
+ emulate the Windows 95 rule for create.
+ win95: emulate the Windows 95 rule for display/create.
+ winnt: emulate the Windows NT rule for display/create.
+ mixed: emulate the Windows NT rule for display,
+ emulate the Windows 95 rule for create.
+ Default setting is `lower'.
+
+<bool>: 0,1,yes,no,true,false
+
+TODO
+----------------------------------------------------------------------
+* Need to get rid of the raw scanning stuff. Instead, always use
+ a get next directory entry approach. The only thing left that uses
+ raw scanning is the directory renaming code.
+
+
+POSSIBLE PROBLEMS
+----------------------------------------------------------------------
+* vfat_valid_longname does not properly checked reserved names.
+* When a volume name is the same as a directory name in the root
+ directory of the filesystem, the directory name sometimes shows
+ up as an empty file.
+* autoconv option does not work correctly.
+
+BUG REPORTS
+----------------------------------------------------------------------
+If you have trouble with the VFAT filesystem, mail bug reports to
+chaffee@bmrc.cs.berkeley.edu. Please specify the filename
+and the operation that gave you trouble.
+
+TEST SUITE
+----------------------------------------------------------------------
+If you plan to make any modifications to the vfat filesystem, please
+get the test suite that comes with the vfat distribution at
+
+ http://bmrc.berkeley.edu/people/chaffee/vfat.html
+
+This tests quite a few parts of the vfat filesystem and additional
+tests for new features or untested features would be appreciated.
+
+NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
+----------------------------------------------------------------------
+(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
+ and lightly annotated by Gordon Chaffee).
+
+This document presents a very rough, technical overview of my
+knowledge of the extended FAT file system used in Windows NT 3.5 and
+Windows 95. I don't guarantee that any of the following is correct,
+but it appears to be so.
+
+The extended FAT file system is almost identical to the FAT
+file system used in DOS versions up to and including 6.223410239847
+:-). The significant change has been the addition of long file names.
+These names support up to 255 characters including spaces and lower
+case characters as opposed to the traditional 8.3 short names.
+
+Here is the description of the traditional FAT entry in the current
+Windows 95 filesystem:
+
+ struct directory { // Short 8.3 names
+ unsigned char name[8]; // file name
+ unsigned char ext[3]; // file extension
+ unsigned char attr; // attribute byte
+ unsigned char lcase; // Case for base and extension
+ unsigned char ctime_ms; // Creation time, milliseconds
+ unsigned char ctime[2]; // Creation time
+ unsigned char cdate[2]; // Creation date
+ unsigned char adate[2]; // Last access date
+ unsigned char reserved[2]; // reserved values (ignored)
+ unsigned char time[2]; // time stamp
+ unsigned char date[2]; // date stamp
+ unsigned char start[2]; // starting cluster number
+ unsigned char size[4]; // size of the file
+ };
+
+The lcase field specifies if the base and/or the extension of an 8.3
+name should be capitalized. This field does not seem to be used by
+Windows 95 but it is used by Windows NT. The case of filenames is not
+completely compatible from Windows NT to Windows 95. It is not completely
+compatible in the reverse direction, however. Filenames that fit in
+the 8.3 namespace and are written on Windows NT to be lowercase will
+show up as uppercase on Windows 95.
+
+Note that the "start" and "size" values are actually little
+endian integer values. The descriptions of the fields in this
+structure are public knowledge and can be found elsewhere.
+
+With the extended FAT system, Microsoft has inserted extra
+directory entries for any files with extended names. (Any name which
+legally fits within the old 8.3 encoding scheme does not have extra
+entries.) I call these extra entries slots. Basically, a slot is a
+specially formatted directory entry which holds up to 13 characters of
+a file's extended name. Think of slots as additional labeling for the
+directory entry of the file to which they correspond. Microsoft
+prefers to refer to the 8.3 entry for a file as its alias and the
+extended slot directory entries as the file name.
+
+The C structure for a slot directory entry follows:
+
+ struct slot { // Up to 13 characters of a long name
+ unsigned char id; // sequence number for slot
+ unsigned char name0_4[10]; // first 5 characters in name
+ unsigned char attr; // attribute byte
+ unsigned char reserved; // always 0
+ unsigned char alias_checksum; // checksum for 8.3 alias
+ unsigned char name5_10[12]; // 6 more characters in name
+ unsigned char start[2]; // starting cluster number
+ unsigned char name11_12[4]; // last 2 characters in name
+ };
+
+If the layout of the slots looks a little odd, it's only
+because of Microsoft's efforts to maintain compatibility with old
+software. The slots must be disguised to prevent old software from
+panicking. To this end, a number of measures are taken:
+
+ 1) The attribute byte for a slot directory entry is always set
+ to 0x0f. This corresponds to an old directory entry with
+ attributes of "hidden", "system", "read-only", and "volume
+ label". Most old software will ignore any directory
+ entries with the "volume label" bit set. Real volume label
+ entries don't have the other three bits set.
+
+ 2) The starting cluster is always set to 0, an impossible
+ value for a DOS file.
+
+Because the extended FAT system is backward compatible, it is
+possible for old software to modify directory entries. Measures must
+be taken to ensure the validity of slots. An extended FAT system can
+verify that a slot does in fact belong to an 8.3 directory entry by
+the following:
+
+ 1) Positioning. Slots for a file always immediately proceed
+ their corresponding 8.3 directory entry. In addition, each
+ slot has an id which marks its order in the extended file
+ name. Here is a very abbreviated view of an 8.3 directory
+ entry and its corresponding long name slots for the file
+ "My Big File.Extension which is long":
+
+ <proceeding files...>
+ <slot #3, id = 0x43, characters = "h is long">
+ <slot #2, id = 0x02, characters = "xtension whic">
+ <slot #1, id = 0x01, characters = "My Big File.E">
+ <directory entry, name = "MYBIGFIL.EXT">
+
+ Note that the slots are stored from last to first. Slots
+ are numbered from 1 to N. The Nth slot is or'ed with 0x40
+ to mark it as the last one.
+
+ 2) Checksum. Each slot has an "alias_checksum" value. The
+ checksum is calculated from the 8.3 name using the
+ following algorithm:
+
+ for (sum = i = 0; i < 11; i++) {
+ sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
+ }
+
+ 3) If there is free space in the final slot, a Unicode NULL (0x0000)
+ is stored after the final character. After that, all unused
+ characters in the final slot are set to Unicode 0xFFFF.
+
+Finally, note that the extended name is stored in Unicode. Each Unicode
+character takes two bytes.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
new file mode 100644
index 0000000..3f318dd
--- /dev/null
+++ b/Documentation/filesystems/vfs.txt
@@ -0,0 +1,671 @@
+/* -*- auto-fill -*- */
+
+ Overview of the Virtual File System
+
+ Richard Gooch <rgooch@atnf.csiro.au>
+
+ 5-JUL-1999
+
+
+Conventions used in this document <section>
+=================================
+
+Each section in this document will have the string "<section>" at the
+right-hand side of the section title. Each subsection will have
+"<subsection>" at the right-hand side. These strings are meant to make
+it easier to search through the document.
+
+NOTE that the master copy of this document is available online at:
+http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
+
+
+What is it? <section>
+===========
+
+The Virtual File System (otherwise known as the Virtual Filesystem
+Switch) is the software layer in the kernel that provides the
+filesystem interface to userspace programs. It also provides an
+abstraction within the kernel which allows different filesystem
+implementations to co-exist.
+
+
+A Quick Look At How It Works <section>
+============================
+
+In this section I'll briefly describe how things work, before
+launching into the details. I'll start with describing what happens
+when user programs open and manipulate files, and then look from the
+other view which is how a filesystem is supported and subsequently
+mounted.
+
+Opening a File <subsection>
+--------------
+
+The VFS implements the open(2), stat(2), chmod(2) and similar system
+calls. The pathname argument is used by the VFS to search through the
+directory entry cache (dentry cache or "dcache"). This provides a very
+fast look-up mechanism to translate a pathname (filename) into a
+specific dentry.
+
+An individual dentry usually has a pointer to an inode. Inodes are the
+things that live on disc drives, and can be regular files (you know:
+those things that you write data into), directories, FIFOs and other
+beasts. Dentries live in RAM and are never saved to disc: they exist
+only for performance. Inodes live on disc and are copied into memory
+when required. Later any changes are written back to disc. The inode
+that lives in RAM is a VFS inode, and it is this which the dentry
+points to. A single inode can be pointed to by multiple dentries
+(think about hardlinks).
+
+The dcache is meant to be a view into your entire filespace. Unlike
+Linus, most of us losers can't fit enough dentries into RAM to cover
+all of our filespace, so the dcache has bits missing. In order to
+resolve your pathname into a dentry, the VFS may have to resort to
+creating dentries along the way, and then loading the inode. This is
+done by looking up the inode.
+
+To look up an inode (usually read from disc) requires that the VFS
+calls the lookup() method of the parent directory inode. This method
+is installed by the specific filesystem implementation that the inode
+lives in. There will be more on this later.
+
+Once the VFS has the required dentry (and hence the inode), we can do
+all those boring things like open(2) the file, or stat(2) it to peek
+at the inode data. The stat(2) operation is fairly simple: once the
+VFS has the dentry, it peeks at the inode data and passes some of it
+back to userspace.
+
+Opening a file requires another operation: allocation of a file
+structure (this is the kernel-side implementation of file
+descriptors). The freshly allocated file structure is initialised with
+a pointer to the dentry and a set of file operation member functions.
+These are taken from the inode data. The open() file method is then
+called so the specific filesystem implementation can do it's work. You
+can see that this is another switch performed by the VFS.
+
+The file structure is placed into the file descriptor table for the
+process.
+
+Reading, writing and closing files (and other assorted VFS operations)
+is done by using the userspace file descriptor to grab the appropriate
+file structure, and then calling the required file structure method
+function to do whatever is required.
+
+For as long as the file is open, it keeps the dentry "open" (in use),
+which in turn means that the VFS inode is still in use.
+
+All VFS system calls (i.e. open(2), stat(2), read(2), write(2),
+chmod(2) and so on) are called from a process context. You should
+assume that these calls are made without any kernel locks being
+held. This means that the processes may be executing the same piece of
+filesystem or driver code at the same time, on different
+processors. You should ensure that access to shared resources is
+protected by appropriate locks.
+
+Registering and Mounting a Filesystem <subsection>
+-------------------------------------
+
+If you want to support a new kind of filesystem in the kernel, all you
+need to do is call register_filesystem(). You pass a structure
+describing the filesystem implementation (struct file_system_type)
+which is then added to an internal table of supported filesystems. You
+can do:
+
+% cat /proc/filesystems
+
+to see what filesystems are currently available on your system.
+
+When a request is made to mount a block device onto a directory in
+your filespace the VFS will call the appropriate method for the
+specific filesystem. The dentry for the mount point will then be
+updated to point to the root inode for the new filesystem.
+
+It's now time to look at things in more detail.
+
+
+struct file_system_type <section>
+=======================
+
+This describes the filesystem. As of kernel 2.1.99, the following
+members are defined:
+
+struct file_system_type {
+ const char *name;
+ int fs_flags;
+ struct super_block *(*read_super) (struct super_block *, void *, int);
+ struct file_system_type * next;
+};
+
+ name: the name of the filesystem type, such as "ext2", "iso9660",
+ "msdos" and so on
+
+ fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
+
+ read_super: the method to call when a new instance of this
+ filesystem should be mounted
+
+ next: for internal VFS use: you should initialise this to NULL
+
+The read_super() method has the following arguments:
+
+ struct super_block *sb: the superblock structure. This is partially
+ initialised by the VFS and the rest must be initialised by the
+ read_super() method
+
+ void *data: arbitrary mount options, usually comes as an ASCII
+ string
+
+ int silent: whether or not to be silent on error
+
+The read_super() method must determine if the block device specified
+in the superblock contains a filesystem of the type the method
+supports. On success the method returns the superblock pointer, on
+failure it returns NULL.
+
+The most interesting member of the superblock structure that the
+read_super() method fills in is the "s_op" field. This is a pointer to
+a "struct super_operations" which describes the next level of the
+filesystem implementation.
+
+
+struct super_operations <section>
+=======================
+
+This describes how the VFS can manipulate the superblock of your
+filesystem. As of kernel 2.1.99, the following members are defined:
+
+struct super_operations {
+ void (*read_inode) (struct inode *);
+ int (*write_inode) (struct inode *, int);
+ void (*put_inode) (struct inode *);
+ void (*drop_inode) (struct inode *);
+ void (*delete_inode) (struct inode *);
+ int (*notify_change) (struct dentry *, struct iattr *);
+ void (*put_super) (struct super_block *);
+ void (*write_super) (struct super_block *);
+ int (*statfs) (struct super_block *, struct statfs *, int);
+ int (*remount_fs) (struct super_block *, int *, char *);
+ void (*clear_inode) (struct inode *);
+};
+
+All methods are called without any locks being held, unless otherwise
+noted. This means that most methods can block safely. All methods are
+only called from a process context (i.e. not from an interrupt handler
+or bottom half).
+
+ read_inode: this method is called to read a specific inode from the
+ mounted filesystem. The "i_ino" member in the "struct inode"
+ will be initialised by the VFS to indicate which inode to
+ read. Other members are filled in by this method
+
+ write_inode: this method is called when the VFS needs to write an
+ inode to disc. The second parameter indicates whether the write
+ should be synchronous or not, not all filesystems check this flag.
+
+ put_inode: called when the VFS inode is removed from the inode
+ cache. This method is optional
+
+ drop_inode: called when the last access to the inode is dropped,
+ with the inode_lock spinlock held.
+
+ This method should be either NULL (normal unix filesystem
+ semantics) or "generic_delete_inode" (for filesystems that do not
+ want to cache inodes - causing "delete_inode" to always be
+ called regardless of the value of i_nlink)
+
+ The "generic_delete_inode()" behaviour is equivalent to the
+ old practice of using "force_delete" in the put_inode() case,
+ but does not have the races that the "force_delete()" approach
+ had.
+
+ delete_inode: called when the VFS wants to delete an inode
+
+ notify_change: called when VFS inode attributes are changed. If this
+ is NULL the VFS falls back to the write_inode() method. This
+ is called with the kernel lock held
+
+ put_super: called when the VFS wishes to free the superblock
+ (i.e. unmount). This is called with the superblock lock held
+
+ write_super: called when the VFS superblock needs to be written to
+ disc. This method is optional
+
+ statfs: called when the VFS needs to get filesystem statistics. This
+ is called with the kernel lock held
+
+ remount_fs: called when the filesystem is remounted. This is called
+ with the kernel lock held
+
+ clear_inode: called then the VFS clears the inode. Optional
+
+The read_inode() method is responsible for filling in the "i_op"
+field. This is a pointer to a "struct inode_operations" which
+describes the methods that can be performed on individual inodes.
+
+
+struct inode_operations <section>
+=======================
+
+This describes how the VFS can manipulate an inode in your
+filesystem. As of kernel 2.1.99, the following members are defined:
+
+struct inode_operations {
+ struct file_operations * default_file_ops;
+ int (*create) (struct inode *,struct dentry *,int);
+ int (*lookup) (struct inode *,struct dentry *);
+ int (*link) (struct dentry *,struct inode *,struct dentry *);
+ int (*unlink) (struct inode *,struct dentry *);
+ int (*symlink) (struct inode *,struct dentry *,const char *);
+ int (*mkdir) (struct inode *,struct dentry *,int);
+ int (*rmdir) (struct inode *,struct dentry *);
+ int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*rename) (struct inode *, struct dentry *,
+ struct inode *, struct dentry *);
+ int (*readlink) (struct dentry *, char *,int);
+ struct dentry * (*follow_link) (struct dentry *, struct dentry *);
+ int (*readpage) (struct file *, struct page *);
+ int (*writepage) (struct page *page, struct writeback_control *wbc);
+ int (*bmap) (struct inode *,int);
+ void (*truncate) (struct inode *);
+ int (*permission) (struct inode *, int);
+ int (*smap) (struct inode *,int);
+ int (*updatepage) (struct file *, struct page *, const char *,
+ unsigned long, unsigned int, int);
+ int (*revalidate) (struct dentry *);
+};
+
+Again, all methods are called without any locks being held, unless
+otherwise noted.
+
+ default_file_ops: this is a pointer to a "struct file_operations"
+ which describes how to open and then manipulate open files
+
+ create: called by the open(2) and creat(2) system calls. Only
+ required if you want to support regular files. The dentry you
+ get should not have an inode (i.e. it should be a negative
+ dentry). Here you will probably call d_instantiate() with the
+ dentry and the newly created inode
+
+ lookup: called when the VFS needs to look up an inode in a parent
+ directory. The name to look for is found in the dentry. This
+ method must call d_add() to insert the found inode into the
+ dentry. The "i_count" field in the inode structure should be
+ incremented. If the named inode does not exist a NULL inode
+ should be inserted into the dentry (this is called a negative
+ dentry). Returning an error code from this routine must only
+ be done on a real error, otherwise creating inodes with system
+ calls like create(2), mknod(2), mkdir(2) and so on will fail.
+ If you wish to overload the dentry methods then you should
+ initialise the "d_dop" field in the dentry; this is a pointer
+ to a struct "dentry_operations".
+ This method is called with the directory inode semaphore held
+
+ link: called by the link(2) system call. Only required if you want
+ to support hard links. You will probably need to call
+ d_instantiate() just as you would in the create() method
+
+ unlink: called by the unlink(2) system call. Only required if you
+ want to support deleting inodes
+
+ symlink: called by the symlink(2) system call. Only required if you
+ want to support symlinks. You will probably need to call
+ d_instantiate() just as you would in the create() method
+
+ mkdir: called by the mkdir(2) system call. Only required if you want
+ to support creating subdirectories. You will probably need to
+ call d_instantiate() just as you would in the create() method
+
+ rmdir: called by the rmdir(2) system call. Only required if you want
+ to support deleting subdirectories
+
+ mknod: called by the mknod(2) system call to create a device (char,
+ block) inode or a named pipe (FIFO) or socket. Only required
+ if you want to support creating these types of inodes. You
+ will probably need to call d_instantiate() just as you would
+ in the create() method
+
+ readlink: called by the readlink(2) system call. Only required if
+ you want to support reading symbolic links
+
+ follow_link: called by the VFS to follow a symbolic link to the
+ inode it points to. Only required if you want to support
+ symbolic links
+
+
+struct file_operations <section>
+======================
+
+This describes how the VFS can manipulate an open file. As of kernel
+2.1.99, the following members are defined:
+
+struct file_operations {
+ loff_t (*llseek) (struct file *, loff_t, int);
+ ssize_t (*read) (struct file *, char *, size_t, loff_t *);
+ ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
+ int (*readdir) (struct file *, void *, filldir_t);
+ unsigned int (*poll) (struct file *, struct poll_table_struct *);
+ int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
+ int (*mmap) (struct file *, struct vm_area_struct *);
+ int (*open) (struct inode *, struct file *);
+ int (*release) (struct inode *, struct file *);
+ int (*fsync) (struct file *, struct dentry *);
+ int (*fasync) (struct file *, int);
+ int (*check_media_change) (kdev_t dev);
+ int (*revalidate) (kdev_t dev);
+ int (*lock) (struct file *, int, struct file_lock *);
+};
+
+Again, all methods are called without any locks being held, unless
+otherwise noted.
+
+ llseek: called when the VFS needs to move the file position index
+
+ read: called by read(2) and related system calls
+
+ write: called by write(2) and related system calls
+
+ readdir: called when the VFS needs to read the directory contents
+
+ poll: called by the VFS when a process wants to check if there is
+ activity on this file and (optionally) go to sleep until there
+ is activity. Called by the select(2) and poll(2) system calls
+
+ ioctl: called by the ioctl(2) system call
+
+ mmap: called by the mmap(2) system call
+
+ open: called by the VFS when an inode should be opened. When the VFS
+ opens a file, it creates a new "struct file" and initialises
+ the "f_op" file operations member with the "default_file_ops"
+ field in the inode structure. It then calls the open method
+ for the newly allocated file structure. You might think that
+ the open method really belongs in "struct inode_operations",
+ and you may be right. I think it's done the way it is because
+ it makes filesystems simpler to implement. The open() method
+ is a good place to initialise the "private_data" member in the
+ file structure if you want to point to a device structure
+
+ release: called when the last reference to an open file is closed
+
+ fsync: called by the fsync(2) system call
+
+ fasync: called by the fcntl(2) system call when asynchronous
+ (non-blocking) mode is enabled for a file
+
+Note that the file operations are implemented by the specific
+filesystem in which the inode resides. When opening a device node
+(character or block special) most filesystems will call special
+support routines in the VFS which will locate the required device
+driver information. These support routines replace the filesystem file
+operations with those for the device driver, and then proceed to call
+the new open() method for the file. This is how opening a device file
+in the filesystem eventually ends up calling the device driver open()
+method. Note the devfs (the Device FileSystem) has a more direct path
+from device node to device driver (this is an unofficial kernel
+patch).
+
+
+Directory Entry Cache (dcache) <section>
+------------------------------
+
+struct dentry_operations
+========================
+
+This describes how a filesystem can overload the standard dentry
+operations. Dentries and the dcache are the domain of the VFS and the
+individual filesystem implementations. Device drivers have no business
+here. These methods may be set to NULL, as they are either optional or
+the VFS uses a default. As of kernel 2.1.99, the following members are
+defined:
+
+struct dentry_operations {
+ int (*d_revalidate)(struct dentry *);
+ int (*d_hash) (struct dentry *, struct qstr *);
+ int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
+ void (*d_delete)(struct dentry *);
+ void (*d_release)(struct dentry *);
+ void (*d_iput)(struct dentry *, struct inode *);
+};
+
+ d_revalidate: called when the VFS needs to revalidate a dentry. This
+ is called whenever a name look-up finds a dentry in the
+ dcache. Most filesystems leave this as NULL, because all their
+ dentries in the dcache are valid
+
+ d_hash: called when the VFS adds a dentry to the hash table
+
+ d_compare: called when a dentry should be compared with another
+
+ d_delete: called when the last reference to a dentry is
+ deleted. This means no-one is using the dentry, however it is
+ still valid and in the dcache
+
+ d_release: called when a dentry is really deallocated
+
+ d_iput: called when a dentry loses its inode (just prior to its
+ being deallocated). The default when this is NULL is that the
+ VFS calls iput(). If you define this method, you must call
+ iput() yourself
+
+Each dentry has a pointer to its parent dentry, as well as a hash list
+of child dentries. Child dentries are basically like files in a
+directory.
+
+Directory Entry Cache APIs
+--------------------------
+
+There are a number of functions defined which permit a filesystem to
+manipulate dentries:
+
+ dget: open a new handle for an existing dentry (this just increments
+ the usage count)
+
+ dput: close a handle for a dentry (decrements the usage count). If
+ the usage count drops to 0, the "d_delete" method is called
+ and the dentry is placed on the unused list if the dentry is
+ still in its parents hash list. Putting the dentry on the
+ unused list just means that if the system needs some RAM, it
+ goes through the unused list of dentries and deallocates them.
+ If the dentry has already been unhashed and the usage count
+ drops to 0, in this case the dentry is deallocated after the
+ "d_delete" method is called
+
+ d_drop: this unhashes a dentry from its parents hash list. A
+ subsequent call to dput() will dellocate the dentry if its
+ usage count drops to 0
+
+ d_delete: delete a dentry. If there are no other open references to
+ the dentry then the dentry is turned into a negative dentry
+ (the d_iput() method is called). If there are other
+ references, then d_drop() is called instead
+
+ d_add: add a dentry to its parents hash list and then calls
+ d_instantiate()
+
+ d_instantiate: add a dentry to the alias hash list for the inode and
+ updates the "d_inode" member. The "i_count" member in the
+ inode structure should be set/incremented. If the inode
+ pointer is NULL, the dentry is called a "negative
+ dentry". This function is commonly called when an inode is
+ created for an existing negative dentry
+
+ d_lookup: look up a dentry given its parent and path name component
+ It looks up the child of that given name from the dcache
+ hash table. If it is found, the reference count is incremented
+ and the dentry is returned. The caller must use d_put()
+ to free the dentry when it finishes using it.
+
+
+RCU-based dcache locking model
+------------------------------
+
+On many workloads, the most common operation on dcache is
+to look up a dentry, given a parent dentry and the name
+of the child. Typically, for every open(), stat() etc.,
+the dentry corresponding to the pathname will be looked
+up by walking the tree starting with the first component
+of the pathname and using that dentry along with the next
+component to look up the next level and so on. Since it
+is a frequent operation for workloads like multiuser
+environments and webservers, it is important to optimize
+this path.
+
+Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
+in every component during path look-up. Since 2.5.10 onwards,
+fastwalk algorithm changed this by holding the dcache_lock
+at the beginning and walking as many cached path component
+dentries as possible. This signficantly decreases the number
+of acquisition of dcache_lock. However it also increases the
+lock hold time signficantly and affects performance in large
+SMP machines. Since 2.5.62 kernel, dcache has been using
+a new locking model that uses RCU to make dcache look-up
+lock-free.
+
+The current dcache locking model is not very different from the existing
+dcache locking model. Prior to 2.5.62 kernel, dcache_lock
+protected the hash chain, d_child, d_alias, d_lru lists as well
+as d_inode and several other things like mount look-up. RCU-based
+changes affect only the way the hash chain is protected. For everything
+else the dcache_lock must be taken for both traversing as well as
+updating. The hash chain updations too take the dcache_lock.
+The significant change is the way d_lookup traverses the hash chain,
+it doesn't acquire the dcache_lock for this and rely on RCU to
+ensure that the dentry has not been *freed*.
+
+
+Dcache locking details
+----------------------
+For many multi-user workloads, open() and stat() on files are
+very frequently occurring operations. Both involve walking
+of path names to find the dentry corresponding to the
+concerned file. In 2.4 kernel, dcache_lock was held
+during look-up of each path component. Contention and
+cacheline bouncing of this global lock caused significant
+scalability problems. With the introduction of RCU
+in linux kernel, this was worked around by making
+the look-up of path components during path walking lock-free.
+
+
+Safe lock-free look-up of dcache hash table
+===========================================
+
+Dcache is a complex data structure with the hash table entries
+also linked together in other lists. In 2.4 kernel, dcache_lock
+protected all the lists. We applied RCU only on hash chain
+walking. The rest of the lists are still protected by dcache_lock.
+Some of the important changes are :
+
+1. The deletion from hash chain is done using hlist_del_rcu() macro which
+ doesn't initialize next pointer of the deleted dentry and this
+ allows us to walk safely lock-free while a deletion is happening.
+
+2. Insertion of a dentry into the hash table is done using
+ hlist_add_head_rcu() which take care of ordering the writes -
+ the writes to the dentry must be visible before the dentry
+ is inserted. This works in conjuction with hlist_for_each_rcu()
+ while walking the hash chain. The only requirement is that
+ all initialization to the dentry must be done before hlist_add_head_rcu()
+ since we don't have dcache_lock protection while traversing
+ the hash chain. This isn't different from the existing code.
+
+3. The dentry looked up without holding dcache_lock by cannot be
+ returned for walking if it is unhashed. It then may have a NULL
+ d_inode or other bogosity since RCU doesn't protect the other
+ fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
+ indicate unhashed dentries and use this in conjunction with a
+ per-dentry lock (d_lock). Once looked up without the dcache_lock,
+ we acquire the per-dentry lock (d_lock) and check if the
+ dentry is unhashed. If so, the look-up is failed. If not, the
+ reference count of the dentry is increased and the dentry is returned.
+
+4. Once a dentry is looked up, it must be ensured during the path
+ walk for that component it doesn't go away. In pre-2.5.10 code,
+ this was done holding a reference to the dentry. dcache_rcu does
+ the same. In some sense, dcache_rcu path walking looks like
+ the pre-2.5.10 version.
+
+5. All dentry hash chain updations must take the dcache_lock as well as
+ the per-dentry lock in that order. dput() does this to ensure
+ that a dentry that has just been looked up in another CPU
+ doesn't get deleted before dget() can be done on it.
+
+6. There are several ways to do reference counting of RCU protected
+ objects. One such example is in ipv4 route cache where
+ deferred freeing (using call_rcu()) is done as soon as
+ the reference count goes to zero. This cannot be done in
+ the case of dentries because tearing down of dentries
+ require blocking (dentry_iput()) which isn't supported from
+ RCU callbacks. Instead, tearing down of dentries happen
+ synchronously in dput(), but actual freeing happens later
+ when RCU grace period is over. This allows safe lock-free
+ walking of the hash chains, but a matched dentry may have
+ been partially torn down. The checking of DCACHE_UNHASHED
+ flag with d_lock held detects such dentries and prevents
+ them from being returned from look-up.
+
+
+Maintaining POSIX rename semantics
+==================================
+
+Since look-up of dentries is lock-free, it can race against
+a concurrent rename operation. For example, during rename
+of file A to B, look-up of either A or B must succeed.
+So, if look-up of B happens after A has been removed from the
+hash chain but not added to the new hash chain, it may fail.
+Also, a comparison while the name is being written concurrently
+by a rename may result in false positive matches violating
+rename semantics. Issues related to race with rename are
+handled as described below :
+
+1. Look-up can be done in two ways - d_lookup() which is safe
+ from simultaneous renames and __d_lookup() which is not.
+ If __d_lookup() fails, it must be followed up by a d_lookup()
+ to correctly determine whether a dentry is in the hash table
+ or not. d_lookup() protects look-ups using a sequence
+ lock (rename_lock).
+
+2. The name associated with a dentry (d_name) may be changed if
+ a rename is allowed to happen simultaneously. To avoid memcmp()
+ in __d_lookup() go out of bounds due to a rename and false
+ positive comparison, the name comparison is done while holding the
+ per-dentry lock. This prevents concurrent renames during this
+ operation.
+
+3. Hash table walking during look-up may move to a different bucket as
+ the current dentry is moved to a different bucket due to rename.
+ But we use hlists in dcache hash table and they are null-terminated.
+ So, even if a dentry moves to a different bucket, hash chain
+ walk will terminate. [with a list_head list, it may not since
+ termination is when the list_head in the original bucket is reached].
+ Since we redo the d_parent check and compare name while holding
+ d_lock, lock-free look-up will not race against d_move().
+
+4. There can be a theoritical race when a dentry keeps coming back
+ to original bucket due to double moves. Due to this look-up may
+ consider that it has never moved and can end up in a infinite loop.
+ But this is not any worse that theoritical livelocks we already
+ have in the kernel.
+
+
+Important guidelines for filesystem developers related to dcache_rcu
+====================================================================
+
+1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
+ don't change. Only dcache internal implementation changes. However
+ filesystems *must not* delete from the dentry hash chains directly
+ using the list macros like allowed earlier. They must use dcache
+ APIs like d_drop() or __d_drop() depending on the situation.
+
+2. d_flags is now protected by a per-dentry lock (d_lock). All
+ access to d_flags must be protected by it.
+
+3. For a hashed dentry, checking of d_count needs to be protected
+ by d_lock.
+
+
+Papers and other documentation on dcache locking
+================================================
+
+1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
+
+2. http://lse.sourceforge.net/locking/dcache/dcache.html
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
new file mode 100644
index 0000000..c7d5d0c
--- /dev/null
+++ b/Documentation/filesystems/xfs.txt
@@ -0,0 +1,188 @@
+
+The SGI XFS Filesystem
+======================
+
+XFS is a high performance journaling filesystem which originated
+on the SGI IRIX platform. It is completely multi-threaded, can
+support large files and large filesystems, extended attributes,
+variable block sizes, is extent based, and makes extensive use of
+Btrees (directories, extents, free space) to aid both performance
+and scalability.
+
+Refer to the documentation at http://oss.sgi.com/projects/xfs/
+for further details. This implementation is on-disk compatible
+with the IRIX version of XFS.
+
+
+Mount Options
+=============
+
+When mounting an XFS filesystem, the following options are accepted.
+
+ biosize=size
+ Sets the preferred buffered I/O size (default size is 64K).
+ "size" must be expressed as the logarithm (base2) of the
+ desired I/O size.
+ Valid values for this option are 14 through 16, inclusive
+ (i.e. 16K, 32K, and 64K bytes). On machines with a 4K
+ pagesize, 13 (8K bytes) is also a valid size.
+ The preferred buffered I/O size can also be altered on an
+ individual file basis using the ioctl(2) system call.
+
+ ikeep/noikeep
+ When inode clusters are emptied of inodes, keep them around
+ on the disk (ikeep) - this is the traditional XFS behaviour
+ and is still the default for now. Using the noikeep option,
+ inode clusters are returned to the free space pool.
+
+ logbufs=value
+ Set the number of in-memory log buffers. Valid numbers range
+ from 2-8 inclusive.
+ The default value is 8 buffers for filesystems with a
+ blocksize of 64K, 4 buffers for filesystems with a blocksize
+ of 32K, 3 buffers for filesystems with a blocksize of 16K
+ and 2 buffers for all other configurations. Increasing the
+ number of buffers may increase performance on some workloads
+ at the cost of the memory used for the additional log buffers
+ and their associated control structures.
+
+ logbsize=value
+ Set the size of each in-memory log buffer.
+ Size may be specified in bytes, or in kilobytes with a "k" suffix.
+ Valid sizes for version 1 and version 2 logs are 16384 (16k) and
+ 32768 (32k). Valid sizes for version 2 logs also include
+ 65536 (64k), 131072 (128k) and 262144 (256k).
+ The default value for machines with more than 32MB of memory
+ is 32768, machines with less memory use 16384 by default.
+
+ logdev=device and rtdev=device
+ Use an external log (metadata journal) and/or real-time device.
+ An XFS filesystem has up to three parts: a data section, a log
+ section, and a real-time section. The real-time section is
+ optional, and the log section can be separate from the data
+ section or contained within it.
+
+ noalign
+ Data allocations will not be aligned at stripe unit boundaries.
+
+ noatime
+ Access timestamps are not updated when a file is read.
+
+ norecovery
+ The filesystem will be mounted without running log recovery.
+ If the filesystem was not cleanly unmounted, it is likely to
+ be inconsistent when mounted in "norecovery" mode.
+ Some files or directories may not be accessible because of this.
+ Filesystems mounted "norecovery" must be mounted read-only or
+ the mount will fail.
+
+ nouuid
+ Don't check for double mounted file systems using the file system uuid.
+ This is useful to mount LVM snapshot volumes.
+
+ osyncisosync
+ Make O_SYNC writes implement true O_SYNC. WITHOUT this option,
+ Linux XFS behaves as if an "osyncisdsync" option is used,
+ which will make writes to files opened with the O_SYNC flag set
+ behave as if the O_DSYNC flag had been used instead.
+ This can result in better performance without compromising
+ data safety.
+ However if this option is not in effect, timestamp updates from
+ O_SYNC writes can be lost if the system crashes.
+ If timestamp updates are critical, use the osyncisosync option.
+
+ quota/usrquota/uqnoenforce
+ User disk quota accounting enabled, and limits (optionally)
+ enforced.
+
+ grpquota/gqnoenforce
+ Group disk quota accounting enabled and limits (optionally)
+ enforced.
+
+ sunit=value and swidth=value
+ Used to specify the stripe unit and width for a RAID device or
+ a stripe volume. "value" must be specified in 512-byte block
+ units.
+ If this option is not specified and the filesystem was made on
+ a stripe volume or the stripe width or unit were specified for
+ the RAID device at mkfs time, then the mount system call will
+ restore the value from the superblock. For filesystems that
+ are made directly on RAID devices, these options can be used
+ to override the information in the superblock if the underlying
+ disk layout changes after the filesystem has been created.
+ The "swidth" option is required if the "sunit" option has been
+ specified, and must be a multiple of the "sunit" value.
+
+sysctls
+=======
+
+The following sysctls are available for the XFS filesystem:
+
+ fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
+ Setting this to "1" clears accumulated XFS statistics
+ in /proc/fs/xfs/stat. It then immediately resets to "0".
+
+ fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
+ The interval at which the xfssyncd thread flushes metadata
+ out to disk. This thread will flush log activity out, and
+ do some processing on unlinked inodes.
+
+ fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000)
+ The interval at which xfsbufd scans the dirty metadata buffers list.
+
+ fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000)
+ The age at which xfsbufd flushes dirty metadata buffers to disk.
+
+ fs.xfs.error_level (Min: 0 Default: 3 Max: 11)
+ A volume knob for error reporting when internal errors occur.
+ This will generate detailed messages & backtraces for filesystem
+ shutdowns, for example. Current threshold values are:
+
+ XFS_ERRLEVEL_OFF: 0
+ XFS_ERRLEVEL_LOW: 1
+ XFS_ERRLEVEL_HIGH: 5
+
+ fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127)
+ Causes certain error conditions to call BUG(). Value is a bitmask;
+ AND together the tags which represent errors which should cause panics:
+
+ XFS_NO_PTAG 0
+ XFS_PTAG_IFLUSH 0x00000001
+ XFS_PTAG_LOGRES 0x00000002
+ XFS_PTAG_AILDELETE 0x00000004
+ XFS_PTAG_ERROR_REPORT 0x00000008
+ XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010
+ XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
+ XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
+
+ This option is intended for debugging only.
+
+ fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
+ Controls whether symlinks are created with mode 0777 (default)
+ or whether their mode is affected by the umask (irix mode).
+
+ fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
+ Controls files created in SGID directories.
+ If the group ID of the new file does not match the effective group
+ ID or one of the supplementary group IDs of the parent dir, the
+ ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
+ is set.
+
+ fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1)
+ Controls whether unprivileged users can use chown to "give away"
+ a file to another user.
+
+ fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1)
+ Setting this to "1" will cause the "sync" flag set
+ by the chattr(1) command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1)
+ Setting this to "1" will cause the "nodump" flag set
+ by the chattr(1) command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1)
+ Setting this to "1" will cause the "noatime" flag set
+ by the chattr(1) command on a directory to be
+ inherited by files in that directory.