[PATCH] (2.5.4) death of ->i_zombie

Rediffed to 2.5.4, documentation added. This variant grabs ->s_vfs_rename_sem only for cross-directory renames.

[PATCH] (2.5.4) death of ->i_zombie
Rediffed to 2.5.4, documentation added. This variant grabs ->s_vfs_rename_sem only for cross-directory renames.
1b3d7c93 · Alexander Viro · Linus Torvalds · 9c73428c · 1b3d7c93 · 1b3d7c93
Commit 1b3d7c93 authored Feb 11, 2002 by Alexander Viro Committed by Linus Torvalds Feb 11, 2002
10 changed files
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -48,28 +48,30 @@ prototypes:

 locking rules:
 	all may block
-		BKL	i_sem(inode)	i_zombie(inode)
-lookup:		yes	yes		no
-create:		yes	yes		yes
-link:		yes	yes		yes
-mknod:		yes	yes		yes
-mkdir:		yes	yes		yes
-unlink:		yes	yes		yes
-rmdir:		yes	yes		yes		(see below)
-rename:		yes	yes (both)	yes (both)	(see below)
-readlink:	no	no		no
-follow_link:	no	no		no
-truncate:	yes	yes		no		(see below)
-setattr:	yes	if ATTR_SIZE	no
-permssion:	yes	no		no
-getattr:						(see below)
-revalidate:	no					(see below)
-	Additionally, ->rmdir() has i_zombie on victim and so does ->rename()
-in case when target exists and is a directory.
-	->rename() on directories has (per-superblock) ->s_vfs_rename_sem.
+		BKL	i_sem(inode)
+lookup:		yes	yes
+create:		yes	yes
+link:		yes	yes
+mknod:		yes	yes
+mkdir:		yes	yes
+unlink:		yes	yes (both)
+rmdir:		yes	yes (both)	(see below)
+rename:		yes	yes (all)	(see below)
+readlink:	no	no
+follow_link:	no	no
+truncate:	yes	yes		(see below)
+setattr:	yes	if ATTR_SIZE
+permssion:	yes	no
+getattr:				(see below)
+revalidate:	no			(see below)
+setxattr:	DOCUMENT_ME
+getxattr:	DOCUMENT_ME
+removexattr:	DOCUMENT_ME
+	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
+victim.
+	cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
 	->revalidate(), it may be called both with and without the i_sem
-on dentry->d_inode. VFS never calls it with i_zombie on dentry->d_inode,
-but watch for other methods directly calling this one...
+on dentry->d_inode.
 	->truncate() is never called directly - it's a callback, not a
 method. It's called by vmtruncate() - library function normally used by
 ->setattr(). Locking information above applies to that call (i.e. is
@@ -77,6 +79,9 @@ inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
 passed).
 	->getattr() is currently unused.

+See Documentation/filesystems/directory-locking for more detailed discussion
+of the locking scheme for directory operations.
+
 --------------------------- super_operations ---------------------------
 prototypes:
 	void (*read_inode) (struct inode *);

--- a/Documentation/filesystems/directory-locking
+++ b/Documentation/filesystems/directory-locking
+	Locking scheme used for directory operations is based on two
+kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
+
+	For our purposes all operations fall in 5 classes:
+
+1) read access.  Locking rules: caller locks directory we are accessing.
+
+2) object creation.  Locking rules: same as above.
+
+3) object removal.  Locking rules: caller locks parent, finds victim,
+locks victim and calls the method.
+
+4) rename() that is _not_ cross-directory.  Locking rules: caller locks
+the parent, finds source and target, if target already exists - locks it
+and then calls the method.
+
+5) cross-directory rename.  The trickiest in the whole bunch.  Locking
+rules:
+	* lock the filesystem
+	* lock parents in "ancestors first" order.
+	* find source and target.
+	* if old parent is equal to or is a descendent of target
+		fail with -ENOTEMPTY
+	* if new parent is equal to or is a descendent of source
+		fail with -ELOOP
+	* if target exists - lock it.
+	* call the method.
+
+
+The rules above obviously guarantee that all directories that are going to be
+read, modified or removed by method will be locked by caller.
+
+
+If no directory is its own ancestor, the scheme above is deadlock-free.
+Proof:
+
+	First of all, at any moment we have a partial ordering of the
+objects - A < B iff A is an ancestor of B.
+
+	That ordering can change.  However, the following is true:
+
+(1) if operation different from cross-directory rename holds lock on A and
+    attempts to acquire lock on B, A will remain the parent of B until we
+    acquire the lock on B.  (Proof: only cross-directory rename can change
+    the parent of object and it would have to lock the parent).
+
+(2) if cross-directory rename holds the lock on filesystem, order will not
+    change until rename acquires all locks.  (Proof: other cross-directory
+    renames will be blocked on filesystem lock and we don't start changing
+    the order until we had acquired all locks).
+
+	Now consider the minimal deadlock.  Each process is blocked on
+attempt to acquire some lock and already holds at least one lock.  Let's
+consider the set of contended locks.  First of all, filesystem lock is
+not contended, since any process blocked on it is not holding any locks.
+Thus all processes are blocked on ->i_sem.
+
+	Any contended object is either held by cross-directory rename or
+has a child that is also contended.  Indeed, suppose that it is held by
+operation other than cross-directory rename.  Then the lock this operation
+is blocked on belongs to child of that object due to (1).
+
+	It means that one of the operations is cross-directory rename.
+Otherwise the set of contended objects would be infinite - each of them
+would have a contended child and we had assumed that no object is its
+own descendent.  Moreover, there is exactly one cross-directory rename
+(see above).
+
+	Consider the object blocking the cross-directory rename.  One of
+its descendents is locked by cross-directory rename (otherwise we would again
+have an infinite set of of contended objects).  But that means that means
+that cross-directory rename is taking locks out of order.  Due to (2) the
+order hadn't changed since we had acquired filesystem lock.  But locking
+rules for cross-directory rename guarantee that we do not try to acquire
+lock on descendent before the lock on ancestor.  Contradiction.  I.e.
+deadlock is impossible.  Q.E.D.
+
+
+	These operations are guaranteed to avoid loop creation.  Indeed,
+the only operation that could introduce loops is cross-directory rename.
+Since the only new (parent, child) pair added by rename() is (new parent,
+source), such loop would have to contain these objects and the rest of it
+would have to exist before rename().  I.e. at the moment of loop creation
+rename() responsible for that would be holding filesystem lock and new parent
+would have to be equal to or a descendent of source.  But that means that
+new parent had been equal to or a descendent of source since the moment when
+we had acquired filesystem lock and rename() would fail with -ELOOP in that
+case.
+
+	While this locking scheme works for arbitrary DAGs, it relies on
+ability to check that directory is a descendent of another object.  Current
+implementation assumes that directory graph is a tree.  This assumption is
+also preserved by all operations (cross-directory rename on a tree that would
+not introduce a cycle will leave it a tree and link() fails for directories).
+
+	Notice that "directory" in the above == "anything that might have
+children", so if we are going to introduce hybrid objects we will need
+either to make sure that link(2) doesn't work for them or to make changes
+in is_subdir() that would make it work even in presense of such beasts.
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -472,11 +472,9 @@ static ssize_t bm_entry_write(struct file *file, const char *buffer,
 			break;
 		case 3: root = dget(file->f_vfsmnt->mnt_sb->s_root);
 			down(&root->d_inode->i_sem);
-			down(&root->d_inode->i_zombie);

 			kill_node(e);

-			up(&root->d_inode->i_zombie);
 			up(&root->d_inode->i_sem);
 			dput(root);
 			break;
@@ -516,8 +514,6 @@ static ssize_t bm_register_write(struct file *file, const char *buffer,
 	if (IS_ERR(dentry))
 		goto out;

-	down(&root->d_inode->i_zombie);
-
 	err = -EEXIST;
 	if (dentry->d_inode)
 		goto out2;
@@ -556,7 +552,6 @@ static ssize_t bm_register_write(struct file *file, const char *buffer,
 	mntput(mnt);
 	err = 0;
 out2:
-	up(&root->d_inode->i_zombie);
 	dput(dentry);
 out:
 	up(&root->d_inode->i_sem);
@@ -605,12 +600,10 @@ static ssize_t bm_status_write(struct file * file, const char * buffer,
 		case 2: enabled = 1; break;
 		case 3: root = dget(file->f_vfsmnt->mnt_sb->s_root);
 			down(&root->d_inode->i_sem);
-			down(&root->d_inode->i_zombie);

 			while (!list_empty(&entries))
 				kill_node(list_entry(entries.next, Node, list));

-			up(&root->d_inode->i_zombie);
 			up(&root->d_inode->i_sem);
 			dput(root);
 		default: return res;

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -143,7 +143,6 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->i_dirty_data_buffers);
 	INIT_LIST_HEAD(&inode->i_devices);
 	sema_init(&inode->i_sem, 1);
-	sema_init(&inode->i_zombie, 1);
 	spin_lock_init(&inode->i_data.i_shared_lock);
 }


--- a/fs/namei.c
+++ b/fs/namei.c
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -466,7 +466,7 @@ static int graft_tree(struct vfsmount *mnt, struct nameidata *nd)
 		return -ENOTDIR;

 	err = -ENOENT;
-	down(&nd->dentry->d_inode->i_zombie);
+	down(&nd->dentry->d_inode->i_sem);
 	if (IS_DEADDIR(nd->dentry->d_inode))
 		goto out_unlock;

@@ -481,7 +481,7 @@ static int graft_tree(struct vfsmount *mnt, struct nameidata *nd)
 	}
 	spin_unlock(&dcache_lock);
 out_unlock:
-	up(&nd->dentry->d_inode->i_zombie);
+	up(&nd->dentry->d_inode->i_sem);
 	return err;
 }

@@ -577,7 +577,7 @@ static int do_move_mount(struct nameidata *nd, char *old_name)
 		goto out;

 	err = -ENOENT;
-	down(&nd->dentry->d_inode->i_zombie);
+	down(&nd->dentry->d_inode->i_sem);
 	if (IS_DEADDIR(nd->dentry->d_inode))
 		goto out1;

@@ -607,7 +607,7 @@ static int do_move_mount(struct nameidata *nd, char *old_name)
 out2:
 	spin_unlock(&dcache_lock);
 out1:
-	up(&nd->dentry->d_inode->i_zombie);
+	up(&nd->dentry->d_inode->i_sem);
 out:
 	up_write(&current->namespace->sem);
 	if (!err)
@@ -949,7 +949,7 @@ asmlinkage long sys_pivot_root(const char *new_root, const char *put_old)
 	user_nd.dentry = dget(current->fs->root);
 	read_unlock(&current->fs->lock);
 	down_write(&current->namespace->sem);
-	down(&old_nd.dentry->d_inode->i_zombie);
+	down(&old_nd.dentry->d_inode->i_sem);
 	error = -EINVAL;
 	if (!check_mnt(user_nd.mnt))
 		goto out2;
@@ -992,7 +992,7 @@ asmlinkage long sys_pivot_root(const char *new_root, const char *put_old)
 	path_release(&root_parent);
 	path_release(&parent_nd);
 out2:
-	up(&old_nd.dentry->d_inode->i_zombie);
+	up(&old_nd.dentry->d_inode->i_sem);
 	up_write(&current->namespace->sem);
 	path_release(&user_nd);
 	path_release(&old_nd);

--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1226,7 +1226,7 @@ int
 nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 			    struct svc_fh *tfhp, char *tname, int tlen)
 {
-	struct dentry	*fdentry, *tdentry, *odentry, *ndentry;
+	struct dentry	*fdentry, *tdentry, *odentry, *ndentry, *trap;
 	struct inode	*fdir, *tdir;
 	int		err;

@@ -1253,7 +1253,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,

 	/* cannot use fh_lock as we need deadlock protective ordering
 	 * so do it by hand */
-	double_down(&tdir->i_sem, &fdir->i_sem);
+	trap = lock_rename(tdentry, fdentry);
 	ffhp->fh_locked = tfhp->fh_locked = 1;
 	fill_pre_wcc(ffhp);
 	fill_pre_wcc(tfhp);
@@ -1266,12 +1266,17 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	err = -ENOENT;
 	if (!odentry->d_inode)
 		goto out_dput_old;
+	err = -EINVAL;
+	if (odentry == trap)
+		goto out_dput_old;

 	ndentry = lookup_one_len(tname, tdentry, tlen);
 	err = PTR_ERR(ndentry);
 	if (IS_ERR(ndentry))
 		goto out_dput_old;
-
+	err = -ENOTEMPTY;
+	if (ndentry == trap)
+		goto out_dput_new;

 #ifdef MSNFS
 	if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
@@ -1287,6 +1292,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	}
 	dput(ndentry);

+ out_dput_new:
+	dput(ndentry);
 out_dput_old:
 	dput(odentry);
 out_nfserr:
@@ -1299,9 +1306,9 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	 */
 	fill_post_wcc(ffhp);
 	fill_post_wcc(tfhp);
-	double_up(&tdir->i_sem, &fdir->i_sem);
+	unlock_rename(tdentry, fdentry);
 	ffhp->fh_locked = tfhp->fh_locked = 0;
-	
+
 out:
 	return err;
 }

--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -21,14 +21,12 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)
 	if (!file->f_op || !file->f_op->readdir)
 		goto out;
 	down(&inode->i_sem);
-	down(&inode->i_zombie);
 	res = -ENOENT;
 	if (!IS_DEADDIR(inode)) {
 		lock_kernel();
 		res = file->f_op->readdir(file, buf, filler);
 		unlock_kernel();
 	}
-	up(&inode->i_zombie);
 	up(&inode->i_sem);
 out:
 	return res;

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -425,7 +425,6 @@ struct inode {
 	unsigned long		i_blocks;
 	unsigned long		i_version;
 	struct semaphore	i_sem;
-	struct semaphore	i_zombie;
 	struct inode_operations	*i_op;
 	struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct super_block	*i_sb;
@@ -759,6 +758,9 @@ extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);

+extern struct dentry *lock_rename(struct dentry *, struct dentry *);
+extern void unlock_rename(struct dentry *, struct dentry *);
+
 /*
 * File types
 */
@@ -1505,131 +1507,6 @@ extern int generic_osync_inode(struct inode *, int);
 extern int inode_change_ok(struct inode *, struct iattr *);
 extern int inode_setattr(struct inode *, struct iattr *);

-/*
- * Common dentry functions for inclusion in the VFS
- * or in other stackable file systems.  Some of these
- * functions were in linux/fs/ C (VFS) files.
- *
- */
-
-/*
- * Locking the parent is needed to:
- *  - serialize directory operations
- *  - make sure the parent doesn't change from
- *    under us in the middle of an operation.
- *
- * NOTE! Right now we'd rather use a "struct inode"
- * for this, but as I expect things to move toward
- * using dentries instead for most things it is
- * probably better to start with the conceptually
- * better interface of relying on a path of dentries.
- */
-static inline struct dentry *lock_parent(struct dentry *dentry)
-{
-	struct dentry *dir = dget(dentry->d_parent);
-
-	down(&dir->d_inode->i_sem);
-	return dir;
-}
-
-static inline struct dentry *get_parent(struct dentry *dentry)
-{
-	return dget(dentry->d_parent);
-}
-
-static inline void unlock_dir(struct dentry *dir)
-{
-	up(&dir->d_inode->i_sem);
-	dput(dir);
-}
-
-/*
- * Whee.. Deadlock country. Happily there are only two VFS
- * operations that does this..
- */
-static inline void double_down(struct semaphore *s1, struct semaphore *s2)
-{
-	if (s1 != s2) {
-		if ((unsigned long) s1 < (unsigned long) s2) {
-			struct semaphore *tmp = s2;
-			s2 = s1; s1 = tmp;
-		}
-		down(s1);
-	}
-	down(s2);
-}
-
-/*
- * Ewwwwwwww... _triple_ lock. We are guaranteed that the 3rd argument is
- * not equal to 1st and not equal to 2nd - the first case (target is parent of
- * source) would be already caught, the second is plain impossible (target is
- * its own parent and that case would be caught even earlier). Very messy.
- * I _think_ that it works, but no warranties - please, look it through.
- * Pox on bloody lusers who mandated overwriting rename() for directories...
- */
-
-static inline void triple_down(struct semaphore *s1,
-			       struct semaphore *s2,
-			       struct semaphore *s3)
-{
-	if (s1 != s2) {
-		if ((unsigned long) s1 < (unsigned long) s2) {
-			if ((unsigned long) s1 < (unsigned long) s3) {
-				struct semaphore *tmp = s3;
-				s3 = s1; s1 = tmp;
-			}
-			if ((unsigned long) s1 < (unsigned long) s2) {
-				struct semaphore *tmp = s2;
-				s2 = s1; s1 = tmp;
-			}
-		} else {
-			if ((unsigned long) s1 < (unsigned long) s3) {
-				struct semaphore *tmp = s3;
-				s3 = s1; s1 = tmp;
-			}
-			if ((unsigned long) s2 < (unsigned long) s3) {
-				struct semaphore *tmp = s3;
-				s3 = s2; s2 = tmp;
-			}
-		}
-		down(s1);
-	} else if ((unsigned long) s2 < (unsigned long) s3) {
-		struct semaphore *tmp = s3;
-		s3 = s2; s2 = tmp;
-	}
-	down(s2);
-	down(s3);
-}
-
-static inline void double_up(struct semaphore *s1, struct semaphore *s2)
-{
-	up(s1);
-	if (s1 != s2)
-		up(s2);
-}
-
-static inline void triple_up(struct semaphore *s1,
-			     struct semaphore *s2,
-			     struct semaphore *s3)
-{
-	up(s1);
-	if (s1 != s2)
-		up(s2);
-	up(s3);
-}
-
-static inline void double_lock(struct dentry *d1, struct dentry *d2)
-{
-	double_down(&d1->d_inode->i_sem, &d2->d_inode->i_sem);
-}
-
-static inline void double_unlock(struct dentry *d1, struct dentry *d2)
-{
-	double_up(&d1->d_inode->i_sem,&d2->d_inode->i_sem);
-	dput(d1);
-	dput(d2);
-}
-
 #endif /* __KERNEL__ */

 #endif /* _LINUX_FS_H */
--- a/kernel/ksyms.c
+++ b/kernel/ksyms.c
@@ -253,6 +253,8 @@ EXPORT_SYMBOL(vfs_statfs);
 EXPORT_SYMBOL(vfs_fstat);
 EXPORT_SYMBOL(vfs_stat);
 EXPORT_SYMBOL(vfs_lstat);
+EXPORT_SYMBOL(lock_rename);
+EXPORT_SYMBOL(unlock_rename);
 EXPORT_SYMBOL(generic_read_dir);
 EXPORT_SYMBOL(generic_file_llseek);
 EXPORT_SYMBOL(remote_llseek);