[PATCH] (2.5.4) death of ->i_zombie

Rediffed to 2.5.4, documentation added. This variant grabs ->s_vfs_rename_sem only for cross-directory renames.

[PATCH] (2.5.4) death of ->i_zombie
Rediffed to 2.5.4, documentation added. This variant grabs ->s_vfs_rename_sem only for cross-directory renames.
1b3d7c93 · Alexander Viro · Linus Torvalds · 9c73428c · 1b3d7c93 · 1b3d7c93
Commit 1b3d7c93 authored Feb 11, 2002 by Alexander Viro Committed by Linus Torvalds Feb 11, 2002
10 changed files
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -48,28 +48,30 @@ prototypes:

 locking rules:
 	all may block
-		BKL	i_sem(inode)	i_zombie(inode)
-lookup:		yes	yes		no
-create:		yes	yes		yes
-link:		yes	yes		yes
-mknod:		yes	yes		yes
-mkdir:		yes	yes		yes
-unlink:		yes	yes		yes
-rmdir:		yes	yes		yes		(see below)
-rename:		yes	yes (both)	yes (both)	(see below)
-readlink:	no	no		no
-follow_link:	no	no		no
-truncate:	yes	yes		no		(see below)
-setattr:	yes	if ATTR_SIZE	no
-permssion:	yes	no		no
+		BKL	i_sem(inode)
+lookup:		yes	yes
+create:		yes	yes
+link:		yes	yes
+mknod:		yes	yes
+mkdir:		yes	yes
+unlink:		yes	yes (both)
+rmdir:		yes	yes (both)	(see below)
+rename:		yes	yes (all)	(see below)
+readlink:	no	no
+follow_link:	no	no
+truncate:	yes	yes		(see below)
+setattr:	yes	if ATTR_SIZE
+permssion:	yes	no
 getattr:				(see below)
 revalidate:	no			(see below)
-	Additionally, ->rmdir() has i_zombie on victim and so does ->rename()
-in case when target exists and is a directory.
-	->rename() on directories has (per-superblock) ->s_vfs_rename_sem.
+setxattr:	DOCUMENT_ME
+getxattr:	DOCUMENT_ME
+removexattr:	DOCUMENT_ME
+	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
+victim.
+	cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
 	->revalidate(), it may be called both with and without the i_sem
-on dentry->d_inode. VFS never calls it with i_zombie on dentry->d_inode,
-but watch for other methods directly calling this one...
+on dentry->d_inode.
 	->truncate() is never called directly - it's a callback, not a
 method. It's called by vmtruncate() - library function normally used by
 ->setattr(). Locking information above applies to that call (i.e. is
@@ -77,6 +79,9 @@ inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
 passed).
 	->getattr() is currently unused.

+See Documentation/filesystems/directory-locking for more detailed discussion
+of the locking scheme for directory operations.
+
 --------------------------- super_operations ---------------------------
 prototypes:
 	void (*read_inode) (struct inode *);

--- a/Documentation/filesystems/directory-locking
+++ b/Documentation/filesystems/directory-locking
+	Locking scheme used for directory operations is based on two
+kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
+
+	For our purposes all operations fall in 5 classes:
+
+1) read access.  Locking rules: caller locks directory we are accessing.
+
+2) object creation.  Locking rules: same as above.
+
+3) object removal.  Locking rules: caller locks parent, finds victim,
+locks victim and calls the method.
+
+4) rename() that is _not_ cross-directory.  Locking rules: caller locks
+the parent, finds source and target, if target already exists - locks it
+and then calls the method.
+
+5) cross-directory rename.  The trickiest in the whole bunch.  Locking
+rules:
+	* lock the filesystem
+	* lock parents in "ancestors first" order.
+	* find source and target.
+	* if old parent is equal to or is a descendent of target
+		fail with -ENOTEMPTY
+	* if new parent is equal to or is a descendent of source
+		fail with -ELOOP
+	* if target exists - lock it.
+	* call the method.
+
+
+The rules above obviously guarantee that all directories that are going to be
+read, modified or removed by method will be locked by caller.
+
+
+If no directory is its own ancestor, the scheme above is deadlock-free.
+Proof:
+
+	First of all, at any moment we have a partial ordering of the
+objects - A < B iff A is an ancestor of B.
+
+	That ordering can change.  However, the following is true:
+
+(1) if operation different from cross-directory rename holds lock on A and
+    attempts to acquire lock on B, A will remain the parent of B until we
+    acquire the lock on B.  (Proof: only cross-directory rename can change
+    the parent of object and it would have to lock the parent).
+
+(2) if cross-directory rename holds the lock on filesystem, order will not
+    change until rename acquires all locks.  (Proof: other cross-directory
+    renames will be blocked on filesystem lock and we don't start changing
+    the order until we had acquired all locks).
+
+	Now consider the minimal deadlock.  Each process is blocked on
+attempt to acquire some lock and already holds at least one lock.  Let's
+consider the set of contended locks.  First of all, filesystem lock is
+not contended, since any process blocked on it is not holding any locks.
+Thus all processes are blocked on ->i_sem.
+
+	Any contended object is either held by cross-directory rename or
+has a child that is also contended.  Indeed, suppose that it is held by
+operation other than cross-directory rename.  Then the lock this operation
+is blocked on belongs to child of that object due to (1).
+
+	It means that one of the operations is cross-directory rename.
+Otherwise the set of contended objects would be infinite - each of them
+would have a contended child and we had assumed that no object is its
+own descendent.  Moreover, there is exactly one cross-directory rename
+(see above).
+
+	Consider the object blocking the cross-directory rename.  One of
+its descendents is locked by cross-directory rename (otherwise we would again
+have an infinite set of of contended objects).  But that means that means
+that cross-directory rename is taking locks out of order.  Due to (2) the
+order hadn't changed since we had acquired filesystem lock.  But locking
+rules for cross-directory rename guarantee that we do not try to acquire
+lock on descendent before the lock on ancestor.  Contradiction.  I.e.
+deadlock is impossible.  Q.E.D.
+
+
+	These operations are guaranteed to avoid loop creation.  Indeed,
+the only operation that could introduce loops is cross-directory rename.
+Since the only new (parent, child) pair added by rename() is (new parent,
+source), such loop would have to contain these objects and the rest of it
+would have to exist before rename().  I.e. at the moment of loop creation
+rename() responsible for that would be holding filesystem lock and new parent
+would have to be equal to or a descendent of source.  But that means that
+new parent had been equal to or a descendent of source since the moment when
+we had acquired filesystem lock and rename() would fail with -ELOOP in that
+case.
+
+	While this locking scheme works for arbitrary DAGs, it relies on
+ability to check that directory is a descendent of another object.  Current
+implementation assumes that directory graph is a tree.  This assumption is
+also preserved by all operations (cross-directory rename on a tree that would
+not introduce a cycle will leave it a tree and link() fails for directories).
+
+	Notice that "directory" in the above == "anything that might have
+children", so if we are going to introduce hybrid objects we will need
+either to make sure that link(2) doesn't work for them or to make changes
+in is_subdir() that would make it work even in presense of such beasts.
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -472,11 +472,9 @@ static ssize_t bm_entry_write(struct file *file, const char *buffer,
 			break;
 		case 3: root = dget(file->f_vfsmnt->mnt_sb->s_root);
 			down(&root->d_inode->i_sem);
-			down(&root->d_inode->i_zombie);

 			kill_node(e);

-			up(&root->d_inode->i_zombie);
 			up(&root->d_inode->i_sem);
 			dput(root);
 			break;
@@ -516,8 +514,6 @@ static ssize_t bm_register_write(struct file *file, const char *buffer,
 	if (IS_ERR(dentry))
 		goto out;

-	down(&root->d_inode->i_zombie);
-
 	err = -EEXIST;
 	if (dentry->d_inode)
 		goto out2;
@@ -556,7 +552,6 @@ static ssize_t bm_register_write(struct file *file, const char *buffer,
 	mntput(mnt);
 	err = 0;
 out2:
-	up(&root->d_inode->i_zombie);
 	dput(dentry);
 out:
 	up(&root->d_inode->i_sem);
@@ -605,12 +600,10 @@ static ssize_t bm_status_write(struct file * file, const char * buffer,
 		case 2: enabled = 1; break;
 		case 3: root = dget(file->f_vfsmnt->mnt_sb->s_root);
 			down(&root->d_inode->i_sem);
-			down(&root->d_inode->i_zombie);

 			while (!list_empty(&entries))
 				kill_node(list_entry(entries.next, Node, list));

-			up(&root->d_inode->i_zombie);
 			up(&root->d_inode->i_sem);
 			dput(root);
 		default: return res;

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -143,7 +143,6 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->i_dirty_data_buffers);
 	INIT_LIST_HEAD(&inode->i_devices);
 	sema_init(&inode->i_sem, 1);
-	sema_init(&inode->i_zombie, 1);
 	spin_lock_init(&inode->i_data.i_shared_lock);
 }


--- a/fs/namei.c
+++ b/fs/namei.c
@@ -93,6 +93,11 @@
 * hopefully we will be able to get rid of that wart in 2.5. So far only
 * XEmacs seems to be relying on it...
 */
+/*
+ * [Sep 2001 AV] Single-semaphore locking scheme (kudos to David Holland)
+ * implemented.  Let's see if raised priority of ->s_vfs_rename_sem gives
+ * any extra contention...
+ */

 /* In order to reduce some races, while at the same time doing additional
 * checking and hopefully speeding things up, we copy filenames to the
@@ -931,28 +936,67 @@ static inline int lookup_flags(unsigned int f)
 	return retval;
 }

-int vfs_create(struct inode *dir, struct dentry *dentry, int mode)
+/*
+ * p1 and p2 should be directories on the same fs.
+ */
+struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
 {
-	int error;
+	struct dentry *p;

-	mode &= S_IALLUGO;
-	mode |= S_IFREG;
+	if (p1 == p2) {
+		down(&p1->d_inode->i_sem);
+		return NULL;
+	}
+
+	down(&p1->d_inode->i_sb->s_vfs_rename_sem);
+
+	for (p = p1; p->d_parent != p; p = p->d_parent) {
+		if (p->d_parent == p2) {
+			down(&p2->d_inode->i_sem);
+			down(&p1->d_inode->i_sem);
+			return p;
+		}
+	}
+
+	for (p = p2; p->d_parent != p; p = p->d_parent) {
+		if (p->d_parent == p1) {
+			down(&p1->d_inode->i_sem);
+			down(&p2->d_inode->i_sem);
+			return p;
+		}
+	}
+
+	down(&p1->d_inode->i_sem);
+	down(&p2->d_inode->i_sem);
+	return NULL;
+}
+
+void unlock_rename(struct dentry *p1, struct dentry *p2)
+{
+	up(&p1->d_inode->i_sem);
+	if (p1 != p2) {
+		up(&p2->d_inode->i_sem);
+		up(&p1->d_inode->i_sb->s_vfs_rename_sem);
+	}
+}
+
+int vfs_create(struct inode *dir, struct dentry *dentry, int mode)
+{
+	int error = may_create(dir, dentry);

-	down(&dir->i_zombie);
-	error = may_create(dir, dentry);
 	if (error)
-		goto exit_lock;
+		return error;

-	error = -EACCES;	/* shouldn't it be ENOSYS? */
 	if (!dir->i_op || !dir->i_op->create)
-		goto exit_lock;
+		return -EACCES;	/* shouldn't it be ENOSYS? */

 	DQUOT_INIT(dir);
+
+	mode &= S_IALLUGO;
+	mode |= S_IFREG;
 	lock_kernel();
 	error = dir->i_op->create(dir, dentry, mode);
 	unlock_kernel();
-exit_lock:
-	up(&dir->i_zombie);
 	if (!error)
 		inode_dir_notify(dir, DN_CREATE);
 	return error;
@@ -1212,26 +1256,21 @@ static struct dentry *lookup_create(struct nameidata *nd, int is_dir)

 int vfs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
 {
-	int error = -EPERM;
+	int error = may_create(dir, dentry);

-	down(&dir->i_zombie);
-	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
-		goto exit_lock;
-
-	error = may_create(dir, dentry);
 	if (error)
-		goto exit_lock;
+		return error;
+
+	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
+		return -EPERM;

-	error = -EPERM;
 	if (!dir->i_op || !dir->i_op->mknod)
-		goto exit_lock;
+		return -EPERM;

 	DQUOT_INIT(dir);
 	lock_kernel();
 	error = dir->i_op->mknod(dir, dentry, mode, dev);
 	unlock_kernel();
-exit_lock:
-	up(&dir->i_zombie);
 	if (!error)
 		inode_dir_notify(dir, DN_CREATE);
 	return error;
@@ -1284,25 +1323,19 @@ asmlinkage long sys_mknod(const char * filename, int mode, dev_t dev)

 int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 {
-	int error;
+	int error = may_create(dir, dentry);

-	down(&dir->i_zombie);
-	error = may_create(dir, dentry);
 	if (error)
-		goto exit_lock;
+		return error;

-	error = -EPERM;
 	if (!dir->i_op || !dir->i_op->mkdir)
-		goto exit_lock;
+		return -EPERM;

 	DQUOT_INIT(dir);
 	mode &= (S_IRWXUGO|S_ISVTX);
 	lock_kernel();
 	error = dir->i_op->mkdir(dir, dentry, mode);
 	unlock_kernel();
-
-exit_lock:
-	up(&dir->i_zombie);
 	if (!error)
 		inode_dir_notify(dir, DN_CREATE);
 	return error;
@@ -1369,9 +1402,8 @@ static void d_unhash(struct dentry *dentry)

 int vfs_rmdir(struct inode *dir, struct dentry *dentry)
 {
-	int error;
+	int error = may_delete(dir, dentry, 1);

-	error = may_delete(dir, dentry, 1);
 	if (error)
 		return error;

@@ -1380,7 +1412,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)

 	DQUOT_INIT(dir);

-	double_down(&dir->i_zombie, &dentry->d_inode->i_zombie);
+	down(&dentry->d_inode->i_sem);
 	d_unhash(dentry);
 	if (IS_DEADDIR(dir))
 		error = -ENOENT;
@@ -1393,7 +1425,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
 		if (!error)
 			dentry->d_inode->i_flags |= S_DEAD;
 	}
-	double_up(&dir->i_zombie, &dentry->d_inode->i_zombie);
+	up(&dentry->d_inode->i_sem);
 	if (!error) {
 		inode_dir_notify(dir, DN_DELETE);
 		d_delete(dentry);
@@ -1447,14 +1479,18 @@ asmlinkage long sys_rmdir(const char * pathname)

 int vfs_unlink(struct inode *dir, struct dentry *dentry)
 {
-	int error;
+	int error = may_delete(dir, dentry, 0);
+
+	if (error)
+		return error;
+
+	if (!dir->i_op || !dir->i_op->unlink)
+		return -EPERM;

-	down(&dir->i_zombie);
-	error = may_delete(dir, dentry, 0);
-	if (!error) {
-		error = -EPERM;
-		if (dir->i_op && dir->i_op->unlink) {
 	DQUOT_INIT(dir);
+
+	dget(dentry);
+	down(&dentry->d_inode->i_sem);
 	if (d_mountpoint(dentry))
 		error = -EBUSY;
 	else {
@@ -1464,11 +1500,12 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
 		if (!error)
 			d_delete(dentry);
 	}
-		}
-	}
-	up(&dir->i_zombie);
+	up(&dentry->d_inode->i_sem);
+	dput(dentry);
+
 	if (!error)
 		inode_dir_notify(dir, DN_DELETE);
+
 	return error;
 }

@@ -1517,24 +1554,18 @@ asmlinkage long sys_unlink(const char * pathname)

 int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
 {
-	int error;
+	int error = may_create(dir, dentry);

-	down(&dir->i_zombie);
-	error = may_create(dir, dentry);
 	if (error)
-		goto exit_lock;
+		return error;

-	error = -EPERM;
 	if (!dir->i_op || !dir->i_op->symlink)
-		goto exit_lock;
+		return -EPERM;

 	DQUOT_INIT(dir);
 	lock_kernel();
 	error = dir->i_op->symlink(dir, dentry, oldname);
 	unlock_kernel();
-
-exit_lock:
-	up(&dir->i_zombie);
 	if (!error)
 		inode_dir_notify(dir, DN_CREATE);
 	return error;
@@ -1576,39 +1607,31 @@ asmlinkage long sys_symlink(const char * oldname, const char * newname)

 int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
 {
-	struct inode *inode;
+	struct inode *inode = old_dentry->d_inode;
 	int error;

-	down(&dir->i_zombie);
-	error = -ENOENT;
-	inode = old_dentry->d_inode;
 	if (!inode)
-		goto exit_lock;
+		return -ENOENT;

 	error = may_create(dir, new_dentry);
 	if (error)
-		goto exit_lock;
+		return error;

-	error = -EXDEV;
 	if (dir->i_sb != inode->i_sb)
-		goto exit_lock;
+		return -EXDEV;

 	/*
 	 * A link to an append-only or immutable file cannot be created.
 	 */
-	error = -EPERM;
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
-		goto exit_lock;
+		return -EPERM;
 	if (!dir->i_op || !dir->i_op->link)
-		goto exit_lock;
+		return -EPERM;

 	DQUOT_INIT(dir);
 	lock_kernel();
 	error = dir->i_op->link(old_dentry, dir, new_dentry);
 	unlock_kernel();
-
-exit_lock:
-	up(&dir->i_zombie);
 	if (!error)
 		inode_dir_notify(dir, DN_CREATE);
 	return error;
@@ -1680,17 +1703,23 @@ asmlinkage long sys_link(const char * oldname, const char * newname)
 *	   story.
 *	c) we have to lock _three_ objects - parents and victim (if it exists).
 *	   And that - after we got ->i_sem on parents (until then we don't know
- *	   whether the target exists at all, let alone whether it is a directory
- *	   or not). Solution: ->i_zombie. Taken only after ->i_sem. Always taken
- *	   on link creation/removal of any kind. And taken (without ->i_sem) on
- *	   directory that will be removed (both in rmdir() and here).
+ *	   whether the target exists).  Solution: try to be smart with locking
+ *	   order for inodes.  We rely on the fact that tree topology may change
+ *	   only under ->s_vfs_rename_sem _and_ that parent of the object we
+ *	   move will be locked.  Thus we can rank directories by the tree
+ *	   (ancestors first) and rank all non-directories after them.
+ *	   That works since everybody except rename does "lock parent, lookup,
+ *	   lock child" and rename is under ->s_vfs_rename_sem.
+ *	   HOWEVER, it relies on the assumption that any object with ->lookup()
+ *	   has no more than 1 dentry.  If "hybrid" objects will ever appear,
+ *	   we'd better make sure that there's no link(2) for them.
 *	d) some filesystems don't support opened-but-unlinked directories,
 *	   either because of layout or because they are not ready to deal with
 *	   all cases correctly. The latter will be fixed (taking this sort of
 *	   stuff into VFS), but the former is not going away. Solution: the same
 *	   trick as in rmdir().
 *	e) conversion from fhandle to dentry may come in the wrong moment - when
- *	   we are removing the target. Solution: we will have to grab ->i_zombie
+ *	   we are removing the target. Solution: we will have to grab ->i_sem
 *	   in the fhandle_to_dentry code. [FIXME - current nfsfh.c relies on
 *	   ->i_sem on parents, which works but leads to some truely excessive
 *	   locking].
@@ -1698,131 +1727,96 @@ asmlinkage long sys_link(const char * oldname, const char * newname)
 int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
 	       struct inode *new_dir, struct dentry *new_dentry)
 {
-	int error;
+	int error = 0;
 	struct inode *target;

-	if (old_dentry->d_inode == new_dentry->d_inode)
-		return 0;
-
-	error = may_delete(old_dir, old_dentry, 1);
-	if (error)
-		return error;
-
-	if (new_dir->i_sb != old_dir->i_sb)
-		return -EXDEV;
-
-	if (!new_dentry->d_inode)
-		error = may_create(new_dir, new_dentry);
-	else
-		error = may_delete(new_dir, new_dentry, 1);
-	if (error)
-		return error;
-
-	if (!old_dir->i_op || !old_dir->i_op->rename)
-		return -EPERM;
-
 	/*
 	 * If we are going to change the parent - check write permissions,
 	 * we'll need to flip '..'.
 	 */
-	if (new_dir != old_dir) {
+	if (new_dir != old_dir)
 		error = permission(old_dentry->d_inode, MAY_WRITE);
-	}
+
 	if (error)
 		return error;

-	DQUOT_INIT(old_dir);
-	DQUOT_INIT(new_dir);
-	down(&old_dir->i_sb->s_vfs_rename_sem);
-	error = -EINVAL;
-	if (is_subdir(new_dentry, old_dentry))
-		goto out_unlock;
-	/* Don't eat your daddy, dear... */
-	/* This also avoids locking issues */
-	if (old_dentry->d_parent == new_dentry)
-		goto out_unlock;
 	target = new_dentry->d_inode;
-	if (target) { /* Hastur! Hastur! Hastur! */
-		triple_down(&old_dir->i_zombie,
-			    &new_dir->i_zombie,
-			    &target->i_zombie);
+	if (target) {
+		down(&target->i_sem);
 		d_unhash(new_dentry);
-	} else
-		double_down(&old_dir->i_zombie,
-			    &new_dir->i_zombie);
-	if (IS_DEADDIR(old_dir)||IS_DEADDIR(new_dir))
-		error = -ENOENT;
-	else if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
+	}
+	if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
 		error = -EBUSY;
 	else 
 		error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
 	if (target) {
 		if (!error)
 			target->i_flags |= S_DEAD;
-		triple_up(&old_dir->i_zombie,
-			  &new_dir->i_zombie,
-			  &target->i_zombie);
+		up(&target->i_sem);
 		if (d_unhashed(new_dentry))
 			d_rehash(new_dentry);
 		dput(new_dentry);
-	} else
-		double_up(&old_dir->i_zombie,
-			  &new_dir->i_zombie);
-		
+	}
 	if (!error)
 		d_move(old_dentry,new_dentry);
-out_unlock:
-	up(&old_dir->i_sb->s_vfs_rename_sem);
 	return error;
 }

 int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
 	       struct inode *new_dir, struct dentry *new_dentry)
 {
+	struct inode *target;
 	int error;

+	dget(new_dentry);
+	target = new_dentry->d_inode;
+	if (target)
+		down(&target->i_sem);
+	if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
+		error = -EBUSY;
+	else
+		error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
+	if (!error) {
+		/* The following d_move() should become unconditional */
+		if (!(old_dir->i_sb->s_type->fs_flags & FS_ODD_RENAME)) {
+			d_move(old_dentry, new_dentry);
+		}
+	}
+	if (target)
+		up(&target->i_sem);
+	dput(new_dentry);
+	return error;
+}
+
+int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+	       struct inode *new_dir, struct dentry *new_dentry)
+{
+	int error;
+	int is_dir = S_ISDIR(old_dentry->d_inode->i_mode);
+
 	if (old_dentry->d_inode == new_dentry->d_inode)
 		return 0;
 
-	error = may_delete(old_dir, old_dentry, 0);
+	error = may_delete(old_dir, old_dentry, is_dir);
 	if (error)
 		return error;

-	if (new_dir->i_sb != old_dir->i_sb)
-		return -EXDEV;
-
 	if (!new_dentry->d_inode)
 		error = may_create(new_dir, new_dentry);
 	else
-		error = may_delete(new_dir, new_dentry, 0);
+		error = may_delete(new_dir, new_dentry, is_dir);
 	if (error)
 		return error;

 	if (!old_dir->i_op || !old_dir->i_op->rename)
 		return -EPERM;

+	if (IS_DEADDIR(old_dir)||IS_DEADDIR(new_dir))
+		return -ENOENT;
 	DQUOT_INIT(old_dir);
 	DQUOT_INIT(new_dir);
-	double_down(&old_dir->i_zombie, &new_dir->i_zombie);
-	if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
-		error = -EBUSY;
-	else
-		error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
-	double_up(&old_dir->i_zombie, &new_dir->i_zombie);
-	if (error)
-		return error;
-	/* The following d_move() should become unconditional */
-	if (!(old_dir->i_sb->s_type->fs_flags & FS_ODD_RENAME)) {
-		d_move(old_dentry, new_dentry);
-	}
-	return 0;
-}

-int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
-	       struct inode *new_dir, struct dentry *new_dentry)
-{
-	int error;
-	if (S_ISDIR(old_dentry->d_inode->i_mode))
+	if (is_dir)
 		error = vfs_rename_dir(old_dir,old_dentry,new_dir,new_dentry);
 	else
 		error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry);
@@ -1842,6 +1836,7 @@ static inline int do_rename(const char * oldname, const char * newname)
 	int error = 0;
 	struct dentry * old_dir, * new_dir;
 	struct dentry * old_dentry, *new_dentry;
+	struct dentry * trap;
 	struct nameidata oldnd, newnd;

 	if (path_init(oldname, LOOKUP_PARENT, &oldnd))
@@ -1868,7 +1863,7 @@ static inline int do_rename(const char * oldname, const char * newname)
 	if (newnd.last_type != LAST_NORM)
 		goto exit2;

-	double_lock(new_dir, old_dir);
+	trap = lock_rename(new_dir, old_dir);

 	old_dentry = lookup_hash(&oldnd.last, old_dir);
 	error = PTR_ERR(old_dentry);
@@ -1886,21 +1881,30 @@ static inline int do_rename(const char * oldname, const char * newname)
 		if (newnd.last.name[newnd.last.len])
 			goto exit4;
 	}
+	/* source should not be ancestor of target */
+	error = -EINVAL;
+	if (old_dentry == trap)
+		goto exit4;
 	new_dentry = lookup_hash(&newnd.last, new_dir);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto exit4;
+	/* target should not be an ancestor of source */
+	error = -ENOTEMPTY;
+	if (new_dentry == trap)
+		goto exit5;

 	lock_kernel();
 	error = vfs_rename(old_dir->d_inode, old_dentry,
 				   new_dir->d_inode, new_dentry);
 	unlock_kernel();

+exit5:
 	dput(new_dentry);
 exit4:
 	dput(old_dentry);
 exit3:
-	double_up(&new_dir->d_inode->i_sem, &old_dir->d_inode->i_sem);
+	unlock_rename(new_dir, old_dir);
 exit2:
 	path_release(&newnd);
 exit1:

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -466,7 +466,7 @@ static int graft_tree(struct vfsmount *mnt, struct nameidata *nd)
 		return -ENOTDIR;

 	err = -ENOENT;
-	down(&nd->dentry->d_inode->i_zombie);
+	down(&nd->dentry->d_inode->i_sem);
 	if (IS_DEADDIR(nd->dentry->d_inode))
 		goto out_unlock;

@@ -481,7 +481,7 @@ static int graft_tree(struct vfsmount *mnt, struct nameidata *nd)
 	}
 	spin_unlock(&dcache_lock);
 out_unlock:
-	up(&nd->dentry->d_inode->i_zombie);
+	up(&nd->dentry->d_inode->i_sem);
 	return err;
 }

@@ -577,7 +577,7 @@ static int do_move_mount(struct nameidata *nd, char *old_name)
 		goto out;

 	err = -ENOENT;
-	down(&nd->dentry->d_inode->i_zombie);
+	down(&nd->dentry->d_inode->i_sem);
 	if (IS_DEADDIR(nd->dentry->d_inode))
 		goto out1;

@@ -607,7 +607,7 @@ static int do_move_mount(struct nameidata *nd, char *old_name)
 out2:
 	spin_unlock(&dcache_lock);
 out1:
-	up(&nd->dentry->d_inode->i_zombie);
+	up(&nd->dentry->d_inode->i_sem);
 out:
 	up_write(&current->namespace->sem);
 	if (!err)
@@ -949,7 +949,7 @@ asmlinkage long sys_pivot_root(const char *new_root, const char *put_old)
 	user_nd.dentry = dget(current->fs->root);
 	read_unlock(&current->fs->lock);
 	down_write(&current->namespace->sem);
-	down(&old_nd.dentry->d_inode->i_zombie);
+	down(&old_nd.dentry->d_inode->i_sem);
 	error = -EINVAL;
 	if (!check_mnt(user_nd.mnt))
 		goto out2;
@@ -992,7 +992,7 @@ asmlinkage long sys_pivot_root(const char *new_root, const char *put_old)
 	path_release(&root_parent);
 	path_release(&parent_nd);
 out2:
-	up(&old_nd.dentry->d_inode->i_zombie);
+	up(&old_nd.dentry->d_inode->i_sem);
 	up_write(&current->namespace->sem);
 	path_release(&user_nd);
 	path_release(&old_nd);

--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1226,7 +1226,7 @@ int
 nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 			    struct svc_fh *tfhp, char *tname, int tlen)
 {
-	struct dentry	*fdentry, *tdentry, *odentry, *ndentry;
+	struct dentry	*fdentry, *tdentry, *odentry, *ndentry, *trap;
 	struct inode	*fdir, *tdir;
 	int		err;

@@ -1253,7 +1253,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,

 	/* cannot use fh_lock as we need deadlock protective ordering
 	 * so do it by hand */
-	double_down(&tdir->i_sem, &fdir->i_sem);
+	trap = lock_rename(tdentry, fdentry);
 	ffhp->fh_locked = tfhp->fh_locked = 1;
 	fill_pre_wcc(ffhp);
 	fill_pre_wcc(tfhp);
@@ -1266,12 +1266,17 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	err = -ENOENT;
 	if (!odentry->d_inode)
 		goto out_dput_old;
+	err = -EINVAL;
+	if (odentry == trap)
+		goto out_dput_old;

 	ndentry = lookup_one_len(tname, tdentry, tlen);
 	err = PTR_ERR(ndentry);
 	if (IS_ERR(ndentry))
 		goto out_dput_old;
-
+	err = -ENOTEMPTY;
+	if (ndentry == trap)
+		goto out_dput_new;

 #ifdef MSNFS
 	if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
@@ -1287,6 +1292,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	}
 	dput(ndentry);

+ out_dput_new:
+	dput(ndentry);
 out_dput_old:
 	dput(odentry);
 out_nfserr:
@@ -1299,7 +1306,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	 */
 	fill_post_wcc(ffhp);
 	fill_post_wcc(tfhp);
-	double_up(&tdir->i_sem, &fdir->i_sem);
+	unlock_rename(tdentry, fdentry);
 	ffhp->fh_locked = tfhp->fh_locked = 0;

 out:

--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -21,14 +21,12 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)
 	if (!file->f_op || !file->f_op->readdir)
 		goto out;
 	down(&inode->i_sem);
-	down(&inode->i_zombie);
 	res = -ENOENT;
 	if (!IS_DEADDIR(inode)) {
 		lock_kernel();
 		res = file->f_op->readdir(file, buf, filler);
 		unlock_kernel();
 	}
-	up(&inode->i_zombie);
 	up(&inode->i_sem);
 out:
 	return res;

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -425,7 +425,6 @@ struct inode {
 	unsigned long		i_blocks;
 	unsigned long		i_version;
 	struct semaphore	i_sem;
-	struct semaphore	i_zombie;
 	struct inode_operations	*i_op;
 	struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct super_block	*i_sb;
@@ -759,6 +758,9 @@ extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);

+extern struct dentry *lock_rename(struct dentry *, struct dentry *);
+extern void unlock_rename(struct dentry *, struct dentry *);
+
 /*
 * File types
 */
@@ -1505,131 +1507,6 @@ extern int generic_osync_inode(struct inode *, int);
 extern int inode_change_ok(struct inode *, struct iattr *);
 extern int inode_setattr(struct inode *, struct iattr *);

-/*
- * Common dentry functions for inclusion in the VFS
- * or in other stackable file systems.  Some of these
- * functions were in linux/fs/ C (VFS) files.
- *
- */
-
-/*
- * Locking the parent is needed to:
- *  - serialize directory operations
- *  - make sure the parent doesn't change from
- *    under us in the middle of an operation.
- *
- * NOTE! Right now we'd rather use a "struct inode"
- * for this, but as I expect things to move toward
- * using dentries instead for most things it is
- * probably better to start with the conceptually
- * better interface of relying on a path of dentries.
- */
-static inline struct dentry *lock_parent(struct dentry *dentry)
-{
-	struct dentry *dir = dget(dentry->d_parent);
-
-	down(&dir->d_inode->i_sem);
-	return dir;
-}
-
-static inline struct dentry *get_parent(struct dentry *dentry)
-{
-	return dget(dentry->d_parent);
-}
-
-static inline void unlock_dir(struct dentry *dir)
-{
-	up(&dir->d_inode->i_sem);
-	dput(dir);
-}
-
-/*
- * Whee.. Deadlock country. Happily there are only two VFS
- * operations that does this..
- */
-static inline void double_down(struct semaphore *s1, struct semaphore *s2)
-{
-	if (s1 != s2) {
-		if ((unsigned long) s1 < (unsigned long) s2) {
-			struct semaphore *tmp = s2;
-			s2 = s1; s1 = tmp;
-		}
-		down(s1);
-	}
-	down(s2);
-}
-
-/*
- * Ewwwwwwww... _triple_ lock. We are guaranteed that the 3rd argument is
- * not equal to 1st and not equal to 2nd - the first case (target is parent of
- * source) would be already caught, the second is plain impossible (target is
- * its own parent and that case would be caught even earlier). Very messy.
- * I _think_ that it works, but no warranties - please, look it through.
- * Pox on bloody lusers who mandated overwriting rename() for directories...
- */
-
-static inline void triple_down(struct semaphore *s1,
-			       struct semaphore *s2,
-			       struct semaphore *s3)
-{
-	if (s1 != s2) {
-		if ((unsigned long) s1 < (unsigned long) s2) {
-			if ((unsigned long) s1 < (unsigned long) s3) {
-				struct semaphore *tmp = s3;
-				s3 = s1; s1 = tmp;
-			}
-			if ((unsigned long) s1 < (unsigned long) s2) {
-				struct semaphore *tmp = s2;
-				s2 = s1; s1 = tmp;
-			}
-		} else {
-			if ((unsigned long) s1 < (unsigned long) s3) {
-				struct semaphore *tmp = s3;
-				s3 = s1; s1 = tmp;
-			}
-			if ((unsigned long) s2 < (unsigned long) s3) {
-				struct semaphore *tmp = s3;
-				s3 = s2; s2 = tmp;
-			}
-		}
-		down(s1);
-	} else if ((unsigned long) s2 < (unsigned long) s3) {
-		struct semaphore *tmp = s3;
-		s3 = s2; s2 = tmp;
-	}
-	down(s2);
-	down(s3);
-}
-
-static inline void double_up(struct semaphore *s1, struct semaphore *s2)
-{
-	up(s1);
-	if (s1 != s2)
-		up(s2);
-}
-
-static inline void triple_up(struct semaphore *s1,
-			     struct semaphore *s2,
-			     struct semaphore *s3)
-{
-	up(s1);
-	if (s1 != s2)
-		up(s2);
-	up(s3);
-}
-
-static inline void double_lock(struct dentry *d1, struct dentry *d2)
-{
-	double_down(&d1->d_inode->i_sem, &d2->d_inode->i_sem);
-}
-
-static inline void double_unlock(struct dentry *d1, struct dentry *d2)
-{
-	double_up(&d1->d_inode->i_sem,&d2->d_inode->i_sem);
-	dput(d1);
-	dput(d2);
-}
-
 #endif /* __KERNEL__ */

 #endif /* _LINUX_FS_H */
--- a/kernel/ksyms.c
+++ b/kernel/ksyms.c
@@ -253,6 +253,8 @@ EXPORT_SYMBOL(vfs_statfs);
 EXPORT_SYMBOL(vfs_fstat);
 EXPORT_SYMBOL(vfs_stat);
 EXPORT_SYMBOL(vfs_lstat);
+EXPORT_SYMBOL(lock_rename);
+EXPORT_SYMBOL(unlock_rename);
 EXPORT_SYMBOL(generic_read_dir);
 EXPORT_SYMBOL(generic_file_llseek);
 EXPORT_SYMBOL(remote_llseek);