Commit 3be5a52b authored by Miklos Szeredi's avatar Miklos Szeredi Committed by Linus Torvalds

fuse: support writable mmap

Quoting Linus (3 years ago, FUSE inclusion discussions):

  "User-space filesystems are hard to get right. I'd claim that they
   are almost impossible, unless you limit them somehow (shared
   writable mappings are the nastiest part - if you don't have those,
   you can reasonably limit your problems by limiting the number of
   dirty pages you accept through normal "write()" calls)."

Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others).  This nicely solved the biggest problem: limiting the number of pages
used for write caching.

Some small details remained, however, which this largish patch attempts to
address.  It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks.  Performance may not be very
good for certain usage patterns, but generally it should be acceptable.

It has been tested extensively with fsx-linux and bash-shared-mapping.

Fuse page writeback design
--------------------------

fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.

The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.

For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented.  The per-bdi writeback count is not decremented until the actual
write completes.

On dirtying the page, fuse waits for a previous write to finish before
proceeding.  This makes sure, there can only be one temporary page used at a
time for one cached page.

This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?

The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write.  It may be buggy or even
malicious, and fail to complete WRITE requests.  We don't want unrelated parts
of the system to grind to a halt in such cases.

Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request.  There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.

Currently there are several cases where the kernel can block on page
writeback:

  - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
  - page migration
  - throttle_vm_writeout (through NR_WRITEBACK)
  - sync(2)

Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.

As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default.  This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.

With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent b88473f7
......@@ -47,6 +47,14 @@ struct fuse_req *fuse_request_alloc(void)
return req;
}
struct fuse_req *fuse_request_alloc_nofs(void)
{
struct fuse_req *req = kmem_cache_alloc(fuse_req_cachep, GFP_NOFS);
if (req)
fuse_request_init(req);
return req;
}
void fuse_request_free(struct fuse_req *req)
{
kmem_cache_free(fuse_req_cachep, req);
......@@ -429,6 +437,17 @@ void request_send_background(struct fuse_conn *fc, struct fuse_req *req)
request_send_nowait(fc, req);
}
/*
* Called under fc->lock
*
* fc->connected must have been checked previously
*/
void request_send_background_locked(struct fuse_conn *fc, struct fuse_req *req)
{
req->isreply = 1;
request_send_nowait_locked(fc, req);
}
/*
* Lock the request. Up to the next unlock_request() there mustn't be
* anything that could cause a page-fault. If the request was already
......
......@@ -1106,6 +1106,50 @@ static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg)
}
}
/*
* Prevent concurrent writepages on inode
*
* This is done by adding a negative bias to the inode write counter
* and waiting for all pending writes to finish.
*/
void fuse_set_nowrite(struct inode *inode)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
BUG_ON(!mutex_is_locked(&inode->i_mutex));
spin_lock(&fc->lock);
BUG_ON(fi->writectr < 0);
fi->writectr += FUSE_NOWRITE;
spin_unlock(&fc->lock);
wait_event(fi->page_waitq, fi->writectr == FUSE_NOWRITE);
}
/*
* Allow writepages on inode
*
* Remove the bias from the writecounter and send any queued
* writepages.
*/
static void __fuse_release_nowrite(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
BUG_ON(fi->writectr != FUSE_NOWRITE);
fi->writectr = 0;
fuse_flush_writepages(inode);
}
void fuse_release_nowrite(struct inode *inode)
{
struct fuse_conn *fc = get_fuse_conn(inode);
spin_lock(&fc->lock);
__fuse_release_nowrite(inode);
spin_unlock(&fc->lock);
}
/*
* Set attributes, and at the same time refresh them.
*
......@@ -1122,6 +1166,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
struct fuse_req *req;
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
bool is_truncate = false;
loff_t oldsize;
int err;
if (!fuse_allow_task(fc, current))
......@@ -1145,12 +1191,16 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
send_sig(SIGXFSZ, current, 0);
return -EFBIG;
}
is_truncate = true;
}
req = fuse_get_req(fc);
if (IS_ERR(req))
return PTR_ERR(req);
if (is_truncate)
fuse_set_nowrite(inode);
memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
iattr_to_fattr(attr, &inarg);
......@@ -1181,16 +1231,44 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
if (err) {
if (err == -EINTR)
fuse_invalidate_attr(inode);
return err;
goto error;
}
if ((inode->i_mode ^ outarg.attr.mode) & S_IFMT) {
make_bad_inode(inode);
return -EIO;
err = -EIO;
goto error;
}
spin_lock(&fc->lock);
fuse_change_attributes_common(inode, &outarg.attr,
attr_timeout(&outarg));
oldsize = inode->i_size;
i_size_write(inode, outarg.attr.size);
if (is_truncate) {
/* NOTE: this may release/reacquire fc->lock */
__fuse_release_nowrite(inode);
}
spin_unlock(&fc->lock);
/*
* Only call invalidate_inode_pages2() after removing
* FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
*/
if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
if (outarg.attr.size < oldsize)
fuse_truncate(inode->i_mapping, outarg.attr.size);
invalidate_inode_pages2(inode->i_mapping);
}
fuse_change_attributes(inode, &outarg.attr, attr_timeout(&outarg), 0);
return 0;
error:
if (is_truncate)
fuse_release_nowrite(inode);
return err;
}
static int fuse_setattr(struct dentry *entry, struct iattr *attr)
......
This diff is collapsed.
......@@ -15,6 +15,7 @@
#include <linux/mm.h>
#include <linux/backing-dev.h>
#include <linux/mutex.h>
#include <linux/rwsem.h>
/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
......@@ -25,6 +26,9 @@
/** Congestion starts at 75% of maximum */
#define FUSE_CONGESTION_THRESHOLD (FUSE_MAX_BACKGROUND * 75 / 100)
/** Bias for fi->writectr, meaning new writepages must not be sent */
#define FUSE_NOWRITE INT_MIN
/** It could be as large as PATH_MAX, but would that have any uses? */
#define FUSE_NAME_MAX 1024
......@@ -73,6 +77,19 @@ struct fuse_inode {
/** Files usable in writepage. Protected by fc->lock */
struct list_head write_files;
/** Writepages pending on truncate or fsync */
struct list_head queued_writes;
/** Number of sent writes, a negative bias (FUSE_NOWRITE)
* means more writes are blocked */
int writectr;
/** Waitq for writepage completion */
wait_queue_head_t page_waitq;
/** List of writepage requestst (pending or sent) */
struct list_head writepages;
};
/** FUSE specific file data */
......@@ -242,6 +259,12 @@ struct fuse_req {
/** File used in the request (or NULL) */
struct fuse_file *ff;
/** Inode used in the request or NULL */
struct inode *inode;
/** Link on fi->writepages */
struct list_head writepages_entry;
/** Request completion callback */
void (*end)(struct fuse_conn *, struct fuse_req *);
......@@ -504,6 +527,11 @@ void fuse_init_symlink(struct inode *inode);
void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
u64 attr_valid, u64 attr_version);
void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
u64 attr_valid);
void fuse_truncate(struct address_space *mapping, loff_t offset);
/**
* Initialize the client device
*/
......@@ -522,6 +550,8 @@ void fuse_ctl_cleanup(void);
*/
struct fuse_req *fuse_request_alloc(void);
struct fuse_req *fuse_request_alloc_nofs(void);
/**
* Free a request
*/
......@@ -558,6 +588,8 @@ void request_send_noreply(struct fuse_conn *fc, struct fuse_req *req);
*/
void request_send_background(struct fuse_conn *fc, struct fuse_req *req);
void request_send_background_locked(struct fuse_conn *fc, struct fuse_req *req);
/* Abort all requests */
void fuse_abort_conn(struct fuse_conn *fc);
......@@ -600,3 +632,8 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
int fuse_update_attributes(struct inode *inode, struct kstat *stat,
struct file *file, bool *refreshed);
void fuse_flush_writepages(struct inode *inode);
void fuse_set_nowrite(struct inode *inode);
void fuse_release_nowrite(struct inode *inode);
......@@ -59,7 +59,11 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->nodeid = 0;
fi->nlookup = 0;
fi->attr_version = 0;
fi->writectr = 0;
INIT_LIST_HEAD(&fi->write_files);
INIT_LIST_HEAD(&fi->queued_writes);
INIT_LIST_HEAD(&fi->writepages);
init_waitqueue_head(&fi->page_waitq);
fi->forget_req = fuse_request_alloc();
if (!fi->forget_req) {
kmem_cache_free(fuse_inode_cachep, inode);
......@@ -73,6 +77,7 @@ static void fuse_destroy_inode(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
BUG_ON(!list_empty(&fi->write_files));
BUG_ON(!list_empty(&fi->queued_writes));
if (fi->forget_req)
fuse_request_free(fi->forget_req);
kmem_cache_free(fuse_inode_cachep, inode);
......@@ -109,7 +114,7 @@ static int fuse_remount_fs(struct super_block *sb, int *flags, char *data)
return 0;
}
static void fuse_truncate(struct address_space *mapping, loff_t offset)
void fuse_truncate(struct address_space *mapping, loff_t offset)
{
/* See vmtruncate() */
unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
......@@ -117,19 +122,12 @@ static void fuse_truncate(struct address_space *mapping, loff_t offset)
unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
}
void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
u64 attr_valid, u64 attr_version)
void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
u64 attr_valid)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
loff_t oldsize;
spin_lock(&fc->lock);
if (attr_version != 0 && fi->attr_version > attr_version) {
spin_unlock(&fc->lock);
return;
}
fi->attr_version = ++fc->attr_version;
fi->i_time = attr_valid;
......@@ -159,6 +157,22 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
fi->orig_i_mode = inode->i_mode;
if (!(fc->flags & FUSE_DEFAULT_PERMISSIONS))
inode->i_mode &= ~S_ISVTX;
}
void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
u64 attr_valid, u64 attr_version)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
loff_t oldsize;
spin_lock(&fc->lock);
if (attr_version != 0 && fi->attr_version > attr_version) {
spin_unlock(&fc->lock);
return;
}
fuse_change_attributes_common(inode, attr, attr_valid);
oldsize = inode->i_size;
i_size_write(inode, attr->size);
......@@ -468,6 +482,8 @@ static struct fuse_conn *new_conn(struct super_block *sb)
atomic_set(&fc->num_waiting, 0);
fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
/* fuse does it's own writeback accounting */
fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
fc->dev = sb->s_dev;
err = bdi_init(&fc->bdi);
if (err)
......@@ -475,6 +491,19 @@ static struct fuse_conn *new_conn(struct super_block *sb)
err = bdi_register_dev(&fc->bdi, fc->dev);
if (err)
goto error_bdi_destroy;
/*
* For a single fuse filesystem use max 1% of dirty +
* writeback threshold.
*
* This gives about 1M of write buffer for memory maps on a
* machine with 1G and 10% dirty_ratio, which should be more
* than enough.
*
* Privileged users can raise it by writing to
*
* /sys/class/bdi/<bdi>/max_ratio
*/
bdi_set_max_ratio(&fc->bdi, 1);
fc->reqctr = 0;
fc->blocked = 1;
fc->attr_version = 1;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment