• Miklos Szeredi's avatar
    fuse: support writable mmap · 3be5a52b
    Miklos Szeredi authored
    Quoting Linus (3 years ago, FUSE inclusion discussions):
    
      "User-space filesystems are hard to get right. I'd claim that they
       are almost impossible, unless you limit them somehow (shared
       writable mappings are the nastiest part - if you don't have those,
       you can reasonably limit your problems by limiting the number of
       dirty pages you accept through normal "write()" calls)."
    
    Instead of attempting the impossible, I've just waited for the dirty page
    accounting infrastructure to materialize (thanks to Peter Zijlstra and
    others).  This nicely solved the biggest problem: limiting the number of pages
    used for write caching.
    
    Some small details remained, however, which this largish patch attempts to
    address.  It provides a page writeback implementation for fuse, which is
    completely safe against VM related deadlocks.  Performance may not be very
    good for certain usage patterns, but generally it should be acceptable.
    
    It has been tested extensively with fsx-linux and bash-shared-mapping.
    
    Fuse page writeback design
    --------------------------
    
    fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
    It copies the contents of the original page, and queues a WRITE request to the
    userspace filesystem using this temp page.
    
    The writeback is finished instantly from the MM's point of view: the page is
    removed from the radix trees, and the PageDirty and PageWriteback flags are
    cleared.
    
    For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
    incremented.  The per-bdi writeback count is not decremented until the actual
    write completes.
    
    On dirtying the page, fuse waits for a previous write to finish before
    proceeding.  This makes sure, there can only be one temporary page used at a
    time for one cached page.
    
    This approach is wasteful in both memory and CPU bandwidth, so why is this
    complication needed?
    
    The basic problem is that there can be no guarantee about the time in which
    the userspace filesystem will complete a write.  It may be buggy or even
    malicious, and fail to complete WRITE requests.  We don't want unrelated parts
    of the system to grind to a halt in such cases.
    
    Also a filesystem may need additional resources (particularly memory) to
    complete a WRITE request.  There's a great danger of a deadlock if that
    allocation may wait for the writepage to finish.
    
    Currently there are several cases where the kernel can block on page
    writeback:
    
      - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
      - page migration
      - throttle_vm_writeout (through NR_WRITEBACK)
      - sync(2)
    
    Of course in some cases (fsync, msync) we explicitly want to allow blocking.
    So for these cases new code has to be added to fuse, since the VM is not
    tracking writeback pages for us any more.
    
    As an extra safetly measure, the maximum dirty ratio allocated to a single
    fuse filesystem is set to 1% by default.  This way one (or several) buggy or
    malicious fuse filesystems cannot slow down the rest of the system by hogging
    dirty memory.
    
    With appropriate privileges, this limit can be raised through
    '/sys/class/bdi/<bdi>/max_ratio'.
    Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    3be5a52b
inode.c 20.4 KB