• Nick Piggin's avatar
    mm: close page_mkwrite races · b827e496
    Nick Piggin authored
    Change page_mkwrite to allow implementations to return with the page
    locked, and also change it's callers (in page fault paths) to hold the
    lock until the page is marked dirty.  This allows the filesystem to have
    full control of page dirtying events coming from the VM.
    
    Rather than simply hold the page locked over the page_mkwrite call, we
    call page_mkwrite with the page unlocked and allow callers to return with
    it locked, so filesystems can avoid LOR conditions with page lock.
    
    The problem with the current scheme is this: a filesystem that wants to
    associate some metadata with a page as long as the page is dirty, will
    perform this manipulation in its ->page_mkwrite.  It currently then must
    return with the page unlocked and may not hold any other locks (according
    to existing page_mkwrite convention).
    
    In this window, the VM could write out the page, clearing page-dirty.  The
    filesystem has no good way to detect that a dirty pte is about to be
    attached, so it will happily write out the page, at which point, the
    filesystem may manipulate the metadata to reflect that the page is no
    longer dirty.
    
    It is not always possible to perform the required metadata manipulation in
    ->set_page_dirty, because that function cannot block or fail.  The
    filesystem may need to allocate some data structure, for example.
    
    And the VM cannot mark the pte dirty before page_mkwrite, because
    page_mkwrite is allowed to fail, so we must not allow any window where the
    page could be written to if page_mkwrite does fail.
    
    This solution of holding the page locked over the 3 critical operations
    (page_mkwrite, setting the pte dirty, and finally setting the page dirty)
    closes out races nicely, preventing page cleaning for writeout being
    initiated in that window.  This provides the filesystem with a strong
    synchronisation against the VM here.
    
    - Sage needs this race closed for ceph filesystem.
    - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
    - I need it for fsblock.
    - I suspect other filesystems may need it too (eg. btrfs).
    - I have converted buffer.c to the new locking. Even simple block allocation
      under dirty pages might be susceptible to i_size changing under partial page
      at the end of file (we also have a buffer.c-side problem here, but it cannot
      be fixed properly without this patch).
    - Other filesystems (eg. NFS, maybe btrfs) will need to change their
      page_mkwrite functions themselves.
    
    [ This also moves page_mkwrite another step closer to fault, which should
      eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
      filesystem calldown and page lock/unlock cycle in __do_fault. ]
    
    [akpm@linux-foundation.org: fix derefs of NULL ->mapping]
    Cc: Sage Weil <sage@newdream.net>
    Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
    Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
    Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
    Cc: <stable@kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    b827e496
buffer.c 88.7 KB