• Trond Myklebust's avatar
    nfs/localio: use dedicated workqueues for filesystem read and write · b9f5dd57
    Trond Myklebust authored
    For localio access, don't call filesystem read() and write() routines
    directly.  This solves two problems:
    
    1) localio writes need to use a normal (non-memreclaim) unbound
       workqueue.  This avoids imposing new requirements on how underlying
       filesystems process frontend IO, which would cause a large amount
       of work to update all filesystems.  Without this change, when XFS
       starts getting low on space, XFS flushes work on a non-memreclaim
       work queue, which causes a priority inversion problem:
    
    00573 workqueue: WQ_MEM_RECLAIM writeback:wb_workfn is flushing !WQ_MEM_RECLAIM xfs-sync/vdc:xfs_flush_inodes_worker
    00573 WARNING: CPU: 6 PID: 8525 at kernel/workqueue.c:3706 check_flush_dependency+0x2a4/0x328
    00573 Modules linked in:
    00573 CPU: 6 PID: 8525 Comm: kworker/u71:5 Not tainted 6.10.0-rc3-ktest-00032-g2b0a133403ab #18502
    00573 Hardware name: linux,dummy-virt (DT)
    00573 Workqueue: writeback wb_workfn (flush-0:33)
    00573 pstate: 400010c5 (nZcv daIF -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
    00573 pc : check_flush_dependency+0x2a4/0x328
    00573 lr : check_flush_dependency+0x2a4/0x328
    00573 sp : ffff0000c5f06bb0
    00573 x29: ffff0000c5f06bb0 x28: ffff0000c998a908 x27: 1fffe00019331521
    00573 x26: ffff0000d0620900 x25: ffff0000c5f06ca0 x24: ffff8000828848c0
    00573 x23: 1fffe00018be0d8e x22: ffff0000c1210000 x21: ffff0000c75fde00
    00573 x20: ffff800080bfd258 x19: ffff0000cad63400 x18: ffff0000cd3a4810
    00573 x17: 0000000000000000 x16: 0000000000000000 x15: ffff800080508d98
    00573 x14: 0000000000000000 x13: 204d49414c434552 x12: 1fffe0001b6eeab2
    00573 x11: ffff60001b6eeab2 x10: dfff800000000000 x9 : ffff60001b6eeab3
    00573 x8 : 0000000000000001 x7 : 00009fffe491154e x6 : ffff0000db775593
    00573 x5 : ffff0000db775590 x4 : ffff0000db775590 x3 : 0000000000000000
    00573 x2 : 0000000000000027 x1 : ffff600018be0d62 x0 : dfff800000000000
    00573 Call trace:
    00573  check_flush_dependency+0x2a4/0x328
    00573  __flush_work+0x184/0x5c8
    00573  flush_work+0x18/0x28
    00573  xfs_flush_inodes+0x68/0x88
    00573  xfs_file_buffered_write+0x128/0x6f0
    00573  xfs_file_write_iter+0x358/0x448
    00573  nfs_local_doio+0x854/0x1568
    00573  nfs_initiate_pgio+0x214/0x418
    00573  nfs_generic_pg_pgios+0x304/0x480
    00573  nfs_pageio_doio+0xe8/0x240
    00573  nfs_pageio_complete+0x160/0x480
    00573  nfs_writepages+0x300/0x4f0
    00573  do_writepages+0x12c/0x4a0
    00573  __writeback_single_inode+0xd4/0xa68
    00573  writeback_sb_inodes+0x470/0xcb0
    00573  __writeback_inodes_wb+0xb0/0x1d0
    00573  wb_writeback+0x594/0x808
    00573  wb_workfn+0x5e8/0x9e0
    00573  process_scheduled_works+0x53c/0xd90
    00573  worker_thread+0x370/0x8c8
    00573  kthread+0x258/0x2e8
    00573  ret_from_fork+0x10/0x20
    
    2) Some filesystem writeback routines can end up taking up a lot of
       stack space (particularly XFS).  Instead of risking running over
       due to the extra overhead from the NFS stack, we should just call
       these routines from a workqueue job.  Since we need to do this to
       address 1) above we're able to avoid possibly blowing the stack
       "for free".
    
    Use of dedicated workqueues improves performance over using the
    system_unbound_wq.
    
    Also, the creds used to open the file are used to override_creds() in
    both nfs_local_call_read() and nfs_local_call_write() -- otherwise the
    workqueue could have elevated capabilities (which the caller may not).
    
    Lastly, care is taken to set PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO in
    nfs_do_local_write() to avoid writeback deadlocks.
    
    The PF_LOCAL_THROTTLE flag prevents deadlocks in balance_dirty_pages()
    by causing writes to only be throttled against other writes to the
    same bdi (it keeps the throttling local).  Normally all writes to
    bdi(s) are throttled equally (after throughput factors are allowed
    for).
    
    The PF_MEMALLOC_NOIO flag prevents the lower filesystem IO from
    causing memory reclaim to re-enter filesystems or IO devices and so
    prevents deadlocks from occuring where IO that cleans pages is
    waiting on IO to complete.
    Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
    Co-developed-by: default avatarMike Snitzer <snitzer@kernel.org>
    Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
    Co-developed-by: default avatarNeilBrown <neilb@suse.de>
    Signed-off-by: NeilBrown <neilb@suse.de> # eliminated wait_for_completion
    Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
    Signed-off-by: default avatarAnna Schumaker <anna.schumaker@oracle.com>
    b9f5dd57
internal.h 29.9 KB