1. 08 Apr, 2017 6 commits
    • Darrick J. Wong's avatar
      xfs: fix toctou race when locking an inode to access the data map · 8059f061
      Darrick J. Wong authored
      commit 4b5bd5bf upstream.
      
      We use di_format and if_flags to decide whether we're grabbing the ilock
      in btree mode (btree extents not loaded) or shared mode (anything else),
      but the state of those fields can be changed by other threads that are
      also trying to load the btree extents -- IFEXTENTS gets set before the
      _bmap_read_extents call and cleared if it fails.
      
      We don't actually need to have IFEXTENTS set until after the bmbt
      records are successfully loaded and validated, which will fix the race
      between multiple threads trying to read the same directory.  The next
      patch strengthens directory bmbt validation by refusing to open the
      directory if reading the bmbt to start directory readahead fails.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8059f061
    • Brian Foster's avatar
      xfs: fix eofblocks race with file extending async dio writes · 02577091
      Brian Foster authored
      commit e4229d6b upstream.
      
      It's possible for post-eof blocks to end up being used for direct I/O
      writes. dio write performs an upfront unwritten extent allocation, sends
      the dio and then updates the inode size (if necessary) on write
      completion. If a file release occurs while a file extending dio write is
      in flight, it is possible to mistake the post-eof blocks for speculative
      preallocation and incorrectly truncate them from the inode. This means
      that the resulting dio write completion can discover a hole and allocate
      new blocks rather than perform unwritten extent conversion.
      
      This requires a strange mix of I/O and is thus not likely to reproduce
      in real world workloads. It is intermittently reproduced by generic/299.
      The error manifests as an assert failure due to transaction overrun
      because the aforementioned write completion transaction has only
      reserved enough blocks for btree operations:
      
        XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \
         file: fs/xfs//xfs_trans.c, line: 309
      
      The root cause is that xfs_free_eofblocks() uses i_size to truncate
      post-eof blocks from the inode, but async, file extending direct writes
      do not update i_size until write completion, long after inode locks are
      dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
      to the incorrect size.
      
      Update xfs_free_eofblocks() to serialize against dio similar to how
      extending writes are serialized against i_size updates before post-eof
      block zeroing. Specifically, wait on dio while under the iolock. This
      ensures that dio write completions have updated i_size before post-eof
      blocks are processed.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      02577091
    • Brian Foster's avatar
      xfs: sync eofblocks scans under iolock are livelock prone · 696bfc8e
      Brian Foster authored
      commit c3155097 upstream.
      
      The xfs_eofblocks.eof_scan_owner field is an internal field to
      facilitate invoking eofb scans from the kernel while under the iolock.
      This is necessary because the eofb scan acquires the iolock of each
      inode. Synchronous scans are invoked on certain buffered write failures
      while under iolock. In such cases, the scan owner indicates that the
      context for the scan already owns the particular iolock and prevents a
      double lock deadlock.
      
      eofblocks scans while under iolock are still livelock prone in the event
      of multiple parallel scans, however. If multiple buffered writes to
      different inodes fail and invoke eofblocks scans at the same time, each
      scan avoids a deadlock with its own inode by virtue of the
      eof_scan_owner field, but will never be able to acquire the iolock of
      the inode from the parallel scan. Because the low free space scans are
      invoked with SYNC_WAIT, the scan will not return until it has processed
      every tagged inode and thus both scans will spin indefinitely on the
      iolock being held across the opposite scan. This problem can be
      reproduced reliably by generic/224 on systems with higher cpu counts
      (x16).
      
      To avoid this problem, simplify the semantics of eofblocks scans to
      never invoke a scan while under iolock. This means that the buffered
      write context must drop the iolock before the scan. It must reacquire
      the lock before the write retry and also repeat the initial write
      checks, as the original state might no longer be valid once the iolock
      was dropped.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      696bfc8e
    • Brian Foster's avatar
      xfs: pull up iolock from xfs_free_eofblocks() · ff4ea420
      Brian Foster authored
      commit a36b9261 upstream.
      
      xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
      different contexts where the lock may or may not be held. The
      need_iolock parameter exists for this reason, to indicate whether
      xfs_free_eofblocks() must acquire the iolock itself before it can
      proceed.
      
      This is ugly and confusing. Simplify the semantics of
      xfs_free_eofblocks() to require the caller to acquire the iolock
      appropriately and kill the need_iolock parameter. While here, the mp
      param can be removed as well as the xfs_mount is accessible from the
      xfs_inode structure. This patch does not change behavior.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff4ea420
    • Ladi Prosek's avatar
      KVM: nVMX: fix nested EPT detection · 3eb24329
      Ladi Prosek authored
      commit 7ad658b6 upstream.
      
      The nested_ept_enabled flag introduced in commit 7ca29de2 was not
      computed correctly. We are interested only in L1's EPT state, not the
      the combined L0+L1 value.
      
      In particular, if L0 uses EPT but L1 does not, nested_ept_enabled must
      be false to make sure that PDPSTRs are loaded based on CR3 as usual,
      because the special case described in 26.3.2.4 Loading Page-Directory-
      Pointer-Table Entries does not apply.
      
      Fixes: 7ca29de2 ("KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT")
      Cc: qemu-stable@nongnu.org
      Reported-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarLadi Prosek <lprosek@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3eb24329
    • Ilya Dryomov's avatar
      libceph: force GFP_NOIO for socket allocations · 8a7eb087
      Ilya Dryomov authored
      commit 633ee407 upstream.
      
      sock_alloc_inode() allocates socket+inode and socket_wq with
      GFP_KERNEL, which is not allowed on the writeback path:
      
          Workqueue: ceph-msgr con_work [libceph]
          ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
          0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
          ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
          Call Trace:
          [<ffffffff816dd629>] schedule+0x29/0x70
          [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
          [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
          [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
          [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
          [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
          [<ffffffff81086335>] flush_work+0x165/0x250
          [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
          [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
          [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
          [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
          [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
          [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
          [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
          [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
          [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
          [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
          [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
          [<ffffffff811c0c18>] super_cache_scan+0x178/0x180
          [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
          [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
          [<ffffffff8115af70>] shrink_slab+0x100/0x140
          [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
          [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
          [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
          [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
          [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
          [<ffffffff811a0ac5>] new_slab+0x2c5/0x390
          [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
          [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
          [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
          [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
          [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
          [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
          [<ffffffff811d8566>] alloc_inode+0x26/0xa0
          [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
          [<ffffffff815b933e>] sock_alloc+0x1e/0x80
          [<ffffffff815ba855>] __sock_create+0x95/0x220
          [<ffffffff815baa04>] sock_create_kern+0x24/0x30
          [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
          [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
          [<ffffffff81084c19>] process_one_work+0x159/0x4f0
          [<ffffffff8108561b>] worker_thread+0x11b/0x530
          [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
          [<ffffffff8108b6f9>] kthread+0xc9/0xe0
          [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
          [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
          [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
      
      Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.
      
      Link: http://tracker.ceph.com/issues/19309Reported-by: default avatarSergey Jerusalimov <wintchester@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a7eb087
  2. 31 Mar, 2017 18 commits
  3. 30 Mar, 2017 16 commits