1. 08 Apr, 2017 12 commits
    • Darrick J. Wong's avatar
      xfs: mark speculative prealloc CoW fork extents unwritten · b0f88f0d
      Darrick J. Wong authored
      commit 5eda4300 upstream.
      
      Christoph Hellwig pointed out that there's a potentially nasty race when
      performing simultaneous nearby directio cow writes:
      
      "Thread 1 writes a range from B to c
      
      "                    B --------- C
                                 p
      
      "a little later thread 2 writes from A to B
      
      "        A --------- B
                     p
      
      [editor's note: the 'p' denote cowextsize boundaries, which I added to
      make this more clear]
      
      "but the code preallocates beyond B into the range where thread
      "1 has just written, but ->end_io hasn't been called yet.
      "But once ->end_io is called thread 2 has already allocated
      "up to the extent size hint into the write range of thread 1,
      "so the end_io handler will splice the unintialized blocks from
      "that preallocation back into the file right after B."
      
      We can avoid this race by ensuring that thread 1 cannot accidentally
      remap the blocks that thread 2 allocated (as part of speculative
      preallocation) as part of t2's write preparation in t1's end_io handler.
      The way we make this happen is by taking advantage of the unwritten
      extent flag as an intermediate step.
      
      Recall that when we begin the process of writing data to shared blocks,
      we create a delayed allocation extent in the CoW fork:
      
      D: --RRRRRRSSSRRRRRRRR---
      C: ------DDDDDDD---------
      
      When a thread prepares to CoW some dirty data out to disk, it will now
      convert the delalloc reservation into an /unwritten/ allocated extent in
      the cow fork.  The da conversion code tries to opportunistically
      allocate as much of a (speculatively prealloc'd) extent as possible, so
      we may end up allocating a larger extent than we're actually writing
      out:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UUUUUUU---------
      
      Next, we convert only the part of the extent that we're actively
      planning to write to normal (i.e. not unwritten) status:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UURRUUU---------
      
      If the write succeeds, the end_cow function will now scan the relevant
      range of the CoW fork for real extents and remap only the real extents
      into the data fork:
      
      D: --RRRRRRRRSRRRRRRRR---
      U: ------UU--UUU---------
      
      This ensures that we never obliterate valid data fork extents with
      unwritten blocks from the CoW fork.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0f88f0d
    • Darrick J. Wong's avatar
      xfs: allow unwritten extents in the CoW fork · a0c46fae
      Darrick J. Wong authored
      commit 05a630d7 upstream.
      
      In the data fork, we only allow extents to perform the following state
      transitions:
      
      delay -> real <-> unwritten
      
      There's no way to move directly from a delalloc reservation to an
      /unwritten/ allocated extent.  However, for the CoW fork we want to be
      able to do the following to each extent:
      
      delalloc -> unwritten -> written -> remapped to data fork
      
      This will help us to avoid a race in the speculative CoW preallocation
      code between a first thread that is allocating a CoW extent and a second
      thread that is remapping part of a file after a write.  In order to do
      this, however, we need two things: first, we have to be able to
      transition from da to unwritten, and second the function that converts
      between real and unwritten has to be made aware of the cow fork.  Do
      both of those things.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a0c46fae
    • Darrick J. Wong's avatar
      xfs: verify free block header fields · 1dc0e72c
      Darrick J. Wong authored
      commit de14c5f5 upstream.
      
      Perform basic sanity checking of the directory free block header
      fields so that we avoid hanging the system on invalid data.
      
      (Granted that just means that now we shutdown on directory write,
      but that seems better than hanging...)
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1dc0e72c
    • Darrick J. Wong's avatar
      xfs: check for obviously bad level values in the bmbt root · 58565508
      Darrick J. Wong authored
      commit b3bf607d upstream.
      
      We can't handle a bmbt that's taller than BTREE_MAXLEVELS, and there's
      no such thing as a zero-level bmbt (for that we have extents format),
      so if we see this, send back an error code.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58565508
    • Darrick J. Wong's avatar
      xfs: filter out obviously bad btree pointers · 2b9dcb94
      Darrick J. Wong authored
      commit d5a91bae upstream.
      
      Don't let anybody load an obviously bad btree pointer.  Since the values
      come from disk, we must return an error, not just ASSERT.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2b9dcb94
    • Darrick J. Wong's avatar
      xfs: fail _dir_open when readahead fails · cb308466
      Darrick J. Wong authored
      commit 7a652bbe upstream.
      
      When we open a directory, we try to readahead block 0 of the directory
      on the assumption that we're going to need it soon.  If the bmbt is
      corrupt, the directory will never be usable and the readahead fails
      immediately, so we might as well prevent the directory from being opened
      at all.  This prevents a subsequent read or modify operation from
      hitting it and taking the fs offline.
      
      NOTE: We're only checking for early failures in the block mapping, not
      the readahead directory block itself.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb308466
    • Darrick J. Wong's avatar
      xfs: fix toctou race when locking an inode to access the data map · 8059f061
      Darrick J. Wong authored
      commit 4b5bd5bf upstream.
      
      We use di_format and if_flags to decide whether we're grabbing the ilock
      in btree mode (btree extents not loaded) or shared mode (anything else),
      but the state of those fields can be changed by other threads that are
      also trying to load the btree extents -- IFEXTENTS gets set before the
      _bmap_read_extents call and cleared if it fails.
      
      We don't actually need to have IFEXTENTS set until after the bmbt
      records are successfully loaded and validated, which will fix the race
      between multiple threads trying to read the same directory.  The next
      patch strengthens directory bmbt validation by refusing to open the
      directory if reading the bmbt to start directory readahead fails.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8059f061
    • Brian Foster's avatar
      xfs: fix eofblocks race with file extending async dio writes · 02577091
      Brian Foster authored
      commit e4229d6b upstream.
      
      It's possible for post-eof blocks to end up being used for direct I/O
      writes. dio write performs an upfront unwritten extent allocation, sends
      the dio and then updates the inode size (if necessary) on write
      completion. If a file release occurs while a file extending dio write is
      in flight, it is possible to mistake the post-eof blocks for speculative
      preallocation and incorrectly truncate them from the inode. This means
      that the resulting dio write completion can discover a hole and allocate
      new blocks rather than perform unwritten extent conversion.
      
      This requires a strange mix of I/O and is thus not likely to reproduce
      in real world workloads. It is intermittently reproduced by generic/299.
      The error manifests as an assert failure due to transaction overrun
      because the aforementioned write completion transaction has only
      reserved enough blocks for btree operations:
      
        XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \
         file: fs/xfs//xfs_trans.c, line: 309
      
      The root cause is that xfs_free_eofblocks() uses i_size to truncate
      post-eof blocks from the inode, but async, file extending direct writes
      do not update i_size until write completion, long after inode locks are
      dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
      to the incorrect size.
      
      Update xfs_free_eofblocks() to serialize against dio similar to how
      extending writes are serialized against i_size updates before post-eof
      block zeroing. Specifically, wait on dio while under the iolock. This
      ensures that dio write completions have updated i_size before post-eof
      blocks are processed.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      02577091
    • Brian Foster's avatar
      xfs: sync eofblocks scans under iolock are livelock prone · 696bfc8e
      Brian Foster authored
      commit c3155097 upstream.
      
      The xfs_eofblocks.eof_scan_owner field is an internal field to
      facilitate invoking eofb scans from the kernel while under the iolock.
      This is necessary because the eofb scan acquires the iolock of each
      inode. Synchronous scans are invoked on certain buffered write failures
      while under iolock. In such cases, the scan owner indicates that the
      context for the scan already owns the particular iolock and prevents a
      double lock deadlock.
      
      eofblocks scans while under iolock are still livelock prone in the event
      of multiple parallel scans, however. If multiple buffered writes to
      different inodes fail and invoke eofblocks scans at the same time, each
      scan avoids a deadlock with its own inode by virtue of the
      eof_scan_owner field, but will never be able to acquire the iolock of
      the inode from the parallel scan. Because the low free space scans are
      invoked with SYNC_WAIT, the scan will not return until it has processed
      every tagged inode and thus both scans will spin indefinitely on the
      iolock being held across the opposite scan. This problem can be
      reproduced reliably by generic/224 on systems with higher cpu counts
      (x16).
      
      To avoid this problem, simplify the semantics of eofblocks scans to
      never invoke a scan while under iolock. This means that the buffered
      write context must drop the iolock before the scan. It must reacquire
      the lock before the write retry and also repeat the initial write
      checks, as the original state might no longer be valid once the iolock
      was dropped.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      696bfc8e
    • Brian Foster's avatar
      xfs: pull up iolock from xfs_free_eofblocks() · ff4ea420
      Brian Foster authored
      commit a36b9261 upstream.
      
      xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
      different contexts where the lock may or may not be held. The
      need_iolock parameter exists for this reason, to indicate whether
      xfs_free_eofblocks() must acquire the iolock itself before it can
      proceed.
      
      This is ugly and confusing. Simplify the semantics of
      xfs_free_eofblocks() to require the caller to acquire the iolock
      appropriately and kill the need_iolock parameter. While here, the mp
      param can be removed as well as the xfs_mount is accessible from the
      xfs_inode structure. This patch does not change behavior.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff4ea420
    • Ladi Prosek's avatar
      KVM: nVMX: fix nested EPT detection · 3eb24329
      Ladi Prosek authored
      commit 7ad658b6 upstream.
      
      The nested_ept_enabled flag introduced in commit 7ca29de2 was not
      computed correctly. We are interested only in L1's EPT state, not the
      the combined L0+L1 value.
      
      In particular, if L0 uses EPT but L1 does not, nested_ept_enabled must
      be false to make sure that PDPSTRs are loaded based on CR3 as usual,
      because the special case described in 26.3.2.4 Loading Page-Directory-
      Pointer-Table Entries does not apply.
      
      Fixes: 7ca29de2 ("KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT")
      Cc: qemu-stable@nongnu.org
      Reported-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarLadi Prosek <lprosek@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3eb24329
    • Ilya Dryomov's avatar
      libceph: force GFP_NOIO for socket allocations · 8a7eb087
      Ilya Dryomov authored
      commit 633ee407 upstream.
      
      sock_alloc_inode() allocates socket+inode and socket_wq with
      GFP_KERNEL, which is not allowed on the writeback path:
      
          Workqueue: ceph-msgr con_work [libceph]
          ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
          0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
          ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
          Call Trace:
          [<ffffffff816dd629>] schedule+0x29/0x70
          [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
          [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
          [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
          [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
          [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
          [<ffffffff81086335>] flush_work+0x165/0x250
          [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
          [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
          [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
          [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
          [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
          [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
          [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
          [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
          [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
          [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
          [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
          [<ffffffff811c0c18>] super_cache_scan+0x178/0x180
          [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
          [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
          [<ffffffff8115af70>] shrink_slab+0x100/0x140
          [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
          [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
          [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
          [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
          [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
          [<ffffffff811a0ac5>] new_slab+0x2c5/0x390
          [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
          [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
          [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
          [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
          [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
          [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
          [<ffffffff811d8566>] alloc_inode+0x26/0xa0
          [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
          [<ffffffff815b933e>] sock_alloc+0x1e/0x80
          [<ffffffff815ba855>] __sock_create+0x95/0x220
          [<ffffffff815baa04>] sock_create_kern+0x24/0x30
          [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
          [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
          [<ffffffff81084c19>] process_one_work+0x159/0x4f0
          [<ffffffff8108561b>] worker_thread+0x11b/0x530
          [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
          [<ffffffff8108b6f9>] kthread+0xc9/0xe0
          [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
          [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
          [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
      
      Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.
      
      Link: http://tracker.ceph.com/issues/19309Reported-by: default avatarSergey Jerusalimov <wintchester@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a7eb087
  2. 31 Mar, 2017 18 commits
  3. 30 Mar, 2017 10 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.10.7 · 55db23d3
      Greg Kroah-Hartman authored
      55db23d3
    • Jiri Slaby's avatar
      crypto: algif_hash - avoid zero-sized array · 0dad3de8
      Jiri Slaby authored
      commit 62071194 upstream.
      
      With this reproducer:
        struct sockaddr_alg alg = {
                .salg_family = 0x26,
                .salg_type = "hash",
                .salg_feat = 0xf,
                .salg_mask = 0x5,
                .salg_name = "digest_null",
        };
        int sock, sock2;
      
        sock = socket(AF_ALG, SOCK_SEQPACKET, 0);
        bind(sock, (struct sockaddr *)&alg, sizeof(alg));
        sock2 = accept(sock, NULL, NULL);
        setsockopt(sock, SOL_ALG, ALG_SET_KEY, "\x9b\xca", 2);
        accept(sock2, NULL, NULL);
      
      ==== 8< ======== 8< ======== 8< ======== 8< ====
      
      one can immediatelly see an UBSAN warning:
      UBSAN: Undefined behaviour in crypto/algif_hash.c:187:7
      variable length array bound value 0 <= 0
      CPU: 0 PID: 15949 Comm: syz-executor Tainted: G            E      4.4.30-0-default #1
      ...
      Call Trace:
      ...
       [<ffffffff81d598fd>] ? __ubsan_handle_vla_bound_not_positive+0x13d/0x188
       [<ffffffff81d597c0>] ? __ubsan_handle_out_of_bounds+0x1bc/0x1bc
       [<ffffffffa0e2204d>] ? hash_accept+0x5bd/0x7d0 [algif_hash]
       [<ffffffffa0e2293f>] ? hash_accept_nokey+0x3f/0x51 [algif_hash]
       [<ffffffffa0e206b0>] ? hash_accept_parent_nokey+0x4a0/0x4a0 [algif_hash]
       [<ffffffff8235c42b>] ? SyS_accept+0x2b/0x40
      
      It is a correct warning, as hash state is propagated to accept as zero,
      but creating a zero-length variable array is not allowed in C.
      
      Fix this as proposed by Herbert -- do "?: 1" on that site. No sizeof or
      similar happens in the code there, so we just allocate one byte even
      though we do not use the array.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net> (maintainer:CRYPTO API)
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0dad3de8
    • Takashi Iwai's avatar
      fbcon: Fix vc attr at deinit · f9955dca
      Takashi Iwai authored
      commit 8aac7f34 upstream.
      
      fbcon can deal with vc_hi_font_mask (the upper 256 chars) and adjust
      the vc attrs dynamically when vc_hi_font_mask is changed at
      fbcon_init().  When the vc_hi_font_mask is set, it remaps the attrs in
      the existing console buffer with one bit shift up (for 9 bits), while
      it remaps with one bit shift down (for 8 bits) when the value is
      cleared.  It works fine as long as the font gets updated after fbcon
      was initialized.
      
      However, we hit a bizarre problem when the console is switched to
      another fb driver (typically from vesafb or efifb to drmfb).  At
      switching to the new fb driver, we temporarily rebind the console to
      the dummy console, then rebind to the new driver.  During the
      switching, we leave the modified attrs as is.  Thus, the new fbcon
      takes over the old buffer as if it were to contain 8 bits chars
      (although the attrs are still shifted for 9 bits), and effectively
      this results in the yellow color texts instead of the original white
      color, as found in the bugzilla entry below.
      
      An easy fix for this is to re-adjust the attrs before leaving the
      fbcon at con_deinit callback.  Since the code to adjust the attrs is
      already present in the current fbcon code, in this patch, we simply
      factor out the relevant code, and call it from fbcon_deinit().
      
      Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1000619Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9955dca
    • Daniel Vetter's avatar
      drm: reference count event->completion · 2a324104
      Daniel Vetter authored
      commit 24835e44 upstream.
      
      When writing the generic nonblocking commit code I assumed that
      through clever lifetime management I can assure that the completion
      (stored in drm_crtc_commit) only gets freed after it is completed. And
      that worked.
      
      I also wanted to make nonblocking helpers resilient against driver
      bugs, by having timeouts everywhere. And that worked too.
      
      Unfortunately taking boths things together results in oopses :( Well,
      at least sometimes: What seems to happen is that the drm event hangs
      around forever stuck in limbo land. The nonblocking helpers eventually
      time out, move on and release it. Now the bug I tested all this
      against is drivers that just entirely fail to deliver the vblank
      events like they should, and in those cases the event is simply
      leaked. But what seems to happen, at least sometimes, on i915 is that
      the event is set up correctly, but somohow the vblank fails to fire in
      time. Which means the event isn't leaked, it's still there waiting for
      eventually a vblank to fire. That tends to happen when re-enabling the
      pipe, and then the trap springs and the kernel oopses.
      
      The correct fix here is simply to refcount the crtc commit to make
      sure that the event sticks around even for drivers which only
      sometimes fail to deliver vblanks for some arbitrary reasons. Since
      crtc commits are already refcounted that's easy to do.
      
      References: https://bugs.freedesktop.org/show_bug.cgi?id=96781
      Cc: Jim Rees <rees@umich.edu>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Reviewed-by: default avatarMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20161221102331.31033-1-daniel.vetter@ffwll.ch
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a324104
    • Dan Streetman's avatar
      xen: do not re-use pirq number cached in pci device msi msg data · 59758483
      Dan Streetman authored
      commit c74fd80f upstream.
      
      Revert the main part of commit:
      af42b8d1 ("xen: fix MSI setup and teardown for PV on HVM guests")
      
      That commit introduced reading the pci device's msi message data to see
      if a pirq was previously configured for the device's msi/msix, and re-use
      that pirq.  At the time, that was the correct behavior.  However, a
      later change to Qemu caused it to call into the Xen hypervisor to unmap
      all pirqs for a pci device, when the pci device disables its MSI/MSIX
      vectors; specifically the Qemu commit:
      c976437c7dba9c7444fb41df45468968aaa326ad
      ("qemu-xen: free all the pirqs for msi/msix when driver unload")
      
      Once Qemu added this pirq unmapping, it was no longer correct for the
      kernel to re-use the pirq number cached in the pci device msi message
      data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
      pirqs when the pci device disables its MSI/MSIX vectors.
      
      This bug is causing failures to initialize multiple NVMe controllers
      under Xen, because the NVMe driver sets up a single MSIX vector for
      each controller (concurrently), and then after using that to talk to
      the controller for some configuration data, it disables the single MSIX
      vector and re-configures all the MSIX vectors it needs.  So the MSIX
      setup code tries to re-use the cached pirq from the first vector
      for each controller, but the hypervisor has already given away that
      pirq to another controller, and its initialization fails.
      
      This is discussed in more detail at:
      https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
      
      Fixes: af42b8d1 ("xen: fix MSI setup and teardown for PV on HVM guests")
      Signed-off-by: default avatarDan Streetman <dan.streetman@canonical.com>
      Reviewed-by: default avatarStefano Stabellini <sstabellini@kernel.org>
      Acked-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59758483
    • Vaidyanathan Srinivasan's avatar
      cpuidle: Validate cpu_dev in cpuidle_add_sysfs() · 53569305
      Vaidyanathan Srinivasan authored
      commit ad0a45fd upstream.
      
      If a given cpu is not in cpu_present and cpu hotplug
      is disabled, arch can skip setting up the cpu_dev.
      
      Arch cpuidle driver should pass correct cpu mask
      for registration, but failing to do so by the driver
      causes error to propagate and crash like this:
      
      [   30.076045] Unable to handle kernel paging request for data at address 0x00000048
      [   30.076100] Faulting instruction address: 0xc0000000007b2f30
      cpu 0x4d: Vector: 300 (Data Access) at [c000003feb18b670]
          pc: c0000000007b2f30: kobject_get+0x20/0x70
          lr: c0000000007b3c94: kobject_add_internal+0x54/0x3f0
          sp: c000003feb18b8f0
         msr: 9000000000009033
         dar: 48
       dsisr: 40000000
        current = 0xc000003fd2ed8300
        paca    = 0xc00000000fbab500   softe: 0        irq_happened: 0x01
          pid   = 1, comm = swapper/0
      Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
      20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
      enter ? for help
      [c000003feb18b960] c0000000007b3c94 kobject_add_internal+0x54/0x3f0
      [c000003feb18b9f0] c0000000007b43a4 kobject_init_and_add+0x64/0xa0
      [c000003feb18ba70] c000000000e284f4 cpuidle_add_sysfs+0xb4/0x130
      [c000003feb18baf0] c000000000e26038 cpuidle_register_device+0x118/0x1c0
      [c000003feb18bb30] c000000000e26c48 cpuidle_register+0x78/0x120
      [c000003feb18bbc0] c00000000168fd9c powernv_processor_idle_init+0x110/0x1c4
      [c000003feb18bc40] c00000000000cff8 do_one_initcall+0x68/0x1d0
      [c000003feb18bd00] c0000000016242f4 kernel_init_freeable+0x280/0x360
      [c000003feb18bdc0] c00000000000d864 kernel_init+0x24/0x160
      [c000003feb18be30] c00000000000b4e8 ret_from_kernel_thread+0x5c/0x74
      
      Validating cpu_dev fixes the crash and reports correct error message like:
      
      [   30.163506] Failed to register cpuidle device for cpu136
      [   30.173329] Registration of powernv driver failed.
      Signed-off-by: default avatarVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      [ rjw: Comment massage ]
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53569305
    • Damien Le Moal's avatar
      scsi: sd: Check for unaligned partial completion · a27142e6
      Damien Le Moal authored
      commit c46f0917 upstream.
      
      Commit <f2e767bb> ("mpt3sas: Force request partial completion
      alignment") was not considering the case of commands not operating on
      logical block size units (e.g. REQ_OP_ZONE_REPORT and its 64B aligned
      partial replies). In this case, forcing alignment of resid to the device
      logical block size can break the command result, e.g. in the case of
      REQ_OP_ZONE_REPORT, the exact number of zone reported by the device.
      
      Move the partial completion alignement check of mpt3sas to a generic
      implementation in sd_done(). The check is added within the default
      section of the initial req_op() switch case so that the report and reset
      zone commands are ignored. In addition, as sd_done() is not called for
      passthrough requests, resid corrections are not done as intended by the
      initial mpt3sas patch.
      
      Fixes: f2e767bb ("mpt3sas: Force request partial completion alignment")
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      a27142e6
    • Dave Jiang's avatar
      device-dax: fix pmd/pte fault fallback handling · 66c08128
      Dave Jiang authored
      commit 0134ed4f upstream.
      
      Jeff Moyer reports:
      
          With a device dax alignment of 4KB or 2MB, I get sigbus when running
          the attached fio job file for the current kernel (4.11.0-rc1+).  If
          I specify an alignment of 1GB, it works.
      
          I turned on debug output, and saw that it was failing in the huge
          fault code.
      
           dax dax1.0: dax_open
           dax dax1.0: dax_mmap
           dax dax1.0: dax_dev_huge_fault: fio: write (0x7f08f0a00000 -
           dax dax1.0: __dax_dev_pud_fault: phys_to_pgoff(0xffffffffcf60
           dax dax1.0: dax_release
      
          fio config for reproduce:
          [global]
          ioengine=dev-dax
          direct=0
          filename=/dev/dax0.0
          bs=2m
      
          [write]
          rw=write
      
          [read]
          stonewall
          rw=read
      
      The driver fails to fallback when taking a fault that is larger than
      the device alignment, or handling a larger fault when a smaller
      mapping is already established. While we could support larger
      mappings for a device with a smaller alignment, that change is
      too large for the immediate fix. The simplest change is to force
      fallback until the fault size matches the alignment.
      
      Fixes: dee41079 ("/dev/dax, core: file operations and dax-mmap")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      66c08128
    • Ilya Dryomov's avatar
      libceph: don't set weight to IN when OSD is destroyed · 96aa12df
      Ilya Dryomov authored
      commit b581a585 upstream.
      
      Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
      osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
      This changes the result of applying an incremental for clients, not
      just OSDs.  Because CRUSH computations are obviously affected,
      pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
      object placement, resulting in misdirected requests.
      
      Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.
      
      Fixes: 930c5328 ("libceph: apply new_state before new_up_client on incrementals")
      Link: http://tracker.ceph.com/issues/19122Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      96aa12df
    • Adrian Hunter's avatar
      mmc: block: Fix is_waiting_last_req set incorrectly · 8b38e319
      Adrian Hunter authored
      commit 2602b740 upstream.
      
      Commit 15520111 ("mmc: core: Further fix thread wake-up") allowed a
      queue to release the host with is_waiting_last_req set to true. A queue
      waiting to claim the host will not reset it, which can result in the
      queue getting stuck in a loop.
      
      Fixes: 15520111 ("mmc: core: Further fix thread wake-up")
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b38e319