1. 19 Jun, 2017 21 commits
    • Brian Foster's avatar
      xfs: remove bli from AIL before release on transaction abort · 3d4b4a3e
      Brian Foster authored
      When a buffer is modified, logged and committed, it ultimately ends
      up sitting on the AIL with a dirty bli waiting for metadata
      writeback. If another transaction locks and invalidates the buffer
      (freeing an inode chunk, for example) in the meantime, the bli is
      flagged as stale, the dirty state is cleared and the bli remains in
      the AIL.
      
      If a shutdown occurs before the transaction that has invalidated the
      buffer is committed, the transaction is ultimately aborted. The log
      items are flagged as such and ->iop_unlock() handles the aborted
      items. Because the bli is clean (due to the invalidation),
      ->iop_unlock() unconditionally releases it. The log item may still
      reside in the AIL, however, which means the I/O completion handler
      may still run and attempt to access it. This results in assert
      failure due to the release of the bli while still present in the AIL
      and a subsequent NULL dereference and panic in the buffer I/O
      completion handling. This can be reproduced by running generic/388
      in repetition.
      
      To avoid this problem, update xfs_buf_item_unlock() to first check
      whether the bli is aborted and if so, remove it from the AIL before
      it is released. This ensures that the bli is no longer accessed
      during the shutdown sequence after it has been freed.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      3d4b4a3e
    • Brian Foster's avatar
      xfs: release bli from transaction properly on fs shutdown · 79e641ce
      Brian Foster authored
      If a filesystem shutdown occurs with a buffer log item in the CIL
      and a log force occurs, the ->iop_unpin() handler is generally
      expected to tear down the bli properly. This entails freeing the bli
      memory and releasing the associated hold on the buffer so it can be
      released and the filesystem unmounted.
      
      If this sequence occurs while ->bli_refcount is elevated (i.e.,
      another transaction is open and attempting to modify the buffer),
      however, ->iop_unpin() may not be responsible for releasing the bli.
      Instead, the transaction may release the final ->bli_refcount
      reference and thus xfs_trans_brelse() is responsible for tearing
      down the bli.
      
      While xfs_trans_brelse() does drop the reference count, it only
      attempts to release the bli if it is clean (i.e., not in the
      CIL/AIL). If the filesystem is shutdown and the bli is sitting dirty
      in the CIL as noted above, this ends up skipping the last
      opportunity to release the bli. In turn, this leaves the hold on the
      buffer and causes an unmount hang. This can be reproduced by running
      generic/388 in repetition.
      
      Update xfs_trans_brelse() to handle this shutdown corner case
      correctly. If the final bli reference is dropped and the filesystem
      is shutdown, remove the bli from the AIL (if necessary) and release
      the bli to drop the buffer hold and ensure an unmount does not hang.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      79e641ce
    • Arnd Bergmann's avatar
      xfs: avoid harmless gcc-7 warnings · 0cbe48cc
      Arnd Bergmann authored
      gcc-7 flags the use of integer math inside of a condition
      as a potential bug:
      
      fs/xfs/xfs_bmap_util.c: In function 'xfs_swap_extents_check_format':
      fs/xfs/xfs_bmap_util.c:1619:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]
      fs/xfs/xfs_bmap_util.c:1629:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]
      
      There is already a helper function for testing the di_forkoff
      field for zero, so let's use that instead to shut up the warning.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0cbe48cc
    • Shan Hai's avatar
      xfs: remove lsn relevant fields from xfs_trans structure and its users · f990fc5a
      Shan Hai authored
      The t_lsn is not used anymore and the t_commit_lsn is used as a tmp
      storage for the checkpoint sequence number only in the current code.
      
      And the start/commit lsn are tracked as a transaction group tag in
      the xfs_cil_ctx instead of a single transaction, so remove them from
      the xfs_trans structure and their users to match with the design.
      Signed-off-by: default avatarShan Hai <shan.hai@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      f990fc5a
    • Christoph Hellwig's avatar
      xfs: remove XFS_HSIZE · 3398a400
      Christoph Hellwig authored
      XFS_HSIZE is an extremly confusing way to calculate the size of handle_t.
      Given that handle_t always only had two sizes, and one of them isn't
      even covered by XFS_HSIZE to start with just remove the macro and use
      a constant sizeof expression.
      
      Note that XFS_HSIZE isn't used in xfsprogs, xfsdump or xfstests either.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarEric Sandeen <sandeen@sandeen.net>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      3398a400
    • Brian Foster's avatar
      xfs: dump transaction usage details on log reservation overrun · d4ca1d55
      Brian Foster authored
      If a transaction log reservation overrun occurs, the ticket data
      associated with the reservation is dumped in xfs_log_commit_cil().
      This occurs long after the transaction items and details have been
      removed from the transaction and effectively lost. This limited set
      of ticket data provides very little information to support debugging
      transaction overruns based on the typical report.
      
      To improve transaction log reservation overrun reporting, create a
      helper to dump transaction details such as log items, log vector
      data, etc., as well as the underlying ticket data for the
      transaction. Move the overrun detection from xfs_log_commit_cil() to
      xlog_cil_insert_items() so it occurs prior to migration of the
      logged items to the CIL. Call the new helper such that it is able to
      dump this transaction data before it is lost.
      
      Also, warn on overrun to provide callstack context for the offending
      transaction and include a few additional messages from
      xlog_cil_insert_items() to display the reservation consumed locally
      for overhead such as log vector headers, split region headers and
      the context ticket. This provides a complete general breakdown of
      the reservation consumption of a transaction when/if it happens to
      overrun the reservation.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      d4ca1d55
    • Brian Foster's avatar
      xfs: refactor xlog_cil_insert_items() to facilitate transaction dump · e2f23426
      Brian Foster authored
      Transaction reservation overrun detection currently occurs too late
      to print useful information about the offending transaction.
      Ideally, the transaction data is printed before the associated log
      items are moved from the transaction to the CIL, which occurs in
      xlog_cil_insert_items(), such that details of the items logged by
      the transaction are available for analysis.
      
      Refactor xlog_cil_insert_items() to facilitate moving tx overrun
      detection to this function. Update the function to track each bit of
      extra log reservation stolen from the transaction (i.e., such as for
      the CIL context ticket) and perform the log item migration as the
      last operation before the CIL lock is released. This creates a
      context where the transaction reservation consumption has been fully
      calculated when the log items are moved to the CIL. This patch makes
      no functional changes.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e2f23426
    • Brian Foster's avatar
      xfs: separate shutdown from ticket reservation print helper · 7d2d5653
      Brian Foster authored
      xlog_print_tic_res() pre-dates delayed logging and the committed
      items list (CIL) and thus retains some factoring warts, such as hard
      coded function names in the output and the fact that it induces a
      shutdown.
      
      In preparation for more detailed logging of regular transaction
      overrun situations, refactor xlog_print_tic_res() to be slightly
      more generic. Reword some of the warning messages and pull the
      shutdown into the callers.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      7d2d5653
    • Brian Foster's avatar
      xfs: define fatal assert build time tunable · 1040960e
      Brian Foster authored
      While configurable at runtime, the DEBUG mode assert failure
      behavior is usually either desired or not for a particular
      situation. For example, developers using kernel modules may prefer
      for fatal asserts to remain disabled across module reloads while QE
      engineers doing broad regression testing may prefer to have fatal
      asserts enabled on boot to facilitate data collection for bug
      reports.
      
      To provide a compromise/convenience for developers, create a Kconfig
      option that sets the default value of the DEBUG mode 'bug_on_assert'
      sysfs tunable. The default behavior remains to trigger kernel BUGs
      on assert failures to preserve existing behavior across kernel
      configuration updates with DEBUG mode enabled.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      1040960e
    • Brian Foster's avatar
      xfs: define bug_on_assert debug mode sysfs tunable · ccdab3d6
      Brian Foster authored
      In DEBUG mode, assert failures unconditionally trigger a kernel BUG.
      This is useful in diagnostic situations to panic a system and
      collect detailed state information at the time of a failure.
      
      This can also cause problems in cases where DEBUG mode code is
      desired but it is preferable not trigger kernel BUGs on assert
      failure. For example, during development of new code or during
      certain xfstests tests that intentionally cause corruption and test
      the kernel for survival (but otherwise may expect to trigger assert
      failures).
      
      To provide additional flexibility, create the
      <sysfs>/fs/xfs/debug/bug_on_assert tunable to configure assert
      failure behavior at runtime. This tunable is only available in DEBUG
      mode and is enabled by default to preserve existing default
      behavior. When disabled, assert failures in DEBUG mode result in
      kernel warnings.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      ccdab3d6
    • Darrick J. Wong's avatar
      xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent · e1a4e37c
      Darrick J. Wong authored
      In a pathological scenario where we are trying to bunmapi a single
      extent in which every other block is shared, it's possible that trying
      to unmap the entire large extent in a single transaction can generate so
      many EFIs that we overflow the transaction reservation.
      
      Therefore, use a heuristic to guess at the number of blocks we can
      safely unmap from a reflink file's data fork in an single transaction.
      This should prevent problems such as the log head slamming into the tail
      and ASSERTs that trigger because we've exceeded the transaction
      reservation.
      
      Note that since bunmapi can fail to unmap the entire range, we must also
      teach the deferred unmap code to roll into a new transaction whenever we
      get low on reservation.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      [hch: random edits, all bugs are my fault]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      e1a4e37c
    • Darrick J. Wong's avatar
      xfs: refactor dir2 leaf readahead shadow buffer cleverness · d205a7d0
      Darrick J. Wong authored
      Currently, the dir2 leaf block getdents function uses a complex state
      tracking mechanism to create a shadow copy of the block mappings and
      then uses the shadow copy to schedule readahead.  Since the read and
      readahead functions are perfectly capable of reading the mappings
      themselves, we can tear all that out in favor of a simpler function that
      simply keeps pushing the readahead window further out.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      d205a7d0
    • Brian Foster's avatar
      xfs: push buffer of flush locked dquot to avoid quotacheck deadlock · 7912e7fe
      Brian Foster authored
      Reclaim during quotacheck can lead to deadlocks on the dquot flush
      lock:
      
       - Quotacheck populates a local delwri queue with the physical dquot
         buffers.
       - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
         dirties all of the dquots.
       - Reclaim kicks in and attempts to flush a dquot whose buffer is
         already queud on the quotacheck queue. The flush succeeds but
         queueing to the reclaim delwri queue fails as the backing buffer is
         already queued. The flush unlock is now deferred to I/O completion
         of the buffer from the quotacheck queue.
       - The dqadjust bulkstat continues and dirties the recently flushed
         dquot once again.
       - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
         the flush lock to update the backing buffers with the in-core
         recalculated values. It deadlocks on the redirtied dquot as the
         flush lock was already acquired by reclaim, but the buffer resides
         on the local delwri queue which isn't submitted until the end of
         quotacheck.
      
      This is reproduced by running quotacheck on a filesystem with a
      couple million inodes in low memory (512MB-1GB) situations. This is
      a regression as of commit 43ff2122 ("xfs: on-stack delayed write
      buffer lists"), which removed a trylock and buffer I/O submission
      from the quotacheck dquot flush sequence.
      
      Quotacheck first resets and collects the physical dquot buffers in a
      delwri queue. Then, it traverses the filesystem inodes via bulkstat,
      updates the in-core dquots, flushes the corrected dquots to the
      backing buffers and finally submits the delwri queue for I/O. Since
      the backing buffers are queued across the entire quotacheck
      operation, dquot reclaim cannot possibly complete a dquot flush
      before quotacheck completes.
      
      Therefore, quotacheck must submit the buffer for I/O in order to
      cycle the flush lock and flush the dirty in-core dquot to the
      buffer. Add a delwri queue buffer push mechanism to submit an
      individual buffer for I/O without losing the delwri queue status and
      use it from quotacheck to avoid the deadlock. This restores
      quotacheck behavior to as before the regression was introduced.
      Reported-by: default avatarMartin Svec <martin.svec@zoner.cz>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      7912e7fe
    • Linus Torvalds's avatar
      Linux 4.12-rc6 · 41f1830f
      Linus Torvalds authored
      41f1830f
    • Hugh Dickins's avatar
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins authored
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: default avatarOleg Nesterov <oleg@redhat.com>
      Original-patch-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
    • Linus Torvalds's avatar
      Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 1132d5e7
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "Stream of fixes has slowed down, only a few this week:
      
         - Some DT fixes for Allwinner platforms, and addition of a clock to
           the R_CCU clock controller that had been missed.
      
         - A couple of small DT fixes for am335x-sl50"
      
      * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        arm64: allwinner: a64: Add PLL_PERIPH0 clock to the R_CCU
        ARM: sunxi: h3-h5: Add PLL_PERIPH0 clock to the R_CCU
        ARM: dts: am335x-sl50: Fix cannot claim requested pins for spi0
        ARM: dts: am335x-sl50: Fix card detect pin for mmc1
        arm64: allwinner: h5: Remove syslink to shared DTSI
        ARM: sunxi: h3/h5: fix the compatible of R_CCU
      1132d5e7
    • Olof Johansson's avatar
      Merge tag 'sunxi-fixes-for-4.12' of... · a1858df9
      Olof Johansson authored
      Merge tag 'sunxi-fixes-for-4.12' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into fixes
      
      Allwinner fixes for 4.12
      
      A few fixes around the PRCM support that got in 4.12 with a wrong
      compatible, and a missing clock in the binding.
      
      * tag 'sunxi-fixes-for-4.12' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
        arm64: allwinner: a64: Add PLL_PERIPH0 clock to the R_CCU
        ARM: sunxi: h3-h5: Add PLL_PERIPH0 clock to the R_CCU
        arm64: allwinner: h5: Remove syslink to shared DTSI
        ARM: sunxi: h3/h5: fix the compatible of R_CCU
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      a1858df9
    • Olof Johansson's avatar
      Merge tag 'omap-for-v4.12/fixes-sl50' of... · 51b6e281
      Olof Johansson authored
      Merge tag 'omap-for-v4.12/fixes-sl50' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes
      
      Two fixes for am335x-sl50 to fix a boot time error
      for claiming SPI pins, and to fix a SDIO card detect
      pin for production version of the device.
      
      * tag 'omap-for-v4.12/fixes-sl50' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
        ARM: dts: am335x-sl50: Fix cannot claim requested pins for spi0
        ARM: dts: am335x-sl50: Fix card detect pin for mmc1
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      51b6e281
    • Linus Torvalds's avatar
      Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost · 3696e4f0
      Linus Torvalds authored
      Pull virtio bugfix from Michael Tsirkin:
       "It turns out balloon does not handle IOMMUs correctly. We should fix
        that at some point, for now let's just disable this configuration"
      
      * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
        virtio_balloon: disable VIOMMU support
      3696e4f0
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 7d62d947
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "Two driver bugfixes"
      
      * 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: ismt: fix wrong device address when unmap the data buffer
        i2c: rcar: use correct length when unmapping DMA
      7d62d947
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · b3ee4edd
      Linus Torvalds authored
      Pull MIPS fixes from Ralf Baechle:
      
       - Three highmem fixes:
          + Fixed mapping initialization
          + Adjust the pkmap location
          + Ensure we use at most one page for PTEs
      
       - Fix makefile dependencies for .its targets to depend on vmlinux
      
       - Fix reversed condition in BNEZC and JIALC software branch emulation
      
       - Only flush initialized flush_insn_slot to avoid NULL pointer
         dereference
      
       - perf: Remove incorrect odd/even counter handling for I6400
      
       - ftrace: Fix init functions tracing
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
        MIPS: .its targets depend on vmlinux
        MIPS: Fix bnezc/jialc return address calculation
        MIPS: kprobes: flush_insn_slot should flush only if probe initialised
        MIPS: ftrace: fix init functions tracing
        MIPS: mm: adjust PKMAP location
        MIPS: highmem: ensure that we don't use more than one page for PTEs
        MIPS: mm: fixed mappings: correct initialisation
        MIPS: perf: Remove incorrect odd/even counter handling for I6400
      b3ee4edd
  2. 18 Jun, 2017 7 commits
  3. 17 Jun, 2017 7 commits
  4. 16 Jun, 2017 5 commits