1. 20 Sep, 2017 38 commits
    • Brian Foster's avatar
      xfs: remove unnecessary dirty bli format check for ordered bufs · 44a98221
      Brian Foster authored
      commit 6453c65d upstream.
      
      xfs_buf_item_unlock() historically checked the dirty state of the
      buffer by manually checking the buffer log formats for dirty
      segments. The introduction of ordered buffers invalidated this check
      because ordered buffers have dirty bli's but no dirty (logged)
      segments. The check was updated to accommodate ordered buffers by
      looking at the bli state first and considering the blf only if the
      bli is clean.
      
      This logic is safe but unnecessary. There is no valid case where the
      bli is clean yet the blf has dirty segments. The bli is set dirty
      whenever the blf is logged (via xfs_trans_log_buf()) and the blf is
      cleared in the only place BLI_DIRTY is cleared (xfs_trans_binval()).
      
      Remove the conditional blf dirty checks and replace with an assert
      that should catch any discrepencies between bli and blf dirty
      states. Refactor the old blf dirty check into a helper function to
      be used by the assert.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      44a98221
    • Brian Foster's avatar
      xfs: open-code xfs_buf_item_dirty() · f9abe7f1
      Brian Foster authored
      commit a4f6cf6b upstream.
      
      It checks a single flag and has one caller. It probably isn't worth
      its own function.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9abe7f1
    • Omar Sandoval's avatar
      xfs: check for race with xfs_reclaim_inode() in xfs_ifree_cluster() · 5a64bffc
      Omar Sandoval authored
      commit f2e9ad21 upstream.
      
      After xfs_ifree_cluster() finds an inode in the radix tree and verifies
      that the inode number is what it expected, xfs_reclaim_inode() can swoop
      in and free it. xfs_ifree_cluster() will then happily continue working
      on the freed inode. Most importantly, it will mark the inode stale,
      which will probably be overwritten when the inode slab object is
      reallocated, but if it has already been reallocated then we can end up
      with an inode spuriously marked stale.
      
      In 8a17d7dd ("xfs: mark reclaimed inodes invalid earlier") we added
      a second check to xfs_iflush_cluster() to detect this race, but the
      similar RCU lookup in xfs_ifree_cluster() needs the same treatment.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a64bffc
    • Darrick J. Wong's avatar
      xfs: evict all inodes involved with log redo item · 77c72521
      Darrick J. Wong authored
      commit 799ea9e9 upstream.
      
      When we introduced the bmap redo log items, we set MS_ACTIVE on the
      mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
      from being truncated prematurely during log recovery.  This also had the
      effect of putting linked inodes on the lru instead of evicting them.
      
      Unfortunately, we neglected to find all those unreferenced lru inodes
      and evict them after finishing log recovery, which means that we leak
      them if anything goes wrong in the rest of xfs_mountfs, because the lru
      is only cleaned out on unmount.
      
      Therefore, evict unreferenced inodes in the lru list immediately
      after clearing MS_ACTIVE.
      
      Fixes: 17c12bcd ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Cc: viro@ZenIV.linux.org.uk
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77c72521
    • Carlos Maiolino's avatar
      xfs: stop searching for free slots in an inode chunk when there are none · d0904585
      Carlos Maiolino authored
      commit 2d32311c upstream.
      
      In a filesystem without finobt, the Space manager selects an AG to alloc a new
      inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk.
      
      When the new inode is in the same AG as its parent, the btree will be searched
      starting on the parent's record, and then retried from the top if no slot is
      available beyond the parent's record.
      
      To exit this loop though, xfs_dialloc_ag_inobt() relies on the fact that the
      btree must have a free slot available, once its callers relied on the
      agi->freecount when deciding how/where to allocate this new inode.
      
      In the case when the agi->freecount is corrupted, showing available inodes in an
      AG, when in fact there is none, this becomes an infinite loop.
      
      Add a way to stop the loop when a free slot is not found in the btree, making
      the function to fall into the whole AG scan which will then, be able to detect
      the corruption and shut the filesystem down.
      
      As pointed by Brian, this might impact performance, giving the fact we
      don't reset the search distance anymore when we reach the end of the
      tree, giving it fewer tries before falling back to the whole AG search, but
      it will only affect searches that start within 10 records to the end of the tree.
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d0904585
    • Brian Foster's avatar
      xfs: handle -EFSCORRUPTED during head/tail verification · 5058c279
      Brian Foster authored
      commit a4c9b34d upstream.
      
      Torn write and tail overwrite detection both trigger only on
      -EFSBADCRC errors. While this is the most likely failure scenario
      for each condition, -EFSCORRUPTED is still possible in certain cases
      depending on what ends up on disk when a torn write or partial tail
      overwrite occurs. For example, an invalid log record h_len can lead
      to an -EFSCORRUPTED error when running the log recovery CRC pass.
      
      Therefore, update log head and tail verification to trigger the
      associated head/tail fixups in the event of -EFSCORRUPTED errors
      along with -EFSBADCRC. Also, -EFSCORRUPTED can currently be returned
      from xlog_do_recovery_pass() before rhead_blk is initialized if the
      first record encountered happens to be corrupted. This leads to an
      incorrect 'first_bad' return value. Initialize rhead_blk earlier in
      the function to address that problem as well.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5058c279
    • Brian Foster's avatar
      xfs: fix log recovery corruption error due to tail overwrite · de4b95ce
      Brian Foster authored
      commit 4a4f66ea upstream.
      
      If we consider the case where the tail (T) of the log is pinned long
      enough for the head (H) to push and block behind the tail, we can
      end up blocked in the following state without enough free space (f)
      in the log to satisfy a transaction reservation:
      
      	0	phys. log	N
      	[-------HffT---H'--T'---]
      
      The last good record in the log (before H) refers to T. The tail
      eventually pushes forward (T') leaving more free space in the log
      for writes to H. At this point, suppose space frees up in the log
      for the maximum of 8 in-core log buffers to start flushing out to
      the log. If this pushes the head from H to H', these next writes
      overwrite the previous tail T. This is safe because the items logged
      from T to T' have been written back and removed from the AIL.
      
      If the next log writes (H -> H') happen to fail and result in
      partial records in the log, the filesystem shuts down having
      overwritten T with invalid data. Log recovery correctly locates H on
      the subsequent mount, but H still refers to the now corrupted tail
      T. This results in log corruption errors and recovery failure.
      
      Since the tail overwrite results from otherwise correct runtime
      behavior, it is up to log recovery to try and deal with this
      situation. Update log recovery tail verification to run a CRC pass
      from the first record past the tail to the head. This facilitates
      error detection at T and moves the recovery tail to the first good
      record past H' (similar to truncating the head on torn write
      detection). If corruption is detected beyond the range possibly
      affected by the max number of iclogs, the log is legitimately
      corrupted and log recovery failure is expected.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de4b95ce
    • Brian Foster's avatar
      xfs: always verify the log tail during recovery · 98cb20f8
      Brian Foster authored
      commit 5297ac1f upstream.
      
      Log tail verification currently only occurs when torn writes are
      detected at the head of the log. This was introduced because a
      change in the head block due to torn writes can lead to a change in
      the tail block (each log record header references the current tail)
      and the tail block should be verified before log recovery proceeds.
      
      Tail corruption is possible outside of torn write scenarios,
      however. For example, partial log writes can be detected and cleared
      during the initial head/tail block discovery process. If the partial
      write coincides with a tail overwrite, the log tail is corrupted and
      recovery fails.
      
      To facilitate correct handling of log tail overwites, update log
      recovery to always perform tail verification. This is necessary to
      detect potential tail overwrite conditions when torn writes may not
      have occurred. This changes normal (i.e., no torn writes) recovery
      behavior slightly to detect and return CRC related errors near the
      tail before actual recovery starts.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98cb20f8
    • Brian Foster's avatar
      xfs: fix recovery failure when log record header wraps log end · 6e6acab2
      Brian Foster authored
      commit 284f1c2c upstream.
      
      The high-level log recovery algorithm consists of two loops that
      walk the physical log and process log records from the tail to the
      head. The first loop handles the case where the tail is beyond the
      head and processes records up to the end of the physical log. The
      subsequent loop processes records from the beginning of the physical
      log to the head.
      
      Because log records can wrap around the end of the physical log, the
      first loop mentioned above must handle this case appropriately.
      Records are processed from in-core buffers, which means that this
      algorithm must split the reads of such records into two partial
      I/Os: 1.) from the beginning of the record to the end of the log and
      2.) from the beginning of the log to the end of the record. This is
      further complicated by the fact that the log record header and log
      record data are read into independent buffers.
      
      The current handling of each buffer correctly splits the reads when
      either the header or data starts before the end of the log and wraps
      around the end. The data read does not correctly handle the case
      where the prior header read wrapped or ends on the physical log end
      boundary. blk_no is incremented to or beyond the log end after the
      header read to point to the record data, but the split data read
      logic triggers, attempts to read from an invalid log block and
      ultimately causes log recovery to fail. This can be reproduced
      fairly reliably via xfstests tests generic/047 and generic/388 with
      large iclog sizes (256k) and small (10M) logs.
      
      If the record header read has pushed beyond the end of the physical
      log, the subsequent data read is actually contiguous. Update the
      data read logic to detect the case where blk_no has wrapped, mod it
      against the log size to read from the correct address and issue one
      contiguous read for the log data buffer. The log record is processed
      as normal from the buffer(s), the loop exits after the current
      iteration and the subsequent loop picks up with the first new record
      after the start of the log.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e6acab2
    • Carlos Maiolino's avatar
      xfs: Properly retry failed inode items in case of error during buffer writeback · 77ad1533
      Carlos Maiolino authored
      commit d3a304b6 upstream.
      
      When a buffer has been failed during writeback, the inode items into it
      are kept flush locked, and are never resubmitted due the flush lock, so,
      if any buffer fails to be written, the items in AIL are never written to
      disk and never unlocked.
      
      This causes unmount operation to hang due these items flush locked in AIL,
      but this also causes the items in AIL to never be written back, even when
      the IO device comes back to normal.
      
      I've been testing this patch with a DM-thin device, creating a
      filesystem larger than the real device.
      
      When writing enough data to fill the DM-thin device, XFS receives ENOSPC
      errors from the device, and keep spinning on xfsaild (when 'retry
      forever' configuration is set).
      
      At this point, the filesystem can not be unmounted because of the flush locked
      items in AIL, but worse, the items in AIL are never retried at all
      (once xfs_inode_item_push() will skip the items that are flush locked),
      even if the underlying DM-thin device is expanded to the proper size.
      
      This patch fixes both cases, retrying any item that has been failed
      previously, using the infra-structure provided by the previous patch.
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77ad1533
    • Carlos Maiolino's avatar
      xfs: Add infrastructure needed for error propagation during buffer IO failure · 7d8cd535
      Carlos Maiolino authored
      commit 0b80ae6e upstream.
      
      With the current code, XFS never re-submit a failed buffer for IO,
      because the failed item in the buffer is kept in the flush locked state
      forever.
      
      To be able to resubmit an log item for IO, we need a way to mark an item
      as failed, if, for any reason the buffer which the item belonged to
      failed during writeback.
      
      Add a new log item callback to be used after an IO completion failure
      and make the needed clean ups.
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7d8cd535
    • Eric Sandeen's avatar
      xfs: toggle readonly state around xfs_log_mount_finish · 2e537a0b
      Eric Sandeen authored
      commit 6f4a1eef upstream.
      
      When we do log recovery on a readonly mount, unlinked inode
      processing does not happen due to the readonly checks in
      xfs_inactive(), which are trying to prevent any I/O on a
      readonly mount.
      
      This is misguided - we do I/O on readonly mounts all the time,
      for consistency; for example, log recovery.  So do the same
      RDONLY flag twiddling around xfs_log_mount_finish() as we
      do around xfs_log_mount(), for the same reason.
      
      This all cries out for a big rework but for now this is a
      simple fix to an obvious problem.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e537a0b
    • Eric Sandeen's avatar
      xfs: write unmount record for ro mounts · cc84db7b
      Eric Sandeen authored
      commit 757a69ef upstream.
      
      There are dueling comments in the xfs code about intent
      for log writes when unmounting a readonly filesystem.
      
      In xfs_mountfs, we see the intent:
      
      /*
       * Now the log is fully replayed, we can transition to full read-only
       * mode for read-only mounts. This will sync all the metadata and clean
       * the log so that the recovery we just performed does not have to be
       * replayed again on the next mount.
       */
      
      and it calls xfs_quiesce_attr(), but by the time we get to
      xfs_log_unmount_write(), it returns early for a RDONLY mount:
      
       * Don't write out unmount record on read-only mounts.
      
      Because of this, sequential ro mounts of a filesystem with
      a dirty log will replay the log each time, which seems odd.
      
      Fix this by writing an unmount record even for RO mounts, as long
      as norecovery wasn't specified (don't write a clean log record
      if a dirty log may still be there!) and the log device is
      writable.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc84db7b
    • Dan Williams's avatar
      libnvdimm: fix integer overflow static analysis warning · 77b393af
      Dan Williams authored
      commit 58738c49 upstream.
      
      Dan reports:
          The patch 62232e45: "libnvdimm: control (ioctl) messages for
          nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the
          following static checker warning:
      
                  drivers/nvdimm/bus.c:1018 __nd_ioctl()
                  warn: integer overflows 'buf_len'
      
          From a casual review, this seems like it might be a real bug.  On
          the first iteration we load some data into in_env[].  On the second
          iteration we read a use controlled "in_size" from nd_cmd_in_size().
          It can go up to UINT_MAX - 1.  A high number means we will fill the
          whole in_env[] buffer.  But we potentially keep looping and adding
          more to in_len so now it can be any value.
      
          It simple enough to change, but it feels weird that we keep looping
          even though in_env is totally full.  Shouldn't we just return an
          error if we don't have space for desc->in_num.
      
      We keep looping because the size of the total input is allowed to be
      bigger than the 'envelope' which is a subset of the payload that tells
      us how much data to expect. For safety explicitly check that buf_len
      does not overflow which is what the checker flagged.
      
      Fixes: 62232e45: "libnvdimm: control (ioctl) messages for nvdimm_bus..."
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77b393af
    • Christophe Jaillet's avatar
      libnvdimm, btt: check memory allocation failure · d167685f
      Christophe Jaillet authored
      commit ed36b4db upstream.
      
      Check memory allocation failures and return -ENOMEM in such cases, as
      already done few lines below for another memory allocation.
      
      This avoids NULL pointers dereference.
      
      Fixes: 14e49454 ("libnvdimm, btt: BTT updates for UEFI 2.7 format")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d167685f
    • Eric Biggers's avatar
      idr: remove WARN_ON_ONCE() when trying to replace negative ID · a6cd7f34
      Eric Biggers authored
      commit a47f68d6 upstream.
      
      IDR only supports non-negative IDs.  There used to be a 'WARN_ON_ONCE(id <
      0)' in idr_replace(), but it was intentionally removed by commit
      2e1c9b28 ("idr: remove WARN_ON_ONCE() on negative IDs").
      
      Then it was added back by commit 0a835c4f ("Reimplement IDR and IDA
      using the radix tree").  However it seems that adding it back was a
      mistake, given that some users such as drm_gem_handle_delete()
      (DRM_IOCTL_GEM_CLOSE) pass in a value from userspace to idr_replace(),
      allowing the WARN_ON_ONCE to be triggered.  drm_gem_handle_delete()
      actually just wants idr_replace() to return an error code if the ID is
      not allocated, including in the case where the ID is invalid (negative).
      
      So once again remove the bogus WARN_ON_ONCE().
      
      This bug was found by syzkaller, which encountered the following
      warning:
      
          WARNING: CPU: 3 PID: 3008 at lib/idr.c:157 idr_replace+0x1d8/0x240 lib/idr.c:157
          Kernel panic - not syncing: panic_on_warn set ...
      
          CPU: 3 PID: 3008 Comm: syzkaller218828 Not tainted 4.13.0-rc4-next-20170811 #2
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Call Trace:
           fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
           do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
           do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
           do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
           do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
           invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:930
          RIP: 0010:idr_replace+0x1d8/0x240 lib/idr.c:157
          RSP: 0018:ffff8800394bf9f8 EFLAGS: 00010297
          RAX: ffff88003c6c60c0 RBX: 1ffff10007297f43 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800394bfa78
          RBP: ffff8800394bfae0 R08: ffffffff82856487 R09: 0000000000000000
          R10: ffff8800394bf9a8 R11: ffff88006c8bae28 R12: ffffffffffffffff
          R13: ffff8800394bfab8 R14: dffffc0000000000 R15: ffff8800394bfbc8
           drm_gem_handle_delete+0x33/0xa0 drivers/gpu/drm/drm_gem.c:297
           drm_gem_close_ioctl+0xa1/0xe0 drivers/gpu/drm/drm_gem.c:671
           drm_ioctl_kernel+0x1e7/0x2e0 drivers/gpu/drm/drm_ioctl.c:729
           drm_ioctl+0x72e/0xa50 drivers/gpu/drm/drm_ioctl.c:825
           vfs_ioctl fs/ioctl.c:45 [inline]
           do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
           SYSC_ioctl fs/ioctl.c:700 [inline]
           SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
           entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Here is a C reproducer:
      
          #include <fcntl.h>
          #include <stddef.h>
          #include <stdint.h>
          #include <sys/ioctl.h>
          #include <drm/drm.h>
      
          int main(void)
          {
                  int cardfd = open("/dev/dri/card0", O_RDONLY);
      
                  ioctl(cardfd, DRM_IOCTL_GEM_CLOSE,
                        &(struct drm_gem_close) { .handle = -1 } );
          }
      
      Link: http://lkml.kernel.org/r/20170906235306.20534-1-ebiggers3@gmail.com
      Fixes: 0a835c4f ("Reimplement IDR and IDA using the radix tree")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6cd7f34
    • Miklos Szeredi's avatar
      fuse: allow server to run in different pid_ns · 343b0d82
      Miklos Szeredi authored
      commit 5d6d3a30 upstream.
      
      Commit 0b6e9ea0 ("fuse: Add support for pid namespaces") broke
      Sandstorm.io development tools, which have been sending FUSE file
      descriptors across PID namespace boundaries since early 2014.
      
      The above patch added a check that prevented I/O on the fuse device file
      descriptor if the pid namespace of the reader/writer was different from the
      pid namespace of the mounter.  With this change passing the device file
      descriptor to a different pid namespace simply doesn't work.  The check was
      added because pids are transferred to/from the fuse userspace server in the
      namespace registered at mount time.
      
      To fix this regression, remove the checks and do the following:
      
      1) the pid in the request header (the pid of the task that initiated the
      filesystem operation) is translated to the reader's pid namespace.  If a
      mapping doesn't exist for this pid, then a zero pid is used.  Note: even if
      a mapping would exist between the initiator task's pid namespace and the
      reader's pid namespace the pid will be zero if either mapping from
      initator's to mounter's namespace or mapping from mounter's to reader's
      namespace doesn't exist.
      
      2) The lk.pid value in setlk/setlkw requests and getlk reply is left alone.
      Userspace should not interpret this value anyway.  Also allow the
      setlk/setlkw operations if the pid of the task cannot be represented in the
      mounter's namespace (pid being zero in that case).
      Reported-by: default avatarKenton Varda <kenton@sandstorm.io>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 0b6e9ea0 ("fuse: Add support for pid namespaces")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Seth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      343b0d82
    • Amir Goldstein's avatar
      ovl: fix false positive ESTALE on lookup · 9a023afa
      Amir Goldstein authored
      commit 939ae4ef upstream.
      
      Commit b9ac5c27 ("ovl: hash overlay non-dir inodes by copy up origin")
      verifies that the origin lower inode stored in the overlayfs inode matched
      the inode of a copy up origin dentry found by lookup.
      
      There is a false positive result in that check when lower fs does not
      support file handles and copy up origin cannot be followed by file handle
      at lookup time.
      
      The false negative happens when finding an overlay inode in cache on a
      copied up overlay dentry lookup. The overlay inode still 'remembers' the
      copy up origin inode, but the copy up origin dentry is not available for
      verification.
      
      Relax the check in case copy up origin dentry is not available.
      
      Fixes: b9ac5c27 ("ovl: hash overlay non-dir inodes by copy up...")
      Reported-by: default avatarJordi Pujol <jordipujolp@gmail.com>
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a023afa
    • Tony Luck's avatar
      x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages · 6506d1d7
      Tony Luck authored
      commit ce0fa3e5 upstream.
      
      Speculative processor accesses may reference any memory that has a
      valid page table entry.  While a speculative access won't generate
      a machine check, it will log the error in a machine check bank. That
      could cause escalation of a subsequent error since the overflow bit
      will be then set in the machine check bank status register.
      
      Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
      address of the page we want to map out otherwise we may trigger the
      very problem we are trying to avoid.  We use a non-canonical address
      that passes through the usual Linux table walking code to get to the
      same "pte".
      
      Thanks to Dave Hansen for reviewing several iterations of this.
      
      Also see:
      
        http://marc.info/?l=linux-mm&m=149860136413338&w=2Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Elliott, Robert (Persistent Memory) <elliott@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6506d1d7
    • Andy Lutomirski's avatar
      x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs · 5b7b3fe5
      Andy Lutomirski authored
      commit e137a4d8 upstream.
      
      Switching FS and GS is a mess, and the current code is still subtly
      wrong: it assumes that "Loading a nonzero value into FS sets the
      index and base", which is false on AMD CPUs if the value being
      loaded is 1, 2, or 3.
      
      (The current code came from commit 3e2b68d7 ("x86/asm,
      sched/x86: Rewrite the FS and GS context switch code"), which made
      it better but didn't fully fix it.)
      
      Rewrite it to be much simpler and more obviously correct.  This
      should fix it fully on AMD CPUs and shouldn't adversely affect
      performance.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chang Seok <chang.seok.bae@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5b7b3fe5
    • Andy Lutomirski's avatar
      x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps · 9f79ec98
      Andy Lutomirski authored
      commit 9584d98b upstream.
      
      In ELF_COPY_CORE_REGS, we're copying from the current task, so
      accessing thread.fsbase and thread.gsbase makes no sense.  Just read
      the values from the CPU registers.
      
      In practice, the old code would have been correct most of the time
      simply because thread.fsbase and thread.gsbase usually matched the
      CPU registers.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chang Seok <chang.seok.bae@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f79ec98
    • Andy Lutomirski's avatar
      x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common · 3f13b64c
      Andy Lutomirski authored
      commit 767d035d upstream.
      
      execve used to leak FSBASE and GSBASE on AMD CPUs.  Fix it.
      
      The security impact of this bug is small but not quite zero -- it
      could weaken ASLR when a privileged task execs a less privileged
      program, but only if program changed bitness across the exec, or the
      child binary was highly unusual or actively malicious.  A child
      program that was compromised after the exec would not have access to
      the leaked base.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chang Seok <chang.seok.bae@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f13b64c
    • Bernat, Yehezkel's avatar
      thunderbolt: Allow clearing the key · c4e91eda
      Bernat, Yehezkel authored
      commit e545f0d8 upstream.
      
      If secure authentication of a devices fails, either because the device
      already has another key uploaded, or there is some other error sending
      challenge to the device, and the user only wants to approve the device
      just once (without a new key being uploaded to the device) the current
      implementation does not allow this because the key cannot be cleared
      once set even if we allow it to be changed.
      
      Make this scenario possible and allow clearing the key by writing
      empty string to the key sysfs file.
      Signed-off-by: default avatarYehezkel Bernat <yehezkel.bernat@intel.com>
      Acked-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4e91eda
    • Bernat, Yehezkel's avatar
      thunderbolt: Make key root-only accessible · 24ed5fd6
      Bernat, Yehezkel authored
      commit 0956e411 upstream.
      
      Non-root user may read the key back after root wrote it there.
      This removes read access to everyone but root.
      Signed-off-by: default avatarYehezkel Bernat <yehezkel.bernat@intel.com>
      Acked-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24ed5fd6
    • Bernat, Yehezkel's avatar
      thunderbolt: Remove superfluous check · b92e97e6
      Bernat, Yehezkel authored
      commit 8fdd6ab3 upstream.
      
      The key size is tested by hex2bin() already (as '\0' isn't an hex digit)
      Suggested-by: default avatarAndy Shevchenko <andriy.shevchenko@intel.com>
      Signed-off-by: default avatarYehezkel Bernat <yehezkel.bernat@intel.com>
      Acked-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b92e97e6
    • Jaegeuk Kim's avatar
      f2fs: check hot_data for roll-forward recovery · 214d7f6b
      Jaegeuk Kim authored
      commit 125c9fb1 upstream.
      
      We need to check HOT_DATA to truncate any previous data block when doing
      roll-forward recovery.
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      214d7f6b
    • Jaegeuk Kim's avatar
      f2fs: let fill_super handle roll-forward errors · be009dd0
      Jaegeuk Kim authored
      commit afd2b4da upstream.
      
      If we set CP_ERROR_FLAG in roll-forward error, f2fs is no longer to proceed
      any IOs due to f2fs_cp_error(). But, for example, if some stale data is involved
      on roll-forward process, we're able to get -ENOENT, getting fs stuck.
      If we get any error, let fill_super set SBI_NEED_FSCK and try to recover back
      to stable point.
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be009dd0
    • Haishuang Yan's avatar
      ip_tunnel: fix setting ttl and tos value in collect_md mode · ffe6e0e9
      Haishuang Yan authored
      
      [ Upstream commit 0f693f19 ]
      
      ttl and tos variables are declared and assigned, but are not used in
      iptunnel_xmit() function.
      
      Fixes: cfc7381b ("ip_tunnel: add collect_md mode to IPIP tunnel")
      Cc: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ffe6e0e9
    • Eric Dumazet's avatar
      tcp: fix a request socket leak · 35c2f174
      Eric Dumazet authored
      
      [ Upstream commit 1f3b359f ]
      
      While the cited commit fixed a possible deadlock, it added a leak
      of the request socket, since reqsk_put() must be called if the BPF
      filter decided the ACK packet must be dropped.
      
      Fixes: d624d276 ("tcp: fix possible deadlock in TCP stack vs BPF filter")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      35c2f174
    • Marcelo Ricardo Leitner's avatar
      sctp: fix missing wake ups in some situations · 3b97e138
      Marcelo Ricardo Leitner authored
      
      [ Upstream commit 7906b00f ]
      
      Commit fb586f25 ("sctp: delay calls to sk_data_ready() as much as
      possible") minimized the number of wake ups that are triggered in case
      the association receives a packet with multiple data chunks on it and/or
      when io_events are enabled and then commit 0970f5b3 ("sctp: signal
      sk_data_ready earlier on data chunks reception") moved the wake up to as
      soon as possible. It thus relies on the state machine running later to
      clean the flag that the event was already generated.
      
      The issue is that there are 2 call paths that calls
      sctp_ulpq_tail_event() outside of the state machine, causing the flag to
      linger and possibly omitting a needed wake up in the sequence.
      
      One of the call paths is when enabling SCTP_SENDER_DRY_EVENTS via
      setsockopt(SCTP_EVENTS), as noticed by Harald Welte. The other is when
      partial reliability triggers removal of chunks from the send queue when
      the application calls sendmsg().
      
      This commit fixes it by not setting the flag in case the socket is not
      owned by the user, as it won't be cleaned later. This works for
      user-initiated calls and also for rx path processing.
      
      Fixes: fb586f25 ("sctp: delay calls to sk_data_ready() as much as possible")
      Reported-by: default avatarHarald Welte <laforge@gnumonks.org>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b97e138
    • Eric Dumazet's avatar
      ipv6: fix typo in fib6_net_exit() · c5b047b1
      Eric Dumazet authored
      
      [ Upstream commit 32a805ba ]
      
      IPv6 FIB should use FIB6_TABLE_HASHSZ, not FIB_TABLE_HASHSZ.
      
      Fixes: ba1cc08d ("ipv6: fix memory leak with multiple tables during netns destruction")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c5b047b1
    • Sabrina Dubroca's avatar
      ipv6: fix memory leak with multiple tables during netns destruction · c352fb6a
      Sabrina Dubroca authored
      
      [ Upstream commit ba1cc08d ]
      
      fib6_net_exit only frees the main and local tables. If another table was
      created with fib6_alloc_table, we leak it when the netns is destroyed.
      
      Fix this in the same way ip_fib_net_exit cleans up tables, by walking
      through the whole hashtable of fib6_table's. We can get rid of the
      special cases for local and main, since they're also part of the
      hashtable.
      
      Reproducer:
          ip netns add x
          ip -net x -6 rule add from 6003:1::/64 table 100
          ip netns del x
      Reported-by: default avatarJianlin Shi <jishi@redhat.com>
      Fixes: 58f09b78 ("[NETNS][IPV6] ip6_fib - make it per network namespace")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c352fb6a
    • Paolo Abeni's avatar
      udp: drop head states only when all skb references are gone · f8581386
      Paolo Abeni authored
      
      [ Upstream commit ca2c1418 ]
      
      After commit 0ddf3fb2 ("udp: preserve skb->dst if required
      for IP options processing") we clear the skb head state as soon
      as the skb carrying them is first processed.
      
      Since the same skb can be processed several times when MSG_PEEK
      is used, we can end up lacking the required head states, and
      eventually oopsing.
      
      Fix this clearing the skb head state only when processing the
      last skb reference.
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 0ddf3fb2 ("udp: preserve skb->dst if required for IP options processing")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8581386
    • Xin Long's avatar
      ip6_gre: update mtu properly in ip6gre_err · c9a166be
      Xin Long authored
      
      [ Upstream commit 5c25f30c ]
      
      Now when probessing ICMPV6_PKT_TOOBIG, ip6gre_err only subtracts the
      offset of gre header from mtu info. The expected mtu of gre device
      should also subtract gre header. Otherwise, the next packets still
      can't be sent out.
      
      Jianlin found this issue when using the topo:
        client(ip6gre)<---->(nic1)route(nic2)<----->(ip6gre)server
      
      and reducing nic2's mtu, then both tcp and sctp's performance with
      big size data became 0.
      
      This patch is to fix it by also subtracting grehdr (tun->tun_hlen)
      from mtu info when updating gre device's mtu in ip6gre_err(). It
      also needs to subtract ETH_HLEN if gre dev'type is ARPHRD_ETHER.
      Reported-by: default avatarJianlin Shi <jishi@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9a166be
    • Jason Wang's avatar
      vhost_net: correctly check tx avail during rx busy polling · b076d251
      Jason Wang authored
      
      [ Upstream commit 8b949bef ]
      
      We check tx avail through vhost_enable_notify() in the past which is
      wrong since it only checks whether or not guest has filled more
      available buffer since last avail idx synchronization which was just
      done by vhost_vq_avail_empty() before. What we really want is checking
      pending buffers in the avail ring. Fix this by calling
      vhost_vq_avail_empty() instead.
      
      This issue could be noticed by doing netperf TCP_RR benchmark as
      client from guest (but not host). With this fix, TCP_RR from guest to
      localhost restores from 1375.91 trans per sec to 55235.28 trans per
      sec on my laptop (Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz).
      
      Fixes: 03088137 ("vhost_net: basic polling support")
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b076d251
    • Claudiu Manoil's avatar
      gianfar: Fix Tx flow control deactivation · 80b25b4b
      Claudiu Manoil authored
      
      [ Upstream commit 5d621672 ]
      
      The wrong register is checked for the Tx flow control bit,
      it should have been maccfg1 not maccfg2.
      This went unnoticed for so long probably because the impact is
      hardly visible, not to mention the tangled code from adjust_link().
      First, link flow control (i.e. handling of Rx/Tx link level pause frames)
      is disabled by default (needs to be enabled via 'ethtool -A').
      Secondly, maccfg2 always returns 0 for tx_flow_oldval (except for a few
      old boards), which results in Tx flow control remaining always on
      once activated.
      
      Fixes: 45b679c9 ("gianfar: Implement PAUSE frame generation support")
      Signed-off-by: default avatarClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      80b25b4b
    • Jesper Dangaard Brouer's avatar
      Revert "net: fix percpu memory leaks" · ecb26e81
      Jesper Dangaard Brouer authored
      
      [ Upstream commit 5a63643e ]
      
      This reverts commit 1d6119ba.
      
      After reverting commit 6d7b857d ("net: use lib/percpu_counter API
      for fragmentation mem accounting") then here is no need for this
      fix-up patch.  As percpu_counter is no longer used, it cannot
      memory leak it any-longer.
      
      Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Fixes: 1d6119ba ("net: fix percpu memory leaks")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecb26e81
    • Jesper Dangaard Brouer's avatar
      Revert "net: use lib/percpu_counter API for fragmentation mem accounting" · 021c60ff
      Jesper Dangaard Brouer authored
      
      [ Upstream commit fb452a1a ]
      
      This reverts commit 6d7b857d.
      
      There is a bug in fragmentation codes use of the percpu_counter API,
      that can cause issues on systems with many CPUs.
      
      The frag_mem_limit() just reads the global counter (fbc->count),
      without considering other CPUs can have upto batch size (130K) that
      haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
      this become dangerous at >=24 CPUs (3*1024*1024/130000=24).
      
      The correct API usage would be to use __percpu_counter_compare() which
      does the right thing, and takes into account the number of (online)
      CPUs and batch size, to account for this and call __percpu_counter_sum()
      when needed.
      
      We choose to revert the use of the lib/percpu_counter API for frag
      memory accounting for several reasons:
      
      1) On systems with CPUs > 24, the heavier fully locked
         __percpu_counter_sum() is always invoked, which will be more
         expensive than the atomic_t that is reverted to.
      
      Given systems with more than 24 CPUs are becoming common this doesn't
      seem like a good option.  To mitigate this, the batch size could be
      decreased and thresh be increased.
      
      2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
         CPU, before SKBs are pushed into sockets on remote CPUs.  Given
         NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
         likely be limited.  Thus, a fair chance that atomic add+dec happen
         on the same CPU.
      
      Revert note that commit 1d6119ba ("net: fix percpu memory leaks")
      removed init_frag_mem_limit() and instead use inet_frags_init_net().
      After this revert, inet_frags_uninit_net() becomes empty.
      
      Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Fixes: 1d6119ba ("net: fix percpu memory leaks")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      021c60ff
  2. 13 Sep, 2017 2 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.13.2 · 07dd6cc1
      Greg Kroah-Hartman authored
      07dd6cc1
    • Richard Wareing's avatar
      xfs: XFS_IS_REALTIME_INODE() should be false if no rt device present · 24cb3325
      Richard Wareing authored
      commit b31ff3cd upstream.
      
      If using a kernel with CONFIG_XFS_RT=y and we set the RHINHERIT flag on
      a directory in a filesystem that does not have a realtime device and
      create a new file in that directory, it gets marked as a real time file.
      When data is written and a fsync is issued, the filesystem attempts to
      flush a non-existent rt device during the fsync process.
      
      This results in a crash dereferencing a null buftarg pointer in
      xfs_blkdev_issue_flush():
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: xfs_blkdev_issue_flush+0xd/0x20
        .....
        Call Trace:
          xfs_file_fsync+0x188/0x1c0
          vfs_fsync_range+0x3b/0xa0
          do_fsync+0x3d/0x70
          SyS_fsync+0x10/0x20
          do_syscall_64+0x4d/0xb0
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Setting RT inode flags does not require special privileges so any
      unprivileged user can cause this oops to occur.  To reproduce, confirm
      kernel is compiled with CONFIG_XFS_RT=y and run:
      
        # mkfs.xfs -f /dev/pmem0
        # mount /dev/pmem0 /mnt/test
        # mkdir /mnt/test/foo
        # xfs_io -c 'chattr +t' /mnt/test/foo
        # xfs_io -f -c 'pwrite 0 5m' -c fsync /mnt/test/foo/bar
      
      Or just run xfstests with MKFS_OPTIONS="-d rtinherit=1" and wait.
      
      Kernels built with CONFIG_XFS_RT=n are not exposed to this bug.
      
      Fixes: f538d4da ("[XFS] write barrier support")
      Signed-off-by: default avatarRichard Wareing <rwareing@fb.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24cb3325