1. 27 Aug, 2019 8 commits
    • Nigel Croxon's avatar
      raid5 improve too many read errors msg by adding limits · 0009fad0
      Nigel Croxon authored
      Often limits can be changed by admin. When discussing such things
      it helps if you can provide "self-sustained" facts. Also
      sometimes the admin thinks he changed a limit, but it did not
      take effect for some reason or he changed the wrong thing.
      
      V3: Only pr_warn when Faulty is 0.
      V2: Add read_errors value to pr_warn.
      Signed-off-by: default avatarNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      0009fad0
    • NeilBrown's avatar
      md: don't report active array_state until after revalidate_disk() completes. · 9d4b45d6
      NeilBrown authored
      Until revalidate_disk() has completed, the size of a new md array will
      appear to be zero.
      So we shouldn't report, through array_state, that the array is active
      until that time.
      udev rules check array_state to see if the array is ready.  As soon as
      it appear to be zero, fsck can be run.  If it find the size to be
      zero, it will fail.
      
      So add a new flag to provide an interlock between do_md_run() and
      array_state_show().  This flag is set while do_md_run() is active and
      it prevents array_state_show() from reporting that the array is
      active.
      
      Before do_md_run() is called, ->pers will be NULL so array is
      definitely not active.
      After do_md_run() is called, revalidate_disk() will have run and the
      array will be completely ready.
      
      We also move various sysfs_notify*() calls out of md_run() into
      do_md_run() after MD_NOT_READY is cleared.  This ensure the
      information is ready before the notification is sent.
      
      Prior to v4.12, array_state_show() was called with the
      mddev->reconfig_mutex held, which provided exclusion with do_md_run().
      
      Note that MD_NOT_READY cleared twice.  This is deliberate to cover
      both success and error paths with minimal noise.
      
      Fixes: b7b17c9b ("md: remove mddev_lock() from md_attr_show()")
      Cc: stable@vger.kernel.org (v4.12++)
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      9d4b45d6
    • NeilBrown's avatar
      md: only call set_in_sync() when it is expected to succeed. · 480523fe
      NeilBrown authored
      Since commit 4ad23a97 ("MD: use per-cpu counter for
      writes_pending"), set_in_sync() is substantially more expensive: it
      can wait for a full RCU grace period which can be 10s of milliseconds.
      
      So we should only call it when the cost is justified.
      
      md_check_recovery() currently calls set_in_sync() every time it finds
      anything to do (on non-external active arrays).  For an array
      performing resync or recovery, this will be quite often.
      Each call will introduce a delay to the md thread, which can noticeable
      affect IO submission latency.
      
      In md_check_recovery() we only need to call set_in_sync() if
      'safemode' was non-zero at entry, meaning that there has been not
      recent IO.  So we save this "safemode was nonzero" state, and only
      call set_in_sync() if it was non-zero.
      
      This measurably reduces mean and maximum IO submission latency during
      resync/recovery.
      Reported-and-tested-by: default avatarJack Wang <jinpu.wang@cloud.ionos.com>
      Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
      Cc: stable@vger.kernel.org (v4.12+)
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      480523fe
    • Tejun Heo's avatar
      writeback, memcg: Implement foreign dirty flushing · 97b27821
      Tejun Heo authored
      There's an inherent mismatch between memcg and writeback.  The former
      trackes ownership per-page while the latter per-inode.  This was a
      deliberate design decision because honoring per-page ownership in the
      writeback path is complicated, may lead to higher CPU and IO overheads
      and deemed unnecessary given that write-sharing an inode across
      different cgroups isn't a common use-case.
      
      Combined with inode majority-writer ownership switching, this works
      well enough in most cases but there are some pathological cases.  For
      example, let's say there are two cgroups A and B which keep writing to
      different but confined parts of the same inode.  B owns the inode and
      A's memory is limited far below B's.  A's dirty ratio can rise enough
      to trigger balance_dirty_pages() sleeps but B's can be low enough to
      avoid triggering background writeback.  A will be slowed down without
      a way to make writeback of the dirty pages happen.
      
      This patch implements foreign dirty recording and foreign mechanism so
      that when a memcg encounters a condition as above it can trigger
      flushes on bdi_writebacks which can clean its pages.  Please see the
      comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
      details.
      
      A reproducer follows.
      
      write-range.c::
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <sys/types.h>
      
        static const char *usage = "write-range FILE START SIZE\n";
      
        int main(int argc, char **argv)
        {
      	  int fd;
      	  unsigned long start, size, end, pos;
      	  char *endp;
      	  char buf[4096];
      
      	  if (argc < 4) {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  fd = open(argv[1], O_WRONLY);
      	  if (fd < 0) {
      		  perror("open");
      		  return 1;
      	  }
      
      	  start = strtoul(argv[2], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  size = strtoul(argv[3], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  end = start + size;
      
      	  while (1) {
      		  for (pos = start; pos < end; ) {
      			  long bread, bwritten = 0;
      
      			  if (lseek(fd, pos, SEEK_SET) < 0) {
      				  perror("lseek");
      				  return 1;
      			  }
      
      			  bread = read(0, buf, sizeof(buf) < end - pos ?
      					       sizeof(buf) : end - pos);
      			  if (bread < 0) {
      				  perror("read");
      				  return 1;
      			  }
      			  if (bread == 0)
      				  return 0;
      
      			  while (bwritten < bread) {
      				  long this;
      
      				  this = write(fd, buf + bwritten,
      					       bread - bwritten);
      				  if (this < 0) {
      					  perror("write");
      					  return 1;
      				  }
      
      				  bwritten += this;
      				  pos += bwritten;
      			  }
      		  }
      	  }
        }
      
      repro.sh::
      
        #!/bin/bash
      
        set -e
        set -x
      
        sysctl -w vm.dirty_expire_centisecs=300000
        sysctl -w vm.dirty_writeback_centisecs=300000
        sysctl -w vm.dirtytime_expire_seconds=300000
        echo 3 > /proc/sys/vm/drop_caches
      
        TEST=/sys/fs/cgroup/test
        A=$TEST/A
        B=$TEST/B
      
        mkdir -p $A $B
        echo "+memory +io" > $TEST/cgroup.subtree_control
        echo $((1<<30)) > $A/memory.high
        echo $((32<<30)) > $B/memory.high
      
        rm -f testfile
        touch testfile
        fallocate -l 4G testfile
      
        echo "Starting B"
      
        (echo $BASHPID > $B/cgroup.procs
         pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &
      
        echo "Waiting 10s to ensure B claims the testfile inode"
        sleep 5
        sync
        sleep 5
        sync
        echo "Starting A"
      
        (echo $BASHPID > $A/cgroup.procs
         pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))
      
      v2: Added comments explaining why the specific intervals are being used.
      
      v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort
          flushing while avoding possible livelocks.
      
      v4: Use get_jiffies_64() and time_before/after64() instead of raw
          jiffies_64 and arthimetic comparisons as suggested by Jan.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      97b27821
    • Tejun Heo's avatar
      writeback, memcg: Implement cgroup_writeback_by_id() · d62241c7
      Tejun Heo authored
      Implement cgroup_writeback_by_id() which initiates cgroup writeback
      from bdi and memcg IDs.  This will be used by memcg foreign inode
      flushing.
      
      v2: Use wb_get_lookup() instead of wb_get_create() to avoid creating
          spurious wbs.
      
      v3: Interpret 0 @nr as 1.25 * nr_dirty to implement best-effort
          flushing while avoding possible livelocks.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d62241c7
    • Tejun Heo's avatar
      writeback: Separate out wb_get_lookup() from wb_get_create() · ed288dc0
      Tejun Heo authored
      Separate out wb_get_lookup() which doesn't try to create one if there
      isn't already one from wb_get_create().  This will be used by later
      patches.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ed288dc0
    • Tejun Heo's avatar
      bdi: Add bdi->id · 34f8fe50
      Tejun Heo authored
      There currently is no way to universally identify and lookup a bdi
      without holding a reference and pointer to it.  This patch adds an
      non-recycling bdi->id and implements bdi_get_by_id() which looks up
      bdis by their ids.  This will be used by memcg foreign inode flushing.
      
      I left bdi_list alone for simplicity and because while rb_tree does
      support rcu assignment it doesn't seem to guarantee lossless walk when
      walk is racing aginst tree rebalance operations.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      34f8fe50
    • Tejun Heo's avatar
      writeback: Generalize and expose wb_completion · 5b9cce4c
      Tejun Heo authored
      wb_completion is used to track writeback completions.  We want to use
      it from memcg side for foreign inode flushes.  This patch updates it
      to remember the target waitq instead of assuming bdi->wb_waitq and
      expose it outside of fs-writeback.c.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5b9cce4c
  2. 23 Aug, 2019 7 commits
  3. 22 Aug, 2019 3 commits
  4. 20 Aug, 2019 8 commits
  5. 19 Aug, 2019 1 commit
  6. 15 Aug, 2019 2 commits
    • Tejun Heo's avatar
      writeback, cgroup: inode_switch_wbs() shouldn't give up on wb_switch_rwsem trylock fail · 6444f47e
      Tejun Heo authored
      As inode wb switching may make sync(2) miss some inodes, they're
      synchronized using wb_switch_rwsem so that no wb switching happens
      while sync(2) is in progress.  In addition to synchronizing the actual
      switching, the rwsem is also used to prevent queueing new switch
      attempts while sync(2) is in progress.  This is to avoid queueing too
      many instances while the rwsem is held by sync(2).  Unfortunately,
      this is too agressive and can block wb switching for a long time if
      sync(2) is frequent.
      
      The goal is avoiding expolding the number of scheduled switches, not
      avoiding scheduling anything.  Let's use wb_switch_rwsem only for
      synchronizing the actual switching and sync(2) and use
      isw_nr_in_flight instead for limiting the maximum number of scheduled
      switches.  The limit is set to 1024 which should be more than enough
      while still avoiding extreme situations.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6444f47e
    • Tejun Heo's avatar
      writeback, cgroup: Adjust WB_FRN_TIME_CUT_DIV to accelerate foreign inode switching · 55a694df
      Tejun Heo authored
      WB_FRN_TIME_CUT_DIV is used to tell the foreign inode detection logic
      to ignore short writeback rounds to prevent getting confused by a
      burst of short writebacks.  The parameter is currently 2 meaning that
      anything smaller than half of the running average writback duration
      will be ignored.
      
      This is unnecessarily aggressive.  The detection logic uses 16 history
      slots and is already reasonably protected against some short bursts
      confusing it and the current parameter can lead to tens of seconds of
      missed detection depending on the writeback pattern.
      
      Let's change the parameter to 8, so that it only ignores writeback
      with are smaller than 12.5% of the current running average.
      
      v2: Add comment explaining what's going on with the foreign detection
          parameters.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      55a694df
  7. 14 Aug, 2019 1 commit
  8. 12 Aug, 2019 1 commit
  9. 09 Aug, 2019 3 commits
  10. 08 Aug, 2019 2 commits
  11. 07 Aug, 2019 4 commits