1. 27 Aug, 2019 7 commits
    • Ming Lei's avatar
      block: don't hold q->sysfs_lock in elevator_init_mq · c48dac13
      Ming Lei authored
      The original comment says:
      
      	q->sysfs_lock must be held to provide mutual exclusion between
      	elevator_switch() and here.
      
      Which is simply wrong. elevator_init_mq() is only called from
      blk_mq_init_allocated_queue, which is always called before the request
      queue is registered via blk_register_queue(), for dm-rq or normal rq
      based driver. However, queue's kobject is only exposed and added to sysfs
      in blk_register_queue(). So there isn't such race between elevator_switch()
      and elevator_init_mq().
      
      So avoid to hold q->sysfs_lock in elevator_init_mq().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c48dac13
    • Bart Van Assche's avatar
      block: Remove blk_mq_register_dev() · 9685b227
      Bart Van Assche authored
      This function has no callers. Hence remove it.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9685b227
    • Tejun Heo's avatar
      writeback, memcg: Implement foreign dirty flushing · 97b27821
      Tejun Heo authored
      There's an inherent mismatch between memcg and writeback.  The former
      trackes ownership per-page while the latter per-inode.  This was a
      deliberate design decision because honoring per-page ownership in the
      writeback path is complicated, may lead to higher CPU and IO overheads
      and deemed unnecessary given that write-sharing an inode across
      different cgroups isn't a common use-case.
      
      Combined with inode majority-writer ownership switching, this works
      well enough in most cases but there are some pathological cases.  For
      example, let's say there are two cgroups A and B which keep writing to
      different but confined parts of the same inode.  B owns the inode and
      A's memory is limited far below B's.  A's dirty ratio can rise enough
      to trigger balance_dirty_pages() sleeps but B's can be low enough to
      avoid triggering background writeback.  A will be slowed down without
      a way to make writeback of the dirty pages happen.
      
      This patch implements foreign dirty recording and foreign mechanism so
      that when a memcg encounters a condition as above it can trigger
      flushes on bdi_writebacks which can clean its pages.  Please see the
      comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
      details.
      
      A reproducer follows.
      
      write-range.c::
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <sys/types.h>
      
        static const char *usage = "write-range FILE START SIZE\n";
      
        int main(int argc, char **argv)
        {
      	  int fd;
      	  unsigned long start, size, end, pos;
      	  char *endp;
      	  char buf[4096];
      
      	  if (argc < 4) {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  fd = open(argv[1], O_WRONLY);
      	  if (fd < 0) {
      		  perror("open");
      		  return 1;
      	  }
      
      	  start = strtoul(argv[2], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  size = strtoul(argv[3], &endp, 0);
      	  if (*endp != '\0') {
      		  fprintf(stderr, usage);
      		  return 1;
      	  }
      
      	  end = start + size;
      
      	  while (1) {
      		  for (pos = start; pos < end; ) {
      			  long bread, bwritten = 0;
      
      			  if (lseek(fd, pos, SEEK_SET) < 0) {
      				  perror("lseek");
      				  return 1;
      			  }
      
      			  bread = read(0, buf, sizeof(buf) < end - pos ?
      					       sizeof(buf) : end - pos);
      			  if (bread < 0) {
      				  perror("read");
      				  return 1;
      			  }
      			  if (bread == 0)
      				  return 0;
      
      			  while (bwritten < bread) {
      				  long this;
      
      				  this = write(fd, buf + bwritten,
      					       bread - bwritten);
      				  if (this < 0) {
      					  perror("write");
      					  return 1;
      				  }
      
      				  bwritten += this;
      				  pos += bwritten;
      			  }
      		  }
      	  }
        }
      
      repro.sh::
      
        #!/bin/bash
      
        set -e
        set -x
      
        sysctl -w vm.dirty_expire_centisecs=300000
        sysctl -w vm.dirty_writeback_centisecs=300000
        sysctl -w vm.dirtytime_expire_seconds=300000
        echo 3 > /proc/sys/vm/drop_caches
      
        TEST=/sys/fs/cgroup/test
        A=$TEST/A
        B=$TEST/B
      
        mkdir -p $A $B
        echo "+memory +io" > $TEST/cgroup.subtree_control
        echo $((1<<30)) > $A/memory.high
        echo $((32<<30)) > $B/memory.high
      
        rm -f testfile
        touch testfile
        fallocate -l 4G testfile
      
        echo "Starting B"
      
        (echo $BASHPID > $B/cgroup.procs
         pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &
      
        echo "Waiting 10s to ensure B claims the testfile inode"
        sleep 5
        sync
        sleep 5
        sync
        echo "Starting A"
      
        (echo $BASHPID > $A/cgroup.procs
         pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))
      
      v2: Added comments explaining why the specific intervals are being used.
      
      v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort
          flushing while avoding possible livelocks.
      
      v4: Use get_jiffies_64() and time_before/after64() instead of raw
          jiffies_64 and arthimetic comparisons as suggested by Jan.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      97b27821
    • Tejun Heo's avatar
      writeback, memcg: Implement cgroup_writeback_by_id() · d62241c7
      Tejun Heo authored
      Implement cgroup_writeback_by_id() which initiates cgroup writeback
      from bdi and memcg IDs.  This will be used by memcg foreign inode
      flushing.
      
      v2: Use wb_get_lookup() instead of wb_get_create() to avoid creating
          spurious wbs.
      
      v3: Interpret 0 @nr as 1.25 * nr_dirty to implement best-effort
          flushing while avoding possible livelocks.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d62241c7
    • Tejun Heo's avatar
      writeback: Separate out wb_get_lookup() from wb_get_create() · ed288dc0
      Tejun Heo authored
      Separate out wb_get_lookup() which doesn't try to create one if there
      isn't already one from wb_get_create().  This will be used by later
      patches.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ed288dc0
    • Tejun Heo's avatar
      bdi: Add bdi->id · 34f8fe50
      Tejun Heo authored
      There currently is no way to universally identify and lookup a bdi
      without holding a reference and pointer to it.  This patch adds an
      non-recycling bdi->id and implements bdi_get_by_id() which looks up
      bdis by their ids.  This will be used by memcg foreign inode flushing.
      
      I left bdi_list alone for simplicity and because while rb_tree does
      support rcu assignment it doesn't seem to guarantee lossless walk when
      walk is racing aginst tree rebalance operations.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      34f8fe50
    • Tejun Heo's avatar
      writeback: Generalize and expose wb_completion · 5b9cce4c
      Tejun Heo authored
      wb_completion is used to track writeback completions.  We want to use
      it from memcg side for foreign inode flushes.  This patch updates it
      to remember the target waitq instead of assuming bdi->wb_waitq and
      expose it outside of fs-writeback.c.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5b9cce4c
  2. 23 Aug, 2019 7 commits
  3. 22 Aug, 2019 3 commits
  4. 20 Aug, 2019 8 commits
  5. 19 Aug, 2019 1 commit
  6. 15 Aug, 2019 2 commits
    • Tejun Heo's avatar
      writeback, cgroup: inode_switch_wbs() shouldn't give up on wb_switch_rwsem trylock fail · 6444f47e
      Tejun Heo authored
      As inode wb switching may make sync(2) miss some inodes, they're
      synchronized using wb_switch_rwsem so that no wb switching happens
      while sync(2) is in progress.  In addition to synchronizing the actual
      switching, the rwsem is also used to prevent queueing new switch
      attempts while sync(2) is in progress.  This is to avoid queueing too
      many instances while the rwsem is held by sync(2).  Unfortunately,
      this is too agressive and can block wb switching for a long time if
      sync(2) is frequent.
      
      The goal is avoiding expolding the number of scheduled switches, not
      avoiding scheduling anything.  Let's use wb_switch_rwsem only for
      synchronizing the actual switching and sync(2) and use
      isw_nr_in_flight instead for limiting the maximum number of scheduled
      switches.  The limit is set to 1024 which should be more than enough
      while still avoiding extreme situations.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6444f47e
    • Tejun Heo's avatar
      writeback, cgroup: Adjust WB_FRN_TIME_CUT_DIV to accelerate foreign inode switching · 55a694df
      Tejun Heo authored
      WB_FRN_TIME_CUT_DIV is used to tell the foreign inode detection logic
      to ignore short writeback rounds to prevent getting confused by a
      burst of short writebacks.  The parameter is currently 2 meaning that
      anything smaller than half of the running average writback duration
      will be ignored.
      
      This is unnecessarily aggressive.  The detection logic uses 16 history
      slots and is already reasonably protected against some short bursts
      confusing it and the current parameter can lead to tens of seconds of
      missed detection depending on the writeback pattern.
      
      Let's change the parameter to 8, so that it only ignores writeback
      with are smaller than 12.5% of the current running average.
      
      v2: Add comment explaining what's going on with the foreign detection
          parameters.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      55a694df
  7. 14 Aug, 2019 1 commit
  8. 12 Aug, 2019 1 commit
  9. 09 Aug, 2019 3 commits
  10. 08 Aug, 2019 2 commits
  11. 07 Aug, 2019 5 commits
    • Jens Axboe's avatar
      Merge branch 'md-next' of https://github.com/liu-song-6/linux into for-5.4/block · e8fc87f6
      Jens Axboe authored
      Pull MD changes from Song.
      
      * 'md-next' of https://github.com/liu-song-6/linux:
        raid1: factor out a common routine to handle the completion of sync write
        md: don't call spare_active in md_reap_sync_thread if all member devices can't work
        md: don't set In_sync if array is frozen
        md: allow last device to be forcibly removed from RAID1/RAID10.
        md: Convert to use int_pow()
        md/raid10: end bio when the device faulty
        md/raid1: end bio when the device faulty
        md/raid6: Set R5_ReadError when there is read failure on parity disk
        raid1: use an int as the return value of raise_barrier()
      e8fc87f6
    • Hou Tao's avatar
      raid1: factor out a common routine to handle the completion of sync write · 449808a2
      Hou Tao authored
      It's just code clean-up.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      449808a2
    • Guoqing Jiang's avatar
      md: don't call spare_active in md_reap_sync_thread if all member devices can't work · 0d8ed0e9
      Guoqing Jiang authored
      When add one disk to array, the md_reap_sync_thread is responsible
      to activate the spare and set In_sync flag for the new member in
      spare_active().
      
      But if raid1 has one member disk A, and disk B is added to the array.
      Then we offline A before all the datas are synchronized from A to B,
      obviously B doesn't have the latest data as A, but B is still marked
      with In_sync flag.
      
      So let's not call spare_active under the condition, otherwise B is
      still showed with 'U' state which is not correct.
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      0d8ed0e9
    • Guoqing Jiang's avatar
      md: don't set In_sync if array is frozen · 062f5b2a
      Guoqing Jiang authored
      When a disk is added to array, the following path is called in mdadm.
      
      Manage_subdevs -> sysfs_freeze_array
                     -> Manage_add
                     -> sysfs_set_str(&info, NULL, "sync_action","idle")
      
      Then from kernel side, Manage_add invokes the path (add_new_disk ->
      validate_super = super_1_validate) to set In_sync flag.
      
      Since In_sync means "device is in_sync with rest of array", and the new
      added disk need to resync thread to help the synchronization of data.
      And md_reap_sync_thread would call spare_active to set In_sync for the
      new added disk finally. So don't set In_sync if array is in frozen.
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      062f5b2a
    • Guoqing Jiang's avatar
      md: allow last device to be forcibly removed from RAID1/RAID10. · 9a567843
      Guoqing Jiang authored
      When the 'last' device in a RAID1 or RAID10 reports an error,
      we do not mark it as failed.  This would serve little purpose
      as there is no risk of losing data beyond that which is obviously
      lost (as there is with RAID5), and there could be other sectors
      on the device which are readable, and only readable from this device.
      This in general this maximises access to data.
      
      However the current implementation also stops an admin from removing
      the last device by direct action.  This is rarely useful, but in many
      case is not harmful and can make automation easier by removing special
      cases.
      
      Also, if an attempt to write metadata fails the device must be marked
      as faulty, else an infinite loop will result, attempting to update
      the metadata on all non-faulty devices.
      
      So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
      the 'last disk' checks for RAID1 and RAID10, and control the behavior
      per array by change sysfs node.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      [add sysfs node for fail_last_dev by Guoqing]
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      9a567843