1. 14 Jun, 2016 5 commits
    • Heinz Mauelshagen's avatar
      dm raid: enhance status interface and fixup takeover/raid0 · 3a1c1ef2
      Heinz Mauelshagen authored
      The target's status interface has to provide the new 'data_offset' value
      to allow userspace to retrieve the kernels offset to the data on each
      raid device of a raid set.  This is the base for out-of-place reshaping
      required to not write over any data during reshaping (e.g. change
      raid6_zr -> raid6_nc):
      
       - add rs_set_cur() to be able to start up existing array in case of no
         takeover; use in ctr on takeover check
      
       - enhance raid_status()
      
       - add supporting functions to get resync/reshape progress and raid
         device status chars
      
       - fixup rebuild table line output race, which does miss to emit
         'rebuild N' on fully synced/rebuild devices, because it is relying on
         the transient 'In_sync' raid device flag
      
       - add new status line output for 'data_offset', which'll later be used
         for out-of-place reshaping
      
       - fixup takeover not working for all levels
      
       - fixup raid0 message interface oops caused by missing checks
         for the md threads, which don't exist in case of raid0
      
       - remove ALL_FREEZE_FLAGS not needed for takeover
      
       - adjust comments
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      3a1c1ef2
    • Heinz Mauelshagen's avatar
      dm raid: add raid level takeover support · ecbfb9f1
      Heinz Mauelshagen authored
      Add raid level takeover support allowing arbitrary takeovers between
      raid levels supported by md personalities (i.e. raid0, raid1/10 and
      raid4/5/6):
      
       - add rs_config_{backup|restore} function to allow for temporary
         storing ctr requested layout changes and restore them for takeover
         conersion decision after the superblocks got loaded and analyzed
      
       - add members to store layout to 'struct raid_set' (not mandatory
         for takeover but needed for reshape in later patch)
      
       - add rebuild_disks bitfield to 'struct raid_set' and set bits in ctr
         to use in setting up takeover (base to address a 'rebuild' related
         raid_status() table line bug and needed as well for reshape in future
         patch)
      
       - add runtime flags and respective manipulation functions to be able to
         control e.g. wrting of superlocks to the preresume function on
         takeover and (later) reshape
      
       - add functions to detect takeover, check it's valid (mandatory here to
         avoid failing on md_run()), setup for it and use in the ctr; those
         will be likely moved out once reshaping gets added to simplify the
         ctr
      
       - start raid set readonly in ctr and switch to readwrite, optionally
         updating superblocks, in preresume in order to allow suspend to
         quiesce any active table before (which involves superblock updates);
         this ensures the proper sequence of writing the current and any new
         takeover(/reshape) metadata
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      ecbfb9f1
    • Heinz Mauelshagen's avatar
      dm raid: enhance super_sync() to support new superblock members · 7b34df74
      Heinz Mauelshagen authored
      Add transferring the new takeover/reshape related superblock
      members introduced to the super_sync() function:
      
       - add/move supporting functions
      
       - add failed devices bitfield transfer functions to retrieve the
         bitfield from superblock format or update it in the superblock
      
       - add code to transfer all new members
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      7b34df74
    • Heinz Mauelshagen's avatar
      dm raid: add new reshaping/raid10 format table line options to parameter parser · 4763e543
      Heinz Mauelshagen authored
      Support the follwoing arguments in the ctr parameter parser:
      
       - add 'delta_disks', 'data_offset' taking int and sector respectively
      
       - 'raid10_use_near_sets' bool argument to optionally select
         near sets with supporting raid10 mappings
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      4763e543
    • Heinz Mauelshagen's avatar
      dm raid: introduce extended superblock and new raid types to support takeover/reshaping · 33e53f06
      Heinz Mauelshagen authored
      Add new members to the dm-raid superblock and new raid types to support
      takeover/reshape.
      
      Add all necessary members needed to support takeover and reshape in one
      go -- aiming to limit the amount of changes to the superblock layout.
      
      This is a larger patch due to the new superblock members, their related
      flags, validation of both and involved API additions/changes:
      
       - add additional members to keep track of:
         - state about forward/backward reshaping
         - reshape position
         - new level, layout, stripe size and delta disks
         - data offset to current and new data for out-of-place reshapes
         - failed devices bitfield extensions to keep track of max raid devices
      
       - adjust super_validate() to cope with new superblock members
      
       - adjust super_init_validation() to cope with new superblock members
      
       - add definitions for ctr flags supporting delta disks etc.
      
       - add new raid types (raid6_n_6 etc.)
      
       - add new raid10 supporting function API (_is_raid10_*())
      
       - adjust to changed raid10 supporting function API
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      33e53f06
  2. 13 Jun, 2016 6 commits
  3. 10 Jun, 2016 5 commits
    • Mike Snitzer's avatar
      dm mpath: add optional "queue_mode" feature · e83068a5
      Mike Snitzer authored
      Allow a user to specify an optional feature 'queue_mode <mode>' where
      <mode> may be "bio", "rq" or "mq" -- which corresponds to bio-based,
      request_fn rq-based, and blk-mq rq-based respectively.
      
      If the queue_mode feature isn't specified the default for the
      "multipath" target is still "rq" but if dm_mod.use_blk_mq is set to Y
      it'll default to mode "mq".
      
      This new queue_mode feature introduces the ability for each multipath
      device to have its own queue_mode (whereas before this feature all
      multipath devices effectively had to have the same queue_mode).
      
      This commit also goes a long way to eliminate the awkward (ab)use of
      DM_TYPE_*, the associated filter_md_type() and other relatively fragile
      and difficult to maintain code.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      e83068a5
    • Mike Snitzer's avatar
    • Mike Snitzer's avatar
      dm mpath: reinstate bio-based support · 76e33fe4
      Mike Snitzer authored
      Add "multipath-bio" target that offers a bio-based multipath target as
      an alternative to the request-based "multipath" target -- but in a
      following commit "multipath-bio" will immediately be replaced by a new
      "queue_mode" feature for the "multipath" target which will allow
      bio-based mode to be selected.
      
      When DM multipath was originally converted from bio-based to
      request-based the motivation for the change was better dynamic load
      balancing (by leveraging block core's request-based IO schedulers, for
      merging and sorting, _before_ DM multipath would make the decision on
      where to steer the IO -- based on path load and/or availability).
      
      More background is available in this "Request-based Device-mapper
      multipath and Dynamic load balancing" paper:
      https://www.kernel.org/doc/ols/2007/ols2007v2-pages-235-244.pdf
      
      But we've now come full circle where significantly faster storage
      devices no longer need IOs to be made larger to drive optimal IO
      performance.  And even if they do there have been changes to the block
      and filesystem layers that help ensure upper layers are constructing
      larger IOs.  In addition, SCSI's differentiated IO errors will propagate
      through to bio-based IO completion hooks -- so that eliminates another
      historic justiciation for request-based DM multipath.  Lastly, the block
      layer's immutable biovec changes have made bio cloning cheaper than it
      has ever been; whereas request cloning is still relatively expensive
      (both on a CPU usage and memory footprint level).
      
      As such, bio-based DM multipath offers the promise of a more efficient
      IO path for high IOPs devices that are, or will be, emerging.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      76e33fe4
    • Mike Snitzer's avatar
      dm: move request-based code out to dm-rq.[hc] · 4cc96131
      Mike Snitzer authored
      Add some seperation between bio-based and request-based DM core code.
      
      'struct mapped_device' and other DM core only structures and functions
      have been moved to dm-core.h and all relevant DM core .c files have been
      updated to include dm-core.h rather than dm.h
      
      DM targets should _never_ include dm-core.h!
      
      [block core merge conflict resolution from Stephen Rothwell]
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      4cc96131
    • Ming Lei's avatar
      block: bio: kill BIO_MAX_SIZE · 1a89694f
      Ming Lei authored
      No one need this macro now, so remove it. Basically
      only how many bvecs in one bio matters instead
      of how many bytes in this bio.
      
      The motivation is for supporting multipage bvecs, in
      which we only know what the max count of bvecs is supported
      in the bio.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1a89694f
  4. 09 Jun, 2016 10 commits
    • Jens Axboe's avatar
      cfq-iosched: temporarily boost queue priority for idle classes · b8269db4
      Jens Axboe authored
      If we're queuing REQ_PRIO IO and the task is running at an idle IO
      class, then temporarily boost the priority. This prevents livelocks
      due to priority inversion, when a low priority task is holding file
      system resources while attempting to do IO.
      
      An example of that is shown below. An ioniced idle task is holding
      the directory mutex, while a normal priority task is trying to do
      a directory lookup.
      
      [478381.198925] ------------[ cut here ]------------
      [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
      [478381.201324]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
      [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [478381.203462] ionice          D ffff8803692736a8     0 1168369      1 0x00000080
      [478381.203466]  ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
      [478381.204589]  ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
      [478381.205752]  ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
      [478381.206874] Call Trace:
      [478381.207253]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
      [478381.208175]  [<ffffffff8177cea7>] schedule+0x37/0x90
      [478381.208932]  [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250
      [478381.209805]  [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50
      [478381.210706]  [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0
      [478381.211489]  [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110
      [478381.212402]  [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90
      [478381.213280]  [<ffffffff8177d616>] bit_wait_io+0x36/0x50
      [478381.214063]  [<ffffffff8177d325>] __wait_on_bit+0x65/0x90
      [478381.214961]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
      [478381.215872]  [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90
      [478381.216806]  [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40
      [478381.217773]  [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30
      [478381.218641]  [<ffffffff8123c557>] ext4_bread+0x57/0x70
      [478381.219425]  [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380
      [478381.220467]  [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170
      [478381.221357]  [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0
      [478381.222208]  [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510
      [478381.223090]  [<ffffffff812471a2>] ext4_lookup+0x52/0x160
      [478381.223882]  [<ffffffff811c401d>] lookup_real+0x1d/0x60
      [478381.224675]  [<ffffffff811c4698>] __lookup_hash+0x38/0x50
      [478381.225697]  [<ffffffff817745bd>] lookup_slow+0x45/0xab
      [478381.226941]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
      [478381.227880]  [<ffffffff811c6a42>] path_init+0xc2/0x430
      [478381.228677]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
      [478381.229776]  [<ffffffff811c8c57>] path_openat+0x77/0x620
      [478381.230767]  [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70
      [478381.232019]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
      [478381.233016]  [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70
      [478381.234072]  [<ffffffff811c0cb0>] do_open_execat+0x70/0x170
      [478381.235039]  [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0
      [478381.236051]  [<ffffffff811c214c>] do_execve+0x2c/0x30
      [478381.236809]  [<ffffffff811ca392>] ? getname+0x12/0x20
      [478381.237564]  [<ffffffff811c23be>] SyS_execve+0x2e/0x40
      [478381.238338]  [<ffffffff81780a1d>] stub_execve+0x6d/0xa0
      [478381.239126] ------------[ cut here ]------------
      [478381.239915] ------------[ cut here ]------------
      [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
      [478381.242673]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
      [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [478381.244902] python2.7       D ffff88005cf8fb98     0 1168375 1168248 0x00000080
      [478381.244904]  ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
      [478381.246023]  ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
      [478381.247138]  ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
      [478381.248252] Call Trace:
      [478381.248630]  [<ffffffff8177cea7>] schedule+0x37/0x90
      [478381.249382]  [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10
      [478381.250465]  [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100
      [478381.251409]  [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f
      [478381.252199]  [<ffffffff817745ae>] lookup_slow+0x36/0xab
      [478381.253023]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
      [478381.253877]  [<ffffffff811aeb41>] ? try_charge+0xc1/0x700
      [478381.254690]  [<ffffffff811c6a42>] path_init+0xc2/0x430
      [478381.255525]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
      [478381.256450]  [<ffffffff811c8c57>] path_openat+0x77/0x620
      [478381.257256]  [<ffffffff8115b2fb>] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
      [478381.258390]  [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720
      [478381.259309]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
      [478381.260139]  [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120
      [478381.260962]  [<ffffffff811b95ac>] do_sys_open+0x13c/0x230
      [478381.261779]  [<ffffffff81011393>] ? syscall_trace_enter_phase1+0x113/0x170
      [478381.262851]  [<ffffffff811b96c2>] SyS_open+0x22/0x30
      [478381.263598]  [<ffffffff81780532>] system_call_fastpath+0x12/0x17
      [478381.264551] ------------[ cut here ]------------
      [478381.265377] ------------[ cut here ]------------
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      b8269db4
    • Ming Lei's avatar
      block: drbd: avoid to use BIO_MAX_SIZE · 8bf223c2
      Ming Lei authored
      Use BIO_MAX_PAGES instead and we will remove BIO_MAX_SIZE.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8bf223c2
    • Ming Lei's avatar
      block: bio: remove BIO_MAX_SECTORS · 30ac4607
      Ming Lei authored
      No one need this macro, so remove it. The motivation is for supporting
      multipage bvecs, in which we only know what the max count of bvecs is
      supported in the bio, instead of max size or max sectors.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      30ac4607
    • Ming Lei's avatar
      fs: xfs: replace BIO_MAX_SECTORS with BIO_MAX_PAGES · c908e380
      Ming Lei authored
      BIO_MAX_PAGES is used as maximum count of bvecs, so
      replace BIO_MAX_SECTORS with BIO_MAX_PAGES since
      BIO_MAX_SECTORS is to be removed.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c908e380
    • Ming Lei's avatar
      iov_iter: use bvec iterator to implement iterate_bvec() · 1bdc76ae
      Ming Lei authored
      bvec has one native/mature iterator for long time, so not
      necessary to use the reinvented wheel for iterating bvecs
      in lib/iov_iter.c.
      
      Two ITER_BVEC test cases are run:
      	- xfstest(-g auto) on loop dio/aio, no regression found
      	- swap file works well under extreme stress(stress-ng --all 64 -t
      	  800 -v), and lots of OOMs are triggerd, and the whole
      	system still survives
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1bdc76ae
    • Ming Lei's avatar
      block: mark 1st parameter of bvec_iter_advance as const · 80f162ff
      Ming Lei authored
      bvec_iter_advance() only writes the parameter of iterator,
      so the base address of bvec can be marked as const safely.
      
      Without the change, we can see compiling warning in the
      following patch for implementing iterate_bvec(): lib/iov_iter.c
      with bvec iterator.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      80f162ff
    • Ming Lei's avatar
      block: move two bvec structure into bvec.h · 0781e79e
      Ming Lei authored
      This patch moves 'struct bio_vec' and 'struct bvec_iter'
      into 'include/linux/bvec.h', then always include this header
      into 'include/linux/blk_types.h'.
      
      With this change, both 'struct bvec_iter' and bvec iterator
      helpers don't depend on CONFIG_BLOCK any more, then we can
      use bvec iterator to implement iterate_bvec(): lib/iov_iter.c.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0781e79e
    • Ming Lei's avatar
      block: move bvec iterator into include/linux/bvec.h · 8fc55455
      Ming Lei authored
      bvec iterator helpers should be used to implement by
      iterate_bvec():lib/iov_iter.c too, and move them into
      one header, so that we can keep bvec iterator header
      out of CONFIG_BLOCK. Then we can remove the reinventing
      of wheel in iterate_bvec().
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Tested-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8fc55455
    • Omar Sandoval's avatar
      blk-mq: actually hook up defer list when running requests · 52b9c330
      Omar Sandoval authored
      If ->queue_rq() returns BLK_MQ_RQ_QUEUE_OK, we use continue and skip
      over the rest of the loop body. However, dptr is assigned later in the
      loop body, and the BLK_MQ_RQ_QUEUE_OK case is exactly the case that we'd
      want it for.
      
      NVMe isn't actually using BLK_MQ_F_DEFER_ISSUE yet, nor is any other
      in-tree driver, but if the code's going to be there, it might as well
      work.
      
      Fixes: 74c45052 ("blk-mq: add a 'list' parameter to ->queue_rq()")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      52b9c330
    • Christoph Hellwig's avatar
      block: better packing for struct request · ca93e453
      Christoph Hellwig authored
      Keep the 32-bit CPU and cmd_type flags together to avoid holes on 64-bit
      architectures.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ca93e453
  5. 08 Jun, 2016 4 commits
  6. 07 Jun, 2016 10 commits