1. 22 Jun, 2009 40 commits
    • Kiyoshi Ueda's avatar
      dm: enable request based option · e6ee8c0b
      Kiyoshi Ueda authored
      This patch enables request-based dm.
      
      o Request-based dm and bio-based dm coexist, since there are
        some target drivers which are more fitting to bio-based dm.
        Also, there are other bio-based devices in the kernel
        (e.g. md, loop).
        Since bio-based device can't receive struct request,
        there are some limitations on device stacking between
        bio-based and request-based.
      
                           type of underlying device
                         bio-based      request-based
         ----------------------------------------------
          bio-based         OK                OK
          request-based     --                OK
      
        The device type is recognized by the queue flag in the kernel,
        so dm follows that.
      
      o The type of a dm device is decided at the first table binding time.
        Once the type of a dm device is decided, the type can't be changed.
      
      o Mempool allocations are deferred to at the table loading time, since
        mempools for request-based dm are different from those for bio-based
        dm and needed mempool type is fixed by the type of table.
      
      o Currently, request-based dm supports only tables that have a single
        target.  To support multiple targets, we need to support request
        splitting or prevent bio/request from spanning multiple targets.
        The former needs lots of changes in the block layer, and the latter
        needs that all target drivers support merge() function.
        Both will take a time.
      Signed-off-by: default avatarKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      e6ee8c0b
    • Kiyoshi Ueda's avatar
      dm: prepare for request based option · cec47e3d
      Kiyoshi Ueda authored
      This patch adds core functions for request-based dm.
      
      When struct mapped device (md) is initialized, md->queue has
      an I/O scheduler and the following functions are used for
      request-based dm as the queue functions:
          make_request_fn: dm_make_request()
          pref_fn:         dm_prep_fn()
          request_fn:      dm_request_fn()
          softirq_done_fn: dm_softirq_done()
          lld_busy_fn:     dm_lld_busy()
      Actual initializations are done in another patch (PATCH 2).
      
      Below is a brief summary of how request-based dm behaves, including:
        - making request from bio
        - cloning, mapping and dispatching request
        - completing request and bio
        - suspending md
        - resuming md
      
        bio to request
        ==============
        md->queue->make_request_fn() (dm_make_request()) calls __make_request()
        for a bio submitted to the md.
        Then, the bio is kept in the queue as a new request or merged into
        another request in the queue if possible.
      
        Cloning and Mapping
        ===================
        Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
        when requests are dispatched after they are sorted by the I/O scheduler.
      
        dm_request_fn() checks busy state of underlying devices using
        target's busy() function and stops dispatching requests to keep them
        on the dm device's queue if busy.
        It helps better I/O merging, since no merge is done for a request
        once it is dispatched to underlying devices.
      
        Actual cloning and mapping are done in dm_prep_fn() and map_request()
        called from dm_request_fn().
        dm_prep_fn() clones not only request but also bios of the request
        so that dm can hold bio completion in error cases and prevent
        the bio submitter from noticing the error.
        (See the "Completion" section below for details.)
      
        After the cloning, the clone is mapped by target's map_rq() function
          and inserted to underlying device's queue using
          blk_insert_cloned_request().
      
        Completion
        ==========
        Request completion can be hooked by rq->end_io(), but then, all bios
        in the request will have been completed even error cases, and the bio
        submitter will have noticed the error.
        To prevent the bio completion in error cases, request-based dm clones
        both bio and request and hooks both bio->bi_end_io() and rq->end_io():
            bio->bi_end_io(): end_clone_bio()
            rq->end_io():     end_clone_request()
      
        Summary of the request completion flow is below:
        blk_end_request() for a clone request
          => blk_update_request()
             => bio->bi_end_io() == end_clone_bio() for each clone bio
                => Free the clone bio
                => Success: Complete the original bio (blk_update_request())
                   Error:   Don't complete the original bio
          => blk_finish_request()
             => rq->end_io() == end_clone_request()
                => blk_complete_request()
                   => dm_softirq_done()
                      => Free the clone request
                      => Success: Complete the original request (blk_end_request())
                         Error:   Requeue the original request
      
        end_clone_bio() completes the original request on the size of
        the original bio in successful cases.
        Even if all bios in the original request are completed by that
        completion, the original request must not be completed yet to keep
        the ordering of request completion for the stacking.
        So end_clone_bio() uses blk_update_request() instead of
        blk_end_request().
        In error cases, end_clone_bio() doesn't complete the original bio.
        It just frees the cloned bio and gives over the error handling to
        end_clone_request().
      
        end_clone_request(), which is called with queue lock held, completes
        the clone request and the original request in a softirq context
        (dm_softirq_done()), which has no queue lock, to avoid a deadlock
        issue on submission of another request during the completion:
            - The submitted request may be mapped to the same device
            - Request submission requires queue lock, but the queue lock
              has been held by itself and it doesn't know that
      
        The clone request has no clone bio when dm_softirq_done() is called.
        So target drivers can't resubmit it again even error cases.
        Instead, they can ask dm core for requeueing and remapping
        the original request in that cases.
      
        suspend
        =======
        Request-based dm uses stopping md->queue as suspend of the md.
        For noflush suspend, just stops md->queue.
      
        For flush suspend, inserts a marker request to the tail of md->queue.
        And dispatches all requests in md->queue until the marker comes to
        the front of md->queue.  Then, stops dispatching request and waits
        for the all dispatched requests to complete.
        After that, completes the marker request, stops md->queue and
        wake up the waiter on the suspend queue, md->wait.
      
        resume
        ======
        Starts md->queue.
      Signed-off-by: default avatarKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      cec47e3d
    • Jonthan Brassow's avatar
      dm raid1: add userspace log · f5db4af4
      Jonthan Brassow authored
      This patch contains a device-mapper mirror log module that forwards
      requests to userspace for processing.
      
      The structures used for communication between kernel and userspace are
      located in include/linux/dm-log-userspace.h.  Due to the frequency,
      diversity, and 2-way communication nature of the exchanges between
      kernel and userspace, 'connector' was chosen as the interface for
      communication.
      
      The first log implementations written in userspace - "clustered-disk"
      and "clustered-core" - support clustered shared storage.   A userspace
      daemon (in the LVM2 source code repository) uses openAIS/corosync to
      process requests in an ordered fashion with the rest of the nodes in the
      cluster so as to prevent log state corruption.  Other implementations
      with no association to LVM or openAIS/corosync, are certainly possible.
      
      (Imagine if two machines are writing to the same region of a mirror.
      They would both mark the region dirty, but you need a cluster-aware
      entity that can handle properly marking the region clean when they are
      done.  Otherwise, you might clear the region when the first machine is
      done, not the second.)
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f5db4af4
    • Mike Snitzer's avatar
      dm: calculate queue limits during resume not load · 754c5fc7
      Mike Snitzer authored
      Currently, device-mapper maintains a separate instance of 'struct
      queue_limits' for each table of each device.  When the configuration of
      a device is to be changed, first its table is loaded and this structure
      is populated, then the device is 'resumed' and the calculated
      queue_limits are applied.
      
      This places restrictions on how userspace may process related devices,
      where it is often advantageous to 'load' tables for several devices
      at once before 'resuming' them together.  As the new queue_limits
      only take effect after the 'resume', if they are changing and one
      device uses another, the latter must be 'resumed' before the former
      may be 'loaded'.
      
      This patch moves the calculation of these queue_limits out of
      the 'load' operation into 'resume'.  Since we are no longer
      pre-calculating this struct, we no longer need to maintain copies
      within our dm structs.
      
      dm_set_device_limits() now passes the 'start' of the device's
      data area (aka pe_start) as the 'offset' to blk_stack_limits().
      
      init_valid_queue_limits() is replaced by blk_set_default_limits().
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: martin.petersen@oracle.com
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      754c5fc7
    • Mike Snitzer's avatar
      dm log: fix create_log_context to use logical_block_size of log device · 18d8594d
      Mike Snitzer authored
      create_log_context() must use the logical_block_size from the log disk,
      where the I/O happens, not the target's logical_block_size.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      18d8594d
    • Mike Snitzer's avatar
      dm target:s introduce iterate devices fn · af4874e0
      Mike Snitzer authored
      Add .iterate_devices to 'struct target_type' to allow a function to be
      called for all devices in a DM target.  Implemented it for all targets
      except those in dm-snap.c (origin and snapshot).
      
      (The raid1 version number jumps to 1.12 because we originally reserved
      1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors'
      instead.)
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: martin.petersen@oracle.com
      af4874e0
    • Mike Snitzer's avatar
      dm table: establish queue limits by copying table limits · 1197764e
      Mike Snitzer authored
      Copy the table's queue_limits to the DM device's request_queue.  This
      properly initializes the queue's topology limits and also avoids having
      to track the evolution of 'struct queue_limits' in
      dm_table_set_restrictions()
      
      Also fixes a bug that was introduced in dm_table_set_restrictions() via
      commit ae03bf63.  In addition to
      establishing 'bounce_pfn' in the queue's limits blk_queue_bounce_limit()
      also performs an allocation to setup the ISA DMA pool.  This allocation
      resulted in "sleeping function called from invalid context" when called
      from dm_table_set_restrictions().
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      1197764e
    • Mike Snitzer's avatar
      dm table: replace struct io_restrictions with struct queue_limits · 5ab97588
      Mike Snitzer authored
      Use blk_stack_limits() to stack block limits (including topology) rather
      than duplicate the equivalent within Device Mapper.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      5ab97588
    • Mike Snitzer's avatar
      dm table: validate device logical_block_size · be6d4305
      Mike Snitzer authored
      Impose necessary and sufficient conditions on a devices's table such
      that any incoming bio which respects its logical_block_size can be
      processed successfully.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      be6d4305
    • Mike Snitzer's avatar
      dm table: ensure targets are aligned to logical_block_size · 02acc3a4
      Mike Snitzer authored
      Ensure I/O is aligned to the logical block size of target devices.
      
      Rename check_device_area() to device_area_is_valid() for clarity and
      establish the device limits including the logical block size prior to
      calling it.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      02acc3a4
    • Milan Broz's avatar
      dm ioctl: support cookies for udev · 60935eb2
      Milan Broz authored
      Add support for passing a 32 bit "cookie" into the kernel with the
      DM_SUSPEND, DM_DEV_RENAME and DM_DEV_REMOVE ioctls.  The (unsigned)
      value of this cookie is returned to userspace alongside the uevents
      issued by these ioctls in the variable DM_COOKIE.
      
      This means the userspace process issuing these ioctls can be notified
      by udev after udev has completed any actions triggered.
      
      To minimise the interface extension, we pass the cookie into the
      kernel in the event_nr field which is otherwise unused when calling
      these ioctls.  Incrementing the version number allows userspace to
      determine in advance whether or not the kernel supports the cookie.
      If the kernel does support this but userspace does not, there should
      be no impact as the new variable will just get ignored.
      Signed-off-by: default avatarMilan Broz <mbroz@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      60935eb2
    • Peter Rajnoha's avatar
      dm: sysfs add suspended attribute · 486d220f
      Peter Rajnoha authored
      Add a file named 'suspended' to each device-mapper device directory in
      sysfs.  It holds the value 1 while the device is suspended.  Otherwise
      it holds 0.
      Signed-off-by: default avatarPeter Rajnoha <prajnoha@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      486d220f
    • Jonthan Brassow's avatar
      dm table: improve warning message when devices not freed before destruction · 1b6da754
      Jonthan Brassow authored
      Report any devices forgotten to be freed before a table is destroyed.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      1b6da754
    • Kiyoshi Ueda's avatar
      dm mpath: add service time load balancer · f392ba88
      Kiyoshi Ueda authored
      This patch adds a service time oriented dynamic load balancer,
      dm-service-time, which selects the path with the shortest estimated
      service time for the incoming I/O.
      The service time is estimated by dividing the in-flight I/O size
      by a performance value of each path.
      
      The performance value can be given as a table argument at the table
      loading time.  If no performance value is given, all paths are
      considered equal.
      Signed-off-by: default avatarKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f392ba88
    • Kiyoshi Ueda's avatar
      dm mpath: add queue length load balancer · fd5e0339
      Kiyoshi Ueda authored
      This patch adds a dynamic load balancer, dm-queue-length, which
      balances the number of in-flight I/Os across the paths.
      
      The code is based on the patch posted by Stefan Bader:
      https://www.redhat.com/archives/dm-devel/2005-October/msg00050.htmlSigned-off-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: default avatarKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      fd5e0339
    • Kiyoshi Ueda's avatar
      dm mpath: add start_io and nr_bytes to path selectors · 02ab823f
      Kiyoshi Ueda authored
      This patch makes two additions to the dm path selector interface for
      dynamic load balancers:
        o a new hook, start_io()
        o a new parameter 'nr_bytes' to select_path()/start_io()/end_io()
          to pass the size of the I/O
      
      start_io() is called when a target driver actually submits I/O
      to the selected path.
      Path selectors can use it to start accounting of the I/O.
      (e.g. counting the number of in-flight I/Os.)
      The start_io hook is based on the patch posted by Stefan Bader:
      https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html
      
      nr_bytes, the size of the I/O, is so path selectors can take the
      size of the I/O into account when deciding which path to use.
      dm-service-time uses it to estimate service time, for example.
      (Added the nr_bytes member to dm_mpath_io instead of using existing
       details.bi_size, since request-based dm patch deletes it.)
      Signed-off-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: default avatarKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      02ab823f
    • Mikulas Patocka's avatar
      dm snapshot: use barrier when writing exception store · 2bd02345
      Mikulas Patocka authored
      Send barrier requests when updating the exception area.
      
      Exception area updates need to be ordered w.r.t. data writes, so that
      the writes are not reordered in hardware disk cache.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      2bd02345
    • Mikulas Patocka's avatar
      dm io: retry after barrier error · 51aa3228
      Mikulas Patocka authored
      If -EOPNOTSUPP was returned and the request was a barrier request, retry it
      without barrier.
      
      Retry all regions for now. Barriers are submitted only for one-region requests,
      so it doesn't matter.  (In the future, retries can be limited to the actual
      regions that failed.)
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      51aa3228
    • Mikulas Patocka's avatar
      dm io: record eopnotsupp · 5af443a7
      Mikulas Patocka authored
      Add another field, eopnotsupp_bits. It is subset of error_bits, representing
      regions that returned -EOPNOTSUPP.  (The bit is set in both error_bits and
      eopnotsupp_bits).
      
      This value will be used in further patches.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      5af443a7
    • Mikulas Patocka's avatar
      dm snapshot: support barriers · 494b3ee7
      Mikulas Patocka authored
      Flush support for dm-snapshot target.
      
      This patch just forwards the flush request to either the origin or the snapshot
      device.  (It doesn't flush exception store metadata.)
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      494b3ee7
    • Mikulas Patocka's avatar
      dm mpath: support barriers · 8627921f
      Mikulas Patocka authored
      Flush support for dm-multipath target.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      8627921f
    • Mikulas Patocka's avatar
      dm delay: support barriers · c927259e
      Mikulas Patocka authored
      Flush support for dm-delay target.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      c927259e
    • Mikulas Patocka's avatar
      dm crypt: support flush · 647c7db1
      Mikulas Patocka authored
      Flush support for dm-crypt target.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      647c7db1
    • Mikulas Patocka's avatar
      dm: stripe support flush · 374bf7e7
      Mikulas Patocka authored
      Flush support for the stripe target.
      
      This sets ti->num_flush_requests to the number of stripes and
      remaps individual flush requests to the appropriate stripe devices.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      374bf7e7
    • Mikulas Patocka's avatar
      dm: linear support flush · 433bcac5
      Mikulas Patocka authored
      Flush support for the linear target.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      433bcac5
    • Mikulas Patocka's avatar
      dm: send empty barriers to targets in dm_flush · 52b1fd5a
      Mikulas Patocka authored
      Pass empty barrier flushes to the targets in dm_flush().
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      52b1fd5a
    • Alasdair G Kergon's avatar
      dm: initialise tio in alloc_tio · 9015df24
      Alasdair G Kergon authored
      Move repeated dm_target_io initialisation inside alloc_tio().
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      9015df24
    • Mikulas Patocka's avatar
      dm: introduce num_flush_requests · f9ab94ce
      Mikulas Patocka authored
      Introduce num_flush_requests for a target to set to say how many flush
      instructions (empty barriers) it wants to receive.  These are sent by
      __clone_and_map_empty_barrier with map_info->flush_request going from 0
      to (num_flush_requests - 1).
      
      Old targets without flush support won't receive any flush requests.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f9ab94ce
    • Mikulas Patocka's avatar
      dm: remove check that prevents mapping empty bios · 27eaa149
      Mikulas Patocka authored
      Remove the check that the size of the cloned bio is not zero because a
      subsequent patch needs to send zero-sized barriers down this path.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      27eaa149
    • Mikulas Patocka's avatar
      dm: remove EOPNOTSUPP for barriers · fdb9572b
      Mikulas Patocka authored
      If the underlying device doesn't support barriers and dm receives a
      barrier, it waits until all requests on that device drain so it no
      longer needs to report -EOPNOTSUPP to the caller.
      
      This patch deals with the confusing situation when moving a volume from
      one physical device to another triggers an EOPNOTSUPP on a volume that
      didn't report it before.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      fdb9572b
    • Mikulas Patocka's avatar
      dm: store only first barrier error · 5aa2781d
      Mikulas Patocka authored
      With the following patches, more than one error can occur during
      processing.  Change md->barrier_error so that only the first one is
      recorded and returned to the caller.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      5aa2781d
    • Mikulas Patocka's avatar
      dm: process requeue in dm_wq_work · 2761e95f
      Mikulas Patocka authored
      If barrier request was returned with DM_ENDIO_REQUEUE,
      requeue it in dm_wq_work instead of dec_pending.
      
      This allows us to correctly handle a situation when some targets
      are asking for a requeue and other targets signal an error.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      2761e95f
    • Mikulas Patocka's avatar
      dm: make dm_flush return void · 531fe963
      Mikulas Patocka authored
      Make dm_flush return void.
      
      The first error during flush is stored in md->barrier_error instead.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      531fe963
    • Mikulas Patocka's avatar
      dm: always hold bdev reference · 32a926da
      Mikulas Patocka authored
      Fix a potential deadlock when creating multiple snapshots by holding a
      reference to struct block_device for the whole lifecycle of every dm
      device instead of obtaining it independently at each point it is needed.
      
      bdget_disk() was called while the device was being suspended, in
      dm_suspend().  However there could be other devices already suspended,
      for example when creating additional snapshots of a device. bdget_disk()
      can wait for IO and allocate memory resulting in waiting for the
      already-suspended device - deadlock.
      
      This patch changes the code so that it gets the reference to struct
      block_device when struct mapped_device is allocated and initialized in
      alloc_dev() where it is always OK to allocate memory or wait for I/O.
      It drops the reference when it is destroyed in free_dev().  Thus there
      is no call to bdget_disk() while any device is suspended.
      
      Previously unlock_fs() was called only if bdev was held.  Now it is
      called unconditionally, but the superfluous calls are harmless because
      it returns immediately if the filesystem was not previously frozen.
      
      This patch also now allows the device size to be changed in a
      noflush suspend because the bdev is held.  This has no adverse effect.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      32a926da
    • Mikulas Patocka's avatar
      dm: rename suspended_bdev to bdev · db8fef4f
      Mikulas Patocka authored
      Rename suspended_bdev to bdev.
      
      This patch doesn't change any functionality, just renames the variable.
      In the next patch, the variable will be used even for non-suspended device.
      
      (Pre-requisite for the per-target barrier support patches.)
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      db8fef4f
    • Jonathan Brassow's avatar
      dm exception store: fix exstore lookup to be case insensitive · f6bd4eb7
      Jonathan Brassow authored
      When snapshots are created using 'p' instead of 'P' as the
      exception store type, the device-mapper table loading fails.
      
      This patch makes the code case insensitive as intended and fixes some
      regressions reported with device-mapper snapshots.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f6bd4eb7
    • Mikulas Patocka's avatar
      dm: use i_size_read · 5657e8fa
      Mikulas Patocka authored
      Use i_size_read() instead of reading i_size.
      
      If someone changes the size of the device simultaneously, i_size_read
      is guaranteed to return a valid value (either the old one or the new one).
      
      i_size can return some intermediate invalid value (on 32-bit computers
      with 64-bit i_size, the reads to both halves of i_size can be interleaved
      with updates to i_size, resulting in garbage being returned).
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      5657e8fa
    • Mikulas Patocka's avatar
      dm: avoid unsupported spanning of md stripe boundaries · 8cbeb67a
      Mikulas Patocka authored
      A bio that has two or more vector entries, size less than or equal to
      page size, that crosses a stripe boundary of an underlying md device is
      accepted by device mapper (it conforms to all its limits) but not by the
      underlying device.
      
      The fix is: If device mapper selects the one-page maximum request size,
      it also needs to set its own q->merge_bvec_fn to reject any bios with
      multiple vector entries that span more pages.
      
      The problem was discovered in the following scenario:
        * MD - RAID-0
        * LV on the top of it (raid1, snapshot or striped with chunk
      size/stripe larger than RAID-0 stripe)
        * one of the logical volumes is exported to xen domU
        * inside xen domU it is partitioned, the key point is that the partition
      must be unaligned on page boundary (fdisk normally aligns the partition to
      63 sectors which will trigger it)
        * install the system on the partitioned disk in domU
      This causes I/O failures in dom0.
      Reference: https://bugzilla.redhat.com/show_bug.cgi?id=223947Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      8cbeb67a
    • Mikulas Patocka's avatar
      dm mpath: flush keventd queue in destructor · 53b351f9
      Mikulas Patocka authored
      The commit fe9cf30e moves dm table event
      submission from kmultipath queue to kernel kevent queue to avoid a
      deadlock.
      
      There is a possibility of race condition because kevent queue is not flushed
      in the multipath destructor. The scenario is:
      - some event happens and is queued to keventd
      - keventd thread is delayed due to scheuling latency or some other work
      - multipath device is destroyed
      - keventd now attempts to process work_struct that is residing in already
        released memory.
      
      The patch flushes the keventd queue in multipath constructor.
      I've already fixed similar bug in dm-raid1.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      53b351f9
    • Mikulas Patocka's avatar
      dm raid1: keep retrying alloc if mempool_alloc failed · a72986c5
      Mikulas Patocka authored
      If the code can't handle allocation failures, use __GFP_NOFAIL so that
      in case of memory pressure the allocator will retry indefinitely and
      won't return NULL which would cause a crash in the function.
      
      This is still not a correct fix, it may cause a classic deadlock when
      memory manager waits for I/O being done and I/O waits for some free memory.
      I/O code shouldn't allocate any memory. But in this case it probably
      doesn't matter much in practice, people usually do not swap on RAID.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      a72986c5