1. 30 Oct, 2017 3 commits
    • Elena Reshetova's avatar
      bcache: convert cached_dev.count from atomic_t to refcount_t · 3b304d24
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable cached_dev.count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3b304d24
    • Coly Li's avatar
      bcache: only permit to recovery read error when cache device is clean · d59b2379
      Coly Li authored
      When bcache does read I/Os, for example in writeback or writethrough mode,
      if a read request on cache device is failed, bcache will try to recovery
      the request by reading from cached device. If the data on cached device is
      not synced with cache device, then requester will get a stale data.
      
      For critical storage system like database, providing stale data from
      recovery may result an application level data corruption, which is
      unacceptible.
      
      With this patch, for a failed read request in writeback or writethrough
      mode, recovery a recoverable read request only happens when cache device
      is clean. That is to say, all data on cached device is up to update.
      
      For other cache modes in bcache, read request will never hit
      cached_dev_read_error(), they don't need this patch.
      
      Please note, because cache mode can be switched arbitrarily in run time, a
      writethrough mode might be switched from a writeback mode. Therefore
      checking dc->has_data in writethrough mode still makes sense.
      
      Changelog:
      V4: Fix parens error pointed by Michael Lyle.
      v3: By response from Kent Oversteet, he thinks recovering stale data is a
          bug to fix, and option to permit it is unnecessary. So this version
          the sysfs file is removed.
      v2: rename sysfs entry from allow_stale_data_on_failure  to
          allow_stale_data_on_failure, and fix the confusing commit log.
      v1: initial patch posted.
      
      [small change to patch comment spelling by mlyle]
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reported-by: default avatarArne Wolf <awolf@lenovo.com>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Kai Krakow <hurikhan77@gmail.com>
      Cc: Eric Wheeler <bcache@lists.ewheeler.net>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d59b2379
    • Bart Van Assche's avatar
      block: Fix a race between blk_cleanup_queue() and timeout handling · 4e9b6f20
      Bart Van Assche authored
      Make sure that if the timeout timer fires after a queue has been
      marked "dying" that the affected requests are finished.
      Reported-by: default avatarchenxiang (M) <chenxiang66@hisilicon.com>
      Fixes: commit 287922eb ("block: defer timeouts to a workqueue")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Tested-by: default avatarchenxiang (M) <chenxiang66@hisilicon.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4e9b6f20
  2. 25 Oct, 2017 7 commits
  3. 24 Oct, 2017 1 commit
  4. 17 Oct, 2017 2 commits
    • Omar Sandoval's avatar
      kyber: fix hang on domain token wait queue · 8cf46660
      Omar Sandoval authored
      When we're getting a domain token, if we fail to get a token on our
      first attempt, we put the current hardware queue on a wait queue and
      then try again just in case a token was freed after our initial attempt
      but before we got on the wait queue. If this second attempt succeeds, we
      currently leave the hardware queue on the wait queue. Usually this is
      okay; we'll just run the hardware queue one extra time when another
      token is freed. However, if the hardware queue doesn't have any other
      requests waiting, then when it it gets the extra wakeup, it won't have
      anything to free and therefore won't wake up any other hardware queues.
      If tokens are limited, then we won't make forward progress and the
      device will hang.
      Reported-by: default avatarBin Zha <zhabin.zb@alibaba-inc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8cf46660
    • Wei Yongjun's avatar
      nullb: fix error return code in null_init() · 30c516d7
      Wei Yongjun authored
      Fix to return error code -ENOMEM from the null_alloc_dev() error
      handling case instead of 0, as done elsewhere in this function.
      
      Fixes: 2984c868 ("nullb: factor disk parameters")
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      30c516d7
  5. 16 Oct, 2017 17 commits
    • Randy Dunlap's avatar
      block: fix Sphinx kernel-doc warning · 519c8e9f
      Randy Dunlap authored
      Sphinx treats symbols that end with '_' as a kind of special
      documentation indicator, so fix that by adding an ending '*'
      to it.
      
      ../block/bio.c:404: ERROR: Unknown target name: "gfp".
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      519c8e9f
    • Michael Lyle's avatar
      bcache: writeback rate clamping: make 32 bit safe · 9ce762e8
      Michael Lyle authored
      Sorry this got through to linux-block, was detected by the kbuilds test
      robot.  NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a
      signed long constant.
      
      Fixes: e41166c5 ("bcache: writeback rate shouldn't artifically clamp")
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9ce762e8
    • Michael Lyle's avatar
      bcache: MAINTAINERS: set bcache to MAINTAINED · 52b69ff5
      Michael Lyle authored
      Also add URL for IRC channel.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      52b69ff5
    • Kent Overstreet's avatar
      77c77a98
    • Liang Chen's avatar
      bcache: safeguard a dangerous addressing in closure_queue · 6446c684
      Liang Chen authored
      The use of the union reduces the size of closure struct by taking advantage
      of the current size of its members. The offset of func in work_struct
      equals the size of the first three members, so that work.work_func will
      just reference the forth member - fn.
      
      This is smart but dangerous. It can be broken if work_struct or the other
      structs get changed, and can be a bit difficult to debug.
      Signed-off-by: default avatarLiang Chen <liangchen.linux@gmail.com>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6446c684
    • Michael Lyle's avatar
      bcache: rearrange writeback main thread ratelimit · a8500fc8
      Michael Lyle authored
      The time spent searching for things to write back "counts" for the
      actual rate achieved, so don't flush the accumulated rate with each
      chunk.
      
      This will maintain better fidelity to user-commanded rates, but it
      may slightly increase the burstiness of writeback.  The writeback
      lock needs improvement to help mitigate this.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a8500fc8
    • Michael Lyle's avatar
      bcache: writeback rate shouldn't artifically clamp · e41166c5
      Michael Lyle authored
      The previous code artificially limited writeback rate to 1000000
      blocks/second (NSEC_PER_MSEC), which is a rate that can be met on fast
      hardware.  The rate limiting code works fine (though with decreased
      precision) up to 3 orders of magnitude faster, so use NSEC_PER_SEC.
      
      Additionally, ensure that uint32_t is used as a type for rate throughout
      the rate management so that type checking/clamp_t can work properly.
      
      bch_next_delay should be rewritten for increased precision and better
      handling of high rates and long sleep periods, but this is adequate for
      now.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reported-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e41166c5
    • Michael Lyle's avatar
      bcache: smooth writeback rate control · ae82ddbf
      Michael Lyle authored
      This works in conjunction with the new PI controller.  Currently, in
      real-world workloads, the rate controller attempts to write back 1
      sector per second.  In practice, these minimum-rate writebacks are
      between 4k and 60k in test scenarios, since bcache aggregates and
      attempts to do contiguous writes and because filesystems on top of
      bcachefs typically write 4k or more.
      
      Previously, bcache used to guarantee to write at least once per second.
      This means that the actual writeback rate would exceed the configured
      amount by a factor of 8-120 or more.
      
      This patch adjusts to be willing to sleep up to 2.5 seconds, and to
      target writing 4k/second.  On the smallest writes, it will sleep 1
      second like before, but many times it will sleep longer and load the
      backing device less.  This keeps the loading on the cache and backing
      device related to writeback more consistent when writing back at low
      rates.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ae82ddbf
    • Michael Lyle's avatar
      bcache: implement PI controller for writeback rate · 1d316e65
      Michael Lyle authored
      bcache uses a control system to attempt to keep the amount of dirty data
      in cache at a user-configured level, while not responding excessively to
      transients and variations in write rate.  Previously, the system was a
      PD controller; but the output from it was integrated, turning the
      Proportional term into an Integral term, and turning the Derivative term
      into a crude Proportional term.  Performance of the controller has been
      uneven in production, and it has tended to respond slowly, oscillate,
      and overshoot.
      
      This patch set replaces the current control system with an explicit PI
      controller and tuning that should be correct for most hardware.  By
      default, it attempts to write at a rate that would retire 1/40th of the
      current excess blocks per second.  An integral term in turn works to
      remove steady state errors.
      
      IMO, this yields benefits in simplicity (removing weighted average
      filtering, etc) and system performance.
      
      Another small change is a tunable parameter is introduced to allow the
      user to specify a minimum rate at which dirty blocks are retired.
      
      There is a slight difference from earlier versions of the patch in
      integral handling to prevent excessive negative integral windup.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1d316e65
    • Michael Lyle's avatar
      bcache: don't write back data if reading it failed · 5fa89fb9
      Michael Lyle authored
      If an IO operation fails, and we didn't successfully read data from the
      cache, don't writeback invalid/partial data to the backing disk.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5fa89fb9
    • Yijing Wang's avatar
      bcache: remove unused parameter · 23850102
      Yijing Wang authored
      Parameter bio is no longer used, clean it.
      Signed-off-by: default avatarYijing Wang <wangyijing@huawei.com>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      23850102
    • Eric Wheeler's avatar
      bcache: update bio->bi_opf bypass/writeback REQ_ flag hints · b41c9b02
      Eric Wheeler authored
      Flag for bypass if the IO is for read-ahead or background, unless the
      read-ahead request is for metadata (eg, from gfs2).
              Bypass if:
                      bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
      			!(bio->bi_opf & REQ_META))
      
              Writeback if:
                      op_is_sync(bio->bi_opf) ||
      			bio->bi_opf & (REQ_META|REQ_PRIO)
      Signed-off-by: default avatarEric Wheeler <bcache@linux.ewheeler.net>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b41c9b02
    • Yijing Wang's avatar
      bcache: Remove redundant set_capacity · e89d6759
      Yijing Wang authored
      set_capacity() has been called in bcache_device_init(),
      remove the redundant one.
      Signed-off-by: default avatarYijing Wang <wangyijing@huawei.com>
      Reviewed-by: default avatarEric Wheeler <bcache@linux.ewheeler.net>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e89d6759
    • Coly Li's avatar
      bcache: rewrite multiple partitions support · 1dbe32ad
      Coly Li authored
      Current partition support of bcache is confusing and buggy. It tries to
      trace non-continuous device minor numbers by an ida bit string, and
      mistakenly mixed bcache device index with minor numbers. This design
      generates several negative results,
      - Index of bcache device name is not consecutive under /dev/. If there are
        3 bcache devices, they name will be,
        /dev/bcache0, /dev/bcache16, /dev/bcache32
        Only bcache code indexes bcache device name is such an interesting way.
      - First minor number of each bcache device is traced by ida bit string.
        One bcache device will occupy 16 bits, this is not a good idea. Indeed
        only one bit is enough.
      - Because minor number and bcache device index are mixed, a device index
        is allocated by ida_simple_get(), but an first minor number is sent into
        ida_simple_remove() to release the device. It confused original author
        too.
      
      Root cause of the above errors is, bcache code should not handle device
      minor numbers at all! A standard process to support multiple partitions in
      Linux kernel is,
      - Device driver provides major device number, and indexes multiple device
        instances.
      - Device driver does not allocat nor trace device minor number, only
        provides a first minor number of a given device instance, and sets how
        many minor numbers (paritions) the device instance may have.
      All rested stuffs are handled by block layer code, most of the details can
      be found from block/{genhd, partition-generic}.c files.
      
      This patch re-writes multiple partitions support for bcache. It makes
      whole things to be more clear, and uses ida bit string in a more efficeint
      way.
      - Ida bit string only traces bcache device index, not minor number. For a
        bcache device with 128 partitions, only one bit in ida bit string is
        enough.
      - Device minor number and device index are separated in concept. Device
        index is used for /dev node naming, and ida bit string trace. Minor
        number is calculated from device index and only used to initialize
        first_minor of a bcache device.
      - It does not follow any standard for 16 partitions on a bcache device.
        This patch sets 128 partitions on single bcache device at max, this is
        the limitation from GPT (GUID Partition Table) and supported by fdisk.
      
      Considering a typical device minor number is 20 bits width, each bcache
      device may have 128 partitions (7 bits), there can be 8192 bcache devices
      existing on system. For most common deployment for a single server in
      now days, it should be enough.
      
      [minor spelling fixes in commit message by Michael Lyle]
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Eric Wheeler <bcache@lists.ewheeler.net>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1dbe32ad
    • Coly Li's avatar
      bcache: fix a comments typo in bch_alloc_sectors() · b1e8139e
      Coly Li authored
      Code comments in alloc.c:bch_alloc_sectors() mentions a function
      name find_data_bucket(), the correct function name should be
      pick_data_bucket() indeed. bch_alloc_sectors() is a quite important
      function in bcache allocation code, fixing the typo may help
      other people to have less confusion.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b1e8139e
    • Coly Li's avatar
      bcache: check ca->alloc_thread initialized before wake up it · 91af8300
      Coly Li authored
      In bcache code, sysfs entries are created before all resources get
      allocated, e.g. allocation thread of a cache set.
      
      There is posibility for NULL pointer deference if a resource is accessed
      but which is not initialized yet. Indeed Jorg Bornschein catches one on
      cache set allocation thread and gets a kernel oops.
      
      The reason for this bug is, when bch_bucket_alloc() is called during
      cache set registration and attaching, ca->alloc_thread is not properly
      allocated and initialized yet, call wake_up_process() on ca->alloc_thread
      triggers NULL pointer deference failure. A simple and fast fix is, before
      waking up ca->alloc_thread, checking whether it is allocated, and only
      wake up ca->alloc_thread when it is not NULL.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reported-by: default avatarJorg Bornschein <jb@capsec.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91af8300
    • Peter Foley's avatar
      bcache: Avoid nested function definition · 58f913dc
      Peter Foley authored
      Fixes below error with clang:
      ../drivers/md/bcache/sysfs.c:759:3: error: function definition is not allowed here
                      {       return *((uint16_t *) r) - *((uint16_t *) l); }
                      ^
      ../drivers/md/bcache/sysfs.c:789:32: error: use of undeclared identifier 'cmp'
                      sort(p, n, sizeof(uint16_t), cmp, NULL);
                                                   ^
      2 errors generated.
      
      v2:
      rename function to __bch_cache_cmp
      Signed-off-by: default avatarPeter Foley <pefoley2@pefoley.com>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      58f913dc
  6. 14 Oct, 2017 1 commit
  7. 13 Oct, 2017 9 commits