1. 15 Oct, 2015 4 commits
    • Keith Busch's avatar
      NVMe: Fix memory leak on retried commands · 0dfc70c3
      Keith Busch authored
      Resources are reallocated for requeued commands, so unmap and release
      the iod for the failed command.
      
      It's a pretty bad memory leak and causes a kernel hang if you remove a
      drive because of a busy dma pool. You'll get messages spewing like this:
      
        nvme 0000:xx:xx.x: dma_pool_destroy prp list 256, ffff880420dec000 busy
      
      and lock up pci and the driver since removal never completes while
      holding a lock.
      
      Cc: stable@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 4.0.x-
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0dfc70c3
    • Tejun Heo's avatar
      block: don't release bdi while request_queue has live references · b02176f3
      Tejun Heo authored
      bdi's are initialized in two steps, bdi_init() and bdi_register(), but
      destroyed in a single step by bdi_destroy() which, for a bdi embedded
      in a request_queue, is called during blk_cleanup_queue() which makes
      the queue invisible and starts the draining of remaining usages.
      
      A request_queue's user can access the congestion state of the embedded
      bdi as long as it holds a reference to the queue.  As such, it may
      access the congested state of a queue which finished
      blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
      Because the congested state was embedded in backing_dev_info which in
      turn is embedded in request_queue, accessing the congested state after
      bdi_destroy() was called was fine.  The bdi was destroyed but the
      memory region for the congested state remained accessible till the
      queue got released.
      
      a13f35e8 ("writeback: don't embed root bdi_writeback_congested in
      bdi_writeback") changed the situation.  Now, the root congested state
      which is expected to be pinned while request_queue remains accessible
      is separately reference counted and the base ref is put during
      bdi_destroy().  This means that the root congested state may go away
      prematurely while the queue is between bdi_dstroy() and
      blk_cleanup_queue(), which was detected by Andrey's KASAN tests.
      
      The root cause of this problem is that bdi doesn't distinguish the two
      steps of destruction, unregistration and release, and now the root
      congested state actually requires a separate release step.  To fix the
      issue, this patch separates out bdi_unregister() and bdi_exit() from
      bdi_destroy().  bdi_unregister() is called from blk_cleanup_queue()
      and bdi_exit() from blk_release_queue().  bdi_destroy() is now just a
      simple wrapper calling the two steps back-to-back.
      
      While at it, the prototype of bdi_destroy() is moved right below
      bdi_setup_and_register() so that the counterpart operations are
      located together.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
      Cc: stable@vger.kernel.org # v4.2+
      Reported-and-tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.comReviewed-by: default avatarJan Kara <jack@suse.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b02176f3
    • Christoph Hellwig's avatar
      nvme: use an integer value to Linux errno values · 81c04b94
      Christoph Hellwig authored
      Use a separate integer variable to hold the signed Linux errno
      values we pass back to the block layer.  Note that for pass through
      commands those might still be NVMe values, but those fit into the
      int as well.
      
      Fixes: f4829a9b: ("blk-mq: fix racy updates of rq->errors")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      81c04b94
    • Junichi Nomura's avatar
      blk-mq: fix use-after-free in blk_mq_free_tag_set() · f42d79ab
      Junichi Nomura authored
      tags is freed in blk_mq_free_rq_map() and should not be used after that.
      The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
      free_cpumask_var() is nop.
      
      tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
      free cpumask in its counter part, blk_mq_free_tags().
      
      Fixes: f26cdc85 ("blk-mq: Shared tag enhancements")
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f42d79ab
  2. 12 Oct, 2015 6 commits
    • Arnd Bergmann's avatar
      nvme: fix 32-bit build warning · 835da3f9
      Arnd Bergmann authored
      Compiling the nvme driver on 32-bit warns about a cast from a __u64
      variable to a pointer:
      
      drivers/block/nvme-core.c: In function 'nvme_submit_io':
      drivers/block/nvme-core.c:1847:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
          (void __user *)io.addr, length, NULL, 0);
      
      The cast here is intentional and safe, so we can shut up the
      gcc warning by adding an intermediate cast to 'uintptr_t'.
      
      I had previously submitted a patch to fix this problem in the
      nvme driver, but it was accepted on the same day that two new
      warnings got added.
      
      For clarification, I also change the third instance of this cast
      to use uintptr_t instead of unsigned long now.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: d29ec824 ("nvme: submit internal commands through the block layer")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      835da3f9
    • Tejun Heo's avatar
      writeback: fix incorrect calculation of available memory for memcg domains · c5edf9cd
      Tejun Heo authored
      For memcg domains, the amount of available memory was calculated as
      
       min(the amount currently in use + headroom according to memcg,
           total clean memory)
      
      This isn't quite correct as what should be capped by the amount of
      clean memory is the headroom, not the sum of memory in use and
      headroom.  For example, if a memcg domain has a significant amount of
      dirty memory, the above can lead to a value which is lower than the
      current amount in use which doesn't make much sense.  In most
      circumstances, the above leads to a number which is somewhat but not
      drastically lower.
      
      As the amount of memory which can be readily allocated to the memcg
      domain is capped by the amount of system-wide clean memory which is
      not already assigned to the memcg itself, the number we want is
      
       the amount currently in use +
       min(headroom according to memcg, clean memory elsewhere in the system)
      
      This patch updates mem_cgroup_wb_stats() to return the number of
      filepages and headroom instead of the calculated available pages.
      mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
      above calculation from file, headroom, dirty and globally clean pages.
      
      v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
          to build failure when !CGROUP_WRITEBACK.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: c2aa723a ("writeback: implement memcg writeback domain based throttling")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c5edf9cd
    • Tejun Heo's avatar
      writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions · d60d1bdd
      Tejun Heo authored
      MDTC_INIT() is used to initialize dirty_throttle_control for memcg
      domains.  It used DTC_INIT_COMMON() to initialized mdtc->wb and
      ->wb_completions which is incorrect as DTC_INIT_COMMON() sets the
      latter to wb->completions instead of wb->memcg_completions.  This can
      lead to wildly incorrect results when calculating the proportion of
      dirty memory the memcg domain should get.
      
      Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize
      mdtc->wb_completions to wb->memcg_completions.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: c2aa723a ("writeback: implement memcg writeback domain based throttling")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d60d1bdd
    • Tejun Heo's avatar
      writeback: bdi_writeback iteration must not skip dying ones · b817525a
      Tejun Heo authored
      bdi_for_each_wb() is used in several places to wake up or issue
      writeback work items to all wb's (bdi_writeback's) on a given bdi.
      The iteration is performed by walking bdi->cgwb_tree; however, the
      tree only indexes wb's which are currently active.
      
      For example, when a memcg gets associated with a different blkcg, the
      old wb is removed from the tree so that the new one can be indexed.
      The old wb starts dying from then on but will linger till all its
      inodes are drained.  As these dying wb's may still host dirty inodes,
      writeback operations which affect all wb's must include them.
      bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
      failing to sync the inodes belonging to those wb's.
      
      This patch adds a RCU protected @bdi->wb_list which lists all wb's
      beloinging to that bdi.  wb's are added on creation and removed on
      release rather than on the start of destruction.  bdi_for_each_wb()
      usages are replaced with list_for_each[_continue]_rcu() iterations
      over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
      
      v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
          fixed and unnecessary list head severing in cgwb_bdi_destroy()
          removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarArtem Bityutskiy <dedekind1@gmail.com>
      Fixes: ebe41ab0 ("writeback: implement bdi_for_each_wb()")
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b817525a
    • Tejun Heo's avatar
      writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() · 6fdf860f
      Tejun Heo authored
      wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
      unfortunately, it was always waking up bdi->wb instead of the wb being
      walked.  Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 001fe6f6 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6fdf860f
    • Tejun Heo's avatar
      writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration · 9ad18ab9
      Tejun Heo authored
      laptop_mode_timer_fn() was using bdi_for_each_wb() without the
      required RCU locking leading to the following warning.
      
       WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170()
       ...
       Call Trace:
        <IRQ>  [<ffffffff81480cdc>] dump_stack+0x4e/0x82
        [<ffffffff81051912>] warn_slowpath_common+0x82/0xc0
        [<ffffffff81051a0a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8115f0e6>] laptop_mode_timer_fn+0x106/0x170
        [<ffffffff810ca8e3>] call_timer_fn+0xb3/0x2f0
        [<ffffffff810cad25>] run_timer_softirq+0x205/0x370
        [<ffffffff81056854>] __do_softirq+0xd4/0x460
        [<ffffffff81056d69>] irq_exit+0x89/0xa0
        [<ffffffff8185a892>] smp_apic_timer_interrupt+0x42/0x50
        [<ffffffff81858a44>] apic_timer_interrupt+0x84/0x90
       ...
      
      Fix it by adding rcu_read_lock() around the iteration.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: a06fd6b1 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's")
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9ad18ab9
  3. 08 Oct, 2015 1 commit
  4. 07 Oct, 2015 10 commits
  5. 06 Oct, 2015 10 commits
  6. 04 Oct, 2015 6 commits
    • Linus Torvalds's avatar
      Linux 4.3-rc4 · 049e6dde
      Linus Torvalds authored
      049e6dde
    • Linus Torvalds's avatar
      Merge branch 'strscpy' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile · 30c44659
      Linus Torvalds authored
      Pull strscpy string copy function implementation from Chris Metcalf.
      
      Chris sent this during the merge window, but I waffled back and forth on
      the pull request, which is why it's going in only now.
      
      The new "strscpy()" function is definitely easier to use and more secure
      than either strncpy() or strlcpy(), both of which are horrible nasty
      interfaces that have serious and irredeemable problems.
      
      strncpy() has a useless return value, and doesn't NUL-terminate an
      overlong result.  To make matters worse, it pads a short result with
      zeroes, which is a performance disaster if you have big buffers.
      
      strlcpy(), by contrast, is a mis-designed "fix" for strlcpy(), lacking
      the insane NUL padding, but having a differently broken return value
      which returns the original length of the source string.  Which means
      that it will read characters past the count from the source buffer, and
      you have to trust the source to be properly terminated.  It also makes
      error handling fragile, since the test for overflow is unnecessarily
      subtle.
      
      strscpy() avoids both these problems, guaranteeing the NUL termination
      (but not excessive padding) if the destination size wasn't zero, and
      making the overflow condition very obvious by returning -E2BIG.  It also
      doesn't read past the size of the source, and can thus be used for
      untrusted source data too.
      
      So why did I waffle about this for so long?
      
      Every time we introduce a new-and-improved interface, people start doing
      these interminable series of trivial conversion patches.
      
      And every time that happens, somebody does some silly mistake, and the
      conversion patch to the improved interface actually makes things worse.
      Because the patch is mindnumbing and trivial, nobody has the attention
      span to look at it carefully, and it's usually done over large swatches
      of source code which means that not every conversion gets tested.
      
      So I'm pulling the strscpy() support because it *is* a better interface.
      But I will refuse to pull mindless conversion patches.  Use this in
      places where it makes sense, but don't do trivial patches to fix things
      that aren't actually known to be broken.
      
      * 'strscpy' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
        tile: use global strscpy() rather than private copy
        string: provide strscpy()
        Make asm/word-at-a-time.h available on all architectures
      30c44659
    • Linus Torvalds's avatar
      Merge tag 'md/4.3-fixes' of git://neil.brown.name/md · 15ecf9a9
      Linus Torvalds authored
      Pull md fixes from Neil Brown:
       "Assorted fixes for md in 4.3-rc.
      
        Two tagged for -stable, and one is really a cleanup to match and
        improve kmemcache interface.
      
      * tag 'md/4.3-fixes' of git://neil.brown.name/md:
        md/bitmap: don't pass -1 to bitmap_storage_alloc.
        md/raid1: Avoid raid1 resync getting stuck
        md: drop null test before destroy functions
        md: clear CHANGE_PENDING in readonly array
        md/raid0: apply base queue limits *before* disk_stack_limits
        md/raid5: don't index beyond end of array in need_this_block().
        raid5: update analysis state for failed stripe
        md: wait for pending superblock updates before switching to read-only
      15ecf9a9
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · 0d877081
      Linus Torvalds authored
      Pull MIPS updates from Ralf Baechle:
       "This week's round of MIPS fixes:
         - Fix JZ4740 build
         - Fix fallback to GFP_DMA
         - FP seccomp in case of ENOSYS
         - Fix bootmem panic
         - A number of FP and CPS fixes
         - Wire up new syscalls
         - Make sure BPF assembler objects can properly be disassembled
         - Fix BPF assembler code for MIPS I"
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
        MIPS: scall: Always run the seccomp syscall filters
        MIPS: Octeon: Fix kernel panic on startup from memory corruption
        MIPS: Fix R2300 FP context switch handling
        MIPS: Fix octeon FP context switch handling
        MIPS: BPF: Fix load delay slots.
        MIPS: BPF: Do all exports of symbols with FEXPORT().
        MIPS: Fix the build on jz4740 after removing the custom gpio.h
        MIPS: CPS: #ifdef on CONFIG_MIPS_MT_SMP rather than CONFIG_MIPS_MT
        MIPS: CPS: Don't include MT code in non-MT kernels.
        MIPS: CPS: Stop dangling delay slot from has_mt.
        MIPS: dma-default: Fix 32-bit fall back to GFP_DMA
        MIPS: Wire up userfaultfd and membarrier syscalls.
      0d877081
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3e519dde
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
       "This update contains:
      
         - Fix for a long standing race affecting /proc/irq/NNN
      
         - One line fix for ARM GICV3-ITS counting the wrong data
      
         - Warning silencing in ARM GICV3-ITS.  Another GCC trying to be
           overly clever issue"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/gic-v3-its: Count additional LPIs for the aliased devices
        irqchip/gic-v3-its: Silence warning when its_lpi_alloc_chunks gets inlined
        genirq: Fix race in register_irq_proc()
      3e519dde
    • Markos Chandras's avatar
      MIPS: scall: Always run the seccomp syscall filters · d218af78
      Markos Chandras authored
      The MIPS syscall handler code used to return -ENOSYS on invalid
      syscalls. Whilst this is expected, it caused problems for seccomp
      filters because the said filters never had the change to run since
      the code returned -ENOSYS before triggering them. This caused
      problems on the chromium testsuite for filters looking for invalid
      syscalls. This has now changed and the seccomp filters are always
      run even if the syscall is invalid. We return -ENOSYS once we
      return from the seccomp filters. Moreover, similar codepaths have
      been merged in the process which simplifies somewhat the overall
      syscall code.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/11236/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      d218af78
  7. 03 Oct, 2015 3 commits