1. 08 Jan, 2018 12 commits
  2. 06 Jan, 2018 26 commits
    • Ming Lei's avatar
      blk-mq: fix race between updating nr_hw_queues and switching io sched · fb350e0a
      Ming Lei authored
      In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
      can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
      example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
      freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.
      
      This patch fixes the race be holding q->sysfs_lock.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fb350e0a
    • Ming Lei's avatar
      blk-mq: avoid to map CPU into stale hw queue · 7d4901a9
      Ming Lei authored
      blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
      previous map isn't cleared yet, and may point to one stale hw queue
      index.
      
      This patch fixes the following issue by clearing the mapping table before
      setting it up in blk_mq_pci_map_queues().
      
      This patches fixes this following issue reported by Zhang Yi:
      
      [  101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
      [  101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
      [  101.216346] PGD 0 P4D 0
      [  101.219171] Oops: 0000 [#1] SMP
      [  101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
      [  101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
      [  101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
      [  101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
      [  101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
      [  101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
      [  101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
      [  101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
      [  101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
      [  101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
      [  101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
      [  101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
      [  101.364314] FS:  0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
      [  101.373342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
      [  101.387714] Call Trace:
      [  101.390445]  blk_mq_update_nr_hw_queues+0xbf/0x130
      [  101.395791]  nvme_reset_work+0x6f4/0xc06 [nvme]
      [  101.400848]  ? pick_next_task_fair+0x290/0x5f0
      [  101.405807]  ? __switch_to+0x1f5/0x430
      [  101.409988]  ? put_prev_entity+0x2f/0xd0
      [  101.414365]  process_one_work+0x141/0x340
      [  101.418836]  worker_thread+0x47/0x3e0
      [  101.422921]  kthread+0xf5/0x130
      [  101.426424]  ? rescuer_thread+0x380/0x380
      [  101.430896]  ? kthread_associate_blkcg+0x90/0x90
      [  101.436048]  ret_from_fork+0x1f/0x30
      [  101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
      [  101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
      [  101.468205] CR2: 0000000000000098
      [  101.471907] ---[ end trace 5fe710f98228a3ca ]---
      [  101.482489] Kernel panic - not syncing: Fatal exception
      [  101.488505] Kernel Offset: disabled
      [  101.497752] ---[ end Kernel panic - not syncing: Fatal exception
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7d4901a9
    • Ming Lei's avatar
      blk-mq: quiesce queue during switching io sched and updating nr_requests · 24f5a90f
      Ming Lei authored
      Dispatch may still be in-progress after queue is frozen, so we have to
      quiesce queue before switching IO scheduler and updating nr_requests.
      
      Also when switching io schedulers, blk_mq_run_hw_queue() may still be
      called somewhere(such as from nvme_reset_work()), and io scheduler's
      per-hctx data may not be setup yet, so cause oops even inside
      blk_mq_hctx_has_pending(), such as it can be run just between:
      
              ret = e->ops.mq.init_sched(q, e);
      AND
              ret = e->ops.mq.init_hctx(hctx, i)
      
      inside blk_mq_init_sched().
      
      This reverts commit 7a148c2f(block: don't call blk_mq_quiesce_queue()
      after queue is frozen) basically, and makes sure blk_mq_hctx_has_pending
      won't be called if queue is quiesced.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Fixes: 7a148c2f(block: don't call blk_mq_quiesce_queue() after queue is frozen)
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      24f5a90f
    • Ming Lei's avatar
      blk-mq: quiesce queue before freeing queue · c2856ae2
      Ming Lei authored
      After queue is frozen, dispatch still may happen, for example:
      
      1) requests are submitted from several contexts
      2) requests from all these contexts are inserted to queue, but may dispatch
      to LLD in one of these paths, but other paths sill need to move on even all
      these requests are completed(that means blk_mq_freeze_queue_wait() returns
      at that time)
      3) dispatch after queue freezing still moves on and causes use-after-free,
      because request queue is freed
      
      This patch quiesces queue after it is frozen, and makes sure all
      in-progress dispatch are completed.
      
      This patch fixes the following kernel crash when running heavy IOs vs.
      deleting device:
      
      [   36.719251] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [   36.720318] IP: kyber_has_work+0x14/0x40
      [   36.720847] PGD 254bf5067 P4D 254bf5067 PUD 255e6a067 PMD 0
      [   36.721584] Oops: 0000 [#1] PREEMPT SMP
      [   36.722105] Dumping ftrace buffer:
      [   36.722570]    (ftrace buffer empty)
      [   36.723057] Modules linked in: scsi_debug ebtable_filter ebtables ip6table_filter ip6_tables tcm_loop iscsi_target_mod target_core_file target_core_iblock target_core_pscsi target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter ip_tables sd_mod sg btrfs xor zstd_decompress zstd_compress xxhash raid6_pq mptsas mptscsih bcache crc32c_intel ahci mptbase libahci serio_raw scsi_transport_sas nvme libata shpchp lpc_ich virtio_scsi nvme_core binfmt_misc dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
      [   36.733438] CPU: 2 PID: 2374 Comm: fio Not tainted 4.15.0-rc2.blk_mq_quiesce+ #714
      [   36.735143] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
      [   36.736688] RIP: 0010:kyber_has_work+0x14/0x40
      [   36.737515] RSP: 0018:ffffc9000209bca0 EFLAGS: 00010202
      [   36.738431] RAX: 0000000000000008 RBX: ffff88025578bfc8 RCX: ffff880257bf4ed0
      [   36.739581] RDX: 0000000000000038 RSI: ffffffff81a98c6d RDI: ffff88025578bfc8
      [   36.740730] RBP: ffff880253cebfc8 R08: ffffc9000209bda0 R09: ffff8802554f3480
      [   36.741885] R10: ffffc9000209be60 R11: ffff880263f72538 R12: ffff88025573e9e8
      [   36.743036] R13: ffff88025578bfd0 R14: 0000000000000001 R15: 0000000000000000
      [   36.744189] FS:  00007f9b9bee67c0(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
      [   36.746617] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   36.748483] CR2: 0000000000000008 CR3: 0000000254bf4001 CR4: 00000000003606e0
      [   36.750164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   36.751455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   36.752796] Call Trace:
      [   36.753992]  blk_mq_do_dispatch_sched+0x7f/0xe0
      [   36.755110]  blk_mq_sched_dispatch_requests+0x119/0x190
      [   36.756179]  __blk_mq_run_hw_queue+0x83/0x90
      [   36.757144]  __blk_mq_delay_run_hw_queue+0xaf/0x110
      [   36.758046]  blk_mq_run_hw_queue+0x24/0x70
      [   36.758845]  blk_mq_flush_plug_list+0x1e7/0x270
      [   36.759676]  blk_flush_plug_list+0xd6/0x240
      [   36.760463]  blk_finish_plug+0x27/0x40
      [   36.761195]  do_io_submit+0x19b/0x780
      [   36.761921]  ? entry_SYSCALL_64_fastpath+0x1a/0x7d
      [   36.762788]  entry_SYSCALL_64_fastpath+0x1a/0x7d
      [   36.763639] RIP: 0033:0x7f9b9699f697
      [   36.764352] RSP: 002b:00007ffc10f991b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000d1
      [   36.765773] RAX: ffffffffffffffda RBX: 00000000008f6f00 RCX: 00007f9b9699f697
      [   36.766965] RDX: 0000000000a5e6c0 RSI: 0000000000000001 RDI: 00007f9b8462a000
      [   36.768377] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000008f6420
      [   36.769649] R10: 00007f9b846e5000 R11: 0000000000000206 R12: 00007f9b795d6a70
      [   36.770807] R13: 00007f9b795e4140 R14: 00007f9b795e3fe0 R15: 0000000100000000
      [   36.771955] Code: 83 c7 10 e9 3f 68 d1 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 97 b0 00 00 00 48 8d 42 08 48 83 c2 38 <48> 3b 00 74 06 b8 01 00 00 00 c3 48 3b 40 08 75 f4 48 83 c0 10
      [   36.775004] RIP: kyber_has_work+0x14/0x40 RSP: ffffc9000209bca0
      [   36.776012] CR2: 0000000000000008
      [   36.776690] ---[ end trace 4045cbce364ff2a4 ]---
      [   36.777527] Kernel panic - not syncing: Fatal exception
      [   36.778526] Dumping ftrace buffer:
      [   36.779313]    (ftrace buffer empty)
      [   36.780081] Kernel Offset: disabled
      [   36.780877] ---[ end Kernel panic - not syncing: Fatal exception
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c2856ae2
    • Jens Axboe's avatar
      mq-deadline: make it clear that __dd_dispatch_request() works on all hw queues · ca11f209
      Jens Axboe authored
      Don't pass in the hardware queue to __dd_dispatch_request(), since it
      leads the reader to believe that we are returning a request for that
      specific hardware queue. That's not how mq-deadline works, the state
      for determining which request to serve next is shared across all
      hardware queues for a device.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ca11f209
    • Bart Van Assche's avatar
      target: Use sgl_alloc_order() and sgl_free() · 14db4917
      Bart Van Assche authored
      Use the sgl_alloc_order() and sgl_free() functions instead of open
      coding these functions.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Acked-by: default avatarNicholas A. Bellinger <nab@linux-iscsi.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      14db4917
    • Bart Van Assche's avatar
      nvmet/rdma: Use sgl_alloc() and sgl_free() · 68c6e9cd
      Bart Van Assche authored
      Use the sgl_alloc() and sgl_free() functions instead of open coding
      these functions.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      68c6e9cd
    • Bart Van Assche's avatar
      nvmet/fc: Use sgl_alloc() and sgl_free() · 4442b56f
      Bart Van Assche authored
      Use the sgl_alloc() and sgl_free() functions instead of open coding
      these functions.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarJames Smart <james.smart@broadcom.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4442b56f
    • Bart Van Assche's avatar
      crypto: scompress - use sgl_alloc() and sgl_free() · 8cd579d2
      Bart Van Assche authored
      Use the sgl_alloc() and sgl_free() functions instead of open coding
      these functions.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8cd579d2
    • Bart Van Assche's avatar
      lib/scatterlist: Introduce sgl_alloc() and sgl_free() · e80a0af4
      Bart Van Assche authored
      Many kernel drivers contain code that allocates and frees both a
      scatterlist and the pages that populate that scatterlist.
      Introduce functions in lib/scatterlist.c that perform these tasks
      instead of duplicating this functionality in multiple drivers.
      Only include these functions in the build if CONFIG_SGL_ALLOC=y
      to avoid that the kernel size increases if this functionality is
      not used.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e80a0af4
    • Wang Long's avatar
      writeback: update comment in inode_io_list_move_locked · bbbc3c1c
      Wang Long authored
      The @head can be wb->b_dirty_time, so update the comment.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarWang Long <wanglong19@meituan.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bbbc3c1c
    • Arnd Bergmann's avatar
      DAC960: split up ioctl function to reduce stack size · 91f7b74a
      Arnd Bergmann authored
      When CONFIG_KASAN is set, all the local variables in this function are
      allocated on the stack together, leading to a warning about possible
      kernel stack overflow:
      
      drivers/block/DAC960.c: In function 'DAC960_gam_ioctl':
      drivers/block/DAC960.c:7061:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
      
      By splitting up the function into smaller chunks, we can avoid that and
      make the code slightly more readable at the same time. The coding style
      in this file is completely nonstandard, and I chose to not touch that
      at all, leaving the unconventional intendation unchanged to make it
      easier to review the diff.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91f7b74a
    • Ming Lei's avatar
      block: blk-merge: remove unnecessary check · cf8c0c6a
      Ming Lei authored
      In this case, 'sectors' can't be zero at all, so remove the check
      and let the bio be split.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cf8c0c6a
    • Ming Lei's avatar
      block: blk-merge: try to make front segments in full size · a2d37968
      Ming Lei authored
      When merging one bvec into segment, if the bvec is too big
      to merge, current policy is to move the whole bvec into another
      new segment.
      
      This patchset changes the policy into trying to maximize size of
      front segments, that means in above situation, part of bvec
      is merged into current segment, and the remainder is put
      into next segment.
      
      This patch prepares for support multipage bvec because
      it can be quite common to see this case and we should try
      to make front segments in full size.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a2d37968
    • Ming Lei's avatar
      blk-merge: compute bio->bi_seg_front_size efficiently · 6a501bf0
      Ming Lei authored
      It is enough to check and compute bio->bi_seg_front_size just
      after the 1st segment is found, but current code checks that
      for each bvec, which is inefficient.
      
      This patch follows the way in  __blk_recalc_rq_segments()
      for computing bio->bi_seg_front_size, and it is more efficient
      and code becomes more readable too.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6a501bf0
    • Ming Lei's avatar
      dm-crypt: don't clear bvec->bv_page in crypt_free_buffer_pages() · 92681eca
      Ming Lei authored
      The bio is always freed after running crypt_free_buffer_pages(), so it
      isn't necessary to clear bv->bv_page.
      
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc:dm-devel@redhat.com
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      92681eca
    • Ming Lei's avatar
      btrfs: avoid accessing bvec table directly for a cloned bio · c16a8ac3
      Ming Lei authored
      Commit 17347cec(Btrfs: change how we iterate bios in endio)
      mentioned that for dio the submitted bio may be fast cloned, we
      can't access the bvec table directly for a cloned bio, so use
      bio_get_first_bvec() to retrieve the 1st bvec.
      
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: linux-btrfs@vger.kernel.org
      Cc: Liu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Acked: David Sterba <dsterba@suse.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c16a8ac3
    • Ming Lei's avatar
      btrfs: avoid access to .bi_vcnt directly · a0b60d72
      Ming Lei authored
      BTRFS uses bio->bi_vcnt to figure out page numbers, this approach is no
      longer valid once we start enabling multipage bvecs.
      correct once we start to enable multipage bvec.
      
      Use bio_nr_pages() to do that instead.
      
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: linux-btrfs@vger.kernel.org
      Acked-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a0b60d72
    • Ming Lei's avatar
      block: move bio_alloc_pages() to bcache · 25d8be77
      Ming Lei authored
      bcache is the only user of bio_alloc_pages(), so move this function into
      bcache, and avoid it being misused in the future.
      
      Also rename it to bch_bio_allo_pages() since it is bcache only.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      25d8be77
    • Ming Lei's avatar
      bcache: comment on direct access to bvec table · c2421edf
      Ming Lei authored
      All direct access to bvec table are safe even after multipage bvec is
      supported.
      
      Cc: linux-bcache@vger.kernel.org
      Acked-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c2421edf
    • Ming Lei's avatar
      dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE · 8f50e358
      Ming Lei authored
      For BIO based DM, some targets aren't ready for dealing with bigger
      incoming bio than 1Mbyte, such as crypt target.
      
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc:dm-devel@redhat.com
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8f50e358
    • Ming Lei's avatar
      block: bounce: don't access bio->bi_io_vec in copy_to_high_bio_irq · 3c892a09
      Ming Lei authored
      Firstly this patch introduces BVEC_ITER_ALL_INIT for iterating one bio
      from start to end.
      
      As we need to support multipage bvecs, don't access bio->bi_io_vec
      in copy_to_high_bio_irq(), and just use the standard iterator for that.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3c892a09
    • Ming Lei's avatar
      block: bounce: avoid direct access to bvec table · 7891f05c
      Ming Lei authored
      We will support multipage bvecs in the future, so change to iterator way
      for getting bv_page of bvec from original bio.
      
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7891f05c
    • Ming Lei's avatar
      fs: convert to bio_last_bvec_all() · c45a8f2d
      Ming Lei authored
      This patch converts 3 users to bio_last_bvec_all(), so that we can go
      ahead and convert to multipage bvec.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c45a8f2d
    • Ming Lei's avatar
      block: convert to bio_first_bvec_all & bio_first_page_all · 263663cd
      Ming Lei authored
      This patch converts to bio_first_bvec_all() & bio_first_page_all() for
      retrieving the 1st bvec/page, and prepares for supporting multipage bvec.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      263663cd
    • Ming Lei's avatar
      block: introduce bio helpers for converting to multipage bvec · 86292abc
      Ming Lei authored
      The following helpers are introduced for converting current users of
      direct access to bvec table, and prepares for supporting multipage bvec:
      
      	bio_pages_all()
      	bio_first_bvec_all()
      	bio_first_page_all()
      	bio_last_bvec_all()
      
      All are named as bio_*_all() to following bio_for_each_segment_all(),
      they can only be used on bio of !bio_flagged(bio, BIO_CLONED), that means
      the whole bvec table is covered.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      86292abc
  3. 05 Jan, 2018 2 commits
    • Paolo Valente's avatar
      block, bfq: remove batches of confusing ifdefs · 9b25bd03
      Paolo Valente authored
      Commit a33801e8 ("block, bfq: move debug blkio stats behind
      CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs:
      one reported in [1], plus a similar one in another function. This
      commit removes both batches, in the way suggested in [1].
      
      [1] https://www.spinics.net/lists/linux-block/msg20043.html
      
      Fixes: a33801e8 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP")
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Tested-by: default avatarLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9b25bd03
    • Paolo Valente's avatar
      block, bfq: consider also past I/O in soft real-time detection · a34b0244
      Paolo Valente authored
      BFQ privileges the I/O of soft real-time applications, such as video
      players, to guarantee to these application a high bandwidth and a low
      latency. In this respect, it is not easy to correctly detect when an
      application is soft real-time. A particularly nasty false positive is
      that of an I/O-bound application that occasionally happens to meet all
      requirements to be deemed as soft real-time. After being detected as
      soft real-time, such an application monopolizes the device. Fortunately,
      BFQ will realize soon that the application is actually not soft
      real-time and suspend every privilege. Yet, the application may happen
      again to be wrongly detected as soft real-time, and so on.
      
      As highlighted by our tests, this problem causes BFQ to occasionally
      fail to guarantee a high responsiveness, in the presence of heavy
      background I/O workloads. The reason is that the background workload
      happens to be detected as soft real-time, more or less frequently,
      during the execution of the interactive task under test. To give an
      idea, because of this problem, Libreoffice Writer occasionally takes 8
      seconds, instead of 3, to start up, if there are sequential reads and
      writes in the background, on a Kingston SSDNow V300.
      
      This commit addresses this issue by leveraging the following facts.
      
      The reason why some applications are detected as soft real-time despite
      all BFQ checks to avoid false positives, is simply that, during high
      CPU or storage-device load, I/O-bound applications may happen to do
      I/O slowly enough to meet all soft real-time requirements, and pass
      all BFQ extra checks. Yet, this happens only for limited time periods:
      slow-speed time intervals are usually interspersed between other time
      intervals during which these applications do I/O at a very high speed.
      To exploit these facts, this commit introduces a little change, in the
      detection of soft real-time behavior, to systematically consider also
      the recent past: the higher the speed was in the recent past, the
      later next I/O should arrive for the application to be considered as
      soft real-time. At the beginning of a slow-speed interval, the minimum
      arrival time allowed for the next I/O usually happens to still be so
      high, to fall *after* the end of the slow-speed period itself. As a
      consequence, the application does not risk to be deemed as soft
      real-time during the slow-speed interval. Then, during the next
      high-speed interval, the application cannot, evidently, be deemed as
      soft real-time (exactly because of its speed), and so on.
      
      This extra filtering proved to be rather effective: in the above test,
      the frequency of false positives became so low that the start-up time
      was 3 seconds in all iterations (apart from occasional outliers,
      caused by page-cache-management issues, which are out of the scope of
      this commit, and cannot be solved by an I/O scheduler).
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a34b0244