1. 08 Jun, 2017 3 commits
    • Paolo Valente's avatar
      block, bfq: access and cache blkg data only when safe · 8f9bebc3
      Paolo Valente authored
      In blk-cgroup, operations on blkg objects are protected with the
      request_queue lock. This is no more the lock that protects
      I/O-scheduler operations in blk-mq. In fact, the latter are now
      protected with a finer-grained per-scheduler-instance lock. As a
      consequence, although blkg lookups are also rcu-protected, blk-mq I/O
      schedulers may see inconsistent data when they access blkg and
      blkg-related objects. BFQ does access these objects, and does incur
      this problem, in the following case.
      
      The blkg_lookup performed in bfq_get_queue, being protected (only)
      through rcu, may happen to return the address of a copy of the
      original blkg. If this is the case, then the blkg_get performed in
      bfq_get_queue, to pin down the blkg, is useless: it does not prevent
      blk-cgroup code from destroying both the original blkg and all objects
      directly or indirectly referred by the copy of the blkg. BFQ accesses
      these objects, which typically causes a crash for NULL-pointer
      dereference of memory-protection violation.
      
      Some additional protection mechanism should be added to blk-cgroup to
      address this issue. In the meantime, this commit provides a quick
      temporary fix for BFQ: cache (when safe) blkg data that might
      disappear right after a blkg_lookup.
      
      In particular, this commit exploits the following facts to achieve its
      goal without introducing further locks.  Destroy operations on a blkg
      invoke, as a first step, hooks of the scheduler associated with the
      blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
      consequence, for any blkg associated with the request queue an
      instance of BFQ is attached to, we are guaranteed that such a blkg is
      not destroyed, and that all the pointers it contains are consistent,
      while that instance is holding its bfqd->lock. A blkg_lookup performed
      with bfqd->lock held then returns a fully consistent blkg, which
      remains consistent until this lock is held. In more detail, this holds
      even if the returned blkg is a copy of the original one.
      
      Finally, also the object describing a group inside BFQ needs to be
      protected from destruction on the blkg_free of the original blkg
      (which invokes bfq_pd_free). This commit adds private refcounting for
      this object, to let it disappear only after no bfq_queue refers to it
      any longer.
      
      This commit also removes or updates some stale comments on locking
      issues related to blk-cgroup operations.
      Reported-by: default avatarTomas Konir <tomas.konir@gmail.com>
      Reported-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Reported-by: default avatarMarco Piazza <mpiazza@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: default avatarTomas Konir <tomas.konir@gmail.com>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarMarco Piazza <mpiazza@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8f9bebc3
    • Jens Axboe's avatar
      Merge branch 'nvme-4.12' of git://git.infradead.org/nvme into for-linus · 85d0331a
      Jens Axboe authored
      Christoph writes:
      
      "A few NVMe fixes for 4.12-rc, PCIe reset fixes and APST fixes, a
       RDMA reconnect fix, two FC fixes and a general controller removal fix."
      85d0331a
    • James Wang's avatar
      Fix loop device flush before configure v3 · 64604957
      James Wang authored
      While installing SLES-12 (based on v4.4), I found that the installer
      will stall for 60+ seconds during LVM disk scan.  The root cause was
      determined to be the removal of a bound device check in loop_flush()
      by commit b5dd2f60 ("block: loop: improve performance via blk-mq").
      
      Restoring this check, examining ->lo_state as set by loop_set_fd()
      eliminates the bad behavior.
      
      Test method:
      modprobe loop max_loop=64
      dd if=/dev/zero of=disk bs=512 count=200K
      for((i=0;i<4;i++))do losetup -f disk; done
      mkfs.ext4 -F /dev/loop0
      for((i=0;i<4;i++))do mkdir t$i; mount /dev/loop$i t$i;done
      for f in `ls /dev/loop[0-9]*|sort`; do \
      	echo $f; dd if=$f of=/dev/null  bs=512 count=1; \
      	done
      
      Test output:  stock          patched
      /dev/loop0    18.1217e-05    8.3842e-05
      /dev/loop1     6.1114e-05    0.000147979
      /dev/loop10    0.414701      0.000116564
      /dev/loop11    0.7474        6.7942e-05
      /dev/loop12    0.747986      8.9082e-05
      /dev/loop13    0.746532      7.4799e-05
      /dev/loop14    0.480041      9.3926e-05
      /dev/loop15    1.26453       7.2522e-05
      
      Note that from loop10 onward, the device is not mounted, yet the
      stock kernel consumes several orders of magnitude more wall time
      than it does for a mounted device.
      (Thanks for Mike Galbraith <efault@gmx.de>, give a changelog review.)
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJames Wang <jnwang@suse.com>
      Fixes: b5dd2f60 ("block: loop: improve performance via blk-mq")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      64604957
  2. 07 Jun, 2017 9 commits
    • Shaohua Li's avatar
      blk-throttle: set default latency baseline for harddisk · 6679a90c
      Shaohua Li authored
      hard disk IO latency varies a lot depending on spindle move. The latency
      range could be from several microseconds to several milliseconds. It's
      pretty hard to get the baseline latency used by io.low.
      
      We will use a different stragety here. The idea is only using IO with
      spindle move to determine if cgroup IO is in good state. For HD, if io
      latency is small (< 1ms), we ignore the IO. Such IO is likely from
      sequential IO, and is helpless to help determine if a cgroup's IO is
      impacted by other cgroups. With this, we only account IO with big
      latency. Then we can choose a hardcoded baseline latency for HD (4ms,
      which is typical IO latency with seek).  With all these settings, the
      io.low latency works for both HD and SSD.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6679a90c
    • Joseph Qi's avatar
      blk-throttle: fix NULL pointer dereference in throtl_schedule_pending_timer · a41b816c
      Joseph Qi authored
      I have encountered a NULL pointer dereference in
      throtl_schedule_pending_timer:
        [  413.735396] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
        [  413.735535] IP: [<ffffffff812ebbbf>] throtl_schedule_pending_timer+0x3f/0x210
        [  413.735643] PGD 22c8cf067 PUD 22cb34067 PMD 0
        [  413.735713] Oops: 0000 [#1] SMP
        ......
      
      This is caused by the following case:
        blk_throtl_bio
          throtl_schedule_next_dispatch  <= sq is top level one without parent
            throtl_schedule_pending_timer
              sq_to_tg(sq)->td->throtl_slice  <= sq_to_tg(sq) returns NULL
      
      Fix it by using sq_to_td instead of sq_to_tg(sq)->td, which will always
      return a valid td.
      
      Fixes: 297e3d85 ("blk-throttle: make throtl_slice tunable")
      Signed-off-by: default avatarJoseph Qi <qijiang.qj@alibaba-inc.com>
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a41b816c
    • Kai-Heng Feng's avatar
      nvme: relax APST default max latency to 100ms · 9947d6a0
      Kai-Heng Feng authored
      Christoph Hellwig suggests we should to make APST work out of the box.
      Hence relax the the default max latency to make them able to enter
      deepest power state on default.
      
      Here are id-ctrl excerpts from two high latency NVMes:
      
      vid     : 0x14a4
      ssvid   : 0x1b4b
      mn      : CX2-GB1024-Q11 NVMe LITEON 1024GB
      ps    3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
                rwt:3 rwl:3 idle_power:- active_power:-
      ps    4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
                rwt:4 rwl:4 idle_power:- active_power:-
      
      vid     : 0x15b7
      ssvid   : 0x1b4b
      mn      : A400 NVMe SanDisk 512GB
      ps    3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
                rwt:0 rwl:0 idle_power:- active_power:-
      ps    4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
                rwt:0 rwl:0 idle_power:- active_power:-
      Signed-off-by: default avatarKai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      9947d6a0
    • Kai-Heng Feng's avatar
      nvme: only consider exit latency when choosing useful non-op power states · da87591b
      Kai-Heng Feng authored
      When a NVMe is in non-op states, the latency is exlat.
      The latency will be enlat + exlat only when the NVMe tries to transit
      from operational state right atfer it begins to transit to
      non-operational state, which should be a rare case.
      
      Therefore, as Andy Lutomirski suggests, use exlat only when deciding power
      states to trainsit to.
      Signed-off-by: default avatarKai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      da87591b
    • James Smart's avatar
      nvme-fc: fix missing put reference on controller create failure · 24b7f059
      James Smart authored
      The failure case, of a create controller request, called
      nvme_uninit_ctrl() but didn't do a put to allow the nvme
      controller to be deleted.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      24b7f059
    • James Smart's avatar
      nvme-fc: on lldd/transport io error, terminate association · f874d5d0
      James Smart authored
      Per FC-NVME, when lldd or transport detects an i/o error, the
      connection must be terminated, which in turn requires the association
      to be termianted.  Currently the transport simply creates a nvme
      completion status of transport error and returns the io. The FC-NVME
      spec makes the mandate as initiator and host, depending on the error,
      can get out of sync on outstanding io counts (sqhd/sqtail).
      
      Implement the association teardown on lldd or transport detected
      errors.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      f874d5d0
    • Sagi Grimberg's avatar
      nvme-rdma: fast fail incoming requests while we reconnect · e818a5b4
      Sagi Grimberg authored
      When we encounter an transport/controller errors, error recovery
      kicks in which performs:
      1. stops io/admin queues
      2. moves transport queues out of LIVE state
      3. fast fail pending io
      4. schedule periodic reconnects.
      
      But we also need to fast fail incoming IO taht enters after we
      already scheduled. Given that our queue is not LIVE anymore, simply
      restart the request queues to fail in .queue_rq
      Reported-by: default avatarAlex Turin <alex@vastdata.com>
      Reported-by: default avatarshahar.salzman <shahar.salzman@gmail.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      e818a5b4
    • Rakesh Pandit's avatar
      nvme-pci: fix multiple ctrl removal scheduling · 82b057ca
      Rakesh Pandit authored
      Commit c5f6ce97 tries to address multiple resets but fails as
      work_busy doesn't involve any synchronization and can fail.  This is
      reproducible easily as can be seen by WARNING below which is triggered
      with line:
      
      WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)
      
      Allowing multiple resets can result in multiple controller removal as
      well if different conditions inside nvme_reset_work fail and which
      might deadlock on device_release_driver.
      
      [  480.327007] WARNING: CPU: 3 PID: 150 at drivers/nvme/host/pci.c:1900 nvme_reset_work+0x36c/0xec0
      [  480.327008] Modules linked in: rfcomm fuse nf_conntrack_netbios_ns nf_conntrack_broadcast...
      [  480.327044]  btusb videobuf2_core ghash_clmulni_intel snd_hwdep cfg80211 acer_wmi hci_uart..
      [  480.327065] CPU: 3 PID: 150 Comm: kworker/u16:2 Not tainted 4.12.0-rc1+ #13
      [  480.327065] Hardware name: Acer Predator G9-591/Mustang_SLS, BIOS V1.10 03/03/2016
      [  480.327066] Workqueue: nvme nvme_reset_work
      [  480.327067] task: ffff880498ad8000 task.stack: ffffc90002218000
      [  480.327068] RIP: 0010:nvme_reset_work+0x36c/0xec0
      [  480.327069] RSP: 0018:ffffc9000221bdb8 EFLAGS: 00010246
      [  480.327070] RAX: 0000000000460000 RBX: ffff880498a98128 RCX: dead000000000200
      [  480.327070] RDX: 0000000000000001 RSI: ffff8804b1028020 RDI: ffff880498a98128
      [  480.327071] RBP: ffffc9000221be50 R08: 0000000000000000 R09: 0000000000000000
      [  480.327071] R10: ffffc90001963ce8 R11: 000000000000020d R12: ffff880498a98000
      [  480.327072] R13: ffff880498a53500 R14: ffff880498a98130 R15: ffff880498a98128
      [  480.327072] FS:  0000000000000000(0000) GS:ffff8804c1cc0000(0000) knlGS:0000000000000000
      [  480.327073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  480.327074] CR2: 00007ffcf3c37f78 CR3: 0000000001e09000 CR4: 00000000003406e0
      [  480.327074] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  480.327075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  480.327075] Call Trace:
      [  480.327079]  ? __switch_to+0x227/0x400
      [  480.327081]  process_one_work+0x18c/0x3a0
      [  480.327082]  worker_thread+0x4e/0x3b0
      [  480.327084]  kthread+0x109/0x140
      [  480.327085]  ? process_one_work+0x3a0/0x3a0
      [  480.327087]  ? kthread_park+0x60/0x60
      [  480.327102]  ret_from_fork+0x2c/0x40
      [  480.327103] Code: e8 5a dc ff ff 85 c0 41 89 c1 0f.....
      
      This patch addresses the problem by using state of controller to
      decide whether reset should be queued or not as state change is
      synchronizated using controller spinlock.  Also cancel_work_sync is
      used to make sure remove cancels the reset_work and waits for it to
      finish.  This patch also changes return value from -ENODEV to more
      appropriate -EBUSY if nvme_reset fails to change state.
      
      Fixes: c5f6ce97 ("nvme: don't schedule multiple resets")
      Signed-off-by: default avatarRakesh Pandit <rakesh@tuxera.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      82b057ca
    • Ming Lei's avatar
      nvme: fix hang in remove path · 82654b6b
      Ming Lei authored
      We need to start admin queues too in nvme_kill_queues()
      for avoiding hang in remove path[1].
      
      This patch is very similar with 806f026f(nvme: use
      blk_mq_start_hw_queues() in nvme_kill_queues()).
      
      [1] hang stack trace
      [<ffffffff813c9716>] blk_execute_rq+0x56/0x80
      [<ffffffff815cb6e9>] __nvme_submit_sync_cmd+0x89/0xf0
      [<ffffffff815ce7be>] nvme_set_features+0x5e/0x90
      [<ffffffff815ce9f6>] nvme_configure_apst+0x166/0x200
      [<ffffffff815cef45>] nvme_set_latency_tolerance+0x35/0x50
      [<ffffffff8157bd11>] apply_constraint+0xb1/0xc0
      [<ffffffff8157cbb4>] dev_pm_qos_constraints_destroy+0xf4/0x1f0
      [<ffffffff8157b44a>] dpm_sysfs_remove+0x2a/0x60
      [<ffffffff8156d951>] device_del+0x101/0x320
      [<ffffffff8156db8a>] device_unregister+0x1a/0x60
      [<ffffffff8156dc4c>] device_destroy+0x3c/0x50
      [<ffffffff815cd295>] nvme_uninit_ctrl+0x45/0xa0
      [<ffffffff815d4858>] nvme_remove+0x78/0x110
      [<ffffffff81452b69>] pci_device_remove+0x39/0xb0
      [<ffffffff81572935>] device_release_driver_internal+0x155/0x210
      [<ffffffff81572a02>] device_release_driver+0x12/0x20
      [<ffffffff815d36fb>] nvme_remove_dead_ctrl_work+0x6b/0x70
      [<ffffffff810bf3bc>] process_one_work+0x18c/0x3a0
      [<ffffffff810bf61e>] worker_thread+0x4e/0x3b0
      [<ffffffff810c5ac9>] kthread+0x109/0x140
      [<ffffffff8185800c>] ret_from_fork+0x2c/0x40
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      Fixes: c5552fde("nvme: Enable autonomous power state transitions")
      Reported-by: default avatarRakesh Pandit <rakesh@tuxera.com>
      Tested-by: default avatarRakesh Pandit <rakesh@tuxera.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      82654b6b
  3. 06 Jun, 2017 3 commits
    • Eric Biggers's avatar
      elevator: fix truncation of icq_cache_name · 9bd2bbc0
      Eric Biggers authored
      gcc 7.1 reports the following warning:
      
          block/elevator.c: In function ‘elv_register’:
          block/elevator.c:898:5: warning: ‘snprintf’ output may be truncated before the last format character [-Wformat-truncation=]
               "%s_io_cq", e->elevator_name);
               ^~~~~~~~~~
          block/elevator.c:897:3: note: ‘snprintf’ output between 7 and 22 bytes into a destination of size 21
             snprintf(e->icq_cache_name, sizeof(e->icq_cache_name),
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
               "%s_io_cq", e->elevator_name);
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      The bug is that the name of the icq_cache is 6 characters longer than
      the elevator name, but only ELV_NAME_MAX + 5 characters were reserved
      for it --- so in the case of a maximum-length elevator name, the 'q'
      character in "_io_cq" would be truncated by snprintf().  Fix it by
      reserving ELV_NAME_MAX + 6 characters instead.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9bd2bbc0
    • Ming Lei's avatar
      blk-mq: fix direct issue · d964f04a
      Ming Lei authored
      If queue is stopped, we shouldn't dispatch request into driver and
      hardware, unfortunately the check is removed in bd166ef1(blk-mq-sched:
      add framework for MQ capable IO schedulers).
      
      This patch fixes the issue by moving the check back into
      __blk_mq_try_issue_directly().
      
      This patch fixes request use-after-free[1][2] during canceling requets
      of NVMe in nvme_dev_disable(), which can be triggered easily during
      NVMe reset & remove test.
      
      [1] oops kernel log when CONFIG_BLK_DEV_INTEGRITY is on
      [  103.412969] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
      [  103.412980] IP: bio_integrity_advance+0x48/0xf0
      [  103.412981] PGD 275a88067
      [  103.412981] P4D 275a88067
      [  103.412982] PUD 276c43067
      [  103.412983] PMD 0
      [  103.412984]
      [  103.412986] Oops: 0000 [#1] SMP
      [  103.412989] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support mxm_wmi glue_helper dcdbas ipmi_si mei_me pcspkr mei sg ipmi_devintf lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel nvme ahci nvme_core libahci libata tg3 i2c_core megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
      [  103.413035] CPU: 0 PID: 102 Comm: kworker/0:2 Not tainted 4.11.0+ #1
      [  103.413036] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [  103.413041] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
      [  103.413043] task: ffff9cc8775c8000 task.stack: ffffc033c252c000
      [  103.413045] RIP: 0010:bio_integrity_advance+0x48/0xf0
      [  103.413046] RSP: 0018:ffffc033c252fc10 EFLAGS: 00010202
      [  103.413048] RAX: 0000000000000000 RBX: ffff9cc8720a8cc0 RCX: ffff9cca72958240
      [  103.413049] RDX: ffff9cca72958000 RSI: 0000000000000008 RDI: ffff9cc872537f00
      [  103.413049] RBP: ffffc033c252fc28 R08: 0000000000000000 R09: ffffffffb963a0d5
      [  103.413050] R10: 000000000000063e R11: 0000000000000000 R12: ffff9cc8720a8d18
      [  103.413051] R13: 0000000000001000 R14: ffff9cc872682e00 R15: 00000000fffffffb
      [  103.413053] FS:  0000000000000000(0000) GS:ffff9cc877c00000(0000) knlGS:0000000000000000
      [  103.413054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  103.413055] CR2: 000000000000000a CR3: 0000000276c41000 CR4: 00000000001406f0
      [  103.413056] Call Trace:
      [  103.413063]  bio_advance+0x2a/0xe0
      [  103.413067]  blk_update_request+0x76/0x330
      [  103.413072]  blk_mq_end_request+0x1a/0x70
      [  103.413074]  blk_mq_dispatch_rq_list+0x370/0x410
      [  103.413076]  ? blk_mq_flush_busy_ctxs+0x94/0xe0
      [  103.413080]  blk_mq_sched_dispatch_requests+0x173/0x1a0
      [  103.413083]  __blk_mq_run_hw_queue+0x8e/0xa0
      [  103.413085]  __blk_mq_delay_run_hw_queue+0x9d/0xa0
      [  103.413088]  blk_mq_start_hw_queue+0x17/0x20
      [  103.413090]  blk_mq_start_hw_queues+0x32/0x50
      [  103.413095]  nvme_kill_queues+0x54/0x80 [nvme_core]
      [  103.413097]  nvme_remove_dead_ctrl_work+0x1f/0x40 [nvme]
      [  103.413103]  process_one_work+0x149/0x360
      [  103.413105]  worker_thread+0x4d/0x3c0
      [  103.413109]  kthread+0x109/0x140
      [  103.413111]  ? rescuer_thread+0x380/0x380
      [  103.413113]  ? kthread_park+0x60/0x60
      [  103.413120]  ret_from_fork+0x2c/0x40
      [  103.413121] Code: 08 4c 8b 63 50 48 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 31 c0 48 83 ba 40 02 00 00 00 48 8d 8a 40 02 00 00 48 0f 45 c1 c1 ee 09 <0f> b6 48 0a 0f b6 40 09 41 89 f5 83 e9 09 41 d3 ed 44 0f af e8
      [  103.413145] RIP: bio_integrity_advance+0x48/0xf0 RSP: ffffc033c252fc10
      [  103.413146] CR2: 000000000000000a
      [  103.413157] ---[ end trace cd6875d16eb5a11e ]---
      [  103.455368] Kernel panic - not syncing: Fatal exception
      [  103.459826] Kernel Offset: 0x37600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  103.850916] ---[ end Kernel panic - not syncing: Fatal exception
      [  103.857637] sched: Unexpected reschedule of offline CPU#1!
      [  103.863762] ------------[ cut here ]------------
      
      [2] kernel hang in blk_mq_freeze_queue_wait() when CONFIG_BLK_DEV_INTEGRITY is off
      [  247.129825] INFO: task nvme-test:1772 blocked for more than 120 seconds.
      [  247.137311]       Not tainted 4.12.0-rc2.upstream+ #4
      [  247.142954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  247.151704] Call Trace:
      [  247.154445]  __schedule+0x28a/0x880
      [  247.158341]  schedule+0x36/0x80
      [  247.161850]  blk_mq_freeze_queue_wait+0x4b/0xb0
      [  247.166913]  ? remove_wait_queue+0x60/0x60
      [  247.171485]  blk_freeze_queue+0x1a/0x20
      [  247.175770]  blk_cleanup_queue+0x7f/0x140
      [  247.180252]  nvme_ns_remove+0xa3/0xb0 [nvme_core]
      [  247.185503]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
      [  247.191532]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
      [  247.196977]  nvme_remove+0x70/0x110 [nvme]
      [  247.201545]  pci_device_remove+0x39/0xc0
      [  247.205927]  device_release_driver_internal+0x141/0x200
      [  247.211761]  device_release_driver+0x12/0x20
      [  247.216531]  pci_stop_bus_device+0x8c/0xa0
      [  247.221104]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
      [  247.227420]  remove_store+0x7c/0x90
      [  247.231320]  dev_attr_store+0x18/0x30
      [  247.235409]  sysfs_kf_write+0x3a/0x50
      [  247.239497]  kernfs_fop_write+0xff/0x180
      [  247.243867]  __vfs_write+0x37/0x160
      [  247.247757]  ? selinux_file_permission+0xe5/0x120
      [  247.253011]  ? security_file_permission+0x3b/0xc0
      [  247.258260]  vfs_write+0xb2/0x1b0
      [  247.261964]  ? syscall_trace_enter+0x1d0/0x2b0
      [  247.266924]  SyS_write+0x55/0xc0
      [  247.270540]  do_syscall_64+0x67/0x150
      [  247.274636]  entry_SYSCALL64_slow_path+0x25/0x25
      [  247.279794] RIP: 0033:0x7f5c96740840
      [  247.283785] RSP: 002b:00007ffd00e87ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  247.292238] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5c96740840
      [  247.300194] RDX: 0000000000000002 RSI: 00007f5c97060000 RDI: 0000000000000001
      [  247.308159] RBP: 00007f5c97060000 R08: 000000000000000a R09: 00007f5c97059740
      [  247.316123] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f5c96a14400
      [  247.324087] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
      [  370.016340] INFO: task nvme-test:1772 blocked for more than 120 seconds.
      
      Fixes: 12d70958(blk-mq: don't fail allocating driver tag for stopped hw queue)
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d964f04a
    • Ming Lei's avatar
      blk-mq: pass correct hctx to blk_mq_try_issue_directly · dad7a3be
      Ming Lei authored
      When direct issue is done on request picked up from plug list,
      the hctx need to be updated with the actual hw queue, otherwise
      wrong hctx is used and may hurt performance, especially when
      wrong SRCU readlock is acquired/released
      Reported-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      dad7a3be
  4. 03 Jun, 2017 1 commit
    • Dmitry Monakhov's avatar
      bio-integrity: Do not allocate integrity context for bio w/o data · 3116a23b
      Dmitry Monakhov authored
      If bio has no data, such as ones from blkdev_issue_flush(),
      then we have nothing to protect.
      
      This patch prevent bugon like follows:
      
      kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
      kernel BUG at mm/slab.c:2773!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: bcache
      CPU: 0 PID: 4428 Comm: xfs_io Tainted: G        W       4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
      Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
      task: ffff880137786440 task.stack: ffffc90000ba8000
      RIP: 0010:kfree_debugcheck+0x25/0x2a
      RSP: 0018:ffffc90000babde0 EFLAGS: 00010082
      RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40
      RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000
      R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282
      R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001
      FS:  00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0
      Call Trace:
       kfree+0xc8/0x1b3
       bio_integrity_free+0xc3/0x16b
       bio_free+0x25/0x66
       bio_put+0x14/0x26
       blkdev_issue_flush+0x7a/0x85
       blkdev_fsync+0x35/0x42
       vfs_fsync_range+0x8e/0x9f
       vfs_fsync+0x1c/0x1e
       do_fsync+0x31/0x4a
       SyS_fsync+0x10/0x14
       entry_SYSCALL_64_fastpath+0x1f/0xc2
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3116a23b
  5. 02 Jun, 2017 17 commits
  6. 01 Jun, 2017 7 commits
    • Linus Torvalds's avatar
      Merge tag 'nfsd-4.12-1' of git://linux-nfs.org/~bfields/linux · 3b1e342b
      Linus Torvalds authored
      Pull nfsd fixes from Bruce Fields:
       "Revert patch accidentally included in the merge window pull request,
        and fix a crash that was likely a result of buggy client behavior"
      
      * tag 'nfsd-4.12-1' of git://linux-nfs.org/~bfields/linux:
        nfsd4: fix null dereference on replay
        nfsd: Revert "nfsd: check for oversized NFSv2/v3 arguments"
      3b1e342b
    • Linus Torvalds's avatar
      Merge tag 'gcc-plugins-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 2f48641c
      Linus Torvalds authored
      Pull gcc-plugin prepwork from Kees Cook:
       "Use designated initializers for mtk-vcodec, powerplay, amdgpu, and
        sgi-xp. Use ERR_CAST() to avoid cross-structure cast in ocf2, ntfs,
        and NFS.
      
        Christoph Hellwig recommended that I send these fixes now, rather than
        waiting for the v4.13 merge window. These are all initializer and cast
        fixes needed for the future randstruct plugin that haven't been picked
        up by the respective maintainers"
      
      * tag 'gcc-plugins-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        mtk-vcodec: Use designated initializers
        drm/amd/powerplay: Use designated initializers
        drm/amdgpu: Use designated initializers
        sgi-xp: Use designated initializers
        ocfs2: Use ERR_CAST() to avoid cross-structure cast
        ntfs: Use ERR_CAST() to avoid cross-structure cast
        NFS: Use ERR_CAST() to avoid cross-structure cast
      2f48641c
    • Bart Van Assche's avatar
      block: Avoid that blk_exit_rl() triggers a use-after-free · b425e504
      Bart Van Assche authored
      Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
      essential that the memory allocated for struct request_queue
      stays around until all blk_exit_rl() calls have finished. Hence
      make blk_init_rl() take a reference on struct request_queue.
      
      This patch fixes the following crash:
      
      general protection fault: 0000 [#2] SMP
      CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G      D         4.12.0-rc2-dbg+ #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      task: ffff88013a108040 task.stack: ffffc9000071c000
      RIP: 0010:free_request_size+0x1a/0x30
      RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
      RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
      RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
      RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
      R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
      R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
      FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
      Call Trace:
       mempool_destroy.part.10+0x21/0x40
       mempool_destroy+0xe/0x10
       blk_exit_rl+0x12/0x20
       blkg_free+0x4d/0xa0
       __blkg_release_rcu+0x59/0x170
       rcu_process_callbacks+0x260/0x4e0
       __do_softirq+0x116/0x250
       smpboot_thread_fn+0x123/0x1e0
       kthread+0x109/0x140
       ret_from_fork+0x31/0x40
      
      Fixes: commit e9c787e6 ("scsi: allocate scsi_cmnd structures as part of struct request")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org> # v4.11+
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b425e504
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 9ea15a59
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "Many small x86 bug fixes: SVM segment registers access rights, nested
        VMX, preempt notifiers, LAPIC virtual wire mode, NMI injection"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: Fix nmi injection failure when vcpu got blocked
        KVM: SVM: do not zero out segment attributes if segment is unusable or not present
        KVM: SVM: ignore type when setting segment registers
        KVM: nVMX: fix nested_vmx_check_vmptr failure paths under debugging
        KVM: x86: Fix virtual wire mode
        KVM: nVMX: Fix handling of lmsw instruction
        KVM: X86: Fix preempt the preemption timer cancel
      9ea15a59
    • Linus Torvalds's avatar
      Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · 0bb23039
      Linus Torvalds authored
      Pull Reiserfs and GFS2 fixes from Jan Kara:
       "Fixes to GFS2 & Reiserfs for the fallout of the recent WRITE_FUA
        cleanup from Christoph.
      
        Fixes for other filesystems were already merged by respective
        maintainers."
      
      * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        reiserfs: Make flush bios explicitely sync
        gfs2: Make flush bios explicitely sync
      0bb23039
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 393bcfae
      Linus Torvalds authored
      Pull SCSI target fixes from Nicholas Bellinger:
       "Here are the target-pending fixes for v4.12-rc4:
      
         - ibmviscsis ABORT_TASK handling fixes that missed the v4.12 merge
           window. (Bryant Ly and Michael Cyr)
      
         - Re-add a target-core check enforcing WRITE overflow reject that was
           relaxed in v4.3, to avoid unsupported iscsi-target immediate data
           overflow. (nab)
      
         - Fix a target-core-user OOPs during device removal. (MNC + Bryant
           Ly)
      
         - Fix a long standing iscsi-target potential issue where kthread exit
           did not wait for kthread_should_stop(). (Jiang Yi)
      
         - Fix a iscsi-target v3.12.y regression OOPs involving initial login
           PDU processing during asynchronous TCP connection close. (MNC +
           nab)
      
        This is a little larger than usual for an -rc4, primarily due to the
        iscsi-target v3.12.y regression OOPs bug-fix.
      
        However, it's an important patch as MNC + Hannes where both able to
        trigger it using a reduced iscsi initiator login timeout combined with
        a backend taking a long time to complete I/Os during iscsi login
        driven session reinstatement"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        iscsi-target: Always wait for kthread_should_stop() before kthread exit
        iscsi-target: Fix initial login PDU asynchronous socket close OOPs
        tcmu: fix crash during device removal
        target: Re-add check to reject control WRITEs with overflow data
        ibmvscsis: Fix the incorrect req_lim_delta
        ibmvscsis: Clear left-over abort_cmd pointers
      393bcfae
    • Ingo Molnar's avatar
      Revert "x86/PAT: Fix Xorg regression on CPUs that don't support PAT" · c08d5174
      Ingo Molnar authored
      This reverts commit cbed27cd.
      
      As Andy Lutomirski observed:
      
       "I think this patch is bogus. pat_enabled() sure looks like it's
        supposed to return true if PAT is *enabled*, and these days PAT is
        'enabled' even if there's no HW PAT support."
      Reported-by: default avatarBernhard Held <berny156@gmx.de>
      Reported-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: stable@vger.kernel.org # v4.2+
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c08d5174