1. 03 Jan, 2022 1 commit
    • Song Liu's avatar
      md/raid1: fix missing bitmap update w/o WriteMostly devices · 46669e86
      Song Liu authored
      commit [1] causes missing bitmap updates when there isn't any WriteMostly
      devices.
      
      Detailed steps to reproduce by Norbert (which somehow didn't make to lore):
      
         # setup md10 (raid1) with two drives (1 GByte sparse files)
         dd if=/dev/zero of=disk1 bs=1024k seek=1024 count=0
         dd if=/dev/zero of=disk2 bs=1024k seek=1024 count=0
      
         losetup /dev/loop11 disk1
         losetup /dev/loop12 disk2
      
         mdadm --create /dev/md10 --level=1 --raid-devices=2 /dev/loop11 /dev/loop12
      
         # add bitmap (aka write-intent log)
         mdadm /dev/md10 --grow --bitmap=internal
      
         echo check > /sys/block/md10/md/sync_action
      
         root:# cat /sys/block/md10/md/mismatch_cnt
         0
         root:#
      
         # remove member drive disk2 (loop12)
         mdadm /dev/md10 -f loop12 ; mdadm /dev/md10 -r loop12
      
         # modify degraded md device
         dd if=/dev/urandom of=/dev/md10 bs=512 count=1
      
         # no blocks recorded as out of sync on the remaining member disk1/loop11
         root:# mdadm -X /dev/loop11 | grep Bitmap
                   Bitmap : 16 bits (chunks), 0 dirty (0.0%)
         root:#
      
         # re-add disk2, nothing synced because of empty bitmap
         mdadm /dev/md10 --re-add /dev/loop12
      
         # check integrity again
         echo check > /sys/block/md10/md/sync_action
      
         # disk1 and disk2 are no longer in sync, reads return differend data
         root:# cat /sys/block/md10/md/mismatch_cnt
         128
         root:#
      
         # clean up
         mdadm -S /dev/md10
         losetup -d /dev/loop11
         losetup -d /dev/loop12
         rm disk1 disk2
      
      Fix this by moving the WriteMostly check to the if condition for
      alloc_behind_master_bio().
      
      [1] commit fd3b6975 ("md/raid1: only allocate write behind bio for WriteMostly device")
      Fixes: fd3b6975 ("md/raid1: only allocate write behind bio for WriteMostly device")
      Cc: stable@vger.kernel.org # v5.12+
      Cc: Guoqing Jiang <guoqing.jiang@linux.dev>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reported-by: default avatarNorbert Warmuth <nwarmuth@t-online.de>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      46669e86
  2. 19 Dec, 2021 1 commit
  3. 15 Dec, 2021 2 commits
  4. 14 Dec, 2021 1 commit
    • Tejun Heo's avatar
      iocost: Fix divide-by-zero on donation from low hweight cgroup · edaa2633
      Tejun Heo authored
      The donation calculation logic assumes that the donor has non-zero
      after-donation hweight, so the lowest active hweight a donating cgroup can
      have is 2 so that it can donate 1 while keeping the other 1 for itself.
      Earlier, we only donated from cgroups with sizable surpluses so this
      condition was always true. However, with the precise donation algorithm
      implemented, f1de2439 ("blk-iocost: revamp donation amount
      determination") made the donation amount calculation exact enabling even low
      hweight cgroups to donate.
      
      This means that in rare occasions, a cgroup with active hweight of 1 can
      enter donation calculation triggering the following warning and then a
      divide-by-zero oops.
      
       WARNING: CPU: 4 PID: 0 at block/blk-iocost.c:1928 transfer_surpluses.cold+0x0/0x53 [884/94867]
       ...
       RIP: 0010:transfer_surpluses.cold+0x0/0x53
       Code: 92 ff 48 c7 c7 28 d1 ab b5 65 48 8b 34 25 00 ae 01 00 48 81 c6 90 06 00 00 e8 8b 3f fe ff 48 c7 c0 ea ff ff ff e9 95 ff 92 ff <0f> 0b 48 c7 c7 30 da ab b5 e8 71 3f fe ff 4c 89 e8 4d 85 ed 74 0
      4
       ...
       Call Trace:
        <IRQ>
        ioc_timer_fn+0x1043/0x1390
        call_timer_fn+0xa1/0x2c0
        __run_timers.part.0+0x1ec/0x2e0
        run_timer_softirq+0x35/0x70
       ...
       iocg: invalid donation weights in /a/b: active=1 donating=1 after=0
      
      Fix it by excluding cgroups w/ active hweight < 2 from donating. Excluding
      these extreme low hweight donations shouldn't affect work conservation in
      any meaningful way.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: f1de2439 ("blk-iocost: revamp donation amount determination")
      Cc: stable@vger.kernel.org # v5.10+
      Link: https://lore.kernel.org/r/Ybfh86iSvpWKxhVM@slm.duckdns.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      edaa2633
  5. 10 Dec, 2021 6 commits
  6. 08 Dec, 2021 1 commit
  7. 07 Dec, 2021 3 commits
  8. 06 Dec, 2021 4 commits
  9. 29 Nov, 2021 1 commit
  10. 26 Nov, 2021 2 commits
  11. 25 Nov, 2021 3 commits
  12. 23 Nov, 2021 10 commits
  13. 19 Nov, 2021 2 commits
  14. 17 Nov, 2021 2 commits
  15. 16 Nov, 2021 1 commit
    • Ming Lei's avatar
      blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release() · 2a19b28f
      Ming Lei authored
      For avoiding to slow down queue destroy, we don't call
      blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
      cancel dispatch work in blk_release_queue().
      
      However, this way has caused kernel oops[1], reported by Changhui. The log
      shows that scsi_device can be freed before running blk_release_queue(),
      which is expected too since scsi_device is released after the scsi disk
      is closed and the scsi_device is removed.
      
      Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
      and disk_release():
      
      1) when disk_release() is run, the disk has been closed, and any sync
      dispatch activities have been done, so canceling dispatch work is enough to
      quiesce filesystem I/O dispatch activity.
      
      2) in blk_cleanup_queue(), we only focus on passthrough request, and
      passthrough request is always explicitly allocated & freed by
      its caller, so once queue is frozen, all sync dispatch activity
      for passthrough request has been done, then it is enough to just cancel
      dispatch work for avoiding any dispatch activity.
      
      [1] kernel panic log
      [12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
      [12622.777186] #PF: supervisor read access in kernel mode
      [12622.782918] #PF: error_code(0x0000) - not-present page
      [12622.788649] PGD 0 P4D 0
      [12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
      [12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
      [12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
      [12622.813321] Workqueue: kblockd blk_mq_run_work_fn
      [12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
      [12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
      [12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
      [12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
      [12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
      [12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
      [12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
      [12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
      [12622.889926] FS:  0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
      [12622.898956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
      [12622.913328] Call Trace:
      [12622.916055]  <TASK>
      [12622.918394]  scsi_mq_get_budget+0x1a/0x110
      [12622.922969]  __blk_mq_do_dispatch_sched+0x1d4/0x320
      [12622.928404]  ? pick_next_task_fair+0x39/0x390
      [12622.933268]  __blk_mq_sched_dispatch_requests+0xf4/0x140
      [12622.939194]  blk_mq_sched_dispatch_requests+0x30/0x60
      [12622.944829]  __blk_mq_run_hw_queue+0x30/0xa0
      [12622.949593]  process_one_work+0x1e8/0x3c0
      [12622.954059]  worker_thread+0x50/0x3b0
      [12622.958144]  ? rescuer_thread+0x370/0x370
      [12622.962616]  kthread+0x158/0x180
      [12622.966218]  ? set_kthread_struct+0x40/0x40
      [12622.970884]  ret_from_fork+0x22/0x30
      [12622.974875]  </TASK>
      [12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
      Reported-by: default avatarChanghuiZhong <czhong@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: linux-scsi@vger.kernel.org
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2a19b28f