• Christoph Hellwig's avatar
    md: split mddev_find · 65aa97c4
    Christoph Hellwig authored
    Split mddev_find into a simple mddev_find that just finds an existing
    mddev by the unit number, and a more complicated mddev_find that deals
    with find or allocating a mddev.
    
    This turns out to fix this bug reported by Zhao Heming.
    
    ----------------------------- snip ------------------------------
    commit d3374825 ("md: make devices disappear when they are no longer
    needed.") introduced protection between mddev creating & removing. The
    md_open shouldn't create mddev when all_mddevs list doesn't contain
    mddev. With currently code logic, there will be very easy to trigger
    soft lockup in non-preempt env.
    
    *** env ***
    kvm-qemu VM 2C1G with 2 iscsi luns
    kernel should be non-preempt
    
    *** script ***
    
    about trigger 1 time with 10 tests
    
    `1  node1="15sp3-mdcluster1"
    2  node2="15sp3-mdcluster2"
    3
    4  mdadm -Ss
    5  ssh ${node2} "mdadm -Ss"
    6  wipefs -a /dev/sda /dev/sdb
    7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
       /dev/sdb --assume-clean
    8
    9  for i in {1..100}; do
    10    echo ==== $i ====;
    11
    12    echo "test  ...."
    13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
    14    sleep 1
    15
    16    echo "clean  ....."
    17    ssh ${node2} "mdadm -Ss"
    18 done
    `
    I use mdcluster env to trigger soft lockup, but it isn't mdcluster
    speical bug. To stop md array in mdcluster env will do more jobs than
    non-cluster array, which will leave enough time/gap to allow kernel to
    run md_open.
    
    *** stack ***
    
    `ID: 2831   TASK: ffff8dd7223b5040  CPU: 0   COMMAND: "mdadm"
     #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
     #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
     #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
     #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
     #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
     #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
     #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
     #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
     #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
     #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c
    
    or:
    [  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
    [  884.226515]  md_open+0x3c/0xe0 [md_mod]
    [  884.226518]  __blkdev_get+0x30d/0x710
    [  884.226520]  ? bd_acquire+0xd0/0xd0
    [  884.226522]  blkdev_get+0x14/0x30
    [  884.226524]  do_dentry_open+0x204/0x3a0
    [  884.226531]  path_openat+0x2fc/0x1520
    [  884.226534]  ? seq_printf+0x4e/0x70
    [  884.226536]  do_filp_open+0x9b/0x110
    [  884.226542]  ? md_release+0x20/0x20 [md_mod]
    [  884.226543]  ? seq_read+0x1d8/0x3e0
    [  884.226545]  ? kmem_cache_alloc+0x18a/0x270
    [  884.226547]  ? do_sys_open+0x1bd/0x260
    [  884.226548]  do_sys_open+0x1bd/0x260
    [  884.226551]  do_syscall_64+0x5b/0x1e0
    [  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    `
    *** rootcause ***
    
    "mdadm -A" (or other array assemble commands) will start a daemon "mdadm
    --monitor" by default. When "mdadm -Ss" is running, the stop action will
    wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
    info from /proc/mdstat. This time mddev in kernel still exist, so
    /proc/mdstat still show md device, which makes "mdadm --monitor" to open
    /dev/md0.
    
    The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
    open action will trigger md_open which is creating action. Racing is
    happening.
    
    `<thread 1>: "mdadm -Ss"
    md_release
      mddev_put deletes mddev from all_mddevs
      queue_work for mddev_delayed_delete
      at this time, "/dev/md0" is still available for opening
    
    <thread 2>: "mdadm --monitor ..."
    md_open
     + mddev_find can't find mddev of /dev/md0, and create a new mddev and
     |    return.
     + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
          -ERESTARTSYS.
    `
    In non-preempt kernel, <thread 2> is occupying on current CPU. and
    mddev_delayed_delete which was created in <thread 1> also can't be
    schedule.
    
    In preempt kernel, it can also trigger above racing. But kernel doesn't
    allow one thread running on a CPU all the time. after <thread 2> running
    some time, the later "mdadm -A" (refer above script line 13) will call
    md_alloc to alloc a new gendisk for mddev. it will break md_open
    statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
    the soft lockup is broken.
    ------------------------------ snip ------------------------------
    
    Cc: stable@vger.kernel.org
    Fixes: d3374825 ("md: make devices disappear when they are no longer needed.")
    Reported-by: default avatarHeming Zhao <heming.zhao@suse.com>
    Reviewed-by: default avatarHeming Zhao <heming.zhao@suse.com>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarSong Liu <song@kernel.org>
    65aa97c4
md.c 259 KB