1. 25 Jun, 2020 1 commit
    • Jens Axboe's avatar
      Merge branch 'nvme-5.8' of git://git.infradead.org/nvme into block-5.8 · 1b52671d
      Jens Axboe authored
      Pull NVMe fixes from Christoph.
      
      * 'nvme-5.8' of git://git.infradead.org/nvme:
        nvme-multipath: fix bogus request queue reference put
        nvme-multipath: fix deadlock due to head->lock
        nvme: don't protect ns mutation with ns->head->lock
        nvme-multipath: fix deadlock between ana_work and scan_work
        nvme: fix possible deadlock when I/O is blocked
        nvme-rdma: assign completion vector correctly
        nvme-loop: initialize tagset numa value to the value of the ctrl
        nvme-tcp: initialize tagset numa value to the value of the ctrl
        nvme-pci: initialize tagset numa value to the value of the ctrl
        nvme-pci: override the value of the controller's numa node
        nvme: set initial value for controller's numa node
      1b52671d
  2. 24 Jun, 2020 12 commits
    • Sagi Grimberg's avatar
      nvme-multipath: fix bogus request queue reference put · c3124466
      Sagi Grimberg authored
      The mpath disk node takes a reference on the request mpath
      request queue when adding live path to the mpath gendisk.
      However if we connected to an inaccessible path device_add_disk
      is not called, so if we disconnect and remove the mpath gendisk
      we endup putting an reference on the request queue that was
      never taken [1].
      
      Fix that to check if we ever added a live path (using
      NVME_NS_HEAD_HAS_DISK flag) and if not, clear the disk->queue
      reference.
      
      [1]:
      ------------[ cut here ]------------
      refcount_t: underflow; use-after-free.
      WARNING: CPU: 1 PID: 1372 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
      CPU: 1 PID: 1372 Comm: nvme Tainted: G           O      5.7.0-rc2+ #3
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
      RIP: 0010:refcount_warn_saturate+0xa6/0xf0
      RSP: 0018:ffffb29e8053bdc0 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff8b7a2f4fc060 RCX: 0000000000000007
      RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8b7a3ec99980
      RBP: ffff8b7a2f4fc000 R08: 00000000000002e1 R09: 0000000000000004
      R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
      R13: fffffffffffffff2 R14: ffffb29e8053bf08 R15: ffff8b7a320e2da0
      FS:  00007f135d4ca800(0000) GS:ffff8b7a3ec80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005651178c0c30 CR3: 000000003b650005 CR4: 0000000000360ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       disk_release+0xa2/0xc0
       device_release+0x28/0x80
       kobject_put+0xa5/0x1b0
       nvme_put_ns_head+0x26/0x70 [nvme_core]
       nvme_put_ns+0x30/0x60 [nvme_core]
       nvme_remove_namespaces+0x9b/0xe0 [nvme_core]
       nvme_do_delete_ctrl+0x43/0x5c [nvme_core]
       nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
       kernfs_fop_write+0xc1/0x1a0
       vfs_write+0xb6/0x1a0
       ksys_write+0x5f/0xe0
       do_syscall_64+0x52/0x1a0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Reported-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Tested-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      c3124466
    • Anton Eidelman's avatar
      nvme-multipath: fix deadlock due to head->lock · d8a22f85
      Anton Eidelman authored
      In the following scenario scan_work and ana_work will deadlock:
      
      When scan_work calls nvme_mpath_add_disk() this holds ana_lock
      and invokes nvme_parse_ana_log(), which may issue IO
      in device_add_disk() and hang waiting for an accessible path.
      
      While nvme_mpath_set_live() only called when nvme_state_is_live(),
      a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.
      
      Since nvme_mpath_set_live() holds ns->head->lock, an ana_work on
      ANY ctrl will not be able to complete nvme_mpath_set_live()
      on the same ns->head, which is required in order to update
      the new accessible path and remove NVME_NS_ANA_PENDING..
      Therefore IO never completes: deadlock [1].
      
      Fix:
      Move device_add_disk out of the head->lock and protect it with an
      atomic test_and_set for a new NVME_NS_HEAD_HAS_DISK bit.
      
      [1]:
      kernel: INFO: task kworker/u8:2:160 blocked for more than 120 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u8:2    D    0   160      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  schedule_preempt_disabled+0xe/0x10
      kernel:  __mutex_lock.isra.0+0x182/0x4f0
      kernel:  __mutex_lock_slowpath+0x13/0x20
      kernel:  mutex_lock+0x2e/0x40
      kernel:  nvme_update_ns_ana_state+0x22/0x60 [nvme_core]
      kernel:  nvme_update_ana_state+0xca/0xe0 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_read_ana_log+0x76/0x100 [nvme_core]
      kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x4d/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ret_from_fork+0x35/0x40
      kernel: INFO: task kworker/u8:4:439 blocked for more than 120 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u8:4    D    0   439      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_mpath_add_disk+0xbe/0x100 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x256/0x390 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x4d/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ret_from_fork+0x35/0x40
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      d8a22f85
    • Sagi Grimberg's avatar
      nvme: don't protect ns mutation with ns->head->lock · e164471d
      Sagi Grimberg authored
      Right now ns->head->lock is protecting namespace mutation
      which is wrong and unneeded. Move it to only protect
      against head mutations. While we're at it, remove unnecessary
      ns->head reference as we already have head pointer.
      
      The problem with this is that the head->lock spans
      mpath disk node I/O that may block under some conditions (if
      for example the controller is disconnecting or the path
      became inaccessible), The locking scheme does not allow any
      other path to enable itself, preventing blocked I/O to complete
      and forward-progress from there.
      
      This is a preparation patch for the fix in a subsequent patch
      where the disk I/O will also be done outside the head->lock.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      e164471d
    • Anton Eidelman's avatar
      nvme-multipath: fix deadlock between ana_work and scan_work · 489dd102
      Anton Eidelman authored
      When scan_work calls nvme_mpath_add_disk() this holds ana_lock
      and invokes nvme_parse_ana_log(), which may issue IO
      in device_add_disk() and hang waiting for an accessible path.
      While nvme_mpath_set_live() only called when nvme_state_is_live(),
      a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.
      
      In order to recover and complete the IO ana_work on the same ctrl
      should be able to update the path state and remove NVME_NS_ANA_PENDING.
      
      The deadlock occurs because scan_work keeps holding ana_lock,
      so ana_work hangs [1].
      
      Fix:
      Now nvme_mpath_add_disk() uses nvme_parse_ana_log() to obtain a copy
      of the ANA group desc, and then calls nvme_update_ns_ana_state() without
      holding ana_lock.
      
      [1]:
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      
      kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  schedule_preempt_disabled+0xe/0x10
      kernel:  __mutex_lock.isra.0+0x182/0x4f0
      kernel:  ? __switch_to_asm+0x34/0x70
      kernel:  ? select_task_rq_fair+0x1aa/0x5c0
      kernel:  ? kvm_sched_clock_read+0x11/0x20
      kernel:  ? sched_clock+0x9/0x10
      kernel:  __mutex_lock_slowpath+0x13/0x20
      kernel:  mutex_lock+0x2e/0x40
      kernel:  nvme_read_ana_log+0x3a/0x100 [nvme_core]
      kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x4d/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ? process_one_work+0x380/0x380
      kernel:  ? kthread_park+0x80/0x80
      kernel:  ret_from_fork+0x35/0x40
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      489dd102
    • Sagi Grimberg's avatar
      nvme: fix possible deadlock when I/O is blocked · 3b4b1972
      Sagi Grimberg authored
      Revert fab7772b ("nvme-multipath: revalidate nvme_ns_head gendisk
      in nvme_validate_ns")
      
      When adding a new namespace to the head disk (via nvme_mpath_set_live)
      we will see partition scan which triggers I/O on the mpath device node.
      This process will usually be triggered from the scan_work which holds
      the scan_lock. If I/O blocks (if we got ana change currently have only
      available paths but none are accessible) this can deadlock on the head
      disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
      takes it to check for resize (also triggered from scan_work on a different
      path). See trace [1].
      
      The mpath disk revalidation was originally added to detect online disk
      size change, but this is no longer needed since commit cb224c3a
      ("nvme: Convert to use set_capacity_revalidate_and_notify") which already
      updates resize info without unnecessarily revalidating the disk (the
      mpath disk doesn't even implement .revalidate_disk fop).
      
      [1]:
      --
      kernel: INFO: task kworker/u65:9:494 blocked for more than 241 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u65:9   D    0   494      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  schedule_preempt_disabled+0xe/0x10
      kernel:  __mutex_lock.isra.0+0x182/0x4f0
      kernel:  __mutex_lock_slowpath+0x13/0x20
      kernel:  mutex_lock+0x2e/0x40
      kernel:  revalidate_disk+0x63/0xa0
      kernel:  __nvme_revalidate_disk+0xfe/0x110 [nvme_core]
      kernel:  nvme_revalidate_disk+0xa4/0x160 [nvme_core]
      kernel:  ? evict+0x14c/0x1b0
      kernel:  revalidate_disk+0x2b/0xa0
      kernel:  nvme_validate_ns+0x49/0x940 [nvme_core]
      kernel:  ? blk_mq_free_request+0xd2/0x100
      kernel:  ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ? process_one_work+0x380/0x380
      kernel:  ? kthread_park+0x80/0x80
      kernel:  ret_from_fork+0x1f/0x40
      ...
      kernel: INFO: task kworker/u65:1:2630 blocked for more than 241 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u65:1   D    0  2630      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  ? __switch_to_asm+0x34/0x70
      kernel:  ? file_fdatawait_range+0x30/0x30
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  ? kmem_cache_alloc_trace+0x19c/0x230
      kernel:  efi_partition+0x1e6/0x708
      kernel:  ? vsnprintf+0x39e/0x4e0
      kernel:  ? snprintf+0x49/0x60
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  ? nvme_update_ns_ana_state+0x60/0x60 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  ? blk_mq_free_request+0xd2/0x100
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ? process_one_work+0x380/0x380
      kernel:  ? kthread_park+0x80/0x80
      kernel:  ret_from_fork+0x1f/0x40
      --
      
      Fixes: fab7772b ("nvme-multipath: revalidate nvme_ns_head gendisk
      in nvme_validate_ns")
      Signed-off-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      3b4b1972
    • Max Gurtovoy's avatar
      nvme-rdma: assign completion vector correctly · 032a9966
      Max Gurtovoy authored
      The completion vector index that is given during CQ creation can't
      exceed the number of support vectors by the underlying RDMA device. This
      violation currently can accure, for example, in case one will try to
      connect with N regular read/write queues and M poll queues and the sum
      of N + M > num_supported_vectors. This will lead to failure in establish
      a connection to remote target. Instead, in that case, share a completion
      vector between queues.
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      032a9966
    • Max Gurtovoy's avatar
      nvme-loop: initialize tagset numa value to the value of the ctrl · 1b4ad7a5
      Max Gurtovoy authored
      Both admin's and drive's tagsets should be set according the numa
      node of the controller.
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      1b4ad7a5
    • Max Gurtovoy's avatar
      nvme-tcp: initialize tagset numa value to the value of the ctrl · 610c8235
      Max Gurtovoy authored
      Both admin's and drive's tagsets should be set according the numa
      node of the controller.
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      610c8235
    • Max Gurtovoy's avatar
      nvme-pci: initialize tagset numa value to the value of the ctrl · d4ec47f1
      Max Gurtovoy authored
      Both admin's and drive's tagsets should be set according the numa node
      of the controller.
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      d4ec47f1
    • Max Gurtovoy's avatar
      nvme-pci: override the value of the controller's numa node · 635333e4
      Max Gurtovoy authored
      Set the node value according to the PCI device numa node.
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      635333e4
    • Max Gurtovoy's avatar
      nvme: set initial value for controller's numa node · 4fea243e
      Max Gurtovoy authored
      Initialize the node to NUMA_NO_NODE value. Transports that are aware of
      numa node affinity can override it (e.g. RDMA transport set the affinity
      according to the RDMA HCA).
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      4fea243e
    • Chengguang Xu's avatar
      block: release bip in a right way in error path · 0b8eb629
      Chengguang Xu authored
      Release bip using kfree() in error path when that was allocated
      by kmalloc().
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0b8eb629
  3. 18 Jun, 2020 4 commits
  4. 17 Jun, 2020 2 commits
    • Jan Kara's avatar
      blktrace: Avoid sparse warnings when assigning q->blk_trace · c3dbe541
      Jan Kara authored
      Mostly for historical reasons, q->blk_trace is assigned through xchg()
      and cmpxchg() atomic operations. Although this is correct, sparse
      complains about this because it violates rcu annotations since commit
      c780e86d ("blktrace: Protect q->blk_trace with RCU") which started
      to use rcu for accessing q->blk_trace. Furthermore there's no real need
      for atomic operations anymore since all changes to q->blk_trace happen
      under q->blk_trace_mutex and since it also makes more sense to check if
      q->blk_trace is set with the mutex held earlier.
      
      So let's just replace xchg() with rcu_replace_pointer() and cmpxchg()
      with explicit check and rcu_assign_pointer(). This makes the code more
      efficient and sparse happy.
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c3dbe541
    • Luis Chamberlain's avatar
      blktrace: break out of blktrace setup on concurrent calls · 1b0b2836
      Luis Chamberlain authored
      We use one blktrace per request_queue, that means one per the entire
      disk.  So we cannot run one blktrace on say /dev/vda and then /dev/vda1,
      or just two calls on /dev/vda.
      
      We check for concurrent setup only at the very end of the blktrace setup though.
      
      If we try to run two concurrent blktraces on the same block device the
      second one will fail, and the first one seems to go on. However when
      one tries to kill the first one one will see things like this:
      
      The kernel will show these:
      
      ```
      debugfs: File 'dropped' in directory 'nvme1n1' already present!
      debugfs: File 'msg' in directory 'nvme1n1' already present!
      debugfs: File 'trace0' in directory 'nvme1n1' already present!
      ``
      
      And userspace just sees this error message for the second call:
      
      ```
      blktrace /dev/nvme1n1
      BLKTRACESETUP(2) /dev/nvme1n1 failed: 5/Input/output error
      ```
      
      The first userspace process #1 will also claim that the files
      were taken underneath their nose as well. The files are taken
      away form the first process given that when the second blktrace
      fails, it will follow up with a BLKTRACESTOP and BLKTRACETEARDOWN.
      This means that even if go-happy process #1 is waiting for blktrace
      data, we *have* been asked to take teardown the blktrace.
      
      This can easily be reproduced with break-blktrace [0] run_0005.sh test.
      
      Just break out early if we know we're already going to fail, this will
      prevent trying to create the files all over again, which we know still
      exist.
      
      [0] https://github.com/mcgrof/break-blktraceSigned-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1b0b2836
  5. 16 Jun, 2020 1 commit
    • Jason Yan's avatar
      block: Fix use-after-free in blkdev_get() · 2d3a8e2d
      Jason Yan authored
      In blkdev_get() we call __blkdev_get() to do some internal jobs and if
      there is some errors in __blkdev_get(), the bdput() is called which
      means we have released the refcount of the bdev (actually the refcount of
      the bdev inode). This means we cannot access bdev after that point. But
      acctually bdev is still accessed in blkdev_get() after calling
      __blkdev_get(). This results in use-after-free if the refcount is the
      last one we released in __blkdev_get(). Let's take a look at the
      following scenerio:
      
        CPU0            CPU1                    CPU2
      blkdev_open     blkdev_open           Remove disk
                        bd_acquire
      		  blkdev_get
      		    __blkdev_get      del_gendisk
      					bdev_unhash_inode
        bd_acquire          bdev_get_gendisk
          bd_forget           failed because of unhashed
      	  bdput
      	              bdput (the last one)
      		        bdev_evict_inode
      
      	  	    access bdev => use after free
      
      [  459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0
      [  459.351190] Read of size 8 at addr ffff88806c815a80 by task syz-executor.0/20132
      [  459.352347]
      [  459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2
      [  459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [  459.354947] Call Trace:
      [  459.355337]  dump_stack+0x111/0x19e
      [  459.355879]  ? __lock_acquire+0x24c1/0x31b0
      [  459.356523]  print_address_description+0x60/0x223
      [  459.357248]  ? __lock_acquire+0x24c1/0x31b0
      [  459.357887]  kasan_report.cold+0xae/0x2d8
      [  459.358503]  __lock_acquire+0x24c1/0x31b0
      [  459.359120]  ? _raw_spin_unlock_irq+0x24/0x40
      [  459.359784]  ? lockdep_hardirqs_on+0x37b/0x580
      [  459.360465]  ? _raw_spin_unlock_irq+0x24/0x40
      [  459.361123]  ? finish_task_switch+0x125/0x600
      [  459.361812]  ? finish_task_switch+0xee/0x600
      [  459.362471]  ? mark_held_locks+0xf0/0xf0
      [  459.363108]  ? __schedule+0x96f/0x21d0
      [  459.363716]  lock_acquire+0x111/0x320
      [  459.364285]  ? blkdev_get+0xce/0xbe0
      [  459.364846]  ? blkdev_get+0xce/0xbe0
      [  459.365390]  __mutex_lock+0xf9/0x12a0
      [  459.365948]  ? blkdev_get+0xce/0xbe0
      [  459.366493]  ? bdev_evict_inode+0x1f0/0x1f0
      [  459.367130]  ? blkdev_get+0xce/0xbe0
      [  459.367678]  ? destroy_inode+0xbc/0x110
      [  459.368261]  ? mutex_trylock+0x1a0/0x1a0
      [  459.368867]  ? __blkdev_get+0x3e6/0x1280
      [  459.369463]  ? bdev_disk_changed+0x1d0/0x1d0
      [  459.370114]  ? blkdev_get+0xce/0xbe0
      [  459.370656]  blkdev_get+0xce/0xbe0
      [  459.371178]  ? find_held_lock+0x2c/0x110
      [  459.371774]  ? __blkdev_get+0x1280/0x1280
      [  459.372383]  ? lock_downgrade+0x680/0x680
      [  459.373002]  ? lock_acquire+0x111/0x320
      [  459.373587]  ? bd_acquire+0x21/0x2c0
      [  459.374134]  ? do_raw_spin_unlock+0x4f/0x250
      [  459.374780]  blkdev_open+0x202/0x290
      [  459.375325]  do_dentry_open+0x49e/0x1050
      [  459.375924]  ? blkdev_get_by_dev+0x70/0x70
      [  459.376543]  ? __x64_sys_fchdir+0x1f0/0x1f0
      [  459.377192]  ? inode_permission+0xbe/0x3a0
      [  459.377818]  path_openat+0x148c/0x3f50
      [  459.378392]  ? kmem_cache_alloc+0xd5/0x280
      [  459.379016]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  459.379802]  ? path_lookupat.isra.0+0x900/0x900
      [  459.380489]  ? __lock_is_held+0xad/0x140
      [  459.381093]  do_filp_open+0x1a1/0x280
      [  459.381654]  ? may_open_dev+0xf0/0xf0
      [  459.382214]  ? find_held_lock+0x2c/0x110
      [  459.382816]  ? lock_downgrade+0x680/0x680
      [  459.383425]  ? __lock_is_held+0xad/0x140
      [  459.384024]  ? do_raw_spin_unlock+0x4f/0x250
      [  459.384668]  ? _raw_spin_unlock+0x1f/0x30
      [  459.385280]  ? __alloc_fd+0x448/0x560
      [  459.385841]  do_sys_open+0x3c3/0x500
      [  459.386386]  ? filp_open+0x70/0x70
      [  459.386911]  ? trace_hardirqs_on_thunk+0x1a/0x1c
      [  459.387610]  ? trace_hardirqs_off_caller+0x55/0x1c0
      [  459.388342]  ? do_syscall_64+0x1a/0x520
      [  459.388930]  do_syscall_64+0xc3/0x520
      [  459.389490]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  459.390248] RIP: 0033:0x416211
      [  459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83
      04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f
         05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d
            01
      [  459.393483] RSP: 002b:00007fe45dfe9a60 EFLAGS: 00000293 ORIG_RAX: 0000000000000002
      [  459.394610] RAX: ffffffffffffffda RBX: 00007fe45dfea6d4 RCX: 0000000000416211
      [  459.395678] RDX: 00007fe45dfe9b0a RSI: 0000000000000002 RDI: 00007fe45dfe9b00
      [  459.396758] RBP: 000000000076bf20 R08: 0000000000000000 R09: 000000000000000a
      [  459.397930] R10: 0000000000000075 R11: 0000000000000293 R12: 00000000ffffffff
      [  459.399022] R13: 0000000000000bd9 R14: 00000000004cdb80 R15: 000000000076bf2c
      [  459.400168]
      [  459.400430] Allocated by task 20132:
      [  459.401038]  kasan_kmalloc+0xbf/0xe0
      [  459.401652]  kmem_cache_alloc+0xd5/0x280
      [  459.402330]  bdev_alloc_inode+0x18/0x40
      [  459.402970]  alloc_inode+0x5f/0x180
      [  459.403510]  iget5_locked+0x57/0xd0
      [  459.404095]  bdget+0x94/0x4e0
      [  459.404607]  bd_acquire+0xfa/0x2c0
      [  459.405113]  blkdev_open+0x110/0x290
      [  459.405702]  do_dentry_open+0x49e/0x1050
      [  459.406340]  path_openat+0x148c/0x3f50
      [  459.406926]  do_filp_open+0x1a1/0x280
      [  459.407471]  do_sys_open+0x3c3/0x500
      [  459.408010]  do_syscall_64+0xc3/0x520
      [  459.408572]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  459.409415]
      [  459.409679] Freed by task 1262:
      [  459.410212]  __kasan_slab_free+0x129/0x170
      [  459.410919]  kmem_cache_free+0xb2/0x2a0
      [  459.411564]  rcu_process_callbacks+0xbb2/0x2320
      [  459.412318]  __do_softirq+0x225/0x8ac
      
      Fix this by delaying bdput() to the end of blkdev_get() which means we
      have finished accessing bdev.
      
      Fixes: 77ea887e ("implement in-kernel gendisk events handling")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarJason Yan <yanaijie@huawei.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2d3a8e2d
  6. 15 Jun, 2020 2 commits
  7. 14 Jun, 2020 8 commits
    • Coly Li's avatar
      bcache: pr_info() format clean up in bcache_device_init() · 4b25bbf5
      Coly Li authored
      scripts/checkpatch.pl reports following warning for patch
      ("bcache: check and adjust logical block size for backing devices"),
          WARNING: quoted string split across lines
          #146: FILE: drivers/md/bcache/super.c:896:
          +  pr_info("%s: sb/logical block size (%u) greater than page size "
          +	       "(%lu) falling back to device logical block size (%u)",
      
      There are two things to fix up,
      - The kernel message print should be in a single line.
      - pr_info() won't automatically add new line since v5.8, a '\n' should
        be added.
      
      This patch just does the above cleanup in bcache_device_init().
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4b25bbf5
    • Coly Li's avatar
      bcache: use delayed kworker fo asynchronous devices registration · ee4a36f4
      Coly Li authored
      This patch changes the asynchronous registration kworker to a delayed
      kworker. There is probability queue_work() queues the async registration
      kworker to the same CPU (even though very little), then the process
      which writing sysfs interface to reigster bcache device may won't return
      immeidately. queue_delayed_work() in this patch will delay 10 jiffies
      before insert the kworker to run queue, which makes sure the registering
      process may always returns to user space in time.
      
      Fixes: 9e23ccf8 ("bcache: asynchronous devices registration")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ee4a36f4
    • Mauricio Faria de Oliveira's avatar
      bcache: check and adjust logical block size for backing devices · dcacbc12
      Mauricio Faria de Oliveira authored
      It's possible for a block driver to set logical block size to
      a value greater than page size incorrectly; e.g. bcache takes
      the value from the superblock, set by the user w/ make-bcache.
      
      This causes a BUG/NULL pointer dereference in the path:
      
        __blkdev_get()
        -> set_init_blocksize() // set i_blkbits based on ...
           -> bdev_logical_block_size()
              -> queue_logical_block_size() // ... this value
        -> bdev_disk_changed()
           ...
           -> blkdev_readpage()
              -> block_read_full_page()
                 -> create_page_buffers() // size = 1 << i_blkbits
                    -> create_empty_buffers() // give size/take pointer
                       -> alloc_page_buffers() // return NULL
                       .. BUG!
      
      Because alloc_page_buffers() is called with size > PAGE_SIZE,
      thus it initializes head = NULL, skips the loop, return head;
      then create_empty_buffers() gets (and uses) the NULL pointer.
      
      This has been around longer than commit ad6bf88a ("block:
      fix an integer overflow in logical block size"); however, it
      increased the range of values that can trigger the issue.
      
      Previously only 8k/16k/32k (on x86/4k page size) would do it,
      as greater values overflow unsigned short to zero, and queue_
      logical_block_size() would then use the default of 512.
      
      Now the range with unsigned int is much larger, and users w/
      the 512k value, which happened to be zero'ed previously and
      work fine, started to hit this issue -- as the zero is gone,
      and queue_logical_block_size() does return 512k (>PAGE_SIZE.)
      
      Fix this by checking the bcache device's logical block size,
      and if it's greater than page size, fallback to the backing/
      cached device's logical page size.
      
      This doesn't affect cache devices as those are still checked
      for block/page size in read_super(); only the backing/cached
      devices are not.
      
      Apparently it's a regression from commit 2903381f ("bcache:
      Take data offset from the bdev superblock."), moving the check
      into BCACHE_SB_VERSION_CDEV only. Now that we have superblocks
      of backing devices out there with this larger value, we cannot
      refuse to load them (i.e., have a similar check in _BDEV.)
      
      Ideally perhaps bcache should use all values from the backing
      device (physical/logical/io_min block size)? But for now just
      fix the problematic case.
      
      Test-case:
      
          # IMG=/root/disk.img
          # dd if=/dev/zero of=$IMG bs=1 count=0 seek=1G
          # DEV=$(losetup --find --show $IMG)
          # make-bcache --bdev $DEV --block 8k
            < see dmesg >
      
      Before:
      
          # uname -r
          5.7.0-rc7
      
          [   55.944046] BUG: kernel NULL pointer dereference, address: 0000000000000000
          ...
          [   55.949742] CPU: 3 PID: 610 Comm: bcache-register Not tainted 5.7.0-rc7 #4
          ...
          [   55.952281] RIP: 0010:create_empty_buffers+0x1a/0x100
          ...
          [   55.966434] Call Trace:
          [   55.967021]  create_page_buffers+0x48/0x50
          [   55.967834]  block_read_full_page+0x49/0x380
          [   55.972181]  do_read_cache_page+0x494/0x610
          [   55.974780]  read_part_sector+0x2d/0xaa
          [   55.975558]  read_lba+0x10e/0x1e0
          [   55.977904]  efi_partition+0x120/0x5a6
          [   55.980227]  blk_add_partitions+0x161/0x390
          [   55.982177]  bdev_disk_changed+0x61/0xd0
          [   55.982961]  __blkdev_get+0x350/0x490
          [   55.983715]  __device_add_disk+0x318/0x480
          [   55.984539]  bch_cached_dev_run+0xc5/0x270
          [   55.986010]  register_bcache.cold+0x122/0x179
          [   55.987628]  kernfs_fop_write+0xbc/0x1a0
          [   55.988416]  vfs_write+0xb1/0x1a0
          [   55.989134]  ksys_write+0x5a/0xd0
          [   55.989825]  do_syscall_64+0x43/0x140
          [   55.990563]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
          [   55.991519] RIP: 0033:0x7f7d60ba3154
          ...
      
      After:
      
          # uname -r
          5.7.0.bcachelbspgsz
      
          [   31.672460] bcache: bcache_device_init() bcache0: sb/logical block size (8192) greater than page size (4096) falling back to device logical block size (512)
          [   31.675133] bcache: register_bdev() registered backing device loop0
      
          # grep ^ /sys/block/bcache0/queue/*_block_size
          /sys/block/bcache0/queue/logical_block_size:512
          /sys/block/bcache0/queue/physical_block_size:8192
      Reported-by: default avatarRyan Finnie <ryan@finnie.org>
      Reported-by: default avatarSebastian Marsching <sebastian@marsching.com>
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dcacbc12
    • Zhiqiang Liu's avatar
      bcache: fix potential deadlock problem in btree_gc_coalesce · be23e837
      Zhiqiang Liu authored
      coccicheck reports:
        drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417
      
      In btree_gc_coalesce func, if the coalescing process fails, we will goto
      to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
      Then, it will cause a deadlock when trying to acquire new_nodes[i]->
      write_lock for freeing new_nodes[i] before return.
      
      btree_gc_coalesce func details as follows:
      	if alloc new_nodes[i] fails:
      		goto out_nocoalesce;
      	// obtain new_nodes[i]->write_lock
      	mutex_lock(&new_nodes[i]->write_lock)
      	// main coalescing process
      	for (i = nodes - 1; i > 0; --i)
      		[snipped]
      		if coalescing process fails:
      			// Here, directly goto out_nocoalesce
      			 // tag will cause a deadlock
      			goto out_nocoalesce;
      		[snipped]
      	// release new_nodes[i]->write_lock
      	mutex_unlock(&new_nodes[i]->write_lock)
      	// coalesing succ, return
      	return;
      out_nocoalesce:
      	btree_node_free(new_nodes[i])	// free new_nodes[i]
      	// obtain new_nodes[i]->write_lock
      	mutex_lock(&new_nodes[i]->write_lock);
      	// set flag for reuse
      	clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags);
      	// release new_nodes[i]->write_lock
      	mutex_unlock(&new_nodes[i]->write_lock);
      
      To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
      releasing new_nodes[i]->write_lock before out_nocoalesce tag. If
      coalescing process fails, we will go to out_unlock_nocoalesce tag
      for releasing new_nodes[i]->write_lock before free new_nodes[i] in
      out_nocoalesce tag.
      
      (Coly Li helps to clean up commit log format.)
      
      Fixes: 2a285686 ("bcache: btree locking rework")
      Signed-off-by: default avatarZhiqiang Liu <liuzhiqiang26@huawei.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      be23e837
    • Linus Torvalds's avatar
      Linux 5.8-rc1 · b3a9e3b9
      Linus Torvalds authored
      b3a9e3b9
    • Linus Torvalds's avatar
      Merge tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux · 4a87b197
      Linus Torvalds authored
      Pull SafeSetID update from Micah Morton:
       "Add additional LSM hooks for SafeSetID
      
        SafeSetID is capable of making allow/deny decisions for set*uid calls
        on a system, and we want to add similar functionality for set*gid
        calls.
      
        The work to do that is not yet complete, so probably won't make it in
        for v5.8, but we are looking to get this simple patch in for v5.8
        since we have it ready.
      
        We are planning on the rest of the work for extending the SafeSetID
        LSM being merged during the v5.9 merge window"
      
      * tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux:
        security: Add LSM hooks to set*gid syscalls
      4a87b197
    • Thomas Cedeno's avatar
      security: Add LSM hooks to set*gid syscalls · 39030e13
      Thomas Cedeno authored
      The SafeSetID LSM uses the security_task_fix_setuid hook to filter
      set*uid() syscalls according to its configured security policy. In
      preparation for adding analagous support in the LSM for set*gid()
      syscalls, we add the requisite hook here. Tested by putting print
      statements in the security_task_fix_setgid hook and seeing them get hit
      during kernel boot.
      Signed-off-by: default avatarThomas Cedeno <thomascedeno@google.com>
      Signed-off-by: default avatarMicah Morton <mortonm@chromium.org>
      39030e13
    • Linus Torvalds's avatar
      Merge tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 9d645db8
      Linus Torvalds authored
      Pull btrfs updates from David Sterba:
       "This reverts the direct io port to iomap infrastructure of btrfs
        merged in the first pull request. We found problems in invalidate page
        that don't seem to be fixable as regressions or without changing iomap
        code that would not affect other filesystems.
      
        There are four reverts in total, but three of them are followup
        cleanups needed to revert a43a67a2 cleanly. The result is the
        buffer head based implementation of direct io.
      
        Reverts are not great, but under current circumstances I don't see
        better options"
      
      * tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        Revert "btrfs: switch to iomap_dio_rw() for dio"
        Revert "fs: remove dio_end_io()"
        Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK"
        Revert "btrfs: split btrfs_direct_IO to read and write part"
      9d645db8
  8. 13 Jun, 2020 10 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 96144c58
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix cfg80211 deadlock, from Johannes Berg.
      
       2) RXRPC fails to send norigications, from David Howells.
      
       3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
          Geliang Tang.
      
       4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.
      
       5) The ucc_geth driver needs __netdev_watchdog_up exported, from
          Valentin Longchamp.
      
       6) Fix hashtable memory leak in dccp, from Wang Hai.
      
       7) Fix how nexthops are marked as FDB nexthops, from David Ahern.
      
       8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.
      
       9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.
      
      10) Fix link speed reporting in iavf driver, from Brett Creeley.
      
      11) When a channel is used for XSK and then reused again later for XSK,
          we forget to clear out the relevant data structures in mlx5 which
          causes all kinds of problems. Fix from Maxim Mikityanskiy.
      
      12) Fix memory leak in genetlink, from Cong Wang.
      
      13) Disallow sockmap attachments to UDP sockets, it simply won't work.
          From Lorenz Bauer.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
        net: ethernet: ti: ale: fix allmulti for nu type ale
        net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
        net: atm: Remove the error message according to the atomic context
        bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
        libbpf: Support pre-initializing .bss global variables
        tools/bpftool: Fix skeleton codegen
        bpf: Fix memlock accounting for sock_hash
        bpf: sockmap: Don't attach programs to UDP sockets
        bpf: tcp: Recv() should return 0 when the peer socket is closed
        ibmvnic: Flush existing work items before device removal
        genetlink: clean up family attributes allocations
        net: ipa: header pad field only valid for AP->modem endpoint
        net: ipa: program upper nibbles of sequencer type
        net: ipa: fix modem LAN RX endpoint id
        net: ipa: program metadata mask differently
        ionic: add pcie_print_link_status
        rxrpc: Fix race between incoming ACK parser and retransmitter
        net/mlx5: E-Switch, Fix some error pointer dereferences
        net/mlx5: Don't fail driver on failure to create debugfs
        net/mlx5e: CT: Fix ipv6 nat header rewrite actions
        ...
      96144c58
    • David Sterba's avatar
      Revert "btrfs: switch to iomap_dio_rw() for dio" · 55e20bd1
      David Sterba authored
      This reverts commit a43a67a2.
      
      This patch reverts the main part of switching direct io implementation
      to iomap infrastructure. There's a problem in invalidate page that
      couldn't be solved as regression in this development cycle.
      
      The problem occurs when buffered and direct io are mixed, and the ranges
      overlap. Although this is not recommended, filesystems implement
      measures or fallbacks to make it somehow work. In this case, fallback to
      buffered IO would be an option for btrfs (this already happens when
      direct io is done on compressed data), but the change would be needed in
      the iomap code, bringing new semantics to other filesystems.
      
      Another problem arises when again the buffered and direct ios are mixed,
      invalidation fails, then -EIO is set on the mapping and fsync will fail,
      though there's no real error.
      
      There have been discussions how to fix that, but revert seems to be the
      least intrusive option.
      
      Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55e20bd1
    • Grygorii Strashko's avatar
      net: ethernet: ti: ale: fix allmulti for nu type ale · bc139119
      Grygorii Strashko authored
      On AM65xx MCU CPSW2G NUSS and 66AK2E/L NUSS allmulti setting does not allow
      unregistered mcast packets to pass.
      
      This happens, because ALE VLAN entries on these SoCs do not contain port
      masks for reg/unreg mcast packets, but instead store indexes of
      ALE_VLAN_MASK_MUXx_REG registers which intended for store port masks for
      reg/unreg mcast packets.
      This path was missed by commit 9d1f6447 ("net: ethernet: ti: ale: fix
      seeing unreg mcast packets with promisc and allmulti disabled").
      
      Hence, fix it by taking into account ALE type in cpsw_ale_set_allmulti().
      
      Fixes: 9d1f6447 ("net: ethernet: ti: ale: fix seeing unreg mcast packets with promisc and allmulti disabled")
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc139119
    • Grygorii Strashko's avatar
      net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init · 2074f9ea
      Grygorii Strashko authored
      The ALE parameters structure is created on stack, so it has to be reset
      before passing to cpsw_ale_create() to avoid garbage values.
      
      Fixes: 93a76530 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2074f9ea
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · fa7566a0
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2020-06-12
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 26 non-merge commits during the last 10 day(s) which contain
      a total of 27 files changed, 348 insertions(+), 93 deletions(-).
      
      The main changes are:
      
      1) sock_hash accounting fix, from Andrey.
      
      2) libbpf fix and probe_mem sanitizing, from Andrii.
      
      3) sock_hash fixes, from Jakub.
      
      4) devmap_val fix, from Jesper.
      
      5) load_bytes_relative fix, from YiFei.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa7566a0
    • Liao Pingfang's avatar
      net: atm: Remove the error message according to the atomic context · bf97bac9
      Liao Pingfang authored
      Looking into the context (atomic!) and the error message should be dropped.
      Signed-off-by: default avatarLiao Pingfang <liao.pingfang@zte.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf97bac9
    • Linus Torvalds's avatar
      Merge tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 · f82e7b57
      Linus Torvalds authored
      Pull more cifs updates from Steve French:
       "12 cifs/smb3 fixes, 2 for stable.
      
         - add support for idsfromsid on create and chgrp/chown allowing
           ability to save owner information more naturally for some workloads
      
         - improve query info (getattr) when SMB3.1.1 posix extensions are
           negotiated by using new query info level"
      
      * tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
        smb3: Add debug message for new file creation with idsfromsid mount option
        cifs: fix chown and chgrp when idsfromsid mount option enabled
        smb3: allow uid and gid owners to be set on create with idsfromsid mount option
        smb311: Add tracepoints for new compound posix query info
        smb311: add support for using info level for posix extensions query
        smb311: Add support for lookup with posix extensions query info
        smb311: Add support for SMB311 query info (non-compounded)
        SMB311: Add support for query info using posix extensions (level 100)
        smb3: add indatalen that can be a non-zero value to calculation of credit charge in smb2 ioctl
        smb3: fix typo in mount options displayed in /proc/mounts
        cifs: Add get_security_type_str function to return sec type.
        smb3: extend fscache mount volume coherency check
      f82e7b57
    • Linus Torvalds's avatar
      binderfs: add gitignore for generated sample program · 4f9b3a37
      Linus Torvalds authored
      Let's keep "git status" happy and quiet.
      
      Fixes: 9762dc14 ("samples: add binderfs sample program
      Fixes: fca5e949 ("samples: binderfs: really compile this sample and fix build issues")
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f9b3a37
    • Linus Torvalds's avatar
      doc: don't use deprecated "---help---" markers in target docs · 3e1ad405
      Linus Torvalds authored
      I'm not convinced the script makes useful automaed help lines anyway,
      but since we're trying to deprecate the use of "---help---" in Kconfig
      files, let's fix the doc example code too.
      
      See commit a7f7f624 ("treewide: replace '---help---' in Kconfig
      files with 'help'")
      
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e1ad405
    • Linus Torvalds's avatar
      Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild · 6adc19fd
      Linus Torvalds authored
      Pull more Kbuild updates from Masahiro Yamada:
      
       - fix build rules in binderfs sample
      
       - fix build errors when Kbuild recurses to the top Makefile
      
       - covert '---help---' in Kconfig to 'help'
      
      * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        treewide: replace '---help---' in Kconfig files with 'help'
        kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
        samples: binderfs: really compile this sample and fix build issues
      6adc19fd