1. 11 May, 2017 1 commit
    • Vishal Verma's avatar
      libnvdimm: add an atomic vs process context flag to rw_bytes · 3ae3d67b
      Vishal Verma authored
      nsio_rw_bytes can clear media errors, but this cannot be done while we
      are in an atomic context due to locking within ACPI. From the BTT,
      ->rw_bytes may be called either from atomic or process context depending
      on whether the calls happen during initialization or during IO.
      
      During init, we want to ensure error clearing happens, and the flag
      marking process context allows nsio_rw_bytes to do that. When called
      during IO, we're in atomic context, and error clearing can be skipped.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      3ae3d67b
  2. 09 May, 2017 2 commits
  3. 08 May, 2017 2 commits
  4. 05 May, 2017 2 commits
    • Dan Williams's avatar
      73616367
    • Dan Williams's avatar
      libnvdimm, pfn: fix 'npfns' vs section alignment · d5483fed
      Dan Williams authored
      Fix failures to create namespaces due to the vmem_altmap not advertising
      enough free space to store the memmap.
      
       WARNING: CPU: 15 PID: 8022 at arch/x86/mm/init_64.c:656 arch_add_memory+0xde/0xf0
       [..]
       Call Trace:
        dump_stack+0x63/0x83
        __warn+0xcb/0xf0
        warn_slowpath_null+0x1d/0x20
        arch_add_memory+0xde/0xf0
        devm_memremap_pages+0x244/0x440
        pmem_attach_disk+0x37e/0x490 [nd_pmem]
        nd_pmem_probe+0x7e/0xa0 [nd_pmem]
        nvdimm_bus_probe+0x71/0x120 [libnvdimm]
        driver_probe_device+0x2bb/0x460
        bind_store+0x114/0x160
        drv_attr_store+0x25/0x30
      
      In commit 658922e5 "libnvdimm, pfn: fix memmap reservation sizing"
      we arranged for the capacity to be allocated, but failed to also update
      the 'npfns' parameter. This leads to cases where there is enough
      capacity reserved to hold all the allocated sections, but
      vmemmap_populate_hugepages() still encounters -ENOMEM from
      altmap_alloc_block_buf().
      
      This fix is a stop-gap until we can teach the core memory hotplug
      implementation to permit sub-section hotplug.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 658922e5 ("libnvdimm, pfn: fix memmap reservation sizing")
      Reported-by: default avatarAnisha Allada <anisha.allada@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      d5483fed
  5. 04 May, 2017 2 commits
  6. 03 May, 2017 1 commit
  7. 01 May, 2017 4 commits
    • Arnd Bergmann's avatar
      block, dax: use correct format string in bdev_dax_supported · 67fd3897
      Arnd Bergmann authored
      The new message has an incorrect format string, causing a warning in some
      configurations:
      
      fs/block_dev.c: In function 'bdev_dax_supported':
      fs/block_dev.c:779:5: error: format '%d' expects argument of type 'int', but argument 2 has type 'long int' [-Werror=format=]
           "error: dax access failed (%d)", len);
      
      This changes it to use the correct %ld instead of %d.
      
      Fixes: 2093f2e9 ("block, dax: convert bdev_dax_supported() to dax_direct_access()")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      67fd3897
    • Dan Williams's avatar
      device-dax: fix sysfs attribute deadlock · 565851c9
      Dan Williams authored
      Usage of device_lock() for dax_region attributes is unnecessary and
      deadlock prone. It's unnecessary because the order of registration /
      un-registration guarantees that drvdata is always valid. It's deadlock
      prone because it sets up this situation:
      
       ndctl           D    0  2170   2082 0x00000000
       Call Trace:
        __schedule+0x31f/0x980
        schedule+0x3d/0x90
        schedule_preempt_disabled+0x15/0x20
        __mutex_lock+0x402/0x980
        ? __mutex_lock+0x158/0x980
        ? align_show+0x2b/0x80 [dax]
        ? kernfs_seq_start+0x2f/0x90
        mutex_lock_nested+0x1b/0x20
        align_show+0x2b/0x80 [dax]
        dev_attr_show+0x20/0x50
      
       ndctl           D    0  2186   2079 0x00000000
       Call Trace:
        __schedule+0x31f/0x980
        schedule+0x3d/0x90
        __kernfs_remove+0x1f6/0x340
        ? kernfs_remove_by_name_ns+0x45/0xa0
        ? remove_wait_queue+0x70/0x70
        kernfs_remove_by_name_ns+0x45/0xa0
        remove_files.isra.1+0x35/0x70
        sysfs_remove_group+0x44/0x90
        sysfs_remove_groups+0x2e/0x50
        dax_region_unregister+0x25/0x40 [dax]
        devm_action_release+0xf/0x20
        release_nodes+0x16d/0x2b0
        devres_release_all+0x3c/0x60
        device_release_driver_internal+0x17d/0x220
        device_release_driver+0x12/0x20
        unbind_store+0x112/0x160
      
      ndctl/2170 is trying to acquire the device_lock() to read an attribute,
      and ndctl/2186 is holding the device_lock() while trying to drain all
      active attribute readers.
      
      Thanks to Yi Zhang for the reproduction script.
      
      Fixes: d7fe1a67 ("dax: add region 'id', 'size', and 'align' attributes")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarYi Zhang <yizhan@redhat.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      565851c9
    • Dan Williams's avatar
      libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking" · a3e9af95
      Dan Williams authored
      This continues the 4.11 status quo of disabling of error clearing from
      the BTT I/O path. Toshi found that even though we have eliminated all
      the libnvdimm sources of sleeping-while-atomic triggers, we still have
      sleeping operations that will occur in the path to send the ACPI DSM to
      the DIMM to clear the error:
      
       BUG: sleeping function called from invalid context at mm/slab.h:432
       in_atomic(): 1, irqs_disabled(): 0, pid: 13353, name: dd
       Call Trace:
        dump_stack+0x86/0xc3
        ___might_sleep+0x17d/0x250
        __might_sleep+0x4a/0x80
        __kmalloc+0x1c0/0x2e0
        acpi_os_allocate_zeroed+0x2d/0x2f
        acpi_evaluate_object+0x59/0x3b1
        acpi_evaluate_dsm+0xbd/0x10c
        acpi_nfit_ctl+0x1ef/0x7c0 [nfit]
        ? nsio_rw_bytes+0x152/0x280
        nvdimm_clear_poison+0x77/0x140
        nsio_rw_bytes+0x18f/0x280
        btt_write_pg+0x1d4/0x3d0 [nd_btt]
        btt_make_request+0x119/0x2d0 [nd_btt]
      
      A solution for tracking and handling media errors natively in the BTT is
      needed.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Reported-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      a3e9af95
    • Dan Williams's avatar
      libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering · 452bae0a
      Dan Williams authored
      A debug patch to turn the standard device_lock() into something that
      lockdep can analyze yielded the following:
      
       ======================================================
       [ INFO: possible circular locking dependency detected ]
       4.11.0-rc4+ #106 Tainted: G           O
       -------------------------------------------------------
       lt-libndctl/1898 is trying to acquire lock:
        (&dev->nvdimm_mutex/3){+.+.+.}, at: [<ffffffffc023c948>] nd_attach_ndns+0x178/0x1b0 [libnvdimm]
      
       but task is already holding lock:
        (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffc022e0b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm]
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
              lock_acquire+0xf6/0x1f0
              __mutex_lock+0x88/0x980
              mutex_lock_nested+0x1b/0x20
              nvdimm_bus_lock+0x21/0x30 [libnvdimm]
              nvdimm_namespace_capacity+0x1b/0x40 [libnvdimm]
              nvdimm_namespace_common_probe+0x230/0x510 [libnvdimm]
              nd_pmem_probe+0x14/0x180 [nd_pmem]
              nvdimm_bus_probe+0xa9/0x260 [libnvdimm]
      
       -> #0 (&dev->nvdimm_mutex/3){+.+.+.}:
              __lock_acquire+0x1107/0x1280
              lock_acquire+0xf6/0x1f0
              __mutex_lock+0x88/0x980
              mutex_lock_nested+0x1b/0x20
              nd_attach_ndns+0x178/0x1b0 [libnvdimm]
              nd_namespace_store+0x308/0x3c0 [libnvdimm]
              namespace_store+0x87/0x220 [libnvdimm]
      
      In this case '&dev->nvdimm_mutex/3' mirrors '&dev->mutex'.
      
      Fix this by replacing the use of device_lock() with nvdimm_bus_lock() to protect
      nd_{attach,detach}_ndns() operations.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 8c2f7e86 ("libnvdimm: infrastructure for btt devices")
      Reported-by: default avatarYi Zhang <yizhan@redhat.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      452bae0a
  8. 29 Apr, 2017 1 commit
    • Dan Williams's avatar
      libnvdimm: rework region badblocks clearing · 23f49844
      Dan Williams authored
      Toshi noticed that the new support for a region-level badblocks missed
      the case where errors are cleared due to BTT I/O.
      
      An initial attempt to fix this ran into a "sleeping while atomic"
      warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
      satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
      However, that lock is not needed since we are not acting on any data that
      is subject to change under that lock. The badblocks instance has its own
      internal lock to handle mutations of the error list.
      
      So, in order to make it clear that we are just acting on region devices,
      rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
      Eliminate the lock and consolidate all support routines for the new
      nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
      opportunity to cleanup to some unnecessary casts, make the calling
      convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
      resource with the minimal struct clear_badblocks_context, and use the
      DEVICE_ATTR macro.
      
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Reported-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      23f49844
  9. 28 Apr, 2017 4 commits
    • Dan Williams's avatar
      acpi, nfit: kill ACPI_NFIT_DEBUG · 7699a6a3
      Dan Williams authored
      Inevitably when one actually needs to debug a DSM issue it's on a
      distribution kernel that has CONFIG_ACPI_NFIT_DEBUG=n. The config symbol
      was only there to avoid the compile error due to the missing fallback for
      print_hex_dump_debug in the CONFIG_DYNAMIC_DEBUG=n case. That was fixed
      with commit cdf17449 "hexdump: do not print debug dumps for
      !CONFIG_DEBUG", so the config symbol can just be dropped.
      
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      7699a6a3
    • Toshi Kani's avatar
      libnvdimm: fix clear length of nvdimm_forget_poison() · 8d13c029
      Toshi Kani authored
      ND_CMD_CLEAR_ERROR command returns 'clear_err.cleared', the length
      of error actually cleared, which may be smaller than its requested
      'len'.
      
      Change nvdimm_clear_poison() to call nvdimm_forget_poison() with
      'clear_err.cleared' when this value is valid.
      
      Cc: <stable@vger.kernel.org>
      Fixes: e046114a ("libnvdimm: clear the internal poison_list when clearing badblocks")
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      8d13c029
    • Toshi Kani's avatar
      libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify · b2518c78
      Toshi Kani authored
      The following BUG was observed when nd_pmem_notify() was called
      for a BTT device.  The use of a pmem_device pointer is not valid
      with BTT.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
       IP: nd_pmem_notify+0x30/0xf0 [nd_pmem]
       Call Trace:
        nd_device_notify+0x40/0x50
        child_notify+0x10/0x20
        device_for_each_child+0x50/0x90
        nd_region_notify+0x20/0x30
        nd_device_notify+0x40/0x50
        nvdimm_region_notify+0x27/0x30
        acpi_nfit_scrub+0x341/0x590 [nfit]
        process_one_work+0x197/0x450
        worker_thread+0x4e/0x4a0
        kthread+0x109/0x140
      
      Fix nd_pmem_notify() by setting nd_region and badblocks pointers
      properly for BTT.
      
      Cc: <stable@vger.kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Fixes: 71999466 ("libnvdimm: async notification support")
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      b2518c78
    • Dan Williams's avatar
      libnvdimm, region: sysfs trigger for nvdimm_flush() · ab630891
      Dan Williams authored
      The nvdimm_flush() mechanism helps to reduce the impact of an ADR
      (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
      platform WPQ (write-pending-queue) buffers when power is removed. The
      nvdimm_flush() mechanism performs that same function on-demand.
      
      When a pmem namespace is associated with a block device, an
      nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
      request. These requests are typically associated with filesystem
      metadata updates. However, when a namespace is in device-dax mode,
      userspace (think database metadata) needs another path to perform the
      same flushing. In other words this is not required to make data
      persistent, but in the case of metadata it allows for a smaller failure
      domain in the unlikely event of an ADR failure.
      
      The new 'deep_flush' attribute is visible when the individual DIMMs
      backing a given interleave-set are described by platform firmware. In
      ACPI terms this is "NVDIMM Region Mapping Structures" and associated
      "Flush Hint Address Structures". Reads return "1" if the region supports
      triggering WPQ flushes on all DIMMs. Reads return "0" the flush
      operation is a platform nop, and in that case the attribute is
      read-only.
      
      Why sysfs and not an ioctl? An ioctl requires establishing a new
      ioctl function number space for device-dax. Given that this would be
      called on a device-dax fd an application could be forgiven for
      accidentally calling this on a filesystem-dax fd. Placing this interface
      in libnvdimm sysfs removes that potential for collision with a
      filesystem ioctl, and it keeps ioctls out of the generic device-dax
      implementation.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      ab630891
  10. 27 Apr, 2017 1 commit
  11. 25 Apr, 2017 7 commits
  12. 24 Apr, 2017 1 commit
    • Dan Williams's avatar
      libnvdimm, region: fix flush hint detection crash · bc042fdf
      Dan Williams authored
      In the case where a dimm does not have any associated flush hints the
      ndrd->flush_wpq array may be uninitialized leading to crashes with the
      following signature:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
       IP: region_visible+0x10f/0x160 [libnvdimm]
      
       Call Trace:
        internal_create_group+0xbe/0x2f0
        sysfs_create_groups+0x40/0x80
        device_add+0x2d8/0x650
        nd_async_device_register+0x12/0x40 [libnvdimm]
        async_run_entry_fn+0x39/0x170
        process_one_work+0x212/0x6c0
        ? process_one_work+0x197/0x6c0
        worker_thread+0x4e/0x4a0
        kthread+0x10c/0x140
        ? process_one_work+0x6c0/0x6c0
        ? kthread_create_on_node+0x60/0x60
        ret_from_fork+0x31/0x40
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Fixes: f284a4f2 ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      bc042fdf
  13. 20 Apr, 2017 3 commits
    • Dan Williams's avatar
      dm: add dax_device and dax_operations support · f26c5719
      Dan Williams authored
      Allocate a dax_device to represent the capacity of a device-mapper
      instance. Provide a ->direct_access() method via the new dax_operations
      indirection that mirrors the functionality of the current direct_access
      support via block_device_operations.  Once fs/dax.c has been converted
      to use dax_operations the old dm_blk_direct_access() will be removed.
      
      A new helper dm_dax_get_live_target() is introduced to separate some of
      the dm-specifics from the direct_access implementation.
      
      This enabling is only for the top-level dm representation to upper
      layers. Converting target direct_access implementations is deferred to a
      separate patch.
      
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Reviewed-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      f26c5719
    • Dan Williams's avatar
      dax: introduce dax_direct_access() · b0686260
      Dan Williams authored
      Replace bdev_direct_access() with dax_direct_access() that uses
      dax_device and dax_operations instead of a block_device and
      block_device_operations for dax. Once all consumers of the old api have
      been converted bdev_direct_access() will be deleted.
      
      Given that block device partitioning decisions can cause dax page
      alignment constraints to be violated this also introduces the
      bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
      to the dax_device and also checks for page alignment.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      b0686260
    • Dan Williams's avatar
      block: kill bdev_dax_capable() · d8f07aee
      Dan Williams authored
      This is leftover dead code that has since been replaced by
      bdev_dax_supported().
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      d8f07aee
  14. 19 Apr, 2017 6 commits
  15. 18 Apr, 2017 2 commits
  16. 17 Apr, 2017 1 commit