1. 26 Jun, 2015 16 commits
    • Ross Zwisler's avatar
      arch, x86: pmem api for ensuring durability of persistent memory updates · 61031952
      Ross Zwisler authored
      Based on an original patch by Ross Zwisler [1].
      
      Writes to persistent memory have the potential to be posted to cpu
      cache, cpu write buffers, and platform write buffers (memory controller)
      before being committed to persistent media.  Provide apis,
      memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
      pmem and assert that it is durable in PMEM (a persistent linear address
      range).  A '__pmem' attribute is added so sparse can track proper usage
      of pointers to pmem.
      
      This continues the status quo of pmem being x86 only for 4.2, but
      reworks to ioremap, and wider implementation of memremap() will enable
      other archs in 4.3.
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      [djbw: various reworks]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      61031952
    • Toshi Kani's avatar
      libnvdimm: Add sysfs numa_node to NVDIMM devices · 74ae66c3
      Toshi Kani authored
      Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
      under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.
      
      An example of numa_node values on a 2-socket system with a single
      NVDIMM range on each socket is shown below.
        /sys/bus/nd/devices
        |-- btt0.0/numa_node:0
        |-- btt1.0/numa_node:1
        |-- btt1.1/numa_node:1
        |-- namespace0.0/numa_node:0
        |-- namespace1.0/numa_node:1
        |-- region0/numa_node:0
        |-- region1/numa_node:1
      
      These numa_node files are then linked under the block class of
      their device names.
        /sys/class/block/pmem0/device/numa_node:0
        /sys/class/block/pmem1s/device/numa_node:1
      
      This enables numactl(8) to accept 'block:' and 'file:' paths of
      pmem and btt devices as shown in the examples below.
        numactl --preferred block:pmem0 --show
        numactl --preferred file:/dev/pmem1s --show
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      74ae66c3
    • Toshi Kani's avatar
      libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6
      Toshi Kani authored
      ACPI NFIT table has System Physical Address Range Structure entries that
      describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
      set in the flags.
      
      Change acpi_nfit_register_region() to map a proximity ID to its node ID,
      and set it to a new numa_node field of nd_region_desc, which is then
      conveyed to the nd_region device.
      
      The device core arranges for btt and namespace devices to inherit their
      node from their parent region.
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      [djbw: move set_dev_node() from region.c to bus.c]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      41d7a6d6
    • Toshi Kani's avatar
      acpi: Add acpi_map_pxm_to_online_node() · 99759869
      Toshi Kani authored
      The kernel initializes CPU & memory's NUMA topology from ACPI
      SRAT table.  Some other ACPI tables, such as NFIT and DMAR, also
      contain proximity IDs for their device's NUMA topology.  This
      information can be used to improve performance of these devices.
      
      This patch introduces acpi_map_pxm_to_online_node(), which is
      similar to acpi_map_pxm_to_node(), but always returns an online
      node.  When the mapped node from a given proximity ID is offline,
      it looks up the node distance table and returns the nearest
      online node.
      
      ACPI device drivers, which are called after the NUMA initialization
      has completed in the kernel, can call this interface to obtain their
      device NUMA topology from ACPI tables.  Such drivers do not have to
      deal with offline nodes.  A node may be offline when a device
      proximity ID is unique, SRAT memory entry does not exist, or NUMA is
      disabled, ex. "numa=off" on x86.
      
      This patch also moves the pxm range check from acpi_get_node() to
      acpi_map_pxm_to_node().
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com&gt;>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      99759869
    • Dan Williams's avatar
      libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820
      Dan Williams authored
      Upon detection of an unarmed dimm in a region, arrange for descendant
      BTT, PMEM, or BLK instances to be read-only.  A dimm is primarily marked
      "unarmed" via flags passed by platform firmware (NFIT).
      
      The flags in the NFIT memory device sub-structure indicate the state of
      the data on the nvdimm relative to its energy source or last "flush to
      persistence".  For the most part there is nothing the driver can do but
      advertise the state of these flags in sysfs and emit a message if
      firmware indicates that the contents of the device may be corrupted.
      However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
      the block devices incorporating that nvdimm to be marked read-only.
      This is a safe default as the data is still available and new writes are
      held off until the administrator either forces read-write mode, or the
      energy source becomes armed.
      
      A 'read_only' attribute is added to REGION devices to allow for
      overriding the default read-only policy of all descendant block devices.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      58138820
    • Dan Williams's avatar
      pmem: flag pmem block devices as non-rotational · 0f51c4fa
      Dan Williams authored
      ...since they are effectively SSDs as far as userspace is concerned.
      Reviewed-by: default avatarVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      0f51c4fa
    • Dan Williams's avatar
      libnvdimm: enable iostat · f0dc089c
      Dan Williams authored
      This is disabled by default as the overhead is prohibitive, but if the
      user takes the action to turn it on we'll oblige.
      Reviewed-by: default avatarVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      f0dc089c
    • Dan Williams's avatar
      pmem: make_request cleanups · edc870e5
      Dan Williams authored
      Various cleanups:
      
      1/ Kill the BUG_ON since we've already told the block layer we don't
         support DISCARD on all these drivers.
      
      2/ Kill the 'rw' variable, no need to cache it.
      
      3/ Kill the local 'sector' variable.  bio_for_each_segment() is already
         advancing the iterator's sector number by the bio_vec length.
      
      4/ Kill the check for accessing past the end of device
         generic_make_request_checks() already does that.
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      [hch: kill access past end of the device check]
      Reviewed-by: default avatarVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      edc870e5
    • Dan Williams's avatar
      libnvdimm, pmem: fix up max_hw_sectors · 43d3fa3a
      Dan Williams authored
      There is no hardware limit to enforce on the size of the i/o that can be passed
      to an nvdimm block device, so set it to UINT_MAX.
      Reviewed-by: default avatarVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      43d3fa3a
    • Vishal Verma's avatar
      libnvdimm, blk: add support for blk integrity · fcae6957
      Vishal Verma authored
      Support multiple block sizes (sector + metadata) for nd_blk in the
      same way as done for the BTT. Add the idea of an 'internal' lbasize,
      which is properly aligned and padded, and store metadata in this space.
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      fcae6957
    • Vishal Verma's avatar
      libnvdimm, btt: add support for blk integrity · 41cd8b70
      Vishal Verma authored
      Support multiple block sizes (sector + metadata) using the blk integrity
      framework. This registers a new integrity template that defines the
      protection information tuple size based on the configured metadata size,
      and simply acts as a passthrough for protection information generated by
      another layer. The metadata is written to the storage as-is, and read back
      with each sector.
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      41cd8b70
    • Vishal Verma's avatar
      fs/block_dev.c: skip rw_page if bdev has integrity · f68eb1e7
      Vishal Verma authored
      If a block device has bio integrity enabled, rw_page will bypass the
      integrity payload, which is undesirable. Skip rw_page if this is the
      case.
      
      Currently brd and zram provide rw_page, and the proposed 'nd' drivers
      will too.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Suggested-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      f68eb1e7
    • Dan Williams's avatar
      libnvdimm: Non-Volatile Devices · bc30196f
      Dan Williams authored
      Maintainer information and documentation for drivers/nvdimm
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      bc30196f
    • Dan Williams's avatar
      tools/testing/nvdimm: libnvdimm unit test infrastructure · 6bc75619
      Dan Williams authored
      'libnvdimm' is the first driver sub-system in the kernel to implement
      mocking for unit test coverage.  The nfit_test module gets built as an
      external module and arranges for external module replacements of nfit,
      libnvdimm, nd_pmem, and nd_blk.  These replacements use the linker
      --wrap option to redirect calls to ioremap() + request_mem_region() to
      custom defined unit test resources.  The end result is a fully
      functional nvdimm_bus, as far as userspace is concerned, but with the
      capability to perform otherwise destructive tests on emulated resources.
      
      Q: Why not use QEMU for this emulation?
      QEMU is not suitable for unit testing.  QEMU's role is to faithfully
      emulate the platform.  A unit test's role is to unfaithfully implement
      the platform with the goal of triggering bugs in the corners of the
      sub-system implementation.  As bugs are discovered in platforms, or the
      sub-system itself, the unit tests are extended to backstop a fix with a
      reproducer unit test.
      
      Another problem with QEMU is that it would require coordination of 3
      software projects instead of 2 (kernel + libndctl [1]) to maintain and
      execute the tests.  The chances for bit rot and the difficulty of
      getting the tests running goes up non-linearly the more components
      involved.
      
      
      Q: Why submit this to the kernel tree instead of external modules in
         libndctl?
      Simple, to alleviate the same risk that out-of-tree external modules
      face.  Updates to drivers/nvdimm/ can be immediately evaluated to see if
      they have any impact on tools/testing/nvdimm/.
      
      
      Q: What are the negative implications of merging this?
      It is a unique maintenance burden because the purpose of mocking an
      interface to enable a unit test is to purposefully short circuit the
      semantics of a routine to enable testing.  For example
      __wrap_ioremap_cache() fakes the pmem driver into "ioremap()'ing" a test
      resource buffer allocated by dma_alloc_coherent().  The future
      maintenance burden hits when someone changes the semantics of
      ioremap_cache() and wonders what the implications are for the unit test.
      
      [1]: https://github.com/pmem/ndctl
      
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Robert Moore <robert.moore@intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      6bc75619
    • Ross Zwisler's avatar
      libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1
      Ross Zwisler authored
      The libnvdimm implementation handles allocating dimm address space (DPA)
      between PMEM and BLK mode interfaces.  After DPA has been allocated from
      a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
      as a struct bio based block device. Unlike PMEM, BLK is required to
      handle platform specific details like mmio register formats and memory
      controller interleave.  For this reason the libnvdimm generic nd_blk
      driver calls back into the bus provider to carry out the I/O.
      
      This initial implementation handles the BLK interface defined by the
      ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
      DCR (dimm control region), BDW (block data window), IDT (interleave
      descriptor) NFIT structures and the hardware register format.
      [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
      [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      047fc8a1
    • Vishal Verma's avatar
      nd_btt: atomic sector updates · 5212e11f
      Vishal Verma authored
      BTT stands for Block Translation Table, and is a way to provide power
      fail sector atomicity semantics for block devices that have the ability
      to perform byte granularity IO. It relies on the capability of libnvdimm
      namespace devices to do byte aligned IO.
      
      The BTT works as a stacked blocked device, and reserves a chunk of space
      from the backing device for its accounting metadata. It is a bio-based
      driver because all IO is done synchronously, and there is no queuing or
      asynchronous completions at either the device or the driver level.
      
      The BTT uses 'lanes' to index into various 'on-disk' data structures,
      and lanes also act as a synchronization mechanism in case there are more
      CPUs than available lanes. We did a comparison between two lane lock
      strategies - first where we kept an atomic counter around that tracked
      which was the last lane that was used, and 'our' lane was determined by
      atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
      theoretically, no CPU would be blocked waiting for a lane. The other
      strategy was to use the cpu number we're scheduled on to and hash it to
      a lane number. Theoretically, this could block an IO that could've
      otherwise run using a different, free lane. But some fio workloads
      showed that the direct cpu -> lane hash performed faster than tracking
      'last lane' - my reasoning is the cache thrash caused by moving the
      atomic variable made that approach slower than simply waiting out the
      in-progress IO. This supports the conclusion that the driver can be a
      very simple bio-based one that does synchronous IOs instead of queuing.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      [jmoyer: fix nmi watchdog timeout in btt_map_init]
      [jmoyer: move btt initialization to module load path]
      [jmoyer: fix memory leak in the btt initialization path]
      [jmoyer: Don't overwrite corrupted arenas]
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      5212e11f
  2. 25 Jun, 2015 16 commits
    • Dan Williams's avatar
      libnvdimm: infrastructure for btt devices · 8c2f7e86
      Dan Williams authored
      NVDIMM namespaces, in addition to accepting "struct bio" based requests,
      also have the capability to perform byte-aligned accesses.  By default
      only the bio/block interface is used.  However, if another driver can
      make effective use of the byte-aligned capability it can claim namespace
      interface and use the byte-aligned ->rw_bytes() interface.
      
      The BTT driver is the initial first consumer of this mechanism to allow
      adding atomic sector update semantics to a pmem or blk namespace.  This
      patch is the sysfs infrastructure to allow configuring a BTT instance
      for a namespace.  Enabling that BTT and performing i/o is in a
      subsequent patch.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      8c2f7e86
    • Dan Williams's avatar
      libnvdimm: write blk label set · 0ba1c634
      Dan Williams authored
      After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been
      set to valid values the labels on the dimm can be updated.  The
      difference with the pmem case is that blk namespaces are limited to one
      dimm and can cover discontiguous ranges in dpa space.
      
      Also, after allocating label slots, it is useful for userspace to know
      how many slots are left.  Export this information in sysfs.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      0ba1c634
    • Dan Williams's avatar
      libnvdimm: write pmem label set · f524bf27
      Dan Williams authored
      After 'uuid', 'size', and optionally 'alt_name' have been set to valid
      values the labels on the dimms can be updated.
      
      Write procedure is:
      1/ Allocate and write new labels in the "next" index
      2/ Free the old labels in the working copy
      3/ Write the bitmap and the label space on the dimm
      4/ Write the index to make the update valid
      
      Label ranges directly mirror the dpa resource values for the given
      label_id of the namespace.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      f524bf27
    • Dan Williams's avatar
      libnvdimm: blk labels and namespace instantiation · 1b40e09a
      Dan Williams authored
      A blk label set describes a namespace comprised of one or more
      discontiguous dpa ranges on a single dimm.  They may alias with one or
      more pmem interleave sets that include the given dimm.
      
      This is the runtime/volatile configuration infrastructure for sysfs
      manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'.  A later
      patch will make these settings persistent by writing back the label(s).
      
      Unlike pmem namespaces, multiple blk namespaces can be created per
      region.  Once a blk namespace has been created a new seed device
      (unconfigured child of a parent blk region) is instantiated.  As long as
      a region has 'available_size' != 0 new child namespaces may be created.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      1b40e09a
    • Dan Williams's avatar
      libnvdimm: pmem label sets and namespace instantiation. · bf9bccc1
      Dan Williams authored
      A complete label set is a PMEM-label per-dimm per-interleave-set where
      all the UUIDs match and the interleave set cookie matches the hosting
      interleave set.
      
      Present sysfs attributes for manipulation of a PMEM-namespace's
      'alt_name', 'uuid', and 'size' attributes.  A later patch will make
      these settings persistent by writing back the label.
      
      Note that PMEM allocations grow forwards from the start of an interleave
      set (lowest dimm-physical-address (DPA)).  BLK-namespaces that alias
      with a PMEM interleave set will grow allocations backward from the
      highest DPA.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      bf9bccc1
    • Dan Williams's avatar
      libnvdimm: namespace indices: read and validate · 4a826c83
      Dan Williams authored
      This on media label format [1] consists of two index blocks followed by
      an array of labels.  None of these structures are ever updated in place.
      A sequence number tracks the current active index and the next one to
      write, while labels are written to free slots.
      
          +------------+
          |            |
          |  nsindex0  |
          |            |
          +------------+
          |            |
          |  nsindex1  |
          |            |
          +------------+
          |   label0   |
          +------------+
          |   label1   |
          +------------+
          |            |
           ....nslot...
          |            |
          +------------+
          |   labelN   |
          +------------+
      
      After reading valid labels, store the dpa ranges they claim into
      per-dimm resource trees.
      
      [1]: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
      
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      4a826c83
    • Dan Williams's avatar
      libnvdimm, nfit: add interleave-set state-tracking infrastructure · eaf96153
      Dan Williams authored
      On platforms that have firmware support for reading/writing per-dimm
      label space, a portion of the dimm may be accessible via an interleave
      set PMEM mapping in addition to the dimm's BLK (block-data-window
      aperture(s)) interface.  A label, stored in a "configuration data
      region" on the dimm, disambiguates which dimm addresses are accessed
      through which exclusive interface.
      
      Add infrastructure that allows the kernel to block modifications to a
      label in the set while any member dimm is active.  Note that this is
      meant only for enforcing "no modifications of active labels" via the
      coarse ioctl command.  Adding/deleting namespaces from an active
      interleave set is always possible via sysfs.
      
      Another aspect of tracking interleave sets is tracking their integrity
      when DIMMs in a set are physically re-ordered.  For this purpose we
      generate an "interleave-set cookie" that can be recorded in a label and
      validated against the current configuration.  It is the bus provider
      implementation's responsibility to calculate the interleave set cookie
      and attach it to a given region.
      
      Cc: Neil Brown <neilb@suse.de>
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Robert Moore <robert.moore@intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      eaf96153
    • Dan Williams's avatar
      libnvdimm, pmem: add libnvdimm support to the pmem driver · 9f53f9fa
      Dan Williams authored
      nd_pmem attaches to persistent memory regions and namespaces emitted by
      the libnvdimm subsystem, and, same as the original pmem driver, presents
      the system-physical-address range as a block device.
      
      The existing e820-type-12 to pmem setup is converted to an nvdimm_bus
      that emits an nd_namespace_io device.
      
      Note that the X in 'pmemX' is now derived from the parent region.  This
      provides some stability to the pmem devices names from boot-to-boot.
      The minor numbers are also more predictable by passing 0 to
      alloc_disk().
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      9f53f9fa
    • Dan Williams's avatar
      libnvdimm, pmem: move pmem to drivers/nvdimm/ · 18da2c9e
      Dan Williams authored
      Prepare the pmem driver to consume PMEM namespaces emitted by regions of
      an nvdimm_bus instance.  No functional change.
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      18da2c9e
    • Dan Williams's avatar
      libnvdimm: support for legacy (non-aliasing) nvdimms · 3d88002e
      Dan Williams authored
      The libnvdimm region driver is an intermediary driver that translates
      non-volatile "region"s into "namespace" sub-devices that are surfaced by
      persistent memory block-device drivers (PMEM and BLK).
      
      ACPI 6 introduces the concept that a given nvdimm may simultaneously
      offer multiple access modes to its media through direct PMEM load/store
      access, or windowed BLK mode.  Existing nvdimms mostly implement a PMEM
      interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
      If an nvdimm is single interfaced, then there is no need for dimm
      metadata labels.  For these devices we can take the region boundaries
      directly to create a child namespace device (nd_namespace_io).
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      3d88002e
    • Dan Williams's avatar
      libnvdimm, nfit: regions (block-data-window, persistent memory, volatile memory) · 1f7df6f8
      Dan Williams authored
      A "region" device represents the maximum capacity of a BLK range (mmio
      block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
      volatile memory), without regard for aliasing.  Aliasing, in the
      dimm-local address space (DPA), is resolved by metadata on a dimm to
      designate which exclusive interface will access the aliased DPA ranges.
      Support for the per-dimm metadata/label arrvies is in a subsequent
      patch.
      
      The name format of "region" devices is "regionN" where, like dimms, N is
      a global ida index assigned at discovery time.  This id is not reliable
      across reboots nor in the presence of hotplug.  Look to attributes of
      the region or static id-data of the sub-namespace to generate a
      persistent name.  However, if the platform configuration does not change
      it is reasonable to expect the same region id to be assigned at the next
      boot.
      
      "region"s have 2 generic attributes "size", and "mapping"s where:
      - size: the BLK accessible capacity or the span of the
        system physical address range in the case of PMEM.
      
      - mappingN: a tuple describing a dimm's contribution to the region's
        capacity in the format (<nmemX>,<dpa>,<size>).  For a PMEM-region
        there will be at least one mapping per dimm in the interleave set.  For
        a BLK-region there is only "mapping0" listing the starting DPA of the
        BLK-region and the available DPA capacity of that space (matches "size"
        above).
      
      The max number of mappings per "region" is hard coded per the
      constraints of sysfs attribute groups.  That said the number of mappings
      per region should never exceed the maximum number of possible dimms in
      the system.  If the current number turns out to not be enough then the
      "mappings" attribute clarifies how many there are supposed to be. "32
      should be enough for anybody...".
      
      Cc: Neil Brown <neilb@suse.de>
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Robert Moore <robert.moore@intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      1f7df6f8
    • Dan Williams's avatar
      libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure · 4d88a97a
      Dan Williams authored
      * Implement the device-model infrastructure for loading modules and
        attaching drivers to nvdimm devices.  This is a simple association of a
        nd-device-type number with a driver that has a bitmask of supported
        device types.  To facilitate userspace bind/unbind operations 'modalias'
        and 'devtype', that also appear in the uevent, are added as generic
        sysfs attributes for all nvdimm devices.  The reason for the device-type
        number is to support sub-types within a given parent devtype, be it a
        vendor-specific sub-type or otherwise.
      
      * The first consumer of this infrastructure is the driver
        for dimm devices.  It simply uses control messages to retrieve and
        store the configuration-data image (label set) from each dimm.
      
      Note: nd_device_register() arranges for asynchronous registration of
            nvdimm bus devices by default.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      4d88a97a
    • Dan Williams's avatar
      libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices · 62232e45
      Dan Williams authored
      Most discovery/configuration of the nvdimm-subsystem is done via sysfs
      attributes.  However, some nvdimm_bus instances, particularly the
      ACPI.NFIT bus, define a small set of messages that can be passed to the
      platform.  For convenience we derive the initial libnvdimm-ioctl command
      formats directly from the NFIT DSM Interface Example formats.
      
          ND_CMD_SMART: media health and diagnostics
          ND_CMD_GET_CONFIG_SIZE: size of the label space
          ND_CMD_GET_CONFIG_DATA: read label space
          ND_CMD_SET_CONFIG_DATA: write label space
          ND_CMD_VENDOR: vendor-specific command passthrough
          ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
          ND_CMD_ARS_START: initiate scrubbing
          ND_CMD_ARS_STATUS: report on scrubbing state
          ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events
      
      If a platform later defines different commands than this set it is
      straightforward to extend support to those formats.
      
      Most of the commands target a specific dimm.  However, the
      address-range-scrubbing commands target the bus.  The 'commands'
      attribute in sysfs of an nvdimm_bus, or nvdimm, enumerate the supported
      commands for that object.
      
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Robert Moore <robert.moore@intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reported-by: default avatarNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      62232e45
    • Dan Williams's avatar
      libnvdimm, nfit: dimm/memory-devices · e6dfb2de
      Dan Williams authored
      Enable nvdimm devices to be registered on a nvdimm_bus.  The kernel
      assigned device id for nvdimm devicesis dynamic.  If userspace needs a
      more static identifier it should consult a provider-specific attribute.
      In the case where NFIT is the provider, the 'nmemX/nfit/handle' or
      'nmemX/nfit/serial' attributes may be used for this purpose.
      
      Cc: Neil Brown <neilb@suse.de>
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Robert Moore <robert.moore@intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      e6dfb2de
    • Dan Williams's avatar
      libnvdimm: control character device and nvdimm_bus sysfs attributes · 45def22c
      Dan Williams authored
      The control device for a nvdimm_bus is registered as an "nd" class
      device.  The expectation is that there will usually only be one "nd" bus
      registered under /sys/class/nd.  However, we allow for the possibility
      of multiple buses and they will listed in discovery order as
      ndctl0...ndctlN.  This character device hosts the ioctl for passing
      control messages.  The initial command set has a 1:1 correlation with
      the commands listed in the by the "NFIT DSM Example" document [1], but
      this scheme is extensible to future command sets.
      
      Note, nd_ioctl() and the backing ->ndctl() implementation are defined in
      a subsequent patch.  This is simply the initial registrations and sysfs
      attributes.
      
      [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      
      Cc: Neil Brown <neilb@suse.de>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Robert Moore <robert.moore@intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      45def22c
    • Dan Williams's avatar
      libnvdimm, nfit: initial libnvdimm infrastructure and NFIT support · b94d5230
      Dan Williams authored
      A struct nvdimm_bus is the anchor device for registering nvdimm
      resources and interfaces, for example, a character control device,
      nvdimm devices, and I/O region devices.  The ACPI NFIT (NVDIMM Firmware
      Interface Table) is one possible platform description for such
      non-volatile memory resources in a system.  The nfit.ko driver attaches
      to the "ACPI0012" device that indicates the presence of the NFIT and
      parses the table to register a struct nvdimm_bus instance.
      
      Cc: <linux-acpi@vger.kernel.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Robert Moore <robert.moore@intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      b94d5230
  3. 28 May, 2015 1 commit
    • Dan Williams's avatar
      e820, efi: add ACPI 6.0 persistent memory types · ad5fb870
      Dan Williams authored
      ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory.
      Mark it "reserved" and allow it to be claimed by a persistent memory
      device driver.
      
      This definition is in addition to the Linux kernel's existing type-12
      definition that was recently added in support of shipping platforms with
      NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as
      OEM reserved).
      
      Note, /proc/iomem can be consulted for differentiating legacy
      "Persistent Memory (legacy)" E820_PRAM vs standard "Persistent Memory"
      E820_PMEM.
      
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      ad5fb870
  4. 25 May, 2015 2 commits
  5. 22 May, 2015 5 commits