1. 29 Aug, 2015 3 commits
    • Dan Williams's avatar
      libnvdimm, pmem: direct map legacy pmem by default · 004f1afb
      Dan Williams authored
      The expectation is that the legacy / non-standard pmem discovery method
      (e820 type-12) will only ever be used to describe small quantities of
      persistent memory.  Larger capacities will be described via the ACPI
      NFIT.  When "allocate struct page from pmem" support is added this default
      policy can be overridden by assigning a legacy pmem namespace to a pfn
      device, however this would be only be necessary if a platform used the
      legacy mechanism to define a very large range.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      004f1afb
    • Dan Williams's avatar
      libnvdimm, pmem: 'struct page' for pmem · 32ab0a3f
      Dan Williams authored
      Enable the pmem driver to handle PFN device instances.  Attaching a pmem
      namespace to a pfn device triggers the driver to allocate and initialize
      struct page entries for pmem.  Memory capacity for this allocation comes
      exclusively from RAM for now which is suitable for low PMEM to RAM
      ratios.  This mechanism will be expanded later for setting an "allocate
      from PMEM" policy.
      
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      32ab0a3f
    • Dan Williams's avatar
      libnvdimm, pfn: 'struct page' provider infrastructure · e1455744
      Dan Williams authored
      Implement the base infrastructure for libnvdimm PFN devices. Similar to
      BTT devices they take a namespace as a backing device and layer
      functionality on top. In this case the functionality is reserving space
      for an array of 'struct page' entries to be handed out through
      pfn_to_page(). For now this is just the basic libnvdimm-device-model for
      configuring the base PFN device.
      
      As the namespace claiming mechanism for PFN devices is mostly identical
      to BTT devices drivers/nvdimm/claim.c is created to house the common
      bits.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      e1455744
  2. 27 Aug, 2015 8 commits
    • Dan Williams's avatar
      x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB · 96601adb
      Dan Williams authored
      Given that a write-back (WB) mapping plus non-temporal stores is
      expected to be the most efficient way to access PMEM, update the
      definition of ARCH_HAS_PMEM_API to imply arch support for
      WB-mapped-PMEM.  This is needed as a pre-requisite for adding PMEM to
      the direct map and mapping it with struct page.
      
      The above clarification for X86_64 means that memcpy_to_pmem() is
      permitted to use the non-temporal arch_memcpy_to_pmem() rather than
      needlessly fall back to default_memcpy_to_pmem() when the pcommit
      instruction is not available.  When arch_memcpy_to_pmem() is not
      guaranteed to flush writes out of cache, i.e. on older X86_32
      implementations where non-temporal stores may just dirty cache,
      ARCH_HAS_PMEM_API is simply disabled.
      
      The default fall back for persistent memory handling remains.  Namely,
      map it with the WT (write-through) cache-type and hope for the best.
      
      arch_has_pmem_api() is updated to only indicate whether the arch
      provides the proper helpers to meet the minimum "writes are visible
      outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()".  Code
      that cares whether wmb_pmem() actually flushes writes to pmem must now
      call arch_has_wmb_pmem() directly.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      [hch: set ARCH_HAS_PMEM_API=n on x86_32]
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      [toshi: x86_32 compile fixes]
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      96601adb
    • Christoph Hellwig's avatar
      add devm_memremap_pages · 41e94a85
      Christoph Hellwig authored
      This behaves like devm_memremap except that it ensures we have page
      structures available that can back the region.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      [djbw: catch attempts to remap RAM, drop flags]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      41e94a85
    • Dan Williams's avatar
      mm: ZONE_DEVICE for "device memory" · 033fbae9
      Dan Williams authored
      While pmem is usable as a block device or via DAX mappings to userspace
      there are several usage scenarios that can not target pmem due to its
      lack of struct page coverage. In preparation for "hot plugging" pmem
      into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
      separately from the ones that are subject to standard page allocations.
      Importantly "device memory" can be removed at will by userspace
      unbinding the driver of the device.
      
      Having a separate zone prevents allocation and otherwise marks these
      pages that are distinct from typical uniform memory.  Device memory has
      different lifetime and performance characteristics than RAM.  However,
      since we have run out of ZONES_SHIFT bits this functionality currently
      depends on sacrificing ZONE_DMA.
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Jerome Glisse <j.glisse@gmail.com>
      [hch: various simplifications in the arch interface]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      033fbae9
    • Christoph Hellwig's avatar
      mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h · 012dcef3
      Christoph Hellwig authored
      Three architectures already define these, and we'll need them genericly
      soon.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      012dcef3
    • Dan Williams's avatar
      dax: drop size parameter to ->direct_access() · cb389b9c
      Dan Williams authored
      None of the implementations currently use it.  The common
      bdev_direct_access() entry point handles all the size checks before
      calling ->direct_access().
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      cb389b9c
    • Dan Williams's avatar
      4a9bf88a
    • Ross Zwisler's avatar
      nd_blk: change aperture mapping from WC to WB · 67a3e8fe
      Ross Zwisler authored
      This should result in a pretty sizeable performance gain for reads.  For
      rough comparison I did some simple read testing using PMEM to compare
      reads of write combining (WC) mappings vs write-back (WB).  This was
      done on a random lab machine.
      
      PMEM reads from a write combining mapping:
      	# dd of=/dev/null if=/dev/pmem0 bs=4096 count=100000
      	100000+0 records in
      	100000+0 records out
      	409600000 bytes (410 MB) copied, 9.2855 s, 44.1 MB/s
      
      PMEM reads from a write-back mapping:
      	# dd of=/dev/null if=/dev/pmem0 bs=4096 count=1000000
      	1000000+0 records in
      	1000000+0 records out
      	4096000000 bytes (4.1 GB) copied, 3.44034 s, 1.2 GB/s
      
      To be able to safely support a write-back aperture I needed to add
      support for the "read flush" _DSM flag, as outlined in the DSM spec:
      
      http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      
      This flag tells the ND BLK driver that it needs to flush the cache lines
      associated with the aperture after the aperture is moved but before any
      new data is read.  This ensures that any stale cache lines from the
      previous contents of the aperture will be discarded from the processor
      cache, and the new data will be read properly from the DIMM.  We know
      that the cache lines are clean and will be discarded without any
      writeback because either a) the previous aperture operation was a read,
      and we never modified the contents of the aperture, or b) the previous
      aperture operation was a write and we must have written back the dirtied
      contents of the aperture to the DIMM before the I/O was completed.
      
      In order to add support for the "read flush" flag I needed to add a
      generic routine to invalidate cache lines, mmio_flush_range().  This is
      protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
      only supported on x86.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      67a3e8fe
    • yalin wang's avatar
  3. 20 Aug, 2015 6 commits
  4. 19 Aug, 2015 1 commit
    • Dan Williams's avatar
      libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option · 7a67832c
      Dan Williams authored
      We currently register a platform device for e820 type-12 memory and
      register a nvdimm bus beneath it.  Registering the platform device
      triggers the device-core machinery to probe for a driver, but that
      search currently comes up empty.  Building the nvdimm-bus registration
      into the e820_pmem platform device registration in this way forces
      libnvdimm to be built-in.  Instead, convert the built-in portion of
      CONFIG_X86_PMEM_LEGACY to simply register a platform device and move the
      rest of the logic to the driver for e820_pmem, for the following
      reasons:
      
      1/ Letting e820_pmem support be a module allows building and testing
         libnvdimm.ko changes without rebooting
      
      2/ All the normal policy around modules can be applied to e820_pmem
         (unbind to disable and/or blacklisting the module from loading by
         default)
      
      3/ Moving the driver to a generic location and converting it to scan
         "iomem_resource" rather than "e820.map" means any other architecture can
         take advantage of this simple nvdimm resource discovery mechanism by
         registering a resource named "Persistent Memory (legacy)"
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      7a67832c
  5. 14 Aug, 2015 8 commits
  6. 11 Aug, 2015 3 commits
    • Dan Williams's avatar
      cleanup IORESOURCE_CACHEABLE vs ioremap() · 92b19ff5
      Dan Williams authored
      Quoting Arnd:
          I was thinking the opposite approach and basically removing all uses
          of IORESOURCE_CACHEABLE from the kernel. There are only a handful of
          them.and we can probably replace them all with hardcoded
          ioremap_cached() calls in the cases they are actually useful.
      
      All existing usages of IORESOURCE_CACHEABLE call ioremap() instead of
      ioremap_nocache() if the resource is cacheable, however ioremap() is
      uncached by default. Clearly none of the existing usages care about the
      cacheability. Particularly devm_ioremap_resource() never worked as
      advertised since it always fell back to plain ioremap().
      
      Clean this up as the new direction we want is to convert
      ioremap_<type>() usages to memremap(..., flags).
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      92b19ff5
    • Dan Williams's avatar
      arch, drivers: don't include <asm/io.h> directly, use <linux/io.h> instead · 2584cf83
      Dan Williams authored
      Preparation for uniform definition of ioremap, ioremap_wc, ioremap_wt,
      and ioremap_cache, tree-wide.
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      2584cf83
    • Dan Williams's avatar
      mm: enhance region_is_ram() to region_intersects() · 124fe20d
      Dan Williams authored
      region_is_ram() is used to prevent the establishment of aliased mappings
      to physical "System RAM" with incompatible cache settings.  However, it
      uses "-1" to indicate both "unknown" memory ranges (ranges not described
      by platform firmware) and "mixed" ranges (where the parameters describe
      a range that partially overlaps "System RAM").
      
      Fix this up by explicitly tracking the "unknown" vs "mixed" resource
      cases and returning REGION_INTERSECTS, REGION_MIXED, or REGION_DISJOINT.
      This re-write also adds support for detecting when the requested region
      completely eclipses all of a resource.  Note, the implementation treats
      overlaps between "unknown" and the requested memory type as
      REGION_INTERSECTS.
      
      Finally, other memory types can be passed in by name, for now the only
      usage "System RAM".
      Suggested-by: default avatarLuis R. Rodriguez <mcgrof@suse.com>
      Reviewed-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      124fe20d
  7. 31 Jul, 2015 1 commit
  8. 28 Jul, 2015 5 commits
  9. 26 Jul, 2015 5 commits
    • Linus Torvalds's avatar
      Linux 4.2-rc4 · cbfe8fa6
      Linus Torvalds authored
      cbfe8fa6
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2579d019
      Linus Torvalds authored
      Pull perf fix from Thomas Gleixner:
       "A single fix for the intel cqm perf facility to prevent IPIs from
        interrupt context"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel/cqm: Return cached counter value from IRQ context
      2579d019
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 28003486
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "This update contains:
      
         - the manual revert of the SYSCALL32 changes which caused a
           regression
      
         - a fix for the MPX vma handling
      
         - three fixes for the ioremap 'is ram' checks.
      
         - PAT warning fixes
      
         - a trivial fix for the size calculation of TLB tracepoints
      
         - handle old EFI structures gracefully
      
        This also contains a PAT fix from Jan plus a revert thereof.  Toshi
        explained why the code is correct"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mm/pat: Revert 'Adjust default caching mode translation tables'
        x86/asm/entry/32: Revert 'Do not use R9 in SYSCALL32' commit
        x86/mm: Fix newly introduced printk format warnings
        mm: Fix bugs in region_is_ram()
        x86/mm: Remove region_is_ram() call from ioremap
        x86/mm: Move warning from __ioremap_check_ram() to the call site
        x86/mm/pat, drivers/media/ivtv: Move the PAT warning and replace WARN() with pr_warn()
        x86/mm/pat, drivers/infiniband/ipath: Replace WARN() with pr_warn()
        x86/mm/pat: Adjust default caching mode translation tables
        x86/fpu: Disable dependent CPU features on "noxsave"
        x86/mpx: Do not set ->vm_ops on MPX VMAs
        x86/mm: Add parenthesis for TLB tracepoint size calculation
        efi: Handle memory error structures produced based on old versions of standard
      28003486
    • Thomas Gleixner's avatar
      x86/mm/pat: Revert 'Adjust default caching mode translation tables' · 1a4e8795
      Thomas Gleixner authored
      Toshi explains:
      
      "No, the default values need to be set to the fallback types,
       i.e. minimal supported mode.  For WC and WT, UC is the fallback type.
      
       When PAT is disabled, pat_init() does update the tables below to
       enable WT per the default BIOS setup.  However, when PAT is enabled,
       but CPU has PAT -errata, WT falls back to UC per the default values."
      
      Revert: ca1fec58 'x86/mm/pat: Adjust default caching mode translation tables'
      Requested-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Jan Beulich <jbeulich@suse.de>
      Link: http://lkml.kernel.org/r/1437577776.3214.252.camel@hp.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      1a4e8795
    • Matt Fleming's avatar
      perf/x86/intel/cqm: Return cached counter value from IRQ context · 2c534c0d
      Matt Fleming authored
      Peter reported the following potential crash which I was able to
      reproduce with his test program,
      
      [  148.765788] ------------[ cut here ]------------
      [  148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260()
      [  148.765797] Modules linked in:
      [  148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4
      [  148.765803]  ffffffff81cdc398 ffff88085f105950 ffffffff818bdfd5 0000000000000007
      [  148.765805]  0000000000000000 ffff88085f105990 ffffffff810e413a 0000000000000000
      [  148.765807]  ffffffff82301080 0000000000000022 ffffffff8107f640 ffffffff8107f640
      [  148.765809] Call Trace:
      [  148.765810]  <NMI>  [<ffffffff818bdfd5>] dump_stack+0x45/0x57
      [  148.765818]  [<ffffffff810e413a>] warn_slowpath_common+0x8a/0xc0
      [  148.765822]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
      [  148.765824]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
      [  148.765825]  [<ffffffff810e422a>] warn_slowpath_null+0x1a/0x20
      [  148.765827]  [<ffffffff811613f6>] smp_call_function_many+0xb6/0x260
      [  148.765829]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
      [  148.765831]  [<ffffffff81161748>] on_each_cpu_mask+0x28/0x60
      [  148.765832]  [<ffffffff8107f6ef>] intel_cqm_event_count+0x7f/0xe0
      [  148.765836]  [<ffffffff811cdd35>] perf_output_read+0x2a5/0x400
      [  148.765839]  [<ffffffff811d2e5a>] perf_output_sample+0x31a/0x590
      [  148.765840]  [<ffffffff811d333d>] ? perf_prepare_sample+0x26d/0x380
      [  148.765841]  [<ffffffff811d3497>] perf_event_output+0x47/0x60
      [  148.765843]  [<ffffffff811d36c5>] __perf_event_overflow+0x215/0x240
      [  148.765844]  [<ffffffff811d4124>] perf_event_overflow+0x14/0x20
      [  148.765847]  [<ffffffff8107e7f4>] intel_pmu_handle_irq+0x1d4/0x440
      [  148.765849]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765853]  [<ffffffff81219bad>] ? vunmap_page_range+0x19d/0x2f0
      [  148.765854]  [<ffffffff81219d11>] ? unmap_kernel_range_noflush+0x11/0x20
      [  148.765859]  [<ffffffff814ce6fe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
      [  148.765863]  [<ffffffff8109e5db>] ? native_apic_msr_write+0x2b/0x30
      [  148.765865]  [<ffffffff8109e44d>] ? x2apic_send_IPI_self+0x1d/0x20
      [  148.765869]  [<ffffffff81065135>] ? arch_irq_work_raise+0x35/0x40
      [  148.765872]  [<ffffffff811c8d86>] ? irq_work_queue+0x66/0x80
      [  148.765875]  [<ffffffff81075306>] perf_event_nmi_handler+0x26/0x40
      [  148.765877]  [<ffffffff81063ed9>] nmi_handle+0x79/0x100
      [  148.765879]  [<ffffffff81064422>] default_do_nmi+0x42/0x100
      [  148.765880]  [<ffffffff81064563>] do_nmi+0x83/0xb0
      [  148.765884]  [<ffffffff818c7c0f>] end_repeat_nmi+0x1e/0x2e
      [  148.765886]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765888]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765890]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765891]  <<EOE>>  [<ffffffff8110ab66>] finish_task_switch+0x156/0x210
      [  148.765898]  [<ffffffff818c1671>] __schedule+0x341/0x920
      [  148.765899]  [<ffffffff818c1c87>] schedule+0x37/0x80
      [  148.765903]  [<ffffffff810ae1af>] ? do_page_fault+0x2f/0x80
      [  148.765905]  [<ffffffff818c1f4a>] schedule_user+0x1a/0x50
      [  148.765907]  [<ffffffff818c666c>] retint_careful+0x14/0x32
      [  148.765908] ---[ end trace e33ff2be78e14901 ]---
      
      The CQM task events are not safe to be called from within interrupt
      context because they require performing an IPI to read the counter value
      on all sockets. And performing IPIs from within IRQ context is a
      "no-no".
      
      Make do with the last read counter value currently event in
      event->count when we're invoked in this context.
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikas Shivappa <vikas.shivappa@intel.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Will Auld <will.auld@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-matt@codeblueprint.co.ukSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      2c534c0d