1. 10 Sep, 2015 5 commits
    • Vladimir Davydov's avatar
      memcg: add page_cgroup_ino helper · 2fc04524
      Vladimir Davydov authored
      This patchset introduces a new user API for tracking user memory pages
      that have not been used for a given period of time.  The purpose of this
      is to provide the userspace with the means of tracking a workload's
      working set, i.e.  the set of pages that are actively used by the
      workload.  Knowing the working set size can be useful for partitioning the
      system more efficiently, e.g.  by tuning memory cgroup limits
      appropriately, or for job placement within a compute cluster.
      
      ==== USE CASES ====
      
      The unified cgroup hierarchy has memory.low and memory.high knobs, which
      are defined as the low and high boundaries for the workload working set
      size.  However, the working set size of a workload may be unknown or
      change in time.  With this patch set, one can periodically estimate the
      amount of memory unused by each cgroup and tune their memory.low and
      memory.high parameters accordingly, therefore optimizing the overall
      memory utilization.
      
      Another use case is balancing workloads within a compute cluster.  Knowing
      how much memory is not really used by a workload unit may help take a more
      optimal decision when considering migrating the unit to another node
      within the cluster.
      
      Also, as noted by Minchan, this would be useful for per-process reclaim
      (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
      pages only by smart user memory manager.
      
      ==== USER API ====
      
      The user API consists of two new files:
      
       * /sys/kernel/mm/page_idle/bitmap.  This file implements a bitmap where each
         bit corresponds to a page, indexed by PFN. When the bit is set, the
         corresponding page is idle. A page is considered idle if it has not been
         accessed since it was marked idle. To mark a page idle one should set the
         bit corresponding to the page by writing to the file. A value written to the
         file is OR-ed with the current bitmap value. Only user memory pages can be
         marked idle, for other page types input is silently ignored. Writing to this
         file beyond max PFN results in the ENXIO error. Only available when
         CONFIG_IDLE_PAGE_TRACKING is set.
      
         This file can be used to estimate the amount of pages that are not
         used by a particular workload as follows:
      
         1. mark all pages of interest idle by setting corresponding bits in the
            /sys/kernel/mm/page_idle/bitmap
         2. wait until the workload accesses its working set
         3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set
      
       * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
         memory cgroup each page is charged to, indexed by PFN. Only available when
         CONFIG_MEMCG is set.
      
         This file can be used to find all pages (including unmapped file pages)
         accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
         can then estimate the cgroup working set size.
      
      For an example of using these files for estimating the amount of unused
      memory pages per each memory cgroup, please see the script attached
      below.
      
      ==== REASONING ====
      
      The reason to introduce the new user API instead of using
      /proc/PID/{clear_refs,smaps} is that the latter has two serious
      drawbacks:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      The new API attempts to overcome them both. For more details on how it
      is achieved, please see the comment to patch 6.
      
      ==== PATCHSET STRUCTURE ====
      
      The patch set is organized as follows:
      
       - patch 1 adds page_cgroup_ino() helper for the sake of
         /proc/kpagecgroup and patches 2-3 do related cleanup
       - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
         charged to
       - patch 5 introduces a new mmu notifier callback, clear_young, which is
         a lightweight version of clear_flush_young; it is used in patch 6
       - patch 6 implements the idle page tracking feature, including the
         userspace API, /sys/kernel/mm/page_idle/bitmap
       - patch 7 exports idle flag via /proc/kpageflags
      
      ==== SIMILAR WORKS ====
      
      Originally, the patch for tracking idle memory was proposed back in 2011
      by Michel Lespinasse (see http://lwn.net/Articles/459269/).  The main
      difference between Michel's patch and this one is that Michel implemented
      a kernel space daemon for estimating idle memory size per cgroup while
      this patch only provides the userspace with the minimal API for doing the
      job, leaving the rest up to the userspace.  However, they both share the
      same idea of Idle/Young page flags to avoid affecting the reclaimer logic.
      
      ==== PERFORMANCE EVALUATION ====
      
      SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
      performance impact introduced by this patch set.  Three runs were carried
      out:
      
       - base: kernel without the patch
       - patched: patched kernel, the feature is not used
       - patched-active: patched kernel, 1 minute-period daemon is used for
         tracking idle memory
      
      For tracking idle memory, idlememstat utility was used:
      https://github.com/locker/idlememstat
      
      testcase            base            patched        patched-active
      
      compiler       537.40 ( 0.00)%   532.26 (-0.96)%   538.31 ( 0.17)%
      compress       305.47 ( 0.00)%   301.08 (-1.44)%   300.71 (-1.56)%
      crypto         284.32 ( 0.00)%   282.21 (-0.74)%   284.87 ( 0.19)%
      derby          411.05 ( 0.00)%   413.44 ( 0.58)%   412.07 ( 0.25)%
      mpegaudio      189.96 ( 0.00)%   190.87 ( 0.48)%   189.42 (-0.28)%
      scimark.large   46.85 ( 0.00)%    46.41 (-0.94)%    47.83 ( 2.09)%
      scimark.small  412.91 ( 0.00)%   415.41 ( 0.61)%   421.17 ( 2.00)%
      serial         204.23 ( 0.00)%   213.46 ( 4.52)%   203.17 (-0.52)%
      startup         36.76 ( 0.00)%    35.49 (-3.45)%    35.64 (-3.05)%
      sunflow        115.34 ( 0.00)%   115.08 (-0.23)%   117.37 ( 1.76)%
      xml            620.55 ( 0.00)%   619.95 (-0.10)%   620.39 (-0.03)%
      
      composite      211.50 ( 0.00)%   211.15 (-0.17)%   211.67 ( 0.08)%
      
      time idlememstat:
      
      17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
      448inputs+40outputs (1major+36052minor)pagefaults 0swaps
      
      ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ====
      #! /usr/bin/python
      #
      
      import os
      import stat
      import errno
      import struct
      
      CGROUP_MOUNT = "/sys/fs/cgroup/memory"
      BUFSIZE = 8 * 1024  # must be multiple of 8
      
      def get_hugepage_size():
          with open("/proc/meminfo", "r") as f:
              for s in f:
                  k, v = s.split(":")
                  if k == "Hugepagesize":
                      return int(v.split()[0]) * 1024
      
      PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
      HUGEPAGE_SIZE = get_hugepage_size()
      
      def set_idle():
          f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
          while True:
              try:
                  f.write(struct.pack("Q", pow(2, 64) - 1))
              except IOError as err:
                  if err.errno == errno.ENXIO:
                      break
                  raise
          f.close()
      
      def count_idle():
          f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
          f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
      
          with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
              while f.read(BUFSIZE): pass  # update idle flag
      
          idlememsz = {}
          while True:
              s1, s2 = f_flags.read(8), f_cgroup.read(8)
              if not s1 or not s2:
                  break
      
              flags, = struct.unpack('Q', s1)
              cgino, = struct.unpack('Q', s2)
      
              unevictable = (flags >> 18) & 1
              huge = (flags >> 22) & 1
              idle = (flags >> 25) & 1
      
              if idle and not unevictable:
                  idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                      (HUGEPAGE_SIZE if huge else PAGE_SIZE)
      
          f_flags.close()
          f_cgroup.close()
          return idlememsz
      
      if __name__ == "__main__":
          print "Setting the idle flag for each page..."
          set_idle()
      
          raw_input("Wait until the workload accesses its working set, "
                    "then press Enter")
      
          print "Counting idle pages..."
          idlememsz = count_idle()
      
          for dir, subdirs, files in os.walk(CGROUP_MOUNT):
              ino = os.stat(dir)[stat.ST_INO]
              print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
      ==== END SCRIPT ====
      
      This patch (of 8):
      
      Add page_cgroup_ino() helper to memcg.
      
      This function returns the inode number of the closest online ancestor of
      the memory cgroup a page is charged to.  It is required for exporting
      information about which page is charged to which cgroup to userspace,
      which will be introduced by a following patch.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2fc04524
    • Dan Streetman's avatar
      zswap: update docs for runtime-changeable attributes · 9c4c5ef3
      Dan Streetman authored
      Change the Documentation/vm/zswap.txt doc to indicate that the "zpool" and
      "compressor" params are now changeable at runtime.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c4c5ef3
    • Dan Streetman's avatar
      zswap: change zpool/compressor at runtime · 90b0fc26
      Dan Streetman authored
      Update the zpool and compressor parameters to be changeable at runtime.
      When changed, a new pool is created with the requested zpool/compressor,
      and added as the current pool at the front of the pool list.  Previous
      pools remain in the list only to remove existing compressed pages from.
      The old pool(s) are removed once they become empty.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90b0fc26
    • Dan Streetman's avatar
      zswap: dynamic pool creation · f1c54846
      Dan Streetman authored
      Add dynamic creation of pools.  Move the static crypto compression per-cpu
      transforms into each pool.  Add a pointer to zswap_entry to the pool it's
      in.
      
      This is required by the following patch which enables changing the zswap
      zpool and compressor params at runtime.
      
      [akpm@linux-foundation.org: fix merge snafus]
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1c54846
    • Dan Streetman's avatar
      zpool: add zpool_has_pool() · 3f0e1312
      Dan Streetman authored
      This series makes creation of the zpool and compressor dynamic, so that
      they can be changed at runtime.  This makes using/configuring zswap
      easier, as before this zswap had to be configured at boot time, using boot
      params.
      
      This uses a single list to track both the zpool and compressor together,
      although Seth had mentioned an alternative which is to track the zpools
      and compressors using separate lists.  In the most common case, only a
      single zpool and single compressor, using one list is slightly simpler
      than using two lists, and for the uncommon case of multiple zpools and/or
      compressors, using one list is slightly less simple (and uses slightly
      more memory, probably) than using two lists.
      
      This patch (of 4):
      
      Add zpool_has_pool() function, indicating if the specified type of zpool
      is available (i.e.  zsmalloc or zbud).  This allows checking if a pool is
      available, without actually trying to allocate it, similar to
      crypto_has_alg().
      
      This is used by a following patch to zswap that enables the dynamic
      runtime creation of zswap zpools.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f0e1312
  2. 09 Sep, 2015 6 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · 26d2177e
      Linus Torvalds authored
      Pull inifiniband/rdma updates from Doug Ledford:
       "This is a fairly sizeable set of changes.  I've put them through a
        decent amount of testing prior to sending the pull request due to
        that.
      
        There are still a few fixups that I know are coming, but I wanted to
        go ahead and get the big, sizable chunk into your hands sooner rather
        than waiting for those last few fixups.
      
        Of note is the fact that this creates what is intended to be a
        temporary area in the drivers/staging tree specifically for some
        cleanups and additions that are coming for the RDMA stack.  We
        deprecated two drivers (ipath and amso1100) and are waiting to hear
        back if we can deprecate another one (ehca).  We also put Intel's new
        hfi1 driver into this area because it needs to be refactored and a
        transfer library created out of the factored out code, and then it and
        the qib driver and the soft-roce driver should all be modified to use
        that library.
      
        I expect drivers/staging/rdma to be around for three or four kernel
        releases and then to go away as all of the work is completed and final
        deletions of deprecated drivers are done.
      
        Summary of changes for 4.3:
      
         - Create drivers/staging/rdma
         - Move amso1100 driver to staging/rdma and schedule for deletion
         - Move ipath driver to staging/rdma and schedule for deletion
         - Add hfi1 driver to staging/rdma and set TODO for move to regular
           tree
         - Initial support for namespaces to be used on RDMA devices
         - Add RoCE GID table handling to the RDMA core caching code
         - Infrastructure to support handling of devices with differing read
           and write scatter gather capabilities
         - Various iSER updates
         - Kill off unsafe usage of global mr registrations
         - Update SRP driver
         - Misc  mlx4 driver updates
         - Support for the mr_alloc verb
         - Support for a netlink interface between kernel and user space cache
           daemon to speed path record queries and route resolution
         - Ininitial support for safe hot removal of verbs devices"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
        IB/ipoib: Suppress warning for send only join failures
        IB/ipoib: Clean up send-only multicast joins
        IB/srp: Fix possible protection fault
        IB/core: Move SM class defines from ib_mad.h to ib_smi.h
        IB/core: Remove unnecessary defines from ib_mad.h
        IB/hfi1: Add PSM2 user space header to header_install
        IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
        mlx5: Fix incorrect wc pkey_index assignment for GSI messages
        IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
        IB/uverbs: reject invalid or unknown opcodes
        IB/cxgb4: Fix if statement in pick_local_ip6adddrs
        IB/sa: Fix rdma netlink message flags
        IB/ucma: HW Device hot-removal support
        IB/mlx4_ib: Disassociate support
        IB/uverbs: Enable device removal when there are active user space applications
        IB/uverbs: Explicitly pass ib_dev to uverbs commands
        IB/uverbs: Fix race between ib_uverbs_open and remove_one
        IB/uverbs: Fix reference counting usage of event files
        IB/core: Make ib_dealloc_pd return void
        IB/srp: Create an insecure all physical rkey only if needed
        ...
      26d2177e
    • Linus Torvalds's avatar
      Merge tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi · a794b4f3
      Linus Torvalds authored
      Pull IPMI updates from Corey Minyard:
       "Most of these have been sitting in linux-next for more than a release,
        particularly commit 0fbcf4af ("ipmi: Convert the IPMI SI ACPI
        handling to a platform device") which is probably the most complex
        patch.
      
        That is also the one that changes drivers/acpi/acpi_pnp.c.  The change
        in that file is only removing IPMI from a "special platform devices"
        list, since I convert it to the standard PNP interface.  I posted this
        one to the ACPI list twice and got no response, and it seems to work
        well in my testing, so I'm hoping it's good.
      
        Hidehiro Kawai posted a set of changes that improves the panic time
        handling in the IPMI driver.
      
        The rest of the changes are minor bug fixes or cleanups and some
        documentation"
      
      * tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi:
        ipmi:ssif: Add a module parm to specify that SMBus alerts don't work
        ipmi: add of_device_id in MODULE_DEVICE_TABLE
        ipmi: Compensate for BMCs that wont set the irq enable bit
        ipmi: Don't call receive handler in the panic context
        ipmi: Avoid touching possible corrupted lists in the panic context
        ipmi: Don't flush messages in sender() in run-to-completion mode
        ipmi: Factor out message flushing procedure
        ipmi: Remove unneeded set_run_to_completion call
        ipmi: Make some data const that was only read
        ipmi: constify SSIF ACPI device ids
        ipmi: Delete an unnecessary check before the function call "cleanup_one_si"
        char:ipmi - Change 1 to true for bool type variables during initialization.
        impi:Remove unneeded setting of module owner to THIS_MODULE in the platform structure, powernv_ipmi_driver
        ipmi: Add a comment in how messages are delivered from the lower layer
        ipmi/powernv: Fix potential invalid pointer dereference
        ipmi: Convert the IPMI SI ACPI handling to a platform device
        ipmi: Add device tree bindings information
      a794b4f3
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · f6f7a636
      Linus Torvalds authored
      Merge second patch-bomb from Andrew Morton:
       "Almost all of the rest of MM.  There was an unusually large amount of
        MM material this time"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
        zpool: remove no-op module init/exit
        mm: zbud: constify the zbud_ops
        mm: zpool: constify the zpool_ops
        mm: swap: zswap: maybe_preload & refactoring
        zram: unify error reporting
        zsmalloc: remove null check from destroy_handle_cache()
        zsmalloc: do not take class lock in zs_shrinker_count()
        zsmalloc: use class->pages_per_zspage
        zsmalloc: consider ZS_ALMOST_FULL as migrate source
        zsmalloc: partial page ordering within a fullness_list
        zsmalloc: use shrinker to trigger auto-compaction
        zsmalloc: account the number of compacted pages
        zsmalloc/zram: introduce zs_pool_stats api
        zsmalloc: cosmetic compaction code adjustments
        zsmalloc: introduce zs_can_compact() function
        zsmalloc: always keep per-class stats
        zsmalloc: drop unused variable `nr_to_migrate'
        mm/memblock.c: fix comment in __next_mem_range()
        mm/page_alloc.c: fix type information of memoryless node
        memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
        ...
      f6f7a636
    • Linus Torvalds's avatar
      Merge branch 'parisc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 839fe915
      Linus Torvalds authored
      Pull parisc updates from Helge Deller:
       "The most important changes in this patchset are:
      
         - re-enable 64bit PCI bus addresses which were temporarily disabled
           for PA-RISC in kernel 4.2
      
         - fix the 64bit CAS operation in the LWS path which now enables us to
           enable the 64bit gcc atomic builtins even on 32bit userspace with
           64bit kernel
      
         - fix a long-standing bug which sometimes crashed kernel at bootup
           while serial interrupt wasn't registered yet"
      
      * 'parisc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Use platform_device_register_simple("rtc-generic")
        parisc: Drop CONFIG_SMP around update_cr16_clocksource()
        parisc: Use double word condition in 64bit CAS operation
        parisc: Filter out spurious interrupts in PA-RISC irq handler
        parisc: Additionally check for in_atomic() in page fault handler
        PCI,parisc: Enable 64-bit bus addresses on PA-RISC
        parisc: Define ioremap_uc and ioremap_wc
      839fe915
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-4.3-rc1' of... · 54283aed
      Linus Torvalds authored
      Merge tag 'linux-kselftest-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull kselftest update from Shuah Khan:
       "This update adds new zram test and fixes to problems found during
        testing this new zram test.  In addition, there are a few bug fixes
        and ksefltest improvement patches from Linaro developers.
      
        I will send another update later on this week to fix kselftest
        breakage due to commit 2bf9e0ab ("locking/static_keys: Provide a
        selftest") after the fix soaks in next for a couple of days"
      
      * tag 'linux-kselftest-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests/zram: Makefile fix
        selftests/zram: must be run as root
        selftests: breakpoints: fix installing error on the architecture except x86
        selftests: check before install
        selftests/zram: Adding zram tests
      54283aed
    • Linus Torvalds's avatar
      Merge tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 9a9952bb
      Linus Torvalds authored
      Pull iommu updates for from Joerg Roedel:
       "This time the IOMMU updates are mostly cleanups or fixes.  No big new
        features or drivers this time.  In particular the changes include:
      
         - Bigger cleanup of the Domain<->IOMMU data structures and the code
           that manages them in the Intel VT-d driver.  This makes the code
           easier to understand and maintain, and also easier to keep the data
           structures in sync.  It is also a preparation step to make use of
           default domains from the IOMMU core in the Intel VT-d driver.
      
         - Fixes for a couple of DMA-API misuses in ARM IOMMU drivers, namely
           in the ARM and Tegra SMMU drivers.
      
         - Fix for a potential buffer overflow in the OMAP iommu driver's
           debug code
      
         - A couple of smaller fixes and cleanups in various drivers
      
         - One small new feature: Report domain-id usage in the Intel VT-d
           driver to easier detect bugs where these are leaked"
      
      * tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (83 commits)
        iommu/vt-d: Really use upper context table when necessary
        x86/vt-d: Fix documentation of DRHD
        iommu/fsl: Really fix init section(s) content
        iommu/io-pgtable-arm: Unmap and free table when overwriting with block
        iommu/io-pgtable-arm: Move init-fn declarations to io-pgtable.h
        iommu/msm: Use BUG_ON instead of if () BUG()
        iommu/vt-d: Access iomem correctly
        iommu/vt-d: Make two functions static
        iommu/vt-d: Use BUG_ON instead of if () BUG()
        iommu/vt-d: Return false instead of 0 in irq_remapping_cap()
        iommu/amd: Use BUG_ON instead of if () BUG()
        iommu/amd: Make a symbol static
        iommu/amd: Simplify allocation in irq_remapping_alloc()
        iommu/tegra-smmu: Parameterize number of TLB lines
        iommu/tegra-smmu: Factor out tegra_smmu_set_pde()
        iommu/tegra-smmu: Extract tegra_smmu_pte_get_use()
        iommu/tegra-smmu: Use __GFP_ZERO to allocate zeroed pages
        iommu/tegra-smmu: Remove PageReserved manipulation
        iommu/tegra-smmu: Convert to use DMA API
        iommu/tegra-smmu: smmu_flush_ptc() wants device addresses
        ...
      9a9952bb
  3. 08 Sep, 2015 29 commits
    • Linus Torvalds's avatar
      Merge tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap · e81b594c
      Linus Torvalds authored
      Pull regmap updates from Mark Brown:
       "This has been a busy release for regmap.
      
        By far the biggest set of changes here are those from Markus Pargmann
        which implement support for block transfers in smbus devices.  This
        required quite a bit of refactoring but leaves us better able to
        handle odd restrictions that controllers may have and with better
        performance on smbus.
      
        Other new features include:
      
         - Fix interactions with lockdep for nested regmaps (eg, when a device
           using regmap is connected to a bus where the bus controller has a
           separate regmap).  Lockdep's default class identification is too
           crude to work without help.
      
         - Support for must write bitfield operations, useful for operations
           which require writing a bit to trigger them from Kuniori Morimoto.
      
         - Support for delaying during register patch application from Nariman
           Poushin.
      
         - Support for overriding cache state via the debugfs implementation
           from Richard Fitzgerald"
      
      * tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (25 commits)
        regmap: fix a NULL pointer dereference in __regmap_init
        regmap: Support bulk reads for devices without raw formatting
        regmap-i2c: Add smbus i2c block support
        regmap: Add raw_write/read checks for max_raw_write/read sizes
        regmap: regmap max_raw_read/write getter functions
        regmap: Introduce max_raw_read/write for regmap_bulk_read/write
        regmap: Add missing comments about struct regmap_bus
        regmap: No multi_write support if bus->write does not exist
        regmap: Split use_single_rw internally into use_single_read/write
        regmap: Fix regmap_bulk_write for bus writes
        regmap: regmap_raw_read return error on !bus->read
        regulator: core: Print at debug level on debugfs creation failure
        regmap: Fix regmap_can_raw_write check
        regmap: fix typos in regmap.c
        regmap: Fix integertypes for register address and value
        regmap: Move documentation to regmap.h
        regmap: Use different lockdep class for each regmap init call
        thermal: sti: Add parentheses around bridge->ops->regmap_init call
        mfd: vexpress: Add parentheses around bridge->ops->regmap_init call
        regmap: debugfs: Fix misuse of IS_ENABLED
        ...
      e81b594c
    • Linus Torvalds's avatar
      Merge tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux · fa815580
      Linus Torvalds authored
      Pull fbdev updates from Tomi Valkeinen:
       "Minor fixes and cleanups"
      
      * tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
        video: fbdev: atmel_lcdfb: remove useless include
        video: fbdev: pxa168fb: Use devm_clk_get
        fbdev: ssd1307fb: fix error return code
        fbdev: fix snprintf() limit in show_bl_curve()
        video: fbdev: s3c-fb: Constify platform_device_id
        video: fbdev: atmel: fix warning for const return value
        video: fbdev: Drop owner assignment from platform_driver
        video: fbdev: Drop owner assignment from i2c_driver
        fbdev: remove unnecessary memset in vfb
        framebuffer: disable vgacon on microblaze arch
        fbdev: udlfb: remove unneeded initialization in few places
        fbdev: Allow compile test of GPIO consumers if !GPIOLIB
        fbdev: fix cea_modes array size
      fa815580
    • Linus Torvalds's avatar
      Merge tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc · 85579ad7
      Linus Torvalds authored
      Pull MMC updates from Ulf Hansson:
       "MMC core:
         - Fix a race condition in the request handling
         - Skip trim commands for some buggy kingston eMMCs
         - An optimization and a correction for erase groups
         - Set CMD23 quirk for some Sandisk cards
      
        MMC host:
         - sdhci: Give GPIO CD higher precedence and don't poll when it's used
         - sdhci: Fix DMA memory leakage
         - sdhci: Some updates for clock management
         - sdhci-of-at91: introduce driver for the Atmel SDMMC
         - sdhci-of-arasan: Add support for sdhci-5.1
         - sdhci-esdhc-imx: Add support for imx7d which also supports HS400
         - sdhci: A collection of fixes and improvements for various sdhci hosts
         - omap_hsmmc: Modernization of the regulator code
         - dw_mmc: A couple of fixes for DMA and PIO mode
         - usdhi6rol0: A few fixes and support probe deferral for regulators
         - pxamci: Convert to use dmaengine
         - sh_mmcif: Fix the suspend process in a short term solution
         - tmio: Adjust timeout for commands
         - sunxi: Fix timeout while gating/ungating clock"
      
      * tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc: (67 commits)
        mmc: android-goldfish: remove incorrect __iomem annotation
        mmc: core: fix race condition in mmc_wait_data_done
        mmc: host: omap_hsmmc: remove CONFIG_REGULATOR check
        mmc: host: omap_hsmmc: use ios->vdd for setting vmmc voltage
        mmc: host: omap_hsmmc: use regulator_is_enabled to find pbias status
        mmc: host: omap_hsmmc: enable/disable vmmc_aux regulator based on previous state
        mmc: host: omap_hsmmc: don't use ->set_power to set initial regulator state
        mmc: host: omap_hsmmc: avoid pbias regulator enable on power off
        mmc: host: omap_hsmmc: add separate function to set pbias
        mmc: host: omap_hsmmc: add separate functions for enable/disable supply
        mmc: host: omap_hsmmc: return error if any of the regulator APIs fail
        mmc: host: omap_hsmmc: remove unnecessary pbias set_voltage
        mmc: host: omap_hsmmc: use mmc_host's vmmc and vqmmc
        mmc: host: omap_hsmmc: use the ocrmask provided by the vmmc regulator
        mmc: host: omap_hsmmc: cleanup omap_hsmmc_reg_get()
        mmc: host: omap_hsmmc: return on fatal errors from omap_hsmmc_reg_get
        mmc: host: omap_hsmmc: use devm_regulator_get_optional() for vmmc
        mmc: sdhci-of-at91: fix platform_no_drv_owner.cocci warnings
        mmc: sh_mmcif: Fix suspend process
        mmc: usdhi6rol0: fix error return code
        ...
      85579ad7
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v4.3-1' of... · 3af6e98f
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v4.3-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
      
      Pull x86 platform driver updates from Darren Hart:
       "Significant work on toshiba_acpi, including new hardware support,
        refactoring, and cleanups.  Extend device support for asus, ideapad,
        and acer systems.  New surface pro 3 buttons driver.  Misc minor
        cleanups for thinkpad and hp-wireless.
      
        acer-wmi:
         - No rfkill on HP Omen 15 wifi
      
        thinkpad_acpi:
         - Remove side effects from vdbg_printk -> no_printk macro
      
        surface pro 3:
         - Add support driver for Surface Pro 3 buttons
      
        hp-wireless:
         - remove unneeded goto/label in hpwl_init
      
        ideapad-laptop:
         - add alternative representation for Yoga 2 to DMI table
         - Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list
      
        asus-laptop:
         - Add key found on Asus F3M
      
        MAINTAINERS:
         - Remove Toshiba Linux mailing list address
      
        toshiba_acpi:
         - Bump driver version to 0.23
         - Remove unnecessary checks and returns in HCI/SCI functions
         - Refactor *{get, set} functions return value
         - Remove "*not supported" feature prints
         - Change *available functions return type
         - Add set_fan_status function
         - Change some variables to avoid warnings from ninja-check
         - Reorder toshiba_acpi_alt_keymap entries
         - Remove unused wireless defines
         - Transflective backlight updates
         - Avoid registering input device on WMI event laptops
         - Add /dev/toshiba_acpi device
         - Adapt /proc/acpi/toshiba/keys to TOS1900 devices"
      
      * tag 'platform-drivers-x86-v4.3-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: (21 commits)
        acer-wmi: No rfkill on HP Omen 15 wifi
        thinkpad_acpi: Remove side effects from vdbg_printk -> no_printk macro
        surface pro 3: Add support driver for Surface Pro 3 buttons
        hp-wireless: remove unneeded goto/label in hpwl_init
        ideapad-laptop: add alternative representation for Yoga 2 to DMI table
        asus-laptop: Add key found on Asus F3M
        MAINTAINERS: Remove Toshiba Linux mailing list address
        ideapad-laptop: Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list
        toshiba_acpi: Bump driver version to 0.23
        toshiba_acpi: Remove unnecessary checks and returns in HCI/SCI functions
        toshiba_acpi: Refactor *{get, set} functions return value
        toshiba_acpi: Remove "*not supported" feature prints
        toshiba_acpi: Change *available functions return type
        toshiba_acpi: Add set_fan_status function
        toshiba_acpi: Change some variables to avoid warnings from ninja-check
        toshiba_acpi: Reorder toshiba_acpi_alt_keymap entries
        toshiba_acpi: Remove unused wireless defines
        toshiba_acpi: Transflective backlight updates
        toshiba_acpi: Avoid registering input device on WMI event laptops
        toshiba_acpi: Add /dev/toshiba_acpi device
        ...
      3af6e98f
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · acceba59
      Linus Torvalds authored
      Pull i2c updates from Wolfram Sang:
       "Features:
      
         - new drivers: Renesas EMEV2, register based MUX, NXP LPC2xxx
         - core: scans DT and assigns wakeup interrupts.  no driver changes needed.
         - core: some refcouting issues fixed and better API for that
         - core: new helper function for best effort block read emulation
         - slave framework: proper DT bindings and userspace instantiation
         - some bigger work for xiic, pxa, omap drivers
      
        .. and quite a number of smaller driver fixes, cleanups, improvements"
      
      * 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (65 commits)
        i2c: mux: reg Change ioread endianness for readback
        i2c: mux: reg: fix compilation warnings
        i2c: mux: reg: simplify register size checking
        i2c: muxes: fix leaked i2c adapter device node references
        i2c: allow specifying separate wakeup interrupt in device tree
        of/irq: export of_get_irq_byname()
        i2c: xgene-slimpro: dma_mapping_error() doesn't return an error code
        i2c: Replace I2C_CROS_EC_TUNNEL dependency
        eeprom: at24: use i2c_smbus_read_i2c_block_data_or_emulated
        i2c: core: Add support for best effort block read emulation
        i2c: lpc2k: add driver
        i2c: mux: Add register-based mux i2c-mux-reg
        i2c: dt: describe generic bindings
        i2c: slave: print warning if slave flag not set
        i2c: support 10 bit and slave addresses in sysfs 'new_device'
        i2c: take address space into account when checking for used addresses
        i2c: apply DT flags when probing
        i2c: make address check indpendent from client struct
        i2c: rename address check functions
        i2c: apply address offset for slaves, too
        ...
      acceba59
    • Linus Torvalds's avatar
      Merge tag 'rtc-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · c1917615
      Linus Torvalds authored
      Pull RTC updates from Alexandre Belloni:
       "Core:
         - use is_visible() to control sysfs attributes
         - switch wakealarm attribute to DEVICE_ATTR_RW
         - make rtc_does_wakealarm() return boolean
         - properly manage lifetime of dev and cdev in rtc device
         - remove unnecessary device_get() in rtc_device_unregister
         - fix double free in rtc_register_device() error path
      
        New drivers:
         - NXP LPC24xx
         - Xilinx Zynq MP
         - Dialog DA9062
      
        Subsystem wide cleanups:
         - fix drivers that consider 0 as a valid IRQ in client->irq
         - Drop (un)likely before IS_ERR(_OR_NULL)
         - drop the remaining owner assignment for i2c_driver and
           platform_driver
         - module autoload fixes
      
        Drivers:
         - 88pm80x: add device tree support
         - abx80x: fix RTC write bit
         - ab8500: Add a sentinel to ab85xx_rtc_ids[]
         - armada38x: Align RTC set time procedure with the official errata
         - as3722: correct month value
         - at91sam9: cleanups
         - at91rm9200: get and use slow clock and cleanups
         - bq32k: remove redundant check
         - cmos: century support, proper fix for the spurious wakeup
         - ds1307: cleanups and wakeup irq support
         - ds1374: Remove unused variable
         - ds1685: Use module_platform_driver
         - ds3232: fix WARNING trace in resume function
         - gemini: fix ptr_ret.cocci warnings
         - mt6397: implement suspend/resume
         - omap: support internal and external clock enabling
         - opal: Enable alarms only when opal supports tpo
         - pcf2127: use OFS flag to detect unreliable date and warn the user
         - pl031: fix typo for author email
         - rx8025: huge cleanup and fixes
         - sa1100/pxa: share common code
         - s5m: fix to update ctrl register
         - s3c: fix clocks and wakeup, cleanup
         - sirfsoc: use regmap
         - nvram_read()/nvram_write() functions for cmos, ds1305, ds1307,
           ds1343, ds1511, ds1553, ds1742, m48t59, rp5c01, stk17ta8, tx4939
         - use rtc_valid_tm() error code when reading date/time instead of 0
           for isl12022, pcf2123, pcf2127"
      
      * tag 'rtc-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (90 commits)
        rtc: abx80x: fix RTC write bit
        rtc: ab8500: Add a sentinel to ab85xx_rtc_ids[]
        rtc: ds1374: Remove unused variable
        rtc: Fix module autoload for OF platform drivers
        rtc: Fix module autoload for rtc-{ab8500,max8997,s5m} drivers
        rtc: omap: Add external clock enabling support
        rtc: omap: Add internal clock enabling support
        ARM: dts: AM437x: Add the internal and external clock nodes for rtc
        rtc: s5m: fix to update ctrl register
        rtc: add xilinx zynqmp rtc driver
        devicetree: bindings: rtc: add bindings for xilinx zynqmp rtc
        rtc: as3722: correct month value
        ARM: config: Switch PXA27x platforms to use PXA RTC driver
        ARM: mmp: remove unused RTC register definitions
        ARM: sa1100: remove unused RTC register definitions
        rtc: sa1100/pxa: convert to run-time register mapping
        ARM: pxa: add memory resource to SA1100 RTC device
        rtc: pxa: convert to use shared sa1100 functions
        rtc: sa1100: prepare to share sa1100_rtc_ops
        rtc: ds3232: fix WARNING trace in resume function
        ...
      c1917615
    • Dan Streetman's avatar
      zpool: remove no-op module init/exit · df69f52d
      Dan Streetman authored
      Remove zpool_init() and zpool_exit(); they do nothing other than print
      "loaded" and "unloaded".
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df69f52d
    • Krzysztof Kozlowski's avatar
      mm: zbud: constify the zbud_ops · c83db4f4
      Krzysztof Kozlowski authored
      The structure zbud_ops is not modified so make the pointer to it a
      pointer to const.
      Signed-off-by: default avatarKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Acked-by: default avatarDan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c83db4f4
    • Krzysztof Kozlowski's avatar
      mm: zpool: constify the zpool_ops · 78672779
      Krzysztof Kozlowski authored
      The structure zpool_ops is not modified so make the pointer to it a
      pointer to const.
      Signed-off-by: default avatarKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Acked-by: default avatarDan Streetman <ddstreet@ieee.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78672779
    • Dmitry Safonov's avatar
      mm: swap: zswap: maybe_preload & refactoring · 5b999aad
      Dmitry Safonov authored
      zswap_get_swap_cache_page and read_swap_cache_async have pretty much the
      same code with only significant difference in return value and usage of
      swap_readpage.
      
      I a helper __read_swap_cache_async() with the common code.  Behavior
      change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload
      instead radix_tree_preload.  Looks like, this wasn't changed only by the
      reason of code duplication.
      Signed-off-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b999aad
    • Sergey Senozhatsky's avatar
      zram: unify error reporting · 70864969
      Sergey Senozhatsky authored
      Make zram syslog error reporting more consistent. We have random
      error levels in some places. For example, critical errors like
        "Error allocating memory for compressed page"
      and
        "Unable to allocate temp memory"
      are reported as KERN_INFO messages.
      
      a) Reassign error levels
      
      Error messages that directly affect zram
      functionality -- pr_err():
      
       Error allocating zram address table
       Error creating memory pool
       Decompression failed! err=%d, page=%u
       Unable to allocate temp memory
       Compression failed! err=%d
       Error allocating memory for compressed page: %u, size=%zu
       Cannot initialise %s compressing backend
       Error allocating disk queue for device %d
       Error allocating disk structure for device %d
       Error creating sysfs group for device %d
       Unable to register zram-control class
       Unable to get major number
      
      Messages that do not affect functionality, but user
      must be warned (because sysfs attrs will be removed in
      this particular case) -- pr_warn():
      
       %d (%s) Attribute %s (and others) will be removed. %s
      
      Messages that do not affect functionality and mostly are
      informative -- pr_info():
      
       Cannot change max compression streams
       Can't change algorithm for initialized device
       Cannot change disksize for initialized device
       Added device: %s
       Removed device: %s
      
      b) Update sysfs_create_group() error message
      
      First, it lacks a trailing new line; add it.  Second, every error message
      in zram_add() has a "for device %d" part, which makes errors more
      informative.  Add missing part to "Error creating sysfs group" message.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70864969
    • Sergey Senozhatsky's avatar
      zsmalloc: remove null check from destroy_handle_cache() · cd10add0
      Sergey Senozhatsky authored
      We can pass a NULL cache pointer to kmem_cache_destroy(), because it
      NULL-checks its argument now.  Remove redundant test from
      destroy_handle_cache().
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd10add0
    • Sergey Senozhatsky's avatar
      zsmalloc: do not take class lock in zs_shrinker_count() · b3e237f1
      Sergey Senozhatsky authored
      We can avoid taking class ->lock around zs_can_compact() in
      zs_shrinker_count(), because the number that we return back is outdated
      in general case, by design.  We have different sources that are able to
      change class's state right after we return from zs_can_compact() --
      ongoing I/O operations, manually triggered compaction, or two of them
      happening simultaneously.
      
      We re-do this calculations during compaction on a per class basis
      anyway.
      
      zs_unregister_shrinker() will not return until we have an active
      shrinker, so classes won't unexpectedly disappear while
      zs_shrinker_count() iterates them.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3e237f1
    • Minchan Kim's avatar
      zsmalloc: use class->pages_per_zspage · 6cbf16b3
      Minchan Kim authored
      There is no need to recalcurate pages_per_zspage in runtime.  Just use
      class->pages_per_zspage to avoid unnecessary runtime overhead.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6cbf16b3
    • Minchan Kim's avatar
      zsmalloc: consider ZS_ALMOST_FULL as migrate source · ad9d5e17
      Minchan Kim authored
      There is no reason to prevent select ZS_ALMOST_FULL as migration source
      if we cannot find source from ZS_ALMOST_EMPTY.
      
      With this patch, zs_can_compact will return more exact result.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@lge.com>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad9d5e17
    • Sergey Senozhatsky's avatar
      zsmalloc: partial page ordering within a fullness_list · 58f17117
      Sergey Senozhatsky authored
      We want to see more ZS_FULL pages and less ZS_ALMOST_{FULL, EMPTY}
      pages.  Put a page with higher ->inuse count first within its
      ->fullness_list, which will give us better chances to fill up this page
      with new objects (find_get_zspage() return ->fullness_list head for new
      object allocation), so some zspages will become ZS_ALMOST_FULL/ZS_FULL
      quicker.
      
      It performs a trivial and cheap ->inuse compare which does not slow down
      zsmalloc and in the worst case keeps the list pages in no particular
      order.
      
      A more expensive solution could sort fullness_list by ->inuse count.
      
      [minchan@kernel.org: code adjustments]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58f17117
    • Sergey Senozhatsky's avatar
      zsmalloc: use shrinker to trigger auto-compaction · ab9d306d
      Sergey Senozhatsky authored
      Perform automatic pool compaction by a shrinker when system is getting
      tight on memory.
      
      User-space has a very little knowledge regarding zsmalloc fragmentation
      and basically has no mechanism to tell whether compaction will result in
      any memory gain.  Another issue is that user space is not always aware
      of the fact that system is getting tight on memory.  Which leads to very
      uncomfortable scenarios when user space may start issuing compaction
      'randomly' or from crontab (for example).  Fragmentation is not always
      necessarily bad, allocated and unused objects, after all, may be filled
      with the data later, w/o the need of allocating a new zspage.  On the
      other hand, we obviously don't want to waste memory when the system
      needs it.
      
      Compaction now has a relatively quick pool scan so we are able to
      estimate the number of pages that will be freed easily, which makes it
      possible to call this function from a shrinker->count_objects()
      callback.  We also abort compaction as soon as we detect that we can't
      free any pages any more, preventing wasteful objects migrations.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab9d306d
    • Sergey Senozhatsky's avatar
      zsmalloc: account the number of compacted pages · 860c707d
      Sergey Senozhatsky authored
      Compaction returns back to zram the number of migrated objects, which is
      quite uninformative -- we have objects of different sizes so user space
      cannot obtain any valuable data from that number.  Change compaction to
      operate in terms of pages and return back to compaction issuer the
      number of pages that were freed during compaction.  So from now on we
      will export more meaningful value in zram<id>/mm_stat -- the number of
      freed (compacted) pages.
      
      This requires:
       (a) a rename of `num_migrated' to 'pages_compacted'
       (b) a internal API change -- return first_page's fullness_group from
           putback_zspage(), so we know when putback_zspage() did
           free_zspage().  It helps us to account compaction stats correctly.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      860c707d
    • Sergey Senozhatsky's avatar
      zsmalloc/zram: introduce zs_pool_stats api · 7d3f3938
      Sergey Senozhatsky authored
      `zs_compact_control' accounts the number of migrated objects but it has
      a limited lifespan -- we lose it as soon as zs_compaction() returns back
      to zram.  It worked fine, because (a) zram had it's own counter of
      migrated objects and (b) only zram could trigger compaction.  However,
      this does not work for automatic pool compaction (not issued by zram).
      To account objects migrated during auto-compaction (issued by the
      shrinker) we need to store this number in zs_pool.
      
      Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
      there.  It provides only `num_migrated', as of this writing, but it
      surely can be extended.
      
      A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
      caller.
      
      Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d3f3938
    • Sergey Senozhatsky's avatar
      zsmalloc: cosmetic compaction code adjustments · 0dc63d48
      Sergey Senozhatsky authored
      Change zs_object_copy() argument order to be (DST, SRC) rather than
      (SRC, DST).  copy/move functions usually have (to, from) arguments
      order.
      
      Rename alloc_target_page() to isolate_target_page().  This function
      doesn't allocate anything, it isolates target page, pretty much like
      isolate_source_page().
      
      Tweak __zs_compact() comment.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0dc63d48
    • Sergey Senozhatsky's avatar
      zsmalloc: introduce zs_can_compact() function · 04f05909
      Sergey Senozhatsky authored
      This function checks if class compaction will free any pages.
      Rephrasing -- do we have enough unused objects to form at least one
      ZS_EMPTY page and free it.  It aborts compaction if class compaction
      will not result in any (further) savings.
      
      EXAMPLE (this debug output is not part of this patch set):
      
       - class size
       - number of allocated objects
       - number of used objects
       - max objects per zspage
       - pages per zspage
       - estimated number of pages that will be freed
      
      [..]
      class-512 objs:544 inuse:540 maxobj-per-zspage:8  pages-per-zspage:1 zspages-to-free:0
       ... class-512 compaction is useless. break
      class-496 objs:660 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:2
      class-496 objs:627 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:1
      class-496 objs:594 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:0
       ... class-496 compaction is useless. break
      class-448 objs:657 inuse:617 maxobj-per-zspage:9  pages-per-zspage:1 zspages-to-free:4
      class-448 objs:648 inuse:617 maxobj-per-zspage:9  pages-per-zspage:1 zspages-to-free:3
      class-448 objs:639 inuse:617 maxobj-per-zspage:9  pages-per-zspage:1 zspages-to-free:2
      class-448 objs:630 inuse:617 maxobj-per-zspage:9  pages-per-zspage:1 zspages-to-free:1
      class-448 objs:621 inuse:617 maxobj-per-zspage:9  pages-per-zspage:1 zspages-to-free:0
       ... class-448 compaction is useless. break
      class-432 objs:728 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:1
      class-432 objs:700 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:0
       ... class-432 compaction is useless. break
      class-416 objs:819 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:2
      class-416 objs:780 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:1
      class-416 objs:741 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:0
       ... class-416 compaction is useless. break
      class-400 objs:690 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:1
      class-400 objs:680 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:0
       ... class-400 compaction is useless. break
      class-384 objs:736 inuse:709 maxobj-per-zspage:32 pages-per-zspage:3 zspages-to-free:0
       ... class-384 compaction is useless. break
      [..]
      
      Every "compaction is useless" indicates that we saved CPU cycles.
      
      class-512 has
      	544	object allocated
      	540	objects used
      	8	objects per-page
      
      Even if we have a ALMOST_EMPTY zspage, we still don't have enough room to
      migrate all of its objects and free this zspage; so compaction will not
      make a lot of sense, it's better to just leave it as is.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f05909
    • Sergey Senozhatsky's avatar
      zsmalloc: always keep per-class stats · 57244594
      Sergey Senozhatsky authored
      Always account per-class `zs_size_stat' stats.  This data will help us
      make better decisions during compaction.  We are especially interested
      in OBJ_ALLOCATED and OBJ_USED, which can tell us if class compaction
      will result in any memory gain.
      
      For instance, we know the number of allocated objects in the class, the
      number of objects being used (so we also know how many objects are not
      used) and the number of objects per-page.  So we can ensure if we have
      enough unused objects to form at least one ZS_EMPTY zspage during
      compaction.
      
      We calculate this value on per-class basis so we can calculate a total
      number of zspages that can be released.  Which is exactly what a
      shrinker wants to know.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57244594
    • Sergey Senozhatsky's avatar
      zsmalloc: drop unused variable `nr_to_migrate' · b430d1fd
      Sergey Senozhatsky authored
      This patchset tweaks compaction and makes it possible to trigger pool
      compaction automatically when system is getting low on memory.
      
      zsmalloc in some cases can suffer from a notable fragmentation and
      compaction can release some considerable amount of memory.  The problem
      here is that currently we fully rely on user space to perform compaction
      when needed.  However, performing zsmalloc compaction is not always an
      obvious thing to do.  For example, suppose we have a `idle' fragmented
      (compaction was never performed) zram device and system is getting low
      on memory due to some 3rd party user processes (gcc LTO, or firefox,
      etc.).  It's quite unlikely that user space will issue zpool compaction
      in this case.  Besides, user space cannot tell for sure how badly pool
      is fragmented; however, this info is known to zsmalloc and, hence, to a
      shrinker.
      
      This patch (of 7):
      
      __zs_compact() does not use `nr_to_migrate', drop it.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b430d1fd
    • Alexander Kuleshov's avatar
    • Zhen Lei's avatar
      mm/page_alloc.c: fix type information of memoryless node · 4ada0c5a
      Zhen Lei authored
      For a memoryless node, the output of get_pfn_range_for_nid are all zero.
      It will display mem from 0 to -1.
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ada0c5a
    • Xishi Qiu's avatar
      memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node() · b5685e92
      Xishi Qiu authored
      When hot adding a node from add_memory(), we will add memblock first, so
      the node is not empty.  But when called from cpu_up(), the node should
      be empty.
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>\
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5685e92
    • Yaowei Bai's avatar
      mm/page_alloc.c: change sysctl_lower_zone_reserve_ratio to sysctl_lowmem_reserve_ratio in comments · 34b10060
      Yaowei Bai authored
      We use sysctl_lowmem_reserve_ratio rather than
      sysctl_lower_zone_reserve_ratio to determine how aggressive the kernel
      is in defending lowmem from the possibility of being captured into
      pinned user memory.  To avoid misleading, correct it in some comments.
      Signed-off-by: default avatarYaowei Bai <bywxiaobai@163.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34b10060
    • Yaowei Bai's avatar
      mm/page_alloc.c: fix a misleading comment · 013110a7
      Yaowei Bai authored
      The comment says that the per-cpu batchsize and zone watermarks are
      determined by present_pages which is definitely wrong, they are both
      calculated from managed_pages.  Fix it.
      Signed-off-by: default avatarYaowei Bai <bywxiaobai@163.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      013110a7
    • Chen Gang's avatar
      mm/mmap.c:insert_vm_struct(): check for failure before setting values · c9d13f5f
      Chen Gang authored
      There's no point in initializing vma->vm_pgoff if the insertion attempt
      will be failing anyway.  Run the checks before performing the
      initialization.
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9d13f5f