1. 10 Jul, 2017 30 commits
    • Sean Christopherson's avatar
      mm/memcontrol: exclude @root from checks in mem_cgroup_low · 34c81057
      Sean Christopherson authored
      Make @root exclusive in mem_cgroup_low; it is never considered low when
      looked at directly and is not checked when traversing the tree.  In
      effect, @root is handled identically to how root_mem_cgroup was
      previously handled by mem_cgroup_low.
      
      If @root is not excluded from the checks, a cgroup underneath @root will
      never be considered low during targeted reclaim of @root, e.g.  due to
      memory.current > memory.high, unless @root is misconfigured to have
      memory.low > memory.high.
      
      Excluding @root enables using memory.low to prioritize memory usage
      between cgroups within a subtree of the hierarchy that is limited by
      memory.high or memory.max, e.g.  when ROOT owns @root's controls but
      delegates the @root directory to a USER so that USER can create and
      administer children of @root.
      
      For example, given cgroup A with children B and C:
      
          A
         / \
        B   C
      
      and
      
        1. A/memory.current > A/memory.high
        2. A/B/memory.current < A/B/memory.low
        3. A/C/memory.current >= A/C/memory.low
      
      As 'A' is high, i.e.  triggers reclaim from 'A', and 'B' is low, we
      should reclaim from 'C' until 'A' is no longer high or until we can no
      longer reclaim from 'C'.  If 'A', i.e.  @root, isn't excluded by
      mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low
      and we will reclaim indiscriminately from both 'B' and 'C'.
      
      Here is the test I used to confirm the bug and the patch.
      
      20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test
      #!/bin/bash
      
      x62mb=$((62<<20))
      x66mb=$((66<<20))
      x94mb=$((94<<20))
      x98mb=$((98<<20))
      
      setup() {
          set -e
      
          if [[ -n $DEBUG ]]; then
              set -x
          fi
      
          trap teardown EXIT HUP INT TERM
      
          if [[ ! -e /mnt/1gb.swap ]]; then
              sudo fallocate -l 1G /mnt/1gb.swap > /dev/null
              sudo mkswap /mnt/1gb.swap > /dev/null
          fi
          if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then
              sudo swapon /mnt/1gb.swap
          fi
      
          if [[ ! -e /cgroup/cgroup.controllers ]]; then
              sudo mount -t cgroup2 none /cgroup
          fi
      
          grep -q memory /cgroup/cgroup.controllers
      
          sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control"
      
          sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A
          sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control"
          sudo sh -c "echo '96m' > /cgroup/A/memory.high"
      
          mkdir /cgroup/A/0
          mkdir /cgroup/A/1
      
          echo 64m > /cgroup/A/0/memory.low
      }
      
      teardown() {
          set +e
      
          trap - EXIT HUP INT TERM
      
          if [[ -z $1 ]]; then
              printf "\n"
              printf "%0.s*" {1..35}
              printf "\nFAILED!\n\n"
              tail /cgroup/A/**/memory.current
              printf "%0.s*" {1..35}
              printf "\n\n"
          fi
      
          ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
      
          sleep 2
      
          if [[ -e /cgroup/A/0 ]]; then
              rmdir /cgroup/A/0
          fi
          if [[ -e /cgroup/A/1 ]]; then
              rmdir /cgroup/A/1
          fi
          if [[ -e /cgroup/A ]]; then
              sudo rmdir /cgroup/A
          fi
      }
      
      stress_test() {
          sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs"
          stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &
      
          sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs"
          stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &
      
          sudo sh -c "echo $$ > /cgroup/cgroup.procs"
      
          sleep 1
      
          # A/0 should be consuming more memory than A/1
          [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]]
      
          # A/0 should be consuming ~64mb
          [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]]
      
          # A should cumulatively be consuming ~96mb
          [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]]
      
          # Stop the stressors
          ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
      }
      
      teardown 1
      setup
      
      for ((i=1;i<=$1;i++)); do
          printf "ITERATION $i of $1 - stress_test 0 1"
          stress_test 0 1
          printf "\x1b[2K\r"
      
          printf "ITERATION $i of $1 - stress_test 1 0"
          stress_test 1 0
          printf "\x1b[2K\r"
      
          printf "ITERATION $i of $1 - PASSED\n"
      done
      
      teardown 1
      
      echo PASSED!
      
      20:11:26@sjchrist-vm ? ~ $ memcg_low_test 10
      
      Link: http://lkml.kernel.org/r/1496434412-21005-1-git-send-email-sean.j.christopherson@intel.comSigned-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34c81057
    • Michal Hocko's avatar
      mm: make PR_SET_THP_DISABLE immediately active · 18600332
      Michal Hocko authored
      PR_SET_THP_DISABLE has a rather subtle semantic.  It doesn't affect any
      existing mapping because it only updated mm->def_flags which is a
      template for new mappings.
      
      The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
      flag set.  This can be quite surprising for all those applications which
      do not do prctl(); fork() & exec() and want to control their own THP
      behavior.
      
      Another usecase when the immediate semantic of the prctl might be useful
      is a combination of pre- and post-copy migration of containers with
      CRIU.  In this case CRIU populates a part of a memory region with data
      that was saved during the pre-copy stage.  Afterwards, the region is
      registered with userfaultfd and CRIU expects to get page faults for the
      parts of the region that were not yet populated.  However, khugepaged
      collapses the pages and the expected page faults do not occur.
      
      In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
      temporary mechanism for enabling/disabling THP process wide.
      
      Implementation wise, a new MMF_DISABLE_THP flag is added.  This flag is
      tested when decision whether to use huge pages is taken either during
      page fault of at the time of THP collapse.
      
      It should be noted, that the new implementation makes PR_SET_THP_DISABLE
      master override to any per-VMA setting, which was not the case
      previously.
      
      Fixes: a0715cc2 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
      Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18600332
    • David Rientjes's avatar
      mm, vmpressure: pass-through notification support · b6bb9811
      David Rientjes authored
      By default, vmpressure events are not pass-through, i.e.  they propagate
      up through the memcg hierarchy until an event notifier is found for any
      threshold level.
      
      This presents a difficulty when a thread waiting on a read(2) for a
      vmpressure event cannot distinguish between local memory pressure and
      memory pressure in a descendant memcg, especially when that thread may
      not control the memcg hierarchy.
      
      Consider a user-controlled child memcg with a smaller limit than a
      top-level memcg controlled by the "Activity Manager" specified in
      Documentation/cgroup-v1/memory.txt.  It may register for memory pressure
      notification for descendant memcgs to make a policy decision: oom kill a
      low priority job, increase the limit, decrease other limits, etc.  If it
      registers for memory pressure notification on the top-level memcg, it
      currently cannot distinguish between memory pressure in its own memcg or
      a descendant memcg, which is user-controlled.
      
      Conversely, if a user registers for memory pressure notification on
      their own descendant memcg, the Activity Manager does not receive any
      pressure notification for that child memcg hierarchy.  Vmpressure events
      are not received for ancestor memcgs if the memcg experiencing pressure
      have notifiers registered, perhaps outside the knowledge of the thread
      waiting on read(2) at the top level.
      
      Both of these are consequences of vmpressure notification not being
      pass-through.
      
      This implements a pass-through behavior for vmpressure events.  When
      writing to control.event_control, vmpressure event handlers may
      optionally specify a mode.  There are two new modes:
      
       - "hierarchy": always propagate memory pressure events up the hierarchy
         regardless if descendant memcgs have their own notifiers registered,
         and
      
       - "local": only receive notifications when the memcg for which the
         event is registered experiences memory pressure.
      
      Of course, processes may register for one notification of "low,local",
      for example, and another for "low".
      
      If no mode is specified, the current behavior is maintained for
      backwards compatibility.
      
      See the change to Documentation/cgroup-v1/memory.txt for full
      specification.
      
      [dan.carpenter@oracle.com: free the same pointer we allocated]
        Link: http://lkml.kernel.org/r/20170613191820.GA20003@elgon.mountain
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705311421320.8946@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6bb9811
    • Naoya Horiguchi's avatar
    • Naoya Horiguchi's avatar
    • Naoya Horiguchi's avatar
      mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error · 78bb9203
      Naoya Horiguchi authored
      Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
      keep the error hugepage away from the system, which is OK but not good
      enough because the hugepage still has a refcount and unpoison doesn't
      work on the error hugepage (PageHWPoison flags are cleared but pages are
      still leaked.) And there's "wasting health subpages" issue too.  This
      patch reworks on me_huge_page() to solve these issues.
      
      For hugetlb file, recently we have truncating code so let's use it in
      hugetlbfs specific ->error_remove_page().
      
      For anonymous hugepage, it's helpful to dissolve the error page after
      freeing it into free hugepage list.  Migration entry and PageHWPoison in
      the head page prevent the access to it.
      
      TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
      It's not critical (and at least no worse that now) because in such case
      the error hugepage just stays in free hugepage list without being
      dissolved.  By virtue of PageHWPoison in head page, it's never allocated
      to processes.
      
      [akpm@linux-foundation.org: fix unused var warnings]
      Fixes: 23a003bf ("mm/madvise: pass return code of memory_failure() to userspace")
      Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
      Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78bb9203
    • Naoya Horiguchi's avatar
      mm: hwpoison: introduce memory_failure_hugetlb() · 761ad8d7
      Naoya Horiguchi authored
      memory_failure() is a big function and hard to maintain.  Handling
      hugetlb- and non-hugetlb- case in a single function is not good, so this
      patch separates PageHuge() branch into a new function, which saves many
      PageHuge() check.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      761ad8d7
    • Naoya Horiguchi's avatar
      mm: soft-offline: dissolve free hugepage if soft-offlined · d4a3a60b
      Naoya Horiguchi authored
      Now we have code to rescue most of healthy pages from a hwpoisoned
      hugepage.  So let's apply it to soft_offline_free_page too.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4a3a60b
    • Anshuman Khandual's avatar
      mm: hugetlb: soft-offline: dissolve source hugepage after successful migration · c3114a84
      Anshuman Khandual authored
      Currently hugepage migrated by soft-offline (i.e.  due to correctable
      memory errors) is contained as a hugepage, which means many non-error
      pages in it are unreusable, i.e.  wasted.
      
      This patch solves this issue by dissolving source hugepages into buddy.
      As done in previous patch, PageHWPoison is set only on a head page of
      the error hugepage.  Then in dissoliving we move the PageHWPoison flag
      to the raw error page so that all healthy subpages return back to buddy.
      
      [arnd@arndb.de: fix warnings: replace some macros with inline functions]
        Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3114a84
    • Naoya Horiguchi's avatar
      mm: hwpoison: change PageHWPoison behavior on hugetlb pages · b37ff71c
      Naoya Horiguchi authored
      We'd like to narrow down the error region in memory error on hugetlb
      pages.  However, currently we set PageHWPoison flags on all subpages in
      the error hugepage and add # of subpages to num_hwpoison_pages, which
      doesn't fit our purpose.
      
      So this patch changes the behavior and we only set PageHWPoison on the
      head page then increase num_hwpoison_pages only by 1.  This is a
      preparation for narrow-down part which comes in later patches.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b37ff71c
    • Naoya Horiguchi's avatar
      mm: hugetlb: return immediately for hugetlb page in __delete_from_page_cache() · 09612fa6
      Naoya Horiguchi authored
      We avoid calling __mod_node_page_state(NR_FILE_PAGES) for hugetlb page
      now, but it's not enough because later code doesn't handle hugetlb
      properly.  Actually in our testing, WARN_ON_ONCE(PageDirty(page)) at the
      end of this function fires for hugetlb, which makes no sense.  So we
      should return immediately for hugetlb pages.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-3-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09612fa6
    • Naoya Horiguchi's avatar
      mm: hugetlb: prevent reuse of hwpoisoned free hugepages · 243abd5b
      Naoya Horiguchi authored
      Patch series "mm: hwpoison: fixlet for hugetlb migration".
      
      This patchset updates the hwpoison/hugetlb code to address 2 reported
      issues.
      
      One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
      http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
      half was already fixed in mainline, and another half about hugetlb cases
      are solved in this series.
      
      Another issue is "narrow-down error affected region into a single 4kB
      page instead of a whole hugetlb page" issue, which was tried by Anshuman
      (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
      and I updated it to apply it more widely.
      
      This patch (of 9):
      
      We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
      as we did before.  So current dequeue_huge_page_node() doesn't work as
      intended because it still uses is_migrate_isolate_page() for this check.
      This patch fixes it with PageHWPoison flag.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      243abd5b
    • Eric Biggers's avatar
      fs/buffer.c: make bh_lru_install() more efficient · 241f01fb
      Eric Biggers authored
      To install a buffer_head into the cpu's LRU queue, bh_lru_install()
      would construct a new copy of the queue and then memcpy it over the real
      queue.  But it's easily possible to do the update in-place, which is
      faster and simpler.  Some work can also be skipped if the buffer_head
      was already in the queue.
      
      As a microbenchmark I timed how long it takes to run sb_getblk()
      10,000,000 times alternating between BH_LRU_SIZE + 1 blocks.
      Effectively, this benchmarks looking up buffer_heads that are in the
      page cache but not in the LRU:
      
      	Before this patch: 1.758s
      	After this patch: 1.653s
      
      This patch also removes about 350 bytes of compiled code (on x86_64),
      partly due to removal of the memcpy() which was being inlined+unrolled.
      
      Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.comSigned-off-by: default avatarEric Biggers <ebiggers@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      241f01fb
    • Nick Desaulniers's avatar
      mm/zsmalloc.c: fix -Wunneeded-internal-declaration warning · 3457f414
      Nick Desaulniers authored
      is_first_page() is only called from the macro VM_BUG_ON_PAGE() which is
      only compiled in as a runtime check when CONFIG_DEBUG_VM is set,
      otherwise is checked at compile time and not actually compiled in.
      
      Fixes the following warning, found with Clang:
      
        mm/zsmalloc.c:472:12: warning: function 'is_first_page' is not needed and will not be emitted [-Wunneeded-internal-declaration]
        static int is_first_page(struct page *page)
                 ^
      
      Link: http://lkml.kernel.org/r/20170524053859.29059-1-nick.desaulniers@gmail.comSigned-off-by: default avatarNick Desaulniers <nick.desaulniers@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3457f414
    • Gustavo A. R. Silva's avatar
      mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereference · dbac61a3
      Gustavo A. R. Silva authored
      The NULL check at line 1226: if (!pgdat), implies that pointer pgdat
      might be NULL.
      
      rollback_node_hotadd() dereferences this pointer.  Add NULL check to
      avoid a potential NULL pointer dereference.
      
      Addresses-Coverity-ID: 1369133
      Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgusSigned-off-by: default avatarGustavo A. R. Silva <garsilva@embeddedor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbac61a3
    • David Rientjes's avatar
      mm, vmscan: avoid thrashing anon lru when free + file is low · 06226226
      David Rientjes authored
      The purpose of the code that commit 62376251 ("revert 'mm: vmscan:
      do not swap anon pages just because free+file is low'") reintroduces is
      to prefer swapping anonymous memory rather than trashing the file lru.
      
      If the anonymous inactive lru for the set of eligible zones is
      considered low, however, or the length of the list for the given reclaim
      priority does not allow for effective anonymous-only reclaiming, then
      avoid forcing SCAN_ANON.  Forcing SCAN_ANON will end up thrashing the
      small list and leave unreclaimed memory on the file lrus.
      
      If the inactive list is insufficient, fallback to balanced reclaim so
      the file lru doesn't remain untouched.
      
      [akpm@linux-foundation.org: fix build]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06226226
    • Yevgen Pronenko's avatar
      mm/memory.c: convert to DEFINE_DEBUGFS_ATTRIBUTE · 0a1345f8
      Yevgen Pronenko authored
      The preferred strategy to define debugfs attributes is to use the
      DEFINE_DEBUGFS_ATTRIBUTE() macro and to use debugfs_create_file_unsafe().
      
      Link: http://lkml.kernel.org/r/20170528145948.32127-1-y.pronenko@gmail.comSigned-off-by: default avatarYevgen Pronenko <y.pronenko@gmail.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a1345f8
    • Vlastimil Babka's avatar
      mm, page_alloc: fallback to smallest page when not stealing whole pageblock · 7a8f58f3
      Vlastimil Babka authored
      Since commit 3bc48f96 ("mm, page_alloc: split smallest stolen page
      in fallback") we pick the smallest (but sufficient) page of all that
      have been stolen from a pageblock of different migratetype.  However,
      there are cases when we decide not to steal the whole pageblock.
      
      Practically in the current implementation it means that we are trying to
      fallback for a MIGRATE_MOVABLE allocation of order X, go through the
      freelists from MAX_ORDER-1 down to X, and find free page of order Y.  If
      Y is less than pageblock_order / 2, we decide not to steal all pages
      from the pageblock.  When Y > X, it means we are potentially splitting a
      larger page than we need, as there might be other pages of order Z,
      where X <= Z < Y.  Since Y is already too small to steal whole
      pageblock, picking smallest available Z will result in the same decision
      and we avoid splitting a higher-order page in a MIGRATE_UNMOVABLE or
      MIGRATE_RECLAIMABLE pageblock.
      
      This patch therefore changes the fallback algorithm so that in the
      situation described above, we switch the fallback search strategy to go
      from order X upwards to find the smallest suitable fallback.  In theory
      there shouldn't be a downside of this change wrt fragmentation.
      
      This has been tested with mmtests' stress-highalloc performing
      GFP_KERNEL order-4 allocations, here is the relevant extfrag tracepoint
      statistics:
      
                                                              4.12.0-rc2      4.12.0-rc2
                                                               1-kernel4       2-kernel4
        Page alloc extfrag event                                  25640976    69680977
        Extfrag fragmenting                                       25621086    69661364
        Extfrag fragmenting for unmovable                            74409       73204
        Extfrag fragmenting unmovable placed with movable            69003       67684
        Extfrag fragmenting unmovable placed with reclaim.            5406        5520
        Extfrag fragmenting for reclaimable                           6398        8467
        Extfrag fragmenting reclaimable placed with movable            869         884
        Extfrag fragmenting reclaimable placed with unmov.            5529        7583
        Extfrag fragmenting for movable                           25540279    69579693
      
      Since we force movable allocations to steal the smallest available page
      (which we then practially always split), we steal less per fallback, so
      the number of fallbacks increases and steals potentially happen from
      different pageblocks.  This is however not an issue for movable pages
      that can be compacted.
      
      Importantly, the "unmovable placed with movable" statistics is lower,
      which is the result of less fragmentation in the unmovable pageblocks.
      The effect on reclaimable allocation is a bit unclear.
      
      Link: http://lkml.kernel.org/r/20170529093947.22618-1-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a8f58f3
    • Shaohua Li's avatar
      swap: add block io poll in swapin path · 23955622
      Shaohua Li authored
      For fast flash disk, async IO could introduce overhead because of
      context switch.  block-mq now supports IO poll, which improves
      performance and latency a lot.  swapin is a good place to use this
      technique, because the task is waiting for the swapin page to continue
      execution.
      
      In my virtual machine, directly read 4k data from a NVMe with iopoll is
      about 60% better than that without poll.  With iopoll support in swapin
      patch, my microbenchmark (a task does random memory write) is about
      10%~25% faster.  CPU utilization increases a lot though, 2x and even 3x
      CPU utilization.  This will depend on disk speed.
      
      While iopoll in swapin isn't intended for all usage cases, it's a win
      for latency sensistive workloads with high speed swap disk.  block layer
      has knob to control poll in runtime.  If poll isn't enabled in block
      layer, there should be no noticeable change in swapin.
      
      I got a chance to run the same test in a NVMe with DRAM as the media.
      In simple fio IO test, blkpoll boosts 50% performance in single thread
      test and ~20% in 8 threads test.  So this is the base line.  In above
      swap test, blkpoll boosts ~27% performance in single thread test.
      blkpoll uses 2x CPU time though.
      
      If we enable hybid polling, the performance gain has very slight drop
      but CPU time is only 50% worse than that without blkpoll.  Also we can
      adjust parameter of hybid poll, with it, the CPU time penality is
      reduced further.  In 8 threads test, blkpoll doesn't help though.  The
      performance is similar to that without blkpoll, but cpu utilization is
      similar too.  There is lock contention in swap path.  The cpu time
      spending on blkpoll isn't high.  So overall, blkpoll swapin isn't worse
      than that without it.
      
      The swapin readahead might read several pages in in the same time and
      form a big IO request.  Since the IO will take longer time, it doesn't
      make sense to do poll, so the patch only does iopoll for single page
      swapin.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965cSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23955622
    • Linus Torvalds's avatar
      Merge tag 'for-linus-4.13-v2' of git://github.com/cminyard/linux-ipmi · 9eb78880
      Linus Torvalds authored
      Pull IPMI updates from Corey Minyard:
       "Some small fixes for IPMI, and one medium sized changed.
      
        The medium sized change is adding a platform device for IPMI entries
        in the DMI table. Otherwise there is no auto loading for IPMI devices
        if they are only in the DMI table"
      
      * tag 'for-linus-4.13-v2' of git://github.com/cminyard/linux-ipmi:
        ipmi:ssif: Add missing unlock in error branch
        char: ipmi: constify bmc_dev_attr_group and bmc_device_type
        ipmi:ssif: Check dev before setting drvdata
        ipmi: Convert DMI handling over to a platform device
        ipmi: Create a platform device for a DMI-specified IPMI interface
        ipmi: use rcu lock around call to intf->handlers->sender()
        ipmi:ssif: Use i2c_adapter_id instead of adapter->nr
        ipmi: Use the proper default value for register size in ACPI
        ipmi_ssif: remove redundant null check on array client->adapter->name
        ipmi/watchdog: fix watchdog timeout set on reboot
        ipmi_ssif: unlock on allocation failure
      9eb78880
    • Linus Torvalds's avatar
      Merge tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 642338ba
      Linus Torvalds authored
      Pull XFS updates from Darrick Wong:
       "Here are some changes for you for 4.13. For the most part it's fixes
        for bugs and deadlock problems, and preparation for online fsck in
        some future merge window.
      
         - Avoid quotacheck deadlocks
      
         - Fix transaction overflows when bunmapping fragmented files
      
         - Refactor directory readahead
      
         - Allow admin to configure if ASSERT is fatal
      
         - Improve transaction usage detail logging during overflows
      
         - Minor cleanups
      
         - Don't leak log items when the log shuts down
      
         - Remove double-underscore typedefs
      
         - Various preparation for online scrubbing
      
         - Introduce new error injection configuration sysfs knobs
      
         - Refactor dq_get_next to use extent map directly
      
         - Fix problems with iterating the page cache for unwritten data
      
         - Implement SEEK_{HOLE,DATA} via iomap
      
         - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA
      
         - Don't use MAXPATHLEN to check on-disk symlink target lengths"
      
      * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
        xfs: don't crash on unexpected holes in dir/attr btrees
        xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
        xfs: fix contiguous dquot chunk iteration livelock
        xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
        vfs: Add iomap_seek_hole and iomap_seek_data helpers
        vfs: Add page_cache_seek_hole_data helper
        xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
        xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
        xfs: Check for m_errortag initialization in xfs_errortag_test
        xfs: grab dquots without taking the ilock
        xfs: fix semicolon.cocci warnings
        xfs: Don't clear SGID when inheriting ACLs
        xfs: free cowblocks and retry on buffered write ENOSPC
        xfs: replace log_badcrc_factor knob with error injection tag
        xfs: convert drop_writes to use the errortag mechanism
        xfs: remove unneeded parameter from XFS_TEST_ERROR
        xfs: expose errortag knobs via sysfs
        xfs: make errortag a per-mountpoint structure
        xfs: free uncommitted transactions during log recovery
        xfs: don't allow bmap on rt files
        ...
      642338ba
    • Linus Torvalds's avatar
      Merge branch 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 6618a24a
      Linus Torvalds authored
      Pull btrfs fix from David Sterba:
       "This fixes a user-visible bug introduced by the nowait-aio patches
        merged in this cycle"
      
      * 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: nowait aio: Correct assignment of pos
      6618a24a
    • Linus Torvalds's avatar
      Merge branch 'fix-uio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 1d07b6cb
      Linus Torvalds authored
      Pull copy*_iter fix from Al Viro.
      
      [ Al used entirely the wrong return value. Oopsie. ]
      
      * 'fix-uio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        fix brown paperbag bug in inlined copy_..._iter()
      1d07b6cb
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid · a91ab911
      Linus Torvalds authored
      Pull HID updates from Jiri Kosina:
      
       - open/close tracking improvements from Dmitry Torokhov
      
       - battery support improvements in Wacom driver from Jason Gerecke
      
       - Win8 support fixes from Benjamin Tissories and Hans de Geode
      
       - misc fixes to Intel-ISH driver from Arnd Bergmann
      
       - support for quite a few new devices and small assorted fixes here and
         there
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (35 commits)
        HID: intel-ish-hid: Enable Gemini Lake ish driver
        HID: intel-ish-hid: Enable Cannon Lake ish driver
        HID: wacom: fix mistake in printk
        HID: multitouch: optimize the sticky fingers timer
        HID: multitouch: fix rare Win 8 cases when the touch up event gets missing
        HID: multitouch: use BIT macro
        HID: Add driver for Retrode2 joypad adapter
        HID: multitouch: Add support for Google Rose Touchpad
        HID: multitouch: Support PTP Stick and Touchpad device
        HID: core: don't use negative operands when shift
        HID: apple: Use country code to detect ISO keyboards
        HID: remove no longer used hid->open field
        greybus: hid: remove custom locking from gb_hid_open/close
        HID: usbhid: remove custom locking from usbhid_open/close
        HID: i2c-hid: remove custom locking from i2c_hid_open/close
        HID: serialize hid_hw_open and hid_hw_close
        HID: usbhid: do not rely on hid->open when deciding to do IO
        HID: hiddev: use hid_hw_power instead of usbhid_get/put_power
        HID: hiddev: use hid_hw_open/close instead of usbhid_open/close
        HID: asus: Add support for Zen AiO MD-5110 keyboard
        ...
      a91ab911
    • Goldwyn Rodrigues's avatar
      btrfs: nowait aio: Correct assignment of pos · ff0fa732
      Goldwyn Rodrigues authored
      Assigning pos for usage early messes up in append mode, where the pos is
      re-assigned in generic_write_checks(). Assign pos later to get the
      correct position to write from iocb->ki_pos.
      
      Since check_can_nocow also uses the value of pos, we shift
      generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT
      are present in generic_write_checks(), so checking for IOCB_NOWAIT is
      enough.
      
      Also, put locking sequence in the fast path.
      
      This fixes a user visible bug, as reported:
      
      "apparently breaks several shell related features on my system.
      In zsh history stopped working, because no new entries are added
      anymore.
      I fist noticed the issue when I tried to build mplayer. It uses a shell
      script to generate a help_mp.h file:
      [...]
      
      Here is a simple testcase:
      
       % echo "foo" >> test
       % echo "foo" >> test
       % cat test
       foo
       %
      "
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      CC: Jens Axboe <axboe@kernel.dk>
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Link: https://lkml.kernel.org/r/20170704042306.GA274@x4Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff0fa732
    • Al Viro's avatar
      fix brown paperbag bug in inlined copy_..._iter() · c43aeb19
      Al Viro authored
      "copied nothing" == "return 0", not "return full size".
      
      Fixes: aa28de27 "iov_iter/hardening: move object size checks to inlined part"
      Spotted-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c43aeb19
    • Jiri Kosina's avatar
      Merge branches 'for-4.13/multitouch', 'for-4.13/retrode',... · 837c194a
      Jiri Kosina authored
      Merge branches 'for-4.13/multitouch', 'for-4.13/retrode', 'for-4.13/transport-open-close-consolidation', 'for-4.13/upstream' and 'for-4.13/wacom' into for-linus
      837c194a
    • Jiri Kosina's avatar
      Merge branches 'for-4.13/ish' and 'for-4.13/ite' into for-linus · 604250dd
      Jiri Kosina authored
      Conflicts:
      	drivers/hid/hid-core.c
      604250dd
    • Jiri Kosina's avatar
      Merge branches 'for-4.13/apple' and 'for-4.13/asus' into for-linus · 4f94ff4e
      Jiri Kosina authored
      Conflicts:
      	drivers/hid/hid-core.c
      4f94ff4e
    • Linus Torvalds's avatar
      Merge tag 'drm-for-v4.13' of git://people.freedesktop.org/~airlied/linux · af3c8d98
      Linus Torvalds authored
      Pull drm updates from Dave Airlie:
       "This is the main pull request for the drm, I think I've got one later
        driver pull for mediatek SoC driver, I'm undecided on if it needs to
        go to you yet.
      
        Otherwise summary below:
      
        Core drm:
         - Atomic add driver private objects
         - Deprecate preclose hook in modern drivers
         - MST bandwidth tracking
         - Use kvmalloc in more places
         - Add mode_valid hook for crtc/encoder/bridge
         - Reduce sync_file construction time
         - Documentation updates
         - New DRM synchronisation object support
      
        New drivers:
         - pl111 - pl111 CLCD display controller
      
        Panel:
         - Innolux P079ZCA panel driver
         - Add NL12880B20-05, NL192108AC18-02D, P320HVN03 panels
         - panel-samsung-s6e3ha2: Add s6e3hf2 panel support
      
        i915:
         - SKL+ watermark fixes
         - G4x/G33 reset improvements
         - DP AUX backlight improvements
         - Buffer based GuC/host communication
         - New getparam for (sub)slice infomation
         - Cannonlake and Coffeelake initial patches
         - Execbuf optimisations
      
        radeon/amdgpu:
         - Lots of Vega10 bug fixes
         - Preliminary raven support
         - KIQ support for compute rings
         - MEC queue management rework
         - DCE6 Audio support
         - SR-IOV improvements
         - Better radeon/amdgpu selection support
      
        nouveau:
         - HDMI stereoscopic support
         - Display code rework for >= GM20x GPUs
      
        msm:
         - GEM rework for fine-grained locking
         - Per-process pagetable work
         - HDMI fixes for Snapdragon 820.
      
        vc4:
         - Remove 256MB CMA limit from vc4
         - Add out-fence support
         - Add support for cygnus
         - Get/set tiling ioctls support
         - Add T-format tiling support for scanout
      
        zte:
         - add VGA support.
      
        etnaviv:
         - Thermal throttle support for newer GPUs
         - Restore userspace buffer cache performance
         - dma-buf sync fix
      
        stm:
         - add stm32f429 display support
      
        exynos:
         - Rework vblank handling
         - Fixup sw-trigger code
      
        sun4i:
         - V3s display engine support
         - HDMI support for older SoCs
         - Preliminary work on dual-pipeline SoCs.
      
        rcar-du:
         - VSP work
      
        imx-drm:
         - Remove counter load enable from PRE
         - Double read/write reduction flag support
      
        tegra:
         - Documentation for the host1x and drm driver.
         - Lots of staging ioctl fixes due to grate project work.
      
        omapdrm:
         - dma-buf fence support
         - TILER rotation fixes"
      
      * tag 'drm-for-v4.13' of git://people.freedesktop.org/~airlied/linux: (1270 commits)
        drm: Remove unused drm_file parameter to drm_syncobj_replace_fence()
        drm/amd/powerplay: fix bug fail to remove sysfs when rmmod amdgpu.
        amdgpu: Set cik/si_support to 1 by default if radeon isn't built
        drm/amdgpu/gfx9: fix driver reload with KIQ
        drm/amdgpu/gfx8: fix driver reload with KIQ
        drm/amdgpu: Don't call amd_powerplay_destroy() if we don't have powerplay
        drm/ttm: Fix use-after-free in ttm_bo_clean_mm
        drm/amd/amdgpu: move get memory type function from early init to sw init
        drm/amdgpu/cgs: always set reference clock in mode_info
        drm/amdgpu: fix vblank_time when displays are off
        drm/amd/powerplay: power value format change for Vega10
        drm/amdgpu/gfx9: support the amdgpu.disable_cu option
        drm/amd/powerplay: change PPSMC_MSG_GetCurrPkgPwr for Vega10
        drm/amdgpu: Make amdgpu_cs_parser_init static (v2)
        drm/amdgpu/cs: fix a typo in a comment
        drm/amdgpu: Fix the exported always on CU bitmap
        drm/amdgpu/gfx9: gfx_v9_0_enable_gfx_static_mg_power_gating() can be static
        drm/amdgpu/psp: upper_32_bits/lower_32_bits for address setup
        drm/amd/powerplay/cz: print message if smc message fails
        drm/amdgpu: fix typo in amdgpu_debugfs_test_ib_init
        ...
      af3c8d98
  2. 09 Jul, 2017 10 commits
    • David Howells's avatar
      afs: Add metadata xattrs · d3e3b7ea
      David Howells authored
      Add xattrs to allow the user to get/set metadata in lieu of having pioctl()
      available.  The following xattrs are now available:
      
       - "afs.cell"
      
         The name of the cell in which the vnode's volume resides.
      
       - "afs.fid"
      
         The volume ID, vnode ID and vnode uniquifier of the file as three hex
         numbers separated by colons.
      
       - "afs.volume"
      
         The name of the volume in which the vnode resides.
      
      For example:
      
      	# getfattr -d -m ".*" /mnt/scratch
      	getfattr: Removing leading '/' from absolute path names
      	# file: mnt/scratch
      	afs.cell="mycell.myorg.org"
      	afs.fid="10000b:1:1"
      	afs.volume="scratch"
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3e3b7ea
    • Marc Dionne's avatar
      afs: Ignore AFS_ACE_READ and AFS_ACE_WRITE for directories · fd249821
      Marc Dionne authored
      The AFS_ACE_READ and AFS_ACE_WRITE permission bits should not
      be used to make access decisions for the directory itself.  They
      are meant to control access for the objects contained in that
      directory.
      
      Reading a directory is allowed if the AFS_ACE_LOOKUP bit is set.
      This would cause an incorrect access denied error for a directory
      with AFS_ACE_LOOKUP but not AFS_ACE_READ.
      
      The AFS_ACE_WRITE bit does not allow operations that modify the
      directory.  For a directory with AFS_ACE_WRITE but neither
      AFS_ACE_INSERT nor AFS_ACE_DELETE, this would result in trying
      operations that would ultimately be denied by the server.
      Signed-off-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd249821
    • Cong Wang's avatar
      mqueue: fix a use-after-free in sys_mq_notify() · f991af3d
      Cong Wang authored
      The retry logic for netlink_attachskb() inside sys_mq_notify()
      is nasty and vulnerable:
      
      1) The sock refcnt is already released when retry is needed
      2) The fd is controllable by user-space because we already
         release the file refcnt
      
      so we when retry but the fd has been just closed by user-space
      during this small window, we end up calling netlink_detachskb()
      on the error path which releases the sock again, later when
      the user-space closes this socket a use-after-free could be
      triggered.
      
      Setting 'sock' to NULL here should be sufficient to fix it.
      Reported-by: default avatarGeneBlue <geneblue.mail@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f991af3d
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2b976203
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "The x86 updates contain:
      
         - A fix for a longstanding PAT bug, where PAT was reported on CPUs
           that do not support it, which leads to wrong caching attributes and
           missing MTRR updates
      
         - Prevent overwriting of the e820 firmware table, which causes kexec
           kernels to lose the fake mptable which is stored there.
      
         - Cleanup of the UV/BAU code, removing unused code and making local
           functions static"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/boot/e820: Introduce the bootloader provided e820_table_firmware[] table
        x86/boot/e820: Rename the e820_table_firmware to e820_table_kexec
        x86/boot/e820: Avoid overwriting e820_table_firmware
        x86/mm/pat: Don't report PAT on CPUs that don't support it
        x86/platform/uv/BAU: Minor cleanup, make some local functions static
      2b976203
    • Linus Torvalds's avatar
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 8d97a6c3
      Linus Torvalds authored
      Pull timers fixlet from Thomas Gleixner:
       "Add Frederic Weisbecker as NOHZ/dyntick maintainer"
      
      [ And an unmentioned and unrelated typo fix in the same commit? Hmm.. ]
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        MAINTAINERS: Add Frederic Weisbecker as nohz/dyntics maintainer
      8d97a6c3
    • Linus Torvalds's avatar
      Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4d3c4a42
      Linus Torvalds authored
      Pull smp/hotplug fix from Thomas Gleixner:
       "A single fix for a brown paperbag bug:
      
        The unparking of the initial percpu threads of an upcoming CPU happens
        right now on the idle task, but that's wrong as the unpark function
        might sleep. Move it to the control CPU."
      
      * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        smp/hotplug: Move unparking of percpu threads to the control CPU
      4d3c4a42
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4fde846a
      Linus Torvalds authored
      Pull scheduler fixes from Thomas Gleixner:
       "This scheduler update provides:
      
         - The (hopefully) final fix for the vtime accounting issues which
           were around for quite some time
      
         - Use types known to user space in UAPI headers to unbreak user space
           builds
      
         - Make load balancing respect the current scheduling domain again
           instead of evaluating unrelated CPUs"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/headers/uapi: Fix linux/sched/types.h userspace compilation errors
        sched/fair: Fix load_balance() affinity redo path
        sched/cputime: Accumulate vtime on top of nsec clocksource
        sched/cputime: Move the vtime task fields to their own struct
        sched/cputime: Rename vtime fields
        sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime
        vtime, sched/cputime: Remove vtime_account_user()
        Revert "sched/cputime: Refactor the cputime_adjust() code"
      4fde846a
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c3931a87
      Linus Torvalds authored
      Pull perf fixes from Thomas Gleixner:
       "A couple of fixes for perf and kprobes:
      
         - Add he missing exclude_kernel attribute for the precise_ip level so
           !CAP_SYS_ADMIN users get the proper results.
      
         - Warn instead of failing completely when perf has no unwind support
           for a particular architectiure built in.
      
         - Ensure that jprobes are at function entry and not at some random
           place"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        kprobes: Ensure that jprobe probepoints are at function entry
        kprobes: Simplify register_jprobes()
        kprobes: Rename [arch_]function_offset_within_entry() to [arch_]kprobe_on_func_entry()
        perf unwind: Do not fail due to missing unwind support
        perf evsel: Set attr.exclude_kernel when probing max attr.precise_ip
      c3931a87
    • Linus Torvalds's avatar
      Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c8b2ba83
      Linus Torvalds authored
      Pull locking fixes from Thomas Gleixner:
      
       - Fix the EINTR logic in rwsem-spinlock to avoid double locking by a
         writer and a reader
      
       - Add a missing include to qspinlocks
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/qspinlock: Explicitly include asm/prefetch.h
        locking/rwsem-spinlock: Fix EINTR branch in __down_write_common()
      c8b2ba83
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7cb328c3
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
      
       - A few fixes mopping up the fallout of the big irq overhaul
      
       - Move the interrupt resource management logic out of the spin locked,
         irq disabled region to avoid unnecessary restrictions of the resource
         callbacks
      
       - Preparation for reworking the per cpu irq request function.
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqdomain: Allow ACPI device nodes to be used as irqdomain identifiers
        genirq/debugfs: Remove redundant NULL pointer check
        genirq: Allow to pass the IRQF_TIMER flag with percpu irq request
        genirq/timings: Move free timings out of spinlocked region
        genirq: Move irq resource handling out of spinlocked region
        genirq: Add mutex to irq desc to serialize request/free_irq()
        genirq: Move bus locking into __setup_irq()
        genirq: Force inlining of __irq_startup_managed to prevent build failure
        genirq/debugfs: Fix build for !CONFIG_IRQ_DOMAIN
      7cb328c3