1. 10 Jul, 2017 40 commits
    • Michal Hocko's avatar
      mm, memcg: fix potential undefined behavior in mem_cgroup_event_ratelimit() · 6a1a8b80
      Michal Hocko authored
      Alice has reported the following UBSAN splat:
      
        UBSAN: Undefined behaviour in mm/memcontrol.c:661:17
        signed integer overflow:
        -2147483644 - 2147483525 cannot be represented in type 'long int'
        CPU: 1 PID: 11758 Comm: mybibtex2filena Tainted: P           O 4.9.25-gentoo #4
        Hardware name: XXXXXX, BIOS YYYYYY
        Call Trace:
          dump_stack+0x59/0x87
          ubsan_epilogue+0xe/0x40
          handle_overflow+0xbb/0xf0
          __ubsan_handle_sub_overflow+0x12/0x20
          memcg_check_events.isra.36+0x223/0x360
          mem_cgroup_commit_charge+0x55/0x140
          wp_page_copy+0x34e/0xb80
          do_wp_page+0x1e6/0x1300
          handle_mm_fault+0x88b/0x1990
          __do_page_fault+0x2de/0x8a0
          do_page_fault+0x1a/0x20
          error_code+0x67/0x6c
      
      The reason is that we subtract two signed types.  Let's fix this by
      truly mimicing time_after and cast the result of the subtraction.
      
      Link: http://lkml.kernel.org/r/20170616150057.GQ30580@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarAlice Ferrazzi <alicef@gentoo.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a1a8b80
    • David Rientjes's avatar
      mm, hugetlb: schedule when potentially allocating many hugepages · 69ed779a
      David Rientjes authored
      A few hugetlb allocators loop while calling the page allocator and can
      potentially prevent rescheduling if the page allocator slowpath is not
      utilized.
      
      Conditionally schedule when large numbers of hugepages can be allocated.
      
      Anshuman:
       "Fixes a task which was getting hung while writing like 10000 hugepages
        (16MB on POWER8) into /proc/sys/vm/nr_hugepages."
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69ed779a
    • Michal Hocko's avatar
      mm: unify new_node_page and alloc_migrate_target · 8b913238
      Michal Hocko authored
      Commit 394e31d2 ("mem-hotplug: alloc new page from a nearest
      neighbor node when mem-offline") has duplicated a large part of
      alloc_migrate_target with some hotplug specific special casing.
      
      To be more precise it tried to enfore the allocation from a different
      node than the original page.  As a result the two function diverged in
      their shared logic, e.g.  the hugetlb allocation strategy.
      
      Let's unify the two and express different NUMA requirements by the given
      nodemask.  new_node_page will simply exclude the node it doesn't care
      about and alloc_migrate_target will use all the available nodes.
      alloc_migrate_target will then learn to migrate hugetlb pages more
      sanely and use preallocated pool when possible.
      
      Please note that alloc_migrate_target used to call alloc_page resp.
      alloc_pages_current so the memory policy of the current context which is
      quite strange when we consider that it is used in the context of
      alloc_contig_range which just tries to migrate pages which stand in the
      way.
      
      Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b913238
    • Michal Hocko's avatar
      hugetlb, memory_hotplug: prefer to use reserved pages for migration · 4db9b2ef
      Michal Hocko authored
      new_node_page will try to use the origin's next NUMA node as the
      migration destination for hugetlb pages.  If such a node doesn't have
      any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
      to allocate a surplus page instead.  This is quite subotpimal for any
      configuration when hugetlb pages are no distributed to all NUMA nodes
      evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
      node 0
      
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
      
      Now we consume the whole pool on node 4 and try to offline this node.
      All the allocated pages should be moved to node0 which has enough
      preallocated pages to hold them.  With the current implementation
      offlining very likely fails because hugetlb allocations during runtime
      are much less reliable.
      
      Fix this by reusing the nodemask which excludes migration source and try
      to find a first node which has a page in the preallocated pool first and
      fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
      consumed.
      
      [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
      Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4db9b2ef
    • Michal Hocko's avatar
      mm, memory_hotplug: simplify empty node mask handling in new_node_page · 7f252f27
      Michal Hocko authored
      new_node_page tries to allocate the target page on a different NUMA node
      than the source page.  This makes sense in most cases during the hotplug
      because we are likely to offline the whole numa node.  But there are
      cases where there are no other nodes to fallback (e.g.  when offlining
      parts of the only existing node) and we have to fallback to allocating
      from the source node.  The current code does that but it can be
      simplified by checking the nmask and updating it before we even try to
      allocate rather than special casing it.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f252f27
    • Michal Hocko's avatar
      mm, memory_hotplug: support movable_node for hotpluggable nodes · 9f123ab5
      Michal Hocko authored
      movable_node kernel parameter allows making hotpluggable NUMA nodes to
      put all the hotplugable memory into movable zone which allows more or
      less reliable memory hotremove.  At least this is the case for the NUMA
      nodes present during the boot (see find_zone_movable_pfns_for_nodes).
      
      This is not the case for the memory hotplug, though.
      
      	echo online > /sys/devices/system/memory/memoryXYZ/state
      
      will default to a kernel zone (usually ZONE_NORMAL) unless the
      particular memblock is already in the movable zone range which is not
      the case normally when onlining the memory from the udev rule context
      for a freshly hotadded NUMA node.  The only option currently is to have
      a special udev rule to echo online_movable to all memblocks belonging to
      such a node which is rather clumsy.  Not to mention this is inconsistent
      as well because what ended up in the movable zone during the boot will
      end up in a kernel zone after hotremove & hotadd without special care.
      
      It would be nice to reuse memblock_is_hotpluggable but the runtime
      hotplug doesn't have that information available because the boot and
      hotplug paths are not shared and it would be really non trivial to make
      them use the same code path because the runtime hotplug doesn't play
      with the memblock allocator at all.
      
      Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
      movable_node is enabled and the range doesn't overlap with the existing
      normal zone.  This should provide a reasonable default onlining
      strategy.
      
      Strictly speaking the semantic is not identical with the boot time
      initialization because find_zone_movable_pfns_for_nodes covers only the
      hotplugable range as described by the BIOS/FW.  From my experience this
      is usually a full node though (except for Node0 which is special and
      never goes away completely).  If this turns out to be a problem in the
      real life we can tweak the code to store hotplug flag into memblocks but
      let's keep this simple now.
      
      Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f123ab5
    • Andy Shevchenko's avatar
      zram: use __sysfs_match_string() helper · ed8a5553
      Andy Shevchenko authored
      Use __sysfs_match_string() helper instead of open coded variant.
      
      Link: http://lkml.kernel.org/r/20170609120835.22156-1-andriy.shevchenko@linux.intel.comSigned-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed8a5553
    • Will Deacon's avatar
      mm/migrate.c: stabilise page count when migrating transparent hugepages · f4e177d1
      Will Deacon authored
      When migrating a transparent hugepage, migrate_misplaced_transhuge_page
      guards itself against a concurrent fastgup of the page by checking that
      the page count is equal to 2 before and after installing the new pmd.
      
      If the page count changes, then the pmd is reverted back to the original
      entry, however there is a small window where the new (possibly writable)
      pmd is installed and the underlying page could be written by userspace.
      Restoring the old pmd could therefore result in loss of data.
      
      This patch fixes the problem by freezing the page count whilst updating
      the page tables, which protects against a concurrent fastgup without the
      need to restore the old pmd in the failure case (since the page count
      can no longer change under our feet).
      
      Link: http://lkml.kernel.org/r/1497349722-6731-4-git-send-email-will.deacon@arm.comSigned-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4e177d1
    • Will Deacon's avatar
      include/linux/page_ref.h: ensure page_ref_unfreeze is ordered against prior accesses · 108a7ac4
      Will Deacon authored
      page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
      wrapping a critical section where struct pages can be modified without
      having to worry about consistency for a concurrent fast-GUP.
      
      Whilst page_ref_freeze has full barrier semantics due to its use of
      atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
      doesn't provide any barrier semantics and allows the operation to be
      reordered with respect to page modifications in the critical section.
      
      This patch ensures that page_ref_unfreeze is ordered after any critical
      section updates, by invoking smp_mb() prior to the atomic_set.
      
      Link: http://lkml.kernel.org/r/1497349722-6731-3-git-send-email-will.deacon@arm.comSigned-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarSteve Capper <steve.capper@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      108a7ac4
    • Dan Williams's avatar
      mm: always enable thp for dax mappings · baabda26
      Dan Williams authored
      The madvise policy for transparent huge pages is meant to avoid unwanted
      allocations of transparent huge pages.  It allows a policy of disabling
      the extra memory pressure and effort to arrange for a huge page when it
      is not needed.
      
      DAX by definition never incurs this overhead since it is statically
      allocated.  The policy choice makes even less sense for device-dax which
      tries to guarantee a given tlb-fault size.  Specifically, the following
      setting:
      
      	echo never > /sys/kernel/mm/transparent_hugepage/enabled
      
      ...violates that guarantee and silently disables all device-dax
      instances with a 2M or 1G alignment.  So, let's avoid that non-obvious
      side effect by force enabling thp for dax mappings in all cases.
      
      It is worth noting that the reason this uses vma_is_dax(), and the
      resulting header include changes, is that previous attempts to add a
      VM_DAX flag were NAKd.
      
      Link: http://lkml.kernel.org/r/149739531127.20686.15813586620597484283.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      baabda26
    • Dan Williams's avatar
      mm: improve readability of transparent_hugepage_enabled() · 16981d76
      Dan Williams authored
      Turn the macro into a static inline and rewrite the condition checks for
      better readability in preparation for adding another condition.
      
      [ross.zwisler@linux.intel.com: fix logic to make conversion equivalent]
      [akpm@linux-foundation.org: resolve vs mm-make-pr_set_thp_disable-immediately-active.patch]
      [akpm@linux-foundation.org: include coredump.h for MMF_DISABLE_THP]
      Link: http://lkml.kernel.org/r/149739530612.20686.14760671150202647861.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: default avatar"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16981d76
    • Steven Rostedt (VMware)'s avatar
      oom, trace: remove ENUM evaluation of COMPACTION_FEEDBACK · 7ab0e50a
      Steven Rostedt (VMware) authored
      After enabling CONFIG_TRACE_ENUM_MAP_FILE (which will soon be renamed to
      CONFIG_TRACE_EVAL_MAP_FILE), I am able to examine the enums that have
      been evaluated:
      
       # cat /sys/kernel/debug/tracing/enum_map
      
      (which will soon be renamed to eval_map)
      
      And it showed some interesting results:
      
        [..]
        ZONE_MOVABLE 3 (oom)
        ZONE_NORMAL 2 (oom)
        ZONE_DMA32 1 (oom)
        ZONE_DMA 0 (oom)
        3 3 (oom)
        2 2 (oom)
        1 1 (oom)
        COMPACT_PRIO_ASYNC 2 (oom)
        COMPACT_PRIO_SYNC_LIGHT 1 (oom)
        COMPACT_PRIO_SYNC_FULL 0 (oom)
        [..]
        ZONE_DMA 0 (vmscan)
        3 3 (vmscan)
        2 2 (vmscan)
        1 1 (vmscan)
        COMPACT_PRIO_ASYNC 2 (vmscan)
        [..]
        ZONE_DMA 0 (kmem)
        3 3 (kmem)
        2 2 (kmem)
        1 1 (kmem)
        COMPACT_PRIO_ASYNC 2 (kmem)
        [..]
        ZONE_DMA 0 (compaction)
        3 3 (compaction)
        2 2 (compaction)
        1 1 (compaction)
        COMPACT_PRIO_ASYNC 2 (compaction)
        [..]
      
      The name within the parenthesis are the trace systems that the enum/eval
      maps are associated with. When there's a number evaluated to another
      number, that tells me that the TRACE_DEFINE_ENUM() was used on a #define
      and not an enum. As #defines get converted normally, they are not needed
      to be evaluated.
      
      Each of the above trace systems with the number to number evaluation
      included the file include/trace/events/mmflags.h which has:
      
       /* High-level compaction status feedback */
       #define COMPACTION_FAILED       1
       #define COMPACTION_WITHDRAWN    2
       #define COMPACTION_PROGRESS     3
      
      [..]
      
       #define COMPACTION_FEEDBACK             \
              EM(COMPACTION_FAILED,           "failed")       \
              EM(COMPACTION_WITHDRAWN,        "withdrawn")    \
              EMe(COMPACTION_PROGRESS,        "progress")
      
      Which is still needed for the __print_symbolic() usage in the
      trace_event.  But it is not needed to be evaluated.
      
      Removing the evaluation part removes the unnecessary evaluations of
      numbers to numbers.
      
      Link: http://lkml.kernel.org/r/20170615074944.7be9a647@gandalf.local.homeSigned-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ab0e50a
    • Liam R. Howlett's avatar
      mm/hugetlb.c: warn the user when issues arise on boot due to hugepages · d715cf80
      Liam R. Howlett authored
      When the user specifies too many hugepages or an invalid
      default_hugepagesz the communication to the user is implicit in the
      allocation message.  This patch adds a warning when the desired page
      count is not allocated and prints an error when the default_hugepagesz
      is invalid on boot.
      
      During boot hugepages will allocate until there is a fraction of the
      hugepage size left.  That is, we allocate until either the request is
      satisfied or memory for the pages is exhausted.  When memory for the
      pages is exhausted, it will most likely lead to the system failing with
      the OOM manager not finding enough (or anything) to kill (unless you're
      using really big hugepages in the order of 100s of MB or in the GBs).
      The user will most likely see the OOM messages much later in the boot
      sequence than the implicitly stated message.  Worse yet, you may even
      get an OOM for each processor which causes many pages of OOMs on modern
      systems.  Although these messages will be printed earlier than the OOM
      messages, at least giving the user errors and warnings will highlight
      the configuration as an issue.  I'm trying to point the user in the
      right direction by providing a more robust statement of what is failing.
      
      During the sysctl or echo command, the user can check the results much
      easier than if the system hangs during boot and the scenario of having
      nothing to OOM for kernel memory is highly unlikely.
      
      Mike said:
       "Before sending out this patch, I asked Liam off list why he was doing
        it. Was it something he just thought would be useful? Or, was there
        some type of user situation/need. He said that he had been called in
        to assist on several occasions when a system OOMed during boot. In
        almost all of these situations, the user had grossly misconfigured
        huge pages.
      
        DB users want to pre-allocate just the right amount of huge pages, but
        sometimes they can be really off. In such situations, the huge page
        init code just allocates as many huge pages as it can and reports the
        number allocated. There is no indication that it quit allocating
        because it ran out of memory. Of course, a user could compare the
        number in the message to what they requested on the command line to
        determine if they got all the huge pages they requested. The thought
        was that it would be useful to at least flag this situation. That way,
        the user might be able to better relate the huge page allocation
        failure to the OOM.
      
        I'm not sure if the e-mail discussion made it obvious that this is
        something he has seen on several occasions.
      
        I see Michal's point that this will only flag the situation where
        someone configures huge pages very badly. And, a more extensive look
        at the situation of misconfiguring huge pages might be in order. But,
        this has happened on several occasions which led to the creation of
        this patch"
      
      [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration]
      Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d715cf80
    • Anshuman Khandual's avatar
      mm/cma.c: warn if the CMA area could not be activated · e35ef639
      Anshuman Khandual authored
      While activating a CMA area we check to make sure that all the PFNs in
      the range are inside the same zone.  This is a requirement for
      alloc_contig_range() to work.  Any CMA area failing the check is
      disabled for good.  This happens silently right now making all future
      cma_alloc() allocations failure inevitable.
      
      Here we add an error message stating that the CMA area could not be
      activated which makes it easier to explain any future cma_alloc()
      failures on it.  While in there, change the bail out goto label from
      'err' to 'not_in_zone' which makes more sense.
      
      Link: http://lkml.kernel.org/r/20170605023729.26303-1-khandual@linux.vnet.ibm.comSigned-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e35ef639
    • Yisheng Xie's avatar
      vmalloc: show lazy-purged vma info in vmallocinfo · 78c72746
      Yisheng Xie authored
      When ioremap a 67112960 bytes vm_area with the vmallocinfo:
       [..]
       0xec79b000-0xec7fa000  389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc
       0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap
      
      we get the result:
       0xf1000000-0xf5001000 67112960 devm_ioremap+0x38/0x7c phys=40000000 ioremap
      
      For the align for ioremap must be less than '1 << IOREMAP_MAX_ORDER':
      
      	if (flags & VM_IOREMAP)
      		align = 1ul << clamp_t(int, get_count_order_long(size),
      			PAGE_SHIFT, IOREMAP_MAX_ORDER);
      
      So it makes idiot like me a litte puzzled why this was a jump the
      vm_area from 0xec800000-0xecbe1000 to 0xf1000000-0xf5001000, and leaving
      0xed000000-0xf1000000 as a big hole.
      
      This patch is to show all of vm_area, including vmas which are freeing
      but still in the vmap_area_list, to make it more clear about why we will
      get 0xf1000000-0xf5001000 in the above case.  And we will get a
      vmallocinfo like:
      
       [..]
       0xec79b000-0xec7fa000  389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc
       0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap
       [..]
       0xece7c000-0xece7e000    8192 unpurged vm_area
       0xece7e000-0xece83000   20480 vm_map_ram
       0xf0099000-0xf00aa000   69632 vm_map_ram
      
      after this patch.
      
      Link: http://lkml.kernel.org/r/1496649682-20710-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: zijun_hu <zijun_hu@htc.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78c72746
    • Sean Christopherson's avatar
      mm/memcontrol: exclude @root from checks in mem_cgroup_low · 34c81057
      Sean Christopherson authored
      Make @root exclusive in mem_cgroup_low; it is never considered low when
      looked at directly and is not checked when traversing the tree.  In
      effect, @root is handled identically to how root_mem_cgroup was
      previously handled by mem_cgroup_low.
      
      If @root is not excluded from the checks, a cgroup underneath @root will
      never be considered low during targeted reclaim of @root, e.g.  due to
      memory.current > memory.high, unless @root is misconfigured to have
      memory.low > memory.high.
      
      Excluding @root enables using memory.low to prioritize memory usage
      between cgroups within a subtree of the hierarchy that is limited by
      memory.high or memory.max, e.g.  when ROOT owns @root's controls but
      delegates the @root directory to a USER so that USER can create and
      administer children of @root.
      
      For example, given cgroup A with children B and C:
      
          A
         / \
        B   C
      
      and
      
        1. A/memory.current > A/memory.high
        2. A/B/memory.current < A/B/memory.low
        3. A/C/memory.current >= A/C/memory.low
      
      As 'A' is high, i.e.  triggers reclaim from 'A', and 'B' is low, we
      should reclaim from 'C' until 'A' is no longer high or until we can no
      longer reclaim from 'C'.  If 'A', i.e.  @root, isn't excluded by
      mem_cgroup_low when reclaming from 'A', then 'B' won't be considered low
      and we will reclaim indiscriminately from both 'B' and 'C'.
      
      Here is the test I used to confirm the bug and the patch.
      
      20:00:55@sjchrist-vm ? ~ $ cat ~/.bin/memcg_low_test
      #!/bin/bash
      
      x62mb=$((62<<20))
      x66mb=$((66<<20))
      x94mb=$((94<<20))
      x98mb=$((98<<20))
      
      setup() {
          set -e
      
          if [[ -n $DEBUG ]]; then
              set -x
          fi
      
          trap teardown EXIT HUP INT TERM
      
          if [[ ! -e /mnt/1gb.swap ]]; then
              sudo fallocate -l 1G /mnt/1gb.swap > /dev/null
              sudo mkswap /mnt/1gb.swap > /dev/null
          fi
          if ! swapon --show=NAME | grep -q "/mnt/1gb.swap"; then
              sudo swapon /mnt/1gb.swap
          fi
      
          if [[ ! -e /cgroup/cgroup.controllers ]]; then
              sudo mount -t cgroup2 none /cgroup
          fi
      
          grep -q memory /cgroup/cgroup.controllers
      
          sudo sh -c "echo '+memory' > /cgroup/cgroup.subtree_control"
      
          sudo mkdir /cgroup/A && sudo chown $USER:$USER /cgroup/A
          sudo sh -c "echo '+memory' > /cgroup/A/cgroup.subtree_control"
          sudo sh -c "echo '96m' > /cgroup/A/memory.high"
      
          mkdir /cgroup/A/0
          mkdir /cgroup/A/1
      
          echo 64m > /cgroup/A/0/memory.low
      }
      
      teardown() {
          set +e
      
          trap - EXIT HUP INT TERM
      
          if [[ -z $1 ]]; then
              printf "\n"
              printf "%0.s*" {1..35}
              printf "\nFAILED!\n\n"
              tail /cgroup/A/**/memory.current
              printf "%0.s*" {1..35}
              printf "\n\n"
          fi
      
          ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
      
          sleep 2
      
          if [[ -e /cgroup/A/0 ]]; then
              rmdir /cgroup/A/0
          fi
          if [[ -e /cgroup/A/1 ]]; then
              rmdir /cgroup/A/1
          fi
          if [[ -e /cgroup/A ]]; then
              sudo rmdir /cgroup/A
          fi
      }
      
      stress_test() {
          sudo sh -c "echo $$ > /cgroup/A/$1/cgroup.procs"
          stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &
      
          sudo sh -c "echo $$ > /cgroup/A/$2/cgroup.procs"
          stress --vm 1 --vm-bytes 64M --vm-keep > /dev/null &
      
          sudo sh -c "echo $$ > /cgroup/cgroup.procs"
      
          sleep 1
      
          # A/0 should be consuming more memory than A/1
          [[ $(cat /cgroup/A/0/memory.current) -ge $(cat /cgroup/A/1/memory.current) ]]
      
          # A/0 should be consuming ~64mb
          [[ $(cat /cgroup/A/0/memory.current) -ge $x62mb ]] && [[ $(cat /cgroup/A/0/memory.current) -le $x66mb ]]
      
          # A should cumulatively be consuming ~96mb
          [[ $(cat /cgroup/A/memory.current) -ge $x94mb ]] && [[ $(cat /cgroup/A/memory.current) -le $x98mb ]]
      
          # Stop the stressors
          ps | grep stress | tr -s ' ' | cut -f 2 -d ' ' | xargs -I % kill %
      }
      
      teardown 1
      setup
      
      for ((i=1;i<=$1;i++)); do
          printf "ITERATION $i of $1 - stress_test 0 1"
          stress_test 0 1
          printf "\x1b[2K\r"
      
          printf "ITERATION $i of $1 - stress_test 1 0"
          stress_test 1 0
          printf "\x1b[2K\r"
      
          printf "ITERATION $i of $1 - PASSED\n"
      done
      
      teardown 1
      
      echo PASSED!
      
      20:11:26@sjchrist-vm ? ~ $ memcg_low_test 10
      
      Link: http://lkml.kernel.org/r/1496434412-21005-1-git-send-email-sean.j.christopherson@intel.comSigned-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34c81057
    • Michal Hocko's avatar
      mm: make PR_SET_THP_DISABLE immediately active · 18600332
      Michal Hocko authored
      PR_SET_THP_DISABLE has a rather subtle semantic.  It doesn't affect any
      existing mapping because it only updated mm->def_flags which is a
      template for new mappings.
      
      The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
      flag set.  This can be quite surprising for all those applications which
      do not do prctl(); fork() & exec() and want to control their own THP
      behavior.
      
      Another usecase when the immediate semantic of the prctl might be useful
      is a combination of pre- and post-copy migration of containers with
      CRIU.  In this case CRIU populates a part of a memory region with data
      that was saved during the pre-copy stage.  Afterwards, the region is
      registered with userfaultfd and CRIU expects to get page faults for the
      parts of the region that were not yet populated.  However, khugepaged
      collapses the pages and the expected page faults do not occur.
      
      In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
      temporary mechanism for enabling/disabling THP process wide.
      
      Implementation wise, a new MMF_DISABLE_THP flag is added.  This flag is
      tested when decision whether to use huge pages is taken either during
      page fault of at the time of THP collapse.
      
      It should be noted, that the new implementation makes PR_SET_THP_DISABLE
      master override to any per-VMA setting, which was not the case
      previously.
      
      Fixes: a0715cc2 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
      Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18600332
    • David Rientjes's avatar
      mm, vmpressure: pass-through notification support · b6bb9811
      David Rientjes authored
      By default, vmpressure events are not pass-through, i.e.  they propagate
      up through the memcg hierarchy until an event notifier is found for any
      threshold level.
      
      This presents a difficulty when a thread waiting on a read(2) for a
      vmpressure event cannot distinguish between local memory pressure and
      memory pressure in a descendant memcg, especially when that thread may
      not control the memcg hierarchy.
      
      Consider a user-controlled child memcg with a smaller limit than a
      top-level memcg controlled by the "Activity Manager" specified in
      Documentation/cgroup-v1/memory.txt.  It may register for memory pressure
      notification for descendant memcgs to make a policy decision: oom kill a
      low priority job, increase the limit, decrease other limits, etc.  If it
      registers for memory pressure notification on the top-level memcg, it
      currently cannot distinguish between memory pressure in its own memcg or
      a descendant memcg, which is user-controlled.
      
      Conversely, if a user registers for memory pressure notification on
      their own descendant memcg, the Activity Manager does not receive any
      pressure notification for that child memcg hierarchy.  Vmpressure events
      are not received for ancestor memcgs if the memcg experiencing pressure
      have notifiers registered, perhaps outside the knowledge of the thread
      waiting on read(2) at the top level.
      
      Both of these are consequences of vmpressure notification not being
      pass-through.
      
      This implements a pass-through behavior for vmpressure events.  When
      writing to control.event_control, vmpressure event handlers may
      optionally specify a mode.  There are two new modes:
      
       - "hierarchy": always propagate memory pressure events up the hierarchy
         regardless if descendant memcgs have their own notifiers registered,
         and
      
       - "local": only receive notifications when the memcg for which the
         event is registered experiences memory pressure.
      
      Of course, processes may register for one notification of "low,local",
      for example, and another for "low".
      
      If no mode is specified, the current behavior is maintained for
      backwards compatibility.
      
      See the change to Documentation/cgroup-v1/memory.txt for full
      specification.
      
      [dan.carpenter@oracle.com: free the same pointer we allocated]
        Link: http://lkml.kernel.org/r/20170613191820.GA20003@elgon.mountain
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705311421320.8946@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6bb9811
    • Naoya Horiguchi's avatar
    • Naoya Horiguchi's avatar
    • Naoya Horiguchi's avatar
      mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error · 78bb9203
      Naoya Horiguchi authored
      Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
      keep the error hugepage away from the system, which is OK but not good
      enough because the hugepage still has a refcount and unpoison doesn't
      work on the error hugepage (PageHWPoison flags are cleared but pages are
      still leaked.) And there's "wasting health subpages" issue too.  This
      patch reworks on me_huge_page() to solve these issues.
      
      For hugetlb file, recently we have truncating code so let's use it in
      hugetlbfs specific ->error_remove_page().
      
      For anonymous hugepage, it's helpful to dissolve the error page after
      freeing it into free hugepage list.  Migration entry and PageHWPoison in
      the head page prevent the access to it.
      
      TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
      It's not critical (and at least no worse that now) because in such case
      the error hugepage just stays in free hugepage list without being
      dissolved.  By virtue of PageHWPoison in head page, it's never allocated
      to processes.
      
      [akpm@linux-foundation.org: fix unused var warnings]
      Fixes: 23a003bf ("mm/madvise: pass return code of memory_failure() to userspace")
      Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
      Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78bb9203
    • Naoya Horiguchi's avatar
      mm: hwpoison: introduce memory_failure_hugetlb() · 761ad8d7
      Naoya Horiguchi authored
      memory_failure() is a big function and hard to maintain.  Handling
      hugetlb- and non-hugetlb- case in a single function is not good, so this
      patch separates PageHuge() branch into a new function, which saves many
      PageHuge() check.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-7-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      761ad8d7
    • Naoya Horiguchi's avatar
      mm: soft-offline: dissolve free hugepage if soft-offlined · d4a3a60b
      Naoya Horiguchi authored
      Now we have code to rescue most of healthy pages from a hwpoisoned
      hugepage.  So let's apply it to soft_offline_free_page too.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-6-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4a3a60b
    • Anshuman Khandual's avatar
      mm: hugetlb: soft-offline: dissolve source hugepage after successful migration · c3114a84
      Anshuman Khandual authored
      Currently hugepage migrated by soft-offline (i.e.  due to correctable
      memory errors) is contained as a hugepage, which means many non-error
      pages in it are unreusable, i.e.  wasted.
      
      This patch solves this issue by dissolving source hugepages into buddy.
      As done in previous patch, PageHWPoison is set only on a head page of
      the error hugepage.  Then in dissoliving we move the PageHWPoison flag
      to the raw error page so that all healthy subpages return back to buddy.
      
      [arnd@arndb.de: fix warnings: replace some macros with inline functions]
        Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3114a84
    • Naoya Horiguchi's avatar
      mm: hwpoison: change PageHWPoison behavior on hugetlb pages · b37ff71c
      Naoya Horiguchi authored
      We'd like to narrow down the error region in memory error on hugetlb
      pages.  However, currently we set PageHWPoison flags on all subpages in
      the error hugepage and add # of subpages to num_hwpoison_pages, which
      doesn't fit our purpose.
      
      So this patch changes the behavior and we only set PageHWPoison on the
      head page then increase num_hwpoison_pages only by 1.  This is a
      preparation for narrow-down part which comes in later patches.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b37ff71c
    • Naoya Horiguchi's avatar
      mm: hugetlb: return immediately for hugetlb page in __delete_from_page_cache() · 09612fa6
      Naoya Horiguchi authored
      We avoid calling __mod_node_page_state(NR_FILE_PAGES) for hugetlb page
      now, but it's not enough because later code doesn't handle hugetlb
      properly.  Actually in our testing, WARN_ON_ONCE(PageDirty(page)) at the
      end of this function fires for hugetlb, which makes no sense.  So we
      should return immediately for hugetlb pages.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-3-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09612fa6
    • Naoya Horiguchi's avatar
      mm: hugetlb: prevent reuse of hwpoisoned free hugepages · 243abd5b
      Naoya Horiguchi authored
      Patch series "mm: hwpoison: fixlet for hugetlb migration".
      
      This patchset updates the hwpoison/hugetlb code to address 2 reported
      issues.
      
      One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
      http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
      half was already fixed in mainline, and another half about hugetlb cases
      are solved in this series.
      
      Another issue is "narrow-down error affected region into a single 4kB
      page instead of a whole hugetlb page" issue, which was tried by Anshuman
      (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
      and I updated it to apply it more widely.
      
      This patch (of 9):
      
      We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
      as we did before.  So current dequeue_huge_page_node() doesn't work as
      intended because it still uses is_migrate_isolate_page() for this check.
      This patch fixes it with PageHWPoison flag.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      243abd5b
    • Eric Biggers's avatar
      fs/buffer.c: make bh_lru_install() more efficient · 241f01fb
      Eric Biggers authored
      To install a buffer_head into the cpu's LRU queue, bh_lru_install()
      would construct a new copy of the queue and then memcpy it over the real
      queue.  But it's easily possible to do the update in-place, which is
      faster and simpler.  Some work can also be skipped if the buffer_head
      was already in the queue.
      
      As a microbenchmark I timed how long it takes to run sb_getblk()
      10,000,000 times alternating between BH_LRU_SIZE + 1 blocks.
      Effectively, this benchmarks looking up buffer_heads that are in the
      page cache but not in the LRU:
      
      	Before this patch: 1.758s
      	After this patch: 1.653s
      
      This patch also removes about 350 bytes of compiled code (on x86_64),
      partly due to removal of the memcpy() which was being inlined+unrolled.
      
      Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.comSigned-off-by: default avatarEric Biggers <ebiggers@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      241f01fb
    • Nick Desaulniers's avatar
      mm/zsmalloc.c: fix -Wunneeded-internal-declaration warning · 3457f414
      Nick Desaulniers authored
      is_first_page() is only called from the macro VM_BUG_ON_PAGE() which is
      only compiled in as a runtime check when CONFIG_DEBUG_VM is set,
      otherwise is checked at compile time and not actually compiled in.
      
      Fixes the following warning, found with Clang:
      
        mm/zsmalloc.c:472:12: warning: function 'is_first_page' is not needed and will not be emitted [-Wunneeded-internal-declaration]
        static int is_first_page(struct page *page)
                 ^
      
      Link: http://lkml.kernel.org/r/20170524053859.29059-1-nick.desaulniers@gmail.comSigned-off-by: default avatarNick Desaulniers <nick.desaulniers@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3457f414
    • Gustavo A. R. Silva's avatar
      mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereference · dbac61a3
      Gustavo A. R. Silva authored
      The NULL check at line 1226: if (!pgdat), implies that pointer pgdat
      might be NULL.
      
      rollback_node_hotadd() dereferences this pointer.  Add NULL check to
      avoid a potential NULL pointer dereference.
      
      Addresses-Coverity-ID: 1369133
      Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgusSigned-off-by: default avatarGustavo A. R. Silva <garsilva@embeddedor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbac61a3
    • David Rientjes's avatar
      mm, vmscan: avoid thrashing anon lru when free + file is low · 06226226
      David Rientjes authored
      The purpose of the code that commit 62376251 ("revert 'mm: vmscan:
      do not swap anon pages just because free+file is low'") reintroduces is
      to prefer swapping anonymous memory rather than trashing the file lru.
      
      If the anonymous inactive lru for the set of eligible zones is
      considered low, however, or the length of the list for the given reclaim
      priority does not allow for effective anonymous-only reclaiming, then
      avoid forcing SCAN_ANON.  Forcing SCAN_ANON will end up thrashing the
      small list and leave unreclaimed memory on the file lrus.
      
      If the inactive list is insufficient, fallback to balanced reclaim so
      the file lru doesn't remain untouched.
      
      [akpm@linux-foundation.org: fix build]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06226226
    • Yevgen Pronenko's avatar
      mm/memory.c: convert to DEFINE_DEBUGFS_ATTRIBUTE · 0a1345f8
      Yevgen Pronenko authored
      The preferred strategy to define debugfs attributes is to use the
      DEFINE_DEBUGFS_ATTRIBUTE() macro and to use debugfs_create_file_unsafe().
      
      Link: http://lkml.kernel.org/r/20170528145948.32127-1-y.pronenko@gmail.comSigned-off-by: default avatarYevgen Pronenko <y.pronenko@gmail.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a1345f8
    • Vlastimil Babka's avatar
      mm, page_alloc: fallback to smallest page when not stealing whole pageblock · 7a8f58f3
      Vlastimil Babka authored
      Since commit 3bc48f96 ("mm, page_alloc: split smallest stolen page
      in fallback") we pick the smallest (but sufficient) page of all that
      have been stolen from a pageblock of different migratetype.  However,
      there are cases when we decide not to steal the whole pageblock.
      
      Practically in the current implementation it means that we are trying to
      fallback for a MIGRATE_MOVABLE allocation of order X, go through the
      freelists from MAX_ORDER-1 down to X, and find free page of order Y.  If
      Y is less than pageblock_order / 2, we decide not to steal all pages
      from the pageblock.  When Y > X, it means we are potentially splitting a
      larger page than we need, as there might be other pages of order Z,
      where X <= Z < Y.  Since Y is already too small to steal whole
      pageblock, picking smallest available Z will result in the same decision
      and we avoid splitting a higher-order page in a MIGRATE_UNMOVABLE or
      MIGRATE_RECLAIMABLE pageblock.
      
      This patch therefore changes the fallback algorithm so that in the
      situation described above, we switch the fallback search strategy to go
      from order X upwards to find the smallest suitable fallback.  In theory
      there shouldn't be a downside of this change wrt fragmentation.
      
      This has been tested with mmtests' stress-highalloc performing
      GFP_KERNEL order-4 allocations, here is the relevant extfrag tracepoint
      statistics:
      
                                                              4.12.0-rc2      4.12.0-rc2
                                                               1-kernel4       2-kernel4
        Page alloc extfrag event                                  25640976    69680977
        Extfrag fragmenting                                       25621086    69661364
        Extfrag fragmenting for unmovable                            74409       73204
        Extfrag fragmenting unmovable placed with movable            69003       67684
        Extfrag fragmenting unmovable placed with reclaim.            5406        5520
        Extfrag fragmenting for reclaimable                           6398        8467
        Extfrag fragmenting reclaimable placed with movable            869         884
        Extfrag fragmenting reclaimable placed with unmov.            5529        7583
        Extfrag fragmenting for movable                           25540279    69579693
      
      Since we force movable allocations to steal the smallest available page
      (which we then practially always split), we steal less per fallback, so
      the number of fallbacks increases and steals potentially happen from
      different pageblocks.  This is however not an issue for movable pages
      that can be compacted.
      
      Importantly, the "unmovable placed with movable" statistics is lower,
      which is the result of less fragmentation in the unmovable pageblocks.
      The effect on reclaimable allocation is a bit unclear.
      
      Link: http://lkml.kernel.org/r/20170529093947.22618-1-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a8f58f3
    • Shaohua Li's avatar
      swap: add block io poll in swapin path · 23955622
      Shaohua Li authored
      For fast flash disk, async IO could introduce overhead because of
      context switch.  block-mq now supports IO poll, which improves
      performance and latency a lot.  swapin is a good place to use this
      technique, because the task is waiting for the swapin page to continue
      execution.
      
      In my virtual machine, directly read 4k data from a NVMe with iopoll is
      about 60% better than that without poll.  With iopoll support in swapin
      patch, my microbenchmark (a task does random memory write) is about
      10%~25% faster.  CPU utilization increases a lot though, 2x and even 3x
      CPU utilization.  This will depend on disk speed.
      
      While iopoll in swapin isn't intended for all usage cases, it's a win
      for latency sensistive workloads with high speed swap disk.  block layer
      has knob to control poll in runtime.  If poll isn't enabled in block
      layer, there should be no noticeable change in swapin.
      
      I got a chance to run the same test in a NVMe with DRAM as the media.
      In simple fio IO test, blkpoll boosts 50% performance in single thread
      test and ~20% in 8 threads test.  So this is the base line.  In above
      swap test, blkpoll boosts ~27% performance in single thread test.
      blkpoll uses 2x CPU time though.
      
      If we enable hybid polling, the performance gain has very slight drop
      but CPU time is only 50% worse than that without blkpoll.  Also we can
      adjust parameter of hybid poll, with it, the CPU time penality is
      reduced further.  In 8 threads test, blkpoll doesn't help though.  The
      performance is similar to that without blkpoll, but cpu utilization is
      similar too.  There is lock contention in swap path.  The cpu time
      spending on blkpoll isn't high.  So overall, blkpoll swapin isn't worse
      than that without it.
      
      The swapin readahead might read several pages in in the same time and
      form a big IO request.  Since the IO will take longer time, it doesn't
      make sense to do poll, so the patch only does iopoll for single page
      swapin.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965cSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23955622
    • Linus Torvalds's avatar
      Merge tag 'for-linus-4.13-v2' of git://github.com/cminyard/linux-ipmi · 9eb78880
      Linus Torvalds authored
      Pull IPMI updates from Corey Minyard:
       "Some small fixes for IPMI, and one medium sized changed.
      
        The medium sized change is adding a platform device for IPMI entries
        in the DMI table. Otherwise there is no auto loading for IPMI devices
        if they are only in the DMI table"
      
      * tag 'for-linus-4.13-v2' of git://github.com/cminyard/linux-ipmi:
        ipmi:ssif: Add missing unlock in error branch
        char: ipmi: constify bmc_dev_attr_group and bmc_device_type
        ipmi:ssif: Check dev before setting drvdata
        ipmi: Convert DMI handling over to a platform device
        ipmi: Create a platform device for a DMI-specified IPMI interface
        ipmi: use rcu lock around call to intf->handlers->sender()
        ipmi:ssif: Use i2c_adapter_id instead of adapter->nr
        ipmi: Use the proper default value for register size in ACPI
        ipmi_ssif: remove redundant null check on array client->adapter->name
        ipmi/watchdog: fix watchdog timeout set on reboot
        ipmi_ssif: unlock on allocation failure
      9eb78880
    • Linus Torvalds's avatar
      Merge tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 642338ba
      Linus Torvalds authored
      Pull XFS updates from Darrick Wong:
       "Here are some changes for you for 4.13. For the most part it's fixes
        for bugs and deadlock problems, and preparation for online fsck in
        some future merge window.
      
         - Avoid quotacheck deadlocks
      
         - Fix transaction overflows when bunmapping fragmented files
      
         - Refactor directory readahead
      
         - Allow admin to configure if ASSERT is fatal
      
         - Improve transaction usage detail logging during overflows
      
         - Minor cleanups
      
         - Don't leak log items when the log shuts down
      
         - Remove double-underscore typedefs
      
         - Various preparation for online scrubbing
      
         - Introduce new error injection configuration sysfs knobs
      
         - Refactor dq_get_next to use extent map directly
      
         - Fix problems with iterating the page cache for unwritten data
      
         - Implement SEEK_{HOLE,DATA} via iomap
      
         - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA
      
         - Don't use MAXPATHLEN to check on-disk symlink target lengths"
      
      * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
        xfs: don't crash on unexpected holes in dir/attr btrees
        xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
        xfs: fix contiguous dquot chunk iteration livelock
        xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
        vfs: Add iomap_seek_hole and iomap_seek_data helpers
        vfs: Add page_cache_seek_hole_data helper
        xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
        xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
        xfs: Check for m_errortag initialization in xfs_errortag_test
        xfs: grab dquots without taking the ilock
        xfs: fix semicolon.cocci warnings
        xfs: Don't clear SGID when inheriting ACLs
        xfs: free cowblocks and retry on buffered write ENOSPC
        xfs: replace log_badcrc_factor knob with error injection tag
        xfs: convert drop_writes to use the errortag mechanism
        xfs: remove unneeded parameter from XFS_TEST_ERROR
        xfs: expose errortag knobs via sysfs
        xfs: make errortag a per-mountpoint structure
        xfs: free uncommitted transactions during log recovery
        xfs: don't allow bmap on rt files
        ...
      642338ba
    • Linus Torvalds's avatar
      Merge branch 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 6618a24a
      Linus Torvalds authored
      Pull btrfs fix from David Sterba:
       "This fixes a user-visible bug introduced by the nowait-aio patches
        merged in this cycle"
      
      * 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: nowait aio: Correct assignment of pos
      6618a24a
    • Linus Torvalds's avatar
      Merge branch 'fix-uio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 1d07b6cb
      Linus Torvalds authored
      Pull copy*_iter fix from Al Viro.
      
      [ Al used entirely the wrong return value. Oopsie. ]
      
      * 'fix-uio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        fix brown paperbag bug in inlined copy_..._iter()
      1d07b6cb
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid · a91ab911
      Linus Torvalds authored
      Pull HID updates from Jiri Kosina:
      
       - open/close tracking improvements from Dmitry Torokhov
      
       - battery support improvements in Wacom driver from Jason Gerecke
      
       - Win8 support fixes from Benjamin Tissories and Hans de Geode
      
       - misc fixes to Intel-ISH driver from Arnd Bergmann
      
       - support for quite a few new devices and small assorted fixes here and
         there
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (35 commits)
        HID: intel-ish-hid: Enable Gemini Lake ish driver
        HID: intel-ish-hid: Enable Cannon Lake ish driver
        HID: wacom: fix mistake in printk
        HID: multitouch: optimize the sticky fingers timer
        HID: multitouch: fix rare Win 8 cases when the touch up event gets missing
        HID: multitouch: use BIT macro
        HID: Add driver for Retrode2 joypad adapter
        HID: multitouch: Add support for Google Rose Touchpad
        HID: multitouch: Support PTP Stick and Touchpad device
        HID: core: don't use negative operands when shift
        HID: apple: Use country code to detect ISO keyboards
        HID: remove no longer used hid->open field
        greybus: hid: remove custom locking from gb_hid_open/close
        HID: usbhid: remove custom locking from usbhid_open/close
        HID: i2c-hid: remove custom locking from i2c_hid_open/close
        HID: serialize hid_hw_open and hid_hw_close
        HID: usbhid: do not rely on hid->open when deciding to do IO
        HID: hiddev: use hid_hw_power instead of usbhid_get/put_power
        HID: hiddev: use hid_hw_open/close instead of usbhid_open/close
        HID: asus: Add support for Zen AiO MD-5110 keyboard
        ...
      a91ab911
    • Goldwyn Rodrigues's avatar
      btrfs: nowait aio: Correct assignment of pos · ff0fa732
      Goldwyn Rodrigues authored
      Assigning pos for usage early messes up in append mode, where the pos is
      re-assigned in generic_write_checks(). Assign pos later to get the
      correct position to write from iocb->ki_pos.
      
      Since check_can_nocow also uses the value of pos, we shift
      generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT
      are present in generic_write_checks(), so checking for IOCB_NOWAIT is
      enough.
      
      Also, put locking sequence in the fast path.
      
      This fixes a user visible bug, as reported:
      
      "apparently breaks several shell related features on my system.
      In zsh history stopped working, because no new entries are added
      anymore.
      I fist noticed the issue when I tried to build mplayer. It uses a shell
      script to generate a help_mp.h file:
      [...]
      
      Here is a simple testcase:
      
       % echo "foo" >> test
       % echo "foo" >> test
       % cat test
       foo
       %
      "
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      CC: Jens Axboe <axboe@kernel.dk>
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Link: https://lkml.kernel.org/r/20170704042306.GA274@x4Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff0fa732