1. 15 Dec, 2020 40 commits
    • Vlastimil Babka's avatar
      mm, page_alloc: disable pcplists during memory offline · ec6e8c7e
      Vlastimil Babka authored
      Memory offlining relies on page isolation to guarantee a forward progress
      because pages cannot be reused while they are isolated.  But the page
      isolation itself doesn't prevent from races while freed pages are stored
      on pcp lists and thus can be reused.  This can be worked around by
      repeated draining of pcplists, as done by commit 96831826
      ("mm/memory_hotplug: drain per-cpu pages again during memory offline").
      
      David and Michal would prefer that this race was closed in a way that
      callers of page isolation who need stronger guarantees don't need to
      repeatedly drain.  David suggested disabling pcplists usage completely
      during page isolation, instead of repeatedly draining them.
      
      To achieve this without adding special cases in alloc/free fastpath, we
      can use the same approach as boot pagesets - when pcp->high is 0, any
      pcplist addition will be immediately flushed.
      
      The race can thus be closed by setting pcp->high to 0 and draining
      pcplists once, before calling start_isolate_page_range().  The draining
      will serialize after processes that already disabled interrupts and read
      the old value of pcp->high in free_unref_page_commit(), and processes that
      have not yet disabled interrupts, will observe pcp->high == 0 when they
      are rescheduled, and skip pcplists.  This guarantees no stray pages on
      pcplists in zones where isolation happens.
      
      This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions
      that page isolation users can call before start_isolate_page_range() and
      after unisolating (or offlining) the isolated pages.
      
      Also, drain_all_pages() is optimized to only execute on cpus where
      pcplists are not empty.  The check can however race with a free to pcplist
      that has not yet increased the pcp->count from 0 to 1.  Thus make the
      drain optionally skip the racy check and drain on all cpus, and use this
      option in zone_pcp_disable().
      
      As we have to avoid external updates to high and batch while pcplists are
      disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it
      in zone_pcp_enable().  This also synchronizes multiple users of
      zone_pcp_disable()/enable().
      
      Currently the only user of this functionality is offline_pages().
      
      [vbabka@suse.cz: add comment, per David]
        Link: https://lkml.kernel.org/r/527480ef-ed72-e1c1-52a0-1c5b0113df45@suse.cz
      
      Link: https://lkml.kernel.org/r/20201111092812.11329-8-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec6e8c7e
    • Vlastimil Babka's avatar
      mm, page_alloc: move draining pcplists to page isolation users · 7612921f
      Vlastimil Babka authored
      Currently, pcplists are drained during set_migratetype_isolate() which
      means once per pageblock processed start_isolate_page_range().  This is
      somewhat wasteful.  Moreover, the callers might need different guarantees,
      and the draining is currently prone to races and does not guarantee that
      no page from isolated pageblock will end up on the pcplist after the
      drain.
      
      Better guarantees are added by later patches and require explicit actions
      by page isolation users that need them.  Thus it makes sense to move the
      current imperfect draining to the callers also as a preparation step.
      
      Link: https://lkml.kernel.org/r/20201111092812.11329-7-vbabka@suse.czSuggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7612921f
    • Vlastimil Babka's avatar
      mm, page_alloc: cache pageset high and batch in struct zone · 952eaf81
      Vlastimil Babka authored
      All per-cpu pagesets for a zone use the same high and batch values, that
      are duplicated there just for performance (locality) reasons.  This patch
      adds the same variables also to struct zone as a shared copy.
      
      This will be useful later for making possible to disable pcplists
      temporarily by setting high value to 0, while remembering the values for
      restoring them later.  But we can also immediately benefit from not
      updating pagesets of all possible cpus in case the newly recalculated
      values (after sysctl change or memory online/offline) are actually
      unchanged from the previous ones.
      
      Link: https://lkml.kernel.org/r/20201111092812.11329-6-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      952eaf81
    • Vlastimil Babka's avatar
      mm, page_alloc: simplify pageset_update() · 5c3ad2eb
      Vlastimil Babka authored
      pageset_update() attempts to update pcplist's high and batch values in a
      way that readers don't observe batch > high.  It uses smp_wmb() to order
      the updates in a way to achieve this.  However, without proper pairing
      read barriers in readers this guarantee doesn't hold, and there are no
      such barriers in e.g.  free_unref_page_commit().
      
      Commit 88e8ac11 ("mm, page_alloc: fix core hung in
      free_pcppages_bulk()") already showed this is problematic, and solved this
      by ultimately only trusing pcp->count of the current cpu with interrupts
      disabled.
      
      The update dance with unpaired write barriers thus makes no sense.
      Replace them with plain WRITE_ONCE to prevent store tearing, and document
      that the values can change asynchronously and should not be trusted for
      correctness.
      
      All current readers appear to be OK after 88e8ac11.  Convert them to
      READ_ONCE to prevent unnecessary read tearing, but mainly to alert anybody
      making future changes to the code that special care is needed.
      
      Link: https://lkml.kernel.org/r/20201111092812.11329-5-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c3ad2eb
    • Vlastimil Babka's avatar
      mm, page_alloc: remove setup_pageset() · 69a8396a
      Vlastimil Babka authored
      We initialize boot-time pagesets with setup_pageset(), which sets high and
      batch values that effectively disable pcplists.
      
      We can remove this wrapper if we just set these values for all pagesets in
      pageset_init().  Non-boot pagesets then subsequently update them to the
      proper values.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/20201111092812.11329-4-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69a8396a
    • Vlastimil Babka's avatar
      mm, page_alloc: calculate pageset high and batch once per zone · 0a8b4f1d
      Vlastimil Babka authored
      We currently call pageset_set_high_and_batch() for each possible cpu,
      which repeats the same calculations of high and batch values.
      
      Instead call the function just once per zone, and make it apply the
      calculated values to all per-cpu pagesets of the zone.
      
      This also allows removing the zone_pageset_init() and __zone_pcp_update()
      wrappers.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/20201111092812.11329-3-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a8b4f1d
    • Vlastimil Babka's avatar
      mm, page_alloc: clean up pageset high and batch update · 7115ac6e
      Vlastimil Babka authored
      Patch series "disable pcplists during memory offline", v3.
      
      As per the discussions [1] [2] this is an attempt to implement David's
      suggestion that page isolation should disable pcplists to avoid races with
      page freeing in progress.  This is done without extra checks in fast
      paths, as explained in Patch 9.  The repeated draining done by [2] is then
      no longer needed.  Previous version (RFC) is at [3].
      
      The RFC tried to hide pcplists disabling/enabling into page isolation, but
      it wasn't completely possible, as memory offline does not unisolation.
      Michal suggested an explicit API in [4] so that's the current
      implementation and it seems indeed nicer.
      
      Once we accept that page isolation users need to do explicit actions
      around it depending on the needed guarantees, we can also IMHO accept that
      the current pcplist draining can be also done by the callers, which is
      more effective.  After all, there are only two users of page isolation.
      So patch 6 does effectively the same thing as Pavel proposed in [5], and
      patch 7 implement stronger guarantees only for memory offline.  If CMA
      decides to opt-in to the stronger guarantee, it can be added later.
      
      Patches 1-5 are preparatory cleanups for pcplist disabling.
      
      Patchset was briefly tested in QEMU so that memory online/offline works,
      but I haven't done a stress test that would prove the race fixed by [2] is
      eliminated.
      
      Note that patch 7 could be avoided if we instead adjusted page freeing in
      shown in [6], but I believe the current implementation of disabling
      pcplists is not too much complex, so I would prefer this instead of adding
      new checks and longer irq-disabled section into page freeing hotpaths.
      
      [1] https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com/
      [2] https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatashin@soleen.com/
      [3] https://lore.kernel.org/linux-mm/20200907163628.26495-1-vbabka@suse.cz/
      [4] https://lore.kernel.org/linux-mm/20200909113647.GG7348@dhcp22.suse.cz/
      [5] https://lore.kernel.org/linux-mm/20200904151448.100489-3-pasha.tatashin@soleen.com/
      [6] https://lore.kernel.org/linux-mm/3d3b53db-aeaa-ff24-260b-36427fac9b1c@suse.cz/
      [7] https://lore.kernel.org/linux-mm/20200922143712.12048-1-vbabka@suse.cz/
      [8] https://lore.kernel.org/linux-mm/20201008114201.18824-1-vbabka@suse.cz/
      
      This patch (of 7):
      
      The updates to pcplists' high and batch values are handled by multiple
      functions that make the calculations hard to follow.  Consolidate
      everything to pageset_set_high_and_batch() and remove pageset_set_batch()
      and pageset_set_high() wrappers.
      
      The only special case using one of the removed wrappers was:
      build_all_zonelists_init()
      
        setup_pageset()
          pageset_set_batch()
      
      which was hardcoding batch as 0, so we can just open-code a call to
      pageset_update() with constant parameters instead.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/20201111092812.11329-1-vbabka@suse.cz
      Link: https://lkml.kernel.org/r/20201111092812.11329-2-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7115ac6e
    • Mike Rapoport's avatar
      arch, mm: make kernel_page_present() always available · 32a0de88
      Mike Rapoport authored
      For architectures that enable ARCH_HAS_SET_MEMORY having the ability to
      verify that a page is mapped in the kernel direct map can be useful
      regardless of hibernation.
      
      Add RISC-V implementation of kernel_page_present(), update its forward
      declarations and stubs to be a part of set_memory API and remove ugly
      ifdefery in inlcude/linux/mm.h around current declarations of
      kernel_page_present().
      
      Link: https://lkml.kernel.org/r/20201109192128.960-5-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32a0de88
    • Mike Rapoport's avatar
      arch, mm: restore dependency of __kernel_map_pages() on DEBUG_PAGEALLOC · 5d6ad668
      Mike Rapoport authored
      The design of DEBUG_PAGEALLOC presumes that __kernel_map_pages() must
      never fail.  With this assumption is wouldn't be safe to allow general
      usage of this function.
      
      Moreover, some architectures that implement __kernel_map_pages() have this
      function guarded by #ifdef DEBUG_PAGEALLOC and some refuse to map/unmap
      pages when page allocation debugging is disabled at runtime.
      
      As all the users of __kernel_map_pages() were converted to use
      debug_pagealloc_map_pages() it is safe to make it available only when
      DEBUG_PAGEALLOC is set.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-4-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d6ad668
    • Mike Rapoport's avatar
      PM: hibernate: make direct map manipulations more explicit · 2abf962a
      Mike Rapoport authored
      When DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP is enabled a page may be
      not present in the direct map and has to be explicitly mapped before it
      could be copied.
      
      Introduce hibernate_map_page() and hibernation_unmap_page() that will
      explicitly use set_direct_map_{default,invalid}_noflush() for
      ARCH_HAS_SET_DIRECT_MAP case and debug_pagealloc_{map,unmap}_pages() for
      DEBUG_PAGEALLOC case.
      
      The remapping of the pages in safe_copy_page() presumes that it only
      changes protection bits in an existing PTE and so it is safe to ignore
      return value of set_direct_map_{default,invalid}_noflush().
      
      Still, add a pr_warn() so that future changes in set_memory APIs will not
      silently break hibernation.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-3-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2abf962a
    • Mike Rapoport's avatar
      mm: introduce debug_pagealloc_{map,unmap}_pages() helpers · 77bc7fd6
      Mike Rapoport authored
      Patch series "arch, mm: improve robustness of direct map manipulation", v7.
      
      During recent discussion about KVM protected memory, David raised a
      concern about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC
      scope [1].
      
      Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
      possible that __kernel_map_pages() would fail, but since this function is
      void, the failure will go unnoticed.
      
      Moreover, there's lack of consistency of __kernel_map_pages() semantics
      across architectures as some guard this function with #ifdef
      DEBUG_PAGEALLOC, some refuse to update the direct map if page allocation
      debugging is disabled at run time and some allow modifying the direct map
      regardless of DEBUG_PAGEALLOC settings.
      
      This set straightens this out by restoring dependency of
      __kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
      accordingly.
      
      Since currently the only user of __kernel_map_pages() outside
      DEBUG_PAGEALLOC is hibernation, it is updated to make direct map accesses
      there more explicit.
      
      [1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a40348269d@redhat.com
      
      This patch (of 4):
      
      When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel
      direct mapping after free_pages().  The pages than need to be mapped back
      before they could be used.  Theese mapping operations use
      __kernel_map_pages() guarded with with debug_pagealloc_enabled().
      
      The only place that calls __kernel_map_pages() without checking whether
      DEBUG_PAGEALLOC is enabled is the hibernation code that presumes
      availability of this function when ARCH_HAS_SET_DIRECT_MAP is set.  Still,
      on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is not
      enabled but set_direct_map_invalid_noflush() may render some pages not
      present in the direct map and hibernation code won't be able to save such
      pages.
      
      To make page allocation debugging and hibernation interaction more robust,
      the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be
      made more explicit.
      
      Start with combining the guard condition and the call to
      __kernel_map_pages() into debug_pagealloc_map_pages() and
      debug_pagealloc_unmap_pages() functions to emphasize that
      __kernel_map_pages() should not be called without DEBUG_PAGEALLOC and use
      these new functions to map/unmap pages when page allocation debugging is
      enabled.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20201109192128.960-2-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77bc7fd6
    • Mike Rapoport's avatar
      m68k: deprecate DISCONTIGMEM · fcd353a3
      Mike Rapoport authored
      DISCONTIGMEM was intended to provide more efficient support for systems
      with holes in their physical address space that FLATMEM did.
      
      Yet, it's overhead in terms of the memory consumption seems to
      overweight the savings on the unused memory map.
      
      For a ARAnyM system with 16 MBytes of FastRAM configured, the memory
      usage reported after page allocator initialization is
      
        Memory: 23828K/30720K available (3206K kernel code, 535K rwdata, 936K rodata, 768K init, 193K bss, 6892K reserved, 0K cma-reserved)
      
      and with DISCONTIGMEM disabled and with relatively large hole in the memory
      map it is:
      
        Memory: 23864K/30720K available (3197K kernel code, 516K rwdata, 936K rodata, 764K init, 179K bss, 6856K reserved, 0K cma-reserved)
      
      Moreover, since m68k already has custom pfn_valid() it is possible to
      define HAVE_ARCH_PFN_VALID to enable freeing of unused memory map.  The
      minimal size of a hole that can be freed should not be less than
      MAX_ORDER_NR_PAGES so to achieve more substantial memory savings let
      m68k also define custom FORCE_MAX_ZONEORDER.
      
      With FORCE_MAX_ZONEORDER set to 9 memory usage becomes:
      
        Memory: 23880K/30720K available (3197K kernel code, 516K rwdata, 936K rodata, 764K init, 179K bss, 6840K reserved, 0K cma-reserved)
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-14-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcd353a3
    • Mike Rapoport's avatar
      m68k/mm: enable use of generic memory_model.h for !DISCONTIGMEM · 4bfc848e
      Mike Rapoport authored
      The pg_data_map and pg_data_table arrays as well as page_to_pfn() and
      pfn_to_page() are required only for DISCONTIGMEM. Other memory models can
      use the generic definitions in asm-generic/memory_model.h.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-13-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bfc848e
    • Mike Rapoport's avatar
      m68k/mm: make node data and node setup depend on CONFIG_DISCONTIGMEM · 6b2ad8d7
      Mike Rapoport authored
      The pg_data_t node structures and their initialization currently depends on
      !CONFIG_SINGLE_MEMORY_CHUNK. Since they are required only for DISCONTIGMEM
      make this dependency explicit and replace usage of
      CONFIG_SINGLE_MEMORY_CHUNK with CONFIG_DISCONTIGMEM where appropriate.
      
      The CONFIG_SINGLE_MEMORY_CHUNK was implicitly disabled on the ColdFire MMU
      variant, although it always presumed a single memory bank. As there is no
      actual need for DISCONTIGMEM in this case, make sure that ColdFire MMU
      systems set CONFIG_SINGLE_MEMORY_CHUNK to 'y'.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-12-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b2ad8d7
    • Mike Rapoport's avatar
      arc: use FLATMEM with freeing of unused memory map instead of DISCONTIGMEM · 050b2da2
      Mike Rapoport authored
      Currently ARC uses DISCONTIGMEM to cope with sparse physical memory address
      space on systems with 2 memory banks. While DISCONTIGMEM avoids wasting
      memory on unpopulated memory map, it adds both memory and CPU overhead
      relatively to FLATMEM. Moreover, DISCONTINGMEM is generally considered
      deprecated.
      
      The obvious replacement for DISCONTIGMEM would be SPARSEMEM, but it is also
      less efficient than FLATMEM in pfn_to_page() and page_to_pfn() conversions.
      Besides it requires tuning of SECTION_SIZE which is not trivial for
      possible ARC memory configuration.
      
      Since the memory map for both banks is always allocated from the "lowmem"
      bank, it is possible to use FLATMEM for two-bank configuration and simply
      free the unused hole in the memory map. All is required for that is to
      provide ARC-specific pfn_valid() that will take into account actual
      physical memory configuration and define HAVE_ARCH_PFN_VALID.
      
      The resulting kernel image configured with defconfig + HIGHMEM=y is
      smaller:
      
        $ size a/vmlinux b/vmlinux
           text    data     bss     dec     hex filename
        4673503 1245456  279756 6198715  5e95bb a/vmlinux
        4658706 1246864  279756 6185326  5e616e b/vmlinux
      
        $ ./scripts/bloat-o-meter a/vmlinux b/vmlinux
        add/remove: 28/30 grow/shrink: 42/399 up/down: 10986/-29025 (-18039)
        ...
        Total: Before=4709315, After = 4691276, chg -0.38%
      
      Booting nSIM with haps_ns.dts results in the following memory usage
      reports:
      
        a:
        Memory: 1559104K/1572864K available (3531K kernel code, 595K rwdata, 752K rodata, 136K init, 275K bss, 13760K reserved, 0K cma-reserved, 1048576K highmem)
      
        b:
        Memory: 1559112K/1572864K available (3519K kernel code, 594K rwdata, 752K rodata, 136K init, 280K bss, 13752K reserved, 0K cma-reserved, 1048576K highmem)
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-11-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      050b2da2
    • Mike Rapoport's avatar
      arm, arm64: move free_unused_memmap() to generic mm · 4f5b0c17
      Mike Rapoport authored
      ARM and ARM64 free unused parts of the memory map just before the
      initialization of the page allocator. To allow holes in the memory map both
      architectures overload pfn_valid() and define HAVE_ARCH_PFN_VALID.
      
      Allowing holes in the memory map for FLATMEM may be useful for small
      machines, such as ARC and m68k and will enable those architectures to cease
      using DISCONTIGMEM and still support more than one memory bank.
      
      Move the functions that free unused memory map to generic mm and enable
      them in case HAVE_ARCH_PFN_VALID=y.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-10-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f5b0c17
    • Mike Rapoport's avatar
      arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL · 5e545df3
      Mike Rapoport authored
      ARM is the only architecture that defines CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
      which in turn enables memmap_valid_within() function that is intended to
      verify existence  of struct page associated with a pfn when there are holes
      in the memory map.
      
      However, the ARCH_HAS_HOLES_MEMORYMODEL also enables HAVE_ARCH_PFN_VALID
      and arch-specific pfn_valid() implementation that also deals with the holes
      in the memory map.
      
      The only two users of memmap_valid_within() call this function after
      a call to pfn_valid() so the memmap_valid_within() check becomes redundant.
      
      Remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and memmap_valid_within() and rely
      entirely on ARM's implementation of pfn_valid() that is now enabled
      unconditionally.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-9-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e545df3
    • Mike Rapoport's avatar
      ia64: make SPARSEMEM default and disable DISCONTIGMEM · 214496cb
      Mike Rapoport authored
      SPARSEMEM memory model suitable for systems with large holes in their
      phyiscal memory layout. With SPARSEMEM_VMEMMAP enabled it provides
      pfn_to_page() and page_to_pfn() as fast as FLATMEM.
      
      Make it the default memory model for IA-64 and disable DISCONTIGMEM which
      is considered obsolete for quite some time.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-8-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      214496cb
    • Mike Rapoport's avatar
      ia64: forbid using VIRTUAL_MEM_MAP with FLATMEM · ea34f78f
      Mike Rapoport authored
      Virtual memory map was intended to avoid wasting memory on the memory map
      on systems with large holes in the physical memory layout. Long ago it been
      superseded first by DISCONTIGMEM and then by SPARSEMEM. Moreover,
      SPARSEMEM_VMEMMAP provide the same functionality in much more portable way.
      
      As the first step to removing the VIRTUAL_MEM_MAP forbid it's usage with
      FLATMEM and panic on systems with large holes in the physical memory
      layout that try to run FLATMEM kernels.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-7-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea34f78f
    • Mike Rapoport's avatar
      ia64: split virtual map initialization out of paging_init() · 1f112129
      Mike Rapoport authored
      For both FLATMEM and DISCONTIGMEM/SPARSEMEM the virtual map initialization
      is spread over paging_init() for no good reason.
      
      Split out the bits related to virtual map initialization to a helper
      functions, one for FLATMEM and another for !FLATMEM configurations.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-6-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f112129
    • Mike Rapoport's avatar
      ia64: discontig: paging_init(): remove local max_pfn calculation · b90b5547
      Mike Rapoport authored
      The maximal PFN in the system is calculated during find_memory() time and
      it is stored at max_low_pfn then.
      
      Use this value in paging_init() and remove the redundant detection of
      max_pfn in that function.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-5-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b90b5547
    • Mike Rapoport's avatar
      ia64: remove 'ifdef CONFIG_ZONE_DMA32' statements · 5d37fc0b
      Mike Rapoport authored
      After the removal of SN2 platform (commit cf07cb1f ("ia64: remove
      support for the SGI SN2 platform") IA-64 always has ZONE_DMA32 and there is
      no point to guard code with this configuration option.
      
      Remove ifdefery associated with CONFIG_ZONE_DMA32
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-4-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d37fc0b
    • Mike Rapoport's avatar
      ia64: remove custom __early_pfn_to_nid() · 03e92a5e
      Mike Rapoport authored
      The ia64 implementation of __early_pfn_to_nid() essentially relies on the
      same data as the generic implementation.
      
      The correspondence between memory ranges and nodes is set in memblock
      during early memory initialization in register_active_ranges() function.
      
      The initialization of sparsemem that requires early_pfn_to_nid() happens
      later and it can use the memblock information like the other architectures.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-3-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03e92a5e
    • Mike Rapoport's avatar
      alpha: switch from DISCONTIGMEM to SPARSEMEM · 36d40290
      Mike Rapoport authored
      Patch series "arch, mm: deprecate DISCONTIGMEM", v2.
      
      It's been a while since DISCONTIGMEM is generally considered deprecated,
      but it is still used by four architectures.  This set replaces
      DISCONTIGMEM with a different way to handle holes in the memory map and
      marks DISCONTIGMEM configuration as BROKEN in Kconfigs of these
      architectures with the intention to completely remove it in several
      releases.
      
      While for 64-bit alpha and ia64 the switch to SPARSEMEM is quite obvious
      and was a matter of moving some bits around, for smaller 32-bit arc and
      m68k SPARSEMEM is not necessarily the best thing to do.
      
      On 32-bit machines SPARSEMEM would require large sections to make section
      index fit in the page flags, but larger sections mean that more memory is
      wasted for unused memory map.
      
      Besides, pfn_to_page() and page_to_pfn() become less efficient, at least
      on arc.
      
      So I've decided to generalize arm's approach for freeing of unused parts
      of the memory map with FLATMEM and enable it for both arc and m68k.  The
      details are in the description of patches 10 (arc) and 13 (m68k).
      
      This patch (of 13):
      
      Enable SPARSEMEM support on alpha and deprecate DISCONTIGMEM.
      
      The required changes are mostly around moving duplicated definitions of
      page access and address conversion macros to a common place and making sure
      they are available for all memory models.
      
      The DISCONTINGMEM support is marked as BROKEN an will be removed in a
      couple of releases.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20201101170454.9567-2-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36d40290
    • Marco Elver's avatar
      lkdtm: disable KASAN for rodata.o · 6d5a88cd
      Marco Elver authored
      Building lkdtm with KASAN and Clang 11 or later results in the following
      error when attempting to load the module:
      
        kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
        BUG: unable to handle page fault for address: ffffffffc019cd70
        #PF: supervisor instruction fetch in kernel mode
        #PF: error_code(0x0011) - permissions violation
        ...
        RIP: 0010:asan.module_ctor+0x0/0xffffffffffffa290 [lkdtm]
        ...
        Call Trace:
         do_init_module+0x17c/0x570
         load_module+0xadee/0xd0b0
         __x64_sys_finit_module+0x16c/0x1a0
         do_syscall_64+0x34/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is that rodata.o generates a dummy function that lives in
      .rodata to validate that .rodata can't be executed; however, Clang 11 adds
      KASAN globals support by generating module constructors to initialize
      globals redzones.  When Clang 11 adds a module constructor to rodata.o, it
      is also added to .rodata: any attempt to call it on initialization results
      in the above error.
      
      Therefore, disable KASAN instrumentation for rodata.o.
      
      Link: https://lkml.kernel.org/r/20201214191413.3164796-1-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d5a88cd
    • Walter Wu's avatar
      kasan: update documentation for generic kasan · 4784be28
      Walter Wu authored
      Generic KASAN also supports to record the last two workqueue stacks and
      print them in KASAN report.  So that need to update documentation.
      
      Link: https://lkml.kernel.org/r/20201203023037.30792-1-walter-zh.wu@mediatek.comSigned-off-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4784be28
    • Walter Wu's avatar
      lib/test_kasan.c: add workqueue test case · 214c783d
      Walter Wu authored
      Adds a test to verify workqueue stack recording and print it in
      KASAN report.
      
      The KASAN report was as follows(cleaned up slightly):
      
       BUG: KASAN: use-after-free in kasan_workqueue_uaf
      
       Freed by task 54:
        kasan_save_stack+0x24/0x50
        kasan_set_track+0x24/0x38
        kasan_set_free_info+0x20/0x40
        __kasan_slab_free+0x10c/0x170
        kasan_slab_free+0x10/0x18
        kfree+0x98/0x270
        kasan_workqueue_work+0xc/0x18
      
       Last potentially related work creation:
        kasan_save_stack+0x24/0x50
        kasan_record_wq_stack+0xa8/0xb8
        insert_work+0x48/0x288
        __queue_work+0x3e8/0xc40
        queue_work_on+0xf4/0x118
        kasan_workqueue_uaf+0xfc/0x190
      
      Link: https://lkml.kernel.org/r/20201203022748.30681-1-walter-zh.wu@mediatek.comSigned-off-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      214c783d
    • Walter Wu's avatar
      kasan: print workqueue stack · ef133461
      Walter Wu authored
      The aux_stack[2] is reused to record the call_rcu() call stack and
      enqueuing work call stacks.  So that we need to change the auxiliary stack
      title for common title, print them in KASAN report.
      
      Link: https://lkml.kernel.org/r/20201203022715.30635-1-walter-zh.wu@mediatek.comSigned-off-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef133461
    • Walter Wu's avatar
      workqueue: kasan: record workqueue stack · e89a85d6
      Walter Wu authored
      Patch series "kasan: add workqueue stack for generic KASAN", v5.
      
      Syzbot reports many UAF issues for workqueue, see [1].
      
      In some of these access/allocation happened in process_one_work(), we
      see the free stack is useless in KASAN report, it doesn't help
      programmers to solve UAF for workqueue issue.
      
      This patchset improves KASAN reports by making them to have workqueue
      queueing stack.  It is useful for programmers to solve use-after-free or
      double-free memory issue.
      
      Generic KASAN also records the last two workqueue stacks and prints them
      in KASAN report.  It is only suitable for generic KASAN.
      
      [1] https://groups.google.com/g/syzkaller-bugs/search?q=%22use-after-free%22+process_one_work
      [2] https://bugzilla.kernel.org/show_bug.cgi?id=198437
      
      This patch (of 4):
      
      When analyzing use-after-free or double-free issue, recording the
      enqueuing work stacks is helpful to preserve usage history which
      potentially gives a hint about the affected code.
      
      For workqueue it has turned out to be useful to record the enqueuing work
      call stacks.  Because user can see KASAN report to determine whether it is
      root cause.  They don't need to enable debugobjects, but they have a
      chance to find out the root cause.
      
      Link: https://lkml.kernel.org/r/20201203022148.29754-1-walter-zh.wu@mediatek.com
      Link: https://lkml.kernel.org/r/20201203022442.30006-1-walter-zh.wu@mediatek.comSigned-off-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e89a85d6
    • Vincenzo Frascino's avatar
      mm/vmalloc.c: fix kasan shadow poisoning size · c041098c
      Vincenzo Frascino authored
      The size of vm area can be affected by the presence or not of the guard
      page.  In particular when VM_NO_GUARD is present, the actual accessible
      size has to be considered like the real size minus the guard page.
      
      Currently kasan does not keep into account this information during the
      poison operation and in particular tries to poison the guard page as well.
      
      This approach, even if incorrect, does not cause an issue because the tags
      for the guard page are written in the shadow memory.  With the future
      introduction of the Tag-Based KASAN, being the guard page inaccessible by
      nature, the write tag operation on this page triggers a fault.
      
      Fix kasan shadow poisoning size invoking get_vm_area_size() instead of
      accessing directly the field in the data structure to detect the correct
      value.
      
      Link: https://lkml.kernel.org/r/20201027160213.32904-1-vincenzo.frascino@arm.com
      Fixes: d98c9e83 ("kasan: fix crashes on access to memory mapped by vm_map_ram()")
      Signed-off-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c041098c
    • Alex Shi's avatar
      docs/vm: remove unused 3 items explanation for /proc/vmstat · 56db19fe
      Alex Shi authored
      Commit 5647bc29 ("mm: compaction: Move migration fail/success
      stats to migrate.c"), removed 3 items in /proc/vmstat. but the docs
      still has their explanation. let's remove them.
      
      "compact_blocks_moved",
      "compact_pages_moved",
      "compact_pagemigrate_failed",
      
      Link: https://lkml.kernel.org/r/1605520282-51993-1-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56db19fe
    • Waiman Long's avatar
      mm/vmalloc: Fix unlock order in s_stop() · 0a7dd4e9
      Waiman Long authored
      When multiple locks are acquired, they should be released in reverse
      order. For s_start() and s_stop() in mm/vmalloc.c, that is not the
      case.
      
        s_start: mutex_lock(&vmap_purge_lock); spin_lock(&vmap_area_lock);
        s_stop : mutex_unlock(&vmap_purge_lock); spin_unlock(&vmap_area_lock);
      
      This unlock sequence, though allowed, is not optimal. If a waiter is
      present, mutex_unlock() will need to go through the slowpath of waking
      up the waiter with preemption disabled. Fix that by releasing the
      spinlock first before the mutex.
      
      Link: https://lkml.kernel.org/r/20201213180843.16938-1-longman@redhat.com
      Fixes: e36176be ("mm/vmalloc: rework vmap_area_lock")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a7dd4e9
    • Baolin Wang's avatar
    • Alex Shi's avatar
      mm/vmalloc: add 'align' parameter explanation for pvm_determine_end_from_reverse · 799fa85d
      Alex Shi authored
      Kernel-doc markup has a issue on pvm_determine_end_from_reverse:
      
        mm/vmalloc.c:3145: warning: Function parameter or member 'align' not described in 'pvm_determine_end_from_reverse'
      
      Add a explanation for it to remove the warning.
      
      Link: https://lkml.kernel.org/r/1605605088-30668-3-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      799fa85d
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: rework the drain logic · 96e2db45
      Uladzislau Rezki (Sony) authored
      A current "lazy drain" model suffers from at least two issues.
      
      First one is related to the unsorted list of vmap areas, thus in order to
      identify the [min:max] range of areas to be drained, it requires a full
      list scan.  What is a time consuming if the list is too long.
      
      Second one and as a next step is about merging all fragments with a free
      space.  What is also a time consuming because it has to iterate over
      entire list which holds outstanding lazy areas.
      
      See below the "preemptirqsoff" tracer that illustrates a high latency.  It
      is ~24676us.  Our workloads like audio and video are effected by such long
      latency:
      
      <snip>
        tracer: preemptirqsoff
      
        preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+
        --------------------------------------------------------------------
        latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8)
           -----------------
           | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16)
           -----------------
         => started at: __purge_vmap_area_lazy
         => ended at:   __purge_vmap_area_lazy
      
                         _------=> CPU#
                        / _-----=> irqs-off
                       | / _----=> need-resched
                       || / _---=> hardirq/softirq
                       ||| / _--=> preempt-depth
                       |||| /     delay
         cmd     pid   ||||| time  |   caller
            \   /      |||||  \    |   /
      crtc_com-261     1...1    1us*: _raw_spin_lock <-__purge_vmap_area_lazy
      [...]
      crtc_com-261     1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy
      crtc_com-261     1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy
      crtc_com-261     1...1 24683us : <stack trace>
       => free_vmap_area_noflush
       => remove_vm_area
       => __vunmap
       => vfree
       => drm_property_free_blob
       => drm_mode_object_unreference
       => drm_property_unreference_blob
       => __drm_atomic_helper_crtc_destroy_state
       => sde_crtc_destroy_state
       => drm_atomic_state_default_clear
       => drm_atomic_state_clear
       => drm_atomic_state_free
       => complete_commit
       => _msm_drm_commit_work_cb
       => kthread_worker_fn
       => kthread
       => ret_from_fork
      <snip>
      
      To address those two issues we can redesign a purging of the outstanding
      lazy areas.  Instead of queuing vmap areas to the list, we replace it by
      the separate rb-tree.  In hat case an area is located in the tree/list in
      ascending order.  It will give us below advantages:
      
      a) Outstanding vmap areas are merged creating bigger coalesced blocks,
         thus it becomes less fragmented.
      
      b) It is possible to calculate a flush range [min:max] without scanning
         all elements.  It is O(1) access time or complexity;
      
      c) The final merge of areas with the rb-tree that represents a free
         space is faster because of (a).  As a result the lock contention is
         also reduced.
      
      Link: https://lkml.kernel.org/r/20201116220033.1837-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96e2db45
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: use free_vm_area() if an allocation fails · 8945a723
      Uladzislau Rezki (Sony) authored
      There is a dedicated and separate function that finds and removes a
      continuous kernel virtual area.  As a final step it also releases the
      "area", a descriptor of corresponding vm_struct.
      
      Use free_vmap_area() in the __vmalloc_node_range() instead of open coded
      steps which are exactly the same, to perform a cleanup.
      
      Link: https://lkml.kernel.org/r/20201116220033.1837-1-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8945a723
    • Andrew Morton's avatar
      mm/vmalloc.c:__vmalloc_area_node(): avoid 32-bit overflow · 34fe6537
      Andrew Morton authored
      With a machine with 3 TB (more than 2 TB memory).  If you use vmalloc to
      allocate > 2 TB memory, the array_size below will be overflowed.
      
      The array_size is an unsigned int and can only be used to allocate less
      than 2 TB memory.  If you pass 2*1028*1028*1024*1024 = 2 * 2^40 in the
      argument of vmalloc.  The array_size will become 2*2^31 = 2^32.  The 2^32
      cannot be store with a 32 bit integer.
      
      The fix is to change the type of array_size to unsigned long.
      
      [akpm@linux-foundation.org: rework for current mainline]
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=210023
      Reported-by: <hsinhuiwu@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34fe6537
    • Daniel Vetter's avatar
      locking/selftests: add testcases for fs_reclaim · d5037d1d
      Daniel Vetter authored
      Since I butchered this I figured better to make sure we have testcases for
      this now.  Since we only have a locking context for __GFP_FS that's the
      only thing we're testing right now.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-4-daniel.vetter@ffwll.chSigned-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5037d1d
    • Daniel Vetter's avatar
      mm: extract might_alloc() debug check · 95d6c701
      Daniel Vetter authored
      Extracted from slab.h, which seems to have the most complete version
      including the correct might_sleep() check.  Roll it out to slob.c.
      
      Motivated by a discussion with Paul about possibly changing call_rcu
      behaviour to allocate memory, but only roughly every 500th call.
      
      There are a lot fewer places in the kernel that care about whether
      allocating memory is allowed or not (due to deadlocks with reclaim code)
      than places that care whether sleeping is allowed.  But debugging these
      also tends to be a lot harder, so nice descriptive checks could come in
      handy.  I might have some use eventually for annotations in drivers/gpu.
      
      Note that unlike fs_reclaim_acquire/release gfpflags_allow_blocking does
      not consult the PF_MEMALLOC flags.  But there is no flag equivalent for
      GFP_NOWAIT, hence this check can't go wrong due to
      memalloc_no*_save/restore contexts.  Willy is working on a patch series
      which might change this:
      
      https://lore.kernel.org/linux-mm/20200625113122.7540-7-willy@infradead.org/
      
      I think best would be if that updates gfpflags_allow_blocking(), since
      there's a ton of callers all over the place for that already.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-3-daniel.vetter@ffwll.chSigned-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95d6c701
    • Daniel Vetter's avatar
      mm: track mmu notifiers in fs_reclaim_acquire/release · f920e413
      Daniel Vetter authored
      fs_reclaim_acquire/release nicely catch recursion issues when allocating
      GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep
      the excessive caches in check).  For mmu notifier recursions we do have
      lockdep annotations since 23b68395 ("mm/mmu_notifiers: add a lockdep
      map for invalidate_range_start/end").
      
      But these only fire if a path actually results in some pte invalidation -
      for most small allocations that's very rarely the case.  The other trouble
      is that pte invalidation can happen any time when __GFP_RECLAIM is set.
      Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good
      enough to avoid potential mmu notifier recursion.
      
      I was pondering whether we should just do the general annotation, but
      there's always the risk for false positives.  Plus I'm assuming that the
      core fs and io code is a lot better reviewed and tested than random mmu
      notifier code in drivers.  Hence why I decide to only annotate for that
      specific case.
      
      Furthermore even if we'd create a lockdep map for direct reclaim, we'd
      still need to explicit pull in the mmu notifier map - there's a lot more
      places that do pte invalidation than just direct reclaim, these two
      contexts arent the same.
      
      Note that the mmu notifiers needing their own independent lockdep map is
      also the reason we can't hold them from fs_reclaim_acquire to
      fs_reclaim_release - it would nest with the acquistion in the pte
      invalidation code, causing a lockdep splat.  And we can't remove the
      annotations from pte invalidation and all the other places since they're
      called from many other places than page reclaim.  Hence we can only do the
      equivalent of might_lock, but on the raw lockdep map.
      
      With this we can also remove the lockdep priming added in 66204f1d
      ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly
      more powerful.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-2-daniel.vetter@ffwll.chSigned-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f920e413