1. 16 Oct, 2020 40 commits
    • David Hildenbrand's avatar
      mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag · f04a5d5d
      David Hildenbrand authored
      Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2.
      
      When adding separate memory blocks via add_memory*() and onlining them
      immediately, the metadata (especially the memmap) of the next block will
      be placed onto one of the just added+onlined block.  This creates a chain
      of unmovable allocations: If the last memory block cannot get
      offlined+removed() so will all dependent ones.  We directly have unmovable
      allocations all over the place.
      
      This can be observed quite easily using virtio-mem, however, it can also
      be observed when using DIMMs.  The freshly onlined pages will usually be
      placed to the head of the freelists, meaning they will be allocated next,
      turning the just-added memory usually immediately un-removable.  The fresh
      pages are cold, prefering to allocate others (that might be hot) also
      feels to be the natural thing to do.
      
      It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
      adding separate, successive memory blocks, each memory block will have
      unmovable allocations on them - for example gigantic pages will fail to
      allocate.
      
      While the ZONE_NORMAL doesn't provide any guarantees that memory can get
      offlined+removed again (any kind of fragmentation with unmovable
      allocations is possible), there are many scenarios (hotplugging a lot of
      memory, running workload, hotunplug some memory/as much as possible) where
      we can offline+remove quite a lot with this patchset.
      
      a) To visualize the problem, a very simple example:
      
      Start a VM with 4GB and 8GB of virtio-mem memory:
      
       [root@localhost ~]# lsmem
       RANGE                                 SIZE  STATE REMOVABLE  BLOCK
       0x0000000000000000-0x00000000bfffffff   3G online       yes   0-23
       0x0000000100000000-0x000000033fffffff   9G online       yes 32-103
      
       Memory block size:       128M
       Total online memory:      12G
       Total offline memory:      0B
      
      Then try to unplug as much as possible using virtio-mem. Observe which
      memory blocks are still around. Without this patch set:
      
       [root@localhost ~]# lsmem
       RANGE                                  SIZE  STATE REMOVABLE   BLOCK
       0x0000000000000000-0x00000000bfffffff    3G online       yes    0-23
       0x0000000100000000-0x000000013fffffff    1G online       yes   32-39
       0x0000000148000000-0x000000014fffffff  128M online       yes      41
       0x0000000158000000-0x000000015fffffff  128M online       yes      43
       0x0000000168000000-0x000000016fffffff  128M online       yes      45
       0x0000000178000000-0x000000017fffffff  128M online       yes      47
       0x0000000188000000-0x0000000197ffffff  256M online       yes   49-50
       0x00000001a0000000-0x00000001a7ffffff  128M online       yes      52
       0x00000001b0000000-0x00000001b7ffffff  128M online       yes      54
       0x00000001c0000000-0x00000001c7ffffff  128M online       yes      56
       0x00000001d0000000-0x00000001d7ffffff  128M online       yes      58
       0x00000001e0000000-0x00000001e7ffffff  128M online       yes      60
       0x00000001f0000000-0x00000001f7ffffff  128M online       yes      62
       0x0000000200000000-0x0000000207ffffff  128M online       yes      64
       0x0000000210000000-0x0000000217ffffff  128M online       yes      66
       0x0000000220000000-0x0000000227ffffff  128M online       yes      68
       0x0000000230000000-0x0000000237ffffff  128M online       yes      70
       0x0000000240000000-0x0000000247ffffff  128M online       yes      72
       0x0000000250000000-0x0000000257ffffff  128M online       yes      74
       0x0000000260000000-0x0000000267ffffff  128M online       yes      76
       0x0000000270000000-0x0000000277ffffff  128M online       yes      78
       0x0000000280000000-0x0000000287ffffff  128M online       yes      80
       0x0000000290000000-0x0000000297ffffff  128M online       yes      82
       0x00000002a0000000-0x00000002a7ffffff  128M online       yes      84
       0x00000002b0000000-0x00000002b7ffffff  128M online       yes      86
       0x00000002c0000000-0x00000002c7ffffff  128M online       yes      88
       0x00000002d0000000-0x00000002d7ffffff  128M online       yes      90
       0x00000002e0000000-0x00000002e7ffffff  128M online       yes      92
       0x00000002f0000000-0x00000002f7ffffff  128M online       yes      94
       0x0000000300000000-0x0000000307ffffff  128M online       yes      96
       0x0000000310000000-0x0000000317ffffff  128M online       yes      98
       0x0000000320000000-0x0000000327ffffff  128M online       yes     100
       0x0000000330000000-0x000000033fffffff  256M online       yes 102-103
      
       Memory block size:       128M
       Total online memory:     8.1G
       Total offline memory:      0B
      
      With this patch set:
      
       [root@localhost ~]# lsmem
       RANGE                                 SIZE  STATE REMOVABLE BLOCK
       0x0000000000000000-0x00000000bfffffff   3G online       yes  0-23
       0x0000000100000000-0x000000013fffffff   1G online       yes 32-39
      
       Memory block size:       128M
       Total online memory:       4G
       Total offline memory:      0B
      
      All memory can get unplugged, all memory block can get removed.  Of
      course, no workload ran and the system was basically idle, but it
      highlights the issue - the fairly deterministic chain of unmovable
      allocations.  When a huge page for the 2MB memmap is needed, a
      just-onlined 4MB page will be split.  The remaining 2MB page will be used
      for the memmap of the next memory block.  So one memory block will hold
      the memmap of the two following memory blocks.  Finally the pages of the
      last-onlined memory block will get used for the next bigger allocations -
      if any allocation is unmovable, all dependent memory blocks cannot get
      unplugged and removed until that allocation is gone.
      
      Note that with bigger memory blocks (e.g., 256MB), *all* memory
      blocks are dependent and none can get unplugged again!
      
      b) Experiment with memory intensive workload
      
      I performed an experiment with an older version of this patch set (before
      we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM
      with an initial 4GB, onlining all memory to ZONE_NORMAL right from the
      kernel when adding it.  I then run various memory intensive workloads that
      consume most system memory for a total of 45 minutes.  Once finished, I
      try to unplug as much memory as possible.
      
      With this change, I am able to remove via virtio-mem (adding individual
      128MB memory blocks) 413 out of 448 added memory blocks.  Via individual
      (256MB) DIMMs 380 out of 448 added memory blocks.  (I don't have any
      numbers without this patchset, but looking at the above example, it's at
      most half of the 448 memory blocks for virtio-mem, and most probably none
      for DIMMs).
      
      Again, there are workloads that might behave very differently due to the
      nature of ZONE_NORMAL.
      
      This change also affects (besides memory onlining):
      - Other users of undo_isolate_page_range(): Pages are always placed to the
        tail.
      -- When memory offlining fails
      -- When memory isolation fails after having isolated some pageblocks
      -- When alloc_contig_range() either succeeds or fails
      - Other users of __putback_isolated_page(): Pages are always placed to the
        tail.
      -- Free page reporting
      - Other users of __free_pages_core()
      -- AFAIKs, any memory that is getting exposed to the buddy during boot.
         IIUC we will now usually allocate memory from lower addresses within
         a zone first (especially during boot).
      - Other users of generic_online_page()
      -- Hyper-V balloon
      
      This patch (of 5):
      
      Let's prepare for additional flags and avoid long parameter lists of
      bools.  Follow-up patches will also make use of the flags in
      __free_pages_ok().
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarWei Yang <richard.weiyang@linux.alibaba.com>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f04a5d5d
    • Laurent Dufour's avatar
      mm: don't panic when links can't be created in sysfs · 90c7eaeb
      Laurent Dufour authored
      At boot time, or when doing memory hot-add operations, if the links in
      sysfs can't be created, the system is still able to run, so just report
      the error in the kernel log rather than BUG_ON and potentially make system
      unusable because the callpath can be called with locks held.
      
      Since the number of memory blocks managed could be high, the messages are
      rate limited.
      
      As a consequence, link_mem_sections() has no status to report anymore.
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200915094143.79181-4-ldufour@linux.ibm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90c7eaeb
    • David Hildenbrand's avatar
      kernel/resource: make iomem_resource implicit in release_mem_region_adjustable() · cb8e3c8b
      David Hildenbrand authored
      "mem" in the name already indicates the root, similar to
      release_mem_region() and devm_request_mem_region().  Make it implicit.
      The only single caller always passes iomem_resource, other parents are not
      applicable.
      Suggested-by: default avatarWei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb8e3c8b
    • David Hildenbrand's avatar
      hv_balloon: try to merge system ram resources · 2c76e7f6
      David Hildenbrand authored
      Let's try to merge system ram resources we add, to minimize the number of
      resources in /proc/iomem.  We don't care about the boundaries of
      individual chunks we added.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Liu <wei.liu@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Link: https://lkml.kernel.org/r/20200911103459.10306-9-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c76e7f6
    • David Hildenbrand's avatar
      xen/balloon: try to merge system ram resources · 1b989d5d
      David Hildenbrand authored
      Let's try to merge system ram resources we add, to minimize the number of
      resources in /proc/iomem.  We don't care about the boundaries of
      individual chunks we added.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20200911103459.10306-8-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b989d5d
    • David Hildenbrand's avatar
      virtio-mem: try to merge system ram resources · 9b24247a
      David Hildenbrand authored
      virtio-mem adds memory in memory block granularity, to be able to remove
      it in the same granularity again later, and to grow slowly on demand.
      This, however, results in quite a lot of resources when adding a lot of
      memory.  Resources are effectively stored in a list-based tree.  Having a
      lot of resources not only wastes memory, it also makes traversing that
      tree more expensive, and makes /proc/iomem explode in size (e.g.,
      requiring kexec-tools to manually merge resources later when e.g., trying
      to create a kdump header).
      
      Before this patch, we get (/proc/iomem) when hotplugging 2G via virtio-mem
      on x86-64:
              [...]
              100000000-13fffffff : System RAM
              140000000-33fffffff : virtio0
                140000000-147ffffff : System RAM (virtio_mem)
                148000000-14fffffff : System RAM (virtio_mem)
                150000000-157ffffff : System RAM (virtio_mem)
                158000000-15fffffff : System RAM (virtio_mem)
                160000000-167ffffff : System RAM (virtio_mem)
                168000000-16fffffff : System RAM (virtio_mem)
                170000000-177ffffff : System RAM (virtio_mem)
                178000000-17fffffff : System RAM (virtio_mem)
                180000000-187ffffff : System RAM (virtio_mem)
                188000000-18fffffff : System RAM (virtio_mem)
                190000000-197ffffff : System RAM (virtio_mem)
                198000000-19fffffff : System RAM (virtio_mem)
                1a0000000-1a7ffffff : System RAM (virtio_mem)
                1a8000000-1afffffff : System RAM (virtio_mem)
                1b0000000-1b7ffffff : System RAM (virtio_mem)
                1b8000000-1bfffffff : System RAM (virtio_mem)
              3280000000-32ffffffff : PCI Bus 0000:00
      
      With this patch, we get (/proc/iomem):
              [...]
              fffc0000-ffffffff : Reserved
              100000000-13fffffff : System RAM
              140000000-33fffffff : virtio0
                140000000-1bfffffff : System RAM (virtio_mem)
              3280000000-32ffffffff : PCI Bus 0000:00
      
      Of course, with more hotplugged memory, it gets worse.  When unplugging
      memory blocks again, try_remove_memory() (via offline_and_remove_memory())
      will properly split the resource up again.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20200911103459.10306-7-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b24247a
    • David Hildenbrand's avatar
      mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources · 9ca6551e
      David Hildenbrand authored
      Some add_memory*() users add memory in small, contiguous memory blocks.
      Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
      
      This can quickly result in a lot of memory resources, whereby the actual
      resource boundaries are not of interest (e.g., it might be relevant for
      DIMMs, exposed via /proc/iomem to user space).  We really want to merge
      added resources in this scenario where possible.
      
      Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
      either created within add_memory*() or passed via add_memory_resource()
      shall be marked mergeable and merged with applicable siblings.
      
      To implement that, we need a kernel/resource interface to mark selected
      System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
      merging.
      
      Note: We really want to merge after the whole operation succeeded, not
      directly when adding a resource to the resource tree (it would break
      add_memory_resource() and require splitting resources again when the
      operation failed - e.g., due to -ENOMEM).
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ca6551e
    • David Hildenbrand's avatar
      mm/memory_hotplug: prepare passing flags to add_memory() and friends · b6117199
      David Hildenbrand authored
      We soon want to pass flags, e.g., to mark added System RAM resources.
      mergeable.  Prepare for that.
      
      This patch is based on a similar patch by Oscar Salvador:
      
      https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.deSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: Juergen Gross <jgross@suse.com> # Xen related part
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: default avatarWei Liu <wei.liu@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6117199
    • David Hildenbrand's avatar
      mm/memory_hotplug: guard more declarations by CONFIG_MEMORY_HOTPLUG · 3a0aaefe
      David Hildenbrand authored
      We soon want to pass flags via a new type to add_memory() and friends.
      That revealed that we currently don't guard some declarations by
      CONFIG_MEMORY_HOTPLUG.
      
      While some definitions could be moved to different places, let's keep it
      minimal for now and use CONFIG_MEMORY_HOTPLUG for all functions only
      compiled with CONFIG_MEMORY_HOTPLUG.
      
      Wrap sparse_decode_mem_map() into CONFIG_MEMORY_HOTPLUG, it's only called
      from CONFIG_MEMORY_HOTPLUG code.
      
      While at it, remove allow_online_pfn_range(), which is no longer around,
      and mhp_notimplemented(), which is unused.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20200911103459.10306-4-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a0aaefe
    • David Hildenbrand's avatar
      kernel/resource: move and rename IORESOURCE_MEM_DRIVER_MANAGED · 7cf603d1
      David Hildenbrand authored
      IORESOURCE_MEM_DRIVER_MANAGED currently uses an unused PnP bit, which is
      always set to 0 by hardware.  This is far from beautiful (and confusing),
      and the bit only applies to SYSRAM.  So let's move it out of the
      bus-specific (PnP) defined bits.
      
      We'll add another SYSRAM specific bit soon.  If we ever need more bits for
      other purposes, we can steal some from "desc", or reshuffle/regroup what
      we have.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20200911103459.10306-3-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cf603d1
    • David Hildenbrand's avatar
      kernel/resource: make release_mem_region_adjustable() never fail · ec62d04e
      David Hildenbrand authored
      Patch series "selective merging of system ram resources", v4.
      
      Some add_memory*() users add memory in small, contiguous memory blocks.
      Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
      
      This can quickly result in a lot of memory resources, whereby the actual
      resource boundaries are not of interest (e.g., it might be relevant for
      DIMMs, exposed via /proc/iomem to user space).  We really want to merge
      added resources in this scenario where possible.
      
      Resources are effectively stored in a list-based tree.  Having a lot of
      resources not only wastes memory, it also makes traversing that tree more
      expensive, and makes /proc/iomem explode in size (e.g., requiring
      kexec-tools to manually merge resources when creating a kdump header.  The
      current kexec-tools resource count limit does not allow for more than
      ~100GB of memory with a memory block size of 128MB on x86-64).
      
      Let's allow to selectively merge system ram resources by specifying a new
      flag for add_memory*().  Patch #5 contains a /proc/iomem example.  Only
      tested with virtio-mem.
      
      This patch (of 8):
      
      Let's make sure splitting a resource on memory hotunplug will never fail.
      This will become more relevant once we merge selected System RAM resources
      - then, we'll trigger that case more often on memory hotunplug.
      
      In general, this function is already unlikely to fail.  When we remove
      memory, we free up quite a lot of metadata (memmap, page tables, memory
      block device, etc.).  The only reason it could really fail would be when
      injecting allocation errors.
      
      All other error cases inside release_mem_region_adjustable() seem to be
      sanity checks if the function would be abused in different context - let's
      add WARN_ON_ONCE() in these cases so we can catch them.
      
      [natechancellor@gmail.com: fix use of ternary condition in release_mem_region_adjustable]
        Link: https://lkml.kernel.org/r/20200922060748.2452056-1-natechancellor@gmail.com
        Link: https://github.com/ClangBuiltLinux/linux/issues/1159Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Roger Pau Monn <roger.pau@citrix.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20200911103459.10306-2-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec62d04e
    • David Hildenbrand's avatar
      mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining memory · b30c5927
      David Hildenbrand authored
      Currently, it can happen that pages are allocated (and freed) via the
      buddy before we finished basic memory onlining.
      
      For example, pages are exposed to the buddy and can be allocated before we
      actually mark the sections online.  Allocated pages could suddenly fail
      pfn_to_online_page() checks.  We had similar issues with pcp handling,
      when pages are allocated+freed before we reach zone_pcp_update() in
      online_pages() [1].
      
      Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
      impossible.  Once done with the heavy lifting, use
      undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
      freelist, marking them ready for allocation.  Similar to offline_pages(),
      we have to manually adjust zone->nr_isolate_pageblock.
      
      [1] https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.orgSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-11-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b30c5927
    • David Hildenbrand's avatar
      mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone() · d882c006
      David Hildenbrand authored
      On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
      un-isolate the pages after memory onlining is complete.  Let's allow
      passing in the migratetype.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d882c006
    • David Hildenbrand's avatar
      mm/page_alloc: drop stale pageblock comment in memmap_init_zone*() · 4eb29bd9
      David Hildenbrand authored
      Commit ac5d2539 ("mm: meminit: reduce number of times pageblocks are
      set during struct page init") moved the actual zone range check, leaving
      only the alignment check for pageblocks.
      
      Let's drop the stale comment and make the pageblock check easier to read.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-9-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eb29bd9
    • David Hildenbrand's avatar
      mm/memory_hotplug: simplify page onlining · aac65321
      David Hildenbrand authored
      We don't allow to offline memory with holes, all boot memory is online,
      and all hotplugged memory cannot have holes.
      
      We can now simplify onlining of pages.  As we only allow to online/offline
      full sections and sections always span full MAX_ORDER_NR_PAGES, we can
      just process MAX_ORDER - 1 pages without further special handling.
      
      The number of onlined pages simply corresponds to the number of pages we
      were requested to online.
      
      While at it, refine the comment regarding the callback not exposing all
      pages to the buddy.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-8-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aac65321
    • David Hildenbrand's avatar
      mm/page_isolation: simplify return value of start_isolate_page_range() · 3fa0c7c7
      David Hildenbrand authored
      Callers no longer need the number of isolated pageblocks.  Let's simplify.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-7-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fa0c7c7
    • David Hildenbrand's avatar
      mm/memory_hotplug: drop nr_isolate_pageblock in offline_pages() · ea15153c
      David Hildenbrand authored
      We make sure that we cannot have any memory holes right at the beginning
      of offline_pages() and we only support to online/offline full sections.
      Both, sections and pageblocks are a power of two in size, and sections
      always span full pageblocks.
      
      We can directly calculate the number of isolated pageblocks from nr_pages.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-6-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea15153c
    • David Hildenbrand's avatar
      mm/page_alloc: simplify __offline_isolated_pages() · 257bea71
      David Hildenbrand authored
      offline_pages() is the only user.  __offline_isolated_pages() never gets
      called with ranges that contain memory holes and we no longer care about
      the return value.  Drop the return value handling and all pfn_valid()
      checks.
      
      Update the documentation.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-5-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      257bea71
    • David Hildenbrand's avatar
      mm/memory_hotplug: simplify page offlining · 0a1a9a00
      David Hildenbrand authored
      We make sure that we cannot have any memory holes right at the beginning
      of offline_pages().  We no longer need walk_system_ram_range() and can
      call test_pages_isolated() and __offline_isolated_pages() directly.
      
      offlined_pages always corresponds to nr_pages, so we can simplify that.
      
      [akpm@linux-foundation.org: patch conflict resolution]
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-4-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a1a9a00
    • David Hildenbrand's avatar
      mm/memory_hotplug: enforce section granularity when onlining/offlining · 4986fac1
      David Hildenbrand authored
      Already two people (including me) tried to offline subsections, because
      the function looks like it can deal with it.  But we really can only
      online/offline full sections that are properly aligned (e.g., we can only
      mark full sections online/offline via SECTION_IS_ONLINE).
      
      Add a simple safety net to document the restriction now.  Current users
      (core and powernv/memtrace) respect these restrictions.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-3-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4986fac1
    • David Hildenbrand's avatar
      mm/memory_hotplug: inline __offline_pages() into offline_pages() · 73a11c96
      David Hildenbrand authored
      Patch series "mm/memory_hotplug: online_pages()/offline_pages() cleanups", v2.
      
      These are a bunch of cleanups for online_pages()/offline_pages() and
      related code, mostly getting rid of memory hole handling that is no longer
      necessary.  There is only a single walk_system_ram_range() call left in
      offline_pages(), to make sure we don't have any memory holes.  I had some
      of these patches lying around for a longer time but didn't have time to
      polish them.
      
      In addition, the last patch marks all pageblocks of memory to get onlined
      MIGRATE_ISOLATE, so pages that have just been exposed to the buddy cannot
      get allocated before onlining is complete.  Once heavy lifting is done,
      the pageblocks are set to MIGRATE_MOVABLE, such that allocations are
      possible.
      
      I played with DIMMs and virtio-mem on x86-64 and didn't spot any
      surprises.  I verified that the numer of isolated pageblocks is correctly
      handled when onlining/offlining.
      
      This patch (of 10):
      
      There is only a single user, offline_pages(). Let's inline, to make
      it look more similar to online_pages().
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Link: https://lkml.kernel.org/r/20200819175957.28465-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20200819175957.28465-2-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73a11c96
    • Jann Horn's avatar
      mm/mmu_notifier: fix mmget() assert in __mmu_interval_notifier_insert · c9682d10
      Jann Horn authored
      The comment talks about having to hold mmget() (which means mm_users), but
      the actual check is on mm_count (which would be mmgrab()).
      
      Given that MMU notifiers are torn down in mmput() -> __mmput() ->
      exit_mmap() -> mmu_notifier_release(), I believe that the comment is
      correct and the check should be on mm->mm_users.  Fix it up accordingly.
      
      Fixes: 99cb252f ("mm/mmu_notifier: add an interval tree notifier")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christian König <christian.koenig@amd.com
      Link: https://lkml.kernel.org/r/20200901000143.207585-1-jannh@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9682d10
    • Bartosz Golaszewski's avatar
      mm/util.c: update the kerneldoc for kstrdup_const() · 295a1730
      Bartosz Golaszewski authored
      Memory allocated with kstrdup_const() must not be passed to regular
      krealloc() as it is not aware of the possibility of the chunk residing in
      .rodata.  Since there are no potential users of krealloc_const() at the
      moment, let's just update the doc to make it explicit.
      Signed-off-by: default avatarBartosz Golaszewski <bgolaszewski@baylibre.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200817173927.23389-1-brgl@bgdev.plSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      295a1730
    • Miaohe Lin's avatar
      mm/vmstat.c: use helper macro abs() · 40610076
      Miaohe Lin authored
      Use helper macro abs() to simplify the "x > t || x < -t" cmp.
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20200905084008.15748-1-linmiaohe@huawei.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40610076
    • Mateusz Nosek's avatar
      mm/page_poison.c: replace bool variable with static key · 11c9c7ed
      Mateusz Nosek authored
      Variable 'want_page_poisoning' is a switch deciding if page poisoning
      should be enabled.  This patch changes it to be static key.
      Signed-off-by: default avatarMateusz Nosek <mateusznosek0@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Link: https://lkml.kernel.org/r/20200921152931.938-1-mateusznosek0@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11c9c7ed
    • Oscar Salvador's avatar
      mm,hwpoison: try to narrow window race for free pages · b94e0282
      Oscar Salvador authored
      Aristeu Rozanski reported that a customer test case started to report
      -EBUSY after the hwpoison rework patchset.
      
      There is a race window between spotting a free page and taking it off its
      buddy freelist, so it might be that by the time we try to take it off, the
      page has been already allocated.
      
      This patch tries to handle such race window by trying to handle the new
      type of page again if the page was allocated under us.
      Reported-by: default avatarAristeu Rozanski <aris@ruivo.org>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarAristeu Rozanski <aris@ruivo.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-15-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b94e0282
    • Naoya Horiguchi's avatar
      mm,hwpoison: double-check page count in __get_any_page() · 1f2481dd
      Naoya Horiguchi authored
      Soft offlining could fail with EIO due to the race condition with hugepage
      migration.  This issuse became visible due to the change by previous patch
      that makes soft offline handler take page refcount by its own.  We have no
      way to directly pin zero refcount page, and the page considered as a zero
      refcount page could be allocated just after the first check.
      
      This patch adds the second check to find the race and gives us chance to
      handle it more reliably.
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-14-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f2481dd
    • Naoya Horiguchi's avatar
      mm,hwpoison: introduce MF_MSG_UNSPLIT_THP · 5d1fd5dc
      Naoya Horiguchi authored
      memory_failure() is supposed to call action_result() when it handles a
      memory error event, but there's one missing case.  So let's add it.
      
      I find that include/ras/ras_event.h has some other MF_MSG_* undefined, so
      this patch also adds them.
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-13-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d1fd5dc
    • Oscar Salvador's avatar
      mm,hwpoison: return 0 if the page is already poisoned in soft-offline · 5a2ffca3
      Oscar Salvador authored
      Currently, there is an inconsistency when calling soft-offline from
      different paths on a page that is already poisoned.
      
      1) madvise:
      
              madvise_inject_error skips any poisoned page and continues
              the loop.
              If that was the only page to madvise, it returns 0.
      
      2) /sys/devices/system/memory/:
      
              When calling soft_offline_page_store()->soft_offline_page(),
              we return -EBUSY in case the page is already poisoned.
              This is inconsistent with a) the above example and b)
              memory_failure, where we return 0 if the page was poisoned.
      
      Fix this by dropping the PageHWPoison() check in madvise_inject_error, and
      let soft_offline_page return 0 if it finds the page already poisoned.
      
      Please, note that this represents a user-api change, since now the return
      error when calling soft_offline_page_store()->soft_offline_page() will be
      different.
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-12-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a2ffca3
    • Oscar Salvador's avatar
      mm,hwpoison: refactor soft_offline_huge_page and __soft_offline_page · 6b9a217e
      Oscar Salvador authored
      Merging soft_offline_huge_page and __soft_offline_page let us get rid of
      quite some duplicated code, and makes the code much easier to follow.
      
      Now, __soft_offline_page will handle both normal and hugetlb pages.
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-11-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b9a217e
    • Oscar Salvador's avatar
      mm,hwpoison: rework soft offline for in-use pages · 79f5f8fa
      Oscar Salvador authored
      This patch changes the way we set and handle in-use poisoned pages.  Until
      now, poisoned pages were released to the buddy allocator, trusting that
      the checks that take place at allocation time would act as a safe net and
      would skip that page.
      
      This has proved to be wrong, as we got some pfn walkers out there, like
      compaction, that all they care is the page to be in a buddy freelist.
      
      Although this might not be the only user, having poisoned pages in the
      buddy allocator seems a bad idea as we should only have free pages that
      are ready and meant to be used as such.
      
      Before explaining the taken approach, let us break down the kind of pages
      we can soft offline.
      
      - Anonymous THP (after the split, they end up being 4K pages)
      - Hugetlb
      - Order-0 pages (that can be either migrated or invalited)
      
      * Normal pages (order-0 and anon-THP)
      
        - If they are clean and unmapped page cache pages, we invalidate
          then by means of invalidate_inode_page().
        - If they are mapped/dirty, we do the isolate-and-migrate dance.
      
      Either way, do not call put_page directly from those paths.  Instead, we
      keep the page and send it to page_handle_poison to perform the right
      handling.
      
      page_handle_poison sets the HWPoison flag and does the last put_page.
      
      Down the chain, we placed a check for HWPoison page in
      free_pages_prepare, that just skips any poisoned page, so those pages
      do not end up in any pcplist/freelist.
      
      After that, we set the refcount on the page to 1 and we increment
      the poisoned pages counter.
      
      If we see that the check in free_pages_prepare creates trouble, we can
      always do what we do for free pages:
      
        - wait until the page hits buddy's freelists
        - take it off, and flag it
      
      The downside of the above approach is that we could race with an
      allocation, so by the time we  want to take the page off the buddy, the
      page has been already allocated so we cannot soft offline it.
      But the user could always retry it.
      
      * Hugetlb pages
      
        - We isolate-and-migrate them
      
      After the migration has been successful, we call dissolve_free_huge_page,
      and we set HWPoison on the page if we succeed.
      Hugetlb has a slightly different handling though.
      
      While for non-hugetlb pages we cared about closing the race with an
      allocation, doing so for hugetlb pages requires quite some additional
      and intrusive code (we would need to hook in free_huge_page and some other
      places).
      So I decided to not make the code overly complicated and just fail
      normally if the page we allocated in the meantime.
      
      We can always build on top of this.
      
      As a bonus, because of the way we handle now in-use pages, we no longer
      need the put-as-isolation-migratetype dance, that was guarding for poisoned
      pages to end up in pcplists.
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79f5f8fa
    • Oscar Salvador's avatar
      mm,hwpoison: rework soft offline for free pages · 06be6ff3
      Oscar Salvador authored
      When trying to soft-offline a free page, we need to first take it off the
      buddy allocator.  Once we know is out of reach, we can safely flag it as
      poisoned.
      
      take_page_off_buddy will be used to take a page meant to be poisoned off
      the buddy allocator.  take_page_off_buddy calls break_down_buddy_pages,
      which splits a higher-order page in case our page belongs to one.
      
      Once the page is under our control, we call page_handle_poison to set it
      as poisoned and grab a refcount on it.
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06be6ff3
    • Oscar Salvador's avatar
      mm,hwpoison: unify THP handling for hard and soft offline · 694bf0b0
      Oscar Salvador authored
      Place the THP's page handling in a helper and use it from both hard and
      soft-offline machinery, so we get rid of some duplicated code.
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-8-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      694bf0b0
    • Oscar Salvador's avatar
      mm,hwpoison: kill put_hwpoison_page · dd6e2402
      Oscar Salvador authored
      After commit 4e41a30c ("mm: hwpoison: adjust for new thp
      refcounting"), put_hwpoison_page got reduced to a put_page.  Let us just
      use put_page instead.
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-7-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd6e2402
    • Oscar Salvador's avatar
      mm,hwpoison: refactor madvise_inject_error · dc7560b4
      Oscar Salvador authored
      Make a proper if-else condition for {hard,soft}-offline.
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Link: https://lkml.kernel.org/r/20200908075626.11976-3-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc7560b4
    • Oscar Salvador's avatar
      mm,hwpoison: unexport get_hwpoison_page and make it static · 7e27f22c
      Oscar Salvador authored
      Since get_hwpoison_page is only used in memory-failure code now, let us
      un-export it and make it private to that code.
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-5-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e27f22c
    • Naoya Horiguchi's avatar
      mm,hwpoison-inject: don't pin for hwpoison_filter · fd476720
      Naoya Horiguchi authored
      Another memory error injection interface debugfs:hwpoison/corrupt-pfn also
      takes bogus refcount for hwpoison_filter().  It's justified because this
      does a coarse filter, expecting that memory_failure() redoes the check for
      sure.
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-4-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd476720
    • Naoya Horiguchi's avatar
      mm, hwpoison: remove recalculating hpage · 1b473bec
      Naoya Horiguchi authored
      hpage is never used after try_to_split_thp_page() in memory_failure(), so
      we don't have to update hpage.  So let's not recalculate/use hpage.
      Suggested-by: default avatar"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-3-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b473bec
    • Naoya Horiguchi's avatar
      mm,hwpoison: cleanup unused PageHuge() check · 7d9d46ac
      Naoya Horiguchi authored
      Patch series "HWPOISON: soft offline rework", v7.
      
      This patchset fixes a couple of issues that the patchset Naoya sent [1]
      contained due to rebasing problems and a misunterdansting.
      
      Main focus of this series is to stabilize soft offline.  Historically soft
      offlined pages have suffered from racy conditions because PageHWPoison is
      used to a little too aggressively, which (directly or indirectly) invades
      other mm code which cares little about hwpoison.  This results in
      unexpected behavior or kernel panic, which is very far from soft offline's
      "do not disturb userspace or other kernel component" policy.  An example
      of this can be found here [2].
      
      Along with several cleanups, this code refactors and changes the way soft
      offline work.  Main point of this change set is to contain target page
      "via buddy allocator" or in migrating path.  For ther former we first free
      the target page as we do for normal pages, and once it has reached buddy
      and it has been taken off the freelists, we flag it as HWpoison.  For the
      latter we never get to release the page in unmap_and_move, so the page is
      under our control and we can handle it in hwpoison code.
      
      [1] https://patchwork.kernel.org/cover/11704083/
      [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
      
      This patch (of 14):
      
      Drop the PageHuge check, which is dead code since memory_failure() forks
      into memory_failure_hugetlb() for hugetlb pages.
      
      memory_failure() and memory_failure_hugetlb() shares some functions like
      hwpoison_user_mappings() and identify_page_state(), so they should
      properly handle 4kB page, thp, and hugetlb.
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Oscar Salvador <osalvador@suse.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20200922135650.1634-2-osalvador@suse.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d9d46ac
    • David Howells's avatar
      mm/readahead: pass a file_ra_state into force_page_cache_ra · b1647dc0
      David Howells authored
      The file_ra_state being passed into page_cache_sync_readahead() was being
      ignored in favour of using the one embedded in the struct file.  The only
      caller for which this makes a difference is the fsverity code if the file
      has been marked as POSIX_FADV_RANDOM, but it's confusing and worth fixing.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Eric Biggers <ebiggers@google.com>
      Link: https://lkml.kernel.org/r/20200903140844.14194-10-willy@infradead.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1647dc0