1. 07 Apr, 2020 40 commits
    • Hugh Dickins's avatar
      mm: huge tmpfs: try to split_huge_page() when punching hole · 71725ed1
      Hugh Dickins authored
      Yang Shi writes:
      
      Currently, when truncating a shmem file, if the range is partly in a THP
      (start or end is in the middle of THP), the pages actually will just get
      cleared rather than being freed, unless the range covers the whole THP.
      Even though all the subpages are truncated (randomly or sequentially), the
      THP may still be kept in page cache.
      
      This might be fine for some usecases which prefer preserving THP, but
      balloon inflation is handled in base page size.  So when using shmem THP
      as memory backend, QEMU inflation actually doesn't work as expected since
      it doesn't free memory.  But the inflation usecase really needs to get the
      memory freed.  (Anonymous THP will also not get freed right away, but will
      be freed eventually when all subpages are unmapped: whereas shmem THP
      still stays in page cache.)
      
      Split THP right away when doing partial hole punch, and if split fails
      just clear the page so that read of the punched area will return zeroes.
      
      Hugh Dickins adds:
      
      Our earlier "team of pages" huge tmpfs implementation worked in the way
      that Yang Shi proposes; and we have been using this patch to continue to
      split the huge page when hole-punched or truncated, since converting over
      to the compound page implementation.  Although huge tmpfs gives out huge
      pages when available, if the user specifically asks to truncate or punch a
      hole (perhaps to free memory, perhaps to reduce the memcg charge), then
      the filesystem should do so as best it can, splitting the huge page.
      
      That is not always possible: any additional reference to the huge page
      prevents split_huge_page() from succeeding, so the result can be flaky.
      But in practice it works successfully enough that we've not seen any
      problem from that.
      
      Add shmem_punch_compound() to encapsulate the decision of when a split is
      needed, and doing the split if so.  Using this simplifies the flow in
      shmem_undo_range(); and the first (trylock) pass does not need to do any
      page clearing on failure, because the second pass will either succeed or
      do that clearing.  Following the example of zero_user_segment() when
      clearing a partial page, add flush_dcache_page() and set_page_dirty() when
      clearing a hole - though I'm not certain that either is needed.
      
      But: split_huge_page() would be sure to fail if shmem_undo_range()'s
      pagevec holds further references to the huge page.  The easiest way to fix
      that is for find_get_entries() to return early, as soon as it has put one
      compound head or tail into the pagevec.  At first this felt like a hack;
      but on examination, this convention better suits all its callers - or will
      do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
      and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
      speedup by checking for compound pages there.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvilsSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71725ed1
    • Mateusz Nosek's avatar
      mm/shmem.c: clean code by removing unnecessary assignment · 343c3d7f
      Mateusz Nosek authored
      Previously 0 was assigned to variable 'error' but the variable was never
      read before reassignemnt later.  So the assignment can be removed.
      Signed-off-by: default avatarMateusz Nosek <mateusznosek0@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200301152832.24595-1-mateusznosek0@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      343c3d7f
    • Kees Cook's avatar
      mm/shmem.c: distribute switch variables for initialization · 27d80fa2
      Kees Cook authored
      Variables declared in a switch statement before any case statements cannot
      be automatically initialized with compiler instrumentation (as they are
      not part of any execution flow).  With GCC's proposed automatic stack
      variable initialization feature, this triggers a warning (and they don't
      get initialized).  Clang's automatic stack variable initialization (via
      CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't
      initialize such variables[1].  Note that these warnings (or silent
      skipping) happen before the dead-store elimination optimization phase, so
      even when the automatic initializations are later elided in favor of
      direct initializations, the warnings remain.
      
      To avoid these problems, move such variables into the "case" where they're
      used or lift them up into the main function body.
      
      mm/shmem.c: In function `shmem_getpage_gfp':
      mm/shmem.c:1816:10: warning: statement will never be executed [-Wswitch-unreachable]
       1816 |   loff_t i_size;
            |          ^~~~~~
      
      [1] https://bugs.llvm.org/show_bug.cgi?id=44916Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Link: http://lkml.kernel.org/r/20200220062312.69165-1-keescook@chromium.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27d80fa2
    • chenqiwu's avatar
      10404901
    • David Hildenbrand's avatar
      mm/memory_hotplug: allow to specify a default online_type · 5f47adf7
      David Hildenbrand authored
      For now, distributions implement advanced udev rules to essentially
      - Don't online any hotplugged memory (s390x)
      - Online all memory to ZONE_NORMAL (e.g., most virt environments like
        hyperv)
      - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
        care of (e.g., bare metal, special virt environments)
      
      In summary: All memory is usually onlined the same way, however, the
      kernel always has to ask user space to come up with the same answer.
      E.g., Hyper-V always waits for a memory block to get onlined before
      continuing, otherwise it might end up adding memory faster than
      onlining it, which can result in strange OOM situations.  This waiting
      slows down adding of a bigger amount of memory.
      
      Let's allow to specify a default online_type, not just "online" and
      "offline".  This allows distributions to configure the default online_type
      when booting up and be done with it.
      
      We can now specify "offline", "online", "online_movable" and
      "online_kernel" via
      - "memhp_default_state=" on the kernel cmdline
      - /sys/devices/system/memory/auto_online_blocks
      just like we are able to specify for a single memory block via
      /sys/devices/system/memory/memoryX/state
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f47adf7
    • David Hildenbrand's avatar
      mm/memory_hotplug: convert memhp_auto_online to store an online_type · 862919e5
      David Hildenbrand authored
      ...  and rename it to memhp_default_online_type.  This is a preparation
      for more detailed default online behavior.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      862919e5
    • David Hildenbrand's avatar
      mm/memory_hotplug: unexport memhp_auto_online · 5a04af13
      David Hildenbrand authored
      All in-tree users except the mm-core are gone. Let's drop the export.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-7-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a04af13
    • David Hildenbrand's avatar
      hv_balloon: don't check for memhp_auto_online manually · bc58ebd5
      David Hildenbrand authored
      We get the MEM_ONLINE notifier call if memory is added right from the
      kernel via add_memory() or later from user space.
      
      Let's get rid of the "ha_waiting" flag - the wait event has an inbuilt
      mechanism (->done) for that.  Initialize the wait event only once and
      reinitialize before adding memory.  Unconditionally call complete() and
      wait_for_completion_timeout().
      
      If there are no waiters, complete() will only increment ->done - which
      will be reset by reinit_completion().  If complete() has already been
      called, wait_for_completion_timeout() will not wait.
      
      There is still the chance for a small race between concurrent
      reinit_completion() and complete().  If complete() wins, we would not wait
      - which is tolerable (and the race exists in current code as well).
      
      Note: We only wait for "some" memory to get onlined, which seems to be
            good enough for now.
      
      [akpm@linux-foundation.org: register_memory_notifier() after init_completion(), per David]
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-6-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc58ebd5
    • David Hildenbrand's avatar
      powernv/memtrace: always online added memory blocks · ed7f9fec
      David Hildenbrand authored
      Let's always try to online the re-added memory blocks.  In case
      add_memory() already onlined the added memory blocks, the first
      device_online() call will fail and stop processing the remaining memory
      blocks.
      
      This avoids manually having to check memhp_auto_online.
      
      Note: PPC always onlines all hotplugged memory directly from the kernel as
      well - something that is handled by user space on other architectures.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-5-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed7f9fec
    • David Hildenbrand's avatar
      drivers/base/memory: store mapping between MMOP_* and string in an array · 4dc8207b
      David Hildenbrand authored
      Let's use a simple array which we can reuse soon.  While at it, move the
      string->mmop conversion out of the device hotplug lock.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-4-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4dc8207b
    • David Hildenbrand's avatar
      drivers/base/memory: map MMOP_OFFLINE to 0 · efc978ad
      David Hildenbrand authored
      Historically, we used the value -1.  Just treat 0 as the special case now.
      Clarify a comment (which was wrong, when we come via device_online() the
      first time, the online_type would have been 0 / MEM_ONLINE).  The default
      is now always MMOP_OFFLINE.  This removes the last user of the manual
      "-1", which didn't use the enum value.
      
      This is a preparation to use the online_type as an array index.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      efc978ad
    • David Hildenbrand's avatar
      drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE · 956f8b44
      David Hildenbrand authored
      Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.
      
      Distributions nowadays use udev rules ([1] [2]) to specify if and how to
      online hotplugged memory.  The rules seem to get more complex with many
      special cases.  Due to the various special cases,
      CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used.  All memory hotplug
      is handled via udev rules.
      
      Every time we hotplug memory, the udev rule will come to the same
      conclusion.  Especially Hyper-V (but also soon virtio-mem) add a lot of
      memory in separate memory blocks and wait for memory to get onlined by
      user space before continuing to add more memory blocks (to not add memory
      faster than it is getting onlined).  This of course slows down the whole
      memory hotplug process.
      
      To make the job of distributions easier and to avoid udev rules that get
      more and more complicated, let's extend the mechanism provided by
      - /sys/devices/system/memory/auto_online_blocks
      - "memhp_default_state=" on the kernel cmdline
      to be able to specify also "online_movable" as well as "online_kernel"
      
      === Example /usr/libexec/config-memhotplug ===
      
      #!/bin/bash
      
      VIRT=`systemd-detect-virt --vm`
      ARCH=`uname -p`
      
      sense_virtio_mem() {
        if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
          DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
          if [ $DEVICES != "0" ]; then
              return 0
          fi
        fi
        return 1
      }
      
      if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
        echo "Memory hotplug configuration support missing in the kernel"
        exit 1
      fi
      
      if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
        echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
        exit 1
      fi
      
      if [ $VIRT == "microsoft" ]; then
        echo "Detected Hyper-V on $ARCH"
        # Hyper-V wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif sense_virtio_mem; then
        echo "Detected virtio-mem on $ARCH"
        # virtio-mem wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
        echo "Detected $ARCH"
        # standby memory should not be onlined automatically
        ONLINE_TYPE="offline"
      elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
        echo "Detected" $ARCH
        # PPC64 onlines all hotplugged memory right from the kernel
        ONLINE_TYPE="offline"
      elif [ $VIRT == "none" ]; then
        echo "Detected bare-metal on $ARCH"
        # Bare metal users expect hotplugged memory to be unpluggable. We assume
        # that ZONE imbalances on such enterpise servers cannot happen and is
        # properly documented
        ONLINE_TYPE="online_movable"
      else
        # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
        # imbalances won't happen
        echo "Detected $VIRT on $ARCH"
        # Usually, ballooning is used in virtual environments, so memory should go to
        # ZONE_NORMAL. However, sometimes "movable_node" is relevant.
        ONLINE_TYPE="online"
      fi
      
      echo "Selected online_type:" $ONLINE_TYPE
      
      # Configure what to do with memory that will be hotplugged in the future
      echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
      if [ $? != "0" ]; then
        echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
        # A backup udev rule should handle old kernels if necessary
        exit 1
      fi
      
      # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
      if [ $ONLINE_TYPE != "offline" ]; then
        for MEMORY in /sys/devices/system/memory/memory*; do
          STATE=`cat $MEMORY/state`
          if [ $STATE == "offline" ]; then
              echo $ONLINE_TYPE > $MEMORY/state
          fi
        done
      fi
      
      === Example /usr/lib/systemd/system/config-memhotplug.service ===
      
      [Unit]
      Description=Configure memory hotplug behavior
      DefaultDependencies=no
      Conflicts=shutdown.target
      Before=sysinit.target shutdown.target
      After=systemd-modules-load.service
      ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks
      
      [Service]
      ExecStart=/usr/libexec/config-memhotplug
      Type=oneshot
      TimeoutSec=0
      RemainAfterExit=yes
      
      [Install]
      WantedBy=sysinit.target
      
      === Example modification to the 40-redhat.rules [2] ===
      
      : diff --git a/40-redhat.rules b/40-redhat.rules-new
      : index 2c690e5..168fd03 100644
      : --- a/40-redhat.rules
      : +++ b/40-redhat.rules-new
      : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
      :  # Memory hotadd request
      :  SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
      :  ACTION!="add", GOTO="memory_hotplug_end"
      : +# memory hotplug behavior configured
      : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
      : +
      :  PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
      :
      :  ENV{.state}="online"
      
      ===
      
      [1] https://github.com/lnykryn/systemd-rhel/pull/281
      [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules
      
      This patch (of 8):
      
      The name is misleading and it's not really clear what is "kept".  Let's
      just name it like the online_type name we expose to user space ("online").
      
      Add some documentation to the types.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      956f8b44
    • Baoquan He's avatar
      mm/sparse.c: move subsection_map related functions together · 6ecb0fc6
      Baoquan He authored
      No functional change.
      
      [bhe@redhat.com: move functions into CONFIG_MEMORY_HOTPLUG ifdeffery scope]
        Link: http://lkml.kernel.org/r/20200316045804.GC3486@MiWiFi-R3L-srvSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200312124414.439-6-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ecb0fc6
    • Baoquan He's avatar
      mm/sparse.c: add note about only VMEMMAP supporting sub-section hotplug · 95a5a34d
      Baoquan He authored
      And tell check_pfn_span() gating the porper alignment and size of hot
      added memory region.
      
      And also move the code comments from inside section_deactivate() to being
      above it.  The code comments are reasonable for the whole function, and
      the moving makes code cleaner.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-5-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95a5a34d
    • Baoquan He's avatar
      mm/sparse.c: only use subsection map in VMEMMAP case · 0a9f9f62
      Baoquan He authored
      Currently, to support subsection aligned memory region adding for pmem,
      subsection map is added to track which subsection is present.
      
      However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP.  It means
      subsection map only makes sense when SPARSEMEM_VMEMMAP enabled.  For the
      classic sparse, it's meaningless.  Even worse, it may confuse people when
      checking code related to the classic sparse.
      
      About the classic sparse which doesn't support subsection hotplug, Dan
      said it's more because the effort and maintenance burden outweighs the
      benefit.  Besides, the current 64 bit ARCHes all enable
      SPARSEMEM_VMEMMAP_ENABLE by default.
      
      Combining the above reasons, no need to provide subsection map and the
      relevant handling for the classic sparse.  Let's remove them.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a9f9f62
    • Baoquan He's avatar
      mm/sparse.c: introduce a new function clear_subsection_map() · 37bc1502
      Baoquan He authored
      Factor out the code which clear subsection map of one memory region from
      section_deactivate() into clear_subsection_map().
      
      And also add helper function is_subsection_map_empty() to check if the
      current subsection map is empty or not.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-3-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37bc1502
    • Baoquan He's avatar
      mm/sparse.c: introduce new function fill_subsection_map() · 5d87255c
      Baoquan He authored
      Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4.
      
      Memory sub-section hotplug was added to fix the issue that nvdimm could be
      mapped at non-section aligned starting address.  A subsection map is added
      into struct mem_section_usage to implement it.
      
      However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP.  It means
      subsection map only makes sense when SPARSEMEM_VMEMMAP enabled.  For the
      classic sparse, subsection map is meaningless and confusing.
      
      About the classic sparse which doesn't support subsection hotplug, Dan
      said it's more because the effort and maintenance burden outweighs the
      benefit.  Besides, the current 64 bit ARCHes all enable
      SPARSEMEM_VMEMMAP_ENABLE by default.
      
      This patch (of 5):
      
      Factor out the code that fills the subsection map from section_activate()
      into fill_subsection_map(), this makes section_activate() cleaner and
      easier to follow.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d87255c
    • David Hildenbrand's avatar
      mm/memory_hotplug.c: cleanup __add_pages() · 6cdd0b30
      David Hildenbrand authored
      Let's drop the basically unused section stuff and simplify.  The logic now
      matches the logic in __remove_pages().
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200228095819.10750-3-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6cdd0b30
    • David Hildenbrand's avatar
      mm/memory_hotplug.c: simplify calculation of number of pages in __remove_pages() · a11b9419
      David Hildenbrand authored
      In commit 52fb87c8 ("mm/memory_hotplug: cleanup __remove_pages()"), we
      cleaned up __remove_pages(), and introduced a shorter variant to calculate
      the number of pages to the next section boundary.
      
      Turns out we can make this calculation easier to read.  We always want to
      have the number of pages (> 0) to the next section boundary, starting from
      the current pfn.
      
      We'll clean up __remove_pages() in a follow-up patch and directly make use
      of this computation.
      Suggested-by: default avatarSegher Boessenkool <segher@kernel.crashing.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200228095819.10750-2-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a11b9419
    • Baoquan He's avatar
      mm/memory_hotplug.c: only respect mem= parameter during boot stage · f3cd4c86
      Baoquan He authored
      In commit 357b4da5 ("x86: respect memory size limiting via mem=
      parameter") a global varialbe max_mem_size is added to store the value
      parsed from 'mem= ', then checked when memory region is added.  This truly
      stops those DIMMs from being added into system memory during boot-time.
      
      However, it also limits the later memory hotplug functionality.  Any DIMM
      can't be hotplugged any more if its region is beyond the max_mem_size.  We
      will get errors like:
      
      [  216.387164] acpi PNP0C80:02: add_memory failed
      [  216.389301] acpi PNP0C80:02: acpi_memory_enable_device() error
      [  216.392187] acpi PNP0C80:02: Enumeration failure
      
      This will cause issue in a known use case where 'mem=' is added to the
      hypervisor.  The memory that lies after 'mem=' boundary will be assigned
      to KVM guests.  After commit 357b4da5 merged, memory can't be extended
      dynamically if system memory on hypervisor is not sufficient.
      
      So fix it by also checking if it's during boot-time restricting to add
      memory.  Otherwise, skip the restriction.
      
      And also add this use case to document of 'mem=' kernel parameter.
      
      Fixes: 357b4da5 ("x86: respect memory size limiting via mem= parameter")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Link: http://lkml.kernel.org/r/20200204050643.20925-1-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3cd4c86
    • David Hildenbrand's avatar
      mm/page_ext.c: drop pfn_present() check when onlining · dccacf8d
      David Hildenbrand authored
      Since commit c5e79ef5 ("mm/memory_hotplug.c: don't allow to
      online/offline memory blocks with holes") we disallow to offline any
      memory with holes.  As all boot memory is online and hotplugged memory
      cannot contain holes, we never online memory with holes.
      
      This present check can be dropped.
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Link: http://lkml.kernel.org/r/20200127110424.5757-4-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dccacf8d
    • David Hildenbrand's avatar
      drivers/base/memory.c: drop pages_correctly_probed() · fada9ae3
      David Hildenbrand authored
      pages_correctly_probed() is a leftover from ancient times.  It dates back
      to commit 3947be19 ("[PATCH] memory hotplug: sysfs and add/remove
      functions"), where Pg_reserved checks were added as a sfety net:
      
      	/*
      	 * The probe routines leave the pages reserved, just
      	 * as the bootmem code does.  Make sure they're still
      	 * that way.
      	 */
      
      The checks were refactored quite a bit over the years, especially in
      commit b77eab70 ("mm/memory_hotplug: optimize probe routine"), where
      checks for present, valid, and online sections were added.
      
      Hotplugged memory is added via add_memory(), which will create the full
      memmap for the hotplugged memory, and mark all sections valid and present.
      
      Only full memory blocks are onlined/offlined, so we also cannot have an
      inconsistency in that regard (especially, memory blocks with some sections
      being online and some being offline).
      
      1. Boot memory always starts online.  Since commit c5e79ef5
         ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with
         holes") we disallow to offline any memory with holes.  Therefore, we
         never online memory with holes.  Present and validity checks are
         superfluous.
      
      2. Only complete memory blocks are onlined/offlined (and especially,
         the state - online or offline - is stored for whole memory blocks).
         Besides the core, only arch/powerpc/platforms/powernv/memtrace.c
         manually calls offline_pages() and fiddels with memory block states.
         But it also only offlines complete memory blocks.
      
      3. To make any of these conditions trigger, something would have to be
         terribly messed up in the core.  (e.g., online/offline only some
         sections of a memory block).
      
      4. Memory unplug properly makes sure that all sysfs attributes were
         removed (and therefore, that all threads left the sysfs handlers).  We
         don't have to worry about zombie devices at this point.
      
      5. The valid_section_nr(section_nr) check is actually dead code, as it
         would never have been reached due to the WARN_ON_ONCE(!pfn_valid(pfn)).
      
      No wonder we haven't seen any of these errors in a long time (or even
         ever, according to my search).  Let's just get rid of them.  Now, all
         checks that could hinder onlining and offlining are completely
         contained in online_pages()/offline_pages().
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200127110424.5757-3-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fada9ae3
    • David Hildenbrand's avatar
      drivers/base/memory.c: drop section_count · 68c3a6ac
      David Hildenbrand authored
      Patch series "mm: drop superfluous section checks when onlining/offlining".
      
      Let's drop some superfluous section checks on the onlining/offlining path.
      
      This patch (of 3):
      
      Since commit c5e79ef5 ("mm/memory_hotplug.c: don't allow to
      online/offline memory blocks with holes") we have a generic check in
      offline_pages() that disallows offlining memory blocks with holes.
      
      Memory blocks with missing sections are just another variant of these type
      of blocks.  We can stop checking (and especially storing) present
      sections.  A proper error message is now printed why offlining failed.
      
      section_count was initially introduced in commit 07681215 ("Driver
      core: Add section count to memory_block struct") in order to detect when
      it is okay to remove a memory block.  It was used in commit 26bbe7ef
      ("drivers/base/memory.c: prohibit offlining of memory blocks with missing
      sections") to disallow offlining memory blocks with missing sections.  As
      we refactored creation/removal of memory devices and have a proper check
      for holes in place, we can drop the section_count.
      
      This also removes a leftover comment regarding the mem_sysfs_mutex, which
      was removed in commit 848e19ad ("drivers/base/memory.c: drop the
      mem_sysfs_mutex").
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68c3a6ac
    • Peter Xu's avatar
      userfaultfd: selftests: add write-protect test · 9b12488a
      Peter Xu authored
      Add uffd tests for write protection.
      
      Instead of introducing new tests for it, let's simply squashing uffd-wp
      tests into existing uffd-missing test cases.  Changes are:
      
      (1) Bouncing tests
      
        We do the write-protection in two ways during the bouncing test:
      
        - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
          we'll make sure for each bounce process every single page will be
          at least fault twice: once for MISSING, once for WP.
      
        - By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
          To further torture the explicit page protection procedures of
          uffd-wp, we split each bounce procedure into two halves (in the
          background thread): the first half will be MISSING+WP for each
          page as explained above.  After the first half, we write protect
          the faulted region in the background thread to make sure at least
          half of the pages will be write protected again which is the first
          half to test the new UFFDIO_WRITEPROTECT call.  Then we continue
          with the 2nd half, which will contain both MISSING and WP faulting
          tests for the 2nd half and WP-only faults from the 1st half.
      
      (2) Event/Signal test
      
        Mostly previous tests but will do MISSING+WP for each page.  For
        sigbus-mode test we'll need to provide standalone path to handle the
        write protection faults.
      
      For all tests, do statistics as well for uffd-wp pages.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-20-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b12488a
    • Peter Xu's avatar
      userfaultfd: selftests: refactor statistics · 5c8aed6c
      Peter Xu authored
      Introduce uffd_stats structure for statistics of the self test, at the
      same time refactor the code to always pass in the uffd_stats for either
      read() or poll() typed fault handling threads instead of using two
      different ways to return the statistic results.  No functional change.
      
      With the new structure, it's very easy to introduce new statistics.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-19-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c8aed6c
    • Peter Xu's avatar
      userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally · 14819305
      Peter Xu authored
      Only declare _UFFDIO_WRITEPROTECT if the user specified
      UFFDIO_REGISTER_MODE_WP and if all the checks passed.  Then when the user
      registers regions with shmem/hugetlbfs we won't expose the new ioctl to
      them.  Even with complete anonymous memory range, we'll only expose the
      new WP ioctl bit if the register mode has MODE_WP.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-18-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14819305
    • Martin Cracauer's avatar
      userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update · 57e5d4f2
      Martin Cracauer authored
      Add documentation about the write protection support.
      
      [peterx@redhat.com: rewrite in rst format; fixups here and there]
      Signed-off-by: default avatarMartin Cracauer <cracauer@cons.org>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-17-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57e5d4f2
    • Peter Xu's avatar
      userfaultfd: wp: don't wake up when doing write protect · 23080e27
      Peter Xu authored
      It does not make sense to try to wake up any waiting thread when we're
      write-protecting a memory region.  Only wake up when resolving a write
      protected page fault.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-16-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23080e27
    • Shaohua Li's avatar
      userfaultfd: wp: enabled write protection in userfaultfd API · e06f1e1d
      Shaohua Li authored
      Now it's safe to enable write protection in userfaultfd API
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220163112.11409-15-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e06f1e1d
    • Andrea Arcangeli's avatar
      userfaultfd: wp: add the writeprotect API to userfaultfd ioctl · 63b2d417
      Andrea Arcangeli authored
      Introduce the new uffd-wp APIs for userspace.
      
      Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking
      using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this flag can
      co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the
      userspace program can not only resolve missing page faults, and at the
      same time tracking page data changes along the way.
      
      Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level
      write protection tracking.  Note that we will need to register the memory
      region with UFFDIO_REGISTER_MODE_WP before that.
      
      [peterx@redhat.com: write up the commit message]
      [peterx@redhat.com: remove useless block, write commit message, check against
       VM_MAYWRITE rather than VM_WRITE when register]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63b2d417
    • Shaohua Li's avatar
      userfaultfd: wp: support write protection for userfault vma range · ffd05793
      Shaohua Li authored
      Add API to enable/disable writeprotect a vma range.  Unlike mprotect, this
      doesn't split/merge vmas.
      
      [peterx@redhat.com:
       - use the helper to find VMA;
       - return -ENOENT if not found to match mcopy case;
       - use the new MM_CP_UFFD_WP* flags for change_protection
       - check against mmap_changing for failures
       - replace find_dst_vma with vma_find_uffd]
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220163112.11409-13-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffd05793
    • Peter Xu's avatar
      khugepaged: skip collapse if uffd-wp detected · e1e267c7
      Peter Xu authored
      Don't collapse the huge PMD if there is any userfault write protected
      small PTEs.  The problem is that the write protection is in small page
      granularity and there's no way to keep all these write protection
      information if the small pages are going to be merged into a huge PMD.
      
      The same thing needs to be considered for swap entries and migration
      entries.  So do the check as well disregarding khugepaged_max_ptes_swap.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-12-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1e267c7
    • Peter Xu's avatar
      userfaultfd: wp: support swap and page migration · f45ec5ff
      Peter Xu authored
      For either swap and page migration, we all use the bit 2 of the entry to
      identify whether this entry is uffd write-protected.  It plays a similar
      role as the existing soft dirty bit in swap entries but only for keeping
      the uffd-wp tracking for a specific PTE/PMD.
      
      Something special here is that when we want to recover the uffd-wp bit
      from a swap/migration entry to the PTE bit we'll also need to take care of
      the _PAGE_RW bit and make sure it's cleared, otherwise even with the
      _PAGE_UFFD_WP bit we can't trap it at all.
      
      In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
      That can lead to data mismatch if the page that we are going to write
      protect is swapped out when sending the UFFDIO_WRITEPROTECT.  This patch
      also applies/removes the uffd-wp bit even for the swap entries.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f45ec5ff
    • Peter Xu's avatar
      userfaultfd: wp: add pmd_swp_*uffd_wp() helpers · 2e3d5dc5
      Peter Xu authored
      Adding these missing helpers for uffd-wp operations with pmd
      swap/migration entries.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-10-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e3d5dc5
    • Peter Xu's avatar
      userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork · b569a176
      Peter Xu authored
      UFFD_EVENT_FORK support for uffd-wp should be already there, except that
      we should clean the uffd-wp bit if uffd fork event is not enabled.  Detect
      that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
      by VM_UFFD_WP.  Do this for both small PTEs and huge PMDs.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b569a176
    • Peter Xu's avatar
      userfaultfd: wp: apply _PAGE_UFFD_WP bit · 292924b2
      Peter Xu authored
      Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
      change_protection() when used with uffd-wp and make sure the two new flags
      are exclusively used.  Then,
      
        - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
          when a range of memory is write protected by uffd
      
        - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
          _PAGE_RW when write protection is resolved from userspace
      
      And use this new interface in mwriteprotect_range() to replace the old
      MM_CP_DIRTY_ACCT.
      
      Do this change for both PTEs and huge PMDs.  Then we can start to identify
      which PTE/PMD is write protected by general (e.g., COW or soft dirty
      tracking), and which is for userfaultfd-wp.
      
      Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
      into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
      can be even more strict when detecting uffd-wp page faults in either
      do_wp_page() or wp_huge_pmd().
      
      After we're with _PAGE_UFFD_WP, a special case is when a page is both
      protected by the general COW logic and also userfault-wp.  Here the
      userfault-wp will have higher priority and will be handled first.  Only
      after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
      the general COW.  These are the steps on what will happen with such a
      page:
      
        1. CPU accesses write protected shared page (so both protected by
           general COW and uffd-wp), blocked by uffd-wp first because in
           do_wp_page we'll handle uffd-wp first, so it has higher priority
           than general COW.
      
        2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
           to remove the uffd-wp bit upon the PTE/PMD.  However here we
           still keep the write bit cleared.  Notify the blocked CPU.
      
        3. The blocked CPU resumes the page fault process with a fault
           retry, during retry it'll notice it was not with the uffd-wp bit
           this time but it is still write protected by general COW, then
           it'll go though the COW path in the fault handler, copy the page,
           apply write bit where necessary, and retry again.
      
        4. The CPU will be able to access this page with write bit set.
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      292924b2
    • Peter Xu's avatar
      mm: merge parameters for change_protection() · 58705444
      Peter Xu authored
      change_protection() was used by either the NUMA or mprotect() code,
      there's one parameter for each of the callers (dirty_accountable and
      prot_numa).  Further, these parameters are passed along the calls:
      
        - change_protection_range()
        - change_p4d_range()
        - change_pud_range()
        - change_pmd_range()
        - ...
      
      Now we introduce a flag for change_protect() and all these helpers to
      replace these parameters.  Then we can avoid passing multiple parameters
      multiple times along the way.
      
      More importantly, it'll greatly simplify the work if we want to introduce
      any new parameters to change_protection().  In the follow up patches, a
      new parameter for userfaultfd write protection will be introduced.
      
      No functional change at all.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58705444
    • Andrea Arcangeli's avatar
      userfaultfd: wp: add UFFDIO_COPY_MODE_WP · 72981e0e
      Andrea Arcangeli authored
      This allows UFFDIO_COPY to map pages write-protected.
      
      [peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
       around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
       commit messages]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72981e0e
    • Andrea Arcangeli's avatar
      userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers · 55adf4de
      Andrea Arcangeli authored
      Implement helpers methods to invoke userfaultfd wp faults more
      selectively: not only when a wp fault triggers on a vma with vma->vm_flags
      VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set in the pagetable
      too.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-5-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55adf4de
    • Andrea Arcangeli's avatar
      userfaultfd: wp: add WP pagetable tracking to x86 · 5a281062
      Andrea Arcangeli authored
      Accurate userfaultfd WP tracking is possible by tracking exactly which
      virtual memory ranges were writeprotected by userland.  We can't relay
      only on the RW bit of the mapped pagetable because that information is
      destroyed by fork() or KSM or swap.  If we were to relay on that, we'd
      need to stay on the safe side and generate false positive wp faults for
      every swapped out page.
      
      [peterx@redhat.com: append _PAGE_UFD_WP to _PAGE_CHG_MASK]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-4-peterx@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a281062