1. 19 Apr, 2019 16 commits
    • Andrea Arcangeli's avatar
      coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping · 04f5866e
      Andrea Arcangeli authored
      The core dumping code has always run without holding the mmap_sem for
      writing, despite that is the only way to ensure that the entire vma
      layout will not change from under it.  Only using some signal
      serialization on the processes belonging to the mm is not nearly enough.
      This was pointed out earlier.  For example in Hugh's post from Jul 2017:
      
        https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
      
        "Not strictly relevant here, but a related note: I was very surprised
         to discover, only quite recently, how handle_mm_fault() may be called
         without down_read(mmap_sem) - when core dumping. That seems a
         misguided optimization to me, which would also be nice to correct"
      
      In particular because the growsdown and growsup can move the
      vm_start/vm_end the various loops the core dump does around the vma will
      not be consistent if page faults can happen concurrently.
      
      Pretty much all users calling mmget_not_zero()/get_task_mm() and then
      taking the mmap_sem had the potential to introduce unexpected side
      effects in the core dumping code.
      
      Adding mmap_sem for writing around the ->core_dump invocation is a
      viable long term fix, but it requires removing all copy user and page
      faults and to replace them with get_dump_page() for all binary formats
      which is not suitable as a short term fix.
      
      For the time being this solution manually covers the places that can
      confuse the core dump either by altering the vma layout or the vma flags
      while it runs.  Once ->core_dump runs under mmap_sem for writing the
      function mmget_still_valid() can be dropped.
      
      Allowing mmap_sem protected sections to run in parallel with the
      coredump provides some minor parallelism advantage to the swapoff code
      (which seems to be safe enough by never mangling any vma field and can
      keep doing swapins in parallel to the core dumping) and to some other
      corner case.
      
      In order to facilitate the backporting I added "Fixes: 86039bd3"
      however the side effect of this same race condition in /proc/pid/mem
      should be reproducible since before 2.6.12-rc2 so I couldn't add any
      other "Fixes:" because there's no hash beyond the git genesis commit.
      
      Because find_extend_vma() is the only location outside of the process
      context that could modify the "mm" structures under mmap_sem for
      reading, by adding the mmget_still_valid() check to it, all other cases
      that take the mmap_sem for reading don't need the new check after
      mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
      context also doesn't need the new check, because all tasks under core
      dumping are frozen.
      
      Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
      Fixes: 86039bd3 ("userfaultfd: add new syscall to provide memory externalization")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarJann Horn <jannh@google.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f5866e
    • Arnd Bergmann's avatar
      mm/kmemleak.c: fix unused-function warning · dce5b0bd
      Arnd Bergmann authored
      The only references outside of the #ifdef have been removed, so now we
      get a warning in non-SMP configurations:
      
        mm/kmemleak.c:1404:13: error: unused function 'scan_large_block' [-Werror,-Wunused-function]
      
      Add a new #ifdef around it.
      
      Link: http://lkml.kernel.org/r/20190416123148.3502045-1-arnd@arndb.de
      Fixes: 298a32b1 ("kmemleak: powerpc: skip scanning holes in the .bss section")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dce5b0bd
    • Dan Williams's avatar
      init: initialize jump labels before command line option parsing · 6041186a
      Dan Williams authored
      When a module option, or core kernel argument, toggles a static-key it
      requires jump labels to be initialized early.  While x86, PowerPC, and
      ARM64 arrange for jump_label_init() to be called before parse_args(),
      ARM does not.
      
        Kernel command line: rdinit=/sbin/init page_alloc.shuffle=1 panic=-1 console=ttyAMA0,115200 page_alloc.shuffle=1
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 0 at ./include/linux/jump_label.h:303
        page_alloc_shuffle+0x12c/0x1ac
        static_key_enable(): static key 'page_alloc_shuffle_key+0x0/0x4' used
        before call to jump_label_init()
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted
        5.1.0-rc4-next-20190410-00003-g3367c36ce744 #1
        Hardware name: ARM Integrator/CP (Device Tree)
        [<c0011c68>] (unwind_backtrace) from [<c000ec48>] (show_stack+0x10/0x18)
        [<c000ec48>] (show_stack) from [<c07e9710>] (dump_stack+0x18/0x24)
        [<c07e9710>] (dump_stack) from [<c001bb1c>] (__warn+0xe0/0x108)
        [<c001bb1c>] (__warn) from [<c001bb88>] (warn_slowpath_fmt+0x44/0x6c)
        [<c001bb88>] (warn_slowpath_fmt) from [<c0b0c4a8>]
        (page_alloc_shuffle+0x12c/0x1ac)
        [<c0b0c4a8>] (page_alloc_shuffle) from [<c0b0c550>] (shuffle_store+0x28/0x48)
        [<c0b0c550>] (shuffle_store) from [<c003e6a0>] (parse_args+0x1f4/0x350)
        [<c003e6a0>] (parse_args) from [<c0ac3c00>] (start_kernel+0x1c0/0x488)
      
      Move the fallback call to jump_label_init() to occur before
      parse_args().
      
      The redundant calls to jump_label_init() in other archs are left intact
      in case they have static key toggling use cases that are even earlier
      than option parsing.
      
      Link: http://lkml.kernel.org/r/155544804466.1032396.13418949511615676665.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarGuenter Roeck <groeck@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Russell King <rmk@armlinux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6041186a
    • Sergey Senozhatsky's avatar
      kernel/watchdog_hld.c: hard lockup message should end with a newline · 8f4a8c12
      Sergey Senozhatsky authored
      Separate print_modules() and hard lockup error message.
      
      Before the patch:
      
        NMI watchdog: Watchdog detected hard LOCKUP on cpu 1Modules linked in: nls_cp437
      
      Link: http://lkml.kernel.org/r/20190412062557.2700-1-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f4a8c12
    • Mark Rutland's avatar
      kcov: improve CONFIG_ARCH_HAS_KCOV help text · 40453c4f
      Mark Rutland authored
      The help text for CONFIG_ARCH_HAS_KCOV is stale, and describes the
      feature as being enabled only for x86_64, when it is now enabled for
      several architectures, including arm, arm64, powerpc, and s390.
      
      Let's remove that stale help text, and update it along the lines of hat
      for ARCH_HAS_FORTIFY_SOURCE, better describing when an architecture
      should select CONFIG_ARCH_HAS_KCOV.
      
      Link: http://lkml.kernel.org/r/20190412102733.5154-1-mark.rutland@arm.comSigned-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40453c4f
    • Johannes Weiner's avatar
      mm: fix inactive list balancing between NUMA nodes and cgroups · 3b991208
      Johannes Weiner authored
      During !CONFIG_CGROUP reclaim, we expand the inactive list size if it's
      thrashing on the node that is about to be reclaimed.  But when cgroups
      are enabled, we suddenly ignore the node scope and use the cgroup scope
      only.  The result is that pressure bleeds between NUMA nodes depending
      on whether cgroups are merely compiled into Linux.  This behavioral
      difference is unexpected and undesirable.
      
      When the refault adaptivity of the inactive list was first introduced,
      there were no statistics at the lruvec level - the intersection of node
      and memcg - so it was better than nothing.
      
      But now that we have that infrastructure, use lruvec_page_state() to
      make the list balancing decision always NUMA aware.
      
      [hannes@cmpxchg.org: fix bisection hole]
        Link: http://lkml.kernel.org/r/20190417155241.GB23013@cmpxchg.org
      Link: http://lkml.kernel.org/r/20190412144438.2645-1-hannes@cmpxchg.org
      Fixes: 2a2e4885 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b991208
    • Qian Cai's avatar
      mm/hotplug: treat CMA pages as unmovable · 1a9f2191
      Qian Cai authored
      has_unmovable_pages() is used by allocating CMA and gigantic pages as
      well as the memory hotplug.  The later doesn't know how to offline CMA
      pool properly now, but if an unused (free) CMA page is encountered, then
      has_unmovable_pages() happily considers it as a free memory and
      propagates this up the call chain.  Memory offlining code then frees the
      page without a proper CMA tear down which leads to an accounting issues.
      Moreover if the same memory range is onlined again then the memory never
      gets back to the CMA pool.
      
      State after memory offline:
      
       # grep cma /proc/vmstat
       nr_free_cma 205824
      
       # cat /sys/kernel/debug/cma/cma-kvm_cma/count
       209920
      
      Also, kmemleak still think those memory address are reserved below but
      have already been used by the buddy allocator after onlining.  This
      patch fixes the situation by treating CMA pageblocks as unmovable except
      when has_unmovable_pages() is called as part of CMA allocation.
      
        Offlined Pages 4096
        kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing)
        Call Trace:
          dump_stack+0xb0/0xf4 (unreliable)
          create_object+0x344/0x380
          __kmalloc_node+0x3ec/0x860
          kvmalloc_node+0x58/0x110
          seq_read+0x41c/0x620
          __vfs_read+0x3c/0x70
          vfs_read+0xbc/0x1a0
          ksys_read+0x7c/0x140
          system_call+0x5c/0x70
        kmemleak: Kernel memory leak detector disabled
        kmemleak: Object 0xc000201cc8000000 (size 13757317120):
        kmemleak:   comm "swapper/0", pid 0, jiffies 4294937297
        kmemleak:   min_count = -1
        kmemleak:   count = 0
        kmemleak:   flags = 0x5
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
             cma_declare_contiguous+0x2a4/0x3b0
             kvm_cma_reserve+0x11c/0x134
             setup_arch+0x300/0x3f8
             start_kernel+0x9c/0x6e8
             start_here_common+0x1c/0x4b0
        kmemleak: Automatic memory scanning thread ended
      
      [cai@lca.pw: use is_migrate_cma_page() and update commit log]
        Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pwSigned-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a9f2191
    • Alexey Dobriyan's avatar
      proc: fixup proc-pid-vm test · 68545aa1
      Alexey Dobriyan authored
      Silly sizeof(pointer) vs sizeof(uint8_t[]) bug.
      
      Link: http://lkml.kernel.org/r/20190414123009.GA12971@avx2
      Fixes: e483b020 ("proc: test /proc/*/maps, smaps, smaps_rollup, statm")
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68545aa1
    • Alexey Dobriyan's avatar
      proc: fix map_files test on F29 · 8cd40d1d
      Alexey Dobriyan authored
      F29 bans mapping first 64KB even for root making test fail.  Iterate
      from address 0 until mmap() works.
      
      Gentoo (root):
      
      	openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
      	mmap(NULL, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0
      
      Gentoo (non-root):
      
      	openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
      	mmap(NULL, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EPERM (Operation not permitted)
      	mmap(0x1000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x1000
      
      F29 (root):
      
      	openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
      	mmap(NULL, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0x1000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0x2000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0x3000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0x4000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0x5000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0x6000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0x7000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0x8000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0x9000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0xa000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0xb000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0xc000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0xd000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0xe000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0xf000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
      	mmap(0x10000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x10000
      
      Now all proc tests succeed on F29 if run as root, at last!
      
      Link: http://lkml.kernel.org/r/20190414123612.GB12971@avx2Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cd40d1d
    • Konstantin Khlebnikov's avatar
      mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n · e8277b3b
      Konstantin Khlebnikov authored
      Commit 58bc4c34 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
      depends on skipping vmstat entries with empty name introduced in
      7aaf7727 ("mm: don't show nr_indirectly_reclaimable in
      /proc/vmstat") but reverted in b29940c1 ("mm: rename and change
      semantics of nr_indirectly_reclaimable_bytes").
      
      So skipping no longer works and /proc/vmstat has misformatted lines " 0".
      
      This patch simply shows debug counters "nr_tlb_remote_*" for UP.
      
      Link: http://lkml.kernel.org/r/155481488468.467.4295519102880913454.stgit@buzz
      Fixes: 58bc4c34 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8277b3b
    • zhong jiang's avatar
      mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock · 37803841
      zhong jiang authored
      When adding memory by probing a memory block in the sysfs interface,
      there is an obvious issue where we will unlock the device_hotplug_lock
      when we failed to takes it.
      
      That issue was introduced in 8df1d0e4 ("mm/memory_hotplug: make
      add_memory() take the device_hotplug_lock").
      
      We should drop out in time when failing to take the device_hotplug_lock.
      
      Link: http://lkml.kernel.org/r/1554696437-9593-1-git-send-email-zhongjiang@huawei.com
      Fixes: 8df1d0e4 ("mm/memory_hotplug: make add_memory() take the device_hotplug_lock")
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Reported-by: default avatarYang yingliang <yangyingliang@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37803841
    • Hugh Dickins's avatar
      mm: swapoff: shmem_unuse() stop eviction without igrab() · af53d3e9
      Hugh Dickins authored
      The igrab() in shmem_unuse() looks good, but we forgot that it gives no
      protection against concurrent unmounting: a point made by Konstantin
      Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893
      ("tmpfs: fix race between umount and swapoff").  The current 5.1-rc
      swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
      Self-destruct in 5 seconds.  Have a nice day..." followed by GPF.
      
      Once again, give up on using igrab(); but don't go back to making such
      heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
      the new design, and I expect could deadlock inside shmem_swapin_page().
      
      Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
      specific inode, and shmem_evict_inode() wait for that to go down to 0.
      Call it "stop_eviction" rather than "swapoff_busy" because it can be put
      to use for others later (huge tmpfs patches expect to use it).
      
      That simplifies shmem_unuse(), protecting it from both unlink and
      unmount; and in practice lets it locate all the swap in its first try.
      But do not rely on that: there's still a theoretical case, when
      shmem_writepage() might have been preempted after its get_swap_page(),
      before making the swap entry visible to swapoff.
      
      [hughd@google.com: remove incorrect list_del()]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904091133570.1898@eggly.anvils
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081259400.1523@eggly.anvils
      Fixes: b56a2d8a ("mm: rid swapoff of quadratic complexity")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kelley Nielsen <kelleynnn@gmail.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vineeth Pillai <vpillai@digitalocean.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af53d3e9
    • Hugh Dickins's avatar
      mm: swapoff: take notice of completion sooner · 64165b1a
      Hugh Dickins authored
      The old try_to_unuse() implementation was driven by find_next_to_unuse(),
      which terminated as soon as all the swap had been freed.
      
      Add inuse_pages checks now (alongside signal_pending()) to stop scanning
      mms and swap_map once finished.
      
      The same ought to be done in shmem_unuse() too, but never was before,
      and needs a different interface: so leave it as is for now.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081258200.1523@eggly.anvils
      Fixes: b56a2d8a ("mm: rid swapoff of quadratic complexity")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kelley Nielsen <kelleynnn@gmail.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vineeth Pillai <vpillai@digitalocean.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64165b1a
    • Hugh Dickins's avatar
      mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES · dd862deb
      Hugh Dickins authored
      SWAP_UNUSE_MAX_TRIES 3 appeared to work well in earlier testing, but
      further testing has proved it to be a source of unnecessary swapoff
      EBUSY failures (which can then be followed by unmount EBUSY failures).
      
      When mmget_not_zero() or shmem's igrab() fails, there is an mm exiting
      or inode being evicted, freeing up swap independent of try_to_unuse().
      Those typically completed much sooner than the old quadratic swapoff,
      but now it's more common that swapoff may need to wait for them.
      
      It's possible to move those cases from init_mm.mmlist and shmem_swaplist
      to separate "exiting" swaplists, and try_to_unuse() then wait for those
      lists to be emptied; but we've not bothered with that in the past, and
      don't want to risk missing some other forgotten case.  So just revert to
      cycling around until the swap is gone, without any retries limit.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081256170.1523@eggly.anvils
      Fixes: b56a2d8a ("mm: rid swapoff of quadratic complexity")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kelley Nielsen <kelleynnn@gmail.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vineeth Pillai <vpillai@digitalocean.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd862deb
    • Hugh Dickins's avatar
      mm: swapoff: shmem_find_swap_entries() filter out other types · 87039546
      Hugh Dickins authored
      Swapfile "type" was passed all the way down to shmem_unuse_inode(), but
      then forgotten from shmem_find_swap_entries(): with the result that
      removing one swapfile would try to free up all the swap from shmem - no
      problem when only one swapfile anyway, but counter-productive when more,
      causing swapoff to be unnecessarily OOM-killed when it should succeed.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081254470.1523@eggly.anvils
      Fixes: b56a2d8a ("mm: rid swapoff of quadratic complexity")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
      Cc: Vineeth Pillai <vpillai@digitalocean.com>
      Cc: Kelley Nielsen <kelleynnn@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87039546
    • Qian Cai's avatar
      slab: store tagged freelist for off-slab slabmgmt · 1a62b18d
      Qian Cai authored
      Commit 51dedad0 ("kasan, slab: make freelist stored without tags")
      calls kasan_reset_tag() for off-slab slab management object leading to
      freelist being stored non-tagged.
      
      However, cache_grow_begin() calls alloc_slabmgmt() which calls
      kmem_cache_alloc_node() assigns a tag for the address and stores it in
      the shadow address.  As the result, it causes endless errors below
      during boot due to drain_freelist() -> slab_destroy() ->
      kasan_slab_free() which compares already untagged freelist against the
      stored tag in the shadow address.
      
      Since off-slab slab management object freelist is such a special case,
      just store it tagged.  Non-off-slab management object freelist is still
      stored untagged which has not been assigned a tag and should not cause
      any other troubles with this inconsistency.
      
        BUG: KASAN: double-free or invalid-free in slab_destroy+0x84/0x88
        Pointer tag: [ff], memory tag: [99]
      
        CPU: 0 PID: 1376 Comm: kworker/0:4 Tainted: G        W 5.1.0-rc3+ #8
        Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.0.6 07/10/2018
        Workqueue: cgroup_destroy css_killed_work_fn
        Call trace:
         print_address_description+0x74/0x2a4
         kasan_report_invalid_free+0x80/0xc0
         __kasan_slab_free+0x204/0x208
         kasan_slab_free+0xc/0x18
         kmem_cache_free+0xe4/0x254
         slab_destroy+0x84/0x88
         drain_freelist+0xd0/0x104
         __kmem_cache_shrink+0x1ac/0x224
         __kmemcg_cache_deactivate+0x1c/0x28
         memcg_deactivate_kmem_caches+0xa0/0xe8
         memcg_offline_kmem+0x8c/0x3d4
         mem_cgroup_css_offline+0x24c/0x290
         css_killed_work_fn+0x154/0x618
         process_one_work+0x9cc/0x183c
         worker_thread+0x9b0/0xe38
         kthread+0x374/0x390
         ret_from_fork+0x10/0x18
      
        Allocated by task 1625:
         __kasan_kmalloc+0x168/0x240
         kasan_slab_alloc+0x18/0x20
         kmem_cache_alloc_node+0x1f8/0x3a0
         cache_grow_begin+0x4fc/0xa24
         cache_alloc_refill+0x2f8/0x3e8
         kmem_cache_alloc+0x1bc/0x3bc
         sock_alloc_inode+0x58/0x334
         alloc_inode+0xb8/0x164
         new_inode_pseudo+0x20/0xec
         sock_alloc+0x74/0x284
         __sock_create+0xb0/0x58c
         sock_create+0x98/0xb8
         __sys_socket+0x60/0x138
         __arm64_sys_socket+0xa4/0x110
         el0_svc_handler+0x2c0/0x47c
         el0_svc+0x8/0xc
      
        Freed by task 1625:
         __kasan_slab_free+0x114/0x208
         kasan_slab_free+0xc/0x18
         kfree+0x1a8/0x1e0
         single_release+0x7c/0x9c
         close_pdeo+0x13c/0x43c
         proc_reg_release+0xec/0x108
         __fput+0x2f8/0x784
         ____fput+0x1c/0x28
         task_work_run+0xc0/0x1b0
         do_notify_resume+0xb44/0x1278
         work_pending+0x8/0x10
      
        The buggy address belongs to the object at ffff809681b89e00
         which belongs to the cache kmalloc-128 of size 128
        The buggy address is located 0 bytes inside of
         128-byte region [ffff809681b89e00, ffff809681b89e80)
        The buggy address belongs to the page:
        page:ffff7fe025a06e00 count:1 mapcount:0 mapping:01ff80082000fb00
        index:0xffff809681b8fe04
        flags: 0x17ffffffc000200(slab)
        raw: 017ffffffc000200 ffff7fe025a06d08 ffff7fe022ef7b88 01ff80082000fb00
        raw: ffff809681b8fe04 ffff809681b80000 00000001000000e0 0000000000000000
        page dumped because: kasan: bad access detected
        page allocated via order 0, migratetype Unmovable, gfp_mask
        0x2420c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE)
         prep_new_page+0x4e0/0x5e0
         get_page_from_freelist+0x4ce8/0x50d4
         __alloc_pages_nodemask+0x738/0x38b8
         cache_grow_begin+0xd8/0xa24
         ____cache_alloc_node+0x14c/0x268
         __kmalloc+0x1c8/0x3fc
         ftrace_free_mem+0x408/0x1284
         ftrace_free_init_mem+0x20/0x28
         kernel_init+0x24/0x548
         ret_from_fork+0x10/0x18
      
        Memory state around the buggy address:
         ffff809681b89c00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
         ffff809681b89d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
        >ffff809681b89e00: 99 99 99 99 99 99 99 99 fe fe fe fe fe fe fe fe
                           ^
         ffff809681b89f00: 43 43 43 43 43 fe fe fe fe fe fe fe fe fe fe fe
         ffff809681b8a000: 6d fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
      
      Link: http://lkml.kernel.org/r/20190403022858.97584-1-cai@lca.pw
      Fixes: 51dedad0 ("kasan, slab: make freelist stored without tags")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a62b18d
  2. 18 Apr, 2019 7 commits
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 6d906f99
      Linus Torvalds authored
      Pull arm64 fix from Catalin Marinas:
       "Avoid compiler uninitialised warning introduced by recent arm64 futex
        fix"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: futex: Restore oldval initialization to work around buggy compilers
      6d906f99
    • Nathan Chancellor's avatar
      arm64: futex: Restore oldval initialization to work around buggy compilers · ff8acf92
      Nathan Chancellor authored
      Commit 045afc24 ("arm64: futex: Fix FUTEX_WAKE_OP atomic ops with
      non-zero result value") removed oldval's zero initialization in
      arch_futex_atomic_op_inuser because it is not necessary. Unfortunately,
      Android's arm64 GCC 4.9.4 [1] does not agree:
      
      ../kernel/futex.c: In function 'do_futex':
      ../kernel/futex.c:1658:17: warning: 'oldval' may be used uninitialized
      in this function [-Wmaybe-uninitialized]
         return oldval == cmparg;
                       ^
      In file included from ../kernel/futex.c:73:0:
      ../arch/arm64/include/asm/futex.h:53:6: note: 'oldval' was declared here
        int oldval, ret, tmp;
            ^
      
      GCC fails to follow that when ret is non-zero, futex_atomic_op_inuser
      returns right away, avoiding the uninitialized use that it claims.
      Restoring the zero initialization works around this issue.
      
      [1]: https://android.googlesource.com/platform/prebuilts/gcc/linux-x86/aarch64/aarch64-linux-android-4.9/
      
      Cc: stable@vger.kernel.org
      Fixes: 045afc24 ("arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value")
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      ff8acf92
    • Christian Brauner's avatar
      signal: use fdget() since we don't allow O_PATH · 738a7832
      Christian Brauner authored
      As stated in the original commit for pidfd_send_signal() we don't allow
      to signal processes through O_PATH file descriptors since it is
      semantically equivalent to a write on the pidfd.
      
      We already correctly error out right now and return EBADF if an O_PATH
      fd is passed.  This is because we use file->f_op to detect whether a
      pidfd is passed and O_PATH fds have their file->f_op set to empty_fops
      in do_dentry_open() and thus fail the test.
      
      Thus, there is no regression.  It's just semantically correct to use
      fdget() and return an error right from there instead of taking a
      reference and returning an error later.
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jann Horn <jann@thejh.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      738a7832
    • Linus Torvalds's avatar
      Merge tag 's390-5.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · d22113a2
      Linus Torvalds authored
      Pull s390 bug fixes from Martin Schwidefsky:
      
       - Fix overwrite of the initial ramdisk due to misuse of IS_ENABLED
      
       - Fix integer overflow in the dasd driver resulting in incorrect number
         of blocks for large devices
      
       - Fix a lockdep false positive in the 3270 driver
      
       - Fix a deadlock in the zcrypt driver
      
       - Fix incorrect debug feature entries in the pkey api
      
       - Fix inline assembly constraints fallout with CONFIG_KASAN=y
      
      * tag 's390-5.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390: correct some inline assembly constraints
        s390/pkey: add one more argument space for debug feature entry
        s390/zcrypt: fix possible deadlock situation on ap queue remove
        s390/3270: fix lockdep false positive on view->lock
        s390/dasd: Fix capacity calculation for large volumes
        s390/mem_detect: Use IS_ENABLED(CONFIG_BLK_DEV_INITRD)
      d22113a2
    • Linus Torvalds's avatar
      Merge tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 2a852fd1
      Linus Torvalds authored
      Pull AFS fixes from David Howells:
      
       - Stop using the deprecated get_seconds().
      
       - Don't make tracepoint strings const as the section they go in isn't
         read-only.
      
       - Differentiate failure due to unmarshalling from other failure cases.
         We shouldn't abort with RXGEN_CC/SS_UNMARSHAL if it's not due to
         unmarshalling.
      
       - Add a missing unlock_page().
      
       - Fix the interaction between receiving a notification from a server
         that it has invalidated all outstanding callback promises and a
         client call that we're in the middle of making that will get a new
         promise.
      
      * tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
        afs: Fix in-progess ops to ignore server-level callback invalidation
        afs: Unlock pages for __pagevec_release()
        afs: Differentiate abort due to unmarshalling from other errors
        afs: Avoid section confusion in CM_NAME
        afs: avoid deprecated get_seconds()
      2a852fd1
    • Linus Torvalds's avatar
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · d3ce3b18
      Linus Torvalds authored
      Pull crypto fix from Herbert Xu:
       "Fix a bug in the implementation of the x86 accelerated version of
        poly1305"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: x86/poly1305 - fix overflow during partial reduction
      d3ce3b18
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2019-04-18' of git://anongit.freedesktop.org/drm/drm · 95ea5529
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Since Easter is looming for me, I'm just pushing whatever is in my
        tree, I'll see what else turns up and maybe I'll send another pull
        early next week if there is anything.
      
        tegra:
         - stream id programming fix
         - avoid divide by 0 for bad hdmi audio setup code
      
        ttm:
         - Hugepages fix
         - refcount imbalance in error path fix
      
        amdgpu:
         - GPU VM fixes for Vega/RV
         - DC AUX fix for active DP-DVI dongles
         - DC fix for multihead regression"
      
      * tag 'drm-fixes-2019-04-18' of git://anongit.freedesktop.org/drm/drm:
        drm/tegra: hdmi: Setup audio only if configured
        drm/amd/display: If one stream full updates, full update all planes
        drm/amdgpu/gmc9: fix VM_L2_CNTL3 programming
        drm/amdgpu: shadow in shadow_list without tbo.mem.start cause page fault in sriov TDR
        gpu: host1x: Program stream ID to bypass without SMMU
        drm/amd/display: extending AUX SW Timeout
        drm/ttm: fix dma_fence refcount imbalance on error path
        drm/ttm: fix incrementing the page pointer for huge pages
        drm/ttm: fix start page for huge page check in ttm_put_pages()
        drm/ttm: fix out-of-bounds read in ttm_put_pages() v2
      95ea5529
  3. 17 Apr, 2019 17 commits