1. 23 Mar, 2018 13 commits
    • David Rientjes's avatar
      mm, thp: do not cause memcg oom for thp · 9d3c3354
      David Rientjes authored
      Commit 25160354 ("mm, thp: remove __GFP_NORETRY from khugepaged and
      madvised allocations") changed the page allocator to no longer detect
      thp allocations based on __GFP_NORETRY.
      
      It did not, however, modify the mem cgroup try_charge() path to avoid
      oom kill for either khugepaged collapsing or thp faulting.  It is never
      expected to oom kill a process to allocate a hugepage for thp; reclaim
      is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
      (and charging) should fallback instead of oom killing processes.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
      Fixes: 25160354 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d3c3354
    • Andrey Ryabinin's avatar
      mm/vmscan: wake up flushers for legacy cgroups too · 1c610d5f
      Andrey Ryabinin authored
      Commit 726d061f ("mm: vmscan: kick flushers when we encounter dirty
      pages on the LRU") added flusher invocation to shrink_inactive_list()
      when many dirty pages on the LRU are encountered.
      
      However, shrink_inactive_list() doesn't wake up flushers for legacy
      cgroup reclaim, so the next commit bbef9384 ("mm: vmscan: remove old
      flusher wakeup from direct reclaim path") removed the only source of
      flusher's wake up in legacy mem cgroup reclaim path.
      
      This leads to premature OOM if there is too many dirty pages in cgroup:
          # mkdir /sys/fs/cgroup/memory/test
          # echo $$ > /sys/fs/cgroup/memory/test/tasks
          # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          # dd if=/dev/zero of=tmp_file bs=1M count=100
          Killed
      
          dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
      
          Call Trace:
           dump_stack+0x46/0x65
           dump_header+0x6b/0x2ac
           oom_kill_process+0x21c/0x4a0
           out_of_memory+0x2a5/0x4b0
           mem_cgroup_out_of_memory+0x3b/0x60
           mem_cgroup_oom_synchronize+0x2ed/0x330
           pagefault_out_of_memory+0x24/0x54
           __do_page_fault+0x521/0x540
           page_fault+0x45/0x50
      
          Task in /test killed as a result of limit of /test
          memory: usage 51200kB, limit 51200kB, failcnt 73
          memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
          kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
          Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
                  mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
      	    active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
          Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
          Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
          oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      Wake up flushers in legacy cgroup reclaim too.
      
      Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
      Fixes: bbef9384 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c610d5f
    • Daniel Vacek's avatar
      Revert "mm: page_alloc: skip over regions of invalid pfns where possible" · f59f1caf
      Daniel Vacek authored
      This reverts commit b92df1de ("mm: page_alloc: skip over regions of
      invalid pfns where possible").  The commit is meant to be a boot init
      speed up skipping the loop in memmap_init_zone() for invalid pfns.
      
      But given some specific memory mapping on x86_64 (or more generally
      theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
      implementation also skips valid pfns which is plain wrong and causes
      'kernel BUG at mm/page_alloc.c:1389!'
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010: move_freepages+0x15e/0x160
        --
        Call Trace:
          move_freepages_block+0x73/0x80
          __rmqueue+0x263/0x460
          get_page_from_freelist+0x7e1/0x9e0
          __alloc_pages_nodemask+0x176/0x420
        --
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff       System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff       System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff       System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff       System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff       System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200>       505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0       <<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: default avatarDaniel Vacek <neelx@redhat.com>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f59f1caf
    • Kirill A. Shutemov's avatar
      mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink() · b3cd54b2
      Kirill A. Shutemov authored
      shmem_unused_huge_shrink() gets called from reclaim path.  Waiting for
      page lock may lead to deadlock there.
      
      There was a bug report that may be attributed to this:
      
        http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net
      
      Replace lock_page() with trylock_page() and skip the page if we failed
      to lock it.  We will get to the page on the next scan.
      
      We can test for the PageTransHuge() outside the page lock as we only
      need protection against splitting the page under us.  Holding pin oni
      the page is enough for this.
      
      Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com
      Fixes: 779750d2 ("shmem: split huge pages beyond i_size under memory pressure")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarEric Wheeler <linux-mm@lists.ewheeler.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3cd54b2
    • Kirill A. Shutemov's avatar
      mm/thp: do not wait for lock_page() in deferred_split_scan() · fa41b900
      Kirill A. Shutemov authored
      deferred_split_scan() gets called from reclaim path.  Waiting for page
      lock may lead to deadlock there.
      
      Replace lock_page() with trylock_page() and skip the page if we failed
      to lock it.  We will get to the page on the next scan.
      
      Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
      Fixes: 9a982250 ("thp: introduce deferred_split_huge_page()")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa41b900
    • Kirill A. Shutemov's avatar
      mm/khugepaged.c: convert VM_BUG_ON() to collapse fail · fece2029
      Kirill A. Shutemov authored
      khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
      mapped.  We do not collapse such pages.  See check
      khugepaged_scan_pmd().
      
      But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
      somebody managed to instantiate THP in the range and then split the PMD
      back to PTEs we would have a problem --
      VM_BUG_ON_PAGE(PageCompound(page)) will get triggered.
      
      It's possible since we drop mmap_sem during collapse to re-take for
      write.
      
      Replace the VM_BUG_ON() with graceful collapse fail.
      
      Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
      Fixes: b1caa957 ("khugepaged: ignore pmd tables with THP mapped with ptes")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fece2029
    • Toshi Kani's avatar
      x86/mm: implement free pmd/pte page interfaces · 28ee90fe
      Toshi Kani authored
      Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
      clear a given pud/pmd entry and free up lower level page table(s).
      
      The address range associated with the pud/pmd entry must have been
      purged by INVLPG.
      
      Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Reported-by: default avatarLei Li <lious.lilei@hisilicon.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28ee90fe
    • Toshi Kani's avatar
      mm/vmalloc: add interfaces to free unmapped page table · b6bdb751
      Toshi Kani authored
      On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
      create pud/pmd mappings.  A kernel panic was observed on arm64 systems
      with Cortex-A75 in the following steps as described by Hanjun Guo.
      
       1. ioremap a 4K size, valid page table will build,
       2. iounmap it, pte0 will set to 0;
       3. ioremap the same address with 2M size, pgd/pmd is unchanged,
          then set the a new value for pmd;
       4. pte0 is leaked;
       5. CPU may meet exception because the old pmd is still in TLB,
          which will lead to kernel panic.
      
      This panic is not reproducible on x86.  INVLPG, called from iounmap,
      purges all levels of entries associated with purged address on x86.  x86
      still has memory leak.
      
      The patch changes the ioremap path to free unmapped page table(s) since
      doing so in the unmap path has the following issues:
      
       - The iounmap() path is shared with vunmap(). Since vmap() only
         supports pte mappings, making vunmap() to free a pte page is an
         overhead for regular vmap users as they do not need a pte page freed
         up.
      
       - Checking if all entries in a pte page are cleared in the unmap path
         is racy, and serializing this check is expensive.
      
       - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
         Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
         purge.
      
      Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
      clear a given pud/pmd entry and free up a page for the lower level
      entries.
      
      This patch implements their stub functions on x86 and arm64, which work
      as workaround.
      
      [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
      Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Reported-by: default avatarLei Li <lious.lilei@hisilicon.com>
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6bdb751
    • Arnd Bergmann's avatar
      h8300: remove extraneous __BIG_ENDIAN definition · 1705f7c5
      Arnd Bergmann authored
      A bugfix I did earlier caused a build regression on h8300, which defines
      the __BIG_ENDIAN macro in a slightly different way than the generic
      code:
      
        arch/h8300/include/asm/byteorder.h:5:0: warning: "__BIG_ENDIAN" redefined
      
      We don't need to define it here, as the same macro is already provided
      by the linux/byteorder/big_endian.h, and that version does not conflict.
      
      While this is a v4.16 regression, my earlier patch also got backported
      to the 4.14 and 4.15 stable kernels, so we need the fixup there as well.
      
      Link: http://lkml.kernel.org/r/20180313120752.2645129-1-arnd@arndb.de
      Fixes: 101110f6 ("Kbuild: always define endianess in kconfig.h")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1705f7c5
    • Mike Kravetz's avatar
      hugetlbfs: check for pgoff value overflow · 63489f8e
      Mike Kravetz authored
      A vma with vm_pgoff large enough to overflow a loff_t type when
      converted to a byte offset can be passed via the remap_file_pages system
      call.  The hugetlbfs mmap routine uses the byte offset to calculate
      reservations and file size.
      
      A sequence such as:
      
        mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
        remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);
      
      will result in the following when task exits/file closed,
      
        kernel BUG at mm/hugetlb.c:749!
        Call Trace:
          hugetlbfs_evict_inode+0x2f/0x40
          evict+0xcb/0x190
          __dentry_kill+0xcb/0x150
          __fput+0x164/0x1e0
          task_work_run+0x84/0xa0
          exit_to_usermode_loop+0x7d/0x80
          do_syscall_64+0x18b/0x190
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      The overflowed pgoff value causes hugetlbfs to try to set up a mapping
      with a negative range (end < start) that leaves invalid state which
      causes the BUG.
      
      The previous overflow fix to this code was incomplete and did not take
      the remap_file_pages system call into account.
      
      [mike.kravetz@oracle.com: v3]
        Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
      [akpm@linux-foundation.org: include mmdebug.h]
      [akpm@linux-foundation.org: fix -ve left shift count on sh]
      Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
      Fixes: 045c7a3f ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarNic Losby <blurbdust@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63489f8e
    • Tetsuo Handa's avatar
      lockdep: fix fs_reclaim warning · 2e517d68
      Tetsuo Handa authored
      Dave Jones reported fs_reclaim lockdep warnings.
      
        ============================================
        WARNING: possible recursive locking detected
        4.15.0-rc9-backup-debug+ #1 Not tainted
        --------------------------------------------
        sshd/24800 is trying to acquire lock:
         (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        but task is already holding lock:
         (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(fs_reclaim);
          lock(fs_reclaim);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        2 locks held by sshd/24800:
         #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
         #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        stack backtrace:
        CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
        Call Trace:
         dump_stack+0xbc/0x13f
         __lock_acquire+0xa09/0x2040
         lock_acquire+0x12e/0x350
         fs_reclaim_acquire.part.102+0x29/0x30
         kmem_cache_alloc+0x3d/0x2c0
         alloc_extent_state+0xa7/0x410
         __clear_extent_bit+0x3ea/0x570
         try_release_extent_mapping+0x21a/0x260
         __btrfs_releasepage+0xb0/0x1c0
         btrfs_releasepage+0x161/0x170
         try_to_release_page+0x162/0x1c0
         shrink_page_list+0x1d5a/0x2fb0
         shrink_inactive_list+0x451/0x940
         shrink_node_memcg.constprop.88+0x4c9/0x5e0
         shrink_node+0x12d/0x260
         try_to_free_pages+0x418/0xaf0
         __alloc_pages_slowpath+0x976/0x1790
         __alloc_pages_nodemask+0x52c/0x5c0
         new_slab+0x374/0x3f0
         ___slab_alloc.constprop.81+0x47e/0x5a0
         __slab_alloc.constprop.80+0x32/0x60
         __kmalloc_track_caller+0x267/0x310
         __kmalloc_reserve.isra.40+0x29/0x80
         __alloc_skb+0xee/0x390
         sk_stream_alloc_skb+0xb8/0x340
         tcp_sendmsg_locked+0x8e6/0x1d30
         tcp_sendmsg+0x27/0x40
         inet_sendmsg+0xd0/0x310
         sock_write_iter+0x17a/0x240
         __vfs_write+0x2ab/0x380
         vfs_write+0xfb/0x260
         SyS_write+0xb6/0x140
         do_syscall_64+0x1e5/0xc05
         entry_SYSCALL64_slow_path+0x25/0x25
      
      This warning is caused by commit d92a8cfc ("locking/lockdep:
      Rework FS_RECLAIM annotation") which replaced the use of
      lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim()
      and lockdep_trace_alloc() in slab_pre_alloc_hook() with
      fs_reclaim_acquire()/ fs_reclaim_release().
      
      Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC |
      __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates
      __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is
      trying to grab the 'fake' lock again when __perform_reclaim() already
      grabbed the 'fake' lock.
      
      The
      
        /* this guy won't enter reclaim */
        if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
                return false;
      
      test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
      was added by commit cf40bd16 ("lockdep: annotate reclaim context
      (__GFP_NOFS)").  But that test is outdated because PF_MEMALLOC thread
      won't enter reclaim regardless of __GFP_NOMEMALLOC after commit
      341ce06f ("page allocator: calculate the alloc_flags for allocation
      only once") added the PF_MEMALLOC safeguard (
      
        /* Avoid recursion of direct reclaim */
        if (p->flags & PF_MEMALLOC)
                goto nopage;
      
      in __alloc_pages_slowpath()).
      
      Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and
      allow __need_fs_reclaim() to return false.
      
      Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp
      Fixes: d92a8cfc ("locking/lockdep: Rework FS_RECLAIM annotation")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Tested-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e517d68
    • Mark Fasheh's avatar
      MAINTAINERS: update Mark Fasheh's e-mail · 296cefee
      Mark Fasheh authored
      I'd like to use my personal e-mail for Ocfs2 requests and review.
      
      Link: http://lkml.kernel.org/r/20180311231356.9385-1-mfasheh@versity.comSigned-off-by: default avatarMark Fasheh <mark@fasheh.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      296cefee
    • Yisheng Xie's avatar
      mm/mempolicy.c: avoid use uninitialized preferred_node · 8970a63e
      Yisheng Xie authored
      Alexander reported a use of uninitialized memory in __mpol_equal(),
      which is caused by incorrect use of preferred_node.
      
      When mempolicy in mode MPOL_PREFERRED with flags MPOL_F_LOCAL, it uses
      numa_node_id() instead of preferred_node, however, __mpol_equal() uses
      preferred_node without checking whether it is MPOL_F_LOCAL or not.
      
      [akpm@linux-foundation.org: slight comment tweak]
      Link: http://lkml.kernel.org/r/4ebee1c2-57f6-bcb8-0e2d-1833d1ee0bb7@huawei.com
      Fixes: fc36b8d3 ("mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy")
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: default avatarAlexander Potapenko <glider@google.com>
      Tested-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8970a63e
  2. 21 Mar, 2018 2 commits
    • Linus Torvalds's avatar
      Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · 3215b9d5
      Linus Torvalds authored
      Pull clk fixes from Stephen Boyd:
       "A late collection of fixes for regressions seen this release cycle.
        Normally I send this earlier than now but real life got in the way.
        Things are back to normal now.
      
        There's the normal set of SoC driver fixes: i.MX boot warning, TI
        display clks, allwinner clk ops being wrong (fun), driver probe
        badness on error paths, correctness fix for the new aspeed driver, and
        even a fix for a race condition in the bcm2835 clk driver.
      
        At the core framework level we also got some fixes for the clk phase
        API caching at the wrong time, better handling of the enabled state of
        orphan clks, and a fix for a newly introduced bug in how we handle
        rate calculations for pass-through clks"
      
      * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
        clk: bcm2835: Protect sections updating shared registers
        clk: bcm2835: Fix ana->maskX definitions
        clk: aspeed: Prevent reset if clock is enabled
        clk: aspeed: Fix is_enabled for certain clocks
        clk: qcom: msm8916: Fix return value check in qcom_apcs_msm8916_clk_probe()
        clk: hisilicon: hi3660:Fix potential NULL dereference in hi3660_stub_clk_probe()
        clk: fix determine rate error with pass-through clock
        clk: migrate the count of orphaned clocks at init
        clk: update cached phase to respect the fact when setting phase
        clk: ti: am43xx: add set-rate-parent support for display clkctrl clock
        clk: ti: am33xx: add set-rate-parent support for display clkctrl clock
        clk: ti: clkctrl: add support for CLK_SET_RATE_PARENT flag
        clk: imx51-imx53: Fix UART4/5 registration on i.MX50 and i.MX53
        clk: sunxi-ng: a31: Fix CLK_OUT_* clock ops
      3215b9d5
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 303851e1
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "Not much exciting here, almost entirely syzkaller fixes.
      
        This is going to be on ongoing theme for some time, I think. Both
        Google and Mellanox are now running syzkaller on different parts of
        the user API.
      
        Summary:
      
         - Many bug fixes related to syzkaller from Leon Romanovsky. These are
           still for the mlx driver and ucma interface.
      
         - Fix a situation with port reuse for iWarp, discovered during
           scale-up testing
      
         - Bug fixes for the profile and restrack patches accepted during this
           merge window
      
         - Compile warning cleanups from Arnd, this is apparently the last
           warning to make 32 bit builds quiet"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        RDMA/ucma: Ensure that CM_ID exists prior to access it
        RDMA/verbs: Remove restrack entry from XRCD structure
        RDMA/ucma: Fix use-after-free access in ucma_close
        RDMA/ucma: Check AF family prior resolving address
        infiniband: bnxt_re: use BIT_ULL() for 64-bit bit masks
        infiniband: qplib_fp: fix pointer cast
        IB/mlx5: Fix cleanup order on unload
        RDMA/ucma: Don't allow join attempts for unsupported AF family
        RDMA/ucma: Fix access to non-initialized CM_ID object
        RDMA/core: Do not use invalid destination in determining port reuse
        RDMA/mlx5: Fix crash while accessing garbage pointer and freed memory
        IB/mlx5: Fix integer overflows in mlx5_ib_create_srq
        IB/mlx5: Fix out-of-bounds read in create_raw_packet_qp_rq
      303851e1
  3. 20 Mar, 2018 4 commits
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 76c0b6a3
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
      
       - one driver patch (qla2xxx) which fixes a problem caused by an
         existing regression fix (FCP discovery is failing)
      
       - one generic fix to a longstanding bug in libsas that causes I/O
         eventually to hang to the device in the face of ATA error recovery.
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: qla2xxx: Remove FC_NO_LOOP_ID for FCP and FC-NVMe Discovery
        scsi: libsas: defer ata device eh commands to libata
      76c0b6a3
    • Linus Torvalds's avatar
      Merge tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux · 645102ea
      Linus Torvalds authored
      Pull nfsd fix from Bruce Fields:
       "Just one fix for an occasional panic from Jeff Layton"
      
      * tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux:
        nfsd: remove blocked locks on client teardown
      645102ea
    • Linus Torvalds's avatar
      kvm/x86: fix icebp instruction handling · 32d43cd3
      Linus Torvalds authored
      The undocumented 'icebp' instruction (aka 'int1') works pretty much like
      'int3' in the absense of in-circuit probing equipment (except,
      obviously, that it raises #DB instead of raising #BP), and is used by
      some validation test-suites as such.
      
      But Andy Lutomirski noticed that his test suite acted differently in kvm
      than on bare hardware.
      
      The reason is that kvm used an inexact test for the icebp instruction:
      it just assumed that an all-zero VM exit qualification value meant that
      the VM exit was due to icebp.
      
      That is not unlike the guess that do_debug() does for the actual
      exception handling case, but it's purely a heuristic, not an absolute
      rule.  do_debug() does it because it wants to ascribe _some_ reasons to
      the #DB that happened, and an empty %dr6 value means that 'icebp' is the
      most likely casue and we have no better information.
      
      But kvm can just do it right, because unlike the do_debug() case, kvm
      actually sees the real reason for the #DB in the VM-exit interruption
      information field.
      
      So instead of relying on an inexact heuristic, just use the actual VM
      exit information that says "it was 'icebp'".
      
      Right now the 'icebp' instruction isn't technically documented by Intel,
      but that will hopefully change.  The special "privileged software
      exception" information _is_ actually mentioned in the Intel SDM, even
      though the cause of it isn't enumerated.
      Reported-by: default avatarAndy Lutomirski <luto@kernel.org>
      Tested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32d43cd3
    • Leon Romanovsky's avatar
      RDMA/ucma: Ensure that CM_ID exists prior to access it · e8980d67
      Leon Romanovsky authored
      Prior to access UCMA commands, the context should be initialized
      and connected to CM_ID with ucma_create_id(). In case user skips
      this step, he can provide non-valid ctx without CM_ID and cause
      to multiple NULL dereferences.
      
      Also there are situations where the create_id can be raced with
      other user access, ensure that the context is only shared to
      other threads once it is fully initialized to avoid the races.
      
      [  109.088108] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
      [  109.090315] IP: ucma_connect+0x138/0x1d0
      [  109.092595] PGD 80000001dc02d067 P4D 80000001dc02d067 PUD 1da9ef067 PMD 0
      [  109.095384] Oops: 0000 [#1] SMP KASAN PTI
      [  109.097834] CPU: 0 PID: 663 Comm: uclose Tainted: G    B 4.16.0-rc1-00062-g2975d5de #45
      [  109.100816] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
      [  109.105943] RIP: 0010:ucma_connect+0x138/0x1d0
      [  109.108850] RSP: 0018:ffff8801c8567a80 EFLAGS: 00010246
      [  109.111484] RAX: 0000000000000000 RBX: 1ffff100390acf50 RCX: ffffffff9d7812e2
      [  109.114496] RDX: 1ffffffff3f507a5 RSI: 0000000000000297 RDI: 0000000000000297
      [  109.117490] RBP: ffff8801daa15600 R08: 0000000000000000 R09: ffffed00390aceeb
      [  109.120429] R10: 0000000000000001 R11: ffffed00390aceea R12: 0000000000000000
      [  109.123318] R13: 0000000000000120 R14: ffff8801de6459c0 R15: 0000000000000118
      [  109.126221] FS:  00007fabb68d6700(0000) GS:ffff8801e5c00000(0000) knlGS:0000000000000000
      [  109.129468] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  109.132523] CR2: 0000000000000020 CR3: 00000001d45d8003 CR4: 00000000003606b0
      [  109.135573] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  109.138716] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  109.142057] Call Trace:
      [  109.144160]  ? ucma_listen+0x110/0x110
      [  109.146386]  ? wake_up_q+0x59/0x90
      [  109.148853]  ? futex_wake+0x10b/0x2a0
      [  109.151297]  ? save_stack+0x89/0xb0
      [  109.153489]  ? _copy_from_user+0x5e/0x90
      [  109.155500]  ucma_write+0x174/0x1f0
      [  109.157933]  ? ucma_resolve_route+0xf0/0xf0
      [  109.160389]  ? __mod_node_page_state+0x1d/0x80
      [  109.162706]  __vfs_write+0xc4/0x350
      [  109.164911]  ? kernel_read+0xa0/0xa0
      [  109.167121]  ? path_openat+0x1b10/0x1b10
      [  109.169355]  ? fsnotify+0x899/0x8f0
      [  109.171567]  ? fsnotify_unmount_inodes+0x170/0x170
      [  109.174145]  ? __fget+0xa8/0xf0
      [  109.177110]  vfs_write+0xf7/0x280
      [  109.179532]  SyS_write+0xa1/0x120
      [  109.181885]  ? SyS_read+0x120/0x120
      [  109.184482]  ? compat_start_thread+0x60/0x60
      [  109.187124]  ? SyS_read+0x120/0x120
      [  109.189548]  do_syscall_64+0xeb/0x250
      [  109.192178]  entry_SYSCALL_64_after_hwframe+0x21/0x86
      [  109.194725] RIP: 0033:0x7fabb61ebe99
      [  109.197040] RSP: 002b:00007fabb68d5e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
      [  109.200294] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fabb61ebe99
      [  109.203399] RDX: 0000000000000120 RSI: 00000000200001c0 RDI: 0000000000000004
      [  109.206548] RBP: 00007fabb68d5ec0 R08: 0000000000000000 R09: 0000000000000000
      [  109.209902] R10: 0000000000000000 R11: 0000000000000202 R12: 00007fabb68d5fc0
      [  109.213327] R13: 0000000000000000 R14: 00007fff40ab2430 R15: 00007fabb68d69c0
      [  109.216613] Code: 88 44 24 2c 0f b6 84 24 6e 01 00 00 88 44 24 2d 0f
      b6 84 24 69 01 00 00 88 44 24 2e 8b 44 24 60 89 44 24 30 e8 da f6 06 ff
      31 c0 <66> 41 83 7c 24 20 1b 75 04 8b 44 24 64 48 8d 74 24 20 4c 89 e7
      [  109.223602] RIP: ucma_connect+0x138/0x1d0 RSP: ffff8801c8567a80
      [  109.226256] CR2: 0000000000000020
      
      Fixes: 75216638 ("RDMA/cma: Export rdma cm interface to userspace")
      Reported-by: <syzbot+36712f50b0552615bf59@syzkaller.appspotmail.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      e8980d67
  4. 19 Mar, 2018 16 commits
    • Linus Torvalds's avatar
      Merge branch 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 1b5f3ba4
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "Two commits to fix the following subtle cgroup2 behavior bugs:
      
         - cpu.max was rejecting config when it shouldn't
      
         - thread mode enable was allowed when it shouldn't"
      
      * 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup: fix rule checking for threaded mode switching
        sched, cgroup: Don't reject lower cpu.max on ancestors
      1b5f3ba4
    • Linus Torvalds's avatar
      Merge branch 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · c6256ca9
      Linus Torvalds authored
      Pull workqueue fixes from Tejun Heo:
       "Two low-impact workqueue commits.
      
        One fixes workqueue creation error path and the other removes the
        unused cancel_work()"
      
      * 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: remove unused cancel_work()
        workqueue: use put_device() instead of kfree()
      c6256ca9
    • Linus Torvalds's avatar
      Merge branch 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu · 0d707a2f
      Linus Torvalds authored
      Pull percpu fixes from Tejun Heo:
       "Late percpu pull request for v4.16-rc6.
      
         - percpu allocator pool replenishing no longer triggers OOM or
           warning messages.
      
           Also, the alloc interface now understands __GFP_NORETRY and
           __GFP_NOWARN. This is to allow avoiding OOMs from userland
           triggered actions like bpf map creation.
      
           Also added cond_resched() in alloc loop.
      
         - perpcu allocation now can be interrupted by kill sigs to avoid
           deadlocking OOM killer.
      
         - Added Dennis Zhou as a co-maintainer.
      
           He has rewritten the area map allocator, understands most of the
           code base and has been responsive for all bug reports"
      
      * 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
        percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods
        mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
        percpu: include linux/sched.h for cond_resched()
        percpu: add a schedule point in pcpu_balance_workfn()
        percpu: allow select gfp to be passed to underlying allocators
        percpu: add __GFP_NORETRY semantics to the percpu balancing path
        percpu: match chunk allocator declarations with definitions
        percpu: add Dennis Zhou as a percpu co-maintainer
      0d707a2f
    • Linus Torvalds's avatar
      Merge branch 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata · efac2483
      Linus Torvalds authored
      Pull libata fixes from Tejun Heo:
       "I sat on them too long and it's quite a few this late, but nothing has
        a wide blast area. The changes are...
      
         - Fix corner cases in SG command handling.
      
         - Recent introduction of default powersaving mode config option
           exposed several devices with broken powersaving behaviors. A number
           of patches to update the blacklist accordingly.
      
         - Fix a kernel panic on SAS hotplug.
      
         - Other misc and device specific updates"
      
      * 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
        libata: Modify quirks for MX100 to limit NCQ_TRIM quirk to MU01 version
        libata: Make Crucial BX100 500GB LPM quirk apply to all firmware versions
        libata: Apply NOLPM quirk to Crucial M500 480 and 960GB SSDs
        libata: Enable queued TRIM for Samsung SSD 860
        PCI: Add function 1 DMA alias quirk for Highpoint RocketRAID 644L
        ahci: Add PCI-id for the Highpoint Rocketraid 644L card
        ata: do not schedule hot plug if it is a sas host
        libata: disable LPM for Crucial BX100 SSD 500GB drive
        libata: Apply NOLPM quirk to Crucial MX100 512GB SSDs
        libata: update documentation for sysfs interfaces
        ata: sata_rcar: Remove unused variable in sata_rcar_init_controller()
        libata: transport: cleanup documentation of sysfs interface
        sata_rcar: Reset SATA PHY when Salvator-X board resumes
        libata: don't try to pass through NCQ commands to non-NCQ devices
        libata: remove WARN() for DMA or PIO command without data
        libata: fix length validation of ATAPI-relayed SCSI commands
        ata: libahci: fix comment indentation
        ahci: Add check for device presence (PCIe hot unplug) in ahci_stop_engine()
        libata: Fix compile warning with ATA_DEBUG enabled
      efac2483
    • Jeff Layton's avatar
      nfsd: remove blocked locks on client teardown · 68ef3bc3
      Jeff Layton authored
      We had some reports of panics in nfsd4_lm_notify, and that showed a
      nfs4_lockowner that had outlived its so_client.
      
      Ensure that we walk any leftover lockowners after tearing down all of
      the stateids, and remove any blocked locks that they hold.
      
      With this change, we also don't need to walk the nbl_lru on nfsd_net
      shutdown, as that will happen naturally when we tear down the clients.
      
      Fixes: 76d348fa (nfsd: have nfsd4_lock use blocking locks for v4.1+ locks)
      Reported-by: default avatarFrank Sorenson <fsorenso@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: stable@vger.kernel.org # 4.9
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      68ef3bc3
    • Leon Romanovsky's avatar
      RDMA/verbs: Remove restrack entry from XRCD structure · 80cf79ae
      Leon Romanovsky authored
      XRCD object is not implemented in the restrack, so lets remove it.
      
      Fixes: 02d8883f ("RDMA/restrack: Add general infrastructure to track RDMA resources")
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      80cf79ae
    • Leon Romanovsky's avatar
      RDMA/ucma: Fix use-after-free access in ucma_close · ed65a4dc
      Leon Romanovsky authored
      The error in ucma_create_id() left ctx in the list of contexts belong
      to ucma file descriptor. The attempt to close this file descriptor causes
      to use-after-free accesses while iterating over such list.
      
      Fixes: 75216638 ("RDMA/cma: Export rdma cm interface to userspace")
      Reported-by: <syzbot+dcfd344365a56fbebd0f@syzkaller.appspotmail.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: default avatarSean Hefty <sean.hefty@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      ed65a4dc
    • Tejun Heo's avatar
      percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods · b3a5d111
      Tejun Heo authored
      percpu_ref internally uses sched-RCU to implement the percpu -> atomic
      mode switching and the documentation suggested that this could be
      depended upon.  This doesn't seem like a good idea.
      
      * percpu_ref uses sched-RCU which has different grace periods regular
        RCU.  Users may combine percpu_ref with regular RCU usage and
        incorrectly believe that regular RCU grace periods are performed by
        percpu_ref.  This can lead to, for example, use-after-free due to
        premature freeing.
      
      * percpu_ref has a grace period when switching from percpu to atomic
        mode.  It doesn't have one between the last put and release.  This
        distinction is subtle and can lead to surprising bugs.
      
      * percpu_ref allows starting in and switching to atomic mode manually
        for debugging and other purposes.  This means that there may not be
        any grace periods from kill to release.
      
      This patch makes it clear that the grace periods are percpu_ref's
      internal implementation detail and can't be depended upon by the
      users.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      b3a5d111
    • Kirill Tkhai's avatar
      mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn() · f52ba1fe
      Kirill Tkhai authored
      In case of memory deficit and low percpu memory pages,
      pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
      time (as it makes memory allocations itself and waits
      for memory reclaim). If tasks doing pcpu_alloc() are
      choosen by OOM killer, they can't exit, because they
      are waiting for the mutex.
      
      The patch makes pcpu_alloc() to care about killing signal
      and use mutex_lock_killable(), when it's allowed by GFP
      flags. This guarantees, a task does not miss SIGKILL
      from OOM killer.
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f52ba1fe
    • Tejun Heo's avatar
      percpu: include linux/sched.h for cond_resched() · 71546d10
      Tejun Heo authored
      microblaze build broke due to missing declaration of the
      cond_resched() invocation added recently.  Let's include linux/sched.h
      explicitly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      71546d10
    • Boris Brezillon's avatar
      clk: bcm2835: Protect sections updating shared registers · 7997f3b2
      Boris Brezillon authored
      CM_PLLx and A2W_XOSC_CTRL registers are accessed by different clock
      handlers and must be accessed with ->regs_lock held.
      Update the sections where this protection is missing.
      
      Fixes: 41691b88 ("clk: bcm2835: Add support for programming the audio domain clocks")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarBoris Brezillon <boris.brezillon@bootlin.com>
      Reviewed-by: default avatarEric Anholt <eric@anholt.net>
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      7997f3b2
    • Boris Brezillon's avatar
      clk: bcm2835: Fix ana->maskX definitions · 49012d1b
      Boris Brezillon authored
      ana->maskX values are already '~'-ed in bcm2835_pll_set_rate(). Remove
      the '~' in the definition to fix ANA setup.
      
      Note that this commit fixes a long standing bug preventing one from
      using an HDMI display if it's plugged after the FW has booted Linux.
      This is because PLLH is used by the HDMI encoder to generate the pixel
      clock.
      
      Fixes: 41691b88 ("clk: bcm2835: Add support for programming the audio domain clocks")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarBoris Brezillon <boris.brezillon@bootlin.com>
      Reviewed-by: default avatarEric Anholt <eric@anholt.net>
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      49012d1b
    • Hans de Goede's avatar
      libata: Modify quirks for MX100 to limit NCQ_TRIM quirk to MU01 version · d418ff56
      Hans de Goede authored
      When commit 9c7be59f ("libata: Apply NOLPM quirk to Crucial MX100
      512GB SSDs") was added it inherited the ATA_HORKAGE_NO_NCQ_TRIM quirk
      from the existing "Crucial_CT*MX100*" entry, but that entry sets model_rev
      to "MU01", where as the entry adding the NOLPM quirk sets it to NULL.
      
      This means that after this commit we no apply the NO_NCQ_TRIM quirk to
      all "Crucial_CT512MX100*" SSDs even if they have the fixed "MU02"
      firmware. This commit splits the "Crucial_CT512MX100*" quirk into 2
      quirks, one for the "MU01" firmware and one for all other firmware
      versions, so that we once again only apply the NO_NCQ_TRIM quirk to the
      "MU01" firmware version.
      
      Fixes: 9c7be59f ("libata: Apply NOLPM quirk to ... MX100 512GB SSDs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d418ff56
    • Hans de Goede's avatar
      libata: Make Crucial BX100 500GB LPM quirk apply to all firmware versions · 3bf7b5d6
      Hans de Goede authored
      Commit b17e5729 ("libata: disable LPM for Crucial BX100 SSD 500GB
      drive"), introduced a ATA_HORKAGE_NOLPM quirk for Crucial BX100 500GB SSDs
      but limited this to the MU02 firmware version, according to:
      http://www.crucial.com/usa/en/support-ssd-firmware
      
      MU02 is the last version, so there are no newer possibly fixed versions
      and if the MU02 version has broken LPM then the MU01 almost certainly
      also has broken LPM, so this commit changes the quirk to apply to all
      firmware versions.
      
      Fixes: b17e5729 ("libata: disable LPM for Crucial BX100 SSD 500GB...")
      Cc: stable@vger.kernel.org
      Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      3bf7b5d6
    • Hans de Goede's avatar
      libata: Apply NOLPM quirk to Crucial M500 480 and 960GB SSDs · 62ac3f73
      Hans de Goede authored
      There have been reports of the Crucial M500 480GB model not working
      with LPM set to min_power / med_power_with_dipm level.
      
      It has not been tested with medium_power, but that typically has no
      measurable power-savings.
      
      Note the reporters Crucial_CT480M500SSD3 has a firmware version of MU03
      and there is a MU05 update available, but that update does not mention any
      LPM fixes in its changelog, so the quirk matches all firmware versions.
      
      In my experience the LPM problems with (older) Crucial SSDs seem to be
      limited to higher capacity versions of the SSDs (different firmware?),
      so this commit adds a NOLPM quirk for the 480 and 960GB versions of the
      M500, to avoid LPM causing issues with these SSDs.
      
      Cc: stable@vger.kernel.org
      Reported-and-tested-by: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      62ac3f73
    • Linus Torvalds's avatar
      Linux 4.16-rc6 · c698ca52
      Linus Torvalds authored
      c698ca52
  5. 18 Mar, 2018 5 commits
    • Linus Torvalds's avatar
      Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9e1909b9
      Linus Torvalds authored
      Pull x86/pti updates from Thomas Gleixner:
       "Another set of melted spectrum updates:
      
         - Iron out the last late microcode loading issues by actually
           checking whether new microcode is present and preventing the CPU
           synchronization to run into a timeout induced hang.
      
         - Remove Skylake C2 from the microcode blacklist according to the
           latest Intel documentation
      
         - Fix the VM86 POPF emulation which traps if VIP is set, but VIF is
           not. Enhance the selftests to catch that kind of issue
      
         - Annotate indirect calls/jumps for objtool on 32bit. This is not a
           functional issue, but for consistency sake its the right thing to
           do.
      
         - Fix a jump label build warning observed on SPARC64 which uses 32bit
           storage for the code location which is casted to 64 bit pointer w/o
           extending it to 64bit first.
      
         - Add two new cpufeature bits. Not really an urgent issue, but
           provides them for both x86 and x86/kvm work. No impact on the
           current kernel"
      
      * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/microcode: Fix CPU synchronization routine
        x86/microcode: Attempt late loading only when new microcode is present
        x86/speculation: Remove Skylake C2 from Speculation Control microcode blacklist
        jump_label: Fix sparc64 warning
        x86/speculation, objtool: Annotate indirect calls/jumps for objtool on 32-bit kernels
        x86/vm86/32: Fix POPF emulation
        selftests/x86/entry_from_vm86: Add test cases for POPF
        selftests/x86/entry_from_vm86: Exit with 1 if we fail
        x86/cpufeatures: Add Intel PCONFIG cpufeature
        x86/cpufeatures: Add Intel Total Memory Encryption cpufeature
      9e1909b9
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · df4fe178
      Linus Torvalds authored
      Pull x86 fix from Thomas Gleixner:
       "A single fix for vmalloc_fault() which uses p*d_huge() unconditionally
        whether CONFIG_HUGETLBFS is set or not. In case of CONFIG_HUGETLBFS=n
        this results in a crash as p*d_huge() returns 0 in that case"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mm: Fix vmalloc_fault to use pXd_large
      df4fe178
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d2149e13
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
       "Three fixes for irq chip drivers:
      
         - Make sure the allocations in the GIC-V3 ITS driver are large enough
           to accomodate the interrupt space
      
         - Fix a misplaced __iomem annotation which causes a splat of 26
           sparse warnings
      
         - Remove an unused function in the IMX GPCV2 driver which causes
           build warnings"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/irq-imx-gpcv2: Remove unused function
        irqchip/gic-v3-its: Ensure nr_ites >= nr_lpis
        irqchip/gic-v3-its: Fix misplaced __iomem annotations
      d2149e13
    • Linus Torvalds's avatar
      Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 23fe85ae
      Linus Torvalds authored
      Pull EFI fix from Thomas Gleixner:
       "A single fix to prevent partially initialized pointers in mixed mode
        (64bit kernel on 32bit UEFI)"
      
      * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi/libstub/tpm: Initialize pointer variables to zero for mixed mode
      23fe85ae
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 3cd1d327
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "PPC:
         - fix bug leading to lost IPIs and smp_call_function_many() lockups
           on POWER9
      
        ARM:
         - locking fix
         - reset fix
         - GICv2 multi-source SGI injection fix
         - GICv2-on-v3 MMIO synchronization fix
         - make the console less verbose.
      
        x86:
         - fix device passthrough on AMD SME"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: Fix device passthrough when SME is active
        kvm: arm/arm64: vgic-v3: Tighten synchronization for guests using v2 on v3
        KVM: arm/arm64: vgic: Don't populate multiple LRs with the same vintid
        KVM: arm/arm64: Reduce verbosity of KVM init log
        KVM: arm/arm64: Reset mapped IRQs on VM reset
        KVM: arm/arm64: Avoid vcpu_load for other vcpu ioctls than KVM_RUN
        KVM: arm/arm64: vgic: Add missing irq_lock to vgic_mmio_read_pending
        KVM: PPC: Book3S HV: Fix trap number return from __kvmppc_vcore_entry
      3cd1d327