1. 14 Jun, 2019 13 commits
    • Dan Williams's avatar
      PCI/P2PDMA: fix the gen_pool_add_virt() failure path · e615a191
      Dan Williams authored
      The pci_p2pdma_add_resource() implementation immediately frees the pgmap
      if gen_pool_add_virt() fails.  However, that means that when @dev
      triggers a devres release devm_memremap_pages_release() will crash
      trying to access the freed @pgmap.
      
      Use the new devm_memunmap_pages() to manually free the mapping in the
      error path.
      
      Link: http://lkml.kernel.org/r/155727337603.292046.13101332703665246702.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Fixes: 52916982 ("PCI/P2PDMA: Support peer-to-peer memory")
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Acked-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e615a191
    • Dan Williams's avatar
      mm/devm_memremap_pages: introduce devm_memunmap_pages · 2e3f139e
      Dan Williams authored
      Use the new devm_release_action() facility to allow
      devm_memremap_pages_release() to be manually triggered.
      
      Link: http://lkml.kernel.org/r/155727337088.292046.5774214552136776763.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e3f139e
    • Dan Williams's avatar
      drivers/base/devres: introduce devm_release_action() · 2374b682
      Dan Williams authored
      Patch series "mm/devm_memremap_pages: Fix page release race", v2.
      
      Logan audited the devm_memremap_pages() shutdown path and noticed that
      it was possible to proceed to arch_remove_memory() before all potential
      page references have been reaped.
      
      Introduce a new ->cleanup() callback to do the work of waiting for any
      straggling page references and then perform the percpu_ref_exit() in
      devm_memremap_pages_release() context.
      
      For p2pdma this involves some deeper reworks to reference count
      resources on a per-instance basis rather than a per pci-device basis.  A
      modified genalloc api is introduced to convey a driver-private pointer
      through gen_pool_{alloc,free}() interfaces.  Also, a
      devm_memunmap_pages() api is introduced since p2pdma does not
      auto-release resources on a setup failure.
      
      The dax and pmem changes pass the nvdimm unit tests, and the p2pdma
      changes should now pass testing with the pci_p2pdma_release() fix.
      Jrme, how does this look for HMM?
      
      This patch (of 6):
      
      The devm_add_action() facility allows a resource allocation routine to
      add custom devm semantics.  One such user is devm_memremap_pages().
      
      There is now a need to manually trigger
      devm_memremap_pages_release().  Introduce devm_release_action() so the
      release action can be triggered via a new devm_memunmap_pages() api in a
      follow-on change.
      
      Link: http://lkml.kernel.org/r/155727336530.292046.2926860263201336366.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2374b682
    • Minchan Kim's avatar
      mm/vmscan.c: fix trying to reclaim unevictable LRU page · a58f2cef
      Minchan Kim authored
      There was the below bug report from Wu Fangsuo.
      
      On the CMA allocation path, isolate_migratepages_range() could isolate
      unevictable LRU pages and reclaim_clean_page_from_list() can try to
      reclaim them if they are clean file-backed pages.
      
        page:ffffffbf02f33b40 count:86 mapcount:84 mapping:ffffffc08fa7a810 index:0x24
        flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked)
        raw: 000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053
        raw: ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80
        page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page))
        page->mem_cgroup:ffffffc09bf3ee80
        ------------[ cut here ]------------
        kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S              4.14.81 #3
        Hardware name: ASR AQUILAC EVB (DT)
        task: ffffffc00a54cd00 task.stack: ffffffc009b00000
        PC is at shrink_page_list+0x1998/0x3240
        LR is at shrink_page_list+0x1998/0x3240
        pc : [<ffffff90083a2158>] lr : [<ffffff90083a2158>] pstate: 60400045
        sp : ffffffc009b05940
        ..
           shrink_page_list+0x1998/0x3240
           reclaim_clean_pages_from_list+0x3c0/0x4f0
           alloc_contig_range+0x3bc/0x650
           cma_alloc+0x214/0x668
           ion_cma_allocate+0x98/0x1d8
           ion_alloc+0x200/0x7e0
           ion_ioctl+0x18c/0x378
           do_vfs_ioctl+0x17c/0x1780
           SyS_ioctl+0xac/0xc0
      
      Wu found it's due to commit ad6b6704 ("mm: remove SWAP_MLOCK in
      ttu").  Before that, unevictable pages go to cull_mlocked so that we
      can't reach the VM_BUG_ON_PAGE line.
      
      To fix the issue, this patch filters out unevictable LRU pages from the
      reclaim_clean_pages_from_list in CMA.
      
      Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org
      Fixes: ad6b6704 ("mm: remove SWAP_MLOCK in ttu")
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarWu Fangsuo <fangsuowu@asrmicro.com>
      Debugged-by: default avatarWu Fangsuo <fangsuowu@asrmicro.com>
      Tested-by: default avatarWu Fangsuo <fangsuowu@asrmicro.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Pankaj Suryawanshi <pankaj.suryawanshi@einfochips.com>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a58f2cef
    • Andrea Arcangeli's avatar
      coredump: fix race condition between collapse_huge_page() and core dumping · 59ea6d06
      Andrea Arcangeli authored
      When fixing the race conditions between the coredump and the mmap_sem
      holders outside the context of the process, we focused on
      mmget_not_zero()/get_task_mm() callers in 04f5866e ("coredump: fix
      race condition between mmget_not_zero()/get_task_mm() and core
      dumping"), but those aren't the only cases where the mmap_sem can be
      taken outside of the context of the process as Michal Hocko noticed
      while backporting that commit to older -stable kernels.
      
      If mmgrab() is called in the context of the process, but then the
      mm_count reference is transferred outside the context of the process,
      that can also be a problem if the mmap_sem has to be taken for writing
      through that mm_count reference.
      
      khugepaged registration calls mmgrab() in the context of the process,
      but the mmap_sem for writing is taken later in the context of the
      khugepaged kernel thread.
      
      collapse_huge_page() after taking the mmap_sem for writing doesn't
      modify any vma, so it's not obvious that it could cause a problem to the
      coredump, but it happens to modify the pmd in a way that breaks an
      invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
      needs the mmap_sem for writing just to block concurrent page faults that
      call pmd_trans_huge_lock().
      
      Specifically the invariant that "!pmd_trans_huge()" cannot become a
      "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
      
      The coredump will call __get_user_pages() without mmap_sem for reading,
      which eventually can invoke a lockless page fault which will need a
      functional pmd_trans_huge_lock().
      
      So collapse_huge_page() needs to use mmget_still_valid() to check it's
      not running concurrently with the coredump...  as long as the coredump
      can invoke page faults without holding the mmap_sem for reading.
      
      This has "Fixes: khugepaged" to facilitate backporting, but in my view
      it's more a bug in the coredump code that will eventually have to be
      rewritten to stop invoking page faults without the mmap_sem for reading.
      So the long term plan is still to drop all mmget_still_valid().
      
      Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59ea6d06
    • swkhack's avatar
      mm/mlock.c: change count_mm_mlocked_page_nr return type · 0874bb49
      swkhack authored
      On a 64-bit machine the value of "vma->vm_end - vma->vm_start" may be
      negative when using 32 bit ints and the "count >> PAGE_SHIFT"'s result
      will be wrong.  So change the local variable and return value to
      unsigned long to fix the problem.
      
      Link: http://lkml.kernel.org/r/20190513023701.83056-1-swkhack@gmail.com
      Fixes: 0cf2f6f6 ("mm: mlock: check against vma for actual mlock() size")
      Signed-off-by: default avatarswkhack <swkhack@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0874bb49
    • Yang Shi's avatar
      mm: mmu_gather: remove __tlb_reset_range() for force flush · 7a30df49
      Yang Shi authored
      A few new fields were added to mmu_gather to make TLB flush smarter for
      huge page by telling what level of page table is changed.
      
      __tlb_reset_range() is used to reset all these page table state to
      unchanged, which is called by TLB flush for parallel mapping changes for
      the same range under non-exclusive lock (i.e.  read mmap_sem).
      
      Before commit dd2283f2 ("mm: mmap: zap pages with read mmap_sem in
      munmap"), the syscalls (e.g.  MADV_DONTNEED, MADV_FREE) which may update
      PTEs in parallel don't remove page tables.  But, the forementioned
      commit may do munmap() under read mmap_sem and free page tables.  This
      may result in program hang on aarch64 reported by Jan Stancek.  The
      problem could be reproduced by his test program with slightly modified
      below.
      
      ---8<---
      
      static int map_size = 4096;
      static int num_iter = 500;
      static long threads_total;
      
      static void *distant_area;
      
      void *map_write_unmap(void *ptr)
      {
      	int *fd = ptr;
      	unsigned char *map_address;
      	int i, j = 0;
      
      	for (i = 0; i < num_iter; i++) {
      		map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
      			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      		if (map_address == MAP_FAILED) {
      			perror("mmap");
      			exit(1);
      		}
      
      		for (j = 0; j < map_size; j++)
      			map_address[j] = 'b';
      
      		if (munmap(map_address, map_size) == -1) {
      			perror("munmap");
      			exit(1);
      		}
      	}
      
      	return NULL;
      }
      
      void *dummy(void *ptr)
      {
      	return NULL;
      }
      
      int main(void)
      {
      	pthread_t thid[2];
      
      	/* hint for mmap in map_write_unmap() */
      	distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
      			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
      	munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
      	distant_area += DISTANT_MMAP_SIZE / 2;
      
      	while (1) {
      		pthread_create(&thid[0], NULL, map_write_unmap, NULL);
      		pthread_create(&thid[1], NULL, dummy, NULL);
      
      		pthread_join(thid[0], NULL);
      		pthread_join(thid[1], NULL);
      	}
      }
      ---8<---
      
      The program may bring in parallel execution like below:
      
              t1                                        t2
      munmap(map_address)
        downgrade_write(&mm->mmap_sem);
        unmap_region()
        tlb_gather_mmu()
          inc_tlb_flush_pending(tlb->mm);
        free_pgtables()
          tlb->freed_tables = 1
          tlb->cleared_pmds = 1
      
                                              pthread_exit()
                                              madvise(thread_stack, 8M, MADV_DONTNEED)
                                                zap_page_range()
                                                  tlb_gather_mmu()
                                                    inc_tlb_flush_pending(tlb->mm);
      
        tlb_finish_mmu()
          if (mm_tlb_flush_nested(tlb->mm))
            __tlb_reset_range()
      
      __tlb_reset_range() would reset freed_tables and cleared_* bits, but this
      may cause inconsistency for munmap() which do free page tables.  Then it
      may result in some architectures, e.g.  aarch64, may not flush TLB
      completely as expected to have stale TLB entries remained.
      
      Use fullmm flush since it yields much better performance on aarch64 and
      non-fullmm doesn't yields significant difference on x86.
      
      The original proposed fix came from Jan Stancek who mainly debugged this
      issue, I just wrapped up everything together.
      
      Jan's testing results:
      
      v5.2-rc2-24-gbec7550c
      --------------------------
               mean     stddev
      real    37.382   2.780
      user     1.420   0.078
      sys     54.658   1.855
      
      v5.2-rc2-24-gbec7550c + "mm: mmu_gather: remove __tlb_reset_range() for force flush"
      ---------------------------------------------------------------------------------------_
               mean     stddev
      real    37.119   2.105
      user     1.548   0.087
      sys     55.698   1.357
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd2283f2 ("mm: mmap: zap pages with read mmap_sem in munmap")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarJan Stancek <jstancek@redhat.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Tested-by: default avatarJan Stancek <jstancek@redhat.com>
      Suggested-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a30df49
    • Wengang Wang's avatar
      fs/ocfs2: fix race in ocfs2_dentry_attach_lock() · be99ca27
      Wengang Wang authored
      ocfs2_dentry_attach_lock() can be executed in parallel threads against the
      same dentry.  Make that race safe.  The race is like this:
      
                  thread A                               thread B
      
      (A1) enter ocfs2_dentry_attach_lock,
      seeing dentry->d_fsdata is NULL,
      and no alias found by
      ocfs2_find_local_alias, so kmalloc
      a new ocfs2_dentry_lock structure
      to local variable "dl", dl1
      
                     .....
      
                                          (B1) enter ocfs2_dentry_attach_lock,
                                          seeing dentry->d_fsdata is NULL,
                                          and no alias found by
                                          ocfs2_find_local_alias so kmalloc
                                          a new ocfs2_dentry_lock structure
                                          to local variable "dl", dl2.
      
                                                         ......
      
      (A2) set dentry->d_fsdata with dl1,
      call ocfs2_dentry_lock() and increase
      dl1->dl_lockres.l_ro_holders to 1 on
      success.
                    ......
      
                                          (B2) set dentry->d_fsdata with dl2
                                          call ocfs2_dentry_lock() and increase
      				    dl2->dl_lockres.l_ro_holders to 1 on
      				    success.
      
                                                        ......
      
      (A3) call ocfs2_dentry_unlock()
      and decrease
      dl2->dl_lockres.l_ro_holders to 0
      on success.
                   ....
      
                                          (B3) call ocfs2_dentry_unlock(),
                                          decreasing
      				    dl2->dl_lockres.l_ro_holders, but
      				    see it's zero now, panic
      
      Link: http://lkml.kernel.org/r/20190529174636.22364-1-wen.gang.wang@oracle.comSigned-off-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Reported-by: default avatarDaniel Sobe <daniel.sobe@nxp.com>
      Tested-by: default avatarDaniel Sobe <daniel.sobe@nxp.com>
      Reviewed-by: default avatarChangwei Ge <gechangwei@live.cn>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be99ca27
    • Kirill Tkhai's avatar
      mm/vmscan.c: fix recent_rotated history · b17f18af
      Kirill Tkhai authored
      Johannes pointed out that after commit 886cf190 ("mm: move
      recent_rotated pages calculation to shrink_inactive_list()") we lost all
      zone_reclaim_stat::recent_rotated history.
      
      This fixes it.
      
      Link: http://lkml.kernel.org/r/155905972210.26456.11178359431724024112.stgit@localhost.localdomain
      Fixes: 886cf190 ("mm: move recent_rotated pages calculation to shrink_inactive_list()")
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b17f18af
    • Potyra, Stefan's avatar
      mm/mlock.c: mlockall error for flag MCL_ONFAULT · dedca635
      Potyra, Stefan authored
      If mlockall() is called with only MCL_ONFAULT as flag, it removes any
      previously applied lockings and does nothing else.
      
      This behavior is counter-intuitive and doesn't match the Linux man page.
      
        For mlockall():
      
        EINVAL Unknown flags were specified or MCL_ONFAULT was specified
        without either MCL_FUTURE or MCL_CURRENT.
      
      Consequently, return the error EINVAL, if only MCL_ONFAULT is passed.
      That way, applications will at least detect that they are calling
      mlockall() incorrectly.
      
      Link: http://lkml.kernel.org/r/20190527075333.GA6339@er01809n.ebgroup.elektrobit.com
      Fixes: b0f205c2 ("mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage")
      Signed-off-by: default avatarStefan Potyra <Stefan.Potyra@elektrobit.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dedca635
    • Manuel Traut's avatar
      scripts/decode_stacktrace.sh: prefix addr2line with $CROSS_COMPILE · c04e32e9
      Manuel Traut authored
      At least for ARM64 kernels compiled with the crosstoolchain from
      Debian/stretch or with the toolchain from kernel.org the line number is
      not decoded correctly by 'decode_stacktrace.sh':
      
        $ echo "[  136.513051]  f1+0x0/0xc [kcrash]" | \
          CROSS_COMPILE=/opt/gcc-8.1.0-nolibc/aarch64-linux/bin/aarch64-linux- \
         ./scripts/decode_stacktrace.sh /scratch/linux-arm64/vmlinux \
                                        /scratch/linux-arm64 \
                                        /nfs/debian/lib/modules/4.20.0-devel
        [  136.513051] f1 (/linux/drivers/staging/kcrash/kcrash.c:68) kcrash
      
      If addr2line from the toolchain is used the decoded line number is correct:
      
        [  136.513051] f1 (/linux/drivers/staging/kcrash/kcrash.c:57) kcrash
      
      Link: http://lkml.kernel.org/r/20190527083425.3763-1-manut@linutronix.deSigned-off-by: default avatarManuel Traut <manut@linutronix.de>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c04e32e9
    • Shakeel Butt's avatar
      mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node · 3510955b
      Shakeel Butt authored
      Syzbot reported following memory leak:
      
      ffffffffda RBX: 0000000000000003 RCX: 0000000000441f79
      BUG: memory leak
      unreferenced object 0xffff888114f26040 (size 32):
        comm "syz-executor626", pid 7056, jiffies 4294948701 (age 39.410s)
        hex dump (first 32 bytes):
          40 60 f2 14 81 88 ff ff 40 60 f2 14 81 88 ff ff  @`......@`......
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
           slab_post_alloc_hook mm/slab.h:439 [inline]
           slab_alloc mm/slab.c:3326 [inline]
           kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
           kmalloc include/linux/slab.h:547 [inline]
           __memcg_init_list_lru_node+0x58/0xf0 mm/list_lru.c:352
           memcg_init_list_lru_node mm/list_lru.c:375 [inline]
           memcg_init_list_lru mm/list_lru.c:459 [inline]
           __list_lru_init+0x193/0x2a0 mm/list_lru.c:626
           alloc_super+0x2e0/0x310 fs/super.c:269
           sget_userns+0x94/0x2a0 fs/super.c:609
           sget+0x8d/0xb0 fs/super.c:660
           mount_nodev+0x31/0xb0 fs/super.c:1387
           fuse_mount+0x2d/0x40 fs/fuse/inode.c:1236
           legacy_get_tree+0x27/0x80 fs/fs_context.c:661
           vfs_get_tree+0x2e/0x120 fs/super.c:1476
           do_new_mount fs/namespace.c:2790 [inline]
           do_mount+0x932/0xc50 fs/namespace.c:3110
           ksys_mount+0xab/0x120 fs/namespace.c:3319
           __do_sys_mount fs/namespace.c:3333 [inline]
           __se_sys_mount fs/namespace.c:3330 [inline]
           __x64_sys_mount+0x26/0x30 fs/namespace.c:3330
           do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is a simple off by one bug on the error path.
      
      Link: http://lkml.kernel.org/r/20190528043202.99980-1-shakeelb@google.com
      Fixes: 60d3fd32 ("list_lru: introduce per-memcg lists")
      Reported-by: syzbot+f90a420dfe2b1b03cb2c@syzkaller.appspotmail.com
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3510955b
    • Johannes Weiner's avatar
      mm: memcontrol: don't batch updates of local VM stats and events · 815744d7
      Johannes Weiner authored
      The kernel test robot noticed a 26% will-it-scale pagefault regression
      from commit 42a30035 ("mm: memcontrol: fix recursive statistics
      correctness & scalabilty").  This appears to be caused by bouncing the
      additional cachelines from the new hierarchical statistics counters.
      
      We can fix this by getting rid of the batched local counters instead.
      
      Originally, there were *only* group-local counters, and they were fully
      maintained per cpu.  A reader of a stats file high up in the cgroup tree
      would have to walk the entire subtree and collect each level's per-cpu
      counters to get the recursive view.  This was prohibitively expensive,
      and so we switched to per-cpu batched updates of the local counters
      during a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting"), reducing the complexity from nr_subgroups *
      nr_cpus to nr_subgroups.
      
      With growing machines and cgroup trees, the tree walk itself became too
      expensive for monitoring top-level groups, and this is when the culprit
      patch added hierarchy counters on each cgroup level.  When the per-cpu
      batch size would be reached, both the local and the hierarchy counters
      would get batch-updated from the per-cpu delta simultaneously.
      
      This makes local and hierarchical counter reads blazingly fast, but it
      unfortunately makes the write-side too cache line intense.
      
      Since local counter reads were never a problem - we only centralized
      them to accelerate the hierarchy walk - and use of the local counters
      are becoming rarer due to replacement with hierarchical views (ongoing
      rework in the page reclaim and workingset code), we can make those local
      counters unbatched per-cpu counters again.
      
      The scheme will then be as such:
      
         when a memcg statistic changes, the writer will:
         - update the local counter (per-cpu)
         - update the batch counter (per-cpu). If the batch is full:
         - spill the batch into the group's atomic_t
         - spill the batch into all ancestors' atomic_ts
         - empty out the batch counter (per-cpu)
      
         when a local memcg counter is read, the reader will:
         - collect the local counter from all cpus
      
         when a hiearchy memcg counter is read, the reader will:
         - read the atomic_t
      
      We might be able to simplify this further and make the recursive
      counters unbatched per-cpu counters as well (batch upward propagation,
      but leave per-cpu collection to the readers), but that will require a
      more in-depth analysis and testing of all the callsites.  Deal with the
      immediate regression for now.
      
      Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Tested-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      815744d7
  2. 13 Jun, 2019 2 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · c11fb13a
      Linus Torvalds authored
      Pull HID fixes from Jiri Kosina:
      
       - regression fixes (reverts) for module loading changes that turned out
         to be incompatible with some userspace, from Benjamin Tissoires
      
       - regression fix for special Logitech unifiying receiver 0xc52f, from
         Hans de Goede
      
       - a few device ID additions to logitech driver, from Hans de Goede
      
       - fix for Bluetooth support on 2nd-gen Wacom Intuos Pro, from Jason
         Gerecke
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
        HID: logitech-dj: Fix 064d:c52f receiver support
        Revert "HID: core: Call request_module before doing device_add"
        Revert "HID: core: Do not call request_module() in async context"
        Revert "HID: Increase maximum report size allowed by hid_field_extract()"
        HID: a4tech: fix horizontal scrolling
        HID: hyperv: Add a module description line
        HID: logitech-hidpp: Add support for the S510 remote control
        HID: multitouch: handle faulty Elo touch device
        HID: wacom: Sync INTUOSP2_BT touch state after each frame if necessary
        HID: wacom: Correct button numbering 2nd-gen Intuos Pro over Bluetooth
        HID: wacom: Send BTN_TOUCH in response to INTUOSP2_BT eraser contact
        HID: wacom: Don't report anything prior to the tool entering range
        HID: wacom: Don't set tool type until we're in range
        HID: rmi: Use SET_REPORT request on control endpoint for Acer Switch 3 and 5
        HID: logitech-hidpp: add support for the MX5500 keyboard
        HID: logitech-dj: add support for the Logitech MX5500's Bluetooth Mini-Receiver
        HID: i2c-hid: add iBall Aer3 to descriptor override
      c11fb13a
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20190612' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · b076173a
      Linus Torvalds authored
      Pull selinux fixes from Paul Moore:
       "Three patches for v5.2.
      
        One fixes a problem where we weren't correctly logging raw SELinux
        labels, the other two fix problems where we weren't properly checking
        calls to kmemdup()"
      
      * tag 'selinux-pr-20190612' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinux: fix a missing-check bug in selinux_sb_eat_lsm_opts()
        selinux: fix a missing-check bug in selinux_add_mnt_opt( )
        selinux: log raw contexts as untrusted strings
      b076173a
  3. 12 Jun, 2019 7 commits
  4. 11 Jun, 2019 2 commits
  5. 10 Jun, 2019 4 commits
  6. 09 Jun, 2019 1 commit
  7. 08 Jun, 2019 11 commits
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-5.2-rc4' of git://github.com/ceph/ceph-client · 2759e05c
      Linus Torvalds authored
      Pull ceph fixes from Ilya Dryomov:
       "A change to call iput() asynchronously to avoid a possible deadlock
        when iput_final() needs to wait for in-flight I/O (e.g. readahead) and
        a fixup for a cleanup that went into -rc1"
      
      * tag 'ceph-for-5.2-rc4' of git://github.com/ceph/ceph-client:
        ceph: fix error handling in ceph_get_caps()
        ceph: avoid iput_final() while holding mutex or in dispatch thread
        ceph: single workqueue for inode related works
      2759e05c
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.2b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 8e61f6f7
      Linus Torvalds authored
      Pull xen fix from Juergen Gross:
       "Just one fix for the Xen block frontend driver avoiding allocations
        with order > 0"
      
      * tag 'for-linus-5.2b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen-blkfront: switch kcalloc to kvcalloc for large array allocation
      8e61f6f7
    • Linus Torvalds's avatar
      Merge tag 's390-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 3d4645bf
      Linus Torvalds authored
      Pull s390 fixes from Heiko Carstens:
      
       - fix stack unwinder: the stack unwinder rework has on off-by-one bug
         which prevents following stack backchains over more than one context
         (e.g. irq -> process).
      
       - fix address space detection in exception handler: if user space
         switches to access register mode, which is not supported anymore, the
         exception handler may resolve to the wrong address space.
      
      * tag 's390-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/unwind: correct stack switching during unwind
        s390/mm: fix address space detection in exception handling
      3d4645bf
    • Linus Torvalds's avatar
      Merge tag 'mips_fixes_5.2_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · d0cc617a
      Linus Torvalds authored
      Pull MIPS fixes from Paul Burton:
      
       - Declare ginvt() __always_inline due to its use of an argument as an
         inline asm immediate.
      
       - A VDSO build fix following Kbuild changes made this cycle.
      
       - A fix for boot failures on txx9 systems following memory
         initialization changes made this cycle.
      
       - Bounds check virt_addr_valid() to prevent it spuriously indicating
         that bogus addresses are valid, in turn fixing hardened usercopy
         failures that have been present since v4.12.
      
       - Build uImage.gz for pistachio systems by default, since this is the
         image we need in order to actually boot on a board.
      
       - Remove an unused variable in our uprobes code.
      
      * tag 'mips_fixes_5.2_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: uprobes: remove set but not used variable 'epc'
        MIPS: pistachio: Build uImage.gz by default
        MIPS: Make virt_addr_valid() return bool
        MIPS: Bounds check virt_addr_valid
        MIPS: TXx9: Fix boot crash in free_initmem()
        MIPS: remove a space after -I to cope with header search paths for VDSO
        MIPS: mark ginvt() as __always_inline
      d0cc617a
    • Linus Torvalds's avatar
      Merge tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core · 9331b674
      Linus Torvalds authored
      Pull yet more SPDX updates from Greg KH:
       "Another round of SPDX header file fixes for 5.2-rc4
      
        These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being
        added, based on the text in the files. We are slowly chipping away at
        the 700+ different ways people tried to write the license text. All of
        these were reviewed on the spdx mailing list by a number of different
        people.
      
        We now have over 60% of the kernel files covered with SPDX tags:
      	$ ./scripts/spdxcheck.py -v 2>&1 | grep Files
      	Files checked:            64533
      	Files with SPDX:          40392
      	Files with errors:            0
      
        I think the majority of the "easy" fixups are now done, it's now the
        start of the longer-tail of crazy variants to wade through"
      
      * tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (159 commits)
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 450
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 449
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 448
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 446
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 445
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 444
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 443
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 442
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 440
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 438
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 437
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 435
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 434
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 433
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 431
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430
        treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 429
        ...
      9331b674
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 1ce2c851
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are some small char and misc driver fixes for 5.2-rc4 to resolve
        a number of reported issues.
      
        The most "notable" one here is the kernel headers in proc^Wsysfs
        fixes. Those changes move the header file info into sysfs and fixes
        the build issues that you reported.
      
        Other than that, a bunch of small habanalabs driver fixes, some fpga
        driver fixes, and a few other tiny driver fixes.
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'char-misc-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        habanalabs: Read upper bits of trace buffer from RWPHI
        habanalabs: Fix virtual address access via debugfs for 2MB pages
        fpga: zynqmp-fpga: Correctly handle error pointer
        habanalabs: fix bug in checking huge page optimization
        habanalabs: Avoid using a non-initialized MMU cache mutex
        habanalabs: fix debugfs code
        uapi/habanalabs: add opcode for enable/disable device debug mode
        habanalabs: halt debug engines on user process close
        test_firmware: Use correct snprintf() limit
        genwqe: Prevent an integer overflow in the ioctl
        parport: Fix mem leak in parport_register_dev_model
        fpga: dfl: expand minor range when registering chrdev region
        fpga: dfl: Add lockdep classes for pdata->lock
        fpga: dfl: afu: Pass the correct device to dma_mapping_error()
        fpga: stratix10-soc: fix use-after-free on s10_init()
        w1: ds2408: Fix typo after 49695ac4 (reset on output_write retry with readback)
        kheaders: Do not regenerate archive if config is not changed
        kheaders: Move from proc to sysfs
        lkdtm/bugs: Adjust recursion test to avoid elision
        lkdtm/usercopy: Moves the KERNEL_DS test to non-canonical
      1ce2c851
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 902b2edf
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "I2C has a driver bugfix and a MAINTAINERS fix"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        MAINTAINERS: Karthikeyan Ramasubramanian is MIA
        i2c: xiic: Add max_read_len quirk
      902b2edf
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fix-5.2-rc4' of git://git.infradead.org/users/vkoul/slave-dma · 66b59f2b
      Linus Torvalds authored
      Pull dmaengine fixes from Vinod Koul:
      
       - jz4780 transfer fix for acking descriptors early
      
       - fsl-qdma: clean registers on error
      
       - dw-axi-dmac: null pointer dereference fix
      
       - mediatek-cqdma: fix sleeping in atomic context
      
       - tegra210-adma: fix bunch os issues like crashing in driver probe,
         channel FIFO configuration etc.
      
       - sprd: Fixes for possible crash on descriptor status, block length
         overflow. For 2-stage transfer fix incorrect start, configuration and
         interrupt handling.
      
      * tag 'dmaengine-fix-5.2-rc4' of git://git.infradead.org/users/vkoul/slave-dma:
        dmaengine: sprd: Add interrupt support for 2-stage transfer
        dmaengine: sprd: Fix the right place to configure 2-stage transfer
        dmaengine: sprd: Fix block length overflow
        dmaengine: sprd: Fix the incorrect start for 2-stage destination channels
        dmaengine: sprd: Add validation of current descriptor in irq handler
        dmaengine: sprd: Fix the possible crash when getting descriptor status
        dmaengine: tegra210-adma: Fix spelling
        dmaengine: tegra210-adma: Fix channel FIFO configuration
        dmaengine: tegra210-adma: Fix crash during probe
        dmaengine: mediatek-cqdma: sleeping in atomic context
        dmaengine: dw-axi-dmac: fix null dereference when pointer first is null
        dmaengine: fsl-qdma: Add improvement
        dmaengine: jz4780: Fix transfers being ACKed too soon
      66b59f2b
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20190608' of git://git.kernel.dk/linux-block · 8d72e5bd
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - Allow symlink from the bfq.weight cgroup parameter to the general
         weight (Angelo)
      
       - Damien is new skd maintainer (Bart)
      
       - NVMe pull request from Sagi, with a few small fixes.
      
       - Ensure we set DMA segment size properly, dma-debug is now tripping on
         these (Christoph)
      
       - Remove useless debugfs_create() return check (Greg)
      
       - Remove redundant unlikely() check on IS_ERR() (Kefeng)
      
       - Fixup request freeing on exit (Ming)
      
      * tag 'for-linus-20190608' of git://git.kernel.dk/linux-block:
        block, bfq: add weight symlink to the bfq.weight cgroup parameter
        cgroup: let a symlink too be created with a cftype file
        block: free sched's request pool in blk_cleanup_queue
        nvme-rdma: use dynamic dma mapping per command
        nvme: Fix u32 overflow in the number of namespace list calculation
        mmc: also set max_segment_size in the device
        mtip32xx: also set max_segment_size in the device
        rsxx: don't call dma_set_max_seg_size
        nvme-pci: don't limit DMA segement size
        block: Drop unlikely before IS_ERR(_OR_NULL)
        block: aoe: no need to check return value of debugfs_create functions
        nvmet: fix data_len to 0 for bdev-backed write_zeroes
        MAINTAINERS: Hand over skd maintainership
        nvme-tcp: fix queue mapping when queue count is limited
        nvme-rdma: fix queue mapping when queue count is limited
      8d72e5bd
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 1b02caa3
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Two bug fixes, both for fairly serious problems; the UFS one looks
        like it could be used to exfiltrate data from the kernel, although
        probably only a privileged user has access to the command management
        interface and the missing unlock in smartpqi is long standing and
        probably a little used error path"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: smartpqi: unlock on error in pqi_submit_raid_request_synchronous()
        scsi: ufs: Check that space was properly alloced in copy_query_response
      1b02caa3
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-5.2-rc4-2' of... · 0ad43e29
      Linus Torvalds authored
      Merge tag 'linux-kselftest-5.2-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fix from Shuah Khan:
       "This consists of a single fix for a vm test build failure regression
        when it is built by itself"
      
      * tag 'linux-kselftest-5.2-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests: vm: Fix test build failure when built by itself
      0ad43e29