1. 18 Feb, 2019 2 commits
    • Eric Auger's avatar
      vfio_pci: Enable memory accesses before calling pci_map_rom · 0cfd027b
      Eric Auger authored
      pci_map_rom/pci_get_rom_size() performs memory access in the ROM.
      In case the Memory Space accesses were disabled, readw() is likely
      to trigger a synchronous external abort on some platforms.
      
      In case memory accesses were disabled, re-enable them before the
      call and disable them back again just after.
      
      Fixes: 89e1f7d4 ("vfio: Add PCI device driver")
      Signed-off-by: default avatarEric Auger <eric.auger@redhat.com>
      Suggested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      0cfd027b
    • Alex Williamson's avatar
      vfio/pci: Restore device state on PM transition · 51ef3a00
      Alex Williamson authored
      PCI core handles save and restore of device state around reset, but
      when using pci_set_power_state() we can unintentionally trigger a soft
      reset of the device, where PCI core only restores the BAR state.  If
      we're using vfio-pci's idle D3 support to try to put devices into low
      power when unused, this might trigger a reset when the device is woken
      for use.  Also power state management by the user, or within a guest,
      can put the device into D3 power state with potentially limited
      ability to restore the device if it should undergo a reset.  The PCI
      spec does not define the extent of a soft reset and many devices
      reporting soft reset on D3->D0 transition do not undergo a PCI config
      space reset.  It's therefore assumed safe to unconditionally restore
      the remainder of the state if the device indicates soft reset
      support, even on a user initiated wakeup.
      
      Implement a wrapper in vfio-pci to tag devices reporting PM reset
      support, save their state on transitions into D3 and restore on
      transitions back to D0.
      Reported-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      51ef3a00
  2. 13 Feb, 2019 1 commit
    • Alexey Kardashevskiy's avatar
      vfio/spapr_tce: Skip unsetting already unset table · a3906855
      Alexey Kardashevskiy authored
      VFIO TCE IOMMU v2 owns IOMMU tables. When we detach an IOMMU group from
      a container, we need to unset these tables from the group which we do by
      calling unset_window(). We also unset tables when removing a DMA window
      via the VFIO_IOMMU_SPAPR_TCE_REMOVE ioctl.
      
      The window removal checks if the table actually exists (hidden inside
      tce_iommu_find_table()) but the group detaching does not so the user
      may see duplicating messages:
      pci 0009:03     : [PE# fd] Removing DMA window #0
      pci 0009:03     : [PE# fd] Removing DMA window #1
      pci 0009:03     : [PE# fd] Removing DMA window #0
      pci 0009:03     : [PE# fd] Removing DMA window #1
      
      At the moment this is not a problem as the second invocation
      of unset_window() writes zeroes to the HW registers again and exits early
      as there is no table.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      a3906855
  3. 12 Feb, 2019 4 commits
  4. 05 Feb, 2019 2 commits
  5. 03 Feb, 2019 6 commits
    • Linus Torvalds's avatar
      Linux 5.0-rc5 · 8834f560
      Linus Torvalds authored
      8834f560
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 24b888d8
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "A few updates for x86:
      
         - Fix an unintended sign extension issue in the fault handling code
      
         - Rename the new resource control config switch so it's less
           confusing
      
         - Avoid setting up EFI info in kexec when the EFI runtime is
           disabled.
      
         - Fix the microcode version check in the AMD microcode loader so it
           only loads higher version numbers and never downgrades
      
         - Set EFER.LME in the 32bit trampoline before returning to long mode
           to handle older AMD/KVM behaviour properly.
      
         - Add Darren and Andy as x86/platform reviewers"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/resctrl: Avoid confusion over the new X86_RESCTRL config
        x86/kexec: Don't setup EFI info if EFI runtime is not enabled
        x86/microcode/amd: Don't falsely trick the late loading mechanism
        MAINTAINERS: Add Andy and Darren as arch/x86/platform/ reviewers
        x86/fault: Fix sign-extend unintended sign extension
        x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode
        x86/cpu: Add Atom Tremont (Jacobsville)
      24b888d8
    • Linus Torvalds's avatar
      Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · cc6810e3
      Linus Torvalds authored
      Pull cpu hotplug fixes from Thomas Gleixner:
       "Two fixes for the cpu hotplug machinery:
      
         - Replace the overly clever 'SMT disabled by BIOS' detection logic as
           it breaks KVM scenarios and prevents speculation control updates
           when the Hyperthreads are brought online late after boot.
      
         - Remove a redundant invocation of the speculation control update
           function"
      
      * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
        x86/speculation: Remove redundant arch_smt_update() invocation
      cc6810e3
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 58f6d428
      Linus Torvalds authored
      Pull perf fixes from Thomas Gleixner:
       "A pile of perf updates:
      
         - Fix broken sanity check in the /proc/sys/kernel/perf_cpu_time_max_percent
           write handler
      
         - Cure a perf script crash which caused by an unitinialized data
           structure
      
         - Highlight the hottest instruction in perf top and not a random one
      
         - Cure yet another clang issue when building perf python
      
         - Handle topology entries with no CPU correctly in the tools
      
         - Handle perf data which contains both tracepoints and performance
           counter entries correctly.
      
         - Add a missing NULL pointer check in perf ordered_events_free()"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf script: Fix crash when processing recorded stat data
        perf top: Fix wrong hottest instruction highlighted
        perf tools: Handle TOPOLOGY headers with no CPU
        perf python: Remove -fstack-clash-protection when building with some clang versions
        perf core: Fix perf_proc_update_handler() bug
        perf script: Fix crash with printing mixed trace point and other events
        perf ordered_events: Fix crash in ordered_events__free
      58f6d428
    • Linus Torvalds's avatar
      Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 89401be6
      Linus Torvalds authored
      Pull EFI fix from Thomas Gleixner:
       "The dump info for the efi page table debugging lacks a terminator
        which causes the kernel to crash when the debugfile is read"
      
      * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi/arm64: Fix debugfs crash by adding a terminator for ptdump marker
      89401be6
    • Linus Torvalds's avatar
      Merge tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 312b3a93
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - regression fix: transaction commit can run away due to delayed ref
         waiting heuristic, this is not necessary now because of the proper
         reservation mechanism introduced in 5.0
      
       - regression fix: potential crash due to use-before-check of an ERR_PTR
         return value
      
       - fix for transaction abort during transaction commit that needs to
         properly clean up pending block groups
      
       - fix deadlock during b-tree node/leaf splitting, when this happens on
         some of the fundamental trees, we must prevent new tree block
         allocation to re-enter indirectly via the block group flushing path
      
       - potential memory leak after errors during mount
      
      * tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: On error always free subvol_name in btrfs_mount
        btrfs: clean up pending block groups when transaction commit aborts
        btrfs: fix potential oops in device_list_add
        btrfs: don't end the transaction for delayed refs in throttle
        Btrfs: fix deadlock when allocating tree block during leaf/node split
      312b3a93
  6. 02 Feb, 2019 11 commits
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-5.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · 12491ed3
      Linus Torvalds authored
      Pull Devicetree fix from Rob Herring:
       "A single fix for building DT bindings in-tree"
      
      * tag 'devicetree-fixes-for-5.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        dt-bindings: Fix dt_binding_check target for in tree builds
      12491ed3
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.0-rc5' of... · 74b13e7e
      Linus Torvalds authored
      Merge tag 'riscv-for-linus-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
      
      Pull RISC-V fixes from Palmer Dabbelt:
       "This contains a handful of mostly-independent patches:
      
         - make our port respect TIF_NEED_RESCHED, which fixes
           CONFIG_PREEMPT=y kernels
      
         - fix double-put of OF nodes
      
         - fix a misspelling of target in our Kconfig
      
         - generic PCIe is enabled in our defconfig
      
         - fix our SBI early console to properly handle line
           endings
      
         - fix max_low_pfn being counted in PFNs
      
         - a change to TASK_UNMAPPED_BASE to match what other
           arches do
      
        This has passed my standard 'boot Fedora' flow"
      
      * tag 'riscv-for-linus-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
        riscv: Adjust mmap base address at a third of task size
        riscv: fixup max_low_pfn with PFN_DOWN.
        tty/serial: use uart_console_write in the RISC-V SBL early console
        RISC-V: defconfig: Add CRYPTO_DEV_VIRTIO=y
        RISC-V: defconfig: Enable Generic PCIE by default
        RISC-V: defconfig: Move CONFIG_PCI{,E_XILINX}
        RISC-V: Kconfig: fix spelling mistake "traget" -> "target"
        RISC-V: asm/page.h: fix spelling mistake "CONFIG_64BITS" -> "CONFIG_64BIT"
        RISC-V: fix bad use of of_node_put
        RISC-V: Add _TIF_NEED_RESCHED check for kernel thread when CONFIG_PREEMPT=y
      74b13e7e
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20190202' of git://git.kernel.dk/linux-block · c8864cb7
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "A few fixes that should go into this release. This contains:
      
         - MD pull request from Song, fixing a recovery OOM issue (Alexei)
      
         - Fix for a sync related stall (Jianchao)
      
         - Dummy callback for timeouts (Tetsuo)
      
         - IDE atapi sense ordering fix (me)"
      
      * tag 'for-linus-20190202' of git://git.kernel.dk/linux-block:
        ide: ensure atapi sense request aren't preempted
        blk-mq: fix a hung issue when fsync
        block: pass no-op callback to INIT_WORK().
        md/raid5: fix 'out of memory' during raid cache recovery
      c8864cb7
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 3cde55ee
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Five minor bug fixes.
      
        The libfc one is a tiny memory leak, the zfcp one is an incorrect user
        visible parameter and the rest are on error legs or obscure features"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: 53c700: pass correct "dev" to dma_alloc_attrs()
        scsi: bnx2fc: Fix error handling in probe()
        scsi: scsi_debug: fix write_same with virtual_gb problem
        scsi: libfc: free skb when receiving invalid flogi resp
        scsi: zfcp: fix sysfs block queue limit output for max_segment_size
      3cde55ee
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · b9de6efe
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "24 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits)
        autofs: fix error return in autofs_fill_super()
        autofs: drop dentry reference only when it is never used
        fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
        mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
        psi: clarify the Kconfig text for the default-disable option
        mm, memory_hotplug: __offline_pages fix wrong locking
        mm: hwpoison: use do_send_sig_info() instead of force_sig()
        kasan: mark file common so ftrace doesn't trace it
        init/Kconfig: fix grammar by moving a closing parenthesis
        lib/test_kmod.c: potential double free in error handling
        mm, oom: fix use-after-free in oom_kill_process
        mm/hotplug: invalid PFNs from pfn_to_online_page()
        mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages
        psi: fix aggregation idle shut-off
        mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone
        mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone
        oom, oom_reaper: do not enqueue same task twice
        mm: migrate: make buffer_migrate_page_norefs() actually succeed
        kernel/exit.c: release ptraced tasks before zap_pid_ns_processes
        x86_64: increase stack size for KASAN_EXTRA
        ...
      b9de6efe
    • Qian Cai's avatar
      efi/arm64: Fix debugfs crash by adding a terminator for ptdump marker · 74c953ca
      Qian Cai authored
      When reading 'efi_page_tables' debugfs triggers an out-of-bounds access here:
      
        arch/arm64/mm/dump.c: 282
        if (addr >= st->marker[1].start_address) {
      
      called from:
      
        arch/arm64/mm/dump.c: 331
        note_page(st, addr, 2, pud_val(pud));
      
      because st->marker++ is is called after "UEFI runtime end" which is the
      last element in addr_marker[]. Therefore, add a terminator like the one
      for kernel_page_tables, so it can be skipped to print out non-existent
      markers.
      
      Here's the KASAN bug report:
      
        # cat /sys/kernel/debug/efi_page_tables
        ---[ UEFI runtime start ]---
        0x0000000020000000-0x0000000020010000          64K PTE       RW NX SHD AF ...
        0x0000000020200000-0x0000000021340000       17664K PTE       RW NX SHD AF ...
        ...
        0x0000000021920000-0x0000000021950000         192K PTE       RW x  SHD AF ...
        0x0000000021950000-0x00000000219a0000         320K PTE       RW NX SHD AF ...
        ---[ UEFI runtime end ]---
        ---[ (null) ]---
        ---[ (null) ]---
      
         BUG: KASAN: global-out-of-bounds in note_page+0x1f0/0xac0
         Read of size 8 at addr ffff2000123f2ac0 by task read_all/42464
         Call trace:
          dump_backtrace+0x0/0x298
          show_stack+0x24/0x30
          dump_stack+0xb0/0xdc
          print_address_description+0x64/0x2b0
          kasan_report+0x150/0x1a4
          __asan_report_load8_noabort+0x30/0x3c
          note_page+0x1f0/0xac0
          walk_pgd+0xb4/0x244
          ptdump_walk_pgd+0xec/0x140
          ptdump_show+0x40/0x50
          seq_read+0x3f8/0xad0
          full_proxy_read+0x9c/0xc0
          __vfs_read+0xfc/0x4c8
          vfs_read+0xec/0x208
          ksys_read+0xd0/0x15c
          __arm64_sys_read+0x84/0x94
          el0_svc_handler+0x258/0x304
          el0_svc+0x8/0xc
      
        The buggy address belongs to the variable:
         __compound_literal.0+0x20/0x800
      
        Memory state around the buggy address:
         ffff2000123f2980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         ffff2000123f2a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa
        >ffff2000123f2a80: fa fa fa fa 00 00 00 00 fa fa fa fa 00 00 00 00
                                                  ^
         ffff2000123f2b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         ffff2000123f2b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
      
      [ ardb: fix up whitespace ]
      [ mingo: fix up some moar ]
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Fixes: 9d80448a ("efi/arm64: Add debugfs node to dump UEFI runtime page tables")
      Link: http://lkml.kernel.org/r/20190202095017.13799-2-ard.biesheuvel@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      74c953ca
    • Johannes Weiner's avatar
      x86/resctrl: Avoid confusion over the new X86_RESCTRL config · e6d42931
      Johannes Weiner authored
      "Resource Control" is a very broad term for this CPU feature, and a term
      that is also associated with containers, cgroups etc. This can easily
      cause confusion.
      
      Make the user prompt more specific. Match the config symbol name.
      
       [ bp: In the future, the corresponding ARM arch-specific code will be
         under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
         under the CPU_RESCTRL umbrella symbol. ]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Babu Moger <Babu.Moger@amd.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
      e6d42931
    • Linus Torvalds's avatar
      Merge tag 'xtensa-20190201' of git://github.com/jcmvbkbc/linux-xtensa · cd984a5b
      Linus Torvalds authored
      Pull xtensa fixes from Max Filippov:
      
       - fix ccount_timer_shutdown for secondary CPUs
      
       - fix secondary CPU initialization
      
       - fix secondary CPU reset vector clash with double exception vector
      
       - fix present CPUs when booting with 'maxcpus' parameter
      
       - limit possible CPUs by configured NR_CPUS
      
       - issue a warning if xtensa PIC is asked to retrigger anything other
         than software IRQ
      
       - fix masking/unmasking of the first two IRQs on xtensa MX PIC
      
       - fix typo in Kconfig description for user space unaligned access
         feature
      
       - fix Kconfig warning for selecting BUILTIN_DTB
      
      * tag 'xtensa-20190201' of git://github.com/jcmvbkbc/linux-xtensa:
        xtensa: SMP: limit number of possible CPUs by NR_CPUS
        xtensa: rename BUILTIN_DTB to BUILTIN_DTB_SOURCE
        xtensa: Fix typo use space=>user space
        drivers/irqchip: xtensa-mx: fix mask and unmask
        drivers/irqchip: xtensa: add warning to irq_retrigger
        xtensa: SMP: mark each possible CPU as present
        xtensa: smp_lx200_defconfig: fix vectors clash
        xtensa: SMP: fix secondary CPU initialization
        xtensa: SMP: fix ccount_timer_shutdown
      cd984a5b
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 8b050fe4
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "Although we're still debugging a few minor arm64-specific issues in
        mainline, I didn't want to hold this lot up in the meantime.
      
        We've got an additional KASLR fix after the previous one wasn't quite
        complete, a fix for a performance regression when mapping executable
        pages into userspace and some fixes for kprobe blacklisting. All
        candidates for stable.
      
        Summary:
      
         - Fix module loading when KASLR is configured but disabled at runtime
      
         - Fix accidental IPI when mapping user executable pages
      
         - Ensure hyp-stub and KVM world switch code cannot be kprobed"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: hibernate: Clean the __hyp_text to PoC after resume
        arm64: hyp-stub: Forbid kprobing of the hyp-stub
        arm64: kprobe: Always blacklist the KVM world-switch code
        arm64: kaslr: ensure randomized quantities are clean also when kaslr is off
        arm64: Do not issue IPIs for user executable ptes
      8b050fe4
    • Linus Torvalds's avatar
      Merge tag '5.0-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 · 33640d71
      Linus Torvalds authored
      Pull smb3 fixes from Steve French:
       "SMB3 fixes, some from this week's SMB3 test evemt, 5 for stable and a
        particularly important one for queryxattr (see xfstests 70 and 117)"
      
      * tag '5.0-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: update internal module version number
        CIFS: fix use-after-free of the lease keys
        CIFS: Do not consider -ENODATA as stat failure for reads
        CIFS: Do not count -ENODATA as failure for query directory
        CIFS: Fix trace command logging for SMB2 reads and writes
        CIFS: Fix possible oops and memory leaks in async IO
        cifs: limit amount of data we request for xattrs to CIFSMaxBufSize
        cifs: fix computation for MAX_SMB2_HDR_SIZE
      33640d71
    • Linus Torvalds's avatar
      Merge tag 'apparmor-pr-2019-02-01' of... · b7bd29b5
      Linus Torvalds authored
      Merge tag 'apparmor-pr-2019-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
      
      Pull apparmor bug fixes from John Johansen:
       "Two bug fixes for apparmor:
      
         - Fix aa_label_build() error handling for failed merges
      
         - Fix warning about unused function apparmor_ipv6_postroute"
      
      * tag 'apparmor-pr-2019-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
        apparmor: Fix aa_label_build() error handling for failed merges
        apparmor: Fix warning about unused function apparmor_ipv6_postroute
      b7bd29b5
  7. 01 Feb, 2019 14 commits
    • Ian Kent's avatar
      autofs: fix error return in autofs_fill_super() · f585b283
      Ian Kent authored
      In autofs_fill_super() on error of get inode/make root dentry the return
      should be ENOMEM as this is the only failure case of the called
      functions.
      
      Link: http://lkml.kernel.org/r/154725123240.11260.796773942606871359.stgit@pluto-themaw-netSigned-off-by: default avatarIan Kent <raven@themaw.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f585b283
    • Pan Bian's avatar
      autofs: drop dentry reference only when it is never used · 63ce5f55
      Pan Bian authored
      autofs_expire_run() calls dput(dentry) to drop the reference count of
      dentry.  However, dentry is read via autofs_dentry_ino(dentry) after
      that.  This may result in a use-free-bug.  The patch drops the reference
      count of dentry only when it is never used.
      
      Link: http://lkml.kernel.org/r/154725122396.11260.16053424107144453867.stgit@pluto-themaw-netSigned-off-by: default avatarPan Bian <bianpan2016@163.com>
      Signed-off-by: default avatarIan Kent <raven@themaw.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63ce5f55
    • Jan Kara's avatar
      fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() · c27d82f5
      Jan Kara authored
      When superblock has lots of inodes without any pagecache (like is the
      case for /proc), drop_pagecache_sb() will iterate through all of them
      without dropping sb->s_inode_list_lock which can lead to softlockups
      (one of our customers hit this).
      
      Fix the problem by going to the slow path and doing cond_resched() in
      case the process needs rescheduling.
      
      Link: http://lkml.kernel.org/r/20190114085343.15011-1-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c27d82f5
    • David Hildenbrand's avatar
      mm: migrate: don't rely on __PageMovable() of newpage after unlocking it · e0a352fa
      David Hildenbrand authored
      We had a race in the old balloon compaction code before b1123ea6
      ("mm: balloon: use general non-lru movable page feature") refactored it
      that became visible after backporting 195a8c43 ("virtio-balloon:
      deflate via a page list") without the refactoring.
      
      The bug existed from commit d6d86c0a ("mm/balloon_compaction:
      redesign ballooned pages management") till b1123ea6 ("mm: balloon:
      use general non-lru movable page feature").  d6d86c0a
      ("mm/balloon_compaction: redesign ballooned pages management") was
      backported to 3.12, so the broken kernels are stable kernels [3.12 -
      4.7].
      
      There was a subtle race between dropping the page lock of the newpage in
      __unmap_and_move() and checking for __is_movable_balloon_page(newpage).
      
      Just after dropping this page lock, virtio-balloon could go ahead and
      deflate the newpage, effectively dequeueing it and clearing PageBalloon,
      in turn making __is_movable_balloon_page(newpage) fail.
      
      This resulted in dropping the reference of the newpage via
      putback_lru_page(newpage) instead of put_page(newpage), leading to
      page->lru getting modified and a !LRU page ending up in the LRU lists.
      With 195a8c43 ("virtio-balloon: deflate via a page list")
      backported, one would suddenly get corrupted lists in
      release_pages_balloon():
      
      - WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
      - list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100
      
      Nowadays this race is no longer possible, but it is hidden behind very
      ugly handling of __ClearPageMovable() and __PageMovable().
      
      __ClearPageMovable() will not make __PageMovable() fail, only
      PageMovable().  So the new check (__PageMovable(newpage)) will still
      hold even after newpage was dequeued by virtio-balloon.
      
      If anybody would ever change that special handling, the BUG would be
      introduced again.  So instead, make it explicit and use the information
      of the original isolated page before migration.
      
      This patch can be backported fairly easy to stable kernels (in contrast
      to the refactoring).
      
      Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com
      Fixes: d6d86c0a ("mm/balloon_compaction: redesign ballooned pages management")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarVratislav Bendel <vbendel@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vratislav Bendel <vbendel@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>	[3.12 - 4.7]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0a352fa
    • Johannes Weiner's avatar
      psi: clarify the Kconfig text for the default-disable option · 7b2489d3
      Johannes Weiner authored
      The current help text caused some confusion in online forums about
      whether or not to default-enable or default-disable psi in vendor
      kernels.  This is because it doesn't communicate the reason for why we
      made this setting configurable in the first place: that the overhead is
      non-zero in an artificial scheduler stress test.
      
      Since this isn't representative of real workloads, and the effect was
      not measurable in scheduler-heavy real world applications such as the
      webservers and memcache installations at Facebook, it's fair to point
      out that this is a pretty cautious option to select.
      
      Link: http://lkml.kernel.org/r/20190129233617.16767-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b2489d3
    • Michal Hocko's avatar
      mm, memory_hotplug: __offline_pages fix wrong locking · e3df4c6e
      Michal Hocko authored
      Jan has noticed that we do double unlock on some failure paths when
      offlining a page range.  This is indeed the case when
      test_pages_in_a_zone respp.  start_isolate_page_range fail.  This was an
      omission when forward porting the debugging patch from an older kernel.
      
      Fix the issue by dropping mem_hotplug_done from the failure condition
      and keeping the single unlock in the catch all failure path.
      
      Link: http://lkml.kernel.org/r/20190115120307.22768-1-mhocko@kernel.org
      Fixes: 79605093 ("mm, memory_hotplug: print reason for the offlining failure")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3df4c6e
    • Naoya Horiguchi's avatar
      mm: hwpoison: use do_send_sig_info() instead of force_sig() · 6376360e
      Naoya Horiguchi authored
      Currently memory_failure() is racy against process's exiting, which
      results in kernel crash by null pointer dereference.
      
      The root cause is that memory_failure() uses force_sig() to forcibly
      kill asynchronous (meaning not in the current context) processes.  As
      discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
      fixes, this is not a right thing to do.  OOM solves this issue by using
      do_send_sig_info() as done in commit d2d39309 ("signal:
      oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
      patch is suggesting to do the same for hwpoison.  do_send_sig_info()
      properly accesses to siglock with lock_task_sighand(), so is free from
      the reported race.
      
      I confirmed that the reported bug reproduces with inserting some delay
      in kill_procs(), and it never reproduces with this patch.
      
      Note that memory_failure() can send another type of signal using
      force_sig_mceerr(), and the reported race shouldn't happen on it because
      force_sig_mceerr() is called only for synchronous processes (i.e.
      BUS_MCEERR_AR happens only when some process accesses to the corrupted
      memory.)
      
      Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jpSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6376360e
    • Anders Roxell's avatar
      kasan: mark file common so ftrace doesn't trace it · 0d0c8de8
      Anders Roxell authored
      When option CONFIG_KASAN is enabled toghether with ftrace, function
      ftrace_graph_caller() gets in to a recursion, via functions
      kasan_check_read() and kasan_check_write().
      
       Breakpoint 2, ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
       179             mcount_get_pc             x0    //     function's pc
       (gdb) bt
       #0  ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
       #1  0xffffff90101406c8 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151
       #2  0xffffff90106fd084 in kasan_check_write (p=0xffffffc06c170878, size=4) at ../mm/kasan/common.c:105
       #3  0xffffff90104a2464 in atomic_add_return (v=<optimized out>, i=<optimized out>) at ./include/generated/atomic-instrumented.h:71
       #4  atomic_inc_return (v=<optimized out>) at ./include/generated/atomic-fallback.h:284
       #5  trace_graph_entry (trace=0xffffffc03f5ff380) at ../kernel/trace/trace_functions_graph.c:441
       #6  0xffffff9010481774 in trace_graph_entry_watchdog (trace=<optimized out>) at ../kernel/trace/trace_selftest.c:741
       #7  0xffffff90104a185c in function_graph_enter (ret=<optimized out>, func=<optimized out>, frame_pointer=18446743799894897728, retp=<optimized out>) at ../kernel/trace/trace_functions_graph.c:196
       #8  0xffffff9010140628 in prepare_ftrace_return (self_addr=18446743592948977792, parent=0xffffffc03f5ff418, frame_pointer=18446743799894897728) at ../arch/arm64/kernel/ftrace.c:231
       #9  0xffffff90101406f4 in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182
       Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       (gdb)
      
      Rework so that the kasan implementation isn't traced.
      
      Link: http://lkml.kernel.org/r/20181212183447.15890-1-anders.roxell@linaro.orgSigned-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d0c8de8
    • Jonathan Neuschäfer's avatar
    • Dan Carpenter's avatar
      lib/test_kmod.c: potential double free in error handling · db7ddeab
      Dan Carpenter authored
      There is a copy and paste bug so we set "config->test_driver" to NULL
      twice instead of setting "config->test_fs".  Smatch complains that it
      leads to a double free:
      
        lib/test_kmod.c:840 __kmod_config_init() warn: 'config->test_fs' double freed
      
      Link: http://lkml.kernel.org/r/20190121140011.GA14283@kadam
      Fixes: d9c6a72d ("kmod: add test driver to stress test the module loader")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db7ddeab
    • Shakeel Butt's avatar
      mm, oom: fix use-after-free in oom_kill_process · cefc7ef3
      Shakeel Butt authored
      Syzbot instance running on upstream kernel found a use-after-free bug in
      oom_kill_process.  On further inspection it seems like the process
      selected to be oom-killed has exited even before reaching
      read_lock(&tasklist_lock) in oom_kill_process().  More specifically the
      tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
      and the put_task_struct within for_each_thread() frees the tsk and
      for_each_thread() tries to access the tsk.  The easiest fix is to do
      get/put across the for_each_thread() on the selected task.
      
      Now the next question is should we continue with the oom-kill as the
      previously selected task has exited? However before adding more
      complexity and heuristics, let's answer why we even look at the children
      of oom-kill selected task? The select_bad_process() has already selected
      the worst process in the system/memcg.  Due to race, the selected
      process might not be the worst at the kill time but does that matter?
      The userspace can use the oom_score_adj interface to prefer children to
      be killed before the parent.  I looked at the history but it seems like
      this is there before git history.
      
      Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
      Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
      Fixes: 6b0c81b3 ("mm, oom: reduce dependency on tasklist_lock")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cefc7ef3
    • Qian Cai's avatar
      mm/hotplug: invalid PFNs from pfn_to_online_page() · b13bc351
      Qian Cai authored
      On an arm64 ThunderX2 server, the first kmemleak scan would crash [1]
      with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is
      not directly mapped (MEMBLOCK_NOMAP).  Hence, the page->flags is
      uninitialized.
      
      This is due to the commit 9f1eb38e ("mm, kmemleak: little
      optimization while scanning") starts to use pfn_to_online_page() instead
      of pfn_valid().  However, in the CONFIG_MEMORY_HOTPLUG=y case,
      pfn_to_online_page() does not call memblock_is_map_memory() while
      pfn_valid() does.
      
      Historically, the commit 68709f45 ("arm64: only consider memblocks
      with NOMAP cleared for linear mapping") causes pages marked as nomap
      being no long reassigned to the new zone in memmap_init_zone() by
      calling __init_single_page().
      
      Since the commit 2d070eab ("mm: consider zone which is not fully
      populated to have holes") introduced pfn_to_online_page() and was
      designed to return a valid pfn only, but it is clearly broken on arm64.
      
      Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can
      handle nomap thanks to the commit f52bb98f ("arm64: mm: always
      enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on
      architectures where have no HOLES_IN_ZONE.
      
      [1]
        Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
        Mem abort info:
          ESR = 0x96000005
          Exception class = DABT (current EL), IL = 32 bits
          SET = 0, FnV = 0
          EA = 0, S1PTW = 0
        Data abort info:
          ISV = 0, ISS = 0x00000005
          CM = 0, WnR = 0
        Internal error: Oops: 96000005 [#1] SMP
        CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8
        pstate: 60400009 (nZCv daif +PAN -UAO)
        pc : page_mapping+0x24/0x144
        lr : __dump_page+0x34/0x3dc
        sp : ffff00003a5cfd10
        x29: ffff00003a5cfd10 x28: 000000000000802f
        x27: 0000000000000000 x26: 0000000000277d00
        x25: ffff000010791f56 x24: ffff7fe000000000
        x23: ffff000010772f8b x22: ffff00001125f670
        x21: ffff000011311000 x20: ffff000010772f8b
        x19: fffffffffffffffe x18: 0000000000000000
        x17: 0000000000000000 x16: 0000000000000000
        x15: 0000000000000000 x14: ffff802698b19600
        x13: ffff802698b1a200 x12: ffff802698b16f00
        x11: ffff802698b1a400 x10: 0000000000001400
        x9 : 0000000000000001 x8 : ffff00001121a000
        x7 : 0000000000000000 x6 : ffff0000102c53b8
        x5 : 0000000000000000 x4 : 0000000000000003
        x3 : 0000000000000100 x2 : 0000000000000000
        x1 : ffff000010772f8b x0 : ffffffffffffffff
        Process kmemleak (pid: 1408, stack limit = 0x(____ptrval____))
        Call trace:
         page_mapping+0x24/0x144
         __dump_page+0x34/0x3dc
         dump_page+0x28/0x4c
         kmemleak_scan+0x4ac/0x680
         kmemleak_scan_thread+0xb4/0xdc
         kthread+0x12c/0x13c
         ret_from_fork+0x10/0x18
        Code: d503201f f9400660 36000040 d1000413 (f9400661)
        ---[ end trace 4d4bd7f573490c8e ]---
        Kernel panic - not syncing: Fatal exception
        SMP: stopping secondary CPUs
        Kernel Offset: disabled
        CPU features: 0x002,20000c38
        Memory Limit: none
        ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Link: http://lkml.kernel.org/r/20190122132916.28360-1-cai@lca.pw
      Fixes: 9f1eb38e ("mm, kmemleak: little optimization while scanning")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b13bc351
    • Oscar Salvador's avatar
      mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages · eeb0efd0
      Oscar Salvador authored
      This is the same sort of error we saw in commit 17e2e7d7 ("mm,
      page_alloc: fix has_unmovable_pages for HugePages").
      
      Gigantic hugepages cross several memblocks, so it can be that the page
      we get in scan_movable_pages() is a page-tail belonging to a
      1G-hugepage.  If that happens, page_hstate()->size_to_hstate() will
      return NULL, and we will blow up in hugepage_migration_supported().
      
      The splat is as follows:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        #PF error: [normal kernel read fault]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 1 PID: 1350 Comm: bash Tainted: G            E     5.0.0-rc1-mm1-1-default+ #27
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
        RIP: 0010:__offline_pages+0x6ae/0x900
        Call Trace:
         memory_subsys_offline+0x42/0x60
         device_offline+0x80/0xa0
         state_store+0xab/0xc0
         kernfs_fop_write+0x102/0x180
         __vfs_write+0x26/0x190
         vfs_write+0xad/0x1b0
         ksys_write+0x42/0x90
         do_syscall_64+0x5b/0x180
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E)
      
      [akpm@linux-foundation.org: fix brace layout, per David.  Reduce indentation]
      Link: http://lkml.kernel.org/r/20190122154407.18417-1-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarAnthony Yznaga <anthony.yznaga@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eeb0efd0
    • Johannes Weiner's avatar
      psi: fix aggregation idle shut-off · 1b69ac6b
      Johannes Weiner authored
      psi has provisions to shut off the periodic aggregation worker when
      there is a period of no task activity - and thus no data that needs
      aggregating.  However, while developing psi monitoring, Suren noticed
      that the aggregation clock currently won't stay shut off for good.
      
      Debugging this revealed a flaw in the idle design: an aggregation run
      will see no task activity and decide to go to sleep; shortly thereafter,
      the kworker thread that executed the aggregation will go idle and cause
      a scheduling change, during which the psi callback will kick the
      !pending worker again.  This will ping-pong forever, and is equivalent
      to having no shut-off logic at all (but with more code!)
      
      Fix this by exempting aggregation workers from psi's clock waking logic
      when the state change is them going to sleep.  To do this, tag workers
      with the last work function they executed, and if in psi we see a worker
      going to sleep after aggregating psi data, we will not reschedule the
      aggregation work item.
      
      What if the worker is also executing other items before or after?
      
      Any psi state times that were incurred by work items preceding the
      aggregation work will have been collected from the per-cpu buckets
      during the aggregation itself.  If there are work items following the
      aggregation work, the worker's last_func tag will be overwritten and the
      aggregator will be kept alive to process this genuine new activity.
      
      If the aggregation work is the last thing the worker does, and we decide
      to go idle, the brief period of non-idle time incurred between the
      aggregation run and the kworker's dequeue will be stranded in the
      per-cpu buckets until the clock is woken by later activity.  But that
      should not be a problem.  The buckets can hold 4s worth of time, and
      future activity will wake the clock with a 2s delay, giving us 2s worth
      of data we can leave behind when disabling aggregation.  If it takes a
      worker more than two seconds to go idle after it finishes its last work
      item, we likely have bigger problems in the system, and won't notice one
      sample that was averaged with a bogus per-CPU weight.
      
      Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
      Fixes: eb414681 ("psi: pressure stall information for CPU, memory, and IO")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b69ac6b