1. 27 Mar, 2018 1 commit
    • Stephane Eranian's avatar
      perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs · 71eb9ee9
      Stephane Eranian authored
      this patch fix a bug in how the pebs->real_ip is handled in the PEBS
      handler. real_ip only exists in Haswell and later processor. It is
      actually the eventing IP, i.e., where the event occurred. As opposed
      to the pebs->ip which is the PEBS interrupt IP which is always off
      by one.
      
      The problem is that the real_ip just like the IP needs to be fixed up
      because PEBS does not record all the machine state registers, and
      in particular the code segement (cs). This is why we have the set_linear_ip()
      function. The problem was that set_linear_ip() was only used on the pebs->ip
      and not the pebs->real_ip.
      
      We have profiles which ran into invalid callstacks because of this.
      Here is an example:
      
       .....  0: ffffffffffffff80 recent entry, marker kernel v
       .....  1: 000000000040044d <= user address in kernel space!
       .....  2: fffffffffffffe00 marker enter user v
       .....  3: 000000000040044d
       .....  4: 00000000004004b6 oldest entry
      
      Debugging output in get_perf_callchain():
      
       [  857.769909] CALLCHAIN: CPU8 ip=40044d regs->cs=10 user_mode(regs)=0
      
      The problem is that the kernel entry in 1: points to a user level
      address. How can that be?
      
      The reason is that with PEBS sampling the instruction that caused the event
      to occur and the instruction where the CPU was when the interrupt was posted
      may be far apart. And sometime during that time window, the privilege level may
      change. This happens, for instance, when the PEBS sample is taken close to a
      kernel entry point. Here PEBS, eventing IP (real_ip) captured a user level
      instruction. But by the time the PMU interrupt fired, the processor had already
      entered kernel space. This is why the debug output shows a user address with
      user_mode() false.
      
      The problem comes from PEBS not recording the code segment (cs) register.
      The register is used in x86_64 to determine if executing in kernel vs user
      space. This is okay because the kernel has a software workaround called
      set_linear_ip(). But the issue in setup_pebs_sample_data() is that
      set_linear_ip() is never called on the real_ip value when it is available
      (Haswell and later) and precise_ip > 1.
      
      This patch fixes this problem and eliminates the callchain discrepancy.
      
      The patch restructures the code around set_linear_ip() to minimize the number
      of times the IP has to be set.
      Signed-off-by: default avatarStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1521788507-10231-1-git-send-email-eranian@google.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      71eb9ee9
  2. 25 Mar, 2018 10 commits
  3. 24 Mar, 2018 1 commit
  4. 23 Mar, 2018 28 commits
    • Linus Torvalds's avatar
      Merge tag 'trace-v4.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 99fec39e
      Linus Torvalds authored
      Pull kprobe fixes from Steven Rostedt:
       "The documentation for kprobe events says that symbol offets can take
        both a + and - sign to get to befor and after the symbol address.
      
        But in actuality, the code does not support the minus. This fixes that
        issue, and adds a few more selftests to kprobe events"
      
      * tag 'trace-v4.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        selftests: ftrace: Add a testcase for probepoint
        selftests: ftrace: Add a testcase for string type with kprobe_event
        selftests: ftrace: Add probe event argument syntax testcase
        tracing: probeevent: Fix to support minus offset from symbol
      99fec39e
    • Andy Lutomirski's avatar
      x86/entry/64: Don't use IST entry for #BP stack · d8ba61ba
      Andy Lutomirski authored
      There's nothing IST-worthy about #BP/int3.  We don't allow kprobes
      in the small handful of places in the kernel that run at CPL0 with
      an invalid stack, and 32-bit kernels have used normal interrupt
      gates for #BP forever.
      
      Furthermore, we don't allow kprobes in places that have usergs while
      in kernel mode, so "paranoid" is also unnecessary.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      d8ba61ba
    • Waiman Long's avatar
      x86/efi: Free efi_pgd with free_pages() · 06ace26f
      Waiman Long authored
      The efi_pgd is allocated as PGD_ALLOCATION_ORDER pages and therefore must
      also be freed as PGD_ALLOCATION_ORDER pages with free_pages().
      
      Fixes: d9e9a641 ("x86/mm/pti: Allocate a separate user PGD")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1521746333-19593-1-git-send-email-longman@redhat.com
      06ace26f
    • Linus Torvalds's avatar
      Merge tag 'mips_fixes_4.16_5' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips · 86d043d4
      Linus Torvalds authored
      Pull MIPS fixes from James Hogan:
       "Another miscellaneous pile of MIPS fixes for 4.16:
      
         - lantiq: fixes for clocks and Amazon SE (4.14)
      
         - ralink: fix booting on MT7621 (4.5)
      
         - ralink: fix halt (3.9)"
      
      * tag 'mips_fixes_4.16_5' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips:
        MIPS: ralink: Fix booting on MT7621
        MIPS: ralink: Remove ralink_halt()
        MIPS: lantiq: ase: Enable MFD_SYSCON
        MIPS: lantiq: Enable AHB Bus for USB
        MIPS: lantiq: Fix Danube USB clock
      86d043d4
    • Linus Torvalds's avatar
      Merge tag 'vfio-v4.16-rc7' of git://github.com/awilliam/linux-vfio · 095fe49f
      Linus Torvalds authored
      Pull VFIO fix from Alex Williamson:
       "Revert masking INTx where it cannot be enabled - it plays poorly with
        SR-IOV VFs and presumes DisINTx support"
      
      * tag 'vfio-v4.16-rc7' of git://github.com/awilliam/linux-vfio:
        Revert: "vfio-pci: Mask INTx if a device is not capabable of enabling it"
      095fe49f
    • Linus Torvalds's avatar
      Merge tag 'mtd/fixes-for-4.16-rc7' of git://git.infradead.org/linux-mtd · a580657a
      Linus Torvalds authored
      Pull MTD fixes from Boris Brezillon:
      
       - Fix several problems in the fsl_ifc NAND controller driver
      
       - Fix misuse of mtd_ooblayout_ecc() in mtdchar.c
      
      * tag 'mtd/fixes-for-4.16-rc7' of git://git.infradead.org/linux-mtd:
        mtd: nand: fsl_ifc: Read ECCSTAT0 and ECCSTAT1 registers for IFC 2.0
        mtd: nand: fsl_ifc: Fix eccstat array overflow for IFC ver >= 2.0.0
        mtd: nand: fsl_ifc: Fix nand waitfunc return value
        mtdchar: fix usage of mtd_ooblayout_ecc()
      a580657a
    • Linus Torvalds's avatar
      Merge tag 'staging-4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 935c200a
      Linus Torvalds authored
      Pull staging/IIO fixes from Greg KH:
       "Here are a few small staging and IIO fixes for various reported
        issues.
      
        All of them are tiny, the majority being iio driver fixes for small
        issues, and one staging driver fix for a memory corruption issue.
      
        All have been in linux-next with no reported issues"
      
      * tag 'staging-4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: ncpfs: memory corruption in ncp_read_kernel()
        iio: st_pressure: st_accel: pass correct platform data to init
        Revert "iio: accel: st_accel: remove redundant pointer pdata"
        iio: adc: meson-saradc: unlock on error in meson_sar_adc_lock()
        dt-bindings: iio: adc: sd-modulator: fix io-channel-cells
        iio: adc: stm32-dfsdm: fix multiple channel initialization
        iio: adc: stm32-dfsdm: fix clock source selection
        iio: adc: stm32-dfsdm: fix call to stop channel
        iio: adc: stm32-dfsdm: fix compatible data use
        iio: chemical: ccs811: Corrected firmware boot/application mode transition
      935c200a
    • Linus Torvalds's avatar
      Merge tag 'char-misc-4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 97235e74
      Linus Torvalds authored
      Pull hyperv fix from Greg KH:
       "This is a single hyperv bugfix for 4.16-rc7.
      
        It resolves an issue with the ring-buffer signaling to resolve
        reported problems.
      
        It's been in linux-next for a while now with no reported issues"
      
      * tag 'char-misc-4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        Drivers: hv: vmbus: Fix ring buffer signaling
      97235e74
    • Linus Torvalds's avatar
      Merge tag 'media/v4.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · cde00d21
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
       "Three fixes:
      
         - dvb: fix a Kconfig typo on a help text
      
         - tegra-cec: reset rx_buf_cnt when start bit detected
      
         - rc: lirc does not use LIRC_CAN_SEND_SCANCODE feature"
      
      * tag 'media/v4.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        media: dvb: fix a Kconfig typo
        media: tegra-cec: reset rx_buf_cnt when start bit detected
        media: rc: lirc does not use LIRC_CAN_SEND_SCANCODE feature
      cde00d21
    • Linus Torvalds's avatar
      Merge tag 'sound-4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 8ce72017
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Things look calming down, but people were still busy to plaster over
        small holes:
      
         - Two fixes to harden against races in aloop driver
      
         - A correction of a long-standing bug in USB-audio UAC2 processing
           unit parser
      
         - As usual suspects, HD-audio: a workaround for Coffee Lake
           controller and a few other device-specific fixes
      
        All small and for stable"
      
      * tag 'sound-4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: aloop: Fix access to not-yet-ready substream via cable
        ALSA: aloop: Sync stale timer before release
        ALSA: hda/realtek - Fix speaker no sound after system resume
        ALSA: hda/realtek - Fix Dell headset Mic can't record
        ALSA: hda - Force polling mode on CFL for fixing codec communication
        ALSA: usb-audio: Fix parsing descriptor of UAC2 processing unit
        ALSA: hda/realtek - Always immediately update mute LED with pin VREF
      8ce72017
    • Masami Hiramatsu's avatar
      selftests: ftrace: Add a testcase for probepoint · dfa453bc
      Masami Hiramatsu authored
      Add a testcase for probe point definition. This tests
      symbol, address and symbol+offset syntax. The offset
      must be positive and smaller than UINT_MAX.
      
      Link: http://lkml.kernel.org/r/152129043097.31874.14273580606301767394.stgit@devboxSigned-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      dfa453bc
    • Masami Hiramatsu's avatar
      selftests: ftrace: Add a testcase for string type with kprobe_event · 5fbdbed7
      Masami Hiramatsu authored
      Add a testcase for string type with kprobe event.
      This tests good/bad syntax combinations and also
      the traced data is correct in several way.
      
      Link: http://lkml.kernel.org/r/152129038381.31874.9201387794548737554.stgit@devboxSigned-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      5fbdbed7
    • Masami Hiramatsu's avatar
      selftests: ftrace: Add probe event argument syntax testcase · 871bef20
      Masami Hiramatsu authored
      Add a testcase for probe event argument syntax which
      ensures the kprobe_events interface correctly parses
      given event arguments.
      
      Link: http://lkml.kernel.org/r/152129033679.31874.12705519603869152799.stgit@devboxSigned-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      871bef20
    • Masami Hiramatsu's avatar
      tracing: probeevent: Fix to support minus offset from symbol · c5d343b6
      Masami Hiramatsu authored
      In Documentation/trace/kprobetrace.txt, it says
      
       @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
      
      However, the parser doesn't parse minus offset correctly, since
      commit 2fba0c88 ("tracing/kprobes: Fix probe offset to be
      unsigned") drops minus ("-") offset support for kprobe probe
      address usage.
      
      This fixes the traceprobe_split_symbol_offset() to parse minus
      offset again with checking the offset range, and add a minus
      offset check in kprobe probe address usage.
      
      Link: http://lkml.kernel.org/r/152129028983.31874.13419301530285775521.stgit@devbox
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Fixes: 2fba0c88 ("tracing/kprobes: Fix probe offset to be unsigned")
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      c5d343b6
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · f36b7534
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "13 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm, thp: do not cause memcg oom for thp
        mm/vmscan: wake up flushers for legacy cgroups too
        Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
        mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
        mm/thp: do not wait for lock_page() in deferred_split_scan()
        mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
        x86/mm: implement free pmd/pte page interfaces
        mm/vmalloc: add interfaces to free unmapped page table
        h8300: remove extraneous __BIG_ENDIAN definition
        hugetlbfs: check for pgoff value overflow
        lockdep: fix fs_reclaim warning
        MAINTAINERS: update Mark Fasheh's e-mail
        mm/mempolicy.c: avoid use uninitialized preferred_node
      f36b7534
    • Linus Torvalds's avatar
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 8401c72c
      Linus Torvalds authored
      Pull libnvdimm fixes from Dan Williams:
       "Two regression fixes, two bug fixes for older issues, two fixes for
        new functionality added this cycle that have userspace ABI concerns,
        and a small cleanup. These have appeared in a linux-next release and
        have a build success report from the 0day robot.
      
         * The 4.16 rework of altmap handling led to some configurations
           leaking page table allocations due to freeing from the altmap
           reservation rather than the page allocator.
      
           The impact without the fix is leaked memory and a WARN() message
           when tearing down libnvdimm namespaces. The rework also missed a
           place where error handling code needed to be removed that can lead
           to a crash if devm_memremap_pages() fails.
      
         * acpi_map_pxm_to_node() had a latent bug whereby it could
           misidentify the closest online node to a given proximity domain.
      
         * Block integrity handling was reworked several kernels back to allow
           calling add_disk() after setting up the integrity profile.
      
           The nd_btt and nd_blk drivers are just now catching up to fix
           automatic partition detection at driver load time.
      
         * The new peristence_domain attribute, a platform indicator of
           whether cpu caches are powerfail protected for example, is meant to
           be a single value enum and not a set of flags.
      
           This oversight was caught while reviewing new userspace code in
           libndctl to communicate the attribute.
      
           Fix this new enabling up so that we are not stuck with an unwanted
           userspace ABI"
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        libnvdimm, nfit: fix persistence domain reporting
        libnvdimm, region: hide persistence_domain when unknown
        acpi, numa: fix pxm to online numa node associations
        x86, memremap: fix altmap accounting at free
        libnvdimm: remove redundant assignment to pointer 'dev'
        libnvdimm, {btt, blk}: do integrity setup before add_disk()
        kernel/memremap: Remove stale devres_free() call
      8401c72c
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-for-v4.16-rc7' of git://people.freedesktop.org/~airlied/linux · 9ec7ccc8
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "A bunch of fixes all over the place (core, i915, amdgpu, imx, sun4i,
        ast, tegra, vmwgfx), nothing too serious or worrying at this stage.
      
         - one uapi fix to stop multi-planar images with getfb
      
         - Sun4i error path and clock fixes
      
         - udl driver mmap offset fix
      
         - i915 DP MST and GPU reset fixes
      
         - vmwgfx mutex and black screen fixes
      
         - imx array underflow fix and vblank fix
      
         - amdgpu: display fixes
      
         - exynos devicetree fix
      
         - ast mode fix"
      
      * tag 'drm-fixes-for-v4.16-rc7' of git://people.freedesktop.org/~airlied/linux: (29 commits)
        drm/ast: Fixed 1280x800 Display Issue
        drm: udl: Properly check framebuffer mmap offsets
        drm/i915: Specify which engines to reset following semaphore/event lockups
        drm/vmwgfx: Fix a destoy-while-held mutex problem.
        drm/vmwgfx: Fix black screen and device errors when running without fbdev
        drm: Reject getfb for multi-plane framebuffers
        drm/amd/display: Add one to EDID's audio channel count when passing to DC
        drm/amd/display: We shouldn't set format_default on plane as atomic driver
        drm/amd/display: Fix FMT truncation programming
        drm/amd/display: Allow truncation to 10 bits
        drm/sun4i: hdmi: Fix another error handling path in 'sun4i_hdmi_bind()'
        drm/sun4i: hdmi: Fix an error handling path in 'sun4i_hdmi_bind()'
        drm/i915/dp: Write to SET_POWER dpcd to enable MST hub.
        drm/amd/display: fix dereferencing possible ERR_PTR()
        drm/amd/display: Refine disable VGA
        drm/tegra: Shutdown on driver unbind
        drm/tegra: dsi: Don't disable regulator on ->exit()
        drm/tegra: dc: Detach IOMMU group from domain only once
        dt-bindings: exynos: Document #sound-dai-cells property of the HDMI node
        drm/imx: move arming of the vblank event to atomic_flush
        ...
      9ec7ccc8
    • David Rientjes's avatar
      mm, thp: do not cause memcg oom for thp · 9d3c3354
      David Rientjes authored
      Commit 25160354 ("mm, thp: remove __GFP_NORETRY from khugepaged and
      madvised allocations") changed the page allocator to no longer detect
      thp allocations based on __GFP_NORETRY.
      
      It did not, however, modify the mem cgroup try_charge() path to avoid
      oom kill for either khugepaged collapsing or thp faulting.  It is never
      expected to oom kill a process to allocate a hugepage for thp; reclaim
      is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
      (and charging) should fallback instead of oom killing processes.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
      Fixes: 25160354 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d3c3354
    • Andrey Ryabinin's avatar
      mm/vmscan: wake up flushers for legacy cgroups too · 1c610d5f
      Andrey Ryabinin authored
      Commit 726d061f ("mm: vmscan: kick flushers when we encounter dirty
      pages on the LRU") added flusher invocation to shrink_inactive_list()
      when many dirty pages on the LRU are encountered.
      
      However, shrink_inactive_list() doesn't wake up flushers for legacy
      cgroup reclaim, so the next commit bbef9384 ("mm: vmscan: remove old
      flusher wakeup from direct reclaim path") removed the only source of
      flusher's wake up in legacy mem cgroup reclaim path.
      
      This leads to premature OOM if there is too many dirty pages in cgroup:
          # mkdir /sys/fs/cgroup/memory/test
          # echo $$ > /sys/fs/cgroup/memory/test/tasks
          # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          # dd if=/dev/zero of=tmp_file bs=1M count=100
          Killed
      
          dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
      
          Call Trace:
           dump_stack+0x46/0x65
           dump_header+0x6b/0x2ac
           oom_kill_process+0x21c/0x4a0
           out_of_memory+0x2a5/0x4b0
           mem_cgroup_out_of_memory+0x3b/0x60
           mem_cgroup_oom_synchronize+0x2ed/0x330
           pagefault_out_of_memory+0x24/0x54
           __do_page_fault+0x521/0x540
           page_fault+0x45/0x50
      
          Task in /test killed as a result of limit of /test
          memory: usage 51200kB, limit 51200kB, failcnt 73
          memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
          kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
          Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
                  mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
      	    active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
          Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
          Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
          oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      Wake up flushers in legacy cgroup reclaim too.
      
      Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
      Fixes: bbef9384 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c610d5f
    • Daniel Vacek's avatar
      Revert "mm: page_alloc: skip over regions of invalid pfns where possible" · f59f1caf
      Daniel Vacek authored
      This reverts commit b92df1de ("mm: page_alloc: skip over regions of
      invalid pfns where possible").  The commit is meant to be a boot init
      speed up skipping the loop in memmap_init_zone() for invalid pfns.
      
      But given some specific memory mapping on x86_64 (or more generally
      theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
      implementation also skips valid pfns which is plain wrong and causes
      'kernel BUG at mm/page_alloc.c:1389!'
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010: move_freepages+0x15e/0x160
        --
        Call Trace:
          move_freepages_block+0x73/0x80
          __rmqueue+0x263/0x460
          get_page_from_freelist+0x7e1/0x9e0
          __alloc_pages_nodemask+0x176/0x420
        --
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff       System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff       System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff       System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff       System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff       System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200>       505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0       <<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: default avatarDaniel Vacek <neelx@redhat.com>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f59f1caf
    • Kirill A. Shutemov's avatar
      mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink() · b3cd54b2
      Kirill A. Shutemov authored
      shmem_unused_huge_shrink() gets called from reclaim path.  Waiting for
      page lock may lead to deadlock there.
      
      There was a bug report that may be attributed to this:
      
        http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net
      
      Replace lock_page() with trylock_page() and skip the page if we failed
      to lock it.  We will get to the page on the next scan.
      
      We can test for the PageTransHuge() outside the page lock as we only
      need protection against splitting the page under us.  Holding pin oni
      the page is enough for this.
      
      Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com
      Fixes: 779750d2 ("shmem: split huge pages beyond i_size under memory pressure")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarEric Wheeler <linux-mm@lists.ewheeler.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3cd54b2
    • Kirill A. Shutemov's avatar
      mm/thp: do not wait for lock_page() in deferred_split_scan() · fa41b900
      Kirill A. Shutemov authored
      deferred_split_scan() gets called from reclaim path.  Waiting for page
      lock may lead to deadlock there.
      
      Replace lock_page() with trylock_page() and skip the page if we failed
      to lock it.  We will get to the page on the next scan.
      
      Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
      Fixes: 9a982250 ("thp: introduce deferred_split_huge_page()")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa41b900
    • Kirill A. Shutemov's avatar
      mm/khugepaged.c: convert VM_BUG_ON() to collapse fail · fece2029
      Kirill A. Shutemov authored
      khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
      mapped.  We do not collapse such pages.  See check
      khugepaged_scan_pmd().
      
      But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
      somebody managed to instantiate THP in the range and then split the PMD
      back to PTEs we would have a problem --
      VM_BUG_ON_PAGE(PageCompound(page)) will get triggered.
      
      It's possible since we drop mmap_sem during collapse to re-take for
      write.
      
      Replace the VM_BUG_ON() with graceful collapse fail.
      
      Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
      Fixes: b1caa957 ("khugepaged: ignore pmd tables with THP mapped with ptes")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fece2029
    • Toshi Kani's avatar
      x86/mm: implement free pmd/pte page interfaces · 28ee90fe
      Toshi Kani authored
      Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
      clear a given pud/pmd entry and free up lower level page table(s).
      
      The address range associated with the pud/pmd entry must have been
      purged by INVLPG.
      
      Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Reported-by: default avatarLei Li <lious.lilei@hisilicon.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28ee90fe
    • Toshi Kani's avatar
      mm/vmalloc: add interfaces to free unmapped page table · b6bdb751
      Toshi Kani authored
      On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
      create pud/pmd mappings.  A kernel panic was observed on arm64 systems
      with Cortex-A75 in the following steps as described by Hanjun Guo.
      
       1. ioremap a 4K size, valid page table will build,
       2. iounmap it, pte0 will set to 0;
       3. ioremap the same address with 2M size, pgd/pmd is unchanged,
          then set the a new value for pmd;
       4. pte0 is leaked;
       5. CPU may meet exception because the old pmd is still in TLB,
          which will lead to kernel panic.
      
      This panic is not reproducible on x86.  INVLPG, called from iounmap,
      purges all levels of entries associated with purged address on x86.  x86
      still has memory leak.
      
      The patch changes the ioremap path to free unmapped page table(s) since
      doing so in the unmap path has the following issues:
      
       - The iounmap() path is shared with vunmap(). Since vmap() only
         supports pte mappings, making vunmap() to free a pte page is an
         overhead for regular vmap users as they do not need a pte page freed
         up.
      
       - Checking if all entries in a pte page are cleared in the unmap path
         is racy, and serializing this check is expensive.
      
       - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
         Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
         purge.
      
      Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
      clear a given pud/pmd entry and free up a page for the lower level
      entries.
      
      This patch implements their stub functions on x86 and arm64, which work
      as workaround.
      
      [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
      Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Reported-by: default avatarLei Li <lious.lilei@hisilicon.com>
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6bdb751
    • Arnd Bergmann's avatar
      h8300: remove extraneous __BIG_ENDIAN definition · 1705f7c5
      Arnd Bergmann authored
      A bugfix I did earlier caused a build regression on h8300, which defines
      the __BIG_ENDIAN macro in a slightly different way than the generic
      code:
      
        arch/h8300/include/asm/byteorder.h:5:0: warning: "__BIG_ENDIAN" redefined
      
      We don't need to define it here, as the same macro is already provided
      by the linux/byteorder/big_endian.h, and that version does not conflict.
      
      While this is a v4.16 regression, my earlier patch also got backported
      to the 4.14 and 4.15 stable kernels, so we need the fixup there as well.
      
      Link: http://lkml.kernel.org/r/20180313120752.2645129-1-arnd@arndb.de
      Fixes: 101110f6 ("Kbuild: always define endianess in kconfig.h")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1705f7c5
    • Mike Kravetz's avatar
      hugetlbfs: check for pgoff value overflow · 63489f8e
      Mike Kravetz authored
      A vma with vm_pgoff large enough to overflow a loff_t type when
      converted to a byte offset can be passed via the remap_file_pages system
      call.  The hugetlbfs mmap routine uses the byte offset to calculate
      reservations and file size.
      
      A sequence such as:
      
        mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
        remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);
      
      will result in the following when task exits/file closed,
      
        kernel BUG at mm/hugetlb.c:749!
        Call Trace:
          hugetlbfs_evict_inode+0x2f/0x40
          evict+0xcb/0x190
          __dentry_kill+0xcb/0x150
          __fput+0x164/0x1e0
          task_work_run+0x84/0xa0
          exit_to_usermode_loop+0x7d/0x80
          do_syscall_64+0x18b/0x190
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      The overflowed pgoff value causes hugetlbfs to try to set up a mapping
      with a negative range (end < start) that leaves invalid state which
      causes the BUG.
      
      The previous overflow fix to this code was incomplete and did not take
      the remap_file_pages system call into account.
      
      [mike.kravetz@oracle.com: v3]
        Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
      [akpm@linux-foundation.org: include mmdebug.h]
      [akpm@linux-foundation.org: fix -ve left shift count on sh]
      Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
      Fixes: 045c7a3f ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarNic Losby <blurbdust@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63489f8e
    • Tetsuo Handa's avatar
      lockdep: fix fs_reclaim warning · 2e517d68
      Tetsuo Handa authored
      Dave Jones reported fs_reclaim lockdep warnings.
      
        ============================================
        WARNING: possible recursive locking detected
        4.15.0-rc9-backup-debug+ #1 Not tainted
        --------------------------------------------
        sshd/24800 is trying to acquire lock:
         (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        but task is already holding lock:
         (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(fs_reclaim);
          lock(fs_reclaim);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        2 locks held by sshd/24800:
         #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
         #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        stack backtrace:
        CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
        Call Trace:
         dump_stack+0xbc/0x13f
         __lock_acquire+0xa09/0x2040
         lock_acquire+0x12e/0x350
         fs_reclaim_acquire.part.102+0x29/0x30
         kmem_cache_alloc+0x3d/0x2c0
         alloc_extent_state+0xa7/0x410
         __clear_extent_bit+0x3ea/0x570
         try_release_extent_mapping+0x21a/0x260
         __btrfs_releasepage+0xb0/0x1c0
         btrfs_releasepage+0x161/0x170
         try_to_release_page+0x162/0x1c0
         shrink_page_list+0x1d5a/0x2fb0
         shrink_inactive_list+0x451/0x940
         shrink_node_memcg.constprop.88+0x4c9/0x5e0
         shrink_node+0x12d/0x260
         try_to_free_pages+0x418/0xaf0
         __alloc_pages_slowpath+0x976/0x1790
         __alloc_pages_nodemask+0x52c/0x5c0
         new_slab+0x374/0x3f0
         ___slab_alloc.constprop.81+0x47e/0x5a0
         __slab_alloc.constprop.80+0x32/0x60
         __kmalloc_track_caller+0x267/0x310
         __kmalloc_reserve.isra.40+0x29/0x80
         __alloc_skb+0xee/0x390
         sk_stream_alloc_skb+0xb8/0x340
         tcp_sendmsg_locked+0x8e6/0x1d30
         tcp_sendmsg+0x27/0x40
         inet_sendmsg+0xd0/0x310
         sock_write_iter+0x17a/0x240
         __vfs_write+0x2ab/0x380
         vfs_write+0xfb/0x260
         SyS_write+0xb6/0x140
         do_syscall_64+0x1e5/0xc05
         entry_SYSCALL64_slow_path+0x25/0x25
      
      This warning is caused by commit d92a8cfc ("locking/lockdep:
      Rework FS_RECLAIM annotation") which replaced the use of
      lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim()
      and lockdep_trace_alloc() in slab_pre_alloc_hook() with
      fs_reclaim_acquire()/ fs_reclaim_release().
      
      Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC |
      __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates
      __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is
      trying to grab the 'fake' lock again when __perform_reclaim() already
      grabbed the 'fake' lock.
      
      The
      
        /* this guy won't enter reclaim */
        if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
                return false;
      
      test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
      was added by commit cf40bd16 ("lockdep: annotate reclaim context
      (__GFP_NOFS)").  But that test is outdated because PF_MEMALLOC thread
      won't enter reclaim regardless of __GFP_NOMEMALLOC after commit
      341ce06f ("page allocator: calculate the alloc_flags for allocation
      only once") added the PF_MEMALLOC safeguard (
      
        /* Avoid recursion of direct reclaim */
        if (p->flags & PF_MEMALLOC)
                goto nopage;
      
      in __alloc_pages_slowpath()).
      
      Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and
      allow __need_fs_reclaim() to return false.
      
      Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp
      Fixes: d92a8cfc ("locking/lockdep: Rework FS_RECLAIM annotation")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Tested-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e517d68