1. 10 May, 2016 2 commits
    • Sergey Senozhatsky's avatar
      zsmalloc: fix zs_can_compact() integer overflow · 44f43e99
      Sergey Senozhatsky authored
      zs_can_compact() has two race conditions in its core calculation:
      
      unsigned long obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) -
      				zs_stat_get(class, OBJ_USED);
      
      1) classes are not locked, so the numbers of allocated and used
         objects can change by the concurrent ops happening on other CPUs
      2) shrinker invokes it from preemptible context
      
      Depending on the circumstances, thus, OBJ_ALLOCATED can become
      less than OBJ_USED, which can result in either very high or
      negative `total_scan' value calculated later in do_shrink_slab().
      
      do_shrink_slab() has some logic to prevent those cases:
      
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-64
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
      
      However, due to the way `total_scan' is calculated, not every
      shrinker->count_objects() overflow can be spotted and handled.
      To demonstrate the latter, I added some debugging code to do_shrink_slab()
      (x86_64) and the results were:
      
       vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615]
       vmscan: but total_scan > 0: 92679974445502
       vmscan: resulting total_scan: 92679974445502
      [..]
       vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615]
       vmscan: but total_scan > 0: 22634041808232578
       vmscan: resulting total_scan: 22634041808232578
      
      Even though shrinker->count_objects() has returned an overflowed value,
      the resulting `total_scan' is positive, and, what is more worrisome, it
      is insanely huge. This value is getting used later on in
      shrinker->scan_objects() loop:
      
              while (total_scan >= batch_size ||
                     total_scan >= freeable) {
                      unsigned long ret;
                      unsigned long nr_to_scan = min(batch_size, total_scan);
      
                      shrinkctl->nr_to_scan = nr_to_scan;
                      ret = shrinker->scan_objects(shrinker, shrinkctl);
                      if (ret == SHRINK_STOP)
                              break;
                      freed += ret;
      
                      count_vm_events(SLABS_SCANNED, nr_to_scan);
                      total_scan -= nr_to_scan;
      
                      cond_resched();
              }
      
      `total_scan >= batch_size' is true for a very-very long time and
      'total_scan >= freeable' is also true for quite some time, because
      `freeable < 0' and `total_scan' is large enough, for example,
      22634041808232578. The only break condition, in the given scheme of
      things, is shrinker->scan_objects() == SHRINK_STOP test, which is a
      bit too weak to rely on, especially in heavy zsmalloc-usage scenarios.
      
      To fix the issue, take a pool stat snapshot and use it instead of
      racy zs_stat_get() calls.
      
      Link: http://lkml.kernel.org/r/20160509140052.3389-1-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>        [4.3+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44f43e99
    • Robin Humble's avatar
      Revert "proc/base: make prompt shell start from new line after executing "cat /proc/$pid/wchan"" · 1e92a61c
      Robin Humble authored
      This reverts the 4.6-rc1 commit 7e2bc81d ("proc/base: make prompt
      shell start from new line after executing "cat /proc/$pid/wchan")
      because it breaks /proc/$PID/whcan formatting in ps and top.
      
      Revert also because the patch is inconsistent - it adds a newline at the
      end of only the '0' wchan, and does not add a newline when
      /proc/$PID/wchan contains a symbol name.
      
      eg.
      $ ps -eo pid,stat,wchan,comm
      PID STAT WCHAN  COMMAND
      ...
      1189 S    -      dbus-launch
      1190 Ssl  0
      dbus-daemon
      1198 Sl   0
      lightdm
      1299 Ss   ep_pol systemd
      1301 S    -      (sd-pam)
      1304 Ss   wait   sh
      Signed-off-by: default avatarRobin Humble <plaguedbypenguins@gmail.com>
      Cc: Minfei Huang <mnfhuang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e92a61c
  2. 08 May, 2016 1 commit
  3. 07 May, 2016 7 commits
  4. 06 May, 2016 30 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 07837831
      Linus Torvalds authored
      Pull writeback fix from Jens Axboe:
       "Just a single fix for domain aware writeback, fixing a regression that
        can cause balance_dirty_pages() to keep looping while not getting any
        work done"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        writeback: Fix performance regression in wb_over_bg_thresh()
      07837831
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3f86ba5d
      Linus Torvalds authored
      Pull x86 fixes from Ingo Molnar:
       "This contains two fixes: a boot fix for older SGI/UV systems, and an
        APIC calibration fix"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO
        x86/platform/UV: Bring back the call to map_low_mmrs in uv_system_init
      3f86ba5d
    • Linus Torvalds's avatar
      Merge tag 'pm+acpi-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 01ec7167
      Linus Torvalds authored
      Pull power management and ACPI fixes from Rafael Wysocki:
       "Fixes for problems introduced or discovered recently (intel_pstate,
        sti-cpufreq, ARM64 cpuidle, Operating Performance Points framework,
        generic device properties framework) and one fix for a hotplug-related
        deadlock in ACPICA that's been there forever, but is nasty enough.
      
        Specifics:
      
         - Fix for a recent regression in the intel_pstate driver causing it
           to fail to restore the HWP (HW-managed P-states) configuration of
           the boot CPU after suspend-to-RAM (Rafael Wysocki).
      
         - Fix for two recent regressions in the intel_pstate driver, one that
           can trigger a divide by zero if the driver is accessed via sysfs
           before it manages to take the first sample and one causing it to
           fail to update a structure field used in a trace point, so the
           information coming from it is less useful (Rafael Wysocki).
      
         - Fix for a problem in the sti-cpufreq driver introduced during the
           4.5 cycle that causes it to break CPU PM in multi-platform kernels
           by registering cpufreq-dt (which subsequently doesn't work)
           unconditionally and preventing the driver that would actually work
           from registering (Sudeep Holla).
      
         - Stable-candidate fix for an ARM64 cpuidle issue causing idle state
           usage counters to be incorrectly updated for idle states that were
           not entered due to errors (James Morse).
      
         - Fix for a recently introduced issue in the OPP (Operating
           Performance Points) framework causing it to print bogus error
           messages for missing optional regulators (Viresh Kumar).
      
         - Fix for a recently introduced issue in the generic device
           properties framework that may cause it to attempt to dereferece and
           invalid pointer in some cases (Heikki Krogerus).
      
         - Fix for a deadlock in the ACPICA core that may be triggered by
           device (eg Thunderbolt) hotplug (Prarit Bhargava)"
      
      * tag 'pm+acpi-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PM / OPP: Remove useless check
        ACPICA: Dispatcher: Update thread ID for recursive method calls
        intel_pstate: Fix intel_pstate_get()
        cpufreq: intel_pstate: Fix HWP on boot CPU after system resume
        cpufreq: st: enable selective initialization based on the platform
        ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value
        device property: Avoid potential dereferences of invalid pointers
      01ec7167
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 17d25a33
      Linus Torvalds authored
      Pull scheduler fix from Ingo Molnar:
       "This contains a single fix that fixes a nohz tick stopping bug when
        mixed-poliocy SCHED_FIFO and SCHED_RR tasks are present on a runqueue"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()
      17d25a33
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 18fb92c3
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "This tree contains two fixes: new Intel CPU model numbers and an
        AMD/iommu uncore PMU driver fix"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/amd/iommu: Do not register a task ctx for uncore like PMUs
        perf/x86: Add model numbers for Kabylake CPUs
      18fb92c3
    • Linus Torvalds's avatar
      Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · cade8184
      Linus Torvalds authored
      Pull EFI fixes from Ingo Molnar:
       "This tree contains three fixes: a console spam fix, a file pattern fix
        and a sysfb_efi fix for a bug that triggered on older ThinkPads"
      
      * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/sysfb_efi: Fix valid BAR address range check
        x86/efi-bgrt: Switch all pr_err() to pr_notice() for invalid BGRT
        MAINTAINERS: Remove asterisk from EFI directory names
      cade8184
    • Linus Torvalds's avatar
      Merge branch 'parisc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 83a395d3
      Linus Torvalds authored
      Pull parisc fix from Helge Deller:
       "Patch from Dmitry V Levin to fix a kernel crash when a straced process
        calls the (invalid) syscall which is equal to value of __NR_Linux_syscalls"
      
      * 'parisc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: fix a bug when syscall number of tracee is __NR_Linux_syscalls
      83a395d3
    • Linus Torvalds's avatar
      Merge tag 'arc-4.6-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · dd287690
      Linus Torvalds authored
      Pull ARC fixes from Vineet Gupta:
       "Late in the cycle, but this has fixes for couple of issues: a PAE40
        boot crash and Arnd spotting lack of barriers in BE io-accessors.
      
        The 3rd patch for enabling highmem in low physical mem ;-) honestly is
        more than a "fix" but its been in works for some time, seems to be
        stable in testing and enables 2 of our customers to go forward with
        4.6 kernel.
      
         - Fix for PTE truncation in PAE40 builds
         - Fix for big endian IO accessors lacking IO barrier
         - Allow HIGHMEM to work with low physical addresses"
      
      * tag 'arc-4.6-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: support HIGHMEM even without PAE40
        ARC: Fix PAE40 boot failures due to PTE truncation
        ARC: Add missing io barriers to io{read,write}{16,32}be()
      dd287690
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 4883d11e
      Linus Torvalds authored
      Pull powerpc fix from Michael Ellerman:
       "Fix bad inline asm constraint in create_zero_mask() from Anton
        Blanchard"
      
      * tag 'powerpc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc: Fix bad inline asm constraint in create_zero_mask()
      4883d11e
    • Linus Torvalds's avatar
      Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · 659a1823
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Fixes for i915, amdgpu/radeon and imx.
      
        The IMX fix is for an autoloading regression found in Fedora.  The
        radeon fixes, are the same fix to amdgpu/radeon to avoid a hardware
        lockup in some circumstances with a bad mode, and a double free bug I
        took a few hours chasing down the other morning.
      
        The i915 fixes are across the board, all stable material, and fixing
        some hangs and suspend/resume issues, along with a live status
        regressions"
      
      * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
        gpu: ipu-v3: Fix imx-ipuv3-crtc module autoloading
        drm/amdgpu: make sure vertical front porch is at least 1
        drm/radeon: make sure vertical front porch is at least 1
        drm/amdgpu: set metadata pointer to NULL after freeing.
        drm/i915: Make RPS EI/thresholds multiple of 25 on SNB-BDW
        drm/i915: Fake HDMI live status
        drm/i915: Fix eDP low vswing for Broadwell
        drm/i915/ddi: Fix eDP VDD handling during booting and suspend/resume
        drm/i915: Fix system resume if PCI device remained enabled
        drm/i915: Avoid stalling on pending flips for legacy cursor updates
      659a1823
    • Dmitry V. Levin's avatar
      parisc: fix a bug when syscall number of tracee is __NR_Linux_syscalls · f0b22d1b
      Dmitry V. Levin authored
      Do not load one entry beyond the end of the syscall table when the
      syscall number of a traced process equals to __NR_Linux_syscalls.
      Similar bug with regular processes was fixed by commit 3bb457af
      ("[PARISC] Fix bug when syscall nr is __NR_Linux_syscalls").
      
      This bug was found by strace test suite.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Acked-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      f0b22d1b
    • Rafael J. Wysocki's avatar
      Merge branches 'pm-opp-fixes', 'pm-cpufreq-fixes' and 'pm-cpuidle-fixes' · 5f2f88e3
      Rafael J. Wysocki authored
      * pm-opp-fixes:
        PM / OPP: Remove useless check
      
      * pm-cpufreq-fixes:
        intel_pstate: Fix intel_pstate_get()
        cpufreq: intel_pstate: Fix HWP on boot CPU after system resume
        cpufreq: st: enable selective initialization based on the platform
      
      * pm-cpuidle-fixes:
        ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value
      5f2f88e3
    • Rafael J. Wysocki's avatar
      Merge branches 'acpica-fixes' and 'device-properties-fixes' · 7c21b38c
      Rafael J. Wysocki authored
      * acpica-fixes:
        ACPICA: Dispatcher: Update thread ID for recursive method calls
      
      * device-properties-fixes:
        device property: Avoid potential dereferences of invalid pointers
      7c21b38c
    • Chen Yu's avatar
      x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO · 886123fb
      Chen Yu authored
      Currently we read the tsc radio: ratio = (MSR_PLATFORM_INFO >> 8) & 0x1f;
      
      Thus we get bit 8-12 of MSR_PLATFORM_INFO, however according to the SDM
      (35.5), the ratio bits are bit 8-15.
      
      Ignoring the upper bits can result in an incorrect tsc ratio, which causes the
      TSC calibration and the Local APIC timer frequency to be incorrect.
      
      Fix this problem by masking 0xff instead.
      
      [ tglx: Massaged changelog ]
      
      Fixes: 7da7c156 "x86, tsc: Add static (MSR) TSC calibration on Intel Atom SoCs"
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: stable@vger.kernel.org
      Cc: Bin Gao <bin.gao@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Link: http://lkml.kernel.org/r/1462505619-5516-1-git-send-email-yu.c.chen@intel.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      886123fb
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 9caa7e78
      Linus Torvalds authored
      Merge fixes from Andrew Morton:
       "14 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        byteswap: try to avoid __builtin_constant_p gcc bug
        lib/stackdepot: avoid to return 0 handle
        mm: fix kcompactd hang during memory offlining
        modpost: fix module autoloading for OF devices with generic compatible property
        proc: prevent accessing /proc/<PID>/environ until it's ready
        mm/zswap: provide unique zpool name
        mm: thp: kvm: fix memory corruption in KVM with THP enabled
        MAINTAINERS: fix Rajendra Nayak's address
        mm, cma: prevent nr_isolated_* counters from going negative
        mm: update min_free_kbytes from khugepaged after core initialization
        huge pagecache: mmap_sem is unlocked when truncation splits pmd
        rapidio/mport_cdev: fix uapi type definitions
        mm: memcontrol: let v2 cgroups follow changes in system swappiness
        mm: thp: correct split_huge_pages file permission
      9caa7e78
    • Linus Torvalds's avatar
      mailmap: add John Paul Adrian Glaubitz · 43a3e837
      Linus Torvalds authored
      Apparently patchwork ended up truncating the full name.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43a3e837
    • Linus Torvalds's avatar
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 7270a3f7
      Linus Torvalds authored
      Pull libnvdimm fixes from Dan Williams:
      
       - a fix for the persistent memory 'struct page' driver.  The
         implementation overlooked the fact that pages are allocated in 2MB
         units leading to -ENOMEM when establishing some configurations.
      
         It's tagged for -stable as the problem was introduced with the
         initial implementation in 4.5.
      
       - The new "error status translation" routine, introduced with the 4.6
         updates to the nfit driver, missed a necessary path in
         acpi_nfit_ctl().
      
         The end result is that we are falsely assuming commands complete
         successfully when the embedded status says otherwise.
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        nfit: fix translation of command status results
        libnvdimm, pfn: fix memmap reservation sizing
      7270a3f7
    • Arnd Bergmann's avatar
      byteswap: try to avoid __builtin_constant_p gcc bug · 7322dd75
      Arnd Bergmann authored
      This is another attempt to avoid a regression in wwn_to_u64() after that
      started using get_unaligned_be64(), which in turn ran into a bug on
      gcc-4.9 through 6.1.
      
      The regression got introduced due to the combination of two separate
      workarounds (commits e3bde956: "include/linux/unaligned: force
      inlining of byteswap operations" and ef3fb242: "scsi: fc: use
      get/put_unaligned64 for wwn access") that each try to sidestep distinct
      problems with gcc behavior (code growth and increased stack usage).
      
      Unfortunately after both have been applied, a more serious gcc bug has
      been uncovered, leading to incorrect object code that discards part of a
      function and causes undefined behavior.
      
      As part of this problem is how __builtin_constant_p gets evaluated on an
      argument passed by reference into an inline function, this avoids the
      use of __builtin_constant_p() for all architectures that set
      CONFIG_ARCH_USE_BUILTIN_BSWAP.  Most architectures do not set
      ARCH_SUPPORTS_OPTIMIZED_INLINING, which means they probably do not
      suffer from the problem in the qla2xxx driver, but they might still run
      into it elsewhere.
      
      Both of the original workarounds were only merged in the 4.6 kernel, and
      the bug that is fixed by this patch should only appear if both are
      there, so we probably don't need to backport the fix.  On the other
      hand, it works by simplifying the code path and should not have any
      negative effects.
      
      [arnd@arndb.de: fix older gcc warnings]
        (http://lkml.kernel.org/r/12243652.bxSxEgjgfk@wuerfel)
      Link: https://lkml.org/lkml/headers/2016/4/12/1103
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70232
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
      Fixes: e3bde956 ("include/linux/unaligned: force inlining of byteswap operations")
      Fixes: ef3fb242 ("scsi: fc: use get/put_unaligned64 for wwn access")
      Link: http://lkml.kernel.org/r/1780465.XdtPJpi8Tt@wuerfelSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Tested-by: Josh Poimboeuf <jpoimboe@redhat.com> # on gcc-5.3
      Tested-by: default avatarQuinn Tran <quinn.tran@qlogic.com>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
      Cc: Jan Hubicka <hubicka@ucw.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7322dd75
    • Joonsoo Kim's avatar
      lib/stackdepot: avoid to return 0 handle · 7c31190b
      Joonsoo Kim authored
      Recently, we allow to save the stacktrace whose hashed value is 0.  It
      causes the problem that stackdepot could return 0 even if in success.
      User of stackdepot cannot distinguish whether it is success or not so we
      need to solve this problem.  In this patch, 1 bit are added to handle
      and make valid handle none 0 by setting this bit.  After that, valid
      handle will not be 0 and 0 handle will represent failure correctly.
      
      Fixes: 33334e25 ("lib/stackdepot.c: allow the stack trace hash to be zero")
      Link: http://lkml.kernel.org/r/1462252403-1106-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c31190b
    • Vlastimil Babka's avatar
      mm: fix kcompactd hang during memory offlining · 172400c6
      Vlastimil Babka authored
      Assume memory47 is the last online block left in node1.  This will hang:
      
        # echo offline > /sys/devices/system/node/node1/memory47/state
      
      After a couple of minutes, the following pops up in dmesg:
      
        INFO: task bash:957 blocked for more than 120 seconds.
               Not tainted 4.6.0-rc6+ #6
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        bash            D ffff8800b7adbaf8     0   957    951 0x00000000
        Call Trace:
          schedule+0x35/0x80
          schedule_timeout+0x1ac/0x270
          wait_for_completion+0xe1/0x120
          kthread_stop+0x4f/0x110
          kcompactd_stop+0x26/0x40
          __offline_pages.constprop.28+0x7e6/0x840
          offline_pages+0x11/0x20
          memory_block_action+0x73/0x1d0
          memory_subsys_offline+0x47/0x60
          device_offline+0x86/0xb0
          store_mem_state+0xda/0xf0
          dev_attr_store+0x18/0x30
          sysfs_kf_write+0x37/0x40
          kernfs_fop_write+0x11d/0x170
          __vfs_write+0x37/0x120
          vfs_write+0xa9/0x1a0
          SyS_write+0x55/0xc0
          entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      kcompactd is waiting for kcompactd_max_order > 0 when it's woken up to
      actually exit.  Check kthread_should_stop() to break out of the wait.
      
      Fixes: 698b1b30 ("mm, compaction: introduce kcompactd").
      Reported-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      172400c6
    • Philipp Zabel's avatar
      modpost: fix module autoloading for OF devices with generic compatible property · acbef7b7
      Philipp Zabel authored
      Since the wildcard at the end of OF module aliases is gone, autoloading
      of modules that don't match a device's last (most generic) compatible
      value fails.
      
      For example the CODA960 VPU on i.MX6Q has the SoC specific compatible
      "fsl,imx6q-vpu" and the generic compatible "cnm,coda960".  Since the
      driver currently only works with knowledge about the SoC specific
      integration, it doesn't list "cnm,cod960" in the module device table.
      
      This results in the device compatible
      "of:NvpuT<NULL>Cfsl,imx6q-vpuCcnm,coda960" not matching the module alias
      "of:N*T*Cfsl,imx6q-vpu" anymore, whereas before commit 2f632369
      ("modpost: don't add a trailing wildcard for OF module aliases") it
      matched the module alias "of:N*T*Cfsl,imx6q-vpu*".
      
      This patch adds two module aliases for each compatible, one without the
      wildcard and one with "C*" appended.
      
        $ modinfo coda | grep imx6q
        alias:          of:N*T*Cfsl,imx6q-vpuC*
        alias:          of:N*T*Cfsl,imx6q-vpu
      
      Fixes: 2f632369 ("modpost: don't add a trailing wildcard for OF module aliases")
      Link: http://lkml.kernel.org/r/1462203339-15340-1-git-send-email-p.zabel@pengutronix.deSigned-off-by: default avatarPhilipp Zabel <p.zabel@pengutronix.de>
      Cc: Javier Martinez Canillas <javier@osg.samsung.com>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Sjoerd Simons <sjoerd.simons@collabora.co.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      acbef7b7
    • Mathias Krause's avatar
      proc: prevent accessing /proc/<PID>/environ until it's ready · 8148a73c
      Mathias Krause authored
      If /proc/<PID>/environ gets read before the envp[] array is fully set up
      in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
      read more bytes than are actually written, as env_start will already be
      set but env_end will still be zero, making the range calculation
      underflow, allowing to read beyond the end of what has been written.
      
      Fix this as it is done for /proc/<PID>/cmdline by testing env_end for
      zero.  It is, apparently, intentionally set last in create_*_tables().
      
      This bug was found by the PaX size_overflow plugin that detected the
      arithmetic underflow of 'this_len = env_end - (env_start + src)' when
      env_end is still zero.
      
      The expected consequence is that userland trying to access
      /proc/<PID>/environ of a not yet fully set up process may get
      inconsistent data as we're in the middle of copying in the environment
      variables.
      
      Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: Pax Team <pageexec@freemail.hu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8148a73c
    • Dan Streetman's avatar
      mm/zswap: provide unique zpool name · 32a4e169
      Dan Streetman authored
      Instead of using "zswap" as the name for all zpools created, add an
      atomic counter and use "zswap%x" with the counter number for each zpool
      created, to provide a unique name for each new zpool.
      
      As zsmalloc, one of the zpool implementations, requires/expects a unique
      name for each pool created, zswap should provide a unique name.  The
      zsmalloc pool creation does not fail if a new pool with a conflicting
      name is created, unless CONFIG_ZSMALLOC_STAT is enabled; in that case,
      zsmalloc pool creation fails with -ENOMEM.  Then zswap will be unable to
      change its compressor parameter if its zpool is zsmalloc; it also will
      be unable to change its zpool parameter back to zsmalloc, if it has any
      existing old zpool using zsmalloc with page(s) in it.  Attempts to
      change the parameters will result in failure to create the zpool.  This
      changes zswap to provide a unique name for each zpool creation.
      
      Fixes: f1c54846 ("zswap: dynamic pool creation")
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Reported-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32a4e169
    • Andrea Arcangeli's avatar
      mm: thp: kvm: fix memory corruption in KVM with THP enabled · 127393fb
      Andrea Arcangeli authored
      After the THP refcounting change, obtaining a compound pages from
      get_user_pages() no longer allows us to assume the entire compound page
      is immediately mappable from a secondary MMU.
      
      A secondary MMU doesn't want to call get_user_pages() more than once for
      each compound page, in order to know if it can map the whole compound
      page.  So a secondary MMU needs to know from a single get_user_pages()
      invocation when it can map immediately the entire compound page to avoid
      a flood of unnecessary secondary MMU faults and spurious
      atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
      users).
      
      Ideally instead of the page->_mapcount < 1 check, get_user_pages()
      should return the granularity of the "page" mapping in the "mm" passed
      to get_user_pages().  However it's non trivial change to pass the "pmd"
      status belonging to the "mm" walked by get_user_pages up the stack (up
      to the caller of get_user_pages).  So the fix just checks if there is
      not a single pte mapping on the page returned by get_user_pages, and in
      turn if the caller can assume that the whole compound page is mapped in
      the current "mm" (in a pmd_trans_huge()).  In such case the entire
      compound page is safe to map into the secondary MMU without additional
      get_user_pages() calls on the surrounding tail/head pages.  In addition
      of being faster, not having to run other get_user_pages() calls also
      reduces the memory footprint of the secondary MMU fault in case the pmd
      split happened as result of memory pressure.
      
      Without this fix after a MADV_DONTNEED (like invoked by QEMU during
      postcopy live migration or balloning) or after generic swapping (with a
      failure in split_huge_page() that would only result in pmd splitting and
      not a physical page split), KVM would map the whole compound page into
      the shadow pagetables, despite regular faults or userfaults (like
      UFFDIO_COPY) may map regular pages into the primary MMU as result of the
      pte faults, leading to the guest mode and userland mode going out of
      sync and not working on the same memory at all times.
      
      Any other secondary MMU notifier manager (KVM is just one of the many
      MMU notifier users) will need the same information if it doesn't want to
      run a flood of get_user_pages_fast and it can support multiple
      granularity in the secondary MMU mappings, so I think it is justified to
      be exposed not just to KVM.
      
      The other option would be to move transparent_hugepage_adjust to
      mm/huge_memory.c but that currently has all kind of KVM data structures
      in it, so it's definitely not a cut-and-paste work, so I couldn't do a
      fix as cleaner as this one for 4.6.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Li, Liang Z" <liang.z.li@intel.com>
      Cc: Amit Shah <amit.shah@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      127393fb
    • Eric Engestrom's avatar
      MAINTAINERS: fix Rajendra Nayak's address · ff2de822
      Eric Engestrom authored
      Signed-off-by: default avatarEric Engestrom <eric.engestrom@imgtec.com>
      Cc: Rajendra Nayak <rnayak@codeaurora.org>
      Cc: Afzal Mohammed <afzal.mohd.ma@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff2de822
    • Hugh Dickins's avatar
      mm, cma: prevent nr_isolated_* counters from going negative · 14af4a5e
      Hugh Dickins authored
      /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file go
      increasingly negative under compaction: which would add delay when
      should be none, or no delay when should delay.  The bug in compaction
      was due to a recent mmotm patch, but much older instance of the bug was
      also noticed in isolate_migratepages_range() which is used for CMA and
      gigantic hugepage allocations.
      
      The bug is caused by putback_movable_pages() in an error path
      decrementing the isolated counters without them being previously
      incremented by acct_isolated().  Fix isolate_migratepages_range() by
      removing the error-path putback, thus reaching acct_isolated() with
      migratepages still isolated, and leaving putback to caller like most
      other places do.
      
      Fixes: edc2ca61 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
      [vbabka@suse.cz: expanded the changelog]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14af4a5e
    • Jason Baron's avatar
      mm: update min_free_kbytes from khugepaged after core initialization · bc22af74
      Jason Baron authored
      Khugepaged attempts to raise min_free_kbytes if its set too low.
      However, on boot khugepaged sets min_free_kbytes first from
      subsys_initcall(), and then the mm 'core' over-rides min_free_kbytes
      after from init_per_zone_wmark_min(), via a module_init() call.
      
      Khugepaged used to use a late_initcall() to set min_free_kbytes (such
      that it occurred after the core initialization), however this was
      removed when the initialization of min_free_kbytes was integrated into
      the starting of the khugepaged thread.
      
      The fix here is simply to invoke the core initialization using a
      core_initcall() instead of module_init(), such that the previous
      initialization ordering is restored.  I didn't restore the
      late_initcall() since start_stop_khugepaged() already sets
      min_free_kbytes via set_recommended_min_free_kbytes().
      
      This was noticed when we had a number of page allocation failures when
      moving a workload to a kernel with this new initialization ordering.  On
      an 8GB system this restores min_free_kbytes back to 67584 from 11365
      when CONFIG_TRANSPARENT_HUGEPAGE=y is set and either
      CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y or
      CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y.
      
      Fixes: 79553da2 ("thp: cleanup khugepaged startup")
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc22af74
    • Hugh Dickins's avatar
      huge pagecache: mmap_sem is unlocked when truncation splits pmd · 68428398
      Hugh Dickins authored
      zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG() will
      be invalid with huge pagecache, in whatever way it is implemented:
      truncation of a hugely-mapped file to an unhugely-aligned size would
      easily hit it.
      
      (Although anon THP could in principle apply khugepaged to private file
      mappings, which are not excluded by the MADV_HUGEPAGE restrictions, in
      practice there's a vm_ops check which excludes them, so it never hits
      this BUG() - there's no interface to "truncate" an anonymous mapping.)
      
      We could complicate the test, to check i_mmap_rwsem also when there's a
      vm_file; but my inclination was to make zap_pmd_range() more readable by
      simply deleting this check.  A search has shown no report of the issue
      in the years since commit e0897d75 ("mm, thp: print useful
      information when mmap_sem is unlocked in zap_pmd_range") expanded it
      from VM_BUG_ON() - though I cannot point to what commit I would say then
      fixed the issue.
      
      But there are a couple of other patches now floating around, neither yet
      in the tree: let's agree to retain the check as a VM_BUG_ON_VMA(), as
      Matthew Wilcox has done; but subject to a vma_is_anonymous() check, as
      Kirill Shutemov has done.  And let's get this in, without waiting for
      any particular huge pagecache implementation to reach the tree.
      
      Matthew said "We can reproduce this BUG() in the current Linus tree with
      DAX PMDs".
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarMatthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68428398
    • Alexandre Bounine's avatar
      rapidio/mport_cdev: fix uapi type definitions · 4e1016da
      Alexandre Bounine authored
      Fix problems in uapi definitions reported by Gabriel Laskar: (see
      https://lkml.org/lkml/2016/4/5/205 for details)
      
       - move public header file rio_mport_cdev.h to include/uapi/linux directory
       - change types in data structures passed as IOCTL parameters
       - improve parameter checking in some IOCTL service routines
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Reported-by: default avatarGabriel Laskar <gabriel@lse.epita.fr>
      Tested-by: default avatarBarry Wood <barry.wood@idt.com>
      Cc: Gabriel Laskar <gabriel@lse.epita.fr>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
      Cc: Barry Wood <barry.wood@idt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e1016da
    • Johannes Weiner's avatar
      mm: memcontrol: let v2 cgroups follow changes in system swappiness · 4550c4e1
      Johannes Weiner authored
      Cgroup2 currently doesn't have a per-cgroup swappiness setting.  We
      might want to add one later - that's a different discussion - but until
      we do, the cgroups should always follow the system setting.  Otherwise
      it will be unchangeably set to whatever the ancestor inherited from the
      system setting at the time of cgroup creation.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.5]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4550c4e1