1. 18 Aug, 2017 1 commit
    • Thomas Gleixner's avatar
      kernel/watchdog: Prevent false positives with turbo modes · 7edaeb68
      Thomas Gleixner authored
      The hardlockup detector on x86 uses a performance counter based on unhalted
      CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
      performance counter period, so the hrtimer should fire 2-3 times before the
      performance counter NMI fires. The NMI code checks whether the hrtimer
      fired since the last invocation. If not, it assumess a hard lockup.
      
      The calculation of those periods is based on the nominal CPU
      frequency. Turbo modes increase the CPU clock frequency and therefore
      shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
      nominal frequency) the perf/NMI period is shorter than the hrtimer period
      which leads to false positives.
      
      A simple fix would be to shorten the hrtimer period, but that comes with
      the side effect of more frequent hrtimer and softlockup thread wakeups,
      which is not desired.
      
      Implement a low pass filter, which checks the perf/NMI period against
      kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
      elapsed then the event is ignored and postponed to the next perf/NMI.
      
      That solves the problem and avoids the overhead of shorter hrtimer periods
      and more frequent softlockup thread wakeups.
      
      Fixes: 58687acb ("lockup_detector: Combine nmi_watchdog and softlockup detector")
      Reported-and-tested-by: default avatarKan Liang <Kan.liang@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: dzickus@redhat.com
      Cc: prarit@redhat.com
      Cc: ak@linux.intel.com
      Cc: babu.moger@oracle.com
      Cc: peterz@infradead.org
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: stable@vger.kernel.org
      Cc: atomlin@redhat.com
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
      7edaeb68
  2. 13 Aug, 2017 7 commits
    • Linus Torvalds's avatar
      Linux 4.13-rc5 · ef954844
      Linus Torvalds authored
      ef954844
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · b2298fc9
      Linus Torvalds authored
      Pull MIPS fixes from Ralf Baechle:
       "Another round of MIPS fixes:
      
         - compressed boot: Ignore a generated .c file
      
         - VDSO: Fix a register clobber list
      
         - DECstation: Fix an int-handler.S CPU_DADDI_WORKAROUNDS regression
      
         - Octeon: Fix recent cleanups that cleaned away a bit too much thus
           breaking the arch side of the EDAC and USB drivers.
      
         - uasm: Fix duplicate const in "const struct foo const bar[]" which
           GCC 7.1 no longer accepts.
      
         - Fix race on setting and getting cpu_online_mask
      
         - Fix preemption issue. To do so cleanly introduce macro to get the
           size of L3 cache line.
      
         - Revert include cleanup that sometimes results in build error
      
         - MicroMIPS uses bit 0 of the PC to indicate microMIPS mode. Make
           sure this bit is set for kernel entry as well.
      
         - Prevent configuring the kernel for both microMIPS and MT. There are
           no such CPUs currently and thus the combination is unsupported and
           results in build errors.
      
        This has been sitting in linux-next for a few days and has survived
        automated testing by Imagination's test farm. No known regressions
        pending except a number of issues that crept up due to lots of people
        switching to GCC 7.1"
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
        MIPS: Set ISA bit in entry-y for microMIPS kernels
        MIPS: Prevent building MT support for microMIPS kernels
        MIPS: PCI: Fix smp_processor_id() in preemptible
        MIPS: Introduce cpu_tcache_line_size
        MIPS: DEC: Fix an int-handler.S CPU_DADDI_WORKAROUNDS regression
        MIPS: VDSO: Fix clobber lists in fallback code paths
        Revert "MIPS: Don't unnecessarily include kmalloc.h into <asm/cache.h>."
        MIPS: OCTEON: Fix USB platform code breakage.
        MIPS: Octeon: Fix broken EDAC driver.
        MIPS: gitignore: ignore generated .c files
        MIPS: Fix race on setting and getting cpu_online_mask
        MIPS: mm: remove duplicate "const" qualifier on insn_table
      b2298fc9
    • Linus Torvalds's avatar
      Merge tag 'driver-core-4.13-rc5' of... · c9dc281d
      Linus Torvalds authored
      Merge tag 'driver-core-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
      
      Pull driver core fixes from Greg KH:
       "Here are three firmware core fixes for 4.13-rc5.
      
        All three of these fix reported issues and have been floating around
        for a few weeks. They have been in linux-next with no reported
        problems"
      
      * tag 'driver-core-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
        firmware: avoid invalid fallback aborts by using killable wait
        firmware: fix batched requests - send wake up on failure on direct lookups
        firmware: fix batched requests - wake all waiters
      c9dc281d
    • Linus Torvalds's avatar
      Merge tag 'char-misc-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · ce7ba95c
      Linus Torvalds authored
      Pull char/misc fixes from Greg KH:
       "Here are two patches for 4.13-rc5.
      
        One is a fix for a reported thunderbolt issue, and the other a fix for
        an MEI driver issue. Both have been in linux-next with no reported
        issues"
      
      * tag 'char-misc-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        thunderbolt: Do not enumerate more ports from DROM than the controller has
        mei: exclude device from suspend direct complete optimization
      ce7ba95c
    • Linus Torvalds's avatar
      Merge tag 'tty-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 438630ef
      Linus Torvalds authored
      Pull tty/serial fixes from Greg KH:
       "Here are two tty serial driver fixes for 4.13-rc5. One is a revert of
        a -rc1 patch that turned out to not be a good idea, and the other is a
        fix for the pl011 serial driver.
      
        Both have been in linux-next with no reported issues"
      
      * tag 'tty-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        Revert "serial: Delete dead code for CIR serial ports"
        tty: pl011: fix initialization order of QDF2400 E44
      438630ef
    • Linus Torvalds's avatar
      Merge tag 'staging-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · dd95f186
      Linus Torvalds authored
      Pull staging/iio fixes from Greg KH:
       "Here are some Staging and IIO driver fixes for 4.13-rc5.
      
        Nothing major, just a number of small fixes for reported issues. All
        of these have been in linux-next for a while now with no reported
        issues. Full details are in the shortlog"
      
      * tag 'staging-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: comedi: comedi_fops: do not call blocking ops when !TASK_RUNNING
        iio: aspeed-adc: wait for initial sequence.
        iio: accel: bmc150: Always restore device to normal mode after suspend-resume
        staging:iio:resolver:ad2s1210 fix negative IIO_ANGL_VEL read
        iio: adc: axp288: Fix the GPADC pin reading often wrongly returning 0
        iio: adc: vf610_adc: Fix VALT selection value for REFSEL bits
        iio: accel: st_accel: add SPI-3wire support
        iio: adc: Revert "axp288: Drop bogus AXP288_ADC_TS_PIN_CTRL register modifications"
        iio: adc: sun4i-gpadc-iio: fix unbalanced irq enable/disable
        iio: pressure: st_pressure_core: disable multiread by default for LPS22HB
        iio: light: tsl2563: use correct event code
      dd95f186
    • Linus Torvalds's avatar
      Merge tag 'usb-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 10cec917
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are a number of small USB driver fixes and new device ids for
        4.13-rc5. There is the usual gadget driver fixes, some new quirks for
        "messy" hardware, and some new device ids.
      
        All have been in linux-next with no reported issues"
      
      * tag 'usb-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        USB: serial: pl2303: add new ATEN device id
        usb: quirks: Add no-lpm quirk for Moshi USB to Ethernet Adapter
        USB: Check for dropped connection before switching to full speed
        usb:xhci:Add quirk for Certain failing HP keyboard on reset after resume
        usb: renesas_usbhs: gadget: fix unused-but-set-variable warning
        usb: renesas_usbhs: Fix UGCTRL2 value for R-Car Gen3
        usb: phy: phy-msm-usb: Fix usage of devm_regulator_bulk_get()
        usb: gadget: udc: renesas_usb3: Fix usb_gadget_giveback_request() calling
        usb: dwc3: gadget: Correct ISOC DATA PIDs for short packets
        USB: serial: option: add D-Link DWM-222 device ID
        usb: musb: fix tx fifo flush handling again
        usb: core: unlink urbs from the tail of the endpoint's urb_list
        usb-storage: fix deadlock involving host lock and scsi_done
        uas: Add US_FL_IGNORE_RESIDUE for Initio Corporation INIC-3069
        USB: hcd: Mark secondary HCD as dead if the primary one died
        USB: serial: cp210x: add support for Qivicon USB ZigBee dongle
      10cec917
  3. 12 Aug, 2017 4 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20170812' of git://git.infradead.org/linux-mtd · 89a55278
      Linus Torvalds authored
      Pull another MTD fix from Brian Norris:
       "An mtdblock regression occurred in -rc1 (all writes were broken!), in
        the process of some block subsystem refactoring. Noticed and fixed
        last week, but I'm a little slow on the uptake"
      
      * tag 'for-linus-20170812' of git://git.infradead.org/linux-mtd:
        mtd: blkdevs: Fix mtd block write failure
      89a55278
    • Abhishek Sahu's avatar
      mtd: blkdevs: Fix mtd block write failure · 9a515447
      Abhishek Sahu authored
      All the MTD block write requests are failing with
      following error messages
      
          mkfs.ext4  /dev/mtdblock0
      
          print_req_error: I/O error, dev mtdblock0, sector 0
          Buffer I/O error on dev mtdblock0, logical block 0,
          lost async page write
      
      The control is going to default case after block write request
      because of missing return.
      
      Fixes: commit 2a842aca ("block: introduce new block status code type")
      Signed-off-by: default avatarAbhishek Sahu <absahu@codeaurora.org>
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      9a515447
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · a99bcdce
      Linus Torvalds authored
      Pull SCSI target fixes from Nicholas Bellinger:
       "The highlights include:
      
         - Fix iscsi-target payload memory leak during
           ISCSI_FLAG_TEXT_CONTINUE (Varun Prakash)
      
         - Fix tcm_qla2xxx incorrect use of tcm_qla2xxx_free_cmd during ABORT
           (Pascal de Bruijn + Himanshu Madhani + nab)
      
         - Fix iscsi-target long-standing issue with parallel delete of a
           single network portal across multiple target instances (Gary Guo +
           nab)
      
         - Fix target dynamic se_node GPF during uncached shutdown regression
           (Justin Maggard + nab)"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        target: Fix node_acl demo-mode + uncached dynamic shutdown regression
        iscsi-target: Fix iscsi_np reset hung task during parallel delete
        qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT (v2)
        cxgbit: fix sg_nents calculation
        iscsi-target: fix invalid flags in text response
        iscsi-target: fix memory leak in iscsit_setup_text_cmd()
        cxgbit: add missing __kfree_skb()
        tcmu: free old string on reconfig
        tcmu: Fix possible to/from address overflow when doing the memcpy
      a99bcdce
    • Linus Torvalds's avatar
      Merge tag 'for-linus-4.13b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 043cd07c
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
       "Some fixes for Xen:
      
         - a fix for a regression introduced in 4.13 for a Xen HVM-guest
           configured with KASLR
      
         - a fix for a possible deadlock in the xenbus driver when booting the
           system
      
         - a fix for lost interrupts in Xen guests"
      
      * tag 'for-linus-4.13b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen/events: Fix interrupt lost during irq_disable and irq_enable
        xen: avoid deadlock in xenbus
        xen: fix hvm guest with kaslr enabled
        xen: split up xen_hvm_init_shared_info()
        x86: provide an init_mem_mapping hypervisor hook
      043cd07c
  4. 11 Aug, 2017 17 commits
  5. 10 Aug, 2017 11 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 27df704d
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "21 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits)
        userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
        zram: rework copy of compressor name in comp_algorithm_store()
        rmap: do not call mmu_notifier_invalidate_page() under ptl
        mm: fix list corruptions on shmem shrinklist
        mm/balloon_compaction.c: don't zero ballooned pages
        MAINTAINERS: copy virtio on balloon_compaction.c
        mm: fix KSM data corruption
        mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
        mm: make tlb_flush_pending global
        mm: refactor TLB gathering API
        Revert "mm: numa: defer TLB flush for THP migration as long as possible"
        mm: migrate: fix barriers around tlb_flush_pending
        mm: migrate: prevent racy access to tlb_flush_pending
        fault-inject: fix wrong should_fail() decision in task context
        test_kmod: fix small memory leak on filesystem tests
        test_kmod: fix the lock in register_test_dev_kmod()
        test_kmod: fix bug which allows negative values on two config options
        test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
        userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
        mm: ratelimit PFNs busy info message
        ...
      27df704d
    • Mike Rapoport's avatar
      userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage · e86b298b
      Mike Rapoport authored
      When the process exit races with outstanding mcopy_atomic, it would be
      better to return ESRCH error.  When such race occurs the process and
      it's mm are going away and returning "no such process" to the uffd
      monitor seems better fit than ENOSPC.
      
      Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e86b298b
    • Matthias Kaehlcke's avatar
      zram: rework copy of compressor name in comp_algorithm_store() · f357e345
      Matthias Kaehlcke authored
      comp_algorithm_store() passes the size of the source buffer to strlcpy()
      instead of the destination buffer size.  Make it explicit that the two
      buffers have the same size and use strcpy() instead of strlcpy().  The
      latter can be done safely since the function ensures that the string in
      the source buffer is terminated.
      
      Link: http://lkml.kernel.org/r/20170803163350.45245-1-mka@chromium.orgSigned-off-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Reviewed-by: default avatarDouglas Anderson <dianders@chromium.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f357e345
    • Kirill A. Shutemov's avatar
      rmap: do not call mmu_notifier_invalidate_page() under ptl · aac2fea9
      Kirill A. Shutemov authored
      MMU notifiers can sleep, but in page_mkclean_one() we call
      mmu_notifier_invalidate_page() under page table lock.
      
      Let's instead use mmu_notifier_invalidate_range() outside
      page_vma_mapped_walk() loop.
      
      [jglisse@redhat.com: try_to_unmap_one() do not call mmu_notifier under ptl]
        Link: http://lkml.kernel.org/r/20170809204333.27485-1-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170804134928.l4klfcnqatni7vsc@black.fi.intel.com
      Fixes: c7ab0d2f ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reported-by: default avataraxie <axie@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "Writer, Tim" <Tim.Writer@amd.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aac2fea9
    • Cong Wang's avatar
      mm: fix list corruptions on shmem shrinklist · d041353d
      Cong Wang authored
      We saw many list corruption warnings on shmem shrinklist:
      
        WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0
        list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960
        Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
        CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1
        Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
        Call Trace:
          dump_stack+0x4d/0x66
          __warn+0xcb/0xf0
          warn_slowpath_fmt+0x4f/0x60
          __list_del_entry+0x9e/0xc0
          shmem_unused_huge_shrink+0xfa/0x2e0
          shmem_unused_huge_scan+0x20/0x30
          super_cache_scan+0x193/0x1a0
          shrink_slab.part.41+0x1e3/0x3f0
          shrink_slab+0x29/0x30
          shrink_node+0xf9/0x2f0
          kswapd+0x2d8/0x6c0
          kthread+0xd7/0xf0
          ret_from_fork+0x22/0x30
      
        WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0
        list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8).
        Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
        CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G        W       4.9.34-t3.el7.twitter.x86_64 #1
        Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
        Call Trace:
          dump_stack+0x4d/0x66
          __warn+0xcb/0xf0
          warn_slowpath_fmt+0x4f/0x60
          __list_add+0x89/0xb0
          shmem_setattr+0x204/0x230
          notify_change+0x2ef/0x440
          do_truncate+0x5d/0x90
          path_openat+0x331/0x1190
          do_filp_open+0x7e/0xe0
          do_sys_open+0x123/0x200
          SyS_open+0x1e/0x20
          do_syscall_64+0x61/0x170
          entry_SYSCALL64_slow_path+0x25/0x25
      
      The problem is that shmem_unused_huge_shrink() moves entries from the
      global sbinfo->shrinklist to its local lists and then releases the
      spinlock.  However, a parallel shmem_setattr() could access one of these
      entries directly and add it back to the global shrinklist if it is
      removed, with the spinlock held.
      
      The logic itself looks solid since an entry could be either in a local
      list or the global list, otherwise it is removed from one of them by
      list_del_init().  So probably the race condition is that, one CPU is in
      the middle of INIT_LIST_HEAD() but the other CPU calls list_empty()
      which returns true too early then the following list_add_tail() sees a
      corrupted entry.
      
      list_empty_careful() is designed to fix this situation.
      
      [akpm@linux-foundation.org: add comments]
      Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@gmail.com
      Fixes: 779750d2 ("shmem: split huge pages beyond i_size under memory pressure")
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d041353d
    • Wei Wang's avatar
      mm/balloon_compaction.c: don't zero ballooned pages · af54aed9
      Wei Wang authored
      Revert commit bb01b64c ("mm/balloon_compaction.c: enqueue zero page
      to balloon device")'
      
      Zeroing ballon pages is rather time consuming, especially when a lot of
      pages are in flight. E.g. 7GB worth of ballooned memory takes 2.8s with
      __GFP_ZERO while it takes ~491ms without it.
      
      The original commit argued that zeroing will help ksmd to merge these
      pages on the host but this argument is assuming that the host actually
      marks balloon pages for ksm which is not universally true.  So we pay
      performance penalty for something that even might not be used in the end
      which is wrong.  The host can zero out pages on its own when there is a
      need.
      
      [mhocko@kernel.org: new changelog text]
      Link: http://lkml.kernel.org/r/1501761557-9758-1-git-send-email-wei.w.wang@intel.com
      Fixes: bb01b64c ("mm/balloon_compaction.c: enqueue zero page to balloon device")
      Signed-off-by: default avatarWei Wang <wei.w.wang@intel.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: zhenwei.pi <zhenwei.pi@youruncloud.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af54aed9
    • Michael S. Tsirkin's avatar
      MAINTAINERS: copy virtio on balloon_compaction.c · c0a6a5ae
      Michael S. Tsirkin authored
      Changes to mm/balloon_compaction.c can easily break virtio, and virtio
      is the only user of that interface.  Add a line to MAINTAINERS so
      whoever changes that file remembers to copy us.
      
      Link: http://lkml.kernel.org/r/1501764010-24456-1-git-send-email-mst@redhat.comSigned-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarWei Wang <wei.w.wang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0a6a5ae
    • Minchan Kim's avatar
      mm: fix KSM data corruption · b3a81d08
      Minchan Kim authored
      Nadav reported KSM can corrupt the user data by the TLB batching
      race[1].  That means data user written can be lost.
      
      Quote from Nadav Amit:
       "For this race we need 4 CPUs:
      
        CPU0: Caches a writable and dirty PTE entry, and uses the stale value
        for write later.
      
        CPU1: Runs madvise_free on the range that includes the PTE. It would
        clear the dirty-bit. It batches TLB flushes.
      
        CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
        We care about the fact that it clears the PTE write-bit, and of
        course, batches TLB flushes.
      
        CPU3: Runs KSM. Our purpose is to pass the following test in
        write_protect_page():
      
      	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
      	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
      
        Since it will avoid TLB flush. And we want to do it while the PTE is
        stale. Later, and before replacing the page, we would be able to
        change the page.
      
        Note that all the operations the CPU1-3 perform canhappen in parallel
        since they only acquire mmap_sem for read.
      
        We start with two identical pages. Everything below regards the same
        page/PTE.
      
        CPU0        CPU1        CPU2        CPU3
        ----        ----        ----        ----
        Write the same
        value on page
      
        [cache PTE as
         dirty in TLB]
      
                    MADV_FREE
                    pte_mkclean()
      
                                4 > clear_refs
                                pte_wrprotect()
      
                                            write_protect_page()
                                            [ success, no flush ]
      
                                            pages_indentical()
                                            [ ok ]
      
        Write to page
        different value
      
        [Ok, using stale
         PTE]
      
                                            replace_page()
      
        Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
        CPU0 already wrote on the page, but KSM ignored this write, and it got
        lost"
      
      In above scenario, MADV_FREE is fixed by changing TLB batching API
      including [set|clear]_tlb_flush_pending.  Remained thing is soft-dirty
      part.
      
      This patch changes soft-dirty uses TLB batching API instead of
      flush_tlb_mm and KSM checks pending TLB flush by using
      mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
      there are other parallel threads pending TLB flush.
      
      [1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Tested-by: default avatarNadav Amit <namit@vmware.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3a81d08
    • Minchan Kim's avatar
      mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem · 99baac21
      Minchan Kim authored
      Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
      problem and Mel fixed it[1] and found same problem on MADV_FREE[2].
      
      Quote from Mel Gorman:
       "The race in question is CPU 0 running madv_free and updating some PTEs
        while CPU 1 is also running madv_free and looking at the same PTEs.
        CPU 1 may have writable TLB entries for a page but fail the pte_dirty
        check (because CPU 0 has updated it already) and potentially fail to
        flush.
      
        Hence, when madv_free on CPU 1 returns, there are still potentially
        writable TLB entries and the underlying PTE is still present so that a
        subsequent write does not necessarily propagate the dirty bit to the
        underlying PTE any more. Reclaim at some unknown time at the future
        may then see that the PTE is still clean and discard the page even
        though a write has happened in the meantime. I think this is possible
        but I could have missed some protection in madv_free that prevents it
        happening."
      
      This patch aims for solving both problems all at once and is ready for
      other problem with KSM, MADV_FREE and soft-dirty story[3].
      
      TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending
      and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can
      catch there are parallel threads going on.  In that case, forcefully,
      flush TLB to prevent for user to access memory via stale TLB entry
      although it fail to gather page table entry.
      
      I confirmed this patch works with [4] test program Nadav gave so this
      patch supersedes "mm: Always flush VMA ranges affected by zap_page_range
      v2" in current mmotm.
      
      NOTE:
      
      This patch modifies arch-specific TLB gathering interface(x86, ia64,
      s390, sh, um).  It seems most of architecture are straightforward but
      s390 need to be careful because tlb_flush_mmu works only if
      mm->context.flush_mm is set to non-zero which happens only a pte entry
      really is cleared by ptep_get_and_clear and friends.  However, this
      problem never changes the pte entries but need to flush to prevent
      memory access from stale tlb.
      
      [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net
      [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de
      [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      [4] https://patchwork.kernel.org/patch/9861621/
      
      [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu]
        Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox
      Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99baac21
    • Minchan Kim's avatar
      mm: make tlb_flush_pending global · 0a2dd266
      Minchan Kim authored
      Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
      COMPACTION] but upcoming patches to solve subtle TLB flush batching
      problem will use it regardless of compaction/NUMA so this patch doesn't
      remove the dependency.
      
      [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
      Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a2dd266
    • Minchan Kim's avatar
      mm: refactor TLB gathering API · 56236a59
      Minchan Kim authored
      This patch is a preparatory patch for solving race problems caused by
      TLB batch.  For that, we will increase/decrease TLB flush pending count
      of mm_struct whenever tlb_[gather|finish]_mmu is called.
      
      Before making it simple, this patch separates architecture specific part
      and rename it to arch_tlb_[gather|finish]_mmu and generic part just
      calls it.
      
      It shouldn't change any behavior.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56236a59