1. 12 Feb, 2015 40 commits
    • Heiko Carstens's avatar
      s390/cacheinfo: coding style changes · f4dce5c9
      Heiko Carstens authored
      Just some minor coding style changes, while I had to look at the code.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      f4dce5c9
    • Heiko Carstens's avatar
      s390/cacheinfo: fix shared cpu masks · 4fd4f1c7
      Heiko Carstens authored
      When testing Sudeep Holla's cache info rework I didn't realize that the
      shared cpu masks are broken (all have the same cpu set).
      Let's fix this.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      4fd4f1c7
    • Heiko Carstens's avatar
      s390/smp: reduce size of struct pcpu · 2f859d0d
      Heiko Carstens authored
      Reduce the size of struct pcpu, since the pcpu_devices array consists
      of NR_CPUS elements of type struct pcpu. For most machines this is just
      a waste of memory.
      So let's try to make it a bit smaller.
      This saves 16k with performance_defconfig.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      2f859d0d
    • Heiko Carstens's avatar
      s390/topology: convert cpu_topology array to per cpu variable · da0c636e
      Heiko Carstens authored
      Convert the per cpu topology cpu masks to a per cpu variable.
      At least for machines which do have less possible cpus than NR_CPUS this can
      save a bit of memory (z/VM: max 64 vs 512 for performance_defconfig).
      
      This reduces the kernel image size by 100k.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      da0c636e
    • Heiko Carstens's avatar
      s390/topology: delay initialization of topology cpu masks · d05d15da
      Heiko Carstens authored
      There is no reason to initialize the topology cpu masks already while
      setup_arch() is being called. It is sufficient to initialize the masks
      before the scheduler becomes SMP aware.
      Therefore a pre-SMP initcall aka early_initcall is suffucient.
      
      This also allows to convert the cpu_topology array into a per cpu
      variable with a later patch. Without this patch this wouldn't be
      possible since the per cpu memory areas are not allocated while setup_arch
      is executed.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      d05d15da
    • Martin Schwidefsky's avatar
      s390/vdso: fix clock_gettime for CLOCK_THREAD_CPUTIME_ID, -2 and -3 · 49253925
      Martin Schwidefsky authored
      Git commit 8d8f2e18a6dbd3d09dd918788422e6ac8c878e96
      "s390/vdso: ectg gettime support for CLOCK_THREAD_CPUTIME_ID"
      broke clock_gettime for CLOCK_THREAD_CPUTIME_ID.
      
      Git commit c742b31c
      "fast vdso implementation for CLOCK_THREAD_CPUTIME_ID"
      introduced the ECTG for clock id -2. Correct would have been
      clock id -3.
      
      Fix the whole mess, CLOCK_THREAD_CPUTIME_ID is based on
      CPUCLOCK_SCHED and can not be speed up by the vdso. A speedup
      is only available for clock id -3 which is CPUCLOCK_VIRT for
      the task currently running on the CPU.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      49253925
    • Linus Torvalds's avatar
      Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security · 8cc748aa
      Linus Torvalds authored
      Pull security layer updates from James Morris:
       "Highlights:
      
         - Smack adds secmark support for Netfilter
         - /proc/keys is now mandatory if CONFIG_KEYS=y
         - TPM gets its own device class
         - Added TPM 2.0 support
         - Smack file hook rework (all Smack users should review this!)"
      
      * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
        cipso: don't use IPCB() to locate the CIPSO IP option
        SELinux: fix error code in policydb_init()
        selinux: add security in-core xattr support for pstore and debugfs
        selinux: quiet the filesystem labeling behavior message
        selinux: Remove unused function avc_sidcmp()
        ima: /proc/keys is now mandatory
        Smack: Repair netfilter dependency
        X.509: silence asn1 compiler debug output
        X.509: shut up about included cert for silent build
        KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
        MAINTAINERS: email update
        tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
        smack: fix possible use after frees in task_security() callers
        smack: Add missing logging in bidirectional UDS connect check
        Smack: secmark support for netfilter
        Smack: Rework file hooks
        tpm: fix format string error in tpm-chip.c
        char/tpm/tpm_crb: fix build error
        smack: Fix a bidirectional UDS connect check typo
        smack: introduce a special case for tmpfs in smack_d_instantiate()
        ...
      8cc748aa
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit · 7184487f
      Linus Torvalds authored
      Pull audit fix from Paul Moore:
       "Just one patch from the audit tree for v3.20, and a very minor one at
        that.
      
        The patch simply removes an old, unused field from the audit_krule
        structure, a private audit-only struct.  In audit related news, we did
        a proper overhaul of the audit pathname code and removed the nasty
        getname()/putname() hacks for audit, you should see those patches in
        Al's vfs tree if you haven't already.
      
        That's it for audit this time, let's hope for a quiet -rcX series"
      
      * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
        audit: remove vestiges of vers_ops
      7184487f
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 59d53737
      Linus Torvalds authored
      Merge second set of updates from Andrew Morton:
       "More of MM"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
        mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
        mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
        vmstat: Reduce time interval to stat update on idle cpu
        mm/page_owner.c: remove unnecessary stack_trace field
        Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files
        mm: incorporate read-only pages into transparent huge pages
        vmstat: do not use deferrable delayed work for vmstat_update
        mm: more aggressive page stealing for UNMOVABLE allocations
        mm: always steal split buddies in fallback allocations
        mm: when stealing freepages, also take pages created by splitting buddy page
        mincore: apply page table walker on do_mincore()
        mm: /proc/pid/clear_refs: avoid split_huge_page()
        mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
        mempolicy: apply page table walker on queue_pages_range()
        arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()
        memcg: cleanup preparation for page table walk
        numa_maps: remove numa_maps->vma
        numa_maps: fix typo in gather_hugetbl_stats
        pagemap: use walk->vma instead of calling find_vma()
        clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
        ...
      59d53737
    • Linus Torvalds's avatar
      Merge tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux · d3f180ea
      Linus Torvalds authored
      Pull powerpc updates from Michael Ellerman:
      
       - Update of all defconfigs
      
       - Addition of a bunch of config options to modernise our defconfigs
      
       - Some PS3 updates from Geoff
      
       - Optimised memcmp for 64 bit from Anton
      
       - Fix for kprobes that allows 'perf probe' to work from Naveen
      
       - Several cxl updates from Ian & Ryan
      
       - Expanded support for the '24x7' PMU from Cody & Sukadev
      
       - Freescale updates from Scott:
          "Highlights include 8xx optimizations, some more work on datapath
           device tree content, e300 machine check support, t1040 corenet
           error reporting, and various cleanups and fixes"
      
      * tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits)
        cxl: Add missing return statement after handling AFU errror
        cxl: Fail AFU initialisation if an invalid configuration record is found
        cxl: Export optional AFU configuration record in sysfs
        powerpc/mm: Warn on flushing tlb page in kernel context
        powerpc/powernv: Add OPAL soft-poweroff routine
        powerpc/perf/hv-24x7: Document sysfs event description entries
        powerpc/perf/hv-gpci: add the remaining gpci requests
        powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated
        powerpc/perf/hv-24x7: parse catalog and populate sysfs with events
        perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper
        perf: add PMU_EVENT_ATTR_STRING() helper
        perf: provide sysfs_show for struct perf_pmu_events_attr
        powerpc/kernel: Avoid initializing device-tree pointer twice
        powerpc: Remove old compile time disabled syscall tracing code
        powerpc/kernel: Make syscall_exit a local label
        cxl: Fix device_node reference counting
        powerpc/mm: bail out early when flushing TLB page
        powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80)
        perf/powerpc: reset event hw state when adding it to the PMU
        powerpc/qe: Use strlcpy()
        ...
      d3f180ea
    • Linus Torvalds's avatar
      Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 6b00f7ef
      Linus Torvalds authored
      Pull arm64 updates from Catalin Marinas:
       "arm64 updates for 3.20:
      
         - reimplementation of the virtual remapping of UEFI Runtime Services
           in a way that is stable across kexec
         - emulation of the "setend" instruction for 32-bit tasks (user
           endianness switching trapped in the kernel, SCTLR_EL1.E0E bit set
           accordingly)
         - compat_sys_call_table implemented in C (from asm) and made it a
           constant array together with sys_call_table
         - export CPU cache information via /sys (like other architectures)
         - DMA API implementation clean-up in preparation for IOMMU support
         - macros clean-up for KVM
         - dropped some unnecessary cache+tlb maintenance
         - CONFIG_ARM64_CPU_SUSPEND clean-up
         - defconfig update (CPU_IDLE)
      
        The EFI changes going via the arm64 tree have been acked by Matt
        Fleming.  There is also a patch adding sys_*stat64 prototypes to
        include/linux/syscalls.h, acked by Andrew Morton"
      
      * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (47 commits)
        arm64: compat: Remove incorrect comment in compat_siginfo
        arm64: Fix section mismatch on alloc_init_p[mu]d()
        arm64: Avoid breakage caused by .altmacro in fpsimd save/restore macros
        arm64: mm: use *_sect to check for section maps
        arm64: drop unnecessary cache+tlb maintenance
        arm64:mm: free the useless initial page table
        arm64: Enable CPU_IDLE in defconfig
        arm64: kernel: remove ARM64_CPU_SUSPEND config option
        arm64: make sys_call_table const
        arm64: Remove asm/syscalls.h
        arm64: Implement the compat_sys_call_table in C
        syscalls: Declare sys_*stat64 prototypes if __ARCH_WANT_(COMPAT_)STAT64
        compat: Declare compat_sys_sigpending and compat_sys_sigprocmask prototypes
        arm64: uapi: expose our struct ucontext to the uapi headers
        smp, ARM64: Kill SMP single function call interrupt
        arm64: Emulate SETEND for AArch32 tasks
        arm64: Consolidate hotplug notifier for instruction emulation
        arm64: Track system support for mixed endian EL0
        arm64: implement generic IOMMU configuration
        arm64: Combine coherent and non-coherent swiotlb dma_ops
        ...
      6b00f7ef
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · b3d6524f
      Linus Torvalds authored
      Pull s390 updates from Martin Schwidefsky:
      
       - The remaining patches for the z13 machine support: kernel build
         option for z13, the cache synonym avoidance, SMT support,
         compare-and-delay for spinloops and the CES5S crypto adapater.
      
       - The ftrace support for function tracing with the gcc hotpatch option.
         This touches common code Makefiles, Steven is ok with the changes.
      
       - The hypfs file system gets an extension to access diagnose 0x0c data
         in user space for performance analysis for Linux running under z/VM.
      
       - The iucv hvc console gets wildcard spport for the user id filtering.
      
       - The cacheinfo code is converted to use the generic infrastructure.
      
       - Cleanup and bug fixes.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
        s390/process: free vx save area when releasing tasks
        s390/hypfs: Eliminate hypfs interval
        s390/hypfs: Add diagnose 0c support
        s390/cacheinfo: don't use smp_processor_id() in preemptible context
        s390/zcrypt: fixed domain scanning problem (again)
        s390/smp: increase maximum value of NR_CPUS to 512
        s390/jump label: use different nop instruction
        s390/jump label: add sanity checks
        s390/mm: correct missing space when reporting user process faults
        s390/dasd: cleanup profiling
        s390/dasd: add locking for global_profile access
        s390/ftrace: hotpatch support for function tracing
        ftrace: let notrace function attribute disable hotpatching if necessary
        ftrace: allow architectures to specify ftrace compile options
        s390: reintroduce diag 44 calls for cpu_relax()
        s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
        s390/zcrypt: Number of supported ap domains is not retrievable.
        s390/spinlock: add compare-and-delay to lock wait loops
        s390/tape: remove redundant if statement
        s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
        ...
      b3d6524f
    • Linus Torvalds's avatar
      Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux · 07f80d41
      Linus Torvalds authored
      Pull pstore update from Tony Luck:
       "Miscellaneous fs/pstore fixes"
      
      * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
        pstore: Fix sprintf format specifier in pstore_dump()
        pstore: Add pmsg - user-space accessible pstore object
        pstore: Handle zero-sized prz in series
        pstore: Remove superfluous memory size check
        pstore: Use scnprintf() in pstore_mkfile()
      07f80d41
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 6f83e5bd
      Linus Torvalds authored
      Pull NFS client updates from Trond Myklebust:
       "Highlights incluse:
      
        Features:
         - Removing the forced serialisation of open()/close() calls in
           NFSv4.x (x>0) makes for a significant performance improvement in
           metadata intensive workloads.
         - Full support for the pNFS "flexible files" layout type
         - Further RPC/RDMA client improvements from Chuck
      
        Bugfixes:
         - Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING
         - Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL
         - Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called
           as part of the namespace cleanup,
         - Stable fix: Ensure we reference the inode for return-on-close in
           delegreturn
         - Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to
           the same source address/port combination during a disconnect/
           reconnect event.  This is a requirement imposed by most NFSv3
           server duplicate reply cache implementations.
      
        Optimisations:
         - Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT
      
        Other:
         - Add Anna Schumaker as co-maintainer for the NFS client"
      
      * tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits)
        SUNRPC: Cleanup to remove xs_tcp_close()
        pnfs: delete an unintended goto
        pnfs/flexfiles: Do not dprintk after the free
        SUNRPC: Fix stupid typo in xs_sock_set_reuseport
        SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG
        SUNRPC: Handle connection reset more efficiently.
        SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag
        SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release
        SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection
        SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT
        SUNRPC: Remove TCP socket linger code
        SUNRPC: Remove TCP client connection reset hack
        SUNRPC: TCP/UDP always close the old socket before reconnecting
        SUNRPC: Add helpers to prevent socket create from racing
        SUNRPC: Ensure xs_reset_transport() resets the close connection flags
        SUNRPC: Do not clear the source port in xs_reset_transport
        SUNRPC: Handle EADDRINUSE on connect
        SUNRPC: Set SO_REUSEPORT socket option for TCP connections
        NFSv4.1: Fix pnfs_put_lseg races
        NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS
        ...
      6f83e5bd
    • Roman Gushchin's avatar
      mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() · 8138a67a
      Roman Gushchin authored
      I noticed that "allowed" can easily overflow by falling below 0, because
      (total_vm / 32) can be larger than "allowed".  The problem occurs in
      OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      Signed-off-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8138a67a
    • Roman Gushchin's avatar
      mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() · 5703b087
      Roman Gushchin authored
      I noticed, that "allowed" can easily overflow by falling below 0,
      because (total_vm / 32) can be larger than "allowed".  The problem
      occurs in OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      
      [akpm@linux-foundation.org: use min_t]
      Signed-off-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5703b087
    • Christoph Lameter's avatar
      vmstat: Reduce time interval to stat update on idle cpu · 57c2e36b
      Christoph Lameter authored
      It was noted that the vm stat shepherd runs every 2 seconds and that the
      vmstat update is then scheduled 2 seconds in the future.
      
      This yields an interval of double the time interval which is not desired.
      
      Change the shepherd so that it does not delay the vmstat update on the
      other cpu.  We stil have to use schedule_delayed_work since we are using a
      delayed_work_struct but we can set the delay to 0.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57c2e36b
    • Sergei Rogachev's avatar
      mm/page_owner.c: remove unnecessary stack_trace field · 94f759d6
      Sergei Rogachev authored
      Page owner uses the page_ext structure to keep meta-information for every
      page in the system.  The structure also contains a field of type 'struct
      stack_trace', page owner uses this field during invocation of the function
      save_stack_trace.  It is easy to notice that keeping a copy of this
      structure for every page in the system is very inefficiently in terms of
      memory.
      
      The patch removes this unnecessary field of page_ext and forces page owner
      to use a stack_trace structure allocated on the stack.
      
      [akpm@linux-foundation.org: use struct initializers]
      Signed-off-by: default avatarSergei Rogachev <rogachevsergei@gmail.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94f759d6
    • Cyrill Gorcunov's avatar
      Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files · 740a5ddb
      Cyrill Gorcunov authored
      [akpm@linux-foundation.org: tweaks]
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Calvin Owens <calvinowens@fb.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      740a5ddb
    • Ebru Akagunduz's avatar
      mm: incorporate read-only pages into transparent huge pages · 10359213
      Ebru Akagunduz authored
      This patch aims to improve THP collapse rates, by allowing THP collapse in
      the presence of read-only ptes, like those left in place by do_swap_page
      after a read fault.
      
      Currently THP can collapse 4kB pages into a THP when there are up to
      khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch applies
      the same limit for read-only ptes.
      
      The patch was tested with a test program that allocates 800MB of memory,
      writes to it, and then sleeps.  I force the system to swap out all but
      190MB of the program by touching other memory.  Afterwards, the test
      program does a mix of reads and writes to its memory, and the memory gets
      swapped back in.
      
      Without the patch, only the memory that did not get swapped out remained
      in THPs, which corresponds to 24% of the memory of the program.  The
      percentage did not increase over time.
      
      With this patch, after 5 minutes of waiting khugepaged had collapsed 50%
      of the program's memory back into THPs.
      
      Test results:
      
      With the patch:
      After swapped out:
      cat /proc/pid/smaps:
      Anonymous:      100464 kB
      AnonHugePages:  100352 kB
      Swap:           699540 kB
      Fraction:       99,88
      
      cat /proc/meminfo:
      AnonPages:      1754448 kB
      AnonHugePages:  1716224 kB
      Fraction:       97,82
      
      After swapped in:
      In a few seconds:
      cat /proc/pid/smaps:
      Anonymous:      800004 kB
      AnonHugePages:  145408 kB
      Swap:           0 kB
      Fraction:       18,17
      
      cat /proc/meminfo:
      AnonPages:      2455016 kB
      AnonHugePages:  1761280 kB
      Fraction:       71,74
      
      In 5 minutes:
      cat /proc/pid/smaps
      Anonymous:      800004 kB
      AnonHugePages:  407552 kB
      Swap:           0 kB
      Fraction:       50,94
      
      cat /proc/meminfo:
      AnonPages:      2456872 kB
      AnonHugePages:  2023424 kB
      Fraction:       82,35
      
      Without the patch:
      After swapped out:
      cat /proc/pid/smaps:
      Anonymous:      190660 kB
      AnonHugePages:  190464 kB
      Swap:           609344 kB
      Fraction:       99,89
      
      cat /proc/meminfo:
      AnonPages:      1740456 kB
      AnonHugePages:  1667072 kB
      Fraction:       95,78
      
      After swapped in:
      cat /proc/pid/smaps:
      Anonymous:      800004 kB
      AnonHugePages:  190464 kB
      Swap:           0 kB
      Fraction:       23,80
      
      cat /proc/meminfo:
      AnonPages:      2350032 kB
      AnonHugePages:  1667072 kB
      Fraction:       70,93
      
      I waited 10 minutes the fractions did not change without the patch.
      Signed-off-by: default avatarEbru Akagunduz <ebru.akagunduz@gmail.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10359213
    • Michal Hocko's avatar
      vmstat: do not use deferrable delayed work for vmstat_update · ba4877b9
      Michal Hocko authored
      Vinayak Menon has reported that an excessive number of tasks was throttled
      in the direct reclaim inside too_many_isolated() because NR_ISOLATED_FILE
      was relatively high compared to NR_INACTIVE_FILE.  However it turned out
      that the real number of NR_ISOLATED_FILE was 0 and the per-cpu
      vm_stat_diff wasn't transferred into the global counter.
      
      vmstat_work which is responsible for the sync is defined as deferrable
      delayed work which means that the defined timeout doesn't wake up an idle
      CPU.  A CPU might stay in an idle state for a long time and general effort
      is to keep such a CPU in this state as long as possible which might lead
      to all sorts of troubles for vmstat consumers as can be seen with the
      excessive direct reclaim throttling.
      
      This patch basically reverts 39bf6270 ("VM statistics: Make timer
      deferrable") but it shouldn't cause any problems for idle CPUs because
      only CPUs with an active per-cpu drift are woken up since 7cc36bbd
      ("vmstat: on-demand vmstat workers v8") and CPUs which are idle for a
      longer time shouldn't have per-cpu drift.
      
      Fixes: 39bf6270 (VM statistics: Make timer deferrable)
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: default avatarVinayak Menon <vinmenon@codeaurora.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba4877b9
    • Vlastimil Babka's avatar
      mm: more aggressive page stealing for UNMOVABLE allocations · 9c0415eb
      Vlastimil Babka authored
      When allocation falls back to stealing free pages of another migratetype,
      it can decide to steal extra pages, or even the whole pageblock in order
      to reduce fragmentation, which could happen if further allocation
      fallbacks pick a different pageblock.  In try_to_steal_freepages(), one of
      the situations where extra pages are stolen happens when we are trying to
      allocate a MIGRATE_RECLAIMABLE page.
      
      However, MIGRATE_UNMOVABLE allocations are not treated the same way,
      although spreading such allocation over multiple fallback pageblocks is
      arguably even worse than it is for RECLAIMABLE allocations.  To minimize
      fragmentation, we should minimize the number of such fallbacks, and thus
      steal as much as is possible from each fallback pageblock.
      
      Note that in theory this might put more pressure on movable pageblocks and
      cause movable allocations to steal back from unmovable pageblocks.
      However, movable allocations are not as aggressive with stealing, and do
      not cause permanent fragmentation, so the tradeoff is reasonable, and
      evaluation seems to support the change.
      
      This patch thus adds a check for MIGRATE_UNMOVABLE to the decision to
      steal extra free pages.  When evaluating with stress-highalloc from
      mmtests, this has reduced the number of MIGRATE_UNMOVABLE fallbacks to
      roughly 1/6.  The number of these fallbacks stealing from MIGRATE_MOVABLE
      block is reduced to 1/3.  There was no observation of growing number of
      unmovable pageblocks over time, and also not of increased movable
      allocation fallbacks.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c0415eb
    • Vlastimil Babka's avatar
      mm: always steal split buddies in fallback allocations · 3a1086fb
      Vlastimil Babka authored
      When allocation falls back to another migratetype, it will steal a page
      with highest available order, and (depending on this order and desired
      migratetype), it might also steal the rest of free pages from the same
      pageblock.
      
      Given the preference of highest available order, it is likely that it will
      be higher than the desired order, and result in the stolen buddy page
      being split.  The remaining pages after split are currently stolen only
      when the rest of the free pages are stolen.  This can however lead to
      situations where for MOVABLE allocations we split e.g.  order-4 fallback
      UNMOVABLE page, but steal only order-0 page.  Then on the next MOVABLE
      allocation (which may be batched to fill the pcplists) we split another
      order-3 or higher page, etc.  By stealing all pages that we have split, we
      can avoid further stealing.
      
      This patch therefore adjusts the page stealing so that buddy pages created
      by split are always stolen.  This has effect only on MOVABLE allocations,
      as RECLAIMABLE and UNMOVABLE allocations already always do that in
      addition to stealing the rest of free pages from the pageblock.  The
      change also allows to simplify try_to_steal_freepages() and factor out CMA
      handling.
      
      According to Mel, it has been intended since the beginning that buddy
      pages after split would be stolen always, but it doesn't seem like it was
      ever the case until commit 47118af0 ("mm: mmzone: MIGRATE_CMA
      migration type added").  The commit has unintentionally introduced this
      behavior, but was reverted by commit 0cbef29a ("mm:
      __rmqueue_fallback() should respect pageblock type").  Neither included
      evaluation.
      
      My evaluation with stress-highalloc from mmtests shows about 2.5x
      reduction of page stealing events for MOVABLE allocations, without
      affecting the page stealing events for other allocation migratetypes.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a1086fb
    • Vlastimil Babka's avatar
      mm: when stealing freepages, also take pages created by splitting buddy page · 99592d59
      Vlastimil Babka authored
      When studying page stealing, I noticed some weird looking decisions in
      try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
      following two patches were driven by evaluation.
      
      Testing was done with stress-highalloc of mmtests, using the
      mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
      often page stealing occurs for individual migratetypes, and what
      migratetypes are used for fallbacks.  Arguably, the worst case of page
      stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
      RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
      so the goal is to minimize these two cases.
      
      The evaluation of v2 wasn't always clear win and Joonsoo questioned the
      results.  Here I used different baseline which includes RFC compaction
      improvements from [1].  I found that the compaction improvements reduce
      variability of stress-highalloc, so there's less noise in the data.
      
      First, let's look at stress-highalloc configured to do sync compaction,
      and how these patches reduce page stealing events during the test.  First
      column is after fresh reboot, other two are reiterations of test without
      reboot.  That was all accumulater over 5 re-iterations (so the benchmark
      was run 5x3 times with 5 fresh restarts).
      
      Baseline:
      
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        5-nothp-1       5-nothp-2       5-nothp-3
      Page alloc extfrag event                               10264225     8702233    10244125
      Extfrag fragmenting                                    10263271     8701552    10243473
      Extfrag fragmenting for unmovable                         13595       17616       15960
      Extfrag fragmenting unmovable placed with movable          7989       12193        8447
      Extfrag fragmenting for reclaimable                         658        1840        1817
      Extfrag fragmenting reclaimable placed with movable         558        1677        1679
      Extfrag fragmenting for movable                        10249018     8682096    10225696
      
      With Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        6-nothp-1       6-nothp-2       6-nothp-3
      Page alloc extfrag event                               11834954     9877523     9774860
      Extfrag fragmenting                                    11833993     9876880     9774245
      Extfrag fragmenting for unmovable                          7342       16129       11712
      Extfrag fragmenting unmovable placed with movable          4191       10547        6270
      Extfrag fragmenting for reclaimable                         373        1130         923
      Extfrag fragmenting reclaimable placed with movable         302         906         738
      Extfrag fragmenting for movable                        11826278     9859621     9761610
      
      With Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        7-nothp-1       7-nothp-2       7-nothp-3
      Page alloc extfrag event                                4725990     3668793     3807436
      Extfrag fragmenting                                     4725104     3668252     3806898
      Extfrag fragmenting for unmovable                          6678        7974        7281
      Extfrag fragmenting unmovable placed with movable          2051        3829        4017
      Extfrag fragmenting for reclaimable                         429        1208        1278
      Extfrag fragmenting reclaimable placed with movable         369         976        1034
      Extfrag fragmenting for movable                         4717997     3659070     3798339
      
      With Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        8-nothp-1       8-nothp-2       8-nothp-3
      Page alloc extfrag event                                5016183     4700142     3850633
      Extfrag fragmenting                                     5015325     4699613     3850072
      Extfrag fragmenting for unmovable                          1312        3154        3088
      Extfrag fragmenting unmovable placed with movable          1115        2777        2714
      Extfrag fragmenting for reclaimable                         437        1193        1097
      Extfrag fragmenting reclaimable placed with movable         330         969         879
      Extfrag fragmenting for movable                         5013576     4695266     3845887
      
      In v2 we've seen apparent regression with Patch 1 for unmovable events,
      this is now gone, suggesting it was indeed noise.  Here, each patch
      improves the situation for unmovable events.  Reclaimable is improved by
      patch 1 and then either the same modulo noise, or perhaps sligtly worse -
      a small price for unmovable improvements, IMHO.  The number of movable
      allocations falling back to other migratetypes is most noisy, but it's
      reduced to half at Patch 2 nevertheless.  These are least critical as
      compaction can move them around.
      
      If we look at success rates, the patches don't affect them, that didn't change.
      
      Baseline:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  5-nothp-1             5-nothp-2             5-nothp-3
      Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
      Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
      Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
      Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
      Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
      Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
      Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
      Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 1:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  6-nothp-1             6-nothp-2             6-nothp-3
      Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
      Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
      Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
      Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
      Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
      Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
      Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)
      
      Patch 2:
      
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  7-nothp-1             7-nothp-2             7-nothp-3
      Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
      Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
      Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
      Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
      Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
      Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 3:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  8-nothp-1             8-nothp-2             8-nothp-3
      Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
      Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
      Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
      Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
      Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
      Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
      Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
      Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
      Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)
      
      While there's no improvement here, I consider reduced fragmentation events
      to be worth on its own.  Patch 2 also seems to reduce scanning for free
      pages, and migrations in compaction, suggesting it has somewhat less work
      to do:
      
      Patch 1:
      
      Compaction stalls                 4153        3959        3978
      Compaction success                1523        1441        1446
      Compaction failures               2630        2517        2531
      Page migrate success           4600827     4943120     5104348
      Page migrate failure             19763       16656       17806
      Compaction pages isolated      9597640    10305617    10653541
      Compaction migrate scanned    77828948    86533283    87137064
      Compaction free scanned      517758295   521312840   521462251
      Compaction cost                   5503        5932        6110
      
      Patch 2:
      
      Compaction stalls                 3800        3450        3518
      Compaction success                1421        1316        1317
      Compaction failures               2379        2134        2201
      Page migrate success           4160421     4502708     4752148
      Page migrate failure             19705       14340       14911
      Compaction pages isolated      8731983     9382374     9910043
      Compaction migrate scanned    98362797    96349194    98609686
      Compaction free scanned      496512560   469502017   480442545
      Compaction cost                   5173        5526        5811
      
      As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
      of unmovable and reclaimable pageblocks.
      
      Configuring the benchmark to allocate like THP page fault (i.e.  no sync
      compaction) gives much noisier results for iterations 2 and 3 after
      reboot.  This is not so surprising given how [1] offers lower improvements
      in this scenario due to less restarts after deferred compaction which
      would change compaction pivot.
      
      Baseline:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          5-thp-1         5-thp-2         5-thp-3
      Page alloc extfrag event                                8148965     6227815     6646741
      Extfrag fragmenting                                     8147872     6227130     6646117
      Extfrag fragmenting for unmovable                         10324       12942       15975
      Extfrag fragmenting unmovable placed with movable          5972        8495       10907
      Extfrag fragmenting for reclaimable                         601        1707        2210
      Extfrag fragmenting reclaimable placed with movable         520        1570        2000
      Extfrag fragmenting for movable                         8136947     6212481     6627932
      
      Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          6-thp-1         6-thp-2         6-thp-3
      Page alloc extfrag event                                8345457     7574471     7020419
      Extfrag fragmenting                                     8343546     7573777     7019718
      Extfrag fragmenting for unmovable                         10256       18535       30716
      Extfrag fragmenting unmovable placed with movable          6893       11726       22181
      Extfrag fragmenting for reclaimable                         465        1208        1023
      Extfrag fragmenting reclaimable placed with movable         353         996         843
      Extfrag fragmenting for movable                         8332825     7554034     6987979
      
      Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          7-thp-1         7-thp-2         7-thp-3
      Page alloc extfrag event                                3512847     3020756     2891625
      Extfrag fragmenting                                     3511940     3020185     2891059
      Extfrag fragmenting for unmovable                          9017        6892        6191
      Extfrag fragmenting unmovable placed with movable          1524        3053        2435
      Extfrag fragmenting for reclaimable                         445        1081        1160
      Extfrag fragmenting reclaimable placed with movable         375         918         986
      Extfrag fragmenting for movable                         3502478     3012212     2883708
      
      Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          8-thp-1         8-thp-2         8-thp-3
      Page alloc extfrag event                                3181699     3082881     2674164
      Extfrag fragmenting                                     3180812     3082303     2673611
      Extfrag fragmenting for unmovable                          1201        4031        4040
      Extfrag fragmenting unmovable placed with movable           974        3611        3645
      Extfrag fragmenting for reclaimable                         478        1165        1294
      Extfrag fragmenting reclaimable placed with movable         387         985        1030
      Extfrag fragmenting for movable                         3179133     3077107     2668277
      
      The improvements for first iteration are clear, the rest is much noisier
      and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.
      
      Allocation success rates are again unaffected so there's no point in
      making this e-mail any longer.
      
      [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2
      
      This patch (of 3):
      
      When __rmqueue_fallback() is called to allocate a page of order X, it will
      find a page of order Y >= X of a fallback migratetype, which is different
      from the desired migratetype.  With the help of try_to_steal_freepages(),
      it may change the migratetype (to the desired one) also of:
      
      1) all currently free pages in the pageblock containing the fallback page
      2) the fallback pageblock itself
      3) buddy pages created by splitting the fallback page (when Y > X)
      
      These decisions take the order Y into account, as well as the desired
      migratetype, with the goal of preventing multiple fallback allocations
      that could e.g.  distribute UNMOVABLE allocations among multiple
      pageblocks.
      
      Originally, decision for 1) has implied the decision for 3).  Commit
      47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
      (probably unintentionally) so that the buddy pages in case 3) are always
      changed to the desired migratetype, except for CMA pageblocks.
      
      Commit fef903ef ("mm/page_allo.c: restructure free-page stealing code
      and fix a bug") did some refactoring and added a comment that the case of
      3) is intended.  Commit 0cbef29a ("mm: __rmqueue_fallback() should
      respect pageblock type") removed the comment and tried to restore the
      original behavior where 1) implies 3), but due to the previous
      refactoring, the result is instead that only 2) implies 3) - and the
      conditions for 2) are less frequently met than conditions for 1).  This
      may increase fragmentation in situations where the code decides to steal
      all free pages from the pageblock (case 1)), but then gives back the buddy
      pages produced by splitting.
      
      This patch restores the original intended logic where 1) implies 3).
      During testing with stress-highalloc from mmtests, this has shown to
      decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
      steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
      In some cases it has increased the number of events when MOVABLE
      allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
      fixable by sync compaction and thus less harmful.
      
      Note that evaluation has shown that the behavior introduced by
      47118af0 for buddy pages in case 3) is actually even better than the
      original logic, so the following patch will introduce it properly once
      again.  For stable backports of this patch it makes thus sense to only fix
      versions containing 0cbef29a.
      
      [iamjoonsoo.kim@lge.com: tracepoint fix]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[3.13+ containing 0cbef29a]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99592d59
    • Naoya Horiguchi's avatar
      mincore: apply page table walker on do_mincore() · 1e25a271
      Naoya Horiguchi authored
      This patch makes do_mincore() use walk_page_vma(), which reduces many
      lines of code by using common page table walk code.
      
      [daeseok.youn@gmail.com: remove unneeded variable 'err']
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarDaeseok Youn <daeseok.youn@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e25a271
    • Kirill A. Shutemov's avatar
      mm: /proc/pid/clear_refs: avoid split_huge_page() · 7d5b3bfa
      Kirill A. Shutemov authored
      Currently pagewalker splits all THP pages on any clear_refs request.  It's
      not necessary.  We can handle this on PMD level.
      
      One side effect is that soft dirty will potentially see more dirty memory,
      since we will mark whole THP page dirty at once.
      
      Sanity checked with CRIU test suite. More testing is required.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d5b3bfa
    • Naoya Horiguchi's avatar
      mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) · 48684a65
      Naoya Horiguchi authored
      walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
      undesirable behaviour at client end (who called walk_page_range).  For
      example for pagemap_read(), when no callbacks are called against VM_PFNMAP
      vma, pagemap_read() may prepare pagemap data for next virtual address
      range at wrong index.  That could confuse and/or break userspace
      applications.
      
      This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
      - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
        over vma(VM_PFNMAP),
      - for clear_refs and queue_pages which have their own ->tests_walk,
        just return 1 and skip vma(VM_PFNMAP). This is no problem because
        these are not interested in hole regions,
      - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarShiraz Hashim <shashim@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48684a65
    • Naoya Horiguchi's avatar
      mempolicy: apply page table walker on queue_pages_range() · 6f4576e3
      Naoya Horiguchi authored
      queue_pages_range() does page table walking in its own way now, but there
      is some code duplicate.  This patch applies page table walker to reduce
      lines of code.
      
      queue_pages_range() has to do some precheck to determine whether we really
      walk over the vma or just skip it.  Now we have test_walk() callback in
      mm_walk for this purpose, so we can do this replacement cleanly.
      queue_pages_test_walk() depends on not only the current vma but also the
      previous one, so queue_pages->prev is introduced to remember it.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f4576e3
    • Naoya Horiguchi's avatar
      arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() · 1757bbd9
      Naoya Horiguchi authored
      We don't have to use mm_walk->private to pass vma to the callback function
      because of mm_walk->vma.  And walk_page_vma() is useful if we walk over a
      single vma.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1757bbd9
    • Naoya Horiguchi's avatar
      memcg: cleanup preparation for page table walk · 26bcd64a
      Naoya Horiguchi authored
      pagewalk.c can handle vma in itself, so we don't have to pass vma via
      walk->private.  And both of mem_cgroup_count_precharge() and
      mem_cgroup_move_charge() do for each vma loop themselves, but now it's
      done in pagewalk.c, so let's clean up them.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26bcd64a
    • Naoya Horiguchi's avatar
      numa_maps: remove numa_maps->vma · d85f4d6d
      Naoya Horiguchi authored
      pagewalk.c can handle vma in itself, so we don't have to pass vma via
      walk->private.  And show_numa_map() walks pages on vma basis, so using
      walk_page_vma() is preferable.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d85f4d6d
    • Naoya Horiguchi's avatar
      numa_maps: fix typo in gather_hugetbl_stats · 632fd60f
      Naoya Horiguchi authored
      Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code
      grep-friendly.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      632fd60f
    • Naoya Horiguchi's avatar
      pagemap: use walk->vma instead of calling find_vma() · f995ece2
      Naoya Horiguchi authored
      Page table walker has the information of the current vma in mm_walk, so we
      don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() call
      any longer.  Currently pagemap_pte_range() does vma loop itself, so this
      patch reduces many lines of code.
      
      NULL-vma check is omitted because we assume that we never run these
      callbacks on any address outside vma.  And even if it were broken, NULL
      pointer dereference would be detected, so we can get enough information
      for debugging.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f995ece2
    • Naoya Horiguchi's avatar
      clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() · 5c64f52a
      Naoya Horiguchi authored
      clear_refs_write() has some prechecks to determine if we really walk over
      a given vma.  Now we have a test_walk() callback to filter vmas, so let's
      utilize it.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c64f52a
    • Naoya Horiguchi's avatar
      smaps: remove mem_size_stats->vma and use walk_page_vma() · 14eb6fdd
      Naoya Horiguchi authored
      pagewalk.c can handle vma in itself, so we don't have to pass vma via
      walk->private.  And show_smap() walks pages on vma basis, so using
      walk_page_vma() is preferable.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14eb6fdd
    • Naoya Horiguchi's avatar
      pagewalk: add walk_page_vma() · 900fc5f1
      Naoya Horiguchi authored
      Introduce walk_page_vma(), which is useful for the callers which want to
      walk over a given vma.  It's used by later patches.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      900fc5f1
    • Naoya Horiguchi's avatar
      pagewalk: improve vma handling · fafaa426
      Naoya Horiguchi authored
      Current implementation of page table walker has a fundamental problem in
      vma handling, which started when we tried to handle vma(VM_HUGETLB).
      Because it's done in pgd loop, considering vma boundary makes code
      complicated and bug-prone.
      
      From the users viewpoint, some user checks some vma-related condition to
      determine whether the user really does page walk over the vma.
      
      In order to solve these, this patch moves vma check outside pgd loop and
      introduce a new callback ->test_walk().
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fafaa426
    • Naoya Horiguchi's avatar
      mm/pagewalk: remove pgd_entry() and pud_entry() · 0b1fbfe5
      Naoya Horiguchi authored
      Currently no user of page table walker sets ->pgd_entry() or
      ->pud_entry(), so checking their existence in each loop is just wasting
      CPU cycle.  So let's remove it to reduce overhead.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b1fbfe5
    • Konstantin Khlebnikov's avatar
      proc/pagemap: walk page tables under pte lock · 05fbf357
      Konstantin Khlebnikov authored
      Lockless access to pte in pagemap_pte_range() might race with page
      migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():
      
      CPU A (pagemap)                           CPU B (migration)
                                                lock_page()
                                                try_to_unmap(page, TTU_MIGRATION...)
                                                     make_migration_entry()
                                                     set_pte_at()
      <read *pte>
      pte_to_pagemap_entry()
                                                remove_migration_ptes()
                                                unlock_page()
          if(is_migration_entry())
              migration_entry_to_page()
                  BUG_ON(!PageLocked(page))
      
      Also lockless read might be non-atomic if pte is larger than wordsize.
      Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.
      
      Fixes: 052fb0d6 ("proc: report file/anon bit in /proc/pid/pagemap")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reported-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[3.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05fbf357
    • Andrea Arcangeli's avatar
      mm: gup: kvm use get_user_pages_unlocked · 0664e57f
      Andrea Arcangeli authored
      Use the more generic get_user_pages_unlocked which has the additional
      benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault
      (which allows the first page fault in an unmapped area to be always able
      to block indefinitely by being allowed to release the mmap_sem).
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0664e57f