1. 16 Feb, 2015 5 commits
    • Mikulas Patocka's avatar
      dm crypt: offload writes to thread · dc267621
      Mikulas Patocka authored
      Submitting write bios directly in the encryption thread caused serious
      performance degradation.  On a multiprocessor machine, encryption requests
      finish in a different order than they were submitted.  Consequently, write
      requests would be submitted in a different order and it could cause severe
      performance degradation.
      
      Move the submission of write requests to a separate thread so that the
      requests can be sorted before submitting.  But this commit improves
      dm-crypt performance even without having dm-crypt perform request
      sorting (in particular it enables IO schedulers like CFQ to sort more
      effectively).
      
      Note: it is required that a previous commit ("dm crypt: don't allocate
      pages for a partial request") be applied before applying this patch.
      Otherwise, this commit could introduce a crash.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      dc267621
    • Mikulas Patocka's avatar
      dm crypt: remove unused io_pool and _crypt_io_pool · 94f5e024
      Mikulas Patocka authored
      The previous commit ("dm crypt: don't allocate pages for a partial
      request") stopped using the io_pool slab mempool and backing
      _crypt_io_pool kmem cache.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      94f5e024
    • Mikulas Patocka's avatar
      dm crypt: avoid deadlock in mempools · 7145c241
      Mikulas Patocka authored
      Fix a theoretical deadlock introduced in the previous commit ("dm crypt:
      don't allocate pages for a partial request").
      
      The function crypt_alloc_buffer may be called concurrently.  If we allocate
      from the mempool concurrently, there is a possibility of deadlock.  For
      example, if we have mempool of 256 pages, two processes, each wanting
      256, pages allocate from the mempool concurrently, it may deadlock in a
      situation where both processes have allocated 128 pages and the mempool
      is exhausted.
      
      To avoid such a scenario we allocate the pages under a mutex.  In order
      to not degrade performance with excessive locking, we try non-blocking
      allocations without a mutex first and if that fails, we fallback to a
      blocking allocations with a mutex.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      7145c241
    • Mikulas Patocka's avatar
      dm crypt: don't allocate pages for a partial request · cf2f1abf
      Mikulas Patocka authored
      Change crypt_alloc_buffer so that it only ever allocates pages for a
      full request.  This is a prerequisite for the commit "dm crypt: offload
      writes to thread".
      
      This change simplifies the dm-crypt code at the expense of reduced
      throughput in low memory conditions (where allocation for a partial
      request is most useful).
      
      Note: the next commit ("dm crypt: avoid deadlock in mempools") is needed
      to fix a theoretical deadlock.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      cf2f1abf
    • Mikulas Patocka's avatar
      dm crypt: use unbound workqueue for request processing · f3396c58
      Mikulas Patocka authored
      Use unbound workqueue by default so that work is automatically balanced
      between available CPUs.  The original behavior of encrypting using the
      same cpu that IO was submitted on can still be enabled by setting the
      optional 'same_cpu_crypt' table argument.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      f3396c58
  2. 14 Feb, 2015 3 commits
    • Darrick J. Wong's avatar
      dm io: reject unsupported DISCARD requests with EOPNOTSUPP · 37527b86
      Darrick J. Wong authored
      I created a dm-raid1 device backed by a device that supports DISCARD
      and another device that does NOT support DISCARD with the following
      dm configuration:
      
       #  echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
       # lsblk -D
       NAME         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
       sda                 0        4K       1G         0
       `-moo (dm-0)        0        4K       1G         0
       sdb                 0        0B       0B         0
       `-moo (dm-0)        0        4K       1G         0
      
      Notice that the mirror device /dev/mapper/moo advertises DISCARD
      support even though one of the mirror halves doesn't.
      
      If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
      BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
      loop in do_region() when it tries to issue a DISCARD request to sdb.
      The problem is that when we call do_region() against sdb, num_sectors
      is set to zero because q->limits.max_discard_sectors is zero.
      Therefore, "remaining" never decreases and the loop never terminates.
      
      To fix this: before entering the loop, check for the combination of
      REQ_DISCARD and no discard and return -EOPNOTSUPP to avoid hanging up
      the mirror device.
      
      This bug was found by the unfortunate coincidence of pvmove and a
      discard operation in the RHEL 6.5 kernel; upstream is also affected.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: default avatar"Martin K. Petersen" <martin.petersen@oracle.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      37527b86
    • Mikulas Patocka's avatar
      dm mirror: do not degrade the mirror on discard error · f2ed51ac
      Mikulas Patocka authored
      It may be possible that a device claims discard support but it rejects
      discards with -EOPNOTSUPP.  It happens when using loopback on ext2/ext3
      filesystem driven by the ext4 driver.  It may also happen if the
      underlying devices are moved from one disk on another.
      
      If discard error happens, we reject the bio with -EOPNOTSUPP, but we do
      not degrade the array.
      
      This patch fixes failed test shell/lvconvert-repair-transient.sh in the
      lvm2 testsuite if the testsuite is extracted on an ext2 or ext3
      filesystem and it is being driven by the ext4 driver.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      f2ed51ac
    • Mike Snitzer's avatar
      dm space map disk: fix sm_disk_count_is_more_than_one() · 145b9006
      Mike Snitzer authored
      dm_tm_shadow_block() is the only caller of
      dm_sm_count_is_more_than_one() which only ever operates on a metadata
      space-map.  So in practice, sm_disk_count_is_more_than_one() isn't
      actually used (which explains why this bug never amounted to anything).
      
      But fix sm_disk_count_is_more_than_one() to properly set *result and
      return 0.
      Reported-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      145b9006
  3. 13 Feb, 2015 1 commit
    • Linus Torvalds's avatar
      Merge tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · 802ea9d8
      Linus Torvalds authored
      Pull device mapper changes from Mike Snitzer:
      
       - The most significant change this cycle is request-based DM now
         supports stacking ontop of blk-mq devices.  This blk-mq support
         changes the model request-based DM uses for cloning a request to
         relying on calling blk_get_request() directly from the underlying
         blk-mq device.
      
         An early consumer of this code is Intel's emerging NVMe hardware;
         thanks to Keith Busch for working on, and pushing for, these changes.
      
       - A few other small fixes and cleanups across other DM targets.
      
      * tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues
        dm snapshot: remove unnecessary NULL checks before vfree() calls
        dm mpath: simplify failure path of dm_multipath_init()
        dm thin metadata: remove unused dm_pool_get_data_block_size()
        dm ioctl: fix stale comment above dm_get_inactive_table()
        dm crypt: update url in CONFIG_DM_CRYPT help text
        dm bufio: fix time comparison to use time_after_eq()
        dm: use time_in_range() and time_after()
        dm raid: fix a couple integer overflows
        dm table: train hybrid target type detection to select blk-mq if appropriate
        dm: allocate requests in target when stacking on blk-mq devices
        dm: prepare for allocating blk-mq clone requests in target
        dm: submit stacked requests in irq enabled context
        dm: split request structure out from dm_rq_target_io structure
        dm: remove exports for request-based interfaces without external callers
      802ea9d8
  4. 12 Feb, 2015 31 commits
    • Linus Torvalds's avatar
      Merge branch 'for-3.20/drivers' of git://git.kernel.dk/linux-block · 8494bcf5
      Linus Torvalds authored
      Pull block driver changes from Jens Axboe:
       "This contains:
      
         - The 4k/partition fixes for brd from Boaz/Matthew.
      
         - A few xen front/back block fixes from David Vrabel and Roger Pau
           Monne.
      
         - Floppy changes from Takashi, cleaning the device file creation.
      
         - Switching libata to use the new blk-mq tagging policy, removing
           code (and a suboptimal implementation) from libata.  This will
           throw you a merge conflict, since a bug in the original libata
           tagging code was fixed since this code was branched.  Trivial.
           From Shaohua.
      
         - Conversion of loop to blk-mq, from Ming Lei.
      
         - Cleanup of the io_schedule() handling in bsg from Peter Zijlstra.
           He claims it improves on unreadable code, which will cost him a
           beer.
      
         - Maintainer update or NDB, now handled by Markus Pargmann.
      
         - NVMe:
              - Optimization from me that avoids a kmalloc/kfree per IO for
                smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU
                overhead.
              - Removal of (now) dead RCU code, a relic from before NVMe was
                converted to blk-mq"
      
      * 'for-3.20/drivers' of git://git.kernel.dk/linux-block:
        xen-blkback: default to X86_32 ABI on x86
        xen-blkfront: fix accounting of reqs when migrating
        xen-blkback,xen-blkfront: add myself as maintainer
        block: Simplify bsg complete all
        floppy: Avoid manual call of device_create_file()
        NVMe: avoid kmalloc/kfree for smaller IO
        MAINTAINERS: Update NBD maintainer
        libata: make sata_sil24 use fifo tag allocator
        libata: move sas ata tag allocation to libata-scsi.c
        libata: use blk taging
        NVMe: within nvme_free_queues(), delete RCU sychro/deferred free
        null_blk: suppress invalid partition info
        brd: Request from fdisk 4k alignment
        brd: Fix all partitions BUGs
        axonram: Fix bug in direct_access
        loop: add blk-mq.h include
        block: loop: don't handle REQ_FUA explicitly
        block: loop: introduce lo_discard() and lo_req_flush()
        block: loop: say goodby to bio
        block: loop: improve performance via blk-mq
      8494bcf5
    • Linus Torvalds's avatar
      Merge branch 'for-3.20/core' of git://git.kernel.dk/linux-block · 3e12cefb
      Linus Torvalds authored
      Pull core block IO changes from Jens Axboe:
       "This contains:
      
         - A series from Christoph that cleans up and refactors various parts
           of the REQ_BLOCK_PC handling.  Contributions in that series from
           Dongsu Park and Kent Overstreet as well.
      
         - CFQ:
              - A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
              - A stable patch fixing a potential crash in CFQ in OOM
                situations.  From Konstantin Khlebnikov.
      
         - blk-mq:
              - Add support for tag allocation policies, from Shaohua. This is
                a prep patch enabling libata (and other SCSI parts) to use the
                blk-mq tagging, instead of rolling their own.
              - Various little tweaks from Keith and Mike, in preparation for
                DM blk-mq support.
              - Minor little fixes or tweaks from me.
              - A double free error fix from Tony Battersby.
      
         - The partition 4k issue fixes from Matthew and Boaz.
      
         - Add support for zero+unprovision for blkdev_issue_zeroout() from
           Martin"
      
      * 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
        block: remove unused function blk_bio_map_sg
        block: handle the null_mapped flag correctly in blk_rq_map_user_iov
        blk-mq: fix double-free in error path
        block: prevent request-to-request merging with gaps if not allowed
        blk-mq: make blk_mq_run_queues() static
        dm: fix multipath regression due to initializing wrong request
        cfq-iosched: handle failure of cfq group allocation
        block: Quiesce zeroout wrapper
        block: rewrite and split __bio_copy_iov()
        block: merge __bio_map_user_iov into bio_map_user_iov
        block: merge __bio_map_kern into bio_map_kern
        block: pass iov_iter to the BLOCK_PC mapping functions
        block: add a helper to free bio bounce buffer pages
        block: use blk_rq_map_user_iov to implement blk_rq_map_user
        block: simplify bio_map_kern
        block: mark blk-mq devices as stackable
        block: keep established cmd_flags when cloning into a blk-mq request
        block: add blk-mq support to blk_insert_cloned_request()
        block: require blk_rq_prep_clone() be given an initialized clone request
        blk-mq: add tag allocation policy
        ...
      3e12cefb
    • Linus Torvalds's avatar
      Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block · 6bec0035
      Linus Torvalds authored
      Pull backing device changes from Jens Axboe:
       "This contains a cleanup of how the backing device is handled, in
        preparation for a rework of the life time rules.  In this part, the
        most important change is to split the unrelated nommu mmap flags from
        it, but also removing a backing_dev_info pointer from the
        address_space (and inode), and a cleanup of other various minor bits.
      
        Christoph did all the work here, I just fixed an oops with pages that
        have a swap backing.  Arnd fixed a missing export, and Oleg killed the
        lustre backing_dev_info from staging.  Last patch was from Al,
        unexporting parts that are now no longer needed outside"
      
      * 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
        Make super_blocks and sb_lock static
        mtd: export new mtd_mmap_capabilities
        fs: make inode_to_bdi() handle NULL inode
        staging/lustre/llite: get rid of backing_dev_info
        fs: remove default_backing_dev_info
        fs: don't reassign dirty inodes to default_backing_dev_info
        nfs: don't call bdi_unregister
        ceph: remove call to bdi_unregister
        fs: remove mapping->backing_dev_info
        fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
        nilfs2: set up s_bdi like the generic mount_bdev code
        block_dev: get bdev inode bdi directly from the block device
        block_dev: only write bdev inode on close
        fs: introduce f_op->mmap_capabilities for nommu mmap support
        fs: kill BDI_CAP_SWAP_BACKED
        fs: deduplicate noop_backing_dev_info
      6bec0035
    • Linus Torvalds's avatar
      Merge tag 'md/3.20' of git://neil.brown.name/md · 5d8e7fb6
      Linus Torvalds authored
      Pull md updates from Neil Brown:
      
       - assorted locking changes so that access to /proc/mdstat
         and much of /sys/block/mdXX/md/* is protected by a spinlock
         rather than a mutex and will never block indefinitely.
      
       - Make an 'if' condition in RAID5 - which has been implicated
         in recent bugs - more readable.
      
       - misc minor fixes
      
      * tag 'md/3.20' of git://neil.brown.name/md: (28 commits)
        md/raid10: fix conversion from RAID0 to RAID10
        md: wakeup thread upon rdev_dec_pending()
        md: make reconfig_mutex optional for writes to md sysfs files.
        md: move mddev_lock and related to md.h
        md: use mddev->lock to protect updates to resync_{min,max}.
        md: minor cleanup in safe_delay_store.
        md: move GET_BITMAP_FILE ioctl out from mddev_lock.
        md: tidy up set_bitmap_file
        md: remove unnecessary 'buf' from get_bitmap_file.
        md: remove mddev_lock from rdev_attr_show()
        md: remove mddev_lock() from md_attr_show()
        md/raid5: use ->lock to protect accessing raid5 sysfs attributes.
        md: remove need for mddev_lock() in md_seq_show()
        md/bitmap: protect clearing of ->bitmap by mddev->lock
        md: protect ->pers changes with mddev->lock
        md: level_store: group all important changes into one place.
        md: rename ->stop to ->free
        md: split detach operation out from ->stop.
        md/linear: remove rcu protections in favour of suspend/resume
        md: make merge_bvec_fn more robust in face of personality changes.
        ...
      5d8e7fb6
    • Linus Torvalds's avatar
      Merge tag 'jfs-3.20' of git://github.com/kleikamp/linux-shaggy · 87c9172f
      Linus Torvalds authored
      Pull jfs updates from David Kleikamp:
       "A couple cleanups for jfs"
      
      * tag 'jfs-3.20' of git://github.com/kleikamp/linux-shaggy:
        jfs: Deletion of an unnecessary check before the function call "unload_nls"
        jfs: get rid of homegrown endianness helpers
      87c9172f
    • Linus Torvalds's avatar
      Merge branch 'for-3.20' of git://linux-nfs.org/~bfields/linux · 61845143
      Linus Torvalds authored
      Pull nfsd updates from Bruce Fields:
       "The main change is the pNFS block server support from Christoph, which
        allows an NFS client connected to shared disk to do block IO to the
        shared disk in place of NFS reads and writes.  This also requires xfs
        patches, which should arrive soon through the xfs tree, barring
        unexpected problems.  Support for other filesystems is also possible
        if there's interest.
      
        Thanks also to Chuck Lever for continuing work to get NFS/RDMA into
        shape"
      
      * 'for-3.20' of git://linux-nfs.org/~bfields/linux: (32 commits)
        nfsd: default NFSv4.2 to on
        nfsd: pNFS block layout driver
        exportfs: add methods for block layout exports
        nfsd: add trace events
        nfsd: update documentation for pNFS support
        nfsd: implement pNFS layout recalls
        nfsd: implement pNFS operations
        nfsd: make find_any_file available outside nfs4state.c
        nfsd: make find/get/put file available outside nfs4state.c
        nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c
        nfsd: add fh_fsid_match helper
        nfsd: move nfsd_fh_match to nfsfh.h
        fs: add FL_LAYOUT lease type
        fs: track fl_owner for leases
        nfs: add LAYOUT_TYPE_MAX enum value
        nfsd: factor out a helper to decode nfstime4 values
        sunrpc/lockd: fix references to the BKL
        nfsd: fix year-2038 nfs4 state problem
        svcrdma: Handle additional inline content
        svcrdma: Move read list XDR round-up logic
        ...
      61845143
    • Linus Torvalds's avatar
      Merge tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · a26be149
      Linus Torvalds authored
      Pull IOMMU updates from Joerg Roedel:
       "This time with:
      
         - Generic page-table framework for ARM IOMMUs using the LPAE
           page-table format, ARM-SMMU and Renesas IPMMU make use of it
           already.
      
         - Break out the IO virtual address allocator from the Intel IOMMU so
           that it can be used by other DMA-API implementations too.  The
           first user will be the ARM64 common DMA-API implementation for
           IOMMUs
      
         - Device tree support for Renesas IPMMU
      
         - Various fixes and cleanups all over the place"
      
      * tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (36 commits)
        iommu/amd: Convert non-returned local variable to boolean when relevant
        iommu: Update my email address
        iommu/amd: Use wait_event in put_pasid_state_wait
        iommu/amd: Fix amd_iommu_free_device()
        iommu/arm-smmu: Avoid build warning
        iommu/fsl: Various cleanups
        iommu/fsl: Use %pa to print phys_addr_t
        iommu/omap: Print phys_addr_t using %pa
        iommu: Make more drivers depend on COMPILE_TEST
        iommu/ipmmu-vmsa: Fix IOMMU lookup when multiple IOMMUs are registered
        iommu: Disable on !MMU builds
        iommu/fsl: Remove unused fsl_of_pamu_ids[]
        iommu/fsl: Fix section mismatch
        iommu/ipmmu-vmsa: Use the ARM LPAE page table allocator
        iommu: Fix trace_map() to report original iova and original size
        iommu/arm-smmu: add support for iova_to_phys through ATS1PR
        iopoll: Introduce memory-mapped IO polling macros
        iommu/arm-smmu: don't touch the secure STLBIALL register
        iommu/arm-smmu: make use of generic LPAE allocator
        iommu: io-pgtable-arm: add non-secure quirk
        ...
      a26be149
    • Linus Torvalds's avatar
      Merge tag 'devicetree-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · cdd30545
      Linus Torvalds authored
      Pull DeviceTree changes from Rob Herring:
      
       - DT unittests for I2C probing and overlays from Pantelis Antoniou
      
       - Remove DT unittest dependency on OF_DYNAMIC from Gaurav Minocha
      
       - Add Tegra compatible strings missing for newer parts from Paul
         Walmsley
      
       - Various vendor prefix additions
      
      * tag 'devicetree-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        of: Add vendor prefix for OmniVision Technologies
        of: Use ovti for Omnivision
        of: Add vendor prefix for Truly Semiconductors Limited
        of: Add vendor prefix for Himax Technologies Inc.
        of/fdt: fix sparse warning
        of: unitest: Add I2C overlay unit tests.
        Documentation: DT: document compatible string existence requirement
        Documentation: DT bindings: add nvidia, tegra132-denver compatible string
        Documentation: DT bindings: add more Tegra chip compatible strings
        of: EXPORT_SYMBOL_GPL of_property_read_u64_array
        of: Fix brace position for struct of_device_id definition
        of/unittest: Remove obsolete code
        dt-bindings: use isil prefix for Intersil in vendor-prefixes.txt
        Add AD Holdings Plc. to vendor-prefixes.
        dt-bindings: Add Silicon Mitus vendor prefix
        Removes OF_UNITTEST dependency on OF_DYNAMIC config symbol
        pinctrl: fix up device tree bindings
        DT: Vendors: Add Everspin
        doc: add bindings document for altera fpga manager
        drivers: of: Export of_reserved_mem_device_{init,release}
      cdd30545
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm · 42cf0f20
      Linus Torvalds authored
      Pull ARM updates from Russell King:
      
       - clang assembly fixes from Ard
      
       - optimisations and cleanups for Aurora L2 cache support
      
       - efficient L2 cache support for secure monitor API on Exynos SoCs
      
       - debug menu cleanup from Daniel Thompson to allow better behaviour for
         multiplatform kernels
      
       - StrongARM SA11x0 conversion to irq domains, and pxa_timer
      
       - kprobes updates for older ARM CPUs
      
       - move probes support out of arch/arm/kernel to arch/arm/probes
      
       - add inline asm support for the rbit (reverse bits) instruction
      
       - provide an ARM mode secondary CPU entry point (for Qualcomm CPUs)
      
       - remove the unused ARMv3 user access code
      
       - add driver_override support to AMBA Primecell bus
      
      * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (55 commits)
        ARM: 8256/1: driver coamba: add device binding path 'driver_override'
        ARM: 8301/1: qcom: Use secondary_startup_arm()
        ARM: 8302/1: Add a secondary_startup that assumes ARM mode
        ARM: 8300/1: teach __asmeq that r11 == fp and r12 == ip
        ARM: kprobes: Fix compilation error caused by superfluous '*'
        ARM: 8297/1: cache-l2x0: optimize aurora range operations
        ARM: 8296/1: cache-l2x0: clean up aurora cache handling
        ARM: 8284/1: sa1100: clear RCSR_SMR on resume
        ARM: 8283/1: sa1100: collie: clear PWER register on machine init
        ARM: 8282/1: sa1100: use handle_domain_irq
        ARM: 8281/1: sa1100: move GPIO-related IRQ code to gpio driver
        ARM: 8280/1: sa1100: switch to irq_domain_add_simple()
        ARM: 8279/1: sa1100: merge both GPIO irqdomains
        ARM: 8278/1: sa1100: split irq handling for low GPIOs
        ARM: 8291/1: replace magic number with PAGE_SHIFT macro in fixup_pv code
        ARM: 8290/1: decompressor: fix a wrong comment
        ARM: 8286/1: mm: Fix dma_contiguous_reserve comment
        ARM: 8248/1: pm: remove outdated comment
        ARM: 8274/1: Fix DEBUG_LL for multi-platform kernels (without PL01X)
        ARM: 8273/1: Seperate DEBUG_UART_PHYS from DEBUG_LL on EP93XX
        ...
      42cf0f20
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32 · a2f0bb03
      Linus Torvalds authored
      Pull AVR32 update from Hans-Christian Egtvedt.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32:
        avr32: update all default configurations
        avr32: remove fake at91 cpu identification
        avr32: wire up missing syscalls
      a2f0bb03
    • Linus Torvalds's avatar
      Merge tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 41cbc01f
      Linus Torvalds authored
      Pull tracing updates from Steven Rostedt:
       "The updates included in this pull request for ftrace are:
      
         o Several clean ups to the code
      
           One such clean up was to convert to 64 bit time keeping, in the
           ring buffer benchmark code.
      
         o Adding of __print_array() helper macro for TRACE_EVENT()
      
         o Updating the sample/trace_events/ to add samples of different ways
           to make trace events.  Lots of features have been added since the
           sample code was made, and these features are mostly unknown.
           Developers have been making their own hacks to do things that are
           already available.
      
         o Performance improvements.  Most notably, I found a performance bug
           where a waiter that is waiting for a full page from the ring buffer
           will see that a full page is not available, and go to sleep.  The
           sched event caused by it going to sleep would cause it to wake up
           again.  It would see that there was still not a full page, and go
           back to sleep again, and that would wake it up again, until finally
           it would see a full page.  This change has been marked for stable.
      
        Other improvements include removing global locks from fast paths"
      
      * tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        ring-buffer: Do not wake up a splice waiter when page is not full
        tracing: Fix unmapping loop in tracing_mark_write
        tracing: Add samples of DECLARE_EVENT_CLASS() and DEFINE_EVENT()
        tracing: Add TRACE_EVENT_FN example
        tracing: Add TRACE_EVENT_CONDITION sample
        tracing: Update the TRACE_EVENT fields available in the sample code
        tracing: Separate out initializing top level dir from instances
        tracing: Make tracing_init_dentry_tr() static
        trace: Use 64-bit timekeeping
        tracing: Add array printing helper
        tracing: Remove newline from trace_printk warning banner
        tracing: Use IS_ERR() check for return value of tracing_init_dentry()
        tracing: Remove unneeded includes of debugfs.h and fs.h
        tracing: Remove taking of trace_types_lock in pipe files
        tracing: Add ref count to tracer for when they are being read by pipe
      41cbc01f
    • Linus Torvalds's avatar
      Merge tag 'ktest-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest · 12df4289
      Linus Torvalds authored
      Pull ktest updates from Steven Rostedt:
       "The following ktest updates were done:
      
         o Added timings to various parts of the test (build, install, boot,
           tests) and report them so that the users can keep track of changes.
      
         o Josh Poimboeuf fixed the console output to work better with virtual
           machine targets.
      
         o Various clean ups and fixes"
      
      * tag 'ktest-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
        ktest: Place quotes around item variable
        ktest: Cleanup terminal on dodie() failure
        ktest: Print build,install,boot,test times at success and failure
        ktest: Enable user input to the console
        ktest: Give console process a dedicated tty
        ktest: Rename start_monitor_and_boot to start_monitor_and_install
        ktest: Show times for build, install, boot and test
        ktest: Restore tty settings after closing console
        ktest: Add timings for commands
      12df4289
    • Linus Torvalds's avatar
      Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security · 8cc748aa
      Linus Torvalds authored
      Pull security layer updates from James Morris:
       "Highlights:
      
         - Smack adds secmark support for Netfilter
         - /proc/keys is now mandatory if CONFIG_KEYS=y
         - TPM gets its own device class
         - Added TPM 2.0 support
         - Smack file hook rework (all Smack users should review this!)"
      
      * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
        cipso: don't use IPCB() to locate the CIPSO IP option
        SELinux: fix error code in policydb_init()
        selinux: add security in-core xattr support for pstore and debugfs
        selinux: quiet the filesystem labeling behavior message
        selinux: Remove unused function avc_sidcmp()
        ima: /proc/keys is now mandatory
        Smack: Repair netfilter dependency
        X.509: silence asn1 compiler debug output
        X.509: shut up about included cert for silent build
        KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
        MAINTAINERS: email update
        tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
        smack: fix possible use after frees in task_security() callers
        smack: Add missing logging in bidirectional UDS connect check
        Smack: secmark support for netfilter
        Smack: Rework file hooks
        tpm: fix format string error in tpm-chip.c
        char/tpm/tpm_crb: fix build error
        smack: Fix a bidirectional UDS connect check typo
        smack: introduce a special case for tmpfs in smack_d_instantiate()
        ...
      8cc748aa
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit · 7184487f
      Linus Torvalds authored
      Pull audit fix from Paul Moore:
       "Just one patch from the audit tree for v3.20, and a very minor one at
        that.
      
        The patch simply removes an old, unused field from the audit_krule
        structure, a private audit-only struct.  In audit related news, we did
        a proper overhaul of the audit pathname code and removed the nasty
        getname()/putname() hacks for audit, you should see those patches in
        Al's vfs tree if you haven't already.
      
        That's it for audit this time, let's hope for a quiet -rcX series"
      
      * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
        audit: remove vestiges of vers_ops
      7184487f
    • Rob Herring's avatar
    • NeilBrown's avatar
      md/raid10: fix conversion from RAID0 to RAID10 · 53a6ab4d
      NeilBrown authored
      A RAID0 array (like a LINEAR array) does not have a concept
      of 'size' being the amount of each device that is in use.
      Rather, as much of each device as is available is used.
      So the 'size' is set to 0 and ignored.
      
      RAID10 does have this concept and needs it to be set correctly.
      So when we convert RAID0 to RAID10 we must determine the
      'size' (that being the size of the first 'strip_zone' in the
      RAID0), and set it correctly.
      Reported-and-tested-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      53a6ab4d
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 59d53737
      Linus Torvalds authored
      Merge second set of updates from Andrew Morton:
       "More of MM"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
        mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
        mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
        vmstat: Reduce time interval to stat update on idle cpu
        mm/page_owner.c: remove unnecessary stack_trace field
        Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files
        mm: incorporate read-only pages into transparent huge pages
        vmstat: do not use deferrable delayed work for vmstat_update
        mm: more aggressive page stealing for UNMOVABLE allocations
        mm: always steal split buddies in fallback allocations
        mm: when stealing freepages, also take pages created by splitting buddy page
        mincore: apply page table walker on do_mincore()
        mm: /proc/pid/clear_refs: avoid split_huge_page()
        mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
        mempolicy: apply page table walker on queue_pages_range()
        arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()
        memcg: cleanup preparation for page table walk
        numa_maps: remove numa_maps->vma
        numa_maps: fix typo in gather_hugetbl_stats
        pagemap: use walk->vma instead of calling find_vma()
        clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
        ...
      59d53737
    • Linus Torvalds's avatar
      Merge tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux · d3f180ea
      Linus Torvalds authored
      Pull powerpc updates from Michael Ellerman:
      
       - Update of all defconfigs
      
       - Addition of a bunch of config options to modernise our defconfigs
      
       - Some PS3 updates from Geoff
      
       - Optimised memcmp for 64 bit from Anton
      
       - Fix for kprobes that allows 'perf probe' to work from Naveen
      
       - Several cxl updates from Ian & Ryan
      
       - Expanded support for the '24x7' PMU from Cody & Sukadev
      
       - Freescale updates from Scott:
          "Highlights include 8xx optimizations, some more work on datapath
           device tree content, e300 machine check support, t1040 corenet
           error reporting, and various cleanups and fixes"
      
      * tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits)
        cxl: Add missing return statement after handling AFU errror
        cxl: Fail AFU initialisation if an invalid configuration record is found
        cxl: Export optional AFU configuration record in sysfs
        powerpc/mm: Warn on flushing tlb page in kernel context
        powerpc/powernv: Add OPAL soft-poweroff routine
        powerpc/perf/hv-24x7: Document sysfs event description entries
        powerpc/perf/hv-gpci: add the remaining gpci requests
        powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated
        powerpc/perf/hv-24x7: parse catalog and populate sysfs with events
        perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper
        perf: add PMU_EVENT_ATTR_STRING() helper
        perf: provide sysfs_show for struct perf_pmu_events_attr
        powerpc/kernel: Avoid initializing device-tree pointer twice
        powerpc: Remove old compile time disabled syscall tracing code
        powerpc/kernel: Make syscall_exit a local label
        cxl: Fix device_node reference counting
        powerpc/mm: bail out early when flushing TLB page
        powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80)
        perf/powerpc: reset event hw state when adding it to the PMU
        powerpc/qe: Use strlcpy()
        ...
      d3f180ea
    • Linus Torvalds's avatar
      Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 6b00f7ef
      Linus Torvalds authored
      Pull arm64 updates from Catalin Marinas:
       "arm64 updates for 3.20:
      
         - reimplementation of the virtual remapping of UEFI Runtime Services
           in a way that is stable across kexec
         - emulation of the "setend" instruction for 32-bit tasks (user
           endianness switching trapped in the kernel, SCTLR_EL1.E0E bit set
           accordingly)
         - compat_sys_call_table implemented in C (from asm) and made it a
           constant array together with sys_call_table
         - export CPU cache information via /sys (like other architectures)
         - DMA API implementation clean-up in preparation for IOMMU support
         - macros clean-up for KVM
         - dropped some unnecessary cache+tlb maintenance
         - CONFIG_ARM64_CPU_SUSPEND clean-up
         - defconfig update (CPU_IDLE)
      
        The EFI changes going via the arm64 tree have been acked by Matt
        Fleming.  There is also a patch adding sys_*stat64 prototypes to
        include/linux/syscalls.h, acked by Andrew Morton"
      
      * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (47 commits)
        arm64: compat: Remove incorrect comment in compat_siginfo
        arm64: Fix section mismatch on alloc_init_p[mu]d()
        arm64: Avoid breakage caused by .altmacro in fpsimd save/restore macros
        arm64: mm: use *_sect to check for section maps
        arm64: drop unnecessary cache+tlb maintenance
        arm64:mm: free the useless initial page table
        arm64: Enable CPU_IDLE in defconfig
        arm64: kernel: remove ARM64_CPU_SUSPEND config option
        arm64: make sys_call_table const
        arm64: Remove asm/syscalls.h
        arm64: Implement the compat_sys_call_table in C
        syscalls: Declare sys_*stat64 prototypes if __ARCH_WANT_(COMPAT_)STAT64
        compat: Declare compat_sys_sigpending and compat_sys_sigprocmask prototypes
        arm64: uapi: expose our struct ucontext to the uapi headers
        smp, ARM64: Kill SMP single function call interrupt
        arm64: Emulate SETEND for AArch32 tasks
        arm64: Consolidate hotplug notifier for instruction emulation
        arm64: Track system support for mixed endian EL0
        arm64: implement generic IOMMU configuration
        arm64: Combine coherent and non-coherent swiotlb dma_ops
        ...
      6b00f7ef
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · b3d6524f
      Linus Torvalds authored
      Pull s390 updates from Martin Schwidefsky:
      
       - The remaining patches for the z13 machine support: kernel build
         option for z13, the cache synonym avoidance, SMT support,
         compare-and-delay for spinloops and the CES5S crypto adapater.
      
       - The ftrace support for function tracing with the gcc hotpatch option.
         This touches common code Makefiles, Steven is ok with the changes.
      
       - The hypfs file system gets an extension to access diagnose 0x0c data
         in user space for performance analysis for Linux running under z/VM.
      
       - The iucv hvc console gets wildcard spport for the user id filtering.
      
       - The cacheinfo code is converted to use the generic infrastructure.
      
       - Cleanup and bug fixes.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
        s390/process: free vx save area when releasing tasks
        s390/hypfs: Eliminate hypfs interval
        s390/hypfs: Add diagnose 0c support
        s390/cacheinfo: don't use smp_processor_id() in preemptible context
        s390/zcrypt: fixed domain scanning problem (again)
        s390/smp: increase maximum value of NR_CPUS to 512
        s390/jump label: use different nop instruction
        s390/jump label: add sanity checks
        s390/mm: correct missing space when reporting user process faults
        s390/dasd: cleanup profiling
        s390/dasd: add locking for global_profile access
        s390/ftrace: hotpatch support for function tracing
        ftrace: let notrace function attribute disable hotpatching if necessary
        ftrace: allow architectures to specify ftrace compile options
        s390: reintroduce diag 44 calls for cpu_relax()
        s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
        s390/zcrypt: Number of supported ap domains is not retrievable.
        s390/spinlock: add compare-and-delay to lock wait loops
        s390/tape: remove redundant if statement
        s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
        ...
      b3d6524f
    • Linus Torvalds's avatar
      Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux · 07f80d41
      Linus Torvalds authored
      Pull pstore update from Tony Luck:
       "Miscellaneous fs/pstore fixes"
      
      * tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
        pstore: Fix sprintf format specifier in pstore_dump()
        pstore: Add pmsg - user-space accessible pstore object
        pstore: Handle zero-sized prz in series
        pstore: Remove superfluous memory size check
        pstore: Use scnprintf() in pstore_mkfile()
      07f80d41
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 6f83e5bd
      Linus Torvalds authored
      Pull NFS client updates from Trond Myklebust:
       "Highlights incluse:
      
        Features:
         - Removing the forced serialisation of open()/close() calls in
           NFSv4.x (x>0) makes for a significant performance improvement in
           metadata intensive workloads.
         - Full support for the pNFS "flexible files" layout type
         - Further RPC/RDMA client improvements from Chuck
      
        Bugfixes:
         - Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING
         - Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL
         - Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called
           as part of the namespace cleanup,
         - Stable fix: Ensure we reference the inode for return-on-close in
           delegreturn
         - Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to
           the same source address/port combination during a disconnect/
           reconnect event.  This is a requirement imposed by most NFSv3
           server duplicate reply cache implementations.
      
        Optimisations:
         - Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT
      
        Other:
         - Add Anna Schumaker as co-maintainer for the NFS client"
      
      * tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits)
        SUNRPC: Cleanup to remove xs_tcp_close()
        pnfs: delete an unintended goto
        pnfs/flexfiles: Do not dprintk after the free
        SUNRPC: Fix stupid typo in xs_sock_set_reuseport
        SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG
        SUNRPC: Handle connection reset more efficiently.
        SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag
        SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release
        SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection
        SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT
        SUNRPC: Remove TCP socket linger code
        SUNRPC: Remove TCP client connection reset hack
        SUNRPC: TCP/UDP always close the old socket before reconnecting
        SUNRPC: Add helpers to prevent socket create from racing
        SUNRPC: Ensure xs_reset_transport() resets the close connection flags
        SUNRPC: Do not clear the source port in xs_reset_transport
        SUNRPC: Handle EADDRINUSE on connect
        SUNRPC: Set SO_REUSEPORT socket option for TCP connections
        NFSv4.1: Fix pnfs_put_lseg races
        NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS
        ...
      6f83e5bd
    • Roman Gushchin's avatar
      mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() · 8138a67a
      Roman Gushchin authored
      I noticed that "allowed" can easily overflow by falling below 0, because
      (total_vm / 32) can be larger than "allowed".  The problem occurs in
      OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      Signed-off-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8138a67a
    • Roman Gushchin's avatar
      mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() · 5703b087
      Roman Gushchin authored
      I noticed, that "allowed" can easily overflow by falling below 0,
      because (total_vm / 32) can be larger than "allowed".  The problem
      occurs in OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      
      [akpm@linux-foundation.org: use min_t]
      Signed-off-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5703b087
    • Christoph Lameter's avatar
      vmstat: Reduce time interval to stat update on idle cpu · 57c2e36b
      Christoph Lameter authored
      It was noted that the vm stat shepherd runs every 2 seconds and that the
      vmstat update is then scheduled 2 seconds in the future.
      
      This yields an interval of double the time interval which is not desired.
      
      Change the shepherd so that it does not delay the vmstat update on the
      other cpu.  We stil have to use schedule_delayed_work since we are using a
      delayed_work_struct but we can set the delay to 0.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57c2e36b
    • Sergei Rogachev's avatar
      mm/page_owner.c: remove unnecessary stack_trace field · 94f759d6
      Sergei Rogachev authored
      Page owner uses the page_ext structure to keep meta-information for every
      page in the system.  The structure also contains a field of type 'struct
      stack_trace', page owner uses this field during invocation of the function
      save_stack_trace.  It is easy to notice that keeping a copy of this
      structure for every page in the system is very inefficiently in terms of
      memory.
      
      The patch removes this unnecessary field of page_ext and forces page owner
      to use a stack_trace structure allocated on the stack.
      
      [akpm@linux-foundation.org: use struct initializers]
      Signed-off-by: default avatarSergei Rogachev <rogachevsergei@gmail.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94f759d6
    • Cyrill Gorcunov's avatar
      Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files · 740a5ddb
      Cyrill Gorcunov authored
      [akpm@linux-foundation.org: tweaks]
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Calvin Owens <calvinowens@fb.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      740a5ddb
    • Ebru Akagunduz's avatar
      mm: incorporate read-only pages into transparent huge pages · 10359213
      Ebru Akagunduz authored
      This patch aims to improve THP collapse rates, by allowing THP collapse in
      the presence of read-only ptes, like those left in place by do_swap_page
      after a read fault.
      
      Currently THP can collapse 4kB pages into a THP when there are up to
      khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch applies
      the same limit for read-only ptes.
      
      The patch was tested with a test program that allocates 800MB of memory,
      writes to it, and then sleeps.  I force the system to swap out all but
      190MB of the program by touching other memory.  Afterwards, the test
      program does a mix of reads and writes to its memory, and the memory gets
      swapped back in.
      
      Without the patch, only the memory that did not get swapped out remained
      in THPs, which corresponds to 24% of the memory of the program.  The
      percentage did not increase over time.
      
      With this patch, after 5 minutes of waiting khugepaged had collapsed 50%
      of the program's memory back into THPs.
      
      Test results:
      
      With the patch:
      After swapped out:
      cat /proc/pid/smaps:
      Anonymous:      100464 kB
      AnonHugePages:  100352 kB
      Swap:           699540 kB
      Fraction:       99,88
      
      cat /proc/meminfo:
      AnonPages:      1754448 kB
      AnonHugePages:  1716224 kB
      Fraction:       97,82
      
      After swapped in:
      In a few seconds:
      cat /proc/pid/smaps:
      Anonymous:      800004 kB
      AnonHugePages:  145408 kB
      Swap:           0 kB
      Fraction:       18,17
      
      cat /proc/meminfo:
      AnonPages:      2455016 kB
      AnonHugePages:  1761280 kB
      Fraction:       71,74
      
      In 5 minutes:
      cat /proc/pid/smaps
      Anonymous:      800004 kB
      AnonHugePages:  407552 kB
      Swap:           0 kB
      Fraction:       50,94
      
      cat /proc/meminfo:
      AnonPages:      2456872 kB
      AnonHugePages:  2023424 kB
      Fraction:       82,35
      
      Without the patch:
      After swapped out:
      cat /proc/pid/smaps:
      Anonymous:      190660 kB
      AnonHugePages:  190464 kB
      Swap:           609344 kB
      Fraction:       99,89
      
      cat /proc/meminfo:
      AnonPages:      1740456 kB
      AnonHugePages:  1667072 kB
      Fraction:       95,78
      
      After swapped in:
      cat /proc/pid/smaps:
      Anonymous:      800004 kB
      AnonHugePages:  190464 kB
      Swap:           0 kB
      Fraction:       23,80
      
      cat /proc/meminfo:
      AnonPages:      2350032 kB
      AnonHugePages:  1667072 kB
      Fraction:       70,93
      
      I waited 10 minutes the fractions did not change without the patch.
      Signed-off-by: default avatarEbru Akagunduz <ebru.akagunduz@gmail.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10359213
    • Michal Hocko's avatar
      vmstat: do not use deferrable delayed work for vmstat_update · ba4877b9
      Michal Hocko authored
      Vinayak Menon has reported that an excessive number of tasks was throttled
      in the direct reclaim inside too_many_isolated() because NR_ISOLATED_FILE
      was relatively high compared to NR_INACTIVE_FILE.  However it turned out
      that the real number of NR_ISOLATED_FILE was 0 and the per-cpu
      vm_stat_diff wasn't transferred into the global counter.
      
      vmstat_work which is responsible for the sync is defined as deferrable
      delayed work which means that the defined timeout doesn't wake up an idle
      CPU.  A CPU might stay in an idle state for a long time and general effort
      is to keep such a CPU in this state as long as possible which might lead
      to all sorts of troubles for vmstat consumers as can be seen with the
      excessive direct reclaim throttling.
      
      This patch basically reverts 39bf6270 ("VM statistics: Make timer
      deferrable") but it shouldn't cause any problems for idle CPUs because
      only CPUs with an active per-cpu drift are woken up since 7cc36bbd
      ("vmstat: on-demand vmstat workers v8") and CPUs which are idle for a
      longer time shouldn't have per-cpu drift.
      
      Fixes: 39bf6270 (VM statistics: Make timer deferrable)
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: default avatarVinayak Menon <vinmenon@codeaurora.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba4877b9
    • Vlastimil Babka's avatar
      mm: more aggressive page stealing for UNMOVABLE allocations · 9c0415eb
      Vlastimil Babka authored
      When allocation falls back to stealing free pages of another migratetype,
      it can decide to steal extra pages, or even the whole pageblock in order
      to reduce fragmentation, which could happen if further allocation
      fallbacks pick a different pageblock.  In try_to_steal_freepages(), one of
      the situations where extra pages are stolen happens when we are trying to
      allocate a MIGRATE_RECLAIMABLE page.
      
      However, MIGRATE_UNMOVABLE allocations are not treated the same way,
      although spreading such allocation over multiple fallback pageblocks is
      arguably even worse than it is for RECLAIMABLE allocations.  To minimize
      fragmentation, we should minimize the number of such fallbacks, and thus
      steal as much as is possible from each fallback pageblock.
      
      Note that in theory this might put more pressure on movable pageblocks and
      cause movable allocations to steal back from unmovable pageblocks.
      However, movable allocations are not as aggressive with stealing, and do
      not cause permanent fragmentation, so the tradeoff is reasonable, and
      evaluation seems to support the change.
      
      This patch thus adds a check for MIGRATE_UNMOVABLE to the decision to
      steal extra free pages.  When evaluating with stress-highalloc from
      mmtests, this has reduced the number of MIGRATE_UNMOVABLE fallbacks to
      roughly 1/6.  The number of these fallbacks stealing from MIGRATE_MOVABLE
      block is reduced to 1/3.  There was no observation of growing number of
      unmovable pageblocks over time, and also not of increased movable
      allocation fallbacks.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c0415eb
    • Vlastimil Babka's avatar
      mm: always steal split buddies in fallback allocations · 3a1086fb
      Vlastimil Babka authored
      When allocation falls back to another migratetype, it will steal a page
      with highest available order, and (depending on this order and desired
      migratetype), it might also steal the rest of free pages from the same
      pageblock.
      
      Given the preference of highest available order, it is likely that it will
      be higher than the desired order, and result in the stolen buddy page
      being split.  The remaining pages after split are currently stolen only
      when the rest of the free pages are stolen.  This can however lead to
      situations where for MOVABLE allocations we split e.g.  order-4 fallback
      UNMOVABLE page, but steal only order-0 page.  Then on the next MOVABLE
      allocation (which may be batched to fill the pcplists) we split another
      order-3 or higher page, etc.  By stealing all pages that we have split, we
      can avoid further stealing.
      
      This patch therefore adjusts the page stealing so that buddy pages created
      by split are always stolen.  This has effect only on MOVABLE allocations,
      as RECLAIMABLE and UNMOVABLE allocations already always do that in
      addition to stealing the rest of free pages from the pageblock.  The
      change also allows to simplify try_to_steal_freepages() and factor out CMA
      handling.
      
      According to Mel, it has been intended since the beginning that buddy
      pages after split would be stolen always, but it doesn't seem like it was
      ever the case until commit 47118af0 ("mm: mmzone: MIGRATE_CMA
      migration type added").  The commit has unintentionally introduced this
      behavior, but was reverted by commit 0cbef29a ("mm:
      __rmqueue_fallback() should respect pageblock type").  Neither included
      evaluation.
      
      My evaluation with stress-highalloc from mmtests shows about 2.5x
      reduction of page stealing events for MOVABLE allocations, without
      affecting the page stealing events for other allocation migratetypes.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a1086fb