1. 29 Jul, 2016 32 commits
    • Linus Torvalds's avatar
      Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs · e7b4f2d8
      Linus Torvalds authored
      Pull overlayfs update from Miklos Szeredi:
       "First of all, this fixes a regression in overlayfs introduced by the
        dentry hash salting.  I've moved the patch fixing this to the front of
        the queue, so if (god forbid) something needs to be bisected in
        overlayfs this regression won't interfere with that.
      
        The biggest part is preparation for selinux support, done by Vivek
        Goyal.  Essentially this makes all operations on underlying
        filesystems be done with credentials of mounter.  This makes
        everything nicely consistent.
      
        There are also fixes for a number of known and recently discovered
        non-standard behavior (thanks to Eryu Guan for testing and improving
        the test suites)"
      
      * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (23 commits)
        ovl: simplify empty checking
        qstr: constify instances in overlayfs
        ovl: clear nlink on rmdir
        ovl: disallow overlayfs as upperdir
        ovl: fix warning
        ovl: remove duplicated include from super.c
        ovl: append MAY_READ when diluting write checks
        ovl: dilute permission checks on lower only if not special file
        ovl: fix POSIX ACL setting
        ovl: share inode for hard link
        ovl: store real inode pointer in ->i_private
        ovl: permission: return ECHILD instead of ENOENT
        ovl: update atime on upper
        ovl: fix sgid on directory
        ovl: simplify permission checking
        ovl: do not require mounter to have MAY_WRITE on lower
        ovl: do operations on underlying file system in mounter's context
        ovl: modify ovl_permission() to do checks on two inodes
        ovl: define ->get_acl() for overlay inodes
        ovl: move some common code in a function
        ...
      e7b4f2d8
    • Linus Torvalds's avatar
      Merge tag 'freevxfs-for-4.8' of git://git.infradead.org/users/hch/freevxfs · 0a7736d0
      Linus Torvalds authored
      Pull freevxfs updates from Christoph Hellwig:
       "Support for foreign endianess and HP-UP superblocks from
        Krzysztof Błaszkowski"
      
      * tag 'freevxfs-for-4.8' of git://git.infradead.org/users/hch/freevxfs:
        freevxfs: update Kconfig information
        freevxfs: refactor readdir and lookup code
        freevxfs: fix lack of inode initialization
        freevxfs: fix memory leak in vxfs_read_fshead()
        freevxfs: update documentation and cresdits for HP-UX support
        freevxfs: implement ->alloc_inode and ->destroy_inode
        freevxfs: avoid the need for forward declaring the super operations
        freevxfs: move VFS inode allocation into vxfs_blkiget and vxfs_stiget
        freevxfs: remove vxfs_put_fake_inode
        freevxfs: handle big endian HP-UX file systems
      0a7736d0
    • Linus Torvalds's avatar
      Merge tag 'configfs-for-4.8' of git://git.infradead.org/users/hch/configfs · a54809f1
      Linus Torvalds authored
      Pull configfs update from Christoph Hellwig:
       "A simple error handling fix from Tal Shorer"
      
      * tag 'configfs-for-4.8' of git://git.infradead.org/users/hch/configfs:
        configfs: don't set buffer_needs_fill to zero if show() returns error
      a54809f1
    • Linus Torvalds's avatar
      Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 · b0c4e2ac
      Linus Torvalds authored
      Pull CIFS/SMB3 fixes from Steve French:
       "Various CIFS/SMB3 fixes, most for stable"
      
      * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
        CIFS: Fix a possible invalid memory access in smb2_query_symlink()
        fs/cifs: make share unaccessible at root level mountable
        cifs: fix crash due to race in hmac(md5) handling
        cifs: unbreak TCP session reuse
        cifs: Check for existing directory when opening file with O_CREAT
        Add MF-Symlinks support for SMB 2.0
      b0c4e2ac
    • Miklos Szeredi's avatar
      ovl: simplify empty checking · 30c17ebf
      Miklos Szeredi authored
      The empty checking logic is duplicated in ovl_check_empty_and_clear() and
      ovl_remove_and_whiteout(), except the condition for clearing whiteouts is
      different:
      
      ovl_check_empty_and_clear() checked for being upper
      
      ovl_remove_and_whiteout() checked for merge OR lower
      
      Move the intersection of those checks (upper AND merge) into
      ovl_check_empty_and_clear() and simplify ovl_remove_and_whiteout().
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      30c17ebf
    • Al Viro's avatar
      qstr: constify instances in overlayfs · 29c42e80
      Al Viro authored
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      29c42e80
    • Miklos Szeredi's avatar
      ovl: clear nlink on rmdir · dbc816d0
      Miklos Szeredi authored
      To make delete notification work on fa/inotify.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      dbc816d0
    • Miklos Szeredi's avatar
      ovl: disallow overlayfs as upperdir · 76bc8e28
      Miklos Szeredi authored
      This does not work and does not make sense.  So instead of fixing it
      (probably not hard) just disallow.
      Reported-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Cc: <stable@vger.kernel.org>
      76bc8e28
    • Miklos Szeredi's avatar
      ovl: fix warning · 656189d2
      Miklos Szeredi authored
      There's a superfluous newline in the warning message in ovl_d_real().
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      656189d2
    • Wei Yongjun's avatar
      ovl: remove duplicated include from super.c · 5f215013
      Wei Yongjun authored
      Remove duplicated include.
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      5f215013
    • Vivek Goyal's avatar
      ovl: append MAY_READ when diluting write checks · 500cac3c
      Vivek Goyal authored
      Right now we remove MAY_WRITE/MAY_APPEND bits from mask if realfile is on
      lower/. This is done as files on lower will never be written and will be
      copied up. But to copy up a file, mounter should have MAY_READ permission
      otherwise copy up will fail. So set MAY_READ in mask when MAY_WRITE is
      reset.
      
      Dan Walsh noticed this when he did access(lowerfile, W_OK) and it returned
      True (context mounts) but when he tried to actually write to file, it
      failed as mounter did not have permission on lower file.
      
      [SzM] don't set MAY_READ if only MAY_APPEND is set without MAY_WRITE; this
      won't trigger a copy-up.
      Reported-by: default avatarDan Walsh <dwalsh@redhat.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      500cac3c
    • Vivek Goyal's avatar
      ovl: dilute permission checks on lower only if not special file · e29841a0
      Vivek Goyal authored
      Right now if file is on lower/, we remove MAY_WRITE/MAY_APPEND bits from
      mask as lower/ will never be written and file will be copied up. But this
      is not true for special files. These files are not copied up and are opened
      in place. So don't dilute the checks for these types of files.
      Reported-by: default avatarDan Walsh <dwalsh@redhat.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      e29841a0
    • Miklos Szeredi's avatar
      ovl: fix POSIX ACL setting · d837a49b
      Miklos Szeredi authored
      Setting POSIX ACL needs special handling:
      
      1) Some permission checks are done by ->setxattr() which now uses mounter's
      creds ("ovl: do operations on underlying file system in mounter's
      context").  These permission checks need to be done with current cred as
      well.
      
      2) Setting ACL can fail for various reasons.  We do not need to copy up in
      these cases.
      
      In the mean time switch to using generic_setxattr.
      
      [Arnd Bergmann] Fix link error without POSIX ACL. posix_acl_from_xattr()
      doesn't have a 'static inline' implementation when CONFIG_FS_POSIX_ACL is
      disabled, and I could not come up with an obvious way to do it.
      
      This instead avoids the link error by defining two sets of ACL operations
      and letting the compiler drop one of the two at compile time depending
      on CONFIG_FS_POSIX_ACL. This avoids all references to the ACL code,
      also leading to smaller code.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      d837a49b
    • Miklos Szeredi's avatar
      ovl: share inode for hard link · 51f7e52d
      Miklos Szeredi authored
      Inode attributes are copied up to overlay inode (uid, gid, mode, atime,
      mtime, ctime) so generic code using these fields works correcty.  If a hard
      link is created in overlayfs separate inodes are allocated for each link.
      If chmod/chown/etc. is performed on one of the links then the inode
      belonging to the other ones won't be updated.
      
      This patch attempts to fix this by sharing inodes for hard links.
      
      Use inode hash (with real inode pointer as a key) to make sure overlay
      inodes are shared for hard links on upper.  Hard links on lower are still
      split (which is not user observable until the copy-up happens, see
      Documentation/filesystems/overlayfs.txt under "Non-standard behavior").
      
      The inode is only inserted in the hash if it is non-directoy and upper.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      51f7e52d
    • Miklos Szeredi's avatar
      ovl: store real inode pointer in ->i_private · 39b681f8
      Miklos Szeredi authored
      To get from overlay inode to real inode we currently use 'struct
      ovl_entry', which has lifetime connected to overlay dentry.  This is okay,
      since each overlay dentry had a new overlay inode allocated.
      
      Following patch will break that assumption, so need to leave out ovl_entry.
      This patch stores the real inode directly in i_private, with the lowest bit
      used to indicate whether the inode is upper or lower.
      
      Lifetime rules remain, using ovl_inode_real() must only be done while
      caller holds ref on overlay dentry (and hence on real dentry), or within
      RCU protected regions.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      39b681f8
    • Miklos Szeredi's avatar
      ovl: permission: return ECHILD instead of ENOENT · a999d7e1
      Miklos Szeredi authored
      The error is due to RCU and is temporary.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      a999d7e1
    • Miklos Szeredi's avatar
      ovl: update atime on upper · d719e8f2
      Miklos Szeredi authored
      Fix atime update logic in overlayfs.
      
      This patch adds an i_op->update_time() handler to overlayfs inodes.  This
      forwards atime updates to the upper layer only.  No atime updates are done
      on lower layers.
      
      Remove implicit atime updates to underlying files and directories with
      O_NOATIME.  Remove explicit atime update in ovl_readlink().
      
      Clear atime related mnt flags from cloned upper mount.  This means atime
      updates are controlled purely by overlayfs mount options.
      
      Reported-by: Konstantin Khlebnikov <koct9i@gmail.com> 
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      d719e8f2
    • Miklos Szeredi's avatar
      ovl: fix sgid on directory · bb0d2b8a
      Miklos Szeredi authored
      When creating directory in workdir, the group/sgid inheritance from the
      parent dir was omitted completely.  Fix this by calling inode_init_owner()
      on overlay inode and using the resulting uid/gid/mode to create the file.
      
      Unfortunately the sgid bit can be stripped off due to umask, so need to
      reset the mode in this case in workdir before moving the directory in
      place.
      Reported-by: default avatarEryu Guan <eguan@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      bb0d2b8a
    • Miklos Szeredi's avatar
      ovl: simplify permission checking · 9c630ebe
      Miklos Szeredi authored
      The fact that we always do permission checking on the overlay inode and
      clear MAY_WRITE for checking access to the lower inode allows cruft to be
      removed from ovl_permission().
      
      1) "default_permissions" option effectively did generic_permission() on the
      overlay inode with i_mode, i_uid and i_gid updated from underlying
      filesystem.  This is what we do by default now.  It did the update using
      vfs_getattr() but that's only needed if the underlying filesystem can
      change (which is not allowed).  We may later introduce a "paranoia_mode"
      that verifies that mode/uid/gid are not changed.
      
      2) splitting out the IS_RDONLY() check from inode_permission() also becomes
      unnecessary once we remove the MAY_WRITE from the lower inode check.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      9c630ebe
    • Vivek Goyal's avatar
      ovl: do not require mounter to have MAY_WRITE on lower · 754f8cb7
      Vivek Goyal authored
      Now we have two levels of checks in ovl_permission(). overlay inode
      is checked with the creds of task while underlying inode is checked
      with the creds of mounter.
      
      Looks like mounter does not have to have WRITE access to files on lower/.
      So remove the MAY_WRITE from access mask for checks on underlying
      lower inode.
      
      This means task should still have the MAY_WRITE permission on lower
      inode and mounter is not required to have MAY_WRITE.
      
      It also solves the problem of read only NFS mounts being used as lower.
      If __inode_permission(lower_inode, MAY_WRITE) is called on read only
      NFS, it fails. By resetting MAY_WRITE, check succeeds and case of
      read only NFS shold work with overlay without having to specify any
      special mount options (default permission).
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      754f8cb7
    • Vivek Goyal's avatar
      ovl: do operations on underlying file system in mounter's context · 1175b6b8
      Vivek Goyal authored
      Given we are now doing checks both on overlay inode as well underlying
      inode, we should be able to do checks and operations on underlying file
      system using mounter's context.
      
      So modify all operations to do checks/operations on underlying dentry/inode
      in the context of mounter.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      1175b6b8
    • Vivek Goyal's avatar
      ovl: modify ovl_permission() to do checks on two inodes · c0ca3d70
      Vivek Goyal authored
      Right now ovl_permission() calls __inode_permission(realinode), to do
      permission checks on real inode and no checks are done on overlay inode.
      
      Modify it to do checks both on overlay inode as well as underlying inode.
      Checks on overlay inode will be done with the creds of calling task while
      checks on underlying inode will be done with the creds of mounter.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      c0ca3d70
    • Vivek Goyal's avatar
      ovl: define ->get_acl() for overlay inodes · 39a25b2b
      Vivek Goyal authored
      Now we are planning to do DAC permission checks on overlay inode
      itself. And to make it work, we will need to make sure we can get acls from
      underlying inode. So define ->get_acl() for overlay inodes and this in turn
      calls into underlying filesystem to get acls, if any.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      39a25b2b
    • Vivek Goyal's avatar
      ovl: move some common code in a function · 72e48481
      Vivek Goyal authored
      ovl_create_upper() and ovl_create_over_whiteout() seem to be sharing some
      common code which can be moved into a separate function.  No functionality
      change.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      72e48481
    • Andreas Gruenbacher's avatar
      ovl: store ovl_entry in inode->i_private for all inodes · 58ed4e70
      Andreas Gruenbacher authored
      Previously this was only done for directory inodes.  Doing so for all
      inodes makes for a nice cleanup in ovl_permission at zero cost.
      
      Inodes are not shared for hard links on the overlay, so this works fine.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      58ed4e70
    • Miklos Szeredi's avatar
      ovl: use generic_delete_inode · eead4f2d
      Miklos Szeredi authored
      No point in keeping overlay inodes around since they will never be reused.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      eead4f2d
    • Miklos Szeredi's avatar
      ovl: check mounter creds on underlying lookup · c1b2cc1a
      Miklos Szeredi authored
      The hash salting changes meant that we can no longer reuse the hash in the
      overlay dentry to look up the underlying dentry.
      
      Instead of lookup_hash(), use lookup_one_len_unlocked() and swith to
      mounter's creds (like we do for all other operations later in the series).
      
      Now the lookup_hash() export introduced in 4.6 by 3c9fe8cd ("vfs: add
      lookup_hash() helper") is unused and can possibly be removed; its
      usefulness negated by the hash salting and the idea that mounter's creds
      should be used on operations on underlying filesystems.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 8387ff25 ("vfs: make the string hashes salt the hash")
      c1b2cc1a
    • Linus Torvalds's avatar
      Merge tag 'trace-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · c624c866
      Linus Torvalds authored
      Pull tracing updates from Steven Rostedt:
       "This is mostly clean ups and small fixes.  Some of the more visible
        changes are:
      
         - The function pid code uses the event pid filtering logic
         - [ku]probe events have access to current->comm
         - trace_printk now has sample code
         - PCI devices now trace physical addresses
         - stack tracing has less unnessary functions traced"
      
      * tag 'trace-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        printk, tracing: Avoiding unneeded blank lines
        tracing: Use __get_str() when manipulating strings
        tracing, RAS: Cleanup on __get_str() usage
        tracing: Use outer () on __get_str() definition
        ftrace: Reduce size of function graph entries
        tracing: Have HIST_TRIGGERS select TRACING
        tracing: Using for_each_set_bit() to simplify trace_pid_write()
        ftrace: Move toplevel init out of ftrace_init_tracefs()
        tracing/function_graph: Fix filters for function_graph threshold
        tracing: Skip more functions when doing stack tracing of events
        tracing: Expose CPU physical addresses (resource values) for PCI devices
        tracing: Show the preempt count of when the event was called
        tracing: Add trace_printk sample code
        tracing: Choose static tp_printk buffer by explicit nesting count
        tracing: expose current->comm to [ku]probe events
        ftrace: Have set_ftrace_pid use the bitmap like events do
        tracing: Move pid_list write processing into its own function
        tracing: Move the pid_list seq_file functions to be global
        tracing: Move filtered_pid helper functions into trace.c
        tracing: Make the pid filtering helper functions global
      c624c866
    • Linus Torvalds's avatar
      Merge tag 'vfio-v4.8-rc1' of git://github.com/awilliam/linux-vfio · e55884d2
      Linus Torvalds authored
      Pull VFIO updates from Alex Williamson:
       - Enable no-iommu mode for platform devices (Peng Fan)
       - Sub-page mmap for exclusive pages (Yongji Xie)
       - Use-after-free fix (Ilya Lesokhin)
       - Support for ACPI-based platform devices (Sinan Kaya)
      
      * tag 'vfio-v4.8-rc1' of git://github.com/awilliam/linux-vfio:
        vfio: platform: check reset call return code during release
        vfio: platform: check reset call return code during open
        vfio, platform: make reset driver a requirement by default
        vfio: platform: call _RST method when using ACPI
        vfio: platform: add extra debug info argument to call reset
        vfio: platform: add support for ACPI probe
        vfio: platform: determine reset capability
        vfio: platform: move reset call to a common function
        vfio: platform: rename reset function
        vfio: fix possible use after free of vfio group
        vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
        vfio: platform: support No-IOMMU mode
      e55884d2
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md · 867900b5
      Linus Torvalds authored
      Pull MD updates from Shaohua Li:
       - A bunch of patches from Neil Brown to fix RCU usage
       - Two performance improvement patches from Tomasz Majchrzak
       - Alexey Obitotskiy fixes module refcount issue
       - Arnd Bergmann fixes time granularity
       - Cong Wang fixes a list corruption issue
       - Guoqing Jiang fixes a deadlock in md-cluster
       - A null pointer deference fix from me
       - Song Liu fixes misuse of raid6 rmw
       - Other trival/cleanup fixes from Guoqing Jiang and Xiao Ni
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (28 commits)
        MD: fix null pointer deference
        raid10: improve random reads performance
        md: add missing sysfs_notify on array_state update
        Fix kernel module refcount handling
        md: use seconds granularity for error logging
        md: reduce the number of synchronize_rcu() calls when multiple devices fail.
        md: be extra careful not to take a reference to a Faulty device.
        md/multipath: add rcu protection to rdev access in multipath_status.
        md/raid5: add rcu protection to rdev accesses in raid5_status.
        md/raid5: add rcu protection to rdev accesses in want_replace
        md/raid5: add rcu protection to rdev accesses in handle_failed_sync.
        md/raid1: add rcu protection to rdev in fix_read_error
        md/raid1: small code cleanup in end_sync_write
        md/raid1: small cleanup in raid1_end_read/write_request
        md/raid10: simplify print_conf a little.
        md/raid10: minor code improvement in fix_read_error()
        md/raid10: add rcu protection to rdev access during reshape.
        md/raid10: add rcu protection to rdev access in raid10_sync_request.
        md/raid10: add rcu protection in raid10_status.
        md/raid10: fix refounct imbalance when resyncing an array with a replacement device.
        ...
      867900b5
    • Linus Torvalds's avatar
      Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · f0c98ebc
      Linus Torvalds authored
      Pull libnvdimm updates from Dan Williams:
      
       - Replace pcommit with ADR / directed-flushing.
      
         The pcommit instruction, which has not shipped on any product, is
         deprecated.  Instead, the requirement is that platforms implement
         either ADR, or provide one or more flush addresses per nvdimm.
      
         ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
         to the memory controller on a power-fail event.
      
         Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
         Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
         A flush hint is an mmio address that when written and fenced assures
         that all previous posted writes targeting a given dimm have been
         flushed to media.
      
       - On-demand ARS (address range scrub).
      
         Linux uses the results of the ACPI ARS commands to track bad blocks
         in pmem devices.  When latent errors are detected we re-scrub the
         media to refresh the bad block list, userspace can also request a
         re-scrub at any time.
      
       - Support for the Microsoft DSM (device specific method) command
         format.
      
       - Support for EDK2/OVMF virtual disk device memory ranges.
      
       - Various fixes and cleanups across the subsystem.
      
      * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
        libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
        nfit: do an ARS scrub on hitting a latent media error
        nfit: move to nfit/ sub-directory
        nfit, libnvdimm: allow an ARS scrub to be triggered on demand
        libnvdimm: register nvdimm_bus devices with an nd_bus driver
        pmem: clarify a debug print in pmem_clear_poison
        x86/insn: remove pcommit
        Revert "KVM: x86: add pcommit support"
        nfit, tools/testing/nvdimm/: unify shutdown paths
        libnvdimm: move ->module to struct nvdimm_bus_descriptor
        nfit: cleanup acpi_nfit_init calling convention
        nfit: fix _FIT evaluation memory leak + use after free
        tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
        tools/testing/nvdimm: add virtual ramdisk range
        acpi, nfit: treat virtual ramdisk SPA as pmem region
        pmem: kill __pmem address space
        pmem: kill wmb_pmem()
        libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
        fs/dax: remove wmb_pmem()
        libnvdimm, pmem: flush posted-write queues on shutdown
        ...
      f0c98ebc
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · d94ba9e7
      Linus Torvalds authored
      Pull pin control updates from Linus Walleij:
       "This is the bulk of pin control changes for the v4.8 kernel cycle.
      
        Nothing stands out as especially exiting: new drivers, new subdrivers,
        lots of cleanups and incremental features.
      
        Business as usual.
      
        New drivers:
      
         - New driver for Oxnas pin control and GPIO.  This ARM-based chipset
           is used in a few storage (NAS) type devices.
      
         - New driver for the MAX77620/MAX20024 pin controller portions.
      
         - New driver for the Intel Merrifield pin controller.
      
        New subdrivers:
      
         - New subdriver for the Qualcomm MDM9615
      
         - New subdriver for the STM32F746 MCU
      
         - New subdriver for the Broadcom NSP SoC.
      
        Cleanups:
      
         - Demodularization of bool compiled-in drivers.
      
        Apart from this there is just regular incremental improvements to a
        lot of drivers, especially Uniphier and PFC"
      
      * tag 'pinctrl-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (131 commits)
        pinctrl: fix pincontrol definition for marvell
        pinctrl: xway: fix typo
        Revert "pinctrl: amd: make it explicitly non-modular"
        pinctrl: iproc: Add NSP and Stingray GPIO support
        pinctrl: Update iProc GPIO DT bindings
        pinctrl: bcm: add OF dependencies
        pinctrl: ns2: remove redundant dev_err call in ns2_pinmux_probe()
        pinctrl: Add STM32F746 MCU support
        pinctrl: intel: Protect set wake flow by spin lock
        pinctrl: nsp: remove redundant dev_err call in nsp_pinmux_probe()
        pinctrl: uniphier: add Ethernet pin-mux settings
        sh-pfc: Use PTR_ERR_OR_ZERO() to simplify the code
        pinctrl: ns2: fix return value check in ns2_pinmux_probe()
        pinctrl: qcom: update DT bindings with ebi2 groups
        pinctrl: qcom: establish proper EBI2 pin groups
        pinctrl: imx21: Remove the MODULE_DEVICE_TABLE() macro
        Documentation: dt: Add new compatible to STM32 pinctrl driver bindings
        includes: dt-bindings: Add STM32F746 pinctrl DT bindings
        pinctrl: sunxi: fix nand0 function name for sun8i
        pinctrl: uniphier: remove pointless pin-mux settings for PH1-LD11
        ...
      d94ba9e7
  2. 28 Jul, 2016 8 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 1c88e19b
      Linus Torvalds authored
      Merge more updates from Andrew Morton:
       "The rest of MM"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits)
        mm, compaction: simplify contended compaction handling
        mm, compaction: introduce direct compaction priority
        mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
        mm, page_alloc: make THP-specific decisions more generic
        mm, page_alloc: restructure direct compaction handling in slowpath
        mm, page_alloc: don't retry initial attempt in slowpath
        mm, page_alloc: set alloc_flags only once in slowpath
        lib/stackdepot.c: use __GFP_NOWARN for stack allocations
        mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
        mm, kasan: account for object redzone in SLUB's nearest_obj()
        mm: fix use-after-free if memory allocation failed in vma_adjust()
        zsmalloc: Delete an unnecessary check before the function call "iput"
        mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
        mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
        mm: optimize copy_page_to/from_iter_iovec
        mm: add cond_resched() to generic_swapfile_activate()
        Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
        mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
        mm: hwpoison: remove incorrect comments
        make __section_nr() more efficient
        ...
      1c88e19b
    • Vlastimil Babka's avatar
      mm, compaction: simplify contended compaction handling · c3486f53
      Vlastimil Babka authored
      Async compaction detects contention either due to failing trylock on
      zone->lock or lru_lock, or by need_resched().  Since 1f9efdef ("mm,
      compaction: khugepaged should not give up due to need_resched()") the
      code got quite complicated to distinguish these two up to the
      __alloc_pages_slowpath() level, so different decisions could be taken
      for khugepaged allocations.
      
      After the recent changes, khugepaged allocations don't check for
      contended compaction anymore, so we again don't need to distinguish lock
      and sched contention, and simplify the current convoluted code a lot.
      
      However, I believe it's also possible to simplify even more and
      completely remove the check for contended compaction after the initial
      async compaction for costly orders, which was originally aimed at THP
      page fault allocations.  There are several reasons why this can be done
      now:
      
      - with the new defaults, THP page faults no longer do reclaim/compaction at
        all, unless the system admin has overridden the default, or application has
        indicated via madvise that it can benefit from THP's. In both cases, it
        means that the potential extra latency is expected and worth the benefits.
      - even if reclaim/compaction proceeds after this patch where it previously
        wouldn't, the second compaction attempt is still async and will detect the
        contention and back off, if the contention persists
      - there are still heuristics like deferred compaction and pageblock skip bits
        in place that prevent excessive THP page fault latencies
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3486f53
    • Vlastimil Babka's avatar
      mm, compaction: introduce direct compaction priority · a5508cd8
      Vlastimil Babka authored
      In the context of direct compaction, for some types of allocations we
      would like the compaction to either succeed or definitely fail while
      trying as hard as possible.  Current async/sync_light migration mode is
      insufficient, as there are heuristics such as caching scanner positions,
      marking pageblocks as unsuitable or deferring compaction for a zone.  At
      least the final compaction attempt should be able to override these
      heuristics.
      
      To communicate how hard compaction should try, we replace migration mode
      with a new enum compact_priority and change the relevant function
      signatures.  In compact_zone_order() where struct compact_control is
      constructed, the priority is mapped to suitable control flags.  This
      patch itself has no functional change, as the current priority levels
      are mapped back to the same migration modes as before.  Expanding them
      will be done next.
      
      Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
      removed, as the only caller exists under CONFIG_COMPACTION.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5508cd8
    • Vlastimil Babka's avatar
      mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations · 25160354
      Vlastimil Babka authored
      After the previous patch, we can distinguish costly allocations that
      should be really lightweight, such as THP page faults, with
      __GFP_NORETRY.  This means we don't need to recognize khugepaged
      allocations via PF_KTHREAD anymore.  We can also change THP page faults
      in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
      khugepaged, as the process has indicated that it benefits from THP's and
      is willing to pay some initial latency costs.
      
      We can also make the flags handling less cryptic by distinguishing
      GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
      GFP_TRANSHUGE (only direct reclaim, khugepaged default).  Adding
      __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.
      
      The patch effectively changes the current GFP_TRANSHUGE users as
      follows:
      
      * get_huge_zero_page() - the zero page lifetime should be relatively
        long and it's shared by multiple users, so it's worth spending some
        effort on it.  We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
        This also restores direct reclaim to this allocation, which was
        unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
        by default to madvise and add a stall-free defrag option")
      
      * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
        is not an issue.  So if khugepaged "defrag" is enabled (the default), do
        reclaim via GFP_TRANSHUGE without __GFP_NORETRY.  We can remove the
        PF_KTHREAD check from page alloc.
      
        As a side-effect, khugepaged will now no longer check if the initial
        compaction was deferred or contended.  This is OK, as khugepaged sleep
        times between collapsion attempts are long enough to prevent noticeable
        disruption, so we should allow it to spend some effort.
      
      * migrate_misplaced_transhuge_page() - already was masking out
        __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
        equivalent.
      
      * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
        are now allocating without __GFP_NORETRY.  Other vma's keep using
        __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
        it's allowed only for madvised vma's).  The rest is conversion to
        GFP_TRANSHUGE(_LIGHT).
      
      [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
      Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25160354
    • Vlastimil Babka's avatar
      mm, page_alloc: make THP-specific decisions more generic · 3eb2771b
      Vlastimil Babka authored
      Since THP allocations during page faults can be costly, extra decisions
      are employed for them to avoid excessive reclaim and compaction, if the
      initial compaction doesn't look promising.  The detection has never been
      perfect as there is no gfp flag specific to THP allocations.  At this
      moment it checks the whole combination of flags that makes up
      GFP_TRANSHUGE, and hopes that no other users of such combination exist,
      or would mind being treated the same way.  Extra care is also taken to
      separate allocations from khugepaged, where latency doesn't matter that
      much.
      
      It is however possible to distinguish these allocations in a simpler and
      more reliable way.  The key observation is that after the initial
      compaction followed by the first iteration of "standard"
      reclaim/compaction, both __GFP_NORETRY allocations and costly
      allocations without __GFP_REPEAT are declared as failures:
      
              /* Do not loop if specifically requested */
              if (gfp_mask & __GFP_NORETRY)
                      goto nopage;
      
              /*
               * Do not retry costly high order allocations unless they are
               * __GFP_REPEAT
               */
              if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
                      goto nopage;
      
      This means we can further distinguish allocations that are costly order
      *and* additionally include the __GFP_NORETRY flag.  As it happens,
      GFP_TRANSHUGE allocations do already fall into this category.  This will
      also allow other costly allocations with similar high-order benefit vs
      latency considerations to use this semantic.  Furthermore, we can
      distinguish THP allocations that should try a bit harder (such as from
      khugepageed) by removing __GFP_NORETRY, as will be done in the next
      patch.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eb2771b
    • Vlastimil Babka's avatar
      mm, page_alloc: restructure direct compaction handling in slowpath · a8161d1e
      Vlastimil Babka authored
      The retry loop in __alloc_pages_slowpath is supposed to keep trying
      reclaim and compaction (and OOM), until either the allocation succeeds,
      or returns with failure.  Success here is more probable when reclaim
      precedes compaction, as certain watermarks have to be met for compaction
      to even try, and more free pages increase the probability of compaction
      success.  On the other hand, starting with light async compaction (if
      the watermarks allow it), can be more efficient, especially for smaller
      orders, if there's enough free memory which is just fragmented.
      
      Thus, the current code starts with compaction before reclaim, and to
      make sure that the last reclaim is always followed by a final
      compaction, there's another direct compaction call at the end of the
      loop.  This makes the code hard to follow and adds some duplicated
      handling of migration_mode decisions.  It's also somewhat inefficient
      that even if reclaim or compaction decides not to retry, the final
      compaction is still attempted.  Some gfp flags combination also shortcut
      these retry decisions by "goto noretry;", making it even harder to
      follow.
      
      This patch attempts to restructure the code with only minimal functional
      changes.  The call to the first compaction and THP-specific checks are
      now placed above the retry loop, and the "noretry" direct compaction is
      removed.
      
      The initial compaction is additionally restricted only to costly orders,
      as we can expect smaller orders to be held back by watermarks, and only
      larger orders to suffer primarily from fragmentation.  This better
      matches the checks in reclaim's shrink_zones().
      
      There are two other smaller functional changes.  One is that the upgrade
      from async migration to light sync migration will always occur after the
      initial compaction.  This is how it has been until recent patch "mm,
      oom: protect !costly allocations some more", which introduced upgrading
      the mode based on COMPACT_COMPLETE result, but kept the final compaction
      always upgraded, which made it even more special.  It's better to return
      to the simpler handling for now, as migration modes will be further
      modified later in the series.
      
      The second change is that once both reclaim and compaction declare it's
      not worth to retry the reclaim/compact loop, there is no final
      compaction attempt.  As argued above, this is intentional.  If that
      final compaction were to succeed, it would be due to a wrong retry
      decision, or simply a race with somebody else freeing memory for us.
      
      The main outcome of this patch should be simpler code.  Logically, the
      initial compaction without reclaim is the exceptional case to the
      reclaim/compaction scheme, but prior to the patch, it was the last loop
      iteration that was exceptional.  Now the code matches the logic better.
      The change also enable the following patches.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8161d1e
    • Vlastimil Babka's avatar
      mm, page_alloc: don't retry initial attempt in slowpath · 23771235
      Vlastimil Babka authored
      After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
      kswapd, it first tries get_page_from_freelist() with the new
      alloc_flags, as it may succeed e.g. due to using min watermark instead
      of low watermark.  It makes sense to to do this attempt before adjusting
      zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
      path if we just wake up kswapd and successfully allocate.
      
      This patch therefore moves the initial attempt above the retry label and
      reorganizes a bit the part below the retry label.  We still have to
      attempt get_page_from_freelist() on each retry, as some allocations
      cannot do that as part of direct reclaim or compaction, and yet are not
      allowed to fail (even though they do a WARN_ON_ONCE() and thus should
      not exist).  We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
      and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
      it.  As a side-effect, the attempts from direct reclaim/compaction will
      also no longer obey watermarks once this is set, but there's little harm
      in that.
      
      Kswapd wakeups are also done on each retry to be safe from potential
      races resulting in kswapd going to sleep while a process (that may not
      be able to reclaim by itself) is still looping.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23771235
    • Vlastimil Babka's avatar
      mm, page_alloc: set alloc_flags only once in slowpath · 31a6c190
      Vlastimil Babka authored
      In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
      initialized, so move the initialization above the retry: label.  Also
      make the comment above the initialization more descriptive.
      
      The only exception in the alloc_flags being constant is
      ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
      allocating thread.  We can fix this, and make the code simpler and a bit
      more effective at the same time, by moving the part that determines
      ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().
      
      This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
      places in __alloc_pages_slowpath() anymore.  The only two tests for the
      flag can instead call gfp_pfmemalloc_allowed().
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31a6c190