1. 27 May, 2016 14 commits
    • Eryu Guan's avatar
      direct-io: fix direct write stale data exposure from concurrent buffered read · 9ecd10b7
      Eryu Guan authored
      Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
      not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
      before calling get_block() callback), if it's a sparse file, direct
      writes fall back to buffered writes to avoid stale data exposure from
      concurrent buffered read.  But there're two cases that can result in
      stale data exposure are not correctly detected.
      
      1. The detection for "writing inside i_size" is not sufficient,
         writes can be treated as "extending writes" wrongly.  For example,
         direct write 1FSB (file system block) to a 1FSB sparse file on
         ext2/3/4, starting from offset 0, in this case it's writing inside
         i_size, but 'create' is non-zero, because 'block_in_file' and
         '(i_size_read(inode) >> blkbits' are both zero.
      
      2. Direct writes starting from or beyong i_size (not inside i_size)
         also could trigger block allocation and expose stale data.  For
         example, consider a sparse file with i_size of 2k, and a write to
         offset 2k or 3k into the file, with a filesystem block size of 4k.
         (Thanks to Jeff Moyer for pointing this case out in his review.)
      
      The first problem can be demostrated by running ltp-aiodio test ADSP045
      many times.  When testing on extN filesystems, I see test failures
      occasionally, buffered read could read non-zero (stale) data.
      
      ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1
      
      dio_sparse    0  TINFO  :  Dirtying free blocks
      dio_sparse    0  TINFO  :  Starting I/O tests
      non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
      non-zero read at offset 0
      dio_sparse    0  TINFO  :  Killing childrens(s)
      dio_sparse    1  TFAIL  :  dio_sparse.c:191: 1 children(s) exited abnormally
      
      The second problem can also be reproduced easily by a hacked dio_sparse
      program, which accepts an option to specify the write offset.
      
      What we should really do is to disable block allocation for writes that
      could result in filling holes inside i_size.
      
      Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.comReviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarEryu Guan <guaneryu@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ecd10b7
    • Junxiao Bi's avatar
      ocfs2: bump up o2cb network protocol version · 38b52efd
      Junxiao Bi authored
      Two new messages are added to support negotiating hb timeout.  Stop
      nodes frmo talking an old version to mount as they will cause the
      negotiation to fail.
      
      Link: http://lkml.kernel.org/r/1464231615-27939-1-git-send-email-junxiao.bi@oracle.comSigned-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38b52efd
    • Junxiao Bi's avatar
      ocfs2: o2hb: fix hb hung time · 6633ca57
      Junxiao Bi authored
      hr_last_timeout_start should be set as the last time where hb is
      still OK.  When hb write timeout, hung time will be (jiffies -
      hr_last_timeout_start).
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6633ca57
    • Junxiao Bi's avatar
      ocfs2: o2hb: don't negotiate if last hb fail · 88dbe98d
      Junxiao Bi authored
      Sometimes io error is returned when storage is down for a while.  Like
      for iscsi device, stroage is made offline when session timeout, and this
      will make all io return -EIO.  For this case, nodes shouldn't do
      negotiate timeout but should fence self.  So let nodes fence self when
      o2hb_do_disk_heartbeat return an error, this is the same behavior with
      o2hb without negotiate timer.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88dbe98d
    • Junxiao Bi's avatar
      ocfs2: o2hb: add some user/debug log · 1bd12902
      Junxiao Bi authored
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1bd12902
    • Junxiao Bi's avatar
      ocfs2: o2hb: add NEGOTIATE_APPROVE message · e76f8237
      Junxiao Bi authored
      This message is used to re-queue write timeout timer and negotiate timer
      when all nodes suffer a write hung to storage, this makes node not fence
      self if storage down.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e76f8237
    • Junxiao Bi's avatar
      ocfs2: o2hb: add NEGO_TIMEOUT message · 34069b88
      Junxiao Bi authored
      This message is sent to master node when non-master nodes's negotiate
      timer expired.  Master node records these nodes in a bitmap which is
      used to do write timeout timer re-queue decision.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34069b88
    • Junxiao Bi's avatar
      ocfs2: o2hb: add negotiate timer · e0cbb798
      Junxiao Bi authored
      This series of patches is to fix the issue that when storage down, all
      nodes will fence self due to write timeout.
      
      With this patch set, all nodes will keep going until storage back
      online, except if the following issue happens, then all nodes will do as
      before to fence self.
      
      1. io error got
      2. network between nodes down
      3. nodes panic
      
      This patch (of 6):
      
      When storage down, all nodes will fence self due to write timeout.  The
      negotiate timer is designed to avoid this, with it node will wait until
      storage up again.
      
      Negotiate timer working in the following way:
      
      1. The timer expires before write timeout timer, its timeout is half
         of write timeout now.  It is re-queued along with write timeout timer.
         If expires, it will send NEGO_TIMEOUT message to master node(node with
         lowest node number).  This message does nothing but marks a bit in a
         bitmap recording which nodes are negotiating timeout on master node.
      
      2. If storage down, nodes will send this message to master node, then
         when master node finds its bitmap including all online nodes, it sends
         NEGO_APPROVL message to all nodes one by one, this message will
         re-queue write timeout timer and negotiate timer.  For any node doesn't
         receive this message or meets some issue when handling this message, it
         will be fenced.  If storage up at any time, o2hb_thread will run and
         re-queue all the timer, nothing will be affected by these two steps.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Gang He <ghe@suse.com>
      Cc: rwxybh <rwxybh@126.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0cbb798
    • Linus Torvalds's avatar
      Merge branch 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild · dc03c0f9
      Linus Torvalds authored
      Pull misc kbuild updates from Michal Marek:
       "This is the non-critical part of kbuild:
      
         - Coccinelle fixes, one semantic patch less in this round [Vaishali
           Thakkar, Wolfram Sang, Kees Cook]
      
         - rpm-pkg support for (open)SUSE's update-bootloader [Jiří Kosian]
      
         - rpm-pkg restored support for $RPMOPTS [Srinivas Pandruvada]
      
         - deb-pkg fixes for the linux-headers package [Bjørn Mork, Azriel
           Samson]"
      
      * 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
        coccicheck: Fix missing 0 index in kill loop
        scripts/package/Makefile: rpmbuild add support of RPMOPTS
        builddeb: fix missing headers in linux-headers package
        builddeb: include objtool binary in headers package
        kbuild/mkspec: support 'update-bootloader'-based systems
        scripts: coccinelle: remove check to move constants to right
        Coccinelle: setup_timer: Add space in front of parentheses
      dc03c0f9
    • Linus Torvalds's avatar
      Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild · f429d355
      Linus Torvalds authored
      Pull kconfig update from Michal Marek:
      
       - fix for behavior of tristate choice items and fix for documentation
         of existing kconfig behavior [Dirk Gouders]
      
       - more helpful "unexpected data" kconfig warning [Paul Bolle]
      
      * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
        kconfig/symbol.c: handle choice_values that depend on 'm' symbols
        kconfig-language: elaborate on the type of a choice
        kconfig-language: fix comment on dependency-generated menu structures.
        kconfig: add unexpected data itself to warning
      f429d355
    • Linus Torvalds's avatar
      Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild · 5b26fc88
      Linus Torvalds authored
      Pull kbuild updates from Michal Marek:
      
       - new option CONFIG_TRIM_UNUSED_KSYMS which does a two-pass build and
         unexports symbols which are not used in the current config [Nicolas
         Pitre]
      
       - several kbuild rule cleanups [Masahiro Yamada]
      
       - warning option adjustments for gcov etc [Arnd Bergmann]
      
       - a few more small fixes
      
      * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (31 commits)
        kbuild: move -Wunused-const-variable to W=1 warning level
        kbuild: fix if_change and friends to consider argument order
        kbuild: fix adjust_autoksyms.sh for modules that need only one symbol
        kbuild: fix ksym_dep_filter when multiple EXPORT_SYMBOL() on the same line
        gcov: disable -Wmaybe-uninitialized warning
        gcov: disable tree-loop-im to reduce stack usage
        gcov: disable for COMPILE_TEST
        Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES
        Kbuild: change CC_OPTIMIZE_FOR_SIZE definition
        kbuild: forbid kernel directory to contain spaces and colons
        kbuild: adjust ksym_dep_filter for some cmd_* renames
        kbuild: Fix dependencies for final vmlinux link
        kbuild: better abstract vmlinux sequential prerequisites
        kbuild: fix call to adjust_autoksyms.sh when output directory specified
        kbuild: Get rid of KBUILD_STR
        kbuild: rename cmd_as_s_S to cmd_cpp_s_S
        kbuild: rename cmd_cc_i_c to cmd_cpp_i_c
        kbuild: drop redundant "PHONY += FORCE"
        kbuild: delete unnecessary "@:"
        kbuild: mark help target as PHONY
        ...
      5b26fc88
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · e12fab28
      Linus Torvalds authored
      Merge fixes from Andrew Morton:
       "10 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        drivers/pinctrl/intel/pinctrl-baytrail.c: fix build with gcc-4.4
        update "mm/zsmalloc: don't fail if can't create debugfs info"
        dma-debug: avoid spinlock recursion when disabling dma-debug
        mm: oom_reaper: remove some bloat
        memcg: fix mem_cgroup_out_of_memory() return value.
        ocfs2: fix improper handling of return errno
        mm: slub: remove unused virt_to_obj()
        mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta
        mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly
        seqlock: fix raw_read_seqcount_latch()
      e12fab28
    • Linus Torvalds's avatar
      Merge tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 478a1469
      Linus Torvalds authored
      Pull DAX locking updates from Ross Zwisler:
       "Filesystem DAX locking for 4.7
      
         - We use a bit in an exceptional radix tree entry as a lock bit and
           use it similarly to how page lock is used for normal faults.  This
           fixes races between hole instantiation and read faults of the same
           index.
      
         - Filesystem DAX PMD faults are disabled, and will be re-enabled when
           PMD locking is implemented"
      
      * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        dax: Remove i_mmap_lock protection
        dax: Use radix tree entry lock to protect cow faults
        dax: New fault locking
        dax: Allow DAX code to replace exceptional entries
        dax: Define DAX lock bit for radix tree exceptional entry
        dax: Make huge page handling depend of CONFIG_BROKEN
        dax: Fix condition for filling of PMD holes
      478a1469
    • Linus Torvalds's avatar
      Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 315227f6
      Linus Torvalds authored
      Pull misc DAX updates from Vishal Verma:
       "DAX error handling for 4.7
      
         - Until now, dax has been disabled if media errors were found on any
           device.  This enables the use of DAX in the presence of these
           errors by making all sector-aligned zeroing go through the driver.
      
         - The driver (already) has the ability to clear errors on writes that
           are sent through the block layer using 'DSMs' defined in ACPI 6.1.
      
        Other misc changes:
      
         - When mounting DAX filesystems, check to make sure the partition is
           page aligned.  This is a requirement for DAX, and previously, we
           allowed such unaligned mounts to succeed, but subsequent
           reads/writes would fail.
      
         - Misc/cleanup fixes from Jan that remove unused code from DAX
           related to zeroing, writeback, and some size checks"
      
      * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        dax: fix a comment in dax_zero_page_range and dax_truncate_page
        dax: for truncate/hole-punch, do zeroing through the driver if possible
        dax: export a low-level __dax_zero_page_range helper
        dax: use sb_issue_zerout instead of calling dax_clear_sectors
        dax: enable dax in the presence of known media errors (badblocks)
        dax: fallback from pmd to pte on error
        block: Update blkdev_dax_capable() for consistency
        xfs: Add alignment check for DAX mount
        ext2: Add alignment check for DAX mount
        ext4: Add alignment check for DAX mount
        block: Add bdev_dax_supported() for dax mount checks
        block: Add vfs_msg() interface
        dax: Remove redundant inode size checks
        dax: Remove pointless writeback from dax_do_io()
        dax: Remove zeroing from dax_io()
        dax: Remove dead zeroing code from fault handlers
        ext2: Avoid DAX zeroing to corrupt data
        ext2: Fix block zeroing in ext2_get_blocks() for DAX
        dax: Remove complete_unwritten argument
        DAX: move RADIX_DAX_ definitions to dax.c
      315227f6
  2. 26 May, 2016 24 commits
  3. 25 May, 2016 2 commits
    • Jann Horn's avatar
      Yama: fix double-spinlock and user access in atomic context · dca6b414
      Jann Horn authored
      Commit 8a56038c ("Yama: consolidate error reporting") causes lockups
      when someone hits a Yama denial. Call chain:
      
      process_vm_readv -> process_vm_rw -> process_vm_rw_core -> mm_access
      -> ptrace_may_access
      task_lock(...) is taken
      __ptrace_may_access -> security_ptrace_access_check
      -> yama_ptrace_access_check -> report_access -> kstrdup_quotable_cmdline
      -> get_cmdline -> access_process_vm -> get_task_mm
      task_lock(...) is taken again
      
      task_lock(p) just calls spin_lock(&p->alloc_lock), so at this point,
      spin_lock() is called on a lock that is already held by the current
      process.
      
      Also: Since the alloc_lock is a spinlock, sleeping inside
      security_ptrace_access_check hooks is probably not allowed at all? So it's
      not even possible to print the cmdline from in there because that might
      involve paging in userspace memory.
      
      It would be tempting to rewrite ptrace_may_access() to drop the alloc_lock
      before calling the LSM, but even then, ptrace_may_access() itself might be
      called from various contexts in which you're not allowed to sleep; for
      example, as far as I understand, to be able to hold a reference to another
      task, usually an RCU read lock will be taken (see e.g. kcmp() and
      get_robust_list()), so that also prohibits sleeping. (And using e.g. FUSE,
      a user can cause pagefault handling to take arbitrary amounts of time -
      see https://bugs.chromium.org/p/project-zero/issues/detail?id=808.)
      
      Therefore, AFAIK, in order to print the name of a process below
      security_ptrace_access_check(), you'd have to either grab a reference to
      the mm_struct and defer the access violation reporting or just use the
      "comm" value that's stored in kernelspace and accessible without big
      complications. (Or you could try to use some kind of atomic remote VM
      access that fails if the memory isn't paged in, similar to
      copy_from_user_inatomic(), and if necessary fall back to comm, but
      that'd be kind of ugly because the comm/cmdline choice would look
      pretty random to the user.)
      
      Fix it by deferring reporting of the access violation until current
      exits kernelspace the next time.
      
      v2: Don't oops on PTRACE_TRACEME, call report_access under
      task_lock(current). Also fix nonsensical comment. And don't use
      GPF_ATOMIC for memory allocation with no locks held.
      This patch is tested both for ptrace attach and ptrace traceme.
      
      Fixes: 8a56038c ("Yama: consolidate error reporting")
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJames Morris <james.l.morris@oracle.com>
      dca6b414
    • Linus Torvalds's avatar
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c4a34600
      Linus Torvalds authored
      Pull objtool build fix from Ingo Molnar:
       "An libtool fix for older libelf versions"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        objtool: Allow building with older libelf
      c4a34600