1. 20 May, 2017 40 commits
    • Jan Kara's avatar
      ext4: return to starting transaction in ext4_dax_huge_fault() · 5a3651b4
      Jan Kara authored
      commit fb26a1cb upstream.
      
      DAX will return to locking exceptional entry before mapping blocks for a
      page fault to fix possible races with concurrent writes.  To avoid lock
      inversion between exceptional entry lock and transaction start, start
      the transaction already in ext4_dax_huge_fault().
      
      Fixes: 9f141d6e
      Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a3651b4
    • Jan Kara's avatar
      mm: fix data corruption due to stale mmap reads · 85f353db
      Jan Kara authored
      commit cd656375 upstream.
      
      Currently, we didn't invalidate page tables during invalidate_inode_pages2()
      for DAX.  That could result in e.g. 2MiB zero page being mapped into
      page tables while there were already underlying blocks allocated and
      thus data seen through mmap were different from data seen by read(2).
      The following sequence reproduces the problem:
      
       - open an mmap over a 2MiB hole
      
       - read from a 2MiB hole, faulting in a 2MiB zero page
      
       - write to the hole with write(3p). The write succeeds but we
         incorrectly leave the 2MiB zero page mapping intact.
      
       - via the mmap, read the data that was just written. Since the zero
         page mapping is still intact we read back zeroes instead of the new
         data.
      
      Fix the problem by unconditionally calling invalidate_inode_pages2_range()
      in dax_iomap_actor() for new block allocations and by properly
      invalidating page tables in invalidate_inode_pages2_range() for DAX
      mappings.
      
      Fixes: c6dcf52c
      Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85f353db
    • Ross Zwisler's avatar
      dax: prevent invalidation of mapped DAX entries · 06305f51
      Ross Zwisler authored
      commit 4636e70b upstream.
      
      Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
      v4.
      
      This series fixes data corruption that can happen for DAX mounts when
      page faults race with write(2) and as a result page tables get out of
      sync with block mappings in the filesystem and thus data seen through
      mmap is different from data seen through read(2).
      
      The series passes testing with t_mmap_stale test program from Ross and
      also other mmap related tests on DAX filesystem.
      
      This patch (of 4):
      
      dax_invalidate_mapping_entry() currently removes DAX exceptional entries
      only if they are clean and unlocked.  This is done via:
      
        invalidate_mapping_pages()
          invalidate_exceptional_entry()
            dax_invalidate_mapping_entry()
      
      However, for page cache pages removed in invalidate_mapping_pages()
      there is an additional criteria which is that the page must not be
      mapped.  This is noted in the comments above invalidate_mapping_pages()
      and is checked in invalidate_inode_page().
      
      For DAX entries this means that we can can end up in a situation where a
      DAX exceptional entry, either a huge zero page or a regular DAX entry,
      could end up mapped but without an associated radix tree entry.  This is
      inconsistent with the rest of the DAX code and with what happens in the
      page cache case.
      
      We aren't able to unmap the DAX exceptional entry because according to
      its comments invalidate_mapping_pages() isn't allowed to block, and
      unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
      
      Since we essentially never have unmapped DAX entries to evict from the
      radix tree, just remove dax_invalidate_mapping_entry().
      
      Fixes: c6dcf52c ("mm: Invalidate DAX radix tree entries only if appropriate")
      Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.czSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      06305f51
    • Dan Williams's avatar
      device-dax: fix sysfs attribute deadlock · 55dbfd7d
      Dan Williams authored
      commit 565851c9 upstream.
      
      Usage of device_lock() for dax_region attributes is unnecessary and
      deadlock prone. It's unnecessary because the order of registration /
      un-registration guarantees that drvdata is always valid. It's deadlock
      prone because it sets up this situation:
      
       ndctl           D    0  2170   2082 0x00000000
       Call Trace:
        __schedule+0x31f/0x980
        schedule+0x3d/0x90
        schedule_preempt_disabled+0x15/0x20
        __mutex_lock+0x402/0x980
        ? __mutex_lock+0x158/0x980
        ? align_show+0x2b/0x80 [dax]
        ? kernfs_seq_start+0x2f/0x90
        mutex_lock_nested+0x1b/0x20
        align_show+0x2b/0x80 [dax]
        dev_attr_show+0x20/0x50
      
       ndctl           D    0  2186   2079 0x00000000
       Call Trace:
        __schedule+0x31f/0x980
        schedule+0x3d/0x90
        __kernfs_remove+0x1f6/0x340
        ? kernfs_remove_by_name_ns+0x45/0xa0
        ? remove_wait_queue+0x70/0x70
        kernfs_remove_by_name_ns+0x45/0xa0
        remove_files.isra.1+0x35/0x70
        sysfs_remove_group+0x44/0x90
        sysfs_remove_groups+0x2e/0x50
        dax_region_unregister+0x25/0x40 [dax]
        devm_action_release+0xf/0x20
        release_nodes+0x16d/0x2b0
        devres_release_all+0x3c/0x60
        device_release_driver_internal+0x17d/0x220
        device_release_driver+0x12/0x20
        unbind_store+0x112/0x160
      
      ndctl/2170 is trying to acquire the device_lock() to read an attribute,
      and ndctl/2186 is holding the device_lock() while trying to drain all
      active attribute readers.
      
      Thanks to Yi Zhang for the reproduction script.
      
      Fixes: d7fe1a67 ("dax: add region 'id', 'size', and 'align' attributes")
      Reported-by: default avatarYi Zhang <yizhan@redhat.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      55dbfd7d
    • Dan Williams's avatar
      device-dax: fix cdev leak · 89314777
      Dan Williams authored
      commit ed01e50a upstream.
      
      If device_add() fails, cleanup the cdev. Otherwise, we leak a kobj_map()
      with a stale device number.
      
      As Jason points out, there is a small possibility that userspace has
      opened and mapped the device in the time between cdev_add() and the
      device_add() failure. We need a new kill_dax_dev() helper to invalidate
      any established mappings.
      
      Fixes: ba09c01d ("dax: convert to the cdev api")
      Reported-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      89314777
    • NeilBrown's avatar
      md/raid1: avoid reusing a resync bio after error handling. · 9bcb9cc9
      NeilBrown authored
      commit 0c9d5b12 upstream.
      
      fix_sync_read_error() modifies a bio on a newly faulty
      device by setting bi_end_io to end_sync_write.
      This ensure that put_buf() will still call rdev_dec_pending()
      as required, but makes sure that subsequent code in
      fix_sync_read_error() doesn't try to read from the device.
      
      Unfortunately this interacts badly with sync_request_write()
      which assumes that any bio with bi_end_io set to non-NULL
      other than end_sync_read is safe to write to.
      
      As the device is now faulty it doesn't make sense to write.
      As the bio was recently used for a read, it is "dirty"
      and not suitable for immediate submission.
      In particular, ->bi_next might be non-NULL, which will cause
      generic_make_request() to complain.
      
      Break this interaction by refusing to write to devices
      which are marked as Faulty.
      Reported-and-tested-by: default avatarMichael Wang <yun.wang@profitbricks.com>
      Fixes: 2e52d449 ("md/raid1: add failfast handling for reads.")
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9bcb9cc9
    • Jason A. Donenfeld's avatar
      padata: free correct variable · 7d708085
      Jason A. Donenfeld authored
      commit 07a77929 upstream.
      
      The author meant to free the variable that was just allocated, instead
      of the one that failed to be allocated, but made a simple typo. This
      patch rectifies that.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7d708085
    • Amir Goldstein's avatar
      ovl: do not set overlay.opaque on non-dir create · f33f17b2
      Amir Goldstein authored
      commit 4a99f3c8 upstream.
      
      The optimization for opaque dir create was wrongly being applied
      also to non-dir create.
      
      Fixes: 97c684cc ("ovl: create directories inside merged parent opaque")
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f33f17b2
    • Björn Jacke's avatar
      CIFS: add misssing SFM mapping for doublequote · 6a1baa3a
      Björn Jacke authored
      commit 85435d7a upstream.
      
      SFM is mapping doublequote to 0xF020
      
      Without this patch creating files with doublequote fails to Windows/Mac
      Signed-off-by: default avatarBjoern Jacke <bjacke@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a1baa3a
    • David Disseldorp's avatar
      cifs: fix CIFS_IOC_GET_MNT_INFO oops · c4f35cc8
      David Disseldorp authored
      commit d8a6e505 upstream.
      
      An open directory may have a NULL private_data pointer prior to readdir.
      
      Fixes: 0de1f4c6 ("Add way to query server fs info for smb3")
      Signed-off-by: default avatarDavid Disseldorp <ddiss@suse.de>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4f35cc8
    • Rabin Vincent's avatar
      CIFS: fix oplock break deadlocks · 33fa0116
      Rabin Vincent authored
      commit 3998e6b8 upstream.
      
      When the final cifsFileInfo_put() is called from cifsiod and an oplock
      break work is queued, lockdep complains loudly:
      
       =============================================
       [ INFO: possible recursive locking detected ]
       4.11.0+ #21 Not tainted
       ---------------------------------------------
       kworker/0:2/78 is trying to acquire lock:
        ("cifsiod"){++++.+}, at: flush_work+0x215/0x350
      
       but task is already holding lock:
        ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock("cifsiod");
         lock("cifsiod");
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       2 locks held by kworker/0:2/78:
        #0:  ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0
        #1:  ((&wdata->work)){+.+...}, at: process_one_work+0x255/0x8e0
      
       stack backtrace:
       CPU: 0 PID: 78 Comm: kworker/0:2 Not tainted 4.11.0+ #21
       Workqueue: cifsiod cifs_writev_complete
       Call Trace:
        dump_stack+0x85/0xc2
        __lock_acquire+0x17dd/0x2260
        ? match_held_lock+0x20/0x2b0
        ? trace_hardirqs_off_caller+0x86/0x130
        ? mark_lock+0xa6/0x920
        lock_acquire+0xcc/0x260
        ? lock_acquire+0xcc/0x260
        ? flush_work+0x215/0x350
        flush_work+0x236/0x350
        ? flush_work+0x215/0x350
        ? destroy_worker+0x170/0x170
        __cancel_work_timer+0x17d/0x210
        ? ___preempt_schedule+0x16/0x18
        cancel_work_sync+0x10/0x20
        cifsFileInfo_put+0x338/0x7f0
        cifs_writedata_release+0x2a/0x40
        ? cifs_writedata_release+0x2a/0x40
        cifs_writev_complete+0x29d/0x850
        ? preempt_count_sub+0x18/0xd0
        process_one_work+0x304/0x8e0
        worker_thread+0x9b/0x6a0
        kthread+0x1b2/0x200
        ? process_one_work+0x8e0/0x8e0
        ? kthread_create_on_node+0x40/0x40
        ret_from_fork+0x31/0x40
      
      This is a real warning.  Since the oplock is queued on the same
      workqueue this can deadlock if there is only one worker thread active
      for the workqueue (which will be the case during memory pressure when
      the rescuer thread is handling it).
      
      Furthermore, there is at least one other kind of hang possible due to
      the oplock break handling if there is only worker.  (This can be
      reproduced without introducing memory pressure by having passing 1 for
      the max_active parameter of cifsiod.) cifs_oplock_break() can wait
      indefintely in the filemap_fdatawait() while the cifs_writev_complete()
      work is blocked:
      
       sysrq: SysRq : Show Blocked State
         task                        PC stack   pid father
       kworker/0:1     D    0    16      2 0x00000000
       Workqueue: cifsiod cifs_oplock_break
       Call Trace:
        __schedule+0x562/0xf40
        ? mark_held_locks+0x4a/0xb0
        schedule+0x57/0xe0
        io_schedule+0x21/0x50
        wait_on_page_bit+0x143/0x190
        ? add_to_page_cache_lru+0x150/0x150
        __filemap_fdatawait_range+0x134/0x190
        ? do_writepages+0x51/0x70
        filemap_fdatawait_range+0x14/0x30
        filemap_fdatawait+0x3b/0x40
        cifs_oplock_break+0x651/0x710
        ? preempt_count_sub+0x18/0xd0
        process_one_work+0x304/0x8e0
        worker_thread+0x9b/0x6a0
        kthread+0x1b2/0x200
        ? process_one_work+0x8e0/0x8e0
        ? kthread_create_on_node+0x40/0x40
        ret_from_fork+0x31/0x40
       dd              D    0   683    171 0x00000000
       Call Trace:
        __schedule+0x562/0xf40
        ? mark_held_locks+0x29/0xb0
        schedule+0x57/0xe0
        io_schedule+0x21/0x50
        wait_on_page_bit+0x143/0x190
        ? add_to_page_cache_lru+0x150/0x150
        __filemap_fdatawait_range+0x134/0x190
        ? do_writepages+0x51/0x70
        filemap_fdatawait_range+0x14/0x30
        filemap_fdatawait+0x3b/0x40
        filemap_write_and_wait+0x4e/0x70
        cifs_flush+0x6a/0xb0
        filp_close+0x52/0xa0
        __close_fd+0xdc/0x150
        SyS_close+0x33/0x60
        entry_SYSCALL_64_fastpath+0x1f/0xbe
      
       Showing all locks held in the system:
       2 locks held by kworker/0:1/16:
        #0:  ("cifsiod"){.+.+.+}, at: process_one_work+0x255/0x8e0
        #1:  ((&cfile->oplock_break)){+.+.+.}, at: process_one_work+0x255/0x8e0
      
       Showing busy workqueues and worker pools:
       workqueue cifsiod: flags=0xc
         pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
           in-flight: 16:cifs_oplock_break
           delayed: cifs_writev_complete, cifs_echo_request
       pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 750 3
      
      Fix these problems by creating a a new workqueue (with a rescuer) for
      the oplock break work.
      Signed-off-by: default avatarRabin Vincent <rabinv@axis.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      33fa0116
    • David Disseldorp's avatar
      cifs: fix CIFS_ENUMERATE_SNAPSHOTS oops · 076c404c
      David Disseldorp authored
      commit 6026685d upstream.
      
      As with 61876395, an open directory may have a NULL private_data
      pointer prior to readdir. CIFS_ENUMERATE_SNAPSHOTS must check for this
      before dereference.
      
      Fixes: 834170c8 ("Enable previous version support")
      Signed-off-by: default avatarDavid Disseldorp <ddiss@suse.de>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      076c404c
    • David Disseldorp's avatar
      cifs: fix leak in FSCTL_ENUM_SNAPS response handling · 63d86cd9
      David Disseldorp authored
      commit 0e5c7955 upstream.
      
      The server may respond with success, and an output buffer less than
      sizeof(struct smb_snapshot_array) in length. Do not leak the output
      buffer in this case.
      
      Fixes: 834170c8 ("Enable previous version support")
      Signed-off-by: default avatarDavid Disseldorp <ddiss@suse.de>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      63d86cd9
    • Björn Jacke's avatar
      CIFS: fix mapping of SFM_SPACE and SFM_PERIOD · 160239df
      Björn Jacke authored
      commit b704e70b upstream.
      
      - trailing space maps to 0xF028
      - trailing period maps to 0xF029
      
      This fix corrects the mapping of file names which have a trailing character
      that would otherwise be illegal (period or space) but is allowed by POSIX.
      Signed-off-by: default avatarBjoern Jacke <bjacke@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      160239df
    • Steve French's avatar
      SMB3: Work around mount failure when using SMB3 dialect to Macs · 6f95f192
      Steve French authored
      commit 7db0a6ef upstream.
      
      Macs send the maximum buffer size in response on ioctl to validate
      negotiate security information, which causes us to fail the mount
      as the response buffer is larger than the expected response.
      
      Changed ioctl response processing to allow for padding of validate
      negotiate ioctl response and limit the maximum response size to
      maximum buffer size.
      Signed-off-by: default avatarSteve French <steve.french@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f95f192
    • Steve French's avatar
      Set unicode flag on cifs echo request to avoid Mac error · 49f70d15
      Steve French authored
      commit 26c9cb66 upstream.
      
      Mac requires the unicode flag to be set for cifs, even for the smb
      echo request (which doesn't have strings).
      
      Without this Mac rejects the periodic echo requests (when mounting
      with cifs) that we use to check if server is down
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      49f70d15
    • Sachin Prabhu's avatar
      Do not return number of bytes written for ioctl CIFS_IOC_COPYCHUNK_FILE · fdf150ee
      Sachin Prabhu authored
      commit 7d0c234f upstream.
      
      commit 620d8745 ("Introduce cifs_copy_file_range()") changes the
      behaviour of the cifs ioctl call CIFS_IOC_COPYCHUNK_FILE. In case of
      successful writes, it now returns the number of bytes written. This
      return value is treated as an error by the xfstest cifs/001. Depending
      on the errno set at that time, this may or may not result in the test
      failing.
      
      The patch fixes this by setting the return value to 0 in case of
      successful writes.
      
      Fixes: commit 620d8745 ("Introduce cifs_copy_file_range()")
      Reported-by: default avatarEryu Guan <eguan@redhat.com>
      Signed-off-by: default avatarSachin Prabhu <sprabhu@redhat.com>
      Acked-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fdf150ee
    • Sachin Prabhu's avatar
      Fix match_prepath() · ecf9bb0f
      Sachin Prabhu authored
      commit cd8c4296 upstream.
      
      Incorrect return value for shares not using the prefix path means that
      we will never match superblocks for these shares.
      
      Fixes: commit c1d8b24d ("Compare prepaths when comparing superblocks")
      Signed-off-by: default avatarSachin Prabhu <sprabhu@redhat.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecf9bb0f
    • Vlastimil Babka's avatar
      mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC · d322fd67
      Vlastimil Babka authored
      commit 62be1511 upstream.
      
      Patch series "more robust PF_MEMALLOC handling"
      
      This series aims to unify the setting and clearing of PF_MEMALLOC, which
      prevents recursive reclaim.  There are some places that clear the flag
      unconditionally from current->flags, which may result in clearing a
      pre-existing flag.  This already resulted in a bug report that Patch 1
      fixes (without the new helpers, to make backporting easier).  Patch 2
      introduces the new helpers, modelled after existing memalloc_noio_* and
      memalloc_nofs_* helpers, and converts mm core to use them.  Patches 3
      and 4 convert non-mm code.
      
      This patch (of 4):
      
      __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock
      during page migration by lock_page() (see the comment in
      __unmap_and_move()).  Then it unconditionally clears the flag, which can
      clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim.
      This was not a problem until commit a8161d1e ("mm, page_alloc:
      restructure direct compaction handling in slowpath"), because direct
      compation was called only after direct reclaim, which was skipped when
      PF_MEMALLOC flag was set.
      
      Even now it's only a theoretical issue, as the new callsite of
      __alloc_pages_direct_compact() is reached only for costly orders and
      when gfp_pfmemalloc_allowed() is true, which means either
      __GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true.  There is no
      such known context, but let's play it safe and make
      __alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is
      already set.
      
      Fixes: a8161d1e ("mm, page_alloc: restructure direct compaction handling in slowpath")
      Link: http://lkml.kernel.org/r/20170405074700.29871-2-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d322fd67
    • Johannes Weiner's avatar
      mm: vmscan: fix IO/refault regression in cache workingset transition · 41a68ab8
      Johannes Weiner authored
      commit 2a2e4885 upstream.
      
      Since commit 59dc76b0 ("mm: vmscan: reduce size of inactive file
      list") we noticed bigger IO spikes during changes in cache access
      patterns.
      
      The patch in question shrunk the inactive list size to leave more room
      for the current workingset in the presence of streaming IO.  However,
      workingset transitions that previously happened on the inactive list are
      now pushed out of memory and incur more refaults to complete.
      
      This patch disables active list protection when refaults are being
      observed.  This accelerates workingset transitions, and allows more of
      the new set to establish itself from memory, without eating into the
      ability to protect the established workingset during stable periods.
      
      The workloads that were measurably affected for us were hit pretty bad
      by it, with refault/majfault rates doubling and tripling during cache
      transitions, and the machines sustaining half-hour periods of 100% IO
      utilization, where they'd previously have sub-minute peaks at 60-90%.
      
      Stateful services that handle user data tend to be more conservative
      with kernel upgrades.  As a result we hit most page cache issues with
      some delay, as was the case here.
      
      The severity seemed to warrant a stable tag.
      
      Fixes: 59dc76b0 ("mm: vmscan: reduce size of inactive file list")
      Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41a68ab8
    • Andrey Ryabinin's avatar
      fs/block_dev: always invalidate cleancache in invalidate_bdev() · f7bccf3f
      Andrey Ryabinin authored
      commit a5f6a6a9 upstream.
      
      invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
      which doen't make any sense.
      
      Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
      regardless of mapping->nrpages value.
      
      Fixes: c515e1fd ("mm/fs: add hooks to support cleancache")
      Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.comSigned-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7bccf3f
    • Andrey Ryabinin's avatar
      fs: fix data invalidation in the cleancache during direct IO · 60146286
      Andrey Ryabinin authored
      commit 55635ba7 upstream.
      
      Patch series "Properly invalidate data in the cleancache", v2.
      
      We've noticed that after direct IO write, buffered read sometimes gets
      stale data which is coming from the cleancache.  The reason for this is
      that some direct write hooks call call invalidate_inode_pages2[_range]()
      conditionally iff mapping->nrpages is not zero, so we may not invalidate
      data in the cleancache.
      
      Another odd thing is that we check only for ->nrpages and don't check
      for ->nrexceptional, but invalidate_inode_pages2[_range] also
      invalidates exceptional entries as well.  So we invalidate exceptional
      entries only if ->nrpages != 0? This doesn't feel right.
      
       - Patch 1 fixes direct IO writes by removing ->nrpages check.
       - Patch 2 fixes similar case in invalidate_bdev().
           Note: I only fixed conditional cleancache_invalidate_inode() here.
             Do we also need to add ->nrexceptional check in into invalidate_bdev()?
      
       - Patches 3-4: some optimizations.
      
      This patch (of 4):
      
      Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
      conditionally iff mapping->nrpages is not zero.  This can't be right,
      because invalidate_inode_pages2[_range]() also invalidate data in the
      cleancache via cleancache_invalidate_inode() call.  So if page cache is
      empty but there is some data in the cleancache, buffered read after
      direct IO write would get stale data from the cleancache.
      
      Also it doesn't feel right to check only for ->nrpages because
      invalidate_inode_pages2[_range] invalidates exceptional entries as well.
      
      Fix this by calling invalidate_inode_pages2[_range]() regardless of
      nrpages state.
      
      Note: nfs,cifs,9p doesn't need similar fix because the never call
      cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
      they are not affected by this bug.
      
      Fixes: c515e1fd ("mm/fs: add hooks to support cleancache")
      Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.comSigned-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      60146286
    • Luis Henriques's avatar
      ceph: fix memory leak in __ceph_setxattr() · a4c34c20
      Luis Henriques authored
      commit eeca958d upstream.
      
      The ceph_inode_xattr needs to be released when removing an xattr.  Easily
      reproducible running the 'generic/020' test from xfstests or simply by
      doing:
      
        attr -s attr0 -V 0 /mnt/test && attr -r attr0 /mnt/test
      
      While there, also fix the error path.
      
      Here's the kmemleak splat:
      
      unreferenced object 0xffff88001f86fbc0 (size 64):
        comm "attr", pid 244, jiffies 4294904246 (age 98.464s)
        hex dump (first 32 bytes):
          40 fa 86 1f 00 88 ff ff 80 32 38 1f 00 88 ff ff  @........28.....
          00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
        backtrace:
          [<ffffffff81560199>] kmemleak_alloc+0x49/0xa0
          [<ffffffff810f3e5b>] kmem_cache_alloc+0x9b/0xf0
          [<ffffffff812b157e>] __ceph_setxattr+0x17e/0x820
          [<ffffffff812b1c57>] ceph_set_xattr_handler+0x37/0x40
          [<ffffffff8111fb4b>] __vfs_removexattr+0x4b/0x60
          [<ffffffff8111fd37>] vfs_removexattr+0x77/0xd0
          [<ffffffff8111fdd1>] removexattr+0x41/0x60
          [<ffffffff8111fe65>] path_removexattr+0x75/0xa0
          [<ffffffff81120aeb>] SyS_lremovexattr+0xb/0x10
          [<ffffffff81564b20>] entry_SYSCALL_64_fastpath+0x13/0x94
          [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: default avatarLuis Henriques <lhenriques@suse.com>
      Reviewed-by: default avatar"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a4c34c20
    • Michal Hocko's avatar
      fs/xattr.c: zero out memory copied to userspace in getxattr · 2cfc7440
      Michal Hocko authored
      commit 81be3dee upstream.
      
      getxattr uses vmalloc to allocate memory if kzalloc fails.  This is
      filled by vfs_getxattr and then copied to the userspace.  vmalloc,
      however, doesn't zero out the memory so if the specific implementation
      of the xattr handler is sloppy we can theoretically expose a kernel
      memory.  There is no real sign this is really the case but let's make
      sure this will not happen and use vzalloc instead.
      
      Fixes: 779302e6 ("fs/xattr.c:getxattr(): improve handling of allocation failures")
      Link: http://lkml.kernel.org/r/20170306103327.2766-1-mhocko@kernel.orgAcked-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2cfc7440
    • Martin Brandenburg's avatar
      orangefs: do not check possibly stale size on truncate · 8bd83223
      Martin Brandenburg authored
      commit 53950ef5 upstream.
      
      Let the server figure this out because our size might be out of date or
      not present.
      
      The bug was that
      
      	xfs_io -f -t -c "pread -v 0 100" /mnt/foo
      	echo "Test" > /mnt/foo
      	xfs_io -f -t -c "pread -v 0 100" /mnt/foo
      
      fails because the second truncate did not happen if nothing had
      requested the size after the write in echo.  Thus i_size was zero (not
      present) and the orangefs_setattr though i_size was zero and there was
      nothing to do.
      Signed-off-by: default avatarMartin Brandenburg <martin@omnibond.com>
      Signed-off-by: default avatarMike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8bd83223
    • Martin Brandenburg's avatar
      orangefs: do not set getattr_time on orangefs_lookup · 66f1885b
      Martin Brandenburg authored
      commit 17930b25 upstream.
      
      Since orangefs_lookup calls orangefs_iget which calls
      orangefs_inode_getattr, getattr_time will get set.
      Signed-off-by: default avatarMartin Brandenburg <martin@omnibond.com>
      Signed-off-by: default avatarMike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      66f1885b
    • Martin Brandenburg's avatar
      orangefs: clean up oversize xattr validation · a6d3c332
      Martin Brandenburg authored
      commit e675c5ec upstream.
      
      Also don't check flags as this has been validated by the VFS already.
      
      Fix an off-by-one error in the max size checking.
      
      Stop logging just because userspace wants to write attributes which do
      not fit.
      
      This and the previous commit fix xfstests generic/020.
      Signed-off-by: default avatarMartin Brandenburg <martin@omnibond.com>
      Signed-off-by: default avatarMike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6d3c332
    • Martin Brandenburg's avatar
      7adb1503
    • Eric Biggers's avatar
      ext4: evict inline data when writing to memory map · 1567c170
      Eric Biggers authored
      commit 7b4cc978 upstream.
      
      Currently the case of writing via mmap to a file with inline data is not
      handled.  This is maybe a rare case since it requires a writable memory
      map of a very small file, but it is trivial to trigger with on
      inline_data filesystem, and it causes the
      'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
      ext4_writepages() to be hit:
      
          mkfs.ext4 -O inline_data /dev/vdb
          mount /dev/vdb /mnt
          xfs_io -f /mnt/file \
      	-c 'pwrite 0 1' \
      	-c 'mmap -w 0 1m' \
      	-c 'mwrite 0 1' \
      	-c 'fsync'
      
      	kernel BUG at fs/ext4/inode.c:2723!
      	invalid opcode: 0000 [#1] SMP
      	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
      	task: ffff88003d3a8040 task.stack: ffffc90000300000
      	RIP: 0010:ext4_writepages+0xc89/0xf8a
      	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
      	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
      	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
      	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
      	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
      	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
      	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
      	Call Trace:
      	 ? _raw_spin_unlock+0x27/0x2a
      	 ? kvm_clock_read+0x1e/0x20
      	 do_writepages+0x23/0x2c
      	 ? do_writepages+0x23/0x2c
      	 __filemap_fdatawrite_range+0x80/0x87
      	 filemap_write_and_wait_range+0x67/0x8c
      	 ext4_sync_file+0x20e/0x472
      	 vfs_fsync_range+0x8e/0x9f
      	 ? syscall_trace_enter+0x25b/0x2d0
      	 vfs_fsync+0x1c/0x1e
      	 do_fsync+0x31/0x4a
      	 SyS_fsync+0x10/0x14
      	 do_syscall_64+0x69/0x131
      	 entry_SYSCALL64_slow_path+0x25/0x25
      
      We could try to be smart and keep the inline data in this case, or at
      least support delayed allocation when allocating the block, but these
      solutions would be more complicated and don't seem worthwhile given how
      rare this case seems to be.  So just fix the bug by calling
      ext4_convert_inline_data() when we're asked to make a page writable, so
      that any inline data gets evicted, with the block allocated immediately.
      Reported-by: default avatarNick Alcock <nick.alcock@oracle.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1567c170
    • Jan Kara's avatar
      jbd2: fix dbench4 performance regression for 'nobarrier' mounts · 7621cb79
      Jan Kara authored
      commit 5052b069 upstream.
      
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
      JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
      filesystem is mounted with nobarrier mount option, journal superblock
      writes ended up being async writes after this patch and that caused
      heavy performance regression for dbench4 benchmark with high number of
      processes. In my test setup with HP RAID array with non-volatile write
      cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.
      
      Fix the problem by making sure journal superblock writes are always
      treated as synchronous since they generally block progress of the
      journalling machinery and thus the whole filesystem.
      
      Fixes: b685d3d6Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7621cb79
    • Christian Borntraeger's avatar
      perf annotate s390: Implement jump types for perf annotate · 7a17cbb4
      Christian Borntraeger authored
      commit d9f8dfa9 upstream.
      
      Implement simple detection for all kind of jumps and branches.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Andreas Krebbel <krebbel@linux.vnet.ibm.com>
      Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-s390 <linux-s390@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1491465112-45819-3-git-send-email-borntraeger@de.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a17cbb4
    • Christian Borntraeger's avatar
      perf annotate s390: Fix perf annotate error -95 (4.10 regression) · b09eace7
      Christian Borntraeger authored
      commit e77852b3 upstream.
      
      since 4.10 perf annotate exits on s390 with an "unknown error -95".
      Turns out that commit 786c1b51 ("perf annotate: Start supporting
      cross arch annotation") added a hard requirement for architecture
      support when objdump is used but only provided x86 and arm support.
      Meanwhile power was added so lets add s390 as well.
      
      While at it make sure to implement the branch and jump types.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Andreas Krebbel <krebbel@linux.vnet.ibm.com>
      Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-s390 <linux-s390@vger.kernel.org>
      Fixes: 786c1b51 "perf annotate: Start supporting cross arch annotation"
      Link: http://lkml.kernel.org/r/1491465112-45819-2-git-send-email-borntraeger@de.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b09eace7
    • Adrian Hunter's avatar
      perf auxtrace: Fix no_size logic in addr_filter__resolve_kernel_syms() · 08b0b5bd
      Adrian Hunter authored
      commit c3a0bbc7 upstream.
      
      Address filtering with kernel symbols incorrectly resulted in the error
      "Cannot determine size of symbol" because the no_size logic was the wrong
      way around.
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Link: http://lkml.kernel.org/r/1490357752-27942-1-git-send-email-adrian.hunter@intel.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08b0b5bd
    • Mike Marciniszyn's avatar
      IB/hfi1: Prevent kernel QP post send hard lockups · 2d584522
      Mike Marciniszyn authored
      commit b6eac931 upstream.
      
      The driver progress routines can call cond_resched() when
      a timeslice is exhausted and irqs are enabled.
      
      If the ULP had been holding a spin lock without disabling irqs and
      the post send directly called the progress routine, the cond_resched()
      could yield allowing another thread from the same ULP to deadlock
      on that same lock.
      
      Correct by replacing the current hfi1_do_send() calldown with a unique
      one for post send and adding an argument to hfi1_do_send() to indicate
      that the send engine is running in a thread.   If the routine is not
      running in a thread, avoid calling cond_resched().
      
      Fixes: Commit 831464ce ("IB/hfi1: Don't call cond_resched in atomic mode when sending packets")
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d584522
    • Jack Morgenstein's avatar
      IB/mlx4: Reduce SRIOV multicast cleanup warning message to debug level · 7a48c89a
      Jack Morgenstein authored
      commit fb7a9174 upstream.
      
      A warning message during SRIOV multicast cleanup should have actually been
      a debug level message. The condition generating the warning does no harm
      and can fill the message log.
      
      In some cases, during testing, some tests were so intense as to swamp the
      message log with these warning messages, causing a stall in the console
      message log output task. This stall caused an NMI to be sent to all CPUs
      (so that they all dumped their stacks into the message log).
      Aside from the message flood causing an NMI, the tests all passed.
      
      Once the message flood which caused the NMI is removed (by reducing the
      warning message to debug level), the NMI no longer occurs.
      
      Sample message log (console log) output illustrating the flood and
      resultant NMI (snippets with comments and modified with ... instead
      of hex digits, to satisfy checkpatch.pl):
      
       <mlx4_ib> _mlx4_ib_mcg_port_cleanup: ... WARNING: group refcount 1!!!...
       *** About 4000 almost identical lines in less than one second ***
       <mlx4_ib> _mlx4_ib_mcg_port_cleanup: ... WARNING: group refcount 1!!!...
       INFO: rcu_sched detected stalls on CPUs/tasks: { 17} (...)
       *** { 17} above indicates that CPU 17 was the one that stalled ***
       sending NMI to all CPUs:
       ...
       NMI backtrace for cpu 17
       CPU: 17 PID: 45909 Comm: kworker/17:2
       Hardware name: HP ProLiant DL360p Gen8, BIOS P71 09/08/2013
       Workqueue: events fb_flashcursor
       task: ffff880478...... ti: ffff88064e...... task.ti: ffff88064e......
       RIP: 0010:[ffffffff81......]  [ffffffff81......] io_serial_in+0x15/0x20
       RSP: 0018:ffff88064e257cb0  EFLAGS: 00000002
       RAX: 0000000000...... RBX: ffffffff81...... RCX: 0000000000......
       RDX: 0000000000...... RSI: 0000000000...... RDI: ffffffff81......
       RBP: ffff88064e...... R08: ffffffff81...... R09: 0000000000......
       R10: 0000000000...... R11: ffff88064e...... R12: 0000000000......
       R13: 0000000000...... R14: ffffffff81...... R15: 0000000000......
       FS:  0000000000......(0000) GS:ffff8804af......(0000) knlGS:000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080......
       CR2: 00007f2a2f...... CR3: 0000000001...... CR4: 0000000000......
       DR0: 0000000000...... DR1: 0000000000...... DR2: 0000000000......
       DR3: 0000000000...... DR6: 00000000ff...... DR7: 0000000000......
       Stack:
       ffff88064e...... ffffffff81...... ffffffff81...... 0000000000......
       ffffffff81...... ffff88064e...... ffffffff81...... ffffffff81......
       ffffffff81...... ffff88064e...... ffffffff81...... 0000000000......
       Call Trace:
      [<ffffffff813d099b>] wait_for_xmitr+0x3b/0xa0
      [<ffffffff813d0b5c>] serial8250_console_putchar+0x1c/0x30
      [<ffffffff813d0b40>] ? serial8250_console_write+0x140/0x140
      [<ffffffff813cb5fa>] uart_console_write+0x3a/0x80
      [<ffffffff813d0aae>] serial8250_console_write+0xae/0x140
      [<ffffffff8107c4d1>] call_console_drivers.constprop.15+0x91/0xf0
      [<ffffffff8107d6cf>] console_unlock+0x3bf/0x400
      [<ffffffff813503cd>] fb_flashcursor+0x5d/0x140
      [<ffffffff81355c30>] ? bit_clear+0x120/0x120
      [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
      [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
      [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
      [<ffffffff810a5aef>] kthread+0xcf/0xe0
      [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
      [<ffffffff81645858>] ret_from_fork+0x58/0x90
      [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
      Code: 48 89 e5 d3 e6 48 63 f6 48 03 77 10 8b 06 5d c3 66 0f 1f 44 00 00 66 66 66 6
      
      As indicated in the stack trace above, the console output task got swamped.
      
      Fixes: b9c5d6a6 ("IB/mlx4: Add multicast group (MCG) paravirtualization for SR-IOV")
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a48c89a
    • Jack Morgenstein's avatar
      IB/mlx4: Fix ib device initialization error flow · 295a7aab
      Jack Morgenstein authored
      commit 99e68909 upstream.
      
      In mlx4_ib_add, procedure mlx4_ib_alloc_eqs is called to allocate EQs.
      
      However, in the mlx4_ib_add error flow, procedure mlx4_ib_free_eqs is not
      called to free the allocated EQs.
      
      Fixes: e605b743 ("IB/mlx4: Increase the number of vectors (EQs) available for ULPs")
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      295a7aab
    • Shamir Rabinovitch's avatar
      IB/IPoIB: ibX: failed to create mcg debug file · f98af463
      Shamir Rabinovitch authored
      commit 771a5258 upstream.
      
      When udev renames the netdev devices, ipoib debugfs entries does not
      get renamed. As a result, if subsequent probe of ipoib device reuse the
      name then creating a debugfs entry for the new device would fail.
      
      Also, moved ipoib_create_debug_files and ipoib_delete_debug_files as part
      of ipoib event handling in order to avoid any race condition between these.
      
      Fixes: 1732b0ef ([IPoIB] add path record information in debugfs)
      Signed-off-by: default avatarVijay Kumar <vijay.ac.kumar@oracle.com>
      Signed-off-by: default avatarShamir Rabinovitch <shamir.rabinovitch@oracle.com>
      Reviewed-by: default avatarMark Bloch <markb@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f98af463
    • Michael J. Ruhl's avatar
      IB/core: For multicast functions, verify that LIDs are multicast LIDs · 88ded1f1
      Michael J. Ruhl authored
      commit 8561eae6 upstream.
      
      The Infiniband spec defines "A multicast address is defined by a
      MGID and a MLID" (section 10.5).  Currently the MLID value is not
      validated.
      
      Add check to verify that the MLID value is in the correct address
      range.
      
      Fixes: 0c33aeed ("[IB] Add checks to multicast attach and detach")
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarDasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
      Signed-off-by: default avatarMichael J. Ruhl <michael.j.ruhl@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88ded1f1
    • Parav Pandit's avatar
      IB/core: Fix kernel crash during fail to initialize device · 82f0fecc
      Parav Pandit authored
      commit 4be3a4fa upstream.
      
      This patch fixes the kernel crash that occurs during ib_dealloc_device()
      called due to provider driver fails with an error after
      ib_alloc_device() and before it can register using ib_register_device().
      
      This crashed seen in tha lab as below which can occur with any IB device
      which fails to perform its device initialization before invoking
      ib_register_device().
      
      This patch avoids touching cache and port immutable structures if device
      is not yet initialized.
      It also releases related memory when cache and port immutable data
      structure initialization fails during register_device() state.
      
      [81416.561946] BUG: unable to handle kernel NULL pointer dereference at (null)
      [81416.570340] IP: ib_cache_release_one+0x29/0x80 [ib_core]
      [81416.576222] PGD 78da66067
      [81416.576223] PUD 7f2d7c067
      [81416.579484] PMD 0
      [81416.582720]
      [81416.587242] Oops: 0000 [#1] SMP
      [81416.722395] task: ffff8807887515c0 task.stack: ffffc900062c0000
      [81416.729148] RIP: 0010:ib_cache_release_one+0x29/0x80 [ib_core]
      [81416.735793] RSP: 0018:ffffc900062c3a90 EFLAGS: 00010202
      [81416.741823] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
      [81416.749785] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffff880859fec000
      [81416.757757] RBP: ffffc900062c3aa0 R08: ffff8808536e5ac0 R09: ffff880859fec5b0
      [81416.765708] R10: 00000000536e5c01 R11: ffff8808536e5ac0 R12: ffff880859fec000
      [81416.773672] R13: 0000000000000000 R14: ffff8808536e5ac0 R15: ffff88084ebc0060
      [81416.781621] FS:  00007fd879fab740(0000) GS:ffff88085fac0000(0000) knlGS:0000000000000000
      [81416.790522] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [81416.797094] CR2: 0000000000000000 CR3: 00000007eb215000 CR4: 00000000003406e0
      [81416.805051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [81416.812997] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [81416.820950] Call Trace:
      [81416.824226]  ib_device_release+0x1e/0x40 [ib_core]
      [81416.829858]  device_release+0x32/0xa0
      [81416.834370]  kobject_cleanup+0x63/0x170
      [81416.839058]  kobject_put+0x25/0x50
      [81416.843319]  ib_dealloc_device+0x25/0x40 [ib_core]
      [81416.848986]  mlx5_ib_add+0x163/0x1990 [mlx5_ib]
      [81416.854414]  mlx5_add_device+0x5a/0x160 [mlx5_core]
      [81416.860191]  mlx5_register_interface+0x8d/0xc0 [mlx5_core]
      [81416.866587]  ? 0xffffffffa09e9000
      [81416.870816]  mlx5_ib_init+0x15/0x17 [mlx5_ib]
      [81416.876094]  do_one_initcall+0x51/0x1b0
      [81416.880861]  ? __vunmap+0x85/0xd0
      [81416.885113]  ? kmem_cache_alloc_trace+0x14b/0x1b0
      [81416.890768]  ? vfree+0x2e/0x70
      [81416.894762]  do_init_module+0x60/0x1fa
      [81416.899441]  load_module+0x15f6/0x1af0
      [81416.904114]  ? __symbol_put+0x60/0x60
      [81416.908709]  ? ima_post_read_file+0x3d/0x80
      [81416.913828]  ? security_kernel_post_read_file+0x6b/0x80
      [81416.920006]  SYSC_finit_module+0xa6/0xf0
      [81416.924888]  SyS_finit_module+0xe/0x10
      [81416.929568]  entry_SYSCALL_64_fastpath+0x1a/0xa9
      [81416.935089] RIP: 0033:0x7fd879494949
      [81416.939543] RSP: 002b:00007ffdbc1b4e58 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
      [81416.947982] RAX: ffffffffffffffda RBX: 0000000001b66f00 RCX: 00007fd879494949
      [81416.955965] RDX: 0000000000000000 RSI: 000000000041a13c RDI: 0000000000000003
      [81416.963926] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000001b652a0
      [81416.971861] R10: 0000000000000003 R11: 0000000000000202 R12: 00007ffdbc1b3e70
      [81416.979763] R13: 00007ffdbc1b3e50 R14: 0000000000000005 R15: 0000000000000000
      [81417.008005] RIP: ib_cache_release_one+0x29/0x80 [ib_core] RSP: ffffc900062c3a90
      [81417.016045] CR2: 0000000000000000
      
      Fixes: 55aeed06 ("IB/core: Make ib_alloc_device init the kobject")
      Fixes: 7738613e ("IB/core: Add per port immutable struct to ib_device")
      Reviewed-by: default avatarDaniel Jurgens <danielj@mellanox.com>
      Signed-off-by: default avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82f0fecc
    • Jack Morgenstein's avatar
      IB/core: Fix sysfs registration error flow · 5c7f1dfa
      Jack Morgenstein authored
      commit b312be3d upstream.
      
      The kernel commit cited below restructured ib device management
      so that the device kobject is initialized in ib_alloc_device.
      
      As part of the restructuring, the kobject is now initialized in
      procedure ib_alloc_device, and is later added to the device hierarchy
      in the ib_register_device call stack, in procedure
      ib_device_register_sysfs (which calls device_add).
      
      However, in the ib_device_register_sysfs error flow, if an error
      occurs following the call to device_add, the cleanup procedure
      device_unregister is called. This call results in the device object
      being deleted -- which results in various use-after-free crashes.
      
      The correct cleanup call is device_del -- which undoes device_add
      without deleting the device object.
      
      The device object will then (correctly) be deleted in the
      ib_register_device caller's error cleanup flow, when the caller invokes
      ib_dealloc_device.
      
      Fixes: 55aeed06 ("IB/core: Make ib_alloc_device init the kobject")
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5c7f1dfa