1. 15 Sep, 2016 4 commits
    • Fabian Frederick's avatar
      ext4: create EXT4_MAX_BLOCKS() macro · 518eaa63
      Fabian Frederick authored
      Create a macro to calculate length + offset -> maximum blocks
      This adds more readability.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      518eaa63
    • Fabian Frederick's avatar
      ext4: remove unneeded test in ext4_alloc_file_blocks() · c3fe493c
      Fabian Frederick authored
      ext4_alloc_file_blocks() is called from ext4_zero_range() and
      ext4_fallocate() both already testing EXT4_INODE_EXTENTS
      We can call ext_depth(inode) unconditionnally.
      
      [ Added BUG_ON check to make sure ext4_alloc_file_blocks() won't get
        called for a indirect-mapped inode in the future.  -- tytso ]
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      c3fe493c
    • Fabian Frederick's avatar
      ext4: fix memory leak in ext4_insert_range() · edf15aa1
      Fabian Frederick authored
      Running xfstests generic/013 with kmemleak gives the following:
      
      unreferenced object 0xffff8801d3d27de0 (size 96):
        comm "fsstress", pid 4941, jiffies 4294860168 (age 53.485s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff818eaaf3>] kmemleak_alloc+0x23/0x40
          [<ffffffff81179805>] __kmalloc+0xf5/0x1d0
          [<ffffffff8122ef5c>] ext4_find_extent+0x1ec/0x2f0
          [<ffffffff8123530c>] ext4_insert_range+0x34c/0x4a0
          [<ffffffff81235942>] ext4_fallocate+0x4e2/0x8b0
          [<ffffffff81181334>] vfs_fallocate+0x134/0x210
          [<ffffffff8118203f>] SyS_fallocate+0x3f/0x60
          [<ffffffff818efa9b>] entry_SYSCALL_64_fastpath+0x13/0x8f
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Problem seems mitigated by dropping refs and freeing path
      when there's no path[depth].p_ext
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      edf15aa1
    • wangguang's avatar
      ext4: bugfix for mmaped pages in mpage_release_unused_pages() · 4e800c03
      wangguang authored
      Pages clear buffers after ext4 delayed block allocation failed,
      However, it does not clean its pte_dirty flag.
      if the pages unmap ,in cording to the pte_dirty ,
      unmap_page_range may try to call __set_page_dirty,
      
      which may lead to the bugon at 
      mpage_prepare_extent_to_map:head = page_buffers(page);.
      
      This patch just call clear_page_dirty_for_io to clean pte_dirty 
      at mpage_release_unused_pages for pages mmaped. 
      
      Steps to reproduce the bug:
      
      (1) mmap a file in ext4
      	addr = (char *)mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
      	       	            fd, 0);
      	memset(addr, 'i', 4096);
      
      (2) return EIO at 
      
      	ext4_writepages->mpage_map_and_submit_extent->mpage_map_one_extent 
      
      which causes this log message to be print:
      
                      ext4_msg(sb, KERN_CRIT,
                              "Delayed block allocation failed for "
                              "inode %lu at logical offset %llu with"
                              " max blocks %u with error %d",
                              inode->i_ino,
                              (unsigned long long)map->m_lblk,
                              (unsigned)map->m_len, -err);
      
      (3)Unmap the addr cause warning at
      
      	__set_page_dirty:WARN_ON_ONCE(warn && !PageUptodate(page));
      
      (4) wait for a minute,then bugon happen.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarwangguang <wangguang03@zte.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      4e800c03
  2. 06 Sep, 2016 5 commits
    • Dmitry Monakhov's avatar
      ext4: improve ext4lazyinit scalability · e22834f0
      Dmitry Monakhov authored
      ext4lazyinit is a global thread. This thread performs itable
      initalization under li_list_mtx mutex.
      
      It basically does the following:
      ext4_lazyinit_thread
        ->mutex_lock(&eli->li_list_mtx);
        ->ext4_run_li_request(elr)
          ->ext4_init_inode_table-> Do a lot of IO if the list is large
      
      And when new mount/umount arrive they have to block on ->li_list_mtx
      because  lazy_thread holds it during full walk procedure.
      ext4_fill_super
       ->ext4_register_li_request
         ->mutex_lock(&ext4_li_info->li_list_mtx);
         ->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
      In my case mount takes 40minutes on server with 36 * 4Tb HDD.
      Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
      Even more. If one of filesystems was frozen lazyinit_thread will simply
      block on sb_start_write() so other mount/umount will be stuck forever.
      
      This patch changes logic like follows:
      - grab ->s_umount read sem before processing new li_request.
        After that it is safe to drop li_list_mtx because all callers of
        li_remove_request are holding ->s_umount for write.
      - li_thread skips frozen SB's
      
      Locking order:
      Mh KOrder is asserted by umount path like follows: s_umount ->li_list_mtx so
      the only way to to grab ->s_mount inside li_thread is via down_read_trylock
      
      xfstests:ext4/023
      #PSBM-49658
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      e22834f0
    • Jan Kara's avatar
      ext4: cleanup ext4_sync_parent() · 6ae4c5a6
      Jan Kara authored
      A condition !hlist_empty(&inode->i_dentry) is always true for open file.
      Just remove it. Also ext4_sync_parent() could use some explanation why
      races with rmdir() are not an issue - add a comment explaining that.
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      6ae4c5a6
    • Kaho Ng's avatar
      ext4: remove old feature helpers · 0b7b7779
      Kaho Ng authored
      Use the ext4_{has,set,clear}_feature_* helpers to replace the old
      feature helpers.
      Signed-off-by: default avatarKaho Ng <ngkaho1234@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0b7b7779
    • Jan Kara's avatar
      ext4: enable quota enforcement based on mount options · 49da9392
      Jan Kara authored
      When quota information is stored in quota files, we enable only quota
      accounting on mount and enforcement is enabled only in response to
      Q_QUOTAON quotactl. To make ext4 behavior consistent with XFS, we add a
      possibility to enable quota enforcement on mount by specifying
      corresponding quota mount option (usrquota, grpquota, prjquota).
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      49da9392
    • Daeho Jeong's avatar
      ext4: reinforce check of i_dtime when clearing high fields of uid and gid · 93e3b4e6
      Daeho Jeong authored
      Now, ext4_do_update_inode() clears high 16-bit fields of uid/gid
      of deleted and evicted inode to fix up interoperability with old
      kernels. However, it checks only i_dtime of an inode to determine
      whether the inode was deleted and evicted, and this is very risky,
      because i_dtime can be used for the pointer maintaining orphan inode
      list, too. We need to further check whether the i_dtime is being
      used for the orphan inode list even if the i_dtime is not NULL.
      
      We found that high 16-bit fields of uid/gid of inode are unintentionally
      and permanently cleared when the inode truncation is just triggered,
      but not finished, and the inode metadata, whose high uid/gid bits are
      cleared, is written on disk, and the sudden power-off follows that
      in order.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarHobin Woo <hobin.woo@samsung.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      93e3b4e6
  3. 31 Aug, 2016 1 commit
  4. 29 Aug, 2016 30 commits
    • Eric Whitney's avatar
      ext4: enforce online defrag restriction for encrypted files · 14fbd4aa
      Eric Whitney authored
      Online defragging of encrypted files is not currently implemented.
      However, the move extent ioctl can still return successfully when
      called.  For example, this occurs when xfstest ext4/020 is run on an
      encrypted file system, resulting in a corrupted test file and a
      corresponding test failure.
      
      Until the proper functionality is implemented, fail the move extent
      ioctl if either the original or donor file is encrypted.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEric Whitney <enwlinux@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      14fbd4aa
    • Jan Kara's avatar
      ext4: factor out loop for freeing inode xattr space · dfa2064b
      Jan Kara authored
      Move loop to make enough space in the inode from
      ext4_expand_extra_isize_ea() into a separate function to make that
      function smaller and better readable and also to avoid delaration of
      variables inside a loop block.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      dfa2064b
    • Jan Kara's avatar
      ext4: remove (almost) unused variables from ext4_expand_extra_isize_ea() · 6e0cd088
      Jan Kara authored
      'start' variable is completely unused in ext4_expand_extra_isize_ea().
      Variable 'first' is used only once in one place. So just remove them.
      Variables 'entry' and 'last' are only really used later in the function
      inside a loop. Move their declarations there.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      6e0cd088
    • Jan Kara's avatar
      ext4: factor out xattr moving · 3f2571c1
      Jan Kara authored
      Factor out function for moving xattrs from inode into external xattr
      block from ext4_expand_extra_isize_ea(). That function is already quite
      long and factoring out this rather standalone functionality helps
      readability.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      3f2571c1
    • Jan Kara's avatar
      ext4: replace bogus assertion in ext4_xattr_shift_entries() · 94405713
      Jan Kara authored
      We were checking whether computed offsets do not exceed end of block in
      ext4_xattr_shift_entries(). However this does not make sense since we
      always only decrease offsets. So replace that assertion with a check
      whether we really decrease xattrs value offsets.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      94405713
    • Jan Kara's avatar
      ext4: remove checks for e_value_block · 1cba4237
      Jan Kara authored
      Currently we don't support xattrs with e_value_block set. We don't allow
      them to pass initial xattr check so there's no point for checking for
      this later. Since these tests were untested, bugs were creeping in and
      not all places which should have checked were checking e_value_block
      anyway.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1cba4237
    • Jan Kara's avatar
      ext4: Check that external xattr value block is zero · 2de58f11
      Jan Kara authored
      Currently we don't support xattrs with values stored out of line. Check
      for that in ext4_xattr_check_names() to make sure we never work with
      such xattrs since not all the code counts with that resulting is possible
      weird corruption issues.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      2de58f11
    • Jan Kara's avatar
      ext4: fixup free space calculations when expanding inodes · e3014d14
      Jan Kara authored
      Conditions checking whether there is enough free space in an xattr block
      and when xattr is large enough to make enough space in the inode forgot
      to account for the fact that inode need not be completely filled up with
      xattrs. Thus we could move unnecessarily many xattrs out of inode or
      even falsely claim there is not enough space to expand the inode. We
      also forgot to update the amount of free space in xattr block when moving
      more xattrs and thus could decide to move too big xattr resulting in
      unexpected failure.
      
      Fix these problems by properly updating free space in the inode and
      xattr block as we move xattrs. To simplify the math, avoid shifting
      xattrs after removing each one xattr and instead just shift xattrs only
      once there is enough free space in the inode.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      e3014d14
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · b8927721
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "Fix bugs that could cause kernel deadlocks or file system corruption
        while moving xattrs to expand the extended inode.
      
        Also add some sanity checks to the block group descriptors to make
        sure we don't end up overwriting the superblock"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: avoid deadlock when expanding inode size
        ext4: properly align shifted xattrs when expanding inodes
        ext4: fix xattr shifting when expanding inodes part 2
        ext4: fix xattr shifting when expanding inodes
        ext4: validate that metadata blocks do not overlap superblock
        ext4: reserve xattr index for the Hurd
      b8927721
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 1f6a563e
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Segregate namespaces properly in conntrack dumps, from Liping Zhang.
      
       2) tcp listener refcount fix in netfilter tproxy, from Eric Dumazet.
      
       3) Fix timeouts in qed driver due to xmit_more, from Yuval Mintz.
      
       4) Fix use-after-free in tcp_xmit_retransmit_queue().
      
       5) Userspace header fixups (use of __u32, missing includes, etc.) from
          Mikko Rapeli.
      
       6) Further refinements to fragmentation wrt gso and tunnels, from
          Shmulik Ladkani.
      
       7) Trigger poll correctly for zero length UDP packets, from Eric
          Dumazet.
      
       8) TCP window scaling fix, also from Eric Dumazet.
      
       9) SLAB_DESTROY_BY_RCU is not relevant any more for UDP sockets.
      
      10) Module refcount leak in qdisc_create_dflt(), from Eric Dumazet.
      
      11) Fix deadlock in cp_rx_poll() of 8139cp driver, from Gao Feng.
      
      12) Memory leak in rhashtable's alloc_bucket_locks(), from Eric Dumazet.
      
      13) Add new device ID to alx driver, from Owen Lin.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
        Add Killer E2500 device ID in alx driver.
        net: smc91x: fix SMC accesses
        Documentation: networking: dsa: Remove platform device TODO
        net/mlx5: Increase number of ethtool steering priorities
        net/mlx5: Add error prints when validate ETS failed
        net/mlx5e: Fix memory leak if refreshing TIRs fails
        net/mlx5e: Add ethtool counter for TX xmit_more
        net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ
        net/mlx5e: Don't wait for SQ completions on close
        net/mlx5e: Don't post fragmented MPWQE when RQ is disabled
        net/mlx5e: Don't wait for RQ completions on close
        net/mlx5e: Limit UMR length to the device's limitation
        rhashtable: fix a memory leak in alloc_bucket_locks()
        sfc: fix potential stack corruption from running past stat bitmask
        team: loadbalance: push lacpdus to exact delivery
        net: hns: dereference ppe_cb->ppe_common_cb if it is non-null
        8139cp: Fix one possible deadloop in cp_rx_poll
        i40e: Change some init flow for the client
        Revert "phy: IRQ cannot be shared"
        net: dsa: bcm_sf2: Fix race condition while unmasking interrupts
        ...
      1f6a563e
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v4.8-4' of... · cf4d3779
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v4.8-4' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
      
      Pull x86 platform driver fixes from Darren Hart:
       "Remove module related code from two drivers that are only configurable
        as built-in: intel_pmic_gpio and platform/olpc"
      
      * tag 'platform-drivers-x86-v4.8-4' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
        intel_pmic_gpio: Make explicitly non-modular
        platform/olpc: Make ec explicitly non-modular
      cf4d3779
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 2a90309e
      Linus Torvalds authored
      Pull powerpc fixes from Ben Herrenschmidt:
       "This was meant to be sent early last week, but I has a change pending
        on one of the fixes and other things made me forget all about.  Ugh.
      
        We have some misc fixes for powerpc 4.8.  Some trivial bits and some
        regressions, and a trivial cleanup or two that I saw no point in
        letting rot in patchwork"
      
      * tag 'powerpc-4.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc: signals: Discard transaction state from signal frames
        powerpc/powernv : Drop reference added by kset_find_obj()
        powerpc/tm: do not use r13 for tabort_syscall
        powerpc: move hmi.c to arch/powerpc/kvm/
        powerpc: sysdev: cpm: fix gpio save_regs functions
        powerpc/pseries: PACA save area fix for MCE vs MCE
        powerpc/pseries: PACA save area fix for general exception vs MCE
        powerpc/prom: Fix sub-processor option passed to ibm, client-architecture-support
        powerpc, hotplug: Avoid to touch non-existent cpumasks.
        powerpc: migrate exception table users off module.h and onto extable.h
        powerpc/powernv/pci: fix iterator signedness
        powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)
        cxl: use pcibios_free_controller_deferred() when removing vPHBs
        powerpc: mpc8349emitx: Delete unnecessary assignment for the field "owner"
        powerpc/512x: Delete unnecessary assignment for the field "owner"
        drivers/macintosh: Delete owner assignment
        powerpc: cputhreads: Add missing include file
      2a90309e
    • Paul Gortmaker's avatar
      intel_pmic_gpio: Make explicitly non-modular · da43bf0c
      Paul Gortmaker authored
      The Kconfig entry controlling compilation of this code is:
      
      drivers/platform/x86/Kconfig:config GPIO_INTEL_PMIC
      drivers/platform/x86/Kconfig:   bool "Intel PMIC GPIO support"
      
      ...meaning that it currently is not being built as a module by anyone.
      
      Lets remove the couple traces of modular infrastructure use, so that
      when reading the driver there is no doubt it is builtin-only.
      
      We delete the MODULE_LICENSE tag etc. since all that information
      was (or is now) contained at the top of the file in the comments.
      
      We don't replace module.h with init.h since the file already has that.
      
      Cc: Alek Du <alek.du@intel.com>
      Cc: platform-driver-x86@vger.kernel.org
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      da43bf0c
    • Paul Gortmaker's avatar
      platform/olpc: Make ec explicitly non-modular · f48d1496
      Paul Gortmaker authored
      The Kconfig entry controlling compilation of this code is:
      
      arch/x86/Kconfig:config OLPC
      arch/x86/Kconfig:       bool "One Laptop Per Child support"
      
      ...meaning that it currently is not being built as a module by anyone.
      
      Lets remove the couple traces of modular infrastructure use, so that
      when reading the driver there is no doubt it is builtin-only.
      
      We delete the MODULE_LICENSE tag etc. since all that information
      was (or is now) contained at the top of the file in the comments.
      
      Cc: platform-driver-x86@vger.kernel.org
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: default avatarAndres Salomon <dilinger@queued.net>
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      f48d1496
    • Owen Lin's avatar
      Add Killer E2500 device ID in alx driver. · b99b43bb
      Owen Lin authored
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b99b43bb
    • Russell King's avatar
      net: smc91x: fix SMC accesses · 2fb04fdf
      Russell King authored
      Commit b70661c7 ("net: smc91x: use run-time configuration on all ARM
      machines") broke some ARM platforms through several mistakes.  Firstly,
      the access size must correspond to the following rule:
      
      (a) at least one of 16-bit or 8-bit access size must be supported
      (b) 32-bit accesses are optional, and may be enabled in addition to
          the above.
      
      Secondly, it provides no emulation of 16-bit accesses, instead blindly
      making 16-bit accesses even when the platform specifies that only 8-bit
      is supported.
      
      Reorganise smc91x.h so we can make use of the existing 16-bit access
      emulation already provided - if 16-bit accesses are supported, use
      16-bit accesses directly, otherwise if 8-bit accesses are supported,
      use the provided 16-bit access emulation.  If neither, BUG().  This
      exactly reflects the driver behaviour prior to the commit being fixed.
      
      Since the conversion incorrectly cut down the available access sizes on
      several platforms, we also need to go through every platform and fix up
      the overly-restrictive access size: Arnd assumed that if a platform can
      perform 32-bit, 16-bit and 8-bit accesses, then only a 32-bit access
      size needed to be specified - not so, all available access sizes must
      be specified.
      
      This likely fixes some performance regressions in doing this: if a
      platform does not support 8-bit accesses, 8-bit accesses have been
      emulated by performing a 16-bit read-modify-write access.
      
      Tested on the Intel Assabet/Neponset platform, which supports only 8-bit
      accesses, which was broken by the original commit.
      
      Fixes: b70661c7 ("net: smc91x: use run-time configuration on all ARM machines")
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Tested-by: default avatarRobert Jarzmik <robert.jarzmik@free.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2fb04fdf
    • Florian Fainelli's avatar
      Documentation: networking: dsa: Remove platform device TODO · 7d13eca0
      Florian Fainelli authored
      Since commit 83c0afae ("net: dsa: Add new binding implementation"),
      the shortcomings of the dsa platform device have been addressed, remove
      that TODO item.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d13eca0
    • David S. Miller's avatar
      Merge branch 'mlx5-series' · e4d986a8
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      Mellanox 100G mlx5 fixes 2016-08-29
      
      This series contains some bug fixes for the mlx5 core and mlx5
      ethernet driver.
      
      From Saeed, Fix UMR to consider hardware translation table field
      size limitation when calculating the maximum number of MTTs required
      by the driver.  Three patches to speed-up netdevice close time by
      serializing channel (SQs & RQs) destruction rather than issuing and
      waiting for hardware interrupts to free them.
      
      From Eran, Fix ethtool ring parameter reporting for striding RQ layout.
      Add error prints on ETS validation failure.
      
      From Kamal, Fix memory leak on error flow.
      
      From Maor, Fix ethtool steering priorities number.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4d986a8
    • Maor Gottlieb's avatar
      net/mlx5: Increase number of ethtool steering priorities · e5835f28
      Maor Gottlieb authored
      Ethtool has 11 flow tables, each flow table has its own priority.
      Increase the number of priorities to be aligned with the number of flow
      tables.
      
      Fixes: 1174fce8 ('net/mlx5e: Support l3/l4 flow type specs in ethtool flow steering')
      Signed-off-by: default avatarMaor Gottlieb <maorg@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5835f28
    • Eran Ben Elisha's avatar
      net/mlx5: Add error prints when validate ETS failed · 1722b969
      Eran Ben Elisha authored
      Upon set ETS failure due to user invalid input, add error prints to
      specify the exact error to the user.
      
      Fixes: cdcf1121 ('net/mlx5e: Validate BW weight values of ETS')
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1722b969
    • Kamal Heib's avatar
      net/mlx5e: Fix memory leak if refreshing TIRs fails · bf50082c
      Kamal Heib authored
      Free 'in' command object also when mlx5_core_modify_tir fails.
      
      Fixes: 724b2aa1 ("net/mlx5e: TIRs management refactoring")
      Signed-off-by: default avatarKamal Heib <kamalh@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf50082c
    • Tariq Toukan's avatar
      net/mlx5e: Add ethtool counter for TX xmit_more · c8cf78fe
      Tariq Toukan authored
      Add a counter in ethtool for the number of times that
      TX xmit_more was used.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8cf78fe
    • Eran Ben Elisha's avatar
      net/mlx5e: Fix ethtool -g/G rx ring parameter report with striding RQ · cc8e9ebf
      Eran Ben Elisha authored
      The driver RQ has two possible configurations: striding RQ and
      non-striding RQ.  Until this patch, the driver always reported the
      number of hardware WQEs (ring descriptors). For non striding RQ
      configuration, this was OK since we have one WQE per pending packet
      For striding RQ, multiple packets can fit into one WQE. For better
      user experience we normalize the rx_pending parameter (size of wqe/mtu)
      as the average ring size in case of striding RQ.
      
      Fixes: 461017cb ('net/mlx5e: Support RX multi-packet WQE ...')
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc8e9ebf
    • Saeed Mahameed's avatar
      net/mlx5e: Don't wait for SQ completions on close · 6e8dd6d6
      Saeed Mahameed authored
      Instead of asking the firmware to flush the SQ (Send Queue) via
      asynchronous completions when moved to error, we handle SQ flush
      manually (mlx5e_free_tx_descs) same as we did when SQ flush got
      timed out or on tx_timeout.
      
      This will reduce SQs flush time and speedup interface down procedure.
      
      Moved mlx5e_free_tx_descs to the end of en_tx.c for tx
      critical code locality.
      
      Fixes: 29429f33 ('net/mlx5e: Timeout if SQ doesn't flush during close')
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e8dd6d6
    • Saeed Mahameed's avatar
      net/mlx5e: Don't post fragmented MPWQE when RQ is disabled · 8484f9ed
      Saeed Mahameed authored
      ICO (Internal control operations) SQ (Send Queue) is closed/disabled
      after RQ (Receive Queue).  After RQ is closed an ICO SQ completion
      might post a fragmented MPWQE (Multi Packet Work Queue Element) into
      that RQ.
      
      As on regular RQ post, check if we are allowed to post to that
      RQ (RQ is enabled). Cleanup in-progress UMR MPWQE on mlx5e_free_rx_descs
      if needed.
      
      Fixes: bc77b240 ('net/mlx5e: Add fragmented memory support for RX multi packet WQE')
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8484f9ed
    • Saeed Mahameed's avatar
      net/mlx5e: Don't wait for RQ completions on close · f2fde18c
      Saeed Mahameed authored
      This will significantly reduce receive queue flush time on interface
      down.
      
      Instead of asking the firmware to flush the RQ (Receive Queue) via
      asynchronous completions when moved to error, we handle RQ flush
      manually (mlx5e_free_rx_descs) same as we did when RQ flush got timed
      out.
      
      This will reduce RQs flush time and speedup interface down procedure
      (ifconfig down) from 6 sec to 0.3 sec on a 48 cores system.
      
      Moved mlx5e_free_rx_descs en_main.c where it is needed, to keep en_rx.c
      free form non critical data path code for better code locality.
      
      Fixes: 6cd392a0 ('net/mlx5e: Handle RQ flush in error cases')
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2fde18c
    • Saeed Mahameed's avatar
      net/mlx5e: Limit UMR length to the device's limitation · fe4c988b
      Saeed Mahameed authored
      ConnectX-4 UMR (User Memory Region) MTT translation table offset in WQE
      is limited to U16_MAX, before this patch we ignored that limitation and
      requested the maximum possible UMR translation length that the netdev
      might need (MAX channels * MAX pages per channel).
      In case of a system with #cores > 32 and when linear WQE allocation fails,
      falling back to using UMR WQEs will cause the RQ (Receive Queue) to get
      stuck.
      
      Here we limit UMR length to min(U16_MAX, max required pages) (while
      considering the required alignments) on driver load, by default U16_MAX is
      sufficient since the default RX rings value guarantees that we are in
      range, dynamically (on set_ringparam/set_channels) we will check if the
      new required UMR length (num mtts) is still in range, if not, fail the
      request.
      
      Fixes: bc77b240 ('net/mlx5e: Add fragmented memory support for RX multi packet WQE')
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe4c988b
    • Cyril Bur's avatar
      powerpc: signals: Discard transaction state from signal frames · 78a3e888
      Cyril Bur authored
      Userspace can begin and suspend a transaction within the signal
      handler which means they might enter sys_rt_sigreturn() with the
      processor in suspended state.
      
      sys_rt_sigreturn() wants to restore process context (which may have
      been in a transaction before signal delivery). To do this it must
      restore TM SPRS. To achieve this, any transaction initiated within the
      signal frame must be discarded in order to be able to restore TM SPRs
      as TM SPRs can only be manipulated non-transactionally..
      >From the PowerPC ISA:
        TM Bad Thing Exception [Category: Transactional Memory]
         An attempt is made to execute a mtspr targeting a TM register in
         other than Non-transactional state.
      
      Not doing so results in a TM Bad Thing:
      [12045.221359] Kernel BUG at c000000000050a40 [verbose debug info unavailable]
      [12045.221470] Unexpected TM Bad Thing exception at c000000000050a40 (msr 0x201033)
      [12045.221540] Oops: Unrecoverable exception, sig: 6 [#1]
      [12045.221586] SMP NR_CPUS=2048 NUMA PowerNV
      [12045.221634] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE
       nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
       xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter
       ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables kvm_hv kvm
       uio_pdrv_genirq ipmi_powernv uio powernv_rng ipmi_msghandler autofs4 ses enclosure
       scsi_transport_sas bnx2x ipr mdio libcrc32c
      [12045.222167] CPU: 68 PID: 6178 Comm: sigreturnpanic Not tainted 4.7.0 #34
      [12045.222224] task: c0000000fce38600 ti: c0000000fceb4000 task.ti: c0000000fceb4000
      [12045.222293] NIP: c000000000050a40 LR: c0000000000163bc CTR: 0000000000000000
      [12045.222361] REGS: c0000000fceb7ac0 TRAP: 0700   Not tainted (4.7.0)
      [12045.222418] MSR: 9000000300201033 <SF,HV,ME,IR,DR,RI,LE,TM[SE]> CR: 28444280  XER: 20000000
      [12045.222625] CFAR: c0000000000163b8 SOFTE: 0 PACATMSCRATCH: 900000014280f033
      GPR00: 01100000b8000001 c0000000fceb7d40 c00000000139c100 c0000000fce390d0
      GPR04: 900000034280f033 0000000000000000 0000000000000000 0000000000000000
      GPR08: 0000000000000000 b000000000001033 0000000000000001 0000000000000000
      GPR12: 0000000000000000 c000000002926400 0000000000000000 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR24: 0000000000000000 00003ffff98cadd0 00003ffff98cb470 0000000000000000
      GPR28: 900000034280f033 c0000000fceb7ea0 0000000000000001 c0000000fce390d0
      [12045.223535] NIP [c000000000050a40] tm_restore_sprs+0xc/0x1c
      [12045.223584] LR [c0000000000163bc] tm_recheckpoint+0x5c/0xa0
      [12045.223630] Call Trace:
      [12045.223655] [c0000000fceb7d80] [c000000000026e74] sys_rt_sigreturn+0x494/0x6c0
      [12045.223738] [c0000000fceb7e30] [c0000000000092e0] system_call+0x38/0x108
      [12045.223806] Instruction dump:
      [12045.223841] 7c800164 4e800020 7c0022a6 f80304a8 7c0222a6 f80304b0 7c0122a6 f80304b8
      [12045.223955] 4e800020 e80304a8 7c0023a6 e80304b0 <7c0223a6> e80304b8 7c0123a6 4e800020
      [12045.224074] ---[ end trace cb8002ee240bae76 ]---
      
      It isn't clear exactly if there is really a use case for userspace
      returning with a suspended transaction, however, doing so doesn't (on
      its own) constitute a bad frame. As such, this patch simply discards
      the transactional state of the context calling the sigreturn and
      continues.
      Reported-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: default avatarCyril Bur <cyrilbur@gmail.com>
      Tested-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Reviewed-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Acked-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      78a3e888
    • Mukesh Ojha's avatar
      powerpc/powernv : Drop reference added by kset_find_obj() · a9cbf0b2
      Mukesh Ojha authored
      In a situation, where Linux kernel gets notified about duplicate error log
      from OPAL, it is been observed that kernel fails to remove sysfs entries
      (/sys/firmware/opal/elog/0xXXXXXXXX) of such error logs. This is because,
      we currently search the error log/dump kobject in the kset list via
      'kset_find_obj()' routine. Which eventually increment the reference count
      by one, once it founds the kobject.
      
      So, unless we decrement the reference count by one after it found the kobject,
      we would not be able to release the kobject properly later.
      
      This patch adds the 'kobject_put()' which was missing earlier.
      Signed-off-by: default avatarMukesh Ojha <mukesh02@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a9cbf0b2
    • Nicholas Piggin's avatar
      powerpc/tm: do not use r13 for tabort_syscall · cc7786d3
      Nicholas Piggin authored
      tabort_syscall runs with RI=1, so a nested recoverable machine
      check will load the paca into r13 and overwrite what we loaded
      it with, because exceptions returning to privileged mode do not
      restore r13.
      
      Fixes: b4b56f9e (powerpc/tm: Abort syscalls in active transactions)
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNick Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cc7786d3