1. 10 Sep, 2015 37 commits
    • Joe Perches's avatar
      checkpatch: always check block comment styles · 86406b1c
      Joe Perches authored
      Some of the block comment tests that are used only for networking are
      appropriate for all patches.
      
      For example, these styles are not encouraged:
      
      	/*
      	 block comment without introductory *
      	*/
      and
      	/*
      	 * block comment with line terminating */
      
      Remove the networking specific test and add comments.
      
      There are some infrequent false positives where code is lazily
      commented out using /* and */ rather than using #if 0/#endif blocks
      like:
      	/* case foo:
      	case bar: */
      	case baz:
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86406b1c
    • Joe Perches's avatar
      checkpatch: report the right line # when using --emacs and --file · 7d3a9f67
      Joe Perches authored
      commit 34d8815f ("checkpatch: add --showfile to allow input via pipe
      to show filenames") broke the --emacs with --file option.
      
      Fix it.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d3a9f67
    • Joe Perches's avatar
      checkpatch: add some <foo>_destroy functions to NEEDLESS_IF tests · 100425de
      Joe Perches authored
      Sergey Senozhatsky has modified several destroy functions that can
      now be called with NULL values.
      
       - kmem_cache_destroy()
       - mempool_destroy()
       - dma_pool_destroy()
      
      Update checkpatch to warn when those functions are preceded by an if.
      
      Update checkpatch to --fix all the calls too only when the code style
      form is using leading tabs.
      
      from:
      	if (foo)
      		<func>(foo);
      to:
      	<func>(foo);
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Tested-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Julia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      100425de
    • Joe Perches's avatar
      checkpatch: Allow longer declaration macros · 3e838b6c
      Joe Perches authored
      Some really long declaration macros exist.
      
      For instance;
        	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
      and
      	DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(name, description)
      
      Increase the limit from 2 words to 6 after DECLARE/DEFINE uses.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e838b6c
    • Joe Perches's avatar
      checkpatch: improve SUSPECT_CODE_INDENT test · 9f5af480
      Joe Perches authored
      Many lines exist like
      
      	if (foo)
      			bar;
      
      where the tabbed indentation of the branch is not one more than the "if"
      line above it.
      
      checkpatch should emit a warning on those lines.
      
      Miscellenea:
      
      o Remove comments from branch blocks
      o Skip blank lines in block
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f5af480
    • Joe Perches's avatar
      checkpatch: add warning on BUG/BUG_ON use · 9d3e3c70
      Joe Perches authored
      Using BUG/BUG_ON crashes the kernel and is just unfriendly.
      
      Enable code that emits a warning on BUG/BUG_ON use.
      
      Make the code emit the message at WARNING level when scanning a patch and
      at CHECK level when scanning files so that script users don't feel an
      obligation to fix code that might be above their pay grade.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d3e3c70
    • Joe Perches's avatar
      checkpatch: warn on bare SHA-1 commit IDs in commit logs · fe043ea1
      Joe Perches authored
      Commit IDs should have commit descriptions too.  Warn when a 12 to 40 byte
      SHA-1 is used in commit logs.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe043ea1
    • Wang Long's avatar
      lib/test_kasan.c: make kmalloc_oob_krealloc_less more correctly · 6b4a35fc
      Wang Long authored
      In kmalloc_oob_krealloc_less, I think it is better to test
      the size2 boundary.
      
      If we do not call krealloc, the access of position size1 will still cause
      out-of-bounds and access of position size2 does not.  After call krealloc,
      the access of position size2 cause out-of-bounds.  So using size2 is more
      correct.
      Signed-off-by: default avatarWang Long <long.wanglong@huawei.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b4a35fc
    • Wang Long's avatar
      lib/test_kasan.c: fix a typo · 9789d8e0
      Wang Long authored
      Signed-off-by: default avatarWang Long <long.wanglong@huawei.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9789d8e0
    • Kees Cook's avatar
      lib/string_helpers: rename "esc" arg to "only" · b40bdb7f
      Kees Cook authored
      To further clarify the purpose of the "esc" argument, rename it to "only"
      to reflect that it is a limit, not a list of additional characters to
      escape.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Suggested-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b40bdb7f
    • Kees Cook's avatar
      lib/string_helpers: clarify esc arg in string_escape_mem · d89a3f73
      Kees Cook authored
      The esc argument is used to reduce which characters will be escaped.  For
      example, using " " with ESCAPE_SPACE will not produce any escaped spaces.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d89a3f73
    • Linus Walleij's avatar
      hexdump: do not print debug dumps for !CONFIG_DEBUG · cdf17449
      Linus Walleij authored
      print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or
      dev_dbg() & friends.  Currently it will adhere to dynamic debug, but will
      not stub out prints if CONFIG_DEBUG is not set.  Let's make it do the
      right thing, because I am tired of having my dmesg buffer full of hex
      dumps on production systems.
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdf17449
    • Pan Xinhui's avatar
      lib/bitmap.c: bitmap_parselist can accept string with whitespaces on head or tail · 9bf98f16
      Pan Xinhui authored
      In __bitmap_parselist we can accept whitespaces on head or tail during
      every parsing procedure.  If input has valid ranges, there is no reason to
      reject the user.
      
      For example, bitmap_parselist(" 1-3, 5, ", &mask, nmaskbits).  After
      separating the string, we get " 1-3", " 5", and " ".  It's possible and
      reasonable to accept such string as long as the parsing result is correct.
      Signed-off-by: default avatarPan Xinhui <xinhuix.pan@intel.com>
      Cc: Yury Norov <yury.norov@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bf98f16
    • Pan Xinhui's avatar
      lib/bitmap.c: fix a special string handling bug in __bitmap_parselist · d9282cb6
      Pan Xinhui authored
      If string end with '-', for exapmle, bitmap_parselist("1,0-",&mask,
      nmaskbits), It is not in a valid pattern, so add a check after loop.
      Return -EINVAL on such condition.
      Signed-off-by: default avatarPan Xinhui <xinhuix.pan@intel.com>
      Cc: Yury Norov <yury.norov@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9282cb6
    • Pan Xinhui's avatar
      lib/bitmap.c: correct a code style and do some, optimization · d21c3d4d
      Pan Xinhui authored
      We can avoid in-loop incrementation of ndigits.  Save current totaldigits
      to ndigits before loop, and check ndigits against totaldigits after the
      loop.
      Signed-off-by: default avatarPan Xinhui <xinhuix.pan@intel.com>
      Cc: Yury Norov <yury.norov@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d21c3d4d
    • Alexey Dobriyan's avatar
      proc: convert to kstrto*()/kstrto*_from_user() · 774636e1
      Alexey Dobriyan authored
      Convert from manual allocation/copy_from_user/...  to kstrto*() family
      which were designed for exactly that.
      
      One case can not be converted to kstrto*_from_user() to make code even
      more simpler because of whitespace stripping, oh well...
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      774636e1
    • Alexey Dobriyan's avatar
      kstrto*: accept "-0" for signed conversion · 2d2e4715
      Alexey Dobriyan authored
      strtol(3) et al accept "-0", so should we.
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d2e4715
    • Joe Perches's avatar
      MAINTAINERS/CREDITS: mark MaxRAID as Orphan, move Anil Ravindranath to CREDITS · 3cdea4d7
      Joe Perches authored
      Anil's email address bounces and he hasn't had a signoff
      in over 5 years.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cdea4d7
    • Jason A. Donenfeld's avatar
      include/linux/printk.h: include pr_fmt in pr_debug_ratelimited · 515a9adc
      Jason A. Donenfeld authored
      The other two implementations of pr_debug_ratelimited include pr_fmt,
      along with every other pr_* function.  But pr_debug_ratelimited forgot to
      add it with the CONFIG_DYNAMIC_DEBUG implementation.
      
      This patch unifies the behavior.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      515a9adc
    • Joe Perches's avatar
      kernel/cred.c: remove unnecessary kdebug atomic reads · 52aa8536
      Joe Perches authored
      Commit e0e81739 ("CRED: Add some configurable debugging [try #6]")
      added the kdebug mechanism to this file back in 2009.
      
      The kdebug macro calls no_printk which always evaluates arguments.
      
      Most of the kdebug uses have an unnecessary call of
      	atomic_read(&cred->usage)
      
      Make the kdebug macro do nothing by defining it with
      	do { if (0) no_printk(...); } while (0)
      when not enabled.
      
      $ size kernel/cred.o* (defconfig x86-64)
         text	   data	    bss	    dec	    hex	filename
         2748	    336	      8	   3092	    c14	kernel/cred.o.new
         2788	    336	      8	   3132	    c3c	kernel/cred.o.old
      
      Miscellanea:
      o Neaten the #define kdebug macros while there
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52aa8536
    • Wei Yongjun's avatar
      kernel/extable.c: remove duplicated include · 2307e1a3
      Wei Yongjun authored
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2307e1a3
    • Vasily Kulikov's avatar
      include/linux/poison.h: remove not-used poison pointer macros · 8b839635
      Vasily Kulikov authored
      Signed-off-by: default avatarVasily Kulikov <segoon@openwall.com>
      Cc: Solar Designer <solar@openwall.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b839635
    • Vasily Kulikov's avatar
      include/linux/poison.h: fix LIST_POISON{1,2} offset · 8a5e5e02
      Vasily Kulikov authored
      Poison pointer values should be small enough to find a room in
      non-mmap'able/hardly-mmap'able space.  E.g.  on x86 "poison pointer space"
      is located starting from 0x0.  Given unprivileged users cannot mmap
      anything below mmap_min_addr, it should be safe to use poison pointers
      lower than mmap_min_addr.
      
      The current poison pointer values of LIST_POISON{1,2} might be too big for
      mmap_min_addr values equal or less than 1 MB (common case, e.g.  Ubuntu
      uses only 0x10000).  There is little point to use such a big value given
      the "poison pointer space" below 1 MB is not yet exhausted.  Changing it
      to a smaller value solves the problem for small mmap_min_addr setups.
      
      The values are suggested by Solar Designer:
      http://www.openwall.com/lists/oss-security/2015/05/02/6Signed-off-by: default avatarVasily Kulikov <segoon@openwall.com>
      Cc: Solar Designer <solar@openwall.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a5e5e02
    • Waiman Long's avatar
      proc: change proc_subdir_lock to a rwlock · ecf1a3df
      Waiman Long authored
      The proc_subdir_lock spinlock is used to allow only one task to make
      change to the proc directory structure as well as looking up information
      in it.  However, the information lookup part can actually be entered by
      more than one task as the pde_get() and pde_put() reference count update
      calls in the critical sections are atomic increment and decrement
      respectively and so are safe with concurrent updates.
      
      The x86 architecture has already used qrwlock which is fair and other
      architectures like ARM are in the process of switching to qrwlock.  So
      unfairness shouldn't be a concern in that conversion.
      
      This patch changed the proc_subdir_lock to a rwlock in order to enable
      concurrent lookup. The following functions were modified to take a
      write lock:
       - proc_register()
       - remove_proc_entry()
       - remove_proc_subtree()
      
      The following functions were modified to take a read lock:
       - xlate_proc_name()
       - proc_lookup_de()
       - proc_readdir_de()
      
      A parallel /proc filesystem search with the "find" command (1000 threads)
      was run on a 4-socket Haswell-EX box (144 threads).  Before the patch, the
      parallel search took about 39s.  After the patch, the parallel find took
      only 25s, a saving of about 14s.
      
      The micro-benchmark that I used was artificial, but it was used to
      reproduce an exit hanging problem that I saw in real application.  In
      fact, only allow one task to do a lookup seems too limiting to me.
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecf1a3df
    • Calvin Owens's avatar
      procfs: always expose /proc/<pid>/map_files/ and make it readable · bdb4d100
      Calvin Owens authored
      Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
      only exposed if CONFIG_CHECKPOINT_RESTORE is set.
      
      Each mapped file region gets a symlink in /proc/<pid>/map_files/
      corresponding to the virtual address range at which it is mapped.  The
      symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
      to the backing file even if that backing file has been unlinked.
      
      Currently, files which are mapped, unlinked, and closed are impossible to
      stat() from userspace.  Exposing /proc/<pid>/map_files/ closes this
      functionality "hole".
      
      Not being able to stat() such files makes noticing and explicitly
      accounting for the space they use on the filesystem impossible.  You can
      work around this by summing up the space used by every file in the
      filesystem and subtracting that total from what statfs() tells you, but
      that obviously isn't great, and it becomes unworkable once your filesystem
      becomes large enough.
      
      This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
      adjusts the permissions enforced on it as follows:
      
      * proc_map_files_lookup()
      * proc_map_files_readdir()
      * map_files_d_revalidate()
      
      	Remove the CAP_SYS_ADMIN restriction, leaving only the current
      	restriction requiring PTRACE_MODE_READ. The information made
      	available to userspace by these three functions is already
      	available in /proc/PID/maps with MODE_READ, so I don't see any
      	reason to limit them any further (see below for more detail).
      
      * proc_map_files_follow_link()
      
      	This stub has been added, and requires that the user have
      	CAP_SYS_ADMIN in order to follow the links in map_files/,
      	since there was concern on LKML both about the potential for
      	bypassing permissions on ancestor directories in the path to
      	files pointed to, and about what happens with more exotic
      	memory mappings created by some drivers (ie dma-buf).
      
      In older versions of this patch, I changed every permission check in
      the four functions above to enforce MODE_ATTACH instead of MODE_READ.
      This was an oversight on my part, and after revisiting the discussion
      it seems that nobody was concerned about anything outside of what is
      made possible by ->follow_link(). So in this version, I've left the
      checks for PTRACE_MODE_READ as-is.
      
      [akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
      Signed-off-by: default avatarCalvin Owens <calvinowens@fb.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdb4d100
    • Vladimir Davydov's avatar
      proc: add cond_resched to /proc/kpage* read/write loop · d3691d2c
      Vladimir Davydov authored
      Reading/writing a /proc/kpage* file may take long on machines with a lot
      of RAM installed.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3691d2c
    • Vladimir Davydov's avatar
      proc: export idle flag via kpageflags · f074a8f4
      Vladimir Davydov authored
      As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags
      is that one can easily filter dirty and/or unevictable pages while
      estimating the size of unused memory.
      
      Note that idle flag read from /proc/kpageflags may be stale in case the
      page was accessed via a PTE, because it would be too costly to iterate
      over all page mappings on each /proc/kpageflags read to provide an
      up-to-date value.  To make sure the flag is up-to-date one has to read
      /sys/kernel/mm/page_idle/bitmap first.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f074a8f4
    • Vladimir Davydov's avatar
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov authored
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
    • Vladimir Davydov's avatar
      mmu-notifier: add clear_young callback · 1d7715c6
      Vladimir Davydov authored
      In the scope of the idle memory tracking feature, which is introduced by
      the following patch, we need to clear the referenced/accessed bit not only
      in primary, but also in secondary ptes.  The latter is required in order
      to estimate wss of KVM VMs.  At the same time we want to avoid flushing
      tlb, because it is quite expensive and it won't really affect the final
      result.
      
      Currently, there is no function for clearing pte young bit that would meet
      our requirements, so this patch introduces one.  To achieve that we have
      to add a new mmu-notifier callback, clear_young, since there is no method
      for testing-and-clearing a secondary pte w/o flushing tlb.  The new method
      is not mandatory and currently only implemented by KVM.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d7715c6
    • Vladimir Davydov's avatar
      proc: add kpagecgroup file · 80ae2fdc
      Vladimir Davydov authored
      /proc/kpagecgroup contains a 64-bit inode number of the memory cgroup each
      page is charged to, indexed by PFN.  Having this information is useful for
      estimating a cgroup working set size.
      
      The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80ae2fdc
    • Vladimir Davydov's avatar
      memcg: zap try_get_mem_cgroup_from_page · e993d905
      Vladimir Davydov authored
      It is only used in mem_cgroup_try_charge, so fold it in and zap it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e993d905
    • Vladimir Davydov's avatar
      hwpoison: use page_cgroup_ino for filtering by memcg · 94a59fb3
      Vladimir Davydov authored
      Hwpoison allows to filter pages by memory cgroup ino.  Currently, it
      calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
      then its ino using cgroup_ino, but now we have a helper method for
      that, page_cgroup_ino, so use it instead.
      
      This patch also loosens the hwpoison memcg filter dependency rules - it
      makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because
      hwpoison memcg filter does not require anything (nor it used to) from
      CONFIG_MEMCG_SWAP side.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94a59fb3
    • Vladimir Davydov's avatar
      memcg: add page_cgroup_ino helper · 2fc04524
      Vladimir Davydov authored
      This patchset introduces a new user API for tracking user memory pages
      that have not been used for a given period of time.  The purpose of this
      is to provide the userspace with the means of tracking a workload's
      working set, i.e.  the set of pages that are actively used by the
      workload.  Knowing the working set size can be useful for partitioning the
      system more efficiently, e.g.  by tuning memory cgroup limits
      appropriately, or for job placement within a compute cluster.
      
      ==== USE CASES ====
      
      The unified cgroup hierarchy has memory.low and memory.high knobs, which
      are defined as the low and high boundaries for the workload working set
      size.  However, the working set size of a workload may be unknown or
      change in time.  With this patch set, one can periodically estimate the
      amount of memory unused by each cgroup and tune their memory.low and
      memory.high parameters accordingly, therefore optimizing the overall
      memory utilization.
      
      Another use case is balancing workloads within a compute cluster.  Knowing
      how much memory is not really used by a workload unit may help take a more
      optimal decision when considering migrating the unit to another node
      within the cluster.
      
      Also, as noted by Minchan, this would be useful for per-process reclaim
      (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
      pages only by smart user memory manager.
      
      ==== USER API ====
      
      The user API consists of two new files:
      
       * /sys/kernel/mm/page_idle/bitmap.  This file implements a bitmap where each
         bit corresponds to a page, indexed by PFN. When the bit is set, the
         corresponding page is idle. A page is considered idle if it has not been
         accessed since it was marked idle. To mark a page idle one should set the
         bit corresponding to the page by writing to the file. A value written to the
         file is OR-ed with the current bitmap value. Only user memory pages can be
         marked idle, for other page types input is silently ignored. Writing to this
         file beyond max PFN results in the ENXIO error. Only available when
         CONFIG_IDLE_PAGE_TRACKING is set.
      
         This file can be used to estimate the amount of pages that are not
         used by a particular workload as follows:
      
         1. mark all pages of interest idle by setting corresponding bits in the
            /sys/kernel/mm/page_idle/bitmap
         2. wait until the workload accesses its working set
         3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set
      
       * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
         memory cgroup each page is charged to, indexed by PFN. Only available when
         CONFIG_MEMCG is set.
      
         This file can be used to find all pages (including unmapped file pages)
         accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
         can then estimate the cgroup working set size.
      
      For an example of using these files for estimating the amount of unused
      memory pages per each memory cgroup, please see the script attached
      below.
      
      ==== REASONING ====
      
      The reason to introduce the new user API instead of using
      /proc/PID/{clear_refs,smaps} is that the latter has two serious
      drawbacks:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      The new API attempts to overcome them both. For more details on how it
      is achieved, please see the comment to patch 6.
      
      ==== PATCHSET STRUCTURE ====
      
      The patch set is organized as follows:
      
       - patch 1 adds page_cgroup_ino() helper for the sake of
         /proc/kpagecgroup and patches 2-3 do related cleanup
       - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
         charged to
       - patch 5 introduces a new mmu notifier callback, clear_young, which is
         a lightweight version of clear_flush_young; it is used in patch 6
       - patch 6 implements the idle page tracking feature, including the
         userspace API, /sys/kernel/mm/page_idle/bitmap
       - patch 7 exports idle flag via /proc/kpageflags
      
      ==== SIMILAR WORKS ====
      
      Originally, the patch for tracking idle memory was proposed back in 2011
      by Michel Lespinasse (see http://lwn.net/Articles/459269/).  The main
      difference between Michel's patch and this one is that Michel implemented
      a kernel space daemon for estimating idle memory size per cgroup while
      this patch only provides the userspace with the minimal API for doing the
      job, leaving the rest up to the userspace.  However, they both share the
      same idea of Idle/Young page flags to avoid affecting the reclaimer logic.
      
      ==== PERFORMANCE EVALUATION ====
      
      SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
      performance impact introduced by this patch set.  Three runs were carried
      out:
      
       - base: kernel without the patch
       - patched: patched kernel, the feature is not used
       - patched-active: patched kernel, 1 minute-period daemon is used for
         tracking idle memory
      
      For tracking idle memory, idlememstat utility was used:
      https://github.com/locker/idlememstat
      
      testcase            base            patched        patched-active
      
      compiler       537.40 ( 0.00)%   532.26 (-0.96)%   538.31 ( 0.17)%
      compress       305.47 ( 0.00)%   301.08 (-1.44)%   300.71 (-1.56)%
      crypto         284.32 ( 0.00)%   282.21 (-0.74)%   284.87 ( 0.19)%
      derby          411.05 ( 0.00)%   413.44 ( 0.58)%   412.07 ( 0.25)%
      mpegaudio      189.96 ( 0.00)%   190.87 ( 0.48)%   189.42 (-0.28)%
      scimark.large   46.85 ( 0.00)%    46.41 (-0.94)%    47.83 ( 2.09)%
      scimark.small  412.91 ( 0.00)%   415.41 ( 0.61)%   421.17 ( 2.00)%
      serial         204.23 ( 0.00)%   213.46 ( 4.52)%   203.17 (-0.52)%
      startup         36.76 ( 0.00)%    35.49 (-3.45)%    35.64 (-3.05)%
      sunflow        115.34 ( 0.00)%   115.08 (-0.23)%   117.37 ( 1.76)%
      xml            620.55 ( 0.00)%   619.95 (-0.10)%   620.39 (-0.03)%
      
      composite      211.50 ( 0.00)%   211.15 (-0.17)%   211.67 ( 0.08)%
      
      time idlememstat:
      
      17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
      448inputs+40outputs (1major+36052minor)pagefaults 0swaps
      
      ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ====
      #! /usr/bin/python
      #
      
      import os
      import stat
      import errno
      import struct
      
      CGROUP_MOUNT = "/sys/fs/cgroup/memory"
      BUFSIZE = 8 * 1024  # must be multiple of 8
      
      def get_hugepage_size():
          with open("/proc/meminfo", "r") as f:
              for s in f:
                  k, v = s.split(":")
                  if k == "Hugepagesize":
                      return int(v.split()[0]) * 1024
      
      PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
      HUGEPAGE_SIZE = get_hugepage_size()
      
      def set_idle():
          f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
          while True:
              try:
                  f.write(struct.pack("Q", pow(2, 64) - 1))
              except IOError as err:
                  if err.errno == errno.ENXIO:
                      break
                  raise
          f.close()
      
      def count_idle():
          f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
          f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
      
          with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
              while f.read(BUFSIZE): pass  # update idle flag
      
          idlememsz = {}
          while True:
              s1, s2 = f_flags.read(8), f_cgroup.read(8)
              if not s1 or not s2:
                  break
      
              flags, = struct.unpack('Q', s1)
              cgino, = struct.unpack('Q', s2)
      
              unevictable = (flags >> 18) & 1
              huge = (flags >> 22) & 1
              idle = (flags >> 25) & 1
      
              if idle and not unevictable:
                  idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                      (HUGEPAGE_SIZE if huge else PAGE_SIZE)
      
          f_flags.close()
          f_cgroup.close()
          return idlememsz
      
      if __name__ == "__main__":
          print "Setting the idle flag for each page..."
          set_idle()
      
          raw_input("Wait until the workload accesses its working set, "
                    "then press Enter")
      
          print "Counting idle pages..."
          idlememsz = count_idle()
      
          for dir, subdirs, files in os.walk(CGROUP_MOUNT):
              ino = os.stat(dir)[stat.ST_INO]
              print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
      ==== END SCRIPT ====
      
      This patch (of 8):
      
      Add page_cgroup_ino() helper to memcg.
      
      This function returns the inode number of the closest online ancestor of
      the memory cgroup a page is charged to.  It is required for exporting
      information about which page is charged to which cgroup to userspace,
      which will be introduced by a following patch.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2fc04524
    • Dan Streetman's avatar
      zswap: update docs for runtime-changeable attributes · 9c4c5ef3
      Dan Streetman authored
      Change the Documentation/vm/zswap.txt doc to indicate that the "zpool" and
      "compressor" params are now changeable at runtime.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c4c5ef3
    • Dan Streetman's avatar
      zswap: change zpool/compressor at runtime · 90b0fc26
      Dan Streetman authored
      Update the zpool and compressor parameters to be changeable at runtime.
      When changed, a new pool is created with the requested zpool/compressor,
      and added as the current pool at the front of the pool list.  Previous
      pools remain in the list only to remove existing compressed pages from.
      The old pool(s) are removed once they become empty.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90b0fc26
    • Dan Streetman's avatar
      zswap: dynamic pool creation · f1c54846
      Dan Streetman authored
      Add dynamic creation of pools.  Move the static crypto compression per-cpu
      transforms into each pool.  Add a pointer to zswap_entry to the pool it's
      in.
      
      This is required by the following patch which enables changing the zswap
      zpool and compressor params at runtime.
      
      [akpm@linux-foundation.org: fix merge snafus]
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1c54846
    • Dan Streetman's avatar
      zpool: add zpool_has_pool() · 3f0e1312
      Dan Streetman authored
      This series makes creation of the zpool and compressor dynamic, so that
      they can be changed at runtime.  This makes using/configuring zswap
      easier, as before this zswap had to be configured at boot time, using boot
      params.
      
      This uses a single list to track both the zpool and compressor together,
      although Seth had mentioned an alternative which is to track the zpools
      and compressors using separate lists.  In the most common case, only a
      single zpool and single compressor, using one list is slightly simpler
      than using two lists, and for the uncommon case of multiple zpools and/or
      compressors, using one list is slightly less simple (and uses slightly
      more memory, probably) than using two lists.
      
      This patch (of 4):
      
      Add zpool_has_pool() function, indicating if the specified type of zpool
      is available (i.e.  zsmalloc or zbud).  This allows checking if a pool is
      available, without actually trying to allocate it, similar to
      crypto_has_alg().
      
      This is used by a following patch to zswap that enables the dynamic
      runtime creation of zswap zpools.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f0e1312
  2. 09 Sep, 2015 3 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · 26d2177e
      Linus Torvalds authored
      Pull inifiniband/rdma updates from Doug Ledford:
       "This is a fairly sizeable set of changes.  I've put them through a
        decent amount of testing prior to sending the pull request due to
        that.
      
        There are still a few fixups that I know are coming, but I wanted to
        go ahead and get the big, sizable chunk into your hands sooner rather
        than waiting for those last few fixups.
      
        Of note is the fact that this creates what is intended to be a
        temporary area in the drivers/staging tree specifically for some
        cleanups and additions that are coming for the RDMA stack.  We
        deprecated two drivers (ipath and amso1100) and are waiting to hear
        back if we can deprecate another one (ehca).  We also put Intel's new
        hfi1 driver into this area because it needs to be refactored and a
        transfer library created out of the factored out code, and then it and
        the qib driver and the soft-roce driver should all be modified to use
        that library.
      
        I expect drivers/staging/rdma to be around for three or four kernel
        releases and then to go away as all of the work is completed and final
        deletions of deprecated drivers are done.
      
        Summary of changes for 4.3:
      
         - Create drivers/staging/rdma
         - Move amso1100 driver to staging/rdma and schedule for deletion
         - Move ipath driver to staging/rdma and schedule for deletion
         - Add hfi1 driver to staging/rdma and set TODO for move to regular
           tree
         - Initial support for namespaces to be used on RDMA devices
         - Add RoCE GID table handling to the RDMA core caching code
         - Infrastructure to support handling of devices with differing read
           and write scatter gather capabilities
         - Various iSER updates
         - Kill off unsafe usage of global mr registrations
         - Update SRP driver
         - Misc  mlx4 driver updates
         - Support for the mr_alloc verb
         - Support for a netlink interface between kernel and user space cache
           daemon to speed path record queries and route resolution
         - Ininitial support for safe hot removal of verbs devices"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
        IB/ipoib: Suppress warning for send only join failures
        IB/ipoib: Clean up send-only multicast joins
        IB/srp: Fix possible protection fault
        IB/core: Move SM class defines from ib_mad.h to ib_smi.h
        IB/core: Remove unnecessary defines from ib_mad.h
        IB/hfi1: Add PSM2 user space header to header_install
        IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
        mlx5: Fix incorrect wc pkey_index assignment for GSI messages
        IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
        IB/uverbs: reject invalid or unknown opcodes
        IB/cxgb4: Fix if statement in pick_local_ip6adddrs
        IB/sa: Fix rdma netlink message flags
        IB/ucma: HW Device hot-removal support
        IB/mlx4_ib: Disassociate support
        IB/uverbs: Enable device removal when there are active user space applications
        IB/uverbs: Explicitly pass ib_dev to uverbs commands
        IB/uverbs: Fix race between ib_uverbs_open and remove_one
        IB/uverbs: Fix reference counting usage of event files
        IB/core: Make ib_dealloc_pd return void
        IB/srp: Create an insecure all physical rkey only if needed
        ...
      26d2177e
    • Linus Torvalds's avatar
      Merge tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi · a794b4f3
      Linus Torvalds authored
      Pull IPMI updates from Corey Minyard:
       "Most of these have been sitting in linux-next for more than a release,
        particularly commit 0fbcf4af ("ipmi: Convert the IPMI SI ACPI
        handling to a platform device") which is probably the most complex
        patch.
      
        That is also the one that changes drivers/acpi/acpi_pnp.c.  The change
        in that file is only removing IPMI from a "special platform devices"
        list, since I convert it to the standard PNP interface.  I posted this
        one to the ACPI list twice and got no response, and it seems to work
        well in my testing, so I'm hoping it's good.
      
        Hidehiro Kawai posted a set of changes that improves the panic time
        handling in the IPMI driver.
      
        The rest of the changes are minor bug fixes or cleanups and some
        documentation"
      
      * tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi:
        ipmi:ssif: Add a module parm to specify that SMBus alerts don't work
        ipmi: add of_device_id in MODULE_DEVICE_TABLE
        ipmi: Compensate for BMCs that wont set the irq enable bit
        ipmi: Don't call receive handler in the panic context
        ipmi: Avoid touching possible corrupted lists in the panic context
        ipmi: Don't flush messages in sender() in run-to-completion mode
        ipmi: Factor out message flushing procedure
        ipmi: Remove unneeded set_run_to_completion call
        ipmi: Make some data const that was only read
        ipmi: constify SSIF ACPI device ids
        ipmi: Delete an unnecessary check before the function call "cleanup_one_si"
        char:ipmi - Change 1 to true for bool type variables during initialization.
        impi:Remove unneeded setting of module owner to THIS_MODULE in the platform structure, powernv_ipmi_driver
        ipmi: Add a comment in how messages are delivered from the lower layer
        ipmi/powernv: Fix potential invalid pointer dereference
        ipmi: Convert the IPMI SI ACPI handling to a platform device
        ipmi: Add device tree bindings information
      a794b4f3
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · f6f7a636
      Linus Torvalds authored
      Merge second patch-bomb from Andrew Morton:
       "Almost all of the rest of MM.  There was an unusually large amount of
        MM material this time"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
        zpool: remove no-op module init/exit
        mm: zbud: constify the zbud_ops
        mm: zpool: constify the zpool_ops
        mm: swap: zswap: maybe_preload & refactoring
        zram: unify error reporting
        zsmalloc: remove null check from destroy_handle_cache()
        zsmalloc: do not take class lock in zs_shrinker_count()
        zsmalloc: use class->pages_per_zspage
        zsmalloc: consider ZS_ALMOST_FULL as migrate source
        zsmalloc: partial page ordering within a fullness_list
        zsmalloc: use shrinker to trigger auto-compaction
        zsmalloc: account the number of compacted pages
        zsmalloc/zram: introduce zs_pool_stats api
        zsmalloc: cosmetic compaction code adjustments
        zsmalloc: introduce zs_can_compact() function
        zsmalloc: always keep per-class stats
        zsmalloc: drop unused variable `nr_to_migrate'
        mm/memblock.c: fix comment in __next_mem_range()
        mm/page_alloc.c: fix type information of memoryless node
        memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
        ...
      f6f7a636