1. 26 Jun, 2015 40 commits
    • Vasily Averin's avatar
      security_syslog() should be called once only · d194e5d6
      Vasily Averin authored
      The final version of commit 637241a9 ("kmsg: honor dmesg_restrict
      sysctl on /dev/kmsg") lost few hooks, as result security_syslog() are
      processed incorrectly:
      
      - open of /dev/kmsg checks syslog access permissions by using
        check_syslog_permissions() where security_syslog() is not called if
        dmesg_restrict is set.
      
      - syslog syscall and /proc/kmsg calls do_syslog() where security_syslog
        can be executed twice (inside check_syslog_permissions() and then
        directly in do_syslog())
      
      With this patch security_syslog() is called once only in all
      syslog-related operations regardless of dmesg_restrict value.
      
      Fixes: 637241a9 ("kmsg: honor dmesg_restrict sysctl on /dev/kmsg")
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d194e5d6
    • Tejun Heo's avatar
      netconsole: implement extended console support · e2f15f9a
      Tejun Heo authored
      printk logbuf keeps various metadata and optional key=value dictionary for
      structured messages, both of which are stripped when messages are handed
      to regular console drivers.
      
      It can be useful to have this metadata and dictionary available to
      netconsole consumers.  This obviously makes logging via netconsole more
      complete and the sequence number in particular is useful in environments
      where messages may be lost or reordered in transit - e.g.  when netconsole
      is used to collect messages in a large cluster where packets may have to
      travel congested hops to reach the aggregator.  The lost and reordered
      messages can easily be identified and handled accordingly using the
      sequence numbers.
      
      printk recently added extended console support which can be selected by
      setting CON_EXTENDED flag.  From console driver side, not much changes.
      The only difference is that the text passed to the write callback is
      formatted the same way as /dev/kmsg.
      
      This patch implements extended console support for netconsole which can be
      enabled by either prepending "+" to a netconsole boot param entry or
      echoing 1 to "extended" file in configfs.  When enabled, netconsole
      transmits extended log messages with headers identical to /dev/kmsg
      output.
      
      There's one complication due to message fragments.  netconsole limits the
      maximum message size to 1k and messages longer than that are split into
      multiple fragments.  As all extended console messages should carry
      matching headers and be uniquely identifiable, each extended message
      fragment carries full copy of the metadata and an extra header field to
      identify the specific fragment.  The optional header is of the form
      "ncfrag=OFF/LEN" where OFF is the byte offset into the message body and
      LEN is the total length.
      
      To avoid unnecessarily making printk format extended messages, Extended
      netconsole is registered with printk when the first extended netconsole is
      configured.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2f15f9a
    • Tejun Heo's avatar
      netconsole: make all dynamic netconsoles share a mutex · 369e5a88
      Tejun Heo authored
      Currently, each dynamic netconsole_target uses its own separate mutex to
      synchronize the configuration operations.
      
      This patch replaces the per-netconsole_target mutexes with a single
      mutex - dynamic_netconsole_mutex.  The reduced granularity doesn't hurt
      anything, the code is minutely simpler and this'd allow adding
      operations which should be synchronized across all dynamic netconsoles.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      369e5a88
    • Tejun Heo's avatar
      netconsole: make netconsole_target->enabled a bool · 698cf1c6
      Tejun Heo authored
      netconsole uses both bool and int for boolean values.  Let's convert
      nt->enabled to bool for consistency.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      698cf1c6
    • Tejun Heo's avatar
      netconsole: remove unnecessary netconsole_target_get/out() from write_msg() · a6d403ac
      Tejun Heo authored
      write_msg() grabs target_list_lock and walks target_list invoking
      netpool_send_udp() on each target.  Curiously, it protects each iteration
      with netconsole_target_get/put() even though it never releases
      target_list_lock which protects all the members.
      
      While this doesn't harm anything, it doesn't serve any purpose either.
      The items on the list can't go away while target_list_lock is held.
      Remove the unnecessary get/put pair.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6d403ac
    • Tejun Heo's avatar
      printk: implement support for extended console drivers · 6fe29354
      Tejun Heo authored
      printk log_buf keeps various metadata for each message including its
      sequence number and timestamp.  The metadata is currently available only
      through /dev/kmsg and stripped out before passed onto console drivers.  We
      want this metadata to be available to console drivers too so that console
      consumers can get full information including the metadata and dictionary,
      which among other things can be used to detect whether messages got lost
      in transit.
      
      This patch implements support for extended console drivers.  Consoles can
      indicate that they want extended messages by setting the new CON_EXTENDED
      flag and they'll be fed messages formatted the same way as /dev/kmsg.
      
       "<level>,<sequnum>,<timestamp>,<contflag>;<message text>\n"
      
      If extended consoles exist, in-kernel fragment assembly is disabled.  This
      ensures that all messages emitted to consoles have full metadata including
      sequence number.  The contflag carries enough information to reassemble
      the fragments from the reader side trivially.  Note that this only affects
      /dev/kmsg.  Regular console and /proc/kmsg outputs are not affected by
      this change.
      
      * Extended message formatting for console drivers is enabled iff there
        are registered extended consoles.
      
      * Comment describing /dev/kmsg message format updated to add missing
        contflag field and help distinguishing variable from verbatim terms.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fe29354
    • Tejun Heo's avatar
      printk: factor out message formatting from devkmsg_read() · 0a295e67
      Tejun Heo authored
      The extended message formatting used for /dev/kmsg will be used implement
      extended consoles.  Factor out msg_print_ext_header() and
      msg_print_ext_body() from devkmsg_read().
      
      This is pure restructuring.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a295e67
    • Tejun Heo's avatar
      printk: guard the amount written per line by devkmsg_read() · d43ff430
      Tejun Heo authored
      This patchset updates netconsole so that it can emit messages with the
      same header as used in /dev/kmsg which gives neconsole receiver full log
      information which enables things like structured logging and detection
      of lost messages.
      
      This patch (of 7):
      
      devkmsg_read() uses 8k buffer and assumes that the formatted output
      message won't overrun which seems safe given LOG_LINE_MAX, the current use
      of dict and the escaping method being used; however, we're planning to use
      devkmsg formatting wider and accounting for the buffer size properly isn't
      that complicated.
      
      This patch defines CONSOLE_EXT_LOG_MAX as 8192 and updates devkmsg_read()
      so that it limits output accordingly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Kay Sievers <kay@vrfy.org>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d43ff430
    • Colin Ian King's avatar
      drivers/misc/altera-stapl/altera.c: remove extraneous KERN_INFO prefix · 4ae555a5
      Colin Ian King authored
      The KERN_INFO prefix is being prepended to KERN_DEBUG when using the
      dprink macro, Remove it as it is extraneous since we are printing the
      message out as debug via dprintk().
      
      Fixes smatch warning:
      
      drivers/misc/altera-stapl/altera.c:2454 altera_init()
         warn: KERN_* level not at start of string
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Igor M. Liplianin <liplianin@netup.ru>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ae555a5
    • Josh Triplett's avatar
      clone: support passing tls argument via C rather than pt_regs magic · 3033f14a
      Josh Triplett authored
      clone has some of the quirkiest syscall handling in the kernel, with a
      pile of special cases, historical curiosities, and architecture-specific
      calling conventions.  In particular, clone with CLONE_SETTLS accepts a
      parameter "tls" that the C entry point completely ignores and some
      assembly entry points overwrite; instead, the low-level arch-specific
      code pulls the tls parameter out of the arch-specific register captured
      as part of pt_regs on entry to the kernel.  That's a massive hack, and
      it makes the arch-specific code only work when called via the specific
      existing syscall entry points; because of this hack, any new clone-like
      system call would have to accept an identical tls argument in exactly
      the same arch-specific position, rather than providing a unified system
      call entry point across architectures.
      
      The first patch allows architectures to handle the tls argument via
      normal C parameter passing, if they opt in by selecting
      HAVE_COPY_THREAD_TLS.  The second patch makes 32-bit and 64-bit x86 opt
      into this.
      
      These two patches came out of the clone4 series, which isn't ready for
      this merge window, but these first two cleanup patches were entirely
      uncontroversial and have acks.  I'd like to go ahead and submit these
      two so that other architectures can begin building on top of this and
      opting into HAVE_COPY_THREAD_TLS.  However, I'm also happy to wait and
      send these through the next merge window (along with v3 of clone4) if
      anyone would prefer that.
      
      This patch (of 2):
      
      clone with CLONE_SETTLS accepts an argument to set the thread-local
      storage area for the new thread.  sys_clone declares an int argument
      tls_val in the appropriate point in the argument list (based on the
      various CLONE_BACKWARDS variants), but doesn't actually use or pass along
      that argument.  Instead, sys_clone calls do_fork, which calls
      copy_process, which calls the arch-specific copy_thread, and copy_thread
      pulls the corresponding syscall argument out of the pt_regs captured at
      kernel entry (knowing what argument of clone that architecture passes tls
      in).
      
      Apart from being awful and inscrutable, that also only works because only
      one code path into copy_thread can pass the CLONE_SETTLS flag, and that
      code path comes from sys_clone with its architecture-specific
      argument-passing order.  This prevents introducing a new version of the
      clone system call without propagating the same architecture-specific
      position of the tls argument.
      
      However, there's no reason to pull the argument out of pt_regs when
      sys_clone could just pass it down via C function call arguments.
      
      Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
      and a new copy_thread_tls that accepts the tls parameter as an additional
      unsigned long (syscall-argument-sized) argument.  Change sys_clone's tls
      argument to an unsigned long (which does not change the ABI), and pass
      that down to copy_thread_tls.
      
      Architectures that don't opt into copy_thread_tls will continue to ignore
      the C argument to sys_clone in favor of the pt_regs captured at kernel
      entry, and thus will be unable to introduce new versions of the clone
      syscall.
      
      Patch co-authored by Josh Triplett and Thiago Macieira.
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thiago Macieira <thiago.macieira@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3033f14a
    • Joe Perches's avatar
      stddef.h: move offsetofend inside #ifndef/#endif guard, neaten · 8c7fbe57
      Joe Perches authored
      Commit 38764884 ("include/stddef.h: Move offsetofend() from vfio.h
      to a generic kernel header") added offsetofend outside the normal
      include #ifndef/#endif guard.  Move it inside.
      
      Miscellanea:
      
      o remove unnecessary blank line
      o standardize offsetof macros whitespace style
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c7fbe57
    • Kees Cook's avatar
      mailmap: add rdunlap email auto-correction · 4d5b367c
      Kees Cook authored
      To avoid having xenotime bounce when things like get_maintainers gives
      me addresses, add Randy's current address.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d5b367c
    • Pratyush Anand's avatar
      Mohit Kumar has moved · 9c5dcdd0
      Pratyush Anand authored
      Mohit's email-id doesn't exist anymore as he has left the company.
      Replace ST's id with mohit.kumar.dhaka@gmail.com.
      Signed-off-by: default avatarPratyush Anand <pratyush.anand@gmail.com>
      Cc: Mohit Kumar <mohit.kumar.dhaka@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c5dcdd0
    • Pratyush Anand's avatar
      Pratyush Anand has moved · e34cadde
      Pratyush Anand authored
      pratyush.anand@st.com email-id doesn't exist anymore as I have left the
      company.  Replace ST's id with pratyush.anand@gmail.com.
      Signed-off-by: default avatarPratyush Anand <pratyush.anand@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e34cadde
    • Daniel Borkmann's avatar
      compiler-intel: fix wrong compiler barrier() macro · b86a50c3
      Daniel Borkmann authored
      Cleanup commit 73679e50 ("compiler-intel.h: Remove duplicate
      definition") removed the double definition of __memory_barrier()
      intrinsics.
      
      However, in doing so, it also removed the preceding #undef barrier by
      accident, meaning, the actual barrier() macro from compiler-gcc.h with
      inline asm is still in place as __GNUC__ is provided.
      
      Subsequently, barrier() can never be defined as __memory_barrier() from
      compiler.h since it already has a definition in place and if we trust
      the comment in compiler-intel.h, ecc doesn't support gcc specific asm
      statements.
      
      I don't have an ecc at hand (unsure if that's still used in the field?)
      and only found this by accident during code review, a revert of that
      cleanup would be simplest option.
      
      Fixes: 73679e50 ("compiler-intel.h: Remove duplicate definition")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarPranith Kumar <bobby.prani@gmail.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: mancha security <mancha1@zoho.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b86a50c3
    • Joe Perches's avatar
      compiler-gcc: integrate the various compiler-gcc[345].h files · cb984d10
      Joe Perches authored
      As gcc major version numbers are going to advance rather rapidly in the
      future, there's no real value in separate files for each compiler
      version.
      
      Deduplicate some of the macros #defined in each file too.
      
      Neaten comments using normal kernel commenting style.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Alan Modra <amodra@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb984d10
    • Joe Perches's avatar
      compiler-gcc.h: neatening · f6d133f8
      Joe Perches authored
       - Move the inline and noinline blocks together
      
       - Comment neatening
      
       - Alignment of __attribute__ uses
      
       - Consistent naming of __must_be_array macro argument
      
       - Multiline macro neatening
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Alan Modra <amodra@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6d133f8
    • Iago López Galeiras's avatar
      fs, proc: introduce CONFIG_PROC_CHILDREN · 2e13ba54
      Iago López Galeiras authored
      Commit 81841161 ("fs, proc: introduce /proc/<pid>/task/<tid>/children
      entry") introduced the children entry for checkpoint restore and the
      file is only available on kernels configured with CONFIG_EXPERT and
      CONFIG_CHECKPOINT_RESTORE.
      
      This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
      because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
      But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.
      
      However, the children proc file is useful outside of checkpoint restore.
      I would like to use it in rkt.  The rkt process exec() another program
      it does not control, and that other program will fork()+exec() a child
      process.  I would like to find the pid of the child process from an
      external tool without iterating in /proc over all processes to find
      which one has a parent pid equal to rkt.
      
      This commit introduces CONFIG_PROC_CHILDREN and makes
      CONFIG_CHECKPOINT_RESTORE select it.  This allows enabling
      /proc/<pid>/task/<tid>/children without needing to enable
      CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.
      
      Alban tested that /proc/<pid>/task/<tid>/children is present when the
      kernel is configured with CONFIG_PROC_CHILDREN=y but without
      CONFIG_CHECKPOINT_RESTORE
      Signed-off-by: default avatarIago López Galeiras <iago@endocode.com>
      Tested-by: default avatarAlban Crequy <alban@endocode.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Djalal Harouni <djalal@endocode.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e13ba54
    • Alexey Dobriyan's avatar
      proc: fix PAGE_SIZE limit of /proc/$PID/cmdline · c2c0bb44
      Alexey Dobriyan authored
      /proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with
      
      	$ cat /proc/self/cmdline $(seq 1037) 2>/dev/null
      
      However, command line size was never limited to PAGE_SIZE but to 128 KB
      and relatively recently limitation was removed altogether.
      
      People noticed and ask questions:
      http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit
      
      seq file interface is not OK, because it kmalloc's for whole output and
      open + read(, 1) + sleep will pin arbitrary amounts of kernel memory.  To
      not do that, limit must be imposed which is incompatible with arbitrary
      sized command lines.
      
      I apologize for hairy code, but this it direct consequence of command line
      layout in memory and hacks to support things like "init [3]".
      
      The loops are "unrolled" otherwise it is either macros which hide control
      flow or functions with 7-8 arguments with equal line count.
      
      There should be real setproctitle(2) or something.
      
      [akpm@linux-foundation.org: fix a billion min() warnings]
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Tested-by: default avatarJarod Wilson <jarod@redhat.com>
      Acked-by: default avatarJarod Wilson <jarod@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2c0bb44
    • Alexey Dobriyan's avatar
      prctl: more prctl(PR_SET_MM_*) checks · 4a00e9df
      Alexey Dobriyan authored
      Individual prctl(PR_SET_MM_*) calls do some checking to maintain a
      consistent view of mm->arg_start et al fields, but not enough.  In
      particular PR_SET_MM_ARG_START/PR_SET_MM_ARG_END/ R_SET_MM_ENV_START/
      PR_SET_MM_ENV_END only check that the address lies in an existing VMA,
      but don't check that the start address is lower than the end address _at
      all_.
      
      Consolidate all consistency checks, so there will be no difference in
      the future between PR_SET_MM_MAP and individual PR_SET_MM_* calls.
      
      The program below makes both ARGV and ENVP areas be reversed.  It makes
      /proc/$PID/cmdline show garbage (it doesn't oops by luck).
      
      #include <sys/mman.h>
      #include <sys/prctl.h>
      #include <unistd.h>
      
      enum {PAGE_SIZE=4096};
      
      int main(void)
      {
      	void *p;
      
      	p = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
      
      #define PR_SET_MM               35
      #define PR_SET_MM_ARG_START     8
      #define PR_SET_MM_ARG_END       9
      #define PR_SET_MM_ENV_START     10
      #define PR_SET_MM_ENV_END       11
      	prctl(PR_SET_MM, PR_SET_MM_ARG_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
      	prctl(PR_SET_MM, PR_SET_MM_ARG_END,   (unsigned long)p, 0, 0);
      	prctl(PR_SET_MM, PR_SET_MM_ENV_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
      	prctl(PR_SET_MM, PR_SET_MM_ENV_END,   (unsigned long)p, 0, 0);
      
      	pause();
      	return 0;
      }
      
      [akpm@linux-foundation.org: tidy code, tweak comment]
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a00e9df
    • Akinobu Mita's avatar
      avr32: use for_each_sg() · 20342f1d
      Akinobu Mita authored
      This replaces the plain loop over the sglist array with for_each_sg()
      macro which consists of sg_next() function calls.  Since avr32 doesn't
      select ARCH_HAS_SG_CHAIN, it is not necessary to use for_each_sg() in
      order to loop over each sg element.  But this can help find problems
      with drivers that do not properly initialize their sg tables when
      CONFIG_DEBUG_SG is enabled.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Acked-by: default avatarHans-Christian Egtvedt <egtvedt@samfundet.no>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20342f1d
    • Akinobu Mita's avatar
      frv: use for_each_sg() · 0989e1f9
      Akinobu Mita authored
      This replaces the plain loop over the sglist array with for_each_sg()
      macro which consists of sg_next() function calls.  Since frv doesn't
      select ARCH_HAS_SG_CHAIN, it is not necessary to use for_each_sg() in
      order to loop over each sg element.  But this can help find problems
      with drivers that do not properly initialize their sg tables when
      CONFIG_DEBUG_SG is enabled.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0989e1f9
    • Tobias Klauser's avatar
      frv: remove unused inline function is_in_rom() · 3fe111fc
      Tobias Klauser authored
      The function is not used anywhere in the tree (anymore) and this is the
      last remaining instance, so remove it.
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fe111fc
    • Dan Streetman's avatar
      zpool: remove zpool_evict() · 479305fd
      Dan Streetman authored
      Remove zpool_evict() helper function.  As zbud is currently the only
      zpool implementation that supports eviction, add zpool and zpool_ops
      references to struct zbud_pool and directly call zpool_ops->evict(zpool,
      handle) on eviction.
      
      Currently zpool provides the zpool_evict helper which locks the zpool
      list lock and searches through all pools to find the specific one
      matching the caller, and call the corresponding zpool_ops->evict
      function.  However, this is unnecessary, as the zbud pool can simply
      keep a reference to the zpool that created it, as well as the zpool_ops,
      and directly call the zpool_ops->evict function, when it needs to evict
      a page.  This avoids a spinlock and list search in zpool for each
      eviction.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      479305fd
    • Dan Streetman's avatar
      zpool: change pr_info to pr_debug · cf41f5f4
      Dan Streetman authored
      Change the pr_info() calls to pr_debug().  There's no need for the extra
      verbosity in the log.  Also change the msg formats to be consistent.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf41f5f4
    • Dan Streetman's avatar
      zswap: runtime enable/disable · c00ed16a
      Dan Streetman authored
      Change the "enabled" parameter to be configurable at runtime.  Remove the
      enabled check from init(), and move it to the frontswap store() function;
      when enabled, pages will be stored, and when disabled, pages won't be
      stored.
      
      This is almost identical to Seth's patch from 2 years ago:
      http://lkml.iu.edu/hypermail/linux/kernel/1307.2/04289.html
      
      [akpm@linux-foundation.org: tweak documentation]
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Suggested-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c00ed16a
    • Sergey Senozhatsky's avatar
      zram: check comp algorithm availability earlier · d93435c3
      Sergey Senozhatsky authored
      Improvement idea by Marcin Jabrzyk.
      
      comp_algorithm_store() silently accepts any supplied algorithm name,
      because zram performs algorithm availability check later, during the
      device configuration phase in disksize_store() and emits the following
      error:
      
        "zram: Cannot initialise %s compressing backend"
      
      this error line is somewhat generic and, besides, can indicate a failed
      attempt to allocate compression backend's working buffers.
      
      add algorithm availability check to comp_algorithm_store():
      
        echo lzz > /sys/block/zram0/comp_algorithm
        -bash: echo: write error: Invalid argument
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: default avatarMarcin Jabrzyk <m.jabrzyk@samsung.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d93435c3
    • Sergey Senozhatsky's avatar
      zram: cut trailing newline in algorithm name · 4bbacd51
      Sergey Senozhatsky authored
      Supplied sysfs values sometimes contain new-line symbols (echo vs.  echo
      -n), which we also copy as a compression algorithm name.  it works fine
      when we lookup for compression algorithm, because we use sysfs_streq()
      which takes care of new line symbols.  however, it doesn't look nice when
      we print compression algorithm name if zcomp_create() failed:
      
       zram: Cannot initialise LXZ
                  compressing backend
      
      cut trailing new-line, so the error string will look like
      
        zram: Cannot initialise LXZ compressing backend
      
      we also now can replace sysfs_streq() in zcomp_available_show() with
      strcmp().
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bbacd51
    • Sergey Senozhatsky's avatar
      zram: cosmetic zram_bvec_write() cleanup · 17162f41
      Sergey Senozhatsky authored
      `bool locked' local variable tells us if we should perform
      zcomp_strm_release() or not (jumped to `out' label before
      zcomp_strm_find() occurred), which is equivalent to `zstrm' being or not
      being NULL.  remove `locked' and check `zstrm' instead.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17162f41
    • Sergey Senozhatsky's avatar
      zram: add dynamic device add/remove functionality · 6566d1a3
      Sergey Senozhatsky authored
      We currently don't support on-demand device creation.  The one and only
      way to have N zram devices is to specify num_devices module parameter
      (default value: 1).  IOW if, for some reason, at some point, user wants
      to have N + 1 devies he/she must umount all the existing devices, unload
      the module, load the module passing num_devices equals to N + 1.  And do
      this again, if needed.
      
      This patch introduces zram control sysfs class, which has two sysfs
      attrs:
      - hot_add      -- add a new zram device
      - hot_remove   -- remove a specific (device_id) zram device
      
      hot_add sysfs attr is read-only and has only automatic device id
      assignment mode (as requested by Minchan Kim).  read operation performed
      on this attr creates a new zram device and returns back its device_id or
      error status.
      
      Usage example:
      	# add a new specific zram device
      	cat /sys/class/zram-control/hot_add
      	2
      
      	# remove a specific zram device
      	echo 4 > /sys/class/zram-control/hot_remove
      
      Returning zram_add() error code back to user (-ENOMEM in this case)
      
      	cat /sys/class/zram-control/hot_add
      	cat: /sys/class/zram-control/hot_add: Cannot allocate memory
      
      NOTE, there might be users who already depend on the fact that at least
      zram0 device gets always created by zram_init(). Preserve this behavior.
      
      [minchan@kernel.org: use zram->claim to avoid lockdep splat]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6566d1a3
    • Sergey Senozhatsky's avatar
      zram: close race by open overriding · f405c445
      Sergey Senozhatsky authored
      [ Original patch from Minchan Kim <minchan@kernel.org> ]
      
      Commit ba6b17d6 ("zram: fix umount-reset_store-mount race
      condition") introduced bdev->bd_mutex to protect a race between mount
      and reset.  At that time, we don't have dynamic zram-add/remove feature
      so it was okay.
      
      However, as we introduce dynamic device feature, bd_mutex became
      trouble.
      
      	CPU 0
      
      echo 1 > /sys/block/zram<id>/reset
        -> kernfs->s_active(A)
          -> zram:reset_store->bd_mutex(B)
      
      	CPU 1
      
      echo <id> > /sys/class/zram/zram-remove
        ->zram:zram_remove: bd_mutex(B)
        -> sysfs_remove_group
          -> kernfs->s_active(A)
      
      IOW, AB -> BA deadlock
      
      The reason we are holding bd_mutex for zram_remove is to prevent
      any incoming open /dev/zram[0-9]. Otherwise, we could remove zram
      others already have opened. But it causes above deadlock problem.
      
      To fix the problem, this patch overrides block_device.open and
      it returns -EBUSY if zram asserts he claims zram to reset so any
      incoming open will be failed so we don't need to hold bd_mutex
      for zram_remove ayn more.
      
      This patch is to prepare for zram-add/remove feature.
      
      [sergey.senozhatsky@gmail.com: simplify reset_store()]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f405c445
    • Sergey Senozhatsky's avatar
      zram: return zram device_id from zram_add() · 92ff1528
      Sergey Senozhatsky authored
      This patch prepares zram to enable on-demand device creation.
      zram_add() performs automatic device_id assignment and returns
      new device id (>= 0) or error code (< 0).
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      92ff1528
    • Sergey Senozhatsky's avatar
      zram: trivial: correct flag operations comment · b31177f2
      Sergey Senozhatsky authored
      We don't have meta->tb_lock anymore and use meta table entry bit_spin_lock
      instead. update corresponding comment.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b31177f2
    • Sergey Senozhatsky's avatar
      zram: report every added and removed device · d12b63c9
      Sergey Senozhatsky authored
      With dynamic device creation/removal (which will be introduced later in
      the series) printing num_devices in zram_init() will not make a lot of
      sense, as well as printing the number of destroyed devices in
      destroy_devices().  Print per-device action (added/removed) in zram_add()
      and zram_remove() instead.
      
      Example:
      
      [ 3645.259652] zram: Added device: zram5
      [ 3646.152074] zram: Added device: zram6
      [ 3650.585012] zram: Removed device: zram5
      [ 3655.845584] zram: Added device: zram8
      [ 3660.975223] zram: Removed device: zram6
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d12b63c9
    • Sergey Senozhatsky's avatar
      zram: remove max_num_devices limitation · c3cdb40e
      Sergey Senozhatsky authored
      Limiting the number of zram devices to 32 (default max_num_devices value)
      is confusing, let's drop it.  A user with 2TB or 4TB of RAM, for example,
      can request as many devices as he can handle.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3cdb40e
    • Sergey Senozhatsky's avatar
      zram: reorganize code layout · 522698d7
      Sergey Senozhatsky authored
      This patch looks big, but basically it just moves code blocks.
      No functional changes.
      
      Our current code layout looks like a sandwitch.
      
      For example,
      a) between read/write handlers, we have update_used_max() helper function:
      
      static int zram_decompress_page
      static int zram_bvec_read
      static inline void update_used_max
      static int zram_bvec_write
      static int zram_bvec_rw
      
      b) RW request handlers __zram_make_request/zram_bio_discard are divided by
      sysfs attr reset_store() function and corresponding zram_reset_device()
      handler:
      
      static void zram_bio_discard
      static void zram_reset_device
      static ssize_t disksize_store
      static ssize_t reset_store
      static void __zram_make_request
      
      c) we first a bunch of sysfs read/store functions. then a number of
      one-liners, then helper functions, RW functions, sysfs functions, helper
      functions again, and so on.
      
      Reorganize layout to be more logically grouped (a brief description,
      `cat zram_drv.c | grep static` gives a bigger picture):
      
      -- one-liners: zram_test_flag/etc.
      
      -- helpers: is_partial_io/update_position/etc
      
      -- sysfs attr show/store functions + ZRAM_ATTR_RO() generated stats
      show() functions
      exception: reset and disksize store functions are required to be after
      meta() functions. because we do device create/destroy actions in these
      sysfs handlers.
      
      -- "mm" functions: meta get/put, meta alloc/free, page free
      static inline bool zram_meta_get
      static inline void zram_meta_put
      static void zram_meta_free
      static struct zram_meta *zram_meta_alloc
      static void zram_free_page
      
      -- a block of I/O functions
      static int zram_decompress_page
      static int zram_bvec_read
      static int zram_bvec_write
      static void zram_bio_discard
      static int zram_bvec_rw
      static void __zram_make_request
      static void zram_make_request
      static void zram_slot_free_notify
      static int zram_rw_page
      
      -- device contol: add/remove/init/reset functions (+zram-control class
      will sit here)
      static int zram_reset_device
      static ssize_t reset_store
      static ssize_t disksize_store
      static int zram_add
      static void zram_remove
      static int __init zram_init
      static void __exit zram_exit
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      522698d7
    • Sergey Senozhatsky's avatar
      zram: use idr instead of `zram_devices' array · 85508ec6
      Sergey Senozhatsky authored
      This patch makes some preparations for on-demand device add/remove
      functionality.
      
      Remove `zram_devices' array and switch to id-to-pointer translation (idr).
      idr doesn't bloat zram struct with additional members, f.e.  list_head,
      yet still provides ability to match the device_id with the device pointer.
      
      No user-space visible changes.
      
      [Julia.Lawall@lip6.fr: return -ENOMEM when `queue' alloc fails]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: default avatarJulia Lawall <Julia.Lawall@lip6.fr>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85508ec6
    • Sergey Senozhatsky's avatar
    • Sergey Senozhatsky's avatar
      zram: add `compact` sysfs entry to documentation · 3d8ed88b
      Sergey Senozhatsky authored
      We currently don't support zram on-demand device creation.  The only way
      to have N zram devices is to specify num_devices module parameter (default
      value 1).  That means that if, for some reason, at some point, user wants
      to have N + 1 devies he/she must umount all the existing devices, unload
      the module, load the module passing num_devices equals to N + 1.
      
      This patchset introduces zram-control sysfs class, which has two sysfs
      attrs:
      
       - hot_add     -- add a new zram device
       - hot_remove  -- remove a specific (device_id) zram device
      
          Usage example:
              # add a new specific zram device
              cat /sys/class/zram-control/hot_add
              1
      
              # remove a specific zram device
              echo 4 > /sys/class/zram-control/hot_remove
      
      This patch (of 10):
      
      Briefly describe missing `compact` sysfs entry.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d8ed88b
    • Marcin Jabrzyk's avatar
      zsmalloc: remove obsolete ZSMALLOC_DEBUG · 13a18a1c
      Marcin Jabrzyk authored
      The DEBUG define in zsmalloc is useless, there is no usage of it at all.
      Signed-off-by: default avatarMarcin Jabrzyk <m.jabrzyk@samsung.com>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13a18a1c