1. 01 Jun, 2012 40 commits
    • H. Peter Anvin's avatar
    • H.J. Lu's avatar
      x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32 · bad1a753
      H.J. Lu authored
      When I added x32 ptrace to 3.4 kernel, I also include PTRACE_ARCH_PRCTL
      support for x32 GDB  For ARCH_GET_FS/GS, it takes a pointer to int64.  But
      at user level, ARCH_GET_FS/GS takes a pointer to int32.  So I have to add
      x32 ptrace to glibc to handle it with a temporary int64 passed to kernel and
      copy it back to GDB as int32.  Roland suggested that PTRACE_ARCH_PRCTL
      is obsolete and x32 GDB should use fs_base and gs_base fields of
      user_regs_struct instead.
      
      Accordingly, remove PTRACE_ARCH_PRCTL completely from the x32 code to
      avoid possible memory overrun when pointer to int32 is passed to
      kernel.
      
      Link: http://lkml.kernel.org/r/CAMe9rOpDzHfS7NH7m1vmD9QRw8SSj4Sc%2BaNOgcWm_WJME2eRsQ@mail.gmail.comSigned-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: <stable@vger.kernel.org> v3.4
      bad1a753
    • Matt Fleming's avatar
      x86, efi: Add EFI boot stub documentation · 0c759662
      Matt Fleming authored
      Since we can't expect every user to read the EFI boot stub code it
      seems prudent to have a couple of paragraphs explaining what it is and
      how it works.
      
      The "initrd=" option in particular is tricky because it only
      understands absolute EFI-style paths (backslashes as directory
      separators), and until now this hasn't been documented anywhere. This
      has tripped up a couple of users.
      
      Cc: Matthew Garrett <mjg@redhat.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Link: http://lkml.kernel.org/r/1331907517-3985-4-git-send-email-matt@console-pimps.orgSigned-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      0c759662
    • Matt Fleming's avatar
      x86, efi; Add EFI boot stub console support · 9fa7deda
      Matt Fleming authored
      We need a way of printing useful messages to the user, for example
      when we fail to open an initrd file, instead of just hanging the
      machine without giving the user any indication of what went wrong. So
      sprinkle some error messages throughout the EFI boot stub code to make
      it easier for users to diagnose/report problems.
      Reported-by: default avatarKeshav P R <the.ridikulus.rat@gmail.com>
      Cc: Matthew Garrett <mjg@redhat.com>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Link: http://lkml.kernel.org/r/1331907517-3985-3-git-send-email-matt@console-pimps.orgSigned-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      9fa7deda
    • Matt Fleming's avatar
      x86, efi: Only close open files in error path · 30dc0d0f
      Matt Fleming authored
      The loop at the 'close_handles' label in handle_ramdisks() should be
      using 'i', which represents the number of initrd files that were
      successfully opened, not 'nr_initrds' which is the number of initrd=
      arguments passed on the command line.
      
      Currently, if we execute the loop to close all file handles and we
      failed to open any initrds we'll try to call the close function on a
      garbage pointer, causing the machine to hang.
      
      Cc: Matthew Garrett <mjg@redhat.com>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Link: http://lkml.kernel.org/r/1331907517-3985-2-git-send-email-matt@console-pimps.orgSigned-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      30dc0d0f
    • Steven Rostedt's avatar
      ftrace/x86: Do not change stacks in DEBUG when calling lockdep · 5963e317
      Steven Rostedt authored
      When both DYNAMIC_FTRACE and LOCKDEP are set, the TRACE_IRQS_ON/OFF
      will call into the lockdep code. The lockdep code can call lots of
      functions that may be traced by ftrace. When ftrace is updating its
      code and hits a breakpoint, the breakpoint handler will call into
      lockdep. If lockdep happens to call a function that also has a breakpoint
      attached, it will jump back into the breakpoint handler resetting
      the stack to the debug stack and corrupt the contents currently on
      that stack.
      
      The 'do_sym' call that calls do_int3() is protected by modifying the
      IST table to point to a different location if another breakpoint is
      hit. But the TRACE_IRQS_OFF/ON are outside that protection, and if
      a breakpoint is hit from those, the stack will get corrupted, and
      the kernel will crash:
      
      [ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
      [ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
      [ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
      [ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
      [ 1013.310600] CPU 2
      [ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
      [ 1013.401848]
      [ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
      [ 1013.437943] RIP: 8eb8:[<ffff88014630a000>]  [<ffff88014630a000>] 0xffff880146309fff
      [ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408  EFLAGS: 00010046
      [ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000
      [ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20
      [ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000
      [ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000
      [ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40
      [ 1013.586108] FS:  0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000
      [ 1013.609458] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0
      [ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0)
      [ 1013.717028] Stack:
      [ 1013.724131]  ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000
      [ 1013.745918]  cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627
      [ 1013.767870]  ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb
      [ 1013.790021] Call Trace:
      [ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff
      [ 1013.861443] RIP  [<ffff88014630a000>] 0xffff880146309fff
      [ 1013.884466]  RSP <ffff88014780f408>
      [ 1013.901507] CR2: 0000000000000002
      
      The solution was to reuse the NMI functions that change the IDT table to make the debug
      stack keep its current stack (in kernel mode) when hitting a breakpoint:
      
        call debug_stack_set_zero
        TRACE_IRQS_ON
        call debug_stack_reset
      
      If the TRACE_IRQS_ON happens to hit a breakpoint then it will keep the current stack
      and not crash the box.
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      5963e317
    • Steven Rostedt's avatar
      x86: Allow nesting of the debug stack IDT setting · f8988175
      Steven Rostedt authored
      When the NMI handler runs, it checks if it preempted a debug handler
      and if that handler is using the debug stack. If it is, it changes the
      IDT table not to update the stack, otherwise it will reset the debug
      stack and corrupt the debug handler it preempted.
      
      Now that ftrace uses breakpoints to change functions from nops to
      callers, many more places may hit a breakpoint. Unfortunately this
      includes some of the calls that lockdep performs. Which causes issues
      with the debug stack. It too needs to change the debug stack before
      tracing (if called from the debug handler).
      
      Allow the debug_stack_set_zero() and debug_stack_reset() to be nested
      so that the debug handlers can take advantage of them too.
      
      [ Used this_cpu_*() over __get_cpu_var() as suggested by H. Peter Anvin ]
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      f8988175
    • Steven Rostedt's avatar
      x86: Reset the debug_stack update counter · c0525a69
      Steven Rostedt authored
      When an NMI goes off and it sees that it preempted the debug stack,
      to keep the debug stack safe, it changes the IDT to point to one that
      does not modify the stack on breakpoint (to allow breakpoints in NMIs).
      
      But the variable that gets set to know to undo it on exit never gets
      cleared on exit. Thus every NMI will reset it on exit the first time
      it is done even if it does not need to be reset.
      
      [ Added H. Peter Anvin's suggestion to use this_cpu_read/write ]
      
      Cc: <stable@vger.kernel.org> # v3.3
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      c0525a69
    • Steven Rostedt's avatar
      ftrace: Use breakpoint method to update ftrace caller · 8a4d0a68
      Steven Rostedt authored
      On boot up and module load, it is fine to modify the code directly,
      without the use of breakpoints. This is because boot up modification
      is done before SMP is initialized, thus the modification is serial,
      and module load is done before the module executes.
      
      But after that we must use a SMP safe method to modify running code.
      Otherwise, if we are running the function tracer and update its
      function (by starting off the stack tracer, or perf tracing)
      the change of the function called by the ftrace trampoline is done
      directly. If this is being executed on another CPU, that CPU may
      take a GPF and crash the kernel.
      
      The breakpoint method is used to change the nops at all the functions, but
      the change of the ftrace callback handler itself was still using a
      direct modification. If tracing was enabled and the function callback
      was changed then another CPU could fault if it was currently calling
      the original callback. This modification must use the breakpoint method
      too.
      
      Note, the direct method is still used for boot up and module load.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      8a4d0a68
    • Steven Rostedt's avatar
      ftrace: Synchronize variable setting with breakpoints · a192cd04
      Steven Rostedt authored
      When the function tracer starts modifying the code via breakpoints
      it sets a variable (modifying_ftrace_code) to inform the breakpoint
      handler to call the ftrace int3 code.
      
      But there's no synchronization between setting this code and the
      handler, thus it is possible for the handler to be called on another
      CPU before it sees the variable. This will cause a kernel crash as
      the int3 handler will not know what to do with it.
      
      I originally added smp_mb()'s to force the visibility of the variable
      but H. Peter Anvin suggested that I just make it atomic.
      
      [ Added comments as suggested by Peter Zijlstra ]
      Suggested-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      a192cd04
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal · fb21affa
      Linus Torvalds authored
      Pull second pile of signal handling patches from Al Viro:
       "This one is just task_work_add() series + remaining prereqs for it.
      
        There probably will be another pull request from that tree this
        cycle - at least for helpers, to get them out of the way for per-arch
        fixes remaining in the tree."
      
      Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
      had brought in commit 97fd75b7 ("kernel/irq/manage.c: use the
      pr_foo() infrastructure to prefix printks") which changed one of the
      pr_err() calls that this merge moves around.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
        keys: kill task_struct->replacement_session_keyring
        keys: kill the dummy key_replace_session_keyring()
        keys: change keyctl_session_to_parent() to use task_work_add()
        genirq: reimplement exit_irq_thread() hook via task_work_add()
        task_work_add: generic process-context callbacks
        avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
        parisc: need to check NOTIFY_RESUME when exiting from syscall
        move key_repace_session_keyring() into tracehook_notify_resume()
        TIF_NOTIFY_RESUME is defined on all targets now
      fb21affa
    • Linus Torvalds's avatar
      Merge branch 'for-3.5-take-2' of git://linux-nfs.org/~bfields/linux · a00b6151
      Linus Torvalds authored
      Pull nfsd update from Bruce Fields.
      
      * 'for-3.5-take-2' of git://linux-nfs.org/~bfields/linux: (23 commits)
        nfsd: trivial: use SEEK_SET instead of 0 in vfs_llseek
        SUNRPC: split upcall function to extract reusable parts
        nfsd: allocate id-to-name and name-to-id caches in per-net operations.
        nfsd: make name-to-id cache allocated per network namespace context
        nfsd: make id-to-name cache allocated per network namespace context
        nfsd: pass network context to idmap init/exit functions
        nfsd: allocate export and expkey caches in per-net operations.
        nfsd: make expkey cache allocated per network namespace context
        nfsd: make export cache allocated per network namespace context
        nfsd: pass pointer to export cache down to stack wherever possible.
        nfsd: pass network context to export caches init/shutdown routines
        Lockd: pass network namespace to creation and destruction routines
        NFSd: remove hard-coded dereferences to name-to-id and id-to-name caches
        nfsd: pass pointer to expkey cache down to stack wherever possible.
        nfsd: use hash table from cache detail in nfsd export seq ops
        nfsd: pass svc_export_cache pointer as private data to "exports" seq file ops
        nfsd: use exp_put() for svc_export_cache put
        nfsd: use cache detail pointer from svc_export structure on cache put
        nfsd: add link to owner cache detail to svc_export structure
        nfsd: use passed cache_detail pointer expkey_parse()
        ...
      a00b6151
    • Linus Torvalds's avatar
      Merge branch 'akpm' (Andrew's patch-bomb) · 08615d7d
      Linus Torvalds authored
      Merge misc patches from Andrew Morton:
      
       - the "misc" tree - stuff from all over the map
      
       - checkpatch updates
      
       - fatfs
      
       - kmod changes
      
       - procfs
      
       - cpumask
      
       - UML
      
       - kexec
      
       - mqueue
      
       - rapidio
      
       - pidns
      
       - some checkpoint-restore feature work.  Reluctantly.  Most of it
         delayed a release.  I'm still rather worried that we don't have a
         clear roadmap to completion for this work.
      
      * emailed from Andrew Morton <akpm@linux-foundation.org>: (78 patches)
        kconfig: update compression algorithm info
        c/r: prctl: add ability to set new mm_struct::exe_file
        c/r: prctl: extend PR_SET_MM to set up more mm_struct entries
        c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat
        syscalls, x86: add __NR_kcmp syscall
        fs, proc: introduce /proc/<pid>/task/<tid>/children entry
        sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE
        aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()
        eventfd: change int to __u64 in eventfd_signal()
        fs/nls: add Apple NLS
        pidns: make killed children autoreap
        pidns: use task_active_pid_ns in do_notify_parent
        rapidio/tsi721: add DMA engine support
        rapidio: add DMA engine support for RIO data transfers
        ipc/mqueue: add rbtree node caching support
        tools/selftests: add mq_perf_tests
        ipc/mqueue: strengthen checks on mqueue creation
        ipc/mqueue: correct mq_attr_ok test
        ipc/mqueue: improve performance of send/recv
        selftests: add mq_open_tests
        ...
      08615d7d
    • Linus Torvalds's avatar
      Merge branch 'drm-prime-vmap' of git://people.freedesktop.org/~airlied/linux · 9fdadb2c
      Linus Torvalds authored
      Pull drm prime mmap/vmap code from Dave Airlie:
       "As mentioned previously these are the extra bits of drm that relied on
        the dma-buf pull to work, the first three just stub out the mmap
        interface, and the next set provide vmap export to i915/radeon/nouveau
        and vmap import to udl."
      
      * 'drm-prime-vmap' of git://people.freedesktop.org/~airlied/linux:
        radeon: add radeon prime vmap support.
        nouveau: add vmap support to nouveau prime support
        udl: support vmapping imported dma-bufs
        i915: add dma-buf vmap support for exporting vmapped buffer
        radeon: add stub dma-buf mmap functionality
        nouveau: add stub dma-buf mmap functionality.
        i915: add stub dma-buf mmap callback.
      9fdadb2c
    • Randy Dunlap's avatar
      kconfig: update compression algorithm info · 0a4dd35c
      Randy Dunlap authored
      There have been new compression algorithms added without updating nearby
      relevant descriptive text that refers to (a) the number of compression
      algorithms and (b) the most recent one.  Fix these inconsistencies.
      Signed-off-by: default avatarRandy Dunlap <rdunlap@xenotime.net>
      Reported-by: <qasdfgtyuiop@gmail.com>
      Cc: Lasse Collin <lasse.collin@tukaani.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Alain Knaff <alain@knaff.lu>
      Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a4dd35c
    • Cyrill Gorcunov's avatar
      c/r: prctl: add ability to set new mm_struct::exe_file · b32dfe37
      Cyrill Gorcunov authored
      When we do restore we would like to have a way to setup a former
      mm_struct::exe_file so that /proc/pid/exe would point to the original
      executable file a process had at checkpoint time.
      
      For this the PR_SET_MM_EXE_FILE code is introduced.  This option takes a
      file descriptor which will be set as a source for new /proc/$pid/exe
      symlink.
      
      Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
      vmas present for current process, simply because this feature is a special
      to C/R and mm::num_exe_file_vmas become meaningless after that.
      
      To minimize the amount of transition the /proc/pid/exe symlink might have,
      this feature is implemented in one-shot manner.  Thus once changed the
      symlink can't be changed again.  This should help sysadmins to monitor the
      symlinks over all process running in a system.
      
      In particular one could make a snapshot of processes and ring alarm if
      there unexpected changes of /proc/pid/exe's in a system.
      
      Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
      the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
      request to change symlink will be rejected.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b32dfe37
    • Cyrill Gorcunov's avatar
      c/r: prctl: extend PR_SET_MM to set up more mm_struct entries · fe8c7f5c
      Cyrill Gorcunov authored
      During checkpoint we dump whole process memory to a file and the dump
      includes process stack memory.  But among stack data itself, the stack
      carries additional parameters such as command line arguments, environment
      data and auxiliary vector.
      
      So when we do restore procedure and once we've restored stack data itself
      we need to setup mm_struct::arg_start/end, env_start/end, so restored
      process would be able to find command line arguments and environment data
      it had at checkpoint time.  The same applies to auxiliary vector.
      
      For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
      ENV_END | AUXV) codes are introduced.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe8c7f5c
    • Cyrill Gorcunov's avatar
      c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat · 5b172087
      Cyrill Gorcunov authored
      We would like to have an ability to restore command line arguments and
      program environment pointers but first we need to obtain them somehow.
      Thus we put these values into /proc/$pid/stat.  The exit_code is needed to
      restore zombie tasks.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b172087
    • Cyrill Gorcunov's avatar
      syscalls, x86: add __NR_kcmp syscall · d97b46a6
      Cyrill Gorcunov authored
      While doing the checkpoint-restore in the user space one need to determine
      whether various kernel objects (like mm_struct-s of file_struct-s) are
      shared between tasks and restore this state.
      
      The 2nd step can be solved by using appropriate CLONE_ flags and the
      unshare syscall, while there's currently no ways for solving the 1st one.
      
      One of the ways for checking whether two tasks share e.g.  mm_struct is to
      provide some mm_struct ID of a task to its proc file, but showing such
      info considered to be not that good for security reasons.
      
      Thus after some debates we end up in conclusion that using that named
      'comparison' syscall might be the best candidate.  So here is it --
      __NR_kcmp.
      
      It takes up to 5 arguments - the pids of the two tasks (which
      characteristics should be compared), the comparison type and (in case of
      comparison of files) two file descriptors.
      
      Lookups for pids are done in the caller's PID namespace only.
      
      At moment only x86 is supported and tested.
      
      [akpm@linux-foundation.org: fix up selftests, warnings]
      [akpm@linux-foundation.org: include errno.h]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Valdis.Kletnieks@vt.edu
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d97b46a6
    • Cyrill Gorcunov's avatar
      fs, proc: introduce /proc/<pid>/task/<tid>/children entry · 81841161
      Cyrill Gorcunov authored
      When we do checkpoint of a task we need to know the list of children the
      task, has but there is no easy and fast way to generate reverse
      parent->children chain from arbitrary <pid> (while a parent pid is
      provided in "PPid" field of /proc/<pid>/status).
      
      So instead of walking over all pids in the system (creating one big
      process tree in memory, just to figure out which children a task has) --
      we add explicit /proc/<pid>/task/<tid>/children entry, because the kernel
      already has this kind of information but it is not yet exported.
      
      This is a first level children, not the whole process tree.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81841161
    • Cyrill Gorcunov's avatar
      sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE · 98ed57ee
      Cyrill Gorcunov authored
      For those who doesn't need C/R functionality there is no need to control
      last pid, ie the pid for the next fork() call.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98ed57ee
    • Christopher Yeoh's avatar
      aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() · ac34ebb3
      Christopher Yeoh authored
      A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
      changes made to support CMA in an earlier patch.
      
      Rather than having an additional check_access parameter to these
      functions, the first paramater type is overloaded to allow the caller to
      specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
      are valid, but do not check the memory that they point to.  This is used
      by process_vm_readv/writev where we need to validate that a iovec passed
      to the syscall is valid but do not want to check the memory that it points
      to at this point because it refers to an address space in another process.
      Signed-off-by: default avatarChris Yeoh <yeohc@au1.ibm.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac34ebb3
    • Sha Zhengju's avatar
      eventfd: change int to __u64 in eventfd_signal() · ee62c6b2
      Sha Zhengju authored
      eventfd_ctx->count is an __u64 counter which is allowed to reach
      ULLONG_MAX.  eventfd_write() adds a __u64 value to "count", but the kernel
      side eventfd_signal() only adds an int value to it.  Make them consistent.
      
      [akpm@linux-foundation.org: update interface documentation]
      Signed-off-by: default avatarSha Zhengju <handai.szj@taobao.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee62c6b2
    • Vladimir Serbinenko's avatar
      fs/nls: add Apple NLS · 71ca97da
      Vladimir Serbinenko authored
      HFS has support for NLS.  However the relevant NLS tables are missing.
      Here they are automatically transformed from the tables at unicode.org.
      Codepages requiring special handling like CJK, RTL or Brahmic ones are not
      included in this patch.
      
      [akpm@linux-foundation.org: add unicode.org copyright and permission notices]
      Signed-off-by: default avatarVladimir Serbinenko <phcoder@gmail.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Clemens Ladisch <clemens@ladisch.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71ca97da
    • Eric W. Biederman's avatar
      pidns: make killed children autoreap · 00c10bc1
      Eric W. Biederman authored
      Force SIGCHLD handling to SIG_IGN so that signals are not generated and so
      that the children autoreap.  This increases the parallelize and in general
      the speed of network namespace shutdown.
      
      Note self reaping childrean can exist past zap_pid_ns_processess but they
      will all be reaped before we allow the pid namespace init task with pid ==
      1 to be reaped.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Louis Rilling <louis.rilling@kerlabs.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00c10bc1
    • Eric W. Biederman's avatar
      pidns: use task_active_pid_ns in do_notify_parent · 32084504
      Eric W. Biederman authored
      Using task_active_pid_ns is more robust because it works even after we
      have called exit_namespaces.  This change allows us to have parent
      processes that are zombies.  Normally a zombie parent processes is crazy
      and the last thing you would want to have but in the case of not letting
      the init process of a pid namespace be reaped until all of it's children
      are dead and reaped a zombie parent process is exactly what we want.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Louis Rilling <louis.rilling@kerlabs.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32084504
    • Alexandre Bounine's avatar
      rapidio/tsi721: add DMA engine support · 9eaa3d9b
      Alexandre Bounine authored
      Adds support for DMA Engine API into Tsi721 mport driver.
      
      Includes following changes for Tsi721 driver:
      - Modifies BDMA register offset definitions to support per-channel handling
      - Separates BDMA channel reserved for RIO Maintenance requests
      - Adds DMA Engine callback routines
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9eaa3d9b
    • Alexandre Bounine's avatar
      rapidio: add DMA engine support for RIO data transfers · e42d98eb
      Alexandre Bounine authored
      Adds DMA Engine framework support into RapidIO subsystem.
      
      Uses DMA Engine DMA_SLAVE interface to generate data transfers to/from
      remote RapidIO target devices.
      
      Introduces RapidIO-specific wrapper for prep_slave_sg() interface with an
      extra parameter to pass target specific information.
      
      Uses scatterlist to describe local data buffer.  Address flat data buffer
      on a remote side.
      Signed-off-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarVinod Koul <vinod.koul@linux.intel.com>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e42d98eb
    • Doug Ledford's avatar
      ipc/mqueue: add rbtree node caching support · ce2d52cc
      Doug Ledford authored
      When I wrote the first patch that added the rbtree support for message
      queue insertion, it sped up the case where the queue was very full
      drastically from the original code.  It, however, slowed down the case
      where the queue was empty (not drastically though).
      
      This patch caches the last freed rbtree node struct so we can quickly
      reuse it when we get a new message.  This is the common path for any queue
      that very frequently goes from 0 to 1 then back to 0 messages in queue.
      
      Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
      msg_insert, so this patch attempts to speculatively allocate a new node
      struct outside of the spin lock when we know we need it, but will still
      fall back to a GFP_ATOMIC allocation if it has to.
      
      Once I added the caching, the necessary various ret = ; spin_unlock
      gyrations in mq_timedsend were getting pretty ugly, so this also slightly
      refactors that function to streamline the flow of the code and the
      function exit.
      
      Finally, while working on getting performance back I made sure that all of
      the node structs were always fully initialized when they were first used,
      rendering the use of kzalloc unnecessary and a waste of CPU cycles.
      
      The net result of all of this is:
      
      1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
         on it when necessary.
      
      2) We will speculatively allocate a node struct using GFP_KERNEL if our
         cache is empty (and save the struct to our cache if it's still empty
         after we have obtained the spin lock).
      
      3) The performance of the common queue empty case has significantly
         improved and is now much more in line with the older performance for
         this case.
      
      The performance changes are:
      
                  Old mqueue      new mqueue      new mqueue + caching
      queue empty
      send/recv   305/288ns       349/318ns       310/322ns
      
      I don't think we'll ever be able to get the recv performance back, but
      that's because the old recv performance was a direct result and
      consequence of the old methods abysmal send performance.  The recv path
      simply must do more so that the send path does not incur such a penalty
      under higher queue depths.
      
      As it turns out, the new caching code also sped up the various queue full
      cases relative to my last patch.  That could be because of the difference
      between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
      change in code flow in the mq_timedsend routine.  Regardless, I'll take
      it.  It wasn't huge, and I *would* say it was within the margin for error,
      but after many repeated runs what I'm seeing is that the old numbers trend
      slightly higher (about 10 to 20ns depending on which test is the one
      running).
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce2d52cc
    • Doug Ledford's avatar
      tools/selftests: add mq_perf_tests · 7820b071
      Doug Ledford authored
      Add the mq_perf_tests tool I used when creating my mq performance patch.
      Also add a local .gitignore to keep the binaries from showing up in git
      status output.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7820b071
    • Doug Ledford's avatar
      ipc/mqueue: strengthen checks on mqueue creation · 113289cc
      Doug Ledford authored
      We already check the mq attr struct if it's passed in, but now that the
      admin can set system wide defaults separate from maximums, it's actually
      possible to set the defaults to something that would overflow.  So, if
      there is no attr struct passed in to the open call, check the default
      values.
      
      While we are at it, simplify mq_attr_ok() by making it return 0 or an
      error condition, so that way if we add more tests to it later, we have the
      option of what error should be returned instead of the calling location
      having to pick a possibly inaccurate error code.
      
      [akpm@linux-foundation.org: s/ENOMEM/EOVERFLOW/]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      113289cc
    • Doug Ledford's avatar
      ipc/mqueue: correct mq_attr_ok test · 2c12ea49
      Doug Ledford authored
      While working on the other parts of the mqueue stuff, I noticed that the
      calculation for overflow in mq_attr_ok didn't actually match reality (this
      is especially true since my last patch which changed how we account memory
      slightly).
      
      In particular, we used to test for overflow using:
        msgs * msgsize + msgs * sizeof(struct msg_msg *)
      
      That was never really correct because each message we allocate via
      load_msg() is actually a struct msg_msg followed by the data for the
      message (and if struct msg_msg + data exceeds PAGE_SIZE we end up
      allocating struct msg_msgseg structs too, but accounting for them would
      get really tedious, so let's ignore those...they're only a pointer in size
      anyway).  This patch updates the calculation to be more accurate in
      regards to maximum possible memory consumption by the mqueue.
      
      [akpm@linux-foundation.org: add a local to simplify overflow-checking expression]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c12ea49
    • Doug Ledford's avatar
      ipc/mqueue: improve performance of send/recv · d6629859
      Doug Ledford authored
      The existing implementation of the POSIX message queue send and recv
      functions is, well, abysmal.  Even worse than abysmal.  I submitted a
      patch to increase the maximum POSIX message queue limit to 65536 due to
      customer needs, however, upon looking over the send/recv implementation, I
      realized that my customer needs help with that too even if they don't know
      it.  The basic problem is that, given the fairly typical use case scenario
      for a large queue of queueing lots of messages all at the same priority (I
      verified with my customer that this is indeed what their app does), the
      msg_insert routine is basically a frikkin' bubble sort.  I mean, whoa,
      that's *so* middle school.
      
      OK, OK, to not slam the original author too much, I'm sure they didn't
      envision a queue depth of 50,000+ messages.  No one would think that
      moving elements in an array, one at a time, and dereferencing each pointer
      in that array to check priority of the message being pointed too, again
      one at a time, for 50,000+ times would be good.  So let's assume that, as
      is typical, the users have found a way to break our code simply by using
      it in a way we didn't envision.  Fair enough.
      
      "So, just how broken is it?", you ask.  I wondered the same thing, so I
      wrote an app to let me know.  It's my next patch.  It gave me some
      interesting results.  Here's what it tested:
      
      Interference with other apps - In continuous mode, the app just sits there
      and hits a message queue forever, while you go do something productive on
      another terminal using other CPUs.  You then measure how long it takes you
      to do that something productive.  Then you restart the app in fake
      continuous mode, and it sits in a tight loop on a CPU while you repeat
      your tests.  The whole point of this is to keep one CPU tied up (so it
      can't be used in your other work) but in one case tied up hitting the
      mqueue code so we can see the effect of walking that 65,528 element array
      one pointer at a time on the global CPU cache.  If it's bad, then it will
      slow down your app on the other CPUs just by polluting cache mercilessly.
      In the fake case, it will be in a tight loop, but not polluting cache.
      Testing the mqueue subsystem directly - Here we just run a number of tests
      to see how the mqueue subsystem performs under different conditions.  A
      couple conditions are known to be worst case for the old system, and some
      routines, so this tests all of them.
      
      So, on to the results already:
      
      Subsystem/Test                  Old                         New
      
      Time to compile linux
      kernel (make -j12 on a
      6 core CPU)
        Running mqueue test     user 49m10.744s             user 45m26.294s
      			   sys  5m51.924s              sys  4m59.894s
      			 total 55m02.668s            total 50m26.188s
      
        Running fake test       user 45m32.686s             user 45m18.552s
                                 sys  5m12.465s              sys  4m56.468s
                               total 50m45.151s            total 50m15.020s
      
        % slowdown from mqueue
          cache thrashing            ~8%                         ~.5%
      
      Avg time to send/recv (in nanoseconds per message)
        when queue empty            305/288                    349/318
        when queue full (65528 messages)
          constant priority      526589/823                    362/314
          increasing priority    403105/916                    495/445
          decreasing priority     73420/594                    482/409
          random priority        280147/920                    546/436
      
      Time to fill/drain queue (65528 messages, in seconds)
        constant priority         17.37/.12                    .13/.12
        increasing priority        4.14/.14                    .21/.18
        decreasing priority       12.93/.13                    .21/.18
        random priority            8.88/.16                    .22/.17
      
      So, I think the results speak for themselves.  It's possible this
      implementation could be improved by cacheing at least one priority level
      in the node tree (that would bring the queue empty performance more in
      line with the old implementation), but this works and is *so* much better
      than what we had, especially for the common case of a single priority in
      use, that further refinements can be in follow on patches.
      
      [akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
      [levinsasha928@gmail.com: use correct gfp flags in msg_insert]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6629859
    • Doug Ledford's avatar
      selftests: add mq_open_tests · 50069a58
      Doug Ledford authored
      Add a directory to house POSIX message queue subsystem specific tests.
      Add first test which checks the operation of mq_open() under various
      corner conditions.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50069a58
    • KOSAKI Motohiro's avatar
      mqueue: separate mqueue default value from maximum value · cef0184c
      KOSAKI Motohiro authored
      Commit b231cca4 ("message queues: increase range limits") changed
      mqueue default value when attr parameter is specified NULL from hard
      coded value to fs.mqueue.{msg,msgsize}_max sysctl value.
      
      This made large side effect.  When user need to use two mqueue
      applications 1) using !NULL attr parameter and it require big message
      size and 2) using NULL attr parameter and only need small size message,
      app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
      memory size even though it doesn't need.
      
      Doug Ledford propsed to switch back it to static hard coded value.
      However it also has a compatibility problem.  Some applications might
      started depend on the default value is tunable.
      
      The solution is to separate default value from maximum value.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Acked-by: default avatarJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Acked-by: default avatarSerge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cef0184c
    • KOSAKI Motohiro's avatar
      mqueue: don't use kmalloc with KMALLOC_MAX_SIZE · fd1f87d2
      KOSAKI Motohiro authored
      KMALLOC_MAX_SIZE is not a good threshold.  It is extremely high and
      problematic.  Unfortunately, some silly drivers depend on this and we
      can't change it.  But any new code needn't use such extreme ugly high
      order allocations.  It brings us awful fragmentation issues and system
      slowdown.
      Signed-off-by: default avatarKOSAKI Motohiro <mkosaki@jp.fujitsu.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Acked-by: default avatarJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd1f87d2
    • KOSAKI Motohiro's avatar
      mqueue: revert bump up DFLT_*MAX · e6315bb1
      KOSAKI Motohiro authored
      Mqueue limitation is slightly naieve parameter likes other ipcs because
      unprivileged user can consume kernel memory by using ipcs.
      
      Thus, too aggressive raise bring us security issue.  Example, current
      setting allow evil unprivileged user use 256GB (= 256 * 1024 * 1024*1024)
      and it's enough large to system will belome unresponsive.  Don't do that.
      
      Instead, every admin should adjust the knobs for their own systems.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Acked-by: default avatarJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Acked-by: default avatarSerge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6315bb1
    • Doug Ledford's avatar
      ipc/mqueue: update maximums for the mqueue subsystem · 5b5c4d1a
      Doug Ledford authored
      Commit b231cca4 ("message queues: increase range limits") changed the
      maximum size of a message in a message queue from INT_MAX to 8192*128.
      Unfortunately, we had customers that relied on a size much larger than
      8192*128 on their production systems.  After reviewing POSIX, we found
      that it is silent on the maximum message size.  We did find a couple other
      areas in which it was not silent.  Fix up the mqueue maximums so that the
      customer's system can continue to work, and document both the POSIX and
      real world requirements in ipc_namespace.h so that we don't have this
      issue crop back up.
      
      Also, commit 9cf18e1d ("ipc: HARD_MSGMAX should be higher not lower
      on 64bit") fiddled with HARD_MSGMAX without realizing that the number was
      intentionally in place to limit the msg queue depth to one that was small
      enough to kmalloc an array of pointers (hence why we divided 128k by
      sizeof(long)).  If we wish to meet POSIX requirements, we have no choice
      but to change our allocation to a vmalloc instead (at least for the large
      queue size case).  With that, it's possible to increase our allowed
      maximum to the POSIX requirements (or more if we choose).
      
      [sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h]
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b5c4d1a
    • Doug Ledford's avatar
      ipc/mqueue: enforce hard limits · 02967ea0
      Doug Ledford authored
      In two places we don't enforce the hard limits for CAP_SYS_RESOURCE apps.
      In preparation for making more reasonable hard limits, start enforcing
      them even on CAP_SYS_RESOURCE.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02967ea0
    • Doug Ledford's avatar
      ipc/mqueue: switch back to using non-max values on create · 858ee378
      Doug Ledford authored
      Commit b231cca4 ("message queues: increase range limits") changed
      how we create a queue that does not include an attr struct passed to
      open so that it creates the queue with whatever the maximum values are.
      However, if the admin has set the maximums to allow flexibility in
      creating a queue (aka, both a large size and large queue are allowed,
      but combined they create a queue too large for the RLIMIT_MSGQUEUE of
      the user), then attempts to create a queue without an attr struct will
      fail.  Switch back to using acceptable defaults regardless of what the
      maximums are.
      
      Note: so far, we only know of a few applications that rely on this
      behavior (specifically, set the maximums in /proc, then run the
      application which calls mq_open() without passing in an attr struct, and
      the application expects the newly created message queue to have the
      maximum sizes that were set in /proc used on the mq_open() call, and all
      of those applications that we know of are actually part of regression
      test suites that were coded to do something like this:
      
      for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do
      	echo $size > /proc/sys/fs/mqueue/msgsize_max
      	mq_open || echo "Error opening mq with size $size"
      done
      
      These test suites that depend on any behavior like this are broken.  The
      concept that programs should rely upon the system wide maximum in order
      to get their desired results instead of simply using a attr struct to
      specify what they want is fundamentally unfriendly programming practice
      for any multi-tasking OS.
      
      Fixing this will break those few apps that we know of (and those app
      authors recognize the brokenness of their code and the need to fix it).
      However, the following patch "mqueue: separate mqueue default value"
      allows a workaround in the form of new knobs for the default msg queue
      creation parameters for any software out there that we don't already
      know about that might rely on this behavior at the moment.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      858ee378