Commits · 060028bac94bf60a65415d1d55a359c3a17d5c31 · Kirill Smelkov / linux

06 Jun, 2014 40 commits

ipc/shm.c: increase the defaults for SHMALL, SHMMAX · 060028ba

Manfred Spraul authored Jun 06, 2014

System V shared memory

a) can be abused to trigger out-of-memory conditions and the standard
   measures against out-of-memory do not work:

    - it is not possible to use setrlimit to limit the size of shm segments.

    - segments can exist without association with any processes, thus
      the oom-killer is unable to free that memory.

b) is typically used for shared information - today often multiple GB.
   (e.g. database shared buffers)

The current default is a maximum segment size of 32 MB and a maximum
total size of 8 GB.  This is often too much for a) and not enough for
b), which means that lots of users must change the defaults.

This patch increases the default limits (nearly) to the maximum, which
is perfect for case b).  The defaults are used after boot and as the
initial value for each new namespace.

Admins/distros that need a protection against a) should reduce the
limits and/or enable shm_rmid_forced.

Unix has historically required setting these limits for shared memory,
and Linux inherited such behavior.  The consequence of this is added
complexity for users and administrators.  One very common example are
Database setup/installation documents and scripts, where users must
manually calculate the values for these limits.  This also requires
(some) knowledge of how the underlying memory management works, thus
causing, in many occasions, the limits to just be flat out wrong.
Disabling these limits sooner could have saved companies a lot of time,
headaches and money for support.  But it's never too late, simplify
users life now.

Further notes:
- The patch only changes default, overrides behave as before:
        # sysctl kernel.shmall=33554432
  would recreate the previous limit for SHMMAX (for the current namespace).

- Disabling sysv shm allocation is possible with:
        # sysctl kernel.shmall=0
  (not a new feature, also per-namespace)

- The limits are intentionally set to a value slightly less than ULONG_MAX,
  to avoid triggering overflows in user space apps.
  [not unreasonable, see http://marc.info/?l=linux-mm&m=139638334330127]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reported-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

060028ba

ipc/shm.c: check for integer overflow during shmget. · 1376327c

Manfred Spraul authored Jun 06, 2014

SHMMAX is the upper limit for the size of a shared memory segment, counted
in bytes.  The actual allocation is that size, rounded up to the next full
page.

Add a check that prevents the creation of segments where the rounded up
size causes an integer overflow.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1376327c

ipc/shm.c: check for overflows of shm_tot · 09c6eb1f

Manfred Spraul authored Jun 06, 2014

shm_tot counts the total number of pages used by shm segments.

If SHMALL is ULONG_MAX (or nearly ULONG_MAX), then the number can
overflow.  Subsequent calls to shmctl(,SHM_INFO,) would return wrong
values for shm_tot.

The patch adds a detection for overflows.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

09c6eb1f

ipc/shm.c: check for ulong overflows in shmat · 247a8ce8

Manfred Spraul authored Jun 06, 2014

The increase of SHMMAX/SHMALL is a 4 patch series.

The change itself is trivial, the only problem are interger overflows.
The overflows are not new, but if we make huge values the default, then
the code should be free from overflows.

SHMMAX:

- shmmem_file_setup places a hard limit on the segment size:
  MAX_LFS_FILESIZE.

  On 32-bit, the limit is > 1 TB, i.e. 4 GB-1 byte segments are
  possible. Rounded up to full pages the actual allocated size
  is 0. --> must be fixed, patch 3

- shmat:
  - find_vma_intersection does not handle overflows properly.
    --> must be fixed, patch 1

  - the rest is fine, do_mmap_pgoff limits mappings to TASK_SIZE
    and checks for overflows (i.e.: map 2 GB, starting from
    addr=2.5GB fails).

SHMALL:
- after creating 8192 segments size (1L<<63)-1, shm_tot overflows and
  returns 0.  --> must be fixed, patch 2.

Userspace:
- Obviously, there could be overflows in userspace. There is nothing
  we can do, only use values smaller than ULONG_MAX.
  I ended with "ULONG_MAX - 1L<<24":

  - TASK_SIZE cannot be used because it is the size of the current
    task. Could be 4G if it's a 32-bit task on a 64-bit kernel.

  - The maximum size is not standardized across archs:
    I found TASK_MAX_SIZE, TASK_SIZE_MAX and TASK_SIZE_64.

  - Just in case some arch revives a 4G/4G split, nearly
    ULONG_MAX is a valid segment size.

  - Using "0" as a magic value for infinity is even worse, because
    right now 0 means 0, i.e. fail all allocations.

This patch (of 4):

find_vma_intersection() does not work as intended if addr+size overflows.
The patch adds a manual check before the call to find_vma_intersection.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

247a8ce8

ipc, kernel: clear whitespace · 46c0a8ca

Paul McQuade authored Jun 06, 2014

trailing whitespace
Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

46c0a8ca

ipc, kernel: use Linux headers · 7153e402

Paul McQuade authored Jun 06, 2014

Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
Use #include <linux/types.h> instead of <asm/types.h>
Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7153e402

ipc: constify ipc_ops · eb66ec44

Mathias Krause authored Jun 06, 2014

There is no need to recreate the very same ipc_ops structure on every
kernel entry for msgget/semget/shmget.  Just declare it static and be
done with it.  While at it, constify it as we don't modify the structure
at runtime.

Found in the PaX patch, written by the PaX Team.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

eb66ec44

initramfs: remove "compression mode" choice · 3e4e0f0a

Paul Bolle authored Jun 06, 2014

Commit 9ba4bcb6 ("initramfs: read CONFIG_RD_ variables for initramfs
compression") removed the users of the various INITRAMFS_COMPRESSION_*
Kconfig symbols.  So since v3.13 the entire "Built-in initramfs
compression mode" choice is a set of knobs connected to nothing.  The
entire choice can safely be removed.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Cc: P J P <ppandit@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3e4e0f0a

fs/devpts/inode.c: convert printk to pr_foo() · 04541a2f

Fabian Frederick authored Jun 06, 2014

Also convert spaces to tabs (checkpatch warnings) if (!dentry) KERN_NOTICE
converted to pr_err (like if (!inode) error process)

[akpm@linux-foundation.org: use KBUILD_MODNAME, per Joe]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

04541a2f

fs/cachefiles: replace kerror by pr_err · 0227d6ab

Fabian Frederick authored Jun 06, 2014

Also add pr_fmt in internal.h
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0227d6ab

FS/CACHEFILES: convert printk to pr_foo() · 4e1eb883

Fabian Frederick authored Jun 06, 2014

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4e1eb883

fs/pstore: logging clean-up · ef748853

Fabian Frederick authored Jun 06, 2014

- Define pr_fmt in plateform.c and ram_core.c for global prefix.

- Coalesce format fragments.

- Separate format/arguments on lines > 80 characters.

Note: Some pr_foo() were initially declared without prefix and therefore
this could break existing log analyzer.

[akpm@linux-foundation.org: missed a couple of prefix removals]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ef748853

kernel/profile.c: use static const char instead of static char · f3da64d1

Fabian Frederick authored Jun 06, 2014

schedstr, sleepstr and kvmstr are only used in strcmp & strlen
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f3da64d1

kernel/profile.c: convert printk to pr_foo() · aba871f1

Fabian Frederick authored Jun 06, 2014

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

aba871f1

fs/affs: pr_debug cleanup · 9606d9aa

Fabian Frederick authored Jun 06, 2014

- Remove AFFS: prefix (defined in pr_fmt)

- Use __func__

- Separate format/arguments on lines > 80 characters.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9606d9aa

fs/affs: convert printk to pr_foo() · 0158de12

Fabian Frederick authored Jun 06, 2014

-All printk(KERN_foo converted to pr_foo()

-Default printk converted to pr_warn()

-Add pr_fmt to affs.h
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0158de12

fs/affs/file.c: remove unnecessary function parameters · 0c89d670

Fabian Frederick authored Jun 06, 2014

- affs_do_readpage_ofs is always called with from = 0 ie reading from
  page->index

- File parameter is never used
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0c89d670

include/asm-generic/ioctl.h: fix _IOC_TYPECHECK sparse error · d55875f5

Hans Verkuil authored Jun 06, 2014

When running sparse over drivers/media/v4l2-core/v4l2-ioctl.c I get these
errors:

  drivers/media/v4l2-core/v4l2-ioctl.c:2043:9: error: bad integer constant expression
  drivers/media/v4l2-core/v4l2-ioctl.c:2044:9: error: bad integer constant expression
  drivers/media/v4l2-core/v4l2-ioctl.c:2045:9: error: bad integer constant expression
  drivers/media/v4l2-core/v4l2-ioctl.c:2046:9: error: bad integer constant expression

etc.

The root cause of that turns out to be in include/asm-generic/ioctl.h:

#include <uapi/asm-generic/ioctl.h>

/* provoke compile error for invalid uses of size argument */
extern unsigned int __invalid_size_argument_for_IOC;
#define _IOC_TYPECHECK(t) \
        ((sizeof(t) == sizeof(t[1]) && \
          sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
          sizeof(t) : __invalid_size_argument_for_IOC)

If it is defined as this (as is already done if __KERNEL__ is not defined):

  #define _IOC_TYPECHECK(t) (sizeof(t))

then all is well with the world.

This patch allows sparse to work correctly.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d55875f5

kernel/user_namespace.c: kernel-doc/checkpatch fixes · 68a9a435

Fabian Frederick authored Jun 06, 2014

-uid->gid
-split some function declarations
-if/then/else warning
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

68a9a435

tools/testing/selftests/sysctl: validate sysctl_writes_strict · 24fe831c

Kees Cook authored Jun 06, 2014

This adds several behavioral tests to sysctl string and number writing
to detect unexpected cases that behaved differently when the sysctl
kernel.sysctl_writes_strict != 1.

[ original ]
    root@localhost:~# make test_num
    == Testing sysctl behavior against /proc/sys/kernel/domainname ==
    Writing test file ... ok
    Checking sysctl is not set to test value ... ok
    Writing sysctl from shell ... ok
    Resetting sysctl to original value ... ok
    Writing entire sysctl in single write ... ok
    Writing middle of sysctl after synchronized seek ... FAIL
    Writing beyond end of sysctl ... FAIL
    Writing sysctl with multiple long writes ... FAIL
    Writing entire sysctl in short writes ... FAIL
    Writing middle of sysctl after unsynchronized seek ... ok
    Checking sysctl maxlen is at least 65 ... ok
    Checking sysctl keeps original string on overflow append ... FAIL
    Checking sysctl stays NULL terminated on write ... ok
    Checking sysctl stays NULL terminated on overwrite ... ok
    make: *** [test_num] Error 1
    root@localhost:~# make test_string
    == Testing sysctl behavior against /proc/sys/vm/swappiness ==
    Writing test file ... ok
    Checking sysctl is not set to test value ... ok
    Writing sysctl from shell ... ok
    Resetting sysctl to original value ... ok
    Writing entire sysctl in single write ... ok
    Writing middle of sysctl after synchronized seek ... FAIL
    Writing beyond end of sysctl ... FAIL
    Writing sysctl with multiple long writes ... ok
    make: *** [test_string] Error 1

[ with CONFIG_PROC_SYSCTL_STRICT_WRITES ]
    root@localhost:~# make run_tests
    == Testing sysctl behavior against /proc/sys/kernel/domainname ==
    Writing test file ... ok
    Checking sysctl is not set to test value ... ok
    Writing sysctl from shell ... ok
    Resetting sysctl to original value ... ok
    Writing entire sysctl in single write ... ok
    Writing middle of sysctl after synchronized seek ... ok
    Writing beyond end of sysctl ... ok
    Writing sysctl with multiple long writes ... ok
    Writing entire sysctl in short writes ... ok
    Writing middle of sysctl after unsynchronized seek ... ok
    Checking sysctl maxlen is at least 65 ... ok
    Checking sysctl keeps original string on overflow append ... ok
    Checking sysctl stays NULL terminated on write ... ok
    Checking sysctl stays NULL terminated on overwrite ... ok
    == Testing sysctl behavior against /proc/sys/vm/swappiness ==
    Writing test file ... ok
    Checking sysctl is not set to test value ... ok
    Writing sysctl from shell ... ok
    Resetting sysctl to original value ... ok
    Writing entire sysctl in single write ... ok
    Writing middle of sysctl after synchronized seek ... ok
    Writing beyond end of sysctl ... ok
    Writing sysctl with multiple long writes ... ok
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

24fe831c

sysctl: allow for strict write position handling · f4aacea2

Kees Cook authored Jun 06, 2014

When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start.  This means the contents of
the last write to the sysctl controls the string contents instead of the
first:

  open("/proc/sys/kernel/modprobe", O_WRONLY)   = 1
  write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
  write(1, "/bin/true", 9)                = 9
  close(1)                                = 0

  $ cat /proc/sys/kernel/modprobe
  /bin/true

Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write.  Similarly, multiple short writes would
not append to the sysctl.

The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations.  For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.

This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions).  For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior.  Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f4aacea2

sysctl: refactor sysctl string writing logic · 2ca9bb45

Kees Cook authored Jun 06, 2014

Consolidate buffer length checking with new-line/end-of-line checking.
Additionally, instead of reading user memory twice, just do the
assignment during the loop.

This change doesn't affect the potential races here.  It was already
possible to read a sysctl that was in the middle of a write.  In both
cases, the string will always be NULL terminated.  The pre-existing race
remains a problem to be solved.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2ca9bb45

sysctl: clean up char buffer arguments · f8808300

Kees Cook authored Jun 06, 2014

When writing to a sysctl string, each write, regardless of VFS position,
began writing the string from the start.  This meant the contents of the
last write to the sysctl controlled the string contents instead of the
first.

This misbehavior was featured in an exploit against Chrome OS.  While
it's not in itself a vulnerability, it's a weirdness that isn't on the
mind of most auditors: "This filter looks correct, the first line
written would not be meaningful to sysctl" doesn't apply here, since the
size of the write and the contents of the final write are what matter
when writing to sysctls.

This adds the sysctl kernel.sysctl_writes_strict to control the write
behavior.  The default (0) reports when VFS position is non-0 on a
write, but retains legacy behavior, -1 disables the warning, and 1
enables the position-respecting behavior.

The long-term plan here is to wait for userspace to be fixed in response
to the new warning and to then switch the default kernel behavior to the
new position-respecting behavior.

This patch (of 4):

The char buffer arguments are needlessly cast in weird places.  Clean it
up so things are easier to read.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f8808300

rapidio/tsi721: use pci_enable_msix_exact() instead of pci_enable_msix() · 1c92ab1e

Alexander Gordeev authored Jun 06, 2014

As result of deprecation of MSI-X/MSI enablement functions
pci_enable_msix() and pci_enable_msi_block() all drivers using these two
interfaces need to be updated to use the new pci_enable_msi_range() or
pci_enable_msi_exact() and pci_enable_msix_range() or
pci_enable_msix_exact() interfaces.

The patch has no runtime effect.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Acked-by: Alexandre Bounine <alexandre.bounine@idt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1c92ab1e

idr: reorder the fields · dcbff5d1

Lai Jiangshan authored Jun 06, 2014

idr_layer->layer is always accessed in read path, move it in the front.

idr_layer->bitmap is moved on the bottom.  And rcu_head shares with
bitmap due to they do not be accessed at the same time.

idr->id_free/id_free_cnt/lock are free list fields, and moved to the
bottom.  They will be removed in near future.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dcbff5d1

idr: reduce the unneeded check in free_layer() · 15f3ec3f

Lai Jiangshan authored Jun 06, 2014

If "idr->hint == p" is true, it also implies "idr->hint" is true(not NULL).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

15f3ec3f

idr: don't need to shink the free list when idr_remove() · aefb7682

Lai Jiangshan authored Jun 06, 2014

After idr subsystem is changed to RCU-awared, the free layer will not go
to the free list.  The free list will not be filled up when
idr_remove().  So we don't need to shink it too.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

aefb7682

idr: fix idr_replace()'s returned error code · b93804b2

Lai Jiangshan authored Jun 06, 2014

When the smaller id is not found, idr_replace() returns -ENOENT.  But
when the id is bigger enough, idr_replace() returns -EINVAL, actually
there is no difference between these two kinds of ids.

These are all unallocated id, the return values of the idr_replace() for
these ids should be the same: -ENOENT.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b93804b2

idr: fix NULL pointer dereference when ida_remove(unallocated_id) · aef0f62e

Lai Jiangshan authored Jun 06, 2014

If the ida has at least one existing id, and when an unallocated ID
which meets a certain condition is passed to the ida_remove(), the
system will crash because it hits NULL pointer dereference.

The condition is that the unallocated ID shares the same lowest idr
layer with the existing ID, but the idr slot would be different if the
unallocated ID were to be allocated.

In this case the matching idr slot for the unallocated_id is NULL,
causing @bitmap to be NULL which the function dereferences without
checking crashing the kernel.

See the test code:

  static void test3(void)
  {
        int id;
        DEFINE_IDA(test_ida);

        printk(KERN_INFO "Start test3\n");
        if (ida_pre_get(&test_ida, GFP_KERNEL) < 0) return;
        if (ida_get_new(&test_ida,  &id) < 0) return;
        ida_remove(&test_ida, 4000); /* bug: null deference here */
        printk(KERN_INFO "End of test3\n");
  }

It happens only when the caller tries to free an unallocated ID which is
the caller's fault.  It is not a bug.  But it is better to add the
proper check and complain rather than crashing the kernel.

[tj@kernel.org: updated patch description]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

aef0f62e

idr: fix unexpected ID-removal when idr_remove(unallocated_id) · 8f9f665a

Lai Jiangshan authored Jun 06, 2014

If unallocated_id = (ANY * idr_max(idp->layers) + existing_id) is passed
to idr_remove().  The existing_id will be removed unexpectedly.

The following test shows this unexpected id-removal:

  static void test4(void)
  {
        int id;
        DEFINE_IDR(test_idr);

        printk(KERN_INFO "Start test4\n");
        id = idr_alloc(&test_idr, (void *)1, 42, 43, GFP_KERNEL);
        BUG_ON(id != 42);
        idr_remove(&test_idr, 42 + IDR_SIZE);
        TEST_BUG_ON(idr_find(&test_idr, 42) != (void *)1);
        idr_destroy(&test_idr);
        printk(KERN_INFO "End of test4\n");
  }

ida_remove() shares the similar problem.

It happens only when the caller tries to free an unallocated ID which is
the caller's fault.  It is not a bug.  But it is better to add the
proper check and complain rather than removing an existing_id silently.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8f9f665a

idr: fix overflow bug during maximum ID calculation at maximum height · 3afb69cb

Lai Jiangshan authored Jun 06, 2014

idr_replace() open-codes the logic to calculate the maximum valid ID
given the height of the idr tree; unfortunately, the open-coded logic
doesn't account for the fact that the top layer may have unused slots
and over-shifts the limit to zero when the tree is at its maximum
height.

The following test code shows it fails to replace the value for
id=((1<<27)+42):

  static void test5(void)
  {
        int id;
        DEFINE_IDR(test_idr);
  #define TEST5_START ((1<<27)+42) /* use the highest layer */

        printk(KERN_INFO "Start test5\n");
        id = idr_alloc(&test_idr, (void *)1, TEST5_START, 0, GFP_KERNEL);
        BUG_ON(id != TEST5_START);
        TEST_BUG_ON(idr_replace(&test_idr, (void *)2, TEST5_START) != (void *)1);
        idr_destroy(&test_idr);
        printk(KERN_INFO "End of test5\n");
  }

Fix the bug by using idr_max() which correctly takes into account the
maximum allowed shift.

sub_alloc() shares the same problem and may incorrectly fail with
-EAGAIN; however, this bug doesn't affect correct operation because
idr_get_empty_slot(), which already uses idr_max(), retries with the
increased @id in such cases.

[tj@kernel.org: Updated patch description.]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3afb69cb

kernel/kexec.c: convert printk to pr_foo() · e1bebcf4

Fabian Frederick authored Jun 06, 2014

+ some pr_warning -> pr_warn and checkpatch warning fixes
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e1bebcf4

kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers · f06e5153

Masami Hiramatsu authored Jun 06, 2014

Add a "crash_kexec_post_notifiers" boot option to run kdump after
running panic_notifiers and dump kmsg.  This can help rare situations
where kdump fails because of unstable crashed kernel or hardware failure
(memory corruption on critical data/code), or the 2nd kernel is already
broken by the 1st kernel (it's a broken behavior, but who can guarantee
that the "crashed" kernel works correctly?).

Usage: add "crash_kexec_post_notifiers" to kernel boot option.

Note that this actually increases risks of the failure of kdump.  This
option should be set only if you worry about the rare case of kdump
failure rather than increasing the chance of success.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Motohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Satoru MORIYA <satoru.moriya.br@hitachi.com>
Cc: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f06e5153

smp: print more useful debug info upon receiving IPI on an offline CPU · a219ccf4

Srivatsa S. Bhat authored Jun 06, 2014

There is a longstanding problem related to CPU hotplug which causes IPIs
to be delivered to offline CPUs, and the smp-call-function IPI handler
code prints out a warning whenever this is detected.  Every once in a
while this (usually harmless) warning gets reported on LKML, but so far
it has not been completely fixed.  Usually the solution involves finding
out the IPI sender and fixing it by adding appropriate synchronization
with CPU hotplug.

However, while going through one such internal bug reports, I found that
there is a significant bug in the receiver side itself (more
specifically, in stop-machine) that can lead to this problem even when
the sender code is perfectly fine.  This patchset fixes that
synchronization problem in the CPU hotplug stop-machine code.

Patch 1 adds some additional debug code to the smp-call-function
framework, to help debug such issues easily.

Patch 2 modifies the stop-machine code to ensure that any IPIs that were
sent while the target CPU was online, would be noticed and handled by
that CPU without fail before it goes offline.  Thus, this avoids
scenarios where IPIs are received on offline CPUs (as long as the sender
uses proper hotplug synchronization).

In fact, I debugged the problem by using Patch 1, and found that the
payload of the IPI was always the block layer's trigger_softirq()
function.  But I was not able to find anything wrong with the block
layer code.  That's when I started looking at the stop-machine code and
realized that there is a race-window which makes the IPI _receiver_ the
culprit, not the sender.  Patch 2 fixes that race and hence this should
put an end to most of the hard-to-debug IPI-to-offline-CPU issues.

This patch (of 2):

Today the smp-call-function code just prints a warning if we get an IPI
on an offline CPU.  This info is sufficient to let us know that
something went wrong, but often it is very hard to debug exactly who
sent the IPI and why, from this info alone.

In most cases, we get the warning about the IPI to an offline CPU,
immediately after the CPU going offline comes out of the stop-machine
phase and reenables interrupts.  Since all online CPUs participate in
stop-machine, the information regarding the sender of the IPI is already
lost by the time we exit the stop-machine loop.  So even if we dump the
stack on each CPU at this point, we won't find anything useful since all
of them will show the stack-trace of the stopper thread.  So we need a
better way to figure out who sent the IPI and why.

To achieve this, when we detect an IPI targeted to an offline CPU, loop
through the call-single-data linked list and print out the payload
(i.e., the name of the function which was supposed to be executed by the
target CPU).  This would give us an insight as to who might have sent
the IPI and help us debug this further.

[akpm@linux-foundation.org: correctly suppress warning output on second and later occurrences]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a219ccf4

fs/proc/vmcore.c: remove NULL assignment to static · a05e16ad

Fabian Frederick authored Jun 06, 2014

Static values are automatically initialized to NULL.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a05e16ad

fs/proc/task_mmu.c: replace seq_printf by seq_puts · 17c2b4ee

Fabian Frederick authored Jun 06, 2014

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

17c2b4ee

signals: change wait_for_helper() to use kernel_sigaction() · 76e0a6f4

Oleg Nesterov authored Jun 06, 2014

Now that we have kernel_sigaction() we can change wait_for_helper() to
use it and cleans up the code a bit.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

76e0a6f4

signals: introduce kernel_sigaction() · b4e74264

Oleg Nesterov authored Jun 06, 2014

Now that allow_signal() is really trivial we can unify it with
disallow_signal().  Add the new helper, kernel_sigaction(), and
reimplement allow_signal/disallow_signal as a trivial wrappers.

This saves one EXPORT_SYMBOL() and the new helper can have more users.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b4e74264

signals: disallow_signal() should flush the potentially pending signal · 580d34e4

Oleg Nesterov authored Jun 06, 2014

disallow_signal() simply sets SIG_IGN, this is not enough and
recalc_sigpending() is simply pointless because in can never change the
state of TIF_SIGPENDING.

If we ignore a signal, we also need to do flush_sigqueue_mask() for the
case when this signal is pending, this way recalc_sigpending() can
actually clear TIF_SIGPENDING and we do not "leak" the allocated
siginfo's.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

580d34e4

signals: kill the obsolete sigdelset() and recalc_sigpending() in allow_signal() · ec5955b8

Oleg Nesterov authored Jun 06, 2014

allow_signal() does sigdelset(current->blocked) due to historic reason,
previously it could be called by a daemonize()'ed kthread, and
daemonize() played with current->blocked.

Now that daemonize() has gone away we can remove sigdelset() and
recalc_sigpending().  If a user really wants to unblock a signal, it
must use sigprocmask() or set_current_block() explicitely.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ec5955b8