Commits · d6c355c7dabcd753a75bc77d150d36328a355267 · nexedi / linux

09 Sep, 2013 1 commit

aio: fix race in ring buffer page lookup introduced by page migration support · d6c355c7

Benjamin LaHaise authored Sep 09, 2013

Prior to the introduction of page migration support in "fs/aio: Add support
to aio ring pages migration" / 36bc08cc,
mapping of the ring buffer pages was done via get_user_pages() while
retaining mmap_sem held for write. This avoided possible races with userland
racing an munmap() or mremap(). The page migration patch, however, switched
to using mm_populate() to prime the page mapping. mm_populate() cannot be
called with mmap_sem held.

Instead of dropping the mmap_sem, revert to the old behaviour and simply
drop the use of mm_populate() since get_user_pages() will cause the pages to
get mapped anyways. Thanks to Al Viro for spotting this issue.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

d6c355c7

30 Aug, 2013 2 commits

aio: fix rcu sparse warnings introduced by ioctx table lookup patch · 77d30b14

Benjamin LaHaise authored Aug 30, 2013

Sseveral sparse warnings were caused by missing rcu_dereference() annotations
for dereferencing mm->ioctx_table. Thankfully, none of those were actual bugs
as the deref was protected by a spin lock in all instances.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>

77d30b14

aio: remove unnecessary debugging from aio_free_ring() · 79bd1bcf

Benjamin LaHaise authored Aug 30, 2013

The commit 36bc08cc ("fs/aio: Add support to aio ring pages migration")
added some debugging code that is not required and resulted in a build error
when 98474236 ("vfs: make the dentry cache use the lockref infrastructure")
was added to the tree. The code is not required, so just delete it.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

79bd1bcf

07 Aug, 2013 1 commit

aio: table lookup: verify ctx pointer · f30d704f

Benjamin LaHaise authored Aug 07, 2013

Another shortcoming of the table lookup patch was revealed where the pointer
was not being tested before being dereferenced.  Verify this to avoid the
NULL pointer dereference.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

f30d704f

06 Aug, 2013 1 commit

staging/lustre: kiocb->ki_left is removed · 0bdd5ca5

Peng Tao authored Aug 06, 2013

We also missed ki_nbytes...

Cc: Kent Overstreet <koverstreet@google.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

0bdd5ca5

05 Aug, 2013 1 commit

aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" · da90382c

Benjamin LaHaise authored Aug 05, 2013

In the patch "aio: convert the ioctx list to table lookup v3", incorrect
handling in the ioctx_alloc() error path was introduced that lead to an
ioctx being added via ioctx_add_table() while freed when the ioctx_alloc()
call returned -EAGAIN due to hitting the aio_max_nr limit. Fix this by
only calling ioctx_add_table() as the last step in ioctx_alloc().

Also, several unnecessary rcu_dereference() calls were added that lead to
RCU warnings where the system was already protected by a spin lock for
accessing mm->ioctx_table.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

da90382c

31 Jul, 2013 1 commit

aio: be defensive to ensure request batching is non-zero instead of BUG_ON() · 6878ea72

Benjamin LaHaise authored Jul 31, 2013

In the event that an overflow/underflow occurs while calculating req_batch,
clamp the minimum at 1 request instead of doing a BUG_ON().
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

6878ea72

30 Jul, 2013 11 commits

aio: convert the ioctx list to table lookup v3 · db446a08

Benjamin LaHaise authored Jul 30, 2013

On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
> On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
> > When using a large number of threads performing AIO operations the
> > IOCTX list may get a significant number of entries which will cause
> > significant overhead. For example, when running this fio script:
> >
> > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=512; thread; loops=100
> >
> > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
> > 30% CPU time spent by lookup_ioctx:
> >
> >  32.51%  [guest.kernel]  [g] lookup_ioctx
> >   9.19%  [guest.kernel]  [g] __lock_acquire.isra.28
> >   4.40%  [guest.kernel]  [g] lock_release
> >   4.19%  [guest.kernel]  [g] sched_clock_local
> >   3.86%  [guest.kernel]  [g] local_clock
> >   3.68%  [guest.kernel]  [g] native_sched_clock
> >   3.08%  [guest.kernel]  [g] sched_clock_cpu
> >   2.64%  [guest.kernel]  [g] lock_release_holdtime.part.11
> >   2.60%  [guest.kernel]  [g] memcpy
> >   2.33%  [guest.kernel]  [g] lock_acquired
> >   2.25%  [guest.kernel]  [g] lock_acquire
> >   1.84%  [guest.kernel]  [g] do_io_submit
> >
> > This patchs converts the ioctx list to a radix tree. For a performance
> > comparison the above FIO script was run on a 2 sockets 8 core
> > machine. This are the results (average and %rsd of 10 runs) for the
> > original list based implementation and for the radix tree based
> > implementation:
> >
> > cores         1         2         4         8         16        32
> > list       109376 ms  69119 ms  35682 ms  22671 ms  19724 ms  16408 ms
> > %rsd         0.69%      1.15%     1.17%     1.21%     1.71%     1.43%
> > radix       73651 ms  41748 ms  23028 ms  16766 ms  15232 ms   13787 ms
> > %rsd         1.19%      0.98%     0.69%     1.13%    0.72%      0.75%
> > % of radix
> > relative    66.12%     65.59%    66.63%    72.31%   77.26%     83.66%
> > to list
> >
> > To consider the impact of the patch on the typical case of having
> > only one ctx per process the following FIO script was run:
> >
> > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=1; thread; loops=100
> >
> > on the same system and the results are the following:
> >
> > list        58892 ms
> > %rsd         0.91%
> > radix       59404 ms
> > %rsd         0.81%
> > % of radix
> > relative    100.87%
> > to list
>
> So, I was just doing some benchmarking/profiling to get ready to send
> out the aio patches I've got for 3.11 - and it looks like your patch is
> causing a ~1.5% throughput regression in my testing :/
... <snip>

I've got an alternate approach for fixing this wart in lookup_ioctx()...
Instead of using an rbtree, just use the reserved id in the ring buffer
header to index an array pointing the ioctx.  It's not finished yet, and
it needs to be tidied up, but is most of the way there.

		-ben
--
"Thought is the essence of where you are now."
--
kmo> And, a rework of Ben's code, but this was entirely his idea
kmo>		-Kent

bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
free memory.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

db446a08

aio: double aio_max_nr in calculations · 4cd81c3d

Benjamin LaHaise authored Jul 30, 2013

With the changes to use percpu counters for aio event ring size calculation,
existing increases to aio_max_nr are now insufficient to allow for the
allocation of enough events. Double the value used for aio_max_nr to account
for the doubling introduced by the percpu slack.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

4cd81c3d

aio: Kill ki_dtor · d29c445b

Kent Overstreet authored Mar 18, 2013

sock_aio_dtor() is dead code - and stuff that does need to do cleanup
can simply do it before calling aio_complete().
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

d29c445b

aio: Kill ki_users · 57282d8f

Kent Overstreet authored May 13, 2013

The kiocb refcount is only needed for cancellation - to ensure a kiocb
isn't freed while a ki_cancel callback is running. But if we restrict
ki_cancel callbacks to not block (which they currently don't), we can
simply drop the refcount.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

57282d8f

aio: Kill unneeded kiocb members · 8bc92afc

Kent Overstreet authored Feb 25, 2013

The old aio retry infrastucture needed to save the various arguments to
to aio operations. But with the retry infrastructure gone, we can trim
struct kiocb quite a bit.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

8bc92afc

aio: Kill aio_rw_vect_retry() · 73a7075e

Kent Overstreet authored May 09, 2013

This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.

This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

73a7075e

aio: Don't use ctx->tail unnecessarily · 5ffac122

Kent Overstreet authored May 09, 2013

aio_complete() (arguably) needs to keep its own trusted copy of the tail
pointer, but io_getevents() doesn't have to use it - it's already using
the head pointer from the ring buffer.

So convert it to use the tail from the ring buffer so it touches fewer
cachelines and doesn't contend with the cacheline aio_complete() needs.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

5ffac122

aio: io_cancel() no longer returns the io_event · bec68faa

Kent Overstreet authored May 13, 2013

Originally, io_event() was documented to return the io_event if
cancellation succeeded - the io_event wouldn't be delivered via the ring
buffer like it normally would.

But this isn't what the implementation was actually doing; the only
driver implementing cancellation, the usb gadget code, never returned an
io_event in its cancel function. And aio_complete() was recently changed
to no longer suppress event delivery if the kiocb had been cancelled.

This gets rid of the unused io_event argument to kiocb_cancel() and
kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if
kiocb->ki_cancel() returned success.

Also tweak the refcounting in kiocb_cancel() to make more sense.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

bec68faa

aio: percpu ioctx refcount · 723be6e3

Kent Overstreet authored May 28, 2013

This just converts the ioctx refcount to the new generic dynamic percpu
refcount code.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

723be6e3

aio: percpu reqs_available · e1bdd5f2

Kent Overstreet authored Apr 26, 2013

See the previous patch ("aio: reqs_active -> reqs_available") for why we
want to do this - this basically implements a per cpu allocator for
reqs_available that doesn't actually allocate anything.

Note that we need to increase the size of the ringbuffer we allocate,
since a single thread won't necessarily be able to use all the
reqs_available slots - some (up to about half) might be on other per cpu
lists, unavailable for the current thread.

We size the ringbuffer based on the nr_events userspace passed to
io_setup(), so this is a slight behaviour change - but nr_events wasn't
being used as a hard limit before, it was being rounded up to the next
page before so this doesn't change the actual semantics.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

e1bdd5f2

aio: reqs_active -> reqs_available · 34e83fc6

Kent Overstreet authored Apr 26, 2013

The number of outstanding kiocbs is one of the few shared things left that
has to be touched for every kiocb - it'd be nice to make it percpu.

We can make it per cpu by treating it like an allocation problem: we have
a maximum number of kiocbs that can be outstanding (i.e.  slots) - then we
just allocate and free slots, and we know how to write per cpu allocators.

So as prep work for that, we convert reqs_active to reqs_available.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

34e83fc6

17 Jul, 2013 1 commit

aio: fix build when migration is disabled · 0c45355f

Benjamin LaHaise authored Jul 17, 2013

When "fs/aio: Add support to aio ring pages migration" was applied, it
broke the build when CONFIG_MIGRATION was disabled.  Wrap the migration
code with a test for CONFIG_MIGRATION to fix this and save a few bytes
when migration is disabled.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

0c45355f

16 Jul, 2013 2 commits

fs/aio: Add support to aio ring pages migration · 36bc08cc

Gu Zheng authored Jul 16, 2013

As the aio job will pin the ring pages, that will lead to mem migrated
failed. In order to fix this problem we use an anon inode to manage the aio ring
pages, and setup the migratepage callback in the anon inode's address space, so
that when mem migrating the aio ring pages will be moved to other mem node safely.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

36bc08cc

fs/anon_inode: Introduce a new lib function anon_inode_getfile_private() · 55708698

Gu Zheng authored Jul 16, 2013

Introduce a new lib function anon_inode_getfile_private(), it creates a new file
instance by hooking it up to an anonymous inode, and a dentry that describe the
"class" of the file, similar to anon_inode_getfile(), but each file holds a
single inode. Furthermore, anyone who wants to create a private anon file will
benefit from this change.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

55708698

15 Jul, 2013 1 commit

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 47188d39

Linus Torvalds authored Jul 14, 2013

Pull ext4 bugfixes from Ted Ts'o:
 "Various regression and bug fixes for ext4"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: don't allow ext4_free_blocks() to fail due to ENOMEM
  ext4: fix spelling errors and a comment in extent_status tree
  ext4: rate limit printk in buffer_io_error()
  ext4: don't show usrquota/grpquota twice in /proc/mounts
  ext4: fix warning in ext4_evict_inode()
  ext4: fix ext4_get_group_number()
  ext4: silence warning in ext4_writepages()

47188d39

14 Jul, 2013 15 commits

Linux 3.11-rc1 · ad81f054
Linus Torvalds authored Jul 14, 2013

ad81f054

Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux · 54be8200

Linus Torvalds authored Jul 14, 2013

Pull slab update from Pekka Enberg:
 "Highlights:

  - Fix for boot-time problems on some architectures due to
    init_lock_keys() not respecting kmalloc_caches boundaries
    (Christoph Lameter)

  - CONFIG_SLUB_CPU_PARTIAL requested by RT folks (Joonsoo Kim)

  - Fix for excessive slab freelist draining (Wanpeng Li)

  - SLUB and SLOB cleanups and fixes (various people)"

I ended up editing the branch, and this avoids two commits at the end
that were immediately reverted, and I instead just applied the oneliner
fix in between myself.

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
  slub: Check for page NULL before doing the node_match check
  mm/slab: Give s_next and s_stop slab-specific names
  slob: Check for NULL pointer before calling ctor()
  slub: Make cpu partial slab support configurable
  slab: add kmalloc() to kernel API documentation
  slab: fix init_lock_keys
  slob: use DIV_ROUND_UP where possible
  slub: do not put a slab to cpu partial list when cpu_partial is 0
  mm/slub: Use node_nr_slabs and node_nr_objs in get_slabinfo
  mm/slub: Drop unnecessary nr_partials
  mm/slab: Fix /proc/slabinfo unwriteable for slab
  mm/slab: Sharing s_next and s_stop between slab and slub
  mm/slab: Fix drain freelist excessively
  slob: Rework #ifdeffery in slab.h
  mm, slab: moved kmem_cache_alloc_node comment to correct place

54be8200

slub: Check for page NULL before doing the node_match check · c25f195e

Steven Rostedt authored Jan 17, 2013

In the -rt kernel (mrg), we hit the following dump:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
PGD a2d39067 PUD b1641067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: sunrpc cpufreq_ondemand ipv6 tg3 joydev sg serio_raw pcspkr k8temp amd64_edac_mod edac_core i2c_piix4 e100 mii shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
CPU 3
Pid: 20878, comm: hackbench Not tainted 3.6.11-rt25.14.el6rt.x86_64 #1 empty empty/Tyan Transport GT24-B3992
RIP: 0010:[<ffffffff811573f1>]  [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
RSP: 0018:ffff8800a9b17d70  EFLAGS: 00010213
RAX: 0000000000000000 RBX: 0000000001200011 RCX: ffff8800a06d8000
RDX: 0000000004d92a03 RSI: 00000000000000d0 RDI: ffff88013b805500
RBP: ffff8800a9b17dc0 R08: ffff88023fd14d10 R09: ffffffff81041cbd
R10: 00007f4e3f06e9d0 R11: 0000000000000246 R12: ffff88013b805500
R13: ffff8801ff46af40 R14: 0000000000000001 R15: 0000000000000000
FS:  00007f4e3f06e700(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000000a2d3a000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process hackbench (pid: 20878, threadinfo ffff8800a9b16000, task ffff8800a06d8000)
Stack:
 ffff8800a9b17da0 ffffffff81202e08 ffff8800a9b17de0 000000d001200011
 0000000001200011 0000000001200011 0000000000000000 0000000000000000
 00007f4e3f06e9d0 0000000000000000 ffff8800a9b17e60 ffffffff81041cbd
Call Trace:
 [<ffffffff81202e08>] ? current_has_perm+0x68/0x80
 [<ffffffff81041cbd>] copy_process+0xdd/0x15b0
 [<ffffffff810a2125>] ? rt_up_read+0x25/0x30
 [<ffffffff8104369a>] do_fork+0x5a/0x360
 [<ffffffff8107c66b>] ? migrate_enable+0xeb/0x220
 [<ffffffff8100b068>] sys_clone+0x28/0x30
 [<ffffffff81527423>] stub_clone+0x13/0x20
 [<ffffffff81527152>] ? system_call_fastpath+0x16/0x1b
Code: 89 fc 89 75 cc 41 89 d6 4d 8b 04 24 65 4c 03 04 25 48 ae 00 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 74 12 41 83 fe ff 74 27 <48> 8b 00 48 c1 e8 3a 41 39 c6 74 1b 8b 75 cc 4c 89 c9 44 89 f2
RIP  [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180
 RSP <ffff8800a9b17d70>
CR2: 0000000000000000
---[ end trace 0000000000000002 ]---

Now, this uses SLUB pretty much unmodified, but as it is the -rt kernel
with CONFIG_PREEMPT_RT set, spinlocks are mutexes, although they do
disable migration. But the SLUB code is relatively lockless, and the
spin_locks there are raw_spin_locks (not converted to mutexes), thus I
believe this bug can happen in mainline without -rt features. The -rt
patch is just good at triggering mainline bugs ;-)

Anyway, looking at where this crashed, it seems that the page variable
can be NULL when passed to the node_match() function (which does not
check if it is NULL). When this happens we get the above panic.

As page is only used in slab_alloc() to check if the node matches, if
it's NULL I'm assuming that we can say it doesn't and call the
__slab_alloc() code. Is this a correct assumption?
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c25f195e

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 41d9884c

Linus Torvalds authored Jul 14, 2013

Pull more vfs stuff from Al Viro:
 "O_TMPFILE ABI changes, Oleg's fput() series, misc cleanups, including
  making simple_lookup() usable for filesystems with non-NULL s_d_op,
  which allows us to get rid of quite a bit of ugliness"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  sunrpc: now we can just set ->s_d_op
  cgroup: we can use simple_lookup() now
  efivarfs: we can use simple_lookup() now
  make simple_lookup() usable for filesystems that set ->s_d_op
  configfs: don't open-code d_alloc_name()
  __rpc_lookup_create_exclusive: pass string instead of qstr
  rpc_create_*_dir: don't bother with qstr
  llist: llist_add() can use llist_add_batch()
  llist: fix/simplify llist_add() and llist_add_batch()
  fput: turn "list_head delayed_fput_list" into llist_head
  fs/file_table.c:fput(): add comment
  Safer ABI for O_TMPFILE

41d9884c

sunrpc: now we can just set ->s_d_op · dae3794f
Al Viro authored Jul 14, 2013
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
dae3794f
cgroup: we can use simple_lookup() now · 786e1448
Al Viro authored Jul 14, 2013
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
786e1448
efivarfs: we can use simple_lookup() now · 6e8cd2cb
Al Viro authored Jul 14, 2013
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
6e8cd2cb
make simple_lookup() usable for filesystems that set ->s_d_op · 74931da7
Al Viro authored Jul 14, 2013
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
74931da7
configfs: don't open-code d_alloc_name() · ec193cf5
Al Viro authored Jul 14, 2013
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
ec193cf5

__rpc_lookup_create_exclusive: pass string instead of qstr · d3db90b0

Al Viro authored Jul 14, 2013

... and use d_hash_and_lookup() instead of open-coding it, for fsck sake...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d3db90b0

rpc_create_*_dir: don't bother with qstr · a95e691f
Al Viro authored Jul 14, 2013
```
just pass the name
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
a95e691f

Merge branch 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86 · 63345b47

Linus Torvalds authored Jul 13, 2013

Pull x86 platform driver updates from Matthew Garrett:
 "Nothing overly exciting here - a couple of new drivers that don't do a
  great deal, along with some miscellaneous fixes and a couple of small
  feature enablement patches"

* 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86:
  x86 platform drivers: fix gpio leak
  toshiba_acpi: Add dependency on SERIO_I8042
  asus-nb-wmi: set wapf=4 for ASUSTeK COMPUTER INC. 1015E/U
  Add trivial driver to disable Intel Smart Connect
  Add support driver for Intel Rapid Start Technology
  hp-wmi: add supports for POST code error
  asus-wmi: control wlan-led only if wapf == 4
  drivers/platform/x86/intel_ips: Convert to module_pci_driver
  asus-nb-wmi: ignore ALS notification key code
  asus-wmi: append newline to messages
  x86: asus-laptop: fix invalid point access
  x86: msi-laptop: fix memleak
  amilo-rfkill: Add dependency on SERIO_I8042
  dell-laptop: fix error return code in dell_init()
  hp-wmi: Enable hotkeys on some systems

63345b47

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 18fb38e2

Linus Torvalds authored Jul 13, 2013

Pull second round of input updates from Dmitry Torokhov:
 "An update to Elantech driver to support hardware v7, fix to the new
  cyttsp4 driver to use proper addressing, ads7846 device tree support
  and nspire-keypad got a small cleanup."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: nspire-keypad - replace magic offset with define
  Input: elantech - fix for newer hardware versions (v7)
  Input: cyttsp4 - use 16bit address for I2C/SPI communication
  Input: ads7846 - add device tree bindings
  Input: ads7846 - make sure we do not change platform data

18fb38e2

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · be9c6d91

Linus Torvalds authored Jul 13, 2013

Pull networking fixes from David Miller:
 "Just a bunch of small fixes and tidy ups:

   1) Finish the "busy_poll" renames, from Eliezer Tamir.

   2) Fix RCU stalls in IFB driver, from Ding Tianhong.

   3) Linearize buffers properly in tun/macvtap zerocopy code.

   4) Don't crash on rmmod in vxlan, from Pravin B Shelar.

   5) Spinlock used before init in alx driver, from Maarten Lankhorst.

   6) A sparse warning fix in bnx2x broke TSO checksums, fix from Dmitry
      Kravkov.

   7) Dummy and ifb driver load failure paths can oops, fixes from Tan
      Xiaojun and Ding Tianhong.

   8) Correct MTU calculations in IP tunnels, from Alexander Duyck.

   9) Account all TCP retransmits in SNMP stats properly, from Yuchung
      Cheng.

  10) atl1e and via-rhine do not handle DMA mapping failures properly,
      from Neil Horman.

  11) Various equal-cost multipath route fixes in ipv6 from Hannes
      Frederic Sowa"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (36 commits)
  ipv6: only static routes qualify for equal cost multipathing
  via-rhine: fix dma mapping errors
  atl1e: fix dma mapping warnings
  tcp: account all retransmit failures
  usb/net/r815x: fix cast to restricted __le32
  usb/net/r8152: fix integer overflow in expression
  net: access page->private by using page_private
  net: strict_strtoul is obsolete, use kstrtoul instead
  drivers/net/ieee802154: don't use devm_pinctrl_get_select_default() in probe
  drivers/net/ethernet/cadence: don't use devm_pinctrl_get_select_default() in probe
  drivers/net/can/c_can: don't use devm_pinctrl_get_select_default() in probe
  net/usb: add relative mii functions for r815x
  net/tipc: use %*phC to dump small buffers in hex form
  qlcnic: Adding Maintainers.
  gre: Fix MTU sizing check for gretap tunnels
  pkt_sched: sch_qfq: remove forward declaration of qfq_update_agg_ts
  pkt_sched: sch_qfq: improve efficiency of make_eligible
  gso: Update tunnel segmentation to support Tx checksum offload
  inet: fix spacing in assignment
  ifb: fix oops when loading the ifb failed
  ...

be9c6d91

Merge tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 03ce3ca4

Linus Torvalds authored Jul 13, 2013

Pull final round of SCSI updates from James Bottomley:
 "This is the remaining set of SCSI patches for the merge window.  It's
  mostly driver updates (scsi_debug, qla2xxx, storvsc, mp3sas).  There
  are also several bug fixes in fcoe, libfc, and megaraid_sas.  We also
  have a couple of core changes to try to make device destruction more
  deterministic"

* tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (46 commits)
  [SCSI] scsi constants: command, sense key + additional sense strings
  fcoe: Reduce number of sparse warnings
  fcoe: Stop fc_rport_priv structure leak
  libfcoe: Fix meaningless log statement
  libfc: Differentiate echange timer cancellation debug statements
  libfc: Remove extra space in fc_exch_timer_cancel definition
  fcoe: fix the link error status block sparse warnings
  fcoe: Fix smatch warning in fcoe_fdmi_info function
  libfc: Reject PLOGI from nodes with incompatible role
  [SCSI] enable destruction of blocked devices which fail LUN scanning
  [SCSI] Fix race between starved list and device removal
  [SCSI] megaraid_sas: fix a bug for 64 bit arches
  [SCSI] scsi_debug: reduce duplication between prot_verify_read and prot_verify_write
  [SCSI] scsi_debug: simplify offset calculation for dif_storep
  [SCSI] scsi_debug: invalidate protection info for unmapped region
  [SCSI] scsi_debug: fix NULL pointer dereference with parameters dif=0 dix=1
  [SCSI] scsi_debug: fix incorrectly nested kmap_atomic()
  [SCSI] scsi_debug: fix invalid address passed to kunmap_atomic()
  [SCSI] mpt3sas: Bump driver version to v02.100.00.00
  [SCSI] mpt3sas: when async scanning is enabled then while scanning, devices are removed but their transport layer entries are not removed
  ...

03ce3ca4

13 Jul, 2013 3 commits

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f8acc450

Linus Torvalds authored Jul 13, 2013

Pull scheduler fix from Thomas Gleixner:
 "Fix a potential deadlock versus hrtimers"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix HRTICK

f8acc450

Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 505608d2

Linus Torvalds authored Jul 13, 2013

Pull irq updates from Thomas Gleixner:
 - core fix for missing round up in the generic irq chip implementation
 - new irq chip for MOXA SoCs
 - a few fixes and cleanups in the irqchip drivers

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: Add support for MOXA ART SoCs
  genirq: generic chip: Use DIV_ROUND_UP to calculate numchips
  irqchip: nvic: Fix wrong num_ct argument for irq_alloc_domain_generic_chips()
  irqchip: sun4i: Staticize sun4i_irq_ack()
  irqchip: vt8500: Staticize local symbols

505608d2

Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0da27366

Linus Torvalds authored Jul 13, 2013

Pull timer updates from Thomas Gleixner:
 - watchdog fixes for full dynticks
 - improved debug output for full dynticks
 - remove an obsolete full dynticks check
 - two ARM SoC clocksource drivers for sharing across SoCs
 - tick broadcast fix for CPU hotplug

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: broadcast: Check broadcast mode on CPU hotplug
  clocksource: arm_global_timer: Add ARM global timer support
  clocksource: Add Marvell Orion SoC timer
  nohz: Remove obsolete check for full dynticks CPUs to be RCU nocbs
  watchdog: Boot-disable by default on full dynticks
  watchdog: Rename confusing state variable
  watchdog: Register / unregister watchdog kthreads on sysctl control
  nohz: Warn if the machine can not perform nohz_full

0da27366