Commits · f03683b8fb7e03862d2f1366a16c1b01732a5741 · Kirill Smelkov / linux

05 Aug, 2011 3 commits

Merge branch 'core-urgent-for-linus' of... · f03683b8

Linus Torvalds authored Aug 04, 2011

Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  slab, lockdep: Annotate the locks before using them
  lockdep: Clear whole lockdep_map on initialization
  slab, lockdep: Annotate slab -> rcu -> debug_object -> slab
  lockdep: Fix up warning
  lockdep: Fix trace_hardirqs_on_caller()
  futex: Fix regression with read only mappings

f03683b8

Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx · 7f3bf7cd

Linus Torvalds authored Aug 04, 2011

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx:
  dmaengine: use DEFINE_IDR for static initialization
  ioat: fix xor_idx_to_desc
  Avoid section type conflict in dma/ioat/dma_v3.c
  ioat: Adding PCI IDs for IOAT devices on SandyBridge platforms

7f3bf7cd

cpuidle: Consistent spelling of cpuidle_idle_call() · cbc158d6

David Brown authored Aug 04, 2011

Commit a0bfa137 mispells
cpuidle_idle_call() on ARM and SH code.  Fix this to be consistent.

Cc: Kevin Hilman <khilman@deeprootsystems.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: x86@kernel.org
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: David Brown <davidb@codeaurora.org>
[ Also done by Mark Brown - th ebug has been around forever, and was
  noticed in -next, but the idle tree never picked it up. Bad bad bad ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cbc158d6

04 Aug, 2011 37 commits

Merge branch 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6 · 53d1e658

Linus Torvalds authored Aug 04, 2011

* 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6:
  Revert "dt: add of_alias_scan and of_alias_get_id"
  dt: remove of_alias_get_id() reference

53d1e658

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6 · 455ce9d8

Linus Torvalds authored Aug 04, 2011

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6:
  [PARISC] wire up sendmmsg syscall
  [PARISC] fix return type of __atomic64_add_return
  [PARISC] Fix futex support

455ce9d8

Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6 · 447e1363

Linus Torvalds authored Aug 04, 2011

* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
  [S390] signal: use set_restore_sigmask() helper
  [S390] smp: remove pointless comments in startup_secondary()
  [S390] qdio: Use kstrtoul_from_user
  [S390] sclp_async: Use kstrtoul_from_user
  [S390] exec: remove redundant set_fs(USER_DS)
  [S390] cpu hotplug: on cpu start wait until being marked active
  [S390] signal: convert to use set_current_blocked()
  [S390] asm offsets: fix coding style
  [S390] Add support for IBM zEnterprise 114
  [S390] dasd: check if raw track access is supported
  [S390] Use diagnose 308 for system reset
  [S390] Export store_status() function
  [S390] dasd: use vmalloc for statistics input buffer
  [S390] Add PSW restart shutdown trigger
  [S390] missing return in page_table_alloc_pgste
  [S390] qdio: 2nd stage retry on SIGA-W busy conditions

447e1363

eisa/pci_eisa.c: fix BUG introduced by · 82de9a0c

Arnaud Lacombe authored Aug 04, 2011

While `pci_eisa_driver' still refer `pci_eisa_init', the .probe() function
should not be called after init memory release, as pointed out by commit
74b9a297. The structure is still referenced in the drivers subsystem, and can
be accesseed through sysfs, so the modpost warning is a false positive. Mark
it as such.

In the same time, the warning referenced in 005bdad7 did only mention
`pci_eisa_driver', not `pci_eisa_pci_tbl', so remove its marking.

Broken-by: Arnaud Lacombe <lacombar@gmail.com> (in 005bdad7)
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Arnaud Lacombe <lacombar@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

82de9a0c

Revert "dt: add of_alias_scan and of_alias_get_id" · fe55c184

Grant Likely authored Aug 04, 2011

This reverts commit 750f463a.

of_alias_* still needs work to be generalized for 'promtree' dt
platforms, and to no implicitly create entries for available ids.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>

fe55c184

dt: remove of_alias_get_id() reference · 9e191b22

Grant Likely authored Aug 04, 2011

of_alias_get_id() is broken and being reverted. Remove the reference
to it and replace with a single incrementing id number.

There is no risk of regression here on the imx driver since the imx
change to use of_alias_get_id() is commit 22698aa2, "serial/imx: add
device tree probe support" which is new for v3.1, and it won't get
used unless CONFIG_OF is enabled and the board is booted using a
device tree. A single incrementing integer is sufficient for now.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Shawn Guo <shawn.guo@linaro.org>

9e191b22

slab, lockdep: Annotate the locks before using them · 30765b92

Peter Zijlstra authored Jul 28, 2011

Fernando found we hit the regular OFF_SLAB 'recursion' before we
annotate the locks, cure this.

The relevant portion of the stack-trace:

> [    0.000000]  [<c085e24f>] rt_spin_lock+0x50/0x56
> [    0.000000]  [<c04fb406>] __cache_free+0x43/0xc3
> [    0.000000]  [<c04fb23f>] kmem_cache_free+0x6c/0xdc
> [    0.000000]  [<c04fb2fe>] slab_destroy+0x4f/0x53
> [    0.000000]  [<c04fb396>] free_block+0x94/0xc1
> [    0.000000]  [<c04fc551>] do_tune_cpucache+0x10b/0x2bb
> [    0.000000]  [<c04fc8dc>] enable_cpucache+0x7b/0xa7
> [    0.000000]  [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61
> [    0.000000]  [<c0bba687>] start_kernel+0x24c/0x363
> [    0.000000]  [<c0bba0ba>] i386_start_kernel+0xa9/0xaf
Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptopSigned-off-by: Ingo Molnar <mingo@elte.hu>

30765b92

lockdep: Clear whole lockdep_map on initialization · f59de899

Tejun Heo authored Jul 14, 2011

lockdep_init_map() only initializes parts of lockdep_map and triggers
kmemcheck warning when it is copied as a whole. There isn't anything
to be gained by clearing selectively. memset() the whole structure
and remove loop for ->class_cache[] clearing.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=35532Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Casteyde <casteyde.christian@free.fr>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=35532Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110714131909.GJ3455@htj.dyndns.orgSigned-off-by: Ingo Molnar <mingo@elte.hu>

f59de899

slab, lockdep: Annotate slab -> rcu -> debug_object -> slab · 83835b3d

Peter Zijlstra authored Jul 22, 2011

Lockdep thinks there's lock recursion through:

	kmem_cache_free()
	  cache_flusharray()
	    spin_lock(&l3->list_lock)  <----------------.
	    free_block()                                |
	      slab_destroy()                            |
		call_rcu()                              |
		  debug_object_activate()               |
		    debug_object_init()                 |
		      __debug_object_init()             |
			kmem_cache_alloc()              |
			  cache_alloc_refill()          |
			    spin_lock(&l3->list_lock) --'

Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no
actual possibility of recursing. Luckily debug objects marks it slab
with SLAB_DEBUG_OBJECTS so we can identify the thing.

Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special
lockdep key so that lockdep sees its a different cachep.

Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU |
SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble.
Reported-and-tested-by: Sebastian Siewior <sebastian@breakpoint.cc>
[ fixes to the initial patch ]
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twinsSigned-off-by: Ingo Molnar <mingo@elte.hu>

83835b3d

lockdep: Fix up warning · 70a0686a

Peter Zijlstra authored Jul 25, 2011

On Sun, 2011-07-24 at 21:06 -0400, Arnaud Lacombe wrote:

> /src/linux/linux/kernel/lockdep.c: In function 'mark_held_locks':
> /src/linux/linux/kernel/lockdep.c:2471:31: warning: comparison of
> distinct pointer types lacks a cast

The warning is harmless in this case, but the below makes it go away.
Reported-by: Arnaud Lacombe <lacombar@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311588599.2617.56.camel@laptopSigned-off-by: Ingo Molnar <mingo@elte.hu>

70a0686a

lockdep: Fix trace_hardirqs_on_caller() · 7d36b26b

Peter Zijlstra authored Jul 26, 2011

Commit dd4e5d3a ("lockdep: Fix trace_[soft,hard]irqs_[on,off]()
recursion") made a bit of a mess of the various checks and error
conditions.

In particular it moved the check for !irqs_disabled() before the
spurious enable test, resulting in some warnings.
Reported-by: Arnaud Lacombe <lacombar@gmail.com>
Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311679697.24752.28.camel@twinsSigned-off-by: Ingo Molnar <mingo@elte.hu>

7d36b26b

Boot up with usermodehelper disabled · 288d5abe

Linus Torvalds authored Aug 03, 2011

The core device layer sends tons of uevent notifications for each device
it finds, and if the kernel has been built with a non-empty
CONFIG_UEVENT_HELPER_PATH that will make us try to execute the usermode
helper binary for all these events very early in the boot.

Not only won't the root filesystem even be mounted at that point, we
literally won't have necessarily even initialized all the process
handling data structures at that point, which causes no end of silly
problems even when the usermode helper doesn't actually succeed in
executing.

So just use our existing infrastructure to disable the usermodehelpers
to make the kernel start out with them disabled.  We enable them when
we've at least initialized stuff a bit.

Problems related to an uninitialized

	init_ipc_ns.ids[IPC_SHM_IDS].rw_mutex

reported by various people.
Reported-by: Manuel Lauss <manuel.lauss@googlemail.com>
Reported-by: Richard Weinberger <richard@nod.at>
Reported-by: Marc Zyngier <maz@misterjones.org>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

288d5abe

x86: don't include xen/xen.h in <asm/io.h> unless XEN is enabled · 33f35f2a

Linus Torvalds authored Aug 03, 2011

Dmitry Kasatkin reports:
  "kernel-devel package with kernel headers have no <include/xen>
   directory if XEN is disabled.  Modules which inclide asm/io.h won't
   compile.

   XEN related content is behind the CONFIG_XEN flag in the io.h.  And
   <xen/xen.h> should be also behind CONFIG_XEN flag."

So move the include of <xen/xen.h> down into the section that is
conditional on CONFIG_XEN.
Reported-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

33f35f2a

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 0ea64844

Linus Torvalds authored Aug 03, 2011

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: ad7879 - fix deficient device disable
  Input: gpio_keys - fix two typos in devicetree documentation
  Input: mma8450 - add device tree probe support
  Input: gpio_keys - return proper error code if memory allocation fails
  Input: lm8323 - add missing device_remove_file for dev_attr_time
  Input: tegra-kbc - fix computation of polling time
  Input: kxtj9 - explicitly include module.h
  Input: psmouse - hgpk.c needs module.h

0ea64844

Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 · 35e51fe8

Linus Torvalds authored Aug 03, 2011

* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
  cpuidle: stop depending on pm_idle
  x86 idle: move mwait_idle_with_hints() to where it is used
  cpuidle: replace xen access to x86 pm_idle and default_idle
  cpuidle: create bootparam "cpuidle.off=1"
  mrst_pmu: driver for Intel Moorestown Power Management Unit

35e51fe8

Merge branch 'apei-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 · c0c770e6

Linus Torvalds authored Aug 03, 2011

* 'apei-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
  ACPI, APEI, EINJ Param support is disabled by default
  APEI GHES: 32-bit buildfix
  ACPI: APEI build fix
  ACPI, APEI, GHES: Add hardware memory error recovery support
  HWPoison: add memory_failure_queue()
  ACPI, APEI, GHES, Error records content based throttle
  ACPI, APEI, GHES, printk support for recoverable error via NMI
  lib, Make gen_pool memory allocator lockless
  lib, Add lock-less NULL terminated single list
  Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG
  ACPI, APEI, Add WHEA _OSC support
  ACPI, APEI, Add APEI bit support in generic _OSC call
  ACPI, APEI, GHES, Support disable GHES at boot time
  ACPI, APEI, GHES, Prevent GHES to be built as module
  ACPI, APEI, Use apei_exec_run_optional in APEI EINJ and ERST
  ACPI, APEI, Add apei_exec_run_optional
  ACPI, APEI, GHES, Do not ratelimit fatal error printk before panic
  ACPI, APEI, ERST, Fix erst-dbg long record reading issue
  ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled

c0c770e6

Merge branch 'linus' into core/urgent · d7619fe3
Ingo Molnar authored Aug 04, 2011

d7619fe3

dmaengine: use DEFINE_IDR for static initialization · 21ef4b8b

Axel Lin authored Jul 20, 2011

We could use DEFINE_IDR for statically allocated idr
that allow us to save a few lines of code.

And also remove unneeded mutex_init() for dma_list_mutex, as
dma_list_mutex is initialized automatically by DEFINE_MUTEX().
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

21ef4b8b

ioat: fix xor_idx_to_desc · d0b0c8c7

Dan Williams authored Jul 22, 2011

For versions of the device that implement operation-types 0x87, 0x88
(IOAT_OP_XOR, IOAT_OP_XOR_VAL) this map determines whether a given
source is located in the base or extended descriptor.  Source addresses
6 through 8 require an extended descriptor, hence 0xe0, not 0xd0.  No
shipping hardware currently implements these operation types.
Reported-by: Evgueni Smogailov <evgueni.smogailov@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

d0b0c8c7

Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · a9e4e6e1

Linus Torvalds authored Aug 03, 2011

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
  tcm_fc: Handle DDP/SW fc_frame_payload_get failures in ft_recv_write_data
  target: Fix bug for transport_generic_wait_for_tasks with direct operation
  target: iscsi_target depends on NET
  target: Fix WRITE_SAME_16 lba assignment breakage
  MAINTAINERS: Add target-devel list for drivers/target/
  iscsi-target: Fix CONFIG_SMP=n and CONFIG_MODULES=n build failure
  iscsi-target: Fix snprintf usage with MAX_PORTAL_LEN
  iscsi-target: Fix uninitialized usage of cmd->pad_bytes
  iscsi-target: strlen() doesn't count the terminator
  iscsi-target: Fix NULL dereference on allocation failure

a9e4e6e1

Merge branch 'devicetree/next' of git://git.secretlab.ca/git/linux-2.6 · 27665ffa
Linus Torvalds authored Aug 03, 2011
```
* 'devicetree/next' of git://git.secretlab.ca/git/linux-2.6:
  dt: add of_alias_scan and of_alias_get_id
```
27665ffa
Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 14595f70
Linus Torvalds authored Aug 03, 2011
```
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: use kzalloc in ext4_kzalloc()
```
14595f70

shm: optimize exit_shm() · 298507d4

Vasiliy Kulikov authored Aug 03, 2011

We may optimistically check .in_use == 0 without holding the rw_mutex:
it's the common case, and if it's zero, there certainly won't be any
segments associated with us.

After taking the lock, the idr_for_each() will do the right thing, so we
could now drop the re-check inside the lock without any real cost.  But
it won't hurt.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

298507d4

shm: fix wrong tests · 33a30ed4

Vasiliy Kulikov authored Aug 03, 2011

Commit 4c677e2e ("shm: optimize locking and ipc_namespace getting")
introduced a copy-paste bug.  Due to the bug cycle optimizations were
disabled.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

33a30ed4

tmpfs: expand "help" to explain value of TMPFS_POSIX_ACL · 206506cc

Robert P. J. Day authored Aug 03, 2011

Expand the fs/Kconfig "help" info to clarify why it's a bad idea to
deselect the TMPFS_POSIX_ACL config variable.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

206506cc

mm: clarify the radix_tree exceptional cases · 8079b1c8

Hugh Dickins authored Aug 03, 2011

Make the radix_tree exceptional cases, mostly in filemap.c, clearer.

It's hard to devise a suitable snappy name that illuminates the use by
shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
generality.  And akpm points out that /* radix_tree_deref_retry(page) */
comments look like calls that have been commented out for unknown
reason.

Skirt the naming difficulty by rearranging these blocks to handle the
transient radix_tree_deref_retry(page) case first; then just explain the
remaining shmem/tmpfs swap case in a comment.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8079b1c8

tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd

Hugh Dickins authored Aug 03, 2011

We have already acknowledged that swapoff of a tmpfs file is slower than
it was before conversion to the generic radix_tree: a little slower
there will be acceptable, if the hotter paths are faster.

But it was a shock to find swapoff of a 500MB file 20 times slower on my
laptop, taking 10 minutes; and at that rate it significantly slows down
my testing.

Now, most of that turned out to be overhead from PROVE_LOCKING and
PROVE_RCU: without those it was only 4 times slower than before; and
more realistic tests on other machines don't fare as badly.

I've tried a number of things to improve it, including tagging the swap
entries, then doing lookup by tag: I'd expected that to halve the time,
but in practice it's erratic, and often counter-productive.

The only change I've so far found to make a consistent improvement, is
to short-circuit the way we go back and forth, gang lookup packing
entries into the array supplied, then shmem scanning that array for the
target entry. Scanning in place doubles the speed, so it's now only
twice as slow as before (or three times slower when the PROVEs are on).

So, add radix_tree_locate_item() as an expedient, once-off,
single-caller hack to do the lookup directly in place. #ifdef it on
CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
applicability as save space in other configurations. And, sadly,
#include sched.h for cond_resched().
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e504f3fd

mm: a few small updates for radix-swap · 31475dd6

Hugh Dickins authored Aug 03, 2011

Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
go through shmem_add_to_page_cache().

Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
and add a comment on swap entries to invalidate_mapping_pages().

And mincore_page() uses find_get_page() on what might be shmem or a
tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

31475dd6

tmpfs: use kmemdup for short symlinks · 69f07ec9

Hugh Dickins authored Aug 03, 2011

But we've not yet removed the old swp_entry_t i_direct[16] from
shmem_inode_info. That's because it was still being shared with the
inline symlink. Remove it now (saving 64 or 128 bytes from shmem inode
size), and use kmemdup() for short symlinks, say, those up to 128 bytes.

I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
rather than shmem_evict_inode(), where we usually do such freeing? I
guess it doesn't matter, and I'm not into NUMA mpol testing right now.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

69f07ec9

tmpfs: convert shmem_writepage and enable swap · 6922c0c7

Hugh Dickins authored Aug 03, 2011

Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
shmem_radix_tree_replace() to substitute swap entry for page pointer
atomically in the radix tree.

As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
copying such code from delete_from_swap_cache, but again judged easier
to sell than making its other callers go through the extras.

Remove the toy implementation's shmem_put_swap() and shmem_get_swap(),
now unreferenced, and the hack to disable swap: it's now good to go.

The way things have worked out, info->lock no longer helps to guard the
shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
That global mutex exclusion between shmem_writepage() and shmem_unuse()
is not pretty, and we ought to find another way; but it's been forced on
us by recent race discoveries, not a consequence of this patchset.

And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a
swap entry was found already present? That's no longer possible, the
(unknown) one inserting this page into filecache would hit the swap
entry occupying that slot.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6922c0c7

tmpfs: convert mem_cgroup shmem to radix-swap · aa3b1895

Hugh Dickins authored Aug 03, 2011

Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
had to move swappage to filecache with GFP_NOWAIT.

Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
moving its call out from shmem_add_to_page_cache() to two of thats three
callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
although asymmetrical, it's easier for all 3 callers to handle.

These two changes would also be appropriate if anyone were to start
using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
radix_tree_exceptional_entry() to get what it needs for itself.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

aa3b1895

tmpfs: convert shmem_getpage_gfp to radix-swap · 54af6042

Hugh Dickins authored Aug 03, 2011

Convert shmem_getpage_gfp(), the engine-room of shmem, to expect page or
swap entry returned from radix tree by find_lock_page().

Whereas the repetitive old method proceeded mainly under info->lock,
dropping and repeating whenever one of the conditions needed was not
met, now we can proceed without it, leaving shmem_add_to_page_cache() to
check for a race.

This way there is no need to preallocate a page, no need for an early
radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().

Move the error unwinding down to the bottom instead of repeating it
throughout.  ENOSPC handling is a little different from before: there is
no longer any race between find_lock_page() and finding swap, but we can
arrive at ENOSPC before calling shmem_recalc_inode(), which might
occasionally discover freed space.

Be stricter to check i_size before returning.  info->lock is used for
little but alloced, swapped, i_blocks updates.  Move i_blocks updates
out from under the max_blocks check, so even an unlimited size=0 mount
can show accurate du.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

54af6042

tmpfs: convert shmem_unuse_inode to radix-swap · 46f65ec1

Hugh Dickins authored Aug 03, 2011

Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
tree, searching for matching swap.

This is somewhat slower than the old method: because of repeated radix
tree descents, because of copying entries up, but probably most because
the old method noted and skipped once a vector page was cleared of swap.
Perhaps we can devise a use of radix tree tagging to achieve that later.

shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
for the lockless lookup by checking that the expected entry is in place,
under lock. It is not very satisfactory to be copying this much from
add_to_page_cache_locked(), but I think easier to sell than insisting
that every caller of add_to_page_cache*() go through the extras.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

46f65ec1

tmpfs: convert shmem_truncate_range to radix-swap · 7a5d0fbb

Hugh Dickins authored Aug 03, 2011

Disable the toy swapping implementation in shmem_writepage() - it's hard
to support two schemes at once - and convert shmem_truncate_range() to a
lockless gang lookup of swap entries along with pages, freeing both.

Since the second loop tightens its noose until all entries of either
kind have been squeezed out (and we shall make sure that there's not an
instant when neither is visible), there is no longer a need for yet
another pass below.

shmem_radix_tree_replace() compensates for the lockless lookup by
checking that the expected entry is in place, under lock, before
replacing it.  Here it just deletes, but will be used in later patches
to substitute swap entry for page or page for swap entry.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7a5d0fbb

tmpfs: copy truncate_inode_pages_range · bda97eab

Hugh Dickins authored Aug 03, 2011

Bring truncate.c's code for truncate_inode_pages_range() inline into
shmem_truncate_range(), replacing its first call (there's a followup
call below, but leave that one, it will disappear next).

Don't play with it yet, apart from leaving out the cleancache flush, and
(importantly) the nrpages == 0 skip, and moving shmem_setattr()'s
partial page preparation into its partial page handling.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bda97eab

tmpfs: miscellaneous trivial cleanups · 41ffe5d5

Hugh Dickins authored Aug 03, 2011

While it's at its least, make a number of boring nitpicky cleanups to
shmem.c, mostly for consistency of variable naming.  Things like "swap"
instead of "entry", "pgoff_t index" instead of "unsigned long idx".

And since everything else here is prefixed "shmem_", better change
init_tmpfs() to shmem_init().
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

41ffe5d5

tmpfs: demolish old swap vector support · 285b2c4f

Hugh Dickins authored Aug 03, 2011

The maximum size of a shmem/tmpfs file has been limited by the maximum
size of its triple-indirect swap vector. With 4kB page size, maximum
filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
that on a 64-bit kernel. (With 8kB page size, maximum filesize was just
over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)

It's a shame that tmpfs should be more restrictive than ramfs, and this
limitation has now been noticed. Add another level to the swap vector?
No, it became obscure and hard to maintain, once I complicated it to
make use of highmem pages nine years ago: better choose another way.

Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
tmpfs would never have invented its own peculiar radix tree: we would
have fitted swap entries into the common radix tree instead, in much the
same way as we fit swap entries into page tables.

And why should each file have a separate radix tree for its pages and
for its swap entries? The swap entries are required precisely where and
when the pages are not. We want to put them together in a single radix
tree: which can then avoid much of the locking which was needed to
prevent them from being exchanged underneath us.

This also avoids the waste of memory devoted to swap vectors, first in
the shmem_inode itself, then at least two more pages once a file grew
beyond 16 data pages (pages accounted by df and du, but not by memcg).
Allocated upfront, to avoid allocation when under swapping pressure, but
pure waste when CONFIG_SWAP is not set - I have never spattered around
the ifdefs to prevent that, preferring this move to sharing the common
radix tree instead.

There are three downsides to sharing the radix tree. One, that it binds
tmpfs more tightly to the rest of mm, either requiring knowledge of swap
entries in radix tree there, or duplication of its code here in shmem.c.
I believe that the simplications and memory savings (and probable higher
performance, not yet measured) justify that.

Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
nodes that cannot be freed under memory pressure - whereas before it was
the less precious highmem swap vector pages that could not be freed.
I'm hoping that 64-bit has now been accessible for long enough, that the
highmem argument has grown much less persuasive.

Three, that swapoff is slower than it used to be on tmpfs files, since
it's using a simple generic mechanism not tailored to it: I find this
noticeable, and shall want to improve, but maybe nobody else will
notice.

So... now remove most of the old swap vector code from shmem.c. But,
for the moment, keep the simple i_direct vector of 16 pages, with simple
accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
to help mark where swap needs to be handled in subsequent patches.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

285b2c4f