Commits · 34de35c869250ea330de6331b03fcf2d68d1db5c · Kirill Smelkov / linux

14 May, 2015 18 commits

userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key · 34de35c8

Andrea Arcangeli authored Jun 13, 2014

userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

34de35c8

userfaultfd: linux/Documentation/vm/userfaultfd.txt · 87d7a171
Andrea Arcangeli authored Mar 04, 2015
```
Add documentation.
```
87d7a171

kvm: fix crash in kvm_vcpu_reload_apic_access_page · 978c6891

Andrea Arcangeli authored May 08, 2015

memslot->userfault_addr is set by the kernel with a mmap executed
from the kernel but the userland can still munmap it and lead to the
below oops after memslot->userfault_addr points to a host virtual
address that has no vma or mapping.

[  327.538306] BUG: unable to handle kernel paging request at fffffffffffffffe
[  327.538407] IP: [<ffffffff811a7b55>] put_page+0x5/0x50
[  327.538474] PGD 1a01067 PUD 1a03067 PMD 0
[  327.538529] Oops: 0000 [#1] SMP
[  327.538574] Modules linked in: macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT iptable_filter ip_tables tun bridge stp llc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ipmi_devintf iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp dcdbas intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core ipmi_si ipmi_msghandler acpi_pad wmi acpi_power_meter lpc_ich mfd_core mei_me
[  327.539488]  mei shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc mlx4_ib ib_sa ib_mad ib_core mlx4_en vxlan ib_addr ip_tunnel xfs libcrc32c sd_mod crc_t10dif crct10dif_common crc32c_intel mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm drm ahci i2c_core libahci mlx4_core libata tg3 ptp pps_core megaraid_sas ntb dm_mirror dm_region_hash dm_log dm_mod
[  327.539956] CPU: 3 PID: 3161 Comm: qemu-kvm Not tainted 3.10.0-240.el7.userfault19.4ca4011.x86_64.debug #1
[  327.540045] Hardware name: Dell Inc. PowerEdge R420/0CN7CM, BIOS 2.1.2 01/20/2014
[  327.540115] task: ffff8803280ccf00 ti: ffff880317c58000 task.ti: ffff880317c58000
[  327.540184] RIP: 0010:[<ffffffff811a7b55>]  [<ffffffff811a7b55>] put_page+0x5/0x50
[  327.540261] RSP: 0018:ffff880317c5bcf8  EFLAGS: 00010246
[  327.540313] RAX: 00057ffffffff000 RBX: ffff880616a20000 RCX: 0000000000000000
[  327.540379] RDX: 0000000000002014 RSI: 00057ffffffff000 RDI: fffffffffffffffe
[  327.540445] RBP: ffff880317c5bd10 R08: 0000000000000103 R09: 0000000000000000
[  327.540511] R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffffe
[  327.540576] R13: 0000000000000000 R14: ffff880317c5bd70 R15: ffff880317c5bd50
[  327.540643] FS:  00007fd230b7f700(0000) GS:ffff880630800000(0000) knlGS:0000000000000000
[  327.540717] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  327.540771] CR2: fffffffffffffffe CR3: 000000062a2c3000 CR4: 00000000000427e0
[  327.540837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  327.540904] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  327.540974] Stack:
[  327.541008]  ffffffffa05d6d0c ffff880616a20000 0000000000000000 ffff880317c5bdc0
[  327.541093]  ffffffffa05ddaa2 0000000000000000 00000000002191bf 00000042f3feab2d
[  327.541177]  00000042f3feab2d 0000000000000002 0000000000000001 0321000000000000
[  327.541261] Call Trace:
[  327.541321]  [<ffffffffa05d6d0c>] ? kvm_vcpu_reload_apic_access_page+0x6c/0x80 [kvm]
[  327.543615]  [<ffffffffa05ddaa2>] vcpu_enter_guest+0x3f2/0x10f0 [kvm]
[  327.545918]  [<ffffffffa05e2f10>] kvm_arch_vcpu_ioctl_run+0x2b0/0x5a0 [kvm]
[  327.548211]  [<ffffffffa05e2d02>] ? kvm_arch_vcpu_ioctl_run+0xa2/0x5a0 [kvm]
[  327.550500]  [<ffffffffa05ca845>] kvm_vcpu_ioctl+0x2b5/0x680 [kvm]
[  327.552768]  [<ffffffff810b8d12>] ? creds_are_invalid.part.1+0x12/0x50
[  327.555069]  [<ffffffff810b8d71>] ? creds_are_invalid+0x21/0x30
[  327.557373]  [<ffffffff812d6066>] ? inode_has_perm.isra.49.constprop.65+0x26/0x80
[  327.559663]  [<ffffffff8122d985>] do_vfs_ioctl+0x305/0x530
[  327.561917]  [<ffffffff8122dc51>] SyS_ioctl+0xa1/0xc0
[  327.564185]  [<ffffffff816de829>] system_call_fastpath+0x16/0x1b
[  327.566480] Code: 0b 31 f6 4c 89 e7 e8 4b 7f ff ff 0f 0b e8 24 fd ff ff e9 a9 fd ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 2a 8b 47 1c 85 c0 74 1e f0

978c6891

mm: gup: use get_user_pages_fast instead of get_user_pages_unlocked · bc26cb8c
Andrea Arcangeli authored Sep 29, 2014
```
Just an optimization, where possible use get_user_pages_fast.
```
bc26cb8c

mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious · 051bd3c6

Andrea Arcangeli authored Oct 02, 2014

This teaches gup_fast and __gup_fast to re-enable irqs and
cond_resched() if possible every BATCH_PAGES.

This must be implemented by other archs as well and it's a requirement
before converting more get_user_pages() to get_user_pages_fast() as an
optimization (instead of using get_user_pages_unlocked which would be
slower).

051bd3c6

mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode · 123cb69c

Andrea Arcangeli authored May 27, 2013

This adds compaction to zone_reclaim so THP enabled won't decrease the
NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.

It is important to boot with numa_zonelist_order=n (n means nodes) to
get more accurate NUMA locality if there are multiple zones per node.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

123cb69c

mm: zone_reclaim: after a successful zone_reclaim check the min watermark · ea160218

Andrea Arcangeli authored Jul 15, 2013

If we're in the fast path and we succeeded zone_reclaim(), it means we
freed enough memory and we can use the min watermark to have some
margin against concurrent allocations from other CPUs or interrupts.

ea160218

mm: zone_reclaim: compaction: export compact_zone_order() · b6c2abee
Andrea Arcangeli authored Jun 03, 2013
```
Needed by zone_reclaim_mode compaction-awareness.
```
b6c2abee
mm: zone_reclaim: compaction: increase the high order pages in the watermarks · b14cae31
Andrea Arcangeli authored May 27, 2013
```
Prevent the scaling down to reduce the watermarks too much.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
```
b14cae31

mm: compaction: don't require high order pages below min wmark · 7a271558

Andrea Arcangeli authored Jul 10, 2013

The min wmark should be satisfied with just 1 hugepage. And the other
wmarks should be adjusted accordingly. We need to succeed the low
wmark check if there's some significant amount of 0 order pages, but
we don't need plenty of high order pages because the PF_MEMALLOC paths
don't require those. Creating a ton of high order pages that cannot be
allocated by the high order allocation paths (no PF_MEMALLOC) is quite
wasteful because they can be splitted in lower order pages before
anybody has a chance to allocate them.

7a271558

mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable · af24516e

Andrea Arcangeli authored May 27, 2013

If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
plenty of free memory) compaction is otherwise crippled down and stops
running for a while after the free/isolation cursor meets. After that
allocation can fail for a full cycle of compaction_deferred, until
compaction_restarting finally reset it again.

Stopping compaction for a full cycle after the cursor meets, even if
it never failed and it's not going to fail, doesn't make sense.

We already throttle compaction CPU utilization using
defer_compaction. We shouldn't prevent compaction to run after each
pass completes when the cursor meets, unless it failed.

This makes direct compaction functional again. The throttling of
direct compaction is still controlled by the defer_compaction
logic.

kswapd still won't risk to reset compaction, and it will wait direct
compaction to do so. Not sure if this is ideal but it at least
decreases the risk of kswapd doing too much work. kswapd will only run
one pass of compaction until some allocation invokes compaction again.

This decreased reliability of compaction was introduced in commit
62997027 .
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>

af24516e

mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory · 02eaa78b

Andrea Arcangeli authored May 27, 2013

Reset the stats so /proc/sys/vm/compact_memory will scan all memory.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>

02eaa78b

mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED · 43dc77b0

Andrea Arcangeli authored May 27, 2013

Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
thread allocates memory at the same time, it forces a premature
allocation into remote NUMA nodes even when there's plenty of clean
cache to reclaim in the local nodes.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

43dc77b0

mm: fix the theoretical compound_lock() vs prep_new_page() race · 93585352

Oleg Nesterov authored Dec 19, 2013

get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
compound_lock(). In theory this page_head can be already freed and
reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
get_page_unless_zero() can succeed right after set_page_refcounted(),
and compound_lock() can race with the non-atomic __SetPageHead().

Perhaps we should rework the thp locking (under discussion), but
until then this patch moves set_page_refcounted() and adds wmb()
to ensure that page->_count != 0 comes as a last change.

I am not sure about other callers of set_page_refcounted(), but at
first glance they look fine to me.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

93585352

oom: allow !__GFP_FS allocations access emergency reserves like __GFP_NOFAIL · fa175d10

Andrea Arcangeli authored Jun 23, 2014

With the previous two commits I cannot reproduce any ext4 related
livelocks anymore, however I hit ext4 memory corruption. ext4 thinks
it can handle alloc_pages to fail and it doesn't use __GFP_NOFAIL in
some places but it actually cannot. No surprise as those errors paths
couldn't ever run so they're likely untested.

I logged all the stack traces of all ext4 failures that lead to the
ext4 final corruption, at least one of them should be the culprit (the
lasts ones are more probable). The actual bug in the error paths
should be found by code review (or the error paths should be deleted
and __GFP_NOFAIL should be added to the gfp_mask).

Until ext4 is fixed, it is safer to threat !__GFP_FS like __GFP_NOFAIL
if TIF_MEMDIE is not set (so we cannot exercise any new allocation
error path in kernel threads, because they're never picked as OOM
killer victims and TIF_MEMDIE never gets set on them).

I assume other filesystems may have become complacent of this
accommodating allocator behavior that cannot fail an allocation if
invoked by a kernel thread too, but the longer we keep the
__GFP_NOFAIL behavior in should_alloc_retry for small order
allocations, the less robust these error paths will become and the
harder it will be to remove this livelock prone assumption in
should_alloc_retry. In fact we should remove that assumption not just
for !__GFP_FS allocations.

In practice with this fix there's no regression and all livelocks are
still gone. The only risk in this approach is to extinguish the
emergency reserves earlier than before but only during OOM (during
normal runtime GFP_ATOMIC allocation or other __GFP_MEMALLOC
allocation reliability is not affected). Clearly this actually reduces
the livelock risk (verified in practice too) so it is a low risk net
improvement to the OOM handling with no risk of regression because
this way no new allocation error paths is exercised.

fa175d10

oom: fix ext4 __GFP_NOFAIL livelock · 47fb3887

Andrea Arcangeli authored Jun 22, 2014

The previous commit fixed a ext4 livelock by not making !__GFP_FS
allocations behave similarly to __GFP_NOFAIL and I mentioned how
__GFP_NOFAIL is livelock prone.

After letting the trinity load run for a while I actually hit the
very __GFP_NOFAIL livelock too:

 #0  get_page_from_freelist (gfp_mask=0x20858, nodemask=0x0 <irq_stack_union>, order=0x0, zonelist=0xffff88007fffc100, hi
gh_zoneidx=0x2, alloc_flags=0xc0, preferred_zone=0xffff88007fffa840, classzone_idx=classzone_idx@entry=0x1, migratetype=
migratetype@entry=0x2) at mm/page_alloc.c:1953
 #1  0xffffffff81178e88 in __alloc_pages_slowpath (migratetype=0x2, classzone_idx=0x1, preferred_zone=0xffff88007fffa840,
 nodemask=0x0 <irq_stack_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x20858) at
mm/page_alloc.c:2597
 #2  __alloc_pages_nodemask (gfp_mask=<optimized out>, order=0x0, zonelist=0xffff87fffffffffa, nodemask=0x0 <irq_stack_un
ion>) at mm/page_alloc.c:2832
 #3  0xffffffff811becab in alloc_pages_current (gfp=0x20858, order=0x0) at mm/mempolicy.c:2100
 #4  0xffffffff8116e450 in alloc_pages (order=0x0, gfp_mask=0x20858) at include/linux/gfp.h:336
 #5  __page_cache_alloc (gfp=0x20858) at mm/filemap.c:663
 #6  0xffffffff8116f03c in pagecache_get_page (mapping=0xffff88007cc03908, offset=0xc920f, fgp_flags=0x7, cache_gfp_mask=
0x20858, radix_gfp_mask=0x850) at mm/filemap.c:1096
 #7  0xffffffff812160f4 in find_or_create_page (mapping=<optimized out>, gfp_mask=<optimized out>, offset=0xc920f) at inc
lude/linux/pagemap.h:336
 #8  grow_dev_page (sizebits=0x0, size=0x1000, index=0xc920f, block=0xc920f, bdev=0xffff88007cc03580) at fs/buffer.c:1022
 #9  grow_buffers (size=<optimized out>, block=<optimized out>, bdev=<optimized out>) at fs/buffer.c:1095
 #10 __getblk_slow (size=0x1000, block=0xc920f, bdev=0xffff88007cc03580) at fs/buffer.c:1121
 #11 __getblk (bdev=0xffff88007cc03580, block=0xc920f, size=0x1000) at fs/buffer.c:1395
 #12 0xffffffff8125c8ed in sb_getblk (block=0xc920f, sb=<optimized out>) at include/linux/buffer_head.h:310
 #13 ext4_read_block_bitmap_nowait (sb=0xffff88007c579000, block_group=0x2f) at fs/ext4/balloc.c:407
 #14 0xffffffff8125ced4 in ext4_read_block_bitmap (sb=0xffff88007c579000, block_group=0x2f) at fs/ext4/balloc.c:489
 #15 0xffffffff8167963b in ext4_mb_discard_group_preallocations (sb=0xffff88007c579000, group=0x2f, needed=0x38) at fs/ex
t4/mballoc.c:3798
 #16 0xffffffff8129ddbd in ext4_mb_discard_preallocations (needed=0x38, sb=0xffff88007c579000) at fs/ext4/mballoc.c:4346
 #17 ext4_mb_new_blocks (handle=0xffff88003305ee98, ar=0xffff88001f50b890, errp=0xffff88001f50b880) at fs/ext4/mballoc.c:4479
 #18 0xffffffff81290fd3 in ext4_ext_map_blocks (handle=0xffff88003305ee98, inode=0xffff88007b85b178, map=0xffff88001f50ba50, flags=0x25) at fs/ext4/extents.c:4453
 #19 0xffffffff81265688 in ext4_map_blocks (handle=0xffff88003305ee98, inode=0xffff88007b85b178, map=0xffff88001f50ba50, flags=0x25) at fs/ext4/inode.c:648
 #20 0xffffffff8126af77 in mpage_map_one_extent (mpd=0xffff88001f50ba28, handle=0xffff88003305ee98) at fs/ext4/inode.c:2164
 #21 mpage_map_and_submit_extent (give_up_on_write=<synthetic pointer>, mpd=0xffff88001f50ba28, handle=0xffff88003305ee98) at fs/ext4/inode.c:2219
 #22 ext4_writepages (mapping=0xffff88007b85b350, wbc=0xffff88001f50bb60) at fs/ext4/inode.c:2557
 #23 0xffffffff8117ce81 in do_writepages (mapping=0xffff88007b85b350, wbc=0xffff88001f50bb60) at mm/page-writeback.c:2046
 #24 0xffffffff812096c0 in __writeback_single_inode (inode=0xffff88007b85b178, wbc=0xffff88001f50bb60) at fs/fs-writeback.c:460
 #25 0xffffffff8120b311 in writeback_sb_inodes (sb=0xffff88007c579000, wb=0xffff88007bceb060, work=0xffff8800130f9d80) at fs/fs-writeback.c:687
 #26 0xffffffff8120b68f in __writeback_inodes_wb (wb=0xffff88007bceb060, work=0xffff8800130f9d80) at fs/fs-writeback.c:732
 #27 0xffffffff8120b94b in wb_writeback (wb=0xffff88007bceb060, work=0xffff8800130f9d80) at fs/fs-writeback.c:863
 #28 0xffffffff8120befc in wb_do_writeback (wb=0xffff88007bceb060) at fs/fs-writeback.c:998
 #29 bdi_writeback_workfn (work=0xffff88007bceb078) at fs/fs-writeback.c:1043
 #30 0xffffffff81092cf5 in process_one_work (worker=0xffff88002c555e80, work=0xffff88007bceb078) at kernel/workqueue.c:2081
 #31 0xffffffff8109376b in worker_thread (__worker=0xffff88002c555e80) at kernel/workqueue.c:2212
 #32 0xffffffff8109ba54 in kthread (_create=0xffff88007bf2e2c0) at kernel/kthread.c:207
 #33 <signal handler called>
 #34 0x0000000000000000 in irq_stack_union ()
 #35 0x0000000000000000 in ?? ()

To solve this I set manually with gdb ALLOC_NO_WATERMARKS in
alloc_flags, and the livelock resolved itself.

The fix simply allows __GFP_NOFAIL allocation to get access to the
emergency reserves in the buddy allocator if __GFP_NOFAIL triggers a
reclaim failure signaling an out of memory condition. Worst case it'll
deadlock because we run out of emergency reserves but not giving it
access to the emergency reserves after the __GFP_NOFAIL hits on a out
of memory condition may actually result in a livelock despite there
are still ~50Mbyte free! So this is safer. After applying this OOM
livelock fix I cannot reproduce the livelock anymore in __GFP_NOFAIL.

47fb3887

mm: ext4 livelock during OOM · 7636db0a

Andrea Arcangeli authored Jun 20, 2014

I can easily reproduce a livelock with some trinity load on a 2GB
guest running in parallel:

	./trinity -X -c remap_anon_pages -q
	./trinity -X -c userfaultfd -q

The last OOM killer invocation selects this task:

Out of memory: Kill process 6537 (trinity-c6) score 106 or sacrifice child
Killed process 6537 (trinity-c6) total-vm:414772kB, anon-rss:186744kB, file-rss:560kB

The victim task shortly later is detected in uninterruptible state for
too long by the hangcheck timer:

INFO: task trinity-c6:6537 blocked for more than 120 seconds.
      Not tainted 3.16.0-rc1+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
trinity-c6      D ffff88004ec37b50 11080  6537   5530 0x00100004
 ffff88004ec37aa8 0000000000000082 0000000000000000 ffff880039174910
 ffff88004ec37fd8 0000000000004000 ffff88007c8de3d0 ffff880039174910
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8167c759>] ? schedule+0x29/0x70
 [<ffffffff8112ae42>] ? __delayacct_blkio_start+0x22/0x30
 [<ffffffff8116e290>] ? __lock_page+0x70/0x70
 [<ffffffff8167c759>] schedule+0x29/0x70
 [<ffffffff8167c82f>] io_schedule+0x8f/0xd0
 [<ffffffff8116e29e>] sleep_on_page+0xe/0x20
 [<ffffffff8167cd33>] __wait_on_bit_lock+0x73/0xb0
 [<ffffffff8116e287>] __lock_page+0x67/0x70
 [<ffffffff810c34d0>] ? wake_atomic_t_function+0x40/0x40
 [<ffffffff8116f125>] pagecache_get_page+0x165/0x1f0
 [<ffffffff8116f3d4>] grab_cache_page_write_begin+0x34/0x50
 [<ffffffff81268f82>] ext4_da_write_begin+0x92/0x380
 [<ffffffff8116d717>] generic_perform_write+0xc7/0x1d0
 [<ffffffff8116fee3>] __generic_file_write_iter+0x173/0x350
 [<ffffffff8125e6ad>] ext4_file_write_iter+0x10d/0x3c0
 [<ffffffff811db72b>] ? vfs_write+0x1bb/0x1f0
 [<ffffffff811da581>] new_sync_write+0x81/0xb0
 [<ffffffff811db62f>] vfs_write+0xbf/0x1f0
 [<ffffffff811dbb72>] SyS_write+0x52/0xc0
 [<ffffffff816816d2>] system_call_fastpath+0x16/0x1b
3 locks held by trinity-c6/6537:
 #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811fc0ee>] __fdget_pos+0x3e/0x50
 #1:  (sb_writers#3){.+.+.+}, at: [<ffffffff811db72b>] vfs_write+0x1bb/0x1f0
 #2:  (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8125e62d>] ext4_file_write_iter+0x8d/0x3c0

The task that holds the page lock is likely the below one that never
returns from __alloc_pages_slowpath.

ck_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x50) at mm/page_alloc.c:2661
ion>) at mm/page_alloc.c:2821
0, radix_gfp_mask=0x50) at mm/filemap.c:1096
emap.h:336
 at fs/ext4/mballoc.c:4442
50, flags=0x25) at fs/ext4/extents.c:4453
flags=0x25) at fs/ext4/inode.c:648
64

(full stack trace of the !__GFP_FS allocation in the kernel thread
holding the page lock below)

gfp_mask in __alloc_pages_slowpath is gfp_mask=0x50, so
___GFP_IO|___GFP_WAIT. ext4_writepages run from a kworker kernel
thread is holding the page lock that the OOM victim task is waiting
on.

If alloc_pages returned NULL the whole livelock would resolve itself
(-ENOMEM would be returned all the way up, ext4 thinks it can handle
it, in reality it cannot but that's for a later patch in this series).
ext4_writepages would return and the kworker would try again later to
flush the dirty pages in the dirty inodes.

To verify I breakpointed in the should_alloc_retry and added
__GFP_NORETRY to the gfp_mask just before the __GFP_NORETRY check.

gdb> b mm/page_alloc.c:2185
Breakpoint 8 at 0xffffffff81179122: file mm/page_alloc.c, line 2185.
gdb> c
Continuing.
[Switching to Thread 1]
_______________________________________________________________________________
     eax:00000000 ebx:00000050  ecx:00000001  edx:00000001     eflags:00000206
     esi:0000196A edi:2963AA50  esp:4EDEF448  ebp:4EDEF568     eip:Error while running hook_stop:
Value can't be converted to integer.

Breakpoint 8, __alloc_pages_slowpath (migratetype=0x0, classzone_idx=0x1, preferred_zone=0xffff88007fffa840, nodemask=0x0 <irq_stack_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x50) at mm/page_alloc.c:2713

I set the breakpoint at 2185 and it stopped at 2713 but 2713 is in the
middle of some comment, I assume that's addr2line imperfection and
it's not relevant.

Then I simply added __GFP_NORETRY to the gfp_mask in the stack:

gdb> print gfp_mask
$1 = 0x50
gdb> set gfp_mask = 0x1050
gdb> p gfp_mask
$2 = 0x1050
gdb> c
Continuing.

After that the livelock resolved itself immediately, the OOM victim
quit and the workload continued without errors.

The problem was probably introduced in commit
11e33f6a .

	/*
	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
	 * means __GFP_NOFAIL, but that may not be true in other
	 * implementations.
	 */
	if (order <= PAGE_ALLOC_COSTLY_ORDER)
		return 1;

Retrying forever and depending on the OOM killer to send a SIGKILL is
only ok if the victim task isn't sleeping in uninterruptible state
waiting in this case a kernel thread to release a page lock.

In this case the kernel thread holding the lock would never be picked
by the OOM killer in the first place so this is a error path in ext4
probably never exercised.

The objective of an implicit __GFP_NOFAIL behavior (but that fails the
allocation if TIF_MEMDIE is set on the task, unlike a real
__GFP_NOFAIL that never fails) I assume is to avoid spurious
VM_FAULT_OOM to make the OOM killer more reliable, but I don't think
an almost implicit __GFP_NOFAIL is safe in general. __GFP_NOFAIL is
unsafe too in fact but at least it's very rarely used.

For now we can start by letting the allocations that hold lowlevel
filesystem locks (__GFP_FS clear) fail so they can release those
locks. Those locks tends to be uninterruptible too.

This will reduce the scope of the problem, but I'd rather prefer to
drop that entire check quoted above in should_alloc_retry though. If a
kernel thread would use_mm() and then take the mmap_sem for writing,
and then run a GFP_KERNEL allocation that invokes the OOM killer to
kill one process that is waiting in down_read in __do_page_fault the
same problem would emerge.

In short it's a tradeoff between the accuracy of the OOM killer (not
causing spurious allocation failures in addition to killing the task)
and the risk of livelock.

Furthermore the more we hold on this change, the more likely those
ext4 allocations done by kernel thread (never picked by the OOM
killer, which would set TIF_MEMDIE and let alloc_pages fail once in a
while) will never be exercised.

After the fix the identical trinity load that reliably reproduces the
problem completes and in addition to the OOM killer info, as expected
I get this in the kernel log (instead of the below allocation error,
I'd get the livelock earlier).

kworker/u16:0: page allocation failure: order:0, mode:0x50
CPU: 2 PID: 6006 Comm: kworker/u16:0 Tainted: G        W     3.16.0-rc1+ #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: writeback bdi_writeback_workfn (flush-254:0)
 0000000000000000 ffff880006d3f408 ffffffff81679b51 ffff88007fc8f068
 0000000000000050 ffff880006d3f498 ffffffff811748c2 0000000000000010
 ffffffffffffffff ffff88007fffc128 ffffffff810d0bed ffff880006d3f468
Call Trace:
 [<ffffffff81679b51>] dump_stack+0x4e/0x68
 [<ffffffff811748c2>] warn_alloc_failed+0xe2/0x130
 [<ffffffff810d0bed>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff811791d0>] __alloc_pages_nodemask+0x910/0xb10
 [<ffffffff811becab>] alloc_pages_current+0x8b/0x120
 [<ffffffff8116e450>] __page_cache_alloc+0x10/0x20
 [<ffffffff8116f03c>] pagecache_get_page+0x7c/0x1f0
 [<ffffffff81299004>] ext4_mb_load_buddy+0x274/0x3b0
 [<ffffffff8129a3f2>] ext4_mb_regular_allocator+0x1e2/0x480
 [<ffffffff81296931>] ? ext4_mb_use_preallocated+0x31/0x600
 [<ffffffff8129de28>] ext4_mb_new_blocks+0x568/0x7f0
 [<ffffffff81290fd3>] ext4_ext_map_blocks+0x683/0x1970
 [<ffffffff81265688>] ext4_map_blocks+0x168/0x4d0
 [<ffffffff8126af77>] ext4_writepages+0x6e7/0x1030
 [<ffffffff8117ce81>] do_writepages+0x21/0x50
 [<ffffffff812096c0>] __writeback_single_inode+0x40/0x550
 [<ffffffff8120b311>] writeback_sb_inodes+0x281/0x560
 [<ffffffff8120b68f>] __writeback_inodes_wb+0x9f/0xd0
 [<ffffffff8120b94b>] wb_writeback+0x28b/0x510
 [<ffffffff8120befc>] bdi_writeback_workfn+0x11c/0x6a0
 [<ffffffff81092c8b>] ? process_one_work+0x15b/0x620
 [<ffffffff81092cf5>] process_one_work+0x1c5/0x620
 [<ffffffff81092c8b>] ? process_one_work+0x15b/0x620
 [<ffffffff8109376b>] worker_thread+0x11b/0x4f0
 [<ffffffff81093650>] ? init_pwq+0x190/0x190
 [<ffffffff8109ba54>] kthread+0xe4/0x100
 [<ffffffff8109b970>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff8168162c>] ret_from_fork+0x7c/0xb0
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

7636db0a

Revert "Merge tag 'usb-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb" · fb684033
Andrea Arcangeli authored Apr 24, 2015
```
This reverts commit 42e3a58b, reversing
changes made to 4fd48b45.
```
fb684033

13 May, 2015 6 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 110bc767

Linus Torvalds authored May 12, 2015

Pull networking fixes from David Miller:

1) Handle max TX power properly wrt VIFs and the MAC in iwlwifi, from
Avri Altman.

2) Use the correct FW API for scan completions in iwlwifi, from Avraham
Stern.

3) FW monitor in iwlwifi accidently uses unmapped memory, fix from Liad
Kaufman.

4) rhashtable conversion of mac80211 station table was buggy, the
virtual interface was not taken into account. Fix from Johannes
Berg.

5) Fix deadlock in rtlwifi by not using a zero timeout for
usb_control_msg(), from Larry Finger.

6) Update reordering state before calculating loss detection, from
Yuchung Cheng.

7) Fix off by one in bluetooth firmward parsing, from Dan Carpenter.

8) Fix extended frame handling in xiling_can driver, from Jeppe
Ledet-Pedersen.

9) Fix CODEL packet scheduler behavior in the presence of TSO packets,
from Eric Dumazet.

10) Fix NAPI budget testing in fm10k driver, from Alexander Duyck.

11) macvlan needs to propagate promisc settings down the the lower
device, from Vlad Yasevich.

12) igb driver can oops when changing number of rings, from Toshiaki
Makita.

13) Source specific default routes not handled properly in ipv6, from
Markus Stenberg.

14) Use after free in tc_ctl_tfilter(), from WANG Cong.

15) Use softirq spinlocking in netxen driver, from Tony Camuso.

16) Two ARM bpf JIT fixes from Nicolas Schichan.

17) Handle MSG_DONTWAIT properly in ring based AF_PACKET sends, from
Mathias Kretschmer.

18) Fix x86 bpf JIT implementation of FROM_{BE16,LE16,LE32}, from Alexei
Starovoitov.

19) ll_temac driver DMA maps TX packet header with incorrect length, fix
from Michal Simek.

20) We removed pm_qos bits from netdevice.h, but some indirect
references remained. Kill them. From David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
net: Remove remaining remnants of pm_qos from netdevice.h
e1000e: Add pm_qos header
net: phy: micrel: Fix regression in kszphy_probe
net: ll_temac: Fix DMA map size bug
x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions
netns: return RTM_NEWNSID instead of RTM_GETNSID on a get
Update be2net maintainers' email addresses
net_sched: gred: use correct backlog value in WRED mode
pppoe: drop pppoe device in pppoe_unbind_sock_work
net: qca_spi: Fix possible race during probe
net: mdio-gpio: Allow for unspecified bus id
af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).
bnx2x: limit fw delay in kdump to 5s after boot
ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits.
ARM: net fix emit_udiv() for BPF_ALU | BPF_DIV | BPF_K intruction.
mpls: Change reserved label names to be consistent with netbsd
usbnet: avoid integer overflow in start_xmit
netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2)
net: xgene_enet: Set hardware dependency
net: amd-xgbe: Add hardware dependency
...

110bc767

net: Remove remaining remnants of pm_qos from netdevice.h · 01d460dd

David Ahern authored May 12, 2015

Commit e2c65448 removed pm_qos from struct net_device but left the
comment and header file. Remove those.
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

01d460dd

e1000e: Add pm_qos header · 5684044f

David Ahern authored May 12, 2015

Commit e2c65448 moved pm_qos_req to e1000_adapter. Add the header file
that defines the struct.
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5684044f

net: phy: micrel: Fix regression in kszphy_probe · bced8701

Niklas Cassel authored May 12, 2015

Don't do clock-mode-select if clk == NULL,
since when building without CONFIG_HAVE_CLK,
clk_get returns NULL and clk_get_rate returns 0.

Doing clock-mode-select in this cause causes kszphy_probe to
return -EINVAL and thus prevents the device from being probed.

The original code (before regression) would return 0
when building without CONFIG_HAVE_CLK.

Cc: stable <stable@vger.kernel.org> # 3.18+
Fixes: 1fadee0c ("net/phy: micrel: Add clock support for
KSZ8021/KSZ8031")
Reviewed-by: Fabio Estevam <fabio.estevam@freescale.com>
Reviewed-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Niklas Cassel <niklass@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bced8701

net: ll_temac: Fix DMA map size bug · 44d4f8d7

Michal Simek authored May 12, 2015

DMA allocates skb->len instead of headlen
which is used for DMA.
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

44d4f8d7

x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions · 343f845b

Alexei Starovoitov authored May 11, 2015

FROM_BE16:
'ror %reg, 8' doesn't clear upper bits of the register,
so use additional 'movzwl' insn to zero extend 16 bits into 64

FROM_LE16:
should zero extend lower 16 bits into 64 bit

FROM_LE32:
should zero extend lower 32 bits into 64 bit

Fixes: 89aa0758 ("net: sock: allow eBPF programs to be attached to sockets")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

343f845b

12 May, 2015 11 commits

Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · 6c9d370c

Linus Torvalds authored May 12, 2015

Pull MIPS fixes from Ralf Baechle:
 "One build fix for build breakage of all MIPS SMP kernels caused by
  Rusty's fix of obsolete use of cpu mask helpers, another to fix the FP
  ABI selection when loading an ELF binary"

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
  MIPS: fix FP mode selection in lieu of .MIPS.abiflags data
  MIPS: SMP: Fix build error.

6c9d370c

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · 03906ca3

Linus Torvalds authored May 12, 2015

Pull rdma fixes from Doug Ledford:
 - update MAINTAINERS git repo pointer
 - printk garbage fix
 - fix for qib and iw_cxgb4 bugs introduced in 4.1 window
 - fix for an older iWARP netlink bug
 - fix a memcpy issue in ehca driver

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
  infiniband: Remove duplicated KERN_<LEVEL> from pr_<level> uses
  IB/qib: fix test of unsigned variable
  RDMA/core: Fix for parsing netlink string attribute
  MAINTAINERS: update the official rdma git repo
  iw_cxgb4: use wildcard mapping for getting remote addr info
  IB/ehca: use correct destination for memcpy

03906ca3

netns: return RTM_NEWNSID instead of RTM_GETNSID on a get · e3d8ecb7

Nicolas Dichtel authored May 11, 2015

Usually, RTM_NEWxxx is returned on a get (same as a dump).

Fixes: 0c7aecd4 ("netns: add rtnl cmd to add and get peer netns ids")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e3d8ecb7

Merge tag 'for-v4.1-rc' of git://git.infradead.org/battery-2.6 · cc49e8c9

Linus Torvalds authored May 12, 2015

Pull power supply and reset fixes from Sebastian Reichel:
 "misc fixes"

* tag 'for-v4.1-rc' of git://git.infradead.org/battery-2.6:
  power: bq27x00_battery: Add missing MODULE_ALIAS
  power: reset: Add MFD_SYSCON depends for brcmstb
  power: reset: ltc2952: Remove bogus hrtimer_start() return value checks
  power_supply: fix oops in collie_battery driver
  power/reset: at91: fix return value check in at91_reset_platform_probe()
  MAINTAINERS: Add me as maintainer of Nokia N900 power supply drivers
  axp288_fuel_gauge: Add original author details

cc49e8c9

infiniband: Remove duplicated KERN_<LEVEL> from pr_<level> uses · f4f01b54

Joe Perches authored May 08, 2015

These KERN_<LEVEL> uses are unnecessary with pr_<level> and cause
bad logging output so remove them.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>

f4f01b54

IB/qib: fix test of unsigned variable · ec40f925

Mike Marciniszyn authored May 12, 2015

Commit d4988623 ("IB/qib: use arch_phys_wc_add()")
adjusted mtrr inititialization to use the new interface.

Unfortunately, the new interface returns a signed
value and the patch tested the unsigned wc_cookie.

Fix the issue by changing the type of wc_cookie to int.  For
the success case the ret left at zero to avoid
a warning from the caller.  For failure wc_cookie
is used as the ret.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>

ec40f925

RDMA/core: Fix for parsing netlink string attribute · ec04847c

Tatyana Nikolova authored May 08, 2015

The string iwpm_ulib_name is recorded in a nlmsg as a netlink attribute.
Without this fix parsing of the nlmsg by the userspace port mapper service fails
because of unknown attribute length, causing the port mapper service not to
register the client, which has sent the nlmsg.
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Cc: <stable@vger.kernel.org> #v3.16
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>

ec04847c

MIPS: fix FP mode selection in lieu of .MIPS.abiflags data · 620b1550

Paul Burton authored May 06, 2015

Commit 46490b57 ("MIPS: kernel: elf: Improve the overall ABI and FPU
mode checks") reworked the ELF FP ABI mode selection logic, but when
CONFIG_MIPS_O32_FP64_SUPPORT is enabled it breaks the use of binaries
which have no PT_MIPS_ABIFLAGS program header & associated
.MIPS.abiflags section.

A default mode is selected based upon whether the ELF contains MIPS32 or
MIPS64 code, but that selection is made in arch_elf_pt_proc.
arch_elf_pt_proc only executes when a PT_MIPS_ABIFLAGS program header is
found. If one is not found then arch_elf_pt_proc is never called, and no
default overall_fp_mode value is selected. When arch_check_elf is
called, both abi0 & abi1 are MIPS_ABI_FP_UNKNOWN which leads to both
prog_req & interp_req being set to none_req. none_req matches none of
the conditions for mode selection at the end of arch_check_elf, so
overall_fp_mode is left untouched. Finally once mips_set_personality_fp
is called the BUG() in the default case is then hit & the kernel likely
panics.

Fix this by moving the selection of a default overall mode to the start
of arch_check_elf, which runs once per ELF executed regardless of
whether it has a PT_MIPS_ABIFLAGS program header.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Matthew Fortune <matthew.fortune@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: stable@vger.kernel.org # v4.0+
Patchwork: http://patchwork.linux-mips.org/patch/9978/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

620b1550

Update be2net maintainers' email addresses · 6938f855

Sathya Perla authored May 12, 2015

Emulex developers' email addresses are now "@avagotech" instead of
"@emulex". I'm also replacing Subbu with Padmanabh and Sriharsha in the
maintainers list. The driver's heading was outdated and did not include
some of the chip types (BE3, Lancer and Skyhawk) that the driver has
been supporting for a longtime. I've updated this too.
Signed-off-by: Sathya Perla <sathya.perla@avagotech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6938f855

MIPS: SMP: Fix build error. · cafb45b2

Ralf Baechle authored May 12, 2015

  CC      arch/mips/kernel/smp.o
arch/mips/kernel/smp.c: In function ‘start_secondary’:
arch/mips/kernel/smp.c:149:2: error: passing argument 2 of ‘cpumask_set_cpu’ discards ‘volatile’ qualifier from pointer target type [-Werror]
  cpumask_set_cpu(cpu, &cpu_callin_map);
  ^
In file included from ./arch/mips/include/asm/processor.h:14:0,
                 from ./arch/mips/include/asm/thread_info.h:15,
                 from include/linux/thread_info.h:54,
                 from include/asm-generic/preempt.h:4,
                 from arch/mips/include/generated/asm/preempt.h:1,
                 from include/linux/preempt.h:18,
                 from include/linux/interrupt.h:8,
                 from arch/mips/kernel/smp.c:24:
include/linux/cpumask.h:272:91: note: expected ‘struct cpumask *’ but argument is of type ‘volatile struct cpumask_t *’
 static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
                                                                                           ^
arch/mips/kernel/smp.c: In function ‘smp_prepare_boot_cpu’:
arch/mips/kernel/smp.c:211:2: error: passing argument 2 of ‘cpumask_set_cpu’ discards ‘volatile’ qualifier from pointer target type [-Werror]
  cpumask_set_cpu(0, &cpu_callin_map);
  ^
In file included from ./arch/mips/include/asm/processor.h:14:0,
                 from ./arch/mips/include/asm/thread_info.h:15,
                 from include/linux/thread_info.h:54,
                 from include/asm-generic/preempt.h:4,
                 from arch/mips/include/generated/asm/preempt.h:1,
                 from include/linux/preempt.h:18,
                 from include/linux/interrupt.h:8,
                 from arch/mips/kernel/smp.c:24:
include/linux/cpumask.h:272:91: note: expected ‘struct cpumask *’ but argument is of type ‘volatile struct cpumask_t *’
 static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
                                                                                           ^
arch/mips/kernel/smp.c: In function ‘__cpu_up’:
arch/mips/kernel/smp.c:221:10: error: passing argument 2 of ‘cpumask_test_cpu’ discards ‘volatile’ qualifier from pointer target type [-Werror]
  while (!cpumask_test_cpu(cpu, &cpu_callin_map))
          ^
In file included from ./arch/mips/include/asm/processor.h:14:0,
                 from ./arch/mips/include/asm/thread_info.h:15,
                 from include/linux/thread_info.h:54,
                 from include/asm-generic/preempt.h:4,
                 from arch/mips/include/generated/asm/preempt.h:1,
                 from include/linux/preempt.h:18,
                 from include/linux/interrupt.h:8,
                 from arch/mips/kernel/smp.c:24:
include/linux/cpumask.h:294:90: note: expected ‘const struct cpumask *’ but argument is of type ‘volatile struct cpumask_t *’
 static inline int cpumask_test_cpu(int cpu, const struct cpumask *cpumask)
                                                                                          ^
cc1: all warnings being treated as errors
make[2]: *** [arch/mips/kernel/smp.o] Error 1
make[1]: *** [arch/mips/kernel] Error 2
make: *** [arch/mips] Error 2
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

cafb45b2

MAINTAINERS: update the official rdma git repo · 2936ae04

Doug Ledford authored May 11, 2015

Linus prefers kernel.org repos to github repos for security.
Signed-off-by: Doug Ledford <dledford@redhat.com>

2936ae04

11 May, 2015 5 commits

Merge branch 'for-4.1' of git://linux-nfs.org/~bfields/linux · 4cfceaf0

Linus Torvalds authored May 11, 2015

Pull nfsd bugfixes from Bruce Fields:
 "Mainly pnfs fixes (and for problems with generic callback code made
  more obvious by pnfs)"

* 'for-4.1' of git://linux-nfs.org/~bfields/linux:
  nfsd: skip CB_NULL probes for 4.1 or later
  nfsd: fix callback restarts
  nfsd: split transport vs operation errors for callbacks
  svcrpc: fix potential GSSX_ACCEPT_SEC_CONTEXT decoding failures
  nfsd: fix pNFS return on close semantics
  nfsd: fix the check for confirmed openowner in nfs4_preprocess_stateid_op
  nfsd/blocklayout: pretend we can send deviceid notifications

4cfceaf0

iw_cxgb4: use wildcard mapping for getting remote addr info · 940fd304

Steve Wise authored May 07, 2015

For listening endpoints bound to the wildcard address, we need to pass
the wildcard address mapping to iwpm_get_remote_info() instead of the
mapped address of the new child connection.

Without this fix, and with iwarp port mapping enabled, each iw_cxgb4
connection that is spawned from a listening endpoint bound to the wildcard
address, will generate an annoying dmesg entry about failing to find
the remote address mapping info, and the connection state displayed in
debugfs under /sys/kernel/debug/iw_cxgb4/<pci-slot-no>/eps will not have
the peer's address/port mapping info. The connection still works though.

Fixes: 5b6b8fe6 ("RDMA/cxgb4: Report the actual address of the remote connecting peer")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>

940fd304

IB/ehca: use correct destination for memcpy · 94634e98

Nicholas Mc Guire authored May 11, 2015

Using an element of a struct as the address for the memcpy of the whole
struct may introduce a buffer overflow and does not help readability either
simply pass the real thing as first argument to memcpy.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>

94634e98

Merge tag 'spi-fix-v4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · ef208162

Linus Torvalds authored May 11, 2015

Pull spi fixes from Mark Brown:
 "A number of driver specific fixes (including several missing
  dependencies for randconfig type cases) plus two core fixes.

  One makes the setup_transfer() callback optional which unbreaks some
  drivers which had been merged with it omitted due to local versions of
  this patch and another ensures that we don't corrupt data by leaking
  internal dummy buffers to callers, causing the callers to think they
  allocated those buffers"

* tag 'spi-fix-v4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: fsl-espi: fix behaviour for full-duplex xfers
  spi: fsl-spi: fix devm_ioremap_resource() error case
  spi: Kconfig: Add SOC_LS1021A to SPI_FSL_DSPI dependence
  spi/omap2-mcpsi: Always call spi_finalize_current_message()
  spi: fsl-spi: use devm_ioremap_resource() to map parameter ram on CPM1
  spi: bitbang: Make setup_transfer() callback optional
  spi: check tx_buf and rx_buf in spi_unmap_msg
  spi: bcm2835: change timeout of polling driver to 1s
  spi: bcm2835: Add GPIOLIB dependency

ef208162

Merge tag 'iommu-fixes-v4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · a156e068

Linus Torvalds authored May 11, 2015

Pull iommu fixes from Joerg Roedel:
 "Three fixes have queued up:

   - reference count fix in the AMD IOMMUv2 driver

   - sign extension fix in the ARM-SMMU driver

   - build fix for rockchip driver with device tree"

* tag 'iommu-fixes-v4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/arm-smmu: Fix sign-extension of upstream bus addresses at stage 1
  iommu/rockchip: Fix build without CONFIG_OF
  iommu/amd: Fix bug in put_pasid_state_wait

a156e068