Commits · d1fe19444d82e399e38c1594c71b850eca8e9de0 · nexedi / linux

27 Jul, 2015 1 commit

inet: frag: don't re-use chainlist for evictor · d1fe1944

Florian Westphal authored Jul 23, 2015

commit 65ba1f1e ("inet: frags: fix a race between inet_evict_bucket
and inet_frag_kill") describes the bug, but the fix doesn't work reliably.

Problem is that ->flags member can be set on other cpu without chainlock
being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED
bit after we put the element on the evictor private list.

We can crash when walking the 'private' evictor list since an element can
be deleted from list underneath the evictor.

Join work with Nikolay Alexandrov.

Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue")
Reported-by: Johan Schuijt <johan@transip.nl>
Tested-by: Frank Schreuder <fschreuder@transip.nl>
Signed-off-by: Nikolay Alexandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

d1fe1944

26 Jul, 2015 7 commits

net: sctp: stop spamming klog with rfc6458, 5.3.2. deprecation warnings · 81296fc6

Daniel Borkmann authored Jul 22, 2015

Back then when we added support for SCTP_SNDINFO/SCTP_RCVINFO from
RFC6458 5.3.4/5.3.5, we decided to add a deprecation warning for the
(as per RFC deprecated) SCTP_SNDRCV via commit bbbea41d ("net:
sctp: deprecate rfc6458, 5.3.2. SCTP_SNDRCV support"), see [1].

Imho, it was not a good idea, and we should just revert that message
for a couple of reasons:

  1) It's uapi and therefore set in stone forever.

  2) To be able to run on older and newer kernels, an SCTP application
     would need to probe for both, SCTP_SNDRCV, but also SCTP_SNDINFO/
     SCTP_RCVINFO support, so that on older kernels, it can make use
     of SCTP_SNDRCV, and on newer kernels SCTP_SNDINFO/SCTP_RCVINFO.
     In my (limited) experience, a lot of SCTP appliances are migrating
     to newer kernels only ve(ee)ry slowly.

  3) Some people don't have the chance to change their applications,
     f.e. due to proprietary legacy stuff. So, they'll hit this warning
     in fast path and are stuck with older kernels.

But i.e. due to point 1) I really fail to see the benefit of a warning.
So just revert that for now, the issue was reported up Jamal.

  [1] http://thread.gmane.org/gmane.linux.network/321960/Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Michael Tuexen <tuexen@fh-muenster.de>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

81296fc6

Merge branch 'mlx4-fixes' · 10e59ee3

David S. Miller authored Jul 26, 2015

Or Gerlitz says:

====================
mlx4 driver fixes, July 22nd 2015

Just few mlx4 fixes.. the fix related to propagating port state
changes to VF should go to -stable of >= 3.11, all the rest just
to 4.2-rc
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

10e59ee3

net/mlx4_en: Remove BUG_ON assert when checking if ring is full · 62e4c9b4

Ido Shamay authored Jul 22, 2015

In mlx4_en_is_ring_empty we check if ring surpassed its size.
Since the prod and cons indicators are u32, there might be a state where
prod wrapped around and cons, making this assert false, although no
actual bug exists (other code segment can cope with this state).
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

62e4c9b4

net/mlx4_core: Relieve cpu load average on the port sending flow · 9f5b0317

Jack Morgenstein authored Jul 22, 2015

When a port is not attached, the FW requires a longer than usual time to
execute the SENSE_PORT command. In the command flow, the
wait_for_completion_timeout call used in mlx4_cmd_wait puts the kernel
thread into the uninterruptible state during this time. This, in turn,
due to the computation method, causes the CPU load average to increase.

Fix this by using wait_for_completion_interruptible_timeout() for the
SENSE_PORT command, which puts the thread in the interruptible state.
In this state, the thread does not contribute to the CPU load average.

Treat the interrupted case as if the SENSE_PORT command returned
port_type = NONE.

Fix suggested by Gideon Naim <gideonn@mellanox.com> and
Bart Van Assche <bart.vanassche@sandisk.com>.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9f5b0317

net/mlx4_core: Fix wrong index in propagating port change event to VFs · 1c1bf349

Jack Morgenstein authored Jul 22, 2015

The port-change event processing in procedure mlx4_eq_int() uses "slave"
as the vf_oper array index. Since the value of "slave" is the PF function
index, the result is that the PF link state is used for deciding to
propagate the event for all the VFs. The VF link state should be used,
so the VF function index should be used here.

Fixes: 948e306d ('net/mlx4: Add VF link state support')
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c1bf349

net/mlx4_core: Use sink counter for the VF default as fallback · 178d23e3

Or Gerlitz authored Jul 22, 2015

Some old PF drivers don't let VFs allocate counters, in that case, use
the sink counter so the VF can load and operate properly.

Fixes: 6de5f7f6 ('net/mlx4_core: Allocate default counter per port')
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

178d23e3

bridge: netlink: fix slave_changelink/br_setport race conditions · 963ad948

Nikolay Aleksandrov authored Jul 22, 2015

Since slave_changelink support was added there have been a few race
conditions when using br_setport() since some of the port functions it
uses require the bridge lock. It is very easy to trigger a lockup due to
some internal spin_lock() usage without bh disabled, also it's possible to
get the bridge into an inconsistent state.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 3ac636b8 ("bridge: implement rtnl_link_ops->slave_changelink")
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

963ad948

25 Jul, 2015 7 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 48516438

David S. Miller authored Jul 25, 2015

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains ten Netfilter/IPVS fixes, they are:

1) Address refcount leak when creating an expectation from the ctnetlink
   interface.

2) Fix bug splat in the IDLETIMER target related to sysfs, from Dmitry
   Torokhov.

3) Resolve panic for unreachable route in IPVS with locally generated
   traffic in the output path, from Alex Gartrell.

4) Fix wrong source address in rare cases for tunneled traffic in IPVS,
   from Julian Anastasov.

5) Fix crash if scheduler is changed via ipvsadm -E, again from Julian.

6) Make sure skb->sk is unset for forwarded traffic through IPVS, again from
   Alex Gartrell.

7) Fix crash with IPVS sync protocol v0 and FTP, from Julian.

8) Reset sender cpu for forwarded traffic in IPVS, also from Julian.

9) Allocate template conntracks through kmalloc() to resolve netns dependency
   problems with the conntrack kmem_cache.

10) Fix zones with expectations that clash using the same tuple, from Joe
    Stringer.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

48516438

cgroup: net_cls: fix false-positive "suspicious RCU usage" · cc9f4daa

Konstantin Khlebnikov authored Jul 22, 2015

In dev_queue_xmit() net_cls protected with rcu-bh.

[  270.730026] ===============================
[  270.730029] [ INFO: suspicious RCU usage. ]
[  270.730033] 4.2.0-rc3+ #2 Not tainted
[  270.730036] -------------------------------
[  270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage!
[  270.730041] other info that might help us debug this:
[  270.730043] rcu_scheduler_active = 1, debug_locks = 1
[  270.730045] 2 locks held by dhclient/748:
[  270.730046]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff81682b70>] __dev_queue_xmit+0x50/0x960
[  270.730085]  #1:  (&qdisc_tx_lock){+.....}, at: [<ffffffff81682d60>] __dev_queue_xmit+0x240/0x960
[  270.730090] stack backtrace:
[  270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2
[  270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[  270.730100]  0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007
[  270.730103]  ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00
[  270.730105]  ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8
[  270.730108] Call Trace:
[  270.730121]  [<ffffffff817ad487>] dump_stack+0x4c/0x65
[  270.730148]  [<ffffffff810ca4f2>] lockdep_rcu_suspicious+0xe2/0x120
[  270.730153]  [<ffffffff816a62d2>] task_cls_state+0x92/0xa0
[  270.730158]  [<ffffffffa00b534f>] cls_cgroup_classify+0x4f/0x120 [cls_cgroup]
[  270.730164]  [<ffffffff816aac74>] tc_classify_compat+0x74/0xc0
[  270.730166]  [<ffffffff816ab573>] tc_classify+0x33/0x90
[  270.730170]  [<ffffffffa00bcb0a>] htb_enqueue+0xaa/0x4a0 [sch_htb]
[  270.730172]  [<ffffffff81682e26>] __dev_queue_xmit+0x306/0x960
[  270.730174]  [<ffffffff81682b70>] ? __dev_queue_xmit+0x50/0x960
[  270.730176]  [<ffffffff816834a3>] dev_queue_xmit_sk+0x13/0x20
[  270.730185]  [<ffffffff81787770>] dev_queue_xmit+0x10/0x20
[  270.730187]  [<ffffffff8178b91c>] packet_snd.isra.62+0x54c/0x760
[  270.730190]  [<ffffffff8178be25>] packet_sendmsg+0x2f5/0x3f0
[  270.730203]  [<ffffffff81665245>] ? sock_def_readable+0x5/0x190
[  270.730210]  [<ffffffff817b64bb>] ? _raw_spin_unlock+0x2b/0x40
[  270.730216]  [<ffffffff8173bcbc>] ? unix_dgram_sendmsg+0x5cc/0x640
[  270.730219]  [<ffffffff8165f367>] sock_sendmsg+0x47/0x50
[  270.730221]  [<ffffffff8165f42f>] sock_write_iter+0x7f/0xd0
[  270.730232]  [<ffffffff811fd4c7>] __vfs_write+0xa7/0xf0
[  270.730234]  [<ffffffff811fe5b8>] vfs_write+0xb8/0x190
[  270.730236]  [<ffffffff811fe8c2>] SyS_write+0x52/0xb0
[  270.730239]  [<ffffffff817b6bae>] entry_SYSCALL_64_fastpath+0x12/0x76
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

cc9f4daa

sch_choke: drop all packets in queue during reset · 77e62da6

WANG Cong authored Jul 21, 2015

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

77e62da6

sch_plug: purge buffered packets during reset · fe6bea7f

WANG Cong authored Jul 21, 2015

Otherwise the skbuff related structures are not correctly
refcount'ed.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fe6bea7f

Merge branch 'fib_select_default-fixes' · c42a6e8b

David S. Miller authored Jul 24, 2015

Julian Anastasov says:

====================
ipv4: fib_select_default changes

This patchset contains 2 changes for the alternative routes,
one to add tb_id/fa_slen check needed after the recent
fib_trie optimizations for fib aliases and the second
change attempts to support alternative routes with TOS
requirement.

	Sorry that I don't have access to the original
report from Hagen Paul Pfeifer. I hope he will see this
change.

	The second change adds fa_default field to the
fib aliases (which can be many) and if the feature to
filter the alternative routes by TOS is not worth it,
this second patch can be scrapped.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c42a6e8b

ipv4: consider TOS in fib_select_default · 2392debc

Julian Anastasov authored Jul 22, 2015

fib_select_default considers alternative routes only when
res->fi is for the first alias in res->fa_head. In the
common case this can happen only when the initial lookup
matches the first alias with highest TOS value. This
prevents the alternative routes to require specific TOS.

This patch solves the problem as follows:

- routes that require specific TOS should be returned by
fib_select_default only when TOS matches, as already done
in fib_table_lookup. This rule implies that depending on the
TOS we can have many different lists of alternative gateways
and we have to keep the last used gateway (fa_default) in first
alias for the TOS instead of using single tb_default value.

- as the aliases are ordered by many keys (TOS desc,
fib_priority asc), we restrict the possible results to
routes with matching TOS and lowest metric (fib_priority)
and routes that match any TOS, again with lowest metric.

For example, packet with TOS 8 can not use gw3 (not lowest
metric), gw4 (different TOS) and gw6 (not lowest metric),
all other gateways can be used:

tos 8 via gw1 metric 2 <--- res->fa_head and res->fi
tos 8 via gw2 metric 2
tos 8 via gw3 metric 3
tos 4 via gw4
tos 0 via gw5
tos 0 via gw6 metric 1
Reported-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>

2392debc

ipv4: fib_select_default should match the prefix · 18a912e9

Julian Anastasov authored Jul 22, 2015

fib_trie starting from 4.1 can link fib aliases from
different prefixes in same list. Make sure the alternative
gateways are in same table and for same prefix (0) by
checking tb_id and fa_slen.

Fixes: 79e5ad2c ("fib_trie: Remove leaf_info")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>

18a912e9

22 Jul, 2015 14 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · c5dfd654

Linus Torvalds authored Jul 22, 2015

Pull networking fixes from David Miller:

 1) Don't use shared bluetooth antenna in iwlwifi driver for management
    frames, from Emmanuel Grumbach.

 2) Fix device ID check in ath9k driver, from Felix Fietkau.

 3) Off by one in xen-netback BUG checks, from Dan Carpenter.

 4) Fix IFLA_VF_PORT netlink attribute validation, from Daniel Borkmann.

 5) Fix races in setting peeked bit flag in SKBs during datagram
    receive.  If it's shared we have to clone it otherwise the value can
    easily be corrupted.  Fix from Herbert Xu.

 6) Revert fec clock handling change, causes regressions.  From Fabio
    Estevam.

 7) Fix use after free in fq_codel and sfq packet schedulers, from WANG
    Cong.

 8) ipvlan bug fixes (memory leaks, missing rcu_dereference_bh, etc.)
    from WANG Cong and Konstantin Khlebnikov.

 9) Memory leak in act_bpf packet action, from Alexei Starovoitov.

10) ARM bpf JIT bug fixes from Nicolas Schichan.

11) Fix backwards compat of ANY_LAYOUT in virtio_net driver, from
    Michael S Tsirkin.

12) Destruction of bond with different ARP header types not handled
    correctly, fix from Nikolay Aleksandrov.

13) Revert GRO receive support in ipv6 SIT tunnel driver, causes
    regressions because the GRO packets created cannot be processed
    properly on the GSO side if we forward the frame.  From Herbert Xu.

14) TCCR update race and other fixes to ravb driver from Sergei
    Shtylyov.

15) Fix SKB leaks in caif_queue_rcv_skb(), from Eric Dumazet.

16) Fix panics on packet scheduler filter replace, from Daniel Borkmann.

17) Make sure AF_PACKET sees properly IP headers in defragmented frames
    (via PACKET_FANOUT_FLAG_DEFRAG option), from Edward Hyunkoo Jee.

18) AF_NETLINK cannot hold mutex in RCU callback, fix from Florian
    Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (84 commits)
  ravb: fix ring memory allocation
  net: phy: dp83867: Fix warning check for setting the internal delay
  openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes
  netlink: don't hold mutex in rcu callback when releasing mmapd ring
  ARM: net: fix vlan access instructions in ARM JIT.
  ARM: net: handle negative offsets in BPF JIT.
  ARM: net: fix condition for load_order > 0 when translating load instructions.
  tcp: suppress a division by zero warning
  drivers: net: cpsw: remove tx event processing in rx napi poll
  inet: frags: fix defragmented packet's IP header for af_packet
  net: mvneta: fix refilling for Rx DMA buffers
  stmmac: fix setting of driver data in stmmac_dvr_probe
  sched: cls_flow: fix panic on filter replace
  sched: cls_flower: fix panic on filter replace
  sched: cls_bpf: fix panic on filter replace
  net/mdio: fix mdio_bus_match for c45 PHY
  net: ratelimit warnings about dst entry refcount underflow or overflow
  caif: fix leaks and race in caif_queue_rcv_skb()
  qmi_wwan: add the second QMI/network interface for Sierra Wireless MC7305/MC7355
  ravb: fix race updating TCCR
  ...

c5dfd654

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 5a5ca73a

Linus Torvalds authored Jul 22, 2015

Pull ARM64 fixes from Catalin Marinas:

 - arm64 build fix following the move of the thread_struct to the end of
   task_struct and the asm offsets becoming too large for the AArch64
   ISA

 - preparatory patch for moving irq_data struct members (applied now to
   reduce dependency for the next merging window)

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  ARM64/irq: Use access helper irq_data_get_affinity_mask()
  arm64: switch_to: calculate cpu context pointer using separate register

5a5ca73a

netfilter: nf_conntrack: Support expectations in different zones · 4b31814d

Joe Stringer authored Jul 21, 2015

When zones were originally introduced, the expectation functions were
all extended to perform lookup using the zone. However, insertion was
not modified to check the zone. This means that two expectations which
are intended to apply for different connections that have the same tuple
but exist in different zones cannot both be tracked.

Fixes: 5d0aa2cc (netfilter: nf_conntrack: add support for "conntrack zones")
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

4b31814d

ARM64/irq: Use access helper irq_data_get_affinity_mask() · 3bc38fc1

Jiang Liu authored Jul 13, 2015

This is a preparatory patch for moving irq_data struct members.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

3bc38fc1

arm64: switch_to: calculate cpu context pointer using separate register · c0d3fce5

Will Deacon authored Jul 20, 2015

Commit 0c8c0f03 ("x86/fpu, sched: Dynamically allocate 'struct fpu'")
moved the thread_struct to the bottom of task_struct. As a result, the
offset is now too large to be used in an immediate add on arm64 with
some kernel configs:

arch/arm64/kernel/entry.S: Assembler messages:
arch/arm64/kernel/entry.S:588: Error: immediate out of range
arch/arm64/kernel/entry.S:597: Error: immediate out of range

This patch calculates the offset using an additional register instead of
an immediate offset.

Fixes: 0c8c0f03 ("x86/fpu, sched: Dynamically allocate 'struct fpu'")
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

c0d3fce5

ravb: fix ring memory allocation · d8b48911

Sergei Shtylyov authored Jul 22, 2015

The driver is written as if it can adapt to a low memory situation allocating
less RX skbs and TX aligned buffers than the respective RX/TX ring sizes. In
reality though the driver would malfunction in this case. Stop being overly
smart and just fail in such situation -- this is achieved by moving the memory
allocation from ravb_ring_format() to ravb_ring_init().

We leave dma_map_single() calls in place but make their failure non-fatal
by marking the corresponding RX descriptors with zero data size which should
prevent DMA to an invalid addresses.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d8b48911

net: phy: dp83867: Fix warning check for setting the internal delay · a46fa260

Dan Murphy authored Jul 21, 2015

Fix warning: logical ‘or’ of collectively exhaustive tests is always true

Change the internal delay check from an 'or' condition to an 'and'
condition.
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Dan Murphy <dmurphy@ti.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a46fa260

openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes · bac541e4

Chris J Arges authored Jul 21, 2015

Some architectures like POWER can have a NUMA node_possible_map that
contains sparse entries. This causes memory corruption with openvswitch
since it allocates flow_cache with a multiple of num_possible_nodes() and
assumes the node variable returned by for_each_node will index into
flow->stats[node].

Use nr_node_ids to allocate a maximal sparse array instead of
num_possible_nodes().

The crash was noticed after 3af229f2 was applied as it changed the
node_possible_map to match node_online_map on boot.
Fixes: 3af229f2Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bac541e4

netlink: don't hold mutex in rcu callback when releasing mmapd ring · 0470eb99

Florian Westphal authored Jul 21, 2015

Kirill A. Shutemov says:

This simple test-case trigers few locking asserts in kernel:

int main(int argc, char **argv)
{
        unsigned int block_size = 16 * 4096;
        struct nl_mmap_req req = {
                .nm_block_size          = block_size,
                .nm_block_nr            = 64,
                .nm_frame_size          = 16384,
                .nm_frame_nr            = 64 * block_size / 16384,
        };
        unsigned int ring_size;
	int fd;

	fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
        if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
                exit(1);
        if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
                exit(1);

	ring_size = req.nm_block_nr * req.nm_block_size;
	mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	return 0;
}

+++ exited with 0 +++
BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
 #0:  (reboot_mutex){+.+...}, at: [<ffffffff81080959>] SyS_reboot+0xa9/0x220
 #1:  ((reboot_notifier_list).rwsem){.+.+..}, at: [<ffffffff8107f379>] __blocking_notifier_call_chain+0x39/0x70
 #2:  (rcu_callback){......}, at: [<ffffffff810d32e0>] rcu_do_batch.isra.49+0x160/0x10c0
Preemption disabled at:[<ffffffff8145365f>] __delay+0xf/0x20

CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
 ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
 0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
 ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
Call Trace:
 <IRQ>  [<ffffffff81929ceb>] dump_stack+0x4f/0x7b
 [<ffffffff81085a9d>] ___might_sleep+0x16d/0x270
 [<ffffffff81085bed>] __might_sleep+0x4d/0x90
 [<ffffffff8192e96f>] mutex_lock_nested+0x2f/0x430
 [<ffffffff81932fed>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
 [<ffffffff81464143>] ? __this_cpu_preempt_check+0x13/0x20
 [<ffffffff8182fc3d>] netlink_set_ring+0x1ed/0x350
 [<ffffffff8182e000>] ? netlink_undo_bind+0x70/0x70
 [<ffffffff8182fe20>] netlink_sock_destruct+0x80/0x150
 [<ffffffff817e484d>] __sk_free+0x1d/0x160
 [<ffffffff817e49a9>] sk_free+0x19/0x20
[..]

Cong Wang says:

We can't hold mutex lock in a rcu callback, [..]

Thomas Graf says:

The socket should be dead at this point. It might be simpler to
add a netlink_release_ring() function which doesn't require
locking at all.
Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Diagnosed-by: Cong Wang <cwang@twopensource.com>
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

0470eb99

Merge branch 'arm-bpf-fixes' · 7c8cbaca

David S. Miller authored Jul 21, 2015

Nicolas Schichan says:

====================
BPF JIT fixes for ARM

These patches are fixing bugs in the ARM JIT and should probably find
their way to a stable kernel. All 60 test_bpf tests in Linux 4.1 release
are now passing OK (was 54 out of 60 before).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7c8cbaca

ARM: net: fix vlan access instructions in ARM JIT. · c18fe54b

Nicolas Schichan authored Jul 21, 2015

This makes BPF_ANC | SKF_AD_VLAN_TAG and BPF_ANC | SKF_AD_VLAN_TAG_PRESENT
have the same behaviour as the in kernel VM and makes the test_bpf LD_VLAN_TAG
and LD_VLAN_TAG_PRESENT tests pass.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

c18fe54b

ARM: net: handle negative offsets in BPF JIT. · 6d715e30

Nicolas Schichan authored Jul 21, 2015

Previously, the JIT would reject negative offsets known during code
generation and mishandle negative offsets provided at runtime.

Fix that by calling bpf_internal_load_pointer_neg_helper()
appropriately in the jit_get_skb_{b,h,w} slow path helpers and by forcing
the execution flow to the slow path helpers when the offset is
negative.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

6d715e30

ARM: net: fix condition for load_order > 0 when translating load instructions. · 7aed35cb

Nicolas Schichan authored Jul 21, 2015

To check whether the load should take the fast path or not, the code
would check that (r_skb_hlen - load_order) is greater than the offset
of the access using an "Unsigned higher or same" condition. For
halfword accesses and an skb length of 1 at offset 0, that test is
valid, as we end up comparing 0xffffffff(-1) and 0, so the fast path
is taken and the filter allows the load to wrongly succeed. A similar
issue exists for word loads at offset 0 and an skb length of less than
4.

Fix that by using the condition "Signed greater than or equal"
condition for the fast path code for load orders greater than 0.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

7aed35cb

tcp: suppress a division by zero warning · 89e478a2

Eric Dumazet authored Jul 22, 2015

Andrew Morton reported following warning on one ARM build
with gcc-4.4 :

net/ipv4/inet_hashtables.c: In function 'inet_ehash_locks_alloc':
net/ipv4/inet_hashtables.c:617: warning: division by zero

Even guarded with a test on sizeof(spinlock_t), compiler does not
like current construct on a !CONFIG_SMP build.

Remove the warning by using a temporary variable.

Fixes: 095dc8e0 ("tcp: fix/cleanup inet_ehash_locks_alloc()")
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

89e478a2

21 Jul, 2015 11 commits

Revert "fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()" · d725e66c

Linus Torvalds authored Jul 21, 2015

This reverts commit a2673b6e.

Kinglong Mee reports a memory leak with that patch, and Jan Kara confirms:

 "Thanks for report! You are right that my patch introduces a race
  between fsnotify kthread and fsnotify_destroy_group() which can result
  in leaking inotify event on group destruction.

  I haven't yet decided whether the right fix is not to queue events for
  dying notification group (as that is pointless anyway) or whether we
  should just fix the original problem differently...  Whenever I look
  at fsnotify code mark handling I get lost in the maze of locks, lists,
  and subtle differences between how different notification systems
  handle notification marks :( I'll think about it over night"

and after thinking about it, Jan says:

 "OK, I have looked into the code some more and I found another
  relatively simple way of fixing the original oops.  It will be IMHO
  better than trying to fixup this issue which has more potential for
  breakage.  I'll ask Linus to revert the fsnotify fix he already merged
  and send a new fix"
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Requested-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d725e66c

Merge tag 'wireless-drivers-for-davem-2015-07-20' of... · 0bccece5

David S. Miller authored Jul 21, 2015

Merge tag 'wireless-drivers-for-davem-2015-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
ath9k:

* fix device ID check for AR956x

iwlwifi:

* bug fixes specific for 8000 series
* fix a crash in time events
* fix a crash in PCIe transport
* fix BT Coex code that prevented association on certain
  devices (3160).
* revert the new RBD allocation model because it introduced
  a bug when running on weak VM setups.
* new device IDs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

0bccece5

Merge tag 'pinctrl-v4.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 71ebd1af

Linus Torvalds authored Jul 21, 2015

Pull pin control fixes from Linus Walleij:
 "Here are some overly ripe pin control fixes for the v4.2 series.

  They got delayed because of various crap commits and having to clean
  and rinse the patch stack a few times.  Now they are however looking
  good.

   - some dead defines dropped from the Samsung driver, was targeted for
     -rc2 but got delayed
   - drop the strict mode from abx500, this was too strict
   - fix the R-Car sparse IRQs code to work as intended
   - fix the IRQ code for the pinctrl-single GPIO backend to not enforce
     threaded IRQs
   - clear the latched events/IRQs for the Broadcom BCM2835 driver
   - fix up debugfs for the Freescale imx1 driver
   - fix a typo bug in the Schmitt Trigger setup in the LPC18xx driver"

* tag 'pinctrl-v4.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: lpc18xx: fix schmitt trigger setup
  Subject: pinctrl: imx1-core: Fix debug output in .pin_config_set callback
  pinctrl: bcm2835: Clear the event latch register when disabling interrupts
  pinctrl: single: ensure pcs irq will not be forced threaded
  sh-pfc: fix sparse GPIOs for R-Car SoCs
  pinctrl: abx500: remove strict mode
  pinctrl: samsung: Remove old unused defines

71ebd1af

Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · 8426fb30

Linus Torvalds authored Jul 21, 2015

Pull UDF fix from Jan Kara:
 "A fix for UDF corruption when certain disk-format feature is enabled"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Don't corrupt unalloc spacetable when writing it

8426fb30

Merge tag 'trace-v4.2-rc2-fix2' of... · 1ad474de

Linus Torvalds authored Jul 21, 2015

Merge tag 'trace-v4.2-rc2-fix2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing sample code fix from Steven Rostedt:
 "He Kuang noticed that the sample code using the trace_event helper
  function __get_dynamic_array_len() is broken.

  This only changes the sample code, and I'm pushing this now instead of
  later because I don't want others using the broken code as an example
  when using it for real"

* tag 'trace-v4.2-rc2-fix2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix sample output of dynamic arrays

1ad474de

drivers: net: cpsw: remove tx event processing in rx napi poll · 1e353cdd

Mugunthan V N authored Jul 21, 2015

With commit c03abd84 ("net: ethernet: cpsw: don't requests IRQs
we don't use") common isr and napi are separated into separate tx isr
and rx isr/napi, but still in rx napi tx events are handled. So removing
the tx event handling in rx napi.
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1e353cdd

inet: frags: fix defragmented packet's IP header for af_packet · 0848f642

Edward Hyunkoo Jee authored Jul 21, 2015

When ip_frag_queue() computes positions, it assumes that the passed
sk_buff does not contain L2 headers.

However, when PACKET_FANOUT_FLAG_DEFRAG is used, IP reassembly
functions can be called on outgoing packets that contain L2 headers.

Also, IPv4 checksum is not corrected after reassembly.

Fixes: 7736d33f ("packet: Add pre-defragmentation support for ipv4 fanouts.")
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0848f642

net: mvneta: fix refilling for Rx DMA buffers · a84e3289

Simon Guinot authored Jul 19, 2015

With the actual code, if a memory allocation error happens while
refilling a Rx descriptor, then the original Rx buffer is both passed
to the networking stack (in a SKB) and let in the Rx ring. This leads
to various kernel oops and crashes.

As a fix, this patch moves Rx descriptor refilling ahead of building
SKB with the associated Rx buffer. In case of a memory allocation
failure, data is dropped and the original DMA buffer is put back into
the Rx ring.
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org>
Fixes: c5aff182 ("net: mvneta: driver for Marvell Armada 370/XP network unit")
Cc: <stable@vger.kernel.org> # v3.8+
Tested-by: Yoann Sculo <yoann@sculo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

a84e3289

stmmac: fix setting of driver data in stmmac_dvr_probe · a7a62685

Joachim Eastwood authored Jul 17, 2015

Commit 803f8fc4 ("stmmac: move driver data setting into
stmmac_dvr_probe") mistakenly set priv and not priv->dev as
driver data. This meant that the remove, resume and suspend
callbacks that fetched and tried to use this data would most
likely explode. Fix the issue by using the correct variable.

Fixes: 803f8fc4 ("stmmac: move driver data setting into stmmac_dvr_probe")
Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a7a62685

Merge branch 'sch_panic' · 053c26f3

David S. Miller authored Jul 21, 2015

Daniel Borkmann says:

====================
Couple of classifier fixes

This fixes a couple of panics in the form of (analogous for
cls_flow{,er}):

[  912.759276] BUG: unable to handle kernel NULL pointer dereference at (null)
[  912.759373] IP: [<ffffffffa09d4d6d>] cls_bpf_change+0x23d/0x268 [cls_bpf]
[  912.759441] PGD 8783c067 PUD 5f684067 PMD 0
[  912.759491] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[  912.759543] Modules linked in: cls_bpf(E) act_gact [...]
[  912.772734] CPU: 3 PID: 10489 Comm: tc Tainted: G        W   E   4.2.0-rc2+ #73
[  912.775004] Hardware name: Apple Inc. MacBookAir5,1/Mac-66F35F19FE2A0D05, BIOS MBA51.88Z.00EF.B02.1211271028 11/27/2012
[  912.777327] task: ffff88025eaa8000 ti: ffff88005f734000 task.ti: ffff88005f734000
[  912.779662] RIP: 0010:[<ffffffffa09d4d6d>]  [<ffffffffa09d4d6d>] cls_bpf_change+0x23d/0x268 [cls_bpf]
[  912.781991] RSP: 0018:ffff88005f7379c8  EFLAGS: 00010286
[  912.784183] RAX: ffff880201d64e48 RBX: 0000000000000000 RCX: ffff880201d64e40
[  912.786402] RDX: 0000000000000000 RSI: ffffffffa09d51c0 RDI: ffffffffa09d51a6
[  912.788625] RBP: ffff88005f737a68 R08: 0000000000000000 R09: 0000000000000000
[  912.790854] R10: 0000000000000001 R11: 0000000000000001 R12: ffff880078ab5a80
[  912.793082] R13: ffff880232b31570 R14: ffff88005f737ae0 R15: ffff8801e215d1d0
[  912.795181] FS:  00007f3c0c80d740(0000) GS:ffff880265400000(0000) knlGS:0000000000000000
[  912.797281] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  912.799402] CR2: 0000000000000000 CR3: 000000005460f000 CR4: 00000000001407e0
[  912.799403] Stack:
[  912.799407]  ffffffff00000000 ffff88023ea18000 000000005f737a08 0000000000000000
[  912.799415]  ffffffff81f06140 ffff880201d64e40 0000000000000000 ffff88023ea1804c
[  912.799418]  0000000000000000 ffff88023ea18044 ffff88023ea18030 ffff88023ea18038
[  912.799418] Call Trace:
[  912.799437]  [<ffffffff816d5685>] tc_ctl_tfilter+0x335/0x910
[  912.799443]  [<ffffffff813622a8>] ? security_capable+0x48/0x60
[  912.799448]  [<ffffffff816b90e5>] rtnetlink_rcv_msg+0x95/0x240
[  912.799454]  [<ffffffff810f612d>] ? trace_hardirqs_on+0xd/0x10
[  912.799456]  [<ffffffff816b902f>] ? rtnetlink_rcv+0x1f/0x40
[  912.799459]  [<ffffffff816b902f>] ? rtnetlink_rcv+0x1f/0x40
[  912.799461]  [<ffffffff816b9050>] ? rtnetlink_rcv+0x40/0x40
[  912.799464]  [<ffffffff816df38f>] netlink_rcv_skb+0xaf/0xc0
[  912.799467]  [<ffffffff816b903e>] rtnetlink_rcv+0x2e/0x40
[  912.799469]  [<ffffffff816deaef>] netlink_unicast+0xef/0x1b0
[  912.799471]  [<ffffffff816defa0>] netlink_sendmsg+0x3f0/0x620
[  912.799476]  [<ffffffff81687028>] sock_sendmsg+0x38/0x50
[  912.799479]  [<ffffffff81687938>] ___sys_sendmsg+0x288/0x290
[  912.799482]  [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050
[  912.799488]  [<ffffffff810265db>] ? native_sched_clock+0x2b/0x90
[  912.799493]  [<ffffffff8116135f>] ? __audit_syscall_entry+0xaf/0x100
[  912.799497]  [<ffffffff8116135f>] ? __audit_syscall_entry+0xaf/0x100
[  912.799501]  [<ffffffff8112aa19>] ? current_kernel_time+0x69/0xd0
[  912.799505]  [<ffffffff81266f16>] ? __fget_light+0x66/0x90
[  912.799508]  [<ffffffff81688812>] __sys_sendmsg+0x42/0x80
[  912.799510]  [<ffffffff81688862>] SyS_sendmsg+0x12/0x20
[  912.799515]  [<ffffffff817f9a6e>] entry_SYSCALL_64_fastpath+0x12/0x76
[  912.799540] Code: 4d 88 49 8b 57 08 48 89 51 08 49 8b 57 10 48 89 c8 48 83 c0 08 48
                     89 51 10 48 8b 51 10 48 c7 c6 c0 51 9d a0 48 c7 c7 a6 51 9d a0 <48>
                     89 02 48 8b 51 08 48 89 42 08 48 b8 00 02 20 00 00 00 ad de
[  912.799544] RIP  [<ffffffffa09d4d6d>] cls_bpf_change+0x23d/0x268 [cls_bpf]
[  912.799544]  RSP <ffff88005f7379c8>
[  912.799545] CR2: 0000000000000000
[  912.807380] ---[ end trace a6440067cfdc7c29 ]---

I've split them into 3 patches, so they can be backported easier
when needed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

053c26f3

sched: cls_flow: fix panic on filter replace · 32b2f4b1

Daniel Borkmann authored Jul 17, 2015

The following test case causes a NULL pointer dereference in cls_flow:

tc filter add dev foo parent 1: handle 0x1 flow hash keys dst action ok
tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
flow hash keys mark action drop

To be more precise, actually two different panics are fixed, the first
occurs because tcf_exts_init() is not called on the newly allocated
filter when we do a replace. And the second panic uncovered after that
happens since the arguments of list_replace_rcu() are swapped, the old
element needs to be the first argument and the new element the second.

Fixes: 70da9f0b ("net: sched: cls_flow use RCU")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

32b2f4b1