1. 03 Mar, 2016 27 commits
    • Daniel Borkmann's avatar
      bpf: fix branch offset adjustment on backjumps after patching ctx expansion · a34f2f9f
      Daniel Borkmann authored
      [ Upstream commit a1b14d27 ]
      
      When ctx access is used, the kernel often needs to expand/rewrite
      instructions, so after that patching, branch offsets have to be
      adjusted for both forward and backward jumps in the new eBPF program,
      but for backward jumps it fails to account the delta. Meaning, for
      example, if the expansion happens exactly on the insn that sits at
      the jump target, it doesn't fix up the back jump offset.
      
      Analysis on what the check in adjust_branches() is currently doing:
      
        /* adjust offset of jmps if necessary */
        if (i < pos && i + insn->off + 1 > pos)
          insn->off += delta;
        else if (i > pos && i + insn->off + 1 < pos)
          insn->off -= delta;
      
      First condition (forward jumps):
      
        Before:                         After:
      
        insns[0]                        insns[0]
        insns[1] <--- i/insn            insns[1] <--- i/insn
        insns[2] <--- pos               insns[P] <--- pos
        insns[3]                        insns[P]  `------| delta
        insns[4] <--- target_X          insns[P]   `-----|
        insns[5]                        insns[3]
                                        insns[4] <--- target_X
                                        insns[5]
      
      First case is if we cross pos-boundary and the jump instruction was
      before pos. This is handeled correctly. I.e. if i == pos, then this
      would mean our jump that we currently check was the patchlet itself
      that we just injected. Since such patchlets are self-contained and
      have no awareness of any insns before or after the patched one, the
      delta is correctly not adjusted. Also, for the second condition in
      case of i + insn->off + 1 == pos, means we jump to that newly patched
      instruction, so no offset adjustment are needed. That part is correct.
      
      Second condition (backward jumps):
      
        Before:                         After:
      
        insns[0]                        insns[0]
        insns[1] <--- target_X          insns[1] <--- target_X
        insns[2] <--- pos <-- target_Y  insns[P] <--- pos <-- target_Y
        insns[3]                        insns[P]  `------| delta
        insns[4] <--- i/insn            insns[P]   `-----|
        insns[5]                        insns[3]
                                        insns[4] <--- i/insn
                                        insns[5]
      
      Second interesting case is where we cross pos-boundary and the jump
      instruction was after pos. Backward jump with i == pos would be
      impossible and pose a bug somewhere in the patchlet, so the first
      condition checking i > pos is okay only by itself. However, i +
      insn->off + 1 < pos does not always work as intended to trigger the
      adjustment. It works when jump targets would be far off where the
      delta wouldn't matter. But, for example, where the fixed insn->off
      before pointed to pos (target_Y), it now points to pos + delta, so
      that additional room needs to be taken into account for the check.
      This means that i) both tests here need to be adjusted into pos + delta,
      and ii) for the second condition, the test needs to be <= as pos
      itself can be a target in the backjump, too.
      
      Fixes: 9bac3d6d ("bpf: allow extended BPF programs access skb fields")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a34f2f9f
    • Alexander Duyck's avatar
      flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen · b083b36c
      Alexander Duyck authored
      [ Upstream commit 461547f3 ]
      
      This patch fixes an issue with unaligned accesses when using
      eth_get_headlen on a page that was DMA aligned instead of being IP aligned.
      The fact is when trying to check the length we don't need to be looking at
      the flow label so we can reorder the checks to first check if we are
      supposed to gather the flow label and then make the call to actually get
      it.
      
      v2:  Updated path so that either STOP_AT_FLOW_LABEL or KEY_FLOW_LABEL can
           cause us to check for the flow label.
      Reported-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b083b36c
    • Alexander Duyck's avatar
      net: Copy inner L3 and L4 headers as unaligned on GRE TEB · e3865b8b
      Alexander Duyck authored
      [ Upstream commit 78565208 ]
      
      This patch corrects the unaligned accesses seen on GRE TEB tunnels when
      generating hash keys.  Specifically what this patch does is make it so that
      we force the use of skb_copy_bits when the GRE inner headers will be
      unaligned due to NET_IP_ALIGNED being a non-zero value.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3865b8b
    • Xin Long's avatar
      sctp: translate network order to host order when users get a hmacid · 2038fb6f
      Xin Long authored
      [ Upstream commit 7a84bd46 ]
      
      Commit ed5a377d ("sctp: translate host order to network order when
      setting a hmacid") corrected the hmacid byte-order when setting a hmacid.
      but the same issue also exists on getting a hmacid.
      
      We fix it by changing hmacids to host order when users get them with
      getsockopt.
      
      Fixes: Commit ed5a377d ("sctp: translate host order to network order when setting a hmacid")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2038fb6f
    • Sandeep Pillai's avatar
      enic: increment devcmd2 result ring in case of timeout · ff914007
      Sandeep Pillai authored
      [ Upstream commit ca7f41a4 ]
      
      Firmware posts the devcmd result in result ring. In case of timeout, driver
      does not increment the current result pointer and firmware could post the
      result after timeout has occurred. During next devcmd, driver would be
      reading the result of previous devcmd.
      
      Fix this by incrementing result even in case of timeout.
      
      Fixes: 373fb087 ("enic: add devcmd2")
      Signed-off-by: default avatarSandeep Pillai <sanpilla@cisco.com>
      Signed-off-by: default avatarGovindarajulu Varadarajan <_govind@gmx.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff914007
    • Siva Reddy Kallam's avatar
      tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs · 98673eb0
      Siva Reddy Kallam authored
      [ Upstream commit b7d98729 ]
      
      tg3_tso_bug() can hit a condition where the entire tx ring is not big
      enough to segment the GSO packet. For example, if MSS is very small,
      gso_segs can exceed the tx ring size. When we hit the condition, it
      will cause tx timeout.
      
      tg3_tso_bug() is called to handle TSO and DMA hardware bugs.
      For TSO bugs, if tg3_tso_bug() cannot succeed, we have to drop the packet.
      For DMA bugs, we can still fall back to linearize the SKB and let the
      hardware transmit the TSO packet.
      
      This patch adds a function tg3_tso_bug_gso_check() to check if there
      are enough tx descriptors for GSO before calling tg3_tso_bug().
      The caller will then handle the error appropriately - drop or
      lineraize the SKB.
      
      v2: Corrected patch description to avoid confusion.
      Signed-off-by: default avatarSiva Reddy Kallam <siva.kallam@broadcom.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Acked-by: default avatarPrashant Sreedharan <prashant@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98673eb0
    • Hans Westgaard Ry's avatar
      net:Add sysctl_max_skb_frags · 1bec5f40
      Hans Westgaard Ry authored
      [ Upstream commit 5f74f82e ]
      
      Devices may have limits on the number of fragments in an skb they support.
      Current codebase uses a constant as maximum for number of fragments one
      skb can hold and use.
      When enabling scatter/gather and running traffic with many small messages
      the codebase uses the maximum number of fragments and may thereby violate
      the max for certain devices.
      The patch introduces a global variable as max number of fragments.
      Signed-off-by: default avatarHans Westgaard Ry <hans.westgaard.ry@oracle.com>
      Reviewed-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1bec5f40
    • Eric Dumazet's avatar
      tcp: do not drop syn_recv on all icmp reports · 2679161c
      Eric Dumazet authored
      [ Upstream commit 9cf74903 ]
      
      Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
      were leading to RST.
      
      This is of course incorrect.
      
      A specific list of ICMP messages should be able to drop a SYN_RECV.
      
      For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
      not hold a dst per SYN_RECV pseudo request.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Reported-by: default avatarPetr Novopashenniy <pety@rusnet.ru>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2679161c
    • Hannes Frederic Sowa's avatar
      unix: correctly track in-flight fds in sending process user_struct · 3ba9b9f2
      Hannes Frederic Sowa authored
      [ Upstream commit 415e3d3e ]
      
      The commit referenced in the Fixes tag incorrectly accounted the number
      of in-flight fds over a unix domain socket to the original opener
      of the file-descriptor. This allows another process to arbitrary
      deplete the original file-openers resource limit for the maximum of
      open files. Instead the sending processes and its struct cred should
      be credited.
      
      To do so, we add a reference counted struct user_struct pointer to the
      scm_fp_list and use it to account for the number of inflight unix fds.
      
      Fixes: 712f4aad ("unix: properly account for FDs passed over unix sockets")
      Reported-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ba9b9f2
    • Eric Dumazet's avatar
      ipv6: fix a lockdep splat · 4c890233
      Eric Dumazet authored
      [ Upstream commit 44c3d0c1 ]
      
      Silence lockdep false positive about rcu_dereference() being
      used in the wrong context.
      
      First one should use rcu_dereference_protected() as we own the spinlock.
      
      Second one should be a normal assignation, as no barrier is needed.
      
      Fixes: 18367681 ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.")
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4c890233
    • subashab@codeaurora.org's avatar
      ipv6: addrconf: Fix recursive spin lock call · cdbc6682
      subashab@codeaurora.org authored
      [ Upstream commit 16186a82 ]
      
      A rcu stall with the following backtrace was seen on a system with
      forwarding, optimistic_dad and use_optimistic set. To reproduce,
      set these flags and allow ipv6 autoconf.
      
      This occurs because the device write_lock is acquired while already
      holding the read_lock. Back trace below -
      
      INFO: rcu_preempt self-detected stall on CPU { 1}  (t=2100 jiffies
       g=3992 c=3991 q=4471)
      <6> Task dump for CPU 1:
      <2> kworker/1:0     R  running task    12168    15   2 0x00000002
      <2> Workqueue: ipv6_addrconf addrconf_dad_work
      <6> Call trace:
      <2> [<ffffffc000084da8>] el1_irq+0x68/0xdc
      <2> [<ffffffc000cc4e0c>] _raw_write_lock_bh+0x20/0x30
      <2> [<ffffffc000bc5dd8>] __ipv6_dev_ac_inc+0x64/0x1b4
      <2> [<ffffffc000bcbd2c>] addrconf_join_anycast+0x9c/0xc4
      <2> [<ffffffc000bcf9f0>] __ipv6_ifa_notify+0x160/0x29c
      <2> [<ffffffc000bcfb7c>] ipv6_ifa_notify+0x50/0x70
      <2> [<ffffffc000bd035c>] addrconf_dad_work+0x314/0x334
      <2> [<ffffffc0000b64c8>] process_one_work+0x244/0x3fc
      <2> [<ffffffc0000b7324>] worker_thread+0x2f8/0x418
      <2> [<ffffffc0000bb40c>] kthread+0xe0/0xec
      
      v2: do addrconf_dad_kick inside read lock and then acquire write
      lock for ipv6_ifa_notify as suggested by Eric
      
      Fixes: 7fd2561e ("net: ipv6: Add a sysctl to make optimistic
      addresses useful candidates")
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Erik Kline <ek@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdbc6682
    • Paolo Abeni's avatar
      ipv6/udp: use sticky pktinfo egress ifindex on connect() · e1c4e14b
      Paolo Abeni authored
      [ Upstream commit 1cdda918 ]
      
      Currently, the egress interface index specified via IPV6_PKTINFO
      is ignored by __ip6_datagram_connect(), so that RFC 3542 section 6.7
      can be subverted when the user space application calls connect()
      before sendmsg().
      Fix it by initializing properly flowi6_oif in connect() before
      performing the route lookup.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e1c4e14b
    • Paolo Abeni's avatar
      ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail() · e8e729cc
      Paolo Abeni authored
      [ Upstream commit 6f21c96a ]
      
      The current implementation of ip6_dst_lookup_tail basically
      ignore the egress ifindex match: if the saddr is set,
      ip6_route_output() purposefully ignores flowi6_oif, due
      to the commit d46a9d67 ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE
      flag if saddr set"), if the saddr is 'any' the first route lookup
      in ip6_dst_lookup_tail fails, but upon failure a second lookup will
      be performed with saddr set, thus ignoring the ifindex constraint.
      
      This commit adds an output route lookup function variant, which
      allows the caller to specify lookup flags, and modify
      ip6_dst_lookup_tail() to enforce the ifindex match on the second
      lookup via said helper.
      
      ip6_route_output() becames now a static inline function build on
      top of ip6_route_output_flags(); as a side effect, out-of-tree
      modules need now a GPL license to access the output route lookup
      functionality.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8e729cc
    • Eric Dumazet's avatar
      tcp: beware of alignments in tcp_get_info() · 87e40d8d
      Eric Dumazet authored
      [ Upstream commit ff5d7497 ]
      
      With some combinations of user provided flags in netlink command,
      it is possible to call tcp_get_info() with a buffer that is not 8-bytes
      aligned.
      
      It does matter on some arches, so we need to use put_unaligned() to
      store the u64 fields.
      
      Current iproute2 package does not trigger this particular issue.
      
      Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
      Fixes: 977cb0ec ("tcp: add pacing_rate information into tcp_info")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      87e40d8d
    • Ido Schimmel's avatar
      switchdev: Require RTNL mutex to be held when sending FDB notifications · ba50e6d9
      Ido Schimmel authored
      [ Upstream commit 4f2c6ae5 ]
      
      When switchdev drivers process FDB notifications from the underlying
      device they resolve the netdev to which the entry points to and notify
      the bridge using the switchdev notifier.
      
      However, since the RTNL mutex is not held there is nothing preventing
      the netdev from disappearing in the middle, which will cause
      br_switchdev_event() to dereference a non-existing netdev.
      
      Make switchdev drivers hold the lock at the beginning of the
      notification processing session and release it once it ends, after
      notifying the bridge.
      
      Also, remove switchdev_mutex and fdb_lock, as they are no longer needed
      when RTNL mutex is held.
      
      Fixes: 03bf0c28 ("switchdev: introduce switchdev notifier")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba50e6d9
    • Joe Stringer's avatar
      inet: frag: Always orphan skbs inside ip_defrag() · 649dc6c3
      Joe Stringer authored
      [ Upstream commit 8282f274 ]
      
      Later parts of the stack (including fragmentation) expect that there is
      never a socket attached to frag in a frag_list, however this invariant
      was not enforced on all defrag paths. This could lead to the
      BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the
      end of this commit message.
      
      While the call could be added to openvswitch to fix this particular
      error, the head and tail of the frags list are already orphaned
      indirectly inside ip_defrag(), so it seems like the remaining fragments
      should all be orphaned in all circumstances.
      
      kernel BUG at net/ipv4/ip_output.c:586!
      [...]
      Call Trace:
       <IRQ>
       [<ffffffffa0205270>] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch]
       [<ffffffffa02167a7>] ovs_fragment+0xcc/0x214 [openvswitch]
       [<ffffffff81667830>] ? dst_discard_out+0x20/0x20
       [<ffffffff81667810>] ? dst_ifdown+0x80/0x80
       [<ffffffffa0212072>] ? find_bucket.isra.2+0x62/0x70 [openvswitch]
       [<ffffffff810e0ba5>] ? mod_timer_pending+0x65/0x210
       [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
       [<ffffffffa03205a2>] ? nf_conntrack_in+0x252/0x500 [nf_conntrack]
       [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
       [<ffffffffa02051a3>] do_output.isra.29+0xe3/0x1b0 [openvswitch]
       [<ffffffffa0206411>] do_execute_actions+0xe11/0x11f0 [openvswitch]
       [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
       [<ffffffffa0206822>] ovs_execute_actions+0x32/0xd0 [openvswitch]
       [<ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch]
       [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
       [<ffffffffa02068a2>] ovs_execute_actions+0xb2/0xd0 [openvswitch]
       [<ffffffffa020b505>] ovs_dp_process_packet+0x85/0x140 [openvswitch]
       [<ffffffffa0215019>] ? ovs_ct_get_labels+0x49/0x80 [openvswitch]
       [<ffffffffa0213a1d>] ovs_vport_receive+0x5d/0xa0 [openvswitch]
       [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
       [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
       [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
       [<ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch]
       [<ffffffffa02148fc>] internal_dev_xmit+0x6c/0x140 [openvswitch]
       [<ffffffffa0214895>] ? internal_dev_xmit+0x5/0x140 [openvswitch]
       [<ffffffff81660299>] dev_hard_start_xmit+0x2b9/0x5e0
       [<ffffffff8165fc21>] ? netif_skb_features+0xd1/0x1f0
       [<ffffffff81660f20>] __dev_queue_xmit+0x800/0x930
       [<ffffffff81660770>] ? __dev_queue_xmit+0x50/0x930
       [<ffffffff810b53f1>] ? mark_held_locks+0x71/0x90
       [<ffffffff81669876>] ? neigh_resolve_output+0x106/0x220
       [<ffffffff81661060>] dev_queue_xmit+0x10/0x20
       [<ffffffff816698e8>] neigh_resolve_output+0x178/0x220
       [<ffffffff816a8e6f>] ? ip_finish_output2+0x1ff/0x590
       [<ffffffff816a8e6f>] ip_finish_output2+0x1ff/0x590
       [<ffffffff816a8cee>] ? ip_finish_output2+0x7e/0x590
       [<ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0
       [<ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0
       [<ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80
       [<ffffffff816a9c9c>] ip_finish_output+0x17c/0x340
       [<ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190
       [<ffffffff816ab4c0>] ip_output+0x70/0x110
       [<ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80
       [<ffffffff816aa9f9>] ip_local_out+0x39/0x70
       [<ffffffff816abf89>] ip_send_skb+0x19/0x40
       [<ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40
       [<ffffffff816df21a>] icmp_push_reply+0xea/0x120
       [<ffffffff816df93d>] icmp_reply.constprop.23+0x1ed/0x230
       [<ffffffff816df9ce>] icmp_echo.part.21+0x4e/0x50
       [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
       [<ffffffff810d5f9e>] ? rcu_read_lock_held+0x5e/0x70
       [<ffffffff816dfa06>] icmp_echo+0x36/0x70
       [<ffffffff816e0d11>] icmp_rcv+0x271/0x450
       [<ffffffff816a4ca7>] ip_local_deliver_finish+0x127/0x3a0
       [<ffffffff816a4bc1>] ? ip_local_deliver_finish+0x41/0x3a0
       [<ffffffff816a5160>] ip_local_deliver+0x60/0xd0
       [<ffffffff816a4b80>] ? ip_rcv_finish+0x560/0x560
       [<ffffffff816a46fd>] ip_rcv_finish+0xdd/0x560
       [<ffffffff816a5453>] ip_rcv+0x283/0x3e0
       [<ffffffff810b6302>] ? match_held_lock+0x192/0x200
       [<ffffffff816a4620>] ? inet_del_offload+0x40/0x40
       [<ffffffff8165d062>] __netif_receive_skb_core+0x392/0xae0
       [<ffffffff8165e68e>] ? process_backlog+0x8e/0x230
       [<ffffffff810b53f1>] ? mark_held_locks+0x71/0x90
       [<ffffffff8165d7c8>] __netif_receive_skb+0x18/0x60
       [<ffffffff8165e678>] process_backlog+0x78/0x230
       [<ffffffff8165e6dd>] ? process_backlog+0xdd/0x230
       [<ffffffff8165e355>] net_rx_action+0x155/0x400
       [<ffffffff8106b48c>] __do_softirq+0xcc/0x420
       [<ffffffff816a8e87>] ? ip_finish_output2+0x217/0x590
       [<ffffffff8178e78c>] do_softirq_own_stack+0x1c/0x30
       <EOI>
       [<ffffffff8106b88e>] do_softirq+0x4e/0x60
       [<ffffffff8106b948>] __local_bh_enable_ip+0xa8/0xb0
       [<ffffffff816a8eb0>] ip_finish_output2+0x240/0x590
       [<ffffffff816a9a31>] ? ip_do_fragment+0x831/0x8a0
       [<ffffffff816a9a31>] ip_do_fragment+0x831/0x8a0
       [<ffffffff816a8c70>] ? ip_copy_metadata+0x1b0/0x1b0
       [<ffffffff816a9ae3>] ip_fragment.constprop.49+0x43/0x80
       [<ffffffff816a9c9c>] ip_finish_output+0x17c/0x340
       [<ffffffff8169a6f4>] ? nf_hook_slow+0xe4/0x190
       [<ffffffff816ab4c0>] ip_output+0x70/0x110
       [<ffffffff816a9b20>] ? ip_fragment.constprop.49+0x80/0x80
       [<ffffffff816aa9f9>] ip_local_out+0x39/0x70
       [<ffffffff816abf89>] ip_send_skb+0x19/0x40
       [<ffffffff816abfe3>] ip_push_pending_frames+0x33/0x40
       [<ffffffff816d55d3>] raw_sendmsg+0x7d3/0xc30
       [<ffffffff810b732b>] ? __lock_acquire+0x3db/0x1b90
       [<ffffffff816e7557>] ? inet_sendmsg+0xc7/0x1d0
       [<ffffffff810b63c4>] ? __lock_is_held+0x54/0x70
       [<ffffffff816e759a>] inet_sendmsg+0x10a/0x1d0
       [<ffffffff816e7495>] ? inet_sendmsg+0x5/0x1d0
       [<ffffffff8163e398>] sock_sendmsg+0x38/0x50
       [<ffffffff8163ec5f>] ___sys_sendmsg+0x25f/0x270
       [<ffffffff811aadad>] ? handle_mm_fault+0x8dd/0x1320
       [<ffffffff8178c147>] ? _raw_spin_unlock+0x27/0x40
       [<ffffffff810529b2>] ? __do_page_fault+0x1e2/0x460
       [<ffffffff81204886>] ? __fget_light+0x66/0x90
       [<ffffffff8163f8e2>] __sys_sendmsg+0x42/0x80
       [<ffffffff8163f932>] SyS_sendmsg+0x12/0x20
       [<ffffffff8178cb17>] entry_SYSCALL_64_fastpath+0x12/0x6f
      Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f
       66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
      RIP  [<ffffffff816a9a92>] ip_do_fragment+0x892/0x8a0
       RSP <ffff88006d603170>
      
      Fixes: 7f8a436e ("openvswitch: Add conntrack action")
      Signed-off-by: default avatarJoe Stringer <joe@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      649dc6c3
    • Parthasarathy Bhuvaragan's avatar
      tipc: fix connection abort during subscription cancel · c57e51ff
      Parthasarathy Bhuvaragan authored
      [ Upstream commit 4d5cfcba ]
      
      In 'commit 7fe8097c ("tipc: fix nullpointer bug when subscribing
      to events")', we terminate the connection if the subscription
      creation fails.
      In the same commit, the subscription creation result was based on
      the value of the subscription pointer (set in the function) instead
      of the return code.
      
      Unfortunately, the same function tipc_subscrp_create() handles
      subscription cancel request. For a subscription cancellation request,
      the subscription pointer cannot be set. Thus if a subscriber has
      several subscriptions and cancels any of them, the connection is
      terminated.
      
      In this commit, we terminate the connection based on the return value
      of tipc_subscrp_create().
      Fixes: commit 7fe8097c ("tipc: fix nullpointer bug when subscribing to events")
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c57e51ff
    • Russell King's avatar
      net: dsa: fix mv88e6xxx switches · 7f76933d
      Russell King authored
      [ Upstream commit db0e51af ]
      
      Since commit 76e398a6 ("net: dsa: use switchdev obj for VLAN add/del
      ops"), the Marvell 88E6xxx switch has been unable to pass traffic
      between ports - any received traffic is discarded by the switch.
      Taking a port out of bridge mode and configuring a vlan on it also the
      port to start passing traffic.
      
      With the debugfs files re-instated to allow debug of this issue by
      comparing the register settings between the working and non-working
      case, the reason becomes clear:
      
           GLOBAL GLOBAL2 SERDES   0    1    2    3    4    5    6
      - 7:  1111    707f    2001     2    2    2    2    2    0    2
      + 7:  1111    707f    2001     1    1    1    1    1    0    1
      
      Register 7 for the ports is the default vlan tag register, and in the
      non-working setup, it has been set to 2, despite vlan 2 not being
      configured.  This causes the switch to drop all packets coming in to
      these ports.  The working setup has the default vlan tag register set
      to 1, which is the default vlan when none is configured.
      
      Inspection of the code reveals why.  The code prior to this commit
      was:
      
      -		for (vid = vlan->vid_begin; vid <= vlan->vid_end; ++vid) {
      ...
      -			if (!err && vlan->flags & BRIDGE_VLAN_INFO_PVID)
      -				err = ds->drv->port_pvid_set(ds, p->port, vid);
      
      but the new code is:
      
      +	for (vid = vlan->vid_begin; vid <= vlan->vid_end; ++vid) {
      ...
      +	}
      ...
      +	if (pvid)
      +		err = _mv88e6xxx_port_pvid_set(ds, port, vid);
      
      This causes the new code to always set the default vlan to one higher
      than the old code.
      
      Fix this.
      
      Fixes: 76e398a6 ("net: dsa: use switchdev obj for VLAN add/del ops")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f76933d
    • Marcelo Ricardo Leitner's avatar
      sctp: allow setting SCTP_SACK_IMMEDIATELY by the application · 293c41f8
      Marcelo Ricardo Leitner authored
      [ Upstream commit 27f7ed2b ]
      
      This patch extends commit b93d6471 ("sctp: implement the sender side
      for SACK-IMMEDIATELY extension") as it didn't white list
      SCTP_SACK_IMMEDIATELY on sctp_msghdr_parse(), causing it to be
      understood as an invalid flag and returning -EINVAL to the application.
      
      Note that the actual handling of the flag is already there in
      sctp_datamsg_from_user().
      
      https://tools.ietf.org/html/rfc7053#section-7
      
      Fixes: b93d6471 ("sctp: implement the sender side for SACK-IMMEDIATELY extension")
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      293c41f8
    • Hannes Frederic Sowa's avatar
      pptp: fix illegal memory access caused by multiple bind()s · cccf9f37
      Hannes Frederic Sowa authored
      [ Upstream commit 9a368aff ]
      
      Several times already this has been reported as kasan reports caused by
      syzkaller and trinity and people always looked at RCU races, but it is
      much more simple. :)
      
      In case we bind a pptp socket multiple times, we simply add it to
      the callid_sock list but don't remove the old binding. Thus the old
      socket stays in the bucket with unused call_id indexes and doesn't get
      cleaned up. This causes various forms of kasan reports which were hard
      to pinpoint.
      
      Simply don't allow multiple binds and correct error handling in
      pptp_bind. Also keep sk_state bits in place in pptp_connect.
      
      Fixes: 00959ade ("PPTP: PPP over IPv4 (Point-to-Point Tunneling Protocol)")
      Cc: Dmitry Kozlov <xeb@mail.ru>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cccf9f37
    • Eric Dumazet's avatar
      af_unix: fix struct pid memory leak · 39770be4
      Eric Dumazet authored
      [ Upstream commit fa0dc04d ]
      
      Dmitry reported a struct pid leak detected by a syzkaller program.
      
      Bug happens in unix_stream_recvmsg() when we break the loop when a
      signal is pending, without properly releasing scm.
      
      Fixes: b3ca9b02 ("net: fix multithreaded signal handling in unix recv routines")
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39770be4
    • Eric Dumazet's avatar
      tcp: fix NULL deref in tcp_v4_send_ack() · e5abc10d
      Eric Dumazet authored
      [ Upstream commit e62a123b ]
      
      Neal reported crashes with this stack trace :
      
       RIP: 0010:[<ffffffff8c57231b>] tcp_v4_send_ack+0x41/0x20f
      ...
       CR2: 0000000000000018 CR3: 000000044005c000 CR4: 00000000001427e0
      ...
        [<ffffffff8c57258e>] tcp_v4_reqsk_send_ack+0xa5/0xb4
        [<ffffffff8c1a7caa>] tcp_check_req+0x2ea/0x3e0
        [<ffffffff8c19e420>] tcp_rcv_state_process+0x850/0x2500
        [<ffffffff8c1a6d21>] tcp_v4_do_rcv+0x141/0x330
        [<ffffffff8c56cdb2>] sk_backlog_rcv+0x21/0x30
        [<ffffffff8c098bbd>] tcp_recvmsg+0x75d/0xf90
        [<ffffffff8c0a8700>] inet_recvmsg+0x80/0xa0
        [<ffffffff8c17623e>] sock_aio_read+0xee/0x110
        [<ffffffff8c066fcf>] do_sync_read+0x6f/0xa0
        [<ffffffff8c0673a1>] SyS_read+0x1e1/0x290
        [<ffffffff8c5ca262>] system_call_fastpath+0x16/0x1b
      
      The problem here is the skb we provide to tcp_v4_send_ack() had to
      be parked in the backlog of a new TCP fastopen child because this child
      was owned by the user at the time an out of window packet arrived.
      
      Before queuing a packet, TCP has to set skb->dev to NULL as the device
      could disappear before packet is removed from the queue.
      
      Fix this issue by using the net pointer provided by the socket (being a
      timewait or a request socket).
      
      IPv6 is immune to the bug : tcp_v6_send_response() already gets the net
      pointer from the socket if provided.
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Reported-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5abc10d
    • Paolo Abeni's avatar
      lwt: fix rx checksum setting for lwt devices tunneling over ipv6 · 176d8f37
      Paolo Abeni authored
      [ Upstream commit c868ee70 ]
      
      the commit 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be
      correctly controlled.") changed the default xmit checksum setting
      for lwt vxlan/geneve ipv6 tunnels, so that now the checksum is not
      set into external UDP header.
      This commit changes the rx checksum setting for both lwt vxlan/geneve
      devices created by openvswitch accordingly, so that lwt over ipv6
      tunnel pairs are again able to communicate with default values.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Acked-by: default avatarJesse Gross <jesse@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      176d8f37
    • Jesse Gross's avatar
      tunnels: Allow IPv6 UDP checksums to be correctly controlled. · aa12fd6d
      Jesse Gross authored
      [ Upstream commit 35e2d115 ]
      
      When configuring checksums on UDP tunnels, the flags are different
      for IPv4 vs. IPv6 (and reversed). However, when lightweight tunnels
      are enabled the flags used are always the IPv4 versions, which are
      ignored in the IPv6 code paths. This uses the correct IPv6 flags, so
      checksums can be controlled appropriately.
      
      Fixes: a725e514 ("vxlan: metadata based tunneling for IPv6")
      Fixes: abe492b4 ("geneve: UDP checksum configuration via netlink")
      Signed-off-by: default avatarJesse Gross <jesse@kernel.org>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aa12fd6d
    • Manfred Rudigier's avatar
      net: dp83640: Fix tx timestamp overflow handling. · c95b9687
      Manfred Rudigier authored
      [ Upstream commit 81e8f2e9 ]
      
      PHY status frames are not reliable, the PHY may not be able to send them
      during heavy receive traffic. This overflow condition is signaled by the
      PHY in the next status frame, but the driver did not make use of it.
      Instead it always reported wrong tx timestamps to user space after an
      overflow happened because it assigned newly received tx timestamps to old
      packets in the queue.
      
      This commit fixes this issue by clearing the tx timestamp queue every time
      an overflow happens, so that no timestamps are delivered for overflow
      packets. This way time stamping will continue correctly after an overflow.
      Signed-off-by: default avatarManfred Rudigier <manfred.rudigier@omicron.at>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c95b9687
    • Jesse Gross's avatar
      gro: Make GRO aware of lightweight tunnels. · 306d3165
      Jesse Gross authored
      [ Upstream commit ce87fc6c ]
      
      GRO is currently not aware of tunnel metadata generated by lightweight
      tunnels and stored in the dst. This leads to two possible problems:
       * Incorrectly merging two frames that have different metadata.
       * Leaking of allocated metadata from merged frames.
      
      This avoids those problems by comparing the tunnel information before
      merging, similar to how we handle other metadata (such as vlan tags),
      and releasing any state when we are done.
      Reported-by: default avatarJohn <john.phillips5@hpe.com>
      Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
      Signed-off-by: default avatarJesse Gross <jesse@kernel.org>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      306d3165
    • Ursula Braun's avatar
  2. 25 Feb, 2016 13 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.4.3 · 2134d97a
      Greg Kroah-Hartman authored
      2134d97a
    • Luis R. Rodriguez's avatar
      modules: fix modparam async_probe request · e2f712dc
      Luis R. Rodriguez authored
      commit 4355efbd upstream.
      
      Commit f2411da7 ("driver-core: add driver module
      asynchronous probe support") added async probe support,
      in two forms:
      
        * in-kernel driver specification annotation
        * generic async_probe module parameter (modprobe foo async_probe)
      
      To support the generic kernel parameter parse_args() was
      extended via commit ecc86170 ("module: add extra
      argument for parse_params() callback") however commit
      failed to f2411da7 failed to add the required argument.
      
      This causes a crash then whenever async_probe generic
      module parameter is used. This was overlooked when the
      form in which in-kernel async probe support was reworked
      a bit... Fix this as originally intended.
      
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> [minimized]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2f712dc
    • Rusty Russell's avatar
      module: wrapper for symbol name. · a24d9a2f
      Rusty Russell authored
      commit 2e7bac53 upstream.
      
      This trivial wrapper adds clarity and makes the following patch
      smaller.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a24d9a2f
    • Thomas Gleixner's avatar
      itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper · 82e730ba
      Thomas Gleixner authored
      commit 51cbb524 upstream.
      
      As Helge reported for timerfd we have the same issue in itimers. We return
      remaining time larger than the programmed relative time to user space in case
      of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra time
      added in hrtimer_start_range_ns().
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: dhowells@redhat.com
      Link: http://lkml.kernel.org/r/20160114164159.528222587@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82e730ba
    • Thomas Gleixner's avatar
      posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper · 1c94da3e
      Thomas Gleixner authored
      commit 572c3917 upstream.
      
      As Helge reported for timerfd we have the same issue in posix timers. We
      return remaining time larger than the programmed relative time to user space
      in case of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra
      time added in hrtimer_start_range_ns().
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: dhowells@redhat.com
      Link: http://lkml.kernel.org/r/20160114164159.450510905@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c94da3e
    • Thomas Gleixner's avatar
      timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper · 565f2229
      Thomas Gleixner authored
      commit b62526ed upstream.
      
      Helge reported that a relative timer can return a remaining time larger than
      the programmed relative time on parisc and other architectures which have
      CONFIG_TIME_LOW_RES set. This happens because we add a jiffie to the resulting
      expiry time to prevent short timeouts.
      
      Use the new function hrtimer_expires_remaining_adjusted() to calculate the
      remaining time. It takes that extra added time into account for relative
      timers.
      Reported-and-tested-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: dhowells@redhat.com
      Link: http://lkml.kernel.org/r/20160114164159.354500742@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      565f2229
    • Mateusz Guzik's avatar
      prctl: take mmap sem for writing to protect against others · e5e99792
      Mateusz Guzik authored
      commit ddf1d398 upstream.
      
      An unprivileged user can trigger an oops on a kernel with
      CONFIG_CHECKPOINT_RESTORE.
      
      proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
      start/end values. These get sanity checked as follows:
              BUG_ON(arg_start > arg_end);
              BUG_ON(env_start > env_end);
      
      These can be changed by prctl_set_mm. Turns out also takes the semaphore for
      reading, effectively rendering it useless. This results in:
      
        kernel BUG at fs/proc/base.c:240!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: virtio_net
        CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
        RIP: proc_pid_cmdline_read+0x520/0x530
        RSP: 0018:ffff8800784d3db8  EFLAGS: 00010206
        RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
        RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
        RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
        R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
        FS:  00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
        Call Trace:
          __vfs_read+0x37/0x100
          vfs_read+0x82/0x130
          SyS_read+0x58/0xd0
          entry_SYSCALL_64_fastpath+0x12/0x76
        Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
        RIP   proc_pid_cmdline_read+0x520/0x530
        ---[ end trace 97882617ae9c6818 ]---
      
      Turns out there are instances where the code just reads aformentioned
      values without locking whatsoever - namely environ_read and get_cmdline.
      
      Interestingly these functions look quite resilient against bogus values,
      but I don't believe this should be relied upon.
      
      The first patch gets rid of the oops bug by grabbing mmap_sem for
      writing.
      
      The second patch is optional and puts locking around aformentioned
      consumers for safety.  Consumers of other fields don't seem to benefit
      from similar treatment and are left untouched.
      
      This patch (of 2):
      
      The code was taking the semaphore for reading, which does not protect
      against readers nor concurrent modifications.
      
      The problem could cause a sanity checks to fail in procfs's cmdline
      reader, resulting in an OOPS.
      
      Note that some functions perform an unlocked read of various mm fields,
      but they seem to be fine despite possible modificaton.
      Signed-off-by: default avatarMateusz Guzik <mguzik@redhat.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Anshuman Khandual <anshuman.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5e99792
    • Dave Chinner's avatar
      xfs: log mount failures don't wait for buffers to be released · f86701c4
      Dave Chinner authored
      commit 85bec546 upstream.
      
      Recently I've been seeing xfs/051 fail on 1k block size filesystems.
      Trying to trace the events during the test lead to the problem going
      away, indicating that it was a race condition that lead to this
      ASSERT failure:
      
      XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
      .....
      [<ffffffff814e1257>] xfs_free_perag+0x87/0xb0
      [<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900
      [<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0
      [<ffffffff811d8800>] mount_bdev+0x180/0x1b0
      [<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20
      [<ffffffff811d90a8>] mount_fs+0x38/0x170
      [<ffffffff811f4347>] vfs_kern_mount+0x67/0x120
      [<ffffffff811f7018>] do_mount+0x218/0xd60
      [<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0
      
      When I finally caught it with tracing enabled, I saw that AG 2 had
      an elevated reference count and a buffer was responsible for it. I
      tracked down the specific buffer, and found that it was missing the
      final reference count release that would put it back on the LRU and
      hence be found by xfs_wait_buftarg() calls in the log mount failure
      handling.
      
      The last four traces for the buffer before the assert were (trimmed
      for relevance)
      
      kworker/0:1-5259   xfs_buf_iodone:        hold 2  lock 0 flags ASYNC
      kworker/0:1-5259   xfs_buf_ioerror:       hold 2  lock 0 error -5
      mount-7163	   xfs_buf_lock_done:     hold 2  lock 0 flags ASYNC
      mount-7163	   xfs_buf_unlock:        hold 2  lock 1 flags ASYNC
      
      This is an async write that is completing, so there's nobody waiting
      for it directly.  Hence we call xfs_buf_relse() once all the
      processing is complete. That does:
      
      static inline void xfs_buf_relse(xfs_buf_t *bp)
      {
      	xfs_buf_unlock(bp);
      	xfs_buf_rele(bp);
      }
      
      Now, it's clear that mount is waiting on the buffer lock, and that
      it has been released by xfs_buf_relse() and gained by mount. This is
      expected, because at this point the mount process is in
      xfs_buf_delwri_submit() waiting for all the IO it submitted to
      complete.
      
      The mount process, however, is waiting on the lock for the buffer
      because it is in xfs_buf_delwri_submit(). This waits for IO
      completion, but it doesn't wait for the buffer reference owned by
      the IO to go away. The mount process collects all the completions,
      fails the log recovery, and the higher level code then calls
      xfs_wait_buftarg() to free all the remaining buffers in the
      filesystem.
      
      The issue is that on unlocking the buffer, the scheduler has decided
      that the mount process has higher priority than the the kworker
      thread that is running the IO completion, and so immediately
      switched contexts to the mount process from the semaphore unlock
      code, hence preventing the kworker thread from finishing the IO
      completion and releasing the IO reference to the buffer.
      
      Hence by the time that xfs_wait_buftarg() is run, the buffer still
      has an active reference and so isn't on the LRU list that the
      function walks to free the remaining buffers. Hence we miss that
      buffer and continue onwards to tear down the mount structures,
      at which time we get find a stray reference count on the perag
      structure. On a non-debug kernel, this will be ignored and the
      structure torn down and freed. Hence when the kworker thread is then
      rescheduled and the buffer released and freed, it will access a
      freed perag structure.
      
      The problem here is that when the log mount fails, we still need to
      quiesce the log to ensure that the IO workqueues have returned to
      idle before we run xfs_wait_buftarg(). By synchronising the
      workqueues, we ensure that all IO completions are fully processed,
      not just to the point where buffers have been unlocked. This ensures
      we don't end up in the situation above.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f86701c4
    • Dave Chinner's avatar
      Revert "xfs: clear PF_NOFREEZE for xfsaild kthread" · 16f14a28
      Dave Chinner authored
      commit 3e85286e upstream.
      
      This reverts commit 24ba16bb as it
      prevents machines from suspending. This regression occurs when the
      xfsaild is idle on entry to suspend, and so there s no activity to
      wake it from it's idle sleep and hence see that it is supposed to
      freeze. Hence the freezer times out waiting for it and suspend is
      cancelled.
      
      There is no obvious fix for this short of freezing the filesystem
      properly, so revert this change for now.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16f14a28
    • Dave Chinner's avatar
      xfs: inode recovery readahead can race with inode buffer creation · 7530e6fd
      Dave Chinner authored
      commit b79f4a1c upstream.
      
      When we do inode readahead in log recovery, we do can do the
      readahead before we've replayed the icreate transaction that stamps
      the buffer with inode cores. The inode readahead verifier catches
      this and marks the buffer as !done to indicate that it doesn't yet
      contain valid inodes.
      
      In adding buffer error notification  (i.e. setting b_error = -EIO at
      the same time as as we clear the done flag) to such a readahead
      verifier failure, we can then get subsequent inode recovery failing
      with this error:
      
      XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32
      
      This occurs when readahead completion races with icreate item replay
      such as:
      
      	inode readahead
      		find buffer
      		lock buffer
      		submit RA io
      	....
      	icreate recovery
      	    xfs_trans_get_buffer
      		find buffer
      		lock buffer
      		<blocks on RA completion>
      	.....
      	<ra completion>
      		fails verifier
      		clear XBF_DONE
      		set bp->b_error = -EIO
      		release and unlock buffer
      	<icreate gains lock>
      	icreate initialises buffer
      	marks buffer as done
      	adds buffer to delayed write queue
      	releases buffer
      
      At this point, we have an initialised inode buffer that is up to
      date but has an -EIO state registered against it. When we finally
      get to recovering an inode in that buffer:
      
      	inode item recovery
      	    xfs_trans_read_buffer
      		find buffer
      		lock buffer
      		sees XBF_DONE is set, returns buffer
      	    sees bp->b_error is set
      		fail log recovery!
      
      Essentially, we need xfs_trans_get_buf_map() to clear the error status of
      the buffer when doing a lookup. This function returns uninitialised
      buffers, so the buffer returned can not be in an error state and
      none of the code that uses this function expects b_error to be set
      on return. Indeed, there is an ASSERT(!bp->b_error); in the
      transaction case in xfs_trans_get_buf_map() that would have caught
      this if log recovery used transactions....
      
      This patch firstly changes the inode readahead failure to set -EIO
      on the buffer, and secondly changes xfs_buf_get_map() to never
      return a buffer with an error state set so this first change doesn't
      cause unexpected log recovery failures.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7530e6fd
    • Darrick J. Wong's avatar
      libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct · 888959f2
      Darrick J. Wong authored
      commit 96f859d5 upstream.
      
      Because struct xfs_agfl is 36 bytes long and has a 64-bit integer
      inside it, gcc will quietly round the structure size up to the nearest
      64 bits -- in this case, 40 bytes.  This results in the XFS_AGFL_SIZE
      macro returning incorrect results for v5 filesystems on 64-bit
      machines (118 items instead of 119).  As a result, a 32-bit xfs_repair
      will see garbage in AGFL item 119 and complain.
      
      Therefore, tell gcc not to pad the structure so that the AGFL size
      calculation is correct.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      888959f2
    • Miklos Szeredi's avatar
      ovl: setattr: check permissions before copy-up · 8373f659
      Miklos Szeredi authored
      commit cf9a6784 upstream.
      
      Without this copy-up of a file can be forced, even without actually being
      allowed to do anything on the file.
      
      [Arnd Bergmann] include <linux/pagemap.h> for PAGE_CACHE_SIZE (used by
      MAX_LFS_FILESIZE definition).
      Signed-off-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8373f659
    • Miklos Szeredi's avatar
      ovl: root: copy attr · 7193e802
      Miklos Szeredi authored
      commit ed06e069 upstream.
      
      We copy i_uid and i_gid of underlying inode into overlayfs inode.  Except
      for the root inode.
      
      Fix this omission.
      Signed-off-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7193e802