1. 05 May, 2016 8 commits
    • Florian Westphal's avatar
      netfilter: conntrack: make netns address part of hash · 1b8c8a9f
      Florian Westphal authored
      Once we place all conntracks into a global hash table we want them to be
      spread across entire hash table, even if namespaces have overlapping ip
      addresses.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      1b8c8a9f
    • Florian Westphal's avatar
      netfilter: conntrack: check netns when comparing conntrack objects · e0c7d472
      Florian Westphal authored
      Once we place all conntracks in the same hash table we must also compare
      the netns pointer to skip conntracks that belong to a different namespace.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e0c7d472
    • Florian Westphal's avatar
      netfilter: conntrack: small refactoring of conntrack seq_printf · 245cfdca
      Florian Westphal authored
      The iteration process is lockless, so we test if the conntrack object is
      eligible for printing (e.g. is AF_INET) after obtaining the reference
      count.
      
      Once we put all conntracks into same hash table we might see more
      entries that need to be skipped.
      
      So add a helper and first perform the test in a lockless fashion
      for fast skip.
      
      Once we obtain the reference count, just repeat the check.
      
      Note that this refactoring also includes a missing check for unconfirmed
      conntrack entries due to slab rcu object re-usage, so they need to be
      skipped since they are not part of the listing.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      245cfdca
    • Florian Westphal's avatar
      netfilter: conntrack: use nf_ct_key_equal() in more places · 86804348
      Florian Westphal authored
      This prepares for upcoming change that places all conntracks into a
      single, global table.  For this to work we will need to also compare
      net pointer during lookup.  To avoid open-coding such check use the
      nf_ct_key_equal helper and then later extend it to also consider net_eq.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      86804348
    • Florian Westphal's avatar
      netfilter: conntrack: don't attempt to iterate over empty table · 88b68bc5
      Florian Westphal authored
      Once we place all conntracks into same table iteration becomes more
      costly because the table contains conntracks that we are not interested
      in (belonging to other netns).
      
      So don't bother scanning if the current namespace has no entries.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      88b68bc5
    • Florian Westphal's avatar
      netfilter: conntrack: fix lookup race during hash resize · 5e3c61f9
      Florian Westphal authored
      When resizing the conntrack hash table at runtime via
      echo 42 > /sys/module/nf_conntrack/parameters/hashsize, we are racing with
      the conntrack lookup path -- reads can happen in parallel and nothing
      prevents readers from observing a the newly allocated hash but the old
      size (or vice versa).
      
      So access to hash[bucket] can trigger OOB read access in case the table got
      expanded and we saw the new size but the old hash pointer (or it got shrunk
      and we got new hash ptr but the size of the old and larger table):
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN
      CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.6.0-rc2+ #107
      [..]
      Call Trace:
      [<ffffffff822c3d6a>] ? nf_conntrack_tuple_taken+0x12a/0xe90
      [<ffffffff822c3ac1>] ? nf_ct_invert_tuplepr+0x221/0x3a0
      [<ffffffff8230e703>] get_unique_tuple+0xfb3/0x2760
      
      Use generation counter to obtain the address/length of the same table.
      
      Also add a synchronize_net before freeing the old hash.
      AFAICS, without it we might access ct_hash[bucket] after ct_hash has been
      freed, provided that lockless reader got delayed by another event:
      
      CPU1			CPU2
      seq_begin
      seq_retry
      <delay>			resize occurs
      			free oldhash
      for_each(oldhash[size])
      
      Note that resize is only supported in init_netns, it took over 2 minutes
      of constant resizing+flooding to produce the warning, so this isn't a
      big problem in practice.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      5e3c61f9
    • Florian Westphal's avatar
      netfilter: conntrack: keep BH enabled during lookup · 2cf12348
      Florian Westphal authored
      No need to disable BH here anymore:
      
      stats are switched to _ATOMIC variant (== this_cpu_inc()), which
      nowadays generates same code as the non _ATOMIC NF_STAT, at least on x86.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      2cf12348
    • Florian Westphal's avatar
      netfilter: nftables: add connlabel set support · 1ad8f48d
      Florian Westphal authored
      Conntrack labels are currently sized depending on the iptables
      ruleset, i.e. if we're asked to test or set bits 1, 2, and 65 then we
      would allocate enough room to store at least bit 65.
      
      However, with nft, the input is just a register with arbitrary runtime
      content.
      
      We therefore ask for the upper ceiling we currently have, which is
      enough room to store 128 bits.
      
      Alternatively, we could alter nf_connlabel_replace to increase
      net->ct.label_words at run time, but since 128 bits is not that
      big we'd only save sizeof(long) so it doesn't seem worth it for now.
      
      This follows a similar approach that xtables 'connlabel'
      match uses, so when user inputs
      
          ct label set bar
      
      then we will set the bit used by the 'bar' label and leave the rest alone.
      
      This is done by passing the sreg content to nf_connlabels_replace
      as both value and mask argument.
      Labels (bits) already set thus cannot be re-set to zero, but
      this is not supported by xtables connlabel match either.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      1ad8f48d
  2. 29 Apr, 2016 1 commit
  3. 25 Apr, 2016 11 commits
  4. 24 Apr, 2016 16 commits
    • Eric Dumazet's avatar
      tcp-tso: do not split TSO packets at retransmit time · 10d3be56
      Eric Dumazet authored
      Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
      
      This was fine back in the days when TSO/GSO were emerging, with their
      bugs, but we believe the dark age is over.
      
      Keeping big packets in write queues, but also in stack traversal
      has a lot of benefits.
       - Less memory overhead, because write queues have less skbs
       - Less cpu overhead at ACK processing.
       - Better SACK processing, as lot of studies mentioned how
         awful linux was at this ;)
       - Less cpu overhead to send the rtx packets
         (IP stack traversal, netfilter traversal, drivers...)
       - Better latencies in presence of losses.
       - Smaller spikes in fq like packet schedulers, as retransmits
         are not constrained by TCP Small Queues.
      
      1 % packet losses are common today, and at 100Gbit speeds, this
      translates to ~80,000 losses per second.
      Losses are often correlated, and we see many retransmit events
      leading to 1-MSS train of packets, at the time hosts are already
      under stress.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10d3be56
    • Parthasarathy Bhuvaragan's avatar
      tipc: fix stale links after re-enabling bearer · 8cee83dd
      Parthasarathy Bhuvaragan authored
      Commit 42b18f60 ("tipc: refactor function tipc_link_timeout()"),
      introduced a bug which prevents sending of probe messages during
      link synchronization phase. This leads to hanging links, if the
      bearer is disabled/enabled after links are up.
      
      In this commit, we send the probe messages correctly.
      
      Fixes: 42b18f60 ("tipc: refactor function tipc_link_timeout()")
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8cee83dd
    • David S. Miller's avatar
      Merge branch 'tcp-tcstamp_ack-frag-coalesce' · 6a74c196
      David S. Miller authored
      Martin KaFai Lau says:
      
      ====================
      tcp: Handle txstamp_ack when fragmenting/coalescing skbs
      
      This patchset is to handle the txstamp-ack bit when
      fragmenting/coalescing skbs.
      
      The second patch depends on the recently posted series
      for the net branch:
      "tcp: Merge timestamp info when coalescing skbs"
      
      A BPF prog is used to kprobe to sock_queue_err_skb()
      and print out the value of serr->ee.ee_data.  The BPF
      prog (run-able from bcc) is attached here:
      
      BPF prog used for testing:
      ~~~~~
      
      from __future__ import print_function
      from bcc import BPF
      
      bpf_text = """
      
      int trace_err_skb(struct pt_regs *ctx)
      {
      	struct sk_buff *skb = (struct sk_buff *)ctx->si;
      	struct sock *sk = (struct sock *)ctx->di;
      	struct sock_exterr_skb *serr;
      	u32 ee_data = 0;
      
      	if (!sk || !skb)
      		return 0;
      
      	serr = SKB_EXT_ERR(skb);
      	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
      	bpf_trace_printk("ee_data:%u\\n", ee_data);
      
      	return 0;
      };
      """
      
      b = BPF(text=bpf_text)
      b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
      print("Attached to kprobe")
      b.trace_print()
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a74c196
    • Martin KaFai Lau's avatar
      tcp: Merge txstamp_ack in tcp_skb_collapse_tstamp · 2de8023e
      Martin KaFai Lau authored
      When collapsing skbs, txstamp_ack also needs to be merged.
      
      Retrans Collapse Test:
      ~~~~~~
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 730) = 730
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 730) = 730
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      0.200 write(4, ..., 11680) = 11680
      
      0.200 > P. 1:731(730) ack 1
      0.200 > P. 731:1461(730) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:13141(4380) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 13141 win 257
      
      BPF Output Before:
      ~~~~~
      <No output due to missing SCM_TSTAMP_ACK timestamp>
      
      BPF Output After:
      ~~~~~
      <...>-2027  [007] d.s.    79.765921: : ee_data:1459
      
      Sacks Collapse Test:
      ~~~~~
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 1460) = 1460
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 13140) = 13140
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      BPF Output Before:
      ~~~~~
      <No output due to missing SCM_TSTAMP_ACK timestamp>
      
      BPF Output After:
      ~~~~~
      <...>-2049  [007] d.s.    89.185538: : ee_data:14599
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2de8023e
    • Martin KaFai Lau's avatar
      tcp: Carry txstamp_ack in tcp_fragment_tstamp · b51e13fa
      Martin KaFai Lau authored
      When a tcp skb is sliced into two smaller skbs (e.g. in
      tcp_fragment() and tso_fragment()),  it does not carry
      the txstamp_ack bit to the newly created skb if it is needed.
      The end result is a timestamping event (SCM_TSTAMP_ACK) will
      be missing from the sk->sk_error_queue.
      
      This patch carries this bit to the new skb2
      in tcp_fragment_tstamp().
      
      BPF Output Before:
      ~~~~~~
      <No output due to missing SCM_TSTAMP_ACK timestamp>
      
      BPF Output After:
      ~~~~~~
      <...>-2050  [000] d.s.   100.928763: : ee_data:14599
      
      Packetdrill Script:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 14600) = 14600
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      
      0.200 > . 1:7301(7300) ack 1
      0.200 > P. 7301:14601(7300) ack 1
      
      0.300 < . 1:1(0) ack 14601 win 257
      
      0.300 close(4) = 0
      0.300 > F. 14601:14601(0) ack 1
      0.400 < F. 1:1(0) ack 16062 win 257
      0.400 > . 14602:14602(0) ack 2
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b51e13fa
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 11afbff8
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for your net-next
      tree, mostly from Florian Westphal to sort out the lack of sufficient
      validation in x_tables and connlabel preparation patches to add
      nf_tables support. They are:
      
      1) Ensure we don't go over the ruleset blob boundaries in
         mark_source_chains().
      
      2) Validate that target jumps land on an existing xt_entry. This extra
         sanitization comes with a performance penalty when loading the ruleset.
      
      3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.
      
      4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.
      
      5) Make sure the minimal possible target size in x_tables.
      
      6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.
      
      7) Check that standard target size is valid.
      
      8) More sanitization to ensure that the target_offset field is correct.
      
      9) Add xt_check_entry_match() to validate that matches are well-formed.
      
      10-12) Three patch to reduce the number of parameters in
          translate_compat_table() for {arp,ip,ip6}tables by using a container
          structure.
      
      13) No need to return value from xt_compat_match_from_user(), so make
          it void.
      
      14) Consolidate translate_table() so it can be used by compat code too.
      
      15) Remove obsolete check for compat code, so we keep consistent with
          what was already removed in the native layout code (back in 2007).
      
      16) Get rid of target jump validation from mark_source_chains(),
          obsoleted by #2.
      
      17) Introduce xt_copy_counters_from_user() to consolidate counter
          copying, and use it from {arp,ip,ip6}tables.
      
      18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
          functions.
      
      19) Move nf_connlabel_match() to xt_connlabel.
      
      20) Skip event notification if connlabel did not change.
      
      21) Update of nf_connlabels_get() to make the upcoming nft connlabel
          support easier.
      
      23) Remove spinlock to read protocol state field in conntrack.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11afbff8
    • David S. Miller's avatar
      Merge branch 'nla_align-more' · 8d9ea160
      David S. Miller authored
      Nicolas Dichtel says:
      
      ====================
      netlink: align attributes when needed (patchset #1)
      
      This is the continuation of the work done to align netlink attributes
      when these attributes contain some 64-bit fields.
      
      David, if the third patch is too big (or maybe the series), I can split it.
      Just tell me what you prefer.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d9ea160
    • Nicolas Dichtel's avatar
      taskstats: use the libnl API to align nlattr on 64-bit · 80df5542
      Nicolas Dichtel authored
      Goal of this patch is to use the new libnl API to align netlink attribute
      when needed.
      The layout of the netlink message will be a bit different after the patch,
      because the padattr (TASKSTATS_TYPE_STATS) will be inside the nested
      attribute instead of before it.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80df5542
    • Nicolas Dichtel's avatar
    • Nicolas Dichtel's avatar
      libnl: add nla_put_u64_64bit() helper · 73520786
      Nicolas Dichtel authored
      With this function, nla_data() is aligned on a 64-bit area.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73520786
    • Nicolas Dichtel's avatar
      libnl: nla_put_msecs(): align on a 64-bit area · 2175d87c
      Nicolas Dichtel authored
      nla_data() is now aligned on a 64-bit area.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2175d87c
    • Nicolas Dichtel's avatar
      libnl: nla_put_s64(): align on a 64-bit area · 756a2f59
      Nicolas Dichtel authored
      nla_data() is now aligned on a 64-bit area.
      In fact, there is no user of this function.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      756a2f59
    • Nicolas Dichtel's avatar
      libnl: nla_put_net64(): align on a 64-bit area · e9bbe898
      Nicolas Dichtel authored
      nla_data() is now aligned on a 64-bit area.
      
      The temporary function nla_put_be64_32bit() is removed in this patch.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9bbe898
    • Nicolas Dichtel's avatar
      libnl: nla_put_be64(): align on a 64-bit area · b46f6ded
      Nicolas Dichtel authored
      nla_data() is now aligned on a 64-bit area.
      
      A temporary version (nla_put_be64_32bit()) is added for nla_put_net64().
      This function is removed in the next patch.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b46f6ded
    • Nicolas Dichtel's avatar
      libnl: nla_put_le64(): align on a 64-bit area · e7479122
      Nicolas Dichtel authored
      nla_data() is now aligned on a 64-bit area.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7479122
    • Nicolas Dichtel's avatar
      libnl: fix help of _64bit functions · 11a99573
      Nicolas Dichtel authored
      Fix typo and describe 'padattr'.
      
      Fixes: 089bf1a6 ("libnl: add more helpers to align attributes on 64-bit")
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11a99573
  5. 23 Apr, 2016 1 commit
  6. 21 Apr, 2016 3 commits