An error occurred fetching the project authors.
  1. 30 Apr, 2020 2 commits
  2. 09 Mar, 2020 1 commit
  3. 28 Feb, 2020 3 commits
    • Martin KaFai Lau's avatar
      bpf: inet_diag: Dump bpf_sk_storages in inet_diag_dump() · 085c20ca
      Martin KaFai Lau authored
      This patch will dump out the bpf_sk_storages of a sk
      if the request has the INET_DIAG_REQ_SK_BPF_STORAGES nlattr.
      
      An array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD can be specified in
      INET_DIAG_REQ_SK_BPF_STORAGES to select which bpf_sk_storage to dump.
      If no map_fd is specified, all bpf_sk_storages of a sk will be dumped.
      
      bpf_sk_storages can be added to the system at runtime.  It is difficult
      to find a proper static value for cb->min_dump_alloc.
      
      This patch learns the nlattr size required to dump the bpf_sk_storages
      of a sk.  If it happens to be the very first nlmsg of a dump and it
      cannot fit the needed bpf_sk_storages,  it will try to expand the
      skb by "pskb_expand_head()".
      
      Instead of expanding it in inet_sk_diag_fill(), it is expanded at a
      sleepable context in __inet_diag_dump() so __GFP_DIRECT_RECLAIM can
      be used.  In __inet_diag_dump(), it will retry as long as the
      skb is empty and the cb->min_dump_alloc becomes larger than before.
      cb->min_dump_alloc is bounded by KMALLOC_MAX_SIZE.  The min_dump_alloc
      is also changed from 'u16' to 'u32' to accommodate a sk that may have
      a few large bpf_sk_storages.
      
      The updated cb->min_dump_alloc will also be used to allocate the skb in
      the next dump.  This logic already exists in netlink_dump().
      
      Here is the sample output of a locally modified 'ss' and it could be made
      more readable by using BTF later:
      [root@arch-fb-vm1 ~]# ss --bpf-map-id 14 --bpf-map-id 13 -t6an 'dst [::1]:8989'
      State Recv-Q Send-Q Local Address:Port  Peer Address:PortProcess
      ESTAB 0      0              [::1]:51072        [::1]:8989
      	 bpf_map_id:14 value:[ 3feb ]
      	 bpf_map_id:13 value:[ 3f ]
      ESTAB 0      0              [::1]:51070        [::1]:8989
      	 bpf_map_id:14 value:[ 3feb ]
      	 bpf_map_id:13 value:[ 3f ]
      
      [root@arch-fb-vm1 ~]# ~/devshare/github/iproute2/misc/ss --bpf-maps -t6an 'dst [::1]:8989'
      State         Recv-Q         Send-Q                   Local Address:Port                    Peer Address:Port         Process
      ESTAB         0              0                                [::1]:51072                          [::1]:8989
      	 bpf_map_id:14 value:[ 3feb ]
      	 bpf_map_id:13 value:[ 3f ]
      	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
      ESTAB         0              0                                [::1]:51070                          [::1]:8989
      	 bpf_map_id:14 value:[ 3feb ]
      	 bpf_map_id:13 value:[ 3f ]
      	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com
      085c20ca
    • Martin KaFai Lau's avatar
      inet_diag: Move the INET_DIAG_REQ_BYTECODE nlattr to cb->data · 0df6d328
      Martin KaFai Lau authored
      The INET_DIAG_REQ_BYTECODE nlattr is currently re-found every time when
      the "dump()" is re-started.
      
      In a latter patch, it will also need to parse the new
      INET_DIAG_REQ_SK_BPF_STORAGES nlattr to learn the map_fds. Thus, this
      patch takes this chance to store the parsed nlattr in cb->data
      during the "start" time of a dump.
      
      By doing this, the "bc" argument also becomes unnecessary
      and is removed.  Also, the two copies of the INET_DIAG_REQ_BYTECODE
      parsing-audit logic between compat/current version can be
      consolidated to one.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200225230415.1975555-1-kafai@fb.com
      0df6d328
    • Martin KaFai Lau's avatar
      inet_diag: Refactor inet_sk_diag_fill(), dump(), and dump_one() · 5682d393
      Martin KaFai Lau authored
      In a latter patch, there is a need to update "cb->min_dump_alloc"
      in inet_sk_diag_fill() as it learns the diffierent bpf_sk_storages
      stored in a sk while dumping all sk(s) (e.g. tcp_hashinfo).
      
      The inet_sk_diag_fill() currently does not take the "cb" as an argument.
      One of the reason is inet_sk_diag_fill() is used by both dump_one()
      and dump() (which belong to the "struct inet_diag_handler".  The dump_one()
      interface does not pass the "cb" along.
      
      This patch is to make dump_one() pass a "cb".  The "cb" is created in
      inet_diag_cmd_exact().  The "nlh" and "in_skb" are stored in "cb" as
      the dump() interface does.  The total number of args in
      inet_sk_diag_fill() is also cut from 10 to 7 and
      that helps many callers to pass fewer args.
      
      In particular,
      "struct user_namespace *user_ns", "u32 pid", and "u32 seq"
      can be replaced by accessing "cb->nlh" and "cb->skb".
      
      A similar argument reduction is also made to
      inet_twsk_diag_fill() and inet_req_diag_fill().
      
      inet_csk_diag_dump() and inet_csk_diag_fill() are also removed.
      They are mostly equivalent to inet_sk_diag_fill().  Their repeated
      usages are very limited.  Thus, inet_sk_diag_fill() is directly used
      in those occasions.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200225230409.1975173-1-kafai@fb.com
      5682d393
  4. 14 Dec, 2019 1 commit
  5. 07 Nov, 2019 1 commit
  6. 13 Oct, 2019 1 commit
  7. 30 May, 2019 1 commit
  8. 12 Feb, 2019 1 commit
    • Konstantin Khlebnikov's avatar
      inet_diag: fix reporting cgroup classid and fallback to priority · 1ec17dbd
      Konstantin Khlebnikov authored
      Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
      extensions has only 8 bits. Thus extensions starting from DCTCPINFO
      cannot be requested directly. Some of them included into response
      unconditionally or hook into some of lower 8 bits.
      
      Extension INET_DIAG_CLASS_ID has not way to request from the beginning.
      
      This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
      reservation, and documents behavior for other extensions.
      
      Also this patch adds fallback to reporting socket priority. This filed
      is more widely used for traffic classification because ipv4 sockets
      automatically maps TOS to priority and default qdisc pfifo_fast knows
      about that. But priority could be changed via setsockopt SO_PRIORITY so
      INET_DIAG_TOS isn't enough for predicting class.
      
      Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
      reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).
      
      So, after this patch INET_DIAG_CLASS_ID will report socket priority
      for most common setup when net_cls isn't set and/or cgroup2 in use.
      
      Fixes: 0888e372 ("net: inet: diag: expose sockets cgroup classid")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ec17dbd
  9. 21 Dec, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: fix a race in inet_diag_dump_icsk() · f0c928d8
      Eric Dumazet authored
      Alexei reported use after frees in inet_diag_dump_icsk() [1]
      
      Because we use refcount_set() when various sockets are setup and
      inserted into ehash, we also need to make sure inet_diag_dump_icsk()
      wont race with the refcount_set() operations.
      
      Jonathan Lemon sent a patch changing net_twsk_hashdance() but
      other spots would need risky changes.
      
      Instead, fix inet_diag_dump_icsk() as this bug came with
      linux-4.10 only.
      
      [1] Quoting Alexei :
      
      First something iterating over sockets finds already freed tw socket:
      
      refcount_t: increment on 0; use-after-free.
      WARNING: CPU: 2 PID: 2738 at lib/refcount.c:153 refcount_inc+0x26/0x30
      RIP: 0010:refcount_inc+0x26/0x30
      RSP: 0018:ffffc90004c8fbc0 EFLAGS: 00010282
      RAX: 000000000000002b RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff88085ee9d680 RSI: ffff88085ee954c8 RDI: ffff88085ee954c8
      RBP: ffff88010ecbd2c0 R08: 0000000000000000 R09: 000000000000174c
      R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff8806ba9bf210 R14: ffffffff82304600 R15: ffff88010ecbd328
      FS:  00007f81f5a7d700(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f81e2a95000 CR3: 000000069b2eb006 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       inet_diag_dump_icsk+0x2b3/0x4e0 [inet_diag]  // sock_hold(sk); in net/ipv4/inet_diag.c:1002
       ? kmalloc_large_node+0x37/0x70
       ? __kmalloc_node_track_caller+0x1cb/0x260
       ? __alloc_skb+0x72/0x1b0
       ? __kmalloc_reserve.isra.40+0x2e/0x80
       __inet_diag_dump+0x3b/0x80 [inet_diag]
       netlink_dump+0x116/0x2a0
       netlink_recvmsg+0x205/0x3c0
       sock_read_iter+0x89/0xd0
       __vfs_read+0xf7/0x140
       vfs_read+0x8a/0x140
       SyS_read+0x3f/0xa0
       do_syscall_64+0x5a/0x100
      
      then a minute later twsk timer fires and hits two bad refcnts
      for this freed socket:
      
      refcount_t: decrement hit 0; leaking memory.
      WARNING: CPU: 31 PID: 0 at lib/refcount.c:228 refcount_dec+0x2e/0x40
      Modules linked in:
      RIP: 0010:refcount_dec+0x2e/0x40
      RSP: 0018:ffff88085f5c3ea8 EFLAGS: 00010296
      RAX: 000000000000002c RBX: ffff88010ecbd2c0 RCX: 000000000000083f
      RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
      RBP: ffffc90003c77280 R08: 0000000000000000 R09: 00000000000017d3
      R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffffffff82ad2d80
      R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       inet_twsk_kill+0x9d/0xc0  // inet_twsk_bind_unhash(tw, hashinfo);
       call_timer_fn+0x29/0x110
       run_timer_softirq+0x36b/0x3a0
      
      refcount_t: underflow; use-after-free.
      WARNING: CPU: 31 PID: 0 at lib/refcount.c:187 refcount_sub_and_test+0x46/0x50
      RIP: 0010:refcount_sub_and_test+0x46/0x50
      RSP: 0018:ffff88085f5c3eb8 EFLAGS: 00010296
      RAX: 0000000000000026 RBX: ffff88010ecbd2c0 RCX: 000000000000083f
      RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
      RBP: ffff88010ecbd358 R08: 0000000000000000 R09: 000000000000185b
      R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffff88010ecbd358
      R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       inet_twsk_put+0x12/0x20  // inet_twsk_put(tw);
       call_timer_fn+0x29/0x110
       run_timer_softirq+0x36b/0x3a0
      
      Fixes: 67db3e4b ("tcp: no longer hold ehash lock while calling tcp_get_info()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f0c928d8
  10. 12 Mar, 2018 1 commit
    • Xin Long's avatar
      sock_diag: request _diag module only when the family or proto has been registered · bf2ae2e4
      Xin Long authored
      Now when using 'ss' in iproute, kernel would try to load all _diag
      modules, which also causes corresponding family and proto modules
      to be loaded as well due to module dependencies.
      
      Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
      would be loaded.
      
      For example:
      
        $ lsmod|grep sctp
        $ ss
        $ lsmod|grep sctp
        sctp_diag              16384  0
        sctp                  323584  5 sctp_diag
        inet_diag              24576  4 raw_diag,tcp_diag,sctp_diag,udp_diag
        libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
      
      As these family and proto modules are loaded unintentionally, it
      could cause some problems, like:
      
      - Some debug tools use 'ss' to collect the socket info, which loads all
        those diag and family and protocol modules. It's noisy for identifying
        issues.
      
      - Users usually expect to drop sctp init packet silently when they
        have no sense of sctp protocol instead of sending abort back.
      
      - It wastes resources (especially with multiple netns), and SCTP module
        can't be unloaded once it's loaded.
      
      ...
      
      In short, it's really inappropriate to have these family and proto
      modules loaded unexpectedly when just doing debugging with inet_diag.
      
      This patch is to introduce sock_load_diag_module() where it loads
      the _diag module only when it's corresponding family or proto has
      been already registered.
      
      Note that we can't just load _diag module without the family or
      proto loaded, as some symbols used in _diag module are from the
      family or proto module.
      
      v1->v2:
        - move inet proto check to inet_diag to avoid a compiling err.
      v2->v3:
        - define sock_load_diag_module in sock.c and export one symbol
          only.
        - improve the changelog.
      Reported-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarPhil Sutter <phil@nwl.cc>
      Acked-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf2ae2e4
  11. 02 Jan, 2018 1 commit
    • Kristian Evensen's avatar
      inet_diag: Add equal-operator for ports · bbb6189d
      Kristian Evensen authored
      inet_diag currently provides less/greater than or equal operators for
      comparing ports when filtering sockets. An equal comparison can be
      performed by combining the two existing operators, or a user can for
      example request a port range and then do the final filtering in
      userspace. However, these approaches both have drawbacks. Implementing
      equal using LE/GE causes the size and complexity of a filter to grow
      quickly as the number of ports increase, while it on busy machines would
      be great if the kernel only returns information about relevant sockets.
      
      This patch introduces source and destination port equal operators.
      INET_DIAG_BC_S_EQ is used to match a source port, INET_DIAG_BC_D_EQ a
      destination port, and usage is the same as for the existing port
      operators.  I.e., the port to match is stored in the no-member of the
      next inet_diag_bc_op-struct in the filter.
      Signed-off-by: default avatarKristian Evensen <kristian.evensen@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbb6189d
  12. 02 Sep, 2017 1 commit
  13. 18 Aug, 2017 1 commit
  14. 14 Jan, 2017 2 commits
    • Yuchung Cheng's avatar
      tcp: remove early retransmit · bec41a11
      Yuchung Cheng authored
      This patch removes the support of RFC5827 early retransmit (i.e.,
      fast recovery on small inflight with <3 dupacks) because it is
      subsumed by the new RACK loss detection. More specifically when
      RACK receives DUPACKs, it'll arm a reordering timer to start fast
      recovery after a quarter of (min)RTT, hence it covers the early
      retransmit except RACK does not limit itself to specific inflight
      or dupack numbers.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bec41a11
    • Yuchung Cheng's avatar
      tcp: add reordering timer in RACK loss detection · 57dde7f7
      Yuchung Cheng authored
      This patch makes RACK install a reordering timer when it suspects
      some packets might be lost, but wants to delay the decision
      a little bit to accomodate reordering.
      
      It does not create a new timer but instead repurposes the existing
      RTO timer, because both are meant to retransmit packets.
      Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
      the RACK timing check fails. The wait time is set to
      
        RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
      
      This translates to expecting a packet (Packet) should take
      (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
      
      When there are multiple packets that need a timer, we use one timer
      with the maximum timeout. Therefore the timer conservatively uses
      the maximum window to expire N packets by one timeout, instead of
      N timeouts to expire N packets sent at different times.
      
      The fudge factor is 2 jiffies to ensure when the timer fires, all
      the suspected packets would exceed the deadline and be marked lost
      by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
      clock may tick between calling icsk_reset_xmit_timer(timeout) and
      actually hang the timer. The next jiffy is to lower-bound the timeout
      to 2 jiffies when reo_wnd is < 1ms.
      
      When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
      in Recovery we'll enter fast recovery and force fast retransmit.
      This is very similar to the early retransmit (RFC5827) except RACK
      is not constrained to only enter recovery for small outstanding
      flights.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      57dde7f7
  15. 09 Nov, 2016 1 commit
  16. 23 Oct, 2016 1 commit
    • Cyrill Gorcunov's avatar
      net: ip, diag -- Add diag interface for raw sockets · 432490f9
      Cyrill Gorcunov authored
      In criu we are actively using diag interface to collect sockets
      present in the system when dumping applications. And while for
      unix, tcp, udp[lite], packet, netlink it works as expected,
      the raw sockets do not have. Thus add it.
      
      v2:
       - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
       - implement @destroy for diag requests (by dsa@)
      
      v3:
       - add export of raw_abort for IPv6 (by dsa@)
       - pass net-admin flag into inet_sk_diag_fill due to
         changes in net-next branch (by dsa@)
      
      v4:
       - use @pad in struct inet_diag_req_v2 for raw socket
         protocol specification: raw module carries sockets
         which may have custom protocol passed from socket()
         syscall and sole @sdiag_protocol is not enough to
         match underlied ones
       - start reporting protocol specifed in socket() call
         when sockets are raw ones for the same reason: user
         space tools like ss may parse this attribute and use
         it for socket matching
      
      v5 (by eric.dumazet@):
       - use sock_hold in raw_sock_get instead of atomic_inc,
         we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
         when looking up so counter won't be zero here.
      
      v6:
       - use sdiag_raw_protocol() helper which will access @pad
         structure used for raw sockets protocol specification:
         we can't simply rename this member without breaking uapi
      
      v7:
       - sine sdiag_raw_protocol() helper is not suitable for
         uapi lets rather make an alias structure with proper
         names. __check_inet_diag_req_raw helper will catch
         if any of structure unintentionally changed.
      
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      432490f9
  17. 20 Oct, 2016 1 commit
    • Eric Dumazet's avatar
      tcp: relax listening_hash operations · 9652dc2e
      Eric Dumazet authored
      softirq handlers use RCU protection to lookup listeners,
      and write operations all happen from process context.
      We do not need to block BH for dump operations.
      
      Also SYN_RECV since request sockets are stored in the ehash table :
      
       1) inet_diag_dump_icsk() no longer need to clear
          cb->args[3] and cb->args[4] that were used as cursors while
          iterating the old per listener hash table.
      
       2) Also factorize a test : No need to scan listening_hash[]
          if r->id.idiag_dport is not zero.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9652dc2e
  18. 08 Sep, 2016 1 commit
  19. 25 Aug, 2016 2 commits
  20. 28 Jun, 2016 1 commit
  21. 26 Apr, 2016 1 commit
  22. 21 Apr, 2016 1 commit
  23. 15 Apr, 2016 1 commit
    • Xin Long's avatar
      sctp: export some functions for sctp_diag in inet_diag · cb2050a7
      Xin Long authored
      inet_diag_msg_common_fill is used to fill the diag msg common info,
      we need to use it in sctp_diag as well, so export it.
      
      inet_diag_msg_attrs_fill is used to fill some common attrs info between
      sctp diag and tcp diag.
      
      v2->v3:
      - do not need to define and export inet_diag_get_handler any more.
        cause all the functions in it are in sctp_diag.ko, we just call
        them in sctp_diag.ko.
      
      - add inet_diag_msg_attrs_fill to make codes clear.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb2050a7
  24. 05 Apr, 2016 2 commits
    • Eric Dumazet's avatar
      tcp/dccp: do not touch listener sk_refcnt under synflood · 3b24d854
      Eric Dumazet authored
      When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
      cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.
      
      By letting listeners use SOCK_RCU_FREE infrastructure,
      we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt
      
      Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
      only listeners are impacted by this change.
      
      Peak performance under SYNFLOOD is increased by ~33% :
      
      On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps
      
      Most consuming functions are now skb_set_owner_w() and sock_wfree()
      contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b24d854
    • Eric Dumazet's avatar
      tcp/dccp: use rcu locking in inet_diag_find_one_icsk() · 2d331915
      Eric Dumazet authored
      RX packet processing holds rcu_read_lock(), so we can remove
      pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
      if inet_diag also holds rcu before calling them.
      
      This is needed anyway as __inet_lookup_listener() and
      inet6_lookup_listener() will soon no longer increment
      refcount on the found listener.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d331915
  25. 14 Mar, 2016 1 commit
  26. 11 Feb, 2016 1 commit
  27. 21 Jan, 2016 1 commit
  28. 16 Dec, 2015 2 commits
  29. 03 Oct, 2015 2 commits
    • Eric Dumazet's avatar
      tcp/dccp: install syn_recv requests into ehash table · 079096f1
      Eric Dumazet authored
      In this patch, we insert request sockets into TCP/DCCP
      regular ehash table (where ESTABLISHED and TIMEWAIT sockets
      are) instead of using the per listener hash table.
      
      ACK packets find SYN_RECV pseudo sockets without having
      to find and lock the listener.
      
      In nominal conditions, this halves pressure on listener lock.
      
      Note that this will allow for SO_REUSEPORT refinements,
      so that we can select a listener using cpu/numa affinities instead
      of the prior 'consistent hash', since only SYN packets will
      apply this selection logic.
      
      We will shrink listen_sock in the following patch to ease
      code review.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ying Cai <ycai@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      079096f1
    • Eric Dumazet's avatar
      tcp: move qlen/young out of struct listen_sock · aac065c5
      Eric Dumazet authored
      qlen_inc & young_inc were protected by listener lock,
      while qlen_dec & young_dec were atomic fields.
      
      Everything needs to be atomic for upcoming lockless listener.
      
      Also move qlen/young in request_sock_queue as we'll get rid
      of struct listen_sock eventually.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aac065c5
  30. 11 Jul, 2015 1 commit
    • Phil Sutter's avatar
      net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets · 8220ea23
      Phil Sutter authored
      Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
      sockopt", I am not happy with the limitations it causes for socket
      analysing code in userspace. Exporting the value only if it is set makes
      it hard for userspace to decide whether the option is not set or the
      kernel does not support exporting the option at all.
      
      >From an auditor's perspective, the interesting question for listening
      AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
      the unexpected case. This patch allows to answer this question reliably.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8220ea23
  31. 24 Jun, 2015 1 commit
    • Phil Sutter's avatar
      net: inet_diag: export IPV6_V6ONLY sockopt · 20462155
      Phil Sutter authored
      For AF_INET6 sockets, the value of struct ipv6_pinfo.ipv6only is
      exported to userspace. It indicates whether a socket bound to in6addr_any
      listens on IPv4 as well as IPv6. Since the socket is natively IPv6, it is not
      listed by e.g. 'ss -l -4'.
      
      This patch is accompanied by an appropriate one for iproute2 to enable
      the additional information in 'ss -e'.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20462155
  32. 23 Jun, 2015 1 commit