1. 21 Mar, 2019 9 commits
    • David Ahern's avatar
      ipv4: Allow amount of dirty memory from fib resizing to be controllable · 9ab948a9
      David Ahern authored
      fib_trie implementation calls synchronize_rcu when a certain amount of
      pages are dirty from freed entries. The number of pages was determined
      experimentally in 2009 (commit c3059477).
      
      At the current setting, synchronize_rcu is called often -- 51 times in a
      second in one test with an average of an 8 msec delay adding a fib entry.
      The total impact is a lot of slow down modifying the fib. This is seen
      in the output of 'time' - the difference between real time and sys+user.
      For example, using 720,022 single path routes and 'ip -batch'[1]:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m14.214s
          user    0m2.513s
          sys     0m6.783s
      
      So roughly 35% of the actual time to install the routes is from the ip
      command getting scheduled out, most notably due to synchronize_rcu (this
      is observed using 'perf sched timehist').
      
      This patch makes the amount of dirty memory configurable between 64k where
      the synchronize_rcu is called often (small, low end systems that are memory
      sensitive) to 64M where synchronize_rcu is called rarely during a large
      FIB change (for high end systems with lots of memory). The default is 512kB
      which corresponds to the current setting of 128 pages with a 4kB page size.
      
      As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
      in a second blocking for up to 30 msec in a single instance, and a total
      of almost 100 msec across the 4 calls in the second. The trade off is
      allowing FIB entries to consume more memory in a given time window but
      but with much better fib insertion rates (~30% increase in prefixes/sec).
      With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
      file runs in:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m9.692s
          user    0m2.491s
          sys     0m6.769s
      
      So the dead time is reduced to about 1/2 second or <5% of the real time.
      
      [1] 'ip' modified to not request ACK messages which improves route
          insertion times by about 20%
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ab948a9
    • Kirill Tkhai's avatar
    • Kirill Tkhai's avatar
      tun: Add ioctl() TUNGETDEVNETNS cmd to allow obtaining real net ns of tun device · 0c3e0e3b
      Kirill Tkhai authored
      In commit f2780d6d "tun: Add ioctl() SIOCGSKNS cmd to allow
      obtaining net ns of tun device" it was missed that tun may change
      its net ns, while net ns of socket remains the same as it was
      created initially. SIOCGSKNS returns net ns of socket, so it is
      not suitable for obtaining net ns of device.
      
      We may have two tun devices with the same names in two net ns,
      and in this case it's not possible to determ, which of them
      fd refers to (TUNGETIFF will return the same name).
      
      This patch adds new ioctl() cmd for obtaining net ns of a device.
      Reported-by: default avatarHarald Albrecht <harald.albrecht@gmx.net>
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c3e0e3b
    • David S. Miller's avatar
      Merge branch 'ipv6-Change-addrconf_f6i_alloc-to-use-ip6_route_info_create' · 28b18b39
      David S. Miller authored
      David Ahern says:
      
      ====================
      ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create
      
      addrconf_f6i_alloc is the last caller of fib6_info_alloc besides
      ip6_route_info_create. There really is no good reason for it do
      its own fib6_info initialization, so convert it to call
      ip6_route_info_create.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28b18b39
    • David Ahern's avatar
      ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create · c7a1ce39
      David Ahern authored
      Change addrconf_f6i_alloc to generate a fib6_config and call
      ip6_route_info_create. addrconf_f6i_alloc is the last caller to
      fib6_info_alloc besides ip6_route_info_create, and there is no
      reason for it to do its own initialization on a fib6_info.
      
      Host routes need to be created even if the device is down, so add a
      new flag, fc_ignore_dev_down, to fib6_config and update fib6_nh_init
      to not error out if device is not up.
      
      Notes on the conversion:
      - ip_fib_metrics_init is the same as fib6_config has fc_mx set to NULL
        and fc_mx_len set to 0
      - dst_nocount is handled by the RTF_ADDRCONF flag
      - dst_host is handled by fc_dst_len = 128
      
      nh_gw does not get set after the conversion to ip6_route_info_create
      but it should not be set in addrconf_f6i_alloc since this is a host
      route not a gateway route.
      
      Everything else is a straight forward map between fib6_info and
      fib6_config.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7a1ce39
    • David Ahern's avatar
      ipv6: Move setting default metric for routes · 67f69513
      David Ahern authored
      ip6_route_info_create is a low level function for ensuring fc_metric is
      set. Move the check and default setting to the 2 locations that do not
      already set fc_metric before calling ip6_route_info_create. This is
      required for the next patch which moves addrconf allocations to
      ip6_route_info_create and want the metric for host routes to be 0.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67f69513
    • Vakul Garg's avatar
      net/tls: Replace kfree_skb() with consume_skb() · a88c26f6
      Vakul Garg authored
      To free the skb in normal course of processing, consume_skb() should be
      used. Only for failure paths, skb_free() is intended to be used.
      
      https://www.kernel.org/doc/htmldocs/networking/API-consume-skb.htmlSigned-off-by: default avatarVakul Garg <vakul.garg@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a88c26f6
    • Hoang Le's avatar
      tipc: fix a null pointer deref · 08e046c8
      Hoang Le authored
      In commit c55c8eda ("tipc: smooth change between replicast and
      broadcast") we introduced new method to eliminate the risk of message
      reordering that happen in between different nodes.
      Unfortunately, we forgot checking at receiving side to ignore intra node.
      
      We fix this by checking and returning if arrived message from intra node.
      
      syzbot report:
      
      ==================================================================
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 7820 Comm: syz-executor418 Not tainted 5.0.0+ #61
      Hardware name: Google Google Compute Engine/Google Compute Engine,
      BIOS Google 01/01/2011
      RIP: 0010:tipc_mcast_filter_msg+0x21b/0x13d0 net/tipc/bcast.c:782
      Code: 45 c0 0f 84 39 06 00 00 48 89 5d 98 e8 ce ab a5 fa 49 8d bc
       24 c8 00 00 00 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03
       <80> 3c 08 00 0f 85 9a 0e 00 00 49 8b 9c 24 c8 00 00 00 48 be 00 00
      RSP: 0018:ffff8880959defc8 EFLAGS: 00010202
      RAX: 0000000000000019 RBX: ffff888081258a48 RCX: dffffc0000000000
      RDX: 0000000000000000 RSI: ffffffff86cab862 RDI: 00000000000000c8
      RBP: ffff8880959df030 R08: ffff8880813d0200 R09: ffffed1015d05bc8
      R10: ffffed1015d05bc7 R11: ffff8880ae82de3b R12: 0000000000000000
      R13: 000000000000002c R14: 0000000000000000 R15: ffff888081258a48
      FS:  000000000106a880(0000) GS:ffff8880ae800000(0000)
       knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001cc0 CR3: 0000000094a20000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tipc_sk_filter_rcv+0x182d/0x34f0 net/tipc/socket.c:2168
       tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
       tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
       tipc_sk_mcast_rcv+0x724/0x1020 net/tipc/socket.c:1209
       tipc_mcast_xmit+0x7fe/0x1200 net/tipc/bcast.c:410
       tipc_sendmcast+0xb36/0xfc0 net/tipc/socket.c:820
       __tipc_sendmsg+0x10df/0x18d0 net/tipc/socket.c:1358
       tipc_sendmsg+0x53/0x80 net/tipc/socket.c:1291
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg+0xdd/0x130 net/socket.c:661
       ___sys_sendmsg+0x806/0x930 net/socket.c:2260
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2298
       __do_sys_sendmsg net/socket.c:2307 [inline]
       __se_sys_sendmsg net/socket.c:2305 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2305
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4401c9
      Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8
       48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05
       <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffd887fa9d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9
      RDX: 0000000000000000 RSI: 0000000020002140 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a50
      R13: 0000000000401ae0 R14: 0000000000000000 R15: 0000000000000000
      Modules linked in:
      ---[ end trace ba79875754e1708f ]---
      
      Reported-by: syzbot+be4bdf2cc3e85e952c50@syzkaller.appspotmail.com
      Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08e046c8
    • Hoang Le's avatar
      tipc: fix use-after-free in tipc_sk_filter_rcv · 77d5ad40
      Hoang Le authored
      skb free-ed in:
        1/ condition 1: tipc_sk_filter_rcv -> tipc_sk_proto_rcv
        2/ condition 2: tipc_sk_filter_rcv -> tipc_group_filter_msg
      This leads to a "use-after-free" access in the next condition.
      
      We fix this by intializing the variable at declaration, then it is safe
      to check this variable to continue processing if condition matches.
      
      syzbot report:
      
      ==================================================================
      BUG: KASAN: use-after-free in tipc_sk_filter_rcv+0x2166/0x34f0
       net/tipc/socket.c:2167
      Read of size 4 at addr ffff88808ea58534 by task kworker/u4:0/7
      
      CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.0.0+ #61
      Hardware name: Google Google Compute Engine/Google Compute Engine,
       BIOS Google 01/01/2011
      Workqueue: tipc_send tipc_conn_send_work
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131
       tipc_sk_filter_rcv+0x2166/0x34f0 net/tipc/socket.c:2167
       tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
       tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
       tipc_topsrv_kern_evt+0x3b7/0x580 net/tipc/topsrv.c:610
       tipc_conn_send_to_sock+0x43e/0x5f0 net/tipc/topsrv.c:283
       tipc_conn_send_work+0x65/0x80 net/tipc/topsrv.c:303
       process_one_work+0x98e/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x357/0x430 kernel/kthread.c:253
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
      
      Reported-by: syzbot+e863893591cc7a622e40@syzkaller.appspotmail.com
      Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77d5ad40
  2. 20 Mar, 2019 23 commits
  3. 19 Mar, 2019 8 commits