1. 14 Dec, 2019 8 commits
    • Jakub Kicinski's avatar
      Merge branch 'tcp-take-care-of-empty-skbs-in-write-queue' · cd1263b6
      Jakub Kicinski authored
      Eric Dumazet says:
      ====================
      tcp: take care of empty skbs in write queue
      
      We understood recently that TCP sockets could have an empty
      skb at the tail of the write queue, leading to various problems.
      
      This patch series :
      
      1) Make sure we do not send an empty packet since this
         was unintended and causing crashes in old kernels.
      
      2) Change tcp_write_queue_empty() to not be fooled by
         the presence of an empty skb.
      
      3) Fix a bug that could trigger suboptimal epoll()
         application behavior under memory pressure.
      ====================
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      cd1263b6
    • Eric Dumazet's avatar
      tcp: refine rule to allow EPOLLOUT generation under mem pressure · 216808c6
      Eric Dumazet authored
      At the time commit ce5ec440 ("tcp: ensure epoll edge trigger
      wakeup when write queue is empty") was added to the kernel,
      we still had a single write queue, combining rtx and write queues.
      
      Once we moved the rtx queue into a separate rb-tree, testing
      if sk_write_queue is empty has been suboptimal.
      
      Indeed, if we have packets in the rtx queue, we probably want
      to delay the EPOLLOUT generation at the time incoming packets
      will free them, making room, but more importantly avoiding
      flooding application with EPOLLOUT events.
      
      Solution is to use tcp_rtx_and_write_queues_empty() helper.
      
      Fixes: 75c119af ("tcp: implement rb-tree based retransmit queue")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      216808c6
    • Eric Dumazet's avatar
      tcp: refine tcp_write_queue_empty() implementation · ee2aabd3
      Eric Dumazet authored
      Due to how tcp_sendmsg() is implemented, we can have an empty
      skb at the tail of the write queue.
      
      Most [1] tcp_write_queue_empty() callers want to know if there is
      anything to send (payload and/or FIN)
      
      Instead of checking if the sk_write_queue is empty, we need
      to test if tp->write_seq == tp->snd_nxt
      
      [1] tcp_send_fin() was the only caller that expected to
       see if an skb was in the write queue, I have changed the code
       to reuse the tcp_write_queue_tail() result.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      ee2aabd3
    • Eric Dumazet's avatar
      tcp: do not send empty skb from tcp_write_xmit() · 1f85e626
      Eric Dumazet authored
      Backport of commit fdfc5c85 ("tcp: remove empty skb from
      write queue in error cases") in linux-4.14 stable triggered
      various bugs. One of them has been fixed in commit ba2ddb43f270
      ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
      we still have crashes in some occasions.
      
      Root-cause is that when tcp_sendmsg() has allocated a fresh
      skb and could not append a fragment before being blocked
      in sk_stream_wait_memory(), tcp_write_xmit() might be called
      and decide to send this fresh and empty skb.
      
      Sending an empty packet is not only silly, it might have caused
      many issues we had in the past with tp->packets_out being
      out of sync.
      
      Fixes: c65f7f00 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Christoph Paasch <cpaasch@apple.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      1f85e626
    • Eric Dumazet's avatar
      6pack,mkiss: fix possible deadlock · 5c9934b6
      Eric Dumazet authored
      We got another syzbot report [1] that tells us we must use
      write_lock_irq()/write_unlock_irq() to avoid possible deadlock.
      
      [1]
      
      WARNING: inconsistent lock state
      5.5.0-rc1-syzkaller #0 Not tainted
      --------------------------------
      inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage.
      syz-executor826/9605 [HC1[1]:SC0[0]:HE0:SE1] takes:
      ffffffff8a128718 (disc_data_lock){+-..}, at: sp_get.isra.0+0x1d/0xf0 drivers/net/ppp/ppp_synctty.c:138
      {HARDIRQ-ON-W} state was registered at:
        lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4485
        __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline]
        _raw_write_lock_bh+0x33/0x50 kernel/locking/spinlock.c:319
        sixpack_close+0x1d/0x250 drivers/net/hamradio/6pack.c:657
        tty_ldisc_close.isra.0+0x119/0x1a0 drivers/tty/tty_ldisc.c:489
        tty_set_ldisc+0x230/0x6b0 drivers/tty/tty_ldisc.c:585
        tiocsetd drivers/tty/tty_io.c:2337 [inline]
        tty_ioctl+0xe8d/0x14f0 drivers/tty/tty_io.c:2597
        vfs_ioctl fs/ioctl.c:47 [inline]
        file_ioctl fs/ioctl.c:545 [inline]
        do_vfs_ioctl+0x977/0x14e0 fs/ioctl.c:732
        ksys_ioctl+0xab/0xd0 fs/ioctl.c:749
        __do_sys_ioctl fs/ioctl.c:756 [inline]
        __se_sys_ioctl fs/ioctl.c:754 [inline]
        __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:754
        do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      irq event stamp: 3946
      hardirqs last  enabled at (3945): [<ffffffff87c86e43>] __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline]
      hardirqs last  enabled at (3945): [<ffffffff87c86e43>] _raw_spin_unlock_irq+0x23/0x80 kernel/locking/spinlock.c:199
      hardirqs last disabled at (3946): [<ffffffff8100675f>] trace_hardirqs_off_thunk+0x1a/0x1c arch/x86/entry/thunk_64.S:42
      softirqs last  enabled at (2658): [<ffffffff86a8b4df>] spin_unlock_bh include/linux/spinlock.h:383 [inline]
      softirqs last  enabled at (2658): [<ffffffff86a8b4df>] clusterip_netdev_event+0x46f/0x670 net/ipv4/netfilter/ipt_CLUSTERIP.c:222
      softirqs last disabled at (2656): [<ffffffff86a8b22b>] spin_lock_bh include/linux/spinlock.h:343 [inline]
      softirqs last disabled at (2656): [<ffffffff86a8b22b>] clusterip_netdev_event+0x1bb/0x670 net/ipv4/netfilter/ipt_CLUSTERIP.c:196
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(disc_data_lock);
        <Interrupt>
          lock(disc_data_lock);
      
       *** DEADLOCK ***
      
      5 locks held by syz-executor826/9605:
       #0: ffff8880a905e198 (&tty->legacy_mutex){+.+.}, at: tty_lock+0xc7/0x130 drivers/tty/tty_mutex.c:19
       #1: ffffffff899a56c0 (rcu_read_lock){....}, at: mutex_spin_on_owner+0x0/0x330 kernel/locking/mutex.c:413
       #2: ffff8880a496a2b0 (&(&i->lock)->rlock){-.-.}, at: spin_lock include/linux/spinlock.h:338 [inline]
       #2: ffff8880a496a2b0 (&(&i->lock)->rlock){-.-.}, at: serial8250_interrupt+0x2d/0x1a0 drivers/tty/serial/8250/8250_core.c:116
       #3: ffffffff8c104048 (&port_lock_key){-.-.}, at: serial8250_handle_irq.part.0+0x24/0x330 drivers/tty/serial/8250/8250_port.c:1823
       #4: ffff8880a905e090 (&tty->ldisc_sem){++++}, at: tty_ldisc_ref+0x22/0x90 drivers/tty/tty_ldisc.c:288
      
      stack backtrace:
      CPU: 1 PID: 9605 Comm: syz-executor826 Not tainted 5.5.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       print_usage_bug.cold+0x327/0x378 kernel/locking/lockdep.c:3101
       valid_state kernel/locking/lockdep.c:3112 [inline]
       mark_lock_irq kernel/locking/lockdep.c:3309 [inline]
       mark_lock+0xbb4/0x1220 kernel/locking/lockdep.c:3666
       mark_usage kernel/locking/lockdep.c:3554 [inline]
       __lock_acquire+0x1e55/0x4a00 kernel/locking/lockdep.c:3909
       lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4485
       __raw_read_lock include/linux/rwlock_api_smp.h:149 [inline]
       _raw_read_lock+0x32/0x50 kernel/locking/spinlock.c:223
       sp_get.isra.0+0x1d/0xf0 drivers/net/ppp/ppp_synctty.c:138
       sixpack_write_wakeup+0x25/0x340 drivers/net/hamradio/6pack.c:402
       tty_wakeup+0xe9/0x120 drivers/tty/tty_io.c:536
       tty_port_default_wakeup+0x2b/0x40 drivers/tty/tty_port.c:50
       tty_port_tty_wakeup+0x57/0x70 drivers/tty/tty_port.c:387
       uart_write_wakeup+0x46/0x70 drivers/tty/serial/serial_core.c:104
       serial8250_tx_chars+0x495/0xaf0 drivers/tty/serial/8250/8250_port.c:1761
       serial8250_handle_irq.part.0+0x2a2/0x330 drivers/tty/serial/8250/8250_port.c:1834
       serial8250_handle_irq drivers/tty/serial/8250/8250_port.c:1820 [inline]
       serial8250_default_handle_irq+0xc0/0x150 drivers/tty/serial/8250/8250_port.c:1850
       serial8250_interrupt+0xf1/0x1a0 drivers/tty/serial/8250/8250_core.c:126
       __handle_irq_event_percpu+0x15d/0x970 kernel/irq/handle.c:149
       handle_irq_event_percpu+0x74/0x160 kernel/irq/handle.c:189
       handle_irq_event+0xa7/0x134 kernel/irq/handle.c:206
       handle_edge_irq+0x25e/0x8d0 kernel/irq/chip.c:830
       generic_handle_irq_desc include/linux/irqdesc.h:156 [inline]
       do_IRQ+0xde/0x280 arch/x86/kernel/irq.c:250
       common_interrupt+0xf/0xf arch/x86/entry/entry_64.S:607
       </IRQ>
      RIP: 0010:cpu_relax arch/x86/include/asm/processor.h:685 [inline]
      RIP: 0010:mutex_spin_on_owner+0x247/0x330 kernel/locking/mutex.c:579
      Code: c3 be 08 00 00 00 4c 89 e7 e8 e5 06 59 00 4c 89 e0 48 c1 e8 03 42 80 3c 38 00 0f 85 e1 00 00 00 49 8b 04 24 a8 01 75 96 f3 90 <e9> 2f fe ff ff 0f 0b e8 0d 19 09 00 84 c0 0f 85 ff fd ff ff 48 c7
      RSP: 0018:ffffc90001eafa20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd7
      RAX: 0000000000000000 RBX: ffff88809fd9e0c0 RCX: 1ffffffff13266dd
      RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000000
      RBP: ffffc90001eafa60 R08: 1ffff11013d22898 R09: ffffed1013d22899
      R10: ffffed1013d22898 R11: ffff88809e9144c7 R12: ffff8880a905e138
      R13: ffff88809e9144c0 R14: 0000000000000000 R15: dffffc0000000000
       mutex_optimistic_spin kernel/locking/mutex.c:673 [inline]
       __mutex_lock_common kernel/locking/mutex.c:962 [inline]
       __mutex_lock+0x32b/0x13c0 kernel/locking/mutex.c:1106
       mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1121
       tty_lock+0xc7/0x130 drivers/tty/tty_mutex.c:19
       tty_release+0xb5/0xe90 drivers/tty/tty_io.c:1665
       __fput+0x2ff/0x890 fs/file_table.c:280
       ____fput+0x16/0x20 fs/file_table.c:313
       task_work_run+0x145/0x1c0 kernel/task_work.c:113
       exit_task_work include/linux/task_work.h:22 [inline]
       do_exit+0x8e7/0x2ef0 kernel/exit.c:797
       do_group_exit+0x135/0x360 kernel/exit.c:895
       __do_sys_exit_group kernel/exit.c:906 [inline]
       __se_sys_exit_group kernel/exit.c:904 [inline]
       __x64_sys_exit_group+0x44/0x50 kernel/exit.c:904
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x43fef8
      Code: Bad RIP value.
      RSP: 002b:00007ffdb07d2338 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043fef8
      RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
      RBP: 00000000004bf730 R08: 00000000000000e7 R09: ffffffffffffffd0
      R10: 00000000004002c8 R11: 0000000000000246 R12: 0000000000000001
      R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000
      
      Fixes: 6e4e2f81 ("6pack,mkiss: fix lock inconsistency")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      5c9934b6
    • Eric Dumazet's avatar
      tcp/dccp: fix possible race __inet_lookup_established() · 8dbd76e7
      Eric Dumazet authored
      Michal Kubecek and Firo Yang did a very nice analysis of crashes
      happening in __inet_lookup_established().
      
      Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN
      (via a close()/socket()/listen() cycle) without a RCU grace period,
      I should not have changed listeners linkage in their hash table.
      
      They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt),
      so that a lookup can detect a socket in a hash list was moved in
      another one.
      
      Since we added code in commit d296ba60 ("soreuseport: Resolve
      merge conflict for v4/v6 ordering fix"), we have to add
      hlist_nulls_add_tail_rcu() helper.
      
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reported-by: default avatarFiro Yang <firo.yang@suse.com>
      Reviewed-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      8dbd76e7
    • Thomas Falcon's avatar
      net/ibmvnic: Fix typo in retry check · 8f9cc1ee
      Thomas Falcon authored
      This conditional is missing a bang, with the intent
      being to break when the retry count reaches zero.
      
      Fixes: 476d96ca ("ibmvnic: Bound waits for device queries")
      Suggested-by: default avatarJuliet Kim <julietk@linux.vnet.ibm.com>
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      8f9cc1ee
    • Hangbin Liu's avatar
      ipv6/addrconf: only check invalid header values when NETLINK_F_STRICT_CHK is set · 2beb6d29
      Hangbin Liu authored
      In commit 4b1373de ("net: ipv6: addr: perform strict checks also for
      doit handlers") we add strict check for inet6_rtm_getaddr(). But we did
      the invalid header values check before checking if NETLINK_F_STRICT_CHK
      is set. This may break backwards compatibility if user already set the
      ifm->ifa_prefixlen, ifm->ifa_flags, ifm->ifa_scope in their netlink code.
      
      I didn't move the nlmsg_len check because I thought it's a valid check.
      Reported-by: default avatarJianlin Shi <jishi@redhat.com>
      Fixes: 4b1373de ("net: ipv6: addr: perform strict checks also for doit handlers")
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      2beb6d29
  2. 13 Dec, 2019 4 commits
  3. 12 Dec, 2019 3 commits
  4. 11 Dec, 2019 22 commits
  5. 10 Dec, 2019 1 commit
  6. 09 Dec, 2019 2 commits