1. 17 May, 2021 1 commit
    • Xin Long's avatar
      tipc: skb_linearize the head skb when reassembling msgs · b7df21cf
      Xin Long authored
      It's not a good idea to append the frag skb to a skb's frag_list if
      the frag_list already has skbs from elsewhere, such as this skb was
      created by pskb_copy() where the frag_list was cloned (all the skbs
      in it were skb_get'ed) and shared by multiple skbs.
      
      However, the new appended frag skb should have been only seen by the
      current skb. Otherwise, it will cause use after free crashes as this
      appended frag skb are seen by multiple skbs but it only got skb_get
      called once.
      
      The same thing happens with a skb updated by pskb_may_pull() with a
      skb_cloned skb. Li Shuang has reported quite a few crashes caused
      by this when doing testing over macvlan devices:
      
        [] kernel BUG at net/core/skbuff.c:1970!
        [] Call Trace:
        []  skb_clone+0x4d/0xb0
        []  macvlan_broadcast+0xd8/0x160 [macvlan]
        []  macvlan_process_broadcast+0x148/0x150 [macvlan]
        []  process_one_work+0x1a7/0x360
        []  worker_thread+0x30/0x390
      
        [] kernel BUG at mm/usercopy.c:102!
        [] Call Trace:
        []  __check_heap_object+0xd3/0x100
        []  __check_object_size+0xff/0x16b
        []  simple_copy_to_iter+0x1c/0x30
        []  __skb_datagram_iter+0x7d/0x310
        []  __skb_datagram_iter+0x2a5/0x310
        []  skb_copy_datagram_iter+0x3b/0x90
        []  tipc_recvmsg+0x14a/0x3a0 [tipc]
        []  ____sys_recvmsg+0x91/0x150
        []  ___sys_recvmsg+0x7b/0xc0
      
        [] kernel BUG at mm/slub.c:305!
        [] Call Trace:
        []  <IRQ>
        []  kmem_cache_free+0x3ff/0x400
        []  __netif_receive_skb_core+0x12c/0xc40
        []  ? kmem_cache_alloc+0x12e/0x270
        []  netif_receive_skb_internal+0x3d/0xb0
        []  ? get_rx_page_info+0x8e/0xa0 [be2net]
        []  be_poll+0x6ef/0xd00 [be2net]
        []  ? irq_exit+0x4f/0x100
        []  net_rx_action+0x149/0x3b0
      
        ...
      
      This patch is to fix it by linearizing the head skb if it has frag_list
      set in tipc_buf_append(). Note that we choose to do this before calling
      skb_unshare(), as __skb_linearize() will avoid skb_copy(). Also, we can
      not just drop the frag_list either as the early time.
      
      Fixes: 45c8b7b1 ("tipc: allow non-linear first fragment buffer")
      Reported-by: default avatarLi Shuang <shuali@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarJon Maloy <jmaloy@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7df21cf
  2. 14 May, 2021 8 commits
    • Jonathan Davies's avatar
      net: cdc_eem: fix URL to CDC EEM 1.0 spec · b81ac784
      Jonathan Davies authored
      The old URL is no longer accessible.
      Signed-off-by: default avatarJonathan Davies <jonathan.davies@nutanix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b81ac784
    • David S. Miller's avatar
      Merge branch 'lockless-qdisc-packet-stuck' · a0c5393d
      David S. Miller authored
      Yunsheng Lin says:
      
      ====================
      ix packet stuck problem for lockless qdisc
      
      This patchset fixes the packet stuck problem mentioned in [1].
      
      Patch 1: Add STATE_MISSED flag to fix packet stuck problem.
      Patch 2: Fix a tx_action rescheduling problem after STATE_MISSED
               flag is added in patch 1.
      Patch 3: Fix the significantly higher CPU consumption problem when
               multiple threads are competing on a saturated outgoing
               device.
      
      V8: Change function name as suggested by Jakub and fix some typo
          in patch 3, adjust commit log in patch 2, and add Acked-by
          from Jakub.
      V7: Fix netif_tx_wake_queue() data race noted by Jakub.
      V6: Some performance optimization in patch 1 suggested by Jakub
          and drop NET_XMIT_DROP checking in patch 3.
      V5: add patch 3 to fix the problem reported by Michal Kubecek.
      V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED and add patch 2.
      
      [1]. https://lkml.org/lkml/2019/10/9/42
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0c5393d
    • Yunsheng Lin's avatar
      net: sched: fix tx action reschedule issue with stopped queue · dcad9ee9
      Yunsheng Lin authored
      The netdev qeueue might be stopped when byte queue limit has
      reached or tx hw ring is full, net_tx_action() may still be
      rescheduled if STATE_MISSED is set, which consumes unnecessary
      cpu without dequeuing and transmiting any skb because the
      netdev queue is stopped, see qdisc_run_end().
      
      This patch fixes it by checking the netdev queue state before
      calling qdisc_run() and clearing STATE_MISSED if netdev queue is
      stopped during qdisc_run(), the net_tx_action() is rescheduled
      again when netdev qeueue is restarted, see netif_tx_wake_queue().
      
      As there is time window between netif_xmit_frozen_or_stopped()
      checking and STATE_MISSED clearing, between which STATE_MISSED
      may set by net_tx_action() scheduled by netif_tx_wake_queue(),
      so set the STATE_MISSED again if netdev queue is restarted.
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Reported-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcad9ee9
    • Yunsheng Lin's avatar
      net: sched: fix tx action rescheduling issue during deactivation · 102b55ee
      Yunsheng Lin authored
      Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
      qdisc before calling __qdisc_run(), which ultimately clear the
      STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
      is set before clearing STATE_MISSED, there may be rescheduling
      of net_tx_action() at the end of qdisc_run_end(), see below:
      
      CPU0(net_tx_atcion)  CPU1(__dev_xmit_skb)  CPU2(dev_deactivate)
                .                   .                     .
                .            set STATE_MISSED             .
                .           __netif_schedule()            .
                .                   .           set STATE_DEACTIVATED
                .                   .                qdisc_reset()
                .                   .                     .
                .<---------------   .              synchronize_net()
      clear __QDISC_STATE_SCHED  |  .                     .
                .                |  .                     .
                .                |  .            some_qdisc_is_busy()
                .                |  .               return *false*
                .                |  .                     .
        test STATE_DEACTIVATED   |  .                     .
      __qdisc_run() *not* called |  .                     .
                .                |  .                     .
         test STATE_MISS         |  .                     .
       __netif_schedule()--------|  .                     .
                .                   .                     .
                .                   .                     .
      
      __qdisc_run() is not called by net_tx_atcion() in CPU0 because
      CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
      STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
      is called at the end of qdisc_run_end(), causing tx action
      rescheduling problem.
      
      qdisc_run() called by net_tx_action() runs in the softirq context,
      which should has the same semantic as the qdisc_run() called by
      __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
      synchronize_net() between STATE_DEACTIVATED flag being set and
      qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
      bail out for the deactived lockless qdisc in net_tx_action(), and
      qdisc_reset() will reset all skb not dequeued yet.
      
      So add the rcu_read_lock() explicitly to protect the qdisc_run()
      and do the STATE_DEACTIVATED checking in net_tx_action() before
      calling qdisc_run_begin(). Another option is to do the checking in
      the qdisc_run_end(), but it will add unnecessary overhead for
      non-tx_action case, because __dev_queue_xmit() will not see qdisc
      with STATE_DEACTIVATED after synchronize_net(), the qdisc with
      STATE_DEACTIVATED can only be seen by net_tx_action() because of
      __netif_schedule().
      
      The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
      between net_tx_action() and qdisc_reset(), see:
      commit d518d2ed ("net/sched: fix race between deactivation
      and dequeue for NOLOCK qdisc"). As the bailout added above for
      deactived lockless qdisc in net_tx_action() provides better
      protection for the race without calling qdisc_run() at all, so
      remove the STATE_DEACTIVATED checking in qdisc_run().
      
      After qdisc_reset(), there is no skb in qdisc to be dequeued, so
      clear the STATE_MISSED in dev_reset_queue() too.
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      V8: Clearing STATE_MISSED before calling __netif_schedule() has
          avoid the endless rescheduling problem, but there may still
          be a unnecessary rescheduling, so adjust the commit log.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      102b55ee
    • Yunsheng Lin's avatar
      net: sched: fix packet stuck problem for lockless qdisc · a90c57f2
      Yunsheng Lin authored
      Lockless qdisc has below concurrent problem:
          cpu0                 cpu1
           .                     .
      q->enqueue                 .
           .                     .
      qdisc_run_begin()          .
           .                     .
      dequeue_skb()              .
           .                     .
      sch_direct_xmit()          .
           .                     .
           .                q->enqueue
           .             qdisc_run_begin()
           .            return and do nothing
           .                     .
      qdisc_run_end()            .
      
      cpu1 enqueue a skb without calling __qdisc_run() because cpu0
      has not released the lock yet and spin_trylock() return false
      for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
      enqueued by cpu1 when calling dequeue_skb() because cpu1 may
      enqueue the skb after cpu0 calling dequeue_skb() and before
      cpu0 calling qdisc_run_end().
      
      Lockless qdisc has below another concurrent problem when
      tx_action is involved:
      
      cpu0(serving tx_action)     cpu1             cpu2
                .                   .                .
                .              q->enqueue            .
                .            qdisc_run_begin()       .
                .              dequeue_skb()         .
                .                   .            q->enqueue
                .                   .                .
                .             sch_direct_xmit()      .
                .                   .         qdisc_run_begin()
                .                   .       return and do nothing
                .                   .                .
       clear __QDISC_STATE_SCHED    .                .
       qdisc_run_begin()            .                .
       return and do nothing        .                .
                .                   .                .
                .            qdisc_run_end()         .
      
      This patch fixes the above data race by:
      1. If the first spin_trylock() return false and STATE_MISSED is
         not set, set STATE_MISSED and retry another spin_trylock() in
         case other CPU may not see STATE_MISSED after it releases the
         lock.
      2. reschedule if STATE_MISSED is set after the lock is released
         at the end of qdisc_run_end().
      
      For tx_action case, STATE_MISSED is also set when cpu1 is at the
      end if qdisc_run_end(), so tx_action will be rescheduled again
      to dequeue the skb enqueued by cpu2.
      
      Clear STATE_MISSED before retrying a dequeuing when dequeuing
      returns NULL in order to reduce the overhead of the second
      spin_trylock() and __netif_schedule() calling.
      
      Also clear the STATE_MISSED before calling __netif_schedule()
      at the end of qdisc_run_end() to avoid doing another round of
      dequeuing in the pfifo_fast_dequeue().
      
      The performance impact of this patch, tested using pktgen and
      dummy netdev with pfifo_fast qdisc attached:
      
       threads  without+this_patch   with+this_patch      delta
          1        2.61Mpps            2.60Mpps           -0.3%
          2        3.97Mpps            3.82Mpps           -3.7%
          4        5.62Mpps            5.59Mpps           -0.5%
          8        2.78Mpps            2.77Mpps           -0.3%
         16        2.22Mpps            2.22Mpps           -0.0%
      
      Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking")
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a90c57f2
    • Jim Ma's avatar
      tls splice: check SPLICE_F_NONBLOCK instead of MSG_DONTWAIT · 974271e5
      Jim Ma authored
      In tls_sw_splice_read, checkout MSG_* is inappropriate, should use
      SPLICE_*, update tls_wait_data to accept nonblock arguments instead
      of flags for recvmsg and splice.
      
      Fixes: c46234eb ("tls: RX path for ktls")
      Signed-off-by: default avatarJim Ma <majinjing3@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      974271e5
    • Hoang Le's avatar
      Revert "net:tipc: Fix a double free in tipc_sk_mcast_rcv" · 75016891
      Hoang Le authored
      This reverts commit 6bf24dc0.
      Above fix is not correct and caused memory leak issue.
      
      Fixes: 6bf24dc0 ("net:tipc: Fix a double free in tipc_sk_mcast_rcv")
      Acked-by: default avatarJon Maloy <jmaloy@redhat.com>
      Acked-by: default avatarTung Nguyen <tung.q.nguyen@dektech.com.au>
      Signed-off-by: default avatarHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75016891
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 414ed7fe
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Remove the flowtable hardware refresh state, fall back to the
         existing hardware pending state instead, from Roi Dayan.
      
      2) Fix crash in pipapo avx2 lookup when FPU is in used from user
         context, from Stefano Brivio.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      414ed7fe
  3. 13 May, 2021 7 commits
    • Stefano Brivio's avatar
      netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check, fallback to non-AVX2 version · f0b3d338
      Stefano Brivio authored
      Arturo reported this backtrace:
      
      [709732.358791] WARNING: CPU: 3 PID: 456 at arch/x86/kernel/fpu/core.c:128 kernel_fpu_begin_mask+0xae/0xe0
      [709732.358793] Modules linked in: binfmt_misc nft_nat nft_chain_nat nf_nat nft_counter nft_ct nf_tables nf_conntrack_netlink nfnetlink 8021q garp stp mrp llc vrf intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp crc32_pclmul mgag200 ghash_clmulni_intel drm_kms_helper cec aesni_intel drm libaes crypto_simd cryptd glue_helper mei_me dell_smbios iTCO_wdt evdev intel_pmc_bxt iTCO_vendor_support dcdbas pcspkr rapl dell_wmi_descriptor wmi_bmof sg i2c_algo_bit watchdog mei acpi_ipmi ipmi_si button nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipmi_devintf ipmi_msghandler ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor sd_mod t10_pi crc_t10dif crct10dif_generic raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ahci libahci tg3 libata xhci_pci libphy xhci_hcd ptp usbcore crct10dif_pclmul crct10dif_common bnxt_en crc32c_intel scsi_mod
      [709732.358941]  pps_core i2c_i801 lpc_ich i2c_smbus wmi usb_common
      [709732.358957] CPU: 3 PID: 456 Comm: jbd2/dm-0-8 Not tainted 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
      [709732.358959] Hardware name: Dell Inc. PowerEdge R440/04JN2K, BIOS 2.9.3 09/23/2020
      [709732.358964] RIP: 0010:kernel_fpu_begin_mask+0xae/0xe0
      [709732.358969] Code: ae 54 24 04 83 e3 01 75 38 48 8b 44 24 08 65 48 33 04 25 28 00 00 00 75 33 48 83 c4 10 5b c3 65 8a 05 5e 21 5e 76 84 c0 74 92 <0f> 0b eb 8e f0 80 4f 01 40 48 81 c7 00 14 00 00 e8 dd fb ff ff eb
      [709732.358972] RSP: 0018:ffffbb9700304740 EFLAGS: 00010202
      [709732.358976] RAX: 0000000000000001 RBX: 0000000000000003 RCX: 0000000000000001
      [709732.358979] RDX: ffffbb9700304970 RSI: ffff922fe1952e00 RDI: 0000000000000003
      [709732.358981] RBP: ffffbb9700304970 R08: ffff922fc868a600 R09: ffff922fc711e462
      [709732.358984] R10: 000000000000005f R11: ffff922ff0b27180 R12: ffffbb9700304960
      [709732.358987] R13: ffffbb9700304b08 R14: ffff922fc664b6c8 R15: ffff922fc664b660
      [709732.358990] FS:  0000000000000000(0000) GS:ffff92371fec0000(0000) knlGS:0000000000000000
      [709732.358993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [709732.358996] CR2: 0000557a6655bdd0 CR3: 000000026020a001 CR4: 00000000007706e0
      [709732.358999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [709732.359001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [709732.359003] PKRU: 55555554
      [709732.359005] Call Trace:
      [709732.359009]  <IRQ>
      [709732.359035]  nft_pipapo_avx2_lookup+0x4c/0x1cba [nf_tables]
      [709732.359046]  ? sched_clock+0x5/0x10
      [709732.359054]  ? sched_clock_cpu+0xc/0xb0
      [709732.359061]  ? record_times+0x16/0x80
      [709732.359068]  ? plist_add+0xc1/0x100
      [709732.359073]  ? psi_group_change+0x47/0x230
      [709732.359079]  ? skb_clone+0x4d/0xb0
      [709732.359085]  ? enqueue_task_rt+0x22b/0x310
      [709732.359098]  ? bnxt_start_xmit+0x1e8/0xaf0 [bnxt_en]
      [709732.359102]  ? packet_rcv+0x40/0x4a0
      [709732.359121]  nft_lookup_eval+0x59/0x160 [nf_tables]
      [709732.359133]  nft_do_chain+0x350/0x500 [nf_tables]
      [709732.359152]  ? nft_lookup_eval+0x59/0x160 [nf_tables]
      [709732.359163]  ? nft_do_chain+0x364/0x500 [nf_tables]
      [709732.359172]  ? fib4_rule_action+0x6d/0x80
      [709732.359178]  ? fib_rules_lookup+0x107/0x250
      [709732.359184]  nft_nat_do_chain+0x8a/0xf2 [nft_chain_nat]
      [709732.359193]  nf_nat_inet_fn+0xea/0x210 [nf_nat]
      [709732.359202]  nf_nat_ipv4_out+0x14/0xa0 [nf_nat]
      [709732.359207]  nf_hook_slow+0x44/0xc0
      [709732.359214]  ip_output+0xd2/0x100
      [709732.359221]  ? __ip_finish_output+0x210/0x210
      [709732.359226]  ip_forward+0x37d/0x4a0
      [709732.359232]  ? ip4_key_hashfn+0xb0/0xb0
      [709732.359238]  ip_sublist_rcv_finish+0x4f/0x60
      [709732.359243]  ip_sublist_rcv+0x196/0x220
      [709732.359250]  ? ip_rcv_finish_core.isra.22+0x400/0x400
      [709732.359255]  ip_list_rcv+0x137/0x160
      [709732.359264]  __netif_receive_skb_list_core+0x29b/0x2c0
      [709732.359272]  netif_receive_skb_list_internal+0x1a6/0x2d0
      [709732.359280]  gro_normal_list.part.156+0x19/0x40
      [709732.359286]  napi_complete_done+0x67/0x170
      [709732.359298]  bnxt_poll+0x105/0x190 [bnxt_en]
      [709732.359304]  ? irqentry_exit+0x29/0x30
      [709732.359309]  ? asm_common_interrupt+0x1e/0x40
      [709732.359315]  net_rx_action+0x144/0x3c0
      [709732.359322]  __do_softirq+0xd5/0x29c
      [709732.359329]  asm_call_irq_on_stack+0xf/0x20
      [709732.359332]  </IRQ>
      [709732.359339]  do_softirq_own_stack+0x37/0x40
      [709732.359346]  irq_exit_rcu+0x9d/0xa0
      [709732.359353]  common_interrupt+0x78/0x130
      [709732.359358]  asm_common_interrupt+0x1e/0x40
      [709732.359366] RIP: 0010:crc_41+0x0/0x1e [crc32c_intel]
      [709732.359370] Code: ff ff f2 4d 0f 38 f1 93 a8 fe ff ff f2 4c 0f 38 f1 81 b0 fe ff ff f2 4c 0f 38 f1 8a b0 fe ff ff f2 4d 0f 38 f1 93 b0 fe ff ff <f2> 4c 0f 38 f1 81 b8 fe ff ff f2 4c 0f 38 f1 8a b8 fe ff ff f2 4d
      [709732.359373] RSP: 0018:ffffbb97008dfcd0 EFLAGS: 00000246
      [709732.359377] RAX: 000000000000002a RBX: 0000000000000400 RCX: ffff922fc591dd50
      [709732.359379] RDX: ffff922fc591dea0 RSI: 0000000000000a14 RDI: ffffffffc00dddc0
      [709732.359382] RBP: 0000000000001000 R08: 000000000342d8c3 R09: 0000000000000000
      [709732.359384] R10: 0000000000000000 R11: ffff922fc591dff0 R12: ffffbb97008dfe58
      [709732.359386] R13: 000000000000000a R14: ffff922fd2b91e80 R15: ffff922fef83fe38
      [709732.359395]  ? crc_43+0x1e/0x1e [crc32c_intel]
      [709732.359403]  ? crc32c_pcl_intel_update+0x97/0xb0 [crc32c_intel]
      [709732.359419]  ? jbd2_journal_commit_transaction+0xaec/0x1a30 [jbd2]
      [709732.359425]  ? irq_exit_rcu+0x3e/0xa0
      [709732.359447]  ? kjournald2+0xbd/0x270 [jbd2]
      [709732.359454]  ? finish_wait+0x80/0x80
      [709732.359470]  ? commit_timeout+0x10/0x10 [jbd2]
      [709732.359476]  ? kthread+0x116/0x130
      [709732.359481]  ? kthread_park+0x80/0x80
      [709732.359488]  ? ret_from_fork+0x1f/0x30
      [709732.359494] ---[ end trace 081a19978e5f09f5 ]---
      
      that is, nft_pipapo_avx2_lookup() uses the FPU running from a softirq
      that interrupted a kthread, also using the FPU.
      
      That's exactly the reason why irq_fpu_usable() is there: use it, and
      if we can't use the FPU, fall back to the non-AVX2 version of the
      lookup operation, i.e. nft_pipapo_lookup().
      Reported-by: default avatarArturo Borrero Gonzalez <arturo@netfilter.org>
      Cc: <stable@vger.kernel.org> # 5.6.x
      Fixes: 7400b063 ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f0b3d338
    • Roi Dayan's avatar
      netfilter: flowtable: Remove redundant hw refresh bit · c07531c0
      Roi Dayan authored
      Offloading conns could fail for multiple reasons and a hw refresh bit is
      set to try to reoffload it in next sw packet.
      But it could be in some cases and future points that the hw refresh bit
      is not set but a refresh could succeed.
      Remove the hw refresh bit and do offload refresh if requested.
      There won't be a new work entry if a work is already pending
      anyway as there is the hw pending bit.
      
      Fixes: 8b3646d6 ("net/sched: act_ct: Support refreshing the flow table entries")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c07531c0
    • Tao Liu's avatar
      openvswitch: meter: fix race when getting now_ms. · e4df1b0c
      Tao Liu authored
      We have observed meters working unexpected if traffic is 3+Gbit/s
      with multiple connections.
      
      now_ms is not pretected by meter->lock, we may get a negative
      long_delta_ms when another cpu updated meter->used, then:
          delta_ms = (u32)long_delta_ms;
      which will be a large value.
      
          band->bucket += delta_ms * band->rate;
      then we get a wrong band->bucket.
      
      OpenVswitch userspace datapath has fixed the same issue[1] some
      time ago, and we port the implementation to kernel datapath.
      
      [1] https://patchwork.ozlabs.org/project/openvswitch/patch/20191025114436.9746-1-i.maximets@ovn.org/
      
      Fixes: 96fbc13d ("openvswitch: Add meter infrastructure")
      Signed-off-by: default avatarTao Liu <thomas.liu@ucloud.cn>
      Suggested-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Reviewed-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4df1b0c
    • Wei Yongjun's avatar
      net: korina: Fix return value check in korina_probe() · c7d83024
      Wei Yongjun authored
      In case of error, the function devm_platform_ioremap_resource_byname()
      returns ERR_PTR() and never returns NULL. The NULL test in the return
      value check should be replaced with IS_ERR().
      
      Fixes: b4cd249a ("net: korina: Use devres functions")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7d83024
    • Ayush Sawal's avatar
      cxgb4/ch_ktls: Clear resources when pf4 device is removed · 65e302a9
      Ayush Sawal authored
      This patch maintain the list of active tids and clear all the active
      connection resources when DETACH notification comes.
      
      Fixes: a8c16e8e ("crypto/chcr: move nic TLS functionality to drivers/net")
      Signed-off-by: default avatarAyush Sawal <ayush.sawal@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65e302a9
    • Christophe JAILLET's avatar
      net: mdio: octeon: Fix some double free issues · e1d027dd
      Christophe JAILLET authored
      'bus->mii_bus' has been allocated with 'devm_mdiobus_alloc_size()' in the
      probe function. So it must not be freed explicitly or there will be a
      double free.
      
      Remove the incorrect 'mdiobus_free' in the error handling path of the
      probe function and in remove function.
      Suggested-By: default avatarAndrew Lunn <andrew@lunn.ch>
      Fixes: 35d2aeac ("phy: mdio-octeon: Use devm_mdiobus_alloc_size()")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1d027dd
    • Christophe JAILLET's avatar
      net: mdio: thunder: Fix a double free issue in the .remove function · a93a0a15
      Christophe JAILLET authored
      'bus->mii_bus' have been allocated with 'devm_mdiobus_alloc_size()' in the
      probe function. So it must not be freed explicitly or there will be a
      double free.
      
      Remove the incorrect 'mdiobus_free' in the remove function.
      
      Fixes: 379d7ac7 ("phy: mdio-thunder: Add driver for Cavium Thunder SoC MDIO buses.")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a93a0a15
  4. 12 May, 2021 14 commits
  5. 11 May, 2021 10 commits
    • Alex Elder's avatar
      net: ipa: memory region array is variable size · 440c3247
      Alex Elder authored
      IPA configuration data includes an array of memory region
      descriptors.  That was a fixed-size array at one time, but
      at some point we started defining it such that it was only
      as big as required for a given platform.  The actual number
      of entries in the array is recorded in the configuration data
      along with the array.
      
      A loop in ipa_mem_config() still assumes the array has entries
      for all defined memory region IDs.  As a result, this loop can
      go past the end of the actual array and attempt to write
      "canary" values based on nonsensical data.
      
      Fix this, by stashing the number of entries in the array, and
      using that rather than IPA_MEM_COUNT in the initialization loop
      found in ipa_mem_config().
      
      The only remaining use of IPA_MEM_COUNT is in a validation check
      to ensure configuration data doesn't have too many entries.
      That's fine for now.
      
      Fixes: 3128aae8 ("net: ipa: redefine struct ipa_mem_data")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      440c3247
    • Shannon Nelson's avatar
      ionic: fix ptp support config breakage · bcbda3fc
      Shannon Nelson authored
      When IONIC=y and PTP_1588_CLOCK=m were set in the .config file
      the driver link failed with undefined references.
      
      We add the dependancy
      	depends on PTP_1588_CLOCK || !PTP_1588_CLOCK
      to clear this up.
      
      If PTP_1588_CLOCK=m, the depends limits IONIC to =m (or disabled).
      If PTP_1588_CLOCK is disabled, IONIC can be any of y/m/n.
      
      Fixes: 61db421d ("ionic: link in the new hw timestamp code")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Allen Hubbe <allenbh@pensando.io>
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bcbda3fc
    • Paolo Abeni's avatar
      mptcp: fix data stream corruption · 29249eac
      Paolo Abeni authored
      Maxim reported several issues when forcing a TCP transparent proxy
      to use the MPTCP protocol for the inbound connections. He also
      provided a clean reproducer.
      
      The problem boils down to 'mptcp_frag_can_collapse_to()' assuming
      that only MPTCP will use the given page_frag.
      
      If others - e.g. the plain TCP protocol - allocate page fragments,
      we can end-up re-using already allocated memory for mptcp_data_frag.
      
      Fix the issue ensuring that the to-be-expanded data fragment is
      located at the current page frag end.
      
      v1 -> v2:
       - added missing fixes tag (Mat)
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/178Reported-and-tested-by: default avatarMaxim Galaganov <max@internet.ru>
      Fixes: 18b683bf ("mptcp: queue data for mptcp level retransmission")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29249eac
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · df6f8237
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2021-05-11
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 13 non-merge commits during the last 8 day(s) which contain
      a total of 21 files changed, 817 insertions(+), 382 deletions(-).
      
      The main changes are:
      
      1) Fix multiple ringbuf bugs in particular to prevent writable mmap of
         read-only pages, from Andrii Nakryiko & Thadeu Lima de Souza Cascardo.
      
      2) Fix verifier alu32 known-const subregister bound tracking for bitwise
         operations and/or/xor, from Daniel Borkmann.
      
      3) Reject trampoline attachment for functions with variable arguments,
         and also add a deny list of other forbidden functions, from Jiri Olsa.
      
      4) Fix nested bpf_bprintf_prepare() calls used by various helpers by
         switching to per-CPU buffers, from Florent Revest.
      
      5) Fix kernel compilation with BTF debug info on ppc64 due to pahole
         missing TCP-CC functions like cubictcp_init, from Martin KaFai Lau.
      
      6) Add a kconfig entry to provide an option to disallow unprivileged
         BPF by default, from Daniel Borkmann.
      
      7) Fix libbpf compilation for older libelf when GELF_ST_VISIBILITY()
         macro is not available, from Arnaldo Carvalho de Melo.
      
      8) Migrate test_tc_redirect to test_progs framework as prep work
         for upcoming skb_change_head() fix & selftest, from Jussi Maki.
      
      9) Fix a libbpf segfault in add_dummy_ksym_var() if BTF is not
         present, from Ian Rogers.
      
      10) Fix tx_only micro-benchmark in xdpsock BPF sample with proper frame
          size, from Magnus Karlsson.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df6f8237
    • David S. Miller's avatar
      Merge tag 'mac80211-for-net-2021-05-11' of... · 9fe37a80
      David S. Miller authored
      Merge tag 'mac80211-for-net-2021-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
      
      Johannes Berg says:
      
      ====================
      pull-request: mac80211 2021-05-11
      
      So exciting times, for the first pull request for fixes I
      have a bunch of security things that have been under embargo
      for a while - see more details in the tag below, and at the
      patch posting message I linked to.
      
      I organized with Kalle to just have a single set of fixes
      for mac80211 and ath10k/ath11k, we don't know about any of
      the other vendors (the mac80211 + already released firmware
      is sufficient to fix iwlwifi.)
      
      Please pull and let me know if there's any problem.
      
      Several security issues in the 802.11 implementations were found by
      Mathy Vanhoef (New York University Abu Dhabi), and this contains the
      fixes developed for mac80211 and specifically Qualcomm drivers, I'm
      sending this together (as agreed with Kalle) to have just a single
      set of patches for now. We don't know about other vendors though.
      
      More details in the patch posting:
      https://lore.kernel.org/r/20210511180259.159598-1-johannes@sipsolutions.net
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fe37a80
    • Joakim Zhang's avatar
      net: stmmac: Fix MAC WoL not working if PHY does not support WoL · 576f9eac
      Joakim Zhang authored
      Both get and set WoL will check device_can_wakeup(), if MAC supports PMT, it
      will set device wakeup capability. After commit 1d8e5b0f ("net: stmmac:
      Support WOL with phy"), device wakeup capability will be overwrite in
      stmmac_init_phy() according to phy's Wol feature. If phy doesn't support WoL,
      then MAC will lose wakeup capability. To fix this issue, only overwrite device
      wakeup capability when MAC doesn't support PMT.
      
      For STMMAC now driver checks MAC's WoL capability if MAC supports PMT, if
      not support, driver will check PHY's WoL capability.
      
      Fixes: 1d8e5b0f ("net: stmmac: Support WOL with phy")
      Reviewed-by: default avatarJisheng Zhang <Jisheng.Zhang@synaptics.com>
      Signed-off-by: default avatarJoakim Zhang <qiangqing.zhang@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      576f9eac
    • Martin KaFai Lau's avatar
      bpf: Limit static tcp-cc functions in the .BTF_ids list to x86 · 569c484f
      Martin KaFai Lau authored
      During the discussion in [0]. It was pointed out that static functions
      in ppc64 is prefixed with ".". For example, the 'readelf -s vmlinux.ppc':
      
        89326: c000000001383280    24 NOTYPE  LOCAL  DEFAULT   31 cubictcp_init
        89327: c000000000c97c50   168 FUNC    LOCAL  DEFAULT    2 .cubictcp_init
      
      The one with FUNC type is ".cubictcp_init" instead of "cubictcp_init".
      The "." seems to be done by arch/powerpc/include/asm/ppc_asm.h.
      
      This caused that pahole cannot generate the BTF for these tcp-cc kernel
      functions because pahole only captures the FUNC type and "cubictcp_init"
      is not. It then failed the kernel compilation in ppc64.
      
      This behavior is only reported in ppc64 so far. I tried arm64, s390,
      and sparc64 and did not observe this "." prefix and NOTYPE behavior.
      
      Since the kfunc call is only supported in the x86_64 and x86_32 JIT,
      this patch limits those tcp-cc functions to x86 only to avoid unnecessary
      compilation issue in other ARCHs. In the future, we can examine if it
      is better to change all those functions from static to extern.
      
        [0] https://lore.kernel.org/bpf/4e051459-8532-7b61-c815-f3435767f8a0@kernel.org/
      
      Fixes: e78aea8b ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Michal Suchánek <msuchanek@suse.de>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: https://lore.kernel.org/bpf/20210508005011.3863757-1-kafai@fb.com
      569c484f
    • Jussi Maki's avatar
      selftests/bpf: Rewrite test_tc_redirect.sh as prog_tests/tc_redirect.c · 096eccde
      Jussi Maki authored
      As discussed in [0], this ports test_tc_redirect.sh to the test_progs
      framework and removes the old test.
      
      This makes it more in line with rest of the tests and makes it possible
      to run this test case with vmtest.sh and under the bpf CI.
      
      The upcoming skb_change_head() helper fix in [0] is depending on it and
      extending the test case to redirect a packet from L3 device to veth.
      
        [0] https://lore.kernel.org/bpf/20210427135550.807355-1-joamaki@gmail.comSigned-off-by: default avatarJussi Maki <joamaki@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210505085925.783985-1-joamaki@gmail.com
      096eccde
    • Arnaldo Carvalho de Melo's avatar
    • Florent Revest's avatar
      bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers · e2d5b2bb
      Florent Revest authored
      The bpf_seq_printf, bpf_trace_printk and bpf_snprintf helpers share one
      per-cpu buffer that they use to store temporary data (arguments to
      bprintf). They "get" that buffer with try_get_fmt_tmp_buf and "put" it
      by the end of their scope with bpf_bprintf_cleanup.
      
      If one of these helpers gets called within the scope of one of these
      helpers, for example: a first bpf program gets called, uses
      bpf_trace_printk which calls raw_spin_lock_irqsave which is traced by
      another bpf program that calls bpf_snprintf, then the second "get"
      fails. Essentially, these helpers are not re-entrant. They would return
      -EBUSY and print a warning message once.
      
      This patch triples the number of bprintf buffers to allow three levels
      of nesting. This is very similar to what was done for tracepoints in
      "9594dc3c bpf: fix nested bpf tracepoints with per-cpu data"
      
      Fixes: d9c9e4db ("bpf: Factorize bpf_trace_printk and bpf_seq_printf")
      Reported-by: syzbot+63122d0bc347f18c1884@syzkaller.appspotmail.com
      Signed-off-by: default avatarFlorent Revest <revest@chromium.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210511081054.2125874-1-revest@chromium.org
      e2d5b2bb