1. 22 Sep, 2022 8 commits
    • Jakub Kicinski's avatar
      Merge branch 'bonding-fix-null-deref-in-bond_rr_gen_slave_id' · c5da4b68
      Jakub Kicinski authored
      Jonathan Toppins says:
      
      ====================
      bonding: fix NULL deref in bond_rr_gen_slave_id
      
      Fix a NULL dereference of the struct bonding.rr_tx_counter member because
      if a bond is initially created with an initial mode != zero (Round Robin)
      the memory required for the counter is never created and when the mode is
      changed there is never any attempt to verify the memory is allocated upon
      switching modes.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1663694476.git.jtoppins@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c5da4b68
    • Jonathan Toppins's avatar
      selftests: bonding: cause oops in bond_rr_gen_slave_id · 2ffd5732
      Jonathan Toppins authored
      This bonding selftest used to cause a kernel oops on aarch64
      and should be architectures agnostic.
      Signed-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Acked-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2ffd5732
    • Jonathan Toppins's avatar
      bonding: fix NULL deref in bond_rr_gen_slave_id · 0e400d60
      Jonathan Toppins authored
      Fix a NULL dereference of the struct bonding.rr_tx_counter member because
      if a bond is initially created with an initial mode != zero (Round Robin)
      the memory required for the counter is never created and when the mode is
      changed there is never any attempt to verify the memory is allocated upon
      switching modes.
      
      This causes the following Oops on an aarch64 machine:
          [  334.686773] Unable to handle kernel paging request at virtual address ffff2c91ac905000
          [  334.694703] Mem abort info:
          [  334.697486]   ESR = 0x0000000096000004
          [  334.701234]   EC = 0x25: DABT (current EL), IL = 32 bits
          [  334.706536]   SET = 0, FnV = 0
          [  334.709579]   EA = 0, S1PTW = 0
          [  334.712719]   FSC = 0x04: level 0 translation fault
          [  334.717586] Data abort info:
          [  334.720454]   ISV = 0, ISS = 0x00000004
          [  334.724288]   CM = 0, WnR = 0
          [  334.727244] swapper pgtable: 4k pages, 48-bit VAs, pgdp=000008044d662000
          [  334.733944] [ffff2c91ac905000] pgd=0000000000000000, p4d=0000000000000000
          [  334.740734] Internal error: Oops: 96000004 [#1] SMP
          [  334.745602] Modules linked in: bonding tls veth rfkill sunrpc arm_spe_pmu vfat fat acpi_ipmi ipmi_ssif ixgbe igb i40e mdio ipmi_devintf ipmi_msghandler arm_cmn arm_dsu_pmu cppc_cpufreq acpi_tad fuse zram crct10dif_ce ast ghash_ce sbsa_gwdt nvme drm_vram_helper drm_ttm_helper nvme_core ttm xgene_hwmon
          [  334.772217] CPU: 7 PID: 2214 Comm: ping Not tainted 6.0.0-rc4-00133-g64ae13ed #4
          [  334.779950] Hardware name: GIGABYTE R272-P31-00/MP32-AR1-00, BIOS F18v (SCP: 1.08.20211002) 12/01/2021
          [  334.789244] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
          [  334.796196] pc : bond_rr_gen_slave_id+0x40/0x124 [bonding]
          [  334.801691] lr : bond_xmit_roundrobin_slave_get+0x38/0xdc [bonding]
          [  334.807962] sp : ffff8000221733e0
          [  334.811265] x29: ffff8000221733e0 x28: ffffdbac8572d198 x27: ffff80002217357c
          [  334.818392] x26: 000000000000002a x25: ffffdbacb33ee000 x24: ffff07ff980fa000
          [  334.825519] x23: ffffdbacb2e398ba x22: ffff07ff98102000 x21: ffff07ff981029c0
          [  334.832646] x20: 0000000000000001 x19: ffff07ff981029c0 x18: 0000000000000014
          [  334.839773] x17: 0000000000000000 x16: ffffdbacb1004364 x15: 0000aaaabe2f5a62
          [  334.846899] x14: ffff07ff8e55d968 x13: ffff07ff8e55db30 x12: 0000000000000000
          [  334.854026] x11: ffffdbacb21532e8 x10: 0000000000000001 x9 : ffffdbac857178ec
          [  334.861153] x8 : ffff07ff9f6e5a28 x7 : 0000000000000000 x6 : 000000007c2b3742
          [  334.868279] x5 : ffff2c91ac905000 x4 : ffff2c91ac905000 x3 : ffff07ff9f554400
          [  334.875406] x2 : ffff2c91ac905000 x1 : 0000000000000001 x0 : ffff07ff981029c0
          [  334.882532] Call trace:
          [  334.884967]  bond_rr_gen_slave_id+0x40/0x124 [bonding]
          [  334.890109]  bond_xmit_roundrobin_slave_get+0x38/0xdc [bonding]
          [  334.896033]  __bond_start_xmit+0x128/0x3a0 [bonding]
          [  334.901001]  bond_start_xmit+0x54/0xb0 [bonding]
          [  334.905622]  dev_hard_start_xmit+0xb4/0x220
          [  334.909798]  __dev_queue_xmit+0x1a0/0x720
          [  334.913799]  arp_xmit+0x3c/0xbc
          [  334.916932]  arp_send_dst+0x98/0xd0
          [  334.920410]  arp_solicit+0xe8/0x230
          [  334.923888]  neigh_probe+0x60/0xb0
          [  334.927279]  __neigh_event_send+0x3b0/0x470
          [  334.931453]  neigh_resolve_output+0x70/0x90
          [  334.935626]  ip_finish_output2+0x158/0x514
          [  334.939714]  __ip_finish_output+0xac/0x1a4
          [  334.943800]  ip_finish_output+0x40/0xfc
          [  334.947626]  ip_output+0xf8/0x1a4
          [  334.950931]  ip_send_skb+0x5c/0x100
          [  334.954410]  ip_push_pending_frames+0x3c/0x60
          [  334.958758]  raw_sendmsg+0x458/0x6d0
          [  334.962325]  inet_sendmsg+0x50/0x80
          [  334.965805]  sock_sendmsg+0x60/0x6c
          [  334.969286]  __sys_sendto+0xc8/0x134
          [  334.972853]  __arm64_sys_sendto+0x34/0x4c
          [  334.976854]  invoke_syscall+0x78/0x100
          [  334.980594]  el0_svc_common.constprop.0+0x4c/0xf4
          [  334.985287]  do_el0_svc+0x38/0x4c
          [  334.988591]  el0_svc+0x34/0x10c
          [  334.991724]  el0t_64_sync_handler+0x11c/0x150
          [  334.996072]  el0t_64_sync+0x190/0x194
          [  334.999726] Code: b9001062 f9403c02 d53cd044 8b040042 (b8210040)
          [  335.005810] ---[ end trace 0000000000000000 ]---
          [  335.010416] Kernel panic - not syncing: Oops: Fatal exception in interrupt
          [  335.017279] SMP: stopping secondary CPUs
          [  335.021374] Kernel Offset: 0x5baca8eb0000 from 0xffff800008000000
          [  335.027456] PHYS_OFFSET: 0x80000000
          [  335.030932] CPU features: 0x0000,0085c029,19805c82
          [  335.035713] Memory Limit: none
          [  335.038756] Rebooting in 180 seconds..
      
      The fix is to allocate the memory in bond_open() which is guaranteed
      to be called before any packets are processed.
      
      Fixes: 848ca918 ("net: bonding: Use per-cpu rr_tx_counter")
      CC: Jussi Maki <joamaki@gmail.com>
      Signed-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Acked-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0e400d60
    • Michael Walle's avatar
      net: phy: micrel: fix shared interrupt on LAN8814 · 2002fbac
      Michael Walle authored
      Since commit ece19502 ("net: phy: micrel: 1588 support for LAN8814
      phy") the handler always returns IRQ_HANDLED, except in an error case.
      Before that commit, the interrupt status register was checked and if
      it was empty, IRQ_NONE was returned. Restore that behavior to play nice
      with the interrupt line being shared with others.
      
      Fixes: ece19502 ("net: phy: micrel: 1588 support for LAN8814 phy")
      Signed-off-by: default avatarMichael Walle <michael@walle.cc>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarDivya Koppera <Divya.Koppera@microchip.com>
      Link: https://lore.kernel.org/r/20220920141619.808117-1-michael@walle.ccSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2002fbac
    • Wen Gu's avatar
      net/smc: Stop the CLC flow if no link to map buffers on · e738455b
      Wen Gu authored
      There might be a potential race between SMC-R buffer map and
      link group termination.
      
      smc_smcr_terminate_all()     | smc_connect_rdma()
      --------------------------------------------------------------
                                   | smc_conn_create()
      for links in smcibdev        |
              schedule links down  |
                                   | smc_buf_create()
                                   |  \- smcr_buf_map_usable_links()
                                   |      \- no usable links found,
                                   |         (rmb->mr = NULL)
                                   |
                                   | smc_clc_send_confirm()
                                   |  \- access conn->rmb_desc->mr[]->rkey
                                   |     (panic)
      
      During reboot and IB device module remove, all links will be set
      down and no usable links remain in link groups. In such situation
      smcr_buf_map_usable_links() should return an error and stop the
      CLC flow accessing to uninitialized mr.
      
      Fixes: b9247544 ("net/smc: convert static link ID instances to support multiple links")
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Link: https://lore.kernel.org/r/1663656189-32090-1-git-send-email-guwen@linux.alibaba.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e738455b
    • Jakub Kicinski's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 624aea6b
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2022-09-20 (ice)
      
      Michal re-sets TC configuration when changing number of queues.
      
      Mateusz moves the check and call for link-down-on-close to the specific
      path for downing/closing the interface.
      
      * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        ice: Fix interface being down after reset with link-down-on-close flag on
        ice: config netdev tc before setting queues number
      ====================
      
      Link: https://lore.kernel.org/r/20220920205344.1860934-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      624aea6b
    • Larysa Zaremba's avatar
      ice: Fix ice_xdp_xmit() when XDP TX queue number is not sufficient · 114f398d
      Larysa Zaremba authored
      The original patch added the static branch to handle the situation,
      when assigning an XDP TX queue to every CPU is not possible,
      so they have to be shared.
      
      However, in the XDP transmit handler ice_xdp_xmit(), an error was
      returned in such cases even before static condition was checked,
      thus making queue sharing still impossible.
      
      Fixes: 22bf877e ("ice: introduce XDP_TX fallback path")
      Signed-off-by: default avatarLarysa Zaremba <larysa.zaremba@intel.com>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Link: https://lore.kernel.org/r/20220919134346.25030-1-larysa.zaremba@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      114f398d
    • Jakub Kicinski's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · f64780e3
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2022-09-19 (iavf, i40e)
      
      Norbert adds checking of buffer size for Rx buffer checks in iavf.
      
      Michal corrects setting of max MTU in iavf to account for MTU data provided
      by PF, fixes i40e to set VF max MTU, and resolves lack of rate limiting
      when value was less than divisor for i40e.
      
      * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        i40e: Fix set max_tx_rate when it is lower than 1 Mbps
        i40e: Fix VF set max MTU size
        iavf: Fix set max MTU size with port VLAN and jumbo frames
        iavf: Fix bad page state
      ====================
      
      Link: https://lore.kernel.org/r/20220919223428.572091-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f64780e3
  2. 21 Sep, 2022 8 commits
  3. 20 Sep, 2022 24 commits
    • Florian Westphal's avatar
      netfilter: nf_ct_ftp: fix deadlock when nat rewrite is needed · d2508893
      Florian Westphal authored
      We can't use ct->lock, this is already used by the seqadj internals.
      When using ftp helper + nat, seqadj will attempt to acquire ct->lock
      again.
      
      Revert back to a global lock for now.
      
      Fixes: c783a29c ("netfilter: nf_ct_ftp: prefer skb_linearize")
      Reported-by: default avatarBruno de Paula Larini <bruno.larini@riosoft.com.br>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      d2508893
    • Florian Westphal's avatar
      netfilter: ebtables: fix memory leak when blob is malformed · 62ce44c4
      Florian Westphal authored
      The bug fix was incomplete, it "replaced" crash with a memory leak.
      The old code had an assignment to "ret" embedded into the conditional,
      restore this.
      
      Fixes: 7997eff8 ("netfilter: ebtables: reject blobs that don't provide all entry points")
      Reported-and-tested-by: syzbot+a24c5252f3e3ab733464@syzkaller.appspotmail.com
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      62ce44c4
    • Tetsuo Handa's avatar
      netfilter: nf_tables: fix percpu memory leak at nf_tables_addchain() · 9a4d6dd5
      Tetsuo Handa authored
      It seems to me that percpu memory for chain stats started leaking since
      commit 3bc158f8 ("netfilter: nf_tables: map basechain priority to
      hardware priority") when nft_chain_offload_priority() returned an error.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Fixes: 3bc158f8 ("netfilter: nf_tables: map basechain priority to hardware priority")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      9a4d6dd5
    • Tetsuo Handa's avatar
      netfilter: nf_tables: fix nft_counters_enabled underflow at nf_tables_addchain() · 921ebde3
      Tetsuo Handa authored
      syzbot is reporting underflow of nft_counters_enabled counter at
      nf_tables_addchain() [1], for commit 43eb8949 ("netfilter:
      nf_tables: do not leave chain stats enabled on error") missed that
      nf_tables_chain_destroy() after nft_basechain_init() in the error path of
      nf_tables_addchain() decrements the counter because nft_basechain_init()
      makes nft_is_base_chain() return true by setting NFT_CHAIN_BASE flag.
      
      Increment the counter immediately after returning from
      nft_basechain_init().
      
      Link:  https://syzkaller.appspot.com/bug?extid=b5d82a651b71cd8a75ab [1]
      Reported-by: default avatarsyzbot <syzbot+b5d82a651b71cd8a75ab@syzkaller.appspotmail.com>
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Tested-by: default avatarsyzbot <syzbot+b5d82a651b71cd8a75ab@syzkaller.appspotmail.com>
      Fixes: 43eb8949 ("netfilter: nf_tables: do not leave chain stats enabled on error")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      921ebde3
    • Pablo Neira Ayuso's avatar
      netfilter: conntrack: remove nf_conntrack_helper documentation · 76b907ee
      Pablo Neira Ayuso authored
      This toggle has been already remove by b1185090 ("netfilter: remove
      nf_conntrack_helper sysctl and modparam toggles").
      
      Remove the documentation entry for this toggle too.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      76b907ee
    • Bhupesh Sharma's avatar
      MAINTAINERS: Add myself as a reviewer for Qualcomm ETHQOS Ethernet driver · 603ccb3a
      Bhupesh Sharma authored
      As suggested by Vinod, adding myself as the reviewer
      for the Qualcomm ETHQOS Ethernet driver.
      
      Recently I have enabled this driver on a few Qualcomm
      SoCs / boards and hence trying to keep a close eye on
      it.
      Signed-off-by: default avatarBhupesh Sharma <bhupesh.sharma@linaro.org>
      Acked-by: default avatarVinod Koul <vkoul@kernel.org>
      Link: https://lore.kernel.org/r/20220915112804.3950680-1-bhupesh.sharma@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      603ccb3a
    • Mateusz Palczewski's avatar
      ice: Fix interface being down after reset with link-down-on-close flag on · 8ac71327
      Mateusz Palczewski authored
      When performing a reset on ice driver with link-down-on-close flag on
      interface would always stay down. Fix this by moving a check of this
      flag to ice_stop() that is called only when user wants to bring
      interface down.
      
      Fixes: ab4ab73f ("ice: Add ethtool private flag to make forcing link down optional")
      Signed-off-by: default avatarMateusz Palczewski <mateusz.palczewski@intel.com>
      Tested-by: default avatarPetr Oros <poros@redhat.com>
      Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      8ac71327
    • Michal Swiatkowski's avatar
      ice: config netdev tc before setting queues number · 122045ca
      Michal Swiatkowski authored
      After lowering number of tx queues the warning appears:
      "Number of in use tx queues changed invalidating tc mappings. Priority
      traffic classification disabled!"
      Example command to reproduce:
      ethtool -L enp24s0f0 tx 36 rx 36
      
      Fix this by setting correct tc mapping before setting real number of
      queues on netdev.
      
      Fixes: 0754d65b ("ice: Add infrastructure for mqprio support via ndo_setup_tc")
      Signed-off-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      122045ca
    • Jakub Kicinski's avatar
      Merge branch 'fixes-for-tc-taprio-software-mode' · da847246
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Fixes for tc-taprio software mode
      
      While working on some new features for tc-taprio, I found some strange
      behavior which looked like bugs. I was able to eventually trigger a NULL
      pointer dereference. This patch set fixes 2 issues I saw. Detailed
      explanation in patches.
      ====================
      
      Link: https://lore.kernel.org/r/20220915100802.2308279-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      da847246
    • Vladimir Oltean's avatar
      net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs · 1461d212
      Vladimir Oltean authored
      taprio can only operate as root qdisc, and to that end, there exists the
      following check in taprio_init(), just as in mqprio:
      
      	if (sch->parent != TC_H_ROOT)
      		return -EOPNOTSUPP;
      
      And indeed, when we try to attach taprio to an mqprio child, it fails as
      expected:
      
      $ tc qdisc add dev swp0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 \
      	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
      $ tc qdisc replace dev swp0 parent 1:2 taprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 \
      	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
      	base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
      	flags 0x0 clockid CLOCK_TAI
      Error: sch_taprio: Can only be attached as root qdisc.
      
      (extack message added by me)
      
      But when we try to attach a taprio child to a taprio root qdisc,
      surprisingly it doesn't fail:
      
      $ tc qdisc replace dev swp0 root handle 1: taprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
      	base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
      	flags 0x0 clockid CLOCK_TAI
      $ tc qdisc replace dev swp0 parent 1:2 taprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 \
      	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
      	base-time 0 sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
      	flags 0x0 clockid CLOCK_TAI
      
      This is because tc_modify_qdisc() behaves differently when mqprio is
      root, vs when taprio is root.
      
      In the mqprio case, it finds the parent qdisc through
      p = qdisc_lookup(dev, TC_H_MAJ(clid)), and then the child qdisc through
      q = qdisc_leaf(p, clid). This leaf qdisc q has handle 0, so it is
      ignored according to the comment right below ("It may be default qdisc,
      ignore it"). As a result, tc_modify_qdisc() goes through the
      qdisc_create() code path, and this gives taprio_init() a chance to check
      for sch_parent != TC_H_ROOT and error out.
      
      Whereas in the taprio case, the returned q = qdisc_leaf(p, clid) is
      different. It is not the default qdisc created for each netdev queue
      (both taprio and mqprio call qdisc_create_dflt() and keep them in
      a private q->qdiscs[], or priv->qdiscs[], respectively). Instead, taprio
      makes qdisc_leaf() return the _root_ qdisc, aka itself.
      
      When taprio does that, tc_modify_qdisc() goes through the qdisc_change()
      code path, because the qdisc layer never finds out about the child qdisc
      of the root. And through the ->change() ops, taprio has no reason to
      check whether its parent is root or not, just through ->init(), which is
      not called.
      
      The problem is the taprio_leaf() implementation. Even though code wise,
      it does the exact same thing as mqprio_leaf() which it is copied from,
      it works with different input data. This is because mqprio does not
      attach itself (the root) to each device TX queue, but one of the default
      qdiscs from its private array.
      
      In fact, since commit 13511704 ("net: taprio offload: enforce qdisc
      to netdev queue mapping"), taprio does this too, but just for the full
      offload case. So if we tried to attach a taprio child to a fully
      offloaded taprio root qdisc, it would properly fail too; just not to a
      software root taprio.
      
      To fix the problem, stop looking at the Qdisc that's attached to the TX
      queue, and instead, always return the default qdiscs that we've
      allocated (and to which we privately enqueue and dequeue, in software
      scheduling mode).
      
      Since Qdisc_class_ops :: leaf  is only called from tc_modify_qdisc(),
      the risk of unforeseen side effects introduced by this change is
      minimal.
      
      Fixes: 5a781ccb ("tc: Add support for configuring the taprio scheduler")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1461d212
    • Vladimir Oltean's avatar
      net/sched: taprio: avoid disabling offload when it was never enabled · db46e3a8
      Vladimir Oltean authored
      In an incredibly strange API design decision, qdisc->destroy() gets
      called even if qdisc->init() never succeeded, not exclusively since
      commit 87b60cfa ("net_sched: fix error recovery at qdisc creation"),
      but apparently also earlier (in the case of qdisc_create_dflt()).
      
      The taprio qdisc does not fully acknowledge this when it attempts full
      offload, because it starts off with q->flags = TAPRIO_FLAGS_INVALID in
      taprio_init(), then it replaces q->flags with TCA_TAPRIO_ATTR_FLAGS
      parsed from netlink (in taprio_change(), tail called from taprio_init()).
      
      But in taprio_destroy(), we call taprio_disable_offload(), and this
      determines what to do based on FULL_OFFLOAD_IS_ENABLED(q->flags).
      
      But looking at the implementation of FULL_OFFLOAD_IS_ENABLED()
      (a bitwise check of bit 1 in q->flags), it is invalid to call this macro
      on q->flags when it contains TAPRIO_FLAGS_INVALID, because that is set
      to U32_MAX, and therefore FULL_OFFLOAD_IS_ENABLED() will return true on
      an invalid set of flags.
      
      As a result, it is possible to crash the kernel if user space forces an
      error between setting q->flags = TAPRIO_FLAGS_INVALID, and the calling
      of taprio_enable_offload(). This is because drivers do not expect the
      offload to be disabled when it was never enabled.
      
      The error that we force here is to attach taprio as a non-root qdisc,
      but instead as child of an mqprio root qdisc:
      
      $ tc qdisc add dev swp0 root handle 1: \
      	mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \
      	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
      $ tc qdisc replace dev swp0 parent 1:1 \
      	taprio num_tc 8 map 0 1 2 3 4 5 6 7 \
      	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 base-time 0 \
      	sched-entry S 0x7f 990000 sched-entry S 0x80 100000 \
      	flags 0x0 clockid CLOCK_TAI
      Unable to handle kernel paging request at virtual address fffffffffffffff8
      [fffffffffffffff8] pgd=0000000000000000, p4d=0000000000000000
      Internal error: Oops: 96000004 [#1] PREEMPT SMP
      Call trace:
       taprio_dump+0x27c/0x310
       vsc9959_port_setup_tc+0x1f4/0x460
       felix_port_setup_tc+0x24/0x3c
       dsa_slave_setup_tc+0x54/0x27c
       taprio_disable_offload.isra.0+0x58/0xe0
       taprio_destroy+0x80/0x104
       qdisc_create+0x240/0x470
       tc_modify_qdisc+0x1fc/0x6b0
       rtnetlink_rcv_msg+0x12c/0x390
       netlink_rcv_skb+0x5c/0x130
       rtnetlink_rcv+0x1c/0x2c
      
      Fix this by keeping track of the operations we made, and undo the
      offload only if we actually did it.
      
      I've added "bool offloaded" inside a 4 byte hole between "int clockid"
      and "atomic64_t picos_per_byte". Now the first cache line looks like
      below:
      
      $ pahole -C taprio_sched net/sched/sch_taprio.o
      struct taprio_sched {
              struct Qdisc * *           qdiscs;               /*     0     8 */
              struct Qdisc *             root;                 /*     8     8 */
              u32                        flags;                /*    16     4 */
              enum tk_offsets            tk_offset;            /*    20     4 */
              int                        clockid;              /*    24     4 */
              bool                       offloaded;            /*    28     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              atomic64_t                 picos_per_byte;       /*    32     0 */
      
              /* XXX 8 bytes hole, try to pack */
      
              spinlock_t                 current_entry_lock;   /*    40     0 */
      
              /* XXX 8 bytes hole, try to pack */
      
              struct sched_entry *       current_entry;        /*    48     8 */
              struct sched_gate_list *   oper_sched;           /*    56     8 */
              /* --- cacheline 1 boundary (64 bytes) --- */
      
      Fixes: 9c66d156 ("taprio: Add support for hardware offloading")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarVinicius Costa Gomes <vinicius.gomes@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      db46e3a8
    • Ido Schimmel's avatar
      ipv6: Fix crash when IPv6 is administratively disabled · 76dd0728
      Ido Schimmel authored
      The global 'raw_v6_hashinfo' variable can be accessed even when IPv6 is
      administratively disabled via the 'ipv6.disable=1' kernel command line
      option, leading to a crash [1].
      
      Fix by restoring the original behavior and always initializing the
      variable, regardless of IPv6 support being administratively disabled or
      not.
      
      [1]
       BUG: unable to handle page fault for address: ffffffffffffffc8
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 173e18067 P4D 173e18067 PUD 173e1a067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP KASAN
       CPU: 3 PID: 271 Comm: ss Not tainted 6.0.0-rc4-custom-00136-g0727a9a5 #1396
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
       RIP: 0010:raw_diag_dump+0x310/0x7f0
       [...]
       Call Trace:
        <TASK>
        __inet_diag_dump+0x10f/0x2e0
        netlink_dump+0x575/0xfd0
        __netlink_dump_start+0x67b/0x940
        inet_diag_handler_cmd+0x273/0x2d0
        sock_diag_rcv_msg+0x317/0x440
        netlink_rcv_skb+0x15e/0x430
        sock_diag_rcv+0x2b/0x40
        netlink_unicast+0x53b/0x800
        netlink_sendmsg+0x945/0xe60
        ____sys_sendmsg+0x747/0x960
        ___sys_sendmsg+0x13a/0x1e0
        __sys_sendmsg+0x118/0x1e0
        do_syscall_64+0x34/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Fixes: 0daf07e5 ("raw: convert raw sockets to RCU")
      Reported-by: default avatarRoberto Ricci <rroberto2r@gmail.com>
      Tested-by: default avatarRoberto Ricci <rroberto2r@gmail.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220916084821.229287-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      76dd0728
    • Vladimir Oltean's avatar
      net: enetc: deny offload of tc-based TSN features on VF interfaces · 5641c751
      Vladimir Oltean authored
      TSN features on the ENETC (taprio, cbs, gate, police) are configured
      through a mix of command BD ring messages and port registers:
      enetc_port_rd(), enetc_port_wr().
      
      Port registers are a region of the ENETC memory map which are only
      accessible from the PCIe Physical Function. They are not accessible from
      the Virtual Functions.
      
      Moreover, attempting to access these registers crashes the kernel:
      
      $ echo 1 > /sys/bus/pci/devices/0000\:00\:00.0/sriov_numvfs
      pci 0000:00:01.0: [1957:ef00] type 00 class 0x020001
      fsl_enetc_vf 0000:00:01.0: Adding to iommu group 15
      fsl_enetc_vf 0000:00:01.0: enabling device (0000 -> 0002)
      fsl_enetc_vf 0000:00:01.0 eno0vf0: renamed from eth0
      $ tc qdisc replace dev eno0vf0 root taprio num_tc 8 map 0 1 2 3 4 5 6 7 \
      	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 base-time 0 \
      	sched-entry S 0x7f 900000 sched-entry S 0x80 100000 flags 0x2
      Unable to handle kernel paging request at virtual address ffff800009551a08
      Internal error: Oops: 96000007 [#1] PREEMPT SMP
      pc : enetc_setup_tc_taprio+0x170/0x47c
      lr : enetc_setup_tc_taprio+0x16c/0x47c
      Call trace:
       enetc_setup_tc_taprio+0x170/0x47c
       enetc_setup_tc+0x38/0x2dc
       taprio_change+0x43c/0x970
       taprio_init+0x188/0x1e0
       qdisc_create+0x114/0x470
       tc_modify_qdisc+0x1fc/0x6c0
       rtnetlink_rcv_msg+0x12c/0x390
      
      Split enetc_setup_tc() into separate functions for the PF and for the
      VF drivers. Also remove enetc_qos.o from being included into
      enetc-vf.ko, since it serves absolutely no purpose there.
      
      Fixes: 34c6adf1 ("enetc: Configure the Time-Aware Scheduler via tc-taprio offload")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20220916133209.3351399-2-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5641c751
    • Vladimir Oltean's avatar
      net: enetc: move enetc_set_psfp() out of the common enetc_set_features() · fed38e64
      Vladimir Oltean authored
      The VF netdev driver shouldn't respond to changes in the NETIF_F_HW_TC
      flag; only PFs should. Moreover, TSN-specific code should go to
      enetc_qos.c, which should not be included in the VF driver.
      
      Fixes: 79e49982 ("net: enetc: add hw tc hw offload features for PSPF capability")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20220916133209.3351399-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fed38e64
    • Jakub Kicinski's avatar
      Merge branch 'wireguard-patches-for-6-0-rc6' · 0507246d
      Jakub Kicinski authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard patches for 6.0-rc6
      
      1) The ratelimiter timing test doesn't help outside of development, yet
         it is currently preventing the module from being inserted on some
         kernels when it flakes at insertion time. So we disable it.
      
      2) A fix for a build error on UML, caused by a recent change in a
         different tree.
      
      3) A WARN_ON() is triggered by Kees' new fortified memcpy() patch, due
         to memcpy()ing over a sockaddr pointer with the size of a
         sockaddr_in[6]. The type safe fix is pretty simple. Given how classic
         of a thing sockaddr punning is, I suspect this may be the first in a
         few patches like this throughout the net tree, once Kees' fortify
         series is more widely deployed (current it's just in next).
      ====================
      
      Link: https://lore.kernel.org/r/20220916143740.831881-1-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0507246d
    • Jason A. Donenfeld's avatar
      wireguard: netlink: avoid variable-sized memcpy on sockaddr · 26c01310
      Jason A. Donenfeld authored
      Doing a variable-sized memcpy is slower, and the compiler isn't smart
      enough to turn this into a constant-size assignment.
      
      Further, Kees' latest fortified memcpy will actually bark, because the
      destination pointer is type sockaddr, not explicitly sockaddr_in or
      sockaddr_in6, so it thinks there's an overflow:
      
          memcpy: detected field-spanning write (size 28) of single field
          "&endpoint.addr" at drivers/net/wireguard/netlink.c:446 (size 16)
      
      Fix this by just assigning by using explicit casts for each checked
      case.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: syzbot+a448cda4dba2dac50de5@syzkaller.appspotmail.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      26c01310
    • Jason A. Donenfeld's avatar
      wireguard: selftests: do not install headers on UML · 8e25c02b
      Jason A. Donenfeld authored
      Since 1b620d53 ("kbuild: disable header exports for UML in a
      straightforward way"), installing headers fails on UML, so just disable
      installing them, since they're not needed anyway on the architecture.
      
      Fixes: b438b3b8 ("wireguard: selftests: support UML")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8e25c02b
    • Jason A. Donenfeld's avatar
      wireguard: ratelimiter: disable timings test by default · 684dec3c
      Jason A. Donenfeld authored
      A previous commit tried to make the ratelimiter timings test more
      reliable but in the process made it less reliable on other
      configurations. This is an impossible problem to solve without
      increasingly ridiculous heuristics. And it's not even a problem that
      actually needs to be solved in any comprehensive way, since this is only
      ever used during development. So just cordon this off with a DEBUG_
      ifdef, just like we do for the trie's randomized tests, so it can be
      enabled while hacking on the code, and otherwise disabled in CI. In the
      process we also revert 151c8e49.
      
      Fixes: 151c8e49 ("wireguard: ratelimiter: use hrtimer in selftest")
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      684dec3c
    • Íñigo Huguet's avatar
      sfc/siena: fix null pointer dereference in efx_hard_start_xmit · 589c6ede
      Íñigo Huguet authored
      Like in previous patch for sfc, prevent potential (but unlikely) NULL
      pointer dereference.
      
      Fixes: 12804793 ("sfc: decouple TXQ type from label")
      Reported-by: default avatarTianhao Zhao <tizhao@redhat.com>
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Link: https://lore.kernel.org/r/20220915141958.16458-1-ihuguet@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      589c6ede
    • Íñigo Huguet's avatar
      sfc/siena: fix TX channel offset when using legacy interrupts · 974bb793
      Íñigo Huguet authored
      As in previous commit for sfc, fix TX channels offset when
      efx_siena_separate_tx_channels is false (the default)
      
      Fixes: 25bde571 ("sfc/siena: fix wrong tx channel offset with efx_separate_tx_channels")
      Reported-by: default avatarTianhao Zhao <tizhao@redhat.com>
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Link: https://lore.kernel.org/r/20220915141653.15504-1-ihuguet@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      974bb793
    • Tetsuo Handa's avatar
      net: clear msg_get_inq in __get_compat_msghdr() · d547c1b7
      Tetsuo Handa authored
      syzbot is still complaining uninit-value in tcp_recvmsg(), for
      commit 1228b34c ("net: clear msg_get_inq in __sys_recvfrom() and
      __copy_msghdr_from_user()") missed that __get_compat_msghdr() is called
      instead of copy_msghdr_from_user() when MSG_CMSG_COMPAT is specified.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Fixes: 1228b34c ("net: clear msg_get_inq in __sys_recvfrom() and __copy_msghdr_from_user()")
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Link: https://lore.kernel.org/r/d06d0f7f-696c-83b4-b2d5-70b5f2730a37@I-love.SAKURA.ne.jpSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d547c1b7
    • Jakub Kicinski's avatar
      Merge branch 'ipmr-always-call-ip-6-_mr_forward-from-rcu-read-side-critical-section' · 68fe503c
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      ipmr: Always call ip{,6}_mr_forward() from RCU read-side critical section
      
      Patch #1 fixes a bug in ipmr code.
      
      Patch #2 adds corresponding test cases.
      ====================
      
      Link: https://lore.kernel.org/r/20220914075339.4074096-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      68fe503c
    • Ido Schimmel's avatar
      selftests: forwarding: Add test cases for unresolved multicast routes · 2b5a8c8f
      Ido Schimmel authored
      Add IPv4 and IPv6 test cases for unresolved multicast routes, testing
      that queued packets are forwarded after installing a matching (S, G)
      route.
      
      The test cases can be used to reproduce the bugs fixed in "ipmr: Always
      call ip{,6}_mr_forward() from RCU read-side critical section".
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2b5a8c8f
    • Ido Schimmel's avatar
      ipmr: Always call ip{,6}_mr_forward() from RCU read-side critical section · b07a9b26
      Ido Schimmel authored
      These functions expect to be called from RCU read-side critical section,
      but this only happens when invoked from the data path via
      ip{,6}_mr_input(). They can also be invoked from process context in
      response to user space adding a multicast route which resolves a cache
      entry with queued packets [1][2].
      
      Fix by adding missing rcu_read_lock() / rcu_read_unlock() in these call
      paths.
      
      [1]
      WARNING: suspicious RCU usage
      6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted
      -----------------------------
      net/ipv4/ipmr.c:84 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by smcrouted/246:
       #0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip_mroute_setsockopt+0x11c/0x1420
      
      stack backtrace:
      CPU: 0 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x91/0xb9
       vif_dev_read+0xbf/0xd0
       ipmr_queue_xmit+0x135/0x1ab0
       ip_mr_forward+0xe7b/0x13d0
       ipmr_mfc_add+0x1a06/0x2ad0
       ip_mroute_setsockopt+0x5c1/0x1420
       do_ip_setsockopt+0x23d/0x37f0
       ip_setsockopt+0x56/0x80
       raw_setsockopt+0x219/0x290
       __sys_setsockopt+0x236/0x4d0
       __x64_sys_setsockopt+0xbe/0x160
       do_syscall_64+0x34/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      [2]
      WARNING: suspicious RCU usage
      6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted
      -----------------------------
      net/ipv6/ip6mr.c:69 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by smcrouted/246:
       #0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip6_mroute_setsockopt+0x6b9/0x2630
      
      stack backtrace:
      CPU: 1 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x91/0xb9
       vif_dev_read+0xbf/0xd0
       ip6mr_forward2.isra.0+0xc9/0x1160
       ip6_mr_forward+0xef0/0x13f0
       ip6mr_mfc_add+0x1ff2/0x31f0
       ip6_mroute_setsockopt+0x1825/0x2630
       do_ipv6_setsockopt+0x462/0x4440
       ipv6_setsockopt+0x105/0x140
       rawv6_setsockopt+0xd8/0x690
       __sys_setsockopt+0x236/0x4d0
       __x64_sys_setsockopt+0xbe/0x160
       do_syscall_64+0x34/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: ebc31979 ("ipmr: add rcu protection over (struct vif_device)->dev")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b07a9b26