1. 25 Nov, 2021 12 commits
  2. 24 Nov, 2021 4 commits
  3. 23 Nov, 2021 18 commits
    • Marek Behún's avatar
      net: marvell: mvpp2: increase MTU limit when XDP enabled · 7b1b62bc
      Marek Behún authored
      Currently mvpp2_xdp_setup won't allow attaching XDP program if
        mtu > ETH_DATA_LEN (1500).
      
      The mvpp2_change_mtu on the other hand checks whether
        MVPP2_RX_PKT_SIZE(mtu) > MVPP2_BM_LONG_PKT_SIZE.
      
      These two checks are semantically different.
      
      Moreover this limit can be increased to MVPP2_MAX_RX_BUF_SIZE, since in
      mvpp2_rx we have
        xdp.data = data + MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM;
        xdp.frame_sz = PAGE_SIZE;
      
      Change the checks to check whether
        mtu > MVPP2_MAX_RX_BUF_SIZE
      
      Fixes: 07dd0a7a ("mvpp2: add basic XDP support")
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b1b62bc
    • Alex Elder's avatar
      net: ipa: kill ipa_cmd_pipeline_clear() · e4e9bfb7
      Alex Elder authored
      Calling ipa_cmd_pipeline_clear() after stopping the channel
      underlying the AP<-modem RX endpoint can lead to a deadlock.
      
      This occurs in the ->runtime_suspend device power operation for the
      IPA driver.  While this callback is in progress, any other requests
      for power will block until the callback returns.
      
      Stopping the AP<-modem RX channel does not prevent the modem from
      sending another packet to this endpoint.  If a packet arrives for an
      RX channel when the channel is stopped, an SUSPEND IPA interrupt
      condition will be pending.  Handling an IPA interrupt requires
      power, so ipa_isr_thread() calls pm_runtime_get_sync() first thing.
      
      The problem occurs because a "pipeline clear" command will not
      complete while such a SUSPEND interrupt condition exists.  So the
      SUSPEND IPA interrupt handler won't proceed until it gets power;
      that won't happen until the ->runtime_suspend callback (and its
      "pipeline clear" command) completes; and that can't happen while
      the SUSPEND interrupt condition exists.
      
      It turns out that in this case there is no need to use the "pipeline
      clear" command.  There are scenarios in which clearing the pipeline
      is required while suspending, but those are not (yet) supported
      upstream.  So a simple fix, avoiding the potential deadlock, is to
      stop calling ipa_cmd_pipeline_clear() in ipa_endpoint_suspend().
      This removes the only user of ipa_cmd_pipeline_clear(), so get rid
      of that function.  It can be restored again whenever it's needed.
      
      This is basically a manual revert along with an explanation for
      commit 6cb63ea6 ("net: ipa: introduce ipa_cmd_tag_process()").
      
      Fixes: 6cb63ea6 ("net: ipa: introduce ipa_cmd_tag_process()")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4e9bfb7
    • Martyn Welch's avatar
      net: usb: Correct PHY handling of smsc95xx · a049a30f
      Martyn Welch authored
      The smsc95xx driver is dropping phy speed settings and causing a stack
      trace at device unbind:
      
      [  536.379147] smsc95xx 2-1:1.0 eth1: unregister 'smsc95xx' usb-ci_hdrc.2-1, smsc95xx USB 2.0 Ethernet
      [  536.425029] ------------[ cut here ]------------
      [  536.429650] WARNING: CPU: 0 PID: 439 at fs/kernfs/dir.c:1535 kernfs_remove_by_name_ns+0xb8/0xc0
      [  536.438416] kernfs: can not remove 'attached_dev', no directory
      [  536.444363] Modules linked in: xts dm_crypt dm_mod atmel_mxt_ts smsc95xx usbnet
      [  536.451748] CPU: 0 PID: 439 Comm: sh Tainted: G        W         5.15.0 #1
      [  536.458636] Hardware name: Freescale i.MX53 (Device Tree Support)
      [  536.464735] Backtrace: 
      [  536.467190] [<80b1c904>] (dump_backtrace) from [<80b1cb48>] (show_stack+0x20/0x24)
      [  536.474787]  r7:000005ff r6:8035b294 r5:600f0013 r4:80d8af78
      [  536.480449] [<80b1cb28>] (show_stack) from [<80b1f764>] (dump_stack_lvl+0x48/0x54)
      [  536.488035] [<80b1f71c>] (dump_stack_lvl) from [<80b1f788>] (dump_stack+0x18/0x1c)
      [  536.495620]  r5:00000009 r4:80d9b820
      [  536.499198] [<80b1f770>] (dump_stack) from [<80124fac>] (__warn+0xfc/0x114)
      [  536.506187] [<80124eb0>] (__warn) from [<80b1d21c>] (warn_slowpath_fmt+0xa8/0xdc)
      [  536.513688]  r7:000005ff r6:80d9b820 r5:80d9b8e0 r4:83744000
      [  536.519349] [<80b1d178>] (warn_slowpath_fmt) from [<8035b294>] (kernfs_remove_by_name_ns+0xb8/0xc0)
      [  536.528416]  r9:00000001 r8:00000000 r7:824926dc r6:00000000 r5:80df6c2c r4:00000000
      [  536.536162] [<8035b1dc>] (kernfs_remove_by_name_ns) from [<80b1f56c>] (sysfs_remove_link+0x4c/0x50)
      [  536.545225]  r6:7f00f02c r5:80df6c2c r4:83306400
      [  536.549845] [<80b1f520>] (sysfs_remove_link) from [<806f9c8c>] (phy_detach+0xfc/0x11c)
      [  536.557780]  r5:82492000 r4:83306400
      [  536.561359] [<806f9b90>] (phy_detach) from [<806f9cf8>] (phy_disconnect+0x4c/0x58)
      [  536.568943]  r7:824926dc r6:7f00f02c r5:82492580 r4:83306400
      [  536.574604] [<806f9cac>] (phy_disconnect) from [<7f00a310>] (smsc95xx_disconnect_phy+0x30/0x38 [smsc95xx])
      [  536.584290]  r5:82492580 r4:82492580
      [  536.587868] [<7f00a2e0>] (smsc95xx_disconnect_phy [smsc95xx]) from [<7f001570>] (usbnet_stop+0x70/0x1a0 [usbnet])
      [  536.598161]  r5:82492580 r4:82492000
      [  536.601740] [<7f001500>] (usbnet_stop [usbnet]) from [<808baa70>] (__dev_close_many+0xb4/0x12c)
      [  536.610466]  r8:83744000 r7:00000000 r6:83744000 r5:83745b74 r4:82492000
      [  536.617170] [<808ba9bc>] (__dev_close_many) from [<808bab78>] (dev_close_many+0x90/0x120)
      [  536.625365]  r7:00000001 r6:83745b74 r5:83745b8c r4:82492000
      [  536.631026] [<808baae8>] (dev_close_many) from [<808bf408>] (unregister_netdevice_many+0x15c/0x704)
      [  536.640094]  r9:00000001 r8:81130b98 r7:83745b74 r6:83745bc4 r5:83745b8c r4:82492000
      [  536.647840] [<808bf2ac>] (unregister_netdevice_many) from [<808bfa50>] (unregister_netdevice_queue+0xa0/0xe8)
      [  536.657775]  r10:8112bcc0 r9:83306c00 r8:83306c80 r7:8291e420 r6:83744000 r5:00000000
      [  536.665608]  r4:82492000
      [  536.668143] [<808bf9b0>] (unregister_netdevice_queue) from [<808bfac0>] (unregister_netdev+0x28/0x30)
      [  536.677381]  r6:7f01003c r5:82492000 r4:82492000
      [  536.682000] [<808bfa98>] (unregister_netdev) from [<7f000b40>] (usbnet_disconnect+0x64/0xdc [usbnet])
      [  536.691241]  r5:82492000 r4:82492580
      [  536.694819] [<7f000adc>] (usbnet_disconnect [usbnet]) from [<8076b958>] (usb_unbind_interface+0x80/0x248)
      [  536.704406]  r5:7f01003c r4:83306c80
      [  536.707984] [<8076b8d8>] (usb_unbind_interface) from [<8061765c>] (device_release_driver_internal+0x1c4/0x1cc)
      [  536.718005]  r10:8112bcc0 r9:80dff1dc r8:83306c80 r7:83744000 r6:7f01003c r5:00000000
      [  536.725838]  r4:8291e420
      [  536.728373] [<80617498>] (device_release_driver_internal) from [<80617684>] (device_release_driver+0x20/0x24)
      [  536.738302]  r7:83744000 r6:810d4f4c r5:8291e420 r4:8176ae30
      [  536.743963] [<80617664>] (device_release_driver) from [<806156cc>] (bus_remove_device+0xf0/0x148)
      [  536.752858] [<806155dc>] (bus_remove_device) from [<80610018>] (device_del+0x198/0x41c)
      [  536.760880]  r7:83744000 r6:8116e2e4 r5:8291e464 r4:8291e420
      [  536.766542] [<8060fe80>] (device_del) from [<80768fe8>] (usb_disable_device+0xcc/0x1e0)
      [  536.774576]  r10:8112bcc0 r9:80dff1dc r8:00000001 r7:8112bc48 r6:8291e400 r5:00000001
      [  536.782410]  r4:83306c00
      [  536.784945] [<80768f1c>] (usb_disable_device) from [<80769c30>] (usb_set_configuration+0x514/0x8dc)
      [  536.794011]  r10:00000000 r9:00000000 r8:832c3600 r7:00000004 r6:810d5688 r5:00000000
      [  536.801844]  r4:83306c00
      [  536.804379] [<8076971c>] (usb_set_configuration) from [<80775fac>] (usb_generic_driver_disconnect+0x34/0x38)
      [  536.814236]  r10:832c3610 r9:83745ef8 r8:832c3600 r7:00000004 r6:810d5688 r5:83306c00
      [  536.822069]  r4:83306c00
      [  536.824605] [<80775f78>] (usb_generic_driver_disconnect) from [<8076b850>] (usb_unbind_device+0x30/0x70)
      [  536.834100]  r5:83306c00 r4:810d5688
      [  536.837678] [<8076b820>] (usb_unbind_device) from [<8061765c>] (device_release_driver_internal+0x1c4/0x1cc)
      [  536.847432]  r5:822fb480 r4:83306c80
      [  536.851009] [<80617498>] (device_release_driver_internal) from [<806176a8>] (device_driver_detach+0x20/0x24)
      [  536.860853]  r7:00000004 r6:810d4f4c r5:810d5688 r4:83306c80
      [  536.866515] [<80617688>] (device_driver_detach) from [<80614d98>] (unbind_store+0x70/0xe4)
      [  536.874793] [<80614d28>] (unbind_store) from [<80614118>] (drv_attr_store+0x30/0x3c)
      [  536.882554]  r7:00000000 r6:00000000 r5:83739200 r4:80614d28
      [  536.888217] [<806140e8>] (drv_attr_store) from [<8035cb68>] (sysfs_kf_write+0x48/0x54)
      [  536.896154]  r5:83739200 r4:806140e8
      [  536.899732] [<8035cb20>] (sysfs_kf_write) from [<8035be84>] (kernfs_fop_write_iter+0x11c/0x1d4)
      [  536.908446]  r5:83739200 r4:00000004
      [  536.912024] [<8035bd68>] (kernfs_fop_write_iter) from [<802b87fc>] (vfs_write+0x258/0x3e4)
      [  536.920317]  r10:00000000 r9:83745f58 r8:83744000 r7:00000000 r6:00000004 r5:00000000
      [  536.928151]  r4:82adacc0
      [  536.930687] [<802b85a4>] (vfs_write) from [<802b8b0c>] (ksys_write+0x74/0xf4)
      [  536.937842]  r10:00000004 r9:007767a0 r8:83744000 r7:00000000 r6:00000000 r5:82adacc0
      [  536.945676]  r4:82adacc0
      [  536.948213] [<802b8a98>] (ksys_write) from [<802b8ba4>] (sys_write+0x18/0x1c)
      [  536.955367]  r10:00000004 r9:83744000 r8:80100244 r7:00000004 r6:76f47b58 r5:76fc0350
      [  536.963200]  r4:00000004
      [  536.965735] [<802b8b8c>] (sys_write) from [<80100060>] (ret_fast_syscall+0x0/0x48)
      [  536.973320] Exception stack(0x83745fa8 to 0x83745ff0)
      [  536.978383] 5fa0:                   00000004 76fc0350 00000001 007767a0 00000004 00000000
      [  536.986569] 5fc0: 00000004 76fc0350 76f47b58 00000004 76f47c7c 76f48114 00000000 7e87991c
      [  536.994753] 5fe0: 00000498 7e879908 76e6dce8 76eca2e8
      [  536.999922] ---[ end trace 9b835d809816b435 ]---
      
      The driver should not be connecting and disconnecting the PHY when the
      device is opened and closed, it should be stopping and starting the PHY. The
      phy should be connected as part of binding and disconnected during
      unbinding.
      
      As this results in the PHY not being reset during open, link speed, etc.
      settings set prior to the link coming up are now not being lost.
      
      It is necessary for phy_stop() to only be called when the phydev still
      exists (resolving the above stack trace). When unbinding, ".unbind" will be
      called prior to ".stop", with phy_disconnect() already having called
      phy_stop() before the phydev becomes inaccessible.
      Signed-off-by: default avatarMartyn Welch <martyn.welch@collabora.com>
      Cc: Steve Glendinning <steve.glendinning@shawell.net>
      Cc: UNGLinuxDriver@microchip.com
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: stable@kernel.org # v5.15
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a049a30f
    • Zheyu Ma's avatar
      net: chelsio: cxgb4vf: Fix an error code in cxgb4vf_pci_probe() · b82d71c0
      Zheyu Ma authored
      During the process of driver probing, probe function should return < 0
      for failure, otherwise kernel will treat value == 0 as success.
      
      Therefore, we should set err to -EINVAL when
      adapter->registered_device_map is NULL. Otherwise kernel will assume
      that driver has been successfully probed and will cause unexpected
      errors.
      Signed-off-by: default avatarZheyu Ma <zheyuma97@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b82d71c0
    • Heiner Kallweit's avatar
      r8169: fix incorrect mac address assignment · c75a9ad4
      Heiner Kallweit authored
      The original changes brakes MAC address assignment on older chip
      versions (see bug report [0]), and it brakes random MAC assignment.
      
      is_valid_ether_addr() requires that its argument is word-aligned.
      Add the missing alignment to array mac_addr.
      
      [0] https://bugzilla.kernel.org/show_bug.cgi?id=215087
      
      Fixes: 1c5d09d5 ("ethernet: r8169: use eth_hw_addr_set()")
      Reported-by: default avatarRichard Herbert <rherbert@sympatico.ca>
      Tested-by: default avatarRichard Herbert <rherbert@sympatico.ca>
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c75a9ad4
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 52911bb6
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2021-11-22
      
      Maciej Fijalkowski says:
      
      Here are the two fixes for issues around ethtool's set_channels()
      callback for ice driver. Both are related to XDP resources. First one
      corrects the size of vsi->txq_map that is used to track the usage of Tx
      resources and the second one prevents the wrong refcounting of bpf_prog.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52911bb6
    • David S. Miller's avatar
      Merge branch 'ipa-fixes' · 60ebd673
      David S. Miller authored
      Alex Elder says:
      
      ====================
      net: ipa: prevent shutdown during setup
      
      The setup phase of the IPA driver occurs in one of two ways.
      Normally, it is done directly by the main driver probe function.
      But some systems (those having a "modem-init" DTS property) don't
      start setup until an SMP2P interrupt (sent by the modem) arrives.
      
      Because it isn't performed by the probe function, setup on
      "modem-init" systems could be underway at the time a driver
      remove (or shutdown) request arrives (or vice-versa).  This
      situation can lead to hardware state not being cleaned up
      properly.
      
      This series addresses this problem by having the driver remove
      function disable the setup interrupt.  A consequence of this is
      that setup will complete if it is underway when the remove function
      is called.
      
      So now, when removing the driver, setup:
        - will have already completed;
        - is underway, and will complete before proceeding; or
        - will not have begun (and will not occur).
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60ebd673
    • Alex Elder's avatar
      net: ipa: separate disabling setup from modem stop · 8afc7e47
      Alex Elder authored
      The IPA setup_complete flag is set at the end of ipa_setup(), when
      the setup phase of initialization has completed successfully.  This
      occurs as part of driver probe processing, or (if "modem-init" is
      specified in the DTS file) it is triggered by the "ipa-setup-ready"
      SMP2P interrupt generated by the modem.
      
      In the latter case, it's possible for driver shutdown (or remove) to
      begin while setup processing is underway, and this can't be allowed.
      The problem is that the setup_complete flag is not adequate to signal
      that setup is underway.
      
      If setup_complete is set, it will never be un-set, so that case is
      not a problem.  But if setup_complete is false, there's a chance
      setup is underway.
      
      Because setup is triggered by an interrupt on a "modem-init" system,
      there is a simple way to ensure the value of setup_complete is safe
      to read.  The threaded handler--if it is executing--will complete as
      part of a request to disable the "ipa-modem-ready" interrupt.  This
      means that ipa_setup() (which is called from the handler) will run
      to completion if it was underway, or will never be called otherwise.
      
      The request to disable the "ipa-setup-ready" interrupt is currently
      made within ipa_modem_stop().  Instead, disable the interrupt
      outside that function in the two places it's called.  In the case of
      ipa_remove(), this ensures the setup_complete flag is safe to read
      before we read it.
      
      Rename ipa_smp2p_disable() to be ipa_smp2p_irq_disable_setup(), to be
      more specific about its effect.
      
      Fixes: 530f9216 ("soc: qcom: ipa: AP/modem communications")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8afc7e47
    • Alex Elder's avatar
      net: ipa: directly disable ipa-setup-ready interrupt · 33a15310
      Alex Elder authored
      We currently maintain a "disabled" Boolean flag to determine whether
      the "ipa-setup-ready" SMP2P IRQ handler does anything.  That flag
      must be accessed under protection of a mutex.
      
      Instead, disable the SMP2P interrupt when requested, which prevents
      the interrupt handler from ever being called.  More importantly, it
      synchronizes a thread disabling the interrupt with the completion of
      the interrupt handler in case they run concurrently.
      
      Use the IPA setup_complete flag rather than the disabled flag in the
      handler to determine whether to ignore any interrupts arriving after
      the first.
      
      Rename the "disabled" flag to be "setup_disabled", to be specific
      about its purpose.
      
      Fixes: 530f9216 ("soc: qcom: ipa: AP/modem communications")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33a15310
    • David S. Miller's avatar
      Merge branch 'mlxsw-fixes' · bd08ee23
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Two small fixes
      
      Patch #1 fixes a recent regression that prevents the driver from loading
      with old firmware versions.
      
      Patch #2 protects the driver from a NULL pointer dereference when
      working on top of a buggy firmware. This was never observed in an actual
      system, only on top of an emulator during development.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd08ee23
    • Amit Cohen's avatar
      mlxsw: spectrum: Protect driver from buggy firmware · 63b08b1f
      Amit Cohen authored
      When processing port up/down events generated by the device's firmware,
      the driver protects itself from events reported for non-existent local
      ports, but not the CPU port (local port 0), which exists, but lacks a
      netdev.
      
      This can result in a NULL pointer dereference when calling
      netif_carrier_{on,off}().
      
      Fix this by bailing early when processing an event reported for the CPU
      port. Problem was only observed when running on top of a buggy emulator.
      
      Fixes: 28b1987e ("mlxsw: spectrum: Register CPU port with devlink")
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63b08b1f
    • Danielle Ratson's avatar
      mlxsw: spectrum: Allow driver to load with old firmware versions · ce4995bc
      Danielle Ratson authored
      The driver fails to load with old firmware versions that cannot report
      the maximum number of RIF MAC profiles [1].
      
      Fix this by defaulting to a maximum of a single profile in such
      situations, as multiple profiles are not supported by old firmware
      versions.
      
      [1]
      mlxsw_spectrum 0000:03:00.0: cannot register bus device
      mlxsw_spectrum: probe of 0000:03:00.0 failed with error -5
      
      Fixes: 1c375ffb ("mlxsw: spectrum_router: Expose RIF MAC profiles to devlink resource")
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Reported-by: default avatarVadim Pasternak <vadimp@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce4995bc
    • David S. Miller's avatar
      Merge branch 'smc-fixes' · 5789d04b
      David S. Miller authored
      Tony Lu says:
      
      ====================
      smc: Fixes for closing process and minor cleanup
      
      Patch 1 is a minor cleanup for local struct sock variables.
      
      Patch 2 ensures the active closing side enters TIME_WAIT.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5789d04b
    • Tony Lu's avatar
      net/smc: Ensure the active closing peer first closes clcsock · 606a63c9
      Tony Lu authored
      The side that actively closed socket, it's clcsock doesn't enter
      TIME_WAIT state, but the passive side does it. It should show the same
      behavior as TCP sockets.
      
      Consider this, when client actively closes the socket, the clcsock in
      server enters TIME_WAIT state, which means the address is occupied and
      won't be reused before TIME_WAIT dismissing. If we restarted server, the
      service would be unavailable for a long time.
      
      To solve this issue, shutdown the clcsock in [A], perform the TCP active
      close progress first, before the passive closed side closing it. So that
      the actively closed side enters TIME_WAIT, not the passive one.
      
      Client                                            |  Server
      close() // client actively close                  |
        smc_release()                                   |
            smc_close_active() // PEERCLOSEWAIT1        |
                smc_close_final() // abort or closed = 1|
                    smc_cdc_get_slot_and_msg_send()     |
                [A]                                     |
                                                        |smc_cdc_msg_recv_action() // ACTIVE
                                                        |  queue_work(smc_close_wq, &conn->close_work)
                                                        |    smc_close_passive_work() // PROCESSABORT or APPCLOSEWAIT1
                                                        |      smc_close_passive_abort_received() // only in abort
                                                        |
                                                        |close() // server recv zero, close
                                                        |  smc_release() // PROCESSABORT or APPCLOSEWAIT1
                                                        |    smc_close_active()
                                                        |      smc_close_abort() or smc_close_final() // CLOSED
                                                        |        smc_cdc_get_slot_and_msg_send() // abort or closed = 1
      smc_cdc_msg_recv_action()                         |    smc_clcsock_release()
        queue_work(smc_close_wq, &conn->close_work)     |      sock_release(tcp) // actively close clc, enter TIME_WAIT
          smc_close_passive_work() // PEERCLOSEWAIT1    |    smc_conn_free()
            smc_close_passive_abort_received() // CLOSED|
            smc_conn_free()                             |
            smc_clcsock_release()                       |
              sock_release(tcp) // passive close clc    |
      
      Link: https://www.spinics.net/lists/netdev/msg780407.html
      Fixes: b38d7324 ("smc: socket closing and linkgroup cleanup")
      Signed-off-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      606a63c9
    • Tony Lu's avatar
      net/smc: Clean up local struct sock variables · 45c3ff7a
      Tony Lu authored
      There remains some variables to replace with local struct sock. So clean
      them up all.
      
      Fixes: 3163c507 ("net/smc: use local struct sock variables consistently")
      Signed-off-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45c3ff7a
    • Nikolay Aleksandrov's avatar
      net: nexthop: fix null pointer dereference when IPv6 is not enabled · 1c743127
      Nikolay Aleksandrov authored
      When we try to add an IPv6 nexthop and IPv6 is not enabled
      (!CONFIG_IPV6) we'll hit a NULL pointer dereference[1] in the error path
      of nh_create_ipv6() due to calling ipv6_stub->fib6_nh_release. The bug
      has been present since the beginning of IPv6 nexthop gateway support.
      Commit 1aefd3de ("ipv6: Add fib6_nh_init and release to stubs") tells
      us that only fib6_nh_init has a dummy stub because fib6_nh_release should
      not be called if fib6_nh_init returns an error, but the commit below added
      a call to ipv6_stub->fib6_nh_release in its error path. To fix it return
      the dummy stub's -EAFNOSUPPORT error directly without calling
      ipv6_stub->fib6_nh_release in nh_create_ipv6()'s error path.
      
      [1]
       Output is a bit truncated, but it clearly shows the error.
       BUG: kernel NULL pointer dereference, address: 000000000000000000
       #PF: supervisor instruction fetch in kernel modede
       #PF: error_code(0x0010) - not-present pagege
       PGD 0 P4D 0
       Oops: 0010 [#1] PREEMPT SMP NOPTI
       CPU: 4 PID: 638 Comm: ip Kdump: loaded Not tainted 5.16.0-rc1+ #446
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
       RIP: 0010:0x0
       Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
       RSP: 0018:ffff888109f5b8f0 EFLAGS: 00010286^Ac
       RAX: 0000000000000000 RBX: ffff888109f5ba28 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8881008a2860
       RBP: ffff888109f5b9d8 R08: 0000000000000000 R09: 0000000000000000
       R10: ffff888109f5b978 R11: ffff888109f5b948 R12: 00000000ffffff9f
       R13: ffff8881008a2a80 R14: ffff8881008a2860 R15: ffff8881008a2840
       FS:  00007f98de70f100(0000) GS:ffff88822bf00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffffffffffffd6 CR3: 0000000100efc000 CR4: 00000000000006e0
       Call Trace:
        <TASK>
        nh_create_ipv6+0xed/0x10c
        rtm_new_nexthop+0x6d7/0x13f3
        ? check_preemption_disabled+0x3d/0xf2
        ? lock_is_held_type+0xbe/0xfd
        rtnetlink_rcv_msg+0x23f/0x26a
        ? check_preemption_disabled+0x3d/0xf2
        ? rtnl_calcit.isra.0+0x147/0x147
        netlink_rcv_skb+0x61/0xb2
        netlink_unicast+0x100/0x187
        netlink_sendmsg+0x37f/0x3a0
        ? netlink_unicast+0x187/0x187
        sock_sendmsg_nosec+0x67/0x9b
        ____sys_sendmsg+0x19d/0x1f9
        ? copy_msghdr_from_user+0x4c/0x5e
        ? rcu_read_lock_any_held+0x2a/0x78
        ___sys_sendmsg+0x6c/0x8c
        ? asm_sysvec_apic_timer_interrupt+0x12/0x20
        ? lockdep_hardirqs_on+0xd9/0x102
        ? sockfd_lookup_light+0x69/0x99
        __sys_sendmsg+0x50/0x6e
        do_syscall_64+0xcb/0xf2
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f98dea28914
       Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 e9 5d 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53
       RSP: 002b:00007fff859f5e68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e2e
       RAX: ffffffffffffffda RBX: 00000000619cb810 RCX: 00007f98dea28914
       RDX: 0000000000000000 RSI: 00007fff859f5ed0 RDI: 0000000000000003
       RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000008
       R10: fffffffffffffce6 R11: 0000000000000246 R12: 0000000000000001
       R13: 000055c0097ae520 R14: 000055c0097957fd R15: 00007fff859f63a0
       </TASK>
       Modules linked in: bridge stp llc bonding virtio_net
      
      Cc: stable@vger.kernel.org
      Fixes: 53010f99 ("nexthop: Add support for IPv6 gateways")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c743127
    • Huang Pei's avatar
      slip: fix macro redefine warning · e5b40668
      Huang Pei authored
      MIPS/IA64 define END as assembly function ending, which conflict
      with END definition in slip.h, just undef it at first
      
      Reported-by: lkp@intel.com
      Signed-off-by: default avatarHuang Pei <huangpei@loongson.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5b40668
    • Huang Pei's avatar
      hamradio: fix macro redefine warning · 16517829
      Huang Pei authored
      MIPS/IA64 define END as assembly function ending, which conflict
      with END definition in mkiss.c, just undef it at first
      
      Reported-by: lkp@intel.com
      Signed-off-by: default avatarHuang Pei <huangpei@loongson.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16517829
  4. 22 Nov, 2021 6 commits
    • Marta Plantykow's avatar
      ice: avoid bpf_prog refcount underflow · f65ee535
      Marta Plantykow authored
      Ice driver has the routines for managing XDP resources that are shared
      between ndo_bpf op and VSI rebuild flow. The latter takes place for
      example when user changes queue count on an interface via ethtool's
      set_channels().
      
      There is an issue around the bpf_prog refcounting when VSI is being
      rebuilt - since ice_prepare_xdp_rings() is called with vsi->xdp_prog as
      an argument that is used later on by ice_vsi_assign_bpf_prog(), same
      bpf_prog pointers are swapped with each other. Then it is also
      interpreted as an 'old_prog' which in turn causes us to call
      bpf_prog_put on it that will decrement its refcount.
      
      Below splat can be interpreted in a way that due to zero refcount of a
      bpf_prog it is wiped out from the system while kernel still tries to
      refer to it:
      
      [  481.069429] BUG: unable to handle page fault for address: ffffc9000640f038
      [  481.077390] #PF: supervisor read access in kernel mode
      [  481.083335] #PF: error_code(0x0000) - not-present page
      [  481.089276] PGD 100000067 P4D 100000067 PUD 1001cb067 PMD 106d2b067 PTE 0
      [  481.097141] Oops: 0000 [#1] PREEMPT SMP PTI
      [  481.101980] CPU: 12 PID: 3339 Comm: sudo Tainted: G           OE     5.15.0-rc5+ #1
      [  481.110840] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
      [  481.122021] RIP: 0010:dev_xdp_prog_id+0x25/0x40
      [  481.127265] Code: 80 00 00 00 00 0f 1f 44 00 00 89 f6 48 c1 e6 04 48 01 fe 48 8b 86 98 08 00 00 48 85 c0 74 13 48 8b 50 18 31 c0 48 85 d2 74 07 <48> 8b 42 38 8b 40 20 c3 48 8b 96 90 08 00 00 eb e8 66 2e 0f 1f 84
      [  481.148991] RSP: 0018:ffffc90007b63868 EFLAGS: 00010286
      [  481.155034] RAX: 0000000000000000 RBX: ffff889080824000 RCX: 0000000000000000
      [  481.163278] RDX: ffffc9000640f000 RSI: ffff889080824010 RDI: ffff889080824000
      [  481.171527] RBP: ffff888107af7d00 R08: 0000000000000000 R09: ffff88810db5f6e0
      [  481.179776] R10: 0000000000000000 R11: ffff8890885b9988 R12: ffff88810db5f4bc
      [  481.188026] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      [  481.196276] FS:  00007f5466d5bec0(0000) GS:ffff88903fb00000(0000) knlGS:0000000000000000
      [  481.205633] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  481.212279] CR2: ffffc9000640f038 CR3: 000000014429c006 CR4: 00000000003706e0
      [  481.220530] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  481.228771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  481.237029] Call Trace:
      [  481.239856]  rtnl_fill_ifinfo+0x768/0x12e0
      [  481.244602]  rtnl_dump_ifinfo+0x525/0x650
      [  481.249246]  ? __alloc_skb+0xa5/0x280
      [  481.253484]  netlink_dump+0x168/0x3c0
      [  481.257725]  netlink_recvmsg+0x21e/0x3e0
      [  481.262263]  ____sys_recvmsg+0x87/0x170
      [  481.266707]  ? __might_fault+0x20/0x30
      [  481.271046]  ? _copy_from_user+0x66/0xa0
      [  481.275591]  ? iovec_from_user+0xf6/0x1c0
      [  481.280226]  ___sys_recvmsg+0x82/0x100
      [  481.284566]  ? sock_sendmsg+0x5e/0x60
      [  481.288791]  ? __sys_sendto+0xee/0x150
      [  481.293129]  __sys_recvmsg+0x56/0xa0
      [  481.297267]  do_syscall_64+0x3b/0xc0
      [  481.301395]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  481.307238] RIP: 0033:0x7f5466f39617
      [  481.311373] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bd 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2f 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      [  481.342944] RSP: 002b:00007ffedc7f4308 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
      [  481.361783] RAX: ffffffffffffffda RBX: 00007ffedc7f5460 RCX: 00007f5466f39617
      [  481.380278] RDX: 0000000000000000 RSI: 00007ffedc7f5360 RDI: 0000000000000003
      [  481.398500] RBP: 00007ffedc7f53f0 R08: 0000000000000000 R09: 000055d556f04d50
      [  481.416463] R10: 0000000000000077 R11: 0000000000000246 R12: 00007ffedc7f5360
      [  481.434131] R13: 00007ffedc7f5350 R14: 00007ffedc7f5344 R15: 0000000000000e98
      [  481.451520] Modules linked in: ice(OE) af_packet binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp mxm_wmi mei_me coretemp mei ipmi_si ipmi_msghandler wmi acpi_pad acpi_power_meter ip_tables x_tables autofs4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ahci crypto_simd cryptd libahci lpc_ich [last unloaded: ice]
      [  481.528558] CR2: ffffc9000640f038
      [  481.542041] ---[ end trace d1f24c9ecf5b61c1 ]---
      
      Fix this by only calling ice_vsi_assign_bpf_prog() inside
      ice_prepare_xdp_rings() when current vsi->xdp_prog pointer is NULL.
      This way set_channels() flow will not attempt to swap the vsi->xdp_prog
      pointers with itself.
      
      Also, sprinkle around some comments that provide a reasoning about
      correlation between driver and kernel in terms of bpf_prog refcount.
      
      Fixes: efc2214b ("ice: Add support for XDP")
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: default avatarMarta Plantykow <marta.a.plantykow@intel.com>
      Co-developed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarKiran Bhandare <kiranx.bhandare@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      f65ee535
    • Maciej Fijalkowski's avatar
      ice: fix vsi->txq_map sizing · 792b2086
      Maciej Fijalkowski authored
      The approach of having XDP queue per CPU regardless of user's setting
      exposed a hidden bug that could occur in case when Rx queue count differ
      from Tx queue count. Currently vsi->txq_map's size is equal to the
      doubled vsi->alloc_txq, which is not correct due to the fact that XDP
      rings were previously based on the Rx queue count. Below splat can be
      seen when ethtool -L is used and XDP rings are configured:
      
      [  682.875339] BUG: kernel NULL pointer dereference, address: 000000000000000f
      [  682.883403] #PF: supervisor read access in kernel mode
      [  682.889345] #PF: error_code(0x0000) - not-present page
      [  682.895289] PGD 0 P4D 0
      [  682.898218] Oops: 0000 [#1] PREEMPT SMP PTI
      [  682.903055] CPU: 42 PID: 2878 Comm: ethtool Tainted: G           OE     5.15.0-rc5+ #1
      [  682.912214] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
      [  682.923380] RIP: 0010:devres_remove+0x44/0x130
      [  682.928527] Code: 49 89 f4 55 48 89 fd 4c 89 ff 53 48 83 ec 10 e8 92 b9 49 00 48 8b 9d a8 02 00 00 48 8d 8d a0 02 00 00 49 89 c2 48 39 cb 74 0f <4c> 3b 63 10 74 25 48 8b 5b 08 48 39 cb 75 f1 4c 89 ff 4c 89 d6 e8
      [  682.950237] RSP: 0018:ffffc90006a679f0 EFLAGS: 00010002
      [  682.956285] RAX: 0000000000000286 RBX: ffffffffffffffff RCX: ffff88908343a370
      [  682.964538] RDX: 0000000000000001 RSI: ffffffff81690d60 RDI: 0000000000000000
      [  682.972789] RBP: ffff88908343a0d0 R08: 0000000000000000 R09: 0000000000000000
      [  682.981040] R10: 0000000000000286 R11: 3fffffffffffffff R12: ffffffff81690d60
      [  682.989282] R13: ffffffff81690a00 R14: ffff8890819807a8 R15: ffff88908343a36c
      [  682.997535] FS:  00007f08c7bfa740(0000) GS:ffff88a03fd00000(0000) knlGS:0000000000000000
      [  683.006910] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  683.013557] CR2: 000000000000000f CR3: 0000001080a66003 CR4: 00000000003706e0
      [  683.021819] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  683.030075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  683.038336] Call Trace:
      [  683.041167]  devm_kfree+0x33/0x50
      [  683.045004]  ice_vsi_free_arrays+0x5e/0xc0 [ice]
      [  683.050380]  ice_vsi_rebuild+0x4c8/0x750 [ice]
      [  683.055543]  ice_vsi_recfg_qs+0x9a/0x110 [ice]
      [  683.060697]  ice_set_channels+0x14f/0x290 [ice]
      [  683.065962]  ethnl_set_channels+0x333/0x3f0
      [  683.070807]  genl_family_rcv_msg_doit+0xea/0x150
      [  683.076152]  genl_rcv_msg+0xde/0x1d0
      [  683.080289]  ? channels_prepare_data+0x60/0x60
      [  683.085432]  ? genl_get_cmd+0xd0/0xd0
      [  683.089667]  netlink_rcv_skb+0x50/0xf0
      [  683.094006]  genl_rcv+0x24/0x40
      [  683.097638]  netlink_unicast+0x239/0x340
      [  683.102177]  netlink_sendmsg+0x22e/0x470
      [  683.106717]  sock_sendmsg+0x5e/0x60
      [  683.110756]  __sys_sendto+0xee/0x150
      [  683.114894]  ? handle_mm_fault+0xd0/0x2a0
      [  683.119535]  ? do_user_addr_fault+0x1f3/0x690
      [  683.134173]  __x64_sys_sendto+0x25/0x30
      [  683.148231]  do_syscall_64+0x3b/0xc0
      [  683.161992]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fix this by taking into account the value that num_possible_cpus()
      yields in addition to vsi->alloc_txq instead of doubling the latter.
      
      Fixes: efc2214b ("ice: Add support for XDP")
      Fixes: 22bf877e ("ice: introduce XDP_TX fallback path")
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarKiran Bhandare <kiranx.bhandare@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      792b2086
    • David S. Miller's avatar
      Merge branch 'nh-group-refcnt' · 03a000bf
      David S. Miller authored
      Nikolay Aleksandrov says:
      
      ====================
      net: nexthop: fix refcount issues when replacing groups
      
      This set fixes a refcount bug when replacing nexthop groups and
      modifying routes. It is complex because the objects look valid when
      debugging memory dumps, but we end up having refcount dependency between
      unlinked objects which can never be released, so in turn they cannot
      free their resources and refcounts. The problem happens because we can
      have stale IPv6 per-cpu dsts in nexthops which were removed from a
      group. Even though the IPv6 gen is bumped, the dsts won't be released
      until traffic passes through them or the nexthop is freed, that can take
      arbitrarily long time, and even worse we can create a scenario[1] where it
      can never be released. The fix is to release the IPv6 per-cpu dsts of
      replaced nexthops after an RCU grace period so no new ones can be
      created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
      is used by the nexthop code only when necessary. We can further optimize
      group replacement, but that is more suited for net-next as these patches
      would have to be backported to stable releases.
      
      v2: patch 02: update commit msg
          patch 03: check for mausezahn before testing and make a few comments
                    more verbose
      
      [1]
      This info is also present in patch 02's commit message.
      Initial state:
       $ ip nexthop list
        id 200 via 2002:db8::2 dev bridge.10 scope link onlink
        id 201 via 2002:db8::3 dev bridge scope link onlink
        id 203 group 201/200
       $ ip -6 route
        2001:db8::10 nhid 203 metric 1024 pref medium
           nexthop via 2002:db8::3 dev bridge weight 1 onlink
           nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
      
      Create rt6_info through one of the multipath legs, e.g.:
       $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
       (pkt_inj is just a custom packet generator, nothing special)
      
      Then remove that leg from the group by replace (let's assume it is id
      200 in this case):
       $ ip nexthop replace id 203 group 201
      
      Now remove the IPv6 route:
       $ ip -6 route del 2001:db8::10/128
      
      The route won't be really deleted due to the stale rt6_info holding 1
      refcnt in nexthop id 200.
      At this point we have the following reference count dependency:
       (deleted) IPv6 route holds 1 reference over nhid 203
       nh 203 holds 1 ref over id 201
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      Now to create circular dependency between nh 200 and the IPv6 route, and
      also to get a reference over nh 200, restore nhid 200 in the group:
       $ ip nexthop replace id 203 group 201/200
      
      And now we have a permanent circular dependncy because nhid 203 holds a
      reference over nh 200 and 201, but the route holds a ref over nh 203 and
      is deleted.
      
      To trigger the bug just delete the group (nhid 203):
       $ ip nexthop del id 203
      
      It won't really be deleted due to the IPv6 route dependency, and now we
      have 2 unlinked and deleted objects that reference each other: the group
      and the IPv6 route. Since the group drops the reference it holds over its
      entries at free time (i.e. its own refcount needs to drop to 0) that will
      never happen and we get a permanent ref on them, since one of the entries
      holds a reference over the IPv6 route it will also never be released.
      
      At this point the dependencies are:
       (deleted, only unlinked) IPv6 route holds reference over group nh 203
       (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      This is the last point where it can be fixed by running traffic through
      nh 200, and specifically through the same CPU so the rt6_info (dst) will
      get released due to the IPv6 genid, that in turn will free the IPv6
      route, which in turn will free the ref count over the group nh 203.
      
      If nh 200 is deleted at this point, it will never be released due to the
      ref from the unlinked group 203, it will only be unlinked:
       $ ip nexthop del id 200
       $ ip nexthop
       $
      
      Now we can never release that stale rt6_info, we have IPv6 route with ref
      over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
      rt6_info (dst) with ref over the net device and the IPv6 route. All of
      these objects are only unlinked, and cannot be released, thus they can't
      release their ref counts.
      
       Message from syslogd@dev at Nov 19 14:04:10 ...
        kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
       Message from syslogd@dev at Nov 19 14:04:20 ...
        kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
      
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03a000bf
    • Nikolay Aleksandrov's avatar
      selftests: net: fib_nexthops: add test for group refcount imbalance bug · 02ebe49a
      Nikolay Aleksandrov authored
      The new selftest runs a sequence which causes circular refcount
      dependency between deleted objects which cannot be released and results
      in a netdevice refcount imbalance.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02ebe49a
    • Nikolay Aleksandrov's avatar
      net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group · 1005f19b
      Nikolay Aleksandrov authored
      When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
      the removed nexthop entries after an RCU grace period because they
      contain references to the nexthop's net device and to the fib6 info.
      With specific series of events[1] we can reach net device refcount
      imbalance which is unrecoverable. IPv4 is not affected because dsts
      don't take a refcount on the route.
      
      [1]
       $ ip nexthop list
        id 200 via 2002:db8::2 dev bridge.10 scope link onlink
        id 201 via 2002:db8::3 dev bridge scope link onlink
        id 203 group 201/200
       $ ip -6 route
        2001:db8::10 nhid 203 metric 1024 pref medium
           nexthop via 2002:db8::3 dev bridge weight 1 onlink
           nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
      
      Create rt6_info through one of the multipath legs, e.g.:
       $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
       (pkt_inj is just a custom packet generator, nothing special)
      
      Then remove that leg from the group by replace (let's assume it is id
      200 in this case):
       $ ip nexthop replace id 203 group 201
      
      Now remove the IPv6 route:
       $ ip -6 route del 2001:db8::10/128
      
      The route won't be really deleted due to the stale rt6_info holding 1
      refcnt in nexthop id 200.
      At this point we have the following reference count dependency:
       (deleted) IPv6 route holds 1 reference over nhid 203
       nh 203 holds 1 ref over id 201
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      Now to create circular dependency between nh 200 and the IPv6 route, and
      also to get a reference over nh 200, restore nhid 200 in the group:
       $ ip nexthop replace id 203 group 201/200
      
      And now we have a permanent circular dependncy because nhid 203 holds a
      reference over nh 200 and 201, but the route holds a ref over nh 203 and
      is deleted.
      
      To trigger the bug just delete the group (nhid 203):
       $ ip nexthop del id 203
      
      It won't really be deleted due to the IPv6 route dependency, and now we
      have 2 unlinked and deleted objects that reference each other: the group
      and the IPv6 route. Since the group drops the reference it holds over its
      entries at free time (i.e. its own refcount needs to drop to 0) that will
      never happen and we get a permanent ref on them, since one of the entries
      holds a reference over the IPv6 route it will also never be released.
      
      At this point the dependencies are:
       (deleted, only unlinked) IPv6 route holds reference over group nh 203
       (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      This is the last point where it can be fixed by running traffic through
      nh 200, and specifically through the same CPU so the rt6_info (dst) will
      get released due to the IPv6 genid, that in turn will free the IPv6
      route, which in turn will free the ref count over the group nh 203.
      
      If nh 200 is deleted at this point, it will never be released due to the
      ref from the unlinked group 203, it will only be unlinked:
       $ ip nexthop del id 200
       $ ip nexthop
       $
      
      Now we can never release that stale rt6_info, we have IPv6 route with ref
      over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
      rt6_info (dst) with ref over the net device and the IPv6 route. All of
      these objects are only unlinked, and cannot be released, thus they can't
      release their ref counts.
      
       Message from syslogd@dev at Nov 19 14:04:10 ...
        kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
       Message from syslogd@dev at Nov 19 14:04:20 ...
        kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
      
      Fixes: 7bf4796d ("nexthops: add support for replace")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1005f19b
    • Nikolay Aleksandrov's avatar
      net: ipv6: add fib6_nh_release_dsts stub · 8837cbbf
      Nikolay Aleksandrov authored
      We need a way to release a fib6_nh's per-cpu dsts when replacing
      nexthops otherwise we can end up with stale per-cpu dsts which hold net
      device references, so add a new IPv6 stub called fib6_nh_release_dsts.
      It must be used after an RCU grace period, so no new dsts can be created
      through a group's nexthop entry.
      Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed
      so it doesn't need a dummy stub when IPv6 is not enabled.
      
      Fixes: 7bf4796d ("nexthops: add support for replace")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8837cbbf