1. 31 Dec, 2019 7 commits
    • Cambda Zhu's avatar
      tcp: Fix highest_sack and highest_sack_seq · 85369750
      Cambda Zhu authored
      >From commit 50895b9d ("tcp: highest_sack fix"), the logic about
      setting tp->highest_sack to the head of the send queue was removed.
      Of course the logic is error prone, but it is logical. Before we
      remove the pointer to the highest sack skb and use the seq instead,
      we need to set tp->highest_sack to NULL when there is no skb after
      the last sack, and then replace NULL with the real skb when new skb
      inserted into the rtx queue, because the NULL means the highest sack
      seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
      the next ACK with sack option will increase tp->reordering unexpectedly.
      
      This patch sets tp->highest_sack to the tail of the rtx queue if
      it's NULL and new data is sent. The patch keeps the rule that the
      highest_sack can only be maintained by sack processing, except for
      this only case.
      
      Fixes: 50895b9d ("tcp: highest_sack fix")
      Signed-off-by: default avatarCambda Zhu <cambda@linux.alibaba.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85369750
    • Vladis Dronov's avatar
      ptp: fix the race between the release of ptp_clock and cdev · a33121e5
      Vladis Dronov authored
      In a case when a ptp chardev (like /dev/ptp0) is open but an underlying
      device is removed, closing this file leads to a race. This reproduces
      easily in a kvm virtual machine:
      
      ts# cat openptp0.c
      int main() { ... fp = fopen("/dev/ptp0", "r"); ... sleep(10); }
      ts# uname -r
      5.5.0-rc3-46cf053e
      ts# cat /proc/cmdline
      ... slub_debug=FZP
      ts# modprobe ptp_kvm
      ts# ./openptp0 &
      [1] 670
      opened /dev/ptp0, sleeping 10s...
      ts# rmmod ptp_kvm
      ts# ls /dev/ptp*
      ls: cannot access '/dev/ptp*': No such file or directory
      ts# ...woken up
      [   48.010809] general protection fault: 0000 [#1] SMP
      [   48.012502] CPU: 6 PID: 658 Comm: openptp0 Not tainted 5.5.0-rc3-46cf053e #25
      [   48.014624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
      [   48.016270] RIP: 0010:module_put.part.0+0x7/0x80
      [   48.017939] RSP: 0018:ffffb3850073be00 EFLAGS: 00010202
      [   48.018339] RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffff89a476c00ad0
      [   48.018936] RDX: fffff65a08d3ea08 RSI: 0000000000000247 RDI: 6b6b6b6b6b6b6b6b
      [   48.019470] ...                                              ^^^ a slub poison
      [   48.023854] Call Trace:
      [   48.024050]  __fput+0x21f/0x240
      [   48.024288]  task_work_run+0x79/0x90
      [   48.024555]  do_exit+0x2af/0xab0
      [   48.024799]  ? vfs_write+0x16a/0x190
      [   48.025082]  do_group_exit+0x35/0x90
      [   48.025387]  __x64_sys_exit_group+0xf/0x10
      [   48.025737]  do_syscall_64+0x3d/0x130
      [   48.026056]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   48.026479] RIP: 0033:0x7f53b12082f6
      [   48.026792] ...
      [   48.030945] Modules linked in: ptp i6300esb watchdog [last unloaded: ptp_kvm]
      [   48.045001] Fixing recursive fault but reboot is needed!
      
      This happens in:
      
      static void __fput(struct file *file)
      {   ...
          if (file->f_op->release)
              file->f_op->release(inode, file); <<< cdev is kfree'd here
          if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
                   !(mode & FMODE_PATH))) {
              cdev_put(inode->i_cdev); <<< cdev fields are accessed here
      
      Namely:
      
      __fput()
        posix_clock_release()
          kref_put(&clk->kref, delete_clock) <<< the last reference
            delete_clock()
              delete_ptp_clock()
                kfree(ptp) <<< cdev is embedded in ptp
        cdev_put
          module_put(p->owner) <<< *p is kfree'd, bang!
      
      Here cdev is embedded in posix_clock which is embedded in ptp_clock.
      The race happens because ptp_clock's lifetime is controlled by two
      refcounts: kref and cdev.kobj in posix_clock. This is wrong.
      
      Make ptp_clock's sysfs device a parent of cdev with cdev_device_add()
      created especially for such cases. This way the parent device with its
      ptp_clock is not released until all references to the cdev are released.
      This adds a requirement that an initialized but not exposed struct
      device should be provided to posix_clock_register() by a caller instead
      of a simple dev_t.
      
      This approach was adopted from the commit 72139dfa ("watchdog: Fix
      the race between the release of watchdog_core_data and cdev"). See
      details of the implementation in the commit 233ed09d ("chardev: add
      helper function to register char devs with a struct device").
      
      Link: https://lore.kernel.org/linux-fsdevel/20191125125342.6189-1-vdronov@redhat.com/T/#uAnalyzed-by: default avatarStephen Johnston <sjohnsto@redhat.com>
      Analyzed-by: default avatarVern Lovejoy <vlovejoy@redhat.com>
      Signed-off-by: default avatarVladis Dronov <vdronov@redhat.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a33121e5
    • Vladimir Oltean's avatar
      net: dsa: sja1105: Reconcile the meaning of TPID and TPID2 for E/T and P/Q/R/S · 54fa49ee
      Vladimir Oltean authored
      For first-generation switches (SJA1105E and SJA1105T):
      - TPID means C-Tag (typically 0x8100)
      - TPID2 means S-Tag (typically 0x88A8)
      
      While for the second generation switches (SJA1105P, SJA1105Q, SJA1105R,
      SJA1105S) it is the other way around:
      - TPID means S-Tag (typically 0x88A8)
      - TPID2 means C-Tag (typically 0x8100)
      
      In other words, E/T tags untagged traffic with TPID, and P/Q/R/S with
      TPID2.
      
      So the patch mentioned below fixed VLAN filtering for P/Q/R/S, but broke
      it for E/T.
      
      We strive for a common code path for all switches in the family, so just
      lie in the static config packing functions that TPID and TPID2 are at
      swapped bit offsets than they actually are, for P/Q/R/S. This will make
      both switches understand TPID to be ETH_P_8021Q and TPID2 to be
      ETH_P_8021AD. The meaning from the original E/T was chosen over P/Q/R/S
      because E/T is actually the one with public documentation available
      (UM10944.pdf).
      
      Fixes: f9a1a764 ("net: dsa: sja1105: Reverse TPID and TPID2")
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54fa49ee
    • Vladimir Oltean's avatar
      Documentation: net: dsa: sja1105: Remove text about taprio base-time limitation · 3a323ed7
      Vladimir Oltean authored
      Since commit 86db36a3 ("net: dsa: sja1105: Implement state machine
      for TAS with PTP clock source"), this paragraph is no longer true. So
      remove it.
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a323ed7
    • Vladimir Oltean's avatar
      net: dsa: sja1105: Remove restriction of zero base-time for taprio offload · d00bdc0a
      Vladimir Oltean authored
      The check originates from the initial implementation which was not based
      on PTP time but on a standalone clock source. In the meantime we can now
      program the PTPSCHTM register at runtime with the dynamic base time
      (actually with a value that is 200 ns smaller, to avoid writing DELTA=0
      in the Schedule Entry Points Parameters Table). And we also have logic
      for moving the actual base time in the future of the PHC's current time
      base, so the check for zero serves no purpose, since even if the user
      will specify zero, that's not what will end up in the static config
      table where the limitation is.
      
      Fixes: 86db36a3 ("net: dsa: sja1105: Implement state machine for TAS with PTP clock source")
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d00bdc0a
    • Vladimir Oltean's avatar
      net: dsa: sja1105: Really make the PTP command read-write · 5a47f588
      Vladimir Oltean authored
      When activating tc-taprio offload on the switch ports, the TAS state
      machine will try to check whether it is running or not, but will find
      both the STARTED and STOPPED bits as false in the
      sja1105_tas_check_running function. So the function will return -EINVAL
      (an abnormal situation) and the kernel will keep printing this from the
      TAS FSM workqueue:
      
      [   37.691971] sja1105 spi0.1: An operation returned -22
      
      The reason is that the underlying function that gets called,
      sja1105_ptp_commit, does not actually do a SPI_READ, but a SPI_WRITE. So
      the command buffer remains initialized with zeroes instead of retrieving
      the hardware state. Fix that.
      
      Fixes: 41603d78 ("net: dsa: sja1105: Make the PTP command read-write")
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a47f588
    • Vladimir Oltean's avatar
      net: dsa: sja1105: Take PTP egress timestamp by port, not mgmt slot · 9fcf024d
      Vladimir Oltean authored
      The PTP egress timestamp N must be captured from register PTPEGR_TS[n],
      where n = 2 * PORT + TSREG. There are 10 PTPEGR_TS registers, 2 per
      port. We are only using TSREG=0.
      
      As opposed to the management slots, which are 4 in number
      (SJA1105_NUM_PORTS, minus the CPU port). Any management frame (which
      includes PTP frames) can be sent to any non-CPU port through any
      management slot. When the CPU port is not the last port (#4), there will
      be a mismatch between the slot and the port number.
      
      Luckily, the only mainline occurrence with this switch
      (arch/arm/boot/dts/ls1021a-tsn.dts) does have the CPU port as #4, so the
      issue did not manifest itself thus far.
      
      Fixes: 47ed985e ("net: dsa: sja1105: Add logic for TX timestamping")
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fcf024d
  2. 30 Dec, 2019 1 commit
  3. 29 Dec, 2019 3 commits
    • David S. Miller's avatar
      Merge branch 'mlxsw-fixes' · 3faf6eda
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Couple of fixes
      
      This patch set contains two fixes for mlxsw. Please consider both for
      stable.
      
      Patch #1 from Amit fixes a wrong check during MAC validation when
      creating router interfaces (RIFs). Given a particular order of
      configuration this can result in the driver refusing to create new RIFs.
      
      Patch #2 fixes a wrong trap configuration in which VRRP packets and
      routing exceptions were policed by the same policer towards the CPU. In
      certain situations this can prevent VRRP packets from reaching the CPU.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3faf6eda
    • Ido Schimmel's avatar
      mlxsw: spectrum: Use dedicated policer for VRRP packets · acca789a
      Ido Schimmel authored
      Currently, VRRP packets and packets that hit exceptions during routing
      (e.g., MTU error) are policed using the same policer towards the CPU.
      This means, for example, that misconfiguration of the MTU on a routed
      interface can prevent VRRP packets from reaching the CPU, which in turn
      can cause the VRRP daemon to assume it is the Master router.
      
      Fix this by using a dedicated policer for VRRP packets.
      
      Fixes: 11566d34 ("mlxsw: spectrum: Add VRRP traps")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarAlex Veber <alexve@mellanox.com>
      Tested-by: default avatarAlex Veber <alexve@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acca789a
    • Amit Cohen's avatar
      mlxsw: spectrum_router: Skip loopback RIFs during MAC validation · 314bd842
      Amit Cohen authored
      When a router interface (RIF) is created the MAC address of the backing
      netdev is verified to have the same MSBs as existing RIFs. This is
      required in order to avoid changing existing RIF MAC addresses that all
      share the same MSBs.
      
      Loopback RIFs are special in this regard as they do not have a MAC
      address, given they are only used to loop packets from the overlay to
      the underlay.
      
      Without this change, an error is returned when trying to create a RIF
      after the creation of a GRE tunnel that is represented by a loopback
      RIF. 'rif->dev->dev_addr' points to the GRE device's local IP, which
      does not share the same MSBs as physical interfaces. Adding an IP
      address to any physical interface results in:
      
      Error: mlxsw_spectrum: All router interface MAC addresses must have the
      same prefix.
      
      Fix this by skipping loopback RIFs during MAC validation.
      
      Fixes: 74bc9939 ("mlxsw: spectrum_router: Veto unsupported RIF MAC addresses")
      Signed-off-by: default avatarAmit Cohen <amitc@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      314bd842
  4. 28 Dec, 2019 2 commits
    • Martin Blumenstingl's avatar
      net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs · bd6f4854
      Martin Blumenstingl authored
      GXBB and newer SoCs use the fixed FCLK_DIV2 (1GHz) clock as input for
      the m250_sel clock. Meson8b and Meson8m2 use MPLL2 instead, whose rate
      can be adjusted at runtime.
      
      So far we have been running MPLL2 with ~250MHz (and the internal
      m250_div with value 1), which worked enough that we could transfer data
      with an TX delay of 4ns. Unfortunately there is high packet loss with
      an RGMII PHY when transferring data (receiving data works fine though).
      Odroid-C1's u-boot is running with a TX delay of only 2ns as well as
      the internal m250_div set to 2 - no lost (TX) packets can be observed
      with that setting in u-boot.
      
      Manual testing has shown that the TX packet loss goes away when using
      the following settings in Linux (the vendor kernel uses the same
      settings):
      - MPLL2 clock set to ~500MHz
      - m250_div set to 2
      - TX delay set to 2ns on the MAC side
      
      Update the m250_div divider settings to only accept dividers greater or
      equal 2 to fix the TX delay generated by the MAC.
      
      iperf3 results before the change:
      [ ID] Interval           Transfer     Bitrate         Retr
      [  5]   0.00-10.00  sec   182 MBytes   153 Mbits/sec  514      sender
      [  5]   0.00-10.00  sec   182 MBytes   152 Mbits/sec           receiver
      
      iperf3 results after the change (including an updated TX delay of 2ns):
      [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
      [  5]   0.00-10.00  sec   927 MBytes   778 Mbits/sec    0      sender
      [  5]   0.00-10.01  sec   927 MBytes   777 Mbits/sec           receiver
      
      Fixes: 4f6a71b8 ("net: stmmac: dwmac-meson8b: fix internal RGMII clock configuration")
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd6f4854
    • Shmulik Ladkani's avatar
      net/sched: act_mirred: Pull mac prior redir to non mac_header_xmit device · 70cf3dc7
      Shmulik Ladkani authored
      There's no skb_pull performed when a mirred action is set at egress of a
      mac device, with a target device/action that expects skb->data to point
      at the network header.
      
      As a result, either the target device is errornously given an skb with
      data pointing to the mac (egress case), or the net stack receives the
      skb with data pointing to the mac (ingress case).
      
      E.g:
       # tc qdisc add dev eth9 root handle 1: prio
       # tc filter add dev eth9 parent 1: prio 9 protocol ip handle 9 basic \
         action mirred egress redirect dev tun0
      
       (tun0 is a tun device. result: tun0 errornously gets the eth header
        instead of the iph)
      
      Revise the push/pull logic of tcf_mirred_act() to not rely on the
      skb_at_tc_ingress() vs tcf_mirred_act_wants_ingress() comparison, as it
      does not cover all "pull" cases.
      
      Instead, calculate whether the required action on the target device
      requires the data to point at the network header, and compare this to
      whether skb->data points to network header - and make the push/pull
      adjustments as necessary.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarShmulik Ladkani <sladkani@proofpoint.com>
      Tested-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70cf3dc7
  5. 26 Dec, 2019 18 commits
    • Eric Dumazet's avatar
      net_sched: sch_fq: properly set sk->sk_pacing_status · bb3d0b8b
      Eric Dumazet authored
      If fq_classify() recycles a struct fq_flow because
      a socket structure has been reallocated, we do not
      set sk->sk_pacing_status immediately, but later if the
      flow becomes detached.
      
      This means that any flow requiring pacing (BBR, or SO_MAX_PACING_RATE)
      might fallback to TCP internal pacing, which requires a per-socket
      high resolution timer, and therefore more cpu cycles.
      
      Fixes: 218af599 ("tcp: internal implementation for pacing")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb3d0b8b
    • David S. Miller's avatar
      Merge branch 'bnx2x-Bug-fixes' · 4e55a11a
      David S. Miller authored
      Manish Chopra says:
      
      ====================
      bnx2x: Bug fixes
      
      This series has changes in the area of vlan resources
      management APIs to fix fw assert issue reported in max
      vlan configuration testing over the PF.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e55a11a
    • Manish Chopra's avatar
      bnx2x: Fix accounting of vlan resources among the PFs · 5cdc40c7
      Manish Chopra authored
      While testing max vlan configuration on the PF, firmware gets
      assert as driver was configuring number of vlans more than what
      is supported per port/engine, it was figured out that there is an
      implicit vlan (hidden default vlan consuming hardware cam entry resource)
      which is configured default for all the clients (PF/VFs) on client_init
      ramrod by the adapter implicitly, so when allocating resources among the
      PFs this implicit vlan should be considered or total vlan entries should
      be reduced by one to accommodate that default/implicit vlan entry.
      Signed-off-by: default avatarManish Chopra <manishc@marvell.com>
      Signed-off-by: default avatarAriel Elior <aelior@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cdc40c7
    • Manish Chopra's avatar
      bnx2x: Use appropriate define for vlan credit · 0444716a
      Manish Chopra authored
      Although it has same value as MAX_MAC_CREDIT_E2,
      use MAX_VLAN_CREDIT_E2 appropriately.
      Signed-off-by: default avatarManish Chopra <manishc@marvell.com>
      Signed-off-by: default avatarAriel Elior <aelior@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0444716a
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 3c2f450e
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2019-12-23
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 2 non-merge commits during the last 1 day(s) which contain
      a total of 4 files changed, 34 insertions(+), 31 deletions(-).
      
      The main changes are:
      
      1) Fix libbpf build when building on a read-only filesystem with O=dir
         option, from Namhyung Kim.
      
      2) Fix a precision tracking bug for unknown scalars, from Daniel Borkmann.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c2f450e
    • Geert Uytterhoeven's avatar
      of: mdio: Add missing inline to of_mdiobus_child_is_phy() dummy · 7df2281a
      Geert Uytterhoeven authored
      If CONFIG_OF_MDIO=n:
      
          drivers/net/phy/mdio_bus.c:23:
          include/linux/of_mdio.h:58:13: warning: ‘of_mdiobus_child_is_phy’ defined but not used [-Wunused-function]
           static bool of_mdiobus_child_is_phy(struct device_node *child)
      		 ^~~~~~~~~~~~~~~~~~~~~~~
      
      Fix this by adding the missing "inline" keyword.
      
      Fixes: 0aa4d016 ("of: mdio: export of_mdiobus_child_is_phy")
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7df2281a
    • Madalin Bucur's avatar
      net: phy: aquantia: add suspend / resume ops for AQR105 · 1c93fb45
      Madalin Bucur authored
      The suspend/resume code for AQR107 works on AQR105 too.
      This patch fixes issues with the partner not seeing the link down
      when the interface using AQR105 is brought down.
      
      Fixes: bee8259d ("net: phy: add driver for aquantia phy")
      Signed-off-by: default avatarMadalin Bucur <madalin.bucur@oss.nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c93fb45
    • Madalin Bucur's avatar
      dpaa_eth: fix DMA mapping leak · c27569fc
      Madalin Bucur authored
      On the error path some fragments remain DMA mapped. Adding a fix
      that unmaps all the fragments. Rework cleanup path to be simpler.
      
      Fixes: 8151ee88 ("dpaa_eth: use page backed rx buffers")
      Signed-off-by: default avatarMadalin Bucur <madalin.bucur@oss.nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c27569fc
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · ec34c015
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Fix endianness issue in flowtable TCP flags dissector,
         from Arnd Bergmann.
      
      2) Extend flowtable test script with dnat rules, from Florian Westphal.
      
      3) Reject padding in ebtables user entries and validate computed user
         offset, reported by syzbot, from Florian Westphal.
      
      4) Fix endianness in nft_tproxy, from Phil Sutter.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec34c015
    • Vladyslav Tarasiuk's avatar
      net/mlxfw: Fix out-of-memory error in mfa2 flash burning · a5bcd72e
      Vladyslav Tarasiuk authored
      The burning process requires to perform internal allocations of large
      chunks of memory. This memory doesn't need to be contiguous and can be
      safely allocated by vzalloc() instead of kzalloc(). This patch changes
      such allocation to avoid possible out-of-memory failure.
      
      Fixes: 410ed13c ("Add the mlxfw module for Mellanox firmware flash process")
      Signed-off-by: default avatarVladyslav Tarasiuk <vladyslavt@mellanox.com>
      Reviewed-by: default avatarAya Levin <ayal@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Tested-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5bcd72e
    • David S. Miller's avatar
      Merge branch 'hsr-fix-several-bugs-in-hsr-module' · 095e90e0
      David S. Miller authored
      Taehee Yoo says:
      
      ====================
      hsr: fix several bugs in hsr module
      
      1. The first patch fixes debugfs warning when it's opened when hsr module
      is being removed. debugfs file is opened, it tries to hold .owner module,
      but it would print warning messages if it couldn't hold .owner module.
      In order to avoid the warning message, this patch makes hsr module does
      not set .owner. Unsetting .owner is safe because these are protected by
      inode_lock().
      
      2. The second patch fixes wrong error handling of hsr_dev_finalize()
      a) hsr_dev_finalize() calls debugfs_create_{dir/file} to create debugfs.
      it checks NULL pointer but debugfs don't return NULL so it's wrong code.
      b) hsr_dev_finalize() calls register_netdevice(). so if it fails after
      register_netdevice(), it should call unregister_netdevice().
      But it doesn't.
      c) debugfs doesn't affect any actual logic of hsr module.
      So, the failure of creating of debugfs could be ignored.
      
      3. The third patch adds hsr root debugfs directory.
      When hsr interface is created, it creates debugfs directory in
      /sys/kernel/debug/<interface name>.
      It's a little bit faulty path because if an interface is the same with
      another directory name in the same path, it will fail. If hsr root
      directory is existing, the possibility of failure of creating debugfs
      file will be reduced.
      
      4. The fourth patch adds debugfs rename routine.
      debugfs directory name is the same with hsr interface name.
      So hsr interface name is changed, debugfs directory name should be
      changed too.
      
      5. The fifth patch fixes a race condition in node list add and del.
      hsr nodes are protected by RCU and there is no write side lock.
      But node insertions and deletions could be being operated concurrently.
      So write side locking is needed.
      
      6. The Sixth patch resets network header
      Tap routine is enabled, below message will be printed.
      
      [  175.852292][    C3] protocol 88fb is buggy, dev veth0
      
      hsr module doesn't set network header for supervision frame.
      But tap routine validates network header.
      If network header wasn't set, it resets and warns about it.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      095e90e0
    • Taehee Yoo's avatar
      hsr: reset network header when supervision frame is created · 3ed0a1d5
      Taehee Yoo authored
      The supervision frame is L2 frame.
      When supervision frame is created, hsr module doesn't set network header.
      If tap routine is enabled, dev_queue_xmit_nit() is called and it checks
      network_header. If network_header pointer wasn't set(or invalid),
      it resets network_header and warns.
      In order to avoid unnecessary warning message, resetting network_header
      is needed.
      
      Test commands:
          ip netns add nst
          ip link add veth0 type veth peer name veth1
          ip link add veth2 type veth peer name veth3
          ip link set veth1 netns nst
          ip link set veth3 netns nst
          ip link set veth0 up
          ip link set veth2 up
          ip link add hsr0 type hsr slave1 veth0 slave2 veth2
          ip a a 192.168.100.1/24 dev hsr0
          ip link set hsr0 up
          ip netns exec nst ip link set veth1 up
          ip netns exec nst ip link set veth3 up
          ip netns exec nst ip link add hsr1 type hsr slave1 veth1 slave2 veth3
          ip netns exec nst ip a a 192.168.100.2/24 dev hsr1
          ip netns exec nst ip link set hsr1 up
          tcpdump -nei veth0
      
      Splat looks like:
      [  175.852292][    C3] protocol 88fb is buggy, dev veth0
      
      Fixes: f421436a ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ed0a1d5
    • Taehee Yoo's avatar
      hsr: fix a race condition in node list insertion and deletion · 92a35678
      Taehee Yoo authored
      hsr nodes are protected by RCU and there is no write side lock.
      But node insertions and deletions could be being operated concurrently.
      So write side locking is needed.
      
      Test commands:
          ip netns add nst
          ip link add veth0 type veth peer name veth1
          ip link add veth2 type veth peer name veth3
          ip link set veth1 netns nst
          ip link set veth3 netns nst
          ip link set veth0 up
          ip link set veth2 up
          ip link add hsr0 type hsr slave1 veth0 slave2 veth2
          ip a a 192.168.100.1/24 dev hsr0
          ip link set hsr0 up
          ip netns exec nst ip link set veth1 up
          ip netns exec nst ip link set veth3 up
          ip netns exec nst ip link add hsr1 type hsr slave1 veth1 slave2 veth3
          ip netns exec nst ip a a 192.168.100.2/24 dev hsr1
          ip netns exec nst ip link set hsr1 up
      
          for i in {0..9}
          do
              for j in {0..9}
      	do
      	    for k in {0..9}
      	    do
      	        for l in {0..9}
      		do
      	        arping 192.168.100.2 -I hsr0 -s 00:01:3$i:4$j:5$k:6$l -c1 &
      		done
      	    done
      	done
          done
      
      Splat looks like:
      [  236.066091][ T3286] list_add corruption. next->prev should be prev (ffff8880a5940300), but was ffff8880a5940d0.
      [  236.069617][ T3286] ------------[ cut here ]------------
      [  236.070545][ T3286] kernel BUG at lib/list_debug.c:25!
      [  236.071391][ T3286] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  236.072343][ T3286] CPU: 0 PID: 3286 Comm: arping Tainted: G        W         5.5.0-rc1+ #209
      [  236.073463][ T3286] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  236.074695][ T3286] RIP: 0010:__list_add_valid+0x74/0xd0
      [  236.075499][ T3286] Code: 48 39 da 75 27 48 39 f5 74 36 48 39 dd 74 31 48 83 c4 08 b8 01 00 00 00 5b 5d c3 48 b
      [  236.078277][ T3286] RSP: 0018:ffff8880aaa97648 EFLAGS: 00010286
      [  236.086991][ T3286] RAX: 0000000000000075 RBX: ffff8880d4624c20 RCX: 0000000000000000
      [  236.088000][ T3286] RDX: 0000000000000075 RSI: 0000000000000008 RDI: ffffed1015552ebf
      [  236.098897][ T3286] RBP: ffff88809b53d200 R08: ffffed101b3c04f9 R09: ffffed101b3c04f9
      [  236.099960][ T3286] R10: 00000000308769a1 R11: ffffed101b3c04f8 R12: ffff8880d4624c28
      [  236.100974][ T3286] R13: ffff8880d4624c20 R14: 0000000040310100 R15: ffff8880ce17ee02
      [  236.138967][ T3286] FS:  00007f23479fa680(0000) GS:ffff8880d9c00000(0000) knlGS:0000000000000000
      [  236.144852][ T3286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  236.145720][ T3286] CR2: 00007f4a14bab210 CR3: 00000000a61c6001 CR4: 00000000000606f0
      [  236.146776][ T3286] Call Trace:
      [  236.147222][ T3286]  hsr_add_node+0x314/0x490 [hsr]
      [  236.153633][ T3286]  hsr_forward_skb+0x2b6/0x1bc0 [hsr]
      [  236.154362][ T3286]  ? rcu_read_lock_sched_held+0x90/0xc0
      [  236.155091][ T3286]  ? rcu_read_lock_bh_held+0xa0/0xa0
      [  236.156607][ T3286]  hsr_dev_xmit+0x70/0xd0 [hsr]
      [  236.157254][ T3286]  dev_hard_start_xmit+0x160/0x740
      [  236.157941][ T3286]  __dev_queue_xmit+0x1961/0x2e10
      [  236.158565][ T3286]  ? netdev_core_pick_tx+0x2e0/0x2e0
      [ ... ]
      
      Reported-by: syzbot+3924327f9ad5f4d2b343@syzkaller.appspotmail.com
      Fixes: f421436a ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92a35678
    • Taehee Yoo's avatar
      hsr: rename debugfs file when interface name is changed · 4c2d5e33
      Taehee Yoo authored
      hsr interface has own debugfs file, which name is same with interface name.
      So, interface name is changed, debugfs file name should be changed too.
      
      Fixes: fc4ecaee ("net: hsr: add debugfs support for display node list")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c2d5e33
    • Taehee Yoo's avatar
      hsr: add hsr root debugfs directory · c6c4ccd7
      Taehee Yoo authored
      In current hsr code, when hsr interface is created, it creates debugfs
      directory /sys/kernel/debug/<interface name>.
      If there is same directory or file name in there, it fails.
      In order to reduce possibility of failure of creation of debugfs,
      this patch adds root directory.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add dummy1 type dummy
          ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
      
      Before this patch:
          /sys/kernel/debug/hsr0/node_table
      
      After this patch:
          /sys/kernel/debug/hsr/hsr0/node_table
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6c4ccd7
    • Taehee Yoo's avatar
      hsr: fix error handling routine in hsr_dev_finalize() · 1d19e2d5
      Taehee Yoo authored
      hsr_dev_finalize() is called to create new hsr interface.
      There are some wrong error handling codes.
      
      1. wrong checking return value of debugfs_create_{dir/file}.
      These function doesn't return NULL. If error occurs in there,
      it returns error pointer.
      So, it should check error pointer instead of NULL.
      
      2. It doesn't unregister interface if it fails to setup hsr interface.
      If it fails to initialize hsr interface after register_netdevice(),
      it should call unregister_netdevice().
      
      3. Ignore failure of creation of debugfs
      If creating of debugfs dir and file is failed, creating hsr interface
      will be failed. But debugfs doesn't affect actual logic of hsr module.
      So, ignoring this is more correct and this behavior is more general.
      
      Fixes: c5a75911 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d19e2d5
    • Taehee Yoo's avatar
      hsr: avoid debugfs warning message when module is remove · 84bb59d7
      Taehee Yoo authored
      When hsr module is being removed, debugfs_remove() is called to remove
      both debugfs directory and file.
      
      When module is being removed, module state is changed to
      MODULE_STATE_GOING then exit() is called.
      At this moment, module couldn't be held so try_module_get()
      will be failed.
      
      debugfs's open() callback tries to hold the module if .owner is existing.
      If it fails, warning message is printed.
      
      CPU0				CPU1
      delete_module()
          try_stop_module()
          hsr_exit()			open() <-- WARNING
              debugfs_remove()
      
      In order to avoid the warning message, this patch makes hsr module does
      not set .owner. Unsetting .owner is safe because these are protected by
      inode_lock().
      
      Test commands:
          #SHELL1
          ip link add dummy0 type dummy
          ip link add dummy1 type dummy
          while :
          do
              ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
      	modprobe -rv hsr
          done
      
          #SHELL2
          while :
          do
              cat /sys/kernel/debug/hsr0/node_table
          done
      
      Splat looks like:
      [  101.223783][ T1271] ------------[ cut here ]------------
      [  101.230309][ T1271] debugfs file owner did not clean up at exit: node_table
      [  101.230380][ T1271] WARNING: CPU: 3 PID: 1271 at fs/debugfs/file.c:309 full_proxy_open+0x10f/0x650
      [  101.233153][ T1271] Modules linked in: hsr(-) dummy veth openvswitch nsh nf_conncount nf_nat nf_conntrack nf_d]
      [  101.237112][ T1271] CPU: 3 PID: 1271 Comm: cat Tainted: G        W         5.5.0-rc1+ #204
      [  101.238270][ T1271] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  101.240379][ T1271] RIP: 0010:full_proxy_open+0x10f/0x650
      [  101.241166][ T1271] Code: 48 c1 ea 03 80 3c 02 00 0f 85 c1 04 00 00 49 8b 3c 24 e8 04 86 7e ff 84 c0 75 2d 4c 8
      [  101.251985][ T1271] RSP: 0018:ffff8880ca22fa38 EFLAGS: 00010286
      [  101.273355][ T1271] RAX: dffffc0000000008 RBX: ffff8880cc6e6200 RCX: 0000000000000000
      [  101.274466][ T1271] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8880c4dd5c14
      [  101.275581][ T1271] RBP: 0000000000000000 R08: fffffbfff2922f5d R09: 0000000000000000
      [  101.276733][ T1271] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffffc0551bc0
      [  101.277853][ T1271] R13: ffff8880c4059a48 R14: ffff8880be50a5e0 R15: ffffffff941adaa0
      [  101.278956][ T1271] FS:  00007f8871cda540(0000) GS:ffff8880da800000(0000) knlGS:0000000000000000
      [  101.280216][ T1271] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  101.282832][ T1271] CR2: 00007f88717cfd10 CR3: 00000000b9440005 CR4: 00000000000606e0
      [  101.283974][ T1271] Call Trace:
      [  101.285328][ T1271]  do_dentry_open+0x63c/0xf50
      [  101.286077][ T1271]  ? open_proxy_open+0x270/0x270
      [  101.288271][ T1271]  ? __x64_sys_fchdir+0x180/0x180
      [  101.288987][ T1271]  ? inode_permission+0x65/0x390
      [  101.289682][ T1271]  path_openat+0x701/0x2810
      [  101.290294][ T1271]  ? path_lookupat+0x880/0x880
      [  101.290957][ T1271]  ? check_chain_key+0x236/0x5d0
      [  101.291676][ T1271]  ? __lock_acquire+0xdfe/0x3de0
      [  101.292358][ T1271]  ? sched_clock+0x5/0x10
      [  101.292962][ T1271]  ? sched_clock_cpu+0x18/0x170
      [  101.293644][ T1271]  ? find_held_lock+0x39/0x1d0
      [  101.305616][ T1271]  do_filp_open+0x17a/0x270
      [  101.306061][ T1271]  ? may_open_dev+0xc0/0xc0
      [ ... ]
      
      Fixes: fc4ecaee ("net: hsr: add debugfs support for display node list")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84bb59d7
    • Netanel Belgazal's avatar
  6. 25 Dec, 2019 9 commits
    • David S. Miller's avatar
      Merge branch 's390-qeth-fixes' · 7f936f2a
      David S. Miller authored
      Julian Wiedmann says:
      
      ====================
      s390/qeth: fixes 2019-12-23
      
      please apply the following patch series for qeth to your net tree.
      
      This brings two fixes for errors during device initialization, deals with
      several issues in the vnicc control code, and adds a missing lock.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f936f2a
    • Julian Wiedmann's avatar
      s390/qeth: fix initialization on old HW · 0b698c83
      Julian Wiedmann authored
      I stumbled over an old OSA model that claims to support DIAG_ASSIST,
      but then rejects the cmd to query its DIAG capabilities.
      
      In the old code this was ok, as the returned raw error code was > 0.
      Now that we translate the raw codes to errnos, the "rc < 0" causes us
      to fail the initialization of the device.
      
      The fix is trivial: don't bail out when the DIAG query fails. Such an
      error is not critical, we can still use the device (with a slightly
      reduced set of features).
      
      Fixes: 742d4d40 ("s390/qeth: convert remaining legacy cmd callbacks")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b698c83
    • Alexandra Winter's avatar
      s390/qeth: vnicc Fix init to default · d1b9ae18
      Alexandra Winter authored
      During vnicc_init wanted_char should be compared to cur_char and not
      to QETH_VNICC_DEFAULT. Without this patch there is no way to enforce
      the default values as desired values.
      
      Note, that it is expected, that a card comes online with default values.
      This patch was tested with private card firmware.
      
      Fixes: caa1f0b1 ("s390/qeth: add VNICC enable/disable support")
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1b9ae18
    • Alexandra Winter's avatar
      s390/qeth: Fix vnicc_is_in_use if rx_bcast not set · e8a66d80
      Alexandra Winter authored
      Symptom: After vnicc/rx_bcast has been manually set to 0,
      	bridge_* sysfs parameters can still be set or written.
      Only occurs on HiperSockets, as OSA doesn't support changing rx_bcast.
      
      Vnic characteristics and bridgeport settings are mutually exclusive.
      rx_bcast defaults to 1, so manually setting it to 0 should disable
      bridge_* parameters.
      
      Instead it makes sense here to check the supported mask. If the card
      does not support vnicc at all, bridge commands are always allowed.
      
      Fixes: caa1f0b1 ("s390/qeth: add VNICC enable/disable support")
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8a66d80
    • Alexandra Winter's avatar
      s390/qeth: fix false reporting of VNIC CHAR config failure · 68c57bfd
      Alexandra Winter authored
      Symptom: Error message "Configuring the VNIC characteristics failed"
      in dmesg whenever an OSA interface on z15 is set online.
      
      The VNIC characteristics get re-programmed when setting a L2 device
      online. This follows the selected 'wanted' characteristics - with the
      exception that the INVISIBLE characteristic unconditionally gets
      switched off.
      
      For devices that don't support INVISIBLE (ie. OSA), the resulting
      IO failure raises a noisy error message
      ("Configuring the VNIC characteristics failed").
      For IQD, INVISIBLE is off by default anyways.
      
      So don't unnecessarily special-case the INVISIBLE characteristic, and
      thereby suppress the misleading error message on OSA devices.
      
      Fixes: caa1f0b1 ("s390/qeth: add VNICC enable/disable support")
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68c57bfd
    • Julian Wiedmann's avatar
      s390/qeth: lock the card while changing its hsuid · 5b6c7b55
      Julian Wiedmann authored
      qeth_l3_dev_hsuid_store() initially checks the card state, but doesn't
      take the conf_mutex to ensure that the card stays in this state while
      being reconfigured.
      
      Rework the code to take this lock, and drop a redundant state check in a
      helper function.
      
      Fixes: b3332930 ("qeth: add support for af_iucv HiperSockets transport")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b6c7b55
    • Julian Wiedmann's avatar
      s390/qeth: fix qdio teardown after early init error · 8b5026bc
      Julian Wiedmann authored
      qeth_l?_set_online() goes through a number of initialization steps, and
      on any error uses qeth_l?_stop_card() to tear down the residual state.
      
      The first initialization step is qeth_core_hardsetup_card(). When this
      fails after having established a QDIO context on the device
      (ie. somewhere after qeth_mpc_initialize()), qeth_l?_stop_card() doesn't
      shut down this QDIO context again (since the card state hasn't
      progressed from DOWN at this stage).
      
      Even worse, we then call qdio_free() as final teardown step to free the
      QDIO data structures - while some of them are still hooked into wider
      QDIO infrastructure such as the IRQ list. This is inevitably followed by
      use-after-frees and other nastyness.
      
      Fix this by unconditionally calling qeth_qdio_clear_card() to shut down
      the QDIO context, and also to halt/clear any pending activity on the
      various IO channels.
      Remove the naive attempt at handling the teardown in
      qeth_mpc_initialize(), it clearly doesn't suffice and we're handling it
      properly now in the wider teardown code.
      
      Fixes: 4a71df50 ("qeth: new qeth device driver")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b5026bc
    • David S. Miller's avatar
      Merge branch 'disable-neigh-update-for-tunnels-during-pmtu-update' · 47d0b2fe
      David S. Miller authored
      Hangbin Liu says:
      
      ====================
      disable neigh update for tunnels during pmtu update
      
      When we setup a pair of gretap, ping each other and create neighbour cache.
      Then delete and recreate one side. We will never be able to ping6 to the new
      created gretap.
      
      The reason is when we ping6 remote via gretap, we will call like
      
      gre_tap_xmit()
       - ip_tunnel_xmit()
         - tnl_update_pmtu()
           - skb_dst_update_pmtu()
             - ip6_rt_update_pmtu()
               - __ip6_rt_update_pmtu()
                 - dst_confirm_neigh()
                   - ip6_confirm_neigh()
                     - __ipv6_confirm_neigh()
                       - n->confirmed = now
      
      As the confirmed time updated, in neigh_timer_handler() the check for
      NUD_DELAY confirm time will pass and the neigh state will back to
      NUD_REACHABLE. So the old/wrong mac address will be used again.
      
      If we do not update the confirmed time, the neigh state will go to
      neigh->nud_state = NUD_PROBE; then go to NUD_FAILED and re-create the
      neigh later, which is what IPv4 does.
      
      We couldn't remove the ip6_confirm_neigh() directly as we still need it
      for TCP flows. To fix it, we have to pass a bool parameter to
      dst_ops.update_pmtu() and only disable neighbor update for tunnels.
      
      v5: No code change, upate some commits description
      v4: No code change, upate some commits description
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47d0b2fe
    • Hangbin Liu's avatar
      net/dst: do not confirm neighbor for vxlan and geneve pmtu update · f081042d
      Hangbin Liu authored
      When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
      we should not call dst_confirm_neigh() as there is no two-way communication.
      
      So disable the neigh confirm for vxlan and geneve pmtu update.
      
      v5: No change.
      v4: No change.
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      
      Fixes: a93bf0ff ("vxlan: update skb dst pmtu on tx path")
      Fixes: 52a589d5 ("geneve: update skb dst pmtu on tx path")
      Reviewed-by: default avatarGuillaume Nault <gnault@redhat.com>
      Tested-by: default avatarGuillaume Nault <gnault@redhat.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f081042d