1. 30 Dec, 2016 4 commits
    • Ryan Hsu's avatar
      ath10k: recal the txpower when removing interface · d679fa1b
      Ryan Hsu authored
      The txpower is being recalculated when adding interface to make sure
      txpower won't overshoot the spec, and when removing the interface,
      the txpower should again to be recalculated to restore the correct value
      from the active interface list.
      
      Following is one of the scenario
      	vdev0 is created as STA and connected: txpower:23
      	vdev1 is created as P2P_DEVICE for control interface: txpower:0
      	vdev2 is created as p2p go/gc interface: txpower is 21
      
      So the vdev2@txpower:21 will be set to firmware when vdev2 is created.
      When we tear down the vdev2, the txpower needs to be recalculated to
      re-set it to vdev0@txpower:23 as vdev0/vdev1 are the active interface.
      
      	ath10k_pci mac vdev 0 peer create 8c:fd:f0:01:62:98
      	ath10k_pci mac vdev_id 0 txpower 23
      	... (adding interface)
      	ath10k_pci mac vdev create 2 (add interface) type 1 subtype 3
      	ath10k_pci mac vdev_id 2 txpower 21
      	ath10k_pci mac txpower 21
      	... (removing interface)
      	ath10k_pci mac vdev 2 delete (remove interface)
      	ath10k_pci vdev 1 txpower 0
      	ath10k_pci vdev 0 txpower 23
      	ath10k_pci mac txpower 23
      Signed-off-by: default avatarRyan Hsu <ryanhsu@qca.qualcomm.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      d679fa1b
    • Arun Khandavalli's avatar
      ath10k: support dev_coredump for crash dump · 727000e6
      Arun Khandavalli authored
      Whenever firmware crashes, and both CONFIG_ATH10K_DEBUGFS and
      CONFIG_ALLOW_DEV_COREDUMP are enabled, dump information about the crash via a
      devcoredump device. Dump can be read from userspace for further analysis from:
      
      /sys/class/devcoredump/devcd*/data
      
      As until now we have provided the firmware crash dump file via fw_crash_dump
      debugfs keep it still available but deprecate and a warning print that the user
      should switch to using dev_coredump.
      
      Future improvement would be not to depend on CONFIG_ATH10K_DEBUGFS, as there
      might be systems which want to get the firmware crash dump but not enable
      debugfs. How to handle memory consumption is also something which needs to be
      taken into account.
      Signed-off-by: default avatarArun Khandavalli <akhandav@qti.qualcomm.com>
      [kvalo@qca.qualcomm.com: rebase, fixes, improve commit log]
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      727000e6
    • Ryan Hsu's avatar
      ath10k: fix incorrect txpower set by P2P_DEVICE interface · 88407beb
      Ryan Hsu authored
      Ath10k reports the phy capability that supports P2P_DEVICE interface.
      
      When we use the P2P supported wpa_supplicant to start connection, it'll
      create two interfaces, one is wlan0 (vdev_id=0) and one is P2P_DEVICE
      p2p-dev-wlan0 which is for p2p control channel (vdev_id=1).
      
      	ath10k_pci mac vdev create 0 (add interface) type 2 subtype 0
      	ath10k_add_interface: vdev_id: 0, txpower: 0, bss_power: 0
      	...
      	ath10k_pci mac vdev create 1 (add interface) type 2 subtype 1
      	ath10k_add_interface: vdev_id: 1, txpower: 0, bss_power: 0
      
      And the txpower in per vif bss_conf will only be set to valid tx power when
      the interface is assigned with channel_ctx.
      
      But this P2P_DEVICE interface will never be used for any connection, so
      that the uninitialized bss_conf.txpower=0 is assinged to the
      arvif->txpower when interface created.
      
      Since the txpower configuration is firmware per physical interface.
      So the smallest txpower of all vifs will be the one limit the tx power
      of the physical device, that causing the low txpower issue on other
      active interfaces.
      
      	wlan0: Limiting TX power to 21 (24 - 3) dBm
      	ath10k_pci mac vdev_id 0 txpower 21
      	ath10k_mac_txpower_recalc: vdev_id: 1, txpower: 0
      	ath10k_mac_txpower_recalc: vdev_id: 0, txpower: 21
      	ath10k_pci mac txpower 0
      
      This issue only happens when we use the wpa_supplicant that supports
      P2P or if we use the iw tool to create the control P2P_DEVICE interface.
      Signed-off-by: default avatarRyan Hsu <ryanhsu@qca.qualcomm.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      88407beb
    • Christian Lamparter's avatar
      ath10k: fix potential memory leak in ath10k_wmi_tlv_op_pull_fw_stats() · 097e46d2
      Christian Lamparter authored
      ath10k_wmi_tlv_op_pull_fw_stats() uses tb = ath10k_wmi_tlv_parse_alloc(...)
      function, which allocates memory. If any of the three error-paths are
      taken, this tb needs to be freed.
      Signed-off-by: default avatarChristian Lamparter <chunkeey@googlemail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      097e46d2
  2. 29 Dec, 2016 5 commits
  3. 15 Dec, 2016 10 commits
    • Mohammed Shafi Shajakhan's avatar
      ath10k: Avoid potential page alloc BUG_ON in tx free path · 02a9e08d
      Mohammed Shafi Shajakhan authored
      'ath10k_htt_tx_free_cont_txbuf' and 'ath10k_htt_tx_free_cont_frag_desc'
      have NULL pointer checks to avoid crash if they are called twice
      but this is as of now not sufficient as these pointers are not assigned
      to NULL once the contiguous DMA memory allocation is freed, fix this.
      Though this may not be hit with the explicity check of state variable
      'tx_mem_allocated' check, good to have this addressed as well.
      
      Below BUG_ON is hit when the above scenario is simulated
      with kernel debugging enabled
      
       page:f6d09a00 count:0 mapcount:-127 mapping:  (null)
      index:0x0
       flags: 0x40000000()
       page dumped because: VM_BUG_ON_PAGE(page_ref_count(page)
      == 0)
       ------------[ cut here ]------------
       kernel BUG at ./include/linux/mm.h:445!
       invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
       EIP is at put_page_testzero.part.88+0xd/0xf
       Call Trace:
        [<c118a2cc>] __free_pages+0x3c/0x40
        [<c118a30e>] free_pages+0x3e/0x50
        [<c10222b4>] dma_generic_free_coherent+0x24/0x30
        [<f8c1d9a8>] ath10k_htt_tx_free_cont_txbuf+0xf8/0x140
      
        [<f8c1e2a9>] ath10k_htt_tx_destroy+0x29/0xa0
      
        [<f8c143e0>] ath10k_core_destroy+0x60/0x80 [ath10k_core]
        [<f8acd7e9>] ath10k_pci_remove+0x79/0xa0 [ath10k_pci]
        [<c13ed7a8>] pci_device_remove+0x38/0xb0
        [<c14d3492>] __device_release_driver+0x72/0x100
        [<c14d36b7>] driver_detach+0x97/0xa0
        [<c14d29c0>] bus_remove_driver+0x40/0x80
        [<c14d427a>] driver_unregister+0x2a/0x60
        [<c13ec768>] pci_unregister_driver+0x18/0x70
        [<f8aced4f>] ath10k_pci_exit+0xd/0x2be [ath10k_pci]
        [<c1101e78>] SyS_delete_module+0x158/0x210
        [<c11b34f1>] ? __might_fault+0x41/0xa0
        [<c11b353b>] ? __might_fault+0x8b/0xa0
        [<c1001a4b>] do_fast_syscall_32+0x9b/0x1c0
        [<c178da34>] sysenter_past_esp+0x45/0x74
      Signed-off-by: default avatarMohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      02a9e08d
    • Toke Høiland-Jørgensen's avatar
      ath9k: Turn ath_txq_lock/unlock() into static inlines. · 5c4607eb
      Toke Høiland-Jørgensen authored
      These are one-line functions that just call spin_lock/unlock_bh(); turn
      them into static inlines to avoid the function call overhead.
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      5c4607eb
    • Toke Høiland-Jørgensen's avatar
      ath9k: Introduce airtime fairness scheduling between stations · 63fefa05
      Toke Høiland-Jørgensen authored
      This reworks the ath9k driver to schedule transmissions to connected
      stations in a way that enforces airtime fairness between them. It
      accomplishes this by measuring the time spent transmitting to or
      receiving from a station at TX and RX completion, and accounting this to
      a per-station, per-QoS level airtime deficit. Then, an FQ-CoDel based
      deficit scheduler is employed at packet dequeue time, to control which
      station gets the next transmission opportunity.
      
      Airtime fairness can significantly improve the efficiency of the network
      when station rates vary. The following throughput values are from a
      simple three-station test scenario, where two stations operate at the
      highest HT20 rate, and one station at the lowest, and the scheduler is
      employed at the access point:
      
                        Before   /   After
      Fast station 1:    19.17   /   25.09 Mbps
      Fast station 2:    19.83   /   25.21 Mbps
      Slow station:       2.58   /    1.77 Mbps
      Total:             41.58   /   52.07 Mbps
      
      The benefit of airtime fairness goes up the more stations are present.
      In a 30-station test with one station artificially limited to 1 Mbps,
      we have seen aggregate throughput go from 2.14 to 17.76 Mbps.
      Signed-off-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      63fefa05
    • Martin Blumenstingl's avatar
      ath9k: define all EEPROM fields in Little Endian format · 4bca5303
      Martin Blumenstingl authored
      The ar9300_eeprom logic is already using only 8-bit (endian neutral),
      __le16 and __le32 fields to state explicitly how the values should be
      interpreted.
      All other EEPROM implementations (4k, 9287 and def) were using u16 and
      u32 fields with additional logic to swap the values (read from the
      original EEPROM) so they match the current CPUs endianness.
      
      The EEPROM format defaults to "all values are Little Endian", indicated
      by the absence of the AR5416_EEPMISC_BIG_ENDIAN in the u8 EEPMISC
      register. If we detect that the EEPROM indicates Big Endian mode
      (AR5416_EEPMISC_BIG_ENDIAN is set in the EEPMISC register) then we'll
      swap the values to convert them into Little Endian. This is done by
      activating the EEPMISC based logic in ath9k_hw_nvram_swap_data even if
      AH_NO_EEP_SWAP is set (this makes ath9k behave like the FreeBSD driver,
      which also does not have a flag to enable swapping based on the
      AR5416_EEPMISC_BIG_ENDIAN bit). Before this logic was only used to
      enable swapping when "current CPU endianness != EEPROM endianness".
      
      After changing all relevant fields to __le16 and __le32 sparse was used
      to check that all code which reads any of these fields uses
      le{16,32}_to_cpu.
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      4bca5303
    • Martin Blumenstingl's avatar
      ath9k: Make the EEPROM swapping check use the eepmisc register · 68fbe792
      Martin Blumenstingl authored
      There are two ways of swapping the EEPROM data in the ath9k driver:
      1) swab16 based on the first two EEPROM "magic" bytes (same for all
         EEPROM formats)
      2) field and EEPROM format specific swab16/swab32 (different for
         eeprom_def, eeprom_4k and eeprom_9287)
      
      The result of the first check was used to also enable the second swap.
      This behavior seems incorrect, since the data may only be byte-swapped
      (afterwards the data could be in the correct endianness).
      Thus we introduce a separate check based on the "eepmisc" register
      (which is part of the EEPROM data). When bit 0 is set, then the EEPROM
      format specific values are in "big endian". This is also done by the
      FreeBSD kernel, see [0] for example.
      
      This allows us to parse EEPROMs with the "correct" magic bytes but
      swapped EEPROM format specific values. These EEPROMs (mostly found in
      lantiq and broadcom based big endian MIPS based devices) only worked
      due to platform specific "hacks" which swapped the EEPROM so the
      magic was inverted, which also enabled the format specific swapping.
      With this patch the old behavior is still supported, but neither
      recommended nor needed anymore.
      
      [0]
      https://github.com/freebsd/freebsd/blob/50719b56d9ce8d7d4beb53b16e9edb2e9a4a7a18/sys/dev/ath/ath_hal/ah_eeprom_9287.c#L351Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      68fbe792
    • Martin Blumenstingl's avatar
      ath9k: consistently use get_eeprom_rev(ah) · 9bff7428
      Martin Blumenstingl authored
      The AR5416_VER_MASK macro does the same as get_eeprom_rev, except that
      one has to know the actual EEPROM type (and providing a reference to
      that in a variable named "eep"). Additionally the eeprom_*.c
      implementations used the same shifting logic multiple times to get the
      eeprom revision which was also unnecessary duplication of
      get_eeprom_rev.
      
      Also use the AR5416_EEP_VER_MINOR_MASK macro where needed and introduce
      a similar macro (AR5416_EEP_VER_MAJOR_MASK) for the major version.
      Finally drop AR9287_EEP_VER_MINOR_MASK since it simply duplicates the
      already defined AR5416_EEP_VER_MINOR_MASK.
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      9bff7428
    • Martin Blumenstingl's avatar
      ath9k: replace eeprom_param EEP_MINOR_REV with get_eeprom_rev · 7d7dc538
      Martin Blumenstingl authored
      get_eeprom(ah, EEP_MINOR_REV) and get_eeprom_rev(ah) are both doing the
      same thing: returning the EEPROM revision (12 lowest bits). Make the
      code consistent by using get_eeprom_rev(ah) everywhere.
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      7d7dc538
    • Martin Blumenstingl's avatar
      ath9k: Add an eeprom_ops callback for retrieving the eepmisc value · d8ec2e2a
      Martin Blumenstingl authored
      This allows deciding if we have to swap the EEPROM data (so it matches
      the system's native endianness) even if no byte-swapping (swab16, based on
      the first two bytes in the EEPROM) is needed.
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      d8ec2e2a
    • Martin Blumenstingl's avatar
      ath9k: indicate that the AR9003 EEPROM template values are little endian · 291478b7
      Martin Blumenstingl authored
      The eepMisc field was not set explicitly. The default value of 0 means
      that the values in the EEPROM (template) should be interpreted as little
      endian. However, this is not clear until comparing the AR9003 code with
      the other EEPROM formats.
      To make the code easier to understand we explicitly state that the values
      are little endian - there are no functional changes with this patch.
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      291478b7
    • Martin Blumenstingl's avatar
      ath9k: Add a #define for the EEPROM "eepmisc" endianness bit · 81a834e3
      Martin Blumenstingl authored
      This replaces a magic number with a named #define. Additionally it
      removes two "eeprom format" specific #defines for the "big endianness"
      bit which are the same on all eeprom formats.
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      81a834e3
  4. 05 Dec, 2016 2 commits
  5. 04 Dec, 2016 19 commits
    • Erik Nordmark's avatar
      ipv6 addrconf: Implemented enhanced DAD (RFC7527) · adc176c5
      Erik Nordmark authored
      Implemented RFC7527 Enhanced DAD.
      IPv6 duplicate address detection can fail if there is some temporary
      loopback of Ethernet frames. RFC7527 solves this by including a random
      nonce in the NS messages used for DAD, and if an NS is received with the
      same nonce it is assumed to be a looped back DAD probe and is ignored.
      RFC7527 is enabled by default. Can be disabled by setting both of
      conf/{all,interface}/enhanced_dad to zero.
      Signed-off-by: default avatarErik Nordmark <nordmark@arista.com>
      Signed-off-by: default avatarBob Gilligan <gilligan@arista.com>
      Reviewed-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      adc176c5
    • David S. Miller's avatar
      Merge branch 'mv88e6390-batch-three' · ce84c7c6
      David S. Miller authored
      Andrew Lunn says:
      
      ====================
      mv88e6390 batch 3
      
      More patches to support the MV88e6390. This is mostly refactoring
      existing code and adding implementations for the mv88e6390.  This
      patchset set which reserved frames are sent to the cpu, the size of
      jumbo frames that will be accepted, turn off egress rate limiting, and
      configuration of pause frames.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce84c7c6
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Implement mv88e6390 pause control · 3ce0e65e
      Andrew Lunn authored
      The mv88e6390 has a number flow control registers accessed via the
      Flow Control register. Use these to set the pause control.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ce0e65e
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Refactor pause configuration · b35d322a
      Andrew Lunn authored
      The mv88e6390 has a different mechanism for configuring pause.
      Refactor the code into an ops function, and for the moment, don't add
      any mv88e6390 code yet.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b35d322a
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Refactor egress rate limiting · ef70b111
      Andrew Lunn authored
      There are two different rate limiting configurations, depending on the
      switch generation. Refactor this into ops.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef70b111
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Refactor setting of jumbo frames · 5f436666
      Andrew Lunn authored
      Some switches support jumbo frames. Refactor this code into operations
      in the ops structure.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f436666
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Reserved Management frames to CPU · 6e55f698
      Andrew Lunn authored
      Older devices have a couple of registers in global2. The mv88e6390
      family has a single register in global1 behind which hides similar
      configuration. Implement and op for this.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e55f698
    • David S. Miller's avatar
      Merge branch 'mv88e6390-batch-two' · 7a6c5cb9
      David S. Miller authored
      Andrew Lunn says:
      
      ====================
      MV88E6390 batch two
      
      This is the second batch of patches adding support for the
      MV88e6390. They are not sufficient to make it work properly.
      
      The mv88e6390 has a much expanded set of priority maps. Refactor the
      existing code, and implement basic support for the new device.
      
      Similarly, the monitor control register has been reworked.
      
      The mv88e6390 has something odd in its EDSA tagging implementation,
      which means it is not possible to use it. So we need to use DSA
      tagging. This is the first device with EDSA support where we need to
      use DSA, and the code does not support this. So two patches refactor
      the existing code. The two different register definitions are
      separated out, and using DSA on an EDSA capable device is added.
      
      v2:
      Add port prefix
      Add helper function for 6390
      Add _IEEE_ into #defines
      Split monitor_ctrl into a number of separate ops.
      Remove 6390 code which is management, used in a later patch
      s/EGREES/EGRESS/.
      Broke up setup_port_dsa() and set_port_dsa() into a number of ops
      
      v3:
      Verify mandatory ops for port setup
      Don't set ether type for DSA port.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a6c5cb9
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Refactor CPU and DSA port setup · 56995cbc
      Andrew Lunn authored
      Older chips only support DSA tagging. Newer chips have both DSA and
      EDSA tagging. Refactor the code by adding port functions for setting the
      frame mode, egress mode, and if to forward unknown frames.
      
      This results in the helper mv88e6xxx_6065_family() becoming unused, so
      remove it.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      v3:
      Verify mandatory ops for port setup
      Don't set ether type for DSA port.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56995cbc
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Move the tagging protocol into info · 443d5a1b
      Andrew Lunn authored
      Older chips support a single tagging protocol, DSA. New chips support
      both DSA and EDSA, an enhanced version. Having both as an option
      changes the register layouts. Up until now, it has been assumed that
      if EDSA is supported, it will be used. Hence the register layout has
      been determined by which protocol should be used. However, mv88e6390
      has a different implementation of EDSA, which requires we need to use
      the DSA tagging. Hence separate the selection of the protocol from the
      register layout.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      443d5a1b
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Monitor and Management tables · 33641994
      Andrew Lunn authored
      The mv88e6390 changes the monitor control register into the Monitor
      and Management control, which is an indirection register to various
      registers.
      
      Add ops to set the CPU port and the ingress/egress port for both
      register layouts, to global1
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33641994
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Implement mv88e6390 tag remap · ef0a7318
      Andrew Lunn authored
      The mv88e6390 does not have the two registers to set the frame
      priority map. Instead it has an indirection registers for setting a
      number of different priority maps. Refactor the old code into an
      function, implement the mv88e6390 version, and use an op to call the
      right one.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef0a7318
    • David S. Miller's avatar
      Merge branch 'fib-notifier-event-replay' · 69248719
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      ipv4: fib: Replay events when registering FIB notifier
      
      Ido says:
      
      In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
      by a new FIB notification chain to which modules could register in order
      to be notified about the addition and deletion of FIB entries. The
      motivation for this change was that switchdev drivers need to be able to
      reflect the entire FIB table and not only FIBs configured on top of the
      port netdevs themselves. This is useful in case of in-band management.
      
      The fundamental problem with this approach is that upon registration
      listeners lose all the information previously sent in the chain and
      thus have an incomplete view of the FIB tables, which can result in
      packet loss. This patchset fixes that by dumping the FIB tables and
      replaying notifications previously sent in the chain for the registered
      notification block.
      
      The entire dump process is done under RCU and thus the FIB notification
      chain is converted to be atomic. The listeners are modified accordingly.
      This is done in the first eight patches.
      
      The ninth patch adds a change sequence counter to ensure the integrity
      of the FIB dump. The last patch adds the dump itself to the FIB chain
      registration function and modifies existing listeners to pass a callback
      to be executed in case dump was inconsistent.
      
      ---
      v3->v4:
      - Register the notification block after the dump and protect it using
        the change sequence counter (Hannes Frederic Sowa).
      - Since we now integrate the dump into the registration function, drop
        the sysctl to set maximum number of retries and instead set it to a
        fixed number. Lets see if it's really a problem before adding something
        we can never remove.
      - For the same reason, dump FIB tables for all net namespaces.
      - Add a comment regarding guarantees provided by mutex semantics.
      
      v2->v3:
      - Add sysctl to set the number of FIB dump retries (Hannes Frederic Sowa).
      - Read the sequence counter under RTNL to ensure synchronization
        between the dump process and other processes changing the routing
        tables (Hannes Frederic Sowa).
      - Pass a callback to the dump function to be executed prior to a retry.
      - Limit the dump to a single net namespace.
      
      v1->v2:
      - Add a sequence counter to ensure the integrity of the FIB dump
        (David S. Miller, Hannes Frederic Sowa).
      - Protect notifications from re-ordering in listeners by using an
        ordered workqueue (Hannes Frederic Sowa).
      - Introduce fib_info_hold() (Jiri Pirko).
      - Relieve rocker from the need to invoke the FIB dump by registering
        to the FIB notification chain prior to ports creation.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69248719
    • Ido Schimmel's avatar
      ipv4: fib: Replay events when registering FIB notifier · c3852ef7
      Ido Schimmel authored
      Commit b90eb754 ("fib: introduce FIB notification infrastructure")
      introduced a new notification chain to notify listeners (f.e., switchdev
      drivers) about addition and deletion of routes.
      
      However, upon registration to the chain the FIB tables can already be
      populated, which means potential listeners will have an incomplete view
      of the tables.
      
      Solve that by dumping the FIB tables and replaying the events to the
      passed notification block. The dump itself is done using RCU in order
      not to starve consumers that need RTNL to make progress.
      
      The integrity of the dump is ensured by reading the FIB change sequence
      counter before and after the dump under RTNL. This allows us to avoid
      the problematic situation in which the dumping process sends a ENTRY_ADD
      notification following ENTRY_DEL generated by another process holding
      RTNL.
      
      Callers of the registration function may pass a callback that is
      executed in case the dump was inconsistent with current FIB tables.
      
      The number of retries until a consistent dump is achieved is set to a
      fixed number to prevent callers from looping for long periods of time.
      In case current limit proves to be problematic in the future, it can be
      easily converted to be configurable using a sysctl.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3852ef7
    • Ido Schimmel's avatar
      ipv4: fib: Allow for consistent FIB dumping · cacaad11
      Ido Schimmel authored
      The next patch will enable listeners of the FIB notification chain to
      request a dump of the FIB tables. However, since RTNL isn't taken during
      the dump, it's possible for the FIB tables to change mid-dump, which
      will result in inconsistency between the listener's table and the
      kernel's.
      
      Allow listeners to know about changes that occurred mid-dump, by adding
      a change sequence counter to each net namespace. The counter is
      incremented just before a notification is sent in the FIB chain.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cacaad11
    • Ido Schimmel's avatar
      ipv4: fib: Convert FIB notification chain to be atomic · d3f706f6
      Ido Schimmel authored
      In order not to hold RTNL for long periods of time we're going to dump
      the FIB tables using RCU.
      
      Convert the FIB notification chain to be atomic, as we can't block in
      RCU critical sections.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3f706f6
    • Ido Schimmel's avatar
      rocker: Register FIB notifier before creating ports · 17f8be7d
      Ido Schimmel authored
      We can miss FIB notifications sent between the time the ports were
      created and the FIB notification block registered.
      
      Instead of receiving these notifications only when they are replayed for
      the FIB notification block during registration, just register the
      notification block before the ports are created.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17f8be7d
    • Ido Schimmel's avatar
      rocker: Implement FIB offload in deferred work · db701955
      Ido Schimmel authored
      Convert rocker to offload FIBs in deferred work in a similar fashion to
      mlxsw, which was converted in the previous commits.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db701955
    • Ido Schimmel's avatar
      rocker: Create an ordered workqueue for FIB offload · c1bb279c
      Ido Schimmel authored
      As explained in the previous commits, we need to process FIB entries
      addition / deletion events in FIFO order or otherwise we can have a
      mismatch between the kernel's FIB table and the device's.
      
      Create an ordered workqueue for rocker to which these work items will be
      submitted to.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1bb279c