1. 27 Nov, 2021 27 commits
    • Jakub Kicinski's avatar
      Merge branch 'af_unix-replace-unix_table_lock-with-per-hash-locks' · d40ce48c
      Jakub Kicinski authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Replace unix_table_lock with per-hash locks.
      
      The hash table of AF_UNIX sockets is protected by a single big lock,
      unix_table_lock.  This series replaces it with small per-hash locks.
      
      1st -  2nd : Misc refactoring
      3rd -  8th : Separate BSD/abstract address logics
      9th - 11th : Prep to save a hash in each socket
      12th       : Replace the big lock
      13th       : Speed up autobind()
      
      Note to maintainers:
      The 12th patch adds two kinds of Sparse warnings on patchwork:
      
        about unix_table_double_lock/unlock()
          We can avoid this by adding two apparent acquires/releases annotations,
          but there are the same kinds of warnings about unix_state_double_lock().
      
        about unix_next_socket() and unix_seq_stop() (/proc/net/unix)
          This is because Sparse does not understand logic in unix_next_socket(),
          which leaves a spin lock held until it returns NULL.
          Also, tcp_seq_stop() causes a warning for the same reason.
      
      These warnings seem reasonable, but let me know if there is any better way.
      Please see [0] for details.
      
      [0]: https://lore.kernel.org/netdev/20211117001611.74123-1-kuniyu@amazon.co.jp/
      ====================
      
      Link: https://lore.kernel.org/r/20211124021431.48956-1-kuniyu@amazon.co.jpSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d40ce48c
    • Kuniyuki Iwashima's avatar
      af_unix: Relax race in unix_autobind(). · 9acbc584
      Kuniyuki Iwashima authored
      When we bind an AF_UNIX socket without a name specified, the kernel selects
      an available one from 0x00000 to 0xFFFFF.  unix_autobind() starts searching
      from a number in the 'static' variable and increments it after acquiring
      two locks.
      
      If multiple processes try autobind, they obtain the same lock and check if
      a socket in the hash list has the same name.  If not, one process uses it,
      and all except one end up retrying the _next_ number (actually not, it may
      be incremented by the other processes).  The more we autobind sockets in
      parallel, the longer the latency gets.  We can avoid such a race by
      searching for a name from a random number.
      
      These show latency in unix_autobind() while 64 CPUs are simultaneously
      autobind-ing 1024 sockets for each.
      
        Without this patch:
      
           usec          : count     distribution
              0          : 1176     |***                                     |
              2          : 3655     |***********                             |
              4          : 4094     |*************                           |
              6          : 3831     |************                            |
              8          : 3829     |************                            |
              10         : 3844     |************                            |
              12         : 3638     |***********                             |
              14         : 2992     |*********                               |
              16         : 2485     |*******                                 |
              18         : 2230     |*******                                 |
              20         : 2095     |******                                  |
              22         : 1853     |*****                                   |
              24         : 1827     |*****                                   |
              26         : 1677     |*****                                   |
              28         : 1473     |****                                    |
              30         : 1573     |*****                                   |
              32         : 1417     |****                                    |
              34         : 1385     |****                                    |
              36         : 1345     |****                                    |
              38         : 1344     |****                                    |
              40         : 1200     |***                                     |
      
        With this patch:
      
           usec          : count     distribution
              0          : 1855     |******                                  |
              2          : 6464     |*********************                   |
              4          : 9936     |********************************        |
              6          : 12107    |****************************************|
              8          : 10441    |**********************************      |
              10         : 7264     |***********************                 |
              12         : 4254     |**************                          |
              14         : 2538     |********                                |
              16         : 1596     |*****                                   |
              18         : 1088     |***                                     |
              20         : 800      |**                                      |
              22         : 670      |**                                      |
              24         : 601      |*                                       |
              26         : 562      |*                                       |
              28         : 525      |*                                       |
              30         : 446      |*                                       |
              32         : 378      |*                                       |
              34         : 337      |*                                       |
              36         : 317      |*                                       |
              38         : 314      |*                                       |
              40         : 298      |                                        |
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9acbc584
    • Kuniyuki Iwashima's avatar
      af_unix: Replace the big lock with small locks. · afd20b92
      Kuniyuki Iwashima authored
      The hash table of AF_UNIX sockets is protected by the single lock.  This
      patch replaces it with per-hash locks.
      
      The effect is noticeable when we handle multiple sockets simultaneously.
      Here is a test result on an EC2 c5.24xlarge instance.  It shows latency
      (under 10us only) in unix_insert_unbound_socket() while 64 CPUs creating
      1024 sockets for each in parallel.
      
        Without this patch:
      
           nsec          : count     distribution
              0          : 179      |                                        |
              500        : 3021     |*********                               |
              1000       : 6271     |*******************                     |
              1500       : 6318     |*******************                     |
              2000       : 5828     |*****************                       |
              2500       : 5124     |***************                         |
              3000       : 4426     |*************                           |
              3500       : 3672     |***********                             |
              4000       : 3138     |*********                               |
              4500       : 2811     |********                                |
              5000       : 2384     |*******                                 |
              5500       : 2023     |******                                  |
              6000       : 1954     |*****                                   |
              6500       : 1737     |*****                                   |
              7000       : 1749     |*****                                   |
              7500       : 1520     |****                                    |
              8000       : 1469     |****                                    |
              8500       : 1394     |****                                    |
              9000       : 1232     |***                                     |
              9500       : 1138     |***                                     |
              10000      : 994      |***                                     |
      
        With this patch:
      
           nsec          : count     distribution
              0          : 1634     |****                                    |
              500        : 13170    |****************************************|
              1000       : 13156    |*************************************** |
              1500       : 9010     |***************************             |
              2000       : 6363     |*******************                     |
              2500       : 4443     |*************                           |
              3000       : 3240     |*********                               |
              3500       : 2549     |*******                                 |
              4000       : 1872     |*****                                   |
              4500       : 1504     |****                                    |
              5000       : 1247     |***                                     |
              5500       : 1035     |***                                     |
              6000       : 889      |**                                      |
              6500       : 744      |**                                      |
              7000       : 634      |*                                       |
              7500       : 498      |*                                       |
              8000       : 433      |*                                       |
              8500       : 355      |*                                       |
              9000       : 336      |*                                       |
              9500       : 284      |                                        |
              10000      : 243      |                                        |
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      afd20b92
    • Kuniyuki Iwashima's avatar
      af_unix: Save hash in sk_hash. · e6b4b873
      Kuniyuki Iwashima authored
      To replace unix_table_lock with per-hash locks in the next patch, we need
      to save a hash in each socket because /proc/net/unix or BPF prog iterate
      sockets while holding a hash table lock and release it later in a different
      function.
      
      Currently, we store a real/pseudo hash in struct unix_address.  However, we
      do not allocate it to unbound sockets, nor should we do just for that.  For
      this purpose, we can use sk_hash.  Then, we no longer use the hash field in
      struct unix_address and can remove it.
      
      Also, this patch does
        - rename unix_insert_socket() to unix_insert_unbound_socket()
        - remove the redundant list argument from __unix_insert_socket() and
           unix_insert_unbound_socket()
        - use 'unsigned int' instead of 'unsigned' in __unix_set_addr_hash()
        - remove 'inline' from unix_remove_socket() and
           unix_insert_unbound_socket().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e6b4b873
    • Kuniyuki Iwashima's avatar
      af_unix: Add helpers to calculate hashes. · f452be49
      Kuniyuki Iwashima authored
      This patch adds three helper functions that calculate hashes for unbound
      sockets and bound sockets with BSD/abstract addresses.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f452be49
    • Kuniyuki Iwashima's avatar
      af_unix: Remove UNIX_ABSTRACT() macro and test sun_path[0] instead. · 5ce7ab49
      Kuniyuki Iwashima authored
      In BSD and abstract address cases, we store sockets in the hash table with
      keys between 0 and UNIX_HASH_SIZE - 1.  However, the hash saved in a socket
      varies depending on its address type; sockets with BSD addresses always
      have UNIX_HASH_SIZE in their unix_sk(sk)->addr->hash.
      
      This is just for the UNIX_ABSTRACT() macro used to check the address type.
      The difference of the saved hashes comes from the first byte of the address
      in the first place.  So, we can test it directly.
      
      Then we can keep a real hash in each socket and replace unix_table_lock
      with per-hash locks in the later patch.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5ce7ab49
    • Kuniyuki Iwashima's avatar
      af_unix: Allocate unix_address in unix_bind_(bsd|abstract)(). · 12f21c49
      Kuniyuki Iwashima authored
      To terminate address with '\0' in unix_bind_bsd(), we add
      unix_create_addr() and call it in unix_bind_bsd() and unix_bind_abstract().
      
      Also, unix_bind_abstract() does not return -EEXIST.  Only
      kern_path_create() and vfs_mknod() in unix_bind_bsd() can return it,
      so we move the last error check in unix_bind() to unix_bind_bsd().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      12f21c49
    • Kuniyuki Iwashima's avatar
      af_unix: Remove unix_mkname(). · 5c32a3ed
      Kuniyuki Iwashima authored
      This patch removes unix_mkname() and postpones calculating a hash to
      unix_bind_abstract().  Some BSD stuffs still remain in unix_bind()
      though, the next patch packs them into unix_bind_bsd().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5c32a3ed
    • Kuniyuki Iwashima's avatar
      af_unix: Copy unix_mkname() into unix_find_(bsd|abstract)(). · d2d8c9fd
      Kuniyuki Iwashima authored
      We should not call unix_mkname() before unix_find_other() and instead do
      the same thing where necessary based on the address type:
      
        - terminating the address with '\0' in unix_find_bsd()
        - calculating the hash in unix_find_abstract().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d2d8c9fd
    • Kuniyuki Iwashima's avatar
      af_unix: Cut unix_validate_addr() out of unix_mkname(). · b8a58aa6
      Kuniyuki Iwashima authored
      unix_mkname() tests socket address length and family and does some
      processing based on the address type.  It is called in the early stage,
      and therefore some instructions are redundant and can end up in vain.
      
      The address length/family tests are done twice in unix_bind().  Also, the
      address type is rechecked later in unix_bind() and unix_find_other(), where
      we can do the same processing.  Moreover, in the BSD address case, the hash
      is set to 0 but never used and confusing.
      
      This patch moves the address tests out of unix_mkname(), and the following
      patches move the other part into appropriate places and remove
      unix_mkname() finally.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b8a58aa6
    • Kuniyuki Iwashima's avatar
      af_unix: Return an error as a pointer in unix_find_other(). · aed26f55
      Kuniyuki Iwashima authored
      We can return an error as a pointer and need not pass an additional
      argument to unix_find_other().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aed26f55
    • Kuniyuki Iwashima's avatar
      af_unix: Factorise unix_find_other() based on address types. · fa39ef0e
      Kuniyuki Iwashima authored
      As done in the commit fa42d910 ("unix_bind(): take BSD and abstract
      address cases into new helpers"), this patch moves BSD and abstract address
      cases from unix_find_other() into unix_find_bsd() and unix_find_abstract().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa39ef0e
    • Kuniyuki Iwashima's avatar
      af_unix: Pass struct sock to unix_autobind(). · f7ed31f4
      Kuniyuki Iwashima authored
      We do not use struct socket in unix_autobind() and pass struct sock to
      unix_bind_bsd() and unix_bind_abstract().  Let's pass it to unix_autobind()
      as well.
      
      Also, this patch fixes these errors by checkpatch.pl.
      
        ERROR: do not use assignment in if condition
        #1795: FILE: net/unix/af_unix.c:1795:
        +	if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
      
        CHECK: Logical continuations should be on the previous line
        #1796: FILE: net/unix/af_unix.c:1796:
        +	if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
        +	    && (err = unix_autobind(sock)) != 0)
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f7ed31f4
    • Kuniyuki Iwashima's avatar
      af_unix: Use offsetof() instead of sizeof(). · 755662ce
      Kuniyuki Iwashima authored
      The length of the AF_UNIX socket address contains an offset to the member
      sun_path of struct sockaddr_un.
      
      Currently, the preceding member is just sun_family, and its type is
      sa_family_t and resolved to short.  Therefore, the offset is represented by
      sizeof(short).  However, it is not clear and fragile to changes in struct
      sockaddr_storage or sockaddr_un.
      
      This commit makes it clear and robust by rewriting sizeof() with
      offsetof().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      755662ce
    • Xin Long's avatar
      bridge: use __set_bit in __br_vlan_set_default_pvid · 442b03c3
      Xin Long authored
      The same optimization as the one in commit cc0be1ad ("net:
      bridge: Slightly optimize 'find_portno()'") is needed for the
      'changed' bitmap in __br_vlan_set_default_pvid().
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Link: https://lore.kernel.org/r/4e35f415226765e79c2a11d2c96fbf3061c486e2.1637782773.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      442b03c3
    • Tonghao Zhang's avatar
      net: ethtool: set a default driver name · bde3b0fd
      Tonghao Zhang authored
      The netdev (e.g. ifb, bareudp), which not support ethtool ops
      (e.g. .get_drvinfo), we can use the rtnl kind as a default name.
      
      ifb netdev may be created by others prefix, not ifbX.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Hao Chen <chenhao288@hisilicon.com>
      Cc: Heiner Kallweit <hkallweit1@gmail.com>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Danielle Ratson <danieller@nvidia.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20211125163049.84970-1-xiangxia.m.yue@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bde3b0fd
    • Jakub Kicinski's avatar
      Merge branch 'selftests-net-bridge-vlan-multicast-tests' · c2e0cf08
      Jakub Kicinski authored
      Nikolay Aleksandrov says:
      
      ====================
      selftests: net: bridge: vlan multicast tests
      
      This patch-set adds selftests for the new vlan multicast options that
      were recently added. Most of the tests check for default values,
      changing options and try to verify that the changes actually take
      effect. The last test checks if the dependency between vlan_filtering
      and mcast_vlan_snooping holds. The rest are pretty self-explanatory.
      
      TEST: Vlan multicast snooping enable                                [ OK ]
      TEST: Vlan global options existence                                 [ OK ]
      TEST: Vlan mcast_snooping global option default value               [ OK ]
      TEST: Vlan 10 multicast snooping control                            [ OK ]
      TEST: Vlan mcast_querier global option default value                [ OK ]
      TEST: Vlan 10 multicast querier enable                              [ OK ]
      TEST: Vlan 10 tagged IGMPv2 general query sent                      [ OK ]
      TEST: Vlan 10 tagged MLD general query sent                         [ OK ]
      TEST: Vlan mcast_igmp_version global option default value           [ OK ]
      TEST: Vlan mcast_mld_version global option default value            [ OK ]
      TEST: Vlan 10 mcast_igmp_version option changed to 3                [ OK ]
      TEST: Vlan 10 tagged IGMPv3 general query sent                      [ OK ]
      TEST: Vlan 10 mcast_mld_version option changed to 2                 [ OK ]
      TEST: Vlan 10 tagged MLDv2 general query sent                       [ OK ]
      TEST: Vlan mcast_last_member_count global option default value      [ OK ]
      TEST: Vlan mcast_last_member_interval global option default value   [ OK ]
      TEST: Vlan 10 mcast_last_member_count option changed to 3           [ OK ]
      TEST: Vlan 10 mcast_last_member_interval option changed to 200      [ OK ]
      TEST: Vlan mcast_startup_query_interval global option default value   [ OK ]
      TEST: Vlan mcast_startup_query_count global option default value    [ OK ]
      TEST: Vlan 10 mcast_startup_query_interval option changed to 100    [ OK ]
      TEST: Vlan 10 mcast_startup_query_count option changed to 3         [ OK ]
      TEST: Vlan mcast_membership_interval global option default value    [ OK ]
      TEST: Vlan 10 mcast_membership_interval option changed to 200       [ OK ]
      TEST: Vlan 10 mcast_membership_interval mdb entry expire            [ OK ]
      TEST: Vlan mcast_querier_interval global option default value       [ OK ]
      TEST: Vlan 10 mcast_querier_interval option changed to 100          [ OK ]
      TEST: Vlan 10 mcast_querier_interval expire after outside query     [ OK ]
      TEST: Vlan mcast_query_interval global option default value         [ OK ]
      TEST: Vlan 10 mcast_query_interval option changed to 200            [ OK ]
      TEST: Vlan mcast_query_response_interval global option default value   [ OK ]
      TEST: Vlan 10 mcast_query_response_interval option changed to 200   [ OK ]
      TEST: Port vlan 10 option mcast_router default value                [ OK ]
      TEST: Port vlan 10 mcast_router option changed to 2                 [ OK ]
      TEST: Flood unknown vlan multicast packets to router port only      [ OK ]
      TEST: Disable multicast vlan snooping when vlan filtering is disabled   [ OK ]
      ====================
      
      Link: https://lore.kernel.org/r/20211125140858.3639139-1-razor@blackwall.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c2e0cf08
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add test for vlan_filtering dependency · f5a9dd58
      Nikolay Aleksandrov authored
      Add a test for dependency of mcast_vlan_snooping on vlan_filtering. If
      vlan_filtering gets disabled, then mcast_vlan_snooping must be
      automatically disabled as well.
      
      TEST: Disable multicast vlan snooping when vlan filtering is disabled   [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f5a9dd58
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast_router tests · 2cd67a4e
      Nikolay Aleksandrov authored
      Add tests for the new per-port/vlan mcast_router option, verify that
      unknown multicast packets are flooded only to router ports.
      
      TEST: Port vlan 10 option mcast_router default value                [ OK ]
      TEST: Port vlan 10 mcast_router option changed to 2                 [ OK ]
      TEST: Flood unknown vlan multicast packets to router port only      [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2cd67a4e
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast query and query response interval tests · b4ce7b95
      Nikolay Aleksandrov authored
      Add tests which change the new per-vlan mcast_query_interval and verify
      the new value is in effect, also add a test to change
      mcast_query_response_interval's value.
      
      TEST: Vlan mcast_query_interval global option default value         [ OK ]
      TEST: Vlan 10 mcast_query_interval option changed to 200            [ OK ]
      TEST: Vlan mcast_query_response_interval global option default value   [ OK ]
      TEST: Vlan 10 mcast_query_response_interval option changed to 200   [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b4ce7b95
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast_querier_interval tests · 4d8610ee
      Nikolay Aleksandrov authored
      Add tests which change the new per-vlan mcast_querier_interval and
      verify that it is taken into account when an outside querier is present.
      
      TEST: Vlan mcast_querier_interval global option default value       [ OK ]
      TEST: Vlan 10 mcast_querier_interval option changed to 100          [ OK ]
      TEST: Vlan 10 mcast_querier_interval expire after outside query     [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4d8610ee
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast_membership_interval test · a45fe974
      Nikolay Aleksandrov authored
      Add a test which changes the new per-vlan mcast_membership_interval and
      verifies that a newly learned mdb entry would expire in that interval.
      
      TEST: Vlan mcast_membership_interval global option default value    [ OK ]
      TEST: Vlan 10 mcast_membership_interval option changed to 200       [ OK ]
      TEST: Vlan 10 mcast_membership_interval mdb entry expire            [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a45fe974
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast_startup_query_count/interval tests · bdf1b2c0
      Nikolay Aleksandrov authored
      Add tests which change the new per-vlan startup query count/interval
      options and verify the proper number of queries are sent in the expected
      interval.
      
      TEST: Vlan mcast_startup_query_interval global option default value   [ OK ]
      TEST: Vlan mcast_startup_query_count global option default value    [ OK ]
      TEST: Vlan 10 mcast_startup_query_interval option changed to 100    [ OK ]
      TEST: Vlan 10 mcast_startup_query_count option changed to 3         [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bdf1b2c0
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast_last_member_count/interval tests · 3825f1fb
      Nikolay Aleksandrov authored
      Add tests which verify the default values of mcast_last_member_count
      mcast_last_member_count and also try to change them.
      
      TEST: Vlan mcast_last_member_count global option default value      [ OK ]
      TEST: Vlan mcast_last_member_interval global option default value   [ OK ]
      TEST: Vlan 10 mcast_last_member_count option changed to 3           [ OK ]
      TEST: Vlan 10 mcast_last_member_interval option changed to 200      [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3825f1fb
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast igmp/mld version tests · 2b75e9dd
      Nikolay Aleksandrov authored
      Add tests which change the new per-vlan IGMP/MLD versions and verify
      that proper tagged general query packets are sent.
      
      TEST: Vlan mcast_igmp_version global option default value           [ OK ]
      TEST: Vlan mcast_mld_version global option default value            [ OK ]
      TEST: Vlan 10 mcast_igmp_version option changed to 3                [ OK ]
      TEST: Vlan 10 tagged IGMPv3 general query sent                      [ OK ]
      TEST: Vlan 10 mcast_mld_version option changed to 2                 [ OK ]
      TEST: Vlan 10 tagged MLDv2 general query sent                       [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2b75e9dd
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast querier test · dee2cdc0
      Nikolay Aleksandrov authored
      Add a test to try the new global vlan mcast_querier control and also
      verify that tagged general query packets are properly generated when
      querier is enabled for a single vlan.
      
      TEST: Vlan mcast_querier global option default value                [ OK ]
      TEST: Vlan 10 multicast querier enable                              [ OK ]
      TEST: Vlan 10 tagged IGMPv2 general query sent                      [ OK ]
      TEST: Vlan 10 tagged MLD general query sent                         [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dee2cdc0
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast snooping control test · 71ae450f
      Nikolay Aleksandrov authored
      Add the first test for bridge per-vlan multicast snooping which checks
      if control of the global and per-vlan options work as expected, joins
      and leaves are tested at each option value.
      
      TEST: Vlan multicast snooping enable                                [ OK ]
      TEST: Vlan global options existence                                 [ OK ]
      TEST: Vlan mcast_snooping global option default value               [ OK ]
      TEST: Vlan 10 multicast snooping control                            [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      71ae450f
  2. 26 Nov, 2021 13 commits
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 93d5404e
      Jakub Kicinski authored
      drivers/net/ipa/ipa_main.c
        8afc7e47 ("net: ipa: separate disabling setup from modem stop")
        76b5fbcd ("net: ipa: kill ipa_modem_init()")
      
      Duplicated include, drop one.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      93d5404e
    • Linus Torvalds's avatar
      Merge tag 'net-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · c5c17547
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Networking fixes, including fixes from netfilter.
      
        Current release - regressions:
      
         - r8169: fix incorrect mac address assignment
      
         - vlan: fix underflow for the real_dev refcnt when vlan creation
           fails
      
         - smc: avoid warning of possible recursive locking
      
        Current release - new code bugs:
      
         - vsock/virtio: suppress used length validation
      
         - neigh: fix crash in v6 module initialization error path
      
        Previous releases - regressions:
      
         - af_unix: fix change in behavior in read after shutdown
      
         - igb: fix netpoll exit with traffic, avoid warning
      
         - tls: fix splice_read() when starting mid-record
      
         - lan743x: fix deadlock in lan743x_phy_link_status_change()
      
         - marvell: prestera: fix bridge port operation
      
        Previous releases - always broken:
      
         - tcp_cubic: fix spurious Hystart ACK train detections for
           not-cwnd-limited flows
      
         - nexthop: fix refcount issues when replacing IPv6 groups
      
         - nexthop: fix null pointer dereference when IPv6 is not enabled
      
         - phylink: force link down and retrigger resolve on interface change
      
         - mptcp: fix delack timer length calculation and incorrect early
           clearing
      
         - ieee802154: handle iftypes as u32, prevent shift-out-of-bounds
      
         - nfc: virtual_ncidev: change default device permissions
      
         - netfilter: ctnetlink: fix error codes and flags used for kernel
           side filtering of dumps
      
         - netfilter: flowtable: fix IPv6 tunnel addr match
      
         - ncsi: align payload to 32-bit to fix dropped packets
      
         - iavf: fix deadlock and loss of config during VF interface reset
      
         - ice: avoid bpf_prog refcount underflow
      
         - ocelot: fix broken PTP over IP and PTP API violations
      
        Misc:
      
         - marvell: mvpp2: increase MTU limit when XDP enabled"
      
      * tag 'net-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
        net: dsa: microchip: implement multi-bridge support
        net: mscc: ocelot: correctly report the timestamping RX filters in ethtool
        net: mscc: ocelot: set up traps for PTP packets
        net: ptp: add a definition for the UDP port for IEEE 1588 general messages
        net: mscc: ocelot: create a function that replaces an existing VCAP filter
        net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP
        net: hns3: fix incorrect components info of ethtool --reset command
        net: hns3: fix one incorrect value of page pool info when queried by debugfs
        net: hns3: add check NULL address for page pool
        net: hns3: fix VF RSS failed problem after PF enable multi-TCs
        net: qed: fix the array may be out of bound
        net/smc: Don't call clcsock shutdown twice when smc shutdown
        net: vlan: fix underflow for the real_dev refcnt
        ptp: fix filter names in the documentation
        ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce()
        nfc: virtual_ncidev: change default device permissions
        net/sched: sch_ets: don't peek at classes beyond 'nbands'
        net: stmmac: Disable Tx queues when reconfiguring the interface
        selftests: tls: test for correct proto_ops
        tls: fix replacing proto_ops
        ...
      c5c17547
    • Oleksij Rempel's avatar
      net: dsa: microchip: implement multi-bridge support · b3612ccd
      Oleksij Rempel authored
      Current driver version is able to handle only one bridge at time.
      Configuring two bridges on two different ports would end up shorting this
      bridges by HW. To reproduce it:
      
      	ip l a name br0 type bridge
      	ip l a name br1 type bridge
      	ip l s dev br0 up
      	ip l s dev br1 up
      	ip l s lan1 master br0
      	ip l s dev lan1 up
      	ip l s lan2 master br1
      	ip l s dev lan2 up
      
      	Ping on lan1 and get response on lan2, which should not happen.
      
      This happened, because current driver version is storing one global "Port VLAN
      Membership" and applying it to all ports which are members of any
      bridge.
      To solve this issue, we need to handle each port separately.
      
      This patch is dropping the global port member storage and calculating
      membership dynamically depending on STP state and bridge participation.
      
      Note: STP support was broken before this patch and should be fixed
      separately.
      
      Fixes: c2e86691 ("net: dsa: microchip: break KSZ9477 DSA driver into two files")
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Link: https://lore.kernel.org/r/20211126123926.2981028-1-o.rempel@pengutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b3612ccd
    • Linus Torvalds's avatar
      Merge tag 'acpi-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 5367cf1c
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "These fix a NULL pointer dereference in the CPPC library code and a
        locking issue related to printing the names of ACPI device nodes in
        the device properties framework.
      
        Specifics:
      
         - Fix NULL pointer dereference in the CPPC library code occuring on
           hybrid systems without CPPC support (Rafael Wysocki).
      
         - Avoid attempts to acquire a semaphore with interrupts off when
           printing the names of ACPI device nodes and clean up code on top of
           that fix (Sakari Ailus)"
      
      * tag 'acpi-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: CPPC: Add NULL pointer check to cppc_get_perf()
        ACPI: Make acpi_node_get_parent() local
        ACPI: Get acpi_device's parent from the parent field
      5367cf1c
    • Linus Torvalds's avatar
      Merge tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 0ce629b1
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "These address three issues in the intel_pstate driver and fix two
        problems related to hibernation.
      
        Specifics:
      
         - Make intel_pstate work correctly on Ice Lake server systems with
           out-of-band performance control enabled (Adamos Ttofari).
      
         - Fix EPP handling in intel_pstate during CPU offline and online in
           the active mode (Rafael Wysocki).
      
         - Make intel_pstate support ITMT on asymmetric systems with
           overclocking enabled (Srinivas Pandruvada).
      
         - Fix hibernation image saving when using the user space interface
           based on the snapshot special device file (Evan Green).
      
         - Make the hibernation code release the snapshot block device using
           the same mode that was used when acquiring it (Thomas Zeitlhofer)"
      
      * tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PM: hibernate: Fix snapshot partial write lengths
        PM: hibernate: use correct mode for swsusp_close()
        cpufreq: intel_pstate: ITMT support for overclocked system
        cpufreq: intel_pstate: Fix active mode offline/online EPP handling
        cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs
      0ce629b1
    • Linus Torvalds's avatar
      Merge tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse · 925c9437
      Linus Torvalds authored
      Pull fuse fix from Miklos Szeredi:
       "Fix a regression caused by a bugfix in the previous release. The
        symptom is a VM_BUG_ON triggered from splice to the fuse device.
      
        Unfortunately the original bugfix was already backported to a number
        of stable releases, so this fix-fix will need to be backported as
        well"
      
      * tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
        fuse: release pipe buf after last use
      925c9437
    • Jakub Kicinski's avatar
      Merge branch 'fix-broken-ptp-over-ip-on-ocelot-switches' · 32c54497
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Fix broken PTP over IP on Ocelot switches
      
      Changes in v2: added patch 5, added Richard's ack for the whole series
      sans patch 5 which is new.
      
      Po Liu reported recently that timestamping PTP over IPv4 is broken using
      the felix driver on NXP LS1028A. This has been known for a while, of
      course, since it has always been broken. The reason is because IP PTP
      packets are currently treated as unknown IP multicast, which is not
      flooded to the CPU port in the ocelot driver design, so packets don't
      reach the ptp4l program.
      
      The series solves the problem by installing packet traps per port when
      the timestamping ioctl is called, depending on the RX filter selected
      (L2, L4 or both).
      ====================
      
      Link: https://lore.kernel.org/r/20211126172845.3149260-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      32c54497
    • Vladimir Oltean's avatar
      net: mscc: ocelot: correctly report the timestamping RX filters in ethtool · c49a35ee
      Vladimir Oltean authored
      The driver doesn't support RX timestamping for non-PTP packets, but it
      declares that it does. Restrict the reported RX filters to PTP v2 over
      L2 and over L4.
      
      Fixes: 4e3b0468 ("net: mscc: PTP Hardware Clock (PHC) support")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c49a35ee
    • Vladimir Oltean's avatar
      net: mscc: ocelot: set up traps for PTP packets · 96ca08c0
      Vladimir Oltean authored
      IEEE 1588 support was declared too soon for the Ocelot switch. Out of
      reset, this switch does not apply any special treatment for PTP packets,
      i.e. when an event message is received, the natural tendency is to
      forward it by MAC DA/VLAN ID. This poses a problem when the ingress port
      is under a bridge, since user space application stacks (written
      primarily for endpoint ports, not switches) like ptp4l expect that PTP
      messages are always received on AF_PACKET / AF_INET sockets (depending
      on the PTP transport being used), and never being autonomously
      forwarded. Any forwarding, if necessary (for example in Transparent
      Clock mode) is handled in software by ptp4l. Having the hardware forward
      these packets too will cause duplicates which will confuse endpoints
      connected to these switches.
      
      So PTP over L2 barely works, in the sense that PTP packets reach the CPU
      port, but they reach it via flooding, and therefore reach lots of other
      unwanted destinations too. But PTP over IPv4/IPv6 does not work at all.
      This is because the Ocelot switch have a separate destination port mask
      for unknown IP multicast (which PTP over IP is) flooding compared to
      unknown non-IP multicast (which PTP over L2 is) flooding. Specifically,
      the driver allows the CPU port to be in the PGID_MC port group, but not
      in PGID_MCIPV4 and PGID_MCIPV6. There are several presentations from
      Allan Nielsen which explain that the embedded MIPS CPU on Ocelot
      switches is not very powerful at all, so every penny they could save by
      not allowing flooding to the CPU port module matters. Unknown IP
      multicast did not make it.
      
      The de facto consensus is that when a switch is PTP-aware and an
      application stack for PTP is running, switches should have some sort of
      trapping mechanism for PTP packets, to extract them from the hardware
      data path. This avoids both problems:
      (a) PTP packets are no longer flooded to unwanted destinations
      (b) PTP over IP packets are no longer denied from reaching the CPU since
          they arrive there via a trap and not via flooding
      
      It is not the first time when this change is attempted. Last time, the
      feedback from Allan Nielsen and Andrew Lunn was that the traps should
      not be installed by default, and that PTP-unaware switching may be
      desired for some use cases:
      https://patchwork.ozlabs.org/project/netdev/patch/20190813025214.18601-5-yangbo.lu@nxp.com/
      
      To address that feedback, the present patch adds the necessary packet
      traps according to the RX filter configuration transmitted by user space
      through the SIOCSHWTSTAMP ioctl. Trapping is done via VCAP IS2, where we
      keep 5 filters, which are amended each time RX timestamping is enabled
      or disabled on a port:
      - 1 for PTP over L2
      - 2 for PTP over IPv4 (UDP ports 319 and 320)
      - 2 for PTP over IPv6 (UDP ports 319 and 320)
      
      The cookie by which these filters (invisible to tc) are identified is
      strategically chosen such that it does not collide with the filters used
      for the ocelot-8021q tagging protocol by the Felix driver, or with the
      MRP traps set up by the Ocelot library.
      
      Other alternatives were considered, like patching user space to do
      something, but there are so many ways in which PTP packets could be made
      to reach the CPU, generically speaking, that "do what?" is a very valid
      question. The ptp4l program from the linuxptp stack already attempts to
      do something: it calls setsockopt(IP_ADD_MEMBERSHIP) (and
      PACKET_ADD_MEMBERSHIP, respectively) which translates in both cases into
      a dev_mc_add() on the interface, in the kernel:
      https://github.com/richardcochran/linuxptp/blob/v3.1.1/udp.c#L73
      https://github.com/richardcochran/linuxptp/blob/v3.1.1/raw.c
      
      Reality shows that this is not sufficient in case the interface belongs
      to a switchdev driver, as dev_mc_add() does not show the intention to
      trap a packet to the CPU, but rather the intention to not drop it (it is
      strictly for RX filtering, same as promiscuous does not mean to send all
      traffic to the CPU, but to not drop traffic with unknown MAC DA). This
      topic is a can of worms in itself, and it would be great if user space
      could just stay out of it.
      
      On the other hand, setting up PTP traps privately within the driver is
      not new by any stretch of the imagination:
      https://elixir.bootlin.com/linux/v5.16-rc2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c#L833
      https://elixir.bootlin.com/linux/v5.16-rc2/source/drivers/net/dsa/hirschmann/hellcreek.c#L1050
      https://elixir.bootlin.com/linux/v5.16-rc2/source/include/linux/dsa/sja1105.h#L21
      
      So this is the approach taken here as well. The difference here being
      that we prepare and destroy the traps per port, dynamically at runtime,
      as opposed to driver init time, because apparently, PTP-unaware
      forwarding is a use case.
      
      Fixes: 4e3b0468 ("net: mscc: PTP Hardware Clock (PHC) support")
      Reported-by: default avatarPo Liu <po.liu@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      96ca08c0
    • Vladimir Oltean's avatar
      net: ptp: add a definition for the UDP port for IEEE 1588 general messages · ec15baec
      Vladimir Oltean authored
      As opposed to event messages (Sync, PdelayReq etc) which require
      timestamping, general messages (Announce, FollowUp etc) do not.
      In PTP they are part of different streams of data.
      
      IEEE 1588-2008 Annex D.2 "UDP port numbers" states that the UDP
      destination port assigned by IANA is 319 for event messages, and 320 for
      general messages. Yet the kernel seems to be missing the definition for
      general messages. This patch adds it.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec15baec
    • Vladimir Oltean's avatar
      net: mscc: ocelot: create a function that replaces an existing VCAP filter · 95706be1
      Vladimir Oltean authored
      VCAP (Versatile Content Aware Processor) is the TCAM-based engine behind
      tc flower offload on ocelot, among other things. The ingress port mask
      on which VCAP rules match is present as a bit field in the actual key of
      the rule. This means that it is possible for a rule to be shared among
      multiple source ports. When the rule is added one by one on each desired
      port, that the ingress port mask of the key must be edited and rewritten
      to hardware.
      
      But the API in ocelot_vcap.c does not allow for this. For one thing,
      ocelot_vcap_filter_add() and ocelot_vcap_filter_del() are not symmetric,
      because ocelot_vcap_filter_add() works with a preallocated and
      prepopulated filter and programs it to hardware, and
      ocelot_vcap_filter_del() does both the job of removing the specified
      filter from hardware, as well as kfreeing it. That is to say, the only
      option of editing a filter in place, which is to delete it, modify the
      structure and add it back, does not work because it results in
      use-after-free.
      
      This patch introduces ocelot_vcap_filter_replace, which trivially
      reprograms a VCAP entry to hardware, at the exact same index at which it
      existed before, without modifying any list or allocating any memory.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      95706be1
    • Vladimir Oltean's avatar
      net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP · 8a075464
      Vladimir Oltean authored
      The ocelot driver, when asked to timestamp all receiving packets, 1588
      v1 or NTP, says "nah, here's 1588 v2 for you".
      
      According to this discussion:
      https://patchwork.kernel.org/project/netdevbpf/patch/20211104133204.19757-8-martin.kaistra@linutronix.de/#24577647
      drivers that downgrade from a wider request to a narrower response (or
      even a response where the intersection with the request is empty) are
      buggy, and should return -ERANGE instead. This patch fixes that.
      
      Fixes: 4e3b0468 ("net: mscc: PTP Hardware Clock (PHC) support")
      Suggested-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8a075464
    • Jakub Kicinski's avatar
      Merge branch 'net-hns3-add-some-fixes-for-net' · b32e521e
      Jakub Kicinski authored
      Guangbin Huang says:
      
      ====================
      net: hns3: add some fixes for -net
      
      This series adds some fixes for the HNS3 ethernet driver.
      ====================
      
      Link: https://lore.kernel.org/r/20211126120318.33921-1-huangguangbin2@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b32e521e