1. 21 Mar, 2019 35 commits
    • David S. Miller's avatar
      Merge branch 'Refactor-flower-classifier-to-remove-dependency-on-rtnl-lock' · 1d965c4d
      David S. Miller authored
      Vlad Buslov says:
      
      ====================
      Refactor flower classifier to remove dependency on rtnl lock
      
      Currently, all netlink protocol handlers for updating rules, actions and
      qdiscs are protected with single global rtnl lock which removes any
      possibility for parallelism. This patch set is a third step to remove
      rtnl lock dependency from TC rules update path.
      
      Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
      TC rule update handlers (RTM_NEWTFILTER, RTM_DELTFILTER, etc.) are
      already registered with this flag and only take rtnl lock when qdisc or
      classifier requires it. Classifiers can indicate that their ops
      callbacks don't require caller to hold rtnl lock by setting the
      TCF_PROTO_OPS_DOIT_UNLOCKED flag. The goal of this change is to refactor
      flower classifier to support unlocked execution and register it with
      unlocked flag.
      
      This patch set implements following changes to make flower classifier
      concurrency-safe:
      
      - Implement reference counting for individual filters. Change fl_get to
        take reference to filter. Implement tp->ops->put callback that was
        introduced in cls API patch set to release reference to flower filter.
      
      - Use tp->lock spinlock to protect internal classifier data structures
        from concurrent modification.
      
      - Handle concurrent tcf proto deletion by returning EAGAIN, which will
        cause cls API to retry and create new proto instance or return error
        to the user (depending on message type).
      
      - Handle concurrent insertion of filter with same priority and handle by
        returning EAGAIN, which will cause cls API to lookup filter again and
        process it accordingly to netlink message flags.
      
      - Extend flower mask with reference counting and protect masks list with
        masks_lock spinlock.
      
      - Prevent concurrent mask insertion by inserting temporary value to
        masks hash table. This is necessary because mask initialization is a
        sleeping operation and cannot be done while holding tp->lock.
      
      Both chain level and classifier level conflicts are resolved by
      returning -EAGAIN to cls API that results restart of whole operation.
      This retry mechanism is a result of fine-grained locking approach used
      in this and previous changes in series and is necessary to allow
      concurrent updates on same chain instance. Alternative approach would be
      to lock the whole chain while updating filters on any of child tp's,
      adding and removing classifier instances from the chain. However, since
      most CPU-intensive parts of filter update code are specifically in
      classifier code and its dependencies (extensions and hw offloads), such
      approach would negate most of the gains introduced by this change and
      previous changes in the series when updating same chain instance.
      
      Tcf hw offloads API is not changed by this patch set and still requires
      caller to hold rtnl lock. Refactored flower classifier tracks rtnl lock
      state by means of 'rtnl_held' flag provided by cls API and obtains the
      lock before calling hw offloads. Following patch set will lift this
      restriction and refactor cls hw offloads API to support unlocked
      execution.
      
      With these changes flower classifier is safely registered with
      TCF_PROTO_OPS_DOIT_UNLOCKED flag in last patch.
      
      Changes from V2 to V3:
      - Rebase on latest net-next
      
      Changes from V1 to V2:
      - Extend cover letter with explanation about retry mechanism.
      - Rebase on current net-next.
      - Patch 1:
        - Use rcu_dereference_raw() for tp->root dereference.
        - Update comment in fl_head_dereference().
      - Patch 2:
        - Remove redundant check in fl_change error handling code.
        - Add empty line between error check and new handle assignment.
      - Patch 3:
        - Refactor loop in fl_get_next_filter() to improve readability.
      - Patch 4:
        - Refactor __fl_delete() to improve readability.
      - Patch 6:
        - Fix comment in fl_check_assign_mask().
      - Patch 9:
        - Extend commit message.
        - Fix error code in comment.
      - Patch 11:
        - Fix fl_hw_replace_filter() to always release rtnl lock in error
          handlers.
      - Patch 12:
        - Don't take rtnl lock before calling __fl_destroy_filter() in
          workqueue context.
        - Extend commit message with explanation why flower still takes rtnl
          lock before calling hardware offloads API.
      
      Github: <https://github.com/vbuslov/linux/tree/unlocked-flower-cong3>
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d965c4d
    • Vlad Buslov's avatar
      net: sched: flower: set unlocked flag for flower proto ops · 92149190
      Vlad Buslov authored
      Set TCF_PROTO_OPS_DOIT_UNLOCKED for flower classifier to indicate that its
      ops callbacks don't require caller to hold rtnl lock. Don't take rtnl lock
      in fl_destroy_filter_work() that is executed on workqueue instead of being
      called by cls API and is not affected by setting
      TCF_PROTO_OPS_DOIT_UNLOCKED. Rtnl mutex is still manually taken by flower
      classifier before calling hardware offloads API that has not been updated
      for unlocked execution.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92149190
    • Vlad Buslov's avatar
      net: sched: flower: track rtnl lock state · c24e43d8
      Vlad Buslov authored
      Use 'rtnl_held' flag to track if caller holds rtnl lock. Propagate the flag
      to internal functions that need to know rtnl lock state. Take rtnl lock
      before calling tcf APIs that require it (hw offload, bind filter, etc.).
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c24e43d8
    • Vlad Buslov's avatar
      net: sched: flower: protect flower classifier state with spinlock · 3d81e711
      Vlad Buslov authored
      struct tcf_proto was extended with spinlock to be used by classifiers
      instead of global rtnl lock. Use it to protect shared flower classifier
      data structures (handle_idr, mask hashtable and list) and fields of
      individual filters that can be accessed concurrently. This patch set uses
      tcf_proto->lock as per instance lock that protects all filters on
      tcf_proto.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d81e711
    • Vlad Buslov's avatar
      net: sched: flower: handle concurrent tcf proto deletion · 272ffaad
      Vlad Buslov authored
      Without rtnl lock protection tcf proto can be deleted concurrently. Check
      tcf proto 'deleting' flag after taking tcf spinlock to verify that no
      concurrent deletion is in progress. Return EAGAIN error if concurrent
      deletion detected, which will cause caller to retry and possibly create new
      instance of tcf proto.
      
      Retry mechanism is a result of fine-grained locking approach used in this
      and previous changes in series and is necessary to allow concurrent updates
      on same chain instance. Alternative approach would be to lock the whole
      chain while updating filters on any of child tp's, adding and removing
      classifier instances from the chain. However, since most CPU-intensive
      parts of filter update code are specifically in classifier code and its
      dependencies (extensions and hw offloads), such approach would negate most
      of the gains introduced by this change and previous changes in the series
      when updating same chain instance.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      272ffaad
    • Vlad Buslov's avatar
      net: sched: flower: handle concurrent filter insertion in fl_change · 9a2d9389
      Vlad Buslov authored
      Check if user specified a handle and another filter with the same handle
      was inserted concurrently. Return EAGAIN to retry filter processing (in
      case it is an overwrite request).
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a2d9389
    • Vlad Buslov's avatar
      net: sched: flower: protect masks list with spinlock · 259e60f9
      Vlad Buslov authored
      Protect modifications of flower masks list with spinlock to remove
      dependency on rtnl lock and allow concurrent access.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      259e60f9
    • Vlad Buslov's avatar
      net: sched: flower: handle concurrent mask insertion · 195c234d
      Vlad Buslov authored
      Without rtnl lock protection masks with same key can be inserted
      concurrently. Insert temporary mask with reference count zero to masks
      hashtable. This will cause any concurrent modifications to retry.
      
      Wait for rcu grace period to complete after removing temporary mask from
      masks hashtable to accommodate concurrent readers.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Suggested-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      195c234d
    • Vlad Buslov's avatar
      net: sched: flower: add reference counter to flower mask · f48ef4d5
      Vlad Buslov authored
      Extend fl_flow_mask structure with reference counter to allow parallel
      modification without relying on rtnl lock. Use rcu read lock to safely
      lookup mask and increment reference counter in order to accommodate
      concurrent deletes.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f48ef4d5
    • Vlad Buslov's avatar
      net: sched: flower: track filter deletion with flag · b2552b8c
      Vlad Buslov authored
      In order to prevent double deletion of filter by concurrent tasks when rtnl
      lock is not used for synchronization, add 'deleted' filter field. Check
      value of this field when modifying filters and return error if concurrent
      deletion is detected.
      
      Refactor __fl_delete() to accept pointer to 'last' boolean as argument,
      and return error code as function return value instead. This is necessary
      to signal concurrent filter delete to caller.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2552b8c
    • Vlad Buslov's avatar
      net: sched: flower: introduce reference counting for filters · 06177558
      Vlad Buslov authored
      Extend flower filters with reference counting in order to remove dependency
      on rtnl lock in flower ops and allow to modify filters concurrently.
      Reference to flower filter can be taken/released concurrently as soon as it
      is marked as 'unlocked' by last patch in this series. Use atomic reference
      counter type to make concurrent modifications safe.
      
      Always take reference to flower filter while working with it:
      - Modify fl_get() to take reference to filter.
      - Implement tp->put() callback as fl_put() function to allow cls API to
      release reference taken by fl_get().
      - Modify fl_change() to assume that caller holds reference to fold and take
      reference to fnew.
      - Take reference to filter while using it in fl_walk().
      
      Implement helper functions to get/put filter reference counter.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06177558
    • Vlad Buslov's avatar
      net: sched: flower: refactor fl_change · 620da486
      Vlad Buslov authored
      As a preparation for using classifier spinlock instead of relying on
      external rtnl lock, rearrange code in fl_change. The goal is to group the
      code which changes classifier state in single block in order to allow
      following commits in this set to protect it from parallel modification with
      tp->lock. Data structures that require tp->lock protection are mask
      hashtable and filters list, and classifier handle_idr.
      
      fl_hw_replace_filter() is a sleeping function and cannot be called while
      holding a spinlock. In order to execute all sequence of changes to shared
      classifier data structures atomically, call fl_hw_replace_filter() before
      modifying them.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      620da486
    • Vlad Buslov's avatar
      net: sched: flower: don't check for rtnl on head dereference · e474619a
      Vlad Buslov authored
      Flower classifier only changes root pointer during init and destroy. Cls
      API implements reference counting for tcf_proto, so there is no danger of
      concurrent access to tp when it is being destroyed, even without protection
      provided by rtnl lock.
      
      Implement new function fl_head_dereference() to dereference tp->root
      without checking for rtnl lock. Use it in all flower function that obtain
      head pointer instead of rtnl_dereference().
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e474619a
    • Jakub Kicinski's avatar
      nfp: remove defines for unused control bits · 31f1a0e3
      Jakub Kicinski authored
      NFP driver ABI contains bits for L2 switching which were never
      implemented in initially envisioned form.
      
      Remove the defines, and open up the possibility of
      reclaiming the bits for other uses.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31f1a0e3
    • David S. Miller's avatar
      Merge branch 'rhashtable-cleanups' · 143eb9ac
      David S. Miller authored
      NeilBrown says:
      
      ====================
      Two clean-ups for rhashtable.
      
      These two patches make small improvements to
      rhashtable, but are otherwise unrelated.
      
      Thanks to Herbert, Miguel, and Paul for the review.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      143eb9ac
    • NeilBrown's avatar
      rhashtable: rename rht_for_each*continue as *from. · f7ad68bf
      NeilBrown authored
      The pattern set by list.h is that for_each..continue()
      iterators start at the next entry after the given one,
      while for_each..from() iterators start at the given
      entry.
      
      The rht_for_each*continue() iterators are documented as though the
      start at the 'next' entry, but actually start at the given entry,
      and they are used expecting that behaviour.
      So fix the documentation and change the names to *from for consistency
      with list.h
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: default avatarMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7ad68bf
    • NeilBrown's avatar
      rhashtable: don't hold lock on first table throughout insertion. · 4feb7c7a
      NeilBrown authored
      rhashtable_try_insert() currently holds a lock on the bucket in
      the first table, while also locking buckets in subsequent tables.
      This is unnecessary and looks like a hold-over from some earlier
      version of the implementation.
      
      As insert and remove always lock a bucket in each table in turn, and
      as insert only inserts in the final table, there cannot be any races
      that are not covered by simply locking a bucket in each table in turn.
      
      When an insert call reaches that last table it can be sure that there
      is no matchinf entry in any other table as it has searched them all, and
      insertion never happens anywhere but in the last table.  The fact that
      code tests for the existence of future_tbl while holding a lock on
      the relevant bucket ensures that two threads inserting the same key
      will make compatible decisions about which is the "last" table.
      
      This simplifies the code and allows the ->rehash field to be
      discarded.
      
      We still need a way to ensure that a dead bucket_table is never
      re-linked by rhashtable_walk_stop().  This can be achieved by calling
      call_rcu() inside the locked region, and checking with
      rcu_head_after_call_rcu() in rhashtable_walk_stop() to see if the
      bucket table is empty and dead.
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4feb7c7a
    • David S. Miller's avatar
      Merge branch 'net-phy-Move-Omega-PHY-entry-to-Cygnus-PHY-driver' · 83b038db
      David S. Miller authored
      Florian Fainelli says:
      
      ====================
      net: phy: Move Omega PHY entry to Cygnus PHY driver
      
      In order to pave the way for adding some specific Omega PHY features
      that may not be desirable on other products covered by the bcm7xxx PHY
      driver, split the Omega PHY entry into the Cygnus PHY driver such that
      the PHY drivers are reflective of product lines/business units
      maintaining them within Broadcom.
      
      No functional changes intended.
      ====================
      Acked-by: default avatarArun Parameswaran <arun.parameswaran@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83b038db
    • Florian Fainelli's avatar
      net: phy: Move Omega PHY entry to Cygnus PHY driver · 17cc9821
      Florian Fainelli authored
      Cygnus and Omega are part of the same business unit and product line, it
      makes sense to group PHY entries by products such that a platform can
      select only the drivers that it needs. Bring all the functionality that
      the BCM7XXX_28NM_GPHY() macro hides for us and remove the Omega PHY
      entry from bcm7xxx.c.
      
      As an added bonus, we now have a proper mdio_device_id entry to permit
      auto-loading.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarScott Branden <scott.branden@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17cc9821
    • Florian Fainelli's avatar
      net: phy: Prepare for moving Omega out of bcm7xxx · f878fe56
      Florian Fainelli authored
      The Omega PHY entry was added to bcm7xxx.c out of convenience and this
      breaks the one driver per product line paradigm that was applied up
      until now. Since the AFE initialization is shared between Omega and
      BCM7xxx move the relevant functions to bcm-phy-lib.[ch]. No functional
      changes introduced.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarScott Branden <scott.branden@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f878fe56
    • Julian Wiedmann's avatar
      net: dst: remove gc leftovers · 02afc7ad
      Julian Wiedmann authored
      Get rid of some obsolete gc-related documentation and macros that were
      missed in commit 5b7c9a8f ("net: remove dst gc related code").
      
      CC: Wei Wang <weiwan@google.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Acked-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02afc7ad
    • David S. Miller's avatar
      Merge branch 'net-broadcom-Remove-print-of-base-address' · 88f808f3
      David S. Miller authored
      Florian Fainelli says:
      
      ====================
      net: broadcom: Remove print of base address
      
      Some broadcom MDIO/switch/Ethernet MAC drivers insist on printing the
      base register virtual address which has little value.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88f808f3
    • Florian Fainelli's avatar
      net: systemport: Remove print of base address · 62be757f
      Florian Fainelli authored
      Since commit ad67b74d ("printk: hash addresses printed with %p")
      pointers are being hashed when printed. Displaying the virtual memory at
      bootup time is not helpful, especially given we use a dev_info() which
      already displays the platform device's address.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62be757f
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Remove print of base address · fbb7bc45
      Florian Fainelli authored
      Since commit ad67b74d ("printk: hash addresses printed with %p")
      pointers are being hashed when printed. Displaying the virtual memory at
      bootup time is not helpful, we use a dev_info() print which already
      displays the platform device's address.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbb7bc45
    • Florian Fainelli's avatar
      net: phy: mdio-bcm-unimac: Remove print of base address · 647aed23
      Florian Fainelli authored
      Since commit ad67b74d ("printk: hash addresses printed with %p")
      pointers are being hashed when printed. Displaying the virtual memory at
      bootup time is not helpful, especially given we use a dev_info() which
      already displays the platform device's address.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      647aed23
    • David Ahern's avatar
      ipv6: Remove fallback argument from ip6_hold_safe · 10585b43
      David Ahern authored
      net and null_fallback are redundant. Remove null_fallback in favor of
      !net check.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Acked-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10585b43
    • David Ahern's avatar
      ipv4: Allow amount of dirty memory from fib resizing to be controllable · 9ab948a9
      David Ahern authored
      fib_trie implementation calls synchronize_rcu when a certain amount of
      pages are dirty from freed entries. The number of pages was determined
      experimentally in 2009 (commit c3059477).
      
      At the current setting, synchronize_rcu is called often -- 51 times in a
      second in one test with an average of an 8 msec delay adding a fib entry.
      The total impact is a lot of slow down modifying the fib. This is seen
      in the output of 'time' - the difference between real time and sys+user.
      For example, using 720,022 single path routes and 'ip -batch'[1]:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m14.214s
          user    0m2.513s
          sys     0m6.783s
      
      So roughly 35% of the actual time to install the routes is from the ip
      command getting scheduled out, most notably due to synchronize_rcu (this
      is observed using 'perf sched timehist').
      
      This patch makes the amount of dirty memory configurable between 64k where
      the synchronize_rcu is called often (small, low end systems that are memory
      sensitive) to 64M where synchronize_rcu is called rarely during a large
      FIB change (for high end systems with lots of memory). The default is 512kB
      which corresponds to the current setting of 128 pages with a 4kB page size.
      
      As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
      in a second blocking for up to 30 msec in a single instance, and a total
      of almost 100 msec across the 4 calls in the second. The trade off is
      allowing FIB entries to consume more memory in a given time window but
      but with much better fib insertion rates (~30% increase in prefixes/sec).
      With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
      file runs in:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m9.692s
          user    0m2.491s
          sys     0m6.769s
      
      So the dead time is reduced to about 1/2 second or <5% of the real time.
      
      [1] 'ip' modified to not request ACK messages which improves route
          insertion times by about 20%
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ab948a9
    • Kirill Tkhai's avatar
    • Kirill Tkhai's avatar
      tun: Add ioctl() TUNGETDEVNETNS cmd to allow obtaining real net ns of tun device · 0c3e0e3b
      Kirill Tkhai authored
      In commit f2780d6d "tun: Add ioctl() SIOCGSKNS cmd to allow
      obtaining net ns of tun device" it was missed that tun may change
      its net ns, while net ns of socket remains the same as it was
      created initially. SIOCGSKNS returns net ns of socket, so it is
      not suitable for obtaining net ns of device.
      
      We may have two tun devices with the same names in two net ns,
      and in this case it's not possible to determ, which of them
      fd refers to (TUNGETIFF will return the same name).
      
      This patch adds new ioctl() cmd for obtaining net ns of a device.
      Reported-by: default avatarHarald Albrecht <harald.albrecht@gmx.net>
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c3e0e3b
    • David S. Miller's avatar
      Merge branch 'ipv6-Change-addrconf_f6i_alloc-to-use-ip6_route_info_create' · 28b18b39
      David S. Miller authored
      David Ahern says:
      
      ====================
      ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create
      
      addrconf_f6i_alloc is the last caller of fib6_info_alloc besides
      ip6_route_info_create. There really is no good reason for it do
      its own fib6_info initialization, so convert it to call
      ip6_route_info_create.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28b18b39
    • David Ahern's avatar
      ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create · c7a1ce39
      David Ahern authored
      Change addrconf_f6i_alloc to generate a fib6_config and call
      ip6_route_info_create. addrconf_f6i_alloc is the last caller to
      fib6_info_alloc besides ip6_route_info_create, and there is no
      reason for it to do its own initialization on a fib6_info.
      
      Host routes need to be created even if the device is down, so add a
      new flag, fc_ignore_dev_down, to fib6_config and update fib6_nh_init
      to not error out if device is not up.
      
      Notes on the conversion:
      - ip_fib_metrics_init is the same as fib6_config has fc_mx set to NULL
        and fc_mx_len set to 0
      - dst_nocount is handled by the RTF_ADDRCONF flag
      - dst_host is handled by fc_dst_len = 128
      
      nh_gw does not get set after the conversion to ip6_route_info_create
      but it should not be set in addrconf_f6i_alloc since this is a host
      route not a gateway route.
      
      Everything else is a straight forward map between fib6_info and
      fib6_config.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7a1ce39
    • David Ahern's avatar
      ipv6: Move setting default metric for routes · 67f69513
      David Ahern authored
      ip6_route_info_create is a low level function for ensuring fc_metric is
      set. Move the check and default setting to the 2 locations that do not
      already set fc_metric before calling ip6_route_info_create. This is
      required for the next patch which moves addrconf allocations to
      ip6_route_info_create and want the metric for host routes to be 0.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67f69513
    • Vakul Garg's avatar
      net/tls: Replace kfree_skb() with consume_skb() · a88c26f6
      Vakul Garg authored
      To free the skb in normal course of processing, consume_skb() should be
      used. Only for failure paths, skb_free() is intended to be used.
      
      https://www.kernel.org/doc/htmldocs/networking/API-consume-skb.htmlSigned-off-by: default avatarVakul Garg <vakul.garg@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a88c26f6
    • Hoang Le's avatar
      tipc: fix a null pointer deref · 08e046c8
      Hoang Le authored
      In commit c55c8eda ("tipc: smooth change between replicast and
      broadcast") we introduced new method to eliminate the risk of message
      reordering that happen in between different nodes.
      Unfortunately, we forgot checking at receiving side to ignore intra node.
      
      We fix this by checking and returning if arrived message from intra node.
      
      syzbot report:
      
      ==================================================================
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 7820 Comm: syz-executor418 Not tainted 5.0.0+ #61
      Hardware name: Google Google Compute Engine/Google Compute Engine,
      BIOS Google 01/01/2011
      RIP: 0010:tipc_mcast_filter_msg+0x21b/0x13d0 net/tipc/bcast.c:782
      Code: 45 c0 0f 84 39 06 00 00 48 89 5d 98 e8 ce ab a5 fa 49 8d bc
       24 c8 00 00 00 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03
       <80> 3c 08 00 0f 85 9a 0e 00 00 49 8b 9c 24 c8 00 00 00 48 be 00 00
      RSP: 0018:ffff8880959defc8 EFLAGS: 00010202
      RAX: 0000000000000019 RBX: ffff888081258a48 RCX: dffffc0000000000
      RDX: 0000000000000000 RSI: ffffffff86cab862 RDI: 00000000000000c8
      RBP: ffff8880959df030 R08: ffff8880813d0200 R09: ffffed1015d05bc8
      R10: ffffed1015d05bc7 R11: ffff8880ae82de3b R12: 0000000000000000
      R13: 000000000000002c R14: 0000000000000000 R15: ffff888081258a48
      FS:  000000000106a880(0000) GS:ffff8880ae800000(0000)
       knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001cc0 CR3: 0000000094a20000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tipc_sk_filter_rcv+0x182d/0x34f0 net/tipc/socket.c:2168
       tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
       tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
       tipc_sk_mcast_rcv+0x724/0x1020 net/tipc/socket.c:1209
       tipc_mcast_xmit+0x7fe/0x1200 net/tipc/bcast.c:410
       tipc_sendmcast+0xb36/0xfc0 net/tipc/socket.c:820
       __tipc_sendmsg+0x10df/0x18d0 net/tipc/socket.c:1358
       tipc_sendmsg+0x53/0x80 net/tipc/socket.c:1291
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg+0xdd/0x130 net/socket.c:661
       ___sys_sendmsg+0x806/0x930 net/socket.c:2260
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2298
       __do_sys_sendmsg net/socket.c:2307 [inline]
       __se_sys_sendmsg net/socket.c:2305 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2305
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4401c9
      Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8
       48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05
       <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffd887fa9d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9
      RDX: 0000000000000000 RSI: 0000000020002140 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a50
      R13: 0000000000401ae0 R14: 0000000000000000 R15: 0000000000000000
      Modules linked in:
      ---[ end trace ba79875754e1708f ]---
      
      Reported-by: syzbot+be4bdf2cc3e85e952c50@syzkaller.appspotmail.com
      Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08e046c8
    • Hoang Le's avatar
      tipc: fix use-after-free in tipc_sk_filter_rcv · 77d5ad40
      Hoang Le authored
      skb free-ed in:
        1/ condition 1: tipc_sk_filter_rcv -> tipc_sk_proto_rcv
        2/ condition 2: tipc_sk_filter_rcv -> tipc_group_filter_msg
      This leads to a "use-after-free" access in the next condition.
      
      We fix this by intializing the variable at declaration, then it is safe
      to check this variable to continue processing if condition matches.
      
      syzbot report:
      
      ==================================================================
      BUG: KASAN: use-after-free in tipc_sk_filter_rcv+0x2166/0x34f0
       net/tipc/socket.c:2167
      Read of size 4 at addr ffff88808ea58534 by task kworker/u4:0/7
      
      CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.0.0+ #61
      Hardware name: Google Google Compute Engine/Google Compute Engine,
       BIOS Google 01/01/2011
      Workqueue: tipc_send tipc_conn_send_work
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131
       tipc_sk_filter_rcv+0x2166/0x34f0 net/tipc/socket.c:2167
       tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
       tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
       tipc_topsrv_kern_evt+0x3b7/0x580 net/tipc/topsrv.c:610
       tipc_conn_send_to_sock+0x43e/0x5f0 net/tipc/topsrv.c:283
       tipc_conn_send_work+0x65/0x80 net/tipc/topsrv.c:303
       process_one_work+0x98e/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x357/0x430 kernel/kthread.c:253
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
      
      Reported-by: syzbot+e863893591cc7a622e40@syzkaller.appspotmail.com
      Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77d5ad40
  2. 20 Mar, 2019 5 commits
    • Stephen Suryaputra's avatar
      ipv6: Add icmp_echo_ignore_anycast for ICMPv6 · 0b03a5ca
      Stephen Suryaputra authored
      In addition to icmp_echo_ignore_multicast, there is a need to also
      prevent responding to pings to anycast addresses for security.
      Signed-off-by: default avatarStephen Suryaputra <ssuryaextr@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b03a5ca
    • YueHaibing's avatar
      net: isdn: Make isdn_ppp_mp_discard and isdn_ppp_mp_reassembly static · a534ea30
      YueHaibing authored
      Fix sparse warnings:
      
      drivers/isdn/i4l/isdn_ppp.c:1891:16: warning:
       symbol 'isdn_ppp_mp_discard' was not declared. Should it be static?
      drivers/isdn/i4l/isdn_ppp.c:1903:6: warning:
       symbol 'isdn_ppp_mp_reassembly' was not declared. Should it be static?
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a534ea30
    • YueHaibing's avatar
      net: hns3: Make hclge_destroy_cmd_queue static · 881d7afd
      YueHaibing authored
      Fix sparse warning:
      
      drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c:414:6:
       warning: symbol 'hclge_destroy_cmd_queue' was not declared. Should it be static?
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      881d7afd
    • David S. Miller's avatar
      Merge branch 'net-refactor-ndo_select_queue' · 75d317c4
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      net: refactor ndo_select_queue()
      
      Currently, on most devices implementing ndo_select_queue(), we get 2
      indirect calls per xmit packet, at least in some scenarios.
      
      We can avoid one of such indirect calls refactoring the ndo_select_queue()
      usage so that we don't need anymore the 'fallback' argument.
      
      The first patch renames a helper used later as a public API, the second one
      changes the af packet implementation so that it uses the common infrastructure
      to select the xmit queue, and the second patch drops the now unneeded argument
      from ndo_select_queue().
      
      Alternatively we could use the INDIRECT_CALL_WRAPPER infrastructure to avoid
      the fallback indirect call in the common case, but this solution allows also
      for some code cleanup.
      
       v1 -> v2:
        - renamed select queue helpers, as per Eric's and David's suggestions
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75d317c4
    • Paolo Abeni's avatar
      net: remove 'fallback' argument from dev->ndo_select_queue() · a350ecce
      Paolo Abeni authored
      After the previous patch, all the callers of ndo_select_queue()
      provide as a 'fallback' argument netdev_pick_tx.
      The only exceptions are nested calls to ndo_select_queue(),
      which pass down the 'fallback' available in the current scope
      - still netdev_pick_tx.
      
      We can drop such argument and replace fallback() invocation with
      netdev_pick_tx(). This avoids an indirect call per xmit packet
      in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
      with device drivers implementing such ndo. It also clean the code
      a bit.
      
      Tested with ixgbe and CONFIG_FCOE=m
      
      With pktgen using queue xmit:
      threads		vanilla 	patched
      		(kpps)		(kpps)
      1		2334		2428
      2		4166		4278
      4		7895		8100
      
       v1 -> v2:
       - rebased after helper's name change
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a350ecce