1. 06 Jan, 2015 9 commits
    • Florian Westphal's avatar
      net: fib6: convert cfg metric to u32 outside of table write lock · e715b6d3
      Florian Westphal authored
      Do the nla validation earlier, outside the write lock.
      
      This is needed by followup patch which needs to be able to call
      request_module (which can sleep) if needed.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e715b6d3
    • Daniel Borkmann's avatar
      net: fib6: fib6_commit_metrics: fix potential NULL pointer dereference · 0409c9a5
      Daniel Borkmann authored
      When IPv6 host routes with metrics attached are being added, we fetch
      the metrics store from the dst via COW through dst_metrics_write_ptr(),
      added through commit e5fd387a.
      
      One remaining problem here is that we actually call into inet_getpeer()
      and may end up allocating/creating a new peer from the kmemcache, which
      may fail.
      
      Example trace from perf probe (inet_getpeer:41) where create is 1:
      
      ip 6877 [002] 4221.391591: probe:inet_getpeer: (ffffffff8165e293)
        85e294 inet_getpeer.part.7 (<- kmem_cache_alloc())
        85e578 inet_getpeer
        8eb333 ipv6_cow_metrics
        8f10ff fib6_commit_metrics
      
      Therefore, a check for NULL on the return of dst_metrics_write_ptr()
      is necessary here.
      
      Joint work with Florian Westphal.
      
      Fixes: e5fd387a ("ipv6: do not overwrite inetpeer metrics prematurely")
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0409c9a5
    • Hubert Sokolowski's avatar
      net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined · 6cb69742
      Hubert Sokolowski authored
      Add checking whether the call to ndo_dflt_fdb_dump is needed.
      It is not expected to call ndo_dflt_fdb_dump unconditionally
      by some drivers (i.e. qlcnic or macvlan) that defines
      own ndo_fdb_dump. Other drivers define own ndo_fdb_dump
      and don't want ndo_dflt_fdb_dump to be called at all.
      At the same time it is desirable to call the default dump
      function on a bridge device.
      Fix attributes that are passed to dev->netdev_ops->ndo_fdb_dump.
      Add extra checking in br_fdb_dump to avoid duplicate entries
      as now filter_dev can be NULL.
      
      Following tests for filtering have been performed before
      the change and after the patch was applied to make sure
      they are the same and it doesn't break the filtering algorithm.
      
      [root@localhost ~]# cd /root/iproute2-3.18.0/bridge
      [root@localhost bridge]# modprobe dummy
      [root@localhost bridge]# ./bridge fdb add f1:f2:f3:f4:f5:f6 dev dummy0
      [root@localhost bridge]# brctl addbr br0
      [root@localhost bridge]# brctl addif  br0 dummy0
      [root@localhost bridge]# ip link set dev br0 address 02:00:00:12:01:04
      [root@localhost bridge]# # show all
      [root@localhost bridge]# ./bridge fdb show
      33:33:00:00:00:01 dev p2p1 self permanent
      01:00:5e:00:00:01 dev p2p1 self permanent
      33:33:ff:ac:ce:32 dev p2p1 self permanent
      33:33:00:00:02:02 dev p2p1 self permanent
      01:00:5e:00:00:fb dev p2p1 self permanent
      33:33:00:00:00:01 dev p7p1 self permanent
      01:00:5e:00:00:01 dev p7p1 self permanent
      33:33:ff:79:50:53 dev p7p1 self permanent
      33:33:00:00:02:02 dev p7p1 self permanent
      01:00:5e:00:00:fb dev p7p1 self permanent
      f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
      f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
      33:33:00:00:00:01 dev dummy0 self permanent
      f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
      33:33:00:00:00:01 dev br0 self permanent
      02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
      02:00:00:12:01:04 dev br0 master br0 permanent
      [root@localhost bridge]# # filter by bridge
      [root@localhost bridge]# ./bridge fdb show br br0
      f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
      f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
      33:33:00:00:00:01 dev dummy0 self permanent
      f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
      33:33:00:00:00:01 dev br0 self permanent
      02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
      02:00:00:12:01:04 dev br0 master br0 permanent
      [root@localhost bridge]# # filter by port
      [root@localhost bridge]# ./bridge fdb show brport dummy0
      f2:46:50:85:6d:d9 master br0 permanent
      f2:46:50:85:6d:d9 vlan 1 master br0 permanent
      33:33:00:00:00:01 self permanent
      f1:f2:f3:f4:f5:f6 self permanent
      [root@localhost bridge]# # filter by port + bridge
      [root@localhost bridge]# ./bridge fdb show br br0 brport dummy0
      f2:46:50:85:6d:d9 master br0 permanent
      f2:46:50:85:6d:d9 vlan 1 master br0 permanent
      33:33:00:00:00:01 self permanent
      f1:f2:f3:f4:f5:f6 self permanent
      [root@localhost bridge]#
      Signed-off-by: default avatarHubert Sokolowski <hubert.sokolowski@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6cb69742
    • David S. Miller's avatar
      Merge branch 'ip_cmsg_csum' · d4253c62
      David S. Miller authored
      Tom Herbert says:
      
      ====================
      ip: Support checksum returned in csmg
      
      This patch set allows the packet checksum for a datagram socket
      to be returned in csum data in recvmsg. This allows userspace
      to implement its own checksum over the data, for instance if an
      IP tunnel was be implemented in user space, the inner checksum
      could be validated.
      
      Changes in this patch set:
        - Move checksum conversion to inet_sock from udp_sock. This
          generalizes checksum conversion for use with other protocols.
        - Move IP cmsg constants to a header file and make processing
          of the flags more efficient in ip_cmsg_recv
        - Return checksum value in cmsg. This is specifically the unfolded
          32 bit checksum of the full packet starting from the first byte
          returned in recvmsg
      
      Tested: Wrote a little server to get checksums in cmsg for UDP and
              verfied correct checksum is returned.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4253c62
    • Tom Herbert's avatar
      ip: Add offset parameter to ip_cmsg_recv · ad6f939a
      Tom Herbert authored
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad6f939a
    • Tom Herbert's avatar
      ip: Add offset parameter to ip_cmsg_recv · 5961de9f
      Tom Herbert authored
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5961de9f
    • Tom Herbert's avatar
      ip: IP cmsg cleanup · c44d13d6
      Tom Herbert authored
      Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that
      they can be referenced in other source files.
      
      Restructure ip_cmsg_recv to not go through flags using shift, check
      for flags by 'and'. This eliminates both the shift and a conditional
      per flag check.
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c44d13d6
    • Tom Herbert's avatar
      ip: Move checksum convert defines to inet · 224d019c
      Tom Herbert authored
      Move convert_csum from udp_sock to inet_sock. This allows the
      possibility that we can use convert checksum for different types
      of sockets and also allows convert checksum to be enabled from
      inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      224d019c
    • Thomas Graf's avatar
      netlink: Warn on unordered or illegal nla_nest_cancel() or nlmsg_cancel() · 149118d8
      Thomas Graf authored
      Calling nla_nest_cancel() in a different order as the nesting was
      built up can lead to negative offsets being calculated which
      results in skb_trim() being called with an underflowed unsigned
      int. Warn if mark < skb->data as it's definitely a bug.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      149118d8
  2. 05 Jan, 2015 17 commits
  3. 03 Jan, 2015 10 commits
    • David S. Miller's avatar
      Merge branch 'rhashtable-next' · 7beceebf
      David S. Miller authored
      Thomas Graf says:
      
      ====================
      rhashtable: Per bucket locks & deferred table resizing
      
      Prepares for and introduces per bucket spinlocks and deferred table
      resizing. This allows for parallel table mutations in different hash
      buckets from atomic context. The resizing occurs in the background
      in a separate worker thread while lookups, inserts, and removals can
      continue.
      
      Also modified the chain linked list to be terminated with a special
      nulls marker to allow entries to move between multiple lists.
      
      Last but not least, reintroduces lockless netlink_lookup() with
      deferred Netlink socket destruction to avoid the side effect of
      increased netlink_release() runtime.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7beceebf
    • Thomas Graf's avatar
      netlink: Lockless lookup with RCU grace period in socket release · 21e4902a
      Thomas Graf authored
      Defers the release of the socket reference using call_rcu() to
      allow using an RCU read-side protected call to rhashtable_lookup()
      
      This restores behaviour and performance gains as previously
      introduced by e341694e ("netlink: Convert netlink_lookup() to use
      RCU protected hash table") without the side effect of severely
      delayed socket destruction.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21e4902a
    • Thomas Graf's avatar
      rhashtable: Supports for nulls marker · f89bd6f8
      Thomas Graf authored
      In order to allow for wider usage of rhashtable, use a special nulls
      marker to terminate each chain. The reason for not using the existing
      nulls_list is that the prev pointer usage would not be valid as entries
      can be linked in two different buckets at the same time.
      
      The 4 nulls base bits can be set through the rhashtable_params structure
      like this:
      
      struct rhashtable_params params = {
              [...]
              .nulls_base = (1U << RHT_BASE_SHIFT),
      };
      
      This reduces the hash length from 32 bits to 27 bits.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f89bd6f8
    • Thomas Graf's avatar
      rhashtable: Per bucket locks & deferred expansion/shrinking · 97defe1e
      Thomas Graf authored
      Introduces an array of spinlocks to protect bucket mutations. The number
      of spinlocks per CPU is configurable and selected based on the hash of
      the bucket. This allows for parallel insertions and removals of entries
      which do not share a lock.
      
      The patch also defers expansion and shrinking to a worker queue which
      allows insertion and removal from atomic context. Insertions and
      deletions may occur in parallel to it and are only held up briefly
      while the particular bucket is linked or unzipped.
      
      Mutations of the bucket table pointer is protected by a new mutex, read
      access is RCU protected.
      
      In the event of an expansion or shrinking, the new bucket table allocated
      is exposed as a so called future table as soon as the resize process
      starts.  Lookups, deletions, and insertions will briefly use both tables.
      The future table becomes the main table after an RCU grace period and
      initial linking of the old to the new table was performed. Optimization
      of the chains to make use of the new number of buckets follows only the
      new table is in use.
      
      The side effect of this is that during that RCU grace period, a bucket
      traversal using any rht_for_each() variant on the main table will not see
      any insertions performed during the RCU grace period which would at that
      point land in the future table. The lookup will see them as it searches
      both tables if needed.
      
      Having multiple insertions and removals occur in parallel requires nelems
      to become an atomic counter.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      97defe1e
    • Thomas Graf's avatar
      spinlock: Add spin_lock_bh_nested() · 113948d8
      Thomas Graf authored
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      113948d8
    • Thomas Graf's avatar
      nft_hash: Remove rhashtable_remove_pprev() · 897362e4
      Thomas Graf authored
      The removal function of nft_hash currently stores a reference to the
      previous element during lookup which is used to optimize removal later
      on. This was possible because a lock is held throughout calling
      rhashtable_lookup() and rhashtable_remove().
      
      With the introdution of deferred table resizing in parallel to lookups
      and insertions, the nftables lock will no longer synchronize all
      table mutations and the stored pprev may become invalid.
      
      Removing this optimization makes removal slightly more expensive on
      average but allows taking the resize cost out of the insert and
      remove path.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      897362e4
    • Thomas Graf's avatar
      rhashtable: Factor out bucket_tail() function · b8e1943e
      Thomas Graf authored
      Subsequent patches will require access to the bucket tail. Access
      to the tail is relatively cheap as the automatic resizing of the
      table should keep the number of entries per bucket to no more
      than 0.75 on average.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8e1943e
    • Thomas Graf's avatar
      rhashtable: Convert bucket iterators to take table and index · 88d6ed15
      Thomas Graf authored
      This patch is in preparation to introduce per bucket spinlocks. It
      extends all iterator macros to take the bucket table and bucket
      index. It also introduces a new rht_dereference_bucket() to
      handle protected accesses to buckets.
      
      It introduces a barrier() to the RCU iterators to the prevent
      the compiler from caching the first element.
      
      The lockdep verifier is introduced as stub which always succeeds
      and properly implement in the next patch when the locks are
      introduced.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88d6ed15
    • Thomas Graf's avatar
    • Thomas Graf's avatar
      rhashtable: Do hashing inside of rhashtable_lookup_compare() · 8d24c0b4
      Thomas Graf authored
      Hash the key inside of rhashtable_lookup_compare() like
      rhashtable_lookup() does. This allows to simplify the hashing
      functions and keep them private.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d24c0b4
  4. 02 Jan, 2015 4 commits