1. 21 May, 2015 24 commits
    • Daniel Pieczko's avatar
      sfc: Enable a VF to get its own MAC address · 0d5e0fbb
      Daniel Pieczko authored
      A VF's MAC address is set by its parent PF and added to its vport.
      To get this MAC address, the VF must use MC_CMD_ VPORT_GET_MAC_ADDRESSES.
      In the current scheme, a VF's vport should only have one MAC address,
      so warn if this is not the case.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d5e0fbb
    • Edward Cree's avatar
      sfc: protect filter table against use-after-free · 0d322413
      Edward Cree authored
      If MCDI timeouts are encountered during efx_ef10_filter_table_remove(),
      an FLR will be queued, but efx->filter_state will still be kfree()d.
      The queued FLR will then call efx_ef10_filter_table_restore(), which
      will try to use efx->filter_state. This previously caused a panic.
      This patch adds an rwsem to protect the existence of efx->filter_state,
      separately from the spinlock protecting its contents.  Users which can
      race against efx_ef10_filter_table_remove() should down_read this rwsem.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d322413
    • Shradha Shah's avatar
      sfc: Store the efx_nic struct of the current VF in the VF data struct · f1122a34
      Shradha Shah authored
      Initialised in efx_probe_vf and removal is dealt with in
      efx_ef10_remove.
      
      vf->efx is needed in future patches to change the MAC address
      of the VF via the parent PF, while the driver is bound to the
      VF.
      Example: ip link set dev vf NUM mac LLADDR
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1122a34
    • Shradha Shah's avatar
      sfc: save old MAC address in case sriov_mac_address_changed fails · cfc77c2f
      Shradha Shah authored
      Otherwise the PF and VF can disagree on the VF's MAC address and
      this leads to strange behaviour, up to and including kernel panics.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfc77c2f
    • Shradha Shah's avatar
      sfc: Store vf_index in nic_data for Ef10. · 88a37de6
      Shradha Shah authored
      Added function efx_ef10_get_vf_index to store the vf_index
      in nic_data during probe
      
      vf_index is needed in future patches to access a particular
      VF in the VF data structure.
      
      Moved efx_ef10_probe_pf and efx_ef10_probe_vf in order to
      used efx_ef10_remove
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88a37de6
    • Shradha Shah's avatar
      sfc: MC_CMD_SET_MAC can only be called by the link control Function · 862f894c
      Shradha Shah authored
      MC_CMD_SET_MAC is privileged and can only by called by the link
      control function.
      
      This patch adds efx_ef10_mac_reconfigure_vf which avoids the call
      to MC_CMD_SET_MAC by the Virtual function
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      862f894c
    • Shradha Shah's avatar
      af6a074d
    • Shradha Shah's avatar
      sfc: Add permissions to MCDI commands · 75122ec8
      Shradha Shah authored
      There is one primary function per adaptor, one link control function
      per port and the rest as categorised as general.
      
      This patch adds privileges to the MCDI commands based on which
      functions are allowed to call them.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75122ec8
    • Vineet Gupta's avatar
      stmmac: replace open coded __netdev_alloc_skb_ip_align() with actual call · 4ec49a37
      Vineet Gupta authored
      This also matches with the sibling call netdev_alloc_skb_ip_align() made in
      rx fast path.
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ec49a37
    • Joe Perches's avatar
      qlge: Move jiffies_to_usecs immediately before loop · 3f6e785f
      Joe Perches authored
      30 usecs (or really, 1 jiffy) can go by pretty fast.
      
      Move the set of the timeout immediately before the loop.
      
      Remove the unnecessary max(1ul, usecs_to_jiffies(30)) as
      usecs_to_jiffies with a non-zero constant is guaranteed
      to be non-zero.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f6e785f
    • David S. Miller's avatar
      Merge branch 'rocker-transaction-fixes' · 4ac2dc89
      David S. Miller authored
      Simon Horman says:
      
      ====================
      rocker: transaction fixes
      
      this series addresses what appear to be errors in the handling of
      prepare and then commit transactions in the rocker driver.
      
      In all cases the problem is that data structures visible outside of
      the transaction are modified during the prepare phase.
      
      In the case of the first two patches this results in the kernel reporting a
      BUG. I have noted test-cases in the change logs.
      
      The third patch is also a bug fix, as noted by  Toshiaki Makita,
      however I have not been able to reliably reproduce the problem and
      thus have not provided a test case.
      
      The last patch is a correctness fix that does not fix a bug
      that manifests as far as I can tell.
      
      Changes: v3->v4
      * All patches
        - Add Jiri Pirko's ack
      * "rocker: do not make neighbour entry changes when preparing transactions"
        - Setting of entry values in all transaction phases
          as suggested by Toshiaki Makita
      * "rocker: make rocker_port_internal_vlan_id_{get,put}() non-transactional"
        - Remove Fixes tag as I believe this is a correctness rather than a bug fix
      
      Changes: v2->v3
      * "rocker: do not make neighbour entry changes when preparing transactions"
        - Correct inverted logic
        - Added ack from Scott Feldman
      
      Changes: v1->v2
      * "rocker: do not make neighbour entry changes when preparing transactions"
        - Revised changelog to reflect information from Toshiaki Makita
          that there is a bug that can manifest
        - Update address and ttl regardless of the value of the transaction state
      * All other patches
        - Added acks from Scott Feldman
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ac2dc89
    • Simon Horman's avatar
      rocker: make rocker_port_internal_vlan_id_{get, put}() non-transactional · df6a2067
      Simon Horman authored
      The motivation for this is that rocker_port_internal_vlan_id_{get,put} appear
      to only partially implement the transaction model: memory allocation
      and freeing is transactional, but hash and bitmap manipulation is not.
      
      The latter could be fixed, however, as it is not currently exercised
      due to trans always being SWITCHDEV_TRANS_NONE it seems cleaner
      to make rocker_port_internal_vlan_id_get non-transactional.
      
      This problem was introduced by c4f20321 ("rocker: support
      prepare-commit transaction model").
      
      Found by inspection.
      I do not believe that this change should have any run-time effect.
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df6a2067
    • Simon Horman's avatar
      rocker: do not make neighbour entry changes when preparing transactions · 550ecc92
      Simon Horman authored
      rocker_port_ipv4_nh() and in turn rocker_port_ipv4_neigh() may be
      be called with trans == SWITCHDEV_TRANS_PREPARE and then
      trans == SWITCHDEV_TRANS_COMMIT from switchdev_port_obj_set() via
      fib_table_insert().
      
      The first time that rocker_port_ipv4_nh() is called, with
      trans == SWITCHDEV_TRANS_PREPARE, _rocker_neigh_add() adds a new entry to
      the neigh table.
      
      And the second time  rocker_port_ipv4_nh() is called, with
      trans == SWITCHDEV_TRANS_COMMIT, that entry is found. This causes
      rocker_port_ipv4_nh() to believe it is not adding an entry and thus it
      frees "entry", which is still present in rocker driver's neigh table.
      
      This problem does not appear to affect deletion as my analysis is that
      deletion is always performed with trans == SWITCHDEV_TRANS_NONE.
      
      For completeness _rocker_neigh_{add,del,prepare} are updated not to
      manipulate fib table entries if trans == SWITCHDEV_TRANS_PREPARE.
      
      Fixes: c4f20321 ("rocker: support prepare-commit transaction model")
      Reported-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      550ecc92
    • Simon Horman's avatar
      rocker: do not modify fdb table in rocker_port_fdb() when preparing transactions · 42e94889
      Simon Horman authored
      rocker_port_fdb_flush() may be called be called with
      trans == SWITCHDEV_TRANS_PREPARE and then trans == SWITCHDEV_TRANS_COMMIT from
      switchdev_port_attr_set() via switchdev_port_obj_add().
      
      Adding the new entry to the FDB table when trans == SWITCHDEV_TRANS_PREPARE
      may result in a memory leak because when trans == SWITCHDEV_TRANS_PREPARE
      rocker_flow_tbl_bridge() will allocate memory when called via
      rocker_port_fdb_learn(). However, when trans == SWITCHDEV_TRANS_COMMIT
      the presence of the FDB entry in the FDB table causes
      rocker_port_fdb() to set the ROCKER_OP_FLAG_REFRESH flag which results
      in rocker_port_fdb_learn() skipping the call to rocker_flow_tbl_bridge()
      which would free the memory allocated by it when
      trans == SWITCHDEV_TRANS_PREPARE.
      
      ip link add br0 type bridge
      ip link set up dev eth0
      ip link set dev eth0 master br0
      bridge fdb add 52:54:00:12:35:08 dev eth0
      bridge fdb add 52:54:00:12:35:09 dev eth0
      [    2.600730] ------------[ cut here ]------------
      [    2.601002] kernel BUG at drivers/net/ethernet/rocker/rocker.c:4369!
      [    2.601373] invalid opcode: 0000 [#1] SMP
      [    2.601963] Modules linked in:
      [    2.602355] CPU: 0 PID: 64 Comm: bridge Not tainted 4.1.0-rc3-01048-g6d0f50c50211-dirty #1075
      [    2.602721] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org 04/01/2014
      [    2.602721] task: ffff880019facef0 ti: ffff88001f96c000 task.ti: ffff88001f96c000
      [    2.602721] RIP: 0010:[<ffffffff811f1470>]  [<ffffffff811f1470>] rocker_port_obj_add+0x150/0x160
      [    2.602721] RSP: 0018:ffff88001f96fa98  EFLAGS: 00000212
      [    2.602721] RAX: ffff880019d4fa68 RBX: ffff88001f96fb18 RCX: 0000000000000000
      [    2.602721] RDX: ffff880019d4f000 RSI: ffff88001f96fb18 RDI: ffff880019d4f000
      [    2.602721] RBP: 0000000000000001 R08: 0000000000000000 R09: ffff88001f904620
      [    2.602721] R10: ffff88001f96fb60 R11: ffff880019e9d100 R12: ffff88001f96fb18
      [    2.602721] R13: ffff880019d4f680 R14: ffff88001f904610 R15: ffff8800198f7b80
      [    2.602721] FS:  00007f3eee917700(0000) GS:ffff88001b000000(0000) knlGS:0000000000000000
      [    2.602721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    2.602721] CR2: 00007f3eee4a15cb CR3: 000000001f933000 CR4: 00000000000006b0
      [    2.602721] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    2.602721] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
      [    2.602721] Stack:
      [    2.602721]  0000000000000000 ffff88001f96fb18 ffff880019d4f000 ffff88001f96fb18
      [    2.602721]  ffff880019d4f000 ffffffff81332105 ffff88001f96fb50 ffffffff814464c0
      [    2.602721]  ffff88001f96fb18 ffff88001f904600 ffff880019d4f000 ffffffff813326e5
      [    2.602721] Call Trace:
      [    2.602721]  [<ffffffff81332105>] ? __switchdev_port_obj_add+0x25/0x90
      [    2.602721]  [<ffffffff813326e5>] ? switchdev_port_obj_add+0x25/0xc0
      [    2.602721]  [<ffffffff813327b1>] ? switchdev_port_fdb_add+0x31/0x40
      [    2.602721]  [<ffffffff8123911f>] ? rtnl_fdb_add+0xff/0x1e0
      [    2.602721]  [<ffffffff81237d8e>] ? rtnetlink_rcv_msg+0x7e/0x250
      [    2.602721]  [<ffffffff8121d1ce>] ? __skb_recv_datagram+0xfe/0x4b0
      [    2.602721]  [<ffffffff81237d10>] ? rtnetlink_rcv+0x30/0x30
      [    2.602721]  [<ffffffff81247958>] ? netlink_rcv_skb+0xa8/0xd0
      [    2.602721]  [<ffffffff81237cff>] ? rtnetlink_rcv+0x1f/0x30
      [    2.602721]  [<ffffffff81247220>] ? netlink_unicast+0x150/0x200
      [    2.602721]  [<ffffffff81247714>] ? netlink_sendmsg+0x374/0x3e0
      [    2.602721]  [<ffffffff8120f8df>] ? sock_sendmsg+0xf/0x30
      [    2.602721]  [<ffffffff8120ffd3>] ? ___sys_sendmsg+0x1f3/0x200
      [    2.602721]  [<ffffffff812100e5>] ? ___sys_recvmsg+0x105/0x140
      [    2.602721]  [<ffffffff810a36f0>] ? SyS_readahead+0x90/0x90
      [    2.602721]  [<ffffffff81098dfd>] ? filemap_map_pages+0x1ed/0x210
      [    2.602721]  [<ffffffff810b77fc>] ? handle_mm_fault+0x5fc/0xe50
      [    2.602721]  [<ffffffff81210ef9>] ? __sys_sendmsg+0x39/0x70
      [    2.602721]  [<ffffffff8133ce17>] ? system_call_fastpath+0x12/0x6a
      [    2.602721] Code: b7 8f a0 06 00 00 48 83 bf 88 06 00 00 00 74 1d 48 83 c4 08 89 ee 4c 89 ef 5b 5d 41 5c 41 5d 0f b7 c9 45 31 c0 e9 51 db ff ff 90 <0f> 0b b8 ea ff ff ff e9 cf fe ff ff 0f 1f 40 00 41 57 41 56 b9
      [    2.602721] RIP  [<ffffffff811f1470>] rocker_port_obj_add+0x150/0x160
      [    2.602721]  RSP <ffff88001f96fa98>
      [    2.615848] ---[ end trace 4f7b4f1c98077108 ]---
      
      The above is resolved by not adding the new FDB entry to the FDB table
      if trans == SWITCHDEV_TRANS_PREPARE.
      
      For symmetry this patch also skips deleting FDB entries from the FDB
      table trans == SWITCHDEV_TRANS_PREPARE. However, my analysis is that
      this never occurs as trans is always SWITCHDEV_TRANS_NONE when removing
      FDB entries.
      
      Fixes: c4f20321 ("rocker: support prepare-commit transaction model")
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42e94889
    • Simon Horman's avatar
      rocker: do not delete fdb entries in rocker_port_fdb_flush() when preparing transactions · 3098ac39
      Simon Horman authored
      rocker_port_fdb_flush() is called by rocker_port_stp_update() which in
      turn may be called with trans == SWITCHDEV_TRANS_PREPARE and then
      trans == SWITCHDEV_TRANS_COMMIT from switchdev_port_attr_set() via
      br_set_state().
      
      When rocker_port_fdb_flush() is called with trans == SWITCHDEV_TRANS_PREPARE
      it calls rocker_port_fdb_learn() for each entry in the FDB table which in
      turn calls rocker_flow_tbl_bridge() which will allocate memory using
      rocker_port_kzalloc(). rocker_port_fdb_learn() will then remove the entry
      from the FDB table.
      
      Then when rocker_port_fdb_learn() is called with
      trans == SWITCHDEV_TRANS_PREPARE no calls are made to rocker_port_fdb_learn()
      because there are no longer any entries present in the FDB table. Thus the
      memory previously allocated by rocker_port_fdb_learn() is leaked resulting
      in the kernel BUG() below.
      
      Furthermore, it looks like the driver ends up with an incorrect view of the
      fdb table as the FDB entries are purged from the driver's table but not the
      hardware's table.
      
      ip link add br0 type bridge
      ip link set up dev eth0
      sleep 1
      ip link set dev eth0 master br0
      [    3.704360] ------------[ cut here ]------------
      [    3.704611] kernel BUG at drivers/net/ethernet/rocker/rocker.c:4289!
      [    3.704962] invalid opcode: 0000 [#1] SMP
      [    3.705537] Modules linked in:
      [    3.705919] CPU: 0 PID: 63 Comm: ip Not tainted 4.1.0-rc3-01046-gb9fbe709 #1044
      [    3.706191] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org 04/01/2014
      [    3.706820] task: ffff880019f70150 ti: ffff88001f92c000 task.ti: ffff88001f92c000
      [    3.707138] RIP: 0010:[<ffffffff811f0080>]  [<ffffffff811f0080>] rocker_port_attr_set+0xe0/0xf0
      [    3.707990] RSP: 0018:ffff88001f92f808  EFLAGS: 00000212
      [    3.708200] RAX: ffff880019d4fa68 RBX: ffff880019d4f000 RCX: 0000000000000000
      [    3.708471] RDX: 000000000000000c RSI: ffff88001f92f890 RDI: ffff880019d4f680
      [    3.708740] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000004
      [    3.708999] R10: ffff880000034024 R11: 0000000000000000 R12: ffff88001f92f890
      [    3.709276] R13: ffff88001f8f1c00 R14: 000000000000000b R15: 0000000000000000
      [    3.709303] FS:  00007f8ab66bd700(0000) GS:ffff88001b000000(0000) knlGS:0000000000000000
      [    3.709303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    3.709303] CR2: 0000000000654988 CR3: 000000001f8f3000 CR4: 00000000000006b0
      [    3.709303] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    3.709303] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
      [    3.709303] Stack:
      [    3.709303]  ffff88001f8f1c00 000000000000000b ffff88001f92f890 ffff880019d4f000
      [    3.709303]  ffff88001f92f890 ffffffff813332f5 ffff88001f92f880 0000000000000000
      [    3.709303]  ffff88001f92f890 0000000000000001 ffff880019d4f000 ffffffff81333627
      [    3.709303] Call Trace:
      [    3.709303]  [<ffffffff813332f5>] ? __switchdev_port_attr_set+0x25/0x90
      [    3.709303]  [<ffffffff81333627>] ? switchdev_port_attr_set+0x27/0x120
      [    3.709303]  [<ffffffff81318e86>] ? br_set_state+0x36/0x50
      [    3.709303]  [<ffffffff8131795c>] ? br_add_if+0x37c/0x400
      [    3.709303]  [<ffffffff81238ce1>] ? do_setlink+0x7e1/0x800
      [    3.709303]  [<ffffffff8111f980>] ? radix_tree_lookup_slot+0x10/0x30
      [    3.709303]  [<ffffffff81136fba>] ? nla_parse+0xaa/0x110
      [    3.709303]  [<ffffffff81239c98>] ? rtnl_newlink+0x548/0x870
      [    3.709303]  [<ffffffff8111f900>] ? __radix_tree_lookup+0x40/0xb0
      [    3.709303]  [<ffffffff81136f3e>] ? nla_parse+0x2e/0x110
      [    3.709303]  [<ffffffff81237d7e>] ? rtnetlink_rcv_msg+0x7e/0x250
      [    3.709303]  [<ffffffff8121d1be>] ? __skb_recv_datagram+0xfe/0x4b0
      [    3.709303]  [<ffffffff81237d00>] ? rtnetlink_rcv+0x30/0x30
      [    3.709303]  [<ffffffff81247948>] ? netlink_rcv_skb+0xa8/0xd0
      [    3.709303]  [<ffffffff81237cef>] ? rtnetlink_rcv+0x1f/0x30
      [    3.709303]  [<ffffffff81247210>] ? netlink_unicast+0x150/0x200
      [    3.709303]  [<ffffffff81247704>] ? netlink_sendmsg+0x374/0x3e0
      [    3.709303]  [<ffffffff8120f8cf>] ? sock_sendmsg+0xf/0x30
      [    3.709303]  [<ffffffff8120ffc3>] ? ___sys_sendmsg+0x1f3/0x200
      [    3.709303]  [<ffffffff812100d5>] ? ___sys_recvmsg+0x105/0x140
      [    3.709303]  [<ffffffff812228d9>] ? dev_get_by_name_rcu+0x69/0x90
      [    3.709303]  [<ffffffff812228d9>] ? dev_get_by_name_rcu+0x69/0x90
      [    3.709303]  [<ffffffff81217b7d>] ? skb_dequeue+0x4d/0x60
      [    3.709303]  [<ffffffff81217bb0>] ? skb_queue_purge+0x20/0x30
      [    3.709303]  [<ffffffff810ebdcf>] ? __inode_wait_for_writeback+0x5f/0xb0
      [    3.709303]  [<ffffffff810648b0>] ? autoremove_wake_function+0x30/0x30
      [    3.709303]  [<ffffffff81210ee9>] ? __sys_sendmsg+0x39/0x70
      [    3.709303]  [<ffffffff8133e097>] ? system_call_fastpath+0x12/0x6a
      [    3.709303] Code: bb 90 06 00 00 48 c7 04 24 00 00 00 00 45 31 c9 45 31 c0 48 c7 c1 c0 b7 1e 81 89 ea e8 da da ff ff eb 95 0f 1f 84 00 00 00 00 00 <0f> 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 83 fe 15 75
      [    3.709303] RIP  [<ffffffff811f0080>] rocker_port_attr_set+0xe0/0xf0
      [    3.709303]  RSP <ffff88001f92f808>
      [    3.721409] ---[ end trace b7481fcb7cb032aa ]---
      Segmentation fault
      
      Fixes: c4f20321 ("rocker: support prepare-commit transaction model")
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3098ac39
    • Joe Perches's avatar
      spider_net: Use DECLARE_BITMAP · e26cc7ff
      Joe Perches authored
      Use the generic mechanism to declare a bitmap instead of unsigned long.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e26cc7ff
    • David S. Miller's avatar
      Merge branch 'ebpf-tail-call' · 3f55b7ed
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf: introduce bpf_tail_call() helper
      
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      
      Use cases:
      - simplify complex programs
      - dispatch into other programs
        (for example: index in jump table can be syscall number or network protocol)
      - build dynamic chains of programs
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      patch 1 - support bpf_tail_call() in interpreter
      patch 2 - support in x64 JIT
      We've discussed what's neccessary to support it in arm64/s390 JITs
      and it looks fine.
      patch 3 - sample example for tracing
      patch 4 - sample example for networking
      
      More details in every patch.
      
      This set went through several iterations of reviews/fixes and older
      attempts can be seen:
      https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/log/?h=tail_call_v[123456]
      - tail_call_v1 does it without touching JITs but introduces overhead
        for all programs that don't use this helper function.
      - tail_call_v2 still has some overhead and x64 JIT does full stack
        unwind (prologue skipping optimization wasn't there)
      - tail_call_v3 reuses 'call' instruction encoding and has interpreter
        overhead for every normal call
      - tail_call_v4 fixes above architectural shortcomings and v5,v6 fix few
        more bugs
      
      This last tail_call_v6 approach seems to be the best.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f55b7ed
    • Alexei Starovoitov's avatar
      samples/bpf: bpf_tail_call example for networking · 530b2c86
      Alexei Starovoitov authored
      Usage:
      $ sudo ./sockex3
      IP     src.port -> dst.port               bytes      packets
      127.0.0.1.42010 -> 127.0.0.1.12865         1568            8
      127.0.0.1.59526 -> 127.0.0.1.33778     11422636       173070
      127.0.0.1.33778 -> 127.0.0.1.59526  11260224828       341974
      127.0.0.1.12865 -> 127.0.0.1.42010         1832           12
      IP     src.port -> dst.port               bytes      packets
      127.0.0.1.42010 -> 127.0.0.1.12865         1568            8
      127.0.0.1.59526 -> 127.0.0.1.33778     23198092       351486
      127.0.0.1.33778 -> 127.0.0.1.59526  22972698518       698616
      127.0.0.1.12865 -> 127.0.0.1.42010         1832           12
      
      this example is similar to sockex2 in a way that it accumulates per-flow
      statistics, but it does packet parsing differently.
      sockex2 inlines full packet parser routine into single bpf program.
      This sockex3 example have 4 independent programs that parse vlan, mpls, ip, ipv6
      and one main program that starts the process.
      bpf_tail_call() mechanism allows each program to be small and be called
      on demand potentially multiple times, so that many vlan, mpls, ip in ip,
      gre encapsulations can be parsed. These and other protocol parsers can
      be added or removed at runtime. TLVs can be parsed in similar manner.
      Note, tail_call_cnt dynamic check limits the number of tail calls to 32.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      530b2c86
    • Alexei Starovoitov's avatar
      samples/bpf: bpf_tail_call example for tracing · 5bacd780
      Alexei Starovoitov authored
      kprobe example that demonstrates how future seccomp programs may look like.
      It attaches to seccomp_phase1() function and tail-calls other BPF programs
      depending on syscall number.
      
      Existing optimized classic BPF seccomp programs generated by Chrome look like:
      if (sd.nr < 121) {
        if (sd.nr < 57) {
          if (sd.nr < 22) {
            if (sd.nr < 7) {
              if (sd.nr < 4) {
                if (sd.nr < 1) {
                  check sys_read
                } else {
                  if (sd.nr < 3) {
                    check sys_write and sys_open
                  } else {
                    check sys_close
                  }
                }
              } else {
            } else {
          } else {
        } else {
      } else {
      }
      
      the future seccomp using native eBPF may look like:
        bpf_tail_call(&sd, &syscall_jmp_table, sd.nr);
      which is simpler, faster and leaves more room for per-syscall checks.
      
      Usage:
      $ sudo ./tracex5
      <...>-366   [001] d...     4.870033: : read(fd=1, buf=00007f6d5bebf000, size=771)
      <...>-369   [003] d...     4.870066: : mmap
      <...>-369   [003] d...     4.870077: : syscall=110 (one of get/set uid/pid/gid)
      <...>-369   [003] d...     4.870089: : syscall=107 (one of get/set uid/pid/gid)
         sh-369   [000] d...     4.891740: : read(fd=0, buf=00000000023d1000, size=512)
         sh-369   [000] d...     4.891747: : write(fd=1, buf=00000000023d3000, size=512)
         sh-369   [000] d...     4.891747: : read(fd=1, buf=00000000023d3000, size=512)
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5bacd780
    • Alexei Starovoitov's avatar
      x86: bpf_jit: implement bpf_tail_call() helper · b52f00e6
      Alexei Starovoitov authored
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      In this implementation x64 JIT bypasses stack unwind and jumps into the
      callee program after prologue, so the callee program reuses the same stack.
      
      The logic can be roughly expressed in C like:
      
      u32 tail_call_cnt;
      
      void *jumptable[2] = { &&label1, &&label2 };
      
      int bpf_prog1(void *ctx)
      {
      label1:
          ...
      }
      
      int bpf_prog2(void *ctx)
      {
      label2:
          ...
      }
      
      int bpf_prog1(void *ctx)
      {
          ...
          if (tail_call_cnt++ < MAX_TAIL_CALL_CNT)
              goto *jumptable[index]; ... and pass my 'ctx' to callee ...
      
          ... fall through if no entry in jumptable ...
      }
      
      Note that 'skip current program epilogue and next program prologue' is
      an optimization. Other JITs don't have to do it the same way.
      >From safety point of view it's valid as well, since programs always
      initialize the stack before use, so any residue in the stack left by
      the current program is not going be read. The same verifier checks are
      done for the calls from the kernel into all bpf programs.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b52f00e6
    • Alexei Starovoitov's avatar
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov authored
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      04fd61ab
    • Daniel Borkmann's avatar
      net: dev: reduce both ingress hook ifdefs · e7582bab
      Daniel Borkmann authored
      Reduce ifdef pollution slightly, no functional change. We can simply
      remove the extra alternative definition of handle_ing() and nf_ingress().
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7582bab
    • Eric Dumazet's avatar
      tcp: add a force_schedule argument to sk_stream_alloc_skb() · eb934478
      Eric Dumazet authored
      In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger")
      we fixed a possible hang of TCP sockets under memory pressure,
      by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
      if no packet is in socket write queue.
      
      It turns out there are other cases where we want to force memory
      schedule :
      
      tcp_fragment() & tso_fragment() need to split a big TSO packet into
      two smaller ones. If we block here because of TCP memory pressure,
      we can effectively block TCP socket from sending new data.
      If no further ACK is coming, this hang would be definitive, and socket
      has no chance to effectively reduce its memory usage.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb934478
    • Erik Kline's avatar
      neigh: Better handling of transition to NUD_PROBE state · 765c9c63
      Erik Kline authored
      [1] When entering NUD_PROBE state via neigh_update(), perhaps received
          from userspace, correctly (re)initialize the probes count to zero.
      
          This is useful for forcing revalidation of a neighbor (for example
          if the host is attempting to do DNA [IPv4 4436, IPv6 6059]).
      
      [2] Notify listeners when a neighbor goes into NUD_PROBE state.
      
          By sending notifications on entry to NUD_PROBE state listeners get
          more timely warnings of imminent connectivity issues.
      
          The current notifications on entry to NUD_STALE have somewhat
          limited usefulness: NUD_STALE is a perfectly normal state, as is
          NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after
          a neighbor reachability problem has been confirmed (typically after
          three probes).
      Signed-off-by: default avatarErik Kline <ek@google.com>
      Acked-By: default avatarLorenzo Colitti <lorenzo@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      765c9c63
  2. 19 May, 2015 13 commits
  3. 18 May, 2015 3 commits
    • Willem de Bruijn's avatar
      selftests/net: expect headroom in psock_fanout rollover · a2ad5d2a
      Willem de Bruijn authored
      psock_fanout tests the various fanout modes. Change the test for
      rollover mode to expect early rollover due to socket pressure
      as implemented in 2ccdbaa6 ("packet: rollover lock contention
      avoidance").
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2ad5d2a
    • Edward Cree's avatar
      sfc: nicer log message on Siena SR-IOV probe fail · 7e91a210
      Edward Cree authored
      We expect that MC_CMD_SRIOV will fail if the card has no VFs configured.
      So output a readable message instead of a cryptic MCDI error.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e91a210
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 0bc4c070
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for net-next. Briefly
      speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and
      Serget Popovich, more incremental updates to make br_netfilter a better
      place from Florian Westphal, ARP support to the x_tables mark match /
      target from and context Zhang Chunyu and the addition of context to know
      that the x_tables runs through nft_compat. More specifically, they are:
      
      1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the
         IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik.
      
      2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef.
      
      3) Use skb->network_header to calculate the transport offset in
         ip_set_get_ip{4,6}_port(). From Alexander Drozdov.
      
      4) Reduce memory consumption per element due to size miscalculation,
         this patch and follow up patches from Sergey Popovich.
      
      5) Expand nomatch field from 1 bit to 8 bits to allow to simplify
         mtype_data_reset_flags(), also from Sergey.
      
      6) Small clean for ipset macro trickery.
      
      7) Fix error reporting when both ip_set_get_hostipaddr4() and
         ip_set_get_extensions() from per-set uadt functions.
      
      8) Simplify IPSET_ATTR_PORT netlink attribute validation.
      
      9) Introduce HOST_MASK instead of hardcoded 32 in ipset.
      
      10) Return true/false instead of 0/1 in functions that return boolean
          in the ipset code.
      
      11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute.
      
      12) Allow to dereference from ext_*() ipset macros.
      
      13) Get rid of incorrect definitions of HKEY_DATALEN.
      
      14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match.
      
      15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal.
      
      16) Release nf_bridge_info after POSTROUTING since this is only needed
          from the physdev match, also from Florian.
      
      17) Reduce size of ipset code by deinlining ip_set_put_extensions(),
          from Denys Vlasenko.
      
      18) Oneliner to add ARP support to the x_tables mark match/target, from
          Zhang Chunyu.
      
      19) Add context to know if the x_tables extension runs from nft_compat,
          to address minor problems with three existing extensions.
      
      20) Correct return value in several seqfile *_show() functions in the
          netfilter tree, from Joe Perches.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0bc4c070