1. 21 Dec, 2019 7 commits
    • Chan Shu Tak, Alex's avatar
      llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c) · af1c0e4e
      Chan Shu Tak, Alex authored
      When a frame with NULL DSAP is received, llc_station_rcv is called.
      In turn, llc_stat_ev_rx_null_dsap_xid_c is called to check if it is a NULL
      XID frame. The return statement of llc_stat_ev_rx_null_dsap_xid_c returns 1
      when the incoming frame is not a NULL XID frame and 0 otherwise. Hence, a
      NULL XID response is returned unexpectedly, e.g. when the incoming frame is
      a NULL TEST command.
      
      To fix the error, simply remove the conditional operator.
      
      A similar error in llc_stat_ev_rx_null_dsap_test_c is also fixed.
      Signed-off-by: default avatarChan Shu Tak, Alex <alexchan@task.com.hk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af1c0e4e
    • Jiangfeng Xiao's avatar
      net: hisilicon: Fix a BUG trigered by wrong bytes_compl · 90b3b339
      Jiangfeng Xiao authored
      When doing stress test, we get the following trace:
      kernel BUG at lib/dynamic_queue_limits.c:26!
      Internal error: Oops - BUG: 0 [#1] SMP ARM
      Modules linked in: hip04_eth
      CPU: 0 PID: 2003 Comm: tDblStackPcap0 Tainted: G           O L  4.4.197 #1
      Hardware name: Hisilicon A15
      task: c3637668 task.stack: de3bc000
      PC is at dql_completed+0x18/0x154
      LR is at hip04_tx_reclaim+0x110/0x174 [hip04_eth]
      pc : [<c041abfc>]    lr : [<bf0003a8>]    psr: 800f0313
      sp : de3bdc2c  ip : 00000000  fp : c020fb10
      r10: 00000000  r9 : c39b4224  r8 : 00000001
      r7 : 00000046  r6 : c39b4000  r5 : 0078f392  r4 : 0078f392
      r3 : 00000047  r2 : 00000000  r1 : 00000046  r0 : df5d5c80
      Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
      Control: 32c5387d  Table: 1e189b80  DAC: 55555555
      Process tDblStackPcap0 (pid: 2003, stack limit = 0xde3bc190)
      Stack: (0xde3bdc2c to 0xde3be000)
      [<c041abfc>] (dql_completed) from [<bf0003a8>] (hip04_tx_reclaim+0x110/0x174 [hip04_eth])
      [<bf0003a8>] (hip04_tx_reclaim [hip04_eth]) from [<bf0012c0>] (hip04_rx_poll+0x20/0x388 [hip04_eth])
      [<bf0012c0>] (hip04_rx_poll [hip04_eth]) from [<c04c8d9c>] (net_rx_action+0x120/0x374)
      [<c04c8d9c>] (net_rx_action) from [<c021eaf4>] (__do_softirq+0x218/0x318)
      [<c021eaf4>] (__do_softirq) from [<c021eea0>] (irq_exit+0x88/0xac)
      [<c021eea0>] (irq_exit) from [<c0240130>] (msa_irq_exit+0x11c/0x1d4)
      [<c0240130>] (msa_irq_exit) from [<c0267ba8>] (__handle_domain_irq+0x110/0x148)
      [<c0267ba8>] (__handle_domain_irq) from [<c0201588>] (gic_handle_irq+0xd4/0x118)
      [<c0201588>] (gic_handle_irq) from [<c0558360>] (__irq_svc+0x40/0x58)
      Exception stack(0xde3bdde0 to 0xde3bde28)
      dde0: 00000000 00008001 c3637668 00000000 00000000 a00f0213 dd3627a0 c0af6380
      de00: c086d380 a00f0213 c0a22a50 de3bde6c 00000002 de3bde30 c0558138 c055813c
      de20: 600f0213 ffffffff
      [<c0558360>] (__irq_svc) from [<c055813c>] (_raw_spin_unlock_irqrestore+0x44/0x54)
      Kernel panic - not syncing: Fatal exception in interrupt
      
      Pre-modification code:
      int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
      {
      [...]
      [1]	priv->tx_head = TX_NEXT(tx_head);
      [2]	count++;
      [3]	netdev_sent_queue(ndev, skb->len);
      [...]
      }
      An rx interrupt occurs if hip04_mac_start_xmit just executes to the line 2,
      tx_head has been updated, but corresponding 'skb->len' has not been
      added to dql_queue.
      
      And then
      hip04_mac_interrupt->__napi_schedule->hip04_rx_poll->hip04_tx_reclaim
      
      In hip04_tx_reclaim, because tx_head has been updated,
      bytes_compl will plus an additional "skb-> len"
      which has not been added to dql_queue. And then
      trigger the BUG_ON(bytes_compl > num_queued - dql->num_completed).
      
      To solve the problem described above, we put
      "netdev_sent_queue(ndev, skb->len);"
      before
      "priv->tx_head = TX_NEXT(tx_head);"
      
      Fixes: a41ea46a ("net: hisilicon: new hip04 ethernet driver")
      Signed-off-by: default avatarJiangfeng Xiao <xiaojiangfeng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90b3b339
    • Michael Grzeschik's avatar
      net: dsa: ksz: use common define for tag len · 4249c507
      Michael Grzeschik authored
      Remove special taglen define KSZ8795_INGRESS_TAG_LEN
      and use generic KSZ_INGRESS_TAG_LEN instead.
      Signed-off-by: default avatarMichael Grzeschik <m.grzeschik@pengutronix.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4249c507
    • David S. Miller's avatar
      Merge branch 's390-fixes' · f80742b9
      David S. Miller authored
      Julian Wiedmann says:
      
      ====================
      s390/qeth: fixes 2019-12-18
      
      please apply the following patch series to your net tree.
      This brings two fixes for initialization / teardown issues, and one
      ENOTSUPP cleanup.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f80742b9
    • Julian Wiedmann's avatar
      s390/qeth: don't return -ENOTSUPP to userspace · 39bdbf3e
      Julian Wiedmann authored
      ENOTSUPP is not uapi, use EOPNOTSUPP instead.
      
      Fixes: d66cb37e ("qeth: Add new priority queueing options")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39bdbf3e
    • Julian Wiedmann's avatar
      s390/qeth: fix promiscuous mode after reset · 0f399305
      Julian Wiedmann authored
      When managing the promiscuous mode during an RX modeset, qeth caches the
      current HW state to avoid repeated programming of the same state on each
      modeset.
      
      But while tearing down a device, we forget to clear the cached state. So
      when the device is later set online again, the initial RX modeset
      doesn't program the promiscuous mode since we believe it is already
      enabled.
      Fix this by clearing the cached state in the tear-down path.
      
      Note that for the SBP variant of promiscuous mode, this accidentally
      works right now because we unconditionally restore the SBP role while
      re-initializing.
      
      Fixes: 4a71df50 ("qeth: new qeth device driver")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Reviewed-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f399305
    • Julian Wiedmann's avatar
      s390/qeth: handle error due to unsupported transport mode · 2e3d7fa5
      Julian Wiedmann authored
      Along with z/VM NICs, there's additional device types that only support
      a specific transport mode (eg. external-bridged IQD).
      Identify the corresponding error code, and raise a fitting error message
      so that the user knows to adjust their device configuration.
      
      On top of that also fix the subsequent error path, so that the rejected
      cmd doesn't need to wait for a timeout but gets cancelled straight away.
      
      Fixes: 4a71df50 ("qeth: new qeth device driver")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e3d7fa5
  2. 20 Dec, 2019 8 commits
    • Rahul Lakkireddy's avatar
      cxgb4: fix refcount init for TC-MQPRIO offload · ea8608d4
      Rahul Lakkireddy authored
      Properly initialize refcount to 1 when hardware queue arrays for
      TC-MQPRIO offload have been freshly allocated. Otherwise, following
      warning is observed. Also fix up error path to only free hardware
      queue arrays when refcount reaches 0.
      
      [  130.075342] ------------[ cut here ]------------
      [  130.075343] refcount_t: addition on 0; use-after-free.
      [  130.075355] WARNING: CPU: 0 PID: 10870 at lib/refcount.c:25
      refcount_warn_saturate+0xe1/0x100
      [  130.075356] Modules linked in: sch_mqprio iptable_nat ib_iser
      libiscsi scsi_transport_iscsi ib_ipoib rdma_ucm ib_umad iw_cxgb4 libcxgb
      ib_uverbs x86_pkg_temp_thermal cxgb4 igb
      [  130.075361] CPU: 0 PID: 10870 Comm: tc Kdump: loaded Not tainted
      5.5.0-rc1+ #11
      [  130.075362] Hardware name: Supermicro
      X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 3.2
      01/16/2015
      [  130.075363] RIP: 0010:refcount_warn_saturate+0xe1/0x100
      [  130.075364] Code: e8 14 41 c1 ff 0f 0b c3 80 3d 44 f4 10 01 00 0f 85
      63 ff ff ff 48 c7 c7 38 9f 83 8c 31 c0 c6 05 2e f4 10 01 01 e8 ef 40 c1
      ff <0f> 0b c3 48 c7 c7 10 9f 83 8c 31 c0 c6 05 17 f4 10 01 01 e8 d7 40
      [  130.075365] RSP: 0018:ffffa48d00c0b768 EFLAGS: 00010286
      [  130.075366] RAX: 0000000000000000 RBX: 0000000000000008 RCX:
      0000000000000001
      [  130.075366] RDX: 0000000000000001 RSI: 0000000000000096 RDI:
      ffff8a2e9fa187d0
      [  130.075367] RBP: ffff8a2e93890000 R08: 0000000000000398 R09:
      000000000000003c
      [  130.075367] R10: 00000000000142a0 R11: 0000000000000397 R12:
      ffffa48d00c0b848
      [  130.075368] R13: ffff8a2e94746498 R14: ffff8a2e966f7000 R15:
      0000000000000031
      [  130.075368] FS:  00007f689015f840(0000) GS:ffff8a2e9fa00000(0000)
      knlGS:0000000000000000
      [  130.075369] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  130.075369] CR2: 00000000006762a0 CR3: 00000007cf164005 CR4:
      00000000001606f0
      [  130.075370] Call Trace:
      [  130.075377]  cxgb4_setup_tc_mqprio+0xbee/0xc30 [cxgb4]
      [  130.075382]  ? cxgb4_ethofld_restart+0x50/0x50 [cxgb4]
      [  130.075384]  ? pfifo_fast_init+0x7e/0xf0
      [  130.075386]  mqprio_init+0x5f4/0x630 [sch_mqprio]
      [  130.075389]  qdisc_create+0x1bf/0x4a0
      [  130.075390]  tc_modify_qdisc+0x1ff/0x770
      [  130.075392]  rtnetlink_rcv_msg+0x28b/0x350
      [  130.075394]  ? rtnl_calcit.isra.32+0x110/0x110
      [  130.075395]  netlink_rcv_skb+0xc6/0x100
      [  130.075396]  netlink_unicast+0x1db/0x330
      [  130.075397]  netlink_sendmsg+0x2f5/0x460
      [  130.075399]  ? _copy_from_user+0x2e/0x60
      [  130.075400]  sock_sendmsg+0x59/0x70
      [  130.075401]  ____sys_sendmsg+0x1f0/0x230
      [  130.075402]  ? copy_msghdr_from_user+0xd7/0x140
      [  130.075403]  ___sys_sendmsg+0x77/0xb0
      [  130.075404]  ? ___sys_recvmsg+0x84/0xb0
      [  130.075406]  ? __handle_mm_fault+0x377/0xaf0
      [  130.075407]  __sys_sendmsg+0x53/0xa0
      [  130.075409]  do_syscall_64+0x44/0x130
      [  130.075412]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  130.075413] RIP: 0033:0x7f688f13af10
      [  130.075414] Code: c3 48 8b 05 82 6f 2c 00 f7 db 64 89 18 48 83 cb ff
      eb dd 0f 1f 80 00 00 00 00 83 3d 8d d0 2c 00 00 75 10 b8 2e 00 00 00 0f
      05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae cc 00 00 48 89 04 24
      [  130.075414] RSP: 002b:00007ffe6c7d9988 EFLAGS: 00000246 ORIG_RAX:
      000000000000002e
      [  130.075415] RAX: ffffffffffffffda RBX: 00000000006703a0 RCX:
      00007f688f13af10
      [  130.075415] RDX: 0000000000000000 RSI: 00007ffe6c7d99f0 RDI:
      0000000000000003
      [  130.075416] RBP: 000000005df38312 R08: 0000000000000002 R09:
      0000000000008000
      [  130.075416] R10: 00007ffe6c7d93e0 R11: 0000000000000246 R12:
      0000000000000000
      [  130.075417] R13: 00007ffe6c7e9c50 R14: 0000000000000001 R15:
      000000000067c600
      [  130.075418] ---[ end trace 8fbb3bf36a8671db ]---
      
      v2:
      - Move the refcount_set() closer to where the hardware queue arrays
        are being allocated.
      - Fix up error path to only free hardware queue arrays when refcount
        reaches 0.
      
      Fixes: 2d0cb84d ("cxgb4: add ETHOFLD hardware queue support")
      Signed-off-by: default avatarRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea8608d4
    • David S. Miller's avatar
      Merge branch 'cls_u32-fix-refcount-leak' · 307201a3
      David S. Miller authored
      Davide Caratti says:
      
      ====================
      net/sched: cls_u32: fix refcount leak
      
      a refcount leak in the error path of u32_change() has been recently
      introduced. It can be observed with the following commands:
      
        [root@f31 ~]# tc filter replace dev eth0 ingress protocol ip prio 97 \
        > u32 match ip src 127.0.0.1/32 indev notexist20 flowid 1:1 action drop
        RTNETLINK answers: Invalid argument
        We have an error talking to the kernel
        [root@f31 ~]# tc filter replace dev eth0 ingress protocol ip prio 98 \
        > handle 42:42 u32 divisor 256
        Error: cls_u32: Divisor can only be used on a hash table.
        We have an error talking to the kernel
        [root@f31 ~]# tc filter replace dev eth0 ingress protocol ip prio 99 \
        > u32 ht 47:47
        Error: cls_u32: Specified hash table not found.
        We have an error talking to the kernel
      
      they all legitimately return -EINVAL; however, they leave semi-configured
      filters at eth0 tc ingress:
      
       [root@f31 ~]# tc filter show dev eth0 ingress
       filter protocol ip pref 97 u32 chain 0
       filter protocol ip pref 97 u32 chain 0 fh 800: ht divisor 1
       filter protocol ip pref 98 u32 chain 0
       filter protocol ip pref 98 u32 chain 0 fh 801: ht divisor 1
       filter protocol ip pref 99 u32 chain 0
       filter protocol ip pref 99 u32 chain 0 fh 802: ht divisor 1
      
      With older kernels, filters were unconditionally considered empty (and
      thus de-refcounted) on the error path of ->change().
      After commit 8b64678e ("net: sched: refactor tp insert/delete for
      concurrent execution"), filters were considered empty when the walk()
      function didn't set 'walker.stop' to 1.
      Finally, with commit 6676d5e4 ("net: sched: set dedicated tcf_walker
      flag when tp is empty"), tc filters are considered empty unless the walker
      function is called with a non-NULL handle. This last change doesn't fit
      cls_u32 design, because at least the "root hnode" is (almost) always
      non-NULL, as it's allocated in u32_init().
      
      - patch 1/2 is a proposal to restore the original kernel behavior, where
        no filter was installed in the error path of u32_change().
      - patch 2/2 adds tdc selftests that can be ued to verify the correct
        behavior of u32 in the error path of ->change().
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      307201a3
    • Davide Caratti's avatar
      tc-testing: initial tdc selftests for cls_u32 · 6649a3f3
      Davide Caratti authored
      - move test "e9a3 - Add u32 with source match" to u32.json, and change the
        match pattern to catch all hnodes
      - add testcases for relevant error paths of cls_u32 module
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6649a3f3
    • Davide Caratti's avatar
      net/sched: cls_u32: fix refcount leak in the error path of u32_change() · 275c44aa
      Davide Caratti authored
      when users replace cls_u32 filters with new ones having wrong parameters,
      so that u32_change() fails to validate them, the kernel doesn't roll-back
      correctly, and leaves semi-configured rules.
      
      Fix this in u32_walk(), avoiding a call to the walker function on filters
      that don't have a match rule connected. The side effect is, these "empty"
      filters are not even dumped when present; but that shouldn't be a problem
      as long as we are restoring the original behaviour, where semi-configured
      filters were not even added in the error path of u32_change().
      
      Fixes: 6676d5e4 ("net: sched: set dedicated tcf_walker flag when tp is empty")
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      275c44aa
    • Aditya Pakki's avatar
      nfc: s3fwrn5: replace the assertion with a WARN_ON · 615f22f5
      Aditya Pakki authored
      In s3fwrn5_fw_recv_frame, if fw_info->rsp is not empty, the
      current code causes a crash via BUG_ON. However, s3fwrn5_fw_send_msg
      does not crash in such a scenario. The patch replaces the BUG_ON
      by returning the error to the callers and frees up skb.
      Signed-off-by: default avatarAditya Pakki <pakki001@umn.edu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      615f22f5
    • David S. Miller's avatar
      Merge branch 'macb-fix-probing-of-PHY-not-described-in-the-dt' · a019739c
      David S. Miller authored
      Antoine Tenart says:
      
      ====================
      net: macb: fix probing of PHY not described in the dt
      
      The macb Ethernet driver supports various ways of referencing its
      network PHY. When a device tree is used the PHY can be referenced with
      a phy-handle or, if connected to its internal MDIO bus, described in
      a child node. Some platforms omitted the PHY description while
      connecting the PHY to the internal MDIO bus and in such cases the MDIO
      bus has to be scanned "manually" by the macb driver.
      
      Prior to the phylink conversion the driver registered the MDIO bus with
      of_mdiobus_register and then in case the PHY couldn't be retrieved
      using dt or using phy_find_first (because registering an MDIO bus with
      of_mdiobus_register masks all PHYs) the macb driver was "manually"
      scanning the MDIO bus (like mdiobus_register does). The phylink
      conversion did break this particular case but reimplementing the manual
      scan of the bus in the macb driver wouldn't be very clean. The solution
      seems to be registering the MDIO bus based on if the PHYs are described
      in the device tree or not.
      
      There are multiple ways to do this, none is perfect. I chose to check if
      any of the child nodes of the macb node was a network PHY and based on
      this to register the MDIO bus with the of_ helper or not. The drawback
      is boards referencing the PHY through phy-handle, would scan the entire
      MDIO bus of the macb at boot time (as the MDIO bus would be registered
      with mdiobus_register). For this solution to work properly
      of_mdiobus_child_is_phy has to be exported, which means the patch doing
      so has to be backported to -stable as well.
      
      Another possible solution could have been to simply check if the macb
      node has a child node by counting its sub-nodes. This isn't techically
      perfect, as there could be other sub-nodes (in practice this should be
      fine, fixed-link being taken care of in the driver). We could also
      simply s/of_mdiobus_register/mdiobus_register/ but that could break
      boards using the PHY description in child node as a selector (which
      really would be not a proper way to do this...).
      
      The real issue here being having PHYs not described in the dt but we
      have dt backward compatibility, so we have to live with that.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a019739c
    • Antoine Tenart's avatar
      net: macb: fix probing of PHY not described in the dt · ef8a2e27
      Antoine Tenart authored
      This patch fixes the case where the PHY isn't described in the device
      tree. This is due to the way the MDIO bus is registered in the driver:
      whether the PHY is described in the device tree or not, the bus is
      registered through of_mdiobus_register. The function masks all the PHYs
      and only allow probing the ones described in the device tree. Prior to
      the Phylink conversion this was also done but later on in the driver
      the MDIO bus was manually scanned to circumvent the fact that the PHY
      wasn't described.
      
      This patch fixes it in a proper way, by registering the MDIO bus based
      on if the PHY attached to a given interface is described in the device
      tree or not.
      
      Fixes: 7897b071 ("net: macb: convert to phylink")
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef8a2e27
    • Antoine Tenart's avatar
      of: mdio: export of_mdiobus_child_is_phy · 0aa4d016
      Antoine Tenart authored
      This patch exports of_mdiobus_child_is_phy, allowing to check if a child
      node is a network PHY.
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0aa4d016
  3. 19 Dec, 2019 9 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 0fd26005
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2019-12-19
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 10 non-merge commits during the last 8 day(s) which contain
      a total of 21 files changed, 269 insertions(+), 108 deletions(-).
      
      The main changes are:
      
      1) Fix lack of synchronization between xsk wakeup and destroying resources
         used by xsk wakeup, from Maxim Mikityanskiy.
      
      2) Fix pruning with tail call patching, untrack programs in case of verifier
         error and fix a cgroup local storage tracking bug, from Daniel Borkmann.
      
      3) Fix clearing skb->tstamp in bpf_redirect() when going from ingress to
         egress which otherwise cause issues e.g. on fq qdisc, from Lorenz Bauer.
      
      4) Fix compile warning of unused proc_dointvec_minmax_bpf_restricted() when
         only cBPF is present, from Alexander Lobakin.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fd26005
    • Daniel Borkmann's avatar
      bpf: Add further test_verifier cases for record_func_key · 3123d801
      Daniel Borkmann authored
      Expand dummy prog generation such that we can easily check on return
      codes and add few more test cases to make sure we keep on tracking
      pruning behavior.
      
        # ./test_verifier
        [...]
        #1066/p XDP pkt read, pkt_data <= pkt_meta', bad access 1 OK
        #1067/p XDP pkt read, pkt_data <= pkt_meta', bad access 2 OK
        Summary: 1580 PASSED, 0 SKIPPED, 0 FAILED
      
      Also verified that JIT dump of added test cases looks good.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/df7200b6021444fd369376d227de917357285b65.1576789878.git.daniel@iogearbox.net
      3123d801
    • Daniel Borkmann's avatar
      bpf: Fix record_func_key to perform backtracking on r3 · cc52d914
      Daniel Borkmann authored
      While testing Cilium with /unreleased/ Linus' tree under BPF-based NodePort
      implementation, I noticed a strange BPF SNAT engine behavior from time to
      time. In some cases it would do the correct SNAT/DNAT service translation,
      but at a random point in time it would just stop and perform an unexpected
      translation after SYN, SYN/ACK and stack would send a RST back. While initially
      assuming that there is some sort of a race condition in BPF code, adding
      trace_printk()s for debugging purposes at some point seemed to have resolved
      the issue auto-magically.
      
      Digging deeper on this Heisenbug and reducing the trace_printk() calls to
      an absolute minimum, it turns out that a single call would suffice to
      trigger / not trigger the seen RST issue, even though the logic of the
      program itself remains unchanged. Turns out the single call changed verifier
      pruning behavior to get everything to work. Reconstructing a minimal test
      case, the incorrect JIT dump looked as follows:
      
        # bpftool p d j i 11346
        0xffffffffc0cba96c:
        [...]
          21:   movzbq 0x30(%rdi),%rax
          26:   cmp    $0xd,%rax
          2a:   je     0x000000000000003a
          2c:   xor    %edx,%edx
          2e:   movabs $0xffff89cc74e85800,%rsi
          38:   jmp    0x0000000000000049
          3a:   mov    $0x2,%edx
          3f:   movabs $0xffff89cc74e85800,%rsi
          49:   mov    -0x224(%rbp),%eax
          4f:   cmp    $0x20,%eax
          52:   ja     0x0000000000000062
          54:   add    $0x1,%eax
          57:   mov    %eax,-0x224(%rbp)
          5d:   jmpq   0xffffffffffff6911
          62:   mov    $0x1,%eax
        [...]
      
      Hence, unexpectedly, JIT emitted a direct jump even though retpoline based
      one would have been needed since in line 2c and 3a we have different slot
      keys in BPF reg r3. Verifier log of the test case reveals what happened:
      
        0: (b7) r0 = 14
        1: (73) *(u8 *)(r1 +48) = r0
        2: (71) r0 = *(u8 *)(r1 +48)
        3: (15) if r0 == 0xd goto pc+4
         R0_w=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=ctx(id=0,off=0,imm=0) R10=fp0
        4: (b7) r3 = 0
        5: (18) r2 = 0xffff89cc74d54a00
        7: (05) goto pc+3
        11: (85) call bpf_tail_call#12
        12: (b7) r0 = 1
        13: (95) exit
        from 3 to 8: R0_w=inv13 R1=ctx(id=0,off=0,imm=0) R10=fp0
        8: (b7) r3 = 2
        9: (18) r2 = 0xffff89cc74d54a00
        11: safe
        processed 13 insns (limit 1000000) [...]
      
      Second branch is pruned by verifier since considered safe, but issue is that
      record_func_key() couldn't have seen the index in line 3a and therefore
      decided that emitting a direct jump at this location was okay.
      
      Fix this by reusing our backtracking logic for precise scalar verification
      in order to prevent pruning on the slot key. This means verifier will track
      content of r3 all the way backwards and only prune if both scalars were
      unknown in state equivalence check and therefore poisoned in the first place
      in record_func_key(). The range is [x,x] in record_func_key() case since
      the slot always would have to be constant immediate. Correct verification
      after fix:
      
        0: (b7) r0 = 14
        1: (73) *(u8 *)(r1 +48) = r0
        2: (71) r0 = *(u8 *)(r1 +48)
        3: (15) if r0 == 0xd goto pc+4
         R0_w=invP(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=ctx(id=0,off=0,imm=0) R10=fp0
        4: (b7) r3 = 0
        5: (18) r2 = 0x0
        7: (05) goto pc+3
        11: (85) call bpf_tail_call#12
        12: (b7) r0 = 1
        13: (95) exit
        from 3 to 8: R0_w=invP13 R1=ctx(id=0,off=0,imm=0) R10=fp0
        8: (b7) r3 = 2
        9: (18) r2 = 0x0
        11: (85) call bpf_tail_call#12
        12: (b7) r0 = 1
        13: (95) exit
        processed 15 insns (limit 1000000) [...]
      
      And correct corresponding JIT dump:
      
        # bpftool p d j i 11
        0xffffffffc0dc34c4:
        [...]
          21:	  movzbq 0x30(%rdi),%rax
          26:	  cmp    $0xd,%rax
          2a:	  je     0x000000000000003a
          2c:	  xor    %edx,%edx
          2e:	  movabs $0xffff9928b4c02200,%rsi
          38:	  jmp    0x0000000000000049
          3a:	  mov    $0x2,%edx
          3f:	  movabs $0xffff9928b4c02200,%rsi
          49:	  cmp    $0x4,%rdx
          4d:	  jae    0x0000000000000093
          4f:	  and    $0x3,%edx
          52:	  mov    %edx,%edx
          54:	  cmp    %edx,0x24(%rsi)
          57:	  jbe    0x0000000000000093
          59:	  mov    -0x224(%rbp),%eax
          5f:	  cmp    $0x20,%eax
          62:	  ja     0x0000000000000093
          64:	  add    $0x1,%eax
          67:	  mov    %eax,-0x224(%rbp)
          6d:	  mov    0x110(%rsi,%rdx,8),%rax
          75:	  test   %rax,%rax
          78:	  je     0x0000000000000093
          7a:	  mov    0x30(%rax),%rax
          7e:	  add    $0x19,%rax
          82:   callq  0x000000000000008e
          87:   pause
          89:   lfence
          8c:   jmp    0x0000000000000087
          8e:   mov    %rax,(%rsp)
          92:   retq
          93:   mov    $0x1,%eax
        [...]
      
      Also explicitly adding explicit env->allow_ptr_leaks to fixup_bpf_calls() since
      backtracking is enabled under former (direct jumps as well, but use different
      test). In case of only tracking different map pointers as in c93552c4 ("bpf:
      properly enforce index mask to prevent out-of-bounds speculation"), pruning
      cannot make such short-cuts, neither if there are paths with scalar and non-scalar
      types as r3. mark_chain_precision() is only needed after we know that
      register_is_const(). If it was not the case, we already poison the key on first
      path and non-const key in later paths are not matching the scalar range in regsafe()
      either. Cilium NodePort testing passes fine as well now. Note, released kernels
      not affected.
      
      Fixes: d2e4c1e6 ("bpf: Constant map key tracking for prog array pokes")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/ac43ffdeb7386c5bd688761ed266f3722bb39823.1576789878.git.daniel@iogearbox.net
      cc52d914
    • Alexander Lobakin's avatar
      net, sysctl: Fix compiler warning when only cBPF is present · 1148f9ad
      Alexander Lobakin authored
      proc_dointvec_minmax_bpf_restricted() has been firstly introduced
      in commit 2e4a3098 ("bpf: restrict access to core bpf sysctls")
      under CONFIG_HAVE_EBPF_JIT. Then, this ifdef has been removed in
      ede95a63 ("bpf: add bpf_jit_limit knob to restrict unpriv
      allocations"), because a new sysctl, bpf_jit_limit, made use of it.
      Finally, this parameter has become long instead of integer with
      fdadd049 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
      and thus, a new proc_dolongvec_minmax_bpf_restricted() has been
      added.
      
      With this last change, we got back to that
      proc_dointvec_minmax_bpf_restricted() is used only under
      CONFIG_HAVE_EBPF_JIT, but the corresponding ifdef has not been
      brought back.
      
      So, in configurations like CONFIG_BPF_JIT=y && CONFIG_HAVE_EBPF_JIT=n
      since v4.20 we have:
      
        CC      net/core/sysctl_net_core.o
      net/core/sysctl_net_core.c:292:1: warning: ‘proc_dointvec_minmax_bpf_restricted’ defined but not used [-Wunused-function]
        292 | proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write,
            | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Suppress this by guarding it with CONFIG_HAVE_EBPF_JIT again.
      
      Fixes: fdadd049 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
      Signed-off-by: default avatarAlexander Lobakin <alobakin@dlink.ru>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191218091821.7080-1-alobakin@dlink.ru
      1148f9ad
    • Daniel Borkmann's avatar
      Merge branch 'bpf-fix-xsk-wakeup' · ca8d0fa7
      Daniel Borkmann authored
      Maxim Mikityanskiy says:
      
      ====================
      This series addresses the issue described in the commit message of the
      first patch: lack of synchronization between XSK wakeup and destroying
      the resources used by XSK wakeup. The idea is similar to napi_synchronize.
      The series contains fixes for the drivers that implement XSK.
      
      v2 incorporates changes suggested by Björn:
      
      1. Call synchronize_rcu in Intel drivers only if the XDP program is
         being unloaded.
      2. Don't forget rcu_read_lock when wakeup is called from xsk_poll.
      3. Use xs->zc as the condition to call ndo_xsk_wakeup.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ca8d0fa7
    • Maxim Mikityanskiy's avatar
      net/ixgbe: Fix concurrency issues between config flow and XSK · c0fdccfd
      Maxim Mikityanskiy authored
      Use synchronize_rcu to wait until the XSK wakeup function finishes
      before destroying the resources it uses:
      
      1. ixgbe_down already calls synchronize_rcu after setting __IXGBE_DOWN.
      
      2. After switching the XDP program, call synchronize_rcu to let
      ixgbe_xsk_wakeup exit before the XDP program is freed.
      
      3. Changing the number of channels brings the interface down.
      
      4. Disabling UMEM sets __IXGBE_TX_DISABLED before closing hardware
      resources and resetting xsk_umem. Check that bit in ixgbe_xsk_wakeup to
      avoid using the XDP ring when it's already destroyed. synchronize_rcu is
      called from ixgbe_txrx_ring_disable.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191217162023.16011-5-maximmi@mellanox.com
      c0fdccfd
    • Maxim Mikityanskiy's avatar
      net/i40e: Fix concurrency issues between config flow and XSK · b3873a5b
      Maxim Mikityanskiy authored
      Use synchronize_rcu to wait until the XSK wakeup function finishes
      before destroying the resources it uses:
      
      1. i40e_down already calls synchronize_rcu. On i40e_down either
      __I40E_VSI_DOWN or __I40E_CONFIG_BUSY is set. Check the latter in
      i40e_xsk_wakeup (the former is already checked there).
      
      2. After switching the XDP program, call synchronize_rcu to let
      i40e_xsk_wakeup exit before the XDP program is freed.
      
      3. Changing the number of channels brings the interface down (see
      i40e_prep_for_reset and i40e_pf_quiesce_all_vsi).
      
      4. Disabling UMEM sets __I40E_CONFIG_BUSY, too.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191217162023.16011-4-maximmi@mellanox.com
      b3873a5b
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Fix concurrency issues between config flow and XSK · 9cf88808
      Maxim Mikityanskiy authored
      After disabling resources necessary for XSK (the XDP program, channels,
      XSK queues), use synchronize_rcu to wait until the XSK wakeup function
      finishes, before freeing the resources.
      
      Suspend XSK wakeups during switching channels. If the XDP program is
      being removed, synchronize_rcu before closing the old channels to allow
      XSK wakeup to complete.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191217162023.16011-3-maximmi@mellanox.com
      9cf88808
    • Maxim Mikityanskiy's avatar
      xsk: Add rcu_read_lock around the XSK wakeup · 06870682
      Maxim Mikityanskiy authored
      The XSK wakeup callback in drivers makes some sanity checks before
      triggering NAPI. However, some configuration changes may occur during
      this function that affect the result of those checks. For example, the
      interface can go down, and all the resources will be destroyed after the
      checks in the wakeup function, but before it attempts to use these
      resources. Wrap this callback in rcu_read_lock to allow driver to
      synchronize_rcu before actually destroying the resources.
      
      xsk_wakeup is a new function that encapsulates calling ndo_xsk_wakeup
      wrapped into the RCU lock. After this commit, xsk_poll starts using
      xsk_wakeup and checks xs->zc instead of ndo_xsk_wakeup != NULL to decide
      ndo_xsk_wakeup should be called. It also fixes a bug introduced with the
      need_wakeup feature: a non-zero-copy socket may be used with a driver
      supporting zero-copy, and in this case ndo_xsk_wakeup should not be
      called, so the xs->zc check is the correct one.
      
      Fixes: 77cd0d7b ("xsk: add support for need_wakeup flag in AF_XDP rings")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191217162023.16011-2-maximmi@mellanox.com
      06870682
  4. 18 Dec, 2019 16 commits