1. 05 Oct, 2020 1 commit
    • Marc Dionne's avatar
      rxrpc: Fix rxkad token xdr encoding · 56305118
      Marc Dionne authored
      The session key should be encoded with just the 8 data bytes and
      no length; ENCODE_DATA precedes it with a 4 byte length, which
      confuses some existing tools that try to parse this format.
      
      Add an ENCODE_BYTES macro that does not include a length, and use
      it for the key.  Also adjust the expected length.
      
      Note that commit 774521f3 ("rxrpc: Fix an assertion in
      rxrpc_read()") had fixed a BUG by changing the length rather than
      fixing the encoding.  The original length was correct.
      
      Fixes: 99455153 ("RxRPC: Parse security index 5 keys (Kerberos 5)")
      Signed-off-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      56305118
  2. 04 Oct, 2020 5 commits
    • Guillaume Nault's avatar
      net/core: check length before updating Ethertype in skb_mpls_{push,pop} · 4296adc3
      Guillaume Nault authored
      Openvswitch allows to drop a packet's Ethernet header, therefore
      skb_mpls_push() and skb_mpls_pop() might be called with ethernet=true
      and mac_len=0. In that case the pointer passed to skb_mod_eth_type()
      doesn't point to an Ethernet header and the new Ethertype is written at
      unexpected locations.
      
      Fix this by verifying that mac_len is big enough to contain an Ethernet
      header.
      
      Fixes: fa4e0f88 ("net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actions")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Acked-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4296adc3
    • Tom Rix's avatar
      net: mvneta: fix double free of txq->buf · f4544e53
      Tom Rix authored
      clang static analysis reports this problem:
      
      drivers/net/ethernet/marvell/mvneta.c:3465:2: warning:
        Attempt to free released memory
              kfree(txq->buf);
              ^~~~~~~~~~~~~~~
      
      When mvneta_txq_sw_init() fails to alloc txq->tso_hdrs,
      it frees without poisoning txq->buf.  The error is caught
      in the mvneta_setup_txqs() caller which handles the error
      by cleaning up all of the txqs with a call to
      mvneta_txq_sw_deinit which also frees txq->buf.
      
      Since mvneta_txq_sw_deinit is a general cleaner, all of the
      partial cleaning in mvneta_txq_sw_deinit()'s error handling
      is not needed.
      
      Fixes: 2adb719d ("net: mvneta: Implement software TSO")
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f4544e53
    • Cong Wang's avatar
      net_sched: check error pointer in tcf_dump_walker() · 580e4273
      Cong Wang authored
      Although we take RTNL on dump path, it is possible to
      skip RTNL on insertion path. So the following race condition
      is possible:
      
      rtnl_lock()		// no rtnl lock
      			mutex_lock(&idrinfo->lock);
      			// insert ERR_PTR(-EBUSY)
      			mutex_unlock(&idrinfo->lock);
      tc_dump_action()
      rtnl_unlock()
      
      So we have to skip those temporary -EBUSY entries on dump path
      too.
      
      Reported-and-tested-by: syzbot+b47bc4f247856fb4d9e1@syzkaller.appspotmail.com
      Fixes: 0fedc63f ("net_sched: commit action insertions together")
      Cc: Vlad Buslov <vladbu@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      580e4273
    • Anant Thazhemadam's avatar
      net: team: fix memory leak in __team_options_register · 9a9e7749
      Anant Thazhemadam authored
      The variable "i" isn't initialized back correctly after the first loop
      under the label inst_rollback gets executed.
      
      The value of "i" is assigned to be option_count - 1, and the ensuing
      loop (under alloc_rollback) begins by initializing i--.
      Thus, the value of i when the loop begins execution will now become
      i = option_count - 2.
      
      Thus, when kfree(dst_opts[i]) is called in the second loop in this
      order, (i.e., inst_rollback followed by alloc_rollback),
      dst_optsp[option_count - 2] is the first element freed, and
      dst_opts[option_count - 1] does not get freed, and thus, a memory
      leak is caused.
      
      This memory leak can be fixed, by assigning i = option_count (instead of
      option_count - 1).
      
      Fixes: 80f7c668 ("team: add support for per-port options")
      Reported-by: syzbot+69b804437cfec30deac3@syzkaller.appspotmail.com
      Tested-by: syzbot+69b804437cfec30deac3@syzkaller.appspotmail.com
      Signed-off-by: default avatarAnant Thazhemadam <anant.thazhemadam@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a9e7749
    • Christophe JAILLET's avatar
      790ca79d
  3. 03 Oct, 2020 2 commits
    • Randy Dunlap's avatar
      net: hinic: fix DEVLINK build errors · 1f7e877c
      Randy Dunlap authored
      Fix many (lots deleted here) build errors in hinic by selecting NET_DEVLINK.
      
      ld: drivers/net/ethernet/huawei/hinic/hinic_hw_dev.o: in function `mgmt_watchdog_timeout_event_handler':
      hinic_hw_dev.c:(.text+0x30a): undefined reference to `devlink_health_report'
      ld: drivers/net/ethernet/huawei/hinic/hinic_devlink.o: in function `hinic_fw_reporter_dump':
      hinic_devlink.c:(.text+0x1c): undefined reference to `devlink_fmsg_u32_pair_put'
      ld: drivers/net/ethernet/huawei/hinic/hinic_devlink.o: in function `hinic_fw_reporter_dump':
      hinic_devlink.c:(.text+0x126): undefined reference to `devlink_fmsg_binary_pair_put'
      ld: drivers/net/ethernet/huawei/hinic/hinic_devlink.o: in function `hinic_hw_reporter_dump':
      hinic_devlink.c:(.text+0x1ba): undefined reference to `devlink_fmsg_string_pair_put'
      ld: hinic_devlink.c:(.text+0x227): undefined reference to `devlink_fmsg_u8_pair_put'
      ld: drivers/net/ethernet/huawei/hinic/hinic_devlink.o: in function `hinic_devlink_alloc':
      hinic_devlink.c:(.text+0xaee): undefined reference to `devlink_alloc'
      ld: drivers/net/ethernet/huawei/hinic/hinic_devlink.o: in function `hinic_devlink_free':
      hinic_devlink.c:(.text+0xb04): undefined reference to `devlink_free'
      ld: drivers/net/ethernet/huawei/hinic/hinic_devlink.o: in function `hinic_devlink_register':
      hinic_devlink.c:(.text+0xb26): undefined reference to `devlink_register'
      ld: drivers/net/ethernet/huawei/hinic/hinic_devlink.o: in function `hinic_devlink_unregister':
      hinic_devlink.c:(.text+0xb46): undefined reference to `devlink_unregister'
      ld: drivers/net/ethernet/huawei/hinic/hinic_devlink.o: in function `hinic_health_reporters_create':
      hinic_devlink.c:(.text+0xb75): undefined reference to `devlink_health_reporter_create'
      ld: hinic_devlink.c:(.text+0xb95): undefined reference to `devlink_health_reporter_create'
      ld: hinic_devlink.c:(.text+0xbac): undefined reference to `devlink_health_reporter_destroy'
      ld: drivers/net/ethernet/huawei/hinic/hinic_devlink.o: in function `hinic_health_reporters_destroy':
      
      Fixes: 51ba902a ("net-next/hinic: Initialize hw interface")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Bin Luo <luobin9@huawei.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Aviad Krawczyk <aviad.krawczyk@huawei.com>
      Cc: Zhao Chen <zhaochen6@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f7e877c
    • Vineetha G. Jaya Kumaran's avatar
      net: stmmac: Modify configuration method of EEE timers · 388e201d
      Vineetha G. Jaya Kumaran authored
      Ethtool manual stated that the tx-timer is the "the amount of time the
      device should stay in idle mode prior to asserting its Tx LPI". The
      previous implementation for "ethtool --set-eee tx-timer" sets the LPI TW
      timer duration which is not correct. Hence, this patch fixes the
      "ethtool --set-eee tx-timer" to configure the EEE LPI timer.
      
      The LPI TW Timer will be using the defined default value instead of
      "ethtool --set-eee tx-timer" which follows the EEE LS timer implementation.
      
      Changelog V2
      *Not removing/modifying the eee_timer.
      *EEE LPI timer can be configured through ethtool and also the eee_timer
      module param.
      *EEE TW Timer will be configured with default value only, not able to be
      configured through ethtool or module param. This follows the implementation
      of the EEE LS Timer.
      
      Fixes: d765955d ("stmmac: add the Energy Efficient Ethernet support")
      Signed-off-by: default avatarVineetha G. Jaya Kumaran <vineetha.g.jaya.kumaran@intel.com>
      Signed-off-by: default avatarVoon Weifeng <weifeng.voon@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      388e201d
  4. 02 Oct, 2020 27 commits
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2020-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · ab0faf5f
      David S. Miller authored
      From: Saeed Mahameed <saeedm@nvidia.com>
      
      ====================
      This series introduces some fixes to mlx5 driver.
      
      v1->v2:
       - Patch #1 Don't return while mutex is held. (Dave)
      
      v2->v3:
       - Drop patch #1, will consider a better approach (Jakub)
       - use cpu_relax() instead of cond_resched() (Jakub)
       - while(i--) to reveres a loop (Jakub)
       - Drop old mellanox email sign-off and change the committer email
         (Jakub)
      
      Please pull and let me know if there is any problem.
      
      For -stable v4.15
       ('net/mlx5e: Fix VLAN cleanup flow')
       ('net/mlx5e: Fix VLAN create flow')
      
      For -stable v4.16
       ('net/mlx5: Fix request_irqs error flow')
      
      For -stable v5.4
       ('net/mlx5e: Add resiliency in Striding RQ mode for packets larger than MTU')
       ('net/mlx5: Avoid possible free of command entry while timeout comp handler')
      
      For -stable v5.7
       ('net/mlx5e: Fix return status when setting unsupported FEC mode')
      
      For -stable v5.8
       ('net/mlx5e: Fix race condition on nhe->n pointer in neigh update')
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab0faf5f
    • Paolo Abeni's avatar
      tcp: fix syn cookied MPTCP request socket leak · 9d8c05ad
      Paolo Abeni authored
      If a syn-cookies request socket don't pass MPTCP-level
      validation done in syn_recv_sock(), we need to release
      it immediately, or it will be leaked.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/89
      Fixes: 9466a1cc ("mptcp: enable JOIN requests even if cookies are in use")
      Reported-and-tested-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d8c05ad
    • David S. Miller's avatar
      Merge branch 'Introduce-sendpage_ok-to-detect-misused-sendpage-in-network-related-drivers' · e7d4005d
      David S. Miller authored
      Coly Li says:
      
      ====================
      Introduce sendpage_ok() to detect misused sendpage in network related drivers
      
      As Sagi Grimberg suggested, the original fix is refind to a more common
      inline routine:
          static inline bool sendpage_ok(struct page *page)
          {
              return  (!PageSlab(page) && page_count(page) >= 1);
          }
      If sendpage_ok() returns true, the checking page can be handled by the
      concrete zero-copy sendpage method in network layer.
      
      The v10 series has 7 patches, fixes a WARN_ONCE() usage from v9 series,
      - The 1st patch in this series introduces sendpage_ok() in header file
        include/linux/net.h.
      - The 2nd patch adds WARN_ONCE() for improper zero-copy send in
        kernel_sendpage().
      - The 3rd patch fixes the page checking issue in nvme-over-tcp driver.
      - The 4th patch adds page_count check by using sendpage_ok() in
        do_tcp_sendpages() as Eric Dumazet suggested.
      - The 5th and 6th patches just replace existing open coded checks with
        the inline sendpage_ok() routine.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7d4005d
    • Coly Li's avatar
      libceph: use sendpage_ok() in ceph_tcp_sendpage() · 40efc4dc
      Coly Li authored
      In libceph, ceph_tcp_sendpage() does the following checks before handle
      the page by network layer's zero copy sendpage method,
      	if (page_count(page) >= 1 && !PageSlab(page))
      
      This check is exactly what sendpage_ok() does. This patch replace the
      open coded checks by sendpage_ok() as a code cleanup.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Acked-by: default avatarJeff Layton <jlayton@kernel.org>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40efc4dc
    • Coly Li's avatar
      scsi: libiscsi: use sendpage_ok() in iscsi_tcp_segment_map() · 6aa25c73
      Coly Li authored
      In iscsci driver, iscsi_tcp_segment_map() uses the following code to
      check whether the page should or not be handled by sendpage:
          if (!recv && page_count(sg_page(sg)) >= 1 && !PageSlab(sg_page(sg)))
      
      The "page_count(sg_page(sg)) >= 1 && !PageSlab(sg_page(sg)" part is to
      make sure the page can be sent to network layer's zero copy path. This
      part is exactly what sendpage_ok() does.
      
      This patch uses  use sendpage_ok() in iscsi_tcp_segment_map() to replace
      the original open coded checks.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarLee Duncan <lduncan@suse.com>
      Acked-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6aa25c73
    • Coly Li's avatar
      drbd: code cleanup by using sendpage_ok() to check page for kernel_sendpage() · fb25ebe1
      Coly Li authored
      In _drbd_send_page() a page is checked by following code before sending
      it by kernel_sendpage(),
              (page_count(page) < 1) || PageSlab(page)
      If the check is true, this page won't be send by kernel_sendpage() and
      handled by sock_no_sendpage().
      
      This kind of check is exactly what macro sendpage_ok() does, which is
      introduced into include/linux/net.h to solve a similar send page issue
      in nvme-tcp code.
      
      This patch uses macro sendpage_ok() to replace the open coded checks to
      page type and refcount in _drbd_send_page(), as a code cleanup.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb25ebe1
    • Coly Li's avatar
      tcp: use sendpage_ok() to detect misused .sendpage · cf83a17e
      Coly Li authored
      commit a10674bf ("tcp: detecting the misuse of .sendpage for Slab
      objects") adds the checks for Slab pages, but the pages don't have
      page_count are still missing from the check.
      
      Network layer's sendpage method is not designed to send page_count 0
      pages neither, therefore both PageSlab() and page_count() should be
      both checked for the sending page. This is exactly what sendpage_ok()
      does.
      
      This patch uses sendpage_ok() in do_tcp_sendpages() to detect misused
      .sendpage, to make the code more robust.
      
      Fixes: a10674bf ("tcp: detecting the misuse of .sendpage for Slab objects")
      Suggested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf83a17e
    • Coly Li's avatar
      nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() · 7d4194ab
      Coly Li authored
      Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
      send slab pages. But for pages allocated by __get_free_pages() without
      __GFP_COMP, which also have refcount as 0, they are still sent by
      kernel_sendpage() to remote end, this is problematic.
      
      The new introduced helper sendpage_ok() checks both PageSlab tag and
      page_count counter, and returns true if the checking page is OK to be
      sent by kernel_sendpage().
      
      This patch fixes the page checking issue of nvme_tcp_try_send_data()
      with sendpage_ok(). If sendpage_ok() returns true, send this page by
      kernel_sendpage(), otherwise use sock_no_sendpage to handle this page.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Vlastimil Babka <vbabka@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d4194ab
    • Coly Li's avatar
      net: add WARN_ONCE in kernel_sendpage() for improper zero-copy send · 7b62d31d
      Coly Li authored
      If a page sent into kernel_sendpage() is a slab page or it doesn't have
      ref_count, this page is improper to send by the zero copy sendpage()
      method. Otherwise such page might be unexpected released in network code
      path and causes impredictable panic due to kernel memory management data
      structure corruption.
      
      This path adds a WARN_ON() on the sending page before sends it into the
      concrete zero-copy sendpage() method, if the page is improper for the
      zero-copy sendpage() method, a warning message can be observed before
      the consequential unpredictable kernel panic.
      
      This patch does not change existing kernel_sendpage() behavior for the
      improper page zero-copy send, it just provides hint warning message for
      following potential panic due the kernel memory heap corruption.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Sridhar Samudrala <sri@us.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b62d31d
    • Coly Li's avatar
      net: introduce helper sendpage_ok() in include/linux/net.h · c381b079
      Coly Li authored
      The original problem was from nvme-over-tcp code, who mistakenly uses
      kernel_sendpage() to send pages allocated by __get_free_pages() without
      __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
      tail pages, sending them by kernel_sendpage() may trigger a kernel panic
      from a corrupted kernel heap, because these pages are incorrectly freed
      in network stack as page_count 0 pages.
      
      This patch introduces a helper sendpage_ok(), it returns true if the
      checking page,
      - is not slab page: PageSlab(page) is false.
      - has page refcount: page_count(page) is not zero
      
      All drivers who want to send page to remote end by kernel_sendpage()
      may use this helper to check whether the page is OK. If the helper does
      not return true, the driver should try other non sendpage method (e.g.
      sock_no_sendpage()) to handle the page.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Vlastimil Babka <vbabka@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c381b079
    • Petko Manolov's avatar
      net: usb: pegasus: Proper error handing when setting pegasus' MAC address · f30e25a9
      Petko Manolov authored
      v2:
      
      If reading the MAC address from eeprom fail don't throw an error, use randomly
      generated MAC instead.  Either way the adapter will soldier on and the return
      type of set_ethernet_addr() can be reverted to void.
      
      v1:
      
      Fix a bug in set_ethernet_addr() which does not take into account possible
      errors (or partial reads) returned by its helpers.  This can potentially lead to
      writing random data into device's MAC address registers.
      Signed-off-by: default avatarPetko Manolov <petko.manolov@konsulko.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f30e25a9
    • Mauro Carvalho Chehab's avatar
      net: core: document two new elements of struct net_device · a93bdcb9
      Mauro Carvalho Chehab authored
      As warned by "make htmldocs", there are two new struct elements
      that aren't documented:
      
      	../include/linux/netdevice.h:2159: warning: Function parameter or member 'unlink_list' not described in 'net_device'
      	../include/linux/netdevice.h:2159: warning: Function parameter or member 'nested_level' not described in 'net_device'
      
      Fixes: 1fc70edb ("net: core: add nested_level variable in net_device")
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a93bdcb9
    • Johannes Berg's avatar
      netlink: fix policy dump leak · a95bc734
      Johannes Berg authored
      If userspace doesn't complete the policy dump, we leak the
      allocated state. Fix this.
      
      Fixes: d07dcf9a ("netlink: add infrastructure to expose policies to userspace")
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a95bc734
    • Vlad Buslov's avatar
      net/mlx5e: Fix race condition on nhe->n pointer in neigh update · 1253935a
      Vlad Buslov authored
      Current neigh update event handler implementation takes reference to
      neighbour structure, assigns it to nhe->n, tries to schedule workqueue task
      and releases the reference if task was already enqueued. This results
      potentially overwriting existing nhe->n pointer with another neighbour
      instance, which causes double release of the instance (once in neigh update
      handler that failed to enqueue to workqueue and another one in neigh update
      workqueue task that processes updated nhe->n pointer instead of original
      one):
      
      [ 3376.512806] ------------[ cut here ]------------
      [ 3376.513534] refcount_t: underflow; use-after-free.
      [ 3376.521213] Modules linked in: act_skbedit act_mirred act_tunnel_key vxlan ip6_udp_tunnel udp_tunnel nfnetlink act_gact cls_flower sch_ingress openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_ib mlx5_core mlxfw pci_hyperv_intf ptp pps_core nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd
       grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_umad ib_ipoib ib_iser rdma_cm ib_cm iw_cm rfkill ib_uverbs ib_core sunrpc kvm_intel kvm iTCO_wdt iTCO_vendor_support virtio_net irqbypass net_failover crc32_pclmul lpc_ich i2c_i801 failover pcspkr i2c_smbus mfd_core ghash_clmulni_intel sch_fq_codel drm i2c
      _core ip_tables crc32c_intel serio_raw [last unloaded: mlxfw]
      [ 3376.529468] CPU: 8 PID: 22756 Comm: kworker/u20:5 Not tainted 5.9.0-rc5+ #6
      [ 3376.530399] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
      [ 3376.531975] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core]
      [ 3376.532820] RIP: 0010:refcount_warn_saturate+0xd8/0xe0
      [ 3376.533589] Code: ff 48 c7 c7 e0 b8 27 82 c6 05 0b b6 09 01 01 e8 94 93 c1 ff 0f 0b c3 48 c7 c7 88 b8 27 82 c6 05 f7 b5 09 01 01 e8 7e 93 c1 ff <0f> 0b c3 0f 1f 44 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 13
      [ 3376.536017] RSP: 0018:ffffc90002a97e30 EFLAGS: 00010286
      [ 3376.536793] RAX: 0000000000000000 RBX: ffff8882de30d648 RCX: 0000000000000000
      [ 3376.537718] RDX: ffff8882f5c28f20 RSI: ffff8882f5c18e40 RDI: ffff8882f5c18e40
      [ 3376.538654] RBP: ffff8882cdf56c00 R08: 000000000000c580 R09: 0000000000001a4d
      [ 3376.539582] R10: 0000000000000731 R11: ffffc90002a97ccd R12: 0000000000000000
      [ 3376.540519] R13: ffff8882de30d600 R14: ffff8882de30d640 R15: ffff88821e000900
      [ 3376.541444] FS:  0000000000000000(0000) GS:ffff8882f5c00000(0000) knlGS:0000000000000000
      [ 3376.542732] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3376.543545] CR2: 0000556e5504b248 CR3: 00000002c6f10005 CR4: 0000000000770ee0
      [ 3376.544483] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 3376.545419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 3376.546344] PKRU: 55555554
      [ 3376.546911] Call Trace:
      [ 3376.547479]  mlx5e_rep_neigh_update.cold+0x33/0xe2 [mlx5_core]
      [ 3376.548299]  process_one_work+0x1d8/0x390
      [ 3376.548977]  worker_thread+0x4d/0x3e0
      [ 3376.549631]  ? rescuer_thread+0x3e0/0x3e0
      [ 3376.550295]  kthread+0x118/0x130
      [ 3376.550914]  ? kthread_create_worker_on_cpu+0x70/0x70
      [ 3376.551675]  ret_from_fork+0x1f/0x30
      [ 3376.552312] ---[ end trace d84e8f46d2a77eec ]---
      
      Fix the bug by moving work_struct to dedicated dynamically-allocated
      structure. This enabled every event handler to work on its own private
      neighbour pointer and removes the need for handling the case when task is
      already enqueued.
      
      Fixes: 232c0013 ("net/mlx5e: Add support to neighbour update flow")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1253935a
    • Aya Levin's avatar
      net/mlx5e: Fix VLAN create flow · d4a16052
      Aya Levin authored
      When interface is attached while in promiscuous mode and with VLAN
      filtering turned off, both configurations are not respected and VLAN
      filtering is performed.
      There are 2 flows which add the any-vid rules during interface attach:
      VLAN creation table and set rx mode. Each is relaying on the other to
      add any-vid rules, eventually non of them does.
      
      Fix this by adding any-vid rules on VLAN creation regardless of
      promiscuous mode.
      
      Fixes: 9df30601 ("net/mlx5e: Restore vlan filter after seamless reset")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d4a16052
    • Aya Levin's avatar
      net/mlx5e: Fix VLAN cleanup flow · 8c7353b6
      Aya Levin authored
      Prior to this patch unloading an interface in promiscuous mode with RX
      VLAN filtering feature turned off - resulted in a warning. This is due
      to a wrong condition in the VLAN rules cleanup flow, which left the
      any-vid rules in the VLAN steering table. These rules prevented
      destroying the flow group and the flow table.
      
      The any-vid rules are removed in 2 flows, but none of them remove it in
      case both promiscuous is set and VLAN filtering is off. Fix the issue by
      changing the condition of the VLAN table cleanup flow to clean also in
      case of promiscuous mode.
      
      mlx5_core 0000:00:08.0: mlx5_destroy_flow_group:2123:(pid 28729): Flow group 20 wasn't destroyed, refcount > 1
      mlx5_core 0000:00:08.0: mlx5_destroy_flow_group:2123:(pid 28729): Flow group 19 wasn't destroyed, refcount > 1
      mlx5_core 0000:00:08.0: mlx5_destroy_flow_table:2112:(pid 28729): Flow table 262149 wasn't destroyed, refcount > 1
      ...
      ...
      ------------[ cut here ]------------
      FW pages counter is 11560 after reclaiming all pages
      WARNING: CPU: 1 PID: 28729 at
      drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:660
      mlx5_reclaim_startup_pages+0x178/0x230 [mlx5_core]
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
      rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
      Call Trace:
        mlx5_function_teardown+0x2f/0x90 [mlx5_core]
        mlx5_unload_one+0x71/0x110 [mlx5_core]
        remove_one+0x44/0x80 [mlx5_core]
        pci_device_remove+0x3e/0xc0
        device_release_driver_internal+0xfb/0x1c0
        device_release_driver+0x12/0x20
        pci_stop_bus_device+0x68/0x90
        pci_stop_and_remove_bus_device+0x12/0x20
        hv_eject_device_work+0x6f/0x170 [pci_hyperv]
        ? __schedule+0x349/0x790
        process_one_work+0x206/0x400
        worker_thread+0x34/0x3f0
        ? process_one_work+0x400/0x400
        kthread+0x126/0x140
        ? kthread_park+0x90/0x90
        ret_from_fork+0x22/0x30
         ---[ end trace 6283bde8d26170dc ]---
      
      Fixes: 9df30601 ("net/mlx5e: Restore vlan filter after seamless reset")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8c7353b6
    • Aya Levin's avatar
      net/mlx5e: Fix return status when setting unsupported FEC mode · 2608a2f8
      Aya Levin authored
      Verify the configured FEC mode is supported by at least a single link
      mode before applying the command. Otherwise fail the command and return
      "Operation not supported".
      Prior to this patch, the command was successful, yet it falsely set all
      link modes to FEC auto mode - like configuring FEC mode to auto. Auto
      mode is the default configuration if a link mode doesn't support the
      configured FEC mode.
      
      Fixes: b5ede32d ("net/mlx5e: Add support for FEC modes based on 50G per lane links")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2608a2f8
    • Aya Levin's avatar
      net/mlx5e: Fix driver's declaration to support GRE offload · 3d093bc2
      Aya Levin authored
      Declare GRE offload support with respect to the inner protocol. Add a
      list of supported inner protocols on which the driver can offload
      checksum and GSO. For other protocols, inform the stack to do the needed
      operations. There is no noticeable impact on GRE performance.
      
      Fixes: 27299841 ("net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      3d093bc2
    • Maor Dickman's avatar
      net/mlx5e: CT, Fix coverity issue · 2b021989
      Maor Dickman authored
      The cited commit introduced the following coverity issue at function
      mlx5_tc_ct_rule_to_tuple_nat:
      - Memory - corruptions (OVERRUN)
        Overrunning array "tuple->ip.src_v6.in6_u.u6_addr32" of 4 4-byte
        elements at element index 7 (byte offset 31) using index
        "ip6_offset" (which evaluates to 7).
      
      In case of IPv6 destination address rewrite, ip6_offset values are
      between 4 to 7, which will cause memory overrun of array
      "tuple->ip.src_v6.in6_u.u6_addr32" to array
      "tuple->ip.dst_v6.in6_u.u6_addr32".
      
      Fixed by writing the value directly to array
      "tuple->ip.dst_v6.in6_u.u6_addr32" in case ip6_offset values are
      between 4 to 7.
      
      Fixes: bc562be9 ("net/mlx5e: CT: Save ct entries tuples in hashtables")
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2b021989
    • Aya Levin's avatar
      net/mlx5e: Add resiliency in Striding RQ mode for packets larger than MTU · c3c94023
      Aya Levin authored
      Prior to this fix, in Striding RQ mode the driver was vulnerable when
      receiving packets in the range (stride size - headroom, stride size].
      Where stride size is calculated by mtu+headroom+tailroom aligned to the
      closest power of 2.
      Usually, this filtering is performed by the HW, except for a few cases:
      - Between 2 VFs over the same PF with different MTUs
      - On bluefield, when the host physical function sets a larger MTU than
        the ARM has configured on its representor and uplink representor.
      
      When the HW filtering is not present, packets that are larger than MTU
      might be harmful for the RQ's integrity, in the following impacts:
      1) Overflow from one WQE to the next, causing a memory corruption that
      in most cases is unharmful: as the write happens to the headroom of next
      packet, which will be overwritten by build_skb(). In very rare cases,
      high stress/load, this is harmful. When the next WQE is not yet reposted
      and points to existing SKB head.
      2) Each oversize packet overflows to the headroom of the next WQE. On
      the last WQE of the WQ, where addresses wrap-around, the address of the
      remainder headroom does not belong to the next WQE, but it is out of the
      memory region range. This results in a HW CQE error that moves the RQ
      into an error state.
      
      Solution:
      Add a page buffer at the end of each WQE to absorb the leak. Actually
      the maximal overflow size is headroom but since all memory units must be
      of the same size, we use page size to comply with UMR WQEs. The increase
      in memory consumption is of a single page per RQ. Initialize the mkey
      with all MTTs pointing to a default page. When the channels are
      activated, UMR WQEs will redirect the RX WQEs to the actual memory from
      the RQ's pool, while the overflow MTTs remain mapped to the default page.
      
      Fixes: 73281b78 ("net/mlx5e: Derive Striding RQ size from MTU")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c3c94023
    • Aya Levin's avatar
      net/mlx5e: Fix error path for RQ alloc · 08a762ce
      Aya Levin authored
      Increase granularity of the error path to avoid unneeded free/release.
      Fix the cleanup to be symmetric to the order of creation.
      
      Fixes: 0ddf5432 ("xdp/mlx5: setup xdp_rxq_info")
      Fixes: 422d4c40 ("net/mlx5e: RX, Split WQ objects for different RQ types")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      08a762ce
    • Maor Gottlieb's avatar
      net/mlx5: Fix request_irqs error flow · 732ebfab
      Maor Gottlieb authored
      Fix error flow handling in request_irqs which try to free irq
      that we failed to request.
      It fixes the below trace.
      
      WARNING: CPU: 1 PID: 7587 at kernel/irq/manage.c:1684 free_irq+0x4d/0x60
      CPU: 1 PID: 7587 Comm: bash Tainted: G        W  OE    4.15.15-1.el7MELLANOXsmp-x86_64 #1
      Hardware name: Advantech SKY-6200/SKY-6200, BIOS F2.00 08/06/2020
      RIP: 0010:free_irq+0x4d/0x60
      RSP: 0018:ffffc9000ef47af0 EFLAGS: 00010282
      RAX: ffff88001476ae00 RBX: 0000000000000655 RCX: 0000000000000000
      RDX: ffff88001476ae00 RSI: ffffc9000ef47ab8 RDI: ffff8800398bb478
      RBP: ffff88001476a838 R08: ffff88001476ae00 R09: 000000000000156d
      R10: 0000000000000000 R11: 0000000000000004 R12: ffff88001476a838
      R13: 0000000000000006 R14: ffff88001476a888 R15: 00000000ffffffe4
      FS:  00007efeadd32740(0000) GS:ffff88047fc40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fc9cc010008 CR3: 00000001a2380004 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       mlx5_irq_table_create+0x38d/0x400 [mlx5_core]
       ? atomic_notifier_chain_register+0x50/0x60
       mlx5_load_one+0x7ee/0x1130 [mlx5_core]
       init_one+0x4c9/0x650 [mlx5_core]
       pci_device_probe+0xb8/0x120
       driver_probe_device+0x2a1/0x470
       ? driver_allows_async_probing+0x30/0x30
       bus_for_each_drv+0x54/0x80
       __device_attach+0xa3/0x100
       pci_bus_add_device+0x4a/0x90
       pci_iov_add_virtfn+0x2dc/0x2f0
       pci_enable_sriov+0x32e/0x420
       mlx5_core_sriov_configure+0x61/0x1b0 [mlx5_core]
       ? kstrtoll+0x22/0x70
       num_vf_store+0x4b/0x70 [mlx5_core]
       kernfs_fop_write+0x102/0x180
       __vfs_write+0x26/0x140
       ? rcu_all_qs+0x5/0x80
       ? _cond_resched+0x15/0x30
       ? __sb_start_write+0x41/0x80
       vfs_write+0xad/0x1a0
       SyS_write+0x42/0x90
       do_syscall_64+0x60/0x110
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 24163189 ("net/mlx5: Separate IRQ request/free from EQ life cycle")
      Signed-off-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      732ebfab
    • Saeed Mahameed's avatar
      net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible · b898ce7b
      Saeed Mahameed authored
      In case of pci is offline reclaim_pages_cmd() will still try to call
      the FW to release FW pages, cmd_exec() in this case will return a silent
      success without actually calling the FW.
      
      This is wrong and will cause page leaks, what we should do is to detect
      pci offline or command interface un-available before tying to access the
      FW and manually release the FW pages in the driver.
      
      In this patch we share the code to check for FW command interface
      availability and we call it in sensitive places e.g. reclaim_pages_cmd().
      
      Alternative fix:
       1. Remove MLX5_CMD_OP_MANAGE_PAGES form mlx5_internal_err_ret_value,
          command success simulation list.
       2. Always Release FW pages even if cmd_exec fails in reclaim_pages_cmd().
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      b898ce7b
    • Eran Ben Elisha's avatar
      net/mlx5: Add retry mechanism to the command entry index allocation · 410bd754
      Eran Ben Elisha authored
      It is possible that new command entry index allocation will temporarily
      fail. The new command holds the semaphore, so it means that a free entry
      should be ready soon. Add one second retry mechanism before returning an
      error.
      
      Patch "net/mlx5: Avoid possible free of command entry while timeout comp
      handler" increase the possibility to bump into this temporarily failure
      as it delays the entry index release for non-callback commands.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      410bd754
    • Eran Ben Elisha's avatar
      net/mlx5: poll cmd EQ in case of command timeout · 1d5558b1
      Eran Ben Elisha authored
      Once driver detects a command interface command timeout, it warns the
      user and returns timeout error to the caller. In such case, the entry of
      the command is not evacuated (because only real event interrupt is allowed
      to clear command interface entry). If the HW event interrupt
      of this entry will never arrive, this entry will be left unused forever.
      Command interface entries are limited and eventually we can end up without
      the ability to post a new command.
      
      In addition, if driver will not consume the EQE of the lost interrupt and
      rearm the EQ, no new interrupts will arrive for other commands.
      
      Add a resiliency mechanism for manually polling the command EQ in case of
      a command timeout. In case resiliency mechanism will find non-handled EQE,
      it will consume it, and the command interface will be fully functional
      again. Once the resiliency flow finished, wait another 5 seconds for the
      command interface to complete for this command entry.
      
      Define mlx5_cmd_eq_recover() to manage the cmd EQ polling resiliency flow.
      Add an async EQ spinlock to avoid races between resiliency flows and real
      interrupts that might run simultaneously.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1d5558b1
    • Eran Ben Elisha's avatar
      net/mlx5: Avoid possible free of command entry while timeout comp handler · 50b2412b
      Eran Ben Elisha authored
      Upon command completion timeout, driver simulates a forced command
      completion. In a rare case where real interrupt for that command arrives
      simultaneously, it might release the command entry while the forced
      handler might still access it.
      
      Fix that by adding an entry refcount, to track current amount of allowed
      handlers. Command entry to be released only when this refcount is
      decremented to zero.
      
      Command refcount is always initialized to one. For callback commands,
      command completion handler is the symmetric flow to decrement it. For
      non-callback commands, it is wait_func().
      
      Before ringing the doorbell, increment the refcount for the real completion
      handler. Once the real completion handler is called, it will decrement it.
      
      For callback commands, once the delayed work is scheduled, increment the
      refcount. Upon callback command completion handler, we will try to cancel
      the timeout callback. In case of success, we need to decrement the callback
      refcount as it will never run.
      
      In addition, gather the entry index free and the entry free into a one
      flow for all command types release.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      50b2412b
    • Eran Ben Elisha's avatar
      net/mlx5: Fix a race when moving command interface to polling mode · 432161ea
      Eran Ben Elisha authored
      As part of driver unload, it destroys the commands EQ (via FW command).
      As the commands EQ is destroyed, FW will not generate EQEs for any command
      that driver sends afterwards. Driver should poll for later commands status.
      
      Driver commands mode metadata is updated before the commands EQ is
      actually destroyed. This can lead for double completion handle by the
      driver (polling and interrupt), if a command is executed and completed by
      FW after the mode was changed, but before the EQ was destroyed.
      
      Fix that by using the mlx5_cmd_allowed_opcode mechanism to guarantee
      that only DESTROY_EQ command can be executed during this time period.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      432161ea
  5. 01 Oct, 2020 2 commits
  6. 30 Sep, 2020 3 commits