1. 20 Sep, 2017 37 commits
    • Jesper Dangaard Brouer's avatar
      Revert "net: fix percpu memory leaks" · a44bb1c4
      Jesper Dangaard Brouer authored
      
      [ Upstream commit 5a63643e ]
      
      This reverts commit 1d6119ba.
      
      After reverting commit 6d7b857d ("net: use lib/percpu_counter API
      for fragmentation mem accounting") then here is no need for this
      fix-up patch.  As percpu_counter is no longer used, it cannot
      memory leak it any-longer.
      
      Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Fixes: 1d6119ba ("net: fix percpu memory leaks")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a44bb1c4
    • Jesper Dangaard Brouer's avatar
      Revert "net: use lib/percpu_counter API for fragmentation mem accounting" · 8fbf9f91
      Jesper Dangaard Brouer authored
      
      [ Upstream commit fb452a1a ]
      
      This reverts commit 6d7b857d.
      
      There is a bug in fragmentation codes use of the percpu_counter API,
      that can cause issues on systems with many CPUs.
      
      The frag_mem_limit() just reads the global counter (fbc->count),
      without considering other CPUs can have upto batch size (130K) that
      haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
      this become dangerous at >=24 CPUs (3*1024*1024/130000=24).
      
      The correct API usage would be to use __percpu_counter_compare() which
      does the right thing, and takes into account the number of (online)
      CPUs and batch size, to account for this and call __percpu_counter_sum()
      when needed.
      
      We choose to revert the use of the lib/percpu_counter API for frag
      memory accounting for several reasons:
      
      1) On systems with CPUs > 24, the heavier fully locked
         __percpu_counter_sum() is always invoked, which will be more
         expensive than the atomic_t that is reverted to.
      
      Given systems with more than 24 CPUs are becoming common this doesn't
      seem like a good option.  To mitigate this, the batch size could be
      decreased and thresh be increased.
      
      2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
         CPU, before SKBs are pushed into sockets on remote CPUs.  Given
         NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
         likely be limited.  Thus, a fair chance that atomic add+dec happen
         on the same CPU.
      
      Revert note that commit 1d6119ba ("net: fix percpu memory leaks")
      removed init_frag_mem_limit() and instead use inet_frags_init_net().
      After this revert, inet_frags_uninit_net() becomes empty.
      
      Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Fixes: 1d6119ba ("net: fix percpu memory leaks")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8fbf9f91
    • Ido Schimmel's avatar
      bridge: switchdev: Clear forward mark when transmitting packet · 79f08820
      Ido Schimmel authored
      
      [ Upstream commit 79e99bdd ]
      
      Commit 6bc506b4 ("bridge: switchdev: Add forward mark support for
      stacked devices") added the 'offload_fwd_mark' bit to the skb in order
      to allow drivers to indicate to the bridge driver that they already
      forwarded the packet in L2.
      
      In case the bit is set, before transmitting the packet from each port,
      the port's mark is compared with the mark stored in the skb's control
      block. If both marks are equal, we know the packet arrived from a switch
      device that already forwarded the packet and it's not re-transmitted.
      
      However, if the packet is transmitted from the bridge device itself
      (e.g., br0), we should clear the 'offload_fwd_mark' bit as the mark
      stored in the skb's control block isn't valid.
      
      This scenario can happen in rare cases where a packet was trapped during
      L3 forwarding and forwarded by the kernel to a bridge device.
      
      Fixes: 6bc506b4 ("bridge: switchdev: Add forward mark support for stacked devices")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarYotam Gigi <yotamg@mellanox.com>
      Tested-by: default avatarYotam Gigi <yotamg@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      79f08820
    • Ido Schimmel's avatar
      mlxsw: spectrum: Forbid linking to devices that have uppers · 2f4232ba
      Ido Schimmel authored
      
      [ Upstream commit 25cc72a3 ]
      
      The mlxsw driver relies on NETDEV_CHANGEUPPER events to configure the
      device in case a port is enslaved to a master netdev such as bridge or
      bond.
      
      Since the driver ignores events unrelated to its ports and their
      uppers, it's possible to engineer situations in which the device's data
      path differs from the kernel's.
      
      One example to such a situation is when a port is enslaved to a bond
      that is already enslaved to a bridge. When the bond was enslaved the
      driver ignored the event - as the bond wasn't one of its uppers - and
      therefore a bridge port instance isn't created in the device.
      
      Until such configurations are supported forbid them by checking that the
      upper device doesn't have uppers of its own.
      
      Fixes: 0d65fc13 ("mlxsw: spectrum: Implement LAG port join/leave")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Tested-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f4232ba
    • Andrew Lunn's avatar
      net: fec: Allow reception of frames bigger than 1522 bytes · a9e548de
      Andrew Lunn authored
      
      [ Upstream commit fbbeefdd ]
      
      The FEC Receive Control Register has a 14 bit field indicating the
      longest frame that may be received. It is being set to 1522. Frames
      longer than this are discarded, but counted as being in error.
      
      When using DSA, frames from the switch has an additional header,
      either 4 or 8 bytes if a Marvell switch is used. Thus a full MTU frame
      of 1522 bytes received by the switch on a port becomes 1530 bytes when
      passed to the host via the FEC interface.
      
      Change the maximum receive size to 2048 - 64, where 64 is the maximum
      rx_alignment applied on the receive buffer for AVB capable FEC
      cores. Use this value also for the maximum receive buffer size. The
      driver is already allocating a receive SKB of 2048 bytes, so this
      change should not have any significant effects.
      
      Tested on imx51, imx6, vf610.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a9e548de
    • Florian Fainelli's avatar
      Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()" · b8fcbae2
      Florian Fainelli authored
      
      [ Upstream commit ebc8254a ]
      
      This reverts commit 7ad813f2 ("net: phy:
      Correctly process PHY_HALTED in phy_stop_machine()") because it is
      creating the possibility for a NULL pointer dereference.
      
      David Daney provide the following call trace and diagram of events:
      
      When ndo_stop() is called we call:
      
       phy_disconnect()
          +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
          +---> phy_stop_machine()
          |      +---> phy_state_machine()
          |              +----> queue_delayed_work(): Work queued.
          +--->phy_detach() implies: phydev->attached_dev = NULL;
      
      Now at a later time the queued work does:
      
       phy_state_machine()
          +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:
      
       CPU 12 Unable to handle kernel paging request at virtual address
      0000000000000048, epc == ffffffff80de37ec, ra == ffffffff80c7c
      Oops[#1]:
      CPU: 12 PID: 1502 Comm: kworker/12:1 Not tainted 4.9.43-Cavium-Octeon+ #1
      Workqueue: events_power_efficient phy_state_machine
      task: 80000004021ed100 task.stack: 8000000409d70000
      $ 0   : 0000000000000000 ffffffff84720060 0000000000000048 0000000000000004
      $ 4   : 0000000000000000 0000000000000001 0000000000000004 0000000000000000
      $ 8   : 0000000000000000 0000000000000000 00000000ffff98f3 0000000000000000
      $12   : 8000000409d73fe0 0000000000009c00 ffffffff846547c8 000000000000af3b
      $16   : 80000004096bab68 80000004096babd0 0000000000000000 80000004096ba800
      $20   : 0000000000000000 0000000000000000 ffffffff81090000 0000000000000008
      $24   : 0000000000000061 ffffffff808637b0
      $28   : 8000000409d70000 8000000409d73cf0 80000000271bd300 ffffffff80c7804c
      Hi    : 000000000000002a
      Lo    : 000000000000003f
      epc   : ffffffff80de37ec netif_carrier_off+0xc/0x58
      ra    : ffffffff80c7804c phy_state_machine+0x48c/0x4f8
      Status: 14009ce3        KX SX UX KERNEL EXL IE
      Cause : 00800008 (ExcCode 02)
      BadVA : 0000000000000048
      PrId  : 000d9501 (Cavium Octeon III)
      Modules linked in:
      Process kworker/12:1 (pid: 1502, threadinfo=8000000409d70000,
      task=80000004021ed100, tls=0000000000000000)
      Stack : 8000000409a54000 80000004096bab68 80000000271bd300 80000000271c1e00
              0000000000000000 ffffffff808a1708 8000000409a54000 80000000271bd300
              80000000271bd320 8000000409a54030 ffffffff80ff0f00 0000000000000001
              ffffffff81090000 ffffffff808a1ac0 8000000402182080 ffffffff84650000
              8000000402182080 ffffffff84650000 ffffffff80ff0000 8000000409a54000
              ffffffff808a1970 0000000000000000 80000004099e8000 8000000402099240
              0000000000000000 ffffffff808a8598 0000000000000000 8000000408eeeb00
              8000000409a54000 00000000810a1d00 0000000000000000 8000000409d73de8
              8000000409d73de8 0000000000000088 000000000c009c00 8000000409d73e08
              8000000409d73e08 8000000402182080 ffffffff808a84d0 8000000402182080
              ...
      Call Trace:
      [<ffffffff80de37ec>] netif_carrier_off+0xc/0x58
      [<ffffffff80c7804c>] phy_state_machine+0x48c/0x4f8
      [<ffffffff808a1708>] process_one_work+0x158/0x368
      [<ffffffff808a1ac0>] worker_thread+0x150/0x4c0
      [<ffffffff808a8598>] kthread+0xc8/0xe0
      [<ffffffff808617f0>] ret_from_kernel_thread+0x14/0x1c
      
      The original motivation for this change originated from Marc Gonzales
      indicating that his network driver did not have its adjust_link callback
      executing with phydev->link = 0 while he was expecting it.
      
      PHYLIB has never made any such guarantees ever because phy_stop() merely just
      tells the workqueue to move into PHY_HALTED state which will happen
      asynchronously.
      Reported-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reported-by: default avatarDavid Daney <ddaney.cavm@gmail.com>
      Fixes: 7ad813f2 ("net: phy: Correctly process PHY_HALTED in phy_stop_machine()")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8fcbae2
    • Tal Gilboa's avatar
      net/mlx5e: Fix CQ moderation mode not set properly · b88be44f
      Tal Gilboa authored
      
      [ Upstream commit 1213ad28 ]
      
      cq_period_mode assignment was mistakenly removed so it was always set to "0",
      which is EQE based moderation, regardless of the device CAPs and
      requested value in ethtool.
      
      Fixes: 6a9764ef ("net/mlx5e: Isolate open_channels from priv->params")
      Signed-off-by: default avatarTal Gilboa <talgi@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b88be44f
    • Moshe Shemesh's avatar
      net/mlx5e: Fix inline header size for small packets · 8049c41d
      Moshe Shemesh authored
      
      [ Upstream commit 6aace17e ]
      
      Fix inline header size, make sure it is not greater than skb len.
      This bug effects small packets, for example L2 packets with size < 18.
      
      Fixes: ae76715d ("net/mlx5e: Check the minimum inline header mode before xmit")
      Signed-off-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8049c41d
    • Shahar Klein's avatar
      net/mlx5: E-Switch, Unload the representors in the correct order · 8db40bcf
      Shahar Klein authored
      
      [ Upstream commit 19122039 ]
      
      When changing from switchdev to legacy mode, all the representor port
      devices (uplink nic and reps) are cleaned up. Part of this cleaning
      process is removing the neigh entries and the hash table containing them.
      However, a representor neigh entry might be linked to the uplink port
      hash table and if the uplink nic is cleaned first the cleaning of the
      representor will end up in null deref.
      Fix that by unloading the representors in the opposite order of load.
      
      Fixes: cb67b832 ("net/mlx5e: Introduce SRIOV VF representors")
      Signed-off-by: default avatarShahar Klein <shahark@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8db40bcf
    • Paul Blakey's avatar
      net/mlx5e: Properly resolve TC offloaded ipv6 vxlan tunnel source address · b0034cb5
      Paul Blakey authored
      
      [ Upstream commit 08820528 ]
      
      Currently if vxlan tunnel ipv6 src isn't supplied the driver fails to
      resolve it as part of the route lookup. The resulting encap header
      is left with a zeroed out ipv6 src address so the packets are sent
      with this src ip.
      
      Use an appropriate route lookup API that also resolves the source
      ipv6 address if it's not supplied.
      
      Fixes: ce99f6b9 ('net/mlx5e: Support SRIOV TC encapsulation offloads for IPv6 tunnels')
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0034cb5
    • Inbar Karmy's avatar
      net/mlx5e: Don't override user RSS upon set channels · 53c55257
      Inbar Karmy authored
      
      [ Upstream commit 5a8e1267 ]
      
      Currently, increasing the number of combined channels is changing
      the RSS spread to use the new created channels.
      Prevent the RSS spread change in case the user explicitly declare it,
      to avoid overriding user configuration.
      
      Tested:
      when RSS default:
      
      # ethtool -L ens8 combined 4
      RSS spread will change and point to 4 channels.
      
      # ethtool -X ens8 equal 4
      # ethtool -L ens8 combined 6
      RSS will not change after increasing the number of the channels.
      
      Fixes: 8bf36862 ('ethtool: ensure channel counts are within bounds during SCHANNELS')
      Signed-off-by: default avatarInbar Karmy <inbark@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53c55257
    • Eran Ben Elisha's avatar
      net/mlx5e: Fix dangling page pointer on DMA mapping error · ba008489
      Eran Ben Elisha authored
      
      [ Upstream commit 0556ce72 ]
      
      Function mlx5e_dealloc_rx_wqe is using page pointer value as an
      indication to valid DMA mapping. In case that the mapping failed, we
      released the page but kept the dangling pointer. Store the page pointer
      only after the DMA mapping passed to avoid invalid page DMA unmap.
      
      Fixes: bc77b240 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba008489
    • Noa Osherovich's avatar
      net/mlx5: Fix arm SRQ command for ISSI version 0 · 7ae1eccb
      Noa Osherovich authored
      
      [ Upstream commit 672d0880 ]
      
      Support for ISSI version 0 was recently broken as the arm_srq_cmd
      command, which is used only for ISSI version 0, was given the opcode
      for ISSI version 1 instead of ISSI version 0.
      
      Change arm_srq_cmd to use the correct command opcode for ISSI version
      0.
      
      Fixes: af1ba291 ('{net, IB}/mlx5: Refactor internal SRQ API')
      Signed-off-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ae1eccb
    • Huy Nguyen's avatar
      net/mlx5e: Fix DCB_CAP_ATTR_DCBX capability for DCBNL getcap. · 0b6b3028
      Huy Nguyen authored
      
      [ Upstream commit 9e10bf1d ]
      
      Current code doesn't report DCB_CAP_DCBX_HOST capability when query
      through getcap. User space lldptool expects capability to have HOST mode
      set when it wants to configure DCBX CEE mode. In absence of HOST mode
      capability, lldptool fails to switch to CEE mode.
      
      This fix returns DCB_CAP_DCBX_HOST capability when port's DCBX
      controlled mode is under software control.
      
      Fixes: 3a6a931d ("net/mlx5e: Support DCBX CEE API")
      Signed-off-by: default avatarHuy Nguyen <huyn@mellanox.com>
      Reviewed-by: default avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0b6b3028
    • Huy Nguyen's avatar
      net/mlx5e: Check for qos capability in dcbnl_initialize · 9b919ad3
      Huy Nguyen authored
      
      [ Upstream commit 33c52b67 ]
      
      qos capability is the master capability bit that determines
      if the DCBX is supported for the PCI function. If this bit is off,
      driver cannot run any dcbx code.
      
      Fixes: e207b7e9 ("net/mlx5e: ConnectX-4 firmware support for DCBX")
      Signed-off-by: default avatarHuy Nguyen <huyn@mellanox.com>
      Reviewed-by: default avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9b919ad3
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Fix number of CFP entries for BCM7278 · 31034e44
      Florian Fainelli authored
      
      [ Upstream commit df191632 ]
      
      BCM7278 has only 128 entries while BCM7445 has the full 256 entries set,
      fix that.
      
      Fixes: 7318166c ("net: dsa: bcm_sf2: Add support for ethtool::rxnfc")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      31034e44
    • Eric Dumazet's avatar
      kcm: do not attach PF_KCM sockets to avoid deadlock · f9901adf
      Eric Dumazet authored
      
      [ Upstream commit 351050ec ]
      
      syzkaller had no problem to trigger a deadlock, attaching a KCM socket
      to another one (or itself). (original syzkaller report was a very
      confusing lockdep splat during a sendmsg())
      
      It seems KCM claims to only support TCP, but no enforcement is done,
      so we might need to add additional checks.
      
      Fixes: ab7ac4eb ("kcm: Kernel Connection Multiplexor module")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarTom Herbert <tom@quantonium.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9901adf
    • Benjamin Poirier's avatar
      packet: Don't write vnet header beyond end of buffer · e7ebdeb4
      Benjamin Poirier authored
      
      [ Upstream commit edbd58be ]
      
      ... which may happen with certain values of tp_reserve and maclen.
      
      Fixes: 58d19b19 ("packet: vnet_hdr support for tpacket_rcv")
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e7ebdeb4
    • Xin Long's avatar
      ipv6: do not set sk_destruct in IPV6_ADDRFORM sockopt · ef5a20f0
      Xin Long authored
      
      [ Upstream commit e8d411d2 ]
      
      ChunYu found a kernel warn_on during syzkaller fuzzing:
      
      [40226.038539] WARNING: CPU: 5 PID: 23720 at net/ipv4/af_inet.c:152 inet_sock_destruct+0x78d/0x9a0
      [40226.144849] Call Trace:
      [40226.147590]  <IRQ>
      [40226.149859]  dump_stack+0xe2/0x186
      [40226.176546]  __warn+0x1a4/0x1e0
      [40226.180066]  warn_slowpath_null+0x31/0x40
      [40226.184555]  inet_sock_destruct+0x78d/0x9a0
      [40226.246355]  __sk_destruct+0xfa/0x8c0
      [40226.290612]  rcu_process_callbacks+0xaa0/0x18a0
      [40226.336816]  __do_softirq+0x241/0x75e
      [40226.367758]  irq_exit+0x1f6/0x220
      [40226.371458]  smp_apic_timer_interrupt+0x7b/0xa0
      [40226.376507]  apic_timer_interrupt+0x93/0xa0
      
      The warn_on happned when sk->sk_rmem_alloc wasn't 0 in inet_sock_destruct.
      As after commit f970bd9e ("udp: implement memory accounting helpers"),
      udp has changed to use udp_destruct_sock as sk_destruct where it would
      udp_rmem_release all rmem.
      
      But IPV6_ADDRFORM sockopt sets sk_destruct with inet_sock_destruct after
      changing family to PF_INET. If rmem is not 0 at that time, and there is
      no place to release rmem before calling inet_sock_destruct, the warn_on
      will be triggered.
      
      This patch is to fix it by not setting sk_destruct in IPV6_ADDRFORM sockopt
      any more. As IPV6_ADDRFORM sockopt only works for tcp and udp. TCP sock has
      already set it's sk_destruct with inet_sock_destruct and UDP has set with
      udp_destruct_sock since they're created.
      
      Fixes: f970bd9e ("udp: implement memory accounting helpers")
      Reported-by: default avatarChunYu Wang <chunwang@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ef5a20f0
    • Xin Long's avatar
      ipv6: set dst.obsolete when a cached route has expired · 440ea29a
      Xin Long authored
      
      [ Upstream commit 1e2ea8ad ]
      
      Now it doesn't check for the cached route expiration in ipv6's
      dst_ops->check(), because it trusts dst_gc that would clean the
      cached route up when it's expired.
      
      The problem is in dst_gc, it would clean the cached route only
      when it's refcount is 1. If some other module (like xfrm) keeps
      holding it and the module only release it when dst_ops->check()
      fails.
      
      But without checking for the cached route expiration, .check()
      may always return true. Meanwhile, without releasing the cached
      route, dst_gc couldn't del it. It will cause this cached route
      never to expire.
      
      This patch is to set dst.obsolete with DST_OBSOLETE_KILL in .gc
      when it's expired, and check obsolete != DST_OBSOLETE_FORCE_CHK
      in .check.
      
      Note that this is even needed when ipv6 dst_gc timer is removed
      one day. It would set dst.obsolete in .redirect and .update_pmtu
      instead, and check for cached route expiration when getting it,
      just like what ipv4 route does.
      Reported-by: default avatarJianlin Shi <jishi@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      440ea29a
    • Stefano Brivio's avatar
      cxgb4: Fix stack out-of-bounds read due to wrong size to t4_record_mbox() · 24bd86e6
      Stefano Brivio authored
      
      [ Upstream commit 0f308686 ]
      
      Passing commands for logging to t4_record_mbox() with size
      MBOX_LEN, when the actual command size is actually smaller,
      causes out-of-bounds stack accesses in t4_record_mbox() while
      copying command words here:
      
      	for (i = 0; i < size / 8; i++)
      		entry->cmd[i] = be64_to_cpu(cmd[i]);
      
      Up to 48 bytes from the stack are then leaked to debugfs.
      
      This happens whenever we send (and log) commands described by
      structs fw_sched_cmd (32 bytes leaked), fw_vi_rxmode_cmd (48),
      fw_hello_cmd (48), fw_bye_cmd (48), fw_initialize_cmd (48),
      fw_reset_cmd (48), fw_pfvf_cmd (32), fw_eq_eth_cmd (16),
      fw_eq_ctrl_cmd (32), fw_eq_ofld_cmd (32), fw_acl_mac_cmd(16),
      fw_rss_glb_config_cmd(32), fw_rss_vi_config_cmd(32),
      fw_devlog_cmd(32), fw_vi_enable_cmd(48), fw_port_cmd(32),
      fw_sched_cmd(32), fw_devlog_cmd(32).
      
      The cxgb4vf driver got this right instead.
      
      When we call t4_record_mbox() to log a command reply, a MBOX_LEN
      size can be used though, as get_mbox_rpl() will fill cmd_rpl up
      completely.
      
      Fixes: 7f080c3f ("cxgb4: Add support to enable logging of firmware mailbox commands")
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24bd86e6
    • Antoine Tenart's avatar
      net: mvpp2: fix the mac address used when using PPv2.2 · 59b304fd
      Antoine Tenart authored
      
      [ Upstream commit 4c228682 ]
      
      The mac address is only retrieved from h/w when using PPv2.1. Otherwise
      the variable holding it is still checked and used if it contains a valid
      value. As the variable isn't initialized to an invalid mac address
      value, we end up with random mac addresses which can be the same for all
      the ports handled by this PPv2 driver.
      
      Fixes this by initializing the h/w mac address variable to {0}, which is
      an invalid mac address value. This way the random assignation fallback
      is called and all ports end up with their own addresses.
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@free-electrons.com>
      Fixes: 26975821 ("net: mvpp2: handle misc PPv2.1/PPv2.2 differences")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59b304fd
    • Paolo Abeni's avatar
      udp6: set rx_dst_cookie on rx_dst updates · 38ca2d39
      Paolo Abeni authored
      
      [ Upstream commit 64f0f5d1 ]
      
      Currently, in the udp6 code, the dst cookie is not initialized/updated
      concurrently with the RX dst used by early demux.
      
      As a result, the dst_check() in the early_demux path always fails,
      the rx dst cache is always invalidated, and we can't really
      leverage significant gain from the demux lookup.
      
      Fix it adding udp6 specific variant of sk_rx_dst_set() and use it
      to set the dst cookie when the dst entry is really changed.
      
      The issue is there since the introduction of early demux for ipv6.
      
      Fixes: 5425077d ("net: ipv6: Add early demux handler for UDP unicast")
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38ca2d39
    • stephen hemminger's avatar
      netvsc: fix deadlock betwen link status and removal · b4426cf2
      stephen hemminger authored
      
      [ Upstream commit 9b4e946c ]
      
      There is a deadlock possible when canceling the link status
      delayed work queue. The removal process is run with RTNL held,
      and the link status callback is acquring RTNL.
      
      Resolve the issue by using trylock and rescheduling.
      If cancel is in process, that block it from happening.
      
      Fixes: 122a5f64 ("staging: hv: use delayed_work for netvsc_send_garp()")
      Signed-off-by: default avatarStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4426cf2
    • Florian Fainelli's avatar
      net: systemport: Free DMA coherent descriptors on errors · 3f0204b0
      Florian Fainelli authored
      
      [ Upstream commit c2062ee3 ]
      
      In case bcm_sysport_init_tx_ring() is not able to allocate ring->cbs, we
      would return with an error, and call bcm_sysport_fini_tx_ring() and it
      would see that ring->cbs is NULL and do nothing. This would leak the
      coherent DMA descriptor area, so we need to free it on error before
      returning.
      Reported-by: default avatarEric Dumazet <edumazet@gmail.com>
      Fixes: 80105bef ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f0204b0
    • Florian Fainelli's avatar
      net: bcmgenet: Be drop monitor friendly · 71dd9ac5
      Florian Fainelli authored
      
      [ Upstream commit d4fec855 ]
      
      There are 3 spots where we call dev_kfree_skb() but we are actually
      just doing a normal SKB consumption: __bcmgenet_tx_reclaim() for normal
      TX reclamation, bcmgenet_alloc_rx_buffers() during the initial RX ring
      setup and bcmgenet_free_rx_buffers() during RX ring cleanup.
      
      Fixes: d6707bec ("net: bcmgenet: rewrite bcmgenet_rx_refill()")
      Fixes: f48bed16 ("net: bcmgenet: Free skb after last Tx frag")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71dd9ac5
    • Florian Fainelli's avatar
      net: systemport: Be drop monitor friendly · 7def678f
      Florian Fainelli authored
      
      [ Upstream commit c45182eb ]
      
      Utilize dev_consume_skb_any(cb->skb) in bcm_sysport_free_cb() which is
      used when a TX packet is completed, as well as when the RX ring is
      cleaned on shutdown. None of these two cases are packet drops, so be
      drop monitor friendly.
      Suggested-by: default avatarEric Dumazet <edumazet@gmail.com>
      Fixes: 80105bef ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7def678f
    • Bob Peterson's avatar
      tipc: Fix tipc_sk_reinit handling of -EAGAIN · c86a65cf
      Bob Peterson authored
      
      [ Upstream commit 6c7e983b ]
      
      In 9dbbfb0a function tipc_sk_reinit
      had additional logic added to loop in the event that function
      rhashtable_walk_next() returned -EAGAIN. No worries.
      
      However, if rhashtable_walk_start returns -EAGAIN, it does "continue",
      and therefore skips the call to rhashtable_walk_stop(). That has
      the effect of calling rcu_read_lock() without its paired call to
      rcu_read_unlock(). Since rcu_read_lock() may be nested, the problem
      may not be apparent for a while, especially since resize events may
      be rare. But the comments to rhashtable_walk_start() state:
      
       * ...Note that we take the RCU lock in all
       * cases including when we return an error.  So you must always call
       * rhashtable_walk_stop to clean up.
      
      This patch replaces the continue with a goto and label to ensure a
      matching call to rhashtable_walk_stop().
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c86a65cf
    • Arnd Bergmann's avatar
      qlge: avoid memcpy buffer overflow · 8aafed19
      Arnd Bergmann authored
      
      [ Upstream commit e58f9583 ]
      
      gcc-8.0.0 (snapshot) points out that we copy a variable-length string
      into a fixed length field using memcpy() with the destination length,
      and that ends up copying whatever follows the string:
      
          inlined from 'ql_core_dump' at drivers/net/ethernet/qlogic/qlge/qlge_dbg.c:1106:2:
      drivers/net/ethernet/qlogic/qlge/qlge_dbg.c:708:2: error: 'memcpy' reading 15 bytes from a region of size 14 [-Werror=stringop-overflow=]
        memcpy(seg_hdr->description, desc, (sizeof(seg_hdr->description)) - 1);
      
      Changing it to use strncpy() will instead zero-pad the destination,
      which seems to be the right thing to do here.
      
      The bug is probably harmless, but it seems like a good idea to address
      it in stable kernels as well, if only for the purpose of building with
      gcc-8 without warnings.
      
      Fixes: a61f8026 ("qlge: Add ethtool register dump function.")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8aafed19
    • Stefano Brivio's avatar
      sctp: Avoid out-of-bounds reads from address storage · 6da13824
      Stefano Brivio authored
      
      [ Upstream commit ee6c88bb ]
      
      inet_diag_msg_sctp{,l}addr_fill() and sctp_get_sctp_info() copy
      sizeof(sockaddr_storage) bytes to fill in sockaddr structs used
      to export diagnostic information to userspace.
      
      However, the memory allocated to store sockaddr information is
      smaller than that and depends on the address family, so we leak
      up to 100 uninitialized bytes to userspace. Just use the size of
      the source structs instead, in all the three cases this is what
      userspace expects. Zero out the remaining memory.
      
      Unused bytes (i.e. when IPv4 addresses are used) in source
      structs sctp_sockaddr_entry and sctp_transport are already
      cleared by sctp_add_bind_addr() and sctp_transport_new(),
      respectively.
      
      Noticed while testing KASAN-enabled kernel with 'ss':
      
      [ 2326.885243] BUG: KASAN: slab-out-of-bounds in inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag] at addr ffff881be8779800
      [ 2326.896800] Read of size 128 by task ss/9527
      [ 2326.901564] CPU: 0 PID: 9527 Comm: ss Not tainted 4.11.0-22.el7a.x86_64 #1
      [ 2326.909236] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
      [ 2326.917585] Call Trace:
      [ 2326.920312]  dump_stack+0x63/0x8d
      [ 2326.924014]  kasan_object_err+0x21/0x70
      [ 2326.928295]  kasan_report+0x288/0x540
      [ 2326.932380]  ? inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag]
      [ 2326.938500]  ? skb_put+0x8b/0xd0
      [ 2326.942098]  ? memset+0x31/0x40
      [ 2326.945599]  check_memory_region+0x13c/0x1a0
      [ 2326.950362]  memcpy+0x23/0x50
      [ 2326.953669]  inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag]
      [ 2326.959596]  ? inet_diag_msg_sctpasoc_fill+0x460/0x460 [sctp_diag]
      [ 2326.966495]  ? __lock_sock+0x102/0x150
      [ 2326.970671]  ? sock_def_wakeup+0x60/0x60
      [ 2326.975048]  ? remove_wait_queue+0xc0/0xc0
      [ 2326.979619]  sctp_diag_dump+0x44a/0x760 [sctp_diag]
      [ 2326.985063]  ? sctp_ep_dump+0x280/0x280 [sctp_diag]
      [ 2326.990504]  ? memset+0x31/0x40
      [ 2326.994007]  ? mutex_lock+0x12/0x40
      [ 2326.997900]  __inet_diag_dump+0x57/0xb0 [inet_diag]
      [ 2327.003340]  ? __sys_sendmsg+0x150/0x150
      [ 2327.007715]  inet_diag_dump+0x4d/0x80 [inet_diag]
      [ 2327.012979]  netlink_dump+0x1e6/0x490
      [ 2327.017064]  __netlink_dump_start+0x28e/0x2c0
      [ 2327.021924]  inet_diag_handler_cmd+0x189/0x1a0 [inet_diag]
      [ 2327.028045]  ? inet_diag_rcv_msg_compat+0x1b0/0x1b0 [inet_diag]
      [ 2327.034651]  ? inet_diag_dump_compat+0x190/0x190 [inet_diag]
      [ 2327.040965]  ? __netlink_lookup+0x1b9/0x260
      [ 2327.045631]  sock_diag_rcv_msg+0x18b/0x1e0
      [ 2327.050199]  netlink_rcv_skb+0x14b/0x180
      [ 2327.054574]  ? sock_diag_bind+0x60/0x60
      [ 2327.058850]  sock_diag_rcv+0x28/0x40
      [ 2327.062837]  netlink_unicast+0x2e7/0x3b0
      [ 2327.067212]  ? netlink_attachskb+0x330/0x330
      [ 2327.071975]  ? kasan_check_write+0x14/0x20
      [ 2327.076544]  netlink_sendmsg+0x5be/0x730
      [ 2327.080918]  ? netlink_unicast+0x3b0/0x3b0
      [ 2327.085486]  ? kasan_check_write+0x14/0x20
      [ 2327.090057]  ? selinux_socket_sendmsg+0x24/0x30
      [ 2327.095109]  ? netlink_unicast+0x3b0/0x3b0
      [ 2327.099678]  sock_sendmsg+0x74/0x80
      [ 2327.103567]  ___sys_sendmsg+0x520/0x530
      [ 2327.107844]  ? __get_locked_pte+0x178/0x200
      [ 2327.112510]  ? copy_msghdr_from_user+0x270/0x270
      [ 2327.117660]  ? vm_insert_page+0x360/0x360
      [ 2327.122133]  ? vm_insert_pfn_prot+0xb4/0x150
      [ 2327.126895]  ? vm_insert_pfn+0x32/0x40
      [ 2327.131077]  ? vvar_fault+0x71/0xd0
      [ 2327.134968]  ? special_mapping_fault+0x69/0x110
      [ 2327.140022]  ? __do_fault+0x42/0x120
      [ 2327.144008]  ? __handle_mm_fault+0x1062/0x17a0
      [ 2327.148965]  ? __fget_light+0xa7/0xc0
      [ 2327.153049]  __sys_sendmsg+0xcb/0x150
      [ 2327.157133]  ? __sys_sendmsg+0xcb/0x150
      [ 2327.161409]  ? SyS_shutdown+0x140/0x140
      [ 2327.165688]  ? exit_to_usermode_loop+0xd0/0xd0
      [ 2327.170646]  ? __do_page_fault+0x55d/0x620
      [ 2327.175216]  ? __sys_sendmsg+0x150/0x150
      [ 2327.179591]  SyS_sendmsg+0x12/0x20
      [ 2327.183384]  do_syscall_64+0xe3/0x230
      [ 2327.187471]  entry_SYSCALL64_slow_path+0x25/0x25
      [ 2327.192622] RIP: 0033:0x7f41d18fa3b0
      [ 2327.196608] RSP: 002b:00007ffc3b731218 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [ 2327.205055] RAX: ffffffffffffffda RBX: 00007ffc3b731380 RCX: 00007f41d18fa3b0
      [ 2327.213017] RDX: 0000000000000000 RSI: 00007ffc3b731340 RDI: 0000000000000003
      [ 2327.220978] RBP: 0000000000000002 R08: 0000000000000004 R09: 0000000000000040
      [ 2327.228939] R10: 00007ffc3b730f30 R11: 0000000000000246 R12: 0000000000000003
      [ 2327.236901] R13: 00007ffc3b731340 R14: 00007ffc3b7313d0 R15: 0000000000000084
      [ 2327.244865] Object at ffff881be87797e0, in cache kmalloc-64 size: 64
      [ 2327.251953] Allocated:
      [ 2327.254581] PID = 9484
      [ 2327.257215]  save_stack_trace+0x1b/0x20
      [ 2327.261485]  save_stack+0x46/0xd0
      [ 2327.265179]  kasan_kmalloc+0xad/0xe0
      [ 2327.269165]  kmem_cache_alloc_trace+0xe6/0x1d0
      [ 2327.274138]  sctp_add_bind_addr+0x58/0x180 [sctp]
      [ 2327.279400]  sctp_do_bind+0x208/0x310 [sctp]
      [ 2327.284176]  sctp_bind+0x61/0xa0 [sctp]
      [ 2327.288455]  inet_bind+0x5f/0x3a0
      [ 2327.292151]  SYSC_bind+0x1a4/0x1e0
      [ 2327.295944]  SyS_bind+0xe/0x10
      [ 2327.299349]  do_syscall_64+0xe3/0x230
      [ 2327.303433]  return_from_SYSCALL_64+0x0/0x6a
      [ 2327.308194] Freed:
      [ 2327.310434] PID = 4131
      [ 2327.313065]  save_stack_trace+0x1b/0x20
      [ 2327.317344]  save_stack+0x46/0xd0
      [ 2327.321040]  kasan_slab_free+0x73/0xc0
      [ 2327.325220]  kfree+0x96/0x1a0
      [ 2327.328530]  dynamic_kobj_release+0x15/0x40
      [ 2327.333195]  kobject_release+0x99/0x1e0
      [ 2327.337472]  kobject_put+0x38/0x70
      [ 2327.341266]  free_notes_attrs+0x66/0x80
      [ 2327.345545]  mod_sysfs_teardown+0x1a5/0x270
      [ 2327.350211]  free_module+0x20/0x2a0
      [ 2327.354099]  SyS_delete_module+0x2cb/0x2f0
      [ 2327.358667]  do_syscall_64+0xe3/0x230
      [ 2327.362750]  return_from_SYSCALL_64+0x0/0x6a
      [ 2327.367510] Memory state around the buggy address:
      [ 2327.372855]  ffff881be8779700: fc fc fc fc 00 00 00 00 00 00 00 00 fc fc fc fc
      [ 2327.380914]  ffff881be8779780: fb fb fb fb fb fb fb fb fc fc fc fc 00 00 00 00
      [ 2327.388972] >ffff881be8779800: 00 00 00 00 fc fc fc fc fb fb fb fb fb fb fb fb
      [ 2327.397031]                                ^
      [ 2327.401792]  ffff881be8779880: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
      [ 2327.409850]  ffff881be8779900: 00 00 00 00 00 04 fc fc fc fc fc fc 00 00 00 00
      [ 2327.417907] ==================================================================
      
      This fixes CVE-2017-7558.
      
      References: https://bugzilla.redhat.com/show_bug.cgi?id=1480266
      Fixes: 8f840e47 ("sctp: add the sctp_diag.c file")
      Cc: Xin Long <lucien.xin@gmail.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6da13824
    • Florian Fainelli's avatar
      fsl/man: Inherit parent device and of_node · 207ab5d5
      Florian Fainelli authored
      
      [ Upstream commit a1a50c8e ]
      
      Junote Cai reported that he was not able to get a DSA setup involving the
      Freescale DPAA/FMAN driver to work and narrowed it down to
      of_find_net_device_by_node(). This function requires the network device's
      device reference to be correctly set which is the case here, though we have
      lost any device_node association there.
      
      The problem is that dpaa_eth_add_device() allocates a "dpaa-ethernet" platform
      device, and later on dpaa_eth_probe() is called but SET_NETDEV_DEV() won't be
      propagating &pdev->dev.of_node properly. Fix this by inherenting both the parent
      device and the of_node when dpaa_eth_add_device() creates the platform device.
      
      Fixes: 39339616 ("fsl/fman: Add FMan MAC driver")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      207ab5d5
    • Daniel Borkmann's avatar
      bpf: fix map value attribute for hash of maps · 4670d796
      Daniel Borkmann authored
      
      [ Upstream commit 33ba43ed ]
      
      Currently, iproute2's BPF ELF loader works fine with array of maps
      when retrieving the fd from a pinned node and doing a selfcheck
      against the provided map attributes from the object file, but we
      fail to do the same for hash of maps and thus refuse to get the
      map from pinned node.
      
      Reason is that when allocating hash of maps, fd_htab_map_alloc() will
      set the value size to sizeof(void *), and any user space map creation
      requests are forced to set 4 bytes as value size. Thus, selfcheck
      will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from
      object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID
      returns the value size used to create the map.
      
      Fix it by handling it the same way as we do for array of maps, which
      means that we leave value size at 4 bytes and in the allocation phase
      round up value size to 8 bytes. alloc_htab_elem() needs an adjustment
      in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem()
      calling into htab_map_update_elem() with the pointer of the map
      pointer as value. Unlike array of maps where we just xchg(), we're
      using the generic htab_map_update_elem() callback also used from helper
      calls, which published the key/value already on return, so we need
      to ensure to memcpy() the right size.
      
      Fixes: bcc6b1b7 ("bpf: Add hash of maps support")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4670d796
    • Eric Dumazet's avatar
      udp: on peeking bad csum, drop packets even if not at head · 79d6457e
      Eric Dumazet authored
      
      [ Upstream commit fd6055a8 ]
      
      When peeking, if a bad csum is discovered, the skb is unlinked from
      the queue with __sk_queue_drop_skb and the peek operation restarted.
      
      __sk_queue_drop_skb only drops packets that match the queue head.
      
      This fails if the skb was found after the head, using SO_PEEK_OFF
      socket option. This causes an infinite loop.
      
      We MUST drop this problematic skb, and we can simply check if skb was
      already removed by another thread, by looking at skb->next :
      
      This pointer is set to NULL by the  __skb_unlink() operation, that might
      have happened only under the spinlock protection.
      
      Many thanks to syzkaller team (and particularly Dmitry Vyukov who
      provided us nice C reproducers exhibiting the lockup) and Willem de
      Bruijn who provided first version for this patch and a test program.
      
      Fixes: 627d2d6b ("udp: enable MSG_PEEK at non-zero offset")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      79d6457e
    • Sabrina Dubroca's avatar
      macsec: add genl family module alias · 1999821f
      Sabrina Dubroca authored
      
      [ Upstream commit 78362998 ]
      
      This helps tools such as wpa_supplicant can start even if the macsec
      module isn't loaded yet.
      
      Fixes: c09440f7 ("macsec: introduce IEEE 802.1AE driver")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1999821f
    • Wei Wang's avatar
      ipv6: fix sparse warning on rt6i_node · 517e43bd
      Wei Wang authored
      
      [ Upstream commit 4e587ea7 ]
      
      Commit c5cff856 adds rcu grace period before freeing fib6_node. This
      generates a new sparse warning on rt->rt6i_node related code:
        net/ipv6/route.c:1394:30: error: incompatible types in comparison
        expression (different address spaces)
        ./include/net/ip6_fib.h:187:14: error: incompatible types in comparison
        expression (different address spaces)
      
      This commit adds "__rcu" tag for rt6i_node and makes sure corresponding
      rcu API is used for it.
      After this fix, sparse no longer generates the above warning.
      
      Fixes: c5cff856 ("ipv6: add rcu grace period before freeing fib6_node")
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      517e43bd
    • Wei Wang's avatar
      ipv6: add rcu grace period before freeing fib6_node · 640efece
      Wei Wang authored
      
      [ Upstream commit c5cff856 ]
      
      We currently keep rt->rt6i_node pointing to the fib6_node for the route.
      And some functions make use of this pointer to dereference the fib6_node
      from rt structure, e.g. rt6_check(). However, as there is neither
      refcount nor rcu taken when dereferencing rt->rt6i_node, it could
      potentially cause crashes as rt->rt6i_node could be set to NULL by other
      CPUs when doing a route deletion.
      This patch introduces an rcu grace period before freeing fib6_node and
      makes sure the functions that dereference it takes rcu_read_lock().
      
      Note: there is no "Fixes" tag because this bug was there in a very
      early stage.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      640efece
    • Stefano Brivio's avatar
      ipv6: accept 64k - 1 packet length in ip6_find_1stfragopt() · 76d3e7ff
      Stefano Brivio authored
      
      [ Upstream commit 3de33e1b ]
      
      A packet length of exactly IPV6_MAXPLEN is allowed, we should
      refuse parsing options only if the size is 64KiB or more.
      
      While at it, remove one extra variable and one assignment which
      were also introduced by the commit that introduced the size
      check. Checking the sum 'offset + len' and only later adding
      'len' to 'offset' doesn't provide any advantage over directly
      summing to 'offset' and checking it.
      
      Fixes: 6399f1fa ("ipv6: avoid overflow of offset in ip6_find_1stfragopt")
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76d3e7ff
  2. 13 Sep, 2017 3 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.12.13 · 5d7d2e03
      Greg Kroah-Hartman authored
      5d7d2e03
    • Richard Wareing's avatar
      xfs: XFS_IS_REALTIME_INODE() should be false if no rt device present · 9f7df0bc
      Richard Wareing authored
      commit b31ff3cd upstream.
      
      If using a kernel with CONFIG_XFS_RT=y and we set the RHINHERIT flag on
      a directory in a filesystem that does not have a realtime device and
      create a new file in that directory, it gets marked as a real time file.
      When data is written and a fsync is issued, the filesystem attempts to
      flush a non-existent rt device during the fsync process.
      
      This results in a crash dereferencing a null buftarg pointer in
      xfs_blkdev_issue_flush():
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: xfs_blkdev_issue_flush+0xd/0x20
        .....
        Call Trace:
          xfs_file_fsync+0x188/0x1c0
          vfs_fsync_range+0x3b/0xa0
          do_fsync+0x3d/0x70
          SyS_fsync+0x10/0x20
          do_syscall_64+0x4d/0xb0
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Setting RT inode flags does not require special privileges so any
      unprivileged user can cause this oops to occur.  To reproduce, confirm
      kernel is compiled with CONFIG_XFS_RT=y and run:
      
        # mkfs.xfs -f /dev/pmem0
        # mount /dev/pmem0 /mnt/test
        # mkdir /mnt/test/foo
        # xfs_io -c 'chattr +t' /mnt/test/foo
        # xfs_io -f -c 'pwrite 0 5m' -c fsync /mnt/test/foo/bar
      
      Or just run xfstests with MKFS_OPTIONS="-d rtinherit=1" and wait.
      
      Kernels built with CONFIG_XFS_RT=n are not exposed to this bug.
      
      Fixes: f538d4da ("[XFS] write barrier support")
      Signed-off-by: default avatarRichard Wareing <rwareing@fb.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f7df0bc
    • Trond Myklebust's avatar
      NFSv4: Fix up mirror allocation · da0f4931
      Trond Myklebust authored
      commit 14abcb0b upstream.
      
      There are a number of callers of nfs_pageio_complete() that want to
      continue using the nfs_pageio_descriptor without needing to call
      nfs_pageio_init() again. Examples include nfs_pageio_resend() and
      nfs_pageio_cond_complete().
      
      The problem is that nfs_pageio_complete() also calls
      nfs_pageio_cleanup_mirroring(), which frees up the array of mirrors.
      This can lead to writeback errors, in the next call to
      nfs_pageio_setup_mirroring().
      
      Fix by simply moving the allocation of the mirrors to
      nfs_pageio_setup_mirroring().
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=196709Reported-by: default avatarJianhongYin <yin-jianhong@163.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da0f4931