1. 02 Jul, 2018 1 commit
    • Sabrina Dubroca's avatar
      net: fix use-after-free in GRO with ESP · 603d4cf8
      Sabrina Dubroca authored
      Since the addition of GRO for ESP, gro_receive can consume the skb and
      return -EINPROGRESS. In that case, the lower layer GRO handler cannot
      touch the skb anymore.
      
      Commit 5f114163 ("net: Add a skb_gro_flush_final helper.") converted
      some of the gro_receive handlers that can lead to ESP's gro_receive so
      that they wouldn't access the skb when -EINPROGRESS is returned, but
      missed other spots, mainly in tunneling protocols.
      
      This patch finishes the conversion to using skb_gro_flush_final(), and
      adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and
      GUE.
      
      Fixes: 5f114163 ("net: Add a skb_gro_flush_final helper.")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      603d4cf8
  2. 01 Jul, 2018 2 commits
    • Ilpo Järvinen's avatar
      tcp: prevent bogus FRTO undos with non-SACK flows · 1236f22f
      Ilpo Järvinen authored
      If SACK is not enabled and the first cumulative ACK after the RTO
      retransmission covers more than the retransmitted skb, a spurious
      FRTO undo will trigger (assuming FRTO is enabled for that RTO).
      The reason is that any non-retransmitted segment acknowledged will
      set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
      no indication that it would have been delivered for real (the
      scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
      case so the check for that bit won't help like it does with SACK).
      Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
      in tcp_process_loss.
      
      We need to use more strict condition for non-SACK case and check
      that none of the cumulatively ACKed segments were retransmitted
      to prove that progress is due to original transmissions. Only then
      keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
      non-SACK case.
      
      (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
      to better indicate its purpose but to keep this change minimal, it
      will be done in another patch).
      
      Besides burstiness and congestion control violations, this problem
      can result in RTO loop: When the loss recovery is prematurely
      undoed, only new data will be transmitted (if available) and
      the next retransmission can occur only after a new RTO which in case
      of multiple losses (that are not for consecutive packets) requires
      one RTO per loss to recover.
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Tested-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1236f22f
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 271b955e
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-07-01
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) A bpf_fib_lookup() helper fix to change the API before freeze to
         return an encoding of the FIB lookup result and return the nexthop
         device index in the params struct (instead of device index as return
         code that we had before), from David.
      
      2) Various BPF JIT fixes to address syzkaller fallout, that is, do not
         reject progs when set_memory_*() fails since it could still be RO.
         Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was
         an issue, and a memory leak in s390 JIT found during review, from
         Daniel.
      
      3) Multiple fixes for sockmap/hash to address most of the syzkaller
         triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(),
         a missing sock_map_release() routine to hook up to callbacks, and a
         fix for an omitted bucket lock in sock_close(), from John.
      
      4) Two bpftool fixes to remove duplicated error message on program load,
         and another one to close the libbpf object after program load. One
         additional fix for nfp driver's BPF offload to avoid stopping offload
         completely if replace of program failed, from Jakub.
      
      5) Couple of BPF selftest fixes that bail out in some of the test
         scripts if the user does not have the right privileges, from Jeffrin.
      
      6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set
         where we need to set the flag that some of the test cases are expected
         to fail, from Kleber.
      
      7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF
         since it has no relation to it and lirc2 users often have configs
         without cgroups enabled and thus would not be able to use it, from Sean.
      
      8) Fix a selftest failure in sockmap by removing a useless setrlimit()
         call that would set a too low limit where at the same time we are
         already including bpf_rlimit.h that does the job, from Yonghong.
      
      9) Fix BPF selftest config with missing missing NET_SCHED, from Anders.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      271b955e
  3. 30 Jun, 2018 27 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-sockmap-fixes' · bf2b866a
      Daniel Borkmann authored
      John Fastabend says:
      
      ====================
      This addresses two syzbot issues that lead to identifying (by Eric and
      Wei) a class of bugs where we don't correctly check for IPv4/v6
      sockets and their associated state. The second issue was a locking
      omission in sockhash.
      
      The first patch addresses IPv6 socks and fixing an error where
      sockhash would overwrite the prot pointer with IPv4 prot. To fix
      this build similar solution to TLS ULP. Although we continue to
      allow socks in all states not just ESTABLISH in this patch set
      because as Martin points out there should be no issue with this
      on the sockmap ULP because we don't use the ctx in this code. Once
      multiple ULPs coexist we may need to revisit this. However we
      can do this in *next trees.
      
      The other issue syzbot found that the tcp_close() handler missed
      locking the hash bucket lock which could result in corrupting the
      sockhash bucket list if delete and close ran at the same time.
      And also the smap_list_remove() routine was not working correctly
      at all. This was not caught in my testing because in general my
      tests (to date at least lets add some more robust selftest in
      bpf-next) do things in the "expected" order, create map, add socks,
      delete socks, then tear down maps. The tests we have that do the
      ops out of this order where only working on single maps not multi-
      maps so we never saw the issue. Thanks syzbot. The fix is to
      restructure the tcp_close() lock handling. And fix the obvious
      bug in smap_list_remove().
      
      Finally, during review I noticed the release handler was omitted
      from the upstream code (patch 4) due to an incorrect merge conflict
      fix when I ported the code to latest bpf-next before submitting.
      This would leave references to the map around if the user never
      closes the map.
      
      v3: rework patches, dropping ESTABLISH check and adding rcu
          annotation along with the smap_list_remove fix
      
      v4: missed one more case where maps was being accessed without
          the sk_callback_lock, spoted by Martin as well.
      
      v5: changed to use a specific lock for maps and reduced callback
          lock so that it is only used to gaurd sk callbacks. I think
          this makes the logic a bit cleaner and avoids confusion
          ovoer what each lock is doing.
      
      Also big thanks to Martin for thorough review he caught at least
      one case where I missed a rcu_call().
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      bf2b866a
    • John Fastabend's avatar
      bpf: sockhash, add release routine · caac76a5
      John Fastabend authored
      Add map_release_uref pointer to hashmap ops. This was dropped when
      original sockhash code was ported into bpf-next before initial
      commit.
      
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      caac76a5
    • John Fastabend's avatar
      bpf: sockhash fix omitted bucket lock in sock_close · e9db4ef6
      John Fastabend authored
      First the sk_callback_lock() was being used to protect both the
      sock callback hooks and the psock->maps list. This got overly
      convoluted after the addition of sockhash (in sockmap it made
      some sense because masp and callbacks were tightly coupled) so
      lets split out a specific lock for maps and only use the callback
      lock for its intended purpose. This fixes a couple cases where
      we missed using maps lock when it was in fact needed. Also this
      makes it easier to follow the code because now we can put the
      locking closer to the actual code its serializing.
      
      Next, in sock_hash_delete_elem() the pattern was as follows,
      
        sock_hash_delete_elem()
           [...]
           spin_lock(bucket_lock)
           l = lookup_elem_raw()
           if (l)
              hlist_del_rcu()
              write_lock(sk_callback_lock)
               .... destroy psock ...
              write_unlock(sk_callback_lock)
           spin_unlock(bucket_lock)
      
      The ordering is necessary because we only know the {p}sock after
      dereferencing the hash table which we can't do unless we have the
      bucket lock held. Once we have the bucket lock and the psock element
      it is deleted from the hashmap to ensure any other path doing a lookup
      will fail. Finally, the refcnt is decremented and if zero the psock
      is destroyed.
      
      In parallel with the above (or free'ing the map) a tcp close event
      may trigger tcp_close(). Which at the moment omits the bucket lock
      altogether (oops!) where the flow looks like this,
      
        bpf_tcp_close()
           [...]
           write_lock(sk_callback_lock)
           for each psock->maps // list of maps this sock is part of
               hlist_del_rcu(ref_hash_node);
               .... destroy psock ...
           write_unlock(sk_callback_lock)
      
      Obviously, and demonstrated by syzbot, this is broken because
      we can have multiple threads deleting entries via hlist_del_rcu().
      
      To fix this we might be tempted to wrap the hlist operation in a
      bucket lock but that would create a lock inversion problem. In
      summary to follow locking rules the psocks maps list needs the
      sk_callback_lock (after this patch maps_lock) but we need the bucket
      lock to do the hlist_del_rcu.
      
      To resolve the lock inversion problem pop the head of the maps list
      repeatedly and remove the reference until no more are left. If a
      delete happens in parallel from the BPF API that is OK as well because
      it will do a similar action, lookup the lock in the map/hash, delete
      it from the map/hash, and dec the refcnt. We check for this case
      before doing a destroy on the psock to ensure we don't have two
      threads tearing down a psock. The new logic is as follows,
      
        bpf_tcp_close()
        e = psock_map_pop(psock->maps) // done with map lock
        bucket_lock() // lock hash list bucket
        l = lookup_elem_raw(head, hash, key, key_size);
        if (l) {
           //only get here if elmnt was not already removed
           hlist_del_rcu()
           ... destroy psock...
        }
        bucket_unlock()
      
      And finally for all the above to work add missing locking around  map
      operations per above. Then add RCU annotations and use
      rcu_dereference/rcu_assign_pointer to manage values relying on RCU so
      that the object is not free'd from sock_hash_free() while it is being
      referenced in bpf_tcp_close().
      
      Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e9db4ef6
    • John Fastabend's avatar
      bpf: sockmap, fix smap_list_map_remove when psock is in many maps · 54fedb42
      John Fastabend authored
      If a hashmap is free'd with open socks it removes the reference to
      the hash entry from the psock. If that is the last reference to the
      psock then it will also be free'd by the reference counting logic.
      However the current logic that removes the hash reference from the
      list of references is broken. In smap_list_remove() we first check
      if the sockmap entry matches and then check if the hashmap entry
      matches. But, the sockmap entry sill always match because its NULL in
      this case which causes the first entry to be removed from the list.
      If this is always the "right" entry (because the user adds/removes
      entries in order) then everything is OK but otherwise a subsequent
      bpf_tcp_close() may reference a free'd object.
      
      To fix this create two list handlers one for sockmap and one for
      sockhash.
      
      Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      54fedb42
    • John Fastabend's avatar
      bpf: sockmap, fix crash when ipv6 sock is added · 9901c5d7
      John Fastabend authored
      This fixes a crash where we assign tcp_prot to IPv6 sockets instead
      of tcpv6_prot.
      
      Previously we overwrote the sk->prot field with tcp_prot even in the
      AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
      are used.
      
      Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
      'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
      crashing case here.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9901c5d7
    • Roopa Prabhu's avatar
      net: fib_rules: bring back rule_exists to match rule during add · 35e8c7ba
      Roopa Prabhu authored
      After commit f9d4b0c1 ("fib_rules: move common handling of newrule
      delrule msgs into fib_nl2rule"), rule_exists got replaced by rule_find
      for existing rule lookup in both the add and del paths. While this
      is good for the delete path, it solves a few problems but opens up
      a few invalid key matches in the add path.
      
      $ip -4 rule add table main tos 10 fwmark 1
      $ip -4 rule add table main tos 10
      RTNETLINK answers: File exists
      
      The problem here is rule_find does not check if the key masks in
      the new and old rule are the same and hence ends up matching a more
      secific rule. Rule key masks cannot be easily compared today without
      an elaborate if-else block. Its best to introduce key masks for easier
      and accurate rule comparison in the future. Until then, due to fear of
      regressions this patch re-introduces older loose rule_exists during add.
      Also fixes both rule_exists and rule_find to cover missing attributes.
      
      Fixes: f9d4b0c1 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule")
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35e8c7ba
    • Stephen Hemminger's avatar
      hv_netvsc: split sub-channel setup into async and sync · 3ffe64f1
      Stephen Hemminger authored
      When doing device hotplug the sub channel must be async to avoid
      deadlock issues because device is discovered in softirq context.
      
      When doing changes to MTU and number of channels, the setup
      must be synchronous to avoid races such as when MTU and device
      settings are done in a single ip command.
      Reported-by: default avatarThomas Walker <Thomas.Walker@twosigma.com>
      Fixes: 8195b139 ("hv_netvsc: fix deadlock on hotplug")
      Fixes: 732e4985 ("netvsc: fix race on sub channel creation")
      Signed-off-by: default avatarStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ffe64f1
    • Cong Wang's avatar
      net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN · 3f76df19
      Cong Wang authored
      As noticed by Eric, we need to switch to the helper
      dev_change_tx_queue_len() for SIOCSIFTXQLEN call path too,
      otheriwse still miss dev_qdisc_change_tx_queue_len().
      
      Fixes: 6a643ddb ("net: introduce helper dev_change_tx_queue_len()")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f76df19
    • Gustavo A. R. Silva's avatar
      atm: zatm: Fix potential Spectre v1 · ced9e191
      Gustavo A. R. Silva authored
      pool can be indirectly controlled by user-space, hence leading to
      a potential exploitation of the Spectre variant 1 vulnerability.
      
      This issue was detected with the help of Smatch:
      
      drivers/atm/zatm.c:1491 zatm_ioctl() warn: potential spectre issue
      'zatm_dev->pool_info' (local cap)
      
      Fix this by sanitizing pool before using it to index
      zatm_dev->pool_info
      
      Notice that given that speculation windows are large, the policy is
      to kill the speculation on the first load and not worry if it can be
      completed with a dependent load/store [1].
      
      [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ced9e191
    • David S. Miller's avatar
      Merge branch 's390-qeth-fixes' · c7f653e0
      David S. Miller authored
      Julian Wiedmann says:
      
      ====================
      s390/qeth: fixes 2018-06-29
      
      please apply a few qeth fixes for -net and your 4.17 stable queue.
      
      Patches 1-3 fix several issues wrt to MAC address management that were
      introduced during the 4.17 cycle.
      Patch 4 tackles a long-standing issue with busy multi-connection workloads
      on devices in af_iucv mode.
      Patch 5 makes sure to re-enable all active HW offloads, after a card was
      previously set offline and thus lost its HW context.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7f653e0
    • Julian Wiedmann's avatar
      s390/qeth: consistently re-enable device features · d025da9e
      Julian Wiedmann authored
      commit e830baa9 ("qeth: restore device features after recovery") and
      commit ce344356 ("s390/qeth: rely on kernel for feature recovery")
      made sure that the HW functions for device features get re-programmed
      after recovery.
      
      But we missed that the same handling is also required when a card is
      first set offline (destroying all HW context), and then online again.
      Fix this by moving the re-enable action out of the recovery-only path.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d025da9e
    • Julian Wiedmann's avatar
      s390/qeth: don't clobber buffer on async TX completion · ce28867f
      Julian Wiedmann authored
      If qeth_qdio_output_handler() detects that a transmit requires async
      completion, it replaces the pending buffer's metadata object
      (qeth_qdio_out_buffer) so that this queue buffer can be re-used while
      the data is pending completion.
      
      Later when the CQ indicates async completion of such a metadata object,
      qeth_qdio_cq_handler() tries to free any data associated with this
      object (since HW has now completed the transfer). By calling
      qeth_clear_output_buffer(), it erronously operates on the queue buffer
      that _previously_ belonged to this transfer ... but which has been
      potentially re-used several times by now.
      This results in double-free's of the buffer's data, and failing
      transmits as the buffer descriptor is scrubbed in mid-air.
      
      The correct way of handling this situation is to
      1. scrub the queue buffer when it is prepared for re-use, and
      2. later obtain the data addresses from the async-completion notifier
         (ie. the AOB), instead of the queue buffer.
      
      All this only affects qeth devices used for af_iucv HiperTransport.
      
      Fixes: 0da9581d ("qeth: exploit asynchronous delivery of storage blocks")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce28867f
    • Vasily Gorbik's avatar
      s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6] · 9d0a58fb
      Vasily Gorbik authored
      *ether_addr*_64bits functions have been introduced to optimize
      performance critical paths, which access 6-byte ethernet address as u64
      value to get "nice" assembly. A harmless hack works nicely on ethernet
      addresses shoved into a structure or a larger buffer, until busted by
      Kasan on smth like plain (u8 *)[6].
      
      qeth_l2_set_mac_address calls qeth_l2_remove_mac passing
      u8 old_addr[ETH_ALEN] as an argument.
      
      Adding/removing macs for an ethernet adapter is not that performance
      critical. Moreover is_multicast_ether_addr_64bits itself on s390 is not
      faster than is_multicast_ether_addr:
      
      is_multicast_ether_addr(%r2) -> %r2
      llc	%r2,0(%r2)
      risbg	%r2,%r2,63,191,0
      
      is_multicast_ether_addr_64bits(%r2) -> %r2
      llgc	%r2,0(%r2)
      risbg	%r2,%r2,63,191,0
      
      So, let's just use is_multicast_ether_addr instead of
      is_multicast_ether_addr_64bits.
      
      Fixes: bcacfcbc ("s390/qeth: fix MAC address update sequence")
      Reviewed-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d0a58fb
    • Julian Wiedmann's avatar
      s390/qeth: fix race when setting MAC address · 4789a218
      Julian Wiedmann authored
      When qeth_l2_set_mac_address() finds the card in a non-reachable state,
      it merely copies the new MAC address into dev->dev_addr so that
      __qeth_l2_set_online() can later register it with the HW.
      
      But __qeth_l2_set_online() may very well be running concurrently, so we
      can't trust the card state without appropriate locking:
      If the online sequence is past the point where it registers
      dev->dev_addr (but not yet in SOFTSETUP state), any address change needs
      to be properly programmed into the HW. Otherwise the netdevice ends up
      with a different MAC address than what's set in the HW, and inbound
      traffic is not forwarded as expected.
      
      This is most likely to occur for OSD in LPAR, where
      commit 21b1702a ("s390/qeth: improve fallback to random MAC address")
      now triggers eg. systemd to immediately change the MAC when the netdevice
      is registered with a NET_ADDR_RANDOM address.
      
      Fixes: bcacfcbc ("s390/qeth: fix MAC address update sequence")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4789a218
    • Julian Wiedmann's avatar
      Revert "s390/qeth: use Read device to query hypervisor for MAC" · 46646105
      Julian Wiedmann authored
      This reverts commit b7493e91.
      
      On its own, querying RDEV for a MAC address works fine. But when upgrading
      from a qeth that previously queried DDEV on a z/VM NIC (ie. any kernel with
      commit ec61bd2f), the RDEV query now returns a _different_ MAC address
      than the DDEV query.
      
      If the NIC is configured with MACPROTECT, z/VM apparently requires us to
      use the MAC that was initially returned (on DDEV) and registered. So after
      upgrading to a kernel that uses RDEV, the SETVMAC registration cmd for the
      new MAC address fails and we end up with a non-operabel interface.
      
      To avoid regressions on upgrade, switch back to using DDEV for the MAC
      address query. The downgrade path (first RDEV, later DDEV) is fine, in this
      case both queries return the same MAC address.
      
      Fixes: b7493e91 ("s390/qeth: use Read device to query hypervisor for MAC")
      Reported-by: default avatarMichal Kubecek <mkubecek@suse.com>
      Tested-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46646105
    • Sabrina Dubroca's avatar
      alx: take rtnl before calling __alx_open from resume · bc800e8b
      Sabrina Dubroca authored
      The __alx_open function can be called from ndo_open, which is called
      under RTNL, or from alx_resume, which isn't. Since commit d768319c,
      we're calling the netif_set_real_num_{tx,rx}_queues functions, which
      need to be called under RTNL.
      
      This is similar to commit 0c2cc02e ("igb: Move the calls to set the
      Tx and Rx queues into igb_open").
      
      Fixes: d768319c ("alx: enable multiple tx queues")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc800e8b
    • Bert Kenward's avatar
      sfc: correctly initialise filter rwsem for farch · cafb3960
      Bert Kenward authored
      Fixes: fc7a6c28 ("sfc: use a semaphore to lock farch filters too")
      Suggested-by: default avatarJoseph Korty <joe.korty@concurrent-rt.com>
      Signed-off-by: default avatarBert Kenward <bkenward@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cafb3960
    • Dan Murphy's avatar
      net: phy: DP83TC811: Fix disabling interrupts · 713b4a33
      Dan Murphy authored
      Fix a bug where INT_STAT1 was written twice and
      INT_STAT2 was ignored when disabling interrupts.
      
      Fixes: b753a9fa ("net: phy: DP83TC811: Introduce support for the DP83TC811 phy")
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDan Murphy <dmurphy@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      713b4a33
    • David Ahern's avatar
      net/ipv6: Fix updates to prefix route · e7c7faa9
      David Ahern authored
      Sowmini reported that a recent commit broke prefix routes for linklocal
      addresses. The newly added modify_prefix_route is attempting to add a
      new prefix route when the ifp priority does not match the route metric
      however the check needs to account for the default priority. In addition,
      the route add fails because the route already exists, and then the delete
      removes the one that exists. Flip the order to do the delete first.
      
      Fixes: 8308f3ff ("net/ipv6: Add support for specifying metric of connected routes")
      Reported-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Tested-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7c7faa9
    • Michal Hocko's avatar
      net: cleanup gfp mask in alloc_skb_with_frags · d14b56f5
      Michal Hocko authored
      alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
      which is just a noop and a little bit confusing.
      
      __GFP_NORETRY was added by ed98df33 ("net: use __GFP_NORETRY for
      high order allocations") to prevent from the OOM killer. Yet this was
      not enough because fb05e7a8 ("net: don't wait for order-3 page
      allocation") didn't want an excessive reclaim for non-costly orders
      so it made it completely NOWAIT while it preserved __GFP_NORETRY in
      place which is now redundant.
      
      Drop the pointless __GFP_NORETRY because this function is used as
      copy&paste source for other places.
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d14b56f5
    • David S. Miller's avatar
      Merge branch 'DPAA-fixes' · 70098289
      David S. Miller authored
      Madalin Bucur says:
      
      ====================
      DPAA fixes
      
      A couple of fixes for the DPAA drivers, addressing an issue
      with short UDP or TCP frames (with padding) that were marked
      as having a wrong checksum and dropped by the FMan hardware
      and a problem with the buffer used for the scatter-gather
      table being too small as per the hardware requirements.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70098289
    • Madalin Bucur's avatar
      dpaa_eth: DPAA SGT needs to be 256B · 595e802e
      Madalin Bucur authored
      The DPAA HW requires that at least 256 bytes from the start of the
      first scatter-gather table entry are allocated and accessible. The
      hardware reads the maximum size the table can have in one access,
      thus requiring that the allocation and mapping to be done for the
      maximum size of 256B even if there is a smaller number of entries
      in the table.
      Signed-off-by: default avatarMadalin Bucur <madalin.bucur@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      595e802e
    • Madalin Bucur's avatar
      fsl/fman: fix parser reporting bad checksum on short frames · b95f6fbc
      Madalin Bucur authored
      The FMan hardware parser needs to be configured to remove the
      short frame padding from the checksum calculation, otherwise
      short UDP and TCP frames are likely to be marked as having a
      bad checksum.
      Signed-off-by: default avatarMadalin Bucur <madalin.bucur@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b95f6fbc
    • Sudarsana Reddy Kalluru's avatar
      bnx2x: Fix receiving tx-timeout in error or recovery state. · 484c016d
      Sudarsana Reddy Kalluru authored
      Driver performs the internal reload when it receives tx-timeout event from
      the OS. Internal reload might fail in some scenarios e.g., fatal HW issues.
      In such cases OS still see the link, which would result in undesirable
      functionalities such as re-generation of tx-timeouts.
      The patch addresses this issue by indicating the link-down to OS when
      tx-timeout is detected, and keeping the link in down state till the
      internal reload is successful.
      
      Please consider applying it to 'net' branch.
      Signed-off-by: default avatarSudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
      Signed-off-by: default avatarAriel Elior <ariel.elior@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      484c016d
    • Dan Carpenter's avatar
      cnic: tidy up a size calculation · 5037c628
      Dan Carpenter authored
      Static checkers complain that id_tbl->table points to longs and 4 bytes
      is smaller than sizeof(long).  But the since other side is dividing by
      32 instead of sizeof(long), that means the current code works fine.
      
      Anyway, it's more conventional to use the BITS_TO_LONGS() macro when
      we're allocating a bitmap.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5037c628
    • Dan Carpenter's avatar
      atm: iphase: fix a 64 bit bug · 92291c95
      Dan Carpenter authored
      The code assumes that there is 4 bytes in a pointer and it doesn't
      allocate enough memory.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92291c95
    • Yuchung Cheng's avatar
      tcp: fix Fast Open key endianness · c860e997
      Yuchung Cheng authored
      Fast Open key could be stored in different endian based on the CPU.
      Previously hosts in different endianness in a server farm using
      the same key config (sysctl value) would produce different cookies.
      This patch fixes it by always storing it as little endian to keep
      same API for LE hosts.
      Reported-by: default avatarDaniele Iamartino <danielei@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c860e997
  4. 29 Jun, 2018 9 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-fixes' · ca09cb04
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      This set contains three fixes that are mostly JIT and set_memory_*()
      related. The third in the series in particular fixes the syzkaller
      bugs that were still pending; aside from local reproduction & testing,
      also 'syz test' wasn't able to trigger them anymore. I've tested this
      series on x86_64, arm64 and s390x, and kbuild bot wasn't yelling either
      for the rest. For details, please see patches as usual, thanks!
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ca09cb04
    • Daniel Borkmann's avatar
      bpf: undo prog rejection on read-only lock failure · 85782e03
      Daniel Borkmann authored
      Partially undo commit 9facc336 ("bpf: reject any prog that failed
      read-only lock") since it caused a regression, that is, syzkaller was
      able to manage to cause a panic via fault injection deep in set_memory_ro()
      path by letting an allocation fail: In x86's __change_page_attr_set_clr()
      it was able to change the attributes of the primary mapping but not in
      the alias mapping via cpa_process_alias(), so the second, inner call
      to the __change_page_attr() via __change_page_attr_set_clr() had to split
      a larger page and failed in the alloc_pages() with the artifically triggered
      allocation error which is then propagated down to the call site.
      
      Thus, for set_memory_ro() this means that it returned with an error, but
      from debugging a probe_kernel_write() revealed EFAULT on that memory since
      the primary mapping succeeded to get changed. Therefore the subsequent
      hdr->locked = 0 reset triggered the panic as it was performed on read-only
      memory, so call-site assumptions were infact wrong to assume that it would
      either succeed /or/ not succeed at all since there's no such rollback in
      set_memory_*() calls from partial change of mappings, in other words, we're
      left in a state that is "half done". A later undo via set_memory_rw() is
      succeeding though due to matching permissions on that part (aka due to the
      try_preserve_large_page() succeeding). While reproducing locally with
      explicitly triggering this error, the initial splitting only happens on
      rare occasions and in real world it would additionally need oom conditions,
      but that said, it could partially fail. Therefore, it is definitely wrong
      to bail out on set_memory_ro() error and reject the program with the
      set_memory_*() semantics we have today. Shouldn't have gone the extra mile
      since no other user in tree today infact checks for any set_memory_*()
      errors, e.g. neither module_enable_ro() / module_disable_ro() for module
      RO/NX handling which is mostly default these days nor kprobes core with
      alloc_insn_page() / free_insn_page() as examples that could be invoked long
      after bootup and original 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks") did neither when it got first introduced to BPF
      so "improving" with bailing out was clearly not right when set_memory_*()
      cannot handle it today.
      
      Kees suggested that if set_memory_*() can fail, we should annotate it with
      __must_check, and all callers need to deal with it gracefully given those
      set_memory_*() markings aren't "advisory", but they're expected to actually
      do what they say. This might be an option worth to move forward in future
      but would at the same time require that set_memory_*() calls from supporting
      archs are guaranteed to be "atomic" in that they provide rollback if part
      of the range fails, once that happened, the transition from RW -> RO could
      be made more robust that way, while subsequent RO -> RW transition /must/
      continue guaranteeing to always succeed the undo part.
      
      Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
      Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
      Fixes: 9facc336 ("bpf: reject any prog that failed read-only lock")
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      85782e03
    • Daniel Borkmann's avatar
      bpf, s390: fix potential memleak when later bpf_jit_prog fails · f605ce5e
      Daniel Borkmann authored
      If we would ever fail in the bpf_jit_prog() pass that writes the
      actual insns to the image after we got header via bpf_jit_binary_alloc()
      then we also need to make sure to free it through bpf_jit_binary_free()
      again when bailing out. Given we had prior bpf_jit_prog() passes to
      initially probe for clobbered registers, program size and to fill in
      addrs arrray for jump targets, this is more of a theoretical one,
      but at least make sure this doesn't break with future changes.
      
      Fixes: 05462310 ("s390/bpf: Add s390x eBPF JIT compiler backend")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f605ce5e
    • Daniel Borkmann's avatar
      bpf, arm32: fix to use bpf_jit_binary_lock_ro api · 18d405af
      Daniel Borkmann authored
      Any eBPF JIT that where its underlying arch supports ARCH_HAS_SET_MEMORY
      would need to use bpf_jit_binary_{un,}lock_ro() pair instead of the
      set_memory_{ro,rw}() pair directly as otherwise changes to the former
      might break. arm32's eBPF conversion missed to change it, so fix this
      up here.
      
      Fixes: 39c13c20 ("arm: eBPF JIT compiler")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      18d405af
    • David S. Miller's avatar
      Merge tag 'mac80211-for-davem-2018-06-29' of... · 0933cc29
      David S. Miller authored
      Merge tag 'mac80211-for-davem-2018-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
      
      Johannes Berg says:
      
      ====================
      Just three fixes:
       * fix HT operation in mesh mode
       * disable preemption in control frame TX
       * check nla_parse_nested() return values
         where missing (two places)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0933cc29
    • Shakeel Butt's avatar
      net, mm: account sock objects to kmemcg · e699e2c6
      Shakeel Butt authored
      Currently the kernel accounts the memory for network traffic through
      mem_cgroup_[un]charge_skmem() interface. However the memory accounted
      only includes the truesize of sk_buff which does not include the size of
      sock objects. In our production environment, with opt-out kmem
      accounting, the sock kmem caches (TCP[v6], UDP[v6], RAW[v6], UNIX) are
      among the top most charged kmem caches and consume a significant amount
      of memory which can not be left as system overhead. So, this patch
      converts the kmem caches of all sock objects to SLAB_ACCOUNT.
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e699e2c6
    • Johannes Berg's avatar
      nl80211: check nla_parse_nested() return values · 95bca62f
      Johannes Berg authored
      At the very least we should check the return value if
      nla_parse_nested() is called with a non-NULL policy.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      95bca62f
    • Bob Copeland's avatar
      nl80211: relax ht operation checks for mesh · 188f60ab
      Bob Copeland authored
      Commit 9757235f, "nl80211: correct checks for
      NL80211_MESHCONF_HT_OPMODE value") relaxed the range for the HT
      operation field in meshconf, while also adding checks requiring
      the non-greenfield and non-ht-sta bits to be set in certain
      circumstances.  The latter bit is actually reserved for mesh BSSes
      according to Table 9-168 in 802.11-2016, so in fact it should not
      be set.
      
      wpa_supplicant sets these bits because the mesh and AP code share
      the same implementation, but authsae does not.  As a result, some
      meshconf updates from authsae which set only the NONHT_MIXED
      protection bits were being rejected.
      
      In order to avoid breaking userspace by changing the rules again,
      simply accept the values with or without the bits set, and mask
      off the reserved bit to match the spec.
      
      While in here, update the 802.11-2012 reference to 802.11-2016.
      
      Fixes: 9757235f ("nl80211: correct checks for NL80211_MESHCONF_HT_OPMODE value")
      Cc: Masashi Honma <masashi.honma@gmail.com>
      Signed-off-by: default avatarBob Copeland <bobcopeland@fb.com>
      Reviewed-by: default avatarMasashi Honma <masashi.honma@gmail.com>
      Reviewed-by: default avatarMasashi Honma <masashi.honma@gmail.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      188f60ab
    • Denis Kenzior's avatar
      mac80211: disable BHs/preemption in ieee80211_tx_control_port() · e7441c92
      Denis Kenzior authored
      On pre-emption enabled kernels the following print was being seen due to
      missing local_bh_disable/local_bh_enable calls.  mac80211 assumes that
      pre-emption is disabled in the data path.
      
          BUG: using smp_processor_id() in preemptible [00000000] code: iwd/517
          caller is __ieee80211_subif_start_xmit+0x144/0x210 [mac80211]
          [...]
          Call Trace:
          dump_stack+0x5c/0x80
          check_preemption_disabled.cold.0+0x46/0x51
          __ieee80211_subif_start_xmit+0x144/0x210 [mac80211]
      
      Fixes: 91180649 ("mac80211: Add support for tx_control_port")
      Signed-off-by: default avatarDenis Kenzior <denkenz@gmail.com>
      [commit message rewrite, fixes tag]
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      e7441c92
  5. 28 Jun, 2018 1 commit
    • David Ahern's avatar
      bpf: Change bpf_fib_lookup to return lookup status · 4c79579b
      David Ahern authored
      For ACLs implemented using either FIB rules or FIB entries, the BPF
      program needs the FIB lookup status to be able to drop the packet.
      Since the bpf_fib_lookup API has not reached a released kernel yet,
      change the return code to contain an encoding of the FIB lookup
      result and return the nexthop device index in the params struct.
      
      In addition, inform the BPF program of any post FIB lookup reason as
      to why the packet needs to go up the stack.
      
      The fib result for unicast routes must have an egress device, so remove
      the check that it is non-NULL.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4c79579b