1. 30 Jun, 2018 4 commits
    • John Fastabend's avatar
      bpf: sockhash, add release routine · caac76a5
      John Fastabend authored
      Add map_release_uref pointer to hashmap ops. This was dropped when
      original sockhash code was ported into bpf-next before initial
      commit.
      
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      caac76a5
    • John Fastabend's avatar
      bpf: sockhash fix omitted bucket lock in sock_close · e9db4ef6
      John Fastabend authored
      First the sk_callback_lock() was being used to protect both the
      sock callback hooks and the psock->maps list. This got overly
      convoluted after the addition of sockhash (in sockmap it made
      some sense because masp and callbacks were tightly coupled) so
      lets split out a specific lock for maps and only use the callback
      lock for its intended purpose. This fixes a couple cases where
      we missed using maps lock when it was in fact needed. Also this
      makes it easier to follow the code because now we can put the
      locking closer to the actual code its serializing.
      
      Next, in sock_hash_delete_elem() the pattern was as follows,
      
        sock_hash_delete_elem()
           [...]
           spin_lock(bucket_lock)
           l = lookup_elem_raw()
           if (l)
              hlist_del_rcu()
              write_lock(sk_callback_lock)
               .... destroy psock ...
              write_unlock(sk_callback_lock)
           spin_unlock(bucket_lock)
      
      The ordering is necessary because we only know the {p}sock after
      dereferencing the hash table which we can't do unless we have the
      bucket lock held. Once we have the bucket lock and the psock element
      it is deleted from the hashmap to ensure any other path doing a lookup
      will fail. Finally, the refcnt is decremented and if zero the psock
      is destroyed.
      
      In parallel with the above (or free'ing the map) a tcp close event
      may trigger tcp_close(). Which at the moment omits the bucket lock
      altogether (oops!) where the flow looks like this,
      
        bpf_tcp_close()
           [...]
           write_lock(sk_callback_lock)
           for each psock->maps // list of maps this sock is part of
               hlist_del_rcu(ref_hash_node);
               .... destroy psock ...
           write_unlock(sk_callback_lock)
      
      Obviously, and demonstrated by syzbot, this is broken because
      we can have multiple threads deleting entries via hlist_del_rcu().
      
      To fix this we might be tempted to wrap the hlist operation in a
      bucket lock but that would create a lock inversion problem. In
      summary to follow locking rules the psocks maps list needs the
      sk_callback_lock (after this patch maps_lock) but we need the bucket
      lock to do the hlist_del_rcu.
      
      To resolve the lock inversion problem pop the head of the maps list
      repeatedly and remove the reference until no more are left. If a
      delete happens in parallel from the BPF API that is OK as well because
      it will do a similar action, lookup the lock in the map/hash, delete
      it from the map/hash, and dec the refcnt. We check for this case
      before doing a destroy on the psock to ensure we don't have two
      threads tearing down a psock. The new logic is as follows,
      
        bpf_tcp_close()
        e = psock_map_pop(psock->maps) // done with map lock
        bucket_lock() // lock hash list bucket
        l = lookup_elem_raw(head, hash, key, key_size);
        if (l) {
           //only get here if elmnt was not already removed
           hlist_del_rcu()
           ... destroy psock...
        }
        bucket_unlock()
      
      And finally for all the above to work add missing locking around  map
      operations per above. Then add RCU annotations and use
      rcu_dereference/rcu_assign_pointer to manage values relying on RCU so
      that the object is not free'd from sock_hash_free() while it is being
      referenced in bpf_tcp_close().
      
      Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e9db4ef6
    • John Fastabend's avatar
      bpf: sockmap, fix smap_list_map_remove when psock is in many maps · 54fedb42
      John Fastabend authored
      If a hashmap is free'd with open socks it removes the reference to
      the hash entry from the psock. If that is the last reference to the
      psock then it will also be free'd by the reference counting logic.
      However the current logic that removes the hash reference from the
      list of references is broken. In smap_list_remove() we first check
      if the sockmap entry matches and then check if the hashmap entry
      matches. But, the sockmap entry sill always match because its NULL in
      this case which causes the first entry to be removed from the list.
      If this is always the "right" entry (because the user adds/removes
      entries in order) then everything is OK but otherwise a subsequent
      bpf_tcp_close() may reference a free'd object.
      
      To fix this create two list handlers one for sockmap and one for
      sockhash.
      
      Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
      Fixes: 81110384 ("bpf: sockmap, add hash map support")
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      54fedb42
    • John Fastabend's avatar
      bpf: sockmap, fix crash when ipv6 sock is added · 9901c5d7
      John Fastabend authored
      This fixes a crash where we assign tcp_prot to IPv6 sockets instead
      of tcpv6_prot.
      
      Previously we overwrote the sk->prot field with tcp_prot even in the
      AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
      are used.
      
      Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
      'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
      crashing case here.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9901c5d7
  2. 29 Jun, 2018 4 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-fixes' · ca09cb04
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      This set contains three fixes that are mostly JIT and set_memory_*()
      related. The third in the series in particular fixes the syzkaller
      bugs that were still pending; aside from local reproduction & testing,
      also 'syz test' wasn't able to trigger them anymore. I've tested this
      series on x86_64, arm64 and s390x, and kbuild bot wasn't yelling either
      for the rest. For details, please see patches as usual, thanks!
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ca09cb04
    • Daniel Borkmann's avatar
      bpf: undo prog rejection on read-only lock failure · 85782e03
      Daniel Borkmann authored
      Partially undo commit 9facc336 ("bpf: reject any prog that failed
      read-only lock") since it caused a regression, that is, syzkaller was
      able to manage to cause a panic via fault injection deep in set_memory_ro()
      path by letting an allocation fail: In x86's __change_page_attr_set_clr()
      it was able to change the attributes of the primary mapping but not in
      the alias mapping via cpa_process_alias(), so the second, inner call
      to the __change_page_attr() via __change_page_attr_set_clr() had to split
      a larger page and failed in the alloc_pages() with the artifically triggered
      allocation error which is then propagated down to the call site.
      
      Thus, for set_memory_ro() this means that it returned with an error, but
      from debugging a probe_kernel_write() revealed EFAULT on that memory since
      the primary mapping succeeded to get changed. Therefore the subsequent
      hdr->locked = 0 reset triggered the panic as it was performed on read-only
      memory, so call-site assumptions were infact wrong to assume that it would
      either succeed /or/ not succeed at all since there's no such rollback in
      set_memory_*() calls from partial change of mappings, in other words, we're
      left in a state that is "half done". A later undo via set_memory_rw() is
      succeeding though due to matching permissions on that part (aka due to the
      try_preserve_large_page() succeeding). While reproducing locally with
      explicitly triggering this error, the initial splitting only happens on
      rare occasions and in real world it would additionally need oom conditions,
      but that said, it could partially fail. Therefore, it is definitely wrong
      to bail out on set_memory_ro() error and reject the program with the
      set_memory_*() semantics we have today. Shouldn't have gone the extra mile
      since no other user in tree today infact checks for any set_memory_*()
      errors, e.g. neither module_enable_ro() / module_disable_ro() for module
      RO/NX handling which is mostly default these days nor kprobes core with
      alloc_insn_page() / free_insn_page() as examples that could be invoked long
      after bootup and original 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks") did neither when it got first introduced to BPF
      so "improving" with bailing out was clearly not right when set_memory_*()
      cannot handle it today.
      
      Kees suggested that if set_memory_*() can fail, we should annotate it with
      __must_check, and all callers need to deal with it gracefully given those
      set_memory_*() markings aren't "advisory", but they're expected to actually
      do what they say. This might be an option worth to move forward in future
      but would at the same time require that set_memory_*() calls from supporting
      archs are guaranteed to be "atomic" in that they provide rollback if part
      of the range fails, once that happened, the transition from RW -> RO could
      be made more robust that way, while subsequent RO -> RW transition /must/
      continue guaranteeing to always succeed the undo part.
      
      Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
      Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
      Fixes: 9facc336 ("bpf: reject any prog that failed read-only lock")
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      85782e03
    • Daniel Borkmann's avatar
      bpf, s390: fix potential memleak when later bpf_jit_prog fails · f605ce5e
      Daniel Borkmann authored
      If we would ever fail in the bpf_jit_prog() pass that writes the
      actual insns to the image after we got header via bpf_jit_binary_alloc()
      then we also need to make sure to free it through bpf_jit_binary_free()
      again when bailing out. Given we had prior bpf_jit_prog() passes to
      initially probe for clobbered registers, program size and to fill in
      addrs arrray for jump targets, this is more of a theoretical one,
      but at least make sure this doesn't break with future changes.
      
      Fixes: 05462310 ("s390/bpf: Add s390x eBPF JIT compiler backend")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f605ce5e
    • Daniel Borkmann's avatar
      bpf, arm32: fix to use bpf_jit_binary_lock_ro api · 18d405af
      Daniel Borkmann authored
      Any eBPF JIT that where its underlying arch supports ARCH_HAS_SET_MEMORY
      would need to use bpf_jit_binary_{un,}lock_ro() pair instead of the
      set_memory_{ro,rw}() pair directly as otherwise changes to the former
      might break. arm32's eBPF conversion missed to change it, so fix this
      up here.
      
      Fixes: 39c13c20 ("arm: eBPF JIT compiler")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      18d405af
  3. 28 Jun, 2018 2 commits
  4. 26 Jun, 2018 4 commits
  5. 25 Jun, 2018 1 commit
    • Jakub Kicinski's avatar
      nfp: bpf: don't stop offload if replace failed · 68d676a0
      Jakub Kicinski authored
      Stopping offload completely if replace of program failed dates
      back to days of transparent offload.  Back then we wanted to
      silently fall back to the in-driver processing.  Today we mark
      programs for offload when they are loaded into the kernel, so
      the transparent offload is no longer a reality.
      
      Flags check in the driver will only allow replace of a driver
      program with another driver program or an offload program with
      another offload program.
      
      When driver program is replaced stopping offload is a no-op,
      because driver program isn't offloaded.  When replacing
      offloaded program if the offload fails the entire operation
      will fail all the way back to user space and we should continue
      using the old program.  IOW when replacing a driver program
      stopping offload is unnecessary and when replacing offloaded
      program - it's a bug, old program should continue to run.
      
      In practice this bug would mean that if offload operation was to
      fail (either due to FW communication error, kernel OOM or new
      program being offloaded but for a different netdev) driver
      would continue reporting that previous XDP program is offloaded
      but in fact no program will be loaded in hardware.  The failure
      is fairly unlikely (found by inspection, when working on the code)
      but it's unpleasant.
      
      Backport note: even though the bug was introduced in commit
      cafa92ac ("nfp: bpf: add support for XDP_FLAGS_HW_MODE"),
      this fix depends on commit 441a3303 ("net: xdp: don't allow
      device-bound programs in driver mode"), so this fix is sufficient
      only in v4.15 or newer.  Kernels v4.13.x and v4.14.x do need to
      stop offload if it was transparent/opportunistic, i.e. if
      XDP_FLAGS_HW_MODE was not set on running program.
      
      Fixes: cafa92ac ("nfp: bpf: add support for XDP_FLAGS_HW_MODE")
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      68d676a0
  6. 21 Jun, 2018 16 commits
  7. 20 Jun, 2018 9 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 1abd8a8f
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "Here are eight fairly small fixes collected over the last two weeks.
      
        Regression and crashing bug fixes:
      
         - mlx4/5: Fixes for issues found from various checkers
      
         - A resource tracking and uverbs regression in the core code
      
         - qedr: NULL pointer regression found during testing
      
         - rxe: Various small bugs"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        IB/rxe: Fix missing completion for mem_reg work requests
        RDMA/core: Save kernel caller name when creating CQ using ib_create_cq()
        IB/uverbs: Fix ordering of ucontext check in ib_uverbs_write
        IB/mlx4: Fix an error handling path in 'mlx4_ib_rereg_user_mr()'
        RDMA/qedr: Fix NULL pointer dereference when running over iWARP without RDMA-CM
        IB/mlx5: Fix return value check in flow_counters_set_data()
        IB/mlx5: Fix memory leak in mlx5_ib_create_flow
        IB/rxe: avoid double kfree skb
      1abd8a8f
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · d8894a08
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix crash on bpf_prog_load() errors, from Daniel Borkmann.
      
       2) Fix ATM VCC memory accounting, from David Woodhouse.
      
       3) fib6_info objects need RCU freeing, from Eric Dumazet.
      
       4) Fix SO_BINDTODEVICE handling for TCP sockets, from David Ahern.
      
       5) Fix clobbered error code in enic_open() failure path, from
          Govindarajulu Varadarajan.
      
       6) Propagate dev_get_valid_name() error returns properly, from Li
          RongQing.
      
       7) Fix suspend/resume in davinci_emac driver, from Bartosz Golaszewski.
      
       8) Various act_ife fixes (recursive locking, IDR leaks, etc.) from
          Davide Caratti.
      
       9) Fix buggy checksum handling in sungem driver, from Eric Dumazet.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits)
        ip: limit use of gso_size to udp
        stmmac: fix DMA channel hang in half-duplex mode
        net: stmmac: socfpga: add additional ocp reset line for Stratix10
        net: sungem: fix rx checksum support
        bpfilter: ignore binary files
        bpfilter: fix build error
        net/usb/drivers: Remove useless hrtimer_active check
        net/sched: act_ife: preserve the action control in case of error
        net/sched: act_ife: fix recursive lock and idr leak
        net: ethernet: fix suspend/resume in davinci_emac
        net: propagate dev_get_valid_name return code
        enic: do not overwrite error code
        net/tcp: Fix socket lookups with SO_BINDTODEVICE
        ptp: replace getnstimeofday64() with ktime_get_real_ts64()
        net/ipv6: respect rcu grace period before freeing fib6_info
        net: net_failover: fix typo in net_failover_slave_register()
        ipvlan: use ETH_MAX_MTU as max mtu
        net: hamradio: use eth_broadcast_addr
        enic: initialize enic->rfs_h.lock in enic_probe
        MAINTAINERS: Add Sam as the maintainer for NCSI
        ...
      d8894a08
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid · 81e97f01
      Linus Torvalds authored
      Pull HID fixes from Jiri Kosina:
      
       - Wacom 2nd-gen Intuos Pro large Y axis handling fix from Jason Gerecke
      
       - fix for hibernation in Intel ISH driver, from Even Xu
      
       - crash fix for hid-steam driver, from Rodrigo Rivas Costa
      
       - new device ID addition to google-hammer driver
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
        HID: wacom: Correct logical maximum Y for 2nd-gen Intuos Pro large
        HID: intel_ish-hid: ipc: register more pm callbacks to support hibernation
        HID: steam: use hid_device.driver_data instead of hid_set_drvdata()
        HID: google: Add support for whiskers
      81e97f01
    • Linus Torvalds's avatar
      Merge tag 'dma-rename-4.18' of git://git.infradead.org/users/hch/dma-mapping · 6d90eb7b
      Linus Torvalds authored
      Pull dma-mapping rename from Christoph Hellwig:
       "Move all the dma-mapping code to kernel/dma and lose their dma-*
        prefixes"
      
      * tag 'dma-rename-4.18' of git://git.infradead.org/users/hch/dma-mapping:
        dma-mapping: move all DMA mapping code to kernel/dma
        dma-mapping: use obj-y instead of lib-y for generic dma ops
      6d90eb7b
    • Jason Gerecke's avatar
      HID: wacom: Correct logical maximum Y for 2nd-gen Intuos Pro large · d471b6b2
      Jason Gerecke authored
      The HID descriptor for the 2nd-gen Intuos Pro large (PTH-860) contains
      a typo which defines an incorrect logical maximum Y value. This causes
      a small portion of the bottom of the tablet to become unusable (both
      because the area is below the "bottom" of the tablet and because
      'wacom_wac_event' ignores out-of-range values). It also results in a
      skewed aspect ratio.
      
      To fix this, we add a quirk to 'wacom_usage_mapping' which overwrites
      the data with the correct value.
      Signed-off-by: default avatarJason Gerecke <jason.gerecke@wacom.com>
      CC: stable@vger.kernel.org # v4.10+
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      d471b6b2
    • Even Xu's avatar
      HID: intel_ish-hid: ipc: register more pm callbacks to support hibernation · ebeaa367
      Even Xu authored
      Current ISH driver only registers suspend/resume PM callbacks which don't
      support hibernation (suspend to disk). Basically after hiberation, the ISH
      can't resume properly and user may not see sensor events (for example: screen
      		rotation may not work).
      
      User will not see a crash or panic or anything except the following message
      in log:
      
      	hid-sensor-hub 001F:8086:22D8.0001: timeout waiting for response from ISHTP device
      
      So this patch adds support for S4/hiberbation to ISH by using the
      SIMPLE_DEV_PM_OPS() MACRO instead of struct dev_pm_ops directly. The suspend
      and resume functions will now be used for both suspend to RAM and hibernation.
      
      If power management is disabled, SIMPLE_DEV_PM_OPS will do nothing, the suspend
      and resume related functions won't be used, so mark them as __maybe_unused to
      clarify that this is the intended behavior, and remove #ifdefs for power
      management.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEven Xu <even.xu@intel.com>
      Acked-by: default avatarSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      ebeaa367
    • Rodrigo Rivas Costa's avatar
      HID: steam: use hid_device.driver_data instead of hid_set_drvdata() · 4bff980f
      Rodrigo Rivas Costa authored
      When creating the low-level hidraw device, the reference to steam_device
      was stored using hid_set_drvdata(). But this value is not guaranteed to
      be kept when set before calling probe. If this pointer is reset, it
      crashes when opening the emulated hidraw device.
      
      It looks like hid_set_drvdata() is for users "avobe" this hid_device,
      while hid_device.driver_data it for users "below" this one.
      
      In this case, we are creating a virtual hidraw device, so we must use
      hid_device.driver_data.
      Signed-off-by: default avatarRodrigo Rivas Costa <rodrigorivascosta@gmail.com>
      Tested-by: default avatarMariusz Ceier <mceier+kernel@gmail.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      4bff980f
    • Linus Torvalds's avatar
      proc: fix missing final NUL in get_mm_cmdline() rewrite · f5b65348
      Linus Torvalds authored
      The rewrite of the cmdline fetching missed the fact that we used to also
      return the final terminating NUL character of the last argument.  I
      hadn't noticed, and none of the tools I tested cared, but something
      obviously must care, because Michal Kubecek noticed the change in
      behavior.
      
      Tweak the "find the end" logic to actually include the NUL character,
      and once past the eend of argv, always start the strnlen() at the
      expected (original) argument end.
      
      This whole "allow people to rewrite their arguments in place" is a nasty
      hack and requires that odd slop handling at the end of the argv array,
      but it's our traditional model, so we continue to support it.
      Repored-and-bisected-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-and-tested-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5b65348
    • Willem de Bruijn's avatar
      ip: limit use of gso_size to udp · 9887cba1
      Willem de Bruijn authored
      The ipcm(6)_cookie field gso_size is set only in the udp path. The ip
      layer copies this to cork only if sk_type is SOCK_DGRAM. This check
      proved too permissive. Ping and l2tp sockets have the same type.
      
      Limit to sockets of type SOCK_DGRAM and protocol IPPROTO_UDP to
      exclude ping sockets.
      
      v1 -> v2
      - remove irrelevant whitespace changes
      
      Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9887cba1