1. 03 Nov, 2015 5 commits
    • Daniel Borkmann's avatar
      bpf: add support for persistent maps/progs · b2197755
      Daniel Borkmann authored
      This work adds support for "persistent" eBPF maps/programs. The term
      "persistent" is to be understood that maps/programs have a facility
      that lets them survive process termination. This is desired by various
      eBPF subsystem users.
      
      Just to name one example: tc classifier/action. Whenever tc parses
      the ELF object, extracts and loads maps/progs into the kernel, these
      file descriptors will be out of reach after the tc instance exits.
      So a subsequent tc invocation won't be able to access/relocate on this
      resource, and therefore maps cannot easily be shared, f.e. between the
      ingress and egress networking data path.
      
      The current workaround is that Unix domain sockets (UDS) need to be
      instrumented in order to pass the created eBPF map/program file
      descriptors to a third party management daemon through UDS' socket
      passing facility. This makes it a bit complicated to deploy shared
      eBPF maps or programs (programs f.e. for tail calls) among various
      processes.
      
      We've been brainstorming on how we could tackle this issue and various
      approches have been tried out so far, which can be read up further in
      the below reference.
      
      The architecture we eventually ended up with is a minimal file system
      that can hold map/prog objects. The file system is a per mount namespace
      singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
      mounts within a given namespace will point to the same instance. The
      file system allows for creating a user-defined directory structure.
      The objects for maps/progs are created/fetched through bpf(2) with
      two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
      along with a pathname is being passed to bpf(2) that in turn creates
      (we call it eBPF object pinning) the file system nodes. Only the pathname
      is being passed to bpf(2) for getting a new BPF file descriptor to an
      existing node. The user can use that to access maps and progs later on,
      through bpf(2). Removal of file system nodes is being managed through
      normal VFS functions such as unlink(2), etc. The file system code is
      kept to a very minimum and can be further extended later on.
      
      The next step I'm working on is to add dump eBPF map/prog commands
      to bpf(2), so that a specification from a given file descriptor can
      be retrieved. This can be used by things like CRIU but also applications
      can inspect the meta data after calling BPF_OBJ_GET.
      
      Big thanks also to Alexei and Hannes who significantly contributed
      in the design discussion that eventually let us end up with this
      architecture here.
      
      Reference: https://lkml.org/lkml/2015/10/15/925Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2197755
    • Daniel Borkmann's avatar
      bpf: consolidate bpf_prog_put{, _rcu} dismantle paths · e9d8afa9
      Daniel Borkmann authored
      We currently have duplicated cleanup code in bpf_prog_put() and
      bpf_prog_put_rcu() cleanup paths. Back then we decided that it was
      not worth it to make it a common helper called by both, but with
      the recent addition of resource charging, we could have avoided
      the fix in commit ac00737f ("bpf: Need to call bpf_prog_uncharge_memlock
      from bpf_prog_put") if we would have had only a single, common path.
      We can simplify it further by assigning aux->prog only once during
      allocation time.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9d8afa9
    • Daniel Borkmann's avatar
      bpf: align and clean bpf_{map,prog}_get helpers · c2101297
      Daniel Borkmann authored
      Add a bpf_map_get() function that we're going to use later on and
      align/clean the remaining helpers a bit so that we have them a bit
      more consistent:
      
        - __bpf_map_get() and __bpf_prog_get() that both work on the fd
          struct, check whether the descriptor is eBPF and return the
          pointer to the map/prog stored in the private data.
      
          Also, we can return f.file->private_data directly, the function
          signature is enough of a documentation already.
      
        - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
          call their respective __bpf_map_get()/__bpf_prog_get() variants,
          and take a reference.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2101297
    • Daniel Borkmann's avatar
      bpf: abstract anon_inode_getfd invocations · aa79781b
      Daniel Borkmann authored
      Since we're going to use anon_inode_getfd() invocations in more than just
      the current places, make a helper function for both, so that we only need
      to pass a map/prog pointer to the helper itself in order to get a fd. The
      new helpers are called bpf_map_new_fd() and bpf_prog_new_fd().
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa79781b
    • Eric Dumazet's avatar
      net: fix percpu memory leaks · 1d6119ba
      Eric Dumazet authored
      This patch fixes following problems :
      
      1) percpu_counter_init() can return an error, therefore
        init_frag_mem_limit() must propagate this error so that
        inet_frags_init_net() can do the same up to its callers.
      
      2) If ip[46]_frags_ns_ctl_register() fail, we must unwind
         properly and free the percpu_counter.
      
      Without this fix, we leave freed object in percpu_counters
      global list (if CONFIG_HOTPLUG_CPU) leading to crashes.
      
      This bug was detected by KASAN and syzkaller tool
      (http://github.com/google/syzkaller)
      
      Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d6119ba
  2. 02 Nov, 2015 17 commits
  3. 01 Nov, 2015 10 commits
  4. 31 Oct, 2015 5 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · 523e1345
      Linus Torvalds authored
      Pull Ceph fix from Sage Weil:
       "This sets the stable pages flag on the RBD block device when we have
        CRCs enabled.  (This is necessary since the default assumption for
        block devices changed in 3.9)"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
        rbd: require stable pages if message data CRCs are enabled
      523e1345
    • Linus Torvalds's avatar
      Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs · 4bb0fb57
      Linus Torvalds authored
      Pull overlayfs bug fixes from Miklos Szeredi:
       "This contains fixes for bugs that appeared in earlier kernels (all are
        marked for -stable)"
      
      * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
        ovl: free lower_mnt array in ovl_put_super
        ovl: free stack of paths in ovl_fill_super
        ovl: fix open in stacked overlay
        ovl: fix dentry reference leak
        ovl: use O_LARGEFILE in ovl_copy_up()
      4bb0fb57
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · c94eee8a
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix two regressions in ipv6 route lookups, particularly wrt output
          interface specifications in the lookup key.  From David Ahern.
      
       2) Fix checks in ipv6 IPSEC tunnel pre-encap fragmentation, from
          Herbert Xu.
      
       3) Fix mis-advertisement of 1000BASE-T on bcm63xx_enet, from Simon
          Arlott.
      
       4) Some smsc phys misbehave with energy detect mode enabled, so add a
          DT property and disable it on such switches.  From Heiko Schocher.
      
       5) Fix TSO corruption on TX in mv643xx_eth, from Philipp Kirchhofer.
      
       6) Fix regression added by removal of openvswitch vport stats, from
          James Morse.
      
       7) Vendor Kconfig options should be bool, not tristate, from Andreas
          Schwab.
      
       8) Use non-_BH() net stats bump in tcp_xmit_probe_skb(), otherwise we
          barf during TCP REPAIR operations.
      
       9) Fix various bugs in openvswitch conntrack support, from Joe
          Stringer.
      
      10) Fix NETLINK_LIST_MEMBERSHIPS locking, from David Herrmann.
      
      11) Don't have VSOCK do sock_put() in interrupt context, from Jorgen
          Hansen.
      
      12) Fix skb_realloc_headroom() failures properly in ISDN, from Karsten
          Keil.
      
      13) Add some device IDs to qmi_wwan, from Bjorn Mork.
      
      14) Fix ovs egress tunnel information when using lwtunnel devices, from
          Pravin B Shelar.
      
      15) Add missing NETIF_F_FRAGLIST to macvtab feature list, from Jason
          Wang.
      
      16) Fix incorrect handling of throw routes when the result of the throw
          cannot find a match, from Xin Long.
      
      17) Protect ipv6 MTU calculations from wrap-around, from Hannes Frederic
          Sowa.
      
      18) Fix failed autonegotiation on KSZ9031 micrel PHYs, from Nathan
          Sullivan.
      
      19) Add missing memory barries in descriptor accesses or xgbe driver,
          from Thomas Lendacky.
      
      20) Fix release conditon test in pppoe_release(), from Guillaume Nault.
      
      21) Fix gianfar bugs wrt filter configuration, from Claudiu Manoil.
      
      22) Fix violations of RX buffer alignment in sh_eth driver, from Sergei
          Shtylyov.
      
      23) Fixing missing of_node_put() calls in various places around the
          networking, from Julia Lawall.
      
      24) Fix incorrect leaf now walking in ipv4 routing tree, from Alexander
          Duyck.
      
      25) RDS doesn't check pskb_pull()/pskb_trim() return values, from
          Sowmini Varadhan.
      
      26) Fix VLAN configuration in mlx4 driver, from Jack Morgenstein.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (79 commits)
        ipv6: protect mtu calculation of wrap-around and infinite loop by rounding issues
        Revert "Merge branch 'ipv6-overflow-arith'"
        net/mlx4: Copy/set only sizeof struct mlx4_eqe bytes
        net/mlx4_en: Explicitly set no vlan tags in WQE ctrl segment when no vlan is present
        vhost: fix performance on LE hosts
        bpf: sample: define aarch64 specific registers
        amd-xgbe: Fix race between access of desc and desc index
        RDS-TCP: Recover correctly from pskb_pull()/pksb_trim() failure in rds_tcp_data_recv
        forcedeth: fix unilateral interrupt disabling in netpoll path
        openvswitch: Fix skb leak using IPv6 defrag
        ipv6: Export nf_ct_frag6_consume_orig()
        openvswitch: Fix double-free on ip_defrag() errors
        fib_trie: leaf_walk_rcu should not compute key if key is less than pn->key
        net: mv643xx_eth: add missing of_node_put
        ath6kl: add missing of_node_put
        net: phy: mdio: add missing of_node_put
        netdev/phy: add missing of_node_put
        net: netcp: add missing of_node_put
        net: thunderx: add missing of_node_put
        ipv6: gre: support SIT encapsulation
        ...
      c94eee8a
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 38dab9ac
      Linus Torvalds authored
      Pull input layer fixes from Dmitry Torokhov:
      
       - a change to the ALPS driver where we had limit the quirk for
         trackstick handling from being active on all Dells to just a few
         models
      
       - a fix for a build dependency issue in the sur40 driver
      
       - a small clock handling fixup in the LPC32xx touchscreen driver
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: alps - only the Dell Latitude D420/430/620/630 have separate stick button bits
        Input: sur40 - add dependency on VIDEO_V4L2
        Input: lpc32xx_ts - fix warnings caused by enabling unprepared clock
      38dab9ac
    • Linus Torvalds's avatar
      Merge tag 'pci-v4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · f9793e37
      Linus Torvalds authored
      Pull PCI fix from Bjorn Helgaas:
       "Sorry for this last-minute update; it's been in -next for quite a
        while, but I forgot about it until I started getting ready for the
        merge window.
      
        It's small and fixes a way a user could cause a panic via sysfs, so I
        think it's worth getting it in v4.3.
      
        NUMA:
          - Prevent out of bounds access in sysfs numa_node override (Sasha Levin)"
      
      * tag 'pci-v4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        PCI: Prevent out of bounds access in numa_node override
      f9793e37
  5. 30 Oct, 2015 3 commits
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 9b971e77
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "Apologies for this being so late, but we've uncovered a few nasty
        issues on arm64 which didn't settle down until yesterday and the fixes
        all look suitable for 4.3.  Of the four patches, three of them are
        Cc'd to stable, with the remaining patch fixing an issue that only
        took effect during the merge window.
      
        Summary:
      
         - Fix corruption in SWP emulation when STXR fails due to contention
         - Fix MMU re-initialisation when resuming from a low-power state
         - Fix stack unwinding code to match what ftrace expects
         - Fix relocation code in the EFI stub when DRAM base is not 2MB aligned"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64/efi: do not assume DRAM base is aligned to 2 MB
        Revert "ARM64: unwind: Fix PC calculation"
        arm64: kernel: fix tcr_el1.t0sz restore on systems with extended idmap
        arm64: compat: fix stxr failure case in SWP emulation
      9b971e77
    • Linus Torvalds's avatar
      Merge tag 'please-pull-syscalls' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux · 7c0f488f
      Linus Torvalds authored
      Pull ia64 kcmp syscall from Tony Luck:
       "Missed adding the kcmp() syscall a long time ago.  Now it seems that
        it is essential to build systemd"
      
      * tag 'please-pull-syscalls' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
        [IA64] Wire up kcmp syscall
      7c0f488f
    • Ronny Hegewald's avatar
      rbd: require stable pages if message data CRCs are enabled · bae818ee
      Ronny Hegewald authored
      rbd requires stable pages, as it performs a crc of the page data before
      they are send to the OSDs.
      
      But since kernel 3.9 (patch 1d1d1a76
      "mm: only enforce stable page writes if the backing device requires
      it") it is not assumed anymore that block devices require stable pages.
      
      This patch sets the necessary flag to get stable pages back for rbd.
      
      In a ceph installation that provides multiple ext4 formatted rbd
      devices "bad crc" messages appeared regularly (ca 1 message every 1-2
      minutes on every OSD that provided the data for the rbd) in the
      OSD-logs before this patch. After this patch this messages are pretty
      much gone (only ca 1-2 / month / OSD).
      
      Cc: stable@vger.kernel.org # 3.9+, needs backporting
      Signed-off-by: default avatarRonny Hegewald <Ronny.Hegewald@online.de>
      [idryomov@gmail.com: require stable pages only in crc case, changelog]
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      bae818ee