1. 25 Apr, 2018 2 commits
    • Gianluca Borello's avatar
      bpf, x64: fix JIT emission for dead code · 1612a981
      Gianluca Borello authored
      Commit 2a5418a1 ("bpf: improve dead code sanitizing") replaced dead
      code with a series of ja-1 instructions, for safety. That made JIT
      compilation much more complex for some BPF programs. One instance of such
      programs is, for example:
      
      bool flag = false
      ...
      /* A bunch of other code */
      ...
      if (flag)
              do_something()
      
      In some cases llvm is not able to remove at compile time the code for
      do_something(), so the generated BPF program ends up with a large amount
      of dead instructions. In one specific real life example, there are two
      series of ~500 and ~1000 dead instructions in the program. When the
      verifier replaces them with a series of ja-1 instructions, it causes an
      interesting behavior at JIT time.
      
      During the first pass, since all the instructions are estimated at 64
      bytes, the ja-1 instructions end up being translated as 5 bytes JMP
      instructions (0xE9), since the jump offsets become increasingly large (>
      127) as each instruction gets discovered to be 5 bytes instead of the
      estimated 64.
      
      Starting from the second pass, the first N instructions of the ja-1
      sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets
      become <= 127 this time. In particular, N is defined as roughly 127 / (5
      - 2) ~= 42. So, each further pass will make the subsequent N JMP
      instructions shrink from 5 to 2 bytes, making the image shrink every time.
      This means that in order to have the entire program converge, there need
      to be, in the real example above, at least ~1000 / 42 ~= 24 passes just
      for translating the dead code. If we add this number to the passes needed
      to translate the other non dead code, it brings such program to 40+
      passes, and JIT doesn't complete. Ultimately the userspace loader fails
      because such BPF program was supposed to be part of a prog array owner
      being JITed.
      
      While it is certainly possible to try to refactor such programs to help
      the compiler remove dead code, the behavior is not really intuitive and it
      puts further burden on the BPF developer who is not expecting such
      behavior. To make things worse, such programs are working just fine in all
      the kernel releases prior to the ja-1 fix.
      
      A possible approach to mitigate this behavior consists into noticing that
      for ja-1 instructions we don't really need to rely on the estimated size
      of the previous and current instructions, we know that a -1 BPF jump
      offset can be safely translated into a 0xEB instruction with a jump offset
      of -2.
      
      Such fix brings the BPF program in the previous example to complete again
      in ~9 passes.
      
      Fixes: 2a5418a1 ("bpf: improve dead code sanitizing")
      Signed-off-by: default avatarGianluca Borello <g.borello@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1612a981
    • William Tu's avatar
      bpf: clear the ip_tunnel_info. · 5540fbf4
      William Tu authored
      The percpu metadata_dst might carry the stale ip_tunnel_info
      and cause incorrect behavior.  When mixing tests using ipv4/ipv6
      bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
      ipv4's src ip addr as its ipv6 src address, because the previous
      tunnel info does not clean up.  The patch zeros the fields in
      ip_tunnel_info.
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Reported-by: default avatarYifeng Sun <pkusunyifeng@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5540fbf4
  2. 23 Apr, 2018 19 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-sockmap-fixes' · b3f8adee
      Daniel Borkmann authored
      John Fastabend says:
      
      ====================
      While testing sockmap with more programs (besides our test programs)
      I found a couple issues.
      
      The attached series fixes an issue where pinned maps were not
      working correctly, blocking sockets returned zero, and an error
      path that when the sock hit an out of memory case resulted in a
      double page_put() while doing ingress redirects.
      
      See individual patches for more details.
      
      v2: Incorporated Daniel's feedback to use map ops for uref put op
          which also fixed the build error discovered in v1.
      v3: rename map_put_uref to map_release_uref
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b3f8adee
    • John Fastabend's avatar
      bpf: sockmap, fix double page_put on ENOMEM error in redirect path · 4fcfdfb8
      John Fastabend authored
      In the case where the socket memory boundary is hit the redirect
      path returns an ENOMEM error. However, before checking for this
      condition the redirect scatterlist buffer is setup with a valid
      page and length. This is never unwound so when the buffers are
      released latter in the error path we do a put_page() and clear
      the scatterlist fields. But, because the initial error happens
      before completing the scatterlist buffer we end up with both the
      original buffer and the redirect buffer pointing to the same page
      resulting in duplicate put_page() calls.
      
      To fix this simply move the initial configuration of the redirect
      scatterlist buffer below the sock memory check.
      
      Found this while running TCP_STREAM test with netperf using Cilium.
      
      Fixes: fa246693 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4fcfdfb8
    • John Fastabend's avatar
      bpf: sockmap, sk_wait_event needed to handle blocking cases · e20f7334
      John Fastabend authored
      In the recvmsg handler we need to add a wait event to support the
      blocking use cases. Without this we return zero and may confuse
      user applications. In the wait event any data received on the
      sk either via sk_receive_queue or the psock ingress list will
      wake up the sock.
      
      Fixes: fa246693 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e20f7334
    • John Fastabend's avatar
      bpf: sockmap, map_release does not hold refcnt for pinned maps · ba6b8de4
      John Fastabend authored
      Relying on map_release hook to decrement the reference counts when a
      map is removed only works if the map is not being pinned. In the
      pinned case the ref is decremented immediately and the BPF programs
      released. After this BPF programs may not be in-use which is not
      what the user would expect.
      
      This patch moves the release logic into bpf_map_put_uref() and brings
      sockmap in-line with how a similar case is handled in prog array maps.
      
      Fixes: 3d9e9526 ("bpf: sockmap, fix leaking maps with attached but not detached progs")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ba6b8de4
    • John Fastabend's avatar
      bpf: sockmap sample use clang flag, -target bpf · 4dfe1bb9
      John Fastabend authored
      Per Documentation/bpf/bpf_devel_QA.txt add the -target flag to the
      sockmap Makefile. Relevant text quoted here,
      
         Otherwise, you can use bpf target. Additionally, you _must_ use
         bpf target when:
      
       - Your program uses data structures with pointer or long / unsigned
         long types that interface with BPF helpers or context data
         structures. Access into these structures is verified by the BPF
         verifier and may result in verification failures if the native
         architecture is not aligned with the BPF architecture, e.g. 64-bit.
         An example of this is BPF_PROG_TYPE_SK_MSG require '-target bpf'
      
      Fixes: 69e8cc13 ("bpf: sockmap sample program")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4dfe1bb9
    • John Fastabend's avatar
      bpf: Document sockmap '-target bpf' requirement for PROG_TYPE_SK_MSG · 514d6c19
      John Fastabend authored
      BPF_PROG_TYPE_SK_MSG programs use a 'void *' for both data and the
      data_end pointers. Additionally, the verifier ensures that every
      accesses into the values is a __u64 read. This correctly maps on
      to the BPF 64-bit architecture.
      
      However, to ensure that when building on 32bit architectures that
      clang uses correct types the '-target bpf' option _must_ be
      specified. To make this clear add a note to the Documentation.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      514d6c19
    • Roman Gushchin's avatar
      bpf: disable and restore preemption in __BPF_PROG_RUN_ARRAY · 6899b32b
      Roman Gushchin authored
      Running bpf programs requires disabled preemption,
      however at least some* of the BPF_PROG_RUN_ARRAY users
      do not follow this rule.
      
      To fix this bug, and also to make it not happen in the future,
      let's add explicit preemption disabling/re-enabling
      to the __BPF_PROG_RUN_ARRAY code.
      
      * for example:
       [   17.624472] RIP: 0010:__cgroup_bpf_run_filter_sk+0x1c4/0x1d0
       ...
       [   17.640890]  inet6_create+0x3eb/0x520
       [   17.641405]  __sock_create+0x242/0x340
       [   17.641939]  __sys_socket+0x57/0xe0
       [   17.642370]  ? trace_hardirqs_off_thunk+0x1a/0x1c
       [   17.642944]  SyS_socket+0xa/0x10
       [   17.643357]  do_syscall_64+0x79/0x220
       [   17.643879]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6899b32b
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 77621f02
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS fixes for net
      
      The following patchset contains Netfilter/IPVS fixes for your net tree,
      they are:
      
      1) Fix SIP conntrack with phones sending session descriptions for different
         media types but same port numbers, from Florian Westphal.
      
      2) Fix incorrect rtnl_lock mutex logic from IPVS sync thread, from Julian
         Anastasov.
      
      3) Skip compat array allocation in ebtables if there is no entries, also
         from Florian.
      
      4) Do not lose left/right bits when shifting marks from xt_connmark, from
         Jack Ma.
      
      5) Silence false positive memleak in conntrack extensions, from Cong Wang.
      
      6) Fix CONFIG_NF_REJECT_IPV6=m link problems, from Arnd Bergmann.
      
      7) Cannot kfree rule that is already in list in nf_tables, switch order
         so this error handling is not required, from Florian Westphal.
      
      8) Release set name in error path, from Florian.
      
      9) include kmemleak.h in nf_conntrack_extend.c, from Stepheh Rothwell.
      
      10) NAT chain and extensions depend on NF_TABLES.
      
      11) Out of bound access when renaming chains, from Taehee Yoo.
      
      12) Incorrect casting in xt_connmark leads to wrong bitshifting.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77621f02
    • Eric Dumazet's avatar
      ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy · aa8f8778
      Eric Dumazet authored
      KMSAN reported use of uninit-value that I tracked to lack
      of proper size check on RTA_TABLE attribute.
      
      I also believe RTA_PREFSRC lacks a similar check.
      
      Fixes: 86872cb5 ("[IPv6] route: FIB6 configuration using struct fib6_config")
      Fixes: c3968a85 ("ipv6: RTA_PREFSRC support for ipv6 route source address selection")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa8f8778
    • Xin Long's avatar
      bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave · ddea788c
      Xin Long authored
      After Commit 8a8efa22 ("bonding: sync netpoll code with bridge"), it
      would set slave_dev npinfo in slave_enable_netpoll when enslaving a dev
      if bond->dev->npinfo was set.
      
      However now slave_dev npinfo is set with bond->dev->npinfo before calling
      slave_enable_netpoll. With slave_dev npinfo set, __netpoll_setup called
      in slave_enable_netpoll will not call slave dev's .ndo_netpoll_setup().
      It causes that the lower dev of this slave dev can't set its npinfo.
      
      One way to reproduce it:
      
        # modprobe bonding
        # brctl addbr br0
        # brctl addif br0 eth1
        # ifconfig bond0 192.168.122.1/24 up
        # ifenslave bond0 eth2
        # systemctl restart netconsole
        # ifenslave bond0 br0
        # ifconfig eth2 down
        # systemctl restart netconsole
      
      The netpoll won't really work.
      
      This patch is to remove that slave_dev npinfo setting in bond_enslave().
      
      Fixes: 8a8efa22 ("bonding: sync netpoll code with bridge")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ddea788c
    • Jann Horn's avatar
      tcp: don't read out-of-bounds opsize · 7e5a206a
      Jann Horn authored
      The old code reads the "opsize" variable from out-of-bounds memory (first
      byte behind the segment) if a broken TCP segment ends directly after an
      opcode that is neither EOL nor NOP.
      
      The result of the read isn't used for anything, so the worst thing that
      could theoretically happen is a pagefault; and since the physmap is usually
      mostly contiguous, even that seems pretty unlikely.
      
      The following C reproducer triggers the uninitialized read - however, you
      can't actually see anything happen unless you put something like a
      pr_warn() in tcp_parse_md5sig_option() to print the opsize.
      
      ====================================
      #define _GNU_SOURCE
      #include <arpa/inet.h>
      #include <stdlib.h>
      #include <errno.h>
      #include <stdarg.h>
      #include <net/if.h>
      #include <linux/if.h>
      #include <linux/ip.h>
      #include <linux/tcp.h>
      #include <linux/in.h>
      #include <linux/if_tun.h>
      #include <err.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <string.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/ioctl.h>
      #include <assert.h>
      
      void systemf(const char *command, ...) {
        char *full_command;
        va_list ap;
        va_start(ap, command);
        if (vasprintf(&full_command, command, ap) == -1)
          err(1, "vasprintf");
        va_end(ap);
        printf("systemf: <<<%s>>>\n", full_command);
        system(full_command);
      }
      
      char *devname;
      
      int tun_alloc(char *name) {
        int fd = open("/dev/net/tun", O_RDWR);
        if (fd == -1)
          err(1, "open tun dev");
        static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI };
        strcpy(req.ifr_name, name);
        if (ioctl(fd, TUNSETIFF, &req))
          err(1, "TUNSETIFF");
        devname = req.ifr_name;
        printf("device name: %s\n", devname);
        return fd;
      }
      
      #define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24))
      
      void sum_accumulate(unsigned int *sum, void *data, int len) {
        assert((len&2)==0);
        for (int i=0; i<len/2; i++) {
          *sum += ntohs(((unsigned short *)data)[i]);
        }
      }
      
      unsigned short sum_final(unsigned int sum) {
        sum = (sum >> 16) + (sum & 0xffff);
        sum = (sum >> 16) + (sum & 0xffff);
        return htons(~sum);
      }
      
      void fix_ip_sum(struct iphdr *ip) {
        unsigned int sum = 0;
        sum_accumulate(&sum, ip, sizeof(*ip));
        ip->check = sum_final(sum);
      }
      
      void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) {
        unsigned int sum = 0;
        struct {
          unsigned int saddr;
          unsigned int daddr;
          unsigned char pad;
          unsigned char proto_num;
          unsigned short tcp_len;
        } fakehdr = {
          .saddr = ip->saddr,
          .daddr = ip->daddr,
          .proto_num = ip->protocol,
          .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4)
        };
        sum_accumulate(&sum, &fakehdr, sizeof(fakehdr));
        sum_accumulate(&sum, tcp, tcp->doff*4);
        tcp->check = sum_final(sum);
      }
      
      int main(void) {
        int tun_fd = tun_alloc("inject_dev%d");
        systemf("ip link set %s up", devname);
        systemf("ip addr add 192.168.42.1/24 dev %s", devname);
      
        struct {
          struct iphdr ip;
          struct tcphdr tcp;
          unsigned char tcp_opts[20];
        } __attribute__((packed)) syn_packet = {
          .ip = {
            .ihl = sizeof(struct iphdr)/4,
            .version = 4,
            .tot_len = htons(sizeof(syn_packet)),
            .ttl = 30,
            .protocol = IPPROTO_TCP,
            /* FIXUP check */
            .saddr = IPADDR(192,168,42,2),
            .daddr = IPADDR(192,168,42,1)
          },
          .tcp = {
            .source = htons(1),
            .dest = htons(1337),
            .seq = 0x12345678,
            .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4,
            .syn = 1,
            .window = htons(64),
            .check = 0 /*FIXUP*/
          },
          .tcp_opts = {
            /* INVALID: trailing MD5SIG opcode after NOPs */
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 1,
            1, 1, 1, 1, 19
          }
        };
        fix_ip_sum(&syn_packet.ip);
        fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp);
        while (1) {
          int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet));
          if (write_res != sizeof(syn_packet))
            err(1, "packet write failed");
        }
      }
      ====================================
      
      Fixes: cfb6eeb4 ("[TCP]: MD5 Signature Option (RFC2385) support.")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e5a206a
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 986e54cd
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-04-21
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix a deadlock between mm->mmap_sem and bpf_event_mutex when
         one task is detaching a BPF prog via perf_event_detach_bpf_prog()
         and another one dumping through bpf_prog_array_copy_info(). For
         the latter we move the copy_to_user() out of the bpf_event_mutex
         lock to fix it, from Yonghong.
      
      2) Fix test_sock and test_sock_addr.sh failures. The former was
         hitting rlimit issues and the latter required ping to specify
         the address family, from Yonghong.
      
      3) Remove a dead check in sockmap's sock_map_alloc(), from Jann.
      
      4) Add generated files to BPF kselftests gitignore that were previously
         missed, from Anders.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      986e54cd
    • Thomas Falcon's avatar
      ibmvnic: Clean actual number of RX or TX pools · 660e309d
      Thomas Falcon authored
      Avoid using value stored in the login response buffer when
      cleaning TX and RX buffer pools since these could be inconsistent
      depending on the device state. Instead use the field in the driver's
      private data that tracks the number of active pools.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      660e309d
    • David S. Miller's avatar
      Merge branch 'net-sched-ife-malformed-ife-packet-fixes' · 906cce04
      David S. Miller authored
      Alexander Aring says:
      
      ====================
      net: sched: ife: malformed ife packet fixes
      
      As promised at netdev 2.2 tc workshop I am working on adding scapy support for
      tdc testing. It is still work in progress. I will submit the patches to tdc
      later (they are not in good shape yet). The good news is I have been able to
      find bugs which normal packet testing would not be able to find.
      With fuzzy testing I was able to craft certain malformed packets that IFE
      action was not able to deal with. This patch set fixes those bugs.
      
      changes since v4:
       - use pskb_may_pull before pointer assign
      
      changes since v3:
       - use pskb_may_pull
      
      changes since v2:
       - remove inline from __ife_tlv_meta_valid
       - add const to cast to meta_tlvhdr
       - add acked and reviewed tags
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      906cce04
    • Alexander Aring's avatar
      net: sched: ife: check on metadata length · d57493d6
      Alexander Aring authored
      This patch checks if sk buffer is available to dererence ife header. If
      not then NULL will returned to signal an malformed ife packet. This
      avoids to crashing the kernel from outside.
      Signed-off-by: default avatarAlexander Aring <aring@mojatatu.com>
      Reviewed-by: default avatarYotam Gigi <yotam.gi@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d57493d6
    • Alexander Aring's avatar
      net: sched: ife: handle malformed tlv length · cc74eddd
      Alexander Aring authored
      There is currently no handling to check on a invalid tlv length. This
      patch adds such handling to avoid killing the kernel with a malformed
      ife packet.
      Signed-off-by: default avatarAlexander Aring <aring@mojatatu.com>
      Reviewed-by: default avatarYotam Gigi <yotam.gi@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc74eddd
    • Alexander Aring's avatar
      net: sched: ife: signal not finding metaid · f6cd1453
      Alexander Aring authored
      We need to record stats for received metadata that we dont know how
      to process. Have find_decode_metaid() return -ENOENT to capture this.
      Signed-off-by: default avatarAlexander Aring <aring@mojatatu.com>
      Reviewed-by: default avatarYotam Gigi <yotam.gi@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6cd1453
    • Doron Roberts-Kedes's avatar
      strparser: Do not call mod_delayed_work with a timeout of LONG_MAX · 7c5aba21
      Doron Roberts-Kedes authored
      struct sock's sk_rcvtimeo is initialized to
      LONG_MAX/MAX_SCHEDULE_TIMEOUT in sock_init_data. Calling
      mod_delayed_work with a timeout of LONG_MAX causes spurious execution of
      the work function. timer->expires is set equal to jiffies + LONG_MAX.
      When timer_base->clk falls behind the current value of jiffies,
      the delta between timer_base->clk and jiffies + LONG_MAX causes the
      expiration to be in the past. Returning early from strp_start_timer if
      timeo == LONG_MAX solves this problem.
      
      Found while testing net/tls_sw recv path.
      
      Fixes: 43a0c675 ("strparser: Stream parser for messages")
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDoron Roberts-Kedes <doronrk@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c5aba21
    • Ahmed Abdelsalam's avatar
      ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts · a957fa19
      Ahmed Abdelsalam authored
      In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
      in order to set the src addr of outer IPv6 header.
      
      The net_device is required for set_tun_src(). However calling ip6_dst_idev()
      on dst_entry in case of IPv4 traffic results on the following bug.
      
      Using just dst->dev should fix this BUG.
      
      [  196.242461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      [  196.242975] PGD 800000010f076067 P4D 800000010f076067 PUD 10f060067 PMD 0
      [  196.243329] Oops: 0000 [#1] SMP PTI
      [  196.243468] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd input_leds glue_helper led_class pcspkr serio_raw mac_hid video autofs4 hid_generic usbhid hid e1000 i2c_piix4 ahci pata_acpi libahci
      [  196.244362] CPU: 2 PID: 1089 Comm: ping Not tainted 4.16.0+ #1
      [  196.244606] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  196.244968] RIP: 0010:seg6_do_srh_encap+0x1ac/0x300
      [  196.245236] RSP: 0018:ffffb2ce00b23a60 EFLAGS: 00010202
      [  196.245464] RAX: 0000000000000000 RBX: ffff8c7f53eea300 RCX: 0000000000000000
      [  196.245742] RDX: 0000f10000000000 RSI: ffff8c7f52085a6c RDI: ffff8c7f41166850
      [  196.246018] RBP: ffffb2ce00b23aa8 R08: 00000000000261e0 R09: ffff8c7f41166800
      [  196.246294] R10: ffffdce5040ac780 R11: ffff8c7f41166828 R12: ffff8c7f41166808
      [  196.246570] R13: ffff8c7f52085a44 R14: ffffffffb73211c0 R15: ffff8c7e69e44200
      [  196.246846] FS:  00007fc448789700(0000) GS:ffff8c7f59d00000(0000) knlGS:0000000000000000
      [  196.247286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  196.247526] CR2: 0000000000000000 CR3: 000000010f05a000 CR4: 00000000000406e0
      [  196.247804] Call Trace:
      [  196.247972]  seg6_do_srh+0x15b/0x1c0
      [  196.248156]  seg6_output+0x3c/0x220
      [  196.248341]  ? prandom_u32+0x14/0x20
      [  196.248526]  ? ip_idents_reserve+0x6c/0x80
      [  196.248723]  ? __ip_select_ident+0x90/0x100
      [  196.248923]  ? ip_append_data.part.50+0x6c/0xd0
      [  196.249133]  lwtunnel_output+0x44/0x70
      [  196.249328]  ip_send_skb+0x15/0x40
      [  196.249515]  raw_sendmsg+0x8c3/0xac0
      [  196.249701]  ? _copy_from_user+0x2e/0x60
      [  196.249897]  ? rw_copy_check_uvector+0x53/0x110
      [  196.250106]  ? _copy_from_user+0x2e/0x60
      [  196.250299]  ? copy_msghdr_from_user+0xce/0x140
      [  196.250508]  sock_sendmsg+0x36/0x40
      [  196.250690]  ___sys_sendmsg+0x292/0x2a0
      [  196.250881]  ? _cond_resched+0x15/0x30
      [  196.251074]  ? copy_termios+0x1e/0x70
      [  196.251261]  ? _copy_to_user+0x22/0x30
      [  196.251575]  ? tty_mode_ioctl+0x1c3/0x4e0
      [  196.251782]  ? _cond_resched+0x15/0x30
      [  196.251972]  ? mutex_lock+0xe/0x30
      [  196.252152]  ? vvar_fault+0xd2/0x110
      [  196.252337]  ? __do_fault+0x1f/0xc0
      [  196.252521]  ? __handle_mm_fault+0xc1f/0x12d0
      [  196.252727]  ? __sys_sendmsg+0x63/0xa0
      [  196.252919]  __sys_sendmsg+0x63/0xa0
      [  196.253107]  do_syscall_64+0x72/0x200
      [  196.253305]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [  196.253530] RIP: 0033:0x7fc4480b0690
      [  196.253715] RSP: 002b:00007ffde9f252f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [  196.254053] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007fc4480b0690
      [  196.254331] RDX: 0000000000000000 RSI: 000000000060a360 RDI: 0000000000000003
      [  196.254608] RBP: 00007ffde9f253f0 R08: 00000000002d1e81 R09: 0000000000000002
      [  196.254884] R10: 00007ffde9f250c0 R11: 0000000000000246 R12: 0000000000b22070
      [  196.255205] R13: 20c49ba5e353f7cf R14: 431bde82d7b634db R15: 00007ffde9f278fe
      [  196.255484] Code: a5 0f b6 45 c0 41 88 41 28 41 0f b6 41 2c 48 c1 e0 04 49 8b 54 01 38 49 8b 44 01 30 49 89 51 20 49 89 41 18 48 8b 83 b0 00 00 00 <48> 8b 30 49 8b 86 08 0b 00 00 48 8b 40 20 48 8b 50 08 48 0b 10
      [  196.256190] RIP: seg6_do_srh_encap+0x1ac/0x300 RSP: ffffb2ce00b23a60
      [  196.256445] CR2: 0000000000000000
      [  196.256676] ---[ end trace 71af7d093603885c ]---
      
      Fixes: 8936ef76 ("ipv6: sr: fix NULL pointer dereference when setting encap source address")
      Signed-off-by: default avatarAhmed Abdelsalam <amsalam20@gmail.com>
      Acked-by: default avatarDavid Lebrun <dlebrun@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a957fa19
  3. 22 Apr, 2018 11 commits
  4. 20 Apr, 2018 8 commits
    • Jann Horn's avatar
      bpf: sockmap remove dead check · 6ab690aa
      Jann Horn authored
      Remove dead code that bails on `attr->value_size > KMALLOC_MAX_SIZE` - the
      previous check already bails on `attr->value_size != 4`.
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6ab690aa
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal · 83beed7b
      Linus Torvalds authored
      Pull thermal fixes from Eduardo Valentin:
       "A couple of fixes for the thermal subsystem"
      
      * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal:
        dt-bindings: thermal: Remove "cooling-{min|max}-level" properties
        dt-bindings: thermal: remove no longer needed samsung thermal properties
      83beed7b
    • Linus Torvalds's avatar
      Merge tag 'mmc-v4.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · 7e3cb169
      Linus Torvalds authored
      Pull MMC fixes from Ulf Hansson:
       "A couple of MMC host fixes:
      
         - sdhci-pci: Fixup tuning for AMD for eMMC HS200 mode
      
         - renesas_sdhi_internal_dmac: Avoid data corruption by limiting
           DMA RX"
      
      * tag 'mmc-v4.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: renesas_sdhi_internal_dmac: limit DMA RX for old SoCs
        mmc: sdhci-pci: Only do AMD tuning for HS200
      7e3cb169
    • Linus Torvalds's avatar
      Merge tag 'md/4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md · 7768ee3f
      Linus Torvalds authored
      Pull MD fixes from Shaohua Li:
       "Three small fixes for MD:
      
         - md-cluster fix for faulty device from Guoqing
      
         - writehint fix for writebehind IO for raid1 from Mariusz
      
         - a live lock fix for interrupted recovery from Yufen"
      
      * tag 'md/4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
        raid1: copy write hint from master bio to behind bio
        md/raid1: exit sync request if MD_RECOVERY_INTR is set
        md-cluster: don't update recovery_offset for faulty device
      7768ee3f
    • David Howells's avatar
      vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion · a9e5b732
      David Howells authored
      In do_mount() when the MS_* flags are being converted to MNT_* flags,
      MS_RDONLY got accidentally convered to SB_RDONLY.
      
      Undo this change.
      
      Fixes: e462ec50 ("VFS: Differentiate mount flags (MS_*) from internal superblock flags")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9e5b732
    • David Howells's avatar
      afs: Fix server record deletion · 66062592
      David Howells authored
      AFS server records get removed from the net->fs_servers tree when
      they're deleted, but not from the net->fs_addresses{4,6} lists, which
      can lead to an oops in afs_find_server() when a server record has been
      removed, for instance during rmmod.
      
      Fix this by deleting the record from the by-address lists before posting
      it for RCU destruction.
      
      The reason this hasn't been noticed before is that the fileserver keeps
      probing the local cache manager, thereby keeping the service record
      alive, so the oops would only happen when a fileserver eventually gets
      bored and stops pinging or if the module gets rmmod'd and a call comes
      in from the fileserver during the window between the server records
      being destroyed and the socket being closed.
      
      The oops looks something like:
      
        BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
        ...
        Workqueue: kafsd afs_process_async_call [kafs]
        RIP: 0010:afs_find_server+0x271/0x36f [kafs]
        ...
        Call Trace:
         afs_deliver_cb_init_call_back_state3+0x1f2/0x21f [kafs]
         afs_deliver_to_call+0x1ee/0x5e8 [kafs]
         afs_process_async_call+0x5b/0xd0 [kafs]
         process_one_work+0x2c2/0x504
         worker_thread+0x1d4/0x2ac
         kthread+0x11f/0x127
         ret_from_fork+0x24/0x30
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66062592
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · a72db42c
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Unbalanced refcounting in TIPC, from Jon Maloy.
      
       2) Only allow TCP_MD5SIG to be set on sockets in close or listen state.
          Once the connection is established it makes no sense to change this.
          From Eric Dumazet.
      
       3) Missing attribute validation in neigh_dump_table(), also from Eric
          Dumazet.
      
       4) Fix address comparisons in SCTP, from Xin Long.
      
       5) Neigh proxy table clearing can deadlock, from Wolfgang Bumiller.
      
       6) Fix tunnel refcounting in l2tp, from Guillaume Nault.
      
       7) Fix double list insert in team driver, from Paolo Abeni.
      
       8) af_vsock.ko module was accidently made unremovable, from Stefan
          Hajnoczi.
      
       9) Fix reference to freed llc_sap object in llc stack, from Cong Wang.
      
      10) Don't assume netdevice struct is DMA'able memory in virtio_net
          driver, from Michael S. Tsirkin.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
        net/smc: fix shutdown in state SMC_LISTEN
        bnxt_en: Fix memory fault in bnxt_ethtool_init()
        virtio_net: sparse annotation fix
        virtio_net: fix adding vids on big-endian
        virtio_net: split out ctrl buffer
        net: hns: Avoid action name truncation
        docs: ip-sysctl.txt: fix name of some ipv6 variables
        vmxnet3: fix incorrect dereference when rxvlan is disabled
        llc: hold llc_sap before release_sock()
        MAINTAINERS: Direct networking documentation changes to netdev
        atm: iphase: fix spelling mistake: "Tansmit" -> "Transmit"
        net: qmi_wwan: add Wistron Neweb D19Q1
        net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN"
        net: stmmac: Disable ACS Feature for GMAC >= 4
        net: mvpp2: Fix DMA address mask size
        net: change the comment of dev_mc_init
        net: qualcomm: rmnet: Fix warning seen with fill_info
        tun: fix vlan packet truncation
        tipc: fix infinite loop when dumping link monitor summary
        tipc: fix use-after-free in tipc_nametbl_stop
        ...
      a72db42c
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · b9abdcfd
      Linus Torvalds authored
      Pull vfs fixes from Al Viro:
       "Assorted fixes.
      
        Some of that is only a matter with fault injection (broken handling of
        small allocation failure in various mount-related places), but the
        last one is a root-triggerable stack overflow, and combined with
        userns it gets really nasty ;-/"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        Don't leak MNT_INTERNAL away from internal mounts
        mm,vmscan: Allow preallocating memory for register_shrinker().
        rpc_pipefs: fix double-dput()
        orangefs_kill_sb(): deal with allocation failures
        jffs2_kill_sb(): deal with failed allocations
        hypfs_kill_super(): deal with failed allocations
      b9abdcfd