1. 09 May, 2018 2 commits
  2. 08 May, 2018 8 commits
    • Florian Westphal's avatar
      netfilter: nf_tables: don't assume chain stats are set when jumplabel is set · 00924094
      Florian Westphal authored
      nft_chain_stats_replace() and all other spots assume ->stats can be
      NULL, but nft_update_chain_stats does not.  It must do this check,
      just because the jump label is set doesn't mean all basechains have stats
      assigned.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      00924094
    • Florian Westphal's avatar
      netfilter: x_tables: add module alias for icmp matches · a44f6d82
      Florian Westphal authored
      The icmp matches are implemented in ip_tables and ip6_tables,
      respectively, so for normal iptables they are always available:
      those modules are loaded once iptables calls getsockopt() to fetch
      available module revisions.
      
      In iptables-over-nftables case probing occurs via nfnetlink, so
      these modules might not be loaded.  Add aliases so modprobe can load
      these when icmp/icmp6 is requested.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a44f6d82
    • Florian Westphal's avatar
      netfilter: prefer nla_strlcpy for dealing with NLA_STRING attributes · 4e09fc87
      Florian Westphal authored
      fixes these warnings:
      'nfnl_cthelper_create' at net/netfilter/nfnetlink_cthelper.c:237:2,
      'nfnl_cthelper_new' at net/netfilter/nfnetlink_cthelper.c:450:9:
      ./include/linux/string.h:246:9: warning: '__builtin_strncpy' specified bound 16 equals destination size [-Wstringop-truncation]
        return __builtin_strncpy(p, q, size);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Moreover, strncpy assumes null-terminated source buffers, but thats
      not the case here.
      Unlike strlcpy, nla_strlcpy *does* pad the destination buffer
      while also considering nla attribute size.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      4e09fc87
    • Florian Westphal's avatar
      netfilter: core: add missing __rcu annotation · 25fd386e
      Florian Westphal authored
      removes following sparse error:
      net/netfilter/core.c:598:30: warning: incorrect type in argument 1 (different address spaces)
      net/netfilter/core.c:598:30:    expected struct nf_hook_entries **e
      net/netfilter/core.c:598:30:    got struct nf_hook_entries [noderef] <asn:4>**<noident>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      25fd386e
    • Julian Anastasov's avatar
      ipvs: fix stats update from local clients · d5e032fc
      Julian Anastasov authored
      Local clients are not properly synchronized on 32-bit CPUs when
      updating stats (3.10+). Now it is possible estimation_timer (timer),
      a stats reader, to interrupt the local client in the middle of
      write_seqcount_{begin,end} sequence leading to loop (DEADLOCK).
      The same interrupt can happen from received packet (SoftIRQ)
      which updates the same per-CPU stats.
      
      Fix it by disabling BH while updating stats.
      
      Found with debug:
      
      WARNING: inconsistent lock state
      4.17.0-rc2-00105-g35cb6d7-dirty #2 Not tainted
      --------------------------------
      inconsistent {IN-SOFTIRQ-R} -> {SOFTIRQ-ON-W} usage.
      ftp/2545 [HC0[0]:SC0[0]:HE1:SE1] takes:
      86845479 (&syncp->seq#6){+.+-}, at: ip_vs_schedule+0x1c5/0x59e [ip_vs]
      {IN-SOFTIRQ-R} state was registered at:
       lock_acquire+0x44/0x5b
       estimation_timer+0x1b3/0x341 [ip_vs]
       call_timer_fn+0x54/0xcd
       run_timer_softirq+0x10c/0x12b
       __do_softirq+0xc1/0x1a9
       do_softirq_own_stack+0x1d/0x23
       irq_exit+0x4a/0x64
       smp_apic_timer_interrupt+0x63/0x71
       apic_timer_interrupt+0x3a/0x40
       default_idle+0xa/0xc
       arch_cpu_idle+0x9/0xb
       default_idle_call+0x21/0x23
       do_idle+0xa0/0x167
       cpu_startup_entry+0x19/0x1b
       start_secondary+0x133/0x182
       startup_32_smp+0x164/0x168
      irq event stamp: 42213
      
      other info that might help us debug this:
      Possible unsafe locking scenario:
      
            CPU0
            ----
       lock(&syncp->seq#6);
       <Interrupt>
         lock(&syncp->seq#6);
      
      *** DEADLOCK ***
      
      Fixes: ac69269a ("ipvs: do not disable bh for long time")
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d5e032fc
    • Julian Anastasov's avatar
      ipvs: fix refcount usage for conns in ops mode · a050d345
      Julian Anastasov authored
      Connections in One-packet scheduling mode (-o, --ops) are
      removed with refcnt=0 because they are not hashed in conn table.
      To avoid refcount_dec reporting this as error, change them to be
      removed with refcount_dec_if_one as all other connections.
      
      refcount_t hit zero at ip_vs_conn_put+0x31/0x40 [ip_vs]
      in sh[15519], uid/euid: 497/497
      WARNING: CPU: 0 PID: 15519 at ../kernel/panic.c:657
      refcount_error_report+0x94/0x9e
      Modules linked in: ip_vs_rr cirrus ttm sb_edac
      edac_core drm_kms_helper crct10dif_pclmul crc32_pclmul
      ghash_clmulni_intel pcbc mousedev drm aesni_intel aes_x86_64
      crypto_simd glue_helper cryptd psmouse evdev input_leds led_class
      intel_agp fb_sys_fops syscopyarea sysfillrect intel_rapl_perf mac_hid
      intel_gtt serio_raw sysimgblt agpgart i2c_piix4 i2c_core ata_generic
      pata_acpi floppy cfg80211 rfkill button loop macvlan ip_vs
      nf_conntrack libcrc32c crc32c_generic ip_tables x_tables ipv6
      crc_ccitt autofs4 ext4 crc16 mbcache jbd2 fscrypto ata_piix libata
      atkbd libps2 scsi_mod crc32c_intel i8042 rtc_cmos serio af_packet
      dm_mod dax fuse xen_netfront xen_blkfront
      CPU: 0 PID: 15519 Comm: sh Tainted: G        W
      4.15.17 #1-NixOS
      Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
      RIP: 0010:refcount_error_report+0x94/0x9e
      RSP: 0000:ffffa344dde039c8 EFLAGS: 00010296
      RAX: 0000000000000057 RBX: ffffffff92f20e06 RCX: 0000000000000006
      RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffffa344dde165c0
      RBP: ffffa344dde03b08 R08: 0000000000000218 R09: 0000000000000004
      R10: ffffffff93006a80 R11: 0000000000000001 R12: ffffa344d68cd100
      R13: 00000000000001f1 R14: ffffffff92f12fb0 R15: 0000000000000004
      FS:  00007fc9d2040fc0(0000) GS:ffffa344dde00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000262a000 CR3: 0000000016a0c004 CR4: 00000000001606f0
      Call Trace:
       <IRQ>
       ex_handler_refcount+0x4e/0x80
       fixup_exception+0x33/0x40
       do_trap+0x83/0x140
       do_error_trap+0x83/0xf0
       ? ip_vs_conn_drop_conntrack+0x120/0x1a5 [ip_vs]
       ? ip_finish_output2+0x29c/0x390
       ? ip_finish_output2+0x1a2/0x390
       invalid_op+0x1b/0x40
      RIP: 0010:ip_vs_conn_put+0x31/0x40 [ip_vs]
      RSP: 0000:ffffa344dde03bb8 EFLAGS: 00010246
      RAX: 0000000000000001 RBX: ffffa344df31cf00 RCX: ffffa344d7450198
      RDX: 0000000000000003 RSI: 00000000fffffe01 RDI: ffffa344d7450140
      RBP: 0000000000000002 R08: 0000000000000476 R09: 0000000000000000
      R10: ffffa344dde03b28 R11: ffffa344df200000 R12: ffffa344d7d09000
      R13: ffffa344def3a980 R14: ffffffffc04f6e20 R15: 0000000000000008
       ip_vs_in.part.29.constprop.36+0x34f/0x640 [ip_vs]
       ? ip_vs_conn_out_get+0xe0/0xe0 [ip_vs]
       ip_vs_remote_request4+0x47/0xa0 [ip_vs]
       ? ip_vs_in.part.29.constprop.36+0x640/0x640 [ip_vs]
       nf_hook_slow+0x43/0xc0
       ip_local_deliver+0xac/0xc0
       ? ip_rcv_finish+0x400/0x400
       ip_rcv+0x26c/0x380
       __netif_receive_skb_core+0x3a0/0xb10
       ? inet_gro_receive+0x23c/0x2b0
       ? netif_receive_skb_internal+0x24/0xb0
       netif_receive_skb_internal+0x24/0xb0
       napi_gro_receive+0xb8/0xe0
       xennet_poll+0x676/0xb40 [xen_netfront]
       net_rx_action+0x139/0x3a0
       __do_softirq+0xde/0x2b4
       irq_exit+0xae/0xb0
       xen_evtchn_do_upcall+0x2c/0x40
       xen_hvm_callback_vector+0x7d/0x90
       </IRQ>
      RIP: 0033:0x7fc9d11c91f9
      RSP: 002b:00007ffebe8a2ea0 EFLAGS: 00000202 ORIG_RAX:
      ffffffffffffff0c
      RAX: 00000000ffffffff RBX: 0000000002609808 RCX: 0000000000000054
      RDX: 0000000000000001 RSI: 0000000002605440 RDI: 00000000025f940e
      RBP: 00000000025f940e R08: 000000000260213d R09: 1999999999999999
      R10: 000000000262a808 R11: 00000000025f942d R12: 00000000025f940e
      R13: 00007fc9d1301e20 R14: 00000000025f9408 R15: 00007fc9d1302720
      Code: 48 8b 95 80 00 00 00 41 55 49 8d 8c 24 e0 05 00
      00 45 8b 84 24 38 04 00 00 41 89 c1 48 89 de 48 c7 c7 a8 2f f2 92 e8
      7c fa ff ff <0f> 0b 58 5b 5d 41 5c 41 5d c3 0f 1f 44 00 00 55 48 89 e5
      41 56
      Reported-by: default avatarNet Filter <netfilternetfilter@gmail.com>
      Fixes: b54ab92b ("netfilter: refcounter conversions")
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a050d345
    • Florian Westphal's avatar
      netfilter: nf_tables: nft_compat: fix refcount leak on xt module · b8e9dc1c
      Florian Westphal authored
      Taehee Yoo reported following bug:
          iptables-compat -I OUTPUT -m cpu --cpu 0
          iptables-compat -F
          lsmod |grep xt_cpu
          xt_cpu                 16384  1
      
      Quote:
      "When above command is given, a netlink message has two expressions that
      are the cpu compat and the nft_counter.
      The nft_expr_type_get() in the nf_tables_expr_parse() successes
      first expression then, calls select_ops callback.
      (allocates memory and holds module)
      But, second nft_expr_type_get() in the nf_tables_expr_parse()
      returns -EAGAIN because of request_module().
      In that point, by the 'goto err1',
      the 'module_put(info[i].ops->type->owner)' is called.
      There is no release routine."
      
      The core problem is that unlike all other expression,
      nft_compat select_ops has side effects.
      
      1. it allocates dynamic memory which holds an nft ops struct.
         In all other expressions, ops has static storage duration.
      2. It grabs references to the xt module that it is supposed to
         invoke.
      
      Depending on where things go wrong, error unwinding doesn't
      always do the right thing.
      
      In the above scenario, a new nft_compat_expr is created and
      xt_cpu module gets loaded with a refcount of 1.
      
      Due to to -EAGAIN, the netlink messages get re-parsed.
      When that happens, nft_compat finds that xt_cpu is already present
      and increments module refcount again.
      
      This fixes the problem by making select_ops to have no visible
      side effects and removes all extra module_get/put.
      
      When select_ops creates a new nft_compat expression, the new
      expression has a refcount of 0, and the xt module gets its refcount
      incremented.
      
      When error happens, the next call finds existing entry, but will no
      longer increase the reference count -- the presence of existing
      nft_xt means we already hold a module reference.
      
      Because nft_xt_put is only called from nft_compat destroy hook,
      it will never see the initial zero reference count.
      ->destroy can only be called after ->init(), and that will increase the
      refcount.
      
      Lastly, we now free nft_xt struct with kfree_rcu.
      Else, we get use-after free in nf_tables_rule_destroy:
      
        while (expr != nft_expr_last(rule) && expr->ops) {
          nf_tables_expr_destroy(ctx, expr);
          expr = nft_expr_next(expr); // here
      
      nft_expr_next() dereferences expr->ops. This is safe
      for all users, as ops have static storage duration.
      In nft_compat case however, its ->destroy callback can
      free the memory that hold the ops structure.
      Tested-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Reported-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b8e9dc1c
    • Stephen Hemminger's avatar
      netfilter: bridge: stp fix reference to uninitialized data · a4995684
      Stephen Hemminger authored
      The destination mac (destmac) is only valid if EBT_DESTMAC flag
      is set. Fix by changing the order of the comparison to look for
      the flag first.
      
      Reported-by: syzbot+5c06e318fc558cc27823@syzkaller.appspotmail.com
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a4995684
  3. 26 Apr, 2018 4 commits
  4. 25 Apr, 2018 7 commits
    • John Fastabend's avatar
      bpf: fix for lex/yacc build error with gcc-5 · 9c299a32
      John Fastabend authored
      Fix build error found with Ubuntu shipped gcc-5
      
      ~/git/bpf/tools/bpf$ make all
      
      Auto-detecting system features:
      ...                        libbfd: [ OFF ]
      ...        disassembler-four-args: [ OFF ]
      
        CC       bpf_jit_disasm.o
        LINK     bpf_jit_disasm
        CC       bpf_dbg.o
      /home/john/git/bpf/tools/bpf/bpf_dbg.c: In function ‘cmd_load’:
      /home/john/git/bpf/tools/bpf/bpf_dbg.c:1077:13: warning: ‘cont’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        } else if (matches(subcmd, "pcap") == 0) {
                   ^
        LINK     bpf_dbg
        CC       bpf_asm.o
      make: *** No rule to make target `bpf_exp.yacc.o', needed by `bpf_asm'.  Stop.
      
      Fixes: 5a8997f2 ("tools: bpf: respect output directory during build")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9c299a32
    • Dag Moxnes's avatar
      rds: ib: Fix missing call to rds_ib_dev_put in rds_ib_setup_qp · 91a82529
      Dag Moxnes authored
      The function rds_ib_setup_qp is calling rds_ib_get_client_data and
      should correspondingly call rds_ib_dev_put. This call was lost in
      the non-error path with the introduction of error handling done in
      commit 3b12f73a ("rds: ib: add error handle")
      Signed-off-by: default avatarDag Moxnes <dag.moxnes@oracle.com>
      Reviewed-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91a82529
    • Ursula Braun's avatar
      net/smc: keep clcsock reference in smc_tcp_listen_work() · 070204a3
      Ursula Braun authored
      The internal CLC socket should exist till the SMC-socket is released.
      Function tcp_listen_worker() releases the internal CLC socket of a
      listen socket, if an smc_close_active() is called. This function
      is called for the final release(), but it is called for shutdown
      SHUT_RDWR as well. This opens a door for protection faults, if
      socket calls using the internal CLC socket are called for a
      shutdown listen socket.
      
      With the changes of
      commit 3d502067 ("net/smc: simplify wait when closing listen socket")
      there is no need anymore to release the internal CLC socket in
      function tcp_listen_worker((). It is sufficient to release it in
      smc_release().
      
      Fixes: 127f4970 ("net/smc: release clcsock from tcp_listen_worker")
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Reported-by: syzbot+9045fc589fcd196ef522@syzkaller.appspotmail.com
      Reported-by: syzbot+28a2c86cf19c81d871fa@syzkaller.appspotmail.com
      Reported-by: syzbot+9605e6cace1b5efd4a0a@syzkaller.appspotmail.com
      Reported-by: syzbot+cf9012c597c8379d535c@syzkaller.appspotmail.com
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      070204a3
    • Alexandre Belloni's avatar
      net: phy: allow scanning busses with missing phys · 02a6efca
      Alexandre Belloni authored
      Some MDIO busses will error out when trying to read a phy address with no
      phy present at that address. In that case, probing the bus will fail
      because __mdiobus_register() is scanning the bus for all possible phys
      addresses.
      
      In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
      this address and set the phy ID to 0xffffffff which is then properly
      handled in get_phy_device().
      Suggested-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02a6efca
    • Gianluca Borello's avatar
      bpf, x64: fix JIT emission for dead code · 1612a981
      Gianluca Borello authored
      Commit 2a5418a1 ("bpf: improve dead code sanitizing") replaced dead
      code with a series of ja-1 instructions, for safety. That made JIT
      compilation much more complex for some BPF programs. One instance of such
      programs is, for example:
      
      bool flag = false
      ...
      /* A bunch of other code */
      ...
      if (flag)
              do_something()
      
      In some cases llvm is not able to remove at compile time the code for
      do_something(), so the generated BPF program ends up with a large amount
      of dead instructions. In one specific real life example, there are two
      series of ~500 and ~1000 dead instructions in the program. When the
      verifier replaces them with a series of ja-1 instructions, it causes an
      interesting behavior at JIT time.
      
      During the first pass, since all the instructions are estimated at 64
      bytes, the ja-1 instructions end up being translated as 5 bytes JMP
      instructions (0xE9), since the jump offsets become increasingly large (>
      127) as each instruction gets discovered to be 5 bytes instead of the
      estimated 64.
      
      Starting from the second pass, the first N instructions of the ja-1
      sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets
      become <= 127 this time. In particular, N is defined as roughly 127 / (5
      - 2) ~= 42. So, each further pass will make the subsequent N JMP
      instructions shrink from 5 to 2 bytes, making the image shrink every time.
      This means that in order to have the entire program converge, there need
      to be, in the real example above, at least ~1000 / 42 ~= 24 passes just
      for translating the dead code. If we add this number to the passes needed
      to translate the other non dead code, it brings such program to 40+
      passes, and JIT doesn't complete. Ultimately the userspace loader fails
      because such BPF program was supposed to be part of a prog array owner
      being JITed.
      
      While it is certainly possible to try to refactor such programs to help
      the compiler remove dead code, the behavior is not really intuitive and it
      puts further burden on the BPF developer who is not expecting such
      behavior. To make things worse, such programs are working just fine in all
      the kernel releases prior to the ja-1 fix.
      
      A possible approach to mitigate this behavior consists into noticing that
      for ja-1 instructions we don't really need to rely on the estimated size
      of the previous and current instructions, we know that a -1 BPF jump
      offset can be safely translated into a 0xEB instruction with a jump offset
      of -2.
      
      Such fix brings the BPF program in the previous example to complete again
      in ~9 passes.
      
      Fixes: 2a5418a1 ("bpf: improve dead code sanitizing")
      Signed-off-by: default avatarGianluca Borello <g.borello@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1612a981
    • William Tu's avatar
      bpf: clear the ip_tunnel_info. · 5540fbf4
      William Tu authored
      The percpu metadata_dst might carry the stale ip_tunnel_info
      and cause incorrect behavior.  When mixing tests using ipv4/ipv6
      bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
      ipv4's src ip addr as its ipv6 src address, because the previous
      tunnel info does not clean up.  The patch zeros the fields in
      ip_tunnel_info.
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Reported-by: default avatarYifeng Sun <pkusunyifeng@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5540fbf4
    • Linus Torvalds's avatar
      Merge branch 'userns-linus' of... · 3be4aaf4
      Linus Torvalds authored
      Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
      
      Pull userns bug fix from Eric Biederman:
       "Just a small fix to properly set the return code on error"
      
      * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
        commoncap: Handle memory allocation failure.
      3be4aaf4
  5. 24 Apr, 2018 19 commits