1. 06 May, 2016 24 commits
    • Eric Dumazet's avatar
      ipv4: tcp: ip_send_unicast_reply() is not BH safe · 47dcc20a
      Eric Dumazet authored
      I forgot that ip_send_unicast_reply() is not BH safe (yet).
      
      Disabling preemption before calling it was not a good move.
      
      Fixes: c10d9310 ("tcp: do not assume TCP code is non preemptible")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAndres Lagar-Cavilla  <andreslc@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47dcc20a
    • David S. Miller's avatar
      Merge branch 'bpf-direct-pkt-access' · 4b307a8e
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf: introduce direct packet access
      
      This set of patches introduce 'direct packet access' from
      cls_bpf and act_bpf programs (which are root only).
      
      Current bpf programs use LD_ABS, LD_INS instructions which have
      to do 'if (off < skb_headlen)' for every packet access.
      It's ok for socket filters, but too slow for XDP, since single
      LD_ABS insn consumes 3% of cpu. Therefore we have to amortize the cost
      of length check over multiple packet accesses via direct access
      to skb->data, data_end pointers.
      
      The existing packet parser typically look like:
        if (load_half(skb, offsetof(struct ethhdr, h_proto)) != ETH_P_IP)
           return 0;
        if (load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)) != IPPROTO_UDP ||
            load_byte(skb, ETH_HLEN) != 0x45)
           return 0;
        ...
      with 'direct packet access' the bpf program becomes:
         void *data = (void *)(long)skb->data;
         void *data_end = (void *)(long)skb->data_end;
         struct eth_hdr *eth = data;
         struct iphdr *iph = data + sizeof(*eth);
      
         if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
            return 0;
         if (eth->h_proto != htons(ETH_P_IP))
            return 0;
         if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
            return 0;
         ...
      which is more natural to write and significantly faster.
      See patch 6 for performance tests:
      21Mpps(old) vs 24Mpps(new) with just 5 loads.
      For more complex parsers the performance gain is higher.
      
      The other approach implemented in [1] was adding two new instructions
      to interpreter and JITs and was too hard to use from llvm side.
      The approach presented here doesn't need any instruction changes,
      but the verifier has to work harder to check safety of the packet access.
      
      Patch 1 prepares the code and Patch 2 adds new checks for direct
      packet access and all of them are gated with 'env->allow_ptr_leaks'
      which is true for root only.
      Patch 3 improves search pruning for large programs.
      Patch 4 wires in verifier's changes with net/core/filter side.
      Patch 5 updates docs
      Patches 6 and 7 add tests.
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b307a8e
    • Alexei Starovoitov's avatar
      samples/bpf: add verifier tests · 883e44e4
      Alexei Starovoitov authored
      add few tests for "pointer to packet" logic of the verifier
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      883e44e4
    • Alexei Starovoitov's avatar
      samples/bpf: add 'pointer to packet' tests · 65d472fb
      Alexei Starovoitov authored
      parse_simple.c - packet parser exapmle with single length check that
      filters out udp packets for port 9
      
      parse_varlen.c - variable length parser that understand multiple vlan headers,
      ipip, ipip6 and ip options to filter out udp or tcp packets on port 9.
      The packet is parsed layer by layer with multitple length checks.
      
      parse_ldabs.c - classic style of packet parsing using LD_ABS instruction.
      Same functionality as parse_simple.
      
      simple = 24.1Mpps per core
      varlen = 22.7Mpps
      ldabs  = 21.4Mpps
      
      Parser with LD_ABS instructions is slower than full direct access parser
      which does more packet accesses and checks.
      
      These examples demonstrate the choice bpf program authors can make between
      flexibility of the parser vs speed.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65d472fb
    • Alexei Starovoitov's avatar
      bpf: add documentation for 'direct packet access' · f9c8d19d
      Alexei Starovoitov authored
      explain how verifier checks safety of packet access
      and update email addresses.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9c8d19d
    • Alexei Starovoitov's avatar
      bpf: wire in data and data_end for cls_act_bpf · db58ba45
      Alexei Starovoitov authored
      allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
      The bpf helpers that change skb->data need to update data_end pointer as well.
      The verifier checks that programs always reload data, data_end pointers
      after calls to such bpf helpers.
      We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
      since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db58ba45
    • Alexei Starovoitov's avatar
      bpf: improve verifier state equivalence · 735b4333
      Alexei Starovoitov authored
      since UNKNOWN_VALUE type is weaker than CONST_IMM we can un-teach
      verifier its recognition of constants in conditional branches
      without affecting safety.
      Ex:
      if (reg == 123) {
        .. here verifier was marking reg->type as CONST_IMM
           instead keep reg as UNKNOWN_VALUE
      }
      
      Two verifier states with UNKNOWN_VALUE are equivalent, whereas
      CONST_IMM_X != CONST_IMM_Y, since CONST_IMM is used for stack range
      verification and other cases.
      So help search pruning by marking registers as UNKNOWN_VALUE
      where possible instead of CONST_IMM.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      735b4333
    • Alexei Starovoitov's avatar
      bpf: direct packet access · 969bf05e
      Alexei Starovoitov authored
      Extended BPF carried over two instructions from classic to access
      packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
      but due to their design they have to do length check for every access.
      When BPF is processing 20M packets per second single LD_ABS after JIT
      is consuming 3% cpu. Hence the need to optimize it further by amortizing
      the cost of 'off < skb_headlen' over multiple packet accesses.
      One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
      with similar usage as skb_header_pointer().
      The kernel part for interpreter and x64 JIT was implemented in [1], but such
      new insns behave like old ld_abs and abort the program with 'return 0' if
      access is beyond linear data. Such hidden control flow is hard to workaround
      plus changing JITs and rolling out new llvm is incovenient.
      
      Therefore allow cls_bpf/act_bpf program access skb->data directly:
      int bpf_prog(struct __sk_buff *skb)
      {
        struct iphdr *ip;
      
        if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
            /* packet too small */
            return 0;
      
        ip = skb->data + ETH_HLEN;
      
        /* access IP header fields with direct loads */
        if (ip->version != 4 || ip->saddr == 0x7f000001)
            return 1;
        [...]
      }
      
      This solution avoids introduction of new instructions. llvm stays
      the same and all JITs stay the same, but verifier has to work extra hard
      to prove safety of the above program.
      
      For XDP the direct store instructions can be allowed as well.
      
      The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
      the alignment. The complex packet parsers where packet pointer is adjusted
      incrementally cannot be tracked for alignment, so allow byte access in such cases
      and misaligned access on architectures that define efficient_unaligned_access
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dwSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      969bf05e
    • Alexei Starovoitov's avatar
      bpf: cleanup verifier code · 1a0dc1ac
      Alexei Starovoitov authored
      cleanup verifier code and prepare it for addition of "pointer to packet" logic
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a0dc1ac
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 95aef7ce
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2016-05-05
      
      This series contains updates to i40e and i40evf.
      
      The theme behind this series is code reduction, yeah!  Jesse provides
      most of the changes starting with a refactor of the interpretation of
      a tunnel which lets us start using the hardware's parsing.  Removed
      the packet split receive routine and ancillary code in preparation
      for the Rx-refactor.  The refactor of the receive routine,
      aligns the receive routine with the one in ixgbe which was highly
      optimized.  The hardware supports a 16 byte descriptor for receive,
      but the driver was never using it in production.  There was no performance
      benefit to the real driver of 16 byte descriptors, so drop a whole lot
      of complexity while getting rid of the code.  Fixed a bug where while
      changing the number of descriptors using ethtool, the driver did not
      test the limits of the system memory before permanently assuming it
      would be able to get receive buffer memory.
      
      Mitch fixes a memory leak of one page each time the driver is opened by
      allocating the correct number of receive buffers and do not fiddle with
      next_to_use in the VF driver.
      
      Arnd Bergmann fixed a indentation issue by adding the appropriate
      curly braces in i40e_vc_config_promiscuous_mode_msg().
      
      Julia Lawall fixed an issue found by Coccinelle, where i40e_client_ops
      structure can be const since it is never modified.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95aef7ce
    • David Ahern's avatar
      net: vrf: Create FIB tables on link create · b3b4663c
      David Ahern authored
      Tables have to exist for VRFs to function. Ensure they exist
      when VRF device is created.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3b4663c
    • Jon Maxwell's avatar
      cnic: call cp->stop_hw() in cnic_start_hw() on allocation failure · f37bd0cc
      Jon Maxwell authored
      We recently had a system crash in the cnic module. Vmcore analysis confirmed
      that "ip link up" was executed which failed due to an allocation failure
      because of memory fragmentation. Futher analysis revealed that the cnic irq
      vector was still allocated after the "ip link up" that failed. When
      "ip link down" was executed it called free_msi_irqs() which crashed the system
      because the cnic irq was still inuse.
      
      PANIC: "kernel BUG at drivers/pci/msi.c:411!"
      
      The code execution was:
      
      cnic_netdev_event()
      if (event == NETDEV_UP) {
      .
      .
             ▹       if (!cnic_start_hw(dev))
      cnic_start_hw()
      calls cnic_cm_open() which failed with -ENOMEM
      cnic_start_hw() then took the err1 path:
      
      err1:
             cp->free_resc(dev); <---- frees resources but not irq vector
             pci_dev_put(dev->pcidev);
             return err;
      }
      
      This returns control back to cnic_netdev_event() but now the cnic irq vector
      is still allocated even although cnic_cm_open() failed. The next
      "ip link down" while trigger the crash.
      
      The cnic_start_hw() routine is not handling the allocation failure correctly.
      Fix this by checking whether CNIC_DRV_STATE_HANDLES_IRQ flag is set indicating
      that the hardware has been started in cnic_start_hw(). If it has then call
      cp->stop_hw() which frees the cnic irq vector and cnic resources. Otherwise
      just maintain the previous behaviour and free cnic resources.
      
      I reproduced this by injecting an ENOMEM error into cnic_cm_alloc_mem()s return
      code.
      
      # ip link set dev enpX down
      # ip link set dev enpX up <--- hit's allocation failure
      # ip link set dev enpX down <--- crashes here
      
      With this patch I confirmed there was no crash in the reproducer.
      Signed-off-by: default avatarJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f37bd0cc
    • Julia Lawall's avatar
      i40e: constify i40e_client_ops structure · 3949c4ac
      Julia Lawall authored
      The i40e_client_ops structure is never modified, so declare it as const.
      
      Done with the help of Coccinelle.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@lip6.fr>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      3949c4ac
    • Arnd Bergmann's avatar
      i40e: fix misleading indentation · ce927db4
      Arnd Bergmann authored
      Newly added code in i40e_vc_config_promiscuous_mode_msg() is indented
      in a way that gcc rightly complains about:
      
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c: In function 'i40e_vc_config_promiscuous_mode_msg':
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:1543:4: error: this 'if' clause does not guard... [-Werror=misleading-indentation]
          if (f->vlan >= 0 && f->vlan <= I40E_MAX_VLANID)
          ^~
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:1550:5: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'if'
           aq_err = pf->hw.aq.asq_last_status;
      
      From the context, it looks like the aq_err assignment was meant to be
      inside of the conditional expression, so I'm adding the appropriate
      curly braces now.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: 5676a8b9 ("i40e: Add VF promiscuous mode driver support")
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      ce927db4
    • Jesse Brandeburg's avatar
      i40e: Test memory before ethtool alloc succeeds · 147e81ec
      Jesse Brandeburg authored
      When testing on systems with very limited amounts of RAM, a bug was
      found where, while changing the number of descriptors using ethtool,
      the driver didn't test the limits of system memory before permanently
      assuming it would be able to get receive buffer memory.
      
      Work around this issue by pre-allocation of the receive buffer
      memory, in the "ghost" ring, which is then used during reinit
      using the new ring length.
      
      Change-Id: I92d7a5fb59a6c884b2efdd1ec652845f101c3359
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      147e81ec
    • Mitch Williams's avatar
      i40evf: Allocate Rx buffers properly · b163098e
      Mitch Williams authored
      Allocate the correct number of RX buffers, and don't fiddle with
      next_to_use. The common RX code handles all of this. This fixes a memory
      leak of one page each time the driver is opened.
      
      Change-Id: Id06eca353086e084921f047acad28c14745684ee
      Signed-off-by: default avatarMitch Williams <mitch.a.williams@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b163098e
    • Jesse Brandeburg's avatar
      i40e/i40evf: Remove unused hardware receive descriptor code · bec60fc4
      Jesse Brandeburg authored
      The hardware supports a 16 byte descriptor for receive, but the
      driver was never using it in production.  There was no performance
      benefit to the real driver of 16 byte descriptors, so drop a whole
      lot of complexity while getting rid of the code.
      
      Also since the previous patch made us use no-split mode all the
      time, drop any support in the driver for any other value in dtype
      and assume it is always zero (aka no-split).
      
      Hooray for code removal!
      
      Change-ID: I2257e902e4dad84a07b94db6d2e6f4ce69b27bc0
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      bec60fc4
    • Jesse Brandeburg's avatar
      i40evf: refactor receive routine · ab9ad98e
      Jesse Brandeburg authored
      This is part 2 of the Rx refactor series, just including
      changes to i40evf.
      
      This refactor aligns the receive routine with the one in
      ixgbe which was highly optimized.  This reduces the code
      we have to maintain and allows for (hopefully) more readable
      and maintainable RX hot path.
      
      In order to do this:
      - consolidate the receive path into a single function that doesn't
        use packet split but *does* use pages for Rx buffers.
      - remove the old _1buf routine
      - consolidate several routines into helper functions
      - remove VF ethtool control over packet split
      - remove priv_flags interface since it is unused
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      ab9ad98e
    • Jesse Brandeburg's avatar
      i40evf: Drop packet split receive routine · 19b85e67
      Jesse Brandeburg authored
      As part of preparation for the rx-refactor, remove the
      packet split receive routine and ancillary code.
      
      Some of the split related context set up code stays in
      i40e_virtchnl_pf.c in case an older VF driver tries to load
      and still wants to use packet split.
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      19b85e67
    • Jesse Brandeburg's avatar
      i40e: Refactor receive routine · 1a557afc
      Jesse Brandeburg authored
      This is part 1 of the Rx refactor series, just including
      changes to i40e.
      
      This refactor aligns the receive routine with the one in
      ixgbe which was highly optimized.  This reduces the code
      we have to maintain and allows for (hopefully) more readable
      and maintainable RX hot path.
      
      In order to do this:
      - consolidate the receive path into a single function that doesn't
        use packet split but *does* use pages for Rx buffers.
      - remove the old _1buf routine
      - consolidate several routines into helper functions
      - remove ethtool control over packet split
      
      Change-ID: I5ca100721de65992aa0114f8b4bac844b84758e0
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      1a557afc
    • Haggai Abramovsky's avatar
      net/mlx4: Avoid wrong virtual mappings · 73898db0
      Haggai Abramovsky authored
      The dma_alloc_coherent() function returns a virtual address which can
      be used for coherent access to the underlying memory.  On some
      architectures, like arm64, undefined behavior results if this memory is
      also accessed via virtual mappings that are not coherent.  Because of
      their undefined nature, operations like virt_to_page() return garbage
      when passed virtual addresses obtained from dma_alloc_coherent().  Any
      subsequent mappings via vmap() of the garbage page values are unusable
      and result in bad things like bus errors (synchronous aborts in ARM64
      speak).
      
      The mlx4 driver contains code that does the equivalent of:
      vmap(virt_to_page(dma_alloc_coherent)), this results in an OOPs when the
      device is opened.
      
      Prevent Ethernet driver to run this problematic code by forcing it to
      allocate contiguous memory. As for the Infiniband driver, at first we
      are trying to allocate contiguous memory, but in case of failure roll
      back to work with fragmented memory.
      Signed-off-by: default avatarHaggai Abramovsky <hagaya@mellanox.com>
      Signed-off-by: default avatarYishai Hadas <yishaih@mellanox.com>
      Reported-by: default avatarDavid Daney <david.daney@cavium.com>
      Tested-by: default avatarSinan Kaya <okaya@codeaurora.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73898db0
    • Jesse Brandeburg's avatar
      i40e/i40evf: Remove reference to ring->dtype · 04b3b779
      Jesse Brandeburg authored
      As part of the rx-refactor, the dtype variable in the i40e_ring
      struct is no longer used, so remove it.
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      04b3b779
    • Jesse Brandeburg's avatar
      i40e: Drop packet split receive routine · b32bfa17
      Jesse Brandeburg authored
      As part of preparation for the rx-refactor, remove the
      packet split receive routine and ancillary code.
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b32bfa17
    • Jesse Brandeburg's avatar
      i40e/i40evf: Refactor tunnel interpretation · f8a952cb
      Jesse Brandeburg authored
      Refactor the interpretation of a tunnel.  This removes
      some code and lets us start using the hardware's parsing.
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      f8a952cb
  2. 05 May, 2016 2 commits
  3. 04 May, 2016 14 commits