1. 21 Sep, 2016 23 commits
    • Neal Cardwell's avatar
      tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments · 1b3878ca
      Neal Cardwell authored
      To allow congestion control modules to use the default TSO auto-sizing
      algorithm as one of the ingredients in their own decision about TSO sizing:
      
      1) Export tcp_tso_autosize() so that CC modules can use it.
      
      2) Change tcp_tso_autosize() to allow callers to specify a minimum
         number of segments per TSO skb, in case the congestion control
         module has a different notion of the best floor for TSO skbs for
         the connection right now. For very low-rate paths or policed
         connections it can be appropriate to use smaller TSO skbs.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b3878ca
    • Neal Cardwell's avatar
      tcp: allow congestion control module to request TSO skb segment count · ed6e7268
      Neal Cardwell authored
      Add the tso_segs_goal() function in tcp_congestion_ops to allow the
      congestion control module to specify the number of segments that
      should be in a TSO skb sent by tcp_write_xmit() and
      tcp_xmit_retransmit_queue(). The congestion control module can either
      request a particular number of segments in TSO skb that we transmit,
      or return 0 if it doesn't care.
      
      This allows the upcoming BBR congestion control module to select small
      TSO skb sizes if the module detects that the bottleneck bandwidth is
      very low, or that the connection is policed to a low rate.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed6e7268
    • Yuchung Cheng's avatar
      tcp: export data delivery rate · eb8329e0
      Yuchung Cheng authored
      This commit export two new fields in struct tcp_info:
      
        tcpi_delivery_rate: The most recent goodput, as measured by
          tcp_rate_gen(). If the socket is limited by the sending
          application (e.g., no data to send), it reports the highest
          measurement instead of the most recent. The unit is bytes per
          second (like other rate fields in tcp_info).
      
        tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
          was measured when the socket's throughput was limited by the
          sending application.
      
      This delivery rate information can be useful for applications that
      want to know the current throughput the TCP connection is seeing,
      e.g. adaptive bitrate video streaming. It can also be very useful for
      debugging or troubleshooting.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb8329e0
    • Soheil Hassas Yeganeh's avatar
      tcp: track application-limited rate samples · d7722e85
      Soheil Hassas Yeganeh authored
      This commit adds code to track whether the delivery rate represented
      by each rate_sample was limited by the application.
      
      Upon each transmit, we store in the is_app_limited field in the skb a
      boolean bit indicating whether there is a known "bubble in the pipe":
      a point in the rate sample interval where the sender was
      application-limited, and did not transmit even though the cwnd and
      pacing rate allowed it.
      
      This logic marks the flow app-limited on a write if *all* of the
      following are true:
      
        1) There is less than 1 MSS of unsent data in the write queue
           available to transmit.
      
        2) There is no packet in the sender's queues (e.g. in fq or the NIC
           tx queue).
      
        3) The connection is not limited by cwnd.
      
        4) There are no lost packets to retransmit.
      
      The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
      the connection is application-limited at the moment. If the flow is
      application-limited, it sets the tp->app_limited field. If the flow is
      application-limited then that means there is effectively a "bubble" of
      silence in the pipe now, and this silence will be reflected in a lower
      bandwidth sample for any rate samples from now until we get an ACK
      indicating this bubble has exited the pipe: specifically, until we get
      an ACK for the next packet we transmit.
      
      When we send every skb we record in scb->tx.is_app_limited whether the
      resulting rate sample will be application-limited.
      
      The code in tcp_rate_gen() checks to see when it is safe to mark all
      known application-limited bubbles of silence as having exited the
      pipe. It does this by checking to see when the delivered count moves
      past the tp->app_limited marker. At this point it zeroes the
      tp->app_limited marker, as all known bubbles are out of the pipe.
      
      We make room for the tx.is_app_limited bit in the skb by borrowing a
      bit from the in_flight field used by NV to record the number of bytes
      in flight. The receive window in the TCP header is 16 bits, and the
      max receive window scaling shift factor is 14 (RFC 1323). So the max
      receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
      only need 30 bits for the tx.in_flight used by NV.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7722e85
    • Yuchung Cheng's avatar
      tcp: track data delivery rate for a TCP connection · b9f64820
      Yuchung Cheng authored
      This patch generates data delivery rate (throughput) samples on a
      per-ACK basis. These rate samples can be used by congestion control
      modules, and specifically will be used by TCP BBR in later patches in
      this series.
      
      Key state:
      
      tp->delivered: Tracks the total number of data packets (original or not)
      	       delivered so far. This is an already-existing field.
      
      tp->delivered_mstamp: the last time tp->delivered was updated.
      
      Algorithm:
      
      A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:
      
        d1: the current tp->delivered after processing the ACK
        t1: the current time after processing the ACK
      
        d0: the prior tp->delivered when the acked skb was transmitted
        t0: the prior tp->delivered_mstamp when the acked skb was transmitted
      
      When an skb is transmitted, we snapshot d0 and t0 in its control
      block in tcp_rate_skb_sent().
      
      When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
      or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
      to reflect the latest (d0, t0).
      
      Finally, tcp_rate_gen() generates a rate sample by storing
      (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.
      
      One caveat: if an skb was sent with no packets in flight, then
      tp->delivered_mstamp may be either invalid (if the connection is
      starting) or outdated (if the connection was idle). In that case,
      we'll re-stamp tp->delivered_mstamp.
      
      At first glance it seems t0 should always be the time when an skb was
      transmitted, but actually this could over-estimate the rate due to
      phase mismatch between transmit and ACK events. To track the delivery
      rate, we ensure that if packets are in flight then t0 and and t1 are
      times at which packets were marked delivered.
      
      If the initial and final RTTs are different then one may be corrupted
      by some sort of noise. The noise we see most often is sending gaps
      caused by delayed, compressed, or stretched acks. This either affects
      both RTTs equally or artificially reduces the final RTT. We approach
      this by recording the info we need to compute the initial RTT
      (duration of the "send phase" of the window) when we recorded the
      associated inflight. Then, for a filter to avoid bandwidth
      overestimates, we generalize the per-sample bandwidth computation
      from:
      
          bw = delivered / ack_phase_rtt
      
      to the following:
      
          bw = delivered / max(send_phase_rtt, ack_phase_rtt)
      
      In large-scale experiments, this filtering approach incorporating
      send_phase_rtt is effective at avoiding bandwidth overestimates due to
      ACK compression or stretched ACKs.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9f64820
    • Neal Cardwell's avatar
      tcp: count packets marked lost for a TCP connection · 0682e690
      Neal Cardwell authored
      Count the number of packets that a TCP connection marks lost.
      
      Congestion control modules can use this loss rate information for more
      intelligent decisions about how fast to send.
      
      Specifically, this is used in TCP BBR policer detection. BBR uses a
      high packet loss rate as one signal in its policer detection and
      policer bandwidth estimation algorithm.
      
      The BBR policer detection algorithm cannot simply track retransmits,
      because a retransmit can be (and often is) an indicator of packets
      lost long, long ago. This is particularly true in a long CA_Loss
      period that repairs the initial massive losses when a policer kicks
      in.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0682e690
    • Eric Dumazet's avatar
      tcp: switch back to proper tcp_skb_cb size check in tcp_init() · b2d3ea4a
      Eric Dumazet authored
      Revert to the tcp_skb_cb size check that tcp_init() had before commit
      b4772ef8 ("net: use common macro for assering skb->cb[] available
      size in protocol families"). As related commit 744d5a3e ("net:
      move skb->dropcount to skb->cb[]") explains, the
      sock_skb_cb_check_size() mechanism was added to ensure that there is
      space for dropcount, "for protocol families using it". But TCP is not
      a protocol using dropcount, so tcp_init() doesn't need to provision
      space for dropcount in the skb->cb[], and thus we can revert to the
      older form of the tcp_skb_cb size check. Doing so allows TCP to use 4
      more bytes of the skb->cb[] space.
      
      Fixes: b4772ef8 ("net: use common macro for assering skb->cb[] available size in protocol families")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2d3ea4a
    • Eric Dumazet's avatar
      net_sched: sch_fq: add low_rate_threshold parameter · 77879147
      Eric Dumazet authored
      This commit adds to the fq module a low_rate_threshold parameter to
      insert a delay after all packets if the socket requests a pacing rate
      below the threshold.
      
      This helps achieve more precise control of the sending rate with
      low-rate paths, especially policers. The basic issue is that if a
      congestion control module detects a policer at a certain rate, it may
      want fq to be able to shape to that policed rate. That way the sender
      can avoid policer drops by having the packets arrive at the policer at
      or just under the policed rate.
      
      The default threshold of 550Kbps was chosen analytically so that for
      policers or links at 500Kbps or 512Kbps fq would very likely invoke
      this mechanism, even if the pacing rate was briefly slightly above the
      available bandwidth. This value was then empirically validated with
      two years of production testing on YouTube video servers.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77879147
    • Neal Cardwell's avatar
      tcp: use windowed min filter library for TCP min_rtt estimation · 64033892
      Neal Cardwell authored
      Refactor the TCP min_rtt code to reuse the new win_minmax library in
      lib/win_minmax.c to simplify the TCP code.
      
      This is a pure refactor: the functionality is exactly the same. We
      just moved the windowed min code to make TCP easier to read and
      maintain, and to allow other parts of the kernel to use the windowed
      min/max filter code.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64033892
    • Neal Cardwell's avatar
      lib/win_minmax: windowed min or max estimator · a4f1f9ac
      Neal Cardwell authored
      This commit introduces a generic library to estimate either the min or
      max value of a time-varying variable over a recent time window. This
      is code originally from Kathleen Nichols. The current form of the code
      is from Van Jacobson.
      
      A single struct minmax_sample will track the estimated windowed-max
      value of the series if you call minmax_running_max() or the estimated
      windowed-min value of the series if you call minmax_running_min().
      
      Nearly equivalent code is already in place for minimum RTT estimation
      in the TCP stack. This commit extracts that code and generalizes it to
      handle both min and max. Moving the code here reduces the footprint
      and complexity of the TCP code base and makes the filter generally
      available for other parts of the codebase, including an upcoming TCP
      congestion control module.
      
      This library works well for time series where the measurements are
      smoothly increasing or decreasing.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4f1f9ac
    • Soheil Hassas Yeganeh's avatar
      tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict · f78e73e2
      Soheil Hassas Yeganeh authored
      The upcoming change "lib/win_minmax: windowed min or max estimator"
      introduces a struct called minmax, which is then included in
      include/linux/tcp.h in the upcoming change "tcp: use windowed min
      filter library for TCP min_rtt estimation". This would create a
      compilation error for tcp_cdg.c, which defines its own minmax
      struct. To avoid this naming conflict (and potentially others in the
      future), this commit renames the version used in tcp_cdg.c to
      cdg_minmax.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
      Acked-by: default avatarKenneth Klette Jonassen <kennetkl@ifi.uio.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f78e73e2
    • Sean Wang's avatar
      net: ethernet: mediatek: enhance with avoiding superfluous assignment inside mtk_get_ethtool_stats · 94d308d0
      Sean Wang authored
      data_src is unchanged inside the loop, so this patch moves
      the assignment to outside the loop to avoid unnecessarily
      assignment
      Signed-off-by: default avatarSean Wang <sean.wang@mediatek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94d308d0
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: handle multiple ports in ATU · 88472939
      Vivien Didelot authored
      An address can be loaded in the ATU with multiple ports, for instance
      when adding multiple ports to a Multicast group with "bridge mdb".
      
      The current code doesn't allow that. Add an helper to get a single entry
      from the ATU, then set or clear the requested port, before loading the
      entry back in the ATU.
      
      Note that the required _mv88e6xxx_atu_getnext function is defined below
      mv88e6xxx_port_db_load_purge, so forward-declare it for the moment. The
      ATU code will be isolated in future patches.
      
      Fixes: 83dabd1f ("net: dsa: mv88e6xxx: make switchdev DB ops generic")
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88472939
    • Jamal Hadi Salim's avatar
      net sched actions: fix GETing actions · aecc5cef
      Jamal Hadi Salim authored
      With the batch changes that translated transient actions into
      a temporary list lost in the translation was the fact that
      tcf_action_destroy() will eventually delete the action from
      the permanent location if the refcount is zero.
      
      Example of what broke:
      ...add a gact action to drop
      sudo $TC actions add action drop index 10
      ...now retrieve it, looks good
      sudo $TC actions get action gact index 10
      ...retrieve it again and find it is gone!
      sudo $TC actions get action gact index 10
      
      Fixes: 22dc13c8 ("net_sched: convert tcf_exts from list to pointer array"),
      Fixes: 824a7e88 ("net_sched: remove an unnecessary list_del()")
      Fixes: f07fed82 ("net_sched: remove the leftover cleanup_a()")
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aecc5cef
    • David S. Miller's avatar
      Merge branch 'bpf-direct-packet-access-improvements' · 1d9423ae
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      BPF direct packet access improvements
      
      This set adds write support to the currently available read support
      for {cls,act}_bpf programs. First one is a fix for affected commit
      sitting in net-next and prerequisite for the second one, last patch
      adds a number of test cases against the verifier. For details, please
      see individual patches.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d9423ae
    • Daniel Borkmann's avatar
      bpf: add test cases for direct packet access · 7d95b0ab
      Daniel Borkmann authored
      Add couple of test cases for direct write and the negative size issue, and
      also adjust the direct packet access test4 since it asserts that writes are
      not possible, but since we've just added support for writes, we need to
      invert the verdict to ACCEPT, of course. Summary: 133 PASSED, 0 FAILED.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d95b0ab
    • Daniel Borkmann's avatar
      bpf: direct packet write and access for helpers for clsact progs · 36bbef52
      Daniel Borkmann authored
      This work implements direct packet access for helpers and direct packet
      write in a similar fashion as already available for XDP types via commits
      4acf6c0b ("bpf: enable direct packet data write for xdp progs") and
      6841de8b ("bpf: allow helpers access the packet directly"), and as a
      complementary feature to the already available direct packet read for tc
      (cls/act) programs.
      
      For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
      and bpf_csum_update(). The first is generally needed for both, read and
      write, because they would otherwise only be limited to the current linear
      skb head. Usually, when the data_end test fails, programs just bail out,
      or, in the direct read case, use bpf_skb_load_bytes() as an alternative
      to overcome this limitation. If such data sits in non-linear parts, we
      can just pull them in once with the new helper, retest and eventually
      access them.
      
      At the same time, this also makes sure the skb is uncloned, which is, of
      course, a necessary condition for direct write. As this needs to be an
      invariant for the write part only, the verifier detects writes and adds
      a prologue that is calling bpf_skb_pull_data() to effectively unclone the
      skb from the very beginning in case it is indeed cloned. The heuristic
      makes use of a similar trick that was done in 233577a2 ("net: filter:
      constify detection of pkt_type_offset"). This comes at zero cost for other
      programs that do not use the direct write feature. Should a program use
      this feature only sparsely and has read access for the most parts with,
      for example, drop return codes, then such write action can be delegated
      to a tail called program for mitigating this cost of potential uncloning
      to a late point in time where it would have been paid similarly with the
      bpf_skb_store_bytes() as well. Advantage of direct write is that the
      writes are inlined whereas the helper cannot make any length assumptions
      and thus needs to generate a call to memcpy() also for small sizes, as well
      as cost of helper call itself with sanity checks are avoided. Plus, when
      direct read is already used, we don't need to cache or perform rechecks
      on the data boundaries (due to verifier invalidating previous checks for
      helpers that change skb->data), so more complex programs using rewrites
      can benefit from switching to direct read plus write.
      
      For direct packet access to helpers, we save the otherwise needed copy into
      a temp struct sitting on stack memory when use-case allows. Both facilities
      are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
      this to map helpers and csum_diff, and can successively enable other helpers
      where we find it makes sense. Helpers that definitely cannot be allowed for
      this are those part of bpf_helper_changes_skb_data() since they can change
      underlying data, and those that write into memory as this could happen for
      packet typed args when still cloned. bpf_csum_update() helper accommodates
      for the fact that we need to fixup checksum_complete when using direct write
      instead of bpf_skb_store_bytes(), meaning the programs can use available
      helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
      csum_block_add(), csum_block_sub() equivalents in eBPF together with the
      new helper. A usage example will be provided for iproute2's examples/bpf/
      directory.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36bbef52
    • Daniel Borkmann's avatar
      bpf, verifier: enforce larger zero range for pkt on overloading stack buffs · b399cf64
      Daniel Borkmann authored
      Current contract for the following two helper argument types is:
      
        * ARG_CONST_STACK_SIZE: passed argument pair must be (ptr, >0).
        * ARG_CONST_STACK_SIZE_OR_ZERO: passed argument pair can be either
          (NULL, 0) or (ptr, >0).
      
      With 6841de8b ("bpf: allow helpers access the packet directly"), we can
      pass also raw packet data to helpers, so depending on the argument type
      being PTR_TO_PACKET, we now either assert memory via check_packet_access()
      or check_stack_boundary(). As a result, the tests in check_packet_access()
      currently allow more than intended with regards to reg->imm.
      
      Back in 969bf05e ("bpf: direct packet access"), check_packet_access()
      was fine to ignore size argument since in check_mem_access() size was
      bpf_size_to_bytes() derived and prior to the call to check_packet_access()
      guaranteed to be larger than zero.
      
      However, for the above two argument types, it currently means, we can have
      a <= 0 size and thus breaking current guarantees for helpers. Enforce a
      check for size <= 0 and bail out if so.
      
      check_stack_boundary() doesn't have such an issue since it already tests
      for access_size <= 0 and bails out, resp. access_size == 0 in case of NULL
      pointer passed when allowed.
      
      Fixes: 6841de8b ("bpf: allow helpers access the packet directly")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b399cf64
    • Mahesh Bandewar's avatar
      ipvlan: Fix dependency issue · cf714ac1
      Mahesh Bandewar authored
      kbuild-build-bot reported that if NETFILTER is not selected, the
      build fails pointing to netfilter symbols.
      
      Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode")
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf714ac1
    • pravin shelar's avatar
      openvswitch: avoid resetting flow key while installing new flow. · 2279994d
      pravin shelar authored
      since commit commit db74a333 ("openvswitch: use percpu
      flow stats") flow alloc resets flow-key. So there is no need
      to reset the flow-key again if OVS is using newly allocated
      flow-key.
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2279994d
    • pravin shelar's avatar
      openvswitch: Fix Frame-size larger than 1024 bytes warning. · 190aa3e7
      pravin shelar authored
      There is no need to declare separate key on stack,
      we can just use sw_flow->key to store the key directly.
      
      This commit fixes following warning:
      
      net/openvswitch/datapath.c: In function ‘ovs_flow_cmd_new’:
      net/openvswitch/datapath.c:1080:1: warning: the frame size of 1040 bytes
      is larger than 1024 bytes [-Wframe-larger-than=]
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      190aa3e7
    • David S. Miller's avatar
      Merge branch 'for-upstream' of... · 204dfe17
      David S. Miller authored
      Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
      
      Johan Hedberg says:
      
      ====================
      pull request: bluetooth-next 2016-09-19
      
      Here's the main bluetooth-next pull request for the 4.9 kernel.
      
       - Added new messages for monitor sockets for better mgmt tracing
       - Added local name and appearance support in scan response
       - Added new Qualcomm WCNSS SMD based HCI driver
       - Minor fixes & cleanup to 802.15.4 code
       - New USB ID to btusb driver
       - Added Marvell support to HCI UART driver
       - Add combined LED trigger for controller power
       - Other minor fixes here and there
      
      Please let me know if there are any issues pulling. Thanks.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      204dfe17
    • Alan Cox's avatar
      6pack: fix buffer length mishandling · ad979896
      Alan Cox authored
      Dmitry Vyukov wrote:
      > different runs). Looking at code, the following looks suspicious -- we
      > limit copy by 512 bytes, but use the original count which can be
      > larger than 512:
      >
      > static void sixpack_receive_buf(struct tty_struct *tty,
      >     const unsigned char *cp, char *fp, int count)
      > {
      >     unsigned char buf[512];
      >     ....
      >     memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf));
      >     ....
      >     sixpack_decode(sp, buf, count1);
      
      With the sane tty locking we now have I believe the following is safe as
      we consume the bytes and move them into the decoded buffer before
      returning.
      Signed-off-by: default avatarAlan Cox <alan@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad979896
  2. 20 Sep, 2016 17 commits
    • Jesper Dangaard Brouer's avatar
      mlx4: add missed recycle opportunity for XDP_TX on TX failure · 5737f6c9
      Jesper Dangaard Brouer authored
      Correct drop handling for XDP_TX on TX failure, were recently added in
      commit 95357907 ("mlx4: fix XDP_TX is acting like XDP_PASS on TX
      ring full").
      
      The change missed an opportunity for recycling the RX page, instead of
      going through the page allocator, like the regular XDP_DROP action does.
      This patch cease the opportunity, by going through the XDP_DROP case.
      
      Fixes: 95357907 ("mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5737f6c9
    • David S. Miller's avatar
      Merge branch 'dsa-set_addr-optional' · 1860e688
      David S. Miller authored
      John Crispin says:
      
      ====================
      net-next: dsa: set_addr should be optional
      
      The Marvell driver is the only one that actually sets the switches HW
      address. All other drivers have an empty stub. fix this by making the
      callback optional.
      ====================
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1860e688
    • John Crispin's avatar
      net-next: dsa: qca8k: remove empty set_addr() stub · 8941ee36
      John Crispin authored
      The set_addr() callback is now optional. Remove the empty stub that qca8k
      has.
      Signed-off-by: default avatarJohn Crispin <john@phrozen.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8941ee36
    • John Crispin's avatar
      net-next: dsa: b53: remove empty set_addr() stub · 1f449736
      John Crispin authored
      The set_addr() callback is now optional. Remove the empty stub that b53
      has.
      Signed-off-by: default avatarJohn Crispin <john@phrozen.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f449736
    • John Crispin's avatar
      net-next: dsa: make the set_addr() operation optional · 092183df
      John Crispin authored
      Only 1 of the 3 drivers currently has a set_addr() operation. Make the
      set_addr() callback optional to reduce the amount of empty stubs inside
      the drivers.
      Signed-off-by: default avatarJohn Crispin <john@phrozen.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      092183df
    • John Crispin's avatar
      net-next: dsa: fix duplicate invocation of set_addr() · 06f8ec90
      John Crispin authored
      commit 83c0afae ("net: dsa: Add new binding implementation")
      has a duplicate invocation of the set_addr() operation callback. Remove one
      of them.
      Signed-off-by: default avatarJohn Crispin <john@phrozen.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06f8ec90
    • David S. Miller's avatar
      Merge branch 'rhashtable-dups' · f361bdde
      David S. Miller authored
      Herbert Xu says:
      
      ====================
      rhashtable: rhashtable with duplicate objects
      
      v3 fixes a bug in the remove path that causes the element count
      to decrease when it shouldn't, leading to a gigantic hash table
      when it underflows.
      
      v2 contains a reworked insertion slowpath to ensure that the
      spinlock for the table we're inserting into is taken.
      
      This series contains two patches.  The first adds the rhlist
      interface and the second converts mac80211 to use it.  If this
      works out I'll then proceed to convert the other insecure_elasticity
      users over to this.
      
      I've tested the rhlist code with test_rhashtable but I haven't
      tested the mac80211 conversion.  So please give it a go and see
      if it still works.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f361bdde
    • Herbert Xu's avatar
      mac80211: Use rhltable instead of rhashtable · 83e7e4ce
      Herbert Xu authored
      mac80211 currently uses rhashtable with insecure_elasticity set
      to true.  The latter is because of duplicate objects.  What's
      more, mac80211 walks the rhashtable chains by hand which is broken
      as rhashtable may contain multiple tables due to resizing or
      rehashing.
      
      This patch fixes it by converting it to the newly added rhltable
      interface which is designed for use with duplicate objects.
      
      With rhltable a lookup returns a list of objects instead of a
      single one.  This is then fed into the existing for_each_sta_info
      macro.
      
      This patch also deletes the sta_addr_hash function since rhashtable
      defaults to jhash.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83e7e4ce
    • Herbert Xu's avatar
      rhashtable: Add rhlist interface · ca26893f
      Herbert Xu authored
      The insecure_elasticity setting is an ugly wart brought out by
      users who need to insert duplicate objects (that is, distinct
      objects with identical keys) into the same table.
      
      In fact, those users have a much bigger problem.  Once those
      duplicate objects are inserted, they don't have an interface to
      find them (unless you count the walker interface which walks
      over the entire table).
      
      Some users have resorted to doing a manual walk over the hash
      table which is of course broken because they don't handle the
      potential existence of multiple hash tables.  The result is that
      they will break sporadically when they encounter a hash table
      resize/rehash.
      
      This patch provides a way out for those users, at the expense
      of an extra pointer per object.  Essentially each object is now
      a list of objects carrying the same key.  The hash table will
      only see the lists so nothing changes as far as rhashtable is
      concerned.
      
      To use this new interface, you need to insert a struct rhlist_head
      into your objects instead of struct rhash_head.  While the hash
      table is unchanged, for type-safety you'll need to use struct
      rhltable instead of struct rhashtable.  All the existing interfaces
      have been duplicated for rhlist, including the hash table walker.
      
      One missing feature is nulls marking because AFAIK the only potential
      user of it does not need duplicate objects.  Should anyone need
      this it shouldn't be too hard to add.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca26893f
    • Vitaly Kuznetsov's avatar
      xen-netfront: avoid packet loss when ethernet header crosses page boundary · fd07160b
      Vitaly Kuznetsov authored
      Small packet loss is reported on complex multi host network configurations
      including tunnels, NAT, ... My investigation led me to the following check
      in netback which drops packets:
      
              if (unlikely(txreq.size < ETH_HLEN)) {
                      netdev_err(queue->vif->dev,
                                 "Bad packet size: %d\n", txreq.size);
                      xenvif_tx_err(queue, &txreq, extra_count, idx);
                      break;
              }
      
      But this check itself is legitimate. SKBs consist of a linear part (which
      has to have the ethernet header) and (optionally) a number of frags.
      Netfront transmits the head of the linear part up to the page boundary
      as the first request and all the rest becomes frags so when we're
      reconstructing the SKB in netback we can't distinguish between original
      frags and the 'tail' of the linear part. The first SKB needs to be at
      least ETH_HLEN size. So in case we have an SKB with its linear part
      starting too close to the page boundary the packet is lost.
      
      I see two ways to fix the issue:
      - Change the 'wire' protocol between netfront and netback to start keeping
        the original SKB structure. We'll have to add a flag indicating the fact
        that the particular request is a part of the original linear part and not
        a frag. We'll need to know the length of the linear part to pre-allocate
        memory.
      - Avoid transmitting SKBs with linear parts starting too close to the page
        boundary. That seems preferable short-term and shouldn't bring
        significant performance degradation as such packets are rare. That's what
        this patch is trying to achieve with skb_copy().
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Acked-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd07160b
    • Raju Lakkaraju's avatar
      net: phy: Add MAC-IF driver for Microsemi PHYs. · 1a21101d
      Raju Lakkaraju authored
      All the review comments updated and resending for review.
      
      This is MAC interface feature.
      Microsemi PHY can support RGMII, RMII or GMII/MII interface between MAC and PHY.
      MAC-IF function program the right value based on Device tree configuration.
      
      Tested on Beaglebone Black with VSC 8531 PHY.
      Signed-off-by: default avatarRaju Lakkaraju <Raju.Lakkaraju@microsemi.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a21101d
    • Ido Schimmel's avatar
      mlxsw: spectrum: Fix sparse warnings · 1a9234e6
      Ido Schimmel authored
      drivers/net/ethernet/mellanox/mlxsw//spectrum.c:251:28: warning: symbol
      'mlxsw_sp_span_entry_find' was not declared. Should it be static?
      drivers/net/ethernet/mellanox/mlxsw//spectrum.c:265:28: warning: symbol
      'mlxsw_sp_span_entry_get' was not declared. Should it be static?
      drivers/net/ethernet/mellanox/mlxsw//spectrum.c:367:56: warning: mixing
      different enum types
      drivers/net/ethernet/mellanox/mlxsw//spectrum.c:367:56:     int enum
      mlxsw_sp_span_type  versus
      drivers/net/ethernet/mellanox/mlxsw//spectrum.c:367:56:     int enum
      mlxsw_reg_mpar_i_e
      ...
      drivers/net/ethernet/mellanox/mlxsw//spectrum_buffers.c:598:32: warning:
      mixing different enum types
      drivers/net/ethernet/mellanox/mlxsw//spectrum_buffers.c:598:32:     int
      enum mlxsw_reg_sbxx_dir  versus
      drivers/net/ethernet/mellanox/mlxsw//spectrum_buffers.c:598:32:     int
      enum devlink_sb_pool_type
      drivers/net/ethernet/mellanox/mlxsw//spectrum_buffers.c:600:39: warning:
      mixing different enum types
      drivers/net/ethernet/mellanox/mlxsw//spectrum_buffers.c:600:39:     int
      enum mlxsw_reg_sbpr_mode  versus
      drivers/net/ethernet/mellanox/mlxsw//spectrum_buffers.c:600:39:     int
      enum devlink_sb_threshold_type
      ...
      drivers/net/ethernet/mellanox/mlxsw//spectrum_router.c:255:54: warning:
      mixing different enum types
      drivers/net/ethernet/mellanox/mlxsw//spectrum_router.c:255:54:     int
      enum mlxsw_sp_l3proto  versus
      drivers/net/ethernet/mellanox/mlxsw//spectrum_router.c:255:54:     int
      enum mlxsw_reg_ralxx_protocol
      ...
      drivers/net/ethernet/mellanox/mlxsw//spectrum_router.c:1749:6: warning:
      symbol 'mlxsw_sp_fib_entry_put' was not declared. Should it be static?
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a9234e6
    • Elad Raz's avatar
      mlxsw: Change the RX LAG hash function from XOR to CRC · 18c2d2c1
      Elad Raz authored
      Change the RX hash function from XOR to CRC in order to have better
      distribution of the traffic.
      Signed-off-by: default avatarElad Raz <eladr@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18c2d2c1
    • Rob Swindell's avatar
      bnxt_en: Fix build error for kernesl without RTC-LIB · 878786d9
      Rob Swindell authored
      bnxt_hwrm_fw_set_time() now returns -EOPNOTSUPP when built for kernel
      without RTC_LIB.  Setting the firmware time is not critical to the
      successful completion of the firmware update process.
      Signed-off-by: default avatarRob Swindell <Rob.Swindell@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      878786d9
    • Jamal Hadi Salim's avatar
      net sched: stylistic cleanups · 5a7a5555
      Jamal Hadi Salim authored
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a7a5555
    • Roman Mashak's avatar
      net sched actions police: peg drop stats for conforming traffic · f71b109f
      Roman Mashak authored
      setting conforming action to drop is a valid policy.
      When it is set we need to at least see the stats indicating it
      for debugging.
      Signed-off-by: default avatarRoman Mashak <mrv@mojatatu.com>
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f71b109f
    • Jamal Hadi Salim's avatar
      net sched ife action: Introduce skb tcindex metadata encap decap · 408fbc22
      Jamal Hadi Salim authored
      Sample use case of how this is encoded:
      user space via tuntap (or a connected VM/Machine/container)
      encodes the tcindex TLV.
      
      Sample use case of decoding:
      IFE action decodes it and the skb->tc_index is then used to classify.
      So something like this for encoded ICMP packets:
      
      .. first decode then reclassify... skb->tcindex will be set
      sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xbeef \
      u32 match u32 0 0 flowid 1:1 \
      action ife decode reclassify
      
      ...next match the decode icmp packet...
      sudo $TC filter add dev $ETH parent ffff: prio 4 protocol ip \
      u32 match ip protocol 1 0xff flowid 1:1 \
      action continue
      
      ... last classify it using the tcindex classifier and do someaction..
      sudo $TC filter add dev $ETH parent ffff: prio 5 protocol ip \
      handle 0x11 tcindex classid 1:1 \
      action blah..
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      408fbc22