Commits · 63c0ad4d4135d3bdb81a1ee42436f3a403632a3e · Kirill Smelkov / linux

04 May, 2015 23 commits

sched: Call skb_get_hash_perturb in sch_sfb · 63c0ad4d

Tom Herbert authored May 01, 2015

Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

63c0ad4d

sched: Call skb_get_hash_perturb in sch_hhf · f969777a

Tom Herbert authored May 01, 2015

Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f969777a

sched: Call skb_get_hash_perturb in sch_fq_codel · 342db221

Tom Herbert authored May 01, 2015

Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

342db221

net: Add skb_get_hash_perturb · 50fb7992

Tom Herbert authored May 01, 2015

This calls flow_disect and __skb_get_hash to procure a hash for a
packet. Input includes a key to initialize jhash. This function
does not set skb->hash.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

50fb7992

net: ipv4: route: Fix sending IGMP messages with link address · 6a211654

Andrew Lunn authored May 01, 2015

In setups with a global scope address on an interface, and a lesser
scope address on an interface sending IGMP reports, the reports can be
sent using the other interfaces global scope address rather than the
local interface address. RFC 2236 suggests:

     Ignore the Report if you cannot identify the source address of
     the packet as belonging to a subnet assigned to the interface on
     which the packet was received.

since such reports could be forged.

Look at the protocol when deciding if a RT_SCOPE_LINK address should
be used for the packet.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a211654

net: sched: run ingress qdisc without locks · 087c1a60

Alexei Starovoitov authored Apr 30, 2015

TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.

Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.

In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

087c1a60

Merge branch 'tcp_sack_rttm' · a89f96c9

David S. Miller authored May 03, 2015

Kenneth Klette Jonassen says:

====================
tcp: SACK RTTM changes for congestion control

This patch series improves SACK RTT measurements for congestion control:
  o Picks the latest sequence SACKed for RTT, i.e. most accurate delay
    signal.
  o Calls the congestion control's pkts_acked hook with SACK RTTMs
    even when not sequentially ACKing new data.

V2: amend misleading comment
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

a89f96c9

tcp: invoke pkts_acked hook on every ACK · 138998fd

Kenneth Klette Jonassen authored May 01, 2015

Invoking pkts_acked is currently conditioned on FLAG_ACKED:
receiving a cumulative ACK of new data, or ACK with SYN flag set.

Remove this condition so that CC may get RTT measurements from all SACKs.

Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

138998fd

tcp: improve RTT from SACK for CC · 31231a8a

Kenneth Klette Jonassen authored May 01, 2015

tcp_sacktag_one() always picks the earliest sequence SACKed for RTT.
This might not make sense for congestion control in cases where:

  1. ACKs are lost, i.e. a SACK following a lost SACK covers both
     new and old segments at the receiver.
  2. The receiver disregards the RFC 5681 recommendation to immediately
     ACK out-of-order segments.

Give congestion control a RTT for the latest segment SACKed, which is the
most accurate RTT estimate, but preserve the conservative RTT for RTO.

Removes the call to skb_mstamp_get() in tcp_sacktag_one().

Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

31231a8a

tcp: move struct tcp_sacktag_state to tcp_ack() · 196da974

Kenneth Klette Jonassen authored May 01, 2015

Later patch passes two values set in tcp_sacktag_one() to
tcp_clean_rtx_queue(). Prepare passing them via struct tcp_sacktag_state.
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

196da974

Merge branch 'rhashtable-test' · 10308220

David S. Miller authored May 03, 2015

Thomas Graf says:

====================
rhashtable self-test improvements

This series improves the rhashtable self-test to:
  * Avoid allocation of test objects
  * Measure the time of test runs
  * Use the iterator to walk the table for consistency
  * Account for failed insertions due to memory pressure or
    utilization pressure
  * Ignore failed insertions when checking for consistency
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

10308220

rhashtable-test: Detect insertion failures · 67b7cbf4

Thomas Graf authored Apr 30, 2015

Account for failed inserts due to memory pressure or EBUSY and
ignore failed entries during the consistency check.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

67b7cbf4

rhashtable-test: Use walker to test bucket statistics · 246b23a7

Thomas Graf authored Apr 30, 2015

As resizes may continue to run in the background, use walker to
ensure we see all entries. Also print the encountered number
of rehashes queued up while traversing.

This may lead to warnings due to entries being seen multiple
times. We consider them non-fatal.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

246b23a7

rhashtable-test: Do not allocate individual test objects · fcc57020

Thomas Graf authored Apr 30, 2015

By far the most expensive part of the selftest was the allocation
of entries. Using a static array allows to measure the rhashtable
operations.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

fcc57020

rhashtable-test: Get rid of ptr in test_obj structure · c2c8a901

Thomas Graf authored Apr 30, 2015

This only blows up the size of the test structure for no gain
in test coverage. Reduces size of test_obj from 24 to 16 bytes.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

c2c8a901

rhashtable-test: Measure time to insert, remove & traverse entries · 1aa661f5

Thomas Graf authored Apr 30, 2015

Make test configurable by allowing to specify all relevant knobs
through module parameters.

Do several test runs and measure the average time it takes to
insert & remove all entries. Note, a deferred resize might still
continue to run in the background.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

1aa661f5

rhashtable-test: Remove unused TEST_NEXPANDS · f54e84b6

Thomas Graf authored Apr 30, 2015

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

f54e84b6

Merge branch 'eth_type_trans' · 7a852021

David S. Miller authored May 03, 2015

Alexander Duyck says:

====================
A few minor clean-ups to eth_type_trans

This series addresses a few minor issues I found in eth_type_trans that
that allow us to gain back something like 3 or more cycles per packet.

The first change is to drop the byte swap since it isn't necessary.  On x86
we could just check the first byte and compare that against the upper 8
bits of the Ethertype to determine if we are dealing with a size value or
not.

The second makes it so that the value we read in to test for multicast can
be used for the address comparison.  This allows us to avoid a second read
of the destination address.

The final change is to avoid some unneeded instructions in computing the
Ethernet header pointer.  When we start the call the Ethernet header is at
skb->data, so we just use that rather than computing mac_header, and then
adding that back to skb->head.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7a852021

etherdev: Use skb->data to retrieve Ethernet header instead of eth_hdr · 610986e7

Alexander Duyck authored Apr 30, 2015

Avoid recomputing the Ethernet header location and instead just use the
pointer provided by skb->data.  The problem with using eth_hdr is that the
compiler wasn't smart enough to realize that skb->head + skb->mac_header
was the same thing as skb->data before it added ETH_HLEN.  By just caching
it off before calling skb_pull_inline we can avoid a few unnecessary
instructions.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

610986e7

etherdev: Process is_multicast_ether_addr at same size as other operations · d54385ce

Alexander Duyck authored Apr 30, 2015

This change makes it so that we process the address in
is_multicast_ether_addr at the same size as the other calls.  This allows
us to avoid duplicate reads when used with other calls such as
is_zero_ether_addr or eth_addr_copy.  In addition I have added a 64 bit
version of the function so in eth_type_trans we can process the destination
address as a 64 bit value throughout.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d54385ce

etherdev: Avoid unnecessary byte swap in check for Ethertype · 849b920e

Alexander Duyck authored Apr 30, 2015

This change takes advantage of the fact that ETH_P_802_3_MIN is aligned to
512 so as a result we can actually ignore the lower 8b when comparing the
Ethertype to ETH_P_802_3_MIN.  This allows us to avoid a byte swap by simply
masking the value and comparing it to the byte swapped value for
ETH_P_802_3_MIN.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

849b920e

ipv6: Flow label state ranges · 82a584b7

Tom Herbert authored Apr 29, 2015

This patch divides the IPv6 flow label space into two ranges:
0-7ffff is reserved for flow label manager, 80000-fffff will be
used for creating auto flow labels (per RFC6438). This only affects how
labels are set on transmit, it does not affect receive. This range split
can be disbaled by systcl.

Background:

IPv6 flow labels have been an unmitigated disappointment thus far
in the lifetime of IPv6. Support in HW devices to use them for ECMP
is lacking, and OSes don't turn them on by default. If we had these
we could get much better hashing in IPv6 networks without resorting
to DPI, possibly eliminating some of the motivations to to define new
encaps in UDP just for getting ECMP.

Unfortunately, the initial specfications of IPv6 did not clarify
how they are to be used. There has always been a vague concept that
these can be used for ECMP, flow hashing, etc. and we do now have a
good standard how to this in RFC6438. The problem is that flow labels
can be either stateful or stateless (as in RFC6438), and we are
presented with the possibility that a stateless label may collide
with a stateful one.  Attempts to split the flow label space were
rejected in IETF. When we added support in Linux for RFC6438, we
could not turn on flow labels by default due to this conflict.

This patch splits the flow label space and should give us
a path to enabling auto flow labels by default for all IPv6 packets.
This is an API change so we need to consider compatibility with
existing deployment. The stateful range is chosen to be the lower
values in hopes that most uses would have chosen small numbers.

Once we resolve the stateless/stateful issue, we can proceed to
look at enabling RFC6438 flow labels by default (starting with
scaled testing).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

82a584b7

ipv6: Check RTF_LOCAL on rt->rt6i_flags instead of rt->dst.flags · 7035870d

Martin KaFai Lau authored May 03, 2015

In my earlier commit:
653437d0 ("ipv6: Stop /128 route from disappearing after pmtu update"),
there was a horrible typo.  Instead of checking RTF_LOCAL on
rt->rt6i_flags, it was checked on rt->dst.flags.  This patch fixes
it.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hajime Tazaki <tazaki@sfc.wide.ad.jp>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

7035870d

03 May, 2015 3 commits

net: sched: remove TC_MUNGED bits · 4749c3ef

Florian Westphal authored Apr 30, 2015

Not used.

pedit sets TC_MUNGED when packet content was altered, but all the core
does is unset MUNGED again and then set OK2MUNGE.

And the latter isn't tested anywhere. So lets remove both
TC_MUNGED and TC_OK2MUNGE.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4749c3ef

ipv4: remove the unnecessary codes in fib_info_hash_move · 7eee8cd4

Li RongQing authored Apr 30, 2015

The whole hlist will be moved, so not need to call hlist_del before
add the hlist_node to other hlist_head.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7eee8cd4

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 37155447
David S. Miller authored May 02, 2015
```
Merge net into net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
```
37155447

02 May, 2015 12 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 6c3c1eb3

Linus Torvalds authored May 01, 2015

Pull networking fixes from David Miller:

 1) Receive packet length needs to be adjust by 2 on RX to accomodate
    the two padding bytes in altera_tse driver.  From Vlastimil Setka.

 2) If rx frame is dropped due to out of memory in macb driver, we leave
    the receive ring descriptors in an undefined state.  From Punnaiah
    Choudary Kalluri

 3) Some netlink subsystems erroneously signal NLM_F_MULTI.  That is
    only for dumps.  Fix from Nicolas Dichtel.

 4) Fix mis-use of raw rt->rt_pmtu value in ipv4, one must always go via
    the ipv4_mtu() helper.  From Herbert Xu.

 5) Fix null deref in bridge netfilter, and miscalculated lengths in
    jump/goto nf_tables verdicts.  From Florian Westphal.

 6) Unhash ping sockets properly.

 7) Software implementation of BPF divide did 64/32 rather than 64/64
    bit divide.  The JITs got it right.  Fix from Alexei Starovoitov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
  ipv4: Missing sk_nulls_node_init() in ping_unhash().
  net: fec: Fix RGMII-ID mode
  net/mlx4_en: Schedule napi when RX buffers allocation fails
  netxen_nic: use spin_[un]lock_bh around tx_clean_lock
  net/mlx4_core: Fix unaligned accesses
  mlx4_en: Use correct loop cursor in error path.
  cxgb4: Fix MC1 memory offset calculation
  bnx2x: Delay during kdump load
  net: Fix Kernel Panic in bonding driver debugfs file: rlb_hash_table
  net: dsa: Fix scope of eeprom-length property
  net: macb: Fix race condition in driver when Rx frame is dropped
  hv_netvsc: Fix a bug in netvsc_start_xmit()
  altera_tse: Correct rx packet length
  mlx4: Fix tx ring affinity_mask creation
  tipc: fix problem with parallel link synchronization mechanism
  tipc: remove wrong use of NLM_F_MULTI
  bridge/nl: remove wrong use of NLM_F_MULTI
  bridge/mdb: remove wrong use of NLM_F_MULTI
  net: sched: act_connmark: don't zap skb->nfct
  trivial: net: systemport: bcmsysport.h: fix 0x0x prefix
  ...

6c3c1eb3

virtio: fix typo in vring_need_event() doc comment · e412d3a3

Stefan Hajnoczi authored May 02, 2015

Here the "other side" refers to the guest or host.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e412d3a3

virtio: pass baton to Michael Tsirkin · feda5f93

Rusty Russell authored May 02, 2015

With my job change kernel work will be "own time"; I'm keeping lguest
and modules (and the virtio standards work), but virtio kernel has to
go.

This makes it clear that Michael is in charge.  He's good, but having
me watch over his shoulder won't help.

Good luck Michael!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

feda5f93

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · 6fa72720

Linus Torvalds authored May 01, 2015

Pull Ceph RBD fix from Sage Weil.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: end I/O the entire obj_request on error

6fa72720

ipv4: Missing sk_nulls_node_init() in ping_unhash(). · a134f083

David S. Miller authored May 01, 2015

If we don't do that, then the poison value is left in the ->pprev
backlink.

This can cause crashes if we do a disconnect, followed by a connect().
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Wen Xu <hotdog3645@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a134f083

net: rocker: Use ether_addr_equal · 629161f6

Simon Horman authored Apr 30, 2015

A small cleanup to make use of the ether_addr_equal helper.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

629161f6

Merge branch 'rt6_pmtu' · 36c82963

David S. Miller authored May 01, 2015

Martin KaFai Lau says:

====================
ipv6: Stop /128 route from disappearing after pmtu update

The series is separated from another patch series,
'ipv6: Only create RTF_CACHE route after encountering pmtu exception',
which can be found here:
http://thread.gmane.org/gmane.linux.network/359140

This series focus on fixing the /128 route issues.  It is currently targeted
for net-next due to the number of code churn but it is also applicable
to net (should be without conflict).  The original reported problem can be
found here:
http://thread.gmane.org/gmane.linux.network/348138

Patch 01 and 02 are to prepare the fib6 search to expect both the
RTF_CACHE clone and its original route exist at the same fib6_node.

Patch 03 fixes the /128 route disappearing bug.

Patch 04 and 05 stop rt6_info from using the inet_peer's metrics to
avoid the /128 routes (like the /128 clone and its original route)
from stepping on each others' metrics.

The second patch is by 'Steffen Klassert <steffen.klassert@secunet.com>'
which I pulled off from netdev.  The third patch is also mostly by
Steffen with one minor optimization.

Many thanks to Hannes Frederic Sowa <hannes@stressinduktion.org> on
reviewing the patches and giving advice.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

36c82963

ipv6: Remove DST_METRICS_FORCE_OVERWRITE and _rt6i_peer · afc4eef8

Martin KaFai Lau authored Apr 28, 2015

_rt6i_peer is no longer needed after the last patch,
'ipv6: Stop rt6_info from using inet_peer's metrics'.

DST_METRICS_FORCE_OVERWRITE is added by
commit e5fd387a ("ipv6: do not overwrite inetpeer metrics prematurely").
Since inetpeer is no longer used for metrics, this bit is also not needed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

afc4eef8

ipv6: Stop rt6_info from using inet_peer's metrics · 4b32b5ad

Martin KaFai Lau authored Apr 28, 2015

inet_peer is indexed by the dst address alone.  However, the fib6 tree
could have multiple routing entries (rt6_info) for the same dst. For
example,
1. A /128 dst via multiple gateways.
2. A RTF_CACHE route cloned from a /128 route.

In the above cases, all of them will share the same metrics and
step on each other.

This patch will steer away from inet_peer's metrics and use
dst_cow_metrics_generic() for everything.

Change Highlights:
1. Remove rt6_cow_metrics() which currently acquires metrics from
   inet_peer for DST_HOST route (i.e. /128 route).
2. Add rt6i_pmtu to take care of the pmtu update to avoid creating a
   full size metrics just to override the RTAX_MTU.
3. After (2), the RTF_CACHE route can also share the metrics with its
   dst.from route, by:
   dst_init_metrics(&cache_rt->dst, dst_metrics_ptr(cache_rt->dst.from), true);
4. Stop creating RTF_CACHE route by cloning another RTF_CACHE route.  Instead,
   directly clone from rt->dst.

   [ Currently, cloning from another RTF_CACHE is only possible during
     rt6_do_redirect().  Also, the old clone is removed from the tree
     immediately after the new clone is added. ]

   In case of cloning from an older redirect RTF_CACHE, it should work as
   before.

   In case of cloning from an older pmtu RTF_CACHE, this patch will forget
   the pmtu and re-learn it (if there is any) from the redirected route.

The _rt6i_peer and DST_METRICS_FORCE_OVERWRITE will be removed
in the next cleanup patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4b32b5ad

ipv6: Stop /128 route from disappearing after pmtu update · 653437d0

Martin KaFai Lau authored Apr 28, 2015

This patch is mostly from Steffen Klassert <steffen.klassert@secunet.com>.
I only removed the (rt6->rt6i_dst.plen == 128) check from
ip6_rt_update_pmtu() because the (rt6->rt6i_flags & RTF_CACHE) test
has already implied it.

This patch:
1. Create RTF_CACHE route for /128 non local route
2. After (1), all routes that allow pmtu update should have a RTF_CACHE
   clone.  Hence, stop updating MTU for any non RTF_CACHE route.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

653437d0

ipv6: Extend the route lookups to low priority metrics. · 9fbdcfaf

Steffen Klassert authored Apr 28, 2015

We search only for routes with highest priority metric in
find_rr_leaf(). However if one of these routes is marked
as invalid, we may fail to find a route even if there is
a appropriate route with lower priority. Then we loose
connectivity until the garbage collector deletes the
invalid route. This typically happens if a host route
expires afer a pmtu event. Fix this by searching also
for routes with a lower priority metric.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9fbdcfaf

ipv6: Consider RTF_CACHE when searching the fib6 tree · 1f56a01f

Martin KaFai Lau authored Apr 28, 2015

It is a prep work for the later bug-fix patch which will stop /128 route
from disappearing after pmtu update.

The later bug-fix patch will allow a /128 route and its RTF_CACHE clone
both exist at the same fib6_node.  To do this, we need to prepare the
existing fib6 tree search to expect RTF_CACHE for /128 route.

Note that the fn->leaf is sorted by rt6i_metric.  Hence,
RTF_CACHE (if there is any) is always at the front.  This property
leads to the following:

1. When doing ip6_route_del(), it should honor the RTF_CACHE flag which
   the caller is used to ask for deleting clone or non-clone.
   The rtm_to_fib6_config() should also check the RTM_F_CLONED and
   then set RTF_CACHE accordingly so that:
   - 'ip -6 r del...' will make ip6_route_del() to delete a route
     and all its clones. Note that its clones is flushed by fib6_del()
   - 'ip -6 r flush table cache' will make ip6_route_del() to
      only delete clone(s).

2. Exclude RTF_CACHE from addrconf_get_prefix_route() which
   should not configure on a cloned route.

3. No change is need for rt6_device_match() since it currently could
   return a RTF_CACHE clone route, so the later bug-fix patch will not
   affect it.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1f56a01f

01 May, 2015 2 commits

rbd: end I/O the entire obj_request on error · 082a75da

Ilya Dryomov authored Apr 25, 2015

When we end I/O struct request with error, we need to pass
obj_request->length as @nr_bytes so that the entire obj_request worth
of bytes is completed.  Otherwise block layer ends up confused and we
trip on

    rbd_assert(more ^ (which == img_request->obj_request_count));

in rbd_img_obj_callback() due to more being true no matter what.  We
already do it in most cases but we are missing some, in particular
those where we don't even get a chance to submit any obj_requests, due
to an early -ENOMEM for example.

A number of obj_request->xferred assignments seem to be redundant but
I haven't touched any of obj_request->xferred stuff to keep this small
and isolated.

Cc: Alex Elder <elder@linaro.org>
Cc: stable@vger.kernel.org # 3.10+
Reported-by: Shawn Edwards <lesser.evil@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

082a75da

ipv4: speedup ip_idents_reserve() · 355b590c

Eric Dumazet authored May 01, 2015

Under stress, ip_idents_reserve() is accessing a contended
cache line twice, with non optimal MESI transactions.

If we place timestamps in separate location, we reduce this
pressure by ~50% and allow atomic_add_return() to issue
a Request for Ownership.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

355b590c