Commits · 1f9248e5606afc6485255e38ad57bdac08fa7711 · Kirill Smelkov / linux

10 Dec, 2013 22 commits

neigh: convert parms to an array · 1f9248e5

Jiri Pirko authored Dec 07, 2013

This patch converts the neigh param members to an array. This allows easier
manipulation which will be needed later on to provide better management of
default values.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

1f9248e5

Merge branch 'phy_reset' · 65be6291

David S. Miller authored Dec 09, 2013

Florian Fainelli says:

====================
net: phy: consolidate PHY reset

This patchset consolidates the PHY reset through the MII BMCR
register by using a central place were this is done.

This patchset resumes the work Kyle Moffett started here:
https://lkml.org/lkml/2011/10/20/301

Note that at this point, drivers doing funky things after issuing
a PHY reset using phy_init_hw() will still suffer from PHY state
machine problems, this will be taken care of later on.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

65be6291

net: sh_eth: do not issue a wild PHY reset through BMCR · 0c9eb5b9

Florian Fainelli authored Dec 06, 2013

The sh_eth driver issues an uncontrolled PHY reset through the MII
register BMCR but fails to wait for the reset to complete, and will also
implicitely wipe out all possible PHY fixups applied. Use phy_init_hw()
which remedies both problems.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0c9eb5b9

net: tc35815: use phy_init_hw for PHY reset · 01b0114e

Florian Fainelli authored Dec 06, 2013

Instead of open-coding the PHY reset through MII BMCR, use phy_init_hw()
which does that for us and also makes sure that any PHY specific fixups
are applied.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

01b0114e

net: pxa168_eth: use phy_init_hw for PHY reset · 78de53f0

Florian Fainelli authored Dec 06, 2013

Instead of open-coding a PHY reset through the MII BMCR register, use
phy_init_hw() which does this for us and ensures that PHY device fixups
are also applied. We also remove a call to ethernet_phy_reset() which is
now unncessary since phy_attach() calls phy_attach_direct() which in
turns calls phy_init_hw().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

78de53f0

net: mv643xx_eth: use phy_init_hw to reset PHY · 7cd14636

Florian Fainelli authored Dec 06, 2013

Instead of open-coding a PHY reset through the MII BMCR register, use
phy_init_hw() which does that for us and will also make sure that PHY
fixups are applied if required. We also remove a call to phy_reset()
due to the following sequence of calls in the driver:

phy_scan()
	-> phy_connect()
		-> phy_connect_direct()
			-> phy_attach_direct()
				-> phy_init_hw()

and we only have a call to phy_init() after phy_scan().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7cd14636

net: phy: consolidate PHY reset in phy_init_hw() · 87aa9f9c

Florian Fainelli authored Dec 06, 2013

There are quite a lot of drivers touching a PHY device MII_BMCR
register to reset the PHY without taking care of:

1) ensuring that BMCR_RESET is cleared after a given timeout
2) the PHY state machine resuming to the proper state and re-applying
potentially changed settings such as auto-negotiation

Introduce phy_poll_reset() which will take care of polling the MII_BMCR
for the BMCR_RESET bit to be cleared after a given timeout or return a
timeout error code.

In order to make sure the PHY is in a correct state, phy_init_hw() first
issues a software reset through MII_BMCR and then applies any fixups.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

87aa9f9c

net: bfin_mac: do not reset PHY after phy_start() · 06d87cec

Florian Fainelli authored Dec 06, 2013

The PHY is already reset during driver probing, and this manual reset
after calling phy_start() will wipe out board-specific PHY fixups and
driver specific configuration initialization. Remove that explicit PHY
reset.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

06d87cec

net: greth: use phy_read_status() · f0528ce7

Florian Fainelli authored Dec 06, 2013

In case the greth driver is bound to anything but the Generic PHY
driver or the PHY has a special read_status callback implemented,
unexpected things will happen. Make sure we that we use
phy_read_status() which does the proper abstraction of calling the
driver specific read_status() callback for a given PHY.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f0528ce7

net: phy: use phy_init_hw instead of open-coding it · 2613f95f

Florian Fainelli authored Dec 06, 2013

Use phy_init_hw() instead of open-coding it in phy_mii_ioctl(), this
improves consistenty and makes sure that we will not duplicate the same
routine somewhere else.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2613f95f

net: phy: report link partner features through ethtool · 114002bc

Florian Fainelli authored Dec 06, 2013

The PHY library already reads the MII_STAT1000 and MII_LPA registers in
genphy_read_status(), so extend it to also populate the PHY device link
partner advertised features such that we can feed this back into ethtool
when asked for it in phy_ethtool_gset().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

114002bc

tun: remove useless codes in tun_chr_aio_read() and tun_recvmsg() · 73713357

Zhi Yong Wu authored Dec 07, 2013

By checking related codes, it is impossible that ret > len or total_len,
so we should remove some useless codes in both above functions.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

73713357

macvtap: remove useless codes in macvtap_aio_read() and macvtap_recvmsg() · 41e4af69

Zhi Yong Wu authored Dec 07, 2013

By checking related codes, it is impossible that ret > len or total_len,
so we should remove some useless coeds in both above functions.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

41e4af69

xen-netback: improve guest-receive-side flow control · ca2f09f2

Paul Durrant authored Dec 06, 2013

The way that flow control works without this patch is that, in start_xmit()
the code uses xenvif_count_skb_slots() to predict how many slots
xenvif_gop_skb() will consume and then adds this to a 'req_cons_peek'
counter which it then uses to determine if the shared ring has that amount
of space available by checking whether 'req_prod' has passed that value.
If the ring doesn't have space the tx queue is stopped.
xenvif_gop_skb() will then consume slots and update 'req_cons' and issue
responses, updating 'rsp_prod' as it goes. The frontend will consume those
responses and post new requests, by updating req_prod. So, req_prod chases
req_cons which chases rsp_prod, and can never exceed that value. Thus if
xenvif_count_skb_slots() ever returns a number of slots greater than
xenvif_gop_skb() uses, req_cons_peek will get to a value that req_prod cannot
possibly achieve (since it's limited by the 'real' req_cons) and, if this
happens enough times, req_cons_peek gets more than a ring size ahead of
req_cons and the tx queue then remains stopped forever waiting for an
unachievable amount of space to become available in the ring.

Having two routines trying to calculate the same value is always going to be
fragile, so this patch does away with that. All we essentially need to do is
make sure that we have 'enough stuff' on our internal queue without letting
it build up uncontrollably. So start_xmit() makes a cheap optimistic check
of how much space is needed for an skb and only turns the queue off if that
is unachievable. net_rx_action() is the place where we could do with an
accurate predicition but, since that has proven tricky to calculate, a cheap
worse-case (but not too bad) estimate is all we really need since the only
thing we *must* prevent is xenvif_gop_skb() consuming more slots than are
available.

Without this patch I can trivially stall netback permanently by just doing
a large guest to guest file copy between two Windows Server 2008R2 VMs on a
single host.

Patch tested with frontends in:
- Windows Server 2008R2
- CentOS 6.0
- Debian Squeeze
- Debian Wheezy
- SLES11
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Annie Li <annie.li@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ca2f09f2

tipc: remove interface state mirroring in bearer · 512137ee

Erik Hugne authored Dec 06, 2013

struct 'tipc_bearer' is a generic representation of the underlying
media type, and exists in a one-to-one relationship to each interface
TIPC is using. The struct contains a 'blocked' flag that mirrors the
operational and execution state of the represented interface, and is
updated through notification calls from the latter. The users of
tipc_bearer are checking this flag before each attempt to send a
packet via the interface.

This state mirroring serves no purpose in the current code base. TIPC
links will not discover a media failure any faster through this
mechanism, and in reality the flag only adds overhead at packet
sending and reception.

Furthermore, the fact that the flag needs to be protected by a spinlock
aggregated into tipc_bearer has turned out to cause a serious and
completely unnecessary deadlock problem.

CPU0                                    CPU1
----                                    ----
Time 0: bearer_disable()                link_timeout()
Time 1:   spin_lock_bh(&b_ptr->lock)      tipc_link_push_queue()
Time 2:   tipc_link_delete()                tipc_bearer_blocked(b_ptr)
Time 3:     k_cancel_timer(&req->timer)       spin_lock_bh(&b_ptr->lock)
Time 4:       del_timer_sync(&req->timer)

I.e., del_timer_sync() on CPU0 never returns, because the timer handler
on CPU1 is waiting for the bearer lock.

We eliminate the 'blocked' flag from struct tipc_bearer, along with all
tests on this flag. This not only resolves the deadlock, but also
simplifies and speeds up the data path execution of TIPC. It also fits
well into our ongoing effort to make the locking policy simpler and
more manageable.

An effect of this change is that we can get rid of functions such as
tipc_bearer_blocked(), tipc_continue() and tipc_block_bearer().
We replace the latter with a new function, tipc_reset_bearer(), which
resets all links associated to the bearer immediately after an
interface goes down.

A user might notice one slight change in link behaviour after this
change. When an interface goes down, (e.g. through a NETDEV_DOWN
event) all attached links will be reset immediately, instead of
leaving it to each link to detect the failure through a timer-driven
mechanism. We consider this an improvement, and see no obvious risks
with the new behavior.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <Paul.Gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

512137ee

x25: convert printks to pr_<level> · b73e9e3c

wangweidong authored Dec 06, 2013

use pr_<level> instead of printk(LEVEL)
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b73e9e3c

packet: introduce PACKET_QDISC_BYPASS socket option · d346a3fa

Daniel Borkmann authored Dec 06, 2013

This patch introduces a PACKET_QDISC_BYPASS socket option, that
allows for using a similar xmit() function as in pktgen instead
of taking the dev_queue_xmit() path. This can be very useful when
PF_PACKET applications are required to be used in a similar
scenario as pktgen, but with full, flexible packet payload that
needs to be provided, for example.

On default, nothing changes in behaviour for normal PF_PACKET
TX users, so everything stays as is for applications. New users,
however, can now set PACKET_QDISC_BYPASS if needed to prevent
own packets from i) reentering packet_rcv() and ii) to directly
push the frame to the driver.

In doing so we can increase pps (here 64 byte packets) for
PF_PACKET a bit:

  # CPUs -- QDISC_BYPASS   -- qdisc path -- qdisc path[**]
  1 CPU  ==  1,509,628 pps --  1,208,708 --  1,247,436
  2 CPUs ==  3,198,659 pps --  2,536,012 --  1,605,779
  3 CPUs ==  4,787,992 pps --  3,788,740 --  1,735,610
  4 CPUs ==  6,173,956 pps --  4,907,799 --  1,909,114
  5 CPUs ==  7,495,676 pps --  5,956,499 --  2,014,422
  6 CPUs ==  9,001,496 pps --  7,145,064 --  2,155,261
  7 CPUs == 10,229,776 pps --  8,190,596 --  2,220,619
  8 CPUs == 11,040,732 pps --  9,188,544 --  2,241,879
  9 CPUs == 12,009,076 pps -- 10,275,936 --  2,068,447
 10 CPUs == 11,380,052 pps -- 11,265,337 --  1,578,689
 11 CPUs == 11,672,676 pps -- 11,845,344 --  1,297,412
 [...]
 20 CPUs == 11,363,192 pps -- 11,014,933 --  1,245,081

 [**]: qdisc path with packet_rcv(), how probably most people
       seem to use it (hopefully not anymore if not needed)

The test was done using a modified trafgen, sending a simple
static 64 bytes packet, on all CPUs.  The trick in the fast
"qdisc path" case, is to avoid reentering packet_rcv() by
setting the RAW socket protocol to zero, like:
socket(PF_PACKET, SOCK_RAW, 0);

Tradeoffs are documented as well in this patch, clearly, if
queues are busy, we will drop more packets, tc disciplines are
ignored, and these packets are not visible to taps anymore. For
a pktgen like scenario, we argue that this is acceptable.

The pointer to the xmit function has been placed in packet
socket structure hole between cached_dev and prot_hook that
is hot anyway as we're working on cached_dev in each send path.

Done in joint work together with Jesper Dangaard Brouer.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d346a3fa

net: dev: move inline skb_needs_linearize helper to header · 4262e5cc

Daniel Borkmann authored Dec 06, 2013

As we need it elsewhere, move the inline helper function of
skb_needs_linearize() over to skbuff.h include file. While
at it, also convert the return to 'bool' instead of 'int'
and add a proper kernel doc.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4262e5cc

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 34f9f437

David S. Miller authored Dec 09, 2013

Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.
Signed-off-by: David S. Miller <davem@davemloft.net>

34f9f437

packet: fix send path when running with proto == 0 · 66e56cd4

Daniel Borkmann authored Dec 06, 2013

Commit e40526cb introduced a cached dev pointer, that gets
hooked into register_prot_hook(), __unregister_prot_hook() to
update the device used for the send path.

We need to fix this up, as otherwise this will not work with
sockets created with protocol = 0, plus with sll_protocol = 0
passed via sockaddr_ll when doing the bind.

So instead, assign the pointer directly. The compiler can inline
these helper functions automagically.

While at it, also assume the cached dev fast-path as likely(),
and document this variant of socket creation as it seems it is
not widely used (seems not even the author of TX_RING was aware
of that in his reference example [1]). Tested with reproducer
from e40526cb.

 [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example

Fixes: e40526cb ("packet: fix use after free race in send path when dev is released")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Salam Noureddine <noureddine@aristanetworks.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

66e56cd4

pkt_sched: give visibility to mq slave qdiscs · 95dc1929

Eric Dumazet authored Dec 05, 2013

Commit 6da7c8fc ("qdisc: allow setting default queuing discipline")
added the ability to change default qdisc from pfifo_fast to say fq

But as most modern ethernet devices are multiqueue, we cant really
see all the statistics from "tc -s qdisc show", as the default root
qdisc is mq.

This patch adds the calls to qdisc_list_add() to mq and mqprio
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

95dc1929

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · fbec3706

David S. Miller authored Dec 09, 2013

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates

This series contains updates to i40e only.

Jacob provides a i40e patch to get 1588 work correctly by separating
TSYNVALID and TSYNINDX fields in the receive descriptor.

Jesse provides several i40e patches, first to correct the checking
of the multi-bit state.  The hash is reported correctly in the RSS
field if and only if the filter status is 3.  Other values of the
filter status mean different things and we should not depend on a
bitwise result.  Then provides a patch to enable a couple of
workarounds based on revision ID that allow the driver to work
more fully on early hardware.

Shannon provides several i40e patches as well.  First sets the media
type in the hardware structure based on the external connection type.
Then provides a patch to only setup the rings that will be used.  Lastly
provides a fix where the TESTING state was still set when exiting the
ethtool diagnostics.

Kevin Scott provides one i40e patch to add a new flag to the i40e_add_veb()
which allows the driver to request the hardware to filter on layer 2
parameters.

Anjali provides four i40e patches, first refactors the reset code in
order to re-size queues and vectors while the interface is still up.
Then provides a patch to enable all PCTYPEs expect FCoE for RSS.  Adds
a message to notify the user of how many VFs are initialized on each
port.  Lastly adds a new variable to track the number of PF instances,
this is a global counter on purpose so that each PF loaded has a
unique ID.

Catherine bumps the driver version.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

fbec3706

09 Dec, 2013 12 commits