Commits · 20138cf9efc1a761a17839581b0130d8e4209882 · Kirill Smelkov / linux

07 Dec, 2016 32 commits

net: ethernet: ti: cpts: fix overflow check period · 20138cf9

Grygorii Strashko authored Dec 06, 2016

The CPTS drivers uses 8sec period for overflow checking with
assumption that CPTS retclk will not exceed 500MHz. But that's not
true on some TI platforms (Kesytone 2). As result, it is possible that
CPTS counter will overflow more than once between two readings.

Hence, fix it by selecting overflow check period dynamically as
max_sec_before_overflow/2, where
 max_sec_before_overflow = max_counter_val / rftclk_freq.

Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

20138cf9

net: ethernet: ti: cpts: calc mult and shift from refclk freq · 88f0f0b0

Grygorii Strashko authored Dec 06, 2016

The cyclecounter mult and shift values can be calculated based on the
CPTS rfclk frequency and timekeepnig framework provides required algos
and API's.

Hence, calc mult and shift basing on CPTS rfclk frequency if both
cpts_clock_shift and cpts_clock_mult properties are not provided in DT (the
basis of calculation algorithm is borrowed from
__clocksource_update_freq_scale() commit 7d2f944a ("clocksource:
Provide a generic mult/shift factor calculation")). After this change
cpts_clock_shift and cpts_clock_mult DT properties will become optional.

Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

88f0f0b0

clocksource: export the clocks_calc_mult_shift to use by timestamp code · 5304121a

Murali Karicheri authored Dec 06, 2016

The CPSW CPTS driver is capable of doing timestamping on tx/rx packets and
requires to know mult and shift factors for timestamp conversion from raw
value to nanoseconds (ptp clock). Now these mult and shift factors are
calculated manually and provided through DT, which makes very hard to
support of a lot number of platforms, especially if CPTS refclk is not the
same for some kind of boards and depends on efuse settings (Keystone 2
platforms). Hence, export clocks_calc_mult_shift() to allow drivers like
CPSW CPTS (and other ptp drivesr) to benefit from automaitc calculation of
mult and shift factors.

Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

5304121a

net: ethernet: ti: cpts: move dt props parsing to cpts driver · 4a88fb95

Grygorii Strashko authored Dec 06, 2016

Move DT properties parsing into CPTS driver to simplify CPSW
code and CPTS driver porting on other SoC in the future
(like Keystone 2) - with this change it will not be required
to add the same DT parsing code in Keystone 2 NETCP driver.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a88fb95

net: ethernet: ti: cpts: rework initialization/deinitialization · 8a2c9a5a

Grygorii Strashko authored Dec 06, 2016

The current implementation CPTS initialization and deinitialization
(represented by cpts_register/unregister()) does too many static
initialization from .ndo_open(), which is reasonable to do once at probe
time instead, and also require caller to allocate memory for struct cpts,
which is internal for CPTS driver in general.

This patch splits CPTS initialization and deinitialization on two parts:

- static initializtion cpts_create()/cpts_release() which expected to be
executed when parent driver is probed/removed;

- dynamic part cpts_register/unregister() which expected to be executed
when network device is opened/closed.

As result, current code of CPTS parent driver - CPSW - will be simplified
(and it also will allow simplify adding support for Keystone 2 devices in
the future), plus more initialization errors will be catched earlier. In
addition, this change allows to clean up cpts.h for the case when CPTS is
disabled.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8a2c9a5a

net: ethernet: ti: cpts: drop excessive writes to CTRL and INT_EN regs · 2a79df3e

Grygorii Strashko authored Dec 06, 2016

CPTS module and IRQs are always enabled when CPTS is registered,
before starting overflow check work, and disabled during
deregistration, when overflow check work has been canceled already.
So, It doesn't require to (re)enable CPTS module and IRQs in
cpts_overflow_check().
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2a79df3e

net: ethernet: ti: cpts: clean up event list if event pool is empty · e4439fa8

WingMan Kwok authored Dec 06, 2016

When a CPTS user does not exit gracefully by disabling cpts
timestamping and leaving a joined multicast group, the system
continues to receive and timestamps the ptp packets which eventually
occupy all the event list entries.  When this happns, the added code
tries to remove some list entries which are expired.
Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e4439fa8

net: ethernet: ti: cpts: disable cpts when unregistered · 8fcd6891

Grygorii Strashko authored Dec 06, 2016

The cpts now is left enabled after unregistration.
Hence, disable it in cpts_unregister().
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8fcd6891

net: ethernet: ti: cpts: fix registration order · 6c691405

Grygorii Strashko authored Dec 06, 2016

The ptp clock registered before spinlock, which is protecting it, and
before timecounter and cyclecounter initialization in cpts_register().

So, ensure that ptp clock is registered the last, after everything
else is done.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c691405

net: ethernet: ti: cpts: fix unbalanced clk api usage in cpts_register/unregister · fd123a94

Grygorii Strashko authored Dec 06, 2016

There are two issues with TI CPTS code which are reproducible when TI
CPSW ethX device passes few up/down iterations:
- cpts refclk prepare counter continuously incremented after each
up/down iteration;
- devm_clk_get(dev, "cpts") is called many times.

Hence, fix these issues by using clk_disable_unprepare() in
cpts_clk_release() and skipping devm_clk_get() if cpts refclk has been
acquired already.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fd123a94

net: ethernet: ti: cpsw: minimize direct access to struct cpts · b63ba58e

Grygorii Strashko authored Dec 06, 2016

This will provide more flexibility in changing CPTS internals and also
required for further changes.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b63ba58e

net: ethernet: ti: allow cpts to be built separately · c8395d4e

Grygorii Strashko authored Dec 06, 2016

TI CPTS IP is used as part of TI OMAP CPSW driver, but it's also
present as part of NETCP on TI Keystone 2 SoCs. So, It's required
to enable build of CPTS for both this drivers and this can be
achieved by allowing CPTS to be built separately.

Hence, allow cpts to be built separately and convert it to be
a module as both CPSW and NETCP drives can be built as modules.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c8395d4e

net: ethernet: ti: cpts: switch to readl/writel_relaxed() · 391fd6ca

Grygorii Strashko authored Dec 06, 2016

Switch to readl/writel_relaxed() APIs, because this is recommended
API and the CPTS IP is reused on Keystone 2 SoCs
where LE/BE modes are supported.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

391fd6ca

Merge branch 'bnxt_en-RDMA' · d3243aef

David S. Miller authored Dec 07, 2016

Michael Chan says:

====================
bnxt_en: Add interface to support RDMA driver.

This series adds an interface to support a brand new RDMA driver bnxt_re.
The first step is to re-arrange some code so that pci_enable_msix() can
be called during pci probe.  The purpose is to allow the RDMA driver to
initialize and stay initialized whether the netdev is up or down.

Then we make some changes to VF resource allocation so that there is
enough resources to support RDMA.

Finally the last patch adds a simple interface to allow the RDMA driver to
probe and register itself with any bnxt_en devices that support RDMA.
Once registered, the RDMA driver can request MSIX, send fw messages, and
receive some notifications.

v2: Fixed kbuild test robot warnings.

David, please consider this series for net-next.  Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d3243aef

bnxt_en: Add interface to support RDMA driver. · a588e458

Michael Chan authored Dec 07, 2016

Since the network driver and RDMA driver operate on the same PCI function,
we need to create an interface to allow the RDMA driver to share resources
with the network driver.

1. Create a new bnxt_en_dev struct which will be returned by
bnxt_ulp_probe() upon success.  After that, all calls from the RDMA driver
to bnxt_en will pass a pointer to this struct.

2. This struct contains additional function pointers to register, request
msix, send fw messages, register for async events.

3. If the RDMA driver wants to enable RDMA on the function, it needs to
call the function pointer bnxt_register_device().  A ulp_ops structure
is passed for RCU protected upcalls from bnxt_en to the RDMA driver.

4. The RDMA driver can call firmware APIs using the bnxt_send_fw_msg()
function pointer.

5. 1 stats context is reserved when the RDMA driver registers.  MSIX
and completion rings are reserved when the RDMA driver calls
bnxt_request_msix() function pointer.

6. When the RDMA driver calls bnxt_unregister_device(), all RDMA resources
will be cleaned up.

v2: Fixed 2 uninitialized variable warnings.
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a588e458

bnxt_en: Refactor the driver registration function with firmware. · a1653b13

Michael Chan authored Dec 07, 2016

The driver register function with firmware consists of passing version
information and registering for async events.  To support the RDMA driver,
the async events that we need to register may change.  Separate the
driver register function into 2 parts so that we can just update the
async events for the RDMA driver.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a1653b13

bnxt_en: Reserve RDMA resources by default. · e4060d30

Michael Chan authored Dec 07, 2016

If the device supports RDMA, we'll setup network default rings so that
there are enough minimum resources for RDMA, if possible. However, the
user can still increase network rings to the max if he wants. The actual
RDMA resources won't be reserved until the RDMA driver registers.

v2: Fix compile warning when BNXT_CONFIG_SRIOV is not set.
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e4060d30

bnxt_en: Improve completion ring allocation for VFs. · 7b08f661

Michael Chan authored Dec 07, 2016

All available remaining completion rings not used by the PF should be
made available for the VFs so that there are enough rings in the VF to
support RDMA.  The earlier workaround code of capping the rings by the
statistics context is removed.

When SRIOV is disabled, call a new function bnxt_restore_pf_fw_resources()
to restore FW resources.  Later on we need to add some logic to account
for RDMA resources.
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7b08f661

bnxt_en: Move function reset to bnxt_init_one(). · aa8ed021

Michael Chan authored Dec 07, 2016

Now that MSIX is enabled in bnxt_init_one(), resources may be allocated by
the RDMA driver before the network device is opened.  So we cannot do
function reset in bnxt_open() which will clear all the resources.

The proper place to do function reset now is in bnxt_init_one().
If we get AER, we'll do function reset as well.
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

aa8ed021

bnxt_en: Enable MSIX early in bnxt_init_one(). · 7809592d

Michael Chan authored Dec 07, 2016

To better support the new RDMA driver, we need to move pci_enable_msix()
from bnxt_open() to bnxt_init_one().  This way, MSIX vectors are available
to the RDMA driver whether the network device is up or down.

Part of the existing bnxt_setup_int_mode() function is now refactored into
a new bnxt_init_int_mode().  bnxt_init_int_mode() is called during
bnxt_init_one() to enable MSIX.  The remaining logic in
bnxt_setup_int_mode() to map the IRQs to the completion rings is called
during bnxt_open().

v2: Fixed compile warning when CONFIG_BNXT_SRIOV is not set.
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7809592d

bnxt_en: Add bnxt_set_max_func_irqs(). · 33c2657e

Michael Chan authored Dec 07, 2016

By refactoring existing code into this new function.  The new function
will be used in subsequent patches.

v2: Fixed compile warning when CONFIG_BNXT_SRIOV is not set.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

33c2657e

net: sock_rps_record_flow() is for connected sockets · 5b8e2f61

Eric Dumazet authored Dec 06, 2016

Paolo noticed a cache line miss in UDP recvmsg() to access
sk_rxhash, sharing a cache line with sk_drops.

sk_drops might be heavily incremented by cpus handling a flood targeting
this socket.

We might place sk_drops on a separate cache line, but lets try
to avoid wasting 64 bytes per socket just for this, since we have
other bottlenecks to take care of.

sock_rps_record_flow() should only access sk_rxhash for connected
flows.

Testing sk_state for TCP_ESTABLISHED covers most of the cases for
connected sockets, for a zero cost, since system calls using
sock_rps_record_flow() also access sk->sk_prot which is on the
same cache line.

A follow up patch will provide a static_key (Jump Label) since most
hosts do not even use RFS.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5b8e2f61

tun: Use netif_receive_skb instead of netif_rx · d4aea20d

Andrey Konovalov authored Dec 01, 2016

This patch changes tun.c to call netif_receive_skb instead of netif_rx
when a packet is received (if CONFIG_4KSTACKS is not enabled to avoid
stack exhaustion). The difference between the two is that netif_rx queues
the packet into the backlog, and netif_receive_skb proccesses the packet
in the current context.

This patch is required for syzkaller [1] to collect coverage from packet
receive paths, when a packet being received through tun (syzkaller collects
coverage per process in the process context).

As mentioned by Eric this change also speeds up tun/tap. As measured by
Peter it speeds up his closed-loop single-stream tap/OVS benchmark by
about 23%, from 700k packets/second to 867k packets/second.

A similar patch was introduced back in 2010 [2, 3], but the author found
out that the patch doesn't help with the task he had in mind (for cgroups
to shape network traffic based on the original process) and decided not to
go further with it. The main concern back then was about possible stack
exhaustion with 4K stacks.

[1] https://github.com/google/syzkaller

[2] https://www.spinics.net/lists/netdev/thrd440.html#130570

[3] https://www.spinics.net/lists/netdev/msg130570.htmlSigned-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4aea20d

Merge branch 'w83977af_ir-neatening' · d10d0198

David S. Miller authored Dec 06, 2016

Joe Perches says:

====================
irda: w83977af_ir: Neatening

Originally on top of Arnd's overly long udelay patches because I
noticed a misindented block.  That's now already fixed along with some
other whitespace problems.  These patches are the remainder style
issues from my original series.

Even though I haven't turned on the netwinder in a box in the
garage in who knows how long, if this device is still used somewhere,
might as well neaten the code too.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d10d0198

irda: w83977af_ir: Neaten logging · 99d8d215

Joe Perches authored Dec 06, 2016

Use more common logging style, standardize function output logging use.

Miscellanea:

o Add and use pr_fmt
o Convert printks to pr_<level>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

99d8d215

irda: w83977af_ir: Parenthesis alignment · 646bf092

Joe Perches authored Dec 06, 2016

Neaten function declaration and definition arguments.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

646bf092

irda: w83977af_ir: Use the common brace style · ae9e736b

Joe Perches authored Dec 06, 2016

Add braces where appropriate and remove an unnecessary else.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ae9e736b

irda: w83977af_ir: Neaten pointer comparisons · c9471646

Joe Perches authored Dec 06, 2016

Convert pointer comparisons to NULL.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c9471646

irda: w83977af_ir: Remove and add blank lines · 47fda1a7

Joe Perches authored Dec 06, 2016

Use a more typical vertical spacing style.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

47fda1a7

irda: w83977af_ir: More whitespace neatening · 8a3fb40d

Joe Perches authored Dec 06, 2016

Add spaces around operators.
git diff -w shows no differences.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8a3fb40d

irda: w83977af_ir: Whitespace neatening · 352019c8

Joe Perches authored Dec 06, 2016

Remove leading and trailing whitespace.
git diff -w shows no differences.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

352019c8

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · c63d352f
David S. Miller authored Dec 06, 2016

c63d352f

06 Dec, 2016 8 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · bc3913a5

Linus Torvalds authored Dec 06, 2016

Pull sparc fix from David Miller:
 "A use-before-NULL-check from Dan Carpenter"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  dbri: move dereference after check for NULL

bc3913a5

dbri: move dereference after check for NULL · 163117e8

Dan Carpenter authored Dec 01, 2016

We accidentally introduced a dereference before the NULL check in
xmit_descs() as part of silencing a GCC warning.

Fixes: 16f46050 ("dbri: Fix compiler warning")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

163117e8

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · da1b466f

Linus Torvalds authored Dec 06, 2016

Pull networking fixes from David Miller:

 1) When dcbnl_cee_fill() fails to be able to push a new netlink
    attribute, it return 0 instead of an error code. From Pan Bian.

 2) Two suffix handling fixes to FIB trie code, from Alexander Duyck.

 3) bnxt_hwrm_stat_ctx_alloc() goes through all the trouble of setting
    and maintaining a return code 'rc' but fails to actually return it.
    Also from Pan Bian.

 4) ping socket ICMP handler needs to validate ICMP header length, from
    Kees Cook.

 5) caif_sktinit_module() has this interesting logic:

        int err = sock_register(...);
        if (!err)
                return err;
        return 0;

    Just return sock_register()'s return value directly which is the
    only possible correct thing to do.

 6) Two bnx2x driver fixes from Yuval Mintz, return a reasonable
    estimate from get_ringparam() ethtool op when interface is down and
    avoid trying to use UDP port based tunneling on 577xx chips.

 7) Fix ep93xx_eth crash on module unload from Florian Fainelli.

 8) Missing uapi exports, from Stephen Hemminger.

 9) Don't schedule work from sk_destruct(), because the socket will be
    freed upon return from that function. From Herbert Xu.

10) Buggy drivers, of which we know there is at least one, can send a
    huge packet into the TCP stack but forget to set the gso_size in the
    SKB, which causes all kinds of problems.

    Correct this when it happens, and emit a one-time warning with the
    device name included so that it can be diagnosed more easily.

    From Marcelo Ricardo Leitner.

11) virtio-net does DMA off the stack causes hiccups with VMAP_STACK,
    fix from Andy Lutomirski.

12) Fix fec driver compilation with CONFIG_M5272, from Nikita
    Yushchenko.

13) mlx5 fixes from Kamal Heib, Saeed Mahameed, and Mohamad Haj Yahia.
    (erroneously flushing queues on error, module parameter validation,
    etc)

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (34 commits)
  net/mlx5e: Change the SQ/RQ operational state to positive logic
  net/mlx5e: Don't flush SQ on error
  net/mlx5e: Don't notify HW when filling the edge of ICO SQ
  net/mlx5: Fix query ISSI flow
  net/mlx5: Remove duplicate pci dev name print
  net/mlx5: Verify module parameters
  net: fec: fix compile with CONFIG_M5272
  be2net: Add DEVSEC privilege to SET_HSW_CONFIG command.
  virtio-net: Fix DMA-from-the-stack in virtnet_set_mac_address()
  tcp: warn on bogus MSS and try to amend it
  uapi glibc compat: fix outer guard of net device flags enum
  net: stmmac: clear reset value of snps, wr_osr_lmt/snps, rd_osr_lmt before writing
  netlink: Do not schedule work from sk_destruct
  uapi: export nf_log.h
  uapi: export tc_skbmod.h
  net: ep93xx_eth: Do not crash unloading module
  bnx2x: Prevent tunnel config for 577xx
  bnx2x: Correct ringparam estimate when DOWN
  isdn: hisax: set error code on failure
  net: bnx2x: fix improper return value
  ...

da1b466f

shmem: fix shm fallocate() list corruption · 10d20bd2

Linus Torvalds authored Dec 05, 2016

The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
want to race with generating new pages by faulting them in.

However, the wait-queue used to delay the page faulting has a serious
problem: the wait queue head (in shmem_fallocate()) is allocated on the
stack, and the code expects that "wake_up_all()" will make sure that all
the queue entries are gone before the stack frame is de-allocated.

And that is not at all necessarily the case.

Yes, a normal wake-up sequence will remove the wait-queue entry that
caused the wakeup (see "autoremove_wake_function()"), but the key
wording there is "that caused the wakeup".  When there are multiple
possible wakeup sources, the wait queue entry may well stay around.

And _particularly_ in a page fault path, we may be faulting in new pages
from user space while we also have other things going on, and there may
well be other pending wakeups.

So despite the "wake_up_all()", it's not at all guaranteed that all list
entries are removed from the wait queue head on the stack.

Fix this by introducing a new wakeup function that removes the list
entry unconditionally, even if the target process had already woken up
for other reasons.  Use that "synchronous" function to set up the
waiters in shmem_fault().

This problem has never been seen in the wild afaik, but Dave Jones has
reported it on and off while running trinity.  We thought we fixed the
stack corruption with the blk-mq rq_list locking fix (commit
7fe31130: "blk-mq: update hardware and software queues for sleeping
alloc"), but it turns out there was _another_ stack corruptor hiding
in the trinity runs.

Vegard Nossum (also running trinity) was able to trigger this one fairly
consistently, and made us look once again at the shmem code due to the
faults often being in that area.

Reported-and-tested-by: Vegard Nossum <vegard.nossum@oracle.com>.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

10d20bd2

Merge branch 'mlx5-fixes' · 32f16e14

David S. Miller authored Dec 06, 2016

Saeed Mahameed says:

====================
Mellanox 100G mlx5 fixes 2016-12-04

Some bug fixes for mlx5 core and mlx5e driver.

v1->v2:
 - replace "uint" with "unsigned int"
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

32f16e14

net/mlx5e: Change the SQ/RQ operational state to positive logic · c0f1147d

Mohamad Haj Yahia authored Dec 06, 2016

When using the negative logic (i.e. FLUSH state), after the RQ/SQ reopen
we will have a time interval that the RQ/SQ is not really ready and the
state indicates that its not in FLUSH state because the initial SQ/RQ struct
memory starts as zeros.
Now we changed the state to indicate if the SQ/RQ is opened and we will
set the READY state after finishing preparing all the SQ/RQ resources.

Fixes: 6e8dd6d6 ("net/mlx5e: Don't wait for SQ completions on close")
Fixes: f2fde18c ("net/mlx5e: Don't wait for RQ completions on close")
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c0f1147d

net/mlx5e: Don't flush SQ on error · 3c8591d5

Saeed Mahameed authored Dec 06, 2016

We are doing SQ descriptors cleanup in driver.

Fixes: 6e8dd6d6 ("net/mlx5e: Don't wait for SQ completions on close")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3c8591d5

net/mlx5e: Don't notify HW when filling the edge of ICO SQ · b8335d91

Saeed Mahameed authored Dec 06, 2016

We are going to do this a couple of steps ahead anyway.

Fixes: d3c9bc27 ("net/mlx5e: Added ICO SQs")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b8335d91