Commits · c3eab85ff11a8cd4def8cf2b4cc0610f6b47a8cd · Kirill Smelkov / linux

05 Jul, 2017 40 commits

Btrfs: fix truncate down when no_holes feature is enabled · c3eab85f

Liu Bo authored Dec 01, 2016


[ Upstream commit 91298eec ]

For such a file mapping,

[0-4k][hole][8k-12k]

In NO_HOLES mode, we don't have the [hole] extent any more.
Commit c1aa4575 ("Btrfs: fix shrinking truncate when the no_holes feature is enabled")
 fixed disk isize not being updated in NO_HOLES mode when data is not flushed.

However, even if data has been flushed, we can still have trouble
in updating disk isize since we updated disk isize to 'start' of
the last evicted extent.
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

c3eab85f

Btrfs: Fix deadlock between direct IO and fast fsync · e8b5068b

Chandan Rajendra authored Dec 23, 2016


[ Upstream commit 97dcdea0 ]

The following deadlock is seen when executing generic/113 test,

 ---------------------------------------------------------+----------------------------------------------------
  Direct I/O task                                           Fast fsync task
 ---------------------------------------------------------+----------------------------------------------------
  btrfs_direct_IO
    __blockdev_direct_IO
     do_blockdev_direct_IO
      do_direct_IO
       btrfs_get_blocks_direct
        while (blocks needs to written)
         get_more_blocks (first iteration)
          btrfs_get_blocks_direct
           btrfs_create_dio_extent
             down_read(&BTRFS_I(inode) >dio_sem)
             Create and add extent map and ordered extent
             up_read(&BTRFS_I(inode) >dio_sem)
                                                            btrfs_sync_file
                                                              btrfs_log_dentry_safe
                                                               btrfs_log_inode_parent
                                                                btrfs_log_inode
                                                                 btrfs_log_changed_extents
                                                                  down_write(&BTRFS_I(inode) >dio_sem)
                                                                   Collect new extent maps and ordered extents
                                                                    wait for ordered extent completion
         get_more_blocks (second iteration)
          btrfs_get_blocks_direct
           btrfs_create_dio_extent
             down_read(&BTRFS_I(inode) >dio_sem)
 --------------------------------------------------------------------------------------------------------------

In the above description, Btrfs direct I/O code path has not yet started
submitting bios for file range covered by the initial ordered
extent. Meanwhile, The fast fsync task obtains the write semaphore and
waits for I/O on the ordered extent to get completed. However, the
Direct I/O task is now blocked on obtaining the read semaphore.

To resolve the deadlock, this commit modifies the Direct I/O code path
to obtain the read semaphore before invoking
__blockdev_direct_IO(). The semaphore is then given up after
__blockdev_direct_IO() returns. This allows the Direct I/O code to
complete I/O on all the ordered extents it creates.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

e8b5068b

gianfar: Do not reuse pages from emergency reserve · 83571e9e

Eric Dumazet authored Jan 18, 2017


[ Upstream commit 69fed99b ]

A driver using dev_alloc_page() must not reuse a page that had to
use emergency memory reserve.

Otherwise all packets using this page will be immediately dropped,
unless for very specific sockets having SOCK_MEMALLOC bit set.

This issue might be hard to debug, because only a fraction of the RX
ring buffer would suffer from drops.

Fixes: 75354148 ("gianfar: Add paged allocation and Rx S/G")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Claudiu Manoil <claudiu.manoil@freescale.com>
Acked-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

83571e9e

objtool: Fix IRET's opcode · c48a862c

Jiri Slaby authored Jan 18, 2017


[ Upstream commit b5b46c47 ]

The IRET opcode is 0xcf according to the Intel manual and also to objdump of my
vmlinux:

    1ea8:       48 cf                   iretq

Fix the opcode in arch_decode_instruction().

The previous value (0xc5) seems to correspond to LDS.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170118132921.19319-1-jslaby@suse.czSigned-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

c48a862c

bpf: don't trigger OOM killer under pressure with map alloc · 251d00bf

Daniel Borkmann authored Jan 18, 2017


[ Upstream commit d407bd25 ]

This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.

Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.

Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.

Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

251d00bf

bnxt_en: Fix "uninitialized variable" bug in TPA code path. · a7a2a6d3

Michael Chan authored Jan 17, 2017


[ Upstream commit 719ca811 ]

In the TPA GRO code path, initialize the tcp_opt_len variable to 0 so
that it will be correct for packets without TCP timestamps.  The bug
caused the SKB fields to be incorrectly set up for packets without
TCP timestamps, leading to these packets being rejected by the stack.
Reported-by: Andy Gospodarek <andrew.gospodarek@broadocm.com>
Acked-by: Andy Gospodarek <andrew.gospodarek@broadocm.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

a7a2a6d3

xen-netback: protect resource cleaning on XenBus disconnect · da805bc7

Igor Druzhinin authored Jan 17, 2017


[ Upstream commit f16f1df6 ]

vif->lock is used to protect statistics gathering agents from using the
queue structure during cleaning.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

da805bc7

xen-netback: fix memory leaks on XenBus disconnect · 7bdccaa5

Igor Druzhinin authored Jan 17, 2017


[ Upstream commit 9a6cdf52 ]

Eliminate memory leaks introduced several years ago by cleaning the
queue resources which are allocated on XenBus connection event. Namely, queue
structure array and pages used for IO rings.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

7bdccaa5

net: ethtool: Initialize buffer when querying device channel settings · 5dcd0859

Eran Ben Elisha authored Jan 17, 2017


[ Upstream commit 31a86d13 ]

Ethtool channels respond struct was uninitialized when querying device
channel boundaries settings. As a result, unreported fields by the driver
hold garbage.  This may cause sending unsupported params to driver.

Fixes: 8bf36862 ('ethtool: ensure channel counts are within bounds ...')
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
CC: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

5dcd0859

powerpc/eeh: Enable IO path on permanent error · 6e315b2b

Gavin Shan authored Jan 06, 2017


[ Upstream commit 387bbc97 ]

We give up recovery on permanent error, simply shutdown the affected
devices and remove them. If the devices can't be put into quiet state,
they spew more traffic that is likely to cause another unexpected EEH
error. This was observed on "p8dtu2u" machine:

   0002:00:00.0 PCI bridge: IBM Device 03dc
   0002:01:00.0 Ethernet controller: Intel Corporation \
                Ethernet Controller X710/X557-AT 10GBASE-T (rev 02)
   0002:01:00.1 Ethernet controller: Intel Corporation \
                Ethernet Controller X710/X557-AT 10GBASE-T (rev 02)
   0002:01:00.2 Ethernet controller: Intel Corporation \
                Ethernet Controller X710/X557-AT 10GBASE-T (rev 02)
   0002:01:00.3 Ethernet controller: Intel Corporation \
                Ethernet Controller X710/X557-AT 10GBASE-T (rev 02)

On P8 PowerNV platform, the IO path is frozen when shutdowning the
devices, meaning the memory registers are inaccessible. It is why
the devices can't be put into quiet state before removing them.
This fixes the issue by enabling IO path prior to putting the devices
into quiet state.
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

6e315b2b

net: korina: Fix NAPI versus resources freeing · ea7b8081

Florian Fainelli authored Dec 23, 2016

commit e6afb1ad upstream.

Commit beb0babf ("korina: disable napi on close and restart")
introduced calls to napi_disable() that were missing before,
unfortunately this leaves a small window during which NAPI has a chance
to run, yet we just freed resources since korina_free_ring() has been
called:

Fix this by disabling NAPI first then freeing resource, and make sure
that we also cancel the restart task before doing the resource freeing.

Fixes: beb0babf ("korina: disable napi on close and restart")
Reported-by: Alexandros C. Couloumbis <alex@ozo.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ea7b8081

perf/x86/intel: Handle exclusive threadid correctly on CPU hotplug · fded17be

Zhou Chengming authored Jan 16, 2017


[ Upstream commit 4e71de79 ]

The CPU hotplug function intel_pmu_cpu_starting() sets
cpu_hw_events.excl_thread_id unconditionally to 1 when the shared exclusive
counters data structure is already availabe for the sibling thread.

This works during the boot process because the first sibling gets threadid
0 assigned and the second sibling which shares the data structure gets 1.

But when the first thread of the core is offlined and onlined again it
shares the data structure with the second thread and gets exclusive thread
id 1 assigned as well.

Prevent this by checking the threadid of the already online thread.

[ tglx: Rewrote changelog ]
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Cc: NuoHan Qiao <qiaonuohan@huawei.com>
Cc: ak@linux.intel.com
Cc: peterz@infradead.org
Cc: kan.liang@intel.com
Cc: dave.hansen@linux.intel.com
Cc: eranian@google.com
Cc: qiaonuohan@huawei.com
Cc: davidcc@google.com
Cc: guohanjun@huawei.com
Link: http://lkml.kernel.org/r/1484536871-3131-1-git-send-email-zhouchengming1@huawei.comSigned-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

fded17be

net: phy: dp83848: add DP83620 PHY support · 3eeb3459

Alvaro G. M authored Jan 17, 2017


[ Upstream commit 93b43fd1 ]

This PHY with fiber support is register compatible with DP83848,
so add support for it.
Signed-off-by: Alvaro Gamez Machado <alvaro.gamez@hazent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

3eeb3459

drm/amdgpu: add support for new hainan variants · 10c24e89

Alex Deucher authored Jan 17, 2017


[ Upstream commit 17324b6a ]

New hainan parts require updated smc firmware.

Cc: Sonny Jiang <sonny.jiang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

10c24e89

drm/amdgpu: fix program vce instance logic error. · 9f2a36a7

Rex Zhu authored Jan 10, 2017


[ Upstream commit 50a1ebc7 ]

need to clear bit31-29 in GRBM_GFX_INDEX,
then the program can be valid.
Signed-off-by: Rex Zhu <Rex.Zhu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

9f2a36a7

qla2xxx: Fix erroneous invalid handle message · 0c962661

Quinn Tran authored Dec 23, 2016


[ Upstream commit 4f060736 ]

Termination of Immediate Notify IOCB was using wrong
IOCB handle. IOCB completion code was unable to find
appropriate code path due to wrong handle.

Following message is seen in the logs.

"Error entry - invalid handle/queue (ffff)."
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[ bvanassche: Fixed word order in patch title ]
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

0c962661

qla2xxx: Terminate exchange if corrupted · 8cfcaa28

Quinn Tran authored Dec 23, 2016


[ Upstream commit 5f35509d ]

Corrupted ATIO is defined as length of fcp_header & fcp_cmd
payload is less than 0x38. It's the minimum size for a frame to
carry 8..16 bytes SCSI CDB. The exchange will be dropped or
terminated if corrupted.
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[ bvanassche: Fixed spelling in patch title ]
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

8cfcaa28

scsi: lpfc: Set elsiocb contexts to NULL after freeing it · 42a1d5b4

Johannes Thumshirn authored Jan 10, 2017


[ Upstream commit 8667f515 ]

Set the elsiocb contexts to NULL after freeing as others depend on it.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

42a1d5b4

stmmac: add missing of_node_put · 7782ab22

Julia Lawall authored Jan 17, 2017


[ Upstream commit a249708b ]

The function stmmac_dt_phy provides several possibilities for initializing
plat->mdio_node, all of which have the effect of increasing the reference
count of the assigned value.  This field is not updated elsewhere, so the
value is live until the end of the lifetime of plat (devm_allocated), just
after the end of stmmac_remove_config_dt.  Thus, add an of_node_put on
plat->mdio_node in stmmac_remove_config_dt.  It is possible that the field
mdio_node is never initialized, but of_node_put is NULL-safe, so it is also
safe to call of_node_put in that case.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

7782ab22

scsi: sd: Fix wrong DPOFUA disable in sd_read_cache_type · ee4494c6

Damien Le Moal authored Jan 12, 2017


[ Upstream commit 26f28197 ]

Zoned block devices force the use of READ/WRITE(16) commands by setting
sdkp->use_16_for_rw and clearing sdkp->use_10_for_rw. This result in
DPOFUA always being disabled for these drives as the assumed use of
the deprecated READ/WRITE(6) commands only looks at sdkp->use_10_for_rw.
Strenghten the test by also checking that sdkp->use_16_for_rw is false.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ee4494c6

KVM: x86: fix fixing of hypercalls · 80b1a118

Dmitry Vyukov authored Jan 17, 2017


[ Upstream commit ce2e852e ]

emulator_fix_hypercall() replaces hypercall with vmcall instruction,
but it does not handle GP exception properly when writes the new instruction.
It can return X86EMUL_PROPAGATE_FAULT without setting exception information.
This leads to incorrect emulation and triggers
WARN_ON(ctxt->exception.vector > 0x1f) in x86_emulate_insn()
as discovered by syzkaller fuzzer:

WARNING: CPU: 2 PID: 18646 at arch/x86/kvm/emulate.c:5558
Call Trace:
 warn_slowpath_null+0x2c/0x40 kernel/panic.c:582
 x86_emulate_insn+0x16a5/0x4090 arch/x86/kvm/emulate.c:5572
 x86_emulate_instruction+0x403/0x1cc0 arch/x86/kvm/x86.c:5618
 emulate_instruction arch/x86/include/asm/kvm_host.h:1127 [inline]
 handle_exception+0x594/0xfd0 arch/x86/kvm/vmx.c:5762
 vmx_handle_exit+0x2b7/0x38b0 arch/x86/kvm/vmx.c:8625
 vcpu_enter_guest arch/x86/kvm/x86.c:6888 [inline]
 vcpu_run arch/x86/kvm/x86.c:6947 [inline]

Set exception information when write in emulator_fix_hypercall() fails.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: kvm@vger.kernel.org
Cc: syzkaller@googlegroups.com
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

80b1a118

xen/blkback: don't free be structure too early · afaee3ef

Juergen Gross authored May 18, 2017

commit 71df1d7c upstream.

The be structure must not be freed when freeing the blkif structure
isn't done. Otherwise a use-after-free of be when unmapping the ring
used for communicating with the frontend will occur in case of a
late call of xenblk_disconnect() (e.g. due to an I/O still active
when trying to disconnect).
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Steven Haigh <netwiz@crc.id.au>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

afaee3ef

ARM64: dts: meson-gxbb-odroidc2: fix GbE tx link breakage · 13fa36f9

Jerome Brunet authored Jan 20, 2017


[ Upstream commit feb3cbea ]

OdroidC2 GbE link breaks under heavy tx transfer. This happens even if the
MAC does not enable Energy Efficient Ethernet (No Low Power state Idle on
the Tx path). The problem seems to come from the phy Rx path, entering the
LPI state.

Disabling EEE advertisement on the phy prevent this feature to be
negociated with the link partner and solve the issue.
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: Kevin Hilman <khilman@baylibre.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

13fa36f9

dt: bindings: net: use boolean dt properties for eee broken modes · 8bface14

jbrunet authored Dec 19, 2016


[ Upstream commit 308d3165 ]

The patches regarding eee-broken-modes was merged before all people
involved could find an agreement on the best way to move forward.

While we agreed on having a DT property to mark particular modes as broken,
the value used for eee-broken-modes mapped the phy register in very direct
way. Because of this, the concern is that it could be used to implement
configuration policies instead of describing a broken HW.

In the end, having a boolean property for each mode seems to be preferred
over one bit field value mapping the register (too) directly.

Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

8bface14

net: phy: use boolean dt properties for eee broken modes · 3897ae12

jbrunet authored Dec 19, 2016


[ Upstream commit 57f39862 ]

The patches regarding eee-broken-modes was merged before all people
involved could find an agreement on the best way to move forward.

While we agreed on having a DT property to mark particular modes as broken,
the value used for eee-broken-modes mapped the phy register in very direct
way. Because of this, the concern is that it could be used to implement
configuration policies instead of describing a broken HW.

In the end, having a boolean property for each mode seems to be preferred
over one bit field value mapping the register (too) directly.

Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

3897ae12

net: phy: fix sign type error in genphy_config_eee_advert · 40373d91

jbrunet authored Dec 19, 2016


[ Upstream commit 3bb9ab63 ]

In genphy_config_eee_advert, the return value of phy_read_mmd_indirect is
checked to know if the register could be accessed but the result is
assigned to a 'u32'.
Changing to 'int' to correctly get errors from phy_read_mmd_indirect.

Fixes: d853d145 ("net: phy: add an option to disable EEE advertisement")
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

40373d91

dt-bindings: net: add EEE capability constants · 752ba680

jbrunet authored Nov 28, 2016


[ Upstream commit 1fc31357 ]
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Tested-by: Yegor Yefremov <yegorslists@googlemail.com>
Tested-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

752ba680

net: phy: add an option to disable EEE advertisement · 97ace183

jbrunet authored Nov 28, 2016


[ Upstream commit d853d145 ]

This patch adds an option to disable EEE advertisement in the generic PHY
by providing a mask of prohibited modes corresponding to the value found in
the MDIO_AN_EEE_ADV register.

On some platforms, PHY Low power idle seems to be causing issues, even
breaking the link some cases. The patch provides a convenient way for these
platforms to disable EEE advertisement and work around the issue.
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Tested-by: Yegor Yefremov <yegorslists@googlemail.com>
Tested-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

97ace183

net: ethtool: add support for 2500BaseT and 5000BaseT link modes · 0e8eca98

Pavel Belous authored Jan 28, 2017


[ Upstream commit 94842b4f ]

This patch introduce support for 2500BaseT and 5000BaseT link modes.
These modes are included in the new IEEE 802.3bz standard.
Signed-off-by: Pavel Belous <pavel.s.belous@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

0e8eca98

sparc64: Zero pages on allocation for mondo and error queues. · 8886196a

Liam R. Howlett authored May 23, 2017


[ Upstream commit 7a7dc961 ]

Error queues use a non-zero first word to detect if the queues are full.
Using pages that have not been zeroed may result in false positive
overflow events.  These queues are set up once during boot so zeroing
all mondo and error queue pages is safe.

Note that the false positive overflow does not always occur because the
page allocation for these queues is so early in the boot cycle that
higher number CPUs get fresh pages.  It is only when traps are serviced
with lower number CPUs who were given already used pages that this issue
is exposed.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

8886196a

sparc64: Handle PIO & MEM non-resumable errors. · 41172b77

Liam R. Howlett authored May 23, 2017


[ Upstream commit 04748724 ]

User processes trying to access an invalid memory address via PIO will
receive a SIGBUS signal instead of causing a panic.  Memory errors will
receive a SIGKILL since a SIGBUS may result in a coredump which may
attempt to repeat the faulting access.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

41172b77

mm: numa: avoid waiting on freed migrated pages · 2aa6d036

Mark Rutland authored Jun 16, 2017

commit 3c226c63 upstream.

In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
waiting until the pmd is unlocked before we return and retry.  However,
we can race with migrate_misplaced_transhuge_page():

    // do_huge_pmd_numa_page                // migrate_misplaced_transhuge_page()
    // Holds 0 refs on page                 // Holds 2 refs on page

    vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
    /* ... */
    if (pmd_trans_migrating(*vmf->pmd)) {
            page = pmd_page(*vmf->pmd);
            spin_unlock(vmf->ptl);
                                            ptl = pmd_lock(mm, pmd);
                                            if (page_count(page) != 2)) {
                                                    /* roll back */
                                            }
                                            /* ... */
                                            mlock_migrate_page(new_page, page);
                                            /* ... */
                                            spin_unlock(ptl);
                                            put_page(page);
                                            put_page(page); // page freed here
            wait_on_page_locked(page);
            goto out;
    }

This can result in the freed page having its waiters flag set
unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
page alloc/free functions.  This has been observed on arm64 KVM guests.

We can avoid this by having do_huge_pmd_numa_page() take a reference on
the page before dropping the pmd lock, mirroring what we do in
__migration_entry_wait().

When we hit the race, migrate_misplaced_transhuge_page() will see the
reference and abort the migration, as it may do today in other cases.

Fixes: b8916634 ("mm: Prevent parallel splits during THP migration")
Link: http://lkml.kernel.org/r/1497349722-6731-2-git-send-email-will.deacon@arm.comSigned-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

2aa6d036

l2tp: take a reference on sessions used in genetlink handlers · 08cb8e5f

Guillaume Nault authored Mar 31, 2017

commit 2777e2ab upstream.

Callers of l2tp_nl_session_find() need to hold a reference on the
returned session since there's no guarantee that it isn't going to
disappear from under them.

Relying on the fact that no l2tp netlink message may be processed
concurrently isn't enough: sessions can be deleted by other means
(e.g. by closing the PPPOL2TP socket of a ppp pseudowire).

l2tp_nl_cmd_session_delete() is a bit special: it runs a callback
function that may require a previous call to session->ref(). In
particular, for ppp pseudowires, the callback is l2tp_session_delete(),
which then calls pppol2tp_session_close() and dereferences the PPPOL2TP
socket. The socket might already be gone at the moment
l2tp_session_delete() calls session->ref(), so we need to take a
reference during the session lookup. So we need to pass the do_ref
variable down to l2tp_session_get() and l2tp_session_get_by_ifname().

Since all callers have to be updated, l2tp_session_find_by_ifname() and
l2tp_nl_session_find() are renamed to reflect their new behaviour.

Fixes: 309795f4 ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

08cb8e5f

l2tp: hold session while sending creation notifications · 599e6f03

Guillaume Nault authored Mar 31, 2017

commit 5e6a9e5a upstream.

l2tp_session_find() doesn't take any reference on the returned session.
Therefore, the session may disappear while sending the notification.

Use l2tp_session_get() instead and decrement session's refcount once
the notification is sent.

Fixes: 33f72e6f ("l2tp : multicast notification to the registered listeners")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

599e6f03

l2tp: fix duplicate session creation · d9face6f

Guillaume Nault authored Mar 31, 2017

commit dbdbc73b upstream.

l2tp_session_create() relies on its caller for checking for duplicate
sessions. This is racy since a session can be concurrently inserted
after the caller's verification.

Fix this by letting l2tp_session_create() verify sessions uniqueness
upon insertion. Callers need to be adapted to check for
l2tp_session_create()'s return code instead of calling
l2tp_session_find().

pppol2tp_connect() is a bit special because it has to work on existing
sessions (if they're not connected) or to create a new session if none
is found. When acting on a preexisting session, a reference must be
held or it could go away on us. So we have to use l2tp_session_get()
instead of l2tp_session_find() and drop the reference before exiting.

Fixes: d9e31d17 ("l2tp: Add L2TP ethernet pseudowire support")
Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

d9face6f

l2tp: ensure session can't get removed during pppol2tp_session_ioctl() · 806e9883

Guillaume Nault authored Mar 31, 2017

commit 57377d63 upstream.

Holding a reference on session is required before calling
pppol2tp_session_ioctl(). The session could get freed while processing the
ioctl otherwise. Since pppol2tp_session_ioctl() uses the session's socket,
we also need to take a reference on it in l2tp_session_get().

Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

806e9883

l2tp: fix race in l2tp_recv_common() · 6539c4f9

Guillaume Nault authored Mar 31, 2017

commit 61b9a047 upstream.

Taking a reference on sessions in l2tp_recv_common() is racy; this
has to be done by the callers.

To this end, a new function is required (l2tp_session_get()) to
atomically lookup a session and take a reference on it. Callers then
have to manually drop this reference.

Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

6539c4f9

usb: gadget: f_fs: Fix possibe deadlock · d2da8d39

Baolin Wang authored Dec 08, 2016

commit b3ce3ce0 upstream.

When system try to close /dev/usb-ffs/adb/ep0 on one core, at the same
time another core try to attach new UDC, which will cause deadlock as
below scenario. Thus we should release ffs lock before issuing
unregister_gadget_item().

[   52.642225] c1 ======================================================
[   52.642228] c1 [ INFO: possible circular locking dependency detected ]
[   52.642236] c1 4.4.6+ #1 Tainted: G        W  O
[   52.642241] c1 -------------------------------------------------------
[   52.642245] c1 usb ffs open/2808 is trying to acquire lock:
[   52.642270] c0  (udc_lock){+.+.+.}, at: [<ffffffc00065aeec>]
		usb_gadget_unregister_driver+0x3c/0xc8
[   52.642272] c1  but task is already holding lock:
[   52.642283] c0  (ffs_lock){+.+.+.}, at: [<ffffffc00066b244>]
		ffs_data_clear+0x30/0x140
[   52.642285] c1 which lock already depends on the new lock.
[   52.642287] c1
               the existing dependency chain (in reverse order) is:
[   52.642295] c0
	       -> #1 (ffs_lock){+.+.+.}:
[   52.642307] c0        [<ffffffc00012340c>] __lock_acquire+0x20f0/0x2238
[   52.642314] c0        [<ffffffc000123b54>] lock_acquire+0xe4/0x298
[   52.642322] c0        [<ffffffc000aaf6e8>] mutex_lock_nested+0x7c/0x3cc
[   52.642328] c0        [<ffffffc00066f7bc>] ffs_func_bind+0x504/0x6e8
[   52.642334] c0        [<ffffffc000654004>] usb_add_function+0x84/0x184
[   52.642340] c0        [<ffffffc000658ca4>] configfs_composite_bind+0x264/0x39c
[   52.642346] c0        [<ffffffc00065b348>] udc_bind_to_driver+0x58/0x11c
[   52.642352] c0        [<ffffffc00065b49c>] usb_udc_attach_driver+0x90/0xc8
[   52.642358] c0        [<ffffffc0006598e0>] gadget_dev_desc_UDC_store+0xd4/0x128
[   52.642369] c0        [<ffffffc0002c14e8>] configfs_write_file+0xd0/0x13c
[   52.642376] c0        [<ffffffc00023c054>] vfs_write+0xb8/0x214
[   52.642381] c0        [<ffffffc00023cad4>] SyS_write+0x54/0xb0
[   52.642388] c0        [<ffffffc000085ff0>] el0_svc_naked+0x24/0x28
[   52.642395] c0
              -> #0 (udc_lock){+.+.+.}:
[   52.642401] c0        [<ffffffc00011e3d0>] print_circular_bug+0x84/0x2e4
[   52.642407] c0        [<ffffffc000123454>] __lock_acquire+0x2138/0x2238
[   52.642412] c0        [<ffffffc000123b54>] lock_acquire+0xe4/0x298
[   52.642420] c0        [<ffffffc000aaf6e8>] mutex_lock_nested+0x7c/0x3cc
[   52.642427] c0        [<ffffffc00065aeec>] usb_gadget_unregister_driver+0x3c/0xc8
[   52.642432] c0        [<ffffffc00065995c>] unregister_gadget_item+0x28/0x44
[   52.642439] c0        [<ffffffc00066b34c>] ffs_data_clear+0x138/0x140
[   52.642444] c0        [<ffffffc00066b374>] ffs_data_reset+0x20/0x6c
[   52.642450] c0        [<ffffffc00066efd0>] ffs_data_closed+0xac/0x12c
[   52.642454] c0        [<ffffffc00066f070>] ffs_ep0_release+0x20/0x2c
[   52.642460] c0        [<ffffffc00023dbe4>] __fput+0xb0/0x1f4
[   52.642466] c0        [<ffffffc00023dd9c>] ____fput+0x20/0x2c
[   52.642473] c0        [<ffffffc0000ee944>] task_work_run+0xb4/0xe8
[   52.642482] c0        [<ffffffc0000cd45c>] do_exit+0x360/0xb9c
[   52.642487] c0        [<ffffffc0000cf228>] do_group_exit+0x4c/0xb0
[   52.642494] c0        [<ffffffc0000dd3c8>] get_signal+0x380/0x89c
[   52.642501] c0        [<ffffffc00008a8f0>] do_signal+0x154/0x518
[   52.642507] c0        [<ffffffc00008af00>] do_notify_resume+0x70/0x78
[   52.642512] c0        [<ffffffc000085ee8>] work_pending+0x1c/0x20
[   52.642514] c1
              other info that might help us debug this:
[   52.642517] c1  Possible unsafe locking scenario:
[   52.642518] c1        CPU0                    CPU1
[   52.642520] c1        ----                    ----
[   52.642525] c0   lock(ffs_lock);
[   52.642529] c0                                lock(udc_lock);
[   52.642533] c0                                lock(ffs_lock);
[   52.642537] c0   lock(udc_lock);
[   52.642539] c1
                      *** DEADLOCK ***
[   52.642543] c1 1 lock held by usb ffs open/2808:
[   52.642555] c0  #0:  (ffs_lock){+.+.+.}, at: [<ffffffc00066b244>]
		ffs_data_clear+0x30/0x140
[   52.642557] c1 stack backtrace:
[   52.642563] c1 CPU: 1 PID: 2808 Comm: usb ffs open Tainted: G
[   52.642565] c1 Hardware name: Spreadtrum SP9860g Board (DT)
[   52.642568] c1 Call trace:
[   52.642573] c1 [<ffffffc00008b430>] dump_backtrace+0x0/0x170
[   52.642577] c1 [<ffffffc00008b5c0>] show_stack+0x20/0x28
[   52.642583] c1 [<ffffffc000422694>] dump_stack+0xa8/0xe0
[   52.642587] c1 [<ffffffc00011e548>] print_circular_bug+0x1fc/0x2e4
[   52.642591] c1 [<ffffffc000123454>] __lock_acquire+0x2138/0x2238
[   52.642595] c1 [<ffffffc000123b54>] lock_acquire+0xe4/0x298
[   52.642599] c1 [<ffffffc000aaf6e8>] mutex_lock_nested+0x7c/0x3cc
[   52.642604] c1 [<ffffffc00065aeec>] usb_gadget_unregister_driver+0x3c/0xc8
[   52.642608] c1 [<ffffffc00065995c>] unregister_gadget_item+0x28/0x44
[   52.642613] c1 [<ffffffc00066b34c>] ffs_data_clear+0x138/0x140
[   52.642618] c1 [<ffffffc00066b374>] ffs_data_reset+0x20/0x6c
[   52.642621] c1 [<ffffffc00066efd0>] ffs_data_closed+0xac/0x12c
[   52.642625] c1 [<ffffffc00066f070>] ffs_ep0_release+0x20/0x2c
[   52.642629] c1 [<ffffffc00023dbe4>] __fput+0xb0/0x1f4
[   52.642633] c1 [<ffffffc00023dd9c>] ____fput+0x20/0x2c
[   52.642636] c1 [<ffffffc0000ee944>] task_work_run+0xb4/0xe8
[   52.642640] c1 [<ffffffc0000cd45c>] do_exit+0x360/0xb9c
[   52.642644] c1 [<ffffffc0000cf228>] do_group_exit+0x4c/0xb0
[   52.642647] c1 [<ffffffc0000dd3c8>] get_signal+0x380/0x89c
[   52.642651] c1 [<ffffffc00008a8f0>] do_signal+0x154/0x518
[   52.642656] c1 [<ffffffc00008af00>] do_notify_resume+0x70/0x78
[   52.642659] c1 [<ffffffc000085ee8>] work_pending+0x1c/0x20
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Cc: Jerry Zhang <zhangjerry@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

d2da8d39

x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds() · ed96148d

Baoquan He authored May 04, 2017

commit fc5f9d5f upstream.

Jeff Moyer reported that on his system with two memory regions 0~64G and
1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling KASLR
will make the system hang intermittently during boot. While adding 'nokaslr'
won't.

The back trace is:

 Oops: 0000 [#1] SMP

 RIP: memcpy_erms()
 [ .... ]
 Call Trace:
  pmem_rw_page()
  bdev_read_page()
  do_mpage_readpage()
  mpage_readpages()
  blkdev_readpages()
  __do_page_cache_readahead()
  force_page_cache_readahead()
  page_cache_sync_readahead()
  generic_file_read_iter()
  blkdev_read_iter()
  __vfs_read()
  vfs_read()
  SyS_read()
  entry_SYSCALL_64_fastpath()

This crash happens because the for loop count calculation in sync_global_pgds()
is not correct. When a mapping area crosses PGD entries, we should
calculate the starting address of region which next PGD covers and assign
it to next for loop count, but not add PGDIR_SIZE directly. The old
code works right only if the mapping area is an exact multiple of PGDIR_SIZE,
otherwize the end region could be skipped so that it can't be synchronized
to all other processes from kernel PGD init_mm.pgd.

In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
makes this area be mapped inside one PGD entry. With KASLR enabled,
this area could cross two PGD entries, then the next PGD entry won't
be synced to all other processes. That is why we saw empty PGD.

Fix it.
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jinbum Park <jinb.park7@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1493864747-8506-1-git-send-email-bhe@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ed96148d

dm thin: do not queue freed thin mapping for next stage processing · 1c0fa383

Vallish Vaidyeshwara authored Jun 23, 2017

commit 00a0ea33 upstream.

process_prepared_discard_passdown_pt1() should cleanup
dm_thin_new_mapping in cases of error.

dm_pool_inc_data_range() can fail trying to get a block reference:

metadata operation 'dm_pool_inc_data_range' failed: error = -61

When dm_pool_inc_data_range() fails, dm thin aborts current metadata
transaction and marks pool as PM_READ_ONLY. Memory for thin mapping
is released as well. However, current thin mapping will be queued
onto next stage as part of queue_passdown_pt2() or passdown_endio().
This dangling thin mapping memory when processed and accessed in
next stage will lead to device mapper crashing.

Code flow without fix:
-> process_prepared_discard_passdown_pt1(m)
   -> dm_thin_remove_range()
   -> discard passdown
      --> passdown_endio(m) queues m onto next stage
   -> dm_pool_inc_data_range() fails, frees memory m
            but does not remove it from next stage queue

-> process_prepared_discard_passdown_pt2(m)
   -> processes freed memory m and crashes

One such stack:

Call Trace:
[<ffffffffa037a46f>] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison]
[<ffffffffa039b6dc>] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool]
[<ffffffffa039b88b>] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool]
[<ffffffffa0399611>] process_prepared+0x81/0xa0 [dm_thin_pool]
[<ffffffffa039e735>] do_worker+0xc5/0x820 [dm_thin_pool]
[<ffffffff8152bf54>] ? __schedule+0x244/0x680
[<ffffffff81087e72>] ? pwq_activate_delayed_work+0x42/0xb0
[<ffffffff81089f53>] process_one_work+0x153/0x3f0
[<ffffffff8108a71b>] worker_thread+0x12b/0x4b0
[<ffffffff8108a5f0>] ? rescuer_thread+0x350/0x350
[<ffffffff8108fd6a>] kthread+0xca/0xe0
[<ffffffff8108fca0>] ? kthread_park+0x60/0x60
[<ffffffff81530b45>] ret_from_fork+0x25/0x30

The fix is to first take the block ref count for discarded block and
then do a passdown discard of this block. If block ref count fails,
then bail out aborting current metadata transaction, mark pool as
PM_READ_ONLY and also free current thin mapping memory (existing error
handling code) without queueing this thin mapping onto next stage of
processing. If block ref count succeeds, then passdown discard of this
block. Discard callback of passdown_endio() will queue this thin mapping
onto next stage of processing.

Code flow with fix:
-> process_prepared_discard_passdown_pt1(m)
   -> dm_thin_remove_range()
   -> dm_pool_inc_data_range()
      --> if fails, free memory m and bail out
   -> discard passdown
      --> passdown_endio(m) queues m onto next stage
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Reviewed-by: Cristian Gafton <gafton@amazon.com>
Reviewed-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Vallish Vaidyeshwara <vallish@amazon.com>
Reviewed-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

1c0fa383