- 02 Aug, 2016 2 commits
-
-
Jason Wang authored
This patch tries to implement an device IOTLB for vhost. This could be used with userspace(qemu) implementation of DMA remapping to emulate an IOMMU for the guest. The idea is simple, cache the translation in a software device IOTLB (which is implemented as an interval tree) in vhost and use vhost_net file descriptor for reporting IOTLB miss and IOTLB update/invalidation. When vhost meets an IOTLB miss, the fault address, size and access can be read from the file. After userspace finishes the translation, it writes the translated address to the vhost_net file to update the device IOTLB. When device IOTLB is enabled by setting VIRTIO_F_IOMMU_PLATFORM all vq addresses set by ioctl are treated as iova instead of virtual address and the accessing can only be done through IOTLB instead of direct userspace memory access. Before each round or vq processing, all vq metadata is prefetched in device IOTLB to make sure no translation fault happens during vq processing. In most cases, virtqueues are contiguous even in virtual address space. The IOTLB translation for virtqueue itself may make it a little slower. We might add fast path cache on top of this patch. Signed-off-by: Jason Wang <jasowang@redhat.com> [mst: use virtio feature bit: VHOST_F_DEVICE_IOTLB -> VIRTIO_F_IOMMU_PLATFORM ] [mst: fix build warnings ] Signed-off-by: Michael S. Tsirkin <mst@redhat.com> [ weiyj.lk: missing unlock on error ] Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
-
Michael S. Tsirkin authored
vringh isn't used by vhost net or scsi - it's used by CAIF only at the moment. Drop the dependency. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
- 01 Aug, 2016 14 commits
-
-
Jason Wang authored
Current pre-sorted memory region array has some limitations for future device IOTLB conversion: 1) need extra work for adding and removing a single region, and it's expected to be slow because of sorting or memory re-allocation. 2) need extra work of removing a large range which may intersect several regions with different size. 3) need trick for a replacement policy like LRU To overcome the above shortcomings, this patch convert it to interval tree which can easily address the above issue with almost no extra work. The patch could be used for: - Extend the current API and only let the userspace to send diffs of memory table. - Simplify Device IOTLB implementation. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Jason Wang authored
This patch introduces vhost memory accessors which were just wrappers for userspace address access helpers. This is a requirement for vhost device iotlb implementation which will add iotlb translations in those accessors. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Asias He authored
Enable virtio-vsock and vhost-vsock. Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Asias He authored
VM sockets vhost transport implementation. This driver runs on the host. Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Asias He authored
VM sockets virtio transport implementation. This driver runs in the guest. Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Asias He authored
This module contains the common code and header files for the following virtio_transporto and vhost_vsock kernel modules. Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Stefan Hajnoczi authored
The virtio transport will implement graceful shutdown and the related SO_LINGER socket option. This requires orphaning the sock but keeping it in the table of connections after .release(). This patch adds the vsock_remove_sock() function and leaves it up to the transport when to remove the sock. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Stefan Hajnoczi authored
struct vsock_transport contains function pointers called by AF_VSOCK core code. The transport may want its own transport-specific function pointers and they can be added after struct vsock_transport. Allow the transport to fetch vsock_transport. It can downcast it to access transport-specific function pointers. The virtio transport will use this. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Michael S. Tsirkin authored
vringh isn't used by vhost net or scsi - it's used by CAIF only at the moment. Drop the dependency. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Michael S. Tsirkin authored
VOP selects VHOST_RING. Pull in Kconfig that includes it to make it self-containing. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Michael S. Tsirkin authored
The interaction between virtio and IOMMUs is messy. On most systems with virtio, physical addresses match bus addresses, and it doesn't particularly matter which one we use to program the device. On some systems, including Xen and any system with a physical device that speaks virtio behind a physical IOMMU, we must program the IOMMU for virtio DMA to work at all. On other systems, including SPARC and PPC64, virtio-pci devices are enumerated as though they are behind an IOMMU, but the virtio host ignores the IOMMU, so we must either pretend that the IOMMU isn't there or somehow map everything as the identity. Add a feature bit to detect that quirk: VIRTIO_F_IOMMU_PLATFORM. Any device with this feature bit set to 0 needs a quirk and has to be passed physical addresses (as opposed to bus addresses) even though the device is behind an IOMMU. Note: it has to be a per-device quirk because for example, there could be a mix of passed-through and virtual virtio devices. As another example, some devices could be implemented by an out of process hypervisor backend (in case of qemu vhost, or vhost-user) and so support for an IOMMU needs to be coded up separately. It would be cleanest to handle this in IOMMU core code, but that needs per-device DMA ops. While we are waiting for that to be implemented, use a work-around in virtio core. Note: a "noiommu" feature is a quirk - add a wrapper to make that clear. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Konstantin Neumoin authored
The balloon has a special mechanism that is subscribed to the oom notification which leads to deflation for a fixed number of pages. The number is always fixed even when the balloon is fully deflated. But leak_balloon did not expect that the pages to deflate will be more than taken, and raise a "BUG" in balloon_page_dequeue when page list will be empty. So, the simplest solution would be to check that the number of releases pages is less or equal to the number taken pages. Cc: stable@vger.kernel.org Signed-off-by: Konstantin Neumoin <kneumoin@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Jason Wang authored
We use spinlock to synchronize the work list now which may cause unnecessary contentions. So this patch switch to use llist to remove this contention. Pktgen tests shows about 5% improvement: Before: ~1300000 pps After: ~1370000 pps Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
Jason Wang authored
We used to implement the work flushing through tracking queued seq, done seq, and the number of flushing. This patch simplify this by just implement work flushing through another kind of vhost work with completion. This will be used by lockless enqueuing patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-
- 24 Jul, 2016 2 commits
-
-
Linus Torvalds authored
-
git://github.com/ceph/ceph-clientLinus Torvalds authored
Pull ceph fix from Ilya Dryomov: "A fix for a long-standing bug in the incremental osdmap handling code that caused misdirected requests, tagged for stable" The tag is signed with a brand new key - Sage is on vacation and I didn't anticipate this" * tag 'ceph-for-4.7-rc8' of git://github.com/ceph/ceph-client: libceph: apply new_state before new_up_client on incrementals
-
- 23 Jul, 2016 22 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds authored
Pull networking fixes from David Miller: 1) Fix memory leak in nftables, from Liping Zhang. 2) Need to check result of vlan_insert_tag() in batman-adv otherwise we risk NULL skb derefs, from Sven Eckelmann. 3) Check for dev_alloc_skb() failures in cfg80211, from Gregory Greenman. 4) Handle properly when we have ppp_unregister_channel() happening in parallel with ppp_connect_channel(), from WANG Cong. 5) Fix DCCP deadlock, from Eric Dumazet. 6) Bail out properly in UDP if sk_filter() truncates the packet to be smaller than even the space that the protocol headers need. From Michal Kubecek. 7) Similarly for rose, dccp, and sctp, from Willem de Bruijn. 8) Make TCP challenge ACKs less predictable, from Eric Dumazet. 9) Fix infinite loop in bgmac_dma_tx_add() from Florian Fainelli. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits) packet: propagate sock_cmsg_send() error net/mlx5e: Fix del vxlan port command buffer memset packet: fix second argument of sock_tx_timestamp() net: switchdev: change ageing_time type to clock_t Update maintainer for EHEA driver. net/mlx4_en: Add resilience in low memory systems net/mlx4_en: Move filters cleanup to a proper location sctp: load transport header after sk_filter net/sched/sch_htb: clamp xstats tokens to fit into 32-bit int net: cavium: liquidio: Avoid dma_unmap_single on uninitialized ndata net: nb8800: Fix SKB leak in nb8800_receive() et131x: Fix logical vs bitwise check in et131x_tx_timeout() vlan: use a valid default mtu value for vlan over macsec net: bgmac: Fix infinite loop in bgmac_dma_tx_add() mlxsw: spectrum: Prevent invalid ingress buffer mapping mlxsw: spectrum: Prevent overwrite of DCB capability fields mlxsw: spectrum: Don't emit errors when PFC is disabled mlxsw: spectrum: Indicate support for autonegotiation mlxsw: spectrum: Force link training according to admin state r8152: add MODULE_VERSION ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfsLinus Torvalds authored
Pull overlayfs fixes from Miklos Szeredi: "This contains a fix for a potential crash/corruption issue and another where the suid/sgid bits weren't cleared on write" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: verify upper dentry in ovl_remove_and_whiteout() ovl: Copy up underlying inode's ->i_mode to overlay inode ovl: handle ATTR_KILL*
-
Linus Torvalds authored
Merge misc fixes from Andrew Morton: "Five fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: pps: do not crash when failed to register tools/vm/slabinfo: fix an unintentional printf testing/radix-tree: fix a macro expansion bug radix-tree: fix radix_tree_iter_retry() for tagged iterators. mm: memcontrol: fix cgroup creation failure after many small jobs
-
git://people.freedesktop.org/~airlied/linuxLinus Torvalds authored
Pull intel kabylake drm fixes from Dave Airlie: "As mentioned Intel has gathered all the Kabylake fixes from -next, which we've enabled in 4.7 for the first time, these are pretty much limited in scope to only affects kabylake, which is hw that isn't shipping yet. So I'm mostly okay with it going in now. If we don't land this, it might be a good idea to disable kabylake support in 4.7 before we ship" * tag 'drm-fixes-for-v4.7-rc8-intel-kbl' of git://people.freedesktop.org/~airlied/linux: (28 commits) drm/i915/kbl: Introduce the first official DMC for Kabylake. drm/i915: Introduce Kabypoint PCH for Kabylake H/DT. drm/i915/gen9: implement WaConextSwitchWithConcurrentTLBInvalidate drm/i915/gen9: Add WaFbcHighMemBwCorruptionAvoidance drm/i195/fbc: Add WaFbcNukeOnHostModify drm/i915/gen9: Add WaFbcWakeMemOn drm/i915/gen9: Add WaFbcTurnOffFbcWatermark drm/i915/kbl: Add WaClearSlmSpaceAtContextSwitch drm/i915/gen9: Add WaEnableChickenDCPR drm/i915/kbl: Add WaDisableSbeCacheDispatchPortSharing drm/i915/kbl: Add WaDisableGafsUnitClkGating drm/i915/kbl: Add WaForGAMHang drm/i915: Add WaInsertDummyPushConstP for bxt and kbl drm/i915/kbl: Add WaDisableDynamicCreditSharing drm/i915/kbl: Add WaDisableGamClockGating drm/i915/gen9: Enable must set chicken bits in config0 reg drm/i915/kbl: Add WaDisableLSQCROPERFforOCL drm/i915/kbl: Add WaDisableSDEUnitClockGating drm/i915/kbl: Add WaDisableFenceDestinationToSLM for A0 drm/i915/kbl: Add WaEnableGapsTsvCreditFix ...
-
git://people.freedesktop.org/~airlied/linuxLinus Torvalds authored
Pull drm fixes from Dave Airlie: "Two i915 regression fixes. Intel have submitted some Kabylake fixes I'll send separately, since this is the first kernel with kabylake support and they don't go much outside that area I think they should be fine" * tag 'drm-fixes-for-v4.7-rc8-intel' of git://people.freedesktop.org/~airlied/linux: drm/i915: add missing condition for committing planes on crtc drm/i915: Treat eDP as always connected, again
-
git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68kLinus Torvalds authored
Pull m68k upddates from Geert Uytterhoeven: - assorted spelling fixes - defconfig updates * tag 'm68k-for-v4.8-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: m68k/defconfig: Update defconfigs for v4.7-rc2 m68k: Assorted spelling fixes
-
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-socLinus Torvalds authored
Pull ARM SoC fixes from Olof Johansson: "A handful of fixes before final release: Marvell Armada: - One to fix a typo in the devicetree specifying memory ranges for the crypto engine - Two to deal with marking PCI and device-memory as strongly ordered to avoid hardware deadlocks, in particular when enabling above crypto driver. - Compile fix for PM Allwinner: - DT clock fixes to deal with u-boot-enabled framebuffer (simplefb). - Make R8 (C.H.I.P. SoC) inherit system compatibility from A13 to make clocks register proper. Tegra: - Fix SD card voltage setting on the Tegra3 Beaver dev board Misc: - Two maintainers updates for STM32 and STi platforms" * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: ARM: tegra: beaver: Allow SD card voltage to be changed MAINTAINERS: update STi maintainer list MAINTAINERS: update STM32 maintainers list ARM: mvebu: compile pm code conditionally ARM: dts: sun7i: Fix pll3x2 and pll7x2 not having a parent clock ARM: dts: sunxi: Add pll3 to simplefb nodes clocks lists ARM: dts: armada-38x: fix MBUS_ID for crypto SRAM on Armada 385 Linksys ARM: mvebu: map PCI I/O regions strongly ordered ARM: mvebu: fix HW I/O coherency related deadlocks ARM: sunxi/dt: make the CHIP inherit from allwinner,sun5i-a13
-
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds authored
Pull crypto fixes from Herbert Xu: "This fixes a sporadic build failure in the qat driver as well as a memory corruption bug in rsa-pkcs1pad" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: rsa-pkcs1pad - fix rsa-pkcs1pad request struct crypto: qat - make qat_asym_algs.o depend on asn1 headers
-
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-securityLinus Torvalds authored
Pull key handling fixes from James Morris: "Quoting David Howells: Here are three miscellaneous fixes: (1) Fix a panic in some debugging code in PKCS#7. This can only happen by explicitly inserting a #define DEBUG into the code. (2) Fix the calculation of the digest length in the PE file parser. This causes a failure where there should be a success. (3) Fix the case where an X.509 cert can be added as an asymmetric key to a trusted keyring with no trust restriction if no AKID is supplied. Bugs (1) and (2) aren't particularly problematic, but (3) allows a security check to be bypassed. Happily, this is a recent regression and never made it into a released kernel" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: KEYS: Fix for erroneous trust of incorrectly signed X.509 certs pefile: Fix the failure of calculation for digest PKCS#7: Fix panic when referring to the empty AKID when DEBUG defined
-
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/inputLinus Torvalds authored
Pull input fixes from Dmitry Torokhov: "A few more fixes for the input subsystem: - restore naming for tsc2005 touchscreens as some userspace match on it - fix out of bound access in legacy keyboard driver - fixup in RMI4 driver Everything is tagged for stable as well" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: tsc200x - report proper input_dev name tty/vt/keyboard: fix OOB access in do_compute_shiftstate() Input: synaptics-rmi4 - fix maximum size check for F12 control register 8
-
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds authored
Pull libnvdimm fix from Dan Williams: "This contains a regression fix for a problem that was introduced in v4.7-rc6. In 4.7-rc1 we introduced auto-probing for the ACPI DSM (device- specific-method) format that the platform firmware implements for nvdimm devices. We initially fixed a regression in probing the QEMU DSM implementation by making acpi_check_dsm() tolerant of the way QEMU reports the "0 DSMs supported" condition. However, that broke HPE platforms since that tolerance caused the driver to mistakenly match the 1-zero-byte response those platforms give to "unknown" commands. Instead, we simply make the driver tolerant of not finding any supported DSMs. This has been tested to work with both QEMU and HPE platforms. This commit has appeared in a -next release with no reported issues" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: nfit: make DIMM DSMs optional
-
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpioLinus Torvalds authored
Pull GPIO fix from Linus Walleij: "Compile problem fix for Tegra, Sorry to send this in the last minute but Ingo says this build failure is very prominent so I'm not going to wait for v4.7 before sending it. It is a case of COMPILE_TEST causing more problems than it solves and I'm already swearing about me shooting myself in the foot with that gun :(" * tag 'gpio-v4.7-6' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: gpio: tegra: don't auto-enable for COMPILE_TEST
-
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linuxLinus Torvalds authored
Pull clk fixes from Michael Turquette: "Fix a bug in the at91 clk driver, two compile time warnings in sunxi clk drivers, and one bug in a sunxi clk driver introduced in the 4.7 merge window" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: at91: fix clk_programmable_set_parent() clk: sunxi: remove unused variable clk: sunxi: display: Add per-clock flags clk: sunxi: tcon-ch1: Do not return a negative error in get_parent
-
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libataLinus Torvalds authored
Pull libata fix from Tejun Heo: "Another fallout from max_sectors bump a couple years ago. The lite-on optical drive times out on large requests" * 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: libata: LITE-ON CX1-JB256-HP needs lower max_sectors
-
git://git.linaro.org/people/ulf.hansson/mmcLinus Torvalds authored
Pull MMC fixes from Ulf Hansson: "Here are a few late mmc fixes intended for v4.7 final. MMC core: - Fix eMMC packed command header endianness - Fix free of uninitialized buffer for mmc ioctl MMC host: - pxamci: Fix potential oops in ->probe()" * tag 'mmc-v4.7-rc7' of git://git.linaro.org/people/ulf.hansson/mmc: mmc: pxamci: fix potential oops mmc: block: fix packed command header endianness mmc: block: fix free of uninitialized 'idata->buf'
-
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/soundLinus Torvalds authored
Pull sound fixes from Takashi Iwai: "No surprise, just a few small fixes: a couple of changes are seen in the core part, and both of them are rather for unusual error paths. The rest are the regular HD-audio fixes and one USB-audio regression fix" * tag 'sound-4.7-fix2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: usb-audio: Fix quirks code is not called ALSA: hda: add AMD Stoney PCI ID with proper driver caps ALSA: hda - fix use-after-free after module unload ALSA: pcm: Free chmap at PCM free callback, too ALSA: ctl: Stop notification after disconnection ALSA: hda/realtek - add new pin definition in alc225 pin quirk table
-
git://git.kernel.dk/linux-blockLinus Torvalds authored
Pull NVMe fix from Jens Axboe: "Late addition here, it's basically a revert of a patch that was added in this merge window, but has proven to cause problems. This is swapping out the RCU based namespace protection with a good old mutex instead" * 'for-linus' of git://git.kernel.dk/linux-block: nvme: Remove RCU namespace protection
-
Jiri Slaby authored
With this command sequence: modprobe plip modprobe pps_parport rmmod pps_parport the partport_pps modules causes this crash: BUG: unable to handle kernel NULL pointer dereference at (null) IP: parport_detach+0x1d/0x60 [pps_parport] Oops: 0000 [#1] SMP ... Call Trace: parport_unregister_driver+0x65/0xc0 [parport] SyS_delete_module+0x187/0x210 The sequence that builds up to this is: 1) plip is loaded and takes the parport device for exclusive use: plip0: Parallel port at 0x378, using IRQ 7. 2) pps_parport then fails to grab the device: pps_parport: parallel port PPS client parport0: cannot grant exclusive access for device pps_parport pps_parport: couldn't register with parport0 3) rmmod of pps_parport is then killed because it tries to access pardev->name, but pardev (taken from port->cad) is NULL. So add a check for NULL in the test there too. Link: http://lkml.kernel.org/r/20160714115245.12651-1-jslaby@suse.czSigned-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Rodolfo Giometti <giometti@enneenne.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dan Carpenter authored
The curly braces are missing here so we print stuff unintentionally. Fixes: 9da4714a ('slub: slabinfo update for cmpxchg handling') Link: http://lkml.kernel.org/r/20160715211243.GE19522@mwandaSigned-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Laura Abbott <labbott@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dan Carpenter authored
There are no parentheses around this macro and it causes a problem when we do: index = rand() % THRASH_SIZE; Link: http://lkml.kernel.org/r/20160715210953.GC19522@mwandaSigned-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrey Ryabinin authored
radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags. Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot() leading to crash: RIP: radix_tree_next_slot include/linux/radix-tree.h:473 find_get_pages_tag+0x334/0x930 mm/filemap.c:1452 .... Call Trace: pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960 mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516 ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736 do_writepages+0x97/0x100 mm/page-writeback.c:2364 __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300 filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490 ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115 vfs_fsync_range+0x10a/0x250 fs/sync.c:195 vfs_fsync fs/sync.c:209 do_fsync+0x42/0x70 fs/sync.c:219 SYSC_fdatasync fs/sync.c:232 SyS_fdatasync+0x19/0x20 fs/sync.c:230 entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207 We must reset iterator's tags to bail out from radix_tree_next_slot() and go to the slow-path in radix_tree_next_chunk(). Fixes: 46437f9a ("radix-tree: fix race in gang lookup") Link: http://lkml.kernel.org/r/1468495196-10604-1-git-send-email-aryabinin@virtuozzo.comSigned-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
The memory controller has quite a bit of state that usually outlives the cgroup and pins its CSS until said state disappears. At the same time it imposes a 16-bit limit on the CSS ID space to economically store IDs in the wild. Consequently, when we use cgroups to contain frequent but small and short-lived jobs that leave behind some page cache, we quickly run into the 64k limitations of outstanding CSSs. Creating a new cgroup fails with -ENOSPC while there are only a few, or even no user-visible cgroups in existence. Although pinning CSSs past cgroup removal is common, there are only two instances that actually need an ID after a cgroup is deleted: cache shadow entries and swapout records. Cache shadow entries reference the ID weakly and can deal with the CSS having disappeared when it's looked up later. They pose no hurdle. Swap-out records do need to pin the css to hierarchically attribute swapins after the cgroup has been deleted; though the only pages that remain swapped out after offlining are tmpfs/shmem pages. And those references are under the user's control, so they are manageable. This patch introduces a private 16-bit memcg ID and switches swap and cache shadow entries over to using that. This ID can then be recycled after offlining when the CSS remains pinned only by objects that don't specifically need it. This script demonstrates the problem by faulting one cache page in a new cgroup and deleting it again: set -e mkdir -p pages for x in `seq 128000`; do [ $((x % 1000)) -eq 0 ] && echo $x mkdir /cgroup/foo echo $$ >/cgroup/foo/cgroup.procs echo trex >pages/$x echo $$ >/cgroup/cgroup.procs rmdir /cgroup/foo done When run on an unpatched kernel, we eventually run out of possible IDs even though there are no visible cgroups: [root@ham ~]# ./cssidstress.sh [...] 65000 mkdir: cannot create directory '/cgroup/foo': No space left on device After this patch, the IDs get released upon cgroup destruction and the cache and css objects get released once memory reclaim kicks in. [hannes@cmpxchg.org: init the IDR] Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org Fixes: b2052564 ("mm: memcontrol: continue cache reclaim from offlined groups") Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.orgSigned-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: John Garcia <john.garcia@mesosphere.io> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Nikolay Borisov <kernel@kyup.com> Cc: <stable@vger.kernel.org> [3.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-