- 15 Sep, 2015 1 commit
-
-
Daniel Axtens authored
Currently the first thing we do in cxl_probe is to grab a reference on the pci device. Later on, we call device_register on our adapter. In our remove path, we call device_unregister, but we never call pci_dev_put. We therefore leak the device every time we do a reflash. device_register/unregister is sufficient to hold the reference. Therefore, drop the call to pci_dev_get. Here's why this is safe. The proposed cxl_probe(pdev) calls cxl_adapter_init: a) init calls cxl_adapter_alloc, which creates a struct cxl, conventionally called adapter. This struct contains a device entry, adapter->dev. b) init calls cxl_configure_adapter, where we set adapter->dev.parent = &dev->dev (here dev is the pci dev) So at this point, the cxl adapter's device's parent is the PCI device that I want to be refcounted properly. c) init calls cxl_register_adapter *) cxl_register_adapter calls device_register(&adapter->dev) So now we're in device_register, where dev is the adapter device, and we want to know if the PCI device is safe after we return. device_register(&adapter->dev) calls device_initialize() and then device_add(). device_add() does a get_device(). device_add() also explicitly grabs the device's parent, and calls get_device() on it: parent = get_device(dev->parent); So therefore, device_register() takes a lock on the parent PCI dev, which is what pci_dev_get() was guarding. pci_dev_get() can therefore be safely removed. Fixes: f204e0b8 ("cxl: Driver code for powernv PCIe based cards for userspace access") Cc: stable@vger.kernel.org Signed-off-by: Daniel Axtens <dja@axtens.net> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 10 Sep, 2015 1 commit
-
-
Paul Mackerras authored
This fixes a race which can result in the same virtual IRQ number being assigned to two different MSI interrupts. The most visible consequence of that is usually a warning and stack trace from the sysfs code about an attempt to create a duplicate entry in sysfs. The race happens when one CPU (say CPU 0) is disposing of an MSI while another CPU (say CPU 1) is setting up an MSI. CPU 0 calls (for example) pnv_teardown_msi_irqs(), which calls msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its hardware IRQ number) is no longer in use. Then, before CPU 0 gets to calling irq_dispose_mapping() to free up the virtal IRQ number, CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an MSI, and gets the same hardware IRQ number that CPU 0 just freed. CPU 1 then calls irq_create_mapping() to get a virtual IRQ number, which sees that there is currently a mapping for that hardware IRQ number and returns the corresponding virtual IRQ number (which is the same virtual IRQ number that CPU 0 was using). CPU 0 then calls irq_dispose_mapping() and frees that virtual IRQ number. Now, if another CPU comes along and calls irq_create_mapping(), it is likely to get the virtual IRQ number that was just freed, resulting in the same virtual IRQ number apparently being used for two different hardware interrupts. To fix this race, we just move the call to msi_bitmap_free_hwirqs() to after the call to irq_dispose_mapping(). Since virq_to_hw() doesn't work for the virtual IRQ number after irq_dispose_mapping() has been called, we need to call it before irq_dispose_mapping() and remember the result for the msi_bitmap_free_hwirqs() call. The pattern of calling msi_bitmap_free_hwirqs() before irq_dispose_mapping() appears in 5 places under arch/powerpc, and appears to have originated in commit 05af7bd2 ("[POWERPC] MPIC U3/U4 MSI backend") from 2007. Fixes: 05af7bd2 ("[POWERPC] MPIC U3/U4 MSI backend") Cc: stable@vger.kernel.org # v2.6.22+ Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 09 Sep, 2015 1 commit
-
-
Michael Ellerman authored
The selftest passes on 64-bit LE and BE. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 08 Sep, 2015 1 commit
-
-
Bharata B Rao authored
Commit f32393c9 ("powerpc/pseries: Correct cpu affinity for dlpar added cpus") moved dlpar_acquire_drc() call to before dlpar_configure_connector() call in dlpar_cpu_probe(), but missed to release the DRC if dlpar_configure_connector() failed. During CPU hotplug, if configure-connector fails for any reason, then this will result in subsequent CPU hotplug attempts to fail. Release the acquired DRC if dlpar_configure_connector() call fails so that the DRC is left in right isolation and allocation state for the subsequent hotplug operation to succeed. Fixes: f32393c9 ("powerpc/pseries: Correct cpu affinity for dlpar added cpus") Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 07 Sep, 2015 3 commits
-
-
Andrew Donnellan authored
cxl_pci_enable_device_hook() is called when attempting to enable an AFU sitting on a vPHB. At present, the state of the underlying CXL card's PCI channel is only checked when it calls cxl_afu_check_and_enable() at the very end, after it has already set DMA options and initialised a default context. Check the CXL card's link status before setting DMA options or initialising a default context. If the link is down, print a warning and return immediately. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nishanth Aravamudan authored
The 32-bit TCE table initialization relies on the DMA window having a size equal to a power of 2 (and checks for it explicitly). But crashkernel= has no constraint that requires a power-of-2 be specified. This causes the kdump kernel to fail to boot as none of the PCI devices (including the disk controller) are successfully initialized. After this change, the PCI devices successfully set up the 32-bit TCE table and kdump succeeds. Fixes: aca6913f ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages") Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: stable@vger.kernel.org # 4.2 Tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nishanth Aravamudan authored
When attempting to kdump with the 4.2 kernel, we see for each PCI device: pci 0003:01 : [PE# 000] Assign DMA32 space pci 0003:01 : [PE# 000] Setting up 32-bit TCE table at 0..80000000 pci 0003:01 : [PE# 000] Failed to create 32-bit TCE table, err -22 PCI: Domain 0004 has 8 available 32-bit DMA segments PCI: 4 PE# for a total weight of 70 pci 0004:01 : [PE# 002] Assign DMA32 space pci 0004:01 : [PE# 002] Setting up 32-bit TCE table at 0..80000000 pci 0004:01 : [PE# 002] Failed to create 32-bit TCE table, err -22 pci 0004:0d : [PE# 005] Assign DMA32 space pci 0004:0d : [PE# 005] Setting up 32-bit TCE table at 0..80000000 pci 0004:0d : [PE# 005] Failed to create 32-bit TCE table, err -22 pci 0004:0e : [PE# 006] Assign DMA32 space pci 0004:0e : [PE# 006] Setting up 32-bit TCE table at 0..80000000 pci 0004:0e : [PE# 006] Failed to create 32-bit TCE table, err -22 pci 0004:10 : [PE# 008] Assign DMA32 space pci 0004:10 : [PE# 008] Setting up 32-bit TCE table at 0..80000000 pci 0004:10 : [PE# 008] Failed to create 32-bit TCE table, err -22 and eventually the kdump kernel fails to boot as none of the PCI devices (including the disk controller) are successfully initialized. The EINVAL response is because the DMA window (the 2GB base window) is larger than the kdump kernel's reserved memory (crashkernel=, in this case specified to be 1024M). The check in question, if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size)) is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust the caller to pass in a smaller window size if our maximum memory value is smaller than the DMA window. After this change, the PCI devices successfully set up the 32-bit TCE table and kdump succeeds. The problem was seen on a Firestone machine originally. Fixes: aca6913f ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages") Cc: stable@vger.kernel.org # 4.2 Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> [mpe: Coding style pedantry, use u64, change the indentation] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 06 Sep, 2015 6 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds authored
Pull vfs updates from Al Viro: "In this one: - d_move fixes (Eric Biederman) - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error handling ought to be fixed) - switch of sb_writers to percpu rwsem (Oleg Nesterov) - superblock scalability (Josef Bacik and Dave Chinner) - swapon(2) race fix (Hugh Dickins)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits) vfs: Test for and handle paths that are unreachable from their mnt_root dcache: Reduce the scope of i_lock in d_splice_alias dcache: Handle escaped paths in prepend_path mm: fix potential data race in SyS_swapon inode: don't softlockup when evicting inodes inode: rename i_wb_list to i_io_list sync: serialise per-superblock sync operations inode: convert inode_sb_list_lock to per-sb inode: add hlist_fake to avoid the inode hash lock in evict writeback: plug writeback at a high level change sb_writers to use percpu_rw_semaphore shift percpu_counter_destroy() into destroy_super_work() percpu-rwsem: kill CONFIG_PERCPU_RWSEM percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() percpu-rwsem: introduce percpu_down_read_trylock() document rwsem_release() in sb_wait_write() fix the broken lockdep logic in __sb_start_write() introduce __sb_writers_{acquired,release}() helpers ufs_inode_get{frag,block}(): get rid of 'phys' argument ufs_getfrag_block(): tidy up a bit ...
-
Linus Torvalds authored
Merge tag 'for-linus-4.3-merge-window-part-1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs Pull 9p updates from Eric Van Hensbergen: "Just a few cleanups for 4.3 merge window for the 9p file system. I've gotten several more over the past week, but this group has been in for-next for at least a couple of weeks so I figured I'd push them first while I test the rest. Most of the ones not in this set are bug-fixes anyways so I could hold them for rc1" * tag 'for-linus-4.3-merge-window-part-1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: 9p: fix return code of read() when count is 0 9p: remove unused option Opt_trans
-
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-mediaLinus Torvalds authored
Pull media updates from Mauro Carvalho Chehab: - new DVB frontend drivers: ascot2e, cxd2841er, horus3a, lnbh25 - new HDMI capture driver: tc358743 - new driver for NetUP DVB new boards (netup_unidvb) - IR support for DVBSky cards (smipcie-ir) - Coda driver has gain macroblock tiling support - Renesas R-Car gains JPEG codec driver - new DVB platform driver for STi boards: c8sectpfe - added documentation for the media core kABI to device-drivers DocBook - lots of driver fixups, cleanups and improvements * tag 'media/v4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (297 commits) [media] c8sectpfe: Remove select on undefined LIBELF_32 [media] i2c: fix platform_no_drv_owner.cocci warnings [media] cx231xx: Use wake_up_interruptible() instead of wake_up_interruptible_nr() [media] tc358743: only queue subdev notifications if devnode is set [media] tc358743: add missing Kconfig dependency/select [media] c8sectpfe: Use %pad to print 'dma_addr_t' [media] DocBook media: Fix typo "the the" in xml files [media] tc358743: make reset gpio optional [media] tc358743: set direction of reset gpio using devm_gpiod_get [media] dvbdev: document most of the functions/data structs [media] dvb_frontend.h: document the struct dvb_frontend [media] dvb-frontend.h: document struct dtv_frontend_properties [media] dvb-frontend.h: document struct dvb_frontend_ops [media] dvb: Use DVBFE_ALGO_HW where applicable [media] dvb_frontend.h: document struct analog_demod_ops [media] dvb_frontend.h: Document struct dvb_tuner_ops [media] Docbook: Document struct analog_parameters [media] dvb_frontend.h: get rid of dvbfe_modcod [media] add documentation for struct dvb_tuner_info [media] dvb_frontend: document dvb_frontend_tune_settings ...
-
git://git.linaro.org/landing-teams/working/fujitsu/integrationLinus Torvalds authored
Pull mailbox updates from Jassi Brar: "Mainly we move from jiffy based timer to HRTIMER for finer control over polling. Then a controller reduces its polling period from 10 to 1ms" * 'mailbox-for-next' of git://git.linaro.org/landing-teams/working/fujitsu/integration: mailbox: arm_mhu: reduce txpoll_period from 10ms to 1 ms mailbox: switch to hrtimer for tx_complete polling mailbox: Drop owner assignment from platform_driver
-
git://neil.brown.name/mdLinus Torvalds authored
Pull md updates from Neil Brown: - an assortment of little fixes, several for minor races only likely to be hit during testing - further cluster-md-raid1 development, not ready for real use yet. - new RAID6 syndrome code for ARM NEON - fix a race where a write can return before failure of one device is properly recorded in metadata, so an immediate crash might result in that write being lost. * tag 'md/4.3' of git://neil.brown.name/md: (33 commits) md/raid5: ensure device failure recorded before write request returns. md/raid5: use bio_list for the list of bios to return. md/raid10: ensure device failure recorded before write request returns. md/raid1: ensure device failure recorded before write request returns. md-cluster: remove inappropriate try_module_get from join() md: extend spinlock protection in register_md_cluster_operations md-cluster: Read the disk bitmap sb and check if it needs recovery md-cluster: only call complete(&cinfo->completion) when node join cluster md-cluster: add missed lockres_free md-cluster: remove the unused sb_lock md-cluster: init suspend_list and suspend_lock early in join md-cluster: add the error check if failed to get dlm lock md-cluster: init completion within lockres_init md-cluster: fix deadlock issue on message lock md-cluster: transfer the resync ownership to another node md-cluster: split recover_slot for future code reuse md-cluster: use %pU to print UUIDs md: setup safemode_timer before it's being used md/raid5: handle possible race as reshape completes. md: sync sync_completed has correct value as recovery finishes. ...
-
git://linux-nfs.org/~bfields/linuxLinus Torvalds authored
Pull nfsd updates from Bruce Fields: "Nothing major, but: - Add Jeff Layton as an nfsd co-maintainer: no change to existing practice, just an acknowledgement of the status quo. - Two patches ("nfsd: ensure that...") for a race overlooked by the state locking rewrite, causing a crash noticed by multiple users. - Lots of smaller bugfixes all over from Kinglong Mee. - From Jeff, some cleanup of server rpc code in preparation for possible shift of nfsd threads to workqueues" * tag 'nfsd-4.3' of git://linux-nfs.org/~bfields/linux: (52 commits) nfsd: deal with DELEGRETURN racing with CB_RECALL nfsd: return CLID_INUSE for unexpected SETCLIENTID_CONFIRM case nfsd: ensure that delegation stateid hash references are only put once nfsd: ensure that the ol stateid hash reference is only put once net: sunrpc: fix tracepoint Warning: unknown op '->' nfsd: allow more than one laundry job to run at a time nfsd: don't WARN/backtrace for invalid container deployment. fs: fix fs/locks.c kernel-doc warning nfsd: Add Jeff Layton as co-maintainer NFSD: Return word2 bitmask if setting security label in OPEN/CREATE NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1 nfsd: SUPPATTR_EXCLCREAT must be encoded before SECURITY_LABEL. nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bug NFSD: Store parent's stat in a separate value nfsd: Fix two typos in comments lockd: NLM grace period shouldn't block NFSv4 opens nfsd: include linux/nfs4.h in export.h sunrpc: Switch to using hash list instead single list sunrpc/nfsd: Remove redundant code by exports seq_operations functions sunrpc: Store cache_detail in seq_file's private directly ...
-
- 05 Sep, 2015 4 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfsLinus Torvalds authored
Pull btrfs updates from Chris Mason: "This has Jeff Mahoney's long standing trim patch that fixes corners where trims were missing. Omar has some raid5/6 fixes, especially for using scrub and device replace when devices are missing. Zhao Lie continues cleaning and fixing things, this series fixes some really hard to hit corners in xfstests. I had to pull it last merge window due to some deadlocks, but those are now resolved. I added support for Tejun's new blkio controllers. It seems to work well for single devices, we'll expand to multi-device as well" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits) btrfs: fix compile when block cgroups are not enabled Btrfs: fix file read corruption after extent cloning and fsync Btrfs: check if previous transaction aborted to avoid fs corruption btrfs: use __GFP_NOFAIL in alloc_btrfs_bio btrfs: Prevent from early transaction abort btrfs: Remove unused arguments in tree-log.c btrfs: Remove useless condition in start_log_trans() Btrfs: add support for blkio controllers Btrfs: remove unused mutex from struct 'btrfs_fs_info' Btrfs: fix parity scrub of RAID 5/6 with missing device Btrfs: fix device replace of a missing RAID 5/6 device Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation Btrfs: count devices correctly in readahead during RAID 5/6 replace Btrfs: remove misleading handling of missing device scrub btrfs: fix clone / extent-same deadlocks Btrfs: fix defrag to merge tail file extent Btrfs: fix warning in backref walking btrfs: Add WARN_ON() for double lock in btrfs_tree_lock() btrfs: Remove root argument in extent_data_ref_count() btrfs: Fix wrong comment of btrfs_alloc_tree_block() ...
-
Linus Torvalds authored
Merge patch-bomb from Andrew Morton: - a few misc things - Andy's "ambient capabilities" - fs/nofity updates - the ocfs2 queue - kernel/watchdog.c updates and feature work. - some of MM. Includes Andrea's userfaultfd feature. [ Hadn't noticed that userfaultfd was 'default y' when applying the patches, so that got fixed in this merge instead. We do _not_ mark new features that nobody uses yet 'default y' - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits) mm/hugetlb.c: make vma_has_reserves() return bool mm/madvise.c: make madvise_behaviour_valid() return bool mm/memory.c: make tlb_next_batch() return bool mm/dmapool.c: change is_page_busy() return from int to bool mm: remove struct node_active_region mremap: simplify the "overlap" check in mremap_to() mremap: don't do uneccesary checks if new_len == old_len mremap: don't do mm_populate(new_addr) on failure mm: move ->mremap() from file_operations to vm_operations_struct mremap: don't leak new_vma if f_op->mremap() fails mm/hugetlb.c: make vma_shareable() return bool mm: make GUP handle pfn mapping unless FOLL_GET is requested mm: fix status code which move_pages() returns for zero page mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout() genalloc: add support of multiple gen_pools per device genalloc: add name arg to gen_pool_get() and devm_gen_pool_create() mm/memblock: WARN_ON when nid differs from overlap region Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap mm: defer flush of writable TLB entries mm: send one IPI per CPU to TLB flush all entries after unmapping pages ...
-
Eric Dumazet authored
In commit f341861f ("task_work: add a scheduling point in task_work_run()") I fixed a latency problem adding a cond_resched() call. Later, commit ac3d0da8 added yet another loop to reverse a list, bringing back the latency spike : I've seen in some cases this loop taking 275 ms, if for example a process with 2,000,000 files is killed. We could add yet another cond_resched() in the reverse loop, or we can simply remove the reversal, as I do not think anything would depend on order of task_work_add() submitted works. Fixes: ac3d0da8 ("task_work: Make task_work_add() lockless") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Maciej Żenczykowski <maze@google.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
NeilBrown authored
There were a few conflicts that are fairly easy to resolve. Signed-off-by: NeilBrown <neilb@suse.com>
-
- 04 Sep, 2015 23 commits
-
-
Nicholas Krause authored
This makes vma_has_reserves() return bool due to this particular function only returning either one or zero as its return value. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nicholas Krause authored
This makes the madvise_bahaviour_valid() function return bool due to this particular function always returning the value of either one or zero as its return value. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nicholas Krause authored
This makes the tlb_next_batch() bool due to this particular function only ever returning either one or zero as its return value. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nicholas Krause authored
This makes the function is_page_busy() return bool rather then an int now due to this particular function's single return statement only ever evaulating to either one or zero. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
minkyung88.kim authored
struct node_active_region is not used anymore. Remove it. Signed-off-by: minkyung88.kim <minkyung88.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Oleg Nesterov authored
Minor, but this check is overcomplicated. Two half-intervals do NOT overlap if END1 <= START2 || END2 <= START1, mremap_to() just needs to negate this check. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Oleg Nesterov authored
The "new_len > old_len" branch in vma_to_resize() looks very confusing. It only covers the VM_DONTEXPAND/pgoff checks but everything below is equally unneeded if new_len == old_len. Change this code to return if "new_len == old_len", new_len < old_len is not possible, otherwise the code below is wrong anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Oleg Nesterov authored
move_vma() sets *locked even if move_page_tables() or ->mremap() fails, change sys_mremap() to check "ret & ~PAGE_MASK". I think we should simply remove the VM_LOCKED code in move_vma(), that is why this patch doesn't change move_vma(). But this needs more cleanups. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Oleg Nesterov authored
vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this way ->mremap() can have more users. Say, vdso. While at it, s/aio_ring_remap/aio_ring_mremap/. Note: this is the minimal change before ->mremap() finds another user in file_operations; this method should have more arguments, and it can be used to kill arch_remap(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Oleg Nesterov authored
move_vma() can't just return if f_op->mremap() fails, we should unmap the new vma like we do if move_page_tables() fails. To avoid the code duplication this patch moves the "move entries back" under the new "if (err)" branch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nicholas Krause authored
This makes vma_shareable() return bool now due to this particular function only ever returning either one or zero as its return value. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
With DAX, pfn mapping becoming more common. The patch adjusts GUP code to cover pfn mapping for cases when we don't need struct page to proceed. To make it possible, let's change follow_page() code to return -EEXIST error code if proper page table entry exists, but no corresponding struct page. __get_user_page() would ignore the error code and move to the next page frame. The immediate effect of the change is working MAP_POPULATE and mlock() on DAX mappings. [akpm@linux-foundation.org: fix arm64 build] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Toshi Kani <toshi.kani@hp.com> Acked-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
The manpage for move_pages(2) specifies that status code for zero page is supposed to be -EFAULT. Currently kernel return -ENOENT in this case. follow_page() can do it for us, if we would ask for FOLL_DUMP. The use of FOLL_DUMP also means that the upper layer page tables pages are no longer allocated. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sebastian Andrzej Siewior authored
Clark stumbled over a VM_BUG_ON() in -RT which was then was removed by Johannes in commit f371763a ("mm: memcontrol: fix false-positive VM_BUG_ON() on -rt"). The comment before that patch was a tiny bit better than it is now. While the patch claimed to fix a false-postive on -RT this was not the case. None of the -RT folks ACKed it and it was not a false positive report. That was a *real* problem. This patch updates the comment that is improper because it refers to "disabled preemption" as a consequence of that lock being taken. A spin_lock() disables preemption, true, but in this case the code relies on the fact that the lock _also_ disables interrupts once it is acquired. And this is the important detail (which was checked the VM_BUG_ON()) which needs to be pointed out. This is the hint one needs while looking at the code. It was explained by Johannes on the list that the per-CPU variables are protected by local_irq_save(). The BUG_ON() was helpful. This code has been workarounded in -RT in the meantime. I wouldn't mind running into more of those if the code in question uses *special* kind of locking since now there is no verification (in terms of lockdep or BUG_ON()) and therefore I bring the VM_BUG_ON() check back in. The two functions after the comment could also have a "local_irq_save()" dance around them in order to serialize access to the per-CPU variables. This has been avoided because the interrupts should be off. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <williams@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Zapolskiy authored
This change fills devm_gen_pool_create()/gen_pool_get() "name" argument stub with contents and extends of_gen_pool_get() functionality on this basis. If there is no associated platform device with a device node passed to of_gen_pool_get(), the function attempts to get a label property or device node name (= repeats MTD OF partition standard) and seeks for a named gen_pool registered by device of the parent device node. The main idea of the change is to allow registration of independent gen_pools under the same umbrella device, say "partitions" on "storage device", the original functionality of one "partition" per "storage device" is untouched. [akpm@linux-foundation.org: fix constness in devres_find()] [dan.carpenter@oracle.com: freeing const data pointers] Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Cc: Philipp Zabel <p.zabel@pengutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Nicolas Ferre <nicolas.ferre@atmel.com> Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com> Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vladimir Zapolskiy authored
This change modifies gen_pool_get() and devm_gen_pool_create() client interfaces adding one more argument "name" of a gen_pool object. Due to implementation gen_pool_get() is capable to retrieve only one gen_pool associated with a device even if multiple gen_pools are created, fortunately right at the moment it is sufficient for the clients, hence provide NULL as a valid argument on both producer devm_gen_pool_create() and consumer gen_pool_get() sides. Because only one created gen_pool per device is addressable, explicitly add a restriction to devm_gen_pool_create() to create only one gen_pool per device, this implies two possible error codes returned by the function, account it on client side (only misc/sram). This completes client side changes related to genalloc updates. [akpm@linux-foundation.org: gen_pool_get() cleanup] Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Cc: Philipp Zabel <p.zabel@pengutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Nicolas Ferre <nicolas.ferre@atmel.com> Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com> Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wei Yang authored
Each memblock_region has nid to indicates the Node ID of this range. For the overlap case, memblock_add_range() inserts the lower part and leave the upper part as indicated in the overlapped region. If the nid of the new range differs from the overlapped region, the information recorded is not correct. This patch adds a WARN_ON when the nid of the new range differs from the overlapped region. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
If a PTE is unmapped and it's dirty then it was writable recently. Due to deferred TLB flushing, it's best to assume a writable TLB cache entry exists. With that assumption, the TLB must be flushed before any IO can start or the page is freed to avoid lost writes or data corruption. This patch defers flushing of potentially writable TLBs as long as possible. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
An IPI is sent to flush remote TLBs when a page is unmapped that was potentially accesssed by other CPUs. There are many circumstances where this happens but the obvious one is kswapd reclaiming pages belonging to a running process as kswapd and the task are likely running on separate CPUs. On small machines, this is not a significant problem but as machine gets larger with more cores and more memory, the cost of these IPIs can be high. This patch uses a simple structure that tracks CPUs that potentially have TLB entries for pages being unmapped. When the unmapping is complete, the full TLB is flushed on the assumption that a refill cost is lower than flushing individual entries. Architectures wishing to do this must give the following guarantee. If a clean page is unmapped and not immediately flushed, the architecture must guarantee that a write to that linear address from a CPU with a cached TLB entry will trap a page fault. This is essentially what the kernel already depends on but the window is much larger with this patch applied and is worth highlighting. The architecture should consider whether the cost of the full TLB flush is higher than sending an IPI to flush each individual entry. An additional architecture helper called flush_tlb_local is required. It's a trivial wrapper with some accounting in the x86 case. The impact of this patch depends on the workload as measuring any benefit requires both mapped pages co-located on the LRU and memory pressure. The case with the biggest impact is multiple processes reading mapped pages taken from the vm-scalability test suite. The test case uses NR_CPU readers of mapped files that consume 10*RAM. Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%) Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%) Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 581.00 611.43 System 5804.93 4111.76 Elapsed 161.03 122.12 This is showing that the readers completed 24.40% faster with 29% less system CPU time. From vmstats, it is known that the vanilla kernel was interrupted roughly 900K times per second during the steady phase of the test and the patched kernel was interrupts 180K times per second. The impact is lower on a single socket machine. 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%) Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%) Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 58.09 57.64 System 111.82 76.56 Elapsed 27.29 22.55 It's still a noticeable improvement with vmstat showing interrupts went from roughly 500K per second to 45K per second. The patch will have no impact on workloads with no memory pressure or have relatively few mapped pages. It will have an unpredictable impact on the workload running on the CPU being flushed as it'll depend on how many TLB entries need to be refilled and how long that takes. Worst case, the TLB will be completely cleared of active entries when the target PFNs were not resident at all. [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
When unmapping pages it is necessary to flush the TLB. If that page was accessed by another CPU then an IPI is used to flush the remote CPU. That is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. There already is a window between when a page is unmapped and when it is TLB flushed. This series increases the window so multiple pages can be flushed using a single IPI. This should be safe or the kernel is hosed already. Patch 1 simply made the rest of the series easier to write as ftrace could identify all the senders of TLB flush IPIS. Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI to flush the entire TLB. Patch 3 tracks when there potentially are writable TLB entries that need to be batched differently Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes The performance impact is documented in the changelogs but in the optimistic case on a 4-socket machine the full series reduces interrupts from 900K interrupts/second to 60K interrupts/second. This patch (of 4): It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrea Arcangeli authored
This test allocates two virtual areas and bounces the physical memory across the two virtual areas using only userfaultfd. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrea Arcangeli authored
During the refile in userfaultfd_read both waitqueues could look empty to the lockless wake_userfault(). Use a seqcount to prevent this false negative that could leave an userfault blocked. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-