- 14 May, 2015 40 commits
-
-
Andrea Arcangeli authored
This implements the uABI of UFFDIO_REMAP. Notably one mode bitflag is also forwarded (and in turn known) by the lowlevel remap_pages method.
-
Andrea Arcangeli authored
Provide a new swapfile method for remap_pages() to verify the swap entry is mapped only in one vma before relocating the swap entry in a different virtual address. Otherwise if the swap entry is mapped in multiple vmas, when the page is swapped back in, it could get mapped in a non linear way in some anon_vma.
-
Andrea Arcangeli authored
As far as the rmap code is concerned, rmap_pages only alters the page->mapping and page->index. It does it while holding the page lock. However there are a few places that in presence of anon pages are allowed to do rmap walks without the page lock (split_huge_page and page_referenced_anon). Those places that are doing rmap walks without taking the page lock first, must be updated to re-check that the page->mapping didn't change after they obtained the anon_vma lock. remap_pages takes the anon_vma lock for writing before altering the page->mapping, so if the page->mapping is still the same after obtaining the anon_vma lock (without the page lock), the rmap walks can go ahead safely (and remap_pages will wait them to complete before proceeding). remap_pages serializes against itself with the page lock. All other places taking the anon_vma lock while holding the mmap_sem for writing, don't need to check if the page->mapping has changed after taking the anon_vma lock, regardless of the page lock, because remap_pages holds the mmap_sem for reading. There's one constraint enforced to allow this simplification: the source pages passed to remap_pages must be mapped only in one vma, but this is not a limitation when used to handle userland page faults. The source addresses passed to remap_pages should be set as VM_DONTCOPY with MADV_DONTFORK to avoid any risk of the mapcount of the pages increasing, if fork runs in parallel in another thread, before or while remap_pages runs.
-
Andrea Arcangeli authored
These two ioctl allows to either atomically copy or to map zeropages into the virtual address space. This is used by the thread that opened the userfaultfd to resolve the userfaults.
-
Andrea Arcangeli authored
If the rwsem starves writers it wasn't strictly a bug but lockdep doesn't like it and this avoids depending on lowlevel implementation details of the lock.
-
Andrea Arcangeli authored
This implements mcopy_atomic and mfill_zeropage that are the lowlevel VM methods that are invoked respectively by the UFFDIO_COPY and UFFDIO_ZEROPAGE userfaultfd commands.
-
Andrea Arcangeli authored
This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.
-
Andrea Arcangeli authored
This activates the userfaultfd syscall.
-
Andrea Arcangeli authored
This allows to select the userfaultfd during configuration to build it.
-
Andrea Arcangeli authored
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and userfaultfd_read if they are run on different threads simultaneously. Until now qemu solved the race in userland: the race was explicitly and intentionally left for userland to solve. However we can also solve it in kernel. Requiring all users to solve this race if they use two threads (one for the background transfer and one for the userfault reads) isn't very attractive from an API prospective, furthermore this allows to remove a whole bunch of mutex and bitmap code from qemu, making it faster. The cost of __get_user_pages_fast should be insignificant considering it scales perfectly and the pagetables are already hot in the CPU cache, compared to the overhead in userland to maintain those structures. Applying this patch is backwards compatible with respect to the userfaultfd userland API, however reverting this change wouldn't be backwards compatible anymore. Without this patch qemu in the background transfer thread, has to read the old state, and do UFFDIO_WAKE if old_state is missing but it become REQUESTED by the time it tries to set it to RECEIVED (signaling the other side received an userfault). vcpu background_thr userfault_thr ----- ----- ----- vcpu0 handle_mm_fault() postcopy_place_page read old_state -> MISSING UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending) vcpu0 fault at 0x7fb76a139000 enters handle_userfault poll() is kicked poll() -> POLLIN read() -> 0x7fb76a139000 postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED /* check that no userfault raced with UFFDIO_COPY */ if (old_state == MISSING && tmp_state == REQUESTED) UFFDIO_WAKE from background thread And a second case where a UFFDIO_WAKE would be needed is in the userfault thread: vcpu background_thr userfault_thr ----- ----- ----- vcpu0 handle_mm_fault() postcopy_place_page read old_state -> MISSING UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending) tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED vcpu0 fault at 0x7fb76a139000 enters handle_userfault poll() is kicked poll() -> POLLIN read() -> 0x7fb76a139000 if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED) UFFDIO_WAKE from userfault thread This patch removes the need of both UFFDIO_WAKE and of the associated per-page tristate as well.
-
Andrea Arcangeli authored
Use proper slab to guarantee alignment.
-
Andrea Arcangeli authored
This makes read O(1) and poll that was already O(1) becomes lockless.
-
Andrea Arcangeli authored
This is an optimization but it's a userland visible one and it affects the API. The downside of this optimization is that if you call poll() and you get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault may be waken by a different thread, before read(ufd) comes around. This in short means that poll() isn't really usable if the userfaultfd is opened in blocking mode. userfaults won't wait in "pending" state to be read anymore and any UFFDIO_WAKE or similar operations that has the objective of waking userfaults after their resolution, will wake all blocked userfaults for the resolved range, including those that haven't been read() by userland yet. The behavior of poll() becomes not standard, but this obviates the need of "spurious" UFFDIO_WAKE and it lets the userland threads to restart immediately without requiring an UFFDIO_WAKE. This is even more significant in case of repeated faults on the same address from multiple threads. This optimization is justified by the measurement that the number of spurious UFFDIO_WAKE accounts for 5% and 10% of the total userfaults for heavy workloads, so it's worth optimizing those away.
-
Andrea Arcangeli authored
I had requests to return the full address (not the page aligned one) to userland. It's not entirely clear how the page offset could be relevant because userfaults aren't like SIGBUS that can sigjump to a different place and it actually skip resolving the fault depending on a page offset. There's currently no real way to skip the fault especially because after a UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the kernel without having to return to userland first (not even self modifying code replacing the .text that touched the faulting address would prevent the fault to be repeated). Userland cannot skip repeating the fault even more so if the fault was triggered by a KVM secondary page fault or any get_user_pages or any copy-user inside some syscall which will return to kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS being raised because the userfault can't wait if it cannot release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that). Still returning userland a proper structure during the read() on the uffd, can allow to use the current UFFD_API for the future non-cooperative extensions too and it looks cleaner as well. Once we get additional fields there's no point to return the fault address page aligned anymore to reuse the bits below PAGE_SHIFT. The only downside is that the read() syscall will read 32bytes instead of 8bytes but that's not going to be measurable overhead. The total number of new events that can be extended or of new future bits for already shipped events, is limited to 64 by the features field of the uffdio_api structure. If more will be needed a bump of UFFD_API will be required.
-
Andrea Arcangeli authored
Update comment.
-
Pavel Emelyanov authored
This is (seem to be) the minimal thing that is required to unblock standard uffd usage from the non-cooperative one. Now more bits can be added to the features field indicating e.g. UFFD_FEATURE_FORK and others needed for the latter use-case. Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
-
Andrea Arcangeli authored
Once an userfaultfd has been created and certain region of the process virtual address space have been registered into it, the thread responsible for doing the memory externalization can manage the page faults in userland by talking to the kernel using the userfaultfd protocol. poll() can be used to know when there are new pending userfaults to be read (POLLIN).
-
Andrea Arcangeli authored
If userfaultfd is armed on a certain vma we can't "fill" the holes with zeroes or we'll break the userland on demand paging. The holes if the userfault is armed, are really missing information (not zeroes) that the userland has to load from network or elsewhere. The same issue happens for wrprotected ptes that we can't just convert into a single writable pmd_trans_huge. We could however in theory still merge across zeropages if only VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)... that could be slightly improved but it'd be much more complex code for a tiny corner case.
-
Andrea Arcangeli authored
vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge must be aware about so that we can merge vmas back like they were originally before arming the userfaultfd on some memory range.
-
Andrea Arcangeli authored
This is where the page faults must be modified to call handle_userfault() if userfaultfd_missing() is true (so if the vma->vm_flags had VM_UFFD_MISSING set). handle_userfault() then takes care of blocking the page fault and delivering it to userland. The fault flags must also be passed as parameter so the "read|write" kind of fault can be passed to userland.
-
Andrea Arcangeli authored
These two flags gets set in vma->vm_flags to tell the VM common code if the userfaultfd is armed and in which mode (only tracking missing faults, only tracking wrprotect faults or both). If neither flags is set it means the userfaultfd is not armed on the vma.
-
Andrea Arcangeli authored
This adds the vm_userfaultfd_ctx to the vm_area_struct.
-
Andrea Arcangeli authored
Kernel header defining the methods needed by the VM common code to interact with the userfaultfd.
-
Andrea Arcangeli authored
Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.
-
Andrea Arcangeli authored
userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead of the current hardcoded 1 (that would wake just the first waitqueue in the head list).
-
Andrea Arcangeli authored
Add documentation.
-
Andrea Arcangeli authored
memslot->userfault_addr is set by the kernel with a mmap executed from the kernel but the userland can still munmap it and lead to the below oops after memslot->userfault_addr points to a host virtual address that has no vma or mapping. [ 327.538306] BUG: unable to handle kernel paging request at fffffffffffffffe [ 327.538407] IP: [<ffffffff811a7b55>] put_page+0x5/0x50 [ 327.538474] PGD 1a01067 PUD 1a03067 PMD 0 [ 327.538529] Oops: 0000 [#1] SMP [ 327.538574] Modules linked in: macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT iptable_filter ip_tables tun bridge stp llc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ipmi_devintf iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp dcdbas intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core ipmi_si ipmi_msghandler acpi_pad wmi acpi_power_meter lpc_ich mfd_core mei_me [ 327.539488] mei shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc mlx4_ib ib_sa ib_mad ib_core mlx4_en vxlan ib_addr ip_tunnel xfs libcrc32c sd_mod crc_t10dif crct10dif_common crc32c_intel mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm drm ahci i2c_core libahci mlx4_core libata tg3 ptp pps_core megaraid_sas ntb dm_mirror dm_region_hash dm_log dm_mod [ 327.539956] CPU: 3 PID: 3161 Comm: qemu-kvm Not tainted 3.10.0-240.el7.userfault19.4ca4011.x86_64.debug #1 [ 327.540045] Hardware name: Dell Inc. PowerEdge R420/0CN7CM, BIOS 2.1.2 01/20/2014 [ 327.540115] task: ffff8803280ccf00 ti: ffff880317c58000 task.ti: ffff880317c58000 [ 327.540184] RIP: 0010:[<ffffffff811a7b55>] [<ffffffff811a7b55>] put_page+0x5/0x50 [ 327.540261] RSP: 0018:ffff880317c5bcf8 EFLAGS: 00010246 [ 327.540313] RAX: 00057ffffffff000 RBX: ffff880616a20000 RCX: 0000000000000000 [ 327.540379] RDX: 0000000000002014 RSI: 00057ffffffff000 RDI: fffffffffffffffe [ 327.540445] RBP: ffff880317c5bd10 R08: 0000000000000103 R09: 0000000000000000 [ 327.540511] R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffffe [ 327.540576] R13: 0000000000000000 R14: ffff880317c5bd70 R15: ffff880317c5bd50 [ 327.540643] FS: 00007fd230b7f700(0000) GS:ffff880630800000(0000) knlGS:0000000000000000 [ 327.540717] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 327.540771] CR2: fffffffffffffffe CR3: 000000062a2c3000 CR4: 00000000000427e0 [ 327.540837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 327.540904] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 327.540974] Stack: [ 327.541008] ffffffffa05d6d0c ffff880616a20000 0000000000000000 ffff880317c5bdc0 [ 327.541093] ffffffffa05ddaa2 0000000000000000 00000000002191bf 00000042f3feab2d [ 327.541177] 00000042f3feab2d 0000000000000002 0000000000000001 0321000000000000 [ 327.541261] Call Trace: [ 327.541321] [<ffffffffa05d6d0c>] ? kvm_vcpu_reload_apic_access_page+0x6c/0x80 [kvm] [ 327.543615] [<ffffffffa05ddaa2>] vcpu_enter_guest+0x3f2/0x10f0 [kvm] [ 327.545918] [<ffffffffa05e2f10>] kvm_arch_vcpu_ioctl_run+0x2b0/0x5a0 [kvm] [ 327.548211] [<ffffffffa05e2d02>] ? kvm_arch_vcpu_ioctl_run+0xa2/0x5a0 [kvm] [ 327.550500] [<ffffffffa05ca845>] kvm_vcpu_ioctl+0x2b5/0x680 [kvm] [ 327.552768] [<ffffffff810b8d12>] ? creds_are_invalid.part.1+0x12/0x50 [ 327.555069] [<ffffffff810b8d71>] ? creds_are_invalid+0x21/0x30 [ 327.557373] [<ffffffff812d6066>] ? inode_has_perm.isra.49.constprop.65+0x26/0x80 [ 327.559663] [<ffffffff8122d985>] do_vfs_ioctl+0x305/0x530 [ 327.561917] [<ffffffff8122dc51>] SyS_ioctl+0xa1/0xc0 [ 327.564185] [<ffffffff816de829>] system_call_fastpath+0x16/0x1b [ 327.566480] Code: 0b 31 f6 4c 89 e7 e8 4b 7f ff ff 0f 0b e8 24 fd ff ff e9 a9 fd ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 2a 8b 47 1c 85 c0 74 1e f0
-
Andrea Arcangeli authored
Just an optimization, where possible use get_user_pages_fast.
-
Andrea Arcangeli authored
This teaches gup_fast and __gup_fast to re-enable irqs and cond_resched() if possible every BATCH_PAGES. This must be implemented by other archs as well and it's a requirement before converting more get_user_pages() to get_user_pages_fast() as an optimization (instead of using get_user_pages_unlocked which would be slower).
-
Andrea Arcangeli authored
This adds compaction to zone_reclaim so THP enabled won't decrease the NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0. It is important to boot with numa_zonelist_order=n (n means nodes) to get more accurate NUMA locality if there are multiple zones per node. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
-
Andrea Arcangeli authored
If we're in the fast path and we succeeded zone_reclaim(), it means we freed enough memory and we can use the min watermark to have some margin against concurrent allocations from other CPUs or interrupts.
-
Andrea Arcangeli authored
Needed by zone_reclaim_mode compaction-awareness.
-
Andrea Arcangeli authored
Prevent the scaling down to reduce the watermarks too much. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
-
Andrea Arcangeli authored
The min wmark should be satisfied with just 1 hugepage. And the other wmarks should be adjusted accordingly. We need to succeed the low wmark check if there's some significant amount of 0 order pages, but we don't need plenty of high order pages because the PF_MEMALLOC paths don't require those. Creating a ton of high order pages that cannot be allocated by the high order allocation paths (no PF_MEMALLOC) is quite wasteful because they can be splitted in lower order pages before anybody has a chance to allocate them.
-
Andrea Arcangeli authored
If kswapd never need to run (only __GFP_NO_KSWAPD allocations and plenty of free memory) compaction is otherwise crippled down and stops running for a while after the free/isolation cursor meets. After that allocation can fail for a full cycle of compaction_deferred, until compaction_restarting finally reset it again. Stopping compaction for a full cycle after the cursor meets, even if it never failed and it's not going to fail, doesn't make sense. We already throttle compaction CPU utilization using defer_compaction. We shouldn't prevent compaction to run after each pass completes when the cursor meets, unless it failed. This makes direct compaction functional again. The throttling of direct compaction is still controlled by the defer_compaction logic. kswapd still won't risk to reset compaction, and it will wait direct compaction to do so. Not sure if this is ideal but it at least decreases the risk of kswapd doing too much work. kswapd will only run one pass of compaction until some allocation invokes compaction again. This decreased reliability of compaction was introduced in commit 62997027 . Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de>
-
Andrea Arcangeli authored
Reset the stats so /proc/sys/vm/compact_memory will scan all memory. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de>
-
Andrea Arcangeli authored
Zone reclaim locked breaks zone_reclaim_mode=1. If more than one thread allocates memory at the same time, it forces a premature allocation into remote NUMA nodes even when there's plenty of clean cache to reclaim in the local nodes. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
-
Oleg Nesterov authored
get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + compound_lock(). In theory this page_head can be already freed and reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case get_page_unless_zero() can succeed right after set_page_refcounted(), and compound_lock() can race with the non-atomic __SetPageHead(). Perhaps we should rework the thp locking (under discussion), but until then this patch moves set_page_refcounted() and adds wmb() to ensure that page->_count != 0 comes as a last change. I am not sure about other callers of set_page_refcounted(), but at first glance they look fine to me. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
-
Andrea Arcangeli authored
With the previous two commits I cannot reproduce any ext4 related livelocks anymore, however I hit ext4 memory corruption. ext4 thinks it can handle alloc_pages to fail and it doesn't use __GFP_NOFAIL in some places but it actually cannot. No surprise as those errors paths couldn't ever run so they're likely untested. I logged all the stack traces of all ext4 failures that lead to the ext4 final corruption, at least one of them should be the culprit (the lasts ones are more probable). The actual bug in the error paths should be found by code review (or the error paths should be deleted and __GFP_NOFAIL should be added to the gfp_mask). Until ext4 is fixed, it is safer to threat !__GFP_FS like __GFP_NOFAIL if TIF_MEMDIE is not set (so we cannot exercise any new allocation error path in kernel threads, because they're never picked as OOM killer victims and TIF_MEMDIE never gets set on them). I assume other filesystems may have become complacent of this accommodating allocator behavior that cannot fail an allocation if invoked by a kernel thread too, but the longer we keep the __GFP_NOFAIL behavior in should_alloc_retry for small order allocations, the less robust these error paths will become and the harder it will be to remove this livelock prone assumption in should_alloc_retry. In fact we should remove that assumption not just for !__GFP_FS allocations. In practice with this fix there's no regression and all livelocks are still gone. The only risk in this approach is to extinguish the emergency reserves earlier than before but only during OOM (during normal runtime GFP_ATOMIC allocation or other __GFP_MEMALLOC allocation reliability is not affected). Clearly this actually reduces the livelock risk (verified in practice too) so it is a low risk net improvement to the OOM handling with no risk of regression because this way no new allocation error paths is exercised.
-
Andrea Arcangeli authored
The previous commit fixed a ext4 livelock by not making !__GFP_FS allocations behave similarly to __GFP_NOFAIL and I mentioned how __GFP_NOFAIL is livelock prone. After letting the trinity load run for a while I actually hit the very __GFP_NOFAIL livelock too: #0 get_page_from_freelist (gfp_mask=0x20858, nodemask=0x0 <irq_stack_union>, order=0x0, zonelist=0xffff88007fffc100, hi gh_zoneidx=0x2, alloc_flags=0xc0, preferred_zone=0xffff88007fffa840, classzone_idx=classzone_idx@entry=0x1, migratetype= migratetype@entry=0x2) at mm/page_alloc.c:1953 #1 0xffffffff81178e88 in __alloc_pages_slowpath (migratetype=0x2, classzone_idx=0x1, preferred_zone=0xffff88007fffa840, nodemask=0x0 <irq_stack_union>, high_zoneidx=ZONE_NORMAL, zonelist=0xffff88007fffc100, order=0x0, gfp_mask=0x20858) at mm/page_alloc.c:2597 #2 __alloc_pages_nodemask (gfp_mask=<optimized out>, order=0x0, zonelist=0xffff87fffffffffa, nodemask=0x0 <irq_stack_un ion>) at mm/page_alloc.c:2832 #3 0xffffffff811becab in alloc_pages_current (gfp=0x20858, order=0x0) at mm/mempolicy.c:2100 #4 0xffffffff8116e450 in alloc_pages (order=0x0, gfp_mask=0x20858) at include/linux/gfp.h:336 #5 __page_cache_alloc (gfp=0x20858) at mm/filemap.c:663 #6 0xffffffff8116f03c in pagecache_get_page (mapping=0xffff88007cc03908, offset=0xc920f, fgp_flags=0x7, cache_gfp_mask= 0x20858, radix_gfp_mask=0x850) at mm/filemap.c:1096 #7 0xffffffff812160f4 in find_or_create_page (mapping=<optimized out>, gfp_mask=<optimized out>, offset=0xc920f) at inc lude/linux/pagemap.h:336 #8 grow_dev_page (sizebits=0x0, size=0x1000, index=0xc920f, block=0xc920f, bdev=0xffff88007cc03580) at fs/buffer.c:1022 #9 grow_buffers (size=<optimized out>, block=<optimized out>, bdev=<optimized out>) at fs/buffer.c:1095 #10 __getblk_slow (size=0x1000, block=0xc920f, bdev=0xffff88007cc03580) at fs/buffer.c:1121 #11 __getblk (bdev=0xffff88007cc03580, block=0xc920f, size=0x1000) at fs/buffer.c:1395 #12 0xffffffff8125c8ed in sb_getblk (block=0xc920f, sb=<optimized out>) at include/linux/buffer_head.h:310 #13 ext4_read_block_bitmap_nowait (sb=0xffff88007c579000, block_group=0x2f) at fs/ext4/balloc.c:407 #14 0xffffffff8125ced4 in ext4_read_block_bitmap (sb=0xffff88007c579000, block_group=0x2f) at fs/ext4/balloc.c:489 #15 0xffffffff8167963b in ext4_mb_discard_group_preallocations (sb=0xffff88007c579000, group=0x2f, needed=0x38) at fs/ex t4/mballoc.c:3798 #16 0xffffffff8129ddbd in ext4_mb_discard_preallocations (needed=0x38, sb=0xffff88007c579000) at fs/ext4/mballoc.c:4346 #17 ext4_mb_new_blocks (handle=0xffff88003305ee98, ar=0xffff88001f50b890, errp=0xffff88001f50b880) at fs/ext4/mballoc.c:4479 #18 0xffffffff81290fd3 in ext4_ext_map_blocks (handle=0xffff88003305ee98, inode=0xffff88007b85b178, map=0xffff88001f50ba50, flags=0x25) at fs/ext4/extents.c:4453 #19 0xffffffff81265688 in ext4_map_blocks (handle=0xffff88003305ee98, inode=0xffff88007b85b178, map=0xffff88001f50ba50, flags=0x25) at fs/ext4/inode.c:648 #20 0xffffffff8126af77 in mpage_map_one_extent (mpd=0xffff88001f50ba28, handle=0xffff88003305ee98) at fs/ext4/inode.c:2164 #21 mpage_map_and_submit_extent (give_up_on_write=<synthetic pointer>, mpd=0xffff88001f50ba28, handle=0xffff88003305ee98) at fs/ext4/inode.c:2219 #22 ext4_writepages (mapping=0xffff88007b85b350, wbc=0xffff88001f50bb60) at fs/ext4/inode.c:2557 #23 0xffffffff8117ce81 in do_writepages (mapping=0xffff88007b85b350, wbc=0xffff88001f50bb60) at mm/page-writeback.c:2046 #24 0xffffffff812096c0 in __writeback_single_inode (inode=0xffff88007b85b178, wbc=0xffff88001f50bb60) at fs/fs-writeback.c:460 #25 0xffffffff8120b311 in writeback_sb_inodes (sb=0xffff88007c579000, wb=0xffff88007bceb060, work=0xffff8800130f9d80) at fs/fs-writeback.c:687 #26 0xffffffff8120b68f in __writeback_inodes_wb (wb=0xffff88007bceb060, work=0xffff8800130f9d80) at fs/fs-writeback.c:732 #27 0xffffffff8120b94b in wb_writeback (wb=0xffff88007bceb060, work=0xffff8800130f9d80) at fs/fs-writeback.c:863 #28 0xffffffff8120befc in wb_do_writeback (wb=0xffff88007bceb060) at fs/fs-writeback.c:998 #29 bdi_writeback_workfn (work=0xffff88007bceb078) at fs/fs-writeback.c:1043 #30 0xffffffff81092cf5 in process_one_work (worker=0xffff88002c555e80, work=0xffff88007bceb078) at kernel/workqueue.c:2081 #31 0xffffffff8109376b in worker_thread (__worker=0xffff88002c555e80) at kernel/workqueue.c:2212 #32 0xffffffff8109ba54 in kthread (_create=0xffff88007bf2e2c0) at kernel/kthread.c:207 #33 <signal handler called> #34 0x0000000000000000 in irq_stack_union () #35 0x0000000000000000 in ?? () To solve this I set manually with gdb ALLOC_NO_WATERMARKS in alloc_flags, and the livelock resolved itself. The fix simply allows __GFP_NOFAIL allocation to get access to the emergency reserves in the buddy allocator if __GFP_NOFAIL triggers a reclaim failure signaling an out of memory condition. Worst case it'll deadlock because we run out of emergency reserves but not giving it access to the emergency reserves after the __GFP_NOFAIL hits on a out of memory condition may actually result in a livelock despite there are still ~50Mbyte free! So this is safer. After applying this OOM livelock fix I cannot reproduce the livelock anymore in __GFP_NOFAIL.
-