oom: allow !__GFP_FS allocations access emergency reserves like __GFP_NOFAIL

With the previous two commits I cannot reproduce any ext4 related livelocks anymore, however I hit ext4 memory corruption. ext4 thinks it can handle alloc_pages to fail and it doesn't use __GFP_NOFAIL in some places but it actually cannot. No surprise as those errors paths couldn't ever run so they're likely untested. I logged all the stack traces of all ext4 failures that lead to the ext4 final corruption, at least one of them should be the culprit (the lasts ones are more probable). The actual bug in the error paths should be found by code review (or the error paths should be deleted and __GFP_NOFAIL should be added to the gfp_mask). Until ext4 is fixed, it is safer to threat !__GFP_FS like __GFP_NOFAIL if TIF_MEMDIE is not set (so we cannot exercise any new allocation error path in kernel threads, because they're never picked as OOM killer victims and TIF_MEMDIE never gets set on them). I assume other filesystems may have become complacent of this accommodating allocator behavior that cannot fail an allocation if invoked by a kernel thread too, but the longer we keep the __GFP_NOFAIL behavior in should_alloc_retry for small order allocations, the less robust these error paths will become and the harder it will be to remove this livelock prone assumption in should_alloc_retry. In fact we should remove that assumption not just for !__GFP_FS allocations. In practice with this fix there's no regression and all livelocks are still gone. The only risk in this approach is to extinguish the emergency reserves earlier than before but only during OOM (during normal runtime GFP_ATOMIC allocation or other __GFP_MEMALLOC allocation reliability is not affected). Clearly this actually reduces the livelock risk (verified in practice too) so it is a low risk net improvement to the OOM handling with no risk of regression because this way no new allocation error paths is exercised.

oom: allow !__GFP_FS allocations access emergency reserves like __GFP_NOFAIL
With the previous two commits I cannot reproduce any ext4 related livelocks anymore, however I hit ext4 memory corruption. ext4 thinks it can handle alloc_pages to fail and it doesn't use __GFP_NOFAIL in some places but it actually cannot. No surprise as those errors paths couldn't ever run so they're likely untested. I logged all the stack traces of all ext4 failures that lead to the ext4 final corruption, at least one of them should be the culprit (the lasts ones are more probable). The actual bug in the error paths should be found by code review (or the error paths should be deleted and __GFP_NOFAIL should be added to the gfp_mask). Until ext4 is fixed, it is safer to threat !__GFP_FS like __GFP_NOFAIL if TIF_MEMDIE is not set (so we cannot exercise any new allocation error path in kernel threads, because they're never picked as OOM killer victims and TIF_MEMDIE never gets set on them). I assume other filesystems may have become complacent of this accommodating allocator behavior that cannot fail an allocation if invoked by a kernel thread too, but the longer we keep the __GFP_NOFAIL behavior in should_alloc_retry for small order allocations, the less robust these error paths will become and the harder it will be to remove this livelock prone assumption in should_alloc_retry. In fact we should remove that assumption not just for !__GFP_FS allocations. In practice with this fix there's no regression and all livelocks are still gone. The only risk in this approach is to extinguish the emergency reserves earlier than before but only during OOM (during normal runtime GFP_ATOMIC allocation or other __GFP_MEMALLOC allocation reliability is not affected). Clearly this actually reduces the livelock risk (verified in practice too) so it is a low risk net improvement to the OOM handling with no risk of regression because this way no new allocation error paths is exercised.
fa175d10 · Andrea Arcangeli · 47fb3887 · fa175d10
Commit fa175d10 authored Jun 23, 2014 by Andrea Arcangeli
Hide whitespace changes
Inline Side-by-side

Showing with 10 additions and 8 deletions

mm/page_alloc.c mm/page_alloc.c +10 -8

No files found.
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2359,7 +2359,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 	 * the PG_lock and in turn preventing the OOM killer victim
 	 * task to exit).
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER && (gfp_mask & __GFP_FS))
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
 		return 1;

 	/*
@@ -2377,16 +2377,17 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,

 static inline int gfp_to_alloc_flags(gfp_t gfp_mask);

-static void gfp_nofail_emergency(gfp_t *gfp_mask, int *alloc_flags,
-				 unsigned int order)
+static void gfp_emergency(gfp_t *gfp_mask, int *alloc_flags,
+			  unsigned int order)
 {
 	/*
 	 * If we reached an out of memory condition in the context of
-	 * a __GFP_NOFAIL (in turn livelock prone) allocation try to
-	 * give access to the emergency pools, otherwise we could
-	 * livelock.
+	 * a __GFP_NOFAIL or a !__GFP_FS (in turn livelock prone)
+	 * allocation try to give access to the emergency pools,
+	 * otherwise we could livelock.
 	 */
-	if ((*gfp_mask & __GFP_NOFAIL) && !order) {
+	if (((*gfp_mask & __GFP_NOFAIL) || !(*gfp_mask & __GFP_FS)) &&
+	    !order) {
 		*gfp_mask |= __GFP_MEMALLOC;
 		*gfp_mask &= ~__GFP_NOMEMALLOC;
 		*alloc_flags = gfp_to_alloc_flags(*gfp_mask);
@@ -2434,6 +2435,7 @@ __alloc_pages_may_oom(gfp_t *gfp_mask, unsigned int order, int *alloc_flags,
 			goto out;
 		/* The OOM killer does not compensate for light reclaim */
 		if (!(*gfp_mask & __GFP_FS)) {
+			gfp_emergency(gfp_mask, alloc_flags, order);
 			/*
 			 * XXX: Page reclaim didn't yield anything,
 			 * and the OOM killer can't be invoked, but
@@ -2446,7 +2448,7 @@ __alloc_pages_may_oom(gfp_t *gfp_mask, unsigned int order, int *alloc_flags,
 		if (*gfp_mask & __GFP_THISNODE)
 			goto out;
 	} else
-		gfp_nofail_emergency(gfp_mask, alloc_flags, order);
+		gfp_emergency(gfp_mask, alloc_flags, order);
 	/* Exhausted what can be done so it's blamo time */
 	if (out_of_memory(ac->zonelist, *gfp_mask, order, ac->nodemask, false)
 			|| WARN_ON_ONCE(*gfp_mask & __GFP_NOFAIL))