1. 11 Dec, 2014 40 commits
    • Johannes Weiner's avatar
      mm: memcontrol: clarify migration where old page is uncharged · 7d5e3245
      Johannes Weiner authored
      Better explain re-entrant migration when compaction races with reclaim,
      and also mention swapcache readahead pages as possible uncharged migration
      sources.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d5e3245
    • Johannes Weiner's avatar
      mm: memcontrol: update mem_cgroup_page_lruvec() documentation · dfe0e773
      Johannes Weiner authored
      Commit 7512102c ("memcg: fix GPF when cgroup removal races with last
      exit") added a pc->mem_cgroup reset into mem_cgroup_page_lruvec() to
      prevent a crash where an anon page gets uncharged on unmap, the memcg is
      released, and then the final LRU isolation on free dereferences the
      stale pc->mem_cgroup pointer.
      
      But since commit 0a31bc97 ("mm: memcontrol: rewrite uncharge API"),
      pages are only uncharged AFTER that final LRU isolation, which
      guarantees the memcg's lifetime until then.  pc->mem_cgroup now only
      needs to be reset for swapcache readahead pages.
      
      Update the comment and callsite requirements accordingly.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfe0e773
    • Vladimir Davydov's avatar
      memcg: simplify unreclaimable groups handling in soft limit reclaim · bc2f2e7f
      Vladimir Davydov authored
      If we fail to reclaim anything from a cgroup during a soft reclaim pass
      we want to get the next largest cgroup exceeding its soft limit. To
      achieve this, we should obviously remove the current group from the tree
      and then pick the largest group. Currently we have a weird loop instead.
      Let's simplify it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc2f2e7f
    • Aneesh Kumar K.V's avatar
      mm/numa balancing: rearrange Kconfig entry · 6f7c97e8
      Aneesh Kumar K.V authored
      Add the default enable config option after the NUMA_BALANCING option so
      that it appears related in the nconfig interface.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f7c97e8
    • Vlastimil Babka's avatar
      mm, compaction: more focused lru and pcplists draining · fdaf7f5c
      Vlastimil Babka authored
      The goal of memory compaction is to create high-order freepages through
      page migration.  Page migration however puts pages on the per-cpu lru_add
      cache, which is later flushed to per-cpu pcplists, and only after pcplists
      are drained the pages can actually merge.  This can happen due to the
      per-cpu caches becoming full through further freeing, or explicitly.
      
      During direct compaction, it is useful to do the draining explicitly so
      that pages merge as soon as possible and compaction can detect success
      immediately and keep the latency impact at minimum.  However the current
      implementation is far from ideal.  Draining is done only in
      __alloc_pages_direct_compact(), after all zones were already compacted,
      and the decisions to continue or stop compaction in individual zones was
      done without the last batch of migrations being merged.  It is also
      missing the draining of lru_add cache before the pcplists.
      
      This patch moves the draining for direct compaction into compact_zone().
      It adds the missing lru_cache draining and uses the newly introduced
      single zone pcplists draining to reduce overhead and avoid impact on
      unrelated zones.  Draining is only performed when it can actually lead to
      merging of a page of desired order (passed by cc->order).  This means it
      is only done when migration occurred in the previously scanned cc->order
      aligned block(s) and the migration scanner is now pointing to the next
      cc->order aligned block.
      
      The patch has been tested with stress-highalloc benchmark from mmtests.
      Although overal allocation success rates of the benchmark were not
      affected, the number of detected compaction successes has doubled.  This
      suggests that allocations were previously successful due to implicit
      merging caused by background activity, making a later allocation attempt
      succeed immediately, but not attributing the success to compaction.  Since
      stress-highalloc always tries to allocate almost the whole memory, it
      cannot show the improvement in its reported success rate metric.  However
      after this patch, compaction should detect success and terminate earlier,
      reducing the direct compaction latencies in a real scenario.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdaf7f5c
    • Vlastimil Babka's avatar
      mm, compaction: always update cached scanner positions · 6bace090
      Vlastimil Babka authored
      Compaction caches the migration and free scanner positions between
      compaction invocations, so that the whole zone gets eventually scanned and
      there is no bias towards the initial scanner positions at the
      beginning/end of the zone.
      
      The cached positions are continuously updated as scanners progress and the
      updating stops as soon as a page is successfully isolated.  The reasoning
      behind this is that a pageblock where isolation succeeded is likely to
      succeed again in near future and it should be worth revisiting it.
      
      However, the downside is that potentially many pages are rescanned without
      successful isolation.  At worst, there might be a page where isolation
      from LRU succeeds but migration fails (potentially always).  So upon
      encountering this page, cached position would always stop being updated
      for no good reason.  It might have been useful to let such page be
      rescanned with sync compaction after async one failed, but this is now
      handled by caching scanner position for async and sync mode separately
      since commit 35979ef3 ("mm, compaction: add per-zone migration pfn
      cache for async compaction").
      
      After this patch, cached positions are updated unconditionally.  In
      stress-highalloc benchmark, this has decreased the numbers of scanned
      pages by few percent, without affecting allocation success rates.
      
      To prevent free scanner from leaving free pages behind after they are
      returned due to page migration failure, the cached scanner pfn is changed
      to point to the pageblock of the returned free page with the highest pfn,
      before leaving compact_zone().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6bace090
    • Vlastimil Babka's avatar
      mm, compaction: defer only on COMPACT_COMPLETE · f8669795
      Vlastimil Babka authored
      Deferred compaction is employed to avoid compacting zone where sync direct
      compaction has recently failed.  As such, it makes sense to only defer
      when a full zone was scanned, which is when compact_zone returns with
      COMPACT_COMPLETE.  It's less useful to defer when compact_zone returns
      with apparent success (COMPACT_PARTIAL), followed by a watermark check
      failure, which can happen due to parallel allocation activity.  It also
      does not make much sense to defer compaction which was completely skipped
      (COMPACT_SKIP) for being unsuitable in the first place.
      
      This patch therefore makes deferred compaction trigger only when
      COMPACT_COMPLETE is returned from compact_zone().  Results of
      stress-highalloc becnmark show the difference is within measurement error,
      so the issue is rather cosmetic.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8669795
    • Vlastimil Babka's avatar
      mm, compaction: simplify deferred compaction · 97d47a65
      Vlastimil Babka authored
      Since commit 53853e2d ("mm, compaction: defer each zone individually
      instead of preferred zone"), compaction is deferred for each zone where
      sync direct compaction fails, and reset where it succeeds.  However, it
      was observed that for DMA zone compaction often appeared to succeed
      while subsequent allocation attempt would not, due to different outcome
      of watermark check.
      
      In order to properly defer compaction in this zone, the candidate zone
      has to be passed back to __alloc_pages_direct_compact() and compaction
      deferred in the zone after the allocation attempt fails.
      
      The large source of mismatch between watermark check in compaction and
      allocation was the lack of alloc_flags and classzone_idx values in
      compaction, which has been fixed in the previous patch.  So with this
      problem fixed, we can simplify the code by removing the candidate_zone
      parameter and deferring in __alloc_pages_direct_compact().
      
      After this patch, the compaction activity during stress-highalloc
      benchmark is still somewhat increased, but it's negligible compared to the
      increase that occurred without the better watermark checking.  This
      suggests that it is still possible to apparently succeed in compaction but
      fail to allocate, possibly due to parallel allocation activity.
      
      [akpm@linux-foundation.org: fix build]
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97d47a65
    • Vlastimil Babka's avatar
      mm, compaction: pass classzone_idx and alloc_flags to watermark checking · ebff3980
      Vlastimil Babka authored
      Compaction relies on zone watermark checks for decisions such as if it's
      worth to start compacting in compaction_suitable() or whether compaction
      should stop in compact_finished().  The watermark checks take
      classzone_idx and alloc_flags parameters, which are related to the memory
      allocation request.  But from the context of compaction they are currently
      passed as 0, including the direct compaction which is invoked to satisfy
      the allocation request, and could therefore know the proper values.
      
      The lack of proper values can lead to mismatch between decisions taken
      during compaction and decisions related to the allocation request.  Lack
      of proper classzone_idx value means that lowmem_reserve is not taken into
      account.  This has manifested (during recent changes to deferred
      compaction) when DMA zone was used as fallback for preferred Normal zone.
      compaction_suitable() without proper classzone_idx would think that the
      watermarks are already satisfied, but watermark check in
      get_page_from_freelist() would fail.  Because of this problem, deferring
      compaction has extra complexity that can be removed in the following
      patch.
      
      The issue (not confirmed in practice) with missing alloc_flags is opposite
      in nature.  For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
      ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
      CMA-enabled systems) the watermark checking in compaction with 0 passed
      will be stricter than in get_page_from_freelist().  In these cases
      compaction might be running for a longer time than is really needed.
      
      Another issue compaction_suitable() is that the check for "does the zone
      need compaction at all?" comes only after the check "does the zone have
      enough free free pages to succeed compaction".  The latter considers extra
      pages for migration and can therefore in some situations fail and return
      COMPACT_SKIPPED, although the high-order allocation would succeed and we
      should return COMPACT_PARTIAL.
      
      This patch fixes these problems by adding alloc_flags and classzone_idx to
      struct compact_control and related functions involved in direct compaction
      and watermark checking.  Where possible, all other callers of
      compaction_suitable() pass proper values where those are known.  This is
      currently limited to classzone_idx, which is sometimes known in kswapd
      context.  However, the direct reclaim callers should_continue_reclaim()
      and compaction_ready() do not currently know the proper values, so the
      coordination between reclaim and compaction may still not be as accurate
      as it could.  This can be fixed later, if it's shown to be an issue.
      
      Additionaly the checks in compact_suitable() are reordered to address the
      second issue described above.
      
      The effect of this patch should be slightly better high-order allocation
      success rates and/or less compaction overhead, depending on the type of
      allocations and presence of CMA.  It allows simplifying deferred
      compaction code in a followup patch.
      
      When testing with stress-highalloc, there was some slight improvement
      (which might be just due to variance) in success rates of non-THP-like
      allocations.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebff3980
    • Jamie Liu's avatar
      mm: vmscan: count only dirty pages as congested · 1da58ee2
      Jamie Liu authored
      shrink_page_list() counts all pages with a mapping, including clean pages,
      toward nr_congested if they're on a write-congested BDI.
      shrink_inactive_list() then sets ZONE_CONGESTED if nr_dirty ==
      nr_congested.  Fix this apples-to-oranges comparison by only counting
      pages for nr_congested if they count for nr_dirty.
      Signed-off-by: default avatarJamie Liu <jamieliu@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1da58ee2
    • Yu Zhao's avatar
      mm: verify compound order when freeing a page · ab1f306f
      Yu Zhao authored
      This allows us to catch the bug fixed in the previous patch (mm: free
      compound page with correct order).
      
      Here we also verify whether a page is tail page or not -- tail pages are
      supposed to be freed along with their head, not by themselves.
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatar"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab1f306f
    • Akinobu Mita's avatar
      cma: make default CMA area size zero for x86 · d7be003a
      Akinobu Mita authored
      This makes CMA memory area size zero for x86 in default configuration
      (doesn't change on the other architectures).  If default CMA size is
      zero, DMA_CMA is disabled.  It can be enabled by passing cma= to the
      kernel.
      
      This makes less impact on x86.  Because there is no mainline driver that
      requires it for x86, and Peter Hurley reported the performance
      regression, as this is trying to drive _all_ dma mapping allocations
      through a _very_ small window.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Reported-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Jean Delvare <jdelvare@suse.de>
      Acked-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7be003a
    • Vlastimil Babka's avatar
      mm, memory_hotplug/failure: drain single zone pcplists · c0554329
      Vlastimil Babka authored
      Memory hotplug and failure mechanisms have several places where pcplists
      are drained so that pages are returned to the buddy allocator and can be
      e.g. prepared for offlining.  This is always done in the context of a
      single zone, we can reduce the pcplists drain to the single zone, which
      is now possible.
      
      The change should make memory offlining due to hotremove or failure
      faster and not disturbing unrelated pcplists anymore.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0554329
    • Vlastimil Babka's avatar
      mm, cma: drain single zone pcplists · 510f5507
      Vlastimil Babka authored
      CMA allocation drains pcplists so that pages can merge back to buddy
      allocator.  Since it operates on a single zone, we can reduce the
      pcplists drain to the single zone, which is now possible.
      
      The change should make CMA allocations faster and not disturbing
      unrelated pcplists anymore.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      510f5507
    • Vlastimil Babka's avatar
      mm, page_isolation: drain single zone pcplists · ec25af84
      Vlastimil Babka authored
      When setting MIGRATETYPE_ISOLATE on a pageblock, pcplists are drained to
      have a better chance that all pages will be successfully isolated and
      not left in the per-cpu caches.  Since isolation is always concerned
      with a single zone, we can reduce the pcplists drain to the single zone,
      which is now possible.
      
      The change should make memory isolation faster and not disturbing
      unrelated pcplists anymore.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec25af84
    • Vlastimil Babka's avatar
      mm: introduce single zone pcplists drain · 93481ff0
      Vlastimil Babka authored
      The functions for draining per-cpu pages back to buddy allocators
      currently always operate on all zones.  There are however several cases
      where the drain is only needed in the context of a single zone, and
      spilling other pcplists is a waste of time both due to the extra
      spilling and later refilling.
      
      This patch introduces new zone pointer parameter to drain_all_pages()
      and changes the dummy parameter of drain_local_pages() to be also a zone
      pointer.  When NULL is passed, the functions operate on all zones as
      usual.  Passing a specific zone pointer reduces the work to the single
      zone.
      
      All callers are updated to pass the NULL pointer in this patch.
      Conversion to single zone (where appropriate) is done in further
      patches.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93481ff0
    • Pintu Kumar's avatar
      mm/vmscan.c: replace printk with pr_err · 8612c663
      Pintu Kumar authored
      This patch replaces printk(KERN_ERR..) with pr_err found under
      shrink_slab.  Thus it also reduces one line extra because of formatting.
      Signed-off-by: default avatarPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8612c663
    • Pintu Kumar's avatar
      mm/vmalloc.c: replace printk with pr_warn · 0cbc8533
      Pintu Kumar authored
      This patch replaces printk(KERN_WARNING..) with pr_warn.
      Thus it also reduces one line extra because of formatting.
      Signed-off-by: default avatarPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cbc8533
    • Anton Blanchard's avatar
    • Johannes Weiner's avatar
      mm: memcontrol: remove synchronous stock draining code · 6d3d6aa2
      Johannes Weiner authored
      With charge reparenting, the last synchronous stock drainer left.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d3d6aa2
    • Johannes Weiner's avatar
      mm: memcontrol: continue cache reclaim from offlined groups · b2052564
      Johannes Weiner authored
      On cgroup deletion, outstanding page cache charges are moved to the parent
      group so that they're not lost and can be reclaimed during pressure
      on/inside said parent.  But this reparenting is fairly tricky and its
      synchroneous nature has led to several lock-ups in the past.
      
      Since c2931b70 ("cgroup: iterate cgroup_subsys_states directly") css
      iterators now also include offlined css, so memcg iterators can be changed
      to include offlined children during reclaim of a group, and leftover cache
      can just stay put.
      
      There is a slight change of behavior in that charges of deleted groups no
      longer show up as local charges in the parent.  But they are still
      included in the parent's hierarchical statistics.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2052564
    • Johannes Weiner's avatar
      mm: memcontrol: remove obsolete kmemcg pinning tricks · 64f21993
      Johannes Weiner authored
      As charges now pin the css explicitely, there is no more need for kmemcg
      to acquire a proxy reference for outstanding pages during offlining, or
      maintain state to identify such "dead" groups.
      
      This was the last user of the uncharge functions' return values, so remove
      them as well.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64f21993
    • Johannes Weiner's avatar
      mm: memcontrol: take a css reference for each charged page · e8ea14cc
      Johannes Weiner authored
      Charges currently pin the css indirectly by playing tricks during
      css_offline(): user pages stall the offlining process until all of them
      have been reparented, whereas kmemcg acquires a keep-alive reference if
      outstanding kernel pages are detected at that point.
      
      In preparation for removing all this complexity, make the pinning explicit
      and acquire a css references for every charged page.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8ea14cc
    • Johannes Weiner's avatar
      mm: memcontrol: convert reclaim iterator to simple css refcounting · 5ac8fb31
      Johannes Weiner authored
      The memcg reclaim iterators use a complicated weak reference scheme to
      prevent pinning cgroups indefinitely in the absence of memory pressure.
      
      However, during the ongoing cgroup core rework, css lifetime has been
      decoupled such that a pinned css no longer interferes with removal of
      the user-visible cgroup, and all this complexity is now unnecessary.
      
      [mhocko@suse.cz: ensure that the cached reference is always released]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ac8fb31
    • Johannes Weiner's avatar
      kernel: res_counter: remove the unused API · 5b1efc02
      Johannes Weiner authored
      All memory accounting and limiting has been switched over to the
      lockless page counters.  Bye, res_counter!
      
      [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
      [mhocko@suse.cz: ditch the last remainings of res_counter]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b1efc02
    • Johannes Weiner's avatar
      mm: hugetlb_cgroup: convert to lockless page counters · 71f87bee
      Johannes Weiner authored
      Abandon the spinlock-protected byte counters in favor of the unlocked
      page counters in the hugetlb controller as well.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71f87bee
    • Johannes Weiner's avatar
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner authored
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
    • Pranith Kumar's avatar
      slab: replace smp_read_barrier_depends() with lockless_dereference() · 8df0c2dc
      Pranith Kumar authored
      Recently lockless_dereference() was added which can be used in place of
      hard-coding smp_read_barrier_depends().  The following PATCH makes the
      change.
      Signed-off-by: default avatarPranith Kumar <bobby.prani@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8df0c2dc
    • Andrew Morton's avatar
      slab: improve checking for invalid gfp_flags · c871ac4e
      Andrew Morton authored
      The code goes BUG, but doesn't tell us which bits were unexpectedly set.
      Print that out.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c871ac4e
    • Andrey Ryabinin's avatar
      mm: slub: fix format mismatches in slab_err() callers · f6edde9c
      Andrey Ryabinin authored
      Adding __printf(3, 4) to slab_err exposed following:
      
        mm/slub.c: In function `check_slab':
        mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
            s->name, page->objects, maxobj);
            ^
        mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
        mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
            s->name, page->inuse, page->objects);
            ^
        mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]
      
        mm/slub.c: In function `on_freelist':
        mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
            "should be %d", page->objects, max_objects);
      
      Fix first two warnings by removing redundant s->name.
      Fix the last by changing type of max_object from unsigned long to int.
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6edde9c
    • Joonsoo Kim's avatar
      mm/slab: reverse iteration on find_mergeable() · 54362057
      Joonsoo Kim authored
      Unlike SLUB, sometimes, object isn't started at the beginning of the slab
      in the SLAB.  This causes the unalignment problem when after slab merging
      is supported by commit 12220dea ("mm/slab: support slab merge").
      Alignment mismatch check is introduced ("mm/slab: fix unalignment problem
      on Malta with EVA due to slab merge") to prevent merge in this case.
      
      This causes undesirable result that merging happens between infrequently
      used kmem_caches if there are kmem_caches with same size and is 256 bytes,
      are merged into pool_workqueue rather than kmalloc-256, because
      kmem_caches for kmalloc are at the tail of the list.
      
      To prevent this situation, this patch reverses iteration order in
      find_mergeable() to find frequently used kmem_caches.  This change helps
      to merge kmem_cache to frequently used kmem_caches, such as kmalloc
      kmem_caches.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54362057
    • Vladimir Davydov's avatar
      slab: print slabinfo header in seq show · 1df3b26f
      Vladimir Davydov authored
      Currently we print the slabinfo header in the seq start method, which
      makes it unusable for showing leaks, so we have leaks_show, which does
      practically the same as s_show except it doesn't show the header.
      
      However, we can print the header in the seq show method - we only need
      to check if the current element is the first on the list.  This will
      allow us to use the same set of seq iterators for both leaks and
      slabinfo reporting, which is nice.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1df3b26f
    • LQYMGT's avatar
      mm: slab/slub: coding style: whitespaces and tabs mixture · b455def2
      LQYMGT authored
      Some code in mm/slab.c and mm/slub.c use whitespaces in indent.
      Clean them up.
      Signed-off-by: default avatarLQYMGT <lqymgt@gmail.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b455def2
    • Jan Kara's avatar
      fs/char_dev.c: remove pointless assignment from __register_chrdev_region() · e2ab879e
      Jan Kara authored
      At one place we assign major number we found to ret.  That assignment is
      then never used and actually doesn't make any sense given how the code is
      currently structured (the assignment comes from pre-git times).  Just
      remove it.
      
      Coverity id: 1226852.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2ab879e
    • Dan Carpenter's avatar
      ocfs2: remove unneeded NULL check · b3e3e5af
      Dan Carpenter authored
      In commit 1faf2894 ("ocfs2_dlm: disallow a domain join if node maps
      mismatch") we introduced a new earlier NULL check so this one is not
      needed.  Also static checkers complain because we dereference it first
      and then check for NULL.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3e3e5af
    • Dan Carpenter's avatar
      ocfs2: remove bogus NULL check in ocfs2_move_extents() · 88d69b92
      Dan Carpenter authored
      "inode" isn't NULL here, and also we dereference it on the previous line
      so static checkers get annoyed.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88d69b92
    • jiangyiwen's avatar
      ocfs2: do not set filesystem readonly if link down · 61fb9ea4
      jiangyiwen authored
      Do not set the filesystem readonly if the storage link is down.  In this
      case, metadata is not corrupted and only -EIO is returned.  And if it is
      indeed corrupted metadata, it has already called ocfs2_error() in
      ocfs2_validate_inode_block().
      Signed-off-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61fb9ea4
    • Xue jiufei's avatar
      ocfs2: do not set OCFS2_LOCK_UPCONVERT_FINISHING if nonblocking lock can not be granted at once · d1e78238
      Xue jiufei authored
      ocfs2_readpages() use nonblocking flag to avoid page lock inversion.  It
      will trigger cluster hang because that flag OCFS2_LOCK_UPCONVERT_FINISHING
      is not cleared if nonblocking lock cannot be granted at once.  The flag
      would prevent dc thread from downconverting.  So other nodes cannot
      acheive this lockres for ever.
      
      So we should not set OCFS2_LOCK_UPCONVERT_FINISHING when receiving ast if
      nonblocking lock had already returned.
      Signed-off-by: default avatarjoyce.xue <xuejiufei@huawei.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1e78238
    • Jan Kara's avatar
      ocfs2: fix error handling when creating debugfs root in ocfs2_init() · dc171580
      Jan Kara authored
      Error handling if creation of root of debugfs in ocfs2_init() fails is
      broken.  Although error code is set we fail to exit ocfs2_init() with
      error and thus initialization ends with success.  Later when mounting a
      filesystem, ocfs2 debugfs entries end up being created in the root of
      debugfs filesystem which is confusing.
      
      Fix the error handling to bail out.
      
      Coverity id: 1227009.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc171580
    • Goldwyn Rodrigues's avatar
      ocfs2: remove filesize checks for sync I/O journal commit · 86b9c6f3
      Goldwyn Rodrigues authored
      Filesize is not a good indication that the file needs to be synced.
      An example where this breaks is:
       1. Open the file in O_SYNC|O_RDWR
       2. Read a small portion of the file (say 64 bytes)
       3. Lseek to starting of the file
       4. Write 64 bytes
      
      If the node crashes, it is not written out to disk because this was not
      committed in the journal and the other node which reads the file after
      recovery reads stale data (even if the write on the other node was
      successful)
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.de>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86b9c6f3