1. 03 Jul, 2013 40 commits
    • Mel Gorman's avatar
      mm: vmscan: limit the number of pages kswapd reclaims at each priority · 75485363
      Mel Gorman authored
      This series does not fix all the current known problems with reclaim but
      it addresses one important swapping bug when there is background IO.
      
      Changelog since V3
       - Drop the slab shrink changes in light of Glaubers series and
         discussions highlighted that there were a number of potential
         problems with the patch.					(mel)
       - Rebased to 3.10-rc1
      
      Changelog since V2
       - Preserve ratio properly for proportional scanning		(kamezawa)
      
      Changelog since V1
       - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
       - Reformat comment in shrink_page_list				(andi)
       - Clarify some comments					(dhillf)
       - Rework how the proportional scanning is preserved
       - Add PageReclaim check before kswapd starts writeback
       - Reset sc.nr_reclaimed on every full zone scan
      
      Kswapd and page reclaim behaviour has been screwy in one way or the
      other for a long time.  Very broadly speaking it worked in the far past
      because machines were limited in memory so it did not have that many
      pages to scan and it stalled congestion_wait() frequently to prevent it
      going completely nuts.  In recent times it has behaved very
      unsatisfactorily with some of the problems compounded by the removal of
      stall logic and the introduction of transparent hugepage support with
      high-order reclaims.
      
      There are many variations of bugs that are rooted in this area.  One
      example is reports of a large copy operations or backup causing the
      machine to grind to a halt or applications pushed to swap.  Sometimes in
      low memory situations a large percentage of memory suddenly gets
      reclaimed.  In other cases an application starts and kswapd hits 100%
      CPU usage for prolonged periods of time and so on.  There is now talk of
      introducing features like an extra free kbytes tunable to work around
      aspects of the problem instead of trying to deal with it.  It's
      compounded by the problem that it can be very workload and machine
      specific.
      
      This series aims at addressing some of the worst of these problems
      without attempting to fundmentally alter how page reclaim works.
      
      Patches 1-2 limits the number of pages kswapd reclaims while still obeying
      	the anon/file proportion of the LRUs it should be scanning.
      
      Patches 3-4 control how and when kswapd raises its scanning priority and
      	deletes the scanning restart logic which is tricky to follow.
      
      Patch 5 notes that it is too easy for kswapd to reach priority 0 when
      	scanning and then reclaim the world. Down with that sort of thing.
      
      Patch 6 notes that kswapd starts writeback based on scanning priority which
      	is not necessarily related to dirty pages. It will have kswapd
      	writeback pages if a number of unqueued dirty pages have been
      	recently encountered at the tail of the LRU.
      
      Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
      	to reduce LRU churn and the likelihood that it'll reclaim young
      	clean pages or push applications to swap. It will cause kswapd
      	to block on IO if it detects that pages being reclaimed under
      	writeback are recycling through the LRU before the IO completes.
      
      Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
      	are applied.
      
      This was tested using memcached+memcachetest while some background IO
      was in progress as implemented by the parallel IO tests implement in MM
      Tests.
      
      memcachetest benchmarks how many operations/second memcached can service
      and it is run multiple times.  It starts with no background IO and then
      re-runs the test with larger amounts of IO in the background to roughly
      simulate a large copy in progress.  The expectation is that the IO
      should have little or no impact on memcachetest which is running
      entirely in memory.
      
                                              3.10.0-rc1                  3.10.0-rc1
                                                 vanilla            lessdisrupt-v4
      Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
      Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
      Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
      Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
      Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
      Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
      Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
      Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
      Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
      Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
      Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
      Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
      Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
      Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
      
      Note how the vanilla kernels performance collapses when there is enough
      IO taking place in the background.  This drop in performance is part of
      what users complain of when they start backups.  Note how the swapin and
      major fault figures indicate that processes were being pushed to swap
      prematurely.  With the series applied, there is no noticable performance
      drop and while there is still some swap activity, it's tiny.
      
      20 iterations of this test were run in total and averaged.  Every 5
      iterations, additional IO was generated in the background using dd to
      measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
      subblock refer to the amount of IO going on in the background at each
      iteration.  So memcachetest-2385M is reporting how many
      transactions/second memcachetest recorded on average over 5 iterations
      while there was 2385M of IO going on in the ground.  There are six
      blocks of information reported here
      
      memcachetest is the transactions/second reported by memcachetest. In
      	the vanilla kernel note that performance drops from around
      	22K/sec to just under 4K/second when there is 2385M of IO going
      	on in the background. This is one type of performance collapse
      	users complain about if a large cp or backup starts in the
      	background
      
      io-duration refers to how long it takes for the background IO to
      	complete. It's showing that with the patched kernel that the IO
      	completes faster while not interfering with the memcache
      	workload
      
      swaptotal is the total amount of swap traffic. With the patched kernel,
      	the total amount of swapping is much reduced although it is
      	still not zero.
      
      swapin in this case is an indication as to whether we are swap trashing.
      	The closer the swapin/swapout ratio is to 1, the worse the
      	trashing is.  Note with the patched kernel that there is no swapin
      	activity indicating that all the pages swapped were really inactive
      	unused pages.
      
      minorfaults are just minor faults. An increased number of minor faults
      	can indicate that page reclaim is unmapping the pages but not
      	swapping them out before they are faulted back in. With the
      	patched kernel, there is only a small change in minor faults
      
      majorfaults are just major faults in the target workload and a high
      	number can indicate that a workload is being prematurely
      	swapped. With the patched kernel, major faults are much reduced. As
      	there are no swapin's recorded so it's not being swapped. The likely
      	explanation is that that libraries or configuration files used by
      	the workload during startup get paged out by the background IO.
      
      Overall with the series applied, there is no noticable performance drop
      due to background IO and while there is still some swap activity, it's
      tiny and the lack of swapins imply that the swapped pages were inactive
      and unused.
      
                                  3.10.0-rc1  3.10.0-rc1
                                     vanilla lessdisrupt-v4
      Page Ins                       1234608      101892
      Page Outs                     12446272    11810468
      Swap Ins                        283406           0
      Swap Outs                       698469       27882
      Direct pages scanned                 0      136480
      Kswapd pages scanned           6266537     5369364
      Kswapd pages reclaimed         1088989      930832
      Direct pages reclaimed               0      120901
      Kswapd efficiency                  17%         17%
      Kswapd velocity               5398.371    4635.115
      Direct efficiency                 100%         88%
      Direct velocity                  0.000     117.817
      Percentage direct scans             0%          2%
      Page writes by reclaim         1655843     4009929
      Page writes file                957374     3982047
      Page writes anon                698469       27882
      Page reclaim immediate            5245        1745
      Page rescued immediate               0           0
      Slabs scanned                    33664       25216
      Direct inode steals                  0           0
      Kswapd inode steals              19409         778
      Kswapd skipped wait                  0           0
      THP fault alloc                     35          30
      THP collapse alloc                 472         401
      THP splits                          27          22
      THP fault fallback                   0           0
      THP collapse fail                    0           1
      Compaction stalls                    0           4
      Compaction success                   0           0
      Compaction failures                  0           4
      Page migrate success                 0           0
      Page migrate failure                 0           0
      Compaction pages isolated            0           0
      Compaction migrate scanned           0           0
      Compaction free scanned              0           0
      Compaction cost                      0           0
      NUMA PTE updates                     0           0
      NUMA hint faults                     0           0
      NUMA hint local faults               0           0
      NUMA pages migrated                  0           0
      AutoNUMA cost                        0           0
      
      Unfortunately, note that there is a small amount of direct reclaim due to
      kswapd no longer reclaiming the world.  ftrace indicates that the direct
      reclaim stalls are mostly harmless with the vast bulk of the stalls
      incurred by dd
      
           23 tclsh-3367
           38 memcachetest-13733
           49 memcachetest-12443
           57 tee-3368
         1541 dd-13826
         1981 dd-12539
      
      A consequence of the direct reclaim for dd is that the processes for the
      IO workload may show a higher system CPU usage.  There is also a risk that
      kswapd not reclaiming the world may mean that it stays awake balancing
      zones, does not stall on the appropriate events and continually scans
      pages it cannot reclaim consuming CPU.  This will be visible as continued
      high CPU usage but in my own tests I only saw a single spike lasting less
      than a second and I did not observe any problems related to reclaim while
      running the series on my desktop.
      
      This patch:
      
      The number of pages kswapd can reclaim is bound by the number of pages it
      scans which is related to the size of the zone and the scanning priority.
      In many cases the priority remains low because it's reset every
      SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
      number of pages it cannot reclaim, it will raise the priority and
      potentially discard a large percentage of the zone as sc->nr_to_reclaim is
      ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
      percentage of memory is suddenly freed.  It would be bad enough if this
      was just unused memory but because of how anon/file pages are balanced it
      is possible that applications get pushed to swap unnecessarily.
      
      This patch limits the number of pages kswapd will reclaim to the high
      watermark.  Reclaim will still overshoot due to it not being a hard limit
      as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
      prevents kswapd reclaiming the world at higher priorities.  The number of
      pages it reclaims is not adjusted for high-order allocations as kswapd
      will reclaim excessively if it is to balance zones for high-order
      allocations.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: default avatarZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75485363
    • Cody P Schafer's avatar
      mm/page_alloc: don't re-init pageset in zone_pcp_update() · 169f6c19
      Cody P Schafer authored
      When memory hotplug is triggered, we call pageset_init() on
      per-cpu-pagesets which both contain pages and are in use, causing both the
      leakage of those pages and (potentially) bad behaviour if a page is
      allocated from a pageset while it is being cleared.
      
      Avoid this by factoring out pageset_set_high_and_batch() (which contains
      all needed logic too set a pageset's ->high and ->batch inrespective of
      system state) from zone_pageset_init() and using the new
      pageset_set_high_and_batch() instead of zone_pageset_init() in
      zone_pcp_update().
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      169f6c19
    • Cody P Schafer's avatar
      mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch() · 3664033c
      Cody P Schafer authored
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3664033c
    • Cody P Schafer's avatar
      mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init() · 737af4c0
      Cody P Schafer authored
      Previously, zone_pcp_update() called pageset_set_batch() directly,
      essentially assuming that percpu_pagelist_fraction == 0.
      
      Correct this by calling zone_pageset_init(), which chooses the
      appropriate ->batch and ->high calculations.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      737af4c0
    • Cody P Schafer's avatar
      mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset() · 56cef2b8
      Cody P Schafer authored
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56cef2b8
    • Cody P Schafer's avatar
      mm/page_alloc: relocate comment to be directly above code it refers to. · dd1895e2
      Cody P Schafer authored
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd1895e2
    • Cody P Schafer's avatar
      mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch() · 88c90dbc
      Cody P Schafer authored
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88c90dbc
    • Cody P Schafer's avatar
      mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high · 22a7f12b
      Cody P Schafer authored
      Simply moves calculation of the new 'high' value outside the
      for_each_possible_cpu() loop, as it does not depend on the cpu.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22a7f12b
    • Cody P Schafer's avatar
      mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine() · 0a647f38
      Cody P Schafer authored
      zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
      percpu pageset based on a zone's ->managed_pages.  We don't need to drain
      the entire percpu pageset just to modify these fields.
      
      This lets us avoid calling setup_pageset() (and the draining required to
      call it) and instead allows simply setting the fields' values (with some
      attention paid to memory barriers to prevent the relationship between
      ->batch and ->high from being thrown off).
      
      This does change the behavior of zone_pcp_update() as the percpu pagesets
      will not be drained when zone_pcp_update() is called (they will end up
      being shrunk, not completely drained, later when a 0-order page is freed
      in free_hot_cold_page()).
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a647f38
    • Cody P Schafer's avatar
      mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE · 998d39cb
      Cody P Schafer authored
      pcp->batch could change at any point, avoid relying on it being a stable
      value.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      998d39cb
    • Cody P Schafer's avatar
      mm/page_alloc: insert memory barriers to allow async update of pcp batch and high · 8d7a8fa9
      Cody P Schafer authored
      Introduce pageset_update() to perform a safe transision from one set of
      pcp->{batch,high} to a new set using memory barriers.
      
      This ensures that batch is always set to a safe value (1) prior to
      updating high, and ensure that high is fully updated before setting the
      real value of batch.  It avoids ->batch ever rising above ->high.
      
      Suggested by Gilad Ben-Yossef in these threads:
      
      	https://lkml.org/lkml/2013/4/9/23
      	https://lkml.org/lkml/2013/4/10/49
      
      Also reproduces his proposed comment.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Reviewed-by: default avatarGilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d7a8fa9
    • Cody P Schafer's avatar
      mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high · c8e251fa
      Cody P Schafer authored
      Because we are going to rely upon a careful transision between old and new
      ->high and ->batch values using memory barriers and will remove
      stop_machine(), we need to prevent multiple updaters from interweaving
      their memory writes.
      
      Add a simple mutex to protect both update loops.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8e251fa
    • Cody P Schafer's avatar
      mm/page_alloc: factor out setting of pcp->high and pcp->batch · 4008bab7
      Cody P Schafer authored
      "Problems" with the current code:
      
      1: there is a lack of synchronization in setting ->high and ->batch in
         percpu_pagelist_fraction_sysctl_handler()
      
      2: stop_machine() in zone_pcp_update() is unnecissary.
      
      3: zone_pcp_update() does not consider the case where
         percpu_pagelist_fraction is non-zero
      
      To fix:
      
      1: add memory barriers, a safe ->batch value, an update side mutex when
         updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users
         that expect a stable value.
      
      2: avoid draining pages in zone_pcp_update(), rely upon the memory
         barriers added to fix #1
      
      3: factor out quite a few functions, and then call the appropriate one.
      
      Note that it results in a change to the behavior of zone_pcp_update(),
      which is used by memory_hotplug.  I'm rather certain that I've diserned
      (and preserved) the essential behavior (changing ->high and ->batch), and
      only eliminated unneeded actions (draining the per cpu pages), but this
      may not be the case.
      
      Further note that the draining of pages that previously took place in
      zone_pcp_update() occured after repeated draining when attempting to
      offline a page, and after the offline has "succeeded".  It appears that
      the draining was added to zone_pcp_update() to avoid refactoring
      setup_pageset() into 2 funtions.
      
      This patch:
      
      Creates pageset_set_batch() for use in setup_pageset().
      pageset_set_batch() imitates the functionality of
      setup_pagelist_highmark(), but uses the boot time
      (percpu_pagelist_fraction == 0) calculations for determining ->high based
      on ->batch.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4008bab7
    • Libin's avatar
      uio: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT · 52c2dad9
      Libin authored
      (*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
      as a inline funcion vma_pages() in linux/mm.h, so using it.
      Signed-off-by: default avatarLibin <huawei.libin@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52c2dad9
    • Libin's avatar
      ncpfs: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT · ef9f515a
      Libin authored
      (*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
      as a inline funcion vma_pages() in linux/mm.h, so using it.
      Signed-off-by: default avatarLibin <huawei.libin@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef9f515a
    • Libin's avatar
      mm: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT · d6e93217
      Libin authored
      (*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
      as a inline funcion vma_pages() in linux/mm.h, so using it.
      Signed-off-by: default avatarLibin <huawei.libin@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6e93217
    • Minchan Kim's avatar
      mm: remove compressed copy from zram in-memory · b430e9d1
      Minchan Kim authored
      Swap subsystem does lazy swap slot free with expecting the page would be
      swapped out again so we can avoid unnecessary write.
      
      But the problem in in-memory swap(ex, zram) is that it consumes memory
      space until vm_swap_full(ie, used half of all of swap device) condition
      meet.  It could be bad if we use multiple swap device, small in-memory
      swap and big storage swap or in-memory swap alone.
      
      This patch makes swap subsystem free swap slot as soon as swap-read is
      completed and make the swapcache page dirty so the page should be
      written out the swap device to reclaim it.  It means we never lose it.
      
      I tested this patch with kernel compile workload.
      
      1. before
      
         compile time : 9882.42
         zram max wasted space by fragmentation: 13471881 byte
         memory space consumed by zram: 174227456 byte
         the number of slot free notify: 206684
      
      2. after
      
         compile time : 9653.90
         zram max wasted space by fragmentation: 11805932 byte
         memory space consumed by zram: 154001408 byte
         the number of slot free notify: 426972
      
      [akpm@linux-foundation.org: tweak comment text]
      [artem.savkov@gmail.com: fix BUG due to non-swapcache pages in end_swap_bio_read()]
      [akpm@linux-foundation.org: invert unlikely() test, augment comment, 80-col cleanup]
      Signed-off-by: default avatarDan Magenheimer <dan.magenheimer@oracle.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarArtem Savkov <artem.savkov@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
      Cc: Shaohua Li <shli@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b430e9d1
    • David Rientjes's avatar
      mm, memcg: don't take task_lock in task_in_mem_cgroup · ffbdccf5
      David Rientjes authored
      For processes that have detached their mm's, task_in_mem_cgroup()
      unnecessarily takes task_lock() when rcu_read_lock() is all that is
      necessary to call mem_cgroup_from_task().
      
      While we're here, switch task_in_mem_cgroup() to return bool.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffbdccf5
    • Pavel Emelyanov's avatar
      pagemap: prepare to reuse constant bits with page-shift · 541c237c
      Pavel Emelyanov authored
      In order to reuse bits from pagemap entries gracefully, we leave the
      entries as is but on pagemap open emit a warning in dmesg, that bits
      55-60 are about to change in a couple of releases.  Next, if a user
      issues soft-dirty clear command via the clear_refs file (it was disabled
      before v3.9) we assume that he's aware of the new pagemap format, note
      that fact and report the bits in pagemap in the new manner.
      
      The "migration strategy" looks like this then:
      
      1. existing users are not affected -- they don't touch soft-dirty feature, thus
         see old bits in pagemap, but are warned and have time to fix themselves
      2. those who use soft-dirty know about new pagemap format
      3. some time soon we get rid of any signs of page-shift in pagemap as well as
         this trick with clear-soft-dirty affecting pagemap format.
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      541c237c
    • Pavel Emelyanov's avatar
      mm: soft-dirty bits for user memory changes tracking · 0f8975ec
      Pavel Emelyanov authored
      The soft-dirty is a bit on a PTE which helps to track which pages a task
      writes to.  In order to do this tracking one should
      
        1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
        2. Wait some time.
        3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
      
      To do this tracking, the writable bit is cleared from PTEs when the
      soft-dirty bit is.  Thus, after this, when the task tries to modify a
      page at some virtual address the #PF occurs and the kernel sets the
      soft-dirty bit on the respective PTE.
      
      Note, that although all the task's address space is marked as r/o after
      the soft-dirty bits clear, the #PF-s that occur after that are processed
      fast.  This is so, since the pages are still mapped to physical memory,
      and thus all the kernel does is finds this fact out and puts back
      writable, dirty and soft-dirty bits on the PTE.
      
      Another thing to note, is that when mremap moves PTEs they are marked
      with soft-dirty as well, since from the user perspective mremap modifies
      the virtual memory at mremap's new address.
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f8975ec
    • Pavel Emelyanov's avatar
      pagemap: introduce pagemap_entry_t without pmshift bits · 2b0a9f01
      Pavel Emelyanov authored
      These bits are always constant (== PAGE_SHIFT) and just occupy space in
      the entry.  Moreover, in next patch we will need to report one more bit
      in the pagemap, but all bits are already busy on it.
      
      That said, describe the pagemap entry that has 6 more free zero bits.
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b0a9f01
    • Pavel Emelyanov's avatar
      clear_refs: introduce private struct for mm_walk · af9de7eb
      Pavel Emelyanov authored
      In the next patch the clear-refs-type will be required in
      clear_refs_pte_range funciton, so prepare the walk->private to carry
      this info.
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af9de7eb
    • Pavel Emelyanov's avatar
      clear_refs: sanitize accepted commands declaration · 040fa020
      Pavel Emelyanov authored
      This is the implementation of the soft-dirty bit concept that should
      help keep track of changes in user memory, which in turn is very-very
      required by the checkpoint-restore project (http://criu.org).
      
      To create a dump of an application(s) we save all the information about
      it to files, and the biggest part of such dump is the contents of tasks'
      memory.  However, there are usage scenarios where it's not required to
      get _all_ the task memory while creating a dump.  For example, when
      doing periodical dumps, it's only required to take full memory dump only
      at the first step and then take incremental changes of memory.  Another
      example is live migration.  We copy all the memory to the destination
      node without stopping all tasks, then stop them, check for what pages
      has changed, dump it and the rest of the state, then copy it to the
      destination node.  This decreases freeze time significantly.
      
      That said, some help from kernel to watch how processes modify the
      contents of their memory is required.
      
      The proposal is to track changes with the help of new soft-dirty bit
      this way:
      
      1. First do "echo 4 > /proc/$pid/clear_refs".
         At that point kernel clears the soft dirty _and_ the writable bits from all
         ptes of process $pid. From now on every write to any page will result in #pf
         and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
         the soft dirty flag.
      
      2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
         (the 55'th one). If set, the respective pte was written to since last call
         to clear refs.
      
      The soft-dirty bit is the _PAGE_BIT_HIDDEN one.  Although it's used by
      kmemcheck, the latter one marks kernel pages with it, while the former
      bit is put on user pages so they do not conflict to each other.
      
      This patch:
      
      A new clear-refs type will be added in the next patch, so prepare
      code for that.
      
      [akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)]
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      040fa020
    • Kees Cook's avatar
      crypto: sanitize argument for format string · 1c8fca1d
      Kees Cook authored
      The template lookup interface does not provide a way to use format
      strings, so make sure that the interface cannot be abused accidentally.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c8fca1d
    • Kees Cook's avatar
      block: do not pass disk names as format strings · ffc8b308
      Kees Cook authored
      Disk names may contain arbitrary strings, so they must not be
      interpreted as format strings.  It seems that only md allows arbitrary
      strings to be used for disk names, but this could allow for a local
      memory corruption from uid 0 into ring 0.
      
      CVE-2013-2851
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffc8b308
    • Jonathan Salwan's avatar
      drivers/cdrom/cdrom.c: use kzalloc() for failing hardware · 542db015
      Jonathan Salwan authored
      In drivers/cdrom/cdrom.c mmc_ioctl_cdrom_read_data() allocates a memory
      area with kmalloc in line 2885.
      
        2885         cgc->buffer = kmalloc(blocksize, GFP_KERNEL);
        2886         if (cgc->buffer == NULL)
        2887                 return -ENOMEM;
      
      In line 2908 we can find the copy_to_user function:
      
        2908         if (!ret && copy_to_user(arg, cgc->buffer, blocksize))
      
      The cgc->buffer is never cleaned and initialized before this function.
      If ret = 0 with the previous basic block, it's possible to display some
      memory bytes in kernel space from userspace.
      
      When we read a block from the disk it normally fills the ->buffer but if
      the drive is malfunctioning there is a chance that it would only be
      partially filled.  The result is an leak information to userspace.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      542db015
    • Cong Wang's avatar
      block/compat_ioctl.c: do not leak info to user-space · 8b0d77f1
      Cong Wang authored
      There is a hole in struct hd_geometry, so we have to zero the struct on
      stack before copying it to user-space.
      Signed-off-by: default avatarCong Wang <amwang@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b0d77f1
    • Libo Chen's avatar
      drivers/cdrom/gdrom.c: fix device number leak · 31bd8fbb
      Libo Chen authored
      Without this patch, gdrom_major will leak when gd.cd_info alloc fails.
      Signed-off-by: default avatarLibo Chen <libo.chen@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31bd8fbb
    • Xue jiufei's avatar
      ocfs2: fix NULL pointer dereference when traversing o2hb_all_regions · 4a184b4f
      Xue jiufei authored
      There may exist NULL pointer dereference in config_item_name() when one
      volume (say Volume A) unmounts while another (say Volume B) mounting.
      
           Volume A                          Volume B
      
        already Mounted.
        Unmounting, call
        o2hb_heartbeat_group_drop_item()
          -> config_item_put(item)
          set reg(A)->item.ci_name to NULL
          in function config_item_cleanup().
      
                                          begin mounting, call
                                          o2hb_region_pin() and tranverse all
                                          regions. When reading
                                          reg(A)->item.ci_name, it causes
                                          NULL pointer dereference.
      
        call o2hb_region_release() and
        del reg(A) from list.
      
      So we should skip accessing regions that is going to release when
      tranverse o2hb_all_regions.
      Signed-off-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Signed-off-by: default avatarjoyce <xuejiufei@huawei.com>
      Acked-by: default avatarJoel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a184b4f
    • Jie Liu's avatar
      ocfs2: adjust switch_case syntax at o2net_state_change() · 44e89cb8
      Jie Liu authored
      Adjust switch..case syntax at o2net_state_change to meet the kernel coding
      standard.
      
      s/printk/pr_info/.
      
      [akpm@linux-foundation.org: revert pr_foo() change]
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Acked-by: default avatarJoel Becker <jlbec@evilplan.org>
      Cc: Gurudas Pai <gurudas.pai@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com>
      Cc: Srinivas Eeeda <srinivas.eeda@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: Tao Ma <tm@tao.ma>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44e89cb8
    • Jie Liu's avatar
      ocfs2: fix a comments typo at o2quo_hb_still_up() · b4d8ed4f
      Jie Liu authored
      Fix a comment typo in o2quo_hb_still_up()
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Cc: Gurudas Pai <gurudas.pai@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com>
      Cc: Srinivas Eeeda <srinivas.eeda@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: Tao Ma <tm@tao.ma>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4d8ed4f
    • Jie Liu's avatar
      ocfs2: consolidate o2hb_global_hearbeat_mode_set() naming convention · 70f651ed
      Jie Liu authored
      s/o2hb_global_hearbeat_mode_set/o2hb_global_heartbeat_mode_set/ to make
      the signature of those routines in a consistent manner with others for
      heartbeating.
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Acked-by: default avatarSunil Mushran <sunil.mushran@gmail.com>
      Cc: Gurudas Pai <gurudas.pai@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com>
      Cc: Srinivas Eeeda <srinivas.eeda@oracle.com>
      Cc: Tao Ma <tm@tao.ma>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70f651ed
    • Noboru Iwamatsu's avatar
      ocfs2: submit disk heartbeat bio using WRITE_SYNC · e873fdb5
      Noboru Iwamatsu authored
      Under heavy I/O load, writing the disk heartbeat can be forced to wait for
      minutes, and this causes the node to be fenced.
      
      This patch tries to use WRITE_SYNC in submitting the heartbeat bio, so
      that writing the heartbeat will have a priority over other requests.
      Signed-off-by: default avatarNoboru Iwamatsu <n_iwamatsu@jp.fujitsu.com>
      Acked-by: default avatarTao Ma <tm@tao.ma>
      Acked-by: default avatarSunil Mushran <sunil.mushran@gmail.com>
      Cc: Srinivas Eeeda <srinivas.eeda@oracle.com>
      Reviewed-by: default avatarJie Liu <jeff.liu@oracle.com>
      Tested-by: default avatarGurudas Pai <gurudas.pai@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e873fdb5
    • Junxiao Bi's avatar
      ocfs2: xattr: fix inlined xattr reflink · ef962df0
      Junxiao Bi authored
      Inlined xattr shared free space of inode block with inlined data or data
      extent record, so the size of the later two should be adjusted when
      inlined xattr is enabled.  See ocfs2_xattr_ibody_init().  But this isn't
      done well when reflink.  For inode with inlined data, its max inlined
      data size is adjusted in ocfs2_duplicate_inline_data(), no problem.  But
      for inode with data extent record, its record count isn't adjusted.  Fix
      it, or data extent record and inlined xattr may overwrite each other,
      then cause data corruption or xattr failure.
      
      One panic caused by this bug in our test environment is the following:
      
        kernel BUG at fs/ocfs2/xattr.c:1435!
        invalid opcode: 0000 [#1] SMP
        Pid: 10871, comm: multi_reflink_t Not tainted 2.6.39-300.17.1.el5uek #1
        RIP: ocfs2_xa_offset_pointer+0x17/0x20 [ocfs2]
        RSP: e02b:ffff88007a587948  EFLAGS: 00010283
        RAX: 0000000000000000 RBX: 0000000000000010 RCX: 00000000000051e4
        RDX: ffff880057092060 RSI: 0000000000000f80 RDI: ffff88007a587a68
        RBP: ffff88007a587948 R08: 00000000000062f4 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000010
        R13: ffff88007a587a68 R14: 0000000000000001 R15: ffff88007a587c68
        FS:  00007fccff7f06e0(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
        CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00000000015cf000 CR3: 000000007aa76000 CR4: 0000000000000660
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process multi_reflink_t
        Call Trace:
          ocfs2_xa_reuse_entry+0x60/0x280 [ocfs2]
          ocfs2_xa_prepare_entry+0x17e/0x2a0 [ocfs2]
          ocfs2_xa_set+0xcc/0x250 [ocfs2]
          ocfs2_xattr_ibody_set+0x98/0x230 [ocfs2]
          __ocfs2_xattr_set_handle+0x4f/0x700 [ocfs2]
          ocfs2_xattr_set+0x6c6/0x890 [ocfs2]
          ocfs2_xattr_user_set+0x46/0x50 [ocfs2]
          generic_setxattr+0x70/0x90
          __vfs_setxattr_noperm+0x80/0x1a0
          vfs_setxattr+0xa9/0xb0
          setxattr+0xc3/0x120
          sys_fsetxattr+0xa8/0xd0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarJie Liu <jeff.liu@oracle.com>
      Acked-by: default avatarJoel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef962df0
    • Younger Liu's avatar
      ocfs2: fix readonly issue in ocfs2_unlink() · b5a8bb71
      Younger Liu authored
      While deleting a file with ocfs2_unlink(), there is a bug in this
      function.  This bug will result in filesystem read-only.
      
      After calling ocfs2_orphan_add(), the file which will be deleted is
      added into orphan dir.  If ocfs2_delete_entry() fails, the file still
      exists in the parent dir.  And this scenario introduces a conflict of
      metadata.
      
      If a file is added into orphan dir, when we put inode of the file with
      iput(), the inode i_flags is setted (~OCFS2_VALID_FL) in
      ocfs2_remove_inode(), and then write back to disk.
      
      But as previously mentioned, the file still exists in the parent dir.
      On other nodes, the file can be still accessed.  When first read the
      file with ocfs2_read_blocks() from disk, It will check and avalidate
      inode using ocfs2_validate_inode_block().  So File system will be
      readonly because the inode is invalid.  In other words, the inode
      i_flags has been set (~OCFS2_VALID_FL).
      
      [akpm@linux-foundation.org: cleanups]
      [jeff.liu@oracle.com: s/inode_is_unlinkable/ocfs2_inode_is_unlinkable/]
      Signed-off-by: default avatarYounger Liu <younger.liu@huawei.com>
      Signed-off-by: default avatarJensen <shencanquan@huawei.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5a8bb71
    • Andrew Morton's avatar
      ocfs2: remove duplicated mlog_errno() in ocfs2_relink_block_group · 25e28921
      Andrew Morton authored
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: Younger Liu <younger.liu@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25e28921
    • Jie Liu's avatar
      ocfs2: rework transaction rollback in ocfs2_relink_block_group() · 49309841
      Jie Liu authored
      In ocfs2_relink_block_group(), we roll back all those changes if notify
      intent to modify buffers for metadata update failed even if the relevant
      buffer has not yet been modified/got dirty at that point, that are not
      quite right because of:
      
       - None buffer has been modified/dirty if failed to call
         ocfs2_journal_access_gd() against the previous block group buffer
      
       - Only the previous block group buffer has got dirty if failed to call
         ocfs2_journal_access_gd() against the block group buffer
      
       - There is no need to roll back the change for file entry buffer at all
      
      Those problems will not cause anything wrong but unnecessary.  This
      patch fix them and kill the useless bg_ptr variable as well.
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Cc: Younger Liu <younger.liu@huawei.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49309841
    • Younger Liu's avatar
      ocfs2: need rollback when journal_access failed in ocfs2_orphan_add() · ea45466a
      Younger Liu authored
      While adding a file into orphan dir in ocfs2_orphan_add(), it calls
      __ocfs2_add_entry() before ocfs2_journal_access_di().  If
      ocfs2_journal_access_di() failed, the file is added into orphan dir, and
      orphan dir dinode updated, but file dinode has not been updated.
      Accordingly, the data is not consistent between file dinode and orphan
      dir.
      
      So, need to call ocfs2_journal_access_di() before __ocfs2_add_entry(),
      and if ocfs2_journal_access_di() failed, orphan_fe and
      orphan_dir_inode->i_nlink need rollback.
      
      This bug was added by 3939fda4 ("Ocfs2: Journaling i_flags and
      i_orphaned_slot when adding inode to orphan dir.").
      Signed-off-by: default avatarYounger Liu <younger.liu@huawei.com>
      Acked-by: default avatarJeff Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea45466a
    • Xue jiufei's avatar
      ocfs2: dlmlock_master() should return DLM_NORMAL after adding lock to blocked list · 096b2ef8
      Xue jiufei authored
      dlmlock_master() returns DLM_RECOVERING/DLM_MIGRATING/ DLM_FORWAR after
      adding lock to blocked list if lockres has the state
      DLM_LOCK_RES_RECOVERING/DLM_LOCK_RES_MIGRATING/ DLM_LOCK_RES_IN_PROGRESS.
      so it will retry in dlmlock().  And this may cause dlm_thread fall into an
      infinite loop
      
      	Thread1                                  dlm_thread
      
        calls dlm_lock->dlmlock_master,
        if lockresA is in state
        DLM_LOCK_RES_RECOVERING, calls
        __dlm_wait_on_lockres() and waits
        until others threads clear this
        state;
      
        If cannot grant this lock,
        adding lock to blocked list,
        and return DLM_RECOVERING;
      
                                              Grant this lock and move it to
                                              grant list;
      
        After a while, retry and
        calls list_add_tail(), adding lock
        to blocked list again.
      
      Granted and blocked list of this lockres will become the following
      conditions:
      
          lock_res->granted.next = dlm_lock->list_head;
          lock_res->blocked.next = dlm_lock->list_head;
          dlm_lock->list_head.next = dlm_lock_resource->blocked;
      
      When dlm_thread traverses the granted list, it will fall into an endless
      loop, checking dlm_lock.list_head, dlm_lock->list_head.next
      (i.e.lock_res->blocked), lock_res->blocked.next(i.e.dlm_lock.list_head
      again) .....
      Signed-off-by: default avatarjoyce <xuejiufei@huawei.com>
      Reviewed-by: default avatarjensen <shencanquan@huawei.com>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Acked-by: default avatarSunil Mushran <sunil.mushran@gmail.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      096b2ef8
    • Junxiao Bi's avatar
      ocfs2: xattr: remove useless free space checking · b30f14c4
      Junxiao Bi authored
      Free space checking will be done in ocfs2_xattr_ibody_init().  So remove
      here.
      
      [akpm@linux-foundation.org: remove unused local]
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarJie Liu <jeff.liu@oracle.com>
      Acked-by: default avatarJoel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b30f14c4