1. 01 Aug, 2012 17 commits
    • Minchan Kim's avatar
      mm: compaction: trivial clean up in acct_isolated() · f665a680
      Minchan Kim authored
      commit b9e84ac1 upstream.
      
      Stable note: Not tracked in Bugzilla. This patch makes later patches
      	easier to apply but has no other impact.
      
      acct_isolated of compaction uses page_lru_base_type which returns only
      base type of LRU list so it never returns LRU_ACTIVE_ANON or
      LRU_ACTIVE_FILE.  In addtion, cc->nr_[anon|file] is used in only
      acct_isolated so it doesn't have fields in conpact_control.
      
      This patch removes fields from compact_control and makes clear function of
      acct_issolated which counts the number of anon|file pages isolated.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f665a680
    • Mel Gorman's avatar
      vmscan: abort reclaim/compaction if compaction can proceed · 4682e89d
      Mel Gorman authored
      commit e0c23279 upstream.
      
      Stable note: Not tracked on Bugzilla. THP and compaction was found to
      	aggressively reclaim pages and stall systems under different
      	situations that was addressed piecemeal over time.
      
      If compaction can proceed, shrink_zones() stops doing any work but its
      callers still call shrink_slab() which raises the priority and potentially
      sleeps.  This is unnecessary and wasteful so this patch aborts direct
      reclaim/compaction entirely if compaction can proceed.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4682e89d
    • Rik van Riel's avatar
      vmscan: limit direct reclaim for higher order allocations · 4d472406
      Rik van Riel authored
      commit e0887c19 upstream.
      
      Stable note: Not tracked on Bugzilla. THP and compaction was found to
      	aggressively reclaim pages and stall systems under different
      	situations that was addressed piecemeal over time.  Paragraph
      	3 of this changelog is the motivation for this patch.
      
      When suffering from memory fragmentation due to unfreeable pages, THP page
      faults will repeatedly try to compact memory.  Due to the unfreeable
      pages, compaction fails.
      
      Needless to say, at that point page reclaim also fails to create free
      contiguous 2MB areas.  However, that doesn't stop the current code from
      trying, over and over again, and freeing a minimum of 4MB (2UL <<
      sc->order pages) at every single invocation.
      
      This resulted in my 12GB system having 2-3GB free memory, a corresponding
      amount of used swap and very sluggish response times.
      
      This can be avoided by having the direct reclaim code not reclaim from
      zones that already have plenty of free memory available for compaction.
      
      If compaction still fails due to unmovable memory, doing additional
      reclaim will only hurt the system, not help.
      
      [jweiner@redhat.com: change comment to explain the order check]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d472406
    • Dave Chinner's avatar
      vmscan: reduce wind up shrinker->nr when shrinker can't do work · 7554e344
      Dave Chinner authored
      commit 3567b59a upstream.
      
      Stable note: Not tracked in Bugzilla. This patch reduces excessive
      	reclaim of slab objects reducing the amount of information that
      	has to be brought back in from disk. The third and fourth paragram
      	in the series describes the impact.
      
      When a shrinker returns -1 to shrink_slab() to indicate it cannot do
      any work given the current memory reclaim requirements, it adds the
      entire total_scan count to shrinker->nr. The idea ehind this is that
      whenteh shrinker is next called and can do work, it will do the work
      of the previously aborted shrinker call as well.
      
      However, if a filesystem is doing lots of allocation with GFP_NOFS
      set, then we get many, many more aborts from the shrinkers than we
      do successful calls. The result is that shrinker->nr winds up to
      it's maximum permissible value (twice the current cache size) and
      then when the next shrinker call that can do work is issued, it
      has enough scan count built up to free the entire cache twice over.
      
      This manifests itself in the cache going from full to empty in a
      matter of seconds, even when only a small part of the cache is
      needed to be emptied to free sufficient memory.
      
      Under metadata intensive workloads on ext4 and XFS, I'm seeing the
      VFS caches increase memory consumption up to 75% of memory (no page
      cache pressure) over a period of 30-60s, and then the shrinker
      empties them down to zero in the space of 2-3s. This cycle repeats
      over and over again, with the shrinker completely trashing the inode
      and dentry caches every minute or so the workload continues.
      
      This behaviour was made obvious by the shrink_slab tracepoints added
      earlier in the series, and made worse by the patch that corrected
      the concurrent accounting of shrinker->nr.
      
      To avoid this problem, stop repeated small increments of the total
      scan value from winding shrinker->nr up to a value that can cause
      the entire cache to be freed. We still need to allow it to wind up,
      so use the delta as the "large scan" threshold check - if the delta
      is more than a quarter of the entire cache size, then it is a large
      scan and allowed to cause lots of windup because we are clearly
      needing to free lots of memory.
      
      If it isn't a large scan then limit the total scan to half the size
      of the cache so that windup never increases to consume the whole
      cache. Reducing the total scan limit further does not allow enough
      wind-up to maintain the current levels of performance, whilst a
      higher threshold does not prevent the windup from freeing the entire
      cache under sustained workloads.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7554e344
    • Dave Chinner's avatar
      vmscan: shrinker->nr updates race and go wrong · 6a5091a0
      Dave Chinner authored
      commit acf92b48 upstream.
      
      Stable note: Not tracked in Bugzilla. This patch reduces excessive
      	reclaim of slab objects reducing the amount of information
      	that has to be brought back in from disk.
      
      shrink_slab() allows shrinkers to be called in parallel so the
      struct shrinker can be updated concurrently. It does not provide any
      exclusio for such updates, so we can get the shrinker->nr value
      increasing or decreasing incorrectly.
      
      As a result, when a shrinker repeatedly returns a value of -1 (e.g.
      a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
      sometimes updating with the scan count that wasn't used, sometimes
      losing it altogether. Worse is when a shrinker does work and that
      update is lost due to racy updates, which means the shrinker will do
      the work again!
      
      Fix this by making the total_scan calculations independent of
      shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
      other updates via cmpxchg loops.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a5091a0
    • Dave Chinner's avatar
      vmscan: add shrink_slab tracepoints · 5e5b3d2e
      Dave Chinner authored
      commit 09576073 upstream.
      
      Stable note: This patch makes later patches easier to apply but otherwise
              has little to justify it. It is a diagnostic patch that was part
              of a series addressing excessive slab shrinking after GFP_NOFS
              failures. There is detailed information on the series' motivation
              at https://lkml.org/lkml/2011/6/2/42 .
      
      It is impossible to understand what the shrinkers are actually doing
      without instrumenting the code, so add a some tracepoints to allow
      insight to be gained.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      5e5b3d2e
    • Shaohua Li's avatar
      vmscan: clear ZONE_CONGESTED for zone with good watermark · 564ea9dd
      Shaohua Li authored
      commit 439423f6 upstream.
      
      Stable note: Not tracked in Bugzilla. kswapd is responsible for clearing
      	ZONE_CONGESTED after it balances a zone and this patch fixes a bug
      	where that was failing to happen. Without this patch, processes
      	can stall in wait_iff_congested unnecessarily. For users, this can
      	look like an interactivity stall but some workloads would see it
      	as sudden drop in throughput.
      
      ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
      task.  It's possible ZONE_CONGESTED isn't cleared in some cases:
      
       1. the zone is already balanced just entering balance_pgdat() for
          order-0 because concurrent tasks free memory.  In this case, later
          check will skip the zone as it's balanced so the flag isn't cleared.
      
       2. high order balance fallbacks to order-0.  quote from Mel: At the
          end of balance_pgdat(), kswapd uses the following logic;
      
      	If reclaiming at high order {
      		for each zone {
      			if all_unreclaimable
      				skip
      			if watermark is not met
      				order = 0
      				loop again
      
      			/* watermark is met */
      			clear congested
      		}
      	}
      
          i.e. it clears ZONE_CONGESTED if it the zone is balanced.  if not,
          it restarts balancing at order-0.  However, if the higher zones are
          balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
          that only happens after a zone is shrunk.  This can mean that
          wait_iff_congested() stalls unnecessarily.
      
      This patch makes kswapd clear ZONE_CONGESTED during its initial
      highmem->dma scan for zones that are already balanced.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      564ea9dd
    • Johannes Weiner's avatar
      mm: vmscan: fix force-scanning small targets without swap · 33c17eaf
      Johannes Weiner authored
      commit a4d3e9e7 upstream.
      
      Stable note: Not tracked in Bugzilla. This patch augments an earlier commit
              that avoids scanning priority being artificially raised. The older
      	fix was particularly important for small memcgs to avoid calling
      	wait_iff_congested() unnecessarily.
      
      Without swap, anonymous pages are not scanned.  As such, they should not
      count when considering force-scanning a small target if there is no swap.
      
      Otherwise, targets are not force-scanned even when their effective scan
      number is zero and the other conditions--kswapd/memcg--apply.
      
      This fixes 246e87a9 ("memcg: fix get_scan_count() for small
      targets").
      
      [akpm@linux-foundation.org: fix comment]
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      33c17eaf
    • Mel Gorman's avatar
      mm: reduce the amount of work done when updating min_free_kbytes · 71a07f4c
      Mel Gorman authored
      commit 938929f1 upstream.
      
      Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
              Large machines with 1TB or more of RAM take a long time to boot
              without this patch and may spew out soft lockup warnings.
      
      When min_free_kbytes is updated, some pageblocks are marked
      MIGRATE_RESERVE.  Ordinarily, this work is unnoticable as it happens early
      in boot but on large machines with 1TB of memory, this has been reported
      to delay boot times, probably due to the NUMA distances involved.
      
      The bulk of the work is due to calling calling pageblock_is_reserved() an
      unnecessary amount of times and accessing far more struct page metadata
      than is necessary.  This patch significantly reduces the amount of work
      done by setup_zone_migrate_reserve() improving boot times on 1TB machines.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71a07f4c
    • Mel Gorman's avatar
      mm: memory hotplug: Check if pages are correctly reserved on a per-section basis · 1126e709
      Mel Gorman authored
      commit 2bbcb878 upstream.
      
      Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=721039 .
              Without the patch, memory hot-add can fail for kernel configurations
              that do not set CONFIG_SPARSEMEM_VMEMMAP.
      
      (Resending as I am not seeing it in -next so maybe it got lost)
      
      mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
      
      It is expected that memory being brought online is PageReserved
      similar to what happens when the page allocator is being brought up.
      Memory is onlined in "memory blocks" which consist of one or more
      sections. Unfortunately, the code that verifies PageReserved is
      currently assuming that the memmap backing all these pages is virtually
      contiguous which is only the case when CONFIG_SPARSEMEM_VMEMMAP is set.
      As a result, memory hot-add is failing on those configurations with
      the message;
      
      kernel: section number XXX page number 256 not reserved, was it already online?
      
      This patch updates the PageReserved check to lookup struct page once
      per section to guarantee the correct struct page is being checked.
      
      [Check pages within sections properly: rientjes@google.com]
      [original patch by: nfont@linux.vnet.ibm.com]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Tested-by: default avatarNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      1126e709
    • Dimitri Sivanich's avatar
      mm/vmstat.c: cache align vm_stat · 9116bc4f
      Dimitri Sivanich authored
      commit a1cb2c60 upstream.
      
      Stable note: Not tracked on Bugzilla. This patch is known to make a big
              difference to tmpfs performance on larger machines.
      
      This was found to adversely affect tmpfs I/O performance.
      
      Tests run on a 640 cpu UV system.
      
      With 120 threads doing parallel writes, each to different tmpfs mounts:
      No patch:		~300 MB/sec
      With vm_stat alignment:	~430 MB/sec
      Signed-off-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Acked-by: default avatarChristoph Lameter <cl@gentwo.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      9116bc4f
    • Mikulas Patocka's avatar
      dm raid1: fix crash with mirror recovery and discard · fbb41f55
      Mikulas Patocka authored
      commit 751f188d upstream.
      
      This patch fixes a crash when a discard request is sent during mirror
      recovery.
      
      Firstly, some background.  Generally, the following sequence happens during
      mirror synchronization:
      - function do_recovery is called
      - do_recovery calls dm_rh_recovery_prepare
      - dm_rh_recovery_prepare uses a semaphore to limit the number
        simultaneously recovered regions (by default the semaphore value is 1,
        so only one region at a time is recovered)
      - dm_rh_recovery_prepare calls __rh_recovery_prepare,
        __rh_recovery_prepare asks the log driver for the next region to
        recover. Then, it sets the region state to DM_RH_RECOVERING. If there
        are no pending I/Os on this region, the region is added to
        quiesced_regions list. If there are pending I/Os, the region is not
        added to any list. It is added to the quiesced_regions list later (by
        dm_rh_dec function) when all I/Os finish.
      - when the region is on quiesced_regions list, there are no I/Os in
        flight on this region. The region is popped from the list in
        dm_rh_recovery_start function. Then, a kcopyd job is started in the
        recover function.
      - when the kcopyd job finishes, recovery_complete is called. It calls
        dm_rh_recovery_end. dm_rh_recovery_end adds the region to
        recovered_regions or failed_recovered_regions list (depending on
        whether the copy operation was successful or not).
      
      The above mechanism assumes that if the region is in DM_RH_RECOVERING
      state, no new I/Os are started on this region. When I/O is started,
      dm_rh_inc_pending is called, which increases reg->pending count. When
      I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
      If the count is zero and the region was in DM_RH_RECOVERING state,
      dm_rh_dec adds it to the quiesced_regions list.
      
      Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
      in DM_RH_RECOVERING state, it could be added to quiesced_regions list
      multiple times or it could be added to this list when kcopyd is copying
      data (it is assumed that the region is not on any list while kcopyd does
      its jobs). This results in memory corruption and crash.
      
      There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
      do not belong to any region, so they are always added to the sync list
      in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
      requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
      requests. These bypasses avoid the crash possibility described above.
      
      These bypasses were improperly implemented for REQ_DISCARD when
      the mirror target gained discard support in commit
      5fc2ffea (dm raid1: support discard).
      
      In do_writes, REQ_DISCARD requests is always added to the sync queue and
      immediately dispatched (even if the region is in DM_RH_RECOVERING).  However,
      dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts.  So it violates the
      rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
      corruption described above.
      
      This patch changes it so that REQ_DISCARD requests follow the same path
      as REQ_FLUSH. This avoids the crash.
      
      Reference: https://bugzilla.redhat.com/837607Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbb41f55
    • Artem Bityutskiy's avatar
      UBIFS: fix a bug in empty space fix-up · cd050f56
      Artem Bityutskiy authored
      commit c6727932 upstream.
      
      UBIFS has a feature called "empty space fix-up" which is a quirk to work-around
      limitations of dumb flasher programs. Namely, of those flashers that are unable
      to skip NAND pages full of 0xFFs while flashing, resulting in empty space at
      the end of half-filled eraseblocks to be unusable for UBIFS. This feature is
      relatively new (introduced in v3.0).
      
      The fix-up routine (fixup_free_space()) is executed only once at the very first
      mount if the superblock has the 'space_fixup' flag set (can be done with -F
      option of mkfs.ubifs). It basically reads all the UBIFS data and metadata and
      writes it back to the same LEB. The routine assumes the image is pristine and
      does not have anything in the journal.
      
      There was a bug in 'fixup_free_space()' where it fixed up the log incorrectly.
      All but one LEB of the log of a pristine file-system are empty. And one
      contains just a commit start node. And 'fixup_free_space()' just unmapped this
      LEB, which resulted in wiping the commit start node. As a result, some users
      were unable to mount the file-system next time with the following symptom:
      
      UBIFS error (pid 1): replay_log_leb: first log node at LEB 3:0 is not CS node
      UBIFS error (pid 1): replay_log_leb: log error detected while replaying the log at LEB 3:0
      
      The root-cause of this bug was that 'fixup_free_space()' wrongly assumed
      that the beginning of empty space in the log head (c->lhead_offs) was known
      on mount. However, it is not the case - it was always 0. UBIFS does not store
      in it the master node and finds out by scanning the log on every mount.
      
      The fix is simple - just pass commit start node size instead of 0 to
      'fixup_leb()'.
      Signed-off-by: default avatarArtem Bityutskiy <Artem.Bityutskiy@linux.intel.com>
      Reported-by: default avatarIwo Mergler <Iwo.Mergler@netcommwireless.com>
      Tested-by: default avatarIwo Mergler <Iwo.Mergler@netcommwireless.com>
      Reported-by: default avatarJames Nute <newten82@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd050f56
    • David Daney's avatar
      MIPS: Properly align the .data..init_task section. · 689415c1
      David Daney authored
      commit 7b1c0d26 upstream.
      
      Improper alignment can lead to unbootable systems and/or random
      crashes.
      
      [ralf@linux-mips.org: This is a lond standing bug since
      6eb10bc9 (kernel.org) rsp.
      c422a10917f75fd19fa7fe070aaaa23e384dae6f (lmo) [MIPS: Clean up linker script
      using new linker script macros.] so dates back to 2.6.32.]
      Signed-off-by: default avatarDavid Daney <david.daney@cavium.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/3881/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      689415c1
    • Aaditya Kumar's avatar
      mm: fix lost kswapd wakeup in kswapd_stop() · 6d40de83
      Aaditya Kumar authored
      commit 1c7e7f6c upstream.
      
      Offlining memory may block forever, waiting for kswapd() to wake up
      because kswapd() does not check the event kthread->should_stop before
      sleeping.
      
      The proper pattern, from Documentation/memory-barriers.txt, is:
      
         ---  waker  ---
         event_indicated = 1;
         wake_up_process(event_daemon);
      
         ---  sleeper  ---
         for (;;) {
            set_current_state(TASK_UNINTERRUPTIBLE);
            if (event_indicated)
               break;
            schedule();
         }
      
         set_current_state() may be wrapped by:
            prepare_to_wait();
      
      In the kswapd() case, event_indicated is kthread->should_stop.
      
        === offlining memory (waker) ===
         kswapd_stop()
            kthread_stop()
               kthread->should_stop = 1
               wake_up_process()
               wait_for_completion()
      
        ===  kswapd_try_to_sleep (sleeper) ===
         kswapd_try_to_sleep()
            prepare_to_wait()
                 .
                 .
            schedule()
                 .
                 .
            finish_wait()
      
      The schedule() needs to be protected by a test of kthread->should_stop,
      which is wrapped by kthread_should_stop().
      
      Reproducer:
         Do heavy file I/O in background.
         Do a memory offline/online in a tight loop
      Signed-off-by: default avatarAaditya Kumar <aaditya.kumar@ap.sony.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d40de83
    • John Stultz's avatar
      ntp: Fix STA_INS/DEL clearing bug · dccecc64
      John Stultz authored
      commit 6b1859db upstream.
      
      In commit 6b43ae8a, I
      introduced a bug that kept the STA_INS or STA_DEL bit
      from being cleared from time_status via adjtimex()
      without forcing STA_PLL first.
      
      Usually once the STA_INS is set, it isn't cleared
      until the leap second is applied, so its unlikely this
      affected anyone. However during testing I noticed it
      took some effort to cancel a leap second once STA_INS
      was set.
      Signed-off-by: default avatarJohn Stultz <johnstul@us.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Link: http://lkml.kernel.org/r/1342156917-25092-2-git-send-email-john.stultz@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dccecc64
    • Jeff Layton's avatar
      cifs: always update the inode cache with the results from a FIND_* · adccea44
      Jeff Layton authored
      commit cd60042c upstream.
      
      When we get back a FIND_FIRST/NEXT result, we have some info about the
      dentry that we use to instantiate a new inode. We were ignoring and
      discarding that info when we had an existing dentry in the cache.
      
      Fix this by updating the inode in place when we find an existing dentry
      and the uniqueid is the same.
      Reported-and-Tested-by: default avatarAndrew Bartlett <abartlet@samba.org>
      Reported-by: default avatarBill Robertson <bill_robertson@debortoli.com.au>
      Reported-by: default avatarDion Edwards <dion_edwards@debortoli.com.au>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      adccea44
  2. 19 Jul, 2012 23 commits