1. 24 Apr, 2015 8 commits
    • Tejun Heo's avatar
      writeback: add missing INITIAL_JIFFIES init in global_update_bandwidth() · 59881078
      Tejun Heo authored
      [ Upstream commit 7d70e154 ]
      
      global_update_bandwidth() uses static variable update_time as the
      timestamp for the last update but forgets to initialize it to
      INITIALIZE_JIFFIES.
      
      This means that global_dirty_limit will be 5 mins into the future on
      32bit and some large amount jiffies into the past on 64bit.  This
      isn't critical as the only effect is that global_dirty_limit won't be
      updated for the first 5 mins after booting on 32bit machines,
      especially given the auxiliary nature of global_dirty_limit's role -
      protecting against global dirty threshold's sudden dips; however, it
      does lead to unintended suboptimal behavior.  Fix it.
      
      Fixes: c42843f2 ("writeback: introduce smoothed global dirty limit")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      59881078
    • Viresh Kumar's avatar
      cpufreq: Schedule work for the first-online CPU on resume · 5f44e397
      Viresh Kumar authored
      [ Upstream commit c75de0ac ]
      
      All CPUs leaving the first-online CPU are hotplugged out on suspend and
      and cpufreq core stops managing them.
      
      On resume, we need to call cpufreq_update_policy() for this CPU's policy
      to make sure its frequency is in sync with cpufreq's cached value, as it
      might have got updated by hardware during suspend/resume.
      
      The policies are always added to the top of the policy-list. So, in
      normal circumstances, CPU 0's policy will be the last one in the list.
      And so the code checks for the last policy.
      
      But there are cases where it will fail. Consider quad-core system, with
      policy-per core. If CPU0 is hotplugged out and added back again, the
      last policy will be on CPU1 :(
      
      To fix this in a proper way, always look for the policy of the first
      online CPU. That way we will be sure that we are calling
      cpufreq_update_policy() for the only CPU that wasn't hotplugged out.
      
      Cc: 3.15+ <stable@vger.kernel.org> # 3.15+
      Fixes: 2f0aea93 ("cpufreq: suspend governors on system suspend/hibernate")
      Reported-by: default avatarSaravana Kannan <skannan@codeaurora.org>
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: default avatarSaravana Kannan <skannan@codeaurora.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5f44e397
    • Brian Silverman's avatar
      sched: Fix RLIMIT_RTTIME when PI-boosting to RT · 42d921d3
      Brian Silverman authored
      [ Upstream commit 746db944 ]
      
      When non-realtime tasks get priority-inheritance boosted to a realtime
      scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
      counter used for checking this (the same one used for SCHED_RR
      timeslices) was not getting reset. This meant that tasks running with a
      non-realtime scheduling class which are repeatedly boosted to a realtime
      one, but never block while they are running realtime, eventually hit the
      timeout without ever running for a time over the limit. This patch
      resets the realtime timeslice counter when un-PI-boosting from an RT to
      a non-RT scheduling class.
      
      I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
      mutex which induces priority boosting and spins while boosted that gets
      killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
      applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
      does happen eventually with PREEMPT_VOLUNTARY kernels.
      Signed-off-by: default avatarBrian Silverman <brian@peloton-tech.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: austin@peloton-tech.com
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      42d921d3
    • Laura Abbott's avatar
      mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate · a38edfb2
      Laura Abbott authored
      [ Upstream commit cfa86943 ]
      
      Commit 3c605096 ("mm/page_alloc: restrict max order of merging on
      isolated pageblock") changed the logic of unset_migratetype_isolate to
      check the buddy allocator and explicitly call __free_pages to merge.
      
      The page that is being freed in this path never had prep_new_page called
      so set_page_refcounted is called explicitly but there is no call to
      kernel_map_pages.  With the default kernel_map_pages this is mostly
      harmless but if kernel_map_pages does any manipulation of the page
      tables (unmapping or setting pages to read only) this may trigger a
      fault:
      
          alloc_contig_range test_pages_isolated(ceb00, ced00) failed
          Unable to handle kernel paging request at virtual address ffffffc0cec00000
          pgd = ffffffc045fc4000
          [ffffffc0cec00000] *pgd=0000000000000000
          Internal error: Oops: 9600004f [#1] PREEMPT SMP
          Modules linked in: exfatfs
          CPU: 1 PID: 23237 Comm: TimedEventQueue Not tainted 3.10.49-gc72ad36-dirty #1
          task: ffffffc03de52100 ti: ffffffc015388000 task.ti: ffffffc015388000
          PC is at memset+0xc8/0x1c0
          LR is at kernel_map_pages+0x1ec/0x244
      
      Fix this by calling kernel_map_pages to ensure the page is set in the
      page table properly
      
      Fixes: 3c605096 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
      Signed-off-by: default avatarLaura Abbott <lauraa@codeaurora.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      a38edfb2
    • Gu Zheng's avatar
      mm/memory hotplug: postpone the reset of obsolete pgdat · ec4db266
      Gu Zheng authored
      [ Upstream commit b0dc3a34 ]
      
      Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
      stress condition:
      
        BUG: unable to handle kernel paging request at 0000000000025f60
        IP: next_online_pgdat+0x1/0x50
        PGD 0
        Oops: 0000 [#1] SMP
        ACPI: Device does not support D3cold
        Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
        CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
        Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
        Workqueue: events vmstat_update
        task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
        RIP: 0010: next_online_pgdat+0x1/0x50
        RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
        RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
        RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
        RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
        R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
        R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
        FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          refresh_cpu_vm_stats+0xd0/0x140
          vmstat_update+0x11/0x50
          process_one_work+0x194/0x3d0
          worker_thread+0x12b/0x410
          kthread+0xc6/0xd0
          ret_from_fork+0x7c/0xb0
      
      The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
      try_offline_node, which will reset all the content of pgdat to 0, as the
      pgdat is accessed lock-free, so that the users still using the pgdat
      will panic, such as the vmstat_update routine.
      
      process A:				offline node XX:
      
      vmstat_updat()
         refresh_cpu_vm_stats()
           for_each_populated_zone()
             find online node XX
           cond_resched()
      					offline cpu and memory, then try_offline_node()
      					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
             zone = next_zone(zone)
               pg_data_t *pgdat = zone->zone_pgdat;  // here pgdat is NULL now
                 next_online_pgdat(pgdat)
                   next_online_node(pgdat->node_id);  // NULL pointer access
      
      So the solution here is postponing the reset of obsolete pgdat from
      try_offline_node() to hotadd_new_pgdat(), and just resetting
      pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
      0 to avoid breaking pointer information in pgdat.
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Reported-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Suggested-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      ec4db266
    • Leon Yu's avatar
      mm: fix anon_vma->degree underflow in anon_vma endless growing prevention · 5bc650f0
      Leon Yu authored
      [ Upstream commit 3fe89b3e ]
      
      I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after
      upgrading to 3.19 and had no luck with 4.0-rc1 neither.
      
      So, after looking into new logic introduced by commit 7a3ef208 ("mm:
      prevent endless growth of anon_vma hierarchy"), I found chances are that
      unlink_anon_vmas() is called without incrementing dst->anon_vma->degree
      in anon_vma_clone() due to allocation failure.  If dst->anon_vma is not
      NULL in error path, its degree will be incorrectly decremented in
      unlink_anon_vmas() and eventually underflow when exiting as a result of
      another call to unlink_anon_vmas().  That's how "kernel BUG at
      mm/rmap.c:399!" is triggered for me.
      
      This patch fixes the underflow by dropping dst->anon_vma when allocation
      fails.  It's safe to do so regardless of original value of dst->anon_vma
      because dst->anon_vma doesn't have valid meaning if anon_vma_clone()
      fails.  Besides, callers don't care dst->anon_vma in such case neither.
      
      Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as
      anon_vma_clone() now does the work.
      
      [akpm@linux-foundation.org: tweak comment]
      Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
      Signed-off-by: default avatarLeon Yu <chianglungyu@gmail.com>
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5bc650f0
    • Konstantin Khlebnikov's avatar
      mm: fix corner case in anon_vma endless growing prevention · f94faaaa
      Konstantin Khlebnikov authored
      [ Upstream commit b800c91a ]
      
      Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel
      BUG at mm/rmap.c:399!") caused by commit 7a3ef208 ("mm: prevent
      endless growth of anon_vma hierarchy")
      
      Anon_vma_clone() is usually called for a copy of source vma in
      destination argument.  If source vma has anon_vma it should be already
      in dst->anon_vma.  NULL in dst->anon_vma is used as a sign that it's
      called from anon_vma_fork().  In this case anon_vma_clone() finds
      anon_vma for reusing.
      
      Vma_adjust() calls it differently and this breaks anon_vma reusing
      logic: anon_vma_clone() links vma to old anon_vma and updates degree
      counters but vma_adjust() overrides vma->anon_vma right after that.  As
      a result final unlink_anon_vmas() decrements degree for wrong anon_vma.
      
      This patch assigns ->anon_vma before calling anon_vma_clone().
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Reported-and-tested-by: default avatarChris Clayton <chris2553@googlemail.com>
      Reported-and-tested-by: default avatarOded Gabbay <oded.gabbay@amd.com>
      Reported-and-tested-by: default avatarChih-Wei Huang <cwhuang@android-x86.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Daniel Forrest <dan.forrest@ssec.wisc.edu>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: stable@vger.kernel.org  # to match back-porting of 7a3ef208Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f94faaaa
    • Konstantin Khlebnikov's avatar
      mm: prevent endless growth of anon_vma hierarchy · 3f161800
      Konstantin Khlebnikov authored
      [ Upstream commit 7a3ef208 ]
      
      Constantly forking task causes unlimited grow of anon_vma chain.  Each
      next child allocates new level of anon_vmas and links vma to all
      previous levels because pages might be inherited from any level.
      
      This patch adds heuristic which decides to reuse existing anon_vma
      instead of forking new one.  It adds counter anon_vma->degree which
      counts linked vmas and directly descending anon_vmas and reuses anon_vma
      if counter is lower than two.  As a result each anon_vma has either vma
      or at least two descending anon_vmas.  In such trees half of nodes are
      leafs with alive vmas, thus count of anon_vmas is no more than two times
      bigger than count of vmas.
      
      This heuristic reuses anon_vmas as few as possible because each reuse
      adds false aliasing among vmas and rmap walker ought to scan more ptes
      when it searches where page is might be mapped.
      
      Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
      Fixes: 5beb4930 ("mm: change anon_vma linking to fix multi-process server scalability issue")
      [akpm@linux-foundation.org: fix typo, per Rik]
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Reported-by: default avatarDaniel Forrest <dan.forrest@ssec.wisc.edu>
      Tested-by: default avatarMichal Hocko <mhocko@suse.cz>
      Tested-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[2.6.34+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3f161800
  2. 23 Apr, 2015 32 commits