1. 01 Apr, 2014 31 commits
  2. 15 Feb, 2014 9 commits
    • Ben Hutchings's avatar
      Linux 3.2.55 · 39716f2c
      Ben Hutchings authored
      39716f2c
    • Ying Xue's avatar
      sched/rt: Avoid updating RT entry timeout twice within one tick period · b01e0013
      Ying Xue authored
      commit 57d2aa00 upstream.
      
      The issue below was found in 2.6.34-rt rather than mainline rt
      kernel, but the issue still exists upstream as well.
      
      So please let me describe how it was noticed on 2.6.34-rt:
      
      On this version, each softirq has its own thread, it means there
      is at least one RT FIFO task per cpu. The priority of these
      tasks is set to 49 by default. If user launches an RT FIFO task
      with priority lower than 49 of softirq RT tasks, it's possible
      there are two RT FIFO tasks enqueued one cpu runqueue at one
      moment. By current strategy of balancing RT tasks, when it comes
      to RT tasks, we really need to put them off to a CPU that they
      can run on as soon as possible. Even if it means a bit of cache
      line flushing, we want RT tasks to be run with the least latency.
      
      When the user RT FIFO task which just launched before is
      running, the sched timer tick of the current cpu happens. In this
      tick period, the timeout value of the user RT task will be
      updated once. Subsequently, we try to wake up one softirq RT
      task on its local cpu. As the priority of current user RT task
      is lower than the softirq RT task, the current task will be
      preempted by the higher priority softirq RT task. Before
      preemption, we check to see if current can readily move to a
      different cpu. If so, we will reschedule to allow the RT push logic
      to try to move current somewhere else. Whenever the woken
      softirq RT task runs, it first tries to migrate the user FIFO RT
      task over to a cpu that is running a task of lesser priority. If
      migration is done, it will send a reschedule request to the found
      cpu by IPI interrupt. Once the target cpu responds the IPI
      interrupt, it will pick the migrated user RT task to preempt its
      current task. When the user RT task is running on the new cpu,
      the sched timer tick of the cpu fires. So it will tick the user
      RT task again. This also means the RT task timeout value will be
      updated again. As the migration may be done in one tick period,
      it means the user RT task timeout value will be updated twice
      within one tick.
      
      If we set a limit on the amount of cpu time for the user RT task
      by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted
      upon reaching the soft limit.
      
      But exactly when the SIGXCPU signal should be sent depends on the
      RT task timeout value. In fact the timeout mechanism of sending
      the SIGXCPU signal assumes the RT task timeout is increased once
      every tick.
      
      However, currently the timeout value may be added twice per
      tick. So it results in the SIGXCPU signal being sent earlier
      than expected.
      
      To solve this issue, we prevent the timeout value from increasing
      twice within one tick time by remembering the jiffies value of
      last updating the timeout. As long as the RT task's jiffies is
      different with the global jiffies value, we allow its timeout to
      be updated.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarFan Du <fan.du@windriver.com>
      Reviewed-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      [ lizf: backported to 3.4: adjust context ]
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      [bwh: Backported to 3.2: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b01e0013
    • Peter Boonstoppel's avatar
      sched: Unthrottle rt runqueues in __disable_runtime() · 4553dab7
      Peter Boonstoppel authored
      commit a4c96ae3 upstream.
      
      migrate_tasks() uses _pick_next_task_rt() to get tasks from the
      real-time runqueues to be migrated. When rt_rq is throttled
      _pick_next_task_rt() won't return anything, in which case
      migrate_tasks() can't move all threads over and gets stuck in an
      infinite loop.
      
      Instead unthrottle rt runqueues before migrating tasks.
      
      Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()
      Signed-off-by: default avatarPeter Boonstoppel <pboonstoppel@nvidia.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      [ lizf: backported to 3.4: adjust context ]
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      [bwh: Backported to 3.2:
       - Adjust filenames
       - unthrottle_offline_cfs_rqs() is already static, but defined in sched.c
         after including sched_fair.c, so add forward declaration
       - unthrottle_offline_cfs_rqs() also needs to be defined for all CONFIG_SMP
         configurations now]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      4553dab7
    • Mike Galbraith's avatar
      sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled · aee1f8b8
      Mike Galbraith authored
      commit e221d028 upstream.
      
      Root task group bandwidth replenishment must service all CPUs, regardless of
      where the timer was last started, and regardless of the isolation mechanism,
      lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.netSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      [bwh: Backported to 3.2: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      aee1f8b8
    • Colin Cross's avatar
      sched/rt: Fix SCHED_RR across cgroups · 13d8ff3f
      Colin Cross authored
      commit 454c7999 upstream.
      
      task_tick_rt() has an optimization to only reschedule SCHED_RR tasks
      if they were the only element on their rq.  However, with cgroups
      a SCHED_RR task could be the only element on its per-cgroup rq but
      still be competing with other SCHED_RR tasks in its parent's
      cgroup.  In this case, the SCHED_RR task in the child cgroup would
      never yield at the end of its timeslice.  If the child cgroup
      rt_runtime_us was the same as the parent cgroup rt_runtime_us,
      the task in the parent cgroup would starve completely.
      
      Modify task_tick_rt() to check that the task is the only task on its
      rq, and that the each of the scheduling entities of its ancestors
      is also the only entity on its rq.
      Signed-off-by: default avatarColin Cross <ccross@android.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      [bwh: Backported to 3.2: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      13d8ff3f
    • Andrea Arcangeli's avatar
      mm: hugetlbfs: fix hugetlbfs optimization · 8ae94088
      Andrea Arcangeli authored
      commit 27c73ae7 upstream.
      
      Commit 7cb2ef56 ("mm: fix aio performance regression for database
      caused by THP") can cause dereference of a dangling pointer if
      split_huge_page runs during PageHuge() if there are updates to the
      tail_page->private field.
      
      Also it is repeating compound_head twice for hugetlbfs and it is running
      compound_head+compound_trans_head for THP when a single one is needed in
      both cases.
      
      The new code within the PageSlab() check doesn't need to verify that the
      THP page size is never bigger than the smallest hugetlbfs page size, to
      avoid memory corruption.
      
      A longstanding theoretical race condition was found while fixing the
      above (see the change right after the skip_unlock label, that is
      relevant for the compound_lock path too).
      
      By re-establishing the _mapcount tail refcounting for all compound
      pages, this also fixes the below problem:
      
        echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
        BUG: Bad page state in process bash  pfn:59a01
        page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
        page flags: 0x1c00000000008000(tail)
        Modules linked in:
        CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x55/0x76
          bad_page+0xd5/0x130
          free_pages_prepare+0x213/0x280
          __free_pages+0x36/0x80
          update_and_free_page+0xc1/0xd0
          free_pool_huge_page+0xc2/0xe0
          set_max_huge_pages.part.58+0x14c/0x220
          nr_hugepages_store_common.isra.60+0xd0/0xf0
          nr_hugepages_store+0x13/0x20
          kobj_attr_store+0xf/0x20
          sysfs_write_file+0x189/0x1e0
          vfs_write+0xc5/0x1f0
          SyS_write+0x55/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [Khalid Aziz: Backported to 3.4]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      8ae94088
    • Khalid Aziz's avatar
      mm: fix aio performance regression for database caused by THP · b6416444
      Khalid Aziz authored
      commit 7cb2ef56 upstream.
      
      I am working with a tool that simulates oracle database I/O workload.
      This tool (orion to be specific -
      <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>)
      allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag.  It then
      does aio into these pages from flash disks using various common block
      sizes used by database.  I am looking at performance with two of the most
      common block sizes - 1M and 64K.  aio performance with these two block
      sizes plunged after Transparent HugePages was introduced in the kernel.
      Here are performance numbers:
      
      		pre-THP		2.6.39		3.11-rc5
      1M read		8384 MB/s	5629 MB/s	6501 MB/s
      64K read	7867 MB/s	4576 MB/s	4251 MB/s
      
      I have narrowed the performance impact down to the overheads introduced by
      THP in __get_page_tail() and put_compound_page() routines.  perf top shows
      >40% of cycles being spent in these two routines.  Every time direct I/O
      to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
      the pages and calls put_page() when I/O completes to put the reference
      away.  THP introduced significant amount of locking overhead to get_page()
      and put_page() when dealing with compound pages because hugepages can be
      split underneath get_page() and put_page().  It added this overhead
      irrespective of whether it is dealing with hugetlbfs pages or transparent
      hugepages.  This resulted in 20%-45% drop in aio performance when using
      hugetlbfs pages.
      
      Since hugetlbfs pages can not be split, there is no reason to go through
      all the locking overhead for these pages from what I can see.  I added
      code to __get_page_tail() and put_compound_page() to bypass all the
      locking code when working with hugetlbfs pages.  This improved performance
      significantly.  Performance numbers with this patch:
      
      		pre-THP		3.11-rc5	3.11-rc5 + Patch
      1M read		8384 MB/s	6501 MB/s	8371 MB/s
      64K read	7867 MB/s	4251 MB/s	6510 MB/s
      
      Performance with 64K read is still lower than what it was before THP, but
      still a 53% improvement.  It does mean there is more work to be done but I
      will take a 53% improvement for now.
      
      Please take a look at the following patch and let me know if it looks
      reasonable.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b6416444
    • Robert Richter's avatar
      perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h · e07518e9
      Robert Richter authored
      commit bee09ed9 upstream.
      
      On AMD family 10h we see following error messages while waking up from
      S3 for all non-boot CPUs leading to a failed IBS initialization:
      
       Enabling non-boot CPUs ...
       smpboot: Booting Node 0 Processor 1 APIC 0x1
       [Firmware Bug]: cpu 1, try to use APIC500 (LVT offset 0) for vector 0x400, but the register is already in use for vector 0xf9 on another cpu
       perf: IBS APIC setup failed on cpu #1
       process: Switch to broadcast mode on CPU1
       CPU1 is up
       ...
       ACPI: Waking up from system sleep state S3
      
      Reason for this is that during suspend the LVT offset for the IBS
      vector gets lost and needs to be reinialized while resuming.
      
      The offset is read from the IBSCTL msr. On family 10h the offset needs
      to be 1 as offset 0 is used for the MCE threshold interrupt, but
      firmware assings it for IBS to 0 too. The kernel needs to reprogram
      the vector. The msr is a readonly node msr, but a new value can be
      written via pci config space access. The reinitialization is
      implemented for family 10h in setup_ibs_ctl() which is forced during
      IBS setup.
      
      This patch fixes IBS setup after waking up from S3 by adding
      resume/supend hooks for the boot cpu which does the offset
      reinitialization.
      
      Marking it as stable to let distros pick up this fix.
      Signed-off-by: default avatarRobert Richter <rric@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389797849-5565-1-git-send-email-rric.net@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      e07518e9
    • Andreas Rohner's avatar
      nilfs2: fix segctor bug that causes file system corruption · 028d56ae
      Andreas Rohner authored
      commit 70f2fe3a upstream.
      
      There is a bug in the function nilfs_segctor_collect, which results in
      active data being written to a segment, that is marked as clean.  It is
      possible, that this segment is selected for a later segment
      construction, whereby the old data is overwritten.
      
      The problem shows itself with the following kernel log message:
      
        nilfs_sufile_do_cancel_free: segment 6533 must be clean
      
      Usually a few hours later the file system gets corrupted:
      
        NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
        NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)
      
      The issue can be reproduced with a file system that is nearly full and
      with the cleaner running, while some IO intensive task is running.
      Although it is quite hard to reproduce.
      
      This is what happens:
      
       1. The cleaner starts the segment construction
       2. nilfs_segctor_collect is called
       3. sc_stage is on NILFS_ST_SUFILE and segments are freed
       4. sc_stage is on NILFS_ST_DAT current segment is full
       5. nilfs_segctor_extend_segments is called, which
          allocates a new segment
       6. The new segment is one of the segments freed in step 3
       7. nilfs_sufile_cancel_freev is called and produces an error message
       8. Loop around and the collection starts again
       9. sc_stage is on NILFS_ST_SUFILE and segments are freed
          including the newly allocated segment, which will contain active
          data and can be allocated at a later time
      10. A few hours later another segment construction allocates the
          segment and causes file system corruption
      
      This can be prevented by simply reordering the statements.  If
      nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
      the freed segments are marked as dirty and cannot be allocated any more.
      Signed-off-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Reviewed-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      028d56ae