1. 18 Oct, 2010 12 commits
    • Venkatesh Pallipadi's avatar
      x86: Add IRQ_TIME_ACCOUNTING · e82b8e4e
      Venkatesh Pallipadi authored
      This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it
      when TSC is enabled.
      
      This change just enables fine grained irq time accounting, isn't used yet.
      Following patches use it for different purposes.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-6-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e82b8e4e
    • Venkatesh Pallipadi's avatar
      sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time · b52bfee4
      Venkatesh Pallipadi authored
      s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
      the fine granularity accounting of user, system, hardirq, softirq times.
      Adding that option on archs like x86 will be challenging however, given the
      state of TSC reliability on various platforms and also the overhead it will
      add in syscall entry exit.
      
      Instead, add a lighter variant that only does finer accounting of
      hardirq and softirq times, providing precise irq times (instead of timer tick
      based samples). This accounting is added with a new config option
      CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
      interested in paying the perf penalty.
      
      This accounting is based on sched_clock, with the code being generic.
      So, other archs may find it useful as well.
      
      This patch just adds the core logic and does not enable this logic yet.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b52bfee4
    • Venkatesh Pallipadi's avatar
      sched: Add a PF flag for ksoftirqd identification · 6cdd5199
      Venkatesh Pallipadi authored
      To account softirq time cleanly in scheduler, we need to identify whether
      softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
      Add PF_KSOFTIRQD for that purpose.
      
      As all PF flag bits are currently taken, create space by moving one of the
      infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
      with some other state fields.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      6cdd5199
    • Venkatesh Pallipadi's avatar
      sched: Consolidate account_system_vtime extern declaration · e1e10a26
      Venkatesh Pallipadi authored
      Just a minor cleanup patch that makes things easier to the following patches.
      No functionality change in this patch.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e1e10a26
    • Venkatesh Pallipadi's avatar
      sched: Fix softirq time accounting · 75e1056f
      Venkatesh Pallipadi authored
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      75e1056f
    • Nikhil Rao's avatar
      sched: Drop group_capacity to 1 only if local group has extra capacity · 75dd321d
      Nikhil Rao authored
      When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
      only if the local group has extra capacity. The extra check prevents the case
      where you always pull from the heaviest group when it is already under-utilized
      (possible with a large weight task outweighs the tasks on the system).
      
      For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
      scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
      and each task is running on one core. In this case, we observe the following
      events when balancing at the NUMA domain:
      
      - find_busiest_group() will always pick the sched group containing the niced
        task to be the busiest group.
      - find_busiest_queue() will then always pick one of the cpus running the
        nice0 task (never picks the cpu with the nice -15 task since
        weighted_cpuload > imbalance).
      - The load balancer fails to migrate the task since it is the running task
        and increments sd->nr_balance_failed.
      - It repeats the above steps a few more times until sd->nr_balance_failed > 5,
        at which point it kicks off the active load balancer, wakes up the migration
        thread and kicks the nice 0 task off the cpu.
      
      The load balancer doesn't stop until we kick out all nice 0 tasks from
      the sched group, leaving you with 3 idle cpus and one cpu running the
      nice -15 task.
      
      When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
      domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
      not relevant because the niced task has a very large weight.
      
      In this patch, we add an extra condition to the "if(prefer_sibling)" check in
      update_sd_lb_stats(). We drop the capacity of a group only if the local group
      has extra capacity, ie. nr_running < group_capacity. This patch preserves the
      original intent of the prefer_siblings check (to spread tasks across the system
      in low utilization scenarios) and fixes the case above.
      
      It helps in the following ways:
      - In low utilization cases (where nr_tasks << nr_cpus), we still drop
        group_capacity down to 1 if we prefer siblings.
      - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
        likely be > sgs.group_capacity.
      - When balancing large weight tasks, if the local group does not have extra
        capacity, we do not pick the group with the niced task as the busiest group.
        This prevents failed balances, active migration and the under-utilization
        described above.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      75dd321d
    • Nikhil Rao's avatar
      sched: Force balancing on newidle balance if local group has capacity · fab47622
      Nikhil Rao authored
      This patch forces a load balance on a newly idle cpu when the local group has
      extra capacity and the busiest group does not have any. It improves system
      utilization when balancing tasks with a large weight differential.
      
      Under certain situations, such as a niced down task (i.e. nice = -15) in the
      presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
      kicks away other tasks because of its large weight. This leads to sub-optimal
      utilization of the machine. Even though the sched group has capacity, it does
      not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
      
      With this patch, if the local group has extra capacity, we shortcut the checks
      in f_b_g() and try to pull a task over. A sched group has extra capacity if the
      group capacity is greater than the number of running tasks in that group.
      
      Thanks to Mike Galbraith for discussions leading to this patch and for the
      insight to reuse SD_NEWIDLE_BALANCE.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      fab47622
    • Nikhil Rao's avatar
      sched: Set group_imb only a task can be pulled from the busiest cpu · 2582f0eb
      Nikhil Rao authored
      When cycling through sched groups to determine the busiest group, set
      group_imb only if the busiest cpu has more than 1 runnable task. This patch
      fixes the case where two cpus in a group have one runnable task each, but there
      is a large weight differential between these two tasks. The load balancer is
      unable to migrate any task from this group, and hence do not consider this
      group to be imbalanced.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com>
      [ small code readability edits ]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      2582f0eb
    • Nikhil Rao's avatar
      sched: Do not consider SCHED_IDLE tasks to be cache hot · ef8002f6
      Nikhil Rao authored
      This patch adds a check in task_hot to return if the task has SCHED_IDLE
      policy. SCHED_IDLE tasks have very low weight, and when run with regular
      workloads, are typically scheduled many milliseconds apart. There is no
      need to consider these tasks hot for load balancing.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      ef8002f6
    • Linus Walleij's avatar
      sched: Drop all load weight manipulation for RT tasks · 17bdcf94
      Linus Walleij authored
      Load weights are for the CFS, they do not belong in the RT task. This makes all
      RT scheduling classes leave the CFS weights alone.
      
      This fixes a real bug as well: I noticed the following phonomena: a process
      elevated to SCHED_RR forks with SCHED_RESET_ON_FORK set, and the child is
      indeed SCHED_OTHER, and the niceval is indeed reset to 0. However the weight
      inserted by set_load_weight() remains at 0, giving the task insignificat
      priority.
      
      With this fix, the weight is reset to what the task had before being elevated
      to SCHED_RR/SCHED_FIFO.
      
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Walleij <linus.walleij@stericsson.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286807811-10568-1-git-send-email-linus.walleij@stericsson.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      17bdcf94
    • Peter Zijlstra's avatar
      sched: Create special class for stop/migrate work · 34f971f6
      Peter Zijlstra authored
      In order to separate the stop/migrate work thread from the SCHED_FIFO
      implementation, create a special class for it that is of higher priority than
      SCHED_FIFO itself.
      
      This currently solves a problem where cpu-hotplug consumes so much cpu-time
      that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment
      timer pending on the now dead cpu.
      
      It is also required for when we add the planned deadline scheduling class above
      SCHED_FIFO, as the stop/migrate thread still needs to transcent those tasks.
      Tested-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1285165776.2275.1022.camel@laptop>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      34f971f6
    • Peter Zijlstra's avatar
      sched: Unindent labels · 49246274
      Peter Zijlstra authored
      Labels should be on column 0.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      49246274
  2. 14 Oct, 2010 2 commits
  3. 13 Oct, 2010 9 commits
  4. 12 Oct, 2010 13 commits
    • Russell King's avatar
      ARM: relax ioremap prohibition (309caa9c) for -final and -stable · 06c10884
      Russell King authored
      ... but produce a big warning about the problem as encouragement
      for people to fix their drivers.
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      06c10884
    • Russell King's avatar
    • Mika Westerberg's avatar
      ARM: 6440/1: ep93xx: DMA: fix channel_disable · 10d48b39
      Mika Westerberg authored
      When channel_disable() is called, it disables per channel interrupts and
      waits until channels state becomes STATE_STALL, and then disables the
      channel. Now, if the DMA transfer is disabled while the channel is in
      STATE_NEXT we will not wait anything and disable the channel immediately.
      This seems to cause weird data corruption for example in audio transfers.
      
      Fix is to wait while we are in STATE_NEXT or STATE_ON and only then
      disable the channel.
      Signed-off-by: default avatarMika Westerberg <mika.westerberg@iki.fi>
      Acked-by: default avatarRyan Mallon <ryan@bluewatersys.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      10d48b39
    • Linus Torvalds's avatar
      Merge branch 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 0acc1b2a
      Linus Torvalds authored
      * 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: Move TSC reset out of vmcb_init
        KVM: x86: Fix SVM VMCB reset
      0acc1b2a
    • Steven Rostedt's avatar
      ring-buffer: Fix typo of time extends per page · d0134324
      Steven Rostedt authored
      Time stamps for the ring buffer are created by the difference between
      two events. Each page of the ring buffer holds a full 64 bit timestamp.
      Each event has a 27 bit delta stamp from the last event. The unit of time
      is nanoseconds, so 27 bits can hold ~134 milliseconds. If two events
      happen more than 134 milliseconds apart, a time extend is inserted
      to add more bits for the delta. The time extend has 59 bits, which
      is good for ~18 years.
      
      Currently the time extend is committed separately from the event.
      If an event is discarded before it is committed, due to filtering,
      the time extend still exists. If all events are being filtered, then
      after ~134 milliseconds a new time extend will be added to the buffer.
      
      This can only happen till the end of the page. Since each page holds
      a full timestamp, there is no reason to add a time extend to the
      beginning of a page. Time extends can only fill a page that has actual
      data at the beginning, so there is no fear that time extends will fill
      more than a page without any data.
      
      When reading an event, a loop is made to skip over time extends
      since they are only used to maintain the time stamp and are never
      given to the caller. As a paranoid check to prevent the loop running
      forever, with the knowledge that time extends may only fill a page,
      a check is made that tests the iteration of the loop, and if the
      iteration is more than the number of time extends that can fit in a page
      a warning is printed and the ring buffer is disabled (all of ftrace
      is also disabled with it).
      
      There is another event type that is called a TIMESTAMP which can
      hold 64 bits of data in the theoretical case that two events happen
      18 years apart. This code has not been implemented, but the name
      of this event exists, as well as the structure for it. The
      size of a TIMESTAMP is 16 bytes, where as a time extend is only
      8 bytes. The macro used to calculate how many time extends can fit on
      a page used the TIMESTAMP size instead of the time extend size
      cutting the amount in half.
      
      The following test case can easily trigger the warning since we only
      need to have half the page filled with time extends to trigger the
      warning:
      
       # cd /sys/kernel/debug/tracing/
       # echo function > current_tracer
       # echo 'common_pid < 0' > events/ftrace/function/filter
       # echo > trace
       # echo 1 > trace_marker
       # sleep 120
       # cat trace
      
      Enabling the function tracer and then setting the filter to only trace
      functions where the process id is negative (no events), then clearing
      the trace buffer to ensure that we have nothing in the buffer,
      then write to trace_marker to add an event to the beginning of a page,
      sleep for 2 minutes (only 35 seconds is probably needed, but this
      guarantees the bug), and then finally reading the trace which will
      trigger the bug.
      
      This patch fixes the typo and prevents the false positive of that warning.
      Reported-by: default avatarHans J. Koch <hjk@linutronix.de>
      Tested-by: default avatarHans J. Koch <hjk@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Stable Kernel <stable@kernel.org>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      d0134324
    • Deng-Cheng Zhu's avatar
      perf, MIPS: Support cross compiling of tools/perf for MIPS · c1e028ef
      Deng-Cheng Zhu authored
      Changes:
       v4: Fix the cosmetic issue of redundant dot-ops
       v3: Change rmb() to use SYNC
       v2: Include mips unistd.h and define rmb()/cpu_relax() in tools/perf/perf.h
      Signed-off-by: default avatarDeng-Cheng Zhu <dengcheng.zhu@gmail.com>
      Acked-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Cc: David Daney <ddaney@caviumnetworks.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c1e028ef
    • Jean Delvare's avatar
      drm/radeon/kms: Silent spurious error message · a8c051f0
      Jean Delvare authored
      I see the following error message in my kernel log from time to time:
      radeon 0000:07:00.0: ffff88007c334000 reserve failed for wait
      radeon 0000:07:00.0: ffff88007c334000 reserve failed for wait
      
      After investigation, it turns out that there's nothing to be afraid of
      and everything works as intended. So remove the spurious log message.
      Signed-off-by: default avatarJean Delvare <khali@linux-fr.org>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      a8c051f0
    • Alex Deucher's avatar
      d31dba58
    • Alex Deucher's avatar
      drm/radeon/kms: make TV/DFP table info less verbose · 40f76d81
      Alex Deucher authored
      Make TV standard and DFP table revisions debug only.
      Signed-off-by: default avatarAlex Deucher <alexdeucher@gmail.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      40f76d81
    • Alex Deucher's avatar
      drm/radeon/kms: leave certain CP int bits enabled · 3555e53b
      Alex Deucher authored
      These bits are used for internal communication and should
      be left enabled.  This may fix s/r issues on some systems.
      Signed-off-by: default avatarAlex Deucher <alexdeucher@gmail.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      3555e53b
    • Jerome Glisse's avatar
      drm/radeon/kms: avoid corner case issue with unmappable vram V2 · c919b371
      Jerome Glisse authored
      We should not allocate any object into unmappable vram if we
      have no means to access them which on all GPU means having the
      CP running and on newer GPU having the blit utility working.
      
      This patch limit the vram allocation to visible vram until
      we have acceleration up and running.
      
      Note that it's more than unlikely that we run into any issue
      related to that as when acceleration is not woring userspace
      should allocate any object in vram beside front buffer which
      should fit in visible vram.
      
      V2 use real_vram_size as mc_vram_size could be bigger than
         the actual amount of vram
      
      [airlied: fixup r700_cp_stop case]
      Signed-off-by: default avatarJerome Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      c919b371
    • John Blackwood's avatar
      perf: Fix incorrect copy_from_user() usage · ad0cf347
      John Blackwood authored
      perf events: repair incorrect use of copy_from_user
      
      This makes the perf_event_period() return 0 instead of
      -EFAULT on success.
      
      Signed-off-by: John Blackwood<john.blackwood@ccur.com>
      Signed-off-by: default avatarJoe Korty <joe.korty@ccur.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100928220311.GA18145@tsunami.ccur.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      ad0cf347
    • Eric Paris's avatar
      fanotify: disable fanotify syscalls · 7c534773
      Eric Paris authored
      This patch disables the fanotify syscalls by just not building them and
      letting the cond_syscall() statements in kernel/sys_ni.c redirect them
      to sys_ni_syscall().
      
      It was pointed out by Tvrtko Ursulin that the fanotify interface did not
      include an explicit prioritization between groups.  This is necessary
      for fanotify to be usable for hierarchical storage management software,
      as they must get first access to the file, before inotify-like notifiers
      see the file.
      
      This feature can be added in an ABI compatible way in the next release
      (by using a number of bits in the flags field to carry the info) but it
      was suggested by Alan that maybe we should just hold off and do it in
      the next cycle, likely with an (new) explicit argument to the syscall.
      I don't like this approach best as I know people are already starting to
      use the current interface, but Alan is all wise and noone on list backed
      me up with just using what we have.  I feel this is needlessly ripping
      the rug out from under people at the last minute, but if others think it
      needs to be a new argument it might be the best way forward.
      
      Three choices:
      Go with what we got (and implement the new feature next cycle).  Add a
      new field right now (and implement the new feature next cycle).  Wait
      till next cycle to release the ABI (and implement the new feature next
      cycle).  This is number 3.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c534773
  5. 11 Oct, 2010 4 commits