1. 17 Feb, 2011 40 commits
    • Venkatesh Pallipadi's avatar
      sched: Add a PF flag for ksoftirqd identification · 9b511401
      Venkatesh Pallipadi authored
      Commit: 6cdd5199 upstream
      
      To account softirq time cleanly in scheduler, we need to identify whether
      softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
      Add PF_KSOFTIRQD for that purpose.
      
      As all PF flag bits are currently taken, create space by moving one of the
      infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
      with some other state fields.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      9b511401
    • Dave Young's avatar
      sched: Remove unused PF_ALIGNWARN flag · 82f7e90e
      Dave Young authored
      Commit: 637bbdc5 upstream
      
      PF_ALIGNWARN is not implemented and it is for 486 as the
      comment.
      
      It is not likely someone will implement this flag feature.
      So here remove this flag and leave the valuable 0x00000001 for
      future use.
      Signed-off-by: default avatarDave Young <hidave.darkstar@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20100913121903.GB22238@darkstar>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      82f7e90e
    • Venkatesh Pallipadi's avatar
      sched: Consolidate account_system_vtime extern declaration · 95824433
      Venkatesh Pallipadi authored
      Commit: e1e10a26 upstream
      
      Just a minor cleanup patch that makes things easier to the following patches.
      No functionality change in this patch.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      95824433
    • Venkatesh Pallipadi's avatar
      sched: Fix softirq time accounting · 49c6f4a2
      Venkatesh Pallipadi authored
      Commit: 75e1056f upstream
      
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      49c6f4a2
    • Nikhil Rao's avatar
      sched: Drop group_capacity to 1 only if local group has extra capacity · 1d3d2371
      Nikhil Rao authored
      Commit: 75dd321d upstream
      
      When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
      only if the local group has extra capacity. The extra check prevents the case
      where you always pull from the heaviest group when it is already under-utilized
      (possible with a large weight task outweighs the tasks on the system).
      
      For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
      scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
      and each task is running on one core. In this case, we observe the following
      events when balancing at the NUMA domain:
      
      - find_busiest_group() will always pick the sched group containing the niced
        task to be the busiest group.
      - find_busiest_queue() will then always pick one of the cpus running the
        nice0 task (never picks the cpu with the nice -15 task since
        weighted_cpuload > imbalance).
      - The load balancer fails to migrate the task since it is the running task
        and increments sd->nr_balance_failed.
      - It repeats the above steps a few more times until sd->nr_balance_failed > 5,
        at which point it kicks off the active load balancer, wakes up the migration
        thread and kicks the nice 0 task off the cpu.
      
      The load balancer doesn't stop until we kick out all nice 0 tasks from
      the sched group, leaving you with 3 idle cpus and one cpu running the
      nice -15 task.
      
      When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
      domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
      not relevant because the niced task has a very large weight.
      
      In this patch, we add an extra condition to the "if(prefer_sibling)" check in
      update_sd_lb_stats(). We drop the capacity of a group only if the local group
      has extra capacity, ie. nr_running < group_capacity. This patch preserves the
      original intent of the prefer_siblings check (to spread tasks across the system
      in low utilization scenarios) and fixes the case above.
      
      It helps in the following ways:
      - In low utilization cases (where nr_tasks << nr_cpus), we still drop
        group_capacity down to 1 if we prefer siblings.
      - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
        likely be > sgs.group_capacity.
      - When balancing large weight tasks, if the local group does not have extra
        capacity, we do not pick the group with the niced task as the busiest group.
        This prevents failed balances, active migration and the under-utilization
        described above.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      1d3d2371
    • Nikhil Rao's avatar
      sched: Force balancing on newidle balance if local group has capacity · 703482e7
      Nikhil Rao authored
      Commit: fab47622 upstream
      
      This patch forces a load balance on a newly idle cpu when the local group has
      extra capacity and the busiest group does not have any. It improves system
      utilization when balancing tasks with a large weight differential.
      
      Under certain situations, such as a niced down task (i.e. nice = -15) in the
      presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
      kicks away other tasks because of its large weight. This leads to sub-optimal
      utilization of the machine. Even though the sched group has capacity, it does
      not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
      
      With this patch, if the local group has extra capacity, we shortcut the checks
      in f_b_g() and try to pull a task over. A sched group has extra capacity if the
      group capacity is greater than the number of running tasks in that group.
      
      Thanks to Mike Galbraith for discussions leading to this patch and for the
      insight to reuse SD_NEWIDLE_BALANCE.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      703482e7
    • Nikhil Rao's avatar
      sched: Set group_imb only a task can be pulled from the busiest cpu · 6e1d0fe9
      Nikhil Rao authored
      Commit: 2582f0eb upstream
      
      When cycling through sched groups to determine the busiest group, set
      group_imb only if the busiest cpu has more than 1 runnable task. This patch
      fixes the case where two cpus in a group have one runnable task each, but there
      is a large weight differential between these two tasks. The load balancer is
      unable to migrate any task from this group, and hence do not consider this
      group to be imbalanced.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com>
      [ small code readability edits ]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      6e1d0fe9
    • Nikhil Rao's avatar
      sched: Do not consider SCHED_IDLE tasks to be cache hot · 215856a4
      Nikhil Rao authored
      Commit: ef8002f6 upstream
      
      This patch adds a check in task_hot to return if the task has SCHED_IDLE
      policy. SCHED_IDLE tasks have very low weight, and when run with regular
      workloads, are typically scheduled many milliseconds apart. There is no
      need to consider these tasks hot for load balancing.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      215856a4
    • Peter Zijlstra's avatar
      sched: fix RCU lockdep splat from task_group() · acb2c6dc
      Peter Zijlstra authored
      Commit: 6506cf6c upstream
      
      This addresses the following RCU lockdep splat:
      
      [0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03
      [0.052999] lockdep: fixing up alternatives.
      [0.054105]
      [0.054106] ===================================================
      [0.054999] [ INFO: suspicious rcu_dereference_check() usage. ]
      [0.054999] ---------------------------------------------------
      [0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection!
      [0.054999]
      [0.054999] other info that might help us debug this:
      [0.054999]
      [0.054999]
      [0.054999] rcu_scheduler_active = 1, debug_locks = 1
      [0.054999] 3 locks held by swapper/1:
      [0.054999]  #0:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a
      [0.054999]  #1:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51
      [0.054999]  #2:  (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113
      [0.054999]
      [0.054999] stack backtrace:
      [0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1
      [0.054999] Call Trace:
      [0.054999]  [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3
      [0.054999]  [<ffffffff810325c3>] task_group+0x7b/0x8a
      [0.054999]  [<ffffffff810325e5>] set_task_rq+0x13/0x40
      [0.054999]  [<ffffffff814be39a>] init_idle+0xd2/0x113
      [0.054999]  [<ffffffff814be78a>] fork_idle+0xb8/0xc7
      [0.054999]  [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b
      [0.054999]  [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b
      [0.054999]  [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724
      [0.054999]  [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b
      [0.054999]  [<ffffffff814be876>] _cpu_up+0xac/0x127
      [0.054999]  [<ffffffff814be946>] cpu_up+0x55/0x6a
      [0.054999]  [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff
      [0.054999]  [<ffffffff81003854>] kernel_thread_helper+0x4/0x10
      [0.054999]  [<ffffffff814c353c>] ? restore_args+0x0/0x30
      [0.054999]  [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff
      [0.054999]  [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10
      [0.056074] Booting Node   0, Processors  #1lockdep: fixing up alternatives.
      [0.130045]  #2lockdep: fixing up alternatives.
      [0.203089]  #3 Ok.
      [0.275286] Brought up 4 CPUs
      [0.276005] Total of 4 processors activated (16017.17 BogoMIPS).
      
      The cgroup_subsys_state structures referenced by idle tasks are never
      freed, because the idle tasks should be part of the root cgroup,
      which is not removable.
      
      The problem is that while we do in-fact hold rq->lock, the newly spawned
      idle thread's cpu is not yet set to the correct cpu so the lockdep check
      in task_group():
      
        lockdep_is_held(&task_rq(p)->lock)
      
      will fail.
      
      But this is a chicken and egg problem.  Setting the CPU's runqueue requires
      that the CPU's runqueue already be set.  ;-)
      
      So insert an RCU read-side critical section to avoid the complaint.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      acb2c6dc
    • Paul E. McKenney's avatar
      sched: suppress RCU lockdep splat in task_fork_fair · f4de371f
      Paul E. McKenney authored
      Commit: b0a0f667 upstream
      
      > ===================================================
      > [ INFO: suspicious rcu_dereference_check() usage. ]
      > ---------------------------------------------------
      > /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection!
      >
      > other info that might help us debug this:
      >
      > rcu_scheduler_active = 1, debug_locks = 1
      > 1 lock held by ifup/23517:
      >   #0:  (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108
      >
      > stack backtrace:
      > Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5
      > Call Trace:
      >   [<c075e219>] ? printk+0xf/0x16
      >   [<c0455842>] lockdep_rcu_dereference+0x74/0x7d
      >   [<c0426854>] task_group+0x6d/0x79
      >   [<c042686e>] set_task_rq+0xe/0x57
      >   [<c042f79e>] task_fork_fair+0x57/0x108
      >   [<c042e965>] sched_fork+0x82/0xf9
      >   [<c04334b3>] copy_process+0x569/0xe8e
      >   [<c0433ef0>] do_fork+0x118/0x262
      >   [<c076302f>] ? do_page_fault+0x16a/0x2cf
      >   [<c044b80c>] ? up_read+0x16/0x2a
      >   [<c04085ae>] sys_clone+0x1b/0x20
      >   [<c04030a5>] ptregs_clone+0x15/0x30
      >   [<c0402f1c>] ? sysenter_do_call+0x12/0x38
      
      Here a newly created task is having its runqueue assigned.  The new task
      is not yet on the tasklist, so cannot go away.  This is therefore a false
      positive, suppress with an RCU read-side critical section.
      
      Reported-by: Ben Greear <greearb@candelatech.com
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: Ben Greear <greearb@candelatech.com
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f4de371f
    • stable-bot for Steven Rostedt's avatar
      sched: Give CPU bound RT tasks preference · f1d70344
      stable-bot for Steven Rostedt authored
      From:: Steven Rostedt <srostedt@redhat.com>
      
      Commit: b3bc211c upstream
      
      If a high priority task is waking up on a CPU that is running a
      lower priority task that is bound to a CPU, see if we can move the
      high RT task to another CPU first. Note, if all other CPUs are
      running higher priority tasks than the CPU bounded current task,
      then it will be preempted regardless.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.888922071@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f1d70344
    • Steven Rostedt's avatar
      sched: Try not to migrate higher priority RT tasks · f266611e
      Steven Rostedt authored
      Commit: 43fa5460 upstream
      
      When first working on the RT scheduler design, we concentrated on
      keeping all CPUs running RT tasks instead of having multiple RT
      tasks on a single CPU waiting for the migration thread to move
      them. Instead we take a more proactive stance and push or pull RT
      tasks from one CPU to another on wakeup or scheduling.
      
      When an RT task wakes up on a CPU that is running another RT task,
      instead of preempting it and killing the cache of the running RT
      task, we look to see if we can migrate the RT task that is waking
      up, even if the RT task waking up is of higher priority.
      
      This may sound a bit odd, but RT tasks should be limited in
      migration by the user anyway. But in practice, people do not do
      this, which causes high prio RT tasks to bounce around the CPUs.
      This becomes even worse when we have priority inheritance, because
      a high prio task can block on a lower prio task and boost its
      priority. When the lower prio task wakes up the high prio task, if
      it happens to be on the same CPU it will migrate off of it.
      
      But in reality, the above does not happen much either, because the
      wake up of the lower prio task, which has already been boosted, if
      it was on the same CPU as the higher prio task, it would then
      migrate off of it. But anyway, we do not want to migrate them
      either.
      
      To examine the scheduling, I created a test program and examined it
      under kernelshark. The test program created CPU * 2 threads, where
      each thread had a different priority. The program takes different
      options. The options used in this change log was to have priority
      inheritance mutexes or not.
      
      All threads did the following loop:
      
      static void grab_lock(long id, int iter, int l)
      {
      	ftrace_write("thread %ld iter %d, taking lock %d\n",
      		     id, iter, l);
      	pthread_mutex_lock(&locks[l]);
      	ftrace_write("thread %ld iter %d, took lock %d\n",
      		     id, iter, l);
      	busy_loop(nr_tasks - id);
      	ftrace_write("thread %ld iter %d, unlock lock %d\n",
      		     id, iter, l);
      	pthread_mutex_unlock(&locks[l]);
      }
      
      void *start_task(void *id)
      {
      	[...]
      	while (!done) {
      		for (l = 0; l < nr_locks; l++) {
      			grab_lock(id, i, l);
      			ftrace_write("thread %ld iter %d sleeping\n",
      				     id, i);
      			ms_sleep(id);
      		}
      		i++;
      	}
      	[...]
      }
      
      The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
      ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
      to the ftrace buffer to help analyze via ftrace.
      
      The higher the id, the higher the prio, the shorter it does the
      busy loop, but the longer it spins. This is usually the case with
      RT tasks, the lower priority tasks usually run longer than higher
      priority tasks.
      
      At the end of the test, it records the number of loops each thread
      took, as well as the number of voluntary preemptions, non-voluntary
      preemptions, and number of migrations each thread took, taking the
      information from /proc/$$/sched and /proc/$$/status.
      
      Running this on a 4 CPU processor, the results without changes to
      the kernel looked like this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         53      3220       1470             98
        1:        562       773        724             98
        2:        752       933       1375             98
        3:        749        39        697             98
        4:        758         5        515             98
        5:        764         2        679             99
        6:        761         2        535             99
        7:        757         3        346             99
      
      total:     5156       4977      6341            787
      
      Each thread regardless of priority migrated a few hundred times.
      The higher priority tasks, were a little better but still took
      quite an impact.
      
      By letting higher priority tasks bump the lower prio task from the
      CPU, things changed a bit:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         37      2835       1937             98
        1:        666      1821       1865             98
        2:        654      1003       1385             98
        3:        664       635        973             99
        4:        698       197        352             99
        5:        703       101        159             99
        6:        708         1         75             99
        7:        713         1          2             99
      
      total:     4843       6594      6748            789
      
      The total # of migrations did not change (several runs showed the
      difference all within the noise). But we now see a dramatic
      improvement to the higher priority tasks. (kernelshark showed that
      the watchdog timer bumped the highest priority task to give it the
      2 count. This was actually consistent with every run).
      
      Notice that the # of iterations did not change either.
      
      The above was with priority inheritance mutexes. That is, when the
      higher prority task blocked on a lower priority task, the lower
      priority task would inherit the higher priority task (which shows
      why task 6 was bumped so many times). When not using priority
      inheritance mutexes, the current kernel shows this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         56      3101       1892             95
        1:        594       713        937             95
        2:        625       188        618             95
        3:        628         4        491             96
        4:        640         7        468             96
        5:        631         2        501             96
        6:        641         1        466             96
        7:        643         2        497             96
      
      total:     4458       4018      5870            765
      
      Not much changed with or without priority inheritance mutexes. But
      if we let the high priority task bump lower priority tasks on
      wakeup we see:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:        115      3439       2782             98
        1:        633      1354       1583             99
        2:        652       919       1218             99
        3:        645       713        934             99
        4:        690         3          3             99
        5:        694         1          4             99
        6:        720         3          4             99
        7:        747         0          1            100
      
      Which shows a even bigger change. The big difference between task 3
      and task 4 is because we have only 4 CPUs on the machine, causing
      the 4 highest prio tasks to always have preference.
      
      Although I did not measure cache misses, and I'm sure there would
      be little to measure since the test was not data intensive, I could
      imagine large improvements for higher priority tasks when dealing
      with lower priority tasks. Thus, I'm satisfied with making the
      change and agreeing with what Gregory Haskins argued a few years
      ago when we first had this discussion.
      
      One final note. All tasks in the above tests were RT tasks. Any RT
      task will always preempt a non RT task that is running on the CPU
      the RT task wants to run on.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.605460343@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f266611e
    • Venkatesh Pallipadi's avatar
      sched: Increment cache_nice_tries only on periodic lb · 00f3566f
      Venkatesh Pallipadi authored
      Commit: 58b26c4c upstream
      
      scheduler uses cache_nice_tries as an indicator to do cache_hot and
      active load balance, when normal load balance fails. Currently,
      this value is changed on any failed load balance attempt. That ends
      up being not so nice to workloads that enter/exit idle often, as
      they do more frequent new_idle balance and that pretty soon results
      in cache hot tasks being pulled in.
      
      Making the cache_nice_tries ignore failed new_idle balance seems to
      make better sense. With that only the failed load balance in
      periodic load balance gets accounted and the rate of accumulation
      of cache_nice_tries will not depend on idle entry/exit (short
      running sleep-wakeup kind of tasks). This reduces movement of
      cache_hot tasks.
      
      schedstat diff (after-before) excerpt from a workload that has
      frequent and short wakeup-idle pattern (:2 in cpu col below refers
      to NEWIDLE idx) This snapshot was across ~400 seconds.
      
      Without this change:
      domainstats:  domain0
       cpu     cnt      bln      fld      imb     gain    hgain  nobusyq  nobusyg
       0:2  306487   219575    73167  110069413    44583    19070     1172   218403
       1:2  292139   194853    81421  120893383    50745    21902     1259   193594
       2:2  283166   174607    91359  129699642    54931    23688     1287   173320
       3:2  273998   161788    93991  132757146    57122    24351     1366   160422
       4:2  289851   215692    62190  83398383    36377    13680      851   214841
       5:2  316312   222146    77605  117582154    49948    20281      988   221158
       6:2  297172   195596    83623  122133390    52801    21301      929   194667
       7:2  283391   178078    86378  126622761    55122    22239      928   177150
       8:2  297655   210359    72995  110246694    45798    19777     1125   209234
       9:2  297357   202011    79363  119753474    50953    22088     1089   200922
      10:2  278797   178703    83180  122514385    52969    22726     1128   177575
      11:2  272661   167669    86978  127342327    55857    24342     1195   166474
      12:2  293039   204031    73211  110282059    47285    19651      948   203083
      13:2  289502   196762    76803  114712942    49339    20547     1016   195746
      14:2  264446   169609    78292  115715605    50459    21017      982   168627
      15:2  260968   163660    80142  116811793    51483    21281     1064   162596
      
      With this change:
      domainstats:  domain0
       cpu     cnt      bln      fld      imb     gain    hgain  nobusyq  nobusyg
       0:2  272347   187380    77455  105420270    24975        1      953   186427
       1:2  267276   172360    86234  116242264    28087        6     1028   171332
       2:2  259769   156777    93281  123243134    30555        1     1043   155734
       3:2  250870   143129    97627  127370868    32026        6     1188   141941
       4:2  248422   177116    64096  78261112    22202        2      757   176359
       5:2  275595   180683    84950  116075022    29400        6      778   179905
       6:2  262418   162609    88944  119256898    31056        4      817   161792
       7:2  252204   147946    92646  122388300    32879        4      824   147122
       8:2  262335   172239    81631  110477214    26599        4      864   171375
       9:2  261563   164775    88016  117203621    28331        3      849   163926
      10:2  243389   140949    93379  121353071    29585        2      909   140040
      11:2  242795   134651    98310  124768957    30895        2     1016   133635
      12:2  255234   166622    79843  104696912    26483        4      746   165876
      13:2  244944   151595    83855  109808099    27787        3      801   150794
      14:2  241301   140982    89935  116954383    30403        6      845   140137
      15:2  232271   128564    92821  119185207    31207        4     1416   127148
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      00f3566f
    • Suresh Siddha's avatar
      sched: Move sched_avg_update() to update_cpu_load() · deb28d43
      Suresh Siddha authored
      Commit: da2b71ed upstream
      
      Currently sched_avg_update() (which updates rt_avg stats in the rq)
      is getting called from scale_rt_power() (in the load balance context)
      which doesn't take rq->lock.
      
      Fix it by moving the sched_avg_update() to more appropriate
      update_cpu_load() where the CFS load gets updated as well.
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      deb28d43
    • Li Zefan's avatar
      sched: Remove remaining USER_SCHED code · 05db2a0c
      Li Zefan authored
      Commit: 32bd7eb5 upstream
      
      This is left over from commit 7c941438 ("sched: Remove USER_SCHED"")
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarDhaval Giani <dhaval.giani@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: David Howells <dhowells@redhat.com>
      LKML-Reference: <4BA9A05F.7010407@cn.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      05db2a0c
    • Dhaval Giani's avatar
      sched: Remove USER_SCHED · b271aebc
      Dhaval Giani authored
      Commit: 7c941438 upstream
      
      Remove the USER_SCHED feature. It has been scheduled to be removed in
      2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2
      
      [trace from referenced thread]
      [1046577.884289] general protection fault: 0000 [#1] SMP
      [1046577.911332] last sysfs file: /sys/devices/platform/coretemp.7/temp1_input
      [1046577.938715] CPU 3
      [1046577.965814] Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables coretemp k8temp
      [1046577.994456] Pid: 38, comm: events/3 Not tainted 2.6.32.27intel #1 X8DT3
      [1046578.023166] RIP: 0010:[] [] sched_destroy_group+0x3c/0x10d
      [1046578.052639] RSP: 0000:ffff88043e5abe10 EFLAGS: 00010097
      [1046578.081360] RAX: ffff880139fa5540 RBX: ffff8803d18419c0 RCX: ffff8801d2f8fb78
      [1046578.109903] RDX: dead000000200200 RSI: 0000000000000000 RDI: 0000000000000000
      [1046578.109905] RBP: 0000000000000246 R08: 0000000000000020 R09: ffffffff816339b8
      [1046578.109907] R10: 0000000004e6e5f0 R11: 0000000000000006 R12: ffffffff816339b8
      [1046578.109909] R13: ffff8803d63ac4e0 R14: ffff88043e582340 R15: ffffffff8104a216
      [1046578.109911] FS: 0000000000000000(0000) GS:ffff880028260000(0000) knlGS:0000000000000000
      [1046578.109914] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      [1046578.109915] CR2: 00007f55ab220000 CR3: 00000001e5797000 CR4: 00000000000006e0
      [1046578.109917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [1046578.109919] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [1046578.109922] Process events/3 (pid: 38, threadinfo ffff88043e5aa000, task ffff88043e582340)
      [1046578.109923] Stack:
      [1046578.109924] ffff8803d63ac498 ffff8803d63ac4d8 ffff8803d63ac440 ffffffff8104a2c3
      [1046578.109927] <0> ffff88043e5abef8 ffff880028276040 ffff8803d63ac4d8 ffffffff81050395
      [1046578.109929] <0> ffff88043e582340 ffff88043e5826c8 ffff88043e582340 ffff88043e5abfd8
      [1046578.109932] Call Trace:
      [1046578.109938] [] ? cleanup_user_struct+0xad/0xcc
      [1046578.109942] [] ? worker_thread+0x148/0x1d4
      [1046578.109946] [] ? autoremove_wake_function+0x0/0x2e
      [1046578.109948] [] ? worker_thread+0x0/0x1d4
      [1046578.109951] [] ? kthread+0x79/0x81
      [1046578.109955] [] ? child_rip+0xa/0x20
      [1046578.109957] [] ? kthread+0x0/0x81
      [1046578.109959] [] ? child_rip+0x0/0x20
      [1046578.109961] Code: 3c 00 4c 8b 25 02 98 3d 00 48 89 c5 83 cf ff eb 5c 48 8b 43 10 48 63 f7 48 8b 04 f0 48 8b 90 80 00 00 00 48 8b 48 78 48 89 51 08 <48> 89 0a 48 b9 00 02 20 00 00 00 ad de 48 89 88 80 00 00 00 48
      [1046578.109975] RIP [] sched_destroy_group+0x3c/0x10d
      [1046578.109979] RSP
      [1046578.109981] ---[ end trace 5ebc2944b7872d4a ]---
      Signed-off-by: default avatarDhaval Giani <dhaval.giani@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1263990378.24844.3.camel@localhost>
      LKML-Reference: http://marc.info/?l=linux-kernel&m=129466345327931Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      b271aebc
    • Sarah Sharp's avatar
      usb: Realloc xHCI structures after a hub is verified. · 265fed58
      Sarah Sharp authored
      commit 653a39d1 upstream.
      
      When there's an xHCI host power loss after a suspend from memory, the USB
      core attempts to reset and verify the USB devices that are attached to the
      system.  The xHCI driver has to reallocate those devices, since the
      hardware lost all knowledge of them during the power loss.
      
      When a hub is plugged in, and the host loses power, the xHCI hardware
      structures are not updated to say the device is a hub.  This is usually
      done in hub_configure() when the USB hub is detected.  That function is
      skipped during a reset and verify by the USB core, since the core restores
      the old configuration and alternate settings, and the hub driver has no
      idea this happened.  This bug makes the xHCI host controller reject the
      enumeration of low speed devices under the resumed hub.
      
      Therefore, make the USB core re-setup the internal xHCI hub device
      information by calling update_hub_device() when hub_activate() is called
      for a hub reset resume.  After a host power loss, all devices under the
      roothub get a reset-resume or a disconnect.
      
      This patch should be queued for the 2.6.37 stable tree.
      Signed-off-by: default avatarSarah Sharp <sarah.a.sharp@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      265fed58
    • Suresh Siddha's avatar
      x86, mm: avoid possible bogus tlb entries by clearing prev mm_cpumask after switching mm · 45bfd7bf
      Suresh Siddha authored
      commit 831d52bc upstream.
      
      Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb
      IPI's while the cr3 is still pointing to the prev mm.  And this window
      can lead to the possibility of bogus TLB fills resulting in strange
      failures.  One such problematic scenario is mentioned below.
      
       T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI
           etc between the point of clearing the cpu from the mm_cpumask(mm1)
           and before reloading the cr3 with the new mm2.
      
       T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with
           flushing the TLB for mm1.  It doesn't send the flush TLB to CPU-1
           as it doesn't see that cpu listed in the mm_cpumask(mm1).
      
       T3. After the TLB flush is complete, CPU-2 goes ahead and frees the
           page-table pages associated with the removed vma mapping.
      
       T4. CPU-2 now allocates those freed page-table pages for something
           else.
      
       T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1
           can potentially speculate and walk through the page-table caches
           and can insert new TLB entries.  As the page-table pages are
           already freed and being used on CPU-2, this page walk can
           potentially insert a bogus global TLB entry depending on the
           (random) contents of the page that is being used on CPU-2.
      
       T6. This bogus TLB entry being global will be active across future CR3
           changes and can result in weird memory corruption etc.
      
      To avoid this issue, for the prev mm that is handing over the cpu to
      another mm, clear the cpu from the mm_cpumask(prev) after the cr3 is
      changed.
      
      Marking it for -stable, though we haven't seen any reported failure that
      can be attributed to this.
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      45bfd7bf
    • Chris Wilson's avatar
      drm/i915: Add dependency on CONFIG_TMPFS · 7a9e4b42
      Chris Wilson authored
      commit f7ab9b40 upstream.
      
      Without tmpfs, shmem_readpage() is not compiled in causing an OOPS as
      soon as we try to allocate some swappable pages for GEM.
      
      Jan 19 22:52:26 harlie kernel: Modules linked in: i915(+) drm_kms_helper cfbcopyarea video backlight cfbimgblt cfbfillrect
      Jan 19 22:52:26 harlie kernel:
      Jan 19 22:52:26 harlie kernel: Pid: 1125, comm: modprobe Not tainted 2.6.37Harlie #10 To be filled by O.E.M./To be filled by O.E.M.
      Jan 19 22:52:26 harlie kernel: EIP: 0060:[<00000000>] EFLAGS: 00010246 CPU: 3
      Jan 19 22:52:26 harlie kernel: EIP is at 0x0
      Jan 19 22:52:26 harlie kernel: EAX: 00000000 EBX: f7b7d000 ECX: f3383100 EDX: f7b7d000
      Jan 19 22:52:26 harlie kernel: ESI: f1456118 EDI: 00000000 EBP: f2303c98 ESP: f2303c7c
      Jan 19 22:52:26 harlie kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
      Jan 19 22:52:26 harlie kernel: Process modprobe (pid: 1125, ti=f2302000 task=f259cd80 task.ti=f2302000)
      Jan 19 22:52:26 harlie kernel: Stack:
      Jan 19 22:52:26 harlie udevd-work[1072]: '/sbin/modprobe -b pci:v00008086d00000046sv00000000sd00000000bc03sc00i00' unexpected exit with status 0x0009
      Jan 19 22:52:26 harlie kernel:  c1074061 000000d0 f2f42b80 00000000 000a13d2 f2d5dcc0 00000001 f2303cac
      Jan 19 22:52:26 harlie kernel:  c107416f 00000000 000a13d2 00000000 f2303cd4 f8d620ed f2cee620 00001000
      Jan 19 22:52:26 harlie kernel:  00000000 000a13d2 f1456118 f2d5dcc0 f1a40000 00001000 f2303d04 f8d637ab
      Jan 19 22:52:26 harlie kernel: Call Trace:
      Jan 19 22:52:26 harlie kernel:  [<c1074061>] ? do_read_cache_page+0x71/0x160
      Jan 19 22:52:26 harlie kernel:  [<c107416f>] ? read_cache_page_gfp+0x1f/0x30
      Jan 19 22:52:26 harlie kernel:  [<f8d620ed>] ? i915_gem_object_get_pages+0xad/0x1d0 [i915]
      Jan 19 22:52:26 harlie kernel:  [<f8d637ab>] ? i915_gem_object_bind_to_gtt+0xeb/0x2d0 [i915]
      Jan 19 22:52:26 harlie kernel:  [<f8d65961>] ? i915_gem_object_pin+0x151/0x190 [i915]
      Jan 19 22:52:26 harlie kernel:  [<c11e16ed>] ? drm_gem_object_init+0x3d/0x60
      Jan 19 22:52:26 harlie kernel:  [<f8d65aa5>] ? i915_gem_init_ringbuffer+0x105/0x1e0 [i915]
      Jan 19 22:52:26 harlie kernel:  [<f8d571b7>] ? i915_driver_load+0x667/0x1160 [i915]
      Reported-by: default avatarJohn J. Stimson-III <john@idsfa.net>
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      7a9e4b42
    • Knut Petersen's avatar
    • Alex Deucher's avatar
      drm/radeon/kms: fix s/r issues with bios scratch regs · df362518
      Alex Deucher authored
      commit 87364760 upstream.
      
      The accelerate mode bit gets checked by certain atom
      command tables to set up some register state.  It needs
      to be clear when setting modes and set when not.
      
      Fixes:
      https://bugzilla.kernel.org/show_bug.cgi?id=26942Signed-off-by: default avatarAlex Deucher <alexdeucher@gmail.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      df362518
    • Alex Deucher's avatar
      drm/radeon: remove 0x4243 pci id · cce639ad
      Alex Deucher authored
      commit 63a50780 upstream.
      
      0x4243 is a PCI bridge, not a GPU.
      
      Fixes:
      https://bugs.freedesktop.org/show_bug.cgi?id=33815Signed-off-by: default avatarAlex Deucher <alexdeucher@gmail.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      cce639ad
    • Alex Deucher's avatar
      drm/radeon/kms: add pll debugging output · fdf06b2a
      Alex Deucher authored
      commit 51d4bf84 upstream.
      Signed-off-by: default avatarAlex Deucher <alexdeucher@gmail.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      fdf06b2a
    • Alex Deucher's avatar
      drm/radeon/kms: make the mac rv630 quirk generic · e45fd967
      Alex Deucher authored
      commit be23da8a upstream.
      
      Seems some other boards do this as well.
      Reported-by: default avatarAndrea Merello <andrea.merello@gmail.com>
      Signed-off-by: default avatarAlex Deucher <alexdeucher@gmail.com>
      Signed-off-by: default avatarDave Airlie <airlied@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      e45fd967
    • Alex Deucher's avatar
      drm/radeon/kms: add quirk for Mac Radeon HD 2600 card · f06aaf2b
      Alex Deucher authored
      commit f598aa75 upstream.
      Reported-by: default avatar屋国遥 <hyagni@gmail.com>
      Signed-off-by: default avatarAlex Deucher <alexdeucher@gmail.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f06aaf2b
    • Mike Snitzer's avatar
      dm mpath: disable blk_abort_queue · 05b0d76d
      Mike Snitzer authored
      commit 09c9d4c9 upstream.
      
      Revert commit 224cb3e9
        dm: Call blk_abort_queue on failed paths
      
      Multipath began to use blk_abort_queue() to allow for
      lower latency path deactivation.  This was found to
      cause list corruption:
      
         the cmd gets blk_abort_queued/timedout run on it and the scsi eh
         somehow is able to complete and run scsi_queue_insert while
         scsi_request_fn is still trying to process the request.
      
         https://www.redhat.com/archives/dm-devel/2010-November/msg00085.htmlSigned-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: Mike Anderson <andmike@linux.vnet.ibm.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      05b0d76d
    • Mike Snitzer's avatar
      dm: dont take i_mutex to change device size · 67c39be2
      Mike Snitzer authored
      commit c217649b upstream.
      
      No longer needlessly hold md->bdev->bd_inode->i_mutex when changing the
      size of a DM device.  This additional locking is unnecessary because
      i_size_write() is already protected by the existing critical section in
      dm_swap_table().  DM already has a reference on md->bdev so the
      associated bd_inode may be changed without lifetime concerns.
      
      A negative side-effect of having held md->bdev->bd_inode->i_mutex was
      that a concurrent DM device resize and flush (via fsync) would deadlock.
      Dropping md->bdev->bd_inode->i_mutex eliminates this potential for
      deadlock.  The following reproducer no longer deadlocks:
        https://www.redhat.com/archives/dm-devel/2009-July/msg00284.htmlSigned-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      67c39be2
    • Amitkumar Karwar's avatar
      ieee80211: correct IEEE80211_ADDBA_PARAM_BUF_SIZE_MASK macro · b2e68330
      Amitkumar Karwar authored
      commit 8d661f1e upstream.
      
      It is defined in include/linux/ieee80211.h. As per IEEE spec.
      bit6 to bit15 in block ack parameter represents buffer size.
      So the bitmask should be 0xFFC0.
      Signed-off-by: default avatarAmitkumar Karwar <akarwar@marvell.com>
      Signed-off-by: default avatarBing Zhao <bzhao@marvell.com>
      Reviewed-by: default avatarJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: default avatarJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      b2e68330
    • Eric Paris's avatar
      SELinux: do not compute transition labels on mountpoint labeled filesystems · 638f3ef4
      Eric Paris authored
      commit 415103f9 upstream.
      
      selinux_inode_init_security computes transitions sids even for filesystems
      that use mount point labeling.  It shouldn't do that.  It should just use
      the mount point label always and no matter what.
      
      This causes 2 problems.  1) it makes file creation slower than it needs to be
      since we calculate the transition sid and 2) it allows files to be created
      with a different label than the mount point!
      
      # id -Z
      staff_u:sysadm_r:sysadm_t:s0-s0:c0.c1023
      # sesearch --type --class file --source sysadm_t --target tmp_t
      Found 1 semantic te rules:
         type_transition sysadm_t tmp_t : file user_tmp_t;
      
      # mount -o loop,context="system_u:object_r:tmp_t:s0"  /tmp/fs /mnt/tmp
      
      # ls -lZ /mnt/tmp
      drwx------. root root system_u:object_r:tmp_t:s0       lost+found
      # touch /mnt/tmp/file1
      # ls -lZ /mnt/tmp
      -rw-r--r--. root root staff_u:object_r:user_tmp_t:s0   file1
      drwx------. root root system_u:object_r:tmp_t:s0       lost+found
      
      Whoops, we have a mount point labeled filesystem tmp_t with a user_tmp_t
      labeled file!
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Reviewed-by: default avatarReviewed-by: James Morris <jmorris@namei.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      638f3ef4
    • Eric Paris's avatar
      SELinux: define permissions for DCB netlink messages · 599fde16
      Eric Paris authored
      commit 350e4f31 upstream.
      
      Commit 2f90b865 added two new netlink message types to the netlink route
      socket.  SELinux has hooks to define if netlink messages are allowed to
      be sent or received, but it did not know about these two new message
      types.  By default we allow such actions so noone likely noticed.  This
      patch adds the proper definitions and thus proper permissions
      enforcement.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      599fde16
    • Stefan Berger's avatar
      tpm_tis: Use timeouts returned from TPM · 25e442e2
      Stefan Berger authored
      commit 9b29050f upstream.
      
      The current TPM TIS driver in git discards the timeout values returned
      from the TPM. The check of the response packet needs to consider that
      the return_code field is 0 on success and the size of the expected
      packet is equivalent to the header size + u32 length indicator for the
      TPM_GetCapability() result + 3 timeout indicators of type u32.
      
      I am also adding a sysfs entry 'timeouts' showing the timeouts that are
      being used.
      Signed-off-by: default avatarStefan Berger <stefanb@linux.vnet.ibm.com>
      Tested-by: default avatarGuillaume Chazarain <guichaz@gmail.com>
      Signed-off-by: default avatarRajiv Andrade <srajiv@linux.vnet.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      25e442e2
    • Rajiv Andrade's avatar
      TPM: Long default timeout fix · 57180437
      Rajiv Andrade authored
      commit c4ff4b82 upstream.
      
      If duration variable value is 0 at this point, it's because
      chip->vendor.duration wasn't filled by tpm_get_timeouts() yet.
      This patch sets then the lowest timeout just to give enough
      time for tpm_get_timeouts() to further succeed.
      
      This fix avoids long boot times in case another entity attempts
      to send commands to the TPM when the TPM isn't accessible.
      Signed-off-by: default avatarRajiv Andrade <srajiv@linux.vnet.ibm.com>
      Signed-off-by: default avatarJames Morris <jmorris@namei.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      57180437
    • Tejun Heo's avatar
      pata_mpc52xx: inherit from ata_bmdma_port_ops · c873af89
      Tejun Heo authored
      commit 77c5fd19 upstream.
      
      pata_mpc52xx supports BMDMA but inherits ata_sff_port_ops which
      triggers BUG_ON() when a DMA command is issued.  Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarRoman Fietze <roman.fietze@telemotive.de>
      Cc: Sergei Shtylyov <sshtylyov@mvista.com>
      Signed-off-by: default avatarJeff Garzik <jgarzik@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      c873af89
    • NeilBrown's avatar
      md: fix regression with re-adding devices to arrays with no metadata · 6d82749e
      NeilBrown authored
      commit bf572541 upstream.
      
      Commit 1a855a06 (2.6.37-rc4) fixed a problem where devices were
      re-added when they shouldn't be but caused a regression in a less
      common case that means sometimes devices cannot be re-added when they
      should be.
      
      In particular, when re-adding a device to an array without metadata
      we should always access the device, but after the above commit we
      didn't.
      
      This patch sets the In_sync flag in that case so that the re-add
      succeeds.
      
      This patch is suitable for any -stable kernel to which 1a855a06 was
      applied.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      6d82749e
    • Stanislaw Gruszka's avatar
      hostap_cs: fix sleeping function called from invalid context · 04c7ff05
      Stanislaw Gruszka authored
      commit 4e5518ca upstream.
      
      pcmcia_request_irq() and pcmcia_enable_device() are intended
      to be called from process context (first function allocate memory
      with GFP_KERNEL, second take a mutex). We can not take spin lock
      and call them.
      
      It's safe to move spin lock after pcmcia_enable_device() as we
      still hold off IRQ until dev->base_addr is 0 and driver will
      not proceed with interrupts when is not ready.
      
      Patch resolves:
      https://bugzilla.redhat.com/show_bug.cgi?id=643758
      
      Reported-and-tested-by: rbugz@biobind.com
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: default avatarJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      
      04c7ff05
    • Anton Blanchard's avatar
      kernel/smp.c: fix smp_call_function_many() SMP race · b5dc8db4
      Anton Blanchard authored
      commit 6dc19899 upstream.
      
      I noticed a failure where we hit the following WARN_ON in
      generic_smp_call_function_interrupt:
      
                      if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
                              continue;
      
                      data->csd.func(data->csd.info);
      
                      refs = atomic_dec_return(&data->refs);
                      WARN_ON(refs < 0);      <-------------------------
      
      We atomically tested and cleared our bit in the cpumask, and yet the
      number of cpus left (ie refs) was 0.  How can this be?
      
      It turns out commit 54fdade1
      ("generic-ipi: make struct call_function_data lockless") is at fault.  It
      removes locking from smp_call_function_many and in doing so creates a
      rather complicated race.
      
      The problem comes about because:
      
       - The smp_call_function_many interrupt handler walks call_function.queue
         without any locking.
       - We reuse a percpu data structure in smp_call_function_many.
       - We do not wait for any RCU grace period before starting the next
         smp_call_function_many.
      
      Imagine a scenario where CPU A does two smp_call_functions back to back,
      and CPU B does an smp_call_function in between.  We concentrate on how CPU
      C handles the calls:
      
      CPU A            CPU B                  CPU C              CPU D
      
      smp_call_function
                                              smp_call_function_interrupt
                                                  walks
      					call_function.queue sees
      					data from CPU A on list
      
                       smp_call_function
      
                                              smp_call_function_interrupt
                                                  walks
      
                                              call_function.queue sees
                                                (stale) CPU A on list
      							   smp_call_function int
      							   clears last ref on A
      							   list_del_rcu, unlock
      smp_call_function reuses
      percpu *data A
                                               data->cpumask sees and
                                               clears bit in cpumask
                                               might be using old or new fn!
                                               decrements refs below 0
      
      set data->refs (too late!)
      
      The important thing to note is since the interrupt handler walks a
      potentially stale call_function.queue without any locking, then another
      cpu can view the percpu *data structure at any time, even when the owner
      is in the process of initialising it.
      
      The following test case hits the WARN_ON 100% of the time on my PowerPC
      box (having 128 threads does help :)
      
      #include <linux/module.h>
      #include <linux/init.h>
      
      #define ITERATIONS 100
      
      static void do_nothing_ipi(void *dummy)
      {
      }
      
      static void do_ipis(struct work_struct *dummy)
      {
      	int i;
      
      	for (i = 0; i < ITERATIONS; i++)
      		smp_call_function(do_nothing_ipi, NULL, 1);
      
      	printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
      }
      
      static struct work_struct work[NR_CPUS];
      
      static int __init testcase_init(void)
      {
      	int cpu;
      
      	for_each_online_cpu(cpu) {
      		INIT_WORK(&work[cpu], do_ipis);
      		schedule_work_on(cpu, &work[cpu]);
      	}
      
      	return 0;
      }
      
      static void __exit testcase_exit(void)
      {
      }
      
      module_init(testcase_init)
      module_exit(testcase_exit)
      MODULE_LICENSE("GPL");
      MODULE_AUTHOR("Anton Blanchard");
      
      I tried to fix it by ordering the read and the write of ->cpumask and
      ->refs.  In doing so I missed a critical case but Paul McKenney was able
      to spot my bug thankfully :) To ensure we arent viewing previous
      iterations the interrupt handler needs to read ->refs then ->cpumask then
      ->refs _again_.
      
      Thanks to Milton Miller and Paul McKenney for helping to debug this issue.
      
      [miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ]
      [miltonm@bga.com: remove excess tests]
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      b5dc8db4
    • Guy Martin's avatar
      parisc : Remove broken line wrapping handling pdc_iodc_print() · 8dfd4910
      Guy Martin authored
      commit fbea6684 upstream.
      
      Remove the broken line wrapping handling in pdc_iodc_print().
      It is broken in 3 ways :
        - It doesn't keep track of the current screen position, it just
          assumes that the new buffer will be printed at the begining of the
          screen.
        - It doesn't take in account that non printable characters won't
          increase the current position on the screen.
        - And last but not least, it triggers a kernel panic if a backspace
          is the first char in the provided buffer :
      
       Backtrace:
        [<0000000040128ec4>] pdc_console_write+0x44/0x78
        [<0000000040128f18>] pdc_console_tty_write+0x20/0x38
        [<000000004032f1ac>] n_tty_write+0x2a4/0x550
        [<000000004032b158>] tty_write+0x1e0/0x2d8
        [<00000000401bb420>] vfs_write+0xb8/0x188
        [<00000000401bb630>] sys_write+0x68/0xb8
        [<0000000040104eb8>] syscall_exit+0x0/0x14
      
      Most terminals handle the line wrapping just fine. I've confirmed that
      it works correctly on a C8000 with both vga and serial output.
      Signed-off-by: default avatarGuy Martin <gmsoft@tuxicoman.be>
      Signed-off-by: default avatarJames Bottomley <James.Bottomley@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      8dfd4910
    • Benjamin Herrenschmidt's avatar
      powerpc: Fix some 6xx/7xxx CPU setup functions · 6513be80
      Benjamin Herrenschmidt authored
      commit 1f1936ff upstream.
      
      Some of those functions try to adjust the CPU features, for example
      to remove NAP support on some revisions. However, they seem to use
      r5 as an index into the CPU table entry, which might have been right
      a long time ago but no longer is. r4 is the right register to use.
      
      This probably caused some off behaviours on some PowerMac variants
      using 750cx or 7455 processor revisions.
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      6513be80
    • David Miller's avatar
      klist: Fix object alignment on 64-bit. · d845c588
      David Miller authored
      commit 795abaf1 upstream.
      
      Commit c0e69a5b ("klist.c: bit 0 in pointer can't be used as flag")
      intended to make sure that all klist objects were at least pointer size
      aligned, but used the constant "4" which only works on 32-bit.
      
      Use "sizeof(void *)" which is correct in all cases.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarJesper Nilsson <jesper.nilsson@axis.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      d845c588
    • Dario Lombardo's avatar
      drivers: update to pl2303 usb-serial to support Motorola cables · 018f5892
      Dario Lombardo authored
      commit 96a3e79e upstream.
      
      Added 0x0307 device id to support Motorola cables to the pl2303 usb
      serial driver. This cable has a modified chip that is a pl2303, but
      declares itself as 0307. Fixed by adding the right device id to the
      supported devices list, assigning it the code labeled
      PL2303_PRODUCT_ID_MOTOROLA.
      Signed-off-by: default avatarDario Lombardo <dario.lombardo@libero.it>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      018f5892