1. 13 Mar, 2014 9 commits
  2. 12 Mar, 2014 31 commits
    • Waiman Long's avatar
      SELinux: Increase ebitmap_node size for 64-bit configuration · 17d633ca
      Waiman Long authored
      commit a767f680 upstream.
      
      Currently, the ebitmap_node structure has a fixed size of 32 bytes. On
      a 32-bit system, the overhead is 8 bytes, leaving 24 bytes for being
      used as bitmaps. The overhead ratio is 1/4.
      
      On a 64-bit system, the overhead is 16 bytes. Therefore, only 16 bytes
      are left for bitmap purpose and the overhead ratio is 1/2. With a
      3.8.2 kernel, a boot-up operation will cause the ebitmap_get_bit()
      function to be called about 9 million times. The average number of
      ebitmap_node traversal is about 3.7.
      
      This patch increases the size of the ebitmap_node structure to 64
      bytes for 64-bit system to keep the overhead ratio at 1/4. This may
      also improve performance a little bit by making node to node traversal
      less frequent (< 2) as more bits are available in each node.
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Acked-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: default avatarPaul Moore <pmoore@redhat.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      17d633ca
    • Waiman Long's avatar
      SELinux: Reduce overhead of mls_level_isvalid() function call · 3b32517b
      Waiman Long authored
      commit fee71142 upstream.
      
      While running the high_systime workload of the AIM7 benchmark on
      a 2-socket 12-core Westmere x86-64 machine running 3.10-rc4 kernel
      (with HT on), it was found that a pretty sizable amount of time was
      spent in the SELinux code. Below was the perf trace of the "perf
      record -a -s" of a test run at 1500 users:
      
        5.04%            ls  [kernel.kallsyms]     [k] ebitmap_get_bit
        1.96%            ls  [kernel.kallsyms]     [k] mls_level_isvalid
        1.95%            ls  [kernel.kallsyms]     [k] find_next_bit
      
      The ebitmap_get_bit() was the hottest function in the perf-report
      output.  Both the ebitmap_get_bit() and find_next_bit() functions
      were, in fact, called by mls_level_isvalid(). As a result, the
      mls_level_isvalid() call consumed 8.95% of the total CPU time of
      all the 24 virtual CPUs which is quite a lot. The majority of the
      mls_level_isvalid() function invocations come from the socket creation
      system call.
      
      Looking at the mls_level_isvalid() function, it is checking to see
      if all the bits set in one of the ebitmap structure are also set in
      another one as well as the highest set bit is no bigger than the one
      specified by the given policydb data structure. It is doing it in
      a bit-by-bit manner. So if the ebitmap structure has many bits set,
      the iteration loop will be done many times.
      
      The current code can be rewritten to use a similar algorithm as the
      ebitmap_contains() function with an additional check for the
      highest set bit. The ebitmap_contains() function was extended to
      cover an optional additional check for the highest set bit, and the
      mls_level_isvalid() function was modified to call ebitmap_contains().
      
      With that change, the perf trace showed that the used CPU time drop
      down to just 0.08% (ebitmap_contains + mls_level_isvalid) of the
      total which is about 100X less than before.
      
        0.07%            ls  [kernel.kallsyms]     [k] ebitmap_contains
        0.05%            ls  [kernel.kallsyms]     [k] ebitmap_get_bit
        0.01%            ls  [kernel.kallsyms]     [k] mls_level_isvalid
        0.01%            ls  [kernel.kallsyms]     [k] find_next_bit
      
      The remaining ebitmap_get_bit() and find_next_bit() functions calls
      are made by other kernel routines as the new mls_level_isvalid()
      function will not call them anymore.
      
      This patch also improves the high_systime AIM7 benchmark result,
      though the improvement is not as impressive as is suggested by the
      reduction in CPU time spent in the ebitmap functions. The table below
      shows the performance change on the 2-socket x86-64 system (with HT
      on) mentioned above.
      
      +--------------+---------------+----------------+-----------------+
      |   Workload   | mean % change | mean % change  | mean % change   |
      |              | 10-100 users  | 200-1000 users | 1100-2000 users |
      +--------------+---------------+----------------+-----------------+
      | high_systime |     +0.1%     |     +0.9%      |     +2.6%       |
      +--------------+---------------+----------------+-----------------+
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Acked-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: default avatarPaul Moore <pmoore@redhat.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3b32517b
    • Mel Gorman's avatar
      mm: do not walk all of system memory during show_mem · 67e204a0
      Mel Gorman authored
      commit c78e9363 upstream.
      
      It has been reported on very large machines that show_mem is taking almost
      5 minutes to display information.  This is a serious problem if there is
      an OOM storm.  The bulk of the cost is in show_mem doing a very expensive
      PFN walk to give us the following information
      
        Total RAM:       Also available as totalram_pages
        Highmem pages:   Also available as totalhigh_pages
        Reserved pages:  Can be inferred from the zone structure
        Shared pages:    PFN walk required
        Unshared pages:  PFN walk required
        Quick pages:     Per-cpu walk required
      
      Only the shared/unshared pages requires a full PFN walk but that
      information is useless.  It is also inaccurate as page pins of unshared
      pages would be accounted for as shared.  Even if the information was
      accurate, I'm struggling to think how the shared/unshared information
      could be useful for debugging OOM conditions.  Maybe it was useful before
      rmap existed when reclaiming shared pages was costly but it is less
      relevant today.
      
      The PFN walk could be optimised a bit but why bother as the information is
      useless.  This patch deletes the PFN walker and infers the total RAM,
      highmem and reserved pages count from struct zone.  It omits the
      shared/unshared page usage on the grounds that it is useless.  It also
      corrects the reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE
      has similar problems to HighMem with respect to lowmem/highmem exhaustion.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      67e204a0
    • Jason Baron's avatar
      epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL · 230a558c
      Jason Baron authored
      commit 4ff36ee9 upstream.
      
      The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
      That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
      epoll_ctl(b, EPOLL_CTL_DEL, a, x).  The deadlock was introduced with
      commmit 67347fe4 ("epoll: do not take global 'epmutex' for simple
      topologies").
      
      The acquistion of the ep->mtx for the destination 'ep' was added such
      that a concurrent EPOLL_CTL_ADD operation would see the correct state of
      the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')
      
      However, by simply not acquiring the lock, we do not serialize behind
      the ep->mtx from the add path, and thus may perform a full path check
      when if we had waited a little longer it may not have been necessary.
      However, this is a transient state, and performing the full loop
      checking in this case is not harmful.
      
      The important point is that we wouldn't miss doing the full loop
      checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
      its operating upon.  The reason we don't need to do lock ordering in the
      add path, is that we are already are holding the global 'epmutex'
      whenever we do the double lock.  Further, the original posting of this
      patch, which was tested for the intended performance gains, did not
      perform this additional locking.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Nathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      230a558c
    • Jason Baron's avatar
      epoll: do not take global 'epmutex' for simple topologies · 107c1943
      Jason Baron authored
      commit 67347fe4 upstream.
      
      When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
      directly to a wakeup source, we do not need to take the global 'epmutex',
      unless the epoll file descriptor is nested.  The purpose of taking the
      'epmutex' on add is to prevent complex topologies such as loops and deep
      wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
      operations.  However, for the simple case of an epoll file descriptor
      attached directly to a wakeup source (with no nesting), we do not need to
      hold the 'epmutex'.
      
      This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
      scalability on larger systems.  Quoting Nathan Zimmer's mail on SPECjbb
      performance:
      
      "On the 16 socket run the performance went from 35k jOPS to 125k jOPS.  In
      addition the benchmark when from scaling well on 10 sockets to scaling
      well on just over 40 sockets.
      
      ...
      
      Currently the benchmark stops scaling at around 40-44 sockets but it seems like
      I found a second unrelated bottleneck."
      
      [akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc]
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Tested-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      107c1943
    • Jason Baron's avatar
      epoll: optimize EPOLL_CTL_DEL using rcu · 81ff0d3b
      Jason Baron authored
      commit ae10b2b4 upstream.
      
      Nathan Zimmer found that once we get over 10+ cpus, the scalability of
      SPECjbb falls over due to the contention on the global 'epmutex', which is
      taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.
      
      Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path
      by using rcu to guard against any concurrent traversals.
      
      Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for
      simple topologies.  IE when adding a link from an epoll file descriptor to
      a wakeup source, where the epoll file descriptor is not nested.
      
      This patch (of 2):
      
      Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
      converting the file->f_ep_links list into an rcu one.  In this way, we can
      traverse the epoll network on the add path in parallel with deletes.
      Since deletes can't create loops or worse wakeup paths, this is safe.
      
      This patch in combination with the patch "epoll: Do not take global 'epmutex'
      for simple topologies", shows a dramatic performance improvement in
      scalability for SPECjbb.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Tested-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      CC: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      81ff0d3b
    • Jiri Slaby's avatar
      x86/dumpstack: Fix printk_address for direct addresses · dd7ab812
      Jiri Slaby authored
      commit 5f01c988 upstream.
      
      Consider a kernel crash in a module, simulated the following way:
      
       static int my_init(void)
       {
               char *map = (void *)0x5;
               *map = 3;
               return 0;
       }
       module_init(my_init);
      
      When we turn off FRAME_POINTERs, the very first instruction in
      that function causes a BUG. The problem is that we print IP in
      the BUG report using %pB (from printk_address). And %pB
      decrements the pointer by one to fix printing addresses of
      functions with tail calls.
      
      This was added in commit 71f9e598 ("x86, dumpstack: Use
      %pB format specifier for stack trace") to fix the call stack
      printouts.
      
      So instead of correct output:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000005
        IP: [<ffffffffa01ac000>] my_init+0x0/0x10 [pb173]
      
      We get:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000005
        IP: [<ffffffffa0152000>] 0xffffffffa0151fff
      
      To fix that, we use %pS only for stack addresses printouts (via
      newly added printk_stack_address) and %pB for regs->ip (via
      printk_address). I.e. we revert to the old behaviour for all
      except call stacks. And since from all those reliable is 1, we
      remove that parameter from printk_address.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: joe@perches.com
      Cc: jirislaby@gmail.com
      Link: http://lkml.kernel.org/r/1382706418-8435-1-git-send-email-jslaby@suse.czSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      dd7ab812
    • NeilBrown's avatar
      SUNRPC: close a rare race in xs_tcp_setup_socket. · 6456bd97
      NeilBrown authored
      commit 93dc41bd upstream.
      
      We have one report of a crash in xs_tcp_setup_socket.
      The call path to the crash is:
      
        xs_tcp_setup_socket -> inet_stream_connect -> lock_sock_nested.
      
      The 'sock' passed to that last function is NULL.
      
      The only way I can see this happening is a concurrent call to
      xs_close:
      
        xs_close -> xs_reset_transport -> sock_release -> inet_release
      
      inet_release sets:
         sock->sk = NULL;
      inet_stream_connect calls
         lock_sock(sock->sk);
      which gets NULL.
      
      All calls to xs_close are protected by XPRT_LOCKED as are most
      activations of the workqueue which runs xs_tcp_setup_socket.
      The exception is xs_tcp_schedule_linger_timeout.
      
      So presumably the timeout queued by the later fires exactly when some
      other code runs xs_close().
      
      To protect against this we can move the cancel_delayed_work_sync()
      call from xs_destory() to xs_close().
      
      As xs_close is never called from the worker scheduled on
      ->connect_worker, this can never deadlock.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      [Trond: Make it safe to call cancel_delayed_work_sync() on AF_LOCAL sockets]
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6456bd97
    • Shawn Bohrer's avatar
      sched/rt: Remove redundant nr_cpus_allowed test · 4aef0b11
      Shawn Bohrer authored
      commit 6bfa687c upstream.
      
      In 76854c7e ("sched: Use
      rt.nr_cpus_allowed to recover select_task_rq() cycles") an
      optimization was added to select_task_rq_rt() that immediately
      returns when p->nr_cpus_allowed == 1 at the beginning of the
      function.
      
      This makes the latter p->nr_cpus_allowed > 1 check redundant,
      which can now be removed.
      Signed-off-by: default avatarShawn Bohrer <sbohrer@rgmadvisors.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: tomk@rgmadvisors.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.bohrer@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4aef0b11
    • Peter Zijlstra's avatar
      sched/rt: Add missing rmb() · d171cbfc
      Peter Zijlstra authored
      commit 7c3f2ab7 upstream.
      
      While discussing the proposed SCHED_DEADLINE patches which in parts
      mimic the existing FIFO code it was noticed that the wmb in
      rt_set_overloaded() didn't have a matching barrier.
      
      The only site using rt_overloaded() to test the rto_count is
      pull_rt_task() and we should issue a matching rmb before then assuming
      there's an rto_mask bit set.
      
      Without that smp_rmb() in there we could actually miss seeing the
      rto_mask bit.
      
      Also, change to using smp_[wr]mb(), even though this is SMP only code;
      memory barriers without smp_ always make me think they're against
      hardware of some sort.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: vincent.guittot@linaro.org
      Cc: luca.abeni@unitn.it
      Cc: bruce.ashfield@windriver.com
      Cc: dhaval.giani@gmail.com
      Cc: rostedt@goodmis.org
      Cc: hgu1972@gmail.com
      Cc: oleg@redhat.com
      Cc: fweisbec@gmail.com
      Cc: darren@dvhart.com
      Cc: johan.eker@ericsson.com
      Cc: p.faure@akatech.ch
      Cc: paulmck@linux.vnet.ibm.com
      Cc: raistlin@linux.it
      Cc: claudio@evidence.eu.com
      Cc: insop.song@gmail.com
      Cc: michael@amarulasolutions.com
      Cc: liming.wang@windriver.com
      Cc: fchecconi@gmail.com
      Cc: jkacur@redhat.com
      Cc: tommaso.cucinotta@sssup.it
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: harald.gustafsson@ericsson.com
      Cc: nicola.manica@disi.unitn.it
      Cc: tglx@linutronix.de
      Link: http://lkml.kernel.org/r/20131015103507.GF10651@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d171cbfc
    • Mel Gorman's avatar
      sched: Assign correct scheduling domain to 'sd_llc' · 4edad9c1
      Mel Gorman authored
      commit 5d4cf996 upstream.
      
      Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
      dereference on sd_busy but the fix also altered what scheduling domain it
      used for the 'sd_llc' percpu variable.
      
      One impact of this is that a task selecting a runqueue may consider
      idle CPUs that are not cache siblings as candidates for running.
      Tasks are then running on CPUs that are not cache hot.
      
      This was found through bisection where ebizzy threads were not seeing equal
      performance and it looked like a scheduling fairness issue. This patch
      mitigates but does not completely fix the problem on all machines tested
      implying there may be an additional bug or a common root cause. Here are
      the average range of performance seen by individual ebizzy threads. It
      was tested on top of candidate patches related to x86 TLB range flushing.
      
      	4-core machine
      			    3.13.0-rc3            3.13.0-rc3
      			       vanilla            fixsd-v3r3
      	Mean   1        0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2        0.34 (  0.00%)        0.10 ( 70.59%)
      	Mean   3        1.29 (  0.00%)        0.93 ( 27.91%)
      	Mean   4        7.08 (  0.00%)        0.77 ( 89.12%)
      	Mean   5      193.54 (  0.00%)        2.14 ( 98.89%)
      	Mean   6      151.12 (  0.00%)        2.06 ( 98.64%)
      	Mean   7      115.38 (  0.00%)        2.04 ( 98.23%)
      	Mean   8      108.65 (  0.00%)        1.92 ( 98.23%)
      
      	8-core machine
      	Mean   1         0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2         0.40 (  0.00%)        0.21 ( 47.50%)
      	Mean   3        23.73 (  0.00%)        0.89 ( 96.25%)
      	Mean   4        12.79 (  0.00%)        1.04 ( 91.87%)
      	Mean   5        13.08 (  0.00%)        2.42 ( 81.50%)
      	Mean   6        23.21 (  0.00%)       69.46 (-199.27%)
      	Mean   7        15.85 (  0.00%)      101.72 (-541.77%)
      	Mean   8       109.37 (  0.00%)       19.13 ( 82.51%)
      	Mean   12      124.84 (  0.00%)       28.62 ( 77.07%)
      	Mean   16      113.50 (  0.00%)       24.16 ( 78.71%)
      
      It's eliminated for one machine and reduced for another.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: H Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4edad9c1
    • Peter Zijlstra's avatar
      sched: Initialize power_orig for overlapping groups · 8dc051a7
      Peter Zijlstra authored
      commit 8e8339a3 upstream.
      
      Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
      Make sure to always initialize power_orig now that we actually use it.
      
      Ideally build_sched_domains() -> init_sched_groups_power() would also
      initialize this; but for some yet unexplained reason some setups seem
      to miss updates there.
      Reported-by: default avatarYinghai Lu <yinghai@kernel.org>
      Tested-by: default avatarYinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8dc051a7
    • Peter Zijlstra's avatar
      sched: Avoid NULL dereference on sd_busy · a2198407
      Peter Zijlstra authored
      commit 42eb088e upstream.
      
      Commit 37dc6b50 ("sched: Remove unnecessary iteration over sched
      domains to update nr_busy_cpus") forgot to clear 'sd_busy' under some
      conditions leading to a possible NULL deref in set_cpu_sd_state_idle().
      Reported-by: default avatarAnton Blanchard <anton@samba.org>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20131118113701.GF3866@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a2198407
    • Preeti U Murthy's avatar
      sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus · ec6317cc
      Preeti U Murthy authored
      commit 37dc6b50 upstream.
      
      nr_busy_cpus parameter is used by nohz_kick_needed() to find out the
      number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES
      flag set.  Therefore instead of updating nr_busy_cpus at every level
      of sched domain, since it is irrelevant, we can update this parameter
      only at the parent domain of the sd which has this flag set. Introduce
      a per-cpu parameter sd_busy which represents this parent domain.
      
      In nohz_kick_needed() we directly query the nr_busy_cpus parameter
      associated with the groups of sd_busy.
      
      By associating sd_busy with the highest domain which has
      SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains
      which could have this flag set and trigger nohz_idle_balancing if any
      of the levels have more than one busy cpu.
      
      sd_busy is irrelevant for asymmetric load balancing. However sd_asym
      has been introduced to represent the highest sched domain which has
      SD_ASYM_PACKING flag set so that it can be queried directly when
      required.
      
      While we are at it, we might as well change the nohz_idle parameter to
      be updated at the sd_busy domain level alone and not the base domain
      level of a CPU.  This will unify the concept of busy cpus at just one
      level of sched domain where it is currently used.
      
      Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: svaidy@linux.vnet.ibm.com
      Cc: vincent.guittot@linaro.org
      Cc: bitbucket@online.de
      Cc: benh@kernel.crashing.org
      Cc: anton@samba.org
      Cc: Morten.Rasmussen@arm.com
      Cc: pjt@google.com
      Cc: peterz@infradead.org
      Cc: mikey@neuling.org
      Link: http://lkml.kernel.org/r/20131030031252.23426.4417.stgit@preeti.in.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ec6317cc
    • Paul E. McKenney's avatar
      rcu: Throttle rcu_try_advance_all_cbs() execution · 66802dc6
      Paul E. McKenney authored
      commit c229828c upstream.
      
      The rcu_try_advance_all_cbs() function is invoked on each attempted
      entry to and every exit from idle.  If this function determines that
      there are callbacks ready to invoke, the caller will invoke the RCU
      core, which in turn will result in a pair of context switches.  If a
      CPU enters and exits idle extremely frequently, this can result in
      an excessive number of context switches and high CPU overhead.
      
      This commit therefore causes rcu_try_advance_all_cbs() to throttle
      itself, refusing to do work more than once per jiffy.
      Reported-by: default avatarTibor Billes <tbilles@gmx.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarTibor Billes <tbilles@gmx.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      66802dc6
    • Paul E. McKenney's avatar
      rcu: Throttle invoke_rcu_core() invocations due to non-lazy callbacks · ac631f75
      Paul E. McKenney authored
      commit c337f8f5 upstream.
      
      If a non-lazy callback arrives on a CPU that has previously gone idle
      with no non-lazy callbacks, invoke_rcu_core() forces the RCU core to
      run.  However, it does not update the conditions, which could result
      in several closely spaced invocations of the RCU core, which in turn
      could result in an excessively high context-switch rate and resulting
      high overhead.
      
      This commit therefore updates the ->all_lazy and ->nonlazy_posted_snap
      fields to prevent closely spaced invocations.
      Reported-by: default avatarTibor Billes <tbilles@gmx.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarTibor Billes <tbilles@gmx.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ac631f75
    • Benjamin Herrenschmidt's avatar
      powerpc: Fix fatal SLB miss when restoring PPR · d30b39cb
      Benjamin Herrenschmidt authored
      commit 0c4888ef upstream.
      
      When restoring the PPR value, we incorrectly access the thread structure
      at a time where MSR:RI is clear, which means we cannot recover from nested
      faults. However the thread structure isn't covered by the "bolted" SLB
      entries and thus accessing can fault.
      
      This fixes it by splitting the code so that the PPR value is loaded into
      a GPR before MSR:RI is cleared.
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d30b39cb
    • Vaidyanathan Srinivasan's avatar
      sched: Fix asymmetric scheduling for POWER7 · 82a6be9e
      Vaidyanathan Srinivasan authored
      commit 2042abe7 upstream.
      
      Asymmetric scheduling within a core is a scheduler loadbalancing
      feature that is triggered when SD_ASYM_PACKING flag is set.  The goal
      for the load balancer is to move tasks to lower order idle SMT threads
      within a core on a POWER7 system.
      
      In nohz_kick_needed(), we intend to check if our sched domain (core)
      is completely busy or we have idle cpu.
      
      The following check for SD_ASYM_PACKING:
      
          (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)
      
      already covers the case of checking if the domain has an idle cpu,
      because cpumask_first_and() will not yield any set bits if this domain
      has no idle cpu.
      
      Hence, nr_busy check against group weight can be removed.
      Reported-by: default avatarMichael Neuling <michael.neuling@au1.ibm.com>
      Signed-off-by: default avatarVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Tested-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: vincent.guittot@linaro.org
      Cc: bitbucket@online.de
      Cc: benh@kernel.crashing.org
      Cc: anton@samba.org
      Cc: Morten.Rasmussen@arm.com
      Cc: pjt@google.com
      Link: http://lkml.kernel.org/r/20131030031242.23426.13019.stgit@preeti.in.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      82a6be9e
    • Bjorn Helgaas's avatar
      PCI: Drop warning about drivers that don't use pci_set_master() · 6dc265bd
      Bjorn Helgaas authored
      commit fbeeb822 upstream.
      
      f41f064c ("PCI: Workaround missing pci_set_master in pci drivers") made
      pci_enable_bridge() turn on bus mastering if the driver hadn't done so
      already.  It also added a warning in this case.  But there's no reason to
      warn about it unless it's actually a problem to enable bus mastering here.
      
      This patch drops the warning because I'm not aware of any such problem.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      CC: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6dc265bd
    • Thomas Gleixner's avatar
      NOHZ: Check for nohz active instead of nohz enabled · 9321256a
      Thomas Gleixner authored
      commit d689fe22 upstream.
      
      RCU and the fine grained idle time accounting functions check
      tick_nohz_enabled. But that variable is merily telling that NOHZ has
      been enabled in the config and not been disabled on the command line.
      
      But it does not tell anything about nohz being active. That's what all
      this should check for.
      
      Matthew reported, that the idle accounting on his old P1 machine
      showed bogus values, when he enabled NOHZ in the config and did not
      disable it on the kernel command line. The reason is that his machine
      uses (refined) jiffies as a clocksource which explains why the "fine"
      grained accounting went into lala land, because it depends on when the
      system goes and leaves idle relative to the jiffies increment.
      
      Provide a tick_nohz_active indicator and let RCU and the accounting
      code use this instead of tick_nohz_enable.
      Reported-and-tested-by: default avatarMatthew Whitehead <tedheadster@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: john.stultz@linaro.org
      Cc: mwhitehe@redhat.com
      Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311132052240.30673@ionos.tec.linutronix.deSigned-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9321256a
    • Thomas Gleixner's avatar
      nohz: Fix another inconsistency between CONFIG_NO_HZ=n and nohz=off · efca618a
      Thomas Gleixner authored
      commit 0e576acb upstream.
      
      If CONFIG_NO_HZ=n tick_nohz_get_sleep_length() returns NSEC_PER_SEC/HZ.
      
      If CONFIG_NO_HZ=y and the nohz functionality is disabled via the
      command line option "nohz=off" or not enabled due to missing hardware
      support, then tick_nohz_get_sleep_length() returns 0. That happens
      because ts->sleep_length is never set in that case.
      
      Set it to NSEC_PER_SEC/HZ when the NOHZ mode is inactive.
      Reported-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      efca618a
    • Linus Torvalds's avatar
      futex: move user address verification up to common code · 319a69da
      Linus Torvalds authored
      commit 5cdec2d8 upstream.
      
      When debugging the read-only hugepage case, I was confused by the fact
      that get_futex_key() did an access_ok() only for the non-shared futex
      case, since the user address checking really isn't in any way specific
      to the private key handling.
      
      Now, it turns out that the shared key handling does effectively do the
      equivalent checks inside get_user_pages_fast() (it doesn't actually
      check the address range on x86, but does check the page protections for
      being a user page).  So it wasn't actually a bug, but the fact that we
      treat the address differently for private and shared futexes threw me
      for a loop.
      
      Just move the check up, so that it gets done for both cases.  Also, use
      the 'rw' parameter for the type, even if it doesn't actually matter any
      more (it's a historical artifact of the old racy i386 "page faults from
      kernel space don't check write protections").
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      319a69da
    • James Bates's avatar
      efifb: prevent null-deref when iterating dmi_list · 2a352e27
      James Bates authored
      commit 55aa42f2 upstream.
      
      The dmi_list array is initialized using gnu designated initializers, and
      therefore may contain fewer explicitly defined entries as there are
      elements in it. This is because the enum above with M_xyz constants
      contains more items than the designated initializer. Those elements not
      explicitly initialized are implicitly set to 0.
      
      Now efifb_setup() loops through all these array elements, and performs
      a strcmp on each item. For non explicitly initialized elements this will
      be a null pointer:
      
      This patch swaps the check order in the if statement, thus checks first
      whether dmi_list[i].base is null.
      Signed-off-by: default avatarJames Bates <james.h.bates@gmail.com>
      Signed-off-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: default avatarTomi Valkeinen <tomi.valkeinen@ti.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2a352e27
    • Jan Kara's avatar
      blktrace: Send BLK_TN_PROCESS events to all running traces · d4ea1c7f
      Jan Kara authored
      commit a404d557 upstream.
      
      Currently each task sends BLK_TN_PROCESS event to the first traced
      device it interacts with after a new trace is started. When there are
      several traced devices and the task accesses more devices, this logic
      can result in BLK_TN_PROCESS being sent several times to some devices
      while it is never sent to other devices. Thus blkparse doesn't display
      command name when parsing some blktrace files.
      
      Fix the problem by sending BLK_TN_PROCESS event to all traced devices
      when a task interacts with any of them.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Review-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d4ea1c7f
    • Huang Rui's avatar
      usb: ohci: use amd_chipset_type to filter for SB800 prefetch · e89b9f7e
      Huang Rui authored
      commit 02c123ee upstream.
      
      Commit "usb: pci-quirks: refactor AMD quirk to abstract AMD chipset types"
      introduced a new AMD chipset type to filter AMD platforms with different
      chipsets.
      
      According to a recent thread [1], this patch updates SB800 prefetch routine
      in AMD PLL quirk. And make it use the new chipset type to represent SB800
      generation.
      
      [1] http://marc.info/?l=linux-usb&m=138012321616452&w=2Signed-off-by: default avatarHuang Rui <ray.huang@amd.com>
      Acked-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e89b9f7e
    • Huang Rui's avatar
      usb: ehci: use amd_chipset_type to filter for usb subsystem hang bug · 6a3f0afd
      Huang Rui authored
      commit 3ad145b6 upstream.
      
      Commit "usb: pci-quirks: refactor AMD quirk to abstract AMD chipset types"
      introduced a new AMD chipset type to filter AMD platforms with different
      chipsets.
      
      According to a recent thread [1], this patch updates USB subsystem hang
      symptom quirk which is observed on AMD all SB600 and SB700 revision
      0x3a/0x3b. And make it use the new chipset type to represent.
      
      [1] http://marc.info/?l=linux-usb&m=138012321616452&w=2Signed-off-by: default avatarHuang Rui <ray.huang@amd.com>
      Acked-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6a3f0afd
    • Huang Rui's avatar
      usb: pci-quirks: refactor AMD quirk to abstract AMD chipset types · 87689625
      Huang Rui authored
      commit 22b4f0cd upstream.
      
      This patch abstracts out a AMD chipset type which includes southbridge
      generation and its revision. When os excutes usb_amd_find_chipset_info
      routine to initialize AMD chipset type, driver will know which kind of
      chipset is used.
      
      This update has below benifits:
      - Driver is able to confirm which southbridge generations and their
        revision are used, with chipset detection once.
      - To describe chipset generations with enumeration types brings better
        readability.
      - It's flexible to filter AMD platforms to implement new quirks in future.
      Signed-off-by: default avatarHuang Rui <ray.huang@amd.com>
      Cc: Andiry Xu <andiry.xu@gmail.com>
      Acked-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Acked-by: default avatarSarah Sharp <sarah.a.sharp@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      87689625
    • David Henningsson's avatar
      ALSA: hda - Explicitly keep codec powered up in hdmi_present_sense · d4660f50
      David Henningsson authored
      commit da4a7a39 upstream.
      
      This should help us avoid the following mutex deadlock:
      
      [] mutex_lock+0x2a/0x50
      [] hdmi_present_sense+0x53/0x3a0 [snd_hda_codec_hdmi]
      [] generic_hdmi_resume+0x5a/0x70 [snd_hda_codec_hdmi]
      [] hda_call_codec_resume+0xec/0x1d0 [snd_hda_codec]
      [] snd_hda_power_save+0x1e4/0x280 [snd_hda_codec]
      [] codec_exec_verb+0x5f/0x290 [snd_hda_codec]
      [] snd_hda_codec_read+0x5b/0x90 [snd_hda_codec]
      [] snd_hdmi_get_eld_size+0x1e/0x20 [snd_hda_codec_hdmi]
      [] snd_hdmi_get_eld+0x2c/0xd0 [snd_hda_codec_hdmi]
      [] hdmi_present_sense+0x9a/0x3a0 [snd_hda_codec_hdmi]
      [] hdmi_repoll_eld+0x34/0x50 [snd_hda_codec_hdmi]
      Signed-off-by: default avatarDavid Henningsson <david.henningsson@canonical.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d4660f50
    • Takashi Iwai's avatar
      ALSA: hda - Delay HDMI presence reports while waiting for ELD information · f6baee71
      Takashi Iwai authored
      commit efe47108 upstream.
      
      There is a small gap between the jack detection unsolicited event and
      the time the ELD is updated.  When user-space queries the HDMI ELD
      immediately after receiving the notification, it might fail because of
      this gap.
      
      For avoiding such a problem, this patch tries to delay the HDMI jack
      detect notification until ELD information is fully updated.  The
      workaround is imperfect, but good enough as a starting point.
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f6baee71
    • Takashi Iwai's avatar
      ALSA: hda - Name Haswell HDMI controllers better · 19c7fb34
      Takashi Iwai authored
      commit fab1285a upstream.
      
      "HDA Intel MID" is no correct name for Haswell HDMI controllers.
      Give them a better name, "HDA Intel HDMI".
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      19c7fb34
    • Clemens Ladisch's avatar
      ALSA: hda: add device IDs for AMD Evergreen/Northern Islands HDMI · af3d652b
      Clemens Ladisch authored
      commit bbaa0d66 upstream.
      
      The device IDs of the AMD Cypress/Juniper/Redwood/Cedar/Cayman/Antilles/
      Barts/Turks/Caicos HDMI HDA controllers weren't added explicitly
      because the generic entry works, but it made the device appearing as
      "Generic", and people are confused as if it's no proper HDMI
      controller.  Add them so that the name shows up properly as "ATI HDMI"
      instead of "Generic".
      
      According to Takashi's tests and the lack of complaints, these devices
      work fine without disabling snooping.
      Signed-off-by: default avatarClemens Ladisch <clemens@ladisch.de>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      af3d652b