Commits · 4beb31f3657348a8b702dd014d01c520e522012f · Kirill Smelkov / linux

30 Jul, 2013 4 commits

perf: Split the per-cpu accounting part of the event accounting code · 4beb31f3

Frederic Weisbecker authored Jul 23, 2013

This way we can use the per-cpu handling seperately.
This is going to be used by to fix the event migration
code accounting.
Original-patch-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-5-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

4beb31f3

perf: Factor out event accounting code to account_event()/__free_event() · 766d6c07

Frederic Weisbecker authored Jul 23, 2013

Gather all the event accounting code to a single place,
once all the prerequisites are completed. This simplifies
the refcounting.
Original-patch-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-4-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

766d6c07

perf: Sanitize get_callchain_buffer() · 90983b16

Frederic Weisbecker authored Jul 23, 2013

In case of allocation failure, get_callchain_buffer() keeps the
refcount incremented for the current event.

As a result, when get_callchain_buffers() returns an error,
we must cleanup what it did by cancelling its last refcount
with a call to put_callchain_buffers().

This is a hack in order to be able to call free_event()
after that failure.

The original purpose of that was to simplify the failure
path. But this error handling is actually counter intuitive,
ugly and not very easy to follow because one expect to
see the resources used to perform a service to be cleaned
by the callee if case of failure, not by the caller.

So lets clean this up by cancelling the refcount from
get_callchain_buffer() in case of failure. And correctly free
the event accordingly in perf_event_alloc().
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-3-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

90983b16

perf: Fix branch stack refcount leak on callchain init failure · 6050cb0b

Frederic Weisbecker authored Jul 23, 2013

On callchain buffers allocation failure, free_event() is
called and all the accounting performed in perf_event_alloc()
for that event is cancelled.

But if the event has branch stack sampling, it is unaccounted
as well from the branch stack sampling events refcounts.

This is a bug because this accounting is performed after the
callchain buffer allocation. As a result, the branch stack sampling
events refcount can become negative.

To fix this, move the branch stack event accounting before the
callchain buffer allocation.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374539466-4799-2-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

6050cb0b

23 Jul, 2013 9 commits

sched: Micro-optimize the smart wake-affine logic · 7d9ffa89

Peter Zijlstra authored Jul 04, 2013

Smart wake-affine is using node-size as the factor currently, but the overhead
of the mask operation is high.

Thus, this patch introduce the 'sd_llc_size' percpu variable, which will record
the highest cache-share domain size, and make it to be the new factor, in order
to reduce the overhead and make it more reasonable.
Tested-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/51D5008E.6030102@linux.vnet.ibm.com
[ Tidied up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>

7d9ffa89

sched: Implement smarter wake-affine logic · 62470419

Michael Wang authored Jul 04, 2013

The wake-affine scheduler feature is currently always trying to pull
the wakee close to the waker. In theory this should be beneficial if
the waker's CPU caches hot data for the wakee, and it's also beneficial
in the extreme ping-pong high context switch rate case.

Testing shows it can benefit hackbench up to 15%.

However, the feature is somewhat blind, from which some workloads
such as pgbench suffer. It's also time-consuming algorithmically.

Testing shows it can damage pgbench up to 50% - far more than the
benefit it brings in the best case.

So wake-affine should be smarter and it should realize when to
stop its thankless effort at trying to find a suitable CPU to wake on.

This patch introduces 'wakee_flips', which will be increased each
time the task flips (switches) its wakee target.

So a high 'wakee_flips' value means the task has more than one
wakee, and the bigger the number, the higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention to
the wakee with a high 'wakee_flips', pulling such a task may benefit
the wakee. Also imply that the waker will face cruel competition later,
it could be very cruel or very fast depends on the story behind
'wakee_flips', waker therefore suffers.

Furthermore, if waker also has a high 'wakee_flips', that implies that
multiple tasks rely on it, then waker's higher latency will damage all
of them, so pulling wakee seems to be a bad deal.

Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes
higher and higher, the cost of pulling seems to be worse and worse.

The patch therefore helps the wake-affine feature to stop its pulling
work when:

	wakee->wakee_flips > factor &&
	waker->wakee_flips > (factor * wakee->wakee_flips)

The 'factor' here is the number of CPUs in the current CPU's NUMA node,
so a bigger node will lead to more pulling since the trial becomes more
severe.

After applying the patch, pgbench shows up to 40% improvements and no regressions.

Tested with 12 cpu x86 server and tip 3.10.0-rc7.

The percentages in the final column highlight the areas with the biggest wins,
all other areas improved as well:

	pgbench		    base	smart

	| db_size | clients |  tps  |	|  tps  |
	+---------+---------+-------+   +-------+
	| 22 MB   |       1 | 10598 |   | 10796 |
	| 22 MB   |       2 | 21257 |   | 21336 |
	| 22 MB   |       4 | 41386 |   | 41622 |
	| 22 MB   |       8 | 51253 |   | 57932 |
	| 22 MB   |      12 | 48570 |   | 54000 |
	| 22 MB   |      16 | 46748 |   | 55982 | +19.75%
	| 22 MB   |      24 | 44346 |   | 55847 | +25.93%
	| 22 MB   |      32 | 43460 |   | 54614 | +25.66%
	| 7484 MB |       1 |  8951 |   |  9193 |
	| 7484 MB |       2 | 19233 |   | 19240 |
	| 7484 MB |       4 | 37239 |   | 37302 |
	| 7484 MB |       8 | 46087 |   | 50018 |
	| 7484 MB |      12 | 42054 |   | 48763 |
	| 7484 MB |      16 | 40765 |   | 51633 | +26.66%
	| 7484 MB |      24 | 37651 |   | 52377 | +39.11%
	| 7484 MB |      32 | 37056 |   | 51108 | +37.92%
	| 15 GB   |       1 |  8845 |   |  9104 |
	| 15 GB   |       2 | 19094 |   | 19162 |
	| 15 GB   |       4 | 36979 |   | 36983 |
	| 15 GB   |       8 | 46087 |   | 49977 |
	| 15 GB   |      12 | 41901 |   | 48591 |
	| 15 GB   |      16 | 40147 |   | 50651 | +26.16%
	| 15 GB   |      24 | 37250 |   | 52365 | +40.58%
	| 15 GB   |      32 | 36470 |   | 50015 | +37.14%
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>

62470419

sched: Move h_load calculation to task_h_load() · 68520796

Vladimir Davydov authored Jul 15, 2013

The bad thing about update_h_load(), which computes hierarchical load
factor for task groups, is that it is called for each task group in the
system before every load balancer run, and since rebalance can be
triggered very often, this function can eat really a lot of cpu time if
there are many cpu cgroups in the system.

Although the situation was improved significantly by commit a35b6466
('sched, cgroup: Reduce rq->lock hold times for large cgroup
hierarchies'), the problem still can arise under some kinds of loads,
e.g. when cpus are switching from idle to busy and back very frequently.

For instance, when I start 1000 of processes that wake up every
millisecond on my 8 cpus host, 'top' and 'perf top' show:

Cpu(s): 17.8%us, 24.3%sy,  0.0%ni, 57.9%id,  0.0%wa,  0.0%hi,  0.0%si
Events: 243K cycles
  7.57%  [kernel]               [k] __schedule
  7.08%  [kernel]               [k] timerqueue_add
  6.13%  libc-2.12.so           [.] usleep

Then if I create 10000 *idle* cpu cgroups (no processes in them), cpu
usage increases significantly although the 'wakers' are still executing
in the root cpu cgroup:

Cpu(s): 19.1%us, 48.7%sy,  0.0%ni, 31.6%id,  0.0%wa,  0.0%hi,  0.7%si
Events: 230K cycles
 24.56%  [kernel]            [k] tg_load_down
  5.76%  [kernel]            [k] __schedule

This happens because this particular kind of load triggers 'new idle'
rebalance very frequently, which requires calling update_h_load(),
which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
though it is absolutely useless, because idle cpu cgroups have no tasks
to pull.

This patch tries to improve the situation by making h_load calculation
proceed only when h_load is really necessary. To achieve this, it
substitutes update_h_load() with update_cfs_rq_h_load(), which computes
h_load only for a given cfs_rq and all its ascendants, and makes the
load balancer call this function whenever it considers if a task should
be pulled, i.e. it moves h_load calculations directly to task_h_load().
For h_load of the same cfs_rq not to be updated multiple times (in case
several tasks in the same cgroup are considered during the same balance
run), the patch keeps the time of the last h_load update for each cfs_rq
and breaks calculation when it finds h_load to be uptodate.

The benefit of it is that h_load is computed only for those cfs_rq's,
which really need it, in particular all idle task groups are skipped.
Although this, in fact, moves h_load calculation under rq lock, it
should not affect latency much, because the amount of work done under rq
lock while trying to pull tasks is limited by sched_nr_migrate.

After the patch applied with the setup described above (1000 wakers in
the root cgroup and 10000 idle cgroups), I get:

Cpu(s): 16.9%us, 24.8%sy,  0.0%ni, 58.4%id,  0.0%wa,  0.0%hi,  0.0%si
Events: 242K cycles
  7.57%  [kernel]                  [k] __schedule
  6.70%  [kernel]                  [k] timerqueue_add
  5.93%  libc-2.12.so              [.] usleep
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1373896159-1278-1-git-send-email-vdavydov@parallels.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

68520796

perf tools: Add test for converting perf time to/from TSC · 3bd5a5fc

Adrian Hunter authored Jun 28, 2013

The test uses the newly added cap_usr_time_zero and time_zero of
perf_event_mmap_page.  TSC from rdtsc is compared with the time
from 2 perf events.  The test passes if the calculated times are
all in the correct order.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1372425741-1676-4-git-send-email-adrian.hunter@intel.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

3bd5a5fc

perf/x86: Add ability to calculate TSC from perf sample timestamps · c73deb6a

Adrian Hunter authored Jun 28, 2013

For modern CPUs, perf clock is directly related to TSC.  TSC
can be calculated from perf clock and vice versa using a simple
calculation.  Two of the three componenets of that calculation
are already exported in struct perf_event_mmap_page.  This patch
exports the third.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1372425741-1676-3-git-send-email-adrian.hunter@intel.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

c73deb6a

perf: Fix broken union in 'struct perf_event_mmap_page' · 860f085b

Adrian Hunter authored Jun 28, 2013

The capabilities bits must not be "union'ed" together.
Put them in a separate struct.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1372425741-1676-2-git-send-email-adrian.hunter@intel.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

860f085b

perf: Update perf_event_type documentation · a5cdd40c

Peter Zijlstra authored Jul 16, 2013

Due to a discussion with Adrian I had a good look at the perf_event_type record
layout and found the documentation to be somewhat unclear.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130716150907.GL23818@dyad.programming.kicks-ass.netSigned-off-by: Ingo Molnar <mingo@kernel.org>

a5cdd40c

kprobes/x86: Call out into INT3 handler directly instead of using notifier · 17f41571

Jiri Kosina authored Jul 23, 2013

In fd4363ff ("x86: Introduce int3 (breakpoint)-based
instruction patching"), the mechanism that was introduced for
notifying alternatives code from int3 exception handler that and
exception occured was die_notifier.

This is however problematic, as early code might be using jump
labels even before the notifier registration has been performed,
which will then lead to an oops due to unhandled exception. One
of such occurences has been encountered by Fengguang:

 int3: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.11.0-rc1-01429-g04bf576 #8
 task: ffff88000da1b040 ti: ffff88000da1c000 task.ti: ffff88000da1c000
 RIP: 0010:[<ffffffff811098cc>]  [<ffffffff811098cc>] ttwu_do_wakeup+0x28/0x225
 RSP: 0000:ffff88000dd03f10  EFLAGS: 00000006
 RAX: 0000000000000000 RBX: ffff88000dd12940 RCX: ffffffff81769c40
 RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000001
 RBP: ffff88000dd03f28 R08: ffffffff8176a8c0 R09: 0000000000000002
 R10: ffffffff810ff484 R11: ffff88000dd129e8 R12: ffff88000dbc90c0
 R13: ffff88000dbc90c0 R14: ffff88000da1dfd8 R15: ffff88000da1dfd8
 FS:  0000000000000000(0000) GS:ffff88000dd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 00000000ffffffff CR3: 0000000001c88000 CR4: 00000000000006e0
 Stack:
  ffff88000dd12940 ffff88000dbc90c0 ffff88000da1dfd8 ffff88000dd03f48
  ffffffff81109e2b ffff88000dd12940 0000000000000000 ffff88000dd03f68
  ffffffff81109e9e 0000000000000000 0000000000012940 ffff88000dd03f98
 Call Trace:
  <IRQ>
  [<ffffffff81109e2b>] ttwu_do_activate.constprop.56+0x6d/0x79
  [<ffffffff81109e9e>] sched_ttwu_pending+0x67/0x84
  [<ffffffff8110c845>] scheduler_ipi+0x15a/0x2b0
  [<ffffffff8104dfb4>] smp_reschedule_interrupt+0x38/0x41
  [<ffffffff8173bf5d>] reschedule_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff810ff484>] ? __atomic_notifier_call_chain+0x5/0xc1
  [<ffffffff8105cc30>] ? native_safe_halt+0xd/0x16
  [<ffffffff81015f10>] default_idle+0x147/0x282
  [<ffffffff81017026>] arch_cpu_idle+0x3d/0x5d
  [<ffffffff81127d6a>] cpu_idle_loop+0x46d/0x5db
  [<ffffffff81127f5c>] cpu_startup_entry+0x84/0x84
  [<ffffffff8104f4f8>] start_secondary+0x3c8/0x3d5
  [...]

Fix this by directly calling poke_int3_handler() from the int3
exception handler (analogically to what ftrace has been doing
already), instead of relying on notifier, registration of which
might not have yet been finalized by the time of the first trap.
Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1307231007490.14024@pobox.suse.czSigned-off-by: Ingo Molnar <mingo@kernel.org>

17f41571

Merge tag 'perf-core-for-mingo' of... · 4f16d61f

Ingo Molnar authored Jul 23, 2013

Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

  * Fix memcpy benchmark for large sizes, from Andi Kleen.

  * Support callchain sorting based on addresses, from Andi Kleen

  * Move weight back to common sort keys, From Andi Kleen.

  * Fix named threads support in 'perf script', from David Ahern.

  * Handle ENODEV on default cycles event, fix from David Ahern.

  * More install tests, from Jiri Olsa.

  * Fix build with perl 5.18, from Kirill A. Shutemov.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>

4f16d61f

22 Jul, 2013 11 commits

perf tools: Move weight back to common sort keys · f9ea55d0

Andi Kleen authored Jul 18, 2013

This is a partial revert of Namhyung's patch

 afab87b9
 perf sort: Separate out memory-specific sort keys

He wrote

 For global/local weights, I'm not entirely sure to place them into the
 memory dimension.  But it's the only user at this time.

Well TSX is another (in fact the original) user of the flags, and it
needs them to be common. So move local/global weight back to the common
sort keys.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Link: http://lkml.kernel.org/r/1374188333-17899-1-git-send-email-andi@firstfloor.orgSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

f9ea55d0

perf tests: Add broken install-* tests into tests/make · dbad4189

Jiri Olsa authored Jul 22, 2013

Adding install-* tests into tests/make. Those tests are
broken, so commenting them out right away.

* Nothing get installed for install-man, install_doc and
  install_html targets, they just rebuild the documentation.

* I've got following error for 'install-info':

  $ make -f tests/make make_install_info
  - make_install_info: cd . && make -f Makefile DESTDIR=/tmp/tmp.Xi4mb9J1a0 install-info

  $ tail -f make_install_info
  ...
  PERF_VERSION = 3.11.rc1.g9b3c2d
  make[2]: *** No rule to make target `user-manual.xml', needed by `user-manual.texi'.  Stop.
  make[1]: *** [install-info] Error 2

* I've got following error for 'install-pdf':

  $ make -f tests/make make_install_pdf
  - make_install_pdf: cd . && make -f Makefile DESTDIR=/tmp/tmp.fXseECBbt1 install-pdf

  $ tail -f make_install_pdf
  ...
  PERF_VERSION = 3.11.rc1.g9b3c2d
  make[2]: *** No rule to make target `user-manual.xml', needed by `user-manual.pdf'.  Stop.
  make[1]: *** [install-pdf] Error 2
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1374497014-2817-6-git-send-email-jolsa@redhat.comSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

dbad4189

perf tests: Add 'make install/install-bin' tests into tests/make · c0ec1108

Jiri Olsa authored Jul 22, 2013

Adding 'make install' and 'make install-bin' tests into tests/make. It's
run as part of the suite, but could be run separately like:

  $ make -f tests/make make_install
  - make_install: cd . && make -f Makefile DESTDIR=/tmp/tmp.LpkYbk5pfs install
    test: test -x /tmp/tmp.LpkYbk5pfs/bin/perf
  $ make -f tests/make make_install_bin
  - make_install_bin: cd . && make -f Makefile DESTDIR=/tmp/tmp.dMxePBMcFT
    install-bin
    test: test -x /tmp/tmp.dMxePBMcFT/bin/perf
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1374497014-2817-5-git-send-email-jolsa@redhat.comSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

c0ec1108

perf tests: Add DESTDIR=TMP_DEST tests/make variable · c9311674

Jiri Olsa authored Jul 22, 2013

Adding TMP_DEST tests/make variable to provide the DESTDIR directory for
installation tests.

Adding this to existing test targets, since DESTDIR variable 'should
not' affect other than install* targets. We can always separate this if
there's a need for DESTDIR-free build test.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1374497014-2817-4-git-send-email-jolsa@redhat.comSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

c9311674

perf tests: Rename TMP to TMP_O tests/make variable · 8ba7cdea

Jiri Olsa authored Jul 22, 2013

Renaming TMP to TMP_O tests/make variable to make a name space for other
temp variables.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1374497014-2817-3-git-send-email-jolsa@redhat.comSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

8ba7cdea

perf tests: Run ctags/cscope make tests only with needed binaries · 0659e669

Jiri Olsa authored Jul 22, 2013

Running tags and cscope make tests only if the 'ctags' and 'cscope'
binaries are installed, so we don't have false alarm test failures.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1374497014-2817-2-git-send-email-jolsa@redhat.comSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

0659e669

perf tools: Fix build with perl 5.18 · 575bf1d0

Kirill A. Shutemov authored Jun 24, 2013

perl.h from new Perl release doesn't like -Wundef and -Wswitch-default:

/usr/lib/perl5/core_perl/CORE/perl.h:548:5: error: "SILENT_NO_TAINT_SUPPORT" is not defined [-Werror=undef]
 #if SILENT_NO_TAINT_SUPPORT && !defined(NO_TAINT_SUPPORT)
     ^
/usr/lib/perl5/core_perl/CORE/perl.h:556:5: error: "NO_TAINT_SUPPORT" is not defined [-Werror=undef]
 #if NO_TAINT_SUPPORT
     ^
In file included from /usr/lib/perl5/core_perl/CORE/perl.h:3471:0,
                 from util/scripting-engines/trace-event-perl.c:30:
/usr/lib/perl5/core_perl/CORE/sv.h:1455:5: error: "NO_TAINT_SUPPORT" is not defined [-Werror=undef]
 #if NO_TAINT_SUPPORT
     ^
In file included from /usr/lib/perl5/core_perl/CORE/perl.h:3472:0,
                 from util/scripting-engines/trace-event-perl.c:30:
/usr/lib/perl5/core_perl/CORE/regexp.h:436:5: error: "NO_TAINT_SUPPORT" is not defined [-Werror=undef]
 #if NO_TAINT_SUPPORT
     ^
In file included from /usr/lib/perl5/core_perl/CORE/hv.h:592:0,
                 from /usr/lib/perl5/core_perl/CORE/perl.h:3480,
                 from util/scripting-engines/trace-event-perl.c:30:
/usr/lib/perl5/core_perl/CORE/hv_func.h: In function ‘S_perl_hash_siphash_2_4’:
/usr/lib/perl5/core_perl/CORE/hv_func.h:222:3: error: switch missing default case [-Werror=switch-default]
   switch( left )
   ^
/usr/lib/perl5/core_perl/CORE/hv_func.h: In function ‘S_perl_hash_superfast’:
/usr/lib/perl5/core_perl/CORE/hv_func.h:274:5: error: switch missing default case [-Werror=switch-default]
     switch (rem) { \
     ^
/usr/lib/perl5/core_perl/CORE/hv_func.h: In function ‘S_perl_hash_murmur3’:
/usr/lib/perl5/core_perl/CORE/hv_func.h:398:5: error: switch missing default case [-Werror=switch-default]
     switch(bytes_in_carry) { /* how many bytes in carry */
     ^

Let's disable the warnings for code which uses perl.h.
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1372063394-20126-1-git-send-email-kirill@shutemov.nameSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

575bf1d0

perf tools: Support callchain sorting based on addresses · 99571ab3

Andi Kleen authored Jul 18, 2013

With programs with very large functions it can be useful to distinguish
the callgraph nodes on more than just function names. So for example if
you have multiple calls to the same function, it ends up being separate
nodes in the chain.

This patch adds a new key field to the callgraph options, that allows
comparing nodes on functions (as today, default) and addresses.

Longer term it would be nice to also handle src lines, but that would
need more changes and address is a reasonable proxy for it today.

I right now reference the global params, as there was no simple way to
register a params pointer.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-0uskktybf0e7wrnoi5e9b9it@git.kernel.orgSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

99571ab3

perf bench: Fix memcpy benchmark for large sizes · a198996c

Andi Kleen authored Jul 18, 2013

The glibc calloc() function has an optimization to not explicitely
memset() very large calloc allocations that just came from mmap(),
because they are known to be zero.

This could result in the perf memcpy benchmark reading only from
the zero page, which gives unrealistic results.

Always call memset explicitly on the source area to avoid this problem.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Hitoshi Mitake <h.mitake@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-pzz2qrdq9eymxda0y8yxdn33@git.kernel.orgSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

a198996c

perf evsel: Handle ENODEV on default cycles event · 2b821cce

David Ahern authored Jul 18, 2013

Some systems (e.g., VMs on qemu-0.13 with the default vcpu model) report
an unsupported CPU model:

Performance Events: unsupported p6 CPU model 2 no PMU driver, software events only.

Subsequent invocations of perf fail with:

The sys_perf_event_open() syscall returned with 19 (No such device) for event (cycles).
/bin/dmesg may provide additional information.
No CONFIG_PERF_EVENTS=y kernel support configured?

Add ENODEV to the list of errno's to fallback to cpu-clock.
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374190079-28507-1-git-send-email-dsahern@gmail.comSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

2b821cce

perf script: Fix named threads support · 2eaa1b40

David Ahern authored Jul 18, 2013

Commit 73994dc1 broke named thread support in perf-script. The thread
struct in al is the main thread for a multithreaded process. The thread
struct used for analysis (e.g., dumping events) should be the specific
thread for the sample.
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Link: http://lkml.kernel.org/r/1374185175-28272-1-git-send-email-dsahern@gmail.comSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

2eaa1b40

19 Jul, 2013 9 commits

kprobes/x86: Remove unused text_poke_smp() and text_poke_smp_batch() functions · ea8596bb

Masami Hiramatsu authored Jul 18, 2013

Since introducing the text_poke_bp() for all text_poke_smp*()
callers, text_poke_smp*() are now unused. This patch basically
reverts:

  3d55cc8a ("x86: Add text_poke_smp for SMP cross modifying code")
  7deb18dc ("x86: Introduce text_poke_smp_batch() for batch-code modifying")

and related commits.

This patch also fixes a Kconfig dependency issue on STOP_MACHINE
in the case of CONFIG_SMP && !CONFIG_MODULE_UNLOAD.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Borislav Petkov <bpetkov@suse.de>
Link: http://lkml.kernel.org/r/20130718114753.26675.18714.stgit@mhiramat-M0-7522Signed-off-by: Ingo Molnar <mingo@kernel.org>

ea8596bb

kprobes/x86: Use text_poke_bp() instead of text_poke_smp*() · a7b0133e

Masami Hiramatsu authored Jul 18, 2013

Use text_poke_bp() for optimizing kprobes instead of
text_poke_smp*(). Since the number of kprobes is usually not so
large (<100) and text_poke_bp() is much lighter than
text_poke_smp() [which uses stop_machine()], this just stops
using batch processing.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Borislav Petkov <bpetkov@suse.de>
Link: http://lkml.kernel.org/r/20130718114750.26675.9174.stgit@mhiramat-M0-7522Signed-off-by: Ingo Molnar <mingo@kernel.org>

a7b0133e

kprobes/x86: Remove an incorrect comment about int3 in NMI/MCE · c7e85c42

Masami Hiramatsu authored Jul 18, 2013

Remove a comment about an int3 issue in NMI/MCE, since
commit:

  3f3c8b8c ("x86: Add workaround to NMI iret woes")

already fixed that. Keeping this incorrect comment can mislead developers.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Borislav Petkov <bpetkov@suse.de>
Link: http://lkml.kernel.org/r/20130718114747.26675.84110.stgit@mhiramat-M0-7522Signed-off-by: Ingo Molnar <mingo@kernel.org>

c7e85c42

Merge branch 'x86/jumplabel' into perf/core · 9bb15425

Ingo Molnar authored Jul 19, 2013

Upcoming kprobes patches rely on the int3 code-patching machinery introduced by:

fd4363ff x86: Introduce int3 (breakpoint)-based instruction patching
Signed-off-by: Ingo Molnar <mingo@kernel.org>

9bb15425

Merge tag 'perf-core-for-mingo' of... · 5a982132

Ingo Molnar authored Jul 19, 2013

Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

 * Add missing 'finished_round' event forwarding in 'perf inject', from Adrian Hunter.

 * Assorted tidy ups, from Adrian Hunter.

 * Fall back to sysfs event names when parsing fails, from Andi Kleen.

 * List pmu events in perf list, from Andi Kleen.

 * Cleanup some memory allocation/freeing uses, from David Ahern.

 * Add option to collapse undesired parts of call graph, from Greg Price.

 * Prep work for multi perf data file storage, from Jiri Olsa.

 * Add support for more than two files comparision in 'perf diff', from Jiri Olsa

 * A few more 'perf test' improvements, from Jiri Olsa

 * libtraceevent cleanups, from Namhyung Kim.

 * Remove odd build stall in 'perf sched' by moving a large struct initialization
   from a local variable to a global one, from Namhyung Kim.

 * Add support for callchains in the gtk UI, from Namhyung Kim.

 * Do not apply symfs for an absolute vmlinux path, fix from Namhyung Kim.

 * Use default include path notation for libtraceevent, from Robert Richter.

 * Fix 'make tools/perf', from Robert Richter.

 * Make Power7 events available, from Runzhen Wang.

 * Add --objdump option to 'perf top', from Sukadev Bhattiprolu.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>

5a982132

Merge branch 'linus' into perf/core · e43fff2b

Ingo Molnar authored Jul 19, 2013

Merge in a v3.11-rc1-ish branch to go from v3.10 based development
to a v3.11 based one.
Signed-off-by: Ingo Molnar <mingo@kernel.org>

e43fff2b

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · ecb2cf1a

Linus Torvalds authored Jul 18, 2013

Pull networking fixes from David Miller:
 "A couple interesting SKB fragment handling fixes, plus the usual small
  bits here and there:

   1) Fix 64-bit divide build failure on 32-bit platforms in mlx5, from
      Tim Gardner.

   2) Get rid of a stupid reimplementation on "%*phC" in our sysfs MAC
      address printing helper.

   3) Fix NETIF_F_SG capability advertisement in hyperv driver, if the
      device can't do checksumming offloads then it shouldn't say it can
      do SG either.  From Haiyang Zhang.

   4) bgmac needs to depend on PHYLIB, from Hauke Mehrtens.

   5) Don't leak DMA mappings on mapping failures, from Neil Horman.

   6) We need to reset the transport header of SKBs in ipv4 before we
      attempt to perform early socket demux, just like ipv6 does.  From
      Eric Dumazet.

   7) Add missing locking on vxlan device removal, from Stephen
      Hemminger.

   8) xen-netfront has to make two passes over an SKB to prepare it for
      transfer.  One pass calculates the number of slots needed, the
      second massages the SKB and fills the slots.  Unfortunately, the
      first pass doesn't calculate the number of slots properly so we
      can end up trying to build a MAX_SKB_FRAGS + 1 SKB which doesn't
      work out so well.  Fix from Jan Beulich with help and discussion
      with several others.

   9) Fix a similar problem in tun and macvtap, which have to split up
      scatter-gather elements at PAGE_SIZE boundaries.  Don't do
      zerocopy if it would result in a > MAX_SKB_FRAGS skb.  Fixes from
      Jason Wang.

  10) On receive, once we've decoded the VLAN state completely, clear
      skb->vlan_tci.  Otherwise demuxed tunnels underneath can trigger
      the VLAN code again, corrupting the packet.  Fix from Eric
      Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  vlan: fix a race in egress prio management
  vlan: mask vlan prio bits
  macvtap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS
  tuntap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS
  pkt_sched: sch_qfq: remove a source of high packet delay/jitter
  xen-netfront: pull on receive skb may need to happen earlier
  vxlan: add necessary locking on device removal
  hyperv: Fix the NETIF_F_SG flag setting in netvsc
  net: Fix sysfs_format_mac() code duplication.
  be2net: Fix to avoid hardware workaround when not needed
  macvtap: do not assume 802.1Q when send vlan packets
  macvtap: fix the missing ret value of TUNSETQUEUE
  ipv4: set transport header earlier
  mlx5 core: Fix __udivdi3 when compiling for 32 bit arches
  bgmac: add dependency to phylib
  net/irda: fixed style issues in irlan_eth
  ethtool: fixed trailing statements in ethtool
  ndisc: bool initializations should use true and false
  atl1e: unmap partially mapped skb on dma error and free skb

ecb2cf1a

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ee114b97

Linus Torvalds authored Jul 18, 2013

Pull x86 fixes from Peter Anvin:
 "Trying again to get the fixes queue, including the fixed IDT alignment
  patch.

  The UEFI patch is by far the biggest issue at hand: it is currently
  causing quite a few machines to boot.  Which is sad, because the only
  reason they would is because their BIOSes touch memory that has
  already been freed.  The other major issue is that we finally have
  tracked down the root cause of a significant number of machines
  failing to suspend/resume"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Make sure IDT is page aligned
  x86, suspend: Handle CPUs which fail to #GP on RDMSR
  x86/platform/ce4100: Add header file for reboot type
  Revert "UEFI: Don't pass boot services regions to SetVirtualAddressMap()"
  efivars: check for EFI_RUNTIME_SERVICES

ee114b97

Merge tag 'md-3.11-fixes' of git://neil.brown.name/md · 4b8b8a4a

Linus Torvalds authored Jul 18, 2013

Pull md bug fixes from NeilBrown:
 "Sorry boss, back at work now boss.  Here's them nice shiny patches ya
  wanted.  All nicely tagged and justified for -stable and everyfing:

  Three bug fixes for md in 3.10

  3.10 wasn't a good release for md.  The bio changes left a couple of
  bugs, and an md "fix" created another one.

  These three patches appear to fix the issues and have been tagged for
  -stable"

* tag 'md-3.11-fixes' of git://neil.brown.name/md:
  md/raid1: fix bio handling problems in process_checks()
  md: Remove recent change which allows devices to skip recovery.
  md/raid10: fix two problems with RAID10 resync.

4b8b8a4a

18 Jul, 2013 7 commits

Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · 0a693ab6

Linus Torvalds authored Jul 18, 2013

Pull drm fixes from Dave Airlie:
 "You'll be terribly disappointed in this, I'm not trying to sneak any
  features in or anything, its mostly radeon and intel fixes, a couple
  of ARM driver fixes"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (34 commits)
  drm/radeon/dpm: add debugfs support for RS780/RS880 (v3)
  drm/radeon/dpm/atom: fix broken gcc harder
  drm/radeon/dpm/atom: restructure logic to work around a compiler bug
  drm/radeon/dpm: fix atom vram table parsing
  drm/radeon: fix an endian bug in atom table parsing
  drm/radeon: add a module parameter to disable aspm
  drm/rcar-du: Use the GEM PRIME helpers
  drm/shmobile: Use the GEM PRIME helpers
  uvesafb: Really allow mtrr being 0, as documented and warn()ed
  radeon kms: do not flush uninitialized hotplug work
  drm/radeon/dpm/sumo: handle boost states properly when forcing a perf level
  drm/radeon: align VM PTBs (Page Table Blocks) to 32K
  drm/radeon: allow selection of alignment in the sub-allocator
  drm/radeon: never unpin UVD bo v3
  drm/radeon: fix UVD fence emit
  drm/radeon: add fault decode function for CIK
  drm/radeon: add fault decode function for SI (v2)
  drm/radeon: add fault decode function for cayman/TN (v2)
  drm/radeon: use radeon device for request firmware
  drm/radeon: add missing ttm_eu_backoff_reservation to radeon_bo_list_validate
  ...

0a693ab6

vlan: fix a race in egress prio management · 3e3aac49

Eric Dumazet authored Jul 18, 2013

egress_priority_map[] hash table updates are protected by rtnl,
and we never remove elements until device is dismantled.

We have to make sure that before inserting an new element in hash table,
all its fields are committed to memory or else another cpu could
find corrupt values and crash.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

3e3aac49

vlan: mask vlan prio bits · d4b812de

Eric Dumazet authored Jul 18, 2013

In commit 48cc32d3
("vlan: don't deliver frames for unknown vlans to protocols")
Florian made sure we set pkt_type to PACKET_OTHERHOST
if the vlan id is set and we could find a vlan device for this
particular id.

But we also have a problem if prio bits are set.

Steinar reported an issue on a router receiving IPv6 frames with a
vlan tag of 4000 (id 0, prio 2), and tunneled into a sit device,
because skb->vlan_tci is set.

Forwarded frame is completely corrupted : We can see (8100:4000)
being inserted in the middle of IPv6 source address :

16:48:00.780413 IP6 2001:16d8:8100:4000:ee1c:0:9d9:bc87 >
9f94:4d95:2001:67c:29f4::: ICMP6, unknown icmp6 type (0), length 64
       0x0000:  0000 0029 8000 c7c3 7103 0001 a0ae e651
       0x0010:  0000 0000 ccce 0b00 0000 0000 1011 1213
       0x0020:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223
       0x0030:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233

It seems we are not really ready to properly cope with this right now.

We can probably do better in future kernels :
vlan_get_ingress_priority() should be a netdev property instead of
a per vlan_dev one.

For stable kernels, lets clear vlan_tci to fix the bugs.
Reported-by: Steinar H. Gunderson <sesse@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4b812de

macvtap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS · ece793fc

Jason Wang authored Jul 18, 2013

We try to linearize part of the skb when the number of iov is greater than
MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than
one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest
network.

Solve this problem by calculate the pages needed for iov before trying to do
zerocopy and switch to use copy instead of zerocopy if it needs more than
MAX_SKB_FRAGS.

This is done through introducing a new helper to count the pages for iov, and
call uarg->callback() manually when switching from zerocopy to copy to notify
vhost.

We can do further optimization on top.

This bug were introduced from b92946e2
(macvtap: zerocopy: validate vectors before building skb).

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ece793fc

tuntap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS · 88529176

Jason Wang authored Jul 18, 2013

We try to linearize part of the skb when the number of iov is greater than
MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than
one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest
network.

Solve this problem by calculate the pages needed for iov before trying to do
zerocopy and switch to use copy instead of zerocopy if it needs more than
MAX_SKB_FRAGS.

This is done through introducing a new helper to count the pages for iov, and
call uarg->callback() manually when switching from zerocopy to copy to notify
vhost.

We can do further optimization on top.

The bug were introduced from commit 0690899b
(tun: experimental zero copy tx support)

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

88529176

pkt_sched: sch_qfq: remove a source of high packet delay/jitter · 87f40dd6

Paolo Valente authored Jul 16, 2013

QFQ+ inherits from QFQ a design choice that may cause a high packet
delay/jitter and a severe short-term unfairness. As QFQ, QFQ+ uses a
special quantity, the system virtual time, to track the service
provided by the ideal system it approximates. When a packet is
dequeued, this quantity must be incremented by the size of the packet,
divided by the sum of the weights of the aggregates waiting to be
served. Tracking this sum correctly is a non-trivial task, because, to
preserve tight service guarantees, the decrement of this sum must be
delayed in a special way [1]: this sum can be decremented only after
that its value would decrease also in the ideal system approximated by
QFQ+. For efficiency, QFQ+ keeps track only of the 'instantaneous'
weight sum, increased and decreased immediately as the weight of an
aggregate changes, and as an aggregate is created or destroyed (which,
in its turn, happens as a consequence of some class being
created/destroyed/changed). However, to avoid the problems caused to
service guarantees by these immediate decreases, QFQ+ increments the
system virtual time using the maximum value allowed for the weight
sum, 2^10, in place of the dynamic, instantaneous value. The
instantaneous value of the weight sum is used only to check whether a
request of weight increase or a class creation can be satisfied.

Unfortunately, the problems caused by this choice are worse than the
temporary degradation of the service guarantees that may occur, when a
class is changed or destroyed, if the instantaneous value of the
weight sum was used to update the system virtual time. In fact, the
fraction of the link bandwidth guaranteed by QFQ+ to each aggregate is
equal to the ratio between the weight of the aggregate and the sum of
the weights of the competing aggregates. The packet delay guaranteed
to the aggregate is instead inversely proportional to the guaranteed
bandwidth. By using the maximum possible value, and not the actual
value of the weight sum, QFQ+ provides each aggregate with the worst
possible service guarantees, and not with service guarantees related
to the actual set of competing aggregates. To see the consequences of
this fact, consider the following simple example.

Suppose that only the following aggregates are backlogged, i.e., that
only the classes in the following aggregates have packets to transmit:
one aggregate with weight 10, say A, and ten aggregates with weight 1,
say B1, B2, ..., B10. In particular, suppose that these aggregates are
always backlogged. Given the weight distribution, the smoothest and
fairest service order would be:
A B1 A B2 A B3 A B4 A B5 A B6 A B7 A B8 A B9 A B10 A B1 A B2 ...

QFQ+ would provide exactly this optimal service if it used the actual
value for the weight sum instead of the maximum possible value, i.e.,
11 instead of 2^10. In contrast, since QFQ+ uses the latter value, it
serves aggregates as follows (easy to prove and to reproduce
experimentally):
A B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 A A A A A A A A A A B1 B2 ... B10 A A ...

By replacing 10 with N in the above example, and by increasing N, one
can increase at will the maximum packet delay and the jitter
experienced by the classes in aggregate A.

This patch addresses this issue by just using the above
'instantaneous' value of the weight sum, instead of the maximum
possible value, when updating the system virtual time. After the
instantaneous weight sum is decreased, QFQ+ may deviate from the ideal
service for a time interval in the order of the time to serve one
maximum-size packet for each backlogged class. The worst-case extent
of the deviation exhibited by QFQ+ during this time interval [1] is
basically the same as of the deviation described above (but, without
this patch, QFQ+ suffers from such a deviation all the time). Finally,
this patch modifies the comment to the function qfq_slot_insert, to
make it coherent with the fact that the weight sum used by QFQ+ can
now be lower than the maximum possible value.

[1] P. Valente, "Extending WF2Q+ to support a dynamic traffic mix",
Proceedings of AAA-IDEA'05, June 2005.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: David S. Miller <davem@davemloft.net>

87f40dd6

Merge tag 'driver-core-3.11-rc2' of... · 7a62711a

Linus Torvalds authored Jul 18, 2013

Merge tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core patches from Greg KH:
 "Here are some driver core patches for 3.11-rc2.  They aren't really
  bugfixes, but a bunch of new helper macros for drivers to properly
  create attribute groups, which drivers and subsystems need to fix up a
  ton of race issues with incorrectly creating sysfs files (binary and
  normal) after userspace has been told that the device is present.

  Also here is the ability to create binary files as attribute groups,
  to solve that race condition, which was impossible to do before this,
  so that's my fault the drivers were broken.

  The majority of the .c changes is indenting and moving code around a
  bit.  It affects no existing code, but allows the large backlog of 70+
  patches that I already have created to start flowing into the
  different subtrees, instead of having to live in my driver-core tree,
  causing merge nightmares in linux-next for the next few months.

  These were finalized too late for the -rc1 merge window, which is why
  they were didn't make that pull request, testing and review from
  others didn't happen until a few weeks ago, and then there's the whole
  distraction of the past few days, which prevented these from getting
  to you sooner, sorry about that.

  Oh, and there's a bugfix for the documentation build warning in here
  as well.  All of these have been in linux-next this week, with no
  reported problems"

* tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver-core: fix new kernel-doc warning in base/platform.c
  sysfs: use file mode defines from stat.h
  sysfs: add more helper macro's for (bin_)attribute(_groups)
  driver core: add default groups to struct class
  driver core: Introduce device_create_groups
  sysfs: prevent warning when only using binary attributes
  sysfs: add support for binary attributes in groups
  driver core: device.h: add RW and RO attribute macros
  sysfs.h: add BIN_ATTR macro
  sysfs.h: add ATTRIBUTE_GROUPS() macro
  sysfs.h: add __ATTR_RW() macro

7a62711a