Commits · 26057621c2e797988dcffadfc26addb5f6c74c6f · Kirill Smelkov / linux

13 Sep, 2019 2 commits

UBUNTU: SAUCE: vhost: make sure log_num < in_num · 26057621

yongduan authored Sep 13, 2019

The code assumes log_num < in_num everywhere, and
that is true as long as in_num is incremented by
descriptor iov count, and log_num by 1.
However this breaks if there's a zero sized descriptor.

As a result, if a malicious guest creates a vring desc with desc.len = 0,
it may cause the host kernel to crash by overflowing
the log array. This bug can be triggered during the VM migration.

There's no need to log when desc.len = 0, so just don't increment
log_num in this case.

Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server")
Reviewed-by: Lidong Chen <lidongchen@tencent.com>
Signed-off-by: ruippan <ruippan@tencent.com>
Signed-off-by: yongduan <yongduan@tencent.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>

CVE-2019-14835

(backported from email patch attachment)
[juergh: Adjusted context.]
Signed-off-by: Juerg Haefliger <juergh@canonical.com>

26057621

UBUNTU: Start new release · fc75e4ae
Juerg Haefliger authored Sep 13, 2019
```
Ignore: yes
Signed-off-by: Juerg Haefliger <juergh@canonical.com>
```
fc75e4ae

27 Aug, 2019 5 commits

UBUNTU: Ubuntu-4.4.0-161.189 · d047c452
Stefan Bader authored Aug 27, 2019
```
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
```
d047c452

Revert "UBUNTU: SAUCE: apparmor: flock mediation is not being, enforced on cache check" · ac9535d8

Stefan Bader authored Aug 27, 2019

BugLink: https://bugs.launchpad.net/bugs/1658219

This reverts commit 97ac9e61 as it
is currently causing regressions in snaps which would break networking
for all core16 images.
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

ac9535d8

UBUNTU: link-to-tracker: update tracking bug · 37cb4a3b

Stefan Bader authored Aug 27, 2019

BugLink: https://bugs.launchpad.net/bugs/1841544
Properties: no-test-build
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

37cb4a3b

UBUNTU: Start new release · 56fef70e
Stefan Bader authored Aug 27, 2019
```
Ignore: yes
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
```
56fef70e

UBUNTU: [Packaging] resync getabis · 7e7f1f16

Stefan Bader authored Aug 27, 2019

BugLink: http://bugs.launchpad.net/bugs/1786013Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

7e7f1f16

13 Aug, 2019 26 commits

UBUNTU: Ubuntu-4.4.0-160.188 · 9d2699c7
Connor Kuehl authored Aug 13, 2019
```
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
```
9d2699c7

UBUNTU: link-to-tracker: update tracking bug · 4d298ab5

Connor Kuehl authored Aug 13, 2019

BugLink: https://bugs.launchpad.net/bugs/1840021
Properties: no-test-build
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>

4d298ab5

UBUNTU: Start new release · 33405e2a
Connor Kuehl authored Aug 13, 2019
```
Ignore: yes
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
```
33405e2a

UBUNTU: [Packaging] update helper scripts · 44305a2e

Connor Kuehl authored Aug 13, 2019

BugLink: http://bugs.launchpad.net/bugs/1786013Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>

44305a2e

platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from asus_nb_wmi · ccb7110c

Hans de Goede authored Jul 25, 2019

BugLink: https://bugs.launchpad.net/bugs/1837117

Commit 78f3ac76 ("platform/x86: asus-wmi: Tell the EC the OS will
handle the display off hotkey") causes the backlight to be permanently off
on various EeePC laptop models using the eeepc-wmi driver (Asus EeePC
1015BX, Asus EeePC 1025C).

The asus_wmi_set_devstate(ASUS_WMI_DEVID_BACKLIGHT, 2, NULL) call added
by that commit is made conditional in this commit and only enabled in
the quirk_entry structs in the asus-nb-wmi driver fixing the broken
display / backlight on various EeePC laptop models.

Cc: João Paulo Rechi Vita <jprvita@endlessm.com>
Fixes: 78f3ac76 ("platform/x86: asus-wmi: Tell the EC the OS will handle the display off hotkey")
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
(backported from commit 1dd93f87)
[PHLin: context adjustment, only add quirks for models existing in X]
Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Acked-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

ccb7110c

inet: switch IP ID generator to siphash · a79a8219

Eric Dumazet authored Aug 02, 2019

CVE-2019-10638

According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.

Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.

It is time to switch to siphash and its 128bit keys.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit df453700)
[ Connor Kuehl: Adjusted patch to communicate the id return value
  through the skbuf as the function signature for ipv6_proxy_select_ident
  is still void (whereas the patch context expects it to return a
  value). This function signature change doesn't happen until upstream
  commit: 0c19f846 "net: accept UFO datagrams from tuntap and packet" ]
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

a79a8219

siphash: add cryptographically secure PRF · babc199b

Jason A. Donenfeld authored Aug 02, 2019

CVE-2019-10638

SipHash is a 64-bit keyed hash function that is actually a
cryptographically secure PRF, like HMAC. Except SipHash is super fast,
and is meant to be used as a hashtable keyed lookup function, or as a
general PRF for short input use cases, such as sequence numbers or RNG
chaining.

For the first usage:

There are a variety of attacks known as "hashtable poisoning" in which an
attacker forms some data such that the hash of that data will be the
same, and then preceeds to fill up all entries of a hashbucket. This is
a realistic and well-known denial-of-service vector. Currently
hashtables use jhash, which is fast but not secure, and some kind of
rotating key scheme (or none at all, which isn't good). SipHash is meant
as a replacement for jhash in these cases.

There are a modicum of places in the kernel that are vulnerable to
hashtable poisoning attacks, either via userspace vectors or network
vectors, and there's not a reliable mechanism inside the kernel at the
moment to fix it. The first step toward fixing these issues is actually
getting a secure primitive into the kernel for developers to use. Then
we can, bit by bit, port things over to it as deemed appropriate.

While SipHash is extremely fast for a cryptographically secure function,
it is likely a bit slower than the insecure jhash, and so replacements
will be evaluated on a case-by-case basis based on whether or not the
difference in speed is negligible and whether or not the current jhash usage
poses a real security risk.

For the second usage:

A few places in the kernel are using MD5 or SHA1 for creating secure
sequence numbers, syn cookies, port numbers, or fast random numbers.
SipHash is a faster and more fitting, and more secure replacement for MD5
in those situations. Replacing MD5 and SHA1 with SipHash for these uses is
obvious and straight-forward, and so is submitted along with this patch
series. There shouldn't be much of a debate over its efficacy.

Dozens of languages are already using this internally for their hash
tables and PRFs. Some of the BSDs already use this in their kernels.
SipHash is a widely known high-speed solution to a widely known set of
problems, and it's time we catch-up.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit 2c956a60)
[ Connor Kuehl: Minor offset adjustments required due to the high
  traffic nature of things like Kconfig and Makefiles. Had to make
  sure the proper siphash entries made it in to both files since
  the patch context that surrounds it is so different. ]
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

babc199b

UBUNTU: [Config] CONFIG_TEST_HASH=n · 932e344b

Connor Kuehl authored Aug 02, 2019

CVE-2019-10638
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

932e344b

UBUNTU: SAUCE: apparmor: fix nnp subset check failure, when stacking · 0924738c

John Johansen authored Aug 05, 2019

This is a backport of a fix that landed as part of a larger patch
in 4.17 commit 9fcf78cc ("apparmor: update domain transitions that are subsets of confinement at nnp")

Domain transitions that add a new profile to the confinement stack
when under NO NEW PRIVS is allowed as it can not expand privileges.

However such transitions are failing due to how/where the subset
test is being applied. Applying the test per profile in the
profile transition and profile_onexec call backs is incorrect as
it disregards the other profiles in the stack so it can not
correctly determine if the old confinement stack is a subset of
the new confinement stack.

Move the test to after the new confinement stack is constructed.

BugLink: http://bugs.launchpad.net/bugs/1839037Signed-off-by: John Johansen <john.johansen@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

0924738c

UBUNTU: SAUCE: apparmor: fix audit failures when performing profile transitions · a0e9375b

John Johansen authored Jul 30, 2019

v2. Add fix to profile_transition() also

There are 2 cases where a denial in onexec profile transitions can
occur that results in an apparmor WARN traceback. The first occurs if
onexec is denied by policy, the second if onexec fails due to
no-new-privs.

A similar failure can occur in profile_transition() when directed to
perform a stack, resulting in a simiar traceback with handle_onexec()
replaced by profile_transition().

[1140910.816457] ------------[ cut here ]------------
[1140910.816466] WARNING: CPU: 4 PID: 32497 at /build/linux-UdetSb/linux-4.4.0/security/apparmor/file.c:136 aa_audit_file+0x16e/0x180()
[1140910.816467] AppArmor WARN aa_audit_file: ((!(&sa)->apparmor_audit_data->request)):
[1140910.816469] Modules linked in:
[1140910.816470]  xt_mark xt_comment ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_REDIRECT nf_nat_redirect xt_nat veth btrfs xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c msr nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) xt_CHECKSUM iptable_mangle rfcomm ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables xt_multiport iptable_filter ip_tables x_tables aufs overlay bnep uvcvideo videobuf2_vmalloc btusb videobuf2_memops videobuf2_v4l2 btrtl btbcm videobuf2_core btintel v4l2_common bluetooth videodev media binfmt_misc arc4
[1140910.816508]  iwlmvm mac80211 intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic iwlwifi joydev input_leds serio_raw cfg80211 snd_hda_intel snd_hda_codec snd_hda_core lpc_ich snd_hwdep thinkpad_acpi nvram snd_pcm snd_seq_midi mei_me snd_seq_midi_event shpchp ie31200_edac mei snd_rawmidi edac_core snd_seq wmi snd_seq_device snd_timer snd soundcore kvm_intel mac_hid kvm irqbypass coretemp parport_pc ppdev lp parport autofs4 drbg ansi_cprng algif_skcipher af_alg dm_crypt hid_generic hid_logitech_hidpp hid_logitech_dj usbhid hid uas usb_storage crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel i915 aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd i2c_algo_bit drm_kms_helper psmouse syscopyarea sysfillrect ahci sysimgblt e1000e
[1140910.816544]  fb_sys_fops libahci sdhci_pci drm sdhci ptp pps_core fjes video
[1140910.816549] CPU: 4 PID: 32497 Comm: runc:[2:INIT] Tainted: G        W  OE   4.4.0-151-generic #178-Ubuntu
[1140910.816551] Hardware name: LENOVO 20EFCTO1WW/20EFCTO1WW, BIOS GNET82WW (2.30 ) 03/21/2017
[1140910.816552]  0000000000000286 312c35d8d7e796cb ffff880637cef9d0 ffffffff8140b481
[1140910.816554]  ffff880637cefa18 ffffffff81d02fe8 ffff880637cefa08 ffffffff81085432
[1140910.816555]  ffff880108206400 ffff880637cefb6c ffff880825129b88 ffff880637cefd88
[1140910.816557] Call Trace:
[1140910.816563]  [<ffffffff8140b481>] dump_stack+0x63/0x82
[1140910.816567]  [<ffffffff81085432>] warn_slowpath_common+0x82/0xc0
[1140910.816569]  [<ffffffff810854cc>] warn_slowpath_fmt+0x5c/0x80
[1140910.816571]  [<ffffffff81397ebc>] ? label_match.constprop.9+0x3dc/0x6c0
[1140910.816573]  [<ffffffff813a696e>] aa_audit_file+0x16e/0x180
[1140910.816575]  [<ffffffff813982dd>] profile_onexec+0x13d/0x3d0
[1140910.816577]  [<ffffffff8139a33e>] handle_onexec+0x10e/0x10d0
[1140910.816581]  [<ffffffff81242957>] ? vfs_getxattr_alloc+0x67/0x100
[1140910.816584]  [<ffffffff81355395>] ? cap_inode_getsecurity+0x95/0x220
[1140910.816588]  [<ffffffff8135965d>] ? security_inode_getsecurity+0x5d/0x70
[1140910.816590]  [<ffffffff8139b417>] apparmor_bprm_set_creds+0x117/0xa60
[1140910.816591]  [<ffffffff81242a8e>] ? vfs_getxattr+0x9e/0xb0
[1140910.816595]  [<ffffffffc05be712>] ? ovl_getxattr+0x52/0xb0 [overlay]
[1140910.816597]  [<ffffffff8135619d>] ? get_vfs_caps_from_disk+0x7d/0x180
[1140910.816599]  [<ffffffff81356343>] ? cap_bprm_set_creds+0xa3/0x5f0
[1140910.816601]  [<ffffffff81358909>] security_bprm_set_creds+0x39/0x50
[1140910.816605]  [<ffffffff812229d5>] prepare_binprm+0x85/0x190
[1140910.816607]  [<ffffffff812240f4>] do_execveat_common.isra.31+0x4b4/0x770
[1140910.816610]  [<ffffffff8122460a>] SyS_execve+0x3a/0x50
[1140910.816613]  [<ffffffff81863ed5>] stub_execve+0x5/0x5
[1140910.816615]  [<ffffffff81863b5b>] ? entry_SYSCALL_64_fastpath+0x22/0xcb
[1140910.816616] ---[ end trace cf4320c1d43eedd8 ]---

This is because the error is being audited as if onexec was not denied
this triggering the AA_BUG check.

BugLink: http://bugs.launchpad.net/bugs/1838627Signed-off-by: John Johansen <john.johansen@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

a0e9375b

UBUNTU: SAUCE: apparmor: flock mediation is not being, enforced on cache check · 97ac9e61

John Johansen authored Aug 05, 2019

When an open file with cached permissions is checked for the flock
permission. The cache check fails and falls through to no error instead
of auditing, and returning an error.

For the fall through to do a permission check, so it will audit the
failed flock permission check.

BugLink: https://bugs.launchpad.net/bugs/1838090
BugLink: https://bugs.launchpad.net/bugs/1658219Signed-off-by: John Johansen <john.johansen@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

97ac9e61

UBUNTU: SAUCE: bcache: fix deadlock in bcache_allocator · b8ee2db9

Andrea Righi authored Aug 06, 2019

bcache_allocator() can call the following:

 bch_allocator_thread()
  -> bch_prio_write()
     -> bch_bucket_alloc()
        -> wait on &ca->set->bucket_wait

But the wake up event on bucket_wait is supposed to come from
bch_allocator_thread() itself => deadlock:

[ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds.
[ 1158.495929]       Not tainted 5.3.0-050300rc3-generic #201908042232
[ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1158.504413] bcache_allocato D    0 15861      2 0x80004000
[ 1158.504419] Call Trace:
[ 1158.504429]  __schedule+0x2a8/0x670
[ 1158.504432]  schedule+0x2d/0x90
[ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
[ 1158.504453]  ? wait_woken+0x80/0x80
[ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
[ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
[ 1158.504491]  kthread+0x121/0x140
[ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
[ 1158.504506]  ? kthread_park+0xb0/0xb0
[ 1158.504510]  ret_from_fork+0x35/0x40

Fix by making the call to bch_prio_write() non-blocking, so that
bch_allocator_thread() never waits on itself.

Moreover, make sure to wake up the garbage collector thread when
bch_prio_write() is failing to allocate buckets.

BugLink: https://bugs.launchpad.net/bugs/1784665
BugLink: https://bugs.launchpad.net/bugs/1796292Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

b8ee2db9

bcache: Move couple of functions to sysfs.c · 87eaa6bc

Andy Shevchenko authored May 28, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

There is couple of functions that are used exclusively in sysfs.c.
Move it to there and make them static.

Besides above, it will allow further clean up.

No functional change intended.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit ecb37ce9)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

87eaa6bc

bcache: Reduce the number of sparse complaints about lock imbalances · ccdd507f

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

Add more annotations for sparse to inform it about which functions do
not have the same number of spin_lock() and spin_unlock() calls.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 20d3a518)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

ccdd507f

bcache: Suppress more warnings about set-but-not-used variables · f5fc6cba

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

This patch does not change any functionality.
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 42361469)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

f5fc6cba

bcache: Remove an unused variable · fb2a5a7a

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit f0d38140)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

fb2a5a7a

bcache: Fix kernel-doc warnings · 1bf1734f

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

Avoid that building with W=1 triggers warnings about the kernel-doc
headers.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 47344e33)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1bf1734f

bcache: Annotate switch fall-through · 04763d64

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

This patch avoids that building with W=1 triggers complaints about
switch fall-throughs.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 9dfbdec7)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

04763d64

bcache: Add __printf annotation to __bch_check_keys() · 56ff47f0

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

Make it possible for the compiler to verify the consistency of the
format string passed to __bch_check_keys() and the arguments that
should be formatted according to that format string.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 4a4e4438)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

56ff47f0

bcache: Fix indentation · 809c070b

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

This patch avoids that smatch complains about inconsistent indentation.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit fd01991d)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

809c070b

bcache: fix using of loop variable in memory shrink · 07ca203b

Tang Junhui authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

In bch_mca_scan(), There are some confusion and logical error in the use of
loop variables. In this patch, we clarify them as:
1) nr: the number of btree nodes needs to scan, which will decrease after
we scan a btree node, and should not be less than 0;
2) i: the number of btree nodes have scanned, includes both
btree_cache_freeable and btree_cache, which should not be bigger than
btree_cache_used;
3) freed: the number of btree nodes have freed.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit ca71df31)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

07ca203b

bcache: fix error return value in memory shrink · 1bf905e9

Tang Junhui authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

In bch_mca_scan(), the return value should not be the number of freed btree
nodes, but the number of pages of freed btree nodes.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit f3641c3a)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1bf905e9

bcache: fix incorrect sysfs output value of strip size · 80ad663d

Tang Junhui authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

Stripe size is shown as zero when no strip in back end device:
[root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size
0.0k

Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs
interface, stripe_size was changed from sectors to bytes, and move
9 bits left, so the 32 bits variable overflows.

This patch change the variable to a 64 bits type before moving bits.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 688892b3)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

80ad663d

bcache: fix high CPU occupancy during journal · 43711a47

Tang Junhui authored Feb 07, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

After long time small writing I/O running, we found the occupancy of CPU
is very high and I/O performance has been reduced by about half:

[root@ceph151 internal]# top
top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
%Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
KiB Swap: 65667068 total, 65667068 free,        0 used. 25136812 avail Mem

  PID USER PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 2023 root 20  0       0      0      0 S 55.1  0.0   0:04.42 kworker/11:191
14126 root 20  0       0      0      0 S 42.9  0.0   0:08.72 kworker/10:3
 9292 root 20  0       0      0      0 S 30.4  0.0   1:10.99 kworker/6:1
 8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
12287 root 20  0       0      0      0 S 26.7  0.0   0:28.13 kworker/7:85
31019 root 20  0       0      0      0 S 26.1  0.0   1:30.79 kworker/22:1
 1787 root 20  0       0      0      0 R 25.7  0.0   5:18.45 kworker/8:7
32169 root 20  0       0      0      0 S 14.5  0.0   1:01.92 kworker/23:1
21476 root 20  0       0      0      0 S 13.9  0.0   0:05.09 kworker/1:54
 2204 root 20  0       0      0      0 S 12.5  0.0   1:25.17 kworker/9:10
16994 root 20  0       0      0      0 S 12.2  0.0   0:06.27 kworker/5:106
15714 root 20  0       0      0      0 R 10.9  0.0   0:01.85 kworker/19:2
 9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
 9960 root 20  0       0      0      0 S 10.2  0.0   0:02.75 kworker/2:139
11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
 6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd

The kernel work consumed a lot of CPU, and I found they are running journal
work, The journal is reclaiming source and flush btree node with surprising
frequency.

Through further analysis, we found that in btree_flush_write(), we try to
get a btree node with the smallest fifo idex to flush by traverse all the
btree nodein c->bucket_hash, after we getting it, since no locker protects
it, this btree node may have been written to cache device by other works,
and if this occurred, we retry to traverse in c->bucket_hash and get
another btree node. When the problem occurrd, the retry times is very high,
and we consume a lot of CPU in looking for a appropriate btree node.

In this patch, we try to record 128 btree nodes with the smallest fifo idex
in heap, and pop one by one when we need to flush btree node. It greatly
reduces the time for the loop to find the appropriate BTREE node, and also
reduce the occupancy of CPU.

[note by mpl: this triggers a checkpatch error because of adjacent,
pre-existing style violations]
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit c4dc2497)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

43711a47

bcache: add journal statistic · 84052c3c

Tang Junhui authored Feb 07, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

Sometimes, Journal takes up a lot of CPU, we need statistics
to know what's the journal is doing. So this patch provide
some journal statistics:
1) reclaim: how many times the journal try to reclaim resource,
   usually the journal bucket or/and the pin are exhausted.
2) flush_write: how many times the journal try to flush btree node
   to cache device, usually the journal bucket are exhausted.
3) retry_flush_write: how many times the journal retry to flush
   the next btree node, usually the previous tree node have been
   flushed by other thread.
we show these statistic by sysfs interface. Through these statistics
We can totally see the status of journal module when the CPU is too
high.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit a728eacb)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

84052c3c

bcache: improve bcache_reboot() · 5e31ee24

Coly Li authored Apr 25, 2019

BugLink: https://bugs.launchpad.net/bugs/1784665

This patch tries to release mutex bch_register_lock early, to give
chance to stop cache set and bcache device early.

This patch also expends time out of stopping all bcache device from
2 seconds to 10 seconds, because stopping writeback rate update worker
may delay for 5 seconds, 2 seconds is not enough.

After this patch applied, stopping bcache devices during system reboot
or shutdown is very hard to be observed any more.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit eb8cbb6d)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

5e31ee24

12 Aug, 2019 7 commits

vhost: scsi: add weight support · 19402747

Jason Wang authored May 17, 2019

This patch will check the weight and exit the loop if we exceeds the
weight. This is useful for preventing scsi kthread from hogging cpu
which is guest triggerable.

This addresses CVE-2019-3900.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Fixes: 057cbf49 ("tcm_vhost: Initial merge for vhost level target fabric driver")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

CVE-2019-3900

(backported from commit c1ea02f1)
[tyhicks: Backport to Xenial:
 - Minor context adjustment in local variables
 - Adjust context around the loop in vhost_scsi_handle_vq()
 - No need to modify vhost_scsi_ctl_handle_vq() since it was added later
   in commit 0d02dbd6 ("vhost/scsi: Respond to control queue
   operations")]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

19402747

vhost_net: fix possible infinite loop · 5479b12b

Jason Wang authored May 17, 2019

When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:

while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
					      &busyloop_intr))) {
...
	/* On overrun, truncate and discard */
	if (unlikely(headcount > UIO_MAXIOV)) {
		iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
		err = sock->ops->recvmsg(sock, &msg,
					 1, MSG_DONTWAIT | MSG_TRUNC);
		pr_debug("Discarded rx packet: len %zd\n", sock_len);
		continue;
	}
...
}

This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:

1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
   vhost process as much as possible e.g using indirect descriptors or
   other.
2) Malicious VM2 generate packets to VM1 as fast as possible

Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:

- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
  to hit the continue after vhost find guest just add new buffers

This addresses CVE-2019-3900.

Fixes: d8316f39 ("vhost: fix total length when packets are too short")
Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

CVE-2019-3900

(backported from commit e2412c07)
[tyhicks: Backport to Xenial:
 - Adjust handle_tx() instead of handle_tx_{copy,zerocopy}() due to
   missing commit 0d20bdf3 ("vhost_net: split out datacopy logic")
 - Minor context adjustments due to a lack of missing the iov_limit
   member of the vhost_dev struct which was added later in commit
   b46a0bf7 ("vhost: fix OOB in get_rx_bufs()")
 - handle_rx() still uses peek_head_len() due to missing and unneeded commit
   03088137 ("vhost_net: basic polling support")
 - Context adjustment in call to vhost_log_write() in hunk #3 of net.c due to
   missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
 - Context adjustment in hunk #4 due to using break instead of goto out
 - Context adjustment in hunk #5 due to missing and unneeded commit
   c67df11f ("vhost_net: try batch dequing from skb array")]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

5479b12b

vhost: introduce vhost_exceeds_weight() · 1801314e

Jason Wang authored May 17, 2019

We used to have vhost_exceeds_weight() for vhost-net to:

- prevent vhost kthread from hogging the cpu
- balance the time spent between TX and RX

This function could be useful for vsock and scsi as well. So move it
to vhost.c. Device must specify a weight which counts the number of
requests, or it can also specific a byte_weight which counts the
number of bytes that has been processed.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

CVE-2019-3900

(backported from commit e82b9b07)
[tyhicks: Backport to Xenial:
 - Adjust handle_tx() instead of handle_tx_{copy,zerocopy}() due to
   missing commit 0d20bdf3 ("vhost_net: split out datacopy logic")
 - Considerable context adjustments throughout the patch due to a lack
   of missing the iov_limit member of the vhost_dev struct which was
   added later in commit b46a0bf7 ("vhost: fix OOB in get_rx_bufs()")
 - Context adjustment in call to vhost_log_write() in hunk #3 of net.c due to
   missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
 - Context adjustment in hunk #3 of net.c due to using break instead of goto
   out
 - Context adjustment in hunk #4 of net.c due to missing and unneeded commit
   c67df11f ("vhost_net: try batch dequing from skb array")
 - Don't patch vsock.c since Xenial doesn't have vhost vsock support
 - Adjust context in vhost_dev_init() to account for different local variables
 - Adjust context in struct vhost_dev to account for different struct members]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1801314e

vhost_net: introduce vhost_exceeds_weight() · edc26183

Jason Wang authored Jul 20, 2018

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

CVE-2019-3900

(backported from commit 272f35cb)
[tyhicks: Backport to Xenial:
 - Minor context adjustment in net.c due to missing commit b0d0ea50
   ("vhost_net: introduce helper to initialize tx iov iter")
 - Context adjustment in call to vhost_log_write() in hunk #4 due to missing
   and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
 - Context adjustment in hunk #4 due to using break instead of goto out]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

edc26183

vhost_net: use packet weight for rx handler, too · 440a2294

Paolo Abeni authored Apr 24, 2018

Similar to commit a2ac9990 ("vhost-net: set packet weight of
tx polling to 2 * vq size"), we need a packet-based limit for
handler_rx, too - elsewhere, under rx flood with small packets,
tx can be delayed for a very long time, even without busypolling.

The pkt limit applied to handle_rx must be the same applied by
handle_tx, or we will get unfair scheduling between rx and tx.
Tying such limit to the queue length makes it less effective for
large queue length values and can introduce large process
scheduler latencies, so a constant valued is used - likewise
the existing bytes limit.

The selected limit has been validated with PVP[1] performance
test with different queue sizes:

queue size		256	512	1024

baseline		366	354	362
weight 128		715	723	670
weight 256		740	745	733
weight 512		600	460	583
weight 1024		423	427	418

A packet weight of 256 gives peek performances in under all the
tested scenarios.

No measurable regression in unidirectional performance tests has
been detected.

[1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

CVE-2019-3900

(backported from commit db688c24)
[tyhicks: Backport to Xenial:
 - Context adjustment in call to mutex_lock_nested() in hunk #3 due to missing
   and unneeded commit aaa3149b ("vhost_net: add missing lock nesting
   notation")
 - Context adjustment in call to vhost_log_write() in hunk #4 due to missing
   and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
 - Context adjustment in hunk #4 due to using break instead of goto out]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

440a2294

vhost-net: set packet weight of tx polling to 2 * vq size · 054d27bf

haibinzhang(张海斌) authored Apr 09, 2018

handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
polling udp packets with small length(e.g. 1byte udp payload), because setting
VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.

Ping-Latencies shown below were tested between two Virtual Machines using
netperf (UDP_STREAM, len=1), and then another machine pinged the client:

vq size=256
Packet-Weight   Ping-Latencies(millisecond)
                   min      avg       max
Origin           3.319   18.489    57.303
64               1.643    2.021     2.552
128              1.825    2.600     3.224
256              1.997    2.710     4.295
512              1.860    3.171     4.631
1024             2.002    4.173     9.056
2048             2.257    5.650     9.688
4096             2.093    8.508    15.943

vq size=512
Packet-Weight   Ping-Latencies(millisecond)
                   min      avg       max
Origin           6.537   29.177    66.245
64               2.798    3.614     4.403
128              2.861    3.820     4.775
256              3.008    4.018     4.807
512              3.254    4.523     5.824
1024             3.079    5.335     7.747
2048             3.944    8.201    12.762
4096             4.158   11.057    19.985

Seems pretty consistent, a small dip at 2 VQ sizes.
Ring size is a hint from device about a burst size it can tolerate. Based on
benchmarks, set the weight to 2 * vq size.

To evaluate this change, another tests were done using netperf(RR, TX) between
two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
tweaked through qemu. Results shown below does not show obvious changes.

vq size=256 TCP_RR                vq size=512 TCP_RR
size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
   1/       1/  -7%/        -2%      1/       1/   0%/        -2%
   1/       4/  +1%/         0%      1/       4/  +1%/         0%
   1/       8/  +1%/        -2%      1/       8/   0%/        +1%
  64/       1/  -6%/         0%     64/       1/  +7%/        +3%
  64/       4/   0%/        +2%     64/       4/  -1%/        +1%
  64/       8/   0%/         0%     64/       8/  -1%/        -2%
 256/       1/  -3%/        -4%    256/       1/  -4%/        -2%
 256/       4/  +3%/        +4%    256/       4/  +1%/        +2%
 256/       8/  +2%/         0%    256/       8/  +1%/        -1%

vq size=256 UDP_RR                vq size=512 UDP_RR
size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
   1/       1/  -5%/        +1%      1/       1/  -3%/        -2%
   1/       4/  +4%/        +1%      1/       4/  -2%/        +2%
   1/       8/  -1%/        -1%      1/       8/  -1%/         0%
  64/       1/  -2%/        -3%     64/       1/  +1%/        +1%
  64/       4/  -5%/        -1%     64/       4/  +2%/         0%
  64/       8/   0%/        -1%     64/       8/  -2%/        +1%
 256/       1/  +7%/        +1%    256/       1/  -7%/         0%
 256/       4/  +1%/        +1%    256/       4/  -3%/        -4%
 256/       8/  +2%/        +2%    256/       8/  +1%/        +1%

vq size=256 TCP_STREAM            vq size=512 TCP_STREAM
size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
  64/       1/   0%/        -3%     64/       1/   0%/         0%
  64/       4/  +3%/        -1%     64/       4/  -2%/        +4%
  64/       8/  +9%/        -4%     64/       8/  -1%/        +2%
 256/       1/  +1%/        -4%    256/       1/  +1%/        +1%
 256/       4/  -1%/        -1%    256/       4/  -3%/         0%
 256/       8/  +7%/        +5%    256/       8/  -3%/         0%
 512/       1/  +1%/         0%    512/       1/  -1%/        -1%
 512/       4/  +1%/        -1%    512/       4/   0%/         0%
 512/       8/  +7%/        -5%    512/       8/  +6%/        -1%
1024/       1/   0%/        -1%   1024/       1/   0%/        +1%
1024/       4/  +3%/         0%   1024/       4/  +1%/         0%
1024/       8/  +8%/        +5%   1024/       8/  -1%/         0%
2048/       1/  +2%/        +2%   2048/       1/  -1%/         0%
2048/       4/  +1%/         0%   2048/       4/   0%/        -1%
2048/       8/  -2%/         0%   2048/       8/   5%/        -1%
4096/       1/  -2%/         0%   4096/       1/  -2%/         0%
4096/       4/  +2%/         0%   4096/       4/   0%/         0%
4096/       8/  +9%/        -2%   4096/       8/  -5%/        -1%
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
Signed-off-by: Lidong Chen <lidongchen@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

CVE-2019-3900

(cherry picked from commit a2ac9990)
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

054d27bf

vhost_net: do not stall on zerocopy depletion · 1073e969

Willem de Bruijn authored Oct 06, 2017

Vhost-net has a hard limit on the number of zerocopy skbs in flight.
When reached, transmission stalls. Stalls cause latency, as well as
head-of-line blocking of other flows that do not use zerocopy.

Instead of stalling, revert to copy-based transmission.

Tested by sending two udp flows from guest to host, one with payload
of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
large flow is redirected to a netem instance with 1MBps rate limit
and deep 1000 entry queue.

modprobe ifb
ip link set dev ifb0 up
tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit

tc qdisc add dev tap0 ingress
tc filter add dev tap0 parent ffff: protocol ip \
u32 match ip dport 8000 0xffff \
action mirred egress redirect dev ifb0

Before the delay, both flows process around 80K pps. With the delay,
before this patch, both process around 400. After this patch, the
large flow is still rate limited, while the small reverts to its
original rate. See also discussion in the first link, below.

Without rate limiting, {1, 10, 100}x TCP_STREAM tests continued to
send at 100% zerocopy.

The limit in vhost_exceeds_maxpend must be carefully chosen. With
vq->num >> 1, the flows remain correlated. This value happens to
correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
fractions and ensure correctness also for much smaller values of
vq->num, by testing the min() of both explicitly. See also the
discussion in the second link below.

Changes
v1 -> v2
- replaced min with typed min_t
- avoid unnecessary whitespace change

Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com
Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.comSigned-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

CVE-2019-3900

(cherry picked from commit 1e6f7453)
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1073e969