Commits · a79a821964ab7692e111b9b8418fdb1aea997a13 · Kirill Smelkov / linux

13 Aug, 2019 21 commits

inet: switch IP ID generator to siphash · a79a8219

Eric Dumazet authored Aug 02, 2019

CVE-2019-10638

According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.

Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.

It is time to switch to siphash and its 128bit keys.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit df453700)
[ Connor Kuehl: Adjusted patch to communicate the id return value
  through the skbuf as the function signature for ipv6_proxy_select_ident
  is still void (whereas the patch context expects it to return a
  value). This function signature change doesn't happen until upstream
  commit: 0c19f846 "net: accept UFO datagrams from tuntap and packet" ]
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

a79a8219

siphash: add cryptographically secure PRF · babc199b

Jason A. Donenfeld authored Aug 02, 2019

CVE-2019-10638

SipHash is a 64-bit keyed hash function that is actually a
cryptographically secure PRF, like HMAC. Except SipHash is super fast,
and is meant to be used as a hashtable keyed lookup function, or as a
general PRF for short input use cases, such as sequence numbers or RNG
chaining.

For the first usage:

There are a variety of attacks known as "hashtable poisoning" in which an
attacker forms some data such that the hash of that data will be the
same, and then preceeds to fill up all entries of a hashbucket. This is
a realistic and well-known denial-of-service vector. Currently
hashtables use jhash, which is fast but not secure, and some kind of
rotating key scheme (or none at all, which isn't good). SipHash is meant
as a replacement for jhash in these cases.

There are a modicum of places in the kernel that are vulnerable to
hashtable poisoning attacks, either via userspace vectors or network
vectors, and there's not a reliable mechanism inside the kernel at the
moment to fix it. The first step toward fixing these issues is actually
getting a secure primitive into the kernel for developers to use. Then
we can, bit by bit, port things over to it as deemed appropriate.

While SipHash is extremely fast for a cryptographically secure function,
it is likely a bit slower than the insecure jhash, and so replacements
will be evaluated on a case-by-case basis based on whether or not the
difference in speed is negligible and whether or not the current jhash usage
poses a real security risk.

For the second usage:

A few places in the kernel are using MD5 or SHA1 for creating secure
sequence numbers, syn cookies, port numbers, or fast random numbers.
SipHash is a faster and more fitting, and more secure replacement for MD5
in those situations. Replacing MD5 and SHA1 with SipHash for these uses is
obvious and straight-forward, and so is submitted along with this patch
series. There shouldn't be much of a debate over its efficacy.

Dozens of languages are already using this internally for their hash
tables and PRFs. Some of the BSDs already use this in their kernels.
SipHash is a widely known high-speed solution to a widely known set of
problems, and it's time we catch-up.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit 2c956a60)
[ Connor Kuehl: Minor offset adjustments required due to the high
  traffic nature of things like Kconfig and Makefiles. Had to make
  sure the proper siphash entries made it in to both files since
  the patch context that surrounds it is so different. ]
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

babc199b

UBUNTU: [Config] CONFIG_TEST_HASH=n · 932e344b

Connor Kuehl authored Aug 02, 2019

CVE-2019-10638
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

932e344b

UBUNTU: SAUCE: apparmor: fix nnp subset check failure, when stacking · 0924738c

John Johansen authored Aug 05, 2019

This is a backport of a fix that landed as part of a larger patch
in 4.17 commit 9fcf78cc ("apparmor: update domain transitions that are subsets of confinement at nnp")

Domain transitions that add a new profile to the confinement stack
when under NO NEW PRIVS is allowed as it can not expand privileges.

However such transitions are failing due to how/where the subset
test is being applied. Applying the test per profile in the
profile transition and profile_onexec call backs is incorrect as
it disregards the other profiles in the stack so it can not
correctly determine if the old confinement stack is a subset of
the new confinement stack.

Move the test to after the new confinement stack is constructed.

BugLink: http://bugs.launchpad.net/bugs/1839037Signed-off-by: John Johansen <john.johansen@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

0924738c

UBUNTU: SAUCE: apparmor: fix audit failures when performing profile transitions · a0e9375b

John Johansen authored Jul 30, 2019

v2. Add fix to profile_transition() also

There are 2 cases where a denial in onexec profile transitions can
occur that results in an apparmor WARN traceback. The first occurs if
onexec is denied by policy, the second if onexec fails due to
no-new-privs.

A similar failure can occur in profile_transition() when directed to
perform a stack, resulting in a simiar traceback with handle_onexec()
replaced by profile_transition().

[1140910.816457] ------------[ cut here ]------------
[1140910.816466] WARNING: CPU: 4 PID: 32497 at /build/linux-UdetSb/linux-4.4.0/security/apparmor/file.c:136 aa_audit_file+0x16e/0x180()
[1140910.816467] AppArmor WARN aa_audit_file: ((!(&sa)->apparmor_audit_data->request)):
[1140910.816469] Modules linked in:
[1140910.816470]  xt_mark xt_comment ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_REDIRECT nf_nat_redirect xt_nat veth btrfs xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c msr nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) xt_CHECKSUM iptable_mangle rfcomm ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables xt_multiport iptable_filter ip_tables x_tables aufs overlay bnep uvcvideo videobuf2_vmalloc btusb videobuf2_memops videobuf2_v4l2 btrtl btbcm videobuf2_core btintel v4l2_common bluetooth videodev media binfmt_misc arc4
[1140910.816508]  iwlmvm mac80211 intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic iwlwifi joydev input_leds serio_raw cfg80211 snd_hda_intel snd_hda_codec snd_hda_core lpc_ich snd_hwdep thinkpad_acpi nvram snd_pcm snd_seq_midi mei_me snd_seq_midi_event shpchp ie31200_edac mei snd_rawmidi edac_core snd_seq wmi snd_seq_device snd_timer snd soundcore kvm_intel mac_hid kvm irqbypass coretemp parport_pc ppdev lp parport autofs4 drbg ansi_cprng algif_skcipher af_alg dm_crypt hid_generic hid_logitech_hidpp hid_logitech_dj usbhid hid uas usb_storage crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel i915 aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd i2c_algo_bit drm_kms_helper psmouse syscopyarea sysfillrect ahci sysimgblt e1000e
[1140910.816544]  fb_sys_fops libahci sdhci_pci drm sdhci ptp pps_core fjes video
[1140910.816549] CPU: 4 PID: 32497 Comm: runc:[2:INIT] Tainted: G        W  OE   4.4.0-151-generic #178-Ubuntu
[1140910.816551] Hardware name: LENOVO 20EFCTO1WW/20EFCTO1WW, BIOS GNET82WW (2.30 ) 03/21/2017
[1140910.816552]  0000000000000286 312c35d8d7e796cb ffff880637cef9d0 ffffffff8140b481
[1140910.816554]  ffff880637cefa18 ffffffff81d02fe8 ffff880637cefa08 ffffffff81085432
[1140910.816555]  ffff880108206400 ffff880637cefb6c ffff880825129b88 ffff880637cefd88
[1140910.816557] Call Trace:
[1140910.816563]  [<ffffffff8140b481>] dump_stack+0x63/0x82
[1140910.816567]  [<ffffffff81085432>] warn_slowpath_common+0x82/0xc0
[1140910.816569]  [<ffffffff810854cc>] warn_slowpath_fmt+0x5c/0x80
[1140910.816571]  [<ffffffff81397ebc>] ? label_match.constprop.9+0x3dc/0x6c0
[1140910.816573]  [<ffffffff813a696e>] aa_audit_file+0x16e/0x180
[1140910.816575]  [<ffffffff813982dd>] profile_onexec+0x13d/0x3d0
[1140910.816577]  [<ffffffff8139a33e>] handle_onexec+0x10e/0x10d0
[1140910.816581]  [<ffffffff81242957>] ? vfs_getxattr_alloc+0x67/0x100
[1140910.816584]  [<ffffffff81355395>] ? cap_inode_getsecurity+0x95/0x220
[1140910.816588]  [<ffffffff8135965d>] ? security_inode_getsecurity+0x5d/0x70
[1140910.816590]  [<ffffffff8139b417>] apparmor_bprm_set_creds+0x117/0xa60
[1140910.816591]  [<ffffffff81242a8e>] ? vfs_getxattr+0x9e/0xb0
[1140910.816595]  [<ffffffffc05be712>] ? ovl_getxattr+0x52/0xb0 [overlay]
[1140910.816597]  [<ffffffff8135619d>] ? get_vfs_caps_from_disk+0x7d/0x180
[1140910.816599]  [<ffffffff81356343>] ? cap_bprm_set_creds+0xa3/0x5f0
[1140910.816601]  [<ffffffff81358909>] security_bprm_set_creds+0x39/0x50
[1140910.816605]  [<ffffffff812229d5>] prepare_binprm+0x85/0x190
[1140910.816607]  [<ffffffff812240f4>] do_execveat_common.isra.31+0x4b4/0x770
[1140910.816610]  [<ffffffff8122460a>] SyS_execve+0x3a/0x50
[1140910.816613]  [<ffffffff81863ed5>] stub_execve+0x5/0x5
[1140910.816615]  [<ffffffff81863b5b>] ? entry_SYSCALL_64_fastpath+0x22/0xcb
[1140910.816616] ---[ end trace cf4320c1d43eedd8 ]---

This is because the error is being audited as if onexec was not denied
this triggering the AA_BUG check.

BugLink: http://bugs.launchpad.net/bugs/1838627Signed-off-by: John Johansen <john.johansen@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

a0e9375b

UBUNTU: SAUCE: apparmor: flock mediation is not being, enforced on cache check · 97ac9e61

John Johansen authored Aug 05, 2019

When an open file with cached permissions is checked for the flock
permission. The cache check fails and falls through to no error instead
of auditing, and returning an error.

For the fall through to do a permission check, so it will audit the
failed flock permission check.

BugLink: https://bugs.launchpad.net/bugs/1838090
BugLink: https://bugs.launchpad.net/bugs/1658219Signed-off-by: John Johansen <john.johansen@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

97ac9e61

UBUNTU: SAUCE: bcache: fix deadlock in bcache_allocator · b8ee2db9

Andrea Righi authored Aug 06, 2019

bcache_allocator() can call the following:

 bch_allocator_thread()
  -> bch_prio_write()
     -> bch_bucket_alloc()
        -> wait on &ca->set->bucket_wait

But the wake up event on bucket_wait is supposed to come from
bch_allocator_thread() itself => deadlock:

[ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds.
[ 1158.495929]       Not tainted 5.3.0-050300rc3-generic #201908042232
[ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1158.504413] bcache_allocato D    0 15861      2 0x80004000
[ 1158.504419] Call Trace:
[ 1158.504429]  __schedule+0x2a8/0x670
[ 1158.504432]  schedule+0x2d/0x90
[ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
[ 1158.504453]  ? wait_woken+0x80/0x80
[ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
[ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
[ 1158.504491]  kthread+0x121/0x140
[ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
[ 1158.504506]  ? kthread_park+0xb0/0xb0
[ 1158.504510]  ret_from_fork+0x35/0x40

Fix by making the call to bch_prio_write() non-blocking, so that
bch_allocator_thread() never waits on itself.

Moreover, make sure to wake up the garbage collector thread when
bch_prio_write() is failing to allocate buckets.

BugLink: https://bugs.launchpad.net/bugs/1784665
BugLink: https://bugs.launchpad.net/bugs/1796292Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

b8ee2db9

bcache: Move couple of functions to sysfs.c · 87eaa6bc

Andy Shevchenko authored May 28, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

There is couple of functions that are used exclusively in sysfs.c.
Move it to there and make them static.

Besides above, it will allow further clean up.

No functional change intended.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit ecb37ce9)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

87eaa6bc

bcache: Reduce the number of sparse complaints about lock imbalances · ccdd507f

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

Add more annotations for sparse to inform it about which functions do
not have the same number of spin_lock() and spin_unlock() calls.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 20d3a518)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

ccdd507f

bcache: Suppress more warnings about set-but-not-used variables · f5fc6cba

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

This patch does not change any functionality.
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 42361469)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

f5fc6cba

bcache: Remove an unused variable · fb2a5a7a

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit f0d38140)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

fb2a5a7a

bcache: Fix kernel-doc warnings · 1bf1734f

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

Avoid that building with W=1 triggers warnings about the kernel-doc
headers.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 47344e33)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1bf1734f

bcache: Annotate switch fall-through · 04763d64

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

This patch avoids that building with W=1 triggers complaints about
switch fall-throughs.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 9dfbdec7)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

04763d64

bcache: Add __printf annotation to __bch_check_keys() · 56ff47f0

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

Make it possible for the compiler to verify the consistency of the
format string passed to __bch_check_keys() and the arguments that
should be formatted according to that format string.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 4a4e4438)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

56ff47f0

bcache: Fix indentation · 809c070b

Bart Van Assche authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

This patch avoids that smatch complains about inconsistent indentation.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit fd01991d)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

809c070b

bcache: fix using of loop variable in memory shrink · 07ca203b

Tang Junhui authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

In bch_mca_scan(), There are some confusion and logical error in the use of
loop variables. In this patch, we clarify them as:
1) nr: the number of btree nodes needs to scan, which will decrease after
we scan a btree node, and should not be less than 0;
2) i: the number of btree nodes have scanned, includes both
btree_cache_freeable and btree_cache, which should not be bigger than
btree_cache_used;
3) freed: the number of btree nodes have freed.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit ca71df31)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

07ca203b

bcache: fix error return value in memory shrink · 1bf905e9

Tang Junhui authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

In bch_mca_scan(), the return value should not be the number of freed btree
nodes, but the number of pages of freed btree nodes.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit f3641c3a)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1bf905e9

bcache: fix incorrect sysfs output value of strip size · 80ad663d

Tang Junhui authored Mar 18, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

Stripe size is shown as zero when no strip in back end device:
[root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size
0.0k

Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs
interface, stripe_size was changed from sectors to bytes, and move
9 bits left, so the 32 bits variable overflows.

This patch change the variable to a 64 bits type before moving bits.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 688892b3)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

80ad663d

bcache: fix high CPU occupancy during journal · 43711a47

Tang Junhui authored Feb 07, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

After long time small writing I/O running, we found the occupancy of CPU
is very high and I/O performance has been reduced by about half:

[root@ceph151 internal]# top
top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
%Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
KiB Swap: 65667068 total, 65667068 free,        0 used. 25136812 avail Mem

  PID USER PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 2023 root 20  0       0      0      0 S 55.1  0.0   0:04.42 kworker/11:191
14126 root 20  0       0      0      0 S 42.9  0.0   0:08.72 kworker/10:3
 9292 root 20  0       0      0      0 S 30.4  0.0   1:10.99 kworker/6:1
 8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
12287 root 20  0       0      0      0 S 26.7  0.0   0:28.13 kworker/7:85
31019 root 20  0       0      0      0 S 26.1  0.0   1:30.79 kworker/22:1
 1787 root 20  0       0      0      0 R 25.7  0.0   5:18.45 kworker/8:7
32169 root 20  0       0      0      0 S 14.5  0.0   1:01.92 kworker/23:1
21476 root 20  0       0      0      0 S 13.9  0.0   0:05.09 kworker/1:54
 2204 root 20  0       0      0      0 S 12.5  0.0   1:25.17 kworker/9:10
16994 root 20  0       0      0      0 S 12.2  0.0   0:06.27 kworker/5:106
15714 root 20  0       0      0      0 R 10.9  0.0   0:01.85 kworker/19:2
 9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
 9960 root 20  0       0      0      0 S 10.2  0.0   0:02.75 kworker/2:139
11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
 6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd

The kernel work consumed a lot of CPU, and I found they are running journal
work, The journal is reclaiming source and flush btree node with surprising
frequency.

Through further analysis, we found that in btree_flush_write(), we try to
get a btree node with the smallest fifo idex to flush by traverse all the
btree nodein c->bucket_hash, after we getting it, since no locker protects
it, this btree node may have been written to cache device by other works,
and if this occurred, we retry to traverse in c->bucket_hash and get
another btree node. When the problem occurrd, the retry times is very high,
and we consume a lot of CPU in looking for a appropriate btree node.

In this patch, we try to record 128 btree nodes with the smallest fifo idex
in heap, and pop one by one when we need to flush btree node. It greatly
reduces the time for the loop to find the appropriate BTREE node, and also
reduce the occupancy of CPU.

[note by mpl: this triggers a checkpatch error because of adjacent,
pre-existing style violations]
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit c4dc2497)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

43711a47

bcache: add journal statistic · 84052c3c

Tang Junhui authored Feb 07, 2018

BugLink: https://bugs.launchpad.net/bugs/1784665

Sometimes, Journal takes up a lot of CPU, we need statistics
to know what's the journal is doing. So this patch provide
some journal statistics:
1) reclaim: how many times the journal try to reclaim resource,
   usually the journal bucket or/and the pin are exhausted.
2) flush_write: how many times the journal try to flush btree node
   to cache device, usually the journal bucket are exhausted.
3) retry_flush_write: how many times the journal retry to flush
   the next btree node, usually the previous tree node have been
   flushed by other thread.
we show these statistic by sysfs interface. Through these statistics
We can totally see the status of journal module when the CPU is too
high.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit a728eacb)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

84052c3c

bcache: improve bcache_reboot() · 5e31ee24

Coly Li authored Apr 25, 2019

BugLink: https://bugs.launchpad.net/bugs/1784665

This patch tries to release mutex bch_register_lock early, to give
chance to stop cache set and bcache device early.

This patch also expends time out of stopping all bcache device from
2 seconds to 10 seconds, because stopping writeback rate update worker
may delay for 5 seconds, 2 seconds is not enough.

After this patch applied, stopping bcache devices during system reboot
or shutdown is very hard to be observed any more.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit eb8cbb6d)
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

5e31ee24

12 Aug, 2019 12 commits

vhost: scsi: add weight support · 19402747

Jason Wang authored May 17, 2019

This patch will check the weight and exit the loop if we exceeds the
weight. This is useful for preventing scsi kthread from hogging cpu
which is guest triggerable.

This addresses CVE-2019-3900.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Fixes: 057cbf49 ("tcm_vhost: Initial merge for vhost level target fabric driver")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

CVE-2019-3900

(backported from commit c1ea02f1)
[tyhicks: Backport to Xenial:
 - Minor context adjustment in local variables
 - Adjust context around the loop in vhost_scsi_handle_vq()
 - No need to modify vhost_scsi_ctl_handle_vq() since it was added later
   in commit 0d02dbd6 ("vhost/scsi: Respond to control queue
   operations")]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

19402747

vhost_net: fix possible infinite loop · 5479b12b

Jason Wang authored May 17, 2019

When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:

while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
					      &busyloop_intr))) {
...
	/* On overrun, truncate and discard */
	if (unlikely(headcount > UIO_MAXIOV)) {
		iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
		err = sock->ops->recvmsg(sock, &msg,
					 1, MSG_DONTWAIT | MSG_TRUNC);
		pr_debug("Discarded rx packet: len %zd\n", sock_len);
		continue;
	}
...
}

This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:

1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
   vhost process as much as possible e.g using indirect descriptors or
   other.
2) Malicious VM2 generate packets to VM1 as fast as possible

Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:

- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
  to hit the continue after vhost find guest just add new buffers

This addresses CVE-2019-3900.

Fixes: d8316f39 ("vhost: fix total length when packets are too short")
Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

CVE-2019-3900

(backported from commit e2412c07)
[tyhicks: Backport to Xenial:
 - Adjust handle_tx() instead of handle_tx_{copy,zerocopy}() due to
   missing commit 0d20bdf3 ("vhost_net: split out datacopy logic")
 - Minor context adjustments due to a lack of missing the iov_limit
   member of the vhost_dev struct which was added later in commit
   b46a0bf7 ("vhost: fix OOB in get_rx_bufs()")
 - handle_rx() still uses peek_head_len() due to missing and unneeded commit
   03088137 ("vhost_net: basic polling support")
 - Context adjustment in call to vhost_log_write() in hunk #3 of net.c due to
   missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
 - Context adjustment in hunk #4 due to using break instead of goto out
 - Context adjustment in hunk #5 due to missing and unneeded commit
   c67df11f ("vhost_net: try batch dequing from skb array")]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

5479b12b

vhost: introduce vhost_exceeds_weight() · 1801314e

Jason Wang authored May 17, 2019

We used to have vhost_exceeds_weight() for vhost-net to:

- prevent vhost kthread from hogging the cpu
- balance the time spent between TX and RX

This function could be useful for vsock and scsi as well. So move it
to vhost.c. Device must specify a weight which counts the number of
requests, or it can also specific a byte_weight which counts the
number of bytes that has been processed.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

CVE-2019-3900

(backported from commit e82b9b07)
[tyhicks: Backport to Xenial:
 - Adjust handle_tx() instead of handle_tx_{copy,zerocopy}() due to
   missing commit 0d20bdf3 ("vhost_net: split out datacopy logic")
 - Considerable context adjustments throughout the patch due to a lack
   of missing the iov_limit member of the vhost_dev struct which was
   added later in commit b46a0bf7 ("vhost: fix OOB in get_rx_bufs()")
 - Context adjustment in call to vhost_log_write() in hunk #3 of net.c due to
   missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
 - Context adjustment in hunk #3 of net.c due to using break instead of goto
   out
 - Context adjustment in hunk #4 of net.c due to missing and unneeded commit
   c67df11f ("vhost_net: try batch dequing from skb array")
 - Don't patch vsock.c since Xenial doesn't have vhost vsock support
 - Adjust context in vhost_dev_init() to account for different local variables
 - Adjust context in struct vhost_dev to account for different struct members]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1801314e

vhost_net: introduce vhost_exceeds_weight() · edc26183

Jason Wang authored Jul 20, 2018

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

CVE-2019-3900

(backported from commit 272f35cb)
[tyhicks: Backport to Xenial:
 - Minor context adjustment in net.c due to missing commit b0d0ea50
   ("vhost_net: introduce helper to initialize tx iov iter")
 - Context adjustment in call to vhost_log_write() in hunk #4 due to missing
   and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
 - Context adjustment in hunk #4 due to using break instead of goto out]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

edc26183

vhost_net: use packet weight for rx handler, too · 440a2294

Paolo Abeni authored Apr 24, 2018

Similar to commit a2ac9990 ("vhost-net: set packet weight of
tx polling to 2 * vq size"), we need a packet-based limit for
handler_rx, too - elsewhere, under rx flood with small packets,
tx can be delayed for a very long time, even without busypolling.

The pkt limit applied to handle_rx must be the same applied by
handle_tx, or we will get unfair scheduling between rx and tx.
Tying such limit to the queue length makes it less effective for
large queue length values and can introduce large process
scheduler latencies, so a constant valued is used - likewise
the existing bytes limit.

The selected limit has been validated with PVP[1] performance
test with different queue sizes:

queue size		256	512	1024

baseline		366	354	362
weight 128		715	723	670
weight 256		740	745	733
weight 512		600	460	583
weight 1024		423	427	418

A packet weight of 256 gives peek performances in under all the
tested scenarios.

No measurable regression in unidirectional performance tests has
been detected.

[1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

CVE-2019-3900

(backported from commit db688c24)
[tyhicks: Backport to Xenial:
 - Context adjustment in call to mutex_lock_nested() in hunk #3 due to missing
   and unneeded commit aaa3149b ("vhost_net: add missing lock nesting
   notation")
 - Context adjustment in call to vhost_log_write() in hunk #4 due to missing
   and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
 - Context adjustment in hunk #4 due to using break instead of goto out]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

440a2294

vhost-net: set packet weight of tx polling to 2 * vq size · 054d27bf

haibinzhang(张海斌) authored Apr 09, 2018

handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
polling udp packets with small length(e.g. 1byte udp payload), because setting
VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.

Ping-Latencies shown below were tested between two Virtual Machines using
netperf (UDP_STREAM, len=1), and then another machine pinged the client:

vq size=256
Packet-Weight   Ping-Latencies(millisecond)
                   min      avg       max
Origin           3.319   18.489    57.303
64               1.643    2.021     2.552
128              1.825    2.600     3.224
256              1.997    2.710     4.295
512              1.860    3.171     4.631
1024             2.002    4.173     9.056
2048             2.257    5.650     9.688
4096             2.093    8.508    15.943

vq size=512
Packet-Weight   Ping-Latencies(millisecond)
                   min      avg       max
Origin           6.537   29.177    66.245
64               2.798    3.614     4.403
128              2.861    3.820     4.775
256              3.008    4.018     4.807
512              3.254    4.523     5.824
1024             3.079    5.335     7.747
2048             3.944    8.201    12.762
4096             4.158   11.057    19.985

Seems pretty consistent, a small dip at 2 VQ sizes.
Ring size is a hint from device about a burst size it can tolerate. Based on
benchmarks, set the weight to 2 * vq size.

To evaluate this change, another tests were done using netperf(RR, TX) between
two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
tweaked through qemu. Results shown below does not show obvious changes.

vq size=256 TCP_RR                vq size=512 TCP_RR
size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
   1/       1/  -7%/        -2%      1/       1/   0%/        -2%
   1/       4/  +1%/         0%      1/       4/  +1%/         0%
   1/       8/  +1%/        -2%      1/       8/   0%/        +1%
  64/       1/  -6%/         0%     64/       1/  +7%/        +3%
  64/       4/   0%/        +2%     64/       4/  -1%/        +1%
  64/       8/   0%/         0%     64/       8/  -1%/        -2%
 256/       1/  -3%/        -4%    256/       1/  -4%/        -2%
 256/       4/  +3%/        +4%    256/       4/  +1%/        +2%
 256/       8/  +2%/         0%    256/       8/  +1%/        -1%

vq size=256 UDP_RR                vq size=512 UDP_RR
size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
   1/       1/  -5%/        +1%      1/       1/  -3%/        -2%
   1/       4/  +4%/        +1%      1/       4/  -2%/        +2%
   1/       8/  -1%/        -1%      1/       8/  -1%/         0%
  64/       1/  -2%/        -3%     64/       1/  +1%/        +1%
  64/       4/  -5%/        -1%     64/       4/  +2%/         0%
  64/       8/   0%/        -1%     64/       8/  -2%/        +1%
 256/       1/  +7%/        +1%    256/       1/  -7%/         0%
 256/       4/  +1%/        +1%    256/       4/  -3%/        -4%
 256/       8/  +2%/        +2%    256/       8/  +1%/        +1%

vq size=256 TCP_STREAM            vq size=512 TCP_STREAM
size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
  64/       1/   0%/        -3%     64/       1/   0%/         0%
  64/       4/  +3%/        -1%     64/       4/  -2%/        +4%
  64/       8/  +9%/        -4%     64/       8/  -1%/        +2%
 256/       1/  +1%/        -4%    256/       1/  +1%/        +1%
 256/       4/  -1%/        -1%    256/       4/  -3%/         0%
 256/       8/  +7%/        +5%    256/       8/  -3%/         0%
 512/       1/  +1%/         0%    512/       1/  -1%/        -1%
 512/       4/  +1%/        -1%    512/       4/   0%/         0%
 512/       8/  +7%/        -5%    512/       8/  +6%/        -1%
1024/       1/   0%/        -1%   1024/       1/   0%/        +1%
1024/       4/  +3%/         0%   1024/       4/  +1%/         0%
1024/       8/  +8%/        +5%   1024/       8/  -1%/         0%
2048/       1/  +2%/        +2%   2048/       1/  -1%/         0%
2048/       4/  +1%/         0%   2048/       4/   0%/        -1%
2048/       8/  -2%/         0%   2048/       8/   5%/        -1%
4096/       1/  -2%/         0%   4096/       1/  -2%/         0%
4096/       4/  +2%/         0%   4096/       4/   0%/         0%
4096/       8/  +9%/        -2%   4096/       8/  -5%/        -1%
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
Signed-off-by: Lidong Chen <lidongchen@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

CVE-2019-3900

(cherry picked from commit a2ac9990)
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

054d27bf

vhost_net: do not stall on zerocopy depletion · 1073e969

Willem de Bruijn authored Oct 06, 2017

Vhost-net has a hard limit on the number of zerocopy skbs in flight.
When reached, transmission stalls. Stalls cause latency, as well as
head-of-line blocking of other flows that do not use zerocopy.

Instead of stalling, revert to copy-based transmission.

Tested by sending two udp flows from guest to host, one with payload
of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
large flow is redirected to a netem instance with 1MBps rate limit
and deep 1000 entry queue.

modprobe ifb
ip link set dev ifb0 up
tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit

tc qdisc add dev tap0 ingress
tc filter add dev tap0 parent ffff: protocol ip \
u32 match ip dport 8000 0xffff \
action mirred egress redirect dev ifb0

Before the delay, both flows process around 80K pps. With the delay,
before this patch, both process around 400. After this patch, the
large flow is still rate limited, while the small reverts to its
original rate. See also discussion in the first link, below.

Without rate limiting, {1, 10, 100}x TCP_STREAM tests continued to
send at 100% zerocopy.

The limit in vhost_exceeds_maxpend must be carefully chosen. With
vq->num >> 1, the flows remain correlated. This value happens to
correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
fractions and ensure correctness also for much smaller values of
vq->num, by testing the min() of both explicitly. See also the
discussion in the second link below.

Changes
v1 -> v2
- replaced min with typed min_t
- avoid unnecessary whitespace change

Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com
Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.comSigned-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

CVE-2019-3900

(cherry picked from commit 1e6f7453)
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1073e969

vhost_net: tx batching · a2daaa49

Jason Wang authored Jan 18, 2017

This patch tries to utilize tuntap rx batching by peeking the tx
virtqueue during transmission, if there's more available buffers in
the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap)
to batch the packets.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

CVE-2019-3900

(backported from commit 0ed005ce)
[tyhicks: Minor context adjustment due to missing commit 03088137
 ("vhost_net: basic polling support")]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

a2daaa49

vhost: introduce vhost_vq_avail_empty() · 1eed6074

Jason Wang authored Mar 04, 2016

This patch introduces a helper which will return true if we're sure
that the available ring is empty for a specific vq. When we're not
sure, e.g vq access failure, return false instead. This could be used
for busy polling code to exit the busy loop.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

CVE-2019-3900

(cherry picked from commit d4a60603)
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1eed6074

UBUNTU: SAUCE: (noup) Update zfs to 0.6.5.6-0ubuntu28 · 6b969d7f

Colin Ian King authored Aug 09, 2019

BugLink: https://launchpad.net/bugs/1839521

Sync ZFS 0.6.5.6-0ubuntu28 to Xenial:

Fix shrinker deadlock with xattrs (LP: #1839521)

Upstream ZFS fix 31b6111fd92a ("Kill zp->z_xattr_parent to prevent pinning")
and ddae16a9cf0b ("xattr dir doesn't get purged during iput") fix a deadlock
in shrinker path when a xattr directory inode and its xattr inode are in the
same disposal list and the xattr dir inode is evicted before the xattr inode.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kleber Souza <kleber.souza@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

6b969d7f

powerpc/tm: Fix oops on sigreturn on systems without TM · 9e95ea41

Michael Neuling authored Jul 26, 2019

CVE-2019-13648

On systems like P9 powernv where we have no TM (or P8 booted with
ppc_tm=off), userspace can construct a signal context which still has
the MSR TS bits set. The kernel tries to restore this context which
results in the following crash:

  Unexpected TM Bad Thing exception at c0000000000022fc (msr 0x8000000102a03031) tm_scratch=800000020280f033
  Oops: Unrecoverable exception, sig: 6 [#1]
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in:
  CPU: 0 PID: 1636 Comm: sigfuz Not tainted 5.2.0-11043-g0a8ad0ff #69
  NIP:  c0000000000022fc LR: 00007fffb2d67e48 CTR: 0000000000000000
  REGS: c00000003fffbd70 TRAP: 0700   Not tainted  (5.2.0-11045-g7142b497d8)
  MSR:  8000000102a03031 <SF,VEC,VSX,FP,ME,IR,DR,LE,TM[E]>  CR: 42004242  XER: 00000000
  CFAR: c0000000000022e0 IRQMASK: 0
  GPR00: 0000000000000072 00007fffb2b6e560 00007fffb2d87f00 0000000000000669
  GPR04: 00007fffb2b6e728 0000000000000000 0000000000000000 00007fffb2b6f2a8
  GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR12: 0000000000000000 00007fffb2b76900 0000000000000000 0000000000000000
  GPR16: 00007fffb2370000 00007fffb2d84390 00007fffea3a15ac 000001000a250420
  GPR20: 00007fffb2b6f260 0000000010001770 0000000000000000 0000000000000000
  GPR24: 00007fffb2d843a0 00007fffea3a14a0 0000000000010000 0000000000800000
  GPR28: 00007fffea3a14d8 00000000003d0f00 0000000000000000 00007fffb2b6e728
  NIP [c0000000000022fc] rfi_flush_fallback+0x7c/0x80
  LR [00007fffb2d67e48] 0x7fffb2d67e48
  Call Trace:
  Instruction dump:
  e96a0220 e96a02a8 e96a0330 e96a03b8 394a0400 4200ffdc 7d2903a6 e92d0c00
  e94d0c08 e96d0c10 e82d0c18 7db242a6 <4c000024> 7db243a6 7db142a6 f82d0c18

The problem is the signal code assumes TM is enabled when
CONFIG_PPC_TRANSACTIONAL_MEM is enabled. This may not be the case as
with P9 powernv or if `ppc_tm=off` is used on P8.

This means any local user can crash the system.

Fix the problem by returning a bad stack frame to the user if they try
to set the MSR TS bits with sigreturn() on systems where TM is not
supported.

Found with sigfuz kernel selftest on P9.

This fixes CVE-2019-13648.

Fixes: 2b0a576d ("powerpc: Add new transactional memory state to the signal context")
Cc: stable@vger.kernel.org # v3.9
Reported-by: Praveen Pandey <Praveen.Pandey@in.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190719050502.405-1-mikey@neuling.org
(cherry picked from commit f16d80b7)
Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

9e95ea41

block: blk_init_allocated_queue() set q->fq as NULL in the fail case · bf7a4a79

xiao jin authored Aug 02, 2019

CVE-2018-20856

We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.

Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.

Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.

The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().

Fixes: commit 7c94e1c1 ("block: introduce blk_flush_queue to drive flush machinery")
Cc: <stable@vger.kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: xiao jin <jin.xiao@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(backported from commit 54648cf1)
[ Connor Kuehl: had to place the line from the patch in manually since
  the patch context disagreed with what the routine looks like now
  (different label, different return statement). Barely more involved
  than an offset adjustment. ]
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Acked-by: Khalid Elmously <khalid.elmously@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

bf7a4a79

07 Aug, 2019 2 commits

floppy: fix out-of-bounds read in copy_buffer · 4a916e0e

Denis Efremov authored Aug 01, 2019

CVE-2019-14283

This fixes a global out-of-bounds read access in the copy_buffer
function of the floppy driver.

The FDDEFPRM ioctl allows one to set the geometry of a disk.  The sect
and head fields (unsigned int) of the floppy_drive structure are used to
compute the max_sector (int) in the make_raw_rw_request function.  It is
possible to overflow the max_sector.  Next, max_sector is passed to the
copy_buffer function and used in one of the memcpy calls.

An unprivileged user could trigger the bug if the device is accessible,
but requires a floppy disk to be inserted.

The patch adds the check for the .sect * .head multiplication for not
overflowing in the set_geometry function.

The bug was found by syzkaller.
Signed-off-by: Denis Efremov <efremov@ispras.ru>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit da99466a)
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

4a916e0e

floppy: fix div-by-zero in setup_format_params · 2b5d47b7

Denis Efremov authored Jul 31, 2019

This fixes a divide by zero error in the setup_format_params function of
the floppy driver.

Two consecutive ioctls can trigger the bug: The first one should set the
drive geometry with such .sect and .rate values for the F_SECT_PER_TRACK
to become zero.  Next, the floppy format operation should be called.

A floppy disk is not required to be inserted.  An unprivileged user
could trigger the bug if the device is accessible.

The patch checks F_SECT_PER_TRACK for a non-zero value in the
set_geometry function.  The proper check should involve a reasonable
upper limit for the .sect and .rate fields, but it could change the
UAPI.

The patch also checks F_SECT_PER_TRACK in the setup_format_params, and
cancels the formatting operation in case of zero.

The bug was found by syzkaller.
Signed-off-by: Denis Efremov <efremov@ispras.ru>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

CVE-2019-14284

(cherry picked from commit f3554aeb)
Signed-off-by: Sultan Alsawaf <sultan.alsawaf@canonical.com>
Acked-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Po-Hsu Lin <po-hsu.lin@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

2b5d47b7

05 Aug, 2019 5 commits

Linux 4.4.186 · 1ef87ecb

Greg Kroah-Hartman authored Jul 21, 2019

BugLink: https://bugs.launchpad.net/bugs/1838467Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1ef87ecb

KVM: x86: protect KVM_CREATE_PIT/KVM_CREATE_PIT2 with kvm->lock · 2bb27455

Paolo Bonzini authored Jun 01, 2016

BugLink: https://bugs.launchpad.net/bugs/1838467

commit 250715a6 upstream.

The syzkaller folks reported a NULL pointer dereference that seems
to be cause by a race between KVM_CREATE_IRQCHIP and KVM_CREATE_PIT2.
The former takes kvm->lock (except when registering the devices,
which needs kvm->slots_lock); the latter takes kvm->slots_lock only.
Change KVM_CREATE_PIT2 to follow the same model as KVM_CREATE_IRQCHIP.

Testcase:

    #include <pthread.h>
    #include <linux/kvm.h>
    #include <fcntl.h>
    #include <sys/ioctl.h>
    #include <stdint.h>
    #include <string.h>
    #include <stdlib.h>
    #include <sys/syscall.h>
    #include <unistd.h>

    long r[23];

    void* thr1(void* arg)
    {
        struct kvm_pit_config pitcfg = { .flags = 4 };
        switch ((long)arg) {
        case 0: r[2]  = open("/dev/kvm", O_RDONLY|O_ASYNC);    break;
        case 1: r[3]  = ioctl(r[2], KVM_CREATE_VM, 0);         break;
        case 2: r[4]  = ioctl(r[3], KVM_CREATE_IRQCHIP, 0);    break;
        case 3: r[22] = ioctl(r[3], KVM_CREATE_PIT2, &pitcfg); break;
        }
        return 0;
    }

    int main(int argc, char **argv)
    {
        long i;
        pthread_t th[4];

        memset(r, -1, sizeof(r));
        for (i = 0; i < 4; i++) {
            pthread_create(&th[i], 0, thr, (void*)i);
            if (argc > 1 && rand()%2) usleep(rand()%1000);
        }
        usleep(20000);
        return 0;
    }
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Zubin Mithra <zsm@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

2bb27455

s390/qdio: don't touch the dsci in tiqdio_add_input_queues() · 1abcea3f

Julian Wiedmann authored Jun 18, 2019

BugLink: https://bugs.launchpad.net/bugs/1838467

commit ac6639cd upstream.

Current code sets the dsci to 0x00000080. Which doesn't make any sense,
as the indicator area is located in the _left-most_ byte.

Worse: if the dsci is the _shared_ indicator, this potentially clears
the indication of activity for a _different_ device.
tiqdio_thinint_handler() will then have no reason to call that device's
IRQ handler, and the device ends up stalling.

Fixes: d0c9d4a8 ("[S390] qdio: set correct bit in dsci")
Cc: <stable@vger.kernel.org>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

1abcea3f

s390/qdio: (re-)initialize tiqdio list entries · 585c3e30

Julian Wiedmann authored Jun 18, 2019

BugLink: https://bugs.launchpad.net/bugs/1838467

commit e54e4785 upstream.

When tiqdio_remove_input_queues() removes a queue from the tiq_list as
part of qdio_shutdown(), it doesn't re-initialize the queue's list entry
and the prev/next pointers go stale.

If a subsequent qdio_establish() fails while sending the ESTABLISH cmd,
it calls qdio_shutdown() again in QDIO_IRQ_STATE_ERR state and
tiqdio_remove_input_queues() will attempt to remove the queue entry a
second time. This dereferences the stale pointers, and bad things ensue.
Fix this by re-initializing the list entry after removing it from the
list.

For good practice also initialize the list entry when the queue is first
allocated, and remove the quirky checks that papered over this omission.
Note that prior to
commit e5218134 ("s390/qdio: fix access to uninitialized qdio_q fields"),
these checks were bogus anyway.

setup_queues_misc() clears the whole queue struct, and thus needs to
re-init the prev/next pointers as well.

Fixes: 779e6e1c ("[S390] qdio: new qdio driver.")
Cc: <stable@vger.kernel.org>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

585c3e30

s390: fix stfle zero padding · 556a15fe

Heiko Carstens authored Jun 17, 2019

BugLink: https://bugs.launchpad.net/bugs/1838467

commit 4f18d869 upstream.

The stfle inline assembly returns the number of double words written
(condition code 0) or the double words it would have written
(condition code 3), if the memory array it got as parameter would have
been large enough.

The current stfle implementation assumes that the array is always
large enough and clears those parts of the array that have not been
written to with a subsequent memset call.

If however the array is not large enough memset will get a negative
length parameter, which means that memset clears memory until it gets
an exception and the kernel crashes.

To fix this simply limit the maximum length. Move also the inline
assembly to an extra function to avoid clobbering of register 0, which
might happen because of the added min_t invocation together with code
instrumentation.

The bug was introduced with commit 14375bc4 ("[S390] cleanup
facility list handling") but was rather harmless, since it would only
write to a rather large array. It became a potential problem with
commit 3ab121ab ("[S390] kernel: Add z/VM LGR detection"). Since
then it writes to an array with only four double words, while some
machines already deliver three double words. As soon as machines have
a facility bit within the fifth double a crash on IPL would happen.

Fixes: 14375bc4 ("[S390] cleanup facility list handling")
Cc: <stable@vger.kernel.org> # v2.6.37+
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>

556a15fe