- 13 Sep, 2019 2 commits
-
-
yongduan authored
The code assumes log_num < in_num everywhere, and that is true as long as in_num is incremented by descriptor iov count, and log_num by 1. However this breaks if there's a zero sized descriptor. As a result, if a malicious guest creates a vring desc with desc.len = 0, it may cause the host kernel to crash by overflowing the log array. This bug can be triggered during the VM migration. There's no need to log when desc.len = 0, so just don't increment log_num in this case. Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server") Reviewed-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: ruippan <ruippan@tencent.com> Signed-off-by: yongduan <yongduan@tencent.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> CVE-2019-14835 (backported from email patch attachment) [juergh: Adjusted context.] Signed-off-by: Juerg Haefliger <juergh@canonical.com>
-
Juerg Haefliger authored
Ignore: yes Signed-off-by: Juerg Haefliger <juergh@canonical.com>
-
- 27 Aug, 2019 5 commits
-
-
Stefan Bader authored
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
-
Stefan Bader authored
BugLink: https://bugs.launchpad.net/bugs/1658219 This reverts commit 97ac9e61 as it is currently causing regressions in snaps which would break networking for all core16 images. Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
-
Stefan Bader authored
BugLink: https://bugs.launchpad.net/bugs/1841544 Properties: no-test-build Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
-
Stefan Bader authored
Ignore: yes Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
-
Stefan Bader authored
BugLink: http://bugs.launchpad.net/bugs/1786013Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
-
- 13 Aug, 2019 26 commits
-
-
Connor Kuehl authored
Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
-
Connor Kuehl authored
BugLink: https://bugs.launchpad.net/bugs/1840021 Properties: no-test-build Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
-
Connor Kuehl authored
Ignore: yes Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
-
Connor Kuehl authored
BugLink: http://bugs.launchpad.net/bugs/1786013Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com>
-
Hans de Goede authored
BugLink: https://bugs.launchpad.net/bugs/1837117 Commit 78f3ac76 ("platform/x86: asus-wmi: Tell the EC the OS will handle the display off hotkey") causes the backlight to be permanently off on various EeePC laptop models using the eeepc-wmi driver (Asus EeePC 1015BX, Asus EeePC 1025C). The asus_wmi_set_devstate(ASUS_WMI_DEVID_BACKLIGHT, 2, NULL) call added by that commit is made conditional in this commit and only enabled in the quirk_entry structs in the asus-nb-wmi driver fixing the broken display / backlight on various EeePC laptop models. Cc: João Paulo Rechi Vita <jprvita@endlessm.com> Fixes: 78f3ac76 ("platform/x86: asus-wmi: Tell the EC the OS will handle the display off hotkey") Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> (backported from commit 1dd93f87) [PHLin: context adjustment, only add quirks for models existing in X] Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Eric Dumazet authored
CVE-2019-10638 According to Amit Klein and Benny Pinkas, IP ID generation is too weak and might be used by attackers. Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix()) having 64bit key and Jenkins hash is risky. It is time to switch to siphash and its 128bit keys. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Signed-off-by: David S. Miller <davem@davemloft.net> (backported from commit df453700) [ Connor Kuehl: Adjusted patch to communicate the id return value through the skbuf as the function signature for ipv6_proxy_select_ident is still void (whereas the patch context expects it to return a value). This function signature change doesn't happen until upstream commit: 0c19f846 "net: accept UFO datagrams from tuntap and packet" ] Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Jason A. Donenfeld authored
CVE-2019-10638 SipHash is a 64-bit keyed hash function that is actually a cryptographically secure PRF, like HMAC. Except SipHash is super fast, and is meant to be used as a hashtable keyed lookup function, or as a general PRF for short input use cases, such as sequence numbers or RNG chaining. For the first usage: There are a variety of attacks known as "hashtable poisoning" in which an attacker forms some data such that the hash of that data will be the same, and then preceeds to fill up all entries of a hashbucket. This is a realistic and well-known denial-of-service vector. Currently hashtables use jhash, which is fast but not secure, and some kind of rotating key scheme (or none at all, which isn't good). SipHash is meant as a replacement for jhash in these cases. There are a modicum of places in the kernel that are vulnerable to hashtable poisoning attacks, either via userspace vectors or network vectors, and there's not a reliable mechanism inside the kernel at the moment to fix it. The first step toward fixing these issues is actually getting a secure primitive into the kernel for developers to use. Then we can, bit by bit, port things over to it as deemed appropriate. While SipHash is extremely fast for a cryptographically secure function, it is likely a bit slower than the insecure jhash, and so replacements will be evaluated on a case-by-case basis based on whether or not the difference in speed is negligible and whether or not the current jhash usage poses a real security risk. For the second usage: A few places in the kernel are using MD5 or SHA1 for creating secure sequence numbers, syn cookies, port numbers, or fast random numbers. SipHash is a faster and more fitting, and more secure replacement for MD5 in those situations. Replacing MD5 and SHA1 with SipHash for these uses is obvious and straight-forward, and so is submitted along with this patch series. There shouldn't be much of a debate over its efficacy. Dozens of languages are already using this internally for their hash tables and PRFs. Some of the BSDs already use this in their kernels. SipHash is a widely known high-speed solution to a widely known set of problems, and it's time we catch-up. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> (backported from commit 2c956a60) [ Connor Kuehl: Minor offset adjustments required due to the high traffic nature of things like Kconfig and Makefiles. Had to make sure the proper siphash entries made it in to both files since the patch context that surrounds it is so different. ] Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Connor Kuehl authored
CVE-2019-10638 Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
John Johansen authored
This is a backport of a fix that landed as part of a larger patch in 4.17 commit 9fcf78cc ("apparmor: update domain transitions that are subsets of confinement at nnp") Domain transitions that add a new profile to the confinement stack when under NO NEW PRIVS is allowed as it can not expand privileges. However such transitions are failing due to how/where the subset test is being applied. Applying the test per profile in the profile transition and profile_onexec call backs is incorrect as it disregards the other profiles in the stack so it can not correctly determine if the old confinement stack is a subset of the new confinement stack. Move the test to after the new confinement stack is constructed. BugLink: http://bugs.launchpad.net/bugs/1839037Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
John Johansen authored
v2. Add fix to profile_transition() also There are 2 cases where a denial in onexec profile transitions can occur that results in an apparmor WARN traceback. The first occurs if onexec is denied by policy, the second if onexec fails due to no-new-privs. A similar failure can occur in profile_transition() when directed to perform a stack, resulting in a simiar traceback with handle_onexec() replaced by profile_transition(). [1140910.816457] ------------[ cut here ]------------ [1140910.816466] WARNING: CPU: 4 PID: 32497 at /build/linux-UdetSb/linux-4.4.0/security/apparmor/file.c:136 aa_audit_file+0x16e/0x180() [1140910.816467] AppArmor WARN aa_audit_file: ((!(&sa)->apparmor_audit_data->request)): [1140910.816469] Modules linked in: [1140910.816470] xt_mark xt_comment ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_REDIRECT nf_nat_redirect xt_nat veth btrfs xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c msr nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) xt_CHECKSUM iptable_mangle rfcomm ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables xt_multiport iptable_filter ip_tables x_tables aufs overlay bnep uvcvideo videobuf2_vmalloc btusb videobuf2_memops videobuf2_v4l2 btrtl btbcm videobuf2_core btintel v4l2_common bluetooth videodev media binfmt_misc arc4 [1140910.816508] iwlmvm mac80211 intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic iwlwifi joydev input_leds serio_raw cfg80211 snd_hda_intel snd_hda_codec snd_hda_core lpc_ich snd_hwdep thinkpad_acpi nvram snd_pcm snd_seq_midi mei_me snd_seq_midi_event shpchp ie31200_edac mei snd_rawmidi edac_core snd_seq wmi snd_seq_device snd_timer snd soundcore kvm_intel mac_hid kvm irqbypass coretemp parport_pc ppdev lp parport autofs4 drbg ansi_cprng algif_skcipher af_alg dm_crypt hid_generic hid_logitech_hidpp hid_logitech_dj usbhid hid uas usb_storage crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel i915 aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd i2c_algo_bit drm_kms_helper psmouse syscopyarea sysfillrect ahci sysimgblt e1000e [1140910.816544] fb_sys_fops libahci sdhci_pci drm sdhci ptp pps_core fjes video [1140910.816549] CPU: 4 PID: 32497 Comm: runc:[2:INIT] Tainted: G W OE 4.4.0-151-generic #178-Ubuntu [1140910.816551] Hardware name: LENOVO 20EFCTO1WW/20EFCTO1WW, BIOS GNET82WW (2.30 ) 03/21/2017 [1140910.816552] 0000000000000286 312c35d8d7e796cb ffff880637cef9d0 ffffffff8140b481 [1140910.816554] ffff880637cefa18 ffffffff81d02fe8 ffff880637cefa08 ffffffff81085432 [1140910.816555] ffff880108206400 ffff880637cefb6c ffff880825129b88 ffff880637cefd88 [1140910.816557] Call Trace: [1140910.816563] [<ffffffff8140b481>] dump_stack+0x63/0x82 [1140910.816567] [<ffffffff81085432>] warn_slowpath_common+0x82/0xc0 [1140910.816569] [<ffffffff810854cc>] warn_slowpath_fmt+0x5c/0x80 [1140910.816571] [<ffffffff81397ebc>] ? label_match.constprop.9+0x3dc/0x6c0 [1140910.816573] [<ffffffff813a696e>] aa_audit_file+0x16e/0x180 [1140910.816575] [<ffffffff813982dd>] profile_onexec+0x13d/0x3d0 [1140910.816577] [<ffffffff8139a33e>] handle_onexec+0x10e/0x10d0 [1140910.816581] [<ffffffff81242957>] ? vfs_getxattr_alloc+0x67/0x100 [1140910.816584] [<ffffffff81355395>] ? cap_inode_getsecurity+0x95/0x220 [1140910.816588] [<ffffffff8135965d>] ? security_inode_getsecurity+0x5d/0x70 [1140910.816590] [<ffffffff8139b417>] apparmor_bprm_set_creds+0x117/0xa60 [1140910.816591] [<ffffffff81242a8e>] ? vfs_getxattr+0x9e/0xb0 [1140910.816595] [<ffffffffc05be712>] ? ovl_getxattr+0x52/0xb0 [overlay] [1140910.816597] [<ffffffff8135619d>] ? get_vfs_caps_from_disk+0x7d/0x180 [1140910.816599] [<ffffffff81356343>] ? cap_bprm_set_creds+0xa3/0x5f0 [1140910.816601] [<ffffffff81358909>] security_bprm_set_creds+0x39/0x50 [1140910.816605] [<ffffffff812229d5>] prepare_binprm+0x85/0x190 [1140910.816607] [<ffffffff812240f4>] do_execveat_common.isra.31+0x4b4/0x770 [1140910.816610] [<ffffffff8122460a>] SyS_execve+0x3a/0x50 [1140910.816613] [<ffffffff81863ed5>] stub_execve+0x5/0x5 [1140910.816615] [<ffffffff81863b5b>] ? entry_SYSCALL_64_fastpath+0x22/0xcb [1140910.816616] ---[ end trace cf4320c1d43eedd8 ]--- This is because the error is being audited as if onexec was not denied this triggering the AA_BUG check. BugLink: http://bugs.launchpad.net/bugs/1838627Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
John Johansen authored
When an open file with cached permissions is checked for the flock permission. The cache check fails and falls through to no error instead of auditing, and returning an error. For the fall through to do a permission check, so it will audit the failed flock permission check. BugLink: https://bugs.launchpad.net/bugs/1838090 BugLink: https://bugs.launchpad.net/bugs/1658219Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Andrea Righi authored
bcache_allocator() can call the following: bch_allocator_thread() -> bch_prio_write() -> bch_bucket_alloc() -> wait on &ca->set->bucket_wait But the wake up event on bucket_wait is supposed to come from bch_allocator_thread() itself => deadlock: [ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds. [ 1158.495929] Not tainted 5.3.0-050300rc3-generic #201908042232 [ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1158.504413] bcache_allocato D 0 15861 2 0x80004000 [ 1158.504419] Call Trace: [ 1158.504429] __schedule+0x2a8/0x670 [ 1158.504432] schedule+0x2d/0x90 [ 1158.504448] bch_bucket_alloc+0xe5/0x370 [bcache] [ 1158.504453] ? wait_woken+0x80/0x80 [ 1158.504466] bch_prio_write+0x1dc/0x390 [bcache] [ 1158.504476] bch_allocator_thread+0x233/0x490 [bcache] [ 1158.504491] kthread+0x121/0x140 [ 1158.504503] ? invalidate_buckets+0x890/0x890 [bcache] [ 1158.504506] ? kthread_park+0xb0/0xb0 [ 1158.504510] ret_from_fork+0x35/0x40 Fix by making the call to bch_prio_write() non-blocking, so that bch_allocator_thread() never waits on itself. Moreover, make sure to wake up the garbage collector thread when bch_prio_write() is failing to allocate buckets. BugLink: https://bugs.launchpad.net/bugs/1784665 BugLink: https://bugs.launchpad.net/bugs/1796292Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Andy Shevchenko authored
BugLink: https://bugs.launchpad.net/bugs/1784665 There is couple of functions that are used exclusively in sysfs.c. Move it to there and make them static. Besides above, it will allow further clean up. No functional change intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit ecb37ce9) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 Add more annotations for sparse to inform it about which functions do not have the same number of spin_lock() and spin_unlock() calls. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 20d3a518) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 This patch does not change any functionality. Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 42361469) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit f0d38140) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 Avoid that building with W=1 triggers warnings about the kernel-doc headers. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 47344e33) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 This patch avoids that building with W=1 triggers complaints about switch fall-throughs. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 9dfbdec7) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 Make it possible for the compiler to verify the consistency of the format string passed to __bch_check_keys() and the arguments that should be formatted according to that format string. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 4a4e4438) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 This patch avoids that smatch complains about inconsistent indentation. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit fd01991d) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Tang Junhui authored
BugLink: https://bugs.launchpad.net/bugs/1784665 In bch_mca_scan(), There are some confusion and logical error in the use of loop variables. In this patch, we clarify them as: 1) nr: the number of btree nodes needs to scan, which will decrease after we scan a btree node, and should not be less than 0; 2) i: the number of btree nodes have scanned, includes both btree_cache_freeable and btree_cache, which should not be bigger than btree_cache_used; 3) freed: the number of btree nodes have freed. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit ca71df31) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Tang Junhui authored
BugLink: https://bugs.launchpad.net/bugs/1784665 In bch_mca_scan(), the return value should not be the number of freed btree nodes, but the number of pages of freed btree nodes. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit f3641c3a) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Tang Junhui authored
BugLink: https://bugs.launchpad.net/bugs/1784665 Stripe size is shown as zero when no strip in back end device: [root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size 0.0k Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs interface, stripe_size was changed from sectors to bytes, and move 9 bits left, so the 32 bits variable overflows. This patch change the variable to a 64 bits type before moving bits. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 688892b3) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Tang Junhui authored
BugLink: https://bugs.launchpad.net/bugs/1784665 After long time small writing I/O running, we found the occupancy of CPU is very high and I/O performance has been reduced by about half: [root@ceph151 internal]# top top - 15:51:05 up 1 day,2:43, 4 users, load average: 16.89, 15.15, 16.53 Tasks: 2063 total, 4 running, 2059 sleeping, 0 stopped, 0 zombie %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa, 0.0 hi, 0.5 si, 0.0 st KiB Mem : 65450044 total, 24586420 free, 38909008 used, 1954616 buff/cache KiB Swap: 65667068 total, 65667068 free, 0 used. 25136812 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2023 root 20 0 0 0 0 S 55.1 0.0 0:04.42 kworker/11:191 14126 root 20 0 0 0 0 S 42.9 0.0 0:08.72 kworker/10:3 9292 root 20 0 0 0 0 S 30.4 0.0 1:10.99 kworker/6:1 8553 ceph 20 0 4242492 1.805g 18804 S 30.0 2.9 410:07.04 ceph-osd 12287 root 20 0 0 0 0 S 26.7 0.0 0:28.13 kworker/7:85 31019 root 20 0 0 0 0 S 26.1 0.0 1:30.79 kworker/22:1 1787 root 20 0 0 0 0 R 25.7 0.0 5:18.45 kworker/8:7 32169 root 20 0 0 0 0 S 14.5 0.0 1:01.92 kworker/23:1 21476 root 20 0 0 0 0 S 13.9 0.0 0:05.09 kworker/1:54 2204 root 20 0 0 0 0 S 12.5 0.0 1:25.17 kworker/9:10 16994 root 20 0 0 0 0 S 12.2 0.0 0:06.27 kworker/5:106 15714 root 20 0 0 0 0 R 10.9 0.0 0:01.85 kworker/19:2 9661 ceph 20 0 4246876 1.731g 18800 S 10.6 2.8 403:00.80 ceph-osd 11460 ceph 20 0 4164692 2.206g 18876 S 10.6 3.5 360:27.19 ceph-osd 9960 root 20 0 0 0 0 S 10.2 0.0 0:02.75 kworker/2:139 11699 ceph 20 0 4169244 1.920g 18920 S 10.2 3.1 355:23.67 ceph-osd 6843 ceph 20 0 4197632 1.810g 18900 S 9.6 2.9 380:08.30 ceph-osd The kernel work consumed a lot of CPU, and I found they are running journal work, The journal is reclaiming source and flush btree node with surprising frequency. Through further analysis, we found that in btree_flush_write(), we try to get a btree node with the smallest fifo idex to flush by traverse all the btree nodein c->bucket_hash, after we getting it, since no locker protects it, this btree node may have been written to cache device by other works, and if this occurred, we retry to traverse in c->bucket_hash and get another btree node. When the problem occurrd, the retry times is very high, and we consume a lot of CPU in looking for a appropriate btree node. In this patch, we try to record 128 btree nodes with the smallest fifo idex in heap, and pop one by one when we need to flush btree node. It greatly reduces the time for the loop to find the appropriate BTREE node, and also reduce the occupancy of CPU. [note by mpl: this triggers a checkpatch error because of adjacent, pre-existing style violations] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit c4dc2497) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Tang Junhui authored
BugLink: https://bugs.launchpad.net/bugs/1784665 Sometimes, Journal takes up a lot of CPU, we need statistics to know what's the journal is doing. So this patch provide some journal statistics: 1) reclaim: how many times the journal try to reclaim resource, usually the journal bucket or/and the pin are exhausted. 2) flush_write: how many times the journal try to flush btree node to cache device, usually the journal bucket are exhausted. 3) retry_flush_write: how many times the journal retry to flush the next btree node, usually the previous tree node have been flushed by other thread. we show these statistic by sysfs interface. Through these statistics We can totally see the status of journal module when the CPU is too high. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit a728eacb) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Coly Li authored
BugLink: https://bugs.launchpad.net/bugs/1784665 This patch tries to release mutex bch_register_lock early, to give chance to stop cache set and bcache device early. This patch also expends time out of stopping all bcache device from 2 seconds to 10 seconds, because stopping writeback rate update worker may delay for 5 seconds, 2 seconds is not enough. After this patch applied, stopping bcache devices during system reboot or shutdown is very hard to be observed any more. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit eb8cbb6d) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
- 12 Aug, 2019 7 commits
-
-
Jason Wang authored
This patch will check the weight and exit the loop if we exceeds the weight. This is useful for preventing scsi kthread from hogging cpu which is guest triggerable. This addresses CVE-2019-3900. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Fixes: 057cbf49 ("tcm_vhost: Initial merge for vhost level target fabric driver") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> CVE-2019-3900 (backported from commit c1ea02f1) [tyhicks: Backport to Xenial: - Minor context adjustment in local variables - Adjust context around the loop in vhost_scsi_handle_vq() - No need to modify vhost_scsi_ctl_handle_vq() since it was added later in commit 0d02dbd6 ("vhost/scsi: Respond to control queue operations")] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Jason Wang authored
When the rx buffer is too small for a packet, we will discard the vq descriptor and retry it for the next packet: while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk, &busyloop_intr))) { ... /* On overrun, truncate and discard */ if (unlikely(headcount > UIO_MAXIOV)) { iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1); err = sock->ops->recvmsg(sock, &msg, 1, MSG_DONTWAIT | MSG_TRUNC); pr_debug("Discarded rx packet: len %zd\n", sock_len); continue; } ... } This makes it possible to trigger a infinite while..continue loop through the co-opreation of two VMs like: 1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the vhost process as much as possible e.g using indirect descriptors or other. 2) Malicious VM2 generate packets to VM1 as fast as possible Fixing this by checking against weight at the end of RX and TX loop. This also eliminate other similar cases when: - userspace is consuming the packets in the meanwhile - theoretical TOCTOU attack if guest moving avail index back and forth to hit the continue after vhost find guest just add new buffers This addresses CVE-2019-3900. Fixes: d8316f39 ("vhost: fix total length when packets are too short") Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> CVE-2019-3900 (backported from commit e2412c07) [tyhicks: Backport to Xenial: - Adjust handle_tx() instead of handle_tx_{copy,zerocopy}() due to missing commit 0d20bdf3 ("vhost_net: split out datacopy logic") - Minor context adjustments due to a lack of missing the iov_limit member of the vhost_dev struct which was added later in commit b46a0bf7 ("vhost: fix OOB in get_rx_bufs()") - handle_rx() still uses peek_head_len() due to missing and unneeded commit 03088137 ("vhost_net: basic polling support") - Context adjustment in call to vhost_log_write() in hunk #3 of net.c due to missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly") - Context adjustment in hunk #4 due to using break instead of goto out - Context adjustment in hunk #5 due to missing and unneeded commit c67df11f ("vhost_net: try batch dequing from skb array")] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Jason Wang authored
We used to have vhost_exceeds_weight() for vhost-net to: - prevent vhost kthread from hogging the cpu - balance the time spent between TX and RX This function could be useful for vsock and scsi as well. So move it to vhost.c. Device must specify a weight which counts the number of requests, or it can also specific a byte_weight which counts the number of bytes that has been processed. Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> CVE-2019-3900 (backported from commit e82b9b07) [tyhicks: Backport to Xenial: - Adjust handle_tx() instead of handle_tx_{copy,zerocopy}() due to missing commit 0d20bdf3 ("vhost_net: split out datacopy logic") - Considerable context adjustments throughout the patch due to a lack of missing the iov_limit member of the vhost_dev struct which was added later in commit b46a0bf7 ("vhost: fix OOB in get_rx_bufs()") - Context adjustment in call to vhost_log_write() in hunk #3 of net.c due to missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly") - Context adjustment in hunk #3 of net.c due to using break instead of goto out - Context adjustment in hunk #4 of net.c due to missing and unneeded commit c67df11f ("vhost_net: try batch dequing from skb array") - Don't patch vsock.c since Xenial doesn't have vhost vsock support - Adjust context in vhost_dev_init() to account for different local variables - Adjust context in struct vhost_dev to account for different struct members] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Jason Wang authored
Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> CVE-2019-3900 (backported from commit 272f35cb) [tyhicks: Backport to Xenial: - Minor context adjustment in net.c due to missing commit b0d0ea50 ("vhost_net: introduce helper to initialize tx iov iter") - Context adjustment in call to vhost_log_write() in hunk #4 due to missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly") - Context adjustment in hunk #4 due to using break instead of goto out] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Paolo Abeni authored
Similar to commit a2ac9990 ("vhost-net: set packet weight of tx polling to 2 * vq size"), we need a packet-based limit for handler_rx, too - elsewhere, under rx flood with small packets, tx can be delayed for a very long time, even without busypolling. The pkt limit applied to handle_rx must be the same applied by handle_tx, or we will get unfair scheduling between rx and tx. Tying such limit to the queue length makes it less effective for large queue length values and can introduce large process scheduler latencies, so a constant valued is used - likewise the existing bytes limit. The selected limit has been validated with PVP[1] performance test with different queue sizes: queue size 256 512 1024 baseline 366 354 362 weight 128 715 723 670 weight 256 740 745 733 weight 512 600 460 583 weight 1024 423 427 418 A packet weight of 256 gives peek performances in under all the tested scenarios. No measurable regression in unidirectional performance tests has been detected. [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> CVE-2019-3900 (backported from commit db688c24) [tyhicks: Backport to Xenial: - Context adjustment in call to mutex_lock_nested() in hunk #3 due to missing and unneeded commit aaa3149b ("vhost_net: add missing lock nesting notation") - Context adjustment in call to vhost_log_write() in hunk #4 due to missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly") - Context adjustment in hunk #4 due to using break instead of goto out] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
haibinzhang(张海斌) authored
handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy polling udp packets with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length. Ping-Latencies shown below were tested between two Virtual Machines using netperf (UDP_STREAM, len=1), and then another machine pinged the client: vq size=256 Packet-Weight Ping-Latencies(millisecond) min avg max Origin 3.319 18.489 57.303 64 1.643 2.021 2.552 128 1.825 2.600 3.224 256 1.997 2.710 4.295 512 1.860 3.171 4.631 1024 2.002 4.173 9.056 2048 2.257 5.650 9.688 4096 2.093 8.508 15.943 vq size=512 Packet-Weight Ping-Latencies(millisecond) min avg max Origin 6.537 29.177 66.245 64 2.798 3.614 4.403 128 2.861 3.820 4.775 256 3.008 4.018 4.807 512 3.254 4.523 5.824 1024 3.079 5.335 7.747 2048 3.944 8.201 12.762 4096 4.158 11.057 19.985 Seems pretty consistent, a small dip at 2 VQ sizes. Ring size is a hint from device about a burst size it can tolerate. Based on benchmarks, set the weight to 2 * vq size. To evaluate this change, another tests were done using netperf(RR, TX) between two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was tweaked through qemu. Results shown below does not show obvious changes. vq size=256 TCP_RR vq size=512 TCP_RR size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2% 1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0% 1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1% 64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3% 64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1% 64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2% 256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2% 256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2% 256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1% vq size=256 UDP_RR vq size=512 UDP_RR size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2% 1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2% 1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0% 64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1% 64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0% 64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1% 256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0% 256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4% 256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1% vq size=256 TCP_STREAM vq size=512 TCP_STREAM size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0% 64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4% 64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2% 256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1% 256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0% 256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0% 512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1% 512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0% 512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1% 1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1% 1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0% 1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0% 2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0% 2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1% 2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1% 4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0% 4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0% 4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1% Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com> Signed-off-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> CVE-2019-3900 (cherry picked from commit a2ac9990) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Willem de Bruijn authored
Vhost-net has a hard limit on the number of zerocopy skbs in flight. When reached, transmission stalls. Stalls cause latency, as well as head-of-line blocking of other flows that do not use zerocopy. Instead of stalling, revert to copy-based transmission. Tested by sending two udp flows from guest to host, one with payload of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The large flow is redirected to a netem instance with 1MBps rate limit and deep 1000 entry queue. modprobe ifb ip link set dev ifb0 up tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit tc qdisc add dev tap0 ingress tc filter add dev tap0 parent ffff: protocol ip \ u32 match ip dport 8000 0xffff \ action mirred egress redirect dev ifb0 Before the delay, both flows process around 80K pps. With the delay, before this patch, both process around 400. After this patch, the large flow is still rate limited, while the small reverts to its original rate. See also discussion in the first link, below. Without rate limiting, {1, 10, 100}x TCP_STREAM tests continued to send at 100% zerocopy. The limit in vhost_exceeds_maxpend must be carefully chosen. With vq->num >> 1, the flows remain correlated. This value happens to correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller fractions and ensure correctness also for much smaller values of vq->num, by testing the min() of both explicitly. See also the discussion in the second link below. Changes v1 -> v2 - replaced min with typed min_t - avoid unnecessary whitespace change Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.comSigned-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> CVE-2019-3900 (cherry picked from commit 1e6f7453) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-