- 13 Aug, 2019 21 commits
-
-
Eric Dumazet authored
CVE-2019-10638 According to Amit Klein and Benny Pinkas, IP ID generation is too weak and might be used by attackers. Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix()) having 64bit key and Jenkins hash is risky. It is time to switch to siphash and its 128bit keys. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Signed-off-by: David S. Miller <davem@davemloft.net> (backported from commit df453700) [ Connor Kuehl: Adjusted patch to communicate the id return value through the skbuf as the function signature for ipv6_proxy_select_ident is still void (whereas the patch context expects it to return a value). This function signature change doesn't happen until upstream commit: 0c19f846 "net: accept UFO datagrams from tuntap and packet" ] Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Jason A. Donenfeld authored
CVE-2019-10638 SipHash is a 64-bit keyed hash function that is actually a cryptographically secure PRF, like HMAC. Except SipHash is super fast, and is meant to be used as a hashtable keyed lookup function, or as a general PRF for short input use cases, such as sequence numbers or RNG chaining. For the first usage: There are a variety of attacks known as "hashtable poisoning" in which an attacker forms some data such that the hash of that data will be the same, and then preceeds to fill up all entries of a hashbucket. This is a realistic and well-known denial-of-service vector. Currently hashtables use jhash, which is fast but not secure, and some kind of rotating key scheme (or none at all, which isn't good). SipHash is meant as a replacement for jhash in these cases. There are a modicum of places in the kernel that are vulnerable to hashtable poisoning attacks, either via userspace vectors or network vectors, and there's not a reliable mechanism inside the kernel at the moment to fix it. The first step toward fixing these issues is actually getting a secure primitive into the kernel for developers to use. Then we can, bit by bit, port things over to it as deemed appropriate. While SipHash is extremely fast for a cryptographically secure function, it is likely a bit slower than the insecure jhash, and so replacements will be evaluated on a case-by-case basis based on whether or not the difference in speed is negligible and whether or not the current jhash usage poses a real security risk. For the second usage: A few places in the kernel are using MD5 or SHA1 for creating secure sequence numbers, syn cookies, port numbers, or fast random numbers. SipHash is a faster and more fitting, and more secure replacement for MD5 in those situations. Replacing MD5 and SHA1 with SipHash for these uses is obvious and straight-forward, and so is submitted along with this patch series. There shouldn't be much of a debate over its efficacy. Dozens of languages are already using this internally for their hash tables and PRFs. Some of the BSDs already use this in their kernels. SipHash is a widely known high-speed solution to a widely known set of problems, and it's time we catch-up. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> (backported from commit 2c956a60) [ Connor Kuehl: Minor offset adjustments required due to the high traffic nature of things like Kconfig and Makefiles. Had to make sure the proper siphash entries made it in to both files since the patch context that surrounds it is so different. ] Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Connor Kuehl authored
CVE-2019-10638 Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
John Johansen authored
This is a backport of a fix that landed as part of a larger patch in 4.17 commit 9fcf78cc ("apparmor: update domain transitions that are subsets of confinement at nnp") Domain transitions that add a new profile to the confinement stack when under NO NEW PRIVS is allowed as it can not expand privileges. However such transitions are failing due to how/where the subset test is being applied. Applying the test per profile in the profile transition and profile_onexec call backs is incorrect as it disregards the other profiles in the stack so it can not correctly determine if the old confinement stack is a subset of the new confinement stack. Move the test to after the new confinement stack is constructed. BugLink: http://bugs.launchpad.net/bugs/1839037Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
John Johansen authored
v2. Add fix to profile_transition() also There are 2 cases where a denial in onexec profile transitions can occur that results in an apparmor WARN traceback. The first occurs if onexec is denied by policy, the second if onexec fails due to no-new-privs. A similar failure can occur in profile_transition() when directed to perform a stack, resulting in a simiar traceback with handle_onexec() replaced by profile_transition(). [1140910.816457] ------------[ cut here ]------------ [1140910.816466] WARNING: CPU: 4 PID: 32497 at /build/linux-UdetSb/linux-4.4.0/security/apparmor/file.c:136 aa_audit_file+0x16e/0x180() [1140910.816467] AppArmor WARN aa_audit_file: ((!(&sa)->apparmor_audit_data->request)): [1140910.816469] Modules linked in: [1140910.816470] xt_mark xt_comment ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_REDIRECT nf_nat_redirect xt_nat veth btrfs xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c msr nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) xt_CHECKSUM iptable_mangle rfcomm ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables xt_multiport iptable_filter ip_tables x_tables aufs overlay bnep uvcvideo videobuf2_vmalloc btusb videobuf2_memops videobuf2_v4l2 btrtl btbcm videobuf2_core btintel v4l2_common bluetooth videodev media binfmt_misc arc4 [1140910.816508] iwlmvm mac80211 intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic iwlwifi joydev input_leds serio_raw cfg80211 snd_hda_intel snd_hda_codec snd_hda_core lpc_ich snd_hwdep thinkpad_acpi nvram snd_pcm snd_seq_midi mei_me snd_seq_midi_event shpchp ie31200_edac mei snd_rawmidi edac_core snd_seq wmi snd_seq_device snd_timer snd soundcore kvm_intel mac_hid kvm irqbypass coretemp parport_pc ppdev lp parport autofs4 drbg ansi_cprng algif_skcipher af_alg dm_crypt hid_generic hid_logitech_hidpp hid_logitech_dj usbhid hid uas usb_storage crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel i915 aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd i2c_algo_bit drm_kms_helper psmouse syscopyarea sysfillrect ahci sysimgblt e1000e [1140910.816544] fb_sys_fops libahci sdhci_pci drm sdhci ptp pps_core fjes video [1140910.816549] CPU: 4 PID: 32497 Comm: runc:[2:INIT] Tainted: G W OE 4.4.0-151-generic #178-Ubuntu [1140910.816551] Hardware name: LENOVO 20EFCTO1WW/20EFCTO1WW, BIOS GNET82WW (2.30 ) 03/21/2017 [1140910.816552] 0000000000000286 312c35d8d7e796cb ffff880637cef9d0 ffffffff8140b481 [1140910.816554] ffff880637cefa18 ffffffff81d02fe8 ffff880637cefa08 ffffffff81085432 [1140910.816555] ffff880108206400 ffff880637cefb6c ffff880825129b88 ffff880637cefd88 [1140910.816557] Call Trace: [1140910.816563] [<ffffffff8140b481>] dump_stack+0x63/0x82 [1140910.816567] [<ffffffff81085432>] warn_slowpath_common+0x82/0xc0 [1140910.816569] [<ffffffff810854cc>] warn_slowpath_fmt+0x5c/0x80 [1140910.816571] [<ffffffff81397ebc>] ? label_match.constprop.9+0x3dc/0x6c0 [1140910.816573] [<ffffffff813a696e>] aa_audit_file+0x16e/0x180 [1140910.816575] [<ffffffff813982dd>] profile_onexec+0x13d/0x3d0 [1140910.816577] [<ffffffff8139a33e>] handle_onexec+0x10e/0x10d0 [1140910.816581] [<ffffffff81242957>] ? vfs_getxattr_alloc+0x67/0x100 [1140910.816584] [<ffffffff81355395>] ? cap_inode_getsecurity+0x95/0x220 [1140910.816588] [<ffffffff8135965d>] ? security_inode_getsecurity+0x5d/0x70 [1140910.816590] [<ffffffff8139b417>] apparmor_bprm_set_creds+0x117/0xa60 [1140910.816591] [<ffffffff81242a8e>] ? vfs_getxattr+0x9e/0xb0 [1140910.816595] [<ffffffffc05be712>] ? ovl_getxattr+0x52/0xb0 [overlay] [1140910.816597] [<ffffffff8135619d>] ? get_vfs_caps_from_disk+0x7d/0x180 [1140910.816599] [<ffffffff81356343>] ? cap_bprm_set_creds+0xa3/0x5f0 [1140910.816601] [<ffffffff81358909>] security_bprm_set_creds+0x39/0x50 [1140910.816605] [<ffffffff812229d5>] prepare_binprm+0x85/0x190 [1140910.816607] [<ffffffff812240f4>] do_execveat_common.isra.31+0x4b4/0x770 [1140910.816610] [<ffffffff8122460a>] SyS_execve+0x3a/0x50 [1140910.816613] [<ffffffff81863ed5>] stub_execve+0x5/0x5 [1140910.816615] [<ffffffff81863b5b>] ? entry_SYSCALL_64_fastpath+0x22/0xcb [1140910.816616] ---[ end trace cf4320c1d43eedd8 ]--- This is because the error is being audited as if onexec was not denied this triggering the AA_BUG check. BugLink: http://bugs.launchpad.net/bugs/1838627Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
John Johansen authored
When an open file with cached permissions is checked for the flock permission. The cache check fails and falls through to no error instead of auditing, and returning an error. For the fall through to do a permission check, so it will audit the failed flock permission check. BugLink: https://bugs.launchpad.net/bugs/1838090 BugLink: https://bugs.launchpad.net/bugs/1658219Signed-off-by: John Johansen <john.johansen@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Andrea Righi authored
bcache_allocator() can call the following: bch_allocator_thread() -> bch_prio_write() -> bch_bucket_alloc() -> wait on &ca->set->bucket_wait But the wake up event on bucket_wait is supposed to come from bch_allocator_thread() itself => deadlock: [ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds. [ 1158.495929] Not tainted 5.3.0-050300rc3-generic #201908042232 [ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1158.504413] bcache_allocato D 0 15861 2 0x80004000 [ 1158.504419] Call Trace: [ 1158.504429] __schedule+0x2a8/0x670 [ 1158.504432] schedule+0x2d/0x90 [ 1158.504448] bch_bucket_alloc+0xe5/0x370 [bcache] [ 1158.504453] ? wait_woken+0x80/0x80 [ 1158.504466] bch_prio_write+0x1dc/0x390 [bcache] [ 1158.504476] bch_allocator_thread+0x233/0x490 [bcache] [ 1158.504491] kthread+0x121/0x140 [ 1158.504503] ? invalidate_buckets+0x890/0x890 [bcache] [ 1158.504506] ? kthread_park+0xb0/0xb0 [ 1158.504510] ret_from_fork+0x35/0x40 Fix by making the call to bch_prio_write() non-blocking, so that bch_allocator_thread() never waits on itself. Moreover, make sure to wake up the garbage collector thread when bch_prio_write() is failing to allocate buckets. BugLink: https://bugs.launchpad.net/bugs/1784665 BugLink: https://bugs.launchpad.net/bugs/1796292Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Andy Shevchenko authored
BugLink: https://bugs.launchpad.net/bugs/1784665 There is couple of functions that are used exclusively in sysfs.c. Move it to there and make them static. Besides above, it will allow further clean up. No functional change intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit ecb37ce9) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 Add more annotations for sparse to inform it about which functions do not have the same number of spin_lock() and spin_unlock() calls. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 20d3a518) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 This patch does not change any functionality. Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 42361469) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit f0d38140) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 Avoid that building with W=1 triggers warnings about the kernel-doc headers. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 47344e33) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 This patch avoids that building with W=1 triggers complaints about switch fall-throughs. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 9dfbdec7) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 Make it possible for the compiler to verify the consistency of the format string passed to __bch_check_keys() and the arguments that should be formatted according to that format string. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 4a4e4438) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Bart Van Assche authored
BugLink: https://bugs.launchpad.net/bugs/1784665 This patch avoids that smatch complains about inconsistent indentation. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit fd01991d) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Tang Junhui authored
BugLink: https://bugs.launchpad.net/bugs/1784665 In bch_mca_scan(), There are some confusion and logical error in the use of loop variables. In this patch, we clarify them as: 1) nr: the number of btree nodes needs to scan, which will decrease after we scan a btree node, and should not be less than 0; 2) i: the number of btree nodes have scanned, includes both btree_cache_freeable and btree_cache, which should not be bigger than btree_cache_used; 3) freed: the number of btree nodes have freed. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit ca71df31) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Tang Junhui authored
BugLink: https://bugs.launchpad.net/bugs/1784665 In bch_mca_scan(), the return value should not be the number of freed btree nodes, but the number of pages of freed btree nodes. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit f3641c3a) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Tang Junhui authored
BugLink: https://bugs.launchpad.net/bugs/1784665 Stripe size is shown as zero when no strip in back end device: [root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size 0.0k Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs interface, stripe_size was changed from sectors to bytes, and move 9 bits left, so the 32 bits variable overflows. This patch change the variable to a 64 bits type before moving bits. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 688892b3) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Tang Junhui authored
BugLink: https://bugs.launchpad.net/bugs/1784665 After long time small writing I/O running, we found the occupancy of CPU is very high and I/O performance has been reduced by about half: [root@ceph151 internal]# top top - 15:51:05 up 1 day,2:43, 4 users, load average: 16.89, 15.15, 16.53 Tasks: 2063 total, 4 running, 2059 sleeping, 0 stopped, 0 zombie %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa, 0.0 hi, 0.5 si, 0.0 st KiB Mem : 65450044 total, 24586420 free, 38909008 used, 1954616 buff/cache KiB Swap: 65667068 total, 65667068 free, 0 used. 25136812 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2023 root 20 0 0 0 0 S 55.1 0.0 0:04.42 kworker/11:191 14126 root 20 0 0 0 0 S 42.9 0.0 0:08.72 kworker/10:3 9292 root 20 0 0 0 0 S 30.4 0.0 1:10.99 kworker/6:1 8553 ceph 20 0 4242492 1.805g 18804 S 30.0 2.9 410:07.04 ceph-osd 12287 root 20 0 0 0 0 S 26.7 0.0 0:28.13 kworker/7:85 31019 root 20 0 0 0 0 S 26.1 0.0 1:30.79 kworker/22:1 1787 root 20 0 0 0 0 R 25.7 0.0 5:18.45 kworker/8:7 32169 root 20 0 0 0 0 S 14.5 0.0 1:01.92 kworker/23:1 21476 root 20 0 0 0 0 S 13.9 0.0 0:05.09 kworker/1:54 2204 root 20 0 0 0 0 S 12.5 0.0 1:25.17 kworker/9:10 16994 root 20 0 0 0 0 S 12.2 0.0 0:06.27 kworker/5:106 15714 root 20 0 0 0 0 R 10.9 0.0 0:01.85 kworker/19:2 9661 ceph 20 0 4246876 1.731g 18800 S 10.6 2.8 403:00.80 ceph-osd 11460 ceph 20 0 4164692 2.206g 18876 S 10.6 3.5 360:27.19 ceph-osd 9960 root 20 0 0 0 0 S 10.2 0.0 0:02.75 kworker/2:139 11699 ceph 20 0 4169244 1.920g 18920 S 10.2 3.1 355:23.67 ceph-osd 6843 ceph 20 0 4197632 1.810g 18900 S 9.6 2.9 380:08.30 ceph-osd The kernel work consumed a lot of CPU, and I found they are running journal work, The journal is reclaiming source and flush btree node with surprising frequency. Through further analysis, we found that in btree_flush_write(), we try to get a btree node with the smallest fifo idex to flush by traverse all the btree nodein c->bucket_hash, after we getting it, since no locker protects it, this btree node may have been written to cache device by other works, and if this occurred, we retry to traverse in c->bucket_hash and get another btree node. When the problem occurrd, the retry times is very high, and we consume a lot of CPU in looking for a appropriate btree node. In this patch, we try to record 128 btree nodes with the smallest fifo idex in heap, and pop one by one when we need to flush btree node. It greatly reduces the time for the loop to find the appropriate BTREE node, and also reduce the occupancy of CPU. [note by mpl: this triggers a checkpatch error because of adjacent, pre-existing style violations] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit c4dc2497) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Tang Junhui authored
BugLink: https://bugs.launchpad.net/bugs/1784665 Sometimes, Journal takes up a lot of CPU, we need statistics to know what's the journal is doing. So this patch provide some journal statistics: 1) reclaim: how many times the journal try to reclaim resource, usually the journal bucket or/and the pin are exhausted. 2) flush_write: how many times the journal try to flush btree node to cache device, usually the journal bucket are exhausted. 3) retry_flush_write: how many times the journal retry to flush the next btree node, usually the previous tree node have been flushed by other thread. we show these statistic by sysfs interface. Through these statistics We can totally see the status of journal module when the CPU is too high. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit a728eacb) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Coly Li authored
BugLink: https://bugs.launchpad.net/bugs/1784665 This patch tries to release mutex bch_register_lock early, to give chance to stop cache set and bcache device early. This patch also expends time out of stopping all bcache device from 2 seconds to 10 seconds, because stopping writeback rate update worker may delay for 5 seconds, 2 seconds is not enough. After this patch applied, stopping bcache devices during system reboot or shutdown is very hard to be observed any more. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit eb8cbb6d) Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
- 12 Aug, 2019 12 commits
-
-
Jason Wang authored
This patch will check the weight and exit the loop if we exceeds the weight. This is useful for preventing scsi kthread from hogging cpu which is guest triggerable. This addresses CVE-2019-3900. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Fixes: 057cbf49 ("tcm_vhost: Initial merge for vhost level target fabric driver") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> CVE-2019-3900 (backported from commit c1ea02f1) [tyhicks: Backport to Xenial: - Minor context adjustment in local variables - Adjust context around the loop in vhost_scsi_handle_vq() - No need to modify vhost_scsi_ctl_handle_vq() since it was added later in commit 0d02dbd6 ("vhost/scsi: Respond to control queue operations")] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Jason Wang authored
When the rx buffer is too small for a packet, we will discard the vq descriptor and retry it for the next packet: while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk, &busyloop_intr))) { ... /* On overrun, truncate and discard */ if (unlikely(headcount > UIO_MAXIOV)) { iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1); err = sock->ops->recvmsg(sock, &msg, 1, MSG_DONTWAIT | MSG_TRUNC); pr_debug("Discarded rx packet: len %zd\n", sock_len); continue; } ... } This makes it possible to trigger a infinite while..continue loop through the co-opreation of two VMs like: 1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the vhost process as much as possible e.g using indirect descriptors or other. 2) Malicious VM2 generate packets to VM1 as fast as possible Fixing this by checking against weight at the end of RX and TX loop. This also eliminate other similar cases when: - userspace is consuming the packets in the meanwhile - theoretical TOCTOU attack if guest moving avail index back and forth to hit the continue after vhost find guest just add new buffers This addresses CVE-2019-3900. Fixes: d8316f39 ("vhost: fix total length when packets are too short") Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> CVE-2019-3900 (backported from commit e2412c07) [tyhicks: Backport to Xenial: - Adjust handle_tx() instead of handle_tx_{copy,zerocopy}() due to missing commit 0d20bdf3 ("vhost_net: split out datacopy logic") - Minor context adjustments due to a lack of missing the iov_limit member of the vhost_dev struct which was added later in commit b46a0bf7 ("vhost: fix OOB in get_rx_bufs()") - handle_rx() still uses peek_head_len() due to missing and unneeded commit 03088137 ("vhost_net: basic polling support") - Context adjustment in call to vhost_log_write() in hunk #3 of net.c due to missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly") - Context adjustment in hunk #4 due to using break instead of goto out - Context adjustment in hunk #5 due to missing and unneeded commit c67df11f ("vhost_net: try batch dequing from skb array")] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Jason Wang authored
We used to have vhost_exceeds_weight() for vhost-net to: - prevent vhost kthread from hogging the cpu - balance the time spent between TX and RX This function could be useful for vsock and scsi as well. So move it to vhost.c. Device must specify a weight which counts the number of requests, or it can also specific a byte_weight which counts the number of bytes that has been processed. Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> CVE-2019-3900 (backported from commit e82b9b07) [tyhicks: Backport to Xenial: - Adjust handle_tx() instead of handle_tx_{copy,zerocopy}() due to missing commit 0d20bdf3 ("vhost_net: split out datacopy logic") - Considerable context adjustments throughout the patch due to a lack of missing the iov_limit member of the vhost_dev struct which was added later in commit b46a0bf7 ("vhost: fix OOB in get_rx_bufs()") - Context adjustment in call to vhost_log_write() in hunk #3 of net.c due to missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly") - Context adjustment in hunk #3 of net.c due to using break instead of goto out - Context adjustment in hunk #4 of net.c due to missing and unneeded commit c67df11f ("vhost_net: try batch dequing from skb array") - Don't patch vsock.c since Xenial doesn't have vhost vsock support - Adjust context in vhost_dev_init() to account for different local variables - Adjust context in struct vhost_dev to account for different struct members] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Jason Wang authored
Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> CVE-2019-3900 (backported from commit 272f35cb) [tyhicks: Backport to Xenial: - Minor context adjustment in net.c due to missing commit b0d0ea50 ("vhost_net: introduce helper to initialize tx iov iter") - Context adjustment in call to vhost_log_write() in hunk #4 due to missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly") - Context adjustment in hunk #4 due to using break instead of goto out] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Paolo Abeni authored
Similar to commit a2ac9990 ("vhost-net: set packet weight of tx polling to 2 * vq size"), we need a packet-based limit for handler_rx, too - elsewhere, under rx flood with small packets, tx can be delayed for a very long time, even without busypolling. The pkt limit applied to handle_rx must be the same applied by handle_tx, or we will get unfair scheduling between rx and tx. Tying such limit to the queue length makes it less effective for large queue length values and can introduce large process scheduler latencies, so a constant valued is used - likewise the existing bytes limit. The selected limit has been validated with PVP[1] performance test with different queue sizes: queue size 256 512 1024 baseline 366 354 362 weight 128 715 723 670 weight 256 740 745 733 weight 512 600 460 583 weight 1024 423 427 418 A packet weight of 256 gives peek performances in under all the tested scenarios. No measurable regression in unidirectional performance tests has been detected. [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> CVE-2019-3900 (backported from commit db688c24) [tyhicks: Backport to Xenial: - Context adjustment in call to mutex_lock_nested() in hunk #3 due to missing and unneeded commit aaa3149b ("vhost_net: add missing lock nesting notation") - Context adjustment in call to vhost_log_write() in hunk #4 due to missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly") - Context adjustment in hunk #4 due to using break instead of goto out] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
haibinzhang(张海斌) authored
handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy polling udp packets with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length. Ping-Latencies shown below were tested between two Virtual Machines using netperf (UDP_STREAM, len=1), and then another machine pinged the client: vq size=256 Packet-Weight Ping-Latencies(millisecond) min avg max Origin 3.319 18.489 57.303 64 1.643 2.021 2.552 128 1.825 2.600 3.224 256 1.997 2.710 4.295 512 1.860 3.171 4.631 1024 2.002 4.173 9.056 2048 2.257 5.650 9.688 4096 2.093 8.508 15.943 vq size=512 Packet-Weight Ping-Latencies(millisecond) min avg max Origin 6.537 29.177 66.245 64 2.798 3.614 4.403 128 2.861 3.820 4.775 256 3.008 4.018 4.807 512 3.254 4.523 5.824 1024 3.079 5.335 7.747 2048 3.944 8.201 12.762 4096 4.158 11.057 19.985 Seems pretty consistent, a small dip at 2 VQ sizes. Ring size is a hint from device about a burst size it can tolerate. Based on benchmarks, set the weight to 2 * vq size. To evaluate this change, another tests were done using netperf(RR, TX) between two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was tweaked through qemu. Results shown below does not show obvious changes. vq size=256 TCP_RR vq size=512 TCP_RR size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2% 1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0% 1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1% 64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3% 64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1% 64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2% 256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2% 256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2% 256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1% vq size=256 UDP_RR vq size=512 UDP_RR size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2% 1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2% 1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0% 64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1% 64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0% 64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1% 256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0% 256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4% 256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1% vq size=256 TCP_STREAM vq size=512 TCP_STREAM size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0% 64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4% 64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2% 256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1% 256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0% 256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0% 512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1% 512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0% 512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1% 1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1% 1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0% 1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0% 2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0% 2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1% 2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1% 4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0% 4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0% 4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1% Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com> Signed-off-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> CVE-2019-3900 (cherry picked from commit a2ac9990) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Willem de Bruijn authored
Vhost-net has a hard limit on the number of zerocopy skbs in flight. When reached, transmission stalls. Stalls cause latency, as well as head-of-line blocking of other flows that do not use zerocopy. Instead of stalling, revert to copy-based transmission. Tested by sending two udp flows from guest to host, one with payload of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The large flow is redirected to a netem instance with 1MBps rate limit and deep 1000 entry queue. modprobe ifb ip link set dev ifb0 up tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit tc qdisc add dev tap0 ingress tc filter add dev tap0 parent ffff: protocol ip \ u32 match ip dport 8000 0xffff \ action mirred egress redirect dev ifb0 Before the delay, both flows process around 80K pps. With the delay, before this patch, both process around 400. After this patch, the large flow is still rate limited, while the small reverts to its original rate. See also discussion in the first link, below. Without rate limiting, {1, 10, 100}x TCP_STREAM tests continued to send at 100% zerocopy. The limit in vhost_exceeds_maxpend must be carefully chosen. With vq->num >> 1, the flows remain correlated. This value happens to correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller fractions and ensure correctness also for much smaller values of vq->num, by testing the min() of both explicitly. See also the discussion in the second link below. Changes v1 -> v2 - replaced min with typed min_t - avoid unnecessary whitespace change Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.comSigned-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> CVE-2019-3900 (cherry picked from commit 1e6f7453) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Jason Wang authored
This patch tries to utilize tuntap rx batching by peeking the tx virtqueue during transmission, if there's more available buffers in the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap) to batch the packets. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> CVE-2019-3900 (backported from commit 0ed005ce) [tyhicks: Minor context adjustment due to missing commit 03088137 ("vhost_net: basic polling support")] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Jason Wang authored
This patch introduces a helper which will return true if we're sure that the available ring is empty for a specific vq. When we're not sure, e.g vq access failure, return false instead. This could be used for busy polling code to exit the busy loop. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> CVE-2019-3900 (cherry picked from commit d4a60603) Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Colin Ian King authored
BugLink: https://launchpad.net/bugs/1839521 Sync ZFS 0.6.5.6-0ubuntu28 to Xenial: Fix shrinker deadlock with xattrs (LP: #1839521) Upstream ZFS fix 31b6111fd92a ("Kill zp->z_xattr_parent to prevent pinning") and ddae16a9cf0b ("xattr dir doesn't get purged during iput") fix a deadlock in shrinker path when a xattr directory inode and its xattr inode are in the same disposal list and the xattr dir inode is evicted before the xattr inode. Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kleber Souza <kleber.souza@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Michael Neuling authored
CVE-2019-13648 On systems like P9 powernv where we have no TM (or P8 booted with ppc_tm=off), userspace can construct a signal context which still has the MSR TS bits set. The kernel tries to restore this context which results in the following crash: Unexpected TM Bad Thing exception at c0000000000022fc (msr 0x8000000102a03031) tm_scratch=800000020280f033 Oops: Unrecoverable exception, sig: 6 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 0 PID: 1636 Comm: sigfuz Not tainted 5.2.0-11043-g0a8ad0ff #69 NIP: c0000000000022fc LR: 00007fffb2d67e48 CTR: 0000000000000000 REGS: c00000003fffbd70 TRAP: 0700 Not tainted (5.2.0-11045-g7142b497d8) MSR: 8000000102a03031 <SF,VEC,VSX,FP,ME,IR,DR,LE,TM[E]> CR: 42004242 XER: 00000000 CFAR: c0000000000022e0 IRQMASK: 0 GPR00: 0000000000000072 00007fffb2b6e560 00007fffb2d87f00 0000000000000669 GPR04: 00007fffb2b6e728 0000000000000000 0000000000000000 00007fffb2b6f2a8 GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000000000000 00007fffb2b76900 0000000000000000 0000000000000000 GPR16: 00007fffb2370000 00007fffb2d84390 00007fffea3a15ac 000001000a250420 GPR20: 00007fffb2b6f260 0000000010001770 0000000000000000 0000000000000000 GPR24: 00007fffb2d843a0 00007fffea3a14a0 0000000000010000 0000000000800000 GPR28: 00007fffea3a14d8 00000000003d0f00 0000000000000000 00007fffb2b6e728 NIP [c0000000000022fc] rfi_flush_fallback+0x7c/0x80 LR [00007fffb2d67e48] 0x7fffb2d67e48 Call Trace: Instruction dump: e96a0220 e96a02a8 e96a0330 e96a03b8 394a0400 4200ffdc 7d2903a6 e92d0c00 e94d0c08 e96d0c10 e82d0c18 7db242a6 <4c000024> 7db243a6 7db142a6 f82d0c18 The problem is the signal code assumes TM is enabled when CONFIG_PPC_TRANSACTIONAL_MEM is enabled. This may not be the case as with P9 powernv or if `ppc_tm=off` is used on P8. This means any local user can crash the system. Fix the problem by returning a bad stack frame to the user if they try to set the MSR TS bits with sigreturn() on systems where TM is not supported. Found with sigfuz kernel selftest on P9. This fixes CVE-2019-13648. Fixes: 2b0a576d ("powerpc: Add new transactional memory state to the signal context") Cc: stable@vger.kernel.org # v3.9 Reported-by: Praveen Pandey <Praveen.Pandey@in.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190719050502.405-1-mikey@neuling.org (cherry picked from commit f16d80b7) Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
xiao jin authored
CVE-2018-20856 We find the memory use-after-free issue in __blk_drain_queue() on the kernel 4.14. After read the latest kernel 4.18-rc6 we think it has the same problem. Memory is allocated for q->fq in the blk_init_allocated_queue(). If the elevator init function called with error return, it will run into the fail case to free the q->fq. Then the __blk_drain_queue() uses the same memory after the free of the q->fq, it will lead to the unpredictable event. The patch is to set q->fq as NULL in the fail case of blk_init_allocated_queue(). Fixes: commit 7c94e1c1 ("block: introduce blk_flush_queue to drive flush machinery") Cc: <stable@vger.kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: xiao jin <jin.xiao@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> (backported from commit 54648cf1) [ Connor Kuehl: had to place the line from the patch in manually since the patch context disagreed with what the routine looks like now (different label, different return statement). Barely more involved than an offset adjustment. ] Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Khalid Elmously <khalid.elmously@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
- 07 Aug, 2019 2 commits
-
-
Denis Efremov authored
CVE-2019-14283 This fixes a global out-of-bounds read access in the copy_buffer function of the floppy driver. The FDDEFPRM ioctl allows one to set the geometry of a disk. The sect and head fields (unsigned int) of the floppy_drive structure are used to compute the max_sector (int) in the make_raw_rw_request function. It is possible to overflow the max_sector. Next, max_sector is passed to the copy_buffer function and used in one of the memcpy calls. An unprivileged user could trigger the bug if the device is accessible, but requires a floppy disk to be inserted. The patch adds the check for the .sect * .head multiplication for not overflowing in the set_geometry function. The bug was found by syzkaller. Signed-off-by: Denis Efremov <efremov@ispras.ru> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit da99466a) Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Acked-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Denis Efremov authored
This fixes a divide by zero error in the setup_format_params function of the floppy driver. Two consecutive ioctls can trigger the bug: The first one should set the drive geometry with such .sect and .rate values for the F_SECT_PER_TRACK to become zero. Next, the floppy format operation should be called. A floppy disk is not required to be inserted. An unprivileged user could trigger the bug if the device is accessible. The patch checks F_SECT_PER_TRACK for a non-zero value in the set_geometry function. The proper check should involve a reasonable upper limit for the .sect and .rate fields, but it could change the UAPI. The patch also checks F_SECT_PER_TRACK in the setup_format_params, and cancels the formatting operation in case of zero. The bug was found by syzkaller. Signed-off-by: Denis Efremov <efremov@ispras.ru> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> CVE-2019-14284 (cherry picked from commit f3554aeb) Signed-off-by: Sultan Alsawaf <sultan.alsawaf@canonical.com> Acked-by: Colin Ian King <colin.king@canonical.com> Acked-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
- 05 Aug, 2019 5 commits
-
-
Greg Kroah-Hartman authored
BugLink: https://bugs.launchpad.net/bugs/1838467Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Paolo Bonzini authored
BugLink: https://bugs.launchpad.net/bugs/1838467 commit 250715a6 upstream. The syzkaller folks reported a NULL pointer dereference that seems to be cause by a race between KVM_CREATE_IRQCHIP and KVM_CREATE_PIT2. The former takes kvm->lock (except when registering the devices, which needs kvm->slots_lock); the latter takes kvm->slots_lock only. Change KVM_CREATE_PIT2 to follow the same model as KVM_CREATE_IRQCHIP. Testcase: #include <pthread.h> #include <linux/kvm.h> #include <fcntl.h> #include <sys/ioctl.h> #include <stdint.h> #include <string.h> #include <stdlib.h> #include <sys/syscall.h> #include <unistd.h> long r[23]; void* thr1(void* arg) { struct kvm_pit_config pitcfg = { .flags = 4 }; switch ((long)arg) { case 0: r[2] = open("/dev/kvm", O_RDONLY|O_ASYNC); break; case 1: r[3] = ioctl(r[2], KVM_CREATE_VM, 0); break; case 2: r[4] = ioctl(r[3], KVM_CREATE_IRQCHIP, 0); break; case 3: r[22] = ioctl(r[3], KVM_CREATE_PIT2, &pitcfg); break; } return 0; } int main(int argc, char **argv) { long i; pthread_t th[4]; memset(r, -1, sizeof(r)); for (i = 0; i < 4; i++) { pthread_create(&th[i], 0, thr, (void*)i); if (argc > 1 && rand()%2) usleep(rand()%1000); } usleep(20000); return 0; } Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Zubin Mithra <zsm@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Julian Wiedmann authored
BugLink: https://bugs.launchpad.net/bugs/1838467 commit ac6639cd upstream. Current code sets the dsci to 0x00000080. Which doesn't make any sense, as the indicator area is located in the _left-most_ byte. Worse: if the dsci is the _shared_ indicator, this potentially clears the indication of activity for a _different_ device. tiqdio_thinint_handler() will then have no reason to call that device's IRQ handler, and the device ends up stalling. Fixes: d0c9d4a8 ("[S390] qdio: set correct bit in dsci") Cc: <stable@vger.kernel.org> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Julian Wiedmann authored
BugLink: https://bugs.launchpad.net/bugs/1838467 commit e54e4785 upstream. When tiqdio_remove_input_queues() removes a queue from the tiq_list as part of qdio_shutdown(), it doesn't re-initialize the queue's list entry and the prev/next pointers go stale. If a subsequent qdio_establish() fails while sending the ESTABLISH cmd, it calls qdio_shutdown() again in QDIO_IRQ_STATE_ERR state and tiqdio_remove_input_queues() will attempt to remove the queue entry a second time. This dereferences the stale pointers, and bad things ensue. Fix this by re-initializing the list entry after removing it from the list. For good practice also initialize the list entry when the queue is first allocated, and remove the quirky checks that papered over this omission. Note that prior to commit e5218134 ("s390/qdio: fix access to uninitialized qdio_q fields"), these checks were bogus anyway. setup_queues_misc() clears the whole queue struct, and thus needs to re-init the prev/next pointers as well. Fixes: 779e6e1c ("[S390] qdio: new qdio driver.") Cc: <stable@vger.kernel.org> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-
Heiko Carstens authored
BugLink: https://bugs.launchpad.net/bugs/1838467 commit 4f18d869 upstream. The stfle inline assembly returns the number of double words written (condition code 0) or the double words it would have written (condition code 3), if the memory array it got as parameter would have been large enough. The current stfle implementation assumes that the array is always large enough and clears those parts of the array that have not been written to with a subsequent memset call. If however the array is not large enough memset will get a negative length parameter, which means that memset clears memory until it gets an exception and the kernel crashes. To fix this simply limit the maximum length. Move also the inline assembly to an extra function to avoid clobbering of register 0, which might happen because of the added min_t invocation together with code instrumentation. The bug was introduced with commit 14375bc4 ("[S390] cleanup facility list handling") but was rather harmless, since it would only write to a rather large array. It became a potential problem with commit 3ab121ab ("[S390] kernel: Add z/VM LGR detection"). Since then it writes to an array with only four double words, while some machines already deliver three double words. As soon as machines have a facility bit within the fifth double a crash on IPL would happen. Fixes: 14375bc4 ("[S390] cleanup facility list handling") Cc: <stable@vger.kernel.org> # v2.6.37+ Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
-