1. 13 Sep, 2019 2 commits
  2. 27 Aug, 2019 5 commits
  3. 13 Aug, 2019 26 commits
  4. 12 Aug, 2019 7 commits
    • Jason Wang's avatar
      vhost: scsi: add weight support · 19402747
      Jason Wang authored
      This patch will check the weight and exit the loop if we exceeds the
      weight. This is useful for preventing scsi kthread from hogging cpu
      which is guest triggerable.
      
      This addresses CVE-2019-3900.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Fixes: 057cbf49 ("tcm_vhost: Initial merge for vhost level target fabric driver")
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      
      CVE-2019-3900
      
      (backported from commit c1ea02f1)
      [tyhicks: Backport to Xenial:
       - Minor context adjustment in local variables
       - Adjust context around the loop in vhost_scsi_handle_vq()
       - No need to modify vhost_scsi_ctl_handle_vq() since it was added later
         in commit 0d02dbd6 ("vhost/scsi: Respond to control queue
         operations")]
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Acked-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Acked-by: default avatarConnor Kuehl <connor.kuehl@canonical.com>
      Signed-off-by: default avatarKhalid Elmously <khalid.elmously@canonical.com>
      19402747
    • Jason Wang's avatar
      vhost_net: fix possible infinite loop · 5479b12b
      Jason Wang authored
      When the rx buffer is too small for a packet, we will discard the vq
      descriptor and retry it for the next packet:
      
      while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
      					      &busyloop_intr))) {
      ...
      	/* On overrun, truncate and discard */
      	if (unlikely(headcount > UIO_MAXIOV)) {
      		iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
      		err = sock->ops->recvmsg(sock, &msg,
      					 1, MSG_DONTWAIT | MSG_TRUNC);
      		pr_debug("Discarded rx packet: len %zd\n", sock_len);
      		continue;
      	}
      ...
      }
      
      This makes it possible to trigger a infinite while..continue loop
      through the co-opreation of two VMs like:
      
      1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
         vhost process as much as possible e.g using indirect descriptors or
         other.
      2) Malicious VM2 generate packets to VM1 as fast as possible
      
      Fixing this by checking against weight at the end of RX and TX
      loop. This also eliminate other similar cases when:
      
      - userspace is consuming the packets in the meanwhile
      - theoretical TOCTOU attack if guest moving avail index back and forth
        to hit the continue after vhost find guest just add new buffers
      
      This addresses CVE-2019-3900.
      
      Fixes: d8316f39 ("vhost: fix total length when packets are too short")
      Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server")
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      
      CVE-2019-3900
      
      (backported from commit e2412c07)
      [tyhicks: Backport to Xenial:
       - Adjust handle_tx() instead of handle_tx_{copy,zerocopy}() due to
         missing commit 0d20bdf3 ("vhost_net: split out datacopy logic")
       - Minor context adjustments due to a lack of missing the iov_limit
         member of the vhost_dev struct which was added later in commit
         b46a0bf7 ("vhost: fix OOB in get_rx_bufs()")
       - handle_rx() still uses peek_head_len() due to missing and unneeded commit
         03088137 ("vhost_net: basic polling support")
       - Context adjustment in call to vhost_log_write() in hunk #3 of net.c due to
         missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
       - Context adjustment in hunk #4 due to using break instead of goto out
       - Context adjustment in hunk #5 due to missing and unneeded commit
         c67df11f ("vhost_net: try batch dequing from skb array")]
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Acked-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Acked-by: default avatarConnor Kuehl <connor.kuehl@canonical.com>
      Signed-off-by: default avatarKhalid Elmously <khalid.elmously@canonical.com>
      5479b12b
    • Jason Wang's avatar
      vhost: introduce vhost_exceeds_weight() · 1801314e
      Jason Wang authored
      We used to have vhost_exceeds_weight() for vhost-net to:
      
      - prevent vhost kthread from hogging the cpu
      - balance the time spent between TX and RX
      
      This function could be useful for vsock and scsi as well. So move it
      to vhost.c. Device must specify a weight which counts the number of
      requests, or it can also specific a byte_weight which counts the
      number of bytes that has been processed.
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      
      CVE-2019-3900
      
      (backported from commit e82b9b07)
      [tyhicks: Backport to Xenial:
       - Adjust handle_tx() instead of handle_tx_{copy,zerocopy}() due to
         missing commit 0d20bdf3 ("vhost_net: split out datacopy logic")
       - Considerable context adjustments throughout the patch due to a lack
         of missing the iov_limit member of the vhost_dev struct which was
         added later in commit b46a0bf7 ("vhost: fix OOB in get_rx_bufs()")
       - Context adjustment in call to vhost_log_write() in hunk #3 of net.c due to
         missing and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
       - Context adjustment in hunk #3 of net.c due to using break instead of goto
         out
       - Context adjustment in hunk #4 of net.c due to missing and unneeded commit
         c67df11f ("vhost_net: try batch dequing from skb array")
       - Don't patch vsock.c since Xenial doesn't have vhost vsock support
       - Adjust context in vhost_dev_init() to account for different local variables
       - Adjust context in struct vhost_dev to account for different struct members]
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Acked-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Acked-by: default avatarConnor Kuehl <connor.kuehl@canonical.com>
      Signed-off-by: default avatarKhalid Elmously <khalid.elmously@canonical.com>
      1801314e
    • Jason Wang's avatar
      vhost_net: introduce vhost_exceeds_weight() · edc26183
      Jason Wang authored
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      CVE-2019-3900
      
      (backported from commit 272f35cb)
      [tyhicks: Backport to Xenial:
       - Minor context adjustment in net.c due to missing commit b0d0ea50
         ("vhost_net: introduce helper to initialize tx iov iter")
       - Context adjustment in call to vhost_log_write() in hunk #4 due to missing
         and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
       - Context adjustment in hunk #4 due to using break instead of goto out]
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Acked-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Acked-by: default avatarConnor Kuehl <connor.kuehl@canonical.com>
      Signed-off-by: default avatarKhalid Elmously <khalid.elmously@canonical.com>
      edc26183
    • Paolo Abeni's avatar
      vhost_net: use packet weight for rx handler, too · 440a2294
      Paolo Abeni authored
      Similar to commit a2ac9990 ("vhost-net: set packet weight of
      tx polling to 2 * vq size"), we need a packet-based limit for
      handler_rx, too - elsewhere, under rx flood with small packets,
      tx can be delayed for a very long time, even without busypolling.
      
      The pkt limit applied to handle_rx must be the same applied by
      handle_tx, or we will get unfair scheduling between rx and tx.
      Tying such limit to the queue length makes it less effective for
      large queue length values and can introduce large process
      scheduler latencies, so a constant valued is used - likewise
      the existing bytes limit.
      
      The selected limit has been validated with PVP[1] performance
      test with different queue sizes:
      
      queue size		256	512	1024
      
      baseline		366	354	362
      weight 128		715	723	670
      weight 256		740	745	733
      weight 512		600	460	583
      weight 1024		423	427	418
      
      A packet weight of 256 gives peek performances in under all the
      tested scenarios.
      
      No measurable regression in unidirectional performance tests has
      been detected.
      
      [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      CVE-2019-3900
      
      (backported from commit db688c24)
      [tyhicks: Backport to Xenial:
       - Context adjustment in call to mutex_lock_nested() in hunk #3 due to missing
         and unneeded commit aaa3149b ("vhost_net: add missing lock nesting
         notation")
       - Context adjustment in call to vhost_log_write() in hunk #4 due to missing
         and unneeded commit cc5e7107 ("vhost: log dirty page correctly")
       - Context adjustment in hunk #4 due to using break instead of goto out]
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Acked-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Acked-by: default avatarConnor Kuehl <connor.kuehl@canonical.com>
      Signed-off-by: default avatarKhalid Elmously <khalid.elmously@canonical.com>
      440a2294
    • haibinzhang(张海斌)'s avatar
      vhost-net: set packet weight of tx polling to 2 * vq size · 054d27bf
      haibinzhang(张海斌) authored
      handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
      polling udp packets with small length(e.g. 1byte udp payload), because setting
      VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
      
      Ping-Latencies shown below were tested between two Virtual Machines using
      netperf (UDP_STREAM, len=1), and then another machine pinged the client:
      
      vq size=256
      Packet-Weight   Ping-Latencies(millisecond)
                         min      avg       max
      Origin           3.319   18.489    57.303
      64               1.643    2.021     2.552
      128              1.825    2.600     3.224
      256              1.997    2.710     4.295
      512              1.860    3.171     4.631
      1024             2.002    4.173     9.056
      2048             2.257    5.650     9.688
      4096             2.093    8.508    15.943
      
      vq size=512
      Packet-Weight   Ping-Latencies(millisecond)
                         min      avg       max
      Origin           6.537   29.177    66.245
      64               2.798    3.614     4.403
      128              2.861    3.820     4.775
      256              3.008    4.018     4.807
      512              3.254    4.523     5.824
      1024             3.079    5.335     7.747
      2048             3.944    8.201    12.762
      4096             4.158   11.057    19.985
      
      Seems pretty consistent, a small dip at 2 VQ sizes.
      Ring size is a hint from device about a burst size it can tolerate. Based on
      benchmarks, set the weight to 2 * vq size.
      
      To evaluate this change, another tests were done using netperf(RR, TX) between
      two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
      tweaked through qemu. Results shown below does not show obvious changes.
      
      vq size=256 TCP_RR                vq size=512 TCP_RR
      size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
         1/       1/  -7%/        -2%      1/       1/   0%/        -2%
         1/       4/  +1%/         0%      1/       4/  +1%/         0%
         1/       8/  +1%/        -2%      1/       8/   0%/        +1%
        64/       1/  -6%/         0%     64/       1/  +7%/        +3%
        64/       4/   0%/        +2%     64/       4/  -1%/        +1%
        64/       8/   0%/         0%     64/       8/  -1%/        -2%
       256/       1/  -3%/        -4%    256/       1/  -4%/        -2%
       256/       4/  +3%/        +4%    256/       4/  +1%/        +2%
       256/       8/  +2%/         0%    256/       8/  +1%/        -1%
      
      vq size=256 UDP_RR                vq size=512 UDP_RR
      size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
         1/       1/  -5%/        +1%      1/       1/  -3%/        -2%
         1/       4/  +4%/        +1%      1/       4/  -2%/        +2%
         1/       8/  -1%/        -1%      1/       8/  -1%/         0%
        64/       1/  -2%/        -3%     64/       1/  +1%/        +1%
        64/       4/  -5%/        -1%     64/       4/  +2%/         0%
        64/       8/   0%/        -1%     64/       8/  -2%/        +1%
       256/       1/  +7%/        +1%    256/       1/  -7%/         0%
       256/       4/  +1%/        +1%    256/       4/  -3%/        -4%
       256/       8/  +2%/        +2%    256/       8/  +1%/        +1%
      
      vq size=256 TCP_STREAM            vq size=512 TCP_STREAM
      size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
        64/       1/   0%/        -3%     64/       1/   0%/         0%
        64/       4/  +3%/        -1%     64/       4/  -2%/        +4%
        64/       8/  +9%/        -4%     64/       8/  -1%/        +2%
       256/       1/  +1%/        -4%    256/       1/  +1%/        +1%
       256/       4/  -1%/        -1%    256/       4/  -3%/         0%
       256/       8/  +7%/        +5%    256/       8/  -3%/         0%
       512/       1/  +1%/         0%    512/       1/  -1%/        -1%
       512/       4/  +1%/        -1%    512/       4/   0%/         0%
       512/       8/  +7%/        -5%    512/       8/  +6%/        -1%
      1024/       1/   0%/        -1%   1024/       1/   0%/        +1%
      1024/       4/  +3%/         0%   1024/       4/  +1%/         0%
      1024/       8/  +8%/        +5%   1024/       8/  -1%/         0%
      2048/       1/  +2%/        +2%   2048/       1/  -1%/         0%
      2048/       4/  +1%/         0%   2048/       4/   0%/        -1%
      2048/       8/  -2%/         0%   2048/       8/   5%/        -1%
      4096/       1/  -2%/         0%   4096/       1/  -2%/         0%
      4096/       4/  +2%/         0%   4096/       4/   0%/         0%
      4096/       8/  +9%/        -2%   4096/       8/  -5%/        -1%
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarHaibin Zhang <haibinzhang@tencent.com>
      Signed-off-by: default avatarYunfang Tai <yunfangtai@tencent.com>
      Signed-off-by: default avatarLidong Chen <lidongchen@tencent.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      CVE-2019-3900
      
      (cherry picked from commit a2ac9990)
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Acked-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Acked-by: default avatarConnor Kuehl <connor.kuehl@canonical.com>
      Signed-off-by: default avatarKhalid Elmously <khalid.elmously@canonical.com>
      054d27bf
    • Willem de Bruijn's avatar
      vhost_net: do not stall on zerocopy depletion · 1073e969
      Willem de Bruijn authored
      Vhost-net has a hard limit on the number of zerocopy skbs in flight.
      When reached, transmission stalls. Stalls cause latency, as well as
      head-of-line blocking of other flows that do not use zerocopy.
      
      Instead of stalling, revert to copy-based transmission.
      
      Tested by sending two udp flows from guest to host, one with payload
      of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
      large flow is redirected to a netem instance with 1MBps rate limit
      and deep 1000 entry queue.
      
        modprobe ifb
        ip link set dev ifb0 up
        tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit
      
        tc qdisc add dev tap0 ingress
        tc filter add dev tap0 parent ffff: protocol ip \
            u32 match ip dport 8000 0xffff \
            action mirred egress redirect dev ifb0
      
      Before the delay, both flows process around 80K pps. With the delay,
      before this patch, both process around 400. After this patch, the
      large flow is still rate limited, while the small reverts to its
      original rate. See also discussion in the first link, below.
      
      Without rate limiting, {1, 10, 100}x TCP_STREAM tests continued to
      send at 100% zerocopy.
      
      The limit in vhost_exceeds_maxpend must be carefully chosen. With
      vq->num >> 1, the flows remain correlated. This value happens to
      correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
      fractions and ensure correctness also for much smaller values of
      vq->num, by testing the min() of both explicitly. See also the
      discussion in the second link below.
      
      Changes
        v1 -> v2
          - replaced min with typed min_t
          - avoid unnecessary whitespace change
      
      Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com
      Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.comSigned-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      CVE-2019-3900
      
      (cherry picked from commit 1e6f7453)
      Signed-off-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Acked-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Acked-by: default avatarConnor Kuehl <connor.kuehl@canonical.com>
      Signed-off-by: default avatarKhalid Elmously <khalid.elmously@canonical.com>
      1073e969