Commits · 2638134f710364c9e696a155bf16c6847959b1d9 · Kirill Smelkov / linux

09 Jul, 2024 32 commits

vdpa/mlx5: Don't reset VQs more than necessary · 2638134f

Dragos Tatulea authored Jun 26, 2024

The vdpa device can be reset many times in sequence without any
significant state changes in between. Previously this was not a problem:
VQs were torn down only on first reset. But after VQ pre-creation was
introduced, each reset will delete and re-create the hardware VQs and
their associated resources.

To solve this problem, avoid resetting hardware VQs if the VQs are still
in a blank state.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-23-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

2638134f

vdpa/mlx5: Re-create HW VQs under certain conditions · 0fe963d6

Dragos Tatulea authored Jun 26, 2024

There are a few conditions under which the hardware VQs need a full
teardown and setup:

- VQ size changed to something else than default value. Hardware VQ size
  modification is not supported.

- User turns off certain device features: mergeable buffers, checksum
  virtio 1.0 compliance. In these cases, the TIR and RQT need to be
  re-created.

Add a needs_teardown configuration variable and set it when detecting
the above scenarios. On next DRIVER_OK, the resources will be torn down
first.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-22-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

0fe963d6

vdpa/mlx5: Pre-create hardware VQs at vdpa .dev_add time · ffb1aae4

Dragos Tatulea authored Jun 26, 2024

Currently, hardware VQs are created right when the vdpa device gets into
DRIVER_OK state. That is easier because most of the VQ state is known by
then.

This patch switches to creating all VQs and their associated resources
at device creation time. The motivation is to reduce the vdpa device
live migration downtime by moving the expensive operation of creating
all the hardware VQs and their associated resources out of downtime on
the destination VM.

The VQs are now created in a blank state. The VQ configuration will
happen later, on DRIVER_OK. Then the configuration will be applied when
the VQs are moved to the Ready state.

When .set_vq_ready() is called on a VQ before DRIVER_OK, special care is
needed: now that the VQ is already created a resume_vq() will be
triggered too early when no mr has been configured yet. Skip calling
resume_vq() in this case, let it be handled during DRIVER_OK.

For virtio-vdpa, the device configuration is done earlier during
.vdpa_dev_add() by vdpa_register_device(). Avoid calling
setup_vq_resources() a second time in that case.

On a 64 CPU, 256 GB VM with 1 vDPA device of 16 VQps, the full VQ
resource creation + resume time was ~370ms. Now it's down to 60 ms
(only VQ config and resume). The measurements were done on a ConnectX6DX
based vDPA device.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-21-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

ffb1aae4

vdpa/mlx5: Use suspend/resume during VQP change · 3b3adb3b

Dragos Tatulea authored Jun 26, 2024

Resume a VQ if it is already created when the number of VQ pairs
increases. This is done in preparation for VQ pre-creation which is
coming in a later patch. It is necessary because calling setup_vq() on
an already created VQ will return early and will not enable the queue.

For symmetry, suspend a VQ instead of tearing it down when the number of
VQ pairs decreases. But only if the resume operation is supported.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-20-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

3b3adb3b

vdpa/mlx5: Forward error in suspend/resume device · ac85cd90

Dragos Tatulea authored Jun 26, 2024

Start using the suspend/resume_vq() error return codes previously added.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-19-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>

ac85cd90

vdpa/mlx5: Consolidate all VQ modify to Ready to use resume_vq() · 84325027

Dragos Tatulea authored Jun 26, 2024

There are a few more places modifying the VQ to Ready directly. Let's
consolidate them into resume_vq().

The redundant warnings for resume_vq() errors can also be dropped.

There is one special case that needs to be handled for virtio-vdpa:
the initialized flag must be set to true earlier in setup_vq() so that
resume_vq() doesn't return early.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-18-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

84325027

vdpa/mlx5: Add error code for suspend/resume VQ · b89bb349

Dragos Tatulea authored Jun 26, 2024

Instead of blindly calling suspend/resume_vqs(), make then return error
codes.

To keep compatibility, keep suspending or resuming VQs on error and
return the last error code. The assumption here is that the error code
would be the same.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-17-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

b89bb349

vdpa/mlx5: Accept Init -> Ready VQ transition in resume_vq() · fc9af25d

Dragos Tatulea authored Jun 26, 2024

Until now resume_vq() was used only for the suspend/resume scenario.
This change also allows calling resume_vq() to bring it from Init to
Ready state (VQ initialization).
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-16-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>

fc9af25d

vdpa/mlx5: Allow creation of blank VQs · e60e9eeb

Dragos Tatulea authored Jun 26, 2024

Based on the filled flag, create VQs that are filled or blank.
Blank VQs will be filled in later through VQ modify.

Downstream patches will make use of this to pre-create blank VQs at
vdpa device creation.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-15-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>

e60e9eeb

vdpa/mlx5: Set mkey modified flags on all VQs · ebebaf45

Dragos Tatulea authored Jun 26, 2024

Otherwise, when virtqueues are moved from INIT to READY the latest mkey
will not be set appropriately.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-14-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

ebebaf45

vdpa/mlx5: Start off rqt_size with max VQPs · 1e8dac7b

Dragos Tatulea authored Jun 26, 2024

Currently rqt_size is initialized during device flag configuration.
That's because it is the earliest moment when device knows if MQ
(multi queue) is on or off.

Shift this configuration earlier to device creation time. This implies
that non-MQ devices will have a larger RQT size. But the configuration
will still be correct.

This is done in preparation for the pre-creation of hardware virtqueues
at device add time. When that change will be added, RQT will be created
at device creation time so it needs to be initialized to its max size.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-13-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

1e8dac7b

vdpa/mlx5: Set an initial size on the VQ · ad9758fd

Dragos Tatulea authored Jun 26, 2024

The virtqueue size is a pre-requisite for setting up any virtqueue
resources. For the upcoming optimization of creating virtqueues at
device add, the virtqueue size has to be configured.

The queue size check in setup_vq() will always be false. So remove it.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-12-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

ad9758fd

vdpa/mlx5: Add support for modifying the VQ features field · cdc3c7ea

Dragos Tatulea authored Jun 26, 2024

This is done in preparation for the pre-creation of hardware virtqueues
at device add time.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-11-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

cdc3c7ea

vdpa/mlx5: Add support for modifying the virtio_version VQ field · f70080c5

Dragos Tatulea authored Jun 26, 2024

This is done in preparation for the pre-creation of hardware virtqueues
at device add time.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-10-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

f70080c5

vdpa/mlx5: Rename init_mvqs · 4a19f294

Dragos Tatulea authored Jun 26, 2024

Function is used to set default values, so name it accordingly.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-9-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>

4a19f294

vdpa/mlx5: Clear and reinitialize software VQ data on reset · e5bcbd1d

Dragos Tatulea authored Jun 26, 2024

The hardware VQ configuration is mirrored by data in struct
mlx5_vdpa_virtqueue . Instead of clearing just a few fields at reset,
fully clear the struct and initialize with the appropriate default
values.

As clear_vqs_ready() is used only during reset, get rid of it.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-8-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

e5bcbd1d

vdpa/mlx5: Initialize and reset device with one queue pair · 1835ed4a

Dragos Tatulea authored Jun 26, 2024

The virtio spec says that a vdpa device should start off with one queue
pair. The driver is already compliant.

This patch moves the initialization to device add and reset times. This
is done in preparation for the pre-creation of hardware virtqueues at
device add time.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-7-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>

1835ed4a

vdpa/mlx5: Remove duplicate suspend code · a366465b

Dragos Tatulea authored Jun 26, 2024

Use the dedicated suspend_vqs() function instead.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-6-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

a366465b

vdpa/mlx5: Iterate over active VQs during suspend/resume · 34bd86c7

Dragos Tatulea authored Jun 26, 2024

No need to iterate over max number of VQs.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-5-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

34bd86c7

vdpa/mlx5: Drop redundant check in teardown_virtqueues() · ad807392

Dragos Tatulea authored Jun 26, 2024

The check is done inside teardown_vq().
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-4-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

ad807392

vdpa/mlx5: Drop redundant code · 4c90a60a

Dragos Tatulea authored Jun 26, 2024

Originally, the second loop initialized the CVQ. But (acde3929
("vdpa/mlx5: Use consistent RQT size") initialized all the queues in the
first loop, so the second iteration in init_mvqs() is never called
because the first one will iterate up to max_vqs.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-3-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

4c90a60a

vdpa/mlx5: Make setup/teardown_vq_resources() symmetrical · 63f0cbad

Dragos Tatulea authored Jun 26, 2024

... by changing the setup_vq_resources() parameter type.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-2-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

63f0cbad

vdpa/mlx5: Clarify meaning thorough function rename · 1f5d6476

Dragos Tatulea authored Jun 26, 2024

setup_driver()/teardown_driver() are a bit vague. These functions are
used for virtqueue resources.

Same for alloc_resources()/teardown_resources(): they represent fixed
resources that are meant to exist during the device lifetime.
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-1-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

1f5d6476

virtio-fs: improved request latencies when Virtio queue is full · 106e4df1

Peter-Jan Gootzen authored May 17, 2024

Currently, when the Virtio queue is full, a work item is scheduled
to execute in 1ms that retries adding the request to the queue.
This is a large amount of time on the scale on which a
virtio-fs device can operate. When using a DPU this is around
30-40us baseline without going to a remote server (4k, QD=1).

This patch changes the retrying behavior to immediately filling the
Virtio queue up again when a completion has been received.

This reduces the 99.9th percentile latencies in our tests by
60x and slightly increases the overall throughput, when using a
workload IO depth 2x the size of the Virtio queue and a
DPU-powered virtio-fs device (NVIDIA BlueField DPU).
Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Yoray Zack <yorayz@nvidia.com>
Message-Id: <20240517190435.152096-3-pgootzen@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

106e4df1

virtio-fs: let -ENOMEM bubble up or burst gently · 2106e1f4

Peter-Jan Gootzen authored May 17, 2024

Currently, when the enqueueing of a request or forget operation fails
with -ENOMEM, the enqueueing is retried after a timeout. This patch
removes this behavior and treats -ENOMEM in these scenarios like any
other error. By bubbling up the error to user space in the case of a
request, and by dropping the operation in case of a forget. This
behavior matches that of the FUSE layer above, and also simplifies the
error handling. The latter will come in handy for upcoming patches that
optimize the retrying of operations in case of -ENOSPC.
Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Yoray Zack <yorayz@nvidia.com>
Message-Id: <20240517190435.152096-2-pgootzen@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

2106e1f4

vDPA: add missing MODULE_DESCRIPTION() macros · e7909ad6

Jeff Johnson authored Jun 11, 2024

With ARCH=x86, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/vdpa/vdpa.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/vdpa/ifcvf/ifcvf.o

Add the missing invocations of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Message-Id: <20240611-md-drivers-vdpa-v1-1-efaf2de15152@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

e7909ad6

virtio: add missing MODULE_DESCRIPTION() macros · ab0727f3

Jeff Johnson authored Jul 02, 2024

With ARCH=sh, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/virtio/virtio.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/virtio/virtio_ring.o

Add the missing invocations of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Message-Id: <20240702-md-sh-drivers-virtio-v1-1-cf7325ab6ccc@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

ab0727f3

vringh: add MODULE_DESCRIPTION() · e400ddf0

Jeff Johnson authored May 16, 2024

Fix the allmodconfig 'make w=1' issue:

WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/vhost/vringh.o
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Message-Id: <20240516-md-vringh-v1-1-31bf37779a5a@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>

e400ddf0

MAINTAINERS: Change lingshan's email to kernel.org · 9be237df

Zhu Lingshan authored May 15, 2024

This commit changes lingshan's email from intel.com
to kernel.org.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240514165125.74802-1-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

9be237df

vhost: move smp_rmb() into vhost_get_avail_idx() · 7ad47239

Michael S. Tsirkin authored Apr 30, 2024

All callers of vhost_get_avail_idx() use smp_rmb() to
order the available ring entry read and avail_idx read.

Make vhost_get_avail_idx() call smp_rmb() itself whenever the avail_idx
is accessed. This way, the callers don't need to worry about the memory
barrier. As a side benefit, we also validate the index on all paths now,
which will hopefully help prevent/catch earlier future bugs.

Note that current code is inconsistent in how the errors are handled.
They are treated as an empty ring in some places, but as non-empty
ring in other places. This patch doesn't attempt to change the existing
behaviour.

No functional change intended.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Message-Id: <20240429232748.642356-1-gshan@redhat.com>

7ad47239

virtio_balloon: separate vm events into a function · fdba68d2

zhenwei pi authored Apr 23, 2024

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', separate these events into a function to
make code clean. Then we can remove 'CONFIG_VM_EVENT_COUNTERS' from
'update_balloon_stats'.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Message-Id: <20240423034109.1552866-2-pizhenwei@bytedance.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>

fdba68d2

virtio: vdpa: vDPA driver for Marvell OCTEON DPU devices · 8b6c724c

Srujana Challa authored Jun 14, 2024

This commit introduces a new vDPA driver specifically designed for
managing the virtio control plane over the vDPA bus for OCTEON DPU
devices. The driver consists of two layers:

1. Octep HW Layer (Octeon Endpoint): Responsible for handling hardware
operations and configurations related to the DPU device.

2. Octep Main Layer: Compliant with the vDPA bus framework, this layer
implements device operations for the vDPA bus. It handles device
probing, bus attachment, vring operations, and other relevant tasks.
Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
Signed-off-by: Shijith Thotton <sthotton@marvell.com>
Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20240614144659.1776067-1-schalla@marvell.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

8b6c724c

04 Jul, 2024 4 commits

net: missing check virtio · e269d79c

Denis Arefev authored Jun 13, 2024

Two missing check in virtio_net_hdr_to_skb() allowed syzbot
to crash kernels again

1. After the skb_segment function the buffer may become non-linear
(nr_frags != 0), but since the SKBTX_SHARED_FRAG flag is not set anywhere
the __skb_linearize function will not be executed, then the buffer will
remain non-linear. Then the condition (offset >= skb_headlen(skb))
becomes true, which causes WARN_ON_ONCE in skb_checksum_help.

2. The struct sk_buff and struct virtio_net_hdr members must be
mathematically related.
(gso_size) must be greater than (needed) otherwise WARN_ON_ONCE.
(remainder) must be greater than (needed) otherwise WARN_ON_ONCE.
(remainder) may be 0 if division is without remainder.

offset+2 (4191) > skb_headlen() (1116)
WARNING: CPU: 1 PID: 5084 at net/core/dev.c:3303 skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303
Modules linked in:
CPU: 1 PID: 5084 Comm: syz-executor336 Not tainted 6.7.0-rc3-syzkaller-00014-gdf60cee2 #0
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023
RIP: 0010:skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303
Code: 89 e8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 52 01 00 00 44 89 e2 2b 53 74 4c 89 ee 48 c7 c7 40 57 e9 8b e8 af 8f dd f8 90 <0f> 0b 90 90 e9 87 fe ff ff e8 40 0f 6e f9 e9 4b fa ff ff 48 89 ef
RSP: 0018:ffffc90003a9f338 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff888025125780 RCX: ffffffff814db209
RDX: ffff888015393b80 RSI: ffffffff814db216 RDI: 0000000000000001
RBP: ffff8880251257f4 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: 000000000000045c
R13: 000000000000105f R14: ffff8880251257f0 R15: 000000000000105d
FS:  0000555555c24380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000002000f000 CR3: 0000000023151000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ip_do_fragment+0xa1b/0x18b0 net/ipv4/ip_output.c:777
 ip_fragment.constprop.0+0x161/0x230 net/ipv4/ip_output.c:584
 ip_finish_output_gso net/ipv4/ip_output.c:286 [inline]
 __ip_finish_output net/ipv4/ip_output.c:308 [inline]
 __ip_finish_output+0x49c/0x650 net/ipv4/ip_output.c:295
 ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323
 NF_HOOK_COND include/linux/netfilter.h:303 [inline]
 ip_output+0x13b/0x2a0 net/ipv4/ip_output.c:433
 dst_output include/net/dst.h:451 [inline]
 ip_local_out+0xaf/0x1a0 net/ipv4/ip_output.c:129
 iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82
 ipip6_tunnel_xmit net/ipv6/sit.c:1034 [inline]
 sit_tunnel_xmit+0xed2/0x28f0 net/ipv6/sit.c:1076
 __netdev_start_xmit include/linux/netdevice.h:4940 [inline]
 netdev_start_xmit include/linux/netdevice.h:4954 [inline]
 xmit_one net/core/dev.c:3545 [inline]
 dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3561
 __dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4346
 dev_queue_xmit include/linux/netdevice.h:3134 [inline]
 packet_xmit+0x257/0x380 net/packet/af_packet.c:276
 packet_snd net/packet/af_packet.c:3087 [inline]
 packet_sendmsg+0x24ca/0x5240 net/packet/af_packet.c:3119
 sock_sendmsg_nosec net/socket.c:730 [inline]
 __sock_sendmsg+0xd5/0x180 net/socket.c:745
 __sys_sendto+0x255/0x340 net/socket.c:2190
 __do_sys_sendto net/socket.c:2202 [inline]
 __se_sys_sendto net/socket.c:2198 [inline]
 __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x63/0x6b

Found by Linux Verification Center (linuxtesting.org) with Syzkaller

Fixes: 0f6925b3 ("virtio_net: Do not pull payload in skb->head")
Signed-off-by: Denis Arefev <arefev@swemel.ru>
Message-Id: <20240613095448.27118-1-arefev@swemel.ru>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

e269d79c

tools/virtio: creating pipe assertion in vringh_test · ede9c33e

Yunseong Kim authored Jun 25, 2024

parallel_test() function in vringh_test needs to verify
the creation of the guest/host pipe.
Signed-off-by: Yunseong Kim <yskelg@gmail.com>
Message-Id: <20240624174905.27980-2-yskelg@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

ede9c33e

virtio_ring: fix KMSAN error for premapped mode · 840b2d39

Xuan Zhuo authored Jun 06, 2024

Add kmsan for virtqueue_dma_map_single_attrs to fix:

BUG: KMSAN: uninit-value in receive_buf+0x45ca/0x6990
 receive_buf+0x45ca/0x6990
 virtnet_poll+0x17e0/0x3130
 net_rx_action+0x832/0x26e0
 handle_softirqs+0x330/0x10f0
 [...]

Uninit was created at:
 __alloc_pages_noprof+0x62a/0xe60
 alloc_pages_noprof+0x392/0x830
 skb_page_frag_refill+0x21a/0x5c0
 virtnet_rq_alloc+0x50/0x1500
 try_fill_recv+0x372/0x54c0
 virtnet_open+0x210/0xbe0
 __dev_open+0x56e/0x920
 __dev_change_flags+0x39c/0x2000
 dev_change_flags+0xaa/0x200
 do_setlink+0x197a/0x7420
 rtnl_setlink+0x77c/0x860
 [...]
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Tested-by: Alexander Potapenko <glider@google.com>
Message-Id: <20240606111345.93600-1-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # s390x
Acked-by: Jason Wang <jasowang@redhat.com>

840b2d39

vhost/vsock: always initialize seqpacket_allow · 1e1fdcbd

Michael S. Tsirkin authored Apr 22, 2024

There are two issues around seqpacket_allow:
1. seqpacket_allow is not initialized when socket is
   created. Thus if features are never set, it will be
   read uninitialized.
2. if VIRTIO_VSOCK_F_SEQPACKET is set and then cleared,
   then seqpacket_allow will not be cleared appropriately
   (existing apps I know about don't usually do this but
    it's legal and there's no way to be sure no one relies
    on this).

To fix:
	- initialize seqpacket_allow after allocation
	- set it unconditionally in set_features

Reported-by: syzbot+6c21aeb59d0e82eb2782@syzkaller.appspotmail.com
Reported-by: Jeongjun Park <aha310510@gmail.com>
Fixes: ced7b713 ("vhost/vsock: support SEQPACKET for transport").
Tested-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20240422100010-mutt-send-email-mst@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>

1e1fdcbd

02 Jul, 2024 4 commits

Merge tag 'linux_kselftest-fixes-6.10-rc7' of... · e9d22f7a

Linus Torvalds authored Jul 02, 2024

Merge tag 'linux_kselftest-fixes-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest fixes from Shuah Khan:
 "One single patch to fix the non-contiguous CBM resctrl:

  - AMD supports non-contiguous CBM but does not report it via CPUID.
    This test should not use CPUID on AMD to detect non-contiguous CBM
    support. Fix the problem so the test uses CPUID to discover
    non-contiguous CBM support only on Intel"

* tag 'linux_kselftest-fixes-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/resctrl: Fix non-contiguous CBM for AMD

e9d22f7a

Merge tag 'vfs-6.10-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · dbd8132a

Linus Torvalds authored Jul 02, 2024

Pull vfs fixes from Christian Brauner:
 "VFS:

   - Improve handling of deep ancestor chains in is_subdir()

   - Release locks cleanly when fctnl_setlk() races with close().

     When setting a file lock fails the VFS tries to cleanup the already
     created lock. The helper used for this calls back into the LSM
     layer which may cause it to fail, leaving the stale lock accessible
     via /proc/locks.

  AFS:

   - Fix a comma/semicolon typo"

* tag 'vfs-6.10-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  afs: Convert comma to semicolon
  fs: better handle deep ancestor chains in is_subdir()
  filelock: Remove locks reliably when fcntl/close race is detected

dbd8132a

afs: Convert comma to semicolon · 655593a4

Chen Ni authored Jul 02, 2024

Replace a comma between expression statements by a semicolon.
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://lore.kernel.org/r/20240702024055.1411407-1-nichen@iscas.ac.cn/
Link: https://lore.kernel.org/r/20240702024055.1411407-1-nichen@iscas.ac.cnAcked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

655593a4

fs: better handle deep ancestor chains in is_subdir() · 391b59b0

Christian Brauner authored Jul 02, 2024

Jan reported that 'cd ..' may take a long time in deep directory
hierarchies under a bind-mount. If concurrent renames happen it is
possible to livelock in is_subdir() because it will keep retrying.

Change is_subdir() from simply retrying over and over to retry once and
then acquire the rename lock to handle deep ancestor chains better. The
list of alternatives to this approach were less then pleasant. Change
the scope of rcu lock to cover the whole walk while at it.

A big thanks to Jan and Linus. Both Jan and Linus had proposed
effectively the same thing just that one version ended up being slightly
more elegant.
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>

391b59b0