lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2024-02-22	vfio: mdev: make mdev_bus_type const	Ricardo B. Marliere
	Now that the driver core can properly handle constant struct bus_type, move the mdev_bus_type variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com> Link: https://lore.kernel.org/r/20240208-bus_cleanup-vfio-v1-1-ed5da3019949@marliere.net Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22	vfio/mlx5: Let firmware knows upon leaving PRE_COPY back to RUNNING	Yishai Hadas
	Let firmware knows upon leaving PRE_COPY back to RUNNING as of some error in the target/migration cancellation. This will let firmware cleaning its internal resources that were turned on upon PRE_COPY. The flow is based on the device specification in this area. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22	vfio/mlx5: Block incremental query upon migf state error	Yishai Hadas
	Block incremental query which is state-dependent once the migration file was previously marked with state error. This may prevent redundant calls to firmware upon PRE_COPY which will end-up with a failure and a syndrome printed in dmesg. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22	vfio/mlx5: Handle the EREMOTEIO error upon the SAVE command	Yishai Hadas
	The SAVE command uses the async command interface over the PF. Upon a failure in the firmware -EREMOTEIO is returned. In that case call mlx5_cmd_out_err() to let it print the command failure details including the firmware syndrome. Note: The other commands in the driver use the sync command interface in a way that a firmware syndrome is printed upon an error inside mlx5_core. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22	vfio/mlx5: Add support for tracker object change event	Yishai Hadas
	Add support for tracker object change event by referring to its MLX5_EVENT_TYPE_OBJECT_CHANGE event when occurs. This lets the driver recognize whether the firmware moved the tracker object to an error state. In that case, the driver will skip/block any usage of that object including an early exit in case the object was previously marked with an error. This functionality also covers the case when no CQE is delivered as of the error state. The driver was adapted to the device specification to handle the above. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20240205124828.232701-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-22	vfio/pci: WARN_ON driver_override kasprintf failure	Kunwu Chan
	kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. This is a blocking notifier callback, so errno isn't a proper return value. Use WARN_ON to small allocation failures. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240115063434.20278-1-chentao@kylinos.cn Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-02-08	vfio: replace CONFIG_HAVE_KVM with IS_ENABLED(CONFIG_KVM)	Paolo Bonzini
	It is more accurate to Check if KVM is enabled, instead of having the architecture say so. Architectures always "have" KVM, so for example checking CONFIG_HAVE_KVM in vfio code is pointless, but if KVM is disabled in a specific build, there is no need for support code. Alternatively, the #ifdefs could simply be deleted. However, this would add dead code. For example, when KVM is disabled, there is no need to include code in VFIO that uses symbol_get, as that symbol_get would always fail. Cc: Alex Williamson <alex.williamson@redhat.com> Cc: x86@kernel.org Cc: kbingham@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-18	Merge tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfio	Linus Torvalds
	Pull VFIO updates from Alex Williamson: - Add debugfs support, initially used for reporting device migration state (Longfang Liu) - Fixes and support for migration dirty tracking across multiple IOVA regions in the pds-vfio-pci driver (Brett Creeley) - Improved IOMMU allocation accounting visibility (Pasha Tatashin) - Virtio infrastructure and a new virtio-vfio-pci variant driver, which provides emulation of a legacy virtio interfaces on modern virtio hardware for virtio-net VF devices where the PF driver exposes support for legacy admin queues, ie. an emulated IO BAR on an SR-IOV VF to provide driver ABI compatibility to legacy devices (Yishai Hadas & Feng Liu) - Migration fixes for the hisi-acc-vfio-pci variant driver (Shameer Kolothum) - Kconfig dependency fix for new virtio-vfio-pci variant driver (Arnd Bergmann) * tag 'vfio-v6.8-rc1' of https://github.com/awilliam/linux-vfio: (22 commits) vfio/virtio: fix virtio-pci dependency hisi_acc_vfio_pci: Update migration data pointer correctly on saving/resume vfio/virtio: Declare virtiovf_pci_aer_reset_done() static vfio/virtio: Introduce a vfio driver over virtio devices vfio/pci: Expose vfio_pci_core_iowrite/read##size() vfio/pci: Expose vfio_pci_core_setup_barmap() virtio-pci: Introduce APIs to execute legacy IO admin commands virtio-pci: Initialize the supported admin commands virtio-pci: Introduce admin commands virtio-pci: Introduce admin command sending function virtio-pci: Introduce admin virtqueue virtio: Define feature bit for administration virtqueue vfio/type1: account iommu allocations vfio/pds: Add multi-region support vfio/pds: Move seq/ack bitmaps into region struct vfio/pds: Pass region info to relevant functions vfio/pds: Move and rename region specific info vfio/pds: Only use a single SGL for both seq and ack vfio/pds: Fix calculations in pds_vfio_dirty_sync MAINTAINERS: Add vfio debugfs interface doc link ...
2024-01-10	vfio/virtio: fix virtio-pci dependency	Arnd Bergmann
	The new vfio-virtio driver already has a dependency on VIRTIO_PCI_ADMIN_LEGACY, but that is a bool symbol and allows vfio-virtio to be built-in even if virtio-pci itself is a loadable module. This leads to a link failure: aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_probe': main.c:(.text+0xec): undefined reference to `virtio_pci_admin_has_legacy_io' aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_init_device': main.c:(.text+0x260): undefined reference to `virtio_pci_admin_legacy_io_notify_info' aarch64-linux-ld: drivers/vfio/pci/virtio/main.o: in function `virtiovf_pci_bar0_rw': main.c:(.text+0x6ec): undefined reference to `virtio_pci_admin_legacy_common_io_read' aarch64-linux-ld: main.c:(.text+0x6f4): undefined reference to `virtio_pci_admin_legacy_device_io_read' aarch64-linux-ld: main.c:(.text+0x7f0): undefined reference to `virtio_pci_admin_legacy_common_io_write' aarch64-linux-ld: main.c:(.text+0x7f8): undefined reference to `virtio_pci_admin_legacy_device_io_write' Add another explicit dependency on the tristate symbol. Fixes: eb61eca0e8c3 ("vfio/virtio: Introduce a vfio driver over virtio devices") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240109075731.2726731-1-arnd@kernel.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-01-08	Merge tag 'vfs-6.8.misc' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_() calls with their current ida_() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...
2024-01-05	hisi_acc_vfio_pci: Update migration data pointer correctly on saving/resume	Shameer Kolothum
	When the optional PRE_COPY support was added to speed up the device compatibility check, it failed to update the saving/resuming data pointers based on the fd offset. This results in migration data corruption and when the device gets started on the destination the following error is reported in some cases, [ 478.907684] arm-smmu-v3 arm-smmu-v3.2.auto: event 0x10 received: [ 478.913691] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000310200000010 [ 478.919603] arm-smmu-v3 arm-smmu-v3.2.auto: 0x000002088000007f [ 478.925515] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000000000000000 [ 478.931425] arm-smmu-v3 arm-smmu-v3.2.auto: 0x0000000000000000 [ 478.947552] hisi_zip 0000:31:00.0: qm_axi_rresp [error status=0x1] found [ 478.955930] hisi_zip 0000:31:00.0: qm_db_timeout [error status=0x400] found [ 478.955944] hisi_zip 0000:31:00.0: qm sq doorbell timeout in function 2 Fixes: d9a871e4a143 ("hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions") Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231120091406.780-1-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-20	vfio/virtio: Declare virtiovf_pci_aer_reset_done() static	Yishai Hadas
	Declare virtiovf_pci_aer_reset_done() as a static function to prevent the below build warning. "warning: no previous prototype for 'virtiovf_pci_aer_reset_done' [-Wmissing-prototypes]" Fixes: eb61eca0e8c3 ("vfio/virtio: Introduce a vfio driver over virtio devices") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/lkml/20231220143122.63337669@canb.auug.org.au/ Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20231220082456.241973-1-yishaih@nvidia.com Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312202115.oDmvN1VE-lkp@intel.com/ Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19	Merge branch 'v6.8/vfio/virtio' into v6.8/vfio/next	Alex Williamson

2023-12-19	vfio/virtio: Introduce a vfio driver over virtio devices	Yishai Hadas
	Introduce a vfio driver over virtio devices to support the legacy interface functionality for VFs. Background, from the virtio spec [1]. -------------------------------------------------------------------- In some systems, there is a need to support a virtio legacy driver with a device that does not directly support the legacy interface. In such scenarios, a group owner device can provide the legacy interface functionality for the group member devices. The driver of the owner device can then access the legacy interface of a member device on behalf of the legacy member device driver. For example, with the SR-IOV group type, group members (VFs) can not present the legacy interface in an I/O BAR in BAR0 as expected by the legacy pci driver. If the legacy driver is running inside a virtual machine, the hypervisor executing the virtual machine can present a virtual device with an I/O BAR in BAR0. The hypervisor intercepts the legacy driver accesses to this I/O BAR and forwards them to the group owner device (PF) using group administration commands. -------------------------------------------------------------------- Specifically, this driver adds support for a virtio-net VF to be exposed as a transitional device to a guest driver and allows the legacy IO BAR functionality on top. This allows a VM which uses a legacy virtio-net driver in the guest to work transparently over a VF which its driver in the host is that new driver. The driver can be extended easily to support some other types of virtio devices (e.g virtio-blk), by adding in a few places the specific type properties as was done for virtio-net. For now, only the virtio-net use case was tested and as such we introduce the support only for such a device. Practically, Upon probing a VF for a virtio-net device, in case its PF supports legacy access over the virtio admin commands and the VF doesn't have BAR 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a transitional device with I/O BAR in BAR 0. The existence of the simulated I/O bar is reported later on by overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device exposes itself as a transitional device by overwriting some properties upon reading its config space. Once we report the existence of I/O BAR as BAR 0 a legacy driver in the guest may use it via read/write calls according to the virtio specification. Any read/write towards the control parts of the BAR will be captured by the new driver and will be translated into admin commands towards the device. In addition, any data path read/write access (i.e. virtio driver notifications) will be captured by the driver and forwarded to the physical BAR which its properties were supplied by the admin command VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO upon the probing/init flow. With that code in place a legacy driver in the guest has the look and feel as if having a transitional device with legacy support for both its control and data path flows. [1] https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20231219093247.170936-10-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19	vfio/pci: Expose vfio_pci_core_iowrite/read##size()	Yishai Hadas
	Expose vfio_pci_core_iowrite/read##size() to let it be used by drivers. This functionality is needed to enable direct access to some physical BAR of the device with the proper locks/checks in place. The next patches from this series will use this functionality on a data path flow when a direct access to the BAR is needed. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20231219093247.170936-9-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-19	vfio/pci: Expose vfio_pci_core_setup_barmap()	Yishai Hadas
	Expose vfio_pci_core_setup_barmap() to be used by drivers. This will let drivers to mmap a BAR and re-use it from both vfio and the driver when it's applicable. This API will be used in the next patches by the vfio/virtio coming driver. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20231219093247.170936-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04	Merge branches 'v6.8/vfio/debugfs', 'v6.8/vfio/pds' and ↵	Alex Williamson
	'v6.8/vfio/type1-account' into v6.8/vfio/next
2023-12-04	vfio/type1: account iommu allocations	Pasha Tatashin
	iommu allocations should be accounted in order to allow admins to monitor and limit the amount of iommu memory. Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231130200900.2320829-1-pasha.tatashin@soleen.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04	vfio/pds: Add multi-region support	Brett Creeley
	Only supporting a single region/range is limiting, wasteful, and in some cases broken (i.e. when there are large gaps in the iova memory ranges). Fix this by adding support for multiple regions based on what the device tells the driver it can support. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-7-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04	vfio/pds: Move seq/ack bitmaps into region struct	Brett Creeley
	Since the host seq/ack bitmaps are part of a region move them into struct pds_vfio_region. Also, make use of the bmp_bytes value for validation purposes. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-6-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04	vfio/pds: Pass region info to relevant functions	Brett Creeley
	A later patch in the series implements multi-region support. That will require specific regions to be passed to relevant functions. Prepare for that change by passing the region structure to relevant functions. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-5-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04	vfio/pds: Move and rename region specific info	Brett Creeley
	An upcoming change in this series will add support for multiple regions. To prepare for that, move region specific information into struct pds_vfio_region and rename the members for readability. This will reduce the size of the patch that actually implements multiple region support. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-4-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04	vfio/pds: Only use a single SGL for both seq and ack	Brett Creeley
	Since the seq/ack operations never happen in parallel there is no need for multiple scatter gather lists per region. The current implementation is wasting memory. Fix this by only using a single scatter-gather list for both the seq and ack operations. Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-3-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04	vfio/pds: Fix calculations in pds_vfio_dirty_sync	Brett Creeley
	The incorrect check is being done for comparing the iova/length being requested to sync. This can cause the dirty sync operation to fail. Fix this by making sure the iova offset added to the requested sync length doesn't exceed the region_size. Also, the region_start is assumed to always be at 0. This can cause dirty tracking to fail because the device/driver bitmap offset always starts at 0, however, the region_start/iova may not. Fix this by determining the iova offset from region_start to determine the bitmap offset. Fixes: f232836a9152 ("vfio/pds: Add support for dirty page tracking") Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231117001207.2793-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-12-04	vfio/migration: Add debugfs to live migration driver	Longfang Liu
	There are multiple devices, software and operational steps involved in the process of live migration. An error occurred on any node may cause the live migration operation to fail. This complex process makes it very difficult to locate and analyze the cause when the function fails. In order to quickly locate the cause of the problem when the live migration fails, I added a set of debugfs to the vfio live migration driver. +-------------------------------------------+ \| \| \| \| \| QEMU \| \| \| \| \| +---+----------------------------+----------+ \| ^ \| ^ \| \| \| \| \| \| \| \| v \| v \| +---------+--+ +---------+--+ \|src vfio_dev\| \|dst vfio_dev\| +--+---------+ +--+---------+ \| ^ \| ^ \| \| \| \| v \| \| \| +-----------+----+ +-----------+----+ \|src dev debugfs \| \|dst dev debugfs \| +----------------+ +----------------+ The entire debugfs directory will be based on the definition of the CONFIG_DEBUG_FS macro. If this macro is not enabled, the interfaces in vfio.h will be empty definitions, and the creation and initialization of the debugfs directory will not be executed. vfio \| +---<dev_name1> \| +---migration \| +--state \| +---<dev_name2> +---migration +--state debugfs will create a public root directory "vfio" file. then create a dev_name() file for each live migration device. First, create a unified state acquisition file of "migration" in this device directory. Then, create a public live migration state lookup file "state". Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/r/20231106072225.28577-2-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-11-28	eventfd: simplify eventfd_signal()	Christian Brauner
	Ever since the eventfd type was introduced back in 2007 in commit e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed 1 as a value for @n. There's no point in keeping that additional argument. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org Acked-by: Xu Yilun <yilun.xu@intel.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-27	vfio/pds: Fix possible sleep while in atomic context	Brett Creeley
	The driver could possibly sleep while in atomic context resulting in the following call trace while CONFIG_DEBUG_ATOMIC_SLEEP=y is set: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:283 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2817, name: bash preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 Call Trace: <TASK> dump_stack_lvl+0x36/0x50 __might_resched+0x123/0x170 mutex_lock+0x1e/0x50 pds_vfio_put_lm_file+0x1e/0xa0 [pds_vfio_pci] pds_vfio_put_save_file+0x19/0x30 [pds_vfio_pci] pds_vfio_state_mutex_unlock+0x2e/0x80 [pds_vfio_pci] pci_reset_function+0x4b/0x70 reset_store+0x5b/0xa0 kernfs_fop_write_iter+0x137/0x1d0 vfs_write+0x2de/0x410 ksys_write+0x5d/0xd0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 This can happen if pds_vfio_put_restore_file() and/or pds_vfio_put_save_file() grab the mutex_lock(&lm_file->lock) while the spin_lock(&pds_vfio->reset_lock) is held, which can happen during while calling pds_vfio_state_mutex_unlock(). Fix this by changing the reset_lock to reset_mutex so there are no such conerns. Also, make sure to destroy the reset_mutex in the driver specific VFIO device release function. This also fixes a spinlock bad magic BUG that was caused by not calling spinlock_init() on the reset_lock. Since, the lock is being changed to a mutex, make sure to call mutex_init() on it. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/kvm/1f9bc27b-3de9-4891-9687-ba2820c1b390@moroto.mountain/ Fixes: bb500dbe2ac6 ("vfio/pds: Add VFIO live migration support") Signed-off-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231122192532.25791-3-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-11-27	vfio/pds: Fix mutex lock->magic != lock warning	Brett Creeley
	The following BUG was found when running on a kernel with CONFIG_DEBUG_MUTEXES=y set: DEBUG_LOCKS_WARN_ON(lock->magic != lock) RIP: 0010:mutex_trylock+0x10d/0x120 Call Trace: <TASK> ? __warn+0x85/0x140 ? mutex_trylock+0x10d/0x120 ? report_bug+0xfc/0x1e0 ? handle_bug+0x3f/0x70 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? mutex_trylock+0x10d/0x120 ? mutex_trylock+0x10d/0x120 pds_vfio_reset+0x3a/0x60 [pds_vfio_pci] pci_reset_function+0x4b/0x70 reset_store+0x5b/0xa0 kernfs_fop_write_iter+0x137/0x1d0 vfs_write+0x2de/0x410 ksys_write+0x5d/0xd0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 As shown, lock->magic != lock. This is because mutex_init(&pds_vfio->state_mutex) is called in the VFIO open path. So, if a reset is initiated before the VFIO device is opened the mutex will have never been initialized. Fix this by calling mutex_init(&pds_vfio->state_mutex) in the VFIO init path. Also, don't destroy the mutex on close because the device may be re-opened, which would cause mutex to be uninitialized. Fix this by implementing a driver specific vfio_device_ops.release callback that destroys the mutex before calling vfio_pci_core_release_dev(). Fixes: bb500dbe2ac6 ("vfio/pds: Add VFIO live migration support") Signed-off-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20231122192532.25791-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-11-03	Merge tag 'char-misc-6.7-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc updates from Greg KH: "Here is the big set of char/misc and other small driver subsystem changes for 6.7-rc1. Included in here are: - IIO subsystem driver updates and additions (largest part of this pull request) - FPGA subsystem driver updates - Counter subsystem driver updates - ICC subsystem driver updates - extcon subsystem driver updates - mei driver updates and additions - nvmem subsystem driver updates and additions - comedi subsystem dependency fixes - parport driver fixups - cdx subsystem driver and core updates - splice support for /dev/zero and /dev/full - other smaller driver cleanups All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (326 commits) cdx: add sysfs for subsystem, class and revision cdx: add sysfs for bus reset cdx: add support for bus enable and disable cdx: Register cdx bus as a device on cdx subsystem cdx: Create symbol namespaces for cdx subsystem cdx: Introduce lock to protect controller ops cdx: Remove cdx controller list from cdx bus system dts: ti: k3-am625-beagleplay: Add beaglecc1352 greybus: Add BeaglePlay Linux Driver dt-bindings: net: Add ti,cc1352p7 dt-bindings: eeprom: at24: allow NVMEM cells based on old syntax dt-bindings: nvmem: SID: allow NVMEM cells based on old syntax Revert "nvmem: add new config option" MAINTAINERS: coresight: Add missing Coresight files misc: pci_endpoint_test: Add deviceID for J721S2 PCIe EP device support firmware: xilinx: Move EXPORT_SYMBOL_GPL next to zynqmp_pm_feature definition uacce: make uacce_class constant ocxl: make ocxl_class constant cxl: make cxl_class constant misc: phantom: make phantom_class constant ...
2023-11-01	Merge tag 'for-linus-iommufd' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "This brings three new iommufd capabilities: - Dirty tracking for DMA. AMD/ARM/Intel CPUs can now record if a DMA writes to a page in the IOPTEs within the IO page table. This can be used to generate a record of what memory is being dirtied by DMA activities during a VM migration process. A VMM like qemu will combine the IOMMU dirty bits with the CPU's dirty log to determine what memory to transfer. VFIO already has a DMA dirty tracking framework that requires PCI devices to implement tracking HW internally. The iommufd version provides an alternative that the VMM can select, if available. The two are designed to have very similar APIs. - Userspace controlled attributes for hardware page tables (HWPT/iommu_domain). There are currently a few generic attributes for HWPTs (support dirty tracking, and parent of a nest). This is an entry point for the userspace iommu driver to control the HW in detail. - Nested translation support for HWPTs. This is a 2D translation scheme similar to the CPU where a DMA goes through a first stage to determine an intermediate address which is then translated trough a second stage to a physical address. Like for CPU translation the first stage table would exist in VM controlled memory and the second stage is in the kernel and matches the VM's guest to physical map. As every IOMMU has a unique set of parameter to describe the S1 IO page table and its associated parameters the userspace IOMMU driver has to marshal the information into the correct format. This is 1/3 of the feature, it allows creating the nested translation and binding it to VFIO devices, however the API to support IOTLB and ATC invalidation of the stage 1 io page table, and forwarding of IO faults are still in progress. The series includes AMD and Intel support for dirty tracking. Intel support for nested translation. Along the way are a number of internal items: - New iommu core items: ops->domain_alloc_user(), ops->set_dirty_tracking, ops->read_and_clear_dirty(), IOMMU_DOMAIN_NESTED, and iommu_copy_struct_from_user - UAF fix in iopt_area_split() - Spelling fixes and some test suite improvement" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (52 commits) iommufd: Organize the mock domain alloc functions closer to Joerg's tree iommufd/selftest: Fix page-size check in iommufd_test_dirty() iommufd: Add iopt_area_alloc() iommufd: Fix missing update of domains_itree after splitting iopt_area iommu/vt-d: Disallow read-only mappings to nest parent domain iommu/vt-d: Add nested domain allocation iommu/vt-d: Set the nested domain to a device iommu/vt-d: Make domain attach helpers to be extern iommu/vt-d: Add helper to setup pasid nested translation iommu/vt-d: Add helper for nested domain allocation iommu/vt-d: Extend dmar_domain to support nested domain iommufd: Add data structure for Intel VT-d stage-1 domain allocation iommu/vt-d: Enhance capability check for nested parent domain allocation iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs iommufd/selftest: Add nested domain allocation for mock domain iommu: Add iommu_copy_struct_from_user helper iommufd: Add a nested HW pagetable object iommu: Pass in parent domain with user_data to domain_alloc_user op iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable ...
2023-10-27	cdx: Create symbol namespaces for cdx subsystem	Abhijit Gangurde
	Create CDX_BUS and CDX_BUS_CONTROLLER symbol namespace for cdx bus subsystem. CDX controller modules are required to import symbols from CDX_BUS_CONTROLLER namespace and other than controller modules to import from CDX_BUS namespace. Signed-off-by: Abhijit Gangurde <abhijit.gangurde@amd.com> Link: https://lore.kernel.org/r/20231017160505.10640-4-abhijit.gangurde@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-24	iommufd/iova_bitmap: Move symbols to IOMMUFD namespace	Joao Martins
	Have the IOVA bitmap exported symbols adhere to the IOMMUFD symbol export convention i.e. using the IOMMUFD namespace. In doing so, import the namespace in the current users. This means VFIO and the vfio-pci drivers that use iova_bitmap_set(). Link: https://lore.kernel.org/r/20231024135109.73787-4-joao.m.martins@oracle.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-24	vfio: Move iova_bitmap into iommufd	Joao Martins
	Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking the user bitmaps, so move to the common dependency into IOMMUFD. In doing so, create the symbol IOMMUFD_DRIVER which designates the builtin code that will be used by drivers when selected. Today this means MLX5_VFIO_PCI and PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when supporting dirty tracking and select IOMMUFD_DRIVER accordingly. Given that the symbol maybe be disabled, add header definitions in iova_bitmap.h for when IOMMUFD_DRIVER=n Link: https://lore.kernel.org/r/20231024135109.73787-3-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-24	vfio/iova_bitmap: Export more API symbols	Joao Martins
	In preparation to move iova_bitmap into iommufd, export the rest of API symbols that will be used in what could be used by modules, namely: iova_bitmap_alloc iova_bitmap_free iova_bitmap_for_each Link: https://lore.kernel.org/r/20231024135109.73787-2-joao.m.martins@oracle.com Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-10-09	vfio: Fix smatch errors in vfio_combine_iova_ranges()	Alex Williamson
	smatch reports: vfio_combine_iova_ranges() error: uninitialized symbol 'last'. vfio_combine_iova_ranges() error: potentially dereferencing uninitialized 'comb_end'. vfio_combine_iova_ranges() error: potentially dereferencing uninitialized 'comb_start'. These errors are only reachable via invalid input, in the case of @last when we receive an empty rb-tree or for @comb_{start,end} if the rb-tree is empty or otherwise fails to produce a second node that reduces the gap. Add tests with warnings for these cases. Reported-by: Cong Liu <liucong2@kylinos.cn> Link: https://lore.kernel.org/all/20230920095532.88135-1-liucong2@kylinos.cn Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20231002224325.3150842-1-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-10-03	vfio/cdx: Add parentheses between bitwise AND expression and logical NOT	Nathan Chancellor
	When building with clang, there is a warning (or error with CONFIG_WERROR=y) due to a bitwise AND and logical NOT in vfio_cdx_bm_ctrl(): drivers/vfio/cdx/main.c:77:6: error: logical not is only applied to the left hand side of this bitwise operator [-Werror,-Wlogical-not-parentheses] 77 \| if (!vdev->flags & BME_SUPPORT) \| ^ ~ drivers/vfio/cdx/main.c:77:6: note: add parentheses after the '!' to evaluate the bitwise operator first 77 \| if (!vdev->flags & BME_SUPPORT) \| ^ \| ( ) drivers/vfio/cdx/main.c:77:6: note: add parentheses around left hand side expression to silence this warning 77 \| if (!vdev->flags & BME_SUPPORT) \| ^ \| ( ) 1 error generated. Add the parentheses as suggested in the first note, which is clearly what was intended here. Closes: https://github.com/ClangBuiltLinux/linux/issues/1939 Fixes: 8a97ab9b8b31 ("vfio-cdx: add bus mastering device feature support") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by: Nikhil Agarwal <nikhil.agarwal@amd.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20231002-vfio-cdx-logical-not-parentheses-v1-1-a8846c7adfb6@kernel.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28	vfio/mlx5: Activate the chunk mode functionality	Yishai Hadas
	Now that all pieces are in place, activate the chunk mode functionality based on device capabilities. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-10-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28	vfio/mlx5: Add support for READING in chunk mode	Yishai Hadas
	Add support for READING in chunk mode. In case the last SAVE command recognized that there was still some image to be read, however, there was no available chunk to use for, this task was delayed for the reader till one chunk will be consumed and becomes available. In the above case, a work will be executed to read in the background the next image from the device. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-9-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28	vfio/mlx5: Add support for SAVING in chunk mode	Yishai Hadas
	Add support for SAVING in chunk mode, it includes running a work that will fill the next chunk from the device. In case the number of available chunks will reach the MAX_NUM_CHUNKS, the next chunk SAVING will be delayed till the reader will consume one chunk. The next patch from the series will add the reader part of the chunk mode. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28	vfio/mlx5: Pre-allocate chunks for the STOP_COPY phase	Yishai Hadas
	This patch is another preparation step towards working in chunk mode. It pre-allocates chunks for the STOP_COPY phase to let the driver use them immediately and prevent an extra allocation upon that phase. Before that patch we had a single large buffer that was dedicated for the STOP_COPY phase as there was a single SAVE in the source for the last image. Once we'll move to chunk mode the idea is to have some small buffers that will be used upon the STOP_COPY phase. The driver will read-ahead from the firmware the full state in small/optimized chunks while letting QEMU/user space read in parallel the available data. Each buffer holds its chunk number to let it be recognized down the road in the coming patches. The chunk buffer size is picked-up based on the minimum size that firmware requires, the total full size and some max value in the driver code which was set to 8MB to achieve some optimized downtime in the general case. As the chunk mode is applicable even if we move directly to STOP_COPY the buffers preparation and some other related stuff is done unconditionally with regards to STOP/PRE-COPY. Note: In that phase in the series we still didn't activate the chunk mode and the first buffer will be used in all the places. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28	vfio/mlx5: Rename some stuff to match chunk mode	Yishai Hadas
	Upon chunk mode there may be multiple images that will be read from the device upon STOP_COPY. This patch is some preparation for that mode by replacing the relevant stuff to a better matching name. As part of that, be stricter to recognize PRE_COPY error only when it didn't occur on a STOP_COPY chunk. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28	vfio/mlx5: Enable querying state size which is > 4GB	Yishai Hadas
	Once the device supports 'chunk mode' the driver can support state size which is larger than 4GB. In that case the device has the capability to split a single image to multiple chunks as long as the software provides a buffer in the minimum size reported by the device. The driver should query for the minimum buffer size required using QUERY_VHCA_MIGRATION_STATE command with the 'chunk' bit set in its input, in that case, the output will include both the minimum buffer size (i.e. required_umem_size) and also the remaining total size to be reported/used where that it will be applicable. At that point in the series the 'chunk' bit is off, the last patch will activate the feature once all pieces will be ready. Note: Before this change we were limited to 4GB state size as of 4 bytes max value based on the device specification for the query/save/load commands. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28	vfio/mlx5: Refactor the SAVE callback to activate a work only upon an error	Yishai Hadas
	Upon a successful SAVE callback there is no need to activate a work, all the required stuff can be done directly. As so, refactor the above flow to activate a work only upon an error. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28	vfio/mlx5: Wake up the reader post of disabling the SAVING migration file	Yishai Hadas
	Post of disabling the SAVING migration file, which includes setting the file state to be MLX5_MIGF_STATE_ERROR, call to wake_up_interruptible() on its poll_wait member. This lets any potential reader which is waiting already for data as part of mlx5vf_save_read() to wake up, recognize the error state and return with an error. Post of that we don't need to rely on any other condition to wake up the reader as of the returning of the SAVE command that was previously executed, etc. In addition, this change will simplify error flows (e.g health recovery) once we'll move to chunk mode and multiple SAVE commands may run in the STOP_COPY phase as we won't need to rely any more on a SAVE command to wake-up a potential waiting reader. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230911093856.81910-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-28	vfio-cdx: add bus mastering device feature support	Nipun Gupta
	Support Bus master enable and disable on VFIO-CDX devices using VFIO_DEVICE_FEATURE_BUS_MASTER flag over VFIO_DEVICE_FEATURE IOCTL. Co-developed-by: Shubham Rohila <shubham.rohila@amd.com> Signed-off-by: Shubham Rohila <shubham.rohila@amd.com> Signed-off-by: Nipun Gupta <nipun.gupta@amd.com> Link: https://lore.kernel.org/r/20230915045423.31630-3-nipun.gupta@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-22	vfio/mdev: Fix a null-ptr-deref bug for mdev_unregister_parent()	Jinjie Ruan
	Inject fault while probing mdpy.ko, if kstrdup() of create_dir() fails in kobject_add_internal() in kobject_init_and_add() in mdev_type_add() in parent_create_sysfs_files(), it will return 0 and probe successfully. And when rmmod mdpy.ko, the mdpy_dev_exit() will call mdev_unregister_parent(), the mdev_type_remove() may traverse uninitialized parent->types[i] in parent_remove_sysfs_files(), and it will cause below null-ptr-deref. If mdev_type_add() fails, return the error code and kset_unregister() to fix the issue. general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 2 PID: 10215 Comm: rmmod Tainted: G W N 6.6.0-rc2+ #20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__kobject_del+0x62/0x1c0 Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 51 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 28 48 8d 7d 10 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 24 01 00 00 48 8b 75 10 48 89 df 48 8d 6b 3c e8 RSP: 0018:ffff88810695fd30 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffffffffa0270268 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000004 RDI: 0000000000000010 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed10233a4ef1 R10: ffff888119d2778b R11: 0000000063666572 R12: 0000000000000000 R13: fffffbfff404e2d4 R14: dffffc0000000000 R15: ffffffffa0271660 FS: 00007fbc81981540(0000) GS:ffff888119d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc14a142dc0 CR3: 0000000110a62003 CR4: 0000000000770ee0 DR0: ffffffff8fb0bce8 DR1: ffffffff8fb0bce9 DR2: ffffffff8fb0bcea DR3: ffffffff8fb0bceb DR6: 00000000fffe0ff0 DR7: 0000000000000600 PKRU: 55555554 Call Trace: <TASK> ? die_addr+0x3d/0xa0 ? exc_general_protection+0x144/0x220 ? asm_exc_general_protection+0x22/0x30 ? __kobject_del+0x62/0x1c0 kobject_del+0x32/0x50 parent_remove_sysfs_files+0xd6/0x170 [mdev] mdev_unregister_parent+0xfb/0x190 [mdev] ? mdev_register_parent+0x270/0x270 [mdev] ? find_module_all+0x9d/0xe0 mdpy_dev_exit+0x17/0x63 [mdpy] __do_sys_delete_module.constprop.0+0x2fa/0x4b0 ? module_flags+0x300/0x300 ? __fput+0x4e7/0xa00 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fbc813221b7 Code: 73 01 c3 48 8b 0d d1 8c 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 8c 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe780e0648 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00007ffe780e06a8 RCX: 00007fbc813221b7 RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055e214df9b58 RBP: 000055e214df9af0 R08: 00007ffe780df5c1 R09: 0000000000000000 R10: 00007fbc8139ecc0 R11: 0000000000000206 R12: 00007ffe780e0870 R13: 00007ffe780e0ed0 R14: 000055e214df9260 R15: 000055e214df9af0 </TASK> Modules linked in: mdpy(-) mdev vfio_iommu_type1 vfio [last unloaded: mdpy] Dumping ftrace buffer: (ftrace buffer empty) ---[ end trace 0000000000000000 ]--- RIP: 0010:__kobject_del+0x62/0x1c0 Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 51 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 28 48 8d 7d 10 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 24 01 00 00 48 8b 75 10 48 89 df 48 8d 6b 3c e8 RSP: 0018:ffff88810695fd30 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffffffffa0270268 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000004 RDI: 0000000000000010 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed10233a4ef1 R10: ffff888119d2778b R11: 0000000063666572 R12: 0000000000000000 R13: fffffbfff404e2d4 R14: dffffc0000000000 R15: ffffffffa0271660 FS: 00007fbc81981540(0000) GS:ffff888119d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc14a142dc0 CR3: 0000000110a62003 CR4: 0000000000770ee0 DR0: ffffffff8fb0bce8 DR1: ffffffff8fb0bce9 DR2: ffffffff8fb0bcea DR3: ffffffff8fb0bceb DR6: 00000000fffe0ff0 DR7: 0000000000000600 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Rebooting in 1 seconds.. Fixes: da44c340c4fe ("vfio/mdev: simplify mdev_type handling") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230918115551.1423193-1-ruanjinjie@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-18	vfio/pds: Use proper PF device access helper	Shixiong Ou
	The pci_physfn() helper exists to support cases where the physfn field may not be compiled into the pci_dev structure. We've declared this driver dependent on PCI_IOV to avoid this problem, but regardless we should follow the precedent not to access this field directly. Signed-off-by: Shixiong Ou <oushixiong@kylinos.cn> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230914021332.1929155-1-oushixiong@kylinos.cn Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-09-18	vfio/pds: Add missing PCI_IOV depends	Shixiong Ou
	If PCI_ATS isn't set, then pdev->physfn is not defined. it causes a compilation issue: ../drivers/vfio/pci/pds/vfio_dev.c:165:30: error: ‘struct pci_dev’ has no member named ‘physfn’; did you mean ‘is_physfn’? 165 \| __func__, pci_dev_id(pdev->physfn), pci_id, vf_id, \| ^~~~~~ So adding PCI_IOV depends to select PCI_ATS. Signed-off-by: Shixiong Ou <oushixiong@kylinos.cn> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230906014942.1658769-1-oushixiong@kylinos.cn Fixes: 63f77a7161a2 ("vfio/pds: register with the pds_core PF") Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-30	Merge tag 'for-linus-iommufd' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "On top of the vfio updates is built some new iommufd functionality: - IOMMU_HWPT_ALLOC allows userspace to directly create the low level IO Page table objects and affiliate them with IOAS objects that hold the translation mapping. This is the basic functionality for the normal IOMMU_DOMAIN_PAGING domains. - VFIO_DEVICE_ATTACH_IOMMUFD_PT can be used to replace the current translation. This is wired up to through all the layers down to the driver so the driver has the ability to implement a hitless replacement. This is necessary to fully support guest behaviors when emulating HW (eg guest atomic change of translation) - IOMMU_GET_HW_INFO returns information about the IOMMU driver HW that owns a VFIO device. This includes support for the Intel iommu, and patches have been posted for all the other server IOMMU. Along the way are a number of internal items: - New iommufd kernel APIs: iommufd_ctx_has_group(), iommufd_device_to_ictx(), iommufd_device_to_id(), iommufd_access_detach(), iommufd_ctx_from_fd(), iommufd_device_replace() - iommufd now internally tracks iommu_groups as it needs some per-group data - Reorganize how the internal hwpt allocation flows to have more robust locking - Improve the access interfaces to support detach and replace of an IOAS from an access - New selftests and a rework of how the selftests creates a mock iommu driver to be more like a real iommu driver" Link: https://lore.kernel.org/lkml/ZO%2FTe6LU1ENf58ZW@nvidia.com/ * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (34 commits) iommufd/selftest: Don't leak the platform device memory when unloading the module iommu/vt-d: Implement hw_info for iommu capability query iommufd/selftest: Add coverage for IOMMU_GET_HW_INFO ioctl iommufd: Add IOMMU_GET_HW_INFO iommu: Add new iommu op to get iommu hardware information iommu: Move dev_iommu_ops() to private header iommufd: Remove iommufd_ref_to_users() iommufd/selftest: Make the mock iommu driver into a real driver vfio: Support IO page table replacement iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_REPLACE_IOAS coverage iommufd: Add iommufd_access_replace() API iommufd: Use iommufd_access_change_ioas in iommufd_access_destroy_object iommufd: Add iommufd_access_change_ioas(_id) helpers iommufd: Allow passing in iopt_access_list_id to iopt_remove_access() vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages() iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC iommufd/selftest: Return the real idev id from selftest mock_domain iommufd: Add IOMMU_HWPT_ALLOC iommufd/selftest: Test iommufd_device_replace() iommufd: Make destroy_rwsem use a lock class per object type ...
2023-08-30	Merge tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio	Linus Torvalds
	Pull VFIO updates from Alex Williamson: - VFIO direct character device (cdev) interface support. This extracts the vfio device fd from the container and group model, and is intended to be the native uAPI for use with IOMMUFD (Yi Liu) - Enhancements to the PCI hot reset interface in support of cdev usage (Yi Liu) - Fix a potential race between registering and unregistering vfio files in the kvm-vfio interface and extend use of a lock to avoid extra drop and acquires (Dmitry Torokhov) - A new vfio-pci variant driver for the AMD/Pensando Distributed Services Card (PDS) Ethernet device, supporting live migration (Brett Creeley) - Cleanups to remove redundant owner setup in cdx and fsl bus drivers, and simplify driver init/exit in fsl code (Li Zetao) - Fix uninitialized hole in data structure and pad capability structures for alignment (Stefan Hajnoczi) * tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio: (53 commits) vfio/pds: Send type for SUSPEND_STATUS command vfio/pds: fix return value in pds_vfio_get_lm_file() pds_core: Fix function header descriptions vfio: align capability structures vfio/type1: fix cap_migration information leak vfio/fsl-mc: Use module_fsl_mc_driver macro to simplify the code vfio/cdx: Remove redundant initialization owner in vfio_cdx_driver vfio/pds: Add Kconfig and documentation vfio/pds: Add support for firmware recovery vfio/pds: Add support for dirty page tracking vfio/pds: Add VFIO live migration support vfio/pds: register with the pds_core PF pds_core: Require callers of register/unregister to pass PF drvdata vfio/pds: Initial support for pds VFIO driver vfio: Commonize combine_ranges for use in other VFIO drivers kvm/vfio: avoid bouncing the mutex when adding and deleting groups kvm/vfio: ensure kvg instance stays around in kvm_vfio_group_add() docs: vfio: Add vfio device cdev description vfio: Compile vfio_group infrastructure optionally vfio: Move the IOMMU_CAP_CACHE_COHERENCY check in __vfio_register_dev() ...