lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
43 hours	Merge tag 'vfio-v7.1-rc1' of https://github.com/awilliam/linux-vfio	Linus Torvalds
	Pull VFIO updates from Alex Williamson: - Update QAT vfio-pci variant driver for Gen 5, 420xx devices (Vijay Sundar Selvamani, Suman Kumar Chakraborty, Giovanni Cabiddu) - Fix vfio selftest MMIO DMA mapping selftest (Alex Mastro) - Conversions to const struct class in support of class_create() deprecation (Jori Koolstra) - Improve selftest compiler compatibility by avoiding initializer on variable-length array (Manish Honap) - Define new uAPI for drivers supporting migration to advise user- space of new initial data for reducing target startup latency. Implemented for mlx5 vfio-pci variant driver (Yishai Hadas) - Enable vfio selftests on aarch64, not just cross-compiles reporting arm64 (Ted Logan) - Update vfio selftest driver support to include additional DSA devices (Yi Lai) - Unconditionally include debugfs root pointer in vfio device struct, avoiding a build failure seen in hisi_acc variant driver without debugfs otherwise (Arnd Bergmann) - Add support for the s390 ISM (Internal Shared Memory) device via a new variant driver. The device is unique in the size of its BAR space (256TiB) and lack of mmap support (Julian Ruess) - Enforce that vfio-pci drivers implement a name in their ops structure for use in sequestering SR-IOV VFs (Alex Williamson) - Prune leftover group notifier code (Paolo Bonzini) - Fix Xe vfio-pci variant driver to avoid migration support as a dependency in the reset path and missing release call (Michał Winiarski) * tag 'vfio-v7.1-rc1' of https://github.com/awilliam/linux-vfio: (23 commits) vfio/xe: Add a missing vfio_pci_core_release_dev() vfio/xe: Reorganize the init to decouple migration from reset vfio: remove dead notifier code vfio/pci: Require vfio_device_ops.name MAINTAINERS: add VFIO ISM PCI DRIVER section vfio/ism: Implement vfio_pci driver for ISM devices vfio/pci: Rename vfio_config_do_rw() to vfio_pci_config_rw_single() and export it vfio: unhide vdev->debug_root vfio/qat: add support for Intel QAT 420xx VFs vfio: selftests: Support DMR and GNR-D DSA devices vfio: selftests: Build tests on aarch64 vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO vfio/mlx5: consider inflight SAVE during PRE_COPY net/mlx5: Add IFC bits for migration state vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase vfio: selftests: Fix VLA initialisation in vfio_pci_irq_set() vfio: uapi: fix comment typo vfio: mdev: replace mtty_dev->vd_class with a const struct class ...
6 days	Merge branch 'mlx5-next' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2026-04-09 * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Add icm_mng_function_id_mode cap bit net/mlx5: Rename MLX5_PF page counter type to MLX5_SELF net/mlx5: Add vhca_id_type bit to alias context mlx5: Remove redundant iseg base ==================== Link: https://patch.msgid.link/20260409110431.154894-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
9 days	net/mlx5: Add icm_mng_function_id_mode cap bit	Moshe Shemesh
	Introduce the capability bit icm_mng_function_id_mode to indicate that the device firmware uses vhca_id instead of function_id as the effective identifier for the firmware commands MANAGE_PAGES, QUERY_PAGES, and page request event. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260403090028.137783-3-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
9 days	net/mlx5: Rename MLX5_PF page counter type to MLX5_SELF	Moshe Shemesh
	The MLX5_PF enum value in mlx5_func_type is used to track firmware page allocations for the page manager function itself, which is either the ECPF on SmartNIC systems or the host PF when there is no ECPF. Rename it to MLX5_SELF to accurately reflect that this counter tracks pages allocated by the manager for its own use, regardless of whether it is a PF or ECPF. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260403090028.137783-2-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-22	net/mlx5: Add vhca_id_type bit to alias context	Patrisious Haddad
	Add vhca_id_type bit to alias context which allows indicating the vhca_id_type to be passed at vhca_id_to_be_accessed, which can be either HW or SW, note that SW_VHCA_ID must be used to allow alias to work properly after migration. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260319122211.27384-3-tariqt@nvidia.com Reviewed-by: Joe Damato <joe@dama.to> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-22	mlx5: Remove redundant iseg base	Parav Pandit
	iseg_base and base_addr both point to BAR0, making iseg_base redundant. Remove iseg_base and rely on base_addr instead, reducing the size of struct mlx5_core_dev. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260319122211.27384-2-tariqt@nvidia.com Reviewed-by: Joe Damato <joe@dama.to> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-19	net/mlx5: Add IFC bits for migration state	Yishai Hadas
	Add the relevant IFC bits for querying an extra migration state from the device. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20260317161753.18964-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-03-18	Merge branch 'mlx5-next' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2026-03-17 The following pull-request contains common mlx5 updates * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Expose MLX5_UMR_ALIGN definition {net/RDMA}/mlx5: Add LAG demux table API and vport demux rules net/mlx5: Add VHCA RX flow destination support for FW steering net/mlx5: LAG, replace mlx5_get_dev_index with LAG sequence number net/mlx5: E-switch, modify peer miss rule index to vhca_id net/mlx5: LAG, use xa_alloc to manage LAG device indices net/mlx5: LAG, replace pf array with xarray net/mlx5: Add silent mode set/query and VHCA RX IFC bits net/mlx5: Add IFC bits for shared headroom pool PBMC support net/mlx5: Expose TLP emulation capabilities net/mlx5: Add TLP emulation device capabilities ==================== Link: https://patch.msgid.link/20260317075844.12066-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-16	net/mlx5: Expose MLX5_UMR_ALIGN definition	Tariq Toukan
	Expose HW constant value in a shared header, to be used by core/EN drivers. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260309093435.1850724-10-tariqt@nvidia.com Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-16	{net/RDMA}/mlx5: Add LAG demux table API and vport demux rules	Shay Drory
	Downstream patches will introduce SW-only LAG (e.g. shared_fdb without HW LAG). In this mode the firmware cannot create the LAG demux table, but vport demuxing is still required. Move LAG demux flow-table ownership to the LAG layer and introduce APIs to init/cleanup the demux table and add/delete per-vport rules. Adjust the RDMA driver to use the new APIs. In this mode, the LAG layer will create a flow group that matches vport metadata. Vports that are not native to the LAG master eswitch add the demux rule during IB representor load and remove it on unload. The demux rule forward traffic from said vports to their native eswitch manager via a new dest type - MLX5_FLOW_DESTINATION_TYPE_VHCA_RX. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260309093435.1850724-9-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-16	net/mlx5: Add VHCA RX flow destination support for FW steering	Shay Drory
	Introduce MLX5_FLOW_DESTINATION_TYPE_VHCA_RX as a new flow steering destination type. Wire the new destination through flow steering command setup by mapping it to MLX5_IFC_FLOW_DESTINATION_TYPE_VHCA_RX and passing the vhca id, extend forward-destination validation to accept it, and teach the flow steering tracepoint formatter to print rx_vhca_id. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260309093435.1850724-8-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-16	net/mlx5: LAG, replace mlx5_get_dev_index with LAG sequence number	Shay Drory
	Introduce mlx5_lag_get_dev_seq() which returns a device's sequence number within the LAG: master is always 0, remaining devices numbered sequentially. This provides a stable index for peer flow tracking and vport ordering without depending on native_port_num. Replace mlx5_get_dev_index() usage in en_tc.c (peer flow array indexing) and ib_rep.c (vport index ordering) with the new API. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260309093435.1850724-7-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-16	net/mlx5: Add silent mode set/query and VHCA RX IFC bits	Shay Drory
	Update the mlx5 IFC headers with newly defined capability and command-layout bits: - Add silent_mode_query and rename silent_mode to silent_mode_set cap fields. - Add forward_vhca_rx and MLX5_IFC_FLOW_DESTINATION_TYPE_VHCA_RX. - Expose silent mode fields in the L2 table query command structures. Update the SD support check to use the new capability name (silent_mode_set) to match the updated IFC definition. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260309093435.1850724-3-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-16	net/mlx5: Add IFC bits for shared headroom pool PBMC support	Alexei Lazar
	Add hardware interface definitions for shared headroom pool (SHP) in port buffer management: - shp_pbmc_pbsr_support: capability bit in PCAM enhanced features indicating device support for shared headroom pool in PBMC/PBSR. - shared_headroom_pool: buffer entry in PBMC register (pbmc_reg_bits) for the shared headroom pool configuration, reusing the bufferx layout; reduce trailing reserved region accordingly. Signed-off-by: Alexei Lazar <alazar@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260309093435.1850724-2-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-14	net/mlx5: Add a shared devlink instance for PFs on same chip	Jiri Pirko
	Use the previously introduced shared devlink infrastructure to create a shared devlink instance for mlx5 PFs that reside on the same physical chip. The shared instance is identified by the chip's serial number extracted from PCI VPD (V3 keyword, with fallback to serial number for older devices). Each PF that probes calls mlx5_shd_init() which extracts the chip serial number and uses devlink_shd_get() to get or create the shared instance. When a PF is removed, mlx5_shd_uninit() calls devlink_shd_put() to release the reference. The shared instance is automatically destroyed when the last PF is removed. Make the PF devlink instances nested in this shared devlink instance, allowing userspace to identify which PFs belong to the same physical chip. Example: pci/0000:08:00.0: index 0 nested_devlink: auxiliary/mlx5_core.eth.0 devlink_index/1: index 1 nested_devlink: pci/0000:08:00.0 pci/0000:08:00.1 auxiliary/mlx5_core.eth.0: index 2 pci/0000:08:00.1: index 3 nested_devlink: auxiliary/mlx5_core.eth.1 auxiliary/mlx5_core.eth.1: index 4 Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20260312100407.551173-14-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-05	net/mlx5: Expose TLP emulation capabilities	Maher Sanalla
	Expose and query TLP device emulation caps on driver load. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2026-03-05	net/mlx5: Add TLP emulation device capabilities	Maher Sanalla
	Introduce the hardware structures and definitions needed for the driver support of TLP emulation in mlx5_ifc. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2026-02-19	net/mlx5: Fix multiport device check over light SFs	Shay Drory
	Driver is using num_vhca_ports capability to distinguish between multiport master device and multiport slave device. num_vhca_ports is a capability the driver sets according to the MAX num_vhca_ports capability reported by FW. On the other hand, light SFs doesn't set the above capbility. This leads to wrong results whenever light SFs is checking whether he is a multiport master or slave. Therefore, use the MAX capability to distinguish between master and slave devices. Fixes: e71383fb9cd1 ("net/mlx5: Light probe local SFs") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <Jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260218072904.1764634-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05	net/mlx5: Fix 1600G link mode enum naming	Yael Chemla
	Rename TAUI/TBASE to GAUI/GBASE in 1600G link mode identifier and its usage in ethtool and link-info tables. Reported-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Signed-off-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reported-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Signed-off-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://patch.msgid.link/20260204194324.1723534-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-04	net/mlx5e: RX, Drop oversized packets in non-linear mode	Dragos Tatulea
	Currently the driver has an inconsistent behaviour between modes when it comes to oversized packets that are not dropped through the physical MTU check in HW. This can happen for Multi Host configurations where each port has a different MTU. Current behavior: 1) Striding RQ in linear mode drops the packet in SW and counts it with oversize_pkts_sw_drop. 2) Striding RQ in non-linear mode allows it like a normal packet. 3) Legacy RQ can't receive oversized packets by design: the RX WQE uses MTU sized packet buffers. This inconsistency is not a violation of the netdev policy [1] but it is better to be consistent across modes. This patch aligns (2) with (1) and (3). One exception is added for LRO: don't drop the oversized packet if it is an LRO packet. As now rq->hw_mtu always needs to be updated during the MTU change flow, drop the reset avoidance optimization from mlx5e_change_mtu(). Extract the CQE LRO segments reading into a helper function as it is used twice now. [1] Documentation/networking/netdevices.rst#L205 Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260203072130.1710255-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-13	net/mlx5: Add IFC bits for extended ETS rate limit bandwidth value	Alexei Lazar
	Add hardware interface definitions to support extended bandwidth rate limiting in the QoS Enhanced Transmission Selection (ETS) configuration. The new fields include: - max_bw_value: extended from 8-bit to 16-bit in ets_tcn_config_reg, simplifying the implementation by using a single field instead of separate MSB/LSB fields. - qetcr_qshr_max_bw_val_msb: capability bit in qcam_qos_feature_cap_mask indicating device support for the extended 16-bit max_bw_value field. These interface additions are prerequisites for increasing the per-TC rate limit beyond 255 Gbps to support higher-bandwidth NICs. Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1768200608-1543180-1-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-05	net/mlx5: Add support for querying bond speed	Or Har-Toov
	Add mlx5_lag_query_bond_speed() to query the aggregated speed of lag configurations with a bond device. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-05	net/mlx5: Handle port and vport speed change events in MPESW	Or Har-Toov
	Add port change event handling logic for MPESW LAG mode, ensuring VFs are updated when the speed of LAG physical ports changes. This triggers a speed update workflow when relevant port state changes occur, enabling consistent and accurate reporting of VF bandwidth. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-05	net/mlx5: Propagate LAG effective max_tx_speed to vports	Or Har-Toov
	Currently, vports report only their parent's uplink speed, which in LAG setups does not reflect the true aggregated bandwidth. This makes it hard for upper-layer software to optimize load balancing decisions based on accurate bandwidth information. Fix the issue by calculating the possible maximum speed of a LAG as the sum of speeds of all active uplinks that are part of the LAG. Propagate this effective max speed to vports associated with the LAG whenever a relevant event occurs, such as physical port link state changes or LAG creation/modification. With this change, upper-layer components receive accurate bandwidth information corresponding to the active members of the LAG and can make better load balancing decisions. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-01-05	net/mlx5: Add max_tx_speed and its CAP bit to IFC	Or Har-Toov
	Introduce the max_tx_speed field to the query and modify_vport_state structures. Add the esw_vport_state_max_tx_speed capability bit, indicating the firmware support modifying the max_tx_speed field via the MODIFY_VPORT_STATE command. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-19	net/mlx5: Move SF dev table notifier registration outside the PF devlink lock	Cosmin Ratiu
	This completes the previous patches by moving notifier registration for SF dev tables outside the devlink locked critical section in mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() / mlx5_mdev_uninit() functions. This is only done for non-SFs, since SFs do not have a SF HW table themselves. After this patch, notifiers can grab the PF devlink lock (soon to be necessary) without creating a locking cycle. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-7-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19	net/mlx5: Move the SF table notifiers outside the devlink lock	Cosmin Ratiu
	Move the SF table notifiers registration/unregistration outside of mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() / mlx5_mdev_uninit() functions. This is only done for non-SFs, since SFs do not have a SF table themselves and thus don't need notifiers. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19	net/mlx5: Move the SF HW table notifier outside the devlink lock	Cosmin Ratiu
	Move the SF HW table notifier registration/unregistration outside of mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() / mlx5_mdev_uninit() functions. This is only done for non-SFs, since SFs do not have a SF HW table themselves. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19	net/mlx5: Move the vhca event notifier outside of the devlink lock	Cosmin Ratiu
	The vhca event notifier consists of an atomic notifier for vhca state changes (used for SF events), multiple workqueues and a blocking notifier chain for delivering the vhca state change events for further processing. This patch moves the vhca notifier head outside of mlx5_init_one() / mlx5_uninit_one() and into the mlx5_mdev_init() / mlx5_mdev_uninit() functions. This allows called notifiers to grab the PF devlink lock which was previously impossible because it would create a circular lock dependency. mlx5_vhca_event_stop() is now called earlier in the cleanup phase and flushes the workqueues to ensure that after the call, there are no pending events. This simplifies the cleanup flow for vhca event consumers. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19	net/mlx5: Move the esw mode notifier chain outside the devlink lock	Cosmin Ratiu
	The esw mode change notifier chain is initialized/cleaned up in mlx5_init_one() / mlx5_uninit_one() with the devlink lock held. Move the notifier head from the eswitch struct into mlx5_priv directly, and initialize it outside the critical section. This will allow notifier registration to happen earlier in the init procedure in subsequent patches. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763325940-1231508-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-18	net/mlx5: Abort new commands if all command slots are stalled	Saeed Mahameed
	In case of a FW issue, FW might be not responding to FW commands, causing kernel lockout for a long period of time, e.g. rtnl_lock held while ethtool is trying to collect stats waiting for FW to respond to multiple commands, when all of them will timeout. While there's no immediate indication of the FW lockout, we can safely assume that something is wrong when all command slots are busy and in a timeout state and no FW completion was received on any of them. In such case, start immediately failing new commands. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1763415729-1238421-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-14	Merge branch 'mlx5-next' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-11-13 The following pull-request contains common mlx5 updates * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Expose definition for 1600Gbps link mode net/mlx5: fs, set non default device per namespace net/mlx5: fs, Add other_eswitch support for steering tables net/mlx5: Add OTHER_ESWITCH HW capabilities net/mlx5: Add direct ST mode support for RDMA PCI/TPH: Expose pcie_tph_get_st_table_loc() {rdma,net}/mlx5: Query vports mac address from device ==================== Link: https://patch.msgid.link/1763027252-1168760-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-13	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Cross-merge networking fixes after downstream PR (net-6.18-rc6). No conflicts, adjacent changes in: drivers/net/phy/micrel.c 96a9178a29a6 ("net: phy: micrel: lan8814 fix reset of the QSGMII interface") 61b7ade9ba8c ("net: phy: micrel: Add support for non PTP SKUs for lan8814") and a trivial one in tools/testing/selftests/drivers/net/Makefile. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-12	net/mlx5: Expose definition for 1600Gbps link mode	Tariq Toukan
	This patch exposes new link mode for 1600Gbps, utilizing 8 lanes at 200Gbps per lane. Co-developed-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1762863888-1092798-1-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-11	mlx5: Fix default values in create CQ	Akiva Goldberger
	Currently, CQs without a completion function are assigned the mlx5_add_cq_to_tasklet function by default. This is problematic since only user CQs created through the mlx5_ib driver are intended to use this function. Additionally, all CQs that will use doorbells instead of polling for completions must call mlx5_cq_arm. However, the default CQ creation flow leaves a valid value in the CQ's arm_db field, allowing FW to send interrupts to polling-only CQs in certain corner cases. These two factors would allow a polling-only kernel CQ to be triggered by an EQ interrupt and call a completion function intended only for user CQs, causing a null pointer exception. Some areas in the driver have prevented this issue with one-off fixes but did not address the root cause. This patch fixes the described issue by adding defaults to the create CQ flow. It adds a default dummy completion function to protect against null pointer exceptions, and it sets an invalid command sequence number by default in kernel CQs to prevent the FW from sending an interrupt to the CQ until it is armed. User CQs are responsible for their own initialization values. Callers of mlx5_core_create_cq are responsible for changing the completion function and arming the CQ per their needs. Fixes: cdd04f4d4d71 ("net/mlx5: Add support to create SQ and CQ for ASO") Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://patch.msgid.link/1762681743-1084694-1-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-11	net/mlx5: E-Switch, support eswitch inactive mode	Saeed Mahameed
	Add support for eswitch switchdev inactive mode Inactive mode: Drop all traffic going to FDB, Remove mpfs l2 rules and disconnect adjacent vports. Active mode: Traffic flows through FDB, mpfs table populated, and adjacent vports are connected. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Adithya Jayachandran <ajayachandra@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20251108070404.1551708-4-saeed@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-09	net/mlx5: fs, set non default device per namespace	Patrisious Haddad
	Add mlx5_fs_set_root_dev() function which swaps the root namespace core device with another one for a given table_type. It is intended for usage only by RDMA_TRANSPORT tables in case of LAG configuration, to allow the creation of tables during LAG always through the LAG master device, which is valid since during LAG the master is allowed to manage the RDMA_TRANSPORT tables of its slaves. In addition move the table_type enum to global include to allow its use in a downstream patch in the RDMA driver. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-3-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09	net/mlx5: fs, Add other_eswitch support for steering tables	Patrisious Haddad
	Add other_eswitch support which allows flow tables creation above vports that reside on different esw managers. The new flag MLX5_FLOW_TABLE_OTHER_ESWITCH indicates if the esw_owner_vhca_id attribute is supported. Note that this is only supported if the Advanced-RDMA cap- rdma_transport_manager_other_eswitch is set. And it is the caller responsibility to check that. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-2-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09	net/mlx5: Add OTHER_ESWITCH HW capabilities	Patrisious Haddad
	Add OTHER_ESWITCH capabilities which includes other_eswitch and eswitch_owner_vhca_id to all steering objects. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-1-98bb707b5d57@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-28	net/mlx5: Add balance ID support for LAG multiplane groups	Mark Bloch
	Implement balance ID support for multiplane LAG configurations. This feature enables per-multiplane group load balancing by extending the software system image GUID with a balance ID component. Key implementations: - Enable lag_per_mp_group capability when supported by hardware. - Append load_balance_id to software system image GUID when conditions are met. - Increase MLX5_SW_IMAGE_GUID_MAX_BYTES from 8 to 9 to accommodate the extra byte. The balance ID is appended to the system image GUID only when both load_balance_id and lag_per_mp_group capabilities are available, ensuring backward compatibility while enabling enhanced LAG functionality. This enhancement allows for more granular load balancing control in complex multi-plane LAG deployments, improving network performance and flexibility. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1761211020-925651-6-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-28	net/mlx5: Add software system image GUID infrastructure	Mark Bloch
	Replace direct hardware system image GUID usage with a new software system image GUID function that supports variable-length identifiers. Key changes: - Add mlx5_query_nic_sw_system_image_guid() function with length parameter. - Update all callsites to use the new function and buffer/length approach. - Modify mapping contexts to use byte arrays instead of u64 keys. - Update devcom matching to support variable-length keys. - Change mlx5_same_hw_devs() to use buffer comparison instead of u64. This refactoring prepares the infrastructure for balance ID support, which requires extending the system image GUID with additional data. The change maintains backward compatibility while enabling future enhancements. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1761211020-925651-3-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-10-24	{rdma,net}/mlx5: Query vports mac address from device	Adithya Jayachandran
	Before this patch during either switchdev or legacy mode enablement we cleared the mac address of vports between changes. This change allows us to preserve the vports mac address between eswitch mode changes. Vports hold information for VFs/SFs such as the permanent mac address. VF/SF mac can be set either by iproute vf interface or devlink function interface. For no obvious reason we reset it to 0 on switchdev/legacy mode changes, this patch is fixing that, to align with other vport information that are never reset, e.g GUID,mtu,promisc mode, etc .. Signed-off-by: Adithya Jayachandran <ajayachandra@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Acked-by: Leon Romanovsky <leon@kernel.org> # RDMA
2025-10-23	net/mlx5: Add PPHCR to PCAM supported registers mask	Alexei Lazar
	Add the PPHCR bit to the port_access_reg_cap_mask field of PCAM register to indicate that the device supports the PPHCR register and the RS-FEC histogram feature. Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1761136182-918470-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-29	Merge tag 'mlx5-next-lag' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-09-28 * tag 'mlx5-next-lag' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: IFC add balance ID and LAG per MP group bits net/mlx5: Add IFC bit for TIR/SQ order capability ==================== Link: https://patch.msgid.link/1759093989-841873-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-28	net/mlx5: IFC add balance ID and LAG per MP group bits	Mark Bloch
	Add interface definitions for load balance ID and LAG per multiplane group functionality. This patch introduces the hardware capability bits needed to support balance ID in multiplane LAG configurations. The new fields include: - load_balance_id: 4-bit field for balance identifier. - lag_per_mp_group: capability bit for LAG per multiplane group support. These interface additions are prerequisites for implementing balance ID support in the MLX5 driver. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1758521191-814350-3-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-28	net/mlx5: Add IFC bit for TIR/SQ order capability	Tariq Toukan
	Before this cap, firmware requested a certain creation order between TIR objects and SQs of the same transport domain to properly support the self loopback prevention feature. If order is not preserved, explicit modify_tir operations are necessary after the opening of the SQs. When set, this cap bit indicates that this firmware requirement / limitation no longer holds. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1758521191-814350-2-git-send-email-tariqt@nvidia.com Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-09-25	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Cross-merge networking fixes after downstream PR (net-6.17-rc8). Conflicts: drivers/net/can/spi/hi311x.c 6b6968084721 ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled") 27ce71e1ce81 ("net: WQ_PERCPU added to alloc_workqueue users") https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org net/smc/smc_loopback.c drivers/dibs/dibs_loopback.c a35c04de2565 ("net/smc: fix warning in smc_rx_splice() when calling get_page()") cc21191b584c ("dibs: Move data path to dibs layer") https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org Adjacent changes: drivers/net/dsa/lantiq/lantiq_gswip.c c0054b25e2f1 ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()") 7a1eaef0a791 ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-23	net/mlx5: fs, fix UAF in flow counter release	Moshe Shemesh
	Fix a kernel trace [1] caused by releasing an HWS action of a local flow counter in mlx5_cmd_hws_delete_fte(), where the HWS action refcount and mutex were not initialized and the counter struct could already be freed when deleting the rule. Fix it by adding the missing initializations and adding refcount for the local flow counter struct. [1] Kernel log: Call Trace: <TASK> dump_stack_lvl+0x34/0x48 mlx5_fs_put_hws_action.part.0.cold+0x21/0x94 [mlx5_core] mlx5_fc_put_hws_action+0x96/0xad [mlx5_core] mlx5_fs_destroy_fs_actions+0x8b/0x152 [mlx5_core] mlx5_cmd_hws_delete_fte+0x5a/0xa0 [mlx5_core] del_hw_fte+0x1ce/0x260 [mlx5_core] mlx5_del_flow_rules+0x12d/0x240 [mlx5_core] ? ttwu_queue_wakelist+0xf4/0x110 mlx5_ib_destroy_flow+0x103/0x1b0 [mlx5_ib] uverbs_free_flow+0x20/0x50 [ib_uverbs] destroy_hw_idr_uobject+0x1b/0x50 [ib_uverbs] uverbs_destroy_uobject+0x34/0x1a0 [ib_uverbs] uobj_destroy+0x3c/0x80 [ib_uverbs] ib_uverbs_run_method+0x23e/0x360 [ib_uverbs] ? uverbs_finalize_object+0x60/0x60 [ib_uverbs] ib_uverbs_cmd_verbs+0x14f/0x2c0 [ib_uverbs] ? do_tty_write+0x1a9/0x270 ? file_tty_write.constprop.0+0x98/0xc0 ? new_sync_write+0xfc/0x190 ib_uverbs_ioctl+0xd7/0x160 [ib_uverbs] __x64_sys_ioctl+0x87/0xc0 do_syscall_64+0x59/0x90 Fixes: b581f4266928 ("net/mlx5: fs, manage flow counters HWS action sharing by refcount") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1758525094-816583-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22	Merge tag 'mlx5-next-counters' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-09-21 * tag 'mlx5-next-counters' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Add uar access and odp page fault counters ==================== Link: https://patch.msgid.link/1758443940-708689-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18	Merge tag 'mlx5-next-09-11' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-09-17 This series by Carolina contains cleanups significantly touching shared mlx5 net and rdma headers. * tag 'mlx5-next-09-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5e: Prevent WQE metadata conflicts between timestamping and offloads net/mlx5: Refactor MACsec WQE metadata shifts net/mlx5: Remove VLAN insertion fields from WQE Ether segment ==================== Link: https://patch.msgid.link/1757574619-604874-1-git-send-email-tariqt@nvidia.com Link: https://patch.msgid.link/1758104780-642426-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>