summaryrefslogtreecommitdiff
path: root/arch/powerpc/platforms/pseries/iommu.c
AgeCommit message (Collapse)Author
2025-01-21powerpc/pseries/iommu: Don't unset window if it was never setShivaprasad G Bhat
On pSeries, when user attempts to use the same vfio container used by different iommu group, the spapr_tce_set_window() returns -EPERM and the subsequent cleanup leads to the below crash. Kernel attempted to read user page (308) - exploit attempt? BUG: Kernel NULL pointer dereference on read at 0x00000308 Faulting instruction address: 0xc0000000001ce358 Oops: Kernel access of bad area, sig: 11 [#1] NIP: c0000000001ce358 LR: c0000000001ce05c CTR: c00000000005add0 <snip> NIP [c0000000001ce358] spapr_tce_unset_window+0x3b8/0x510 LR [c0000000001ce05c] spapr_tce_unset_window+0xbc/0x510 Call Trace: spapr_tce_unset_window+0xbc/0x510 (unreliable) tce_iommu_attach_group+0x24c/0x340 [vfio_iommu_spapr_tce] vfio_container_attach_group+0xec/0x240 [vfio] vfio_group_fops_unl_ioctl+0x548/0xb00 [vfio] sys_ioctl+0x754/0x1580 system_call_exception+0x13c/0x330 system_call_vectored_common+0x15c/0x2ec <snip> --- interrupt: 3000 Fix this by having null check for the tbl passed to the spapr_tce_unset_window(). Fixes: f431a8cde7f1 ("powerpc/iommu: Reimplement the iommu_table_group_ops for pSeries") Cc: stable@vger.kernel.org Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/173674009556.1559.12487885286848752833.stgit@linux.ibm.com
2025-01-11powerpc/pseries/iommu: IOMMU incorrectly marks MMIO range in DDWGaurav Batra
Power Hypervisor can possibily allocate MMIO window intersecting with Dynamic DMA Window (DDW) range, which is over 32-bit addressing. These MMIO pages needs to be marked as reserved so that IOMMU doesn't map DMA buffers in this range. The current code is not marking these pages correctly which is resulting in LPAR to OOPS while booting. The stack is at below BUG: Unable to handle kernel data access on read at 0xc00800005cd40000 Faulting instruction address: 0xc00000000005cdac Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: af_packet rfkill ibmveth(X) lpfc(+) nvmet_fc nvmet nvme_keyring crct10dif_vpmsum nvme_fc nvme_fabrics nvme_core be2net(+) nvme_auth rtc_generic nfsd auth_rpcgss nfs_acl lockd grace sunrpc fuse configfs ip_tables x_tables xfs libcrc32c dm_service_time ibmvfc(X) scsi_transport_fc vmx_crypto gf128mul crc32c_vpmsum dm_mirror dm_region_hash dm_log dm_multipath dm_mod sd_mod scsi_dh_emc scsi_dh_rdac scsi_dh_alua t10_pi crc64_rocksoft_generic crc64_rocksoft sg crc64 scsi_mod Supported: Yes, External CPU: 8 PID: 241 Comm: kworker/8:1 Kdump: loaded Not tainted 6.4.0-150600.23.14-default #1 SLE15-SP6 b44ee71c81261b9e4bab5e0cde1f2ed891d5359b Hardware name: IBM,9080-M9S POWER9 (raw) 0x4e2103 0xf000005 of:IBM,FW950.B0 (VH950_149) hv:phyp pSeries Workqueue: events work_for_cpu_fn NIP: c00000000005cdac LR: c00000000005e830 CTR: 0000000000000000 REGS: c00001400c9ff770 TRAP: 0300 Not tainted (6.4.0-150600.23.14-default) MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24228448 XER: 00000001 CFAR: c00000000005cdd4 DAR: c00800005cd40000 DSISR: 40000000 IRQMASK: 0 GPR00: c00000000005e830 c00001400c9ffa10 c000000001987d00 c00001400c4fe800 GPR04: 0000080000000000 0000000000000001 0000000004000000 0000000000800000 GPR08: 0000000004000000 0000000000000001 c00800005cd40000 ffffffffffffffff GPR12: 0000000084228882 c00000000a4c4f00 0000000000000010 0000080000000000 GPR16: c00001400c4fe800 0000000004000000 0800000000000000 c00000006088b800 GPR20: c00001401a7be980 c00001400eff3800 c000000002a2da68 000000000000002b GPR24: c0000000026793a8 c000000002679368 000000000000002a c0000000026793c8 GPR28: 000008007effffff 0000080000000000 0000000000800000 c00001400c4fe800 NIP [c00000000005cdac] iommu_table_reserve_pages+0xac/0x100 LR [c00000000005e830] iommu_init_table+0x80/0x1e0 Call Trace: [c00001400c9ffa10] [c00000000005e810] iommu_init_table+0x60/0x1e0 (unreliable) [c00001400c9ffa90] [c00000000010356c] iommu_bypass_supported_pSeriesLP+0x9cc/0xe40 [c00001400c9ffc30] [c00000000005c300] dma_iommu_dma_supported+0xf0/0x230 [c00001400c9ffcb0] [c00000000024b0c4] dma_supported+0x44/0x90 [c00001400c9ffcd0] [c00000000024b14c] dma_set_mask+0x3c/0x80 [c00001400c9ffd00] [c0080000555b715c] be_probe+0xc4/0xb90 [be2net] [c00001400c9ffdc0] [c000000000986f3c] local_pci_probe+0x6c/0x110 [c00001400c9ffe40] [c000000000188f28] work_for_cpu_fn+0x38/0x60 [c00001400c9ffe70] [c00000000018e454] process_one_work+0x314/0x620 [c00001400c9fff10] [c00000000018f280] worker_thread+0x2b0/0x620 [c00001400c9fff90] [c00000000019bb18] kthread+0x148/0x150 [c00001400c9fffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18 There are 2 issues in the code 1. The index is "int" while the address is "unsigned long". This results in negative value when setting the bitmap. 2. The DMA offset is page shifted but the MMIO range is used as-is (64-bit address). MMIO address needs to be page shifted as well. Fixes: 3c33066a2190 ("powerpc/kernel/iommu: Add new iommu_table_in_use() helper") Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20241206210039.93172-1-gbatra@linux.ibm.com
2024-07-04powerpc/pseries/iommu: Define spapr_tce_table_group_ops only with ↵Shivaprasad G Bhat
CONFIG_IOMMU_API The patch fixes the below warning: arch/powerpc/platforms/pseries/iommu.c:1824:37: warning: 'spapr_tce_table_group_ops' defined but not used Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407020357.Hz8kQkKf-lkp@intel.com/ Fixes: b09c031d9433 ("powerpc/iommu: Move pSeries specific functions to pseries/iommu.c") Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/172008854222.784.13666247605789409729.stgit@linux.ibm.com
2024-06-28powerpc/iommu: Reimplement the iommu_table_group_ops for pSeriesShivaprasad G Bhat
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows for PEs, their ownership transfer, create/set/unset the TCE tables for the Dynamic DMA wundows(DDW). VFIOS uses these APIs for support on POWER. The commit 9d67c9433509 ("powerpc/iommu: Add "borrowing" iommu_table_group_ops") implemented partial support for this API with "borrow" mechanism wherein the DMA windows if created already by the host driver, they would be available for VFIO to use. Also, it didn't have the support to control/modify the window size or the IO page size. The current patch implements all the necessary iommu_table_group_ops APIs there by avoiding the "borrrowing". So, just the way it is on the PowerNV platform, with this patch the iommu table group ownership is transferred to the VFIO PPC subdriver, the iommu table, DMA windows creation/deletion all driven through the APIs. The pSeries uses the query-pe-dma-window, create-pe-dma-window and reset-pe-dma-window RTAS calls for DMA window creation, deletion and reset to defaul. The RTAs calls do show some minor differences to the way things are to be handled on the pSeries which are listed below. * On pSeries, the default DMA window size is "fixed" cannot be custom sized as requested by the user. For non-SRIOV VFs, It is fixed at 2GB and for SRIOV VFs, its variable sized based on the capacity assigned to it during the VF assignment to the LPAR. So, for the default DMA window alone the size if requested less than tce32_size, the smaller size is enforced using the iommu table->it_size. * The DMA start address for 32-bit window is 0, and for the 64-bit window in case of PowerNV is hardcoded to TVE select (bit 59) at 512PiB offset. This address is returned at the time of create_table() API call (even before the window is created), the subsequent set_window() call actually opens the DMA window. On pSeries, the DMA start address for 32-bit window is known from the 'ibm,dma-window' DT property. However, the 64-bit window start address is not known until the create-pe-dma RTAS call is made. So, the create_table() which returns the DMA window start address actually opens the DMA window and returns the DMA start address as returned by the Hypervisor for the create-pe-dma RTAS call. * The reset-pe-dma RTAS call resets the DMA windows and restores the default DMA window, however it does not clear the TCE table entries if there are any. In case of ownership transfer from platform domain which used direct mapping, the patch chooses remove-pe-dma instead of reset-pe for the 64-bit window intentionally so that the clear_dma_window() is called. Other than the DMA window management changes mentioned above, the patch also brings back the userspace view for the single level TCE as it existed before commit 090bad39b237a ("powerpc/powernv: Add indirect levels to it_userspace") along with the relavent refactoring. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923275958.1397.907964437142542242.stgit@linux.ibm.com
2024-06-28powerpc/pseries/iommu: Use the iommu table[0] for IOV VF's DDWShivaprasad G Bhat
This patch basically brings consistency with PowerNV approach to use the first freely available iommu table when the default window is removed. The pSeries iommu code convention has been that the table[0] is for the default 32 bit DMA window and the table[1] is for the 64 bit DDW. With VFs having only 1 DMA window, the default has to be removed for creating the larger DMA window. The existing code uses the table[1] for that, while marking the table[0] as NULL. This is fine as long as the host driver itself uses the device. For the VFIO user, on pSeries there is no way to skip table[0] as the VFIO subdriver uses the first freely available table. The window 0, when created as 64-bit DDW in that context would still be on table[0], as the maximum number of windows is 1. This won't have any impact for the host driver as the table is fetched from the device's iommu_table_base. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> [mpe: Rebase and resolve conflicts] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923272328.1397.1817843961216868850.stgit@linux.ibm.com
2024-06-28powerpc/pseries/iommu: Fix the VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctl outputShivaprasad G Bhat
The ioctl VFIO_IOMMU_SPAPR_TCE_GET_INFO is not reporting the actuals on the platform as not all the details are correctly collected during the platform probe/scan into the iommu_table_group. Collect the information during the device setup time as the DMA window property is already looked up on parent nodes anyway. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923271138.1397.7908302630061814623.stgit@linux.ibm.com
2024-06-28powerpc/iommu: Move pSeries specific functions to pseries/iommu.cShivaprasad G Bhat
The PowerNV specific table_group_ops are defined in powernv/pci-ioda.c. The pSeries specific table_group_ops are sitting in the generic powerpc file. Move it to where it actually belong(pseries/iommu.c). The functions are currently defined even for CONFIG_PPC_POWERNV which are unused on PowerNV. Only code movement, no functional changes intended. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923269701.1397.15758640002786937132.stgit@linux.ibm.com
2024-06-04powerpc/pseries/iommu: Split Dynamic DMA Window to be used in Hybrid modeGaurav Batra
Dynamic DMA Window (DDW) supports TCEs that are backed by 2MB page size. In most configurations, DDW is big enough to pre-map all of LPAR memory for IO. Pre-mapping of memory for DMA results in improvements in IO performance. Persistent memory, vPMEM, can be assigned to an LPAR as well. vPMEM is not contiguous with LPAR memory and usually is assigned at high memory addresses. This makes is not possible to pre-map both vPMEM and LPAR memory in the same DDW. For a dedicated adapter this limitation is not an issue. Dedicated adapters can have both Default DMA window, which is backed by 4K page size and a DDW backed by 2MB page size TCEs. In this scenario, LPAR memory is pre-mapped in the DDW. Any DMA going to the vPMEM is routed via dynamically allocated TCEs in the default window. The issue arises with SR-IOV adapters. There is only one DMA window - either Default or DDW. If an LPAR has vPMEM assigned, memory is not pre-mapped in the DDW since TCEs needs to be allocated for vPMEM as well. In this case, DDW is created and TCEs are dynamically allocated for both vPMEM and LPAR memory. Today, DDW is only used in single mode - direct mapped TCEs or dynamically mapped TCEs. This enhancement breaks a single DDW in 2 regions - 1. First region to pre-map LPAR memory 2. Second region to dynamically allocate TCEs for IO to vPMEM The DDW is split only if it is big enough to pre-map complete LPAR memory and still have some space left to dynamically map vPMEM. Maximum size possible DDW is created as permitted by the Hypervisor. Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240514014608.35537-1-gbatra@linux.ibm.com
2024-04-23powerpc/pseries/iommu: LPAR panics during boot up with a frozen PEGaurav Batra
At the time of LPAR boot up, partition firmware provides Open Firmware property ibm,dma-window for the PE. This property is provided on the PCI bus the PE is attached to. There are execptions where the partition firmware might not provide this property for the PE at the time of LPAR boot up. One of the scenario is where the firmware has frozen the PE due to some error condition. This PE is frozen for 24 hours or unless the whole system is reinitialized. Within this time frame, if the LPAR is booted, the frozen PE will be presented to the LPAR but ibm,dma-window property could be missing. Today, under these circumstances, the LPAR oopses with NULL pointer dereference, when configuring the PCI bus the PE is attached to. BUG: Kernel NULL pointer dereference on read at 0x000000c8 Faulting instruction address: 0xc0000000001024c0 Oops: Kernel access of bad area, sig: 7 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: Supported: Yes CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-150600.9-default #1 Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_023) hv:phyp pSeries NIP: c0000000001024c0 LR: c0000000001024b0 CTR: c000000000102450 REGS: c0000000037db5c0 TRAP: 0300 Not tainted (6.4.0-150600.9-default) MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 28000822 XER: 00000000 CFAR: c00000000010254c DAR: 00000000000000c8 DSISR: 00080000 IRQMASK: 0 ... NIP [c0000000001024c0] pci_dma_bus_setup_pSeriesLP+0x70/0x2a0 LR [c0000000001024b0] pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 Call Trace: pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 (unreliable) pcibios_setup_bus_self+0x1c0/0x370 __of_scan_bus+0x2f8/0x330 pcibios_scan_phb+0x280/0x3d0 pcibios_init+0x88/0x12c do_one_initcall+0x60/0x320 kernel_init_freeable+0x344/0x3e4 kernel_init+0x34/0x1d0 ret_from_kernel_user_thread+0x14/0x1c Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window") Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240422205141.10662-1-gbatra@linux.ibm.com
2024-02-23powerpc/pseries/iommu: IOMMU table is not initialized for kdump over SR-IOVGaurav Batra
When kdump kernel tries to copy dump data over SR-IOV, LPAR panics due to NULL pointer exception: Kernel attempted to read user page (0) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000000 Faulting instruction address: 0xc000000020847ad4 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: mlx5_core(+) vmx_crypto pseries_wdt papr_scm libnvdimm mlxfw tls psample sunrpc fuse overlay squashfs loop CPU: 12 PID: 315 Comm: systemd-udevd Not tainted 6.4.0-Test102+ #12 Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries NIP: c000000020847ad4 LR: c00000002083b2dc CTR: 00000000006cd18c REGS: c000000029162ca0 TRAP: 0300 Not tainted (6.4.0-Test102+) MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48288244 XER: 00000008 CFAR: c00000002083b2d8 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 1 ... NIP _find_next_zero_bit+0x24/0x110 LR bitmap_find_next_zero_area_off+0x5c/0xe0 Call Trace: dev_printk_emit+0x38/0x48 (unreliable) iommu_area_alloc+0xc4/0x180 iommu_range_alloc+0x1e8/0x580 iommu_alloc+0x60/0x130 iommu_alloc_coherent+0x158/0x2b0 dma_iommu_alloc_coherent+0x3c/0x50 dma_alloc_attrs+0x170/0x1f0 mlx5_cmd_init+0xc0/0x760 [mlx5_core] mlx5_function_setup+0xf0/0x510 [mlx5_core] mlx5_init_one+0x84/0x210 [mlx5_core] probe_one+0x118/0x2c0 [mlx5_core] local_pci_probe+0x68/0x110 pci_call_probe+0x68/0x200 pci_device_probe+0xbc/0x1a0 really_probe+0x104/0x540 __driver_probe_device+0xb4/0x230 driver_probe_device+0x54/0x130 __driver_attach+0x158/0x2b0 bus_for_each_dev+0xa8/0x130 driver_attach+0x34/0x50 bus_add_driver+0x16c/0x300 driver_register+0xa4/0x1b0 __pci_register_driver+0x68/0x80 mlx5_init+0xb8/0x100 [mlx5_core] do_one_initcall+0x60/0x300 do_init_module+0x7c/0x2b0 At the time of LPAR dump, before kexec hands over control to kdump kernel, DDWs (Dynamic DMA Windows) are scanned and added to the FDT. For the SR-IOV case, default DMA window "ibm,dma-window" is removed from the FDT and DDW added, for the device. Now, kexec hands over control to the kdump kernel. When the kdump kernel initializes, PCI busses are scanned and IOMMU group/tables created, in pci_dma_bus_setup_pSeriesLP(). For the SR-IOV case, there is no "ibm,dma-window". The original commit: b1fc44eaa9ba, fixes the path where memory is pre-mapped (direct mapped) to the DDW. When TCEs are direct mapped, there is no need to initialize IOMMU tables. iommu_table_setparms_lpar() only considers "ibm,dma-window" property when initiallizing IOMMU table. In the scenario where TCEs are dynamically allocated for SR-IOV, newly created IOMMU table is not initialized. Later, when the device driver tries to enter TCEs for the SR-IOV device, NULL pointer execption is thrown from iommu_area_alloc(). The fix is to initialize the IOMMU table with DDW property stored in the FDT. There are 2 points to remember: 1. For the dedicated adapter, kdump kernel would encounter both default and DDW in FDT. In this case, DDW property is used to initialize the IOMMU table. 2. A DDW could be direct or dynamic mapped. kdump kernel would initialize IOMMU table and mark the existing DDW as "dynamic". This works fine since, at the time of table initialization, iommu_table_clear() makes some space in the DDW, for some predefined number of TCEs which are needed for kdump to succeed. Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window") Signed-off-by: Gaurav Batra <gbatra@linux.vnet.ibm.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240125203017.61014-1-gbatra@linux.ibm.com
2023-10-19powerpc/pseries/iommu: enable_ddw incorrectly returns direct mapping for ↵Gaurav Batra
SR-IOV device When a device is initialized, the driver invokes dma_supported() twice - first for streaming mappings followed by coherent mappings. For an SR-IOV device, default window is deleted and DDW created. With vPMEM enabled, TCE mappings are dynamically created for both vPMEM and SR-IOV device. There are no direct mappings. First time when dma_supported() is called with 64 bit mask, DDW is created and marked as dynamic window. The second time dma_supported() is called, enable_ddw() finds existing window for the device and incorrectly returns it as "direct mapping". This only happens when size of DDW is big enough to map max LPAR memory. This results in streaming TCEs to not get dynamically mapped, since code incorrently assumes these are already pre-mapped. The adapter initially comes up but goes down due to EEH. Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Gaurav Batra <gbatra@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231003030802.47914-1-gbatra@linux.vnet.ibm.com
2023-08-18powerpc: Move DMA64_PROPNAME define to a headerMichal Suchanek
Avoid redefining the same value in multiple source. Signed-off-by: Michal Suchanek <msuchanek@suse.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230817162411.429-1-msuchanek@suse.de
2023-06-26powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memoryGaurav Batra
When memory is dynamically added/removed, iommu_mem_notifier() is invoked. This routine traverses through all the DMA windows (DDW only, not default windows) to add/remove "direct" TCE mappings. The routines for this purpose are tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP(). Both these routines are designed for Direct mapped DMA windows only. The issue is that there could be some DMA windows in the list which are not "direct" mapped. Calling these routines will either, 1) remove some dynamically mapped TCEs, Or 2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER Here are the side affects when these routines are incorrectly invoked for "dynamically" mapped DMA windows. tce_setrange_multi_pSeriesLP() This adds direct mapped TCEs. Now, this could invoke HCALL to add TCEs with out-of-bound range. In this scenario, HCALL will return H_PARAMETER and DLAR ADD of memory will fail. tce_clearrange_multi_pSeriesLP() This will remove range of TCEs. The TCE range that is calculated, depending on the memory range being added, could infact be mapping some other memory address (for dynamic DMA window scenario). This will wipe out those TCEs. The solution is for iommu_mem_notifier() to only invoke these routines for "direct" mapped DMA windows. Signed-off-by: Gaurav Batra <gbatra@linux.vnet.ibm.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> [mpe: Initialise direct at allocation time in ddw_list_new_entry()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230613171641.15641-1-gbatra@linux.vnet.ibm.com
2023-05-30powerpc/iommu: Limit number of TCEs to 512 for H_STUFF_TCE hcallGaurav Batra
Currently in tce_freemulti_pSeriesLP() there is no limit on how many TCEs are passed to the H_STUFF_TCE hcall. This has not caused an issue until now, but newer firmware releases have started enforcing a limit of 512 TCEs per call. The limit is correct per the specification (PAPR v2.12 § 14.5.4.2.3). The code has been in it's current form since it was initially merged. Cc: stable@vger.kernel.org Signed-off-by: Gaurav Batra <gbatra@linux.vnet.ibm.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> [mpe: Tweak change log wording & add PAPR reference] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230525143454.56878-1-gbatra@linux.vnet.ibm.com
2023-05-17powerpc/iommu: Incorrect DDW Table is referenced for SR-IOV deviceGaurav Batra
For an SR-IOV device, while enabling DDW, a new table is created and added at index 1 in the group. In the below 2 scenarios, the table is incorrectly referenced at index 0 (which is where the table is for default DMA window). 1. When adding DDW This issue is exposed with "slub_debug". Error thrown out from dma_iommu_dma_supported() Warning: IOMMU offset too big for device mask mask: 0xffffffff, table offset: 0x800000000000000 2. During Dynamic removal of the PCI device. Error is from iommu_tce_table_put() since a NULL table pointer is passed in. Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Gaurav Batra <gbatra@linux.vnet.ibm.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230505184701.91613-1-gbatra@linux.vnet.ibm.com
2023-05-17powerpc/iommu: Remove iommu_del_device()Jason Gunthorpe
Now that power calls iommu_device_register() and populates its groups using iommu_ops->device_group it should not be calling iommu_group_remove_device(). The core code owns the groups and all the other related iommu data, it will clean it up automatically. Remove the bus notifiers and explicit calls to iommu_group_remove_device(). Fixes: a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/0-v1-1421774b874b+167-ppc_device_group_jgg@nvidia.com
2023-04-04powerpc: Use of_address_to_resource()Rob Herring
Replace open coded reading of "reg" or of_get_address()/ of_translate_address() calls with a single call to of_address_to_resource(). Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230329220337.141295-1-robh@kernel.org
2023-03-30powerpc/pseries: Add spaces around / operatorPetr Vaněk
This is follow up change after 14b5d59a261b ("powerpc/pseries: Fix formatting to make code look more beautiful") to conform to kernel coding style. Signed-off-by: Petr Vaněk <arkamar@atlas.cz> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230324220041.11378-1-arkamar@atlas.cz
2023-03-15powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domainsAlexey Kardashevskiy
Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to the generic VFIO which broke POWER: - a coherency capability check; - blocking IOMMU domain - iommu_group_dma_owner_claimed()/... This adds a simple iommu_ops which reports support for cache coherency and provides a basic support for blocking domains. No other domain types are implemented so the default domain is NULL. Since now iommu_ops controls the group ownership, this takes it out of VFIO. This adds an IOMMU device into a pci_controller (=PHB) and registers it in the IOMMU subsystem, iommu_ops is registered at this point. This setup is done in postcore_initcall_sync. This replaces iommu_group_add_device() with iommu_probe_device() as the former misses necessary steps in connecting PCI devices to IOMMU devices. This adds a comment about why explicit iommu_probe_device() is still needed. The previous discussion is here: https://lore.kernel.org/r/20220707135552.3688927-1-aik@ozlabs.ru/ https://lore.kernel.org/r/20220701061751.1955857-1-aik@ozlabs.ru/ Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> [mpe: Fix CONFIG_IOMMU_API=n build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/2000135730.16998523.1678123860135.JavaMail.zimbra@raptorengineeringinc.com
2023-03-14powerpc/iommu: Add "borrowing" iommu_table_group_opsAlexey Kardashevskiy
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows for PEs: control the ownership, create/set/unset a table the hardware for dynamic DMA windows (DDW). VFIO uses the API to implement support on POWER. So far only PowerNV IODA2 (POWER8 and newer machines) implemented this and other cases (POWER7 or nested KVM) did not and instead reused existing iommu_table structs. This means 1) no DDW 2) ownership transfer is done directly in the VFIO SPAPR TCE driver. Soon POWER is going to get its own iommu_ops and ownership control is going to move there. This implements spapr_tce_table_group_ops which borrows iommu_table tables. The upside is that VFIO needs to know less about POWER. The new ops returns the existing table from create_table() and only checks if the same window is already set. This is only going to work if the default DMA window starts table_group.tce32_start and as big as pe->table_group.tce32_size (not the case for IODA2+ PowerNV). This changes iommu_table_group_ops::take_ownership() to return an error if borrowing a table failed. This should not cause any visible change in behavior for PowerNV. pSeries was not that well tested/supported anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> [mpe: Fix CONFIG_IOMMU_API=n build (skiroot_defconfig), & formatting] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/525438831.16998517.1678123820075.JavaMail.zimbra@raptorengineeringinc.com
2022-11-24powerpc/pseries: Fix formatting to make code look more beautifulDeming Wang
Operators should be separated by spaces in tce_buildmulti_pSeriesLP() Signed-off-by: Deming Wang <wangdeming@inspur.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220701094553.1722-1-wangdeming@inspur.com
2022-07-28pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-windowAlexey Kardashevskiy
The pseries platform uses 32bit default DMA window (always 4K pages) and optional 64bit DMA window available via DDW ("Dynamic DMA Windows"), 64K or 2M pages. For ages the default one was not removed and a huge window was created in addition. Things changed with SRIOV-enabled PowerVM which creates a default-and-bigger DMA window in 64bit space (still using 4K pages) for IOV VFs so certain OSes do not need to use the DDW API in order to utilize all available TCE budget. Linux on the other hand removes the default window and creates a bigger one (with more TCEs or/and a bigger page size - 64K/2M) in a bid to map the entire RAM, and if the new window size is smaller than that - it still uses this new bigger window. The result is that the default window is removed but the "ibm,dma-window" property is not. When kdump is invoked, the existing code tries reusing the existing 64bit DMA window which location and parameters are stored in the device tree but this fails as the new property does not make it to the kdump device tree blob. So the code falls back to the default window which does not exist anymore although the device tree says that it does. The result of that is that PCI devices become unusable and cannot be used for kdumping. This preserves the DMA64 and DIRECT64 properties in the device tree blob for the crash kernel. Since the crash kernel setup is done after device drivers are loaded and probed, the proper DMA config is stored at least for boot time devices. Because DDW window is optional and the code configures the default window first, the existing code creates an IOMMU table descriptor for the non-existing default DMA window. It is harmless for kdump as it does not touch the actual window (only reads what is mapped and marks those IO pages as used) but it is bad for kexec which clears it thinking it is a smaller default window rather than a bigger DDW window. This removes the "ibm,dma-window" property from the device tree after a bigger window is created and the crash kernel setup picks it up. Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220629060614.1680476-1-aik@ozlabs.ru
2022-06-29powerpc/pseries/iommu: Print ibm,query-pe-dma-windows parametersAlexey Kardashevskiy
PowerVM has a stricter policy about allocating TCEs for LPARs and often there is not enough TCEs for 1:1 mapping, this adds the supported numbers into dev_info() to help analyzing bugreports. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220601040117.1467710-1-aik@ozlabs.ru
2022-05-19Merge branch 'topic/ppc-kvm' into nextMichael Ellerman
Merge our KVM topic branch.
2022-05-19KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlersAlexey Kardashevskiy
LoPAPR defines guest visible IOMMU with hypercalls to use it - H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap in the KVM in the real mode (with MMU off). The problem with the real mode is some memory is not available and some API usage crashed the host but enabling MMU was an expensive operation. The problems with the real mode handlers are: 1. Occasionally these cannot complete the request so the code is copied+modified to work in the virtual mode, very little is shared; 2. The real mode handlers have to be linked into vmlinux to work; 3. An exception in real mode immediately reboots the machine. If the small DMA window is used, the real mode handlers bring better performance. However since POWER8, there has always been a bigger DMA window which VMs use to map the entire VM memory to avoid calling H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT (a bulk version of H_PUT_TCE) which virtual mode handler is even closer to its real mode version. On POWER9 hypercalls trap straight to the virtual mode so the real mode handlers never execute on POWER9 and later CPUs. So with the current use of the DMA windows and MMU improvements in POWER9 and later, there is no point in duplicating the code. The 32bit passed through devices may slow down but we do not have many of these in practice. For example, with this applied, a 1Gbit ethernet adapter still demostrates above 800Mbit/s of actual throughput. This removes the real mode handlers from KVM and related code from the powernv platform. This updates the list of implemented hcalls in KVM-HV as the realmode handlers are removed. This changes ABI - kvmppc_h_get_tce() moves to the KVM module and kvmppc_find_table() is static now. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220506053755.3820702-1-aik@ozlabs.ru
2022-05-05powerpc: fix typos in commentsJulia Lawall
Various spelling mistakes in comments. Detected with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220430185654.5855-1-Julia.Lawall@inria.fr
2021-12-23powerpc/pseries: Add __init attribute to eligible functionsNick Child
Some functions defined in 'arch/powerpc/platforms/pseries' are deserving of an `__init` macro attribute. These functions are only called by other initialization functions and therefore should inherit the attribute. Also, change function declarations in header files to include `__init`. Signed-off-by: Nick Child <nick.child@ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211216220035.605465-13-nick.child@ibm.com
2021-11-15powerpc/pseries/ddw: Do not try direct mapping with persistent memory and ↵Alexey Kardashevskiy
one window There is a possibility of having just one DMA window available with a limited capacity which the existing code does not handle that well. If the window is big enough for the system RAM but less than MAX_PHYSMEM_BITS (which we want when persistent memory is present), we create 1:1 window and leave persistent memory without DMA. This disables 1:1 mapping entirely if there is persistent memory and either: - the huge DMA window does not cover the entire address space; - the default DMA window is removed. This relies on reverted 54fc3c681ded ("powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memory") to return the actual amount RAM in ddw_memory_hotplug_max() (posted separately). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211108040320.3857636-4-aik@ozlabs.ru
2021-11-15powerpc/pseries/ddw: simplify enable_ddw()Alexey Kardashevskiy
This drops rather useless ddw_enabled flag as direct_mapping implies it anyway. While at this, fix indents in enable_ddw(). This should not cause any behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211108040320.3857636-3-aik@ozlabs.ru
2021-11-15powerpc/pseries/ddw: Revert "Extend upper limit for huge DMA window for ↵Alexey Kardashevskiy
persistent memory" This reverts commit 54fc3c681ded9437e4548e2501dc1136b23cfa9a which does not allow 1:1 mapping even for the system RAM which is usually possible. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211108040320.3857636-2-aik@ozlabs.ru
2021-11-05Merge tag 'powerpc-5.16-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Enable STRICT_KERNEL_RWX for Freescale 85xx platforms. - Activate CONFIG_STRICT_KERNEL_RWX by default, while still allowing it to be disabled. - Add support for out-of-line static calls on 32-bit. - Fix oopses doing bpf-to-bpf calls when STRICT_KERNEL_RWX is enabled. - Fix boot hangs on e5500 due to stale value in ESR passed to do_page_fault(). - Fix several bugs on pseries in handling of device tree cache information for hotplugged CPUs, and/or during partition migration. - Various other small features and fixes. Thanks to Alexey Kardashevskiy, Alistair Popple, Anatolij Gustschin, Andrew Donnellan, Athira Rajeev, Bixuan Cui, Bjorn Helgaas, Cédric Le Goater, Christophe Leroy, Daniel Axtens, Daniel Henrique Barboza, Denis Kirjanov, Fabiano Rosas, Frederic Barrat, Gustavo A. R. Silva, Hari Bathini, Jacques de Laval, Joel Stanley, Kai Song, Kajol Jain, Laurent Vivier, Leonardo Bras, Madhavan Srinivasan, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Niklas Schnelle, Oliver O'Halloran, Rob Herring, Russell Currey, Srikar Dronamraju, Stan Johnson, Tyrel Datwyler, Uwe Kleine-König, Vasant Hegde, Wan Jiabing, and Xiaoming Ni, * tag 'powerpc-5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (73 commits) powerpc/8xx: Fix Oops with STRICT_KERNEL_RWX without DEBUG_RODATA_TEST powerpc/32e: Ignore ESR in instruction storage interrupt handler powerpc/powernv/prd: Unregister OPAL_MSG_PRD2 notifier during module unload powerpc: Don't provide __kernel_map_pages() without ARCH_SUPPORTS_DEBUG_PAGEALLOC MAINTAINERS: Update powerpc KVM entry powerpc/xmon: fix task state output powerpc/44x/fsp2: add missing of_node_put powerpc/dcr: Use cmplwi instead of 3-argument cmpli KVM: PPC: Tick accounting should defer vtime accounting 'til after IRQ handling powerpc/security: Use a mutex for interrupt exit code patching powerpc/83xx/mpc8349emitx: Make mcu_gpiochip_remove() return void powerpc/fsl_booke: Fix setting of exec flag when setting TLBCAMs powerpc/book3e: Fix set_memory_x() and set_memory_nx() powerpc/nohash: Fix __ptep_set_access_flags() and ptep_set_wrprotect() powerpc/bpf: Fix write protecting JIT code selftests/powerpc: Use date instead of EPOCHSECONDS in mitigation-patching.sh powerpc/64s/interrupt: Fix check_return_regs_valid() false positive powerpc/boot: Set LC_ALL=C in wrapper script powerpc/64s: Default to 64K pages for 64 bit book3s Revert "powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC" ...
2021-10-25powerpc/pseries/iommu: Create huge DMA window if no MMIO32 is presentAlexey Kardashevskiy
The iommu_init_table() helper takes an address range to reserve in the IOMMU table being initialized to exclude MMIO addresses, this is useful if the window stretches far beyond 4GB (although wastes some TCEs). At the moment the code searches for such MMIO32 range and fails if none found which is considered a problem while it really is not: it is actually better as this says there is no MMIO32 to reserve and we can use usually wasted TCEs. Furthermore PHYP never actually allows creating windows starting at busaddress=0 so this MMIO32 range is never useful. This removes error exit and initializes the table with zero range if no MMIO32 is detected. Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020132315.2287178-5-aik@ozlabs.ru
2021-10-25powerpc/pseries/iommu: Check if the default window in use before removing itAlexey Kardashevskiy
At the moment this check is performed after we remove the default window which is late and disallows to revert whatever changes enable_ddw() has made to DMA windows. This moves the check and error exit before removing the window. This raised the message severity from "debug" to "warning" as this should not happen in practice and cannot be triggered by the userspace. Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020132315.2287178-4-aik@ozlabs.ru
2021-10-25powerpc/pseries/iommu: Use correct vfree for it_mapAlexey Kardashevskiy
The it_map array is vzalloc'ed so use vfree() for it when creating a huge DMA window failed for whatever reason. While at this, write zero to it_map. Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020132315.2287178-3-aik@ozlabs.ru
2021-10-22powerpc/pseries/iommu: Add of_node_put() before breakWan Jiabing
Fix following coccicheck warning: ./arch/powerpc/platforms/pseries/iommu.c:924:1-28: WARNING: Function for_each_node_with_property should have of_node_put() before break Early exits from for_each_node_with_property should decrement the node reference counter. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Leonardo Bras <leobras.c@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211014075624.16344-1-wanjiabing@vivo.com
2021-10-09powerps/pseries/dma: Add support for 2M IOMMU page sizeAlexey Kardashevskiy
The upcoming PAPR spec adds a 2M page size, bit 23 right after 16G page size in the "ibm,query-pe-dma-window" call. This adds support for the new page size. Since the new page size is out of sorted order, this changes the loop to not assume that shift[] is sorted. This has now been tested and is known to work on a pre-release version of phyp. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211006044735.1114669-1-aik@ozlabs.ru
2021-08-27powerpc/pseries/iommu: Rename "direct window" to "dma window"Leonardo Bras
A previous change introduced the usage of DDW as a bigger indirect DMA mapping when the DDW available size does not map the whole partition. As most of the code that manipulates direct mappings was reused for indirect mappings, it's necessary to rename all names and debug/info messages to reflect that it can be used for both kinds of mapping. This should cause no behavioural change, just adjust naming. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-12-leobras.c@gmail.com
2021-08-27powerpc/pseries/iommu: Make use of DDW for indirect mappingLeonardo Bras
So far it's assumed possible to map the guest RAM 1:1 to the bus, which works with a small number of devices. SRIOV changes it as the user can configure hundreds VFs and since phyp preallocates TCEs and does not allow IOMMU pages bigger than 64K, it has to limit the number of TCEs per a PE to limit waste of physical pages. As of today, if the assumed direct mapping is not possible, DDW creation is skipped and the default DMA window "ibm,dma-window" is used instead. By using DDW, indirect mapping can get more TCEs than available for the default DMA window, and also get access to using much larger pagesizes (16MB as implemented in qemu vs 4k from default DMA window), causing a significant increase on the maximum amount of memory that can be IOMMU mapped at the same time. Indirect mapping will only be used if direct mapping is not a possibility. For indirect mapping, it's necessary to re-create the iommu_table with the new DMA window parameters, so iommu_alloc() can use it. Removing the default DMA window for using DDW with indirect mapping is only allowed if there is no current IOMMU memory allocated in the iommu_table. enable_ddw() is aborted otherwise. Even though there won't be both direct and indirect mappings at the same time, we can't reuse the DIRECT64_PROPNAME property name, or else an older kexec()ed kernel can assume direct mapping, and skip iommu_alloc(), causing undesirable behavior. So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info" was created to represent a DDW that does not allow direct mapping. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-11-leobras.c@gmail.com
2021-08-27powerpc/pseries/iommu: Find existing DDW with given property nameLeonardo Bras
At the moment pseries stores information about created directly mapped DDW window in DIRECT64_PROPNAME. With the objective of implementing indirect DMA mapping with DDW, it's necessary to have another propriety name to make sure kexec'ing into older kernels does not break, as it would if we reuse DIRECT64_PROPNAME. In order to have this, find_existing_ddw_windows() needs to be able to look for different property names. Extract find_existing_ddw_windows() into find_existing_ddw_windows_named() and calls it with current property name. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-10-leobras.c@gmail.com
2021-08-27powerpc/pseries/iommu: Update remove_dma_window() to accept property nameLeonardo Bras
Update remove_dma_window() so it can be used to remove DDW with a given property name. This enables the creation of new property names for DDW, so we can have different usage for it, like indirect mapping. Also, add return values to it so we can check if the property was found while removing the active DDW. This allows skipping the remaining property names while reducing the impact of multiple property names. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-9-leobras.c@gmail.com
2021-08-27powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helperLeonardo Bras
Add a new helper _iommu_table_setparms(), and use it in iommu_table_setparms() and iommu_table_setparms_lpar() to avoid duplicated code. Also, setting tbl->it_ops was happening outsite iommu_table_setparms*(), so move it to the new helper. Since we need the iommu_table_ops to be declared before used, declare iommu_table_lpar_multi_ops and iommu_table_pseries_ops to before their respective iommu_table_setparms*(). Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-8-leobras.c@gmail.com
2021-08-27powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()Leonardo Bras
Code used to create a ddw property that was previously scattered in enable_ddw() is now gathered in ddw_property_create(), which deals with allocation and filling the property, letting it ready for of_property_add(), which now occurs in sequence. This created an opportunity to reorganize the second part of enable_ddw(): Without this patch enable_ddw() does, in order: kzalloc() property & members, create_ddw(), fill ddwprop inside property, ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory, of_add_property(), and list_add(). With this patch enable_ddw() does, in order: create_ddw(), ddw_property_create(), of_add_property(), ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory, and list_add(). This change requires of_remove_property() in case anything fails after of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk in all memory, which looks the most expensive operation, only if everything else succeeds. Also, the error path got remove_ddw() replaced by a new helper __remove_dma_window(), which only removes the new DDW with an rtas-call. For this, a new helper clean_dma_window() was needed to clean anything that could left if walk_system_ram_range() fails. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-7-leobras.c@gmail.com
2021-08-27powerpc/pseries/iommu: Allow DDW windows starting at 0x00Leonardo Bras
enable_ddw() currently returns the address of the DMA window, which is considered invalid if has the value 0x00. Also, it only considers valid an address returned from find_existing_ddw if it's not 0x00. Changing this behavior makes sense, given the users of enable_ddw() only need to know if direct mapping is possible. It can also allow a DMA window starting at 0x00 to be used. This will be helpful for using a DDW with indirect mapping, as the window address will be different than 0x00, but it will not map the whole partition. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-6-leobras.c@gmail.com
2021-08-27powerpc/pseries/iommu: Add ddw_list_new_entry() helperLeonardo Bras
There are two functions creating direct_window_list entries in a similar way, so create a ddw_list_new_entry() to avoid duplicity and simplify those functions. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-5-leobras.c@gmail.com
2021-08-27powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helperLeonardo Bras
Creates a helper to allow allocating a new iommu_table without the need to reallocate the iommu_group. This will be helpful for replacing the iommu_table for the new DMA window, after we remove the old one with iommu_tce_table_put(). Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-4-leobras.c@gmail.com
2021-08-27powerpc/pseries/iommu: Replace hard-coded page shiftLeonardo Bras
Some functions assume IOMMU page size can only be 4K (pageshift == 12). Update them to accept any page size passed, so we can use 64K pages. In the process, some defines like TCE_SHIFT were made obsolete, and then removed. IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show a RPN of 52-bit, and considers a 12-bit pageshift, so there should be no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn. It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and tce_buildmulti_pSeriesLP(). Most places had a tbl struct, so using tbl->it_page_shift was simple. tce_free_pSeriesLP() was a special case, since callers not always have a tbl struct, so adding a tceshift parameter seems the right thing to do. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-2-leobras.c@gmail.com
2021-04-23powerpc/iommu: Do not immediately panic when failed IOMMU table allocationAlexey Kardashevskiy
Most platforms allocate IOMMU table structures (specifically it_map) at the boot time and when this fails - it is a valid reason for panic(). However the powernv platform allocates it_map after a device is returned to the host OS after being passed through and this happens long after the host OS booted. It is quite possible to trigger the it_map allocation panic() and kill the host even though it is not necessary - the host OS can still use the DMA bypass mode (requires a tiny fraction of it_map's memory) and even if that fails, the host OS is runnnable as it was without the device for which allocating it_map causes the panic. Instead of immediately crashing in a powernv/ioda2 system, this prints an error and continues. All other platforms still call panic(). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Leonardo Bras <leobras.c@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210216033307.69863-3-aik@ozlabs.ru
2021-04-21powerpc/pseries/iommu: Fix window size for direct mapping with pmemLeonardo Bras
As of today, if the DDW is big enough to fit (1 << MAX_PHYSMEM_BITS) it's possible to use direct DMA mapping even with pmem region. But, if that happens, the window size (len) is set to (MAX_PHYSMEM_BITS - page_shift) instead of MAX_PHYSMEM_BITS, causing a pagesize times smaller DDW to be created, being insufficient for correct usage. Fix this so the correct window size is used in this case. Fixes: bf6e2d562bbc4 ("powerpc/dma: Fallback to dma_ops when persistent memory present") Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210420045404.438735-1-leobras.c@gmail.com
2021-04-14powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPARLeonardo Bras
According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes" will let the OS know all possible pagesizes that can be used for creating a new DDW. Currently Linux will only try using 3 of the 8 available options: 4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M, 128M, 256M and 16G. Enabling bigger pages would be interesting for direct mapping systems with a lot of RAM, while using less TCE entries. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210408201915.174217-1-leobras.c@gmail.com
2020-11-27powerpc/dma: Fallback to dma_ops when persistent memory presentAlexey Kardashevskiy
So far we have been using huge DMA windows to map all the RAM available. The RAM is normally mapped to the VM address space contiguously, and there is always a reasonable upper limit for possible future hot plugged RAM which makes it easy to map all RAM via IOMMU. Now there is persistent memory ("ibm,pmemory" in the FDT) which (unlike normal RAM) can map anywhere in the VM space beyond the maximum RAM size and since it can be used for DMA, it requires extending the huge window up to MAX_PHYSMEM_BITS which requires hypervisor support for: 1. huge TCE tables; 2. multilevel TCE tables; 3. huge IOMMU pages. Certain hypervisors cannot do either so the only option left is restricting the huge DMA window to include only RAM and fallback to the default DMA window for persistent memory. This defines arch_dma_map_direct/etc to allow generic DMA code perform additional checks on whether direct DMA is still possible. This checks if the system has persistent memory. If it does not, the DMA bypass mode is selected, i.e. * dev->bus_dma_limit = 0 * dev->dma_ops_bypass = true <- this avoid calling dma_ops for mapping. If there is such memory, this creates identity mapping only for RAM and sets the dev->bus_dma_limit to let the generic code decide whether to call into the direct DMA or the indirect DMA ops. This should not change the existing behaviour when no persistent memory as dev->dma_ops_bypass is expected to be set. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Christoph Hellwig <hch@lst.de>