arm64/crc32: Implement 4-way interleave using PMULL - lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

diff options

author	Ard Biesheuvel <ardb@kernel.org>	2024-10-18 09:53:51 +0200
committer	Catalin Marinas <catalin.marinas@arm.com>	2024-10-22 11:54:43 +0100
commit	a6478d69cf56d5deb4c28a6486376d9c7895abec (patch)
tree	67c13a9b2a78e82813d236c8205e2ebae12d8572 /drivers/md/dm-vdo/murmurhash3.c
parent	b98b23e19492f4009070761c53b755f623f60e49 (diff)
download	lwn-a6478d69cf56d5deb4c28a6486376d9c7895abec.tar.gz lwn-a6478d69cf56d5deb4c28a6486376d9c7895abec.zip

arm64/crc32: Implement 4-way interleave using PMULL

Now that kernel mode NEON no longer disables preemption, using FP/SIMD in library code which is not obviously part of the crypto subsystem is no longer problematic, as it will no longer incur unexpected latencies. So accelerate the CRC-32 library code on arm64 to use a 4-way interleave, using PMULL instructions to implement the folding. On Apple M2, this results in a speedup of 2 - 2.8x when using input sizes of 1k - 8k. For smaller sizes, the overhead of preserving and restoring the FP/SIMD register file may not be worth it, so 1k is used as a threshold for choosing this code path. The coefficient tables were generated using code provided by Eric. [0] [0] https://github.com/ebiggers/libdeflate/blob/master/scripts/gen_crc32_multipliers.c Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20241018075347.2821102-8-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

Diffstat (limited to 'drivers/md/dm-vdo/murmurhash3.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: