summaryrefslogtreecommitdiff
path: root/net/sunrpc
AgeCommit message (Collapse)Author
12 dayssvcrdma: wake sq waiters when the transport closesChuck Lever
Threads parked in svc_rdma_sq_wait() on sc_sq_ticket_wait or sc_send_wait can hang indefinitely in TASK_UNINTERRUPTIBLE state across transport teardown, pinning svc_xprt references and blocking svc_rdma_free(). The close path sets XPT_CLOSE before invoking xpo_detach and both wait_event predicates include an XPT_CLOSE term, but the predicates are re-evaluated only on wakeup. sc_sq_ticket_wait has no completion-driven wake path; it is advanced solely by the chained ticket handoff inside svc_rdma_sq_wait() itself. Without an explicit wake at close, parked threads never observe XPT_CLOSE, hold their svc_xprt_get reference forever, and svc_rdma_free() blocks on xpt_ref dropping to zero. Two close entry points reach this transport. Local teardown runs svc_rdma_detach() from svc_handle_xprt() -> svc_delete_xprt() -> xpo_detach() on a worker thread. A remote disconnect arrives at svc_rdma_cma_handler(), which calls svc_xprt_deferred_close(): that sets XPT_CLOSE and enqueues the transport but does not access either RDMA waitqueue, so a worker already parked in svc_rdma_sq_wait() never re-evaluates its predicate. With every worker parked on this transport, no thread is available to run the local teardown either, and the wake site there is unreachable. Introduce svc_rdma_xprt_deferred_close(), a thin svcrdma wrapper that calls svc_xprt_deferred_close() and then wakes both sc_sq_ticket_wait and sc_send_wait. Convert the svcrdma producers that called svc_xprt_deferred_close() directly: svc_rdma_cma_handler(), qp_event_handler(), svc_rdma_post_send_err(), svc_rdma_wc_send(), the sendto drop path, the rw completion error paths, and the recvfrom flush and read-list error paths. Wake both waitqueues from svc_rdma_detach() as well. The synchronous svc_xprt_close() path (backchannel ENOTCONN, device removal via svc_rdma_xprt_done) reaches detach without flowing through svc_xprt_deferred_close() and therefore does not invoke the new helper. Fixes: ccc89b9d1ed2 ("svcrdma: Add fair queuing for Send Queue access") Cc: stable@vger.kernel.org Assisted-by: kres (claude-opus-4-7) Signed-off-by: Chris Mason <clm@meta.com> [ cel: add svc_rdma_xprt_deferred_close() to complete the fix ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 dayssunrpc: wait for in-flight TLS handshake callback when cancel loses raceChuck Lever
When wait_for_completion_interruptible_timeout() in svc_tcp_handshake() returns 0 (timeout) or -ERESTARTSYS (signal) and tls_handshake_cancel() then returns false, handshake_complete() has won the cancellation race: it has set HANDSHAKE_F_REQ_COMPLETED and is about to invoke svc_tcp_handshake_done(), but the callback's side effects on xpt_flags and on svsk->sk_handshake_done have not yet committed. The current code reads xpt_flags immediately to decide whether the session succeeded. Two races result. If the callback has executed set_bit(XPT_TLS_SESSION) but not yet clear_bit(XPT_HANDSHAKE), svc_tcp_handshake() sees a session, enqueues the transport, and returns. svc_xprt_received() then clears XPT_BUSY, a worker thread picks the transport up, the dispatcher in svc_handle_xprt() observes XPT_HANDSHAKE still set, and xpo_handshake is invoked a second time. That svc_tcp_handshake() calls init_completion(&svsk->sk_handshake_done) while the original callback concurrently calls complete_all() on it, corrupting the embedded swait_queue. If the callback has set HANDSHAKE_F_REQ_COMPLETED but not yet entered svc_tcp_handshake_done(), svc_tcp_handshake() reads XPT_TLS_SESSION as clear and tears the connection down even though the handshake is about to succeed. Wait for the callback to commit before inspecting xpt_flags. The completion is guaranteed to fire because handshake_complete() invokes svc_tcp_handshake_done() unconditionally once it has set HANDSHAKE_F_REQ_COMPLETED. Fixes: b3cbf98e2fdf ("SUNRPC: Support TLS handshake in the server-side TCP socket code") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 dayssunrpc: pin svc_xprt across the asynchronous TLS handshake callbackChris Mason
svc_tcp_handshake() stores the raw svc_xprt pointer in tls_handshake_args.ta_data and submits the request through tls_server_hello_x509(). The handshake core takes only sock_hold(req->hr_sk); nothing references the embedding struct svc_sock that svc_tcp_handshake_done() reaches via container_of(). Two close races leave the in-flight callback writing through a freed svc_sock. svc_sock_free() calls tls_handshake_cancel() and discards its return value: a false return means handshake_complete() has already set HANDSHAKE_F_REQ_COMPLETED but hp_done() may not have finished, yet svc_sock_free() proceeds to kfree(svsk). The cancel-loser fall-through inside svc_tcp_handshake() itself produces the same window: when wait_for_completion_interruptible_timeout() returns <= 0 (timeout or signal) and tls_handshake_cancel() returns false, the function does not drain, returns, and svc_handle_xprt() calls svc_xprt_received(), which clears XPT_BUSY and can drop the last reference. A concurrent close then runs svc_sock_free() while svc_tcp_handshake_done() is still updating xpt_flags and walking svsk->sk_handshake_done. The corruption surfaces as set_bit/clear_bit RMW into the freed xpt_flags slab slot and as complete_all() walking and writing the freed wait_queue_head_t list embedded in sk_handshake_done -- a slab-corruption primitive, not a benign read. The path is reachable on any TLS-enabled NFS server whenever a connection close overlaps the tlshd downcall delivery window; the interruptible wait means signal delivery suffices, not just SVC_HANDSHAKE_TO expiry. Take svc_xprt_get(xprt) immediately before tls_server_hello_x509() so the in-flight callback owns its own reference. Release it on the two edges where the callback is guaranteed not to fire -- submission failure from tls_server_hello_x509() and a successful tls_handshake_cancel() -- and at the tail of svc_tcp_handshake_done() after complete_all(). Fixes: b3cbf98e2fdf ("SUNRPC: Support TLS handshake in the server-side TCP socket code") Cc: stable@vger.kernel.org Signed-off-by: Chris Mason <clm@meta.com> Assisted-by: kres (claude-opus-4-7) [cel: rewrote commit message to describe the actual change] Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 dayssunrpc: harden rq_procinfo lifecycle to prevent double-freeLuxiao Xu
The svc_release_rqst() function executes the callback inside rqstp->rq_procinfo->pc_release. However, if a worker thread begins processing a new request and encounters an early error path (e.g., unsupported protocol, short frame, or bad auth) before a valid rq_procinfo is installed, a stale release hook can be re-triggered against reused state from the previous RPC, resulting in a double-free or use-after-free vulnerability. Harden the lifecycle of rq_procinfo by: 1. Ensuring svc_release_rqst() always clears rq_procinfo after the optional pc_release() call, regardless of whether the hook exists. 2. Explicitly clearing rq_procinfo at request entry in svc_process() before any early decode or drop paths. 3. Ensuring svc_process_bc() does the same at backchannel entry. This guarantees that error flows will not encounter a non-NULL stale rq_procinfo pointer when there is nothing to release. Fixes: d9adbb6e10bf ("sunrpc: delay pc_release callback until after the reply is sent") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Suggested-by: Chuck Lever <cel@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Luxiao Xu <rakukuip@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Return an error from xdr_buf_to_bvec() on overflowChuck Lever
xdr_buf_to_bvec() returns a slot count even when the caller's bvec budget is exhausted partway through the xdr_buf. Callers feed that count into iov_iter_bvec() and continue as if the conversion had succeeded, silently sending or writing fewer bytes than the data length declares. For an NFS WRITE the server reports the truncated transfer to the client as full success. The overflow represents an internal invariant violation: a higher layer reserved a bvec budget too small for the xdr_buf it then asked the encoder to convert. That is a server-side fault, not a media I/O failure and not a malformed client argument. Change xdr_buf_to_bvec() to return a signed int and have the overflow label return -ESERVERFAULT. Update the three callers to detect the negative return and fail the request: nfsd_vfs_write() folds the error into host_err, which nfserrno() translates to nfserr_serverfault for the WRITE reply; svc_udp_sendto() and svc_tcp_sendmsg() propagate the error out of the send path. Reported-by: Chris Mason <clm@meta.com> Fixes: 2eb2b9358181 ("SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Bound-check xdr_buf_to_bvec() stores before writingChuck Lever
xdr_buf_to_bvec() writes a bio_vec into the caller's array before testing whether that slot is in range, and the head branch performs the store with no check at all. When the caller's budget is exactly used up, the next store lands one element past the end of the array. The overflow label returns count - 1, which masks the surplus store but cannot undo it. rq_bvec, the array passed by nfsd_vfs_write(), is allocated to exactly rq_maxpages entries with no slack. The OOB store can land in adjacent slab memory; the bv_len and bv_offset fields written there are derived from client-supplied RPC payload sizes. Move the in-range check ahead of the store in the head, page-loop, and tail branches. With the check at the top of each sequence, count is incremented only after a successful store, so the overflow label can return count directly. Reported-by: Chris Mason <clm@meta.com> Fixes: 2eb2b9358181 ("SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysRevert "svcrdma: Use contiguous pages for RDMA Read sink buffers"Chuck Lever
Jonathan Flynn reports that commit 18755b8c2f24 ("svcrdma: Use contiguous pages for RDMA Read sink buffers") regresses NFS/RDMA WRITE throughput from 73.9 GiB/s to 30.3 GiB/s on a 128-core single-NUMA-node server driving dual 400Gb/s links with 640 nfsd threads. Server CPU utilization rises from 8.5% to 76%, with roughly three quarters of all cycles spent spinning on zone->lock. The sink buffers are allocated as high-order page blocks, split into single pages so each sub-page carries an independent refcount, and later released one page at a time through folio batches. The per-CPU page caches cannot satisfy an allocation stream whose alloc order differs from its free order, so every sink buffer page makes a round trip through the buddy allocator's free lists, serialized on the zone lock of the single NUMA node. The rq_pages entries that the split pages displace, bulk-allocated moments earlier by svc_alloc_arg(), are freed without ever being used, doubling the allocator traffic. The regression cannot be addressed trivially. Revert the commit now; a reworked approach can return in an upcoming merge window. Reported-by: Jonathan Flynn <jonathan.flynn@hammerspace.com> Reported-by: Mike Snitzer <snitzer@kernel.org> Closes: https://lore.kernel.org/linux-nfs/aiHlPmeZq3WgMwoJ@kernel.org/ Closes: https://lore.kernel.org/linux-nfs/3cb119b4b2a8aada30c0c60286778a54@mail.gmail.com/ Fixes: 18755b8c2f24 ("svcrdma: Use contiguous pages for RDMA Read sink buffers") Cc: stable@vger.kernel.org Tested-by: Jonathan Flynn <jonathan.flynn@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 dayssvcrdma: Defer send context release to xpo_release_ctxtChuck Lever
Send completion currently queues a work item to an unbound workqueue for each completed send context. Under load, the Send Completion handlers contend for the shared workqueue pool lock. Replace the workqueue with a per-transport lock-free list (llist). The Send completion handler appends the send_ctxt to sc_send_release_list and does no further teardown. The nfsd thread drains the list in xpo_release_ctxt between RPCs, performing DMA unmapping, chunk I/O resource release, and page release in a batch. This eliminates both the workqueue pool lock and the DMA unmap cost from the Send completion path. DMA unmapping can be expensive when an IOMMU is present in strict mode, as each unmap triggers a synchronous hardware IOTLB invalidation. Moving it to the nfsd thread, where that latency is harmless, avoids penalizing completion handler throughput. The nfsd threads absorb the release cost at a point where the client is no longer waiting on a reply, and natural batching amortizes the overhead when completions arrive faster than RPCs complete. A self-enqueue backstops drain on a quiescing transport. When svc_rdma_send_ctxt_put() observes that its llist_add() transitions sc_send_release_list from empty to non-empty, it sets XPT_DATA and calls svc_xprt_enqueue() so that svc_xprt_ready() schedules an nfsd thread. The thread enters svc_rdma_recvfrom(), finds no pending receive, clears XPT_DATA, and returns 0; svc_xprt_release() then runs xpo_release_ctxt and drains the list. Under steady load the foreground drain keeps the list non-empty between adds and no enqueue fires; only the trailing edge of a burst pays for a wakeup. Without this path, a Send completion arriving after the last xpo_release_ctxt on an idle connection would leave the send_ctxt's DMA mappings and reply pages pinned until the next RPC, send-context exhaustion, or transport close. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 dayssvcrdma: Release write chunk resources without re-queuingChuck Lever
Each RDMA Send completion triggers a cascade of work items on the svcrdma_wq unbound workqueue: ib_cq_poll_work (on ib_comp_wq, per-CPU) -> svc_rdma_send_ctxt_put -> queue_work [work item 1] -> svc_rdma_write_info_free -> queue_work [work item 2] Every transition through queue_work contends on the unbound pool's spinlock. Profiling an 8KB NFSv3 read/write workload over RDMA shows about 4% of total CPU cycles spent on this lock, with the cascading re-queue of write_info release contributing roughly 1%. The initial queue_work in svc_rdma_send_ctxt_put is needed to move release work off the CQ completion context (which runs on a per-CPU bound workqueue). However, once executing on svcrdma_wq, there is no need to re-queue for each write_info structure. svc_rdma_reply_chunk_release already calls svc_rdma_cc_release inline from the same svcrdma_wq context, and svc_rdma_recv_ctxt_put does the same from nfsd thread context. Release write chunk resources inline in svc_rdma_write_info_free, removing the intermediate svc_rdma_write_info_free_async work item and the wi_work field from struct svc_rdma_write_info. Reviewed-by: Mike Snitzer <snitzer@kernel.org> Tested-by: Jonathan Flynn <jonathan.flynn@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Remove dead rpcsec_gss_krb5 definitionsChuck Lever
The migration to crypto/krb5 eliminated the per-enctype function dispatch and direct crypto API usage, leaving behind a number of orphaned definitions. Remove the following from gss_krb5.h: - GSS_KRB5_K5CLENGTH, used only by removed key derivation - KG_TOK_MIC_MSG and KG_TOK_WRAP_MSG (Kerberos v1 token types; v1 support was dropped earlier) - KG2_TOK_INITIAL and KG2_TOK_RESPONSE (context establishment token types; no remaining users) - KG2_RESP_FLAG_ERROR and KG2_RESP_FLAG_DELEG_OK - enum sgn_alg and enum seal_alg (v1 algorithm constants) - All CKSUMTYPE_* definitions, now duplicated by KRB5_CKSUMTYPE_* in <crypto/krb5.h> - The KG_ error constants from gssapi_err_krb5.h, which have no remaining users - The ENCTYPE_* constant block, replaced by KRB5_ENCTYPE_* from <crypto/krb5.h> - KG_USAGE_SEAL/SIGN/SEQ (3DES usage constants) - KEY_USAGE_SEED_CHECKSUM/ENCRYPTION/INTEGRITY, duplicated by <crypto/krb5.h> - #include <crypto/skcipher.h>, no longer needed Remove the cksum[] field from struct krb5_ctx in gss_krb5_internal.h; no code reads or writes it after the key derivation removal. Switch gss_krb5_enctypes[] in gss_krb5_mech.c to the canonical KRB5_ENCTYPE_* names from <crypto/krb5.h>. Remove stale #include directives: - <crypto/skcipher.h> from gss_krb5_wrap.c - <linux/random.h> and <linux/crypto.h> from gss_krb5_seal.c Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Remove redundant crypto Kconfig dependenciesChuck Lever
With all per-message crypto operations now routed through crypto/krb5, rpcsec_gss_krb5 no longer calls individual crypto algorithms directly. The CRYPTO_KRB5 symbol already selects CRYPTO_SKCIPHER and CRYPTO_HASH (the latter transitively via CRYPTO_HMAC). Drop the top-level select CRYPTO_SKCIPHER and select CRYPTO_HASH from RPCSEC_GSS_KRB5, as these are redundant with CRYPTO_KRB5's own dependencies. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Remove per-enctype Kconfig optionsChuck Lever
The RPCSEC_GSS_KRB5_ENCTYPES_AES_SHA1, RPCSEC_GSS_KRB5_ENCTYPES_CAMELLIA, and RPCSEC_GSS_KRB5_ENCTYPES_AES_SHA2 Kconfig options originally gated both algorithm availability and the advertised enctype list. Now that per-message crypto operations are routed through crypto/krb5, these options control only which enctype numbers appear in the gssd upcall string; the underlying algorithms are always present. Remove the per-enctype Kconfig options and replace the ifdef-gated enctype table with a candidate list looked up in the crypto/krb5 enctype table at module init time. Each enctype is included in the advertised list only if crypto_krb5_find_enctype() finds it in the library's enctype table. When a new enctype is added to crypto/krb5, adding its constant to the candidate array is sufficient to begin advertising it. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Remove dead code from rpcsec_gss_krb5Chuck Lever
With all per-message crypto operations routed through crypto/krb5, a substantial body of code in rpcsec_gss_krb5 has no remaining callers. The internal key derivation functions (krb5_derive_key_v2, krb5_kdf_hmac_sha2, krb5_kdf_feedback_cmac) and the low-level crypto primitives (krb5_encrypt, gss_krb5_checksum, krb5_cbc_cts_ encrypt/decrypt, krb5_etm_checksum) are unreachable because their only call sites were the per-enctype function pointers removed in previous patches. Delete gss_krb5_keys.c entirely and strip the dead functions from gss_krb5_crypto.c. The KUnit test suite in gss_krb5_test.c exercised exactly these internal functions: RFC 3961 n-fold, RFC 3962 key derivation, RFC 6803 Camellia key derivation, and RFC 8009 AES-SHA2 key derivation, plus encryption self-tests that drove the now-removed encrypt routines. The corresponding test coverage is provided by the crypto/krb5 selftests in crypto/krb5/selftest.c. Remove the test file, the RPCSEC_GSS_KRB5_KUNIT_TEST Kconfig symbol, the .kunitconfig, and all VISIBLE_IF_KUNIT / EXPORT_SYMBOL_IF_KUNIT annotations. xdr_process_buf() walked xdr_buf segments through a per-segment callback and existed solely for the crypto routines in gss_krb5_crypto.c. With that file removed, xdr_process_buf() has no remaining callers. Its successor, xdr_buf_to_sg(), populates a scatterlist directly from an xdr_buf byte range and was introduced earlier in this series. With every consumer of struct gss_krb5_enctype removed, replace its remaining uses with the equivalent fields from struct krb5_enctype (key_len). Remove struct gss_krb5_enctype, the supported_gss_krb5_enctypes[] table, gss_krb5_lookup_enctype(), and the gk5e pointer from krb5_ctx. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Remove legacy skcipher/ahash handles from krb5_ctxChuck Lever
Previous patches switched all per-message crypto operations (encrypt, decrypt, get_mic, verify_mic) from the internal skcipher/ahash primitives to crypto/krb5 AEAD and shash handles. The old crypto_sync_skcipher and crypto_ahash fields in struct krb5_ctx are no longer referenced at runtime. Remove the ten legacy handle fields from struct krb5_ctx along with the key derivation and handle allocation code in gss_krb5_import_ctx_v2() that populated them. Context import now prepares only the four crypto/krb5 handles (two AEAD for encryption, two shash for checksums). The corresponding cleanup in gss_krb5_delete_sec_context() and the error path is likewise reduced. The krb5_derive_key() inline wrapper, gss_krb5_alloc_cipher_v2(), and gss_krb5_alloc_hash_v2() become unused and are removed. The per-enctype encrypt/decrypt functions (gss_krb5_aes_encrypt, gss_krb5_aes_decrypt, krb5_etm_encrypt, krb5_etm_decrypt) that were the sole remaining consumers of these fields are also removed; their function-pointer call sites were already deleted in earlier patches. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Remove encrypt/decrypt function pointers from enctype tableChuck Lever
All enctypes now route through gss_krb5_aead_encrypt() and gss_krb5_aead_decrypt(). The per-enctype .encrypt and .decrypt function pointers served the same purpose as .get_mic and .wrap before them: dispatching v1 versus v2 implementations. With v1 support long removed and the Camellia decrypt path migrated in a preceding patch, every table entry points to the same pair of functions. Call gss_krb5_aead_encrypt() and gss_krb5_aead_decrypt() directly from gss_krb5_wrap_v2() and gss_krb5_unwrap_v2(), and drop the function pointers from struct gss_krb5_enctype. While here, propagate the GSS status code returned by gss_krb5_aead_decrypt() instead of discarding it. The old indirect call sites returned GSS_S_FAILURE unconditionally, losing the distinction between an integrity failure (GSS_S_BAD_SIG) and a structural error (GSS_S_DEFECTIVE_TOKEN). Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Remove wrap/unwrap function pointers from enctype tableChuck Lever
Every enctype points .wrap and .unwrap at gss_krb5_wrap_v2() and gss_krb5_unwrap_v2(). As with get_mic/verify_mic, the indirection dates from when v1 enctypes had different wrap implementations. Call the functions directly and remove the pointers from struct gss_krb5_enctype. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Remove get_mic/verify_mic function pointers from enctype tableChuck Lever
Every enctype in the table points .get_mic and .verify_mic at the same pair of functions. The indirection served no purpose after the v1 enctype support was removed. Call gss_krb5_get_mic_v2() and gss_krb5_verify_mic_v2() directly from the GSS mechanism dispatch and drop the function pointers from struct gss_krb5_enctype. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Switch MIC token verification to crypto/krb5Chuck Lever
gss_krb5_verify_mic_v2() currently recomputes a checksum using gss_krb5_checksum() and then compares it against the received checksum with memcmp(). Replace this with a call to crypto_krb5_verify_mic(), which performs the hash, comparison, and offset/length adjustment in a single operation through the crypto/krb5 library. The scatterlist layout required by RFC 4121 Section 4.2.4 is constructed via gss_krb5_mic_build_sg(), the shared helper introduced in the preceding commit. The received checksum occupies the first scatterlist entry, pointing directly into the token buffer. The errno result from crypto_krb5_verify_mic() is mapped to a GSS major status code via gss_krb5_errno_to_status(), which returns GSS_S_BAD_SIG for -EBADMSG (checksum mismatch). Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Switch MIC token generation to crypto/krb5Chuck Lever
gss_krb5_get_mic_v2() currently computes the MIC checksum by driving a crypto_ahash directly, calling gss_krb5_checksum() with the message body and GSS token header. Replace this with a call to crypto_krb5_get_mic(), which performs the same keyed hash operation through the crypto/krb5 library. RFC 4121 Section 4.2.4 specifies that the checksum covers the message body followed by the token header. Because the crypto/krb5 metadata parameter is hashed before the data, the GSS header cannot be passed as metadata. Instead, the header is appended to the scatterlist after the body data, producing the correct hash input ordering without using the metadata parameter. The scatterlist layout is: [checksum_output | message_body | gss_header] The first scatterlist entry points directly into the token buffer, so the checksum is written in place. A shared helper, gss_krb5_mic_build_sg(), is introduced in gss_krb5_crypto.c to construct this scatterlist layout. The helper handles overflow allocation and scatterlist chaining for large xdr_buf page arrays. It is reused by the verify_mic counterpart in the following commit. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Switch Camellia decrypt to crypto/krb5Chuck Lever
The Camellia enctypes (RFC 6803) use the same MtE authenticated encryption construction as AES-SHA1 (RFC 3962), implemented in crypto/krb5 by the rfc3961_simplified profile. The encrypt path already uses gss_krb5_aead_encrypt() for Camellia, but the decrypt path was left on the old gss_krb5_aes_decrypt() code when the AES enctypes were migrated. Switch the Camellia .decrypt callback to gss_krb5_aead_decrypt() to complete the AEAD migration for all enctypes. The conf_len and cksum_len values in crypto/krb5's Camellia enctype descriptors match the block size and checksum length that gss_krb5_aes_decrypt() was using, so the headskip and tailskip returned to the unwrap layer are unchanged. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Switch wrap token decryption to crypto/krb5Chuck Lever
Replace the per-enctype .decrypt callbacks (gss_krb5_aes_decrypt and krb5_etm_decrypt) with a single gss_krb5_aead_decrypt() wrapper that delegates to crypto_krb5_decrypt(). The new wrapper builds a scatterlist covering the secured region (confounder through checksum), passes it to the AEAD decrypt operation, and derives the confounder and checksum lengths from the data offset and length that crypto_krb5_decrypt() reports. The caller's token header verification and buffer adjustment logic is unchanged. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Switch wrap token encryption to crypto/krb5Chuck Lever
Replace the per-enctype .encrypt callbacks (gss_krb5_aes_encrypt and krb5_etm_encrypt) with a single gss_krb5_aead_encrypt() wrapper that delegates to crypto_krb5_encrypt(). The xdr_buf setup -- GSS header insertion, confounder space allocation, and token header copy -- remains unchanged. The difference is that the CBC-CTS encryption and HMAC computation are now a single AEAD operation through the crypto/krb5 library. Both the MtE construction (RFC 3962) and the EtM construction (RFC 8009) are handled transparently by the AEAD transform. The plaintext page data must be copied from the page cache pages to the scratch output pages before building the scatterlist, since the AEAD operates in-place rather than using separate input and output scatterlists. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Prepare crypto/krb5 encryption and checksum handlesChuck Lever
Allocate crypto_aead handles for encryption (one per direction) and crypto_shash handles for checksumming (one per direction) using the crypto/krb5 library's key preparation functions. These four handles derive their subkeys from the session key and the RFC 4121 usage numbers and are ready for use in encrypt, decrypt, get_mic, and verify_mic operations. The existing crypto_sync_skcipher and crypto_ahash handles remain in place for now; subsequent patches switch the per-message operations to the new handles and then remove the old ones. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Add errno-to-GSS status conversion helperChuck Lever
The crypto/krb5 library returns standard negative errno values, but the GSS mechanism layer reports results as GSS_S_* major status codes. A translation is needed at each call site that will be switched to the new library. Rather than open-coding the mapping in every wrapper, provide a single helper function. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Add helpers to convert xdr_buf byte ranges to scatterlistsChuck Lever
The crypto/krb5 library accepts data in scatterlist form, but the GSS-API layer presents RPC payloads as struct xdr_buf. Bridge that gap with a pair of helper functions: xdr_buf_to_sg() - populate a caller-supplied scatterlist array from a byte range xdr_buf_to_sg_alloc() - populate a caller-supplied inline scatterlist, chaining to a heap- allocated overflow for large payloads The inline array (typically stack-allocated at eight entries) covers the common case of small RPCs with no heap allocation on the encrypt/decrypt path. Only buffers spanning many pages incur a kmalloc for the chained extension. The segment-walking logic follows the same head, page array, tail traversal as xdr_process_buf(), but populates a scatterlist directly rather than invoking a per-segment callback. sg_next() traversal makes the walker safe for chained scatterlists. Once subsequent patches reroute all per-message crypto operations through crypto/krb5, xdr_process_buf() loses its last callers and is removed. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Add crypto/krb5 enctype lookup to krb5_ctxChuck Lever
Each krb5_ctx currently points to a gss_krb5_enctype, the rpcsec_gss_krb5 module's own enctype descriptor. To begin using the common crypto/krb5 library, store a pointer to the corresponding struct krb5_enctype (from <crypto/krb5.h>) as well. The lookup is performed in gss_import_v2_context() immediately after the existing gss_krb5_lookup_enctype() call. If crypto_krb5_find_enctype() cannot find a matching enctype the context import fails, ensuring the module never operates with a partially-initialized krb5_ctx. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
12 daysSUNRPC: Add Kconfig dependency on CRYPTO_KRB5Chuck Lever
The rpcsec_gss_krb5 module currently contains its own Kerberos 5 crypto implementation (key derivation, encryption, checksumming) that duplicates functionality available in the common crypto/krb5 library. As a first step toward migrating to that library, add a Kconfig select so that building rpcsec_gss_krb5 pulls in the common Kerberos 5 crypto support. The per-enctype Kconfig options (AES_SHA1, CAMELLIA, AES_SHA2) remain: they continue to gate which encryption types are offered by the GSS mechanism. The individual crypto algorithm selects they carry become redundant once the migration is complete, since CRYPTO_KRB5 already selects all needed ciphers and hashes. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01NFSD: Put cache get-reqs dump attrs under replyChuck Lever
The new get-reqs dump operations added to sunrpc_cache.yaml and nfsd.yaml place the "requests" nested attribute under dump.request. A netlink dump carries an empty request; its payload travels back in the reply. Because the spec names no reply attributes, the YNL C code generator synthesizes a forward reference to a <op>_rsp struct that is never defined, breaking any consumer of these specs. This first surfaced when Thorsten Leemhuis built tools/net/ynl against -next: nfsd-user.h:746: error: field 'obj' has incomplete type struct nfsd_svc_export_get_reqs_rsp obj ... nfsd-user.h:826: error: field 'obj' has incomplete type struct nfsd_expkey_get_reqs_rsp obj ... nfsd-user.c:1211: error: 'nfsd_svc_export_get_reqs_rsp_parse' undeclared sunrpc_cache.yaml has the same defect in ip-map-get-reqs and unix-gid-get-reqs, but nfsd.yaml errors out first in the Makefile's alphabetical build order and hides the sunrpc failures. These bugs were introduced by incorrect merge conflict resolution. Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Closes: https://lore.kernel.org/linux-nfs/f6a3ca6d-e5cb-4a5c-9af2-8d2b1ce33ef0@leemhuis.info/ Fixes: 1045ccf519ce30 ("sunrpc: add netlink upcall for the auth.unix.ip cache") Tested-by: Thorsten Leemhuis <linux@leemhuis.info> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: add SUNRPC_CMD_CACHE_FLUSH netlink commandJeff Layton
Add a new SUNRPC_CMD_CACHE_FLUSH generic netlink command that allows userspace to flush the sunrpc auth caches (ip_map and unix_gid) without writing to /proc/net/rpc/*/flush. An optional SUNRPC_A_CACHE_FLUSH_MASK u32 attribute selects which caches to flush (bit 1 = ip_map, bit 2 = unix_gid). If the attribute is omitted, all sunrpc caches are flushed. This is used by exportfs to replace its /proc-based cache_flush() with a netlink equivalent, with /proc fallback for older kernels. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01nfsd: add netlink upcall for the svc_export cacheJeff Layton
Add netlink-based cache upcall support for the svc_export (nfsd.export) cache to Documentation/netlink/specs/nfsd.yaml and regenerate the resulting files. Implement nfsd_cache_notify() which sends a NFSD_CMD_CACHE_NOTIFY multicast event to the "exportd" group, carrying the cache type so userspace knows which cache has pending requests. Implement nfsd_nl_svc_export_get_reqs_dumpit() which snapshots pending svc_export cache requests and sends each entry's seqno, client name, and path over netlink. Implement nfsd_nl_svc_export_set_reqs_doit() which parses svc_export cache responses from userspace (client, path, expiry, flags, anon uid/gid, fslocations, uuid, secinfo, xprtsec, fsid, or negative flag) and updates the cache via svc_export_lookup() / svc_export_update(). Wire up the svc_export_notify() callback in svc_export_cache_template so cache misses trigger NFSD_CMD_CACHE_NOTIFY multicast events with NFSD_CACHE_TYPE_SVC_EXPORT. Note that the export-flags and xprtsec-mode enums are organized to match their counterparts in include/uapi/linux/nfsd/export.h. The intent is that future export options will only be added to the netlink headers, which should eliminate the need to keep so much in sync. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: add netlink upcall for the auth.unix.gid cacheJeff Layton
Add netlink-based cache upcall support for the unix_gid (auth.unix.gid) cache, using the sunrpc generic netlink family. Add unix-gid attribute-set (seqno, uid, gids multi-attr, negative, expiry), unix-gid-reqs wrapper, and unix-gid-get-reqs / unix-gid-set-reqs operations to the sunrpc_cache YAML spec and generated headers. Implement sunrpc_nl_unix_gid_get_reqs_dumpit() which snapshots pending unix_gid cache requests and sends each entry's seqno and uid over netlink. Implement sunrpc_nl_unix_gid_set_reqs_doit() which parses unix_gid cache responses from userspace (uid, expiry, gids as u32 multi-attr or negative flag) and updates the cache via unix_gid_lookup() / sunrpc_cache_update(). Wire up unix_gid_notify() callback in unix_gid_cache_template so cache misses trigger SUNRPC_CMD_CACHE_NOTIFY multicast events with SUNRPC_CACHE_TYPE_UNIX_GID. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: add netlink upcall for the auth.unix.ip cacheJeff Layton
Add netlink-based cache upcall support for the ip_map (auth.unix.ip) cache, using the sunrpc generic netlink family. Add ip-map attribute-set (seqno, class, addr, domain, negative, expiry), ip-map-reqs wrapper, and ip-map-get-reqs / ip-map-set-reqs operations to the sunrpc_cache YAML spec and generated headers. Implement sunrpc_nl_ip_map_get_reqs_dumpit() which snapshots pending ip_map cache requests and sends each entry's seqno, class name, and IP address over netlink. Implement sunrpc_nl_ip_map_set_reqs_doit() which parses ip_map cache responses from userspace (class, addr, expiry, and domain name or negative flag) and updates the cache via __ip_map_lookup() / __ip_map_update(). Wire up ip_map_notify() callback in ip_map_cache_template so cache misses trigger SUNRPC_CMD_CACHE_NOTIFY multicast events with SUNRPC_CACHE_TYPE_IP_MAP. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: add a generic netlink family for cache upcallsJeff Layton
The auth.unix.ip and auth.unix.gid caches live in the sunrpc module, so they cannot use the nfsd generic netlink family. Create a new "sunrpc" generic netlink family with its own "exportd" multicast group to support cache upcall notifications for sunrpc-resident caches. Define a YAML spec (sunrpc_cache.yaml) with a cache-type enum (ip_map, unix_gid), a cache-notify multicast event, and the corresponding uapi header. Implement sunrpc_cache_notify() in cache.c, which checks for listeners on the exportd multicast group, builds and sends a SUNRPC_CMD_CACHE_NOTIFY message with the cache-type attribute. Register/unregister the sunrpc_nl_family in init_sunrpc() and cleanup_sunrpc(). Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: add helpers to count and snapshot pending cache requestsJeff Layton
Add sunrpc_cache_requests_count() and sunrpc_cache_requests_snapshot() to allow callers to count and snapshot the pending upcall request list without exposing struct cache_request outside of cache.c. Both functions skip entries that no longer have CACHE_PENDING set. The snapshot function takes a cache_get() reference on each item so the caller can safely use them after the queue_lock is released. These will be used by the nfsd generic netlink dumpit handler for svc_export upcall requests. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: add a cache_notify callbackJeff Layton
A later patch will be changing the kernel to send a netlink notification when there is a pending cache_request. Add a new cache_notify operation to struct cache_detail for this purpose. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: rename cache_pipe_upcall() to cache_do_upcall()Jeff Layton
Rename cache_pipe_upcall() to cache_do_upcall() in anticipation of the addition of a netlink-based upcall mechanism. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: rename sunrpc_cache_pipe_upcall_timeout()Jeff Layton
This function doesn't have anything to do with a timeout. The only difference is that it warns if there are no listeners. Rename it to sunrpc_cache_upcall_warn(). Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: rename sunrpc_cache_pipe_upcall() to sunrpc_cache_upcall()Jeff Layton
Since it will soon also send an upcall via netlink, if configured. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: skip svc_xprt_enqueue when transport is busyChuck Lever
svc_xprt_resource_released() calls svc_xprt_enqueue() whenever XPT_DATA or XPT_DEFERRED is set. During RPC processing, svc_reserve_auth() reduces the reservation counter and triggers this path while the current thread still holds XPT_BUSY. The enqueue enters svc_xprt_ready(), executes an smp_rmb(), READ_ONCE(), and tracepoint, then returns false on seeing XPT_BUSY. Trace data from a 256KB NFSv3 WRITE workload over TCP shows this pattern generates roughly 195,000 wasted enqueue calls -- approximately one per RPC -- each paying the full svc_xprt_ready() cost for no benefit. Add a BUSY check alongside the existing DATA|DEFERRED check in svc_xprt_resource_released(). When the transport is BUSY, the holder will call svc_xprt_received() upon completion, which already checks for pending work flags and re-enqueues. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: skip svc_xprt_enqueue in svc_xprt_received when idleChuck Lever
svc_xprt_received() unconditionally calls svc_xprt_enqueue() after clearing XPT_BUSY. When no work flags are pending, the enqueue traverses svc_xprt_ready() -- executing an smp_rmb(), READ_ONCE(), and tracepoint -- before returning false. Trace data from a 256KB NFSv3 workload over RDMA shows 85% of svc_xprt_received() invocations reach svc_xprt_enqueue() with no pending work flags. In the WRITE phase, 167,335 of 196,420 calls find no work; in the READ phase, 97,165 of 98,276. Each unnecessary call executes a memory barrier, a flags read, and (when tracing is active) fires the svc_xprt_enqueue tracepoint. Add a flags pre-check between clear_bit(XPT_BUSY) and svc_xprt_enqueue(). Both the clear and the subsequent READ_ONCE operate on the same xpt_flags word, so cache-line serialization of the atomic bitops ensures the read observes any flag set by a concurrent producer before the line was acquired for the clear. If a producer's set_bit occurs after the clear_bit, that producer's own svc_xprt_enqueue() call observes !XPT_BUSY and dispatches the transport. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-01sunrpc: skip svc_xprt_enqueue when no work is pendingChuck Lever
svc_reserve() and svc_xprt_release_slot() call svc_xprt_enqueue() after modifying xpt_reserved or xpt_nr_rqsts. The purpose is to re-dispatch the transport when write-space or a slot becomes available. However, when neither XPT_DATA nor XPT_DEFERRED is set, no thread can make progress on the transport and the enqueue accomplishes nothing. Trace data from a 256KB NFSv3 WRITE workload over RDMA shows 11.2 svc_xprt_enqueue() calls per RPC. Of these, 6.9 per RPC lack XPT_DATA and exit svc_xprt_ready() immediately after executing the smp_rmb(), READ_ONCE(), and tracepoint. svc_reserve() and svc_xprt_release_slot() account for roughly five of these per RPC. A new helper, svc_xprt_resource_released(), checks XPT_DATA | XPT_DEFERRED before calling svc_xprt_enqueue(). The existing smp_wmb() barriers are upgraded to smp_mb() to ensure the flags check observes a concurrent producer's set_bit(XPT_DATA). Each producer (svc_rdma_wc_receive, etc.) both sets XPT_DATA and calls svc_xprt_enqueue(), so even if the check reads a stale value, the producer's own enqueue provides a fallback path. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-05-26Merge tag 'nfsd-7.1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Regressions: - Tighten bounds checking for sunrpc cache hash tables - Don't report key material in the ftrace log Stable fix: - Fix lockd's implementation of the NLM TEST procedure" * tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: lockd: fix TEST handling when not all permissions are available. NFSD: Report whether fh_key was actually updated sunrpc: prevent out-of-bounds read in __cache_seq_start()
2026-05-21sunrpc: prevent out-of-bounds read in __cache_seq_start()Chuck Lever
Commit 7b546bd89975 ("sunrpc/cache: improve RCU safety in cache_list walking.") changed the tail of __cache_seq_start() to unconditionally store *pos = ((long long)hash << 32) + 1 before returning, dropping a prior "hash >= cd->hash_size" guard. When the while loop exits because every remaining bucket was empty, hash equals cd->hash_size, so the stored *pos is one position past the table's last valid bucket. seq_read_iter() caches that index in m->index. A subsequent pread(2) at the same file offset skips traverse() and hands the stored index back to __cache_seq_start(), which decodes hash = cd->hash_size and dereferences cd->hash_table[cd->hash_size] -- one hlist_head past the end of the kzalloc'd table. KASAN reports an eight-byte slab-out-of-bounds read at the tail of the 2048-byte hash_table allocation for the NFSD export cache (EXPORT_HASHMAX * sizeof(struct hlist_head) == 256 * 8). Reject an input hash that is out of range before touching the hash table. cache_seq_next() already bounds-checks its own loop; the start routine needs to be symmetric. Reported-by: syzbot+60cfa08822470bbebe44@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=60cfa08822470bbebe44 Fixes: 7b546bd89975 ("sunrpc/cache: improve RCU safety in cache_list walking.") Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-05-15Merge tag 'nfsd-7.1-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Fixes for this release: - Correctness fix for the new sunrpc cache netlink protocol Marked for stable: - Correctness fixes for delegated attributes - Prevent an infinite loop when revoking layouts" * tag 'nfsd-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix infinite loop in layout state revocation sunrpc: start cache request seqno at 1 to fix netlink GET_REQS nfsd: update mtime/ctime on COPY in presence of delegated attributes nfsd: update mtime/ctime on CLONE in presense of delegated attributes nfsd: fix file change detection in CB_GETATTR nfsd: fix GET_DIR_DELEGATION when VFS leases are disabled
2026-05-10sunrpc: start cache request seqno at 1 to fix netlink GET_REQSJeff Layton
sunrpc_cache_requests_snapshot() filters requests with crq->seqno <= min_seqno. The min_seqno for the first netlink dump call is cb->args[0] which is 0. Since next_seqno was initialized to 0, the very first cache request got seqno=0 and was silently skipped by the snapshot (0 <= 0 is true). This caused netlink-based GET_REQS to return 0 pending requests even when a request was queued, preventing mountd from resolving cache entries (particularly expkey/nfsd.fh). The unresolved CACHE_PENDING state blocked all further notifications for the entry, leading to permanent NFS4ERR_DELAY hangs. Start next_seqno at 1 so all requests have seqno >= 1 and pass the snapshot filter when min_seqno is 0. Fixes: facc4e3c8042 ("sunrpc: split cache_detail queue into request and reader lists") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-04-24Merge tag 'nfs-for-7.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Bugfixes: - Fix handling of ENOSPC so that if we have to resend writes, they are written synchronously - SUNRPC RDMA transport fixes from Chuck - Several fixes for delegated timestamps in NFSv4.2 - Failure to obtain a directory delegation should not cause stat() to fail with NFSv4 - Rename was failing to update timestamps when a directory delegation is held on NFSv4 - Ensure we check rsize/wsize after crossing a NFSv4 filesystem boundary - NFSv4/pnfs: - If the server is down, retry the layout returns on reboot - Fallback to MDS could result in a short write being incorrectly logged Cleanups: - Use memcpy_and_pad in decode_fh" * tag 'nfs-for-7.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (21 commits) NFS: Fix RCU dereference of cl_xprt in nfs_compare_super_address NFS: remove redundant __private attribute from nfs_page_class NFSv4.2: fix CLONE/COPY attrs in presence of delegated attributes NFS: fix writeback in presence of errors nfs: use memcpy_and_pad in decode_fh NFSv4.1: Apply session size limits on clone path NFSv4: retry GETATTR if GET_DIR_DELEGATION failed NFS: fix RENAME attr in presence of directory delegations pnfs/flexfiles: validate ds_versions_cnt is non-zero NFS/blocklayout: print each device used for SCSI layouts xprtrdma: Post receive buffers after RPC completion xprtrdma: Scale receive batch size with credit window xprtrdma: Replace rpcrdma_mr_seg with xdr_buf cursor xprtrdma: Decouple frwr_wp_create from frwr_map xprtrdma: Close lost-wakeup race in xprt_rdma_alloc_slot xprtrdma: Avoid 250 ms delay on backlog wakeup xprtrdma: Close sendctx get/put race that can block a transport nfs: update inode ctime after removexattr operation nfs: fix utimensat() for atime with delegated timestamps NFS: improve "Server wrote zero bytes" error ...
2026-04-20Merge tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: - filehandle signing to defend against filehandle-guessing attacks (Benjamin Coddington) The server now appends a SipHash-2-4 MAC to each filehandle when the new "sign_fh" export option is enabled. NFSD then verifies filehandles received from clients against the expected MAC; mismatches return NFS error STALE - convert the entire NLMv4 server-side XDR layer from hand-written C to xdrgen-generated code, spanning roughly thirty patches (Chuck Lever) XDR functions are generally boilerplate code and are easy to get wrong. The goals of this conversion are improved memory safety, lower maintenance burden, and groundwork for eventual Rust code generation for these functions. - improve pNFS block/SCSI layout robustness with two related changes (Dai Ngo) SCSI persistent reservation fencing is now tracked per client and per device via an xarray, to avoid both redundant preempt operations on devices already fenced and a potential NFSD deadlock when all nfsd threads are waiting for a layout return. - scalability and infrastructure improvements Sincere thanks to all contributors, reviewers, testers, and bug reporters who participated in the v7.1 NFSD development cycle. * tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (83 commits) NFSD: Docs: clean up pnfs server timeout docs nfsd: fix comment typo in nfsxdr nfsd: fix comment typo in nfs3xdr NFSD: convert callback RPC program to per-net namespace NFSD: use per-operation statidx for callback procedures svcrdma: Use contiguous pages for RDMA Read sink buffers SUNRPC: Add svc_rqst_page_release() helper SUNRPC: xdr.h: fix all kernel-doc warnings svcrdma: Factor out WR chain linking into helper svcrdma: Add Write chunk WRs to the RPC's Send WR chain svcrdma: Clean up use of rdma->sc_pd->device svcrdma: Clean up use of rdma->sc_pd->device in Receive paths svcrdma: Add fair queuing for Send Queue access SUNRPC: Optimize rq_respages allocation in svc_alloc_arg SUNRPC: Track consumed rq_pages entries svcrdma: preserve rq_next_page in svc_rdma_save_io_pages SUNRPC: Handle NULL entries in svc_rqst_release_pages SUNRPC: Allocate a separate Reply page array SUNRPC: Tighten bounds checking in svc_rqst_replace_page NFSD: Sign filehandles ...
2026-04-13xprtrdma: Post receive buffers after RPC completionChuck Lever
rpcrdma_post_recvs() runs in CQ poll context and its cost falls on the latency-critical path between polling a Receive completion and waking the RPC consumer. Every cycle spent refilling the Receive Queue delays delivery of the reply to the NFS layer. Move the rpcrdma_post_recvs() call in rpcrdma_reply_handler() to after the RPC has been decoded and completed. The larger batch size from the preceding patch provides sufficient Receive Queue headroom to absorb the brief delay before buffers are replenished. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2026-04-13xprtrdma: Scale receive batch size with credit windowChuck Lever
The fixed RPCRDMA_MAX_RECV_BATCH of 7 results in frequent small ib_post_recv batches during high-rate workloads. With a 128-slot credit window, receives are reposted every 7th completion, each batch incurring atomic serialization and a doorbell write. Replace the fixed batch constant with a per-endpoint value scaled to 25% of the negotiated credit window. For a typical 128-credit connection this raises the batch from 7 to 32, reducing doorbell frequency by roughly 4x and amortizing the per-batch atomic and MMIO costs over a larger group of receive WRs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2026-04-13xprtrdma: Replace rpcrdma_mr_seg with xdr_buf cursorChuck Lever
The FRWR registration path converts data through three representations: xdr_buf -> rpcrdma_mr_seg[] -> scatterlist[] -> ib_map_mr_sg(). The rpcrdma_mr_seg intermediate is a relic of when multiple registration strategies existed (FMR, physical, FRWR). Only FRWR remains, so this indirection and the 6240-byte rl_segments[260] array embedded in each rpcrdma_req serve no purpose. Introduce struct rpcrdma_xdr_cursor to track position within an xdr_buf during iterative MR registration. Rewrite frwr_map to populate scatterlist entries directly from the xdr_buf regions (head kvec, page list, tail kvec). The boundary logic for non-SG_GAPS devices is simpler because the xdr_buf structure guarantees that page-region entries after the first start at offset 0, and that head/tail kvecs are separate regions that naturally break at MR boundaries. Fix a pre-existing bug in rpcrdma_encode_write_list where the write-pad statistics accumulator added mr->mr_length from the last data MR rather than the write-pad MR. The refactored code uses ep->re_write_pad_mr->mr_length. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>