linux-next.git/include/linux/sunrpc, branch master

SUNRPC: close backchannel before destroying callback service

2026-06-30T13:13:42+00:00

A backchannel receive can complete a request while the NFS callback service is being torn down. xprt_complete_bc_request() removes the request from bc_pa_list, drops bc_alloc_count, marks the request in use, and then asks xprt_enqueue_bc_request() to hand it to the callback service. If teardown has already cleared xprt->bc_serv, xprt_enqueue_bc_request() currently returns without enqueueing or freeing the committed request. The xprt_get() taken on entry is leaked as well. If the producer wins the race before bc_serv is cleared, it can also enqueue onto sv_cb_list after nfs_callback_down() has stopped the callback threads, leaving the request linked to a svc_serv that is about to be freed. Close the producer side before callback threads are stopped. Add xprt_svc_shutdown_bc() to clear xprt->bc_serv under bc_pa_lock, and call it on callback shutdown and callback-start failure before stopping the service threads. Requests that lose the NULL transition in xprt_enqueue_bc_request() are released through the normal backchannel free path after balancing bc_slot_count. Finally, drain any remaining sv_cb_list requests after the callback threads have stopped and before svc_destroy() frees the service. Fixes: 441244d4273a ("SUNRPC: cleanup common code in backchannel request") Fixes: 9e9fdd0ad0fb ("NFSv4.1: protect destroying and nullifying bc_serv structure") Cc: stable@vger.kernel.org Signed-off-by: Chris Mason Reviewed-by: Jeff Layton Signed-off-by: Chuck Lever

svcrdma: Validate Read chunk positions at decode time

2026-06-30T13:13:42+00:00

Read chunk position and length validation is currently scattered across three consumer functions: svc_rdma_read_data_item(), svc_rdma_read_multiple_chunks(), and svc_rdma_read_call_chunk(). Each independently guards against the same class of unsigned arithmetic underflow from untrusted wire values. Any new consumer of the parsed Read chunk list must replicate these checks or risk re-introducing the defects fixed by earlier patches in this series. Add pcl_check_read_chunk_positions() to consolidate position and length validation into a single post-decode pass, called from svc_rdma_xdr_decode_req() after all three chunk lists have been parsed and the inline body length is known. The pass verifies three properties: - Each Read chunk's inline-body offset (its unreduced-stream position minus the cumulative length of preceding Read chunks) falls within the inline body length, or within the Call chunk length for interleaved reads. - Adjacent Read chunk positions do not overlap: cumulative read bytes at each transition do not exceed the next position. - Each chunk length does not exceed the receive context's page budget. Malformed frames are rejected before reaching any consumer. The existing consumer-side guards remain as defense in depth. Acked-by: Jeff Layton Signed-off-by: Chuck Lever

svcrdma: Fix pcl_for_each_segment for empty chunks

2026-06-30T13:13:42+00:00

When a parsed chunk list contains a chunk whose ch_segcount is zero, pcl_for_each_segment computes its inclusive upper bound as &chunk->ch_segments[ch_segcount - 1]. ch_segcount is u32, so the subtraction wraps to 0xFFFFFFFF and the bound lands far past the ch_segments flex array. The loop body then walks unrelated memory at sizeof(struct svc_rdma_segment) stride until it faults. A zero-segcount chunk is reachable from the wire: xdr_check_write_chunk() only rejects segcount values greater than rc_maxpages, and pcl_alloc_write() links a freshly allocated chunk onto rc_write_pcl/rc_reply_pcl before its segment-fill loop runs, so a Write or Reply chunk advertising zero segments leaves ch_segcount == 0 on the list. When the transport has negotiated Send-With-Invalidate, svc_rdma_get_inv_rkey() iterates all four PCLs with pcl_for_each_segment and dereferences segment->rs_handle on each iteration, turning the underflow into an out-of-bounds read and a general protection fault. xdr_check_write_list / xdr_check_reply_chunk pcl_alloc_write() chunk = pcl_alloc_chunk(...) /* ch_segcount = 0 */ list_add_tail(&chunk->ch_list, &pcl->cl_chunks) /* fill loop iterates zero times for wire segcount 0 */ svc_rdma_get_inv_rkey() pcl_for_each_chunk(rc_write_pcl) pcl_for_each_segment(segment, chunk) pos <= &ch_segments[0u - 1u] /* 0xFFFFFFFF */ segment->rs_handle /* OOB read -> GPF */ Fix by switching the macro to a half-open upper bound that uses ch_segcount directly. For ch_segcount == 0 the loop start equals the loop end and the body is skipped; for ch_segcount > 0 the iteration range is unchanged. All six existing call sites in net/sunrpc/xprtrdma/svc_rdma_recvfrom.c and net/sunrpc/xprtrdma/svc_rdma_rw.c remain correct under the new bound, so no caller changes are needed. Fixes: 78147ca8b4a9 ("svcrdma: Add a "parsed chunk list" data structure") Cc: stable@vger.kernel.org Assisted-by: kres (claude-opus-4-7) Signed-off-by: Chris Mason Acked-by: Jeff Layton Signed-off-by: Chuck Lever

svcrdma: wake sq waiters when the transport closes

2026-06-09T20:32:59+00:00

Threads parked in svc_rdma_sq_wait() on sc_sq_ticket_wait or sc_send_wait can hang indefinitely in TASK_UNINTERRUPTIBLE state across transport teardown, pinning svc_xprt references and blocking svc_rdma_free(). The close path sets XPT_CLOSE before invoking xpo_detach and both wait_event predicates include an XPT_CLOSE term, but the predicates are re-evaluated only on wakeup. sc_sq_ticket_wait has no completion-driven wake path; it is advanced solely by the chained ticket handoff inside svc_rdma_sq_wait() itself. Without an explicit wake at close, parked threads never observe XPT_CLOSE, hold their svc_xprt_get reference forever, and svc_rdma_free() blocks on xpt_ref dropping to zero. Two close entry points reach this transport. Local teardown runs svc_rdma_detach() from svc_handle_xprt() -> svc_delete_xprt() -> xpo_detach() on a worker thread. A remote disconnect arrives at svc_rdma_cma_handler(), which calls svc_xprt_deferred_close(): that sets XPT_CLOSE and enqueues the transport but does not access either RDMA waitqueue, so a worker already parked in svc_rdma_sq_wait() never re-evaluates its predicate. With every worker parked on this transport, no thread is available to run the local teardown either, and the wake site there is unreachable. Introduce svc_rdma_xprt_deferred_close(), a thin svcrdma wrapper that calls svc_xprt_deferred_close() and then wakes both sc_sq_ticket_wait and sc_send_wait. Convert the svcrdma producers that called svc_xprt_deferred_close() directly: svc_rdma_cma_handler(), qp_event_handler(), svc_rdma_post_send_err(), svc_rdma_wc_send(), the sendto drop path, the rw completion error paths, and the recvfrom flush and read-list error paths. Wake both waitqueues from svc_rdma_detach() as well. The synchronous svc_xprt_close() path (backchannel ENOTCONN, device removal via svc_rdma_xprt_done) reaches detach without flowing through svc_xprt_deferred_close() and therefore does not invoke the new helper. Fixes: ccc89b9d1ed2 ("svcrdma: Add fair queuing for Send Queue access") Cc: stable@vger.kernel.org Assisted-by: kres (claude-opus-4-7) Signed-off-by: Chris Mason [ cel: add svc_rdma_xprt_deferred_close() to complete the fix ] Signed-off-by: Chuck Lever

SUNRPC: Return an error from xdr_buf_to_bvec() on overflow

2026-06-09T20:32:59+00:00

xdr_buf_to_bvec() returns a slot count even when the caller's bvec budget is exhausted partway through the xdr_buf. Callers feed that count into iov_iter_bvec() and continue as if the conversion had succeeded, silently sending or writing fewer bytes than the data length declares. For an NFS WRITE the server reports the truncated transfer to the client as full success. The overflow represents an internal invariant violation: a higher layer reserved a bvec budget too small for the xdr_buf it then asked the encoder to convert. That is a server-side fault, not a media I/O failure and not a malformed client argument. Change xdr_buf_to_bvec() to return a signed int and have the overflow label return -ESERVERFAULT. Update the three callers to detect the negative return and fail the request: nfsd_vfs_write() folds the error into host_err, which nfserrno() translates to nfserr_serverfault for the WRITE reply; svc_udp_sendto() and svc_tcp_sendmsg() propagate the error out of the send path. Reported-by: Chris Mason Fixes: 2eb2b9358181 ("SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton Signed-off-by: Chuck Lever

Documentation: Add the RPC language description of NLM version 3

2026-06-09T20:32:59+00:00

In order to generate source code to encode and decode NLMv3 protocol elements, include a copy of the RPC language description of NLMv3 for xdrgen to process. The language description is derived from the Open Group's XNFS specification: https://pubs.opengroup.org/onlinepubs/9629799/chap10.htm#tagcjh_11_03 The C code committed here was generated from the new nlm3.x file using tools/net/sunrpc/xdrgen/xdrgen. The goals of replacing hand-written XDR functions with ones that are tool-generated are to improve memory safety and make XDR encoding and decoding less brittle to maintain. Parts of the NFSv4 protocol are still being extended actively. Tool-generated XDR code reduces the time it takes to get a working implementation of new protocol elements. The xdrgen utility derives both the type definitions and the encode/decode functions directly from protocol specifications, using names and symbols familiar to anyone who knows those specs. Unlike hand-written code that can inadvertently diverge from the specification, xdrgen guarantees that the generated code matches the specification exactly. We would eventually like xdrgen to generate Rust code as well, making the conversion of the kernel's NFS stacks to use Rust just a little easier for us. Reviewed-by: Jeff Layton Signed-off-by: Chuck Lever

svcrdma: Defer send context release to xpo_release_ctxt

2026-06-09T20:32:59+00:00

Send completion currently queues a work item to an unbound workqueue for each completed send context. Under load, the Send Completion handlers contend for the shared workqueue pool lock. Replace the workqueue with a per-transport lock-free list (llist). The Send completion handler appends the send_ctxt to sc_send_release_list and does no further teardown. The nfsd thread drains the list in xpo_release_ctxt between RPCs, performing DMA unmapping, chunk I/O resource release, and page release in a batch. This eliminates both the workqueue pool lock and the DMA unmap cost from the Send completion path. DMA unmapping can be expensive when an IOMMU is present in strict mode, as each unmap triggers a synchronous hardware IOTLB invalidation. Moving it to the nfsd thread, where that latency is harmless, avoids penalizing completion handler throughput. The nfsd threads absorb the release cost at a point where the client is no longer waiting on a reply, and natural batching amortizes the overhead when completions arrive faster than RPCs complete. A self-enqueue backstops drain on a quiescing transport. When svc_rdma_send_ctxt_put() observes that its llist_add() transitions sc_send_release_list from empty to non-empty, it sets XPT_DATA and calls svc_xprt_enqueue() so that svc_xprt_ready() schedules an nfsd thread. The thread enters svc_rdma_recvfrom(), finds no pending receive, clears XPT_DATA, and returns 0; svc_xprt_release() then runs xpo_release_ctxt and drains the list. Under steady load the foreground drain keeps the list non-empty between adds and no enqueue fires; only the trailing edge of a burst pays for a wakeup. Without this path, a Send completion arriving after the last xpo_release_ctxt on an idle connection would leave the send_ctxt's DMA mappings and reply pages pinned until the next RPC, send-context exhaustion, or transport close. Signed-off-by: Chuck Lever

svcrdma: Release write chunk resources without re-queuing

2026-06-09T20:32:59+00:00

Each RDMA Send completion triggers a cascade of work items on the svcrdma_wq unbound workqueue: ib_cq_poll_work (on ib_comp_wq, per-CPU) -> svc_rdma_send_ctxt_put -> queue_work [work item 1] -> svc_rdma_write_info_free -> queue_work [work item 2] Every transition through queue_work contends on the unbound pool's spinlock. Profiling an 8KB NFSv3 read/write workload over RDMA shows about 4% of total CPU cycles spent on this lock, with the cascading re-queue of write_info release contributing roughly 1%. The initial queue_work in svc_rdma_send_ctxt_put is needed to move release work off the CQ completion context (which runs on a per-CPU bound workqueue). However, once executing on svcrdma_wq, there is no need to re-queue for each write_info structure. svc_rdma_reply_chunk_release already calls svc_rdma_cc_release inline from the same svcrdma_wq context, and svc_rdma_recv_ctxt_put does the same from nfsd thread context. Release write chunk resources inline in svc_rdma_write_info_free, removing the intermediate svc_rdma_write_info_free_async work item and the wi_work field from struct svc_rdma_write_info. Reviewed-by: Mike Snitzer Tested-by: Jonathan Flynn Signed-off-by: Chuck Lever

SUNRPC: Remove dead rpcsec_gss_krb5 definitions

2026-06-09T20:32:59+00:00

The migration to crypto/krb5 eliminated the per-enctype function dispatch and direct crypto API usage, leaving behind a number of orphaned definitions. Remove the following from gss_krb5.h: - GSS_KRB5_K5CLENGTH, used only by removed key derivation - KG_TOK_MIC_MSG and KG_TOK_WRAP_MSG (Kerberos v1 token types; v1 support was dropped earlier) - KG2_TOK_INITIAL and KG2_TOK_RESPONSE (context establishment token types; no remaining users) - KG2_RESP_FLAG_ERROR and KG2_RESP_FLAG_DELEG_OK - enum sgn_alg and enum seal_alg (v1 algorithm constants) - All CKSUMTYPE_* definitions, now duplicated by KRB5_CKSUMTYPE_* in - The KG_ error constants from gssapi_err_krb5.h, which have no remaining users - The ENCTYPE_* constant block, replaced by KRB5_ENCTYPE_* from - KG_USAGE_SEAL/SIGN/SEQ (3DES usage constants) - KEY_USAGE_SEED_CHECKSUM/ENCRYPTION/INTEGRITY, duplicated by - #include , no longer needed Remove the cksum[] field from struct krb5_ctx in gss_krb5_internal.h; no code reads or writes it after the key derivation removal. Switch gss_krb5_enctypes[] in gss_krb5_mech.c to the canonical KRB5_ENCTYPE_* names from . Remove stale #include directives: - from gss_krb5_wrap.c - and from gss_krb5_seal.c Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton Acked-by: Anna Schumaker Signed-off-by: Chuck Lever

SUNRPC: Remove dead code from rpcsec_gss_krb5

2026-06-09T20:32:59+00:00

With all per-message crypto operations routed through crypto/krb5, a substantial body of code in rpcsec_gss_krb5 has no remaining callers. The internal key derivation functions (krb5_derive_key_v2, krb5_kdf_hmac_sha2, krb5_kdf_feedback_cmac) and the low-level crypto primitives (krb5_encrypt, gss_krb5_checksum, krb5_cbc_cts_ encrypt/decrypt, krb5_etm_checksum) are unreachable because their only call sites were the per-enctype function pointers removed in previous patches. Delete gss_krb5_keys.c entirely and strip the dead functions from gss_krb5_crypto.c. The KUnit test suite in gss_krb5_test.c exercised exactly these internal functions: RFC 3961 n-fold, RFC 3962 key derivation, RFC 6803 Camellia key derivation, and RFC 8009 AES-SHA2 key derivation, plus encryption self-tests that drove the now-removed encrypt routines. The corresponding test coverage is provided by the crypto/krb5 selftests in crypto/krb5/selftest.c. Remove the test file, the RPCSEC_GSS_KRB5_KUNIT_TEST Kconfig symbol, the .kunitconfig, and all VISIBLE_IF_KUNIT / EXPORT_SYMBOL_IF_KUNIT annotations. xdr_process_buf() walked xdr_buf segments through a per-segment callback and existed solely for the crypto routines in gss_krb5_crypto.c. With that file removed, xdr_process_buf() has no remaining callers. Its successor, xdr_buf_to_sg(), populates a scatterlist directly from an xdr_buf byte range and was introduced earlier in this series. With every consumer of struct gss_krb5_enctype removed, replace its remaining uses with the equivalent fields from struct krb5_enctype (key_len). Remove struct gss_krb5_enctype, the supported_gss_krb5_enctypes[] table, gss_krb5_lookup_enctype(), and the gk5e pointer from krb5_ctx. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton Acked-by: Anna Schumaker Signed-off-by: Chuck Lever