| Age | Commit message (Collapse) | Author |
|
Threads parked in svc_rdma_sq_wait() on sc_sq_ticket_wait or
sc_send_wait can hang indefinitely in TASK_UNINTERRUPTIBLE state
across transport teardown, pinning svc_xprt references and
blocking svc_rdma_free().
The close path sets XPT_CLOSE before invoking xpo_detach and both
wait_event predicates include an XPT_CLOSE term, but the
predicates are re-evaluated only on wakeup. sc_sq_ticket_wait has
no completion-driven wake path; it is advanced solely by the
chained ticket handoff inside svc_rdma_sq_wait() itself. Without
an explicit wake at close, parked threads never observe
XPT_CLOSE, hold their svc_xprt_get reference forever, and
svc_rdma_free() blocks on xpt_ref dropping to zero.
Two close entry points reach this transport. Local teardown runs
svc_rdma_detach() from svc_handle_xprt() -> svc_delete_xprt() ->
xpo_detach() on a worker thread. A remote disconnect arrives at
svc_rdma_cma_handler(), which calls svc_xprt_deferred_close():
that sets XPT_CLOSE and enqueues the transport but does not
access either RDMA waitqueue, so a worker already parked in
svc_rdma_sq_wait() never re-evaluates its predicate. With every
worker parked on this transport, no thread is available to run
the local teardown either, and the wake site there is
unreachable.
Introduce svc_rdma_xprt_deferred_close(), a thin svcrdma wrapper
that calls svc_xprt_deferred_close() and then wakes both
sc_sq_ticket_wait and sc_send_wait. Convert the svcrdma producers
that called svc_xprt_deferred_close() directly:
svc_rdma_cma_handler(), qp_event_handler(),
svc_rdma_post_send_err(), svc_rdma_wc_send(), the sendto drop
path, the rw completion error paths, and the recvfrom flush and
read-list error paths.
Wake both waitqueues from svc_rdma_detach() as well. The
synchronous svc_xprt_close() path (backchannel ENOTCONN, device
removal via svc_rdma_xprt_done) reaches detach without flowing
through svc_xprt_deferred_close() and therefore does not invoke
the new helper.
Fixes: ccc89b9d1ed2 ("svcrdma: Add fair queuing for Send Queue access")
Cc: stable@vger.kernel.org
Assisted-by: kres (claude-opus-4-7)
Signed-off-by: Chris Mason <clm@meta.com>
[ cel: add svc_rdma_xprt_deferred_close() to complete the fix ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When wait_for_completion_interruptible_timeout() in
svc_tcp_handshake() returns 0 (timeout) or -ERESTARTSYS (signal) and
tls_handshake_cancel() then returns false, handshake_complete() has
won the cancellation race: it has set HANDSHAKE_F_REQ_COMPLETED and
is about to invoke svc_tcp_handshake_done(), but the callback's
side effects on xpt_flags and on svsk->sk_handshake_done have not
yet committed.
The current code reads xpt_flags immediately to decide whether the
session succeeded. Two races result.
If the callback has executed set_bit(XPT_TLS_SESSION) but not yet
clear_bit(XPT_HANDSHAKE), svc_tcp_handshake() sees a session,
enqueues the transport, and returns. svc_xprt_received() then
clears XPT_BUSY, a worker thread picks the transport up, the
dispatcher in svc_handle_xprt() observes XPT_HANDSHAKE still set,
and xpo_handshake is invoked a second time. That svc_tcp_handshake()
calls init_completion(&svsk->sk_handshake_done) while the original
callback concurrently calls complete_all() on it, corrupting the
embedded swait_queue.
If the callback has set HANDSHAKE_F_REQ_COMPLETED but not yet
entered svc_tcp_handshake_done(), svc_tcp_handshake() reads
XPT_TLS_SESSION as clear and tears the connection down even though
the handshake is about to succeed.
Wait for the callback to commit before inspecting xpt_flags. The
completion is guaranteed to fire because handshake_complete()
invokes svc_tcp_handshake_done() unconditionally once it has set
HANDSHAKE_F_REQ_COMPLETED.
Fixes: b3cbf98e2fdf ("SUNRPC: Support TLS handshake in the server-side TCP socket code")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
svc_tcp_handshake() stores the raw svc_xprt pointer in
tls_handshake_args.ta_data and submits the request through
tls_server_hello_x509(). The handshake core takes only
sock_hold(req->hr_sk); nothing references the embedding struct
svc_sock that svc_tcp_handshake_done() reaches via container_of().
Two close races leave the in-flight callback writing through a freed
svc_sock. svc_sock_free() calls tls_handshake_cancel() and discards
its return value: a false return means handshake_complete() has
already set HANDSHAKE_F_REQ_COMPLETED but hp_done() may not have
finished, yet svc_sock_free() proceeds to kfree(svsk). The
cancel-loser fall-through inside svc_tcp_handshake() itself produces
the same window: when wait_for_completion_interruptible_timeout()
returns <= 0 (timeout or signal) and tls_handshake_cancel() returns
false, the function does not drain, returns, and svc_handle_xprt()
calls svc_xprt_received(), which clears XPT_BUSY and can drop the
last reference. A concurrent close then runs svc_sock_free() while
svc_tcp_handshake_done() is still updating xpt_flags and walking
svsk->sk_handshake_done.
The corruption surfaces as set_bit/clear_bit RMW into the freed
xpt_flags slab slot and as complete_all() walking and writing the
freed wait_queue_head_t list embedded in sk_handshake_done -- a
slab-corruption primitive, not a benign read. The path is reachable
on any TLS-enabled NFS server whenever a connection close overlaps
the tlshd downcall delivery window; the interruptible wait means
signal delivery suffices, not just SVC_HANDSHAKE_TO expiry.
Take svc_xprt_get(xprt) immediately before tls_server_hello_x509()
so the in-flight callback owns its own reference. Release it on the
two edges where the callback is guaranteed not to fire -- submission
failure from tls_server_hello_x509() and a successful
tls_handshake_cancel() -- and at the tail of
svc_tcp_handshake_done() after complete_all().
Fixes: b3cbf98e2fdf ("SUNRPC: Support TLS handshake in the server-side TCP socket code")
Cc: stable@vger.kernel.org
Signed-off-by: Chris Mason <clm@meta.com>
Assisted-by: kres (claude-opus-4-7)
[cel: rewrote commit message to describe the actual change]
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The svc_release_rqst() function executes the callback inside
rqstp->rq_procinfo->pc_release. However, if a worker thread begins
processing a new request and encounters an early error path (e.g.,
unsupported protocol, short frame, or bad auth) before a valid
rq_procinfo is installed, a stale release hook can be re-triggered
against reused state from the previous RPC, resulting in a double-free
or use-after-free vulnerability.
Harden the lifecycle of rq_procinfo by:
1. Ensuring svc_release_rqst() always clears rq_procinfo after the
optional pc_release() call, regardless of whether the hook exists.
2. Explicitly clearing rq_procinfo at request entry in svc_process()
before any early decode or drop paths.
3. Ensuring svc_process_bc() does the same at backchannel entry.
This guarantees that error flows will not encounter a non-NULL stale
rq_procinfo pointer when there is nothing to release.
Fixes: d9adbb6e10bf ("sunrpc: delay pc_release callback until after the reply is sent")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Suggested-by: Chuck Lever <cel@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luxiao Xu <rakukuip@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
xdr_buf_to_bvec() returns a slot count even when the caller's bvec
budget is exhausted partway through the xdr_buf. Callers feed that
count into iov_iter_bvec() and continue as if the conversion had
succeeded, silently sending or writing fewer bytes than the data
length declares. For an NFS WRITE the server reports the truncated
transfer to the client as full success.
The overflow represents an internal invariant violation: a higher
layer reserved a bvec budget too small for the xdr_buf it then
asked the encoder to convert. That is a server-side fault, not a
media I/O failure and not a malformed client argument.
Change xdr_buf_to_bvec() to return a signed int and have the
overflow label return -ESERVERFAULT. Update the three callers to
detect the negative return and fail the request: nfsd_vfs_write()
folds the error into host_err, which nfserrno() translates to
nfserr_serverfault for the WRITE reply; svc_udp_sendto() and
svc_tcp_sendmsg() propagate the error out of the send path.
Reported-by: Chris Mason <clm@meta.com>
Fixes: 2eb2b9358181 ("SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
xdr_buf_to_bvec() writes a bio_vec into the caller's array before
testing whether that slot is in range, and the head branch performs
the store with no check at all. When the caller's budget is exactly
used up, the next store lands one element past the end of the array.
The overflow label returns count - 1, which masks the surplus store
but cannot undo it.
rq_bvec, the array passed by nfsd_vfs_write(), is allocated to
exactly rq_maxpages entries with no slack. The OOB store can land in
adjacent slab memory; the bv_len and bv_offset fields written there
are derived from client-supplied RPC payload sizes.
Move the in-range check ahead of the store in the head, page-loop,
and tail branches. With the check at the top of each sequence, count
is incremented only after a successful store, so the overflow label
can return count directly.
Reported-by: Chris Mason <clm@meta.com>
Fixes: 2eb2b9358181 ("SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Jonathan Flynn reports that commit 18755b8c2f24 ("svcrdma: Use
contiguous pages for RDMA Read sink buffers") regresses NFS/RDMA
WRITE throughput from 73.9 GiB/s to 30.3 GiB/s on a 128-core
single-NUMA-node server driving dual 400Gb/s links with 640 nfsd
threads. Server CPU utilization rises from 8.5% to 76%, with
roughly three quarters of all cycles spent spinning on zone->lock.
The sink buffers are allocated as high-order page blocks, split
into single pages so each sub-page carries an independent refcount,
and later released one page at a time through folio batches. The
per-CPU page caches cannot satisfy an allocation stream whose alloc
order differs from its free order, so every sink buffer page makes
a round trip through the buddy allocator's free lists, serialized
on the zone lock of the single NUMA node. The rq_pages entries that
the split pages displace, bulk-allocated moments earlier by
svc_alloc_arg(), are freed without ever being used, doubling the
allocator traffic.
The regression cannot be addressed trivially. Revert the commit
now; a reworked approach can return in an upcoming merge window.
Reported-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/aiHlPmeZq3WgMwoJ@kernel.org/
Closes: https://lore.kernel.org/linux-nfs/3cb119b4b2a8aada30c0c60286778a54@mail.gmail.com/
Fixes: 18755b8c2f24 ("svcrdma: Use contiguous pages for RDMA Read sink buffers")
Cc: stable@vger.kernel.org
Tested-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Send completion currently queues a work item to an unbound
workqueue for each completed send context. Under load, the
Send Completion handlers contend for the shared workqueue
pool lock.
Replace the workqueue with a per-transport lock-free list
(llist). The Send completion handler appends the send_ctxt
to sc_send_release_list and does no further teardown. The
nfsd thread drains the list in xpo_release_ctxt between
RPCs, performing DMA unmapping, chunk I/O resource release,
and page release in a batch.
This eliminates both the workqueue pool lock and the DMA
unmap cost from the Send completion path. DMA unmapping can
be expensive when an IOMMU is present in strict mode, as
each unmap triggers a synchronous hardware IOTLB
invalidation. Moving it to the nfsd thread, where that
latency is harmless, avoids penalizing completion handler
throughput.
The nfsd threads absorb the release cost at a point where
the client is no longer waiting on a reply, and natural
batching amortizes the overhead when completions arrive
faster than RPCs complete.
A self-enqueue backstops drain on a quiescing transport.
When svc_rdma_send_ctxt_put() observes that its llist_add()
transitions sc_send_release_list from empty to non-empty,
it sets XPT_DATA and calls svc_xprt_enqueue() so that
svc_xprt_ready() schedules an nfsd thread. The thread
enters svc_rdma_recvfrom(), finds no pending receive,
clears XPT_DATA, and returns 0; svc_xprt_release() then
runs xpo_release_ctxt and drains the list. Under steady
load the foreground drain keeps the list non-empty between
adds and no enqueue fires; only the trailing edge of a
burst pays for a wakeup. Without this path, a Send
completion arriving after the last xpo_release_ctxt on an
idle connection would leave the send_ctxt's DMA mappings
and reply pages pinned until the next RPC, send-context
exhaustion, or transport close.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Each RDMA Send completion triggers a cascade of work items on the
svcrdma_wq unbound workqueue:
ib_cq_poll_work (on ib_comp_wq, per-CPU)
-> svc_rdma_send_ctxt_put -> queue_work [work item 1]
-> svc_rdma_write_info_free -> queue_work [work item 2]
Every transition through queue_work contends on the unbound
pool's spinlock. Profiling an 8KB NFSv3 read/write workload
over RDMA shows about 4% of total CPU cycles spent on this
lock, with the cascading re-queue of write_info release
contributing roughly 1%.
The initial queue_work in svc_rdma_send_ctxt_put is needed to
move release work off the CQ completion context (which runs on
a per-CPU bound workqueue). However, once executing on
svcrdma_wq, there is no need to re-queue for each write_info
structure. svc_rdma_reply_chunk_release already calls
svc_rdma_cc_release inline from the same svcrdma_wq context,
and svc_rdma_recv_ctxt_put does the same from nfsd thread
context.
Release write chunk resources inline in
svc_rdma_write_info_free, removing the intermediate
svc_rdma_write_info_free_async work item and the wi_work
field from struct svc_rdma_write_info.
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Tested-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The migration to crypto/krb5 eliminated the per-enctype
function dispatch and direct crypto API usage, leaving
behind a number of orphaned definitions.
Remove the following from gss_krb5.h:
- GSS_KRB5_K5CLENGTH, used only by removed key derivation
- KG_TOK_MIC_MSG and KG_TOK_WRAP_MSG (Kerberos v1 token
types; v1 support was dropped earlier)
- KG2_TOK_INITIAL and KG2_TOK_RESPONSE (context
establishment token types; no remaining users)
- KG2_RESP_FLAG_ERROR and KG2_RESP_FLAG_DELEG_OK
- enum sgn_alg and enum seal_alg (v1 algorithm constants)
- All CKSUMTYPE_* definitions, now duplicated by
KRB5_CKSUMTYPE_* in <crypto/krb5.h>
- The KG_ error constants from gssapi_err_krb5.h, which
have no remaining users
- The ENCTYPE_* constant block, replaced by KRB5_ENCTYPE_*
from <crypto/krb5.h>
- KG_USAGE_SEAL/SIGN/SEQ (3DES usage constants)
- KEY_USAGE_SEED_CHECKSUM/ENCRYPTION/INTEGRITY, duplicated
by <crypto/krb5.h>
- #include <crypto/skcipher.h>, no longer needed
Remove the cksum[] field from struct krb5_ctx in
gss_krb5_internal.h; no code reads or writes it after the
key derivation removal.
Switch gss_krb5_enctypes[] in gss_krb5_mech.c to the
canonical KRB5_ENCTYPE_* names from <crypto/krb5.h>.
Remove stale #include directives:
- <crypto/skcipher.h> from gss_krb5_wrap.c
- <linux/random.h> and <linux/crypto.h> from
gss_krb5_seal.c
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
With all per-message crypto operations now routed through
crypto/krb5, rpcsec_gss_krb5 no longer calls individual
crypto algorithms directly. The CRYPTO_KRB5 symbol already
selects CRYPTO_SKCIPHER and CRYPTO_HASH (the latter
transitively via CRYPTO_HMAC).
Drop the top-level select CRYPTO_SKCIPHER and select
CRYPTO_HASH from RPCSEC_GSS_KRB5, as these are redundant
with CRYPTO_KRB5's own dependencies.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The RPCSEC_GSS_KRB5_ENCTYPES_AES_SHA1,
RPCSEC_GSS_KRB5_ENCTYPES_CAMELLIA, and
RPCSEC_GSS_KRB5_ENCTYPES_AES_SHA2 Kconfig options
originally gated both algorithm availability and the
advertised enctype list. Now that per-message crypto
operations are routed through crypto/krb5, these options
control only which enctype numbers appear in the gssd
upcall string; the underlying algorithms are always
present.
Remove the per-enctype Kconfig options and replace the
ifdef-gated enctype table with a candidate list looked
up in the crypto/krb5 enctype table at module init
time. Each enctype is included in the advertised list
only if crypto_krb5_find_enctype() finds it in the
library's enctype table. When a new enctype is added
to crypto/krb5, adding its constant to the candidate
array is sufficient to begin advertising it.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
With all per-message crypto operations routed through crypto/krb5,
a substantial body of code in rpcsec_gss_krb5 has no remaining
callers. The internal key derivation functions (krb5_derive_key_v2,
krb5_kdf_hmac_sha2, krb5_kdf_feedback_cmac) and the low-level
crypto primitives (krb5_encrypt, gss_krb5_checksum, krb5_cbc_cts_
encrypt/decrypt, krb5_etm_checksum) are unreachable because their
only call sites were the per-enctype function pointers removed in
previous patches. Delete gss_krb5_keys.c entirely and strip the
dead functions from gss_krb5_crypto.c.
The KUnit test suite in gss_krb5_test.c exercised exactly these
internal functions: RFC 3961 n-fold, RFC 3962 key derivation,
RFC 6803 Camellia key derivation, and RFC 8009 AES-SHA2 key
derivation, plus encryption self-tests that drove the now-removed
encrypt routines. The corresponding test coverage is provided by
the crypto/krb5 selftests in crypto/krb5/selftest.c. Remove the
test file, the RPCSEC_GSS_KRB5_KUNIT_TEST Kconfig symbol, the
.kunitconfig, and all VISIBLE_IF_KUNIT / EXPORT_SYMBOL_IF_KUNIT
annotations.
xdr_process_buf() walked xdr_buf segments through a per-segment
callback and existed solely for the crypto routines in
gss_krb5_crypto.c. With that file removed, xdr_process_buf()
has no remaining callers. Its successor, xdr_buf_to_sg(),
populates a scatterlist directly from an xdr_buf byte range
and was introduced earlier in this series.
With every consumer of struct gss_krb5_enctype removed, replace
its remaining uses with the equivalent fields from struct
krb5_enctype (key_len). Remove struct gss_krb5_enctype, the
supported_gss_krb5_enctypes[] table, gss_krb5_lookup_enctype(),
and the gk5e pointer from krb5_ctx.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Previous patches switched all per-message crypto operations
(encrypt, decrypt, get_mic, verify_mic) from the internal
skcipher/ahash primitives to crypto/krb5 AEAD and shash
handles. The old crypto_sync_skcipher and crypto_ahash fields in
struct krb5_ctx are no longer referenced at runtime.
Remove the ten legacy handle fields from struct krb5_ctx
along with the key derivation and handle allocation code in
gss_krb5_import_ctx_v2() that populated them. Context import
now prepares only the four crypto/krb5 handles (two AEAD for
encryption, two shash for checksums). The corresponding cleanup
in gss_krb5_delete_sec_context() and the error path is likewise
reduced.
The krb5_derive_key() inline wrapper, gss_krb5_alloc_cipher_v2(),
and gss_krb5_alloc_hash_v2() become unused and are removed.
The per-enctype encrypt/decrypt functions (gss_krb5_aes_encrypt,
gss_krb5_aes_decrypt, krb5_etm_encrypt, krb5_etm_decrypt) that
were the sole remaining consumers of these fields are also removed;
their function-pointer call sites were already deleted in earlier
patches.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
All enctypes now route through gss_krb5_aead_encrypt() and
gss_krb5_aead_decrypt(). The per-enctype .encrypt and .decrypt
function pointers served the same purpose as .get_mic and
.wrap before them: dispatching v1 versus v2 implementations.
With v1 support long removed and the Camellia decrypt path
migrated in a preceding patch, every table entry points to
the same pair of functions.
Call gss_krb5_aead_encrypt() and gss_krb5_aead_decrypt()
directly from gss_krb5_wrap_v2() and gss_krb5_unwrap_v2(),
and drop the function pointers from struct gss_krb5_enctype.
While here, propagate the GSS status code returned by
gss_krb5_aead_decrypt() instead of discarding it.
The old indirect call sites returned GSS_S_FAILURE
unconditionally, losing the distinction between an
integrity failure (GSS_S_BAD_SIG) and a structural
error (GSS_S_DEFECTIVE_TOKEN).
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Every enctype points .wrap and .unwrap at gss_krb5_wrap_v2()
and gss_krb5_unwrap_v2(). As with get_mic/verify_mic, the
indirection dates from when v1 enctypes had different wrap
implementations. Call the functions directly and remove the
pointers from struct gss_krb5_enctype.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Every enctype in the table points .get_mic and .verify_mic at
the same pair of functions. The indirection served no purpose
after the v1 enctype support was removed. Call
gss_krb5_get_mic_v2() and gss_krb5_verify_mic_v2() directly
from the GSS mechanism dispatch and drop the function pointers
from struct gss_krb5_enctype.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
gss_krb5_verify_mic_v2() currently recomputes a checksum using
gss_krb5_checksum() and then compares it against the received
checksum with memcmp(). Replace this with a call to
crypto_krb5_verify_mic(), which performs the hash, comparison,
and offset/length adjustment in a single operation through the
crypto/krb5 library.
The scatterlist layout required by RFC 4121 Section 4.2.4 is
constructed via gss_krb5_mic_build_sg(), the shared helper
introduced in the preceding commit. The received checksum
occupies the first scatterlist entry, pointing directly into
the token buffer.
The errno result from crypto_krb5_verify_mic() is mapped to a
GSS major status code via gss_krb5_errno_to_status(), which
returns GSS_S_BAD_SIG for -EBADMSG (checksum mismatch).
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
gss_krb5_get_mic_v2() currently computes the MIC checksum by
driving a crypto_ahash directly, calling gss_krb5_checksum()
with the message body and GSS token header. Replace this with
a call to crypto_krb5_get_mic(), which performs the same keyed
hash operation through the crypto/krb5 library.
RFC 4121 Section 4.2.4 specifies that the checksum covers the
message body followed by the token header. Because the
crypto/krb5 metadata parameter is hashed before the data, the
GSS header cannot be passed as metadata. Instead, the header
is appended to the scatterlist after the body data, producing
the correct hash input ordering without using the metadata
parameter.
The scatterlist layout is:
[checksum_output | message_body | gss_header]
The first scatterlist entry points directly into the
token buffer, so the checksum is written in place.
A shared helper, gss_krb5_mic_build_sg(), is introduced in
gss_krb5_crypto.c to construct this scatterlist layout. The
helper handles overflow allocation and scatterlist chaining
for large xdr_buf page arrays. It is reused by the verify_mic
counterpart in the following commit.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The Camellia enctypes (RFC 6803) use the same MtE authenticated
encryption construction as AES-SHA1 (RFC 3962), implemented in
crypto/krb5 by the rfc3961_simplified profile. The encrypt path
already uses gss_krb5_aead_encrypt() for Camellia, but the decrypt
path was left on the old gss_krb5_aes_decrypt() code when the AES
enctypes were migrated.
Switch the Camellia .decrypt callback to gss_krb5_aead_decrypt() to
complete the AEAD migration for all enctypes. The conf_len and
cksum_len values in crypto/krb5's Camellia enctype descriptors match
the block size and checksum length that gss_krb5_aes_decrypt() was
using, so the headskip and tailskip returned to the unwrap layer are
unchanged.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Replace the per-enctype .decrypt callbacks (gss_krb5_aes_decrypt
and krb5_etm_decrypt) with a single gss_krb5_aead_decrypt()
wrapper that delegates to crypto_krb5_decrypt().
The new wrapper builds a scatterlist covering the secured
region (confounder through checksum), passes it to the AEAD
decrypt operation, and derives the confounder and checksum
lengths from the data offset and length that
crypto_krb5_decrypt() reports. The caller's token header
verification and buffer adjustment logic is unchanged.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Replace the per-enctype .encrypt callbacks (gss_krb5_aes_encrypt and
krb5_etm_encrypt) with a single gss_krb5_aead_encrypt() wrapper that
delegates to crypto_krb5_encrypt().
The xdr_buf setup -- GSS header insertion, confounder space
allocation, and token header copy -- remains unchanged. The
difference is that the CBC-CTS encryption and HMAC computation are
now a single AEAD operation through the crypto/krb5 library. Both
the MtE construction (RFC 3962) and the EtM construction (RFC 8009)
are handled transparently by the AEAD transform.
The plaintext page data must be copied from the page cache pages to
the scratch output pages before building the scatterlist, since the
AEAD operates in-place rather than using separate input and output
scatterlists.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Allocate crypto_aead handles for encryption (one per direction)
and crypto_shash handles for checksumming (one per direction)
using the crypto/krb5 library's key preparation functions.
These four handles derive their subkeys from the session key
and the RFC 4121 usage numbers and are ready for use in
encrypt, decrypt, get_mic, and verify_mic operations.
The existing crypto_sync_skcipher and crypto_ahash handles
remain in place for now; subsequent patches switch the
per-message operations to the new handles and then remove
the old ones.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The crypto/krb5 library returns standard negative errno values,
but the GSS mechanism layer reports results as GSS_S_* major
status codes. A translation is needed at each call site that
will be switched to the new library.
Rather than open-coding the mapping in every wrapper, provide a
single helper function.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The crypto/krb5 library accepts data in scatterlist form, but
the GSS-API layer presents RPC payloads as struct xdr_buf.
Bridge that gap with a pair of helper functions:
xdr_buf_to_sg() - populate a caller-supplied scatterlist
array from a byte range
xdr_buf_to_sg_alloc() - populate a caller-supplied inline
scatterlist, chaining to a heap-
allocated overflow for large payloads
The inline array (typically stack-allocated at eight entries)
covers the common case of small RPCs with no heap allocation
on the encrypt/decrypt path. Only buffers spanning many pages
incur a kmalloc for the chained extension.
The segment-walking logic follows the same head, page array,
tail traversal as xdr_process_buf(), but populates a
scatterlist directly rather than invoking a per-segment
callback. sg_next() traversal makes the walker safe for
chained scatterlists. Once subsequent patches reroute all
per-message crypto operations through crypto/krb5,
xdr_process_buf() loses its last callers and is removed.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Each krb5_ctx currently points to a gss_krb5_enctype, the
rpcsec_gss_krb5 module's own enctype descriptor. To begin
using the common crypto/krb5 library, store a pointer to the
corresponding struct krb5_enctype (from <crypto/krb5.h>) as
well.
The lookup is performed in gss_import_v2_context() immediately
after the existing gss_krb5_lookup_enctype() call. If
crypto_krb5_find_enctype() cannot find a matching enctype the
context import fails, ensuring the module never operates with
a partially-initialized krb5_ctx.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The rpcsec_gss_krb5 module currently contains its own Kerberos 5
crypto implementation (key derivation, encryption, checksumming)
that duplicates functionality available in the common crypto/krb5
library. As a first step toward migrating to that library, add a
Kconfig select so that building rpcsec_gss_krb5 pulls in the
common Kerberos 5 crypto support.
The per-enctype Kconfig options (AES_SHA1, CAMELLIA, AES_SHA2)
remain: they continue to gate which encryption types are offered
by the GSS mechanism. The individual crypto algorithm selects
they carry become redundant once the migration is complete, since
CRYPTO_KRB5 already selects all needed ciphers and hashes.
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The new get-reqs dump operations added to sunrpc_cache.yaml and
nfsd.yaml place the "requests" nested attribute under dump.request.
A netlink dump carries an empty request; its payload travels back
in the reply. Because the spec names no reply attributes, the YNL
C code generator synthesizes a forward reference to a
<op>_rsp struct that is never defined, breaking any consumer of
these specs.
This first surfaced when Thorsten Leemhuis built tools/net/ynl
against -next:
nfsd-user.h:746: error: field 'obj' has incomplete type
struct nfsd_svc_export_get_reqs_rsp obj ...
nfsd-user.h:826: error: field 'obj' has incomplete type
struct nfsd_expkey_get_reqs_rsp obj ...
nfsd-user.c:1211: error: 'nfsd_svc_export_get_reqs_rsp_parse'
undeclared
sunrpc_cache.yaml has the same defect in ip-map-get-reqs and
unix-gid-get-reqs, but nfsd.yaml errors out first in the Makefile's
alphabetical build order and hides the sunrpc failures.
These bugs were introduced by incorrect merge conflict resolution.
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Closes: https://lore.kernel.org/linux-nfs/f6a3ca6d-e5cb-4a5c-9af2-8d2b1ce33ef0@leemhuis.info/
Fixes: 1045ccf519ce30 ("sunrpc: add netlink upcall for the auth.unix.ip cache")
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add a new SUNRPC_CMD_CACHE_FLUSH generic netlink command that allows
userspace to flush the sunrpc auth caches (ip_map and unix_gid) without
writing to /proc/net/rpc/*/flush.
An optional SUNRPC_A_CACHE_FLUSH_MASK u32 attribute selects which caches
to flush (bit 1 = ip_map, bit 2 = unix_gid). If the attribute is
omitted, all sunrpc caches are flushed.
This is used by exportfs to replace its /proc-based cache_flush() with a
netlink equivalent, with /proc fallback for older kernels.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add netlink-based cache upcall support for the svc_export (nfsd.export)
cache to Documentation/netlink/specs/nfsd.yaml and regenerate the
resulting files.
Implement nfsd_cache_notify() which sends a NFSD_CMD_CACHE_NOTIFY
multicast event to the "exportd" group, carrying the cache type so
userspace knows which cache has pending requests.
Implement nfsd_nl_svc_export_get_reqs_dumpit() which snapshots
pending svc_export cache requests and sends each entry's seqno,
client name, and path over netlink.
Implement nfsd_nl_svc_export_set_reqs_doit() which parses svc_export
cache responses from userspace (client, path, expiry, flags, anon
uid/gid, fslocations, uuid, secinfo, xprtsec, fsid, or negative
flag) and updates the cache via svc_export_lookup() /
svc_export_update().
Wire up the svc_export_notify() callback in svc_export_cache_template
so cache misses trigger NFSD_CMD_CACHE_NOTIFY multicast events with
NFSD_CACHE_TYPE_SVC_EXPORT.
Note that the export-flags and xprtsec-mode enums are organized to match
their counterparts in include/uapi/linux/nfsd/export.h. The intent is
that future export options will only be added to the netlink headers,
which should eliminate the need to keep so much in sync.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add netlink-based cache upcall support for the unix_gid (auth.unix.gid)
cache, using the sunrpc generic netlink family.
Add unix-gid attribute-set (seqno, uid, gids multi-attr, negative,
expiry), unix-gid-reqs wrapper, and unix-gid-get-reqs /
unix-gid-set-reqs operations to the sunrpc_cache YAML spec and
generated headers.
Implement sunrpc_nl_unix_gid_get_reqs_dumpit() which snapshots pending
unix_gid cache requests and sends each entry's seqno and uid over
netlink.
Implement sunrpc_nl_unix_gid_set_reqs_doit() which parses unix_gid
cache responses from userspace (uid, expiry, gids as u32 multi-attr
or negative flag) and updates the cache via unix_gid_lookup() /
sunrpc_cache_update().
Wire up unix_gid_notify() callback in unix_gid_cache_template so
cache misses trigger SUNRPC_CMD_CACHE_NOTIFY multicast events with
SUNRPC_CACHE_TYPE_UNIX_GID.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add netlink-based cache upcall support for the ip_map (auth.unix.ip)
cache, using the sunrpc generic netlink family.
Add ip-map attribute-set (seqno, class, addr, domain, negative, expiry),
ip-map-reqs wrapper, and ip-map-get-reqs / ip-map-set-reqs operations
to the sunrpc_cache YAML spec and generated headers.
Implement sunrpc_nl_ip_map_get_reqs_dumpit() which snapshots pending
ip_map cache requests and sends each entry's seqno, class name, and
IP address over netlink.
Implement sunrpc_nl_ip_map_set_reqs_doit() which parses ip_map cache
responses from userspace (class, addr, expiry, and domain name or
negative flag) and updates the cache via __ip_map_lookup() /
__ip_map_update().
Wire up ip_map_notify() callback in ip_map_cache_template so cache
misses trigger SUNRPC_CMD_CACHE_NOTIFY multicast events with
SUNRPC_CACHE_TYPE_IP_MAP.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The auth.unix.ip and auth.unix.gid caches live in the sunrpc module,
so they cannot use the nfsd generic netlink family. Create a new
"sunrpc" generic netlink family with its own "exportd" multicast
group to support cache upcall notifications for sunrpc-resident
caches.
Define a YAML spec (sunrpc_cache.yaml) with a cache-type enum
(ip_map, unix_gid), a cache-notify multicast event, and the
corresponding uapi header.
Implement sunrpc_cache_notify() in cache.c, which checks for
listeners on the exportd multicast group, builds and sends a
SUNRPC_CMD_CACHE_NOTIFY message with the cache-type attribute.
Register/unregister the sunrpc_nl_family in init_sunrpc() and
cleanup_sunrpc().
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add sunrpc_cache_requests_count() and sunrpc_cache_requests_snapshot()
to allow callers to count and snapshot the pending upcall request list
without exposing struct cache_request outside of cache.c.
Both functions skip entries that no longer have CACHE_PENDING set.
The snapshot function takes a cache_get() reference on each item so the
caller can safely use them after the queue_lock is released.
These will be used by the nfsd generic netlink dumpit handler for
svc_export upcall requests.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
A later patch will be changing the kernel to send a netlink notification
when there is a pending cache_request. Add a new cache_notify operation
to struct cache_detail for this purpose.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Rename cache_pipe_upcall() to cache_do_upcall() in anticipation of the
addition of a netlink-based upcall mechanism.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This function doesn't have anything to do with a timeout. The only
difference is that it warns if there are no listeners. Rename it to
sunrpc_cache_upcall_warn().
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Since it will soon also send an upcall via netlink, if configured.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
svc_xprt_resource_released() calls svc_xprt_enqueue()
whenever XPT_DATA or XPT_DEFERRED is set. During RPC
processing, svc_reserve_auth() reduces the reservation
counter and triggers this path while the current thread
still holds XPT_BUSY. The enqueue enters svc_xprt_ready(),
executes an smp_rmb(), READ_ONCE(), and tracepoint, then
returns false on seeing XPT_BUSY.
Trace data from a 256KB NFSv3 WRITE workload over TCP
shows this pattern generates roughly 195,000 wasted
enqueue calls -- approximately one per RPC -- each
paying the full svc_xprt_ready() cost for no benefit.
Add a BUSY check alongside the existing DATA|DEFERRED
check in svc_xprt_resource_released(). When the
transport is BUSY, the holder will call
svc_xprt_received() upon completion, which already
checks for pending work flags and re-enqueues.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
svc_xprt_received() unconditionally calls
svc_xprt_enqueue() after clearing XPT_BUSY. When no
work flags are pending, the enqueue traverses
svc_xprt_ready() -- executing an smp_rmb(), READ_ONCE(),
and tracepoint -- before returning false.
Trace data from a 256KB NFSv3 workload over RDMA shows
85% of svc_xprt_received() invocations reach
svc_xprt_enqueue() with no pending work flags. In the
WRITE phase, 167,335 of 196,420 calls find no work; in
the READ phase, 97,165 of 98,276. Each unnecessary call
executes a memory barrier, a flags read, and (when
tracing is active) fires the svc_xprt_enqueue
tracepoint.
Add a flags pre-check between clear_bit(XPT_BUSY) and
svc_xprt_enqueue(). Both the clear and the subsequent
READ_ONCE operate on the same xpt_flags word, so
cache-line serialization of the atomic bitops ensures
the read observes any flag set by a concurrent producer
before the line was acquired for the clear. If a
producer's set_bit occurs after the clear_bit, that
producer's own svc_xprt_enqueue() call observes
!XPT_BUSY and dispatches the transport.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
svc_reserve() and svc_xprt_release_slot() call
svc_xprt_enqueue() after modifying xpt_reserved or
xpt_nr_rqsts. The purpose is to re-dispatch the
transport when write-space or a slot becomes available.
However, when neither XPT_DATA nor XPT_DEFERRED is
set, no thread can make progress on the transport and
the enqueue accomplishes nothing.
Trace data from a 256KB NFSv3 WRITE workload over RDMA
shows 11.2 svc_xprt_enqueue() calls per RPC. Of these,
6.9 per RPC lack XPT_DATA and exit svc_xprt_ready()
immediately after executing the smp_rmb(), READ_ONCE(),
and tracepoint. svc_reserve() and svc_xprt_release_slot()
account for roughly five of these per RPC.
A new helper, svc_xprt_resource_released(), checks
XPT_DATA | XPT_DEFERRED before calling
svc_xprt_enqueue(). The existing smp_wmb() barriers
are upgraded to smp_mb() to ensure the flags check
observes a concurrent producer's set_bit(XPT_DATA).
Each producer (svc_rdma_wc_receive, etc.) both sets
XPT_DATA and calls svc_xprt_enqueue(), so even if the
check reads a stale value, the producer's own enqueue
provides a fallback path.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Regressions:
- Tighten bounds checking for sunrpc cache hash tables
- Don't report key material in the ftrace log
Stable fix:
- Fix lockd's implementation of the NLM TEST procedure"
* tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
lockd: fix TEST handling when not all permissions are available.
NFSD: Report whether fh_key was actually updated
sunrpc: prevent out-of-bounds read in __cache_seq_start()
|
|
Commit 7b546bd89975 ("sunrpc/cache: improve RCU safety in
cache_list walking.") changed the tail of __cache_seq_start()
to unconditionally store
*pos = ((long long)hash << 32) + 1
before returning, dropping a prior "hash >= cd->hash_size"
guard. When the while loop exits because every remaining
bucket was empty, hash equals cd->hash_size, so the stored
*pos is one position past the table's last valid bucket.
seq_read_iter() caches that index in m->index. A subsequent
pread(2) at the same file offset skips traverse() and hands
the stored index back to __cache_seq_start(), which decodes
hash = cd->hash_size and dereferences
cd->hash_table[cd->hash_size] -- one hlist_head past the end
of the kzalloc'd table.
KASAN reports an eight-byte slab-out-of-bounds read at the
tail of the 2048-byte hash_table allocation for the NFSD
export cache (EXPORT_HASHMAX * sizeof(struct hlist_head) ==
256 * 8).
Reject an input hash that is out of range before touching the
hash table. cache_seq_next() already bounds-checks its own
loop; the start routine needs to be symmetric.
Reported-by: syzbot+60cfa08822470bbebe44@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=60cfa08822470bbebe44
Fixes: 7b546bd89975 ("sunrpc/cache: improve RCU safety in cache_list walking.")
Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Fixes for this release:
- Correctness fix for the new sunrpc cache netlink protocol
Marked for stable:
- Correctness fixes for delegated attributes
- Prevent an infinite loop when revoking layouts"
* tag 'nfsd-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix infinite loop in layout state revocation
sunrpc: start cache request seqno at 1 to fix netlink GET_REQS
nfsd: update mtime/ctime on COPY in presence of delegated attributes
nfsd: update mtime/ctime on CLONE in presense of delegated attributes
nfsd: fix file change detection in CB_GETATTR
nfsd: fix GET_DIR_DELEGATION when VFS leases are disabled
|
|
sunrpc_cache_requests_snapshot() filters requests with
crq->seqno <= min_seqno. The min_seqno for the first netlink
dump call is cb->args[0] which is 0. Since next_seqno was
initialized to 0, the very first cache request got seqno=0
and was silently skipped by the snapshot (0 <= 0 is true).
This caused netlink-based GET_REQS to return 0 pending requests
even when a request was queued, preventing mountd from resolving
cache entries (particularly expkey/nfsd.fh). The unresolved
CACHE_PENDING state blocked all further notifications for the
entry, leading to permanent NFS4ERR_DELAY hangs.
Start next_seqno at 1 so all requests have seqno >= 1 and pass
the snapshot filter when min_seqno is 0.
Fixes: facc4e3c8042 ("sunrpc: split cache_detail queue into request and reader lists")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Pull NFS client updates from Trond Myklebust:
"Bugfixes:
- Fix handling of ENOSPC so that if we have to resend writes, they
are written synchronously
- SUNRPC RDMA transport fixes from Chuck
- Several fixes for delegated timestamps in NFSv4.2
- Failure to obtain a directory delegation should not cause stat() to
fail with NFSv4
- Rename was failing to update timestamps when a directory delegation
is held on NFSv4
- Ensure we check rsize/wsize after crossing a NFSv4 filesystem
boundary
- NFSv4/pnfs:
- If the server is down, retry the layout returns on reboot
- Fallback to MDS could result in a short write being incorrectly
logged
Cleanups:
- Use memcpy_and_pad in decode_fh"
* tag 'nfs-for-7.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (21 commits)
NFS: Fix RCU dereference of cl_xprt in nfs_compare_super_address
NFS: remove redundant __private attribute from nfs_page_class
NFSv4.2: fix CLONE/COPY attrs in presence of delegated attributes
NFS: fix writeback in presence of errors
nfs: use memcpy_and_pad in decode_fh
NFSv4.1: Apply session size limits on clone path
NFSv4: retry GETATTR if GET_DIR_DELEGATION failed
NFS: fix RENAME attr in presence of directory delegations
pnfs/flexfiles: validate ds_versions_cnt is non-zero
NFS/blocklayout: print each device used for SCSI layouts
xprtrdma: Post receive buffers after RPC completion
xprtrdma: Scale receive batch size with credit window
xprtrdma: Replace rpcrdma_mr_seg with xdr_buf cursor
xprtrdma: Decouple frwr_wp_create from frwr_map
xprtrdma: Close lost-wakeup race in xprt_rdma_alloc_slot
xprtrdma: Avoid 250 ms delay on backlog wakeup
xprtrdma: Close sendctx get/put race that can block a transport
nfs: update inode ctime after removexattr operation
nfs: fix utimensat() for atime with delegated timestamps
NFS: improve "Server wrote zero bytes" error
...
|
|
Pull nfsd updates from Chuck Lever:
- filehandle signing to defend against filehandle-guessing attacks
(Benjamin Coddington)
The server now appends a SipHash-2-4 MAC to each filehandle when
the new "sign_fh" export option is enabled. NFSD then verifies
filehandles received from clients against the expected MAC;
mismatches return NFS error STALE
- convert the entire NLMv4 server-side XDR layer from hand-written C to
xdrgen-generated code, spanning roughly thirty patches (Chuck Lever)
XDR functions are generally boilerplate code and are easy to get
wrong. The goals of this conversion are improved memory safety, lower
maintenance burden, and groundwork for eventual Rust code generation
for these functions.
- improve pNFS block/SCSI layout robustness with two related changes
(Dai Ngo)
SCSI persistent reservation fencing is now tracked per client and
per device via an xarray, to avoid both redundant preempt operations
on devices already fenced and a potential NFSD deadlock when all nfsd
threads are waiting for a layout return.
- scalability and infrastructure improvements
Sincere thanks to all contributors, reviewers, testers, and bug
reporters who participated in the v7.1 NFSD development cycle.
* tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (83 commits)
NFSD: Docs: clean up pnfs server timeout docs
nfsd: fix comment typo in nfsxdr
nfsd: fix comment typo in nfs3xdr
NFSD: convert callback RPC program to per-net namespace
NFSD: use per-operation statidx for callback procedures
svcrdma: Use contiguous pages for RDMA Read sink buffers
SUNRPC: Add svc_rqst_page_release() helper
SUNRPC: xdr.h: fix all kernel-doc warnings
svcrdma: Factor out WR chain linking into helper
svcrdma: Add Write chunk WRs to the RPC's Send WR chain
svcrdma: Clean up use of rdma->sc_pd->device
svcrdma: Clean up use of rdma->sc_pd->device in Receive paths
svcrdma: Add fair queuing for Send Queue access
SUNRPC: Optimize rq_respages allocation in svc_alloc_arg
SUNRPC: Track consumed rq_pages entries
svcrdma: preserve rq_next_page in svc_rdma_save_io_pages
SUNRPC: Handle NULL entries in svc_rqst_release_pages
SUNRPC: Allocate a separate Reply page array
SUNRPC: Tighten bounds checking in svc_rqst_replace_page
NFSD: Sign filehandles
...
|
|
rpcrdma_post_recvs() runs in CQ poll context and its cost
falls on the latency-critical path between polling a Receive
completion and waking the RPC consumer. Every cycle spent
refilling the Receive Queue delays delivery of the reply to
the NFS layer.
Move the rpcrdma_post_recvs() call in rpcrdma_reply_handler()
to after the RPC has been decoded and completed. The larger
batch size from the preceding patch provides sufficient
Receive Queue headroom to absorb the brief delay before
buffers are replenished.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
The fixed RPCRDMA_MAX_RECV_BATCH of 7 results in frequent
small ib_post_recv batches during high-rate workloads. With
a 128-slot credit window, receives are reposted every 7th
completion, each batch incurring atomic serialization and a
doorbell write.
Replace the fixed batch constant with a per-endpoint value
scaled to 25% of the negotiated credit window. For a typical
128-credit connection this raises the batch from 7 to 32,
reducing doorbell frequency by roughly 4x and amortizing the
per-batch atomic and MMIO costs over a larger group of
receive WRs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
The FRWR registration path converts data through three
representations: xdr_buf -> rpcrdma_mr_seg[] -> scatterlist[]
-> ib_map_mr_sg(). The rpcrdma_mr_seg intermediate is a relic
of when multiple registration strategies existed (FMR, physical,
FRWR). Only FRWR remains, so this indirection and the 6240-byte
rl_segments[260] array embedded in each rpcrdma_req serve no
purpose.
Introduce struct rpcrdma_xdr_cursor to track position within
an xdr_buf during iterative MR registration. Rewrite frwr_map to
populate scatterlist entries directly from the xdr_buf regions
(head kvec, page list, tail kvec). The boundary logic for
non-SG_GAPS devices is simpler because the xdr_buf structure
guarantees that page-region entries after the first start at
offset 0, and that head/tail kvecs are separate regions that
naturally break at MR boundaries.
Fix a pre-existing bug in rpcrdma_encode_write_list where the
write-pad statistics accumulator added mr->mr_length from the last
data MR rather than the write-pad MR. The refactored code uses
ep->re_write_pad_mr->mr_length.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|