summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSagi Grimberg <sagi@grimberg.me>2022-09-28 09:23:26 +0300
committerChristoph Hellwig <hch@lst.de>2022-10-12 11:35:46 +0200
commita1ae8d4d9be0178132df7c4931a1ba77d0e76039 (patch)
tree07c97ceece4624710d5be1e5cb454c1618c343fe
parent24a403340d70aad3667b3ee0f9a7aa5c0a5193a0 (diff)
downloadlwn-a1ae8d4d9be0178132df7c4931a1ba77d0e76039.tar.gz
lwn-a1ae8d4d9be0178132df7c4931a1ba77d0e76039.zip
nvme-rdma: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following: 1. nvme_stop_ctrl() - stop some work elements that may be inflight or scheduled (specifically also .stop_ctrl which cancels ctrl error recovery work) 2. nvme_remove_namespaces() - which first flushes scan_work to avoid competing ns addition/removal 3. continue to teardown the controller However, if err_work was scheduled to run in (1), it is designed to cancel any inflight I/O, particularly I/O that is originating from ns scan_work in (2), but because it is cancelled in .stop_ctrl(), we can prevent forward progress of (2) as ns scanning is blocking on I/O (that will never be cancelled). The race is: 1. transport layer error observed -> err_work is scheduled 2. scan_work executes, discovers ns, generate I/O to it 3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work) - err_work never executed 4. nvme_remove_namespaces() -> flush_work(scan_work) --> deadlock, because scan_work is blocked on I/O that was supposed to be cancelled by err_work, but was cancelled before executing. Fix this by flushing err_work instead of cancelling it, to force it to execute and cancel all inflight I/O. Fixes: b435ecea2a4d ("nvme: Add .stop_ctrl to nvme ctrl ops") Fixes: f6c8e432cb04 ("nvme: flush namespace scanning work just before removing namespaces") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-rw-r--r--drivers/nvme/host/rdma.c2
1 files changed, 1 insertions, 1 deletions
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 5ad0ab2853a4..6e079abb22ee 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -996,7 +996,7 @@ static void nvme_rdma_stop_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl);
- cancel_work_sync(&ctrl->err_work);
+ flush_work(&ctrl->err_work);
cancel_delayed_work_sync(&ctrl->reconnect_work);
}