diff options
author | Sagi Grimberg <sagi@grimberg.me> | 2022-09-28 09:23:26 +0300 |
---|---|---|
committer | Christoph Hellwig <hch@lst.de> | 2022-10-12 11:35:46 +0200 |
commit | a1ae8d4d9be0178132df7c4931a1ba77d0e76039 (patch) | |
tree | 07c97ceece4624710d5be1e5cb454c1618c343fe | |
parent | 24a403340d70aad3667b3ee0f9a7aa5c0a5193a0 (diff) | |
download | lwn-a1ae8d4d9be0178132df7c4931a1ba77d0e76039.tar.gz lwn-a1ae8d4d9be0178132df7c4931a1ba77d0e76039.zip |
nvme-rdma: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
inflight or scheduled (specifically also .stop_ctrl
which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
to avoid competing ns addition/removal
3. continue to teardown the controller
However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).
The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
- err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing.
Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.
Fixes: b435ecea2a4d ("nvme: Add .stop_ctrl to nvme ctrl ops")
Fixes: f6c8e432cb04 ("nvme: flush namespace scanning work just before removing namespaces")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
-rw-r--r-- | drivers/nvme/host/rdma.c | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 5ad0ab2853a4..6e079abb22ee 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -996,7 +996,7 @@ static void nvme_rdma_stop_ctrl(struct nvme_ctrl *nctrl) { struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl); - cancel_work_sync(&ctrl->err_work); + flush_work(&ctrl->err_work); cancel_delayed_work_sync(&ctrl->reconnect_work); } |