bcachefs: Deadlock cycle detector

We've outgrown our own deadlock avoidance strategy. The btree iterator API provides an interface where the user doesn't need to concern themselves with lock ordering - different btree iterators can be traversed in any order. Without special care, this will lead to deadlocks. Our previous strategy was to define a lock ordering internally, and whenever we attempt to take a lock and trylock() fails, we'd check if the current btree transaction is holding any locks that cause a lock ordering violation. If so, we'd issue a transaction restart, and then bch2_trans_begin() would re-traverse all previously used iterators, but in the correct order. That approach had some issues, though. - Sometimes we'd issue transaction restarts unnecessarily, when no deadlock would have actually occured. Lock ordering restarts have become our primary cause of transaction restarts, on some workloads totally 20% of actual transaction commits. - To avoid deadlock or livelock, we'd often have to take intent locks when we only wanted a read lock: with the lock ordering approach, it is actually illegal to hold _any_ read lock while blocking on an intent lock, and this has been causing us unnecessary lock contention. - It was getting fragile - the various lock ordering rules are not trivial, and we'd been seeing occasional livelock issues related to this machinery. So, since bcachefs is already a relational database masquerading as a filesystem, we're stealing the next traditional database technique and switching to a cycle detector for avoiding deadlocks. When we block taking a btree lock, after adding ourself to the waitlist but before sleeping, we do a DFS of btree transactions waiting on other btree transactions, starting with the current transaction and walking our held locks, and transactions blocking on our held locks. If we find a cycle, we emit a transaction restart. Occasionally (e.g. the btree split path) we can not allow the lock() operation to fail, so if necessary we'll tell another transaction that it has to fail. Result: trans_restart_would_deadlock events are reduced by a factor of 10 to 100, and we'll be able to delete a whole bunch of grotty, fragile code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
author: Kent Overstreet <kent.overstreet@gmail.com> 2022-08-22 13:23:47 -0400
committer: Kent Overstreet <kent.overstreet@linux.dev> 2023-10-22 17:09:41 -0400
commit: 33bd5d068603f9e81e0b73dbe50e9b88b2e56d0d (patch)
tree: 6fff6e218b381e0fa2cd4580da3a2e919d18ccd8 /fs/bcachefs/trace.h
parent: 62448afee714354a26db8a0f3c644f58628f0792 (diff)
download: lwn-33bd5d068603f9e81e0b73dbe50e9b88b2e56d0d.tar.gz
lwn-33bd5d068603f9e81e0b73dbe50e9b88b2e56d0d.zip
1 files changed, 6 insertions, 0 deletions
diff --git a/fs/bcachefs/trace.h b/fs/bcachefs/trace.h
index 62de89fcb74b..35c40678f4b5 100644
--- a/fs/bcachefs/trace.h
+++ b/fs/bcachefs/trace.h
@@ -1065,6 +1065,12 @@ TRACE_EVENT(trans_restart_would_deadlock,
 		  __entry->want_pos_snapshot)
 );
 
+DEFINE_EVENT(transaction_event,	trans_restart_would_deadlock_recursion_limit,
+	TP_PROTO(struct btree_trans *trans,
+		 unsigned long caller_ip),
+	TP_ARGS(trans, caller_ip)
+);
+
 TRACE_EVENT(trans_restart_would_deadlock_write,
 	TP_PROTO(struct btree_trans *trans),
 	TP_ARGS(trans),
author	Kent Overstreet <kent.overstreet@gmail.com>	2022-08-22 13:23:47 -0400
committer	Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:41 -0400
commit	33bd5d068603f9e81e0b73dbe50e9b88b2e56d0d (patch)
tree	6fff6e218b381e0fa2cd4580da3a2e919d18ccd8 /fs/bcachefs/trace.h
parent	62448afee714354a26db8a0f3c644f58628f0792 (diff)
download	lwn-33bd5d068603f9e81e0b73dbe50e9b88b2e56d0d.tar.gz lwn-33bd5d068603f9e81e0b73dbe50e9b88b2e56d0d.zip