summaryrefslogtreecommitdiff
path: root/block/bio.c
diff options
context:
space:
mode:
authorShaohua Li <shli@fb.com>2017-03-27 10:51:41 -0700
committerJens Axboe <axboe@fb.com>2017-03-28 08:02:20 -0600
commit9e234eeafbe17e85908584392f249f0b329b8e1b (patch)
tree9d822cd38526ecc8132ffd4f4a720bb53a8eef0f /block/bio.c
parent7394e31fa440ab7cd20cebd233580b360a7e9ecc (diff)
downloadlwn-9e234eeafbe17e85908584392f249f0b329b8e1b.tar.gz
lwn-9e234eeafbe17e85908584392f249f0b329b8e1b.zip
blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch enough IO to cross the low limit. In such case, the queue state machine will remain in LIMIT_LOW state and all other cgroups will be throttled according to low limit. This is unfair for other cgroups. We should treat the cgroup idle and upgrade the state machine to lower state. We also have a downgrade logic. If the state machine upgrades because of cgroup idle (real idle), the state machine will downgrade soon as the cgroup is below its low limit. This isn't what we want. A more complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But when queue gets upgraded to lower state, other cgroups could dispatch more IO and this cgroup can't dispatch enough IO, so the cgroup is below its low limit and looks like idle (fake idle). In this case, the queue should downgrade soon. The key to determine if we should do downgrade is to detect if cgroup is truely idle. Unfortunately it's very hard to determine if a cgroup is real idle. This patch uses the 'think time check' idea from CFQ for the purpose. Please note, the idea doesn't work for all workloads. For example, a workload with io depth 8 has disk utilization 100%, hence think time is 0, eg, not idle. But the workload can run higher bandwidth with io depth 16. Compared to io depth 16, the io depth 8 workload is idle. We use the idea to roughly determine if a cgroup is idle. We treat a cgroup idle if its think time is above a threshold (by default 1ms for SSD and 100ms for HD). The idea is think time above the threshold will start to harm performance. HD is much slower so a longer think time is ok. The patch (and the latter patches) uses 'unsigned long' to track time. We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses precision, should not a big deal. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Diffstat (limited to 'block/bio.c')
-rw-r--r--block/bio.c2
1 files changed, 2 insertions, 0 deletions
diff --git a/block/bio.c b/block/bio.c
index 6194a8cf2aab..f1857c0f0826 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -30,6 +30,7 @@
#include <linux/cgroup.h>
#include <trace/events/block.h>
+#include "blk.h"
/*
* Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1845,6 +1846,7 @@ again:
goto again;
}
+ blk_throtl_bio_endio(bio);
if (bio->bi_end_io)
bio->bi_end_io(bio);
}