<feed xmlns='http://www.w3.org/2005/Atom'>
<title>lwn.git/block/cfq-iosched.c, branch v3.10-rc1</title>
<subtitle>Linux kernel documentation tree maintained by Jonathan Corbet</subtitle>
<id>http://mirrors.hust.edu.cn/git/lwn.git/atom?h=v3.10-rc1</id>
<link rel='self' href='http://mirrors.hust.edu.cn/git/lwn.git/atom?h=v3.10-rc1'/>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/'/>
<updated>2013-03-23T21:15:29+00:00</updated>
<entry>
<title>block: Add bio_end_sector()</title>
<updated>2013-03-23T21:15:29+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>koverstreet@google.com</email>
</author>
<published>2012-09-25T22:05:12+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=f73a1c7d117d07a96d89475066188a2b79e53c48'/>
<id>urn:sha1:f73a1c7d117d07a96d89475066188a2b79e53c48</id>
<content type='text'>
Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.

Signed-off-by: Kent Overstreet &lt;koverstreet@google.com&gt;
CC: Jens Axboe &lt;axboe@kernel.dk&gt;
CC: Lars Ellenberg &lt;drbd-dev@lists.linbit.com&gt;
CC: Jiri Kosina &lt;jkosina@suse.cz&gt;
CC: Alasdair Kergon &lt;agk@redhat.com&gt;
CC: dm-devel@redhat.com
CC: Neil Brown &lt;neilb@suse.de&gt;
CC: Martin Schwidefsky &lt;schwidefsky@de.ibm.com&gt;
CC: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
CC: linux-s390@vger.kernel.org
CC: Chris Mason &lt;chris.mason@fusionio.com&gt;
CC: Steven Whitehouse &lt;swhiteho@redhat.com&gt;
Acked-by: Steven Whitehouse &lt;swhiteho@redhat.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block</title>
<updated>2013-02-28T20:52:24+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2013-02-28T20:52:24+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=ee89f81252179dcbf6cd65bd48299f5e52292d88'/>
<id>urn:sha1:ee89f81252179dcbf6cd65bd48299f5e52292d88</id>
<content type='text'>
Pull block IO core bits from Jens Axboe:
 "Below are the core block IO bits for 3.9.  It was delayed a few days
  since my workstation kept crashing every 2-8h after pulling it into
  current -git, but turns out it is a bug in the new pstate code (divide
  by zero, will report separately).  In any case, it contains:

   - The big cfq/blkcg update from Tejun and and Vivek.

   - Additional block and writeback tracepoints from Tejun.

   - Improvement of the should sort (based on queues) logic in the plug
     flushing.

   - _io() variants of the wait_for_completion() interface, using
     io_schedule() instead of schedule() to contribute to io wait
     properly.

   - Various little fixes.

  You'll get two trivial merge conflicts, which should be easy enough to
  fix up"

Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").

* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
  block: remove redundant check to bd_openers()
  block: use i_size_write() in bd_set_size()
  cfq: fix lock imbalance with failed allocations
  drivers/block/swim3.c: fix null pointer dereference
  block: don't select PERCPU_RWSEM
  block: account iowait time when waiting for completion of IO request
  sched: add wait_for_completion_io[_timeout]
  writeback: add more tracepoints
  block: add block_{touch|dirty}_buffer tracepoint
  buffer: make touch_buffer() an exported function
  block: add @req to bio_{front|back}_merge tracepoints
  block: add missing block_bio_complete() tracepoint
  block: Remove should_sort judgement when flush blk_plug
  block,elevator: use new hashtable implementation
  cfq-iosched: add hierarchical cfq_group statistics
  cfq-iosched: collect stats from dead cfqgs
  cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
  blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
  block: RCU free request_queue
  blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
  ...
</content>
</entry>
<entry>
<title>hlist: drop the node parameter from iterators</title>
<updated>2013-02-28T03:10:24+00:00</updated>
<author>
<name>Sasha Levin</name>
<email>sasha.levin@oracle.com</email>
</author>
<published>2013-02-28T01:06:00+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=b67bfe0d42cac56c512dd5da4b1b347a23f4b70a'/>
<id>urn:sha1:b67bfe0d42cac56c512dd5da4b1b347a23f4b70a</id>
<content type='text'>
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj-&gt;member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    &lt;+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+&gt;

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin &lt;peter.senna@gmail.com&gt;
Acked-by: Paul E. McKenney &lt;paulmck@linux.vnet.ibm.com&gt;
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Cc: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Cc: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
Cc: Gleb Natapov &lt;gleb@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>cfq: fix lock imbalance with failed allocations</title>
<updated>2013-02-22T09:42:46+00:00</updated>
<author>
<name>Glauber Costa</name>
<email>glommer@parallels.com</email>
</author>
<published>2013-02-21T23:16:41+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=a3cc86c2f00839453d2dbeb46bfc44e885b073db'/>
<id>urn:sha1:a3cc86c2f00839453d2dbeb46bfc44e885b073db</id>
<content type='text'>
While stress-running very-small container scenarios with the Kernel Memory
Controller, I've run into a lockdep-detected lock imbalance in
cfq-iosched.c.

I'll apologize beforehand for not posting a backlog: I didn't anticipate
it would be so hard to reproduce, so I didn't save my serial output and
went directly on debugging.  Turns out that it did not happen again in
more than 20 runs, making it a quite rare pattern.

But here is my analysis:

When we are in very low-memory situations, we will arrive at
cfq_find_alloc_queue and may not find a queue, having to resort to the oom
queue, in an rcu-locked condition:

  if (!cfqq || cfqq == &amp;cfqd-&gt;oom_cfqq)
      [ ... ]

Next, we will release the rcu lock, and try to allocate a queue, retrying
if we succeed:

  rcu_read_unlock();
  spin_unlock_irq(cfqd-&gt;queue-&gt;queue_lock);
  new_cfqq = kmem_cache_alloc_node(cfq_pool,
                  gfp_mask | __GFP_ZERO,
                  cfqd-&gt;queue-&gt;node);
   spin_lock_irq(cfqd-&gt;queue-&gt;queue_lock);
   if (new_cfqq)
       goto retry;

We are unlocked at this point, but it should be fine, since we will
reacquire the rcu_read_lock when we retry.

Except of course, that we may not retry: the allocation may very well fail
and we'll keep on going through the flow:

The next branch is:

    if (cfqq) {
	[ ... ]
    } else
        cfqq = &amp;cfqd-&gt;oom_cfqq;

And right before exiting, we'll issue rcu_read_unlock().

Being already unlocked, this is the likely source of our imbalance.  Since
cfqq is either already NULL or made NULL in the first statement of the
outter branch, the only viable alternative here seems to be to return the
oom queue right away in case of allocation failure.

Please review the following patch and apply if you agree with my analysis.

Signed-off-by: Glauber Costa &lt;glommer@parallels.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>cfq-iosched: add hierarchical cfq_group statistics</title>
<updated>2013-01-09T16:05:13+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2013-01-09T16:05:13+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=43114018cb0b253fd03c4ff4d42bcdc43389ac1c'/>
<id>urn:sha1:43114018cb0b253fd03c4ff4d42bcdc43389ac1c</id>
<content type='text'>
Unfortunately, at this point, there's no way to make the existing
statistics hierarchical without creating nasty surprises for the
existing users.  Just create recursive counterpart of the existing
stats.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Vivek Goyal &lt;vgoyal@redhat.com&gt;
</content>
</entry>
<entry>
<title>cfq-iosched: collect stats from dead cfqgs</title>
<updated>2013-01-09T16:05:13+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2013-01-09T16:05:13+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=0b39920b5f9f3ad37dd259bfa2e9cbca33475b28'/>
<id>urn:sha1:0b39920b5f9f3ad37dd259bfa2e9cbca33475b28</id>
<content type='text'>
To support hierarchical stats, it's necessary to remember stats from
dead children.  Add cfqg-&gt;dead_stats and make a dying cfqg transfer
its stats to the parent's dead-stats.

The transfer happens form -&gt;pd_offline_fn() and it is possible that
there are some residual IOs completing afterwards.  Currently, we lose
these stats.  Given that cgroup removal isn't a very high frequency
operation and the amount of residual IOs on offline are likely to be
nil or small, this shouldn't be a big deal and the complexity needed
to handle residual IOs - another callback and rather elaborate
synchronization to reach and lock the matching q - doesn't seem
justified.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Vivek Goyal &lt;vgoyal@redhat.com&gt;
</content>
</entry>
<entry>
<title>cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()</title>
<updated>2013-01-09T16:05:13+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2013-01-09T16:05:13+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=689665af4489f779bc82e7869509c9ac11b5a903'/>
<id>urn:sha1:689665af4489f779bc82e7869509c9ac11b5a903</id>
<content type='text'>
Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
cfq_pd_reset_stats() and move the latter to where other pd methods are
defined.  cfqg_stats_reset() will be used to implement hierarchical
stats.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Vivek Goyal &lt;vgoyal@redhat.com&gt;
</content>
</entry>
<entry>
<title>blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/</title>
<updated>2013-01-09T16:05:12+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2013-01-09T16:05:12+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=4d5e80a76074786a49879ff482a83e72ad634606'/>
<id>urn:sha1:4d5e80a76074786a49879ff482a83e72ad634606</id>
<content type='text'>
Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
summing up stats from multiple blkgs.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Vivek Goyal &lt;vgoyal@redhat.com&gt;
</content>
</entry>
<entry>
<title>cfq-iosched: enable full blkcg hierarchy support</title>
<updated>2013-01-09T16:05:11+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2013-01-09T16:05:11+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=d02f7aa8dce8166dbbc515ce393912aa45e6b8a6'/>
<id>urn:sha1:d02f7aa8dce8166dbbc515ce393912aa45e6b8a6</id>
<content type='text'>
With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support.  The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.

Replace it with cfqg_parent() which returns the real parent.  This
enables full blkcg hierarchy support for cfq-iosched.  For example,
consider the following hierarchy.

        root
      /      \
   A:500      B:250
  /     \
 AA:500  AB:1000

For simplicity, let's say all the leaf nodes have active tasks and are
on service tree.  For each leaf node, vfraction would be

 AA: (500  / 1500) * (500 / 750) =~ 0.2222
 AB: (1000 / 1500) * (500 / 750) =~ 0.4444
  B:                 (250 / 750) =~ 0.3333

and vdisktime will be distributed accordingly.  For more detail,
please refer to Documentation/block/cfq-iosched.txt.

v2: cfq-iosched.txt updated to describe group scheduling as suggested
    by Vivek.

v3: blkio-controller.txt updated.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Vivek Goyal &lt;vgoyal@redhat.com&gt;
</content>
</entry>
<entry>
<title>cfq-iosched: convert cfq_group_slice() to use cfqg-&gt;vfraction</title>
<updated>2013-01-09T16:05:11+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2013-01-09T16:05:11+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=41cad6ab2cb9ccb3b11546ad56b8b285e47c6279'/>
<id>urn:sha1:41cad6ab2cb9ccb3b11546ad56b8b285e47c6279</id>
<content type='text'>
cfq_group_slice() calculates slice by taking a fraction of
cfq_target_latency according to the ratio of cfqg-&gt;weight against
service_tree-&gt;total_weight.  This currently works only because all
cfqgs are treated to be at the same level.

To prepare for proper hierarchy support, convert cfq_group_slice() to
base the calculation on cfqg-&gt;vfraction.  As cfqg-&gt;vfraction is always
a fraction of 1 and represents the fraction allocated to the cfqg with
hierarchy considered, the slice can be simply calculated by
multiplying cfqg-&gt;vfraction to cfq_target_latency (with fixed point
shift factored in).

As vfraction calculation currently treats all non-root cfqgs as
children of the root cfqg, this patch doesn't introduce noticeable
behavior difference.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Vivek Goyal &lt;vgoyal@redhat.com&gt;
</content>
</entry>
</feed>
