<feed xmlns='http://www.w3.org/2005/Atom'>
<title>lwn.git/fs/btrfs/delayed-ref.h, branch docs-5.13</title>
<subtitle>Linux kernel documentation tree maintained by Jonathan Corbet</subtitle>
<id>http://mirrors.hust.edu.cn/git/lwn.git/atom?h=docs-5.13</id>
<link rel='self' href='http://mirrors.hust.edu.cn/git/lwn.git/atom?h=docs-5.13'/>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/'/>
<updated>2021-02-08T21:58:56+00:00</updated>
<entry>
<title>btrfs: only let one thread pre-flush delayed refs in commit</title>
<updated>2021-02-08T21:58:56+00:00</updated>
<author>
<name>Josef Bacik</name>
<email>josef@toxicpanda.com</email>
</author>
<published>2020-12-18T19:24:20+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=e19eb11f4f3d3b0463cd897016064a79cb6d8c6d'/>
<id>urn:sha1:e19eb11f4f3d3b0463cd897016064a79cb6d8c6d</id>
<content type='text'>
I've been running a stress test that runs 20 workers in their own
subvolume, which are running an fsstress instance with 4 threads per
worker, which is 80 total fsstress threads.  In addition to this I'm
running balance in the background as well as creating and deleting
snapshots.  This test takes around 12 hours to run normally, going
slower and slower as the test goes on.

The reason for this is because fsstress is running fsync sometimes, and
because we're messing with block groups we often fall through to
btrfs_commit_transaction, so will often have 20-30 threads all calling
btrfs_commit_transaction at the same time.

These all get stuck contending on the extent tree while they try to run
delayed refs during the initial part of the commit.

This is suboptimal, really because the extent tree is a single point of
failure we only want one thread acting on that tree at once to reduce
lock contention.

Fix this by making the flushing mechanism a bit operation, to make it
easy to use test_and_set_bit() in order to make sure only one task does
this initial flush.

Once we're into the transaction commit we only have one thread doing
delayed ref running, it's just this initial pre-flush that is
problematic.  With this patch my stress test takes around 90 minutes to
run, instead of 12 hours.

The memory barrier is not necessary for the flushing bit as it's
ordered, unlike plain int. The transaction state accessed in
btrfs_should_end_transaction could be affected by that too as it's not
always used under transaction lock. Upon Nikolay's analysis in [1]
it's not necessary:

  In should_end_transaction it's read without holding any locks. (U)

  It's modified in btrfs_cleanup_transaction without holding the
  fs_info-&gt;trans_lock (U), but the STATE_ERROR flag is going to be set.

  set in cleanup_transaction under fs_info-&gt;trans_lock (L)
  set in btrfs_commit_trans to COMMIT_START under fs_info-&gt;trans_lock.(L)
  set in btrfs_commit_trans to COMMIT_DOING under fs_info-&gt;trans_lock.(L)
  set in btrfs_commit_trans to COMMIT_UNBLOCK under
  fs_info-&gt;trans_lock.(L)

  set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
  point the transaction is finished and fs_info-&gt;running_trans is NULL (U
  but irrelevant).

  So by the looks of it we can have a concurrent READ race with a WRITE,
  due to reads not taking a lock. In this case what we want to ensure is
  we either see new or old state. I consulted with Will Deacon and he said
  that in such a case we'd want to annotate the accesses to -&gt;state with
  (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
  don't think this could happen but I imagine at some point KCSAN would
  flag such an access as racy (which it is).

[1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.com

Reviewed-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
[ add comments regarding memory barrier ]
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself</title>
<updated>2021-02-08T21:58:55+00:00</updated>
<author>
<name>Josef Bacik</name>
<email>josef@toxicpanda.com</email>
</author>
<published>2021-01-15T21:48:55+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=2187374f35fe9cadbddaa9fcf0c4121365d914e8'/>
<id>urn:sha1:2187374f35fe9cadbddaa9fcf0c4121365d914e8</id>
<content type='text'>
Currently we pass things around to figure out if we maybe freeing data
based on the state of the delayed refs head.  This makes the accounting
sort of confusing and hard to follow, as it's distinctly separate from
the delayed ref heads stuff, but also depends on it entirely.

Fix this by explicitly adjusting the space_info-&gt;total_bytes_pinned in
the delayed refs code.  We now have two places where we modify this
counter, once where we create the delayed and destroy the delayed refs,
and once when we pin and unpin the extents.  This means there is a
slight overlap between delayed refs and the pin/unpin mechanisms, but
this is simply used by the ENOSPC infrastructure to determine if we need
to commit the transaction, so there's no adverse affect from this, we
might simply commit thinking it will give us enough space when it might
not.

CC: stable@vger.kernel.org # 5.10
Reviewed-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: migrate the delayed refs rsv code</title>
<updated>2019-07-04T15:26:17+00:00</updated>
<author>
<name>Josef Bacik</name>
<email>josef@toxicpanda.com</email>
</author>
<published>2019-06-19T19:11:58+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=6ef03debdb3d82d7deec65f96e143b9adcfb2cd4'/>
<id>urn:sha1:6ef03debdb3d82d7deec65f96e143b9adcfb2cd4</id>
<content type='text'>
These belong with the delayed refs related code, not in extent-tree.c.

Signed-off-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: remove unused parameter fs_info from btrfs_add_delayed_extent_op</title>
<updated>2019-04-29T17:02:51+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2019-03-20T10:42:34+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=c6e340bc1c9e3411c40aafca4c69b989530c9347'/>
<id>urn:sha1:c6e340bc1c9e3411c40aafca4c69b989530c9347</id>
<content type='text'>
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref()</title>
<updated>2019-04-29T17:02:49+00:00</updated>
<author>
<name>Qu Wenruo</name>
<email>wqu@suse.com</email>
</author>
<published>2019-04-04T06:45:32+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=76675593b69f2fcd57e24d9dd2a9b278f0130d0b'/>
<id>urn:sha1:76675593b69f2fcd57e24d9dd2a9b278f0130d0b</id>
<content type='text'>
Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor
btrfs_add_delayed_data_ref().

Signed-off-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref()</title>
<updated>2019-04-29T17:02:48+00:00</updated>
<author>
<name>Qu Wenruo</name>
<email>wqu@suse.com</email>
</author>
<published>2019-04-04T06:45:31+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=ed4f255b9bacb774c99ded17647f138c3f61546d'/>
<id>urn:sha1:ed4f255b9bacb774c99ded17647f138c3f61546d</id>
<content type='text'>
btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and
some callers like btrfs_inc_extent_ref() are using @owner as level for
delayed tree ref.

Instead of making the parameter list longer, use btrfs_ref to refactor
it, so each parameter assignment should be self-explaining without dirty
level/owner trick, and provides the basis for later refactoring.

Signed-off-by: Qu Wenruo &lt;wqu@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: delayed-ref: Introduce better documented delayed ref structures</title>
<updated>2019-04-29T17:02:48+00:00</updated>
<author>
<name>Qu Wenruo</name>
<email>wqu@suse.com</email>
</author>
<published>2019-04-04T06:45:29+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=b28b1f0ce44c1b9ebc1c43e3eba18c1f1f5d9cec'/>
<id>urn:sha1:b28b1f0ce44c1b9ebc1c43e3eba18c1f1f5d9cec</id>
<content type='text'>
Current delayed ref interface has several problems:

- Longer and longer parameter lists
  bytenr
  num_bytes
  parent
  ---------- so far so good
  ref_root
  owner
  offset
  ---------- I don't feel good now

- Different interpretation of the same parameter

  Above @owner for data ref is inode number (u64),
  while for tree ref, it's level (int).

  They are even in different size range.
  For level we only need 0 ~ 8, while for ino it's
  BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID.

  And @offset doesn't even make sense for tree ref.

  Such parameter reuse may look clever as an hidden union, but it
  destroys code readability.

To solve both problems, we introduce a new structure, btrfs_ref to solve
them:

- Structure instead of long parameter list
  This makes later expansion easier, and is better documented.

- Use btrfs_ref::type to distinguish data and tree ref

- Use proper union to store data/tree ref specific structures.

- Use separate functions to fill data/tree ref data, with a common generic
  function to fill common bytenr/num_bytes members.

All parameters will find its place in btrfs_ref, and an extra member,
@real_root, inspired by ref-verify code, is newly introduced for later
qgroup code, to record which tree is triggered by this extent modification.

This patch doesn't touch any code, but provides the basis for further
refactoring.

Signed-off-by: Qu Wenruo &lt;wqu@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record</title>
<updated>2019-02-25T13:13:39+00:00</updated>
<author>
<name>Qu Wenruo</name>
<email>wqu@suse.com</email>
</author>
<published>2019-01-23T07:15:12+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=1418bae1c22951aad9883bc8f8f4dccb272cce1e'/>
<id>urn:sha1:1418bae1c22951aad9883bc8f8f4dccb272cce1e</id>
<content type='text'>
[BUG]
Btrfs/139 will fail with a high probability if the testing machine (VM)
has only 2G RAM.

Resulting the final write success while it should fail due to EDQUOT,
and the fs will have quota exceeding the limit by 16K.

The simplified reproducer will be: (needs a 2G ram VM)

  $ mkfs.btrfs -f $dev
  $ mount $dev $mnt

  $ btrfs subv create $mnt/subv
  $ btrfs quota enable $mnt
  $ btrfs quota rescan -w $mnt
  $ btrfs qgroup limit -e 1G $mnt/subv

  $ for i in $(seq -w  1 8); do
  	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i &gt; /dev/null
  	echo "file $i written" &gt; /dev/kmsg
    done
  $ sync
  $ btrfs qgroup show -pcre --raw $mnt

The last pwrite will not trigger EDQUOT and final 'qgroup show' will
show something like:

  qgroupid         rfer         excl     max_rfer     max_excl parent  child
  --------         ----         ----     --------     -------- ------  -----
  0/5             16384        16384         none         none ---     ---
  0/256      1073758208   1073758208         none   1073741824 ---     ---

And 1073758208 is larger than
  &gt; 1073741824.

[CAUSE]
It's a bug in btrfs qgroup data reserved space management.

For quota limit, we must ensure that:
  reserved (data + metadata) + rfer/excl &lt;= limit

Since rfer/excl is only updated at transaction commmit time, reserved
space needs to be taken special care.

One important part of reserved space is data, and for a new data extent
written to disk, we still need to take the reserved space until
rfer/excl numbers get updated.

Originally when an ordered extent finishes, we migrate the reserved
qgroup data space from extent_io tree to delayed ref head of the data
extent, expecting delayed ref will only be cleaned up at commit
transaction time.

However for small RAM machine, due to memory pressure dirty pages can be
flushed back to disk without committing a transaction.

The related events will be something like:

  file 1 written
  btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
  btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
  btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
  btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
  btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
  cleanup_ref_head: num_bytes=54947840
  cleanup_ref_head: num_bytes=5636096
  cleanup_ref_head: num_bytes=569344
  cleanup_ref_head: num_bytes=57344
  cleanup_ref_head: num_bytes=8192
  ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
  file 2 written
  ...
  file 8 written
  cleanup_ref_head: num_bytes=8192
  ...
  btrfs_commit_transaction  &lt;&lt;&lt; the only transaction committed during
				the test

When file 2 is written, we have already freed 128M reserved qgroup data
space for ino 258. Thus later write won't trigger EDQUOT.

This allows us to write more data beyond qgroup limit.

In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.

[FIX]
By moving reserved qgroup data space from btrfs_delayed_ref_head to
btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
space won't be freed half way before commit transaction, thus fix the
problem.

Fixes: f64d5ca86821 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
Signed-off-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: add btrfs_delete_ref_head helper</title>
<updated>2018-12-17T13:51:46+00:00</updated>
<author>
<name>Josef Bacik</name>
<email>jbacik@fb.com</email>
</author>
<published>2018-12-03T15:20:29+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=d7baffdaf9f9df8c9715aa507e3be2f409347c74'/>
<id>urn:sha1:d7baffdaf9f9df8c9715aa507e3be2f409347c74</id>
<content type='text'>
We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Reviewed-by: Omar Sandoval &lt;osandov@fb.com&gt;
Reviewed-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Josef Bacik &lt;jbacik@fb.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lock</title>
<updated>2018-10-15T15:23:41+00:00</updated>
<author>
<name>Lu Fengqi</name>
<email>lufq.fnst@cn.fujitsu.com</email>
</author>
<published>2018-10-11T05:40:34+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=9e920a6f03e40b1eb712f38b29ad5880153754e2'/>
<id>urn:sha1:9e920a6f03e40b1eb712f38b29ad5880153754e2</id>
<content type='text'>
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_delayed_ref_lock().

No functional change.

Reviewed-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Lu Fengqi &lt;lufq.fnst@cn.fujitsu.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
</feed>
