Age | Commit message (Collapse) | Author |
|
[ Upstream commit f1cd1f0b7d1b5d4aaa5711e8f4e4898b0045cb6d ]
When listing a inode's xattrs we have a time window where we race against
a concurrent operation for adding a new hard link for our inode that makes
us not return any xattr to user space. In order for this to happen, the
first xattr of our inode needs to be at slot 0 of a leaf and the previous
leaf must still have room for an inode ref (or extref) item, and this can
happen because an inode's listxattrs callback does not lock the inode's
i_mutex (nor does the VFS does it for us), but adding a hard link to an
inode makes the VFS lock the inode's i_mutex before calling the inode's
link callback.
If we have the following leafs:
Leaf X (has N items) Leaf Y
[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 XATTR_ITEM 12345), ... ]
slot N - 2 slot N - 1 slot 0
The race illustrated by the following sequence diagram is possible:
CPU 1 CPU 2
btrfs_listxattr()
searches for key (257 XATTR_ITEM 0)
gets path with path->nodes[0] == leaf X
and path->slots[0] == N
because path->slots[0] is >=
btrfs_header_nritems(leaf X), it calls
btrfs_next_leaf()
btrfs_next_leaf()
releases the path
adds key (257 INODE_REF 666)
to the end of leaf X (slot N),
and leaf X now has N + 1 items
searches for the key (257 INODE_REF 256),
with path->keep_locks == 1, because that
is the last key it saw in leaf X before
releasing the path
ends up at leaf X again and it verifies
that the key (257 INODE_REF 256) is no
longer the last key in leaf X, so it
returns with path->nodes[0] == leaf X
and path->slots[0] == N, pointing to
the new item with key (257 INODE_REF 666)
btrfs_listxattr's loop iteration sees that
the type of the key pointed by the path is
different from the type BTRFS_XATTR_ITEM_KEY
and so it breaks the loop and stops looking
for more xattr items
--> the application doesn't get any xattr
listed for our inode
So fix this by breaking the loop only if the key's type is greater than
BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 1d512cb77bdbda80f0dd0620a3b260d697fd581d ]
If we are using the NO_HOLES feature, we have a tiny time window when
running delalloc for a nodatacow inode where we can race with a concurrent
link or xattr add operation leading to a BUG_ON.
This happens because at run_delalloc_nocow() we end up casting a leaf item
of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
file extent item (struct btrfs_file_extent_item) and then analyse its
extent type field, which won't match any of the expected extent types
(values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
explicit BUG_ON(1).
The following sequence diagram shows how the race happens when running a
no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
neighbour leafs:
Leaf X (has N items) Leaf Y
[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ]
slot N - 2 slot N - 1 slot 0
(Note the implicit hole for inode 257 regarding the [0, 8K[ range)
CPU 1 CPU 2
run_dealloc_nocow()
btrfs_lookup_file_extent()
--> searches for a key with value
(257 EXTENT_DATA 4096) in the
fs/subvol tree
--> returns us a path with
path->nodes[0] == leaf X and
path->slots[0] == N
because path->slots[0] is >=
btrfs_header_nritems(leaf X), it
calls btrfs_next_leaf()
btrfs_next_leaf()
--> releases the path
hard link added to our inode,
with key (257 INODE_REF 500)
added to the end of leaf X,
so leaf X now has N + 1 keys
--> searches for the key
(257 INODE_REF 256), because
it was the last key in leaf X
before it released the path,
with path->keep_locks set to 1
--> ends up at leaf X again and
it verifies that the key
(257 INODE_REF 256) is no longer
the last key in the leaf, so it
returns with path->nodes[0] ==
leaf X and path->slots[0] == N,
pointing to the new item with
key (257 INODE_REF 500)
the loop iteration of run_dealloc_nocow()
does not break out the loop and continues
because the key referenced in the path
at path->nodes[0] and path->slots[0] is
for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
and its offset (500) is less then our delalloc
range's end (8192)
the item pointed by the path, an inode reference item,
is (incorrectly) interpreted as a file extent item and
we get an invalid extent type, leading to the BUG_ON(1):
if (extent_type == BTRFS_FILE_EXTENT_REG ||
extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
(...)
} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
(...)
} else {
BUG_ON(1)
}
The same can happen if a xattr is added concurrently and ends up having
a key with an offset smaller then the delalloc's range end.
So fix this by skipping keys with a type smaller than
BTRFS_EXTENT_DATA_KEY.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit aeafbf8486c9e2bd53f5cc3c10c0b7fd7149d69c ]
While running a stress test I got the following warning triggered:
[191627.672810] ------------[ cut here ]------------
[191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
(...)
[191627.701485] Call Trace:
[191627.702037] [<ffffffff8145f077>] dump_stack+0x4f/0x7b
[191627.702992] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
[191627.704091] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
[191627.705380] [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
[191627.706637] [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c
[191627.707789] [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs]
[191627.709155] [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
[191627.712444] [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
[191627.714162] [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
[191627.715887] [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs]
[191627.717287] [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
[191627.728865] [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs]
[191627.730045] [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs]
[191627.731256] [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
[191627.732661] [<ffffffff81061119>] process_one_work+0x24c/0x4ae
[191627.733822] [<ffffffff810615b0>] worker_thread+0x206/0x2c2
[191627.734857] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
[191627.736052] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
[191627.737349] [<ffffffff810669a6>] kthread+0xef/0xf7
[191627.738267] [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28
[191627.739330] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
[191627.741976] [<ffffffff81465592>] ret_from_fork+0x42/0x70
[191627.743080] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
[191627.744206] ---[ end trace bbfddacb7aaada8d ]---
$ cat -n fs/btrfs/file.c
691 int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
(...)
758 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
759 if (key.objectid > ino ||
760 key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
761 break;
762
763 fi = btrfs_item_ptr(leaf, path->slots[0],
764 struct btrfs_file_extent_item);
765 extent_type = btrfs_file_extent_type(leaf, fi);
766
767 if (extent_type == BTRFS_FILE_EXTENT_REG ||
768 extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
(...)
774 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
(...)
778 } else {
779 WARN_ON(1);
780 extent_end = search_start;
781 }
(...)
This happened because the item we were processing did not match a file
extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
case we cast the item to a struct btrfs_file_extent_item pointer and
then find a type field value that does not match any of the expected
values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
due to a tiny time window where a race can happen as exemplified below.
For example, consider the following scenario where we're using the
NO_HOLES feature and we have the following two neighbour leafs:
Leaf X (has N items) Leaf Y
[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ]
slot N - 2 slot N - 1 slot 0
Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
than explicit because NO_HOLES is enabled). Now if our inode has an
ordered extent for the range [4K, 8K[ that is finishing, the following
can happen:
CPU 1 CPU 2
btrfs_finish_ordered_io()
insert_reserved_file_extent()
__btrfs_drop_extents()
Searches for the key
(257 EXTENT_DATA 4096) through
btrfs_lookup_file_extent()
Key not found and we get a path where
path->nodes[0] == leaf X and
path->slots[0] == N
Because path->slots[0] is >=
btrfs_header_nritems(leaf X), we call
btrfs_next_leaf()
btrfs_next_leaf() releases the path
inserts key
(257 INODE_REF 4096)
at the end of leaf X,
leaf X now has N + 1 keys,
and the new key is at
slot N
btrfs_next_leaf() searches for
key (257 INODE_REF 256), with
path->keep_locks set to 1,
because it was the last key it
saw in leaf X
finds it in leaf X again and
notices it's no longer the last
key of the leaf, so it returns 0
with path->nodes[0] == leaf X and
path->slots[0] == N (which is now
< btrfs_header_nritems(leaf X)),
pointing to the new key
(257 INODE_REF 4096)
__btrfs_drop_extents() casts the
item at path->nodes[0], slot
path->slots[0], to a struct
btrfs_file_extent_item - it does
not skip keys for the target
inode with a type less than
BTRFS_EXTENT_DATA_KEY
(BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
sees a bogus value for the type
field triggering the WARN_ON in
the trace shown above, and sets
extent_end = search_start (4096)
does the if-then-else logic to
fixup 0 length extent items created
by a past bug from hole punching:
if (extent_end == key.offset &&
extent_end >= search_start)
goto delete_extent_item;
that evaluates to true and it ends
up deleting the key pointed to by
path->slots[0], (257 INODE_REF 4096),
from leaf X
The same could happen for example for a xattr that ends up having a key
with an offset value that matches search_start (very unlikely but not
impossible).
So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
skipped, never casted to struct btrfs_file_extent_item and never deleted
by accident. Also protect against the unexpected case of getting a key
for a lower inode number by skipping that key and issuing a warning.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 8039d87d9e473aeb740d4fdbd59b9d2f89b2ced9 ]
Currently the clone ioctl allows to clone an inline extent from one file
to another that already has other (non-inlined) extents. This is a problem
because btrfs is not designed to deal with files having inline and regular
extents, if a file has an inline extent then it must be the only extent
in the file and must start at file offset 0. Having a file with an inline
extent followed by regular extents results in EIO errors when doing reads
or writes against the first 4K of the file.
Also, the clone ioctl allows one to lose data if the source file consists
of a single inline extent, with a size of N bytes, and the destination
file consists of a single inline extent with a size of M bytes, where we
have M > N. In this case the clone operation removes the inline extent
from the destination file and then copies the inline extent from the
source file into the destination file - we lose the M - N bytes from the
destination file, a read operation will get the value 0x00 for any bytes
in the the range [N, M] (the destination inode's i_size remained as M,
that's why we can read past N bytes).
So fix this by not allowing such destructive operations to happen and
return errno EOPNOTSUPP to user space.
Currently the fstest btrfs/035 tests the data loss case but it totally
ignores this - i.e. expects the operation to succeed and does not check
the we got data loss.
The following test case for fstests exercises all these cases that result
in file corruption and data loss:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
_require_btrfs_fs_feature "no_holes"
_require_btrfs_mkfs_feature "no-holes"
rm -f $seqres.full
test_cloning_inline_extents()
{
local mkfs_opts=$1
local mount_opts=$2
_scratch_mkfs $mkfs_opts >>$seqres.full 2>&1
_scratch_mount $mount_opts
# File bar, the source for all the following clone operations, consists
# of a single inline extent (50 bytes).
$XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \
| _filter_xfs_io
# Test cloning into a file with an extent (non-inlined) where the
# destination offset overlaps that extent. It should not be possible to
# clone the inline extent from file bar into this file.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo
# Doing IO against any range in the first 4K of the file should work.
# Due to a past clone ioctl bug which allowed cloning the inline extent,
# these operations resulted in EIO errors.
echo "File foo data after clone operation:"
# All bytes should have the value 0xaa (clone operation failed and did
# not modify our file).
od -t x1 $SCRATCH_MNT/foo
$XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io
# Test cloning the inline extent against a file which has a hole in its
# first 4K followed by a non-inlined extent. It should not be possible
# as well to clone the inline extent from file bar into this file.
$XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2
# Doing IO against any range in the first 4K of the file should work.
# Due to a past clone ioctl bug which allowed cloning the inline extent,
# these operations resulted in EIO errors.
echo "File foo2 data after clone operation:"
# All bytes should have the value 0x00 (clone operation failed and did
# not modify our file).
od -t x1 $SCRATCH_MNT/foo2
$XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io
# Test cloning the inline extent against a file which has a size of zero
# but has a prealloc extent. It should not be possible as well to clone
# the inline extent from file bar into this file.
$XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3
# Doing IO against any range in the first 4K of the file should work.
# Due to a past clone ioctl bug which allowed cloning the inline extent,
# these operations resulted in EIO errors.
echo "First 50 bytes of foo3 after clone operation:"
# Should not be able to read any bytes, file has 0 bytes i_size (the
# clone operation failed and did not modify our file).
od -t x1 $SCRATCH_MNT/foo3
$XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io
# Test cloning the inline extent against a file which consists of a
# single inline extent that has a size not greater than the size of
# bar's inline extent (40 < 50).
# It should be possible to do the extent cloning from bar to this file.
$XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4
# Doing IO against any range in the first 4K of the file should work.
echo "File foo4 data after clone operation:"
# Must match file bar's content.
od -t x1 $SCRATCH_MNT/foo4
$XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io
# Test cloning the inline extent against a file which consists of a
# single inline extent that has a size greater than the size of bar's
# inline extent (60 > 50).
# It should not be possible to clone the inline extent from file bar
# into this file.
$XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5
# Reading the file should not fail.
echo "File foo5 data after clone operation:"
# Must have a size of 60 bytes, with all bytes having a value of 0x03
# (the clone operation failed and did not modify our file).
od -t x1 $SCRATCH_MNT/foo5
# Test cloning the inline extent against a file which has no extents but
# has a size greater than bar's inline extent (16K > 50).
# It should not be possible to clone the inline extent from file bar
# into this file.
$XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6
# Reading the file should not fail.
echo "File foo6 data after clone operation:"
# Must have a size of 16K, with all bytes having a value of 0x00 (the
# clone operation failed and did not modify our file).
od -t x1 $SCRATCH_MNT/foo6
# Test cloning the inline extent against a file which has no extents but
# has a size not greater than bar's inline extent (30 < 50).
# It should be possible to clone the inline extent from file bar into
# this file.
$XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7
# Reading the file should not fail.
echo "File foo7 data after clone operation:"
# Must have a size of 50 bytes, with all bytes having a value of 0xbb.
od -t x1 $SCRATCH_MNT/foo7
# Test cloning the inline extent against a file which has a size not
# greater than the size of bar's inline extent (20 < 50) but has
# a prealloc extent that goes beyond the file's size. It should not be
# possible to clone the inline extent from bar into this file.
$XFS_IO_PROG -f -c "falloc -k 0 1M" \
-c "pwrite -S 0x88 0 20" \
$SCRATCH_MNT/foo8 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8
echo "File foo8 data after clone operation:"
# Must have a size of 20 bytes, with all bytes having a value of 0x88
# (the clone operation did not modify our file).
od -t x1 $SCRATCH_MNT/foo8
_scratch_unmount
}
echo -e "\nTesting without compression and without the no-holes feature...\n"
test_cloning_inline_extents
echo -e "\nTesting with compression and without the no-holes feature...\n"
test_cloning_inline_extents "" "-o compress"
echo -e "\nTesting without compression and with the no-holes feature...\n"
test_cloning_inline_extents "-O no-holes" ""
echo -e "\nTesting with compression and with the no-holes feature...\n"
test_cloning_inline_extents "-O no-holes" "-o compress"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit dc6c5fb3b514221f2e9d21ee626a9d95d3418dff ]
The code for btrfs inode-resolve has never worked properly for
files with enough hard links to trigger extrefs. It was trying to
get the leaf out of a path after freeing the path:
btrfs_release_path(path);
leaf = path->nodes[0];
item_size = btrfs_item_size_nr(leaf, slot);
The fix here is to use the extent buffer we cloned just a little higher
up to avoid deadlocks caused by using the leaf in the path.
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.7+
cc: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 808f80b46790f27e145c72112189d6a3be2bc884 ]
My previous fix in commit 005efedf2c7d ("Btrfs: fix read corruption of
compressed and shared extents") was effective only if the compressed
extents cover a file range with a length that is not a multiple of 16
pages. That's because the detection of when we reached a different range
of the file that shares the same compressed extent as the previously
processed range was done at extent_io.c:__do_contiguous_readpages(),
which covers subranges with a length up to 16 pages, because
extent_readpages() groups the pages in clusters no larger than 16 pages.
So fix this by tracking the start of the previously processed file
range's extent map at extent_readpages().
The following test case for fstests reproduces the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create our test file with a single extent of 64Kb that is going to
# be compressed no matter which compression algo is used (zlib/lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone the compressed extent into an adjacent file offset.
$CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
echo "File digest before unmount:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
# Remount the fs or clear the page cache to trigger the bug in
# btrfs. Because the extent has an uncompressed length that is a
# multiple of 16 pages, all the pages belonging to the second range
# of the file (64K to 128K), which points to the same extent as the
# first range (0K to 64K), had their contents full of zeroes instead
# of the byte 0xaa. This was a bug exclusively in the read path of
# compressed extents, the correct data was stored on disk, btrfs
# just failed to fill in the pages correctly.
_scratch_remount
echo "File digest after remount:"
# Must match the digest we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 005efedf2c7d0a270ffbe28d8997b03844f3e3e7 ]
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit a30e577c96f59b1e1678ea5462432b09bf7d5cbc ]
In btrfs_evict_inode, we properly truncate the page cache for evicted
inodes but then we call btrfs_wait_ordered_range for every inode as well.
It's the right thing to do for regular files but results in incorrect
behavior for device inodes for block devices.
filemap_fdatawrite_range gets called with inode->i_mapping which gets
resolved to the block device inode before getting passed to
wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens
next depends on whether there's an open file handle associated with the
inode. If there is, we write to the block device, which is unexpected
behavior. If there isn't, we through normally and inode->i_data is used.
We can also end up racing against open/close which can result in crashes
when i_mapping points to a block device inode that has been closed.
Since there can't be any page cache associated with special file inodes,
it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
the problem.
Cc: <stable@vger.kernel.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911
Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 1f9b8c8fbc9a4d029760b16f477b9d15500e3a34 ]
While we are committing a transaction, it's possible the previous one is
still finishing its commit and therefore we wait for it to finish first.
However we were not checking if that previous transaction ended up getting
aborted after we waited for it to commit, so we ended up committing the
current transaction which can lead to fs corruption because the new
superblock can point to trees that have had one or more nodes/leafs that
were never durably persisted.
The following sequence diagram exemplifies how this is possible:
CPU 0 CPU 1
transaction N starts
(...)
btrfs_commit_transaction(N)
cur_trans->state = TRANS_STATE_COMMIT_START;
(...)
cur_trans->state = TRANS_STATE_COMMIT_DOING;
(...)
cur_trans->state = TRANS_STATE_UNBLOCKED;
root->fs_info->running_transaction = NULL;
btrfs_start_transaction()
--> starts transaction N + 1
btrfs_write_and_wait_transaction(trans, root);
--> starts writing all new or COWed ebs created
at transaction N
creates some new ebs, COWs some
existing ebs but doesn't COW or
deletes eb X
btrfs_commit_transaction(N + 1)
(...)
cur_trans->state = TRANS_STATE_COMMIT_START;
(...)
wait_for_commit(root, prev_trans);
--> prev_trans == transaction N
btrfs_write_and_wait_transaction() continues
writing ebs
--> fails writing eb X, we abort transaction N
and set bit BTRFS_FS_STATE_ERROR on
fs_info->fs_state, so no new transactions
can start after setting that bit
cleanup_transaction()
btrfs_cleanup_one_transaction()
wakes up task at CPU 1
continues, doesn't abort because
cur_trans->aborted (transaction N + 1)
is zero, and no checks for bit
BTRFS_FS_STATE_ERROR in fs_info->fs_state
are made
btrfs_write_and_wait_transaction(trans, root);
--> succeeds, no errors during writeback
write_ctree_super(trans, root, 0);
--> succeeds
--> we have now a superblock that points us
to some root that uses eb X, which was
never written to disk
In this scenario future attempts to read eb X from disk results in an
error message like "parent transid verify failed on X wanted Y found Z".
So fix this by aborting the current transaction if after waiting for the
previous transaction we verify that it was aborted.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 727b9784b6085c99c2f836bf4fcc2848dc9cf904 ]
Orphans in the fs tree are cleaned up via open_ctree and subvolume
orphans are cleaned via btrfs_lookup_dentry -- except when a default
subvolume is in use. The name for the default subvolume uses a manual
lookup that doesn't trigger orphan cleanup and needs to trigger it
manually as well. This doesn't apply to the remount case since the
subvolumes are cleaned up by walking the root radix tree.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 26e726afe01c1c82072cf23a5ed89ce25f39d9f2 ]
fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was
the last extent that will fit in user array. If 1 is returned, the return
value may eventually returned to user space, which should not happen, according
to manpage of ioctl.
Signed-off-by: Chengyu Song <csong84@gatech.edu>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 497b4050e0eacd4c746dd396d14916b1e669849d ]
We were allocating memory with memdup_user() but we were never releasing
that memory. This affected pretty much every call to the ioctl, whether
it deduplicated extents or not.
This issue was reported on IRC by Julian Taylor and on the mailing list
by Marcel Ritter, credit goes to them for finding the issue.
Reported-by: Julian Taylor <jtaylor.debian@googlemail.com>
Reported-by: Marcel Ritter <ritter.marcel@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit c3f4a1685bb87e59c886ee68f7967eae07d4dffa ]
The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:
/*
* (...)
*
* Don't free memory not originally allocated by kmalloc()
* or you will run into trouble.
*/
So better be safe and use kmem_cache_free().
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 02590fd855d1690568b2fa439c942e933221b57a ]
commit 5f5bc6b1e2d5a6f827bc860ef2dc5b6f365d1339 upstream.
Replacing a xattr consists of doing a lookup for its existing value, delete
the current value from the respective leaf, release the search path and then
finally insert the new value. This leaves a time window where readers (getxattr,
listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
so this has security implications.
This change also fixes 2 other existing issues which were:
*) Deleting the old xattr value without verifying first if the new xattr will
fit in the existing leaf item (in case multiple xattrs are packed in the
same item due to name hash collision);
*) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
exist but we have have an existing item that packs muliple xattrs with
the same name hash as the input xattr. In this case we should return ENOSPC.
A test case for xfstests follows soon.
Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
implementation.
Reported-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 64ad6c488975d7516230cf7849190a991fd615ae ]
Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"),
mounted subvolumes can be deleted because d_invalidate() won't fail.
However, we run into problems when we attempt to delete the default
subvolume while it is mounted as the root filesystem:
# btrfs subvol list /
ID 257 gen 306 top level 5 path rootvol
ID 267 gen 334 top level 5 path snap1
# btrfs subvol get-default /
ID 267 gen 334 top level 5 path snap1
# btrfs inspect-internal rootid /
267
# mount -o subvol=/ /dev/vda1 /mnt
# btrfs subvol del /mnt/snap1
Delete subvolume (no-commit): '/mnt/snap1'
ERROR: cannot delete '/mnt/snap1' - Operation not permitted
# findmnt /
findmnt: can't read /proc/mounts: No such file or directory
# ls /proc
#
Markus reported that this same scenario simply led to a kernel oops.
This happens because in btrfs_ioctl_snap_destroy(), we call
d_invalidate() before we check may_destroy_subvol(), which means that we
detach the submounts and drop the dentry before erroring out. Instead,
we should only invalidate the dentry once the deletion has succeeded.
Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate()
will prune the dcache for the deleted subvolume.
Cc: <stable@vger.kernel.org>
Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate")
Reported-by: Markus Schauler <mschauler@gmail.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 909e26dce3f7600f5e293ac0522c28790a0c8c9c ]
Whenever the check for a send in progress introduced in commit
521e0546c970 (btrfs: protect snapshots from deleting during send) is
hit, we return without unlocking inode->i_mutex. This is easy to see
with lockdep enabled:
[ +0.000059] ================================================
[ +0.000028] [ BUG: lock held when returning to user space! ]
[ +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted
[ +0.000026] ------------------------------------------------
[ +0.000029] btrfs/211 is leaving the kernel with locks still held!
[ +0.000029] 1 lock held by btrfs/211:
[ +0.000023] #0: (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0
Make sure we unlock it in the error path.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit HEAD ]
commit 113e8283869b9855c8b999796aadd506bbac155f upstream.
If we pass a length of 0 to the extent_same ioctl, we end up locking an
extent range with a start offset greater then its end offset (if the
destination file's offset is greater than zero). This results in a warning
from extent_io.c:insert_state through the following call chain:
btrfs_extent_same()
btrfs_double_lock()
lock_extent_range()
lock_extent(inode->io_tree, offset, offset + len - 1)
lock_extent_bits()
__set_extent_bit()
insert_state()
--> WARN_ON(end < start)
This leads to an infinite loop when evicting the inode. This is the same
problem that my previous patch titled
"Btrfs: fix inode eviction infinite loop after cloning into it" addressed
but for the extent_same ioctl instead of the clone ioctl.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9dc106617d5669a6f8d86e08f620dc2fb0413e21)
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit HEAD ]
commit ccccf3d67294714af2d72a6fd6fd7d73b01c9329 upstream.
If we attempt to clone a 0 length region into a file we can end up
inserting a range in the inode's extent_io tree with a start offset
that is greater then the end offset, which triggers immediately the
following warning:
[ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3914.620886] BTRFS: end < start 4095 4096
(...)
[ 3914.638093] Call Trace:
[ 3914.638636] [<ffffffff81425fd9>] dump_stack+0x4c/0x65
[ 3914.639620] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
[ 3914.640789] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
[ 3914.642041] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
[ 3914.643236] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
[ 3914.644441] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3914.645711] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3914.646914] [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33
[ 3914.648058] [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs]
[ 3914.650105] [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs]
[ 3914.651361] [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs]
[ 3914.652761] [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs]
[ 3914.654128] [<ffffffff811226dd>] ? might_fault+0x58/0xb5
[ 3914.655320] [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs]
(...)
[ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---
This later makes the inode eviction handler enter an infinite loop that
keeps dumping the following warning over and over:
[ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3915.119913] BTRFS: end < start 4095 4096
(...)
[ 3915.137394] Call Trace:
[ 3915.137913] [<ffffffff81425fd9>] dump_stack+0x4c/0x65
[ 3915.139154] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
[ 3915.140316] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
[ 3915.141505] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
[ 3915.142709] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
[ 3915.143849] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3915.145120] [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs]
[ 3915.146352] [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50
[ 3915.147565] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3915.148785] [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33
[ 3915.149931] [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs]
[ 3915.151154] [<ffffffff81168904>] evict+0xa0/0x148
[ 3915.152094] [<ffffffff811689e5>] dispose_list+0x39/0x43
[ 3915.153081] [<ffffffff81169564>] evict_inodes+0xdc/0xeb
[ 3915.154062] [<ffffffff81154418>] generic_shutdown_super+0x49/0xef
[ 3915.155193] [<ffffffff811546d1>] kill_anon_super+0x13/0x1e
[ 3915.156274] [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
(...)
[ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---
So just bail out of the clone ioctl if the length of the region to clone
is zero, without locking any extent range, in order to prevent this issue
(same behaviour as a pwrite with a 0 length for example).
This is trivial to reproduce. For example, the steps for the test I just
made for fstests:
mkfs.btrfs -f SCRATCH_DEV
mount SCRATCH_DEV $SCRATCH_MNT
touch $SCRATCH_MNT/foo
touch $SCRATCH_MNT/bar
$CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
umount $SCRATCH_MNT
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 449b46275ce58e1d3fc20d1efacd0d0369c6070f)
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit HEAD ]
commit 3c3b04d10ff1811a27f86684ccd2f5ba6983211d upstream.
Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly
works:
$ touch file
$ setfattr -n user. -v 1 file
$ getfattr -d file
user.="1"
ie. the missing attribute name after the namespace.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291
Reported-by: William Douglas <william.douglas@intel.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 1bb2835ed4f8ee186d8110817cf5a96ef9e35ab3)
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit HEAD ]
commit dcc82f4783ad91d4ab654f89f37ae9291cdc846a upstream.
While committing a transaction we free the log roots before we write the
new super block. Freeing the log roots implies marking the disk location
of every node/leaf (metadata extent) as pinned before the new super block
is written. This is to prevent the disk location of log metadata extents
from being reused before the new super block is written, otherwise we
would have a corrupted log tree if before the new super block is written
a crash/reboot happens and the location of any log tree metadata extent
ended up being reused and rewritten.
Even though we pinned the log tree's metadata extents, we were issuing a
discard against them if the fs was mounted with the -o discard option,
resulting in corruption of the log tree if a crash/reboot happened before
writing the new super block - the next time the fs was mounted, during
the log replay process we would find nodes/leafs of the log btree with
a content full of zeroes, causing the process to fail and require the
use of the tool btrfs-zero-log to wipeout the log tree (and all data
previously fsynced becoming lost forever).
Fix this by not doing a discard when pinning an extent. The discard will
be done later when it's safe (after the new super block is committed) at
extent-tree.c:btrfs_finish_extent_commit().
Fixes: e688b7252f78 (Btrfs: fix extent pinning bugs in the tree log)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3909e5a93ed64a186a396c1b7fd1db07e065728f)
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 9c4f61f01d269815bb7c37be3ede59c5587747c6 ]
We can search and add the orphan item in one go,
btrfs_insert_orphan_item will find out if the item already exists.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
commit dd9ef135e3542ffc621c4eb7f0091870ec7a1504 upstream.
Improper arithmetics when calculting the address of the extended ref could
lead to an out of bounds memory read and kernel panic.
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
commit 3a8b36f378060d20062a0918e99fae39ff077bf0 upstream.
When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.
Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:
1. fs_info->last_trans_committed == N - 1 and current transaction is
transaction N (fs_info->generation == N);
2. do a buffered write;
3. fsync our inode, this clears our inode's full sync flag, starts
an ordered extent and waits for it to complete - when it completes
at btrfs_finish_ordered_io(), the inode's last_trans is set to the
value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
btrfs_set_inode_last_trans);
4. transaction N is committed, so fs_info->last_trans_committed is now
set to the value N and fs_info->generation remains with the value N;
5. do another buffered write, when this happens btrfs_file_write_iter
sets our inode's last_trans to the value N + 1 (that is
fs_info->generation + 1 == N + 1);
6. transaction N + 1 is started and fs_info->generation now has the
value N + 1;
7. transaction N + 1 is committed, so fs_info->last_trans_committed
is set to the value N + 1;
8. fsync our inode - because it doesn't have the full sync flag set,
we only start the ordered extent, we don't wait for it to complete
(only in a later phase) therefore its last_trans field has the
value N + 1 set previously by btrfs_file_write_iter(), and so we
have:
inode->last_trans <= fs_info->last_trans_committed
(N + 1) (N + 1)
Which made us not log the last buffered write and exit the fsync
handler immediately, returning success (0) to user space and resulting
in data loss after a crash.
This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our main test file 'foo', the one we check for data loss.
# By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
# bit from its flags (btrfs inode specific flags).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
-c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
# Now create one other file and 2 directories. We will move this second file
# from one directory to the other later because it forces btrfs to commit its
# currently open transaction if we fsync the old parent directory. This is
# necessary to trigger the data loss bug that affected btrfs.
mkdir $SCRATCH_MNT/testdir_1
touch $SCRATCH_MNT/testdir_1/bar
mkdir $SCRATCH_MNT/testdir_2
# Make sure everything is durably persisted.
sync
# Write more 8Kb of data to our file.
$XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
# Move our 'bar' file into a new directory.
mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
# Fsync our first directory. Because it had a file moved into some other
# directory, this made btrfs commit the currently open transaction. This is
# a condition necessary to trigger the data loss bug.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
# Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
# data we wrote previously to be persisted and available if a crash happens.
# This did not happen with btrfs, because of the transaction commit that
# happened when we fsynced the parent directory.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now check that all data we wrote before are available.
echo "File content after log replay:"
od -t x1 $SCRATCH_MNT/foo
status=0
exit
The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0040000
Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.
Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:
1. write to file
2. fsync file
3. fsync file
|--> btrfs_inode_in_log() returns true and it set last_trans to 0
4. write to file
|--> btrfs_file_write_iter() no longers sets last_trans, so it
remained with a value of 0
5. fsync
|--> inode->last_trans == 0, so it bails out without logging the
second write
A test case for xfstests will be sent soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
commit 1932b7be973b554ffe20a5bba6ffaed6fa995cdc upstream.
A block-local variable stores error code but btrfs_get_blocks_direct may
not return it in the end as there's a ret defined in the function scope.
Fixes: d187663ef24c ("Btrfs: lock extents as we map them in DIO")
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
commit 4d884fceaa2c838abb598778813e93f6d9fea723 upstream.
We can have multiple fsync operations against the same file during the
same transaction and they can collect the same ordered extents while they
don't complete (still accessible from the inode's ordered tree). If this
happens, those ordered extents will never get their reference counts
decremented to 0, leading to memory leaks and inode leaks (an iput for an
ordered extent's inode is scheduled only when the ordered extent's refcount
drops to 0). The following sequence diagram explains this race:
CPU 1 CPU 2
btrfs_sync_file()
btrfs_sync_file()
mutex_lock(inode->i_mutex)
btrfs_log_inode()
btrfs_get_logged_extents()
--> collects ordered extent X
--> increments ordered
extent X's refcount
btrfs_submit_logged_extents()
mutex_unlock(inode->i_mutex)
mutex_lock(inode->i_mutex)
btrfs_sync_log()
btrfs_wait_logged_extents()
--> list_del_init(&ordered->log_list)
btrfs_log_inode()
btrfs_get_logged_extents()
--> Adds ordered extent X
to logged_list because
at this point:
list_empty(&ordered->log_list)
&& test_bit(BTRFS_ORDERED_LOGGED,
&ordered->flags) == 0
--> Increments ordered extent
X's refcount
--> check if ordered extent's io is
finished or not, start it if
necessary and wait for it to finish
--> sets bit BTRFS_ORDERED_LOGGED
on ordered extent X's flags
and adds it to trans->ordered
btrfs_sync_log() finishes
btrfs_submit_logged_extents()
btrfs_log_inode() finishes
mutex_unlock(inode->i_mutex)
btrfs_sync_file() finishes
btrfs_sync_log()
btrfs_wait_logged_extents()
--> Sees ordered extent X has the
bit BTRFS_ORDERED_LOGGED set in
its flags
--> X's refcount is untouched
btrfs_sync_log() finishes
btrfs_sync_file() finishes
btrfs_commit_transaction()
--> called by transaction kthread for e.g.
btrfs_wait_pending_ordered()
--> waits for ordered extent X to
complete
--> decrements ordered extent X's
refcount by 1 only, corresponding
to the increment done by the fsync
task ran by CPU 1
In the scenario of the above diagram, after the transaction commit,
the ordered extent will remain with a refcount of 1 forever, leaking
the ordered extent structure and preventing the i_count of its inode
from ever decreasing to 0, since the delayed iput is scheduled only
when the ordered extent's refcount drops to 0, preventing the inode
from ever being evicted by the VFS.
Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
mean that an ordered extent is already being processed by an fsync call,
which will attach it to the current transaction, preventing it from being
collected by subsequent fsync operations against the same inode.
This race was introduced with the following change (added in 3.19 and
backported to stable 3.18 and 3.17):
Btrfs: make sure logged extents complete in the current transaction V3
commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f
I ran into this issue while running xfstests/generic/113 in a loop, which
failed about 1 out of 10 runs with the following warning in dmesg:
[ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]()
[ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma
l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo
ppy e1000 scsi_mod [last unloaded: btrfs]
[ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1
[ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2612.457709] 0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8
[ 2612.459829] 0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000
[ 2612.461564] ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068
[ 2612.463163] Call Trace:
[ 2612.463719] [<ffffffff8142425e>] dump_stack+0x4c/0x65
[ 2612.464789] [<ffffffff81045308>] warn_slowpath_common+0xa1/0xbb
[ 2612.466026] [<ffffffffa036da56>] ? free_fs_root+0x36/0x133 [btrfs]
[ 2612.467247] [<ffffffff810453c5>] warn_slowpath_null+0x1a/0x1c
[ 2612.468416] [<ffffffffa036da56>] free_fs_root+0x36/0x133 [btrfs]
[ 2612.469625] [<ffffffffa036f2a7>] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs]
[ 2612.471251] [<ffffffffa036f353>] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
[ 2612.472536] [<ffffffff8142612e>] ? wait_for_completion+0x24/0x26
[ 2612.473742] [<ffffffffa0370bbc>] close_ctree+0x1f3/0x33c [btrfs]
[ 2612.475477] [<ffffffff81059d1d>] ? destroy_workqueue+0x148/0x1ba
[ 2612.476695] [<ffffffffa034e3da>] btrfs_put_super+0x19/0x1b [btrfs]
[ 2612.477911] [<ffffffff81153e53>] generic_shutdown_super+0x73/0xef
[ 2612.479106] [<ffffffff811540e2>] kill_anon_super+0x13/0x1e
[ 2612.480226] [<ffffffffa034e1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
[ 2612.481471] [<ffffffff81154307>] deactivate_locked_super+0x3b/0x50
[ 2612.482686] [<ffffffff811547a7>] deactivate_super+0x3f/0x43
[ 2612.483791] [<ffffffff8116b3ed>] cleanup_mnt+0x59/0x78
[ 2612.484842] [<ffffffff8116b44c>] __cleanup_mnt+0x12/0x14
[ 2612.485900] [<ffffffff8105d019>] task_work_run+0x8f/0xbc
[ 2612.486960] [<ffffffff810028d8>] do_notify_resume+0x5a/0x6b
[ 2612.488083] [<ffffffff81236e5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 2612.489333] [<ffffffff8142a17f>] int_signal+0x12/0x17
[ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]---
[ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds. Have a nice day...
Kmemleak confirmed the ordered extent leak (and btrfs inode specific
structures such as delayed nodes):
$ cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff880154290db0 (size 576):
comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s)
hex dump (first 32 bytes):
01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff .@.........N....
00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff ..........)T....
backtrace:
[<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
[<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
[<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
[<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
[<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
[<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
[<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
[<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
[<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
[<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
[<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
[<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
[<ffffffff81429e92>] system_call_fastpath+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff88014ef11db0 (size 576):
comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s)
hex dump (first 32 bytes):
02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff ...........N....
backtrace:
[<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
[<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
[<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
[<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
[<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
[<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
[<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
[<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
[<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
[<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
[<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
[<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
[<ffffffff81429e92>] system_call_fastpath+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800336feda8 (size 584):
comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s)
hex dump (first 32 bytes):
00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00 .@>........B....
00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 ................
backtrace:
[<ffffffff8114eb34>] create_object+0x172/0x29a
[<ffffffff8141d790>] kmemleak_alloc+0x25/0x41
[<ffffffff81141ae6>] kmemleak_alloc_recursive.constprop.52+0x16/0x18
[<ffffffff81145288>] kmem_cache_alloc+0xf7/0x198
[<ffffffffa0389243>] __btrfs_add_ordered_extent+0x43/0x309 [btrfs]
[<ffffffffa038968b>] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs]
[<ffffffffa03810e2>] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs]
[<ffffffff81181349>] do_blockdev_direct_IO+0x62a/0xb47
[<ffffffff8118189a>] __blockdev_direct_IO+0x34/0x36
[<ffffffffa03776e5>] btrfs_direct_IO+0x16a/0x1e8 [btrfs]
[<ffffffff81100373>] generic_file_direct_write+0xb8/0x12d
[<ffffffffa038615c>] btrfs_file_write_iter+0x24b/0x42f [btrfs]
[<ffffffff8118bb0d>] aio_run_iocb+0x2b7/0x32e
[<ffffffff8118c99a>] do_io_submit+0x26e/0x2ff
[<ffffffff8118ca3b>] SyS_io_submit+0x10/0x12
[<ffffffff81429e92>] system_call_fastpath+0x12/0x17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
commit 1a4bcf470c886b955adf36486f4c86f2441d85cb upstream.
We have a scenario where after the fsync log replay we can lose file data
that had been previously fsync'ed if we added an hard link for our inode
and after that we sync'ed the fsync log (for example by fsync'ing some
other file or directory).
This is because when adding an hard link we updated the inode item in the
log tree with an i_size value of 0. At that point the new inode item was
in memory only and a subsequent fsync log replay would not make us lose
the file data. However if after adding the hard link we sync the log tree
to disk, by fsync'ing some other file or directory for example, we ended
up losing the file data after log replay, because the inode item in the
persisted log tree had an an i_size of zero.
This is easy to reproduce, and the following excerpt from my test for
xfstests shows this:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create one file with data and fsync it.
# This made the btrfs fsync log persist the data and the inode metadata with
# a correct inode->i_size (4096 bytes).
$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now add one hard link to our file. This made the btrfs code update the fsync
# log, in memory only, with an inode metadata having a size of 0.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
# Now force persistence of the fsync log to disk, for example, by fsyncing some
# other file.
touch $SCRATCH_MNT/bar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
# Before a power loss or crash, we could read the 4Kb of data from our file as
# expected.
echo "File content before:"
od -t x1 $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# After the fsync log replay, because the fsync log had a value of 0 for our
# inode's i_size, we couldn't read anymore the 4Kb of data that we previously
# wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
echo "File content after:"
od -t x1 $SCRATCH_MNT/foo
Another alternative test, that doesn't need to fsync an inode in the same
transaction it was created, is:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test file with some data.
$XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Make sure the file is durably persisted.
sync
# Append some data to our file, to increase its size.
$XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Fsync the file, so from this point on if a crash/power failure happens, our
# new data is guaranteed to be there next time the fs is mounted.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Add one hard link to our file. This made btrfs write into the in memory fsync
# log a special inode with generation 0 and an i_size of 0 too. Note that this
# didn't update the inode in the fsync log on disk.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
# Now make sure the in memory fsync log is durably persisted.
# Creating and fsync'ing another file will do it.
touch $SCRATCH_MNT/bar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
# As expected, before the crash/power failure, we should be able to read the
# 12Kb of file data.
echo "File content before:"
od -t x1 $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# After mounting the fs again, the fsync log was replayed.
# The btrfs fsync log replay code didn't update the i_size of the persisted
# inode because the inode item in the log had a special generation with a
# value of 0 (and it couldn't know the correct i_size, since that inode item
# had a 0 i_size too). This made the last 4Kb of file data inaccessible and
# effectively lost.
echo "File content after:"
od -t x1 $SCRATCH_MNT/foo
This isn't a new issue/regression. This problem has been around since the
log tree code was added in 2008:
Btrfs: Add a write ahead tree log to optimize synchronous operations
(commit e02119d5a7b4396c5a872582fddc8bd6d305a70a)
Test cases for xfstests follow soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 381cf6587f8a8a8e981bc0c1aaaa8859b51dc756 upstream.
If btrfs_find_item is called with NULL path it allocates one locally but
does not free it. Affected paths are inserting an orphan item for a file
and for a subvol root.
Move the path allocation to the callers.
Fixes: 3f870c289900 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality")
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5efa0490cc94aee06cd8d282683e22a8ce0a0026 upstream.
This has been confusing people for too long, the message is really just
informative.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6f8960541b1eb6054a642da48daae2320fddba93 upstream.
Commit 1d52c78afbb (Btrfs: try not to ENOSPC on log replay) added a
check to skip delayed inode updates during log replay because it
confuses the enospc code. But the delayed processing will end up
ignoring delayed refs from log replay because the inode itself wasn't
put through the delayed code.
This can end up triggering a warning at commit time:
WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34()
Which is repeated for each commit because we never process the delayed
inode ref update.
The fix used here is to change btrfs_delayed_delete_inode_ref to return
an error if we're currently in log replay. The caller will do the ref
deletion immediately and everything will work properly.
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 678886bdc6378c1cbd5072da2c5a3035000214e3 upstream.
When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.
This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.
I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:
[PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
The corruption always happened when a transaction aborted and then fsck complained
like this:
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
*** fsck.btrfs output ***
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
read block failed check_tree_block
Couldn't open file system
In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:
1) transaction N started, fs_info->pinned_extents pointed to
fs_info->freed_extents[0];
2) node/eb 94945280 is created;
3) eb is persisted to disk;
4) transaction N commit starts, fs_info->pinned_extents now points to
fs_info->freed_extents[1], and transaction N completes;
5) transaction N + 1 starts;
6) eb is COWed, and btrfs_free_tree_block() called for this eb;
7) eb range (94945280 to 94945280 + 16Kb) is added to
fs_info->pinned_extents (fs_info->freed_extents[1]);
8) Something goes wrong in transaction N + 1, like hitting ENOSPC
for example, and the transaction is aborted, turning the fs into
readonly mode. The stack trace I got for example:
[112065.253935] [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
[112065.254271] [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
[112065.254567] [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.261674] [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
[112065.261922] [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
[112065.262211] [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.262545] [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
[112065.262771] [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
[112065.263105] [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
(...)
[112065.264493] ---[ end trace dd7903a975a31a08 ]---
[112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
[112065.264997] BTRFS info (device sdc): forced readonly
9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
turn calls btrfs_destroy_pinned_extent();
10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
marked as dirty in fs_info->freed_extents[], and for each one
it calls discard, if the fs was mounted with "-o discard", and
adds the range to the free space cache of the respective block
group;
11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
sees the free space entries and performs a discard;
12) After an umount and mount (or fsck), our eb's location on disk was full
of zeroes, and it should have been untouched, because it was marked as
dirty in the fs_info->pinned_extents tree, and therefore used by the
trees that the last committed superblock points to.
Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.
This isn't a new problem, as it's been present since 2011 (git commit
acce952b0263825da32cf10489413dec78053347).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f upstream.
Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described. It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits. Consider
the following
write extent 0-4k
log extent in log tree
commit transaction
< power fail happens here
ordered extent completes
We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed. If we lose
power before the transaction commit we are save, otherwise we are not.
Fix this by keeping track of all extents we logged in this transaction. Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding. This will make sure that if we lose
power after the transaction commit we still have our data. This also fixes the
problem of the improperly updated extent generation. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a28046956c71985046474283fa3bcd256915fb72 upstream.
We use the modified list to keep track of which extents have been modified so we
know which ones are candidates for logging at fsync() time. Newly modified
extents are added to the list at modification time, around the same time the
ordered extent is created. We do this so that we don't have to wait for ordered
extents to complete before we know what we need to log. The problem is when
something like this happens
log extent 0-4k on inode 1
copy csum for 0-4k from ordered extent into log
sync log
commit transaction
log some other extent on inode 1
ordered extent for 0-4k completes and adds itself onto modified list again
log changed extents
see ordered extent for 0-4k has already been logged
at this point we assume the csum has been copied
sync log
crash
On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
which is the same one that we are replaying which also drops the csum, and then
we won't find the csum in the log for that bytenr. This of course causes us to
have errors about not having csums for certain ranges of our inode. So remove
the modified list manipulation in unpin_extent_cache, any modified extents
should have been added well before now, and we don't want them re-logged. This
fixes my test that I could reliably reproduce this problem with. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0d95c1bec906dd1ad951c9c001e798ca52baeb0f upstream.
The sizes that are obtained from space infos are in raw units and have
to be adjusted according to the raid factor. This was missing for
f_bavail and df reported doubled size for raid1.
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Fixes: ba7b6e62f420 ("btrfs: adjust statfs calculations according to raid profiles")
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9dba8cf128ef98257ca719722280c9634e7e9dc7 upstream.
If we have two fsync()'s race on different subvols one will do all of its work
to get into the log_tree, wait on it's outstanding IO, and then allow the
log_tree to finish it's commit. The problem is we were just free'ing that
subvols logged extents instead of waiting on them, so whoever lost the race
wouldn't really have their data on disk. Fix this by waiting properly instead
of freeing the logged extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Don Bailey noticed that our page zeroing for compression at end-io time
isn't complete. This reworks a patch from Linus to push the zeroing
into the zlib and lzo specific functions instead of trying to handle the
corners inside btrfs_decompress_buf2page
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reported-by: Don A. Bailey <donb@securitymouse.com>
cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs deadlock fix from Chris Mason:
"This has a fix for a long standing deadlock that we've been trying to
nail down for a while. It ended up being a bad interaction with the
fair reader/writer locks and the order btrfs reacquires locks in the
btree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix lockups from btrfs_clear_path_blocking
|
|
The fair reader/writer locks mean that btrfs_clear_path_blocking needs
to strictly follow lock ordering rules even when we already have
blocking locks on a given path.
Before we can clear a blocking lock on the path, we need to make sure
all of the locks have been converted to blocking. This will remove lock
inversions against anyone spinning in write_lock() against the buffers
we're trying to get read locks on. These inversions didn't exist before
the fair read/writer locks, but now we need to be more careful.
We papered over this deadlock in the past by changing
btrfs_try_read_lock() to be a true trylock against both the spinlock and
the blocking lock. This was slower, and not sufficient to fix all the
deadlocks. This patch adds a btrfs_tree_read_lock_atomic(), which
basically means get the spinlock but trylock on the blocking lock.
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: Patrick Schmid <schmid@phys.ethz.ch>
cc: stable@vger.kernel.org #v3.15+
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
"It's a one liner for an error cleanup path that leads to crashes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
|
|
If we hit any errors in btrfs_lookup_csums_range, we'll loop through all
the csums we allocate and free them. But the code was using list_entry
incorrectly, and ended up trying to free the on-stack list_head instead.
This bug came from commit 0678b6185
btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range()
Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Erik Berg <btrfs@slipsprogrammoer.no>
cc: stable@vger.kernel.org # 3.3 or newer
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"Filipe is nailing down some problems with our skinny extent variation,
and Dave's patch fixes endian problems in the new super block checks"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
Btrfs: properly clean up btrfs_end_io_wq_cache
Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
btrfs: use macro accessors in superblock validation checks
|
|
We have a race that can lead us to miss skinny extent items in the function
btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
So basically the sequence of steps is:
1) We search in the extent tree for the skinny extent, which returns > 0
(not found);
2) We check the previous item in the returned leaf for a non-skinny extent,
and we don't find it;
3) Because we didn't find the non-skinny extent in step 2), we release our
path to search the extent tree again, but this time for a non-skinny
extent key;
4) Right after we released our path in step 3), a skinny extent was inserted
in the extent tree (delayed refs were run) - our second extent tree search
will miss it, because it's not looking for a skinny extent;
5) After the second search returned (with ret > 0), we look for any delayed
ref for our extent's bytenr (and we do it while holding a read lock on the
leaf), but we won't find any, as such delayed ref had just run and completed
after we released out path in step 3) before doing the second search.
Fix this by removing completely the path release and re-search logic. This is
safe, because if we seach for a metadata item and we don't find it, we have the
guarantee that the returned leaf is the one where the item would be inserted,
and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
non-skinny extent item is if it exists. The only case where path->slots[0] is
zero is when there are no smaller keys in the tree (i.e. no left siblings for
our leaf), in which case the re-search logic isn't needed as well.
This race has been present since the introduction of skinny metadata (change
3173a18f70554fe7880bb2d85c7da566e364eb3c).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on
unload, which makes us unable to unload and then re-load the btrfs module. This
fixes the problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If we couldn't find our extent item, we accessed the current slot
(path->slots[0]) to check if it corresponds to an equivalent skinny
metadata item. However this slot could be beyond our last item in the
leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
we shouldn't process it.
Since btrfs_lookup_extent() is only used to find extent items for data
extents, fix this by removing completely the logic that looks up for an
equivalent skinny metadata item, since it can not exist.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
The initial patch c926093ec516f5d316 (btrfs: add more superblock checks)
did not properly use the macro accessors that wrap endianness and the
code would not work correctly on big endian machines.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
It's already duplicated in btrfs and about to be used in overlayfs too.
Move the sticky bit check to an inline helper and call the out-of-line
helper only in the unlikly case of the sticky bit being set.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs data corruption fix from Chris Mason:
"I'm testing a pull with more fixes, but wanted to get this one out so
Greg can pick it up.
The corruption isn't easy to hit, you have to do a readonly snapshot
and have orphans in the snapshot. But my review and testing missed
the bug. Filipe has added a better xfstest to cover it"
* 'for-linus-update' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Revert "Btrfs: race free update of commit root for ro snapshots"
|
|
Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.
- blk-mq timeout updates and fixes from Christoph.
- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.
- blk integrity updates and fixes from Martin and Gu Zheng.
- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.
- Improve the error printing, from Rob Elliott.
- Backing device improvements and cleanups from Tejun.
- Fixup of a misplaced rq_complete() tracepoint from Hannes.
- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.
- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.
- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.
- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"
* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...
|
|
This reverts commit 9c3b306e1c9e6be4be09e99a8fe2227d1005effc.
Switching only one commit root during a transaction is wrong because it
leads the fs into an inconsistent state. All commit roots should be
switched at once, at transaction commit time, otherwise backref walking
can often miss important references that were only accessible through
the old commit root. Plus, the root item for the snapshot's root wasn't
getting updated and preventing the next transaction commit to do it.
This made several users get into random corruption issues after creation
of readonly snapshots.
A regression test for xfstests will follow soon.
Cc: stable@vger.kernel.org # 3.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch instead allocates the appropriate amount of
memory using a char array using the SHASH_DESC_ON_STACK macro.
The new code can be compiled with both gcc and clang.
Signed-off-by: Vinícius Tinti <viniciustinti@gmail.com>
Reviewed-by: Jan-Simon Möller <dl9pf@gmx.de>
Reviewed-by: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Acked-by: Chris Mason <clm@fb.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
"The big thing in this pile is Eric's unmount-on-rmdir series; we
finally have everything we need for that. The final piece of prereqs
is delayed mntput() - now filesystem shutdown always happens on
shallow stack.
Other than that, we have several new primitives for iov_iter (Matt
Wilcox, culled from his XIP-related series) pushing the conversion to
->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
cleanups and fixes (including the external name refcounting, which
gives consistent behaviour of d_move() wrt procfs symlinks for long
and short names alike) and assorted cleanups and fixes all over the
place.
This is just the first pile; there's a lot of stuff from various
people that ought to go in this window. Starting with
unionmount/overlayfs mess... ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
fs/file_table.c: Update alloc_file() comment
vfs: Deduplicate code shared by xattr system calls operating on paths
reiserfs: remove pointless forward declaration of struct nameidata
don't need that forward declaration of struct nameidata in dcache.h anymore
take dname_external() into fs/dcache.c
let path_init() failures treated the same way as subsequent link_path_walk()
fix misuses of f_count() in ppp and netlink
ncpfs: use list_for_each_entry() for d_subdirs walk
vfs: move getname() from callers to do_mount()
gfs2_atomic_open(): skip lookups on hashed dentry
[infiniband] remove pointless assignments
gadgetfs: saner API for gadgetfs_create_file()
f_fs: saner API for ffs_sb_create_file()
jfs: don't hash direct inode
[s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
ecryptfs: ->f_op is never NULL
android: ->f_op is never NULL
nouveau: __iomem misannotations
missing annotation in fs/file.c
fs: namespace: suppress 'may be used uninitialized' warnings
...
|