Age | Commit message (Collapse) | Author |
|
[ Upstream commit 2f36db71009304b3f0b95afacd8eba1f9f046b87 ]
This prevents users from triggering a stack overflow through a recursive
invocation of pagefault handling that involves mapping procfs files into
virtual memory.
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit e54ad7f1ee263ffa5a2de9c609d58dfa27b21cd9 ]
This prevents stacking filesystems (ecryptfs and overlayfs) from using
procfs as lower filesystem. There is too much magic going on inside
procfs, and there is no good reason to stack stuff on top of procfs.
(For example, procfs does access checks in VFS open handlers, and
ecryptfs by design calls open handlers from a kernel thread that doesn't
drop privileges or so.)
Signed-off-by: Jann Horn <jannh@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 3d56c25e3bb0726a5c5e16fc2d9e38f8ed763085 ]
Ascend-to-parent logics in d_walk() depends on all encountered child
dentries not getting freed without an RCU delay. Unfortunately, in
quite a few cases it is not true, with hard-to-hit oopsable race as
the result.
Fortunately, the fix is simiple; right now the rule is "if it ever
been hashed, freeing must be delayed" and changing it to "if it
ever had a parent, freeing must be delayed" closes that hole and
covers all cases the old rule used to cover. Moreover, pipes and
sockets remain _not_ covered, so we do not introduce RCU delay in
the cases which are the reason for having that delay conditional
in the first place.
Cc: stable@vger.kernel.org # v3.2+ (and watch out for __d_materialise_dentry())
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit d71ed6c930ac7d8f88f3cef6624a7e826392d61f ]
MNT_LOCKED implies on a child mount implies the child is locked to the
parent. So while looping through the children the children should be
tested (not their parent).
Typically an unshare of a mount namespace locks all mounts together
making both the parent and the slave as locked but there are a few
corner cases where other things work.
Cc: stable@vger.kernel.org
Fixes: ceeb0e5d39fc ("vfs: Ignore unlocked mounts in fs_fully_visible")
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 97c1df3e54e811aed484a036a798b4b25d002ecf ]
Add this trivial missing error handling.
Cc: stable@vger.kernel.org
Fixes: 1b852bceb0d1 ("mnt: Refactor the logic for mounting sysfs and proc in a user namespace")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit b74cb9a80268be5c80cf4c87c74debf0ff2129ac ]
The session key is the default keyring set for request_key operations.
This session key is revoked when the user owning the session logs out.
Any long running daemon processes started by this session ends up with
revoked session keyring which prevents these processes from using the
request_key mechanism from obtaining the krb5 keys.
The problem has been reported by a large number of autofs users. The
problem is also seen with multiuser mounts where the share may be used
by processes run by a user who has since logged out. A reproducer using
automount is available on the Red Hat bz.
The patch creates a new keyring which is used to cache cifs spnego
upcalls.
Red Hat bz: 1267754
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reported-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 7d3aa7fe970791f1a674b14572a411accf2f4d4e ]
We don't write back stale inodes so we should skip them in
xfs_iflush_cluster, too.
cc: <stable@vger.kernel.org> # 3.10.x-
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 51b07f30a71c27405259a0248206ed4e22adbee2 ]
Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs:
convert inode cache lookups to use RCU locking") back in late 2010,
and so xfs_iflush_cluster checks the wrong inode for whether it is
still valid under RCU protection. Fix it to lock and check the
correct inode.
(*) Careless-idiot: Dave Chinner <dchinner@redhat.com>
cc: <stable@vger.kernel.org> # 3.10.x-
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit b1438f477934f5a4d5a44df26f3079a7575d5946 ]
When a failure due to an inode buffer occurs, the error handling
fails to abort the inode writeback correctly. This can result in the
inode being reclaimed whilst still in the AIL, leading to
use-after-free situations as well as filesystems that cannot be
unmounted as the inode log items left in the AIL never get removed.
Fix this by ensuring fatal errors from xfs_imap_to_bp() result in
the inode flush being aborted correctly.
cc: <stable@vger.kernel.org> # 3.10.x-
Reported-by: Shyam Kaushik <shyam@zadarastorage.com>
Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com>
Tested-by: Shyam Kaushik <shyam@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
directories
[ Upstream commit 897fba1172d637d344f009d700f7eb8a1fa262f1 ]
Wrong return code was being returned on SMB3 rmdir of
non-empty directory.
For SMB3 (unlike for cifs), we attempt to delete a directory by
set of delete on close flag on the open. Windows clients set
this flag via a set info (SET_FILE_DISPOSITION to set this flag)
which properly checks if the directory is empty.
With this patch on smb3 mounts we correctly return
"DIRECTORY NOT EMPTY"
on attempts to remove a non-empty directory.
Signed-off-by: Steve French <steve.french@primarydata.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 1a967d6c9b39c226be1b45f13acd4d8a5ab3dc44 ]
Only server which map unknown users to guest will allow
access using a non-null NTLMv2_Response.
For Samba it's the "map to guest = bad user" option.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913
Signed-off-by: Stefan Metzmacher <metze@samba.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 777f69b8d26bf35ade4a76b08f203c11e048365d ]
Only server which map unknown users to guest will allow
access using a non-null NTChallengeResponse.
For Samba it's the "map to guest = bad user" option.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913
Signed-off-by: Stefan Metzmacher <metze@samba.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit fa8f3a354bb775ec586e4475bcb07f7dece97e0c ]
Only server which map unknown users to guest will allow
access using a non-null LMChallengeResponse.
For Samba it's the "map to guest = bad user" option.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913
Signed-off-by: Stefan Metzmacher <metze@samba.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit cfda35d98298131bf38fbad3ce4cd5ecb3cf18db ]
See [MS-NLMP] 3.2.5.1.2 Server Receives an AUTHENTICATE_MESSAGE from the Client:
...
Set NullSession to FALSE
If (AUTHENTICATE_MESSAGE.UserNameLen == 0 AND
AUTHENTICATE_MESSAGE.NtChallengeResponse.Length == 0 AND
(AUTHENTICATE_MESSAGE.LmChallengeResponse == Z(1)
OR
AUTHENTICATE_MESSAGE.LmChallengeResponse.Length == 0))
-- Special case: client requested anonymous authentication
Set NullSession to TRUE
...
Only server which map unknown users to guest will allow
access using a non-null NTChallengeResponse.
For Samba it's the "map to guest = bad user" option.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit c164e038eee805147e95789dddb88ae3b3aca11c ]
As a small zero page, huge zero page should not be accounted in smaps
report as normal page.
For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds. We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.
Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
detect huge zero page for us.
We would need pmd_dirty() helper to do this properly. The patch adds it
to THP-enabled architectures which don't yet have one.
[akpm@linux-foundation.org: use do_div to fix 32-bit build]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengwei Yin <yfw.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 935244cd54b86ca46e69bc6604d2adfb1aec2d42 ]
Currently, in ext4_mb_init(), there's a loop like the following:
do {
...
offset += 1 << (sb->s_blocksize_bits - i);
i++;
} while (i <= sb->s_blocksize_bits + 1);
Note that the updated offset is used in the loop's next iteration only.
However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
and UBSAN reports
UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
shift exponent 4294967295 is too large for 32-bit type 'int'
[...]
Call Trace:
[<ffffffff818c4d25>] dump_stack+0xbc/0x117
[<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169
[<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e
[<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
[<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390
[<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0
[<ffffffff814293c7>] ? create_cache+0x57/0x1f0
[<ffffffff8142948a>] ? create_cache+0x11a/0x1f0
[<ffffffff821c2168>] ? mutex_lock+0x38/0x60
[<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50
[<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0
[<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0
[<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0
[...]
Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.
Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of offset is never used again.
Silence UBSAN by introducing another variable, offset_incr, holding the
next increment to apply to offset and adjust that one by right shifting it
by one position per loop iteration.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161
Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit b5cb316cdf3a3f5f6125412b0f6065185240cfdc ]
Currently, in mb_find_order_for_block(), there's a loop like the following:
while (order <= e4b->bd_blkbits + 1) {
...
bb += 1 << (e4b->bd_blkbits - order);
}
Note that the updated bb is used in the loop's next iteration only.
However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports
UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
shift exponent -1 is negative
[...]
Call Trace:
[<ffffffff818c4d35>] dump_stack+0xbc/0x117
[<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169
[<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e
[<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
[<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590
[<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80
[<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240
[...]
Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of bb is never used again.
Silence UBSAN by introducing another variable, bb_incr, holding the next
increment to apply to bb and adjust that one by right shifting it by one
position per loop iteration.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161
Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 74177f55b70e2f2be770dd28684dd6d17106a4ba ]
When filesystem is corrupted in the right way, it can happen
ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
subsequently remove inode from the in-memory orphan list. However this
deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
leave i_orphan list_head with a stale content. Later we can look at this
content causing list corruption, oops, or other issues. The reported
trace looked like:
WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
0000000000100100)
CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
Stack:
60462947 62219960 602ede24 62219960
602ede24 603ca293 622198f0 602f02eb
62219950 6002c12c 62219900 601b4d6b
Call Trace:
[<6005769c>] ? vprintk_emit+0x2dc/0x5c0
[<602ede24>] ? printk+0x0/0x94
[<600190bc>] show_stack+0xdc/0x1a0
[<602ede24>] ? printk+0x0/0x94
[<602ede24>] ? printk+0x0/0x94
[<602f02eb>] dump_stack+0x2a/0x2c
[<6002c12c>] warn_slowpath_common+0x9c/0xf0
[<601b4d6b>] ? __list_del_entry+0x6b/0x100
[<6002c254>] warn_slowpath_fmt+0x94/0xa0
[<602f4d09>] ? __mutex_lock_slowpath+0x239/0x3a0
[<6002c1c0>] ? warn_slowpath_fmt+0x0/0xa0
[<60023ebf>] ? set_signals+0x3f/0x50
[<600a205a>] ? kmem_cache_free+0x10a/0x180
[<602f4e88>] ? mutex_lock+0x18/0x30
[<601b4d6b>] __list_del_entry+0x6b/0x100
[<601177ec>] ext4_orphan_del+0x22c/0x2f0
[<6012f27c>] ? __ext4_journal_start_sb+0x2c/0xa0
[<6010b973>] ? ext4_truncate+0x383/0x390
[<6010bc8b>] ext4_write_begin+0x30b/0x4b0
[<6001bb50>] ? copy_from_user+0x0/0xb0
[<601aa840>] ? iov_iter_fault_in_readable+0xa0/0xc0
[<60072c4f>] generic_perform_write+0xaf/0x1e0
[<600c4166>] ? file_update_time+0x46/0x110
[<60072f0f>] __generic_file_write_iter+0x18f/0x1b0
[<6010030f>] ext4_file_write_iter+0x15f/0x470
[<60094e10>] ? unlink_file_vma+0x0/0x70
[<6009b180>] ? unlink_anon_vmas+0x0/0x260
[<6008f169>] ? free_pgtables+0xb9/0x100
[<600a6030>] __vfs_write+0xb0/0x130
[<600a61d5>] vfs_write+0xa5/0x170
[<600a63d6>] SyS_write+0x56/0xe0
[<6029fcb0>] ? __libc_waitpid+0x0/0xa0
[<6001b698>] handle_syscall+0x68/0x90
[<6002633d>] userspace+0x4fd/0x600
[<6002274f>] ? save_registers+0x1f/0x40
[<60028bd7>] ? arch_prctl+0x177/0x1b0
[<60017bd5>] fork_handler+0x85/0x90
Fix the problem by using list_del_init() as we always should with
i_orphan list.
CC: stable@vger.kernel.org
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 7827a7f6ebfcb7f388dc47fddd48567a314701ba ]
Instead of just printing warning messages, if the orphan list is
corrupted, declare the file system is corrupted. If there are any
reserved inodes in the orphaned inode list, declare the file system
corrupted and stop right away to avoid doing more potential damage to
the file system.
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit c9eb13a9105e2e418f72e46a2b6da3f49e696902 ]
If the orphaned inode list contains inode #5, ext4_iget() returns a
bad inode (since the bootloader inode should never be referenced
directly). Because of the bad inode, we end up processing the inode
repeatedly and this hangs the machine.
This can be reproduced via:
mke2fs -t ext4 /tmp/foo.img 100
debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
mount -o loop /tmp/foo.img /mnt
(But don't do this if you are using an unpatched kernel if you care
about the system staying functional. :-)
This bug was found by the port of American Fuzzy Lop into the kernel
to find file system problems[1]. (Since it *only* happens if inode #5
shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
surprising that AFL needed two hours before it found it.)
[1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf
Cc: stable@vger.kernel.org
Reported by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 4c63c2454eff996c5e27991221106eb511f7db38 ]
32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can
be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr
fail.
Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org>
Cc: stable@vger.kernel.org
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit d0a58e833931234c44e515b5b8bede32bd4e6eed ]
Today, a kernel which refuses to mount a filesystem read-write
due to unknown ro-compat features can still transition to read-write
via the remount path. The old kernel is most likely none the wiser,
because it's unaware of the new feature, and isn't using it. However,
writing to the filesystem may well corrupt metadata related to that
new feature, and moving to a newer kernel which understand the feature
will have problems.
Right now the only ro-compat feature we have is the free inode btree,
which showed up in v3.16. It would be good to push this back to
all the active stable kernels, I think, so that if anyone is using
newer mkfs (which enables the finobt feature) with older kernel
releases, they'll be protected.
Cc: <stable@vger.kernel.org> # 3.10.x-
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit bbe051c841d522bf2aaa1d362b57fe47457187bf ]
There's a bit of a loophole in norecovery mount handling right
now: an initial mount must be readonly, but nothing prevents
a mount -o remount,rw from producing a writable, unrecovered
xfs filesystem.
It might be possible to try to perform a log recovery when this
is requested, but I'm not sure it's worth the effort. For now,
simply disallow this sort of transition.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit c25a1e0671fbca7b2c0d0757d533bd2650d6dc0c ]
Commit 702e5bc68ad2 ("ocfs2: use generic posix ACL infrastructure")
refactored code to use posix_acl_create. The problem with this function
is that it is not mindful of the cluster wide inode lock making it
unsuitable for use with ocfs2 inode creation with ACLs. For example,
when used in ocfs2_mknod, this function can cause deadlock as follows.
The parent dir inode lock is taken when calling posix_acl_create ->
get_acl -> ocfs2_iop_get_acl which takes the inode lock again. This can
cause deadlock if there is a blocked remote lock request waiting for the
lock to be downconverted. And same deadlock happened in ocfs2_reflink.
This fix is to revert back using ocfs2_init_acl.
Fixes: 702e5bc68ad2 ("ocfs2: use generic posix ACL infrastructure")
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 5ee0fbd50fdf1c1329de8bee35ea9d7c6a81a2e0 ]
Commit 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
introduced this issue. ocfs2_setattr called by chmod command holds
cluster wide inode lock when calling posix_acl_chmod. This latter
function in turn calls ocfs2_iop_get_acl and ocfs2_iop_set_acl. These
two are also called directly from vfs layer for getfacl/setfacl commands
and therefore acquire the cluster wide inode lock. If a remote
conversion request comes after the first inode lock in ocfs2_setattr,
OCFS2_LOCK_BLOCKED will be set. And this will cause the second call to
inode lock from the ocfs2_iop_get_acl() to block indefinetly.
The deleted version of ocfs2_acl_chmod() calls __posix_acl_chmod() which
does not call back into the filesystem. Therefore, we restore
ocfs2_acl_chmod(), modify it slightly for locking as needed, and use that
instead.
Fixes: 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit e073fc58dfe6a4c9b614320c1d56bb71cb213ec4 ]
The code at the "out" label assumes that "default_acl" and "acl" are NULL,
but actually the pointers can be NULL, unitialized, or freed.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 854ee2e944b4daf795e32562a7d2f9e90ab5a6a8 ]
Commit 8f1eb48758aa ("ocfs2: fix umask ignored issue") introduced an
issue, SGID of sub dir was not inherited from its parents dir. It is
because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
is overwritten by "mode" which don't have SGID set later.
Fixes: 8f1eb48758aa ("ocfs2: fix umask ignored issue")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 99d825822eade8d827a1817357cbf3f889a552d6 ]
Payloads of NM entries are not supposed to contain NUL. When we run
into such, only the part prior to the first NUL goes into the
concatenation (i.e. the directory entry name being encoded by a bunch
of NM entries). We do stop when the amount collected so far + the
claimed amount in the current NM entry exceed 254. So far, so good,
but what we return as the total length is the sum of *claimed*
sizes, not the actual amount collected. And that can grow pretty
large - not unlimited, since you'd need to put CE entries in
between to be able to get more than the maximum that could be
contained in one isofs directory entry / continuation chunk and
we are stop once we'd encountered 32 CEs, but you can get about 8Kb
easily. And that's what will be passed to readdir callback as the
name length. 8Kb __copy_to_user() from a buffer allocated by
__get_free_page()
Cc: stable@vger.kernel.org # 0.98pl6+ (yes, really)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 5ec0811d30378ae104f250bfc9b3640242d81e3f ]
When the first propgated copy was a slave the following oops would result:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> IP: [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> PGD bacd4067 PUD bac66067 PMD 0
> Oops: 0000 [#1] SMP
> Modules linked in:
> CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> task: ffff8800bb0a8000 ti: ffff8800bac3c000 task.ti: ffff8800bac3c000
> RIP: 0010:[<ffffffff811fba4e>] [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> RSP: 0018:ffff8800bac3fd38 EFLAGS: 00010283
> RAX: 0000000000000000 RBX: ffff8800bb77ec00 RCX: 0000000000000010
> RDX: 0000000000000000 RSI: ffff8800bb58c000 RDI: ffff8800bb58c480
> RBP: ffff8800bac3fd48 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000001ca1 R11: 0000000000001c9d R12: 0000000000000000
> R13: ffff8800ba713800 R14: ffff8800bac3fda0 R15: ffff8800bb77ec00
> FS: 00007f3c0cd9b7e0(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 00000000bb79d000 CR4: 00000000000006e0
> Stack:
> ffff8800bb77ec00 0000000000000000 ffff8800bac3fd88 ffffffff811fbf85
> ffff8800bac3fd98 ffff8800bb77f080 ffff8800ba713800 ffff8800bb262b40
> 0000000000000000 0000000000000000 ffff8800bac3fdd8 ffffffff811f1da0
> Call Trace:
> [<ffffffff811fbf85>] propagate_mnt+0x105/0x140
> [<ffffffff811f1da0>] attach_recursive_mnt+0x120/0x1e0
> [<ffffffff811f1ec3>] graft_tree+0x63/0x70
> [<ffffffff811f1f6b>] do_add_mount+0x9b/0x100
> [<ffffffff811f2c1a>] do_mount+0x2aa/0xdf0
> [<ffffffff8117efbe>] ? strndup_user+0x4e/0x70
> [<ffffffff811f3a45>] SyS_mount+0x75/0xc0
> [<ffffffff8100242b>] do_syscall_64+0x4b/0xa0
> [<ffffffff81988f3c>] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
> RIP [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> RSP <ffff8800bac3fd38>
> CR2: 0000000000000010
> ---[ end trace 2725ecd95164f217 ]---
This oops happens with the namespace_sem held and can be triggered by
non-root users. An all around not pleasant experience.
To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.
Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.
The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.
To avoid other kinds of confusion last_dest is not changed when
computing last_source. last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.
Cc: stable@vger.kernel.org
fixes: f2ebb3a921c1ca1e2ddd9242e95a1989a50c4c68 ("smarter propagate_mnt()")
Reported-by: Tycho Andersen <tycho.andersen@canonical.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 7ae8fd0351f912b075149a1e03a017be8b903b9a ]
propagate_one(m) calculates "type" argument for copy_tree() like this:
> if (m->mnt_group_id == last_dest->mnt_group_id) {
> type = CL_MAKE_SHARED;
> } else {
> type = CL_SLAVE;
> if (IS_MNT_SHARED(m))
> type |= CL_MAKE_SHARED;
> }
The "type" argument then governs clone_mnt() behavior with respect to flags
and mnt_master of new mount. When we iterate through a slave group, it is
possible that both current "m" and "last_dest" are not shared (although,
both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
above erroneously makes new mount shared and sets its mnt_master to
last_source->mnt_master. The patch fixes the problem by handling zero
mnt_group_id-s as though they are unequal.
The similar problem exists in the implementation of "else" clause above
when we have to ascend upward in the master/slave tree by calling:
> last_source = last_source->mnt_master;
> last_dest = last_source->mnt_parent;
proper number of times. The last step is governed by
"n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
both are zero. The patch fixes this case in the same way as the former one.
[AV: don't open-code an obvious helper...]
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 10c64cea04d3c75c306b3f990586ffb343b63287 ]
* if we have a hashed negative dentry and either CREAT|EXCL on
r/o filesystem, or CREAT|TRUNC on r/o filesystem, or CREAT|EXCL
with failing may_o_create(), we should fail with EROFS or the
error may_o_create() has returned, but not ENOENT. Which is what
the current code ends up returning.
* if we have CREAT|TRUNC hitting a regular file on a read-only
filesystem, we can't fail with EROFS here. At the very least,
not until we'd done follow_managed() - we might have a writable
file (or a device, for that matter) bound on top of that one.
Moreover, the code downstream will see that O_TRUNC and attempt
to grab the write access (*after* following possible mount), so
if we really should fail with EROFS, it will happen. No need
to do that inside atomic_open().
The real logics is much simpler than what the current code is
trying to do - if we decided to go for simple lookup, ended
up with a negative dentry *and* had create_error set, fail with
create_error. No matter whether we'd got that negative dentry
from lookup_real() or had found it in dcache.
Cc: stable@vger.kernel.org # v3.6+
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit d3efe08400317888f559bbedf0e42cd31575d0ef ]
When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't re-initialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state is
>= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
Similarly, btrfs_end_transaction() can end up in some cases calling
btrfs_commit_transaction(), and both did a list splice of the transaction
handle's "ordered" list into the transaction's "pending_ordered" without
re-initializing the handle's "ordered" list, resulting in exactly the
same problem.
This produces the following warning on a kernel with linked list
debugging enabled:
[109749.265416] ------------[ cut here ]------------
[109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
[109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d
(...)
[109749.287505] Call Trace:
[109749.288135] [<ffffffff8145f077>] dump_stack+0x4f/0x7b
[109749.298080] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
[109749.331605] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
[109749.334849] [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98
[109749.337093] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
[109749.337847] [<ffffffff81260642>] __list_del_entry+0x5a/0x98
[109749.338678] [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
[109749.340145] [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
[109749.348313] [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
[109749.349745] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
[109749.350819] [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs]
[109749.351976] [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e
[109749.360341] [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e
[109749.368828] [<ffffffff8118ee1d>] do_fsync+0x34/0x4e
[109749.369790] [<ffffffff8118f045>] SyS_fsync+0x10/0x14
[109749.370925] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
[109749.382274] ---[ end trace 48e0d07f7c03d95a ]---
On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction() and btrfs_end_transaction().
Cc: stable@vger.kernel.org
Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 56f23fdbb600e6087db7b009775b95ce07cc3195 ]
If we rename an inode A (be it a file or a directory), create a new
inode B with the old name of inode A and under the same parent directory,
fsync inode B and then power fail, at log tree replay time we end up
removing inode A completely. If inode A is a directory then all its files
are gone too.
Example scenarios where this happens:
This is reproducible with the following steps, taken from a couple of
test cases written for fstests which are going to be submitted upstream
soon:
# Scenario 1
mkfs.btrfs -f /dev/sdc
mount /dev/sdc /mnt
mkdir -p /mnt/a/x
echo "hello" > /mnt/a/x/foo
echo "world" > /mnt/a/x/bar
sync
mv /mnt/a/x /mnt/a/y
mkdir /mnt/a/x
xfs_io -c fsync /mnt/a/x
<power failure happens>
The next time the fs is mounted, log tree replay happens and
the directory "y" does not exist nor do the files "foo" and
"bar" exist anywhere (neither in "y" nor in "x", nor the root
nor anywhere).
# Scenario 2
mkfs.btrfs -f /dev/sdc
mount /dev/sdc /mnt
mkdir /mnt/a
echo "hello" > /mnt/a/foo
sync
mv /mnt/a/foo /mnt/a/bar
echo "world" > /mnt/a/foo
xfs_io -c fsync /mnt/a/foo
<power failure happens>
The next time the fs is mounted, log tree replay happens and the
file "bar" does not exists anymore. A file with the name "foo"
exists and it matches the second file we created.
Another related problem that does not involve file/data loss is when a
new inode is created with the name of a deleted snapshot and we fsync it:
mkfs.btrfs -f /dev/sdc
mount /dev/sdc /mnt
mkdir /mnt/testdir
btrfs subvolume snapshot /mnt /mnt/testdir/snap
btrfs subvolume delete /mnt/testdir/snap
rmdir /mnt/testdir
mkdir /mnt/testdir
xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir
<power failure>
The next time the fs is mounted the log replay procedure fails because
it attempts to delete the snapshot entry (which has dir item key type
of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
resulting in the following error that causes mount to fail:
[52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
[52174.512570] ------------[ cut here ]------------
[52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]()
[52174.514681] BTRFS: Transaction aborted (error -2)
[52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc
[52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G W 4.5.0-rc6-btrfs-next-27+ #1
[52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[52174.524053] 0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758
[52174.524053] 0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd
[52174.524053] 00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88
[52174.524053] Call Trace:
[52174.524053] [<ffffffff81264e93>] dump_stack+0x67/0x90
[52174.524053] [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2
[52174.524053] [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs]
[52174.524053] [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50
[52174.524053] [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs]
[52174.524053] [<ffffffff8118f5e9>] ? iput+0xb0/0x284
[52174.524053] [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
[52174.524053] [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs]
[52174.524053] [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs]
[52174.524053] [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs]
[52174.524053] [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs]
[52174.524053] [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs]
[52174.524053] [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs]
[52174.524053] [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs]
[52174.524053] [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs]
[52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
[52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131
[52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
[52174.524053] [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs]
[52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
[52174.524053] [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3
[52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131
[52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
[52174.524053] [<ffffffff8119590f>] do_mount+0x8a6/0x9e8
[52174.524053] [<ffffffff811358dd>] ? strndup_user+0x3f/0x59
[52174.524053] [<ffffffff81195c65>] SyS_mount+0x77/0x9f
[52174.524053] [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b
[52174.561288] ---[ end trace 6b53049efb1a3ea6 ]---
Fix this by forcing a transaction commit when such cases happen.
This means we check in the commit root of the subvolume tree if there
was any other inode with the same reference when the inode we are
fsync'ing is a new inode (created in the current transaction).
Test cases for fstests, covering all the scenarios given above, were
submitted upstream for fstests:
* fstests: generic test for fsync after renaming directory
https://patchwork.kernel.org/patch/8694281/
* fstests: generic test for fsync after renaming file
https://patchwork.kernel.org/patch/8694301/
* fstests: add btrfs test for fsync after snapshot deletion
https://patchwork.kernel.org/patch/8670671/
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit a89ca6f24ffe435edad57de02eaabd37a2c6bff6 ]
When we have the no_holes feature enabled, if a we truncate a file to a
smaller size, truncate it again but to a size greater than or equals to
its original size and fsync it, the log tree will not have any information
about the hole covering the range [truncate_1_offset, new_file_size[.
Which means if the fsync log is replayed, the file will remain with the
state it had before both truncate operations.
Without the no_holes feature this does not happen, since when the inode
is logged (full sync flag is set) it will find in the fs/subvol tree a
leaf with a generation matching the current transaction id that has an
explicit extent item representing the hole.
Fix this by adding an explicit extent item representing a hole between
the last extent and the inode's i_size if we are doing a full sync.
The issue is easy to reproduce with the following test case for fstests:
. ./common/rc
. ./common/filter
. ./common/dmflakey
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
# This test was motivated by an issue found in btrfs when the btrfs
# no-holes feature is enabled (introduced in kernel 3.14). So enable
# the feature if the fs being tested is btrfs.
if [ $FSTYP == "btrfs" ]; then
_require_btrfs_fs_feature "no_holes"
_require_btrfs_mkfs_feature "no-holes"
MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
fi
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test files and make sure everything is durably persisted.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" \
-c "pwrite -S 0xbb 64K 61K" \
$SCRATCH_MNT/foo | _filter_xfs_io
$XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K" \
-c "pwrite -S 0xff 64K 61K" \
$SCRATCH_MNT/bar | _filter_xfs_io
sync
# Now truncate our file foo to a smaller size (64Kb) and then truncate
# it to the size it had before the shrinking truncate (125Kb). Then
# fsync our file. If a power failure happens after the fsync, we expect
# our file to have a size of 125Kb, with the first 64Kb of data having
# the value 0xaa and the second 61Kb of data having the value 0x00.
$XFS_IO_PROG -c "truncate 64K" \
-c "truncate 125K" \
-c "fsync" \
$SCRATCH_MNT/foo
# Do something similar to our file bar, but the first truncation sets
# the file size to 0 and the second truncation expands the size to the
# double of what it was initially.
$XFS_IO_PROG -c "truncate 0" \
-c "truncate 253K" \
-c "fsync" \
$SCRATCH_MNT/bar
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again, mount to trigger log replay and validate file
# contents.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# We expect foo to have a size of 125Kb, the first 64Kb of data all
# having the value 0xaa and the remaining 61Kb to be a hole (all bytes
# with value 0x00).
echo "File foo content after log replay:"
od -t x1 $SCRATCH_MNT/foo
# We expect bar to have a size of 253Kb and no extents (any byte read
# from bar has the value 0x00).
echo "File bar content after log replay:"
od -t x1 $SCRATCH_MNT/bar
status=0
exit
The expected file contents in the golden output are:
File foo content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0372000
File bar content after log replay:
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0772000
Without this fix, their contents are:
File foo content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0372000
File bar content after log replay:
0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0772000
A test case submission for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 36283bf777d963fac099213297e155d071096994 ]
After commit 4f764e515361 ("Btrfs: remove deleted xattrs on fsync log
replay"), we can end up in a situation where during log replay we end up
deleting xattrs that were never deleted when their file was last fsynced.
This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
set, the xattr was added in a past transaction and the leaf where the
xattr is located was not updated (COWed or created) in the current
transaction. In this scenario the xattr item never ends up in the log
tree and therefore at log replay time, which makes the replay code delete
the xattr from the fs/subvol tree as it thinks that xattr was deleted
prior to the last fsync.
Fix this by always logging all xattrs, which is the simplest and most
reliable way to detect deleted xattrs and replay the deletes at log replay
time.
This issue is reproducible with the following test case for fstests:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
here=`pwd`
tmp=/tmp/$$
status=1 # failure is the default!
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
trap "_cleanup; exit \$status" 0 1 2 3 15
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
. ./common/attr
# real QA test starts here
# We create a lot of xattrs for a single file. Only btrfs and xfs are currently
# able to store such a large mount of xattrs per file, other filesystems such
# as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add
# less than 1000 xattrs with very small values.
_supported_fs btrfs xfs
_supported_os Linux
_need_to_be_root
_require_scratch
_require_dm_flakey
_require_attrs
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create the test file with some initial data and make sure everything is
# durably persisted.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io
sync
# Add many small xattrs to our file.
# We create such a large amount because it's needed to trigger the issue found
# in btrfs - we need to have an amount that causes the fs to have at least 3
# btree leafs with xattrs stored in them, and it must work on any leaf size
# (maximum leaf/node size is 64Kb).
num_xattrs=2000
for ((i = 1; i <= $num_xattrs; i++)); do
name="user.attr_$(printf "%04d" $i)"
$SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo
done
# Sync the filesystem to force a commit of the current btrfs transaction, this
# is a necessary condition to trigger the bug on btrfs.
sync
# Now update our file's data and fsync the file.
# After a successful fsync, if the fsync log/journal is replayed we expect to
# see all the xattrs we added before with the same values (and the updated file
# data of course). Btrfs used to delete some of these xattrs when it replayed
# its fsync log/journal.
$XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount. This makes the fs replay its fsync log.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
echo "File content after crash and log replay:"
od -t x1 $SCRATCH_MNT/foo
echo "File xattrs after crash and log replay:"
for ((i = 1; i <= $num_xattrs; i++)); do
name="user.attr_$(printf "%04d" $i)"
echo -n "$name="
$GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo
echo
done
status=0
exit
The golden output expects all xattrs to be available, and with the correct
values, after the fsync log is replayed.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit e4545de5b035c7debb73d260c78377dbb69cbfb5 ]
If we do an append write to a file (which increases its inode's i_size)
that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
and the previous transaction added a new hard link to the file, which sets
the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
the file, the inode's new i_size isn't logged. This has the consequence
that after the fsync log is replayed, the file size remains what it was
before the append write operation, which means users/applications will
not be able to read the data that was successsfully fsync'ed before.
This happens because neither the inode item nor the delayed inode get
their i_size updated when the append write is made - doing so would
require starting a transaction in the buffered write path, something that
we do not do intentionally for performance reasons.
Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
set the inode is logged with its current i_size (log the in-memory inode
into the log tree).
This issue is not a recent regression and is easy to reproduce with the
following test case for fstests:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
here=`pwd`
tmp=/tmp/$$
status=1 # failure is the default!
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
trap "_cleanup; exit \$status" 0 1 2 3 15
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_supported_fs generic
_supported_os Linux
_need_to_be_root
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
_crash_and_mount()
{
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount. This makes the fs replay its fsync log.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
}
rm -f $seqres.full
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create the test file with some initial data and then fsync it.
# The fsync here is only needed to trigger the issue in btrfs, as it causes the
# the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
sync
# Add a hard link to our file.
# On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
# which is a necessary condition to trigger the issue.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Sync the filesystem to force a commit of the current btrfs transaction, this
# is a necessary condition to trigger the bug on btrfs.
sync
# Now append more data to our file, increasing its size, and fsync the file.
# In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
# write path did not update the inode item in the btree nor the delayed inode
# item (in memory struture) in the current transaction (created by the fsync
# handler), the fsync did not record the inode's new i_size in the fsync
# log/journal. This made the data unavailable after the fsync log/journal is
# replayed.
$XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
echo "File content after fsync and before crash:"
od -t x1 $SCRATCH_MNT/foo
_crash_and_mount
echo "File content after crash and log replay:"
od -t x1 $SCRATCH_MNT/foo
status=0
exit
The expected file output before and after the crash/power failure expects the
appended data to be available, which is:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0200000
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit c325a67c72903e1cc30e990a15ce745bda0dbfde ]
Previously, ext4 would fail the mount if the file system had the quota
feature enabled and quota mount options (used for the older quota
setups) were present. This broke xfstests, since xfs silently ignores
the usrquote and grpquota mount options if they are specified. This
commit changes things so that we are consistent with xfs; having the
mount options specified is harmless, so no sense break users by
forbidding them.
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit daf647d2dd58cec59570d7698a45b98e580f2076 ]
With the internal Quota feature, mke2fs creates empty quota inodes and
quota usage tracking is enabled as soon as the file system is mounted.
Since quotacheck is no longer preallocating all of the blocks in the
quota inode that are likely needed to be written to, we are now seeing
a lockdep false positive caused by needing to allocate a quota block
from inside ext4_map_blocks(), while holding i_data_sem for a data
inode. This results in this complaint:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&ei->i_data_sem);
lock(&s->s_dquot.dqio_mutex);
lock(&ei->i_data_sem);
lock(&s->s_dquot.dqio_mutex);
Google-Bug-Id: 27907753
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit be12b299a83fc807bbaccd2bcb8ec50cbb0cb55c ]
When master handles convert request, it queues ast first and then
returns status. This may happen that the ast is sent before the request
status because the above two messages are sent by two threads. And
right after the ast is sent, if master down, it may trigger BUG in
dlm_move_lockres_to_recovery_list in the requested node because ast
handler moves it to grant list without clear lock->convert_pending. So
remove BUG_ON statement and check if the ast is processed in
dlmconvert_remote.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit ac7cf246dfdbec3d8fed296c7bf30e16f5099dac ]
There is a race window between dlmconvert_remote and
dlm_move_lockres_to_recovery_list, which will cause a lock with
OCFS2_LOCK_BUSY in grant list, thus system hangs.
dlmconvert_remote
{
spin_lock(&res->spinlock);
list_move_tail(&lock->list, &res->converting);
lock->convert_pending = 1;
spin_unlock(&res->spinlock);
status = dlm_send_remote_convert_request();
>>>>>> race window, master has queued ast and return DLM_NORMAL,
and then down before sending ast.
this node detects master down and calls
dlm_move_lockres_to_recovery_list, which will revert the
lock to grant list.
Then OCFS2_LOCK_BUSY won't be cleared as new master won't
send ast any more because it thinks already be authorized.
spin_lock(&res->spinlock);
lock->convert_pending = 0;
if (status != DLM_NORMAL)
dlm_revert_pending_convert(res, lock);
spin_unlock(&res->spinlock);
}
In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set
(res is still in recovering) or res master changed (new master has
finished recovery), reset the status to DLM_RECOVERING, then it will
retry convert.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 378c6520e7d29280f400ef2ceaf155c86f05a71a ]
This commit fixes the following security hole affecting systems where
all of the following conditions are fulfilled:
- The fs.suid_dumpable sysctl is set to 2.
- The kernel.core_pattern sysctl's value starts with "/". (Systems
where kernel.core_pattern starts with "|/" are not affected.)
- Unprivileged user namespace creation is permitted. (This is
true on Linux >=3.8, but some distributions disallow it by
default using a distro patch.)
Under these conditions, if a program executes under secure exec rules,
causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
namespace, changes its root directory and crashes, the coredump will be
written using fsuid=0 and a path derived from kernel.core_pattern - but
this path is interpreted relative to the root directory of the process,
allowing the attacker to control where a coredump will be written with
root privileges.
To fix the security issue, always interpret core_pattern for dumps that
are written under SUID_DUMP_ROOT relative to the root directory of init.
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit d6785d9152147596f60234157da2b02540c3e60f ]
Running the following command:
busybox cat /sys/kernel/debug/tracing/trace_pipe > /dev/null
with any tracing enabled pretty very quickly leads to various NULL
pointer dereferences and VM BUG_ON()s, such as these:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffff8119df6c>] generic_pipe_buf_release+0xc/0x40
Call Trace:
[<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0
[<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0
[<ffffffff81196869>] do_sendfile+0x199/0x380
[<ffffffff81197600>] SyS_sendfile64+0x90/0xa0
[<ffffffff8192cbee>] entry_SYSCALL_64_fastpath+0x12/0x6d
page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
kernel BUG at include/linux/mm.h:367!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
RIP: [<ffffffff8119df9c>] generic_pipe_buf_release+0x3c/0x40
Call Trace:
[<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0
[<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0
[<ffffffff81196869>] do_sendfile+0x199/0x380
[<ffffffff81197600>] SyS_sendfile64+0x90/0xa0
[<ffffffff8192cd1e>] tracesys_phase2+0x84/0x89
(busybox's cat uses sendfile(2), unlike the coreutils version)
This is because tracing_splice_read_pipe() can call splice_to_pipe()
with spd->nr_pages == 0. spd_pages underflows in splice_to_pipe() and
we fill the page pointers and the other fields of the pipe_buffers with
garbage.
All other callers of splice_to_pipe() avoid calling it when nr_pages ==
0, and we could make tracing_splice_read_pipe() do that too, but it
seems reasonable to have splice_to_page() handle this condition
gracefully.
Cc: stable@vger.kernel.org
Signed-off-by: Rabin Vincent <rabin@rab.in>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 2f6fc056e899bd0144a08da5cacaecbe8997cd74 ]
nfsd_lookup_dentry exits with the parent filehandle locked. fh_put also
unlocks if necessary (nfsd filehandle locking is probably too lenient),
so it gets unlocked eventually, but if the following op in the compound
needs to lock it again, we can deadlock.
A fuzzer ran into this; normal clients don't send a secinfo followed by
a readdir in the same compound.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit c0a2ad9b50dd80eeccd73d9ff962234590d5ec93 ]
On umount path, jbd2_journal_destroy() writes latest transaction ID
(->j_tail_sequence) to be used at next mount.
The bug is that ->j_tail_sequence is not holding latest transaction ID
in some cases. So, at next mount, there is chance to conflict with
remaining (not overwritten yet) transactions.
mount (id=10)
write transaction (id=11)
write transaction (id=12)
umount (id=10) <= the bug doesn't write latest ID
mount (id=10)
write transaction (id=11)
crash
mount
[recovery process]
transaction (id=11)
transaction (id=12) <= valid transaction ID, but old commit
must not replay
Like above, this bug become the cause of recovery failure, or FS
corruption.
So why ->j_tail_sequence doesn't point latest ID?
Because if checkpoint transactions was reclaimed by memory pressure
(i.e. bdev_try_to_free_page()), then ->j_tail_sequence is not updated.
(And another case is, __jbd2_journal_clean_checkpoint_list() is called
with empty transaction.)
So in above cases, ->j_tail_sequence is not pointing latest
transaction ID at umount path. Plus, REQ_FLUSH for checkpoint is not
done too.
So, to fix this problem with minimum changes, this patch updates
->j_tail_sequence, and issue REQ_FLUSH. (With more complex changes,
some optimizations would be possible to avoid unnecessary REQ_FLUSH
for example though.)
BTW,
journal->j_tail_sequence =
++journal->j_transaction_sequence;
Increment of ->j_transaction_sequence seems to be unnecessary, but
ext3 does this.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit ab73ef46398e2c0159f3a71de834586422d2a44a ]
When dqget() in __dquot_initialize() fails e.g. due to IO error,
__dquot_initialize() will pass an array of uninitialized pointers to
dqput_all() and thus can lead to deference of random data. Fix the
problem by properly initializing the array.
CC: stable@vger.kernel.org
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 2e83b79b2d6c78bf1b4aa227938a214dcbddc83f ]
This plugs 2 trivial leaks in xfs_attr_shortform_list and
xfs_attr3_leaf_list_int.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 4aed9c46afb80164401143aa0fdcfe3798baa9d5 ]
A number of spots in the xdr decoding follow a pattern like
n = be32_to_cpup(p++);
READ_BUF(n + 4);
where n is a u32. The only bounds checking is done in READ_BUF itself,
but since it's checking (n + 4), it won't catch cases where n is very
large, (u32)(-4) or higher. I'm not sure exactly what the consequences
are, but we've seen crashes soon after.
Instead, just break these up into two READ_BUF()s.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 87f9a031af48defee9f34c6aaf06d6f1988c244d ]
In commit bcff24887d00 ("ext4: don't read blocks from disk after extents
being swapped") bh is not updated correctly in the for loop and wrong
data has been written to disk. generic/324 catches this on sub-page
block size ext4.
Fixes: bcff24887d00 ("ext4: don't read blocks from disk after extentsbeing swapped")
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 0a95b851370b84a4b9d92ee6d1fa0926901d0454 ]
Parameter of trace_btrfs_work_queued() can be freed in its workqueue.
So no one use use that pointer after queue_work().
Fix the user-after-free bug by move the trace line before queue_work().
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|
|
[ Upstream commit 546bed631203344611f42b2af1d224d2eedb4e6b ]
I managed to trigger this:
| INFO: trying to register non-static key.
| the code is fine but needs lockdep annotation.
| turning off the locking correctness validator.
| CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14
| Hardware name: ARM-Versatile Express
| [<80307cec>] (dump_stack)
| [<80070e98>] (__lock_acquire)
| [<8007184c>] (lock_acquire)
| [<80287800>] (btrfs_ioctl)
| [<8012a8d4>] (do_vfs_ioctl)
| [<8012ac14>] (SyS_ioctl)
so I think that btrfs_device_data_ordered_init() is not invoked behind
a macro somewhere.
Fixes: 7cc8e58d53cd ("Btrfs: fix unprotected device's variants on 32bits machine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
|