Age | Commit message (Collapse) | Author |
|
commit 3d56c25e3bb0726a5c5e16fc2d9e38f8ed763085 upstream.
Ascend-to-parent logics in d_walk() depends on all encountered child
dentries not getting freed without an RCU delay. Unfortunately, in
quite a few cases it is not true, with hard-to-hit oopsable race as
the result.
Fortunately, the fix is simiple; right now the rule is "if it ever
been hashed, freeing must be delayed" and changing it to "if it
ever had a parent, freeing must be delayed" closes that hole and
covers all cases the old rule used to cover. Moreover, pipes and
sockets remain _not_ covered, so we do not introduce RCU delay in
the cases which are the reason for having that delay conditional
in the first place.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e54ad7f1ee263ffa5a2de9c609d58dfa27b21cd9 upstream.
This prevents stacking filesystems (ecryptfs and overlayfs) from using
procfs as lower filesystem. There is too much magic going on inside
procfs, and there is no good reason to stack stuff on top of procfs.
(For example, procfs does access checks in VFS open handlers, and
ecryptfs by design calls open handlers from a kernel thread that doesn't
drop privileges or so.)
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2f36db71009304b3f0b95afacd8eba1f9f046b87 upstream.
This prevents users from triggering a stack overflow through a recursive
invocation of pagefault handling that involves mapping procfs files into
virtual memory.
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7d3aa7fe970791f1a674b14572a411accf2f4d4e upstream.
We don't write back stale inodes so we should skip them in
xfs_iflush_cluster, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 51b07f30a71c27405259a0248206ed4e22adbee2 upstream.
Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs:
convert inode cache lookups to use RCU locking") back in late 2010,
and so xfs_iflush_cluster checks the wrong inode for whether it is
still valid under RCU protection. Fix it to lock and check the
correct inode.
(*) Careless-idiot: Dave Chinner <dchinner@redhat.com>
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b1438f477934f5a4d5a44df26f3079a7575d5946 upstream.
When a failure due to an inode buffer occurs, the error handling
fails to abort the inode writeback correctly. This can result in the
inode being reclaimed whilst still in the AIL, leading to
use-after-free situations as well as filesystems that cannot be
unmounted as the inode log items left in the AIL never get removed.
Fix this by ensuring fatal errors from xfs_imap_to_bp() result in
the inode flush being aborted correctly.
Reported-by: Shyam Kaushik <shyam@zadarastorage.com>
Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com>
Tested-by: Shyam Kaushik <shyam@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8179c03629de67f515d3ab825b5a9428687d4b85 upstream.
Joe Lawrence reported a list_add corruption with 4.6-rc1 when
testing some custom md administration code that made it's own
block device nodes for the md array. The simple test loop of:
for i in {0..100}; do
mknod --mode=0600 $tmp/tmp_node b $MAJOR $MINOR
mdadm --detail --export $tmp/tmp_node > /dev/null
rm -f $tmp/tmp_node
done
Would produce this warning in bd_acquire() when mdadm opened the
device node:
list_add double add: new=ffff88043831c7b8, prev=ffff8804380287d8, next=ffff88043831c7b8.
And then produce this from bd_forget from kdevtmpfs evicting a block
dev inode:
list_del corruption. prev->next should be ffff8800bb83eb10, but was ffff88043831c7b8
This is a regression caused by commit c19b3b05 ("xfs: mode di_mode
to vfs inode"). The issue is that xfs_inactive() frees the
unlinked inode, and the above commit meant that this freeing zeroed
the mode in the struct inode. The problem is that after evict() has
called ->evict_inode, it expects the i_mode to be intact so that it
can call bd_forget() or cd_forget() to drop the reference to the
block device inode attached to the XFS inode.
In reality, the only thing we do in xfs_fs_evict_inode() that is not
generic is call xfs_inactive(). We can move the xfs_inactive() call
to xfs_fs_destroy_inode() without any problems at all, and this
will leave the VFS inode intact until it is completely done with it.
So, remove xfs_fs_evict_inode(), and do the work it used to do in
->destroy_inode instead.
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ad747e3b299671e1a53db74963cc6c5f6cdb9f6d upstream.
Commit 96f859d ("libxfs: pack the agfl header structure so
XFS_AGFL_SIZE is correct") allowed the freelist to use the empty
slot at the end of the freelist on 64 bit systems that was not
being used due to sizeof() rounding up the structure size.
This has caused versions of xfs_repair prior to 4.5.0 (which also
has the fix) to report this as a corruption once the filesystem has
been grown. Older kernels can also have problems (seen from a whacky
container/vm management environment) mounting filesystems grown on a
system with a newer kernel than the vm/container it is deployed on.
To avoid this problem, change the initial free list indexes not to
wrap across the end of the AGFL, hence avoiding the initialisation
of agf_fllast to the last index in the AGFL.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d0a58e833931234c44e515b5b8bede32bd4e6eed upstream.
Today, a kernel which refuses to mount a filesystem read-write
due to unknown ro-compat features can still transition to read-write
via the remount path. The old kernel is most likely none the wiser,
because it's unaware of the new feature, and isn't using it. However,
writing to the filesystem may well corrupt metadata related to that
new feature, and moving to a newer kernel which understand the feature
will have problems.
Right now the only ro-compat feature we have is the free inode btree,
which showed up in v3.16. It would be good to push this back to
all the active stable kernels, I think, so that if anyone is using
newer mkfs (which enables the finobt feature) with older kernel
releases, they'll be protected.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ade8febde0271513360bac44883dbebad44276c3 upstream.
Since the patch "NFS: Allow multiple commit requests in flight per file"
we can run multiple simultaneous commits on the same inode. This
introduced a race over collecting pages to commit that made it possible
to call nfs_init_commit() with an empty list - which causes crashes like
the one below.
The fix is to catch this race and avoid calling nfs_init_commit and
initiate_commit when there is no work to do.
Here is the crash:
[600522.076832] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[600522.078475] IP: [<ffffffffa0479e72>] nfs_init_commit+0x22/0x130 [nfs]
[600522.078745] PGD 4272b1067 PUD 4272cb067 PMD 0
[600522.078972] Oops: 0000 [#1] SMP
[600522.079204] Modules linked in: nfsv3 nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dcdbas ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw vmw_vsock_vmci_transport vsock bonding ipmi_devintf ipmi_msghandler coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev vmw_balloon parport_pc parport acpi_cpufreq vmw_vmci i2c_piix4 shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel serio_raw vmxnet3
[600522.081380] vmw_pvscsi ata_generic pata_acpi
[600522.081809] CPU: 3 PID: 15667 Comm: /usr/bin/python Not tainted 4.1.9-100.pd.88.el7.x86_64 #1
[600522.082281] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[600522.082814] task: ffff8800bbbfa780 ti: ffff88042ae84000 task.ti: ffff88042ae84000
[600522.083378] RIP: 0010:[<ffffffffa0479e72>] [<ffffffffa0479e72>] nfs_init_commit+0x22/0x130 [nfs]
[600522.083973] RSP: 0018:ffff88042ae87438 EFLAGS: 00010246
[600522.084571] RAX: 0000000000000000 RBX: ffff880003485e40 RCX: ffff88042ae87588
[600522.085188] RDX: 0000000000000000 RSI: ffff88042ae874b0 RDI: ffff880003485e40
[600522.085756] RBP: ffff88042ae87448 R08: ffff880003486010 R09: ffff88042ae874b0
[600522.086332] R10: 0000000000000000 R11: 0000000000000005 R12: ffff88042ae872d0
[600522.086905] R13: ffff88042ae874b0 R14: ffff880003485e40 R15: ffff88042704c840
[600522.087484] FS: 00007f4728ff2740(0000) GS:ffff88043fd80000(0000) knlGS:0000000000000000
[600522.088070] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[600522.088663] CR2: 0000000000000040 CR3: 000000042b6aa000 CR4: 00000000001406e0
[600522.089327] Stack:
[600522.089926] 0000000000000001 ffff88042ae87588 ffff88042ae874f8 ffffffffa04f09fa
[600522.090549] 0000000000017840 0000000000017840 ffff88042ae87588 ffff8803258d9930
[600522.091169] ffff88042ae87578 ffffffffa0563d80 0000000000000000 ffff88042704c840
[600522.091789] Call Trace:
[600522.092420] [<ffffffffa04f09fa>] pnfs_generic_commit_pagelist+0x1da/0x320 [nfsv4]
[600522.093052] [<ffffffffa0563d80>] ? ff_layout_commit_prepare_v3+0x30/0x30 [nfs_layout_flexfiles]
[600522.093696] [<ffffffffa0562645>] ff_layout_commit_pagelist+0x15/0x20 [nfs_layout_flexfiles]
[600522.094359] [<ffffffffa047bc78>] nfs_generic_commit_list+0xe8/0x120 [nfs]
[600522.095032] [<ffffffffa047bd6a>] nfs_commit_inode+0xba/0x110 [nfs]
[600522.095719] [<ffffffffa046ac54>] nfs_release_page+0x44/0xd0 [nfs]
[600522.096410] [<ffffffff811a8122>] try_to_release_page+0x32/0x50
[600522.097109] [<ffffffff811bd4f1>] shrink_page_list+0x961/0xb30
[600522.097812] [<ffffffff811bdced>] shrink_inactive_list+0x1cd/0x550
[600522.098530] [<ffffffff811bea65>] shrink_lruvec+0x635/0x840
[600522.099250] [<ffffffff811bed60>] shrink_zone+0xf0/0x2f0
[600522.099974] [<ffffffff811bf312>] do_try_to_free_pages+0x192/0x470
[600522.100709] [<ffffffff811bf6ca>] try_to_free_pages+0xda/0x170
[600522.101464] [<ffffffff811b2198>] __alloc_pages_nodemask+0x588/0x970
[600522.102235] [<ffffffff811fbbd5>] alloc_pages_vma+0xb5/0x230
[600522.103000] [<ffffffff813a1589>] ? cpumask_any_but+0x39/0x50
[600522.103774] [<ffffffff811d6115>] wp_page_copy.isra.55+0x95/0x490
[600522.104558] [<ffffffff810e3438>] ? __wake_up+0x48/0x60
[600522.105357] [<ffffffff811d7d3b>] do_wp_page+0xab/0x4f0
[600522.106137] [<ffffffff810a1bbb>] ? release_task+0x36b/0x470
[600522.106902] [<ffffffff8126dbd7>] ? eventfd_ctx_read+0x67/0x1c0
[600522.107659] [<ffffffff811da2a8>] handle_mm_fault+0xc78/0x1900
[600522.108431] [<ffffffff81067ef1>] __do_page_fault+0x181/0x420
[600522.109173] [<ffffffff811446a6>] ? __audit_syscall_exit+0x1e6/0x280
[600522.109893] [<ffffffff810681c0>] do_page_fault+0x30/0x80
[600522.110594] [<ffffffff81024f36>] ? syscall_trace_leave+0xc6/0x120
[600522.111288] [<ffffffff81790a58>] page_fault+0x28/0x30
[600522.111947] Code: 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 4c 8d 87 d0 01 00 00 48 89 e5 53 48 89 fb 48 83 ec 08 4c 8b 0e 49 8b 41 18 4c 39 ce <48> 8b 40 40 4c 8b 50 30 74 24 48 8b 87 d0 01 00 00 48 8b 7e 08
[600522.113343] RIP [<ffffffffa0479e72>] nfs_init_commit+0x22/0x130 [nfs]
[600522.114003] RSP <ffff88042ae87438>
[600522.114636] CR2: 0000000000000040
Fixes: af7cf057 (NFS: Allow multiple commit requests in flight per file)
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 935244cd54b86ca46e69bc6604d2adfb1aec2d42 upstream.
Currently, in ext4_mb_init(), there's a loop like the following:
do {
...
offset += 1 << (sb->s_blocksize_bits - i);
i++;
} while (i <= sb->s_blocksize_bits + 1);
Note that the updated offset is used in the loop's next iteration only.
However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
and UBSAN reports
UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
shift exponent 4294967295 is too large for 32-bit type 'int'
[...]
Call Trace:
[<ffffffff818c4d25>] dump_stack+0xbc/0x117
[<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169
[<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e
[<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
[<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390
[<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0
[<ffffffff814293c7>] ? create_cache+0x57/0x1f0
[<ffffffff8142948a>] ? create_cache+0x11a/0x1f0
[<ffffffff821c2168>] ? mutex_lock+0x38/0x60
[<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50
[<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0
[<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0
[<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0
[...]
Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.
Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of offset is never used again.
Silence UBSAN by introducing another variable, offset_incr, holding the
next increment to apply to offset and adjust that one by right shifting it
by one position per loop iteration.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b5cb316cdf3a3f5f6125412b0f6065185240cfdc upstream.
Currently, in mb_find_order_for_block(), there's a loop like the following:
while (order <= e4b->bd_blkbits + 1) {
...
bb += 1 << (e4b->bd_blkbits - order);
}
Note that the updated bb is used in the loop's next iteration only.
However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports
UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
shift exponent -1 is negative
[...]
Call Trace:
[<ffffffff818c4d35>] dump_stack+0xbc/0x117
[<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169
[<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e
[<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
[<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590
[<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80
[<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240
[...]
Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of bb is never used again.
Silence UBSAN by introducing another variable, bb_incr, holding the next
increment to apply to bb and adjust that one by right shifting it by one
position per loop iteration.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 74177f55b70e2f2be770dd28684dd6d17106a4ba upstream.
When filesystem is corrupted in the right way, it can happen
ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
subsequently remove inode from the in-memory orphan list. However this
deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
leave i_orphan list_head with a stale content. Later we can look at this
content causing list corruption, oops, or other issues. The reported
trace looked like:
WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
0000000000100100)
CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
Stack:
60462947 62219960 602ede24 62219960
602ede24 603ca293 622198f0 602f02eb
62219950 6002c12c 62219900 601b4d6b
Call Trace:
[<6005769c>] ? vprintk_emit+0x2dc/0x5c0
[<602ede24>] ? printk+0x0/0x94
[<600190bc>] show_stack+0xdc/0x1a0
[<602ede24>] ? printk+0x0/0x94
[<602ede24>] ? printk+0x0/0x94
[<602f02eb>] dump_stack+0x2a/0x2c
[<6002c12c>] warn_slowpath_common+0x9c/0xf0
[<601b4d6b>] ? __list_del_entry+0x6b/0x100
[<6002c254>] warn_slowpath_fmt+0x94/0xa0
[<602f4d09>] ? __mutex_lock_slowpath+0x239/0x3a0
[<6002c1c0>] ? warn_slowpath_fmt+0x0/0xa0
[<60023ebf>] ? set_signals+0x3f/0x50
[<600a205a>] ? kmem_cache_free+0x10a/0x180
[<602f4e88>] ? mutex_lock+0x18/0x30
[<601b4d6b>] __list_del_entry+0x6b/0x100
[<601177ec>] ext4_orphan_del+0x22c/0x2f0
[<6012f27c>] ? __ext4_journal_start_sb+0x2c/0xa0
[<6010b973>] ? ext4_truncate+0x383/0x390
[<6010bc8b>] ext4_write_begin+0x30b/0x4b0
[<6001bb50>] ? copy_from_user+0x0/0xb0
[<601aa840>] ? iov_iter_fault_in_readable+0xa0/0xc0
[<60072c4f>] generic_perform_write+0xaf/0x1e0
[<600c4166>] ? file_update_time+0x46/0x110
[<60072f0f>] __generic_file_write_iter+0x18f/0x1b0
[<6010030f>] ext4_file_write_iter+0x15f/0x470
[<60094e10>] ? unlink_file_vma+0x0/0x70
[<6009b180>] ? unlink_anon_vmas+0x0/0x260
[<6008f169>] ? free_pgtables+0xb9/0x100
[<600a6030>] __vfs_write+0xb0/0x130
[<600a61d5>] vfs_write+0xa5/0x170
[<600a63d6>] SyS_write+0x56/0xe0
[<6029fcb0>] ? __libc_waitpid+0x0/0xa0
[<6001b698>] handle_syscall+0x68/0x90
[<6002633d>] userspace+0x4fd/0x600
[<6002274f>] ? save_registers+0x1f/0x40
[<60028bd7>] ? arch_prctl+0x177/0x1b0
[<60017bd5>] fork_handler+0x85/0x90
Fix the problem by using list_del_init() as we always should with
i_orphan list.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ff0bc08454917964291f72ee5b8eca66de4bc250 upstream.
A failed call to dqget() returns an ERR_PTR() and not null. Fix
the check in ext4_ioctl_setproject() to handle this correctly.
Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support")
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7827a7f6ebfcb7f388dc47fddd48567a314701ba upstream.
Instead of just printing warning messages, if the orphan list is
corrupted, declare the file system is corrupted. If there are any
reserved inodes in the orphaned inode list, declare the file system
corrupted and stop right away to avoid doing more potential damage to
the file system.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c9eb13a9105e2e418f72e46a2b6da3f49e696902 upstream.
If the orphaned inode list contains inode #5, ext4_iget() returns a
bad inode (since the bootloader inode should never be referenced
directly). Because of the bad inode, we end up processing the inode
repeatedly and this hangs the machine.
This can be reproduced via:
mke2fs -t ext4 /tmp/foo.img 100
debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
mount -o loop /tmp/foo.img /mnt
(But don't do this if you are using an unpatched kernel if you care
about the system staying functional. :-)
This bug was found by the port of American Fuzzy Lop into the kernel
to find file system problems[1]. (Since it *only* happens if inode #5
shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
surprising that AFL needed two hours before it found it.)
[1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf
Reported by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 06bd3c36a733ac27962fea7d6f47168841376824 upstream.
Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a69d removed the logic with a
flawed argument that it is not needed.
The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.
The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().
Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 01d6e08711bf90bc4d7ead14a93a0cbd73b1896a upstream.
Commit c8f33d0bec99 ("affs: kstrdup() memory handling") checks if the
kstrdup function returns NULL due to out-of-memory condition.
However, if we are remounting a filesystem with no change to
filesystem-specific options, the parameter data is NULL. In this case,
kstrdup returns NULL (because it was passed NULL parameter), although no
out of memory condition exists. The mount syscall then fails with
ENOMEM.
This patch fixes the bug. We fail with ENOMEM only if data is non-NULL.
The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call
replace_mount_options (thus we don't erase existing reported options).
Fixes: c8f33d0bec99 ("affs: kstrdup() memory handling")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ab47036d8f7227361cad7894adee8e66ab6f95b2 upstream.
Below backtrace info was reported by Yunlei He:
Call Trace:
[<ffffffff817a9395>] schedule+0x35/0x80
[<ffffffff817abb7d>] rwsem_down_read_failed+0xed/0x130
[<ffffffff813c12a8>] call_rwsem_down_read_failed+0x18/0x
[<ffffffff817ab1d0>] down_read+0x20/0x30
[<ffffffffa02a1a12>] f2fs_evict_inode+0x242/0x3a0 [f2fs]
[<ffffffff81217057>] evict+0xc7/0x1a0
[<ffffffff81217cd6>] iput+0x196/0x200
[<ffffffff812134f9>] __dentry_kill+0x179/0x1e0
[<ffffffff812136f9>] dput+0x199/0x1f0
[<ffffffff811fe77b>] __fput+0x18b/0x220
[<ffffffff811fe84e>] ____fput+0xe/0x10
[<ffffffff81097427>] task_work_run+0x77/0x90
[<ffffffff81074d62>] exit_to_usermode_loop+0x73/0xa2
[<ffffffff81003b7a>] do_syscall_64+0xfa/0x110
[<ffffffff817acf65>] entry_SYSCALL64_slow_path+0x25/0x25
Call Trace:
[<ffffffff817a9395>] schedule+0x35/0x80
[<ffffffff81216dc3>] __wait_on_freeing_inode+0xa3/0xd0
[<ffffffff810bc300>] ? autoremove_wake_function+0x40/0x4
[<ffffffff8121771d>] find_inode_fast+0x7d/0xb0
[<ffffffff8121794a>] ilookup+0x6a/0xd0
[<ffffffffa02bc740>] sync_node_pages+0x210/0x650 [f2fs]
[<ffffffff8122e690>] ? do_fsync+0x70/0x70
[<ffffffffa02b085e>] block_operations+0x9e/0xf0 [f2fs]
[<ffffffff8137b795>] ? bio_endio+0x55/0x60
[<ffffffffa02b0942>] write_checkpoint+0x92/0xba0 [f2fs]
[<ffffffff8117da57>] ? mempool_free_slab+0x17/0x20
[<ffffffff8117de8b>] ? mempool_free+0x2b/0x80
[<ffffffff8122e690>] ? do_fsync+0x70/0x70
[<ffffffffa02a53e3>] f2fs_sync_fs+0x63/0xd0 [f2fs]
[<ffffffff8129630f>] ? ext4_sync_fs+0xbf/0x190
[<ffffffff8122e6b0>] sync_fs_one_sb+0x20/0x30
[<ffffffff812002e9>] iterate_supers+0xb9/0x110
[<ffffffff8122e7b5>] sys_sync+0x55/0x90
[<ffffffff81003ae9>] do_syscall_64+0x69/0x110
[<ffffffff817acf65>] entry_SYSCALL64_slow_path+0x25/0x25
With following excuting serials, we will set inline_node in inode page
after inode was unlinked, result in a deadloop described as below:
1. open file
2. write file
3. unlink file
4. write file
5. close file
Thread A Thread B
- dput
- iput_final
- inode->i_state |= I_FREEING
- evict
- f2fs_evict_inode
- f2fs_sync_fs
- write_checkpoint
- block_operations
- f2fs_lock_all (down_write(cp_rwsem))
- f2fs_lock_op (down_read(cp_rwsem))
- sync_node_pages
- ilookup
- find_inode_fast
- __wait_on_freeing_inode
(wait on I_FREEING clear)
Here, we change to set inline_node flag only for linked inode for fixing.
Reported-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 037369b872940cd923835a0a589763180c4a36bc upstream.
The HPFS filesystem used generic_show_options to produce string that is
displayed in /proc/mounts. However, there is a problem that the options
may disappear after remount. If we mount the filesystem with option1
and then remount it with option2, /proc/mounts should show both option1
and option2, however it only shows option2 because the whole option
string is replaced with replace_mount_options in hpfs_remount_fs.
To fix this bug, implement the hpfs_show_options function that prints
options that are currently selected.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 44d51706b4685f965cd32acde3fe0fcc1e6198e8 upstream.
Commit ce657611baf9 ("hpfs: kstrdup() out of memory handling") checks if
the kstrdup function returns NULL due to out-of-memory condition.
However, if we are remounting a filesystem with no change to
filesystem-specific options, the parameter data is NULL. In this case,
kstrdup returns NULL (because it was passed NULL parameter), although no
out of memory condition exists. The mount syscall then fails with
ENOMEM.
This patch fixes the bug. We fail with ENOMEM only if data is non-NULL.
The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call
replace_mount_options (thus we don't erase existing reported options).
Fixes: ce657611baf9 ("hpfs: kstrdup() out of memory handling")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 56244ef151c3cd11f505020ab0b3f45454363bcc upstream.
When btrfs_copy_from_user isn't able to copy all of the pages, we need
to adjust our accounting to reflect the work that was actually done.
Commit 2e78c927d79 changed around the decisions a little and we ended up
skipping the accounting adjustments some of the time. This commit makes
sure that when we don't copy anything at all, we still hop into
the adjustments, and switches to release_bytes instead of write_bytes,
since write_bytes isn't aligned.
The accounting errors led to warnings during btrfs_destroy_inode:
[ 70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 btrfs_destroy_inode+0x2b3/0x2c0
[ 70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk
[ 70.847538] CPU: 10 PID: 514 Comm: umount Tainted: G W 4.6.0-rc6_00062_g2997da1-dirty #23
[ 70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[ 70.847542] 0000000000000000 ffff880ff5cafab8 ffffffff8149d5e9 0000000000000202
[ 70.847543] 0000000000000000 0000000000000000 0000000000000000 ffff880ff5cafb08
[ 70.847547] ffffffff8107bdfd ffff880ff5cafaf8 000024868120013d ffff880ff5cafb28
[ 70.847547] Call Trace:
[ 70.847550] [<ffffffff8149d5e9>] dump_stack+0x51/0x78
[ 70.847551] [<ffffffff8107bdfd>] __warn+0xfd/0x120
[ 70.847553] [<ffffffff8107be3d>] warn_slowpath_null+0x1d/0x20
[ 70.847555] [<ffffffff8139c9e3>] btrfs_destroy_inode+0x2b3/0x2c0
[ 70.847556] [<ffffffff812003a1>] ? __destroy_inode+0x71/0x140
[ 70.847558] [<ffffffff812004b3>] destroy_inode+0x43/0x70
[ 70.847559] [<ffffffff810b7b5f>] ? wake_up_bit+0x2f/0x40
[ 70.847560] [<ffffffff81200c68>] evict+0x148/0x1d0
[ 70.847562] [<ffffffff81398ade>] ? start_transaction+0x3de/0x460
[ 70.847564] [<ffffffff81200d49>] dispose_list+0x59/0x80
[ 70.847565] [<ffffffff81201ba0>] evict_inodes+0x180/0x190
[ 70.847566] [<ffffffff812191ff>] ? __sync_filesystem+0x3f/0x50
[ 70.847568] [<ffffffff811e95f8>] generic_shutdown_super+0x48/0x100
[ 70.847569] [<ffffffff810b75c0>] ? woken_wake_function+0x20/0x20
[ 70.847571] [<ffffffff811e9796>] kill_anon_super+0x16/0x30
[ 70.847573] [<ffffffff81365cde>] btrfs_kill_super+0x1e/0x130
[ 70.847574] [<ffffffff811e99be>] deactivate_locked_super+0x4e/0x90
[ 70.847576] [<ffffffff811e9e61>] deactivate_super+0x51/0x70
[ 70.847577] [<ffffffff8120536f>] cleanup_mnt+0x3f/0x80
[ 70.847579] [<ffffffff81205402>] __cleanup_mnt+0x12/0x20
[ 70.847581] [<ffffffff81098358>] task_work_run+0x68/0xa0
[ 70.847582] [<ffffffff810022b6>] exit_to_usermode_loop+0xd6/0xe0
[ 70.847583] [<ffffffff81002e1d>] do_syscall_64+0xbd/0x170
[ 70.847586] [<ffffffff817d4dbc>] entry_SYSCALL64_slow_path+0x25/0x25
This is the test program I used to force short returns from
btrfs_copy_from_user
void *dontneed(void *arg)
{
char *p = arg;
int ret;
while(1) {
ret = madvise(p, BUFSIZE/4, MADV_DONTNEED);
if (ret) {
perror("madvise");
exit(1);
}
}
}
int main(int ac, char **av) {
int ret;
int fd;
char *filename;
unsigned long offset;
char *buf;
int i;
pthread_t tid;
if (ac != 2) {
fprintf(stderr, "usage: dammitdave filename\n");
exit(1);
}
buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (buf == MAP_FAILED) {
perror("mmap");
exit(1);
}
memset(buf, 'a', BUFSIZE);
filename = av[1];
ret = pthread_create(&tid, NULL, dontneed, buf);
if (ret) {
fprintf(stderr, "error %d from pthread_create\n", ret);
exit(1);
}
ret = pthread_detach(tid);
if (ret) {
fprintf(stderr, "pthread detach failed %d\n", ret);
exit(1);
}
while (1) {
fd = open(filename, O_RDWR | O_CREAT, 0600);
if (fd < 0) {
perror("open");
exit(1);
}
for (i = 0; i < ROUNDS; i++) {
int this_write = BUFSIZE;
offset = rand() % MAXSIZE;
ret = pwrite(fd, buf, this_write, offset);
if (ret < 0) {
perror("pwrite");
exit(1);
} else if (ret != this_write) {
fprintf(stderr, "short write to %s offset %lu ret %d\n",
filename, offset, ret);
exit(1);
}
if (i == ROUNDS - 1) {
ret = sync_file_range(fd, offset, 4096,
SYNC_FILE_RANGE_WRITE);
if (ret < 0) {
perror("sync_file_range");
exit(1);
}
}
}
ret = ftruncate(fd, 0);
if (ret < 0) {
perror("ftruncate");
exit(1);
}
ret = close(fd);
if (ret) {
perror("close");
exit(1);
}
ret = unlink(filename);
if (ret) {
perror("unlink");
exit(1);
}
}
return 0;
}
Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Dave Jones <dsj@fb.com>
Fixes: 2e78c927d79333f299a8ac81c2fd2952caeef335
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4c63c2454eff996c5e27991221106eb511f7db38 upstream.
32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can
be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr
fail.
Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c2985d001d2fb77357aeae675545893b61c50044 upstream.
Commit 80f9642724af5 ("NFSv4.x: Enforce the ca_maxreponsesize_cached
on the back channel") causes an oops when it receives a callback with
cachethis=yes.
[ 109.667378] BUG: unable to handle kernel NULL pointer dereference at 00000000000002c8
[ 109.669476] IP: [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
[ 109.671216] PGD 0
[ 109.671736] Oops: 0000 [#1] SMP
[ 109.705427] CPU: 1 PID: 3579 Comm: nfsv4.1-svc Not tainted 4.5.0-rc1+ #1
[ 109.706987] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[ 109.709468] task: ffff8800b4408000 ti: ffff88008448c000 task.ti: ffff88008448c000
[ 109.711207] RIP: 0010:[<ffffffffa08a3e68>] [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
[ 109.713521] RSP: 0018:ffff88008448fca0 EFLAGS: 00010286
[ 109.714762] RAX: ffff880081ee202c RBX: ffff8800b7b5b600 RCX: 0000000000000001
[ 109.716427] RDX: 0000000000000008 RSI: 0000000000000008 RDI: 0000000000000000
[ 109.718091] RBP: ffff88008448fda8 R08: 0000000000000000 R09: 000000000b000000
[ 109.719757] R10: ffff880137786000 R11: ffff8800b7b5b600 R12: 0000000001000000
[ 109.721415] R13: 0000000000000002 R14: 0000000053270000 R15: 000000000000000b
[ 109.723061] FS: 0000000000000000(0000) GS:ffff880139640000(0000) knlGS:0000000000000000
[ 109.724931] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 109.726278] CR2: 00000000000002c8 CR3: 0000000034d50000 CR4: 00000000001406e0
[ 109.727972] Stack:
[ 109.728465] ffff880081ee202c ffff880081ee201c 000000008448fcc0 ffff8800baccb800
[ 109.730349] ffff8800baccc800 ffffffffa08d0380 0000000000000000 0000000000000000
[ 109.732211] ffff8800b7b5b600 0000000000000001 ffffffff81d073c0 ffff880081ee3090
[ 109.734056] Call Trace:
[ 109.734657] [<ffffffffa03795d4>] svc_process_common+0x5c4/0x6c0 [sunrpc]
[ 109.736267] [<ffffffffa0379a4c>] bc_svc_process+0x1fc/0x360 [sunrpc]
[ 109.737775] [<ffffffffa08a2c2c>] nfs41_callback_svc+0x10c/0x1d0 [nfsv4]
[ 109.739335] [<ffffffff810cb380>] ? prepare_to_wait_event+0xf0/0xf0
[ 109.740799] [<ffffffffa08a2b20>] ? nfs4_callback_svc+0x50/0x50 [nfsv4]
[ 109.742349] [<ffffffff810a6998>] kthread+0xd8/0xf0
[ 109.743495] [<ffffffff810a68c0>] ? kthread_park+0x60/0x60
[ 109.744776] [<ffffffff816abc4f>] ret_from_fork+0x3f/0x70
[ 109.746037] [<ffffffff810a68c0>] ? kthread_park+0x60/0x60
[ 109.747324] Code: cc 45 31 f6 48 8b 85 00 ff ff ff 44 89 30 48 8b 85 f8 fe ff ff 44 89 20 48 8b 9d 38 ff ff ff 48 8b bd 30 ff ff ff 48 85 db 74 4c <4c> 8b af c8 02 00 00 4d 8d a5 08 02 00 00 49 81 c5 98 02 00 00
[ 109.754361] RIP [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
[ 109.756123] RSP <ffff88008448fca0>
[ 109.756951] CR2: 00000000000002c8
[ 109.757738] ---[ end trace 2b8555511ab5dfb4 ]---
[ 109.758819] Kernel panic - not syncing: Fatal exception
[ 109.760126] Kernel Offset: disabled
[ 118.938934] ---[ end Kernel panic - not syncing: Fatal exception
It doesn't unlock the table nor does it set the cps->clp pointer which
is later needed by nfs4_cb_free_slot().
Fixes: 80f9642724af5 ("NFSv4.x: Enforce the ca_maxresponsesize_cached ...")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b5a7aef1ef436ec005fef0efe31a676ec5f4ab31 upstream.
This patch allows fscrypto to handle a second key prefix given by filesystem.
The main reason is to provide backward compatibility, since previously f2fs
used "f2fs:" as a crypto prefix instead of "fscrypt:".
Later, ext4 should also provide key_prefix() to give "ext4:".
One concern decribed by Ted would be kinda double check overhead of prefixes.
In x86, for example, validate_user_key consumes 8 ms after boot-up, which turns
out derive_key_aes() consumed most of the time to load specific crypto module.
After such the cold miss, it shows almost zero latencies, which treats as a
negligible overhead.
Note that request_key() detects wrong prefix in prior to derive_key_aes() even.
Cc: Ted Tso <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1a967d6c9b39c226be1b45f13acd4d8a5ab3dc44 upstream.
Only server which map unknown users to guest will allow
access using a non-null NTLMv2_Response.
For Samba it's the "map to guest = bad user" option.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 777f69b8d26bf35ade4a76b08f203c11e048365d upstream.
Only server which map unknown users to guest will allow
access using a non-null NTChallengeResponse.
For Samba it's the "map to guest = bad user" option.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fa8f3a354bb775ec586e4475bcb07f7dece97e0c upstream.
Only server which map unknown users to guest will allow
access using a non-null LMChallengeResponse.
For Samba it's the "map to guest = bad user" option.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cfda35d98298131bf38fbad3ce4cd5ecb3cf18db upstream.
See [MS-NLMP] 3.2.5.1.2 Server Receives an AUTHENTICATE_MESSAGE from the Client:
...
Set NullSession to FALSE
If (AUTHENTICATE_MESSAGE.UserNameLen == 0 AND
AUTHENTICATE_MESSAGE.NtChallengeResponse.Length == 0 AND
(AUTHENTICATE_MESSAGE.LmChallengeResponse == Z(1)
OR
AUTHENTICATE_MESSAGE.LmChallengeResponse.Length == 0))
-- Special case: client requested anonymous authentication
Set NullSession to TRUE
...
Only server which map unknown users to guest will allow
access using a non-null NTChallengeResponse.
For Samba it's the "map to guest = bad user" option.
BUG: https://bugzilla.samba.org/show_bug.cgi?id=11913
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
directories
commit 897fba1172d637d344f009d700f7eb8a1fa262f1 upstream.
Wrong return code was being returned on SMB3 rmdir of
non-empty directory.
For SMB3 (unlike for cifs), we attempt to delete a directory by
set of delete on close flag on the open. Windows clients set
this flag via a set info (SET_FILE_DISPOSITION to set this flag)
which properly checks if the directory is empty.
With this patch on smb3 mounts we correctly return
"DIRECTORY NOT EMPTY"
on attempts to remove a non-empty directory.
Signed-off-by: Steve French <steve.french@primarydata.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"Overlayfs fixes from Miklos, assorted fixes from me.
Stable fodder of varying severity, all sat in -next for a while"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ovl: ignore permissions on underlying lookup
vfs: add lookup_hash() helper
vfs: rename: check backing inode being equal
vfs: add vfs_select_inode() helper
get_rock_ridge_filename(): handle malformed NM entries
ecryptfs: fix handling of directory opening
atomic_open(): fix the handling of create_error
fix the copy vs. map logics in blk_rq_map_user_iov()
do_splice_to(): cap the size before passing to ->splice_read()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"During v4.6-rc1 cgroup namespace support was merged. There is an
issue where it's impossible to tell whether a given cgroup mount point
is bind mounted or namespaced. Serge has been working on the issue
but it took longer than expected to resolve, so the late pull request.
Given that it's a completely new feature and the patches don't touch
anything else, the risk seems acceptable. However, if this is too
late, an alternative is plugging new cgroup ns creation for v4.6 and
retrying for v4.7"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix compile warning
kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
kernfs_path_from_node_locked: don't overwrite nlen
|
|
Commit 702e5bc68ad2 ("ocfs2: use generic posix ACL infrastructure")
refactored code to use posix_acl_create. The problem with this function
is that it is not mindful of the cluster wide inode lock making it
unsuitable for use with ocfs2 inode creation with ACLs. For example,
when used in ocfs2_mknod, this function can cause deadlock as follows.
The parent dir inode lock is taken when calling posix_acl_create ->
get_acl -> ocfs2_iop_get_acl which takes the inode lock again. This can
cause deadlock if there is a blocked remote lock request waiting for the
lock to be downconverted. And same deadlock happened in ocfs2_reflink.
This fix is to revert back using ocfs2_init_acl.
Fixes: 702e5bc68ad2 ("ocfs2: use generic posix ACL infrastructure")
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
introduced this issue. ocfs2_setattr called by chmod command holds
cluster wide inode lock when calling posix_acl_chmod. This latter
function in turn calls ocfs2_iop_get_acl and ocfs2_iop_set_acl. These
two are also called directly from vfs layer for getfacl/setfacl commands
and therefore acquire the cluster wide inode lock. If a remote
conversion request comes after the first inode lock in ocfs2_setattr,
OCFS2_LOCK_BLOCKED will be set. And this will cause the second call to
inode lock from the ocfs2_iop_get_acl() to block indefinetly.
The deleted version of ocfs2_acl_chmod() calls __posix_acl_chmod() which
does not call back into the filesystem. Therefore, we restore
ocfs2_acl_chmod(), modify it slightly for locking as needed, and use that
instead.
Fixes: 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Our caller expects 0 on success, not >0.
This fixes a bug in the patch
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
where /sys does not show up in mountinfo, breaking criu.
Thanks for catching this, Andrei.
Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
|
|
Generally permission checking is not necessary when overlayfs looks up a
dentry on one of the underlying layers, since search permission on base
directory was already checked in ovl_permission().
More specifically using lookup_one_len() causes a problem when the lower
directory lacks search permission for a specific user while the upper
directory does have search permission. Since lookups are cached, this
causes inconsistency in behavior: success depends on who did the first
lookup.
So instead use lookup_hash() which doesn't do the permission check.
Reported-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Overlayfs needs lookup without inode_permission() and already has the name
hash (in form of dentry->d_name on overlayfs dentry). It also doesn't
support filesystems with d_op->d_hash() so basically it only needs
the actual hashed lookup from lookup_one_len_unlocked()
So add a new helper that does unlocked lookup of a hashed name.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
If a file is renamed to a hardlink of itself POSIX specifies that rename(2)
should do nothing and return success.
This condition is checked in vfs_rename(). However it won't detect hard
links on overlayfs where these are given separate inodes on the overlayfs
layer.
Overlayfs itself detects this condition and returns success without doing
anything, but then vfs_rename() will proceed as if this was a successful
rename (detach_mounts(), d_move()).
The correct thing to do is to detect this condition before even calling
into overlayfs. This patch does this by calling vfs_select_inode() to get
the underlying inodes.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
|
|
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
|
|
"cat /proc/$pid/wchan""
This reverts the 4.6-rc1 commit 7e2bc81da333 ("proc/base: make prompt
shell start from new line after executing "cat /proc/$pid/wchan")
because it breaks /proc/$PID/whcan formatting in ps and top.
Revert also because the patch is inconsistent - it adds a newline at the
end of only the '0' wchan, and does not add a newline when
/proc/$PID/wchan contains a symbol name.
eg.
$ ps -eo pid,stat,wchan,comm
PID STAT WCHAN COMMAND
...
1189 S - dbus-launch
1190 Ssl 0
dbus-daemon
1198 Sl 0
lightdm
1299 Ss ep_pol systemd
1301 S - (sd-pam)
1304 Ss wait sh
Signed-off-by: Robin Humble <plaguedbypenguins@gmail.com>
Cc: Minfei Huang <mnfhuang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Payloads of NM entries are not supposed to contain NUL. When we run
into such, only the part prior to the first NUL goes into the
concatenation (i.e. the directory entry name being encoded by a bunch
of NM entries). We do stop when the amount collected so far + the
claimed amount in the current NM entry exceed 254. So far, so good,
but what we return as the total length is the sum of *claimed*
sizes, not the actual amount collected. And that can grow pretty
large - not unlimited, since you'd need to put CE entries in
between to be able to get more than the maximum that could be
contained in one isofs directory entry / continuation chunk and
we are stop once we'd encountered 32 CEs, but you can get about 8Kb
easily. And that's what will be passed to readdir callback as the
name length. 8Kb __copy_to_user() from a buffer allocated by
__get_free_page()
Cc: stable@vger.kernel.org # 0.98pl6+ (yes, really)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
If /proc/<PID>/environ gets read before the envp[] array is fully set up
in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
read more bytes than are actually written, as env_start will already be
set but env_end will still be zero, making the range calculation
underflow, allowing to read beyond the end of what has been written.
Fix this as it is done for /proc/<PID>/cmdline by testing env_end for
zero. It is, apparently, intentionally set last in create_*_tables().
This bug was found by the PaX size_overflow plugin that detected the
arithmetic underflow of 'this_len = env_end - (env_start + src)' when
env_end is still zero.
The expected consequence is that userland trying to access
/proc/<PID>/environ of a not yet fully set up process may get
inconsistent data as we're in the middle of copying in the environment
variables.
Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns fix from Eric Biederman:
"This contains just a single fix for a nasty oops"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
propogate_mnt: Handle the first propogated copy being a slave
|
|
When the first propgated copy was a slave the following oops would result:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> IP: [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> PGD bacd4067 PUD bac66067 PMD 0
> Oops: 0000 [#1] SMP
> Modules linked in:
> CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> task: ffff8800bb0a8000 ti: ffff8800bac3c000 task.ti: ffff8800bac3c000
> RIP: 0010:[<ffffffff811fba4e>] [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> RSP: 0018:ffff8800bac3fd38 EFLAGS: 00010283
> RAX: 0000000000000000 RBX: ffff8800bb77ec00 RCX: 0000000000000010
> RDX: 0000000000000000 RSI: ffff8800bb58c000 RDI: ffff8800bb58c480
> RBP: ffff8800bac3fd48 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000001ca1 R11: 0000000000001c9d R12: 0000000000000000
> R13: ffff8800ba713800 R14: ffff8800bac3fda0 R15: ffff8800bb77ec00
> FS: 00007f3c0cd9b7e0(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 00000000bb79d000 CR4: 00000000000006e0
> Stack:
> ffff8800bb77ec00 0000000000000000 ffff8800bac3fd88 ffffffff811fbf85
> ffff8800bac3fd98 ffff8800bb77f080 ffff8800ba713800 ffff8800bb262b40
> 0000000000000000 0000000000000000 ffff8800bac3fdd8 ffffffff811f1da0
> Call Trace:
> [<ffffffff811fbf85>] propagate_mnt+0x105/0x140
> [<ffffffff811f1da0>] attach_recursive_mnt+0x120/0x1e0
> [<ffffffff811f1ec3>] graft_tree+0x63/0x70
> [<ffffffff811f1f6b>] do_add_mount+0x9b/0x100
> [<ffffffff811f2c1a>] do_mount+0x2aa/0xdf0
> [<ffffffff8117efbe>] ? strndup_user+0x4e/0x70
> [<ffffffff811f3a45>] SyS_mount+0x75/0xc0
> [<ffffffff8100242b>] do_syscall_64+0x4b/0xa0
> [<ffffffff81988f3c>] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
> RIP [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> RSP <ffff8800bac3fd38>
> CR2: 0000000000000010
> ---[ end trace 2725ecd95164f217 ]---
This oops happens with the namespace_sem held and can be triggered by
non-root users. An all around not pleasant experience.
To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.
Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.
The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.
To avoid other kinds of confusion last_dest is not changed when
computing last_source. last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.
Cc: stable@vger.kernel.org
fixes: f2ebb3a921c1ca1e2ddd9242e95a1989a50c4c68 ("smarter propagate_mnt()")
Reported-by: Tycho Andersen <tycho.andersen@canonical.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
First of all, trying to open them r/w is idiocy; it's guaranteed to fail.
Moreover, assigning ->f_pos and assuming that everything will work is
blatantly broken - try that with e.g. tmpfs as underlying layer and watch
the fireworks. There may be a non-trivial amount of state associated with
current IO position, well beyond the numeric offset. Using the single
struct file associated with underlying inode is really not a good idea;
we ought to open one for each ecryptfs directory struct file.
Additionally, file_operations both for directories and non-directories are
full of pointless methods; non-directories should *not* have ->iterate(),
directories should not have ->flush(), ->fasync() and ->splice_read().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"Fix a regression and update the MAINTAINERS entry for fuse"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: update mailing list in MAINTAINERS
fuse: Fix return value from fuse_get_user_pages()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fix from Jan Kara:
"A fix of a regression in UDF that got introduced in 4.6-rc1 by one of
the charset encoding fixes"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix conversion of 'dstring' fields to UTF8
|
|
We've calculated @len to be the bytes we need for '/..' entries from
@kn_from to the common ancestor, and calculated @nlen to be the extra
bytes we need to get from the common ancestor to @kn_to. We use them
as such at the end. But in the loop copying the actual entries, we
overwrite @nlen. Use a temporary variable for that instead.
Without this, the return length, when the buffer is large enough, is
wrong. (When the buffer is NULL or too small, the returned value is
correct. The buffer contents are also correct.)
Interestingly, no callers of this function are affected by this as of
yet. However the upcoming cgroup_show_path() will be.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
|