summaryrefslogtreecommitdiff
path: root/fs/ext4/super.c
AgeCommit message (Collapse)Author
2014-11-14ext4: check s_chksum_driver when looking for bg csum presenceDarrick J. Wong
commit 813d32f91333e4c33d5a19b67167c4bae42dae75 upstream. Convert the ext4_has_group_desc_csum predicate to look for a checksum driver instead of the metadata_csum flag and change the bg checksum calculation function to look for GDT_CSUM before taking the crc16 path. Without this patch, if we mount with ^uninit_bg,^metadata_csum and later metadata_csum gets turned on by accident, the block group checksum functions will incorrectly assume that checksumming is enabled (metadata_csum) but that crc16 should be used (!s_chksum_driver). This is totally wrong, so fix the predicate and the checksum formula selection. (Granted, if the metadata_csum feature bit gets enabled on a live FS then something underhanded is going on, but we could at least avoid writing garbage into the on-disk fields.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: add ext4_iget_normal() which is to be used for dir tree lookupsTheodore Ts'o
commit f4bb2981024fc91b23b4d09a8817c415396dbabb upstream. If there is a corrupted file system which has directory entries that point at reserved, metadata inodes, prohibit them from being used by treating them the same way we treat Boot Loader inodes --- that is, mark them to be bad inodes. This prohibits them from being opened, deleted, or modified via chmod, chown, utimes, etc. In particular, this prevents a corrupted file system which has a directory entry which points at the journal inode from being deleted and its blocks released, after which point Much Hilarity Ensues. Reported-by: Sami Liedes <sami.liedes@iki.fi> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: don't check quota format when there are no quota filesJan Kara
commit 279bf6d390933d5353ab298fcc306c391a961469 upstream. The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: disable synchronous transaction batching if max_batch_time==0Eric Sandeen
commit 5dd214248f94d430d70e9230bda72f2654ac88a8 upstream. The mount manpage says of the max_batch_time option, This optimization can be turned off entirely by setting max_batch_time to 0. But the code doesn't do that. So fix the code to do that. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: clarify error count warning messagesTheodore Ts'o
commit ae0f78de2c43b6fadd007c231a352b13b5be8ed2 upstream. Make it clear that values printed are times, and that it is error since last fsck. Also add note about fsck version required. Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06ext4: don't try to modify s_flags if the the file system is read-onlyTheodore Ts'o
commit 23301410972330c0ae9a8afc379ba2005e249cc6 upstream. If an ext4 file system is created by some tool other than mke2fs (perhaps by someone who has a pathalogical fear of the GPL) that doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags, and that file system is then mounted read-only, don't try to modify the s_flags field. Otherwise, if dm_verity is in use, the superblock will change, causing an dm_verity failure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ext4: Do not reserve clusters when fs doesn't support extentsJan Kara
commit 30fac0f75da24dd5bb43c9e911d2039a984ac815 upstream. When the filesystem doesn't support extents (like in ext2/3 compatibility modes), there is no need to reserve any clusters. Space estimates for writing are exact, hole punching doesn't need new metadata, and there are no unwritten extents to convert. This fixes a problem when filesystem still having some free space when accessed with a native ext2/3 driver suddently reports ENOSPC when accessed with ext4 driver. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-14ext4: fix mount/remount error messages for incompatible mount optionsPiotr Sarna
commit 6ae6514b33f941d3386da0dfbe2942766eab1577 upstream. Commit 5688978 ("ext4: improve handling of conflicting mount options") introduced incorrect messages shown while choosing wrong mount options. First of all, both cases of incorrect mount options, "data=journal,delalloc" and "data=journal,dioread_nolock" result in the same error message. Secondly, the problem above isn't solved for remount option: the mismatched parameter is simply ignored. Moreover, ext4_msg states that remount with options "data=journal,delalloc" succeeded, which is not true. To fix it up, I added a simple check after parse_options() call to ensure that data=journal and delalloc/dioread_nolock parameters are not present at the same time. Signed-off-by: Piotr Sarna <p.sarna@partner.samsung.com> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-14ext4: allow the mount options nodelalloc and data=journalTheodore Ts'o
commit 59d9fa5c2e9086db11aa287bb4030151d0095a17 upstream. Commit 26092bf ("ext4: use a table-driven handler for mount options") wrongly disallows the specifying the mount options nodelalloc and data=journal simultaneously. This is incorrect; it should have only disallowed the combination of delalloc and data=journal simultaneously. Reported-by: Piotr Sarna <p.sarna@partner.samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-14ext4: destroy ext4_es_cachep on module unloadEric Sandeen
commit dd12ed144e9797094c04736f97aa27d5fe401476 upstream. Without this, module can't be reloaded. [ 500.521980] kmem_cache_sanity_check (ext4_extent_status): Cache name already exists. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-21ext4: don't show usrquota/grpquota twice in /proc/mountsTheodore Ts'o
commit ad065dd01662ae22138899e6b1c8eeb3a529964f upstream. We now print mount options in a generic fashion in ext4_show_options(), so we shouldn't be explicitly printing the {usr,grp}quota options in ext4_show_quota_options(). Without this patch, /proc/mounts can look like this: /dev/vdb /vdb ext4 rw,relatime,quota,usrquota,data=ordered,usrquota 0 0 ^^^^^^^^ ^^^^^^^^ Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-21ext4: fix ext4_get_group_number()Theodore Ts'o
commit 960fd856fdc3b08b3638f3f9b6b4bfceb77660c7 upstream. The function ext4_get_group_number() was introduced as an optimization in commit bd86298e60b8. Unfortunately, this commit incorrectly calculate the group number for file systems with a 1k block size (when s_first_data_block is 1 instead of zero). This could cause the following kernel BUG: [ 568.877799] ------------[ cut here ]------------ [ 568.877833] kernel BUG at fs/ext4/mballoc.c:3728! [ 568.877840] Oops: Exception in kernel mode, sig: 5 [#1] [ 568.877845] SMP NR_CPUS=32 NUMA pSeries [ 568.877852] Modules linked in: binfmt_misc [ 568.877861] CPU: 1 PID: 3516 Comm: fs_mark Not tainted 3.10.0-03216-g7c6809f-dirty #1 [ 568.877867] task: c0000001fb0b8000 ti: c0000001fa954000 task.ti: c0000001fa954000 [ 568.877873] NIP: c0000000002f42a4 LR: c0000000002f4274 CTR: c000000000317ef8 [ 568.877879] REGS: c0000001fa956ed0 TRAP: 0700 Not tainted (3.10.0-03216-g7c6809f-dirty) [ 568.877884] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 24000428 XER: 00000000 [ 568.877902] SOFTE: 1 [ 568.877905] CFAR: c0000000002b5464 [ 568.877908] GPR00: 0000000000000001 c0000001fa957150 c000000000c6a408 c0000001fb588000 GPR04: 0000000000003fff c0000001fa9571c0 c0000001fa9571c4 000138098c50625f GPR08: 1301200000000000 0000000000000002 0000000000000001 0000000000000000 GPR12: 0000000024000422 c00000000f33a300 0000000000008000 c0000001fa9577f0 GPR16: c0000001fb7d0100 c000000000c29190 c0000000007f46e8 c000000000a14672 GPR20: 0000000000000001 0000000000000008 ffffffffffffffff 0000000000000000 GPR24: 0000000000000100 c0000001fa957278 c0000001fdb2bc78 c0000001fa957288 GPR28: 0000000000100100 c0000001fa957288 c0000001fb588000 c0000001fdb2bd10 [ 568.877993] NIP [c0000000002f42a4] .ext4_mb_release_group_pa+0xec/0x1c0 [ 568.877999] LR [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0 [ 568.878004] Call Trace: [ 568.878008] [c0000001fa957150] [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0 (unreliable) [ 568.878017] [c0000001fa957200] [c0000000002fb070] .ext4_mb_discard_lg_preallocations+0x394/0x444 [ 568.878025] [c0000001fa957340] [c0000000002fb45c] .ext4_mb_release_context+0x33c/0x734 [ 568.878032] [c0000001fa957440] [c0000000002fbcf8] .ext4_mb_new_blocks+0x4a4/0x5f4 [ 568.878039] [c0000001fa957510] [c0000000002ef56c] .ext4_ext_map_blocks+0xc28/0x1178 [ 568.878047] [c0000001fa957640] [c0000000002c1a94] .ext4_map_blocks+0x2c8/0x490 [ 568.878054] [c0000001fa957730] [c0000000002c536c] .ext4_writepages+0x738/0xc60 [ 568.878062] [c0000001fa957950] [c000000000168a78] .do_writepages+0x5c/0x80 [ 568.878069] [c0000001fa9579d0] [c00000000015d1c4] .__filemap_fdatawrite_range+0x88/0xb0 [ 568.878078] [c0000001fa957aa0] [c00000000015d23c] .filemap_write_and_wait_range+0x50/0xfc [ 568.878085] [c0000001fa957b30] [c0000000002b8edc] .ext4_sync_file+0x220/0x3c4 [ 568.878092] [c0000001fa957be0] [c0000000001f849c] .vfs_fsync_range+0x64/0x80 [ 568.878098] [c0000001fa957c70] [c0000000001f84f0] .vfs_fsync+0x38/0x4c [ 568.878105] [c0000001fa957d00] [c0000000001f87f4] .do_fsync+0x54/0x90 [ 568.878111] [c0000001fa957db0] [c0000000001f8894] .SyS_fsync+0x28/0x3c [ 568.878120] [c0000001fa957e30] [c000000000009c88] syscall_exit+0x0/0x7c [ 568.878125] Instruction dump: [ 568.878130] 60000000 813d0034 81610070 38000000 7f8b4800 419e001c 813f007c 7d2bfe70 [ 568.878144] 7d604a78 7c005850 54000ffe 7c0007b4 <0b000000> e8a10076 e87f0090 7fa4eb78 [ 568.878160] ---[ end trace 594d911d9654770b ]--- In addition fix the STD_GROUP optimization so that it works for bigalloc file systems as well. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Li Zhong <lizhongfs@gmail.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-07make blkdev_put() return voidAl Viro
same story as with the previous patches - note that return value of blkdev_close() is lost, since there's nowhere the caller (__fput()) could return it to. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...
2013-04-20ext4: mark all metadata I/O with REQ_METATheodore Ts'o
As Dave Chinner pointed out at the 2013 LSF/MM workshop, it's important that metadata I/O requests are marked as such to avoid priority inversions caused by I/O bandwidth throttling. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-11ext4: Use kstrtoul() instead of parse_strtoul()Lukas Czerner
In parse_strtoul() we're still using deprecated simple_strtoul(). Remove parse_strtoul() altogether and replace it with kstrtoul() Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09ext4: fix miscellaneous big endian warningsTheodore Ts'o
None of these result in any bug, but they makes sparse complain. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09ext4: introduce reserved spaceLukas Czerner
Currently in ENOSPC condition when writing into unwritten space, or punching a hole, we might need to split the extent and grow extent tree. However since we can not allocate any new metadata blocks we'll have to zero out unwritten part of extent or punched out part of extent, or in the worst case return ENOSPC even though use actually does not allocate any space. Also in delalloc path we do reserve metadata and data blocks for the time we're going to write out, however metadata block reservation is very tricky especially since we expect that logical connectivity implies physical connectivity, however that might not be the case and hence we might end up allocating more metadata blocks than previously reserved. So in future, metadata reservation checks should be removed since we can not assure that we do not under reserve. And this is where reserved space comes into the picture. When mounting the file system we slice off a little bit of the file system space (2% or 4096 clusters, whichever is smaller) which can be then used for the cases mentioned above to prevent costly zeroout, or unexpected ENOSPC. The number of reserved clusters can be set via sysfs, however it can never be bigger than number of free clusters in the file system. Note that this patch fixes the failure of xfstest 274 as expected. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-04-09procfs: new helper - PDE_DATA(inode)Al Viro
The only part of proc_dir_entry the code outside of fs/proc really cares about is PDE(inode)->data. Provide a helper for that; static inline for now, eventually will be moved to fs/proc, along with the knowledge of struct proc_dir_entry layout. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09ext4: fix deadlock with quota featureJan Kara
We didn't mark hidden quota files with S_NOQUOTA flag and thus quota was accounted even for quota files. Thus we could recurse back to quota code when adding new blocks to quota file which can easily deadlock. Mark hidden quota files properly. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: make ext4_block_in_group() much more efficientLukas Czerner
Currently in when getting the block group number for a particular block in ext4_block_in_group() we're using ext4_get_group_no_and_offset() which uses do_div() to get the block group and the remainer which is offset within the group. We don't need all of that in ext4_block_in_group() as we only need to figure out the group number. This commit changes ext4_block_in_group() to calculate group number directly. This shows as a big improvement with regards to cpu utilization. Measuring fallocate -l 15T on fresh file system with perf showed that 23% of cpu time was spend in the ext4_get_group_no_and_offset(). With this change it completely disappears from the list only bumping the occurrence of ext4_init_block_bitmap() which is the biggest user of ext4_block_in_group() by 4%. As the result of this change on my system the fallocate call was approx. 10% faster. However since there is '-g' option in mkfs which allow us setting different groups size (mostly for developers) I've introduced new per file system flag whether we have a standard block group size or not. The flag is used to determine whether we can use the bit shift optimization or not. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03ext4: unregister es_shrinker if mount failedDmitry Monakhov
Otherwise destroyed ext_sb_info will be part of global shinker list and result in the following OOPS: JBD2: corrupted journal superblock JBD2: recovery failed EXT4-fs (dm-2): error loading journal general protection fault: 0000 [#1] SMP Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\ mod CPU 1 Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136 /DH55TC RIP: 0010:[<ffffffff811bfb2d>] [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0 RSP: 0000:ffff88011d5cbcd8 EFLAGS: 00010207 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246 RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848 R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000 FS: 00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80) Stack: ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1 00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000 ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000 Call Trace: [<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0 [<ffffffff81249381>] mount_bdev+0x331/0x340 [<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180 [<ffffffff81362035>] ext4_mount+0x15/0x20 [<ffffffff8124869a>] mount_fs+0x9a/0x2e0 [<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170 [<ffffffff81279c02>] do_new_mount+0x172/0x2e0 [<ffffffff8127aa56>] do_mount+0x376/0x380 [<ffffffff8127ab98>] sys_mount+0x138/0x150 [<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\ 8 RIP [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0 RSP <ffff88011d5cbcd8> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-04-03ext4: fix journal callback list traversalDmitry Monakhov
It is incorrect to use list_for_each_entry_safe() for journal callback traversial because ->next may be removed by other task: ->ext4_mb_free_metadata() ->ext4_mb_free_metadata() ->ext4_journal_callback_del() This results in the following issue: WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250() Hardware name: list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod Pid: 16400, comm: jbd2/dm-1-8 Tainted: G W 3.8.0-rc3+ #107 Call Trace: [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80 [<ffffffff810ac6be>] kthread+0x10e/0x120 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70 This patch fix the issue as follows: - ext4_journal_commit_callback() make list truly traversial safe simply by always starting from list_head - fix race between two ext4_journal_callback_del() and ext4_journal_callback_try_del() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.com
2013-03-21Merge tag 'ext4_for_linue' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a number of regression and other bugs in ext4, most of which were relatively obscure cornercases or races that were found using regression tests." * tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits) ext4: fix data=journal fast mount/umount hang ext4: fix ext4_evict_inode() racing against workqueue processing code ext4: fix memory leakage in mext_check_coverage ext4: use s_extent_max_zeroout_kb value as number of kb ext4: use atomic64_t for the per-flexbg free_clusters count jbd2: fix use after free in jbd2_journal_dirty_metadata() ext4: reserve metadata block for every delayed write ext4: update reserved space after the 'correction' ext4: do not use yield() ext4: remove unused variable in ext4_free_blocks() ext4: fix WARN_ON from ext4_releasepage() ext4: fix the wrong number of the allocated blocks in ext4_split_extent() ext4: update extent status tree after an extent is zeroed out ext4: fix wrong m_len value after unwritten extent conversion ext4: add self-testing infrastructure to do a sanity check ext4: avoid a potential overflow in ext4_es_can_be_merged() ext4: invalidate extent status tree during extent migration ext4: remove unnecessary wait for extent conversion in ext4_fallocate() ext4: add warning to ext4_convert_unwritten_extents_endio ext4: disable merging of uninitialized extents ...
2013-03-12fs: Readd the fs module aliases.Eric W. Biederman
I had assumed that the only use of module aliases for filesystems prior to "fs: Limit sys_mount to only request filesystem modules." was in request_module. It turns out I was wrong. At least mkinitcpio in Arch linux uses these aliases. So readd the preexising aliases, to keep from breaking userspace. Userspace eventually will have to follow and use the same aliases the kernel does. So at some point we may be delete these aliases without problems. However that day is not today. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-11ext4: use atomic64_t for the per-flexbg free_clusters countTheodore Ts'o
A user who was using a 8TB+ file system and with a very large flexbg size (> 65536) could cause the atomic_t used in the struct flex_groups to overflow. This was detected by PaX security patchset: http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551 This bug was introduced in commit 9f24e4208f7e, so it's been around since 2.6.30. :-( Fix this by using an atomic64_t for struct orlav_stats's free_clusters. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@vger.kernel.org
2013-03-03fs: Limit sys_mount to only request filesystem modules.Eric W. Biederman
Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-02Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bug fixes from Ted Ts'o: "Various bug fixes for ext4. The most important is a fix for the new extent cache's slab shrinker which can cause significant, user-visible pauses when the system is under memory pressure." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: enable quotas before orphan cleanup ext4: don't allow quota mount options when quota feature enabled ext4: fix a warning from sparse check for ext4_dir_llseek ext4: convert number of blocks to clusters properly ext4: fix possible memory leak in ext4_remount() jbd2: fix ERR_PTR dereference in jbd2__journal_start ext4: use percpu counter for extent cache count ext4: optimize ext4_es_shrink()
2013-03-02ext4: enable quotas before orphan cleanupJan Kara
When using quota feature we need to enable quotas before orphan cleanup so that changes happening during it are properly reflected in quota accounting. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02ext4: don't allow quota mount options when quota feature enabledJan Kara
So far we silently ignored when quota mount options were set while quota feature was enabled. But this can create confusion in userspace when mount options are set but silently ignored and also creates opportunities for bugs when we don't properly test all quota types. Actually ext4_mark_dquot_dirty() forgets to test for quota feature so it was dependent on journaled quota options being set. OTOH ext4_orphan_cleanup() tries to enable journaled quota when quota options are specified which is wrong when quota feature is enabled. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02ext4: convert number of blocks to clusters properlyLukas Czerner
We're using macro EXT4_B2C() to convert number of blocks to number of clusters for bigalloc file systems. However, we should be using EXT4_NUM_B2C(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-03-02ext4: fix possible memory leak in ext4_remount()Wei Yongjun
'orig_data' is malloced in ext4_remount() and should be freed before leaving from the error handling cases, otherwise it will cause memory leak. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@vger.kernel.org
2013-03-02ext4: use percpu counter for extent cache countTheodore Ts'o
Use a percpu counter rather than atomic types for shrinker accounting. There's no need for ultimate accuracy in the shrinker, so this should come a little more cheaply. The percpu struct is somewhat large, but there was a big gap before the cache-aligned s_es_lru_lock anyway, and it fits nicely in there. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-22new helper: file_inode(file)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-18ext4: reclaim extents from extent status treeZheng Liu
Although extent status is loaded on-demand, we also need to reclaim extent from the tree when we are under a heavy memory pressure because in some cases fragmented extent tree causes status tree costs too much memory. Here we maintain a lru list in super_block. When the extent status of an inode is accessed and changed, this inode will be move to the tail of the list. The inode will be dropped from this list when it is cleared. In the inode, a counter is added to count the number of cached objects in extent status tree. Here only written/unwritten/hole extent is counted because delayed extent doesn't be reclaimed due to fiemap, bigalloc and seek_data/hole need it. The counter will be increased as a new extent is allocated, and it will be decreased as a extent is freed. In this commit we use normal shrinker framework to reclaim memory from the status tree. ext4_es_reclaim_extents_count() traverses the lru list to count the number of reclaimable extents. ext4_es_shrink() tries to reclaim written/unwritten/hole extents from extent status tree. The inode that has been shrunk is moved to the tail of lru list. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
2013-02-18ext4: remove single extent cacheZheng Liu
Single extent cache could be removed because we have extent status tree as a extent cache, and it would be better. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
2013-02-08ext4: pass context information to jbd2__journal_start()Theodore Ts'o
So we can better understand what bits of ext4 are responsible for long-running jbd2 handles, use jbd2__journal_start() so we can pass context information for logging purposes. The recommended way for finding the longer-running handles is: T=/sys/kernel/debug/tracing EVENT=$T/events/jbd2/jbd2_handle_stats echo "interval > 5" > $EVENT/filter echo 1 > $EVENT/enable ./run-my-fs-benchmark cat $T/trace > /tmp/problem-handles This will list handles that were active for longer than 20ms. Having longer-running handles is bad, because a commit started at the wrong time could stall for those 20+ milliseconds, which could delay an fsync() or an O_SYNC operation. Here is an example line from the trace file describing a handle which lived on for 311 jiffies, or over 1.2 seconds: postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32 tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1 dirtied_blocks 0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08ext4: move the jbd2 wrapper functions out of super.cTheodore Ts'o
Move the jbd2 wrapper functions which start and stop handles out of super.c, where they don't really logically belong, and into ext4_jbd2.c. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02ext4: check incompatible mount options while mounting ext2/3Theodore Ts'o
Check for incompatible mount options when using the ext4 file system driver to mount ext2 or ext3 file systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02ext4: print error when argument of inode_readahead_blk is invalidJan Kara
If argument of inode_readahead_blk is too big, we just bail out without printing any error. Fix this since it could confuse users. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02ext4: make mount option parsing loop more logicalJan Kara
The loop looking for correct mount option entry is more logical if it is written rewritten as an empty loop looking for correct option entry and then code handling the option. It also saves one level of indentation for a lot of code so we can join a couple of split lines. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02ext4: move several mount options to standard handling loopJan Kara
Several mount option (resuid, resgid, journal_dev, journal_ioprio) are currently handled before we enter standard option handling loop. I don't see a reason for this so move them to normal handling loop to make things more regular. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28ext4: remove unnecessary NULL pointer checkGuo Chao
brelse() and ext4_journal_force_commit() are both inlined and able to handle NULL. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28ext4: move work from io_end to inodeJan Kara
It does not make much sense to have struct work in ext4_io_end_t because we always use it for only one ext4_io_end_t per inode (the first one in the i_completed_io list). So just move the structure to inode itself. This also allows for a small simplification in processing io_end structures. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28ext4: Always use ext4_bio_write_page() for writeoutJan Kara
Currently we sometimes used block_write_full_page() and sometimes ext4_bio_write_page() for writeback (depending on mount options and call path). Let's always use ext4_bio_write_page() to simplify things a bit. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-24ext4: fix memory leak when quota options are specified multiple timesChen Gang
When usrjquota or grpjquota mount options are specified several times, we leak memory storing the names. Free the memory correctly. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-01-24ext4: release sysfs kobject when failing to enable quotas on mountTheodore Ts'o
In addition, print the error returned from ext4_enable_quotas() Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Cc: stable@vger.kernel.org
2013-01-13ext4: trigger the lazy inode table initialization after resizeTheodore Ts'o
After we have finished extending the file system, we need to trigger a the lazy inode table thread to zero out the inode tables. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-27ext4: lock i_mutex when truncating orphan inodesTheodore Ts'o
Commit c278531d39 added a warning when ext4_flush_unwritten_io() is called without i_mutex being taken. It had previously not been taken during orphan cleanup since races weren't possible at that point in the mount process, but as a result of this c278531d39, we will now see a kernel WARN_ON in this case. Take the i_mutex in ext4_orphan_cleanup() to suppress this warning. Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Cc: stable@vger.kernel.org