lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2010-05-13	fs: Resolve mntput_no_expire issues.	john stultz
	In testing the mnt_count typo fix, I hit a few BUG_ON/WARN_ON messages in the mntput_no_expire code. The first issue was a race against the MNT_MOUNTED flag, where if after the optimistic lock free check is done, someone changes the value, we might BUG_ON after getting the lock. The fix is after getting the lock, re-check the MNT_MOUNTED bit and drop the lock and try again if its changed. The second issue was a call to smp_processor_id() in add_mnt_count() that was done while preemptable. This was missed in my earlier commit 070976b5b038218900648ea4cc88786d5dfcd58d. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Clark Williams <williams@redhat.com> Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Nick Piggin <npiggin@suse.de> LKML-Reference: <1273711934.2856.22.camel@localhost.localdomain> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-13	fs: Fix mnt_count typo	john stultz
	Clark noticed the following snippit in commit 070976b5b038218900648ea4cc88786d5dfcd58d : if (mnt->mnt_pinned) { - inc_mnt_count(mnt); + preempt_disable(); + dec_mnt_count(mnt); + preempt_enable(); mnt->mnt_pinned--; } vfsmount_write_unlock(); I accidentally replaced an inc_mnt_count() with a dec_mnt_count(). The issue went unnoticed, as the only user of mnt_unpin in the acct syscall. This patch corrects the mistake. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Clark Williams <williams@redhat.com> Cc: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <1273711544.2856.15.camel@localhost.localdomain> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-13	Merge branch '2.6.33.4' into rt/2.6.33	Thomas Gleixner
	Conflicts: Makefile Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-12	xfs: add a shrinker to background inode reclaim	Dave Chinner
	commit 9bf729c0af67897ea8498ce17c29b0683f7f2028 upstream On low memory boxes or those with highmem, kernel can OOM before the background reclaims inodes via xfssyncd. Add a shrinker to run inode reclaim so that it inode reclaim is expedited when memory is low. This is more complex than it needs to be because the VM folk don't want a context added to the shrinker infrastructure. Hence we need to add a global list of XFS mount structures so the shrinker can traverse them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Alex Elder <aelder@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	jfs: fix diAllocExt error in resizing filesystem	Bill Pemberton
	commit 2b0b39517d1af5294128dbc2fd7ed39c8effa540 upstream. Resizing the filesystem would result in an diAllocExt error in some instances because changes in bmp->db_agsize would not get noticed if goto extendBmap was called. Signed-off-by: Bill Pemberton <wfp5p@virginia.edu> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: jfs-discussion@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	ext4: correctly calculate number of blocks for fiemap	Leonard Michlmayr
	commit aca92ff6f57c000d1b4523e383c8bd6b8269b8b1 upstream. ext4_fiemap() rounds the length of the requested range down to blocksize, which is is not the true number of blocks that cover the requested region. This problem is especially impressive if the user requests only the first byte of a file: not a single extent will be reported. We fix this by calculating the last block of the region and then subtract to find the number of blocks in the extents. Signed-off-by: Leonard Michlmayr <leonard.michlmayr@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	NFS: rsize and wsize settings ignored on v4 mounts	Chuck Lever
	commit 356e76b855bdbfd8d1c5e75bcf0c6bf0dfe83496 upstream. NFSv4 mounts ignore the rsize and wsize mount options, and always use the default transfer size for both. This seems to be because all NFSv4 mounts are now cloned, and the cloning logic doesn't copy the rsize and wsize settings from the parent nfs_server. I tested Fedora's 2.6.32.11-99 and it seems to have this problem as well, so I'm guessing that .33, .32, and perhaps older kernels have this issue as well. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	nfs d_revalidate() is too trigger-happy with d_drop()	Al Viro
	commit d9e80b7de91db05c1c4d2e5ebbfd70b3b3ba0e0f upstream. If dentry found stale happens to be a root of disconnected tree, we can't d_drop() it; its d_hash is actually part of s_anon and d_drop() would simply hide it from shrink_dcache_for_umount(), leading to all sorts of fun, including busy inodes on umount and oopsen after that. Bug had been there since at least 2006 (commit c636eb already has it), so it's definitely -stable fodder. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	ocfs2_dlmfs: Fix math error when reading LVB.	Joel Becker
	commit a36d515c7a2dfacebcf41729f6812dbc424ebcf0 upstream. When asked for a partial read of the LVB in a dlmfs file, we can accidentally calculate a negative count. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	ocfs2: Compute metaecc for superblocks during online resize.	Joel Becker
	commit a42ab8e1a37257da37e0f018e707bf365ac24531 upstream. Online resize writes out the new superblock and its backups directly. The metaecc data wasn't being recomputed. Let's do that directly. Signed-off-by: Joel Becker <joel.becker@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	ocfs2: potential ERR_PTR dereference on error paths	Dan Carpenter
	commit 0350cb078f5035716ebdad4ad4709d02fe466a8a upstream. If "handle" is non null at the end of the function then we assume it's a valid pointer and pass it to ocfs2_commit_trans(); Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	ocfs2: Update VFS inode's id info after reflink.	Tao Ma
	commit c21a534e2f24968cf74976a4e721ac194db30ded upstream. In reflink we update the id info on the disk but forgot to update the corresponding information in the VFS inode. Update them accordingly when we want to preserve the attributes. Reported-by: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	nfsd4: bug in read_buf	Neil Brown
	commit 2bc3c1179c781b359d4f2f3439cb3df72afc17fc upstream. When read_buf is called to move over to the next page in the pagelist of an NFSv4 request, it sets argp->end to essentially a random number, certainly not an address within the page which argp->p now points to. So subsequent calls to READ_BUF will think there is much more than a page of spare space (the cast to u32 ensures an unsigned comparison) so we can expect to fall off the end of the second page. We never encountered thsi in testing because typically the only operations which use more than two pages are write-like operations, which have their own decoding logic. Something like a getattr after a write may cross a page boundary, but it would be very unusual for it to cross another boundary after that. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	procfs: fix tid fdinfo	Jerome Marchand
	commit 3835541dd481091c4dbf5ef83c08aed12e50fd61 upstream. Correct the file_operations struct in fdinfo entry of tid_base_stuff[]. Presently /proc//task//fdinfo contains symlinks to opened files like /proc/*/fd/. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	reiserfs: fix corruption during shrinking of xattrs	Jeff Mahoney
	commit fb2162df74bb19552db3d988fd11c787cf5fad56 upstream. Commit 48b32a3553a54740d236b79a90f20147a25875e3 ("reiserfs: use generic xattr handlers") introduced a problem that causes corruption when extended attributes are replaced with a smaller value. The issue is that the reiserfs_setattr to shrink the xattr file was moved from before the write to after the write. The root issue has always been in the reiserfs xattr code, but was papered over by the fact that in the shrink case, the file would just be expanded again while the xattr was written. The end result is that the last 8 bytes of xattr data are lost. This patch fixes it to use new_size. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=14826 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reported-by: Christian Kujau <lists@nerdbynature.de> Tested-by: Christian Kujau <lists@nerdbynature.de> Cc: Edward Shishkin <edward.shishkin@gmail.com> Cc: Jethro Beekman <kernel@jbeekman.nl> Cc: Greg Surbey <gregsurbey@hotmail.com> Cc: Marco Gatti <marco.gatti@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-12	reiserfs: fix permissions on .reiserfs_priv	Jeff Mahoney
	commit cac36f707119b792b2396aed371d6b5cdc194890 upstream. Commit 677c9b2e393a0cd203bd54e9c18b012b2c73305a ("reiserfs: remove privroot hiding in lookup") removed the magic from the lookup code to hide the .reiserfs_priv directory since it was getting loaded at mount-time instead. The intent was that the entry would be hidden from the user via a poisoned d_compare, but this was faulty. This introduced a security issue where unprivileged users could access and modify extended attributes or ACLs belonging to other users, including root. This patch resolves the issue by properly hiding .reiserfs_priv. This was the intent of the xattr poisoning code, but it appears to have never worked as expected. This is fixed by using d_revalidate instead of d_compare. This patch makes -oexpose_privroot a no-op. I'm fine leaving it this way. The effort involved in working out the corner cases wrt permissions and caching outweigh the benefit of the feature. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Acked-by: Edward Shishkin <edward.shishkin@gmail.com> Reported-by: Matt McCutchen <matt@mattmccutchen.net> Tested-by: Matt McCutchen <matt@mattmccutchen.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-11	autofs4: Remove another autofs4_lock deadlock	John Stultz
	Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-10	autofs: Remove deadlock	john stultz
	Apparently the conversion from using the dcache_lock -> autofs4_lock forgot that this function already grabs the autofs_lock for a small moment, so we end up grabbing the lock, then a moment later grab it again. Splat. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> LKML-Reference: <1273279153.2776.7.camel@localhost.localdomain> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-02	fs: Add missing parantheses	Olaf Hering
	Fix for this compile warning: fs/namespace.c:757: warning: suggest parentheses around operand \ of '!' or change '&' to '&&' or '!' to '~' Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-02	fs: Prevent dput race	Thomas Gleixner
	dput() drops dentry->d_lock when it fails to lock inode->i_lock or parent->d_lock. dentry->d_count is 0 at this point so dentry kann be killed and freed by someone else. This leaves dput with a stale pointer in the retry code which results in interesting kernel crashes. Prevent this by incrementing dentry->d_count before dropping the lock. Go back to start after dropping the lock so d_count is decremented again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-30	fs: Use s_inodes not s_files for inode lists	Thomas Gleixner
	The VFS scalability rework broke UP due to a stupid typo which enqueued inodes on the file list. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-29	fs: Fix namespace related hangs	John Stultz
	Nick converted the dentry->d_mounted counter to a flag, however with namespaces, dentries can be mounted multiple times (and more importantly unmounted multiple times). If a namespace was created and then released, the unmount_tree would remove the DCACHE_MOUNTED flag and that would make d_mountpoint fail, causing the mounts to be lost. This patch coverts it back to a counter, and adds some extra WARN_ONs to make sure things are accounted properly. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org> Cc: Nick Piggin <npiggin@suse.de> LKML-Reference: <1272522942.1967.12.camel@work-vm> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-29	xfs: Make i_count access non-atomic	John Stultz
	i_count is not longer atomic. Fix up the leftover. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-28	fs: Fix d_count fallout	Thomas Gleixner
	d_count got converted to int and back to atomic_t. Two instances were missed in the backward conversion. Fix them up. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-28	fs: namespace: Fix MNT_MOUNTED handling for cloned rootfs	John Stultz
	We don't call attach_mnt on a cloned rootfs so set the MNT_MOUNTED flag in copy_tree(). Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-28	fs: namespace: Make put_mnt_ns rt aware	Thomas Gleixner
	On RT the lock() inside the preempt disabled region of get_cpu_var() results in a might sleep warning. Restructure the code and check the atomic transition to 0 open coded to avoid vfsmount_write_lock() in the case when ns->count is > 1. If ns->count == 1 then do the atomic decrement under full locking of namespace_sem and vfsmount_write_lock(). In most cases the atomic_dec_and_test() will have dropped ns->count to 0 so we need the full locking anyway. Based on a patch from John Stultz Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-28	fs: namespace: Fix potential deadlock	John Stultz
	do_unmount() does a lock() instead of unlock() in a return path which will lead to a dead lock when this code path is taken. Fix the typo. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Fixup some compilation warnings and errors	John Stultz
	Amit Arora noticed some compile issues with coda, and an fs.h include issue, so so this patch fixes those along with btrfs warnings. Thanks to Amit for the testing! Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Remove j_state_lock usage in jbd2_journal_stop()	tytso@mit.edu
	On Wed, Apr 07, 2010 at 04:21:18PM -0700, john stultz wrote: > Further using lockstat I was able to isolate it the contention down to > the journal j_state_lock, and then adding some lock owner tracking, I > was able to see that the lock owners were almost always in > start_this_handle, and jbd2_journal_stop when we saw contention (with > the freq breakdown being about 55% in jbd2_journal_stop and 45% in > start_this_handle). Hmm.... I've taken a very close look at jbd2_journal_stop(), and I don't think we need to take j_state_lock() at all except if we need to call jbd2_log_start_commit(). t_outstanding_credits, h_buffer_credits, and t_updates are all documented (and verified by me) to be protected by the t_handle_lock spinlock. So I *think* the following might be safe. WARNING! WARNING!! No real testing done on this patch, other than "it compiles! ship it!!". I'll let other people review it, and maybe you could give this a run and see what happens with this patch? - Ted Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Revert Nick's fs-scale-pseudo	John Stultz
	After adding an xfs partition to my system, I started seeing boot time NULL pointer oopses, and bisected it down to the fs-scale-pseudo change. Not sure what the right fix is, but this change avoids the issue. Here's the bug i was seeing on boot: BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 IP: [<ffffffff81103d42>] link_path_walk+0xd12/0xda0 PGD 42b12e067 PUD 42cb2a067 PMD 0 Oops: 0000 [#1] PREEMPT SMP last sysfs file: /sys/block/md0/dev CPU 7 Pid: 2993, comm: vgs Not tainted 2.6.33-rc8john #272 Server Blade/IBM eServer BladeCenter HS21 -[7995AC1]- RIP: 0010:[<ffffffff81103d42>] [<ffffffff81103d42>] link_path_walk+0xd12/0xda0 RSP: 0018:ffff88042a929b78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88042ab41000 RCX: ffff88042ab41028 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88042aa0fcc0 RBP: ffff88042a929c28 R08: ffff88042aa0fcc0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88042c6a40b0 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88042a929dc8 FS: 00007f6f8c481710(0000) GS:ffff8800283c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000030 CR3: 000000042b310000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process vgs (pid: 2993, threadinfo ffff88042a928000, task ffff88042ab41000) Stack: ffff88042ab41000 ffff88042ab41000 ffff88042ab41000 ffff88042ab41000 <0> 0000000100000000 ffff88042a929de8 ffff880400000000 0000000000000000 <0> ffff88042f6b5610 0000000000000000 0000000000000000 ffff88042f418920 Call Trace: [<ffffffff811006c2>] ? path_get+0x32/0x50 [<ffffffff81103c50>] link_path_walk+0xc20/0xda0 [<ffffffff811006c2>] ? path_get+0x32/0x50 [<ffffffff81103f7c>] path_walk+0x5c/0xd0 [<ffffffff811041de>] do_path_lookup+0x1ee/0x250 [<ffffffff81103ff0>] ? do_path_lookup+0x0/0x250 [<ffffffff81104ebb>] user_path_at+0x7b/0xb0 [<ffffffff81112bb1>] ? vfsmount_read_unlock+0x31/0x60 [<ffffffff81114788>] ? mntput_no_expire+0x48/0x190 [<ffffffff810fb293>] ? cp_new_stat+0xe3/0xf0 [<ffffffff810fb4ac>] vfs_fstatat+0x3c/0x80 [<ffffffff810fb616>] vfs_stat+0x16/0x20 [<ffffffff810fb63f>] sys_newstat+0x1f/0x50 [<ffffffff81994a33>] ? lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff810025eb>] system_call_fastpath+0x16/0x1b Code: ec e8 93 c8 ff ff 0f 1f 00 e9 46 ff ff ff 41 83 7f 34 04 66 0f 1f 44 00 00 0f 85 38 ff ff ff 4d 8b 67 08 49 8b 84 24 b8 00 00 00 <48> 8b 40 30 f6 40 09 40 0f 84 1e ff ff ff 49 8b 44 24 70 4c 89 RIP [<ffffffff81103d42>] link_path_walk+0xd12/0xda0 RSP <ffff88042a929b78> CR2: 0000000000000030 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ---[ end trace 0dd94d94b1b27094 ]--- Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Call synchronize_rcpu in unregister_filesystem	John Stultz
	Quoting Nick: "BTW there are a few issues Al pointed out. We have to synchronize RCU after unregistering a filesystem so d_ops/i_ops doesn't go away, and mntput can sleep so we can't do it under RCU read lock." This patch simply calls synchronize_rcpu in unregister_filesystem to avoid this issue Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Make sure MNT_MOUNTED isn't cleared on remount	John Stultz
	Originally found by Anton Blanchard, this patch makes sure we keep the MNT_MOUNTED flag set in do_remount(). Without this scalability suffers pretty badly. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Revert d_count back to an atomic_t	John Stultz
	This patch reverts the portion of Nick's vfs scalability patch that converts the dentry d_count from an atomic_t to an int protected by the d_lock. This greatly improves vfs scalability with the -rt kernel, as the extra lock contention on the d_lock hurts very badly when CONFIG_PREEMPT_RT is enabled and the spinlocks become rtmutexes. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Fixup get_cpu_var holds over spinlock() calls.	John Stultz
	In Nick's patches, there's a few spots that use get_cpu_var to access a per-cpu spinlock. However, the put_cpu_var isn't called until after the lock is aquired and released. This causes mightsleep warnings with -rt. Move the put_cpu_var above the spin_lock/unlock call to avoid this. Not sure if this is 100% right, but seems to work. Not sure what holding the get does on the lock, since once we have the lock, the reference shouldn't change. Other users of the same lock don't bother with the get_cpu_var method and just use per_cpu. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Fix inc/dec_mnt_count for -rt	John Stultz
	With Nick's vfs patches, inc/dec_mnt_count use per-cpu counters, so this patch makes sure we disable preemption before calling. Its not a great fix, but works because count_mnt_count() sums all the percpu values, so each one individually doesn't need to be 0'ed out. I suspect the better fix for -rt is to revert the mnt_count back to an atomic counter. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Fix vfsmount_read_lock to work with -rt	John Stultz
	Because vfsmount_read_lock aquires the vfsmount spinlock for the current cpu, it causes problems wiht -rt, as you might migrate between cpus between a lock and unlock. This patch fixes the issue by having the caller pick a cpu, then consistently use that cpu between the lock and unlock. We may migrate inbetween lock and unlock, but that's ok because we're not doing anything cpu specific, other then avoiding contention on the read side across the cpus. Its not pretty, but it works and statistically shouldn't hurt performance. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Fixup rt hack for mnt_want_write	John Stultz
	The rt hack in mnt_want_write needs to be changed to work with Nick's VFS patches. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Fix MNT_MOUNTED WARN_ON	John Stultz
	I was seeing MNT_MOUNTED already set WARN_ON messages in commit_tree. This seems to be caused by clone_mnt copying the flag of an already mounted mnt to the mount before it is used by commit_tree. My fix (which may not be correct) is to unmark MNT_MOUNTED on the cloned mnt. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	Fixups from 09102009.patch.gz	Nick Piggin
	This patch is just the delta from Nick's 06102009 and his 09102009 megapatches Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-fixes	Nick Piggin
	Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode-hash-rcu	Nick Piggin
	Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-sb-inodes-percpu	Nick Piggin
	Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-nr_inodes-percpu	Eric Dumazet
	fs: inode per-cpu nr_inodes counter Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-last_ino-percpu	Eric Dumazet
	fs: inode per-cpu last_ino allocator new_inode() dirties a contended cache line to get increasing inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino, and give same spreading ino numbers than before. (same wraparound after 2^32 allocations) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode-nr_inodes	Nick Piggin
	XXX: this should be folded back into the individual locking patches Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-scale-pseudo	Nick Piggin
	Regardless of how much we possibly try to scale dcache, there is likely always going to be some fundamental contention when adding or removing children under the same parent. Pseudo filesystems do not seem need to have connected dentries because by definition they are disconnected. XXX: is this right? I can't see any reason why they need to have a real parent. TODO: add a d_instantiate_something() and avoid adding the extra checks for !d_parent Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale-11	Nick Piggin
	This enables locking to be reduced and simplified. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_rcu	Nick Piggin
	RCU free the struct inode. This will allow: - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want to take i_lock no longer need to take sb_inode_list_lock to walk the list in the first place. This will simplify and optimize locking. - eventually, completely write-free RCU path walking. The inode must be consulted for permissions when walking, so a write-free reference (ie. RCU is helpful). - can potentially simplify things a bit in VM land. May not need to take the page lock to get back to the page->mapping. - can remove some nested trylock loops in dcache code todo: convert all filesystems Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale-10	Nick Piggin
	Impelemnt lazy inode lru similarly to dcache. This should reduce inode list lock acquisition (todo: measure). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-27	fs-inode_lock-scale-9	Nick Piggin
	Remove the global inode_hash_lock and replace it with per-hash-bucket locks. Todo: should use bit spinlock in hlist_head pointer to save space. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>