lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2011-03-14	open-style analog of vfs_path_lookup()	Al Viro
	new function: file_open_root(dentry, mnt, name, flags) opens the file vfs_path_lookup would arrive to. Note that name can be empty; in that case the usual requirement that dentry should be a directory is lifted. open-coded equivalents switched to it, may_open() got down exactly one caller and became static. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	reduce vfs_path_lookup() to do_path_lookup()	Al Viro
	New lookup flag: LOOKUP_ROOT. nd->root is set (and held) by caller, path_init() starts walking from that place and all pathname resolution machinery never drops nd->root if that flag is set. That turns vfs_path_lookup() into a special case of do_path_lookup() and gets us down to 3 callers of link_path_walk(), making it finally feasible to rip the handling of trailing symlink out of link_path_walk(). That will not only simply the living hell out of it, but make life much simpler for unionfs merge. Trailing symlink handling will become iterative, which is a good thing for stack footprint in a lot of situations as well. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	untangle do_lookup()	Al Viro
	That thing has devolved into rats nest of gotos; sane use of unlikely() gets rid of that horror and gives much more readable structure: * make a fast attempt to find a dentry; false negatives are OK. In RCU mode if everything went fine, we are done, otherwise just drop out of RCU. If we'd done (RCU) ->d_revalidate() and it had not refused outright (i.e. didn't give us -ECHILD), remember its result. * now we are not in RCU mode and hopefully have a dentry. If we do not, lock parent, do full d_lookup() and if that has not found anything, allocate and call ->lookup(). If we'd done that ->lookup(), remember that dentry is good and we don't need to revalidate it. * now we have a dentry. If it has ->d_revalidate() and we can't skip it, call it. * hopefully dentry is good; if not, either fail (in case of error) or try to invalidate it. If d_invalidate() has succeeded, drop it and retry everything as if original attempt had not found a dentry. * now we can finish it up - deal with mountpoint crossing and automount. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	path_openat: clean ELOOP handling a bit	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	do_last: kill a rudiment of old ->d_revalidate() workaround	Al Viro
	There used to be time when ->d_revalidate() couldn't return an error. So intents code had lookup_instantiate_filp() stash ERR_PTR(error) in nd->intent.open.filp and had it checked after lookup_hash(), to catch the otherwise silent failures. That had been introduced by commit 4af4c52f34606bdaab6930a845550c6fb02078a4. These days ->d_revalidate() can and does propagate errors back to callers explicitly, so this check isn't needed anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	fold __open_namei_create() and open_will_truncate() into do_last()	Al Viro
	... and clean up a bit more Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	do_last: unify may_open() call and everyting after it	Al Viro
	We have a bunch of diverging codepaths in do_last(); some of them converge, but the case of having to create a new file duplicates large part of common tail of the rest and exits separately. Massage them so that they could be merged. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	move may_open() from __open_name_create() to do_last()	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	expand finish_open() in its only caller	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	sanitize pathname component hash calculation	Al Viro
	Lift it to lookup_one_len() and link_path_walk() resp. into the same place where we calculated default hash function of the same name. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	kill __lookup_one_len()	Al Viro
	only one caller left Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	switch non-create side of open() to use of do_last()	Al Viro
	Instead of path_lookupat() doing trailing symlink resolution, use the same scheme as on the O_CREAT side. Walk with LOOKUP_PARENT, then (in do_last()) look the final component up, then either open it or return error or, if it's a symlink, give the symlink back to path_openat() to be resolved there. The really messy complication here is RCU. We don't want to drop out of RCU mode before the final lookup, since we don't want to bounce parent directory ->d_count without a good reason. Result is _not_ pretty; later in the series we'll clean it up. For now we are roughly back where we'd been before the revert done by Nick's series - top-level logics of path_openat() is cleaned up, do_last() does actual opening, symlink resolution is done uniformly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	get rid of nd->file	Al Viro
	Don't stash the struct file * used as starting point of walk in nameidata; pass file ** to path_init() instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	get rid of the last LOOKUP_RCU dependencies in link_path_walk()	Al Viro
	New helper: terminate_walk(). An error has happened during pathname resolution and we either drop nd->path or terminate RCU, depending the mode we had been in. After that, nd is essentially empty. Switch link_path_walk() to using that for cleanup. Now the top-level logics in link_path_walk() is back to sanity. RCU dependencies are in the lower-level functions. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	make nameidata_dentry_drop_rcu_maybe() always leave RCU mode	Al Viro
	Now we have do_follow_link() guaranteed to leave without dangling RCU and the next step will get LOOKUP_RCU logics completely out of link_path_walk(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	make handle_dots() leave RCU mode on error	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	clear RCU on all failure exits from link_path_walk()	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	pull handling of . and .. into inlined helper	Al Viro
	getting LOOKUP_RCU checks out of link_path_walk()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	kill out_dput: in link_path_walk()	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	separate -ESTALE/-ECHILD retries in do_filp_open() from real work	Al Viro
	new helper: path_openat(). Does what do_filp_open() does, except that it tries only the walk mode (RCU/normal/force revalidation) it had been told to. Both create and non-create branches are using path_lookupat() now. Fixed the double audit_inode() in non-create branch. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	switch do_filp_open() to struct open_flags	Al Viro
	take calculation of open_flags by open(2) arguments into new helper in fs/open.c, move filp_open() over there, have it and do_sys_open() use that helper, switch exec.c callers of do_filp_open() to explicit (and constant) struct open_flags. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	Collect "operation mode" arguments of do_last() into a structure	Al Viro
	No point messing with passing shitloads of "operation mode" arguments to do_open() one by one, especially since they are not going to change during do_filp_open(). Collect them into a struct, fill it and pass to do_last() by reference. Make sure that lookup intent flags are correctly set and removed - we want them for do_last(), but they make no sense for __do_follow_link(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	clean up the failure exits after __do_follow_link() in do_filp_open()	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	pull security_inode_follow_link() into __do_follow_link()	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	pull dropping RCU on success of link_path_walk() into path_lookupat()	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	untangle the "need_reval_dot" mess	Al Viro
	instead of ad-hackery around need_reval_dot(), do the following: set a flag (LOOKUP_JUMPED) in the beginning of path, on absolute symlink traversal, on ".." and on procfs-style symlinks. Clear on normal components, leave unchanged on ".". Non-nested callers of link_path_walk() call handle_reval_path(), which checks that flag is set and that fs does want the final revalidate thing, then does ->d_revalidate(). In link_path_walk() all the return_reval stuff is gone. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	merge component type recognition	Al Viro
	no need to do it in three places... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	merge path_init and path_init_rcu	Al Viro
	Actual dependency on whether we want RCU or not is in 3 small areas (as it ought to be) and everything around those is the same in both versions. Since each function has only one caller and those callers are on two sides of if (flags & LOOKUP_RCU), it's easier and cleaner to merge them and pull the checks inside. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	sanitize path_walk() mess	Al Viro
	New helper: path_lookupat(). Basically, what do_path_lookup() boils to modulo -ECHILD/-ESTALE handler. path_walk* family is gone; vfs_path_lookup() is using link_path_walk() directly, do_path_lookup() and do_filp_open() are using path_lookupat(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	take RCU-dependent stuff around exec_permission() into a new helper	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-14	kill path_lookup()	Al Viro
	all remaining callers pass LOOKUP_PARENT to it, so flags argument can die; renamed to kern_path_parent() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-13	compat breakage in preadv() and pwritev()	Al Viro
	Fix for a dumb preadv()/pwritev() compat bug - unlike the native variants, compat_... ones forget to check FMODE_P{READ,WRITE}, so e.g. on pipe the native preadv() will fail with -ESPIPE and compat one will act as readv() and succeed. Not critical, but it's a clear bug with trivial fix. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10	fs/dcache: allow d_obtain_alias() to return unhashed dentries	J. Bruce Fields
	Without this patch, inodes are not promptly freed on last close of an unlinked file by an nfs client: client$ mount -tnfs4 server:/export/ /mnt/ client$ tail -f /mnt/FOO ... server$ df -i /export server$ rm /export/FOO (^C the tail -f) server$ df -i /export server$ echo 2 >/proc/sys/vm/drop_caches server$ df -i /export the df's will show that the inode is not freed on the filesystem until the last step, when it could have been freed after killing the client's tail -f. On-disk data won't be deallocated either, leading to possible spurious ENOSPC. This occurs because when the client does the close, it arrives in a compound with a putfh and a close, processed like: - putfh: look up the filehandle. The only alias found for the inode will be DCACHE_UNHASHED alias referenced by the filp this, so it creates a new DCACHE_DISCONECTED dentry and returns that instead. - close: closes the existing filp, which is destroyed immediately by dput() since it's DCACHE_UNHASHED. - end of the compound: release the reference to the current filehandle, and dput() the new DCACHE_DISCONECTED dentry, which gets put on the unused list instead of being destroyed immediately. Nick Piggin suggested fixing this by allowing d_obtain_alias to return the unhashed dentry that is referenced by the filp, instead of making it create a new dentry. Leave __d_find_alias() alone to avoid changing behavior of other callers. Also nfsd doesn't need all the checks of __d_find_alias(); any dentry, hashed or unhashed, disconnected or not, should work. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10	Check for immutable/append flag in fallocate path	Marco Stornelli
	In the fallocate path the kernel doesn't check for the immutable/append flag. It's possible to have a race condition in this scenario: an application open a file in read/write and it does something, meanwhile root set the immutable flag on the file, the application at that point can call fallocate with success. In addition, we don't allow to do any unreserve operation on an append only file but only the reserve one. Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10	fat: fix d_revalidate oopsen on NFS exports	Al Viro
	can't blindly check nd->flags in ->d_revalidate() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10	jfs: fix d_revalidate oopsen on NFS exports	Al Viro
	can't blindly check nd->flags in ->d_revalidate() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10	ocfs2: fix d_revalidate oopsen on NFS exports	Al Viro
	can't blindly check nd->flags in ->d_revalidate() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10	gfs2: fix d_revalidate oopsen on NFS exports	Al Viro
	can't blindly check nd->flags in ->d_revalidate() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10	fuse: fix d_revalidate oopsen on NFS exports	Al Viro
	can't blindly check nd->flags in ->d_revalidate() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10	ceph: fix d_revalidate oopsen on NFS exports	Al Viro
	can't blindly check nd->flags in ->d_revalidate() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10	reiserfs xattr ->d_revalidate() shouldn't care about RCU	Al Viro
	... it returns an error unconditionally Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10	/proc/self is never going to be invalidated...	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-09	Merge branch 'for-2.6.38' of git://linux-nfs.org/~bfields/linux	Linus Torvalds
	* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux: nfsd: wrong index used in inner loop nfsd4: fix bad pointer on failure to find delegation NFSD: fix decode_cb_sequence4resok
2011-03-09	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: nd->inode is not set on the second attempt in path_walk() unfuck proc_sysctl ->d_compare() minimal fix for do_filp_open() race
2011-03-08	nd->inode is not set on the second attempt in path_walk()	Al Viro
	We leave it at whatever it had been pointing to after the first link_path_walk() had failed with -ESTALE. Things do not work well after that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-08	nfsd: wrong index used in inner loop	roel
	Index i was already used in the outer loop Cc: stable@kernel.org Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-03-08	unfuck proc_sysctl ->d_compare()	Al Viro
	a) struct inode is not going to be freed under ->d_compare(); however, the thing PROC_I(inode)->sysctl points to just might. Fortunately, it's enough to make freeing that sucker delayed, provided that we don't step on its ->unregistering, clear the pointer to it in PROC_I(inode) before dropping the reference and check if it's NULL in ->d_compare(). b) I'm not sure that we can walk into NULL inode here (we recheck dentry->seq between verifying that it's still hashed / fetching dentry->d_inode and passing it to ->d_compare() and there's no negative hashed dentries in /proc/sys/*), but if we can walk into that, we really should not have ->d_compare() return 0 on it! Said that, I really suspect that this check can be simply killed. Nick? Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-07	nfsd4: fix bad pointer on failure to find delegation	J. Bruce Fields
	In case of a nonempty list, the return on error here is obviously bogus; it ends up being a pointer to the list head instead of to any valid delegation on the list. In particular, if nfsd4_delegreturn() hits this case, and you're quite unlucky, then renew_client may oops, and it may take an embarassingly long time to figure out why. Facepalm. BUG: unable to handle kernel NULL pointer dereference at 0000000000000090 IP: [<ffffffff81292965>] nfsd4_delegreturn+0x125/0x200 ... Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-03-05	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: no .snap inside of snapped namespace libceph: fix msgr standby handling libceph: fix msgr keepalive flag libceph: fix msgr backoff libceph: retry after authorization failure libceph: fix handling of short returns from get_user_pages ceph: do not clear I_COMPLETE from d_release ceph: do not set I_COMPLETE Revert "ceph: keep reference to parent inode on ceph_dentry"
2011-03-04	nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3)	Neil Horman
	The "bad_page()" page allocator sanity check was reported recently (call chain as follows): bad_page+0x69/0x91 free_hot_cold_page+0x81/0x144 skb_release_data+0x5f/0x98 __kfree_skb+0x11/0x1a tcp_ack+0x6a3/0x1868 tcp_rcv_established+0x7a6/0x8b9 tcp_v4_do_rcv+0x2a/0x2fa tcp_v4_rcv+0x9a2/0x9f6 do_timer+0x2df/0x52c ip_local_deliver+0x19d/0x263 ip_rcv+0x539/0x57c netif_receive_skb+0x470/0x49f :virtio_net:virtnet_poll+0x46b/0x5c5 net_rx_action+0xac/0x1b3 __do_softirq+0x89/0x133 call_softirq+0x1c/0x28 do_softirq+0x2c/0x7d do_IRQ+0xec/0xf5 default_idle+0x0/0x50 ret_from_intr+0x0/0xa default_idle+0x29/0x50 cpu_idle+0x95/0xb8 start_kernel+0x220/0x225 _sinittext+0x22f/0x236 It occurs because an skb with a fraglist was freed from the tcp retransmit queue when it was acked, but a page on that fraglist had PG_Slab set (indicating it was allocated from the Slab allocator (which means the free path above can't safely free it via put_page. We tracked this back to an nfsv4 setacl operation, in which the nfs code attempted to fill convert the passed in buffer to an array of pages in __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages. __nfs4_proc_set_acl just converts each page in the buffer to a page struct via virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab bit is set. We can't create a buffer with kmalloc and free it later in the tcp ack path with put_page, so we need to either: 1) ensure that when we create the list of pages, no page struct has PG_Slab set or 2) not use a page list to send this data Given that these buffers can be multiple pages and arbitrarily sized, I think (1) is the right way to go. I've written the below patch to allocate a page from the buddy allocator directly and copy the data over to it. This ensures that we have a put_page free-able page for every entry that winds up on an skb frag list, so it can be safely freed when the frame is acked. We do a put page on each entry after the rpc_call_sync call so as to drop our own reference count to the page, leaving only the ref count taken by tcp_sendpages. This way the data will be properly freed when the ack comes in Successfully tested by myself to solve the above oops. Note, as this is the result of a setacl operation that exceeded a page of data, I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm CCing security on this as well. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Trond Myklebust <Trond.Myklebust@netapp.com> CC: security@kernel.org CC: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>