FS-Cache: Provide the ability to enable/disable cookies

2013-09-27T17:40:25+00:00

Provide the ability to enable and disable fscache cookies. A disabled cookie will reject or ignore further requests to: Acquire a child cookie Invalidate and update backing objects Check the consistency of a backing object Allocate storage for backing page Read backing pages Write to backing pages but still allows: Checks/waits on the completion of already in-progress objects Uncaching of pages Relinquishment of cookies Two new operations are provided: (1) Disable a cookie: void fscache_disable_cookie(struct fscache_cookie *cookie, bool invalidate); If the cookie is not already disabled, this locks the cookie against other dis/enablement ops, marks the cookie as being disabled, discards or invalidates any backing objects and waits for cessation of activity on any associated object. This is a wrapper around a chunk split out of fscache_relinquish_cookie(), but it reinitialises the cookie such that it can be reenabled. All possible failures are handled internally. The caller should consider calling fscache_uncache_all_inode_pages() afterwards to make sure all page markings are cleared up. (2) Enable a cookie: void fscache_enable_cookie(struct fscache_cookie *cookie, bool (*can_enable)(void *data), void *data) If the cookie is not already enabled, this locks the cookie against other dis/enablement ops, invokes can_enable() and, if the cookie is not an index cookie, will begin the procedure of acquiring backing objects. The optional can_enable() function is passed the data argument and returns a ruling as to whether or not enablement should actually be permitted to begin. All possible failures are handled internally. The cookie will only be marked as enabled if provisional backing objects are allocated. A later patch will introduce these to NFS. Cookie enablement during nfs_open() is then contingent on i_writecount <= 0. can_enable() checks for a race between open(O_RDONLY) and open(O_WRONLY/O_RDWR). This simplifies NFS's cookie handling and allows us to get rid of open(O_RDONLY) accidentally introducing caching to an inode that's open for writing already. One operation has its API modified: (3) Acquire a cookie. struct fscache_cookie *fscache_acquire_cookie( struct fscache_cookie *parent, const struct fscache_cookie_def *def, void *netfs_data, bool enable); This now has an additional argument that indicates whether the requested cookie should be enabled by default. It doesn't need the can_enable() function because the caller must prevent multiple calls for the same netfs object and it doesn't need to take the enablement lock because no one else can get at the cookie before this returns. Signed-off-by: David Howells

FS-Cache: Add use/unuse/wake cookie wrappers

2013-09-27T17:40:25+00:00

Add wrapper functions for dealing with cookie->n_active: (*) __fscache_use_cookie() to increment it. (*) __fscache_unuse_cookie() to decrement and test against zero. (*) __fscache_wake_unused_cookie() to wake up anyone waiting for it to reach zero. The second and third are split so that the third can be done after cookie->lock has been released in case the waiter wakes up whilst we're still holding it and tries to get it. We will need to wake-on-zero once the cookie disablement patch is applied because it will then be possible to see n_active become zero without the cookie being relinquished. Also move the cookie usement out of fscache_attr_changed_op() and into fscache_attr_changed() and the operation struct so that cookie disablement will be able to track it. Whilst we're at it, only increment n_active if we're about to do fscache_submit_op() so that we don't have to deal with undoing it if anything earlier fails. Possibly this should be moved into fscache_submit_op() which could look at FSCACHE_OP_UNUSE_COOKIE. Signed-off-by: David Howells

lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt

2013-09-11T22:59:36+00:00

With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is one such possible user), the following race can happen: radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; ... radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; And we give out one radix tree node twice. That clearly results in radix tree corruption with different results (usually OOPS) depending on which two users of radix tree race. We fix the problem by making radix_tree_node_alloc() always allocate fresh radix tree nodes when in interrupt. Using preloading when in interrupt doesn't make sense since all the allocations have to be atomic anyway and we cannot steal nodes from process-context users because some users rely on radix_tree_insert() succeeding after radix_tree_preload(). in_interrupt() check is somewhat ugly but we cannot simply key off passed gfp_mask as that is acquired from root_gfp_mask() and thus the same for all preload users. Another part of the fix is to avoid node preallocation in radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again, preallocation in such case doesn't make sense and when preallocation would happen in interrupt we could possibly leak some allocated nodes. However, some users of radix_tree_preload() require following radix_tree_insert() to succeed. To avoid unexpected effects for these users, radix_tree_preload() only warns if passed gfp mask doesn't allow waiting and we provide a new function radix_tree_maybe_preload() for those users which get different gfp mask from different call sites and which are prepared to handle radix_tree_insert() failure. Signed-off-by: Jan Kara Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

fscache: Netfs function for cleanup post readpages

2013-09-06T08:17:30+00:00

Currently the fscache code expect the netfs to call fscache_readpages_or_alloc inside the aops readpages callback. It marks all the pages in the list provided by readahead with PG_private_2. In the cases that the netfs fails to read all the pages (which is legal) it ends up returning to the readahead and triggering a BUG. This happens because the page list still contains marked pages. This patch implements a simple fscache_readpages_cancel function that the netfs should call before returning from readpages. It will revoke the pages from the underlying cache backend and unmark them. The problem was originally worked out in the Ceph devel tree, but it also occurs in CIFS. It appears that NFS, AFS and 9P are okay as read_cache_pages() will clean up the unprocessed pages in the case of an error. This can be used to address the following oops: [12410647.597278] BUG: Bad page state in process petabucket pfn:3d504e [12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping: (null) index:0x0 [12410647.597298] page flags: 0x200000000001000(private_2) ... [12410647.597334] Call Trace: [12410647.597345] [] dump_stack+0x19/0x1b [12410647.597356] [] bad_page+0xc7/0x120 [12410647.597359] [] free_pages_prepare+0x10e/0x120 [12410647.597361] [] free_hot_cold_page+0x40/0x170 [12410647.597363] [] __put_single_page+0x27/0x30 [12410647.597365] [] put_page+0x25/0x40 [12410647.597376] [] ceph_readpages+0x2e9/0x6e0 [ceph] [12410647.597379] [] __do_page_cache_readahead+0x1af/0x260 [12410647.597382] [] ra_submit+0x21/0x30 [12410647.597384] [] filemap_fault+0x254/0x490 [12410647.597387] [] __do_fault+0x6f/0x4e0 [12410647.597391] [] ? __switch_to+0x16d/0x4a0 [12410647.597395] [] ? finish_task_switch+0x5a/0xc0 [12410647.597398] [] handle_pte_fault+0xf6/0x930 [12410647.597401] [] ? pte_mfn_to_pfn+0x93/0x110 [12410647.597403] [] ? xen_pmd_val+0xe/0x10 [12410647.597405] [] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [12410647.597407] [] handle_mm_fault+0x251/0x370 [12410647.597411] [] ? call_rwsem_down_read_failed+0x14/0x30 [12410647.597414] [] __do_page_fault+0x1aa/0x550 [12410647.597418] [] ? up_write+0x1d/0x20 [12410647.597422] [] ? vm_mmap_pgoff+0xbc/0xe0 [12410647.597425] [] ? SyS_mmap_pgoff+0xd8/0x240 [12410647.597427] [] do_page_fault+0xe/0x10 [12410647.597431] [] page_fault+0x28/0x30 Signed-off-by: Milosz Tanski Signed-off-by: David Howells

FS-Cache: Add interface to check consistency of a cached object

2013-09-06T08:17:30+00:00

Extend the fscache netfs API so that the netfs can ask as to whether a cache object is up to date with respect to its corresponding netfs object: int fscache_check_consistency(struct fscache_cookie *cookie) This will call back to the netfs to check whether the auxiliary data associated with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it may also return -ENOMEM and -ERESTARTSYS. The backends now have to implement a mandatory operation pointer: int (*check_consistency)(struct fscache_object *object) that corresponds to the above API call. FS-Cache takes care of pinning the object and the cookie in memory and managing this call with respect to the object state. Original-author: Hongyi Jia Signed-off-by: David Howells cc: Hongyi Jia cc: Milosz Tanski

FS-Cache: The retrieval remaining-pages counter needs to be atomic_t

2013-06-19T13:16:47+00:00

struct fscache_retrieval contains a count of the number of pages that still need some processing (n_pages). This is decremented as the pages are processed. However, this needs to be atomic as fscache_retrieval_complete() (I think) just occasionally may be called from cachefiles_read_backing_file() and cachefiles_read_copier() simultaneously. This happens when an fscache_read_or_alloc_pages() request containing a lot of pages (say a couple of hundred) is being processed. The read on each backing page is dispatched individually because we need to insert a monitor into the waitqueue to catch when the read completes. However, under low-memory conditions, we might be forced to wait in the allocator - and this gives the I/O on the backing page a chance to complete first. When the I/O completes, fscache_enqueue_retrieval() chucks the retrieval onto the workqueue without waiting for the operation to finish the initial I/O dispatch (we want to release any pages we can as soon as we can), thus both can end up running simultaneously and potentially attempting to partially complete the retrieval simultaneously (ENOMEM may occur, backing pages may already be in the page cache). This was demonstrated by parallelling the non-atomic counter with an atomic counter and printing both of them when the assertion fails. At this point, the atomic counter has reached zero, but the non-atomic counter has not. To fix this, make the counter an atomic_t. This results in the following bug appearing FS-Cache: Assertion failed 3 == 5 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/operation.c:421! or FS-Cache: Assertion failed 3 == 5 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/operation.c:414! With a backtrace like the following: RIP: 0010:[] fscache_put_operation+0x1ad/0x240 [fscache] Call Trace: [] fscache_retrieval_work+0x55/0x270 [fscache] [] ? fscache_retrieval_work+0x0/0x270 [fscache] [] worker_thread+0x170/0x2a0 [] ? autoremove_wake_function+0x0/0x40 [] ? worker_thread+0x0/0x2a0 [] kthread+0x96/0xa0 [] child_rip+0xa/0x20 [] ? kthread+0x0/0xa0 [] ? child_rip+0x0/0x20 Signed-off-by: David Howells Reviewed-and-tested-By: Milosz Tanski Acked-by: Jeff Layton

FS-Cache: Simplify cookie retention for fscache_objects, fixing oops

2013-06-19T13:16:47+00:00

Simplify the way fscache cache objects retain their cookie. The way I implemented the cookie storage handling made synchronisation a pain (ie. the object state machine can't rely on the cookie actually still being there). Instead of the the object being detached from the cookie and the cookie being freed in __fscache_relinquish_cookie(), we defer both operations: (*) The detachment of the object from the list in the cookie now takes place in fscache_drop_object() and is thus governed by the object state machine (fscache_detach_from_cookie() has been removed). (*) The release of the cookie is now in fscache_object_destroy() - which is called by the cache backend just before it frees the object. This means that the fscache_cookie struct is now available to the cache all the way through from ->alloc_object() to ->drop_object() and ->put_object() - meaning that it's no longer necessary to take object->lock to guarantee access. However, __fscache_relinquish_cookie() doesn't wait for the object to go all the way through to destruction before letting the netfs proceed. That would massively slow down the netfs. Since __fscache_relinquish_cookie() leaves the cookie around, in must therefore break all attachments to the netfs - which includes ->def, ->netfs_data and any outstanding page read/writes. To handle this, struct fscache_cookie now has an n_active counter: (1) This starts off initialised to 1. (2) Any time the cache needs to get at the netfs data, it calls fscache_use_cookie() to increment it - if it is not zero. If it was zero, then access is not permitted. (3) When the cache has finished with the data, it calls fscache_unuse_cookie() to decrement it. This does a wake-up on it if it reaches 0. (4) __fscache_relinquish_cookie() decrements n_active and then waits for it to reach 0. The initialisation to 1 in step (1) ensures that we only get wake ups when we're trying to get rid of the cookie. This leaves __fscache_relinquish_cookie() a lot simpler. *** This fixes a problem in the current code whereby if fscache_invalidate() is followed sufficiently quickly by fscache_relinquish_cookie() then it is possible for __fscache_relinquish_cookie() to have detached the cookie from the object and cleared the pointer before a thread is dispatched to process the invalidation state in the object state machine. Since the pending write clearance was deferred to the invalidation state to make it asynchronous, we need to either wait in relinquishment for the stores tree to be cleared in the invalidation state or we need to handle the clearance in relinquishment. Further, if the relinquishment code does clear the tree, then the invalidation state need to make the clearance contingent on still having the cookie to hand (since that's where the tree is rooted) and we have to prevent the cookie from disappearing for the duration. This can lead to an oops like the following: BUG: unable to handle kernel NULL pointer dereference at 000000000000000c ... RIP: 0010:[] _spin_lock+0xe/0x30 ... CR2: 000000000000000c ... ... Process kslowd002 (...) .... Call Trace: [] fscache_invalidate_writes+0x38/0xd0 [fscache] [] ? __switch_to+0xd0/0x320 [] ? find_busiest_queue+0x69/0x150 [] ? slow_work_enqueue+0x104/0x180 [] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache] [] ? bit_waitqueue+0x17/0xd0 [] slow_work_execute+0x233/0x310 [] slow_work_thread+0x205/0x360 [] ? autoremove_wake_function+0x0/0x40 [] ? slow_work_thread+0x0/0x360 [] kthread+0x96/0xa0 [] child_rip+0xa/0x20 [] ? kthread+0x0/0xa0 [] ? child_rip+0x0/0x20 The parameter to fscache_invalidate_writes() was object->cookie which is NULL. Signed-off-by: David Howells Tested-By: Milosz Tanski Acked-by: Jeff Layton

FS-Cache: Fix object state machine to have separate work and wait states

2013-06-19T13:16:47+00:00

Fix object state machine to have separate work and wait states as that makes it easier to envision. There are now three kinds of state: (1) Work state. This is an execution state. No event processing is performed by a work state. The function attached to a work state returns a pointer indicating the next state to which the OSM should transition. Returning NO_TRANSIT repeats the current state, but goes back to the scheduler first. (2) Wait state. This is an event processing state. No execution is performed by a wait state. Wait states are just tables of "if event X occurs, clear it and transition to state Y". The dispatcher returns to the scheduler if none of the events in which the wait state has an interest are currently pending. (3) Out-of-band state. This is a special work state. Transitions to normal states can be overridden when an unexpected event occurs (eg. I/O error). Instead the dispatcher disables and clears the OOB event and transits to the specified work state. This then acts as an ordinary work state, though object->state points to the overridden destination. Returning NO_TRANSIT resumes the overridden transition. In addition, the states have names in their definitions, so there's no need for tables of state names. Further, the EV_REQUEUE event is no longer necessary as that is automatic for work states. Since the states are now separate structs rather than values in an enum, it's not possible to use comparisons other than (non-)equality between them, so use some object->flags to indicate what phase an object is in. The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one (EV_KILL). An object flag now carries the information about retirement. Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged into an KILL_OBJECT state and additional states have been added for handling waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS). A state has also been added for synchronising with parent object initialisation (WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY). Signed-off-by: David Howells Tested-By: Milosz Tanski Acked-by: Jeff Layton

FS-Cache: Don't sleep in page release if __GFP_FS is not set

2013-06-19T13:16:47+00:00

Don't sleep in __fscache_maybe_release_page() if __GFP_FS is not set. This goes some way towards mitigating fscache deadlocking against ext4 by way of the allocator, eg: INFO: task flush-8:0:24427 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-8:0 D ffff88003e2b9fd8 0 24427 2 0x00000000 ffff88003e2b9138 0000000000000046 ffff880012e3a040 ffff88003e2b9fd8 0000000000011c80 ffff88003e2b9fd8 ffffffff81a10400 ffff880012e3a040 0000000000000002 ffff880012e3a040 ffff88003e2b9098 ffffffff8106dcf5 Call Trace: [] ? __lock_is_held+0x31/0x53 [] ? radix_tree_lookup_element+0xf4/0x12a [] schedule+0x60/0x62 [] __fscache_wait_on_page_write+0x8b/0xa5 [fscache] [] ? __init_waitqueue_head+0x4d/0x4d [] __fscache_maybe_release_page+0x30c/0x324 [fscache] [] ? __fscache_maybe_release_page+0x6c/0x324 [fscache] [] ? trace_hardirqs_on_caller+0x114/0x170 [] nfs_fscache_release_page+0x68/0x94 [nfs] [] nfs_release_page+0x7e/0x86 [nfs] [] try_to_release_page+0x32/0x3b [] shrink_page_list+0x535/0x71a [] ? trace_hardirqs_on_caller+0x114/0x170 [] shrink_inactive_list+0x20a/0x2dd [] ? mark_held_locks+0xbe/0xea [] shrink_lruvec+0x34c/0x3eb [] do_try_to_free_pages+0xcf/0x355 [] try_to_free_pages+0x9a/0xa1 [] __alloc_pages_nodemask+0x494/0x6f7 [] kmem_getpages+0x58/0x155 [] fallback_alloc+0x120/0x1f3 [] ? trace_hardirqs_off+0xd/0xf [] ____cache_alloc_node+0x177/0x186 [] ? ext4_init_io_end+0x1c/0x37 [] kmem_cache_alloc+0xf1/0x176 [] ? test_set_page_writeback+0x101/0x113 [] ext4_init_io_end+0x1c/0x37 [] ext4_bio_write_page+0x20f/0x3af [] mpage_da_submit_io+0x26e/0x2f6 [] ? __find_get_block_slow+0x38/0x133 [] mpage_da_map_and_submit+0x3a7/0x3bd [] ext4_da_writepages+0x30d/0x426 [] do_writepages+0x1c/0x2a [] __writeback_single_inode+0x3e/0xe5 [] writeback_sb_inodes+0x1bd/0x2f4 [] __writeback_inodes_wb+0x6f/0xb4 [] wb_writeback+0x101/0x195 [] ? trace_hardirqs_on_caller+0x114/0x170 [] ? wb_do_writeback+0xaa/0x173 [] wb_do_writeback+0x4a/0x173 [] ? trace_hardirqs_on+0xd/0xf [] ? del_timer+0x4b/0x5b [] bdi_writeback_thread+0x6d/0x147 [] ? wb_do_writeback+0x173/0x173 [] kthread+0xd0/0xd8 [] ? _raw_spin_unlock_irq+0x29/0x3e [] ? __init_kthread_worker+0x55/0x55 [] ret_from_fork+0x7c/0xb0 [] ? __init_kthread_worker+0x55/0x55 2 locks held by flush-8:0/24427: #0: (&type->s_umount_key#41){.+.+..}, at: [] grab_super_passive+0x4c/0x76 #1: (jbd2_handle){+.+...}, at: [] start_this_handle+0x475/0x4ea The problem here is that another thread, which is attempting to write the to-be-stored NFS page to the on-ext4 cache file is waiting for the journal lock, eg: INFO: task kworker/u:2:24437 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u:2 D ffff880039589768 0 24437 2 0x00000000 ffff8800395896d8 0000000000000046 ffff8800283bf040 ffff880039589fd8 0000000000011c80 ffff880039589fd8 ffff880039f0b040 ffff8800283bf040 0000000000000006 ffff8800283bf6b8 ffff880039589658 ffffffff81071a13 Call Trace: [] ? mark_held_locks+0xbe/0xea [] ? _raw_spin_unlock_irqrestore+0x3a/0x50 [] ? trace_hardirqs_on_caller+0x114/0x170 [] ? trace_hardirqs_on+0xd/0xf [] schedule+0x60/0x62 [] start_this_handle+0x317/0x4ea [] ? __init_waitqueue_head+0x4d/0x4d [] jbd2__journal_start+0xb3/0x12e [] __ext4_journal_start_sb+0xb2/0xc6 [] ext4_da_write_begin+0x109/0x233 [] generic_file_buffered_write+0x11a/0x264 [] ? __mark_inode_dirty+0x2d/0x1ee [] __generic_file_aio_write+0x2a5/0x2d5 [] generic_file_aio_write+0x6f/0xd0 [] ext4_file_write+0x38c/0x3c4 [] do_sync_write+0x91/0xd1 [] cachefiles_write_page+0x26f/0x310 [cachefiles] [] fscache_write_op+0x21e/0x37a [fscache] [] ? _raw_spin_unlock_irq+0x29/0x3e [] fscache_op_work_func+0x78/0xd7 [fscache] [] process_one_work+0x232/0x3a8 [] ? process_one_work+0x1d7/0x3a8 [] worker_thread+0x214/0x303 [] ? manage_workers+0x245/0x245 [] kthread+0xd0/0xd8 [] ? _raw_spin_unlock_irq+0x29/0x3e [] ? __init_kthread_worker+0x55/0x55 [] ret_from_fork+0x7c/0xb0 [] ? __init_kthread_worker+0x55/0x55 4 locks held by kworker/u:2/24437: #0: (fscache_operation){.+.+.+}, at: [] process_one_work+0x1d7/0x3a8 #1: ((&op->work)){+.+.+.}, at: [] process_one_work+0x1d7/0x3a8 #2: (sb_writers#14){.+.+.+}, at: [] generic_file_aio_write+0x51/0xd0 #3: (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [] generic_file_aio_write+0x5b/0x fscache already tries to cancel pending stores, but it can't cancel a write for which I/O is already in progress. An alternative would be to accept writing garbage to the cache under extreme circumstances and to kill the afflicted cache object if we have to do this. However, we really need to know how strapped the allocator is before deciding to do that. Signed-off-by: David Howells Tested-By: Milosz Tanski Acked-by: Jeff Layton

fs/fscache: remove spin_lock() from the condition in while()

2013-06-19T13:16:47+00:00

The spinlock() within the condition in while() will cause a compile error if it is not a function. This is not a problem on mainline but it does not look pretty and there is no reason to do it that way. That patch writes it a little differently and avoids the double condition. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: David Howells Tested-By: Milosz Tanski Acked-by: Jeff Layton

lwn.git/fs/fscache/page.c, branch v3.14.3

FS-Cache: Provide the ability to enable/disable cookies

FS-Cache: Add use/unuse/wake cookie wrappers

lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt

fscache: Netfs function for cleanup post readpages

FS-Cache: Add interface to check consistency of a cached object

FS-Cache: The retrieval remaining-pages counter needs to be atomic_t

FS-Cache: Simplify cookie retention for fscache_objects, fixing oops

FS-Cache: Fix object state machine to have separate work and wait states

FS-Cache: Don't sleep in page release if __GFP_FS is not set

fs/fscache: remove spin_lock() from the condition in while()