lwn.git/mm/truncate.c, branch v2.6.27.8

VFS: fix dio write returning EIO when try_to_release_page fails

2008-09-03T02:21:37+00:00

Dio write returns EIO when try_to_release_page fails because bh is still referenced. The patch commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91 Author: Mingming Cao Date: Fri Jul 25 01:46:22 2008 -0700 jbd: fix race between free buffer and commit transaction was merged into 2.6.27-rc1, but I noticed that this patch is not enough to fix the race. I did fsstress test heavily to 2.6.27-rc1, and found that dio write still sometimes got EIO through this test. The patch above fixed race between freeing buffer(dio) and committing transaction(jbd) but I discovered that there is another race, freeing buffer(dio) and ext3/4_ordered_writepage. : background_writeout() ->write_cache_pages() ->ext3_ordered_writepage() walk_page_buffers() -> take a bh ref block_write_full_page() -> unlock_page : <- end_page_writeback : <- race! (dio write->try_to_release_page fails) walk_page_buffers() ->release a bh ref ext3_ordered_writepage holds bh ref and does unlock_page remaining taking a bh ref, so this causes the race and failure of try_to_release_page. To fix this race, I used the approach of falling back to buffered writes if try_to_release_page() fails on a page. [akpm@linux-foundation.org: cleanups] Signed-off-by: Hisashi Hifumi Cc: Chris Mason Cc: Jan Kara Cc: Mingming Cao Cc: Zach Brown Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: rename page trylock

2008-08-05T04:31:34+00:00

Converting page lock to new locking bitops requires a change of page flag operation naming, so we might as well convert it to something nicer (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked). This also facilitates lockdeping of page lock. Signed-off-by: Nick Piggin Acked-by: KOSAKI Motohiro Acked-by: Peter Zijlstra Acked-by: Andrew Morton Acked-by: Benjamin Herrenschmidt Signed-off-by: Linus Torvalds

mm: dont clear PG_uptodate on truncate/invalidate

2008-08-02T16:12:34+00:00

Brian Wang reported that a FUSE filesystem exported through NFS could return I/O errors on read. This was traced to splice_direct_to_actor() returning a short or zero count when racing with page invalidation. However this is not FUSE or NFSD specific, other filesystems (notably NFS) also call invalidate_inode_pages2() to purge stale data from the cache. If this happens while such pages are sitting in a pipe buffer, then splice(2) from the pipe can return zero, and read(2) from the pipe can return ENODATA. The zero return is especially bad, since it implies end-of-file or disconnected pipe/socket, and is documented as such for splice. But returning an error for read() is also nasty, when in fact there was no error (data becoming stale is not an error). The same problems can be triggered by "hole punching" with madvise(MADV_REMOVE). Fix this by not clearing the PG_uptodate flag on truncation and invalidation. Signed-off-by: Miklos Szeredi Acked-by: Nick Piggin Cc: Andrew Morton Cc: Jens Axboe Signed-off-by: Linus Torvalds

mm: spinlock tree_lock

2008-07-26T19:00:06+00:00

mapping->tree_lock has no read lockers. convert the lock from an rwlock to a spinlock. Signed-off-by: Nick Piggin Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Hugh Dickins Cc: "Paul E. McKenney" Reviewed-by: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

fix invalidate_inode_pages2_range() to not clear ret

2008-04-28T15:58:18+00:00

DIO invalidates page cache through invalidate_inode_pages2_range(). invalidate_inode_pages2_range() sets ret=-EIO when invalidate_complete_page2() fails, but this ret is cleared if do_launder_page() succeed on a page of next index. In this case, dio is carried out even if invalidate_complete_page2() fails on some pages. This can cause inconsistency between memory and blocks on HDD because the page cache still exists. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hisashi Hifumi Cc: Badari Pulavarty Cc: Ken Chen Cc: Zach Brown Cc: Nick Piggin Cc: Trond Myklebust Cc: "J. Bruce Fields" Cc: Chuck Lever Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

docbook: fix kernel-api source files

2008-03-03T18:47:14+00:00

Fix docbook problems in kernel-api.tmpl. These cause the generated docbook to be incorrect. Signed-off-by: Randy Dunlap Signed-off-by: Linus Torvalds

page migraton: handle orphaned pages

2008-02-05T17:44:19+00:00

Orphaned page might have fs-private metadata, the page is truncated. As the page hasn't mapping, page migration refuse to migrate the page. It appears the page is only freed in page reclaim and if zone watermark is low, the page is never freed, as a result migration always fail. I thought we could free the metadata so such page can be freed in migration and make migration more reliable. [akpm@linux-foundation.org: go direct to try_to_free_buffers()] Signed-off-by: Shaohua Li Acked-by: Nick Piggin Acked-by: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

Fix dirty page accounting leak with ext3 data=journal

2008-02-05T17:44:19+00:00

In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 ("Clean up and make try_to_free_buffers() not race with dirty pages"), try_to_free_buffers was changed to bail out if the page was dirty. That in turn caused truncate_complete_page to leak massive amounts of memory, because the dirty bit was only cleared after the call to try_to_free_buffers. So the call to cancel_dirty_page was moved up to have the dirty bit cleared early in 3e67c0987d7567ad666641164a153dca9a43b11d ("truncate: clear page dirtiness before running try_to_free_buffers()"). The problem with that fix is, that the page can be redirtied after cancel_dirty_page was called, eg. like this: truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages And then we end up with dirty pages being wrongly accounted. As a result, in ecdfc9787fe527491baefc22dce8b2dbd5b2908d ("Resurrect 'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers were reverted, so the original reason for the massive memory leak is gone, and we can also revert the move of the call to cancel_dirty_page from truncate_complete_page and get the accounting right again. I'm not sure if it matters, but opposed to the final check in __remove_from_page_cache, this one also cares about the task io accounting, so maybe we want to use this instead, although it's not quite the clean fix either. Signed-off-by: Björn Steinbrink Tested-by: Krzysztof Piotr Oledzki Cc: Jan Kara Cc: Nick Piggin Cc: Peter Zijlstra Cc: Thomas Osterried Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user

2008-02-05T17:44:13+00:00

Simplify page cache zeroing of segments of pages through 3 functions zero_user_segments(page, start1, end1, start2, end2) Zeros two segments of the page. It takes the position where to start and end the zeroing which avoids length calculations and makes code clearer. zero_user_segment(page, start, end) Same for a single segment. zero_user(page, start, length) Length variant for the case where we know the length. We remove the zero_user_page macro. Issues: 1. Its a macro. Inline functions are preferable. 2. The KM_USER0 macro is only defined for HIGHMEM. Having to treat this special case everywhere makes the code needlessly complex. The parameter for zeroing is always KM_USER0 except in one single case that we open code. Avoiding KM_USER0 makes a lot of code not having to be dealing with the special casing for HIGHMEM anymore. Dealing with kmap is only necessary for HIGHMEM configurations. In those configurations we use KM_USER0 like we do for a series of other functions defined in highmem.h. Since KM_USER0 is depends on HIGHMEM the existing zero_user_page function could not be a macro. zero_user_* functions introduced here can be be inline because that constant is not used when these functions are called. Also extract the flushing of the caches to be outside of the kmap. [akpm@linux-foundation.org: fix nfs and ntfs build] [akpm@linux-foundation.org: fix ntfs build some more] Signed-off-by: Christoph Lameter Cc: Steven French Cc: Michael Halcrow Cc: Cc: Steven Whitehouse Cc: Trond Myklebust Cc: "J. Bruce Fields" Cc: Anton Altaparmakov Cc: Mark Fasheh Cc: David Chinner Cc: Michael Halcrow Cc: Steven French Cc: Steven Whitehouse Cc: Trond Myklebust Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

do_invalidatepage() comment typo fix

2008-02-03T16:04:10+00:00

Fix a typo in the comment for do_invalidatepage(). Signed-off-by: Fengguang Wu Signed-off-by: Adrian Bunk