lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2014-09-25	NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()	NeilBrown
	Now that nfs_release_page() doesn't block indefinitely, other deadlock avoidance mechanisms aren't needed. - it doesn't hurt for kswapd to block occasionally. If it doesn't want to block it would clear __GFP_WAIT. The current_is_kswapd() was only added to avoid deadlocks and we have a new approach for that. - memory allocation in the SUNRPC layer can very rarely try to ->releasepage() a page it is trying to handle. The deadlock is removed as nfs_release_page() doesn't block indefinitely. So we don't need to set PF_FSTRANS for sunrpc network operations any more. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25	NFS: avoid waiting at all in nfs_release_page when congested.	NeilBrown
	If nfs_release_page() is called on a sequence of pages which are all in the same file which is blocked on COMMIT, each page could contribute a 1 second delay which could be come excessive. I have seen delays of as much as 208 seconds. To keep the delay to one second, mark the bdi as write-congested if the commit didn't finished. Once it does finish, the write-congested flag will be cleared by nfs_commit_release_pages(). With this, the longest total delay in try_to_free_pages that I have seen is under 3 seconds. With no waiting in nfs_release_page at all I have seen delays of nearly 1.5 seconds. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25	NFS: avoid deadlocks with loop-back mounted NFS filesystems.	NeilBrown
	Support for loop-back mounted NFS filesystems is useful when NFS is used to access shared storage in a high-availability cluster. If the node running the NFS server fails, some other node can mount the filesystem and start providing NFS service. If that node already had the filesystem NFS mounted, it will now have it loop-back mounted. nfsd can suffer a deadlock when allocating memory and entering direct reclaim. While direct reclaim does not write to the NFS filesystem it can send and wait for a COMMIT through nfs_release_page(). This patch modifies nfs_release_page() to wait a limited time for the commit to complete - one second. If the commit doesn't complete in this time, nfs_release_page() will fail. This means it might now fail in some cases where it wouldn't before. These cases are only when 'gfp' includes '__GFP_WAIT'. nfs_release_page() is only called by try_to_release_page(), and that can only be called on an NFS page with required 'gfp' flags from - page_cache_pipe_buf_steal() in splice.c - shrink_page_list() in vmscan.c - invalidate_inode_pages2_range() in truncate.c The first two handle failure quite safely. The last is only called after ->launder_page() has been called, and that will have waited for the commit to finish already. So aborting if the commit takes longer than 1 second is perfectly safe. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24	NFS: don't use STABLE writes during writeback.	NeilBrown
	commit b31268ac793fd300da66b9c28bbf0a200339ab96 FS: Use stable writes when not doing a bulk flush was a bit heavy handed. The particular problem that lead to this patch was that small writes to an O_SYNC file we being written as UNSTABLE writes followed by a commit. This is appropriate for large writes (which require multiple NFS requests) but for small writes (single NFS request), using NFS_FILE_SYNC is more efficient. So that patch causes the code to select between the two methods depending on how many nfs requests get generated. Unfortunately this ends up applying to non O_SYNC writes as well. In particular if you memory-map a file and update random pages, then when they are eventually written out by writeback they will go as NFS_FILE_SYNC. This is inefficient and slows down the application. So: only set FLUSH_COND_STABLE when wbc->sync_mode is WB_SYNC_ALL. With this patch: O_SYNC writes are NFS_FILE_SYNC for single requests, and NFS_UNSTABLE followed by COMMIT for multiple requests Writing immediately before close of fsync follow the same pattern. Non-O_SYNC writes without an fsync of close eventually get flushed out as UNSTABLE and a commit follows eventually as appropriate. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24	NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.	NeilBrown
	Currently asynchronous NFSv4 request will be retried with exponential timeout (from 1/10 to 15 seconds), but async requests will always use a 15second retry. Some "async" requests are really synchronous though. The async mechanism is used to allow the request to continue if the requesting process is killed. In those cases, an exponential retry is appropriate. For example, if two different clients both open a file and get a READ delegation, and one client then unlinks the file (while still holding an open file descriptor), that unlink will used the "silly-rename" handling which is async. The first rename will result in NFS4ERR_DELAY while the delegation is reclaimed from the other client. The rename will not be retried for 15 seconds, causing an unlink to take 15 seconds rather than 100msec. This patch only added exponential timeout for async unlink and async rename. Other async calls, such as 'close' are sometimes waited for so they might benefit from exponential timeout too. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24	lockd: Try to reconnect if statd has moved	Benjamin Coddington
	If rpc.statd is restarted, upcalls to monitor hosts can fail with ECONNREFUSED. In that case force a lookup of statd's new port and retry the upcall. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24	Fixing lease renewal	Olga Kornievskaia
	Commit c9fdeb28 removed a 'continue' after checking if the lease needs to be renewed. However, if client hasn't moved, the code falls down to starting reboot recovery erroneously (ie., sends open reclaim and gets back stale_clientid error) before recovering from getting stale_clientid on the renew operation. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Fixes: c9fdeb280b8c (NFS: Add basic migration support to state manager thread) Cc: stable@vger.kernel.org # 3.13+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24	nfs: fix duplicate proc entries	Fabian Frederick
	Commit 65b38851a174 ("NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes") updated the following function: static int nfs_volume_list_open(struct inode inode, struct file file) it used &nfs_server_list_ops instead of &nfs_volume_list_ops which means cat /proc/fs/nfsfs/volumes = /proc/fs/nfsfs/servers Signed-off-by: Fabian Frederick <fabf@skynet.be> Fixes: 65b38851a174 (NFS: Fix /proc/fs/nfsfs/servers and...) Cc: stable@vger.kernel.org # 3.4.x+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-23	f2fs: use more free segments until SSR is activated	Jaegeuk Kim
	Previously, f2fs activates SSR if the # of free segments reaches to the # of overprovisioned segments. In this case, SSR starts to use dirty segments only, so that the overprovisoned space cannot be selected for new data. This means that we have no chance to utilizae the overprovisioned space at all. This patch fixes that by allowing LFS allocations until the # of free segments reaches to the last threshold, reserved space. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: change the ipu_policy option to enable combinations	Jaegeuk Kim
	This patch changes the ipu_policy setting to use any combination of orthogonal policies. Signed-off-by: Changman Lee <cm224.lee@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: fix to search whole dirty segmap when get_victim	Chao Yu
	In ->get_victim we get max_search value from dirty_i->nr_dirty without protection of seglist_lock, after that, nr_dirty can be increased/decreased before we hold seglist_lock lock. Then in main loop we attempt to traverse all dirty section one time to find victim section, but it's not accurate to use max_search as the total loop count, because we might lose checking several sections or check sections redundantly for the case of nr_dirty are increased or decreased previously. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: fix to clean previous mount option when remount_fs	Chao Yu
	In manual of mount, we descript remount as below: "mount -o remount,rw /dev/foo /dir After this call all old mount options are replaced and arbitrary stuff from fstab is ignored, except the loop= option which is internally generated and maintained by the mount command." Previously f2fs do not clear up old mount options when remount_fs, so we have no chance of disabling previous option (e.g. flush_merge). Fix it. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: skip punching hole in special condition	Chao Yu
	Now punching hole in directory is not supported in f2fs, so let's limit file type in punch_hole(). In addition, in punch_hole if offset is exceed file size, we should skip punching hole. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: support large sector size	Chao Yu
	Block size in f2fs is 4096 bytes, so theoretically, f2fs can support 4096 bytes sector device at maximum. But now f2fs only support 512 bytes size sector, so block device such as zRAM which uses page cache as its block storage space will not be mounted successfully as mismatch between sector size of zRAM and sector size of f2fs supported. In this patch we support large sector size in f2fs, so block device with sector size of 512/1024/2048/4096 bytes can be supported in f2fs. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: fix to truncate blocks past EOF in ->setattr	Chao Yu
	By using FALLOC_FL_KEEP_SIZE in ->fallocate of f2fs, we can fallocate block past EOF without changing i_size of inode. These blocks past EOF will not be truncated in ->setattr as we truncate them only when change the file size. We should give a chance to truncate blocks out of filesize in setattr(). Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: update i_size when __allocate_data_block	Jaegeuk Kim
	The f2fs_direct_IO uses __allocate_data_block, but inside the allocation path, we should update i_size at the changed time to update its inode page. Otherwise, we can get wrong i_size after roll-forward recovery. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: use MAX_BIO_BLOCKS(sbi)	Jaegeuk Kim
	This patch cleans up a simple macro. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: remove redundant operation during roll-forward recovery	Jaegeuk Kim
	If same data is updated multiple times, we don't need to redo whole the operations. Let's just update the lastest one. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: do not skip latest inode information	Jaegeuk Kim
	In f2fs_sync_file, if there is no written appended writes, it skips to write its node blocks. But, if there is up-to-date inode page, we should write it to update its metadata during the roll-forward recovery. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: fix roll-forward missing scenarios	Jaegeuk Kim
	We can summarize the roll forward recovery scenarios as follows. [Term] F: fsync_mark, D: dentry_mark 1. inode(x) \| CP \| inode(x) \| dnode(F) -> Update the latest inode(x). 2. inode(x) \| CP \| inode(F) \| dnode(F) -> No problem. 3. inode(x) \| CP \| dnode(F) \| inode(x) -> Recover to the latest dnode(F), and drop the last inode(x) 4. inode(x) \| CP \| dnode(F) \| inode(F) -> No problem. 5. CP \| inode(x) \| dnode(F) -> The inode(DF) was missing. Should drop this dnode(F). 6. CP \| inode(DF) \| dnode(F) -> No problem. 7. CP \| dnode(F) \| inode(DF) -> If f2fs_iget fails, then goto next to find inode(DF). 8. CP \| dnode(F) \| inode(x) -> If f2fs_iget fails, then goto next to find inode(DF). But it will fail due to no inode(DF). So, this patch adds some missing points such as #1, #5, #7, and #8. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: fix conditions to remain recovery information in f2fs_sync_file	Jaegeuk Kim
	This patch revisited whole the recovery information during the f2fs_sync_file. In this patch, there are three information to make a decision. a) IS_CHECKPOINTED, /* is it checkpointed before? / b) HAS_FSYNCED_INODE, / is the inode fsynced before? / c) HAS_LAST_FSYNC, / has the latest node fsync mark? */ And, the scenarios for our rule are based on: [Term] F: fsync_mark, D: dentry_mark 1. inode(x) \| CP \| inode(x) \| dnode(F) 2. inode(x) \| CP \| inode(F) \| dnode(F) 3. inode(x) \| CP \| dnode(F) \| inode(x) \| inode(F) 4. inode(x) \| CP \| dnode(F) \| inode(F) 5. CP \| inode(x) \| dnode(F) \| inode(DF) 6. CP \| inode(DF) \| dnode(F) 7. CP \| dnode(F) \| inode(DF) 8. CP \| dnode(F) \| inode(x) \| inode(DF) For example, #3, the three conditions should be changed as follows. inode(x) \| CP \| dnode(F) \| inode(x) \| inode(F) a) x o o o o b) x x x x o c) x o o x o If f2fs_sync_file stops ------^, it should write inode(F) --------------^ So, the need_inode_block_update should return true, since c) get_nat_flag(e, HAS_LAST_FSYNC), is false. For example, #8, CP \| alloc \| dnode(F) \| inode(x) \| inode(DF) a) o x x x x b) x x x o c) o o x o If f2fs_sync_file stops -------^, it should write inode(DF) --------------^ Note that, the roll-forward policy should follow this rule, which means, if there are any missing blocks, we doesn't need to recover that inode. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: introduce a flag to represent each nat entry information	Jaegeuk Kim
	This patch introduces a flag in the nat entry structure to merge various information such as checkpointed and fsync_done marks. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23	f2fs: use meta_inode cache to improve roll-forward speed	Jaegeuk Kim
	Previously, all the dnode pages should be read during the roll-forward recovery. Even worsely, whole the chain was traversed twice. This patch removes that redundant and costly read operations by using page cache of meta_inode and readahead function as well. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-22	Merge tag 'fscache-fixes-20140917' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull fs-cache fixes from David Howells: - Put a timeout in releasepage() to deal with a recursive hang between the memory allocator, writeback, ext4 and fscache under memory pressure. - Fix a pair of refcount bugs in the fscache error handling. - Remove a couple of unused pagevecs. - The cachefiles requirement that the base directory support rename should permit rename2 as an alternative - otherwise certain filesystems cannot now be used as backing stores (such as ext4). * tag 'fscache-fixes-20140917' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: CacheFiles: Handle rename2 cachefiles: remove two unused pagevecs. FS-Cache: refcount becomes corrupt under vma pressure. FS-Cache: Reduce cookie ref count if submit fails. FS-Cache: Timeout for releasepage()
2014-09-22	Fix nasty 32-bit overflow bug in buffer i/o code.	Anton Altaparmakov
	On 32-bit architectures, the legacy buffer_head functions are not always handling the sector number with the proper 64-bit types, and will thus fail on 4TB+ disks. Any code that uses __getblk() (and thus bread(), breadahead(), sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop in __getblk_slow() with an infinite stream of errors logged to dmesg like this: __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648 b_state=0x00000020, b_size=512 device sda1 blocksize: 512 Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the top 32-bits are missing (in this case the 0x1 at the top). This is because grow_dev_page() is broken and has a 32-bit overflow due to shifting the page index value (a pgoff_t - which is just 32 bits on 32-bit architectures) left-shifted as the block number. But the top bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit before the shift. This patch fixes this issue by type casting "index" to sector_t before doing the left shift. Note this is not a theoretical bug but has been seen in the field on a 4TiB hard drive with logical sector size 512 bytes. This patch has been verified to fix the infinite loop problem on 3.17-rc5 kernel using a 4TB disk image mounted using "-o loop". Without this patch doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop 100% reproducibly whilst with the patch it works fine as expected. Signed-off-by: Anton Altaparmakov <aia21@cantab.net> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-21	pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe	Trond Myklebust
	kbuild test robot reports: fs/built-in.o: In function `bl_map_stripe': >> :(.text+0x965b4): undefined reference to `__aeabi_uldivmod' >> :(.text+0x965cc): undefined reference to `__aeabi_uldivmod' >> :(.text+0x96604): undefined reference to `__aeabi_uldivmod' Fixes: 5c83746a0cf2 (pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing) Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-19	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I've got a revert to fix a regression with btrfs device registration, and Filipe has part two of his fsync fix from last week" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Revert "Btrfs: device_list_add() should not update list when mounted" Btrfs: set inode's logged_trans/last_log_commit after ranged fsync
2014-09-19	Merge tag 'nfs-for-3.17-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs	Linus Torvalds
	Pull NFS client fixes from Trond Myklebust: "Highligts: - fix an Oops in nfs4_open_and_get_state - fix an Oops in the nfs4_state_manager - fix another bug in the close/open_downgrade code" * tag 'nfs-for-3.17-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Fix another bug in the close/open_downgrade code NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists() NFS: remove BUG possibility in nfs4_open_and_get_state
2014-09-18	Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6	Linus Torvalds
	Pull cifs/smb3 fixes from Steve French: "Fixes for problems found during testing and debugging at the SMB3 storage test event (plugfest) this week" * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: Fix mfsymlinks file size check Update version number displayed by modinfo for cifs.ko cifs: remove dead code Revert "cifs: No need to send SIGKILL to demux_thread during umount" [SMB3] Fix oops when creating symlinks on smb3 [CIFS] Fix setting time before epoch (negative time values)
2014-09-18	NFSv4: Fix another bug in the close/open_downgrade code	Trond Myklebust
	James Drew reports another bug whereby the NFS client is now sending an OPEN_DOWNGRADE in a situation where it should really have sent a CLOSE: the client is opening the file for O_RDWR, but then trying to do a downgrade to O_RDONLY, which is not allowed by the NFSv4 spec. Reported-by: James Drews <drews@engr.wisc.edu> Link: http://lkml.kernel.org/r/541AD7E5.8020409@engr.wisc.edu Fixes: aee7af356e15 (NFSv4: Fix problems with close in the presence...) Cc: stable@vger.kernel.org # 2.6.33+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-18	NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists()	Steve Dickson
	There is a race between nfs4_state_manager() and nfs_server_remove_lists() that happens during a nfsv3 mount. The v3 mount notices there is already a supper block so nfs_server_remove_lists() called which uses the nfs_client_lock spin lock to synchronize access to the client list. At the same time nfs4_state_manager() is running through the client list looking for work to do, using the same lock. When nfs4_state_manager() wins the race to the list, a v3 client pointer is found and not ignored properly which causes the panic. Moving some protocol checks before the state checking avoids the panic. CC: Stable Tree <stable@vger.kernel.org> Signed-off-by: Steve Dickson <steved@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-18	Revert "Btrfs: device_list_add() should not update list when mounted"	Chris Mason
	This reverts commit b96de000bc8bc9688b3a2abea4332bd57648a49f. This commit is triggering failures to mount by subvolume id in some configurations. The main problem is how many different ways this scanning function is used, both for scanning while mounted and unmounted. A proper cleanup is too big for late rcs. For now, just revert the commit and we'll put a better fix into a later merge window. Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17	CacheFiles: Handle rename2	David Howells
	Not all filesystems now provide the rename i_op - ext4 for one - but rather provide the rename2 i_op. CacheFiles checks that the filesystem has rename and so will reject ext4 now with EPERM: CacheFiles: Failed to register: -1 Fix this by checking for rename2 as an alternative. The call to vfs_rename() actually handles selection of the appropriate function, so we needn't worry about that. Turning on debugging shows: [cachef] ==> cachefiles_get_directory(,,cache) [cachef] subdir -> ffff88000b22b778 positive [cachef] <== cachefiles_get_directory() = -1 [check] where -1 is EPERM. Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17	cachefiles: remove two unused pagevecs.	NeilBrown
	These two have been unused since commit c4d6d8dbf335c7fa47341654a37c53a512b519bb CacheFiles: Fix the marking of cached pages in 3.8. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17	FS-Cache: refcount becomes corrupt under vma pressure.	Milosz Tanski
	In rare cases under heavy VMA pressure the ref count for a fscache cookie becomes corrupt. In this case we decrement ref count even if we fail before incrementing the refcount. FS-Cache: Assertion failed bnode-eca5f9c6/syslog 0 > 0 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/cookie.c:519! invalid opcode: 0000 [#1] SMP Call Trace: [<ffffffffa01ba060>] __fscache_relinquish_cookie+0x50/0x220 [fscache] [<ffffffffa02d64ce>] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph] [<ffffffffa02ae1d3>] ceph_destroy_inode+0x33/0x200 [ceph] [<ffffffff811cf67e>] ? __fsnotify_inode_delete+0xe/0x10 [<ffffffff811a9e0c>] destroy_inode+0x3c/0x70 [<ffffffff811a9f51>] evict+0x111/0x180 [<ffffffff811aa763>] iput+0x103/0x190 [<ffffffff811a5de8>] __dentry_kill+0x1c8/0x220 [<ffffffff811a5f31>] shrink_dentry_list+0xf1/0x250 [<ffffffff811a762c>] prune_dcache_sb+0x4c/0x60 [<ffffffff811930af>] super_cache_scan+0xff/0x170 [<ffffffff8113d7a0>] shrink_slab_node+0x140/0x2c0 [<ffffffff8113f2da>] shrink_slab+0x8a/0x130 [<ffffffff81142572>] balance_pgdat+0x3e2/0x5d0 [<ffffffff811428ca>] kswapd+0x16a/0x4a0 [<ffffffff810a43f0>] ? __wake_up_sync+0x20/0x20 [<ffffffff81142760>] ? balance_pgdat+0x5d0/0x5d0 [<ffffffff81083e09>] kthread+0xc9/0xe0 [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90 [<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0 [<ffffffff8159f63c>] ret_from_fork+0x7c/0xb0 [<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0 RIP [<ffffffffa01b984b>] __fscache_disable_cookie+0x1db/0x210 [fscache] RSP <ffff8803bc85f9b8> ---[ end trace 254d0d7c74a01f25 ]--- Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17	nfsd4: clarify how grace period ends	J. Bruce Fields
	The grace period is ended in two steps--first userland is notified that the grace period is now long enough that any clients who have not yet reclaimed can be safely forgotten, then we flip the switch that forbids reclaims and allows new opens. I had to think a bit to convince myself that the ordering was right here. Document it. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-17	nfsd4: stop grace_time update at end of grace period	J. Bruce Fields
	The attempt to automatically set a new grace period time at the end of the grace period isn't really helpful. We'll probably shut down and reboot before we actually make use of the new grace period time anyway. So may as well leave it up to the init system to get this right. This just confuses people when they see /proc/fs/nfsd/nfsv4gracetime change from what they set it to. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-17	nfsd: skip subsequent UMH "create" operations after the first one for v4.0 ↵	Jeff Layton
	clients In the case of v4.0 clients, we may call into the "create" client tracking operation multiple times (once for each openowner). Upcalling for each one of those is wasteful and slow however. We can skip doing further "create" operations after the first one if we know that one has already been done. v4.1+ clients generally only call into this function once (on RECLAIM_COMPLETE), and we can't skip upcalling on the create even if the STABLE bit is set. Doing so would make it impossible for nfsdcltrack to lift the grace period early since the timestamp has a different meaning in the case where the client is expected to issue a RECLAIM_COMPLETE. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17	nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls	Jeff Layton
	The nfsdcltrack upcall doesn't utilize the NFSD4_CLIENT_STABLE flag, which basically results in an upcall every time we call into the client tracking ops. Change it to set this bit on a successful "check" or "create" request, and clear it on a "remove" request. Also, check to see if that bit is set before upcalling on a "check" or "remove" request, and skip upcalling appropriately, depending on its state. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17	nfsd: serialize nfsdcltrack upcalls for a particular client	Jeff Layton
	In a later patch, we want to add a flag that will allow us to reduce the need for upcalls. In order to handle that correctly, we'll need to ensure that racing upcalls for the same client can't occur. In practice it should be rare for this to occur with a well-behaved client, but it is possible. Convert one of the bits in the cl_flags field to be an upcall bitlock, and use it to ensure that upcalls for the same client are serialized. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17	nfsd: pass extra info in env vars to upcalls to allow for early grace period end	Jeff Layton
	In order to support lifting the grace period early, we must tell nfsdcltrack what sort of client the "create" upcall is for. We can't reliably tell if a v4.0 client has completed reclaiming, so we can only lift the grace period once all the v4.1+ clients have issued a RECLAIM_COMPLETE and if there are no v4.0 clients. Also, in order to lift the grace period, we have to tell userland when the grace period started so that it can tell whether a RECLAIM_COMPLETE has been issued for each client since then. Since this is all optional info, we pass it along in environment variables to the "init" and "create" upcalls. By doing this, we don't need to revise the upcall format. The UMH upcall can simply make use of this info if it happens to be present. If it's not then it can just avoid lifting the grace period early. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17	nfsd: add a v4_end_grace file to /proc/fs/nfsd	Jeff Layton
	Allow a privileged userland process to end the v4 grace period early. Writing "Y", "y", or "1" to the file will cause the v4 grace period to be lifted. The basic idea with this will be to allow the userland client tracking program to lift the grace period once it knows that no more clients will be reclaiming state. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17	lockd: add a /proc/fs/lockd/nlm_end_grace file	Jeff Layton
	Add a new procfile that will allow a (privileged) userland process to end the NLM grace period early. The basic idea here will be to have sm-notify write to this file, if it sent out no NOTIFY requests when it runs. In that situation, we can generally expect that there will be no reclaim requests so the grace period can be lifted early. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17	nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE	Jeff Layton
	As stated in RFC 5661, section 18.51.3: Once a RECLAIM_COMPLETE is done, there can be no further reclaim operations for locks whose scope is defined as having completed recovery. Once the client sends RECLAIM_COMPLETE, the server will not allow the client to do subsequent reclaims of locking state for that scope and, if these are attempted, will return NFS4ERR_NO_GRACE. Ensure that we enforce that requirement. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17	nfsd: remove redundant boot_time parm from grace_done client tracking op	Jeff Layton
	Since it's stored in nfsd_net, we don't need to pass it in separately. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17	lockd: move lockd's grace period handling into its own module	Jeff Layton
	Currently, all of the grace period handling is part of lockd. Eventually though we'd like to be able to build v4-only servers, at which point we'll need to put all of this elsewhere. Move the code itself into fs/nfs_common and have it build a grace.ko module. Then, rejigger the Kconfig options so that both nfsd and lockd enable it automatically. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-16	Btrfs: set inode's logged_trans/last_log_commit after ranged fsync	Filipe Manana
	When a ranged fsync finishes if there are still extent maps in the modified list, still set the inode's logged_trans and last_log_commit. This is important in case an inode is fsync'ed and unlinked in the same transaction, to ensure its inode ref gets deleted from the log and the respective dentries in its parent are deleted too from the log (if the parent directory was fsync'ed in the same transaction). Instead make btrfs_inode_in_log() return false if the list of modified extent maps isn't empty. This is an incremental on top of the v4 version of the patch: "Btrfs: fix fsync data loss after a ranged fsync" which was added to its v5, but didn't make it on time. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-16	Merge tag 'gfs2-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes Pull gfs2 fixes from Steven Whitehouse: "Here are a number of small fixes for GFS2. There is a fix for FIEMAP on large sparse files, a negative dentry hashing fix, a fix for flock, and a bug fix relating to d_splice_alias usage. There are also (patches 1 and 5) a couple of updates which are less critical, but small and low risk" * tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes: GFS2: fix d_splice_alias() misuses GFS2: Don't use MAXQUOTAS value GFS2: Hash the negative dentry during inode lookup GFS2: Request demote when a "try" flock fails GFS2: Change maxlen variables to size_t GFS2: fs/gfs2/super.c: replace seq_printf by seq_puts
2014-09-16	vfs: workaround gcc <4.6 build error in link_path_walk()	James Hogan
	Commit d6bb3e9075bb ("vfs: simplify and shrink stack frame of link_path_walk()") introduced build problems with GCC versions older than 4.6 due to the initialisation of a member of an anonymous union in struct qstr without enclosing braces. This hits GCC bug 10676 [1] (which was fixed in GCC 4.6 by [2]), and causes the following build error: fs/namei.c: In function 'link_path_walk': fs/namei.c:1778: error: unknown field 'hash_len' specified in initializer This is worked around by adding explicit braces. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676 [2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=159206 Fixes: d6bb3e9075bb (vfs: simplify and shrink stack frame of link_path_walk()) Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-metag@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-16	Fix mfsymlinks file size check	Steve French
	If the mfsymlinks file size has changed (e.g. the file no longer represents an emulated symlink) we were not returning an error properly. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Stefan Metzmacher <metze@samba.org>