lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2013-01-17	dm thin: fix race between simultaneous io and discards to same block	Joe Thornber
	commit e8088073c9610af017fd47fddd104a2c3afb32e8 upstream. There is a race when discard bios and non-discard bios are issued simultaneously to the same block. Discard support is expensive for all thin devices precisely because you have to be careful to quiesce the area you're discarding. DM thin must handle this conflicting IO pattern (simultaneous non-discard vs discard) even though a sane application shouldn't be issuing such IO. The race manifests as follows: 1. A non-discard bio is mapped in thin_bio_map. This doesn't lock out parallel activity to the same block. 2. A discard bio is issued to the same block as the non-discard bio. 3. The discard bio is locked in a dm_bio_prison_cell in process_discard to lock out parallel activity against the same block. 4. The non-discard bio's mapping continues and its all_io_entry is incremented so the bio is accounted for in the thin pool's all_io_ds which is a dm_deferred_set used to track time locality of non-discard IO. 5. The non-discard bio is finally locked in a dm_bio_prison_cell in process_bio. The race can result in deadlock, leaving the block layer hanging waiting for completion of a discard bio that never completes, e.g.: INFO: task ruby:15354 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ruby D ffffffff8160f0e0 0 15354 15314 0x00000000 ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900 ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900 ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0 Call Trace: [<ffffffff814e5a19>] schedule+0x29/0x70 [<ffffffff814e3d85>] schedule_timeout+0x195/0x220 [<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod] [<ffffffff814e589e>] wait_for_common+0x11e/0x190 [<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0 [<ffffffff814e59ed>] wait_for_completion+0x1d/0x20 [<ffffffff81233289>] blkdev_issue_discard+0x219/0x260 [<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0 [<ffffffff8119a65c>] block_ioctl+0x3c/0x40 [<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340 [<ffffffff8119a547>] ? block_llseek+0x67/0xb0 [<ffffffff811756f1>] sys_ioctl+0xa1/0xb0 [<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0 [<ffffffff814ef099>] system_call_fastpath+0x16/0x1b The thinp-test-suite's test_discard_random_sectors reliably hits this deadlock on fast SSD storage. The fix for this race is that the all_io_entry for a bio must be incremented whilst the dm_bio_prison_cell is held for the bio's associated virtual and physical blocks. That cell locking wasn't occurring early enough in thin_bio_map. This patch fixes this. Care is taken to always call the new function inc_all_io_entry() with the relevant cells locked, but they are generally unlocked before calling issue() to try to avoid holding the cells locked across generic_submit_request. Also, now that thin_bio_map may lock bios in a cell, process_bio() is no longer the only thread that will do so. Because of this we must be sure to use cell_defer_except() to release all non-holder entries, that were added by the other thread, because they must be deferred. This patch depends on "dm thin: replace dm_cell_release_singleton with cell_defer_except". Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17	dm thin: replace dm_cell_release_singleton with cell_defer_except	Joe Thornber
	commit b7ca9c9273e5eebd63880dd8a6e4e5c18fc7901d upstream. Change existing users of the function dm_cell_release_singleton to share cell_defer_except instead, and then remove the now-unused function. Everywhere that calls dm_cell_release_singleton, the bio in question is the holder of the cell. If there are no non-holder entries in the cell then cell_defer_except behaves exactly like dm_cell_release_singleton. Conversely, if there are non-holder entries then dm_cell_release_singleton must not be used because those entries would need to be deferred. Consequently, it is safe to replace use of dm_cell_release_singleton with cell_defer_except. This patch is a pre-requisite for "dm thin: fix race between simultaneous io and discards to same block". Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17	dm ioctl: prevent unsafe change to dm_ioctl data_size	Alasdair G Kergon
	commit e910d7ebecd1aac43125944a8641b6cb1a0dfabe upstream. Abort dm ioctl processing if userspace changes the data_size parameter after we validated it but before we finished copying the data buffer from userspace. The dm ioctl parameters are processed in the following sequence: 1. ctl_ioctl() calls copy_params(); 2. copy_params() makes a first copy of the fixed-sized portion of the userspace parameters into the local variable "tmp"; 3. copy_params() then validates tmp.data_size and allocates a new structure big enough to hold the complete data and copies the whole userspace buffer there; 4. ctl_ioctl() reads userspace data the second time and copies the whole buffer into the pointer "param"; 5. ctl_ioctl() reads param->data_size without any validation and stores it in the variable "input_param_size"; 6. "input_param_size" is further used as the authoritative size of the kernel buffer. The problem is that userspace code could change the contents of user memory between steps 2 and 4. In particular, the data_size parameter can be changed to an invalid value after the kernel has validated it. This lets userspace force the kernel to access invalid kernel memory. The fix is to ensure that the size has not changed at step 4. This patch shouldn't have a security impact because CAP_SYS_ADMIN is required to run this code, but it should be fixed anyway. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17	dm persistent data: rename node to btree_node	Mikulas Patocka
	commit 550929faf89e2e2cdb3e9945ea87d383989274cf upstream. This patch fixes a compilation failure on sparc32 by renaming struct node. struct node is already defined in include/linux/node.h. On sparc32, it happens to be included through other dependencies and persistent-data doesn't compile because of conflicting declarations. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17	dm: disable WRITE SAME	Mike Snitzer
	commit c1a94672a830e01d58c7c7e8de530c3f136d6ff2 upstream. WRITE SAME bios are not yet handled correctly by device-mapper so disable their use on device-mapper devices by setting max_write_same_sectors to zero. As an example, a ciphertext device is incompatible because the data gets changed according to the location at which it written and so the dm crypt target cannot support it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-02	Merge tag 'md-3.7-fixes' of git://neil.brown.name/md	Linus Torvalds
	Pull md bugfix from NeilBrown: "Single bugfix for raid1/raid10. Fixes a recently introduced deadlock." * tag 'md-3.7-fixes' of git://neil.brown.name/md: md/raid1{,0}: fix deadlock in bitmap_unplug.
2012-11-27	md/raid1{,0}: fix deadlock in bitmap_unplug.	NeilBrown
	If the raid1 or raid10 unplug function gets called from a make_request function (which is very possible) when there are bios on the current->bio_list list, then it will not be able to successfully call bitmap_unplug() and it could need to submit more bios and wait for them to complete. But they won't complete while current->bio_list is non-empty. So detect that case and handle the unplugging off to another thread just like we already do when called from within the scheduler. RAID1 version of bug was introduced in 3.6, so that part of fix is suitable for 3.6.y. RAID10 part won't apply. Cc: stable@vger.kernel.org Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com> Reported-by: Peter Maloney <peter.maloney@brockmann-consult.de> Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-23	Merge tag 'md-3.7-fixes' of git://neil.brown.name/md	Linus Torvalds
	Pull md fixes from NeilBrown: "Several bug fixes for md in 3.7: - raid5 discard has problems - raid10 replacement devices have problems - bad block lock seqlock usage has problems - dm-raid doesn't free everything" * tag 'md-3.7-fixes' of git://neil.brown.name/md: md/raid10: decrement correct pending counter when writing to replacement. md/raid10: close race that lose writes lost when replacement completes. md/raid5: Make sure we clear R5_Discard when discard is finished. md/raid5: move resolving of reconstruct_state earlier in stripe_handle. md/raid5: round discard alignment up to power of 2. md: make sure everything is freed when dm-raid stops an array. md: Avoid write invalid address if read_seqretry returned true. md: Reassigned the parameters if read_seqretry returned true in func md_is_badblock.
2012-11-23	dm: fix deadlock with request based dm and queue request_fn recursion	Jens Axboe
	Request based dm attempts to re-run the request queue off the request completion path. If used with a driver that potentially does end_io from its request_fn, we could deadlock trying to recurse back into request dispatch. Fix this by punting the request queue run to kblockd. Tested to fix a quickly reproducible deadlock in such a scenario. Cc: stable@kernel.org Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-22	md/raid10: decrement correct pending counter when writing to replacement.	NeilBrown
	When a write to a replacement device completes, we carefully and correctly found the rdev that the write actually went to and the blithely called rdev_dec_pending on the primary rdev, even if this write was to the replacement. This means that any writes to an array while a replacement was ongoing would cause the nr_pending count for the primary device to go negative, so it could never be removed. This bug has been present since replacement was introduced in 3.3, so it is suitable for any -stable kernel since then. Reported-by: "George Spelvin" <linux@horizon.com> Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22	md/raid10: close race that lose writes lost when replacement completes.	NeilBrown
	When a replacement operation completes there is a small window when the original device is marked 'faulty' and the replacement still looks like a replacement. The faulty should be removed and the replacement moved in place very quickly, bit it isn't instant. So the code write out to the array must handle the possibility that the only working device for some slot in the replacement - but it doesn't. If the primary device is faulty it just gives up. This can lead to corruption. So make the code more robust: if either the primary or the replacement is present and working, write to them. Only when neither are present do we give up. This bug has been present since replacement was introduced in 3.3, so it is suitable for any -stable kernel since then. Reported-by: "George Spelvin" <linux@horizon.com> Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22	md/raid5: Make sure we clear R5_Discard when discard is finished.	NeilBrown
	commit 9e44476851e91c86c98eb92b9bc27fb801f89072 MD: raid5 avoid unnecessary zero page for trim change raid5 to clear R5_Discard when the complete request is handled rather than when submitting the per-device discard request. However it did not clear R5_Discard for the parity device. This means that if the stripe_head was reused before it expired from the cache, the setting would be wrong and a hang would result. Also if the R5_Uptodate bit happens to be set, R5_Discard again won't be cleared. But R5_Uptodate really should be clear at this point. So make sure R5_Discard is cleared in all cases, and clear R5_Uptodate when a 'discard' completes. Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22	md/raid5: move resolving of reconstruct_state earlier in	NeilBrown
	stripe_handle. The chunk of code in stripe_handle which responds to a *_result value in reconstruct_state is really the completion of some processing that happened outside of handle_stripe (possibly asynchronously) and so should be one of the first things done in handle_stripe(). After the next patch it will be important that it happens before handle_stripe_clean_event(), as that will clear some dev->flags bit that this code tests. Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20	md/raid5: round discard alignment up to power of 2.	NeilBrown
	blkdev_issue_discard currently assumes that the granularity is a power of 2. So in raid5, round the chosen number up to avoid embarrassment. Cc: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20	md: make sure everything is freed when dm-raid stops an array.	NeilBrown
	md_stop() would stop an array, but not free various attached data structures. For internal arrays, these are freed later in do_md_stop() or mddev_put(), but they don't apply for dm-raid arrays. So get md_stop() to free them, and only all it from dm-raid. For internal arrays we now call __md_stop. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20	md: Avoid write invalid address if read_seqretry returned true.	majianpeng
	If read_seqretry returned true and bbp was changed, it will write invalid address which can cause some serious problem. This bug was introduced by commit v3.0-rc7-130-g2699b67. So fix is suitable for 3.0.y thru 3.6.y. Reported-by: zhuwenfeng@kedacom.com Tested-by: zhuwenfeng@kedacom.com Cc: stable@vger.kernel.org Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20	md: Reassigned the parameters if read_seqretry returned true in func ↵	majianpeng
	md_is_badblock. This bug was introduced by commit(v3.0-rc7-126-g2230dfe). So fix is suitable for 3.0.y thru 3.6.y. Cc: stable@vger.kernel.org Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-31	MD RAID10: Fix oops when creating RAID10 arrays via dm-raid.c	Jonathan Brassow
	Commit 2863b9eb didn't take into account the changes to add TRIM support to RAID10 (commit 532a2a3fb). That is, when using dm-raid.c to create the RAID10 arrays, there is no mddev->gendisk or mddev->queue. The code added to support TRIM simply assumes that mddev->queue is available without checking. The result is an oops any time dm-raid.c attempts to create a RAID10 device. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-31	md/raid1: Fix assembling of arrays containing Replacements.	NeilBrown
	setup_conf in raid1.c uses conf->raid_disks before assigning a value. It is used when including 'Replacement' devices. The consequence is that assembling an array which contains a replacement will misbehave and either not include the replacement, or not include the device being replaced. Though this doesn't lead directly to data corruption, it could lead to reduced data safety. So use mddev->raid_disks, which is initialised, instead. Bug was introduced by commit c19d57980b38a5bb613a898937a1cf85f422fb9b md/raid1: recognise replacements when assembling arrays. in 3.3, so fix is suitable for 3.3.y thru 3.6.y. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-22	md faulty: use disk_stack_limits()	Eric Sandeen
	in: fe86cdce block: do not artificially constrain max_sectors for stacking drivers max_sectors defaults to UINT_MAX. md faulty wasn't using disk_stack_limits(), so inherited this large value as well. This triggered a bug in XFS when stressed over md_faulty, when a very large bio_alloc() failed. That was on an older kernel, and I can't reproduce exactly the same thing upstream, but I think the fix is appropriate in any case. Thanks to Mike Snitzer for pointing out the problem. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-13	Merge tag 'md-3.7' of git://neil.brown.name/md	Linus Torvalds
	Pull md updates from NeilBrown: - "discard" support, some dm-raid improvements and other assorted bits and pieces. * tag 'md-3.7' of git://neil.brown.name/md: (29 commits) md: refine reporting of resync/reshape delays. md/raid5: be careful not to resize_stripes too big. md: make sure manual changes to recovery checkpoint are saved. md/raid10: use correct limit variable md: writing to sync_action should clear the read-auto state. Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races md/raid5: make sure to_read and to_write never go negative. md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write. md/raid5: protect debug message against NULL derefernce. md/raid5: add some missing locking in handle_failed_stripe. MD: raid5 avoid unnecessary zero page for trim MD: raid5 trim support md/bitmap:Don't use IS_ERR to judge alloc_page(). md/raid1: Don't release reference to device while handling read error. raid: replace list_for_each_continue_rcu with new interface add further __init annotations to crypto/xor.c DM RAID: Fix for "sync" directive ineffectiveness DM RAID: Fix comparison of index and quantity for "rebuild" parameter DM RAID: Add rebuild capability for RAID10 DM RAID: Move 'rebuild' checking code to its own function ...
2012-10-12	dm: store dm_target_io in bio front_pad	Mikulas Patocka
	Use the recently-added bio front_pad field to allocate struct dm_target_io. Prior to this patch, dm_target_io was allocated from a mempool. For each dm_target_io, there is exactly one bio allocated from a bioset. This patch merges these two allocations into one allocation: we create a bioset with front_pad equal to the size of dm_target_io so that every bio allocated from the bioset has sizeof(struct dm_target_io) bytes before it. We allocate a bio and use the bytes before the bio as dm_target_io. _tio_cache is removed and the tio_pool mempool is now only used for request-based devices. This idea was introduced by Kent Overstreet. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: tj@kernel.org Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Bill Pemberton <wfp5p@viridian.itc.virginia.edu> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12	dm thin: move bio_prison code to separate module	Mike Snitzer
	The bio prison code will be useful to other future DM targets so move it to a separate module. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12	dm thin: prepare to separate bio_prison code	Mike Snitzer
	The bio prison code will be useful to share with future DM targets. Prepare to move this code into a separate module, adding a dm prefix to structures and functions that will be exported. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12	dm thin: support discard with non power of two block size	Mike Snitzer
	Support discards when the pool's block size is not a power of 2. The block layer assumes discard_granularity is a power of 2 (in blkdev_issue_discard), so we set this to the largest power of 2 that is a divides into the number of sectors in each block, but never less than DATA_DEV_BLOCK_SIZE_MIN_SECTORS. This patch eliminates the "Discard support must be disabled when the block size is not a power of 2" constraint that was imposed in commit 55f2b8b ("dm thin: support for non power of 2 pool blocksize"). That commit was incomplete: using a block size that is not a power of 2 shouldn't mean disabling discard support on the device completely. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12	dm persistent data: convert to use le32_add_cpu	Wei Yongjun
	Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu(). dpatch engine is used to auto generate this patch. (https://github.com/weiyj/dpatch) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12	dm: use ACCESS_ONCE for sysfs values	Mikulas Patocka
	Use the ACCESS_ONCE macro in dm-bufio and dm-verity where a variable can be modified asynchronously (through sysfs) and we want to prevent compiler optimizations that assume that the variable hasn't changed. (See Documentation/atomic_ops.txt.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12	dm bufio: use list_move	Wei Yongjun
	Use list_move() instead of list_del() + list_add(). spatch with a semantic match was used to find this. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12	dm mpath: fix check for null mpio in end_io fn	Wei Yongjun
	The mpio dereference should be moved below the BUG_ON NULL test in multipath_end_io(). spatch with a semantic match was used to found this. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-11	md: refine reporting of resync/reshape delays.	NeilBrown
	If 'resync_max' is set to 0 (as is often done when starting a reshape, so the mdadm can remain in control during a sensitive period), and if the reshape request is initially delayed because another array using the same array is resyncing or reshaping etc, when user-space cannot easily tell when the delay changes from being due to a conflicting reshape, to being due to resync_max = 0. So introduce a new state: (curr_resync == 3) to reflect this, make sure it is visible both via /proc/mdstat and via the "sync_completed" sysfs attribute, and ensure that the event transition from one delay state to the other is properly notified. Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md/raid5: be careful not to resize_stripes too big.	NeilBrown
	When a RAID5 is reshaping, conf->raid_disks is increased before mddev->delta_disks becomes zero. This can result in check_reshape calling resize_stripes with a number that is too large. This particularly happens when md_check_recovery calls ->check_reshape(). If we use ->previous_raid_disks, we don't risk this. Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md: make sure manual changes to recovery checkpoint are saved.	NeilBrown
	If you make an array bigger but suppress resync of the new region with mdadm --grow /dev/mdX --size=max --assume-clean then stop the array before anything is written to it, the effect of the "--assume-clean" is lost and the array will resync the new space when restarted. So ensure that we update the metadata in the case. Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md/raid10: use correct limit variable	Dan Carpenter
	Clang complains that we are assigning a variable to itself. This should be using bad_sectors like the similar earlier check does. Bug has been present since 3.1-rc1. It is minor but could conceivably cause corruption or other bad behaviour. Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md: writing to sync_action should clear the read-auto state.	NeilBrown
	In some cases array are started in 'read-auto' state where in nothing gets written to any device until the array is written to. The purpose of this is to make accidental auto-assembly of the wrong arrays less of a risk, and to allow arrays to be started to read suspend-to-disk images without actually changing anything (as might happen if the array were dirty and a resync seemed necessary). Explicitly writing the 'sync_action' for a read-auto array currently doesn't clear the read-auto state, so the sync action doesn't happen, which can be confusing. So allow any successful write to sync_action to clear any read-auto state. Reported-by: Alexander Kühn <alexander.kuehn@nagilum.de> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races	Jianpeng Ma
	Now that multiple threads can handle stripes, it is safer to use an atomic64_t for resync_mismatches, to avoid update races. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md/raid5: make sure to_read and to_write never go negative.	NeilBrown
	to_read and to_write are part of the result of analysing a stripe before handling it. Their use is to avoid some loops and tests if the values are known to be zero. Thus it is not a problem if they are a little bit larger than they should be. So decrementing them in handle_failed_stripe serves little value, and due to races it could cause some loops to be skipped incorrectly. So remove those decrements. Reported-by: "Jianpeng Ma" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.	Alexander Lyakas
	Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Suggested-by: Yair Hershko <yair@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md/raid5: protect debug message against NULL derefernce.	NeilBrown
	The pr_debug in add_stripe_bio could race with something changing *bip, so it is best to hold the lock until after the pr_debug. Reported-by: "Jianpeng Ma" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md/raid5: add some missing locking in handle_failed_stripe.	NeilBrown
	We really should hold the stripe_lock while accessing 'toread' else we could race with add_stripe_bio and corrupt a list. Reported-by: "Jianpeng Ma" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	MD: raid5 avoid unnecessary zero page for trim	Shaohua Li
	We want to avoid zero discarded dev page, because it's useless for discard. But if we don't zero it, another read/write hit such page in the cache and will get inconsistent data. To avoid zero the page, we don't set R5_UPTODATE flag after construction is done. In this way, discard write request is still issued and finished, but read will not hit the page. If the stripe gets accessed soon, we need reread the stripe, but since the chance is low, the reread isn't a big deal. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	MD: raid5 trim support	Shaohua Li
	Discard for raid4/5/6 has limitation. If discard request size is small, we do discard for one disk, but we need calculate parity and write parity disk. To correctly calculate parity, zero_after_discard must be guaranteed. Even it's true, we need do discard for one disk but write another disks, which makes the parity disks wear out fast. This doesn't make sense. So an efficient discard for raid4/5/6 should discard all data disks and parity disks, which requires the write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size is smaller than chunk_size, such pattern is almost impossible in practice. So in this patch, I only handle the case that A's size equals to chunk_size. That is discard request should be aligned to stripe size and its size is multiple of stripe size. Since we can only handle request with specific alignment and size (or part of the request fitting stripes), we can't guarantee zero_after_discard even zero_after_discard is true in low level drives. The block layer doesn't send down correctly aligned requests even correct discard alignment is set, so I must filter out. For raid4/5/6 parity calculation, if data is 0, parity is 0. So if zero_after_discard is true for all disks, data is consistent after discard. Otherwise, data might be lost. Let's consider a scenario: discard a stripe, write data to one disk and write parity disk. The stripe could be still inconsistent till then depending on using data from other data disks or parity disks to calculate new parity. If the disk is broken, we can't restore it. So in this patch, we only enable discard support if all disks have zero_after_discard. If discard fails in one disk, we face the similar inconsistent issue above. The patch will make discard follow the same path as normal write request. If discard fails, a resync will be scheduled to make the data consistent. This isn't good to have extra writes, but data consistency is important. If a subsequent read/write request hits raid5 cache of a discarded stripe, the discarded dev page should have zero filled, so the data is consistent. This patch will always zero dev page for discarded request stripe. This isn't optimal because discard request doesn't need such payload. Next patch will avoid it. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md/bitmap:Don't use IS_ERR to judge alloc_page().	Jianpeng Ma
	Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md/raid1: Don't release reference to device while handling read error.	NeilBrown
	When we get a read error, we arrange for raid1d to handle it. Currently we release the reference on the device. This can result in conf->mirrors[read_disk].rdev being NULL in fix_read_error, if the device happens to get removed before the read error is handled. So instead keep the reference until the read error has been fully handled. Reported-by: hank <pyu@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	raid: replace list_for_each_continue_rcu with new interface	Michael Wang
	This patch replaces list_for_each_continue_rcu() with list_for_each_entry_continue_rcu() to save a few lines of code and allow removing list_for_each_continue_rcu(). Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	DM RAID: Fix for "sync" directive ineffectiveness	Jonathan Brassow
	There are two table arguments that can be given to a DM RAID target that control whether the array is forced to (re)synchronize or skip initialization: "sync" and "nosync". When "sync" is given, we set mddev->recovery_cp to 0 in order to cause the device to resynchronize. This is insufficient if there is a bitmap in use, because the array will simply look at the bitmap and see that there is no recovery necessary. The fix is to skip over the loading of the superblocks when "sync" is given, causing new superblocks to be written that will force the array to go through initialization (i.e. synchronization). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	DM RAID: Fix comparison of index and quantity for "rebuild" parameter	Jonathan Brassow
	DM RAID: Fix comparison of index and quantity for "rebuild" parameter The "rebuild" parameter takes an index argument that starts counting from zero. The conditional used to validate the index was using '>' rather than '>=', leaving the door open for an index value that would be 1 too large. Reported-by: Neil Brown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	DM RAID: Add rebuild capability for RAID10	Jonathan Brassow
	DM RAID: Add code to validate replacement slots for RAID10 arrays RAID10 can handle 'copies - 1' failures for each mirror group. This code ensures the user has provided a valid array - one whose devices specified for rebuild do not exceed the amount of redundancy available. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	DM RAID: Move 'rebuild' checking code to its own function	Jonathan Brassow
	DM RAID: Move chunk of code to it's own function The code that checks whether device replacements/rebuilds are possible given a specific RAID type is moved to it's own function. It will further expand when the code to check RAID10 is added. A separate function makes it easier to read. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	MD RAID10: Prep for DM RAID10 device replacement capability	Jonathan Brassow
	MD RAID10: Fix a couple potential kernel panics if RAID10 is used by dm-raid When device-mapper uses the RAID10 personality through dm-raid.c, there is no 'gendisk' structure in mddev and some sysfs information is also not populated. This patch avoids touching those non-existent structures. Signed-off-by: Jonathan Brassow <jbrassow@rehdat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11	md: avoid taking the mutex on some ioctls.	NeilBrown
	Some ioctls don't need to take the mutex and doing so can cause a delay as it is held during super-block update. So move those ioctls out of the mutex and rely on rcu locking to ensure we don't access stale data. Signed-off-by: NeilBrown <neilb@suse.de>