summaryrefslogtreecommitdiff
path: root/drivers/md/dm-mpath.c
AgeCommit message (Collapse)Author
2015-04-15dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mqMike Snitzer
dm_mq_queue_rq() is in atomic context so care must be taken to not sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations and dm-mpath's call to blk_get_request(). In the future the bioset allocations will hopefully go away (by removing support for partial completions of bios in a cloned request). Also prepare for supporting DM blk-mq ontop of old-style request_fn device(s) if a new dm-mod 'use_blk_mq' parameter is set. The kthread will still be used to queue work if blk-mq is used ontop of old-style request_fn device(s). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-15dm: add full blk-mq support to request-based DMMike Snitzer
Commit e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices") served as the first step toward fully utilizing blk-mq in request-based DM -- it enabled stacking an old-style (request_fn) request_queue ontop of the underlying blk-mq device(s). That first step didn't improve performance of DM multipath ontop of fast blk-mq devices (e.g. NVMe) because the top-level old-style request_queue was severely limited by the queue_lock. The second step offered here enables stacking a blk-mq request_queue ontop of the underlying blk-mq device(s). This unlocks significant performance gains on fast blk-mq devices, Keith Busch tested on his NVMe testbed and offered this really positive news: "Just providing a performance update. All my fio tests are getting roughly equal performance whether accessed through the raw block device or the multipath device mapper (~470k IOPS). I could only push ~20% of the raw iops through dm before this conversion, so this latest tree is looking really solid from a performance standpoint." Signed-off-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Keith Busch <keith.busch@intel.com>
2015-03-31dm: remove unnecessary wrapper around blk_lld_busyMike Snitzer
There is no need for DM to export a wrapper around the already exported blk_lld_busy(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09dm mpath: simplify failure path of dm_multipath_init()Johannes Thumshirn
Currently the cleanup of all error cases are open-coded. Introduce a common exit path and labels. Signed-off-by: Johannes Thumshirn <morbidrsa@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09dm: allocate requests in target when stacking on blk-mq devicesMike Snitzer
For blk-mq request-based DM the responsibility of allocating a cloned request is transfered from DM core to the target type. Doing so enables the cloned request to be allocated from the appropriate blk-mq request_queue's pool (only the DM target, e.g. multipath, can know which block device to send a given cloned request to). Care was taken to preserve compatibility with old-style block request completion that requires request-based DM _not_ acquire the clone request's queue lock in the completion path. As such, there are now 2 different request-based DM target_type interfaces: 1) the original .map_rq() interface will continue to be used for non-blk-mq devices -- the preallocated clone request is passed in from DM core. 2) a new .clone_and_map_rq() and .release_clone_rq() will be used for blk-mq devices -- blk_get_request() and blk_put_request() are used respectively from these hooks. dm_table_set_type() was updated to detect if the request-based target is being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set. DM core disallows switching the DM table's type after it is set. This means that there is no mixing of non-blk-mq and blk-mq devices within the same request-based DM table. [This patch was started by Keith and later heavily modified by Mike] Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-09dm: submit stacked requests in irq enabled contextKeith Busch
Switch to having request-based DM enqueue all prep'ed requests into work processed by another thread. This allows request-based DM to invoke block APIs that assume interrupt enabled context (e.g. blk_get_request) and is a prerequisite for adding blk-mq support to request-based DM. The new kernel thread is only initialized for request-based DM devices. multipath_map() is now always in irq enabled context so change multipath spinlock (m->lock) locking to always disable interrupts. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-10-05dm mpath: stop queueing IO when no valid paths existBenjamin Marzinski
'queue_io' is set so that IO is queued while paths are being initialized. Clear queue_io in __choose_pgpath if there are no valid paths, since there are obviously no paths that can be initialized. Otherwise IOs to the device will back up. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-08-01dm mpath: eliminate pg_ready() wrapperMike Snitzer
pg_ready() is not comprehensive in its logic and only serves to obfuscate code. Replace pg_ready() with the appropriate logic in multipath_map(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-07-10dm mpath: fix IO hang due to logic bug in multipath_busyJun'ichi Nomura
Commit e80991773 ("dm mpath: push back requests instead of queueing") modified multipath_busy() to return true if !pg_ready(). pg_ready() checks the current state of the multipath device and may return false even if a new IO is needed to change the state. Bart Van Assche reported that he had multipath IO lockup when he was performing cable pull tests. Analysis showed that the multipath device had a single path group with both paths active, but that the path group itself was not active. During the multipath device state transitions 'queue_io' got set but nothing could clear it. Clearing 'queue_io' only happens in __choose_pgpath(), but it won't be called if multipath_busy() returns true due to pg_ready() returning false when 'queue_io' is set. As such the !pg_ready() check in multipath_busy() is wrong because new IO will not be sent to multipath target and the multipath state change won't happen. That results in multipath IO lockup. The intent of multipath_busy() is to avoid unnecessary cycles of dequeue + request_fn + requeue if it is known that the multipath device will requeue. Such "busy" situations would be: - path group is being activated - there is no path and the multipath is setup to requeue if no path Fix multipath_busy() to return "busy" early only for these specific situations. Reported-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.15
2014-06-04dm: disable WRITE SAME if it failsMike Snitzer
Add DM core support for disabling WRITE SAME on first failure to both request-based and bio-based targets. The need to disable WRITE SAME stems from SCSI enabling it by default but then disabling it when it fails. When SCSI does this it returns "permanent target failure, do not retry" using -EREMOTEIO. Update DM core to only disable WRITE SAME on failure if the returned error is -EREMOTEIO. Commit f84cb8a4 ("dm mpath: disable WRITE SAME if it fails") implemented multipath specific disabling of WRITE SAME if it fails. However, as that commit detailed, the multipath-only solution doesn't go far enough if bio-based DM targets are stacked ontop of the request-based dm-multipath target (as is commonly done using dm-linear to support partitions on multipath devices, via kpartx). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Tested-by: Alex Chen <alex.chen@huawei.com>
2014-05-27dm mpath: really fix lockdep warningHannes Reinecke
lockdep complains about a circular locking. And indeed, we need to release the lock before calling dm_table_run_md_queue_async(). As such, commit 4cdd2ad ("dm mpath: fix lock order inconsistency in multipath_ioctl") must also be reverted in addition to fixing the lock order in the other dm_table_run_md_queue_async() callers. Reported-by: Bart van Assche <bvanassche@acm.org> Tested-by: Bart van Assche <bvanassche@acm.org> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-05-14dm mpath: fix lock order inconsistency in multipath_ioctlMike Snitzer
Commit 3e9f1be1b40 ("dm mpath: remove process_queued_ios()") did not consistently take the multipath device's spinlock (m->lock) before calling dm_table_run_md_queue_async() -- which takes the q->queue_lock. Found with code inspection using hint from reported lockdep warning. Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27dm mpath: print more useful warnings in multipath_message()Jose Castillo
The warning message "Unrecognised multipath message received" is displayed in two different situations in multipath_message(): when the number of arguments passed is invalid and when the string passed in argv[0] is not recognized. Make it easier to identify where the problem is by making these warnings more specific with additional context for each case. Signed-off-by: Jose Castillo <jcastillo@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27dm-mpath: do not activate failed pathsHannes Reinecke
activate_path() is run without a lock, so the path might be set to failed before activate_path() had a chance to run. This patch add a check for ->active in activate_path() to avoid unnecessary overhead by calling functions which are known to be failing. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27dm mpath: remove extra nesting in map functionMike Snitzer
Return early for case when no path exists, and when the pathgroup isn't ready. This eliminates the need for extra nesting for the the common case. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de>
2014-03-27dm mpath: remove map_io()Hannes Reinecke
multipath_map() is now just a wrapper around map_io(), so we can rename map_io() to multipath_map(). Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27dm mpath: reduce memory pressure when requeuingHannes Reinecke
When multipath needs to requeue I/O in the block layer the per-request context shouldn't be allocated, as it will be freed immediately afterwards anyway. Avoiding this memory allocation will reduce memory pressure during requeuing. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27dm mpath: remove process_queued_ios()Hannes Reinecke
process_queued_ios() has served 3 functions: 1) select pg and pgpath if none is selected 2) start pg_init if requested 3) dispatch queued IOs when pg is ready Basically, a call to queue_work(process_queued_ios) can be replaced by dm_table_run_md_queue_async(), which runs request queue and ends up calling map_io(), which does 1), 2) and 3). Exception is when !pg_ready() (which means either pg_init is running or requested), then multipath_busy() prevents map_io() being called from request_fn. If pg_init is running, it should be ok as long as pg_init_done() does the right thing when pg_init is completed, I.e.: restart pg_init if !pg_ready() or call dm_table_run_md_queue_async() to kick map_io(). If pg_init is requested, we have to make sure the request is detected and pg_init will be started. pg_init is requested in 3 places: a) __choose_pgpath() in map_io() b) __choose_pgpath() in multipath_ioctl() c) pg_init retry in pg_init_done() a) is ok because map_io() calls __pg_init_all_paths(), which does 2). b) needs a call to __pg_init_all_paths(), which does 2). c) needs a call to __pg_init_all_paths(), which does 2). So this patch removes process_queued_ios() and ensures that __pg_init_all_paths() is called at the appropriate locations. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27dm mpath: push back requests instead of queueingHannes Reinecke
There is no reason why multipath needs to queue requests internally for queue_if_no_path or pg_init; we should rather push them back onto the request queue. And while we're at it we can simplify the conditional statement in map_io() to make it easier to read. Since mpath no longer does internal queuing of I/O the table info no longer emits the internal queue_size. Instead it displays 1 if queuing is being used or 0 if it is not. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-03-27dm mpath: do not call pg_init when it is already runningHannes Reinecke
This patch moves condition checks as a preparation of following patches and has no effect on behaviour. process_queued_ios() is the only caller of __pg_init_all_paths() and 2 condition checks are moved from outside to inside without side effects. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
2014-02-26dm mpath: fix stalls when handling invalid ioctlsHannes Reinecke
An invalid ioctl will never be valid, irrespective of whether multipath has active paths or not. So for invalid ioctls we do not have to wait for multipath to activate any paths, but can rather return an error code immediately. This fix resolves numerous instances of: udevd[]: worker [] unexpectedly returned with status 0x0100 that have been seen during testing. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2013-11-05dm mpath: requeue I/O during pg_initHannes Reinecke
When pg_init is running no I/O can be submitted to the underlying devices, as the path priority etc might change. When using queue_io for this, requests will be piling up within multipath as the block I/O scheduler just sees a _very fast_ device. All of this queued I/O has to be resubmitted from within multipathing once pg_init is done. This approach has the problem that it's virtually impossible to abort I/O when pg_init is running, and we're adding heavy load to the devices after pg_init since all of the queued I/O needs to be resubmitted _before_ any requests can be pulled off of the request queue and normal operation continues. This patch will requeue the I/O that triggers the pg_init call, and return 'busy' when pg_init is in progress. With these changes the block I/O scheduler will stop submitting I/O during pg_init, resulting in a quicker path switch and less I/O pressure (and memory consumption) after pg_init. Signed-off-by: Hannes Reinecke <hare@suse.de> [patch header edited for clarity and typos by Mike Snitzer] Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-10-31dm mpath: fix race condition between multipath_dtr and pg_init_doneShiva Krishna Merla
Whenever multipath_dtr() is happening we must prevent queueing any further path activation work. Implement this by adding a new 'pg_init_disabled' flag to the multipath structure that denotes future path activation work should be skipped if it is set. By disabling pg_init and then re-enabling in flush_multipath_work() we also avoid the potential for pg_init to be initiated while suspending an mpath device. Without this patch a race condition exists that may result in a kernel panic: 1) If after pg_init_done() decrements pg_init_in_progress to 0, a call to wait_for_pg_init_completion() assumes there are no more pending path management commands. 2) If pg_init_required is set by pg_init_done(), due to retryable mode_select errors, then process_queued_ios() will again queue the path activation work. 3) If free_multipath() completes before activate_path() work is called a NULL pointer dereference like the following can be seen when accessing members of the recently destructed multipath: BUG: unable to handle kernel NULL pointer dereference at 0000000000000090 RIP: 0010:[<ffffffffa003db1b>] [<ffffffffa003db1b>] activate_path+0x1b/0x30 [dm_multipath] [<ffffffff81090ac0>] worker_thread+0x170/0x2a0 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40 [switch to disabling pg_init in flush_multipath_work & header edits by Mike Snitzer] Signed-off-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com> Reviewed-by: Krishnasamy Somasundaram <somasundaram.krishnasamy@netapp.com> Tested-by: Speagle Andy <Andy.Speagle@netapp.com> Acked-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2013-09-23dm: add reserved_rq_based_ios module parameterMike Snitzer
Allow user to change the number of IOs that are reserved by request-based DM's mempools by writing to this file: /sys/module/dm_mod/parameters/reserved_rq_based_ios The default value is RESERVED_REQUEST_BASED_IOS (256). The maximum allowed value is RESERVED_MAX_IOS (1024). Export dm_get_reserved_rq_based_ios() for use by DM targets and core code. Switch to sizing dm-mpath's mempool using DM core's configurable 'reserved_rq_based_ios'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Frank Mayhar <fmayhar@google.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com>
2013-09-20dm mpath: disable WRITE SAME if it failsMike Snitzer
Workaround the SCSI layer's problematic WRITE SAME heuristics by disabling WRITE SAME in the DM multipath device's queue_limits if an underlying device disabled it. The WRITE SAME heuristics, with both the original commit 5db44863b6eb ("[SCSI] sd: Implement support for WRITE SAME") and the updated commit 66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling WRITE SAME(10) even without successfully determining it is supported. After the first failed WRITE SAME the SCSI layer will disable WRITE SAME for the device (by setting sdkp->device->no_write_same which results in 'max_write_same_sectors' in device's queue_limits to be set to 0). When a device is stacked ontop of such a SCSI device any changes to that SCSI device's queue_limits do not automatically propagate up the stack. As such, a DM multipath device will not have its WRITE SAME support disabled. This causes the block layer to continue to issue WRITE SAME requests to the mpath device which causes paths to fail and (if mpath IO isn't configured to queue when no paths are available) it will result in actual IO errors to the upper layers. This fix doesn't help configurations that have additional devices stacked ontop of the mpath device (e.g. LVM created linear DM devices ontop). A proper fix that restacks all the queue_limits from the bottom of the device stack up will need to be explored if SCSI will continue to use this model of optimistically allowing op codes and then disabling them after they fail for the first time. Before this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. device-mapper: multipath: Failing path 8:112. end_request: I/O error, dev dm-6, sector 4616 dm-6: WRITE SAME failed. Manually zeroing. end_request: I/O error, dev dm-6, sector 4616 end_request: I/O error, dev dm-6, sector 5640 end_request: I/O error, dev dm-6, sector 6664 end_request: I/O error, dev dm-6, sector 7688 end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. end_request: I/O error, dev dm-6, sector 524296 Aborting journal on device dm-6-8. end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 33553920 After this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 0 It should be noted that WRITE SAME support wasn't enabled in DM multipath until v3.10. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: stable@vger.kernel.org # 3.10+
2013-09-18dm mpath: do not fail path on -ENOSPCJun'ichi Nomura
Since ENOSPC is a target-side error, dm-mpath should just pass the error information to upper layer instead of retrying itself with path failover. Otherwise it will end up failing all paths down while path checkers find all paths ok. ENOSPC can now be returned from SCSI device after commit a9d6ceb8 ("[SCSI] return ENOSPC on thin provisioning failure"). Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2013-08-23[SCSI] Return ENODATA on medium errorHannes Reinecke
When a medium error is detected the SCSI stack should return ENODATA to the upper layers. [jejb: fix whitespace error] Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2013-07-10dm mpath: fix ioctl deadlock when no pathsHannes Reinecke
When multipath needs to retry an ioctl the reference to the current live table needs to be dropped. Otherwise a deadlock occurs when all paths are down: - dm_blk_ioctl takes a reference to the current table and spins in multipath_ioctl(). - A new table is being loaded, but upon resume the process hangs in dm_table_destroy() waiting for references to drop to zero. With this patch the reference to the old table is dropped prior to retry, thereby avoiding the deadlock. Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10dm mpath: enable WRITE SAME supportMike Snitzer
Enable WRITE SAME support in dm multipath. As far as multipath is concerned it is just another write request. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Bharata B Rao <bharata.rao@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: rename request variables to biosAlasdair G Kergon
Use 'bio' in the name of variables and functions that deal with bios rather than 'request' to avoid confusion with the normal block layer use of 'request'. No functional changes. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-03-01dm: fix truncated status stringsMikulas Patocka
Avoid returning a truncated table or status string instead of setting the DM_BUFFER_FULL_FLAG when the last target of a table fills the buffer. When processing a table or status request, the function retrieve_status calls ti->type->status. If ti->type->status returns non-zero, retrieve_status assumes that the buffer overflowed and sets DM_BUFFER_FULL_FLAG. However, targets don't return non-zero values from their status method on overflow. Most targets returns always zero. If a buffer overflow happens in a target that is not the last in the table, it gets noticed during the next iteration of the loop in retrieve_status; but if a buffer overflow happens in the last target, it goes unnoticed and erroneously truncated data is returned. In the current code, the targets behave in the following way: * dm-crypt returns -ENOMEM if there is not enough space to store the key, but it returns 0 on all other overflows. * dm-thin returns errors from the status method if a disk error happened. This is incorrect because retrieve_status doesn't check the error code, it assumes that all non-zero values mean buffer overflow. * all the other targets always return 0. This patch changes the ti->type->status function to return void (because most targets don't use the return code). Overflow is detected in retrieve_status: if the status method fills up the remaining space completely, it is assumed that buffer overflow happened. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12dm mpath: fix check for null mpio in end_io fnWei Yongjun
The mpio dereference should be moved below the BUG_ON NULL test in multipath_end_io(). spatch with a semantic match was used to found this. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-02Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...
2012-09-26dm mpath: only retry ioctl when no paths if queue_if_no_path setMike Snitzer
When there are no paths and multipath receives an ioctl, it waits until a path becomes available. This behaviour is incorrect if the "queue_if_no_path" setting was not specified, as then the ioctl should be rejected immediately, which this patch now does. commit 35991652b ("dm mpath: allow ioctls to trigger pg init") should have checked if queue_if_no_path was configured before queueing IO. Checking for the queue_if_no_path feature, like is done in map_io(), allows the following table load to work without blocking in the multipath_ioctl retry loop: echo "0 1024 multipath 0 0 0 0" | dmsetup create mpath_nodevs Without this fix the multipath_ioctl will block with the following stack trace: blkid D 0000000000000002 0 23936 1 0x00000000 ffff8802b89e5cd8 0000000000000082 ffff8802b89e5fd8 0000000000012440 ffff8802b89e4010 0000000000012440 0000000000012440 0000000000012440 ffff8802b89e5fd8 0000000000012440 ffff88030c2aab30 ffff880325794040 Call Trace: [<ffffffff814ce099>] schedule+0x29/0x70 [<ffffffff814cc312>] schedule_timeout+0x182/0x2e0 [<ffffffff8104dee0>] ? lock_timer_base+0x70/0x70 [<ffffffff814cc48e>] schedule_timeout_uninterruptible+0x1e/0x20 [<ffffffff8104f840>] msleep+0x20/0x30 [<ffffffffa0000839>] multipath_ioctl+0x109/0x170 [dm_multipath] [<ffffffffa06bfb9c>] dm_blk_ioctl+0xbc/0xd0 [dm_mod] [<ffffffff8122a408>] __blkdev_driver_ioctl+0x28/0x30 [<ffffffff8122a79e>] blkdev_ioctl+0xce/0x730 [<ffffffff811970ac>] block_ioctl+0x3c/0x40 [<ffffffff8117321c>] do_vfs_ioctl+0x8c/0x340 [<ffffffff81166293>] ? sys_newfstat+0x33/0x40 [<ffffffff81173571>] sys_ioctl+0xa1/0xb0 [<ffffffff814d70a9>] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.5+ Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-08-20workqueue: deprecate flush[_delayed]_work_sync()Tejun Heo
flush[_delayed]_work_sync() are now spurious. Mark them deprecated and convert all users to flush[_delayed]_work(). If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant and the regular flushes guarantee that the work item is not pending or running on any CPU on return, so there's no reason to use the sync flushes at all and they're going away. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mattia Dongili <malattia@linux.it> Cc: Kent Yoder <key@linux.vnet.ibm.com> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Bryan Wu <bryan.wu@canonical.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-wireless@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Avi Kivity <avi@redhat.com>
2012-07-27dm thin: commit before gathering statusAlasdair G Kergon
Commit outstanding metadata before returning the status for a dm thin pool so that the numbers reported are as up-to-date as possible. The commit is not performed if the device is suspended or if the DM_NOFLUSH_FLAG is supplied by userspace and passed to the target through a new 'status_flags' parameter in the target's dm_status_fn. The userspace dmsetup tool will support the --noflush flag with the 'dmsetup status' and 'dmsetup wait' commands from version 1.02.76 onwards. Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27dm mpath: add retain_attached_hw_handler featureMike Snitzer
A SCSI device handler might get attached to a device during the initial device scan. We do not necessarily want to override this when loading a multipath table, so this patch adds a new multipath feature argument "retain_attached_hw_handler". During SCSI device scan all loaded SCSI device handlers will be consulted for a match (via scsi_dh's provided .match). If a match is found that device handler will be attached. We need a way to have userspace multipathd's provided 'hw_handler' not override the already attached hardware handler. When specifying the new feature 'retain_attached_hw_handler' multipath will use the currently attached hardware handler instead of trying to attach the one specified during table load. If no hardware handler is attached the specified hardware handler will still be used. Leverages scsi_dh_attach's ability to increment the scsi_dh's reference count if the same scsi_dh name is provided when attaching - currently attached scsi_dh name is determined with scsi_dh_attached_handler_name. Depends upon commit 7e8a74b177f17d100916b6ad415450f7c9508691 ("[SCSI] scsi_dh: add scsi_dh_attached_handler_name"). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Babu Moger <babu.moger@netapp.com> Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-06-03dm mpath: allow ioctls to trigger pg initMikulas Patocka
After the failure of a group of paths, any alternative paths that need initialising do not become available until further I/O is sent to the device. Until this has happened, ioctls return -EAGAIN. With this patch, new paths are made available in response to an ioctl too. The processing of the ioctl gets delayed until this has happened. Instead of returning an error, we submit a work item to kmultipathd (that will potentially activate the new path) and retry in ten milliseconds. Note that the patch doesn't retry an ioctl if the ioctl itself fails due to a path failure. Such retries should be handled intelligently by the code that generated the ioctl in the first place, noting that some SCSI commands should not be retried because they are not idempotent (XOR write commands). For commands that could be retried, there is a danger that if the device rejected the SCSI command, the path could be errorneously marked as failed, and the request would be retried on another path which might fail too. It can be determined if the failure happens on the device or on the SCSI controller, but there is no guarantee that all SCSI drivers set these flags correctly. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-06-03dm mpath: delay retry of bypassed pgMike Christie
If I/O needs retrying and only bypassed priority groups are available, set the pg_init_delay_retry flag to wait before retrying. If, for example, the reason for the bypass is that the controller is getting reset or there is a firmware upgrade happening, retrying right away would cause a flood of log messages and retries for what could be a few seconds or even several minutes. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-06-03dm mpath: reduce size of struct multipathMike Snitzer
Move multipath structure's 'lock' and 'queue_size' members to eliminate two 4-byte holes. Also use a bit within a single unsigned int for each existing flag (saves 8-bytes). This allows future flags to be added without each consuming an unsigned int. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-12dm mpath: check if scsi_dh module already loaded before trying to loadMike Snitzer
If the requested scsi_dh module is already loaded then skip request_module(). Multipath table loads can hang in an unnecessary __request_module. Reported-by: Ben Marzinski <bmarzins@redhat.com> Cc: stable@kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28dm: reject trailing characters in sccanf inputMikulas Patocka
Device mapper uses sscanf to convert arguments to numbers. The problem is that the way we use it ignores additional unmatched characters in the scanned string. For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number, but also it will match number with some garbage appended, like "123abc". As a result, device mapper accepts garbage after some numbers. For example the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"' will pass without an error. This patch fixes all sscanf uses in device mapper. It appends "%c" with a pointer to a dummy character variable to every sscanf statement. The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds only if string is a null-terminated number (optionally preceded by some whitespace characters). If there is some character appended after the number, sscanf matches "%c", writes the character to the dummy variable and returns 2. We check the return value for 1 and consequently reject numbers with some garbage appended. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28dm mpath: detect invalid map_contextJun'ichi Nomura
The map_context pointer should always be set. However, we have reports that upon requeuing it is not set correctly. So add set and clear functions with a BUG_ON() to track the issue properly. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Mike Snitzer <snitzer@redhat.com> Acked-by: Hannes Reinecke <hare@suse.de> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-01-14dm: do not forward ioctls from logical volumes to the underlying devicePaolo Bonzini
A logical volume can map to just part of underlying physical volume. In this case, it must be treated like a partition. Based on a patch from Alasdair G Kergon. Cc: Alasdair G Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-02dm table: share target argument parsing functionsMike Snitzer
Move multipath target argument parsing code into dm-table so other targets can share it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm mpath: fix potential NULL pointer in feature arg processingMike Snitzer
Avoid dereferencing a NULL pointer if the number of feature arguments supplied is fewer than indicated. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
2011-07-26atomic: use <linux/atomic.h>Arun Sharma
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-29dm mpath: do not fail paths after integrity errorsMartin K. Petersen
Integrity errors need to be passed to the owner of the integrity metadata for processing. Consequently EILSEQ should be passed up the stack. Cc: stable@kernel.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-24dm mpath: allow table load with no priority groupsMike Snitzer
This patch adjusts the multipath target to allow a table with both 0 priority groups and 0 for the initial priority group number. If any mpath device is held open when all paths in the last priority group have failed, userspace multipathd will attempt to reload the associated DM table to reflect the fact that the device no longer has any priority groups. But the reload attempt always failed because the multipath target did not allow 0 priority groups. All multipath target messages related to priority group (enable_group, disable_group, switch_group) will handle a priority group of 0 (will cause error). When reloading a multipath table with 0 priority groups, userspace multipathd must be updated to specify an initial priority group number of 0 (rather than 1). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Babu Moger <babu.moger@lsi.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-24dm mpath: fail message ioctl if specified path is not validMike Snitzer
Fail the reinstate_path and fail_path message ioctl if the specified path is not valid. The message ioctl would succeed for the 'reinistate_path' and 'fail_path' messages even if action was not taken because the specified device was not a valid path of the multipath device. Before, when /dev/vdb is not a path of mpathb: $ dmsetup message mpathb 0 reinstate_path /dev/vdb $ echo $? 0 After: $ dmsetup message mpathb 0 reinstate_path /dev/vdb device-mapper: message ioctl failed: Invalid argument Command failed $ echo $? 1 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>