diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2016-05-17 16:13:00 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2016-05-17 16:13:00 -0700 |
commit | b80fed9595513384424cd141923c9161c4b5021b (patch) | |
tree | a7ca08c40a41f157f3cb472b9bc7cfc123859d8d | |
parent | 24b9f0cf00c8e8df29a4ddfec8c139ad62753113 (diff) | |
parent | 202bae52934d4eb79ffaebf49f49b1cc64d8e40b (diff) | |
download | lwn-b80fed9595513384424cd141923c9161c4b5021b.tar.gz lwn-b80fed9595513384424cd141923c9161c4b5021b.zip |
Merge tag 'dm-4.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- based on Jens' 'for-4.7/core' to have DM thinp's discard support use
bio_inc_remaining() and the block core's new async __blkdev_issue_discard()
interface
- make DM multipath's fast code-paths lockless, using lockless_deference,
to significantly improve large NUMA performance when using blk-mq.
The m->lock spinlock contention was a serious bottleneck.
- a few other small code cleanups and Documentation fixes
* tag 'dm-4.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm thin: unroll issue_discard() to create longer discard bio chains
dm thin: use __blkdev_issue_discard for async discard support
dm thin: remove __bio_inc_remaining() and switch to using bio_inc_remaining()
dm raid: make sure no feature flags are set in metadata
dm ioctl: drop use of __GFP_REPEAT in copy_params()'s __vmalloc() call
dm stats: fix spelling mistake in Documentation
dm cache: update cache-policies.txt now that mq is an alias for smq
dm mpath: eliminate use of spinlock in IO fast-paths
dm mpath: move trigger_event member to the end of 'struct multipath'
dm mpath: use atomic_t for counting members of 'struct multipath'
dm mpath: switch to using bitops for state flags
dm thin: Remove return statement from void function
dm: remove unused mapped_device argument from free_tio()
-rw-r--r-- | Documentation/device-mapper/cache-policies.txt | 34 | ||||
-rw-r--r-- | Documentation/device-mapper/statistics.txt | 2 | ||||
-rw-r--r-- | drivers/md/dm-ioctl.c | 2 | ||||
-rw-r--r-- | drivers/md/dm-mpath.c | 351 | ||||
-rw-r--r-- | drivers/md/dm-raid.c | 7 | ||||
-rw-r--r-- | drivers/md/dm-thin.c | 165 | ||||
-rw-r--r-- | drivers/md/dm.c | 10 |
7 files changed, 298 insertions, 273 deletions
diff --git a/Documentation/device-mapper/cache-policies.txt b/Documentation/device-mapper/cache-policies.txt index e5062ad18717..d3ca8af21a31 100644 --- a/Documentation/device-mapper/cache-policies.txt +++ b/Documentation/device-mapper/cache-policies.txt @@ -11,7 +11,7 @@ Every bio that is mapped by the target is referred to the policy. The policy can return a simple HIT or MISS or issue a migration. Currently there's no way for the policy to issue background work, -e.g. to start writing back dirty blocks that are going to be evicte +e.g. to start writing back dirty blocks that are going to be evicted soon. Because we map bios, rather than requests it's easy for the policy @@ -48,7 +48,7 @@ with the multiqueue (mq) policy. The smq policy (vs mq) offers the promise of less memory utilization, improved performance and increased adaptability in the face of changing -workloads. SMQ also does not have any cumbersome tuning knobs. +workloads. smq also does not have any cumbersome tuning knobs. Users may switch from "mq" to "smq" simply by appropriately reloading a DM table that is using the cache target. Doing so will cause all of the @@ -57,47 +57,45 @@ degrade slightly until smq recalculates the origin device's hotspots that should be cached. Memory usage: -The mq policy uses a lot of memory; 88 bytes per cache block on a 64 +The mq policy used a lot of memory; 88 bytes per cache block on a 64 bit machine. -SMQ uses 28bit indexes to implement it's data structures rather than +smq uses 28bit indexes to implement it's data structures rather than pointers. It avoids storing an explicit hit count for each block. It -has a 'hotspot' queue rather than a pre cache which uses a quarter of +has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of the entries (each hotspot block covers a larger area than a single cache block). -All these mean smq uses ~25bytes per cache block. Still a lot of +All this means smq uses ~25bytes per cache block. Still a lot of memory, but a substantial improvement nontheless. Level balancing: -MQ places entries in different levels of the multiqueue structures -based on their hit count (~ln(hit count)). This means the bottom -levels generally have the most entries, and the top ones have very -few. Having unbalanced levels like this reduces the efficacy of the +mq placed entries in different levels of the multiqueue structures +based on their hit count (~ln(hit count)). This meant the bottom +levels generally had the most entries, and the top ones had very +few. Having unbalanced levels like this reduced the efficacy of the multiqueue. -SMQ does not maintain a hit count, instead it swaps hit entries with -the least recently used entry from the level above. The over all +smq does not maintain a hit count, instead it swaps hit entries with +the least recently used entry from the level above. The overall ordering being a side effect of this stochastic process. With this scheme we can decide how many entries occupy each multiqueue level, resulting in better promotion/demotion decisions. Adaptability: -The MQ policy maintains a hit count for each cache block. For a +The mq policy maintained a hit count for each cache block. For a different block to get promoted to the cache it's hit count has to -exceed the lowest currently in the cache. This means it can take a +exceed the lowest currently in the cache. This meant it could take a long time for the cache to adapt between varying IO patterns. -Periodically degrading the hit counts could help with this, but I -haven't found a nice general solution. -SMQ doesn't maintain hit counts, so a lot of this problem just goes +smq doesn't maintain hit counts, so a lot of this problem just goes away. In addition it tracks performance of the hotspot queue, which is used to decide which blocks to promote. If the hotspot queue is performing badly then it starts moving entries more quickly between levels. This lets it adapt to new IO patterns very quickly. Performance: -Testing SMQ shows substantially better performance than MQ. +Testing smq shows substantially better performance than mq. cleaner ------- diff --git a/Documentation/device-mapper/statistics.txt b/Documentation/device-mapper/statistics.txt index 6f5ef944ca4c..170ac02a1f50 100644 --- a/Documentation/device-mapper/statistics.txt +++ b/Documentation/device-mapper/statistics.txt @@ -205,7 +205,7 @@ statistics on them: dmsetup message vol 0 @stats_create - /100 -Set the auxillary data string to "foo bar baz" (the escape for each +Set the auxiliary data string to "foo bar baz" (the escape for each space must also be escaped, otherwise the shell will consume them): dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c index 2adf81d81fca..2c7ca258c4e4 100644 --- a/drivers/md/dm-ioctl.c +++ b/drivers/md/dm-ioctl.c @@ -1723,7 +1723,7 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern if (!dmi) { unsigned noio_flag; noio_flag = memalloc_noio_save(); - dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_REPEAT | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL); + dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL); memalloc_noio_restore(noio_flag); if (dmi) *param_flags |= DM_PARAMS_VMALLOC; diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 677ba223e2ae..52baf8a5b0f4 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -76,26 +76,18 @@ struct multipath { wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */ - unsigned pg_init_in_progress; /* Only one pg_init allowed at once */ - - unsigned nr_valid_paths; /* Total number of usable paths */ struct pgpath *current_pgpath; struct priority_group *current_pg; struct priority_group *next_pg; /* Switch to this PG if set */ - bool queue_io:1; /* Must we queue all I/O? */ - bool queue_if_no_path:1; /* Queue I/O if last path fails? */ - bool saved_queue_if_no_path:1; /* Saved state during suspension */ - bool retain_attached_hw_handler:1; /* If there's already a hw_handler present, don't change it. */ - bool pg_init_disabled:1; /* pg_init is not currently allowed */ - bool pg_init_required:1; /* pg_init needs calling? */ - bool pg_init_delay_retry:1; /* Delay pg_init retry? */ + unsigned long flags; /* Multipath state flags */ unsigned pg_init_retries; /* Number of times to retry pg_init */ - unsigned pg_init_count; /* Number of times pg_init called */ unsigned pg_init_delay_msecs; /* Number of msecs before pg_init retry */ - struct work_struct trigger_event; + atomic_t nr_valid_paths; /* Total number of usable paths */ + atomic_t pg_init_in_progress; /* Only one pg_init allowed at once */ + atomic_t pg_init_count; /* Number of times pg_init called */ /* * We must use a mempool of dm_mpath_io structs so that we @@ -104,6 +96,7 @@ struct multipath { mempool_t *mpio_pool; struct mutex work_mutex; + struct work_struct trigger_event; }; /* @@ -122,6 +115,17 @@ static struct workqueue_struct *kmultipathd, *kmpath_handlerd; static void trigger_event(struct work_struct *work); static void activate_path(struct work_struct *work); +/*----------------------------------------------- + * Multipath state flags. + *-----------------------------------------------*/ + +#define MPATHF_QUEUE_IO 0 /* Must we queue all I/O? */ +#define MPATHF_QUEUE_IF_NO_PATH 1 /* Queue I/O if last path fails? */ +#define MPATHF_SAVED_QUEUE_IF_NO_PATH 2 /* Saved state during suspension */ +#define MPATHF_RETAIN_ATTACHED_HW_HANDLER 3 /* If there's already a hw_handler present, don't change it. */ +#define MPATHF_PG_INIT_DISABLED 4 /* pg_init is not currently allowed */ +#define MPATHF_PG_INIT_REQUIRED 5 /* pg_init needs calling? */ +#define MPATHF_PG_INIT_DELAY_RETRY 6 /* Delay pg_init retry? */ /*----------------------------------------------- * Allocation routines @@ -189,7 +193,10 @@ static struct multipath *alloc_multipath(struct dm_target *ti, bool use_blk_mq) if (m) { INIT_LIST_HEAD(&m->priority_groups); spin_lock_init(&m->lock); - m->queue_io = true; + set_bit(MPATHF_QUEUE_IO, &m->flags); + atomic_set(&m->nr_valid_paths, 0); + atomic_set(&m->pg_init_in_progress, 0); + atomic_set(&m->pg_init_count, 0); m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT; INIT_WORK(&m->trigger_event, trigger_event); init_waitqueue_head(&m->pg_init_wait); @@ -274,17 +281,17 @@ static int __pg_init_all_paths(struct multipath *m) struct pgpath *pgpath; unsigned long pg_init_delay = 0; - if (m->pg_init_in_progress || m->pg_init_disabled) + if (atomic_read(&m->pg_init_in_progress) || test_bit(MPATHF_PG_INIT_DISABLED, &m->flags)) return 0; - m->pg_init_count++; - m->pg_init_required = false; + atomic_inc(&m->pg_init_count); + clear_bit(MPATHF_PG_INIT_REQUIRED, &m->flags); /* Check here to reset pg_init_required */ if (!m->current_pg) return 0; - if (m->pg_init_delay_retry) + if (test_bit(MPATHF_PG_INIT_DELAY_RETRY, &m->flags)) pg_init_delay = msecs_to_jiffies(m->pg_init_delay_msecs != DM_PG_INIT_DELAY_DEFAULT ? m->pg_init_delay_msecs : DM_PG_INIT_DELAY_MSECS); list_for_each_entry(pgpath, &m->current_pg->pgpaths, list) { @@ -293,65 +300,99 @@ static int __pg_init_all_paths(struct multipath *m) continue; if (queue_delayed_work(kmpath_handlerd, &pgpath->activate_path, pg_init_delay)) - m->pg_init_in_progress++; + atomic_inc(&m->pg_init_in_progress); } - return m->pg_init_in_progress; + return atomic_read(&m->pg_init_in_progress); +} + +static int pg_init_all_paths(struct multipath *m) +{ + int r; + unsigned long flags; + + spin_lock_irqsave(&m->lock, flags); + r = __pg_init_all_paths(m); + spin_unlock_irqrestore(&m->lock, flags); + + return r; } -static void __switch_pg(struct multipath *m, struct pgpath *pgpath) +static void __switch_pg(struct multipath *m, struct priority_group *pg) { - m->current_pg = pgpath->pg; + m->current_pg = pg; /* Must we initialise the PG first, and queue I/O till it's ready? */ if (m->hw_handler_name) { - m->pg_init_required = true; - m->queue_io = true; + set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags); + set_bit(MPATHF_QUEUE_IO, &m->flags); } else { - m->pg_init_required = false; - m->queue_io = false; + clear_bit(MPATHF_PG_INIT_REQUIRED, &m->flags); + clear_bit(MPATHF_QUEUE_IO, &m->flags); } - m->pg_init_count = 0; + atomic_set(&m->pg_init_count, 0); } -static int __choose_path_in_pg(struct multipath *m, struct priority_group *pg, - size_t nr_bytes) +static struct pgpath *choose_path_in_pg(struct multipath *m, + struct priority_group *pg, + size_t nr_bytes) { + unsigned long flags; struct dm_path *path; + struct pgpath *pgpath; path = pg->ps.type->select_path(&pg->ps, nr_bytes); if (!path) - return -ENXIO; + return ERR_PTR(-ENXIO); - m->current_pgpath = path_to_pgpath(path); + pgpath = path_to_pgpath(path); - if (m->current_pg != pg) - __switch_pg(m, m->current_pgpath); + if (unlikely(lockless_dereference(m->current_pg) != pg)) { + /* Only update current_pgpath if pg changed */ + spin_lock_irqsave(&m->lock, flags); + m->current_pgpath = pgpath; + __switch_pg(m, pg); + spin_unlock_irqrestore(&m->lock, flags); + } - return 0; + return pgpath; } -static void __choose_pgpath(struct multipath *m, size_t nr_bytes) +static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes) { + unsigned long flags; struct priority_group *pg; + struct pgpath *pgpath; bool bypassed = true; - if (!m->nr_valid_paths) { - m->queue_io = false; + if (!atomic_read(&m->nr_valid_paths)) { + clear_bit(MPATHF_QUEUE_IO, &m->flags); goto failed; } /* Were we instructed to switch PG? */ - if (m->next_pg) { + if (lockless_dereference(m->next_pg)) { + spin_lock_irqsave(&m->lock, flags); pg = m->next_pg; + if (!pg) { + spin_unlock_irqrestore(&m->lock, flags); + goto check_current_pg; + } m->next_pg = NULL; - if (!__choose_path_in_pg(m, pg, nr_bytes)) - return; + spin_unlock_irqrestore(&m->lock, flags); + pgpath = choose_path_in_pg(m, pg, nr_bytes); + if (!IS_ERR_OR_NULL(pgpath)) + return pgpath; } /* Don't change PG until it has no remaining paths */ - if (m->current_pg && !__choose_path_in_pg(m, m->current_pg, nr_bytes)) - return; +check_current_pg: + pg = lockless_dereference(m->current_pg); + if (pg) { + pgpath = choose_path_in_pg(m, pg, nr_bytes); + if (!IS_ERR_OR_NULL(pgpath)) + return pgpath; + } /* * Loop through priority groups until we find a valid path. @@ -363,34 +404,38 @@ static void __choose_pgpath(struct multipath *m, size_t nr_bytes) list_for_each_entry(pg, &m->priority_groups, list) { if (pg->bypassed == bypassed) continue; - if (!__choose_path_in_pg(m, pg, nr_bytes)) { + pgpath = choose_path_in_pg(m, pg, nr_bytes); + if (!IS_ERR_OR_NULL(pgpath)) { if (!bypassed) - m->pg_init_delay_retry = true; - return; + set_bit(MPATHF_PG_INIT_DELAY_RETRY, &m->flags); + return pgpath; } } } while (bypassed--); failed: + spin_lock_irqsave(&m->lock, flags); m->current_pgpath = NULL; m->current_pg = NULL; + spin_unlock_irqrestore(&m->lock, flags); + + return NULL; } /* * Check whether bios must be queued in the device-mapper core rather * than here in the target. * - * m->lock must be held on entry. - * * If m->queue_if_no_path and m->saved_queue_if_no_path hold the * same value then we are not between multipath_presuspend() * and multipath_resume() calls and we have no need to check * for the DMF_NOFLUSH_SUSPENDING flag. */ -static int __must_push_back(struct multipath *m) +static int must_push_back(struct multipath *m) { - return (m->queue_if_no_path || - (m->queue_if_no_path != m->saved_queue_if_no_path && + return (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) || + ((test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) != + test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags)) && dm_noflush_suspending(m->ti))); } @@ -408,35 +453,31 @@ static int __multipath_map(struct dm_target *ti, struct request *clone, struct block_device *bdev; struct dm_mpath_io *mpio; - spin_lock_irq(&m->lock); - /* Do we need to select a new pgpath? */ - if (!m->current_pgpath || !m->queue_io) - __choose_pgpath(m, nr_bytes); - - pgpath = m->current_pgpath; + pgpath = lockless_dereference(m->current_pgpath); + if (!pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags)) + pgpath = choose_pgpath(m, nr_bytes); if (!pgpath) { - if (!__must_push_back(m)) + if (!must_push_back(m)) r = -EIO; /* Failed */ - goto out_unlock; - } else if (m->queue_io || m->pg_init_required) { - __pg_init_all_paths(m); - goto out_unlock; + return r; + } else if (test_bit(MPATHF_QUEUE_IO, &m->flags) || + test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) { + pg_init_all_paths(m); + return r; } mpio = set_mpio(m, map_context); if (!mpio) /* ENOMEM, requeue */ - goto out_unlock; + return r; mpio->pgpath = pgpath; mpio->nr_bytes = nr_bytes; bdev = pgpath->path.dev->bdev; - spin_unlock_irq(&m->lock); - if (clone) { /* * Old request-based interface: allocated clone is passed in. @@ -468,11 +509,6 @@ static int __multipath_map(struct dm_target *ti, struct request *clone, &pgpath->path, nr_bytes); return DM_MAPIO_REMAPPED; - -out_unlock: - spin_unlock_irq(&m->lock); - - return r; } static int multipath_map(struct dm_target *ti, struct request *clone, @@ -503,11 +539,22 @@ static int queue_if_no_path(struct multipath *m, bool queue_if_no_path, spin_lock_irqsave(&m->lock, flags); - if (save_old_value) - m->saved_queue_if_no_path = m->queue_if_no_path; + if (save_old_value) { + if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) + set_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags); + else + clear_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags); + } else { + if (queue_if_no_path) + set_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags); + else + clear_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags); + } + if (queue_if_no_path) + set_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); else - m->saved_queue_if_no_path = queue_if_no_path; - m->queue_if_no_path = queue_if_no_path; + clear_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); + spin_unlock_irqrestore(&m->lock, flags); if (!queue_if_no_path) @@ -600,10 +647,10 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps goto bad; } - if (m->retain_attached_hw_handler || m->hw_handler_name) + if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags) || m->hw_handler_name) q = bdev_get_queue(p->path.dev->bdev); - if (m->retain_attached_hw_handler) { + if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) { retain: attached_handler_name = scsi_dh_attached_handler_name(q, GFP_KERNEL); if (attached_handler_name) { @@ -808,7 +855,7 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m) } if (!strcasecmp(arg_name, "retain_attached_hw_handler")) { - m->retain_attached_hw_handler = true; + set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags); continue; } @@ -884,6 +931,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned int argc, /* parse the priority groups */ while (as.argc) { struct priority_group *pg; + unsigned nr_valid_paths = atomic_read(&m->nr_valid_paths); pg = parse_priority_group(&as, m); if (IS_ERR(pg)) { @@ -891,7 +939,9 @@ static int multipath_ctr(struct dm_target *ti, unsigned int argc, goto bad; } - m->nr_valid_paths += pg->nr_pgpaths; + nr_valid_paths += pg->nr_pgpaths; + atomic_set(&m->nr_valid_paths, nr_valid_paths); + list_add_tail(&pg->list, &m->priority_groups); pg_count++; pg->pg_num = pg_count; @@ -921,19 +971,14 @@ static int multipath_ctr(struct dm_target *ti, unsigned int argc, static void multipath_wait_for_pg_init_completion(struct multipath *m) { DECLARE_WAITQUEUE(wait, current); - unsigned long flags; add_wait_queue(&m->pg_init_wait, &wait); while (1) { set_current_state(TASK_UNINTERRUPTIBLE); - spin_lock_irqsave(&m->lock, flags); - if (!m->pg_init_in_progress) { - spin_unlock_irqrestore(&m->lock, flags); + if (!atomic_read(&m->pg_init_in_progress)) break; - } - spin_unlock_irqrestore(&m->lock, flags); io_schedule(); } @@ -944,20 +989,16 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m) static void flush_multipath_work(struct multipath *m) { - unsigned long flags; - - spin_lock_irqsave(&m->lock, flags); - m->pg_init_disabled = true; - spin_unlock_irqrestore(&m->lock, flags); + set_bit(MPATHF_PG_INIT_DISABLED, &m->flags); + smp_mb__after_atomic(); flush_workqueue(kmpath_handlerd); multipath_wait_for_pg_init_completion(m); flush_workqueue(kmultipathd); flush_work(&m->trigger_event); - spin_lock_irqsave(&m->lock, flags); - m->pg_init_disabled = false; - spin_unlock_irqrestore(&m->lock, flags); + clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags); + smp_mb__after_atomic(); } static void multipath_dtr(struct dm_target *ti) @@ -987,13 +1028,13 @@ static int fail_path(struct pgpath *pgpath) pgpath->is_active = false; pgpath->fail_count++; - m->nr_valid_paths--; + atomic_dec(&m->nr_valid_paths); if (pgpath == m->current_pgpath) m->current_pgpath = NULL; dm_path_uevent(DM_UEVENT_PATH_FAILED, m->ti, - pgpath->path.dev->name, m->nr_valid_paths); + pgpath->path.dev->name, atomic_read(&m->nr_valid_paths)); schedule_work(&m->trigger_event); @@ -1011,6 +1052,7 @@ static int reinstate_path(struct pgpath *pgpath) int r = 0, run_queue = 0; unsigned long flags; struct multipath *m = pgpath->pg->m; + unsigned nr_valid_paths; spin_lock_irqsave(&m->lock, flags); @@ -1025,16 +1067,17 @@ static int reinstate_path(struct pgpath *pgpath) pgpath->is_active = true; - if (!m->nr_valid_paths++) { + nr_valid_paths = atomic_inc_return(&m->nr_valid_paths); + if (nr_valid_paths == 1) { m->current_pgpath = NULL; run_queue = 1; } else if (m->hw_handler_name && (m->current_pg == pgpath->pg)) { if (queue_work(kmpath_handlerd, &pgpath->activate_path.work)) - m->pg_init_in_progress++; + atomic_inc(&m->pg_init_in_progress); } dm_path_uevent(DM_UEVENT_PATH_REINSTATED, m->ti, - pgpath->path.dev->name, m->nr_valid_paths); + pgpath->path.dev->name, nr_valid_paths); schedule_work(&m->trigger_event); @@ -1152,8 +1195,9 @@ static bool pg_init_limit_reached(struct multipath *m, struct pgpath *pgpath) spin_lock_irqsave(&m->lock, flags); - if (m->pg_init_count <= m->pg_init_retries && !m->pg_init_disabled) - m->pg_init_required = true; + if (atomic_read(&m->pg_init_count) <= m->pg_init_retries && + !test_bit(MPATHF_PG_INIT_DISABLED, &m->flags)) + set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags); else limit_reached = true; @@ -1219,19 +1263,23 @@ static void pg_init_done(void *data, int errors) m->current_pgpath = NULL; m->current_pg = NULL; } - } else if (!m->pg_init_required) + } else if (!test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) pg->bypassed = false; - if (--m->pg_init_in_progress) + if (atomic_dec_return(&m->pg_init_in_progress) > 0) /* Activations of other paths are still on going */ goto out; - if (m->pg_init_required) { - m->pg_init_delay_retry = delay_retry; + if (test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) { + if (delay_retry) + set_bit(MPATHF_PG_INIT_DELAY_RETRY, &m->flags); + else + clear_bit(MPATHF_PG_INIT_DELAY_RETRY, &m->flags); + if (__pg_init_all_paths(m)) goto out; } - m->queue_io = false; + clear_bit(MPATHF_QUEUE_IO, &m->flags); /* * Wake up any thread waiting to suspend. @@ -1287,7 +1335,6 @@ static int do_end_io(struct multipath *m, struct request *clone, * clone bios for it and resubmit it later. */ int r = DM_ENDIO_REQUEUE; - unsigned long flags; if (!error && !clone->errors) return 0; /* I/O complete */ @@ -1298,17 +1345,15 @@ static int do_end_io(struct multipath *m, struct request *clone, if (mpio->pgpath) fail_path(mpio->pgpath); - spin_lock_irqsave(&m->lock, flags); - if (!m->nr_valid_paths) { - if (!m->queue_if_no_path) { - if (!__must_push_back(m)) + if (!atomic_read(&m->nr_valid_paths)) { + if (!test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) { + if (!must_push_back(m)) r = -EIO; } else { if (error == -EBADE) r = error; } } - spin_unlock_irqrestore(&m->lock, flags); return r; } @@ -1364,11 +1409,12 @@ static void multipath_postsuspend(struct dm_target *ti) static void multipath_resume(struct dm_target *ti) { struct multipath *m = ti->private; - unsigned long flags; - spin_lock_irqsave(&m->lock, flags); - m->queue_if_no_path = m->saved_queue_if_no_path; - spin_unlock_irqrestore(&m->lock, flags); + if (test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags)) + set_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); + else + clear_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); + smp_mb__after_atomic(); } /* @@ -1402,19 +1448,20 @@ static void multipath_status(struct dm_target *ti, status_type_t type, /* Features */ if (type == STATUSTYPE_INFO) - DMEMIT("2 %u %u ", m->queue_io, m->pg_init_count); + DMEMIT("2 %u %u ", test_bit(MPATHF_QUEUE_IO, &m->flags), + atomic_read(&m->pg_init_count)); else { - DMEMIT("%u ", m->queue_if_no_path + + DMEMIT("%u ", test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags) + (m->pg_init_retries > 0) * 2 + (m->pg_init_delay_msecs != DM_PG_INIT_DELAY_DEFAULT) * 2 + - m->retain_attached_hw_handler); - if (m->queue_if_no_path) + test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)); + if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) DMEMIT("queue_if_no_path "); if (m->pg_init_retries) DMEMIT("pg_init_retries %u ", m->pg_init_retries); if (m->pg_init_delay_msecs != DM_PG_INIT_DELAY_DEFAULT) DMEMIT("pg_init_delay_msecs %u ", m->pg_init_delay_msecs); - if (m->retain_attached_hw_handler) + if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) DMEMIT("retain_attached_hw_handler "); } @@ -1563,18 +1610,17 @@ static int multipath_prepare_ioctl(struct dm_target *ti, struct block_device **bdev, fmode_t *mode) { struct multipath *m = ti->private; - unsigned long flags; + struct pgpath *current_pgpath; int r; - spin_lock_irqsave(&m->lock, flags); + current_pgpath = lockless_dereference(m->current_pgpath); + if (!current_pgpath) + current_pgpath = choose_pgpath(m, 0); - if (!m->current_pgpath) - __choose_pgpath(m, 0); - - if (m->current_pgpath) { - if (!m->queue_io) { - *bdev = m->current_pgpath->path.dev->bdev; - *mode = m->current_pgpath->path.dev->mode; + if (current_pgpath) { + if (!test_bit(MPATHF_QUEUE_IO, &m->flags)) { + *bdev = current_pgpath->path.dev->bdev; + *mode = current_pgpath->path.dev->mode; r = 0; } else { /* pg_init has not started or completed */ @@ -1582,23 +1628,19 @@ static int multipath_prepare_ioctl(struct dm_target *ti, } } else { /* No path is available */ - if (m->queue_if_no_path) + if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) r = -ENOTCONN; else r = -EIO; } - spin_unlock_irqrestore(&m->lock, flags); - if (r == -ENOTCONN) { - spin_lock_irqsave(&m->lock, flags); - if (!m->current_pg) { + if (!lockless_dereference(m->current_pg)) { /* Path status changed, redo selection */ - __choose_pgpath(m, 0); + (void) choose_pgpath(m, 0); } - if (m->pg_init_required) - __pg_init_all_paths(m); - spin_unlock_irqrestore(&m->lock, flags); + if (test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) + pg_init_all_paths(m); dm_table_run_md_queue_async(m->ti->table); } @@ -1649,39 +1691,37 @@ static int multipath_busy(struct dm_target *ti) { bool busy = false, has_active = false; struct multipath *m = ti->private; - struct priority_group *pg; + struct priority_group *pg, *next_pg; struct pgpath *pgpath; - unsigned long flags; - - spin_lock_irqsave(&m->lock, flags); /* pg_init in progress or no paths available */ - if (m->pg_init_in_progress || - (!m->nr_valid_paths && m->queue_if_no_path)) { - busy = true; - goto out; - } + if (atomic_read(&m->pg_init_in_progress) || + (!atomic_read(&m->nr_valid_paths) && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) + return true; + /* Guess which priority_group will be used at next mapping time */ - if (unlikely(!m->current_pgpath && m->next_pg)) - pg = m->next_pg; - else if (likely(m->current_pg)) - pg = m->current_pg; - else + pg = lockless_dereference(m->current_pg); + next_pg = lockless_dereference(m->next_pg); + if (unlikely(!lockless_dereference(m->current_pgpath) && next_pg)) + pg = next_pg; + + if (!pg) { /* * We don't know which pg will be used at next mapping time. - * We don't call __choose_pgpath() here to avoid to trigger + * We don't call choose_pgpath() here to avoid to trigger * pg_init just by busy checking. * So we don't know whether underlying devices we will be using * at next mapping time are busy or not. Just try mapping. */ - goto out; + return busy; + } /* * If there is one non-busy active path at least, the path selector * will be able to select it. So we consider such a pg as not busy. */ busy = true; - list_for_each_entry(pgpath, &pg->pgpaths, list) + list_for_each_entry(pgpath, &pg->pgpaths, list) { if (pgpath->is_active) { has_active = true; if (!pgpath_busy(pgpath)) { @@ -1689,17 +1729,16 @@ static int multipath_busy(struct dm_target *ti) break; } } + } - if (!has_active) + if (!has_active) { /* * No active path in this pg, so this pg won't be used and * the current_pg will be changed at next mapping time. * We need to try mapping to determine it. */ busy = false; - -out: - spin_unlock_irqrestore(&m->lock, flags); + } return busy; } diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index a0901214aef5..52532745a50f 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -1037,6 +1037,11 @@ static int super_validate(struct raid_set *rs, struct md_rdev *rdev) if (!mddev->events && super_init_validation(mddev, rdev)) return -EINVAL; + if (le32_to_cpu(sb->features)) { + rs->ti->error = "Unable to assemble array: No feature flags supported yet"; + return -EINVAL; + } + /* Enable bitmap creation for RAID levels != 0 */ mddev->bitmap_info.offset = (rs->raid_type->level) ? to_sector(4096) : 0; rdev->mddev->bitmap_info.default_offset = mddev->bitmap_info.offset; @@ -1718,7 +1723,7 @@ static void raid_resume(struct dm_target *ti) static struct target_type raid_target = { .name = "raid", - .version = {1, 7, 0}, + .version = {1, 8, 0}, .module = THIS_MODULE, .ctr = raid_ctr, .dtr = raid_dtr, diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index 92237b6fa8cd..fc803d50f9f0 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -322,56 +322,6 @@ struct thin_c { /*----------------------------------------------------------------*/ -/** - * __blkdev_issue_discard_async - queue a discard with async completion - * @bdev: blockdev to issue discard for - * @sector: start sector - * @nr_sects: number of sectors to discard - * @gfp_mask: memory allocation flags (for bio_alloc) - * @flags: BLKDEV_IFL_* flags to control behaviour - * @parent_bio: parent discard bio that all sub discards get chained to - * - * Description: - * Asynchronously issue a discard request for the sectors in question. - */ -static int __blkdev_issue_discard_async(struct block_device *bdev, sector_t sector, - sector_t nr_sects, gfp_t gfp_mask, unsigned long flags, - struct bio *parent_bio) -{ - struct request_queue *q = bdev_get_queue(bdev); - int type = REQ_WRITE | REQ_DISCARD; - struct bio *bio; - - if (!q || !nr_sects) - return -ENXIO; - - if (!blk_queue_discard(q)) - return -EOPNOTSUPP; - - if (flags & BLKDEV_DISCARD_SECURE) { - if (!blk_queue_secdiscard(q)) - return -EOPNOTSUPP; - type |= REQ_SECURE; - } - - /* - * Required bio_put occurs in bio_endio thanks to bio_chain below - */ - bio = bio_alloc(gfp_mask, 1); - if (!bio) - return -ENOMEM; - - bio_chain(bio, parent_bio); - - bio->bi_iter.bi_sector = sector; - bio->bi_bdev = bdev; - bio->bi_iter.bi_size = nr_sects << 9; - - submit_bio(type, bio); - - return 0; -} - static bool block_size_is_power_of_two(struct pool *pool) { return pool->sectors_per_block_shift >= 0; @@ -384,14 +334,55 @@ static sector_t block_to_sectors(struct pool *pool, dm_block_t b) (b * pool->sectors_per_block); } -static int issue_discard(struct thin_c *tc, dm_block_t data_b, dm_block_t data_e, - struct bio *parent_bio) +/*----------------------------------------------------------------*/ + +struct discard_op { + struct thin_c *tc; + struct blk_plug plug; + struct bio *parent_bio; + struct bio *bio; +}; + +static void begin_discard(struct discard_op *op, struct thin_c *tc, struct bio *parent) +{ + BUG_ON(!parent); + + op->tc = tc; + blk_start_plug(&op->plug); + op->parent_bio = parent; + op->bio = NULL; +} + +static int issue_discard(struct discard_op *op, dm_block_t data_b, dm_block_t data_e) { + struct thin_c *tc = op->tc; sector_t s = block_to_sectors(tc->pool, data_b); sector_t len = block_to_sectors(tc->pool, data_e - data_b); - return __blkdev_issue_discard_async(tc->pool_dev->bdev, s, len, - GFP_NOWAIT, 0, parent_bio); + return __blkdev_issue_discard(tc->pool_dev->bdev, s, len, + GFP_NOWAIT, REQ_WRITE | REQ_DISCARD, &op->bio); +} + +static void end_discard(struct discard_op *op, int r) +{ + if (op->bio) { + /* + * Even if one of the calls to issue_discard failed, we + * need to wait for the chain to complete. + */ + bio_chain(op->bio, op->parent_bio); + submit_bio(REQ_WRITE | REQ_DISCARD, op->bio); + } + + blk_finish_plug(&op->plug); + + /* + * Even if r is set, there could be sub discards in flight that we + * need to wait for. + */ + if (r && !op->parent_bio->bi_error) + op->parent_bio->bi_error = r; + bio_endio(op->parent_bio); } /*----------------------------------------------------------------*/ @@ -632,7 +623,7 @@ static void error_retry_list(struct pool *pool) { int error = get_pool_io_error_code(pool); - return error_retry_list_with_code(pool, error); + error_retry_list_with_code(pool, error); } /* @@ -1006,24 +997,28 @@ static void process_prepared_discard_no_passdown(struct dm_thin_new_mapping *m) mempool_free(m, tc->pool->mapping_pool); } -static int passdown_double_checking_shared_status(struct dm_thin_new_mapping *m) +/*----------------------------------------------------------------*/ + +static void passdown_double_checking_shared_status(struct dm_thin_new_mapping *m) { /* * We've already unmapped this range of blocks, but before we * passdown we have to check that these blocks are now unused. */ - int r; + int r = 0; bool used = true; struct thin_c *tc = m->tc; struct pool *pool = tc->pool; dm_block_t b = m->data_block, e, end = m->data_block + m->virt_end - m->virt_begin; + struct discard_op op; + begin_discard(&op, tc, m->bio); while (b != end) { /* find start of unmapped run */ for (; b < end; b++) { r = dm_pool_block_is_used(pool->pmd, b, &used); if (r) - return r; + goto out; if (!used) break; @@ -1036,20 +1031,20 @@ static int passdown_double_checking_shared_status(struct dm_thin_new_mapping *m) for (e = b + 1; e != end; e++) { r = dm_pool_block_is_used(pool->pmd, e, &used); if (r) - return r; + goto out; if (used) break; } - r = issue_discard(tc, b, e, m->bio); + r = issue_discard(&op, b, e); if (r) - return r; + goto out; b = e; } - - return 0; +out: + end_discard(&op, r); } static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m) @@ -1059,20 +1054,21 @@ static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m) struct pool *pool = tc->pool; r = dm_thin_remove_range(tc->td, m->virt_begin, m->virt_end); - if (r) + if (r) { metadata_operation_failed(pool, "dm_thin_remove_range", r); + bio_io_error(m->bio); - else if (m->maybe_shared) - r = passdown_double_checking_shared_status(m); - else - r = issue_discard(tc, m->data_block, m->data_block + (m->virt_end - m->virt_begin), m->bio); + } else if (m->maybe_shared) { + passdown_double_checking_shared_status(m); + + } else { + struct discard_op op; + begin_discard(&op, tc, m->bio); + r = issue_discard(&op, m->data_block, + m->data_block + (m->virt_end - m->virt_begin)); + end_discard(&op, r); + } - /* - * Even if r is set, there could be sub discards in flight that we - * need to wait for. - */ - m->bio->bi_error = r; - bio_endio(m->bio); cell_defer_no_holder(tc, m->cell); mempool_free(m, pool->mapping_pool); } @@ -1494,17 +1490,6 @@ static void process_discard_cell_no_passdown(struct thin_c *tc, pool->process_prepared_discard(m); } -/* - * __bio_inc_remaining() is used to defer parent bios's end_io until - * we _know_ all chained sub range discard bios have completed. - */ -static inline void __bio_inc_remaining(struct bio *bio) -{ - bio->bi_flags |= (1 << BIO_CHAIN); - smp_mb__before_atomic(); - atomic_inc(&bio->__bi_remaining); -} - static void break_up_discard_bio(struct thin_c *tc, dm_block_t begin, dm_block_t end, struct bio *bio) { @@ -1554,13 +1539,13 @@ static void break_up_discard_bio(struct thin_c *tc, dm_block_t begin, dm_block_t /* * The parent bio must not complete before sub discard bios are - * chained to it (see __blkdev_issue_discard_async's bio_chain)! + * chained to it (see end_discard's bio_chain)! * * This per-mapping bi_remaining increment is paired with * the implicit decrement that occurs via bio_endio() in - * process_prepared_discard_{passdown,no_passdown}. + * end_discard(). */ - __bio_inc_remaining(bio); + bio_inc_remaining(bio); if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list)) pool->process_prepared_discard(m); @@ -3899,7 +3884,7 @@ static struct target_type pool_target = { .name = "thin-pool", .features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE | DM_TARGET_IMMUTABLE, - .version = {1, 18, 0}, + .version = {1, 19, 0}, .module = THIS_MODULE, .ctr = pool_ctr, .dtr = pool_dtr, @@ -4273,7 +4258,7 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits) static struct target_type thin_target = { .name = "thin", - .version = {1, 18, 0}, + .version = {1, 19, 0}, .module = THIS_MODULE, .ctr = thin_ctr, .dtr = thin_dtr, diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 3d3ac13287a4..1b2f96205361 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -674,7 +674,7 @@ static void free_io(struct mapped_device *md, struct dm_io *io) mempool_free(io, md->io_pool); } -static void free_tio(struct mapped_device *md, struct dm_target_io *tio) +static void free_tio(struct dm_target_io *tio) { bio_put(&tio->clone); } @@ -1055,7 +1055,7 @@ static void clone_endio(struct bio *bio) !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)) disable_write_same(md); - free_tio(md, tio); + free_tio(tio); dec_pending(io, error); } @@ -1517,7 +1517,6 @@ static void __map_bio(struct dm_target_io *tio) { int r; sector_t sector; - struct mapped_device *md; struct bio *clone = &tio->clone; struct dm_target *ti = tio->ti; @@ -1540,9 +1539,8 @@ static void __map_bio(struct dm_target_io *tio) generic_make_request(clone); } else if (r < 0 || r == DM_MAPIO_REQUEUE) { /* error the io and bail out, or requeue it if needed */ - md = tio->io->md; dec_pending(tio->io, r); - free_tio(md, tio); + free_tio(tio); } else if (r != DM_MAPIO_SUBMITTED) { DMWARN("unimplemented target map return value: %d", r); BUG(); @@ -1663,7 +1661,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti, tio->len_ptr = len; r = clone_bio(tio, bio, sector, *len); if (r < 0) { - free_tio(ci->md, tio); + free_tio(tio); break; } __map_bio(tio); |