Age | Commit message (Collapse) | Author |
|
This patch avoids that workloads with large block sizes (megabytes)
can trigger the following call stack with the ib_srpt driver (that
driver is the only driver that chains scatterlists allocated by
sgl_alloc_order()):
BUG: Bad page state in process kworker/0:1H pfn:2423a78
page:fffffb03d08e9e00 count:-3 mapcount:0 mapping: (null) index:0x0
flags: 0x57ffffc0000000()
raw: 0057ffffc0000000 0000000000000000 0000000000000000 fffffffdffffffff
raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
page dumped because: nonzero _count
CPU: 0 PID: 733 Comm: kworker/0:1H Tainted: G I 4.15.0-rc7.bart+ #1
Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
Call Trace:
dump_stack+0x5c/0x83
bad_page+0xf5/0x10f
get_page_from_freelist+0xa46/0x11b0
__alloc_pages_nodemask+0x103/0x290
sgl_alloc_order+0x101/0x180
target_alloc_sgl+0x2c/0x40 [target_core_mod]
srpt_alloc_rw_ctxs+0x173/0x2d0 [ib_srpt]
srpt_handle_new_iu+0x61e/0x7f0 [ib_srpt]
__ib_process_cq+0x55/0xa0 [ib_core]
ib_cq_poll_work+0x1b/0x60 [ib_core]
process_one_work+0x141/0x340
worker_thread+0x47/0x3e0
kthread+0xf5/0x130
ret_from_fork+0x1f/0x30
Fixes: e80a0af4759a ("lib/scatterlist: Introduce sgl_alloc() and sgl_free()")
Reported-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull NVMe fixes for 4.16 from Christoph.
* 'nvme-4.16' of git://git.infradead.org/nvme:
nvme-pci: clean up SMBSZ bit definitions
nvme-pci: clean up CMB initialization
nvme-fc: correct hang in nvme_ns_remove()
nvme-fc: fix rogue admin cmds stalling teardown
nvmet: release a ns reference in nvmet_req_uninit if needed
nvme-fabrics: fix memory leak when parsing host ID option
nvme: fix comment typos in nvme_create_io_queues
nvme: host delete_work and reset_work on separate workqueues
nvme-pci: allocate device queues storage space at probe
nvme-pci: serialize pci resets
|
|
Define the bit positions instead of macros using the magic values,
and move the expanded helpers to calculate the size and size unit into
the implementation C file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
|
|
Refactor the call to nvme_map_cmb, and change the conditions for probing
for the CMB. First remove the version check as NVMe TPs always apply
to earlier versions of the spec as well. Second check for the whole CMBSZ
register for support of the CMB feature instead of just the size field
inside of it to simplify the code a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
|
|
When connectivity is lost to a device, the association is terminated
and the blk-mq queues are quiesced/stopped. When connectivity is
re-established, they are resumed.
If connectivity is lost for a sufficient amount of time that the
controller is then deleted, the delete path starts tearing down queues,
and eventually calling nvme_ns_remove(). It appears that pending
commands may cause blk_cleanup_queue() to never complete and the
teardown stalls.
Correct by starting the ns queues after transitioning to a DELETING
state, allowing pending commands to be flushed with io failures. Thus
the delete path is clear when reached.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When connectivity is lost to a device, the association is terminated
and the blk-mq queues are quiesced/stopped. When connectivity is
re-established, they are resumed.
If an admin command is received while connectivity is list, the ioctl
queues the command on the admin_q and the command stalls (the thread
issuing the ioctl hangs/waits). if the connectivity is lost long
enough such that the controller is then deleted, the delete code
makes its calls to initiate the delete, which then expects the core
layer to call the transport when all references are removed and the
controller can be freed. Unfortunately, nothing in this path dequeued
the admin command, so a reference sits outstanding and things stop,
hanging the delete indefinitely.
Correct by unquiescing the admin queue in the delete association. This
means any admin command (which should only be from an ioctl) issued
after connectivity is lost will detect the controller is in a
reconnecting state and will (fast) fail the command. Thus, a pending
reference can no longer be created. Once connectivity is re-established,
a new ioctl/admin command would see proper device state and function again.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.
1) This way isn't efficient enough because the hctx spinlock is always
used.
2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO. But dm-rq currently can't get the underlying queue's
dispatch feedback at all. Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging). This
obviously causes very bad sequential IO performance.
Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t). If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request. Whereby preserving the opportunity to
merge IO.
With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).
Signed-off-by: Ming Lei <ming.lei@redhat.com>
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
'struct frame' uses two variables to store the sent timestamp - 'struct
timeval' and jiffies. jiffies is used to avoid discrepancies caused by
updates to system time. 'struct timeval' is deprecated because it uses
32-bit representation for seconds which will overflow in year 2038.
This patch does the following:
- Replace the use of 'struct timeval' and jiffies with ktime_t, which
is the recommended type for timestamping
- ktime_t provides both long range (like jiffies) and high resolution
(like timeval). Using ktime_get (monotonic time) instead of wall-clock
time prevents any discprepancies caused by updates to system time.
[updates by Arnd below]
The original patch from Tina never went anywhere as we discussed how
to keep the impact on performance minimal. I've started over now but
arrived at basically the same patch that she had originally, except for
an slightly improved tsince_hr() function. I'm making it more robust
against overflows, and also optimize explicitly for the common case
in which a frame is less than 4.2 seconds old, using only a 32-bit
division in that case.
This should make the new version more efficient than the old code,
since we replace the existing two 32-bit division in do_gettimeofday()
plus one multiplication with a single single 32-bit division in
tsince_hr() and drop the double bookkeeping. It's also more efficient
than the ktime_get_us() API we discussed before, since that would
also rely on multiple divisions.
Link: https://lists.linaro.org/pipermail/y2038/2015-May/000276.html
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Cc: Ed Cashin <ed.cashin@acm.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
nvmet_req_init looked up a namespace and took a reference on it (unless it
failed prior to that). If the request is uninitialized (in error cases) we
need to remove that reference in case it was taken, otherwise we leak
namespace reference when calling nvme_req_uninit.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
We use match_strdup() to get a copy of the option string for host ID string, but
we just pass it to uuid_parse() and don't store the string pointer, so we need to
kfree() the string after parsing it.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
fix comment typos in nvme_create_io_queues() like below.
_aount_ to _amount_
_an_ to _can_
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
We need to ensure that delete_work will be hosted on a different
workqueue than all the works we flush or cancel from it.
Otherwise we may hit a circular dependency warning [1].
Also, given that delete_work flushes reset_work, host reset_work
on nvme_reset_wq and delete_work on nvme_delete_wq. In addition,
fix the flushing in the individual drivers to flush nvme_delete_wq
when draining queued deletes.
[1]:
[ 178.491942] =============================================
[ 178.492718] [ INFO: possible recursive locking detected ]
[ 178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G OE
[ 178.494382] ---------------------------------------------
[ 178.495160] kworker/5:1/135 is trying to acquire lock:
[ 178.495894] (
[ 178.496120] "nvme-wq"
[ 178.496471] ){++++.+}
[ 178.496599] , at:
[ 178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0
[ 178.497670]
but task is already holding lock:
[ 178.498499] (
[ 178.498724] "nvme-wq"
[ 178.499074] ){++++.+}
[ 178.499202] , at:
[ 178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[ 178.500343]
other info that might help us debug this:
[ 178.501269] Possible unsafe locking scenario:
[ 178.502113] CPU0
[ 178.502472] ----
[ 178.502829] lock(
[ 178.503115] "nvme-wq"
[ 178.503467] );
[ 178.503716] lock(
[ 178.504001] "nvme-wq"
[ 178.504353] );
[ 178.504601]
*** DEADLOCK ***
[ 178.505441] May be due to missing lock nesting notation
[ 178.506453] 2 locks held by kworker/5:1/135:
[ 178.507068] #0:
[ 178.507330] (
[ 178.507598] "nvme-wq"
[ 178.507726] ){++++.+}
[ 178.508079] , at:
[ 178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[ 178.509004] #1:
[ 178.509265] (
[ 178.509532] (&ctrl->delete_work)
[ 178.509795] ){+.+.+.}
[ 178.510145] , at:
[ 178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[ 178.511070]
stack backtrace:
:
[ 178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G OE 4.9.0-rc4-c844263313a8-lb #3
[ 178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[ 178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp]
[ 178.515071] ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80
[ 178.516195] ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700
[ 178.517318] ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e
[ 178.518443] Call Trace:
[ 178.518807] [<ffffffffa7450823>] dump_stack+0x85/0xc2
[ 178.519542] [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0
[ 178.520377] [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30
[ 178.521330] [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0
[ 178.522174] [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0
[ 178.522975] [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220
[ 178.523753] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[ 178.524535] [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0
[ 178.525291] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[ 178.526077] [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220
[ 178.527040] [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0
[ 178.527907] [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40
[ 178.528726] [<ffffffffa71cb507>] ? printk+0x48/0x50
[ 178.529434] [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20
[ 178.530381] [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core]
[ 178.531314] [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp]
[ 178.532271] [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0
[ 178.533101] [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0
[ 178.533954] [<ffffffffa70adc4e>] worker_thread+0x4e/0x490
[ 178.534735] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[ 178.535588] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[ 178.536441] [<ffffffffa70b48cf>] kthread+0xff/0x120
[ 178.537149] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[ 178.538094] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[ 178.538900] [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
DM is no longer prone to having its request_queue be improperly
initialized.
Summary of changes:
- defer DM's blk_register_queue() from add_disk()-time until
dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().
- dm_setup_md_queue() is updated to fully initialize DM's request_queue
(_after_ all table loads have occurred and the request_queue's type,
features and limits are known).
A very welcome side-effect of these changes is DM no longer needs to:
1) backfill the "mq" sysfs entry (because historically DM didn't
initialize the request_queue to use blk-mq until _after_
blk_register_queue() was called via add_disk()).
2) call elv_register_queue() to get .request_fn request-based DM
device's "iosched" exposed in syfs.
In addition, blk-mq debugfs support is now made available because
request-based DM's blk-mq request_queue is now properly initialized
before dm_setup_md_queue() calls blk_register_queue().
These changes also stave off the need to introduce new DM-specific
workarounds in block core, e.g. this proposal:
https://patchwork.kernel.org/patch/10067961/
In the end DM devices should be less unicorn in nature (relative to
initialization and availability of block core infrastructure provided by
the request_queue).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It may cause race by setting 'nvmeq' in nvme_init_request()
because .init_request is called inside switching io scheduler, which
may happen when the NVMe device is being resetted and its nvme queues
are being freed and created. We don't have any sync between the two
pathes.
This patch changes the nvmeq allocation to occur at probe time so
there is no way we can dereference it at init_request.
[ 93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
[ 93.274146] invalid opcode: 0000 [#1] SMP
[ 93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
[ 93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[ 93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
[ 93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
[ 93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
[ 93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
[ 93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
[ 93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
[ 93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
[ 93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
[ 93.422969] FS: 00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
[ 93.431996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
[ 93.446368] Call Trace:
[ 93.449103] blk_mq_alloc_rqs+0x231/0x2a0
[ 93.453579] blk_mq_sched_alloc_tags.isra.8+0x42/0x80
[ 93.459214] blk_mq_init_sched+0x7e/0x140
[ 93.463687] elevator_switch+0x5a/0x1f0
[ 93.467966] ? elevator_get.isra.17+0x52/0xc0
[ 93.472826] elv_iosched_store+0xde/0x150
[ 93.477299] queue_attr_store+0x4e/0x90
[ 93.481580] kernfs_fop_write+0xfa/0x180
[ 93.485958] __vfs_write+0x33/0x170
[ 93.489851] ? __inode_security_revalidate+0x4c/0x60
[ 93.495390] ? selinux_file_permission+0xda/0x130
[ 93.500641] ? _cond_resched+0x15/0x30
[ 93.504815] vfs_write+0xad/0x1a0
[ 93.508512] SyS_write+0x52/0xc0
[ 93.512113] do_syscall_64+0x61/0x1a0
[ 93.516199] entry_SYSCALL64_slow_path+0x25/0x25
[ 93.521351] RIP: 0033:0x7f33ce96aab0
[ 93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
[ 93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
[ 93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
[ 93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
[ 93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
[ 93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
[ 93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
[ 93.602273] ---[ end trace 810dde3993e5f14e ]---
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Selecting FAULT_INJECTION causes a Kconfig warning when CONFIG_DEBUG_KERNEL
is not set:
warning: (BLK_DEV_NULL_BLK && DRM_I915_SELFTEST) selects FAULT_INJECTION which has unmet direct dependencies (DEBUG_KERNEL)
The other drivers that use FAULT_INJECTION tend to have a separate
Kconfig symbol for turning on that feature, so let's do the same
thing here. This may add a bit more complexity than we like, but
it avoids the warning and is more consistent with the rest of the
kernel.
Fixes: 93b570464cce ("null_blk: add option for managing IO timeouts")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Uses common code for determining if an error should be retried on
alternate path.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Uses common code for determining if an error should be retried on
alternate path.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This removes nvme multipath's specific status decoding to see if failover
is needed, using the generic blk_status_t that was decoded earlier. This
abstraction from the raw NVMe status means all status decoding exists
in one place.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This adds more NVMe status code translations to blk_status_t values,
and captures all the current status codes NVMe multipath uses.
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use the fault injection framework to provide a way for null_blk
to configure timeouts. This only works for queue_mode 1 and 2,
since the bio mode doesn't have code for tracking timeouts.
Let's say you want to have a 10% chance of timing out every
100,000 requests, and for 5 total timeouts, you could do:
modprobe null_blk timeout="100000,10,0,5"
This is useful for adding blktests to test that IO timeouts
are handled appropriately.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is needed to ensure that we actually handle timeouts.
Without it, the queue_mode=1 path will never call blk_add_timer(),
and the queue_mode=2 path will continually just return
EH_RESET_TIMER and we never actually complete the offending request.
This was used to test the new timeout code, and the changes around
killing off REQ_ATOM_COMPLETE.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Otherwise, architectures that do negated adds of atomics (e.g. s390)
to do atomic_sub fail in closure_set_stopped.
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Bcache needs to scale the dirty data in the cache over the multiple
backing disks in order to calculate writeback rates for each.
The previous code did this by multiplying the target number of dirty
sectors by the backing device size, and expected it to fit into a
uint64_t; this blows up on relatively small backing devices.
The new approach figures out the bdev's share in 16384ths of the overall
cached data. This is chosen to cope well when bdevs drastically vary in
size and to ensure that bcache can cross the petabyte boundary for each
backing device.
This has been improved based on Tang Junhui's feedback to ensure that
every device gets a share of dirty data, no matter how small it is
compared to the total backing pool.
The existing mechanism is very limited; this is purely a bug fix to
remove limits on volume size. However, there still needs to be change
to make this "fair" over many volumes where some are idle.
Reported-by: Jack Douglas <jack@douglastechnology.co.uk>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Bcache only does recoverable I/O for read operations by calling
cached_dev_read_error(). For write opertions there is no I/O recovery for
failed requests.
But in bch_count_io_errors() no matter read or write I/Os, before errors
counter reaches io error limit, pr_err() always prints "IO error on %,
recoverying". For write requests this information is misleading, because
there is no I/O recovery at all.
This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
prints "recovering" by pr_err() when the bio direction is READ.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Member devices of struct cache_set is used to reference all attached
bcache devices to this cache set. If it is treated as array of pointers,
size of devices[] is indicated by member nr_uuids of struct cache_set.
nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
bucket_bytes(c) / sizeof(struct uuid_entry)
Bucket size is determined by user space tool "make-bcache", by default it
is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
nr_uuids value is 4096 from the above calculation.
Every time when bcache code iterates bcache devices of a cache set, all
the 4096 pointers are checked even only 1 bcache device is attached to the
cache set, that's a wast of time and unncessary.
This patch adds a member devices_max_used to struct cache_set. Its value
is 1 + the maximum used index of devices[] in a cache set. When iterating
all valid bcache devices of a cache set, use c->devices_max_used in
for-loop may reduce a lot of useless checking.
Personally, my motivation of this patch is not for performance, I use it
in bcache debugging, which helps me to narrow down the scape to check
valid bcached devices of a cache set.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The function cached_dev_make_request() and flash_dev_make_request() call
generic_start_io_acct() with (struct bcache_device)->disk when they start a
closure. Then the function bio_complete() calls generic_end_io_acct() with
(struct search)->orig_bio->bi_disk when the closure has done.
Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is
called with a wrong device queue.
It causes the "inflight" (in struct hd_struct) counter keep increasing
without decreasing.
This patch fix the problem by calling generic_end_io_acct() with
(struct bcache_device)->disk.
Signed-off-by: Zhai Zhaoxuan <kxuanobj@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
[edit by mlyle: include sched/debug.h to get __sched]
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Eliminates cases where sync can race and fail to complete / get stuck.
Removes many status flags and simplifies entering-and-exiting closure
sleeping behaviors.
[mlyle: fixed conflicts due to changed return behavior in mainline.
extended commit comment, and squashed down two commits that were mostly
contradictory to get to this state. Changed __set_current_state to
set_current_state per Jens review comment]
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If the control system would wait for at least half a second, and there's
been no reqs hitting the backing disk for awhile: use an alternate mode
where we have at most one contiguous set of writebacks in flight at a
time. (But don't otherwise delay). If front-end IO appears, it will
still be quick, as it will only have to contend with one real operation
in flight. But otherwise, we'll be sending data to the backing disk as
quickly as it can accept it (with one op at a time).
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Writeback keys are presently iterated and dispatched for writeback in
order of the logical block address on the backing device. Multiple may
be, in parallel, read from the cache device and then written back
(especially when there are contiguous I/O).
However-- there was no guarantee with the existing code that the writes
would be issued in LBA order, as the reads from the cache device are
often re-ordered. In turn, when writing back quickly, the backing disk
often has to seek backwards-- this slows writeback and increases
utilization.
This patch introduces an ordering mechanism that guarantees that the
original order of issue is maintained for the write portion of the I/O.
Performance for writeback is significantly improved when there are
multiple contiguous keys or high writeback rates.
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Tested-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
in bch_debug_init(), ret is always 0, and the return value is useless,
change it to return 0 if be success after calling debugfs_create_dir(),
else return a non-zero value.
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |
So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.
In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.
Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.
[mlyle: fixed commit log formatting, permissions, line endings]
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fix ptr_ret.cocci warnings:
drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
Generated by: scripts/coccinelle/api/ptr_ret.cocci
Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, when a cached device detaching from cache, writeback thread is
not stopped, and writeback_rate_update work is not canceled. For example,
after the following command:
echo 1 >/sys/block/sdb/bcache/detach
you can still see the writeback thread. Then you attach the device to the
cache again, bcache will create another writeback thread, for example,
after below command:
echo ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach
then you will see 2 writeback threads.
This patch stops writeback thread and cancels writeback_rate_update work
when cached device detaching from cache.
Compare with patch v1, this v2 patch moves code down into the register
lock for safety in case of any future changes as Coly and Mike suggested.
[edit by mlyle: commit log spelling/formatting]
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The read request might meet error when searching the btree, but the error
was not handled in cache_lookup(), and this kind of metadata failure will
not go into cached_dev_read_error(), finally, the upper layer will receive
bi_status=0. In this patch we judge the metadata error by the return
value of bch_btree_map_keys(), there are two potential paths give rise to
the error:
1. Because the btree is not totally cached in memery, we maybe get error
when read btree node from cache device (see bch_btree_node_get()), the
likely errno is -EIO, -ENOMEM
2. When read miss happens, bch_btree_insert_check_key() will be called to
insert a "replace_key" to btree(see cached_dev_cache_miss(), just for
doing preparatory work before insert the missed data to cache device),
a failure can also happen in this situation, the likely errno is
-ENOMEM
bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will
get either -EIO or -ENOMEM in above two cases. if this happened, we should
NOT recover data from backing device (when cache device is dirty) because
we don't know whether bkeys the read request covered are all clean. And
after that happened, s->iop.status is still its initially value(0) before
we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into
cached_dev_read_error(), and finally it can be passed to upper layer, or
recovered by reread from backing device.
[edit by mlyle: patch formatting, word-wrap, comment spelling,
commit log format]
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is a problem when another module (e.g. nvmet) takes a reference on
the nvme block device and the physical nvme drive is removed. In that
case nvme_free_ctrl() will not be called and the controller state will be
"deleting" or "dead" unless nvmet module releases the block device.
Later on, the same nvme drive probes back and nvme_init_subsystem() will
be called and fail due to duplicate subnqn (if the nvme device doesn't
support subsystem with multiple controllers). This will cause a probe
failure. This commit changes the check of multiple controllers support
at nvme_init_subsystem() by not counting all the controllers at "dead" or
"deleting" state (this is safe because controllers at this state will
never be active again).
Fixes: ab9e00cc72fa ("nvme: track subsystems")
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The block device is backed by the transport so we must ensure that the
transport driver will not be removed until all references are released.
Otherwise, we might end up referencing freed memory.
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When the io queues setup or tagset allocation failed, ctrl.tagset is
NULL. But the scan work will still be queued and executed, then panic
comes up due to NULL pointer reference of ctrl.tagset.
To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate only
admin queue is live. When non io queues or tagset allocation failed, ctrl
enters into this state, scan work will not be started. But async event
work and nvme dev ioctl will be still available. This will be helpful to
do further investigation and recovery.
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When an NVMe controller reports RTD3 Entry Latency larger than the value
of shutdown_timeout module parameter, we update the shutdown_timeout
accordingly to honor RTD3 Entry Latency. Use an informational debug level
instead of a warning level for it.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Make it symmetric to nvmet_alloc_ctrl().
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Remove the allocated id on error.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The local variable __size__ will be set a bit later in a for-loop.
Remove the explicit initialization at the beginning of this function.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
NVMe transport driver module unload may (and usually does) trigger
iteration over the active controllers and delete them all (sometimes
under a mutex). However, a controller can be created concurrently with
module unload which can lead to leakage of resources (most important char
device node leakage) in case the controller creation occured after the
unload delete and drain sequence. To protect against this, we take a
module reference to guarantee that the nvme transport driver is not
unloaded while creating a controller.
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The current fc transport add_port routine validates that there is a
matching port to the target port config. It then takes a reference
on the targetport. The del_port removes the reference.
Unfortunately, if the LLDD undergoes a hw reset or driver unload and
wants to unreg the targetport, due to the reference, the targetport
effectively can't be removed. It requires the admin to remove the
port from the nvmet config first, which calls the del_port.
Note: it appears nvmetcli clear skips over the del_port call (I'm
not attempting to change that).
There's no real reason to take the reference. With FC, there is nothing
to enable or disable as the presence of the FC targetport implicitly
means its enabled, and removal of the targtport means its disabled.
Change add_port to simply validate and change remove_port to a noop.
No references are taken on the targetport.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The split between what the host accesses on its flows vs what the
target side accesses was flawed. Abort handling didn't properly
clear initiator vs target structure cross-reference and locks
weren't used for synchronization. Thus, there were issues of
freeing structures too soon and access after free.
A couple of these existed pre the IN_ISR mods, but when the
target upcalls were converted to work items, thus adding delays
between the 2 sides of accesses, the problems became pronounced.
Resolve by:
- tracking io state mainly in the tgt-side io structure.
- make the tgt-side io structure released by reference not by
code flow.
- when changing initiator structures, use locks for
synchronization
- aborts are clearly tracked for which side saw the abort, and
after seeing the abort, cross-references are cleared under lock.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The existing fcloop driver expects the target side upcalls to
the transport to context switch, thus the calls into the nvmet layer
are not done in the calling context of the host/initiator down calls.
The xxx_IN_ISR feature flags are used to select this logic.
The xxx_IN_ISR feature flags should go away in the nvmet_fc transport
as no other lldd utilizes them. Both Broadcom and Cavium lldds have their
own non-ISR deferred handlers thus the nvmet calls can be made directly.
This patch converts the paths that make the target upcalls (command
receive, abort receive) such that they schedule a work item rather
than expecting the transport to schedule the work item.
The patch also cleans up the following:
- The completion path from target to host scheduled a host work
element called "work". Rename it "tio_done_work" for code clarity.
- The abort io path called a iniwork item to call the host side
io done. This is no longer needed as the abort routine can make
the same call.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The current fcloop driver gets its lport structure from the private
area co-allocated with the fc_localport. All is fine except the
teardown path, which wants to wait on the completion, which is marked
complete by the delete_localport callback performed after
unregister_localport. The issue is, the nvme_fc transport frees the
localport structure immediately after delete_localport is called,
meaning the original routine is trying to wait on a complete that
was just freed.
Change such that a lport struct is allocated coincident with the
addition and registration of a localport. The private area of the
localport now contains just a backpointer to the real lport struct.
Now, the completion can be waited for, and after completing, the
new structure can be kfree'd.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
A test case revealed a race condition of an i/o completing on a thread
parallel to the delete_association generating the aborts for the
outstanding ios on the controller. The i/o completion was freeing the
target fcloop context, thus the abort task referenced the just-freed
memory.
Correct by clearing the target/initiator cross pointers in the io
completion and abort tasks before calling the callbacks. On aborts
that detect already finished io's, ensure the complete context is
called.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|