summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-09-24bdi: invert BDI_CAP_NO_ACCT_WBChristoph Hellwig
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to make the checks more obvious. Also remove the pointless bdi_cap_account_writeback wrapper that just obsfucates the check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flagChristoph Hellwig
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the backing_dev_info shared between the block drivers and the writeback code. To help untangling the dependency replace it with a queue flag and a superblock flag derived from it. This also helps with the case of e.g. a file system requiring stable writes due to its own checksumming, but not forcing it on other users of the block device like the swap code. One downside is that we an't support the stable_pages_required bdi attribute in sysfs anymore. It is replaced with a queue attribute which also is writable for easier testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24mm: use SWP_SYNCHRONOUS_IO more intelligentlyChristoph Hellwig
There is no point in trying to call bdev_read_page if SWP_SYNCHRONOUS_IO is not set, as the device won't support it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: remove BDI_CAP_SYNCHRONOUS_IOChristoph Hellwig
BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to decided if ->rw_page can be used on a block device. Just check up for the method instead. The only complication is that zram needs a second set of block_device_operations as it can switch between modes that actually support ->rw_page and those who don't. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: remove BDI_CAP_CGROUP_WRITEBACKChristoph Hellwig
Just checking SB_I_CGROUPWB for cgroup writeback support is enough. Either the file system allocates its own bdi (e.g. btrfs), in which case it is known to support cgroup writeback, or the bdi comes from the block layer, which always supports cgroup writeback. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24block: lift setting the readahead size into the block layerChristoph Hellwig
Drivers shouldn't really mess with the readahead size, as that is a VM concept. Instead set it based on the optimal I/O size by lifting the algorithm from the md driver when registering the disk. Also set bdi->io_pages there as well by applying the same scheme based on max_sectors. To ensure the limits work well for stacking drivers a new helper is added to update the readahead limits from the block limits, which is also called from disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24md: update the optimal I/O size on reshapeChristoph Hellwig
The raid5 and raid10 drivers currently update the read-ahead size, but not the optimal I/O size on reshape. To prepare for deriving the read-ahead size from the optimal I/O size make sure it is updated as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: initialize ->ra_pages and ->io_pages in bdi_initChristoph Hellwig
Set up a readahead size by default, as very few users have a good reason to change it. This means code, ecryptfs, and orangefs now set up the values while they were previously missing it, while ubifs, mtd and vboxsf manually set it to 0 to avoid readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24aoe: set an optimal I/O sizeChristoph Hellwig
aoe forces a larger readahead size, but any reason to do larger I/O is not limited to readahead. Also set the optimal I/O size, and remove the local constants in favor of just using SZ_2G. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bcache: inherit the optimal I/O sizeChristoph Hellwig
Inherit the optimal I/O size setting just like the readahead window, as any reason to do larger I/O does not apply to just readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24drbd: remove dead code in device_to_statisticsChristoph Hellwig
Ever since the switch to blk-mq, a lower device not used for VM writeback will not be marked congested, so the check will never trigger. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24fs: remove the unused SB_I_MULTIROOT flagChristoph Hellwig
The last user of SB_I_MULTIROOT is disappeared with commit f2aedb713c28 ("NFS: Add fs_context support.") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: mark blkdev_get staticChristoph Hellwig
There are no users outside the core block code left now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23PM: mm: cleanup swsusp_swap_checkChristoph Hellwig
Use blkdev_get_by_dev instead of bdget + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23mm: split swap_type_ofChristoph Hellwig
swap_type_of is used for two entirely different purposes: (1) check what swap type a given device/offset corresponds to (2) find the first available swap device that can be written to Mixing both in a single function creates an unreadable mess. Create two separate functions instead, and switch both to pass a dev_t instead of a struct block_device to further simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23PM: rewrite is_hibernate_resume_dev to not require an inodeChristoph Hellwig
Just check the dev_t to help simplifying the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23mm: cleanup claim_swapfileChristoph Hellwig
Use blkdev_get_by_dev instead of bdgrab + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23ocfs2: cleanup o2hb_region_dev_storeChristoph Hellwig
Use blkdev_get_by_dev instead of igrab (aka open coded bdgrab) + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23dasd: cleanup dasd_scan_partitionsChristoph Hellwig
Use blkdev_get_by_dev instead of bdget_disk + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23raw: don't keep unopened block device aroundChristoph Hellwig
Turn binding into a normal dev_t as the struct block device doesn't buy us anything and use blkdev_open_by_dev to actually open it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23zram: cleanup backing_dev_storeChristoph Hellwig
Use blkdev_get_by_dev instead of bdgrab + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23pktcdvd: use blkdev_get_by_dev instead of open coding itChristoph Hellwig
Replace bdget + blkdev_get by blkdev_get_by_dev. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23pktcdvd: remove the if 0'ed pkt_start_recovery functionChristoph Hellwig
Remove code which has been dead since the initial commit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: cleanup blkdev_bszsetChristoph Hellwig
Use blkdev_get_by_dev instead of bdgrab + blkdev_get. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: cleanup partition scanning in register_diskChristoph Hellwig
Use blkdev_get_by_dev instead of open coding it using bdget_disk + blkdev_get, and split the code to read the partition table into a separate helper to make it a little more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: move the NEED_PART_SCAN flag to struct gendiskChristoph Hellwig
We can only scan for partitions on the whole disk, so move the flag from struct block_device to struct gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: allow 'chunk_sectors' to be non-power-of-2Mike Snitzer
It is possible, albeit more unlikely, for a block device to have a non power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors, which results in a full-stripe size of 1280K. This causes the RAID6's io_opt to be advertised as 1280K, and a stacked device _could_ then be made to use a blocksize, aka chunk_sectors, that matches non power-of-2 io_opt of underlying RAID6 -- resulting in stacked device's chunk_sectors being a non power-of-2). Update blk_queue_chunk_sectors() and blk_max_size_offset() to accommodate drivers that need a non power-of-2 chunk_sectors. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: use lcm_not_zero() when stacking chunk_sectorsMike Snitzer
Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using lcm_not_zero() rather than min_not_zero() -- otherwise the final 'chunk_sectors' could result in sub-optimal alignment of IO to component devices in the IO stack. Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size' then it is a bug in the driver and the device should be flagged as 'misaligned'. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: fix bmd->is_null_mapped initializationChristoph Hellwig
bmd is allocated using kmalloc in bio_alloc_map_data, so make sure is_null_mapped is properly initialized to false for the !null_mapped case. Fixes: f3256075ba49 ("block: remove the BIO_NULL_MAPPED flag") Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23block: drop double zeroingJulia Lawall
sg_init_table zeroes its first argument, so the allocation of that argument doesn't have to. the semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ x = - kzalloc + kmalloc (...) ... sg_init_table(x,...) // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Avoid checking bps/iops limitation if bps or iops is unlimitedBaolin Wang
Do not need check the bps or iops limitation if bps or iops is unlimited. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Avoid calculating bps/iops limitation repeatedlyBaolin Wang
The tg_may_dispatch() will call tg_with_in_bps_limit() and tg_with_in_iops_limit() to check if we can dispatch a bio or not, which will calculate bps/iops limitation multiple times. But tg_may_dispatch() is always called under queue lock, which means the bps/iops limitation will not change in tg_may_dispatch(). So we can calculate the bps/iops limitation only once, and pass them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to avoid calculating bps/iops limitation repeatedly. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Define readable macros instead of static variablesBaolin Wang
The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only variables, thus better to use readable macros instead of static variables, which can also save some spaces for .bss area. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Use readable READ/WRITE macrosBaolin Wang
Use readable READ/WRITE macros instead of magic numbers. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14blk-throttle: Fix some comments' typosBaolin Wang
Fix some comments' typos. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14iocost: fix infinite loop bug in adjust_inuse_and_calc_cost()Tejun Heo
adjust_inuse_and_calc_cost() is responsible for reducing the amount of donated weights dynamically in period as the budget runs low. Because we don't want to do full donation calculation in period, we keep latching up inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the resulting hweight_inuse is satisfactory. Unfortunately, the adj_step calculation was reading the active weight before acquiring ioc->lock. Because the current thread could have lost race to activate the iocg to another thread before entering this function, it may read the active weight as zero before acquiring ioc->lock. When this happens, the adj_step is calculated as zero and the incremental adjustment loop becomes an infinite one. Fix it by fetching the active weight after acquiring ioc->lock. Fixes: b0853ab4a238 ("blk-iocost: revamp in-period donation snapbacks") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11blk-iocost: fix divide-by-zero in transfer_surpluses()Tejun Heo
Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but all hweight calculations round up and thus it may end up >= WEIGHT_ONE triggering divide-by-zero and other issues. Bound the value to avoid surprises. Fixes: e08d02aa5fc9 ("blk-iocost: implement Andy's method for donation weight updates") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11bcache: use part_[begin|end]_io_acct instead of disk_[begin|end]_io_acctSong Liu
This enables proper statistics in /proc/diskstats for bcache partitions. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11md: use part_[begin|end]_io_acct instead of disk_[begin|end]_io_acctSong Liu
This enables proper statistics in /proc/diskstats for md partitions. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11block: introduce part_[begin|end]_io_acctSong Liu
These functions can be used to enable iostat for partitions on devices like md, bcache. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11blk-mq: always allow reserved allocation in hctx_may_queueMing Lei
NVMe shares tagset between fabric queue and admin queue or between connect_q and NS queue, so hctx_may_queue() can be called to allocate request for these queues. Tags can be reserved in these tagset. Before error recovery, there is often lots of in-flight requests which can't be completed, and new reserved request may be needed in error recovery path. However, hctx_may_queue() can always return false because there is too many in-flight requests which can't be completed during error handling. Finally, nothing can proceed. Fix this issue by always allowing reserved tag allocation in hctx_may_queue(). This is reasonable because reserved tags are supposed to always be available. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: David Milburn <dmilburn@redhat.com> Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11block: remove duplicate include statement in scsi_ioctl.cTian Tao
scsi/sg.h is included more than once, Remove the one that isn't necessary. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10blkcg: add plugging support for punt bioXianting Tian
The test and the explaination of the patch as bellow. Before test we added more debug code in blkg_async_bio_workfn(): int count = 0 if (bios.head && bios.head->bi_next) { need_plug = true; blk_start_plug(&plug); } while ((bio = bio_list_pop(&bios))) { /*io_punt is a sysctl user interface to control the print*/ if(io_punt) { printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n", current->comm, current->pid, bio->bi_iter.bi_sector, (bio->bi_iter.bi_size)>>9, count++, need_plug); } submit_bio(bio); } if (need_plug) blk_finish_plug(&plug); Steps that need to be set to trigger *PUNT* io before testing: mount -t btrfs -o compress=lzo /dev/sda6 /btrfs mount -t cgroup2 nodev /cgroup2 mkdir /cgroup2/cg3 echo "+io" > /cgroup2/cgroup.subtree_control echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s echo $$ > /cgroup2/cg3/cgroup.procs Then use dd command to test btrfs PUNT io in current shell: dd if=/dev/zero of=/btrfs/file bs=64K count=100000 Test hardware environment as below: [root@localhost btrfs]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel With above debug code, test command and test environment, I did the tests under 3 different system loads, which are triggered by stress: 1, Run 64 threads by command "stress -c 64 &" [53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1 [53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1 [53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1 [53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1 [53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1 [53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1 ... ... [53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1 [53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1 [53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1 [53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1 [53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1 [53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1 [53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1 2, Run 32 threads by command "stress -c 32 &" [50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1 [50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1 [50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1 [50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1 [50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1 [50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1 ... ... [50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1 [50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1 [50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1 [50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1 [50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1 [50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1 [50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1 [50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1 3, Don't run thread by stress [50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0 [50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0 [50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0 [50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0 [50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0 [50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0 [50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0 [50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0 [50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0 [50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0 [50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0 [50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0 [50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1 [50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1 [50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0 [50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0 [50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0 [50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0 [50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0 [50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0 [50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0 Analysis of above 3 test results with different system load: >From above test, we can see more and more continuous bios can be plugged with system load increasing. When run "stress -c 64 &", 310 continuous bios are plugged; When run "stress -c 32 &", 260 continuous bios are plugged; When don't run stress, at most only 2 continuous bios are plugged, in most cases, bio_list only contains one single bio. How to explain above phenomenon: We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is scheduled, it depends on the system load. When system load is low, the workqueue will be quickly scheduled, and the bio in bio_list will be quickly processed in blkg_async_bio_workfn(), so there is less chance that the same io submit thread can add multiple continuous bios to bio_list before workqueue is scheduled to run. The analysis aligned with above test "3". When system load is high, there is some delay before the workqueue can be scheduled to run, the higher the system load the greater the delay. So there is more chance that the same io submit thread can add multiple continuous bios to bio_list. Then when the workqueue is scheduled to run, there are more continuous bios in bio_list, which will be processed in blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2". According to test, we can get io performance improved with the patch, especially when system load is higher. Another optimazition is to use the plug only when bio_list contains at least 2 bios. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10block: remove check_disk_changeChristoph Hellwig
Remove the now unused check_disk_change helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10sr: simplify sr_block_revalidate_diskChristoph Hellwig
Both callers have a valid CD struture available, so rely on that instead of getting another reference. Also move the function to avoid a forward declaration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10sr: use bdev_check_media_changeChristoph Hellwig
Switch to use bdev_check_media_change instead of check_disk_change and call sr_block_revalidate_disk manually. Also add an explicit call to sr_block_revalidate_disk just before disk_add() to ensure we always read check for a ready unit and read the TOC and then stop wiring up ->revalidate_disk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10sd: use bdev_check_media_changeChristoph Hellwig
Switch to use bdev_check_media_change instead of check_disk_change and call sd_revalidate_disk manually. As sd also calls sd_revalidate_disk manually during probe and open, the extra call into ->revalidate_disk from bdev_disk_changed is not required either, so stop wiring up the method. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10md: use bdev_check_media_changeChristoph Hellwig
The md driver does not have a ->revalidate_disk method, so it can just use bdev_check_media_change without any additional changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10ide-gd: stop using the disk events mechanismChristoph Hellwig
ide-gd is only using the disk events mechanism to be able to force an invalidation and partition scan on opening removable media. Just open code the logic without invoving the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10ide-cd: remove idecd_revalidate_diskChristoph Hellwig
Just merge the trivial function into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>