lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2020-07-30	io_uring: get rid of atomic FAA for cq_timeouts	Pavel Begunkov
	If ->cq_timeouts modifications are done under ->completion_lock, we don't really nee any fetch-and-add and other complex atomics. Replace it with non-atomic FAA, that saves an implicit full memory barrier. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30	io_uring: consolidate *_check_overflow accounting	Pavel Begunkov
	Add a helper to mark ctx->{cq,sq}_check_overflow to get rid of duplicates, and it's clearer to check cq_overflow_list directly anyway. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30	io_uring: fix stalled deferred requests	Pavel Begunkov
	Always do io_commit_cqring() after completing a request, even if it was accounted as overflowed on the CQ side. Failing to do that may lead to not to pushing deferred requests when needed, and so stalling the whole ring. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30	io_uring: fix racy overflow count reporting	Pavel Begunkov
	All ->cq_overflow modifications should be under completion_lock, otherwise it can report a wrong number to the userspace. Fix it in io_uring_cancel_files(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30	io_uring: deduplicate __io_complete_rw()	Pavel Begunkov
	Call __io_complete_rw() in io_iopoll_queue() instead of hand coding it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30	io_uring: de-unionise io_kiocb	Pavel Begunkov
	As io_kiocb have enough space, move ->work out of a union. It's safer this way and removes ->work memcpy bouncing. By the way make tabulation in struct io_kiocb consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25	io-wq: update hash bits	Pavel Begunkov
	Linked requests are hashed, remove a comment stating otherwise. Also move hash bits to emphasise that we don't carry it through loop iteration and set it every time. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25	io_uring: fix missing io_queue_linked_timeout()	Pavel Begunkov
	Whoever called io_prep_linked_timeout() should also do io_queue_linked_timeout(). __io_queue_sqe() doesn't follow that for the punting path leaving linked timeouts prepared but never queued. Fixes: 6df1db6b54243 ("io_uring: fix mis-refcounting linked timeouts") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25	io_uring: mark ->work uninitialised after cleanup	Pavel Begunkov
	Remove REQ_F_WORK_INITIALIZED after io_req_clean_work(). That's a cold path but is safer for those using io_req_clean_work() out of dismantle_req()/io_free(). And for the same reason zero work.fs Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: deduplicate io_grab_files() calls	Pavel Begunkov
	Move io_req_init_async() into io_grab_files(), it's safer this way. Note that io_queue_async_work() does *init_async(), so it's valid to move out of __io_queue_sqe() punt path. Also, add a helper around io_grab_files(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: don't do opcode prep twice	Pavel Begunkov
	Calling into opcode prep handlers may be dangerous, as they re-read SQE but might not re-initialise requests completely. If io_req_defer() passed fast checks and is done with preparations, punt it async. As all other cases are covered with nulling @sqe, this guarantees that io_[opcode]_prep() are visited only once per request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works	Xiaoguang Wang
	In io_sq_thread(), if there are task works to handle, current codes will skip schedule() and go on polling sq again, but forget to clear IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to set and clear IORING_SQ_NEED_WAKEUP flag, Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: batch put_task_struct()	Pavel Begunkov
	As every iopoll request have a task ref, it becomes expensive to put them one by one, instead we can put several at once integrating that into io_req_free_batch(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: return locked and pinned page accounting	Pavel Begunkov
	Locked and pinned memory accounting in io_{,un}account_mem() depends on having ->sqo_mm, which is NULL after a recent change for non SQPOLL'ed io_ring. That disables the accounting. Return ->sqo_mm initialisation back, and do __io_sq_thread_acquire_mm() based on IORING_SETUP_SQPOLL flag. Fixes: 8eb06d7e8dd85 ("io_uring: fix missing ->mm on exit") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: don't miscount pinned memory	Pavel Begunkov
	io_sqe_buffer_unregister() uses cxt->sqo_mm for memory accounting, but io_ring_ctx_free() drops ->sqo_mm before leaving pinned_vm over-accounted. Postpone mm cleanup for when it's not needed anymore. Fixes: 309758254ea62 ("io_uring: report pinned memory usage") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: don't open-code recv kbuf managment	Pavel Begunkov
	Don't implement fast path of kbuf freeing and management inlined into io_recv{,msg}(), that's error prone and duplicates handling. Replace it with a helper io_put_recv_kbuf(), which mimics io_put_rw_kbuf() in the io_read/write(). This also keeps cflags calculation in one place, removing duplication between rw and recv/send. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: extract io_put_kbuf() helper	Pavel Begunkov
	Extract a common helper for cleaning up a selected buffer, this will be used shortly. By the way, correct cflags types to unsigned and, as kbufs are anyway tracked by a flag, remove useless zeroing req->rw.addr. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: move BUFFER_SELECT check into *recv[msg]	Pavel Begunkov
	Move REQ_F_BUFFER_SELECT flag check out of io_recv_buffer_select(), and do that in its call sites That saves us from double error checking and possibly an extra function call. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: free selected-bufs if error'ed	Pavel Begunkov
	io_clean_op() may be skipped even if there is a selected io_buffer, that's because *select_buffer() funcions never set REQ_F_NEED_CLEANUP. Trigger io_clean_op() when REQ_F_BUFFER_SELECTED is set as well, and and clear the flag if was freed out of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: don't forget cflags in io_recv()	Pavel Begunkov
	Instead of returning error from io_recv(), go through generic cleanup path, because it'll retain cflags for userspace. Do the same for io_send() for consistency. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: remove extra checks in send/recv	Pavel Begunkov
	With the return on a bad socket, kmsg is always non-null by the end of the function, prune left extra checks and initialisations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: indent left {send,recv}[msg]()	Pavel Begunkov
	Flip over "if (sock)" condition with return on error, the upper layer will take care. That change will be handy later, but already removes an extra jump from hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: simplify file ref tracking in submission state	Pavel Begunkov
	Currently, file refs in struct io_submit_state are tracked with 2 vars: @has_refs -- how many refs were initially taken @used_refs -- number of refs used Replace it with a single variable counting how many refs left at the current moment. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring/io-wq: move RLIMIT_FSIZE to io-wq	Pavel Begunkov
	RLIMIT_SIZE in needed only for execution from an io-wq context, hence move all preparations from hot path to io-wq work setup. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: alloc ->io in io_req_defer_prep()	Pavel Begunkov
	Every call to io_req_defer_prep() is prepended with allocating ->io, just do that in the function. And while we're at it, mark error paths with unlikey and replace "if (ret < 0)" with "if (ret)". There is only one change in the observable behaviour, that's instead of killing the head request right away on error, it postpones it until the link is assembled, that looks more preferable. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: remove empty cleanup of OP_OPEN* reqs	Pavel Begunkov
	A switch in __io_clean_op() doesn't have default, it's pointless to list opcodes that doesn't do any cleanup. Remove IORING_OP_OPEN* from there. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: inline io_req_work_grab_env()	Pavel Begunkov
	The only caller of io_req_work_grab_env() is io_prep_async_work(), and they are both initialising req->work. Inline grab_env(), it's easier to keep this way, moreover there already were bugs with misplacing io_req_init_async(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: place cflags into completion data	Pavel Begunkov
	req->cflags is used only for defer-completion path, just use completion data to store it. With the 4 bytes from the ->sequence patch and compacting io_kiocb, this frees 8 bytes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: remove sequence from io_kiocb	Pavel Begunkov
	req->sequence is used only for deferred (i.e. DRAIN) requests, but initialised for every request. Remove req->sequence from io_kiocb together with its initialisation in io_init_req(). Replace it with a new field in struct io_defer_entry, that will be calculated only when needed in io_req_defer(), which is a slow path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: use non-intrusive list for defer	Pavel Begunkov
	The only left user of req->list is DRAIN, hence instead of keeping a separate per request list for it, do that with old fashion non-intrusive lists allocated on demand. That's a really slow path, so that's OK. This removes req->list and so sheds 16 bytes from io_kiocb. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: remove init for unused list	Pavel Begunkov
	poll*() doesn't use req->list, don't init it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: add req->timeout.list	Pavel Begunkov
	Instead of using shared req->list, hang timeouts up on their own list entry. struct io_timeout have enough extra space for it, but if that will be a problem ->inflight_entry can reused for that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: use completion list for CQ overflow	Pavel Begunkov
	As with the completion path, also use compl.list for overflowed requests. If cleaned up properly, nobody needs per-op data there anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: use inflight_entry list for iopoll'ing	Pavel Begunkov
	req->inflight_entry is used to track requests that grabbed files_struct. Let's share it with iopoll list, because the only iopoll'ed ops are reads and writes, which don't need a file table. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: rename ctx->poll into ctx->iopoll	Pavel Begunkov
	It supports both polling and I/O polling. Rename ctx->poll to clearly show that it's only in I/O poll case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: share completion list w/ per-op space	Pavel Begunkov
	Calling io_req_complete(req) means that the request is done, and there is nothing left but to clean it up. That also means that per-op data after that should not be used, so we're free to reuse it in completion path, e.g. to store overflow_list as done in this patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: follow **iovec idiom in io_import_iovec	Pavel Begunkov
	As for import_iovec(), return !=NULL iovec from io_import_iovec() only when it should be freed. That includes returning NULL when iovec is already in req->io, because it should be deallocated by other means, e.g. inside op handler. After io_setup_async_rw() local iovec to ->io, just mark it NULL, to follow the idea in io_{read,write} as well. That's easier to follow, and especially useful if we want to reuse per-op space for completion data. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: only call kfree() on non-NULL pointer] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: add a helper for async rw iovec prep	Pavel Begunkov
	Preparing reads/writes for async is a bit tricky. Extract a helper to not repeat it twice. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: simplify io_req_map_rw()	Pavel Begunkov
	Don't deref req->io->rw every time, but put it in a local variable. This looks prettier, generates less instructions, and doesn't break alias analysis. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: replace rw->task_work with rq->task_work	Pavel Begunkov
	io_kiocb::task_work was de-unionised, and is not planned to be shared back, because it's too useful and commonly used. Hence, instead of keeping a separate task_work in struct io_async_rw just reuse req->task_work. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: extract io_sendmsg_copy_hdr()	Pavel Begunkov
	Don't repeat send msg initialisation code, it's error prone. Extract and use a helper function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: use more specific type in rcv/snd msg cp	Pavel Begunkov
	send/recv msghdr initialisation works with struct io_async_msghdr, but pulls the whole struct io_async_ctx for no reason. That complicates it with composite accessing, e.g. io->msg. Use and pass the most specific type, which is struct io_async_msghdr. It is the larget field in union io_async_ctx and doesn't save stack space, but looks clearer. The most of the changes are replacing "io->msg." with "iomsg->" Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: rename sr->msg into umsg	Pavel Begunkov
	Every second field in send/recv is called msg, make it a bit more understandable by renaming ->msg, which is a user provided ptr, to ->umsg. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: fix sq array offset calculation	Dmitry Vyukov
	rings_size() sets sq_offset to the total size of the rings (the returned value which is used for memory allocation). This is wrong: sq array should be located within the rings, not after them. Set sq_offset to where it should be. Fixes: 75b28affdd6a ("io_uring: allocate the two rings together") Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Hristo Venev <hristo@venev.name> Cc: io-uring@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	Merge branch 'io_uring-5.8' into for-5.9/io_uring	Jens Axboe
	Merge in io_uring-5.8 fixes, as changes/cleanups to how we do locked mem accounting require a fixup, and only one of the spots are noticed by git as the other merges cleanly. The flags fix from io_uring-5.8 also causes a merge conflict, the leak fix for recvmsg, the double poll fix, and the link failure locking fix. * io_uring-5.8: io_uring: fix lockup in io_fail_links() io_uring: fix ->work corruption with poll_add io_uring: missed req_init_async() for IOSQE_ASYNC io_uring: always allow drain/link/hardlink/async sqe flags io_uring: ensure double poll additions work with both request types io_uring: fix recvmsg memory leak with buffer selection io_uring: fix not initialised work->flags io_uring: fix missing msg_name assignment io_uring: account user memory freed when exit has been queued io_uring: fix memleak in io_sqe_files_register() io_uring: fix memleak in __io_sqe_files_update() io_uring: export cq overflow status to userspace Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: fix lockup in io_fail_links()	Pavel Begunkov
	io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested spin_lock(completion_lock) and lockup. [ 197.680409] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/. [ 197.680411] rcu: blocking rcu_node structures: [ 197.680412] Task dump for CPU 6: [ 197.680413] link-timeout R running task 0 1669 1 0x8000008a [ 197.680414] Call Trace: [ 197.680420] ? io_req_find_next+0xa0/0x200 [ 197.680422] ? io_put_req_find_next+0x2a/0x50 [ 197.680423] ? io_poll_task_func+0xcf/0x140 [ 197.680425] ? task_work_run+0x67/0xa0 [ 197.680426] ? do_exit+0x35d/0xb70 [ 197.680429] ? syscall_trace_enter+0x187/0x2c0 [ 197.680430] ? do_group_exit+0x43/0xa0 [ 197.680448] ? __x64_sys_exit_group+0x18/0x20 [ 197.680450] ? do_syscall_64+0x52/0xa0 [ 197.680452] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-24	io_uring: fix ->work corruption with poll_add	Pavel Begunkov
	req->work might be already initialised by the time it gets into __io_arm_poll_handler(), which will corrupt it by using fields that are in an union with req->work. Luckily, the only side effect is missing put_creds(). Clean req->work before going there. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-23	io_uring: missed req_init_async() for IOSQE_ASYNC	Pavel Begunkov
	IOSQE_ASYNC branch of io_queue_sqe() is another place where an unitialised req->work can be accessed (i.e. prior io_req_init_async()). Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-18	io_uring: always allow drain/link/hardlink/async sqe flags	Daniele Albano
	We currently filter these for timeout_remove/async_cancel/files_update, but we only should be filtering for fixed file and buffer select. This also causes a second read of sqe->flags, which isn't needed. Just check req->flags for the relevant bits. This then allows these commands to be used in links, for example, like everything else. Signed-off-by: Daniele Albano <d.albano@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17	io_uring: ensure double poll additions work with both request types	Jens Axboe
	The double poll additions were centered around doing POLL_ADD on file descriptors that use more than one waitqueue (typically one for read, one for write) when being polled. However, it can also end up being triggered for when we use poll triggered retry. For that case, we cannot safely use req->io, as that could be used by the request type itself. Add a second io_poll_iocb pointer in the structure we allocate for poll based retry, and ensure we use the right one from the two paths. Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users") Signed-off-by: Jens Axboe <axboe@kernel.dk>