git.baikalelectronics.ru Git

io_uring: make apoll_events a __poll_t

apoll_events is fed to vfs_poll and the poll tables, so it should be
a __poll_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: drop a spurious inline on a forward declaration

io_file_get_normal isn't marked inline, so don't claim it as such in the
forward declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: don't use ERR_PTR for user pointers

ERR_PTR abuses the high bits of a pointer to transport error information.
This is only safe for kernel pointers and not user pointers. Fix
io_buffer_select and its helpers to just return NULL for failure and get
rid of this abuse.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: use a rwf_t for io_rw.flags

Use the proper type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add support for ring mapped supplied buffers

Provided buffers allow an application to supply io_uring with buffers
that can then be grabbed for a read/receive request, when the data
source is ready to deliver data. The existing scheme relies on using
IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
in real world applications. It's pretty efficient if the application
is able to supply back batches of provided buffers when they have been
consumed and the application is ready to recycle them, but if
fragmentation occurs in the buffer space, it can become difficult to
supply enough buffers at the time. This hurts efficiency.

Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
to setup a shared queue for each buffer group of provided buffers. The
application can then supply buffers simply by adding them to this ring,
and the kernel can consume then just as easily. The ring shares the head
with the application, the tail remains private in the kernel.

Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
ring, they must use the mapped ring. Mapped provided buffer rings can
co-exist with normal provided buffers, just not within the same group ID.

To gauge overhead of the existing scheme and evaluate the mapped ring
approach, a simple NOP benchmark was written. It uses a ring of 128
entries, and submits/completes 32 at the time. 'Replenish' is how
many buffers are provided back at the time after they have been
consumed:

Test Replenish NOPs/sec
================================================================
No provided buffers NA ~30M
Provided buffers 32 ~16M
Provided buffers 1 ~10M
Ring buffers 32 ~27M
Ring buffers 1 ~27M

The ring mapped buffers perform almost as well as not using provided
buffers at all, and they don't care if you provided 1 or more back at
the same time. This means application can just replenish as they go,
rather than need to batch and compact, further reducing overhead in the
application. The NOP benchmark above doesn't need to do any compaction,
so that overhead isn't even reflected in the above test.

Co-developed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add io_pin_pages() helper

Abstract this out from io_sqe_buffer_register() so we can use it
elsewhere too without duplicating this code.

No intended functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add buffer selection support to IORING_OP_NOP

Obviously not really useful since it's not transferring data, but it
is helpful in benchmarking overhead of provided buffers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: fix locking state for empty buffer group

io_provided_buffer_select() must drop the submit lock, if needed, even
in the error handling case. Failure to do so will leave us with the
ctx->uring_lock held, causing spew like:

====================================
WARNING: iou-wrk-366/368 still has locks held!
5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted
------------------------------------
1 lock held by iou-wrk-366/368:
#0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48

stack backtrace:
CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace.part.0+0xa4/0xd4
show_stack+0x14/0x5c
dump_stack_lvl+0x88/0xb0
dump_stack+0x14/0x2c
debug_check_no_locks_held+0x84/0x90
try_to_freeze.isra.0+0x18/0x44
get_signal+0x94/0x6ec
io_wqe_worker+0x1d8/0x2b4
ret_from_fork+0x10/0x20

and triggering later hangs off get_signal() because we attempt to
re-grab the lock.

Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com
Fixes: 69095f7adadd ("io_uring: abstract out provided buffer list selection")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: implement multishot mode for accept

Refactor io_accept() to support multishot mode.

theoretical analysis:
  1) when connections come in fast
    - singleshot:
              add accept sqe(userspace) --> accept inline
                              ^                 |
                              |-----------------|
    - multishot:
             add accept sqe(userspace) --> accept inline
                                              ^     |
                                              |--*--|

    we do accept repeatedly in * place until get EAGAIN

  2) when connections come in at a low pressure
    similar thing like 1), we reduce a lot of userspace-kernel context
    switch and useless vfs_poll()

tests:
Did some tests, which goes in this way:

  server    client(multiple)
  accept    connect
  read      write
  write     read
  close     close

Basically, raise up a number of clients(on same machine with server) to
connect to the server, and then write some data to it, the server will
write those data back to the client after it receives them, and then
close the connection after write return. Then the client will read the
data and then close the connection. Here I test 10000 clients connect
one server, data size 128 bytes. And each client has a go routine for
it, so they come to the server in short time.
test 20 times before/after this patchset, time spent:(unit cycle, which
is the return value of clock())
before:
  1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
  +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
  +1934226+1914385)/20.0 = 1927633.75
after:
  1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
  +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
  +1871324+1940803)/20.0 = 1894750.45

(1927633.75 - 1894750.45) / 1927633.75 = 1.65%

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-5-haoxu.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: let fast poll support multishot

For operations like accept, multishot is a useful feature, since we can
reduce a number of accept sqe. Let's integrate it to fast poll, it may
be good for other operations in the future.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-4-haoxu.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add REQ_F_APOLL_MULTISHOT for requests

Add a flag to indicate multishot mode for fast poll. currently only
accept use it, but there may be more operations leveraging it in the
future. Also add a mask IO_APOLL_MULTI_POLLED which stands for
REQ_F_APOLL_MULTI | REQ_F_POLLED, to make the code short and cleaner.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-3-haoxu.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add IORING_ACCEPT_MULTISHOT for accept

add an accept_flag IORING_ACCEPT_MULTISHOT for accept, which is to
support multishot.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-2-haoxu.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: only wake when the correct events are set

The check for waking up a request compares the poll_t bits, however this
will always contain some common flags so this always wakes up.

For files with single wait queues such as sockets this can cause the
request to be sent to the async worker unnecesarily. Further if it is
non-blocking will complete the request with EAGAIN which is not desired.

Here exclude these common events, making sure to not exclude POLLERR which
might be important.

Fixes: 4163e41e4431 ("io_uring: use poll driven retry for files that support it")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220512091834.728610-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: avoid io-wq -EAGAIN looping for !IOPOLL

If an opcode handler semi-reliably returns -EAGAIN, io_wq_submit_work()
might continue busily hammer the same handler over and over again, which
is not ideal. The -EAGAIN handling in question was put there only for
IOPOLL, so restrict it to IOPOLL mode only where there is no other
recourse than to retry as we cannot wait.

Fixes: c28c32ed82eb9 ("io_uring: support for IO polling")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f168b4f24181942f3614dd8ff648221736f572e6.1652433740.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add flag for allocating a fully sparse direct descriptor space

Currently to setup a fully sparse descriptor space upfront, the app needs
to alloate an array of the full size and memset it to -1 and then pass
that in. Make this a bit easier by allowing a flag that simply does
this internally rather than needing to copy each slot separately.

This works with IORING_REGISTER_FILES2 as the flag is set in struct
io_uring_rsrc_register, and is only allow when the type is
IORING_RSRC_FILE as this doesn't make sense for registered buffers.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: bump max direct descriptor count to 1M

We currently limit these to 32K, but since we're now backing the table
space with vmalloc when needed, there's no reason why we can't make it
bigger. The total space is limited by RLIMIT_NOFILE as well.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: allow allocated fixed files for accept

If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot,
then that's a hint to allocate a fixed file descriptor rather than have
one be passed in directly.

This can be useful for having io_uring manage the direct descriptor space,
and also allows multi-shot support to work with fixed files.

Normal accept direct requests will complete with 0 for success, and < 0
in case of error. If io_uring is asked to allocated the direct descriptor,
then the direct descriptor is returned in case of success.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: allow allocated fixed files for openat/openat2

If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot,
then that's a hint to allocate a fixed file descriptor rather than have
one be passed in directly.

This can be useful for having io_uring manage the direct descriptor space.

Normal open direct requests will complete with 0 for success, and < 0
in case of error. If io_uring is asked to allocated the direct descriptor,
then the direct descriptor is returned in case of success.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add basic fixed file allocator

Applications currently always pick where they want fixed files to go.
In preparation for allowing these types of commands with multishot
support, add a basic allocator in the fixed file table.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: track fixed files with a bitmap

In preparation for adding a basic allocator for direct descriptors,
add helpers that set/clear whether a file slot is used.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: don't clear req->kbuf when buffer selection is done

It's not needed as the REQ_F_BUFFER_SELECTED flag tracks the state of
whether or not kbuf is valid, so just drop it.

Suggested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: eliminate the need to track provided buffer ID separately

We have io_kiocb->buf_index which is used for either fixed buffers, or
for provided buffers. For the latter, it's used to hold the buffer group
ID for buffer selection. Post selection, req->kbuf->bid is used to get
the buffer ID.

Store the buffer ID, when selected, in req->buf_index. If we do end up
recycling the buffer, reset it back to the buffer group ID.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: move provided buffer state closer to submit state

The timeout and other items that follow are less hot, so let's move the
provided buffer state above that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: move provided and fixed buffers into the same io_kiocb area

These are mutually exclusive - if you use provided buffers, then you
cannot use fixed buffers and vice versa. Move them into the same spot
in the io_kiocb, which is also advantageous for provided buffers as
they get near the submit side hot cacheline.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: abstract out provided buffer list selection

In preparation for providing another way to select a buffer, move the
existing logic into a helper.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: never call io_buffer_select() for a buffer re-select

Callers already have room to store the addr and length information,
clean it up by having the caller just assign the previously provided
data.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: get rid of hashed provided buffer groups

Use a plain array for any group ID that's less than 64, and punt
anything beyond that to an xarray. 64 fits in a page even for 4KB
page sizes and with the planned additions.

This makes the expected group usage faster by avoiding a hash and lookup
to find our list, and it uses less memory upfront by not allocating any
memory for provided buffers unless it's actually being used.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: always use req->buf_index for the provided buffer group

The read/write opcodes use it already, but the recv/recvmsg do not. If
we switch them over and read and validate this at init time while we're
checking if the opcode supports it anyway, then we can do it in one spot
and we don't have to pass in a separate group ID for io_buffer_select().

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: ignore ->buf_index if REQ_F_BUFFER_SELECT isn't set

There's no point in validity checking buf_index if the request doesn't
have REQ_F_BUFFER_SELECT set, as we will never use it for that case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: kill io_rw_buffer_select() wrapper

After the recent changes, this is direct call to io_buffer_select()
anyway. With this change, there are no wrappers left for provided
buffer selection.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: make io_buffer_select() return the user address directly

There's no point in having callers provide a kbuf, we're just returning
the address anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: kill io_recv_buffer_select() wrapper

It's just a thin wrapper around io_buffer_select(), get rid of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: use 'sr' vs 'req->sr_msg' consistently

For all of send/sendmsg and recv/recvmsg we have the local 'sr' variable,
yet some cases still use req->sr_msg which sr points to. Use 'sr'
consistently.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg

If IORING_RECVSEND_POLL_FIRST is set for recv/recvmsg or send/sendmsg,
then we arm poll first rather than attempt a receive or send upfront.
This can be useful if we expect there to be no data (or space) available
for the request, as we can then avoid wasting time on the initial
issue attempt.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: check IOPOLL/ioprio support upfront

Don't punt this check to the op prep handlers, add the support to
io_op_defs and we can check them while setting up the request.

This reduces the text size by 500 bytes on aarch64, and makes this less
fragile by having the check in one spot and needing opcodes to opt in
to IOPOLL or ioprio support.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: replace smp_mb() with smp_mb__after_atomic() in io_sq_thread()

The IORING_SQ_NEED_WAKEUP flag is now set using atomic_or() which
implies a full barrier on some architectures but it is not required to
do so. Use the more appropriate smp_mb__after_atomic() which avoids the
extra barrier on those architectures.

Signed-off-by: Almog Khaikin <almogkh@gmail.com>
Link: https://lore.kernel.org/r/20220426163403.112692-1-almogkh@gmail.com
Fixes: 8018823e6987 ("io_uring: serialize ctx->rings->sq_flags with atomic_or/and")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add IORING_SETUP_TASKRUN_FLAG

If IORING_SETUP_COOP_TASKRUN is set to use cooperative scheduling for
running task_work, then IORING_SETUP_TASKRUN_FLAG can be set so the
application can tell if task_work is pending in the kernel for this
ring. This allows use cases like io_uring_peek_cqe() to still function
appropriately, or for the task to know when it would be useful to
call io_uring_wait_cqe() to run pending events.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-7-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: use TWA_SIGNAL_NO_IPI if IORING_SETUP_COOP_TASKRUN is used

If this is set, io_uring will never use an IPI to deliver a task_work
notification. This can be used in the common case where a single task or
thread communicates with the ring, and doesn't rely on
io_uring_cqe_peek().

This provides a noticeable win in performance, both from eliminating
the IPI itself, but also from avoiding interrupting the submitting
task unnecessarily.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-6-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: set task_work notify method at init time

While doing so, switch SQPOLL to TWA_SIGNAL_NO_IPI as well, as that
just does a task wakeup and then we can remove the special wakeup we
have in task_work_add.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io-wq: use __set_notify_signal() to wake workers

The only difference between set_notify_signal() and __set_notify_signal()
is that the former checks if it needs to deliver an IPI to force a
reschedule. As the io-wq workers never leave the kernel, and IPI is never
needed, they simply need a wakeup.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: serialize ctx->rings->sq_flags with atomic_or/and

Rather than require ctx->completion_lock for ensuring that we don't
clobber the flags, use the atomic bitop helpers instead. This removes
the need to grab the completion_lock, in preparation for needing to set
or clear sq_flags when we don't know the status of this lock.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220426014904.60384-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>

task_work: allow TWA_SIGNAL without a rescheduling IPI

Some use cases don't always need an IPI when sending a TWA_SIGNAL
notification. Add TWA_SIGNAL_NO_IPI, which is just like TWA_SIGNAL, except
it doesn't send an IPI to the target task. It merely sets
TIF_NOTIFY_SIGNAL and wakes up the task.

This can be useful in avoiding a forceful transition to the kernel if the
task is running in userspace. Depending on the task_work in question, it
may be quite fine waiting for the next reschedule or kernel enter anyway,
or the use case may even have other mechanisms for hinting to the task
that a transition may be useful. This can drive more cooperative
scheduling of task_work.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/821f42b6-7d91-8074-8212-d34998097de4@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: fix compile warning for 32-bit builds

If IO_URING_SCM_ALL isn't set, as it would not be on 32-bit builds,
then we trigger a warning:

fs/io_uring.c: In function '__io_sqe_files_unregister':
fs/io_uring.c:8992:13: warning: unused variable 'i' [-Wunused-variable]
8992 | int i;
| ^

Move the ifdef up to include the 'i' variable declaration.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 8bd895003118 ("io_uring: store SCM state in io_fixed_file->file_ptr")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: return an error when cqe is dropped

Right now io_uring will not actively inform userspace if a CQE is
dropped. This is extremely rare, requiring a CQ ring overflow, as well as
a GFP_ATOMIC kmalloc failure. However the consequences could cause for
example applications to go into an undefined state, possibly waiting for a
CQE that never arrives.

Return an error code (EBADR) in these cases. Since this is expected to be
incredibly rare, try and avoid as much as possible affecting the hot code
paths, and so it only is returned lazily and when there is no other
available CQEs.

Once the error is returned, reset the error condition assuming the user is
either ok with it or will clean up appropriately.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: use constants for cq_overflow bitfield

Prepare to use this bitfield for more flags by using constants instead of
magic value 0

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: rework io_uring_enter to simplify return value

io_uring_enter returns the count submitted preferrably over an error
code. In some code paths this check is not required, so reorganise the
code so that the check is only done as needed.
This is also a prep for returning error codes only in waiting scenarios.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: trace cqe overflows

Trace cqe overflows in io_uring. Print ocqe before the check, so if it is
NULL it indicates that it has been dropped.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add trace support for CQE overflow

Add trace function for overflowing CQ ring.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220421091345.2115755-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: allow re-poll if we made progress

We currently check REQ_F_POLLED before arming async poll for a
notification to retry. If it's set, then we don't allow poll and will
punt to io-wq instead. This is done to prevent a situation where a buggy
driver will repeatedly return that there's space/data available yet we
get -EAGAIN.

However, if we already transferred data, then it should be safe to rely
on poll again. Gate the check on whether or not REQ_F_PARTIAL_IO is
also set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: support MSG_WAITALL for IORING_OP_SEND(MSG)

Like commit cb9342f20212 for recv/recvmsg, support MSG_WAITALL for the
send side. If this flag is set and we do a short send, retry for a
stream of seqpacket socket.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add support for IORING_ASYNC_CANCEL_ANY

Rather than match on a specific key, be it user_data or file, allow
canceling any request that we can lookup. Works like
IORING_ASYNC_CANCEL_ALL in that it cancels multiple requests, but it
doesn't key off user_data or the file.

Can't be set with IORING_ASYNC_CANCEL_FD, as that's a key selector.
Only one may be used at the time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-6-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key

Currently sqe->addr must contain the user_data of the request being
canceled. Introduce the IORING_ASYNC_CANCEL_FD flag, which tells the
kernel that we're keying off the file fd instead for cancelation. This
allows canceling any request that a) uses a file, and b) was assigned the
file based on the value being passed in.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-5-axboe@kernel.dk

io_uring: add support for IORING_ASYNC_CANCEL_ALL

The current cancelation will lookup and cancel the first request it
finds based on the key passed in. Add a flag that allows to cancel any
request that matches they key. It completes with the number of requests
found and canceled, or res < 0 if an error occured.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-4-axboe@kernel.dk

io_uring: pass in struct io_cancel_data consistently

In preparation for being able to not only key cancel off the user_data,
pass in the io_cancel_data struct for the various functions that deal
with request cancelation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-3-axboe@kernel.dk

io_uring: remove dead 'poll_only' argument to io_poll_cancel()

It's only called from one location, and it always passes in 'false'.
Kill the argument, and just pass in 'false' to io_poll_find().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220418164402.75259-2-axboe@kernel.dk

io_uring: refactor io_disarm_next() locking

Split timeout handling into removal + failing, so we can reduce
spinlocking time and remove another instance of triple nested locking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0f00d115f9d4c5749028f19623708ad3695512d6.1650458197.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: move timeout locking in io_timeout_cancel()

Move ->timeout_lock grabbing inside of io_timeout_cancel(), so
we can do io_req_task_queue_fail() outside of the lock. It's much nicer
than relying on triple nested locking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cde758c2897930d31e205ed8f476d4ec879a8849.1650458197.git.asml.silence@gmail.com
[axboe: drop now wrong timeout_lock annotation]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: store SCM state in io_fixed_file->file_ptr

A previous commit removed SCM accounting for non-unix sockets, as those
are the only ones that can cause a fixed file reference. While that is
true, it also means we're now dereferencing the file as part of the
workqueue driven __io_sqe_files_unregister() after the process has
exited. This isn't safe for SCM files, as unix gc may have already
reaped them when the process exited. KASAN complains about this:

[   12.307040] Freed by task 0:
[   12.307592]  kasan_save_stack+0x28/0x4c
[   12.308318]  kasan_set_track+0x28/0x38
[   12.309049]  kasan_set_free_info+0x24/0x44
[   12.309890]  ____kasan_slab_free+0x108/0x11c
[   12.310739]  __kasan_slab_free+0x14/0x1c
[   12.311482]  slab_free_freelist_hook+0xd4/0x164
[   12.312382]  kmem_cache_free+0x100/0x1dc
[   12.313178]  file_free_rcu+0x58/0x74
[   12.313864]  rcu_core+0x59c/0x7c0
[   12.314675]  rcu_core_si+0xc/0x14
[   12.315496]  _stext+0x30c/0x414
[   12.316287]
[   12.316687] Last potentially related work creation:
[   12.317885]  kasan_save_stack+0x28/0x4c
[   12.318845]  __kasan_record_aux_stack+0x9c/0xb0
[   12.319976]  kasan_record_aux_stack_noalloc+0x10/0x18
[   12.321268]  call_rcu+0x50/0x35c
[   12.322082]  __fput+0x2fc/0x324
[   12.322873]  ____fput+0xc/0x14
[   12.323644]  task_work_run+0xac/0x10c
[   12.324561]  do_notify_resume+0x37c/0xe74
[   12.325420]  el0_svc+0x5c/0x68
[   12.326050]  el0t_64_sync_handler+0xb0/0x12c
[   12.326918]  el0t_64_sync+0x164/0x168
[   12.327657]
[   12.327976] Second to last potentially related work creation:
[   12.329134]  kasan_save_stack+0x28/0x4c
[   12.329864]  __kasan_record_aux_stack+0x9c/0xb0
[   12.330735]  kasan_record_aux_stack+0x10/0x18
[   12.331576]  task_work_add+0x34/0xf0
[   12.332284]  fput_many+0x11c/0x134
[   12.332960]  fput+0x10/0x94
[   12.333524]  __scm_destroy+0x80/0x84
[   12.334213]  unix_destruct_scm+0xc4/0x144
[   12.334948]  skb_release_head_state+0x5c/0x6c
[   12.335696]  skb_release_all+0x14/0x38
[   12.336339]  __kfree_skb+0x14/0x28
[   12.336928]  kfree_skb_reason+0xf4/0x108
[   12.337604]  unix_gc+0x1e8/0x42c
[   12.338154]  unix_release_sock+0x25c/0x2dc
[   12.338895]  unix_release+0x58/0x78
[   12.339531]  __sock_release+0x68/0xec
[   12.340170]  sock_close+0x14/0x20
[   12.340729]  __fput+0x18c/0x324
[   12.341254]  ____fput+0xc/0x14
[   12.341763]  task_work_run+0xac/0x10c
[   12.342367]  do_notify_resume+0x37c/0xe74
[   12.343086]  el0_svc+0x5c/0x68
[   12.343510]  el0t_64_sync_handler+0xb0/0x12c
[   12.344086]  el0t_64_sync+0x164/0x168

We have an extra bit we can use in file_ptr on 64-bit, use that to store
whether this file is SCM'ed or not, avoiding the need to look at the
file contents itself. This does mean that 32-bit will be stuck with SCM
for all registered files, just like 64-bit did before the referenced
commit.

Fixes: fbd4ef2c417a ("io_uring: don't scm-account for non af_unix sockets")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: kill ctx arg from io_req_put_rsrc

The ctx argument of io_req_put_rsrc() is not used, kill it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bb51bf3ff02775b03e6ea21bc79c25d7870d1644.1650311386.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add a helper for putting rsrc nodes

Add a simple helper to encapsulating dropping rsrc nodes references,
it's cleaner and will help if we'd change rsrc refcounting or play with
percpu_ref_put() [no]inlining.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/63fdd953ac75898734cd50e8f69e95e6664f46fe.1650311386.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: store rsrc node in req instead of refs

req->fixed_rsrc_refs keeps a pointer to rsrc node pcpu references, but
it's more natural just to store rsrc node directly. There were some
reasons for that in the past but not anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cee1c86ec9023f3e4f6ce8940d58c017ef8782f4.1650311386.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: refactor io_assign_file error path

All io_assign_file() callers do error handling themselves,
req_set_fail() in the io_assign_file()'s fail path needlessly bloats the
kernel and is not the best abstraction to have. Simplify the error path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/eff77fb1eac2b6a90cca5223813e6a396ffedec0.1650311386.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: use right helpers for file assign locking

We have io_ring_submit_[un]lock() functions helping us with conditional
->uring_lock locking, use them in io_file_get_fixed() instead of hand
coding.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c9c9ff1e046f6eb68da0a251962a697f8a2275fa.1650311386.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add data_race annotations

We have several racy reads, mark them with data_race() to demonstrate
this fact.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e56e750d294c70b2a56938bd733386f19f0eb53.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: inline io_req_complete_fail_submit()

Inline io_req_complete_fail_submit(), there is only one caller and the
name doesn't tell us much.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fe5851af01dcd39fc84b71b8539c7cbe4658fb6d.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: refactor io_submit_sqe()

Remove one extra if for non-linked path of io_submit_sqe().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/03183199d1bf494b4a72eca16d792c8a5945acb4.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: refactor lazy link fail

Remove the lazy link fail logic from io_submit_sqe() and hide it into a
helper. It simplifies the code and will be needed in next patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6a68aca9cf4492132da1d7c8a09068b74aba3c65.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: introduce IO_REQ_LINK_FLAGS

Add a macro for all link request flags to avoid duplication.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/df38b883e31e7e0ca4e364d25a0743862961b180.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: refactor io_queue_sqe()

io_queue_sqe() is a part of the submission path and we try hard to keep
it inlined, so shed some extra bytes from it by moving the error
checking part into io_queue_sqe_arm_apoll() and renaming it accordingly.

note: io_queue_sqe_arm_apoll() is not inlined, thus the patch doesn't
change the number of function calls for the apoll path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b79edd246336decfaca79b949a15ac69123490d.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: rename io_queue_async_work()

Rename io_queue_async_work(). The name is pretty old but now doesn't
reflect well what the function is doing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5d4b25c54cccf084f9f2fd63bd4e4fa4515e998e.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: inline io_queue_sqe()

Inline io_queue_sqe() as there is only one caller left, and rename
__io_queue_sqe().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d5742683b7a7caceb1c054e91e5b9135b0f3b858.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: helper for prep+queuing linked timeouts

We try to aggresively inline the submission path, so it's a good idea to
not pollute it with colder code. One of them is linked timeout
preparation + queue, which can be extracted into a function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ecf74df7ac77389b6d9211211ec4954e91de98ba.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: inline io_free_req()

Inline io_free_req() into its only user and remove an underscore prefix
from __io_free_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ed114edef5c256a644f4839bb372df70d8df8e3f.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: kill io_put_req_deferred()

We have several spots where a call to io_fill_cqe_req() is immediately
followed by io_put_req_deferred(). Replace them with
__io_req_complete_post() and get rid of io_put_req_deferred() and
io_fill_cqe_req().

> size ./fs/io_uring.o
   text    data     bss     dec     hex filename
  86942   13734       8  100684   1894c ./fs/io_uring.o
> size ./fs/io_uring.o
   text    data     bss     dec     hex filename
  86438   13654       8  100100   18704 ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/10672a538774ac8986bee6468d960527af59169d.1650056133.git.asml.silence@gmail.com
[axboe: fold in followup fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: minor refactoring for some tw handlers

Get rid of some useless local variables

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7798327b684b7015f7e4300420142ddfcd317297.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: clean poll tw PF_EXITING handling

When we meet PF_EXITING in io_poll_check_events(), don't overcomplicate
the code with io_poll_mark_cancelled() but just return -ECANCELED and
the callers will deal with the rest.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f0cc981af82a5b193658f8f44397eeb3bf838b7b.1650056133.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: optimise io_get_cqe()

io_get_cqe() is expensive because of a bunch of loads, masking, etc.
However, most of the time we should have enough of entries in the CQ,
so we can cache two pointers representing a range of contiguous CQE
memory we can use. When the range is exhausted we'll go through a slower
path to set up a new range. When there are no CQEs avaliable, pointers
will naturally point to the same address.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/487eeef00f3146537b3d9c1a9cef2fc0b9a86f81.1649771823.git.asml.silence@gmail.com
[axboe: santinel -> sentinel]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: optimise submission left counting

Considering all inlining io_submit_sqe() is huge and usually ends up
calling some other functions.

We decrement @left in io_submit_sqes() just before calling
io_submit_sqe() and use it later after the call. Considering how huge
io_submit_sqe() is, there is not much hope @left will be treated
gracefully by compilers.

Decrement it after the call, not only it's easier on register spilling
and probably saves stack write/read, but also at least for x64 uses
CPU flags set by the dec instead of doing (read/write and tests).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/807f9a276b54ee8ff4e42e2b78721484f1c71743.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: optimise submission loop invariant

Instead of keeping @submitted in io_submit_sqes(), which for each
iteration requires comparison with the initial number of SQEs, store the
number of SQEs left to submit. We'll need nr only for when we're done
with SQE handling.

note: if we can't allocate a req for the first SQE we always has been
returning -EAGAIN to the userspace, save this behaviour by looking into
the cache in a slow path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c3b3df9aeae4c2f7a53fd8386385742e4e261e77.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add helper to return req to cache list

Don't hand code wq_stack_add_head() to ->free_list, which serves for
recycling io_kiocb, add a helper doing it for us.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f206f575486a8dd3d52f074ab37ed146b2d215b7.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: helper for empty req cache checks

Add io_req_cache_empty(), which checks if there are requests in the
inline req cache or not. It'll be needed in the future, but also nicely
cleans up a few spots poking into ->free_list directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b18662389f3fb483d0bd07906647f65f6037475a.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: inline io_flush_cached_reqs

io_flush_cached_reqs() isn't descriptive and has only one caller, inline
it into __io_alloc_req_refill().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec38abe65a883d9fe6b169793119ce86806655a4.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: shrink final link flush

All good users should not set IOSQE_IO_*LINK flags for the last request
of a link. io_uring flushes collected links at the end of submission,
but it's not the optimal way and so we don't care too much about it.
Replace io_queue_sqe() call with io_queue_sqe_fallback() as the former
one is inlined and will generate a bunch of extra code. This will also
help compilers with the submission path inlining.

> size ./fs/io_uring.o
   text    data     bss     dec     hex filename
  87265   13734       8  101007   18a8f ./fs/io_uring.o
> size ./fs/io_uring.o
   text    data     bss     dec     hex filename
  87073   13734       8  100815   189cf ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/01fb5e417ef49925d544a0b0bae30409845ed2b4.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: memcpy CQE from req

We can do CQE filling a bit more efficiently when req->cqe is fully
filled by memcpy()'ing it to the userspace instead of doing it field by
field. It's easier on register spilling, removes a couple of extra
loads/stores and write combines two u32 memory writes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ee3f514ff28b1fe3347a8eca93a9d91647f2eaad.1649771823.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: explicitly keep a CQE in io_kiocb

We already have req->{result,user_data,cflags}, which mimic struct
io_uring_cqe and are intended to store CQE data. Combine them into a
struct io_uring_cqe field.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1efe65d5005cd6a9ec3440767eb15a9fa9351cf.1649771823.git.asml.silence@gmail.com
[axboe: add mirror cqe to cater to fd union]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: rename io_sqe_file_register

Rename io_sqe_file_register(), so the name better reflects what the
function is doing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d5091518883786969e244d2f0854a47bbdaa5061.1649334991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: deduplicate SCM accounting

Merge io_sqe_file_register() and io_sqe_file_register(). The only
real difference left between them is from where we get an skb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dddda3039c71fcbec24b3465cbe8c7e7ae7bb0e8.1649334991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: don't pass around fixed index for scm

There is an old API nuisance where io_uring's SCM accounting functions
traverse fixed file tables and so requires them to be set in advance,
which leads to some implicit rules of how io_sqe_file_register() should
be used.

__io_sqe_files_scm() now works with only one file at a time, pass a file
directly and get rid of all fixed table dereferencing inside. Clean
io_sqe_file_register() callers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fb32031d892e61a7748c70da7999725d5e798671.1649334991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: refactor __io_sqe_files_scm

__io_sqe_files_scm() is now called only from one place passing a single
file, so nr argument can be killed and __io_sqe_files_scm() simplified.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66b492bc66dc8356d45d64076bb31d677d11a7c9.1649334991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: uniform SCM accounting

Channel all SCM accounting through io_sqe_file_register(), so we do it
uniformely for updates and initial registration and can kill duplicated
code. Registration might be slightly slower in some case, but first we
skip most of SCM accounting now so it's not a problem. Moreover, it's
nicer for an empty set registration as we don't even try to allocate
skb for them anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6c9afbeb22812777d0c43e52353b63db5b87ed1e.1649334991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: don't scm-account for non af_unix sockets

io_uring deals with file reference loops by registering all fixed files
in the SCM/GC infrastrucure. However, only a small subset of all file
types can keep long-term references to other files and those that don't
are not interesting for the garbage collector as they can't be in a
reference loop. They neither can be directly recycled by GC nor affect
loop searching.

Let's skip io_uring SCM accounting for loop-less files, i.e. all but
af_unix sockets, quite imroving fixed file updates performance and
greatly helpnig with memory footprint.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9c44ecf6e89d69130a8c4360cce2183ffc5ddd6f.1649277098.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: move finish_wait() outside of loop in cqring_wait()

We don't need to call this for every loop. This is particularly
troublesome if we are task_work intensive, and get woken more often than
we desire due to that.

Just do it at the end, that's always safe as we initialize the waitqueue
list head anyway. This can save a considerable amount of hammering on
the waitqueue lock, which is also hot from the request completion side.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: refactor io_req_add_compl_list()

A small refactoring for io_req_add_compl_list() deduplicating some code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f0a5272b45efe4ffc41cb79b99784e39c699aade.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: silence io_for_each_link() warning

Some tooling keep complaining about self assignment in
io_for_each_link(), the code is correct but still let's workaround it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f0de77b0b0f8309554ba6fba34327b7813bcc3ff.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: partially uninline io_put_task()

In most cases io_put_task() is called from the submitter task and go
through a higly optimised fast path, which has to be inlined. The other
branch though is bulkier and we don't care about it as much because it
implies atomics and other heavy calls. Extract it into a helper, which
is expected not to be inlined.

[before] size ./fs/io_uring.o
   text    data     bss     dec     hex filename
  89328   13646       8  102982   19246 ./fs/io_uring.o
[after] size ./fs/io_uring.o
   text    data     bss     dec     hex filename
  89096   13646       8  102750   1915e ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dec213db0e0b8605132da81e0a0be687a4d140cb.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: cleanup conditional submit locking

Refactor io_ring_submit_[un]lock(), make it accept issue_flags and
remove manual IO_URING_F_UNLOCKED checks. It also allows us to place
lockdep annotations inside instead of sprinkling them in a bunch of
places. There is only one user that doesn't fit now, so hand code
locking in __io_rsrc_put_work().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e55c2c06767676a801252e8094c9ab09912487a4.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: optimise mutex locking for submit+iopoll

Both submittion and iopolling requires holding uring_lock. IOPOLL can
users do them together in a single syscall, however it would still do 2
pairs of lock/unlock. Optimise this case combining locking into one
lock/unlock pair, which especially nice for low QD.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/034b6c41658648ad3ad3c9485ac8eb546f010bc4.1647957378.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: pre-calculate syscall iopolling decision

Syscall should only iopoll for events when it's a IOPOLL ring and is not
SQPOLL. Instead of check both flags every time we can save it in ring
flags so it's easier to use. We don't care much about an extra if there,
however it will be inconvenient to copy-paste this chunk with checks in
future patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7fd2f8fc2606305aa06dd8c0ff8f76a66b39c383.1647957378.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: split off IOPOLL argument verifiction

IOPOLL doesn't use additional arguments like sigsets, but it still
needs some basic verification, which is currently done by
io_get_ext_arg(). This patch adds a separate function for the IOPOLL
path, which is a bit simpler and doesn't do extra. This prepares us for
further patches, which would have hurt inlining in the hot path otherwise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/71b23fca412e3374b74be7711cfd42a3d9d5dfe0.1647957378.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: clean up io_queue_next()

Move fast check out of io_queue_next(), it makes req->flags checks in
__io_submit_flush_completions() a bit clearer and grants us better
comtrol, e.g. can remove now not justified unlikely() in
__io_submit_flush_completions(). Also, we don't care about having this
check in io_free_req() as the function is a slow path and
io_req_find_next() handles it correctly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1f9e1cc80adbb11b37017d511df4a2c6141a3f08.1647897811.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>