mmc uses the block layer struct request pointer to indirect their own
lock to the mmc_queue structure, given that the original lock isn't
reachable outside of block.c. Add a lock pointer to struct mmc_queue
instead and stop overriding the block layer lock which protects fields
entirely separate from the mmc use.
Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
->queue_flags is generally not set or cleared in the fast path, and also
generally set or cleared one flag at a time. Make use of the normal
atomic bitops for it so that we don't need to take the queue_lock,
which is otherwise mostly unused in the core block layer now.
Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
No users left since the removal of the legacy request interface, we can
remove all the magic bit stealing now and make it a normal field.
But use WRITE_ONCE/READ_ONCE on the new deadline field, given that we
don't seem to have any mechanism to guarantee a new value actually
gets seen by other threads.
Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Colin Ian King [Wed, 14 Nov 2018 22:17:05 +0000 (22:17 +0000)]
block: clean up dead code that is now redundant
The boolean next_sorted is set to false and is never changed, hence
the code that checks if it is true is dead code and can now be
removed. This dead code occurred from a previous commit that cleaned
up the elevator and removed the setting of next_sorted to true.
Detected by CoverityScan, CID#1475401 ("'Constant' variable guards
dead code")
Fixes: a1ce35fa4985 ("block: remove dead elevator code") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 14 Nov 2018 17:13:50 +0000 (10:13 -0700)]
nvme: fix boot hang with only being able to get one IRQ vector
NVMe always asks for io_queues + 1 worth of IRQ vectors, which
means that even when we scale all the way down, we still ask
for 2 vectors and get -ENOSPC in return if the system can't
support more than 1.
Getting just 1 vector is fine, it just means that we'll have
1 IO queue and 1 admin queue, with a shared vector between them.
Check for this case and don't add our + 1 if it happens.
Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 13 Nov 2018 00:19:32 +0000 (17:19 -0700)]
ide: don't clear special on ide_queue_rq() entry
We can't use RQF_DONTPREP to see if we should clear ->special,
as someone could have set that while inserting the request. Make
sure we clear it in our ->initialize_rq_fn() helper instead.
Fixes: 22ce0a7ccf23 ("ide: don't use req->special") Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tetsuo Handa [Mon, 12 Nov 2018 15:42:14 +0000 (08:42 -0700)]
loop: Fix double mutex_unlock(&loop_ctl_mutex) in loop_control_ioctl()
Commit 0a42e99b58a20883 ("loop: Get rid of loop_index_mutex") forgot to
remove mutex_unlock(&loop_ctl_mutex) from loop_control_ioctl() when
replacing loop_index_mutex with loop_ctl_mutex.
Fixes: 0a42e99b58a20883 ("loop: Get rid of loop_index_mutex") Reported-by: syzbot <syzbot+c0138741c2290fc5e63f@syzkaller.appspotmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The way these functions abuse ->special to try to store the dummy
request looks completely broken, given that it actually stores the
original scsi command.
Instead switch to ->host_scribble and store the actual dummy command.
Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
There isn't much need for this helper - we can just calculate the offset
for the command header once late in the submission path and fill out
the ctba and ctbau fields there.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
mtip32xx: merge mtip_submit_request into mtip_queue_rq
Factor out a new is_stopped helper that matches the existing
is_se_active helper, and merge the trivial amount of remaining code
into the only caller. This also allows better error handling by
returning a BLK_STS_* directly instead of explicitly calling
blk_mq_end_request, and moving blk_mq_start_request closer to the
actual issue to hardware.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current sx8 code spends a lot of effort dealing with the fact that
tags are per-host, but there might be multiple queues. Now that the
driver has been converted to blk-mq it can take care of the blk-mq
tag_set concept that has been designed just for that.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 8 Nov 2018 17:24:07 +0000 (10:24 -0700)]
blk-mq-tag: change busy_iter_fn to return whether to continue or not
We have this functionality in sbitmap, but we don't export it in
blk-mq for users of the tags busy iteration. This can be useful
for stopping the iteration, if the caller doesn't need to find
more requests.
Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 8 Nov 2018 13:01:16 +0000 (14:01 +0100)]
loop: Get rid of 'nested' acquisition of loop_ctl_mutex
The nested acquisition of loop_ctl_mutex (->lo_ctl_mutex back then) has
been introduced by commit f028f3b2f987e "loop: fix circular locking in
loop_clr_fd()" to fix lockdep complains about bd_mutex being acquired
after lo_ctl_mutex during partition rereading. Now that these are
properly fixed, let's stop fooling lockdep.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 8 Nov 2018 13:01:15 +0000 (14:01 +0100)]
loop: Avoid circular locking dependency between loop_ctl_mutex and bd_mutex
Code in loop_change_fd() drops reference to the old file (and also the
new file in a failure case) under loop_ctl_mutex. Similarly to a
situation in loop_set_fd() this can create a circular locking dependency
if this was the last reference holding the file open. Delay dropping of
the file reference until we have released loop_ctl_mutex.
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 8 Nov 2018 13:01:14 +0000 (14:01 +0100)]
loop: Fix deadlock when calling blkdev_reread_part()
Calling blkdev_reread_part() under loop_ctl_mutex causes lockdep to
complain about circular lock dependency between bdev->bd_mutex and
lo->lo_ctl_mutex. The problem is that on loop device open or close
lo_open() and lo_release() get called with bdev->bd_mutex held and they
need to acquire loop_ctl_mutex. OTOH when loop_reread_partitions() is
called with loop_ctl_mutex held, it will call blkdev_reread_part() which
acquires bdev->bd_mutex. See syzbot report for details [1].
Move call to blkdev_reread_part() in __loop_clr_fd() from under
loop_ctl_mutex to finish fixing of the lockdep warning and the possible
deadlock.
Jan Kara [Thu, 8 Nov 2018 13:01:13 +0000 (14:01 +0100)]
loop: Move loop_reread_partitions() out of loop_ctl_mutex
Calling loop_reread_partitions() under loop_ctl_mutex causes lockdep to
complain about circular lock dependency between bdev->bd_mutex and
lo->lo_ctl_mutex. The problem is that on loop device open or close
lo_open() and lo_release() get called with bdev->bd_mutex held and they
need to acquire loop_ctl_mutex. OTOH when loop_reread_partitions() is
called with loop_ctl_mutex held, it will call blkdev_reread_part() which
acquires bdev->bd_mutex. See syzbot report for details [1].
Move all calls of loop_rescan_partitions() out of loop_ctl_mutex to
avoid lockdep warning and fix deadlock possibility.
Jan Kara [Thu, 8 Nov 2018 13:01:12 +0000 (14:01 +0100)]
loop: Move special partition reread handling in loop_clr_fd()
The call of __blkdev_reread_part() from loop_reread_partition() happens
only when we need to invalidate partitions from loop_release(). Thus
move a detection for this into loop_clr_fd() and simplify
loop_reread_partition().
This makes loop_reread_partition() safe to use without loop_ctl_mutex
because we use only lo->lo_number and lo->lo_file_name in case of error
for reporting purposes (thus possibly reporting outdate information is
not a big deal) and we are safe from 'lo' going away under us by
elevated lo->lo_refcnt.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 8 Nov 2018 13:01:07 +0000 (14:01 +0100)]
loop: Push loop_ctl_mutex down into loop_clr_fd()
loop_clr_fd() has a weird locking convention that is expects
loop_ctl_mutex held, releases it on success and keeps it on failure.
Untangle the mess by moving locking of loop_ctl_mutex into
loop_clr_fd().
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 8 Nov 2018 13:01:06 +0000 (14:01 +0100)]
loop: Split setting of lo_state from loop_clr_fd
Move setting of lo_state to Lo_rundown out into the callers. That will
allow us to unlock loop_ctl_mutex while the loop device is protected
from other changes by its special state.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 8 Nov 2018 13:01:05 +0000 (14:01 +0100)]
loop: Push lo_ctl_mutex down into individual ioctls
Push acquisition of lo_ctl_mutex down into individual ioctl handling
branches. This is a preparatory step for pushing the lock down into
individual ioctl handling functions so that they can release the lock as
they need it. We also factor out some simple ioctl handlers that will
not need any special handling to reduce unnecessary code duplication.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 8 Nov 2018 13:01:04 +0000 (14:01 +0100)]
loop: Get rid of loop_index_mutex
Now that loop_ctl_mutex is global, just get rid of loop_index_mutex as
there is no good reason to keep these two separate and it just
complicates the locking.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 8 Nov 2018 13:01:03 +0000 (14:01 +0100)]
loop: Fold __loop_release into loop_release
__loop_release() has a single call site. Fold it there. This is
currently not a huge win but it will make following replacement of
loop_index_mutex more obvious.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tetsuo Handa [Thu, 8 Nov 2018 13:01:02 +0000 (14:01 +0100)]
block/loop: Use global lock for ioctl() operation.
syzbot is reporting NULL pointer dereference [1] which is caused by
race condition between ioctl(loop_fd, LOOP_CLR_FD, 0) versus
ioctl(other_loop_fd, LOOP_SET_FD, loop_fd) due to traversing other
loop devices at loop_validate_file() without holding corresponding
lo->lo_ctl_mutex locks.
Since ioctl() request on loop devices is not frequent operation, we don't
need fine grained locking. Let's use global lock in order to allow safe
traversal at loop_validate_file().
Note that syzbot is also reporting circular locking dependency between
bdev->bd_mutex and lo->lo_ctl_mutex [2] which is caused by calling
blkdev_reread_part() with lock held. This patch does not address it.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+bf89c128e05dd6c62523@syzkaller.appspotmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tetsuo Handa [Thu, 8 Nov 2018 13:01:01 +0000 (14:01 +0100)]
block/loop: Don't grab "struct file" for vfs_getattr() operation.
vfs_getattr() needs "struct path" rather than "struct file".
Let's use path_get()/path_put() rather than get_file()/fput().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 8 Nov 2018 04:17:57 +0000 (21:17 -0700)]
sunvdc: fix compiler warning
Stephen reports:
After merging the block tree, today's linux-next build (sparc64 defconfig)
produced this warning:
/home/sfr/next/next/drivers/block/sunvdc.c: In function 'init_queue':
/home/sfr/next/next/drivers/block/sunvdc.c:788:6: warning: unused variable 'ret' [-Wunused-variable]
int ret;
^~~
Kill the unused variable.
Fixes: fa182a1fa97d ("sunvdc: convert to blk-mq") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 5 Nov 2018 19:44:33 +0000 (12:44 -0700)]
nvme: add separate poll queue map
Adds support for defining a variable number of poll queues, currently
configurable with the 'poll_queues' module parameter. Defaults to
a single poll queue.
And now we finally have poll support without triggering interrupts!
Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 29 Aug 2018 16:36:56 +0000 (10:36 -0600)]
block: add REQ_HIPRI and inherit it from IOCB_HIPRI
We use IOCB_HIPRI to poll for IO in the caller instead of scheduling.
This information is not available for (or after) IO submission. The
driver may make different queue choices based on the type of IO, so
make the fact that we will poll for this IO known to the lower layers
as well.
Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 31 Oct 2018 14:36:31 +0000 (08:36 -0600)]
nvme: utilize two queue maps, one for reads and one for writes
NVMe does round-robin between queues by default, which means that
sharing a queue map for both reads and writes can be problematic
in terms of read servicing. It's much easier to flood the queue
with writes and reduce the read servicing.
Implement two queue maps, one for reads and one for writes. The
write queue count is configurable through the 'write_queues'
parameter.
By default, we retain the previous behavior of having a single
queue set, shared between reads and writes. Setting 'write_queues'
to a non-zero value will create two queue sets, one for reads and
one for writes, the latter using the configurable number of
queues (hardware queue counts permitting).
Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 30 Oct 2018 18:24:04 +0000 (12:24 -0600)]
blk-mq: improve plug list sorting
Currently we only look at the software queue, but with support
for multiple maps, we should also look at the hardware queue.
This is important since we'll flush out the request list if
either the software queue or hardware queue don't match.
This sorts by software queue first, then hardware queue if
that differs. Finally we sort by request location like before.
This minimizes the flush points per plug list.
Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 30 Oct 2018 17:31:51 +0000 (11:31 -0600)]
blk-mq: cleanup and improve list insertion
It's somewhat strange to have a list insertion function that
relies on the fact that the caller has mapped things correctly.
Pass in the hardware queue directly for insertion, which makes
for a much cleaner interface and implementation.
Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 29 Oct 2018 21:06:13 +0000 (15:06 -0600)]
blk-mq: cache request hardware queue mapping
We call blk_mq_map_queue() a lot, at least two times for each
request per IO, sometimes more. Since we now have an indirect
call as well in that function. cache the mapping so we don't
have to re-call blk_mq_map_queue() for the same request
multiple times.
Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 29 Oct 2018 19:25:27 +0000 (13:25 -0600)]
blk-mq: separate number of hardware queues from nr_cpu_ids
With multiple maps, nr_cpu_ids is no longer the maximum number of
hardware queues we support on a given devices. The initializer of
the tag_set can have set ->nr_hw_queues larger than the available
number of CPUs, since we can exceed that with multiple queue maps.
Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 30 Oct 2018 16:36:06 +0000 (10:36 -0600)]
blk-mq: support multiple hctx maps
Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.
This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->rq_flags_to_type(). The function takes request/bio
flags and CPU, and returns a queue map index for that. We then
use the type information in blk_mq_map_queue() to index the map
set.
Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 29 Oct 2018 19:13:29 +0000 (13:13 -0600)]
blk-mq: allow software queue to map to multiple hardware queues
The mapping used to be dependent on just the CPU location, but
now it's a tuple of (type, cpu) instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.
This changes the software queue count to an unsigned short
to save a bit of space. We can still support 64K-1 CPUs,
which should be enough. Add a check to catch a wrap.
Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 7 Nov 2018 20:43:54 +0000 (13:43 -0700)]
Merge branch 'irq/for-block' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-4.21/block
Pull in the irq affinity commits, that are staged through Thomas's
tree.
* 'irq/for-block' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/affinity: Add support for allocating interrupt sets
genirq/affinity: Pass first vector to __irq_build_affinity_masks()
genirq/affinity: Move two stage affinity spreading into a helper function
genirq/affinity: Spread IRQs to all available NUMA nodes
Jens Axboe [Wed, 31 Oct 2018 23:01:22 +0000 (17:01 -0600)]
block: kill request ->cpu member
This was used for completion placement for the legacy path,
but for mq we have rq->mq_ctx->cpu for that. Add a helper
to get the request CPU assignment, as the mq_ctx type is
private to blk-mq.
Jens Axboe [Mon, 29 Oct 2018 16:25:07 +0000 (10:25 -0600)]
block: kill legacy parts of timeout handling
The only user of legacy timing now is BSG, which is invoked
from the mq timeout handler. Kill the legacy code, and rename
the q->rq_timed_out_fn to q->bsg_job_timeout_fn.
Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 29 Oct 2018 16:23:51 +0000 (10:23 -0600)]
block: remove dead elevator code
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 31 Oct 2018 18:43:24 +0000 (12:43 -0600)]
block: cleanup kick/queued handling
Now that blk_flush_queue_rq() always returns false, we can
remove that return value. That bubbles through the stack,
allowing us to remove a bunch of state tracking around it.
Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>