Split nvme_dev_add into a helper to actually allocate the tag set, and
one that just update the number of queues. Add a local variable for
the tag_set to clean up the code a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split nvme_alloc_admin_tags into a helper to actually allocate the
tag set, and one that just restarts the admin queue. Add a local
variable for the tag_set to clean up the code a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nick Bowler [Thu, 21 Jul 2022 03:57:35 +0000 (23:57 -0400)]
nvme: define compat_ioctl again to unbreak 32-bit userspace.
Commit 7e042cd0cf1e ("nvme: simplify the compat ioctl handling") removed
the initialization of compat_ioctl from the nvme block_device_operations
structures.
Presumably the expectation was that 32-bit ioctls would be directed
through the regular handler but this is not the case: failing to assign
.compat_ioctl actually means that the compat case is disabled entirely,
and any attempt to submit nvme ioctls from 32-bit userspace fails
outright with -ENOTTY.
Bean Huo [Fri, 15 Jul 2022 21:27:21 +0000 (23:27 +0200)]
nvme: use command_id instead of req->tag in trace_nvme_complete_rq()
Use command_id instead of req->tag in trace_nvme_complete_rq(),
because of commit e048b22adf66 ("nvme: code command_id with a genctr
for use authentication after release"), cmd->common.command_id is set to
((genctl & 0xf)< 12 | req->tag), no longer req->tag, which makes cid in
trace_nvme_complete_rq and trace_nvme_setup_cmd are not the same.
Fixes: e048b22adf66 ("nvme: code command_id with a genctr for use authentication after release") Signed-off-by: Bean Huo <beanhuo@micron.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a KASAN warning in raid10_remove_disk when running the lvm
test lvconvert-raid-reshape.sh. We fix this warning by verifying that the
value "number" is valid.
BUG: KASAN: slab-out-of-bounds in raid10_remove_disk+0x61/0x2a0 [raid10]
Read of size 8 at addr ffff889108f3d300 by task mdX_raid10/124682
Last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x9e/0xc0
kvfree_call_rcu+0x84/0x480
timerfd_release+0x82/0x140
L __fput+0xfa/0x400
task_work_run+0x80/0xc0
exit_to_user_mode_prepare+0x155/0x160
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x42/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Second to last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x9e/0xc0
kvfree_call_rcu+0x84/0x480
timerfd_release+0x82/0x140
__fput+0xfa/0x400
task_work_run+0x80/0xc0
exit_to_user_mode_prepare+0x155/0x160
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x42/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
The buggy address belongs to the object at ffff889108f3d200
which belongs to the cache kmalloc-256 of size 256
The buggy address is located 0 bytes to the right of
256-byte region [ffff889108f3d200, ffff889108f3d300)
Memory state around the buggy address: ffff889108f3d200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff889108f3d280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff889108f3d300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^ ffff889108f3d380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff889108f3d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md-raid: destroy the bitmap after destroying the thread
When we ran the lvm test "shell/integrity-blocksize-3.sh" on a kernel with
kasan, we got failure in write_page.
The reason for the failure is that md_bitmap_destroy is called before
destroying the thread and the thread may be waiting in the function
write_page for the bio to complete. When the thread finishes waiting, it
executes "if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))", which
triggers the kasan warning.
Note that the commit 56c17cbc4063 that caused this bug claims that it is
neede for md-cluster, you should check md-cluster and possibly find
another bugfix for it.
BUG: KASAN: use-after-free in write_page+0x18d/0x680 [md_mod]
Read of size 8 at addr ffff889162030c78 by task mdX_raid1/5539
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: 56c17cbc4063 ("md: move bitmap_destroy to the beginning of __md_stop") Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Two callers of md_alloc want to use the newly allocated devices, so
return it instead of letting them find it cumbersomely after the
allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
autorun_devices should not be limited to the controls for the legacy
probe on open, so just call md_alloc directly.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yang Li [Fri, 22 Jul 2022 00:27:55 +0000 (08:27 +0800)]
md: remove unneeded semicolon
Eliminate the following coccicheck warning:
./drivers/md/md.c:8208:2-3: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
This driver is for fairly obscure hardware, and has only seen random
drive-by changes after the maintainer stopped working on it in 2005
(about a year and a half after it was introduced). It has some
"interesting" block layer interactions, so let's just drop it unless
anyone complains.
Stephen Rothwell [Thu, 21 Jul 2022 03:11:32 +0000 (13:11 +1000)]
md: fix build failure for !MODULE
After merging the block tree, today's linux-next build (x86_64
allmodconfig) failed like this:
drivers/md/md.c:717:22: error: 'mddev_find' defined but not used [-Werror=unused-function]
717 | static struct mddev *mddev_find(dev_t unit)
| ^~~~~~~~~~
cc1: all warnings being treated as errors
Now that devices are on the all_mddevs list until the gendisk is freed,
there can't be any duplicates. Remove the global list lookup and just
grab a reference.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md: only delete entries from all_mddevs when the disk is freed
This ensures device names don't get prematurely reused. Instead add a
deleted flag to skip already deleted devices in mddev_get and other
places that only want to see live mddevs.
Reported-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a
reference when we drop the lock and delete the now unused for_each_mddev
macro.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a
reference when we drop the lock.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just do a plain list_for_each that only grabs a mddev reference in
the case where the thread sleeps and restarts the list iteration.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md: factor out the rdev overlaps check from rdev_size_store
This splits the code into nicely readable chunks and also avoids
the refcount inc/dec manipulations.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The md_free name is rather misleading, so pick a better one.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that all private data is only freed once all accesses are done.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Error handling in md_alloc is a mess. Untangle it to just free the mddev
directly before add_disk is called and thus the gendisk is globally
visible. After that clear the hold flag and let the mddev_put take care
of cleaning up the mddev through the usual mechanisms.
Fixes: c21799acd0a5 ("[PATCH] md: convert compile time warnings into runtime warnings") Fixes: 5deeaf1b6480 ("md: add error handling support for add_disk()") Fixes: fee490e12a53 ("md: properly unwind when failing to add the kobject in md_alloc") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Once a kobject is initialized, the containing object should not be
directly freed. So delay initialization until it is added. Also
remove the kobject_del call as the last put will remove the kobject as
well. The explicitly delete isn't needed here, and dropping it will
simplify further fixes.
With this md_free now does not need to check that ->gendisk is non-NULL
as it is always set by the time that kobject_init is called on
mddev->kobj.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md/raid5: Convert prepare_to_wait() to wait_woken() api
raid5_get_active_stripe() can sleep in various situations and it
is called by make_stripe_request() while inside the
prepare_to_wait()/finish_wait() section. Nested waits like this are
not supported.
This was noticed while making other changes that add different sleeps
to raid5_get_active_stripe() that caused a WARNING with
CONFIG_DEBUG_ATOMIC_SLEEP.
No ill effects have been noticed with the code as is, but theoretically
a nested and here could cause a dead lock so it should be fixed.
To fix this, convert the prepare_to_wait() call to use wake_woken()
which supports nested sleeps.
Link: https://lwn.net/Articles/628628/ Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md/raid5: Fix sectors_to_do bitmap overflow in raid5_make_request()
For unaligned IO that have nearly maximum sectors, the number of stripes
will end up being one greater than the size of the bitmap. When this
happens, the last stripe in the IO will not be processed as it should
be, resulting in data corruption.
However, this is not normally seen when the backing block devices have
4K physical block sizes since the block layer will split the request
before that happens.
To fix this increase the bitmap size by one bit and ensure the full
number of stripes are checked when calling find_first_bit().
Reported-by: David Sloan <David.Sloan@eideticom.com> Fixes: 53843dc9902f ("md/raid5: Pivot raid5_make_request()") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Tue, 19 Jul 2022 04:27:24 +0000 (12:27 +0800)]
bcache: remove EXPERIMENTAL for Kconfig option 'Asynchronous device registration'
The "Asynchronous device registration (EXPERIMENTAL)" Kconfig option is
for 2+ years, it is used when registration takes too much time for
massive amount of cached data, to avoid udev task timeout during boot
time.
Many users and products enable this Kconfig option for quite long time
(e.g. SUSE Linux) and it works as expected and no issue reported.
It is time to remove the "EXPERIMENTAL" tag from this Kconfig item.
commit 78bd39615e0b ("nbd: use pr_err to output error message") tries
to define pr_fmt and use short pr_err() to output error message,
however, the definition is missed.
This patch also remove existing "nbd:" inside pr_err().
Dan Carpenter [Fri, 15 Jul 2022 08:12:14 +0000 (11:12 +0300)]
null_blk: fix ida error handling in null_add_dev()
There needs to be some error checking if ida_simple_get() fails.
Also call ida_free() if there are errors later.
Fixes: 211c7e757b27 ("nullb: use ida to manage index") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/YtEhXsr6vJeoiYhd@kili Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guixin Liu [Fri, 8 Jul 2022 03:06:05 +0000 (11:06 +0800)]
nvme-apple: use nvme core helper to cancel requests in tagset
Use nvme core helper nvme_cancel_tagset and nvme_cancel_admin_tagset
instead of same logic code.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ruozhu Li <liruozhu@huawei.com> Reviewed-by: Sven Peter <sven@svenpeter.dev> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, command data is only sent in-capsule on the for admin or I/O
commands on queues that indicate support for it. Send fabrics command
data in-capsule for I/O queues as well to avoid needing a separate
H2CData PDU for the connect command.
This is optimization. Without this change, we send the connect command
capsule and data in separate PDUs (CapsuleCmd and H2CData), and must wait
for the controller to respond with an R2T PDU before sending the H2CData.
With the change, we send a single CapsuleCmd PDU that includes the data.
This reduces the number of bytes (and likely packets) sent across the network,
and simplifies the send state machine handling in the driver.
Israel Rukshin [Sun, 15 May 2022 15:04:40 +0000 (18:04 +0300)]
nvme-rdma: remove timeout for getting RDMA-CM established event
In case many controllers start error recovery at the same time (i.e.,
when port is down and up), they may never succeed to reconnect again.
This is because the target can't handle all the connect requests at
three seconds (the arbitrary value set today). Even if some of the
connections are established, when a single queue fails to connect,
all the controller's queues are destroyed as well. So, on the
following reconnection attempts the number of connect requests may
remain the same. To fix this, remove the timeout and wait for RDMA-CM
event to abort/complete the connect request. RDMA-CM sends unreachable
event when a timeout of ~90 seconds is expired. This approach is used
at other RDMA-CM users like SRP and iSER at blocking mode. The commit
also renames NVME_RDMA_CONNECT_TIMEOUT_MS to NVME_RDMA_CM_TIMEOUT_MS.
Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block/rnbd-srv: Replace sess_dev_list with index_idr
The structure rnbd_srv_session maintains a list and an xarray of
rnbd_srv_dev. There is no need to keep both as one of them can serve the
purpose.
Since one of the places where the lookup of rnbd_srv_dev using
rnbd_srv_session is IO path, an xarray would serve us better than a list
traversal. Hence remove sess_dev_list from rnbd_srv_session, and replace
its uses from xarray.
block/rnbd-srv: Set keep_id to true after mutex_trylock
After setting keep_id if the mutex trylock fails, the keep_id stays set
for the rest of the sess_dev lifetime.
Therefore, set keep_id to true after mutex_trylock succeeds, so that a
failure of trylock does'nt touch keep_id.
Fixes: ba30fa27397a ("block/rnbd-srv: Prevent a deadlock generated by accessing sysfs in parallel") Cc: gi-oh.kim@ionos.com Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/20220707143122.460362-2-haris.iqbal@ionos.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:07 +0000 (11:52 +0200)]
nvmet-auth: expire authentication sessions
Each authentication step is required to be completed within the
KATO interval (or two minutes if not set). So add a workqueue function
to reset the transaction ID and the expected next protocol step;
this will automatically the next authentication command referring
to the terminated authentication.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:06 +0000 (11:52 +0200)]
nvmet-auth: Diffie-Hellman key exchange support
Implement Diffie-Hellman key exchange using FFDHE groups for NVMe
In-Band Authentication.
This patch adds a new host configfs attribute 'dhchap_dhgroup' to
select the FFDHE group to use.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:05 +0000 (11:52 +0200)]
nvmet: implement basic In-Band Authentication
Implement NVMe-oF In-Band authentication according to NVMe TPAR 8006.
This patch adds three additional configfs entries 'dhchap_key',
'dhchap_ctrl_key', and 'dhchap_hash' to the 'host' configfs directory.
The 'dhchap_key' and 'dhchap_ctrl_key' entries need to be in the ASCII
format as specified in NVMe Base Specification v2.0 section 8.13.5.8
'Secret representation'.
'dhchap_hash' defaults to 'hmac(sha256)', and can be written to to
switch to a different HMAC algorithm.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:04 +0000 (11:52 +0200)]
nvmet: parse fabrics commands on io queues
Some fabrics commands can be sent via io queues, so add a new
function nvmet_parse_fabrics_io_cmd() and rename the existing
nvmet_parse_fabrics_cmd() to nvmet_parse_fabrics_admin_cmd().
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:02 +0000 (11:52 +0200)]
nvme: implement In-Band authentication
Implement NVMe-oF In-Band authentication according to NVMe TPAR 8006.
This patch adds two new fabric options 'dhchap_secret' to specify the
pre-shared key (in ASCII respresentation according to NVMe 2.0 section
8.13.5.8 'Secret representation') and 'dhchap_ctrl_secret' to specify
the pre-shared controller key for bi-directional authentication of both
the host and the controller.
Re-authentication can be triggered by writing the PSK into the new
controller sysfs attribute 'dhchap_secret' or 'dhchap_ctrl_secret'.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: fold in clang build fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>
When qid value of the function parameter __nvme_submit_sync_cmd() is > 0
from above callers, we use blk_mq_alloc_request_hctx(), where we pass
last parameter as 0 if qid functional parameter value is set to 0 with
conditional operators, see 1002 :-
But qid function parameter value of the __nvme_submit_sync_cmd() will
never be 0 from above caller list see [1], and all the other callers of
__nvme_submit_sync_cmd() use NVME_QID_ANY as qid value :-
1. nvme_submit_sync_cmd()
2. nvme_features()
3. nvme_sec_submit()
4. nvmf_reg_read32()
5. nvmf_reg_read64()
6. nvmf_ref_write32()
7. nvmf_connect_admin_queue()
Remove the conditional operator to pass the qid as 0 in the call to
blk_mq_alloc_requst_hctx().
Michael Kelley [Wed, 8 Jun 2022 18:52:21 +0000 (11:52 -0700)]
nvme: handle the persistent internal error AER
In the NVM Express Revision 1.4 spec, Figure 145 describes possible
values for an AER with event type "Error" (value 000b). For a
Persistent Internal Error (value 03h), the host should perform a
controller reset.
Add support for this error using code that already exists for
doing a controller reset. As part of this support, introduce
two utility functions for parsing the AER type and subtype.
This new support was tested in a lab environment where we can
generate the persistent internal error on demand, and observe
both the Linux side and NVMe controller side to see that the
controller reset has been done.
Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, process_msg_open_rsp checks if capacity changed or not before
call rnbd_clt_change_capacity while the checking also make sense for
rnbd_clt_resize_dev_store, let's move the checking into the function.
Previously, both map and remap trigger rnbd_clt_set_dev_attr to set
some members in rnbd_clt_dev such as wc, fua and logical_block_size
etc, but those members are only useful for map scenario given the
setup_request_queue is only called from the path:
rnbd_clt_map_device -> rnbd_client_setup_device
Since rnbd_clt_map_device frees rsp after rnbd_client_setup_device,
we can pass rsp to rnbd_client_setup_device and it's callees, which
means queue's attributes can be set directly from relevant members
of rsp instead from rnbd_clt_dev.
After that, we can kill 11 members from rnbd_clt_dev, and we don't
need rnbd_clt_set_dev_attr either.
rnbd-clt: don't free rsp in msg_open_conf for map scenario
For map scenario, rsp is freed in two places:
1. msg_open_conf frees rsp if rtrs_clt_request returns 0.
2. Otherwise, rsp is freed by the call sites of rtrs_clt_request.
Now, We'd like to control full lifecycle of rsp in rnbd_clt_map_device,
with that, it is feasible to pass rsp to rnbd_client_setup_device in
next commit.
For 1, it is possible to free rsp from the caller of send_usr_msg
because of the synchronization of iu->comp.wait. And we put iu later
in rnbd_clt_map_device to ensure order of release rsp and iu.
rnbd-clt: open code send_msg_open in rnbd_clt_map_device
Let's open code it in rnbd_clt_map_device, then we can use information
from rsp to setup gendisk and request_queue in next commits. After that,
we can remove some members (wc, fua and max_hw_sectors etc) from struct
rnbd_clt_dev.
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:45 +0000 (13:19 -0600)]
md/raid5: Increase restriction on max segments per request
The block layer defaults the maximum segments to 128, which means
requests tend to get split around the 512KB depending on how many
pages can be merged. There's no such restriction in the raid5 code
so increase the limit to USHRT_MAX so that larger requests can be
sent as one.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:43 +0000 (13:19 -0600)]
md/raid5: Pivot raid5_make_request()
raid5_make_request() loops through every page in the request,
finds the appropriate stripe and adds the bio for that page in the
disk.
This causes a great deal of contention on the hash_lock and extra
work seeing each stripe must be found once for every data disk.
The number of times a stripe must be found can be reduced by pivoting
raid5_make_request() so that it loops through every stripe and then
loops through every disk in that stripe to see if the bio must be
added. This reduces the number of times the hash lock must be taken
by a factor equal to the number of data disks.
To accomplish this, the logical sectors that have already been added
must be tracked. Tracking them is done with a bitmap: the bits
for all pages are set at the start of the request and each bit
is cleared once the bio is added to a stripe.
Finding the next sector to be done is then just a call to
find_first_bit() so that sectors that have been done can simply be
skipped.
One minor downside is that the maximum sectors for a request must be
limited so that the bitmap can be appropriately sized on the stack.
This limit is arbitrarily chosen to be 256 stripe pages which works out
to 1MB if PAGE_SIZE == DEFAULT_STRIPE_SIZE. This doesn't actually
restrict the maximum request further seeing the default block queue
settings are used which restricts the number of segments to 128 (which
results in request sizes that are approximately 512KB).
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:42 +0000 (13:19 -0600)]
md/raid5: Check all disks in a stripe_head for reshape progress
When testing if a previous stripe has had reshape expand past it, use
the earliest or latest logical sector in all the disks for that stripe
head. This will allow adding multiple disks at a time in a subesquent
patch.
To do this cleaner, refactor the check into a helper function called
stripe_ahead_of_reshape().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:41 +0000 (13:19 -0600)]
md/raid5: Refactor add_stripe_bio()
Factor out two helper functions from add_stripe_bio(): one to check for
overlap (stripe_bio_overlaps()), and one to actually add the bio to the
stripe (__add_stripe_bio()). The latter function will always succeed.
This will be useful in the next patch so that overlap can be checked for
multiple disks before adding any
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:40 +0000 (13:19 -0600)]
md/raid5: Keep a reference to last stripe_head for batch
When batching, every stripe head has to find the previous stripe head to
add to the batch list. This involves taking the hash lock which is
highly contended during IO.
Instead of finding the previous stripe_head each time, store a
reference to the previous stripe_head in a pointer so that it doesn't
require taking the contended lock another time.
The reference to the previous stripe must be released before scheduling
and waiting for work to get done. Otherwise, it can hold up
raid5_activate_delayed() and deadlock.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:37 +0000 (13:19 -0600)]
md/raid5: Drop the do_prepare flag in raid5_make_request()
prepare_to_wait() can be reasonably called after schedule instead of
setting a flag and preparing in the next loop iteration.
This means that prepare_to_wait() will be called before
read_seqcount_begin(), but there shouldn't be any reason that the order
matters here. On the first iteration of the loop prepare_to_wait() is
already called first.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:36 +0000 (13:19 -0600)]
md/raid5: Factor out helper from raid5_make_request() loop
Factor out the inner loop of raid5_make_request() into it's own helper
called make_stripe_request().
The helper returns a number of statuses: SUCCESS, RETRY,
SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to
understand and allows the SCHEDULE_AND_RETRY path to be made common.
A context structure is added to contain do_flush. It will be used
more in subsequent patches for state that needs to be kept
outside the loop.
No functional changes intended. This will be cleaned up further in
subsequent patches to untangle the gen_lock and do_prepare logic
further.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:34 +0000 (13:19 -0600)]
md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
stripe_add_to_batch_list() is better done in the loop in make_request
instead of inside add_stripe_bio(). This is clearer and allows for
storing the batch_head state outside the loop in a subsequent patch.
The call to add_stripe_bio() in retry_aligned_read() is for read
and batching only applies to write. So it's impossible for batching
to happen at that call site.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:32 +0000 (13:19 -0600)]
md/raid5: Factor out ahead_of_reshape() function
There are a few uses of an ugly ternary operator in raid5_make_request()
to check if a sector is a head of a reshape sector.
Factor this out into a simple helper called ahead_of_reshape().
No functional changes intended.
Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:31 +0000 (13:19 -0600)]
md/raid5: Make logic blocking check consistent with logic that blocks
The check in raid5_make_request differs very slightly from the logic
that causes it to block lower down. This likely does not cause a bug
as the check is fuzzy anyway (as reshape may move on between the first
check and the subsequent check). However, make it consistent so it can
be cleaned up in a subsequent patch.
Guoqing Jiang [Tue, 21 Jun 2022 03:11:29 +0000 (11:11 +0800)]
md: unlock mddev before reap sync_thread in action_store
Since the bug which commit d0d1827b64334 ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed.
Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
bug with belows.
1. unlock mddev before md_reap_sync_thread in action_store.
2. save reshape_position before unlock, then restore it to ensure position
not changed accidentally by others.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Boot-time assembly of arrays with md= command-line arguments breaks when
CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c
calls blkdev_get_by_dev(), assuming this implicitly creates the block
device.
Fix this by attempting to md_alloc() the array first. As in the probe path,
ignore any error as failure is caught by blkdev_get_by_dev() anyway.
Signed-off-by: Chris Webb <chris@arachsys.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:56 +0000 (10:27 -0600)]
md: Notify sysfs sync_completed in md_reap_sync_thread()
The mdadm test 07layouts randomly produces a kernel hung task deadlock.
The deadlock is caused by the suspend_lo/suspend_hi files being set by
the mdadm background process during reshape and not being cleared
because the process hangs. (Leaving aside the issue of the fragility of
freezing kernel tasks by buggy userspace processes...)
When the background mdadm process hangs it, is waiting (without a
timeout) on a change to the sync_completed file signalling that the
reshape has completed. The process is woken up a couple times when
the reshape finishes but it is woken up before MD_RECOVERY_RUNNING
is cleared so sync_completed_show() reports 0 instead of "none".
To fix this, notify the sysfs file in md_reap_sync_thread() after
MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes
it to continue and write to suspend_lo/suspend_hi to allow IO to
continue.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:55 +0000 (10:27 -0600)]
md: Ensure resync is reported after it starts
The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:
This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.
mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.
The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.
To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:54 +0000 (10:27 -0600)]
md: Use enum for overloaded magic numbers used by mddev->curr_resync
Comments in the code document special values used for
mddev->curr_resync. Make this clearer by using an enum to label these
values.
The only functional change is a couple places use the wrong comparison
operator that implied 3 is another special value. They are all
fixed to imply that 3 or greater is an active resync.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:53 +0000 (10:27 -0600)]
md/raid5-cache: Annotate pslot with __rcu notation
radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the
slot returned and looked up to be marked with __rcu. Otherwise
sparse warnings are generated:
drivers/md/raid5-cache.c:2939:23: warning: incorrect type in
assignment (different address spaces)
drivers/md/raid5-cache.c:2939:23: expected void **pslot
drivers/md/raid5-cache.c:2939:23: got void [noderef] __rcu **
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:51 +0000 (10:27 -0600)]
md/raid5-cache: Drop RCU usage of conf->log
The only place that uses RCU to access conf->log is in
r5l_log_disk_error(). This function is mostly used in the IO path
and once with mddev_lock() held in raid5_change_consistency_policy().
It is known that the IO will be suspended before the log is freed and
r5l_log_exit() is called with the mddev_lock() held.
This should mean that conf->log can not be freed while the function is
being called, so the RCU protection is not necessary. Drop the
rcu_read_lock() as well as the synchronize_rcu() and
rcu_assign_pointer() usage.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:49 +0000 (10:27 -0600)]
md/raid5: suspend the array for calls to log_exit()
The raid5-cache code relies on there being no IO in flight when
log_exit() is called. There are two places where this is not
guaranteed so add mddev_suspend() and mddev_resume() calls to these
sites.
The site in raid5_change_consistency_policy() is in the error path,
and another similar call site already has suspend/resume calls just
below it; so it should be equally safe to make that change here.
There is one remaining site in raid5_remove_disk() that we call log_exit()
without suspending the array. Unfortunately, as the comment stated, we
cannot call mddev_suspend from raid5d.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:48 +0000 (10:27 -0600)]
md/raid5-ppl: Drop unused argument from ppl_handle_flush_request()
ppl_handle_flush_request() takes an struct r5log argument but doesn't
use it. It has no buisiness taking this argument as it is only used
by raid5-cache and has no way to derference it anyway. Remove
the argument.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Wed, 22 Jun 2022 20:49:32 +0000 (22:49 +0200)]
drbd: bm_page_async_io: fix spurious bitmap "IO error" on large volumes
We usually do all our bitmap IO in units of PAGE_SIZE.
With very small or oddly sized external meta data, or with
PAGE_SIZE != 4k, it can happen that our last on-disk bitmap page
is not fully PAGE_SIZE aligned, so we may need to adjust the size
of the IO.
We used to do that with
min_t(unsigned int, PAGE_SIZE,
last_allowed_sector - current_offset);
And for just the right diff, (unsigned int)(diff) will result in 0.
A bio of length 0 will correctly be rejected with an IO error
(and some scary WARN_ON_ONCE()) by the scsi layer.
Linus Torvalds [Tue, 2 Aug 2022 21:21:25 +0000 (14:21 -0700)]
Merge tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Refactor DM core's mempool allocation so that it clearer by not being
split acorss files.
- Improve DM core's BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling.
- Optimize DM core's more common bio splitting by eliminating the use
of bio cloning with bio_split+bio_chain. Shift that cloning cost to
the relatively unlikely dm_io requeue case that only occurs during
error handling. Introduces dm_io_rewind() that will clone a bio that
reflects the subset of the original bio that must be requeued.
- Remove DM core's dm_table_get_num_targets() wrapper and audit all
dm_table_get_target() callers.
- Fix potential for OOM with DM writecache target by setting a default
MAX_WRITEBACK_JOBS (set to 256MiB or 1/16 of total system memory,
whichever is smaller).
- Fix DM writecache target's stats that are reported through
DM-specific table info.
- Fix use-after-free crash in dm_sm_register_threshold_callback().
- Refine DM core's Persistent Reservation handling in preparation for
broader work Mike Christie is doing to add compatibility with
Microsoft Windows Failover Cluster.
- Fix various KASAN reported bugs in the DM raid target.
- Fix DM raid target crash due to md_handle_request() bio splitting
that recurses to block core without properly initializing the bio's
bi_dev.
- Fix some code comment typos and fix some Documentation formatting.
* tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits)
dm: fix dm-raid crash if md_handle_request() splits bio
dm raid: fix address sanitizer warning in raid_resume
dm raid: fix address sanitizer warning in raid_status
dm: Start pr_preempt from the same starting path
dm: Fix PR release handling for non All Registrants
dm: Start pr_reserve from the same starting path
dm: Allow dm_call_pr to be used for path searches
dm: return early from dm_pr_call() if DM device is suspended
dm thin: fix use-after-free crash in dm_sm_register_threshold_callback
dm writecache: count number of blocks discarded, not number of discard bios
dm writecache: count number of blocks written, not number of write bios
dm writecache: count number of blocks read, not number of read bios
dm writecache: return void from functions
dm kcopyd: use __GFP_HIGHMEM when allocating pages
dm writecache: set a default MAX_WRITEBACK_JOBS
Documentation: dm writecache: Render status list as list
Documentation: dm writecache: add blank line before optional parameters
dm snapshot: fix typo in snapshot_map() comment
dm raid: remove redundant "the" in parse_raid_params() comment
dm cache: fix typo in 2 comment blocks
...