James Smart [Wed, 13 Mar 2019 17:55:03 +0000 (18:55 +0100)]
nvmet-fc: fix issues with targetport assoc_list list walking
There are two changes:
1) The logic in the __nvmet_fc_free_assoc() routine is bad. It uses
"safe" routines assuming pointers will come back valid. However, the
intervening next structure being linked can be removed from the list and
the resulting safe pointers are bad, resulting in NULL ptrs being hit.
Correct by scheduling a work element to perform the association delete,
which can be done while under lock.
2) Prior patch that added the work element scheduling left a possible
reference on the object if the work element couldn't be scheduled.
Correct by doing the put on a failing schedule_work() call.
Signed-off-by: Nigel Kirkland <nigel.kirkland@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
James Smart [Wed, 13 Mar 2019 17:55:02 +0000 (18:55 +0100)]
nvme-fc: reject reconnect if io queue count is reduced to zero
If:
- A successful connect has occurred with an io queue count greater than
zero and namespaces detected and running.
- An error or something occurs which causes a termination of the prior
association and then starts a reconnect,
- The reconnect then creates a new controller, but for whatever reason,
nvme_set_queue_count() results in io queue count set to zero. This
will skip io queue and tag set changes.
- But... the controller will transition to live, calling
nvme_start_ctrl, which calls nvme_start_queues(), which then releases
I/Os into the transport which then sends them to the driver.
As there are no queues, things eventually hit the driver looking for a
handle, which was cleared when the original controller was reset, and it
can't proceed. Worst case, things progress, but everything fails.
In the failing scenario, the nvme_set_features(NVME_FEAT_NUM_QUEUES)
command actually failed with a NVME_SC_INTERNAL error. For some reason,
although nvme_set_queue_count() saw the error and set io queue count to
zero, it doesn't return a failure status to the transport, which allows
the transport to continue using the controller.
Fix the problem by simply rejecting the new association if at least 1
I/O queue can't be created. The association reject will fail the
reconnect attempt and fall into the reconnect retry policy.
Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
James Smart [Wed, 13 Mar 2019 17:55:01 +0000 (18:55 +0100)]
nvme-fc: fix numa_node when dev is null
A recent change added a numa_node field to the nvme controller
and has the transport assign the node using dev_to_node().
However, fcloop registers with a NULL device struct, so the
dev_to_node() call oops.
Revise the assignment to assign no node when device struct is null.
Fixes: 103e515efa89b ("nvme: add a numa_node field to struct nvme_ctrl") Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com>
[hch: small coding style fixup] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
James Smart [Wed, 13 Mar 2019 17:55:00 +0000 (18:55 +0100)]
nvme-fc: use nr_phys_segments to determine existence of sgl
For some nvme command, when issued by the nvme core layer, there
is an internal buffer which can cause blk_rq_payload_bytes() to
return a non-zero value yet there is no actual/real command payload
and sg list. An example is the WRITE ZEROES command.
To address this, when making choices on whether to dma map an sgl,
use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes().
When there is a sgl, blk_rq_payload_bytes() will return the amount
of data to be transferred by the sgl.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yufen Yu [Wed, 13 Mar 2019 17:54:58 +0000 (18:54 +0100)]
nvme: update comment to make the code easier to read
After commit a686ed75c0fb ("nvme: introduce a helper function for
controller deletion), nvme_delete_ctrl_sync no longer use flush_work.
Update comment, accordingly.
Sagi Grimberg [Wed, 13 Mar 2019 17:54:57 +0000 (18:54 +0100)]
nvme: put ns_head ref if namespace fails allocation
In case nvme_alloc_ns fails after we initialize ns_head but before we
add the ns to the controller namespaces list we need to explicitly put
the ns_head reference because when we teardown the controller we
won't find it, causing us to leak a dangling subsystem eventually.
Keith Busch [Wed, 13 Mar 2019 17:54:55 +0000 (18:54 +0100)]
nvme: don't warn on block content change effects
A write or flush IO passthrough command is expected to change the
logical block content, so don't warn on these as no additional handling
is necessary.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 13 Mar 2019 16:47:25 +0000 (10:47 -0600)]
Merge branch 'for-5.1/md-post' of https://github.com/liu-song-6/linux into for-5.1/block-post
Pull MD fixes from Song.
* 'for-5.1/md-post' of https://github.com/liu-song-6/linux:
md: Fix failed allocation of md_register_thread
It's wrong to add len to sector_nr in raid10 reshape twice
raid5: set write hint for PPL
Xiao Ni [Fri, 8 Mar 2019 15:52:05 +0000 (23:52 +0800)]
It's wrong to add len to sector_nr in raid10 reshape twice
In reshape_request it already adds len to sector_nr already. It's wrong to add len to
sector_nr again after adding pages to bio. If there is bad block it can't copy one chunk
at a time, it needs to goto read_more. Now the sector_nr is wrong. It can cause data
corruption.
Cc: stable@vger.kernel.org # v3.16+ Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
When the Partial Parity Log is enabled, circular buffer is used to store
PPL data. Each write to RAID device causes overwrite of data in this buffer
so some write_hint can be set to those request to help drives handle
garbage collection. This patch adds new sysfs attribute which can be used
to specify which write_hint should be assigned to PPL.
Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com>
Javier González [Thu, 7 Mar 2019 12:18:53 +0000 (13:18 +0100)]
pblk: fix max_io calculation
When calculating the maximun I/O size allowed into the buffer, consider
the write size (ws_opt) used by the write thread in order to cover the
case in which, due to flushes, the mem and subm pointers are disaligned
by (ws_opt - 1). This case currently translates into a stall when
an I/O of the largest possible size is submitted.
Fixes: f9f9d1ae2c66 ("lightnvm: pblk: prevent stall due to wb threshold") Signed-off-by: Javier González <javier@javigon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sun, 3 Mar 2019 13:17:48 +0000 (21:17 +0800)]
block: fix segment calculation for passthrough IO
blk_recount_segments() can be called in bio_add_pc_page() for
calculating how many segments this bio will has after one page is added
to this bio. If the resulted segment number is beyond the queue limit,
the added page will be removed.
The try-and-fix policy requires blk_recount_segments(__blk_recalc_rq_segments)
to not consider the segment number limit. Unfortunately bvec_split_segs()
does check this limit, and causes small segment number returned to
bio_add_pc_page(), then page still may be added to the bio even though
segment number limit becomes broken.
Fixes this issue by not considering segment number limit when calcualting
bio's segment number.
Fixes: dcebd755926b ("block: use bio_for_each_bvec() to compute multi-page bvec count") Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 6 Mar 2019 16:41:54 +0000 (09:41 -0700)]
Merge branch 'stable/for-jens-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-5.1/block-post
Pull two xen blkback fixes from Konrad.
* 'stable/for-jens-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront
xen/blkback: add stack variable 'blkif' in connect_ring()
Keyur Patel [Sun, 17 Feb 2019 15:21:56 +0000 (10:21 -0500)]
block: Replace function name in string with __func__
Replace hard coded function name register_blkdev with __func__, to
improve robustness and to conform to the Linux kernel coding
style. Issue found using checkpatch.
zhengbin [Wed, 20 Feb 2019 13:27:05 +0000 (21:27 +0800)]
block: fix NULL pointer dereference in register_disk
If __device_add_disk-->bdi_register_owner-->bdi_register-->
bdi_register_va-->device_create_vargs fails, bdi->dev is still
NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj.
This patch fixes that.
Carlos Maiolino [Tue, 26 Feb 2019 10:51:50 +0000 (11:51 +0100)]
fs: fix guard_bio_eod to check for real EOD errors
guard_bio_eod() can truncate a segment in bio to allow it to do IO on
odd last sectors of a device.
It already checks if the IO starts past EOD, but it does not consider
the possibility of an IO request starting within device boundaries can
contain more than one segment past EOD.
In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
underflow bvec->bv_len.
Fix this by checking if truncated_bytes is lower than PAGE_SIZE.
This situation has been found on filesystems such as isofs and vfat,
which doesn't check the device size before mount, if the device is
smaller than the filesystem itself, a readahead on such filesystem,
which spans EOD, can trigger this situation, leading a call to
zero_user() with a wrong size possibly corrupting memory.
I didn't see any crash, or didn't let the system run long enough to
check if memory corruption will be hit somewhere, but adding
instrumentation to guard_bio_end() to check truncated_bytes size, was
enough to see the error.
block: optimize bvec iteration in bvec_iter_advance
There is no need to only iterate in chunks of PAGE_SIZE or less in
bvec_iter_advance, given that the callers pass in the chunk length that
they are operating on - either that already is less than PAGE_SIZE
because they do classic page-based iteration, or it is larger because
the caller operates on multi-page bvecs.
This should help shaving off a few cycles of the I/O hot path.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 27 Feb 2019 12:40:13 +0000 (20:40 +0800)]
block: introduce mp_bvec_for_each_page() for iterating over page
mp_bvec_for_each_segment() is a bit big for the iteration, so introduce
a light-weight helper for iterating over pages, then 32bytes stack
space can be saved.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 27 Feb 2019 12:40:10 +0000 (20:40 +0800)]
block: introduce bvec_nth_page()
Single-page bvec can often be seen in small BS workloads, so
introduce bvec_nth_page() for avoiding to call nth_page() unnecessarily,
which looks not cheap.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device. Also refactor the common direct I/O bio submission code into a
nice little helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Modified to use bio_set_polled().
Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 21 Dec 2018 16:10:46 +0000 (09:10 -0700)]
block: add bio_set_polled() helper
For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.
Utilize the helper in the blockdev DIRECT_IO code.
Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
fs: add an iopoll method to struct file_operations
This new methods is used to explicitly poll for I/O completion for an
iocb. It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.
The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.
Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dongli Zhang [Sun, 24 Feb 2019 15:17:27 +0000 (10:17 -0500)]
xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront
The xenstore 'ring-page-order' is used globally for each blkback queue and
therefore should be read from xenstore only once. However, it is obtained
in read_per_ring_refs() which might be called multiple times during the
initialization of each blkback queue.
If the blkfront is malicious and the 'ring-page-order' is set in different
value by blkfront every time before blkback reads it, this may end up at
the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
xen_blkif_disconnect() when frontend is destroyed.
This patch reworks connect_ring() to read xenstore 'ring-page-order' only
once.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Dongli Zhang [Fri, 22 Feb 2019 14:10:20 +0000 (22:10 +0800)]
loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
Commit 0da03cab87e6
("loop: Fix deadlock when calling blkdev_reread_part()") moves
blkdev_reread_part() out of the loop_ctl_mutex. However,
GENHD_FL_NO_PART_SCAN is set before __blkdev_reread_part(). As a result,
__blkdev_reread_part() will fail the check of GENHD_FL_NO_PART_SCAN and
will not rescan the loop device to delete all partitions.
Ming Lei [Thu, 21 Feb 2019 15:43:36 +0000 (23:43 +0800)]
block: bounce: make sure that bvec table is updated
Block bounce needs to allocate new page for doing IO, and the
new page has to be updated to bvec table.
Commit 6dc4f100c switches __blk_queue_bounce() to use the new
bio_for_each_segment_all() interface. Unfortunately the new
bio_for_each_segment_all() can't be used to update bvec table.
This patch fixes this issue by retrieving bvec from the table
directly, then the new allocated page can be updated to the bio.
This way is safe because the cloned bio is single page bvec.
Fixes: 6dc4f100c ("block: allow bio_for_each_segment_all() to iterate over multi-page bvec") Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 21 Feb 2019 17:42:37 +0000 (10:42 -0700)]
Merge branch 'nvme-5.1' of git://git.infradead.org/nvme into for-5.1/block
Pull NVMe changes for 5.1 from Christoph
* 'nvme-5.1' of git://git.infradead.org/nvme: (22 commits)
nvme-rdma: use nr_phys_segments when map rq to sgl
nvmet: convert to SPDX identifiers
nvmet-rdma: convert to SPDX identifiers
nvme-loop: convert to SPDX identifiers
nvmet-fcloop: convert to SPDX identifiers
nvmet-fc: convert to SPDX identifiers
nvme: convert to SPDX identifiers
nvme-pci: convert to SPDX identifiers
nvme-lightnvm: convert to SPDX identifiers
nvme-rdma: convert to SPDX identifiers
nvme-fc: convert to SPDX identifiers
nvme-fabrics: convert to SPDX identifiers
nvme-tcp.h: fix SPDX header
nvme_ioctl.h: remove duplicate GPL boilerplate
nvme: return error from nvme_alloc_ns()
nvme: avoid that deleting a controller triggers a circular locking complaint
nvme: introduce a helper function for controller deletion
nvme: unexport nvme_delete_ctrl_sync()
nvme-pci: check kstrtoint() return value in queue_count_set()
nvme-fabrics: document the poll function argument
...
nvme-rdma: use nr_phys_segments when map rq to sgl
Use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes() to check
if a command contains data to be mapped. This fixes the case where
a struct request contains LBAs, but it has no payload, such as
Write Zeroes support.
Fixes: 6e02318eaea5 ("nvme: add support for the Write Zeroes command") Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Bart Van Assche [Thu, 14 Feb 2019 22:50:57 +0000 (14:50 -0800)]
nvme: avoid that deleting a controller triggers a circular locking complaint
Rework nvme_delete_ctrl_sync() such that it does not have to wait for
queued work. This patch avoids that test nvme/008 triggers the following
complaint:
WARNING: possible circular locking dependency detected
5.0.0-rc6-dbg+ #10 Not tainted
------------------------------------------------------
nvme/7918 is trying to acquire lock: 000000009a1a7b69 ((work_completion)(&ctrl->delete_work)){+.+.}, at: __flush_work+0x379/0x410
but task is already holding lock: 00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
Bart Van Assche [Thu, 14 Feb 2019 22:50:54 +0000 (14:50 -0800)]
nvme-pci: check kstrtoint() return value in queue_count_set()
This patch avoids that the compiler complains about 'ret' being set
but not being used when building with W=1.
Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") # v5.0-rc1 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 18 Feb 2019 10:43:26 +0000 (11:43 +0100)]
nvme-multipath: round-robin I/O policy
Implement a simple round-robin I/O policy for multipathing. Path
selection is done in two rounds, first iterating across all optimized
paths, and if that doesn't return any valid paths, iterate over all
optimized and non-optimized paths. If no paths are found, use the
existing algorithm. Also add a sysfs attribute 'iopolicy' to switch
between the current NUMA-aware I/O policy and the 'round-robin' I/O
policy.
Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Ming Lei [Tue, 19 Feb 2019 03:08:19 +0000 (11:08 +0800)]
block: avoid to READ fields of null bio
rq->bio can be NULL sometimes, such as flush request, so don't
read bio->bi_seg_front_size until this 'bio' is checked as valid.
Cc: Bart Van Assche <bvanassche@acm.org> Reported-by: Bart Van Assche <bvanassche@acm.org> Fixes: dcebd755926b0f39dd1e ("block: use bio_for_each_bvec() to compute multi-page bvec count") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 15 Feb 2019 15:43:59 +0000 (08:43 -0700)]
Merge tag 'v5.0-rc6' into for-5.1/block
Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.
* tag 'v5.0-rc6': (525 commits)
Linux 5.0-rc6
x86/mm: Make set_pmd_at() paravirt aware
MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
blk-mq: remove duplicated definition of blk_mq_freeze_queue
Blk-iolatency: warn on negative inflight IO counter
blk-iolatency: fix IO hang due to negative inflight counter
MAINTAINERS: unify reference to xen-devel list
x86/mm/cpa: Fix set_mce_nospec()
futex: Handle early deadlock return correctly
futex: Fix barrier comment
net: dsa: b53: Fix for failure when irq is not defined in dt
blktrace: Show requests without sector
mips: cm: reprime error cause
mips: loongson64: remove unreachable(), fix loongson_poweroff().
sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
geneve: should not call rt6_lookup() when ipv6 was disabled
KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
signal: Better detection of synchronous signals
...
Ming Lei [Fri, 15 Feb 2019 11:13:23 +0000 (19:13 +0800)]
block: kill QUEUE_FLAG_NO_SG_MERGE
Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.
Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.
Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().
For another user of blk_recalc_rq_segments():
- run in partial completion branch of blk_update_request, which is an unusual case
- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.
Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:19 +0000 (19:13 +0800)]
block: allow bio_for_each_segment_all() to iterate over multi-page bvec
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.
Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:18 +0000 (19:13 +0800)]
bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.
Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:17 +0000 (19:13 +0800)]
block: loop: pass multi-page bvec to iov_iter
iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:12 +0000 (19:13 +0800)]
block: use bio_for_each_bvec() to compute multi-page bvec count
First it is more efficient to use bio_for_each_bvec() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
many multi-page bvecs there are in the bio.
Secondly once bio_for_each_bvec() is used, the bvec may need to be
splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.
Thirdly when splitting multi-page bvec into segments, the max segment
limit may be reached, so the bio split need to be considered under
this situation too.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:10 +0000 (19:13 +0800)]
block: introduce multi-page bvec helpers
This patch introduces helpers of 'mp_bvec_iter_*' for multi-page bvec
support.
The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.
The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.
Follows some multi-page bvec background:
- bvecs stored in bio->bi_io_vec is always multi-page style
- bvec(struct bio_vec) represents one physically contiguous I/O
buffer, now the buffer may include more than one page after
multi-page bvec is supported, and all these pages represented
by one bvec is physically contiguous. Before multi-page bvec
support, at most one page is included in one bvec, we call it
single-page bvec.
- .bv_page of the bvec points to the 1st page in the multi-page bvec
- .bv_offset of the bvec is the offset of the buffer in the bvec
The effect on the current drivers/filesystem/dm/bcache/...:
- almost everyone supposes that one bvec only includes one single
page, so we keep the sp interface not changed, for example,
bio_for_each_segment() still returns single-page bvec
- bio_for_each_segment_all() will return single-page bvec too
- during iterating, iterator variable(struct bvec_iter) is always
updated in multi-page bvec style, and bvec_iter_advance() is kept
not changed
- returned(copied) single-page bvec is built in flight by bvec
helpers from the stored multi-page bvec
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Feb 2019 11:13:08 +0000 (19:13 +0800)]
block: don't use bio->bi_vcnt to figure out segment number
It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.
So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").
Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max segments") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O. But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying. Use bi_size for that and shift by
PAGE_SHIFT. This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.
Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Heiner Litz [Mon, 11 Feb 2019 12:25:09 +0000 (13:25 +0100)]
lightnvm: pblk: fix race condition on GC
This patch fixes a race condition where a write is mapped to the last
sectors of a line. The write is synced to the device but the L2P is not
updated yet. When the line is garbage collected before the L2P update
is performed, the sectors are ignored by the GC logic and the line is
freed before all sectors are moved. When the L2P is finally updated, it
contains a mapping to a freed line, subsequent reads of the
corresponding LBAs fail.
This patch introduces a per line counter specifying the number of
sectors that are synced to the device but have not been updated in the
L2P. Lines with a counter of greater than zero will not be selected
for GC.
Signed-off-by: Heiner Litz <hlitz@ucsc.edu> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Mon, 11 Feb 2019 12:25:08 +0000 (13:25 +0100)]
lightnvm: pblk: prevent stall due to wb threshold
In order to respect mw_cuinits, pblk's write buffer maintains a
backpointer to protect data not yet persisted; when writing to the write
buffer, this backpointer defines a threshold that pblk's rate-limiter
enforces.
On small PU configurations, the following scenarios might take place: (i)
the threshold is larger than the write buffer and (ii) the threshold is
smaller than the write buffer, but larger than the maximun allowed
split bio - 256KB at this moment (Note that writes are not always
split - we only do this when we the size of the buffer is smaller
than the buffer). In both cases, pblk's rate-limiter prevents the I/O to
be written to the buffer, thus stalling.
This patch fixes the original backpointer implementation by considering
the threshold both on buffer creation and on the rate-limiters path,
when bio_split is triggered (case (ii) above).
Fixes: 766c8ceb16fc ("lightnvm: pblk: guarantee that backpointer is respected on writer stall") Signed-off-by: Javier González <javier@javigon.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Mon, 11 Feb 2019 12:25:07 +0000 (13:25 +0100)]
lightnvm: pblk: extend line wp balance check
pblk stripes writes of minimal write size across all non-offline chunks
in a line, which means that the maximum write pointer delta should not
exceed the minimal write size.
Extend the line write pointer balance check to cover this case, and
ignore the offline chunk wps.
This will render us a warning during recovery if something unexpected
has happened to the chunk write pointers (i.e. powerloss, a spurious
chunk reset, ..).
Masahiro Yamada [Mon, 11 Feb 2019 12:25:06 +0000 (13:25 +0100)]
lightnvm: pblk: fix TRACE_INCLUDE_PATH
As the comment block in include/trace/define_trace.h says,
TRACE_INCLUDE_PATH should be a relative path to the define_trace.h
../../drivers/lightnvm is the correct relative path.
../../../drivers/lightnvm is working by coincidence because the top
Makefile adds -I$(srctree)/arch/$(SRCARCH)/include as a header
search path, but we should not rely on it.
Andy Shevchenko [Mon, 11 Feb 2019 12:25:04 +0000 (13:25 +0100)]
lightnvm: Use u64 instead of __le64 for CPU visible side
Sparse complains about using strict data types:
drivers/lightnvm/pblk-read.c:254:43: warning: incorrect type in assignment (different base types)
drivers/lightnvm/pblk-read.c:254:43: expected restricted __le64 <noident>
drivers/lightnvm/pblk-read.c:254:43: got unsigned long long [unsigned] [usertype] <noident>
drivers/lightnvm/pblk-read.c:255:29: warning: cast from restricted __le64
drivers/lightnvm/pblk-read.c:268:29: warning: cast from restricted __le64
drivers/lightnvm/pblk-read.c:328:41: warning: incorrect type in assignment (different base types)
drivers/lightnvm/pblk-read.c:328:41: expected restricted __le64 <noident>
drivers/lightnvm/pblk-read.c:328:41: got unsigned long long [unsigned] [usertype] <noident>
In the code it seems explicit that lba_list_mem and lba_list_media members
of struct pblk_pr_ctx are used on CPU side, which means they should not be
of strict types.
Change types of lba_list_mem and lba_list_media members to be u64.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Mon, 11 Feb 2019 12:25:02 +0000 (13:25 +0100)]
lightnvm: pblk: stop taking the free lock in in pblk_lines_free
pblk_line_meta_free might sleep (it can end up calling vfree, depending
on how we allocate lba lists), and this can lead to a BUG()
if we wake up on a different cpu and release the lock.
As there is no point of grabbing the free lock when pblk has shut down,
remove the lock.
Linus Torvalds [Sun, 10 Feb 2019 18:39:37 +0000 (10:39 -0800)]
Merge tag 'dmaengine-fix-5.0-rc6' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
- Fix in at_xdmac fr wrongful channel state
- Fix for imx driver for wrong callback invocation
- Fix to bcm driver for interrupt race & transaction abort.
- Fix in dmatest to abort in mapping error
* tag 'dmaengine-fix-5.0-rc6' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: dmatest: Abort test in case of mapping error
dmaengine: bcm2835: Fix abort of transactions
dmaengine: bcm2835: Fix interrupt race on RT
dmaengine: imx-dma: fix wrong callback invoke
dmaengine: at_xdmac: Fix wrongfull report of a channel as in use
Linus Torvalds [Sun, 10 Feb 2019 17:57:42 +0000 (09:57 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"A handful of fixes:
- Fix an MCE corner case bug/crash found via MCE injection testing
- Fix 5-level paging boot crash
- Fix MCE recovery cache invalidation bug
- Fix regression on Xen guests caused by a recent PMD level mremap
speedup optimization"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Make set_pmd_at() paravirt aware
x86/mm/cpa: Fix set_mce_nospec()
x86/boot/compressed/64: Do not corrupt EDX on EFER.LME=1 setting
x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()
Linus Torvalds [Sun, 10 Feb 2019 17:54:19 +0000 (09:54 -0800)]
Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
"irqchip driver fixes: most of them are race fixes for ARM GIC (General
Interrupt Controller) variants, but also a fix for the ARM MMP
(Marvell PXA168 et al) irqchip affecting OLPC keyboards"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Fix ITT_entry_size accessor
irqchip/mmp: Only touch the PJ4 IRQ & FIQ bits on enable/disable
irqchip/gic-v3-its: Gracefully fail on LPI exhaustion
irqchip/gic-v3-its: Plug allocation race for devices sharing a DevID
irqchip/gic-v4: Fix occasional VLPI drop
Linus Torvalds [Sun, 10 Feb 2019 17:48:18 +0000 (09:48 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"A couple of kernel side fixes:
- Fix the Intel uncore driver on certain hardware configurations
- Fix a CPU hotplug related memory allocation bug
- Remove a spurious WARN()
... plus also a handful of perf tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf script python: Add Python3 support to tests/attr.py
perf trace: Support multiple "vfs_getname" probes
perf symbols: Filter out hidden symbols from labels
perf symbols: Add fallback definitions for GELF_ST_VISIBILITY()
tools headers uapi: Sync linux/in.h copy from the kernel sources
perf clang: Do not use 'return std::move(something)'
perf mem/c2c: Fix perf_mem_events to support powerpc
perf tests evsel-tp-sched: Fix bitwise operator
perf/core: Don't WARN() for impossible ring-buffer sizes
perf/x86/intel: Delay memory deallocation until x86_pmu_dead_cpu()
perf/x86/intel/uncore: Add Node ID mask
blk-sysfs: Rework documention of __blk_release_queue
The Notes section of the comment was removed, because now
blk_release_queue can only be executed from blk_cleanup_queue (being
called when the q->kobj reaches zero), and also blk_init_queue was removed
in a1ce35fa4985.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since 4cf6324b17e9, a portion of function blk_cleanup_queue was moved to
a newly created function called blk_exit_queue, including the call of
blkcg_exit_queue. So, adjust the documenation according.
Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>