Dongli Zhang [Wed, 27 May 2020 16:13:52 +0000 (09:13 -0700)]
nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll()
There may be a race between nvme_reap_pending_cqes() and nvme_poll(), e.g.,
when doing live reset while polling the nvme device.
CPU X CPU Y
nvme_poll()
nvme_dev_disable()
-> nvme_stop_queues()
-> nvme_suspend_io_queues()
-> nvme_suspend_queue()
-> spin_lock(&nvmeq->cq_poll_lock);
-> nvme_reap_pending_cqes()
-> nvme_process_cq() -> nvme_process_cq()
In the above scenario, the nvme_process_cq() for the same queue may be
running on both CPU X and CPU Y concurrently.
It is much more easier to reproduce the issue when CONFIG_PREEMPT is
enabled in kernel. When CONFIG_PREEMPT is disabled, it would take longer
time for nvme_stop_queues()-->blk_mq_quiesce_queue() to wait for grace
period.
This patch protects nvme_process_cq() with nvmeq->cq_poll_lock in
nvme_reap_pending_cqes().
Fixes: 60266beac222 ("nvme/pci: move cqe check after device shutdown") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Fri, 8 May 2020 20:04:06 +0000 (13:04 -0700)]
nvme-pci: dma read memory barrier for completions
Control dependencies do not guarantee load order across the condition,
allowing a CPU to predict and speculate memory reads.
Commit 3954c88498ec inlined verifying a new completion with its
handling. At least one architecture was observed to access the contents
out of order, resulting in the driver using stale data for the
completion.
Add a dma read barrier before reading the completion queue entry and
after the condition its contents depend on to ensure the read order is
determinsitic.
Reported-by: John Garry <john.garry@huawei.com> Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: John Garry <john.garry@huawei.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Wed, 6 May 2020 22:44:02 +0000 (15:44 -0700)]
nvme: fix possible hang when ns scanning fails during error recovery
When the controller is reconnecting, the host fails I/O and admin
commands as the host cannot reach the controller. ns scanning may
revalidate namespaces during that period and it is wrong to remove
namespaces due to these failures as we may hang (see 42930c14f26d).
One command that may fail is nvme_identify_ns_descs. Since we return
success due to having ns identify descriptor list optional, we continue
to compare ns identifiers in nvme_revalidate_disk, obviously fail and
return -ENODEV to nvme_validate_ns, which will remove the namespace.
Exactly what we don't want to happen.
Fixes: 20c608c9b372 ("nvme: Namepace identification descriptor list is optional") Tested-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
But the code is minimal now: one read for head, one read for q_depth,
one increment, one comparison, single instruction phase bit update and
one write for new head.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: John Garry <john.garry@huawei.com> Tested-by: John Garry <john.garry@huawei.com> Fixes: 3f08cba5632d9e9 ("nvme-pci: slimmer CQ head update") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
bdi: add a ->dev_name field to struct backing_dev_info
Cache a copy of the name for the life time of the backing_dev_info
structure so that we can reference it even after unregistering.
Fixes: 72e7720691f5 ("memcg: fix a crash in wb_workfn when a device disappears") Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
bdi_dev_name is not a fast path function, move it out of line. This
prepares for using it from modular callers without having to export
an implementation detail like bdi_unknown_name.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Simplify the bdi name to mirror what we are doing elsewhere, and
drop them name in favor of just using a number. This avoids a
potentially very long bdi name.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tejun Heo [Mon, 4 May 2020 23:27:54 +0000 (19:27 -0400)]
iocost: protect iocg->abs_vdebt with iocg->waitq.lock
abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
is and controls the activation of use_delay mechanism. Once a cgroup goes
over budget from forced IOs, it has to pay it back with its future budget.
The progress guarantee on debt paying comes from the iocg being active -
active iocgs are processed by the periodic timer, which ensures that as time
passes the debts dissipate and the iocg returns to normal operation.
However, both iocg activation and vdebt handling are asynchronous and a
sequence like the following may happen.
1. The iocg is in the process of being deactivated by the periodic timer.
2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
without anything because it still sees that the iocg is already active.
3. The iocg is deactivated.
4. The bio from #2 is over budget but needs to be forced. It increases
abs_vdebt and goes over the threshold and enables use_delay.
5. IO control is enabled for the iocg's subtree and now IOs are attributed
to the descendant cgroups and the iocg itself no longer issues IOs.
This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
further IOs which can activate it. This can end up unduly punishing all the
descendants cgroups.
The usual throttling path has the same issue - the iocg must be active while
throttled to ensure that future event will wake it up - and solves the
problem by synchronizing the throttling path with a spinlock. abs_vdebt
handling is another form of overage handling and shares a lot of
characteristics including the fact that it isn't in the hottest path.
This patch fixes the above and other possible races by strictly
synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Vlad Dmitriev <vvd@fb.com> Cc: stable@vger.kernel.org # v5.4+ Fixes: bd0660afb6bb ("blk-iocost: Don't let merges push vtime into the future") Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: remove the bd_openers checks in blk_drop_partitions
When replacing the bd_super check with a bd_openers I followed a logical
conclusion, which turns out to be utterly wrong. When a block device has
bd_super sets it has a mount file system on it (although not every
mounted file system sets bd_super), but that also implies it doesn't even
have partitions to start with.
So instead of trying to come up with a logical check for all openers,
just remove the check entirely.
Fixes: 0ddb04a50871 ("block: fix busy device checking in blk_drop_partitions") Fixes: 9bb6f718079a ("block: fix busy device checking in blk_drop_partitions again") Reported-by: Michal Koutný <mkoutny@suse.com> Reported-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Thu, 23 Apr 2020 03:02:38 +0000 (12:02 +0900)]
null_blk: Cleanup zoned device initialization
Move all zoned mode related code from null_blk_main.c to
null_blk_zoned.c, avoiding an ugly #ifdef in the process.
Rename null_zone_init() into null_init_zoned_dev(), null_zone_exit()
into null_free_zoned_dev() and add the new function
null_register_zoned_dev() to finalize the zoned dev setup before
add_disk().
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Thu, 23 Apr 2020 03:02:37 +0000 (12:02 +0900)]
null_blk: Fix zoned command handling
For write operations issued to a null_blk device with zoned mode
enabled, the state and write pointer position of the zone targeted by
the command should be checked before badblocks and memory backing
are handled as the write may be first failed due to, for instance, a
sector position not aligned with the zone write pointer. This order of
checking for errors reflects more accuratly the behavior of physical
zoned devices.
Furthermore, the write pointer position of the target zone should be
incremented only and only if no errors are reported by badblocks and
memory backing handling.
To fix this, introduce the small helper function null_process_cmd()
which execute null_handle_badblocks() and null_handle_memory_backed()
and use this function in null_zone_write() to correctly handle write
requests to zoned null devices depending on the type and state of the
write target zone. Also call this function in null_handle_zoned() to
process read requests to zoned null devices.
null_process_cmd() is called directly from null_handle_cmd() for
regular null devices, resulting in no functional change for these type
of devices. To have symmetric names, the function null_handle_zoned()
is renamed to null_process_zoned_cmd().
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Waiman Long [Tue, 21 Apr 2020 13:07:55 +0000 (09:07 -0400)]
blk-iocost: Fix error on iocost_ioc_vrate_adj
Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
argument of the iocost_ioc_vrate_adj trace entry defined in
include/trace/events/iocost.h leading to the following error:
/tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
, u32[]* __tracepoint_arg_missed_ppm
That argument type is indeed rather complex and hard to read. Looking
at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
the argument to a simple "u32 *missed_ppm" and adjusting the trace
entry accordingly, the compilation error was gone.
Douglas Anderson [Tue, 24 Mar 2020 21:48:27 +0000 (14:48 -0700)]
bdev: Reduce time holding bd_mutex in sync in blkdev_close()
While trying to "dd" to the block device for a USB stick, I
encountered a hung task warning (blocked for > 120 seconds). I
managed to come up with an easy way to reproduce this on my system
(where /dev/sdb is the block device for my USB stick) with:
while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done
With my reproduction here are the relevant bits from the hung task
detector:
INFO: task udevd:294 blocked for more than 122 seconds.
...
udevd D 0 294 1 0x00400008
Call trace:
...
mutex_lock_nested+0x40/0x50
__blkdev_get+0x7c/0x3d4
blkdev_get+0x118/0x138
blkdev_open+0x94/0xa8
do_dentry_open+0x268/0x3a0
vfs_open+0x34/0x40
path_openat+0x39c/0xdf4
do_filp_open+0x90/0x10c
do_sys_open+0x150/0x3c8
...
...
Showing all locks held in the system:
...
1 lock held by dd/2798:
#0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204
...
dd D 0 2798 2764 0x00400208
Call trace:
...
schedule+0x8c/0xbc
io_schedule+0x1c/0x40
wait_on_page_bit_common+0x238/0x338
__lock_page+0x5c/0x68
write_cache_pages+0x194/0x500
generic_writepages+0x64/0xa4
blkdev_writepages+0x24/0x30
do_writepages+0x48/0xa8
__filemap_fdatawrite_range+0xac/0xd8
filemap_write_and_wait+0x30/0x84
__blkdev_put+0x88/0x204
blkdev_put+0xc4/0xe4
blkdev_close+0x28/0x38
__fput+0xe0/0x238
____fput+0x1c/0x28
task_work_run+0xb0/0xe4
do_notify_resume+0xfc0/0x14bc
work_pending+0x8/0x14
The problem appears related to the fact that my USB disk is terribly
slow and that I have a lot of RAM in my system to cache things.
Specifically my writes seem to be happening at ~15 MB/s and I've got
~4 GB of RAM in my system that can be used for buffering. To write 4
GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds.
The 267 second number is a problem because in __blkdev_put() we call
sync_blockdev() while holding the bd_mutex. Any other callers who
want the bd_mutex will be blocked for the whole time.
The problem is made worse because I believe blkdev_put() specifically
tells other tasks (namely udev) to go try to access the device at right
around the same time we're going to hold the mutex for a long time.
Putting some traces around this (after disabling the hung task detector),
I could confirm:
dd: 437.608600: __blkdev_put() right before sync_blockdev() for sdb
udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb
dd: 661.468451: __blkdev_put() right after sync_blockdev() for sdb
udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb
A simple fix for this is to realize that sync_blockdev() works fine if
you're not holding the mutex. Also, it's not the end of the world if
you sync a little early (though it can have performance impacts).
Thus we can make a guess that we're going to need to do the sync and
then do it without holding the mutex. We still do one last sync with
the mutex but it should be much, much faster.
With this, my hung task warnings for my test case are gone.
Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zhiqiang Liu [Mon, 13 Apr 2020 05:12:10 +0000 (13:12 +0800)]
buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.
free_more_memory func has been completely removed in commit f7d5ba6981dc
("buffer: eliminate the need to call free_more_memory() in __getblk_slow()")
So comment and `WB_REASON_FREE_MORE_MEM` reason about free_more_memory
are no longer needed.
Fixes: f7d5ba6981dc ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull documentation fixes from Jonathan Corbet:
"A handful of fixes for reasonably obnoxious documentation issues"
* tag 'docs-fixes' of git://git.lwn.net/linux:
scripts: documentation-file-ref-check: Add line break before exit
scripts/kernel-doc: Add missing close-paren in c:function directives
docs: admin-guide: merge sections for the kernel.modprobe sysctl
docs: timekeeping: Use correct prototype for deprecated functions
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Remove vdso code trying to free unallocated pages.
- Delete the space separator in the __emit_inst macro as it breaks the
clang integrated assembler.
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Delete the space separator in __emit_inst
arm64: vdso: don't free unallocated pages
Merge tag 'for-linus-5.7-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen update from Juergen Gross:
- a small cleanup patch
- a security fix for a bug in the Xen hypervisor to avoid enabling Xen
guests to crash dom0 on an unfixed hypervisor.
* tag 'for-linus-5.7-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
arm/xen: make _xen_start_info static
xen/xenbus: ensure xenbus_map_ring_valloc() returns proper grant status
Merge tag 'io_uring-5.7-2020-04-17' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- wrap up the init/setup cleanup (Pavel)
- fix some issues around deferral sequences (Pavel)
- fix splice punt check using the wrong struct file member
- apply poll re-arm logic for pollable retry too
- pollable retry should honor cancelation
- fix setup time error handling syzbot reported crash
- restore work state when poll is canceled
* tag 'io_uring-5.7-2020-04-17' of git://git.kernel.dk/linux-block:
io_uring: don't count rqs failed after current one
io_uring: kill already cached timeout.seq_offset
io_uring: fix cached_sq_head in io_timeout()
io_uring: only post events in io_poll_remove_all() if we completed some
io_uring: io_async_task_func() should check and honor cancelation
io_uring: check for need to re-wait in polled async handling
io_uring: correct O_NONBLOCK check for splice punt
io_uring: restore req->work when canceling poll request
io_uring: move all request init code in one place
io_uring: keep all sqe->flags in req->flags
io_uring: early submission req fail code
io_uring: track mm through current->mm
io_uring: remove obsolete @mm_fault
Merge tag 'block-5.7-2020-04-17' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Fix for a driver tag leak in error handling (John)
- Remove now defunct Kconfig selection from dasd (Stefan)
- blk-wbt trace fiexs (Tommi)
* tag 'block-5.7-2020-04-17' of git://git.kernel.dk/linux-block:
blk-wbt: Drop needless newlines from tracepoint format strings
blk-wbt: Use tracepoint_string() for wbt_step tracepoint string literals
s390/dasd: remove IOSCHED_DEADLINE from DASD Kconfig
blk-mq: Put driver tag in blk_mq_dispatch_rq_list() when no budget
Merge tag 'pm-5.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management update from Rafael Wysocki:
"Allow the operating performance points (OPP) core to be used in the
case when the same driver is used on different platforms, some of
which have an OPP table and some of which have a clock node (Rajendra
Nayak)"
* tag 'pm-5.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
opp: Manage empty OPP tables with clk handle
Merge tag 'sound-5.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"One significant regression fix is for HD-audio buffer preallocation.
In 5.6 it was set to non-prompt for x86 and forced to 0, but this
turned out to be problematic for some applications, hence it gets
reverted. Distros would need to restore CONFIG_SND_HDA_PREALLOC_SIZE
value to the earlier values they've used in the past.
Other than that, we've received quite a few small fixes for HD-audio
and USB-audio. Most of them are for dealing with the broken TRX40
mobos and the runtime PM without HD-audio codecs"
* tag 'sound-5.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda: call runtime_allow() for all hda controllers
ALSA: hda: Allow setting preallocation again for x86
ALSA: hda: Explicitly permit using autosuspend if runtime PM is supported
ALSA: hda: Skip controller resume if not needed
ALSA: hda: Keep the controller initialization even if no codecs found
ALSA: hda: Release resources at error in delayed probe
ALSA: hda: Honor PM disablement in PM freeze and thaw_noirq ops
ALSA: hda: Don't release card at firmware loading error
ALSA: usb-audio: Check mapping at creating connector controls, too
ALSA: usb-audio: Don't create jack controls for PCM terminals
ALSA: usb-audio: Don't override ignore_ctl_error value from the map
ALSA: usb-audio: Filter error from connector kctl ops, too
ALSA: hda/realtek - Enable the headset mic on Asus FX505DT
ALSA: ctxfi: Remove unnecessary cast in kfree
kbuild: check libyaml installation for 'make dt_binding_check'
If you run 'make dtbs_check' without installing the libyaml package,
the error message "dtc needs libyaml ..." is shown.
This should be checked also for 'make dt_binding_check' because dtc
needs to validate *.example.dts extracted from *.yaml files.
It is missing since commit 2f4b5be5d172 ("kbuild: Add support for DT
binding schema checks"), but this fix-up is applicable only after commit 992a1b7fa3d0 ("kbuild: allow to run dt_binding_check and dtbs_check
in a single command").
I gave the Fixes tag to the latter in case somebody is interested in
back-porting this.
Fixes: 992a1b7fa3d0 ("kbuild: allow to run dt_binding_check and dtbs_check in a single command") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Rob Herring <robh@kernel.org>
Frank Rowand [Thu, 16 Apr 2020 21:42:50 +0000 (16:42 -0500)]
of: unittest: kmemleak in duplicate property update
kmemleak reports several memory leaks from devicetree unittest.
This is the fix for problem 5 of 5.
When overlay 'overlay_bad_add_dup_prop' is applied, the apply code
properly detects that a memory leak will occur if the overlay is removed
since the duplicate property is located in a base devicetree node and
reports via printk():
OF: overlay: WARNING: memory leak will occur if overlay removed, property: /testcase-data-2/substation@100/motor-1/rpm_avail
OF: overlay: WARNING: memory leak will occur if overlay removed, property: /testcase-data-2/substation@100/motor-1/rpm_avail
The overlay is removed when the apply code detects multiple changesets
modifying the same property. This is reported via printk():
As a result of this error, the overlay is removed resulting in the
expected memory leak.
Add another device node level to the overlay so that the duplicate
property is located in a node added by the overlay, thus no memory
leak will occur when the overlay is removed.
Thus users of kmemleak will not have to debug this leak in the future.
Fixes: 68148a105f03 ("of: overlay: check prevents multiple fragments touching same property") Reported-by: Erhard F. <erhard_f@mailbox.org> Signed-off-by: Frank Rowand <frank.rowand@sony.com> Signed-off-by: Rob Herring <robh@kernel.org>
Frank Rowand [Thu, 16 Apr 2020 21:42:49 +0000 (16:42 -0500)]
of: overlay: kmemleak in dup_and_fixup_symbol_prop()
kmemleak reports several memory leaks from devicetree unittest.
This is the fix for problem 4 of 5.
target_path was not freed in the non-error path.
Fixes: 38667a10bd3b ("of: overlay: remove a dependency on device node full_name") Reported-by: Erhard F. <erhard_f@mailbox.org> Signed-off-by: Frank Rowand <frank.rowand@sony.com> Signed-off-by: Rob Herring <robh@kernel.org>
Frank Rowand [Thu, 16 Apr 2020 21:42:48 +0000 (16:42 -0500)]
of: unittest: kmemleak in of_unittest_overlay_high_level()
kmemleak reports several memory leaks from devicetree unittest.
This is the fix for problem 3 of 5.
of_unittest_overlay_high_level() failed to kfree the newly created
property when the property named 'name' is skipped.
Fixes: a4d4abc95a87 ("of: change overlay apply input data from unflattened to FDT") Reported-by: Erhard F. <erhard_f@mailbox.org> Signed-off-by: Frank Rowand <frank.rowand@sony.com> Signed-off-by: Rob Herring <robh@kernel.org>
Frank Rowand [Thu, 16 Apr 2020 21:42:47 +0000 (16:42 -0500)]
of: unittest: kmemleak in of_unittest_platform_populate()
kmemleak reports several memory leaks from devicetree unittest.
This is the fix for problem 2 of 5.
of_unittest_platform_populate() left an elevated reference count for
grandchild nodes (which are platform devices). Fix the platform
device reference counts so that the memory will be freed.
Fixes: d41501864f95 ("of/selftest: add testcase for nodes with same name and address") Reported-by: Erhard F. <erhard_f@mailbox.org> Signed-off-by: Frank Rowand <frank.rowand@sony.com> Signed-off-by: Rob Herring <robh@kernel.org>
Frank Rowand [Thu, 16 Apr 2020 21:42:46 +0000 (16:42 -0500)]
of: unittest: kmemleak on changeset destroy
kmemleak reports several memory leaks from devicetree unittest.
This is the fix for problem 1 of 5.
of_unittest_changeset() reaches deeply into the dynamic devicetree
functions. Several nodes were left with an elevated reference
count and thus were not properly cleaned up. Fix the reference
counts so that the memory will be freed.
Fixes: e9dc07885451 ("of: Transactional DT support.") Reported-by: Erhard F. <erhard_f@mailbox.org> Signed-off-by: Frank Rowand <frank.rowand@sony.com> Signed-off-by: Rob Herring <robh@kernel.org>
Changeset c4ae047634cc ("dt-bindings: display: Convert Allwinner display pipeline to schemas")
split Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
into several files. Yet, it kept the old place at MAINTAINERS.
Update it to point to the new place.
Fixes: c4ae047634cc ("dt-bindings: display: Convert Allwinner display pipeline to schemas") Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Rob Herring <robh@kernel.org>
Josef Bacik [Fri, 10 Apr 2020 15:42:48 +0000 (11:42 -0400)]
btrfs: fix setting last_trans for reloc roots
I made a mistake with my previous fix, I assumed that we didn't need to
mess with the reloc roots once we were out of the part of relocation where
we are actually moving the extents.
The subtle thing that I missed is that btrfs_init_reloc_root() also
updates the last_trans for the reloc root when we do
btrfs_record_root_in_trans() for the corresponding fs_root. I've added a
comment to make sure future me doesn't make this mistake again.
This showed up as a WARN_ON() in btrfs_copy_root() because our
last_trans didn't == the current transid. This could happen if we
snapshotted a fs root with a reloc root after we set
rc->create_reloc_tree = 0, but before we actually merge the reloc root.
Worth mentioning that the regression produced the following warning
when running snapshot creation and balance in parallel:
Fixes: e4e50e1b59a5 ("btrfs: do not init a reloc root if we aren't relocating") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Merge tag 'tag-chrome-platform-fixes-for-v5.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
Pull chrome-platform fixes from Benson Leung:
"Two small fixes for cros_ec_sensorhub_ring.c, addressing issues
introduced in the cros_ec_sensorhub FIFO support commit"
* tag 'tag-chrome-platform-fixes-for-v5.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
platform/chrome: cros_ec_sensorhub: Add missing '\n' in log messages
platform/chrome: cros_ec_sensorhub: Off by one in cros_sensorhub_send_sample()
1) Disable RISCV BPF JIT builds when !MMU, from Björn Töpel.
2) nf_tables leaves dangling pointer after free, fix from Eric Dumazet.
3) Out of boundary write in __xsk_rcv_memcpy(), fix from Li RongQing.
4) Adjust icmp6 message source address selection when routes have a
preferred source address set, from Tim Stallard.
5) Be sure to validate HSR protocol version when creating new links,
from Taehee Yoo.
6) CAP_NET_ADMIN should be sufficient to manage l2tp tunnels even in
non-initial namespaces, from Michael Weiß.
7) Missing release firmware call in mlx5, from Eran Ben Elisha.
8) Fix variable type in macsec_changelink(), caught by KASAN. Fix from
Taehee Yoo.
9) Fix pause frame negotiation in marvell phy driver, from Clemens
Gruber.
10) Record RX queue early enough in tun packet paths such that XDP
programs will see the correct RX queue index, from Gilberto Bertin.
11) Fix double unlock in mptcp, from Florian Westphal.
12) Fix offset overflow in ARM bpf JIT, from Luke Nelson.
13) marvell10g needs to soft reset PHY when coming out of low power
mode, from Russell King.
14) Fix MTU setting regression in stmmac for some chip types, from
Florian Fainelli.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
amd-xgbe: Use __napi_schedule() in BH context
mISDN: make dmril and dmrim static
net: stmmac: dwmac-sunxi: Provide TX and RX fifo sizes
net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode
tipc: fix incorrect increasing of link window
Documentation: Fix tcp_challenge_ack_limit default value
net: tulip: make early_486_chipsets static
dt-bindings: net: ethernet-phy: add desciption for ethernet-phy-id1234.d400
ipv6: remove redundant assignment to variable err
net/rds: Use ERR_PTR for rds_message_alloc_sgs()
net: mscc: ocelot: fix untagged packet drops when enslaving to vlan aware bridge
selftests/bpf: Check for correct program attach/detach in xdp_attach test
libbpf: Fix type of old_fd in bpf_xdp_set_link_opts
libbpf: Always specify expected_attach_type on program load if supported
xsk: Add missing check on user supplied headroom size
mac80211: fix channel switch trigger from unknown mesh peer
mac80211: fix race in ieee80211_register_hw()
net: marvell10g: soft-reset the PHY when coming out of low power
net: marvell10g: report firmware version
net/cxgb4: Check the return from t4_query_params properly
...
The driver uses __napi_schedule_irqoff() which is fine as long as it is
invoked with disabled interrupts by everybody. Since the commit
mentioned below the driver may invoke xgbe_isr_task() in tasklet/softirq
context. This may lead to list corruption if another driver uses
__napi_schedule_irqoff() in IRQ context.
Use __napi_schedule() which safe to use from IRQ and softirq context.
Fixes: f43c3d6c34200 ("amd-xgbe: Re-issue interrupt if interrupt status not cleared") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Yan [Wed, 15 Apr 2020 08:42:26 +0000 (16:42 +0800)]
mISDN: make dmril and dmrim static
Fix the following sparse warning:
drivers/isdn/hardware/mISDN/mISDNisar.c:746:12: warning: symbol 'dmril'
was not declared. Should it be static?
drivers/isdn/hardware/mISDN/mISDNisar.c:749:12: warning: symbol 'dmrim'
was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: stmmac: dwmac-sunxi: Provide TX and RX fifo sizes
After commit 8aa479224b68c4fbedc27a502cd67f067ff8206e ("net: dsa:
configure the MTU for switch ports") my Lamobo R1 platform which uses
an allwinner,sun7i-a20-gmac compatible Ethernet MAC started to fail
by rejecting a MTU of 1536. The reason for that is that the DMA
capabilities are not readable on this version of the IP, and there
is also no 'tx-fifo-depth' property being provided in Device Tree. The
property is documented as optional, and is not provided.
Chen-Yu indicated that the FIFO sizes are 4KB for TX and 16KB for RX, so
provide these values through platform data as an immediate fix until
various Device Tree sources get updated accordingly.
Fixes: db3a713a7d70 ("net: stmmac: Do not accept invalid MTU values") Suggested-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: David S. Miller <davem@davemloft.net>
DENG Qingfang [Tue, 14 Apr 2020 06:34:08 +0000 (14:34 +0800)]
net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode
In VLAN-unaware mode, the Egress Tag (EG_TAG) field in Port VLAN
Control register must be set to Consistent to let tagged frames pass
through as is, otherwise their tags will be stripped.
Fixes: aaf9e358e672 ("net: dsa: mediatek: add VLAN support for MT7530") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: René van Dorst <opensource@vdorst.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'ceph-for-5.7-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
- a set of patches for a deadlock on "rbd map" error path
- a fix for invalid pointer dereference and uninitialized variable use
on asynchronous create and unlink error paths.
* tag 'ceph-for-5.7-rc2' of git://github.com/ceph/ceph-client:
ceph: fix potential bad pointer deref in async dirops cb's
rbd: don't mess with a page vector in rbd_notify_op_lock()
rbd: don't test rbd_dev->opts in rbd_dev_image_release()
rbd: call rbd_dev_unprobe() after unwatching and flushing notifies
rbd: avoid a deadlock on header_rwsem when flushing notifies
Merge tag 'trace-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"This fixes a small race between allocating a snapshot buffer and
setting the snapshot trigger.
On a slow machine, the trigger can occur before the snapshot is
allocated causing a warning to be displayed in the ring buffer, and no
snapshot triggering. Reversing the allocation and the enabling of the
trigger fixes the problem"
* tag 'trace-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation
John Garry [Thu, 16 Apr 2020 11:18:51 +0000 (19:18 +0800)]
blk-mq: Put driver tag in blk_mq_dispatch_rq_list() when no budget
If in blk_mq_dispatch_rq_list() we find no budget, then we break of the
dispatch loop, but the request may keep the driver tag, evaulated
in 'nxt' in the previous loop iteration.
Fix by putting the driver tag for that request.
Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Looking at the code now that it the internal mount of proc is no
longer used it is possible to unmount proc. If proc is unmounted
the fields of the pid namespace that were used for filesystem
specific state are not reinitialized.
Which means that proc_self and proc_thread_self can be pointers to
already freed dentries.
The reported user after free appears to be from mounting and
unmounting proc followed by mounting proc again and using error
injection to cause the new root dentry allocation to fail. This in
turn results in proc_kill_sb running with proc_self and
proc_thread_self still retaining their values from the previous mount
of proc. Then calling dput on either proc_self of proc_thread_self
will result in double put. Which KASAN sees as a use after free.
Solve this by always reinitializing the filesystem state stored
in the struct pid_namespace, when proc is unmounted.
Reported-by: syzbot+72868dd424eb66c6b95f@syzkaller.appspotmail.com Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Fixes: 55878bfe1385 ("proc: Remove the now unnecessary internal mount of proc") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Merge tag 'efi-urgent-2020-04-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
"Misc EFI fixes, including the boot failure regression caused by the
BSS section not being cleared by the loaders"
* tag 'efi-urgent-2020-04-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/x86: Revert struct layout change to fix kexec boot regression
efi/x86: Don't remap text<->rodata gap read-only for mixed mode
efi/x86: Fix the deletion of variables in mixed mode
efi/libstub/file: Merge file name buffers to reduce stack usage
Documentation/x86, efi/x86: Clarify EFI handover protocol and its requirements
efi/arm: Deal with ADR going out of range in efi_enter_kernel()
efi/x86: Always relocate the kernel for EFI handover entry
efi/x86: Move efi stub globals from .bss to .data
efi/libstub/x86: Remove redundant assignment to pointer hdr
efi/cper: Use scnprintf() for avoiding potential buffer overflow
In commit e04b9759e7ae ("tipc: introduce variable window congestion
control"), we allow link window to change with the congestion avoidance
algorithm. However, there is a bug that during the slow-start if packet
retransmission occurs, the link will enter the fast-recovery phase, set
its window to the 'ssthresh' which is never less than 300, so the link
window suddenly increases to that limit instead of decreasing.
Consequently, two issues have been observed:
- For broadcast-link: it can leave a gap between the link queues that a
new packet will be inserted and sent before the previous ones, i.e. not
in-order.
- For unicast: the algorithm does not work as expected, the link window
jumps to the slow-start threshold whereas packet retransmission occurs.
This commit fixes the issues by avoiding such the link window increase,
but still decreasing if the 'ssthresh' is lowered.
Fixes: e04b9759e7ae ("tipc: introduce variable window congestion control") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Jonker [Wed, 15 Apr 2020 20:01:49 +0000 (22:01 +0200)]
dt-bindings: net: ethernet-phy: add desciption for ethernet-phy-id1234.d400
The description below is already in use in
'rk3228-evb.dts', 'rk3229-xms6.dts' and 'rk3328.dtsi'
but somehow never added to a document, so add
"ethernet-phy-id1234.d400", "ethernet-phy-ieee802.3-c22"
for ethernet-phy nodes on Rockchip platforms to
'ethernet-phy.yaml'.
Signed-off-by: Johan Jonker <jbx6244@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Wed, 15 Apr 2020 23:16:30 +0000 (00:16 +0100)]
ipv6: remove redundant assignment to variable err
The variable err is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Rob Herring [Wed, 15 Apr 2020 17:57:04 +0000 (12:57 -0500)]
dt-bindings: pwm: Fix cros-ec-pwm example dtc 'reg' warning
The example for the CrOS EC PWM is incomplete and now generates a dtc
warning:
Documentation/devicetree/bindings/pwm/google,cros-ec-pwm.example.dts:17.11-23.11:
Warning (unit_address_vs_reg): /example-0/cros-ec@0: node has a unit name, but no reg or ranges property
Fixing this results in more warnings as a parent spi node is needed as
well.
Ondrej Mosnacek [Tue, 14 Apr 2020 14:23:51 +0000 (16:23 +0200)]
selinux: free str on error in str_read()
In [see "Fixes:"] I missed the fact that str_read() may give back an
allocated pointer even if it returns an error, causing a potential
memory leak in filename_trans_read_one(). Fix this by making the
function free the allocated string whenever it returns a non-zero value,
which also makes its behavior more obvious and prevents repeating the
same mistake in the future.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1461665 ("Resource leaks") Fixes: df2852430e39 ("selinux: optimize storage of filename transitions") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
Tiezhu Yang [Tue, 14 Apr 2020 09:41:48 +0000 (17:41 +0800)]
scripts: documentation-file-ref-check: Add line break before exit
If execute ./scripts/documentation-file-ref-check in a directory which is
not a git tree, it will exit without a line break, fix it.
Without this patch:
[loongson@localhost linux-5.7-rc1]$ ./scripts/documentation-file-ref-check
Warning: can't check if file exists, as this is not a git tree[loongson@localhost linux-5.7-rc1]$
With this patch:
[loongson@localhost linux-5.7-rc1]$ ./scripts/documentation-file-ref-check
Warning: can't check if file exists, as this is not a git tree
[loongson@localhost linux-5.7-rc1]$
Peter Maydell [Tue, 14 Apr 2020 14:37:43 +0000 (15:37 +0100)]
scripts/kernel-doc: Add missing close-paren in c:function directives
When kernel-doc generates a 'c:function' directive for a function
one of whose arguments is a function pointer, it fails to print
the close-paren after the argument list of the function pointer
argument. For instance:
long work_on_cpu(int cpu, long (*fn) (void *, void * arg)
in driver-api/basics.html is missing a ')' separating the
"void *" of the 'fn' arguments from the ", void * arg" which
is an argument to work_on_cpu().
Add the missing close-paren, so that we render the prototype
correctly:
long work_on_cpu(int cpu, long (*fn)(void *), void * arg)
(Note that Sphinx stops rendering a space between the '(fn*)' and the
'(void *)' once it gets something that's syntactically valid.)
Eric Biggers [Tue, 14 Apr 2020 17:24:30 +0000 (10:24 -0700)]
docs: admin-guide: merge sections for the kernel.modprobe sysctl
Documentation for the kernel.modprobe sysctl was added both by
commit fd207f0fe38e ("docs: merge debugging-modules.txt into
sysctl/kernel.rst") and by commit 561f77febbd9 ("docs: admin-guide:
document the kernel.modprobe sysctl"), resulting in the same sysctl
being documented in two places. Merge these into one place.
Chris Packham [Tue, 14 Apr 2020 22:12:22 +0000 (10:12 +1200)]
docs: timekeeping: Use correct prototype for deprecated functions
Use the correct prototypes for do_gettimeofday(), getnstimeofday() and
getnstimeofday64(). All of these returned void and passed the return
value by reference. This should make the documentation of their
deprecation and replacements easier to search for.
Jason Gunthorpe [Tue, 14 Apr 2020 23:02:07 +0000 (20:02 -0300)]
net/rds: Use ERR_PTR for rds_message_alloc_sgs()
Returning the error code via a 'int *ret' when the function returns a
pointer is very un-kernely and causes gcc 10's static analysis to choke:
net/rds/message.c: In function ‘rds_message_map_pages’:
net/rds/message.c:358:10: warning: ‘ret’ may be used uninitialized in this function [-Wmaybe-uninitialized]
358 | return ERR_PTR(ret);
Use a typical ERR_PTR return instead.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 14 Apr 2020 19:36:15 +0000 (22:36 +0300)]
net: mscc: ocelot: fix untagged packet drops when enslaving to vlan aware bridge
To rehash a previous explanation given in commit 3261a0c9ec0e ("net:
mscc: ocelot: fix vlan_filtering when enslaving to bridge before link is
up"), the switch driver operates the in a mode where a single VLAN can
be transmitted as untagged on a particular egress port. That is the
"native VLAN on trunk port" use case.
The configuration for this native VLAN is driven in 2 ways:
- Set the egress port rewriter to strip the VLAN tag for the native
VID (as it is egress-untagged, after all).
- Configure the ingress port to drop untagged and priority-tagged
traffic, if there is no native VLAN. The intention of this setting is
that a trunk port with no native VLAN should not accept untagged
traffic.
Since both of the above configurations for the native VLAN should only
be done if VLAN awareness is requested, they are actually done from the
ocelot_port_vlan_filtering function, after the basic procedure of
toggling the VLAN awareness flag of the port.
But there's a problem with that simplistic approach: we are trying to
juggle with 2 independent variables from a single function:
- Native VLAN of the port - its value is held in port->vid.
- VLAN awareness state of the port - currently there are some issues
here, more on that later*.
The actual problem can be seen when enslaving the switch ports to a VLAN
filtering bridge:
0. The driver configures a pvid of zero for each port, when in
standalone mode. While the bridge configures a default_pvid of 1 for
each port that gets added as a slave to it.
1. The bridge calls ocelot_port_vlan_filtering with vlan_aware=true.
The VLAN-filtering-dependent portion of the native VLAN
configuration is done, considering that the native VLAN is 0.
2. The bridge calls ocelot_vlan_add with vid=1, pvid=true,
untagged=true. The native VLAN changes to 1 (change which gets
propagated to hardware).
3. ??? - nobody calls ocelot_port_vlan_filtering again, to reapply the
VLAN-filtering-dependent portion of the native VLAN configuration,
for the new native VLAN of 1. One can notice that after toggling "ip
link set dev br0 type bridge vlan_filtering 0 && ip link set dev br0
type bridge vlan_filtering 1", the new native VLAN finally makes it
through and untagged traffic finally starts flowing again. But
obviously that shouldn't be needed.
So it is clear that 2 independent variables need to both re-trigger the
native VLAN configuration. So we introduce the second variable as
ocelot_port->vlan_aware.
*Actually both the DSA Felix driver and the Ocelot driver already had
each its own variable:
- Ocelot: ocelot_port_private->vlan_aware
- Felix: dsa_port->vlan_filtering
but the common Ocelot library needs to work with a single, common,
variable, so there is some refactoring done to move the vlan_aware
property from the private structure into the common ocelot_port
structure.
Fixes: 305ea5355ad1 ("net: mscc: ocelot: break apart ocelot_vlan_port_apply") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 15 Apr 2020 18:27:23 +0000 (11:27 -0700)]
Merge tag 'mac80211-for-net-2020-04-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A couple of fixes:
* FTM responder policy netlink validation fix
(but the only user validates again later)
* kernel-doc fixes
* a fix for a race in mac80211 radio registration vs. userspace
* a mesh channel switch fix
* a fix for a syzbot reported kasprintf() issue
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fangrui Song [Tue, 14 Apr 2020 16:32:55 +0000 (09:32 -0700)]
arm64: Delete the space separator in __emit_inst
In assembly, many instances of __emit_inst(x) expand to a directive. In
a few places __emit_inst(x) is used as an assembler macro argument. For
example, in arch/arm64/kvm/hyp/entry.S
Both comma and space are separators, with an exception that content
inside a pair of parentheses/quotes is not split, so the clang
integrated assembler splits the arguments to:
GNU as preprocesses the input with do_scrub_chars(). Its arm64 backend
(along with many other non-x86 backends) sees:
alternative_insn nop,.inst(0xd500401f|((0)<<16|(4)<<5)|((!!1)<<8)),4,1
# .inst(...) is parsed as one argument
while its x86 backend sees:
alternative_insn nop,.inst (0xd500401f|((0)<<16|(4)<<5)|((!!1)<<8)),4,1
# The extra space before '(' makes the whole .inst (...) parsed as two arguments
The non-x86 backend's behavior is considered unintentional
(https://sourceware.org/bugzilla/show_bug.cgi?id=25750).
So drop the space separator inside `.inst (...)` to make the clang
integrated assembler work.
Suggested-by: Ilie Halip <ilie.halip@gmail.com> Signed-off-by: Fangrui Song <maskray@google.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Link: https://github.com/ClangBuiltLinux/linux/issues/939 Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
selftests/bpf: Check for correct program attach/detach in xdp_attach test
David Ahern noticed that there was a bug in the EXPECTED_FD code so
programs did not get detached properly when that parameter was supplied.
This case was not included in the xdp_attach tests; so let's add it to be
sure that such a bug does not sneak back in down.
Fixes: 6be3e9595cb2 ("selftests/bpf: Add tests for attaching XDP programs") Reported-by: David Ahern <dsahern@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200414145025.182163-2-toke@redhat.com
libbpf: Fix type of old_fd in bpf_xdp_set_link_opts
The 'old_fd' parameter used for atomic replacement of XDP programs is
supposed to be an FD, but was left as a u32 from an earlier iteration of
the patch that added it. It was converted to an int when read, so things
worked correctly even with negative values, but better change the
definition to correctly reflect the intention.
Fixes: 380cda6a6342 ("libbpf: Add function to set link XDP fd while specifying old program") Reported-by: David Ahern <dsahern@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Ahern <dsahern@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200414145025.182163-1-toke@redhat.com
libbpf: Always specify expected_attach_type on program load if supported
For some types of BPF programs that utilize expected_attach_type, libbpf won't
set load_attr.expected_attach_type, even if expected_attach_type is known from
section definition. This was done to preserve backwards compatibility with old
kernels that didn't recognize expected_attach_type attribute yet (which was
added in 75c980841d07 ("bpf: Check attach type at prog load time"). But this
is problematic for some BPF programs that utilize newer features that require
kernel to know specific expected_attach_type (e.g., extended set of return
codes for cgroup_skb/egress programs).
This patch makes libbpf specify expected_attach_type by default, but also
detect support for this field in kernel and not set it during program load.
This allows to have a good metadata for bpf_program
(e.g., bpf_program__get_extected_attach_type()), but still work with old
kernels (for cases where it can work at all).
Additionally, due to expected_attach_type being always set for recognized
program types, bpf_program__attach_cgroup doesn't have to do extra checks to
determine correct attach type, so remove that additional logic.
Also adjust section_names selftest to account for this change.
Magnus Karlsson [Tue, 14 Apr 2020 07:35:15 +0000 (09:35 +0200)]
xsk: Add missing check on user supplied headroom size
Add a check that the headroom cannot be larger than the available
space in the chunk. In the current code, a malicious user can set the
headroom to a value larger than the chunk size minus the fixed XDP
headroom. That way packets with a length larger than the supported
size in the umem could get accepted and result in an out-of-bounds
write.
Mark Rutland [Tue, 14 Apr 2020 10:42:48 +0000 (11:42 +0100)]
arm64: vdso: don't free unallocated pages
The aarch32_vdso_pages[] array never has entries allocated in the C_VVAR
or C_VDSO slots, and as the array is zero initialized these contain
NULL.
However in __aarch32_alloc_vdso_pages() when
aarch32_alloc_kuser_vdso_page() fails we attempt to free the page whose
struct page is at NULL, which is obviously nonsensical.
This patch removes the erroneous page freeing.
Fixes: 2ee4ed1826db ("arm64: compat: VDSO setup for compat layer") Cc: <stable@vger.kernel.org> # 5.3.x- Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Tamizh chelvam [Sat, 28 Mar 2020 13:53:24 +0000 (19:23 +0530)]
mac80211: fix channel switch trigger from unknown mesh peer
Previously mesh channel switch happens if beacon contains
CSA IE without checking the mesh peer info. Due to that
channel switch happens even if the beacon is not from
its own mesh peer. Fixing that by checking if the CSA
originated from the same mesh network before proceeding
for channel switch.
A race condition leading to a kernel crash is observed during invocation
of ieee80211_register_hw() on a dragonboard410c device having wcn36xx
driver built as a loadable module along with a wifi manager in user-space
waiting for a wifi device (wlanX) to be active.
Sequence diagram for a particular kernel crash scenario:
As evident from above sequence diagram, this race condition isn't specific
to a particular wifi driver but rather the initialization sequence in
ieee80211_register_hw() needs to be fixed. So re-order the initialization
sequence and the updated sequence diagram would look like:
Xiao Yang [Tue, 14 Apr 2020 01:51:45 +0000 (09:51 +0800)]
tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation
Traced event can trigger 'snapshot' operation(i.e. calls snapshot_trigger()
or snapshot_count_trigger()) when register_snapshot_trigger() has completed
registration but doesn't allocate buffer for 'snapshot' event trigger. In
the rare case, 'snapshot' operation always detects the lack of allocated
buffer so make register_snapshot_trigger() allocate buffer first.
Pavel Begunkov [Tue, 14 Apr 2020 21:39:50 +0000 (00:39 +0300)]
io_uring: don't count rqs failed after current one
When checking for draining with __req_need_defer(), it tries to match
how many requests were sent before a current one with number of already
completed. Dropped SQEs are included in req->sequence, and they won't
ever appear in CQ. To compensate for that, __req_need_defer() substracts
ctx->cached_sq_dropped.
However, what it should really use is number of SQEs dropped __before__
the current one. In other words, any submitted request shouldn't
shouldn't affect dequeueing from the drain queue of previously submitted
ones.
Instead of saving proper ctx->cached_sq_dropped in each request,
substract from req->sequence it at initialisation, so it includes number
of properly submitted requests.
note: it also changes behaviour of timeouts, but
1. it's already diverge from the description because of using SQ
2. the description is ambiguous regarding dropped SQEs
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
====================
Fix 88x3310 leaving power save mode
This series fixes a problem with the 88x3310 PHY on Macchiatobin
coming out of powersave mode noticed by Matteo Croce. It seems
that certain PHY firmwares do not properly exit powersave mode,
resulting in a fibre link not coming up.
The solution appears to be to soft-reset the PHY after clearing
the powersave bit.
We add support for reporting the PHY firmware version to the kernel
log, and use it to trigger this new behaviour if we have v0.3.x.x
or more recent firmware on the PHY. This, however, is a guess as
the firmware revision documentation does not mention this issue,
and we know that v0.2.1.0 works without this fix but v0.3.3.0 and
later does not.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Tue, 14 Apr 2020 19:49:08 +0000 (20:49 +0100)]
net: marvell10g: soft-reset the PHY when coming out of low power
Soft-reset the PHY when coming out of low power mode, which seems to
be necessary with firmware versions 0.3.3.0 and 0.3.10.0.
This depends on ("net: marvell10g: report firmware version")
Fixes: 63086549d9ae ("net: phy: marvell10g: place in powersave mode at probe") Reported-by: Matteo Croce <mcroce@redhat.com> Tested-by: Matteo Croce <mcroce@redhat.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Tue, 14 Apr 2020 19:49:03 +0000 (20:49 +0100)]
net: marvell10g: report firmware version
Report the firmware version when probing the PHY to allow issues
attributable to firmware to be diagnosed.
Tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Gunthorpe [Tue, 14 Apr 2020 15:27:08 +0000 (12:27 -0300)]
net/cxgb4: Check the return from t4_query_params properly
Positive return values are also failures that don't set val,
although this probably can't happen. Fixes gcc 10 warning:
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c: In function ‘t4_phy_fw_ver’:
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:3747:14: warning: ‘val’ may be used uninitialized in this function [-Wmaybe-uninitialized]
3747 | *phy_fw_ver = val;
Fixes: 63431aa748e8 ("cxgb4: Add PHY firmware support for T420-BT cards") Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 14 Apr 2020 23:33:26 +0000 (16:33 -0700)]
Merge branch 'mv88e6xxx-fixed-link-fixes'
Andrew Lunn says:
====================
mv88e6xxx fixed link fixes
Recent changes for how the MAC is configured broke fixed links, as
used by CPU/DSA ports, and for SFPs when phylink cannot be used. The
first fix is unchanged from v1. The second fix takes a different
solution than v1. If a CPU or DSA port is known to have a PHYLINK
instance, configure the port down before instantiating the PHYLINK, so
it is in the down state as expected by PHYLINK.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Tue, 14 Apr 2020 00:34:39 +0000 (02:34 +0200)]
net: dsa: Down cpu/dsa ports phylink will control
DSA and CPU ports can be configured in two ways. By default, the
driver should configure such ports to there maximum bandwidth. For
most use cases, this is sufficient. When this default is insufficient,
a phylink instance can be bound to such ports, and phylink will
configure the port, e.g. based on fixed-link properties. phylink
assumes the port is initially down. Given that the driver should have
already configured it to its maximum speed, ask the driver to down
the port before instantiating the phylink instance.
Fixes: af0024100559 ("net: mv88e6xxx: use resolved link config in mac_link_up()") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Tue, 14 Apr 2020 00:34:38 +0000 (02:34 +0200)]
net: dsa: mv88e6xxx: Configure MAC when using fixed link
The 88e6185 is reporting it has detected a PHY, when a port is
connected to an SFP. As a result, the fixed-phy configuration is not
being applied. That then breaks packet transfer, since the port is
reported as being down.
Add additional conditions to check the interface mode, and if it is
fixed always configure the port on link up/down, independent of the
PPU status.
Fixes: af0024100559 ("net: mv88e6xxx: use resolved link config in mac_link_up()") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Mon, 13 Apr 2020 17:33:11 +0000 (10:33 -0700)]
ionic: fix unused assignment
Remove an unused initialized value.
Fixes: 8b4eaed2ebbf ("ionic: replay filters after fw upgrade") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Mon, 13 Apr 2020 17:33:10 +0000 (10:33 -0700)]
ionic: add dynamic_debug header
Add the appropriate header for using dynamic_hex_dump(), which
seems to be incidentally included in some configurations but
not all.
Fixes: 8b4eaed2ebbf ("ionic: replay filters after fw upgrade") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Mon, 13 Apr 2020 12:57:14 +0000 (13:57 +0100)]
rxrpc: Fix DATA Tx to disable nofrag for UDP on AF_INET6 socket
Fix the DATA packet transmission to disable nofrag for UDPv4 on an AF_INET6
socket as well as UDPv6 when trying to transmit fragmentably.
Without this, packets filled to the normal size used by the kernel AFS
client of 1412 bytes be rejected by udp_sendmsg() with EMSGSIZE
immediately. The ->sk_error_report() notification hook is called, but
rxrpc doesn't generate a trace for it.
This is a temporary fix; a more permanent solution needs to involve
changing the size of the packets being filled in accordance with the MTU,
which isn't currently done in AF_RXRPC. The reason for not doing so was
that, barring the last packet in an rx jumbo packet, jumbos can only be
assembled out of 1412-byte packets - and the plan was to construct jumbos
on the fly at transmission time.
Also, there's no point turning on IPV6_MTU_DISCOVER, since IPv6 has to
engage in this anyway since fragmentation is only done by the sender. We
can then condense the switch-statement in rxrpc_send_data_packet().
Fixes: f1721f4f1c81 ("rxrpc: Add IPv6 support") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>