Include sys/time.h and pthread.h in tmon.h, so that types
"pthread_mutex_t" and "struct timeval tv" are known when tmon.h
references them.
Without these headers, compiling tmon against musl-libc will fail with
these errors:
In file included from sysfs.c:31:0:
tmon.h:47:8: error: unknown type name 'pthread_mutex_t'
extern pthread_mutex_t input_lock;
^~~~~~~~~~~~~~~
make[3]: *** [<builtin>: sysfs.o] Error 1
make[3]: *** Waiting for unfinished jobs....
In file included from tui.c:31:0:
tmon.h:54:17: error: field 'tv' has incomplete type
struct timeval tv;
^~
make[3]: *** [<builtin>: tui.o] Error 1
make[2]: *** [Makefile:83: tmon] Error 2
Signed-off-by: Markus Mayer <mmayer@broadcom.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Sumeet Pawnikar <sumeet.r.pawnikar@intel.com> Acked-by: Alejandro González <alejandro.gonzalez.correo@gmail.com> Tested-by: Alejandro González <alejandro.gonzalez.correo@gmail.com> Fixes: ef0e98c9b0aa ("tools/thermal: Introduce tmon, a tool for thermal subsystem") Link: https://lore.kernel.org/r/20220718031040.44714-1-f.fainelli@gmail.com Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having
called sched_update_tick_dependency() preventing it from re-enabling the
tick on systems that no longer have pending SCHED_RT tasks but have
multiple runnable SCHED_OTHER tasks:
Every other scheduler class performs the operation in the opposite
order, and sched_update_tick_dependency() expects the values to be
updated as such. So avoid the misbehaviour by inverting the order in
which the above operations are performed in the RT scheduler.
Fixes: fefbf9ffb60d ("sched: Migrate sched to use new tick dependency mask model") Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
We should call the of_node_put() for the reference returned by
of_get_child_by_name() which has increased the refcount.
Fixes: 5b2041ba9280 ("regulator: of: Add support for parsing regulator_state for suspend state") Signed-off-by: Liang He <windhl@126.com> Link: https://lore.kernel.org/r/20220715111027.391032-1-windhl@126.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
blk_mq_debugfs_register_hctx() can be called by blk_mq_update_nr_hw_queues
when gendisk isn't added yet, such as nvme tcp.
Fixes the warning of 'debugfs: Directory 'hctx0' with parent '/' already present!'
which can be observed reliably when running blktests nvme/005.
Fixes: af37d63c8b61 ("blk-mq: no need to check return value of debugfs_create functions") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220711090808.259682-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
The bananapi R64 (BPI-R64) experiences wrong WPS button signals.
In OpenWrt pushing the WPS button while powering on the device will set
it to recovery mode. Currently, this also happens without any user
interaction. In particular, the wrong signals appear while booting the
device or restarting it, e.g. after doing a system upgrade. If the
device is in recovery mode the user needs to manually power cycle or
restart it.
The official BPI-R64 sources set the WPS button to GPIO_ACTIVE_LOW in
the device tree. This setting seems to suppress the unwanted WPS button
press signals. So this commit changes the button from GPIO_ACTIVE_HIGH to
GPIO_ACTIVE_LOW.
The official BPI-R64 sources can be found on
https://github.com/BPI-SINOVOIP/BPI-R64-openwrt
In error case in hisi_lpc_acpi_probe() after calling platform_device_add(),
hisi_lpc_acpi_remove() can't release the failed 'pdev', so it will be leak,
call platform_device_put() to fix this problem.
I'v constructed this error case and tested this patch on D05 board.
Fixes: 775896d243f9 ("HISI LPC: Re-Add ACPI child enumeration support") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Acked-by: John Garry <john.garry@huawei.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Every iteration of for_each_available_child_of_node() decrements
the reference count of the previous node.
When breaking early from a for_each_available_child_of_node() loop,
we need to explicitly call of_node_put() on the child node.
Add missing of_node_put() to avoid refcount leak.
Fixes: 40c7bca794c8 ("soc: qcom: Extend AOSS QMP driver to support resources that are used to wake up the SoC.") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Link: https://lore.kernel.org/r/20220606064252.42595-1-linmq006@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
of_find_compatible_node() returns a node pointer with refcount
incremented, we should use of_node_put() on it when done.
Add missing of_node_put() to avoid refcount leak.
Fixes: f439cd63c4a3 ("ARM: zynq: Add support for SOC_BUS") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Link: https://lore.kernel.org/r/20220605082807.21526-1-linmq006@gmail.com Signed-off-by: Michal Simek <michal.simek@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
of_find_matching_node() returns a node pointer with refcount
incremented, we should use of_node_put() on it when not need anymore.
Add missing of_node_put() to avoid refcount leak.
Fixes: c63b31c998f3 ("ARM: OMAP3+: PRM: register interrupt information from DT") Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Message-Id: <20220526073724.21169-1-linmq006@gmail.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
omapdss_find_dss_of_node() calls of_find_compatible_node() to get device
node. of_find_compatible_node() returns a node pointer with refcount
incremented, we should use of_node_put() on it when done.
Add missing of_node_put() in later error path and normal path.
Fixes: 61b9921aa1251 ("drm/omap: Populate DSS children in omapdss driver") Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Message-Id: <20220601044858.3352-1-linmq006@gmail.com> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
of_find_matching_node() returns a node pointer with refcount
incremented, we should use of_node_put() on it when not need anymore.
Add missing of_node_put() to avoid refcount leak.
of_find_matching_node() returns a node pointer with refcount
incremented, we should use of_node_put() on it when not need anymore.
Add missing of_node_put() to avoid refcount leak.
Fixes: a07535819dbd ("soc: amlogic: Add Meson6/Meson8/Meson8b/Meson8m2 SoC Information driver") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Signed-off-by: Neil Armstrong <narmstrong@baylibre.com> Link: https://lore.kernel.org/r/20220524065729.33689-1-linmq006@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
When offset is larger than the size of the bit array, we should not
attempt to access the array as we can perform an access beyond the
end of the array. Fix this by changing the pre-condition.
Using "cmp r2, r1; bhs ..." covers us for the size == 0 case, since
this will always take the branch when r1 is zero, irrespective of
the value of r2. This means we can fix this bug without adding any
additional code!
Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
RSPI IP on RZ/{A, G2L} SoC's has the same signal for both interrupt
and DMA transfer request. Setting DMARS register for DMA transfer
makes the signal to work as a DMA transfer request signal and
subsequent interrupt requests to the interrupt controller
are masked.
PIO fallback does not work as interrupt signal is disabled.
This patch fixes this issue by re-enabling the interrupts by
calling dmaengine_synchronize().
syzbot is reporting hung task at misc_open() [1], for there is a race
window of AB-BA deadlock which involves probe_count variable. Currently
wait_for_device_probe() from snapshot_open() from misc_open() can sleep
forever with misc_mtx held if probe_count cannot become 0.
When a device is probed by hub_event() work function, probe_count is
incremented before the probe function starts, and probe_count is
decremented after the probe function completed.
There are three cases that can prevent probe_count from dropping to 0.
(a) A device being probed stopped responding (i.e. broken/malicious
hardware).
(b) A process emulating a USB device using /dev/raw-gadget interface
stopped responding for some reason.
(c) New device probe requests keeps coming in before existing device
probe requests complete.
The phenomenon syzbot is reporting is (b). A process which is holding
system_transition_mutex and misc_mtx is waiting for probe_count to become
0 inside wait_for_device_probe(), but the probe function which is called
from hub_event() work function is waiting for the processes which are
blocked at mutex_lock(&misc_mtx) to respond via /dev/raw-gadget interface.
This patch mitigates (b) by deferring wait_for_device_probe() from
snapshot_open() to snapshot_write() and snapshot_ioctl(). Please note that
the possibility of (b) remains as long as any thread which is emulating a
USB device via /dev/raw-gadget interface can be blocked by uninterruptible
blocking operations (e.g. mutex_lock()).
Please also note that (a) and (c) are not addressed. Regarding (c), we
should change the code to wait for only one device which contains the
image for resuming from hibernation. I don't know how to address (a), for
use of timeout for wait_for_device_probe() might result in loss of user
data in the image. Maybe we should require the userland to wait for the
image device before opening /dev/snapshot interface.
In rcar_gen2_regulator_quirk(), for_each_matching_node_and_match() will
automatically increase and decrease the refcount. However, we should
call of_node_get() for the new reference created in 'quirk->np'.
Besides, we also should call of_node_put() before the 'quirk' being
freed.
But, Lenovo G40-45, a platform released in 2015, still needs NVS memory
saving during S3. A quirk is introduced for this platform.
Signed-off-by: Manyi Li <limanyi@uniontech.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Somehow the "ThinkPad X1 Carbon 6th" entry ended up twice in the
struct dmi_system_id acpi_ec_no_wakeup[] array. Remove one of
the entries.
Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
In omapdss_init_fbdev(), of_find_node_by_name() will return a node
pointer with refcount incremented. We should use of_node_put() when
it is not used anymore.
Signed-off-by: Liang He <windhl@126.com>
Message-Id: <20220617145803.4050918-1-windhl@126.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
imx6ul is not compatible to imx6sx, both have different erratas.
Fixes the dt_binding_check warning:
spi@21e0000: compatible: 'oneOf' conditional failed, one must be fixed:
['fsl,imx6ul-qspi', 'fsl,imx6sx-qspi'] is too long
Additional items are not allowed ('fsl,imx6sx-qspi' was unexpected)
'fsl,imx6ul-qspi' is not one of ['fsl,ls1043a-qspi']
'fsl,imx6ul-qspi' is not one of ['fsl,imx8mq-qspi']
'fsl,ls1021a-qspi' was expected
'fsl,imx7d-qspi' was expected
In yaml binding "fsl,imx6ul-lcdif" is listed as compatible to imx6sx-lcdif,
but not imx28-lcdif. Change the list accordingly. Fixes the
dt_binding_check warning:
lcdif@21c8000: compatible: 'oneOf' conditional failed, one must be fixed:
['fsl,imx6ul-lcdif', 'fsl,imx28-lcdif'] is too long
Additional items are not allowed ('fsl,imx28-lcdif' was unexpected)
'fsl,imx6ul-lcdif' is not one of ['fsl,imx23-lcdif', 'fsl,imx28-lcdif',
'fsl,imx6sx-lcdif']
'fsl,imx6sx-lcdif' was expected
"fsl,imx6ul-csi" was never listed as compatible to "fsl,imx7-csi", neither
in yaml bindings, nor previous txt binding. Remove the imx7 part. Fixes
the dt schema check warning:
csi@21c4000: compatible: 'oneOf' conditional failed, one must be fixed:
['fsl,imx6ul-csi', 'fsl,imx7-csi'] is too long
Additional items are not allowed ('fsl,imx7-csi' was unexpected)
'fsl,imx8mm-csi' was expected
operating-points is a uint32-matrix as per opp-v1.yaml. Change it
accordingly. While at it, change fsl,soc-operating-points as well,
although there is no bindings file (yet). But they should have the same
format. Fixes the dt_binding_check warning:
cpu@0: operating-points:0: [696000, 1275000, 528000, 1175000, 396000, 1025000, 198000, 950000] is too long
cpu@0: operating-points:0: Additional items are not allowed (528000, 1175000, 396000, 1025000, 198000, 950000 were unexpected)
All 3 properties are required by sram.yaml. Fixes the dtbs_check
warning:
sram@900000: '#address-cells' is a required property
sram@900000: '#size-cells' is a required property
sram@900000: 'ranges' is a required property
Changes to hrtimer mode (potentially made by __hrtimer_init_sleeper on
PREEMPT_RT) are not visible to hrtimer_start_range_ns, thus not
accounted for by hrtimer_start_expires call paths. In particular,
__wait_event_hrtimeout suffers from this problem as we have, for
example:
fs/aio.c::read_events
wait_event_interruptible_hrtimeout
__wait_event_hrtimeout
hrtimer_init_sleeper_on_stack <- this might "mode |= HRTIMER_MODE_HARD"
on RT if task runs at RT/DL priority
hrtimer_start_range_ns
WARN_ON_ONCE(!(mode & HRTIMER_MODE_HARD) ^ !timer->is_hard)
fires since the latter doesn't see the change of mode done by
init_sleeper
Fix it by making __wait_event_hrtimeout call hrtimer_sleeper_start_expires,
which is aware of the special RT/DL case, instead of hrtimer_start_range_ns.
Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220627095051.42470-1-juri.lelli@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
Function irq_chip::irq_request_resources() is reported as optional
in the declaration of struct irq_chip.
If the parent irq_chip does not implement it, we should ignore it
and return.
Add checks verifying number of inodes stored in the superblock matches
the number computed from number of inodes per group. Also verify we have
at least one block worth of inodes per group. This prevents crashes on
corrupted filesystems.
Reported-by: syzbot+d273f7d7f58afd93be48@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
To fix this issue, keep the table->data as &insn->current_mode and
use container_of() to retrieve the insn pointer. Another mutex is
used to protect against the current_mode update but not for retrieving
insn_emulation as table->data is no longer changing.
Enable tracing of the execve*() system calls with the
syscalls:sys_exit_execve tracepoint by removing the call to
forget_syscall() when starting a new thread and preserving the value of
regs->syscallno across exec.
When kernel is booted with idle=nomwait do not use MWAIT as the
default idle state.
If the user boots the kernel with idle=nomwait, it is a clear
direction to not use mwait as the default idle state.
However, the current code does not take this into consideration
while selecting the default idle state on x86.
Fix it by checking for the idle=nomwait boot option in
prefer_mwait_c1_over_halt().
Also update the documentation around idle=nomwait appropriately.
If a process is killed or otherwise exits while having active network
connections and many threads waiting on epoll_wait, the threads will all
be woken immediately, but not removed from ep->wq. Then when network
traffic scans ep->wq in wake_up, every wakeup attempt will fail, and will
not remove the entries from the list.
This means that the cost of the wakeup attempt is far higher than usual,
does not decrease, and this also competes with the dying threads trying to
actually make progress and remove themselves from the wq.
Handle this by removing visited epoll wq entries unconditionally, rather
than only when the wakeup succeeds - the structure of ep_poll means that
the only potential loss is the timed_out->eavail heuristic, which now can
race and result in a redundant ep_send_events attempt. (But only when
incoming data and a timeout actually race, not on every timeout)
Shakeel added:
: We are seeing this issue in production with real workloads and it has
: caused hard lockups. Particularly network heavy workloads with a lot
: of threads in epoll_wait() can easily trigger this issue if they get
: killed (oom-killed in our case).
Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com Signed-off-by: Ben Segall <bsegall@google.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Heiher <r@hev.cc> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In nf_tables_updtable, if nf_tables_table_enable returns an error,
nft_trans_destroy is called to free the transaction object.
nft_trans_destroy() calls list_del(), but the transaction was never
placed on a list -- the list head is all zeroes, this results in
a null dereference:
BUG: KASAN: null-ptr-deref in nft_trans_destroy+0x26/0x59
Call Trace:
nft_trans_destroy+0x26/0x59
nf_tables_newtable+0x4bc/0x9bc
[..]
Its sane to assume that nft_trans_destroy() can be called
on the transaction object returned by nft_trans_alloc(), so
make sure the list head is initialised.
Fixes: 70acafd59c8d ("netfilter: nf_tables: use new transaction infrastructure to handle table") Reported-by: mingi cho <mgcho.minic@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When doing lookups for rules on the same batch by using its ID, a rule from
a different chain can be used. If a rule is added to a chain but tries to
be positioned next to a rule from a different chain, it will be linked to
chain2, but the use counter on chain1 would be the one to be incremented.
When looking for rules by ID, use the chain that was used for the lookup by
name. The chain used in the context copied to the transaction needs to
match that same chain. That way, struct nft_rule does not need to get
enlarged with another member.
Fixes: 391c8ef5ec05 ("netfilter: nf_tables: add NFTA_RULE_ID attribute") Fixes: c5a31e5b33aa ("netfilter: nf_tables: Support RULE_ID reference in new rule") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When doing lookups for sets on the same batch by using its ID, a set from a
different table can be used.
Then, when the table is removed, a reference to the set may be kept after
the set is freed, leading to a potential use-after-free.
When looking for sets by ID, use the table that was used for the lookup by
name, and only return sets belonging to that same table.
This fixes CVE-2022-2586, also reported as ZDI-CAN-17470.
Reported-by: Team Orca of Sea Security (@seasecresponse) Fixes: dc63100f750d ("netfilter: nf_tables: use new transaction infrastructure to handle sets") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Usb core introduce the mechanism of giveback of URB in tasklet context to
reduce hardware interrupt handling time. On some test situation(such as
FIO with 4KB block size), when tasklet callback function called to
giveback URB, interrupt handler add URB node to the bh->head list also.
If check bh->head list again after finish all URB giveback of local_list,
then it may introduce a "dynamic balance" between giveback URB and add URB
to bh->head list. This tasklet callback function may not exit for a long
time, which will cause other tasklet function calls to be delayed. Some
real-time applications(such as KB and Mouse) will see noticeable lag.
In order to prevent the tasklet function from occupying the cpu for a long
time at a time, new URBS will not be added to the local_list even though
the bh->head list is not empty. But also need to ensure the left URB
giveback to be processed in time, so add a member high_prio for structure
giveback_urb_bh to prioritize tasklet and schelule this tasklet again if
bh->head list is not empty.
At the same time, we are able to prioritize tasklet through structure
member high_prio. So, replace the local high_prio_bh variable with this
structure member in usb_hcd_giveback_urb.
Fixes: 22b4930c5dc8 ("USB: HCD: support giveback of URB in tasklet context") Cc: stable <stable@kernel.org> Reviewed-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Weitao Wang <WeitaoWang-oc@zhaoxin.com> Link: https://lore.kernel.org/r/20220726074918.5114-1-WeitaoWang-oc@zhaoxin.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
coresight devices track their connections (output connections) and
hold a reference to the fwnode. When a device goes away, we walk through
the devices on the coresight bus and make sure that the references
are dropped. This happens both ways:
a) For all output connections from the device, drop the reference to
the target device via coresight_release_platform_data()
b) Iterate over all the devices on the coresight bus and drop the
reference to fwnode if *this* device is the target of the output
connection, via coresight_remove_conns()->coresight_remove_match().
However, the coresight_remove_match() doesn't clear the fwnode field,
after dropping the reference, this causes use-after-free and
additional refcount drops on the fwnode.
e.g., if we have two devices, A and B, with a connection, A -> B.
If we remove B first, B would clear the reference on B, from A
via coresight_remove_match(). But when A is removed, it still has
a connection with fwnode still pointing to B. Thus it tries to drops
the reference in coresight_release_platform_data(), raising the bells
like :
When CONFIG_CPUMASK_OFFSTACK and CONFIG_DEBUG_PER_CPU_MAPS is selected,
cpu_max_bits_warn() generates a runtime warning similar as below while
we show /proc/cpuinfo. Fix this by using nr_cpu_ids (the runtime limit)
instead of NR_CPUS to iterate CPUs.
On a bare-metal Power8 system that doesn't have an "ibm,power-rng", a
malicious QEMU and guest that ignore the absence of the
KVM_CAP_PPC_HWRNG flag, and calls H_RANDOM anyway, will dereference a
NULL pointer.
In practice all Power8 machines have an "ibm,power-rng", but let's not
rely on that, add a NULL check and early return in
powernv_get_random_real_mode().
Fixes: 77df4a0b2a38 ("KVM: PPC: Book3S HV: Add fast real-mode H_RANDOM implementation.") Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220727143219.2684192-1-mpe@ellerman.id.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On FSL_BOOK3E, _PAGE_RW is defined with two bits, one for user and one
for supervisor. As soon as one of the two bits is set, the page has
to be display as RW. But the way it is implemented today requires both
bits to be set in order to display it as RW.
Instead of display RW when _PAGE_RW bits are set and R otherwise,
reverse the logic and display R when _PAGE_RW bits are all 0 and
RW otherwise.
This change has no impact on other platforms as _PAGE_RW is a single
bit on all of them.
By default old pre-3.0 Freescale PCIe controllers reports invalid PCI Class
Code 0x0b20 for PCIe Root Port. It can be seen by lspci -b output on P2020
board which has this pre-3.0 controller:
$ lspci -bvnn
00:00.0 Power PC [0b20]: Freescale Semiconductor Inc P2020E [1957:0070] (rev 21)
!!! Invalid class 0b20 for header type 01
Capabilities: [4c] Express Root Port (Slot-), MSI 00
Fix this issue by programming correct PCI Class Code 0x0604 for PCIe Root
Port to the Freescale specific PCIe register 0x474.
Fix activated by U-Boot stay active also after booting Linux kernel.
But boards which use older U-Boot version without that fix are affected and
still require this fix.
So implement this class code fix also in kernel fsl_pci.c driver.
test_bit(), as any other bitmap op, takes `unsigned long *` as a
second argument (pointer to the actual bitmap), as any bitmap
itself is an array of unsigned longs. However, the ia64_get_irr()
code passes a ref to `u64` as a second argument.
This works with the ia64 bitops implementation due to that they
have `void *` as the second argument and then cast it later on.
This works with the bitmap API itself due to that `unsigned long`
has the same size on ia64 as `u64` (`unsigned long long`), but
from the compiler PoV those two are different.
Define @irr as `unsigned long` to fix that. That implies no
functional changes. Has been hidden for 16 years!
There's a KASAN warning in raid10_remove_disk when running the lvm
test lvconvert-raid-reshape.sh. We fix this warning by verifying that the
value "number" is valid.
BUG: KASAN: slab-out-of-bounds in raid10_remove_disk+0x61/0x2a0 [raid10]
Read of size 8 at addr ffff889108f3d300 by task mdX_raid10/124682
Last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x9e/0xc0
kvfree_call_rcu+0x84/0x480
timerfd_release+0x82/0x140
L __fput+0xfa/0x400
task_work_run+0x80/0xc0
exit_to_user_mode_prepare+0x155/0x160
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x42/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Second to last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x9e/0xc0
kvfree_call_rcu+0x84/0x480
timerfd_release+0x82/0x140
__fput+0xfa/0x400
task_work_run+0x80/0xc0
exit_to_user_mode_prepare+0x155/0x160
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x42/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
The buggy address belongs to the object at ffff889108f3d200
which belongs to the cache kmalloc-256 of size 256
The buggy address is located 0 bytes to the right of
256-byte region [ffff889108f3d200, ffff889108f3d300)
Memory state around the buggy address: ffff889108f3d200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff889108f3d280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff889108f3d300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^ ffff889108f3d380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff889108f3d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The driver use the non-managed form of the register function in
isl29028_remove(). To keep the release order as mirroring the ordering
in probe, the driver should use non-managed form in probe, too.
When pinning a buffer, we should check to see if there are any
additional restrictions imposed by bo->preferred_domains. This will
prevent the BO from being moved to an invalid domain when pinning.
For example, this can happen if the user requests to create a BO in GTT
domain for display scanout. amdgpu_dm will allow pinning to either VRAM
or GTT domains, since DCN can scanout from either or. However, in
amdgpu_bo_pin_restricted(), pinning to VRAM is preferred if there is
adequate carveout. This can lead to pinning to VRAM despite the user
requesting GTT placement for the BO.
v2: Allow the kernel to override the domain, which can happen when
exporting a BO to a V4L camera (for example).
Signed-off-by: Leo Li <sunpeng.li@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use ww_acquire_fini() in the error code paths. Otherwise lockdep
thinks that lock is held when lock's memory is freed after the
drm_gem_lock_reservations() error. The ww_acquire_context needs to be
annotated as "released", which fixes the noisy "WARNING: held lock freed!"
splat of VirtIO-GPU driver with CONFIG_DEBUG_MUTEXES=y and enabled lockdep.
Cc: stable@vger.kernel.org Fixes: 26b8a18e3cc11 ("drm: Add helpers for locking an array of BO reservations.") Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20220630200405.1883897-2-dmitry.osipenko@collabora.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For all syscalls in 32-bit compat mode on 64-bit kernels the upper
32-bits of the 64-bit registers are zeroed out, so a negative 32-bit
signed value will show up as positive 64-bit signed value.
This behaviour breaks the io_pgetevents_time64() syscall which expects
signed 64-bit values for the "min_nr" and "nr" parameters.
Fix this by switching to the compat_sys_io_pgetevents_time64() syscall,
which uses "compat_long_t" types for those parameters.
Fix the output of /proc/iomem to show the real hardware device name
including the pa_pathname, e.g. "Merlin 160 Core Centronics [8:16:0]".
Up to now only the pa_pathname ("[8:16.0]") was shown.
Some code paths cannot guarantee the inode have any dentry alias. So
WARN_ON() all !dentry may flood the kernel logs.
For example, when an overlayfs inode is watched by inotifywait (1), and
someone is trying to read the /proc/$(pidof inotifywait)/fdinfo/INOTIFY_FD,
at that time if the dentry has been reclaimed by kernel (such as
echo 2 > /proc/sys/vm/drop_caches), there will be a WARN_ON(). The
printed call stack would be like:
usbnet uses the work usbnet_deferred_kevent() to perform tasks which may
sleep. On disconnect, completion of the work was originally awaited in
->ndo_stop(). But in 2003, that was moved to ->disconnect() by historic
commit "[PATCH] USB: usbnet, prevent exotic rtnl deadlock":
The change was made because back then, the kernel's workqueue
implementation did not allow waiting for a single work. One had to wait
for completion of *all* work by calling flush_scheduled_work(), and that
could deadlock when waiting for usbnet_deferred_kevent() with rtnl_mutex
held in ->ndo_stop().
The commit solved one problem but created another: It causes a
use-after-free in USB Ethernet drivers aqc111.c, asix_devices.c,
ax88179_178a.c, ch9200.c and smsc75xx.c:
* If the drivers receive a link change interrupt immediately before
disconnect, they raise EVENT_LINK_RESET in their (non-sleepable)
->status() callback and schedule usbnet_deferred_kevent().
* usbnet_deferred_kevent() invokes the driver's ->link_reset() callback,
which calls netif_carrier_{on,off}().
* That in turn schedules the work linkwatch_event().
Because usbnet_deferred_kevent() is awaited after unregister_netdev(),
netif_carrier_{on,off}() may operate on an unregistered netdev and
linkwatch_event() may run after free_netdev(), causing a use-after-free.
In 2010, usbnet was changed to only wait for a single instance of
usbnet_deferred_kevent() instead of *all* work by commit b489eba7fc5c
("drivers/net: don't use flush_scheduled_work()").
Unfortunately the commit neglected to move the wait back to
->ndo_stop(). Rectify that omission at long last.
The user may use the fbcon=vc:<n1>-<n2> option to tell fbcon to take
over the given range (n1...n2) of consoles. The value for n1 and n2
needs to be a positive number and up to (MAX_NR_CONSOLES - 1).
The given values were not fully checked against those boundaries yet.
To fix the issue, convert first_fb_vc and last_fb_vc to unsigned
integers and check them against the upper boundary, and make sure that
first_fb_vc is smaller than last_fb_vc.
If cooling_device_stats_setup() fails to create the stats object, it
must clear the last slot in cooling_device_attr_groups that was
initially empty (so as to make it possible to add stats attributes to
the cooling device attribute groups).
Failing to do so may cause the stats attributes to be created by
mistake for a device that doesn't have a stats object, because the
slot in question might be populated previously during the registration
of another cooling device.
Fixes: 75acf6d9af82 ("thermal: Add cooling device's statistics in sysfs") Reported-by: Di Shen <di.shen@unisoc.com> Tested-by: Di Shen <di.shen@unisoc.com> Cc: 4.17+ <stable@vger.kernel.org> # 4.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.
Link: https://lore.kernel.org/r/1657779088-2242-2-git-send-email-xuyang2018.jy@fujitsu.com Fixes: 1520d1d56327 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...") Cc: <stable@vger.kernel.org> # 4.19+ Reported-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If something manages to set the maximum file size to MAX_OFFSET+1, this
can cause the xfs and ext4 filesystems at least to become corrupt.
Ordinarily, the kernel protects against userspace trying this by
checking the value early in the truncate() and ftruncate() system calls
calls - but there are at least two places that this check is bypassed:
(1) Cachefiles will round up the EOF of the backing file to DIO block
size so as to allow DIO on the final block - but this might push
the offset negative. It then calls notify_change(), but this
inadvertently bypasses the checking. This can be triggered if
someone puts an 8EiB-1 file on a server for someone else to try and
access by, say, nfs.
(2) ksmbd doesn't check the value it is given in set_end_of_file_info()
and then calls vfs_truncate() directly - which also bypasses the
check.
In both cases, it is potentially possible for a network filesystem to
cause a disk filesystem to be corrupted: cachefiles in the client's
cache filesystem; ksmbd in the server's filesystem.
nfsd is okay as it checks the value, but we can then remove this check
too.
Fix this by adding a check to inode_newsize_ok(), as called from
setattr_prepare(), thereby catching the issue as filesystems set up to
perform the truncate with minimal opportunity for bypassing the new
check.
Fixes: 724af98fc23d ("cachefiles: Implement backing file wrangling") Fixes: ffc65102a17f ("cifsd: add file operations") Signed-off-by: David Howells <dhowells@redhat.com> Reported-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Cc: stable@kernel.org Acked-by: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Steve French <sfrench@samba.org>
cc: Hyunchul Lee <hyc.lee@gmail.com>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is another Asus K42JZ model with the PCI SSID 1043:1313
that requires the quirk ALC269VB_FIXUP_ASUS_MIC_NO_PRESENCE.
Add the corresponding entry to the quirk table.
The 12,1 model requires the same configuration as the 12,2 model
to enable headphones but has a different codec SSID. Adds
12,1 SSID for matching quirk.
There is another LENOVO 20149 (Type1Sku0) Notebook model with
CX20590, the device PCI SSID is 17aa:3977, which headphones are
not responding, that requires the quirk CXT_PINCFG_LENOVO_NOTEBOOK.
Add the corresponding entry to the quirk table.
To avoid a race between rmap walk and mremap, mremap does
take_rmap_locks(). The lock was taken to ensure that rmap walk don't miss
a page table entry due to PTE moves via move_pagetables(). The kernel
does further optimization of this lock such that if we are going to find
the newly added vma after the old vma, the rmap lock is not taken. This
is because rmap walk would find the vmas in the same order and if we don't
find the page table attached to older vma we would find it with the new
vma which we would iterate later.
As explained in commit 3cf8ab7353d1 ("mremap: properly flush TLB before
releasing the page") mremap is special in that it doesn't take ownership
of the page. The optimized version for PUD/PMD aligned mremap also
doesn't hold the ptl lock. This can result in stale TLB entries as show
below.
This patch updates the rmap locking requirement in mremap to handle the race condition
explained below with optimized mremap::
When injecting a #GP on LLDT/LTR due to a non-canonical LDT/TSS base, set
the error code to the selector. Intel SDM's says nothing about the #GP,
but AMD's APM explicitly states that both LLDT and LTR set the error code
to the selector, not zero.
Note, a non-canonical memory operand on LLDT/LTR does generate a #GP(0),
but the KVM code in question is specific to the base from the descriptor.
Fixes: 0545389c5924 ("KVM: x86: Emulator ignores LDTR/TR extended base on LLDT/LTR") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220711232750.1092012-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Wait to mark the TSS as busy during LTR emulation until after all fault
checks for the LTR have passed. Specifically, don't mark the TSS busy if
the new TSS base is non-canonical.
Opportunistically drop the one-off !seg_desc.PRESENT check for TR as the
only reason for the early check was to avoid marking a !PRESENT TSS as
busy, i.e. the common !PRESENT is now done before setting the busy bit.
Restrict the nVMX MSRs based on KVM's config, not based on the guest's
current config. Using the guest's config to audit the new config
prevents userspace from restoring the original config (KVM's config) if
at any point in the past the guest's config was restricted in any way.
Fixes: 0dd15b119da0 ("KVM: nVMX: support restore of VMX capability MSRs") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Don't BUG/WARN on interrupt injection due to GIF being cleared,
since it's trivial for userspace to force the situation via
KVM_SET_VCPU_EVENTS (even if having at least a WARN there would be correct
for KVM internally generated injections).
If a nested run isn't pending, snapshot vmcs01.GUEST_IA32_DEBUGCTL
irrespective of whether or not VM_ENTRY_LOAD_DEBUG_CONTROLS is set in
vmcs12. When restoring nested state, e.g. after migration, without a
nested run pending, prepare_vmcs02() will propagate
nested.vmcs01_debugctl to vmcs02, i.e. will load garbage/zeros into
vmcs02.GUEST_IA32_DEBUGCTL.
If userspace restores nested state before MSRs, then loading garbage is a
non-issue as loading DEBUGCTL will also update vmcs02. But if usersepace
restores MSRs first, then KVM is responsible for propagating L2's value,
which is actually thrown into vmcs01, into vmcs02.
Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
is all kinds of bizarre and ideally would not be supported. Sadly, some
VMMs do exactly that and rely on KVM to make things work.
Note, there's still a lurking SMM bug, as propagating vmcs01's DEBUGCTL
to vmcs02 across RSM may corrupt L2's DEBUGCTL. But KVM's entire VMX+SMM
emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
"default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.
If a nested run isn't pending, snapshot vmcs01.GUEST_BNDCFGS irrespective
of whether or not VM_ENTRY_LOAD_BNDCFGS is set in vmcs12. When restoring
nested state, e.g. after migration, without a nested run pending,
prepare_vmcs02() will propagate nested.vmcs01_guest_bndcfgs to vmcs02,
i.e. will load garbage/zeros into vmcs02.GUEST_BNDCFGS.
If userspace restores nested state before MSRs, then loading garbage is a
non-issue as loading BNDCFGS will also update vmcs02. But if usersepace
restores MSRs first, then KVM is responsible for propagating L2's value,
which is actually thrown into vmcs01, into vmcs02.
Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
is all kinds of bizarre and ideally would not be supported. Sadly, some
VMMs do exactly that and rely on KVM to make things work.
Note, there's still a lurking SMM bug, as propagating vmcs01.GUEST_BNDFGS
to vmcs02 across RSM may corrupt L2's BNDCFGS. But KVM's entire VMX+SMM
emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
"default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.
Link: https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com Fixes: e6f8ff3bf4c6 ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS") Cc: stable@vger.kernel.org Cc: Lei Wang <lei4.wang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Touch switch state is received through WACOM_PAD_FIELD. However, it
is reported by touch_input. Don't register pad_input if no other pad
events require the interface.
The generic routine, wacom_wac_pen_event, turns rotation value 90
degree anti-clockwise before posting the events. This non-zero
event trggers a non-zero ABS_Z event for non art pen tools. However,
HID_DG_TWIST is only supported by art pen.
Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be executed before wait_on_buffer(bh) on
architectures with weak memory ordering and it may return invalid data.
Fix this bug by adding a memory barrier to set_buffer_uptodate and an
acquire barrier to buffer_uptodate (in a similar way as
folio_test_uptodate and folio_mark_uptodate).
We won't really have enough skbs to need a 64-bit cookie,
and on 32-bit platforms storing the 64-bit cookie into the
void *rate_driver_data doesn't work anyway. Switch back to
using just a 32-bit cookie and uintptr_t for the type to
avoid compiler warnings about all this.
Fixes: d8f417c7e23a ("wifi: mac80211_hwsim: fix race condition in pending packet") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Cc: Jeongik Cha <jeongik@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
i225 devices have only one PHY vendor. There is no point checking
_I_PHY_ID during the link establishment and auto-negotiation process.
This patch comes to clean up these pointless checkings.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: Linjun Bao <meljbao@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:
ld: warning: arch/x86/boot/pmjump.o: missing .note.GNU-stack section implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions
Generally, we would like to avoid the stack being executable. Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack. Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.
LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO. --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.
While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.
Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:
ld: warning: vmlinux: missing .note.GNU-stack section implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
ld: warning: vmlinux has a LOAD segment with RWX permissions
Generally, we would like to avoid the stack being executable. Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack. Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.
LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO. --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.
While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.
RSB fill sequence does not have any protection for miss-prediction of
conditional branch at the end of the sequence. CPU can speculatively
execute code immediately after the sequence, while RSB filling hasn't
completed yet.
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.
== Background ==
Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.
To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced. eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.
== Problem ==
Here's a simplification of how guests are run on Linux' KVM:
void run_kvm_guest(void)
{
// Prepare to run guest
VMRESUME();
// Clean up after guest runs
}
The execution flow for that would look something like this to the
processor:
1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()
Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:
* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.
* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".
IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.
However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.
Balanced CALL/RET instruction pairs such as in step #5 are not affected.
== Solution ==
The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RETPOLINE triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RETPOLINE need no further mitigation - i.e.,
eIBRS systems which enable retpoline explicitly.
However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RETPOLINE
and most of them need a new mitigation.
Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB Filling at
vmexit.
The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.
In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.
There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.
[ bp: Massage, incorporate review comments from Andy Cooper. ]
[ Pawan: Update commit message to replace RSB_VMEXIT with RETPOLINE ]
In do_adb_query() function of drivers/macintosh/adb.c, req->data is copied
form userland. The parameter "req->data[2]" is missing check, the array
size of adb_handler[] is 16, so adb_handler[req->data[2]].original_address and
adb_handler[req->data[2]].handler_id will lead to oob read.
Cc: stable <stable@kernel.org> Signed-off-by: Ning Qiang <sohu0106@126.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220713153734.2248-1-sohu0106@126.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
DST_QUEUE_OFF_BASE is applied to offset/mem_offset on MMAP capture buffers
only for the VIDIOC_QUERYBUF ioctl, while the userspace fields (including
offset/mem_offset) are filled in for VIDIOC_{QUERY,PREPARE,Q,DQ}BUF
ioctls. This leads to differences in the values presented to userspace.
If userspace attempts to mmap the capture buffer directly using values
from DQBUF, it will fail.
Move the code that applies the magic offset into a helper, and call
that helper from all four ioctl entry points.
[hverkuil: drop unnecessary '= 0' in v4l2_m2m_querybuf() for ret]
Fixes: 1ede4d08c8c3 ("V4L/DVB: add memory-to-memory device helper framework for videobuf") Fixes: 35ca39141453 ("[media] v4l: mem2mem: port to videobuf2") Signed-off-by: Chen-Yu Tsai <wenst@chromium.org> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
[OP: backport to 5.4: adjusted return logic in v4l2_m2m_qbuf() to match the
logic in the original commit: call v4l2_m2m_adjust_mem_offset() only if !ret
and before the v4l2_m2m_try_schedule() call] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The selftests, when built with newer versions of clang, is found
to have over optimized guests' ucall() function, and eliminating
the stores for uc.cmd (perhaps due to no immediate readers). This
resulted in the userspace side always reading a value of '0', and
causing multiple test failures.
As a result, prevent the compiler from optimizing the stores in
ucall() with WRITE_ONCE().
Suggested-by: Ricardo Koller <ricarkol@google.com> Suggested-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Message-Id: <20220615185706.1099208-1-rananta@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
A KVM device cleanup happens in either of two callbacks:
1) destroy() which is called when the VM is being destroyed;
2) release() which is called when a device fd is closed.
Most KVM devices use 1) but Book3s's interrupt controller KVM devices
(XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during
the machine execution. The error handling in kvm_ioctl_create_device()
assumes destroy() is always defined which leads to NULL dereference as
discovered by Syzkaller.
This adds a checks for destroy!=NULL and adds a missing release().
This is not changing kvm_destroy_devices() as devices with defined
release() should have been removed from the KVM devices list by then.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>