Catalin Marinas [Tue, 31 Aug 2021 08:10:00 +0000 (09:10 +0100)]
Merge remote-tracking branch 'tip/sched/arm64' into for-next/core
* tip/sched/arm64: (785 commits)
Documentation: arm64: describe asymmetric 32-bit support
arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores
arm64: Hook up cmdline parameter to allow mismatched 32-bit EL0
arm64: Advertise CPUs capable of running 32-bit applications in sysfs
arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system
arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit EL0
arm64: Implement task_cpu_possible_mask()
sched: Introduce dl_task_check_affinity() to check proposed affinity
sched: Allow task CPU affinity to be restricted on asymmetric systems
sched: Split the guts of sched_setaffinity() into a helper function
sched: Introduce task_struct::user_cpus_ptr to track requested affinity
sched: Reject CPU affinity changes based on task_cpu_possible_mask()
cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()
cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
sched: Cgroup SCHED_IDLE support
sched/topology: Skip updating masks for non-online nodes
Linux 5.14-rc6
lib: use PFN_PHYS() in devmem_is_allowed()
...
Catalin Marinas [Thu, 26 Aug 2021 10:49:33 +0000 (11:49 +0100)]
Merge branch 'for-next/entry' into for-next/core
* for-next/entry:
: More entry.S clean-ups and conversion to C.
arm64: entry: call exit_to_user_mode() from C
arm64: entry: move bulk of ret_to_user to C
arm64: entry: clarify entry/exit helpers
arm64: entry: consolidate entry/exit helpers
Catalin Marinas [Thu, 26 Aug 2021 10:49:27 +0000 (11:49 +0100)]
Merge branches 'for-next/mte', 'for-next/misc' and 'for-next/kselftest', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
arm64/perf: Replace '0xf' instances with ID_AA64DFR0_PMUVER_IMP_DEF
* for-next/mte:
: Miscellaneous MTE improvements.
arm64/cpufeature: Optionally disable MTE via command-line
arm64: kasan: mte: remove redundant mte_report_once logic
arm64: kasan: mte: use a constant kernel GCR_EL1 value
arm64: avoid double ISB on kernel entry
arm64: mte: optimize GCR_EL1 modification on kernel entry/exit
Documentation: document the preferred tag checking mode feature
arm64: mte: introduce a per-CPU tag checking mode preference
arm64: move preemption disablement to prctl handlers
arm64: mte: change ASYNC and SYNC TCF settings into bitfields
arm64: mte: rename gcr_user_excl to mte_ctrl
arm64: mte: avoid TFSRE0_EL1 related operations unless in async mode
* for-next/misc:
: Miscellaneous updates.
arm64: Do not trap PMSNEVFR_EL1
arm64: mm: fix comment typo of pud_offset_phys()
arm64: signal32: Drop pointless call to sigdelsetmask()
arm64/sve: Better handle failure to allocate SVE register storage
arm64: Document the requirement for SCR_EL3.HCE
arm64: head: avoid over-mapping in map_memory
arm64/sve: Add a comment documenting the binutils needed for SVE asm
arm64/sve: Add some comments for sve_save/load_state()
arm64: replace in_irq() with in_hardirq()
arm64: mm: Fix TLBI vs ASID rollover
arm64: entry: Add SYM_CODE annotation for __bad_stack
arm64: fix typo in a comment
arm64: move the (z)install rules to arch/arm64/Makefile
arm64/sve: Make fpsimd_bind_task_to_cpu() static
arm64: unnecessary end 'return;' in void functions
arm64/sme: Document boot requirements for SME
arm64: use __func__ to get function name in pr_err
arm64: SSBS/DIT: print SSBS and DIT bit when printing PSTATE
arm64: cpufeature: Use defined macro instead of magic numbers
arm64/kexec: Test page size support with new TGRAN range values
* for-next/kselftest:
: Kselftest additions for arm64.
kselftest/arm64: signal: Add a TODO list for signal handling tests
kselftest/arm64: signal: Add test case for SVE register state in signals
kselftest/arm64: signal: Verify that signals can't change the SVE vector length
kselftest/arm64: signal: Check SVE signal frame shows expected vector length
kselftest/arm64: signal: Support signal frames with SVE register data
kselftest/arm64: signal: Add SVE to the set of features we can check for
kselftest/arm64: pac: Fix skipping of tests on systems without PAC
kselftest/arm64: mte: Fix misleading output when skipping tests
kselftest/arm64: Add a TODO list for floating point tests
kselftest/arm64: Add tests for SVE vector configuration
kselftest/arm64: Validate vector lengths are set in sve-probe-vls
kselftest/arm64: Provide a helper binary and "library" for SVE RDVL
kselftest/arm64: Ignore check_gcr_el1_cswitch binary
Alexandru Elisei [Tue, 24 Aug 2021 15:45:23 +0000 (16:45 +0100)]
arm64: Do not trap PMSNEVFR_EL1
Commit 9ae637a9dd61 ("arm64: Disable fine grained traps on boot") zeroed
the fine grained trap registers to prevent unwanted register traps from
occuring. However, for the PMSNEVFR_EL1 register, the corresponding
HDFG{R,W}TR_EL2.nPMSNEVFR_EL1 fields must be 1 to disable trapping. Set
both fields to 1 if FEAT_SPEv1p2 is detected to disable read and write
traps.
Fixes: 9ae637a9dd61 ("arm64: Disable fine grained traps on boot") Cc: <stable@vger.kernel.org> # 5.13.x Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210824154523.906270-1-alexandru.elisei@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Will Deacon [Wed, 25 Aug 2021 09:39:11 +0000 (10:39 +0100)]
arm64: signal32: Drop pointless call to sigdelsetmask()
Commit bb08f0e2649e ("most of set_current_blocked() callers want
SIGKILL/SIGSTOP removed from set") extended set_current_blocked() to
remove SIGKILL and SIGSTOP from the new signal set and updated all
callers accordingly.
Unfortunately, this collided with the merge of the arm64 architecture,
which duly removes these signals when restoring the compat sigframe, as
this was what was previously done by arch/arm/.
Remove the redundant call to sigdelsetmask() from
compat_restore_sigframe().
Mark Brown [Tue, 24 Aug 2021 15:34:17 +0000 (16:34 +0100)]
arm64/sve: Better handle failure to allocate SVE register storage
Currently we "handle" failure to allocate the SVE register storage by
doing a BUG_ON() and hoping for the best. This is obviously not great and
the memory allocation failure will already be loud enough without the
BUG_ON(). As the comment says it is a corner case but let's try to do a bit
better, remove the BUG_ON() and add code to handle the failure in the
callers.
For the ptrace and signal code we can return -ENOMEM gracefully however
we have no real error reporting path available to us for the SVE access
trap so instead generate a SIGKILL if the allocation fails there. This
at least means that we won't try to soldier on and end up trying to
access the nonexistant state and while it's obviously not ideal for
userspace SIGKILL doesn't allow any handling so minimises the ABI
impact, making it easier to improve the interface later if we come up
with a better idea.
Marc Zyngier [Thu, 12 Aug 2021 19:02:13 +0000 (20:02 +0100)]
arm64: Document the requirement for SCR_EL3.HCE
It is amazing that we never documented this absolutely basic
requirement: if you boot the kernel at EL2, you'd better
enable the HVC instruction from EL3.
Mark Rutland [Mon, 23 Aug 2021 10:12:53 +0000 (11:12 +0100)]
arm64: head: avoid over-mapping in map_memory
The `compute_indices` and `populate_entries` macros operate on inclusive
bounds, and thus the `map_memory` macro which uses them also operates
on inclusive bounds.
We pass `_end` and `_idmap_text_end` to `map_memory`, but these are
exclusive bounds, and if one of these is sufficiently aligned (as a
result of kernel configuration, physical placement, and KASLR), then:
* In `compute_indices`, the computed `iend` will be in the page/block *after*
the final byte of the intended mapping.
* In `populate_entries`, an unnecessary entry will be created at the end
of each level of table. At the leaf level, this entry will map up to
SWAPPER_BLOCK_SIZE bytes of physical addresses that we did not intend
to map.
As we may map up to SWAPPER_BLOCK_SIZE bytes more than intended, we may
violate the boot protocol and map physical address past the 2MiB-aligned
end address we are permitted to map. As we map these with Normal memory
attributes, this may result in further problems depending on what these
physical addresses correspond to.
The final entry at each level may require an additional table at that
level. As EARLY_ENTRIES() calculates an inclusive bound, we allocate
enough memory for this.
Avoid the extraneous mapping by having map_memory convert the exclusive
end address to an inclusive end address by subtracting one, and do
likewise in EARLY_ENTRIES() when calculating the number of required
tables. For clarity, comments are updated to more clearly document which
boundaries the macros operate on. For consistency with the other
macros, the comments in map_memory are also updated to describe `vstart`
and `vend` as virtual addresses.
Fixes: f71b21f6c1e5 ("arm64: Extend early page table code to allow for larger kernels") Cc: <stable@vger.kernel.org> # 4.16.x Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Steve Capper <steve.capper@arm.com> Cc: Will Deacon <will@kernel.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210823101253.55567-1-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Brown [Mon, 16 Aug 2021 12:50:23 +0000 (13:50 +0100)]
arm64/sve: Add a comment documenting the binutils needed for SVE asm
At some point it would be nice to avoid the need to manually encode SVE
instructions, add a note of the binutils version required to save looking
it up.
Mark Brown [Thu, 12 Aug 2021 20:11:43 +0000 (21:11 +0100)]
arm64/sve: Add some comments for sve_save/load_state()
The use of macros for the actual function bodies means legibility is always
going to be a bit of a challenge, especially while we can't rely on SVE
support in the toolchain, but this helps a little.
Mark Brown [Thu, 19 Aug 2021 13:42:44 +0000 (14:42 +0100)]
kselftest/arm64: signal: Add test case for SVE register state in signals
Currently this doesn't actually verify that the register contents do the
right thing, it just verifes that a SVE context with appropriate size
appears.
Mark Brown [Thu, 19 Aug 2021 13:42:43 +0000 (14:42 +0100)]
kselftest/arm64: signal: Verify that signals can't change the SVE vector length
We do not support changing the SVE vector length as part of signal return,
verify that this is the case if the system supports multiple vector lengths.
Mark Brown [Thu, 19 Aug 2021 13:42:42 +0000 (14:42 +0100)]
kselftest/arm64: signal: Check SVE signal frame shows expected vector length
As a basic check that the SVE signal frame is being set up correctly
verify that the vector length in the signal frame is the vector length
that the process has.
Mark Brown [Thu, 19 Aug 2021 13:42:41 +0000 (14:42 +0100)]
kselftest/arm64: signal: Support signal frames with SVE register data
A signal frame with SVE may validly either be a bare struct sve_context or
a struct sve_context followed by vector length dependent register data.
Support either in the generic helpers for the signal tests, and while we're
at it validate the SVE vector length reported.
Mark Brown [Thu, 19 Aug 2021 16:57:23 +0000 (17:57 +0100)]
kselftest/arm64: pac: Fix skipping of tests on systems without PAC
The PAC tests check to see if the system supports the relevant PAC features
but instead of skipping the tests if they can't be executed they fail the
tests which makes things look like they're not working when they are.
Will Deacon [Fri, 30 Jul 2021 11:24:42 +0000 (12:24 +0100)]
arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores
The scheduler now knows enough about these braindead systems to place
32-bit tasks accordingly, so throw out the safety checks and allow the
ret-to-user path to avoid do_notify_resume() if there is nothing to do.
Will Deacon [Fri, 30 Jul 2021 11:24:40 +0000 (12:24 +0100)]
arm64: Advertise CPUs capable of running 32-bit applications in sysfs
Since 32-bit applications will be killed if they are caught trying to
execute on a 64-bit-only CPU in a mismatched system, advertise the set
of 32-bit capable CPUs to userspace in sysfs.
Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20210730112443.23245-14-will@kernel.org
Will Deacon [Fri, 30 Jul 2021 11:24:39 +0000 (12:24 +0100)]
arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system
If we want to support 32-bit applications, then when we identify a CPU
with mismatched 32-bit EL0 support we must ensure that we will always
have an active 32-bit CPU available to us from then on. This is important
for the scheduler, because is_cpu_allowed() will be constrained to 32-bit
CPUs for compat tasks and forced migration due to a hotplug event will
hang if no 32-bit CPUs are available.
On detecting a mismatch, prevent offlining of either the mismatching CPU
if it is 32-bit capable, or find the first active 32-bit capable CPU
otherwise.
Will Deacon [Fri, 30 Jul 2021 11:24:37 +0000 (12:24 +0100)]
arm64: Implement task_cpu_possible_mask()
Provide an implementation of task_cpu_possible_mask() so that we can
prevent 64-bit-only cores being added to the 'cpus_mask' for compat
tasks on systems with mismatched 32-bit support at EL0,
Will Deacon [Fri, 30 Jul 2021 11:24:36 +0000 (12:24 +0100)]
sched: Introduce dl_task_check_affinity() to check proposed affinity
In preparation for restricting the affinity of a task during execve()
on arm64, introduce a new dl_task_check_affinity() helper function to
give an indication as to whether the restricted mask is admissible for
a deadline task.
Will Deacon [Fri, 30 Jul 2021 11:24:35 +0000 (12:24 +0100)]
sched: Allow task CPU affinity to be restricted on asymmetric systems
Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.
Although userspace can carefully manage the affinity masks for such
tasks, one place where it is particularly problematic is execve()
because the CPU on which the execve() is occurring may be incompatible
with the new application image. In such a situation, it is desirable to
restrict the affinity mask of the task and ensure that the new image is
entered on a compatible CPU. From userspace's point of view, this looks
the same as if the incompatible CPUs have been hotplugged off in the
task's affinity mask. Similarly, if a subsequent execve() reverts to
a compatible image, then the old affinity is restored if it is still
valid.
In preparation for restricting the affinity mask for compat tasks on
arm64 systems without uniform support for 32-bit applications, introduce
{force,relax}_compatible_cpus_allowed_ptr(), which respectively restrict
and restore the affinity mask for a task based on the compatible CPUs.
Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20210730112443.23245-9-will@kernel.org
Will Deacon [Fri, 30 Jul 2021 11:24:34 +0000 (12:24 +0100)]
sched: Split the guts of sched_setaffinity() into a helper function
In preparation for replaying user affinity requests using a saved mask,
split sched_setaffinity() up so that the initial task lookup and
security checks are only performed when the request is coming directly
from userspace.
Will Deacon [Fri, 30 Jul 2021 11:24:32 +0000 (12:24 +0100)]
sched: Reject CPU affinity changes based on task_cpu_possible_mask()
Reject explicit requests to change the affinity mask of a task via
set_cpus_allowed_ptr() if the requested mask is not a subset of the
mask returned by task_cpu_possible_mask(). This ensures that the
'cpus_mask' for a given task cannot contain CPUs which are incapable of
executing it, except in cases where the affinity is forced.
Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20210730112443.23245-6-will@kernel.org
Will Deacon [Fri, 30 Jul 2021 11:24:31 +0000 (12:24 +0100)]
cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()
select_fallback_rq() only needs to recheck for an allowed CPU if the
affinity mask of the task has changed since the last check.
Return a 'bool' from cpuset_cpus_allowed_fallback() to indicate whether
the affinity mask was updated, and use this to elide the allowed check
when the mask has been left alone.
No functional change.
Suggested-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20210730112443.23245-5-will@kernel.org
Will Deacon [Fri, 30 Jul 2021 11:24:30 +0000 (12:24 +0100)]
cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.
Modify guarantee_online_cpus() to take task_cpu_possible_mask() into
account when trying to find a suitable set of online CPUs for a given
task. This will avoid passing an invalid mask to set_cpus_allowed_ptr()
during ->attach() and will subsequently allow the cpuset hierarchy to be
taken into account when forcefully overriding the affinity mask for a
task which requires migration to a compatible CPU.
Will Deacon [Fri, 30 Jul 2021 11:24:29 +0000 (12:24 +0100)]
cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
If the scheduler cannot find an allowed CPU for a task,
cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
if cgroup v1 is in use.
In preparation for allowing architectures to provide their own fallback
mask, just return early if we're either using cgroup v1 or we're using
cgroup v2 with a mask that contains invalid CPUs. This will allow
select_fallback_rq() to figure out the mask by itself.
Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lkml.kernel.org/r/20210730112443.23245-3-will@kernel.org
Will Deacon [Fri, 30 Jul 2021 11:24:28 +0000 (12:24 +0100)]
sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
Asymmetric systems may not offer the same level of userspace ISA support
across all CPUs, meaning that some applications cannot be executed by
some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do
not feature support for 32-bit applications on both clusters.
On such a system, we must take care not to migrate a task to an
unsupported CPU when forcefully moving tasks in select_fallback_rq()
in response to a CPU hot-unplug operation.
Introduce a task_cpu_possible_mask() hook which, given a task argument,
allows an architecture to return a cpumask of CPUs that are capable of
executing that task. The default implementation returns the
cpu_possible_mask, since sane machines do not suffer from per-cpu ISA
limitations that affect scheduling. The new mask is used when selecting
the fallback runqueue as a last resort before forcing a migration to the
first active CPU.
Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20210730112443.23245-2-will@kernel.org
Extending SCHED_IDLE to cgroups means that we incorporate the existing
aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its
descendant threads towards the idle_h_nr_running count of all of its
ancestor cgroups. Thus, sched_idle_rq() will work properly.
Additionally, SCHED_IDLE cgroups are configured with minimum weight.
There are two key differences between the per-task and per-cgroup
SCHED_IDLE interface:
- The cgroup interface allows tasks within a SCHED_IDLE hierarchy to
maintain their relative weights. The entity that is "idle" is the
cgroup, not the tasks themselves.
- Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption
decision is not made by comparing the current task with the woken
task, but rather by comparing their matching sched_entity.
A typical use-case for this is a user that creates an idle and a
non-idle subtree. The non-idle subtree will dominate competition vs
the idle subtree, but the idle subtree will still be high priority vs
other users on the system. The latter is accomplished via comparing
matching sched_entity in the waken preemption path (this could also be
improved by making the sched_idle_rq() decision dependent on the
perspective of a specific task).
For now, we maintain the existing SCHED_IDLE semantics. Future patches
may make improvements that extend how we treat SCHED_IDLE entities.
The per-task_group idle field is an integer that currently only holds
either a 0 or a 1. This is explicitly typed as an integer to allow for
further extensions to this API. For example, a negative value may
indicate a highly latency-sensitive cgroup that should be preferred
for preemption/placement/etc.
sched/topology: Skip updating masks for non-online nodes
The scheduler currently expects NUMA node distances to be stable from
init onwards, and as a consequence builds the related data structures
once-and-for-all at init (see sched_init_numa()).
Unfortunately, on some architectures node distance is unreliable for
offline nodes and may very well change upon onlining.
Skip over offline nodes during sched_init_numa(). Track nodes that have
been onlined at least once, and trigger a build of a node's NUMA masks
when it is first onlined post-init.
Mark Brown [Thu, 19 Aug 2021 17:29:02 +0000 (18:29 +0100)]
kselftest/arm64: mte: Fix misleading output when skipping tests
When skipping the tests due to a lack of system support for MTE we
currently print a message saying FAIL which makes it look like the test
failed even though the test did actually report KSFT_SKIP, creating some
confusion. Change the error message to say SKIP instead so things are
clearer.
Linus Torvalds [Sun, 15 Aug 2021 16:57:43 +0000 (06:57 -1000)]
Merge tag 'powerpc-5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix crashes coming out of nap on 32-bit Book3s (eg. powerbooks).
- Fix critical and debug interrupts on BookE, seen as crashes when
using ptrace.
- Fix an oops when running an SMP kernel on a UP system.
- Update pseries LPAR security flavor after partition migration.
- Fix an oops when using kprobes on BookE.
- Fix oops on 32-bit pmac by not calling do_IRQ() from
timer_interrupt().
- Fix softlockups on CPU hotplug into a CPU-less node with xive (P9).
Thanks to Cédric Le Goater, Christophe Leroy, Finn Thain, Geetika
Moolchandani, Laurent Dufour, Laurent Vivier, Nicholas Piggin, Pu Lehui,
Radu Rendec, Srikar Dronamraju, and Stan Johnson.
* tag 'powerpc-5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/xive: Do not skip CPU-less nodes when creating the IPIs
powerpc/interrupt: Do not call single_step_exception() from other exceptions
powerpc/interrupt: Fix OOPS by not calling do_IRQ() from timer_interrupt()
powerpc/kprobes: Fix kprobe Oops happens in booke
powerpc/pseries: Fix update of LPAR security flavor after LPM
powerpc/smp: Fix OOPS in topology_init()
powerpc/32: Fix critical and debug interrupts on BOOKE
powerpc/32s: Fix napping restore in data storage interrupt (DSI)
Linus Torvalds [Sun, 15 Aug 2021 16:49:40 +0000 (06:49 -1000)]
Merge tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of fixes for PCI/MSI and x86 interrupt startup:
- Mask all MSI-X entries when enabling MSI-X otherwise stale unmasked
entries stay around e.g. when a crashkernel is booted.
- Enforce masking of a MSI-X table entry when updating it, which
mandatory according to speification
- Ensure that writes to MSI[-X} tables are flushed.
- Prevent invalid bits being set in the MSI mask register
- Properly serialize modifications to the mask cache and the mask
register for multi-MSI.
- Cure the violation of the affinity setting rules on X86 during
interrupt startup which can cause lost and stale interrupts. Move
the initial affinity setting ahead of actualy enabling the
interrupt.
- Ensure that MSI interrupts are completely torn down before freeing
them in the error handling case.
- Prevent an array out of bounds access in the irq timings code"
* tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
driver core: Add missing kernel doc for device::msi_lock
genirq/msi: Ensure deactivation on teardown
genirq/timings: Prevent potential array overflow in __irq_timings_store()
x86/msi: Force affinity setup before startup
x86/ioapic: Force affinity setup before startup
genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP
PCI/MSI: Protect msi_desc::masked for multi-MSI
PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown()
PCI/MSI: Correct misleading comments
PCI/MSI: Do not set invalid bits in MSI mask
PCI/MSI: Enforce MSI[X] entry updates to be visible
PCI/MSI: Enforce that MSI-X table entry is masked for update
PCI/MSI: Mask all unused MSI-X entries
PCI/MSI: Enable and mask MSI-X early
Linus Torvalds [Sun, 15 Aug 2021 16:46:04 +0000 (06:46 -1000)]
Merge tag 'locking_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:
- Fix a CONFIG symbol's spelling
* tag 'locking_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Use the correct rtmutex debugging config option
Linus Torvalds [Sun, 15 Aug 2021 16:38:26 +0000 (06:38 -1000)]
Merge tag 'efi_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Borislav Petkov:
"A batch of fixes for the arm64 stub image loader:
- fix a logic bug that can make the random page allocator fail
spuriously
- force reallocation of the Image when it overlaps with firmware
reserved memory regions
- fix an oversight that defeated on optimization introduced earlier
where images loaded at a suitable offset are never moved if booting
without randomization
- complain about images that were not loaded at the right offset by
the firmware image loader"
* tag 'efi_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/libstub: arm64: Double check image alignment at entry
efi/libstub: arm64: Warn when efi_random_alloc() fails
efi/libstub: arm64: Relax 2M alignment again for relocatable kernels
efi/libstub: arm64: Force Image reallocation if BSS was not reserved
arm64: efi: kaslr: Fix occasional random alloc (and boot) failure
Linus Torvalds [Sun, 15 Aug 2021 16:30:24 +0000 (06:30 -1000)]
Merge tag 'x86_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"Two fixes:
- An objdump checker fix to ignore parenthesized strings in the
objdump version
- Fix resctrl default monitoring groups reporting when new subgroups
get created"
* tag 'x86_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Fix default monitoring groups reporting
x86/tools: Fix objdump version check again
Linus Torvalds [Sun, 15 Aug 2021 16:21:30 +0000 (06:21 -1000)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM:
- Plug race between enabling MTE and creating vcpus
- Fix off-by-one bug when checking whether an address range is RAM
x86:
- Fixes for the new MMU, especially a memory leak on hosts with <39
physical address bits
- Remove bogus EFER.NX checks on 32-bit non-PAE hosts
- WAITPKG fix"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs
KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF
kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault
KVM: x86: remove dead initialization
KVM: x86: Allow guest to set EFER.NX=1 on non-PAE 32-bit kernels
KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulation
KVM: arm64: Fix race when enabling KVM_ARM_CAP_MTE
KVM: arm64: Fix off-by-one in range_is_memory
Linus Torvalds [Sun, 15 Aug 2021 05:46:39 +0000 (19:46 -1000)]
Merge tag 'libnvdimm-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A couple of fixes for long standing bugs, a warning fixup, and some
miscellaneous dax cleanups.
The bugs were recently found due to new platforms looking to use the
ACPI NFIT "virtual" device definition, and new error injection
capabilities to trigger error responses to label area requests. Ira's
cleanups have been long pending, I neglected to send them earlier, and
see no harm in including them now. This has all appeared in -next with
no reported issues.
Summary:
- Fix support for NFIT "virtual" ranges (BIOS-defined memory disks)
- Fix recovery from failed label storage areas on NVDIMM devices
- Miscellaneous cleanups from Ira's investigation of
dax_direct_access paths preparing for stray-write protection"
* tag 'libnvdimm-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
tools/testing/nvdimm: Fix missing 'fallthrough' warning
libnvdimm/region: Fix label activation vs errors
ACPI: NFIT: Fix support for virtual SPA ranges
dax: Ensure errno is returned from dax_direct_access
fs/dax: Clarify nr_pages to dax_direct_access()
fs/fuse: Remove unneeded kaddr parameter
Linus Torvalds [Sun, 15 Aug 2021 05:22:33 +0000 (19:22 -1000)]
Merge tag 'usb-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fix from Greg KH:
"A single revert of a commit that caused problems in 5.14-rc5 for
5.14-rc6. It has been in linux-next almost all week, and has resolved
the issues that were reported on lots of different systems that were
not the platform that the change was originally tested on (gotta love
SoC cores used in multiple devices from multiple vendors...)"
* tag 'usb-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
Revert "usb: dwc3: gadget: Use list_replace_init() before traversing lists"
Linus Torvalds [Sun, 15 Aug 2021 05:16:30 +0000 (19:16 -1000)]
Merge tag 'staging-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull IIO driver fixes from Greg KH:
"Here are some small IIO driver fixes for reported problems for
5.14-rc6 (no staging driver fixes at the moment).
All of them resolve reported issues and have been in linux-next all
week with no reported problems. Full details are in the shortlog"
* tag 'staging-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio: adc: Fix incorrect exit of for-loop
iio: humidity: hdc100x: Add margin to the conversion time
dt-bindings: iio: st: Remove wrong items length check
iio: accel: fxls8962af: fix i2c dependency
iio: adis: set GPIO reset pin direction
iio: adc: ti-ads7950: Ensure CS is deasserted after reading channels
iio: accel: fxls8962af: fix potential use of uninitialized symbol
Linus Torvalds [Sun, 15 Aug 2021 04:59:53 +0000 (18:59 -1000)]
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"One driver bugfix, a documentation bugfix, and an "uninitialized data"
leak fix for the core"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
Documentation: i2c: add i2c-sysfs into index
i2c: dev: zero out array used for i2c reads from userspace
i2c: iproc: fix race between client unreg and tasklet
Linus Torvalds [Sat, 14 Aug 2021 16:31:22 +0000 (06:31 -1000)]
Merge tag 'for-linus-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"A small cleanup patch and a fix of a rare race in the Xen evtchn
driver"
* tag 'for-linus-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/events: Fix race in set_evtchn_to_irq
xen/events: remove redundant initialization of variable irq
Linus Torvalds [Sat, 14 Aug 2021 16:28:19 +0000 (06:28 -1000)]
Merge tag 'riscv-for-linus-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- avoid passing -mno-relax to compilers that don't support it
- a comment fix
* tag 'riscv-for-linus-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Fix comment regarding kernel mapping overlapping with IS_ERR_VALUE
riscv: kexec: do not add '-mno-relax' flag if compiler doesn't support it
Linus Torvalds [Sat, 14 Aug 2021 01:05:23 +0000 (15:05 -1000)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"7 patches.
Subsystems affected by this patch series: mm (kasan, mm/slub,
mm/madvise, and memcg), and lib"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
lib: use PFN_PHYS() in devmem_is_allowed()
mm/memcg: fix incorrect flushing of lruvec data in obj_stock
mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE)
mm: slub: fix slub_debug disabling for list of slabs
slub: fix kmalloc_pagealloc_invalid_free unit test
kasan, slub: reset tag when printing address
kasan, kmemleak: reset tags when scanning block
Linus Torvalds [Sat, 14 Aug 2021 00:44:32 +0000 (14:44 -1000)]
Merge tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four CIFS/SMB3 Fixes, all for stable, two relating to deferred close,
and one for the 'modefromsid' mount option (when 'idsfromsid' not
specified)"
* tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Call close synchronously during unlink/rename/lease break.
cifs: Handle race conditions during rename
cifs: use the correct max-length for dentry_path_raw()
cifs: create sd context must be a multiple of 8
Linus Torvalds [Sat, 14 Aug 2021 00:32:38 +0000 (14:32 -1000)]
Merge tag 'linux-kselftest-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest fix from Shuah Khan:
"A single patch to sgx test to fix Q1 and Q2 calculation"
* tag 'linux-kselftest-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/sgx: Fix Q1 and Q2 calculation in sigstruct.c
Liang Wang [Fri, 13 Aug 2021 23:54:45 +0000 (16:54 -0700)]
lib: use PFN_PHYS() in devmem_is_allowed()
The physical address may exceed 32 bits on 32-bit systems with more than
32 bits of physcial address. Use PFN_PHYS() in devmem_is_allowed(), or
the physical address may overflow and be truncated.
We found this bug when mapping a high addresses through devmem tool,
when CONFIG_STRICT_DEVMEM is enabled on the ARM with ARM_LPAE and devmem
is used to map a high address that is not in the iomem address range, an
unexpected error indicating no permission is returned.
This bug was initially introduced from v2.6.37, and the function was
moved to lib in v5.11.
Link: https://lkml.kernel.org/r/20210731025057.78825-1-wangliang101@huawei.com Fixes: dc33085d6c82 ("ARM: implement CONFIG_STRICT_DEVMEM by disabling access to RAM via /dev/mem") Fixes: 5ea6e0ec50c1 ("lib: Add a generic version of devmem_is_allowed()") Signed-off-by: Liang Wang <wangliang101@huawei.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Liang Wang <wangliang101@huawei.com> Cc: Xiaoming Ni <nixiaoming@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [2.6.37+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Waiman Long [Fri, 13 Aug 2021 23:54:41 +0000 (16:54 -0700)]
mm/memcg: fix incorrect flushing of lruvec data in obj_stock
When mod_objcg_state() is called with a pgdat that is different from
that in the obj_stock, the old lruvec data cached in obj_stock are
flushed out. Unfortunately, they were flushed to the new pgdat and so
the data go to the wrong node. This will screw up the slab data
reported in /sys/devices/system/node/node*/meminfo.
Fix that by flushing the data to the cached pgdat instead.
Link: https://lkml.kernel.org/r/20210802143834.30578-1-longman@redhat.com Fixes: 661c91113934 ("mm/memcg: cache vmstat data in percpu memcg_stock_pcp") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE)
Doing some extended tests and polishing the man page update for
MADV_POPULATE_(READ|WRITE), I realized that we end up converting also
SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another
madvise() user error.
We want to report only problematic mappings and permission problems that
the user could have know as -EINVAL.
Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL,
but instead indicate -EFAULT to user space. While we could also convert
it to -ENOMEM, using -EFAULT looks more helpful when user space might
want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is
not part of an final Linux release and we can still adjust the behavior.
Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com Fixes: 9e93c56a5f61 ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 13 Aug 2021 23:54:34 +0000 (16:54 -0700)]
mm: slub: fix slub_debug disabling for list of slabs
Vijayanand Jitta reports:
Consider the scenario where CONFIG_SLUB_DEBUG_ON is set and we would
want to disable slub_debug for few slabs. Using boot parameter with
slub_debug=-,slab_name syntax doesn't work as expected i.e; only
disabling debugging for the specified list of slabs. Instead it
disables debugging for all slabs, which is wrong.
This patch fixes it by delaying the moment when the global slub_debug
flags variable is updated. In case a "slub_debug=-,slab_name" has been
passed, the global flags remain as initialized (depending on
CONFIG_SLUB_DEBUG_ON enabled or disabled) and are not simply reset to 0.
Link: https://lkml.kernel.org/r/8a3d992a-473a-467b-28a0-4ad2ff60ab82@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Vijayanand Jitta <vjitta@codeaurora.org> Reviewed-by: Vijayanand Jitta <vjitta@codeaurora.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Fri, 13 Aug 2021 23:54:31 +0000 (16:54 -0700)]
slub: fix kmalloc_pagealloc_invalid_free unit test
The unit test kmalloc_pagealloc_invalid_free makes sure that for the
higher order slub allocation which goes to page allocator, the free is
called with the correct address i.e. the virtual address of the head
page.
Commit 1d037af8bf4d ("slub: fix unreclaimable slab stat for bulk free")
unified the free code paths for page allocator based slub allocations
but instead of using the address passed by the caller, it extracted the
address from the page. Thus making the unit test
kmalloc_pagealloc_invalid_free moot. So, fix this by using the address
passed by the caller.
Should we fix this? I think yes because dev expect kasan to catch these
type of programming bugs.
Link: https://lkml.kernel.org/r/20210802180819.1110165-1-shakeelb@google.com Fixes: 1d037af8bf4d ("slub: fix unreclaimable slab stat for bulk free") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kuan-Ying Lee [Fri, 13 Aug 2021 23:54:27 +0000 (16:54 -0700)]
kasan, slub: reset tag when printing address
The address still includes the tags when it is printed. With hardware
tag-based kasan enabled, we will get a false positive KASAN issue when
we access metadata.
Reset the tag before we access the metadata.
Link: https://lkml.kernel.org/r/20210804090957.12393-3-Kuan-Ying.Lee@mediatek.com Fixes: 7815ec9f5f76 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kuan-Ying Lee [Fri, 13 Aug 2021 23:54:24 +0000 (16:54 -0700)]
kasan, kmemleak: reset tags when scanning block
Patch series "kasan, slub: reset tag when printing address", v3.
With hardware tag-based kasan enabled, we reset the tag when we access
metadata to avoid from false alarm.
This patch (of 2):
Kmemleak needs to scan kernel memory to check memory leak. With hardware
tag-based kasan enabled, when it scans on the invalid slab and
dereference, the issue will occur as below.
Hardware tag-based KASAN doesn't use compiler instrumentation, we can not
use kasan_disable_current() to ignore tag check.
Based on the below report, there are 11 0xf7 granules, which amounts to
176 bytes, and the object is allocated from the kmalloc-256 cache. So
when kmemleak accesses the last 256-176 bytes, it causes faults, as those
are marked with KASAN_KMALLOC_REDZONE == KASAN_TAG_INVALID == 0xfe.
Thus, we reset tags before accessing metadata to avoid from false positives.
BUG: KASAN: out-of-bounds in scan_block+0x58/0x170
Read at addr f7ff0000c0074eb0 by task kmemleak/138
Pointer tag: [f7], memory tag: [fe]
Memory state around the buggy address: ffff0000c0074c00: f0 f0 f0 f0 f0 f0 f0 f0 f0 fe fe fe fe fe fe fe ffff0000c0074d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
>ffff0000c0074e00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 fe fe fe fe fe
^ ffff0000c0074f00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe ffff0000c0075000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Disabling lock debugging due to kernel taint
kmemleak: 181 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
Link: https://lkml.kernel.org/r/20210804090957.12393-1-Kuan-Ying.Lee@mediatek.com Link: https://lkml.kernel.org/r/20210804090957.12393-2-Kuan-Ying.Lee@mediatek.com Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 13 Aug 2021 23:36:42 +0000 (13:36 -1000)]
Merge tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes for block that should go into 5.14:
- Revert the mq-deadline cgroup addition. More work is needed on this
front, let's revert it for now and get it right before having it in
a released kernel (Tejun)
- blk-iocost lockdep fix (Ming)
- nbd double completion fix (Xie)
- Fix for non-idling when clearing the shared tag flag (Yu)"
* tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block:
nbd: Aovid double completion of a request
blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED
Revert "block/mq-deadline: Add cgroup support"
blk-iocost: fix lockdep warning on blkcg->lock
Linus Torvalds [Fri, 13 Aug 2021 23:25:08 +0000 (13:25 -1000)]
Merge tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A bit bigger than the previous weeks, but mostly just a few stable
bound fixes. In detail:
- Followup fixes to patches from last week for io-wq, turns out they
weren't complete (Hao)
- Two lockdep reported fixes out of the RT camp (me)
- Sync the io_uring-cp example with liburing, as a few bug fixes
never made it to the kernel carried version (me)
- SQPOLL related TIF_NOTIFY_SIGNAL fix (Nadav)
- Use WRITE_ONCE() when writing sq flags (Nadav)
- io_rsrc_put_work() deadlock fix (Pavel)"
* tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-block:
tools/io_uring/io_uring-cp: sync with liburing example
io_uring: fix ctx-exit io_rsrc_put_work() deadlock
io_uring: drop ctx->uring_lock before flushing work item
io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker()
io-wq: fix bug of creating io-wokers unconditionally
io_uring: rsrc ref lock needs to be IRQ safe
io_uring: Use WRITE_ONCE() when writing to sq_flags
io_uring: clear TIF_NOTIFY_SIGNAL when running task work
Linus Torvalds [Fri, 13 Aug 2021 22:41:45 +0000 (12:41 -1000)]
Merge tag 'pinctrl-v5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"An assortment of pin control fixes of varying importance, the most
important ones affecting Intel and AMD laptops turned up the recent
few days so it's time to push this to your tree.
- Fix the Kconfig dependency for Qualcomm SM8350 pin controller
- Fix pin biasing fallback behaviour on the Mediatek pin controller
- Fix the GPIO numbering scheme for Intel Tiger Lake-H to correspond
to the products that are now actually out on the market
- Fix a pin control function itemization in the Sunxi driver
out-of-bounds access bug
- Fix disable clocking for the RISC-V K210 pin controller on the
errorpath
- Fix a system shutdown bug affecting AMD Ryzen-based laptops, the
system would not suspend but just bounce back up"
* tag 'pinctrl-v5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: amd: Fix an issue with shutdown when system set to s0ix
pinctrl: k210: Fix k210_fpioa_probe()
pinctrl: sunxi: Don't underestimate number of functions
pinctrl: tigerlake: Fix GPIO mapping for newer version of software
pinctrl: mediatek: Fix fallback behavior for bias_set_combo
pinctrl: qcom: fix GPIOLIB dependencies
Xie Yongji [Fri, 13 Aug 2021 15:13:30 +0000 (23:13 +0800)]
nbd: Aovid double completion of a request
There is a race between iterating over requests in
nbd_clear_que() and completing requests in recv_work(),
which can lead to double completion of a request.
To fix it, flush the recv worker before iterating over
the requests and don't abort the completed request
while iterating.
blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED
We run a test that delete and recover devcies frequently(two devices on
the same host), and we found that 'active_queues' is super big after a
period of time.
If device a and device b share a tag set, and a is deleted, then
blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there
is only one queue that are using the tag set. However, if b is still
active, the active_queues of b might never be cleared even if b is
deleted.
Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared.
Thomas Gleixner [Fri, 13 Aug 2021 10:36:14 +0000 (12:36 +0200)]
driver core: Add missing kernel doc for device::msi_lock
Fixes: 2548434fa5d1 ("PCI/MSI: Protect msi_desc::masked for multi-MSI") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: 804fe1e3a0fc ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
Set the min_level for the TDP iterator at the root level when zapping all
SPTEs to optimize the iterator's try_step_down(). Zapping a non-leaf
SPTE will recursively zap all its children, thus there is no need for the
iterator to attempt to step down. This avoids rereading the top-level
SPTEs after they are zapped by causing try_step_down() to short-circuit.
In most cases, optimizing try_step_down() will be in the noise as the cost
of zapping SPTEs completely dominates the overall time. The optimization
is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
is zapping in response to multiple memslot updates when userspace is adding
and removing read-only memslots for option ROMs. In that case, the task
doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
for read and thus can be a noisy neighbor of sorts.
Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181414.3376143-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs
Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
really zap all SPTEs in this case. As is, zap_gfn_range() skips non-leaf
SPTEs whose range exceeds the range to be zapped. If shadow_phys_bits is
not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
paging, the "zap all" flows will skip top-level SPTEs whose range extends
beyond shadow_phys_bits and leak their SPs when the VM is destroyed.
Use the current upper bound (based on host.MAXPHYADDR) to detect that the
caller wants to zap all SPTEs, e.g. instead of using the max theoretical
gfn, 1 << (52 - 12). The more precise upper bound allows the TDP iterator
to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.
Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
help future debuggers should KVM decide to leak SPTEs again.
The bug is most easily reproduced by running (and unloading!) KVM in a
VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.
Fixes: 56d8cae20031 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181414.3376143-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF
Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
in L2 or if the VM-Exit should be forwarded to L1. The current logic fails
to account for the case where #PF is intercepted to handle
guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
L1. At best, L1 will complain and inject the #PF back into L2. At
worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
page faults.
Note, while the bug was technically introduced by the commit that added
support for the MAXPHYADDR madness, the shame is all on commit 11d5d2cf17ce ("KVM: VMX: introduce vmx_need_pf_intercept").
Fixes: e6c9cf38a9a6 ("KVM: VMX: Add guest physical address check in EPT violation and misconfig") Cc: stable@vger.kernel.org Cc: Peter Shier <pshier@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812045615.3167686-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Junaid Shahid [Fri, 6 Aug 2021 22:22:29 +0000 (15:22 -0700)]
kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault
When a nested EPT violation/misconfig is injected into the guest,
the shadow EPT PTEs associated with that address need to be synced.
This is done by kvm_inject_emulated_page_fault() before it calls
nested_ept_inject_page_fault(). However, that will only sync the
shadow EPT PTE associated with the current L1 EPTP. Since the ASID
is based on EP4TA rather than the full EPTP, so syncing the current
EPTP is not enough. The SPTEs associated with any other L1 EPTPs
in the prev_roots cache with the same EP4TA also need to be synced.
Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20210806222229.1645356-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Allow guest to set EFER.NX=1 on non-PAE 32-bit kernels
Remove an ancient restriction that disallowed exposing EFER.NX to the
guest if EFER.NX=0 on the host, even if NX is fully supported by the CPU.
The motivation of the check, added by commit 23b51503cba3 ("KVM: VMX:
Avoid saving and restoring msr_efer on lightweight vmexit"), was to rule
out the case of host.EFER.NX=0 and guest.EFER.NX=1 so that KVM could run
the guest with the host's EFER.NX and thus avoid context switching EFER
if the only divergence was the NX bit.
Fast forward to today, and KVM has long since stopped running the guest
with the host's EFER.NX. Not only does KVM context switch EFER if
host.EFER.NX=1 && guest.EFER.NX=0, KVM also forces host.EFER.NX=0 &&
guest.EFER.NX=1 when using shadow paging (to emulate SMEP). Furthermore,
the entire motivation for the restriction was made obsolete over a decade
ago when Intel added dedicated host and guest EFER fields in the VMCS
(Nehalem timeframe), which reduced the overhead of context switching EFER
from 400+ cycles (2 * WRMSR + 1 * RDMSR) to a mere ~2 cycles.
In practice, the removed restriction only affects non-PAE 32-bit kernels,
as EFER.NX is set during boot if NX is supported and the kernel will use
PAE paging (32-bit or 64-bit), regardless of whether or not the kernel
will actually use NX itself (mark PTEs non-executable).
Alternatively and/or complementarily, startup_32_smp() in head_32.S could
be modified to set EFER.NX=1 regardless of paging mode, thus eliminating
the scenario where NX is supported but not enabled. However, that runs
the risk of breaking non-KVM non-PAE kernels (though the risk is very,
very low as there are no known EFER.NX errata), and also eliminates an
easy-to-use mechanism for stressing KVM's handling of guest vs. host EFER
across nested virtualization transitions.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210805183804.1221554-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Fri, 13 Aug 2021 02:24:03 +0000 (16:24 -1000)]
Merge tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter, bpf, can and
ieee802154.
The size of this is pretty normal, but we got more fixes for 5.14
changes this week than last week. Nothing major but the trend is the
opposite of what we like. We'll see how the next week goes..
Current release - regressions:
- r8169: fix ASPM-related link-up regressions
- bridge: fix flags interpretation for extern learn fdb entries
- phy: micrel: fix link detection on ksz87xx switch
- Revert "tipc: Return the correct errno code"
- ptp: fix possible memory leak caused by invalid cast
Current release - new code bugs:
- bpf: add missing bpf_read_[un]lock_trace() for syscall program
- bpf: fix potentially incorrect results with bpf_get_local_storage()
- page_pool: mask the page->signature before the checking, avoid dma
mapping leaks
- netfilter: nfnetlink_hook: 5 fixes to information in netlink dumps
- bnxt_en: fix firmware interface issues with PTP
- mlx5: Bridge, fix ageing time
Previous releases - regressions:
- linkwatch: fix failure to restore device state across
suspend/resume
- bareudp: fix invalid read beyond skb's linear data
Previous releases - always broken:
- bpf: fix integer overflow involving bucket_size
- ppp: fix issues when desired interface name is specified via
netlink
- wwan: mhi_wwan_ctrl: fix possible deadlock
- dsa: microchip: ksz8795: fix number of VLAN related bugs
- dsa: drivers: fix broken backpressure in .port_fdb_dump
- dsa: qca: ar9331: make proper initial port defaults
Misc:
- bpf: add lockdown check for probe_write_user helper
- netfilter: conntrack: remove offload_pickup sysctl before 5.14 is
out
- netfilter: conntrack: collect all entries in one cycle,
heuristically slow down garbage collection scans on idle systems to
prevent frequent wake ups"
* tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
vsock/virtio: avoid potential deadlock when vsock device remove
wwan: core: Avoid returning NULL from wwan_create_dev()
net: dsa: sja1105: unregister the MDIO buses during teardown
Revert "tipc: Return the correct errno code"
net: mscc: Fix non-GPL export of regmap APIs
net: igmp: increase size of mr_ifc_count
MAINTAINERS: switch to my OMP email for Renesas Ethernet drivers
tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets
net: pcs: xpcs: fix error handling on failed to allocate memory
net: linkwatch: fix failure to restore device state across suspend/resume
net: bridge: fix memleak in br_add_if()
net: switchdev: zero-initialize struct switchdev_notifier_fdb_info emitted by drivers towards the bridge
net: bridge: fix flags interpretation for extern learn fdb entries
net: dsa: sja1105: fix broken backpressure in .port_fdb_dump
net: dsa: lantiq: fix broken backpressure in .port_fdb_dump
net: dsa: lan9303: fix broken backpressure in .port_fdb_dump
net: dsa: hellcreek: fix broken backpressure in .port_fdb_dump
bpf, core: Fix kernel-doc notation
net: igmp: fix data-race in igmp_ifc_timer_expire()
net: Fix memory leak in ieee802154_raw_deliver
...
Linus Torvalds [Fri, 13 Aug 2021 02:16:01 +0000 (16:16 -1000)]
Merge tag 'ceph-for-5.14-rc6' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch to avoid a soft lockup in ceph_check_delayed_caps() from Luis
and a reference handling fix from Jeff that should address some memory
corruption reports in the snaprealm area.
Both marked for stable"
* tag 'ceph-for-5.14-rc6' of git://github.com/ceph/ceph-client:
ceph: take snap_empty_lock atomically with snaprealm refcount change
ceph: reduce contention in ceph_check_delayed_caps()
i915:
- GVT fix for Windows VM hang.
- Display fix of 12 BPC bits for display 12 and newer.
- Don't try to access some media register for fused off domains.
- Fix kerneldoc build warnings.
* tag 'drm-fixes-2021-08-13' of git://anongit.freedesktop.org/drm/drm:
drm/doc/rfc: drop lmem uapi section
drm/i915: Only access SFC_DONE when media domain is not fused off
drm/i915/display: Fix the 12 BPC bits for PIPE_MISC reg
drm/amd/display: use GFP_ATOMIC in amdgpu_dm_irq_schedule_work
drm/amd/display: Remove invalid assert for ODM + MPC case
drm/amd/pm: bug fix for the runtime pm BACO
drm/amdgpu: handle VCN instances when harvesting (v2)
drm/meson: fix colour distortion from HDR set during vendor u-boot
drm/i915/gvt: Fix cached atomics setting for Windows VM
drm/amdgpu: Add preferred mode in modeset when freesync video mode's enabled.
drm/amd/pm: Fix a memory leak in an error handling path in 'vangogh_tables_init()'
drm/amdgpu: don't enable baco on boco platforms in runpm
drm/amdgpu: set RAS EEPROM address from VBIOS
drm/amd/pm: update smu v13.0.1 firmware header
drm/mediatek: Fix cursor plane no update
drm/mediatek: mtk-dpi: Set out_fmt from config if not the last bridge
drm/mediatek: dpi: Fix NULL dereference in mtk_dpi_bridge_atomic_check
Dave Airlie [Thu, 12 Aug 2021 20:29:12 +0000 (06:29 +1000)]
Merge tag 'drm-intel-fixes-2021-08-12' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- GVT fix for Windows VM hang.
- Display fix of 12 BPC bits for display 12 and newer.
- Don't try to access some media register for fused off domains.
- Fix kerneldoc build warnings.
Jakub Kicinski [Thu, 12 Aug 2021 18:50:16 +0000 (11:50 -0700)]
Merge tag 'ieee802154-for-davem-2021-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
Stefan Schmidt says:
====================
ieee802154 for net 2021-08-12
Mostly fixes coming from bot reports. Dongliang Mu tackled some syzkaller
reports in hwsim again and Takeshi Misawa a memory leak in ieee802154 raw.
* tag 'ieee802154-for-davem-2021-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan:
net: Fix memory leak in ieee802154_raw_deliver
ieee802154: hwsim: fix GPF in hwsim_new_edge_nl
ieee802154: hwsim: fix GPF in hwsim_set_edge_lqi
====================
Babu Moger [Mon, 2 Aug 2021 19:38:58 +0000 (14:38 -0500)]
x86/resctrl: Fix default monitoring groups reporting
Creating a new sub monitoring group in the root /sys/fs/resctrl leads to
getting the "Unavailable" value for mbm_total_bytes and mbm_local_bytes
on the entire filesystem.
When a new monitoring group is created, a new RMID is assigned to the
new group. But the RMID is not active yet. When the events are read on
the new RMID, it is expected to report the status as "Unavailable".
When the user reads the events on the default monitoring group with
multiple subgroups, the events on all subgroups are consolidated
together. Currently, if any of the RMID reads report as "Unavailable",
then everything will be reported as "Unavailable".
Fix the issue by discarding the "Unavailable" reads and reporting all
the successful RMID reads. This is not a problem on Intel systems as
Intel reports 0 on Inactive RMIDs.
lock_sock() may do initiative schedule when the 'sk' is owned by
other thread at the same time, we would receivce a warning message
that "scheduling while atomic".
Even worse, if the next task (selected by the scheduler) try to
release a 'sk', it need to request vsock_table_lock and the deadlock
occur, cause the system into softlockup state.
Call trace:
queued_spin_lock_slowpath
vsock_remove_bound
vsock_remove_sock
virtio_transport_release
__vsock_release
vsock_release
__sock_release
sock_close
__fput
____fput
So we should not require sk_lock in this case, just like the behavior
in vhost_vsock or vmci.
Linus Torvalds [Thu, 12 Aug 2021 17:20:16 +0000 (07:20 -1000)]
Merge branch 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucounts fix from Eric Biederman:
"This fixes the ucount sysctls on big endian architectures.
The counts were expanded to be longs instead of ints, and the sysctl
code was overlooked, so only the low 32bit were being processed. On
litte endian just processing the low 32bits is fine, but on 64bit big
endian processing just the low 32bits results in the high order bits
instead of the low order bits being processed and nothing works
proper.
This change took a little bit to mature as we have the SYSCTL_ZERO,
and SYSCTL_INT_MAX macros that are only usable for sysctls operating
on ints, but unfortunately are not obviously broken. Which resulted in
the versions of this change working on big endian and not on little
endian, because the int SYSCTL_ZERO when extended 64bit wound up being
0x100000000. So we only allowed values greater than 0x100000000 and
less than 0faff. Which unfortunately broken everything that tried to
set the sysctls. (First reported with the windows subsystem for
linux).
I have tested this on x86_64 64bit after first reproducing the
problems with the earlier version of this change, and then verifying
the problems do not exist when we use appropriate long min and max
values for extra1 and extra2"
* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: add missing data type changes
Andy Shevchenko [Wed, 11 Aug 2021 12:48:45 +0000 (15:48 +0300)]
wwan: core: Avoid returning NULL from wwan_create_dev()
Make wwan_create_dev() to return either valid or error pointer,
In some cases it may return NULL. Prevent this by converting
it to the respective error pointer.
Fixes: 079212ec2776 ("net: Add a WWAN subsystem") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@linaro.org> Link: https://lore.kernel.org/r/20210811124845.10955-1-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cifs: Call close synchronously during unlink/rename/lease break.
During unlink/rename/lease break, deferred work for close is
scheduled immediately but in an asynchronous manner which might
lead to race with actual(unlink/rename) commands.
This change will schedule close synchronously which will avoid
the race conditions with other commands.
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org # 5.13 Signed-off-by: Steve French <stfrench@microsoft.com>
Maximilian Heyne [Thu, 12 Aug 2021 13:09:27 +0000 (13:09 +0000)]
xen/events: Fix race in set_evtchn_to_irq
There is a TOCTOU issue in set_evtchn_to_irq. Rows in the evtchn_to_irq
mapping are lazily allocated in this function. The check whether the row
is already present and the row initialization is not synchronized. Two
threads can at the same time allocate a new row for evtchn_to_irq and
add the irq mapping to the their newly allocated row. One thread will
overwrite what the other has set for evtchn_to_irq[row] and therefore
the irq mapping is lost. This will trigger a BUG_ON later in
bind_evtchn_to_cpu:
This patch sets evtchn_to_irq rows via a cmpxchg operation so that they
will be set only once. The row is now cleared before writing it to
evtchn_to_irq in order to not create a race once the row is visible for
other threads.
While at it, do not require the page to be zeroed, because it will be
overwritten with -1's in clear_evtchn_to_irq_row anyway.
Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Fixes: 01cb019159ba ("xen/events: Refactor evtchn_to_irq array to be dynamically allocated") Link: https://lore.kernel.org/r/20210812130930.127134-1-mheyne@amazon.de Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Randy Dunlap [Sat, 31 Jul 2021 00:01:46 +0000 (17:01 -0700)]
x86/tools: Fix objdump version check again
Skip (omit) any version string info that is parenthesized.
Warning: objdump version 15) is older than 2.19
Warning: Skipping posttest.
where 'objdump -v' says:
GNU objdump (GNU Binutils; SUSE Linux Enterprise 15) 2.35.1.20201123-7.18
Fixes: 642a24382a76c ("x86: Fix objdump version check in chkobjdump.awk for different formats.") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20210731000146.2720-1-rdunlap@infradead.org
riscv: Fix comment regarding kernel mapping overlapping with IS_ERR_VALUE
The current comment states that we check if the 64-bit kernel mapping
overlaps with the last 4K of the address space that is reserved to
error values in create_kernel_page_table, which is not the case since it
is done in setup_vm. But anyway, remove the reference to any function
and simply note that in 64-bit kernel, the check should be done as soon
as the kernel mapping base address is known.
Fixes: f57cd5ec86b6 ("riscv: Make sure the kernel mapping does not overlap with IS_ERR_VALUE") Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Changbin Du [Thu, 22 Jul 2021 02:17:15 +0000 (10:17 +0800)]
riscv: kexec: do not add '-mno-relax' flag if compiler doesn't support it
The RISC-V special option '-mno-relax' which to disable linker relaxations
is supported by GCC8+. For GCC7 and lower versions do not support this
option.
powerpc/xive: Do not skip CPU-less nodes when creating the IPIs
On PowerVM, CPU-less nodes can be populated with hot-plugged CPUs at
runtime. Today, the IPI is not created for such nodes, and hot-plugged
CPUs use a bogus IPI, which leads to soft lockups.
We can not directly allocate and request the IPI on demand because
bringup_up() is called under the IRQ sparse lock. The alternative is
to allocate the IPIs for all possible nodes at startup and to request
the mapping on demand when the first CPU of a node is brought up.
Fixes: 2dc2a052d7af ("powerpc/xive: Map one IPI interrupt per node") Cc: stable@vger.kernel.org # v5.13 Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210807072057.184698-1-clg@kaod.org
Christophe Leroy [Tue, 10 Aug 2021 16:13:17 +0000 (16:13 +0000)]
powerpc/interrupt: Do not call single_step_exception() from other exceptions
single_step_exception() is called by emulate_single_step() which
is called from (at least) alignment exception() handler and
program_check_exception() handler.
Redefine it as a regular __single_step_exception() which is called
by both single_step_exception() handler and emulate_single_step()
function.