Vaibhav Jain [Wed, 11 May 2022 08:26:36 +0000 (13:56 +0530)]
powerpc/papr_scm: Fix leaking nvdimm_events_map elements
Right now 'char *' elements allocated for individual 'stat_id' in
'papr_scm_priv.nvdimm_events_map[]' during papr_scm_pmu_check_events(), get
leaked in papr_scm_remove() and papr_scm_pmu_register(),
papr_scm_pmu_check_events() error paths.
Also individual 'stat_id' arent NULL terminated 'char *' instead they are fixed
8-byte sized identifiers. However papr_scm_pmu_register() assumes it to be a
NULL terminated 'char *' and at other places it assumes it to be a
'papr_scm_perf_stat.stat_id' sized string which is 8-byes in size.
Fix this by allocating the memory for papr_scm_priv.nvdimm_events_map to also
include space for 'stat_id' entries. This is possible since number of available
events/stat_ids are known upfront. This saves some memory and one extra level of
indirection from 'nvdimm_events_map' to 'stat_id'. Also rest of the code
can continue to call 'kfree(papr_scm_priv.nvdimm_events_map)' without needing to
iterate over the array and free up individual elements.
Miaoqian Lin [Thu, 12 May 2022 12:37:18 +0000 (16:37 +0400)]
powerpc/fsl_rio: Fix refcount leak in fsl_rio_setup
of_parse_phandle() returns a node pointer with refcount
incremented, we should use of_node_put() on it when not need anymore.
Add missing of_node_put() to avoid refcount leak.
Miaoqian Lin [Thu, 12 May 2022 09:05:33 +0000 (13:05 +0400)]
powerpc/xive: Fix refcount leak in xive_spapr_init
of_find_compatible_node() returns a node pointer with refcount
incremented, we should use of_node_put() on it when done.
Add missing of_node_put() to avoid refcount leak.
Fixes: 7a69a82db40b ("powerpc/xive: guest exploitation of the XIVE interrupt controller") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220512090535.33397-1-linmq006@gmail.com
Oscar Salvador [Mon, 11 Apr 2022 07:49:34 +0000 (09:49 +0200)]
powerpc/numa: Associate numa node to its cpu earlier
powerpc is the only platform that do not rely on
cpu_up()->try_online_node() to bring up a numa node,
and special cases it, instead, deep in its own machinery:
This should not be needed, but the thing is that the try_online_node()
from cpu_up() will not apply on the right node, because cpu_to_node()
will return the old mapping numa<->cpu that gets set on boot stage
for all possible cpus.
That can be seen easily if we try to print out the numa node passed
to try_online_node() in cpu_up().
The thing is that the numa<->cpu mapping does not get updated till a much
later stage in start_secondary:
But we do not really care, as we already now the
CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
so let us make use of that and set the proper numa<->cpu mapping,
so cpu_to_node() in cpu_up() returns the right node and
try_online_node() can do its work.
Signed-off-by: Oscar Salvador <osalvador@suse.de> Tested-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220411074934.4632-1-osalvador@suse.de
Randy Dunlap [Sun, 10 Apr 2022 16:10:35 +0000 (09:10 -0700)]
macintosh: via-pmu and via-cuda need RTC_LIB
Fix build when RTC_LIB is not set/enabled.
Eliminates these build errors:
m68k-linux-ld: drivers/macintosh/via-pmu.o: in function `pmu_set_rtc_time':
drivers/macintosh/via-pmu.c:1769: undefined reference to `rtc_tm_to_time64'
m68k-linux-ld: drivers/macintosh/via-cuda.o: in function `cuda_set_rtc_time':
drivers/macintosh/via-cuda.c:797: undefined reference to `rtc_tm_to_time64'
Fixes: f7c707070480 ("macintosh: Use common code to access RTC") Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220410161035.592-1-rdunlap@infradead.org
Finn Thain [Thu, 7 Apr 2022 10:11:32 +0000 (20:11 +1000)]
macintosh/via-pmu: Fix build failure when CONFIG_INPUT is disabled
drivers/macintosh/via-pmu-event.o: In function `via_pmu_event':
via-pmu-event.c:(.text+0x44): undefined reference to `input_event'
via-pmu-event.c:(.text+0x68): undefined reference to `input_event'
via-pmu-event.c:(.text+0x94): undefined reference to `input_event'
via-pmu-event.c:(.text+0xb8): undefined reference to `input_event'
drivers/macintosh/via-pmu-event.o: In function `via_pmu_event_init':
via-pmu-event.c:(.init.text+0x20): undefined reference to `input_allocate_device'
via-pmu-event.c:(.init.text+0xc4): undefined reference to `input_register_device'
via-pmu-event.c:(.init.text+0xd4): undefined reference to `input_free_device'
make[1]: *** [Makefile:1155: vmlinux] Error 1
make: *** [Makefile:350: __build_one_by_one] Error 2
Don't call into the input subsystem unless CONFIG_INPUT is built-in.
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Finn Thain <fthain@linux-m68k.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/5edbe76ce68227f71e09af4614cc4c1bd61c7ec8.1649326292.git.fthain@linux-m68k.org
Kajol Jain [Fri, 6 May 2022 06:10:15 +0000 (11:40 +0530)]
powerpc/perf: Fix the threshold compare group constraint for power9
Thresh compare bits for a event is used to program thresh compare
field in Monitor Mode Control Register A (MMCRA: 9-18 bits for power9).
When scheduling events as a group, all events in that group should
match value in threshold bits (like thresh compare, thresh control,
thresh select). Otherwise event open for the sibling events should fail.
But in the current code, incase thresh compare bits are not valid,
we are not failing in group_constraint function which can result
in invalid group schduling.
Fix the issue by returning -1 incase event is threshold and threshold
compare value is not valid.
Thresh control bits in the event code is used to program thresh_ctl
field in Monitor Mode Control Register A (MMCRA: 48-55). In below example,
the scheduling of group events PM_MRK_INST_CMPL (873534401e0) and
PM_THRESH_MET (8734340101ec) is expected to fail as both event
request different thresh control bits and invalid thresh compare value.
Result before the patch changes:
[command]# perf stat -e "{r8735340401e0,r8734340101ec}" sleep 1
[command]# perf stat -e "{r8735340401e0,r8734340101ec}" sleep 1
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument)
for event (r8735340401e0).
/bin/dmesg | grep -i perf may provide additional information.
Kajol Jain [Fri, 6 May 2022 06:10:14 +0000 (11:40 +0530)]
powerpc/perf: Fix the threshold compare group constraint for power10
Thresh compare bits for a event is used to program thresh compare
field in Monitor Mode Control Register A (MMCRA: 8-18 bits for power10).
When scheduling events as a group, all events in that group should
match value in threshold bits. Otherwise event open for the sibling
events should fail. But in the current code, incase thresh compare bits are
not valid, we are not failing in group_constraint function which can result
in invalid group schduling.
Fix the issue by returning -1 incase event is threshold and threshold
compare value is not valid in group_constraint function.
Patch also fixes the p10_thresh_cmp_val function to return -1,
incase threshold bits are not valid and changes corresponding check in
is_thresh_cmp_valid function to return false only when the thresh_cmp
value is less then 0.
Thresh control bits in the event code is used to program thresh_ctl
field in Monitor Mode Control Register A (MMCRA: 48-55). In below example,
the scheduling of group events PM_MRK_INST_CMPL (3534401e0) and
PM_THRESH_MET (34340101ec) is expected to fail as both event
request different thresh control bits.
Result before the patch changes:
[command]# perf stat -e "{r35340401e0,r34340101ec}" sleep 1
Fixes: b67ba2054935b ("powerpc/perf: Adds support for programming of Thresholding in P10") Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220506061015.43916-1-kjain@linux.ibm.com
YueHaibing [Tue, 17 May 2022 09:49:00 +0000 (17:49 +0800)]
powerpc/kaslr_booke: Fix build error
arch/powerpc/mm/nohash/kaslr_booke.c: In function ‘kaslr_get_cmdline’:
arch/powerpc/mm/nohash/kaslr_booke.c:46:2: error: implicit declaration of function ‘early_init_dt_scan_chosen’
early_init_dt_scan_chosen(boot_command_line);
^~~~~~~~~~~~~~~~~~~~~~~~~
arch/powerpc/mm/nohash/kaslr_booke.c: In function ‘get_initrd_range’:
arch/powerpc/mm/nohash/kaslr_booke.c:210:10: error: implicit declaration of function ‘of_read_number’
start = of_read_number(prop, len / 4);
^~~~~~~~~~~~~~
YueHaibing [Tue, 17 May 2022 09:48:30 +0000 (17:48 +0800)]
powerpc/book3e: Fix build error
arch/powerpc/mm/nohash/fsl_book3e.c: In function ‘relocate_init’:
arch/powerpc/mm/nohash/fsl_book3e.c:348:2: error: implicit declaration of function ‘early_get_first_memblock_info’
early_get_first_memblock_info(__va(dt_ptr), &size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Add missing include file linux/of_fdt.h to fix this.
Fixes: 5c414ed7f128 ("powerpc: Remove asm/prom.h from all files that don't need it") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220517094830.27560-1-yuehaibing@huawei.com
Daniel Axtens [Wed, 18 May 2022 10:05:31 +0000 (20:05 +1000)]
powerpc: Book3S 64-bit outline-only KASAN support
Implement a limited form of KASAN for Book3S 64-bit machines running under
the Radix MMU, supporting only outline mode.
- Enable the compiler instrumentation to check addresses and maintain the
shadow region. (This is the guts of KASAN which we can easily reuse.)
- Require kasan-vmalloc support to handle modules and anything else in
vmalloc space.
- KASAN needs to be able to validate all pointer accesses, but we can't
instrument all kernel addresses - only linear map and vmalloc. On boot,
set up a single page of read-only shadow that marks all iomap and
vmemmap accesses as valid.
- Document KASAN in powerpc docs.
Background
----------
KASAN support on Book3S is a bit tricky to get right:
- It would be good to support inline instrumentation so as to be able to
catch stack issues that cannot be caught with outline mode.
- Inline instrumentation requires a fixed offset.
- Book3S runs code with translations off ("real mode") during boot,
including a lot of generic device-tree parsing code which is used to
determine MMU features.
[ppc64 mm note: The kernel installs a linear mapping at effective
address c000...-c008.... This is a one-to-one mapping with physical
memory from 0000... onward. Because of how memory accesses work on
powerpc 64-bit Book3S, a kernel pointer in the linear map accesses the
same memory both with translations on (accessing as an 'effective
address'), and with translations off (accessing as a 'real
address'). This works in both guests and the hypervisor. For more
details, see s5.7 of Book III of version 3 of the ISA, in particular
the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this
KASAN implementation currently only supports Radix.]
- Some code - most notably a lot of KVM code - also runs with translations
off after boot.
- Therefore any offset has to point to memory that is valid with
translations on or off.
One approach is just to give up on inline instrumentation. This way
boot-time checks can be delayed until after the MMU is set is up, and we
can just not instrument any code that runs with translations off after
booting. Take this approach for now and require outline instrumentation.
Previous attempts allowed inline instrumentation. However, they came with
some unfortunate restrictions: only physically contiguous memory could be
used and it had to be specified at compile time. Maybe we can do better in
the future.
[paulus@ozlabs.org - Rebased onto 5.17. Note that a kernel with
CONFIG_KASAN=y will crash during boot on a machine using HPT
translation because not all the entry points to the generic
KASAN code are protected with a call to kasan_arch_is_ready().]
Originally-by: Balbir Singh <bsingharora@gmail.com> # ppc64 out-of-line radix version Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Update copyright year and comment formatting] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/YoTE69OQwiG7z+Gu@cleo
Daniel Axtens [Wed, 18 May 2022 10:07:05 +0000 (20:07 +1000)]
powerpc/kasan: Disable address sanitization in kexec paths
The kexec code paths involve code that necessarily run in real mode, as
CPUs are disabled and control is transferred to the new kernel. Disable
address sanitization for the kexec code and the functions called in real
mode on CPUs being disabled.
[paulus@ozlabs.org: combined a few work-in-progress commits of
Daniel's and wrote the commit message.]
Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Move pseries_machine_kexec() into kexec.c so setup.c can be instrumented] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/YoTFSQ2TUSEaDdVC@cleo
Daniel Axtens [Wed, 18 May 2022 10:06:17 +0000 (20:06 +1000)]
powerpc/kasan: Don't instrument non-maskable or raw interrupts
Disable address sanitization for raw and non-maskable interrupt
handlers, because they can run in real mode, where we cannot access
the shadow memory. (Note that kasan_arch_is_ready() doesn't test for
real mode, since it is a static branch for speed, and in any case not
all the entry points to the generic KASAN code are protected by
kasan_arch_is_ready guards.)
The changes to interrupt_nmi_enter/exit_prepare() look larger than
they actually are. The changes are equivalent to adding
!IS_ENABLED(CONFIG_KASAN) to the conditions for calling nmi_enter() or
nmi_exit() in real mode. That is, the code is equivalent to using the
following condition for calling nmi_enter/exit:
That unwieldy condition has been split into several statements with
comments, for easier reading.
The nmi_ipi_lock functions that call atomic functions (i.e.,
nmi_ipi_lock_start(), nmi_ipi_lock() and nmi_ipi_unlock()), besides
being marked noinstr, now call arch_atomic_* functions instead of
atomic_* functions because with KASAN enabled, the atomic_* functions
are wrappers which explicitly do address sanitization on their
arguments. Since we are trying to avoid address sanitization, we have
to use the lower-level arch_atomic_* versions.
In hv_nmi_check_nonrecoverable(), the regs_set_unrecoverable() call
has been open-coded so as to avoid having to either trust the inlining
or mark regs_set_unrecoverable() as noinstr.
[paulus@ozlabs.org: combined a few work-in-progress commits of
Daniel's and wrote the commit message.]
Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/YoTFGaKM8Pd46PIK@cleo
Russell Currey [Tue, 8 Jun 2021 06:48:09 +0000 (16:48 +1000)]
selftests/powerpc: Better reporting in spectre_v2
In commit 611eed9f0bdd ("selftests/powerpc: Return skip code for
spectre_v2"), the spectre_v2 selftest is updated to be aware of cases
where the vulnerability status reported in sysfs is incorrect, skipping
the test instead.
This happens because qemu can misrepresent the mitigation status of the
host to the guest. If the count cache is disabled in the host, and this
is correctly reported to the guest, then the guest won't apply
mitigations. If the guest is then migrated to a new host where
mitigations are necessary, it is now vulnerable because it has not
applied mitigations.
Update the selftest to report when we see excessive misses, indicative of
the count cache being disabled. If software flushing is enabled, also
warn that these flushes are just wasting performance.
Russell Currey [Mon, 4 Apr 2022 10:15:36 +0000 (20:15 +1000)]
powerpc/powernv: Get STF barrier requirements from device-tree
The device-tree property no-need-store-drain-on-priv-state-switch is
equivalent to H_CPU_BEHAV_NO_STF_BARRIER from the
H_CPU_GET_CHARACTERISTICS hcall on pseries.
Since commit 88632ca22cfe ("powerpc/security: Add a security feature for
STF barrier") powernv systems with this device-tree property have been
enabling the STF barrier when they have no need for it. This patch
fixes this by clearing the STF barrier feature on those systems.
Fixes: 88632ca22cfe ("powerpc/security: Add a security feature for STF barrier") Reported-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220404101536.104794-2-ruscur@russell.cc
Russell Currey [Mon, 4 Apr 2022 10:15:35 +0000 (20:15 +1000)]
powerpc/powernv: Get L1D flush requirements from device-tree
The device-tree properties no-need-l1d-flush-msr-pr-1-to-0 and
no-need-l1d-flush-kernel-on-user-access are the equivalents of
H_CPU_BEHAV_NO_L1D_FLUSH_ENTRY and H_CPU_BEHAV_NO_L1D_FLUSH_UACCESS
from the H_GET_CPU_CHARACTERISTICS hcall on pseries respectively.
In commit 374018712601 ("powerpc/powernv: Remove POWER9 PVR version
check for entry and uaccess flushes") the condition for disabling the
L1D flush on kernel entry and user access was changed from any non-P9
CPU to only checking P7 and P8. Without the appropriate device-tree
checks for newer processors on powernv, these flushes are unnecessarily
enabled on those systems. This patch corrects this.
Fixes: 374018712601 ("powerpc/powernv: Remove POWER9 PVR version check for entry and uaccess flushes") Reported-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220404101536.104794-1-ruscur@russell.cc
Pali Rohár [Fri, 6 May 2022 20:36:21 +0000 (22:36 +0200)]
powerpc/85xx/p2020: Add fsl,mpc8548-pmc node
P2020 also contains Power Management Controller and their registers at
offset 0xe0070 compatible with mpc8548. So add PMC node into DTS include
file fsl/p2020si-post.dtsi
powerpc/64: Only WARN if __pa()/__va() called with bad addresses
We added checks to __pa() / __va() to ensure they're only called with
appropriate addresses. But using BUG_ON() is too strong, it means
virt_addr_valid() will BUG when DEBUG_VIRTUAL is enabled.
Instead switch them to warnings, arm64 does the same.
arch/Kconfig: Drop references to powerpc PAGE_SIZE symbols
In the previous commit powerpc added PAGE_SIZE related config symbols
using the generic names.
So there's no need to refer to them in the definition of
PAGE_SIZE_LESS_THAN_64KB etc, the negative dependency on the generic
symbol is sufficient (in this case !PAGE_SIZE_64KB).
Haren Myneni [Sat, 9 Apr 2022 08:44:16 +0000 (01:44 -0700)]
powerpc/powernv/vas: Assign real address to rx_fifo in vas_rx_win_attr
In init_winctx_regs(), __pa() is called on winctx->rx_fifo and this
function is called to initialize registers for receive and fault
windows. But the real address is passed in winctx->rx_fifo for
receive windows and the virtual address for fault windows which
causes errors with DEBUG_VIRTUAL enabled. Fixes this issue by
assigning only real address to rx_fifo in vas_rx_win_attr struct
for both receive and fault windows.
The following PPC_INST_XXX macros are not used anymore
outside ppc-opcode.h:
- PPC_INST_LD
- PPC_INST_STD
- PPC_INST_ADDIS
- PPC_INST_ADD
- PPC_INST_DIVD
powerpc/ftrace: Don't use copy_from_kernel_nofault() in module_trampoline_target()
module_trampoline_target() is quite a hot path used when
activating/deactivating function tracer.
Avoid the heavy copy_from_kernel_nofault() by doing four calls
to copy_inst_from_kernel_nofault().
Use __copy_inst_from_kernel_nofault() for the 3 last calls. First call
is done to copy_from_kernel_nofault() to check address is within
kernel space. No risk to wrap out the top of kernel space because the
last page is never mapped so if address is in last page the first copy
will fails and the other ones will never be performed.
And also make it notrace just like all functions that call it.
powerpc/ftrace: Use CONFIG_FUNCTION_TRACER instead of CONFIG_DYNAMIC_FTRACE
Since commit 16b642fcedb5 ("powerpc: Only support DYNAMIC_FTRACE not
static"), CONFIG_DYNAMIC_FTRACE is always selected when
CONFIG_FUNCTION_TRACER is selected.
To avoid confusion and have the reader wonder what's happen when
CONFIG_FUNCTION_TRACER is selected and CONFIG_DYNAMIC_FTRACE is not,
use CONFIG_FUNCTION_TRACER in ifdefs instead of CONFIG_DYNAMIC_FTRACE.
As CONFIG_FUNCTION_GRAPH_TRACER depends on CONFIG_FUNCTION_TRACER,
ftrace.o doesn't need to appear for both symbols in Makefile.
Then as ftrace.o is built only when CONFIG_FUNCTION_TRACER is selected
ifdef CONFIG_FUNCTION_TRACER is not needed in ftrace.c, and since it
implies CONFIG_DYNAMIC_FTRACE, CONFIG_DYNAMIC_FTRACE is not needed
in ftrace.c
powerpc: Add CONFIG_PPC64_ELF_ABI_V1 and CONFIG_PPC64_ELF_ABI_V2
At the time being, we use CONFIG_CPU_LITTLE_ENDIAN and
CONFIG_CPU_BIG_ENDIAN to pass -mabi=elfv1 or elfv2 to
compiler, then define a PPC64_ELF_ABI_v1 or PPC64_ELF_ABI_v2
macro in asm/types.h based on _CALL_ELF define set by the compiler.
Make it more straight forward with a CONFIG option that
is directly usable.
It improves ftrace activation/deactivation duration by about 3%.
Modify patch_instruction() return on failure to -EPERM in order to
match with ftrace expectations. Other users of patch_instruction()
do not care about the exact error value returned.
Inlining ftrace_modify_code(), it increases a bit the
size of ftrace code but brings 5% improvment on ftrace
activation.
Usually in C files we let gcc decide what to do but here
it really help to 'help' gcc to decide to inline, thought
we don't want to force it with an __always_inline that
would be too much for CONFIG_CC_OPTIMIZE_FOR_SIZE.
Since commit 55afba3325e1 ("powerpc/code-patching: Fix patch_branch()
return on out-of-range failure") patch_branch() fails with -ERANGE
when trying to branch out of range.
No need to perform the test twice. Remove redundant create_branch()
calls.
When we have CONFIG_DYNAMIC_FTRACE_WITH_ARGS,
prepare_ftrace_return() is called by ftrace_graph_func()
otherwise prepare_ftrace_return() is called from assembly.
Refactor prepare_ftrace_return() into a static
__prepare_ftrace_return() that will be called by both
prepare_ftrace_return() and ftrace_graph_func().
It will allow GCC to fold __prepare_ftrace_return() inside
ftrace_graph_func().
Nicholas Piggin [Tue, 8 Mar 2022 13:50:46 +0000 (23:50 +1000)]
powerpc/rtas: enture rtas_call is called with MMU enabled
rtas_call must not be called with the MMU disabled because in case
of rtas error, log_error is called which requires MMU enabled. Add
a test and warning for this.
Nicholas Piggin [Tue, 8 Mar 2022 13:50:42 +0000 (23:50 +1000)]
powerpc/rtas: Leave MSR[RI] enabled over RTAS call
PAPR specifies that RTAS may be called with MSR[RI] enabled if the
calling context is recoverable, and RTAS will manage RI as necessary.
Call the rtas entry point with RI enabled, and add a check to ensure
the caller has RI enabled.
Nicholas Piggin [Tue, 8 Mar 2022 13:50:40 +0000 (23:50 +1000)]
powerpc/rtas: PACA can be restored directly from SPRG
On 64-bit, PACA is saved in a SPRG so it does not need to be saved on
stack. We also don't need to mask off the top bits for real mode
addresses because the architecture does this for us.
Nicholas Piggin [Mon, 7 Mar 2022 18:27:34 +0000 (04:27 +1000)]
powerpc/signal: Report minimum signal frame size to userspace via AT_MINSIGSTKSZ
Implement the AT_MINSIGSTKSZ AUXV entry, allowing userspace to
dynamically size stack allocations in a manner forward-compatible with
new processor state saved in the signal frame
For now these statically find the maximum signal frame size rather than
doing any runtime testing of features to minimise the size.
glibc 2.34 will take advantage of this, as will applications that use
use _SC_MINSIGSTKSZ and _SC_SIGSTKSZ.
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
References: 9c59c095080b ("arm64: signal: Report signal frame size to userspace via auxv") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220307182734.289289-2-npiggin@gmail.com
Nicholas Piggin [Mon, 7 Mar 2022 18:27:33 +0000 (04:27 +1000)]
powerpc/64: Bump SIGSTKSZ and MINSIGSTKSZ
The sad tale of SIGSTKSZ and MINSIGSTKSZ is documented in glibc.git
commit f7c399cff5bd ("PowerPC SIGSTKSZ"), which explains why glibc
does not use the kernel defines for these constants.
Since then in fact there has been a further expansion of the signal
stack frame size on little-endian with linux commit 1db982d5dc65 ("powerpc: Increase stack redzone for 64-bit userspace to
512 bytes"), which has caused it to exceed even the glibc defines.
See kernel commit 22e607f57960 ("powerpc: Allow 4224 bytes of stack
expansion for the signal frame") for more details of the history of the
expansion.
Increase MINSIGSTKSZ to 8192 which is double the current glibc value and
fits the current stack frame with room to grow. SIGSTKSZ is set to 4x
the minimum as convention.
The PowerPC vDSO uses $(CC) to link, which differs from the rest of the
kernel, which uses $(LD) directly. As a result, the default linker of
the compiler is used, which may differ from the linker requested by the
builder. For example:
$ make ARCH=powerpc LLVM=1 mrproper defconfig arch/powerpc/kernel/vdso/
...
File: arch/powerpc/kernel/vdso/vdso32.so.dbg
String dump of section '.comment':
[ 0] clang version 14.0.0 (Fedora 14.0.0-1.fc37)
File: arch/powerpc/kernel/vdso/vdso64.so.dbg
String dump of section '.comment':
[ 0] clang version 14.0.0 (Fedora 14.0.0-1.fc37)
LLVM=1 sets LD=ld.lld but ld.lld is not used to link the vDSO; GNU ld is
because "ld" is the default linker for clang on most Linux platforms.
This is a problem for Clang's Link Time Optimization as implemented in
the kernel because use of GNU ld with LTO requires the LLVMgold plugin,
which is not technically supported for ld.bfd per
https://llvm.org/docs/GoldPlugin.html. Furthermore, if LLVMgold.so is
missing from a user's system, the build will fail, even though LTO as it
is implemented in the kernel requires ld.lld to avoid this dependency in
the first place.
Ultimately, the PowerPC vDSO should be converted to compiling and
linking with $(CC) and $(LD) respectively but there were issues last
time this was tried, potentially due to older but supported tool
versions. To avoid regressing GCC + binutils, use the compiler option
'-fuse-ld', which tells the compiler which linker to use when it is
invoked as both the compiler and linker. Use '-fuse-ld=lld' when
LD=ld.lld has been specified (CONFIG_LD_IS_LLD) so that the vDSO is
linked with the same linker as the rest of the kernel.
File: arch/powerpc/kernel/vdso/vdso32.so.dbg
String dump of section '.comment':
[ 0] Linker: LLD 14.0.0
[ 14] clang version 14.0.0 (Fedora 14.0.0-1.fc37)
File: arch/powerpc/kernel/vdso/vdso64.so.dbg
String dump of section '.comment':
[ 0] Linker: LLD 14.0.0
[ 14] clang version 14.0.0 (Fedora 14.0.0-1.fc37)
LD can be a full path to ld.lld, which will not be handled properly by
'-fuse-ld=lld' if the full path to ld.lld is outside of the compiler's
search path. '-fuse-ld' can take a path to the linker but it is
deprecated in clang 12.0.0; '--ld-path' is preferred for this scenario.
Use '--ld-path' if it is supported, as it will handle a full path or
just 'ld.lld' properly. See the LLVM commit below for the full details
of '--ld-path'.
Kevin Hao [Tue, 29 Mar 2022 08:57:09 +0000 (16:57 +0800)]
powerpc: Export mmu_feature_keys[] as non-GPL
When the mmu_feature_keys[] was introduced in the commit c0f90d2745f3
("powerpc: Add option to use jump label for mmu_has_feature()"),
it is unlikely that it would be used either directly or indirectly in
the out of tree modules. So we exported it as GPL only.
But with the evolution of the codes, especially the PPC_KUAP support, it
may be indirectly referenced by some primitive macro or inline functions
such as get_user() or __copy_from_user_inatomic(), this will make it
impossible to build many non GPL modules (such as ZFS) on ppc
architecture. Fix this by exposing the mmu_feature_keys[] to the non-GPL
modules too.
Fixes: fe40a6bcea3a ("powerpc/64s/kuap: Use mmu_has_feature()") Reported-by: Nathaniel Filardo <nwfilardo@gmail.com> Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220329085709.4132729-1-haokexin@gmail.com
The panic notifiers infrastructure is a bit limited in the scope of
the callbacks - basically every kind of functionality is dropped
in a list that runs in the same point during the kernel panic path.
This is not really on par with the complexities and particularities
of architecture / hypervisors' needs, and a refactor is ongoing.
As part of this refactor, it was observed that powerpc has 2 notifiers,
with mixed goals: one is just a KASLR offset dumper, whereas the other
aims to hard-disable IRQs (necessary on panic path), warn firmware of
the panic event (fadump) and run low-level platform-specific machinery
that might stop kernel execution and never come back.
Clearly, the 2nd notifier has opposed goals: disable IRQs / fadump
should run earlier while low-level platform actions should
run late since it might not even return. Hence, this patch decouples
the notifiers splitting them in three:
- First one is responsible for hard-disable IRQs and fadump,
should run early;
- The kernel KASLR offset dumper is really an informative notifier,
harmless and may run at any moment in the panic path;
- The last notifier should run last, since it aims to perform
low-level actions for specific platforms, and might never return.
It is also only registered for 2 platforms, pseries and ps3.
The patch better documents the notifiers and clears the code too,
also removing a useless header.
Currently no functionality change should be observed, but after
the planned panic refactor we should expect more panic reliability
with this patch.
KVM: PPC: Book3s: Remove real mode interrupt controller hcalls handlers
Currently we have 2 sets of interrupt controller hypercalls handlers
for real and virtual modes, this is from POWER8 times when switching
MMU on was considered an expensive operation.
POWER9 however does not have dependent threads and MMU is enabled for
handling hcalls so the XIVE native or XICS-on-XIVE real mode handlers
never execute on real P9 and later CPUs.
This untemplate the handlers and only keeps the real mode handlers for
XICS native (up to POWER8) and remove the rest of dead code. Changes
in functions are mechanical except few missing empty lines to make
checkpatch.pl happy.
The default implemented hcalls list already contains XICS hcalls so
no change there.
This should not cause any behavioral change.
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220509071150.181250-1-aik@ozlabs.ru
When KVM_CAP_PPC_ENABLE_HCALL was introduced, H_GET_TCE and H_PUT_TCE
were already implemented and enabled by default; however H_GET_TCE
was missed out on PR KVM (probably because the handler was in
the real mode code at the time).
This enables H_GET_TCE by default. While at this, this wraps
the checks in ifdef CONFIG_SPAPR_TCE_IOMMU just like HV KVM.
KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlers
LoPAPR defines guest visible IOMMU with hypercalls to use it -
H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap
in the KVM in the real mode (with MMU off). The problem with the real mode
is some memory is not available and some API usage crashed the host but
enabling MMU was an expensive operation.
The problems with the real mode handlers are:
1. Occasionally these cannot complete the request so the code is
copied+modified to work in the virtual mode, very little is shared;
2. The real mode handlers have to be linked into vmlinux to work;
3. An exception in real mode immediately reboots the machine.
If the small DMA window is used, the real mode handlers bring better
performance. However since POWER8, there has always been a bigger DMA
window which VMs use to map the entire VM memory to avoid calling
H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT
(a bulk version of H_PUT_TCE) which virtual mode handler is even closer
to its real mode version.
On POWER9 hypercalls trap straight to the virtual mode so the real mode
handlers never execute on POWER9 and later CPUs.
So with the current use of the DMA windows and MMU improvements in
POWER9 and later, there is no point in duplicating the code.
The 32bit passed through devices may slow down but we do not have many
of these in practice. For example, with this applied, a 1Gbit ethernet
adapter still demostrates above 800Mbit/s of actual throughput.
This removes the real mode handlers from KVM and related code from
the powernv platform.
This updates the list of implemented hcalls in KVM-HV as the realmode
handlers are removed.
This changes ABI - kvmppc_h_get_tce() moves to the KVM module and
kvmppc_find_table() is static now.
Bo Liu [Fri, 1 Apr 2022 06:52:52 +0000 (02:52 -0400)]
KVM: PPC: Book3S HV: Use consistent type for return value of kvm_age_rmapp()
The return value type defined in the function kvm_age_rmapp() is
"bool", but the return value type defined in the implementation of the
function kvm_age_rmapp() is "int".
KVM: PPC: Book3S HV: fix incorrect NULL check on list iterator
The bug is here:
if (!p)
return ret;
The list iterator value 'p' will *always* be set and non-NULL by
list_for_each_entry(), so it is incorrect to assume that the iterator
value will be NULL if the list is empty or no element is found.
To fix the bug, Use a new value 'iter' as the list iterator, while use
the old value 'p' as a dedicated variable to point to the found element.
Fixes: 7d57ac025eeb ("KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs") Cc: stable@vger.kernel.org # v5.9+ Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220414062103.8153-1-xiam0nd.tong@gmail.com
The L1 should not be able to adjust LPES mode for the L2. Setting LPES
if the L0 needs it clear would cause external interrupts to be sent to
L2 and missed by the L0.
Clearing LPES when it may be set, as typically happens with XIVE enabled
could cause a performance issue despite having no native XIVE support in
the guest, because it will cause mediated interrupts for the L2 to be
taken in HV mode, which then have to be injected.
Nicholas Piggin [Thu, 3 Mar 2022 05:33:14 +0000 (15:33 +1000)]
KVM: PPC: Book3S HV Nested: L2 must not run with L1 xive context
The PowerNV L0 currently pushes the OS xive context when running a vCPU,
regardless of whether it is running a nested guest. The problem is that
xive OS ring interrupts will be delivered while the L2 is running.
At the moment, by default, the L2 guest runs with LPCR[LPES]=0, which
actually makes external interrupts go to the L0. That causes the L2 to
exit and the interrupt taken or injected into the L1, so in some
respects this behaves like an escalation. It's not clear if this was
deliberate or not, there's no comment about it and the L1 is actually
allowed to clear LPES in the L2, so it's confusing at best.
When the L2 is running, the L1 is essentially in a ceded state with
respect to external interrupts (it can't respond to them directly and
won't get scheduled again absent some additional event). So the natural
way to solve this is when the L0 handles a H_ENTER_NESTED hypercall to
run the L2, have it arm the escalation interrupt and don't push the L1
context while running the L2.
If there is a pending xive interrupt, inject it at guest entry (if
MSR[EE] is enabled) rather than take another interrupt when the guest
is entered. If xive is enabled then LPCR[LPES] is set so this behaviour
should be expected.
Nicholas Piggin [Sun, 23 Jan 2022 12:00:43 +0000 (22:00 +1000)]
KVM: PPC: Book3S HV: Remove KVMPPC_NR_LPIDS
KVMPPC_NR_LPIDS no longer represents any size restriction on the
LPID space and can be removed. A CPU with more than 12 LPID bits
implemented will now be able to create more than 4095 guests.
Nicholas Piggin [Sun, 23 Jan 2022 12:00:42 +0000 (22:00 +1000)]
KVM: PPC: Book3S Nested: Use explicit 4096 LPID maximum
Rather than tie this to KVMPPC_NR_LPIDS which is becoming more dynamic,
fix it to 4096 (12-bits) explicitly for now.
kvmhv_get_nested() does not have to check against KVM_MAX_NESTED_GUESTS
because the L1 partition table registration hcall already did that, and
it checks against the partition table size.
This patch also puts all the partition table size calculations into the
same form, using 12 for the architected size field shift and 4 for the
shift corresponding to the partition table entry size.
The LPID allocator init is changed to:
- use mmu_lpid_bits rather than hard-coding;
- use KVM_MAX_NESTED_GUESTS for nested hypervisors;
- not reserve the top LPID on POWER9 and newer CPUs.
The reserved LPID is made a POWER7/8-specific detail.
Nicholas Piggin [Sun, 23 Jan 2022 12:00:38 +0000 (22:00 +1000)]
KVM: PPC: Remove kvmppc_claim_lpid
Removing kvmppc_claim_lpid makes the lpid allocator API a bit simpler to
change the underlying implementation in a future patch.
The host LPID is always 0, so that can be a detail of the allocator. If
the allocator range is restricted, that can reserve LPIDs at the top of
the range. This allows kvmppc_claim_lpid to be removed.
Nicholas Piggin [Sun, 23 Jan 2022 11:47:25 +0000 (21:47 +1000)]
KVM: PPC: Book3S HV P9: Optimise loads around context switch
It is better to get all loads for the register values in flight
before starting to switch LPID, PID, and LPCR because those
mtSPRs are expensive and serialising.
This also just tidies up the code for a potential future change
to the context switching sequence.
Nicholas Piggin [Sat, 22 Jan 2022 10:56:39 +0000 (20:56 +1000)]
KVM: PPC: Book3S HV: HFSCR[PREFIX] does not exist
This facility is controlled by FSCR only. Reserved bits should not be
set in the HFSCR register (although it's likely harmless as this
position would not be re-used, and the L0 is forgiving here too).
This happens because MSR[RI] is unset when entering RTAS but there is no
valid reason to not set it here.
RTAS is expected to be called with MSR[RI] as specified in PAPR+ section
"7.2.1 Machine State":
R1–7.2.1–9. If called with MSR[RI] equal to 1, then RTAS must protect
its own critical regions from recursion by setting the MSR[RI] bit to
0 when in the critical regions.
Fixing this by reviewing the way MSR is compute before calling RTAS. Now a
hardcoded value meaning real mode, 32 bits big endian mode and Recoverable
Interrupt is loaded. In the case MSR[S] is set, it will remain set while
entering RTAS as only urfid can unset it (thanks Fabiano).
In addition a check is added in do_enter_rtas() to detect calls made with
MSR[RI] unset, as we are forcing it on later.
This patch has been tested on the following machines:
Power KVM Guest
P8 S822L (host Ubuntu kernel 5.11.0-49-generic)
PowerVM LPAR
P8 9119-MME (FW860.A1)
p9 9008-22L (FW950.00)
P10 9080-HEX (FW1010.00)
powerpc/8xx: Convert CPM1 interrupt controller to platform_device
In the same logic as commit e49be8134be4 ("soc: fsl: qe: convert QE
interrupt controller to platform_device"), convert CPM1 interrupt
controller to platform_device.