powerpc/pci: Add config option for using all 256 PCI buses
By default on PPC32 PCI bus numbers are unique across all PCI domains.
So a system could have only 256 PCI buses independently of available PCI
domains.
This is due to filling DT property pci-OF-bus-map which does not support
a multi-domain setup.
On all powerpc platforms except chrp and powermac there is no DT
property pci-OF-bus-map anymore and therefore it is possible on
non-chrp/powermac platforms to avoid this limitation of maximum number
of 256 PCI buses in a system even on multi-domain setup.
But avoiding this limitation would mean that all PCI and PCIe devices
would be present on completely different BDF addresses as every PCI
domain starts numbering PCI bueses from zero (instead of the last bus
number of previous enumerated PCI domain). Such change could break
existing software which expects fixed PCI bus numbers.
So add a new config option CONFIG_PPC_PCI_BUS_NUM_DOMAIN_DEPENDENT which
enables this change. By default it is disabled. It causes the initial
value of hose->first_busno to be zero.
powerpc/pci: Disable filling pci-OF-bus-map for non-chrp/powermac
Creating or filling pci-OF-bus-map property in the device-tree is
deprecated since May 2006 [1] and was used only in old platforms like
PowerMac.
Currently kernel code handles it only for chrp and powermac code. So
completely disable filling pci-OF-bus-map property for non-chrp and
non-powermac platforms.
By default old pre-3.0 Freescale PCIe controllers reports invalid PCI Class
Code 0x0b20 for PCIe Root Port. It can be seen by lspci -b output on P2020
board which has this pre-3.0 controller:
$ lspci -bvnn
00:00.0 Power PC [0b20]: Freescale Semiconductor Inc P2020E [1957:0070] (rev 21)
!!! Invalid class 0b20 for header type 01
Capabilities: [4c] Express Root Port (Slot-), MSI 00
Fix this issue by programming correct PCI Class Code 0x0604 for PCIe Root
Port to the Freescale specific PCIe register 0x474.
Fix activated by U-Boot stay active also after booting Linux kernel.
But boards which use older U-Boot version without that fix are affected and
still require this fix.
So implement this class code fix also in kernel fsl_pci.c driver.
The .incbin assembler directive is much faster than bin2c + $(CC).
Do similar refactoring as in commit b8abea4e0c09 ("s390/purgatory:
Omit use of bin2c").
Please note the .quad directive matches to size_t in C (both 8 byte)
because the purgatory is compiled only for the 64-bit kernel.
(KEXEC_FILE depends on PPC64).
powerpc/pseries/mobility: set NMI watchdog factor during an LPM
During an LPM, while the memory transfer is in progress on the arrival
side, some latencies are generated when accessing not yet transferred
pages on the arrival side. Thus, the NMI watchdog may be triggered too
frequently, which increases the risk to hit an NMI interrupt in a bad
place in the kernel, leading to a kernel panic.
Disabling the Hard Lockup Watchdog until the memory transfer could be a
too strong work around, some users would want this timeout to be
eventually triggered if the system is hanging even during an LPM.
Introduce a new sysctl variable nmi_watchdog_factor. It allows to apply
a factor to the NMI watchdog timeout during an LPM. Just before the CPUs
are stopped for the switchover sequence, the NMI watchdog timer is set
to watchdog_thresh + factor%
A value of 0 has no effect. The default value is 200, meaning that the
NMI watchdog is set to 30s during LPM (based on a 10s watchdog_thresh
value). Once the memory transfer is achieved, the factor is reset to 0.
Setting this value to a high number is like disabling the NMI watchdog
during an LPM.
powerpc/watchdog: introduce a NMI watchdog's factor
Introduce a factor which would apply to the NMI watchdog timeout.
This factor is a percentage added to the watchdog_tresh value. The value is
set under the watchdog_mutex protection and lockup_detector_reconfigure()
is called to recompute wd_panic_timeout_tb.
Once the factor is set, it remains until it is set back to 0, which means
no impact.
In some circumstances it may be interesting to reconfigure the watchdog
from inside the kernel.
On PowerPC, this may helpful before and after a LPAR migration (LPM) is
initiated, because it implies some latencies, watchdog, and especially NMI
watchdog is expected to be triggered during this operation. Reconfiguring
the watchdog with a factor, would prevent it to happen too frequently
during LPM.
Rename lockup_detector_reconfigure() as __lockup_detector_reconfigure() and
create a new function lockup_detector_reconfigure() calling
__lockup_detector_reconfigure() under the protection of watchdog_mutex.
powerpc/mobility: wait for memory transfer to complete
In pseries_migration_partition(), loop until the memory transfer is
complete. This way the calling drmgr process will not exit earlier,
allowing callbacks to be run only once the migration is fully completed.
If reading the VASI state is done after the hypervisor has completed the
migration, the HCALL is returning H_PARAMETER. We can safely assume that
the memory transfer is achieved if this happens.
This will also allow to manage the NMI watchdog state in the next commits.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220713154729.80789-2-ldufour@linux.ibm.com
Michael Ellerman [Mon, 27 Jun 2022 14:02:38 +0000 (00:02 +1000)]
selftests/powerpc/ptrace: Use more interesting values
The ptrace-gpr test uses fixed values to test that registers can be
read/written via ptrace. In particular it sets all GPRs to 1, which
means the test could miss some types of bugs - eg. if the kernel was
only returning the low word.
So generate some random values at startup and use those instead.
Michael Ellerman [Mon, 27 Jun 2022 14:02:36 +0000 (00:02 +1000)]
selftests/powerpc/ptrace: Do more of ptrace-gpr in asm
The ptrace-gpr test includes some inline asm to load GPR and FPR
registers. It then goes back to C to wait for the parent to trace it and
then checks register contents.
The split between inline asm and C is fragile, it relies on the compiler
not using any non-volatile GPRs after the inline asm block. It also
requires a very large and unwieldy inline asm block.
So convert the logic to set registers, wait, and store registers to a
single asm function, meaning there's no window for the compiler to
intervene.
Michael Ellerman [Mon, 27 Jun 2022 14:02:34 +0000 (00:02 +1000)]
selftests/powerpc/ptrace: Convert to load/store doubles
Some of the ptrace tests check the contents of floating pointer
registers. Currently these use float, which is always 4 bytes, but the
ptrace API supports saving/restoring 8 bytes per register, so switch to
using doubles to exercise the code more fully.
Michael Ellerman [Mon, 27 Jun 2022 14:02:31 +0000 (00:02 +1000)]
selftests/powerpc: Don't save TOC by default in asm helpers
Thare are some asm helpers for creating/popping stack frames in
basic_asm.h. They always save/restore r2 (TOC pointer), but none of the
selftests change r2, so it's unnecessary to save it by default.
Michael Ellerman [Mon, 27 Jun 2022 14:02:30 +0000 (00:02 +1000)]
selftests/powerpc: Don't save CR by default in asm helpers
Thare are some asm helpers for creating/popping stack frames in
basic_asm.h. They always save/restore CR, but none of the selftests
tests touch non-volatile CR fields, so it's unnecessary to save them by
default.
Michael Ellerman [Mon, 27 Jun 2022 14:02:29 +0000 (00:02 +1000)]
selftests/powerpc/ptrace: Split CFLAGS better
Currently all ptrace tests are built 64-bit and with TM enabled.
Only the TM tests need TM enabled, so split those out into a separate
variable so that can be specified precisely.
Split the rest of the tests into a variable, and add -m64 to CFLAGS for
those tests, so that in a subsequent patch some tests can be made to
build 32-bit.
Michael Ellerman [Mon, 27 Jun 2022 14:02:28 +0000 (00:02 +1000)]
selftests/powerpc/ptrace: Set LOCAL_HDRS
Set LOCAL_HDRS so header changes cause rebuilds. The lib.mk logic adds
all the headers in LOCAL_HDRS as dependencies, so there's no need to
also list them explicitly.
Michael Ellerman [Mon, 18 Jul 2022 09:51:58 +0000 (19:51 +1000)]
powerpc: Fix all occurences of duplicate words
Since commit 4584e27870a9 ("powerpc: Fix all occurences of "the the"")
fixed "the the", there's now a steady stream of patches fixing other
duplicate words.
Just fix them all at once, to save the overhead of dealing with
individual patches for each case.
This leaves a few cases of "that that", which in some contexts is
correct.
Ning Qiang [Wed, 13 Jul 2022 15:37:34 +0000 (23:37 +0800)]
macintosh/adb: fix oob read in do_adb_query() function
In do_adb_query() function of drivers/macintosh/adb.c, req->data is copied
form userland. The parameter "req->data[2]" is missing check, the array
size of adb_handler[] is 16, so adb_handler[req->data[2]].original_address and
adb_handler[req->data[2]].handler_id will lead to oob read.
Cc: stable <stable@kernel.org> Signed-off-by: Ning Qiang <sohu0106@126.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220713153734.2248-1-sohu0106@126.com
Scott Cheloha [Wed, 13 Jul 2022 20:23:35 +0000 (15:23 -0500)]
watchdog/pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits
guest control of one or more virtual watchdog timers. The timers have
millisecond granularity. The guest is terminated when a timer
expires.
This patch adds a watchdog driver for these timers, "pseries-wdt".
pseries_wdt_probe() currently assumes the existence of only one
platform device and always assigns it watchdogNumber 1. If we ever
expose more than one timer to userspace we will need to devise a way
to assign a distinct watchdogNumber to each platform device at device
registration time.
Scott Cheloha [Wed, 13 Jul 2022 20:23:34 +0000 (15:23 -0500)]
powerpc/pseries: register pseries-wdt device with platform bus
PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits
guest control of one or more virtual watchdog timers.
These timers do not conform to PowerPC device conventions. They are
not affixed to any extant bus, nor do they have full representation in
the device tree.
As a workaround we represent them as platform devices.
This patch registers a single platform device, "pseries-wdt", with the
platform bus if the FW_FEATURE_WATCHDOG flag is set.
A driver for this device, "pseries-wdt", will be introduced in a
subsequent patch.
Scott Cheloha [Wed, 13 Jul 2022 20:23:33 +0000 (15:23 -0500)]
powerpc/pseries: add FW_FEATURE_WATCHDOG flag
PAPR v2.12 specifies a new optional function set, "hcall-watchdog",
for the /rtas/ibm,hypertas-functions property. The presence of this
function set indicates support for the H_WATCHDOG hypercall.
Check for this function set and, if present, set the new
FW_FEATURE_WATCHDOG flag.
Michael Ellerman [Mon, 18 Jul 2022 13:44:18 +0000 (23:44 +1000)]
powerpc/64s: Disable stack variable initialisation for prom_init
With GCC 12 allmodconfig prom_init fails to build:
Error: External symbol 'memset' referenced from prom_init.c
make[2]: *** [arch/powerpc/kernel/Makefile:204: arch/powerpc/kernel/prom_init_check] Error 1
The allmodconfig build enables KASAN, so all calls to memset in
prom_init should be converted to __memset by the #ifdefs in
asm/string.h, because prom_init must use the non-KASAN instrumented
versions.
The build failure happens because there's a call to memset that hasn't
been caught by the pre-processor and converted to __memset. Typically
that's because it's a memset generated by the compiler itself, and that
is the case here.
With GCC 12, allmodconfig enables CONFIG_INIT_STACK_ALL_PATTERN, which
causes the compiler to emit memset calls to initialise on-stack
variables with a pattern.
Because prom_init is non-user-facing boot-time only code, as a
workaround just disable stack variable initialisation to unbreak the
build.
Uwe Kleine-König [Sun, 12 Jun 2022 21:34:00 +0000 (23:34 +0200)]
powerpc/52xx: Mark gpt driver as not removable
Returning an error code (here -EBUSY) from a remove callback doesn't
prevent the driver from being unloaded. The only effect is that an error
message is emitted and the driver is removed anyhow.
So instead drop the remove function (which is equivalent to returning zero)
and set the suppress_bind_attrs property to make it impossible to unload
the driver via sysfs.
This is a preparation for making platform remove callbacks return void.
Athira Rajeev [Fri, 20 May 2022 08:46:30 +0000 (14:16 +0530)]
docs: ABI: sysfs-bus-event_source-devices: Document sysfs caps entry for PMU
Details is added about "caps" attribute group in the ABI documentation.
This is used to expose some of the PMU attributes in "caps"
directory under : /sys/bus/event_source/devices/<dev>/. The dev/caps
will contain information about features that platform specific PMU
supports.
Athira Rajeev [Fri, 20 May 2022 08:46:29 +0000 (14:16 +0530)]
powerpc/perf: Add support for caps under sysfs in powerpc
Add caps support under "/sys/bus/event_source/devices/<pmu>/"
for powerpc. This directory can be used to expose some of the
specific features that powerpc PMU supports to the user.
Example: pmu_name. The name of PMU registered will depend on
platform, say power9 or power10 or it could be Generic Compat
PMU.
Currently the only way to know which is the registered
PMU is from the dmesg logs. But clearing the dmesg will make it
difficult to know exact PMU backend used. And even extracting
from dmesg will be complicated, as we need to parse the dmesg
logs and add filters for pmu name. Whereas by exposing it via
caps will make it easy as we just need to directly read it from
the sysfs.
Add a caps directory to /sys/bus/event_source/devices/cpu/
for power8, power9, power10 and generic compat PMU in respective
PMU driver code. Update the pmu_name file under caps folder
in core-book3s using "attr_update".
The information exposed currently:
- pmu_name : Underlying PMU name from the driver
Example result with power9 pmu:
# ls /sys/bus/event_source/devices/cpu/caps
pmu_name
Merge our fixes branch. In particular this brings in commit 2f2e917c1ba1 ("powerpc/book3e: Fix PUD allocation size in
map_kernel_page()") which fixes a build failure in next, because commit 96bb43d3c142 ("powerpc/64e: Rewrite p4d_populate() as a static inline
function") depends on it.
powerpc/powernv: delay rng platform device creation until later in boot
The platform device for the rng must be created much later in boot.
Otherwise it tries to connect to a parent that doesn't yet exist,
resulting in this splat:
This patch fixes the issue by doing the platform device creation inside
of machine_subsys_initcall.
Fixes: 555c3a79e2b8 ("powerpc/powernv: wire up rng during setup_arch") Cc: stable@vger.kernel.org Reported-by: Sachin Sant <sachinp@linux.ibm.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com>
[mpe: Change "of node" to "platform device" in change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220630121654.1939181-1-Jason@zx2c4.com
powerpc/memhotplug: Add add_pages override for PPC
With commit 5952aaa7a933 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit")
the kernel now validate the addr against high_memory value. This results
in the below BUG_ON with dax pfns.
The fix is to make sure we update high_memory on memory hotplug.
This is similar to what x86 does in commit e6d361888aa0 ("mm/memory_hotplug: introduce add_pages")
Fixes: 5952aaa7a933 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220629050925.31447-1-aneesh.kumar@linux.ibm.com
Naveen N. Rao [Mon, 27 Jun 2022 19:11:19 +0000 (00:41 +0530)]
powerpc/bpf: Fix use of user_pt_regs in uapi
Trying to build a .c file that includes <linux/bpf_perf_event.h>:
$ cat test_bpf_headers.c
#include <linux/bpf_perf_event.h>
throws the below error:
/usr/include/linux/bpf_perf_event.h:14:28: error: field ‘regs’ has incomplete type
14 | bpf_user_pt_regs_t regs;
| ^~~~
This is because we typedef bpf_user_pt_regs_t to 'struct user_pt_regs'
in arch/powerpc/include/uaps/asm/bpf_perf_event.h, but 'struct
user_pt_regs' is not exposed to userspace.
Powerpc has both pt_regs and user_pt_regs structures. However, unlike
arm64 and s390, we expose user_pt_regs to userspace as just 'pt_regs'.
As such, we should typedef bpf_user_pt_regs_t to 'struct pt_regs' for
userspace.
Within the kernel though, we want to typedef bpf_user_pt_regs_t to
'struct user_pt_regs'.
Remove arch/powerpc/include/uapi/asm/bpf_perf_event.h so that the
uapi/asm-generic version of the header is exposed to userspace.
Introduce arch/powerpc/include/asm/bpf_perf_event.h so that we can
typedef bpf_user_pt_regs_t to 'struct user_pt_regs' for use within the
kernel.
Note that this was not showing up with the bpf selftest build since
tools/include/uapi/asm/bpf_perf_event.h didn't include the powerpc
variant.
Fixes: f17a2af27d08b0 ("powerpc/bpf: Fix broken uapi for BPF_PROG_TYPE_PERF_EVENT") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Use typical naming for header include guard] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220627191119.142867-1-naveen.n.rao@linux.vnet.ibm.com
Pali Rohár [Fri, 24 Jun 2022 08:55:50 +0000 (10:55 +0200)]
powerpc: dts: Add DTS file for CZ.NIC Turris 1.x routers
CZ.NIC Turris 1.0 and 1.1 are open source routers, they have dual-core
PowerPC Freescale P2020 CPU and are based on Freescale P2020RDB-PC-A board.
Hardware design is fully open source, all firmware and hardware design
files are available at Turris project website:
Juerg Haefliger [Fri, 20 May 2022 11:54:31 +0000 (13:54 +0200)]
KVM: PPC: Kconfig: Fix indentation
The convention for indentation seems to be a single tab. Help text is
further indented by an additional two whitespaces. Fix the lines that
violate these rules.
powerpc/perf: Update MMCR2 to support event exclude_idle
struct perf_event_attr supports exclude counting of idle task.
This is sent to kernel via perf_event_attr.exclude_idle and
in perf tool, user can use ":I" event modifier to enable this
for specific event.
Monitor Mode Control Register 2 (MMCR2) SPR has control bits
for each PMCs to freeze counting based on the Control Register
CTRL[RUN] state. CTRL[RUN] is not set when idle task is
running. Patch adds a check for event attr.exclude_idle to
set MMCR2[FCnWAIT] bit.
PowerVM has a stricter policy about allocating TCEs for LPARs and
often there is not enough TCEs for 1:1 mapping, this adds the supported
numbers into dev_info() to help analyzing bugreports.
KVM: PPC: Do not warn when userspace asked for too big TCE table
KVM manages emulated TCE tables for guest LIOBNs by a two level table
which maps up to 128TiB with 16MB IOMMU pages (enabled in QEMU by default)
and MAX_ORDER=11 (the kernel's default). Note that the last level of
the table is allocated when actual TCE is updated.
However these tables are created via ioctl() on kvmfd and the userspace
can trigger WARN_ON_ONCE_GFP(order >= MAX_ORDER, gfp) in mm/page_alloc.c
and flood dmesg.
Hari Bathini [Fri, 10 Jun 2022 15:55:52 +0000 (21:25 +0530)]
powerpc/bpf/32: Add instructions for atomic_[cmp]xchg
This adds two atomic opcodes BPF_XCHG and BPF_CMPXCHG on ppc32, both
of which include the BPF_FETCH flag. The kernel's atomic_cmpxchg
operation fundamentally has 3 operands, but we only have two register
fields. Therefore the operand we compare against (the kernel's API
calls it 'old') is hard-coded to be BPF_REG_R0. Also, kernel's
atomic_cmpxchg returns the previous value at dst_reg + off. JIT the
same for BPF too with return value put in BPF_REG_0.
Hari Bathini [Fri, 10 Jun 2022 15:55:50 +0000 (21:25 +0530)]
powerpc/bpf/64: Add instructions for atomic_[cmp]xchg
This adds two atomic opcodes BPF_XCHG and BPF_CMPXCHG on ppc64, both
of which include the BPF_FETCH flag. The kernel's atomic_cmpxchg
operation fundamentally has 3 operands, but we only have two register
fields. Therefore the operand we compare against (the kernel's API
calls it 'old') is hard-coded to be BPF_REG_R0. Also, kernel's
atomic_cmpxchg returns the previous value at dst_reg + off. JIT the
same for BPF too with return value put in BPF_REG_0.
Laurent Dufour [Mon, 23 May 2022 16:43:53 +0000 (18:43 +0200)]
powerpc/64s: Don't read H_BLOCK_REMOVE characteristics in radix mode
There is no need to read the H_BLOCK_REMOVE characteristics when running in
Radix mode because this hcall is never called.
Furthermore since the commit f10b804e100f ("powerpc/64s: Move hash MMU
support code under CONFIG_PPC_64S_HASH_MMU") define
pseries_lpar_read_hblkrm_characteristics as un empty function if
CONFIG_PPC_64S_HASH_MMU is not set, the #ifdef block can be removed.
Michael Ellerman [Tue, 31 May 2022 06:59:36 +0000 (16:59 +1000)]
powerpc/64: Drop ppc_inst_as_str()
The ppc_inst_as_str() macro tries to make printing variable length,
aka "prefixed", instructions convenient. It mostly succeeds, but it does
hide an on-stack buffer, which triggers stack protector.
More problematically it doesn't compile at all with GCC 12,
with -Wdangling-pointer, due to the fact that it returns the char buffer
declared inside the macro:
arch/powerpc/kernel/trace/ftrace.c: In function '__ftrace_modify_call':
./include/linux/printk.h:475:44: error: using a dangling pointer to '__str' [-Werror=dangling-pointer=]
475 | #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
...
arch/powerpc/kernel/trace/ftrace.c:567:17: note: in expansion of macro 'pr_err'
567 | pr_err("Not expected bl: opcode is %s\n", ppc_inst_as_str(op));
| ^~~~~~
./arch/powerpc/include/asm/inst.h:156:14: note: '__str' declared here
156 | char __str[PPC_INST_STR_LEN]; \
| ^~~~~
This could be fixed by having the caller declare the buffer, but in some
places there'd need to be two buffers. In all cases where
ppc_inst_as_str() is used the output is not really meant for user
consumption, it's almost always indicative of a kernel bug.
A simpler solution is to just print the value as an unsigned long. For
normal instructions the output is identical. For prefixed instructions
the value is printed as a single 64-bit quantity, whereas previously the
low half was printed first. But that is good enough for debug output,
especially as prefixed instructions will be rare in kernel code in
practice.
Fabiano Rosas [Fri, 24 Jun 2022 14:27:12 +0000 (11:27 -0300)]
KVM: PPC: Align pt_regs in kvm_vcpu_arch structure
The H_ENTER_NESTED hypercall receives as second parameter the address
of a region of memory containing the values for the nested guest
privileged registers. We currently use the pt_regs structure contained
within kvm_vcpu_arch for that end.
Most hypercalls that receive a memory address expect that region to
not cross a 4K page boundary. We would want H_ENTER_NESTED to follow
the same pattern so this patch ensures the pt_regs structure sits
within a page.
Note: the pt_regs structure is currently 384 bytes in size, so
aligning to 512 is sufficient to ensure it will not cross a 4K page
and avoids punching too big a hole in struct kvm_vcpu_arch.
Fabiano Rosas [Wed, 25 May 2022 13:05:54 +0000 (10:05 -0300)]
KVM: PPC: Book3S HV: Provide more detailed timings for P9 entry path
Alter the data collection points for the debug timing code in the P9
path to be more in line with what the code does. The points where we
accumulate time are now the following:
vcpu_entry: From vcpu_run_hv entry until the start of the inner loop;
guest_entry: From the start of the inner loop until the guest entry in
asm;
in_guest: From the guest entry in asm until the return to KVM C code;
guest_exit: From the return into KVM C code until the corresponding
hypercall/page fault handling or re-entry into the guest;
hypercall: Time spent handling hcalls in the kernel (hcalls can go to
QEMU, not accounted here);
page_fault: Time spent handling page faults;
vcpu_exit: vcpu_run_hv exit (almost no code here currently).
Like before, these are exposed in debugfs in a file called
"timings". There are four values:
- number of occurrences of the accumulation point;
- total time the vcpu spent in the phase in ns;
- shortest time the vcpu spent in the phase in ns;
- longest time the vcpu spent in the phase in ns;
Fabiano Rosas [Wed, 25 May 2022 13:05:52 +0000 (10:05 -0300)]
KVM: PPC: Book3S HV: Decouple the debug timing from the P8 entry path
We are currently doing the timing for debug purposes of the P9 entry
path using the accumulators and terminology defined by the old entry
path for P8 machines.
Not only the "real-mode" and "napping" mentions are out of place for
the P9 Radix entry path but also we cannot change them because the
timing code is coupled to the structures defined in struct
kvm_vcpu_arch.
Add a new CONFIG_KVM_BOOK3S_HV_P9_TIMING to enable the timing code for
the P9 entry path. For now, just add the new CONFIG and duplicate the
structures. A subsequent patch will add the P9 changes.
The "rm_exit" is always showing zero because it is the last one and
end_timing does not increment the counter of the previous entry.
We can fix it by calling accumulate_time again instead of
end_timing. That way the counter gets incremented. The rest of the
arithmetic can be ignored because there are no timing points after
this and the accumulators are reset before the next round.
Christophe Leroy [Tue, 28 Jun 2022 14:48:59 +0000 (16:48 +0200)]
powerpc/64e: KASAN Full support for BOOK3E/64
We now have memory organised in a way that allows
implementing KASAN.
Unlike book3s/64, book3e always has translation active so the only
thing needed to use KASAN is to setup an early zero shadow mapping
just after setting a stack pointer and before calling early_setup().
The memory layout is now as follows
+------------------------+ Kernel virtual map end (0xc000200000000000)
| |
| 16TB of KASAN map |
| |
+------------------------+ Kernel KASAN shadow map start
| |
| 16TB of IO map |
| |
+------------------------+ Kernel IO map start
| |
| 16TB of vmemmap |
| |
+------------------------+ Kernel vmemmap start
| |
| 16TB of vmap |
| |
+------------------------+ Kernel virt start (0xc000100000000000)
| |
| 64TB of linear mem |
| |
+------------------------+ Kernel linear (0xc.....)
Christophe Leroy [Tue, 28 Jun 2022 14:48:57 +0000 (16:48 +0200)]
powerpc/64e: Move virtual memory closer to linear memory
Today nohash/64 have linear memory based at 0xc000000000000000 and
virtual memory based at 0x8000000000000000.
In order to implement KASAN, we need to regroup both areas.
Move virtual memmory at 0xc000100000000000.
This complicates a bit TLB miss handlers. Until now, memory region
was easily identified with the 4 higher bits of address:
- 0 ==> User
- c ==> Linear Memory
- 8 ==> Virtual Memory
Now we need to rely on the 20 higher bits, with:
- 0xxxx ==> User
- c0000 ==> Linear Memory
- c0001 ==> Virtual Memory
Christophe Leroy [Tue, 28 Jun 2022 14:48:55 +0000 (16:48 +0200)]
powerpc/64e: Remove MMU_FTR_USE_TLBRSRV and MMU_FTR_USE_PAIRED_MAS
Commit a28b266b4168 ("powerpc: Remove platforms/wsp and associated
pieces") removed the last CPU having features MMU_FTRS_A2 and
commit abe841db5849 ("powerpc: Clean up MMU_FTRS_A2 and
MMU_FTR_TYPE_3E") removed MMU_FTRS_A2 which was the last user of
MMU_FTR_USE_TLBRSRV and MMU_FTR_USE_PAIRED_MAS.
Remove all code that relies on MMU_FTR_USE_TLBRSRV and
MMU_FTR_USE_PAIRED_MAS.
With this change done, TLB miss can happen before the mmu feature
fixups.
Christophe Leroy [Tue, 28 Jun 2022 14:48:54 +0000 (16:48 +0200)]
powerpc/64e: Fix early TLB miss with KUAP
With KUAP, the TLB miss handler bails out when an access to user
memory is performed with a nul TID.
But the normal TLB miss routine which is only used early during boot
does the check regardless for all memory areas, not only user memory.
By chance there is no early IO or vmalloc access, but when KASAN
come we will start having early TLB misses.
Fix it by creating a special branch for user accesses similar to the
one in the 'bolted' TLB miss handlers. Unfortunately SPRN_MAS1 is
now read too early and there are no registers available to preserve
it so it will be read a second time.
Christophe Leroy [Tue, 28 Jun 2022 14:43:35 +0000 (16:43 +0200)]
powerpc/ptdump: Fix display of RW pages on FSL_BOOK3E
On FSL_BOOK3E, _PAGE_RW is defined with two bits, one for user and one
for supervisor. As soon as one of the two bits is set, the page has
to be display as RW. But the way it is implemented today requires both
bits to be set in order to display it as RW.
Instead of display RW when _PAGE_RW bits are set and R otherwise,
reverse the logic and display R when _PAGE_RW bits are all 0 and
RW otherwise.
This change has no impact on other platforms as _PAGE_RW is a single
bit on all of them.
Christophe Leroy [Tue, 14 Jun 2022 10:32:24 +0000 (12:32 +0200)]
powerpc/32: Remove 'noltlbs' kernel parameter
Mapping without large TLBs has no added value on the 8xx.
Mapping without large TLBs is still necessary on 40x when
selecting CONFIG_KFENCE or CONFIG_DEBUG_PAGEALLOC or
CONFIG_STRICT_KERNEL_RWX, but this is done automatically
and doesn't require user selection.
Remove 'noltlbs' kernel parameter, the user has no reason
to use it.
powerpc/irq: Perform stack_overflow detection after switching to IRQ stack
When KASAN is enabled, as shown by the Oops below, the 2k limit is not
enough to allow stack dump after a stack overflow detection when
CONFIG_DEBUG_STACKOVERFLOW is selected:
As the detection is asynchronously performed at IRQs, we could be long
after the limit has been crossed, so increasing the limit would not
solve the problem completely.
In order to be sure that there is enough stack space for the stack
dump, do it after the switch to the IRQ stack. That way it is sure
that the stack is large enough, unless the IRQ stack has been
overfilled in which case the end of life is close.
powerpc/irq: Increase stack_overflow detection limit when KASAN is enabled
When KASAN is enabled, as shown by the Oops below, the 2k limit is not
enough to allow stack dump after a stack overflow detection when
CONFIG_DEBUG_STACKOVERFLOW is selected:
An investigation shows that on PPC32, calling dump_stack() requires
more than 1k when KASAN is not selected and a bit more than 2k bytes
when KASAN is selected.
On PPC64 the registers are twice the size of PPC32 registers, so the
need should be approximately twice the need on PPC32.
In the meantime we have THREAD_SIZE which is twice larger on PPC64
than PPC32, and twice larger when KASAN is selected.
So we can easily use the value of THREAD_SIZE to set the limit.
On PPC32, THREAD_SIZE is 8k without KASAN and 16k with KASAN.
On PPC64, THREAD_SIZE is 16k without KASAN.
To be on the safe side, leave 2k on PPC32 without KASAN, 4k with
KASAN, and 4k on PPC64 without KASAN. It means the limit should be
one fourth of THREAD_SIZE.
Kajol Jain [Fri, 10 Jun 2022 13:41:13 +0000 (19:11 +0530)]
selftests/powerpc/pmu: Add test for hardware cache events
The testcase checks if the transalation of a generic hardware cache
event is done properly via perf interface. The hardware cache events has
type as PERF_TYPE_HW_CACHE and each event points to raw event code id.
Testcase checks different combination of cache level, cache event
operation type and cache event result type and verify for a given event
code, whether transalation matches with the current cache event mappings
via perf interface.
Kajol Jain [Fri, 10 Jun 2022 13:41:12 +0000 (19:11 +0530)]
selftests/powerpc/pmu: Add selftest for group constraint check for MMCRA thresh_sel field
Thresh select bits in the event code is used to program thresh_sel field
in Monitor Mode Control Register A (MMCRA: 45-47). When scheduling
events as a group, all events in that group should match value in these
bits. Otherwise event open for the sibling events will fail.
Testcase uses event code PM_MRK_INST_CMPL (0x401e0) as leader and
another event PM_THRESH_MET (0x101ec) as sibling event, and checks if
group constraint checks for thresh_sel field added correctly via perf
interface.
Kajol Jain [Fri, 10 Jun 2022 13:41:11 +0000 (19:11 +0530)]
selftests/powerpc/pmu: Add selftest for group constraint check for MMCRA thresh_ctl field
Thresh control bits in the event code is used to program thresh_ctl
field in Monitor Mode Control Register A (MMCRA: 48-55). When scheduling
events as a group, all events in that group should match value in these
bits. Otherwise event open for the sibling events will fail.
Testcase uses event code PM_MRK_INST_CMPL (0x401e0) as leader and
another event PM_THRESH_MET (101ec) as sibling event, and checks if
group constraint checks for thresh_ctl field added correctly via perf
interface.
Kajol Jain [Fri, 10 Jun 2022 13:41:10 +0000 (19:11 +0530)]
selftests/powerpc/pmu: Add selftest for group constraint for unit and pmc field in p9
Unit and pmu bits in the event code is used to program unit and pmc
fields in Monitor Mode Control Register 1 (MMCR1). For power9 platform,
incase unit field value is within 6 to 9, one of the event in the group
should use PMC4. Otherwise event_open should fail for that group.
Kajol Jain [Fri, 10 Jun 2022 13:41:09 +0000 (19:11 +0530)]
selftests/powerpc/pmu: Add selftest for group constraint check for MMCRA thresh_cmp field
Thresh compare bits for a event is used to program thresh compare field
in Monitor Mode Control Register A (MMCRA: 9-18 bits for power9 and
MMCRA: 8-18 bits for power10). When scheduling events as a group, all
events in that group should match value in thresh compare bits.
Otherwise event open for the sibling events will fail.
Testcase uses event code "0x401e0" as leader and another event "0x101ec"
as sibling event, and checks for thresh compare constraint via perf
interface.
Kajol Jain [Fri, 10 Jun 2022 13:41:08 +0000 (19:11 +0530)]
selftests/powerpc/pmu: Add selftest for group constraint check for MMCR1 cache bits
Data and instruction cache qualifier bits in the event code is used to
program cache select field in Monitor Mode Control Register 1 (MMCR1:
16-17). When scheduling events as a group, all events in that group
should match value in these bits. Otherwise event open for the sibling
events will fail.
Testcase uses event code "0x1100fc" as leader and other events like
"0x23e054" and "0x13e054" as sibling events to checks for l1 cache
select field constraints via perf interface.
Kajol Jain [Fri, 10 Jun 2022 13:41:07 +0000 (19:11 +0530)]
selftests/powerpc/pmu: Add selftest for group constraint check for MMCR0 l2l3_sel bits
In power10, L2L3 select bits in the event code is used to program
l2l3_sel field in Monitor Mode Control Register 0 (MMCR0: 56-60). When
scheduling events as a group, all events in that group should match
value in these bits. Otherwise event open for the sibling events will
fail.
Testcase uses event code "0x010000046080" as leader and another events
"0x26880" and "0x010000026880" as sibling events, and checks for
l2l3_sel constraints via perf interface for ISA v3.1 platform.
Athira Rajeev [Fri, 10 Jun 2022 13:41:06 +0000 (19:11 +0530)]
selftests/powerpc/pmu: Add selftest for PERF_TYPE_HARDWARE events valid check
Testcase to ensure that using invalid event in generic event for
PERF_TYPE_HARDWARE will fail. Invalid generic events in power10 are:
- PERF_COUNT_HW_BUS_CYCLES
- PERF_COUNT_HW_STALLED_CYCLES_FRONTEND
- PERF_COUNT_HW_STALLED_CYCLES_BACKEND
- PERF_COUNT_HW_REF_CPU_CYCLES
Invalid generic events in power9 are:
- PERF_COUNT_HW_BUS_CYCLES
- PERF_COUNT_HW_REF_CPU_CYCLES
Testcase does event open for valid and invalid generic events to ensure
event open works for all valid events and fails for invalid events.
Athira Rajeev [Fri, 10 Jun 2022 13:41:05 +0000 (19:11 +0530)]
selftests/powerpc/pmu: Add selftest for event alternatives for power10
Platform specific PMU supports alternative event for some of the event
codes. During perf_event_open, it any event group doesn't match
constraint check criteria, further lookup is done to find alternative
event. Code checks to see if it is possible to schedule event as group
using alternative events.
Testcase exercises the alternative event find code for power10. Example,
Using PMC1 to PMC4 in a group and again trying to schedule
PM_CYC_ALT (0x0001e) will fail since this exceeds number of programmable
events in group. But since 0x600f4 is an alternative event for 0x0001e,
it is possible to use 0x0001e in the group. Testcase uses such
combination all events in power10 which has alternative event.
Athira Rajeev [Fri, 10 Jun 2022 13:41:04 +0000 (19:11 +0530)]
selftests/powerpc/pmu: Add selftest for event alternatives for power9
Platform specific PMU supports alternative event for some of the event
codes. During perf_event_open, it any event group doesn't match
constraint check criteria, further lookup is done to find alternative
event. Code checks to see if it is possible to schedule event as group
using alternative events.
Testcase exercises the alternative event find code for power9. Example,
since events in same PMC can't go in as a group, ideally using
PM_RUN_CYC_ALT (0x200f4) and PM_BR_TAKEN_CMPL (0x200fa) will fail. But
since RUN_CYC (0x600f4) is alternative event for 0x200f4, it is possible
to use 0x600f4 and 0x200fa as group. Testcase uses such combination for
all events in power9 which has an alternative event.