powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().
This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.
This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.
This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.
As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).
This references mm and stores the pointer in the container; this is done
in a new helper - tce_iommu_mm_set() - when one of the following happens:
- a container is enabled (IOMMU v1);
- a first attempt to pre-register memory is made (IOMMU v2);
- a DMA window is created (IOMMU v2).
The @mm stays referenced till the container is destroyed.
This replaces current->mm with container->mm everywhere except debug
prints.
This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes.
DMA map/unmap ioctls() do not check for @mm as they already check
for @enabled which is set after tce_iommu_mm_set() is called.
This does not reference a task as multiple threads within the same mm
are allowed to ioctl() to vfio and supposedly they will have same limits
and capabilities and if they do not, we'll just fail with no harm made.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We are going to allow the userspace to configure container in
one memory context and pass container fd to another so
we are postponing memory allocations accounted against
the locked memory limit. One of previous patches took care of
it_userspace.
At the moment we create the default DMA window when the first group is
attached to a container; this is done for the userspace which is not
DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
pre-registration - such client expects the default DMA window to exist.
This postpones the default DMA window allocation till one of
the folliwing happens:
1. first map/unmap request arrives;
2. new window is requested;
This adds noop for the case when the userspace requested removal
of the default window which has not been created yet.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
vfio/spapr: Add a helper to create default DMA window
There is already a helper to create a DMA window which does allocate
a table and programs it to the IOMMU group. However
tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
itself to simplify error path.
Since we are going to delay the default window creation till
the default window is accessed/removed or new window is added,
we need a helper to create a default window from all these cases.
This adds tce_iommu_create_default_window(). Since it relies on
a VFIO container to have at least one IOMMU group (for future use),
this changes tce_iommu_attach_group() to add a group to the container
first and then call the new helper.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
vfio/spapr: Postpone allocation of userspace version of TCE table
The iommu_table struct manages a hardware TCE table and a vmalloc'd
table with corresponding userspace addresses. Both are allocated when
the default DMA window is created and this happens when the very first
group is attached to a container.
As we are going to allow the userspace to configure container in one
memory context and pas container fd to another, we have to postpones
such allocations till a container fd is passed to the destination
user process so we would account locked memory limit against the actual
container user constrainsts.
This postpones the it_userspace array allocation till it is used first
time for mapping. The unmapping patch already checks if the array is
allocated.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/iommu: Stop using @current in mm_iommu_xxx
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.
This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.
This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/iommu: Pass mm_struct to init/cleanup helpers
We are going to get rid of @current references in mmu_context_boos3s64.c
and cache mm_struct in the VFIO container. Since mm_context_t does not
have reference counting, we will be using mm_struct which does have
the reference counter.
This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
than mm_context_t (which is embedded into mm).
This should not cause any behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Tue, 15 Nov 2016 10:59:38 +0000 (21:59 +1100)]
powerpc/64: Define ILLEGAL_POINTER_VALUE for 64-bit
This is used in poison.h to offset poison values so that they don't
point directly into user space.
The value we choose sits roughly between user and kernel space, which
means on their own the poison values don't point anywhere useful. If an
attacker can cause an access at some offset from the poison value then
we may still be in trouble, but by putting the poison values between
user and kernel space we maximise the required size of that offset.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Balbir Singh [Wed, 30 Nov 2016 06:45:09 +0000 (17:45 +1100)]
powerpc Don't print misleading facility name in facility unavailable exception
The current facility_strings[] are correct when the trap address is
0xf80 (hypervisor facility unavailable). When the trap address is
0xf60 (facility unavailable) IC (Interruption Cause) a.k.a status in the
code is undefined for values 0 and 1.
Add a check to prevent printing the (misleading) facility name for IC 0
and 1 when we came in via 0xf60. In all cases, print the actual IC
value, to avoid any confusion.
This hasn't been seen on real hardware, on only qemu which was
misreporting an exception.
SPU_FS selects MEMORY_HOTPLUG, which is problematic because
MEMORY_HOTPLUG is user selectable, meaning we can end up with a broken
.config where MEMORY_HOTPLUG is enabled but its dependencies are not,
leading to build breakages.
The select of MEMORY_HOTPLUG for SPU_FS was added back in 2006, in
commit cc595bb8d0d1 ("[POWERPC] spufs: fix memory hotplug dependency").
However we reworked the spufs code and removed the dependency on memory
hotplug in 2007 in commit f846e054a754 ("[POWERPC] spufs: remove need
for struct page for SPEs").
So drop the select as it's no longer needed and causes problems.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Mon, 28 Nov 2016 16:50:45 +0000 (11:50 -0500)]
powerpc/pseries: Use lmb_is_removable() to check removability
We should be using lmb_is_removable() to validate that enough LMBs
are available to remove when doing a remove by count. This will check
that the LMB is owned by the system and it is considered removable.
This patch also adds a pr_info() notification to report the LMB count
to remove was not satisfied.
What we do now is just check that there are enough LMBs owned by the
system when validating there are enough LMBs to remove. This can
lead to situations where there are enough LMBs owned by the system
but not enough that are considered removable. This results in having
to bail out of the remove operation instead of just failing the request
that we should have known wouldn't succeed.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Wed, 30 Nov 2016 08:41:02 +0000 (19:41 +1100)]
powerpc/mm: Fix page table dump build on non-Book3S
In the recent commit c200ee447bce ("powerpc/mm: Dump hash table") we
added code to dump the hage page table. Currently this can be selected
to build on any platform. However it breaks the build if we're building
for a non-Book3S platform, because none of the hash page table related
defines and so on exist. So restrict it to building only on Book3S.
Similarly in commit 0796cc493713 ("powerpc/mm: Dump linux pagetables")
we added code to dump the Linux page tables, which uses some constants
which are only defined on Book3S - so guard those with an #ifdef.
Fixes: c200ee447bce ("powerpc/mm: Dump hash table") Fixes: 0796cc493713 ("powerpc/mm: Dump linux pagetables") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Geoff Levand [Tue, 29 Nov 2016 18:47:32 +0000 (10:47 -0800)]
powerpc/ps3: Fix system hang with GCC 5 builds
GCC 5 generates different code for this bootwrapper null check that
causes the PS3 to hang very early in its bootup. This check is of
limited value, so just get rid of it.
Cc: stable@vger.kernel.org Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 18 Nov 2016 12:15:42 +0000 (23:15 +1100)]
powerpc/prom: Switch to using structs for ibm_architecture_vec
Now that we've defined structures to describe each of the client
architecture vectors, we can use those to construct the value we pass to
firmware.
This avoids the tricks we previously played with the W() macro, allows
us to properly endian annotate fields, and should help to avoid bugs
introduced by failing to have the correct number of zero pad bytes
between fields.
It also means we can avoid hard coding IBM_ARCH_VEC_NRCORES_OFFSET in
order to update the max_cpus value and instead just set it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 18 Nov 2016 12:15:41 +0000 (23:15 +1100)]
powerpc/prom: Define structs for client architecture vectors
The "client architecture vectors" are a series of structures we pass to
firmware to define various things, such as what processors we support
and many other options.
Each structure is entirely different so we have to define a different
struct for each one, but that's OK.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Generate a system reset NMI on the threads indicated by target.
Values for target:
-1 = target all online threads including the caller
-2 = target all online threads except for the caller
All other negative values: reserved
Positive values: The thread to be targeted, obtained from the value
of the "ibm,ppc-interrupt-server#s" property of the CPU in the OF
device tree.
Semantics:
- Invalid target: return H_Parameter.
- Otherwise: Generate a system reset NMI on target thread(s),
return H_Success.
This will be used by crash/debug code to get stuck CPUs into a known
state.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds the support code needed for implementing
kexec_file_load() on powerpc.
This consists of functions to load the ELF kernel, either big or little
endian, and setup the purgatory enviroment which switches from the first
kernel to the second kernel.
None of this code is built yet, as it depends on CONFIG_KEXEC_FILE which
we have not yet defined. Although we could define CONFIG_KEXEC_FILE in
this patch, we'd then have a window in history where the kconfig symbol
is present but the syscall is not, which would be awkward.
powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.
Commit 684a52948b25 ("kexec: split kexec_load syscall from kexec core
code") introduced CONFIG_KEXEC_CORE so that CONFIG_KEXEC means whether
the kexec_load system call should be compiled-in and CONFIG_KEXEC_FILE
means whether the kexec_file_load system call should be compiled-in.
These options can be set independently from each other.
Since until now powerpc only supported kexec_load, CONFIG_KEXEC and
CONFIG_KEXEC_CORE were synonyms. That is not the case anymore, so we
need to make a distinction. Almost all places where CONFIG_KEXEC was
being used should be using CONFIG_KEXEC_CORE instead, since
kexec_file_load also needs that code compiled in.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kexec_file: Factor out kexec_locate_mem_hole from kexec_add_buffer.
kexec_locate_mem_hole will be used by the PowerPC kexec_file_load
implementation to find free memory for the purgatory stack.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kexec_file: Change kexec_add_buffer to take kexec_buf as argument.
This is done to simplify the kexec_add_buffer argument list.
Adapt all callers to set up a kexec_buf to pass to kexec_add_buffer.
In addition, change the type of kexec_buf.buffer from char * to void *.
There is no particular reason for it to be a char *, and the change
allows us to get rid of 3 existing casts to char * in the code.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kexec_file: Allow arch-specific memory walking for kexec_add_buffer
Allow architectures to specify a different memory walking function for
kexec_add_buffer. x86 uses iomem to track reserved memory ranges, but
PowerPC uses the memblock subsystem.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Balbir Singh [Wed, 30 Nov 2016 00:35:36 +0000 (11:35 +1100)]
powerpc/mm: Fix no execute fault handling on pre-POWER5
Aneesh/Ben reported that the change to do_page_fault() we made in commit 5851f5492d30 ("powerpc/mm: Detect instruction fetch denied and report")
needs to handle the case where CPU_FTR_COHERENT_ICACHE is missing but we
have CPU_FTR_NOEXECUTE. In those cases the check added for
SRR1_ISI_N_OR_G might trigger a false positive.
This patch adds a check for CPU_FTR_COHERENT_ICACHE in addition to the
MSR value.
Fixes: 5851f5492d30 ("powerpc/mm: Detect instruction fetch denied and report") Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Mon, 21 Nov 2016 10:14:35 +0000 (21:14 +1100)]
powerpc/boot: Fix rebuild when changing kernel endian
Now that we don't set ARCH incorrectly when calling the boot Makefile,
we can use the generic cpp_lds_S rule for converting our zImage.lds.S
into zImage.lds.
The main advantage of using the generic rule is that it correctly uses
if_changed, which means we correctly regenerate the linker script when
switching endian. Fixing that means we are finally able to build one
endian and then rebuild the other endian without requiring to clean
between builds.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Mon, 21 Nov 2016 10:14:33 +0000 (21:14 +1100)]
powerpc: Stop passing ARCH=ppc64 to boot Makefile
Back in 2005 when the ppc/ppc64 merge started, we used to build the
kernel code in arch/powerpc but use the boot code from arch/ppc or
arch/ppc64 depending on whether we were building for 32 or 64-bit.
Originally we called the boot Makefile passing ARCH=$(OLDARCH), where
OLDARCH was ppc or ppc64.
In commit ae9d1d121ea9 ("powerpc: Make building the boot image work for
both 32-bit and 64-bit") (2005-10-11) we split the call for 32/64-bit
using an ifeq check, because the two Makefiles took different targets,
and explicitly passed ARCH=ppc64 for the 64-bit case and ARCH=ppc for
the 32-bit case.
Then in commit 81f1d70266fa ("powerpc: Move ppc64 boot wrapper code over
to arch/powerpc") (2005-11-16) we moved the boot code into arch/powerpc
and dropped the ppc case, but kept passing ARCH=ppc64 to
arch/powerpc/boot/Makefile.
Since then there have been several more boot targets added, all of which
have copied the ARCH=ppc64 setting, such that now we have four targets
using it.
Currently it seems that nothing actually uses the ARCH value, but that's
basically just luck, and in particular it prevents us from using the
generic cpp_lds_S rule. It's also clearly wrong, ARCH=ppc64 is dead,
buried and cremated.
Fix it by dropping the setting of ARCH completely, the correct value is
exported by the top level Makefile.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: update radix__pte_update to not do full mm tlb flush
When we are updating a pte, we just need to flush the tlb mapping
that pte. Right now we do a full mm flush because we don't track page
size. Now that we have page size details in pte use that to do the
optimized flush
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: update radix__ptep_set_access_flag to not do full mm tlb flush
When we are updating a pte, we just need to flush the tlb mapping
that pte. Right now we do a full mm flush because we don't track the page
size. Now that we have page size details in pte use that to do the
optimized flush
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds a new software defined pte bit. We use the reserved
fields of ISA 3.0 pte definition since we will only be using this on DD1
code paths. We can possibly look at removing this code later.
The software bit will be used to differentiate between 64K/4K and 2M
ptes. This helps in finding the page size mapping by a pte so that we
can do efficient tlb flush.
We don't support 1G hugetlb pages yet. So we add a DEBUG WARN_ON to
catch wrong usage.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Wed, 23 Nov 2016 13:02:07 +0000 (00:02 +1100)]
powerpc/64e: Convert cmpi to cmpwi in head_64.S
From ac42336d509f ("powerpc: Convert cmp to cmpd in idle enter sequence"):
PowerPC's "cmp" instruction has four operands. Normally people write
"cmpw" or "cmpd" for the second cmp operand 0 or 1. But, frequently
people forget, and write "cmp" with just three operands.
With older binutils this is silently accepted as if this was "cmpw",
while often "cmpd" is wanted. With newer binutils GAS will complain
about this for 64-bit code. For 32-bit code it still silently assumes
"cmpw" is what is meant.
In this case, cmpwi is called for, so this is just a build fix for
new toolchains.
Cc: stable@vger.kernel.org # v3.0+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Balbir Singh [Tue, 15 Nov 2016 06:56:16 +0000 (17:56 +1100)]
powerpc/mm/radix: Prevent kernel execution of user space
ISA 3 defines new encoded access authority that allows instruction
access prevention in privileged mode and allows normal access
to problem state. This patch just enables IAMR (Instruction Authority
Mask Register), enabling AMR would require more work.
I've tested this with a buggy driver and a simple payload. The payload
is specific to the build I've tested.
mpe: Also tested with LKDTM:
# echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT
lkdtm: Performing direct entry EXEC_USERSPACE
lkdtm: attempting ok execution at c0000000005bf560
lkdtm: attempting bad execution at 00003fff8d940000
Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x3fff8d940000
Oops: Kernel access of bad area, sig: 11 [#1]
NIP: 00003fff8d940000 LR: c0000000005bfa58 CTR: 00003fff8d940000
REGS: c0000000f1fcf900 TRAP: 0400 Not tainted (4.9.0-rc5-compiler_gcc-6.2.0-00109-g956dbc06232a)
MSR: 9000000010009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48002222 XER: 00000000
...
Call Trace:
lkdtm_EXEC_USERSPACE+0x104/0x120 (unreliable)
lkdtm_do_action+0x3c/0x80
direct_entry+0x100/0x1b0
full_proxy_write+0x94/0x100
__vfs_write+0x3c/0x1b0
vfs_write+0xcc/0x230
SyS_write+0x60/0x110
system_call+0x38/0xfc
Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Balbir Singh [Tue, 15 Nov 2016 06:56:15 +0000 (17:56 +1100)]
powerpc/mm: Detect instruction fetch denied and report
ISA 3 allows for prevention of instruction fetch and execution
of user mode pages. If such an error occurs, SRR1 bit 35 reports the
error. We catch and report the error in do_page_fault().
Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powernv: Clear SPRN_PSSCR when a POWER9 CPU comes online
Ensure that PSSCR is set to a safe value corresponding to no
state-loss each time a POWER9 CPU comes online.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Acked-By: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/xmon: Add 'dt' command to dump trace buffers
There is a nice interface for asking ftrace to dump all its tracing
buffers. The only down side for use in xmon is that it uses printk.
Depending on circumstances printk may not work when in xmon, but it also
may, so add a 'dt' command which dumps the ftrace buffers, and add a
note to the help to mentiont that it uses printk.
Calling this routine also disables tracing, which is problematic if you
return from xmon and expect the system to keep operating normally. So
after we do the dump turn tracing back on.
Both functions already have nop versions defined for when ftrace is not
enabled, so we don't need any extra #ifdefs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Andrew Donnellan [Tue, 22 Nov 2016 10:13:27 +0000 (21:13 +1100)]
cxl: Fix coccinelle warnings
Fix the following coccinelle warnings:
drivers/misc/cxl/debugfs.c:46:0-23: WARNING: fops_io_x64 should be
defined with DEFINE_DEBUGFS_ATTRIBUTE
drivers/misc/cxl/guest.c:890:5-26: WARNING: Comparison to bool
drivers/misc/cxl/irq.c:107:3-23: WARNING: Assignment of bool to 0/1
drivers/misc/cxl/native.c:57:2-3: Unneeded semicolon
drivers/misc/cxl/native.c:170:2-3: Unneeded semicolon
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Tue, 22 Nov 2016 10:49:32 +0000 (11:49 +0100)]
powerpc/32: Change the stack protector canary value per task
Partially copied from commit 8b499865a0929 ("ARM: stack protector:
change the canary value per task")
A new random value for the canary is stored in the task struct whenever
a new task is forked. This is meant to allow for different canary values
per task. On powerpc, GCC expects the canary value to be found in a global
variable called __stack_chk_guard. So this variable has to be updated
with the value stored in the task struct whenever a task switch occurs.
Because the variable GCC expects is global, this cannot work on SMP
unfortunately. So, on SMP, the same initial canary value is kept
throughout, making this feature a bit less effective although it is still
useful.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is the very basic stuff without the changing canary upon
task switch yet. Just the Kconfig option and a constant canary
value initialized at boot time.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/pseries/ibmebus: Remove legacy suspend/resume support
There are no ibmebus driver that make use of legacy suspend/resume. This
patch removes the support for it from ibmebus framework, new ibmebus
driver (as unlikely as they are) wanting to use suspend/resume should
use dev_pm_ops.
Since there aren't any special bus specific things to do during
suspend/resume and since the PM core will automatically fallback
directly to using the device's PM ops if no bus PM ops are specified
there is no need to have any special ibmebus PM ops at all.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Naveen N. Rao [Mon, 21 Nov 2016 17:06:41 +0000 (22:36 +0530)]
powerpc/kprobes: Invoke handlers directly
Invoke the kprobe handlers directly rather than through notify_die(), to
reduce path taken for handling kprobes. Similar to commit 6b85732c85e3
("kprobes/x86: Call exception handlers directly from do_int3/do_debug").
While at it, rename post_kprobe_handler() to kprobe_post_handler() for
more uniform naming.
Reported-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Naveen N. Rao [Mon, 21 Nov 2016 17:06:40 +0000 (22:36 +0530)]
powerpc: Remove extraneous header from asm-prototypes.h
Commit dfa2512d4999 ("powerpc: Use kprobe blacklist for exception
handlers") removed __kprobes annotation from some of the prototypes,
but left the kprobes header include directive unchanged. Remove it as it
is no longer needed.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Tue, 22 Nov 2016 03:50:39 +0000 (14:50 +1100)]
powerpc/reg: Add definition for LPCR_PECE_HVEE
ISA 3.0 defines a new PECE (Power-saving mode Exit Cause Enable) field
in the LPCR (Logical Partitioning Control Register), called
LPCR_PECE_HVEE (Hypervisor Virtualization Exit Enable).
KVM code will need to know about this bit, so add a definition for it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 21 Nov 2016 05:00:58 +0000 (16:00 +1100)]
powerpc/64: Provide functions for accessing POWER9 partition table
POWER9 requires the host to set up a partition table, which is a
table in memory indexed by logical partition ID (LPID) which
contains the pointers to page tables and process tables for the
host and each guest.
This factors out the initialization of the partition table into
a single function. This code was previously duplicated between
hash_utils_64.c and pgtable-radix.c.
This provides a function for setting a partition table entry,
which is used in early MMU initialization, and will be used by
KVM whenever a guest is created. This function includes a tlbie
instruction which will flush all TLB entries for the LPID and
all caches of the partition table entry for the LPID, across the
system.
This also moves a call to memblock_set_current_limit(), which was
in radix_init_partition_table(), but has nothing to do with the
partition table. By analogy with the similar code for hash, the
call gets moved to near the end of radix__early_init_mmu(). It
now gets called when running as a guest, whereas previously it
would only be called if the kernel is running as the host.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Russell Currey [Thu, 17 Nov 2016 05:07:47 +0000 (16:07 +1100)]
powerpc/eeh: Refactor EEH PE reset functions
eeh_pe_reset and eeh_reset_pe are two different functions in the same
file which do mostly the same thing. Not only is this confusing, but
potentially causes disrepancies in functionality, notably eeh_reset_pe
as it does not check return values for failure.
Refactor this into the following:
- eeh_pe_reset(): stays as is, performs a single operation, exported
- eeh_pe_reset_full(): new, full reset process that calls eeh_pe_reset()
- eeh_reset_pe(): removed and replaced by eeh_pe_reset_full()
- eeh_reset_pe_once(): removed
Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Russell Currey [Wed, 16 Nov 2016 03:02:15 +0000 (14:02 +1100)]
powerpc/pci: Always print PHB and PE numbers as hexadecimal
PHB, PE (and by association MVE) numbers are printed as a mix of decimal
and hexadecimal throughout the kernel. This can be misleading, so make
them all hexadecimal.
Standardising on hex instead of dec because:
- PHB numbers are presented in hex in sysfs/debugfs (and lspci, etc)
- PE numbers are presented as hex in sysfs and parsed in hex in debugfs
The only place I think this could cause confusing are the messages during
boot, i.e.
pci 000a:01 : [PE# 000] Secondary bus 1 associated with PE#0
which can be a quick way to check PE numbers. pe_level_printk() will
only print two characters instead of three, so the above would be
pci 000a:01 : [PE# 00] Secondary bus 1 associated with PE#0
which gives a hint it's in hex.
Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Russell Currey [Wed, 16 Nov 2016 01:12:26 +0000 (12:12 +1100)]
powerpc/powernv: Don't warn on PE init if unfreeze is unsupported
Whenever a PE is initialised in powernv, opal_pci_eeh_freeze_clear() is
called. This is to remove any existing freeze, and has no negative side
effects if the PE is already in an unfrozen state. On PHB backends that
don't support this operation and return OPAL_UNSUPPORTED, this creates a
scary and misleading warning message.
Skip the warning message on init if OPAL_UNSUPPORTED is returned.
As far as I'm aware, this currently only affects NPUs.
Fixes: c25e97d ("powerpc/powernv: Unfreeze PE on allocation") Signed-off-by: Russell Currey <ruscur@russell.cc> Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Fri, 28 Oct 2016 06:39:53 +0000 (17:39 +1100)]
powerpc/64: Used named initialisers for ibm_pa_features
The ibm_pa_features array consists of structures that describe which bit
and byte in the ibm,pa-features property toggles one or more flags in
either the CPU, MMU, or user visible feature flags.
Each one consists of 7 values, which are all unsigned long, int or char,
meaning the compiler gives us no warning if we assign the wrong values
to the wrong elements. In fact we have had a bug here in the past, where
we were setting incorrect bits, see commit 001be8d6f46e ("powerpc:
scan_features() updates incorrect bits for REAL_LE").
So switch to using named initialisers for the structure elements, to
reduce the likelihood of future bugs, and hopefully improve readability
also.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Michael Ellerman [Tue, 15 Nov 2016 03:47:44 +0000 (14:47 +1100)]
powerpc/pseries: Disable IBMEBUS on little endian builds
The IBMEBUS code supports the GX bus found on Power7 and earlier CPUs.
On Power8 it has been replaced, and so we have no need for it.
We don't actually have a config symbol for Power8 vs Power7 etc., but
we only support booting little endian on Power8 or later, so use that as
a reasonable approximation.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Frederic Barrat [Fri, 18 Nov 2016 12:00:31 +0000 (23:00 +1100)]
cxl: Fix coredump generation when cxl_get_fd() is used
If a process dumps core while owning a cxl file descriptor obtained
from an AFU driver (e.g. cxlflash) through the cxl_get_fd() API, the
following error occurs:
The root cause is that the address_space structure for the file
doesn't define a 'host' member.
When cxl allocates a file descriptor, it's using the anonymous inode
to back the file, but allocates a private address_space for each
context. The private address_space allows to track memory allocation
for each context. cxl doesn't define the 'host' member of the address
space, i.e. the inode. We don't want to define it as the anonymous
inode, since there's no longer a 1-to-1 relation between address_space
and inode.
To fix it, instead of using the anonymous inode, we introduce a simple
pseudo filesystem so that cxl can allocate its own inodes. So we now
have one inode for each file and address_space. The pseudo filesystem
is only mounted on the first allocation of a file descriptor by
cxl_get_fd().
Tested with cxlflash.
Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Vaibhav Jain [Wed, 16 Nov 2016 14:09:33 +0000 (19:39 +0530)]
cxl: Do adapter fence check before handling afu interrupt
If an afu interrupt is in flight when an eeh error is triggered the
control still reaches the function native_irq_multiplexed and the
PE-Handle read from the CXL_PSL_PEHandle_An register is 0xffff. The
function then erroneously assumes that the interrupt belonged to a
detached context and generates a warning with full stack dump in the
kernel log complaining:
"Unable to demultiplex CXL PSL IRQ for PE 65535 DSISR ffffffff DAR ffffffff. (Possible AFU HW issue - was a term/remove acked with
outstanding transactions"
To fix this the patch adds new code to the function
native_irq_multiplexed function to compares the read value of register
CXL_PSL_PEHandle_An to ~0ULL. If true then logs a warning message
saying that the interrupt is being ignored and returns IRQ_HANDLED from
the irq handler.
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
'cxl_context_alloc()' does not return an error pointer. It is just a
shortcut for a call to 'kzalloc' with 'sizeof(struct cxl_context)' as the
size parameter.
So its return value should be compared with NULL.
While fixing it, simplify a bit the code.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Tue, 8 Nov 2016 12:14:45 +0000 (23:14 +1100)]
powerpc: Fix second nested oops hang
When ending an oops, don't clear die_owner unless the nest count
went to zero. This prevents a second nested oops from hanging forever
on the die_lock.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Tue, 8 Nov 2016 12:14:44 +0000 (23:14 +1100)]
powerpc: Fix graceful debugger recovery
When exiting xmon with 'x' (exit and recover), oops_begin bails
out immediately, but die then calls __die() and oops_end(), which
cause a lot of bad things to happen.
If the debugger was attached then went to graceful recovery, exit
from die() immediately.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Wed, 19 Oct 2016 03:15:59 +0000 (14:15 +1100)]
powerpc: Add option to use thin archives
Add an option to use thin archives to build the kernel.
Thin archives are explained in commit 4d4f9a52c923 ("kbuild: allow
architectures to use thin archives instead of ld -r").
This is a gradual way to introduce the option to testers.
Some change to the way we invoke ar is required so it can be used
by scripts/link-vmlinux.sh.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make it an explicit option not dependant on COMPILE_TEST] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Thu, 17 Nov 2016 16:38:10 +0000 (11:38 -0500)]
powerpc/pseries: Correct possible read beyond dlpar sysfs buffer
The pasrsing of data written to the dlpar file in sysfs does not correctly
account for the possibility of reading past the end of the buffer. The code
assumes that all pieces of the command witten to the sysfs file are present
in the form "<resource> <action> <id_type> <id>".
Correct this by updating the buffer parsing code to make a local copy and
use the strsep() and sysfs_streq() routines to parse the buffer. This patch
also separates the parsing code into subroutines for each piece of the
command.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
powerpc/mm: Correct process and partition table max size
Version 3.00 of the ISA states that the PATS (partition table size) field
of the PTCR (partition table control register) and the PRTS (process table
size) field of the partition table entry must both be less than or equal
to 24. However the actual size of the partition and process tables is equal
to 2 to the power of 12 plus the PATS and PRTS fields, respectively. This
means that the max allowable size of each of these tables is 2^36 or 64GB
for both.
Thus when checking the size shift for each we should be checking for values
of greater than 36 instead of the current check for shifts larger than 24
and 23.
selftests/powerpc: Add ptrace tests for VSX, VMX registers in TM
This patch adds ptrace interface test for VSX, VMX registers
inside TM context. This also adds ptrace interface based helper
functions related to chckpointed VSX, VMX registers access.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
selftests/powerpc: Add ptrace tests for VSX, VMX registers
This patch adds ptrace interface test for VSX, VMX registers.
This also adds ptrace interface based helper functions related
to VSX, VMX registers access. This also adds some assembly
helper functions related to VSX and VMX registers.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
selftests/powerpc: Add ptrace tests for TAR, PPR, DSCR in TM
This patch adds ptrace interface test for TAR, PPR, DSCR
registers inside TM context. This also adds ptrace
interface based helper functions related to checkpointed
TAR, PPR, DSCR register access.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
selftests/powerpc: Add ptrace tests for TAR, PPR, DSCR registers
This patch adds ptrace interface test for TAR, PPR, DSCR
registers. This also adds ptrace interface based helper
functions related to TAR, PPR, DSCR register access.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
selftests/powerpc: Add ptrace tests for GPR/FPR registers in TM
This patch adds ptrace interface test for GPR/FPR registers
inside TM context. This adds ptrace interface based helper
functions related to checkpointed GPR/FPR access.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
selftests/powerpc: Add ptrace tests for GPR/FPR registers
This patch adds ptrace interface test for GPR/FPR registers.
This adds ptrace interface based helper functions related to
GPR/FPR access and some assembly helper functions related to
GPR/FPR registers.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
[mpe: Add #defines for the new note types when headers don't define them] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
selftests/powerpc: Add more SPR numbers, TM & VMX instructions to 'reg.h'/'instructions.h'
This patch adds SPR number for TAR, PPR, DSCR special
purpose registers. It also adds TM, VSX, VMX related
instructions which will then be used by patches later
in the series.
Now that the new DSCR register definitions (SPRN_DSCR_PRIV and
SPRN_DSCR) are defined outside this directory, use them instead.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rashmica Gupta [Fri, 27 May 2016 05:49:00 +0000 (15:49 +1000)]
powerpc/mm: Dump hash table
Useful to be able to dump the kernel hash page table to check
which pages are hashed along with their sizes and other details.
Add a debugfs file to check the hash page table. If radix is enabled
(and so there is no hash page table) then this file doesn't exist. To
use this the PPC_PTDUMP config option must be selected.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com>
[mpe: Fix build with SPARSEMEM_VMEMMAP=n & PSERIES=n] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Mon, 14 Nov 2016 05:28:10 +0000 (16:28 +1100)]
powerpc/pseries: Move CMO code from plapr_wrappers.h to platforms/pseries
Currently there's some CMO (Cooperative Memory Overcommit) code, in
plpar_wrappers.h. Some of it is #ifdef CONFIG_PSERIES and some of it
isn't. The end result being if a file includes plpar_wrappers.h it won't
build with CONFIG_PSERIES=n.
Fix it by moving the CMO code into platforms/pseries. The two hcall
wrappers can just be moved into their only caller, cmm.c, and the
accessors can go in pseries.h.
Note we need the accessors because cmm.c can be built as a module, so
there needs to be a split between the built-in code vs the module, and
that's achieved by using those accessors.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Fri, 11 Nov 2016 05:55:03 +0000 (16:55 +1100)]
powerpc/64: Simplify adaptation to new ISA v3.00 HPTE format
This changes the way that we support the new ISA v3.00 HPTE format.
Instead of adapting everything that uses HPTE values to handle either
the old format or the new format, depending on which CPU we are on,
we now convert explicitly between old and new formats if necessary
in the low-level routines that actually access HPTEs in memory.
This limits the amount of code that needs to know about the new
format and makes the conversions explicit. This is OK because the
old format contains all the information that is in the new format.
This also fixes operation under a hypervisor, because the H_ENTER
hypercall (and other hypercalls that deal with HPTEs) will continue
to require the HPTE value to be supplied in the old format. At
present the kernel will not boot in HPT mode on POWER9 under a
hypervisor.
This fixes and partially reverts commit 0666fb66b8f5
("powerpc/mm/hash: Add support for Power9 Hash", 2016-04-29).
Fixes: 0666fb66b8f5 ("powerpc/mm/hash: Add support for Power9 Hash") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>