Ilias Stamatis [Mon, 7 Jun 2021 10:54:38 +0000 (11:54 +0100)]
KVM: X86: Add vendor callbacks for writing the TSC multiplier
Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the
VMCS every time the VMCS is loaded. Instead of doing this, set this
field from common code on initialization and whenever the scaling ratio
changes.
Additionally remove vmx->current_tsc_ratio. This field is redundant as
vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling
ratio. The vmx->current_tsc_ratio field is only used for avoiding
unnecessary writes but it is no longer needed after removing the code
from the VMCS load path.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Message-Id: <20210607105438.16541-1-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ilias Stamatis [Wed, 26 May 2021 18:44:15 +0000 (19:44 +0100)]
KVM: X86: Move write_l1_tsc_offset() logic to common code and rename it
The write_l1_tsc_offset() callback has a misleading name. It does not
set L1's TSC offset, it rather updates the current TSC offset which
might be different if a nested guest is executing. Additionally, both
the vmx and svm implementations use the same logic for calculating the
current TSC before writing it to hardware.
Rename the function and move the common logic to the caller. The vmx/svm
specific code now merely sets the given offset to the corresponding
hardware structure.
Ilias Stamatis [Wed, 26 May 2021 18:44:14 +0000 (19:44 +0100)]
KVM: X86: Add functions that calculate the nested TSC fields
When L2 is entered we need to "merge" the TSC multiplier and TSC offset
values of 01 and 12 together.
The merging is done using the following equations:
offset_02 = ((offset_01 * mult_12) >> shift_bits) + offset_12
mult_02 = (mult_01 * mult_12) >> shift_bits
Where shift_bits is kvm_tsc_scaling_ratio_frac_bits.
Ilias Stamatis [Wed, 26 May 2021 18:44:13 +0000 (19:44 +0100)]
KVM: X86: Add functions for retrieving L2 TSC fields from common code
In order to implement as much of the nested TSC scaling logic as
possible in common code, we need these vendor callbacks for retrieving
the TSC offset and the TSC multiplier that L1 has set for L2.
Ilias Stamatis [Wed, 26 May 2021 18:44:11 +0000 (19:44 +0100)]
KVM: X86: Add a ratio parameter to kvm_scale_tsc()
Sometimes kvm_scale_tsc() needs to use the current scaling ratio and
other times (like when reading the TSC from user space) it needs to use
L1's scaling ratio. Have the caller specify this by passing the ratio as
a parameter.
Ilias Stamatis [Wed, 26 May 2021 18:44:09 +0000 (19:44 +0100)]
KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch'
Store L1's scaling ratio in the kvm_vcpu_arch struct like we already do
for L1's TSC offset. This allows for easy save/restore when we enter and
then exit the nested guest.
Ilias Stamatis [Wed, 26 May 2021 18:44:08 +0000 (19:44 +0100)]
math64.h: Add mul_s64_u64_shr()
This function is needed for KVM's nested virtualization. The nested TSC
scaling implementation requires multiplying the signed TSC offset with
the unsigned TSC multiplier.
Ben Gardon [Tue, 18 May 2021 17:34:14 +0000 (10:34 -0700)]
KVM: x86/mmu: Lazily allocate memslot rmaps
If the TDP MMU is in use, wait to allocate the rmaps until the shadow
MMU is actually used. (i.e. a nested VM is launched.) This saves memory
equal to 0.2% of guest memory in cases where the TDP MMU is used and
there are no nested guests involved.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-8-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ben Gardon [Tue, 18 May 2021 17:34:13 +0000 (10:34 -0700)]
KVM: x86/mmu: Skip rmap operations if rmaps not allocated
If only the TDP MMU is being used to manage the memory mappings for a VM,
then many rmap operations can be skipped as they are guaranteed to be
no-ops. This saves some time which would be spent on the rmap operation.
It also avoids acquiring the MMU lock in write mode for many operations.
This makes it safe to run the VM without rmaps allocated, when only
using the TDP MMU and sets the stage for waiting to allocate the rmaps
until they're needed.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ben Gardon [Tue, 18 May 2021 17:34:12 +0000 (10:34 -0700)]
KVM: x86/mmu: Add a field to control memslot rmap allocation
Add a field to control whether new memslots should have rmaps allocated
for them. As of this change, it's not safe to skip allocating rmaps, so
the field is always set to allocate rmaps. Future changes will make it
safe to operate without rmaps, using the TDP MMU. Then further changes
will allow the rmaps to be allocated lazily when needed for nested
oprtation.
No functional change expected.
Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-6-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ben Gardon [Tue, 18 May 2021 17:34:11 +0000 (10:34 -0700)]
KVM: mmu: Add slots_arch_lock for memslot arch fields
Add a new lock to protect the arch-specific fields of memslots if they
need to be modified in a kvm->srcu read critical section. A future
commit will use this lock to lazily allocate memslot rmaps for x86.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-5-bgardon@google.com>
[Add Documentation/ hunk. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ben Gardon [Tue, 18 May 2021 17:34:10 +0000 (10:34 -0700)]
KVM: mmu: Refactor memslot copy
Factor out copying kvm_memslots from allocating the memory for new ones
in preparation for adding a new lock to protect the arch-specific fields
of the memslots.
No functional change intended.
Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-4-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ben Gardon [Tue, 18 May 2021 17:34:09 +0000 (10:34 -0700)]
KVM: x86/mmu: Factor out allocating memslot rmap
Small refactor to facilitate allocating rmaps for all memslots at once.
No functional change expected.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-3-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ben Gardon [Tue, 18 May 2021 17:34:08 +0000 (10:34 -0700)]
KVM: x86/mmu: Deduplicate rmap freeing
Small code deduplication. No functional change expected.
Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-2-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Do not write protect huge page in initially-all-set mode
Currently, when dirty logging is started in initially-all-set mode,
we write protect huge pages to prepare for splitting them into
4K pages, and leave normal pages untouched as the logging will
be enabled lazily as dirty bits are cleared.
However, enabling dirty logging lazily is also feasible for huge pages.
This not only reduces the time of start dirty logging, but it also
greatly reduces side-effect on guest when there is high dirty rate.
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Message-Id: <20210429034115.35560-3-zhukeqian1@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: kvm_hv_flush_tlb use inputs from XMM registers
Hyper-V supports the use of XMM registers to perform fast hypercalls.
This allows guests to take advantage of the improved performance of the
fast hypercall interface even though a hypercall may require more than
(the current maximum of) two input registers.
The XMM fast hypercall interface uses six additional XMM registers (XMM0
to XMM5) to allow the guest to pass an input parameter block of up to
112 bytes.
Add framework to read from XMM registers in kvm_hv_hypercall() and use
the additional hypercall inputs from XMM registers in kvm_hv_flush_tlb()
when possible.
KVM: hyper-v: Collect hypercall params into struct
As of now there are 7 parameters (and flags) that are used in various
hyper-v hypercall handlers. There are 6 more input/output parameters
passed from XMM registers which are to be added in an upcoming patch.
To make passing arguments to the handlers more readable, capture all
these parameters into a single structure.
Cc: Alexander Graf <graf@amazon.com> Cc: Evgeny Iakovlev <eyakovl@amazon.de> Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <273f7ed510a1f6ba177e61b73a5c7bfbee4a4a87.1622019133.git.sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-v XMM fast hypercalls use XMM registers to pass input/output
parameters. To access these, hyperv.c can reuse some FPU register
accessors defined in emulator.c. Move them to a common location so both
can access them.
While at it, reorder the parameters of these accessor methods to make
them more readable.
Cc: Alexander Graf <graf@amazon.com> Cc: Evgeny Iakovlev <eyakovl@amazon.de> Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <01a85a6560714d4d3637d3d86e5eba65073318fa.1622019133.git.sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Calculate and check "full" mmu_role for nested MMU
Calculate and check the full mmu_role when initializing the MMU context
for the nested MMU, where "full" means the bits and pieces of the role
that aren't handled by kvm_calc_mmu_role_common(). While the nested MMU
isn't used for shadow paging, things like the number of levels in the
guest's page tables are surprisingly important when walking the guest
page tables. Failure to reinitialize the nested MMU context if L2's
paging mode changes can result in unexpected and/or missed page faults,
and likely other explosions.
E.g. if an L1 vCPU is running both a 32-bit PAE L2 and a 64-bit L2, the
"common" role calculation will yield the same role for both L2s. If the
64-bit L2 is run after the 32-bit PAE L2, L0 will fail to reinitialize
the nested MMU context, ultimately resulting in a bad walk of L2's page
tables as the MMU will still have a guest root_level of PT32E_ROOT_LEVEL.
Wanpeng Li [Fri, 11 Jun 2021 04:59:33 +0000 (21:59 -0700)]
KVM: X86: Fix x86_emulator slab cache leak
Commit 95cea3c59bcf6 (KVM: x86: Dynamically allocate per-vCPU emulation context)
tries to allocate per-vCPU emulation context dynamically, however, the
x86_emulator slab cache is still exiting after the kvm module is unload
as below after destroying the VM and unloading the kvm module.
Alper Gun [Thu, 10 Jun 2021 17:46:04 +0000 (17:46 +0000)]
KVM: SVM: Call SEV Guest Decommission if ASID binding fails
Send SEV_CMD_DECOMMISSION command to PSP firmware if ASID binding
fails. If a failure happens after a successful LAUNCH_START command,
a decommission command should be executed. Otherwise, guest context
will be unfreed inside the AMD SP. After the firmware will not have
memory to allocate more SEV guest context, LAUNCH_START command will
begin to fail with SEV_RET_RESOURCE_LIMIT error.
The existing code calls decommission inside sev_unbind_asid, but it is
not called if a failure happens before guest activation succeeds. If
sev_bind_asid fails, decommission is never called. PSP firmware has a
limit for the number of guests. If sev_asid_binding fails many times,
PSP firmware will not have resources to create another guest context.
Cc: stable@vger.kernel.org Fixes: 324f894fc441 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_START command") Reported-by: Peter Gonda <pgonda@google.com> Signed-off-by: Alper Gun <alpergun@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210610174604.2554090-1-alpergun@google.com>
KVM: x86: Immediately reset the MMU context when the SMM flag is cleared
Immediately reset the MMU context when the vCPU's SMM flag is cleared so
that the SMM flag in the MMU role is always synchronized with the vCPU's
flag. If RSM fails (which isn't correctly emulated), KVM will bail
without calling post_leave_smm() and leave the MMU in a bad state.
The bad MMU role can lead to a NULL pointer dereference when grabbing a
shadow page's rmap for a page fault as the initial lookups for the gfn
will happen with the vCPU's SMM flag (=0), whereas the rmap lookup will
use the shadow page's SMM flag, which comes from the MMU (=1). SMM has
an entirely different set of memslots, and so the initial lookup can find
a memslot (SMM=0) and then explode on the rmap memslot lookup (SMM=1).
In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple
of warnings by explicitly adding break statements instead of just letting
the code fall through to the next case.
ChenXiaoSong [Wed, 9 Jun 2021 12:22:17 +0000 (20:22 +0800)]
KVM: SVM: fix doc warnings
Fix kernel-doc warnings:
arch/x86/kvm/svm/avic.c:233: warning: Function parameter or member 'activate' not described in 'avic_update_access_page'
arch/x86/kvm/svm/avic.c:233: warning: Function parameter or member 'kvm' not described in 'avic_update_access_page'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'e' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'kvm' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'svm' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'vcpu_info' not described in 'get_pi_vcpu_info'
arch/x86/kvm/svm/avic.c:1009: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Message-Id: <20210609122217.2967131-1-chenxiaosong2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yanan Wang [Thu, 10 Jun 2021 08:54:18 +0000 (16:54 +0800)]
KVM: selftests: Fix compiling errors when initializing the static structure
Errors like below were produced from test_util.c when compiling the KVM
selftests on my local platform.
lib/test_util.c: In function 'vm_mem_backing_src_alias':
lib/test_util.c:177:12: error: initializer element is not constant
.flag = anon_flags,
^~~~~~~~~~
lib/test_util.c:177:12: note: (near initialization for 'aliases[0].flag')
The reason is that we are using non-const expressions to initialize the
static structure, which will probably trigger a compiling error/warning
on stricter GCC versions. Fix it by converting the two const variables
"anon_flags" and "anon_huge_flags" into more stable macros.
Jim Mattson [Wed, 2 Jun 2021 20:52:24 +0000 (13:52 -0700)]
kvm: LAPIC: Restore guard to prevent illegal APIC register access
Per the SDM, "any access that touches bytes 4 through 15 of an APIC
register may cause undefined behavior and must not be executed."
Worse, such an access in kvm_lapic_reg_read can result in a leak of
kernel stack contents. Prior to commit b21c36f0e3e9 ("kvm: LAPIC:
write down valid APIC registers"), such an access was explicitly
disallowed. Restore the guard that was removed in that commit.
Fixes: b21c36f0e3e9 ("kvm: LAPIC: write down valid APIC registers") Signed-off-by: Jim Mattson <jmattson@google.com> Reported-by: syzbot <syzkaller@googlegroups.com>
Message-Id: <20210602205224.3189316-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 9 Jun 2021 05:49:13 +0000 (01:49 -0400)]
kvm: fix previous commit for 32-bit builds
array_index_nospec does not work for uint64_t on 32-bit builds.
However, the size of a memory slot must be less than 20 bits wide
on those system, since the memory slot must fit in the user
address space. So just store it in an unsigned long.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 8 Jun 2021 19:31:42 +0000 (15:31 -0400)]
kvm: avoid speculation-based attacks from out-of-range memslot accesses
KVM's mechanism for accessing guest memory translates a guest physical
address (gpa) to a host virtual address using the right-shifted gpa
(also known as gfn) and a struct kvm_memory_slot. The translation is
performed in __gfn_to_hva_memslot using the following formula:
hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE
It is expected that gfn falls within the boundaries of the guest's
physical memory. However, a guest can access invalid physical addresses
in such a way that the gfn is invalid.
__gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
retrieves a memslot through __gfn_to_memslot. While __gfn_to_memslot
does check that the gfn falls within the boundaries of the guest's
physical memory or not, a CPU can speculate the result of the check and
continue execution speculatively using an illegal gfn. The speculation
can result in calculating an out-of-bounds hva. If the resulting host
virtual address is used to load another guest physical address, this
is effectively a Spectre gadget consisting of two consecutive reads,
the second of which is data dependent on the first.
Right now it's not clear if there are any cases in which this is
exploitable. One interesting case was reported by the original author
of this patch, and involves visiting guest page tables on x86. Right
now these are not vulnerable because the hva read goes through get_user(),
which contains an LFENCE speculation barrier. However, there are
patches in progress for x86 uaccess.h to mask kernel addresses instead of
using LFENCE; once these land, a guest could use speculation to read
from the VMM's ring 3 address space. Other architectures such as ARM
already use the address masking method, and would be susceptible to
this same kind of data-dependent access gadgets. Therefore, this patch
proactively protects from these attacks by masking out-of-bounds gfns
in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.
Sean Christopherson noted that this patch does not cover
kvm_read_guest_offset_cached. This however is limited to a few bytes
past the end of the cache, and therefore it is unlikely to be useful in
the context of building a chain of data dependent accesses.
Lai Jiangshan [Mon, 31 May 2021 17:22:56 +0000 (01:22 +0800)]
KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync
When using shadow paging, unload the guest MMU when emulating a guest TLB
flush to ensure all roots are synchronized. From the guest's perspective,
flushing the TLB ensures any and all modifications to its PTEs will be
recognized by the CPU.
Note, unloading the MMU is overkill, but is done to mirror KVM's existing
handling of INVPCID(all) and ensure the bug is squashed. Future cleanup
can be done to more precisely synchronize roots when servicing a guest
TLB flush.
If TDP is enabled, synchronizing the MMU is unnecessary even if nested
TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
TDP mappings. For EPT, an explicit INVEPT is required to invalidate
guest-physical mappings; for NPT, guest mappings are always tagged with
an ASID and thus can only be invalidated via the VMCB's ASID control.
This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB.
It was only recently exposed after Linux guests stopped flushing the
local CPU's TLB prior to flushing remote TLBs (see commit d1ad05e04ef6,
"x86/mm/tlb: Flush remote and local TLBs concurrently"), but is also
visible in Windows 10 guests.
Tested-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Fixes: c825a543ef76 ("KVM: X86: support paravirtualized help for TLB shootdowns") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
[sean: massaged comment and changelog]
Message-Id: <20210531172256.2908-1-jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message
Use the __string() machinery provided by the tracing subystem to make a
copy of the string literals consumed by the "nested VM-Enter failed"
tracepoint. A complete copy is necessary to ensure that the tracepoint
can't outlive the data/memory it consumes and deference stale memory.
Because the tracepoint itself is defined by kvm, if kvm-intel and/or
kvm-amd are built as modules, the memory holding the string literals
defined by the vendor modules will be freed when the module is unloaded,
whereas the tracepoint and its data in the ring buffer will live until
kvm is unloaded (or "indefinitely" if kvm is built-in).
This bug has existed since the tracepoint was added, but was recently
exposed by a new check in tracing to detect exactly this type of bug.
Zhenzhong Duan [Tue, 8 Jun 2021 23:38:16 +0000 (07:38 +0800)]
selftests: kvm: Add support for customized slot0 memory size
Until commit 176f94264682 ("selftests: kvm: make allocation of extra
memory take effect", 2021-05-27), parameter extra_mem_pages was used
only to calculate the page table size for all the memory chunks,
because real memory allocation happened with calls of
vm_userspace_mem_region_add() after vm_create_default().
Commit 176f94264682 however changed the meaning of extra_mem_pages to
the size of memory slot 0. This makes the memory allocation more
flexible, but makes it harder to account for the number of
pages needed for the page tables. For example, memslot_perf_test
has a small amount of memory in slot 0 but a lot in other slots,
and adding that memory twice (both in slot 0 and with later
calls to vm_userspace_mem_region_add()) causes an error that
was fixed in commit 2ab628c740ab ("selftests: kvm: fix overlapping
addresses in memslot_perf_test", 2021-05-29)
Since both uses are sensible, add a new parameter slot0_mem_pages
to vm_create_with_vcpus() and some comments to clarify the meaning of
slot0_mem_pages and extra_mem_pages. With this change,
memslot_perf_test can go back to passing the number of memory
pages as extra_mem_pages.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Message-Id: <20210608233816.423958-4-zhenzhong.duan@intel.com>
[Squashed in a single patch and rewrote the commit message. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
s390x can have up to 47bits of physical guest and 64bits of virtual
address bits. Add a new address mode to avoid errors of testcases
going beyond 47bits.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Message-Id: <20210608123954.10991-1-borntraeger@de.ibm.com> Fixes: 68ab48867421 ("KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn()") Cc: stable@vger.kernel.org Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In record_steal_time(), st->preempted is read twice, and
trace_kvm_pv_tlb_flush() might output result inconsistent if
kvm_vcpu_flush_tlb_guest() see a different st->preempted later.
It is a very trivial problem and hardly has actual harm and can be
avoided by reseting and reading st->preempted in atomic way via xchg().
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210531174628.10265-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Thu, 3 Jun 2021 05:24:55 +0000 (13:24 +0800)]
KVM: X86: MMU: Use the correct inherited permissions to get shadow page
When computing the access permissions of a shadow page, use the effective
permissions of the walk up to that point, i.e. the logic AND of its parents'
permissions. Two guest PxE entries that point at the same table gfn need to
be shadowed with different shadow pages if their parents' permissions are
different. KVM currently uses the effective permissions of the last
non-leaf entry for all non-leaf entries. Because all non-leaf SPTEs have
full ("uwx") permissions, and the effective permissions are recorded only
in role.access and merged into the leaves, this can lead to incorrect
reuse of a shadow page and eventually to a missing guest protection page
fault.
pud1 and pud2 point to the same pmd table, so:
- ptr1 and ptr3 points to the same page.
- ptr2 and ptr4 points to the same page.
(pud1 and pud2 here are pud entries, while pmd1 and pmd2 here are pmd entries)
- First, the guest reads from ptr1 first and KVM prepares a shadow
page table with role.access=u--, from ptr1's pud1 and ptr1's pmd1.
"u--" comes from the effective permissions of pgd, pud1 and
pmd1, which are stored in pt->access. "u--" is used also to get
the pagetable for pud1, instead of "uw-".
- Then the guest writes to ptr2 and KVM reuses pud1 which is present.
The hypervisor set up a shadow page for ptr2 with pt->access is "uw-"
even though the pud1 pmd (because of the incorrect argument to
kvm_mmu_get_page in the previous step) has role.access="u--".
- Then the guest reads from ptr3. The hypervisor reuses pud1's
shadow pmd for pud2, because both use "u--" for their permissions.
Thus, the shadow pmd already includes entries for both pmd1 and pmd2.
- At last, the guest writes to ptr4. This causes no vmexit or pagefault,
because pud1's shadow page structures included an "uw-" page even though
its role.access was "u--".
Any kind of shared pagetable might have the similar problem when in
virtual machine without TDP enabled if the permissions are different
from different ancestors.
In order to fix the problem, we change pt->access to be an array, and
any access in it will not include permissions ANDed from child ptes.
The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/
Remember to test it with TDP disabled.
The problem had existed long before the commit 944dc52c8dc4 ("KVM: MMU:
Fix inherited permissions for emulated guest pte updates"), and it
is hard to find which is the culprit. So there is no fixes tag here.
Wanpeng Li [Mon, 7 Jun 2021 07:19:43 +0000 (00:19 -0700)]
KVM: LAPIC: Write 0 to TMICT should also cancel vmx-preemption timer
According to the SDM 10.5.4.1:
A write of 0 to the initial-count register effectively stops the local
APIC timer, in both one-shot and periodic mode.
However, the lapic timer oneshot/periodic mode which is emulated by vmx-preemption
timer doesn't stop by writing 0 to TMICT since vmx->hv_deadline_tsc is still
programmed and the guest will receive the spurious timer interrupt later. This
patch fixes it by also cancelling the vmx-preemption timer when writing 0 to
the initial-count register.
Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1623050385-100988-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ashish Kalra [Mon, 7 Jun 2021 06:15:32 +0000 (06:15 +0000)]
KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit 5d45c26c5576
Commit 5d45c26c5576 ("KVM: SVM: Allocate SEV command structures on local stack")
uses the local stack to allocate the structures used to communicate with the PSP,
which were earlier being kzalloced. This breaks SEV live migration for
computing the SEND_START session length and SEND_UPDATE_DATA query length as
session_len and trans_len and hdr_len fields are not zeroed respectively for
the above commands before issuing the SEV Firmware API call, hence the
firmware returns incorrect session length and update data header or trans length.
Also the SEV Firmware API returns SEV_RET_INVALID_LEN firmware error
for these length query API calls, and the return value and the
firmware error needs to be passed to the userspace as it is, so
need to remove the return check in the KVM code.
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-Id: <20210607061532.27459-1-Ashish.Kalra@amd.com> Fixes: 5d45c26c5576 ("KVM: SVM: Allocate SEV command structures on local stack") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 28 May 2021 19:10:58 +0000 (15:10 -0400)]
selftests: kvm: fix overlapping addresses in memslot_perf_test
vm_create allocates memory and maps it close to GPA. This memory
is separate from what is allocated in subsequent calls to
vm_userspace_mem_region_add, so it is incorrect to pass the
test memory size to vm_create_default. Just pass a small
fixed amount of memory which can be used later for page table,
otherwise GPAs are already allocated at MEM_GPA and the
test aborts.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Fri, 28 May 2021 00:01:37 +0000 (17:01 -0700)]
KVM: X86: Kill off ctxt->ud
ctxt->ud is consumed only by x86_decode_insn(), we can kill it off by
passing emulation_type to x86_decode_insn() and dropping ctxt->ud
altogether. Tracking that info in ctxt for literally one call is silly.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <1622160097-37633-2-git-send-email-wanpengli@tencent.com>
Commit 2f13bb5d9502 ("KVM: x86: handle hardware breakpoints during emulation())
adds hardware breakpoints check before emulation the instruction and parts of
emulation context initialization, actually we don't have the EMULTYPE_NO_DECODE flag
here and the emulation context will not be reused. Commit 6cf7a8b11844 ("KVM: x86:
set ctxt->have_exception in x86_decode_insn()) triggers the warning because it
catches the stale emulation context has #UD, however, it is not during instruction
decoding which should result in EMULATION_FAILED. This patch fixes it by moving
the second part emulation context initialization into init_emulate_ctxt() and
before hardware breakpoints check. The ctxt->ud will be dropped by a follow-up
patch.
Yuan Yao [Wed, 26 May 2021 06:38:28 +0000 (14:38 +0800)]
KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception
The kvm_get_linear_rip() handles x86/long mode cases well and has
better readability, __kvm_set_rflags() also use the paired
function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip
set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the
"CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and
handle_exception_nmi() to this one.
Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Wed, 26 May 2021 16:32:27 +0000 (16:32 +0000)]
KVM: x86/mmu: Fix comment mentioning skip_4k
This comment was left over from a previous version of the patch that
introduced wrprot_gfn_range, when skip_4k was passed in instead of
min_level.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210526163227.3113557-1-dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marcelo Tosatti [Wed, 26 May 2021 17:20:14 +0000 (14:20 -0300)]
KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
For VMX, when a vcpu enters HLT emulation, pi_post_block will:
1) Add vcpu to per-cpu list of blocked vcpus.
2) Program the posted-interrupt descriptor "notification vector"
to POSTED_INTR_WAKEUP_VECTOR
With interrupt remapping, an interrupt will set the PIR bit for the
vector programmed for the device on the CPU, test-and-set the
ON bit on the posted interrupt descriptor, and if the ON bit is clear
generate an interrupt for the notification vector.
This way, the target CPU wakes upon a device interrupt and wakes up
the target vcpu.
Problem is that pi_post_block only programs the notification vector
if kvm_arch_has_assigned_device() is true. Its possible for the
following to happen:
1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
notification vector is not programmed
2) device is assigned to VM
3) device interrupts vcpu V, sets ON bit
(notification vector not programmed, so pcpu P remains in idle)
4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
kvm_vcpu_kick is skipped
5) vcpu 0 busy spins on vcpu V's response for several seconds, until
RCU watchdog NMIs all vCPUs.
To fix this, use the start_assignment kvm_x86_ops callback to kick
vcpus out of the halt loop, so the notification vector is
properly reprogrammed to the wakeup vector.
Wanpeng Li [Tue, 18 May 2021 12:00:35 +0000 (05:00 -0700)]
KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch
Let's treat lapic_timer_advance_ns automatic tuning logic as hypervisor
overhead, move it before wait_lapic_expire instead of between wait_lapic_expire
and the world switch, the wait duration should be calculated by the
up-to-date guest_tsc after the overhead of automatic tuning logic. This
patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing
busy waits.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-5-git-send-email-wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Axel Rasmussen [Wed, 19 May 2021 20:03:39 +0000 (13:03 -0700)]
KVM: selftests: add shared hugetlbfs backing source type
This lets us run the demand paging test on top of a shared
hugetlbfs-backed area. The "shared" is key, as this allows us to
exercise userfaultfd minor faults on hugetlbfs.
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Message-Id: <20210519200339.829146-11-axelrasmussen@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Axel Rasmussen [Wed, 19 May 2021 20:03:38 +0000 (13:03 -0700)]
KVM: selftests: allow using UFFD minor faults for demand paging
UFFD handling of MINOR faults is a new feature whose use case is to
speed up demand paging (compared to MISSING faults). So, it's
interesting to let this selftest exercise this new mode.
Modify the demand paging test to have the option of using UFFD minor
faults, as opposed to missing faults. Now, when turning on userfaultfd
with '-u', the desired mode has to be specified ("MISSING" or "MINOR").
If we're in minor mode, before registering, prefault via the *alias*.
This way, the guest will trigger minor faults, instead of missing
faults, and we can UFFDIO_CONTINUE to resolve them.
Modify the page fault handler function to use the right ioctl depending
on the mode we're running in. In MINOR mode, use UFFDIO_CONTINUE.
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Message-Id: <20210519200339.829146-10-axelrasmussen@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Axel Rasmussen [Wed, 19 May 2021 20:03:37 +0000 (13:03 -0700)]
KVM: selftests: create alias mappings when using shared memory
When a memory region is added with a src_type specifying that it should
use some kind of shared memory, also create an alias mapping to the same
underlying physical pages.
And, add an API so tests can get access to these alias addresses.
Basically, for a guest physical address, let us look up the analogous
host *alias* address.
In a future commit, we'll modify the demand paging test to take
advantage of this to exercise UFFD minor faults. The idea is, we
pre-fault the underlying pages *via the alias*. When the *guest*
faults, it gets a "minor" fault (PTEs don't exist yet, but a page is
already in the page cache). Then, the userfaultfd theads can handle the
fault: they could potentially modify the underlying memory *via the
alias* if they wanted to, and then they install the PTEs and let the
guest carry on via a UFFDIO_CONTINUE ioctl.
Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Message-Id: <20210519200339.829146-9-axelrasmussen@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Axel Rasmussen [Wed, 19 May 2021 20:03:36 +0000 (13:03 -0700)]
KVM: selftests: add shmem backing source type
This lets us run the demand paging test on top of a shmem-backed area.
In follow-up commits, we'll 1) leverage this new capability to create an
alias mapping, and then 2) use the alias mapping to exercise UFFD minor
faults.
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Message-Id: <20210519200339.829146-8-axelrasmussen@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Each struct vm_mem_backing_src_alias has a flags field, which denotes
the flags used to mmap() an area of that type. Previously, this field
never included MAP_PRIVATE | MAP_ANONYMOUS, because
vm_userspace_mem_region_add assumed that *all* types would always use
those flags, and so it hardcoded them.
In a follow-up commit, we'll add a new type: shmem. Areas of this type
must not have MAP_PRIVATE | MAP_ANONYMOUS, and instead they must have
MAP_SHARED.
So, refactor things. Make it so that the flags field of
struct vm_mem_backing_src_alias really is a complete set of flags, and
don't add in any extras in vm_userspace_mem_region_add. This will let us
easily tack on shmem.
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Message-Id: <20210519200339.829146-7-axelrasmussen@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Axel Rasmussen [Wed, 19 May 2021 20:03:34 +0000 (13:03 -0700)]
KVM: selftests: allow different backing source types
Add an argument which lets us specify a different backing memory type
for the test. The default is just to use anonymous, matching existing
behavior.
This is in preparation for testing UFFD minor faults. For that, we'll
need to use a new backing memory type which is setup with MAP_SHARED.
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Message-Id: <20210519200339.829146-6-axelrasmussen@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is a preparatory commit needed before we can use different kinds of
backing pages for guest memory.
Previously, we used perf_test_args.host_page_size, which is the host's
native page size (commonly 4K). For VM_MEM_SRC_ANONYMOUS this turns out
to be okay, but in a follow-up commit we want to allow using different
kinds of backing memory.
Take VM_MEM_SRC_ANONYMOUS_HUGETLB for example. Without this change, if
we used that backing page type, when we issued a UFFDIO_COPY ioctl we'd
only do so with 4K, rather than the full 2M of a backing hugepage. In
this case, UFFDIO_COPY returns -EINVAL (__mcopy_atomic_hugetlb checks
the size).
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Message-Id: <20210519200339.829146-5-axelrasmussen@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
r = setup_demand_paging(...);
if (r < 0) exit(-r);
Since we're just going to exit anyway, instead of returning an error we
can just re-use TEST_ASSERT. This makes the caller simpler, as well as
the function itself - no need to write our branches, etc.
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Message-Id: <20210519200339.829146-3-axelrasmussen@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Tue, 11 May 2021 20:21:20 +0000 (20:21 +0000)]
KVM: selftests: Print a message if /dev/kvm is missing
If a KVM selftest is run on a machine without /dev/kvm, it will exit
silently. Make it easy to tell what's happening by printing an error
message.
Opportunistically consolidate all codepaths that open /dev/kvm into a
single function so they all print the same message.
This slightly changes the semantics of vm_is_unrestricted_guest() by
changing a TEST_ASSERT() to exit(KSFT_SKIP). However
vm_is_unrestricted_guest() is only called in one place
(x86_64/mmio_warning_test.c) and that is to determine if the test should
be skipped or not.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210511202120.1371800-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Axel Rasmussen [Wed, 19 May 2021 20:03:30 +0000 (13:03 -0700)]
KVM: selftests: trivial comment/logging fixes
Some trivial fixes I found while touching related code in this series,
factored out into a separate commit for easier reviewing:
- s/gor/got/ and add a newline in demand_paging_test.c
- s/backing_src/src_type/ in a comment to be consistent with the real
function signature in kvm_util.c
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Message-Id: <20210519200339.829146-2-axelrasmussen@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 14 May 2021 23:05:21 +0000 (23:05 +0000)]
KVM: selftests: Fix hang in hardware_disable_test
If /dev/kvm is not available then hardware_disable_test will hang
indefinitely because the child process exits before posting to the
semaphore for which the parent is waiting.
Fix this by making the parent periodically check if the child has
exited. We have to be careful to forward the child's exit status to
preserve a KSFT_SKIP status.
I considered just checking for /dev/kvm before creating the child
process, but there are so many other reasons why the child could exit
early that it seemed better to handle that as general case.
Tested:
$ ./hardware_disable_test
/dev/kvm not available, skipping test
$ echo $?
4
$ modprobe kvm_intel
$ ./hardware_disable_test
$ echo $?
0
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210514230521.2608768-1-dmatlack@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Wed, 19 May 2021 21:13:45 +0000 (21:13 +0000)]
KVM: selftests: Ignore CPUID.0DH.1H in get_cpuid_test
Similar to CPUID.0DH.0H this entry depends on the vCPU's XCR0 register
and IA32_XSS MSR. Since this test does not control for either before
assigning the vCPU's CPUID, these entries will not necessarily match
the supported CPUID exposed by KVM.
This fixes get_cpuid_test on Cascade Lake CPUs.
Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210519211345.3944063-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 21 May 2021 17:38:28 +0000 (17:38 +0000)]
KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn()
vm_get_max_gfn() casts vm->max_gfn from a uint64_t to an unsigned int,
which causes the upper 32-bits of the max_gfn to get truncated.
Nobody noticed until now likely because vm_get_max_gfn() is only used
as a mechanism to create a memslot in an unused region of the guest
physical address space (the top), and the top of the 32-bit physical
address space was always good enough.
This fix reveals a bug in memslot_modification_stress_test which was
trying to create a dummy memslot past the end of guest physical memory.
Fix that by moving the dummy memslot lower.
Fixes: 4871f3f0ea8e ("KVM: selftests: Remove duplicate guest mode handling") Reviewed-by: Venkatesh Srinivas <venkateshs@chromium.org> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210521173828.1180619-1-dmatlack@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: add a memslot-related performance benchmark
This benchmark contains the following tests:
* Map test, where the host unmaps guest memory while the guest writes to
it (maps it).
The test is designed in a way to make the unmap operation on the host
take a negligible amount of time in comparison with the mapping
operation in the guest.
The test area is actually split in two: the first half is being mapped
by the guest while the second half in being unmapped by the host.
Then a guest <-> host sync happens and the areas are reversed.
* Unmap test which is broadly similar to the above map test, but it is
designed in an opposite way: to make the mapping operation in the guest
take a negligible amount of time in comparison with the unmap operation
on the host.
This test is available in two variants: with per-page unmap operation
or a chunked one (using 2 MiB chunk size).
* Move active area test which involves moving the last (highest gfn)
memslot a bit back and forth on the host while the guest is
concurrently writing around the area being moved (including over the
moved memslot).
* Move inactive area test which is similar to the previous move active
area test, but now guest writes all happen outside of the area being
moved.
* Read / write test in which the guest writes to the beginning of each
page of the test area while the host writes to the middle of each such
page.
Then each side checks the values the other side has written.
This particular test is not expected to give different results depending
on particular memslots implementation, it is meant as a rough sanity
check and to provide insight on the spread of test results expected.
Each test performs its operation in a loop until a test period ends
(this is 5 seconds by default, but it is configurable).
Then the total count of loops done is divided by the actual elapsed
time to give the test result.
The tests have a configurable memslot cap with the "-s" test option, by
default the system maximum is used.
Each test is repeated a particular number of times (by default 20
times), the best result achieved is printed.
The test memory area is divided equally between memslots, the reminder
is added to the last memslot.
The test area size does not depend on the number of memslots in use.
The tests also measure the time that it took to add all these memslots.
The best result from the tests that use the whole test area is printed
after all the requested tests are done.
In general, these tests are designed to use as much memory as possible
(within reason) while still doing 100+ loops even on high memslot counts
with the default test length.
Increasing the test runtime makes it increasingly more likely that some
event will happen on the system during the test run, which might lower
the test result.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Andrew Jones <drjones@redhat.com>
Message-Id: <8d31bb3d92bc8fa33a9756fa802ee14266ab994e.1618253574.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: Keep track of memslots more efficiently
The KVM selftest framework was using a simple list for keeping track of
the memslots currently in use.
This resulted in lookups and adding a single memslot being O(n), the
later due to linear scanning of the existing memslot set to check for
the presence of any conflicting entries.
Before this change, benchmarking high count of memslots was more or less
impossible as pretty much all the benchmark time was spent in the
selftest framework code.
We can simply use a rbtree for keeping track of both of gfn and hva.
We don't need an interval tree for hva here as we can't have overlapping
memslots because we allocate a completely new memory chunk for each new
memslot.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Andrew Jones <drjones@redhat.com>
Message-Id: <b12749d47ee860468240cf027412c91b76dbe3db.1618253574.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 24 May 2021 12:27:38 +0000 (14:27 +0200)]
selftests: kvm: fix potential issue with ELF loading
vm_vaddr_alloc() sets up GVA to GPA mapping page by page; therefore, GPAs
may not be continuous if same memslot is used for data and page table allocation.
kvm_vm_elf_load() however expects a continuous range of HVAs (and thus GPAs)
because it does not try to read file data page by page. Fix this mismatch
by allocating memory in one step.
Reported-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Tue, 18 May 2021 12:00:33 +0000 (05:00 -0700)]
KVM: X86: Fix vCPU preempted state from guest's point of view
Commit 3760a1fc47cd (kvm: x86: only provide PV features if enabled in guest's
CPUID) avoids to access pv tlb shootdown host side logic when this pv feature
is not exposed to guest, however, kvm_steal_time.preempted not only leveraged
by pv tlb shootdown logic but also mitigate the lock holder preemption issue.
From guest's point of view, vCPU is always preempted since we lose the reset
of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not
exposed. This patch fixes it by clearing kvm_steal_time.preempted before
vmentry.
Fixes: 3760a1fc47cd (kvm: x86: only provide PV features if enabled in guest's CPUID) Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-3-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Tue, 18 May 2021 12:00:32 +0000 (05:00 -0700)]
KVM: X86: Bail out of direct yield in case of under-committed scenarios
In case of under-committed scenarios, vCPUs can be scheduled easily;
kvm_vcpu_yield_to adds extra overhead, and it is also common to see
when vcpu->ready is true but yield later failing due to p->state is
TASK_RUNNING.
Let's bail out in such scenarios by checking the length of current cpu
runqueue, which can be treated as a hint of under-committed instead of
guarantee of accuracy. 30%+ of directed-yield attempts can now avoid
the expensive lookups in kvm_sched_yield() in an under-committed scenario.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-2-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Tue, 18 May 2021 12:00:31 +0000 (05:00 -0700)]
KVM: PPC: exit halt polling on need_resched()
This is inspired by commit ed7cda6f3d80e9 (kvm: exit halt polling on
need_resched() as well). Due to PPC implements an arch specific halt
polling logic, we have to the need_resched() check there as well. This
patch adds a helper function that can be shared between book3s and generic
halt-polling loops.
Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Venkatesh Srinivas <venkateshs@chromium.org> Cc: Ben Segall <bsegall@google.com> Cc: Venkatesh Srinivas <venkateshs@chromium.org> Cc: Jim Mattson <jmattson@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1621339235-11131-1-git-send-email-wanpengli@tencent.com>
[Make the function inline. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marc Zyngier [Mon, 24 May 2021 17:07:52 +0000 (18:07 +0100)]
KVM: arm64: Prevent mixed-width VM creation
It looks like we have tolerated creating mixed-width VMs since...
forever. However, that was never the intention, and we'd rather
not have to support that pointless complexity.
Forbid such a setup by making sure all the vcpus have the same
register width.
Zenghui Yu [Wed, 26 May 2021 14:18:31 +0000 (22:18 +0800)]
KVM: arm64: Resolve all pending PC updates before immediate exit
Commit 4c9174ce0a87 ("KVM: arm64: Commit pending PC adjustemnts before
returning to userspace") fixed the PC updating issue by forcing an explicit
synchronisation of the exception state on vcpu exit to userspace.
However, we forgot to take into account the case where immediate_exit is
set by userspace and KVM_RUN will exit immediately. Fix it by resolving all
pending PC updates before returning to userspace.
Since __kvm_adjust_pc() relies on a loaded vcpu context, I moved the
immediate_exit checking right after vcpu_load(). We will get some overhead
if immediate_exit is true (which should hopefully be rare).
Fixes: 4c9174ce0a87 ("KVM: arm64: Commit pending PC adjustemnts before returning to userspace") Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210526141831.1662-1-yuzenghui@huawei.com Cc: stable@vger.kernel.org # 5.11
Paolo Bonzini [Mon, 17 May 2021 07:55:12 +0000 (09:55 +0200)]
Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.13, take #1
- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
Marc Zyngier [Fri, 14 May 2021 08:05:41 +0000 (09:05 +0100)]
KVM: arm64: Fix debug register indexing
Commit c205a5d3b2ac7 ("KVM: arm64: Don't write junk to sysregs on
reset") flipped the register number to 0 for all the debug registers
in the sysreg table, hereby indicating that these registers live
in a separate shadow structure.
However, the author of this patch failed to realise that all the
accessors are using that particular index instead of the register
encoding, resulting in all the registers hitting index 0. Not quite
a valid implementation of the architecture...
Address the issue by fixing all the accessors to use the CRm field
of the encoding, which contains the debug register index.
Fixes: c205a5d3b2ac7 ("KVM: arm64: Don't write junk to sysregs on reset") Reported-by: Ricardo Koller <ricarkol@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org
Marc Zyngier [Thu, 6 May 2021 14:20:12 +0000 (15:20 +0100)]
KVM: arm64: Commit pending PC adjustemnts before returning to userspace
KVM currently updates PC (and the corresponding exception state)
using a two phase approach: first by setting a set of flags,
then by converting these flags into a state update when the vcpu
is about to enter the guest.
However, this creates a disconnect with userspace if the vcpu thread
returns there with any exception/PC flag set. In this case, the exposed
context is wrong, as userspace doesn't have access to these flags
(they aren't architectural). It also means that these flags are
preserved across a reset, which isn't expected.
To solve this problem, force an explicit synchronisation of the
exception state on vcpu exit to userspace. As an optimisation
for nVHE systems, only perform this when there is something pending.
arch/arm64/kvm/mmu.c:1114:9-10: WARNING: return of 0/1 in function 'kvm_age_gfn' with return type bool
arch/arm64/kvm/mmu.c:1084:9-10: WARNING: return of 0/1 in function 'kvm_set_spte_gfn' with return type bool
arch/arm64/kvm/mmu.c:1127:9-10: WARNING: return of 0/1 in function 'kvm_test_age_gfn' with return type bool
arch/arm64/kvm/mmu.c:1070:9-10: WARNING: return of 0/1 in function 'kvm_unmap_gfn_range' with return type bool
Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci
Fixes: 83edcee78f98 ("KVM: arm64: Convert to the gfn-based MMU notifier callbacks") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: kernel test robot <lkp@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210426223357.GA45871@cd4295a34ed8
The reverted commit may cause VM freeze on arm64 with GICv4,
where stopping a consumer is implemented by suspending the VM.
Should the connect fail, the VM will not be resumed, which
is a bit of a problem.
It also erroneously calls the producer destructor unconditionally,
which is unexpected.
Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Suggested-by: Marc Zyngier <maz@kernel.org> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
[maz: tags and cc-stable, commit message update] Signed-off-by: Marc Zyngier <maz@kernel.org> Fixes: b25bf29ffea9 ("irqbypass: do not start cons/prod when failed connect") Link: https://lore.kernel.org/r/3a2c66d6-6ca0-8478-d24b-61e8e3241b20@hisilicon.com Link: https://lore.kernel.org/r/20210508071152.722425-1-lingshan.zhu@intel.com Cc: stable@vger.kernel.org
Linus Torvalds [Sun, 9 May 2021 21:03:33 +0000 (14:03 -0700)]
fbmem: fix horribly incorrect placement of __maybe_unused
Commit f4f4e6994baa ("fbmem: Mark proc_fb_seq_ops as __maybe_unused")
places the '__maybe_unused' in an entirely incorrect location between
the "struct" keyword and the structure name.
It's a wonder that gcc accepts that silently, but clang quite reasonably
warns about it:
Linus Torvalds [Sun, 9 May 2021 20:42:39 +0000 (13:42 -0700)]
Merge tag 'drm-next-2021-05-10' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Bit later than usual, I queued them all up on Friday then promptly
forgot to write the pull request email. This is mainly amdgpu fixes,
with some radeon/msm/fbdev and one i915 gvt fix thrown in.
amdgpu:
- MPO hang workaround
- Fix for concurrent VM flushes on vega/navi
- dcefclk is not adjustable on navi1x and newer
- MST HPD debugfs fix
- Suspend/resumes fixes
- Register VGA clients late in case driver fails to load
- Fix GEM leak in user framebuffer create
- Add support for polaris12 with 32 bit memory interface
- Fix duplicate cursor issue when using overlay
- Fix corruption with tiled surfaces on VCN3
- Add BO size and stride check to fix BO size verification
radeon:
- Fix off-by-one in power state parsing
- Fix possible memory leak in power state parsing
msm:
- NULL ptr dereference fix
fbdev:
- procfs disabled warning fix
i915:
- gvt: Fix a possible division by zero in vgpu display rate
calculation"
* tag 'drm-next-2021-05-10' of git://anongit.freedesktop.org/drm/drm:
drm/amdgpu: Use device specific BO size & stride check.
drm/amdgpu: Init GFX10_ADDR_CONFIG for VCN v3 in DPG mode.
drm/amd/pm: initialize variable
drm/radeon: Avoid power table parsing memory leaks
drm/radeon: Fix off-by-one power_state index heap overwrite
drm/amd/display: Fix two cursor duplication when using overlay
drm/amdgpu: add new MC firmware for Polaris12 32bit ASIC
fbmem: Mark proc_fb_seq_ops as __maybe_unused
drm/msm/dpu: Delete bonkers code
drm/i915/gvt: Prevent divided by zero when calculating refresh rate
amdgpu: fix GEM obj leak in amdgpu_display_user_framebuffer_create
drm/amdgpu: Register VGA clients after init can no longer fail
drm/amdgpu: Handling of amdgpu_device_resume return value for graceful teardown
drm/amdgpu: fix r initial values
drm/amd/display: fix wrong statement in mst hpd debugfs
amdgpu/pm: set pp_dpm_dcefclk to readonly on NAVI10 and newer gpus
amdgpu/pm: Prevent force of DCEFCLK on NAVI10 and SIENNA_CICHLID
drm/amdgpu: fix concurrent VM flushes on Vega/Navi v2
drm/amd/display: Reject non-zero src_y and src_x for video planes
Linus Torvalds [Sun, 9 May 2021 20:25:14 +0000 (13:25 -0700)]
Merge tag 'block-5.13-2021-05-09' of git://git.kernel.dk/linux-block
Pull block fix from Jens Axboe:
"Turns out the bio max size change still has issues, so let's get it
reverted for 5.13-rc1. We'll shake out the issues there and defer it
to 5.14 instead"
* tag 'block-5.13-2021-05-09' of git://git.kernel.dk/linux-block:
Revert "bio: limit bio max size"
Linus Torvalds [Sun, 9 May 2021 20:19:29 +0000 (13:19 -0700)]
Merge tag '5.13-rc-smb3-part3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small SMB3 chmultichannel related changesets (also for stable)
from the SMB3 test event this week.
The other fixes are still in review/testing"
* tag '5.13-rc-smb3-part3' of git://git.samba.org/sfrench/cifs-2.6:
smb3: if max_channels set to more than one channel request multichannel
smb3: do not attempt multichannel to server which does not support it
smb3: when mounting with multichannel include it in requested capabilities
Linus Torvalds [Sun, 9 May 2021 20:14:34 +0000 (13:14 -0700)]
Merge tag 'sched-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
"A set of scheduler updates:
- Prevent PSI state corruption when schedule() races with cgroup
move.
A recent commit combined two PSI callbacks to reduce the number of
cgroup tree updates, but missed that schedule() can drop rq::lock
for load balancing, which opens the race window for
cgroup_move_task() which then observes half updated state.
The fix is to solely use task::ps_flags instead of looking at the
potentially mismatching scheduler state
- Prevent an out-of-bounds access in uclamp caused bu a rounding
division which can lead to an off-by-one error exceeding the
buckets array size.
- Prevent unfairness caused by missing load decay when a task is
attached to a cfs runqueue.
The old load of the task was attached to the runqueue and never
removed. Fix it by enforcing the load update through the hierarchy
for unthrottled run queue instances.
- A documentation fix fot the 'sched_verbose' command line option"
* tag 'sched-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix unfairness caused by missing load decay
sched: Fix out-of-bound access in uclamp
psi: Fix psi state corruption when schedule() races with cgroup move
sched,doc: sched_debug_verbose cmdline should be sched_verbose
Linus Torvalds [Sun, 9 May 2021 20:07:03 +0000 (13:07 -0700)]
Merge tag 'locking-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"A set of locking related fixes and updates:
- Two fixes for the futex syscall related to the timeout handling.
FUTEX_LOCK_PI does not support the FUTEX_CLOCK_REALTIME bit and
because it's not set the time namespace adjustment for clock
MONOTONIC is applied wrongly.
FUTEX_WAIT cannot support the FUTEX_CLOCK_REALTIME bit because its
always a relative timeout.
- Cleanups in the futex syscall entry points which became obvious
when the two timeout handling bugs were fixed.
- Cleanup of queued_write_lock_slowpath() as suggested by Linus
- Fixup of the smp_call_function_single_async() prototype"
* tag 'locking-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Make syscall entry points less convoluted
futex: Get rid of the val2 conditional dance
futex: Do not apply time namespace adjustment on FUTEX_LOCK_PI
Revert 6a2ffac5de31 ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op")
locking/qrwlock: Cleanup queued_write_lock_slowpath()
smp: Fix smp_call_function_single_async prototype
Linus Torvalds [Sun, 9 May 2021 20:00:26 +0000 (13:00 -0700)]
Merge tag 'perf_urgent_for_v5.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fix from Borislav Petkov:
"Handle power-gating of AMD IOMMU perf counters properly when they are
used"
* tag 'perf_urgent_for_v5.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/events/amd/iommu: Fix invalid Perf result due to IOMMU PMC power-gating
Linus Torvalds [Sun, 9 May 2021 19:52:25 +0000 (12:52 -0700)]
Merge tag 'x86_urgent_for_v5.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"A bunch of things accumulated for x86 in the last two weeks:
- Fix guest vtime accounting so that ticks happening while the guest
is running can also be accounted to it. Along with a consolidation
to the guest-specific context tracking helpers.
- Provide for the host NMI handler running after a VMX VMEXIT to be
able to run on the kernel stack correctly.
- Initialize MSR_TSC_AUX when RDPID is supported and not RDTSCP (virt
relevant - real hw supports both)
- A code generation improvement to TASK_SIZE_MAX through the use of
alternatives
- The usual misc and related cleanups and improvements"
* tag 'x86_urgent_for_v5.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
KVM: x86: Consolidate guest enter/exit logic to common helpers
context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain
context_tracking: Consolidate guest enter/exit wrappers
sched/vtime: Move guest enter/exit vtime accounting to vtime.h
sched/vtime: Move vtime accounting external declarations above inlines
KVM: x86: Defer vtime accounting 'til after IRQ handling
context_tracking: Move guest exit vtime accounting to separate helpers
context_tracking: Move guest exit context tracking to separate helpers
KVM/VMX: Invoke NMI non-IST entry instead of IST entry
x86/cpu: Remove write_tsc() and write_rdtscp_aux() wrappers
x86/cpu: Initialize MSR_TSC_AUX if RDTSCP *or* RDPID is supported
x86/resctrl: Fix init const confusion
x86: Delete UD0, UD1 traces
x86/smpboot: Remove duplicate includes
x86/cpu: Use alternative to generate the TASK_SIZE_MAX constant
Linus Torvalds [Sat, 8 May 2021 18:52:37 +0000 (11:52 -0700)]
Merge tag 'riscv-for-linus-5.13-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix to avoid over-allocating the kernel's mapping on !MMU systems,
which could lead to up to 2MiB of lost memory
- The SiFive address extension errata only manifest on rv64, they are
now disabled on rv32 where they are unnecessary
- A pair of late-landing cleanups
* tag 'riscv-for-linus-5.13-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: remove unused handle_exception symbol
riscv: Consistify protect_kernel_linear_mapping_text_rodata() use
riscv: enable SiFive errata CIP-453 and CIP-1200 Kconfig only if CONFIG_64BIT=y
riscv: Only extend kernel reservation if mapped read-only
Linus Torvalds [Sat, 8 May 2021 18:30:22 +0000 (11:30 -0700)]
drm/i915/display: fix compiler warning about array overrun
intel_dp_check_mst_status() uses a 14-byte array to read the DPRX Event
Status Indicator data, but then passes that buffer at offset 10 off as
an argument to drm_dp_channel_eq_ok().
End result: there are only 4 bytes remaining of the buffer, yet
drm_dp_channel_eq_ok() wants a 6-byte buffer. gcc-11 correctly warns
about this case:
drivers/gpu/drm/i915/display/intel_dp.c: In function ‘intel_dp_check_mst_status’:
drivers/gpu/drm/i915/display/intel_dp.c:3491:22: warning: ‘drm_dp_channel_eq_ok’ reading 6 bytes from a region of size 4 [-Wstringop-overread]
3491 | !drm_dp_channel_eq_ok(&esi[10], intel_dp->lane_count)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/i915/display/intel_dp.c:3491:22: note: referencing argument 1 of type ‘const u8 *’ {aka ‘const unsigned char *’}
In file included from drivers/gpu/drm/i915/display/intel_dp.c:38:
include/drm/drm_dp_helper.h:1466:6: note: in a call to function ‘drm_dp_channel_eq_ok’
1466 | bool drm_dp_channel_eq_ok(const u8 link_status[DP_LINK_STATUS_SIZE],
| ^~~~~~~~~~~~~~~~~~~~
6:14 elapsed
This commit just extends the original array by 2 zero-initialized bytes,
avoiding the warning.
There may be some underlying bug in here that caused this confusion, but
this is at least no worse than the existing situation that could use
random data off the stack.
Cc: Jani Nikula <jani.nikula@intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Airlie <airlied@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>