Paolo Bonzini [Tue, 17 Aug 2021 11:22:32 +0000 (07:22 -0400)]
KVM: MMU: inline set_spte in FNAME(sync_page)
Since the two callers of set_spte do different things with the results,
inlining it actually makes the code simpler to reason about. For example,
FNAME(sync_page) already has a struct kvm_mmu_page *, but set_spte had to
fish it back out of sptep's private page data.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 17 Aug 2021 11:22:32 +0000 (07:22 -0400)]
KVM: MMU: inline set_spte in mmu_set_spte
Since the two callers of set_spte do different things with the results,
inlining it actually makes the code simpler to reason about. For example,
mmu_set_spte looks quite like tdp_mmu_map_handle_target_level, but the
similarity is hidden by set_spte.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 13 Aug 2021 20:35:02 +0000 (20:35 +0000)]
KVM: x86/mmu: Avoid memslot lookup in page_fault_handle_page_track
Now that kvm_page_fault has a pointer to the memslot it can be passed
down to the page tracking code to avoid a redundant slot lookup.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 24 Sep 2021 09:05:26 +0000 (05:05 -0400)]
KVM: x86/mmu: Pass the memslot around via struct kvm_page_fault
The memslot for the faulting gfn is used throughout the page fault
handling code, so capture it in kvm_page_fault as soon as we know the
gfn and use it in the page fault handling code that has direct access
to the kvm_page_fault struct. Replace various tests using is_noslot_pfn
with more direct tests on fault->slot being NULL.
This, in combination with the subsequent patch, improves "Populate
memory time" in dirty_log_perf_test by 5% when using the legacy MMU.
There is no discerable improvement to the performance of the TDP MMU.
No functional change intended.
Suggested-by: Ben Gardon <bgardon@google.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 23 Sep 2021 15:20:48 +0000 (11:20 -0400)]
KVM: MMU: unify tdp_mmu_map_set_spte_atomic and tdp_mmu_set_spte_atomic_no_dirty_log
tdp_mmu_map_set_spte_atomic is not taking care of dirty logging anymore,
the only difference that remains is that it takes a vCPU instead of
the struct kvm. Merge the two functions.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 24 Sep 2021 08:52:23 +0000 (04:52 -0400)]
KVM: MMU: mark page dirty in make_spte
This simplifies set_spte, which we want to remove, and unifies code
between the shadow MMU and the TDP MMU. The warning will be added
back later to make_spte as well.
There is a small disadvantage in the TDP MMU; it may unnecessarily mark
a page as dirty twice if two vCPUs end up mapping the same page twice.
However, this is a very small cost for a case that is already rare.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 13 Aug 2021 20:35:00 +0000 (20:35 +0000)]
KVM: x86/mmu: Fold rmap_recycle into rmap_add
Consolidate rmap_recycle and rmap_add into a single function since they
are only ever called together (and only from one place). This has a nice
side effect of eliminating an extra kvm_vcpu_gfn_to_memslot(). In
addition it makes mmu_set_spte(), which is a very long function, a
little shorter.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Verify shadow walk doesn't terminate early in page faults
WARN and bail if the shadow walk for faulting in a SPTE terminates early,
i.e. doesn't reach the expected level because the walk encountered a
terminal SPTE. The shadow walks for page faults are subtle in that they
install non-leaf SPTEs (zapping leaf SPTEs if necessary!) in the loop
body, and consume the newly created non-leaf SPTE in the loop control,
e.g. __shadow_walk_next(). In other words, the walks guarantee that the
walk will stop if and only if the target level is reached by installing
non-leaf SPTEs to guarantee the walk remains valid.
Opportunistically use fault->goal-level instead of it.level in
FNAME(fetch) to further clarify that KVM always installs the leaf SPTE at
the target level.
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210906122547.263316-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 6 Aug 2021 08:35:50 +0000 (04:35 -0400)]
KVM: MMU: change tracepoints arguments to kvm_page_fault
Pass struct kvm_page_fault to tracepoints instead of extracting the
arguments from the struct. This also lets the kvm_mmu_spte_requested
tracepoint pick the gfn directly from fault->gfn, instead of using
the address.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 6 Aug 2021 08:35:50 +0000 (04:35 -0400)]
KVM: MMU: change disallowed_hugepage_adjust() arguments to kvm_page_fault
Pass struct kvm_page_fault to disallowed_hugepage_adjust() instead of
extracting the arguments from the struct. Tweak a bit the conditions
to avoid long lines.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sat, 7 Aug 2021 13:21:53 +0000 (09:21 -0400)]
KVM: MMU: change kvm_mmu_hugepage_adjust() arguments to kvm_page_fault
Pass struct kvm_page_fault to kvm_mmu_hugepage_adjust() instead of
extracting the arguments from the struct; the results are also stored
in the struct, so the callers are adjusted consequently.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sat, 7 Aug 2021 12:57:34 +0000 (08:57 -0400)]
KVM: MMU: change kvm_faultin_pfn() arguments to kvm_page_fault
Add fields to struct kvm_page_fault corresponding to outputs of
kvm_faultin_pfn(). For now they have to be extracted again from struct
kvm_page_fault in the subsequent steps, but this is temporary until
other functions in the chain are switched over as well.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 6 Aug 2021 08:21:58 +0000 (04:21 -0400)]
KVM: MMU: change page_fault_handle_page_track() arguments to kvm_page_fault
Add fields to struct kvm_page_fault corresponding to the arguments
of page_fault_handle_page_track(). The fields are initialized in the
callers, and page_fault_handle_page_track() receives a struct
kvm_page_fault instead of having to extract the arguments out of it.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 6 Aug 2021 08:21:58 +0000 (04:21 -0400)]
KVM: MMU: change direct_page_fault() arguments to kvm_page_fault
Add fields to struct kvm_page_fault corresponding to
the arguments of direct_page_fault(). The fields are
initialized in the callers, and direct_page_fault()
receives a struct kvm_page_fault instead of having to
extract the arguments out of it.
Also adjust FNAME(page_fault) to store the max_level in
struct kvm_page_fault, to keep it similar to the direct
map path.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 6 Aug 2021 08:35:50 +0000 (04:35 -0400)]
KVM: MMU: change mmu->page_fault() arguments to kvm_page_fault
Pass struct kvm_page_fault to mmu->page_fault() instead of
extracting the arguments from the struct. FNAME(page_fault) can use
the precomputed bools from the error code.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 6 Aug 2021 07:52:18 +0000 (03:52 -0400)]
KVM: MMU: Introduce struct kvm_page_fault
Create a single structure for arguments that are passed from
kvm_mmu_do_page_fault to the page fault handlers. Later
the structure will grow to include various output parameters
that are passed back to the next steps in the page fault
handling.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 6 Aug 2021 09:21:17 +0000 (05:21 -0400)]
KVM: MMU: pass unadulterated gpa to direct_page_fault
Do not bother removing the low bits of the gpa. This masking dates back
to the very first commit of KVM but it is unnecessary, as exemplified
by the other call in kvm_tdp_page_fault.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Oliver Upton [Thu, 16 Sep 2021 18:15:34 +0000 (18:15 +0000)]
KVM: x86: Fix potential race in KVM_GET_CLOCK
Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock
outside of the pvclock sync lock. This is problematic, as the clock
value written to the user may or may not actually correspond to a stable
TSC.
Fix the race by populating the entire kvm_clock_data structure behind
the pvclock_gtod_sync_lock.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210916181538.968978-4-oupton@google.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 16 Sep 2021 18:15:32 +0000 (18:15 +0000)]
kvm: x86: abstract locking around pvclock_update_vm_gtod_copy
Updates to the kvmclock parameters needs to do a complicated dance of
KVM_REQ_MCLOCK_INPROGRESS and KVM_REQ_CLOCK_UPDATE in addition to taking
pvclock_gtod_sync_lock. Place that in two functions that can be called
on all of master clock update, KVM_SET_CLOCK, and Hyper-V reenlightenment.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Mon, 6 Sep 2021 12:25:47 +0000 (20:25 +0800)]
KVM: X86: Move PTE present check from loop body to __shadow_walk_next()
So far, the loop bodies already ensure the PTE is present before calling
__shadow_walk_next(): Some loop bodies simply exit with a !PRESENT
directly and some other loop bodies, i.e. FNAME(fetch) and __direct_map()
do not currently guard their walks with is_shadow_present_pte, but only
because they install present non-leaf SPTEs in the loop itself.
But checking pte present in __shadow_walk_next() (which is called from
shadow_walk_okay()) is more prudent; walking past a !PRESENT SPTE
would lead to attempting to read a the next level SPTE from a garbage
iter->shadow_addr. It also allows to remove the is_shadow_present_pte()
checks from the loop bodies.
Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210906122547.263316-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 23 Sep 2021 16:46:07 +0000 (12:46 -0400)]
KVM: x86: SVM: don't set VMLOAD/VMSAVE intercepts on vCPU reset
Commit 69055e5a91f4 ("KVM: nSVM: improve SYSENTER emulation on AMD"),
made init_vmcb set vmload/vmsave intercepts unconditionally,
and relied on svm_vcpu_after_set_cpuid to clear them when possible.
However init_vmcb is also called when the vCPU is reset, and it is
not followed by another call to svm_vcpu_after_set_cpuid because
the CPUID is already set. This mistake makes the VMSAVE/VMLOAD intercept
to be set when it is not needed, and harms performance of the nested
guest.
Extract the relevant parts of svm_vcpu_after_set_cpuid so that they
can be called again on reset.
Fixes: 69055e5a91f4 ("KVM: nSVM: improve SYSENTER emulation on AMD") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Longpeng(Mike) [Fri, 27 Aug 2021 08:00:03 +0000 (16:00 +0800)]
kvm: irqfd: avoid update unmodified entries of the routing
All of the irqfds would to be updated when update the irq
routing, it's too expensive if there're too many irqfds.
However we can reduce the cost by avoid some unnecessary
updates. For irqs of MSI type on X86, the update can be
saved if the msi values are not change.
The vfio migration could receives benefit from this optimi-
zaiton. The test VM has 128 vcpus and 8 VF (with 65 vectors
enabled), so the VM has more than 520 irqfds. We mesure the
cost of the vfio_msix_enable (in QEMU, it would set routing
for each irqfd) for each VF, and we can see the total cost
can be significantly reduced.
Lai Jiangshan [Sat, 18 Sep 2021 00:56:36 +0000 (08:56 +0800)]
KVM: X86: Don't check unsync if the original spte is writible
If the original spte is writable, the target gfn should not be the
gfn of synchronized shadowpage and can continue to be writable.
When !can_unsync, speculative must be false. So when the check of
"!can_unsync" is removed, we need to move the label of "out" up.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Sat, 18 Sep 2021 00:56:35 +0000 (08:56 +0800)]
KVM: X86: Don't unsync pagetables when speculative
We'd better only unsync the pagetable when there just was a really
write fault on a level-1 pagetable.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-10-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Sat, 18 Sep 2021 00:56:34 +0000 (08:56 +0800)]
KVM: X86: Remove FNAME(update_pte)
Its solo caller is changed to use FNAME(prefetch_gpte) directly.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Sat, 18 Sep 2021 00:56:33 +0000 (08:56 +0800)]
KVM: X86: Zap the invalid list after remote tlb flushing
In mmu_sync_children(), it can zap the invalid list after remote tlb flushing.
Emptifying the invalid list ASAP might help reduce a remote tlb flushing
in some cases.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Sat, 18 Sep 2021 00:56:32 +0000 (08:56 +0800)]
KVM: X86: Change kvm_sync_page() to return true when remote flush is needed
Currently kvm_sync_page() returns true when there is any present spte.
But the return value is ignored in the callers.
Changing kvm_sync_page() to return true when remote flush is needed and
changing mmu->sync_page() not to directly flush can combine and reduce
remote flush requests.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Sat, 18 Sep 2021 00:56:31 +0000 (08:56 +0800)]
KVM: X86: Remove kvm_mmu_flush_or_zap()
Because local_flush is useless, kvm_mmu_flush_or_zap() can be removed
and kvm_mmu_remote_flush_or_zap is used instead.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-6-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Sat, 18 Sep 2021 00:56:30 +0000 (08:56 +0800)]
KVM: X86: Don't flush current tlb on shadow page modification
After any shadow page modification, flushing tlb only on current VCPU
is weird due to other VCPU's tlb might still be stale.
In other words, if there is any mandatory tlb-flushing after shadow page
modification, SET_SPTE_NEED_REMOTE_TLB_FLUSH or remote_flush should be
set and the tlbs of all VCPUs should be flushed. There is not point to
only flush current tlb except when the request is from vCPU's or pCPU's
activities.
If there was any bug that mandatory tlb-flushing is required and
SET_SPTE_NEED_REMOTE_TLB_FLUSH/remote_flush is failed to set, this patch
would expose the bug in a more destructive way. The related code paths
are checked and no missing SET_SPTE_NEED_REMOTE_TLB_FLUSH is found yet.
Currently, there is no optional tlb-flushing after sync page related code
is changed to flush tlb timely. So we can just remove these local flushing
code.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Complete prefetch for trailing SPTEs for direct, legacy MMU
Make a final call to direct_pte_prefetch_many() if there are "trailing"
SPTEs to prefetch, i.e. SPTEs for GFNs following the faulting GFN. The
call to direct_pte_prefetch_many() in the loop only handles the case
where there are !PRESENT SPTEs preceding a PRESENT SPTE.
E.g. if the faulting GFN is a multiple of 8 (the prefetch size) and all
SPTEs for the following GFNs are !PRESENT, the loop will terminate with
"start = sptep+1" and not prefetch any SPTEs.
Prefetching trailing SPTEs as intended can drastically reduce the number
of guest page faults, e.g. accessing the first byte of every 4kb page in
a 6gb chunk of virtual memory, in a VM with 8gb of preallocated memory,
the number of pf_fixed events observed in L0 drops from ~1.75M to <0.27M.
Note, this only affects memory that is backed by 4kb pages as KVM doesn't
prefetch when installing hugepages. Shadow paging prefetching is not
affected as it does not batch the prefetches due to the need to process
the corresponding guest PTE. The TDP MMU is not affected because it
doesn't have prefetching, yet...
Fixes: 37c58d016c00 ("KVM: MMU: prefetch ptes when intercepted guest #PF") Cc: Sergey Senozhatsky <senozhatsky@google.com> Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210818235615.2047588-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Thomas Huth [Thu, 26 Aug 2021 07:49:28 +0000 (09:49 +0200)]
KVM: selftests: Fix kvm_vm_free() in cr4_cpuid_sync and vmx_tsc_adjust tests
The kvm_vm_free() statement here is currently dead code, since the loop
in front of it can only be left with the "goto done" that jumps right
after the kvm_vm_free(). Fix it by swapping the locations of the "done"
label and the kvm_vm_free().
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20210826074928.240942-1-thuth@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is a spelling mistake in an error message. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Message-Id: <20210826120752.12633-1-colin.king@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Manually retrieve CPUID.0x1 when getting FMS for RESET/INIT
Manually look for a CPUID.0x1 entry instead of bouncing through
kvm_cpuid() when retrieving the Family-Model-Stepping information for
vCPU RESET/INIT. This fixes a potential undefined behavior bug due to
kvm_cpuid() using the uninitialized "dummy" param as the ECX _input_,
a.k.a. the index.
A more minimal fix would be to simply zero "dummy", but the extra work in
kvm_cpuid() is wasteful, and KVM should be treating the FMS retrieval as
an out-of-band access, e.g. same as how KVM computes guest.MAXPHYADDR.
Both Intel's SDM and AMD's APM describe the RDX value at RESET/INIT as
holding the CPU's FMS information, not as holding CPUID.0x1.EAX. KVM's
usage of CPUID entries to get FMS is simply a pragmatic approach to avoid
having yet another way for userspace to provide inconsistent data.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210929222426.1855730-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: WARN on non-zero CRs at RESET to detect improper initalization
WARN if CR0, CR3, or CR4 are non-zero at RESET, which given the current
KVM implementation, really means WARN if they're not zeroed at vCPU
creation. VMX in particular has several ->set_*() flows that read other
registers to handle side effects, and because those flows are common to
RESET and INIT, KVM subtly relies on emulated/virtualized registers to be
zeroed at vCPU creation in order to do the right thing at RESET.
Use CRs as a sentinel because they are most likely to be written as side
effects, and because KVM specifically needs CR0.PG and CR0.PE to be '0'
to correctly reflect the state of the vCPU's MMU. CRs are also loaded
and stored from/to the VMCS, and so adds some level of coverage to verify
that KVM doesn't conflate zero-allocating the VMCS with properly
initializing the VMCS with VMWRITEs.
Note, '0' is somewhat arbitrary, vCPU creation can technically stuff any
value for a register so long as it's coherent with respect to the current
vCPU state. In practice, '0' works for all registers and is convenient.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SVM: Move RESET emulation to svm_vcpu_reset()
Move RESET emulation for SVM vCPUs to svm_vcpu_reset(), and drop an extra
init_vmcb() from svm_create_vcpu() in the process. Hopefully KVM will
someday expose a dedicated RESET ioctl(), and in the meantime separating
"create" from "RESET" is a nice cleanup.
Keep the call to svm_switch_vmcb() so that misuse of svm->vmcb at worst
breaks the guest, e.g. premature accesses doesn't cause a NULL pointer
dereference.
Cc: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: VMX: Move RESET emulation to vmx_vcpu_reset()
Move vCPU RESET emulation, including initializating of select VMCS state,
to vmx_vcpu_reset(). Drop the open coded "vCPU load" sequence, as
->vcpu_reset() is invoked while the vCPU is properly loaded (which is
kind of the point of ->vcpu_reset()...). Hopefully KVM will someday
expose a dedicated RESET ioctl(), and in the meantime separating "create"
from "RESET" is a nice cleanup.
Deferring VMCS initialization is effectively a nop as it's impossible to
safely access the VMCS between the current call site and its new home, as
both the vCPU and the pCPU are put immediately after init_vmcs(), i.e.
the VMCS isn't guaranteed to be loaded.
Note, task preemption is not a problem as vmx_sched_in() _can't_ touch
the VMCS as ->sched_in() is invoked before the vCPU, and thus VMCS, is
reloaded. I.e. the preemption path also can't consume VMCS state.
Cc: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: VMX: Drop explicit zeroing of MSR guest values at vCPU creation
Don't zero out user return and nested MSRs during vCPU creation, and
instead rely on vcpu_vmx being zero-allocated. Explicitly zeroing MSRs
is not wrong, and is in fact necessary if KVM ever emulates vCPU RESET
outside of vCPU creation, but zeroing only a subset of MSRs is confusing.
Poking directly into KVM's backing is also undesirable in that it doesn't
scale and is error prone. Ideally KVM would have a common RESET path for
all MSRs, e.g. by expanding kvm_set_msr(), which would obviate the need
for this out-of-bad code (to support standalone RESET).
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Fold fx_init() into kvm_arch_vcpu_create()
Move the few bits of relevant fx_init() code into kvm_arch_vcpu_create(),
dropping the superfluous check on vcpu->arch.guest_fpu that was blindly
and wrongly added by commit 2f43e8a68a13 ("KVM: SVM: Guest FPU state
save/restore not needed for SEV-ES guest").
Note, KVM currently allocates and then frees FPU state for SEV-ES guests,
rather than avoid the allocation in the first place. While that approach
is inarguably inefficient and unnecessary, it's a cleanup for the future.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Remove defunct setting of XCR0 for guest during vCPU create
Drop code to initialize XCR0 during fx_init(), a.k.a. vCPU creation, as
XCR0 has been initialized during kvm_vcpu_reset() (for RESET) since
commit 90751d1d97a4 ("KVM: X86: Processor States following Reset or INIT").
Back when XCR0 support was added by commit 9e34e753b462 ("KVM: VMX:
Enable XSAVE/XRSTOR for guest"), KVM didn't differentiate between RESET
and INIT. Ignoring the fact that calling fx_init() for INIT is obviously
wrong, e.g. FPU state after INIT is not the same as after RESET, setting
XCR0 in fx_init() was correct.
Eventually fx_init() got moved to kvm_arch_vcpu_init(), a.k.a. vCPU
creation (ignore the terrible name) by commit 09552a85a841 ("x86/fpu,
kvm: Simplify fx_init()"). Finally, commit 26b586da3295 ("KVM: x86: Move
all vcpu init code into kvm_arch_vcpu_create()") killed off
kvm_arch_vcpu_init(), leaving behind the oddity of redundant setting of
guest state during vCPU creation.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Remove defunct setting of CR0.ET for guests during vCPU create
Drop code to set CR0.ET for the guest during initialization of the guest
FPU. The code was added as a misguided bug fix by commit 155459039134
("KVM Set the ET flag in CR0 after initializing FX") to resolve an issue
where vcpu->cr0 (now vcpu->arch.cr0) was not correctly initialized on SVM
systems. While init_vmcb() did set CR0.ET, it only did so in the VMCB,
and subtly did not update vcpu->cr0. Stuffing CR0.ET worked around the
immediate problem, but did not fix the real bug of vcpu->cr0 and the VMCB
being out of sync. That underlying bug was eventually remedied by commit 5b03e2f5f77c ("KVM: SVM: Reset cr0 properly on vcpu reset").
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Do not mark all registers as avail/dirty during RESET/INIT
Do not blindly mark all registers as available+dirty at RESET/INIT, and
instead rely on writes to registers to go through the proper mutators or
to explicitly mark registers as dirty. INIT in particular does not blindly
overwrite all registers, e.g. select bits in CR0 are preserved across INIT,
thus marking registers available+dirty without first reading the register
from hardware is incorrect.
In practice this is a benign bug as KVM doesn't let the guest control CR0
bits that are preserved across INIT, and all other true registers are
explicitly written during the RESET/INIT flows. The PDPTRs and EX_INFO
"registers" are not explicitly written, but accessing those values during
RESET/INIT is nonsensical and would be a KVM bug regardless of register
caching.
Fixes: 13060218d09b ("KVM: x86: Make register state after reset conform to specification")
[sean: !!! NOT FOR STABLE !!!] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Simplify retrieving the page offset when loading PDTPRs
Replace impressively complex "logic" for computing the page offset from
CR3 when loading PDPTRs. Unlike other paging modes, the address held in
CR3 for PAE paging is 32-byte aligned, i.e. occupies bits 31:5, thus bits
11:5 need to be used as the offset from the gfn when reading PDPTRs.
The existing calculation originated in commit e0b8f9e1894f ("[PATCH] KVM:
MMU: Load the pae pdptrs on cr3 change like the processor does"), which
read the PDPTRs from guest memory as individual 8-byte loads. At the
time, the so called "offset" was the base index of PDPTR0 as a _u64_, not
a byte offset. Naming aside, the computation was useful and arguably
simplified the overall flow.
Unfortunately, when commit b7e21654b8b2 ("KVM: Add general accessors to
read and write guest memory") added accessors with offsets at byte
granularity, the cleverness of the original code was lost and KVM was
left with convoluted code for a simple operation.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210831164224.1119728-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Subsume nested GPA read helper into load_pdptrs()
Open code the call to mmu->translate_gpa() when loading nested PDPTRs and
kill off the existing helper, kvm_read_guest_page_mmu(), to discourage
incorrect use. Reading guest memory straight from an L2 GPA is extremely
rare (as evidenced by the lack of users), as very few constructs in x86
specify physical addresses, even fewer are virtualized by KVM, and even
fewer yet require emulation of L2 by L0 KVM.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210831164224.1119728-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the
number of allowed vcpu-ids. This has already led to confusion, so
rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more
clear
Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210913135745.13944-3-jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The commit has the wrong reasoning, as KVM_MAX_VCPU_ID is not defining the
maximum allowed vcpu-id as its name suggests, but the number of vcpu-ids.
So revert this patch again.
Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210913135745.13944-2-jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Make kvm_make_vcpus_request_mask() use pre-allocated cpu_kick_mask
kvm_make_vcpus_request_mask() already disables preemption so just like
kvm_make_all_cpus_request_except() it can be switched to using
pre-allocated per-cpu cpumasks. This allows for improvements for both
users of the function: in Hyper-V emulation code 'tlb_flush' can now be
dropped from 'struct kvm_vcpu_hv' and kvm_make_scan_ioapic_request_mask()
gets rid of dynamic allocation.
cpumask_available() checks in kvm_make_vcpu_request() and
kvm_kick_many_cpus() can now be dropped as they checks for an impossible
condition: kvm_init() makes sure per-cpu masks are allocated.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-9-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()
Allocating cpumask dynamically in zalloc_cpumask_var() is not ideal.
Allocation is somewhat slow and can (in theory and when CPUMASK_OFFSTACK)
fail. kvm_make_all_cpus_request_except() already disables preemption so
we can use pre-allocated per-cpu cpumasks instead.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Iterating over set bits in 'vcpu_bitmap' should be faster than going
through all vCPUs, especially when just a few bits are set.
Drop kvm_make_vcpus_request_mask() call from kvm_make_all_cpus_request_except()
to avoid handling the special case when 'vcpu_bitmap' is NULL, move the
code to kvm_make_all_cpus_request_except() itself.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: hyper-v: Avoid calling kvm_make_vcpus_request_mask() with vcpu_mask==NULL
In preparation to making kvm_make_vcpus_request_mask() use for_each_set_bit()
switch kvm_hv_flush_tlb() to calling kvm_make_all_cpus_request() for 'all cpus'
case.
Note: kvm_make_all_cpus_request() (unlike kvm_make_vcpus_request_mask())
currently dynamically allocates cpumask on each call and this is suboptimal.
Both kvm_make_all_cpus_request() and kvm_make_vcpus_request_mask() are
going to be switched to using pre-allocated per-cpu masks.
Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yang Li [Wed, 29 Sep 2021 07:28:46 +0000 (15:28 +0800)]
KVM: use vma_pages() helper
Use vma_pages function on vma object instead of explicit computation.
Fix the following coccicheck warning:
./virt/kvm/kvm_main.c:3526:29-35: WARNING: Consider using vma_pages
helper on vma
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Message-Id: <1632900526-119643-1-git-send-email-yang.lee@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, 'vmx->nested.vmxon_ptr' is not reset upon VMXOFF
emulation. This is not a problem per se as we never access
it when !vmx->nested.vmxon. But this should be done to avoid
any issue in the future.
Also, initialize the vmxon_ptr when vcpu is created.
KVM: nVMX: Use INVALID_GPA for pointers used in nVMX.
Clean up nested.c and vmx.c by using INVALID_GPA instead of "-1ull",
to denote an invalid address in nested VMX. Affected addresses are
the ones of VMXON region, current VMCS, VMCS link pointer, virtual-
APIC page, ENCLS-exiting bitmap, and IO bitmap etc.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210929175154.11396-2-yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: Ensure all migrations are performed when test is affined
Rework the CPU selection in the migration worker to ensure the specified
number of migrations are performed when the test iteslf is affined to a
subset of CPUs. The existing logic skips iterations if the target CPU is
not in the original set of possible CPUs, which causes the test to fail
if too many iterations are skipped.
==== Test Assertion Failure ====
rseq_test.c:228: i > (NR_TASK_MIGRATIONS / 2)
pid=10127 tid=10127 errno=4 - Interrupted system call
1 0x00000000004018e5: main at rseq_test.c:227
2 0x00007fcc8fc66bf6: ?? ??:0
3 0x0000000000401959: _start at ??:?
Only performed 4 KVM_RUNs, task stalled too much?
Calculate the min/max possible CPUs as a cheap "best effort" to avoid
high runtimes when the test is affined to a small percentage of CPUs.
Alternatively, a list or xarray of the possible CPUs could be used, but
even in a horrendously inefficient setup, such optimizations are not
needed because the runtime is completely dominated by the cost of
migrating the task, and the absolute runtime is well under a minute in
even truly absurd setups, e.g. running on a subset of vCPUs in a VM that
is heavily overcommited (16 vCPUs per pCPU).
Fixes: 162ea980739d ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs") Reported-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210929234112.1862848-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Swap order of CPUID entry "index" vs. "significant flag" checks
Check whether a CPUID entry's index is significant before checking for a
matching index to hack-a-fix an undefined behavior bug due to consuming
uninitialized data. RESET/INIT emulation uses kvm_cpuid() to retrieve
CPUID.0x1, which does _not_ have a significant index, and fails to
initialize the dummy variable that doubles as EBX/ECX/EDX output _and_
ECX, a.k.a. index, input.
Practically speaking, it's _extremely_ unlikely any compiler will yield
code that causes problems, as the compiler would need to inline the
kvm_cpuid() call to detect the uninitialized data, and intentionally hose
the kernel, e.g. insert ud2, instead of simply ignoring the result of
the index comparison.
Although the sketchy "dummy" pattern was introduced in SVM by commit 13060218d09b ("KVM: x86: Make register state after reset conform to
specification"), it wasn't actually broken until commit b6dbd9db9dbf
("KVM: x86: Remove stateful CPUID handling") arbitrarily swapped the
order of operations such that "index" was checked before the significant
flag.
Avoid consuming uninitialized data by reverting to checking the flag
before the index purely so that the fix can be easily backported; the
offending RESET/INIT code has been refactored, moved, and consolidated
from vendor code to common x86 since the bug was introduced. A future
patch will directly address the bad RESET/INIT behavior.
The undefined behavior was detected by syzbot + KernelMemorySanitizer.
Local variable ----dummy@kvm_vcpu_reset created at:
kvm_vcpu_reset+0x1fb/0x1c20 arch/x86/kvm/x86.c:10812
kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
Reported-by: syzbot+f3985126b746b3d59c9d@syzkaller.appspotmail.com Reported-by: Alexander Potapenko <glider@google.com> Fixes: 2083742e4fdc ("KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping") Fixes: b6dbd9db9dbf ("KVM: x86: Remove stateful CPUID handling") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210929222426.1855730-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zelin Deng [Wed, 29 Sep 2021 05:13:49 +0000 (13:13 +0800)]
ptp: Fix ptp_kvm_getcrosststamp issue for x86 ptp_kvm
hv_clock is preallocated to have only HVC_BOOT_ARRAY_SIZE (64) elements;
if the PTP_SYS_OFFSET_PRECISE ioctl is executed on vCPUs whose index is
64 of higher, retrieving the struct pvclock_vcpu_time_info pointer with
"src = &hv_clock[cpu].pvti" will result in an out-of-bounds access and
a wild pointer. Change it to "this_cpu_pvti()" which is guaranteed to
be valid.
Fixes: e43c1f615a16 ("Switch kvmclock data to a PER_CPU variable") Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-3-git-send-email-zelin.deng@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zelin Deng [Wed, 29 Sep 2021 05:13:48 +0000 (13:13 +0800)]
x86/kvmclock: Move this_cpu_pvti into kvmclock.h
There're other modules might use hv_clock_per_cpu variable like ptp_kvm,
so move it into kvmclock.h and export the symbol to make it visiable to
other modules.
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Oliver Upton [Mon, 27 Sep 2021 22:36:21 +0000 (22:36 +0000)]
selftests: KVM: Don't clobber XMM register when read
There is no need to clobber a register that is only being read from.
Oops. Drop the XMM register from the clobbers list.
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210927223621.50178-1-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: VMX: Fix a TSX_CTRL_CPUID_CLEAR field mask issue
When updating the host's mask for its MSR_IA32_TSX_CTRL user return entry,
clear the mask in the found uret MSR instead of vmx->guest_uret_msrs[i].
Modifying guest_uret_msrs directly is completely broken as 'i' does not
point at the MSR_IA32_TSX_CTRL entry. In fact, it's guaranteed to be an
out-of-bounds accesses as is always set to kvm_nr_uret_msrs in a prior
loop. By sheer dumb luck, the fallout is limited to "only" failing to
preserve the host's TSX_CTRL_CPUID_CLEAR. The out-of-bounds access is
benign as it's guaranteed to clear a bit in a guest MSR value, which are
always zero at vCPU creation on both x86-64 and i386.
Cc: stable@vger.kernel.org Fixes: 55e923613be9 ("KVM: VMX: Use common x86's uret MSR list as the one true list") Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210926015545.281083-1-zhenzhong.duan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Oliver Upton [Fri, 24 Sep 2021 00:51:47 +0000 (00:51 +0000)]
selftests: KVM: Explicitly use movq to read xmm registers
Compiling the KVM selftests with clang emits the following warning:
>> include/x86_64/processor.h:297:25: error: variable 'xmm0' is uninitialized when used here [-Werror,-Wuninitialized]
>> return (unsigned long)xmm0;
where xmm0 is accessed via an uninitialized register variable.
Indeed, this is a misuse of register variables, which really should only
be used for specifying register constraints on variables passed to
inline assembly. Rather than attempting to read xmm registers via
register variables, just explicitly perform the movq from the desired
xmm register.
Fixes: 150755a8b7d8 ("kvm: selftests: add API testing infrastructure") Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210924005147.1122357-1-oupton@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Oliver Upton [Thu, 23 Sep 2021 22:00:33 +0000 (22:00 +0000)]
selftests: KVM: Call ucall_init when setting up in rseq_test
While x86 does not require any additional setup to use the ucall
infrastructure, arm64 needs to set up the MMIO address used to signal a
ucall to userspace. rseq_test does not initialize the MMIO address,
resulting in the test spinning indefinitely.
Fix the issue by calling ucall_init() during setup.
Fixes: 162ea980739d ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs") Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210923220033.4172362-1-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Sat, 18 Sep 2021 00:56:29 +0000 (08:56 +0800)]
KVM: Remove tlbs_dirty
There is no user of tlbs_dirty.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Sat, 18 Sep 2021 00:56:28 +0000 (08:56 +0800)]
KVM: X86: Synchronize the shadow pagetable before link it
If gpte is changed from non-present to present, the guest doesn't need
to flush tlb per SDM. So the host must synchronze sp before
link it. Otherwise the guest might use a wrong mapping.
For example: the guest first changes a level-1 pagetable, and then
links its parent to a new place where the original gpte is non-present.
Finally the guest can access the remapped area without flushing
the tlb. The guest's behavior should be allowed per SDM, but the host
kvm mmu makes it wrong.
Fixes: f77b71643299 ("KVM: MMU: out of sync shadow core") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Sat, 18 Sep 2021 00:56:27 +0000 (08:56 +0800)]
KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
When kvm->tlbs_dirty > 0, some rmaps might have been deleted
without flushing tlb remotely after kvm_sync_page(). If @gfn
was writable before and it's rmaps was deleted in kvm_sync_page(),
and if the tlb entry is still in a remote running VCPU, the @gfn
is not safely protected.
To fix the problem, kvm_sync_page() does the remote flush when
needed to avoid the problem.
Fixes: a5201c6bafef ("KVM: MMU: delay flush all tlbs on sync_page path") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These field correspond to features that we don't expose yet to L2
While currently there are no CVE worthy features in this field,
if AMD adds more features to this field, that could allow guest
escapes similar to CVE-2021-3653 and CVE-2021-3656.
KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround
GP SVM errata workaround made the #GP handler always emulate
the SVM instructions.
However these instructions #GP in case the operand is not 4K aligned,
but the workaround code didn't check this and we ended up
emulating these instructions anyway.
This is only an emulation accuracy check bug as there is no harm for
KVM to read/write unaligned vmcb images.
Fixes: 99d9f6552bc0 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: selftests: test simultaneous uses of V_IRQ from L1 and L0
Test that if:
* L1 disables virtual interrupt masking, and INTR intercept.
* L1 setups a virtual interrupt to be injected to L2 and enters L2 with
interrupts disabled, thus the virtual interrupt is pending.
* Now an external interrupt arrives in L1 and since
L1 doesn't intercept it, it should be delivered to L2 when
it enables interrupts.
to do this L0 (abuses) V_IRQ to setup an
interrupt window, and returns to L2.
* L2 enables interrupts.
This should trigger the interrupt window,
injection of the external interrupt and delivery
of the virtual interrupt that can now be done.
* Test that now L2 gets those interrupts.
This is the test that demonstrates the issue that was
fixed in the previous patch.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a
consistent behavior between Intel and AMD when using KVM_GET_MSRS,
KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST.
We have to add legacy and new MSRs to handle guests running without
X86_FEATURE_PERFCTR_CORE.
Signed-off-by: Fares Mehanna <faresx@amazon.de>
Message-Id: <20210915133951.22389-1-faresx@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: nVMX: re-evaluate emulation_required on nested VM exit
If L1 had invalid state on VM entry (can happen on SMM transactions
when we enter from real mode, straight to nested guest),
then after we load 'host' state from VMCS12, the state has to become
valid again, but since we load the segment registers with
__vmx_set_segment we weren't always updating emulation_required.
Update emulation_required explicitly at end of load_vmcs12_host_state.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry
It is possible that when non root mode is entered via special entry
(!from_vmentry), that is from SMM or from loading the nested state,
the L2 state could be invalid in regard to non unrestricted guest mode,
but later it can become valid.
(for example when RSM emulation restores segment registers from SMRAM)
Thus delay the check to VM entry, where we will check this and fail.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: SVM: call KVM_REQ_GET_NESTED_STATE_PAGES on exit from SMM mode
Currently the KVM_REQ_GET_NESTED_STATE_PAGES on SVM only reloads PDPTRs,
and MSR bitmap, with former not really needed for SMM as SMM exit code
reloads them again from SMRAM'S CR3, and later happens to work
since MSR bitmap isn't modified while in SMM.
Still it is better to be consistient with VMX.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: reset pdptrs_from_userspace when exiting smm
When exiting SMM, pdpts are loaded again from the guest memory.
This fixes a theoretical bug, when exit from SMM triggers entry to the
nested guest which re-uses some of the migration
code which uses this flag as a workaround for a legacy userspace.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: nVMX: Filter out all unsupported controls when eVMCS was activated
Windows Server 2022 with Hyper-V role enabled failed to boot on KVM when
enlightened VMCS is advertised. Debugging revealed there are two exposed
secondary controls it is not happy with: SECONDARY_EXEC_ENABLE_VMFUNC and
SECONDARY_EXEC_SHADOW_VMCS. These controls are known to be unsupported,
as there are no corresponding fields in eVMCSv1 (see the comment above
EVMCS1_UNSUPPORTED_2NDEXEC definition).
Previously, commit f946e45b3bf1 ("x86/kvm/hyper-v: move VMX controls
sanitization out of nested_enable_evmcs()") introduced the required
filtering mechanism for VMX MSRs but for some reason put only known
to be problematic (and not full EVMCS1_UNSUPPORTED_* lists) controls
there.
Note, Windows Server 2022 seems to have gained some sanity check for VMX
MSRs: it doesn't even try to launch a guest when there's something it
doesn't like, nested_evmcs_check_controls() mechanism can't catch the
problem.
Let's be bold this time and instead of playing whack-a-mole just filter out
all unsupported controls from VMX MSRs.
Fixes: f946e45b3bf1 ("x86/kvm/hyper-v: move VMX controls sanitization out of nested_enable_evmcs()") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210907163530.110066-1-vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: KVM: Use cpumask_available() to check for NULL cpumask when kicking vCPUs
Check for a NULL cpumask_var_t when kicking multiple vCPUs via
cpumask_available(), which performs a !NULL check if and only if cpumasks
are configured to be allocated off-stack. This is a meaningless
optimization, e.g. avoids a TEST+Jcc and TEST+CMOV on x86, but more
importantly helps document that the NULL check is necessary even though
all callers pass in a local variable.
No functional change intended.
Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210827092516.1027264-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Clean up benign vcpu->cpu data races when kicking vCPUs
Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu
is read exactly once, and by ensuring the vCPU is booted from guest mode
if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in
kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if
kvm_request_needs_ipi() returns true.
Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or
kvm_request_needs_ipi()) means the target vCPU could get migrated (change
vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and
reading vcpu->mode. If that happens, the kick/IPI will be sent to the
old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs.
Although failing to kick the vCPU is not exactly ideal, practically
speaking it cannot cause a functional issue unless there is also a bug in
the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s
behavior.
The purpose of sending an IPI is purely to get a vCPU into the host (or
out of reading SPTEs) so that the vCPU can recognize a change in state,
e.g. a KVM_REQ_* request. If vCPU's handling of the state change is
required for correctness, KVM must ensure either the vCPU sees the change
before entering the guest, or that the sender sees the vCPU as running in
guest mode. All architectures handle this by (a) sending the request
before calling kvm_vcpu_kick() and (b) checking for requests _after_
setting vcpu->mode.
x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to
ensure it kicks and waits for vCPUs that started reading SPTEs _before_
MMU changes were finalized, but any vCPU that starts reading after MMU
changes were finalized will see the new state and can continue on
uninterrupted.
For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g.
x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied
upon for functional correctness, e.g. in the dirty log case, userspace
cannot assume it has a 100% complete log if vCPUs are still running.
All that said, eliminate the benign race since the cost of doing so is an
"extra" atomic cmpxchg() in the case where the target vCPU is loaded by
the current pCPU or is not loaded at all. I.e. the kick will be skipped
due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as
opposed to the kick being skipped because of the cpu checks.
Keep the "cpu != me" checks even though they appear useless/impossible at
first glance. x86 processes guest IPI writes in a fast path that runs in
IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And
calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or
READING_SHADOW_PAGE_TABLES is perfectly reasonable.
Note, a race with the cpu_online() check in kvm_vcpu_kick() likely
persists, e.g. the vCPU could exit guest mode and get offlined between
the cpu_online() check and the sending of smp_send_reschedule(). But,
the online check appears to exist only to avoid a WARN in x86's
native_smp_send_reschedule() that fires if the target CPU is not online.
The reschedule WARN exists because CPU offlining takes the CPU out of the
scheduling pool, i.e. the WARN is intended to detect the case where the
kernel attempts to schedule a task on an offline CPU. The actual sending
of the IPI is a non-issue as at worst it will simpy be dropped on the
floor. In other words, KVM's usurping of the reschedule IPI could
theoretically trigger a WARN if the stars align, but there will be no
loss of functionality.
The problem appears to be that 'vcpu_bitmap' is allocated as a single long
on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear
the lower 16 bits of it with bitmap_zero() for no particular reason (my
guess would be that 'bitmap' and 'vcpu_bitmap' variables in
kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed
16-bit long, the later should accommodate all possible vCPUs).
Fixes: 8b921683d239 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs") Fixes: 7eb9a980acfc ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap") Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210827092516.1027264-7-vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 17 Sep 2021 17:36:57 +0000 (17:36 +0000)]
KVM: selftests: Create a separate dirty bitmap per slot
The calculation to get the per-slot dirty bitmap was incorrect leading
to a buffer overrun. Fix it by splitting out the dirty bitmap into a
separate bitmap per slot.
Fixes: 33e51d157dba ("KVM: selftests: Support multiple slots in dirty_log_perf_test") Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com>
Message-Id: <20210917173657.44011-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 17 Sep 2021 17:36:56 +0000 (17:36 +0000)]
KVM: selftests: Refactor help message for -s backing_src
All selftests that support the backing_src option were printing their
own description of the flag and then calling backing_src_help() to dump
the list of available backing sources. Consolidate the flag printing in
backing_src_help() to align indentation, reduce duplicated strings, and
improve consistency across tests.
Note: Passing "-s" to backing_src_help is unnecessary since every test
uses the same flag. However I decided to keep it for code readability
at the call sites.
While here this opportunistically fixes the incorrectly interleaved
printing -x help message and list of backing source types in
dirty_log_perf_test.
Fixes: 33e51d157dba ("KVM: selftests: Support multiple slots in dirty_log_perf_test") Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210917173657.44011-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Fri, 17 Sep 2021 17:36:55 +0000 (17:36 +0000)]
KVM: selftests: Change backing_src flag to -s in demand_paging_test
Every other KVM selftest uses -s for the backing_src, so switch
demand_paging_test to match.
Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210917173657.44011-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Peter Gonda [Tue, 21 Sep 2021 15:03:45 +0000 (08:03 -0700)]
KVM: SEV: Allow some commands for mirror VM
A mirrored SEV-ES VM will need to call KVM_SEV_LAUNCH_UPDATE_VMSA to
setup its vCPUs and have them measured, and their VMSAs encrypted. Without
this change, it is impossible to have mirror VMs as part of SEV-ES VMs.
Also allow the guest status check and debugging commands since they do
not change any guest state.
Signed-off-by: Peter Gonda <pgonda@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Nathan Tempelman <natet@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Steve Rutherford <srutherford@google.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 90d0c700aaf0 ("KVM: x86: Support KVM VMs sharing SEV context", 2021-04-21)
Message-Id: <20210921150345.2221634-3-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Peter Gonda [Tue, 21 Sep 2021 15:03:44 +0000 (08:03 -0700)]
KVM: SEV: Update svm_vm_copy_asid_from for SEV-ES
For mirroring SEV-ES the mirror VM will need more then just the ASID.
The FD and the handle are required to all the mirror to call psp
commands. The mirror VM will need to call KVM_SEV_LAUNCH_UPDATE_VMSA to
setup its vCPUs' VMSAs for SEV-ES.
Signed-off-by: Peter Gonda <pgonda@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Nathan Tempelman <natet@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Steve Rutherford <srutherford@google.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 90d0c700aaf0 ("KVM: x86: Support KVM VMs sharing SEV context", 2021-04-21)
Message-Id: <20210921150345.2221634-2-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nested bus lock VM exits are not supported yet. If L2 triggers bus lock
VM exit, it will be directed to L1 VMM, which would cause unexpected
behavior. Therefore, handle L2's bus lock VM exits in L0 directly.
Fixes: f79dabbd7827 ("KVM: VMX: Enable bus lock VM exit") Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20210914095041.29764-1-chenyi.qiang@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>