Like Xu [Tue, 10 May 2022 04:44:07 +0000 (12:44 +0800)]
KVM: x86/pmu: Don't overwrite the pmu->global_ctrl when refreshing
Assigning a value to pmu->global_ctrl just to set the value of
pmu->global_ctrl_mask is more readable but does not conform to the
specification. The value is reset to zero on Power up and Reset but
stays unchanged on INIT, like most other MSRs.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220510044407.26445-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Wed, 18 May 2022 17:01:16 +0000 (01:01 +0800)]
KVM: x86/pmu: Move the vmx_icl_pebs_cpu[] definition out of the header file
Defining a static const array in a header file would introduce redundant
definitions to the point of confusing semantics, and such a use case would
only bring complaints from the compiler:
arch/x86/kvm/pmu.h:20:32: warning: ‘vmx_icl_pebs_cpu’ defined but not used [-Wunused-const-variable=]
20 | static const struct x86_cpu_id vmx_icl_pebs_cpu[] = {
| ^~~~~~~~~~~~~~~~
Fixes: a095df2c5f48 ("KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter") Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220518170118.66263-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Guo Zhengkui [Wed, 11 May 2022 12:05:55 +0000 (20:05 +0800)]
selftests: kvm: replace ternary operator with min()
Fix the following coccicheck warnings:
tools/testing/selftests/kvm/lib/s390x/ucall.c:25:15-17: WARNING
opportunity for min()
tools/testing/selftests/kvm/lib/x86_64/ucall.c:27:15-17: WARNING
opportunity for min()
tools/testing/selftests/kvm/lib/riscv/ucall.c:56:15-17: WARNING
opportunity for min()
tools/testing/selftests/kvm/lib/aarch64/ucall.c:82:15-17: WARNING
opportunity for min()
tools/testing/selftests/kvm/lib/aarch64/ucall.c:55:20-21: WARNING
opportunity for min()
The CPUID features PDCM, DS and DTES64 are required for PEBS feature.
KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS
is supported in the KVM on the Ice Lake server platforms.
Originally-by: Andi Kleen <ak@linux.intel.com> Co-developed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Co-developed-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-18-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:45 +0000 (18:19 +0800)]
KVM: x86/cpuid: Refactor host/guest CPU model consistency check
For the same purpose, the leagcy intel_pmu_lbr_is_compatible() can be
renamed for reuse by more callers, and remove the comment about LBR
use case can be deleted by the way.
Signed-off-by: Like Xu <like.xu@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-17-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:44 +0000 (18:19 +0800)]
KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability
The information obtained from the interface perf_get_x86_pmu_capability()
doesn't change, so an exported "struct x86_pmu_capability" is introduced
for all guests in the KVM, and it's initialized before hardware_setup().
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-16-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:43 +0000 (18:19 +0800)]
KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations
The guest PEBS will be disabled when some users try to perf KVM and
its user-space through the same PEBS facility OR when the host perf
doesn't schedule the guest PEBS counter in a one-to-one mapping manner
(neither of these are typical scenarios).
The PEBS records in the guest DS buffer are still accurate and the
above two restrictions will be checked before each vm-entry only if
guest PEBS is deemed to be enabled.
Suggested-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Like Xu <like.xu@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-15-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:42 +0000 (18:19 +0800)]
KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h
It allows this inline function to be reused by more callers in
more files, such as pmu_intel.c.
Signed-off-by: Like Xu <like.xu@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-14-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:41 +0000 (18:19 +0800)]
KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled
The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" :
1 = PEBS is not supported.
0 = PEBS is supported.
A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS
is enabled. Some PEBS drivers in guest may care about this bit.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20220411101946.20262-13-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:40 +0000 (18:19 +0800)]
KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive
PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable
bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL.
FCx_Adaptive_Record) are also supported.
Adaptive PEBS provides software the capability to configure the PEBS
records to capture only the data of interest, keeping the record size
compact. An overflow of PMCx results in generation of an adaptive PEBS
record with state information based on the selections specified in
MSR_PEBS_DATA_CFG.By default, the record only contain the Basic group.
When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will
be added to the perf_guest_switch_msr() and switched during the VMX
transitions just like CORE_PERF_GLOBAL_CTRL MSR.
According to Intel SDM, software is recommended to PEBS Baseline
when the following is true. IA32_PERF_CAPABILITIES.PEBS_BASELINE[14]
&& IA32_PERF_CAPABILITIES.PEBS_FMT[11:8] ≥ 4.
Co-developed-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-12-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:39 +0000 (18:19 +0800)]
KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS
When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and points
to the linear address of the first byte of the DS buffer management area,
which is used to manage the PEBS records.
When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be added to the
perf_guest_switch_msr() and switched during the VMX transitions just like
CORE_PERF_GLOBAL_CTRL MSR. The WRMSR to IA32_DS_AREA MSR brings a #GP(0)
if the source register contains a non-canonical address.
Originally-by: Andi Kleen <ak@linux.intel.com> Co-developed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Like Xu <like.xu@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-11-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:38 +0000 (18:19 +0800)]
KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter
The PEBS-PDIR facility on Ice Lake server is supported on IA31_FIXED0 only.
If the guest configures counter 32 and PEBS is enabled, the PEBS-PDIR
facility is supposed to be used, in which case KVM adjusts attr.precise_ip
to 3 and request host perf to assign the exactly requested counter or fail.
The CPU model check is also required since some platforms may place the
PEBS-PDIR facility in another counter index.
Signed-off-by: Like Xu <like.xu@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-10-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:37 +0000 (18:19 +0800)]
KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter
When a guest counter is configured as a PEBS counter through
IA32_PEBS_ENABLE, a guest PEBS event will be reprogrammed by
configuring a non-zero precision level in the perf_event_attr.
The guest PEBS overflow PMI bit would be set in the guest
GLOBAL_STATUS MSR when PEBS facility generates a PEBS
overflow PMI based on guest IA32_DS_AREA MSR.
Even with the same counter index and the same event code and
mask, guest PEBS events will not be reused for non-PEBS events.
Originally-by: Andi Kleen <ak@linux.intel.com> Co-developed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Like Xu <likexu@tencent.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-9-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:36 +0000 (18:19 +0800)]
KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
and general-purpose counters have corresponding bits in IA32_PEBS_ENABLE
that enable generation of PEBS records. The general-purpose counter bits
start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
bit IA32_PEBS_ENABLE[32].
When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
added to the perf_guest_switch_msr() and atomically switched during
the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.
Based on whether the platform supports x86_pmu.pebs_ept, it has also
refactored the way to add more msrs to arr[] in intel_guest_get_msrs()
for extensibility.
Originally-by: Andi Kleen <ak@linux.intel.com> Co-developed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Co-developed-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Like Xu <like.xu@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-8-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86/perf/core: Add pebs_capable to store valid PEBS_COUNTER_MASK value
The value of pebs_counter_mask will be accessed frequently
for repeated use in the intel_guest_get_msrs(). So it can be
optimized instead of endlessly mucking about with branches.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-7-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:34 +0000 (18:19 +0800)]
KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
The mask value of fixed counter control register should be dynamic
adjusted with the number of fixed counters. This patch introduces a
variable that includes the reserved bits of fixed counter control
registers. This is a generic code refactoring.
Co-developed-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Like Xu <like.xu@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-6-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:33 +0000 (18:19 +0800)]
KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled
On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
detect whether the processor supports performance monitoring facility.
It depends on the PMU is enabled for the guest, and a software write
operation to this available bit will be ignored. The proposal to ignore
the toggle in KVM is the way to go and that behavior matches bare metal.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-5-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:32 +0000 (18:19 +0800)]
perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values
Splitting the logic for determining the guest values is unnecessarily
confusing, and potentially fragile. Perf should have full knowledge and
control of what values are loaded for the guest.
If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it
can generate the full set of guest values by grabbing guest ds_area and
pebs_data_cfg. Alternatively, .guest_get_msrs() could take the desired
guest MSR values directly (ds_area and pebs_data_cfg), but kvm_pmu is
vendor agnostic, so we don't see any reason to not just pass the pointer.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <like.xu@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-4-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:31 +0000 (18:19 +0800)]
perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest
With PEBS virtualization, the guest PEBS records get delivered to the
guest DS, and the host pmi handler uses perf_guest_cbs->is_in_guest()
to distinguish whether the PMI comes from the guest code like Intel PT.
No matter how many guest PEBS counters are overflowed, only triggering
one fake event is enough. The fake event causes the KVM PMI callback to
be called, thereby injecting the PEBS overflow PMI into the guest.
KVM may inject the PMI with BUFFER_OVF set, even if the guest DS is
empty. That should really be harmless. Thus guest PEBS handler would
retrieve the correct information from its own PEBS records buffer.
Cc: linux-perf-users@vger.kernel.org Originally-by: Andi Kleen <ak@linux.intel.com> Co-developed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-3-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 11 Apr 2022 10:19:30 +0000 (18:19 +0800)]
perf/x86/intel: Add EPT-Friendly PEBS for Ice Lake Server
Add support for EPT-Friendly PEBS, a new CPU feature that enlightens PEBS
to translate guest linear address through EPT, and facilitates handling
VM-Exits that occur when accessing PEBS records. More information can
be found in the December 2021 release of Intel's SDM, Volume 3,
18.9.5 "EPT-Friendly PEBS". This new hardware facility makes sure the
guest PEBS records will not be lost, which is available on Intel Ice Lake
Server platforms (and later).
KVM will check this field through perf_get_x86_pmu_capability() instead
of hard coding the CPU models in the KVM code. If it is supported, the
guest PEBS capability will be exposed to the guest. Guest PEBS can be
enabled when and only when "EPT-Friendly PEBS" is supported and
EPT is enabled.
Cc: linux-perf-users@vger.kernel.org Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-2-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
With IPI virtualization enabled, the processor emulates writes to
APIC registers that would send IPIs. The processor sets the bit
corresponding to the vector in target vCPU's PIR and may send a
notification (IPI) specified by NDST and NV fields in target vCPU's
Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
engine does when dealing with posted interrupt from devices.
A PID-pointer table is used by the processor to locate the PID of a
vCPU with the vCPU's APIC ID. The table size depends on maximum APIC
ID assigned for current VM session from userspace. Allocating memory
for PID-pointer table is deferred to vCPU creation, because irqchip
mode and VM-scope maximum APIC ID is settled at that point. KVM can
skip PID-pointer table allocation if !irqchip_in_kernel().
Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
notification vector to wakeup vector. This can ensure that when an IPI
for blocked vCPUs arrives, VMM can get control and wake up blocked
vCPUs. And if a VCPU is preempted, its posted interrupt notification
is suppressed.
Note that IPI virtualization can only virualize physical-addressing,
flat mode, unicast IPIs. Sending other IPIs would still cause a
trap-like APIC-write VM-exit and need to be handled by VMM.
This capability can be enabled before vCPU creation and only allowed
to set once. if assigned vcpu id is beyond KVM_CAP_MAX_VCPU_ID
capability, vCPU creation will fail.
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220422134456.26655-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Allow userspace to set maximum VCPU id for VM
Introduce new max_vcpu_ids in KVM for x86 architecture. Userspace
can assign maximum possible vcpu id for current VM session using
KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().
This is done for x86 only because the sole use case is to guide
memory allocation for PID-pointer table, a structure needed to
enable VMX IPI.
By default, max_vcpu_ids set as KVM_MAX_VCPU_IDS.
Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419154444.11888-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Move kvm_arch_vcpu_precreate() under kvm->lock
kvm_arch_vcpu_precreate() targets to handle arch specific VM resource
to be prepared prior to the actual creation of vCPU. For example, x86
platform may need do per-VM allocation based on max_vcpu_ids at the
first vCPU creation. It probably leads to concurrency control on this
allocation as multiple vCPU creation could happen simultaneously. From
the architectual point of view, it's necessary to execute
kvm_arch_vcpu_precreate() under protect of kvm->lock.
Currently only arm64, x86 and s390 have non-nop implementations at the
stage of vCPU pre-creation. Remove the lock acquiring in s390's design
and make sure all architecture can run kvm_arch_vcpu_precreate() safely
under kvm->lock without recrusive lock issue.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419154409.11842-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the condition check cpu_has_secondary_exec_ctrls(). Calling
vmx_refresh_apicv_exec_ctrl() premises secondary controls activated
and VMCS fields related to APICv valid as well. If it's invoked in
wrong circumstance at the worst case, VMX operation will report
VMfailValid error without further harmful impact and just functions
as if all the secondary controls were 0.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419153604.11786-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode
Upcoming Intel CPUs will support virtual x2APIC MSR writes to the vICR,
i.e. will trap and generate an APIC-write VM-Exit instead of intercepting
the WRMSR. Add support for handling "nodecode" x2APIC writes, which
were previously impossible.
Note, x2APIC MSR writes are 64 bits wide.
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419153516.11739-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Robert Hoo [Tue, 19 Apr 2022 15:34:41 +0000 (23:34 +0800)]
KVM: VMX: Report tertiary_exec_control field in dump_vmcs()
Add tertiary_exec_control field report in dump_vmcs(). Meanwhile,
reorganize the dump output of VMCS category as follows.
Before change:
*** Control State ***
PinBased=0x000000ff CPUBased=0xb5a26dfa SecondaryExec=0x061037eb
EntryControls=0000d1ff ExitControls=002befff
After change:
*** Control State ***
CPUBased=0xb5a26dfa SecondaryExec=0x061037eb TertiaryExec=0x0000000000000010
PinBased=0x000000ff EntryControls=0000d1ff ExitControls=002befff
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419153441.11687-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Robert Hoo [Tue, 19 Apr 2022 15:34:00 +0000 (23:34 +0800)]
KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS config
Check VMX features on tertiary execution control in VMCS config setup.
Sub-features in tertiary execution control to be enabled are adjusted
according to hardware capabilities although no sub-feature is enabled
in this patch.
EVMCSv1 doesn't support tertiary VM-execution control, so disable it
when EVMCSv1 is in use. And define the auxiliary functions for Tertiary
control field here, using the new BUILD_CONTROLS_SHADOW().
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419153400.11642-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Robert Hoo [Tue, 19 Apr 2022 15:33:18 +0000 (23:33 +0800)]
KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variation
The Tertiary VM-Exec Control, different from previous control fields, is 64
bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to
support both 32 bit and 64 bit fields' auxiliary functions building.
Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419153318.11595-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Robert Hoo [Tue, 19 Apr 2022 15:32:40 +0000 (23:32 +0800)]
x86/cpu: Add new VMX feature, Tertiary VM-Execution control
A new 64-bit control field "tertiary processor-based VM-execution
controls", is defined [1]. It's controlled by bit 17 of the primary
processor-based VM-execution controls.
Different from its brother VM-execution fields, this tertiary VM-
execution controls field is 64 bit. So it occupies 2 vmx_feature_leafs,
TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.
Its companion VMX capability reporting MSR,MSR_IA32_VMX_PROCBASED_CTLS3
(0x492), is also semantically different from its brothers, whose 64 bits
consist of all allow-1, rather than 32-bit allow-0 and 32-bit allow-1 [1][2].
Therefore, its init_vmx_capabilities() is a little different from others.
[1] ISE 6.2 "VMCS Changes"
https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
[2] SDM Vol3. Appendix A.3
Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419153240.11549-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Comment FNAME(sync_page) to document TLB flushing logic
Add a comment to FNAME(sync_page) to explain why the TLB flushing logic
conspiculously doesn't handle the scenario of guest protections being
reduced. Specifically, if synchronizing a SPTE drops execute protections,
KVM will not emit a TLB flush, whereas dropping writable or clearing A/D
bits does trigger a flush via mmu_spte_update(). Architecturally, until
the GPTE is implicitly or explicitly flushed from the guest's perspective,
KVM is not required to flush any old, stale translations.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220513195000.99371-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Drop RWX=0 SPTEs during ept_sync_page()
All of sync_page()'s existing checks filter out only !PRESENT gPTE,
because without execute-only, all upper levels are guaranteed to be at
least READABLE. However, if EPT with execute-only support is in use by
L1, KVM can create an SPTE that is shadow-present but guest-inaccessible
(RWX=0) if the upper level combined permissions are R (or RW) and
the leaf EPTE is changed from R (or RW) to X. Because the EPTE is
considered present when viewed in isolation, and no reserved bits are set,
FNAME(prefetch_invalid_gpte) will consider the GPTE valid, and cause a
not-present SPTE to be created.
The SPTE is "correct": the guest translation is inaccessible because
the combined protections of all levels yield RWX=0, and KVM will just
redirect any vmexits to the guest. If EPT A/D bits are disabled, KVM
can mistake the SPTE for an access-tracked SPTE, but again such confusion
isn't fatal, as the "saved" protections are also RWX=0. However,
creating a useless SPTE in general means that KVM messed up something,
even if this particular goof didn't manifest as a functional bug.
So, drop SPTEs whose new protections will yield a RWX=0 SPTE, and
add a WARN in make_spte() to detect creation of SPTEs that will
result in RWX=0 protections.
Fixes: cf578f190f1c ("kvm: mmu: track read permission explicitly for shadow EPT page tables") Cc: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220513195000.99371-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a KVM self-test that checks whether a nSVM L1 is able to successfully
inject a software interrupt, a soft exception and a NMI into its L2 guest.
In practice, this tests both the next_rip field consistency and
L1-injected event with intervening L0 VMEXIT during its delivery:
the first nested VMRUN (that's also trying to inject a software interrupt)
will immediately trigger a L0 NPF.
This L0 NPF will have zero in its CPU-returned next_rip field, which if
incorrectly reused by KVM will trigger a #PF when trying to return to
such address 0 from the interrupt handler.
For NMI injection this tests whether the L1 NMI state isn't getting
incorrectly mixed with the L2 NMI state if a L1 -> L2 NMI needs to be
re-injected.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: check exact L2 RIP on first soft interrupt] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <d5f3d56528558ad8e28a9f1e1e4187f5a1e6770a.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A NMI that L1 wants to inject into its L2 should be directly re-injected,
without causing L0 side effects like engaging NMI blocking for L1.
It's also worth noting that in this case it is L1 responsibility
to track the NMI window status for its L2 guest.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <f894d13501cd48157b3069a4b4c7369575ddb60e.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Differentiate Soft vs. Hard IRQs vs. reinjected in tracepoint
In the IRQ injection tracepoint, differentiate between Hard IRQs and Soft
"IRQs", i.e. interrupts that are reinjected after incomplete delivery of
a software interrupt from an INTn instruction. Tag reinjected interrupts
as such, even though the information is usually redundant since soft
interrupts are only ever reinjected by KVM. Though rare in practice, a
hard IRQ can be reinjected.
Signed-off-by: Sean Christopherson <seanjc@google.com>
[MSS: change "kvm_inj_virq" event "reinjected" field type to bool] Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <9664d49b3bd21e227caa501cff77b0569bebffe2.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Print the error code in the exception injection tracepoint if and only if
the exception has an error code. Define the entire error code sequence
as a set of formatted strings, print empty strings if there's no error
code, and abuse __print_symbolic() by passing it an empty array to coerce
it into printing the error code as a hex string.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <e8f0511733ed2a0410cbee8a0a7388eac2ee5bac.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Trace exceptions that are re-injected, not just those that KVM is
injecting for the first time. Debugging re-injection bugs is painful
enough as is, not having visibility into what KVM is doing only makes
things worse.
Delay propagating pending=>injected in the non-reinjection path so that
the tracing can properly identify reinjected exceptions.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <25470690a38b4d2b32b6204875dd35676c65c9f2.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SVM: Re-inject INTn instead of retrying the insn on "failure"
Re-inject INTn software interrupts instead of retrying the instruction if
the CPU encountered an intercepted exception while vectoring the INTn,
e.g. if KVM intercepted a #PF when utilizing shadow paging. Retrying the
instruction is architecturally wrong e.g. will result in a spurious #DB
if there's a code breakpoint on the INT3/O, and lack of re-injection also
breaks nested virtualization, e.g. if L1 injects a software interrupt and
vectoring the injected interrupt encounters an exception that is
intercepted by L0 but not L1.
Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <1654ad502f860948e4f2d57b8bd881d67301f785.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction
Re-inject INT3/INTO instead of retrying the instruction if the CPU
encountered an intercepted exception while vectoring the software
exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
shadow paging. Retrying the instruction is architecturally wrong, e.g.
will result in a spurious #DB if there's a code breakpoint on the INT3/O,
and lack of re-injection also breaks nested virtualization, e.g. if L1
injects a software exception and vectoring the injected exception
encounters an exception that is intercepted by L0 but not L1.
Due to, ahem, deficiencies in the SVM architecture, acquiring the next
RIP may require flowing through the emulator even if NRIPS is supported,
as the CPU clears next_rip if the VM-Exit is due to an exception other
than "exceptions caused by the INT3, INTO, and BOUND instructions". To
deal with this, "skip" the instruction to calculate next_rip (if it's
not already known), and then unwind the RIP write and any side effects
(RFLAGS updates).
Save the computed next_rip and use it to re-stuff next_rip if injection
doesn't complete. This allows KVM to do the right thing if next_rip was
known prior to injection, e.g. if L1 injects a soft event into L2, and
there is no backing INTn instruction, e.g. if L1 is injecting an
arbitrary event.
Note, it's impossible to guarantee architectural correctness given SVM's
architectural flaws. E.g. if the guest executes INTn (no KVM injection),
an exit occurs while vectoring the INTn, and the guest modifies the code
stream while the exit is being handled, KVM will compute the incorrect
next_rip due to "skipping" the wrong instruction. A future enhancement
to make this less awful would be for KVM to detect that the decoded
instruction is not the correct INTn and drop the to-be-injected soft
event (retrying is a lesser evil compared to shoving the wrong RIP on the
exception stack).
Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <65cb88deab40bc1649d509194864312a89bbe02e.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SVM: Stuff next_rip on emulated INT3 injection if NRIPS is supported
If NRIPS is supported in hardware but disabled in KVM, set next_rip to
the next RIP when advancing RIP as part of emulating INT3 injection.
There is no flag to tell the CPU that KVM isn't using next_rip, and so
leaving next_rip is left as is will result in the CPU pushing garbage
onto the stack when vectoring the injected event.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 397076fd8c2e ("KVM: SVM: Emulate nRIP feature when reinjecting INT3") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <cd328309a3b88604daa2359ad56f36cb565ce2d4.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails"
Unwind the RIP advancement done by svm_queue_exception() when injecting
an INT3 ultimately "fails" due to the CPU encountering a VM-Exit while
vectoring the injected event, even if the exception reported by the CPU
isn't the same event that was injected. If vectoring INT3 encounters an
exception, e.g. #NP, and vectoring the #NP encounters an intercepted
exception, e.g. #PF when KVM is using shadow paging, then the #NP will
be reported as the event that was in-progress.
Note, this is still imperfect, as it will get a false positive if the
INT3 is cleanly injected, no VM-Exit occurs before the IRET from the INT3
handler in the guest, the instruction following the INT3 generates an
exception (directly or indirectly), _and_ vectoring that exception
encounters an exception that is intercepted by KVM. The false positives
could theoretically be solved by further analyzing the vectoring event,
e.g. by comparing the error code against the expected error code were an
exception to occur when vectoring the original injected exception, but
SVM without NRIPS is a complete disaster, trying to make it 100% correct
is a waste of time.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 397076fd8c2e ("KVM: SVM: Emulate nRIP feature when reinjecting INT3") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <450133cf0a026cb9825a2ff55d02cb136a1cb111.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SVM: Don't BUG if userspace injects an interrupt with GIF=0
Don't BUG/WARN on interrupt injection due to GIF being cleared,
since it's trivial for userspace to force the situation via
KVM_SET_VCPU_EVENTS (even if having at least a WARN there would be correct
for KVM internally generated injections).
KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02
The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
This field value (instead of the saved guest RIP) in used by the CPU for
the return address pushed on stack when injecting a software interrupt or
INT3 or INTO exception.
Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
loading a nested state and NRIPS is exposed to L1. If NRIPS is supported
in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
saves RIP on the stack as-is).
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <c2e0a3d78db3ae30530f11d4e9254b452a89f42b.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 6 Jun 2022 18:11:49 +0000 (21:11 +0300)]
KVM: SVM: fix tsc scaling cache logic
SVM uses a per-cpu variable to cache the current value of the
tsc scaling multiplier msr on each cpu.
Commit 3b05675ac7460
("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
broke this caching logic.
Refactor the code so that all TSC scaling multiplier writes go through
a single function which checks and updates the cache.
This fixes the following scenario:
1. A CPU runs a guest with some tsc scaling ratio.
2. New guest with different tsc scaling ratio starts on this CPU
and terminates almost immediately.
This ensures that the short running guest had set the tsc scaling ratio just
once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
the per-cpu cache is not updated.
3. The original guest continues to run, it doesn't restore the msr
value back to its own value, because the cache matches,
and thus continues to run with a wrong tsc scaling ratio.
Fixes: 3b05675ac7460 ("KVM: X86: Add vendor callbacks for writing the TSC multiplier") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: Make hyperv_clock selftest more stable
hyperv_clock doesn't always give a stable test result, especially with
AMD CPUs. The test compares Hyper-V MSR clocksource (acquired either
with rdmsr() from within the guest or KVM_GET_MSRS from the host)
against rdtsc(). To increase the accuracy, increase the measured delay
(done with nop loop) by two orders of magnitude and take the mean rdtsc()
value before and after rdmsr()/KVM_GET_MSRS.
Ben Gardon [Wed, 25 May 2022 23:09:04 +0000 (23:09 +0000)]
KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
Currently disabling dirty logging with the TDP MMU is extremely slow.
On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds
to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with
the shadow MMU.
When disabling dirty logging, zap non-leaf parent entries to allow
replacement with huge pages instead of recursing and zapping all of the
child, leaf entries. This reduces the number of TLB flushes required.
and reduces the disable dirty log time with the TDP MMU to ~3 seconds.
Opportunistically add a WARN() to catch GFNs that are mapped at a
higher level than their max level.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220525230904.1584480-1-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jan Beulich [Tue, 7 Jun 2022 15:00:53 +0000 (17:00 +0200)]
x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()
As noted (and fixed) a couple of times in the past, "=@cc<cond>" outputs
and clobbering of "cc" don't work well together. The compiler appears to
mean to reject such, but doesn't - in its upstream form - quite manage
to yet for "cc". Furthermore two similar macros don't clobber "cc", and
clobbering "cc" is pointless in asm()-s for x86 anyway - the compiler
always assumes status flags to be clobbered there.
Fixes: 3c75cc27a67d ("x86/uaccess: Implement macros for CMPXCHG on user addresses") Signed-off-by: Jan Beulich <jbeulich@suse.com>
Message-Id: <485c0c0b-a3a7-0b7c-5264-7d00c01de032@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Seth Forshee [Wed, 4 May 2022 18:08:40 +0000 (13:08 -0500)]
entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
A livepatch transition may stall indefinitely when a kvm vCPU is heavily
loaded. To the host, the vCPU task is a user thread which is spending a
very long time in the ioctl(KVM_RUN) syscall. During livepatch
transition, set_notify_signal() will be called on such tasks to
interrupt the syscall so that the task can be transitioned. This
interrupts guest execution, but when xfer_to_guest_mode_work() sees that
TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an
exit to user mode is unnecessary, and guest execution is resumed without
transitioning the task for the livepatch.
This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal()
is expected to break tasks out of interruptible kernel loops and cause
them to return to userspace. Change xfer_to_guest_mode_work() to handle
TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run
loop that an exit to userpsace is needed. Any pending task_work will be
run when get_signal() is called from exit_to_user_mode_loop(), so there
is no longer any need to run task work from xfer_to_guest_mode_work().
Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Seth Forshee <sforshee@digitalocean.com>
Message-Id: <20220504180840.2907296-1-sforshee@digitalocean.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A KVM device cleanup happens in either of two callbacks:
1) destroy() which is called when the VM is being destroyed;
2) release() which is called when a device fd is closed.
Most KVM devices use 1) but Book3s's interrupt controller KVM devices
(XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during
the machine execution. The error handling in kvm_ioctl_create_device()
assumes destroy() is always defined which leads to NULL dereference as
discovered by Syzkaller.
This adds a checks for destroy!=NULL and adds a missing release().
This is not changing kvm_destroy_devices() as devices with defined
release() should have been removed from the KVM devices list by then.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Mon, 6 Jun 2022 00:14:03 +0000 (17:14 -0700)]
Merge tag 'pull-work.fd-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull file descriptor fix from Al Viro:
"Fix for breakage in #work.fd this window"
* tag 'pull-work.fd-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix the breakage in close_fd_get_file() calling conventions change
Linus Torvalds [Sun, 5 Jun 2022 18:51:48 +0000 (11:51 -0700)]
bluetooth: don't use bitmaps for random flag accesses
The bluetooth code uses our bitmap infrastructure for the two bits (!)
of connection setup flags, and in the process causes odd problems when
it converts between a bitmap and just the regular values of said bits.
It's completely pointless to do things like bitmap_to_arr32() to convert
a bitmap into a u32. It shoudln't have been a bitmap in the first
place. The reason to use bitmaps is if you have arbitrary number of
bits you want to manage (not two!), or if you rely on the atomicity
guarantees of the bitmap setting and clearing.
The code could use an "atomic_t" and use "atomic_or/andnot()" to set and
clear the bit values, but considering that it then copies the bitmaps
around with "bitmap_to_arr32()" and friends, there clearly cannot be a
lot of atomicity requirements.
So just use a regular integer.
In the process, this avoids the warnings about erroneous use of
bitmap_from_u64() which were triggered on 32-bit architectures when
conversion from a u64 would access two words (and, surprise, surprise,
only one word is needed - and indeed overkill - for a 2-bit bitmap).
That was always problematic, but the compiler seems to notice it and
warn about the invalid pattern only after commit a1a19b185fa0 ("lib: add
bitmap_{from,to}_arr64") changed the exact implementation details of
'bitmap_from_u64()', as reported by Sudip Mukherjee and Stephen Rothwell.
Al Viro [Sun, 5 Jun 2022 18:01:42 +0000 (14:01 -0400)]
fix the breakage in close_fd_get_file() calling conventions change
It used to grab an extra reference to struct file rather than
just transferring to caller the one it had removed from descriptor
table. New variant doesn't, and callers need to be adjusted.
Reported-and-tested-by: syzbot+47dd250f527cb7bebf24@syzkaller.appspotmail.com Fixes: 8d392c99b321 ("Unify the primitives for file descriptor closing") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Sun, 5 Jun 2022 18:00:43 +0000 (11:00 -0700)]
Merge tag 'x86-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX fix from Thomas Gleixner:
"A single fix for x86/SGX to prevent that memory which is allocated for
an SGX enclave is accounted to the wrong memory control group"
* tag 'x86-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Set active memcg prior to shmem allocation
Linus Torvalds [Sun, 5 Jun 2022 17:55:23 +0000 (10:55 -0700)]
Merge tag 'x86-microcode-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode updates from Thomas Gleixner:
- Disable late microcode loading by default. Unless the HW people get
their act together and provide a required minimum version in the
microcode header for making a halfways informed decision its just
lottery and broken.
- Warn and taint the kernel when microcode is loaded late
- Remove the old unused microcode loader interface
- Remove a redundant perf callback from the microcode loader
* tag 'x86-microcode-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Remove unnecessary perf callback
x86/microcode: Taint and warn on late loading
x86/microcode: Default-disable late loading
x86/microcode: Rip out the OLD_INTERFACE
Linus Torvalds [Sun, 5 Jun 2022 17:53:41 +0000 (10:53 -0700)]
Merge tag 'x86-cleanups-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Thomas Gleixner:
"A set of small x86 cleanups:
- Remove unused headers in the IDT code
- Kconfig indendation and comment fixes
- Fix all 'the the' typos in one go instead of waiting for bots to
fix one at a time"
* tag 'x86-cleanups-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix all occurences of the "the the" typo
x86/idt: Remove unused headers
x86/Kconfig: Fix indentation of arch/x86/Kconfig.debug
x86/Kconfig: Fix indentation and add endif comments to arch/x86/Kconfig
Linus Torvalds [Sun, 5 Jun 2022 17:47:06 +0000 (10:47 -0700)]
Merge tag 'timers-core-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clockevent/clocksource updates from Thomas Gleixner:
- Device tree bindings for MT8186
- Tell the kernel that the RISC-V SBI timer stops in deeper power
states
- Make device tree parsing in sp804 more robust
- Dead code removal and tiny fixes here and there
- Add the missing SPDX identifiers
* tag 'timers-core-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/oxnas-rps: Fix irq_of_parse_and_map() return value
clocksource/drivers/timer-ti-dm: Remove unnecessary NULL check
clocksource/drivers/timer-sun5i: Convert to SPDX identifier
clocksource/drivers/timer-sun4i: Convert to SPDX identifier
clocksource/drivers/pistachio: Convert to SPDX identifier
clocksource/drivers/orion: Convert to SPDX identifier
clocksource/drivers/lpc32xx: Convert to SPDX identifier
clocksource/drivers/digicolor: Convert to SPDX identifier
clocksource/drivers/armada-370-xp: Convert to SPDX identifier
clocksource/drivers/mips-gic-timer: Convert to SPDX identifier
clocksource/drivers/jcore: Convert to SPDX identifier
clocksource/drivers/bcm_kona: Convert to SPDX identifier
clocksource/drivers/sp804: Avoid error on multiple instances
clocksource/drivers/riscv: Events are stopped during CPU suspend
clocksource/drivers/ixp4xx: Drop boardfile probe path
dt-bindings: timer: Add compatible for Mediatek MT8186
Linus Torvalds [Sun, 5 Jun 2022 17:40:31 +0000 (10:40 -0700)]
Merge tag 'perf-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
- Make the ICL event constraints match reality
- Remove a unused local variable
* tag 'perf-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Remove unused local variable
perf/x86/intel: Fix event constraints for ICL
Linus Torvalds [Sun, 5 Jun 2022 16:45:27 +0000 (09:45 -0700)]
Merge tag 'objtool-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Thomas Gleixner:
- Handle __ubsan_handle_builtin_unreachable() correctly and treat it as
noreturn
- Allow architectures to select uaccess validation
- Use the non-instrumented bit test for test_cpu_has() to prevent
escape from non-instrumentable regions
- Use arch_ prefixed atomics for JUMP_LABEL=n builds to prevent escape
from non-instrumentable regions
- Mark a few tiny inline as __always_inline to prevent GCC from
bringing them out of line and instrumenting them
- Mark the empty stub context_tracking_enabled() as always inline as
GCC brings them out of line and instruments the empty shell
- Annotate ex_handler_msr_mce() as dead end
* tag 'objtool-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/extable: Annotate ex_handler_msr_mce() as a dead end
context_tracking: Always inline empty stubs
x86: Always inline on_thread_stack() and current_top_of_stack()
jump_label,noinstr: Avoid instrumentation for JUMP_LABEL=n builds
x86/cpu: Elide KCSAN for cpu_has() and friends
objtool: Mark __ubsan_handle_builtin_unreachable() as noreturn
objtool: Add CONFIG_HAVE_UACCESS_VALIDATION
Linus Torvalds [Sun, 5 Jun 2022 16:12:28 +0000 (09:12 -0700)]
Merge tag 'hte/for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux
Pull hardware timestamping subsystem from Thierry Reding:
"This contains the new HTE (hardware timestamping engine) subsystem
that has been in the works for a couple of months now.
The infrastructure provided allows for drivers to register as hardware
timestamp providers, while consumers will be able to request events
that they are interested in (such as GPIOs and IRQs) to be timestamped
by the hardware providers.
Note that this currently supports only one provider, but there seems
to be enough interest in this functionality and we expect to see more
drivers added once this is merged"
[ Linus Walleij mentions the Intel PMC in the Elkhart and Tiger Lake
platforms as another future timestamp provider ]
* tag 'hte/for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
dt-bindings: timestamp: Correct id path
dt-bindings: Renamed hte directory to timestamp
hte: Uninitialized variable in hte_ts_get()
hte: Fix off by one in hte_push_ts_ns()
hte: Fix possible use-after-free in tegra_hte_test_remove()
hte: Remove unused including <linux/version.h>
MAINTAINERS: Add HTE Subsystem
hte: Add Tegra HTE test driver
tools: gpio: Add new hardware clock type
gpiolib: cdev: Add hardware timestamp clock type
gpio: tegra186: Add HTE support
gpiolib: Add HTE support
dt-bindings: Add HTE bindings
hte: Add Tegra194 HTE kernel provider
drivers: Add hardware timestamp engine (HTE) subsystem
Documentation: Add HTE subsystem guide
Linus Torvalds [Sun, 5 Jun 2022 16:06:03 +0000 (09:06 -0700)]
Merge tag 'kbuild-v5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- Fix build regressions for parisc, csky, nios2, openrisc
- Simplify module builds for CONFIG_LTO_CLANG and CONFIG_X86_KERNEL_IBT
- Remove arch/parisc/nm, which was presumably a workaround for old
tools
- Check the odd combination of EXPORT_SYMBOL and 'static' precisely
- Make external module builds robust against "too long argument error"
- Support j, k keys for moving the cursor in nconfig
* tag 'kbuild-v5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (25 commits)
kbuild: Allow to select bash in a modified environment
scripts: kconfig: nconf: make nconfig accept jk keybindings
modpost: use fnmatch() to simplify match()
modpost: simplify mod->name allocation
kbuild: factor out the common objtool arguments
kbuild: move vmlinux.o link to scripts/Makefile.vmlinux_o
kbuild: clean .tmp_* pattern by make clean
kbuild: remove redundant cleanups in scripts/link-vmlinux.sh
kbuild: rebuild multi-object modules when objtool is updated
kbuild: add cmd_and_savecmd macro
kbuild: make *.mod rule robust against too long argument error
kbuild: make built-in.a rule robust against too long argument error
kbuild: check static EXPORT_SYMBOL* by script instead of modpost
parisc: remove arch/parisc/nm
kbuild: do not create *.prelink.o for Clang LTO or IBT
kbuild: replace $(linked-object) with CONFIG options
kbuild: do not try to parse *.cmd files for objects provided by compiler
kbuild: replace $(if A,A,B) with $(or A,B) in scripts/Makefile.modpost
modpost: squash if...else-if in find_elf_symbol2()
modpost: reuse ARRAY_SIZE() macro for section_mismatch()
...
Linus Torvalds [Sun, 5 Jun 2022 02:07:15 +0000 (19:07 -0700)]
Merge tag 'pull-18-rc1-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pathname updates from Al Viro:
"Several cleanups in fs/namei.c"
* tag 'pull-18-rc1-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
namei: cleanup double word in comment
get rid of dead code in legitimize_root()
fs/namei.c:reserve_stack(): tidy up the call of try_to_unlazy()
Linus Torvalds [Sun, 5 Jun 2022 02:00:05 +0000 (19:00 -0700)]
Merge tag 'pull-18-rc1-work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount handling updates from Al Viro:
"Cleanups (and one fix) around struct mount handling.
The fix is usermode_driver.c one - once you've done kern_mount(), you
must kern_unmount(); simple mntput() will end up with a leak. Several
failure exits in there messed up that way... In practice you won't hit
those particular failure exits without fault injection, though"
* tag 'pull-18-rc1-work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
move mount-related externs from fs.h to mount.h
blob_to_mnt(): kern_unmount() is needed to undo kern_mount()
m->mnt_root->d_inode->i_sb is a weird way to spell m->mnt_sb...
linux/mount.h: trim includes
uninline may_mount() and don't opencode it in fspick(2)/fsopen(2)
Linus Torvalds [Sun, 5 Jun 2022 01:52:00 +0000 (18:52 -0700)]
Merge tag 'pull-18-rc1-work.fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull file descriptor updates from Al Viro.
- Descriptor handling cleanups
* tag 'pull-18-rc1-work.fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Unify the primitives for file descriptor closing
fs: remove fget_many and fput_many interface
io_uring_enter(): don't leave f.flags uninitialized
Linus Torvalds [Sun, 5 Jun 2022 00:42:33 +0000 (17:42 -0700)]
Merge tag '5.19-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs client fixes from Steve French:
"Nine cifs/smb3 client fixes.
Includes DFS fixes, some cleanup of leagcy SMB1 code, duplicated
message cleanup and a double free and deadlock fix"
* tag '5.19-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix uninitialized pointer in error case in dfs_cache_get_tgt_share
cifs: skip trailing separators of prefix paths
cifs: update internal module number
cifs: version operations for smb20 unneeded when legacy support disabled
cifs: do not build smb1ops if legacy support is disabled
cifs: fix potential deadlock in direct reclaim
cifs: when extending a file with falloc we should make files not-sparse
cifs: remove repeated debug message on cifs_put_smb_ses()
cifs: fix potential double free during failed mount
Masahiro Yamada [Mon, 30 May 2022 09:01:38 +0000 (18:01 +0900)]
modpost: simplify mod->name allocation
mod->name is set to the ELF filename with the suffix ".o" stripped.
The current code calls strdup() and free() to manipulate the string,
but a simpler approach is to pass new_module() with the name length
subtracted by 2.
Also, check if the passed filename ends with ".o" before stripping it.
The current code blindly chops the suffix:
tmp[strlen(tmp) - 2] = '\0'
It will cause buffer under-run if strlen(tmp) < 2;
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Masahiro Yamada [Sat, 28 May 2022 15:47:04 +0000 (00:47 +0900)]
kbuild: factor out the common objtool arguments
scripts/Makefile.build and scripts/link-vmlinux.sh have similar setups
for the objtool arguments.
It was difficult to factor out them because all the vmlinux build rules
were written in a shell script. It is somewhat tedious to touch the two
files every time a new objtool option is supported.
To reduce the code duplication, move the objtool for vmlinux.o into
scripts/Makefile.vmlinux_o. Then, move the common macros to Makefile.lib
so they are shared between Makefile.build and Makefile.vmlinux_o.
Linus Torvalds [Sat, 4 Jun 2022 21:04:27 +0000 (14:04 -0700)]
Merge tag 'bitmap-for-5.19-rc1' of https://github.com/norov/linux
Pull bitmap updates from Yury Norov:
- bitmap: optimize bitmap_weight() usage, from me
- lib/bitmap.c make bitmap_print_bitmask_to_buf parseable, from Mauro
Carvalho Chehab
- include/linux/find: Fix documentation, from Anna-Maria Behnsen
- bitmap: fix conversion from/to fix-sized arrays, from me
- bitmap: Fix return values to be unsigned, from Kees Cook
It has been in linux-next for at least a week with no problems.
* tag 'bitmap-for-5.19-rc1' of https://github.com/norov/linux: (31 commits)
nodemask: Fix return values to be unsigned
bitmap: Fix return values to be unsigned
KVM: x86: hyper-v: replace bitmap_weight() with hweight64()
KVM: x86: hyper-v: fix type of valid_bank_mask
ia64: cleanup remove_siblinginfo()
drm/amd/pm: use bitmap_{from,to}_arr32 where appropriate
KVM: s390: replace bitmap_copy with bitmap_{from,to}_arr64 where appropriate
lib/bitmap: add test for bitmap_{from,to}_arr64
lib: add bitmap_{from,to}_arr64
lib/bitmap: extend comment for bitmap_(from,to)_arr32()
include/linux/find: Fix documentation
lib/bitmap.c make bitmap_print_bitmask_to_buf parseable
MAINTAINERS: add cpumask and nodemask files to BITMAP_API
arch/x86: replace nodes_weight with nodes_empty where appropriate
mm/vmstat: replace cpumask_weight with cpumask_empty where appropriate
clocksource: replace cpumask_weight with cpumask_empty in clocksource.c
genirq/affinity: replace cpumask_weight with cpumask_empty where appropriate
irq: mips: replace cpumask_weight with cpumask_empty where appropriate
drm/i915/pmu: replace cpumask_weight with cpumask_empty where appropriate
arch/x86: replace cpumask_weight with cpumask_empty where appropriate
...
Linus Torvalds [Sat, 4 Jun 2022 20:50:23 +0000 (13:50 -0700)]
Merge tag 'for-5.19/parisc-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull more parisc architecture updates from Helge Deller:
"A fix to prevent crash at bootup if CONFIG_SCHED_MC is enabled, and
add auto-detection of primary graphics card for framebuffer driver"
* tag 'for-5.19/parisc-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc/stifb: Keep track of hardware path of graphics card
parisc/stifb: Implement fb_is_primary_device()
parisc: fix a crash with multicore scheduler
Linus Torvalds [Sat, 4 Jun 2022 20:42:53 +0000 (13:42 -0700)]
Merge tag 'for-linus-5.19-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
"Two cleanup patches for Xen related code and (more important) an
update of MAINTAINERS for Xen, as Boris Ostrovsky decided to step
down"
* tag 'for-linus-5.19-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: replace xen_remap() with memremap()
MAINTAINERS: Update Xen maintainership
xen: switch gnttab_end_foreign_access() to take a struct page pointer
Linus Torvalds [Sat, 4 Jun 2022 20:33:12 +0000 (13:33 -0700)]
Merge tag 'perf-tools-for-v5.19-2022-06-04' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull more perf tools updates from Arnaldo Carvalho de Melo:
- Synthesize task events for pre-existing threads when using 'perf lock
--threads', as we need to show task names.
- Fix unwinding with ld.lld (>= version 10.0) linked objects, where
.eh_frame_hdr and .text are in different PT_LOAD program headers,
which makes perf record --call-graph dwarf fail with such obkects.
- Check if 'perf record' hangs in the ARM SPE (Statistical Profiling
Extensions) 'perf test' entry when recording a workload with forks.
- Trace physical address for Arm SPE events, needed for 'perf c2c' to
locate the memory node for samples.
- Fix sorting in percent_rmt_hitm_cmp() in 'perf c2c'.
- Further support for Intel hybrid systems in the evlist and 'perf
record' code.
- Update IBM s/390 vendor event JSON tables.
- Add metrics (JSON) for Intel Sapphirerapids.
- Update metrics for Intel Alderlake.
- Correct typo of sysf 'event_source' directory in the documentation.
* tag 'perf-tools-for-v5.19-2022-06-04' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf vendor events intel: Update metrics for Alderlake
perf vendor events intel: Add metrics for Sapphirerapids
perf c2c: Fix sorting in percent_rmt_hitm_cmp()
perf mem: Trace physical address for Arm SPE events
perf list: Update event description for IBM zEC12/zBC12 to latest level
perf list: Update event description for IBM z196/z114 to latest level
perf list: Update event description for IBM z15 to latest level
perf list: Update event description for IBM z14 to latest level
perf list: Update event description for IBM z13 to latest level
perf list: Update event description for IBM z10 to latest level
perf list: Add IBM z16 event description for s390
perf record: Support sample-read topdown metric group for hybrid platforms
perf lock: Change to synthesize task events
perf unwind: Fix segbase for ld.lld linked objects
perf test arm-spe: Check if perf-record hangs when recording workload with forks
perf docs: Correct typo of event_sources
perf evlist: Extend arch_evsel__must_be_in_group to support hybrid systems
Helge Deller [Thu, 2 Jun 2022 11:50:44 +0000 (13:50 +0200)]
parisc/stifb: Implement fb_is_primary_device()
Implement fb_is_primary_device() function, so that fbcon detects if this
framebuffer belongs to the default graphics card which was used to start
the system.
Linus Torvalds [Sat, 4 Jun 2022 03:01:25 +0000 (20:01 -0700)]
Merge tag 'gpio-fixes-for-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- use the correct register for regcache sync in gpio-pca953x
- remove unused and potentially harmful code from gpio-adp5588
- MAINTAINERS update
* tag 'gpio-fixes-for-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: adp5588: Remove support for platform setup and teardown callbacks
gpio: pca953x: use the correct register address to do regcache sync
MAINTAINERS: Update Intel GPIO (PMIC and PCH) to Supported
MAINTAINERS: Update GPIO ACPI library to Supported
Linus Torvalds [Sat, 4 Jun 2022 02:57:25 +0000 (19:57 -0700)]
Merge tag 'regulator-fix-v5.19-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"One fix that came in during the merge window, fixing an error in the
examples in the DT binding documentation for mt6315"
* tag 'regulator-fix-v5.19-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: mt6315-regulator: fix invalid allowed mode
* tag 'ntfs3_for_5.19' of https://github.com/Paragon-Software-Group/linux-ntfs3:
fs/ntfs3: provide block_invalidate_folio to fix memory leak
fs/ntfs3: Fix invalid free in log_replay
fs/ntfs3: Update valid size if -EIOCBQUEUED
fs/ntfs3: Check new size for limits
fs/ntfs3: Fix fiemap + fix shrink file size (to remove preallocated space)
fs/ntfs3: In function ntfs_set_acl_ex do not change inode->i_mode if called from function ntfs_init_acl
fs/ntfs3: Optimize locking in ntfs_save_wsl_perm
fs/ntfs3: Update i_ctime when xattr is added
fs/ntfs3: Restore ntfs_xattr_get_acl and ntfs_xattr_set_acl functions
fs/ntfs3: Keep preallocated only if option prealloc enabled
fs/ntfs3: Fix some memory leaks in an error handling path of 'log_replay()'
Linus Torvalds [Fri, 3 Jun 2022 23:13:25 +0000 (16:13 -0700)]
Merge tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace_stop cleanups from Eric Biederman:
"While looking at the ptrace problems with PREEMPT_RT and the problems
Peter Zijlstra was encountering with ptrace in his freezer rewrite I
identified some cleanups to ptrace_stop that make sense on their own
and move make resolving the other problems much simpler.
The biggest issue is the habit of the ptrace code to change
task->__state from the tracer to suppress TASK_WAKEKILL from waking up
the tracee. No other code in the kernel does that and it is straight
forward to update signal_wake_up and friends to make that unnecessary.
Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and
then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying
on the fact that all stopped states except the special stop states can
tolerate spurious wake up and recover their state.
The state of stopped and traced tasked is changed to be stored in
task->jobctl as well as in task->__state. This makes it possible for
the freezer to recover tasks in these special states, as well as
serving as a general cleanup. With a little more work in that
direction I believe TASK_STOPPED can learn to tolerate spurious wake
ups and become an ordinary stop state.
The TASK_TRACED state has to remain a special state as the registers
for a process are only reliably available when the process is stopped
in the scheduler. Fundamentally ptrace needs acess to the saved
register values of a task.
There are bunch of semi-random ptrace related cleanups that were found
while looking at these issues.
One cleanup that deserves to be called out is from commit 312a79ecbf5d
("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs"). This
makes a change that is technically user space visible, in the handling
of what happens to a tracee when a tracer dies unexpectedly. According
to our testing and our understanding of userspace nothing cares that
spurious SIGTRAPs can be generated in that case"
* tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
ptrace: Always take siglock in ptrace_resume
ptrace: Don't change __state
ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
ptrace: Document that wait_task_inactive can't fail
ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
signal: Use lockdep_assert_held instead of assert_spin_locked
ptrace: Remove arch_ptrace_attach
ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
signal: Replace __group_send_sig_info with send_signal_locked
signal: Rename send_signal send_signal_locked
Linus Torvalds [Fri, 3 Jun 2022 23:03:05 +0000 (16:03 -0700)]
Merge tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull kthread updates from Eric Biederman:
"This updates init and user mode helper tasks to be ordinary user mode
tasks.
Commit 169a5179aab6 ("kthread: Ensure struct kthread is present for
all kthreads") caused init and the user mode helper threads that call
kernel_execve to have struct kthread allocated for them. This struct
kthread going away during execve in turned made a use after free of
struct kthread possible.
Here, commit 14549687236b ("kthread: Don't allocate kthread_struct for
init and umh") is enough to fix the use after free and is simple
enough to be backportable.
The rest of the changes pass struct kernel_clone_args to clean things
up and cause the code to make sense.
In making init and the user mode helpers tasks purely user mode tasks
I ran into two complications. The function task_tick_numa was
detecting tasks without an mm by testing for the presence of
PF_KTHREAD. The initramfs code in populate_initrd_image was using
flush_delayed_fput to ensuere the closing of all it's file descriptors
was complete, and flush_delayed_fput does not work in a userspace
thread.
I have looked and looked and more complications and in my code review
I have not found any, and neither has anyone else with the code
sitting in linux-next"
* tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sched: Update task_tick_numa to ignore tasks without an mm
fork: Stop allowing kthreads to call execve
fork: Explicitly set PF_KTHREAD
init: Deal with the init process being a user mode process
fork: Generalize PF_IO_WORKER handling
fork: Explicity test for idle tasks in copy_thread
fork: Pass struct kernel_clone_args into copy_thread
kthread: Don't allocate kthread_struct for init and umh
Linus Torvalds [Fri, 3 Jun 2022 22:54:57 +0000 (15:54 -0700)]
Merge tag 'per-namespace-ipc-sysctls-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ipc sysctl namespace updates from Eric Biederman:
"This updates the ipc sysctls so that they are fundamentally per ipc
namespace. Previously these sysctls depended upon a hack to simulate
being per ipc namespace by looking up the ipc namespace in read or
write. With this set of changes the ipc sysctls are registered per ipc
namespace and open looks up the ipc namespace.
Not only does this series of changes ensure the traditional binding at
open time happens, but it sets a foundation for being able to relax
the permission checks to allow a user namspace root to change the ipc
sysctls for an ipc namespace that the user namespace root requires. To
do this requires the ipc namespace to be known at open time"
* tag 'per-namespace-ipc-sysctls-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ipc: Remove extra braces
ipc: Check permissions for checkpoint_restart sysctls at open time
ipc: Remove extra1 field abuse to pass ipc namespace
ipc: Use the same namespace to modify and validate
ipc: Store ipc sysctls in the ipc namespace
ipc: Store mqueue sysctls in the ipc namespace
Linus Torvalds [Fri, 3 Jun 2022 22:46:03 +0000 (15:46 -0700)]
firmware_loader: enable XZ by default if compressed support is enabled
Commit 4e9974ca6046 ("firmware: Add the support for ZSTD-compressed
firmware files") added support for ZSTD compression, but in the process
also made the previously default XZ compression a config option.
That means that anybody who upgrades their kernel and does a
make oldconfig
to update their configuration, will end up without the XZ compression
that the configuration used to have.
Add the 'default y' to make sure this doesn't happen.
The whole compression question should probably be improved upon, since
it is now possible to "enable" compression in the kernel config but not
enable any actual compression algorithm, which makes it all very
useless. It makes no sense to ask Kconfig questions that enable
situations that are nonsensical like that.
This at least fixes the immediate problem of a kernel update resulting
in a nonbootable machine because of a missed option.
Fixes: 4e9974ca6046 ("firmware: Add the support for ZSTD-compressed firmware files") Cc: Takashi Iwai <tiwai@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 3 Jun 2022 21:42:24 +0000 (14:42 -0700)]
Merge tag 'for-linus-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull JFFS2, UBI and UBIFS updates from Richard Weinberger:
"JFFS2:
- Fixes for a memory leak
UBI:
- Fixes for fastmap (UAF, high CPU usage)
UBIFS:
- Minor cleanups"
* tag 'for-linus-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: ubi_create_volume: Fix use-after-free when volume creation failed
ubi: fastmap: Check wl_pool for free peb before wear leveling
ubi: fastmap: Fix high cpu usage of ubi_bgt by making sure wl_pool not empty
ubifs: Use NULL instead of using plain integer as pointer
ubifs: Simplify the return expression of run_gc()
jffs2: fix memory leak in jffs2_do_fill_super
jffs2: Use kzalloc instead of kmalloc/memset
Linus Torvalds [Fri, 3 Jun 2022 21:35:14 +0000 (14:35 -0700)]
Merge tag 'for-linus-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML updates from Richard Weinberger:
- Various cleanups and fixes: xterm, serial line, time travel
- Set ARCH_HAS_GCOV_PROFILE_ALL
* tag 'for-linus-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: Fix out-of-bounds read in LDT setup
um: chan_user: Fix winch_tramp() return value
um: virtio_uml: Fix broken device handling in time-travel
um: line: Use separate IRQs per line
um: Enable ARCH_HAS_GCOV_PROFILE_ALL
um: Use asm-generic/dma-mapping.h
um: daemon: Make default socket configurable
um: xterm: Make default terminal emulator configurable