Wait to mark the TSS as busy during LTR emulation until after all fault
checks for the LTR have passed. Specifically, don't mark the TSS busy if
the new TSS base is non-canonical.
Opportunistically drop the one-off !seg_desc.PRESENT check for TR as the
only reason for the early check was to avoid marking a !PRESENT TSS as
busy, i.e. the common !PRESENT is now done before setting the busy bit.
Inject a #UD if L1 attempts VMXON with a CR0 or CR4 that is disallowed
per the associated nested VMX MSRs' fixed0/1 settings. KVM cannot rely
on hardware to perform the checks, even for the few checks that have
higher priority than VM-Exit, as (a) KVM may have forced CR0/CR4 bits in
hardware while running the guest, (b) there may incompatible CR0/CR4 bits
that have lower priority than VM-Exit, e.g. CR0.NE, and (c) userspace may
have further restricted the allowed CR0/CR4 values by manipulating the
guest's nested VMX MSRs.
Note, despite a very strong desire to throw shade at Jim, commit 0025faf79e9a ("kvm: nVMX: Remove superfluous VMX instruction fault checks")
is not to blame for the buggy behavior (though the comment...). That
commit only removed the CR0.PE, EFLAGS.VM, and COMPATIBILITY mode checks
(though it did erroneously drop the CPL check, but that has already been
remedied). KVM may force CR0.PE=1, but will do so only when also
forcing EFLAGS.VM=1 to emulate Real Mode, i.e. hardware will still #UD.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216033 Fixes: 1eb91cc48533 ("KVM: nVMX: Implement VMXON and VMXOFF") Reported-by: Eric Li <ercli@ucdavis.edu> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Check that the guest (L2) and host (L1) CR4 values that would be loaded
by nested VM-Enter and VM-Exit respectively are valid with respect to
KVM's (L0 host) allowed CR4 bits. Failure to check KVM reserved bits
would allow L1 to load an illegal CR4 (or trigger hardware VM-Fail or
failed VM-Entry) by massaging guest CPUID to allow features that are not
supported by KVM. Amusingly, KVM itself is an accomplice in its doom, as
KVM adjusts L1's MSR_IA32_VMX_CR4_FIXED1 to allow L1 to enable bits for
L2 based on L1's CPUID model.
Note, although nested_{guest,host}_cr4_valid() are _currently_ used if
and only if the vCPU is post-VMXON (nested.vmxon == true), that may not
be true in the future, e.g. emulating VMXON has a bug where it doesn't
check the allowed/required CR0/CR4 bits.
Cc: stable@vger.kernel.org Fixes: 3ce5a61d0f22 ("KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation") Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Restrict the nVMX MSRs based on KVM's config, not based on the guest's
current config. Using the guest's config to audit the new config
prevents userspace from restoring the original config (KVM's config) if
at any point in the past the guest's config was restricted in any way.
Fixes: 651c35318378 ("KVM: nVMX: support restore of VMX capability MSRs") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Split the common x86 parts of kvm_is_valid_cr4(), i.e. the reserved bits
checks, into a separate helper, __kvm_is_valid_cr4(), and export only the
inner helper to vendor code in order to prevent nested VMX from calling
back into vmx_is_valid_cr4() via kvm_is_valid_cr4().
On SVM, this is a nop as SVM doesn't place any additional restrictions on
CR4.
On VMX, this is also currently a nop, but only because nested VMX is
missing checks on reserved CR4 bits for nested VM-Enter. That bug will
be fixed in a future patch, and could simply use kvm_is_valid_cr4() as-is,
but nVMX has _another_ bug where VMXON emulation doesn't enforce VMX's
restrictions on CR0/CR4. The cleanest and most intuitive way to fix the
VMXON bug is to use nested_host_cr{0,4}_valid(). If the CR4 variant
routes through kvm_is_valid_cr4(), using nested_host_cr4_valid() won't do
the right thing for the VMXON case as vmx_is_valid_cr4() enforces VMX's
restrictions if and only if the vCPU is post-VMXON.
Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the SIGP interpretation facility is present and a VCPU sends an
ecall to another VCPU in enabled wait, the sending VCPU receives a 56
intercept (partial execution), so KVM can wake up the receiving CPU.
Note that the SIGP interpretation facility will take care of the
interrupt delivery and KVM's only job is to wake the receiving VCPU.
For PV, the sending VCPU will receive a 108 intercept (pv notify) and
should continue like in the non-PV case, i.e. wake the receiving VCPU.
For PV and non-PV guests the interrupt delivery will occur through the
SIGP interpretation facility on SIE entry when SIE finds the X bit in
the status field set.
However, in handle_pv_notification(), there was no special handling for
SIGP, which leads to interrupt injection being requested by KVM for the
next SIE entry. This results in the interrupt being delivered twice:
once by the SIGP interpretation facility and once by KVM through the
IICTL.
Add the necessary special handling in handle_pv_notification(), similar
to handle_partial_execution(), which simply wakes the receiving VCPU and
leave interrupt delivery to the SIGP interpretation facility.
In contrast to external calls, emergency calls are not interpreted but
also cause a 108 intercept, which is why we still need to call
handle_instruction() for SIGP orders other than ecall.
Since kvm_s390_handle_sigp_pei() is now called for all SIGP orders which
cause a 108 intercept - even if they are actually handled by
handle_instruction() - move the tracepoint in kvm_s390_handle_sigp_pei()
to avoid possibly confusing trace messages.
Don't BUG/WARN on interrupt injection due to GIF being cleared,
since it's trivial for userspace to force the situation via
KVM_SET_VCPU_EVENTS (even if having at least a WARN there would be correct
for KVM internally generated injections).
If a nested run isn't pending, snapshot vmcs01.GUEST_IA32_DEBUGCTL
irrespective of whether or not VM_ENTRY_LOAD_DEBUG_CONTROLS is set in
vmcs12. When restoring nested state, e.g. after migration, without a
nested run pending, prepare_vmcs02() will propagate
nested.vmcs01_debugctl to vmcs02, i.e. will load garbage/zeros into
vmcs02.GUEST_IA32_DEBUGCTL.
If userspace restores nested state before MSRs, then loading garbage is a
non-issue as loading DEBUGCTL will also update vmcs02. But if usersepace
restores MSRs first, then KVM is responsible for propagating L2's value,
which is actually thrown into vmcs01, into vmcs02.
Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
is all kinds of bizarre and ideally would not be supported. Sadly, some
VMMs do exactly that and rely on KVM to make things work.
Note, there's still a lurking SMM bug, as propagating vmcs01's DEBUGCTL
to vmcs02 across RSM may corrupt L2's DEBUGCTL. But KVM's entire VMX+SMM
emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
"default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.
If a nested run isn't pending, snapshot vmcs01.GUEST_BNDCFGS irrespective
of whether or not VM_ENTRY_LOAD_BNDCFGS is set in vmcs12. When restoring
nested state, e.g. after migration, without a nested run pending,
prepare_vmcs02() will propagate nested.vmcs01_guest_bndcfgs to vmcs02,
i.e. will load garbage/zeros into vmcs02.GUEST_BNDCFGS.
If userspace restores nested state before MSRs, then loading garbage is a
non-issue as loading BNDCFGS will also update vmcs02. But if usersepace
restores MSRs first, then KVM is responsible for propagating L2's value,
which is actually thrown into vmcs01, into vmcs02.
Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
is all kinds of bizarre and ideally would not be supported. Sadly, some
VMMs do exactly that and rely on KVM to make things work.
Note, there's still a lurking SMM bug, as propagating vmcs01.GUEST_BNDFGS
to vmcs02 across RSM may corrupt L2's BNDCFGS. But KVM's entire VMX+SMM
emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
"default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.
Link: https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com Fixes: b66b5e1011af ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS") Cc: stable@vger.kernel.org Cc: Lei Wang <lei4.wang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Touch switch state is received through WACOM_PAD_FIELD. However, it
is reported by touch_input. Don't register pad_input if no other pad
events require the interface.
The generic routine, wacom_wac_pen_event, turns rotation value 90
degree anti-clockwise before posting the events. This non-zero
event trggers a non-zero ABS_Z event for non art pen tools. However,
HID_DG_TWIST is only supported by art pen.
Similar to the Surface Go (1), the (Elantech) touchscreen/digitizer in
the Surface Go 2 mistakenly reports the battery of the stylus. Instead
of over the touchscreen device, battery information is provided via
bluetooth and the touchscreen device reports an empty battery.
Apply the HID_BATTERY_QUIRK_IGNORE quirk to ignore this battery and
prevent the erroneous low battery warnings.
lockd doesn't currently vet the start and length in nlm4 requests like
it should, and can end up generating lock requests with arguments that
overflow when passed to the filesystem.
The NLM4 protocol uses unsigned 64-bit arguments for both start and
length, whereas struct file_lock tracks the start and end as loff_t
values. By the time we get around to calling nlm4svc_retrieve_args,
we've lost the information that would allow us to determine if there was
an overflow.
Start tracking the actual start and len for NLM4 requests in the
nlm_lock. In nlm4svc_retrieve_args, vet these values to ensure they
won't cause an overflow, and return NLM4_FBIG if they do.
Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=392 Reported-by: Jan Kasiak <j.kasiak@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> # 5.14+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be executed before wait_on_buffer(bh) on
architectures with weak memory ordering and it may return invalid data.
Fix this bug by adding a memory barrier to set_buffer_uptodate and an
acquire barrier to buffer_uptodate (in a similar way as
folio_test_uptodate and folio_mark_uptodate).
We won't really have enough skbs to need a 64-bit cookie,
and on 32-bit platforms storing the 64-bit cookie into the
void *rate_driver_data doesn't work anyway. Switch back to
using just a 32-bit cookie and uintptr_t for the type to
avoid compiler warnings about all this.
Fixes: 2e05bcefd0a8 ("wifi: mac80211_hwsim: fix race condition in pending packet") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Cc: Jeongik Cha <jeongik@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We had a report from the spring Bake-a-thon of data corruption in some
nfstest_interop tests. Looking at the traces showed the NFS server
allowing a v3 WRITE to proceed while a read delegation was still
outstanding.
Currently, we only set NFSD_FILE_BREAK_* flags if
NFSD_MAY_NOT_BREAK_LEASE was set when we call nfsd_file_alloc.
NFSD_MAY_NOT_BREAK_LEASE was intended to be set when finding files for
COMMIT ops, where we need a writeable filehandle but don't need to
break read leases.
It doesn't make any sense to consult that flag when allocating a file
since the file may be used on subsequent calls where we do want to break
the lease (and the usage of it here seems to be reverse from what it
should be anyway).
Also, after calling nfsd_open_break_lease, we don't want to clear the
BREAK_* bits. A lease could end up being set on it later (more than
once) and we need to be able to break those leases as well.
This means that the NFSD_FILE_BREAK_* flags now just mirror
NFSD_MAY_{READ,WRITE} flags, so there's no need for them at all. Just
drop those flags and unconditionally call nfsd_open_break_lease every
time.
Reported-by: Olga Kornieskaia <kolga@netapp.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2107360 Fixes: a8d3d8b807d4 (nfsd: add a new struct file caching facility to nfsd) Cc: <stable@vger.kernel.org> # 5.4.x : 3cb992cbbe0c NFSD: Clean up the show_nf_flags() macro Cc: <stable@vger.kernel.org> # 5.4.x Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The RPC/RDMA driver will return -EPROTO and -ENODEV as connection errors
under certain circumstances. Make sure that we handle them and report
them to the server. If not, we can end up cycling forever in a
LAYOUTGET/LAYOUTRETURN loop.
Fixes: e557d83849e6 ("NFSv4/pNFS: Use connections to a DS that are all of the same protocol family") Cc: stable@vger.kernel.org # 5.11.x Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 6d0b3ac771433772dbdef0a19b865d6f33dc1c71.
If a transport is down, then we want to fail over to other transports if
they are listed in the GETDEVICEINFO reply.
Fixes: 6d0b3ac77143 ("pNFS: nfs3_set_ds_client should set NFS_CS_NOPING") Cc: stable@vger.kernel.org # 5.11.x Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:
ld: warning: arch/x86/boot/pmjump.o: missing .note.GNU-stack section implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions
Generally, we would like to avoid the stack being executable. Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack. Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.
LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO. --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.
While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.
Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:
ld: warning: vmlinux: missing .note.GNU-stack section implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
ld: warning: vmlinux has a LOAD segment with RWX permissions
Generally, we would like to avoid the stack being executable. Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack. Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.
LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO. --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.
While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.
RSB fill sequence does not have any protection for miss-prediction of
conditional branch at the end of the sequence. CPU can speculatively
execute code immediately after the sequence, while RSB filling hasn't
completed yet.
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.
== Background ==
Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.
To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced. eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.
== Problem ==
Here's a simplification of how guests are run on Linux' KVM:
void run_kvm_guest(void)
{
// Prepare to run guest
VMRESUME();
// Clean up after guest runs
}
The execution flow for that would look something like this to the
processor:
1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()
Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:
* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.
* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".
IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.
However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.
Balanced CALL/RET instruction pairs such as in step #5 are not affected.
== Solution ==
The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e.,
eIBRS systems which enable legacy IBRS explicitly.
However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT
and most of them need a new mitigation.
Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT.
The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.
In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.
There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.
[ bp: Massage, incorporate review comments from Andy Cooper. ]
In do_adb_query() function of drivers/macintosh/adb.c, req->data is copied
form userland. The parameter "req->data[2]" is missing check, the array
size of adb_handler[] is 16, so adb_handler[req->data[2]].original_address and
adb_handler[req->data[2]].handler_id will lead to oob read.
Cc: stable <stable@kernel.org> Signed-off-by: Ning Qiang <sohu0106@126.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220713153734.2248-1-sohu0106@126.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The BCM4349B1, aka CYW/BCM89359, is a WiFi+BT chip and its Bluetooth
portion can be controlled over serial.
Two subversions are added for the chip, because ROM firmware reports
002.002.013 (at least for the chips I have here), while depending on
patchram firmware revision, either 002.002.013 or 002.002.014 is
reported.
We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to
write out to the relocation inode. That critical section must include all
the IO submission for the inode. However, flush_write_bio() in
extent_writepages() is out of the critical section, causing an IO
submission outside of the lock. This leads to an out of the order IO
submission and fail the relocation process.
Fix it by extending the critical section.
Fixes: 5ec7c05c9358 ("btrfs: zoned: only allow one process to add pages to a relocation inode") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The IO errors occur when we allocate a regular extent in previous data
relocation block group.
On zoned btrfs, we use a dedicated block group to relocate a data
extent. Thus, we allocate relocating data extents (pre-alloc) only from
the dedicated block group and vice versa. Once the free space in the
dedicated block group gets tight, a relocating extent may not fit into
the block group. In that case, we need to switch the dedicated block
group to the next one. Then, the previous one is now freed up for
allocating a regular extent. The BG is already not enough to allocate
the relocating extent, but there is still room to allocate a smaller
extent. Now the problem happens. By allocating a regular extent while
nocow IOs for the relocation is still on-going, we will issue WRITE IOs
(for relocation) and ZONE APPEND IOs (for the regular writes) at the
same time. That mixed IOs confuses the write pointer and arises the
unaligned write errors.
This commit introduces a new bit 'zoned_data_reloc_ongoing' to the
btrfs_block_group. We set this bit before releasing the dedicated block
group, and no extent are allocated from a block group having this bit
set. This bit is similar to setting block_group->ro, but is different from
it by allowing nocow writes to start.
Once all the nocow IO for relocation is done (hooked from
btrfs_finish_ordered_io), we reset the bit to release the block group for
further allocation.
Fixes: baba414833e7 ("btrfs: zoned: add a dedicated data relocation block group") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ This issue was fixed upstream by accident in 4da5f77dacaa ("arm64:
head: cover entire kernel image in initial ID map") as part of a
large refactoring of the arm64 boot flow. This simple fix is therefore
preferred for -stable backporting ]
On a system that implements FEAT_EPAN, read/write access to the idmap
is denied because UXN is not set on the swapper PTEs. As a result,
idmap_kpti_install_ng_mappings panics the kernel when accessing
__idmap_kpti_flag. Fix it by setting UXN on these PTEs.
The selftests, when built with newer versions of clang, is found
to have over optimized guests' ucall() function, and eliminating
the stores for uc.cmd (perhaps due to no immediate readers). This
resulted in the userspace side always reading a value of '0', and
causing multiple test failures.
As a result, prevent the compiler from optimizing the stores in
ucall() with WRITE_ONCE().
Suggested-by: Ricardo Koller <ricarkol@google.com> Suggested-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Message-Id: <20220615185706.1099208-1-rananta@google.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Instead of printing an error message, kvm_stat script fails when we
restrict statistics to a guest by its name and there are multiple guests
with such name:
# kvm_stat -g my_vm
Traceback (most recent call last):
File "/usr/bin/kvm_stat", line 1819, in <module>
main()
File "/usr/bin/kvm_stat", line 1779, in main
options = get_options()
File "/usr/bin/kvm_stat", line 1718, in get_options
options = argparser.parse_args()
File "/usr/lib64/python3.10/argparse.py", line 1825, in parse_args
args, argv = self.parse_known_args(args, namespace)
File "/usr/lib64/python3.10/argparse.py", line 1858, in parse_known_args
namespace, args = self._parse_known_args(args, namespace)
File "/usr/lib64/python3.10/argparse.py", line 2067, in _parse_known_args
start_index = consume_optional(start_index)
File "/usr/lib64/python3.10/argparse.py", line 2007, in consume_optional
take_action(action, args, option_string)
File "/usr/lib64/python3.10/argparse.py", line 1935, in take_action
action(self, namespace, argument_values, option_string)
File "/usr/bin/kvm_stat", line 1649, in __call__
' to specify the desired pid'.format(" ".join(pids)))
TypeError: sequence item 0: expected str instance, int found
To avoid this, it's needed to convert pids int values to strings before
pass them to join().
hyperv_clock doesn't always give a stable test result, especially with
AMD CPUs. The test compares Hyper-V MSR clocksource (acquired either
with rdmsr() from within the guest or KVM_GET_MSRS from the host)
against rdtsc(). To increase the accuracy, increase the measured delay
(done with nop loop) by two orders of magnitude and take the mean rdtsc()
value before and after rdmsr()/KVM_GET_MSRS.
Similar to the Xen path, only change the vCPU's reported state if the vCPU
was actually preempted. The reason for KVM's behavior is that for example
optimistic spinning might not be a good idea if the guest is doing repeated
exits to userspace; however, it is confusing and unlikely to make a difference,
because well-tuned guests will hardly ever exit KVM_RUN in the first place.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If a vCPU is outside guest mode and is scheduled out, it might be in the
process of making a memory access. A problem occurs if another vCPU uses
the PV TLB flush feature during the period when the vCPU is scheduled
out, and a virtual address has already been translated but has not yet
been accessed, because this is equivalent to using a stale TLB entry.
To avoid this, only report a vCPU as preempted if sure that the guest
is at an instruction boundary. A rescheduling request will be delivered
to the host physical CPU as an external interrupt, so for simplicity
consider any vmexit *not* instruction boundary except for external
interrupts.
It would in principle be okay to report the vCPU as preempted also
if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
vmentry/vmexit overhead unnecessarily, and optimistic spinning is
also unlikely to succeed. However, leave it for later because right
now kvm_vcpu_check_block() is doing memory accesses. Even
though the TLB flush issue only applies to virtual memory address,
it's very much preferrable to be conservative.
The root cause of the bug sits in neon_poly1305_blocks. The logic
neon_poly1305_blocks() performed is that if it was called with both s[]
and r[] uninitialized, it will first try to initialize them with the
data from the first "block" that it believed to be 32 bytes in length.
First 16 bytes are used as the key and the next 16 bytes for s[]. This
would lead to the aforementioned read out-of-bound. However, after
calling poly1305_init_arch(), only 16 bytes were deducted from the input
and s[] is initialized yet again with the following 16 bytes. The second
initialization of s[] is certainly redundent which indicates that the
first initialization should be for r[] only.
This patch fixes the issue by calling poly1305_init_arm64() instead of
poly1305_init_arch(). This is also the implementation for the same
algorithm on arm platform.
The fix in commit 5debc80f12d4 ("ACPI/APEI: Limit printable size of BERT
table data") does not work as intended on systems where the BIOS has a
fixed size block of memory for the BERT table, relying on s/w to quit
when it finds a record with estatus->block_status == 0. On these systems
all errors are suppressed because the check:
if (region_len < ACPI_BERT_PRINT_MAX_LEN)
always fails.
New scheme skips individual CPER records that are too large, and also
limits the total number of records that will be printed to 5.
Fixes: 5debc80f12d4 ("ACPI/APEI: Limit printable size of BERT table data") Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Taking a recent change in the i8042 quirklist to this one: Clevo
board_names are somewhat unique, and if not: The generic Board_-/Sys_Vendor
string "Notebook" doesn't help much anyway. So identifying the devices just
by the board_name helps keeping the list significantly shorter and might
even hit more devices requiring the fix.
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Fixes: 4da621e20bbc ("ACPI: video: Force backlight native for Clevo NL5xRU and NL5xNU") Cc: All applicable <stable@vger.kernel.org> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The TongFang PF5PU1G, PF4NU1F, PF5NU1G, and PF5LUXG/TUXEDO BA15 Gen10,
Pulse 14/15 Gen1, and Pulse 15 Gen2 have the same problem as the Clevo
NL5xRU and NL5xNU/TUXEDO Aura 15 Gen1 and Gen2:
They have a working native and video interface. However the default
detection mechanism first registers the video interface before
unregistering it again and switching to the native interface during boot.
This results in a dangling SBIOS request for backlight change for some
reason, causing the backlight to switch to ~2% once per boot on the first
power cord connect or disconnect event. Setting the native interface
explicitly circumvents this buggy behaviour by avoiding the unregistering
process.
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Cc: All applicable <stable@vger.kernel.org> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit c3f4b3e8d49ee relocated and renamed the alloc_calls and
free_calls files from /sys/kernel/slab/NAME/*_calls over to
/sys/kernel/debug/slab/NAME/*_calls but didn't update the slabinfo tool
with the new location.
This change will now have slabinfo look at the new location (and filenames)
with a fallback to the prior files.
Commit 26330d5eb60a ("block: fix default IO priority handling")
introduced an inconsistency in get_current_ioprio() that tasks without
IO context return IOPRIO_DEFAULT priority while tasks with freshly
allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
Tasks without IO context used to be rare before eaaf82e2eeab ("block:
move io_context creation into where it's needed") but after this commit
they became common because now only BFQ IO scheduler setups task's IO
context. Similar inconsistency is there for get_task_ioprio() so this
inconsistency is now exposed to userspace and userspace will see
different IO priority for tasks operating on devices with BFQ compared
to devices without BFQ. Furthemore the changes done by commit 26330d5eb60a change the behavior when no IO priority is set for BFQ IO
scheduler which is also documented in ioprio_set(2) manpage:
"If no I/O scheduler has been set for a thread, then by default the I/O
priority will follow the CPU nice value (setpriority(2)). In Linux
kernels before version 2.6.24, once an I/O priority had been set using
ioprio_set(), there was no way to reset the I/O scheduling behavior to
the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to
reset to the default I/O scheduling behavior."
So make sure we default to IOPRIO_CLASS_NONE as used to be the case
before commit 26330d5eb60a. Also cleanup alloc_io_context() to
explicitely set this IO priority for the allocated IO context to avoid
future surprises. Note that we tweak ioprio_best() to maintain
ioprio_get(2) behavior and make this commit easily backportable.
CC: stable@vger.kernel.org Fixes: 26330d5eb60a ("block: fix default IO priority handling") Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cgroup_skb/egress programs which sock_fields test installs process packets
flying in both directions, from the client to the server, and in reverse
direction.
Recently added dst_port check relies on the fact that destination
port (remote peer port) of the socket which sends the packet is known ahead
of time. This holds true only for the client socket, which connects to the
known server port.
Filter out any traffic that is not egressing from the client socket in the
BPF program that tests reading the dst_port.
Fixes: eea83c039b64 ("selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220317113920.1068535-3-jakub@cloudflare.com Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add coverage to the verifier tests and tests for reading bpf_sock fields to
ensure that 32-bit, 16-bit, and 8-bit loads from dst_port field are allowed
only at intended offsets and produce expected values.
While 16-bit and 8-bit access to dst_port field is straight-forward, 32-bit
wide loads need be allowed and produce a zero-padded 16-bit value for
backward compatibility.
The mitigations for RETBleed are currently ineffective on x86_32 since
entry_32.S does not use the required macros. However, for an x86_32
target, the kconfig symbols for them are still enabled by default and
/sys/devices/system/cpu/vulnerabilities/retbleed will wrongly report
that mitigations are in place.
Make all of these symbols depend on X86_64, and only enable RETHUNK by
default on X86_64.
Fixes: 647c1deea862 ("x86/retbleed: Add fine grained Kconfig knobs") Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/YtwSR3NNsWp1ohfV@decadent.org.uk
[bwh: Backported to 5.10/5.15/5.18: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some cloud hypervisors do not provide IBPB on very recent CPU processors,
including AMD processors affected by Retbleed.
Using IBPB before firmware calls on such systems would cause a GPF at boot
like the one below. Do not enable such calls when IBPB support is not
present.
With commit adf56da3a439 ("locking/rwsem: Make handoff bit handling more
consistent"), the writer that sets the handoff bit can be interrupted
out without clearing the bit if the wait queue isn't empty. This disables
reader and writer optimistic lock spinning and stealing.
Now if a non-first writer in the queue is somehow woken up or a new
waiter enters the slowpath, it can't acquire the lock. This is not the
case before commit adf56da3a439 as the writer that set the handoff bit
will clear it when exiting out via the out_nolock path. This is less
efficient as the busy rwsem stays in an unlock state for a longer time.
In some cases, this new behavior may cause lockups as shown in [1] and
[2].
This patch allows a non-first writer to ignore the handoff bit if it
is not originally set or initiated by the first waiter. This patch is
shown to be effective in fixing the lockup problem reported in [1].
Handle 0x0020, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0013
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: PROC 1 DIMM 1 <===== device
Bank Locator: Not Specified <===== bank
This results in a buffer overflow because ghes_edac_register() calls
strlen() on an uninitialized label, which had non-zero values left over
from krealloc_array():
The label contains garbage because the commit in Fixes reallocs the
DIMMs array while scanning the system but doesn't clear the newly
allocated memory.
Change dimm_setup_label() to always initialize the label to fix the
issue. Set it to the empty string in case BIOS does not provide both
bank and device so that ghes_edac_register() can keep the default label
given by edac_mc_alloc_dimms().
[ bp: Rewrite commit message. ]
Fixes: 66721d153402a ("EDAC/ghes: Scan the system once on driver init") Co-developed-by: Robert Richter <rric@kernel.org> Signed-off-by: Robert Richter <rric@kernel.org> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Robert Elliott <elliott@hpe.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220719220124.760359-1-toshi.kani@hpe.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit be1c3685b6f1 ("mm/memblock: add memblock memory allocation apis")
added a check to determine whether arm_dma_zone_size is exceeding the
amount of kernel virtual address space available between the upper 4GB
virtual address limit and PAGE_OFFSET in order to provide a suitable
definition of MAX_DMA_ADDRESS that should fit within the 32-bit virtual
address space. The quantity used for comparison was off by a missing
trailing 0, leading to MAX_DMA_ADDRESS to be overflowing a 32-bit
quantity.
This was caught thanks to CONFIG_DEBUG_VIRTUAL on the bcm2711 platform
where we define a dma_zone_size of 1GB and we have a PAGE_OFFSET value
of 0xc000_0000 (CONFIG_VMSPLIT_3G) leading to MAX_DMA_ADDRESS being
0x1_0000_0000 which overflows the unsigned long type used throughout
__pa() and then __virt_addr_valid(). Because the virtual address passed
to __virt_addr_valid() would now be 0, the function would loudly warn
and flood the kernel log, thus making the platform unable to boot
properly.
There was a report that a task is waiting at the
throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was
increasing.
This is a bug where zone_watermark_fast returns true even when the free
is very low. The commit b9f78fa27226 ("page_alloc: consider highatomic
reserve in watermark fast") changed the watermark fast to consider
highatomic reserve. But it did not handle a negative value case which
can be happened when reserved_highatomic pageblock is bigger than the
actual free.
If watermark is considered as ok for the negative value, allocating
contexts for order-0 will consume all free pages without direct reclaim,
and finally free page may become depleted except highatomic free.
Then allocating contexts may fall into throttle_direct_reclaim. This
symptom may easily happen in a system where wmark min is low and other
reclaimers like kswapd does not make free pages quickly.
Handle the negative case by using MIN.
Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com Fixes: b9f78fa27226 ("page_alloc: consider highatomic reserve in watermark fast") Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Reported-by: GyeongHwan Hong <gh21.hong@samsung.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: <stable@vger.kerenl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If hmm_range_fault() is called with the HMM_PFN_REQ_FAULT flag and a
device private PTE is found, the hmm_range::dev_private_owner page is used
to determine if the device private page should not be faulted in.
However, if the device private page is not owned by the caller,
hmm_range_fault() returns an error instead of calling migrate_to_ram() to
fault in the page.
For example, if a page is migrated to GPU private memory and a RDMA fault
capable NIC tries to read the migrated page, without this patch it will
get an error. With this patch, the page will be migrated back to system
memory and the NIC will be able to read the data.
Link: https://lkml.kernel.org/r/20220727000837.4128709-2-rcampbell@nvidia.com Link: https://lkml.kernel.org/r/20220725183615.4118795-2-rcampbell@nvidia.com Fixes: 58d609aa32d0 ("mm/hmm: check the device private page owner in hmm_range_fault()") Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Reported-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Philip Yang <Philip.Yang@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This happens when calling sctp_sendmsg without connecting to server first.
In this case, a data chunk already queues up in send queue of client side
when processing the INIT_ACK from server in sctp_process_init() where it
calls sctp_stream_init() to alloc stream_in. If it fails to alloc stream_in
all stream_out will be freed in sctp_stream_init's err path. Then in the
asoc freeing it will crash when dequeuing this data chunk as stream_out
is missing.
As we can't free stream out before dequeuing all data from send queue, and
this patch is to fix it by moving the err path stream_out/in freeing in
sctp_stream_init() to sctp_stream_free() which is eventually called when
freeing the asoc in sctp_association_free(). This fix also makes the code
in sctp_process_init() more clear.
Note that in sctp_association_init() when it fails in sctp_stream_init(),
sctp_association_free() will not be called, and in that case it should
go to 'stream_free' err path to free stream instead of 'fail_init'.
Sending a PTP packet can imply to use the normal TX driver datapath but
invoked from the driver's ptp worker. The kernel generic TX code
disables softirqs and preemption before calling specific driver TX code,
but the ptp worker does not. Although current ptp driver functionality
does not require it, there are several reasons for doing so:
1) The invoked code is always executed with softirqs disabled for non
PTP packets.
2) Better if a ptp packet transmission is not interrupted by softirq
handling which could lead to high latencies.
3) netdev_xmit_more used by the TX code requires preemption to be
disabled.
Indeed a solution for dealing with kernel preemption state based on static
kernel configuration is not possible since the introduction of dynamic
preemption level configuration at boot time using the static calls
functionality.
When using 'perf mem' and 'perf c2c', an issue is observed that tool
reports the wrong offset for global data symbols. This is a common
issue on both x86 and Arm64 platforms.
Let's see an example, for a test program, below is the disassembly for
its .bss section which is dumped with objdump:
First we used 'perf mem record' to run the test program and then used
'perf --debug verbose=4 mem report' to observe what's the symbol info
for 'buf1' and 'buf2' structures.
The perf tool relies on libelf to parse symbols, in executable and
shared object files, 'st_value' holds a virtual address; 'sh_addr' is
the address at which section's first byte should reside in memory, and
'sh_offset' is the byte offset from the beginning of the file to the
first byte in the section. The perf tool uses below formula to convert
a symbol's memory address to a file address:
We can see the final adjusted address ranges for buf1 and buf2 are
[0x30a8-0x30e8) and [0x3068-0x30a8) respectively, apparently this is
incorrect, in the code, the structure for 'buf1' and 'buf2' specifies
compiler attribute with 64-byte alignment.
The problem happens for 'sh_offset', libelf returns it as 0x3028 which
is not 64-byte aligned, combining with disassembly, it's likely libelf
doesn't respect the alignment for .bss section, therefore, it doesn't
return the aligned value for 'sh_offset'.
Suggested by Fangrui Song, ELF file contains program header which
contains PT_LOAD segments, the fields p_vaddr and p_offset in PT_LOAD
segments contain the execution info. A better choice for converting
memory address to file address is using the formula:
file_address = st_value - p_vaddr + p_offset
This patch introduces elf_read_program_header() which returns the
program header based on the passed 'st_value', then it uses the formula
above to calculate the symbol file address; and the debugging log is
updated respectively.
We try using cancel_delayed_work_sync() to prevent the work from
enabling NAPI. This is insufficient since we don't disable the source
of the refill work scheduling. This means an NAPI poll callback after
cancel_delayed_work_sync() can schedule the refill work then can
re-enable the NAPI that leads to use-after-free [1].
Since the work can enable NAPI, we can't simply disable NAPI before
calling cancel_delayed_work_sync(). So fix this by introducing a
dedicated boolean to control whether or not the work could be
scheduled from NAPI.
[1]
==================================================================
BUG: KASAN: use-after-free in refill_work+0x43/0xd4
Read of size 2 at addr ffff88810562c92e by task kworker/2:1/42
Fixes: 5a39848b6859f ("virtio_net: set/cancel work on ndo_open/ndo_stop") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
NIX_AF_TLXX_PIR/CIR register format has changed from OcteonTx2
to CN10K. CN10K supports larger burst size. Fix burst exponent
and burst mantissa configuration for CN10K.
Also fixed 'maxrate' from u32 to u64 since 'police.rate_bytes_ps'
passed by stack is also u64.
There are sleep in atomic context bugs in timer handlers of sctp
such as sctp_generate_t3_rtx_event(), sctp_generate_probe_event(),
sctp_generate_t1_init_event(), sctp_generate_timeout_event(),
sctp_generate_t3_rtx_event() and so on.
The root cause is sctp_sched_prio_init_sid() with GFP_KERNEL parameter
that may sleep could be called by different timer handlers which is in
interrupt context.
One of the call paths that could trigger bug is shown below:
This patch changes gfp_t parameter of init_sid in sctp_sched_set_sched()
from GFP_KERNEL to GFP_ATOMIC in order to prevent sleep in atomic
context bugs.
Fix the inability to bring an interface up on a setup with
only MSI interrupts enabled (no MSI-X).
Solution is to add a default number of QPs = 1. This is enough,
since without MSI-X support driver enables only a basic feature set.
Fixes: a2d761c94ee9 ("i40e: Fix the number of queues available to be mapped for use") Signed-off-by: Dawid Lukwinski <dawid.lukwinski@intel.com> Signed-off-by: Michal Maloszewski <michal.maloszewski@intel.com> Tested-by: Dave Switzer <david.switzer@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20220722175401.112572-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
While reading sysctl_fib_notify_on_flag_change, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 1d4352c9675c ("net: ipv4: Emit notification when fib hardware flags are changed") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
While reading sysctl_tcp_reflect_tos, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 6553988b8db0 ("tcp: reflect tos value received in SYN to the socket") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
mld_{query | report}_work() processes queued events.
If there are too many events in the queue, it re-queue a work.
And then, it returns without in6_dev_put().
But if queuing is failed, it should call in6_dev_put(), but it doesn't.
So, a reference count leak would occur.
Script to reproduce(by Hangbin Liu):
ip netns add ns1
ip netns add ns2
ip netns exec ns1 sysctl -w net.ipv6.conf.all.force_mld_version=1
ip netns exec ns2 sysctl -w net.ipv6.conf.all.force_mld_version=1
ip -n ns1 link add veth0 type veth peer name veth0 netns ns2
ip -n ns1 link set veth0 up
ip -n ns2 link set veth0 up
for i in `seq 50`; do
for j in `seq 100`; do
ip -n ns1 addr add 2021:${i}::${j}/64 dev veth0
ip -n ns2 addr add 2022:${i}::${j}/64 dev veth0
done
done
modprobe -r veth
ip -a netns del
splat looks like:
unregister_netdevice: waiting for veth0 to become free. Usage count = 2
leaked reference.
ipv6_add_dev+0x324/0xec0
addrconf_notify+0x481/0xd10
raw_notifier_call_chain+0xe3/0x120
call_netdevice_notifiers+0x106/0x160
register_netdevice+0x114c/0x16b0
veth_newlink+0x48b/0xa50 [veth]
rtnl_newlink+0x11a2/0x1a40
rtnetlink_rcv_msg+0x63f/0xc00
netlink_rcv_skb+0x1df/0x3e0
netlink_unicast+0x5de/0x850
netlink_sendmsg+0x6c9/0xa90
____sys_sendmsg+0x76a/0x780
__sys_sendmsg+0x27c/0x340
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Tested-by: Hangbin Liu <liuhangbin@gmail.com> Fixes: 76fa0d5bd59a ("mld: add new workqueues for process mld events") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
init_rx_sa() allocates relevant resource for rx_sa->stats and rx_sa->
key.tfm with alloc_percpu() and macsec_alloc_tfm(). When some error
occurs after init_rx_sa() is called in macsec_add_rxsa(), the function
released rx_sa with kfree() without releasing rx_sa->stats and rx_sa->
key.tfm, which will lead to a resource leak.
We should call macsec_rxsa_put() instead of kfree() to decrease the ref
count of rx_sa and release the relevant resource if the refcount is 0.
The same bug exists in macsec_add_txsa() for tx_sa as well. This patch
fixes the above two bugs.
Fixes: 37ce3272dda3 ("net: macsec: hardware offloading infrastructure") Signed-off-by: Jianglei Nie <niejianglei2021@163.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, MACSEC_SA_ATTR_PN is handled inconsistently, sometimes as a
u32, sometimes forced into a u64 without checking the actual length of
the attribute. Instead, we can use nla_get_u64 everywhere, which will
read up to 64 bits into a u64, capped by the actual length of the
attribute coming from userspace.
This fixes several issues:
- the check in validate_add_rxsa doesn't work with 32-bit attributes
- the checks in validate_add_txsa and validate_upd_sa incorrectly
reject X << 32 (with X != 0)
Fixes: bb178617ead6 ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
IEEE 802.1AEbw-2013 (section 10.7.8) specifies that the maximum value
of the replay window is 2^30-1, to help with recovery of the upper
bits of the PN.
To avoid leaving the existing macsec device in an inconsistent state
if this test fails during changelink, reuse the cleanup mechanism
introduced for HW offload. This wasn't needed until now because
macsec_changelink_common could not fail during changelink, as
modifying the cipher suite was not allowed.
Finally, this must happen after handling IFLA_MACSEC_CIPHER_SUITE so
that secy->xpn is set.
Fixes: bb178617ead6 ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
The expected length is MACSEC_SALT_LEN, not MACSEC_SA_ATTR_SALT.
Fixes: bb178617ead6 ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit bb178617ead6 added a test on tb_sa[MACSEC_SA_ATTR_PN], but
nothing guarantees that it's not NULL at this point. The same code was
added to macsec_add_txsa, but there it's not a problem because
validate_add_txsa checks that the MACSEC_SA_ATTR_PN attribute is
present.
Note: it's not possible to reproduce with iproute, because iproute
doesn't allow creating an SA without specifying the PN.
Fixes: bb178617ead6 ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)") Link: https://bugzilla.kernel.org/show_bug.cgi?id=208315 Reported-by: Frantisek Sumsal <fsumsal@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Since commit 18662ea47f2a ("sctp: implement memory accounting on tx path"),
SCTP has supported memory accounting on tx path where 'sctp_wmem' is used
by sk_wmem_schedule(). So we should fix the description for this option in
ip-sysctl.rst accordingly.
v1->v2:
- Improve the description as Marcelo suggested.
Fixes: 18662ea47f2a ("sctp: implement memory accounting on tx path") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
While reading sysctl_tcp_invalid_ratelimit, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 7b805d602346 ("tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
While reading sysctl_tcp_autocorking, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 4aed7f27920e ("tcp: auto corking") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
While reading sysctl_tcp_min_rtt_wlen, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 50bc59fc3444 ("tcp: track min RTT using windowed min-filter") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
While phylink_pcs_ops :: pcs_get_state does return void, xpcs_get_state()
does check for a non-zero return code from xpcs_get_state_c37_sgmii()
and prints that as a message to the kernel log.
However, a non-zero return code from xpcs_read() is translated into
"return false" (i.e. zero as int) and the I/O error is therefore not
printed. Fix that.
Fixes: f5cf66c4e398 ("net: pcs: add C37 SGMII AN support for intel mGbE controller") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20220720112057.3504398-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
tls_device_down takes a reference on all contexts it's going to move to
the degraded state (software fallback). If sk_destruct runs afterwards,
it can reduce the reference counter back to 1 and return early without
destroying the context. Then tls_device_down will release the reference
it took and call tls_device_free_ctx. However, the context will still
stay in tls_device_down_list forever. The list will contain an item,
memory for which is released, making a memory corruption possible.
Fix the above bug by properly removing the context from all lists before
any call to tls_device_free_ctx.
Fixes: bf5c2d483b13 ("tls: Fix context leak on tls_device_down") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change net device's MTU to smaller than IPV6_MIN_MTU or unregister
device while matching route. That may trigger null-ptr-deref bug
for ip6_ptr probability as following.
=========================================================
BUG: KASAN: null-ptr-deref in find_match.part.0+0x70/0x134
Read of size 4 at addr 0000000000000308 by task ping6/263
Reproducer as following:
Firstly, prepare conditions:
$ip netns add ns1
$ip netns add ns2
$ip link add veth1 type veth peer name veth2
$ip link set veth1 netns ns1
$ip link set veth2 netns ns2
$ip netns exec ns1 ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1
$ip netns exec ns2 ip -6 addr add 2001:0db8:0:f101::2/64 dev veth2
$ip netns exec ns1 ifconfig veth1 up
$ip netns exec ns2 ifconfig veth2 up
$ip netns exec ns1 ip -6 route add 2000::/64 dev veth1 metric 1
$ip netns exec ns2 ip -6 route add 2001::/64 dev veth2 metric 1
Secondly, execute the following two commands in two ssh windows
respectively:
$ip netns exec ns1 sh
$while true; do ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1; ip -6 route add 2000::/64 dev veth1 metric 1; ping6 2000::2; done
$ip netns exec ns1 sh
$while true; do ip link set veth1 mtu 1000; ip link set veth1 mtu 1500; sleep 5; done
It is because ip6_ptr has been assigned to NULL in addrconf_ifdown() firstly,
then ip6_ignore_linkdown() accesses ip6_ptr directly without NULL check.
When we close ping6 sockets, some resources are left unfreed because
pingv6_prot is missing sk->sk_prot->destroy(). As reported by
syzbot [0], just three syscalls leak 96 bytes and easily cause OOM.
struct ipv6_sr_hdr *hdr;
char data[24] = {0};
int fd;
To fix memory leaks, let's add a destroy function.
Note the socket() syscall checks if the GID is within the range of
net.ipv4.ping_group_range. The default value is [1, 0] so that no
GID meets the condition (1 <= GID <= 0). Thus, the local DoS does
not succeed until we change the default value. However, at least
Ubuntu/Fedora/RHEL loosen it.