Daeho Jeong [Thu, 21 Jan 2021 13:45:29 +0000 (22:45 +0900)]
f2fs: add ckpt_thread_ioprio sysfs node
Added "ckpt_thread_ioprio" sysfs node to give a way to change checkpoint
merge daemon's io priority. Its default value is "be,3", which means
"BE" I/O class and I/O priority "3". We can select the class between "rt"
and "be", and set the I/O priority within valid range of it.
"," delimiter is necessary in between I/O class and priority number.
Daeho Jeong [Tue, 19 Jan 2021 00:00:42 +0000 (09:00 +0900)]
f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.
[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables
In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
"for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
done"
- thread B => gererating async. I/O
"fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
--filename=test_img --name=test"
In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
"echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
fsync test_dir2; done"
Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan] Signed-off-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Yi Chen [Thu, 28 Jan 2021 09:02:56 +0000 (17:02 +0800)]
f2fs: fix to avoid inconsistent quota data
Occasionally, quota data may be corrupted detected by fsck:
Info: checkpoint state = 45 : crc compacted_summary unmount
[QUOTA WARNING] Usage inconsistent for ID 0:actual (1543036928, 762) != expected (1543032832, 762)
[ASSERT] (fsck_chk_quota_files:1986) --> Quota file is missing or invalid quota file content found.
[QUOTA WARNING] Usage inconsistent for ID 0:actual (1352478720, 344) != expected (1352474624, 344)
[ASSERT] (fsck_chk_quota_files:1986) --> Quota file is missing or invalid quota file content found.
[FSCK] Unreachable nat entries [Ok..] [0x0]
[FSCK] SIT valid block bitmap checking [Ok..]
[FSCK] Hard link checking for regular file [Ok..] [0x0]
[FSCK] valid_block_count matching with CP [Ok..] [0xdf299]
[FSCK] valid_node_count matcing with CP (de lookup) [Ok..] [0x2b01]
[FSCK] valid_node_count matcing with CP (nat lookup) [Ok..] [0x2b01]
[FSCK] valid_inode_count matched with CP [Ok..] [0x2665]
[FSCK] free segment_count matched with CP [Ok..] [0xcb04]
[FSCK] next block offset is free [Ok..]
[FSCK] fixing SIT types
[FSCK] other corrupted bugs [Fail]
The root cause is:
If we open file w/ readonly flag, disk quota info won't be initialized
for this file, however, following mmap() will force to convert inline
inode via f2fs_convert_inline_inode(), which may increase block usage
for this inode w/o updating quota data, it causes inconsistent disk quota
info.
The issue will happen in following stack:
open(file, O_RDONLY)
mmap(file)
- f2fs_convert_inline_inode
- f2fs_convert_inline_page
- f2fs_reserve_block
- f2fs_reserve_new_block
- f2fs_reserve_new_blocks
- f2fs_i_blocks_write
- dquot_claim_block
inode->i_blocks increase, but the dqb_curspace keep the size for the dquots
is NULL.
To fix this issue, let's call dquot_initialize() anyway in both
f2fs_truncate() and f2fs_convert_inline_inode() functions to avoid potential
inconsistent quota data issue.
Fixes: 0abd675e97e6 ("f2fs: support plain user/group quota") Signed-off-by: Daiyue Zhang <zhangdaiyue1@huawei.com> Signed-off-by: Dehe Gu <gudehe@huawei.com> Signed-off-by: Junchao Jiang <jiangjunchao1@huawei.com> Signed-off-by: Ge Qiu <qiuge@huawei.com> Signed-off-by: Yi Chen <chenyi77@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Wed, 27 Jan 2021 01:00:42 +0000 (17:00 -0800)]
f2fs: flush data when enabling checkpoint back
During checkpoint=disable period, f2fs bypasses all the synchronous IOs such as
sync and fsync. So, when enabling it back, we must flush all of them in order
to keep the data persistent. Otherwise, suddern power-cut right after enabling
checkpoint will cause data loss.
With the new ->readahead operation, locked pages are added to the page
cache, preventing two threads from racing with each other to read the
same chunk of file, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Thu, 14 Jan 2021 01:41:27 +0000 (09:41 +0800)]
f2fs: introduce sb_status sysfs node
Introduce /sys/fs/f2fs/<devname>/stat/sb_status to show superblock
status in real time as a hexadecimal value.
value sb status macro description
0x1 SBI_IS_DIRTY, /* dirty flag for checkpoint */
0x2 SBI_IS_CLOSE, /* specify unmounting */
0x4 SBI_NEED_FSCK, /* need fsck.f2fs to fix */
0x8 SBI_POR_DOING, /* recovery is doing or not */
0x10 SBI_NEED_SB_WRITE, /* need to recover superblock */
0x20 SBI_NEED_CP, /* need to checkpoint */
0x40 SBI_IS_SHUTDOWN, /* shutdown by ioctl */
0x80 SBI_IS_RECOVERED, /* recovered orphan/data */
0x100 SBI_CP_DISABLED, /* CP was disabled last mount */
0x200 SBI_CP_DISABLED_QUICK, /* CP was disabled quickly */
0x400 SBI_QUOTA_NEED_FLUSH, /* need to flush quota info in CP */
0x800 SBI_QUOTA_SKIP_FLUSH, /* skip flushing quota in current CP */
0x1000 SBI_QUOTA_NEED_REPAIR, /* quota file may be corrupted */
0x2000 SBI_IS_RESIZEFS, /* resizefs is in process */
Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chengguang Xu [Wed, 13 Jan 2021 05:21:54 +0000 (13:21 +0800)]
f2fs: fix to use per-inode maxbytes
F2FS inode may have different max size, e.g. compressed file have
less blkaddr entries in all its direct-node blocks, result in being
with less max filesize. So change to use per-inode maxbytes.
Chao Yu [Mon, 11 Jan 2021 09:42:53 +0000 (17:42 +0800)]
f2fs: compress: fix potential deadlock
generic/269 reports a hangtask issue, the root cause is ABBA deadlock
described as below:
Thread A Thread B
- down_write(&sbi->gc_lock) -- A
- f2fs_write_data_pages
- lock all pages in cluster -- B
- f2fs_write_multi_pages
- f2fs_write_raw_pages
- f2fs_write_single_data_page
- f2fs_balance_fs
- down_write(&sbi->gc_lock) -- A
- f2fs_gc
- do_garbage_collect
- ra_data_block
- pagecache_get_page -- B
To fix this, it needs to avoid calling f2fs_balance_fs() if there is
still cluster pages been locked in context of cluster writeback, so
instead, let's call f2fs_balance_fs() in the end of
f2fs_write_raw_pages() when all cluster pages were unlocked.
Fixes: 4c8ff7095bef ("f2fs: support data compression") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Eric Biggers [Mon, 28 Dec 2020 23:25:29 +0000 (15:25 -0800)]
libfs: unexport generic_ci_d_compare() and generic_ci_d_hash()
Now that generic_set_encrypted_ci_d_ops() has been added and ext4 and
f2fs are using it, it's no longer necessary to export
generic_ci_d_compare() and generic_ci_d_hash() to filesystems.
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Daeho Jeong [Tue, 5 Jan 2021 23:49:28 +0000 (08:49 +0900)]
f2fs: fix null page reference in redirty_blocks
By Colin's static analysis, we found out there is a null page reference
under low memory situation in redirty_blocks. I've made the page finding
loop stop immediately and return an error not to cause further memory
pressure when we run into a failure to find a page under low memory
condition.
Signed-off-by: Daeho Jeong <daehojeong@google.com> Reported-by: Colin Ian King <colin.king@canonical.com> Fixes: 5fdb322ff2c2 ("f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE") Reviewed-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Eric Biggers [Tue, 5 Jan 2021 06:33:02 +0000 (22:33 -0800)]
f2fs: clean up post-read processing
Rework the post-read processing logic to be much easier to understand.
At least one bug is fixed by this: if an I/O error occurred when reading
from disk, decryption and verity would be performed on the uninitialized
data, causing misleading messages in the kernel log.
Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 30 Dec 2020 08:38:35 +0000 (16:38 +0800)]
f2fs: trival cleanup in move_data_block()
Trival cleanups:
- relocate set_summary() before its use
- relocate "allocate block address" to correct place
- remove unneeded f2fs_wait_on_page_writeback()
Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 14 Dec 2020 09:20:57 +0000 (17:20 +0800)]
f2fs: fix to tag FIEMAP_EXTENT_MERGED in f2fs_fiemap()
f2fs does not natively support extents in metadata, 'extent' in f2fs
is used as a virtual concept, so in f2fs_fiemap() interface, it needs
to tag FIEMAP_EXTENT_MERGED flag to indicated the extent status is a
result of merging.
Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 9 Dec 2020 08:43:27 +0000 (16:43 +0800)]
f2fs: introduce a new per-sb directory in sysfs
Add a new directory 'stat' in path of /sys/fs/f2fs/<devname>/, later
we can add new readonly stat sysfs file into this directory, it will
make <devname> directory less mess.
Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 22 Jan 2021 09:46:43 +0000 (17:46 +0800)]
f2fs: compress: support compress level
Expand 'compress_algorithm' mount option to accept parameter as format of
<algorithm>:<level>, by this way, it gives a way to allow user to do more
specified config on lz4 and zstd compression level, then f2fs compression
can provide higher compress ratio.
In order to set compress level for lz4 algorithm, it needs to set
CONFIG_LZ4HC_COMPRESS and CONFIG_F2FS_FS_LZ4HC config to enable lz4hc
compress algorithm.
If kernel doesn't support certain kinds of compress algorithm, deny to set
them as compress algorithm of f2fs via 'compress_algorithm=%s' mount option.
Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 26 Dec 2020 10:07:01 +0000 (18:07 +0800)]
f2fs: enforce the immutable flag on open files
This patch ports commit 02b016ca7f99 ("ext4: enforce the immutable
flag on open files") to f2fs.
According to the chattr man page, "a file with the 'i' attribute
cannot be modified..." Historically, this was only enforced when the
file was opened, per the rest of the description, "... and the file
can not be opened in write mode".
There is general agreement that we should standardize all file systems
to prevent modifications even for files that were opened at the time
the immutable flag is set. Eventually, a change to enforce this at
the VFS layer should be landing in mainline.
Cc: stable@kernel.org Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 25 Dec 2020 08:52:27 +0000 (16:52 +0800)]
f2fs: enhance to update i_mode and acl atomically in f2fs_setattr()
Previously, in f2fs_setattr(), we don't update S_ISUID|S_ISGID|S_ISVTX
bits with S_IRWXUGO bits and acl entries atomically, so in error path,
chmod() may partially success, this patch enhances to make chmod() flow
being atomical.
Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Weichao Guo [Mon, 14 Dec 2020 03:54:53 +0000 (11:54 +0800)]
f2fs: fix to set inode->i_mode correctly for posix_acl_update_mode
We should update the ~S_IRWXUGO part of inode->i_mode in __setattr_copy,
because posix_acl_update_mode updates mode based on inode->i_mode,
which finally overwrites the ~S_IRWXUGO part of i_acl_mode with old i_mode.
Testcase to reproduce this bug:
0. adduser abc
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. mkdir /mnt/f2fs/test
4. setfacl -m u:abc:r /mnt/f2fs/test
5. chmod +s /mnt/f2fs/test
Signed-off-by: Weichao Guo <guoweichao@oppo.com> Signed-off-by: Bin Shu <shubin@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Linus Torvalds [Wed, 27 Jan 2021 19:06:15 +0000 (11:06 -0800)]
Merge branch 'parisc-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"Two small fixes:
- Fix linking error with 64-bit kernel when modules are disabled,
reported by kernel test robot
- Remove leftover reference to power_tasklet, by Davidlohr Bueso"
* 'parisc-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Enable -mlong-calls gcc option by default when !CONFIG_MODULES
parisc: Remove leftover reference to the power_tasklet
Remove the magical "repo-abbrev" comment added when this file was
introduced in e0ab1ec9fcd3 ([PATCH] add .mailmap for proper
git-shortlog output, 2007-02-14).
It's been an undocumented feature of git-shortlog(1), originally added
to git for Linus's use. Since then he's no longer using it[1], and
I've removed the feature in git.git's 4e168333a87 (shortlog: remove
unused(?) "repo-abbrev" feature, 2021-01-12). It's on the "master"
branch, but not yet in a release version.
Let's also remove it from linux.git, both as a heads-up to any
potential users of it in linux.git whose use would be broken sooner
than later by git itself, and because it'll eventually be entirely
redundant.
Helge Deller [Tue, 26 Jan 2021 19:16:21 +0000 (20:16 +0100)]
parisc: Enable -mlong-calls gcc option by default when !CONFIG_MODULES
When building a kernel without module support, the CONFIG_MLONGCALL option
needs to be enabled in order to reach symbols which are outside of a 22-bit
branch.
This patch changes the autodetection in the Kconfig script to always enable
CONFIG_MLONGCALL when modules are disabled and uses a far call to
preempt_schedule_irq() in intr_do_preempt() to reach the symbol in all cases.
Linus Torvalds [Tue, 26 Jan 2021 19:10:14 +0000 (11:10 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- x86 bugfixes
- Documentation fixes
- Avoid performance regression due to SEV-ES patches
- ARM:
- Don't allow tagged pointers to point to memslots
- Filter out ARMv8.1+ PMU events on v8.0 hardware
- Hide PMU registers from userspace when no PMU is configured
- More PMU cleanups
- Don't try to handle broken PSCI firmware
- More sys_reg() to reg_to_encoding() conversions
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX
KVM: x86: Revert "KVM: x86: Mark GPRs dirty when written"
KVM: SVM: Unconditionally sync GPRs to GHCB on VMRUN of SEV-ES guest
KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration
kvm: tracing: Fix unmatched kvm_entry and kvm_exit events
KVM: Documentation: Update description of KVM_{GET,CLEAR}_DIRTY_LOG
KVM: x86: get smi pending status correctly
KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[]
KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()
KVM: x86: Add more protection against undefined behavior in rsvd_bits()
KVM: Documentation: Fix spec for KVM_CAP_ENABLE_CAP_VM
KVM: Forbid the use of tagged userspace addresses for memslots
KVM: arm64: Filter out v8.1+ events on v8.0 HW
KVM: arm64: Compute TPIDR_EL2 ignoring MTE tag
KVM: arm64: Use the reg_to_encoding() macro instead of sys_reg()
KVM: arm64: Allow PSCI SYSTEM_OFF/RESET to return
KVM: arm64: Simplify handling of absent PMU system registers
KVM: arm64: Hide PMU registers from userspace when not available
Linus Torvalds [Tue, 26 Jan 2021 18:59:01 +0000 (10:59 -0800)]
Merge tag 'regulator-fix-v5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"The main thing here is a change to make sure that we don't try to
double resolve the supply of a regulator if we have two probes going
on simultaneously, plus an incremental fix on top of that to resolve a
lockdep issue it introduced.
There's also a patch from Dmitry Osipenko adding stubs for some
functions to avoid build issues in consumers in some configurations"
* tag 'regulator-fix-v5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: Fix lockdep warning resolving supplies
regulator: consumer: Add missing stubs to regulator/consumer.h
regulator: core: avoid regulator_resolve_supply() race condition
Paolo Bonzini [Fri, 8 Jan 2021 16:43:08 +0000 (11:43 -0500)]
KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX
VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS,
which may need to be loaded outside guest mode. Therefore we cannot
WARN in that case.
However, that part of nested_get_vmcs12_pages is _not_ needed at
vmentry time. Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling,
so that both vmentry and migration (and in the latter case, independent
of is_guest_mode) do the parts that are needed.
KVM: x86: Revert "KVM: x86: Mark GPRs dirty when written"
Revert the dirty/available tracking of GPRs now that KVM copies the GPRs
to the GHCB on any post-VMGEXIT VMRUN, even if a GPR is not dirty. Per
commit de3cd117ed2f ("KVM: x86: Omit caching logic for always-available
GPRs"), tracking for GPRs noticeably impacts KVM's code footprint.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210122235049.3107620-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SVM: Unconditionally sync GPRs to GHCB on VMRUN of SEV-ES guest
Drop the per-GPR dirty checks when synchronizing GPRs to the GHCB, the
GRPs' dirty bits are set from time zero and never cleared, i.e. will
always be seen as dirty. The obvious alternative would be to clear
the dirty bits when appropriate, but removing the dirty checks is
desirable as it allows reverting GPR dirty+available tracking, which
adds overhead to all flavors of x86 VMs.
Note, unconditionally writing the GPRs in the GHCB is tacitly allowed
by the GHCB spec, which allows the hypervisor (or guest) to provide
unnecessary info; it's the guest's responsibility to consume only what
it needs (the hypervisor is untrusted after all).
The guest and hypervisor can supply additional state if desired but
must not rely on that additional state being provided.
Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210122235049.3107620-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Thu, 14 Jan 2021 20:54:47 +0000 (22:54 +0200)]
KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration
Even when we are outside the nested guest, some vmcs02 fields
may not be in sync vs vmcs12. This is intentional, even across
nested VM-exit, because the sync can be delayed until the nested
hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those
rarely accessed fields.
However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to
be able to restore it. To fix that, call copy_vmcs02_to_vmcs12_rare()
before the vmcs12 contents are copied to userspace.
Fixes: 7952d769c29ca ("KVM: nVMX: Sync rarely accessed guest fields only when needed") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lorenzo Brescia [Wed, 23 Dec 2020 14:45:07 +0000 (14:45 +0000)]
kvm: tracing: Fix unmatched kvm_entry and kvm_exit events
On VMX, if we exit and then re-enter immediately without leaving
the vmx_vcpu_run() function, the kvm_entry event is not logged.
That means we will see one (or more) kvm_exit, without its (their)
corresponding kvm_entry, as shown here:
It also seems possible for a kvm_entry event to be logged, but then
we leave vmx_vcpu_run() right away (if vmx->emulation_required is
true). In this case, we will have a spurious kvm_entry event in the
trace.
Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run()
(where trace_kvm_exit() already is).
A trace obtained with this patch applied looks like this:
call kvm_vcpu_ioctl_smi() and
kvm_make_request(KVM_REQ_SMI, vcpu);
Step2:
kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
call process_smi() if
kvm_check_request(KVM_REQ_SMI, vcpu) is
true, mark vcpu->arch.smi_pending = true;
The vcpu->arch.smi_pending will be set true in step2, unfortunately if
vcpu paused between step1 and step2, the kvm_run->immediate_exit will be
set and vcpu has to exit to Qemu immediately during step2 before mark
vcpu->arch.smi_pending true.
During VM migration, Qemu will get the smi pending status from KVM using
KVM_GET_VCPU_EVENTS ioctl at the downtime, then the smi pending status
will be lost.
Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com> Signed-off-by: Shengen Zhuang <zhuangshengen@huawei.com>
Message-Id: <20210118084720.1585-1-jianjay.zhou@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Mon, 18 Jan 2021 02:58:00 +0000 (10:58 +0800)]
KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()
Since we know vPMU will not work properly when (1) the guest bit_width(s)
of the [gp|fixed] counters are greater than the host ones, or (2) guest
requested architectural events exceeds the range supported by the host, so
we can setup a smaller left shift value and refresh the guest cpuid entry,
thus fixing the following UBSAN shift-out-of-bounds warning:
shift exponent 197 is too large for 64-bit type 'long long unsigned int'
KVM: x86: Add more protection against undefined behavior in rsvd_bits()
Add compile-time asserts in rsvd_bits() to guard against KVM passing in
garbage hardcoded values, and cap the upper bound at '63' for dynamic
values to prevent generating a mask that would overflow a u64.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210113204515.3473079-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 25 Jan 2021 23:52:01 +0000 (18:52 -0500)]
Merge tag 'kvmarm-fixes-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.11, take #2
- Don't allow tagged pointers to point to memslots
- Filter out ARMv8.1+ PMU events on v8.0 hardware
- Hide PMU registers from userspace when no PMU is configured
- More PMU cleanups
- Don't try to handle broken PSCI firmware
- More sys_reg() to reg_to_encoding() conversions
Johannes Berg [Mon, 25 Jan 2021 09:16:15 +0000 (10:16 +0100)]
fs/pipe: allow sendfile() to pipe again
After commit 36e2c7421f02 ("fs: don't allow splice read/write
without explicit ops") sendfile() could no longer send data
from a real file to a pipe, breaking for example certain cgit
setups (e.g. when running behind fcgiwrap), because in this
case cgit will try to do exactly this: sendfile() to a pipe.
Fix this by using iter_file_splice_write for the splice_write
method of pipes, as suggested by Christoph.
Cc: stable@vger.kernel.org Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops") Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sami Tolvanen [Mon, 25 Jan 2021 19:09:25 +0000 (11:09 -0800)]
Commit 9bb48c82aced ("tty: implement write_iter") converted the tty
layer to use write_iter. Fix the redirected_tty_write declaration
also in n_tty and change the comparisons to use write_iter instead of
write.
[ Also moved the declaration of redirected_tty_write() to the proper
location in a header file. The reason for the bug was the bogus extern
declaration in n_tty.c silently not matching the changed definition in
tty_io.c, and because it wasn't in a shared header file, there was no
cross-checking of the declaration.
Sami noticed because Clang's Control Flow Integrity checking ended up
incidentally noticing the inconsistent declaration. - Linus ]
Linus Torvalds [Mon, 25 Jan 2021 18:19:40 +0000 (10:19 -0800)]
Merge tag 'printk-for-5.11-urgent-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:
"The fix of a potential buffer overflow in 5.11-rc5 introduced another
one. The trailing '\0' might be written up to the message "len" past
the buffer. Fortunately, it is not that easy to hit.
Most readers use 1kB buffers for a single message. Typical messages
fit into the temporary buffer with enough reserve.
Also readers do not rely on the '\0'. It is related to the previous
fix. Some readers required the space for the trailing '\0'. We decided
to write it there to avoid such regressions in the future.
The most realistic victims are dumpers using kmsg_dump_get_buffer().
They are filling the entire buffer with as many messages as possible.
They are typically used when handling panic()"
* tag 'printk-for-5.11-urgent-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: fix string termination for record_print_text()
John Ogness [Sun, 24 Jan 2021 20:27:28 +0000 (21:33 +0106)]
printk: fix string termination for record_print_text()
Commit f0e386ee0c0b ("printk: fix buffer overflow potential for
print_text()") added string termination in record_print_text().
However it used the wrong base pointer for adding the terminator.
This led to a 0-byte being written somewhere beyond the buffer.
Use the correct base pointer when adding the terminator.
Fixes: f0e386ee0c0b ("printk: fix buffer overflow potential for print_text()") Reported-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210124202728.4718-1-john.ogness@linutronix.de
Linus Torvalds [Sun, 24 Jan 2021 20:30:14 +0000 (12:30 -0800)]
Merge tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Still need a final cancelation fix that isn't quite done done,
expected in the next day or two. That said, this contains:
- Wakeup fix for IOPOLL requests
- SQPOLL split close op handling fix
- Ensure that any use of io_uring fd itself is marked as inflight
- Short non-regular file read fix (Pavel)
- Fix up bad false positive warning (Pavel)
- SQPOLL fixes (Pavel)
- In-flight removal fix (Pavel)"
* tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block:
io_uring: account io_uring internal files as REQ_F_INFLIGHT
io_uring: fix sleeping under spin in __io_clean_op
io_uring: fix short read retries for non-reg files
io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
io_uring: fix skipping disabling sqo on exec
io_uring: fix uring_flush in exit_files() warning
io_uring: fix false positive sqo warning on flush
io_uring: iopoll requests should also wake task ->in_idle state
Linus Torvalds [Sun, 24 Jan 2021 20:24:35 +0000 (12:24 -0800)]
Merge tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request from Christoph:
- fix a status code in nvmet (Chaitanya Kulkarni)
- avoid double completions in nvme-rdma/nvme-tcp (Chao Leng)
- fix the CMB support to cope with NVMe 1.4 controllers (Klaus Jensen)
- fix PRINFO handling in the passthrough ioctl (Revanth Rajashekar)
- fix a double DMA unmap in nvme-pci
* tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block:
lightnvm: fix memory leak when submit fails
nvme-pci: fix error unwind in nvme_map_data
nvme-pci: refactor nvme_unmap_data
md: Set prev_flush_start and flush_bio in an atomic way
nvmet: set right status on error in id-ns handler
nvme-pci: allow use of cmb on v1.4 controllers
nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout
nvme-rdma: avoid request double completion for concurrent nvme_rdma_timeout
nvme: check the PRINFO bit before deciding the host buffer length
Linus Torvalds [Sun, 24 Jan 2021 20:16:34 +0000 (12:16 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"18 patches.
Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
memory-failure, and highmem), ubsan, proc, and MAINTAINERS"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add a couple more files to the Clang/LLVM section
proc_sysctl: fix oops caused by incorrect command parameters
powerpc/mm/highmem: use __set_pte_at() for kmap_local()
mips/mm/highmem: use set_pte() for kmap_local()
mm/highmem: prepare for overriding set_pte_at()
sparc/mm/highmem: flush cache and TLB
mm: fix page reference leak in soft_offline_page()
ubsan: disable unsigned-overflow check for i386
kasan, mm: fix resetting page_alloc tags for HW_TAGS
kasan, mm: fix conflicts with init_on_alloc/free
kasan: fix HW_TAGS boot parameters
kasan: fix incorrect arguments passing in kasan_add_zero_shadow
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
mm: fix numa stats for thp migration
mm: memcg: fix memcg file_dirty numa stat
mm: memcg/slab: optimize objcg stock draining
mm: fix initialization of struct page for holes in memory layout
x86/setup: don't remove E820_TYPE_RAM for pfn 0
Linus Torvalds [Sun, 24 Jan 2021 19:26:46 +0000 (11:26 -0800)]
Merge tag 'char-misc-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small char/misc driver fixes for 5.11-rc5:
- habanalabs driver fixes
- phy driver fixes
- hwtracing driver fixes
- rtsx cardreader driver fix
All of these have been in linux-next with no reported issues"
* tag 'char-misc-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
misc: rtsx: init value of aspm_enabled
habanalabs: disable FW events on device removal
habanalabs: fix backward compatibility of idle check
habanalabs: zero pci counters packet before submit to FW
intel_th: pci: Add Alder Lake-P support
stm class: Fix module init return on allocation failure
habanalabs: prevent soft lockup during unmap
habanalabs: fix reset process in case of failures
habanalabs: fix dma_addr passed to dma_mmap_coherent
phy: mediatek: allow compile-testing the dsi phy
phy: cpcap-usb: Fix warning for missing regulator_disable
PHY: Ingenic: fix unconditional build of phy-ingenic-usb
Linus Torvalds [Sun, 24 Jan 2021 19:02:01 +0000 (11:02 -0800)]
Merge tag 'staging-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO driver fixes from Greg KH:
"Here are some IIO driver fixes for 5.11-rc5 to resolve some reported
problems.
Nothing major, just a few small fixes, all of these have been in
linux-next for a while and full details are in the shortlog"
* tag 'staging-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio: sx9310: Fix semtech,avg-pos-strength setting when > 16
iio: common: st_sensors: fix possible infinite loop in st_sensors_irq_thread
iio: ad5504: Fix setting power-down state
counter:ti-eqep: remove floor
drivers: iio: temperature: Add delay after the addressed reset command in mlx90632.c
iio: adc: ti_am335x_adc: remove omitted iio_kfifo_free()
dt-bindings: iio: accel: bma255: Fix bmc150/bmi055 compatible
iio: sx9310: Off by one in sx9310_read_thresh()
Linus Torvalds [Sun, 24 Jan 2021 18:56:45 +0000 (10:56 -0800)]
Merge tag 'tty-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are three small tty/serial fixes for 5.11-rc5 to resolve reported
problems:
- two patches to fix up writing to ttys with splice
- mvebu-uart driver fix for reported problem
All of these have been in linux-next with no reported problems"
* tag 'tty-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: fix up hung_up_tty_write() conversion
tty: implement write_iter
serial: mvebu-uart: fix tx lost characters at power off
Linus Torvalds [Sun, 24 Jan 2021 18:54:54 +0000 (10:54 -0800)]
Merge tag 'usb-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 5.11-rc5. They resolve:
- xhci issues for some reported problems
- ehci driver issue for one specific device
- USB gadget fixes for some reported problems
- cdns3 driver fixes for issues reported
- MAINTAINERS file update
- thunderbolt minor fix
All of these have been in linux-next with no reported issues"
* tag 'usb-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: bdc: Make bdc pci driver depend on BROKEN
xhci: tegra: Delay for disabling LFPS detector
xhci: make sure TRB is fully written before giving it to the controller
usb: udc: core: Use lock when write to soft_connect
USB: gadget: dummy-hcd: Fix errors in port-reset handling
usb: gadget: aspeed: fix stop dma register setting.
USB: ehci: fix an interrupt calltrace error
ehci: fix EHCI host controller initialization sequence
MAINTAINERS: update Peter Chen's email address
thunderbolt: Drop duplicated 0x prefix from format string
MAINTAINERS: Update address for Cadence USB3 driver
usb: cdns3: imx: improve driver .remove API
usb: cdns3: imx: fix can't create core device the second time issue
usb: cdns3: imx: fix writing read-only memory issue
Xiaoming Ni [Sun, 24 Jan 2021 05:02:16 +0000 (21:02 -0800)]
proc_sysctl: fix oops caused by incorrect command parameters
The process_sysctl_arg() does not check whether val is empty before
invoking strlen(val). If the command line parameter () is incorrectly
configured and val is empty, oops is triggered.
For example:
"hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is
triggered. The call stack is as follows:
Kernel command line: .... hung_task_panic
......
Call trace:
__pi_strlen+0x10/0x98
parse_args+0x278/0x344
do_sysctl_args+0x8c/0xfc
kernel_init+0x5c/0xf4
ret_from_fork+0x10/0x30
To fix it, check whether "val" is empty when "phram" is a sysctl field.
Error codes are returned in the failure branch, and error logs are
generated by parse_args().
Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com Fixes: 3db978d480e2843 ("kernel/sysctl: support setting sysctl parameters from kernel command line") Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: <stable@vger.kernel.org> [5.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Sun, 24 Jan 2021 05:02:11 +0000 (21:02 -0800)]
powerpc/mm/highmem: use __set_pte_at() for kmap_local()
The original PowerPC highmem mapping function used __set_pte_at() to
denote that the mapping is per CPU. This got lost with the conversion
to the generic implementation.
Override the default map function.
Link: https://lkml.kernel.org/r/20210112170411.281464308@linutronix.de Fixes: 47da42b27a56 ("powerpc/mm/highmem: Switch to generic kmap atomic") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Sun, 24 Jan 2021 05:02:02 +0000 (21:02 -0800)]
mm/highmem: prepare for overriding set_pte_at()
The generic kmap_local() map function uses set_pte_at(), but MIPS requires
set_pte() and PowerPC wants __set_pte_at().
Provide arch_kmap_local_set_pte() and default it to set_pte_at().
Link: https://lkml.kernel.org/r/20210112170411.056306194@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Williams [Sun, 24 Jan 2021 05:01:52 +0000 (21:01 -0800)]
mm: fix page reference leak in soft_offline_page()
The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails.
Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.
When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page()
pfn the kernel hangs at dax-device shutdown due to a leaked reference.
Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Konovalov [Sun, 24 Jan 2021 05:01:43 +0000 (21:01 -0800)]
kasan, mm: fix resetting page_alloc tags for HW_TAGS
A previous commit added resetting KASAN page tags to
kernel_init_free_pages() to avoid false-positives due to accesses to
metadata with the hardware tag-based mode.
That commit did reset page tags before the metadata access, but didn't
restore them after. As the result, KASAN fails to detect bad accesses
to page_alloc allocations on some configurations.
Fix this by recovering the tag after the metadata access.
Link: https://lkml.kernel.org/r/02b5bcd692e912c27d484030f666b350ad7e4ae4.1611074450.git.andreyknvl@google.com Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Konovalov [Sun, 24 Jan 2021 05:01:38 +0000 (21:01 -0800)]
kasan, mm: fix conflicts with init_on_alloc/free
A few places where SLUB accesses object's data or metadata were missed
in a previous patch. This leads to false positives with hardware
tag-based KASAN when bulk allocations are used with init_on_alloc/free.
Fix the false-positives by resetting pointer tags during these accesses.
(The kasan_reset_tag call is removed from slab_alloc_node, as it's added
into maybe_wipe_obj_freeptr.)
Andrey Konovalov [Sun, 24 Jan 2021 05:01:34 +0000 (21:01 -0800)]
kasan: fix HW_TAGS boot parameters
The initially proposed KASAN command line parameters are redundant.
This change drops the complex "kasan.mode=off/prod/full" parameter and
adds a simpler kill switch "kasan=off/on" instead. The new parameter
together with the already existing ones provides a cleaner way to
express the same set of features.
The full set of parameters with this change:
kasan=off/on - whether KASAN is enabled
kasan.fault=report/panic - whether to only print a report or also panic
kasan.stacktrace=off/on - whether to collect alloc/free stack traces
Lecopzer Chen [Sun, 24 Jan 2021 05:01:25 +0000 (21:01 -0800)]
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
During testing kasan_populate_early_shadow and kasan_remove_zero_shadow,
if the shadow start and end address in kasan_remove_zero_shadow() is not
aligned to PMD_SIZE, the remain unaligned PTE won't be removed.
0xffffffbf80000000 ~ 0xffffffbfbdf80000 will not be removed because in
kasan_remove_pud_table(), kasan_pmd_table(*pud) is true but the next
address is 0xffffffbfbdf80000 which is not aligned to PUD_SIZE.
In the correct condition, this should fallback to the next level
kasan_remove_pmd_table() but the condition flow always continue to skip
the unaligned part.
Fix by correcting the condition when next and addr are neither aligned.
Link: https://lkml.kernel.org/r/20210103135621.83129-1-lecopzer@gmail.com Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: YJ Chiang <yj.chiang@mediatek.com> Cc: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 24 Jan 2021 18:17:03 +0000 (10:17 -0800)]
Merge tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Borislav Petkov:
- Adjust objtool to handle a recent binutils change to not generate
unused symbols anymore.
- Revert the fail-the-build-on-fatal-errors objtool strategy for now
due to the ever-increasing matrix of supported toolchains/plugins and
them causing too many such fatal errors currently.
- Do not add empty symbols to objdump's rbtree to accommodate clang
removing section symbols.
* tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Don't fail on missing symbol table
objtool: Don't fail the kernel build on fatal errors
objtool: Don't add empty symbols to the rbtree
Linus Torvalds [Sun, 24 Jan 2021 18:09:20 +0000 (10:09 -0800)]
Merge tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Correct the marking of kthreads which are supposed to run on a
specific, single CPU vs such which are affine to only one CPU, mark
per-cpu workqueue threads as such and make sure that marking
"survives" CPU hotplug. Fix CPU hotplug issues with such kthreads.
- A fix to not push away tasks on CPUs coming online.
- Have workqueue CPU hotplug code use cpu_possible_mask when breaking
affinity on CPU offlining so that pending workers can finish on newly
arrived onlined CPUs too.
- Dump tasks which haven't vacated a CPU which is currently being
unplugged.
- Register a special scale invariance callback which gets called on
resume from RAM to read out APERF/MPERF after resume and thus make
the schedutil scaling governor more precise.
* tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Relax the set_cpus_allowed_ptr() semantics
sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
sched: Prepare to use balance_push in ttwu()
workqueue: Restrict affinity change to rescuer
workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
kthread: Extract KTHREAD_IS_PER_CPU
sched: Don't run cpu-online with balance_push() enabled
workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
sched/core: Print out straggler tasks in sched_cpu_dying()
x86: PM: Register syscore_ops for scale invariance
Linus Torvalds [Sun, 24 Jan 2021 17:46:05 +0000 (09:46 -0800)]
Merge tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add a new Intel model number for Alder Lake
- Differentiate which aspects of the FPU state get saved/restored when
the FPU is used in-kernel and fix a boot crash on K7 due to early
MXCSR access before CR4.OSFXSR is even set.
- A couple of noinstr annotation fixes
- Correct die ID setting on AMD for users of topology information which
need the correct die ID
- A SEV-ES fix to handle string port IO to/from kernel memory properly
* tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add another Alder Lake CPU to the Intel family
x86/mmx: Use KFPU_387 for MMX string operations
x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
x86/topology: Make __max_die_per_package available unconditionally
x86: __always_inline __{rd,wr}msr()
x86/mce: Remove explicit/superfluous tracing
locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP
locking/lockdep: Cure noinstr fail
x86/sev: Fix nonistr violation
x86/entry: Fix noinstr fail
x86/cpu/amd: Set __max_die_per_package on AMD
x86/sev-es: Handle string port IO to kernel memory properly
Linus Torvalds [Sun, 24 Jan 2021 17:40:51 +0000 (09:40 -0800)]
Merge tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix a bad interaction between the scv handling and the fallback L1D
flush, which could lead to user register corruption. Only affects
people using scv (~no one) on machines with old firmware that are
missing the L1D flush.
- Two small selftest fixes.
Thanks to Eirik Fuller, Libor Pechacek, Nicholas Piggin, Sandipan Das,
and Tulio Magno Quites Machado Filho.
* tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: fix scv entry fallback flush vs interrupt
selftests/powerpc: Only test lwm/stmw on big endian
selftests/powerpc: Fix exit status of pkey tests
Linus Torvalds [Sun, 24 Jan 2021 17:35:28 +0000 (09:35 -0800)]
Merge tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull misc fixes from Christian Brauner:
- Jann reported sparse complaints because of a missing __user
annotation in a helper we added way back when we added
pidfd_send_signal() to avoid compat syscall handling. Fix it.
- Yanfei replaces a reference in a comment to the _do_fork() helper I
removed a while ago with a reference to the new kernel_clone()
replacement
- Alexander Guril added a simple coding style fix
* tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
kthread: remove comments about old _do_fork() helper
Kernel: fork.c: Fix coding style: Do not use {} around single-line statements
signal: Add missing __user annotation to copy_siginfo_from_user_any
Linus Torvalds [Sun, 24 Jan 2021 17:27:14 +0000 (09:27 -0800)]
Merge tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"An important signal handling patch for stable, and two small cleanup
patches"
* tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: do not fail __smb_send_rqst if non-fatal signals are pending
fs/cifs: Simplify bool comparison.
fs/cifs: Assign boolean values to a bool variable
Shakeel Butt [Sun, 24 Jan 2021 05:01:15 +0000 (21:01 -0800)]
mm: fix numa stats for thp migration
Currently the kernel is not correctly updating the numa stats for
NR_FILE_PAGES and NR_SHMEM on THP migration. Fix that.
For NR_FILE_DIRTY and NR_ZONE_WRITE_PENDING, although at the moment
there is no need to handle THP migration as kernel still does not have
write support for file THP but to be more future proof, this patch adds
the THP support for those stats as well.
Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com Fixes: e71769ae52609 ("mm: enable thp migration for shmem thp") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Sun, 24 Jan 2021 05:01:11 +0000 (21:01 -0800)]
mm: memcg: fix memcg file_dirty numa stat
The kernel updates the per-node NR_FILE_DIRTY stats on page migration
but not the memcg numa stats.
That was not an issue until recently the commit 5f9a4f4a7096 ("mm:
memcontrol: add the missing numa_stat interface for cgroup v2") exposed
numa stats for the memcg.
So fix the file_dirty per-memcg numa stat.
Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com Fixes: 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Sun, 24 Jan 2021 05:01:07 +0000 (21:01 -0800)]
mm: memcg/slab: optimize objcg stock draining
Imran Khan reported a 16% regression in hackbench results caused by the
commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
instead of pages"). The regression is noticeable in the case of a
consequent allocation of several relatively large slab objects, e.g.
skb's. As soon as the amount of stocked bytes exceeds PAGE_SIZE,
drain_obj_stock() and __memcg_kmem_uncharge() are called, and it leads
to a number of atomic operations in page_counter_uncharge().
The corresponding call graph is below (provided by Imran Khan):
Instead of directly uncharging the accounted kernel memory, it's
possible to refill the generic page-sized per-cpu stock instead. It's a
much faster operation, especially on a default hierarchy. As a bonus,
__memcg_kmem_uncharge_page() will also get faster, so the freeing of
page-sized kernel allocations (e.g. large kmallocs) will become faster.
A similar change has been done earlier for the socket memory by the
commit 475d0487a2ad ("mm: memcontrol: use per-cpu stocks for socket
memory uncharging").
Link: https://lkml.kernel.org/r/20210106042239.2860107-1-guro@fb.com Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of pages") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Imran Khan <imran.f.khan@oracle.com> Tested-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Michal Koutn <mkoutny@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Sun, 24 Jan 2021 05:01:02 +0000 (21:01 -0800)]
mm: fix initialization of struct page for holes in memory layout
There could be struct pages that are not backed by actual physical
memory. This can happen when the actual memory bank is not a multiple
of SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.
Such pages are currently initialized using init_unavailable_mem()
function that iterates through PFNs in holes in memblock.memory and if
there is a struct page corresponding to a PFN, the fields if this page
are set to default values and the page is marked as Reserved.
init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.
On a system that has firmware reserved holes in a zone above ZONE_DMA,
for instance in a configuration below:
because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
in struct page) in the same pageblock.
Update init_unavailable_mem() to use zone constraints defined by an
architecture to properly setup the zone link and use node ID of the
adjacent range in memblock.memory to set the node link.
Link: https://lkml.kernel.org/r/20210111194017.22696-3-rppt@kernel.org Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Sun, 24 Jan 2021 05:00:57 +0000 (21:00 -0800)]
x86/setup: don't remove E820_TYPE_RAM for pfn 0
Patch series "mm: fix initialization of struct page for holes in memory layout", v3.
Commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions
rather that check each PFN") exposed several issues with the memory map
initialization and these patches fix those issues.
Initially there were crashes during compaction that Qian Cai reported
back in April [1]. It seemed back then that the problem was fixed, but
a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an
additional discussion at [3].
The first 4Kb of memory is a BIOS owned area and to avoid its allocation
for the kernel it was not listed in e820 tables as memory. As the result,
pfn 0 was never recognised by the generic memory management and it is not
a part of neither node 0 nor ZONE_DMA.
If set_pfnblock_flags_mask() would be ever called for the pageblock
corresponding to the first 2Mbytes of memory, having pfn 0 outside of
ZONE_DMA would trigger
Along with reserving the first 4Kb in e820 tables, several first pages are
reserved with memblock in several places during setup_arch(). These
reservations are enough to ensure the kernel does not touch the BIOS area
and it is not necessary to remove E820_TYPE_RAM for pfn 0.
Remove the update of e820 table that changes the type of pfn 0 and move
the comment describing why it was done to trim_low_memory_range() that
reserves the beginning of the memory.
Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jens Axboe [Sat, 23 Jan 2021 22:49:31 +0000 (15:49 -0700)]
io_uring: account io_uring internal files as REQ_F_INFLIGHT
We need to actively cancel anything that introduces a potential circular
loop, where io_uring holds a reference to itself. If the file in question
is an io_uring file, then add the request to the inflight list.
Instead of cleaning files on overflow, return back overflow cancellation
into io_uring_cancel_files(). Previously it was racy to clean
REQ_F_OVERFLOW flag, but we got rid of it, and can do it through
repetitive attempts targeting all matching requests.
Reported-by: Abaci <abaci@linux.alibaba.com> Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sat, 23 Jan 2021 20:02:58 +0000 (12:02 -0800)]
Merge branch 'mtd/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull mtd fixes from Miquel Raynal.
* 'mtd/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: rawnand: omap: Use BCH private fields in the specific OOB layout
mtd: spinand: Fix MTD_OPS_AUTO_OOB requests
mtd: rawnand: intel: check the mtd name only after setting the variable
mtd: rawnand: nandsim: Fix the logic when selecting Hamming soft ECC engine
mtd: rawnand: gpmi: fix dst bit offset when extracting raw payload
Linus Torvalds [Sat, 23 Jan 2021 19:43:02 +0000 (11:43 -0800)]
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Another bunch of driver fixes"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: sprd: depend on COMMON_CLK to fix compile tests
Revert "i2c: imx: Remove unused .id_table support"
i2c: octeon: check correct size of maximum RECV_LEN packet
i2c: tegra: Create i2c_writesl_vi() to use with VI I2C for filling TX FIFO
i2c: bpmp-tegra: Ignore unknown I2C_M flags
i2c: tegra: Wait for config load atomically while in ISR
Linus Torvalds [Sat, 23 Jan 2021 19:35:02 +0000 (11:35 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Twelve minor fixes, all in drivers or doc.
Most of the fixes are pretty obvious (although we had two goes to get
the UFS sysfs doc right) and the biggest change is in the ufs driver
which they've extensively tested"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ibmvfc: Set default timeout to avoid crash during migration
scsi: target: tcmu: Fix use-after-free of se_cmd->priv
scsi: fnic: Fix memleak in vnic_dev_init_devcmd2
scsi: libfc: Avoid invoking response handler twice if ep is already completed
scsi: scsi_transport_srp: Don't block target in failfast state
scsi: docs: ABI: sysfs-driver-ufs: Rectify table formatting
scsi: ufs: Fix tm request when non-fatal error happens
scsi: ufs: Fix livelock of ufshcd_clear_ua_wluns()
scsi: ibmvfc: Fix missing cast of ibmvfc_event pointer to u64 handle
scsi: ufs: ufshcd-pltfrm depends on HAS_IOMEM
scsi: megaraid_sas: Fix MEGASAS_IOC_FIRMWARE regression
scsi: docs: ABI: sysfs-driver-ufs: Add DeepSleep power mode
Linus Torvalds [Sat, 23 Jan 2021 19:25:33 +0000 (11:25 -0800)]
Merge tag 'linux-kselftest-kunit-fixes-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fixes from Shuah :
"Five fixes to the kunit tool and documentation from Daniel Latypov and
David Gow"
* tag 'linux-kselftest-kunit-fixes-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: tool: move kunitconfig parsing into __init__, make it optional
kunit: tool: fix minor typing issue with None status
kunit: tool: surface and address more typing issues
Documentation: kunit: include example of a parameterized test
kunit: tool: Fix spelling of "diagnostic" in kunit_parser