Linus Torvalds [Tue, 4 May 2021 17:52:09 +0000 (10:52 -0700)]
Merge tag 'dma-mapping-5.13' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- add a new dma_alloc_noncontiguous API (me, Ricardo Ribalda)
- fix a copyright notice (Hao Fang)
- add an unlikely annotation to dma_mapping_error (Heiner Kallweit)
- remove a pointless empty line (Wang Qing)
- add support for multi-pages map/unmap bencharking (Xiang Chen)
* tag 'dma-mapping-5.13' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: add unlikely hint to error path in dma_mapping_error
dma-mapping: benchmark: Add support for multi-pages map/unmap
dma-mapping: benchmark: use the correct HiSilicon copyright
dma-mapping: remove a pointless empty line in dma_alloc_coherent
media: uvcvideo: Use dma_alloc_noncontiguous API
dma-iommu: implement ->alloc_noncontiguous
dma-iommu: refactor iommu_dma_alloc_remap
dma-mapping: add a dma_alloc_noncontiguous API
dma-mapping: refactor dma_{alloc,free}_pages
dma-mapping: add a dma_mmap_pages helper
Linus Torvalds [Tue, 4 May 2021 17:48:05 +0000 (10:48 -0700)]
Merge tag 'm68knommu-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
Pull m68knommu updates from Greg Ungerer:
- a fix for interrupt number range checking for the ColdFire SIMR
interrupt controller.
- changes for the binfmt_flat binary loader to allow RISC-V nommu
support it needs to be able to accept flat binaries that have no gap
between the text and data sections.
* tag 'm68knommu-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68k: coldfire: fix irq ranges
riscv: Disable data start offset in flat binaries
binfmt_flat: allow not offsetting data start
Linus Torvalds [Mon, 3 May 2021 20:47:17 +0000 (13:47 -0700)]
Merge tag 'for-5.13/parisc' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc architecture updates from Helge Deller:
- switch to generic syscall header scripts
- minor typo fix in setup.c
* tag 'for-5.13/parisc' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix typo in setup.c
parisc: syscalls: switch to generic syscallhdr.sh
parisc: syscalls: switch to generic syscalltbl.sh
Linus Torvalds [Mon, 3 May 2021 19:23:03 +0000 (12:23 -0700)]
Merge tag 'leds-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds
Pull LED updates from Pavel Machek:
"Nothing too exciting here, just some fixes"
* tag 'leds-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds:
leds: pca9532: Assign gpio base dynamically
leds: trigger: pattern: Switch to using the new API kobj_to_dev()
leds: LEDS_BLINK_LGM should depend on X86
leds: lgm: Fix spelling mistake "prepate" -> "prepare"
MAINTAINERS: Remove Dan Murphy's bouncing email
leds-lm3642: convert comma to semicolon
leds: rt4505: Add support for Richtek RT4505 flash LED controller
leds: rt4505: Add DT binding document for Richtek RT4505
leds: Kconfig: LEDS_CLASS is usually selected.
leds: lgm: Improve Kconfig help
leds: lgm: fix gpiolib dependency
Linus Torvalds [Mon, 3 May 2021 18:19:54 +0000 (11:19 -0700)]
Merge tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New feature:
- A new "func-no-repeats" option in tracefs/options directory.
When set the function tracer will detect if the current function
being traced is the same as the previous one, and instead of
recording it, it will keep track of the number of times that the
function is repeated in a row. And when another function is
recorded, it will write a new event that shows the function that
repeated, the number of times it repeated and the time stamp of
when the last repeated function occurred.
Enhancements:
- In order to implement the above "func-no-repeats" option, the ring
buffer timestamp can now give the accurate timestamp of the event
as it is being recorded, instead of having to record an absolute
timestamp for all events. This helps the histogram code which no
longer needs to waste ring buffer space.
- New validation logic to make sure all trace events that access
dereferenced pointers do so in a safe way, and will warn otherwise.
Fixes:
- No longer limit the PIDs of tasks that are recorded for
"saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows
for a much larger range. This caused the mapping of PIDs to the
task names to be dropped for all tasks with a PID greater than
32768.
- Change trace_clock_global() to never block. This caused a deadlock.
Clean ups:
- Typos, prototype fixes, and removing of duplicate or unused code.
- Better management of ftrace_page allocations"
* tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (32 commits)
tracing: Restructure trace_clock_global() to never block
tracing: Map all PIDs to command lines
ftrace: Reuse the output of the function tracer for func_repeats
tracing: Add "func_no_repeats" option for function tracing
tracing: Unify the logic for function tracing options
tracing: Add method for recording "func_repeats" events
tracing: Add "last_func_repeats" to struct trace_array
tracing: Define new ftrace event "func_repeats"
tracing: Define static void trace_print_time()
ftrace: Simplify the calculation of page number for ftrace_page->records some more
ftrace: Store the order of pages allocated in ftrace_page
tracing: Remove unused argument from "ring_buffer_time_stamp()
tracing: Remove duplicate struct declaration in trace_events.h
tracing: Update create_system_filter() kernel-doc comment
tracing: A minor cleanup for create_system_filter()
kernel: trace: Mundane typo fixes in the file trace_events_filter.c
tracing: Fix various typos in comments
scripts/recordmcount.pl: Make vim and emacs indent the same
scripts/recordmcount.pl: Make indent spacing consistent
tracing: Add a verifier to check string pointers for trace events
...
Linus Torvalds [Sun, 2 May 2021 21:13:46 +0000 (14:13 -0700)]
Merge tag 'for-linus-5.13-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"orangefs: implement orangefs_readahead
mm/readahead.c/read_pages was quite a bit different back when I put my
open-coded readahead logic into orangefs_readpage. That logic seemed
to work as designed back then, it is a trainwreck now.
This implements orangefs_readahead using the new xarray and
readahead_expand features and removes all my open-coded readahead
logic.
This results in an extreme read performance improvement, these sample
numbers are from my test VM:
Here's an example of what's upstream in
5.11.8-200.fc33.x86_64:
30+0 records in
30+0 records out 125829120 bytes (126 MB, 120 MiB) copied, 5.77943 s, 21.8 MB/s
And here's this version of orangefs_readahead on top of 5.12.0-rc4:
30+0 records in
30+0 records out 125829120 bytes (126 MB, 120 MiB) copied, 0.325919 s, 386 MB/s
There are four xfstest regressions with this patch. David Howells and
Matthew Wilcox have been helping me work with this code"
* tag 'for-linus-5.13-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: leave files in the page cache for a few micro seconds at least
Orangef: implement orangefs_readahead.
Linus Torvalds [Sun, 2 May 2021 16:14:01 +0000 (09:14 -0700)]
Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
"Assorted stuff all over the place"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
useful constants: struct qstr for ".."
hostfs_open(): don't open-code file_dentry()
whack-a-mole: kill strlen_user() (again)
autofs: should_expire() argument is guaranteed to be positive
apparmor:match_mn() - constify devpath argument
buffer: a small optimization in grow_buffers
get rid of autofs_getpath()
constify dentry argument of dentry_path()/dentry_path_raw()
Linus Torvalds [Sun, 2 May 2021 16:05:54 +0000 (09:05 -0700)]
Merge branch 'work.ecryptfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull exryptfs updates from Al Viro:
"The interesting part here is (ecryptfs) lock_parent() fixes - its
treatment of ->d_parent had been very wrong.
The rest is trivial cleanups"
* 'work.ecryptfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ecryptfs: ecryptfs_dentry_info->crypt_stat is never used
ecryptfs: get rid of unused accessors
ecryptfs: saner API for lock_parent()
ecryptfs: get rid of pointless dget/dput in ->symlink() and ->link()
Linus Torvalds [Sun, 2 May 2021 01:50:44 +0000 (18:50 -0700)]
Merge tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull Landlock LSM from James Morris:
"Add Landlock, a new LSM from Mickaël Salaün.
Briefly, Landlock provides for unprivileged application sandboxing.
From Mickaël's cover letter:
"The goal of Landlock is to enable to restrict ambient rights (e.g.
global filesystem access) for a set of processes. Because Landlock
is a stackable LSM [1], it makes possible to create safe security
sandboxes as new security layers in addition to the existing
system-wide access-controls. This kind of sandbox is expected to
help mitigate the security impact of bugs or unexpected/malicious
behaviors in user-space applications. Landlock empowers any
process, including unprivileged ones, to securely restrict
themselves.
Landlock is inspired by seccomp-bpf but instead of filtering
syscalls and their raw arguments, a Landlock rule can restrict the
use of kernel objects like file hierarchies, according to the
kernel semantic. Landlock also takes inspiration from other OS
sandbox mechanisms: XNU Sandbox, FreeBSD Capsicum or OpenBSD
Pledge/Unveil.
In this current form, Landlock misses some access-control features.
This enables to minimize this patch series and ease review. This
series still addresses multiple use cases, especially with the
combined use of seccomp-bpf: applications with built-in sandboxing,
init systems, security sandbox tools and security-oriented APIs [2]"
This code has had extensive design discussion and review over several
years"
Link: https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/ Link: https://lore.kernel.org/lkml/f646e1c7-33cf-333f-070c-0a40ad0468cd@digikod.net/
* tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
landlock: Enable user space to infer supported features
landlock: Add user and kernel documentation
samples/landlock: Add a sandbox manager example
selftests/landlock: Add user space tests
landlock: Add syscall implementations
arch: Wire up Landlock syscalls
fs,security: Add sb_delete hook
landlock: Support filesystem access-control
LSM: Infrastructure management of the superblock
landlock: Add ptrace restrictions
landlock: Set up the security framework and manage credentials
landlock: Add ruleset and domain management
landlock: Add object management
Linus Torvalds [Sat, 1 May 2021 22:32:18 +0000 (15:32 -0700)]
Merge tag 'integrity-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull IMA updates from Mimi Zohar:
"In addition to loading the kernel module signing key onto the builtin
keyring, load it onto the IMA keyring as well.
Also six trivial changes and bug fixes"
* tag 'integrity-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: ensure IMA_APPRAISE_MODSIG has necessary dependencies
ima: Fix fall-through warnings for Clang
integrity: Add declarations to init_once void arguments.
ima: Fix function name error in comment.
ima: enable loading of build time generated key on .ima keyring
ima: enable signing of modules with build time generated key
keys: cleanup build time module signing keys
ima: Fix the error code for restoring the PCR value
ima: without an IMA policy loaded, return quickly
Linus Torvalds [Sat, 1 May 2021 19:22:38 +0000 (12:22 -0700)]
Merge tag 'perf-tools-for-v5.13-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tool updates from Arnaldo Carvalho de Melo:
"perf stat:
- Add support for hybrid PMUs to support systems such as Intel
Alderlake and its BIG/little core/atom cpus.
- Introduce 'bperf' to share hardware PMCs with BPF.
- New --iostat option to collect and present IO stats on Intel
hardware.
This functionality is based on recently introduced sysfs attributes
for Intel® Xeon® Scalable processor family (code name Skylake-SP)
in commit 6975bca243b1 ("perf/x86/intel/uncore: Expose an Uncore
unit to IIO PMON mapping")
It is intended to provide four I/O performance metrics in MB per
each PCIe root port:
- Inbound Read: I/O devices below root port read from the host memory
- Inbound Write: I/O devices below root port write to the host memory
- Outbound Read: CPU reads from I/O devices below root port
- Outbound Write: CPU writes to I/O devices below root port
- Align CSV output for summary.
- Clarify --null use cases: Assess raw overhead of 'perf stat' or
measure just wall clock time.
- Improve readability of shadow stats.
perf record:
- Change the COMM when starting tha workload so that --exclude-perf
doesn't seem to be not honoured.
- Improve 'Workload failed' message printing events + what was
exec'ed.
- Fix cross-arch support for TIME_CONV.
perf report:
- Add option to disable raw event ordering.
- Dump the contents of PERF_RECORD_TIME_CONV in 'perf report -D'.
- Improvements to --stat output, that shows information about
PERF_RECORD_ events.
- Preserve identifier id in OCaml demangler.
perf annotate:
- Show full source location with 'l' hotkey in the 'perf annotate'
TUI.
- Add line number like in TUI and source location at EOL to the 'perf
annotate' --stdio mode.
- Add --demangle and --demangle-kernel to 'perf annotate'.
- Allow configuring annotate.demangle{,_kernel} in 'perf config'.
- Fix sample events lost in stdio mode.
perf data:
- Allow converting a perf.data file to JSON.
libperf:
- Add support for user space counter access.
- Update topdown documentation to permit rdpmc calls.
perf test:
- Add 'perf test' for 'perf stat' CSV output.
- Add 'perf test' entries to test the hybrid PMU support.
- Cleanup 'perf test daemon' if its 'perf test' is interrupted.
- Handle metric reuse in pmu-events parsing 'perf test' entry.
- Add test for PE executable support.
- Add timeout for wait for daemon start in its 'perf test' entries.
Build:
- Enable libtraceevent dynamic linking.
- Improve feature detection output.
- Fix caching of feature checks caching.
- First round of updates for tools copies of kernel headers.
- Enable warnings when compiling BPF programs.
Vendor specific events:
- Intel:
- Add missing skylake & icelake model numbers.
- PowerPC:
- Initial JSON/events list for power10 platform.
- Remove unsupported power9 metrics.
- AMD:
- Add Zen3 events.
- Fix broken L2 Cache Hits from L2 HWPF metric.
- Use lowercases for all the eventcodes and umasks.
Hardware tracing:
- arm64:
- Update CoreSight ETM metadata format.
- Fix bitmap for CS-ETM option.
- Support PID tracing in config.
- Detect pid in VMID for kernel running at EL2.
Arch specific updates:
- MIPS:
- Support MIPS unwinding and dwarf-regs.
- Generate mips syscalls_n64.c syscall table.
- PowerPC:
- Add support for PERF_SAMPLE_WEIGH_STRUCT on PowerPC.
- Support pipeline stage cycles for powerpc.
libbeauty:
- Fix fsconfig generator"
* tag 'perf-tools-for-v5.13-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (132 commits)
perf build: Defer printing detected features to the end of all feature checks
tools build: Allow deferring printing the results of feature detection
perf build: Regenerate the FEATURE_DUMP file after extra feature checks
perf session: Dump PERF_RECORD_TIME_CONV event
perf session: Add swap operation for event TIME_CONV
perf jit: Let convert_timestamp() to be backwards-compatible
perf tools: Change fields type in perf_record_time_conv
perf tools: Enable libtraceevent dynamic linking
perf Documentation: Document intel-hybrid support
perf tests: Skip 'perf stat metrics (shadow stat) test' for hybrid
perf tests: Support 'Convert perf time to TSC' test for hybrid
perf tests: Support 'Session topology' test for hybrid
perf tests: Support 'Parse and process metrics' test for hybrid
perf tests: Support 'Track with sched_switch' test for hybrid
perf tests: Skip 'Setup struct perf_event_attr' test for hybrid
perf tests: Add hybrid cases for 'Roundtrip evsel->name' test
perf tests: Add hybrid cases for 'Parse event definition strings' test
perf record: Uniquify hybrid event name
perf stat: Warn group events from different hybrid PMU
perf stat: Filter out unmatched aggregation for hybrid event
...
This indicates that the data version received back from the server did not
match the expected value (the DV should be incremented monotonically for
each individual modification op committed to a vnode).
What is happening is that a lookup call is doing a bulk status fetch
speculatively on a bunch of vnodes in a directory besides getting the
status of the vnode it's actually interested in. This is racing with a
StoreData operation (though it could also occur with, say, a MakeDir op).
On the client, a modification operation locks the vnode, but the bulk
status fetch only locks the parent directory, so no ordering is imposed
there (thereby avoiding an avenue to deadlock).
On the server, the StoreData op handler doesn't lock the vnode until it's
received all the request data, and downgrades the lock after committing the
data until it has finished sending change notifications to other clients -
which allows the status fetch to occur before it has finished.
This means that:
- a status fetch can access the target vnode either side of the exclusive
section of the modification
- the status fetch could start before the modification, yet finish after,
and vice-versa.
- the status fetch and the modification RPCs can complete in either order.
- the status fetch can return either the before or the after DV from the
modification.
- the status fetch might regress the locally cached DV.
Some of these are handled by the previous fix[1], but that's not sufficient
because it checks the DV it received against the DV it cached at the start
of the op, but the DV might've been updated in the meantime by a locally
generated modification op.
Fix this by the following means:
(1) Keep track of when we're performing a modification operation on a
vnode. This is done by marking vnode parameters with a 'modification'
note that causes the AFS_VNODE_MODIFYING flag to be set on the vnode
for the duration.
(2) Alter the speculation race detection to ignore speculative status
fetches if either the vnode is marked as being modified or the data
version number is not what we expected.
Note that whilst the "vnode modified" warning does get recovered from as it
causes the client to refetch the status at the next opportunity, it will
also invalidate the pagecache, so changes might get lost.
Linus Torvalds [Sat, 1 May 2021 18:34:03 +0000 (11:34 -0700)]
Merge tag 'for-5.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Improve scalability of DM's device hash by switching to rbtree
- Extend DM ioctl's DM_LIST_DEVICES_CMD handling to include UUID and
allow filtering based on name or UUID prefix.
- Various small fixes for typos, warnings, unused function, or
needlessly exported interfaces.
- Remove needless request_queue NULL pointer checks in DM thin and
cache targets.
- Remove unnecessary loop in DM core's __split_and_process_bio().
- Remove DM core's dm_vcalloc() and just use kvcalloc or kvmalloc_array
instead (depending whether zeroing is useful).
- Fix request-based DM's double free of blk_mq_tag_set in device remove
after table load fails.
- Improve DM persistent data performance on non-x86 by fixing packed
structs to have a stated alignment. Also remove needless extra work
from redundant calls to sm_disk_get_nr_free() and a paranoid BUG_ON()
that caused duplicate checksum calculation.
- Fix missing goto in DM integrity's bitmap_flush_interval error
handling.
- Add "reset_recalculate" feature flag to DM integrity.
- Improve DM integrity by leveraging discard support to avoid needless
re-writing of metadata and also use discard support to improve hash
recalculation.
- Fix race with DM raid target's reshape and MD raid4/5/6 resync that
resulted in inconsistant reshape state during table reloads.
- Update DM raid target to temove unnecessary discard limits for raid0
and raid10 now that MD has optimized discard handling for both raid
levels.
* tag 'for-5.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits)
dm raid: remove unnecessary discard limits for raid0 and raid10
dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails
dm integrity: use discard support when recalculating
dm integrity: increase RECALC_SECTORS to improve recalculate speed
dm integrity: don't re-write metadata if discarding same blocks
dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences
dm raid: fix fall-through warning in rs_check_takeover() for Clang
dm clone metadata: remove unused function
dm integrity: fix missing goto in bitmap_flush_interval error handling
dm: replace dm_vcalloc()
dm space map common: fix division bug in sm_ll_find_free_block()
dm persistent data: packed struct should have an aligned() attribute too
dm btree spine: remove paranoid node_check call in node_prep_for_write()
dm space map disk: remove redundant calls to sm_disk_get_nr_free()
dm integrity: add the "reset_recalculate" feature flag
dm persistent data: remove unused return from exit_shadow_spine()
dm cache: remove needless request_queue NULL pointer checks
dm thin: remove needless request_queue NULL pointer check
dm: unexport dm_{get,put}_table_device
dm ebs: fix a few typos
...
Linus Torvalds [Sat, 1 May 2021 17:14:08 +0000 (10:14 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This is a large update by KVM standards, including AMD PSP (Platform
Security Processor, aka "AMD Secure Technology") and ARM CoreSight
(debug and trace) changes.
ARM:
- CoreSight: Add support for ETE and TRBE
- Stage-2 isolation for the host kernel when running in protected
mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- AMD PSP driver changes
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation, zap under
read lock, enable/disably dirty page logging under read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing the
architecture-specific code
- a handful of "Get rid of oprofile leftovers" patches
- Some selftests improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
KVM: selftests: Speed up set_memory_region_test
selftests: kvm: Fix the check of return value
KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
KVM: SVM: Skip SEV cache flush if no ASIDs have been used
KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
KVM: SVM: Drop redundant svm_sev_enabled() helper
KVM: SVM: Move SEV VMCB tracking allocation to sev.c
KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
KVM: SVM: Unconditionally invoke sev_hardware_teardown()
KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
KVM: SVM: Move SEV module params/variables to sev.c
KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
KVM: SVM: Zero out the VMCB array used to track SEV ASID association
x86/sev: Drop redundant and potentially misleading 'sev_enabled'
KVM: x86: Move reverse CPUID helpers to separate header file
KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
...
Linus Torvalds [Sat, 1 May 2021 16:33:00 +0000 (09:33 -0700)]
Merge tag 'iommu-updates-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Big cleanup of almost unsused parts of the IOMMU API by Christoph
Hellwig. This mostly affects the Freescale PAMU driver.
- New IOMMU driver for Unisoc SOCs
- ARM SMMU Updates from Will:
- Drop vestigial PREFETCH_ADDR support (SMMUv3)
- Elide TLB sync logic for empty gather (SMMUv3)
- Fix "Service Failure Mode" handling (SMMUv3)
- New Qualcomm compatible string (SMMUv2)
- Removal of the AMD IOMMU performance counter writeable check on AMD.
It caused long boot delays on some machines and is only needed to
work around an errata on some older (possibly pre-production) chips.
If someone is still hit by this hardware issue anyway the performance
counters will just return 0.
- Support for targeted invalidations in the AMD IOMMU driver. Before
that the driver only invalidated a single 4k page or the whole IO/TLB
for an address space. This has been extended now and is mostly useful
for emulated AMD IOMMUs.
- Several fixes for the Shared Virtual Memory support in the Intel VT-d
driver
- Mediatek drivers can now be built as modules
- Re-introduction of the forcedac boot option which got lost when
converting the Intel VT-d driver to the common dma-iommu
implementation.
- Extension of the IOMMU device registration interface and support
iommu_ops to be const again when drivers are built as modules.
* tag 'iommu-updates-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (84 commits)
iommu: Streamline registration interface
iommu: Statically set module owner
iommu/mediatek-v1: Add error handle for mtk_iommu_probe
iommu/mediatek-v1: Avoid build fail when build as module
iommu/mediatek: Always enable the clk on resume
iommu/fsl-pamu: Fix uninitialized variable warning
iommu/vt-d: Force to flush iotlb before creating superpage
iommu/amd: Put newline after closing bracket in warning
iommu/vt-d: Fix an error handling path in 'intel_prepare_irq_remapping()'
iommu/vt-d: Fix build error of pasid_enable_wpe() with !X86
iommu/amd: Remove performance counter pre-initialization test
Revert "iommu/amd: Fix performance counter initialization"
iommu/amd: Remove duplicate check of devid
iommu/exynos: Remove unneeded local variable initialization
iommu/amd: Page-specific invalidations for more than one page
iommu/arm-smmu-v3: Remove the unused fields for PREFETCH_CONFIG command
iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown
iommu/vt-d: Invalidate PASID cache when root/context entry changed
iommu/vt-d: Remove WO permissions on second-level paging entries
iommu/vt-d: Report the right page fault address
...
Linus Torvalds [Sat, 1 May 2021 16:15:05 +0000 (09:15 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This is significantly bug fixes and general cleanups. The noteworthy
new features are fairly small:
- XRC support for HNS and improves RQ operations
- Bug fixes and updates for hns, mlx5, bnxt_re, hfi1, i40iw, rxe, siw
and qib
- Quite a few general cleanups on spelling, error handling, static
checker detections, etc
- Increase the number of device ports supported beyond 255. High port
count software switches now exist
- Several bug fixes for rtrs
- mlx5 Device Memory support for host controlled atomics
- Report SRQ tables through to rdma-tool"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (145 commits)
IB/qib: Remove redundant assignment to ret
RDMA/nldev: Add copy-on-fork attribute to get sys command
RDMA/bnxt_re: Fix a double free in bnxt_qplib_alloc_res
RDMA/siw: Fix a use after free in siw_alloc_mr
IB/hfi1: Remove redundant variable rcd
RDMA/nldev: Add QP numbers to SRQ information
RDMA/nldev: Return SRQ information
RDMA/restrack: Add support to get resource tracking for SRQ
RDMA/nldev: Return context information
RDMA/core: Add CM to restrack after successful attachment to a device
RDMA/cma: Skip device which doesn't support CM
RDMA/rxe: Fix a bug in rxe_fill_ip_info()
RDMA/mlx5: Expose private query port
RDMA/mlx4: Remove an unused variable
RDMA/mlx5: Fix type assignment for ICM DM
IB/mlx5: Set right RoCE l3 type and roce version while deleting GID
RDMA/i40iw: Fix error unwinding when i40iw_hmc_sd_one fails
RDMA/cxgb4: add missing qpid increment
IB/ipoib: Remove unnecessary struct declaration
RDMA/bnxt_re: Get rid of custom module reference counting
...
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New features for ext4 this cycle include support for encrypted
casefold, ensure that deleted file names are cleared in directory
blocks by zeroing directory entries when they are unlinked or moved as
part of a hash tree node split. We also improve the block allocator's
performance on a freshly mounted file system by prefetching block
bitmaps.
There are also the usual cleanups and bug fixes, including fixing a
page cache invalidation race when there is mixed buffered and direct
I/O and the block size is less than page size, and allow the dax flag
to be set and cleared on inline directories"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
ext4: wipe ext4_dir_entry2 upon file deletion
ext4: Fix occasional generic/418 failure
fs: fix reporting supported extra file attributes for statx()
ext4: allow the dax flag to be set and cleared on inline directories
ext4: fix debug format string warning
ext4: fix trailing whitespace
ext4: fix various seppling typos
ext4: fix error return code in ext4_fc_perform_commit()
ext4: annotate data race in jbd2_journal_dirty_metadata()
ext4: annotate data race in start_this_handle()
ext4: fix ext4_error_err save negative errno into superblock
ext4: fix error code in ext4_commit_super
ext4: always panic when errors=panic is specified
ext4: delete redundant uptodate check for buffer
ext4: do not set SB_ACTIVE in ext4_orphan_cleanup()
ext4: make prefetch_block_bitmaps default
ext4: add proc files to monitor new structures
ext4: improve cr 0 / cr 1 group scanning
ext4: add MB_NUM_ORDERS macro
ext4: add mballoc stats proc file
...
Merge tag 'ovl-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:
- Fix a regression introduced in 5.2 that resulted in valid overlayfs
mounts being rejected with ELOOP (Too many levels of symbolic links)
- Fix bugs found by various tools
- Miscellaneous improvements and cleanups
* tag 'ovl-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: add debug print to ovl_do_getxattr()
ovl: invalidate readdir cache on changes to dir with origin
ovl: allow upperdir inside lowerdir
ovl: show "userxattr" in the mount data
ovl: trivial typo fixes in the file inode.c
ovl: fix misspellings using codespell tool
ovl: do not copy attr several times
ovl: remove ovl_map_dev_ino() return value
ovl: fix error for ovl_fill_super()
ovl: fix missing revert_creds() on error path
ovl: fix leaked dentry
ovl: restrict lower null uuid for "xino=auto"
ovl: check that upperdir path is not on a read-only mount
ovl: plumb through flush method
Merge misc updates from Andrew Morton:
"A few misc subsystems and some of MM.
175 patches.
Subsystems affected by this patch series: ia64, kbuild, scripts, sh,
ocfs2, kfifo, vfs, kernel/watchdog, and mm (slab-generic, slub,
kmemleak, debug, pagecache, msync, gup, memremap, memcg, pagemap,
mremap, dma, sparsemem, vmalloc, documentation, kasan, initialization,
pagealloc, and memory-failure)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (175 commits)
mm/memory-failure: unnecessary amount of unmapping
mm/mmzone.h: fix existing kernel-doc comments and link them to core-api
mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1
net: page_pool: use alloc_pages_bulk in refill code path
net: page_pool: refactor dma_map into own function page_pool_dma_map
SUNRPC: refresh rq_pages using a bulk page allocator
SUNRPC: set rq_page_end differently
mm/page_alloc: inline __rmqueue_pcplist
mm/page_alloc: optimize code layout for __alloc_pages_bulk
mm/page_alloc: add an array-based interface to the bulk page allocator
mm/page_alloc: add a bulk page allocator
mm/page_alloc: rename alloced to allocated
mm/page_alloc: duplicate include linux/vmalloc.h
mm, page_alloc: avoid page_to_pfn() in move_freepages()
mm/Kconfig: remove default DISCONTIGMEM_MANUAL
mm: page_alloc: dump migrate-failed pages
mm/mempolicy: fix mpol_misplaced kernel-doc
mm/mempolicy: rewrite alloc_pages_vma documentation
mm/mempolicy: rewrite alloc_pages documentation
mm/mempolicy: rename alloc_pages_current to alloc_pages
...
Merge tag 'pinctrl-v5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
"There is a lot going on!
Core changes:
- A semantic change to handle pinmux and pinconf in explicit order
while up until now we depended on the semantic order in the device
tree. The device tree is a functional programming language and does
not imply any order, so the right thing is for the pin control core
to provide these semantics.
- Add a new pinmux-select debugfs file which makes it possible to go
in and select functions for a pin manually (iteratively, at the
prompt) for debugging purposes.
- Fixes to gpio regmap handling for a new pin control driver making
use of regmap-gpio.
- Use octal permissions on debugfs files.
New drivers:
- A massive rewrite of the former custom pin control driver for MIPS
Broadcom devices to instead use the pin control subsystem. New pin
control drivers for BCM6345, BCM6328, BCM6358, BCM6362, BCM6368,
BCM63268 and BCM6318 SoC variants are implemented.
- Support for PM8350, PM8350B, PM8350C, PMK8350, PMR735A and PMR735B
in the Qualcomm PMIC GPIO driver. Also the two GPIOs on PM8008 are
supported.
- Support for the Rockchip RK3568/RK3566 pin controller.
- Support for Ingenic JZ4730, JZ4750, JZ4755, JZ4775 and X2000.
- Support for Mediatek MTK8195.
- Add a new Xilinx ZynqMP pin control driver.
Driver improvements and non-urgent fixes:
- Modularization and improvements of the Rockchip drivers.
- Some new pins added to the description of new Renesas SoCs.
- Clarifications of the GPIO base calculation in the Intel driver.
- Fix the function names for the MPP54 and MPP55 pins in the Armada
CP110 pin controller.
- GPIO wakeup interrupt map for Qualcomm SC7280 and SM8350.
- Support for ACPI probing of the Qualcomm SC8180x.
- Fix interrupt clear status on rockchip
- Fix some missing pins on the Ingenic JZ4770, some semantic fixes
for the behaviour of the Ingenic pin controller. Add DMIC pins for
JZ4780, X1000, X1500 and X1830.
- A slew of janitorial like of_node_put() calls"
* tag 'pinctrl-v5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (99 commits)
pinctrl: Add Xilinx ZynqMP pinctrl driver support
firmware: xilinx: Add pinctrl support
pinctrl: rockchip: do coding style for mux route struct
pinctrl: Add PIN_CONFIG_MODE_PWM to enum pin_config_param
pinctrl: Introduce MODE group in enum pin_config_param
pinctrl: Keep enum pin_config_param ordered by name
dt-bindings: pinctrl: Add binding for ZynqMP pinctrl driver
pinctrl: core: Fix kernel doc string for pin_get_name()
pinctrl: mediatek: use spin lock in mtk_rmw
pinctrl: add drive for I2C related pins on MT8195
pinctrl: add pinctrl driver on mt8195
dt-bindings: pinctrl: mt8195: add pinctrl file and binding document
pinctrl: Ingenic: Add pinctrl driver for X2000.
pinctrl: Ingenic: Add pinctrl driver for JZ4775.
pinctrl: Ingenic: Add pinctrl driver for JZ4755.
pinctrl: Ingenic: Add pinctrl driver for JZ4750.
pinctrl: Ingenic: Add pinctrl driver for JZ4730.
dt-bindings: pinctrl: Add bindings for new Ingenic SoCs.
pinctrl: Ingenic: Reformat the code.
pinctrl: Ingenic: Add DMIC pins support for Ingenic SoCs.
...
Merge tag 'sound-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"No surprises in this development cycle, and most of work is about the
fixes and the improvements of the existing code, while a new LED
control layer and a few new drivers have been introduced.
Here are some highlights:
Core:
- A common mute-LED framework was introduced. It is used by HD-audio
for now, more adaption will follow later. The former "Mic Mute-LED
Mode" mixer control has been replaced with the corresponding sysfs
now.
- User-control management was changed to count consumed bytes instead
of capping by number of elements; this will allow more controls in
the normal usage pattern while avoiding the possible memory
exhaustion DoS
ASoC:
- Continued refactoring and cleanups in ASoC core and generic card
drivers
- Wide range of small cppcheck and warning fixes
- New drivers for Freescale i.MX DMA over rpmsg, Mediatek MT6358
accessory detection, and Realtek RT1019, RT1316, RT711 and RT715
USB-audio:
- Continued improvements and fixes of the implicit feedback mode,
including better support for Pioneer and Roland/BOSS devices
HD-audio:
- Default back to non-buffer preallocation on x86
- Cirrus codec improvements, more quirks for Realtek codecs
Others:
- New virtio sound driver
- FireWire Bebob updates"
* tag 'sound-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (587 commits)
ALSA: hda/conexant: Re-order CX5066 quirk table entries
ALSA: hda/realtek: Remove redundant entry for ALC861 Haier/Uniwill devices
ALSA: hda/realtek: Re-order ALC662 quirk table entries
ALSA: hda/realtek: Re-order remaining ALC269 quirk table entries
ALSA: hda/realtek: Re-order ALC269 Lenovo quirk table entries
ALSA: hda/realtek: Re-order ALC269 Sony quirk table entries
ALSA: hda/realtek: Re-order ALC269 ASUS quirk table entries
ALSA: hda/realtek: Re-order ALC269 Dell quirk table entries
ALSA: hda/realtek: Re-order ALC269 Acer quirk table entries
ALSA: hda/realtek: Re-order ALC269 HP quirk table entries
ALSA: hda/realtek: Re-order ALC882 Clevo quirk table entries
ALSA: hda/realtek: Re-order ALC882 Sony quirk table entries
ALSA: hda/realtek: Re-order ALC882 Acer quirk table entries
ALSA: usb-audio: Remove redundant assignment to len
ALSA: hda/realtek: Add quirk for Intel Clevo PCx0Dx
ALSA: virtio: fix kernel-doc
ALSA: hda/cirrus: Use CS8409 filter to fix abnormal sounds on Bullseye
ALSA: hda/cirrus: Set Initial DMIC volume for Bullseye to -26 dB
ALSA: sb: Fix two use after free in snd_sb_qsound_build
ALSA: emu8000: Fix a use after free in snd_emu8000_create_mixer
...
Merge tag 'drm-next-2021-04-30' of git://anongit.freedesktop.org/drm/drm
Pull more drm updates from Dave Airlie:
"Looks like I missed a tegra feature request for next, but should still
be fine since it's pretty self contained.
Apart from that got a set of i915 and amdgpu fixes as per usual along
with a few misc fixes.
tegra:
- Tegra186 hardware cursor support
- better capability reporting for different SoC
- better framebuffer modifier support
- host1x fixes
i915:
- Several fixes to GLK handling in recent display refactoring
- Rare watchdog timer race fix
- Cppcheck redundant condition fix
- Overlay error code propagation fix
- Documentation fix
- gvt: Remove one unused function warning
- gvt: Fix intel_gvt_init_device() return type
- gvt: Remove one duplicated register accessible check"
* tag 'drm-next-2021-04-30' of git://anongit.freedesktop.org/drm/drm: (111 commits)
efifb: Check efifb_pci_dev before using it
drm/i915: Fix docbook descriptions for i915_gem_shrinker
drm/i915: fix an error code in intel_overlay_do_put_image()
drm/i915/display/psr: Fix cppcheck warnings
drm/i915: Disable LTTPR detection on GLK once again
drm/i915: Restore lost glk ccs w/a
drm/i915: Restore lost glk FBC 16bpp w/a
drm/i915: Take request reference before arming the watchdog timer
drm/ttm: fix error handling if no BO can be swapped out v4
drm/i915/gvt: Remove duplicated register accessible check
drm/amdgpu/gmc9: remove dummy read workaround for newer chips
drm/amdgpu: Add mem sync flag for IB allocated by SA
drm/amdgpu: Fix SDMA RAS error reporting on Aldebaran
drm/amdgpu: Reset RAS error count and status regs
Revert "drm/amdgpu: workaround the TMR MC address issue (v2)"
drm/amd/display: 3.2.132
drm/amd/display: [FW Promotion] Release 0.0.62
drm/amd/display: add helper for enabling mst stream features
drm/amd/display: Report Proper Quantization Range in AVI Infoframe
drm/amd/display: Fix call to pass bpp in 16ths of a bit
...
Merge tag 'modules-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module updates from Jessica Yu:
"Fix an age old bug involving jump_calls and static_labels when
CONFIG_MODULE_UNLOAD=n.
When CONFIG_MODULE_UNLOAD=n, it means you can't unload modules, so
normally the __exit sections of a module are not loaded at all.
However, dynamic code patching (jump_label, static_call, alternatives)
can have sites in __exit sections even if __exit is never executed.
Reported by Peter Zijlstra:
'Alternatives, jump_labels and static_call all can have relocations
into __exit code. Not loading it at all would be BAD.'
Therefore, load the __exit sections even when CONFIG_MODULE_UNLOAD=n,
and discard them after init"
* tag 'modules-for-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: treat exit sections the same as init sections when !CONFIG_MODULE_UNLOAD
Merge tag 'powerpc-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Enable KFENCE for 32-bit.
- Implement EBPF for 32-bit.
- Convert 32-bit to do interrupt entry/exit in C.
- Convert 64-bit BookE to do interrupt entry/exit in C.
- Changes to our signal handling code to use user_access_begin/end()
more extensively.
- Add support for time namespaces (CONFIG_TIME_NS)
- A series of fixes that allow us to reenable STRICT_KERNEL_RWX.
- Other smaller features, fixes & cleanups.
Thanks to Alexey Kardashevskiy, Andreas Schwab, Andrew Donnellan, Aneesh
Kumar K.V, Athira Rajeev, Bhaskar Chowdhury, Bixuan Cui, Cédric Le
Goater, Chen Huang, Chris Packham, Christophe Leroy, Christopher M.
Riedl, Colin Ian King, Dan Carpenter, Daniel Axtens, Daniel Henrique
Barboza, David Gibson, Davidlohr Bueso, Denis Efremov, dingsenjie,
Dmitry Safonov, Dominic DeMarco, Fabiano Rosas, Ganesh Goudar, Geert
Uytterhoeven, Geetika Moolchandani, Greg Kurz, Guenter Roeck, Haren
Myneni, He Ying, Jiapeng Chong, Jordan Niethe, Laurent Dufour, Lee
Jones, Leonardo Bras, Li Huafei, Madhavan Srinivasan, Mahesh Salgaonkar,
Masahiro Yamada, Nathan Chancellor, Nathan Lynch, Nicholas Piggin,
Oliver O'Halloran, Paul Menzel, Pu Lehui, Randy Dunlap, Ravi Bangoria,
Rosen Penev, Russell Currey, Santosh Sivaraj, Sebastian Andrzej Siewior,
Segher Boessenkool, Shivaprasad G Bhat, Srikar Dronamraju, Stephen
Rothwell, Thadeu Lima de Souza Cascardo, Thomas Gleixner, Tony Ambardar,
Tyrel Datwyler, Vaibhav Jain, Vincenzo Frascino, Xiongwei Song, Yang Li,
Yu Kuai, and Zhang Yunkai.
* tag 'powerpc-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (302 commits)
powerpc/signal32: Fix erroneous SIGSEGV on RT signal return
powerpc: Avoid clang uninitialized warning in __get_user_size_allowed
powerpc/papr_scm: Mark nvdimm as unarmed if needed during probe
powerpc/kvm: Fix build error when PPC_MEM_KEYS/PPC_PSERIES=n
powerpc/kasan: Fix shadow start address with modules
powerpc/kernel/iommu: Use largepool as a last resort when !largealloc
powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs
powerpc/44x: fix spelling mistake in Kconfig "varients" -> "variants"
powerpc/iommu: Annotate nested lock for lockdep
powerpc/iommu: Do not immediately panic when failed IOMMU table allocation
powerpc/iommu: Allocate it_map by vmalloc
selftests/powerpc: remove unneeded semicolon
powerpc/64s: remove unneeded semicolon
powerpc/eeh: remove unneeded semicolon
powerpc/selftests: Add selftest to test concurrent perf/ptrace events
powerpc/selftests/perf-hwbreak: Add testcases for 2nd DAWR
powerpc/selftests/perf-hwbreak: Coalesce event creation code
powerpc/selftests/ptrace-hwbreak: Add testcases for 2nd DAWR
powerpc/configs: Add IBMVNIC to some 64-bit configs
selftests/powerpc: Add uaccess flush test
...
Jane Chu [Fri, 30 Apr 2021 06:02:19 +0000 (23:02 -0700)]
mm/memory-failure: unnecessary amount of unmapping
It appears that unmap_mapping_range() actually takes a 'size' as its third
argument rather than a location, the current calling fashion causes
unnecessary amount of unmapping to occur.
Link: https://lkml.kernel.org/r/20210420002821.2749748-1-jane.chu@oracle.com Fixes: 91e9827440332 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages") Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1
On !ARCH_SUPPORTS_DEBUG_PAGEALLOC (like ia64) debug_pagealloc=1 implies
page_poison=on:
if (page_poisoning_enabled() ||
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()))
static_branch_enable(&_page_poisoning_enabled);
page_poison=on needs to override init_on_free=1.
Before the change it did not work as expected for the following case:
- have PAGE_POISONING=y
- have page_poison unset
- have !ARCH_SUPPORTS_DEBUG_PAGEALLOC arch (like ia64)
- have init_on_free=1
- have debug_pagealloc=1
That way we get both keys enabled:
- static_branch_enable(&init_on_free);
- static_branch_enable(&_page_poisoning_enabled);
which leads to poisoned pages returned for __GFP_ZERO pages.
After the change we execute only:
- static_branch_enable(&_page_poisoning_enabled);
and ignore init_on_free=1.
Link: https://lkml.kernel.org/r/20210329222555.3077928-1-slyfox@gentoo.org Link: https://lkml.org/lkml/2021/3/26/443 Fixes: c3d812419e37 ("mm, page_poison: use static key more efficiently") Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
net: page_pool: use alloc_pages_bulk in refill code path
There are cases where the page_pool need to refill with pages from the
page allocator. Some workloads cause the page_pool to release pages
instead of recycling these pages.
For these workload it can improve performance to bulk alloc pages from the
page-allocator to refill the alloc cache.
For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create an
SKB from the xdp_frame, which then cannot return the page to the
page_pool.
Performance results under GitHub xdp-project[1]:
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
Mel: The patch "net: page_pool: convert to use alloc_pages_bulk_array
variant" was squashed with this patch. From the test page, the array
variant was superior with one of the test results as follows.
Kernel XDP stats CPU pps Delta
Baseline XDP-RX CPU total 3,771,046 n/a
List XDP-RX CPU total 3,940,242 +4.49%
Array XDP-RX CPU total 4,249,224 +12.68%
Link: https://lkml.kernel.org/r/20210325114228.27719-10-mgorman@techsingularity.net Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chuck Lever [Fri, 30 Apr 2021 06:02:01 +0000 (23:02 -0700)]
SUNRPC: refresh rq_pages using a bulk page allocator
Reduce the rate at which nfsd threads hammer on the page allocator. This
improves throughput scalability by enabling the threads to run more
independently of each other.
[mgorman: Update interpretation of alloc_pages_bulk return value]
Link: https://lkml.kernel.org/r/20210325114228.27719-8-mgorman@techsingularity.net Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patches change SUNRPC to invoke the array-based bulk allocator
instead of alloc_page().
The micro-benchmark results are promising. I ran a mixture of 256KB
reads and writes over NFSv3. The server's kernel is built with KASAN
enabled, so the comparison is exaggerated but I believe it is still
valid.
I instrumented svc_recv() to measure the latency of each call to
svc_alloc_arg() and report it via a trace point. The following results
are averages across the trace events.
Single page: 25.007 us per call over 532,571 calls
Bulk list: 6.258 us per call over 517,034 calls
Bulk array: 4.590 us per call over 517,442 calls
This patch (of 2)
Refactor:
I'm about to use the loop variable @i for something else.
As far as the "i++" is concerned, that is a post-increment. The
value of @i is not used subsequently, so the increment operator
is unnecessary and can be removed.
Also note that nfsd_read_actor() was renamed nfsd_splice_actor()
by commit e2dcdb477d84 ("sendfile: convert nfsd to
splice_direct_to_actor()").
Link: https://lkml.kernel.org/r/20210325114228.27719-7-mgorman@techsingularity.net Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc: optimize code layout for __alloc_pages_bulk
Looking at perf-report and ASM-code for __alloc_pages_bulk() it is clear
that the code activated is suboptimal. The compiler guesses wrong and
places unlikely code at the beginning. Due to the use of WARN_ON_ONCE()
macro the UD2 asm instruction is added to the code, which confuse the
I-cache prefetcher in the CPU.
[mgorman@techsingularity.net: minor changes and rebasing]
Link: https://lkml.kernel.org/r/20210325114228.27719-5-mgorman@techsingularity.net Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Acked-By: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc: add an array-based interface to the bulk page allocator
The proposed callers for the bulk allocator store pages from the bulk
allocator in an array. This patch adds an array-based interface to the
API to avoid multiple list iterations. The page list interface is
preserved to avoid requiring all users of the bulk API to allocate and
manage enough storage to store the pages.
[akpm@linux-foundation.org: remove now unused local `allocated']
Link: https://lkml.kernel.org/r/20210325114228.27719-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a new page allocator interface via alloc_pages_bulk, and
__alloc_pages_bulk_nodemask. A caller requests a number of pages to be
allocated and added to a list.
The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch if
necessary.
Note that this implementation is not very efficient and could be
improved but it would require refactoring. The intent is to make it
available early to determine what semantics are required by different
callers. Once the full semantics are nailed down, it can be refactored.
[mgorman@techsingularity.net: fix alloc_pages_bulk() return type, per Matthew] Link: https://lkml.kernel.org/r/20210325123713.GQ3697@techsingularity.net
[mgorman@techsingularity.net: fix uninit var warning] Link: https://lkml.kernel.org/r/20210330114847.GX3697@techsingularity.net
[mgorman@techsingularity.net: fix comment, per Vlastimil] Link: https://lkml.kernel.org/r/20210412110255.GV3697@techsingularity.net Link: https://lkml.kernel.org/r/20210325114228.27719-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Tested-by: Colin Ian King <colin.king@canonical.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Introduce a bulk order-0 page allocator with two in-tree users", v6.
This series introduces a bulk order-0 page allocator with sunrpc and the
network page pool being the first users. The implementation is not
efficient as semantics needed to be ironed out first. If no other
semantic changes are needed, it can be made more efficient. Despite that,
this is a performance-related for users that require multiple pages for an
operation without multiple round-trips to the page allocator. Quoting the
last patch for the high-speed networking use-case
Kernel XDP stats CPU pps Delta
Baseline XDP-RX CPU total 3,771,046 n/a
List XDP-RX CPU total 3,940,242 +4.49%
Array XDP-RX CPU total 4,249,224 +12.68%
Via the SUNRPC traces of svc_alloc_arg()
Single page: 25.007 us per call over 532,571 calls
Bulk list: 6.258 us per call over 517,034 calls
Bulk array: 4.590 us per call over 517,442 calls
Both potential users in this series are corner cases (NFS and high-speed
networks) so it is unlikely that most users will see any benefit in the
short term. Other potential other users are batch allocations for page
cache readahead, fault around and SLUB allocations when high-order pages
are unavailable. It's unknown how much benefit would be seen by
converting multiple page allocation calls to a single batch or what
difference it may make to headline performance.
Light testing of my own running dbench over NFS passed. Chuck and Jesper
conducted their own tests and details are included in the changelogs.
Patch 1 renames a variable name that is particularly unpopular
Patch 2 adds a bulk page allocator
Patch 3 adds an array-based version of the bulk allocator
Patches 4-5 adds micro-optimisations to the implementation
Patches 6-7 SUNRPC user
Patches 8-9 Network page_pool user
This patch (of 9):
Review feedback of the bulk allocator twice found problems with "alloced"
being a counter for pages allocated. The naming was based on the API name
"alloc" and was based on the idea that verbal communication about malloc
tends to use the fake word "malloced" instead of the fake word mallocated.
To be consistent, this preparation patch renames alloced to allocated in
rmqueue_bulk so the bulk allocator and per-cpu allocator use similar names
when the bulk allocator is introduced.
Link: https://lkml.kernel.org/r/20210325114228.27719-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20210325114228.27719-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kefeng Wang [Fri, 30 Apr 2021 06:01:36 +0000 (23:01 -0700)]
mm, page_alloc: avoid page_to_pfn() in move_freepages()
The start_pfn and end_pfn are already available in move_freepages_block(),
there is no need to go back and forth between page and pfn in
move_freepages and move_freepages_block, and pfn_valid_within() should
validate pfn first before touching the page.
Link: https://lkml.kernel.org/r/20210323131215.934472-1-liushixin2@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b906f164d93150ff ("ia64: make SPARSEMEM default and disable
DISCONTIGMEM") removed the last enabler of ARCH_DISCONTIGMEM_DEFAULT,
hence the memory model can no longer default to DISCONTIGMEM_MANUAL.
Link: https://lkml.kernel.org/r/20210312141208.3465520-1-geert@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Fri, 30 Apr 2021 06:01:30 +0000 (23:01 -0700)]
mm: page_alloc: dump migrate-failed pages
Currently, debugging CMA allocation failures is quite limited. The most
common source of these failures seems to be page migration which doesn't
provide any useful information on the reason of the failure by itself.
alloc_contig_range can report those failures as it holds a list of
migrate-failed pages.
The information logged by dump_page() has already proven helpful for
debugging allocation issues, like identifying long-term pinnings on
ZONE_MOVABLE or MIGRATE_CMA.
Let's use the dynamic debugging infrastructure, such that we avoid
flooding the logs and creating a lot of noise on frequent
alloc_contig_range() calls. This information is helpful for debugging
only.
There are two ifdefery conditions to support common dyndbg options:
- CONFIG_DYNAMIC_DEBUG_CORE && DYNAMIC_DEBUG_MODULE
It aims for supporting the feature with only specific file with
adding ccflags.
- CONFIG_DYNAMIC_DEBUG
It aims for supporting the feature with system wide globally.
A simple example to enable the feature:
Admin could enable the dump like this(by default, disabled)
A concern is utility functions in dump_page use inconsistent
loglevels. In the future, we might want to make the loglevels
used inside dump_page() consistent and eventually rework the way
we log the information here. See [1].
Link: https://lkml.kernel.org/r/20210311194042.825152-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: John Dias <joaodias@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sphinx interprets the Return section as a list and complains about it.
Turn it into a sentence and move it to the end of the kernel-doc to fit
the kernel-doc style.
Link: https://lkml.kernel.org/r/20210225150642.2582252-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc: combine __alloc_pages and __alloc_pages_nodemask
There are only two callers of __alloc_pages() so prune the thicket of
alloc_page variants by combining the two functions together. Current
callers of __alloc_pages() simply add an extra 'NULL' parameter and
current callers of __alloc_pages_nodemask() call __alloc_pages() instead.
Link: https://lkml.kernel.org/r/20210225150642.2582252-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Rationalise __alloc_pages wrappers", v3.
I was poking around the __alloc_pages variants trying to understand why
they each exist, and couldn't really find a good justification for keeping
__alloc_pages and __alloc_pages_nodemask as separate functions. That led
to getting rid of alloc_pages_current() and then I noticed the
documentation was bad, and then I noticed the mempolicy documentation
wasn't included.
Anyway, this is all cleanups & doc fixes.
This patch (of 7):
We have two masks involved -- the nodemask and the gfp mask, so alloc_mask
is an unclear name.
Link: https://lkml.kernel.org/r/20210225150642.2582252-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The naming convention used in include/linux/page-flags-layout.h:
*_SHIFT: the number of bits trying to allocate
*_WIDTH: the number of bits successfully allocated
So when it comes to LAST_CPUPID_WIDTH, we need to check whether all
previous *_WIDTH and LAST_CPUPID_SHIFT can fit into page flags. This
means we need to use NODES_WIDTH, not NODES_SHIFT.
Link: https://lkml.kernel.org/r/20210303071609.797782-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Fri, 30 Apr 2021 06:01:01 +0000 (23:01 -0700)]
mm: remove lru_add_drain_all in alloc_contig_range
__alloc_contig_migrate_range already has lru_add_drain_all call via
migrate_prep. It's necessary to move LRU taget pages into LRU list to be
able to isolated. However, lru_add_drain_all call after
__alloc_contig_migrate_range is pointless since it has changed source page
freeing from putback_lru_pages to put_page[1].
This patch removes it.
[1] 78fda28726b3, ("mm: use put_page() to free page instead of putback_lru_page()"
Link: https://lkml.kernel.org/r/20210303204512.2863087-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc: drop pr_info_ratelimited() in alloc_contig_range()
The information that some PFNs are busy is:
a) not helpful for ordinary users: we don't even know *who* called
alloc_contig_range(). This is certainly not worth a pr_info.*().
b) not really helpful for debugging: we don't have any details *why*
these PFNs are busy, and that is what we usually care about.
c) not complete: there are other cases where we fail alloc_contig_range()
using different paths that are not getting recorded.
For example, we reach this path once we succeeded in isolating pageblocks,
but failed to migrate some pages - which can happen easily on ZONE_NORMAL
(i.e., has_unmovable_pages() is racy) but also on ZONE_MOVABLE i.e., we
would have to retry longer to migrate).
For example via virtio-mem when unplugging memory, we can create quite
some noise (especially with ZONE_NORMAL) that is not of interest to users
- it's expected that some allocations may fail as memory is busy.
Let's just drop that pr_info_ratelimit() and rather implement a dynamic
debugging mechanism in the future that can give us a better reason why
alloc_contig_range() failed on specific pages.
Link: https://lkml.kernel.org/r/20210301150945.77012-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the irq_work_queue() call stack into the KASAN auxiliary stack in
order to improve KASAN reports. this will let us know where the irq work
be queued.
Currently, KASAN-KUnit tests can check that a particular annotated part of
code causes a KASAN report. However, they do not check that no unwanted
reports happen between the annotated parts.
This patch implements these checks.
It is done by setting report_data.report_found to false in
kasan_test_init() and at the end of KUNIT_EXPECT_KASAN_FAIL() and then
checking that it remains false at the beginning of
KUNIT_EXPECT_KASAN_FAIL() and in kasan_test_exit().
kunit_add_named_resource() call is moved to kasan_test_init(), and the
value of fail_data.report_expected is kept as false in between
KUNIT_EXPECT_KASAN_FAIL() annotations for consistency.
Link: https://lkml.kernel.org/r/48079c52cc329fbc52f4386996598d58022fb872.1617207873.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Walter Wu [Fri, 30 Apr 2021 06:00:45 +0000 (23:00 -0700)]
kasan: record task_work_add() call stack
Why record task_work_add() call stack? Syzbot reports many use-after-free
issues for task_work, see [1]. After seeing the free stack and the
current auxiliary stack, we think they are useless, we don't know where
the work was registered. This work may be the free call stack, so we miss
the root cause and don't solve the use-after-free.
Add the task_work_add() call stack into the KASAN auxiliary stack in order
to improve KASAN reports. It helps programmers solve use-after-free
issues.
kasan, mm: integrate slab init_on_free with HW_TAGS
This change uses the previously added memory initialization feature of
HW_TAGS KASAN routines for slab memory when init_on_free is enabled.
With this change, memory initialization memset() is no longer called when
both HW_TAGS KASAN and init_on_free are enabled. Instead, memory is
initialized in KASAN runtime.
For SLUB, the memory initialization memset() is moved into
slab_free_hook() that currently directly follows the initialization loop.
A new argument is added to slab_free_hook() that indicates whether to
initialize the memory or not.
To avoid discrepancies with which memory gets initialized that can be
caused by future changes, both KASAN hook and initialization memset() are
put together and a warning comment is added.
Combining setting allocation tags with memory initialization improves
HW_TAGS KASAN performance when init_on_free is enabled.
Link: https://lkml.kernel.org/r/190fd15c1886654afdec0d19ebebd5ade665b601.1615296150.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kasan, mm: integrate slab init_on_alloc with HW_TAGS
This change uses the previously added memory initialization feature of
HW_TAGS KASAN routines for slab memory when init_on_alloc is enabled.
With this change, memory initialization memset() is no longer called when
both HW_TAGS KASAN and init_on_alloc are enabled. Instead, memory is
initialized in KASAN runtime.
The memory initialization memset() is moved into slab_post_alloc_hook()
that currently directly follows the initialization loop. A new argument
is added to slab_post_alloc_hook() that indicates whether to initialize
the memory or not.
To avoid discrepancies with which memory gets initialized that can be
caused by future changes, both KASAN hook and initialization memset() are
put together and a warning comment is added.
Combining setting allocation tags with memory initialization improves
HW_TAGS KASAN performance when init_on_alloc is enabled.
Link: https://lkml.kernel.org/r/c1292aeb5d519da221ec74a0684a949b027d7720.1615296150.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change uses the previously added memory initialization feature of
HW_TAGS KASAN routines for page_alloc memory when init_on_alloc/free is
enabled.
With this change, kernel_init_free_pages() is no longer called when both
HW_TAGS KASAN and init_on_alloc/free are enabled. Instead, memory is
initialized in KASAN runtime.
To avoid discrepancies with which memory gets initialized that can be
caused by future changes, both KASAN and kernel_init_free_pages() hooks
are put together and a warning comment is added.
This patch changes the order in which memory initialization and page
poisoning hooks are called. This doesn't lead to any side-effects, as
whenever page poisoning is enabled, memory initialization gets disabled.
Combining setting allocation tags with memory initialization improves
HW_TAGS KASAN performance when init_on_alloc/free is enabled.
[andreyknvl@google.com: fix for "integrate page_alloc init with HW_TAGS"] Link: https://lkml.kernel.org/r/65b6028dea2e9a6e8e2cb779b5115c09457363fc.1617122211.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/e77f0d5b1b20658ef0b8288625c74c2b3690e725.1615296150.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Sergei Trofimovich <slyfox@gentoo.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arm64: kasan: allow to init memory when setting tags
Patch series "kasan: integrate with init_on_alloc/free", v3.
This patch series integrates HW_TAGS KASAN with init_on_alloc/free by
initializing memory via the same arm64 instruction that sets memory tags.
This is expected to improve HW_TAGS KASAN performance when
init_on_alloc/free is enabled. The exact perfomance numbers are unknown
as MTE-enabled hardware doesn't exist yet.
This patch (of 5):
This change adds an argument to mte_set_mem_tag_range() that allows to
enable memory initialization when settinh the allocation tags. The
implementation uses stzg instruction instead of stg when this argument
indicates to initialize memory.
Combining setting allocation tags with memory initialization will improve
HW_TAGS KASAN performance when init_on_alloc/free is enabled.
This change doesn't integrate memory initialization with KASAN, this is
done is subsequent patches in this series.
Link: https://lkml.kernel.org/r/cover.1615296150.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/d04ae90cc36be3fe246ea8025e5085495681c3d7.1615296150.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm, kasan: don't poison boot memory with tag-based modes
During boot, all non-reserved memblock memory is exposed to page_alloc via
memblock_free_pages->__free_pages_core(). This results in
kasan_free_pages() being called, which poisons that memory.
Poisoning all that memory lengthens boot time. The most noticeable effect
is observed with the HW_TAGS mode. A boot-time impact may potentially
also affect systems with large amount of RAM.
This patch changes the tag-based modes to not poison the memory during the
memblock->page_alloc transition.
An exception is made for KASAN_GENERIC. Since it marks all new memory as
accessible, not poisoning the memory released from memblock will lead to
KASAN missing invalid boot-time accesses to that memory.
With KASAN_SW_TAGS, as it uses the invalid 0xFE tag as the default tag for
all memory, it won't miss bad boot-time accesses even if the poisoning of
memblock memory is removed.
With KASAN_HW_TAGS, the default memory tags values are unspecified.
Therefore, if memblock poisoning is removed, this KASAN mode will miss the
mentioned type of boot-time bugs with a 1/16 probability. This is taken
as an acceptable trafe-off.
Internally, the poisoning is removed as follows. __free_pages_core() is
used when exposing fresh memory during system boot and when onlining
memory during hotplug. This patch adds a new FPI_SKIP_KASAN_POISON flag
and passes it to __free_pages_ok() through free_pages_prepare() from
__free_pages_core(). If FPI_SKIP_KASAN_POISON is set, kasan_free_pages()
is not called.
All memory allocated normally when the boot is over keeps getting poisoned
as usual.
Link: https://lkml.kernel.org/r/a0570dc1e3a8f39a55aa343a1fc08cd5c2d4cad6.1613692950.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kasan: initialize shadow to TAG_INVALID for SW_TAGS
Currently, KASAN_SW_TAGS uses 0xFF as the default tag value for
unallocated memory. The underlying idea is that since that memory hasn't
been allocated yet, it's only supposed to be dereferenced through a
pointer with the native 0xFF tag.
While this is a good idea in terms on consistency, practically it doesn't
bring any benefit. Since the 0xFF pointer tag is a match-all tag, it
doesn't matter what tag the accessed memory has. No accesses through
0xFF-tagged pointers are considered buggy by KASAN.
This patch changes the default tag value for unallocated memory to 0xFE,
which is the tag KASAN uses for inaccessible memory. This doesn't affect
accesses through 0xFF-tagged pointer to this memory, but this allows KASAN
to detect wild and large out-of-bounds invalid memory accesses through
otherwise-tagged pointers.
This is a prepatory patch for the next one, which changes the tag-based
KASAN modes to not poison the boot memory.
Link: https://lkml.kernel.org/r/c8e93571c18b3528aac5eb33ade213bf133d10ad.1613692950.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Marco Elver <elver@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kasan: fix kasan_byte_accessible() to be consistent with actual checks
We can sometimes end up with kasan_byte_accessible() being called on
non-slab memory. For example ksize() and krealloc() may end up calling it
on KFENCE allocated memory. In this case the memory will be tagged with
KASAN_SHADOW_INIT, which a subsequent patch ("kasan: initialize shadow to
TAG_INVALID for SW_TAGS") will set to the same value as KASAN_TAG_INVALID,
causing kasan_byte_accessible() to fail when called on non-slab memory.
This highlighted the fact that the check in kasan_byte_accessible() was
inconsistent with checks as implemented for loads and stores
(kasan_check_range() in SW tags mode and hardware-implemented checks in HW
tags mode). kasan_check_range() does not have a check for
KASAN_TAG_INVALID, and instead has a comparison against
KASAN_SHADOW_START. In HW tags mode, we do not have either, but we do set
TCR_EL1.TCMA which corresponds with the comparison against
KASAN_TAG_KERNEL.
Therefore, update kasan_byte_accessible() for both SW and HW tags modes to
correspond with the respective checks on loads and stores.
Link: https://linux-review.googlesource.com/id/Ic6d40803c57dcc6331bd97fbb9a60b0d38a65a36 Link: https://lkml.kernel.org/r/20210405220647.1965262-1-pcc@google.com Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zhiyuan Dai [Fri, 30 Apr 2021 05:59:43 +0000 (22:59 -0700)]
mm/kasan: switch from strlcpy to strscpy
strlcpy is marked as deprecated in Documentation/process/deprecated.rst,
and there is no functional difference when the caller expects truncation
(when not checking the return value). strscpy is relatively better as it
also avoids scanning the whole source string.
Link: https://lkml.kernel.org/r/1613970647-23272-1-git-send-email-daizhiyuan@phytium.com.cn Signed-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn> Acked-by: Alexander Potapenko <glider@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The script './scripts/kernel-doc -none ./include/linux/pagewalk.h' reports:
include/linux/pagewalk.h:37: warning: cannot understand function prototype: 'struct mm_walk_ops '
include/linux/pagewalk.h:85: warning: cannot understand function prototype: 'struct mm_walk '
A kernel-doc description for a structure requires to prefix the struct
name with the keyword 'struct'. So, do that such that no further
kernel-doc warnings are reported for this file.
Link: https://lkml.kernel.org/r/20210322122542.15072-3-lukas.bulwahn@gmail.com Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralf Ramsauer <ralf.ramsauer@oth-regensburg.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MAINTAINERS: assign pagewalk.h to MEMORY MANAGEMENT
Patch series "kernel-doc and MAINTAINERS clean-up".
Roughly 900 warnings of about 21.000 kernel-doc warnings in the kernel
tree warn with 'cannot understand function prototype:', i.e., the
kernel-doc parser cannot parse the function's signature. The majority,
about 600 cases of those, are just struct definitions following the
kernel-doc description. Further, spot-check investigations suggest that
the authors of the specific kernel-doc descriptions simply were not
aware that the general format for a kernel-doc description for a
structure requires to prefix the struct name with the keyword 'struct',
as in 'struct struct_name - Brief description.'. Details on kernel-doc
are at the Link below.
Without the struct keyword, kernel-doc does not check if the kernel-doc
description fits to the actual struct definition in the source code.
Fortunately, in roughly a quarter of these cases, the kernel-doc
description is actually complete wrt. its corresponding struct
definition. So, the trivial change adding the struct keyword will allow
us to keep the kernel-doc descriptions more consistent for future
changes, by checking for new kernel-doc warnings.
Also, some of the files in ./include/ are not assigned to a specific
MAINTAINERS section and hence have no dedicated maintainer. So, if
needed, the files in ./include/ are also assigned to the fitting
MAINTAINERS section, as I need to identify whom to send the clean-up
patch anyway.
Here is the change from this kernel-doc janitorial work in the
./include/ directory for MEMORY MANAGEMENT.
This patch (of 2):
Commit 2eb28cffa16f ("mm: split out a new pagewalk.h header from mm.h")
adds a new file in ./include/linux, but misses to update MAINTAINERS
accordingly. Hence,
Instead of keeping open-coded style, move the code related to preloading
into a separate function. Therefore introduce the preload_this_cpu_lock()
routine that prelaods a current CPU with one extra vmap_area object.
There is no functional change as a result of this patch.
Link: https://lkml.kernel.org/r/20210402202237.20334-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vm/test_vmalloc.sh: adapt for updated driver interface
A 'single_cpu_test' parameter is odd and it does not exist anymore.
Instead there was introduced a 'nr_threads' one. If it is not set it
behaves as the former parameter.
That is why update a "stress mode" according to this change specifying
number of workers which are equal to number of CPUs. Also update an
output of help message based on a new interface.
Link: https://lkml.kernel.org/r/20210402202237.20334-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lib/test_vmalloc.c: add a new 'nr_threads' parameter
By using this parameter we can specify how many workers are created to
perform vmalloc tests. By default it is one CPU. The maximum value is
set to 1024.
As a result of this change a 'single_cpu_test' one becomes obsolete,
therefore it is no longer needed.
[urezki@gmail.com: extend max value of nr_threads parameter] Link: https://lkml.kernel.org/r/20210406124536.19658-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20210402202237.20334-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove two test cases related to kvfree_rcu() and SLAB. Those are
considered as redundant now, because similar test functionality has
recently been introduced in the "rcuscale" RCU test-suite.
Link: https://lkml.kernel.org/r/20210402202237.20334-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here Process 1 is in purge path and it does list_del_rcu on vmap_block and
later frees the vmap_area, since Process 2 was holding the rcu lock at
this time vmap_block will still be present in and Process 2 accesse it and
thereby it tries to access vmap_area of that vmap_block which was already
freed by Process 1 and this results in use after free.
Fix this by adding a check for vb->dirty before accessing vmap_area
structure since vb->dirty will be set to VMAP_BBMAP_BITS in purge path
checking for this will prevent the use after free.
There are several reasons why a vmalloc can fail, virtual space exhausted,
page array allocation failure, page allocation failure, and kernel page
table allocation failure.
Add distinct warning messages for the main causes of failure, with some
added information like page order or allocation size where applicable.
[urezki@gmail.com: print correct vmalloc allocation size] Link: https://lkml.kernel.org/r/20210329193214.GA28602@pc638.lan Link: https://lkml.kernel.org/r/20210322021806.892164-6-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Cédric Le Goater <clg@kaod.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>