Nick Desaulniers [Wed, 10 Aug 2022 22:24:40 +0000 (15:24 -0700)]
Makefile: link with -z noexecstack --no-warn-rwx-segments
Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:
ld: warning: vmlinux: missing .note.GNU-stack section implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
ld: warning: vmlinux has a LOAD segment with RWX permissions
Generally, we would like to avoid the stack being executable. Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack. Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.
LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO. --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.
While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.
It turns out that gcc-12.1 has some nasty problems with register
allocation on a 32-bit x86 build for the 64-bit values used in the
generic blake2b implementation, where the pattern of 64-bit rotates and
xor operations ends up making gcc generate horrible code.
As a result it ends up with a ridiculously large stack frame for all the
spills it generates, resulting in the following build problem:
crypto/blake2b_generic.c: In function ‘blake2b_compress_one_generic’:
crypto/blake2b_generic.c:109:1: error: the frame size of 2640 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
on the same test-case, clang ends up generating a stack frame that is
just 296 bytes (and older gcc versions generate a slightly bigger one at
428 bytes - still nowhere near that almost 3kB monster stack frame of
gcc-12.1).
The issue is fixed both in mainline and the GCC 12 release branch [1],
but current release compilers end up failing the i386 allmodconfig build
due to this issue.
Disable the warning for now by simply raising the frame size for this
one file, just to keep this issue from having people turn off WERROR.
Linus Torvalds [Wed, 10 Aug 2022 21:04:32 +0000 (14:04 -0700)]
Merge tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- pNFS/flexfiles: Fix infinite looping when the RDMA connection
errors out
Bugfixes:
- NFS: fix port value parsing
- SUNRPC: Reinitialise the backchannel request buffers before reuse
- SUNRPC: fix expiry of auth creds
- NFSv4: Fix races in the legacy idmapper upcall
- NFS: O_DIRECT fixes from Jeff Layton
- NFSv4.1: Fix OP_SEQUENCE error handling
- SUNRPC: Fix an RPC/RDMA performance regression
- NFS: Fix case insensitive renames
- NFSv4/pnfs: Fix a use-after-free bug in open
- NFSv4.1: RECLAIM_COMPLETE must handle EACCES
Features:
- NFSv4.1: session trunking enhancements
- NFSv4.2: READ_PLUS performance optimisations
- NFS: relax the rules for rsize/wsize mount options
- NFS: don't unhash dentry during unlink/rename
- SUNRPC: Fail faster on bad verifier
- NFS/SUNRPC: Various tracing improvements"
* tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits)
NFS: Improve readpage/writepage tracing
NFS: Improve O_DIRECT tracing
NFS: Improve write error tracing
NFS: don't unhash dentry during unlink/rename
NFSv4/pnfs: Fix a use-after-free bug in open
NFS: nfs_async_write_reschedule_io must not recurse into the writeback code
SUNRPC: Don't reuse bvec on retransmission of the request
SUNRPC: Reinitialise the backchannel request buffers before reuse
NFSv4.1: RECLAIM_COMPLETE must handle EACCES
NFSv4.1 probe offline transports for trunking on session creation
SUNRPC create a function that probes only offline transports
SUNRPC export xprt_iter_rewind function
SUNRPC restructure rpc_clnt_setup_test_and_add_xprt
NFSv4.1 remove xprt from xprt_switch if session trunking test fails
SUNRPC create an rpc function that allows xprt removal from rpc_clnt
SUNRPC enable back offline transports in trunking discovery
SUNRPC create an iterator to list only OFFLINE xprts
NFSv4.1 offline trunkable transports on DESTROY_SESSION
SUNRPC add function to offline remove trunkable transports
SUNRPC expose functions for offline remote xprt functionality
...
Linus Torvalds [Wed, 10 Aug 2022 18:30:16 +0000 (11:30 -0700)]
Merge tag 'hwmon-fixes-for-v6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
"Fix two regressions in nct6775 and lm90 drivers"
* tag 'hwmon-fixes-for-v6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (nct6775) Fix platform driver suspend regression
hwmon: (lm90) Fix error return value from detect function
Linus Torvalds [Wed, 10 Aug 2022 18:18:00 +0000 (11:18 -0700)]
Merge tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull remaining MM updates from Andrew Morton:
"Three patch series - two that perform cleanups and one feature:
- hugetlb_vmemmap cleanups from Muchun Song
- hardware poisoning support for 1GB hugepages, from Naoya Horiguchi
- highmem documentation fixups from Fabio De Francesco"
* tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
Documentation/mm: add details about kmap_local_page() and preemption
highmem: delete a sentence from kmap_local_page() kdocs
Documentation/mm: rrefer kmap_local_page() and avoid kmap()
Documentation/mm: avoid invalid use of addresses from kmap_local_page()
Documentation/mm: don't kmap*() pages which can't come from HIGHMEM
highmem: specify that kmap_local_page() is callable from interrupts
highmem: remove unneeded spaces in kmap_local_page() kdocs
mm, hwpoison: enable memory error handling on 1GB hugepage
mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
mm, hwpoison: make __page_handle_poison returns int
mm, hwpoison: set PG_hwpoison for busy hugetlb pages
mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
mm, hwpoison, hugetlb: support saving mechanism of raw error pages
mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE
mm: hugetlb_vmemmap: move code comments to vmemmap_dedup.rst
mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
mm: hugetlb_vmemmap: replace early_param() with core_param()
mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
...
Linus Torvalds [Wed, 10 Aug 2022 18:07:26 +0000 (11:07 -0700)]
Merge tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl updates from Dan Williams:
"Compute Express Link (CXL) updates for 6.0:
- Introduce a 'struct cxl_region' object with support for
provisioning and assembling persistent memory regions.
- Introduce alloc_free_mem_region() to accompany the existing
request_free_mem_region() as a method to allocate physical memory
capacity out of an existing resource.
- Export insert_resource_expand_to_fit() for the CXL subsystem to
late-publish CXL platform windows in iomem_resource.
- Add a polled mode PCI DOE (Data Object Exchange) driver service and
use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute
Table)"
* tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (74 commits)
cxl/hdm: Fix skip allocations vs multiple pmem allocations
cxl/region: Disallow region granularity != window granularity
cxl/region: Fix x1 interleave to greater than x1 interleave routing
cxl/region: Move HPA setup to cxl_region_attach()
cxl/region: Fix decoder interleave programming
Documentation: cxl: remove dangling kernel-doc reference
cxl/region: describe targets and nr_targets members of cxl_region_params
cxl/regions: add padding for cxl_rr_ep_add nested lists
cxl/region: Fix IS_ERR() vs NULL check
cxl/region: Fix region reference target accounting
cxl/region: Fix region commit uninitialized variable warning
cxl/region: Fix port setup uninitialized variable warnings
cxl/region: Stop initializing interleave granularity
cxl/hdm: Fix DPA reservation vs cxl_endpoint_decoder lifetime
cxl/acpi: Minimize granularity for x1 interleaves
cxl/region: Delete 'region' attribute from root decoders
cxl/acpi: Autoload driver for 'cxl_acpi' test devices
cxl/region: decrement ->nr_targets on error in cxl_region_attach()
cxl/region: prevent underflow in ways_to_cxl()
cxl/region: uninitialized variable in alloc_hpa()
...
Linus Torvalds [Wed, 10 Aug 2022 17:53:22 +0000 (10:53 -0700)]
Merge tag 'apparmor-pr-2022-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
Pull AppArmor updates from John Johansen:
"This is mostly cleanups and bug fixes with the one bigger change being
Mathew Wilcox's patch to use XArrays instead of the IDR from the
thread around the locking weirdness.
Features:
- Convert secid mapping to XArrays instead of IDR
- Add a kernel label to use on kernel objects
- Extend policydb permission set by making use of the xbits
- Make export of raw binary profile to userspace optional
- Enable tuning of policy paranoid load for embedded systems
- Don't create raw_sha1 symlink if sha1 hashing is disabled
- Allow labels to carry debug flags
Cleanups:
- Update MAINTAINERS file
- Use struct_size() helper in kmalloc()
- Move ptrace mediation to more logical task.{h,c}
- Resolve uninitialized symbol warnings
- Remove redundant ret variable
- Mark alloc_unconfined() as static
- Update help description of policy hash for introspection
- Remove some casts which are no-longer required
Bug Fixes:
- Fix aa_label_asxprint return check
- Fix reference count leak in aa_pivotroot()
- Fix memleak in aa_simple_write_to_buffer()
- Fix kernel doc comments
- Fix absroot causing audited secids to begin with =
- Fix quiet_denied for file rules
- Fix failed mount permission check error message
- Disable showing the mode as part of a secid to secctx
- Fix setting unconfined mode on a loaded profile
- Fix overlapping attachment computation
- Fix undefined reference to `zlib_deflate_workspacesize'"
* tag 'apparmor-pr-2022-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor: (34 commits)
apparmor: Update MAINTAINERS file with new email address
apparmor: correct config reference to intended one
apparmor: move ptrace mediation to more logical task.{h,c}
apparmor: extend policydb permission set by making use of the xbits
apparmor: allow label to carry debug flags
apparmor: fix overlapping attachment computation
apparmor: fix setting unconfined mode on a loaded profile
apparmor: Fix some kernel-doc comments
apparmor: Mark alloc_unconfined() as static
apparmor: disable showing the mode as part of a secid to secctx
apparmor: Convert secid mapping to XArrays instead of IDR
apparmor: add a kernel label to use on kernel objects
apparmor: test: Remove some casts which are no-longer required
apparmor: Fix memleak in aa_simple_write_to_buffer()
apparmor: fix reference count leak in aa_pivotroot()
apparmor: Fix some kernel-doc comments
apparmor: Fix undefined reference to `zlib_deflate_workspacesize'
apparmor: fix aa_label_asxprint return check
apparmor: Fix some kernel-doc comments
apparmor: Fix some kernel-doc comments
...
Linus Torvalds [Wed, 10 Aug 2022 17:40:41 +0000 (10:40 -0700)]
Merge tag 'kbuild-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Remove the support for -O3 (CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3)
- Fix error of rpm-pkg cross-builds
- Support riscv for checkstack tool
- Re-enable -Wformwat warnings for Clang
- Clean up modpost, Makefiles, and misc scripts
* tag 'kbuild-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (30 commits)
modpost: remove .symbol_white_list field entirely
modpost: remove unneeded .symbol_white_list initializers
modpost: add PATTERNS() helper macro
modpost: shorten warning messages in report_sec_mismatch()
Revert "Kbuild, lto, workaround: Don't warn for initcall_reference in modpost"
modpost: use more reliable way to get fromsec in section_rel(a)()
modpost: add array range check to sec_name()
modpost: refactor get_secindex()
kbuild: set EXIT trap before creating temporary directory
modpost: remove unused Elf_Sword macro
Makefile.extrawarn: re-enable -Wformat for clang
kbuild: add dtbs_prepare target
kconfig: Qt5: tell the user which packages are required
modpost: use sym_get_data() to get module device_table data
modpost: drop executable ELF support
checkstack: add riscv support for scripts/checkstack.pl
kconfig: shorten the temporary directory name for cc-option
scripts: headers_install.sh: Update config leak ignore entries
kbuild: error out if $(INSTALL_MOD_PATH) contains % or :
kbuild: error out if $(KBUILD_EXTMOD) contains % or :
...
Commit 07972dc0c896 ("hwmon: (nct6775) Split core and platform
driver") introduced a slight change in nct6775_suspend() in order to
avoid an otherwise-needless symbol export for nct6775_update_device(),
replacing a call to that function with a simple dev_get_drvdata()
instead.
As it turns out, there is no guarantee that nct6775_update_device()
is ever called prior to suspend. If this happens, the resume function
ends up writing bad data into the various chip registers, which results
in a crash shortly after resume.
To fix the problem, just add the symbol export and return to using
nct6775_update_device() as was employed previously.
Guenter Roeck [Mon, 8 Aug 2022 09:48:21 +0000 (02:48 -0700)]
hwmon: (lm90) Fix error return value from detect function
lm90_detect_nuvoton() is supposed to return NULL if it can not detect
a chip, or a pointer to the chip name if it does. Under some circumstances
it returns an error pointer instead. Some versions of gcc interpret an
ERR_PTR as region of size 0 and generate an error message.
In function ‘__fortify_strlen’,
inlined from ‘strlcpy’ at ./include/linux/fortify-string.h:159:10,
inlined from ‘lm90_detect’ at drivers/hwmon/lm90.c:2550:2:
./include/linux/fortify-string.h:50:33: error:
‘__builtin_strlen’ reading 1 or more bytes from a region of size 0
50 | #define __underlying_strlen __builtin_strlen
| ^
./include/linux/fortify-string.h:141:24: note:
in expansion of macro ‘__underlying_strlen’
141 | return __underlying_strlen(p);
| ^~~~~~~~~~~~~~~~~~~
Returning NULL instead of ERR_PTR() fixes the problem.
Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be executed before wait_on_buffer(bh) on
architectures with weak memory ordering and it may return invalid data.
Fix this bug by adding a memory barrier to set_buffer_uptodate and an
acquire barrier to buffer_uptodate (in a similar way as
folio_test_uptodate and folio_mark_uptodate).
Linus Torvalds [Tue, 9 Aug 2022 21:56:49 +0000 (14:56 -0700)]
Merge tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"Work on 'courteous server', which was introduced in 5.19, continues
apace. This release introduces a more flexible limit on the number of
NFSv4 clients that NFSD allows, now that NFSv4 clients can remain in
courtesy state long after the lease expiration timeout. The client
limit is adjusted based on the physical memory size of the server.
The NFSD filecache is a cache of files held open by NFSv4 clients or
recently touched by NFSv2 or NFSv3 clients. This cache had some
significant scalability constraints that have been relieved in this
release. Thanks to all who contributed to this work.
A data corruption bug found during the most recent NFS bake-a-thon
that involves NFSv3 and NFSv4 clients writing the same file has been
addressed in this release.
This release includes several improvements in CPU scalability for
NFSv4 operations. In addition, Neil Brown provided patches that
simplify locking during file lookup, creation, rename, and removal
that enables subsequent work on making these operations more scalable.
We expect to see that work materialize in the next release.
There are also numerous single-patch fixes, clean-ups, and the usual
improvements in observability"
* tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (78 commits)
lockd: detect and reject lock arguments that overflow
NFSD: discard fh_locked flag and fh_lock/fh_unlock
NFSD: use (un)lock_inode instead of fh_(un)lock for file operations
NFSD: use explicit lock/unlock for directory ops
NFSD: reduce locking in nfsd_lookup()
NFSD: only call fh_unlock() once in nfsd_link()
NFSD: always drop directory lock in nfsd_unlink()
NFSD: change nfsd_create()/nfsd_symlink() to unlock directory before returning.
NFSD: add posix ACLs to struct nfsd_attrs
NFSD: add security label to struct nfsd_attrs
NFSD: set attributes when creating symlinks
NFSD: introduce struct nfsd_attrs
NFSD: verify the opened dentry after setting a delegation
NFSD: drop fh argument from alloc_init_deleg
NFSD: Move copy offload callback arguments into a separate structure
NFSD: Add nfsd4_send_cb_offload()
NFSD: Remove kmalloc from nfsd4_do_async_copy()
NFSD: Refactor nfsd4_do_copy()
NFSD: Refactor nfsd4_cleanup_inter_ssc() (2/2)
NFSD: Refactor nfsd4_cleanup_inter_ssc() (1/2)
...
Replace existing limited example with proper code for Qualcomm Resource
Power Manager (RPM) over SMD based on MSM8916. This also fixes the
example's indentation.
The child node of smd is an SMD edge representing remote subsystem.
Bring back missing reference from previously sent patch (disappeared
when applying).
Trond Myklebust [Tue, 9 Aug 2022 16:50:28 +0000 (12:50 -0400)]
NFS: Improve write error tracing
Don't leak request pointers, but use the "device:inode" labelling that
is used by all the other trace points. Furthermore, replace use of page
indexes with an offset, again in order to align behaviour with other
NFS trace points.
Linus Torvalds [Tue, 9 Aug 2022 17:11:56 +0000 (10:11 -0700)]
Merge tag 'fscache-fixes-20220809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache updates from David Howells:
- Fix a cookie access ref leak if a cookie is invalidated a second time
before the first invalidation is actually processed.
- Add a tracepoint to log cookie lookup failure
* tag 'fscache-fixes-20220809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache: add tracepoint when failing cookie
fscache: don't leak cookie access refs if invalidation is in progress or failed
Linus Torvalds [Tue, 9 Aug 2022 17:08:08 +0000 (10:08 -0700)]
Merge tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Fix AFS refcount handling.
The first patch converts afs to use refcount_t for its refcounts and
the second patch fixes afs_put_call() and afs_put_server() to save the
values they're going to log in the tracepoint before decrementing the
refcount"
* tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix access after dec in put functions
afs: Use refcount_t rather than atomic_t
Linus Torvalds [Tue, 9 Aug 2022 16:52:28 +0000 (09:52 -0700)]
Merge tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull setgid updates from Christian Brauner:
"This contains the work to move setgid stripping out of individual
filesystems and into the VFS itself.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires
additional privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
- S_ISGID stripping logic is entangled with umask stripping.
For example, if the umask removes the S_IXGRP bit from the file
about to be created then the S_ISGID bit will be kept.
The inode_init_owner() helper is responsible for S_ISGID stripping
and is called before posix_acl_create(). So we can end up with two
different orderings:
1. FS without POSIX ACL support
First strip umask then strip S_ISGID in inode_init_owner().
In other words, if a filesystem doesn't support or enable POSIX
ACLs then umask stripping is done directly in the vfs before
calling into the filesystem:
2. FS with POSIX ACL support
First strip S_ISGID in inode_init_owner() then strip umask in
posix_acl_create().
In other words, if the filesystem does support POSIX ACLs then
unmask stripping may be done in the filesystem itself when
calling posix_acl_create().
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_ISGID
inheritance.
(Note that the commit message of commit ec8ef3684e01 ("fs: move
S_ISGID stripping into the vfs_*() helpers") gets the ordering
between inode_init_owner() and posix_acl_create() the wrong way
around. I realized this too late.)
- Filesystems that don't rely on inode_init_owner() don't get S_ISGID
stripping logic.
While that may be intentional (e.g. network filesystems might just
defer setgid stripping to a server) it is often just a security
issue.
Note that mandating the use of inode_init_owner() was proposed as
an alternative solution but that wouldn't fix the ordering issues
and there are examples such as afs where the use of
inode_init_owner() isn't possible.
In any case, we should also try the cleaner and generalized
solution first before resorting to this approach.
- We still have S_ISGID inheritance bugs years after the initial
round of S_ISGID inheritance fixes:
a5e8fb252351 ("xfs: use setattr_copy to set vfs inode attributes") 1db3fee044c4 ("xfs: fix up non-directory creation in SGID directories") 3725be669fb9 ("ceph: fix up non-directory creation in SGID directories")
All of this led us to conclude that the current state is too messy.
While we won't be able to make it completely clean as
posix_acl_create() is still a filesystem specific call we can improve
the S_SIGD stripping situation quite a bit by hoisting it out of
inode_init_owner() and into the respective vfs creation operations.
The obvious advantage is that we don't need to rely on individual
filesystems getting S_ISGID stripping right and instead can
standardize the ordering between S_ISGID and umask stripping directly
in the VFS.
A few short implementation notes:
- The stripping logic needs to happen in vfs_*() helpers for the sake
of stacking filesystems such as overlayfs that rely on these
helpers taking care of S_ISGID stripping.
- Security hooks have never seen the mode as it is ultimately seen by
the filesystem because of the ordering issue we mentioned. Nothing
is changed for them. We simply continue to strip the umask before
passing the mode down to the security hooks.
- The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs,
hfsplus, hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs,
overlayfs, ramfs, reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs,
bpf, tmpfs.
We've audited all callchains as best as we could. More details can
be found in the commit message to ec8ef3684e01 ("fs: move S_ISGID
stripping into the vfs_*() helpers")"
* tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
ceph: rely on vfs for setgid stripping
fs: move S_ISGID stripping into the vfs_*() helpers
fs: Add missing umask strip in vfs_tmpfile
fs: add mode_strip_sgid() helper
Linus Torvalds [Tue, 9 Aug 2022 16:48:30 +0000 (09:48 -0700)]
Merge tag 'memblock-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock updates from Mike Rapoport:
- An optimization in memblock_add_range() to reduce array traversals
- Improvements to the memblock test suite
* tag 'memblock-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock test: Modify the obsolete description in README
memblock tests: fix compilation errors
memblock tests: change build options to run-time options
memblock tests: remove completed TODO items
memblock tests: set memblock_debug to enable memblock_dbg() messages
memblock tests: add verbose output to memblock tests
memblock tests: Makefile: add arguments to control verbosity
memblock: avoid some repeat when add new range
Linus Torvalds [Tue, 9 Aug 2022 16:39:25 +0000 (09:39 -0700)]
Merge tag 'm68knommu-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
Pull m68knommu fixes from Greg Ungerer:
- spelling in comment
- compilation when flexcan driver enabled
- sparse warning
* tag 'm68knommu-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68k: Fix syntax errors in comments
m68k: coldfire: make symbol m523x_clk_lookup static
m68k: coldfire/device.c: protect FLEXCAN blocks
Linus Torvalds [Tue, 9 Aug 2022 16:29:07 +0000 (09:29 -0700)]
Merge tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 eIBRS fixes from Borislav Petkov:
"More from the CPU vulnerability nightmares front:
Intel eIBRS machines do not sufficiently mitigate against RET
mispredictions when doing a VM Exit therefore an additional RSB,
one-entry stuffing is needed"
* tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Add LFENCE to RSB fill sequence
x86/speculation: Add RSB VM Exit protections
Jeff Layton [Fri, 5 Aug 2022 10:42:45 +0000 (06:42 -0400)]
fscache: don't leak cookie access refs if invalidation is in progress or failed
It's possible for a request to invalidate a fscache_cookie will come in
while we're already processing an invalidation. If that happens we
currently take an extra access reference that will leak. Only call
__fscache_begin_cookie_access if the FSCACHE_COOKIE_DO_INVALIDATE bit
was previously clear.
Also, ensure that we attempt to clear the bit when the cookie is
"FAILED" and put the reference to avoid an access leak.
Fixes: 031884e80c18 ("fscache: Fix invalidation/lookup race") Suggested-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com>
Linus Torvalds [Tue, 9 Aug 2022 03:15:13 +0000 (20:15 -0700)]
Merge tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull ksmbd updates from Steve French:
- fixes for memory access bugs (out of bounds access, oops, leak)
- multichannel fixes
- session disconnect performance improvement, and session register
improvement
- cleanup
* tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix heap-based overflow in set_ntacl_dacl()
ksmbd: prevent out of bound read for SMB2_TREE_CONNNECT
ksmbd: prevent out of bound read for SMB2_WRITE
ksmbd: fix use-after-free bug in smb2_tree_disconect
ksmbd: fix memory leak in smb2_handle_negotiate
ksmbd: fix racy issue while destroying session on multichannel
ksmbd: use wait_event instead of schedule_timeout()
ksmbd: fix kernel oops from idr_remove()
ksmbd: add channel rwlock
ksmbd: replace sessions list in connection with xarray
MAINTAINERS: ksmbd: add entry for documentation
ksmbd: remove unused ksmbd_share_configs_cleanup function
Linus Torvalds [Tue, 9 Aug 2022 03:04:35 +0000 (20:04 -0700)]
Merge tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more iov_iter updates from Al Viro:
- more new_sync_{read,write}() speedups - ITER_UBUF introduction
- ITER_PIPE cleanups
- unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
switching them to advancing semantics
- making ITER_PIPE take high-order pages without splitting them
- handling copy_page_from_iter() for high-order pages properly
* tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
fix copy_page_from_iter() for compound destinations
hugetlbfs: copy_page_to_iter() can deal with compound pages
copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
expand those iov_iter_advance()...
pipe_get_pages(): switch to append_pipe()
get rid of non-advancing variants
ceph: switch the last caller of iov_iter_get_pages_alloc()
9p: convert to advancing variant of iov_iter_get_pages_alloc()
af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: saner helper for page array allocation
fold __pipe_get_pages() into pipe_get_pages()
ITER_XARRAY: don't open-code DIV_ROUND_UP()
unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
unify xarray_get_pages() and xarray_get_pages_alloc()
unify pipe_get_pages() and pipe_get_pages_alloc()
iov_iter_get_pages(): sanity-check arguments
iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
...
Al Viro [Fri, 17 Jun 2022 18:45:41 +0000 (14:45 -0400)]
iov_iter: saner helper for page array allocation
All call sites of get_pages_array() are essenitally identical now.
Replace with common helper...
Returns number of slots available in resulting array or 0 on OOM;
it's up to the caller to make sure it doesn't ask to zero-entry
array (i.e. neither maxpages nor size are allowed to be zero).
Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 17 Jun 2022 18:30:39 +0000 (14:30 -0400)]
fold __pipe_get_pages() into pipe_get_pages()
... and don't mangle maxsize there - turn the loop into counting
one instead. Easier to see that we won't run out of array that
way. Note that special treatment of the partial buffer in that
thing is an artifact of the non-advancing semantics of
iov_iter_get_pages() - if not for that, it would be append_pipe(),
same as the body of the loop that follows it. IOW, once we make
iov_iter_get_pages() advancing, the whole thing will turn into
calculate how many pages do we want
allocate an array (if needed)
call append_pipe() that many times.
Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 17 Jun 2022 17:35:35 +0000 (13:35 -0400)]
unify pipe_get_pages() and pipe_get_pages_alloc()
The differences between those two are
* pipe_get_pages() gets a non-NULL struct page ** value pointing to
preallocated array + array size.
* pipe_get_pages_alloc() gets an address of struct page ** variable that
contains NULL, allocates the array and (on success) stores its address in
that variable.
Not hard to combine - always pass struct page ***, have
the previous pipe_get_pages_alloc() caller pass ~0U as cap for
array size.
Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Wed, 15 Jun 2022 06:02:51 +0000 (02:02 -0400)]
ITER_PIPE: cache the type of last buffer
We often need to find whether the last buffer is anon or not, and
currently it's rather clumsy:
check if ->iov_offset is non-zero (i.e. that pipe is not empty)
if so, get the corresponding pipe_buffer and check its ->ops
if it's &default_pipe_buf_ops, we have an anon buffer.
Let's replace the use of ->iov_offset (which is nowhere near similar to
its role for other flavours) with signed field (->last_offset), with
the following rules:
empty, no buffers occupied: 0
anon, with bytes up to N-1 filled: N
zero-copy, with bytes up to N-1 filled: -N
That way abs(i->last_offset) is equal to what used to be in i->iov_offset
and empty vs. anon vs. zero-copy can be distinguished by the sign of
i->last_offset.
Checks for "should we extend the last buffer or should we start
a new one?" become easier to follow that way.
Note that most of the operations can only be done in a sane
state - i.e. when the pipe has nothing past the current position of
iterator. About the only thing that could be done outside of that
state is iov_iter_advance(), which transitions to the sane state by
truncating the pipe. There are only two cases where we leave the
sane state:
1) iov_iter_get_pages()/iov_iter_get_pages_alloc(). Will be
dealt with later, when we make get_pages advancing - the callers are
actually happier that way.
2) iov_iter copied, then something is put into the copy. Since
they share the underlying pipe, the original gets behind. When we
decide that we are done with the copy (original is not usable until then)
we advance the original. direct_io used to be done that way; nowadays
it operates on the original and we do iov_iter_revert() to discard
the excessive data. At the moment there's nothing in the kernel that
could do that to ITER_PIPE iterators, so this reason for insane state
is theoretical right now.
Al Viro [Sun, 12 Jun 2022 21:54:35 +0000 (17:54 -0400)]
ITER_PIPE: clean iov_iter_revert()
Fold pipe_truncate() into it, clean up. We can release buffers
in the same loop where we walk backwards to the iterator beginning
looking for the place where the new position will be.
Al Viro [Wed, 15 Jun 2022 20:03:25 +0000 (16:03 -0400)]
ITER_PIPE: clean pipe_advance() up
instead of setting ->iov_offset for new position and calling
pipe_truncate() to adjust ->len of the last buffer and discard
everything after it, adjust ->len at the same time we set ->iov_offset
and use pipe_discard_from() to deal with buffers past that.
Al Viro [Sat, 11 Jun 2022 06:52:03 +0000 (02:52 -0400)]
ITER_PIPE: fold push_pipe() into __pipe_get_pages()
Expand the only remaining call of push_pipe() (in
__pipe_get_pages()), combine it with the page-collecting loop there.
Note that the only reason it's not a loop doing append_pipe() is
that append_pipe() is advancing, while iov_iter_get_pages() is not.
As soon as it switches to saner semantics, this thing will switch
to using append_pipe().
Al Viro [Tue, 14 Jun 2022 17:53:53 +0000 (13:53 -0400)]
ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives
New helper: append_pipe(). Extends the last buffer if possible,
allocates a new one otherwise. Returns page and offset in it
on success, NULL on failure. iov_iter is advanced past the
data we've got.
Use that instead of push_pipe() in copy-to-pipe primitives;
they get simpler that way. Handling of short copy (in "mc" one)
is done simply by iov_iter_revert() - iov_iter is in consistent
state after that one, so we can use that.
[Fix for braino caught by Liu Xinpeng <liuxp11@chinatelecom.cn> folded in]
[another braino fix, this time in copy_pipe_to_iter() and pipe_zero();
caught by testcase from Hugh Dickins]
Al Viro [Mon, 13 Jun 2022 18:30:15 +0000 (14:30 -0400)]
ITER_PIPE: helpers for adding pipe buffers
There are only two kinds of pipe_buffer in the area used by ITER_PIPE.
1) anonymous - copy_to_iter() et.al. end up creating those and copying
data there. They have zero ->offset, and their ->ops points to
default_pipe_page_ops.
2) zero-copy ones - those come from copy_page_to_iter(), and page
comes from caller. ->offset is also caller-supplied - it might be
non-zero. ->ops points to page_cache_pipe_buf_ops.
Move creation and insertion of those into helpers - push_anon(pipe, size)
and push_page(pipe, page, offset, size) resp., separating them from
the "could we avoid creating a new buffer by merging with the current
head?" logics.
Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 14 Jun 2022 14:24:37 +0000 (10:24 -0400)]
ITER_PIPE: helper for getting pipe buffer by index
pipe_buffer instances of a pipe are organized as a ring buffer,
with power-of-2 size. Indices are kept *not* reduced modulo ring
size, so the buffer refered to by index N is
pipe->bufs[N & (pipe->ring_size - 1)].
Ring size can change over the lifetime of a pipe, but not while
the pipe is locked. So for any iov_iter primitives it's a constant.
Original conversion of pipes to this layout went overboard trying
to microoptimize that - calculating pipe->ring_size - 1, storing
it in a local variable and using through the function. In some
cases it might be warranted, but most of the times it only
obfuscates what's going on in there.
Introduce a helper (pipe_buf(pipe, N)) that would encapsulate
that and use it in the obvious cases. More will follow...
Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 12 Jun 2022 20:07:49 +0000 (16:07 -0400)]
splice: stop abusing iov_iter_advance() to flush a pipe
Use pipe_discard_from() explicitly in generic_file_read_iter(); don't bother
with rather non-obvious use of iov_iter_advance() in there.
Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 22 May 2022 18:59:25 +0000 (14:59 -0400)]
new iov_iter flavour - ITER_UBUF
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.
We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.
New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.
DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.
Documentation/mm: add details about kmap_local_page() and preemption
What happens if a thread is preempted after mapping pages with
kmap_local_page() was questioned recently.[1]
Commit f4d25aafa8df ("mm/highmem: Provide kmap_local*") from Thomas
Gleixner explains clearly that on context switch, the maps of an outgoing
task are removed and the map of the incoming task are restored and that
kmap_local_page() can be invoked from both preemptible and atomic
contexts.[2]
Therefore, for the purpose to make it clearer that users can call
kmap_local_page() from contexts that allow preemption, rework a couple of
sentences and add further information in highmem.rst.
Link: https://lkml.kernel.org/r/20220728154844.10874-8-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
highmem: delete a sentence from kmap_local_page() kdocs
kmap_local_page() should always be preferred in place of kmap() and
kmap_atomic(). "Only use when really necessary." is not consistent with
the Documentation/mm/highmem.rst and these kdocs it embeds.
Therefore, delete the above-mentioned sentence from kdocs.
Link: https://lkml.kernel.org/r/20220728154844.10874-7-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Documentation/mm: rrefer kmap_local_page() and avoid kmap()
The reasoning for converting kmap() to kmap_local_page() was questioned
recently.[1]
There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) kmap() also requires global TLB invalidation when
its pool wraps and it might block when the mapping space is fully utilized
until a slot becomes available.
Warn users to avoid the use of kmap() and instead use kmap_local_page(),
by designing their code to map pages in the same context the mapping will
be used.
Link: https://lkml.kernel.org/r/20220728154844.10874-6-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Documentation/mm: avoid invalid use of addresses from kmap_local_page()
Users of kmap_local_page() must be absolutely sure to not hand kernel
virtual address obtained calling kmap_local_page() on highmem pages to
other contexts because those pointers are thread local, therefore, they
are no longer valid across different contexts.
Extend the documentation of kmap_local_page() to warn users about the
above-mentioned potential invalid use of pointers returned by
kmap_local_page().
Link: https://lkml.kernel.org/r/20220728154844.10874-5-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Documentation/mm: don't kmap*() pages which can't come from HIGHMEM
There is no need to kmap*() pages which are guaranteed to come from
ZONE_NORMAL (or lower). Linux has currently several call sites of
kmap{,_atomic,_local_page}() on pages which are clearly known which can't
come from ZONE_HIGHMEM.
Therefore, add a paragraph to highmem.rst, to explain better that a plain
page_address() may be used for getting the address of pages which cannot
come from ZONE_HIGHMEM, although it is always safe to use
kmap_local_page() / kunmap_local() also on those pages.
Link: https://lkml.kernel.org/r/20220728154844.10874-4-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
highmem: specify that kmap_local_page() is callable from interrupts
In a recent thread about converting kmap() to kmap_local_page(), the
safety of calling kmap_local_page() was questioned.[1]
"any context" should probably be enough detail for users who want to know
whether or not kmap_local_page() can be called from interrupts. However,
Linux still has kmap_atomic() which might make users think they must use
the latter in interrupts.
Link: https://lkml.kernel.org/r/20220728154844.10874-3-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
highmem: remove unneeded spaces in kmap_local_page() kdocs
Patch series "highmem: Extend kmap_local_page() documentation", v2.
The Highmem interface is evolving and the current documentation does not
reflect the intended uses of each of the calls. Furthermore, after a
recent series of reworks, the differences of the calls can still be
confusing and may lead to the expanded use of calls which are deprecated.
This series is the second round of changes towards an enhanced
documentation of the Highmem's interface; at this stage the patches are
only focused to kmap_local_page().
In addition it also contains some minor clean ups.
This patch (of 7):
In the kdocs of kmap_local_page(), the description of @page starts after
several unnecessary spaces.
Therefore, remove those spaces.
Link: https://lkml.kernel.org/r/20220728154844.10874-1-fmdefrancesco@gmail.com Link: https://lkml.kernel.org/r/20220728154844.10874-2-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm, hwpoison: enable memory error handling on 1GB hugepage
Now error handling code is prepared, so remove the blocking code and
enable memory error handling on 1GB hugepage.
Link: https://lkml.kernel.org/r/20220714042420.1847125-9-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
Currently if memory_failure() (modified to remove blocking code with
subsequent patch) is called on a page in some 1GB hugepage, memory error
handling fails and the raw error page gets into leaked state. The impact
is small in production systems (just leaked single 4kB page), but this
limits the testability because unpoison doesn't work for it. We can no
longer create 1GB hugepage on the 1GB physical address range with such
leaked pages, that's not useful when testing on small systems.
When a hwpoison page in a 1GB hugepage is handled, it's caught by the
PageHWPoison check in free_pages_prepare() because the 1GB hugepage is
broken down into raw error pages before coming to this point:
if (unlikely(PageHWPoison(page)) && !order) {
...
return false;
}
Then, the page is not sent to buddy and the page refcount is left 0.
Originally this check is supposed to work when the error page is freed
from page_handle_poison() (that is called from soft-offline), but now we
are opening another path to call it, so the callers of
__page_handle_poison() need to handle the case by considering the return
value 0 as success. Then page refcount for hwpoison is properly
incremented so unpoison works.
Link: https://lkml.kernel.org/r/20220714042420.1847125-8-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm, hwpoison: make __page_handle_poison returns int
__page_handle_poison() returns bool that shows whether
take_page_off_buddy() has passed or not now. But we will want to
distinguish another case of "dissolve has passed but taking off failed" by
its return value. So change the type of the return value. No functional
change.
Link: https://lkml.kernel.org/r/20220714042420.1847125-7-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm, hwpoison: set PG_hwpoison for busy hugetlb pages
If memory_failure() fails to grab page refcount on a hugetlb page because
it's busy, it returns without setting PG_hwpoison on it. This not only
loses a chance of error containment, but breaks the rule that
action_result() should be called only when memory_failure() do any of
handling work (even if that's just setting PG_hwpoison). This
inconsistency could harm code maintainability.
So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.
Link: https://lkml.kernel.org/r/20220714042420.1847125-6-naoya.horiguchi@linux.dev Fixes: 3cc9a2d9ba3b ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
Raw error info list needs to be removed when hwpoisoned hugetlb is
unpoisoned. And unpoison handler needs to know how many errors there are
in the target hugepage. So add them.
HPageVmemmapOptimized(hpage) and HPageRawHwpUnreliable(hpage)) sometimes
can't be unpoisoned, so skip them.
Link: https://lkml.kernel.org/r/20220714042420.1847125-5-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm, hwpoison, hugetlb: support saving mechanism of raw error pages
When handling memory error on a hugetlb page, the error handler tries to
dissolve and turn it into 4kB pages. If it's successfully dissolved,
PageHWPoison flag is moved to the raw error page, so that's all right.
However, dissolve sometimes fails, then the error page is left as
hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
healthy pages, but that's not possible now because the information about
where the raw error pages is lost.
Use the private field of a few tail pages to keep that information. The
code path of shrinking hugepage pool uses this info to try delayed
dissolve. In order to remember multiple errors in a hugepage, a
singly-linked list originated from SUBPAGE_INDEX_HWPOISON-th tail page is
constructed. Only simple operations (adding an entry or clearing all) are
required and the list is assumed not to be very long, so this simple data
structure should be enough.
If we failed to save raw error info, the hwpoison hugepage has errors on
unknown subpage, then this new saving mechanism does not work any more, so
disable saving new raw error info and freeing hwpoison hugepages.
Link: https://lkml.kernel.org/r/20220714042420.1847125-4-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
follow_pud_mask() does not support non-present pud entry now. As long as
I tested on x86_64 server, follow_pud_mask() still simply returns
no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
user-visible effect should happen. But generally we should call
follow_huge_pud() for non-present pud entry for 1GB hugetlb page.
Update pud_huge() and follow_huge_pud() to handle non-present pud entries.
The changes are similar to previous works for pud entries commit e7ffa292f607 ("mm/hugetlb: take page table lock in follow_huge_pmd()") and
commit 0b87551b6c60 ("mm/hugetlb: pmd_huge() returns true for non-present
hugepage").
Link: https://lkml.kernel.org/r/20220714042420.1847125-3-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
Patch series "mm, hwpoison: enable 1GB hugepage support", v7.
This patch (of 8):
I found a weird state of 1GB hugepage pool, caused by the following
procedure:
- run a process reserving all free 1GB hugepages,
- shrink free 1GB hugepage pool to zero (i.e. writing 0 to
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
- kill the reserving process.
, then all the hugepages are free *and* surplus at the same time.
This state is resolved by reserving and allocating the pages then freeing
them again, so this seems not to result in serious problem. But it's a
little surprising (shrinking pool suddenly fails).
This behavior is caused by hstate_is_gigantic() check in
return_unused_surplus_pages(). This was introduced so long ago in 2008 by
commit d24d4969b193 ("hugetlb: support larger than MAX_ORDER"), and at
that time the gigantic pages were not supposed to be allocated/freed at
run-time. Now kernel can support runtime allocation/free, so let's check
gigantic_page_runtime_supported() together.
There is a discussion about the name of hugetlb_vmemmap_alloc/free in
thread [1]. The suggestion suggested by David is rename "alloc/free" to
"optimize/restore" to make functionalities clearer to users, "optimize"
means the function will optimize vmemmap pages, while "restore" means
restoring its vmemmap pages discared before. This commit does this.
Another discussion is the confusion RESERVE_VMEMMAP_NR isn't used
explicitly for vmemmap_addr but implicitly for vmemmap_end in
hugetlb_vmemmap_alloc/free. David suggested we can compute what
hugetlb_vmemmap_init() does now at runtime. We do not need to worry for
the overhead of computing at runtime since the calculation is simple
enough and those functions are not in a hot path. This commit has the
following improvements:
1) The function suffixed name ("optimize/restore") is more expressive.
2) The logic becomes less weird in hugetlb_vmemmap_optimize/restore().
3) The hugetlb_vmemmap_init() does not need to be exported anymore.
4) A ->optimize_vmemmap_pages field in struct hstate is killed.
5) There is only one place where checks is_power_of_2(sizeof(struct
page)) instead of two places.
6) Add more comments for hugetlb_vmemmap_optimize/restore().
7) For external users, hugetlb_optimize_vmemmap_pages() is used for
detecting if the HugeTLB's vmemmap pages is optimizable originally.
In this commit, it is killed and we introduce a new helper
hugetlb_vmemmap_optimizable() to replace it. The name is more
expressive.
Link: https://lore.kernel.org/all/20220404074652.68024-2-songmuchun@bytedance.com/ Link: https://lkml.kernel.org/r/20220628092235.91270-7-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is no order requirement between the parameter of
"hugetlb_free_vmemmap" and "hugepages" since we have removed the check of
whether HVO is enabled from hugetlb_vmemmap_init(). Therefore we can
safely replace early_param() with core_param() to simplify the code.
Link: https://lkml.kernel.org/r/20220628092235.91270-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 28 Jun 2022 09:22:31 +0000 (17:22 +0800)]
mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
When I first introduced vmemmap manipulation functions related to HugeTLB,
I thought those functions may be reused by other modules (e.g. using
similar approach to optimize vmemmap pages, unfortunately, the DAX used
the same approach but does not use those functions). After two years, we
didn't see any other users. So move those functions to hugetlb_vmemmap.c.
Code movement without any functional change.
Link: https://lkml.kernel.org/r/20220628092235.91270-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 28 Jun 2022 09:22:30 +0000 (17:22 +0800)]
mm: hugetlb_vmemmap: introduce the name HVO
It it inconvenient to mention the feature of optimizing vmemmap pages
associated with HugeTLB pages when communicating with others since there
is no specific or abbreviated name for it when it is first introduced.
Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now.
This commit also updates the document about "hugetlb_free_vmemmap" by the
way discussed in thread [1].
Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ Link: https://lkml.kernel.org/r/20220628092235.91270-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We hold an another reference to hugetlb_optimize_vmemmap_key when making
vmemmap_optimize_mode on, because we use static_key to tell memory_hotplug
that memory_hotplug.memmap_on_memory should be overridden. However, this
rule has gone when we have introduced PageVmemmapSelfHosted. Therefore,
we could simplify vmemmap_optimize_mode handling by not holding an another
reference to hugetlb_optimize_vmemmap_key. This also means that we not
incur the extra page_fixed_fake_head checks if there are no vmemmap
optinmized hugetlb pages after this change.
Link: https://lkml.kernel.org/r/20220628092235.91270-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Simplify hugetlb vmemmap and improve its readability", v2.
This series aims to simplify hugetlb vmemmap and improve its readability.
This patch (of 8):
The name hugetlb_optimize_vmemmap_enabled() a bit confusing as it tests
two conditions (enabled and pages in use). Instead of coming up to an
appropriate name, we could just delete it. There is already a discussion
about deleting it in thread [1].
There is only one user of hugetlb_optimize_vmemmap_enabled() outside of
hugetlb_vmemmap, that is flush_dcache_page() in arch/arm64/mm/flush.c.
However, it does not need to call hugetlb_optimize_vmemmap_enabled() in
flush_dcache_page() since HugeTLB pages are always fully mapped and only
head page will be set PG_dcache_clean meaning only head page's flag may
need to be cleared (see commit b023ce37802f). So it is easy to remove
hugetlb_optimize_vmemmap_enabled().
Link: https://lore.kernel.org/all/c77c61c8-8a5a-87e8-db89-d04d8aaab4cc@oracle.com/ Link: https://lkml.kernel.org/r/20220628092235.91270-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Mon, 8 Aug 2022 22:16:29 +0000 (15:16 -0700)]
Merge tag 'rproc-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux
Pull remoteproc updates from Bjorn Andersson:
"This introduces support for the remoteproc on Mediatek MT8188, and
enables caches for MT8186 SCP. It adds support for PRU cores found on
the TI K3 AM62x SoCs.
It moves the recovery work after a firmware crash to an unbound
workqueue, to allow recovery to happen in parallel.
A new DMA API is introduced to release dma_mem for a device.
It adds support a panic handler for the Qualcomm modem remoteproc,
with the goal of having caches flushed in memory dumps for post-mortem
debugging and it introduces a mechanism to wait for the modem firmware
on SM8450 to decrypt part of its memory for post-mortem debugging.
Qualcomm sysmon is restricted to only inform remote processors about
peers that are actually running, to avoid a race where Linux tries to
notify a recovering remote processor about its peers new state. A
mechanism for waiting for the sysmon connection to be established is
also introduced, to avoid out-of-sync updates for rapidly restarting
remote processors.
A number of Devicetree binding cleanups and conversions to YAML are
introduced, to facilitate Devicetree validation. Lastly it introduces
a number of smaller fixes and cleanups in the core and a few different
drivers"
* tag 'rproc-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux: (42 commits)
remoteproc: qcom_q6v5_pas: Do not fail if regulators are not found
drivers/remoteproc: fix repeated words in comments
remoteproc: Directly use ida_alloc()/free()
remoteproc: Use unbounded workqueue for recovery work
remoteproc: using pm_runtime_resume_and_get instead of pm_runtime_get_sync
remoteproc: qcom_q6v5_pas: Deal silently with optional px and cx regulators
remoteproc: sysmon: Send sysmon state only for running rprocs
remoteproc: sysmon: Wait for SSCTL service to come up
remoteproc: qcom: q6v5: Set q6 state to offline on receiving wdog irq
remoteproc: qcom: pas: Check if coredump is enabled
remoteproc: qcom: pas: Mark devices as wakeup capable
remoteproc: qcom: pas: Mark va as io memory
remoteproc: qcom: pas: Add decrypt shutdown support for modem
remoteproc: qcom: q6v5-mss: add powerdomains to MSM8996 config
remoteproc: qcom_q6v5: Introduce panic handler for MSS
remoteproc: qcom_q6v5_mss: Update MBA log info
remoteproc: qcom: correct kerneldoc
remoteproc: qcom_q6v5_mss: map/unmap metadata region before/after use
remoteproc: qcom: using pm_runtime_resume_and_get to simplify the code
remoteproc: mediatek: Support MT8188 SCP
...
Linus Torvalds [Mon, 8 Aug 2022 22:14:43 +0000 (15:14 -0700)]
Merge tag 'rpmsg-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux
Pull rpmsg updates from Bjorn Andersson:
"This contains fixes and cleanups in the rpmsg core, Qualcomm SMD and
GLINK drivers, a circular lock dependency in the Mediatek driver and
a possible race condition in the rpmsg_char driver is resolved"
* tag 'rpmsg-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
rpmsg: convert sysfs snprintf to sysfs_emit
rpmsg: qcom_smd: Fix refcount leak in qcom_smd_parse_edge
rpmsg: qcom: correct kerneldoc
rpmsg: qcom: glink: remove unused name
rpmsg: qcom: glink: replace strncpy() with strscpy_pad()
rpmsg: Strcpy is not safe, use strscpy_pad() instead
rpmsg: Fix possible refcount leak in rpmsg_register_device_override()
rpmsg: Fix parameter naming for announce_create/destroy ops
rpmsg: mtk_rpmsg: Fix circular locking dependency
rpmsg: char: Add mutex protection for rpmsg_eptdev_open()
Linus Torvalds [Mon, 8 Aug 2022 21:29:00 +0000 (14:29 -0700)]
Merge tag 'pm-5.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are ARM cpufreq updates and operating performance points (OPP)
updates plus one cpuidle update adding a new trace point.
Specifics:
- Fix return error code in mtk_cpu_dvfs_info_init (Yang Yingliang).
- Minor cleanups and support for new boards for Qcom cpufreq drivers
(Bryan O'Donoghue, Konrad Dybcio, Pierre Gondois, and Yicong Yang).
- Fix sparse warnings for Tegra cpufreq driver (Viresh Kumar).
- Make dev_pm_opp_set_regulators() accept NULL terminated list
(Viresh Kumar).
- Add dev_pm_opp_set_config() and friends and migrate other users and
helpers to using them (Viresh Kumar).
- Add support for multiple clocks for a device (Viresh Kumar and
Krzysztof Kozlowski).
- Configure resources before adding OPP table for Venus (Stanimir
Varbanov).
- Keep reference count up for opp->np and opp_table->np while they
are still in use (Liang He).
- Minor OPP cleanups (Viresh Kumar and Yang Li).
- Add a trace event for cpuidle to track missed (too deep or too
shallow) wakeups (Kajetan Puchalski)"
* tag 'pm-5.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (55 commits)
cpuidle: Add cpu_idle_miss trace event
venus: pm_helpers: Fix warning in OPP during probe
OPP: Don't drop opp->np reference while it is still in use
OPP: Don't drop opp_table->np reference while it is still in use
cpufreq: tegra194: Staticize struct tegra_cpufreq_soc instances
dt-bindings: cpufreq: cpufreq-qcom-hw: Add SM6375 compatible
dt-bindings: opp: Add msm8939 to the compatible list
dt-bindings: opp: Add missing compat devices
dt-bindings: opp: opp-v2-kryo-cpu: Fix example binding checks
cpufreq: Change order of online() CB and policy->cpus modification
cpufreq: qcom-hw: Remove deprecated irq_set_affinity_hint() call
cpufreq: qcom-hw: Disable LMH irq when disabling policy
cpufreq: qcom-hw: Reset cancel_throttle when policy is re-enabled
cpufreq: qcom-cpufreq-hw: use HZ_PER_KHZ macro in units.h
cpufreq: mediatek: fix error return code in mtk_cpu_dvfs_info_init()
OPP: Remove dev{m}_pm_opp_of_add_table_noclk()
PM / devfreq: tegra30: Register config_clks helper
OPP: Allow config_clks helper for single clk case
OPP: Provide a simple implementation to configure multiple clocks
OPP: Assert clk_count == 1 for single clk helpers
...
Linus Torvalds [Mon, 8 Aug 2022 21:23:37 +0000 (14:23 -0700)]
Merge tag 'thermal-5.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more thermal control updates from Rafael Wysocki:
"These fix an error code path issue leading to a NULL pointer
dereference, drop Kconfig dependencies that are not needed any more
after recent changes, add CPU IDs for new chips to a driver and fix up
the tmon utility.
Specifics:
- Fix NULL pointer dereference in the thermal sysfs interface that
results from an error code path mishandling (Rafael Wysocki).
- Drop COMPILE_TEST dependency that's not needed any more from two
thermal Kconfig entries (Jean Delvare).
- Make the Intel TCC cooling driver support Alder Lake-N and Raptor
Lake-P (Sumeet Pawnikar).
- Fix possible path truncations in the tmon utility (Florian
Fainelli)"
* tag 'thermal-5.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
tools/thermal: Fix possible path truncations
thermal: Drop obsolete dependency on COMPILE_TEST
thermal: sysfs: Fix cooling_device_stats_setup() error code path
thermal: intel: Add TCC cooling support for Alder Lake-N and Raptor Lake-P
Linus Torvalds [Mon, 8 Aug 2022 21:17:46 +0000 (14:17 -0700)]
Merge tag 'sysctl-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"There isn't much for 6.0 for sysctl stuff, most of the stuff went
through the networking subsystem (Kuniyuki Iwashima's trove of fixes
using READ_ONCE/WRITE_ONCE helpers) as most of the issues there have
been identified on networking side. So it is good we don't have much
updates as we would have ended up with tons of conflicts. I rebased my
delta just now to your tree so to avoid conflicts with that stuff.
This merge request is just minor fluff cleanups then. Perhaps for 6.1
kernel/sysctl.c will get more love than this release"
* tag 'sysctl-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
kernel/sysctl.c: Remove trailing white space
kernel/sysctl.c: Clean up indentation, replace spaces with tab.
sysctl: Merge adjacent CONFIG_TREE_RCU blocks
Linus Torvalds [Mon, 8 Aug 2022 21:12:19 +0000 (14:12 -0700)]
Merge tag 'modules-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"For the 6.0 merge window the modules code shifts to cleanup and minor
fixes effort. This becomes much easier to do and review now due to the
code split to its own directory from effort on the last kernel
release. I expect to see more of this with time and as we expand on
test coverage in the future. The cleanups and fixes come from usual
suspects such as Christophe Leroy and Aaron Tomlin but there are also
some other contributors.
One particular minor fix worth mentioning is from Helge Deller, where
he spotted a *forever* incorrect natural alignment on both ELF section
header tables:
* .altinstructions
* __bug_table sections
A lot of back and forth went on in trying to determine the ill effects
of this misalignment being present for years and it has been
determined there should be no real ill effects unless you have a buggy
exception handler. Helge actually hit one of these buggy exception
handlers on parisc which is how he ended up spotting this issue. When
implemented correctly these paths with incorrect misalignment would
just mean a performance penalty, but given that we are dealing with
alternatives on modules and with the __bug_table (where info regardign
BUG()/WARN() file/line information associated with it is stored) this
really shouldn't be a big deal.
The only other change with mentioning is the kmap() with
kmap_local_page() and my only concern with that was on what is done
after preemption, but the virtual addresses are restored after
preemption. This is only used on module decompression.
This all has sit on linux-next for a while except the kmap stuff which
has been there for 3 weeks"
* tag 'modules-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module: Replace kmap() with kmap_local_page()
module: Show the last unloaded module's taint flag(s)
module: Use strscpy() for last_unloaded_module
module: Modify module_flags() to accept show_state argument
module: Move module's Kconfig items in kernel/module/
MAINTAINERS: Update file list for module maintainers
module: Use vzalloc() instead of vmalloc()/memset(0)
modules: Ensure natural alignment for .altinstructions and __bug_table sections
module: Increase readability of module_kallsyms_lookup_name()
module: Fix ERRORs reported by checkpatch.pl
module: Add support for default value for module async_probe
NeilBrown [Mon, 1 Aug 2022 00:33:34 +0000 (10:33 +1000)]
NFS: don't unhash dentry during unlink/rename
NFS unlink() (and rename over existing target) must determine if the
file is open, and must perform a "silly rename" instead of an unlink (or
before rename) if it is. Otherwise the client might hold a file open
which has been removed on the server.
Consequently if it determines that the file isn't open, it must block
any subsequent opens until the unlink/rename has been completed on the
server.
This is currently achieved by unhashing the dentry. This forces any
open attempt to the slow-path for lookup which will block on i_rwsem on
the directory until the unlink/rename completes. A future patch will
change the VFS to only get a shared lock on i_rwsem for unlink, so this
will no longer work.
Instead we introduce an explicit interlock. A special value is stored
in dentry->d_fsdata while the unlink/rename is running and
->d_revalidate blocks while that value is present. When ->d_revalidate
unblocks, the dentry will be invalid. This closes the race
without requiring exclusion on i_rwsem.
d_fsdata is already used in two different ways.
1/ an IS_ROOT directory dentry might have a "devname" stored in
d_fsdata. Such a dentry doesn't have a name and so cannot be the
target of unlink or rename. For safety we check if an old devname
is still stored, and remove it if it is.
2/ a dentry with DCACHE_NFSFS_RENAMED set will have a 'struct
nfs_unlinkdata' stored in d_fsdata. While this is set maydelete()
will fail, so an unlink or rename will never proceed on such
a dentry.
Neither of these can be in effect when a dentry is the target of unlink
or rename. So we can expect d_fsdata to be NULL, and store a special
value ((void*)1) which is given the name NFS_FSDATA_BLOCKED to indicate
that any lookup will be blocked.
The d_count() is incremented under d_lock() when a lookup finds the
dentry, so we check d_count() is low, and set NFS_FSDATA_BLOCKED under
the same lock to avoid any races.
Linus Torvalds [Mon, 8 Aug 2022 18:36:21 +0000 (11:36 -0700)]
Merge tag 'leds-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds
Pull LED updates from Pavel Machek:
"A new driver for bcm63138, is31fl319x updates, fixups for multicolor.
The clevo-mail driver got disabled, it needs an API fix"
* tag 'leds-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds: (23 commits)
leds: is31fl319x: use simple i2c probe function
leds: is31fl319x: Fix devm vs. non-devm ordering
leds: is31fl319x: Make use of dev_err_probe()
leds: is31fl319x: Make use of device properties
leds: is31fl319x: Cleanup formatting and dev_dbg calls
leds: is31fl319x: Add support for is31fl319{0,1,3} chips
leds: is31fl319x: Move chipset-specific values in chipdef struct
leds: is31fl319x: Use non-wildcard names for vars, structs and defines
leds: is31fl319x: Add missing si-en compatibles
dt-bindings: leds: pwm-multicolor: document max-brigthness
leds: turris-omnia: convert to use dev_groups
leds: leds-bcm63138: get rid of LED_OFF
leds: add help info about BCM63138 module name
dt-bindings: leds: leds-bcm63138: unify full stops in descriptions
dt-bindings: leds: lp50xx: fix LED children names
dt-bindings: leds: class-multicolor: reference class directly in multi-led node
leds: bcm63138: add support for BCM63138 controller
dt-bindings: leds: add Broadcom's BCM63138 controller
leds: clevo-mail: Mark as broken pending interface fix
leds: pwm-multicolor: Support active-low LEDs
...
Linus Torvalds [Mon, 8 Aug 2022 18:31:40 +0000 (11:31 -0700)]
Merge tag 'tty-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial driver updates from Greg KH:
"Here is the big set of tty and serial driver changes for 6.0-rc1.
It was delayed from last week as I wanted to make sure the last commit
here got some good testing in linux-next and elsewhere as it seemed to
show up only late in testing for some reason.
Nothing major here, just lots of cleanups from Jiri and Ilpo to make
the tty core cleaner (Jiri) and the rs485 code simpler to use (Ilpo).
Also included in here is the obligatory n_gsm updates from Daniel
Starke and lots of tiny driver updates and minor fixes and tweaks for
other smaller serial drivers.
All of these have been in linux-next for a while with no reported
problems"
* tag 'tty-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (186 commits)
tty: serial: qcom-geni-serial: Fix %lu -> %u in print statements
tty: amiserial: Fix comment typo
tty: serial: document uart_get_console()
tty: serial: serial_core, reformat kernel-doc for functions
Documentation: serial: link uart_ops properly
Documentation: serial: move GPIO kernel-doc to the functions
Documentation: serial: dedup kernel-doc for uart functions
Documentation: serial: move uart_ops documentation to the struct
dt-bindings: serial: snps-dw-apb-uart: Document Rockchip RV1126
serial: mvebu-uart: uart2 error bits clearing
tty: serial: fsl_lpuart: correct the count of break characters
serial: stm32: make info structs static to avoid sparse warnings
serial: fsl_lpuart: zero out parity bit in CS7 mode
tty: serial: qcom-geni-serial: Fix get_clk_div_rate() which otherwise could return a sub-optimal clock rate.
serial: 8250_bcm2835aux: Add missing clk_disable_unprepare()
tty: vt: initialize unicode screen buffer
serial: remove VR41XX serial driver
serial: 8250: lpc18xx: Remove redundant sanity check for RS485 flags
serial: 8250_dwlib: remove redundant sanity check for RS485 flags
dt_bindings: rs485: Correct delay values
...
Linus Torvalds [Mon, 8 Aug 2022 18:18:31 +0000 (11:18 -0700)]
Merge tag 'f2fs-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we mainly fixed some corner cases that manipulate a
per-file compression flag inappropriately. And, we found f2fs counted
valid blocks in a section incorrectly when zone capacity is set, and
thus, fixed it with additional sysfs entry to check it easily.
Lastly, this series includes several patches with respect to the new
atomic write support such as a couple of bug fixes and re-adding
atomic_write_abort support that we removed by mistake in the previous
release.
Enhancements:
- add sysfs entries to understand atomic write operations and zone
capacity
- introduce memory mode to get a hint for low-memory devices
- adjust the waiting time of foreground GC
- decompress clusters under softirq to avoid non-deterministic
latency
- do not skip updating inode when retrying to flush node page
- enforce single zone capacity
Bug fixes:
- set the compression/no-compression flags correctly
- revive F2FS_IOC_ABORT_VOLATILE_WRITE
- check inline_data during compressed inode conversion
- understand zone capacity when calculating valid block count
As usual, the series includes several minor clean-ups and sanity
checks"
* tag 'f2fs-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits)
f2fs: use onstack pages instead of pvec
f2fs: intorduce f2fs_all_cluster_page_ready
f2fs: clean up f2fs_abort_atomic_write()
f2fs: handle decompress only post processing in softirq
f2fs: do not allow to decompress files have FI_COMPRESS_RELEASED
f2fs: do not set compression bit if kernel doesn't support
f2fs: remove device type check for direct IO
f2fs: fix null-ptr-deref in f2fs_get_dnode_of_data
f2fs: revive F2FS_IOC_ABORT_VOLATILE_WRITE
f2fs: fix to do sanity check on segment type in build_sit_entries()
f2fs: obsolete unused MAX_DISCARD_BLOCKS
f2fs: fix to avoid use f2fs_bug_on() in f2fs_new_node_page()
f2fs: fix to remove F2FS_COMPR_FL and tag F2FS_NOCOMP_FL at the same time
f2fs: introduce sysfs atomic write statistics
f2fs: don't bother wait_ms by foreground gc
f2fs: invalidate meta pages only for post_read required inode
f2fs: allow compression of files without blocks
f2fs: fix to check inline_data during compressed inode conversion
f2fs: Delete f2fs_copy_page() and replace with memcpy_page()
f2fs: fix to invalidate META_MAPPING before DIO write
...
Linus Torvalds [Mon, 8 Aug 2022 18:10:02 +0000 (11:10 -0700)]
Merge tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Fix an issue with reusing the bdi in case of block based filesystems
- Allow root (in init namespace) to access fuse filesystems in user
namespaces if expicitly enabled with a module param
- Misc fixes
* tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: retire block-device-based superblock on force unmount
vfs: function to prevent re-use of block-device-based superblocks
virtio_fs: Modify format for virtio_fs_direct_access
virtiofs: delete unused parameter for virtio_fs_cleanup_vqs
fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other
fuse: Remove the control interface for virtio-fs
fuse: ioctl: translate ENOSYS
fuse: limit nsec
fuse: avoid unnecessary spinlock bump
fuse: fix deadlock between atomic O_TRUNC and page invalidation
fuse: write inode in fuse_release()
Linus Torvalds [Mon, 8 Aug 2022 18:03:11 +0000 (11:03 -0700)]
Merge tag 'ovl-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:
"Just a small update"
* tag 'ovl-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix spelling mistakes
ovl: drop WARN_ON() dentry is NULL in ovl_encode_fh()
ovl: improve ovl_get_acl() if POSIX ACL support is off
ovl: fix some kernel-doc comments
ovl: warn if trusted xattr creation fails
Linus Torvalds [Mon, 8 Aug 2022 17:57:09 +0000 (10:57 -0700)]
Merge tag 'exfat-for-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat updates from Namjae Jeon:
- fix the error code of rename syscall
- cleanup and suppress the superfluous error messages
- remove duplicate directory entry update
- add exfat git tree in MAINTAINERS
* tag 'exfat-for-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
MAINTAINERS: Add Namjae's exfat git tree
exfat: Drop superfluous new line for error messages
exfat: Downgrade ENAMETOOLONG error message to debug messages
exfat: Expand exfat_err() and co directly to pr_*() macro
exfat: Define NLS_NAME_* as bit flags explicitly
exfat: Return ENAMETOOLONG consistently for oversized paths
exfat: remove duplicate write inode for extending dir/file
exfat: remove duplicate write inode for truncating file
exfat: reuse __exfat_write_inode() to update directory entry
David Howells [Mon, 8 Aug 2022 08:52:35 +0000 (09:52 +0100)]
vfs: Check the truncate maximum size in inode_newsize_ok()
If something manages to set the maximum file size to MAX_OFFSET+1, this
can cause the xfs and ext4 filesystems at least to become corrupt.
Ordinarily, the kernel protects against userspace trying this by
checking the value early in the truncate() and ftruncate() system calls
calls - but there are at least two places that this check is bypassed:
(1) Cachefiles will round up the EOF of the backing file to DIO block
size so as to allow DIO on the final block - but this might push
the offset negative. It then calls notify_change(), but this
inadvertently bypasses the checking. This can be triggered if
someone puts an 8EiB-1 file on a server for someone else to try and
access by, say, nfs.
(2) ksmbd doesn't check the value it is given in set_end_of_file_info()
and then calls vfs_truncate() directly - which also bypasses the
check.
In both cases, it is potentially possible for a network filesystem to
cause a disk filesystem to be corrupted: cachefiles in the client's
cache filesystem; ksmbd in the server's filesystem.
nfsd is okay as it checks the value, but we can then remove this check
too.
Fix this by adding a check to inode_newsize_ok(), as called from
setattr_prepare(), thereby catching the issue as filesystems set up to
perform the truncate with minimal opportunity for bypassing the new
check.
Fixes: eed34cba8a1f ("cachefiles: Implement backing file wrangling") Fixes: 9be54045676d ("cifsd: add file operations") Signed-off-by: David Howells <dhowells@redhat.com> Reported-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Cc: stable@kernel.org Acked-by: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Steve French <sfrench@samba.org>
cc: Hyunchul Lee <hyc.lee@gmail.com>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>