Sargun Dhillon [Wed, 3 Jun 2020 01:10:43 +0000 (18:10 -0700)]
seccomp: Introduce addfd ioctl to seccomp user notifier
The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over
an fd. It is often used in settings where a supervising task emulates
syscalls on behalf of a supervised task in userspace, either to further
restrict the supervisee's syscall abilities or to circumvent kernel
enforced restrictions the supervisor deems safe to lift (e.g. actually
performing a mount(2) for an unprivileged container).
While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall,
only a certain subset of syscalls could be correctly emulated. Over the
last few development cycles, the set of syscalls which can't be emulated
has been reduced due to the addition of pidfd_getfd(2). With this we are
now able to, for example, intercept syscalls that require the supervisor
to operate on file descriptors of the supervisee such as connect(2).
However, syscalls that cause new file descriptors to be installed can not
currently be correctly emulated since there is no way for the supervisor
to inject file descriptors into the supervisee. This patch adds a
new addfd ioctl to remove this restriction by allowing the supervisor to
install file descriptors into the intercepted task. By implementing this
feature via seccomp the supervisor effectively instructs the supervisee
to install a set of file descriptors into its own file descriptor table
during the intercepted syscall. This way it is possible to intercept
syscalls such as open() or accept(), and install (or replace, like
dup2(2)) the supervisor's resulting fd into the supervisee. One
replacement use-case would be to redirect the stdout and stderr of a
supervisee into log file descriptors opened by the supervisor.
The ioctl handling is based on the discussions[1] of how Extensible
Arguments should interact with ioctls. Instead of building size into
the addfd structure, make it a function of the ioctl command (which
is how sizes are normally passed to ioctls). To support forward and
backward compatibility, just mask out the direction and size, and match
everything. The size (and any future direction) checks are done along
with copy_struct_from_user() logic.
As a note, the seccomp_notif_addfd structure is laid out based on 8-byte
alignment without requiring packing as there have been packing issues
with uapi highlighted before[2][3]. Although we could overload the
newfd field and use -1 to indicate that it is not to be used, doing
so requires changing the size of the fd field, and introduces struct
packing complexity.
Kees Cook [Wed, 10 Jun 2020 15:46:58 +0000 (08:46 -0700)]
fs: Expand __receive_fd() to accept existing fd
Expand __receive_fd() with support for replace_fd() for the coming seccomp
"addfd" ioctl(). Add new wrapper receive_fd_replace() for the new behavior
and update existing wrappers to retain old behavior.
Thanks to Colin Ian King <colin.king@canonical.com> for pointing out an
uninitialized variable exposure in an earlier version of this patch.
Kees Cook [Thu, 11 Jun 2020 03:47:45 +0000 (20:47 -0700)]
fs: Add receive_fd() wrapper for __receive_fd()
For both pidfd and seccomp, the __user pointer is not used. Update
__receive_fd() to make writing to ufd optional via a NULL check. However,
for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT
can be returned to avoid changing the SCM_RIGHTS interface behavior. Add
new wrapper receive_fd() for pidfd and seccomp that does not use the ufd
argument. For the new helper, the allocated fd needs to be returned on
success. Update the existing callers to handle it.
Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>
Kees Cook [Wed, 10 Jun 2020 15:20:05 +0000 (08:20 -0700)]
fs: Move __scm_install_fd() to __receive_fd()
In preparation for users of the "install a received file" logic outside
of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from
net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper
named receive_fd_user(), as future patches will change the interface
to __receive_fd().
Additionally add a comment to fd_install() as a counterpoint to how
__receive_fd() interacts with fput().
Kees Cook [Tue, 9 Jun 2020 23:11:29 +0000 (16:11 -0700)]
net/scm: Regularize compat handling of scm_detach_fds()
Duplicate the cleanups from commit d4f7d860e7fe ("net/scm: cleanup
scm_detach_fds") into the compat code.
Replace open-coded __receive_sock() with a call to the helper.
Move the check added in commit 59dd869cf13f ("net: cleanly handle kernel
vs user buffers for ->msg_control") to before the compat call, even
though it should be impossible for an in-kernel call to also be compat.
Correct the int "flags" argument to unsigned int to match fd_install()
and similar APIs.
Regularize any remaining differences, including a whitespace issue,
a checkpatch warning, and add the check from commit c29dd25a390a ("net,
scm: fix PaX detected msg_controllen overflow in scm_detach_fds") which
fixed an overflow unique to 64-bit. To avoid confusion when comparing
the compat handler to the native handler, just include the same check
in the compat handler.
Cc: Christoph Hellwig <hch@lst.de> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>
Kees Cook [Tue, 9 Jun 2020 23:21:38 +0000 (16:21 -0700)]
pidfd: Add missing sock updates for pidfd_getfd()
The sock counting (sock_update_netprioidx() and sock_update_classid())
was missing from pidfd's implementation of received fd installation. Add
a call to the new __receive_sock() helper.
Kees Cook [Tue, 9 Jun 2020 23:11:29 +0000 (16:11 -0700)]
net/compat: Add missing sock updates for SCM_RIGHTS
Add missed sock updates to compat path via a new helper, which will be
used more in coming patches. (The net/core/scm.c code is left as-is here
to assist with -stable backports for the compat path.)
Cc: Christoph Hellwig <hch@lst.de> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jakub Kicinski <kuba@kernel.org> Cc: stable@vger.kernel.org Fixes: ec1d7af30ada ("net: netprio: fd passed in SCM_RIGHTS datagram not set correctly") Fixes: f63ae9fc6a3d ("net: net_cls: fd passed in SCM_RIGHTS datagram not set correctly") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>
The FIXTURE*() macro kern-doc examples had the wrong names for the C code
examples associated with them. Fix those and clarify that FIXTURE_DATA()
usage should be avoided.
Kees Cook [Fri, 19 Jun 2020 19:20:15 +0000 (12:20 -0700)]
seccomp: Use -1 marker for end of mode 1 syscall list
The terminator for the mode 1 syscalls list was a 0, but that could be
a valid syscall number (e.g. x86_64 __NR_read). By luck, __NR_read was
listed first and the loop construct would not test it, so there was no
bug. However, this is fragile. Replace the terminator with -1 instead,
and make the variable name for mode 1 syscall lists more descriptive.
Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>
Kees Cook [Mon, 15 Jun 2020 22:42:46 +0000 (15:42 -0700)]
seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced it had the wrong
direction flag set. While this isn't a big deal as nothing currently
enforces these bits in the kernel, it should be defined correctly. Fix
the define and provide support for the old command until it is no longer
needed for backward compatibility.
Fixes: 388c4c8f99d2 ("seccomp: add a return code to trap to userspace") Signed-off-by: Kees Cook <keescook@chromium.org>
Kees Cook [Thu, 11 Jun 2020 18:08:48 +0000 (11:08 -0700)]
selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
The user_trap_syscall() helper creates a filter with
SECCOMP_RET_USER_NOTIF. To avoid confusion with SECCOMP_RET_TRAP, rename
the helper to user_notif_syscall().
Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@chromium.org> Cc: linux-kselftest@vger.kernel.org Cc: netdev@vger.kernel.org Cc: bpf@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
Kees Cook [Thu, 11 Jun 2020 18:02:21 +0000 (11:02 -0700)]
selftests/seccomp: Make kcmp() less required
The seccomp tests are a bit noisy without CONFIG_CHECKPOINT_RESTORE (due
to missing the kcmp() syscall). The seccomp tests are more accurate with
kcmp(), but it's not strictly required. Refactor the tests to use
alternatives (comparing fd numbers), and provide a central test for
kcmp() so there is a single SKIP instead of many. Continue to produce
warnings for the other tests, though.
Additionally adds some more bad flag EINVAL tests to the addfd selftest.
Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@chromium.org> Cc: linux-kselftest@vger.kernel.org Cc: netdev@vger.kernel.org Cc: bpf@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
Kees Cook [Sat, 6 Jun 2020 16:37:17 +0000 (09:37 -0700)]
selftests/seccomp: Improve calibration loop
The seccomp benchmark calibration loop did not need to take so long.
Instead, use a simple 1 second timeout and multiply up to target. It
does not need to be accurate.
As seccomp_benchmark tries to calibrate how many samples will take more
than 5 seconds to execute, it may end up picking up a number of samples
that take 10 (but up to 12) seconds. As the calibration will take double
that time, it takes around 20 seconds. Then, it executes the whole thing
again, and then once more, with some added overhead. So, the thing might
take more than 40 seconds, which is too close to the 45s timeout.
That is very dependent on the system where it's executed, so may not be
observed always, but it has been observed on x86 VMs. Using a 90s timeout
seems safe enough.
We've been making heavy use of the seccomp notifier to intercept and
handle certain syscalls for containers. This patch allows a syscall
supervisor listening on a given notifier to be notified when a seccomp
filter has become unused.
A container is often managed by a singleton supervisor process the
so-called "monitor". This monitor process has an event loop which has
various event handlers registered. If the user specified a seccomp
profile that included a notifier for various syscalls then we also
register a seccomp notify even handler. For any container using a
separate pid namespace the lifecycle of the seccomp notifier is bound to
the init process of the pid namespace, i.e. when the init process exits
the filter must be unused.
If a new process attaches to a container we force it to assume a seccomp
profile. This can either be the same seccomp profile as the container
was started with or a modified one. If the attaching process makes use
of the seccomp notifier we will register a new seccomp notifier handler
in the monitor's event loop. However, when the attaching process exits
we can't simply delete the handler since other child processes could've
been created (daemons spawned etc.) that have inherited the seccomp
filter and so we need to keep the seccomp notifier fd alive in the event
loop. But this is problematic since we don't get a notification when the
seccomp filter has become unused and so we currently never remove the
seccomp notifier fd from the event loop and just keep accumulating fds
in the event loop. We've had this issue for a while but it has recently
become more pressing as more and larger users make use of this.
To fix this, we introduce a new "users" reference counter that tracks any
tasks and dependent filters making use of a filter. When a notifier is
registered waiting tasks will be notified that the filter is now empty
by receiving a (E)POLLHUP event.
The concept in this patch introduces is the same as for signal_struct,
i.e. reference counting for life-cycle management is decoupled from
reference counting taks using the object. There's probably some trickery
possible but the second counter is just the correct way of doing this
IMHO and has precedence.
seccomp: Lift wait_queue into struct seccomp_filter
Lift the wait_queue from struct notification into struct seccomp_filter.
This is cleaner overall and lets us avoid having to take the notifier
mutex in the future for EPOLLHUP notifications since we need to neither
read nor modify the notifier specific aspects of the seccomp filter. In
the exit path I'd very much like to avoid having to take the notifier mutex
for each filter in the task's filter hierarchy.
Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>
The seccomp filter used to be released in free_task() which is called
asynchronously via call_rcu() and assorted mechanisms. Since we need
to inform tasks waiting on the seccomp notifier when a filter goes empty
we will notify them as soon as a task has been marked fully dead in
release_task(). To not split seccomp cleanup into two parts, move
filter release out of free_task() and into release_task() after we've
unhashed struct task from struct pid, exited signals, and unlinked it
from the threadgroups' thread list. We'll put the empty filter
notification infrastructure into it in a follow up patch.
This also renames put_seccomp_filter() to seccomp_filter_release() which
is a more descriptive name of what we're doing here especially once
we've added the empty filter notification mechanism in there.
We're also NULL-ing the task's filter tree entrypoint which seems
cleaner than leaving a dangling pointer in there. Note that this shouldn't
need any memory barriers since we're calling this when the task is in
release_task() which means it's EXIT_DEAD. So it can't modify its seccomp
filters anymore. You can also see this from the point where we're calling
seccomp_filter_release(). It's after __exit_signal() and at this point,
tsk->sighand will already have been NULLed which is required for
thread-sync and filter installation alike.
Naming the lifetime counter of a seccomp filter "usage" suggests a
little too strongly that its about tasks that are using this filter
while it also tracks other references such as the user notifier or
ptrace. This also updates the documentation to note this fact.
We'll be introducing an actual usage counter in a follow-up patch.
Kees Cook [Wed, 13 May 2020 21:11:26 +0000 (14:11 -0700)]
seccomp: Report number of loaded filters in /proc/$pid/status
A common question asked when debugging seccomp filters is "how many
filters are attached to your process?" Provide a way to easily answer
this question through /proc/$pid/status with a "Seccomp_filters" line.
Running the seccomp tests as a regular user shouldn't just fail tests
that require CAP_SYS_ADMIN (for getting a PID namespace). Instead,
detect those cases and SKIP them. Additionally, gracefully SKIP missing
CONFIG_USER_NS (and add to "config" since we'd prefer to actually test
this case).
Linus Torvalds [Sun, 14 Jun 2020 18:39:31 +0000 (11:39 -0700)]
Merge tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux
Pull SafeSetID update from Micah Morton:
"Add additional LSM hooks for SafeSetID
SafeSetID is capable of making allow/deny decisions for set*uid calls
on a system, and we want to add similar functionality for set*gid
calls.
The work to do that is not yet complete, so probably won't make it in
for v5.8, but we are looking to get this simple patch in for v5.8
since we have it ready.
We are planning on the rest of the work for extending the SafeSetID
LSM being merged during the v5.9 merge window"
* tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux:
security: Add LSM hooks to set*gid syscalls
Thomas Cedeno [Tue, 9 Jun 2020 17:22:13 +0000 (10:22 -0700)]
security: Add LSM hooks to set*gid syscalls
The SafeSetID LSM uses the security_task_fix_setuid hook to filter
set*uid() syscalls according to its configured security policy. In
preparation for adding analagous support in the LSM for set*gid()
syscalls, we add the requisite hook here. Tested by putting print
statements in the security_task_fix_setgid hook and seeing them get hit
during kernel boot.
Signed-off-by: Thomas Cedeno <thomascedeno@google.com> Signed-off-by: Micah Morton <mortonm@chromium.org>
Linus Torvalds [Sun, 14 Jun 2020 16:47:25 +0000 (09:47 -0700)]
Merge tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This reverts the direct io port to iomap infrastructure of btrfs
merged in the first pull request. We found problems in invalidate page
that don't seem to be fixable as regressions or without changing iomap
code that would not affect other filesystems.
There are four reverts in total, but three of them are followup
cleanups needed to revert 6e1fb14e86fe cleanly. The result is the
buffer head based implementation of direct io.
Reverts are not great, but under current circumstances I don't see
better options"
* tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: switch to iomap_dio_rw() for dio"
Revert "fs: remove dio_end_io()"
Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK"
Revert "btrfs: split btrfs_direct_IO to read and write part"
2) RXRPC fails to send norigications, from David Howells.
3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
Geliang Tang.
4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.
5) The ucc_geth driver needs __netdev_watchdog_up exported, from
Valentin Longchamp.
6) Fix hashtable memory leak in dccp, from Wang Hai.
7) Fix how nexthops are marked as FDB nexthops, from David Ahern.
8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.
9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.
10) Fix link speed reporting in iavf driver, from Brett Creeley.
11) When a channel is used for XSK and then reused again later for XSK,
we forget to clear out the relevant data structures in mlx5 which
causes all kinds of problems. Fix from Maxim Mikityanskiy.
12) Fix memory leak in genetlink, from Cong Wang.
13) Disallow sockmap attachments to UDP sockets, it simply won't work.
From Lorenz Bauer.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
net: ethernet: ti: ale: fix allmulti for nu type ale
net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
net: atm: Remove the error message according to the atomic context
bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
libbpf: Support pre-initializing .bss global variables
tools/bpftool: Fix skeleton codegen
bpf: Fix memlock accounting for sock_hash
bpf: sockmap: Don't attach programs to UDP sockets
bpf: tcp: Recv() should return 0 when the peer socket is closed
ibmvnic: Flush existing work items before device removal
genetlink: clean up family attributes allocations
net: ipa: header pad field only valid for AP->modem endpoint
net: ipa: program upper nibbles of sequencer type
net: ipa: fix modem LAN RX endpoint id
net: ipa: program metadata mask differently
ionic: add pcie_print_link_status
rxrpc: Fix race between incoming ACK parser and retransmitter
net/mlx5: E-Switch, Fix some error pointer dereferences
net/mlx5: Don't fail driver on failure to create debugfs
net/mlx5e: CT: Fix ipv6 nat header rewrite actions
...
This patch reverts the main part of switching direct io implementation
to iomap infrastructure. There's a problem in invalidate page that
couldn't be solved as regression in this development cycle.
The problem occurs when buffered and direct io are mixed, and the ranges
overlap. Although this is not recommended, filesystems implement
measures or fallbacks to make it somehow work. In this case, fallback to
buffered IO would be an option for btrfs (this already happens when
direct io is done on compressed data), but the change would be needed in
the iomap code, bringing new semantics to other filesystems.
Another problem arises when again the buffered and direct ios are mixed,
invalidation fails, then -EIO is set on the mapping and fsync will fail,
though there's no real error.
There have been discussions how to fix that, but revert seems to be the
least intrusive option.
net: ethernet: ti: ale: fix allmulti for nu type ale
On AM65xx MCU CPSW2G NUSS and 66AK2E/L NUSS allmulti setting does not allow
unregistered mcast packets to pass.
This happens, because ALE VLAN entries on these SoCs do not contain port
masks for reg/unreg mcast packets, but instead store indexes of
ALE_VLAN_MASK_MUXx_REG registers which intended for store port masks for
reg/unreg mcast packets.
This path was missed by commit e2e79d2343f9 ("net: ethernet: ti: ale: fix
seeing unreg mcast packets with promisc and allmulti disabled").
Hence, fix it by taking into account ALE type in cpsw_ale_set_allmulti().
Fixes: e2e79d2343f9 ("net: ethernet: ti: ale: fix seeing unreg mcast packets with promisc and allmulti disabled") Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
The ALE parameters structure is created on stack, so it has to be reset
before passing to cpsw_ale_create() to avoid garbage values.
Fixes: e7364a21077b ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver") Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 13 Jun 2020 20:43:56 +0000 (13:43 -0700)]
Merge tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull more cifs updates from Steve French:
"12 cifs/smb3 fixes, 2 for stable.
- add support for idsfromsid on create and chgrp/chown allowing
ability to save owner information more naturally for some workloads
- improve query info (getattr) when SMB3.1.1 posix extensions are
negotiated by using new query info level"
* tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb3: Add debug message for new file creation with idsfromsid mount option
cifs: fix chown and chgrp when idsfromsid mount option enabled
smb3: allow uid and gid owners to be set on create with idsfromsid mount option
smb311: Add tracepoints for new compound posix query info
smb311: add support for using info level for posix extensions query
smb311: Add support for lookup with posix extensions query info
smb311: Add support for SMB311 query info (non-compounded)
SMB311: Add support for query info using posix extensions (level 100)
smb3: add indatalen that can be a non-zero value to calculation of credit charge in smb2 ioctl
smb3: fix typo in mount options displayed in /proc/mounts
cifs: Add get_security_type_str function to return sec type.
smb3: extend fscache mount volume coherency check
Linus Torvalds [Sat, 13 Jun 2020 20:32:40 +0000 (13:32 -0700)]
doc: don't use deprecated "---help---" markers in target docs
I'm not convinced the script makes useful automaed help lines anyway,
but since we're trying to deprecate the use of "---help---" in Kconfig
files, let's fix the doc example code too.
See commit 5b1363814ab1 ("treewide: replace '---help---' in Kconfig
files with 'help'")
Linus Torvalds [Sat, 13 Jun 2020 20:29:16 +0000 (13:29 -0700)]
Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- fix build rules in binderfs sample
- fix build errors when Kbuild recurses to the top Makefile
- covert '---help---' in Kconfig to 'help'
* tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
treewide: replace '---help---' in Kconfig files with 'help'
kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
samples: binderfs: really compile this sample and fix build issues
Linus Torvalds [Sat, 13 Jun 2020 20:17:49 +0000 (13:17 -0700)]
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"This is the set of changes collected since just before the merge
window opened. It's mostly minor fixes in drivers.
The one non-driver set is the three optical disk (sr) changes where
two are error path fixes and one is a helper conversion.
The big driver change is the hpsa compat_alloc_userspace rework by Al
so he can kill the remaining user. This has been tested and acked by
the maintainer"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (21 commits)
scsi: acornscsi: Fix an error handling path in acornscsi_probe()
scsi: storvsc: Remove memset before memory freeing in storvsc_suspend()
scsi: cxlflash: Remove an unnecessary NULL check
scsi: ibmvscsi: Don't send host info in adapter info MAD after LPM
scsi: sr: Fix sr_probe() missing deallocate of device minor
scsi: sr: Fix sr_probe() missing mutex_destroy
scsi: st: Convert convert get_user_pages() --> pin_user_pages()
scsi: target: Rename target_setup_cmd_from_cdb() to target_cmd_parse_cdb()
scsi: target: Fix NULL pointer dereference
scsi: target: Initialize LUN in transport_init_se_cmd()
scsi: target: Factor out a new helper, target_cmd_init_cdb()
scsi: hpsa: hpsa_ioctl(): Tidy up a bit
scsi: hpsa: Get rid of compat_alloc_user_space()
scsi: hpsa: Don't bother with vmalloc for BIG_IOCTL_Command_struct
scsi: hpsa: Lift {BIG_,}IOCTL_Command_struct copy{in,out} into hpsa_ioctl()
scsi: ufs: Remove redundant urgent_bkop_lvl initialization
scsi: ufs: Don't update urgent bkops level when toggling auto bkops
scsi: qedf: Remove redundant initialization of variable rc
scsi: mpt3sas: Fix memset() in non-RDPQ mode
scsi: iscsi: Fix reference count leak in iscsi_boot_create_kobj
...
Linus Torvalds [Sat, 13 Jun 2020 20:12:38 +0000 (13:12 -0700)]
Merge branch 'i2c/for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
"I2C has quite some patches for you this time. I hope it is the move to
per-driver-maintainers which is now showing results. We will see.
The big news is two new drivers (Nuvoton NPCM and Qualcomm CCI),
larger refactoring of the Designware, Tegra, and PXA drivers, the
Cadence driver supports being a slave now, and there is support to
instanciate SPD eeproms for well-known cases (which will be
user-visible because the i801 driver supports it), and some
devm_platform_ioremap_resource() conversions which blow up the
diffstat.
Note that I applied the Nuvoton driver quite late, so some minor fixup
patches arrived during the merge window. I chose to apply them right
away because they were trivial"
* 'i2c/for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (109 commits)
i2c: Drop stray comma in MODULE_AUTHOR statements
i2c: npcm7xx: npcm_i2caddr[] can be static
MAINTAINERS: npcm7xx: Add maintainer for Nuvoton NPCM BMC
i2c: npcm7xx: Fix a couple of error codes in probe
i2c: icy: Fix build with CONFIG_AMIGA_PCMCIA=n
i2c: npcm7xx: Remove unnecessary parentheses
i2c: npcm7xx: Add support for slave mode for Nuvoton
i2c: npcm7xx: Add Nuvoton NPCM I2C controller driver
dt-bindings: i2c: npcm7xx: add NPCM I2C controller
i2c: pxa: don't error out if there's no pinctrl
i2c: add 'single-master' property to generic bindings
i2c: designware: Add Baikal-T1 System I2C support
i2c: designware: Move reg-space remapping into a dedicated function
i2c: designware: Retrieve quirk flags as early as possible
i2c: designware: Convert driver to using regmap API
i2c: designware: Discard Cherry Trail model flag
i2c: designware: Add Baytrail sem config DW I2C platform dependency
i2c: designware: slave: Set DW I2C core module dependency
i2c: designware: Use `-y` to build multi-object modules
dt-bindings: i2c: dw: Add Baikal-T1 SoC I2C controller
...
Linus Torvalds [Sat, 13 Jun 2020 20:09:38 +0000 (13:09 -0700)]
Merge tag 'media/v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull more media updates from Mauro Carvalho Chehab:
- a set of atomisp patches. They remove several abstraction layers, and
fixes clang and gcc warnings (that were hidden via some macros that
were disabling 4 or 5 types of warnings there). There are also some
important fixes and sensor auto-detection on newer BIOSes via ACPI
_DCM tables.
- some fixes
* tag 'media/v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (95 commits)
media: rkvdec: Fix H264 scaling list order
media: v4l2-ctrls: Unset correct HEVC loop filter flag
media: videobuf2-dma-contig: fix bad kfree in vb2_dma_contig_clear_max_seg_size
media: v4l2-subdev.rst: correct information about v4l2 events
media: s5p-mfc: Properly handle dma_parms for the allocated devices
media: medium: cec: Make MEDIA_CEC_SUPPORT default to n if !MEDIA_SUPPORT
media: cedrus: Implement runtime PM
media: cedrus: Program output format during each run
media: atomisp: improve ACPI/DMI detection logs
media: Revert "media: atomisp: add Asus Transform T101HA ACPI vars"
media: Revert "media: atomisp: Add some ACPI detection info"
media: atomisp: improve sensor detection code to use _DSM table
media: atomisp: get rid of an iomem abstraction layer
media: atomisp: get rid of a string_support.h abstraction layer
media: atomisp: use strscpy() instead of less secure variants
media: atomisp: set DFS to MAX if sensor doesn't report fps
media: atomisp: use different dfs failed messages
media: atomisp: change the detection of ISP2401 at runtime
media: atomisp: use macros from intel-family.h
media: atomisp: don't set hpll_freq twice with different values
...
Linus Torvalds [Sat, 13 Jun 2020 20:04:36 +0000 (13:04 -0700)]
Merge tag 'libnvdimm-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"Small collection of cleanups to rework usage of ->queuedata and the
GUID api"
* tag 'libnvdimm-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
nvdimm/pmem: stop using ->queuedata
nvdimm/btt: stop using ->queuedata
nvdimm/blk: stop using ->queuedata
libnvdimm: Replace guid_copy() with import_guid() where it makes sense
Linus Torvalds [Sat, 13 Jun 2020 19:44:30 +0000 (12:44 -0700)]
Merge tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fix from Darrick Wong:
"A single iomap bug fix for a variable type mistake on 32-bit
architectures, fixing an integer overflow problem in the unshare
actor"
* tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: Fix unsharing of an extent >2GB on a 32-bit machine
Linus Torvalds [Sat, 13 Jun 2020 17:55:29 +0000 (10:55 -0700)]
Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fixes from Russell King:
- fix for "hex" Kconfig default to use 0x0 rather than 0 to allow these
to be removed from defconfigs
- fix from Ard Biesheuvel for EFI HYP mode booting
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 8985/1: efi/decompressor: deal with HYP mode boot gracefully
ARM: 8984/1: Kconfig: set default ZBOOT_ROM_TEXT/BSS value to 0x0
Linus Torvalds [Sat, 13 Jun 2020 17:51:29 +0000 (10:51 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha
Pull alpha updates from Matt Turner:
"A few changes for alpha. They're mostly small janitorial fixes but
there's also a build fix and most notably a patch from Mikulas that
fixes a hang on boot on the Avanti platform, which required quite a
bit of work and review"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha:
alpha: Fix build around srm_sysrq_reboot_op
alpha: c_next should increase position index
alpha: Replace sg++ with sg = sg_next(sg)
alpha: fix memory barriers so that they conform to the specification
alpha: remove unneeded semicolon in sys_eiger.c
alpha: remove unneeded semicolon in osf_sys.c
alpha: Replace strncmp with str_has_prefix
alpha: fix rtc port ranges
alpha: Kconfig: pedantic formatting
Linus Torvalds [Sat, 13 Jun 2020 17:21:00 +0000 (10:21 -0700)]
Merge tag 'ras-core-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Thomas Gleixner:
"RAS updates from Borislav Petkov:
- Unmap a whole guest page if an MCE is encountered in it to avoid
follow-on MCEs leading to the guest crashing, by Tony Luck.
This change collided with the entry changes and the merge
resolution would have been rather unpleasant. To avoid that the
entry branch was merged in before applying this. The resulting code
did not change over the rebase.
- AMD MCE error thresholding machinery cleanup and hotplug
sanitization, by Thomas Gleixner.
- Change the MCE notifiers to denote whether they have handled the
error and not break the chain early by returning NOTIFY_STOP, thus
giving the opportunity for the later handlers in the chain to see
it. By Tony Luck.
- Add AMD family 0x17, models 0x60-6f support, by Alexander Monakov.
- Last but not least, the usual round of fixes and improvements"
* tag 'ras-core-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/mce/dev-mcelog: Fix -Wstringop-truncation warning about strncpy()
x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned
EDAC/amd64: Add AMD family 17h model 60h PCI IDs
hwmon: (k10temp) Add AMD family 17h model 60h PCI match
x86/amd_nb: Add AMD family 17h model 60h PCI IDs
x86/mcelog: Add compat_ioctl for 32-bit mcelog support
x86/mce: Drop bogus comment about mce.kflags
x86/mce: Fixup exception only for the correct MCEs
EDAC: Drop the EDAC report status checks
x86/mce: Add mce=print_all option
x86/mce: Change default MCE logger to check mce->kflags
x86/mce: Fix all mce notifiers to update the mce->kflags bitmask
x86/mce: Add a struct mce.kflags field
x86/mce: Convert the CEC to use the MCE notifier
x86/mce: Rename "first" function as "early"
x86/mce/amd, edac: Remove report_gart_errors
x86/mce/amd: Make threshold bank setting hotplug robust
x86/mce/amd: Cleanup threshold device remove path
x86/mce/amd: Straighten CPU hotplug path
x86/mce/amd: Sanitize thresholding device creation hotplug path
...
Linus Torvalds [Sat, 13 Jun 2020 17:05:47 +0000 (10:05 -0700)]
Merge tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry updates from Thomas Gleixner:
"The x86 entry, exception and interrupt code rework
This all started about 6 month ago with the attempt to move the Posix
CPU timer heavy lifting out of the timer interrupt code and just have
lockless quick checks in that code path. Trivial 5 patches.
This unearthed an inconsistency in the KVM handling of task work and
the review requested to move all of this into generic code so other
architectures can share.
Valid request and solved with another 25 patches but those unearthed
inconsistencies vs. RCU and instrumentation.
Digging into this made it obvious that there are quite some
inconsistencies vs. instrumentation in general. The int3 text poke
handling in particular was completely unprotected and with the batched
update of trace events even more likely to expose to endless int3
recursion.
In parallel the RCU implications of instrumenting fragile entry code
came up in several discussions.
The conclusion of the x86 maintainer team was to go all the way and
make the protection against any form of instrumentation of fragile and
dangerous code pathes enforcable and verifiable by tooling.
A first batch of preparatory work hit mainline with commit fc8ae6d1ef55 ("Pull x86 entry code updates from Thomas Gleixner")
That (almost) full solution introduced a new code section
'.noinstr.text' into which all code which needs to be protected from
instrumentation of all sorts goes into. Any call into instrumentable
code out of this section has to be annotated. objtool has support to
validate this.
Kprobes now excludes this section fully which also prevents BPF from
fiddling with it and all 'noinstr' annotated functions also keep
ftrace off. The section, kprobes and objtool changes are already
merged.
The major changes coming with this are:
- Preparatory cleanups
- Annotating of relevant functions to move them into the
noinstr.text section or enforcing inlining by marking them
__always_inline so the compiler cannot misplace or instrument
them.
- Splitting and simplifying the idtentry macro maze so that it is
now clearly separated into simple exception entries and the more
interesting ones which use interrupt stacks and have the paranoid
handling vs. CR3 and GS.
- Move quite some of the low level ASM functionality into C code:
- enter_from and exit to user space handling. The ASM code now
calls into C after doing the really necessary ASM handling and
the return path goes back out without bells and whistels in
ASM.
- exception entry/exit got the equivivalent treatment
- move all IRQ tracepoints from ASM to C so they can be placed as
appropriate which is especially important for the int3
recursion issue.
- Consolidate the declaration and definition of entry points between
32 and 64 bit. They share a common header and macros now.
- Remove the extra device interrupt entry maze and just use the
regular exception entry code.
- All ASM entry points except NMI are now generated from the shared
header file and the corresponding macros in the 32 and 64 bit
entry ASM.
- The C code entry points are consolidated as well with the help of
DEFINE_IDTENTRY*() macros. This allows to ensure at one central
point that all corresponding entry points share the same
semantics. The actual function body for most entry points is in an
instrumentable and sane state.
There are special macros for the more sensitive entry points, e.g.
INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
They allow to put the whole entry instrumentation and RCU handling
into safe places instead of the previous pray that it is correct
approach.
- The INT3 text poke handling is now completely isolated and the
recursion issue banned. Aside of the entry rework this required
other isolation work, e.g. the ability to force inline bsearch.
- Prevent #DB on fragile entry code, entry relevant memory and
disable it on NMI, #MC entry, which allowed to get rid of the
nested #DB IST stack shifting hackery.
- A few other cleanups and enhancements which have been made
possible through this and already merged changes, e.g.
consolidating and further restricting the IDT code so the IDT
table becomes RO after init which removes yet another popular
attack vector
- About 680 lines of ASM maze are gone.
There are a few open issues:
- An escape out of the noinstr section in the MCE handler which needs
some more thought but under the aspect that MCE is a complete
trainwreck by design and the propability to survive it is low, this
was not high on the priority list.
- Paravirtualization
When PV is enabled then objtool complains about a bunch of indirect
calls out of the noinstr section. There are a few straight forward
ways to fix this, but the other issues vs. general correctness were
more pressing than parawitz.
- KVM
KVM is inconsistent as well. Patches have been posted, but they
have not yet been commented on or picked up by the KVM folks.
- IDLE
Pretty much the same problems can be found in the low level idle
code especially the parts where RCU stopped watching. This was
beyond the scope of the more obvious and exposable problems and is
on the todo list.
The lesson learned from this brain melting exercise to morph the
evolved code base into something which can be validated and understood
is that once again the violation of the most important engineering
principle "correctness first" has caused quite a few people to spend
valuable time on problems which could have been avoided in the first
place. The "features first" tinkering mindset really has to stop.
With that I want to say thanks to everyone involved in contributing to
this effort. Special thanks go to the following people (alphabetical
order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian
Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai
Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra,
Vitaly Kuznetsov, and Will Deacon"
* tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits)
x86/entry: Force rcu_irq_enter() when in idle task
x86/entry: Make NMI use IDTENTRY_RAW
x86/entry: Treat BUG/WARN as NMI-like entries
x86/entry: Unbreak __irqentry_text_start/end magic
x86/entry: __always_inline CR2 for noinstr
lockdep: __always_inline more for noinstr
x86/entry: Re-order #DB handler to avoid *SAN instrumentation
x86/entry: __always_inline arch_atomic_* for noinstr
x86/entry: __always_inline irqflags for noinstr
x86/entry: __always_inline debugreg for noinstr
x86/idt: Consolidate idt functionality
x86/idt: Cleanup trap_init()
x86/idt: Use proper constants for table size
x86/idt: Add comments about early #PF handling
x86/idt: Mark init only functions __init
x86/entry: Rename trace_hardirqs_off_prepare()
x86/entry: Clarify irq_{enter,exit}_rcu()
x86/entry: Remove DBn stacks
x86/entry: Remove debug IDT frobbing
x86/entry: Optimize local_db_save() for virt
...
Masahiro Yamada [Sat, 13 Jun 2020 16:50:22 +0000 (01:50 +0900)]
treewide: replace '---help---' in Kconfig files with 'help'
Since commit d0ebe97b779f ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.
This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.
There are a variety of indentation styles found.
a) 4 spaces + '---help---'
b) 7 spaces + '---help---'
c) 8 spaces + '---help---'
d) 1 space + 1 tab + '---help---'
e) 1 tab + '---help---' (correct indentation)
f) 1 tab + 1 space + '---help---'
g) 1 tab + 2 spaces + '---help---'
In order to convert all of them to 1 tab + 'help', I ran the
following commend:
$ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Linus Torvalds [Sat, 13 Jun 2020 16:56:21 +0000 (09:56 -0700)]
Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull notification queue from David Howells:
"This adds a general notification queue concept and adds an event
source for keys/keyrings, such as linking and unlinking keys and
changing their attributes.
Thanks to Debarshi Ray, we do have a pull request to use this to fix a
problem with gnome-online-accounts - as mentioned last time:
Without this, g-o-a has to constantly poll a keyring-based kerberos
cache to find out if kinit has changed anything.
[ There are other notification pending: mount/sb fsinfo notifications
for libmount that Karel Zak and Ian Kent have been working on, and
Christian Brauner would like to use them in lxc, but let's see how
this one works first ]
LSM hooks are included:
- A set of hooks are provided that allow an LSM to rule on whether or
not a watch may be set. Each of these hooks takes a different
"watched object" parameter, so they're not really shareable. The
LSM should use current's credentials. [Wanted by SELinux & Smack]
- A hook is provided to allow an LSM to rule on whether or not a
particular message may be posted to a particular queue. This is
given the credentials from the event generator (which may be the
system) and the watch setter. [Wanted by Smack]
I've provided SELinux and Smack with implementations of some of these
hooks.
WHY
===
Key/keyring notifications are desirable because if you have your
kerberos tickets in a file/directory, your Gnome desktop will monitor
that using something like fanotify and tell you if your credentials
cache changes.
However, we also have the ability to cache your kerberos tickets in
the session, user or persistent keyring so that it isn't left around
on disk across a reboot or logout. Keyrings, however, cannot currently
be monitored asynchronously, so the desktop has to poll for it - not
so good on a laptop. This facility will allow the desktop to avoid the
need to poll.
DESIGN DECISIONS
================
- The notification queue is built on top of a standard pipe. Messages
are effectively spliced in. The pipe is opened with a special flag:
pipe2(fds, O_NOTIFICATION_PIPE);
The special flag has the same value as O_EXCL (which doesn't seem
like it will ever be applicable in this context)[?]. It is given up
front to make it a lot easier to prohibit splice&co from accessing
the pipe.
[?] Should this be done some other way? I'd rather not use up a new
O_* flag if I can avoid it - should I add a pipe3() system call
instead?
Messages are then read out of the pipe using read().
- It should be possible to allow write() to insert data into the
notification pipes too, but this is currently disabled as the
kernel has to be able to insert messages into the pipe *without*
holding pipe->mutex and the code to make this work needs careful
auditing.
- sendfile(), splice() and vmsplice() are disabled on notification
pipes because of the pipe->mutex issue and also because they
sometimes want to revert what they just did - but one or more
notification messages might've been interleaved in the ring.
- The kernel inserts messages with the wait queue spinlock held. This
means that pipe_read() and pipe_write() have to take the spinlock
to update the queue pointers.
- Records in the buffer are binary, typed and have a length so that
they can be of varying size.
This allows multiple heterogeneous sources to share a common
buffer; there are 16 million types available, of which I've used
just a few, so there is scope for others to be used. Tags may be
specified when a watchpoint is created to help distinguish the
sources.
- Records are filterable as types have up to 256 subtypes that can be
individually filtered. Other filtration is also available.
- Notification pipes don't interfere with each other; each may be
bound to a different set of watches. Any particular notification
will be copied to all the queues that are currently watching for it
- and only those that are watching for it.
- When recording a notification, the kernel will not sleep, but will
rather mark a queue as having lost a message if there's
insufficient space. read() will fabricate a loss notification
message at an appropriate point later.
- The notification pipe is created and then watchpoints are attached
to it, using one of:
where in both cases, fd indicates the queue and the number after is
a tag between 0 and 255.
- Watches are removed if either the notification pipe is destroyed or
the watched object is destroyed. In the latter case, a message will
be generated indicating the enforced watch removal.
Things I want to avoid:
- Introducing features that make the core VFS dependent on the
network stack or networking namespaces (ie. usage of netlink).
- Dumping all this stuff into dmesg and having a daemon that sits
there parsing the output and distributing it as this then puts the
responsibility for security into userspace and makes handling
namespaces tricky. Further, dmesg might not exist or might be
inaccessible inside a container.
- Letting users see events they shouldn't be able to see.
TESTING AND MANPAGES
====================
- The keyutils tree has a pipe-watch branch that has keyctl commands
for making use of notifications. Proposed manual pages can also be
found on this branch, though a couple of them really need to go to
the main manpages repository instead.
If the kernel supports the watching of keys, then running "make
test" on that branch will cause the testing infrastructure to spawn
a monitoring process on the side that monitors a notifications pipe
for all the key/keyring changes induced by the tests and they'll
all be checked off to make sure they happened.
- A test program is provided (samples/watch_queue/watch_test) that
can be used to monitor for keyrings, mount and superblock events.
Information on the notifications is simply logged to stdout"
* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
smack: Implement the watch_key and post_notification hooks
selinux: Implement the watch_key security hook
keys: Make the KEY_NEED_* perms an enum rather than a mask
pipe: Add notification lossage handling
pipe: Allow buffers to be marked read-whole-or-error for notifications
Add sample notification program
watch_queue: Add a key/keyring notification facility
security: Add hooks to rule on setting a watch
pipe: Add general notification queue support
pipe: Add O_NOTIFICATION_PIPE
security: Add a hook for the point of notification insertion
uapi: General notification queue definitions
Ard Biesheuvel [Fri, 12 Jun 2020 10:21:35 +0000 (11:21 +0100)]
ARM: 8985/1: efi/decompressor: deal with HYP mode boot gracefully
EFI on ARM only supports short descriptors, and given that it mandates
that the MMU and caches are on, it is implied that booting in HYP mode
is not supported.
However, implementations of EFI exist (i.e., U-Boot) that ignore this
requirement, which is not entirely unreasonable, given that it makes
HYP mode inaccessible to the operating system.
So let's make sure that we can deal with this condition gracefully.
We already tolerate booting the EFI stub with the caches off (even
though this violates the EFI spec as well), and so we should deal
with HYP mode boot with MMU and caches either on or off.
- When the MMU and caches are on, we can ignore the HYP stub altogether,
since we can carry on executing at HYP. We do need to ensure that we
disable the MMU at HYP before entering the kernel proper.
- When the MMU and caches are off, we have to drop to SVC mode so that
we can set up the page tables using short descriptors. In this case,
we need to install the HYP stub as usual, so that we can return to HYP
mode before handing over to the kernel proper.
Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Chris Packham [Tue, 9 Jun 2020 02:28:14 +0000 (03:28 +0100)]
ARM: 8984/1: Kconfig: set default ZBOOT_ROM_TEXT/BSS value to 0x0
ZBOOT_ROM_TEXT and ZBOOT_ROM_BSS are defined as 'hex' but had a default
of "0". Kconfig will helpfully expand a text entry of 0 to 0x0 but
because this is not the same as the default value it was treated as
being explicitly set when running 'make savedefconfig' so most arm
defconfigs have CONFIG_ZBOOT_ROM_TEXT=0x0 and CONFIG_ZBOOT_ROM_BSS=0x0.
Change the default to 0x0 which will mean next time the defconfigs are
re-generated the spurious config entries will be removed.
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Joerg Roedel [Thu, 11 Jun 2020 09:11:39 +0000 (11:11 +0200)]
alpha: Fix build around srm_sysrq_reboot_op
The patch introducing the struct was probably never compile tested,
because it sets a handler with a wrong function signature. Wrap the
handler into a functions with the correct signature to fix the build.
Fixes: f28f73ff4379 ("tty/sysrq: alpha: export and use __sysrq_get_key_op()") Cc: Emil Velikov <emil.l.velikov@gmail.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Matt Turner <mattst88@gmail.com>
Mikulas Patocka [Tue, 26 May 2020 14:47:49 +0000 (10:47 -0400)]
alpha: fix memory barriers so that they conform to the specification
The commits b01850ba70a7 and 21c9c24e950c broke boot on the Alpha Avanti
platform. The patches move memory barriers after a write before the write.
The result is that if there's iowrite followed by ioread, there is no
barrier between them.
The Alpha architecture allows reordering of the accesses to the I/O space,
and the missing barrier between write and read causes hang with serial
port and real time clock.
This patch makes barriers confiorm to the specification.
1. We add mb() before readX_relaxed and writeX_relaxed -
memory-barriers.txt claims that these functions must be ordered w.r.t.
each other. Alpha doesn't order them, so we need an explicit barrier.
2. We add mb() before reads from the I/O space - so that if there's a
write followed by a read, there should be a barrier between them.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: b01850ba70a7 ("alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering") Fixes: 21c9c24e950c ("alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering #2") Cc: stable@vger.kernel.org # v4.17+ Acked-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Matt Turner <mattst88@gmail.com>
In commit 669951268b1b
("tracing: Use str_has_prefix() instead of using fixed sizes")
the newly introduced str_has_prefix() was used
to replace error-prone strncmp(str, const, len).
Here fix codes with the same pattern.
Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Signed-off-by: Matt Turner <mattst88@gmail.com>
Andrii Nakryiko [Fri, 12 Jun 2020 19:45:04 +0000 (12:45 -0700)]
libbpf: Support pre-initializing .bss global variables
Remove invalid assumption in libbpf that .bss map doesn't have to be updated
in kernel. With addition of skeleton and memory-mapped initialization image,
.bss doesn't have to be all zeroes when BPF map is created, because user-code
might have initialized those variables from user-space.
Andrii Nakryiko [Fri, 12 Jun 2020 20:16:03 +0000 (13:16 -0700)]
tools/bpftool: Fix skeleton codegen
Remove unnecessary check at the end of codegen() routine which makes codegen()
to always fail and exit bpftool with error code. Positive value of variable
n is not an indicator of a failure.
Fixes: 7a105cad991e ("tools, bpftool: Exit on error in function codegen") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Tobias Klauser <tklauser@distanz.ch> Link: https://lore.kernel.org/bpf/20200612201603.680852-1-andriin@fb.com
Sabrina Dubroca [Wed, 10 Jun 2020 10:19:43 +0000 (12:19 +0200)]
bpf: tcp: Recv() should return 0 when the peer socket is closed
If the peer is closed, we will never get more data, so
tcp_bpf_wait_data will get stuck forever. In case we passed
MSG_DONTWAIT to recv(), we get EAGAIN but we should actually get
0.
>From man 2 recv:
RETURN VALUE
When a stream socket peer has performed an orderly shutdown, the
return value will be 0 (the traditional "end-of-file" return).
This patch makes tcp_bpf_wait_data always return 1 when the peer
socket has been shutdown. Either we have data available, and it would
have returned 1 anyway, or there isn't, in which case we'll call
tcp_recvmsg which does the right thing in this situation.
Steve French [Fri, 12 Jun 2020 19:49:47 +0000 (14:49 -0500)]
smb3: Add debug message for new file creation with idsfromsid mount option
Pavel noticed that a debug message (disabled by default) in creating the security
descriptor context could be useful for new file creation owner fields
(as we already have for the mode) when using mount parm idsfromsid.
[38120.392272] CIFS: FYI: owner S-1-5-88-1-0, group S-1-5-88-2-0
[38125.792637] CIFS: FYI: owner S-1-5-88-1-1000, group S-1-5-88-2-1000
Also cleans up a typo in a comment
Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Thomas Falcon [Fri, 12 Jun 2020 18:34:41 +0000 (13:34 -0500)]
ibmvnic: Flush existing work items before device removal
Ensure that all scheduled work items have completed before continuing
with device removal and after further event scheduling has been
halted. This patch fixes a bug where a scheduled driver reset event
is processed following device removal.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Fri, 12 Jun 2020 07:16:55 +0000 (00:16 -0700)]
genetlink: clean up family attributes allocations
genl_family_rcv_msg_attrs_parse() and genl_family_rcv_msg_attrs_free()
take a boolean parameter to determine whether allocate/free the family
attrs. This is unnecessary as we can just check family->parallel_ops.
More importantly, callers would not need to worry about pairing these
parameters correctly after this patch.
And this fixes a memory leak, as after commit d975c14e3ea9
("genetlink: fix memory leaks in genl_family_rcv_msg_dumpit()")
we call genl_family_rcv_msg_attrs_parse() for both parallel and
non-parallel cases.
Fixes: d975c14e3ea9 ("genetlink: fix memory leaks in genl_family_rcv_msg_dumpit()") Reported-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 12 Jun 2020 19:38:18 +0000 (12:38 -0700)]
Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc fix from Eric Biederman:
"Much to my surprise syzbot found a very old bug in proc that the
recent changes made easier to reproce. This bug is subtle enough it
looks like it fooled everyone who should know better"
* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Use new_inode not new_inode_pseudo
Thomas Gleixner [Fri, 12 Jun 2020 13:55:00 +0000 (15:55 +0200)]
x86/entry: Force rcu_irq_enter() when in idle task
The idea of conditionally calling into rcu_irq_enter() only when RCU is
not watching turned out to be not completely thought through.
Paul noticed occasional premature end of grace periods in RCU torture
testing. Bisection led to the commit which made the invocation of
rcu_irq_enter() conditional on !rcu_is_watching().
It turned out that this conditional breaks RCU assumptions about the idle
task when the scheduler tick happens to be a nested interrupt. Nested
interrupts can happen when the first interrupt invokes softirq processing
on return which enables interrupts.
If that nested tick interrupt does not invoke rcu_irq_enter() then the
RCU's irq-nesting checks will believe that this interrupt came directly
from idle, which will cause RCU to report a quiescent state. Because this
interrupt instead came from a softirq handler which might have been
executing an RCU read-side critical section, this can cause the grace
period to end prematurely.
Change the condition from !rcu_is_watching() to is_idle_task(current) which
enforces that interrupts in the idle task unconditionally invoke
rcu_irq_enter() independent of the RCU state.
This is also correct vs. user mode entries in NOHZ full scenarios because
user mode entries bring RCU out of EQS and force the RCU irq nesting state
accounting to nested. As only the first interrupt can enter from user mode
a nested tick interrupt will enter from kernel mode and as the nesting
state accounting is forced to nesting it will not do anything stupid even
if rcu_irq_enter() has not been invoked.
Fixes: 5a0073233228 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()") Reported-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: "Paul E. McKenney" <paulmck@kernel.org> Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Acked-by: Andy Lutomirski <luto@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/87wo4cxubv.fsf@nanos.tec.linutronix.de
Linus Torvalds [Fri, 12 Jun 2020 19:24:42 +0000 (12:24 -0700)]
Merge tag 'pwm/for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
Pull pwm updates from Thierry Reding:
"Nothing too exciting for this cycle. A couple of fixes across the
board, and Lee volunteered to help with patch review"
* tag 'pwm/for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
pwm: Add missing "CONFIG_" prefix
MAINTAINERS: Add Lee Jones as reviewer for the PWM subsystem
pwm: imx27: Fix rounding behavior
pwm: rockchip: Simplify rockchip_pwm_get_state()
pwm: img: Call pm_runtime_put() in pm_runtime_get_sync() failed case
pwm: tegra: Support dynamic clock frequency configuration
pwm: jz4740: Add support for the JZ4725B
pwm: jz4740: Make PWM start with the active part
pwm: jz4740: Enhance precision in calculation of duty cycle
pwm: jz4740: Drop dependency on MACH_INGENIC
pwm: lpss: Fix get_state runtime-pm reference handling
pwm: sun4i: Support direct clock output on Allwinner A64
pwm: Add support for Azoteq IQS620A PWM generator
dt-bindings: pwm: rcar: add r8a77961 support
pwm: Add missing '\n' in log messages
Linus Torvalds [Fri, 12 Jun 2020 19:19:13 +0000 (12:19 -0700)]
Merge tag 'iommu-drivers-move-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu driver directory structure cleanup from Joerg Roedel:
"Move the Intel and AMD IOMMU drivers into their own subdirectory.
Both drivers consist of several files by now and giving them their own
directory unclutters the IOMMU top-level directory a bit"
* tag 'iommu-drivers-move-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Move Intel IOMMU driver into subdirectory
iommu/amd: Move AMD IOMMU driver into subdirectory
Linus Torvalds [Fri, 12 Jun 2020 19:13:36 +0000 (12:13 -0700)]
Merge tag 'printk-for-5.8-kdb-nmi' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:
"One more printk change for 5.8: make sure that messages printed from
KDB context are redirected to KDB console handlers. It did not work
when KDB interrupted NMI or printk_safe contexts.
Arm people started hitting this problem more often recently. I forgot
to add the fix into the previous pull request by mistake"
* tag 'printk-for-5.8-kdb-nmi' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk/kdb: Redirect printk messages into kdb in any context
Recently syzbot reported that unmounting proc when there is an ongoing
inotify watch on the root directory of proc could result in a use
after free when the watch is removed after the unmount of proc
when the watcher exits.
Commit 55878bfe1385 ("proc: Remove the now unnecessary internal mount
of proc") made it easier to unmount proc and allowed syzbot to see the
problem, but looking at the code it has been around for a long time.
Looking at the code the fsnotify watch should have been removed by
fsnotify_sb_delete in generic_shutdown_super. Unfortunately the inode
was allocated with new_inode_pseudo instead of new_inode so the inode
was not on the sb->s_inodes list. Which prevented
fsnotify_unmount_inodes from finding the inode and removing the watch
as well as made it so the "VFS: Busy inodes after unmount" warning
could not find the inodes to warn about them.
Make all of the inodes in proc visible to generic_shutdown_super,
and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo.
The only functional difference is that new_inode places the inodes
on the sb->s_inodes list.
I wrote a small test program and I can verify that without changes it
can trigger this issue, and by replacing new_inode_pseudo with
new_inode the issues goes away.
Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com Fixes: e1db54deb96e ("proc: Implement /proc/thread-self to point at the directory of the current thread") Fixes: ebc401282106 ("procfs: switch /proc/self away from proc_dir_entry") Fixes: 667a365bb0c4 ("vfs,proc: guarantee unique inodes in /proc") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Linus Torvalds [Fri, 12 Jun 2020 18:56:43 +0000 (11:56 -0700)]
Merge tag 'devicetree-fixes-for-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull Devicetree fixes from Rob Herring:
- Another round of whack-a-mole removing 'allOf', redundant cases of
'maxItems' and incorrect 'reg' sizes
- Fix support for yaml.h in non-standard paths
* tag 'devicetree-fixes-for-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: Remove redundant 'maxItems'
dt-bindings: Fix more incorrect 'reg' property sizes in examples
dt-bindings: phy: qcom: Fix missing 'ranges' and example addresses
dt-bindings: Remove more cases of 'allOf' containing a '$ref'
scripts/dtc: use pkg-config to include <yaml.h> in non-standard path
Steve French [Fri, 12 Jun 2020 15:36:37 +0000 (10:36 -0500)]
cifs: fix chown and chgrp when idsfromsid mount option enabled
idsfromsid was ignored in chown and chgrp causing it to fail
when upcalls were not configured for lookup. idsfromsid allows
mapping users when setting user or group ownership using
"special SID" (reserved for this). Add support for chmod and chgrp
when idsfromsid mount option is enabled.
Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Steve French [Fri, 12 Jun 2020 14:25:21 +0000 (09:25 -0500)]
smb3: allow uid and gid owners to be set on create with idsfromsid mount option
Currently idsfromsid mount option allows querying owner information from the
special sids used to represent POSIX uids and gids but needed changes to
populate the security descriptor context with the owner information when
idsfromsid mount option was used.
Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Linus Torvalds [Fri, 12 Jun 2020 18:05:52 +0000 (11:05 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
"The guest side of the asynchronous page fault work has been delayed to
5.9 in order to sync with Thomas's interrupt entry rework, but here's
the rest of the KVM updates for this merge window.
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (62 commits)
KVM: x86: do not pass poisoned hva to __kvm_set_memory_region
KVM: selftests: fix sync_with_host() in smm_test
KVM: async_pf: Inject 'page ready' event only if 'page not present' was previously injected
KVM: async_pf: Cleanup kvm_setup_async_pf()
kvm: i8254: remove redundant assignment to pointer s
KVM: x86: respect singlestep when emulating instruction
KVM: selftests: Don't probe KVM_CAP_HYPERV_ENLIGHTENED_VMCS when nested VMX is unsupported
KVM: selftests: do not substitute SVM/VMX check with KVM_CAP_NESTED_STATE check
KVM: nVMX: Consult only the "basic" exit reason when routing nested exit
KVM: arm64: Move hyp_symbol_addr() to kvm_asm.h
KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception
KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts
KVM: arm64: Remove host_cpu_context member from vcpu structure
KVM: arm64: Stop sparse from moaning at __hyp_this_cpu_ptr
KVM: arm64: Handle PtrAuth traps early
KVM: x86: Unexport x86_fpu_cache and make it static
KVM: selftests: Ignore KVM 5-level paging support for VM_MODE_PXXV48_4K
KVM: arm64: Save the host's PtrAuth keys in non-preemptible context
KVM: arm64: Stop save/restoring ACTLR_EL1
KVM: arm64: Add emulation for 32bit guests accessing ACTLR2
...
Linus Torvalds [Fri, 12 Jun 2020 18:00:45 +0000 (11:00 -0700)]
Merge tag 'for-linus-5.8b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- several smaller cleanups
- a fix for a Xen guest regression with CPU offlining
- a small fix in the xen pvcalls backend driver
- an update of MAINTAINERS
* tag 'for-linus-5.8b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE
xen/pci: Get rid of verbose_request and use dev_dbg() instead
xenbus: Use dev_printk() when possible
xen-pciback: Use dev_printk() when possible
xen: enable BALLOON_MEMORY_HOTPLUG by default
xen: expand BALLOON_MEMORY_HOTPLUG description
xen/pvcalls: Make pvcalls_back_global static
xen/cpuhotplug: Fix initial CPU offlining for PV(H) guests
xen-platform: Constify dev_pm_ops
xen/pvcalls-back: test for errors when calling backend_connect()
Steve French [Fri, 12 Jun 2020 03:43:01 +0000 (22:43 -0500)]
smb311: add support for using info level for posix extensions query
Adds calls to the newer info level for query info using SMB3.1.1 posix extensions.
The remaining two places that call the older query info (non-SMB3.1.1 POSIX)
require passing in the fid and can be updated in a later patch.
Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Thomas Gleixner [Fri, 12 Jun 2020 12:02:27 +0000 (14:02 +0200)]
x86/entry: Make NMI use IDTENTRY_RAW
For no reason other than beginning brainmelt, IDTENTRY_NMI was mapped to
IDTENTRY_IST.
This is not a problem on 64bit because the IST default entry point maps to
IDTENTRY_RAW which does not any entry handling. The surplus function
declaration for the noist C entry point is unused and as there is no ASM
code emitted for NMI this went unnoticed.
On 32bit IDTENTRY_IST maps to a regular IDTENTRY which does the normal
entry handling. That is clearly the wrong thing to do for NMI.
Map it to IDTENTRY_RAW to unbreak it. The IDTENTRY_NMI mapping needs to
stay to avoid emitting ASM code.